[haskell-pipes] Efficient chunking

Tran Ma Tue, 04 Aug 2015 22:30:07 -0700

Hi all,

I have a use case where I need to chunk a stream of values, i.e. turn a 
`Producer a m ()` to a `Producer (Chunk a) m ()` where `Chunk` is something 
like an in-memory vector.


This is a small complete example of what I was trying to do: 

import           Control.Lens        (view)
import           Control.Monad
import qualified Data.List           as L
import           Data.Vector         (Vector)
import qualified Data.Vector         as V
import qualified Data.Vector.Mutable as VM
import           Pipes
import           Pipes.Group

main
  = let nums   = each $ L.replicate 1000000 (1 :: Int)
        chunks = chunkify 4096 nums
    in  runEffect $ chunks >-> inspectChunk
  where
    -- eval each element in a chunk to whnf
    inspectChunk
      = forever
      $ do Chunk v <- await
           V.mapM_ (($!) const (return ())) v

newtype Chunk a = Chunk { chunkVec :: Vector a }

chunkify :: Int -> Producer a IO () -> Producer (Chunk a) IO ()
chunkify size p
  = let chunks = {-# SCC view_chunksOf #-} view (chunksOf size) p
    in  foldsM addToChunk (newChunk size) freezeChunk chunks

newChunk size
  = VM.new size >>= return . (0,)

freezeChunk (i, v)
  = V.unsafeFreeze (VM.slice 0 i v) >>= return . Chunk

addToChunk (i, v) x
  = do VM.write v i x
       return (i + 1, v)


This works, but is very inefficient because it spends the majority of time 
chunking. In an ideal world, I would expect chunking the input to have 
minimal cost, compared to the actual computation (in this case, 
`inspectChunk`). Here is a profile of the above:

COST CENTRE       MODULE    %time %alloc

view_chunksOf     Main       35.4   39.6
chunkify          Main       23.3   19.2
main.inspectChunk Main       19.1   25.6
addToChunk        Main       16.6   10.2
main.nums         Main        5.3    5.1
                                                               individual     
inherited
COST CENTRE          MODULE                  no.     entries  %time %alloc   
%time %alloc

MAIN                 MAIN                    156           0    0.0    0.0   
100.0  100.0
 CAF                 Main                    311           0    0.0    0.0    
99.7  100.0
  main               Main                    312           1    0.0    0.0    
99.7  100.0
   main.nums         Main                    318           1    4.6    5.1     
4.6    5.1
   main.chunks       Main                    314           1    0.0    0.0    
76.0   69.3
    chunkify         Main                    315           1   23.3   19.2    
76.0   69.3
     freezeChunk     Main                    322         245    0.0    0.0     
0.0    0.0
     addToChunk      Main                    320     1000000   16.6   10.2    
16.6   10.2
     newChunk        Main                    319           1    0.0    0.3     
0.0    0.3
     chunkify.chunks Main                    316           1    0.0    0.0    
36.1   39.6
      view_chunksOf  Main                    317           1   35.4   39.6    
36.1   39.6
       main.nums     Main                    321           0    0.7    0.0     
0.7    0.0
   main.inspectChunk Main                    313           1   19.1   25.6    
19.1   25.6


In the full program I was working on, which does more computation than just 
inspecting each element and throwing it away, chunking is still a huge 
bottleneck.

I expected chunking to spend some time looking at each element and counting 
down from the chunkSize (like what `splitAt` does), but didn't expect that 
to be so expensive. Could I improve this somehow? Maybe I'm approaching 
this wrong?

Background: this is part of a program that splits/merges/etc. streams of 
inputs, where "stream" could be actual streams or just Data.List lists. 
Embarrassingly the stream version is much slower than lists =( 

Thanks,

-- 
You received this message because you are subscribed to the Google Groups 
"Haskell Pipes" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to haskell-pipes+unsubscr...@googlegroups.com.
To post to this group, send email to haskell-pipes@googlegroups.com.

[haskell-pipes] Efficient chunking

Reply via email to