It's in chunks of about 32 KB. Here's the relevant code from `pipes-bytestring`:

-- | Convert a 'IO.Handle' into a byte stream using a default chunk size
    fromHandle :: MonadIO m => IO.Handle -> Producer' ByteString m ()
    fromHandle = hGetSome defaultChunkSize

    {-| Convert a handle into a byte stream using a maximum chunk size

'hGetSome' forwards input immediately as it becomes available, splitting the
       input into multiple chunks if it exceeds the maximum chunk size.
    -}
    hGetSome :: MonadIO m => Int -> IO.Handle -> Producer' ByteString m ()
    hGetSome size h = go
     where
        go = do
            bs <- liftIO (BS.hGetSome h size)
            if (BS.null bs)
                then return ()
                else do
                    yield bs
                    go

... and `pipes-attoparsec` also feeds in things in chunks.

What will happen on your first parsed element is something like this:

* Your attoparsec `Parser` requests a chunk
* `pipes-attoparsec` feeds the parser the file's first 32 KB chunk
* The parser consumes a few bytes of the chunk, returning the parsed result and the remainder of the chunk (still about ~ 32 KB because the parser consumed so little) * `pipes-attoparsec` stores the remainder and feeds it to the next `attoparsec` parser * After parsing several things eventually the chunk will get close to complete. Let's say that we have 2 bytes of the chunk left. * `pipes-attoparsec` feeds these 2 bytes to the `attoparsec` parser, but let's assume that it needs 4 bytes. * `pipes-attoparsec` will get another 32 KB chunk from the handle and then feed that to the parser * The parser will then return the remainder of the fresh 32KB chunk when it's done

So generally everything will be very fast because:

* `pipes-bytestring` keeps the chunk size large (32 KB)
* `pipes-bytestring`, `attoparsec`, and `pipes-attoparsec` all avoid allocating new bytestrings in this scenario.

At most they will slice up the 32 KB chunk, but this slicing takes O(1) time and does not allocate any new memory because it reuses the original chunk's buffer. This reuse is safe because of purity (there's no danger that the underlying buffer will mutate).

This also means that the entire thing will run about double the chunk size (64 KB), mainly because the handle also keeps a 32 KB buffer of its own if I remember correctly. With some unsafe tricks you can even get rid of that extra memory overhead and use one less copy, but I think the current approach is fast enough for most purposes.

If you want to change the chunk size, just use `hGetSome` instead of `fromHandle`, since `hGetSome` takes an additional argument specifying the desired chunk size.

On 6/13/14, 1:02 PM, Daniel Hlynskyi wrote:
Cool!

One more question. How can I reason about file reading in this situation? Is it read byte-by-byte and fed to parser byte-by-byte, or there are some chunks? If there are chunks, then what is the size and am I able to change it? If not, is it hard to make this code read-and-parse file chunks?

It is very cool to see the differences between approaches (lazy io VS pipes) in code. Such examples are really helpful.


2014-06-13 17:53 GMT+03:00 Gabriel Gonzalez <[email protected] <mailto:[email protected]>>:

    Here you go:

        import Control.Monad (void)
        import Data.Char (ord)
        import Data.Word (Word8)

        import Control.Applicative
        import qualified Data.ByteString as B
        import Data.Attoparsec.ByteString
        import Pipes.Attoparsec (parsed)
        import Pipes.ByteString (fromHandle)
        import Pipes
        import qualified Pipes.Prelude as Pipes
        import qualified System.IO as IO

        main = IO.withFile "euler42.txt" IO.ReadMode $ \handle -> do
            n <- Pipes.length $ void $  -- Ignore errors for simplicity
                parsed parser (fromHandle handle) >-> Pipes.filter
    isTriangle
            print n
          where
            isTriangle _ = True

        parser :: Parser Word8
        parser =
                fmap wordValue
            $   word8 quote
            *>  takeWhile1 (/= quote)
            <*  word8 quote
            <*  optional (word8 comma) )

          where
            wordValue = B.foldl' (+) 0 . B.map (subtract 64)
            quote = fromIntegral (ord '"')
            comma = fromIntegral (ord ',')

    The trick is to only use `attoparsec` to parse a single value, so
    that it never backtracks further than one word. The reason that
    `attoparsec` chokes on large files is that it has to backtrack.

    The way `pipes-attoparsec` is designed to work is that you use
    `attoparsec` to define a backtracking parser for a small unit of
    input (i.e. a line or word), but then after each successful parse
    you commit and don't look back.  The first argument to `parsed` is
    the parser for the backtrackable single element, and
    `pipes-attoparsec` makes sure to commit and not backtrack every
    time the parser succeeds.  That's why it runs in constant space
    compared to `attoparsec`.

    The rest is almost identical to what you proposed.  The only
    difference is that I moved the `map` directly into the `Parser`
    logic instead of making it a separate pipe.


    On 06/12/2014 09:13 PM, ?????? ????????? wrote:
    There is simple problem on Project Euler -
    http://projecteuler.net/problem=42 , which boils down to

    import Data.Char
    import Data.Word
    import Control.Applicative
    import qualified Data.ByteString as B
    import Data.Attoparsec.ByteString

    main = do
    result <- solution <$> parseFile <$> B.readFile "euler42.txt"
    print result
    solution :: [B.ByteString] -> Int
    solution =
    length . filter isTriangle . fmap wordValue
    where
        wordValue = B.foldl' (+) 0 . B.map (subtract 64)
        isTriangle = const True {- some secret function -}
    parseFile :: B.ByteString -> [B.ByteString]
    parseFile = either (const []) id .
    parseOnly (wordParser `sepBy1` word8 (char ','))
    where
    wordParser = word8 (char '"') *> takeWhile1 (/= (char '"')) <*
    word8 (char '"')
    char = fromIntegral . ord

    Now the problem is: given large input file (150 Mb) this solution
    leads to OutOfMemory.

    Seems like pipes are designed to solve this problem in same
    compositinal way, so the question is: what is a correct way to
    "pipify" this small program?

    As I understand, there must be:
    1) Producer of ByteString's, based on pipes-parse and
    pipes-attoparsec
    2) Some analogues of "map", "filter" and "length" on pipe streams
    3) extract result (unpipify)

    The 2) seems easy. We should get something like

    solution :: Producer B.ByteString IO () -> Producer Int IO ()
    solution =
    P.length <-< P.filter isTriangle <-< P.map wordValue

    but streaming from parsing and extracting is not that obvious.
    Could someone help me?
-- You received this message because you are subscribed to the
    Google Groups "Haskell Pipes" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to [email protected]
    <mailto:[email protected]>.
    To post to this group, send email to
    [email protected]
    <mailto:[email protected]>.



--
You received this message because you are subscribed to the Google Groups "Haskell 
Pipes" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].

Reply via email to