It's in chunks of about 32 KB. Here's the relevant code from
`pipes-bytestring`:
-- | Convert a 'IO.Handle' into a byte stream using a default chunk
size
fromHandle :: MonadIO m => IO.Handle -> Producer' ByteString m ()
fromHandle = hGetSome defaultChunkSize
{-| Convert a handle into a byte stream using a maximum chunk size
'hGetSome' forwards input immediately as it becomes available,
splitting the
input into multiple chunks if it exceeds the maximum chunk size.
-}
hGetSome :: MonadIO m => Int -> IO.Handle -> Producer' ByteString m ()
hGetSome size h = go
where
go = do
bs <- liftIO (BS.hGetSome h size)
if (BS.null bs)
then return ()
else do
yield bs
go
... and `pipes-attoparsec` also feeds in things in chunks.
What will happen on your first parsed element is something like this:
* Your attoparsec `Parser` requests a chunk
* `pipes-attoparsec` feeds the parser the file's first 32 KB chunk
* The parser consumes a few bytes of the chunk, returning the parsed
result and the remainder of the chunk (still about ~ 32 KB because the
parser consumed so little)
* `pipes-attoparsec` stores the remainder and feeds it to the next
`attoparsec` parser
* After parsing several things eventually the chunk will get close to
complete. Let's say that we have 2 bytes of the chunk left.
* `pipes-attoparsec` feeds these 2 bytes to the `attoparsec` parser, but
let's assume that it needs 4 bytes.
* `pipes-attoparsec` will get another 32 KB chunk from the handle and
then feed that to the parser
* The parser will then return the remainder of the fresh 32KB chunk when
it's done
So generally everything will be very fast because:
* `pipes-bytestring` keeps the chunk size large (32 KB)
* `pipes-bytestring`, `attoparsec`, and `pipes-attoparsec` all avoid
allocating new bytestrings in this scenario.
At most they will slice up the 32 KB chunk, but this slicing takes O(1)
time and does not allocate any new memory because it reuses the original
chunk's buffer. This reuse is safe because of purity (there's no danger
that the underlying buffer will mutate).
This also means that the entire thing will run about double the chunk
size (64 KB), mainly because the handle also keeps a 32 KB buffer of its
own if I remember correctly. With some unsafe tricks you can even get
rid of that extra memory overhead and use one less copy, but I think the
current approach is fast enough for most purposes.
If you want to change the chunk size, just use `hGetSome` instead of
`fromHandle`, since `hGetSome` takes an additional argument specifying
the desired chunk size.
On 6/13/14, 1:02 PM, Daniel Hlynskyi wrote:
Cool!
One more question. How can I reason about file reading in this
situation? Is it read byte-by-byte and fed to parser byte-by-byte, or
there are some chunks? If there are chunks, then what is the size and
am I able to change it? If not, is it hard to make this code
read-and-parse file chunks?
It is very cool to see the differences between approaches (lazy io VS
pipes) in code. Such examples are really helpful.
2014-06-13 17:53 GMT+03:00 Gabriel Gonzalez <[email protected]
<mailto:[email protected]>>:
Here you go:
import Control.Monad (void)
import Data.Char (ord)
import Data.Word (Word8)
import Control.Applicative
import qualified Data.ByteString as B
import Data.Attoparsec.ByteString
import Pipes.Attoparsec (parsed)
import Pipes.ByteString (fromHandle)
import Pipes
import qualified Pipes.Prelude as Pipes
import qualified System.IO as IO
main = IO.withFile "euler42.txt" IO.ReadMode $ \handle -> do
n <- Pipes.length $ void $ -- Ignore errors for simplicity
parsed parser (fromHandle handle) >-> Pipes.filter
isTriangle
print n
where
isTriangle _ = True
parser :: Parser Word8
parser =
fmap wordValue
$ word8 quote
*> takeWhile1 (/= quote)
<* word8 quote
<* optional (word8 comma) )
where
wordValue = B.foldl' (+) 0 . B.map (subtract 64)
quote = fromIntegral (ord '"')
comma = fromIntegral (ord ',')
The trick is to only use `attoparsec` to parse a single value, so
that it never backtracks further than one word. The reason that
`attoparsec` chokes on large files is that it has to backtrack.
The way `pipes-attoparsec` is designed to work is that you use
`attoparsec` to define a backtracking parser for a small unit of
input (i.e. a line or word), but then after each successful parse
you commit and don't look back. The first argument to `parsed` is
the parser for the backtrackable single element, and
`pipes-attoparsec` makes sure to commit and not backtrack every
time the parser succeeds. That's why it runs in constant space
compared to `attoparsec`.
The rest is almost identical to what you proposed. The only
difference is that I moved the `map` directly into the `Parser`
logic instead of making it a separate pipe.
On 06/12/2014 09:13 PM, ?????? ????????? wrote:
There is simple problem on Project Euler -
http://projecteuler.net/problem=42 , which boils down to
import Data.Char
import Data.Word
import Control.Applicative
import qualified Data.ByteString as B
import Data.Attoparsec.ByteString
main = do
result <- solution <$> parseFile <$> B.readFile "euler42.txt"
print result
solution :: [B.ByteString] -> Int
solution =
length . filter isTriangle . fmap wordValue
where
wordValue = B.foldl' (+) 0 . B.map (subtract 64)
isTriangle = const True {- some secret function -}
parseFile :: B.ByteString -> [B.ByteString]
parseFile = either (const []) id .
parseOnly (wordParser `sepBy1` word8 (char ','))
where
wordParser = word8 (char '"') *> takeWhile1 (/= (char '"')) <*
word8 (char '"')
char = fromIntegral . ord
Now the problem is: given large input file (150 Mb) this solution
leads to OutOfMemory.
Seems like pipes are designed to solve this problem in same
compositinal way, so the question is: what is a correct way to
"pipify" this small program?
As I understand, there must be:
1) Producer of ByteString's, based on pipes-parse and
pipes-attoparsec
2) Some analogues of "map", "filter" and "length" on pipe streams
3) extract result (unpipify)
The 2) seems easy. We should get something like
solution :: Producer B.ByteString IO () -> Producer Int IO ()
solution =
P.length <-< P.filter isTriangle <-< P.map wordValue
but streaming from parsing and extracting is not that obvious.
Could someone help me?
--
You received this message because you are subscribed to the
Google Groups "Haskell Pipes" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to [email protected]
<mailto:[email protected]>.
To post to this group, send email to
[email protected]
<mailto:[email protected]>.
--
You received this message because you are subscribed to the Google Groups "Haskell
Pipes" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].