To make an analogy to `ByteString` operations, what you did was
essentially equivalent to:
ByteString.concat . ByteString.lines
In other words, your code just deleted all the newlines.
The intuition going into the `pipes-bytestring`/`pipes-text` libraries
is that each element of the stream is one chunk of unspecified size that
doesn't necessarily align to line boundaries.
`pipes-bytestring`/`pipes-text` will sometimes slice these chunks into
finer chunks but it will (almost) never combine them into larger chunks,
in order to guarantee that all operations use a bounded amount of
memory. The exception to this rule is the `chunksOf'` function.
To illustrate what happened in your code, I will use a "list of lists"
notation, where the outer list is the `FreeT` and the inner list is each
`Producer` group within that `FreeT` and each element is one value
emitted by a `Producer`.
So let's imagine that you had a text file that looked like this:
```
ABCDEF
GHIJKLMNO
PQR
```
However, when you read it in as a `Producer` of unaligned chunks your
stream of chunks you might get something like this:
[ "ABC", "DEF\nGHI", "JKL", "MNO\nPQR\n"]
Note that the chunk boundaries don't necessarily correspond to newline
boundaries.
Now, when you use `view Pipes.ByteString.lines`, you transform it into this:
[["ABC", "DEF"], ["GHI", "JKL", "MNO"], ["PQR"]]
Each inner list corresponds to one line (represented as a `Producer`),
possibly emitting a stream of multiple chunks. The outer list is the
`FreeT`.
When you follow up with `Pipes.Group.concats` you just concatenate the
inner lists together again:
["ABC", "DEF", "GHI", "JKL", "MNO", "PQR"]
This is probably not what you wanted. The elements of the concatenated
stream don't represent lines. They just represent chunks from the
original stream with all newlines deleted and turned into chunk boundaries.
This is why the number of elements you got was greater than the number
of lines when you counted how many elements you got. You introduced one
new chunk boundary per newline plus whatever natural chunk boundaries
existed beforehand.
So if you actually want each element of the `Producer` to be one line
long, the trick to do this is:
import Control.Foldl(purely, mconcat)
import Pipes.Group (folds)
purely folds mconcat . view Pipes.ByteString.lines
:: Monad m => Producer ByteString m r -> Producer ByteString m r
I make this deliberately hard to discover in order to encourage people
to do things in a proper streaming fashion. The reason why is that
there is no upper bound on how long a line may be, so if you do this
then you risk unbounded space usage.
The more idiomatic approach is to preserve streaming by using
`pipes-group` idioms. For example, if you wanted to map a function over
each line, you would write something like this:
-- Append an exclamation mark to the end of each line
over (Pipes.ByteString.lines . Pipes.Group.individually) (<* yield "!")
That would run in constant space no matter how large each line is. The
general type (which I've simplified a bit), would be:
over (Pipes.ByteString.lines . Pipes.Group.individually)
:: Monad m
-> (forall x . Producer ByteString mx -> Producer ByteString m x)
-- ^ Function to process each line
-> Producer ByteString m r -> Producer ByteString m r
If you provide more specific details about what you want to do with each
line I can direct you to the appropriate `pipes-group` or
`pipes-bytestring` utility that preserves streaming.
On 8/12/2015 12:13 AM, Tran Ma wrote:
Hi all,
I'm delimiting a bytestream like this:
|
mkLines :: Monad m => Producer ByteString m () -> Producer ByteString m ()
mkLines = PipesGroup.concats . view PipesByteString.lines
|
I thought this should behave like a regular `lines` function, e.g.
delimiting "foo,bar\nfizz,buzz\n" into [ "foo,bar", "fizz,buzz" ],
regardless of how `hGetSome` chunks the original stream, but this
isn't the case. For example, running `fmap length . PP.toListM .
mkLines . PB.fromHandle` on a 25000-lines file gives 25569.
Should I be peeking at each ByteString to break on a "\n" character
myself? That is already what `lines` in pipes-bytestring is doing though.
Cheers,
--
You received this message because you are subscribed to the Google
Groups "Haskell Pipes" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to haskell-pipes+unsubscr...@googlegroups.com
<mailto:haskell-pipes+unsubscr...@googlegroups.com>.
To post to this group, send email to haskell-pipes@googlegroups.com
<mailto:haskell-pipes@googlegroups.com>.
--
You received this message because you are subscribed to the Google Groups "Haskell
Pipes" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to haskell-pipes+unsubscr...@googlegroups.com.
To post to this group, send email to haskell-pipes@googlegroups.com.