Re: [haskell-pipes] bytestream lines

Gabriel Gonzalez Sun, 16 Aug 2015 08:13:13 -0700

To make an analogy to `ByteString` operations, what you did wasessentially equivalent to:


    ByteString.concat . ByteString.lines


In other words, your code just deleted all the newlines.

The intuition going into the `pipes-bytestring`/`pipes-text` librariesis that each element of the stream is one chunk of unspecified size thatdoesn't necessarily align to line boundaries.`pipes-bytestring`/`pipes-text` will sometimes slice these chunks intofiner chunks but it will (almost) never combine them into larger chunks,in order to guarantee that all operations use a bounded amount ofmemory. The exception to this rule is the `chunksOf'` function.

To illustrate what happened in your code, I will use a "list of lists"notation, where the outer list is the `FreeT` and the inner list is each`Producer` group within that `FreeT` and each element is one valueemitted by a `Producer`.


So let's imagine that you had a text file that looked like this:

```
ABCDEF
GHIJKLMNO
PQR
```

However, when you read it in as a `Producer` of unaligned chunks yourstream of chunks you might get something like this:


    [ "ABC", "DEF\nGHI", "JKL", "MNO\nPQR\n"]

Note that the chunk boundaries don't necessarily correspond to newlineboundaries.


Now, when you use `view Pipes.ByteString.lines`, you transform it into this:

    [["ABC", "DEF"], ["GHI", "JKL", "MNO"], ["PQR"]]

Each inner list corresponds to one line (represented as a `Producer`),possibly emitting a stream of multiple chunks. The outer list is the`FreeT`.

When you follow up with `Pipes.Group.concats` you just concatenate theinner lists together again:


    ["ABC", "DEF", "GHI", "JKL", "MNO", "PQR"]

This is probably not what you wanted. The elements of the concatenatedstream don't represent lines. They just represent chunks from theoriginal stream with all newlines deleted and turned into chunk boundaries.

This is why the number of elements you got was greater than the numberof lines when you counted how many elements you got. You introduced onenew chunk boundary per newline plus whatever natural chunk boundariesexisted beforehand.

So if you actually want each element of the `Producer` to be one linelong, the trick to do this is:


    import Control.Foldl(purely, mconcat)
    import Pipes.Group (folds)

    purely folds mconcat . view Pipes.ByteString.lines
:: Monad m => Producer ByteString m r -> Producer ByteString m r

I make this deliberately hard to discover in order to encourage peopleto do things in a proper streaming fashion. The reason why is thatthere is no upper bound on how long a line may be, so if you do thisthen you risk unbounded space usage.

The more idiomatic approach is to preserve streaming by using`pipes-group` idioms. For example, if you wanted to map a function overeach line, you would write something like this:


    -- Append an exclamation mark to the end of each line
    over (Pipes.ByteString.lines . Pipes.Group.individually) (<* yield "!")

That would run in constant space no matter how large each line is. Thegeneral type (which I've simplified a bit), would be:


    over (Pipes.ByteString.lines . Pipes.Group.individually)
        :: Monad m
        -> (forall x . Producer ByteString mx -> Producer ByteString m x)
        -- ^ Function to process each line
        -> Producer ByteString m r -> Producer ByteString m r

If you provide more specific details about what you want to do with eachline I can direct you to the appropriate `pipes-group` or`pipes-bytestring` utility that preserves streaming.


On 8/12/2015 12:13 AM, Tran Ma wrote:

Hi all,

I'm delimiting a bytestream like this:

|
mkLines :: Monad m => Producer ByteString m () -> Producer ByteString m ()
mkLines = PipesGroup.concats . view PipesByteString.lines
|
I thought this should behave like a regular `lines` function, e.g.delimiting "foo,bar\nfizz,buzz\n" into [ "foo,bar", "fizz,buzz" ],regardless of how `hGetSome` chunks the original stream, but thisisn't the case. For example, running `fmap length . PP.toListM .mkLines . PB.fromHandle` on a 25000-lines file gives 25569.
Should I be peeking at each ByteString to break on a "\n" charactermyself? That is already what `lines` in pipes-bytestring is doing though.
Cheers,
--
You received this message because you are subscribed to the GoogleGroups "Haskell Pipes" group.To unsubscribe from this group and stop receiving emails from it, sendan email to haskell-pipes+unsubscr...@googlegroups.com<mailto:haskell-pipes+unsubscr...@googlegroups.com>.To post to this group, send email to haskell-pipes@googlegroups.com<mailto:haskell-pipes@googlegroups.com>.


--
You received this message because you are subscribed to the Google Groups "Haskell 
Pipes" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to haskell-pipes+unsubscr...@googlegroups.com.
To post to this group, send email to haskell-pipes@googlegroups.com.

Re: [haskell-pipes] bytestream lines

Reply via email to