Re: [haskell-pipes] What is the ideomatic way to combine pipes-binary, pipes-bytestring, pipes-parse?

Torgeir Strand Henriksen Thu, 22 May 2014 11:04:35 -0700

Let me explain what I mean by the parser keeping on after the error:

parser :: Monad m => Parser ByteString m (String, Maybe Word8)
parser = do
    str <- zoom (PB.span (/= 0) . PT.utf8 . from PT.packChars) drawAll
    a <- PB.drawByte -- for simplicity, it would be a more complicated 
parser in actual code
    return (str, a)


test :: Monad m => [Word8] -> m ((String, Maybe Word8), Producer 
P.ByteString m ())
test = runStateT parser . yield . BS.pack

\> fst <$> test [65,66,67,0]
("ABC",Just 0)

\> fst <$> test [65,255,66,67,0] -- invalid utf8
("A",Just 255)

As you can see, the parser function keeps going with PB.drawByte after 
PT.utf8 fails. Unless I misunderstand, zoom even undraws the leftovers 
returned by PT.utf8, so I don't see a way to detect the error and report it 
to the user. Hopefully I'm missing something. :)

kl. 04:48:26 UTC+2 onsdag 21. mai 2014 skrev Gabriel Gonzalez følgende:
>
>  Returning the unused input on error is the idiomatic way for a lens to 
> handle errors.  The parser won't keep going on after the error because the 
> `Producer` containing any unused input is stashed inside the return value 
> of the outer `Producer`, so the unused input is totally inaccessible to the 
> `Parser`.  The `Parser` type enforces this behavior:
>
>     type Parser a m r = forall x . StateT (Producer a m x) m r
>
> The `forall x` enforces in the types that the `Parser` cannot use whatever 
> is stored in the `x` in any meaningful way.  Since the unused input is 
> stored in that `x`, the `Parser` can't access it.
>
> On 05/16/2014 02:31 AM, Torgeir Strand Henriksen wrote:
>  
> I can see that it would be more elegant to zoom rather than use StateT, 
> but what options are there for error handling inside an encode/decode lens? 
> Wrapping the Text and ByteString chunks in Either sounds like a mess, and 
> returning the unused bytes on error like decodeIso8859_1 means the zoom has 
> to be runStated in isolation to prevent the parser from keeping on after 
> the error. Throwing an exception is possible of course, but would be nice 
> to avoid.
>
> kl. 19:18:51 UTC+2 tirsdag 13. mai 2014 skrev Gabriel Gonzalez følgende: 
>>
>>  It is perfectly acceptable to poke around in the underlying `StateT`.  
>> Generally, it is more idiomatic to encode your error-handling logic into 
>> the lens itself, but manual state passing is definitely an approved thing 
>> to do if you are more comfortable with it.  It really comes down to 
>> whatever is more readable for you.
>>
>> One of the reasons that I chose `StateT` as the substrate for 
>> `pipes-parse` rather than an opaque `Parser` type is that I wanted people 
>> to reuse their existing knowledge for how `StateT` works so that they could 
>> do things like what you are doing.
>>
>> On 5/13/14, 10:02 AM, Torgeir Strand Henriksen wrote:
>>  
>> Great! I'm starting to get a firmer understandig of parsers. I ended up 
>> with this:
>>
>> decodeFilename = StateT $ \p -> do
>>     (fileName, p') <- runStateT drawAll . view (PB.span (/= 0) . to 
>> (PT.decodeAscii . (PB.map (`rotateR` 3) <-<)) . from PT.packChars) $ p
>>     Left p'' <-  next p'
>>     return (fileName, PB.drop 1 <-< join p'')
>>
>> entryParser tableStart = do
>>     fileName <- decodeFilename
>>     P.decodeGet $ (,,,) fileName <$> fmap (tableStart +) getInt32 <*> 
>> getInt32 <*> getInt32
>>
>> Using next instead of drain, decode errors can be handled (pattern match 
>> failure for now). Because of drawAll, p'' (result of span) is empty when 
>> decode succeeds, so it can simply be joined, and then the terminating 0 
>> dropped. Ignoring that the composition chains are a bit on the lengthy 
>> side, do you consider it "good style" to poke around in Parser's underlying 
>> StateT like that, or is it going against how the libraries are meant to be 
>> used?
>>
>> kl. 03:14:37 UTC+2 tirsdag 13. mai 2014 skrev Gabriel Gonzalez følgende: 
>>>
>>>  
>>> On 5/10/14, 7:59 AM, Torgeir Strand Henriksen wrote:
>>>  
>>> Thanks for the reply! The rotated lens is no problem (rotateR is from 
>>> Data.Bits), but i'm afraid the data won't decode as UTF-8. Just to make 
>>> sure I understand correctly: When you talk about re-encoding unused values, 
>>> do you mean the values that would be left if the parser zoomed into was a 
>>> different one than drawAll and didn't consume all the data provided by the 
>>> span lens? 
>>>
>>>
>>> Yes, that's correct.  If you write:
>>>
>>>     example = do
>>>         a <- zoom someLens parser1
>>>         parser2
>>>
>>> ... then `someLens` needs to know how to re-encode leftovers from 
>>> `parser1` in the format that `parser2` understands.
>>>
>>>  I understand why it would be a problem if those leftovers weren't 
>>> propagated back, but I'm not sure I understand why that decision can't be 
>>> made before the data is rotated and decoded as text. Does it have to do 
>>> with the data being bytestrings that get transformed in blocks rather than 
>>> per byte?
>>>  
>>>
>>> Remember that the parser is totally oblivious about where the `Text` 
>>> came from.  It doesn't know that the text originated from bytes or rotated 
>>> data.  All it understands is "I am undrawing some text" and if you want it 
>>> to undraw bytes then you need to translate the "undraw text" command to an 
>>> "undraw bytes" command.  That's what the lens is doing.
>>>
>>> Note that you can still get a lens if you specify a way to handle 
>>> errors.  Right now the `pipes-text` package provides a one-way decoding 
>>> function for latin1 of type:
>>>
>>>     decodeIso8859_1 :: Monad m => Producer ByteString m r -> Producer 
>>> Text m (Producer ByteString m r)
>>>
>>> If you supplement that with a reverse function of type:
>>>
>>>     encoder :: Monad m => Producer Text m (Producer ByteString m r) -> 
>>> Producer ByteString m r
>>>
>>> ... then you can create a latin1 lens that you can pass to `zoom`:
>>>
>>>     latin1 :: Monad m => Lens' (Producer ByteString m r) (Producer Text 
>>> m (Producer ByteString m r))
>>>     latin1 = iso decodeIso8859_1 encoder  -- I might have these 
>>> arguments backwards; I didn't type-check this
>>>
>>> The reason that `pipes-text` doesn't already do this for you is because 
>>> Latin1 does not specify how to encode multibyte characters.  In other 
>>> words, you need to figure out how to convert these exotic characters to 
>>> bytes, even if that means just discarding them (i.e. not undrawing the 
>>> character at all).
>>>
>>> So if you really want to use latin1 as a lens, you definitely can!  It 
>>> just requires that you decide you want to encode multibyte characters since 
>>> there's no obvious right way to do that.  If you don't expect your input to 
>>> have multibyte characters then you can just slightly modify 
>>> `encodeIso8859_1` to do what you want:
>>>
>>>     encoder pText = do
>>>         pBytes <- encodeIso8859_1 pText
>>>         runEffect (runEffect (pBytes >-> drain) >-> drain)
>>>
>>> That basically keeps decoding until it hits a character that 
>>> `encodeIso8859_1` does not know how to encode, then gives up and and drains 
>>> the rest of the stream.
>>>
>>>
>>>  
>>> Anyway I'll have to go with your second option. Instead of breaking the 
>>> parser into multiple code blocks (that have to be runStateTed individually) 
>>> in order to get at the bytestring producer, is it reasonable to use get and 
>>> put from Control.Monad.State? That way I can keep everything a single 
>>> Parser, view the bytestring producer from "get" through the PB.span lens 
>>> composed with the transformations, and "put" back the producer returned by 
>>> span.
>>>
>>> Bonus question: If the rotated lens was simply Bits a => Int -> Lens' a 
>>> a, could it be mapped/zoomed/something over a ByteString producer instead 
>>> of including PB.map in the lens? That way rotated would be more reusable.
>>>
>>> On Saturday, May 10, 2014 1:45:32 AM UTC+2, Gabriel Gonzalez wrote: 
>>>>
>>>>  This works much better if you can make two small changes.
>>>>
>>>> First, I'm guessing that your `rotateR` function has some sort of 
>>>> inverse named `rotateL`.  If it does, then you can make a rotation lens:
>>>>
>>>>     rotated :: Int -> Lens' (Producer ByteString m x) (Producer 
>>>> ByteString m x)
>>>>     rotated n = iso (PB.map (`rotateR` n)) (PB.map (`rotateL` n))
>>>>
>>>> Second, if you can use utf8 instead of latin1, then you can just write:
>>>>
>>>>     decodeFileName :: Parser ByteString String
>>>>     decodeFileName = zoom (PB.span (/= 0) . rotated 3 . PT.utf8 . from 
>>>> PT.packChars) PP.drawAll
>>>>
>>>> The reason this works is that `rotated` and `utf8` contain extra 
>>>> information for how to propagate unused bytes back to the original input 
>>>> source.  In the case of `rotated` it reverse the original rotation and in 
>>>> the case of `utf8` it re-encodes them.
>>>>
>>>> If you don't have information for how to re-encode unused values, then 
>>>> you must apply the rotation and encoding to the producer before feeding it 
>>>> to the parser:
>>>>
>>>>     yourProducer :: Producer ByteString IO ()
>>>>
>>>>     runStateT PP.drawAll (yourProducer ^. span (/= 0) ^. to (PB.map 
>>>> (`rotateR` n)) ^. PT.utf8 ^. fromPT.packChars)
>>>>         :: IO (String, Producer String IO (... {- more nested producers 
>>>> -}))
>>>>
>>>> `pipes-parse` doesn't let you merge logic into the parser unless you 
>>>> also include logic for how to propagate unused bytes to the input source.  
>>>> Without that guarantee you get bugs related to silently dropping input 
>>>> values.
>>>>
>>>> On 5/9/14, 11:06 AM, Torgeir Strand Henriksen wrote:
>>>>  
>>>> While working with a binary file format, I started out with this naive 
>>>> code:
>>>>
>>>> import qualified Pipes.Parse as P
>>>> import qualified Pipes.Binary as P
>>>> import qualified Pipes.ByteString as PB
>>>> import qualified Data.Text as T
>>>> import qualified Data.ByteString as BS
>>>>  
>>>>  entryParser tableStart = P.decodeGet $ (,,,) <$> decodeFilename <*> 
>>>> fmap (tableStart +) getWord32le <*> getWord32le <*> getWord32le
>>>>
>>>> decodeFilename = T.unpack . decodeLatin1 . BS.pack <$> go where
>>>>     go = do
>>>>         c <- (`rotateR` 3) <$> getWord8
>>>>         if c /= 0 then (c :) <$> go else pure [] -- terminate on (and 
>>>> consume the) 0
>>>>  
>>>> While it does work, I'm unhappy with decodeFilename as it basically 
>>>> implements a combination of map and span/fold with explicit recursion. But 
>>>> the underlying ByteString isn't available inside the Get monad without 
>>>> consuming it, so using e.g. BS.span seems out of the question. Let's see 
>>>> if 
>>>> lenses can come to the rescue:
>>>>
>>>> entryParser tableStart = do
>>>>     nameChunks <- zoom (PB.span (/= 0)) P.drawAll
>>>>     PB.drawByte -- draw the terminating 0
>>>>     let fileName = T.unpack . decodeLatin1 . BS.map (flip rotateR 3) . 
>>>> BS.concat $ nameChunks
>>>>     P.decodeGet $ (,,,) fileName <$> fmap (tableStart +) getWord32le 
>>>> <*> getWord32le <*> getWord32le
>>>>  
>>>> I like this better - map and span aren't implemented manually anymore - 
>>>> but at the same time I was hoping for more. It doesn't seem right to work 
>>>> directly on ByteStrings (i.e. BS.map instead of PB.map, and text instead 
>>>> of 
>>>> pipes-text), and the combination of drawAll and concat is a bit awkward, 
>>>> especially since drawAll is only for testing (even though all the 
>>>> tutorials 
>>>> use it :) ). The latter point might be addressed by giving 
>>>> pipes-bytestring 
>>>> a folding function similar to P.foldAll, but even so I wonder if there's a 
>>>> more ideomatic way to do this?
>>>>  -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "Haskell Pipes" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>>
>>>>
>>>>   -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "Haskell Pipes" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>>
>>>
>>>   -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Haskell Pipes" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> To post to this group, send email to [email protected].
>>
>>
>>   -- 
> You received this message because you are subscribed to the Google Groups 
> "Haskell Pipes" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected] <javascript:>.
> To post to this group, send email to [email protected]<javascript:>
> .
>
>
> 

-- 
You received this message because you are subscribed to the Google Groups 
"Haskell Pipes" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].

Re: [haskell-pipes] What is the ideomatic way to combine pipes-binary, pipes-bytestring, pipes-parse?

Reply via email to