Re: [haskell-pipes] What is the ideomatic way to combine pipes-binary, pipes-bytestring, pipes-parse?

Torgeir Strand Henriksen Sun, 25 May 2014 06:55:39 -0700

zoom undraws the remaining data from both the the failing utf8 lens and the 
span, so isEndOfBytes returns False for both valid and invalid UTF-8. I 
guess the transparency of zoom makes it difficult to detect errors that 
way. :) I'll stick to viewing the StateT's Producer for now.


On Saturday, May 24, 2014 1:44:23 AM UTC+2, Gabriel Gonzalez wrote:
>
>  The simplest solution is to use `Pipes.ByteString.isEndOfBytes` after the 
> `zoom` to check if it failed or not.  If there are residual bytes then the 
> parse failed.
>
> Another solution is to apply the lens on the `Producer` end, using 
> `view`.  This ensures that no information is lost.
>
> On 5/22/14, 11:03 AM, Torgeir Strand Henriksen wrote:
>  
> Let me explain what I mean by the parser keeping on after the error:
>
> parser :: Monad m => Parser ByteString m (String, Maybe Word8)
> parser = do
>     str <- zoom (PB.span (/= 0) . PT.utf8 . from PT.packChars) drawAll
>     a <- PB.drawByte -- for simplicity, it would be a more complicated 
> parser in actual code
>     return (str, a)
>
> test :: Monad m => [Word8] -> m ((String, Maybe Word8), Producer 
> P.ByteString m ())
> test = runStateT parser . yield . BS.pack
>
> \> fst <$> test [65,66,67,0]
> ("ABC",Just 0)
>
> \> fst <$> test [65,255,66,67,0] -- invalid utf8
> ("A",Just 255)
>
> As you can see, the parser function keeps going with PB.drawByte after 
> PT.utf8 fails. Unless I misunderstand, zoom even undraws the leftovers 
> returned by PT.utf8, so I don't see a way to detect the error and report it 
> to the user. Hopefully I'm missing something. :)
>
> kl. 04:48:26 UTC+2 onsdag 21. mai 2014 skrev Gabriel Gonzalez følgende: 
>>
>>  Returning the unused input on error is the idiomatic way for a lens to 
>> handle errors.  The parser won't keep going on after the error because the 
>> `Producer` containing any unused input is stashed inside the return value 
>> of the outer `Producer`, so the unused input is totally inaccessible to the 
>> `Parser`.  The `Parser` type enforces this behavior:
>>
>>     type Parser a m r = forall x . StateT (Producer a m x) m r
>>
>> The `forall x` enforces in the types that the `Parser` cannot use 
>> whatever is stored in the `x` in any meaningful way.  Since the unused 
>> input is stored in that `x`, the `Parser` can't access it.
>>
>> On 05/16/2014 02:31 AM, Torgeir Strand Henriksen wrote:
>>  
>> I can see that it would be more elegant to zoom rather than use StateT, 
>> but what options are there for error handling inside an encode/decode lens? 
>> Wrapping the Text and ByteString chunks in Either sounds like a mess, and 
>> returning the unused bytes on error like decodeIso8859_1 means the zoom has 
>> to be runStated in isolation to prevent the parser from keeping on after 
>> the error. Throwing an exception is possible of course, but would be nice 
>> to avoid.
>>
>> kl. 19:18:51 UTC+2 tirsdag 13. mai 2014 skrev Gabriel Gonzalez følgende: 
>>>
>>>  It is perfectly acceptable to poke around in the underlying `StateT`.  
>>> Generally, it is more idiomatic to encode your error-handling logic into 
>>> the lens itself, but manual state passing is definitely an approved thing 
>>> to do if you are more comfortable with it.  It really comes down to 
>>> whatever is more readable for you.
>>>
>>> One of the reasons that I chose `StateT` as the substrate for 
>>> `pipes-parse` rather than an opaque `Parser` type is that I wanted people 
>>> to reuse their existing knowledge for how `StateT` works so that they could 
>>> do things like what you are doing.
>>>
>>> On 5/13/14, 10:02 AM, Torgeir Strand Henriksen wrote:
>>>  
>>> Great! I'm starting to get a firmer understandig of parsers. I ended up 
>>> with this:
>>>
>>> decodeFilename = StateT $ \p -> do
>>>     (fileName, p') <- runStateT drawAll . view (PB.span (/= 0) . to 
>>> (PT.decodeAscii . (PB.map (`rotateR` 3) <-<)) . from PT.packChars) $ p
>>>     Left p'' <-  next p'
>>>     return (fileName, PB.drop 1 <-< join p'')
>>>
>>> entryParser tableStart = do
>>>     fileName <- decodeFilename
>>>     P.decodeGet $ (,,,) fileName <$> fmap (tableStart +) getInt32 <*> 
>>> getInt32 <*> getInt32
>>>
>>> Using next instead of drain, decode errors can be handled (pattern match 
>>> failure for now). Because of drawAll, p'' (result of span) is empty when 
>>> decode succeeds, so it can simply be joined, and then the terminating 0 
>>> dropped. Ignoring that the composition chains are a bit on the lengthy 
>>> side, do you consider it "good style" to poke around in Parser's underlying 
>>> StateT like that, or is it going against how the libraries are meant to be 
>>> used?
>>>
>>> kl. 03:14:37 UTC+2 tirsdag 13. mai 2014 skrev Gabriel Gonzalez følgende: 
>>>>
>>>>  
>>>> On 5/10/14, 7:59 AM, Torgeir Strand Henriksen wrote:
>>>>  
>>>> Thanks for the reply! The rotated lens is no problem (rotateR is from 
>>>> Data.Bits), but i'm afraid the data won't decode as UTF-8. Just to make 
>>>> sure I understand correctly: When you talk about re-encoding unused 
>>>> values, 
>>>> do you mean the values that would be left if the parser zoomed into was a 
>>>> different one than drawAll and didn't consume all the data provided by the 
>>>> span lens? 
>>>>
>>>>
>>>> Yes, that's correct.  If you write:
>>>>
>>>>     example = do
>>>>         a <- zoom someLens parser1
>>>>         parser2
>>>>
>>>> ... then `someLens` needs to know how to re-encode leftovers from 
>>>> `parser1` in the format that `parser2` understands.
>>>>
>>>>  I understand why it would be a problem if those leftovers weren't 
>>>> propagated back, but I'm not sure I understand why that decision can't be 
>>>> made before the data is rotated and decoded as text. Does it have to do 
>>>> with the data being bytestrings that get transformed in blocks rather than 
>>>> per byte?
>>>>  
>>>>
>>>> Remember that the parser is totally oblivious about where the `Text` 
>>>> came from.  It doesn't know that the text originated from bytes or rotated 
>>>> data.  All it understands is "I am undrawing some text" and if you want it 
>>>> to undraw bytes then you need to translate the "undraw text" command to an 
>>>> "undraw bytes" command.  That's what the lens is doing.
>>>>
>>>> Note that you can still get a lens if you specify a way to handle 
>>>> errors.  Right now the `pipes-text` package provides a one-way decoding 
>>>> function for latin1 of type:
>>>>
>>>>     decodeIso8859_1 :: Monad m => Producer ByteString m r -> Producer 
>>>> Text m (Producer ByteString m r)
>>>>
>>>> If you supplement that with a reverse function of type:
>>>>
>>>>     encoder :: Monad m => Producer Text m (Producer ByteString m r) -> 
>>>> Producer ByteString m r
>>>>
>>>> ... then you can create a latin1 lens that you can pass to `zoom`:
>>>>
>>>>     latin1 :: Monad m => Lens' (Producer ByteString m r) (Producer Text 
>>>> m (Producer ByteString m r))
>>>>     latin1 = iso decodeIso8859_1 encoder  -- I might have these 
>>>> arguments backwards; I didn't type-check this
>>>>
>>>> The reason that `pipes-text` doesn't already do this for you is because 
>>>> Latin1 does not specify how to encode multibyte characters.  In other 
>>>> words, you need to figure out how to convert these exotic characters to 
>>>> bytes, even if that means just discarding them (i.e. not undrawing the 
>>>> character at all).
>>>>
>>>> So if you really want to use latin1 as a lens, you definitely can!  It 
>>>> just requires that you decide you want to encode multibyte characters 
>>>> since 
>>>> there's no obvious right way to do that.  If you don't expect your input 
>>>> to 
>>>> have multibyte characters then you can just slightly modify 
>>>> `encodeIso8859_1` to do what you want:
>>>>
>>>>     encoder pText = do
>>>>         pBytes <- encodeIso8859_1 pText
>>>>         runEffect (runEffect (pBytes >-> drain) >-> drain)
>>>>
>>>> That basically keeps decoding until it hits a character that 
>>>> `encodeIso8859_1` does not know how to encode, then gives up and and 
>>>> drains 
>>>> the rest of the stream.
>>>>
>>>>
>>>>  
>>>> Anyway I'll have to go with your second option. Instead of breaking the 
>>>> parser into multiple code blocks (that have to be runStateTed 
>>>> individually) 
>>>> in order to get at the bytestring producer, is it reasonable to use get 
>>>> and 
>>>> put from Control.Monad.State? That way I can keep everything a single 
>>>> Parser, view the bytestring producer from "get" through the PB.span lens 
>>>> composed with the transformations, and "put" back the producer returned by 
>>>> span.
>>>>
>>>> Bonus question: If the rotated lens was simply Bits a => Int -> Lens' a 
>>>> a, could it be mapped/zoomed/something over a ByteString producer instead 
>>>> of including PB.map in the lens? That way rotated would be more reusable.
>>>>
>>>> On Saturday, May 10, 2014 1:45:32 AM UTC+2, Gabriel Gonzalez wrote: 
>>>>>
>>>>>  This works much better if you can make two small changes.
>>>>>
>>>>> First, I'm guessing that your `rotateR` function has some sort of 
>>>>> inverse named `rotateL`.  If it does, then you can make a rotation lens:
>>>>>
>>>>>     rotated :: Int -> Lens' (Producer ByteString m x) (Producer 
>>>>> ByteString m x)
>>>>>     rotated n = iso (PB.map (`rotateR` n)) (PB.map (`rotateL` n))
>>>>>
>>>>> Second, if you can use utf8 instead of latin1, then you can just write:
>>>>>
>>>>>     decodeFileName :: Parser ByteString String
>>>>>     decodeFileName = zoom (PB.span (/= 0) . rotated 3 . PT.utf8 . from 
>>>>> PT.packChars) PP.drawAll
>>>>>
>>>>> The reason this works is that `rotated` and `utf8` contain extra 
>>>>> information for how to propagate unused bytes back to the original input 
>>>>> source.  In the case of `rotated` it reverse the original rotation and in 
>>>>> the case of `utf8` it re-encodes them.
>>>>>
>>>>> If you don't have information for how to re-encode unused values, then 
>>>>> you must apply the rotation and encoding to the producer before feeding 
>>>>> it 
>>>>> to the parser:
>>>>>
>>>>>     yourProducer :: Producer ByteString IO ()
>>>>>
>>>>>     runStateT PP.drawAll (yourProducer ^. span (/= 0) ^. to (PB.map 
>>>>> (`rotateR` n)) ^. PT.utf8 ^. fromPT.packChars)
>>>>>         :: IO (String, Producer String IO (... {- more nested 
>>>>> producers -}))
>>>>>
>>>>> `pipes-parse` doesn't let you merge logic into the parser unless you 
>>>>> also include logic for how to propagate unused bytes to the input source. 
>>>>>  
>>>>> Without that guarantee you get bugs related to silently dropping input 
>>>>> values.
>>>>>
>>>>> On 5/9/14, 11:06 AM, Torgeir Strand Henriksen wrote:
>>>>>  
>>>>> While working with a binary file format, I started out with this naive 
>>>>> code:
>>>>>
>>>>> import qualified Pipes.Parse as P
>>>>> import qualified Pipes.Binary as P
>>>>> import qualified Pipes.ByteString as PB
>>>>> import qualified Data.Text as T
>>>>> import qualified Data.ByteString as BS
>>>>>  
>>>>>  entryParser tableStart = P.decodeGet $ (,,,) <$> decodeFilename <*> 
>>>>> fmap (tableStart +) getWord32le <*> getWord32le <*> getWord32le
>>>>>
>>>>> decodeFilename = T.unpack . decodeLatin1 . BS.pack <$> go where
>>>>>     go = do
>>>>>         c <- (`rotateR` 3) <$> getWord8
>>>>>         if c /= 0 then (c :) <$> go else pure [] -- terminate on (and 
>>>>> consume the) 0
>>>>>  
>>>>> While it does work, I'm unhappy with decodeFilename as it basically 
>>>>> implements a combination of map and span/fold with explicit recursion. 
>>>>> But 
>>>>> the underlying ByteString isn't available inside the Get monad without 
>>>>> consuming it, so using e.g. BS.span seems out of the question. Let's see 
>>>>> if 
>>>>> lenses can come to the rescue:
>>>>>
>>>>> entryParser tableStart = do
>>>>>     nameChunks <- zoom (PB.span (/= 0)) P.drawAll
>>>>>     PB.drawByte -- draw the terminating 0
>>>>>     let fileName = T.unpack . decodeLatin1 . BS.map (flip rotateR 3) . 
>>>>> BS.concat $ nameChunks
>>>>>     P.decodeGet $ (,,,) fileName <$> fmap (tableStart +) getWord32le 
>>>>> <*> getWord32le <*> getWord32le
>>>>>  
>>>>> I like this better - map and span aren't implemented manually anymore 
>>>>> - but at the same time I was hoping for more. It doesn't seem right to 
>>>>> work 
>>>>> directly on ByteStrings (i.e. BS.map instead of PB.map, and text instead 
>>>>> of 
>>>>> pipes-text), and the combination of drawAll and concat is a bit awkward, 
>>>>> especially since drawAll is only for testing (even though all the 
>>>>> tutorials 
>>>>> use it :) ). The latter point might be addressed by giving 
>>>>> pipes-bytestring 
>>>>> a folding function similar to P.foldAll, but even so I wonder if there's 
>>>>> a 
>>>>> more ideomatic way to do this?
>>>>>  -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "Haskell Pipes" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>>
>>>>>
>>>>>   -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "Haskell Pipes" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>>
>>>>
>>>>   -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "Haskell Pipes" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>>
>>>
>>>   -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Haskell Pipes" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> To post to this group, send email to [email protected].
>>
>>
>>   -- 
> You received this message because you are subscribed to the Google Groups 
> "Haskell Pipes" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected] <javascript:>.
> To post to this group, send email to [email protected]<javascript:>
> .
>
>
> 

-- 
You received this message because you are subscribed to the Google Groups 
"Haskell Pipes" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].

Re: [haskell-pipes] What is the ideomatic way to combine pipes-binary, pipes-bytestring, pipes-parse?

Reply via email to