Re: [haskell-pipes] `zoom utf8` / pipes-parse trouble

Carter Schonwald Thu, 20 Feb 2014 22:02:26 -0800

ok,got it, i'll stare at this when i'm more awake. :)


On Fri, Feb 21, 2014 at 12:40 AM, Gabriel Gonzalez <[email protected]>wrote:

>  This is not related to the quadratic issue from the naive free monad
> bind.  Switching to the codensity transformation would not fix the problem
> at all.
>
> The tl;dr way to convince yourself of this is to just compare the
> performance of the following two parsers:
>
>     example1 = forever $ zoom id draw
>
>     example2 = forever draw
>
> The former will trigger the quadratic blowup whereas the second one will
> not.  The issue is `zoom`, not `forever`.
>
> In more detail, the issue is that `zoom` currently works like this:
>
> * Decode all input elements (lazily)
> * Run the parser on that
> * Encode all input elements (lazily)
>
> So let's say that your input has 1000 elements and your parser only
> consumes 2 elements.  After it's done parsing those two elements, the
> remaining 998 elements will all still be decoded and reencoded.  This
> encoding/reencoding is lazy, but it's not free and it triggers when you
> demand each element, adding an O(1) overhead per element.
>
> Now, imagine that you call `zoom` N times.  Now every subsequent element
> will be encoded and reencoded N times.  This means that if you call `zoom`
> N times the total running time is O(N^2) (i.e. N * (N + 1) / 2).
>
>
> On 02/21/2014 12:31 PM, Carter Schonwald wrote:
>
> Gabriel, isnt edward k's Machines.Process an example of another strategy
> that evades that quadratic blowup?
>
> http://hackage.haskell.org/package/machines-0.2.3.1/docs/Data-Machine-Process.html
>  (or is this a totally unrelated problem?)
>
>
> On Thu, Feb 20, 2014 at 11:59 PM, Gabriel Gonzalez 
> <[email protected]>wrote:
>
>> So the one part of Michael's post that was correct was that each use of
>> `zoom` adds a constant O(1) overhead to every subsequent command.  The
>> worst case scenario is a quadratic blowup for code like this:
>>
>>     forever $ do
>>         a <- zoom lens parser
>>         doSomethingWith a
>>
>> Every iteration of `forever` will accumulate an additional O(1) overhead,
>> so by the Nth iteration, you get O(N) overhead per iteration, and the total
>> cost of N iterations is O(N^2).  This is what was happening in your code
>> since your code is basically of the form:
>>
>>     replicateM_ n $ do
>>         a <- zoom lens parser
>>         doSomethingWith a
>>
>> ... so that triggers O(N^2) space and time complexity, where `n` in your
>> example was `10000`.
>>
>> I'm working on a solution to fix this and I will summarize it in broad
>> strokes here.  This solution is backwards-compatible with the current API
>> and in most use cases will compile to code that is more efficient (both in
>> time complexity and constant factors).  This would solve the problem that
>> Michael described
>>
>> There would be a new `Pipes.Parse.Fast` internal module (for implementers
>> only) which will have two new types.  One of them will be a more efficient
>> analog of `Lens'`.  Let's call it `LensFast'` for lack of a better name.
>>
>> I will omit the implementation of `LensFast'` since I'm still playing
>> with different ideas, but I can describe its interface. First, it would
>> form a category, like `Lens'`es:
>>
>>     (.) :: LensFast' a b -> LensFast' b c -> LensFast' a c
>>
>>     id :: LensFast' a a
>>
>> ... and there would be a functor from `LensFast'` to `Lens'`:
>>
>>     optimal :: LensFast' a b -> Lens' a b
>>
>> ... that obeys these functor laws:
>>
>>     optimal (f . g) = optimal f . optimal g
>>
>>     optimal id = id
>>
>> There would also be a more efficient parser type, which I will
>> temporarily call `ParserFast`.  This representation I've already nailed
>> down and it would be equivalent to a free monad transformer over the
>> `draw/unDraw` primitive operations, with an optional bind in the base monad
>> for efficiency:
>>
>>     data ParserFast a m r
>>         = M (m (ParserFast a m r))
>>         | Draw (Maybe a -> ParserFast m a)
>>         | UnDraw a (ParserFast m a)
>>
>>     instance Monad m => Monad (ParserFast a m) where ...
>>
>>     instance MonadTrans (ParserFast a) where ...
>>
>> This would come with an `interpret` function that would be a monad
>> morphism from `ParserFast` to `Parser`:
>>
>>     interpret :: Monad m => ParserFast a m r -> Parser a m r
>>
>> This would obey the following monad morphism laws:
>>
>>     interpret$ do x <- m  =  do x <- interpret m
>>                    f x           interpret (f x)
>>
>>
>>     interpret (return x) = return x
>>
>> ...  and monad morphism laws are also functor laws if you view them from
>> this perspective:
>>
>>     interpret . (f >=> g) = interpret . f >=> interpret . g
>>
>>     interpret . return = return
>>
>> The last piece of the puzzle is a `zoom_` function designed to work with
>> optimized `LensFast'`s and `ParserFast`s.  It would have this type:
>>
>>     zoom_ :: Monad m => LensFast a b -> ParserFast b m r -> ParserFast a
>> m r
>>
>> ... and it would obey this equation:
>>
>>     zoom (optimal lensfast) (interpret parserfast) = interpret (zoom_
>> lensfast parserfast)
>>
>> That equation would be a rewrite rule that turns the inefficient `zoom`
>> into the much more efficient `zoom_` that doesn't do unnecessary
>> round-trips or trigger quadratic time complexity.
>>
>> The way you put it all together is that the default user-facing API would
>> define all lenses and parsers in terms of `optimal` and `interpret`.  i.e.:
>>
>>     exampleParser = interpret internalParserFast
>>
>>     exampleLens = optimal internalLensFast
>>
>> Then if the user writes something like:
>>
>>     zoom exampleParser exampleLens
>>
>> ... then `exampleParser` and `exampleLens` would be written to force
>> inlining early, exposing the `interpret` and `optimal` functions:
>>
>>     zoom (optimal internalLensFast) (interpret internalParserFast)
>>
>> ... then that would trigger the rewrite rule:
>>
>>     interpret (zoom_ internalLensFast internalParserFast)
>>
>> ... and now the code would be efficient and avoid the problems Michael
>> described.
>>
>> You can also optimize some other cases automatically, too.  For example,
>> if the user connects two lenses, which are each internally optimized:
>>
>>     lens1 = optimal internalLens1
>>     lens2 = optimal internalLens2
>>
>>     parser = interpret internalParser
>>
>>     example = zoom (lens1 . lens2) parser
>>
>> ... and that is inlined:
>>
>>     example = zoom (optimal internalLens1 . optimal internalLens2)
>> (interpret internalParser)
>>
>> Then you can provide a rewrite rule using the first functor law for
>> `optimal` to rewrite the two lenses to a single optimized lens:
>>
>>     example = zoom (optimal (internalLens1 . internalLens2)) (interpret
>> internalParser)
>>
>> Then that is now in the right form for the `zoom_` rewrite rule and
>> everything just works!
>>
>>     example = interpret (zoom_ (internalLens1 . internalLens21)
>> internalParser)
>>
>> So far so good!  We can expect those rewrite rules to fire reliably and
>> generate efficient code.
>>
>> The problematic part is where you `zoom` over multiple parsers like this:
>>
>>     lens = optimal internalLens
>>     parser1 = interpret internalParser1
>>     parser2 = interpret internalParser2
>>
>>     example = zoom lens $ do
>>         x <- parser1
>>         y <- parser2
>>         return (x, y)
>>
>> If you inline those you get:
>>
>>     example = zoom (optimal internalLens) $ do
>>         x <- interpret internalParser1
>>         y <- interpret internalParser2
>>         return (x, y)
>>
>> In theory you could use the monad morphism laws for `interpret` to factor
>> out the `interpret`:
>>
>>     example = zoom (optimal internalLens $ interpret $ do
>>         x <- internalParser1
>>         y <- internalParser2
>>         return (x, y)
>>
>> Now that's in a form that can be optimized by the rewrite rule for
>> `zoom_`:
>>
>>     example = interpret $ zoom_ internalLens $ do
>>         x <- internalParser1
>>         y <- internalParser2
>>         return (x, y)
>>
>> **HOWEVER** in practice I don't yet know a way to get the monad morphism
>> rewrite rules to factor out the `interpret` in all cases, which is
>> necessary to trigger the `zoom_` rewrite rule.  This is where Haskell's
>> rewrite rule system falls flat on its face.  If anybody wants a longer
>> explanation of why, I can elaborate more on this.
>>
>> In practice, I'm willing to put up with this inefficiency and just
>> explain to users how to work around this latter issue by either:
>>
>> a) distributing the `zoom` over the parsers, like this:
>>
>>     example = do
>>         x <- zoom lens parser1
>>         y <- zoom lens parser2
>>         return (x, y)
>>
>> Note that this distribution only works if the lens does not violate the
>> monad morphism laws for `zoom`.  Fortunately, the lenses that trigger this
>> performance issue are precisely the same lenses that satisfy the monad
>> morphism laws for `zoom` (example: the `decodeUtf8` lens).  Vice versa, the
>> lenses that violate the `zoom` monad morphism laws are the ones that give
>> excellent performance in conjunction with `zoom` (i.e. delimiting lenses
>> like `splitAt`).
>>
>> b) Expose the internal lenses and parsers so that users can use them
>> directly rather than relying on rewrite rules.
>>
>> I prefer approach (a) because then we can keep the internals hidden.
>>
>> Note that this solution requires that `pipes-parse` provides its own
>> `zoom` function (which is really easy), or alternatively defines the
>> rewrite rule in terms of the `zoom` function from the `lens-family`
>> package.  A rewrite rule in terms of the `zoom` function from `lens` is out
>> of the question since I don't want a `lens` dependency.
>>
>> Does that answer your question?
>>
>>
>>
>> On 02/21/2014 03:37 AM, Michael Thompson wrote:
>>
>>> I was looking at Michael Snoyman's post on pipes parsers,
>>>
>>> http://www.yesodweb.com/blog/2014/02/ideas-pipes-parse
>>>
>>> and working through some of the reddit controversy
>>>
>>>
>>> http://www.reddit.com/r/haskell/comments/1xmmtn/some_ideas_for_pipesparse/
>>>
>>> trying to think if it might matter to `pipes-text`
>>>
>>> I worked from the attempted pipes variant of his conduit
>>> parser for a special archival format for a 'File' type.
>>> Here is the result I will discuss below
>>>
>>> https://gist.github.com/michaelt/9122379
>>>
>>> The first form of the parser (`fileParser0`) was
>>> not so bad when I used it on simple files.
>>> Then I increased the length of the file to be parsed
>>> into a succession of `Files` resulting in a 1M file.
>>> (You can write a suitable file by uncommenting
>>> the first line in `main`)
>>>
>>> On 1M of data, the obvious ways of
>>> formulating the parser in the pipes-parse style
>>> seem to go wildly wrong. The extravagance of
>>> my computer's response -- "swap space full", etc --
>>> may have something to do with os x vs. ghc-7.6
>>> but they are obviously bad programs. I couldn't
>>> get them to complete.
>>>
>>> That is the result if I use either of `fileParser0` or
>>> `fileParser1`.
>>>
>>> If however I use `parse2`, where I limit
>>> each use of `zoom utf8` by a prior
>>> application of `PB.splitAt` (or break)
>>> the 1M file is done in a half-second and
>>> uses no significant memory.
>>>
>>> Is there a mistake I am making, a mistake
>>> in the implementation of something,
>>> or some simple rule that can be stated
>>> to avoid such things?
>>>
>>> Also I wonder if the result of using e.g `fileParser1` is as
>>> pathological on other os's.
>>>
>>> yours Michael
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Haskell Pipes" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Haskell Pipes" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "Haskell Pipes" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "Haskell Pipes" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
>

-- 
You received this message because you are subscribed to the Google Groups 
"Haskell Pipes" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].

Re: [haskell-pipes] `zoom utf8` / pipes-parse trouble

Reply via email to