Re: [haskell-pipes] `zoom utf8` / pipes-parse trouble

Gabriel Gonzalez Thu, 20 Feb 2014 21:00:07 -0800

So the one part of Michael's post that was correct was that each use of`zoom` adds a constant O(1) overhead to every subsequent command. Theworst case scenario is a quadratic blowup for code like this:


    forever $ do
        a <- zoom lens parser
        doSomethingWith a

Every iteration of `forever` will accumulate an additional O(1)overhead, so by the Nth iteration, you get O(N) overhead per iteration,and the total cost of N iterations is O(N^2). This is what washappening in your code since your code is basically of the form:


    replicateM_ n $ do
        a <- zoom lens parser
        doSomethingWith a

... so that triggers O(N^2) space and time complexity, where `n` in yourexample was `10000`.

I'm working on a solution to fix this and I will summarize it in broadstrokes here. This solution is backwards-compatible with the currentAPI and in most use cases will compile to code that is more efficient(both in time complexity and constant factors). This would solve theproblem that Michael described

There would be a new `Pipes.Parse.Fast` internal module (forimplementers only) which will have two new types. One of them will be amore efficient analog of `Lens'`. Let's call it `LensFast'` for lack ofa better name.

I will omit the implementation of `LensFast'` since I'm still playingwith different ideas, but I can describe its interface. First, it wouldform a category, like `Lens'`es:


    (.) :: LensFast' a b -> LensFast' b c -> LensFast' a c

    id :: LensFast' a a

... and there would be a functor from `LensFast'` to `Lens'`:

    optimal :: LensFast' a b -> Lens' a b

... that obeys these functor laws:

    optimal (f . g) = optimal f . optimal g

    optimal id = id

There would also be a more efficient parser type, which I willtemporarily call `ParserFast`. This representation I've already naileddown and it would be equivalent to a free monad transformer over the`draw/unDraw` primitive operations, with an optional bind in the basemonad for efficiency:


    data ParserFast a m r
        = M (m (ParserFast a m r))
        | Draw (Maybe a -> ParserFast m a)
        | UnDraw a (ParserFast m a)

    instance Monad m => Monad (ParserFast a m) where ...

    instance MonadTrans (ParserFast a) where ...

This would come with an `interpret` function that would be a monadmorphism from `ParserFast` to `Parser`:


    interpret :: Monad m => ParserFast a m r -> Parser a m r

This would obey the following monad morphism laws:

    interpret$ do x <- m  =  do x <- interpret m
                   f x           interpret (f x)


    interpret (return x) = return x

... and monad morphism laws are also functor laws if you view them fromthis perspective:


    interpret . (f >=> g) = interpret . f >=> interpret . g

    interpret . return = return

The last piece of the puzzle is a `zoom_` function designed to work withoptimized `LensFast'`s and `ParserFast`s. It would have this type:

zoom_ :: Monad m => LensFast a b -> ParserFast b m r -> ParserFasta m r


... and it would obey this equation:

zoom (optimal lensfast) (interpret parserfast) = interpret (zoom_lensfast parserfast)

That equation would be a rewrite rule that turns the inefficient `zoom`into the much more efficient `zoom_` that doesn't do unnecessaryround-trips or trigger quadratic time complexity.

The way you put it all together is that the default user-facing APIwould define all lenses and parsers in terms of `optimal` and`interpret`. i.e.:


    exampleParser = interpret internalParserFast

    exampleLens = optimal internalLensFast

Then if the user writes something like:

    zoom exampleParser exampleLens

... then `exampleParser` and `exampleLens` would be written to forceinlining early, exposing the `interpret` and `optimal` functions:


    zoom (optimal internalLensFast) (interpret internalParserFast)

... then that would trigger the rewrite rule:

    interpret (zoom_ internalLensFast internalParserFast)

... and now the code would be efficient and avoid the problems Michaeldescribed.

You can also optimize some other cases automatically, too. For example,if the user connects two lenses, which are each internally optimized:


    lens1 = optimal internalLens1
    lens2 = optimal internalLens2

    parser = interpret internalParser

    example = zoom (lens1 . lens2) parser

... and that is inlined:

example = zoom (optimal internalLens1 . optimal internalLens2)(interpret internalParser)

Then you can provide a rewrite rule using the first functor law for`optimal` to rewrite the two lenses to a single optimized lens:

example = zoom (optimal (internalLens1 . internalLens2)) (interpretinternalParser)

Then that is now in the right form for the `zoom_` rewrite rule andeverything just works!

example = interpret (zoom_ (internalLens1 . internalLens21)internalParser)

So far so good! We can expect those rewrite rules to fire reliably andgenerate efficient code.


The problematic part is where you `zoom` over multiple parsers like this:

    lens = optimal internalLens
    parser1 = interpret internalParser1
    parser2 = interpret internalParser2

    example = zoom lens $ do
        x <- parser1
        y <- parser2
        return (x, y)

If you inline those you get:

    example = zoom (optimal internalLens) $ do
        x <- interpret internalParser1
        y <- interpret internalParser2
        return (x, y)

In theory you could use the monad morphism laws for `interpret` tofactor out the `interpret`:


    example = zoom (optimal internalLens $ interpret $ do
        x <- internalParser1
        y <- internalParser2
        return (x, y)

Now that's in a form that can be optimized by the rewrite rule for `zoom_`:

    example = interpret $ zoom_ internalLens $ do
        x <- internalParser1
        y <- internalParser2
        return (x, y)

**HOWEVER** in practice I don't yet know a way to get the monad morphismrewrite rules to factor out the `interpret` in all cases, which isnecessary to trigger the `zoom_` rewrite rule. This is where Haskell'srewrite rule system falls flat on its face. If anybody wants a longerexplanation of why, I can elaborate more on this.

In practice, I'm willing to put up with this inefficiency and justexplain to users how to work around this latter issue by either:


a) distributing the `zoom` over the parsers, like this:

    example = do
        x <- zoom lens parser1
        y <- zoom lens parser2
        return (x, y)

Note that this distribution only works if the lens does not violate themonad morphism laws for `zoom`. Fortunately, the lenses that triggerthis performance issue are precisely the same lenses that satisfy themonad morphism laws for `zoom` (example: the `decodeUtf8` lens). Viceversa, the lenses that violate the `zoom` monad morphism laws are theones that give excellent performance in conjunction with `zoom` (i.e.delimiting lenses like `splitAt`).

b) Expose the internal lenses and parsers so that users can use themdirectly rather than relying on rewrite rules.


I prefer approach (a) because then we can keep the internals hidden.

Note that this solution requires that `pipes-parse` provides its own`zoom` function (which is really easy), or alternatively defines therewrite rule in terms of the `zoom` function from the `lens-family`package. A rewrite rule in terms of the `zoom` function from `lens` isout of the question since I don't want a `lens` dependency.


Does that answer your question?


On 02/21/2014 03:37 AM, Michael Thompson wrote:

I was looking at Michael Snoyman's post on pipes parsers,

http://www.yesodweb.com/blog/2014/02/ideas-pipes-parse

and working through some of the reddit controversy

http://www.reddit.com/r/haskell/comments/1xmmtn/some_ideas_for_pipesparse/


trying to think if it might matter to `pipes-text`

I worked from the attempted pipes variant of his conduit
parser for a special archival format for a 'File' type.
Here is the result I will discuss below

https://gist.github.com/michaelt/9122379

The first form of the parser (`fileParser0`) was
not so bad when I used it on simple files.
Then I increased the length of the file to be parsed
into a succession of `Files` resulting in a 1M file.
(You can write a suitable file by uncommenting
the first line in `main`)

On 1M of data, the obvious ways of
formulating the parser in the pipes-parse style
seem to go wildly wrong. The extravagance of
my computer's response -- "swap space full", etc --
may have something to do with os x vs. ghc-7.6
but they are obviously bad programs. I couldn't
get them to complete.

That is the result if I use either of `fileParser0` or
`fileParser1`.

If however I use `parse2`, where I limit
each use of `zoom utf8` by a prior
application of `PB.splitAt` (or break)
the 1M file is done in a half-second and
uses no significant memory.

Is there a mistake I am making, a mistake
in the implementation of something,
or some simple rule that can be stated
to avoid such things?

Also I wonder if the result of using e.g `fileParser1` is aspathological on other os's.


yours Michael


--

You received this message because you are subscribed to the GoogleGroups "Haskell Pipes" group.To unsubscribe from this group and stop receiving emails from it, sendan email to [email protected].

To post to this group, send email to [email protected].


--
You received this message because you are subscribed to the Google Groups "Haskell 
Pipes" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].

Re: [haskell-pipes] `zoom utf8` / pipes-parse trouble

Reply via email to