So the one part of Michael's post that was correct was that each use of `zoom` adds a constant O(1) overhead to every subsequent command. The worst case scenario is a quadratic blowup for code like this:

    forever $ do
        a <- zoom lens parser
        doSomethingWith a

Every iteration of `forever` will accumulate an additional O(1) overhead, so by the Nth iteration, you get O(N) overhead per iteration, and the total cost of N iterations is O(N^2). This is what was happening in your code since your code is basically of the form:

    replicateM_ n $ do
        a <- zoom lens parser
        doSomethingWith a

... so that triggers O(N^2) space and time complexity, where `n` in your example was `10000`.

I'm working on a solution to fix this and I will summarize it in broad strokes here. This solution is backwards-compatible with the current API and in most use cases will compile to code that is more efficient (both in time complexity and constant factors). This would solve the problem that Michael described

There would be a new `Pipes.Parse.Fast` internal module (for implementers only) which will have two new types. One of them will be a more efficient analog of `Lens'`. Let's call it `LensFast'` for lack of a better name.

I will omit the implementation of `LensFast'` since I'm still playing with different ideas, but I can describe its interface. First, it would form a category, like `Lens'`es:

    (.) :: LensFast' a b -> LensFast' b c -> LensFast' a c

    id :: LensFast' a a

... and there would be a functor from `LensFast'` to `Lens'`:

    optimal :: LensFast' a b -> Lens' a b

... that obeys these functor laws:

    optimal (f . g) = optimal f . optimal g

    optimal id = id

There would also be a more efficient parser type, which I will temporarily call `ParserFast`. This representation I've already nailed down and it would be equivalent to a free monad transformer over the `draw/unDraw` primitive operations, with an optional bind in the base monad for efficiency:

    data ParserFast a m r
        = M (m (ParserFast a m r))
        | Draw (Maybe a -> ParserFast m a)
        | UnDraw a (ParserFast m a)

    instance Monad m => Monad (ParserFast a m) where ...

    instance MonadTrans (ParserFast a) where ...

This would come with an `interpret` function that would be a monad morphism from `ParserFast` to `Parser`:

    interpret :: Monad m => ParserFast a m r -> Parser a m r

This would obey the following monad morphism laws:

    interpret$ do x <- m  =  do x <- interpret m
                   f x           interpret (f x)


    interpret (return x) = return x

... and monad morphism laws are also functor laws if you view them from this perspective:

    interpret . (f >=> g) = interpret . f >=> interpret . g

    interpret . return = return

The last piece of the puzzle is a `zoom_` function designed to work with optimized `LensFast'`s and `ParserFast`s. It would have this type:

zoom_ :: Monad m => LensFast a b -> ParserFast b m r -> ParserFast a m r

... and it would obey this equation:

zoom (optimal lensfast) (interpret parserfast) = interpret (zoom_ lensfast parserfast)

That equation would be a rewrite rule that turns the inefficient `zoom` into the much more efficient `zoom_` that doesn't do unnecessary round-trips or trigger quadratic time complexity.

The way you put it all together is that the default user-facing API would define all lenses and parsers in terms of `optimal` and `interpret`. i.e.:

    exampleParser = interpret internalParserFast

    exampleLens = optimal internalLensFast

Then if the user writes something like:

    zoom exampleParser exampleLens

... then `exampleParser` and `exampleLens` would be written to force inlining early, exposing the `interpret` and `optimal` functions:

    zoom (optimal internalLensFast) (interpret internalParserFast)

... then that would trigger the rewrite rule:

    interpret (zoom_ internalLensFast internalParserFast)

... and now the code would be efficient and avoid the problems Michael described.

You can also optimize some other cases automatically, too. For example, if the user connects two lenses, which are each internally optimized:

    lens1 = optimal internalLens1
    lens2 = optimal internalLens2

    parser = interpret internalParser

    example = zoom (lens1 . lens2) parser

... and that is inlined:

example = zoom (optimal internalLens1 . optimal internalLens2) (interpret internalParser)

Then you can provide a rewrite rule using the first functor law for `optimal` to rewrite the two lenses to a single optimized lens:

example = zoom (optimal (internalLens1 . internalLens2)) (interpret internalParser)

Then that is now in the right form for the `zoom_` rewrite rule and everything just works!

example = interpret (zoom_ (internalLens1 . internalLens21) internalParser)

So far so good! We can expect those rewrite rules to fire reliably and generate efficient code.

The problematic part is where you `zoom` over multiple parsers like this:

    lens = optimal internalLens
    parser1 = interpret internalParser1
    parser2 = interpret internalParser2

    example = zoom lens $ do
        x <- parser1
        y <- parser2
        return (x, y)

If you inline those you get:

    example = zoom (optimal internalLens) $ do
        x <- interpret internalParser1
        y <- interpret internalParser2
        return (x, y)

In theory you could use the monad morphism laws for `interpret` to factor out the `interpret`:

    example = zoom (optimal internalLens $ interpret $ do
        x <- internalParser1
        y <- internalParser2
        return (x, y)

Now that's in a form that can be optimized by the rewrite rule for `zoom_`:

    example = interpret $ zoom_ internalLens $ do
        x <- internalParser1
        y <- internalParser2
        return (x, y)

**HOWEVER** in practice I don't yet know a way to get the monad morphism rewrite rules to factor out the `interpret` in all cases, which is necessary to trigger the `zoom_` rewrite rule. This is where Haskell's rewrite rule system falls flat on its face. If anybody wants a longer explanation of why, I can elaborate more on this.

In practice, I'm willing to put up with this inefficiency and just explain to users how to work around this latter issue by either:

a) distributing the `zoom` over the parsers, like this:

    example = do
        x <- zoom lens parser1
        y <- zoom lens parser2
        return (x, y)

Note that this distribution only works if the lens does not violate the monad morphism laws for `zoom`. Fortunately, the lenses that trigger this performance issue are precisely the same lenses that satisfy the monad morphism laws for `zoom` (example: the `decodeUtf8` lens). Vice versa, the lenses that violate the `zoom` monad morphism laws are the ones that give excellent performance in conjunction with `zoom` (i.e. delimiting lenses like `splitAt`).

b) Expose the internal lenses and parsers so that users can use them directly rather than relying on rewrite rules.

I prefer approach (a) because then we can keep the internals hidden.

Note that this solution requires that `pipes-parse` provides its own `zoom` function (which is really easy), or alternatively defines the rewrite rule in terms of the `zoom` function from the `lens-family` package. A rewrite rule in terms of the `zoom` function from `lens` is out of the question since I don't want a `lens` dependency.

Does that answer your question?


On 02/21/2014 03:37 AM, Michael Thompson wrote:
I was looking at Michael Snoyman's post on pipes parsers,

http://www.yesodweb.com/blog/2014/02/ideas-pipes-parse

and working through some of the reddit controversy

http://www.reddit.com/r/haskell/comments/1xmmtn/some_ideas_for_pipesparse/

trying to think if it might matter to `pipes-text`

I worked from the attempted pipes variant of his conduit
parser for a special archival format for a 'File' type.
Here is the result I will discuss below

https://gist.github.com/michaelt/9122379

The first form of the parser (`fileParser0`) was
not so bad when I used it on simple files.
Then I increased the length of the file to be parsed
into a succession of `Files` resulting in a 1M file.
(You can write a suitable file by uncommenting
the first line in `main`)

On 1M of data, the obvious ways of
formulating the parser in the pipes-parse style
seem to go wildly wrong. The extravagance of
my computer's response -- "swap space full", etc --
may have something to do with os x vs. ghc-7.6
but they are obviously bad programs. I couldn't
get them to complete.

That is the result if I use either of `fileParser0` or
`fileParser1`.

If however I use `parse2`, where I limit
each use of `zoom utf8` by a prior
application of `PB.splitAt` (or break)
the 1M file is done in a half-second and
uses no significant memory.

Is there a mistake I am making, a mistake
in the implementation of something,
or some simple rule that can be stated
to avoid such things?

Also I wonder if the result of using e.g `fileParser1` is as pathological on other os's.

yours Michael


--
You received this message because you are subscribed to the Google Groups "Haskell Pipes" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
To post to this group, send email to [email protected].

--
You received this message because you are subscribed to the Google Groups "Haskell 
Pipes" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].

Reply via email to