Re: [haskell-pipes] `zoom utf8` / pipes-parse trouble

Gabriel Gonzalez Thu, 20 Feb 2014 21:40:57 -0800

This is not related to the quadratic issue from the naive free monadbind. Switching to the codensity transformation would not fix theproblem at all.

The tl;dr way to convince yourself of this is to just compare theperformance of the following two parsers:


    example1 = forever $ zoom id draw

    example2 = forever draw

The former will trigger the quadratic blowup whereas the second one willnot. The issue is `zoom`, not `forever`.


In more detail, the issue is that `zoom` currently works like this:

* Decode all input elements (lazily)
* Run the parser on that
* Encode all input elements (lazily)

So let's say that your input has 1000 elements and your parser onlyconsumes 2 elements. After it's done parsing those two elements, theremaining 998 elements will all still be decoded and reencoded. Thisencoding/reencoding is lazy, but it's not free and it triggers when youdemand each element, adding an O(1) overhead per element.

Now, imagine that you call `zoom` N times. Now every subsequent elementwill be encoded and reencoded N times. This means that if you call`zoom` N times the total running time is O(N^2) (i.e. N * (N + 1) / 2).


On 02/21/2014 12:31 PM, Carter Schonwald wrote:

Gabriel, isnt edward k's Machines.Process an example of anotherstrategy that evades that quadratic blowup?

http://hackage.haskell.org/package/machines-0.2.3.1/docs/Data-Machine-Process.html
(or is this a totally unrelated problem?)

On Thu, Feb 20, 2014 at 11:59 PM, Gabriel Gonzalez<[email protected] <mailto:[email protected]>> wrote:


    So the one part of Michael's post that was correct was that each
    use of `zoom` adds a constant O(1) overhead to every subsequent
    command.  The worst case scenario is a quadratic blowup for code
    like this:

        forever $ do
            a <- zoom lens parser
            doSomethingWith a

    Every iteration of `forever` will accumulate an additional O(1)
    overhead, so by the Nth iteration, you get O(N) overhead per
    iteration, and the total cost of N iterations is O(N^2).  This is
    what was happening in your code since your code is basically of
    the form:

        replicateM_ n $ do
            a <- zoom lens parser
            doSomethingWith a

    ... so that triggers O(N^2) space and time complexity, where `n`
    in your example was `10000`.

    I'm working on a solution to fix this and I will summarize it in
    broad strokes here.  This solution is backwards-compatible with
    the current API and in most use cases will compile to code that is
    more efficient (both in time complexity and constant factors).
     This would solve the problem that Michael described

    There would be a new `Pipes.Parse.Fast` internal module (for
    implementers only) which will have two new types.  One of them
    will be a more efficient analog of `Lens'`.  Let's call it
    `LensFast'` for lack of a better name.

    I will omit the implementation of `LensFast'` since I'm still
    playing with different ideas, but I can describe its interface.
    First, it would form a category, like `Lens'`es:

        (.) :: LensFast' a b -> LensFast' b c -> LensFast' a c

        id :: LensFast' a a

    ... and there would be a functor from `LensFast'` to `Lens'`:

        optimal :: LensFast' a b -> Lens' a b

    ... that obeys these functor laws:

        optimal (f . g) = optimal f . optimal g

        optimal id = id

    There would also be a more efficient parser type, which I will
    temporarily call `ParserFast`.  This representation I've already
    nailed down and it would be equivalent to a free monad transformer
    over the `draw/unDraw` primitive operations, with an optional bind
    in the base monad for efficiency:

        data ParserFast a m r
            = M (m (ParserFast a m r))
            | Draw (Maybe a -> ParserFast m a)
            | UnDraw a (ParserFast m a)

        instance Monad m => Monad (ParserFast a m) where ...

        instance MonadTrans (ParserFast a) where ...

    This would come with an `interpret` function that would be a monad
    morphism from `ParserFast` to `Parser`:

        interpret :: Monad m => ParserFast a m r -> Parser a m r

    This would obey the following monad morphism laws:

        interpret$ do x <- m  =  do x <- interpret m
                       f x           interpret (f x)


        interpret (return x) = return x

    ...  and monad morphism laws are also functor laws if you view
    them from this perspective:

        interpret . (f >=> g) = interpret . f >=> interpret . g

        interpret . return = return

    The last piece of the puzzle is a `zoom_` function designed to
    work with optimized `LensFast'`s and `ParserFast`s.  It would have
    this type:

        zoom_ :: Monad m => LensFast a b -> ParserFast b m r ->
    ParserFast a m r

    ... and it would obey this equation:

        zoom (optimal lensfast) (interpret parserfast) = interpret
    (zoom_ lensfast parserfast)

    That equation would be a rewrite rule that turns the inefficient
    `zoom` into the much more efficient `zoom_` that doesn't do
    unnecessary round-trips or trigger quadratic time complexity.

    The way you put it all together is that the default user-facing
    API would define all lenses and parsers in terms of `optimal` and
    `interpret`.  i.e.:

        exampleParser = interpret internalParserFast

        exampleLens = optimal internalLensFast

    Then if the user writes something like:

        zoom exampleParser exampleLens

    ... then `exampleParser` and `exampleLens` would be written to
    force inlining early, exposing the `interpret` and `optimal`
    functions:

        zoom (optimal internalLensFast) (interpret internalParserFast)

    ... then that would trigger the rewrite rule:

        interpret (zoom_ internalLensFast internalParserFast)

    ... and now the code would be efficient and avoid the problems
    Michael described.

    You can also optimize some other cases automatically, too.  For
    example, if the user connects two lenses, which are each
    internally optimized:

        lens1 = optimal internalLens1
        lens2 = optimal internalLens2

        parser = interpret internalParser

        example = zoom (lens1 . lens2) parser

    ... and that is inlined:

        example = zoom (optimal internalLens1 . optimal internalLens2)
    (interpret internalParser)

    Then you can provide a rewrite rule using the first functor law
    for `optimal` to rewrite the two lenses to a single optimized lens:

        example = zoom (optimal (internalLens1 . internalLens2))
    (interpret internalParser)

    Then that is now in the right form for the `zoom_` rewrite rule
    and everything just works!

        example = interpret (zoom_ (internalLens1 . internalLens21)
    internalParser)

    So far so good!  We can expect those rewrite rules to fire
    reliably and generate efficient code.

    The problematic part is where you `zoom` over multiple parsers
    like this:

        lens = optimal internalLens
        parser1 = interpret internalParser1
        parser2 = interpret internalParser2

        example = zoom lens $ do
            x <- parser1
            y <- parser2
            return (x, y)

    If you inline those you get:

        example = zoom (optimal internalLens) $ do
            x <- interpret internalParser1
            y <- interpret internalParser2
            return (x, y)

    In theory you could use the monad morphism laws for `interpret` to
    factor out the `interpret`:

        example = zoom (optimal internalLens $ interpret $ do
            x <- internalParser1
            y <- internalParser2
            return (x, y)

    Now that's in a form that can be optimized by the rewrite rule for
    `zoom_`:

        example = interpret $ zoom_ internalLens $ do
            x <- internalParser1
            y <- internalParser2
            return (x, y)

    **HOWEVER** in practice I don't yet know a way to get the monad
    morphism rewrite rules to factor out the `interpret` in all cases,
    which is necessary to trigger the `zoom_` rewrite rule.  This is
    where Haskell's rewrite rule system falls flat on its face.  If
    anybody wants a longer explanation of why, I can elaborate more on
    this.

    In practice, I'm willing to put up with this inefficiency and just
    explain to users how to work around this latter issue by either:

    a) distributing the `zoom` over the parsers, like this:

        example = do
            x <- zoom lens parser1
            y <- zoom lens parser2
            return (x, y)

    Note that this distribution only works if the lens does not
    violate the monad morphism laws for `zoom`.  Fortunately, the
    lenses that trigger this performance issue are precisely the same
    lenses that satisfy the monad morphism laws for `zoom` (example:
    the `decodeUtf8` lens).  Vice versa, the lenses that violate the
    `zoom` monad morphism laws are the ones that give excellent
    performance in conjunction with `zoom` (i.e. delimiting lenses
    like `splitAt`).

    b) Expose the internal lenses and parsers so that users can use
    them directly rather than relying on rewrite rules.

    I prefer approach (a) because then we can keep the internals hidden.

    Note that this solution requires that `pipes-parse` provides its
    own `zoom` function (which is really easy), or alternatively
    defines the rewrite rule in terms of the `zoom` function from the
    `lens-family` package.  A rewrite rule in terms of the `zoom`
    function from `lens` is out of the question since I don't want a
    `lens` dependency.

    Does that answer your question?



    On 02/21/2014 03:37 AM, Michael Thompson wrote:

        I was looking at Michael Snoyman's post on pipes parsers,

        http://www.yesodweb.com/blog/2014/02/ideas-pipes-parse

        and working through some of the reddit controversy

        
http://www.reddit.com/r/haskell/comments/1xmmtn/some_ideas_for_pipesparse/


        trying to think if it might matter to `pipes-text`

        I worked from the attempted pipes variant of his conduit
        parser for a special archival format for a 'File' type.
        Here is the result I will discuss below

        https://gist.github.com/michaelt/9122379

        The first form of the parser (`fileParser0`) was
        not so bad when I used it on simple files.
        Then I increased the length of the file to be parsed
        into a succession of `Files` resulting in a 1M file.
        (You can write a suitable file by uncommenting
        the first line in `main`)

        On 1M of data, the obvious ways of
        formulating the parser in the pipes-parse style
        seem to go wildly wrong. The extravagance of
        my computer's response -- "swap space full", etc --
        may have something to do with os x vs. ghc-7.6
        but they are obviously bad programs. I couldn't
        get them to complete.

        That is the result if I use either of `fileParser0` or
        `fileParser1`.

        If however I use `parse2`, where I limit
        each use of `zoom utf8` by a prior
        application of `PB.splitAt` (or break)
        the 1M file is done in a half-second and
        uses no significant memory.

        Is there a mistake I am making, a mistake
        in the implementation of something,
        or some simple rule that can be stated
        to avoid such things?

        Also I wonder if the result of using e.g `fileParser1` is as
        pathological on other os's.

        yours Michael

--You received this message because you are subscribed to the

        Google Groups "Haskell Pipes" group.
        To unsubscribe from this group and stop receiving emails from
        it, send an email to
        [email protected]
        <mailto:haskell-pipes%[email protected]>.
        To post to this group, send email to
        [email protected]
        <mailto:[email protected]>.

--You received this message because you are subscribed to the Google

    Groups "Haskell Pipes" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to [email protected]
    <mailto:haskell-pipes%[email protected]>.
    To post to this group, send email to
    [email protected]
    <mailto:[email protected]>.


--

You received this message because you are subscribed to the GoogleGroups "Haskell Pipes" group.To unsubscribe from this group and stop receiving emails from it, sendan email to [email protected].

To post to this group, send email to [email protected].


--
You received this message because you are subscribed to the Google Groups "Haskell 
Pipes" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].

Re: [haskell-pipes] `zoom utf8` / pipes-parse trouble

Reply via email to