So the one part of Michael's post that was correct was that each use of
`zoom` adds a constant O(1) overhead to every subsequent command. The
worst case scenario is a quadratic blowup for code like this:
forever $ do
a <- zoom lens parser
doSomethingWith a
Every iteration of `forever` will accumulate an additional O(1)
overhead, so by the Nth iteration, you get O(N) overhead per iteration,
and the total cost of N iterations is O(N^2). This is what was
happening in your code since your code is basically of the form:
replicateM_ n $ do
a <- zoom lens parser
doSomethingWith a
... so that triggers O(N^2) space and time complexity, where `n` in your
example was `10000`.
I'm working on a solution to fix this and I will summarize it in broad
strokes here. This solution is backwards-compatible with the current
API and in most use cases will compile to code that is more efficient
(both in time complexity and constant factors). This would solve the
problem that Michael described
There would be a new `Pipes.Parse.Fast` internal module (for
implementers only) which will have two new types. One of them will be a
more efficient analog of `Lens'`. Let's call it `LensFast'` for lack of
a better name.
I will omit the implementation of `LensFast'` since I'm still playing
with different ideas, but I can describe its interface. First, it would
form a category, like `Lens'`es:
(.) :: LensFast' a b -> LensFast' b c -> LensFast' a c
id :: LensFast' a a
... and there would be a functor from `LensFast'` to `Lens'`:
optimal :: LensFast' a b -> Lens' a b
... that obeys these functor laws:
optimal (f . g) = optimal f . optimal g
optimal id = id
There would also be a more efficient parser type, which I will
temporarily call `ParserFast`. This representation I've already nailed
down and it would be equivalent to a free monad transformer over the
`draw/unDraw` primitive operations, with an optional bind in the base
monad for efficiency:
data ParserFast a m r
= M (m (ParserFast a m r))
| Draw (Maybe a -> ParserFast m a)
| UnDraw a (ParserFast m a)
instance Monad m => Monad (ParserFast a m) where ...
instance MonadTrans (ParserFast a) where ...
This would come with an `interpret` function that would be a monad
morphism from `ParserFast` to `Parser`:
interpret :: Monad m => ParserFast a m r -> Parser a m r
This would obey the following monad morphism laws:
interpret$ do x <- m = do x <- interpret m
f x interpret (f x)
interpret (return x) = return x
... and monad morphism laws are also functor laws if you view them from
this perspective:
interpret . (f >=> g) = interpret . f >=> interpret . g
interpret . return = return
The last piece of the puzzle is a `zoom_` function designed to work with
optimized `LensFast'`s and `ParserFast`s. It would have this type:
zoom_ :: Monad m => LensFast a b -> ParserFast b m r -> ParserFast
a m r
... and it would obey this equation:
zoom (optimal lensfast) (interpret parserfast) = interpret (zoom_
lensfast parserfast)
That equation would be a rewrite rule that turns the inefficient `zoom`
into the much more efficient `zoom_` that doesn't do unnecessary
round-trips or trigger quadratic time complexity.
The way you put it all together is that the default user-facing API
would define all lenses and parsers in terms of `optimal` and
`interpret`. i.e.:
exampleParser = interpret internalParserFast
exampleLens = optimal internalLensFast
Then if the user writes something like:
zoom exampleParser exampleLens
... then `exampleParser` and `exampleLens` would be written to force
inlining early, exposing the `interpret` and `optimal` functions:
zoom (optimal internalLensFast) (interpret internalParserFast)
... then that would trigger the rewrite rule:
interpret (zoom_ internalLensFast internalParserFast)
... and now the code would be efficient and avoid the problems Michael
described.
You can also optimize some other cases automatically, too. For example,
if the user connects two lenses, which are each internally optimized:
lens1 = optimal internalLens1
lens2 = optimal internalLens2
parser = interpret internalParser
example = zoom (lens1 . lens2) parser
... and that is inlined:
example = zoom (optimal internalLens1 . optimal internalLens2)
(interpret internalParser)
Then you can provide a rewrite rule using the first functor law for
`optimal` to rewrite the two lenses to a single optimized lens:
example = zoom (optimal (internalLens1 . internalLens2)) (interpret
internalParser)
Then that is now in the right form for the `zoom_` rewrite rule and
everything just works!
example = interpret (zoom_ (internalLens1 . internalLens21)
internalParser)
So far so good! We can expect those rewrite rules to fire reliably and
generate efficient code.
The problematic part is where you `zoom` over multiple parsers like this:
lens = optimal internalLens
parser1 = interpret internalParser1
parser2 = interpret internalParser2
example = zoom lens $ do
x <- parser1
y <- parser2
return (x, y)
If you inline those you get:
example = zoom (optimal internalLens) $ do
x <- interpret internalParser1
y <- interpret internalParser2
return (x, y)
In theory you could use the monad morphism laws for `interpret` to
factor out the `interpret`:
example = zoom (optimal internalLens $ interpret $ do
x <- internalParser1
y <- internalParser2
return (x, y)
Now that's in a form that can be optimized by the rewrite rule for `zoom_`:
example = interpret $ zoom_ internalLens $ do
x <- internalParser1
y <- internalParser2
return (x, y)
**HOWEVER** in practice I don't yet know a way to get the monad morphism
rewrite rules to factor out the `interpret` in all cases, which is
necessary to trigger the `zoom_` rewrite rule. This is where Haskell's
rewrite rule system falls flat on its face. If anybody wants a longer
explanation of why, I can elaborate more on this.
In practice, I'm willing to put up with this inefficiency and just
explain to users how to work around this latter issue by either:
a) distributing the `zoom` over the parsers, like this:
example = do
x <- zoom lens parser1
y <- zoom lens parser2
return (x, y)
Note that this distribution only works if the lens does not violate the
monad morphism laws for `zoom`. Fortunately, the lenses that trigger
this performance issue are precisely the same lenses that satisfy the
monad morphism laws for `zoom` (example: the `decodeUtf8` lens). Vice
versa, the lenses that violate the `zoom` monad morphism laws are the
ones that give excellent performance in conjunction with `zoom` (i.e.
delimiting lenses like `splitAt`).
b) Expose the internal lenses and parsers so that users can use them
directly rather than relying on rewrite rules.
I prefer approach (a) because then we can keep the internals hidden.
Note that this solution requires that `pipes-parse` provides its own
`zoom` function (which is really easy), or alternatively defines the
rewrite rule in terms of the `zoom` function from the `lens-family`
package. A rewrite rule in terms of the `zoom` function from `lens` is
out of the question since I don't want a `lens` dependency.
Does that answer your question?
On 02/21/2014 03:37 AM, Michael Thompson wrote:
I was looking at Michael Snoyman's post on pipes parsers,
http://www.yesodweb.com/blog/2014/02/ideas-pipes-parse
and working through some of the reddit controversy
http://www.reddit.com/r/haskell/comments/1xmmtn/some_ideas_for_pipesparse/
trying to think if it might matter to `pipes-text`
I worked from the attempted pipes variant of his conduit
parser for a special archival format for a 'File' type.
Here is the result I will discuss below
https://gist.github.com/michaelt/9122379
The first form of the parser (`fileParser0`) was
not so bad when I used it on simple files.
Then I increased the length of the file to be parsed
into a succession of `Files` resulting in a 1M file.
(You can write a suitable file by uncommenting
the first line in `main`)
On 1M of data, the obvious ways of
formulating the parser in the pipes-parse style
seem to go wildly wrong. The extravagance of
my computer's response -- "swap space full", etc --
may have something to do with os x vs. ghc-7.6
but they are obviously bad programs. I couldn't
get them to complete.
That is the result if I use either of `fileParser0` or
`fileParser1`.
If however I use `parse2`, where I limit
each use of `zoom utf8` by a prior
application of `PB.splitAt` (or break)
the 1M file is done in a half-second and
uses no significant memory.
Is there a mistake I am making, a mistake
in the implementation of something,
or some simple rule that can be stated
to avoid such things?
Also I wonder if the result of using e.g `fileParser1` is as
pathological on other os's.
yours Michael
--
You received this message because you are subscribed to the Google
Groups "Haskell Pipes" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to [email protected].
To post to this group, send email to [email protected].
--
You received this message because you are subscribed to the Google Groups "Haskell
Pipes" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].