[[email protected]: Re: PEGs]

Kragen Javier Sitaker Mon, 01 Dec 2008 18:41:39 -0800

----- Forwarded message from Kragen Javier Sitaker <[EMAIL PROTECTED]> -----

From: Kragen Javier Sitaker <[EMAIL PROTECTED]>
To: Amit Patel <[EMAIL PROTECTED]>
Subject: Re: PEGs

On Mon, Dec 01, 2008 at 12:06:09PM -0800, Amit Patel wrote:
> Hi Kragen,

Hi!

By the way, would you mind if I forwarded this to kragen-discuss?  If
you would mind, I think I will edit some of my paragraphs out into
kragen-tol posts.

> Neat article!

Thanks!  I think it could use some work.

> On http://lambda-the-ultimate.org/node/3039, Josh Haberman thinks PEGs
> aren't as cool as I and everyone else thinks they are. I don't have enough
> experience with them to know.

Thanks for the pointer --- that's a really interesting discussion.

> For Yapps, I wrote parts of the parser by hand, and it was enough to parse
> the grammar for Yapps, so after a few iterations I had Yapps producing its
> own parser. That was fun :)

It makes it a lot more concise, doesn't it? ;)

> For parsers in general, I'm so happy that ANTLR and packrat got people
> interested again. It seemed like the field was dead, even though there is so
> much more to do.

Yeah, for a long time we were sort of stuck in this world of
context-free grammars.

> Three areas I'm interested in, but don't plan to work on
> anytime soon:
> 
>    - Abstraction. When I write a rule and code to handle a comma-separated
>    list of numbers, and then I write it all again for a semicolon-separated
>    list of CSS rules, and then I write it all again for a period-separated 
> list
>    of names, I want some abstraction rule for List(separator, element) that 
> not
>    only can parse those lists, but produce the output value.

In the notation used in the DLS '07 OMeta paper, you could write
this as follows:

        list separator element ::= <apply element> <apply separator> 
                                 | <apply element>;

Then you could apply it as e.g. `<list ("," spacing) expr>`, without the
` `` `.

The two actual implementations of OMeta I have handy (the Cola one and
the JavaScript one) use different syntax; I assume they have `apply` but
I don't know where to find it.

>    - Inheritance. (A different form of abstraction) It might be useful to
>    take a grammar and add to it and override existing rules by a mechanism
>    similar to inheritance in classes. For example, Vax C let you use $ in
>    identifiers. It'd be neat to take an existing grammar for C, inherit from
>    it, and redefine the Token rule to also allow $.

OMeta does this too.  It does seem like this would be nice in general.

In Bicicleta, inheritance takes the place of function parameter passing.
You could imagine using inheritance in OMeta if you only needed one kind
of list:

        meta VectorLang <: ListUser {
          separator ::= "," <spacing>;
          list_element ::= <expr>;
          expr ::= <list> | <term> ("+" | "-") <term>;
          ...
        }

OMeta also has a "foreign-call" mechanism that you can use to call
rules in another grammar, and that would let you have several kinds of
lists:

        meta OCamlTuple <: List { 
          separator ::= "," <foreign OCaml spacing>;
          list_element ::= <foreign OCaml expr>;
        }
        meta OCamlList <: OCamlTuple { 
          separator ::= ";" <foreign OCaml spacing>; 
        }
        meta OCaml {
          expr ::= "[" OCamlList "]" | ...
        }

But without nesting of grammars for namespace management, that's going
to be hopelessly verbose.  LPEG seems like it might be a step in the
right direction, in that its primary concept is the expression, not the
grammar.

>    - Libraries. (This goes along with abstraction)  There are lots of
>    patterns that are pretty common and could be put into a library.  The 
> List()
>    rule is one of course but nesting, whether () or [] or {}, or strings
>    (single, double, triple quotes, escaping, etc.) would be nice to put into a
>    library.

OCaml doesn't seem to have much of a library yet, but the "foreign"
mechanism does allow for one.  Here's bs-ometa-js-compiler.txt from the
JavaScript implementation of OMeta; it inherits from their JavaScript
parser to support embedded expressions of a different type (as it
happens, OMeta grammars).

        ometa BSOMetaJSParser <: BSJSParser {
          srcElem = spaces foreign(BSOMetaParser, #grammar):r sc -> r
                  | super(#srcElem)
        }

        ometa BSOMetaJSTranslator <: BSJSTranslator {
          Grammar = foreign(BSOMetaTranslator, #Grammar)
        }

(The recursion here is a bit confusing: it's a couple of OMeta grammars,
intended to be executed by an OMeta implementation written in
JavaScript, which extend a compiler intended to parse JavaScript so that
it can compile an JavaScript extended with embedded OMeta grammars.)

You could imagine `foreign(stdlib, #dqstring)` or even 
`foreign(stdlib, #string, '"')`.  But it would be a lot more useful if
you had a more convenient syntax:

        ometa Foo {
          from stdlib import string, dqstring;
          ... string ... dqstring ... stdlib.list("," spacing, expr) ...
        }

>    - Better theory. In its purest form, parser theory tells you whether a
>    string matched. In practice, we want to extract parts of it and compute
>    functions of those parts. This means the grammar aa* and a*a, although
>    equivalent in theory (because they match the same set of strings), are
>    actually different once you take into account extracting data. So I think
>    the theorists started with the wrong problem, and I'd like to see theory
>    that addresses the actual problem that parsers are used to solve. Maybe it
>    won't make any difference; I don't know. But I have this bad feeling about
>    the foundations.

I found that aspect of the theory pretty off-putting when I first
learned it!  The upside, I guess, is that it's easier to prove theorems
and write algorithms about the simpler form, in which it doesn't matter
which parts match what.  But I'd like to see a more comprehensive theory
myself.  I doubt I'm up to inventing it --- I haven't proved any
theorems in quite a while.

>    - Error handling. (This goes along with better theory)  Parsers tend to
>    be fine when things go right, but I haven't seen any good theory around
>    handling errors. In practice it seems that people use heuristics and hacks.

This is kind of a problem --- it goes from mathematics into psychology,
because the mathematical aspect of the problem is then, "How do you design an
algorithm to do something unspecified when run on some input string that
can be anything at all?" which fails to capture the important aspects of
the problem.

The Smalltalk approach --- making it easier for the human to fix errors
rather than trying to make the computer smarter about handling them ---
seems to be fairly helpful here, as manifested in Eclipse and the like.
If your human-machine feedback loop is tight enough, the parser may not
have to be very smart about what it does with parsing errors.

>    - Diff parsing. (This is related to errors)  In theory, a parser is run
>    on a string, and then you stop.  But in practice, you're running a parser 
> on
>    a string and then run the parser on a modified string.  For example, if my
>    input is "(if (foo)  bar)" and I change it to "(if (foo bar)" then by
>    looking at the change I made in the editor, you can tell that the error is
>    likely to be that foo should have ')' after it.  But the parser, having
>    forgotten the previous version of the string, will tell me the error is
>    "missing )" at the very end of the string.  The history can tell you
>    something about where the error occurred.  Looking at it another way,
>    suppose you have the last successful parse, and you know which parts of the
>    parse tree correspond to the parts of the input string. When the input
>    string changes, try to preserve the parse tree to the left of and to the
>    right of the input string changes.  That will give you a subtree that you
>    expect errors to occur within, and the missing ')' likely should be there,
>    not at the very end (rightmost leaf of the tree).

A FOAF is working on a PEG parser for use in interactive editors with
this approach in mind: he wants to invalidate parse tree nodes that span
whatever edit you're making, then attempt to recompute them.  The idea
seems like it should fit really well with packrat parsers.

Something that's occurred to me about packrat parsers and their space
usage is that you can actually throw away a lot of information safely.

Observation 1: if you're not working with an interactive parser like
you're describing above, but rather the traditional kind that parses
from the beginning of an input string to its end, you can throw away any
memoized results that begin further to the left in the input string than
the deepest backtrack point on the backtracking stack, and if you
implement constructs like (...)* using only a single element on the
backtracking stack (rather than the naive approach of transforming them
to a recursive rule) that should get you quite far.

Observation 2: the vast majority of parse tree nodes are very small, and
so keeping them memoized doesn't save very much time.  Memoizing all of
them may save you a fair bit of time over memoizing none of them, but as
they get older they become increasingly unlikely to be used in a
backtrack, since any backtrack will use the memoized result of a larger
expression they're a member of, rather than the small atomic expression
itself.  It's probably not possible to guarantee that the memoized
result won't be needed in practical cases, but heuristically you could
probably throw away small memoized results unless they're pretty recent.

One approach to implementing this, which I've also been thinking of for
maintaining Bicicleta intermediate results, would be to use a value-weak
hashtable for the memoization table, along with a garbage collector.
Any value that is needed again while it's still in the nursery will not
need to be recomputed, but most values will be reaped because at the
time of the next minor collection, the only reference to them will be in
the value-weak hashtable.  A few won't be --- the ones whose values are
currently referenced from some parsing rule currently being tried ---
and those will be promoted.

That sounds like a pretty random approach to decide which results to
keep around longer, but I think the randomness might be biased in a
useful way:
- some results are referenced much more frequently than others; they
  will be proportionally more likely to be on the stack at GC time, and
  they are also proportionally more likely to save time in the future.
- the results that have been returned to higher levels of the stack
  (covering, presumably, a larger portion of the source text) stick
  around on the stack exponentially longer than the results that have
  been returned to a rule that encompasses just a few leaf nodes, giving
  them exponentially more chance of being retained.
- results that are too old to be useful in the future (as in Observation
  1) are guaranteed not to be referenced by the stack.

I think this design may be useful for laziness in general, but I haven't
tried it yet, despite having come up with it a year or two ago.

Kragen

----- End forwarded message -----
-- 
To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-discuss

[[email protected]: Re: PEGs]

Reply via email to