----- Forwarded message from Kragen Javier Sitaker <[EMAIL PROTECTED]> -----
From: Kragen Javier Sitaker <[EMAIL PROTECTED]> To: Amit Patel <[EMAIL PROTECTED]> Subject: Re: PEGs On Mon, Dec 01, 2008 at 12:06:09PM -0800, Amit Patel wrote: > Hi Kragen, Hi! By the way, would you mind if I forwarded this to kragen-discuss? If you would mind, I think I will edit some of my paragraphs out into kragen-tol posts. > Neat article! Thanks! I think it could use some work. > On http://lambda-the-ultimate.org/node/3039, Josh Haberman thinks PEGs > aren't as cool as I and everyone else thinks they are. I don't have enough > experience with them to know. Thanks for the pointer --- that's a really interesting discussion. > For Yapps, I wrote parts of the parser by hand, and it was enough to parse > the grammar for Yapps, so after a few iterations I had Yapps producing its > own parser. That was fun :) It makes it a lot more concise, doesn't it? ;) > For parsers in general, I'm so happy that ANTLR and packrat got people > interested again. It seemed like the field was dead, even though there is so > much more to do. Yeah, for a long time we were sort of stuck in this world of context-free grammars. > Three areas I'm interested in, but don't plan to work on > anytime soon: > > - Abstraction. When I write a rule and code to handle a comma-separated > list of numbers, and then I write it all again for a semicolon-separated > list of CSS rules, and then I write it all again for a period-separated > list > of names, I want some abstraction rule for List(separator, element) that > not > only can parse those lists, but produce the output value. In the notation used in the DLS '07 OMeta paper, you could write this as follows: list separator element ::= <apply element> <apply separator> | <apply element>; Then you could apply it as e.g. `<list ("," spacing) expr>`, without the ` `` `. The two actual implementations of OMeta I have handy (the Cola one and the JavaScript one) use different syntax; I assume they have `apply` but I don't know where to find it. > - Inheritance. (A different form of abstraction) It might be useful to > take a grammar and add to it and override existing rules by a mechanism > similar to inheritance in classes. For example, Vax C let you use $ in > identifiers. It'd be neat to take an existing grammar for C, inherit from > it, and redefine the Token rule to also allow $. OMeta does this too. It does seem like this would be nice in general. In Bicicleta, inheritance takes the place of function parameter passing. You could imagine using inheritance in OMeta if you only needed one kind of list: meta VectorLang <: ListUser { separator ::= "," <spacing>; list_element ::= <expr>; expr ::= <list> | <term> ("+" | "-") <term>; ... } OMeta also has a "foreign-call" mechanism that you can use to call rules in another grammar, and that would let you have several kinds of lists: meta OCamlTuple <: List { separator ::= "," <foreign OCaml spacing>; list_element ::= <foreign OCaml expr>; } meta OCamlList <: OCamlTuple { separator ::= ";" <foreign OCaml spacing>; } meta OCaml { expr ::= "[" OCamlList "]" | ... } But without nesting of grammars for namespace management, that's going to be hopelessly verbose. LPEG seems like it might be a step in the right direction, in that its primary concept is the expression, not the grammar. > - Libraries. (This goes along with abstraction) There are lots of > patterns that are pretty common and could be put into a library. The > List() > rule is one of course but nesting, whether () or [] or {}, or strings > (single, double, triple quotes, escaping, etc.) would be nice to put into a > library. OCaml doesn't seem to have much of a library yet, but the "foreign" mechanism does allow for one. Here's bs-ometa-js-compiler.txt from the JavaScript implementation of OMeta; it inherits from their JavaScript parser to support embedded expressions of a different type (as it happens, OMeta grammars). ometa BSOMetaJSParser <: BSJSParser { srcElem = spaces foreign(BSOMetaParser, #grammar):r sc -> r | super(#srcElem) } ometa BSOMetaJSTranslator <: BSJSTranslator { Grammar = foreign(BSOMetaTranslator, #Grammar) } (The recursion here is a bit confusing: it's a couple of OMeta grammars, intended to be executed by an OMeta implementation written in JavaScript, which extend a compiler intended to parse JavaScript so that it can compile an JavaScript extended with embedded OMeta grammars.) You could imagine `foreign(stdlib, #dqstring)` or even `foreign(stdlib, #string, '"')`. But it would be a lot more useful if you had a more convenient syntax: ometa Foo { from stdlib import string, dqstring; ... string ... dqstring ... stdlib.list("," spacing, expr) ... } > - Better theory. In its purest form, parser theory tells you whether a > string matched. In practice, we want to extract parts of it and compute > functions of those parts. This means the grammar aa* and a*a, although > equivalent in theory (because they match the same set of strings), are > actually different once you take into account extracting data. So I think > the theorists started with the wrong problem, and I'd like to see theory > that addresses the actual problem that parsers are used to solve. Maybe it > won't make any difference; I don't know. But I have this bad feeling about > the foundations. I found that aspect of the theory pretty off-putting when I first learned it! The upside, I guess, is that it's easier to prove theorems and write algorithms about the simpler form, in which it doesn't matter which parts match what. But I'd like to see a more comprehensive theory myself. I doubt I'm up to inventing it --- I haven't proved any theorems in quite a while. > - Error handling. (This goes along with better theory) Parsers tend to > be fine when things go right, but I haven't seen any good theory around > handling errors. In practice it seems that people use heuristics and hacks. This is kind of a problem --- it goes from mathematics into psychology, because the mathematical aspect of the problem is then, "How do you design an algorithm to do something unspecified when run on some input string that can be anything at all?" which fails to capture the important aspects of the problem. The Smalltalk approach --- making it easier for the human to fix errors rather than trying to make the computer smarter about handling them --- seems to be fairly helpful here, as manifested in Eclipse and the like. If your human-machine feedback loop is tight enough, the parser may not have to be very smart about what it does with parsing errors. > - Diff parsing. (This is related to errors) In theory, a parser is run > on a string, and then you stop. But in practice, you're running a parser > on > a string and then run the parser on a modified string. For example, if my > input is "(if (foo) bar)" and I change it to "(if (foo bar)" then by > looking at the change I made in the editor, you can tell that the error is > likely to be that foo should have ')' after it. But the parser, having > forgotten the previous version of the string, will tell me the error is > "missing )" at the very end of the string. The history can tell you > something about where the error occurred. Looking at it another way, > suppose you have the last successful parse, and you know which parts of the > parse tree correspond to the parts of the input string. When the input > string changes, try to preserve the parse tree to the left of and to the > right of the input string changes. That will give you a subtree that you > expect errors to occur within, and the missing ')' likely should be there, > not at the very end (rightmost leaf of the tree). A FOAF is working on a PEG parser for use in interactive editors with this approach in mind: he wants to invalidate parse tree nodes that span whatever edit you're making, then attempt to recompute them. The idea seems like it should fit really well with packrat parsers. Something that's occurred to me about packrat parsers and their space usage is that you can actually throw away a lot of information safely. Observation 1: if you're not working with an interactive parser like you're describing above, but rather the traditional kind that parses from the beginning of an input string to its end, you can throw away any memoized results that begin further to the left in the input string than the deepest backtrack point on the backtracking stack, and if you implement constructs like (...)* using only a single element on the backtracking stack (rather than the naive approach of transforming them to a recursive rule) that should get you quite far. Observation 2: the vast majority of parse tree nodes are very small, and so keeping them memoized doesn't save very much time. Memoizing all of them may save you a fair bit of time over memoizing none of them, but as they get older they become increasingly unlikely to be used in a backtrack, since any backtrack will use the memoized result of a larger expression they're a member of, rather than the small atomic expression itself. It's probably not possible to guarantee that the memoized result won't be needed in practical cases, but heuristically you could probably throw away small memoized results unless they're pretty recent. One approach to implementing this, which I've also been thinking of for maintaining Bicicleta intermediate results, would be to use a value-weak hashtable for the memoization table, along with a garbage collector. Any value that is needed again while it's still in the nursery will not need to be recomputed, but most values will be reaped because at the time of the next minor collection, the only reference to them will be in the value-weak hashtable. A few won't be --- the ones whose values are currently referenced from some parsing rule currently being tried --- and those will be promoted. That sounds like a pretty random approach to decide which results to keep around longer, but I think the randomness might be biased in a useful way: - some results are referenced much more frequently than others; they will be proportionally more likely to be on the stack at GC time, and they are also proportionally more likely to save time in the future. - the results that have been returned to higher levels of the stack (covering, presumably, a larger portion of the source text) stick around on the stack exponentially longer than the results that have been returned to a rule that encompasses just a few leaf nodes, giving them exponentially more chance of being retained. - results that are too old to be useful in the future (as in Observation 1) are guaranteed not to be referenced by the stack. I think this design may be useful for laziness in general, but I haven't tried it yet, despite having come up with it a year or two ago. Kragen ----- End forwarded message ----- -- To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-discuss
