Peter B. West wrote: > With my naive understanding of parsing as a two-stage process (lexemes > -> higher level constructs) I have been curious about earlier comments > of yours about multi-stage parsing. Can ANTLR do this sort of thing?
I'm not quite sure whether you mean by "parsing as a two-stage process" the same as I do. In language specs, the formal description is usually divided into a grammar level representing a Chomsky level 2 context free grammar and a lexical level, described by simple regular expressions (Chomsy level 0 IIRC). This is done both for keeping the spec readable and for efficient implementation: a CNF parser needs a stack, and while the Common Identifier can be described in a CNF, it's more efficient to use regular expression and implement the recognizer as a DFA, which doesn't shuffle characters to and from the stack top like mad.
ANTLR provides for defining both the grammar and the lexical level in one file, and it will generate appropriate Java classes for the grammar parser as well as the token recognizer. It's not as efficient as the famous lex+yacc utilities, but this partly due to Java using Unicode, which would make the lookup tables much much larger if generated the same way lex does. Oh well: while yacc is a LARL(1), ANTLR can be configured as LR(n), with a few LL(n) stuff mixed in. Not that this matters much in practice, except for the number of concepts one has to understand while writing a parser. And don't ask me right now what the acronyms mean in detail, it's been 15 years since I really had to know this.
> Given the amount of hacking I had to do to parse everything that could > legally be thrown at me, I am very surprised that these are the only > issues in HEAD parsing.
Well, one of the problems with the FO spec is that section 5.9 defines a grammar for property expressions, but this doesn't give the whole picture for all XML attribute values in FO files. There are also (mostly) whitespace separated lists for shorthands, and the comma separated font family name list, where a) whitespace is allowed around the commas and b) quotes around the names may be omitted basically as long as there are no commas or whitespace in the name. The latter means there may be unquoted sequences of characters which has to be interpreted as a single token but are not NCNames. It also means the in the "font" shorthand there may be whitespace which is not a list element delimiter. I think this is valid: font="bold 12pt 'Times Roman' , serif" and it should be parsed as font-weight="bold" font-size="12pt" font-family="'Times Roman' , serif" then the font family can be split. This is easy for humans but can be quite tricky to get right for computers, given that the shorthand list has a bunch of optional elements. Specifically font="bold small-caps italic 12pt/14pt 'Times Roman' , A+B,serif" should be valid too. At least, the font family is the last entry. Note that suddenly a slash appears as delimiter between font size and line height...
Another set of problems is token typing, the implicit type conversion and the very implicit type specification for the properties. While often harmless, it shows itself for the "format" property: the spec says the expected type is a string, which means it should be written as format="'01'". Of course, people tend to write format="01". While the parsed number could be cast back into a string, unfortunately the leading zero is lost. The errata amended 5.9 specifically for this use case that in case of an error the original string representation of the property value expression should be used to recover. Which temps me to use initial-page-number="auto+1". Another famous case is hyphenation-char="-", which is by no means a valid property expression. Additionally the restriction to a string of length 1 (a "char") isn't spelled out explicitly anywhere.
All in all I have the feeling the spec tried to provide a property specification system which would be powerful but still easy to manage by hand, and they ended up with a system containing as much or more unintended consequences as the C preprocessor. Which, as everybody knows, lead to weirdness like macro argument prescanning and 0xE-0x1 being a syntax error. Well, the C preprocessor had at least a simple first implementation.
The maintenance branch tried to unify all cases into a single framework, which quite predictably resulted in a complex and somewhat messy code. It's also less efficient than it could be: format="01" is (or would be) indeed parsed as expression, while an optimized parser can take advantage of the lack of any string operations and look for quoted strings and function calls only, returning the trimmed XML attribute value otherwise.
Finally, bless the Mozilla and MySpell folks for the spell checker... :-)
J.Pietschmann