Peter B. West wrote: > With my naive understanding of parsing as a two-stage process (lexemes > -> higher level constructs) I have been curious about earlier comments > of yours about multi-stage parsing. Can ANTLR do this sort of thing?
I'm not quite sure whether you mean by "parsing as a two-stage process" the same as I do. In language specs, the formal description is usually divided into a grammar level representing a Chomsky level 2 context free grammar and a lexical level, described by simple regular expressions (Chomsy level 0 IIRC). This is done both for keeping the spec readable and for efficient implementation
This is basically what I meant - I see (and have experienced in FOP) the difficulty of trying to parse "multiple" grammars out of a single stream of lexical objects.
> Given the amount of hacking I had to do to parse everything that could > legally be thrown at me, I am very surprised that these are the only > issues in HEAD parsing.
Well, one of the problems with the FO spec is that section 5.9 defines a grammar for property expressions, but this doesn't give the whole picture for all XML attribute values in FO files. There are also (mostly) whitespace separated lists for shorthands, and the comma separated font family name list, where a) whitespace is allowed around the commas and b) quotes around the names may be omitted basically as long as there are no commas or whitespace in the name. The latter means there may be unquoted sequences of characters which has to be interpreted as a single token but are not NCNames. It also means the in the "font" shorthand there may be whitespace which is not a list element delimiter. I think this is valid: font="bold 12pt 'Times Roman' , serif" and it should be parsed as font-weight="bold" font-size="12pt" font-family="'Times Roman' , serif" then the font family can be split. This is easy for humans but can be quite tricky to get right for computers, given that the shorthand list has a bunch of optional elements. Specifically font="bold small-caps italic 12pt/14pt 'Times Roman' , A+B,serif" should be valid too. At least, the font family is the last entry. Note that suddenly a slash appears as delimiter between font size and line height...
This usage, AFAICT, is the reason that division is specified by the token 'div'. All a matter of CSS compatibility.
Another set of problems is token typing, the implicit type conversion and the very implicit type specification for the properties. While often harmless, it shows itself for the "format" property: the spec says the expected type is a string, which means it should be written as format="'01'". Of course, people tend to write format="01". While the parsed number could be cast back into a string, unfortunately the leading zero is lost. The errata amended 5.9 specifically for this use case that in case of an error the original string representation of the property value expression should be used to recover. Which temps me to use initial-page-number="auto+1".
This is one of the disgraces of the spec - this time for compatibility with XSLT usage. XSL-FO just cops it sweet whenever someone else's problem (SEP) extrudes into the XSL namespace.
Another famous case is hyphenation-char="-", which is by no means a valid property expression. Additionally the restriction to a string of length 1 (a "char") isn't spelled out explicitly anywhere.
Peter -- Peter B. West <http://www.powerup.com.au/~pbwest/resume.html>