Peter B. West wrote:
> With my naive understanding of parsing as a two-stage process (lexemes
> -> higher level constructs) I have been curious about earlier comments
> of yours about multi-stage parsing.  Can ANTLR do this sort of thing?

I'm not quite sure whether you mean by "parsing as a two-stage
process" the same as I do. In language specs, the formal description
is usually divided into a grammar level representing a Chomsky
level 2 context free grammar and a lexical level, described by simple
regular expressions (Chomsy level 0 IIRC). This is done both for
keeping the spec readable and for efficient implementation: a CNF
parser needs a stack, and while the Common Identifier can be
described in a CNF, it's more efficient to use regular expression
and implement the recognizer as a DFA, which doesn't shuffle
characters to and from the stack top like mad.

ANTLR provides for defining both the grammar and the lexical
level in one file, and it will generate appropriate Java
classes for the grammar parser as well as the token recognizer.
It's not as efficient as the famous lex+yacc utilities, but
this partly due to Java using Unicode, which would make the
lookup tables much much larger if generated the same way lex
does. Oh well: while yacc is a LARL(1), ANTLR can be configured
as LR(n), with a few LL(n) stuff mixed in. Not that this matters
much in practice, except for the number of concepts one has to
understand while writing a parser. And don't ask me right now
what the acronyms mean in detail, it's been 15 years since I
really had to know this.

> Given the amount of hacking I had to do to parse everything that could
> legally be thrown at me, I am very surprised that these are the only
> issues in HEAD parsing.

Well, one of the problems with the FO spec is that section 5.9
defines a grammar for property expressions, but this doesn't
give the whole picture for all XML attribute values in FO files.
There are also (mostly) whitespace separated lists for shorthands,
and the comma separated font family name list, where
a) whitespace is allowed around the commas and
b) quotes around the names may be omitted basically as long
 as there are no commas or whitespace in the name.
The latter means there may be unquoted sequences of characters
which has to be interpreted as a single token but are not NCNames.
It also means the in the "font" shorthand there may be whitespace
which is not a list element delimiter. I think this is valid:
 font="bold 12pt 'Times Roman' , serif"
and it should be parsed as
 font-weight="bold"
 font-size="12pt"
 font-family="'Times Roman' , serif"
then the font family can be split. This is easy for humans but can
be quite tricky to get right for computers, given that the shorthand
list has a bunch of optional elements. Specifically
 font="bold small-caps italic 12pt/14pt 'Times Roman' , A+B,serif"
should be valid too. At least, the font family is the last entry.
Note that suddenly a slash appears as delimiter between font size
and line height...

Another set of problems is token typing, the implicit type conversion
and the very implicit type specification for the properties. While
often harmless, it shows itself for the "format" property: the
spec says the expected type is a string, which means it should be
written as format="'01'". Of course, people tend to write
format="01". While the parsed number could be cast back into a
string, unfortunately the leading zero is lost. The errata
amended 5.9 specifically for this use case that in case of an
error the original string representation of the property value
expression should be used to recover. Which temps me to use
initial-page-number="auto+1".
Another famous case is hyphenation-char="-", which is by no
means a valid property expression. Additionally the restriction
to a string of length 1 (a "char") isn't spelled out explicitly
anywhere.

All in all I have the feeling the spec tried to provide a
property specification system which would be powerful but still
easy to manage by hand, and they ended up with a system
containing as much or more unintended consequences as the C
preprocessor. Which, as everybody knows, lead to weirdness like
macro argument prescanning and 0xE-0x1 being a syntax error.
Well, the C preprocessor had at least a simple first
implementation.

The maintenance branch tried to unify all cases into a single
framework, which quite predictably resulted in a complex and
somewhat messy code. It's also less efficient than it could be:
format="01" is (or would be) indeed parsed as expression, while
an optimized parser can take advantage of the lack of any string
operations and look for quoted strings and function calls only,
returning the trimmed XML attribute value otherwise.

Finally, bless the Mozilla and MySpell folks for the spell
checker... :-)

J.Pietschmann

Reply via email to