J.Pietschmann wrote:

Well, one of the problems with the FO spec is that section 5.9 defines a grammar for property expressions, but this doesn't give the whole picture for all XML attribute values in FO files. There are also (mostly) whitespace separated lists for shorthands, and the comma separated font family name list, where a) whitespace is allowed around the commas and b) quotes around the names may be omitted basically as long as there are no commas or whitespace in the name. The latter means there may be unquoted sequences of characters which has to be interpreted as a single token but are not NCNames. It also means the in the "font" shorthand there may be whitespace which is not a list element delimiter. I think this is valid: font="bold 12pt 'Times Roman' , serif" and it should be parsed as font-weight="bold" font-size="12pt" font-family="'Times Roman' , serif" then the font family can be split. This is easy for humans but can be quite tricky to get right for computers, given that the shorthand list has a bunch of optional elements. Specifically font="bold small-caps italic 12pt/14pt 'Times Roman' , A+B,serif" should be valid too. At least, the font family is the last entry. Note that suddenly a slash appears as delimiter between font size and line height...


Alt-design takes a two-stage approach to parsing. In the first stage the basic datatypes are detected. Where there are nasty constructs hung over from CSS, as in 'font', the elements are collected into PropertyValueLists, in a manner dependent on whether the components were space or comma separated. From the javadoc comment to the 'parse' method in ...fo.expr.PropertyParser

* Parse the property expression described in the instance variables.
* <p>The <tt>PropertyValue</tt> returned by this function has the
* following characteristics:
* If the expression resolves to a single element that object is returned
* directly in an object which implements <PropertyValue</tt>.
* <p>If the expression cannot be resolved into a single object, the set
* to which it resolves is returned in a <tt>PropertyValueList</tt> object
* (which itself implements <tt>PropertyValue</tt>).
* <p>The <tt>PropertyValueList</tt> contains objects whose corresponding
* elements in the original expression were separated by <em>commas</em>.
* <p>Objects whose corresponding elements in the original expression
* were separated by spaces are composed into a sublist contained in
* another <tt>PropertyValueList</tt>.  If all of the elements in the
* expression were separated by spaces, the returned
* <tt>PropertyValueList</tt> will contain one element, a
* <tt>PropertyValueList</tt> containing objects representing each of
* the space-separated elements in the original expression.
* <p>E.g., if a <b>font-family</b> property is assigned the string
* <em>Palatino, New Century Schoolbook, serif</em>, the returned value
* will look like this:
* <pre>
* PropertyValueList(NCName('Palatino')
*                   PropertyValueList(NCName('New')
*                                     NCName('Century')
*                                     NCName('Schoolbook') )
*                   NCName('serif') )
* </pre>
* <p>If the property had been assigned the string
* <em>Palatino, "New Century Schoolbook", serif</em>, the returned value
* would look like this:
* <pre>
* PropertyValueList(NCName('Palatino')
*                   NCName('New Century Schoolbook')
*                   NCName('serif') )
* </pre>
* <p>If a <b>background-position</b> property is assigned the string
* <em>top center</em>, the returned value will look like this:
* <pre>
* PropertyValueList(PropertyValueList(NCName('top')
*                                     NCName('center') ) )
* </pre>

In the second stage (refineParsing) the lists are analysed in their context (e.g. 'font') and the appropriate final values are developed.

The maintenance branch tried to unify all cases into a single framework, which quite predictably resulted in a complex and somewhat messy code. It's also less efficient than it could be: format="01" is (or would be) indeed parsed as expression, while an optimized parser can take advantage of the lack of any string operations and look for quoted strings and function calls only, returning the trimmed XML attribute value otherwise.

This sounds promising.

Peter B. West <http://www.powerup.com.au/~pbwest/resume.html>

Reply via email to