Hi Lukas, 2009/5/4 Lukas Theussl <ltheu...@apache.org>: > > Vincent, > > I'm trying to understand some of the issues we have with entities in the > XmlParser. Is there a special reason why entities are emitted as rawText and > not text?
The text used by XhtmlBaseParser#handleEntity() could contain predefined entities [1] and numeric code entities (ie Æ will become Æ by XmlPullParser) XhtmlBaseSink#text() escapes chars and XhtmlBaseSink#rawText() not. So using rawText() is to be sure to not escape text with entities. > I think they should be emitted as text: > > First, custom entities can be used to simply define some replacement text > inside documents (eg <!ENTITY version "1.0">). > > Second, the resulting events should be consumable by all sinks, not just > x(ht)ml based ones. Consider for instance the text "&Æ" (where > AElig is defined as <!ENTITY AElig "Æ">). Currently it is emitted by > the XhtmlBaseParser as one text event "&" and one rawText event "Æ". > This means that eg the Latex Sink will produce wrong output (the AElig > should be converted to "\AE" in latex). > > IMO the resolved entity should be emitted in a format-independent way, eg as > one (unicode?) character, just like & is emitted as one character above. > The consuming sink then has to transform that into a format-specific > representation. It could be another implementation. XhtmlBaseParser#handleEntity() could unescape xml and call only sink.text() Cheers, Vincent [1] http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-predefined-ent