--- Dennis Sosnoski <[EMAIL PROTECTED]> wrote: > That's a very nice approach for generating and > persisting what amounts > to pre-parsed documents. If you were to transmit the > token descriptor > data structures along with the XML document it'd > pretty much eliminate > parsing for the receiver, but would apparently about > double the document > size on average. The really nice use for this type > of approach is as an > alternative to an object model composed of separate > objects per > component, though.
True. However, I think they are overstating the degree to which String objects (one of bigger performance bottlenecks in parsing) are created, on temporary basis. All modern XML parsers use fully shared Strings for names (prefix, local name), and at least locally shared for NS URIs. When done the right way, these can also be shared between documents produced by parsers that share the same string pool (or intern()). With both SAX and StAX, one can easily access char[] representation of textual content (with obviously slightly more cumbersome processing as compared to Strings -- CharacterSequence might solve, or might not solve, problem being mutability of the underlying char array). I don't see the need to call 3 accessor methods to get the raw char array as a significant performance block -- it certainly does not even register on profiles I have taken for parsing. The only remaining area where non-shared Strings are used are attribute values; and here DTD/schema-based handling might allow sharing too (for enumerated types). Or, for minor improvements, type-based accessors could be used too. If there's interest, I could experiment with Woodstox stax-parser -- adding low-level typed accessors would be quite easy to do, and would avoid String creation. > Unfortunately it doesn't fit very well with either a > StAX or DOM-like > API, because those use Strings (almost) everywhere. > It'd work much > better in combination with a CharSequence-based API. I haven't fully read the proposal, but it might be possible to make use of such annotation, as long as the parser can keep byte-accurate offsets during parsing. This is one problem I have not yet solved, since it does require merging of the byte->char conversion and tokenization parts; something that makes design less clean (and adds code size, since generic decoders can not be used), albeit potentially more performant (single scan instead of double scan). -+ Tatu +- __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
