Tatu Saloranta wrote:
...
I don't see the need to call 3 accessor methods to get
the raw char array as a significant performance block
-- it certainly does not even register on profiles I
have taken for parsing.
I didn't mean to suggest that the 3 method calls caused a performance
problem - it's just somewhat awkward, and giving access to the
underlying parser buffer is not very clean from a structure standpoint.
The big potential advantage I see with a CharSequence-type approach is
that it would allow the parser to avoid translating data to a char[] in
the first place, instead returning characters directly from the byte
stream input (or internal byte[]). For UTF-8 and UTF-16 this would be
very easy to implement - it's not so easy for some other character
encodings, but those could be handled by the current approach of
translating everything to chars up front.
The only remaining area where non-shared Strings are
used are attribute values; and here DTD/schema-based
handling might allow sharing too (for enumerated
types). Or, for minor improvements, type-based
accessors could be used too. If there's interest, I
could experiment with Woodstox stax-parser -- adding
low-level typed accessors would be quite easy to do,
and would avoid String creation.
It'd be interesting to see how much parsing speeds up if you disable the
creation of Strings for attributes. Maybe that's an easy test you could try?
As to typed accessors, I think they'd be somewhat useful but I expect
they'd also be a lot of trouble. The main benefit I see is that they
would make it simpler to substitute a binary data decoder for the
parser, and I'm not all that thrilled by the idea of pure binary data
streams. The binary formats would be based on schemas, so in theory
different implementations should translate the same schema to compatible
formats - but we can't even get a reasonable level of compatibility in
the use of schemas for web services with *text* documents, so how much
more difficult would it be to do this with binary formats?
- Dennis