--- Dennis Sosnoski <[EMAIL PROTECTED]> wrote:
> Tatu Saloranta wrote:
>
> >...
> >I don't see the need to call 3 accessor methods to
> get
> >the raw char array as a significant performance
> block
>
> I didn't mean to suggest that the 3 method calls
> caused a performance
> problem - it's just somewhat awkward, and giving
> access to the
> underlying parser buffer is not very clean from a
> structure standpoint.
Ah, yes, I agree with that.
> The big potential advantage I see with a
> CharSequence-type approach is
> that it would allow the parser to avoid translating
> data to a char[] in
> the first place, instead returning characters
> directly from the byte
> stream input (or internal byte[]). For UTF-8 and
> UTF-16 this would be
> very easy to implement - it's not so easy for some
> other character
> encodings, but those could be handled by the current
> approach of
> translating everything to chars up front.
For UTF-16 it's easier, but for UTF-8... I'm not quite
sure. Or rather, that it would still lead to the need
to scan same data at least twice (unless it's just
passed through as is); first when tokenizing (which
can be done without char conversion), and then when
some form of character data is needed?
What would be interesting to know, too, would be
whether CharacterSequence usage might be a performance
overhead too, it being an interface. I know that
HotSpot can not always (or even often) optimize calls
via interfaces as efficiently as via (final?) classes.
For example, I tested access both using HashMap
reference, and a generic Map reference that points to
a HashMap (with JDK 1.4.2), and latter was
significantly slower (yes, just a micro-benchmark...
but I think it's applicable). This because
String.charAt() I think can nowadays be as fast as
cbuf[i] (claims JavaSoft).
Lots of open questions, obviously. From API cleanness
viewpoint I do like CharacterSequence idea though.
...
> It'd be interesting to see how much parsing speeds
> up if you disable the
> creation of Strings for attributes. Maybe that's an
> easy test you could try?
Yes, it's actually something I was thinking of adding
to "StAX2" API, and should be easy to implement.
Current implementation is actually somewhat optimized,
at least for elements that have multiple attributes:
under the hood, one long String is created (first time
a value is needed), and substrings of that one built
String are returned (except for a single value, in
which case it's returned as is). This is an
opportunistic optimization that I probably should
benchmark bit more.
But perhaps I should see how attribute access
performance is -- most testing I have done has been
based on just raw parsing, not attribute access.
In case of StAX, anyways, attribute values are (or
should be...) generally lazily constructed, when
needed.
> As to typed accessors, I think they'd be somewhat
> useful but I expect
> they'd also be a lot of trouble. The main benefit I
Yes, depending on where they are implemented. The most
efficient (and trickiest) would be to couple them with
tokenization... but that'd be a mess, both due to
added code at low-level, but also since it means that
there needs to be a priori knowledge of types.
On the other hand, simpler solution would be to
essentially wrap raw char array (or CharSequence) and
do conversion from there, by-passing String
construction. This need not even be in parser proper.
> see is that they
> would make it simpler to substitute a binary data
> decoder for the
> parser, and I'm not all that thrilled by the idea of
> pure binary data
> streams. The binary formats would be based on
> schemas, so in theory
> different implementations should translate the same
> schema to compatible
> formats - but we can't even get a reasonable level
> of compatibility in
> the use of schemas for web services with *text*
> documents, so how much
> more difficult would it be to do this with binary
> formats?
I definitely agree with these concerns.
And in the end, I'm not 100% sure if all performance
concerns are valid. Last time I tested, almost half
(40%) of time when parsing xml docs was for raw i/o,
so overhead is "only" about half of the total
interaction time. While not perfect, I think many
people assume there'd be an order of magnitude
discrepancy or so.
And in case of binary indexing; the proposal was to
force storing the whole document in memory all at
once. That may well offset some benefits of the
approach; since although it does allow more
pass-through (raw copy) approach, for things like
routing, it also increases memory usage; index offsets
need to be recalculated when things are changed, and
in the end it may well be that speedup is somewhere in
range of 10 - 25% (as in not phenomenal).
-+ Tatu +-
____________________________________________________
Start your day with Yahoo! - make it your home page
http://www.yahoo.com/r/hs