Re: [Axis2] Binary Serialisation

Tatu Saloranta Fri, 29 Jul 2005 10:32:58 -0700

--- Dennis Sosnoski <[EMAIL PROTECTED]> wrote:

> That's a very nice approach for generating and
> persisting what amounts 
> to pre-parsed documents. If you were to transmit the
> token descriptor 
> data structures along with the XML document it'd
> pretty much eliminate 
> parsing for the receiver, but would apparently about
> double the document 
> size on average. The really nice use for this type
> of approach is as an 
> alternative to an object model composed of separate
> objects per 
> component, though.


True. However, I think they are overstating the degree
to which String objects (one of bigger performance
bottlenecks in parsing) are created, on temporary
basis. All modern XML parsers use fully shared Strings
for names (prefix, local name), and at least locally
shared for NS URIs. When done the right way, these can
also be shared between documents produced by parsers
that share the same string pool (or intern()).
With both SAX and StAX, one can easily access char[]
representation of textual content (with obviously
slightly more cumbersome processing as compared to
Strings -- CharacterSequence might solve, or might not
solve, problem being mutability of the underlying char
array).
I don't see the need to call 3 accessor methods to get
the raw char array as a significant performance block
-- it certainly does not even register on profiles I
have taken for parsing.

The only remaining area where non-shared Strings are
used are attribute values; and here DTD/schema-based
handling might allow sharing too (for enumerated
types). Or, for minor improvements, type-based
accessors could be used too. If there's interest, I
could experiment with Woodstox stax-parser -- adding
low-level typed accessors would be quite easy to do,
and would avoid String creation.

> Unfortunately it doesn't fit very well with either a
> StAX or DOM-like 
> API, because those use Strings (almost) everywhere.
> It'd work much 
> better in combination with a CharSequence-based API.

I haven't fully read the proposal, but it might be
possible to make use of such annotation, as long as
the parser can keep byte-accurate offsets during
parsing. This is one problem I have not yet solved,
since it does require merging of the byte->char
conversion and tokenization parts; something that
makes design less clean (and adds code size, since
generic decoders can not be used), albeit potentially
more performant (single scan instead of double scan).

-+ Tatu +-


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: [Axis2] Binary Serialisation

Reply via email to