DOM and String.intern(), making a re-usable cache of DOM object trees

Dave Crooke Wed, 07 Sep 2011 16:42:23 -0700

Hi folks

We are using Xerces 2.8.1 from Oct 2006 .... no reason not to upgrade, more
along the lines of "if it ain't broke ...."

We have a Java app that collects data and assembles it into XML documents,
and has a variety of modules that consume the data in that format using tree
traversals and XPath, consuming the DOM representations. The master
documents are stored in a database as blobs in XML, and change a few times
per day, but are used 10-20 times per hour.

To avoid the overhead of repeatedly fetching and parsing the XML string, the
product caches DOM representations (DeferredDocumentImpl) using a
SoftReference cache.

There are a couple of limitations to this approach that constrain scale and
performance that I'd like to get the forum's collective input on. I'm not a
purist, so I'm perfectly happy to have a modest amount of code that is
dependent on proprietary Xerces APIs or even the internals of a particular
Xerces version .... we have all the basic XML operations wrapped in a
utility class where I can do it in one sport, and standards compliance to
support theoretical instant portability to another parser is simply not a
requirement.

1. We have a lot of content that re-occurs across different documents (not
just element and attribute names, but also text nodes and attribute values)
and being able to de-duplicate it via String.intern() would save us about
50% of the memory footprint on the cache. AFAICT it seems that interning
within the parser is only offered by SAX. I tried some naive code to walk
the document structure via the official DOM API and do things like
Attr.setValue(Attr.getValue().intern()) but this of course causes the
DeferredDocumentImpl to transform itself internally.

- I've read about LSParserFilter ... would it be appropriate (or indeed,
effective) for a filter to try to intern the content by calling setValue()
etc. on the presented Nodes?

- I considered the "hack" approach of creating a class in the
org.apache.xerces.dom package in order to manipulate the String objects
stored within DeferredDocumentImpl directly, but this is obviously rather
unclean, I'd prefer not to

- It seems based on a cursory inspection that newer versions of Xerces DOM
use internal string pooling - compare the two JavaDoc links below ....

http://www.oxygenxml.com/apidoc/xerces-2_8_0/org/apache/xerces/dom/DeferredDocumentImpl.html
http://xerces.apache.org/xerces-j/apiDocs/org/apache/xerces/dom/DeferredDocumentImpl.html

Is or can that string pool be shared across multiple documents / parsers,
and is it thread safe (or easily sub-classed to become so)?

2. The fact that the DOM implementation is not thread-safe for reads
requires it to be cloned for each consumer in our current model, which is a
fair bit of overhead. I'm wondering if there is a way to circumvent this
cost.

- I was looking around the web to see if there was a naive DOM
implementation from another project that is thread-safe for reads, that we
could perhaps convert the document trees into after parsing by Xerces, but I
couldn't find anything. Does such a creature exist?

- It seems that the cost of parsing the XML from string format is similar to
or slightly less than the overhead to clone documents, so as a short term
tweak we've switched the cache to be XML ... this is about 1/3 of the size
in memory of of the DeferredDocumentImpl so it's a no-brainer improvement

- We have a finite and fixed set of XML schemas for these documents that
ship with the product, and so we've been considering the idea of caching
JAXB-generated trees instead, on the presumption that these are (or can
facilely be made to be) thread safe, but for that to make sense we'd need to
convert at least our major consumer module to use that format. One sticking
point here is that a number of our modules allow user configurable XPath
expressions, including XPath 2.0 .... JXPath only supports XPath 1.0

Has anyone used the JAXB-style object trees widely with some war stories to
relate?

All comments, advice and anecdotal experience most welcome. Thanks in
advance.

Cheers
Dave

DOM and String.intern(), making a re-usable cache of DOM object trees

Reply via email to