Re: DOM and String.intern(), making a re-usable cache of DOM object trees

Michael Glavassevich Thu, 08 Sep 2011 14:15:58 -0700

Hi Dave,

Dave Crooke <dcro...@gmail.com> wrote on 09/07/2011 07:41:52 PM:


> Hi folks
>
> We are using Xerces 2.8.1 from Oct 2006 .... no reason not to
> upgrade, more along the lines of "if it ain't broke ...."
>
> We have a Java app that collects data and assembles it into XML
> documents, and has a variety of modules that consume the data in
> that format using tree traversals and XPath, consuming the DOM
> representations. The master documents are stored in a database as
> blobs in XML, and change a few times per day, but are used 10-20
> times per hour.
>
> To avoid the overhead of repeatedly fetching and parsing the XML
> string, the product caches DOM representations
> (DeferredDocumentImpl) using a SoftReference cache.
>
> There are a couple of limitations to this approach that constrain
> scale and performance that I'd like to get the forum's collective
> input on. I'm not a purist, so I'm perfectly happy to have a modest
> amount of code that is dependent on proprietary Xerces APIs or even
> the internals of a particular Xerces version .... we have all the
> basic XML operations wrapped in a utility class where I can do it in
> one sport, and standards compliance to support theoretical instant
> portability to another parser is simply not a requirement.
>
>
> 1. We have a lot of content that re-occurs across different
> documents (not just element and attribute names, but also text nodes
> and attribute values) and being able to de-duplicate it via
> String.intern() would save us about 50% of the memory footprint on
> the cache. AFAICT it seems that interning within the parser is only
> offered by SAX. I tried some naive code to walk the document
> structure via the official DOM API and do things like Attr.setValue
> (Attr.getValue().intern()) but this of course causes the
> DeferredDocumentImpl to transform itself internally.

The SAX and DOM parsers within Xerces are just a thin layer on top of the
same set of components. Element and attributes names are always interned by
Xerces. What SAX offers is a feature you can check to confirm whether such
strings are interned. DOM has no such feature. Even though Xerces will
construct a DOM with interned strings, it's unsafe to rely on that,
particularly once you start mutating the DOM.

> - I've read about LSParserFilter ... would it be appropriate (or
> indeed, effective) for a filter to try to intern the content by
> calling setValue() etc. on the presented Nodes?

I would never do that with content unless it was for values of an
enumeration or some other bounded set of possible values.

Otherwise you're probably flooding the perm gen space and may get:

java.lang.OutOfMemoryError: PermGen space
        at java.lang.String.intern(Native Method)

if you intern() too many strings.

> - I considered the "hack" approach of creating a class in the
> org.apache.xerces.dom package in order to manipulate the String
> objects stored within DeferredDocumentImpl directly, but this is
> obviously rather unclean, I'd prefer not to
>
> - It seems based on a cursory inspection that newer versions of
> Xerces DOM use internal string pooling - compare the two JavaDoc
> links below ....
>
> http://www.oxygenxml.com/apidoc/xerces-2_8_0/org/apache/xerces/dom/
> DeferredDocumentImpl.html
> http://xerces.apache.org/xerces-j/apiDocs/org/apache/xerces/dom/
> DeferredDocumentImpl.html
>
> Is or can that string pool be shared across multiple documents /
> parsers, and is it thread safe (or easily sub-classed to become so)?

There is no string pool. That's the Xerces-J 1.x website. You're looking 10
years into the past.

The current docs for Xerces-J 2.x are here:
http://xerces.apache.org/xerces2-j/api.html

> 2. The fact that the DOM implementation is not thread-safe for reads
> requires it to be cloned for each consumer in our current model,
> which is a fair bit of overhead. I'm wondering if there is a way to
> circumvent this cost.
>
> - I was looking around the web to see if there was a naive DOM
> implementation from another project that is thread-safe for reads,
> that we could perhaps convert the document trees into after parsing
> by Xerces, but I couldn't find anything. Does such a creature exist?
>
> - It seems that the cost of parsing the XML from string format is
> similar to or slightly less than the overhead to clone documents, so
> as a short term tweak we've switched the cache to be XML ... this is
> about 1/3 of the size in memory of of the DeferredDocumentImpl so
> it's a no-brainer improvement
>
> - We have a finite and fixed set of XML schemas for these documents
> that ship with the product, and so we've been considering the idea
> of caching JAXB-generated trees instead, on the presumption that
> these are (or can facilely be made to be) thread safe, but for that
> to make sense we'd need to convert at least our major consumer
> module to use that format. One sticking point here is that a number
> of our modules allow user configurable XPath expressions, including
> XPath 2.0 .... JXPath only supports XPath 1.0
>
> Has anyone used the JAXB-style object trees widely with some war
> stories to relate?

Regarding DOM and thread-safety I would suggest that you read this thread
[1] and others like it in the archives. In general JAXB isn't thread-safe
either since DOM is its default representation for skipped wildcard
content.

> All comments, advice and anecdotal experience most welcome. Thanks in
advance.
>
> Cheers
> Dave

Thanks.

[1] http://markmail.org/thread/mivj2rtk2gs6d6so

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrgla...@ca.ibm.com
E-mail: mrgla...@apache.org

Re: DOM and String.intern(), making a re-usable cache of DOM object trees

Reply via email to