Hi folks We are using Xerces 2.8.1 from Oct 2006 .... no reason not to upgrade, more along the lines of "if it ain't broke ...."
We have a Java app that collects data and assembles it into XML documents, and has a variety of modules that consume the data in that format using tree traversals and XPath, consuming the DOM representations. The master documents are stored in a database as blobs in XML, and change a few times per day, but are used 10-20 times per hour. To avoid the overhead of repeatedly fetching and parsing the XML string, the product caches DOM representations (DeferredDocumentImpl) using a SoftReference cache. There are a couple of limitations to this approach that constrain scale and performance that I'd like to get the forum's collective input on. I'm not a purist, so I'm perfectly happy to have a modest amount of code that is dependent on proprietary Xerces APIs or even the internals of a particular Xerces version .... we have all the basic XML operations wrapped in a utility class where I can do it in one sport, and standards compliance to support theoretical instant portability to another parser is simply not a requirement. 1. We have a lot of content that re-occurs across different documents (not just element and attribute names, but also text nodes and attribute values) and being able to de-duplicate it via String.intern() would save us about 50% of the memory footprint on the cache. AFAICT it seems that interning within the parser is only offered by SAX. I tried some naive code to walk the document structure via the official DOM API and do things like Attr.setValue(Attr.getValue().intern()) but this of course causes the DeferredDocumentImpl to transform itself internally. - I've read about LSParserFilter ... would it be appropriate (or indeed, effective) for a filter to try to intern the content by calling setValue() etc. on the presented Nodes? - I considered the "hack" approach of creating a class in the org.apache.xerces.dom package in order to manipulate the String objects stored within DeferredDocumentImpl directly, but this is obviously rather unclean, I'd prefer not to - It seems based on a cursory inspection that newer versions of Xerces DOM use internal string pooling - compare the two JavaDoc links below .... http://www.oxygenxml.com/apidoc/xerces-2_8_0/org/apache/xerces/dom/DeferredDocumentImpl.html http://xerces.apache.org/xerces-j/apiDocs/org/apache/xerces/dom/DeferredDocumentImpl.html Is or can that string pool be shared across multiple documents / parsers, and is it thread safe (or easily sub-classed to become so)? 2. The fact that the DOM implementation is not thread-safe for reads requires it to be cloned for each consumer in our current model, which is a fair bit of overhead. I'm wondering if there is a way to circumvent this cost. - I was looking around the web to see if there was a naive DOM implementation from another project that is thread-safe for reads, that we could perhaps convert the document trees into after parsing by Xerces, but I couldn't find anything. Does such a creature exist? - It seems that the cost of parsing the XML from string format is similar to or slightly less than the overhead to clone documents, so as a short term tweak we've switched the cache to be XML ... this is about 1/3 of the size in memory of of the DeferredDocumentImpl so it's a no-brainer improvement - We have a finite and fixed set of XML schemas for these documents that ship with the product, and so we've been considering the idea of caching JAXB-generated trees instead, on the presumption that these are (or can facilely be made to be) thread safe, but for that to make sense we'd need to convert at least our major consumer module to use that format. One sticking point here is that a number of our modules allow user configurable XPath expressions, including XPath 2.0 .... JXPath only supports XPath 1.0 Has anyone used the JAXB-style object trees widely with some war stories to relate? All comments, advice and anecdotal experience most welcome. Thanks in advance. Cheers Dave