DOM footprint

Ken Geis Fri, 11 Nov 2005 08:05:32 -0800

Earlier this year, I was working at a company where we were workingwith some large XML documents. Parsing and transforming a 40M XMLdocument was using up all of the memory we had. I thought that itwould be good to look into how Xerces' footprint could be improved.

Just the other day, I started writing a memory profiling tool that Ihad envisioned. I looked at what is in the DOM objects, and I foundthat one thing I couldn't justify was


        StringBuffer fBufferStr;

defined in org.apache.xerces.dom.ChildNode. It is documented simplyhere:


http://svn.apache.org/viewcvs.cgi?rev=319759&view=rev

The reference takes up 4 bytes (in a 32-bit JVM) which ends up beingabout 7% of the footprint of a class like ElementNSImpl or 13% of thefootprint of CDATASectionImpl.

I've found this attribute used only in two places to implement DOMLevel 3 functionality, so it seems to me that it punishes everyone whodoesn't use that. I've done a little benchmarking using XMLBench(http://www.sosnoski.com/opensrc/xmlbench/) and found that if I revertthe patch, it saves somewhere between 1.7% and 3.4% on memory, mostlyaround 2.5%. Not a lot, but a few percent here and there helps.

It gets more interesting though. Hanging on to a StringBuffer likethis leads to problems that can be illustrated by a pathological case.Imagine an XML file with a 1M text node that's 1000 nodes deep in thetree. Though this file may only be a little bigger than 1M, thereferenced StringBuffers would use a gigabyte of memory of you were totraverse the tree and call getTextContent() at each node.

I recommend that this change be reverted. If someone wants to send mesome cases that illustrate the performance improvement from reusing theStringBuffer, I would try to implement some compromise between memoryand CPU usage. At the least, these StringBuffers should be held bysoft references to keep them from using up all of the memory.

I found it quite amusing that in running XMLBench, it required 211M ofheap in order to benchmark a 3M log file without getting anOutOfMemoryError. So there are clearly some inefficiencies not only inDOM representation but in parsing. So I have some other memory issuesto deal with, but let's start here.



Ken Geis


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

DOM footprint

Reply via email to