Hi jason, I am looking at using content handlers and/or document factories modifications to allow for the re-use of some of the nodes. I have run some tests on some large doms that I have, and have spotted a reduction of 50% - 95% on the size of the dom. The restrictions are that the dom is read only, which is not a problem for you I believe, and that the flyweight pattern is used.
Once I have a version that I am happy with I cn send you some code to try on your XML file. I will be looking to contribute this code once it has stablised, and I have removed the few minor propritory classes Mike > -----Original Message----- > From: Jason Horman [mailto:[EMAIL PROTECTED] > Sent: Tuesday 25 February 2003 23:40 > To: Mike Skells; Jason Horman > Cc: [EMAIL PROTECTED] > Subject: RE: [dom4j-dev] huge dom > > > I can't really send my DOM since it is proprietary company > information. There are plenty of HUGE xml docs on the web > though, such as: > > http://www.cs.washington.edu/research/xmldatasets/www/repository.html/ > http://www.cs.washington.edu/research/xmldatasets/www/data/pir > /psd7003.xml.g > z > > 21,305,818 elements > 103 MB's > > I basically was just doing this: > > SAXReader reader = new SAXReader(); > reader.setStringInternEnabled(true); > reader.setMergeAdjacentText(true); > reader.setStripWhitespaceText(true); > > Document oldArtistDoc = > reader.read(inputStream); > > > Thanks, > Jason Horman > [EMAIL PROTECTED] > > -----Original Message----- > From: Mike Skells [mailto:[EMAIL PROTECTED] > Sent: Tuesday, February 25, 2003 10:00 AM > To: Jason Horman > Cc: [EMAIL PROTECTED] > Subject: RE: [dom4j-dev] huge dom > > > Hi, Jason > I have a few other ideas that I am testing. Can you send me > the (zipped) xml, and a bit of test code so that I can check > if my ideas work > > Mike Skells > > > -----Original Message----- > > From: Jason Horman [mailto:[EMAIL PROTECTED] > > Sent: Thursday 20 February 2003 00:04 > > To: 'James Strachan'; [EMAIL PROTECTED] > > Subject: RE: [dom4j-dev] huge dom > > > > > > Thanks, that trimmed off about 150 mb's from memory. Still > > seems large to me, but I suppose the tree is quite large. > > > > I cannot use the "row by row" technique since I need to have > > a dom available for the massive number of xpath statements > > and sorts that I need to do across the entire document. The > > document is essentially a database dump. I may look into the > > new BDB XML db instead of in-memory in the future. > > > > -jason > > > > -----Original Message----- > > From: James Strachan [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, February 19, 2003 12:26 AM > > To: Jason Horman; [EMAIL PROTECTED] > > Subject: Re: [dom4j-dev] huge dom > > > > > > > > First off there's an FAQ entry > > > > http://dom4j.org/faq.html > > > > on How does dom4j handle very large XML documents? > > > > http://dom4j.org/faq.html#How%20does%20dom4j%20handle%20very%2 > > 0large%20XML%2 > > 0documents? > > > > which essentially means you can process the document in a > > 'row by row' kinda way rather than waiting to load the whole > > thing in one go. > > > > > > Other flags that might help reduce the overall memory > > footprint are these, which avoids storing unnecessary String > > or whitespace objects... > > > > SAXReader reader = new SAXReader(); > reader.setMergeAdjacentText(true); > > reader.setStringInternenabled(true); > > reader.setStripWhitespaceText(true); > > > > James > > ------- > > http://radio.weblogs.com/0112098/ > > ----- Original Message ----- > > From: Jason Horman > > To: '[EMAIL PROTECTED]' > > Sent: Friday, February 14, 2003 1:11 AM > > Subject: [dom4j-dev] huge dom > > > > > > I am using dom4j-1.4-dev-8.jar, the version that came with my > > last maven build of jelly. > > > > My xml document: > > > > 159 mbs > > 2,438,791 lines/tags -> 1 tag per line, all attributes > > ~6 attributes per tag > > 4 out of 6 attributes are numeric values, so they are not > > huge strings. Attributes 5 and 6 could probably be interned > > as well, but this would require additional api support. > > > > This document expands to 1100mb's in memory. Could this be > > right? Seems high to me. I assume all element names and > > attribute names are interned. I tried to force interning by > > doing this: > > > > SAXReader reader = new SAXReader(); > > > > reader.setFeature("http://xml.org/sax/features/string-interning", > > true); > > > > Which I think is the default anyway. I am using > > xerces-2.0.2.jar for SAXReader via the system property. > > > > Are things being interned? Are there any other tricks to > > reducing memory consumption? > > > > -jason horman > > [EMAIL PROTECTED] > > This email message and any attachments are for the sole use > > of the intended > > recipient(s) and may contain confidential and privileged > > information. Any unauthorized review, use, disclosure or > > distribution is prohibited. If you are not the intended > > recipient or his/her representative, please contact the > > sender by reply email and destroy all copies of the > original message. > > > > __________________________________________________ > > Do You Yahoo!? > > Everything you'll ever need on one web page > > from News and Sport to Email and Music Charts > > http://uk.my.yahoo.com This email message and > any attachments > > are for the sole use of the intended > > recipient(s) and may contain confidential and privileged > > information. Any unauthorized review, use, disclosure or > > distribution is prohibited. If you are not the intended > > recipient or his/her representative, please contact the > > sender by reply email and destroy all copies of the > original message. > > > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: SlickEdit Inc. Develop an > > edge. The most comprehensive and flexible code editor you can > > use. Code faster. C/C++, C#, Java, HTML, XML, many more. FREE > > 30-Day Trial. www.slickedit.com/sourceforge > > _______________________________________________ > > dom4j-dev mailing list > > [EMAIL PROTECTED] > > https://lists.sourceforge.net/lists/listinfo/d> om4j-dev > > > This email message and any attachments are for the sole use > of the intended > recipient(s) and may contain confidential and privileged > information. Any unauthorized review, use, disclosure or > distribution is prohibited. If you are not the intended > recipient or his/her representative, please contact the > sender by reply email and destroy all copies of the original message. > ------------------------------------------------------- This SF.net email is sponsored by: Scholarships for Techies! Can't afford IT training? All 2003 ictp students receive scholarships. Get hands-on training in Microsoft, Cisco, Sun, Linux/UNIX, and more. www.ictp.com/training/sourceforge.asp _______________________________________________ dom4j-dev mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dom4j-dev