Hi Jason Just a note to say that I haveny forgotten about this issue. The work is just going a little slower that I would hav hoped, as I have to do some paying work first
Hopefully I should have finished a test build for you this by the end of next week Mike > -----Original Message----- > From: Mike Skells > Sent: Friday 28 February 2003 08:37 > To: Jason Horman > Cc: [EMAIL PROTECTED] > Subject: RE: [dom4j-dev] huge dom > > > Hi, > I would use the Flyweight if it was not broken - see the > thread on equals and hashCode, so I have subclassed from that. > > The values are part of the node, and the intering process > looks at nodes which are identical. I have just about > finished the code, > Docuent factories which co-ordinates the interning of the the > leaf nodes The element classes are written, and are > constructed by the Element handler, which coordinates the > interning of the attribute lists and the content list, and > the elemnt itself. There are a number of support classes for > the custom lists (to reduce size), and a basic interner > > I have a couple of bugs to track down this morning, and I > have finished seperating the code from my commercial > dependencies, so I should ship you a demo jar this pm. > I will run some tests on that 800Mb XML file you refered to > so that I can get some stats I havent checked that the tree > is any good for use yet! But I geuss that you could try it > in with you app to see if anything brakes > > > -----Original Message----- > > From: Jason Horman [mailto:[EMAIL PROTECTED] > > Sent: Thursday 27 February 2003 23:33 > > To: Mike Skells > > Subject: RE: [dom4j-dev] huge dom > > > > > > Excellent, that would be great. How do you plan on using > > flyweight/factories. The nodes I have aren't exact > > duplicates. The actual attribute names and element names or > > obviously duplicated but the values of the attributes will > > differ. I assumed though that string interning would fix the > > issue of duplicate names. > > > > -jason > > > > -----Original Message----- > > From: Mike Skells [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, February 26, 2003 4:45 AM > > To: Jason Horman > > Cc: [EMAIL PROTECTED] > > Subject: RE: [dom4j-dev] huge dom > > > > > > Hi jason, > > I am looking at using content handlers and/or document > > factories modifications to allow for the re-use of some of > > the nodes. I have run some tests on some large doms that I > > have, and have spotted a reduction of 50% - 95% on the size > > of the dom. The restrictions are that the dom is read only, > > which is not a problem for you I believe, and that the > > flyweight pattern is used. > > > > Once I have a version that I am happy with I cn send you some > > code to try on your XML file. I will be looking to contribute > > this code once it has stablised, and I have removed the few > > minor propritory classes > > > > Mike > > > -----Original Message----- > > > From: Jason Horman [mailto:[EMAIL PROTECTED] > > > Sent: Tuesday 25 February 2003 23:40 > > > To: Mike Skells; Jason Horman > > > Cc: [EMAIL PROTECTED] > > > Subject: RE: [dom4j-dev] huge dom > > > > > > > > > I can't really send my DOM since it is proprietary company > > > information. There are plenty of HUGE xml docs on the web though, > > > such as: > > > > > > > > > http://www.cs.washington.edu/research/xmldatasets/www/repository.html/ > > > http://www.cs.washington.edu/research/xmldatasets/www/data/pir > > > /psd7003.xml.g > > > z > > > > > > 21,305,818 elements > > > 103 MB's > > > > > > I basically was just doing this: > > > > > > SAXReader reader = new SAXReader(); > > > reader.setStringInternEnabled(true); > > > reader.setMergeAdjacentText(true); > > > reader.setStripWhitespaceText(true); > > > > > > Document oldArtistDoc = > > > reader.read(inputStream); > > > > > > > > > Thanks, > > > Jason Horman > > > [EMAIL PROTECTED] > > > > > > -----Original Message----- > > > From: Mike Skells [mailto:[EMAIL PROTECTED] > > > Sent: Tuesday, February 25, 2003 10:00 AM > > > To: Jason Horman > > > Cc: [EMAIL PROTECTED] > > > Subject: RE: [dom4j-dev] huge dom > > > > > > > > > Hi, Jason > > > I have a few other ideas that I am testing. Can you send me the > > > (zipped) xml, and a bit of test code so that I can check > if my ideas > > > work > > > > > > Mike Skells > > > > > > > -----Original Message----- > > > > From: Jason Horman [mailto:[EMAIL PROTECTED] > > > > Sent: Thursday 20 February 2003 00:04 > > > > To: 'James Strachan'; [EMAIL PROTECTED] > > > > Subject: RE: [dom4j-dev] huge dom > > > > > > > > > > > > Thanks, that trimmed off about 150 mb's from memory. Still seems > > > > large to me, but I suppose the tree is quite large. > > > > > > > > I cannot use the "row by row" technique since I need to > > have a dom > > > > available for the massive number of xpath statements and > > sorts that > > > > I need to do across the entire document. The document is > > essentially > > > > a database dump. I may look into the new BDB XML db instead of > > > > in-memory in the future. > > > > > > > > -jason > > > > > > > > -----Original Message----- > > > > From: James Strachan [mailto:[EMAIL PROTECTED] > > > > Sent: Wednesday, February 19, 2003 12:26 AM > > > > To: Jason Horman; [EMAIL PROTECTED] > > > > Subject: Re: [dom4j-dev] huge dom > > > > > > > > > > > > > > > > First off there's an FAQ entry > > > > > > > > http://dom4j.org/faq.html > > > > > > > > on How does dom4j handle very large XML documents? > > > > > > > > http://dom4j.org/faq.html#How%20does%20dom4j%20handle%20very%2 > > > > 0large%20XML%2 > > > > 0documents? > > > > > > > > which essentially means you can process the document in > a 'row by > > > > row' kinda way rather than waiting to load the whole > thing in one > > > > go. > > > > > > > > > > > > Other flags that might help reduce the overall memory > > footprint are > > > > these, which avoids storing unnecessary String or whitespace > > > > objects... > > > > > > > > SAXReader reader = new SAXReader(); > > > reader.setMergeAdjacentText(true); > > > > reader.setStringInternenabled(true); > > > > reader.setStripWhitespaceText(true); > > > > > > > > James > > > > ------- > > > > http://radio.weblogs.com/0112098/ > > > > ----- Original Message ----- > > > > From: Jason Horman > > > > To: '[EMAIL PROTECTED]' > > > > Sent: Friday, February 14, 2003 1:11 AM > > > > Subject: [dom4j-dev] huge dom > > > > > > > > > > > > I am using dom4j-1.4-dev-8.jar, the version that came > > with my last > > > > maven build of jelly. > > > > > > > > My xml document: > > > > > > > > 159 mbs > > > > 2,438,791 lines/tags -> 1 tag per line, all attributes > > > > ~6 attributes per tag > > > > 4 out of 6 attributes are numeric values, so they are not huge > > > > strings. Attributes 5 and 6 could probably be interned as > > well, but > > > > this would require additional api support. > > > > > > > > This document expands to 1100mb's in memory. Could this > be right? > > > > Seems high to me. I assume all element names and > > attribute names are > > > > interned. I tried to force interning by doing this: > > > > > > > > SAXReader reader = new SAXReader(); > > > > > > > > > reader.setFeature("http://xml.org/sax/features/string-interning", > > > > true); > > > > > > > > Which I think is the default anyway. I am using > > xerces-2.0.2.jar for > > > > SAXReader via the system property. > > > > > > > > Are things being interned? Are there any other tricks > to reducing > > > > memory consumption? > > > > > > > > -jason horman > > > > [EMAIL PROTECTED] > > > > This email message and any attachments are for the sole > > use of the > > > > intended > > > > recipient(s) and may contain confidential and privileged > > > > information. Any unauthorized review, use, disclosure or > > > > distribution is prohibited. If you are not the intended > recipient > > > > or his/her representative, please contact the sender by reply > > > > email and destroy all copies of the > > > original message. > > > > > > > > __________________________________________________ > > > > Do You Yahoo!? > > > > Everything you'll ever need on one web page > > > > from News and Sport to Email and Music Charts > http://uk.my.yahoo.com > > > This email message and > any attachments are for the sole > use of the > > > > intended > > > recipient(s) and may contain confidential and privileged > > > information. Any unauthorized review, use, disclosure or > > > distribution is prohibited. If you are not the intended > recipient or > > > his/her representative, please contact the sender by > reply email and > > > destroy all copies of the > > original message. > > > > > > > > > ------------------------------------------------------- > > > This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. > > > The most comprehensive and flexible code editor you can use. Code > > > faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. > > > www.slickedit.com/sourceforge > > > _______________________________________________ > > > dom4j-dev mailing list > > > [EMAIL PROTECTED] > > > https://lists.sourceforge.net/lists/listinfo/d> om4j-dev > > > > > This email message and any attachments are for the sole use of the > > intended > > recipient(s) and may contain confidential and privileged > > information. Any unauthorized review, use, disclosure or > > distribution is prohibited. If you are not the intended > > recipient or his/her representative, please contact the > > sender by reply email and destroy all copies of the > original message. > > > This email message and any attachments are for the sole use > of the intended > recipient(s) and may contain confidential and privileged > information. Any unauthorized review, use, disclosure or > distribution is prohibited. If you are not the intended > recipient or his/her representative, please contact the > sender by reply email and destroy all copies of the original message. > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > dom4j-dev mailing list > [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/d> om4j-dev > ------------------------------------------------------- This SF.net email is sponsored by: Tablet PC. Does your code think in ink? You could win a Tablet PC. Get a free Tablet PC hat just for playing. What are you waiting for? http://ads.sourceforge.net/cgi-bin/redirect.pl?micr5043en _______________________________________________ dom4j-dev mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dom4j-dev