Hi jason,
I am looking at using content handlers and/or document factories
modifications to allow for the re-use of some of the nodes. I have run
some tests on some large doms that I have, and have spotted a reduction
of 50% - 95% on the size of the dom. The restrictions are that the dom
is read only, which is not a problem for you I believe, and that the
flyweight pattern is used.

Once I have a version that I am happy with I cn send you some code to
try on your XML file. I will be looking to contribute this code once it
has stablised, and I have removed the few minor propritory classes

Mike
> -----Original Message-----
> From: Jason Horman [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday 25 February 2003 23:40
> To: Mike Skells; Jason Horman
> Cc: [EMAIL PROTECTED]
> Subject: RE: [dom4j-dev] huge dom
> 
> 
> I can't really send my DOM since it is proprietary company 
> information. There are plenty of HUGE xml docs on the web 
> though, such as:
> 
> http://www.cs.washington.edu/research/xmldatasets/www/repository.html/
> http://www.cs.washington.edu/research/xmldatasets/www/data/pir
> /psd7003.xml.g
> z
> 
> 21,305,818 elements
> 103 MB's
> 
> I basically was just doing this:
> 
>                         SAXReader reader = new SAXReader();
>                         reader.setStringInternEnabled(true);
>                         reader.setMergeAdjacentText(true);
>                         reader.setStripWhitespaceText(true);
>                         
>                         Document oldArtistDoc = 
> reader.read(inputStream);
> 
> 
> Thanks,
> Jason Horman
> [EMAIL PROTECTED]
> 
> -----Original Message-----
> From: Mike Skells [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, February 25, 2003 10:00 AM
> To: Jason Horman
> Cc: [EMAIL PROTECTED]
> Subject: RE: [dom4j-dev] huge dom
> 
> 
> Hi, Jason
> I have a few other ideas that I am testing. Can you send me 
> the (zipped) xml, and a bit of test code so that I can check 
> if my ideas work
> 
> Mike Skells
> 
> > -----Original Message-----
> > From: Jason Horman [mailto:[EMAIL PROTECTED]
> > Sent: Thursday 20 February 2003 00:04
> > To: 'James Strachan'; [EMAIL PROTECTED]
> > Subject: RE: [dom4j-dev] huge dom
> > 
> > 
> > Thanks, that trimmed off about 150 mb's from memory. Still
> > seems large to me, but I suppose the tree is quite large.
> > 
> > I cannot use the "row by row" technique since I need to have
> > a dom available for the massive number of xpath statements 
> > and sorts that I need to do across the entire document. The 
> > document is essentially a database dump. I may look into the 
> > new BDB XML db instead of in-memory in the future.
> > 
> > -jason
> > 
> > -----Original Message-----
> > From: James Strachan [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, February 19, 2003 12:26 AM
> > To: Jason Horman; [EMAIL PROTECTED]
> > Subject: Re: [dom4j-dev] huge dom
> > 
> > 
> > 
> > First off there's an FAQ entry
> > 
> > http://dom4j.org/faq.html
> > 
> > on How does dom4j handle very large XML documents?
> > 
> > http://dom4j.org/faq.html#How%20does%20dom4j%20handle%20very%2
> > 0large%20XML%2
> > 0documents?
> > 
> > which essentially means you can process the document in a
> > 'row by row' kinda way rather than waiting to load the whole 
> > thing in one go.
> > 
> > 
> > Other flags that might help reduce the overall memory
> > footprint are these, which avoids storing unnecessary String 
> > or whitespace objects...
> > 
> > SAXReader reader = new SAXReader(); 
> reader.setMergeAdjacentText(true);
> > reader.setStringInternenabled(true);
> > reader.setStripWhitespaceText(true);
> > 
> > James
> > -------
> > http://radio.weblogs.com/0112098/
> > ----- Original Message -----
> > From: Jason Horman
> > To: '[EMAIL PROTECTED]'
> > Sent: Friday, February 14, 2003 1:11 AM
> > Subject: [dom4j-dev] huge dom
> > 
> > 
> > I am using dom4j-1.4-dev-8.jar, the version that came with my
> > last maven build of jelly.
> > 
> > My xml document:
> > 
> > 159 mbs
> > 2,438,791 lines/tags -> 1 tag per line, all attributes
> > ~6 attributes per tag
> > 4 out of 6 attributes are numeric values, so they are not
> > huge strings. Attributes 5 and 6 could probably be interned 
> > as well, but this would require additional api support.
> > 
> > This document expands to 1100mb's in memory. Could this be
> > right? Seems high to me. I assume all element names and 
> > attribute names are interned. I tried to force interning by 
> > doing this:
> > 
> >         SAXReader reader = new SAXReader();
> >         
> > reader.setFeature("http://xml.org/sax/features/string-interning";,
> > true);
> > 
> > Which I think is the default anyway. I am using
> > xerces-2.0.2.jar for SAXReader via the system property.
> > 
> > Are things being interned? Are there any other tricks to
> > reducing memory consumption?
> > 
> > -jason horman
> >  [EMAIL PROTECTED]
> > This email message and any attachments are for the sole use
> > of the intended
> > recipient(s) and may contain confidential and privileged 
> > information. Any unauthorized review, use, disclosure or 
> > distribution is prohibited. If you are not the intended 
> > recipient or his/her representative, please contact the 
> > sender by reply email and destroy all copies of the 
> original message.
> > 
> > __________________________________________________
> > Do You Yahoo!?
> > Everything you'll ever need on one web page
> > from News and Sport to Email and Music Charts
> > http://uk.my.yahoo.com This email message and > any attachments 
> > are for the sole use of the intended
> > recipient(s) and may contain confidential and privileged 
> > information. Any unauthorized review, use, disclosure or 
> > distribution is prohibited. If you are not the intended 
> > recipient or his/her representative, please contact the 
> > sender by reply email and destroy all copies of the 
> original message.
> > 
> > 
> > -------------------------------------------------------
> > This SF.net email is sponsored by: SlickEdit Inc. Develop an
> > edge. The most comprehensive and flexible code editor you can 
> > use. Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 
> > 30-Day Trial. www.slickedit.com/sourceforge 
> > _______________________________________________
> > dom4j-dev mailing list
> > [EMAIL PROTECTED] 
> > https://lists.sourceforge.net/lists/listinfo/d> om4j-dev
> > 
> This email message and any attachments are for the sole use 
> of the intended
> recipient(s) and may contain confidential and privileged 
> information. Any unauthorized review, use, disclosure or 
> distribution is prohibited. If you are not the intended 
> recipient or his/her representative, please contact the 
> sender by reply email and destroy all copies of the original message.
> 


-------------------------------------------------------
This SF.net email is sponsored by: Scholarships for Techies!
Can't afford IT training? All 2003 ictp students receive scholarships.
Get hands-on training in Microsoft, Cisco, Sun, Linux/UNIX, and more.
www.ictp.com/training/sourceforge.asp
_______________________________________________
dom4j-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dom4j-dev

Reply via email to