Oh... Can your code work with just SAX? If you don't need the Document 
object, I'd think it'd be much faster to just use SAX directly.

Wali Ansary wrote:
> I've actually tweaked it to the extent that I left the onEnd() method 
> totally empty, except for the detach() method. As far as I understand, 
> the detach() method is preventing the 'detached' element from being a 
> part of the resulting Document object. That's why this code consumes 
> hardly any memory (I didn't have to change the default HotSpot 
> settings to process a 676 MB file).
>
> I will rerun the code with your suggestions and follow up.
>
> Thanks
> -Wali
>
>
>
> ----Original Message Follows----
> From: Evan Kirkconnell <[EMAIL PROTECTED]>
> To: dom4j-user@lists.sourceforge.net
> Subject: Re: [dom4j-user] Huge file, ElementHandler, and performance woes
> Date: Mon, 12 Feb 2007 08:50:13 -0600
>
> It'd be good to make sure the slowness isn't due to memory issues and
> garbage collection. Here's some code that I've used in some speed
> tests.(got it off the Java forum or google groups I think) I'd recommend
> stopping your timer after each 10000, running the gc, showing the used
> memory, Thread.sleep() for a bit, then starting the timer, and moving on
> to the next series. Might also want to play around with how much memory
> is allocated to the VM, and see if it shifts your numbers. Seems like it
> was 6 milliseconds for a while, then started to jump up. Wasn't a smooth
> curve, which makes me suspicious.
>
> Also, have you tried tweaking it at all? I'd recommend trying some stuff
> like taking the int declarations out of the loop, doing writer.println
> for each string separately instead of appending them, maybe removing the
> .detach()(I don't really know much about what that means in SAX though).
>
> private static void runGC () throws Exception{
> // It helps to call Runtime.gc()
> // using several method calls:
> for (int r = 0; r < 4; ++ r) _runGC ();
> }
>
> private static void _runGC () throws Exception{
> long usedMem1 = usedMemory (), usedMem2 = Long.MAX_VALUE;
> for (int i = 0; (usedMem1 < usedMem2) && (i < 500); ++ i){
> s_runtime.runFinalization ();
> s_runtime.gc ();
> Thread.currentThread ().yield ();
>
> usedMem2 = usedMem1;
> usedMem1 = usedMemory ();
> }
> }
>
> private static void showUsedMemory(){
> long l = usedMemory();
> System.out.println("Used memory: "+l);
> }
>
> private static long usedMemory (){
> return s_runtime.totalMemory () - s_runtime.freeMemory ();
> }
>
> Wali Ansary wrote:
> > Folks,
> >
> > I'm baffled.
> >
> > I've been an avid user of dom4j for a while, and have used the
> > below-mentioned stategy successfully ever since to process/transform
> > huge XML files without consuming much memory. However, this new code I
> > have appears to be getting gradually slower. I'm not sure if I'm
> > missing anything.
> >
> > Here are the timings per 10000 elements processed. Note that I want to
> > process documents that have over 5 million of these elements:
> >
> > 10000: 3
> > 20000: 9
> > 30000: 15
> > 40000: 21
> > 50000: 27
> > 60000: 33
> > 70000: 39
> > 80000: 45
> > 90000: 56
> > 100000: 87
> > 110000: 158
> >
> >
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >
> > // Create a reader and add an Element handler to efficiently iterate
> > // through the large document
> > SAXReader reader = new SAXReader();
> > reader.addHandler("/n-extract-response/guid-info",
> > new ElementHandler() {
> >
> > public void onStart(ElementPath path) {
> > // do nothing
> > }
> >
> > public void onEnd(ElementPath path) {
> > // Get the guid of the document
> > Element guidInfoElement = path.getCurrent();
> > String guid = guidInfoElement.valueOf("guid");
> >
> > // Get the document size, and possibly metadata size
> > int totalSize = 0, docSize = 0, metaSize = 0;
> > try {
> > docSize = Integer.parseInt(guidInfoElement
> > .valueOf("size"));
> > metaSize = Integer.parseInt(guidInfoElement
> > .valueOf("metadatasize"));
> > } catch (NumberFormatException nfe) {
> > // do nothing
> > }
> >
> > // print as line
> > totalSize = docSize + metaSize;
> > writer.println(colID + delimiter + guid + delimiter
> > + totalSize);
> >
> > // for debugging purposes, track how long it takes per
> > // 10000 guid-info elements
> > count++;
> > if (count % 10000 == 0) {
> > end = System.currentTimeMillis();
> > System.out.println(count + ": "
> > + ((end - start) / 1000));
> > start = System.currentTimeMillis();
> > }
> >
> > // make sure to detach to save memory
> > guidInfoElement.detach();
> > }
> > });
> >
> > // Set the start time, and begin reading
> > start = System.currentTimeMillis();
> > reader.read(nxoGuidsFile);
> > writer.close();
> >
> >>>>>>>>>>>>>>>>>>>
> >
> > You can argue that the string-concatenation and/or Integer parsing is
> > taking up time, but it doesnt explain the gradual increase in the
> > timings.
> >
> > I've tried compiling and running in both Java 1.4, 1.5 with various
> > compilation settings, but to no avail.
> >
> > Help!
> >
> > Thanks
> > -Wali
> >
> > _________________________________________________________________
> > MSN Hotmail is evolving – check out the new Windows Live Mail
> > http://ideas.live.com
> >
> >
> > 
> ------------------------------------------------------------------------
> >
> > 
> -------------------------------------------------------------------------
> > Using Tomcat but need to do more? Need to support web services, 
> security?
> > Get stuff done quickly with pre-integrated technology to make your 
> job easier.
> > Download IBM WebSphere Application Server v.1.0.1 based on Apache 
> Geronimo
> > 
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> >
> > 
> ------------------------------------------------------------------------
> >
> > _______________________________________________
> > dom4j-user mailing list
> > dom4j-user@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/dom4j-user
> >
>
>
> -------------------------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job 
> easier.
> Download IBM WebSphere Application Server v.1.0.1 based on Apache 
> Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> dom4j-user mailing list
> dom4j-user@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dom4j-user
>
> _________________________________________________________________
> MSN Hotmail is evolving – check out the new Windows Live Mail 
> http://ideas.live.com
>


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
dom4j-user mailing list
dom4j-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dom4j-user

Reply via email to