Oh... Can your code work with just SAX? If you don't need the Document object, I'd think it'd be much faster to just use SAX directly.
Wali Ansary wrote: > I've actually tweaked it to the extent that I left the onEnd() method > totally empty, except for the detach() method. As far as I understand, > the detach() method is preventing the 'detached' element from being a > part of the resulting Document object. That's why this code consumes > hardly any memory (I didn't have to change the default HotSpot > settings to process a 676 MB file). > > I will rerun the code with your suggestions and follow up. > > Thanks > -Wali > > > > ----Original Message Follows---- > From: Evan Kirkconnell <[EMAIL PROTECTED]> > To: dom4j-user@lists.sourceforge.net > Subject: Re: [dom4j-user] Huge file, ElementHandler, and performance woes > Date: Mon, 12 Feb 2007 08:50:13 -0600 > > It'd be good to make sure the slowness isn't due to memory issues and > garbage collection. Here's some code that I've used in some speed > tests.(got it off the Java forum or google groups I think) I'd recommend > stopping your timer after each 10000, running the gc, showing the used > memory, Thread.sleep() for a bit, then starting the timer, and moving on > to the next series. Might also want to play around with how much memory > is allocated to the VM, and see if it shifts your numbers. Seems like it > was 6 milliseconds for a while, then started to jump up. Wasn't a smooth > curve, which makes me suspicious. > > Also, have you tried tweaking it at all? I'd recommend trying some stuff > like taking the int declarations out of the loop, doing writer.println > for each string separately instead of appending them, maybe removing the > .detach()(I don't really know much about what that means in SAX though). > > private static void runGC () throws Exception{ > // It helps to call Runtime.gc() > // using several method calls: > for (int r = 0; r < 4; ++ r) _runGC (); > } > > private static void _runGC () throws Exception{ > long usedMem1 = usedMemory (), usedMem2 = Long.MAX_VALUE; > for (int i = 0; (usedMem1 < usedMem2) && (i < 500); ++ i){ > s_runtime.runFinalization (); > s_runtime.gc (); > Thread.currentThread ().yield (); > > usedMem2 = usedMem1; > usedMem1 = usedMemory (); > } > } > > private static void showUsedMemory(){ > long l = usedMemory(); > System.out.println("Used memory: "+l); > } > > private static long usedMemory (){ > return s_runtime.totalMemory () - s_runtime.freeMemory (); > } > > Wali Ansary wrote: > > Folks, > > > > I'm baffled. > > > > I've been an avid user of dom4j for a while, and have used the > > below-mentioned stategy successfully ever since to process/transform > > huge XML files without consuming much memory. However, this new code I > > have appears to be getting gradually slower. I'm not sure if I'm > > missing anything. > > > > Here are the timings per 10000 elements processed. Note that I want to > > process documents that have over 5 million of these elements: > > > > 10000: 3 > > 20000: 9 > > 30000: 15 > > 40000: 21 > > 50000: 27 > > 60000: 33 > > 70000: 39 > > 80000: 45 > > 90000: 56 > > 100000: 87 > > 110000: 158 > > > > > >>>>>>>>>>>>>>>>>>>>>>>>>> > > > > // Create a reader and add an Element handler to efficiently iterate > > // through the large document > > SAXReader reader = new SAXReader(); > > reader.addHandler("/n-extract-response/guid-info", > > new ElementHandler() { > > > > public void onStart(ElementPath path) { > > // do nothing > > } > > > > public void onEnd(ElementPath path) { > > // Get the guid of the document > > Element guidInfoElement = path.getCurrent(); > > String guid = guidInfoElement.valueOf("guid"); > > > > // Get the document size, and possibly metadata size > > int totalSize = 0, docSize = 0, metaSize = 0; > > try { > > docSize = Integer.parseInt(guidInfoElement > > .valueOf("size")); > > metaSize = Integer.parseInt(guidInfoElement > > .valueOf("metadatasize")); > > } catch (NumberFormatException nfe) { > > // do nothing > > } > > > > // print as line > > totalSize = docSize + metaSize; > > writer.println(colID + delimiter + guid + delimiter > > + totalSize); > > > > // for debugging purposes, track how long it takes per > > // 10000 guid-info elements > > count++; > > if (count % 10000 == 0) { > > end = System.currentTimeMillis(); > > System.out.println(count + ": " > > + ((end - start) / 1000)); > > start = System.currentTimeMillis(); > > } > > > > // make sure to detach to save memory > > guidInfoElement.detach(); > > } > > }); > > > > // Set the start time, and begin reading > > start = System.currentTimeMillis(); > > reader.read(nxoGuidsFile); > > writer.close(); > > > >>>>>>>>>>>>>>>>>>> > > > > You can argue that the string-concatenation and/or Integer parsing is > > taking up time, but it doesnt explain the gradual increase in the > > timings. > > > > I've tried compiling and running in both Java 1.4, 1.5 with various > > compilation settings, but to no avail. > > > > Help! > > > > Thanks > > -Wali > > > > _________________________________________________________________ > > MSN Hotmail is evolving – check out the new Windows Live Mail > > http://ideas.live.com > > > > > > > ------------------------------------------------------------------------ > > > > > ------------------------------------------------------------------------- > > Using Tomcat but need to do more? Need to support web services, > security? > > Get stuff done quickly with pre-integrated technology to make your > job easier. > > Download IBM WebSphere Application Server v.1.0.1 based on Apache > Geronimo > > > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > dom4j-user mailing list > > dom4j-user@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/dom4j-user > > > > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job > easier. > Download IBM WebSphere Application Server v.1.0.1 based on Apache > Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > dom4j-user mailing list > dom4j-user@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dom4j-user > > _________________________________________________________________ > MSN Hotmail is evolving – check out the new Windows Live Mail > http://ideas.live.com > ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier. Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ dom4j-user mailing list dom4j-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dom4j-user