RE: [dom4j-user] Parsing large files

Peter Venø Mon, 19 Dec 2005 00:16:12 -0800

Title: Meddelande

Thanks for the reply Christian

I have now had a chance to profile my program. It seems that the entry.detach() call is the problem. In the beginning of the run it accounted for appr. 40% of the cpu time and increasing steadily to reach 80% in an hour. My 'own' code account for less than 10%.

I must say that the individual entries (/root/entry) can be quite large - up to 4000 lines each. But the average size does not increase such that the 'bottom' entries are larger that the 'top' entries. But having large entries of course means that dom4j builds quite a big subtree for each entry, but I can't understand why it takes so much time to prune it.

For now it seem my only option is to chop my inputfile in pieces an parse them individually.

Cheers

peter

From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: 16 December 2005 10:09
To: Peter Venø; dom4j-user@lists.sourceforge.net
Subject: SV: [dom4j-user] Parsing large files

One of the systems I'm working on are parsing files about the same size that you are mention using exact the technique that you exemplify. This is working without any problem.

The only problem we have had with this is code outside the "parsing part" i.e. creating to many object so the garbage collector becomes a problem and so on.

We greatly increased our performance by optimizing the parsing of the XML and only retrieve elements that we really needed.

Good luck!

Cheers

Christian

-----Ursprungligt meddelande-----
Från: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] För Peter Venø
Skickat: den 16 december 2005 09:59
Till: dom4j-user@lists.sourceforge.net
Ämne: [dom4j-user] Parsing large files

Helle all,

I've have just joined the list, so forgive me if the questions have been asked before.

I'm parsing large XML files using DOM4J's event system

private void addEntryHandler(SAXReader saxReader) {
    saxReader.addHandler( "/root/entry",
         new ElementHandler() {
             public void onStart(ElementPath path) {}
             public void onEnd(ElementPath path) {
                 Element entry = path.getCurrent();
                 processEntry(entry);
                 entry.detach();
             }
         }
     );
    }

Which I believe is the standard way. The processEntry(entry) method extracts info by means of

entry.element("name").getText()

Now this works well but the time uses to parse the records increases linearly as the parsing progresses.  The memory consumptions of the parser increase, but only slightly. I can parse the first 10,000 records in appr. 10 seconds whereas parsing entries 2,160,000 to 2,170,000 takes more that 5 minutes.

According to this article http://www.devx.com/Java/Article/29161/0/page/2 parsing of 'extremely large files' should not be a problem. However my file is significantly larger that the extremely large file used in the article (14 MB). It ludicrously large - appr. 850 MB - gzipped

Have any of you experienced similar problems with the parsing of large files. Any input is appriciated.

thanks

Peter

RE: [dom4j-user] Parsing large files

Reply via email to