I would really consider SAX.  In MarcEdit, I had originally utilized an XSLT 
process for handling MARCXML translations (using both SAXON and MSXML parsers) 
-- but as you noticed -- there ends up being an upper limit to what you can 
process.  The break point for me was when working with some researchers 
experimenting with data from the HathiTrust and they had a 32 GB XML file of 
MARCXML that needed to be processed.  Using the DOM model, the process was 
untenable.  Re-working the code so that it was SAX based -- required building, 
to some degree, the same type of templating to react to specific elements and 
nested elements -- but shifted processing time so that it took ~8 minutes to 
translate those 32 GBs of MARCXML data into MARC (and allowed me to include 
code that handled some common issues related to field length, etc. at the point 
of translation).

Not knowing what your XML files look like, my guess is that if you do it right, 
you can template your SAX code in such a way that the actual processing code is 
smaller and much more efficient than anything you could create using a DOM 
method.

--tr

-----Original Message-----
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Kyle 
Banerjee
Sent: Friday, June 08, 2012 11:36 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] Best way to process large XML files

I'm working on a script that needs to be able to crosswalk at least a couple 
hundred XML files regularly, some of which are quite large.

I've thought of a number of ways to go about this, but I wanted to bounce this 
off the list since I'm sure people here deal with this problem all the time. My 
goal is to make something that's easy to read/maintain without pegging the CPU 
and consuming too much memory.

The performance and load I'm seeing from running the files through LibXML and 
SimpleXML on the large files is completely unacceptable. SAX is not out of the 
question, but I'm trying to avoid it if possible to keep the code more compact 
and easier to read.


I'm tempted to streamedit out all line breaks since they occur in unpredictable 
places and put new ones at the end of each record into a temp file. Then I can 
read the temp file one line at a time and process using SimpleXML. That way, 
there's no need to load giant files into memory, create huge arrays, etc and 
the code would be easy enough for a 6th grader to follow. My proposed method 
doesn't sound very efficient to me, but it should consume predictable resources 
which don't increase with file size.

How do you guys deal with large XML files? Thanks,

kyle

<rant>Why the heck does the XML spec require a root element, particularly since 
large files usually consist of a large number of records/documents? This makes 
it absolutely impossible to process a file of any size without resorting to SAX 
or string parsing -- which takes away many of the advantages you'd normally 
have with an XML structure. </rant>

--
----------------------------------------------------------
Kyle Banerjee
Digital Services Program Manager
Orbis Cascade Alliance
<baner...@uoregon.edu>baner...@orbiscascade.org / 503.999.9787

Reply via email to