I create 50GB files of marcxml all the time.  We do NOT put a wrapper
element around them, but do put a line feed at the end of each record.
Then a trivial line reading loop in java/perl/whatever can read those
records individually and process them appropriately.

That turns out the be the right way to do things in Hadoop too.

Ralph

-----Original Message-----
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
Kyle Banerjee
Sent: Friday, June 08, 2012 2:36 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Best way to process large XML files

I'm working on a script that needs to be able to crosswalk at least a
couple hundred XML files regularly, some of which are quite large.

I've thought of a number of ways to go about this, but I wanted to
bounce
this off the list since I'm sure people here deal with this problem all
the
time. My goal is to make something that's easy to read/maintain without
pegging the CPU and consuming too much memory.

The performance and load I'm seeing from running the files through
LibXML
and SimpleXML on the large files is completely unacceptable. SAX is not
out
of the question, but I'm trying to avoid it if possible to keep the code
more compact and easier to read.

I'm tempted to streamedit out all line breaks since they occur in
unpredictable places and put new ones at the end of each record into a
temp
file. Then I can read the temp file one line at a time and process using
SimpleXML. That way, there's no need to load giant files into memory,
create huge arrays, etc and the code would be easy enough for a 6th
grader
to follow. My proposed method doesn't sound very efficient to me, but it
should consume predictable resources which don't increase with file
size.

How do you guys deal with large XML files? Thanks,

kyle

<rant>Why the heck does the XML spec require a root element,
particularly since large files usually consist of a large number of
records/documents? This makes it absolutely impossible to process a file
of
any size without resorting to SAX or string parsing -- which takes away
many of the advantages you'd normally have with an XML structure.
</rant>

-- 
----------------------------------------------------------
Kyle Banerjee
Digital Services Program Manager
Orbis Cascade Alliance
<baner...@uoregon.edu>baner...@orbiscascade.org / 503.999.9787

Reply via email to