Re: [CODE4LIB] Best way to process large XML files

2012-06-11 Thread Teemu Nuutinen
On 09.06.2012 00:00, Kyle Banerjee wrote:

 Since you mentioned SimpleXML, Kyle, I assume you're using PHP?

 
 Actually I'm using perl. For reasons not related to XML parsing, it is the
 preferred (but not mandatory) language.
 
 Based on a few tests and manual inspection, it looks like the ticket for me
 is going have a two stage process where the first stage converts the file
 to valid XML and the second cuts through it with SAX.
 
 Originally, I was trying to avoid SAX, but the process has been prettier
 than expected so far. The XML has not been prettier than expected --
 it contains a number of issues including outright invalid XML, invalid
 characters, and hand coded HTML within some elements (i.e. string data not
 encoded as such). Gotta love library data. But screwed up stuff is
 employment security. If things actually worked, I'd be redundant...
 
 kyle


Since you're using perl I think you mean XML::Simple which is a
DOM-parser. You also mentioned LibXML and are considering SAX-parsing so
I assume you've only used DOM-parsing then? How about using an XML
reader, kind of like SAX but a whole lot cleaner and easier - something
like:

use XML::LibXML::Reader;

my $reader = XML::LibXML::Reader-new(location = $filename_or_uri);
while ( $reader-read ) {
next unless $reader-name eq 'record' 
$reader-nodeType eq XML_READER_TYPE_ELEMENT;
my $dom = XML::LibXML-load_xml( string = $reader-readOuterXml );
...do something with the record element's dom-tree...
}

Documentation [https://metacpan.org/module/XML::LibXML::Reader]


HTH
-- 
Teemu Nuutinen, Digital Services, Helsinki University Library


Re: [CODE4LIB] Best way to process large XML files

2012-06-11 Thread Ron Gilmour
When I need to deal with huge XML files, I use Perl's XML::Parser in
stream mode. It's blazing fast, but I have to admit, the code isn't very
pretty.

There's also 
XML::LibXML::SAXhttp://search.cpan.org/dist/XML-LibXML/lib/XML/LibXML/SAX.pod,
but I can't seem to find any substantive documentation on how this works.
(If anyone has any sample code that uses this, I'd love to see it. Please
e-mail me off-list as I don't want to de-rail this thread.)

Teemu's suggestion about XML::LibXML::Reader is definitely worth
considering. I've never clocked it against XML::Parser, but it seems like
it *should* be fast. And as Teemu demonstrated, it allows you to write nice
compact code.

Ron




On Fri, Jun 8, 2012 at 2:36 PM, Kyle Banerjee baner...@orbiscascade.orgwrote:

 I'm working on a script that needs to be able to crosswalk at least a
 couple hundred XML files regularly, some of which are quite large.

 I've thought of a number of ways to go about this, but I wanted to bounce
 this off the list since I'm sure people here deal with this problem all the
 time. My goal is to make something that's easy to read/maintain without
 pegging the CPU and consuming too much memory.

 The performance and load I'm seeing from running the files through LibXML
 and SimpleXML on the large files is completely unacceptable. SAX is not out
 of the question, but I'm trying to avoid it if possible to keep the code
 more compact and easier to read.

 I'm tempted to streamedit out all line breaks since they occur in
 unpredictable places and put new ones at the end of each record into a temp
 file. Then I can read the temp file one line at a time and process using
 SimpleXML. That way, there's no need to load giant files into memory,
 create huge arrays, etc and the code would be easy enough for a 6th grader
 to follow. My proposed method doesn't sound very efficient to me, but it
 should consume predictable resources which don't increase with file size.

 How do you guys deal with large XML files? Thanks,

 kyle

 rantWhy the heck does the XML spec require a root element,
 particularly since large files usually consist of a large number of
 records/documents? This makes it absolutely impossible to process a file of
 any size without resorting to SAX or string parsing -- which takes away
 many of the advantages you'd normally have with an XML structure. /rant

 --
 --
 Kyle Banerjee
 Digital Services Program Manager
 Orbis Cascade Alliance
 baner...@uoregon.edubaner...@orbiscascade.org / 503.999.9787



Re: [CODE4LIB] Best way to process large XML files

2012-06-10 Thread Ross Singer
Steve, I'm not sure if you were hoping for a ruby-related answer to
your question (since you mentioned Nokogiri), but if you are, take a
look at ruby-marc' GenericPullParser [1] as an example of using a SAX
parser for this sort of thing.  It doesn't quite answer your question,
but I think it might provide some guidance.

Basically, I think you're still going to have to use the SAX parser to
create record objects where you can build up your hierarchy logic and
then simply move onto the next record if the conditions aren't met.
Even though you'd still need to build your objects, I think streaming
over the XML (and constructed objects) will still be pretty fast and
efficient.

-Ross.
1. 
https://github.com/ruby-marc/ruby-marc/blob/master/lib/marc/xml_parsers.rb#L27

On Fri, Jun 8, 2012 at 8:07 PM, Steve Meyer steve.e.me...@gmail.com wrote:
 It is also worth noting that you can usually do SAX-style parsing in
 most XML parsing libraries that are normally associated with DOM style
 parsing and conveniences like XPath selectors. For example, Nokogiri
 does SAX and it is *very* fast:

 http://nokogiri.org/Nokogiri/XML/SAX/Document.html

 As a related question, when folks do SAX-style parsing and need to
 select highly conditional and deeply nested elements (think getting
 MODS title data only when a parent element's attribute matches a
 condition and it is all nested in a big METS wrapper), how are you
 keeping track of those nesting and conditional rules? I have relied on
 using a few booleans that get set and unset to track state, but it
 often feels sloppy.

 -steve

 On Fri, Jun 8, 2012 at 2:41 PM, Ethan Gruber ewg4x...@gmail.com wrote:
 but I have gotten noticeably better
 performance from Saxon/XSLT2 than PHP with DOMDocument or SimpleXML or
 nokogiri and hpricot in Ruby.


Re: [CODE4LIB] Best way to process large XML files

2012-06-10 Thread stuart yeates

On 09/06/12 06:36, Kyle Banerjee wrote:


How do you guys deal with large XML files?


There have been a number of excellent suggestions from other people, but 
it's worth pointing out that sometimes low tech is all you need.


I frequently use sed to do things such as replace one domain name with 
another when a website changes their URL.


Short for Stream EDitor, sed is a core part of POSIX and should be 
available pretty on much every UNIX-like platform imaginable. For 
non-trivial files it works faster than disk access (i.e. works as fast 
as a naive file copy). Full regexp support is available.


sed 's/www.example.net/example.com/gI'  IN_FILE  OUT_FILE

Will stream IN_FILE to OUT_FILE replacing all instances of 
www.example.net with example.com


cheers
stuart
--
Stuart Yeates
Library Technology Services http://www.victoria.ac.nz/library/


Re: [CODE4LIB] Best way to process large XML files

2012-06-10 Thread Edward M Corrado
FWIW: I use sed all the time to edit XML files. I wouldn't say I have any 
really large files (which is why i didn't respond earlier) but it works great 
for me. Regular expressions are your friend. 

--
Edward M. Corrado

On Jun 10, 2012, at 19:25, stuart yeates stuart.yea...@vuw.ac.nz wrote:

 On 09/06/12 06:36, Kyle Banerjee wrote:
 
 How do you guys deal with large XML files?
 
 There have been a number of excellent suggestions from other people, but it's 
 worth pointing out that sometimes low tech is all you need.
 
 I frequently use sed to do things such as replace one domain name with 
 another when a website changes their URL.
 
 Short for Stream EDitor, sed is a core part of POSIX and should be available 
 pretty on much every UNIX-like platform imaginable. For non-trivial files it 
 works faster than disk access (i.e. works as fast as a naive file copy). Full 
 regexp support is available.
 
 sed 's/www.example.net/example.com/gI'  IN_FILE  OUT_FILE
 
 Will stream IN_FILE to OUT_FILE replacing all instances of www.example.net 
 with example.com
 
 cheers
 stuart
 -- 
 Stuart Yeates
 Library Technology Services http://www.victoria.ac.nz/library/


[CODE4LIB] Best way to process large XML files

2012-06-08 Thread Kyle Banerjee
I'm working on a script that needs to be able to crosswalk at least a
couple hundred XML files regularly, some of which are quite large.

I've thought of a number of ways to go about this, but I wanted to bounce
this off the list since I'm sure people here deal with this problem all the
time. My goal is to make something that's easy to read/maintain without
pegging the CPU and consuming too much memory.

The performance and load I'm seeing from running the files through LibXML
and SimpleXML on the large files is completely unacceptable. SAX is not out
of the question, but I'm trying to avoid it if possible to keep the code
more compact and easier to read.

I'm tempted to streamedit out all line breaks since they occur in
unpredictable places and put new ones at the end of each record into a temp
file. Then I can read the temp file one line at a time and process using
SimpleXML. That way, there's no need to load giant files into memory,
create huge arrays, etc and the code would be easy enough for a 6th grader
to follow. My proposed method doesn't sound very efficient to me, but it
should consume predictable resources which don't increase with file size.

How do you guys deal with large XML files? Thanks,

kyle

rantWhy the heck does the XML spec require a root element,
particularly since large files usually consist of a large number of
records/documents? This makes it absolutely impossible to process a file of
any size without resorting to SAX or string parsing -- which takes away
many of the advantages you'd normally have with an XML structure. /rant

-- 
--
Kyle Banerjee
Digital Services Program Manager
Orbis Cascade Alliance
baner...@uoregon.edubaner...@orbiscascade.org / 503.999.9787


Re: [CODE4LIB] Best way to process large XML files

2012-06-08 Thread Ethan Gruber
Saxon is really, really efficient with large files.  I don't really have
any benchmarks stats available, but I have gotten noticeably better
performance from Saxon/XSLT2 than PHP with DOMDocument or SimpleXML or
nokogiri and hpricot in Ruby.

Ethan

On Fri, Jun 8, 2012 at 2:36 PM, Kyle Banerjee baner...@orbiscascade.orgwrote:

 I'm working on a script that needs to be able to crosswalk at least a
 couple hundred XML files regularly, some of which are quite large.

 I've thought of a number of ways to go about this, but I wanted to bounce
 this off the list since I'm sure people here deal with this problem all the
 time. My goal is to make something that's easy to read/maintain without
 pegging the CPU and consuming too much memory.

 The performance and load I'm seeing from running the files through LibXML
 and SimpleXML on the large files is completely unacceptable. SAX is not out
 of the question, but I'm trying to avoid it if possible to keep the code
 more compact and easier to read.

 I'm tempted to streamedit out all line breaks since they occur in
 unpredictable places and put new ones at the end of each record into a temp
 file. Then I can read the temp file one line at a time and process using
 SimpleXML. That way, there's no need to load giant files into memory,
 create huge arrays, etc and the code would be easy enough for a 6th grader
 to follow. My proposed method doesn't sound very efficient to me, but it
 should consume predictable resources which don't increase with file size.

 How do you guys deal with large XML files? Thanks,

 kyle

 rantWhy the heck does the XML spec require a root element,
 particularly since large files usually consist of a large number of
 records/documents? This makes it absolutely impossible to process a file of
 any size without resorting to SAX or string parsing -- which takes away
 many of the advantages you'd normally have with an XML structure. /rant

 --
 --
 Kyle Banerjee
 Digital Services Program Manager
 Orbis Cascade Alliance
 baner...@uoregon.edubaner...@orbiscascade.org / 503.999.9787



Re: [CODE4LIB] Best way to process large XML files

2012-06-08 Thread Reese, Terry
I would really consider SAX.  In MarcEdit, I had originally utilized an XSLT 
process for handling MARCXML translations (using both SAXON and MSXML parsers) 
-- but as you noticed -- there ends up being an upper limit to what you can 
process.  The break point for me was when working with some researchers 
experimenting with data from the HathiTrust and they had a 32 GB XML file of 
MARCXML that needed to be processed.  Using the DOM model, the process was 
untenable.  Re-working the code so that it was SAX based -- required building, 
to some degree, the same type of templating to react to specific elements and 
nested elements -- but shifted processing time so that it took ~8 minutes to 
translate those 32 GBs of MARCXML data into MARC (and allowed me to include 
code that handled some common issues related to field length, etc. at the point 
of translation).

Not knowing what your XML files look like, my guess is that if you do it right, 
you can template your SAX code in such a way that the actual processing code is 
smaller and much more efficient than anything you could create using a DOM 
method.

--tr

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Kyle 
Banerjee
Sent: Friday, June 08, 2012 11:36 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] Best way to process large XML files

I'm working on a script that needs to be able to crosswalk at least a couple 
hundred XML files regularly, some of which are quite large.

I've thought of a number of ways to go about this, but I wanted to bounce this 
off the list since I'm sure people here deal with this problem all the time. My 
goal is to make something that's easy to read/maintain without pegging the CPU 
and consuming too much memory.

The performance and load I'm seeing from running the files through LibXML and 
SimpleXML on the large files is completely unacceptable. SAX is not out of the 
question, but I'm trying to avoid it if possible to keep the code more compact 
and easier to read.


I'm tempted to streamedit out all line breaks since they occur in unpredictable 
places and put new ones at the end of each record into a temp file. Then I can 
read the temp file one line at a time and process using SimpleXML. That way, 
there's no need to load giant files into memory, create huge arrays, etc and 
the code would be easy enough for a 6th grader to follow. My proposed method 
doesn't sound very efficient to me, but it should consume predictable resources 
which don't increase with file size.

How do you guys deal with large XML files? Thanks,

kyle

rantWhy the heck does the XML spec require a root element, particularly since 
large files usually consist of a large number of records/documents? This makes 
it absolutely impossible to process a file of any size without resorting to SAX 
or string parsing -- which takes away many of the advantages you'd normally 
have with an XML structure. /rant

--
--
Kyle Banerjee
Digital Services Program Manager
Orbis Cascade Alliance
baner...@uoregon.edubaner...@orbiscascade.org / 503.999.9787


Re: [CODE4LIB] Best way to process large XML files

2012-06-08 Thread LeVan,Ralph
I create 50GB files of marcxml all the time.  We do NOT put a wrapper
element around them, but do put a line feed at the end of each record.
Then a trivial line reading loop in java/perl/whatever can read those
records individually and process them appropriately.

That turns out the be the right way to do things in Hadoop too.

Ralph

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
Kyle Banerjee
Sent: Friday, June 08, 2012 2:36 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Best way to process large XML files

I'm working on a script that needs to be able to crosswalk at least a
couple hundred XML files regularly, some of which are quite large.

I've thought of a number of ways to go about this, but I wanted to
bounce
this off the list since I'm sure people here deal with this problem all
the
time. My goal is to make something that's easy to read/maintain without
pegging the CPU and consuming too much memory.

The performance and load I'm seeing from running the files through
LibXML
and SimpleXML on the large files is completely unacceptable. SAX is not
out
of the question, but I'm trying to avoid it if possible to keep the code
more compact and easier to read.

I'm tempted to streamedit out all line breaks since they occur in
unpredictable places and put new ones at the end of each record into a
temp
file. Then I can read the temp file one line at a time and process using
SimpleXML. That way, there's no need to load giant files into memory,
create huge arrays, etc and the code would be easy enough for a 6th
grader
to follow. My proposed method doesn't sound very efficient to me, but it
should consume predictable resources which don't increase with file
size.

How do you guys deal with large XML files? Thanks,

kyle

rantWhy the heck does the XML spec require a root element,
particularly since large files usually consist of a large number of
records/documents? This makes it absolutely impossible to process a file
of
any size without resorting to SAX or string parsing -- which takes away
many of the advantages you'd normally have with an XML structure.
/rant

-- 
--
Kyle Banerjee
Digital Services Program Manager
Orbis Cascade Alliance
baner...@uoregon.edubaner...@orbiscascade.org / 503.999.9787


Re: [CODE4LIB] Best way to process large XML files

2012-06-08 Thread Devon
This is something I've dealt with. And for a variety of reasons, we
went with the streaming parser. I'm not sure about the quality of your
data, but we have to be prepared for seriously messed up data. There
was no way I was going to develop a process that would try to load a
15 million record file, and the whole process could fail at record 14
million due to a syntax or encoding error. Nope, nope, nope. So, not
only do we use a streaming parser, it's a two stage streaming parser.
First, we have a stage that finds record boundaries and creates a well
formed version of it. Then the parser for the actual record is called
to extract the data for crosswalking.

/dev

--
Devon Smith
Consulting Software Engineer
OCLC Research
http://www.oclc.org/research/people/smith.htm

On Fri, Jun 8, 2012 at 2:36 PM, Kyle Banerjee baner...@orbiscascade.org wrote:
 I'm working on a script that needs to be able to crosswalk at least a
 couple hundred XML files regularly, some of which are quite large.

 I've thought of a number of ways to go about this, but I wanted to bounce
 this off the list since I'm sure people here deal with this problem all the
 time. My goal is to make something that's easy to read/maintain without
 pegging the CPU and consuming too much memory.

 The performance and load I'm seeing from running the files through LibXML
 and SimpleXML on the large files is completely unacceptable. SAX is not out
 of the question, but I'm trying to avoid it if possible to keep the code
 more compact and easier to read.

 I'm tempted to streamedit out all line breaks since they occur in
 unpredictable places and put new ones at the end of each record into a temp
 file. Then I can read the temp file one line at a time and process using
 SimpleXML. That way, there's no need to load giant files into memory,
 create huge arrays, etc and the code would be easy enough for a 6th grader
 to follow. My proposed method doesn't sound very efficient to me, but it
 should consume predictable resources which don't increase with file size.

 How do you guys deal with large XML files? Thanks,

 kyle

 rantWhy the heck does the XML spec require a root element,
 particularly since large files usually consist of a large number of
 records/documents? This makes it absolutely impossible to process a file of
 any size without resorting to SAX or string parsing -- which takes away
 many of the advantages you'd normally have with an XML structure. /rant

 --
 --
 Kyle Banerjee
 Digital Services Program Manager
 Orbis Cascade Alliance
 baner...@uoregon.edubaner...@orbiscascade.org / 503.999.9787



-- 
Sent from my GMail account.


Re: [CODE4LIB] Best way to process large XML files

2012-06-08 Thread Joe Hourcle
On Jun 8, 2012, at 2:36 PM, Kyle Banerjee wrote:

 I'm working on a script that needs to be able to crosswalk at least a
 couple hundred XML files regularly, some of which are quite large.

[trimmed]

 How do you guys deal with large XML files? Thanks,

um ... I return ASCII tab-delim records, because IDL's XML processing
routines have some massive issue with garbage collection if you walk
down the DOM tree.  However, no one in their right mind uses IDL
for XML, as it's basically Fortran w/ multi-dimensional arrays.

...

Everyone else is going to tell you to use SAX, and they're probably
right, but as you sound to be as reluctant as I am on using SAX,
another alternative may be Perl's XML::Twig:

http://search.cpan.org/perldoc?XML::Twig

-Joe


Re: [CODE4LIB] Best way to process large XML files

2012-06-08 Thread Steve Meyer
It is also worth noting that you can usually do SAX-style parsing in
most XML parsing libraries that are normally associated with DOM style
parsing and conveniences like XPath selectors. For example, Nokogiri
does SAX and it is *very* fast:

http://nokogiri.org/Nokogiri/XML/SAX/Document.html

As a related question, when folks do SAX-style parsing and need to
select highly conditional and deeply nested elements (think getting
MODS title data only when a parent element's attribute matches a
condition and it is all nested in a big METS wrapper), how are you
keeping track of those nesting and conditional rules? I have relied on
using a few booleans that get set and unset to track state, but it
often feels sloppy.

-steve

On Fri, Jun 8, 2012 at 2:41 PM, Ethan Gruber ewg4x...@gmail.com wrote:
 but I have gotten noticeably better
 performance from Saxon/XSLT2 than PHP with DOMDocument or SimpleXML or
 nokogiri and hpricot in Ruby.


Re: [CODE4LIB] Best way to process large XML files

2012-06-08 Thread Esmé Cowles
One way to get the best of both worlds (scalability of a streaming parser, but 
convenience of DOM) is to use DOM4J's ElementHandler interface[1].  You parse 
the XML file using a SAXReader, and register a class to handle callbacks, based 
on an XPath expression.  I used this approach to break up giant MARCXML files 
with hundreds of thousands of records.

Though this approach does require the XML to be well-formed.  I had some 
problems with that, and wound up pre-processing the MARCXML to strip out 
illegal characters so they wouldn't cause parsing errors.


1. 
http://dom4j.sourceforge.net/dom4j-1.6.1/apidocs/org/dom4j/ElementHandler.html


-Esme
--
Esme Cowles escow...@ucsd.edu

The wages of sin is death but so is the salary of virtue, and at least the
 evil get to go home early on Fridays. -- Terry Pratchett, Witches Abroad

On 06/8/2012, at 2:36 PM, Kyle Banerjee wrote:

 I'm working on a script that needs to be able to crosswalk at least a
 couple hundred XML files regularly, some of which are quite large.
 
 I've thought of a number of ways to go about this, but I wanted to bounce
 this off the list since I'm sure people here deal with this problem all the
 time. My goal is to make something that's easy to read/maintain without
 pegging the CPU and consuming too much memory.
 
 The performance and load I'm seeing from running the files through LibXML
 and SimpleXML on the large files is completely unacceptable. SAX is not out
 of the question, but I'm trying to avoid it if possible to keep the code
 more compact and easier to read.
 
 I'm tempted to streamedit out all line breaks since they occur in
 unpredictable places and put new ones at the end of each record into a temp
 file. Then I can read the temp file one line at a time and process using
 SimpleXML. That way, there's no need to load giant files into memory,
 create huge arrays, etc and the code would be easy enough for a 6th grader
 to follow. My proposed method doesn't sound very efficient to me, but it
 should consume predictable resources which don't increase with file size.
 
 How do you guys deal with large XML files? Thanks,
 
 kyle
 
 rantWhy the heck does the XML spec require a root element,
 particularly since large files usually consist of a large number of
 records/documents? This makes it absolutely impossible to process a file of
 any size without resorting to SAX or string parsing -- which takes away
 many of the advantages you'd normally have with an XML structure. /rant
 
 -- 
 --
 Kyle Banerjee
 Digital Services Program Manager
 Orbis Cascade Alliance
 baner...@uoregon.edubaner...@orbiscascade.org / 503.999.9787


Re: [CODE4LIB] Best way to process large XML files

2012-06-08 Thread Kevin S. Clarke
If you're not adverse to Java, the XOM XML library has a nice
NodeFactory class that you can override and control the processing of
the XML document.  For instance, it will let you parse a very large
XML document like

root
  rec/rec
  rec/rec
  ...
/root

only keeping a rec at a time in memory.  You control the node
building process so can throw away the one's you're done with.  It's
friendlier than SAX and what I use for processing very large
documents.

Cf. http://www.xom.nu/apidocs/nu/xom/NodeFactory.html

Kevin



On Fri, Jun 8, 2012 at 2:36 PM, Kyle Banerjee baner...@orbiscascade.org wrote:
 I'm working on a script that needs to be able to crosswalk at least a
 couple hundred XML files regularly, some of which are quite large.

 I've thought of a number of ways to go about this, but I wanted to bounce
 this off the list since I'm sure people here deal with this problem all the
 time. My goal is to make something that's easy to read/maintain without
 pegging the CPU and consuming too much memory.

 The performance and load I'm seeing from running the files through LibXML
 and SimpleXML on the large files is completely unacceptable. SAX is not out
 of the question, but I'm trying to avoid it if possible to keep the code
 more compact and easier to read.

 I'm tempted to streamedit out all line breaks since they occur in
 unpredictable places and put new ones at the end of each record into a temp
 file. Then I can read the temp file one line at a time and process using
 SimpleXML. That way, there's no need to load giant files into memory,
 create huge arrays, etc and the code would be easy enough for a 6th grader
 to follow. My proposed method doesn't sound very efficient to me, but it
 should consume predictable resources which don't increase with file size.

 How do you guys deal with large XML files? Thanks,

 kyle

 rantWhy the heck does the XML spec require a root element,
 particularly since large files usually consist of a large number of
 records/documents? This makes it absolutely impossible to process a file of
 any size without resorting to SAX or string parsing -- which takes away
 many of the advantages you'd normally have with an XML structure. /rant

 --
 --
 Kyle Banerjee
 Digital Services Program Manager
 Orbis Cascade Alliance
 baner...@uoregon.edubaner...@orbiscascade.org / 503.999.9787


Re: [CODE4LIB] Best way to process large XML files

2012-06-08 Thread Walker, David
Since you mentioned SimpleXML, Kyle, I assume you're using PHP?

If so, you might look at XMLReader [1], which is a pull parser, and should give 
you better performance on large files than SimpleXML .  

It is still based on libxml, though, so if that is still not fast enough for 
you, you can toss out my suggestion. :-)

--Dave

[1] http://php.net/manual/en/book.xmlreader.php

-
David Walker
Interim Director, Systemwide Digital Library Services
California State University
562-355-4845


-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Kyle 
Banerjee
Sent: Friday, June 08, 2012 11:36 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] Best way to process large XML files

I'm working on a script that needs to be able to crosswalk at least a couple 
hundred XML files regularly, some of which are quite large.

I've thought of a number of ways to go about this, but I wanted to bounce this 
off the list since I'm sure people here deal with this problem all the time. My 
goal is to make something that's easy to read/maintain without pegging the CPU 
and consuming too much memory.

The performance and load I'm seeing from running the files through LibXML and 
SimpleXML on the large files is completely unacceptable. SAX is not out of the 
question, but I'm trying to avoid it if possible to keep the code more compact 
and easier to read.


I'm tempted to streamedit out all line breaks since they occur in unpredictable 
places and put new ones at the end of each record into a temp file. Then I can 
read the temp file one line at a time and process using SimpleXML. That way, 
there's no need to load giant files into memory, create huge arrays, etc and 
the code would be easy enough for a 6th grader to follow. My proposed method 
doesn't sound very efficient to me, but it should consume predictable resources 
which don't increase with file size.

How do you guys deal with large XML files? Thanks,

kyle

rantWhy the heck does the XML spec require a root element, particularly since 
large files usually consist of a large number of records/documents? This makes 
it absolutely impossible to process a file of any size without resorting to SAX 
or string parsing -- which takes away many of the advantages you'd normally 
have with an XML structure. /rant

--
--
Kyle Banerjee
Digital Services Program Manager
Orbis Cascade Alliance
baner...@uoregon.edubaner...@orbiscascade.org / 503.999.9787


Re: [CODE4LIB] Best way to process large XML files

2012-06-08 Thread Kyle Banerjee

 Since you mentioned SimpleXML, Kyle, I assume you're using PHP?


Actually I'm using perl. For reasons not related to XML parsing, it is the
preferred (but not mandatory) language.

Based on a few tests and manual inspection, it looks like the ticket for me
is going have a two stage process where the first stage converts the file
to valid XML and the second cuts through it with SAX.

Originally, I was trying to avoid SAX, but the process has been prettier
than expected so far. The XML has not been prettier than expected --
it contains a number of issues including outright invalid XML, invalid
characters, and hand coded HTML within some elements (i.e. string data not
encoded as such). Gotta love library data. But screwed up stuff is
employment security. If things actually worked, I'd be redundant...

kyle


Re: [CODE4LIB] Best way to process large XML files

2012-06-08 Thread Peter Murray
*sigh* -- I kinda wish this whole discussion got captured in 
http://libraries.stackexchange.com/ ...


Peter

On Jun 8, 2012, at 2:36 PM, Kyle Banerjee wrote:
 I'm working on a script that needs to be able to crosswalk at least a
 couple hundred XML files regularly, some of which are quite large.
 
 I've thought of a number of ways to go about this, but I wanted to bounce
 this off the list since I'm sure people here deal with this problem all the
 time. My goal is to make something that's easy to read/maintain without
 pegging the CPU and consuming too much memory.
 
 The performance and load I'm seeing from running the files through LibXML
 and SimpleXML on the large files is completely unacceptable. SAX is not out
 of the question, but I'm trying to avoid it if possible to keep the code
 more compact and easier to read.
 
 I'm tempted to streamedit out all line breaks since they occur in
 unpredictable places and put new ones at the end of each record into a temp
 file. Then I can read the temp file one line at a time and process using
 SimpleXML. That way, there's no need to load giant files into memory,
 create huge arrays, etc and the code would be easy enough for a 6th grader
 to follow. My proposed method doesn't sound very efficient to me, but it
 should consume predictable resources which don't increase with file size.
 
 How do you guys deal with large XML files? Thanks,
 
 kyle
 
 rantWhy the heck does the XML spec require a root element,
 particularly since large files usually consist of a large number of
 records/documents? This makes it absolutely impossible to process a file of
 any size without resorting to SAX or string parsing -- which takes away
 many of the advantages you'd normally have with an XML structure. /rant



-- 
Peter Murray
Assistant Director, Technology Services Development
LYRASIS
peter.mur...@lyrasis.org
+1 678-235-2955
 
1438 West Peachtree Street NW
Suite 200
Atlanta, GA 30309
Toll Free: 800.999.8558
Fax: 404.892.7879 
www.lyrasis.org
 
LYRASIS: Great Libraries. Strong Communities. Innovative Answers.