Re: [CODE4LIB] Best way to process large XML files

2012-06-11 Thread Teemu Nuutinen
On 09.06.2012 00:00, Kyle Banerjee wrote: Since you mentioned SimpleXML, Kyle, I assume you're using PHP? Actually I'm using perl. For reasons not related to XML parsing, it is the preferred (but not mandatory) language. Based on a few tests and manual inspection, it looks like the

Re: [CODE4LIB] Best way to process large XML files

2012-06-11 Thread Ron Gilmour
When I need to deal with huge XML files, I use Perl's XML::Parser in stream mode. It's blazing fast, but I have to admit, the code isn't very pretty. There's also XML::LibXML::SAXhttp://search.cpan.org/dist/XML-LibXML/lib/XML/LibXML/SAX.pod, but I can't seem to find any substantive documentation

Re: [CODE4LIB] Best way to process large XML files

2012-06-10 Thread Ross Singer
Steve, I'm not sure if you were hoping for a ruby-related answer to your question (since you mentioned Nokogiri), but if you are, take a look at ruby-marc' GenericPullParser [1] as an example of using a SAX parser for this sort of thing. It doesn't quite answer your question, but I think it might

Re: [CODE4LIB] Best way to process large XML files

2012-06-10 Thread stuart yeates
On 09/06/12 06:36, Kyle Banerjee wrote: How do you guys deal with large XML files? There have been a number of excellent suggestions from other people, but it's worth pointing out that sometimes low tech is all you need. I frequently use sed to do things such as replace one domain name

Re: [CODE4LIB] Best way to process large XML files

2012-06-10 Thread Edward M Corrado
FWIW: I use sed all the time to edit XML files. I wouldn't say I have any really large files (which is why i didn't respond earlier) but it works great for me. Regular expressions are your friend. -- Edward M. Corrado On Jun 10, 2012, at 19:25, stuart yeates stuart.yea...@vuw.ac.nz wrote:

[CODE4LIB] Best way to process large XML files

2012-06-08 Thread Kyle Banerjee
I'm working on a script that needs to be able to crosswalk at least a couple hundred XML files regularly, some of which are quite large. I've thought of a number of ways to go about this, but I wanted to bounce this off the list since I'm sure people here deal with this problem all the time. My

Re: [CODE4LIB] Best way to process large XML files

2012-06-08 Thread Ethan Gruber
Saxon is really, really efficient with large files. I don't really have any benchmarks stats available, but I have gotten noticeably better performance from Saxon/XSLT2 than PHP with DOMDocument or SimpleXML or nokogiri and hpricot in Ruby. Ethan On Fri, Jun 8, 2012 at 2:36 PM, Kyle Banerjee

Re: [CODE4LIB] Best way to process large XML files

2012-06-08 Thread Reese, Terry
@LISTSERV.ND.EDU] On Behalf Of Kyle Banerjee Sent: Friday, June 08, 2012 11:36 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] Best way to process large XML files I'm working on a script that needs to be able to crosswalk at least a couple hundred XML files regularly, some of which are quite

Re: [CODE4LIB] Best way to process large XML files

2012-06-08 Thread LeVan,Ralph
I create 50GB files of marcxml all the time. We do NOT put a wrapper element around them, but do put a line feed at the end of each record. Then a trivial line reading loop in java/perl/whatever can read those records individually and process them appropriately. That turns out the be the right

Re: [CODE4LIB] Best way to process large XML files

2012-06-08 Thread Devon
This is something I've dealt with. And for a variety of reasons, we went with the streaming parser. I'm not sure about the quality of your data, but we have to be prepared for seriously messed up data. There was no way I was going to develop a process that would try to load a 15 million record

Re: [CODE4LIB] Best way to process large XML files

2012-06-08 Thread Joe Hourcle
On Jun 8, 2012, at 2:36 PM, Kyle Banerjee wrote: I'm working on a script that needs to be able to crosswalk at least a couple hundred XML files regularly, some of which are quite large. [trimmed] How do you guys deal with large XML files? Thanks, um ... I return ASCII tab-delim records,

Re: [CODE4LIB] Best way to process large XML files

2012-06-08 Thread Steve Meyer
It is also worth noting that you can usually do SAX-style parsing in most XML parsing libraries that are normally associated with DOM style parsing and conveniences like XPath selectors. For example, Nokogiri does SAX and it is *very* fast: http://nokogiri.org/Nokogiri/XML/SAX/Document.html As a

Re: [CODE4LIB] Best way to process large XML files

2012-06-08 Thread Esmé Cowles
One way to get the best of both worlds (scalability of a streaming parser, but convenience of DOM) is to use DOM4J's ElementHandler interface[1]. You parse the XML file using a SAXReader, and register a class to handle callbacks, based on an XPath expression. I used this approach to break up

Re: [CODE4LIB] Best way to process large XML files

2012-06-08 Thread Kevin S. Clarke
If you're not adverse to Java, the XOM XML library has a nice NodeFactory class that you can override and control the processing of the XML document. For instance, it will let you parse a very large XML document like root rec/rec rec/rec ... /root only keeping a rec at a time in memory.

Re: [CODE4LIB] Best way to process large XML files

2012-06-08 Thread Walker, David
@LISTSERV.ND.EDU] On Behalf Of Kyle Banerjee Sent: Friday, June 08, 2012 11:36 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] Best way to process large XML files I'm working on a script that needs to be able to crosswalk at least a couple hundred XML files regularly, some of which are quite large. I've

Re: [CODE4LIB] Best way to process large XML files

2012-06-08 Thread Kyle Banerjee
Since you mentioned SimpleXML, Kyle, I assume you're using PHP? Actually I'm using perl. For reasons not related to XML parsing, it is the preferred (but not mandatory) language. Based on a few tests and manual inspection, it looks like the ticket for me is going have a two stage process

Re: [CODE4LIB] Best way to process large XML files

2012-06-08 Thread Peter Murray
*sigh* -- I kinda wish this whole discussion got captured in http://libraries.stackexchange.com/ ... Peter On Jun 8, 2012, at 2:36 PM, Kyle Banerjee wrote: I'm working on a script that needs to be able to crosswalk at least a couple hundred XML files regularly, some of which are quite large.