On 09.06.2012 00:00, Kyle Banerjee wrote:
Since you mentioned SimpleXML, Kyle, I assume you're using PHP?
Actually I'm using perl. For reasons not related to XML parsing, it is the
preferred (but not mandatory) language.
Based on a few tests and manual inspection, it looks like the
When I need to deal with huge XML files, I use Perl's XML::Parser in
stream mode. It's blazing fast, but I have to admit, the code isn't very
pretty.
There's also
XML::LibXML::SAXhttp://search.cpan.org/dist/XML-LibXML/lib/XML/LibXML/SAX.pod,
but I can't seem to find any substantive documentation
Steve, I'm not sure if you were hoping for a ruby-related answer to
your question (since you mentioned Nokogiri), but if you are, take a
look at ruby-marc' GenericPullParser [1] as an example of using a SAX
parser for this sort of thing. It doesn't quite answer your question,
but I think it might
On 09/06/12 06:36, Kyle Banerjee wrote:
How do you guys deal with large XML files?
There have been a number of excellent suggestions from other people, but
it's worth pointing out that sometimes low tech is all you need.
I frequently use sed to do things such as replace one domain name
FWIW: I use sed all the time to edit XML files. I wouldn't say I have any
really large files (which is why i didn't respond earlier) but it works great
for me. Regular expressions are your friend.
--
Edward M. Corrado
On Jun 10, 2012, at 19:25, stuart yeates stuart.yea...@vuw.ac.nz wrote:
I'm working on a script that needs to be able to crosswalk at least a
couple hundred XML files regularly, some of which are quite large.
I've thought of a number of ways to go about this, but I wanted to bounce
this off the list since I'm sure people here deal with this problem all the
time. My
Saxon is really, really efficient with large files. I don't really have
any benchmarks stats available, but I have gotten noticeably better
performance from Saxon/XSLT2 than PHP with DOMDocument or SimpleXML or
nokogiri and hpricot in Ruby.
Ethan
On Fri, Jun 8, 2012 at 2:36 PM, Kyle Banerjee
@LISTSERV.ND.EDU] On Behalf Of Kyle
Banerjee
Sent: Friday, June 08, 2012 11:36 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] Best way to process large XML files
I'm working on a script that needs to be able to crosswalk at least a couple
hundred XML files regularly, some of which are quite
I create 50GB files of marcxml all the time. We do NOT put a wrapper
element around them, but do put a line feed at the end of each record.
Then a trivial line reading loop in java/perl/whatever can read those
records individually and process them appropriately.
That turns out the be the right
This is something I've dealt with. And for a variety of reasons, we
went with the streaming parser. I'm not sure about the quality of your
data, but we have to be prepared for seriously messed up data. There
was no way I was going to develop a process that would try to load a
15 million record
On Jun 8, 2012, at 2:36 PM, Kyle Banerjee wrote:
I'm working on a script that needs to be able to crosswalk at least a
couple hundred XML files regularly, some of which are quite large.
[trimmed]
How do you guys deal with large XML files? Thanks,
um ... I return ASCII tab-delim records,
It is also worth noting that you can usually do SAX-style parsing in
most XML parsing libraries that are normally associated with DOM style
parsing and conveniences like XPath selectors. For example, Nokogiri
does SAX and it is *very* fast:
http://nokogiri.org/Nokogiri/XML/SAX/Document.html
As a
One way to get the best of both worlds (scalability of a streaming parser, but
convenience of DOM) is to use DOM4J's ElementHandler interface[1]. You parse
the XML file using a SAXReader, and register a class to handle callbacks, based
on an XPath expression. I used this approach to break up
If you're not adverse to Java, the XOM XML library has a nice
NodeFactory class that you can override and control the processing of
the XML document. For instance, it will let you parse a very large
XML document like
root
rec/rec
rec/rec
...
/root
only keeping a rec at a time in memory.
@LISTSERV.ND.EDU] On Behalf Of Kyle
Banerjee
Sent: Friday, June 08, 2012 11:36 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] Best way to process large XML files
I'm working on a script that needs to be able to crosswalk at least a couple
hundred XML files regularly, some of which are quite large.
I've
Since you mentioned SimpleXML, Kyle, I assume you're using PHP?
Actually I'm using perl. For reasons not related to XML parsing, it is the
preferred (but not mandatory) language.
Based on a few tests and manual inspection, it looks like the ticket for me
is going have a two stage process
*sigh* -- I kinda wish this whole discussion got captured in
http://libraries.stackexchange.com/ ...
Peter
On Jun 8, 2012, at 2:36 PM, Kyle Banerjee wrote:
I'm working on a script that needs to be able to crosswalk at least a
couple hundred XML files regularly, some of which are quite large.
17 matches
Mail list logo