Re: [CODE4LIB] Best way to process large XML files
On 09.06.2012 00:00, Kyle Banerjee wrote: Since you mentioned SimpleXML, Kyle, I assume you're using PHP? Actually I'm using perl. For reasons not related to XML parsing, it is the preferred (but not mandatory) language. Based on a few tests and manual inspection, it looks like the ticket for me is going have a two stage process where the first stage converts the file to valid XML and the second cuts through it with SAX. Originally, I was trying to avoid SAX, but the process has been prettier than expected so far. The XML has not been prettier than expected -- it contains a number of issues including outright invalid XML, invalid characters, and hand coded HTML within some elements (i.e. string data not encoded as such). Gotta love library data. But screwed up stuff is employment security. If things actually worked, I'd be redundant... kyle Since you're using perl I think you mean XML::Simple which is a DOM-parser. You also mentioned LibXML and are considering SAX-parsing so I assume you've only used DOM-parsing then? How about using an XML reader, kind of like SAX but a whole lot cleaner and easier - something like: use XML::LibXML::Reader; my $reader = XML::LibXML::Reader-new(location = $filename_or_uri); while ( $reader-read ) { next unless $reader-name eq 'record' $reader-nodeType eq XML_READER_TYPE_ELEMENT; my $dom = XML::LibXML-load_xml( string = $reader-readOuterXml ); ...do something with the record element's dom-tree... } Documentation [https://metacpan.org/module/XML::LibXML::Reader] HTH -- Teemu Nuutinen, Digital Services, Helsinki University Library
Re: [CODE4LIB] Best way to process large XML files
When I need to deal with huge XML files, I use Perl's XML::Parser in stream mode. It's blazing fast, but I have to admit, the code isn't very pretty. There's also XML::LibXML::SAXhttp://search.cpan.org/dist/XML-LibXML/lib/XML/LibXML/SAX.pod, but I can't seem to find any substantive documentation on how this works. (If anyone has any sample code that uses this, I'd love to see it. Please e-mail me off-list as I don't want to de-rail this thread.) Teemu's suggestion about XML::LibXML::Reader is definitely worth considering. I've never clocked it against XML::Parser, but it seems like it *should* be fast. And as Teemu demonstrated, it allows you to write nice compact code. Ron On Fri, Jun 8, 2012 at 2:36 PM, Kyle Banerjee baner...@orbiscascade.orgwrote: I'm working on a script that needs to be able to crosswalk at least a couple hundred XML files regularly, some of which are quite large. I've thought of a number of ways to go about this, but I wanted to bounce this off the list since I'm sure people here deal with this problem all the time. My goal is to make something that's easy to read/maintain without pegging the CPU and consuming too much memory. The performance and load I'm seeing from running the files through LibXML and SimpleXML on the large files is completely unacceptable. SAX is not out of the question, but I'm trying to avoid it if possible to keep the code more compact and easier to read. I'm tempted to streamedit out all line breaks since they occur in unpredictable places and put new ones at the end of each record into a temp file. Then I can read the temp file one line at a time and process using SimpleXML. That way, there's no need to load giant files into memory, create huge arrays, etc and the code would be easy enough for a 6th grader to follow. My proposed method doesn't sound very efficient to me, but it should consume predictable resources which don't increase with file size. How do you guys deal with large XML files? Thanks, kyle rantWhy the heck does the XML spec require a root element, particularly since large files usually consist of a large number of records/documents? This makes it absolutely impossible to process a file of any size without resorting to SAX or string parsing -- which takes away many of the advantages you'd normally have with an XML structure. /rant -- -- Kyle Banerjee Digital Services Program Manager Orbis Cascade Alliance baner...@uoregon.edubaner...@orbiscascade.org / 503.999.9787
Re: [CODE4LIB] Best way to process large XML files
Steve, I'm not sure if you were hoping for a ruby-related answer to your question (since you mentioned Nokogiri), but if you are, take a look at ruby-marc' GenericPullParser [1] as an example of using a SAX parser for this sort of thing. It doesn't quite answer your question, but I think it might provide some guidance. Basically, I think you're still going to have to use the SAX parser to create record objects where you can build up your hierarchy logic and then simply move onto the next record if the conditions aren't met. Even though you'd still need to build your objects, I think streaming over the XML (and constructed objects) will still be pretty fast and efficient. -Ross. 1. https://github.com/ruby-marc/ruby-marc/blob/master/lib/marc/xml_parsers.rb#L27 On Fri, Jun 8, 2012 at 8:07 PM, Steve Meyer steve.e.me...@gmail.com wrote: It is also worth noting that you can usually do SAX-style parsing in most XML parsing libraries that are normally associated with DOM style parsing and conveniences like XPath selectors. For example, Nokogiri does SAX and it is *very* fast: http://nokogiri.org/Nokogiri/XML/SAX/Document.html As a related question, when folks do SAX-style parsing and need to select highly conditional and deeply nested elements (think getting MODS title data only when a parent element's attribute matches a condition and it is all nested in a big METS wrapper), how are you keeping track of those nesting and conditional rules? I have relied on using a few booleans that get set and unset to track state, but it often feels sloppy. -steve On Fri, Jun 8, 2012 at 2:41 PM, Ethan Gruber ewg4x...@gmail.com wrote: but I have gotten noticeably better performance from Saxon/XSLT2 than PHP with DOMDocument or SimpleXML or nokogiri and hpricot in Ruby.
Re: [CODE4LIB] Best way to process large XML files
On 09/06/12 06:36, Kyle Banerjee wrote: How do you guys deal with large XML files? There have been a number of excellent suggestions from other people, but it's worth pointing out that sometimes low tech is all you need. I frequently use sed to do things such as replace one domain name with another when a website changes their URL. Short for Stream EDitor, sed is a core part of POSIX and should be available pretty on much every UNIX-like platform imaginable. For non-trivial files it works faster than disk access (i.e. works as fast as a naive file copy). Full regexp support is available. sed 's/www.example.net/example.com/gI' IN_FILE OUT_FILE Will stream IN_FILE to OUT_FILE replacing all instances of www.example.net with example.com cheers stuart -- Stuart Yeates Library Technology Services http://www.victoria.ac.nz/library/
Re: [CODE4LIB] Best way to process large XML files
FWIW: I use sed all the time to edit XML files. I wouldn't say I have any really large files (which is why i didn't respond earlier) but it works great for me. Regular expressions are your friend. -- Edward M. Corrado On Jun 10, 2012, at 19:25, stuart yeates stuart.yea...@vuw.ac.nz wrote: On 09/06/12 06:36, Kyle Banerjee wrote: How do you guys deal with large XML files? There have been a number of excellent suggestions from other people, but it's worth pointing out that sometimes low tech is all you need. I frequently use sed to do things such as replace one domain name with another when a website changes their URL. Short for Stream EDitor, sed is a core part of POSIX and should be available pretty on much every UNIX-like platform imaginable. For non-trivial files it works faster than disk access (i.e. works as fast as a naive file copy). Full regexp support is available. sed 's/www.example.net/example.com/gI' IN_FILE OUT_FILE Will stream IN_FILE to OUT_FILE replacing all instances of www.example.net with example.com cheers stuart -- Stuart Yeates Library Technology Services http://www.victoria.ac.nz/library/
[CODE4LIB] Best way to process large XML files
I'm working on a script that needs to be able to crosswalk at least a couple hundred XML files regularly, some of which are quite large. I've thought of a number of ways to go about this, but I wanted to bounce this off the list since I'm sure people here deal with this problem all the time. My goal is to make something that's easy to read/maintain without pegging the CPU and consuming too much memory. The performance and load I'm seeing from running the files through LibXML and SimpleXML on the large files is completely unacceptable. SAX is not out of the question, but I'm trying to avoid it if possible to keep the code more compact and easier to read. I'm tempted to streamedit out all line breaks since they occur in unpredictable places and put new ones at the end of each record into a temp file. Then I can read the temp file one line at a time and process using SimpleXML. That way, there's no need to load giant files into memory, create huge arrays, etc and the code would be easy enough for a 6th grader to follow. My proposed method doesn't sound very efficient to me, but it should consume predictable resources which don't increase with file size. How do you guys deal with large XML files? Thanks, kyle rantWhy the heck does the XML spec require a root element, particularly since large files usually consist of a large number of records/documents? This makes it absolutely impossible to process a file of any size without resorting to SAX or string parsing -- which takes away many of the advantages you'd normally have with an XML structure. /rant -- -- Kyle Banerjee Digital Services Program Manager Orbis Cascade Alliance baner...@uoregon.edubaner...@orbiscascade.org / 503.999.9787
Re: [CODE4LIB] Best way to process large XML files
Saxon is really, really efficient with large files. I don't really have any benchmarks stats available, but I have gotten noticeably better performance from Saxon/XSLT2 than PHP with DOMDocument or SimpleXML or nokogiri and hpricot in Ruby. Ethan On Fri, Jun 8, 2012 at 2:36 PM, Kyle Banerjee baner...@orbiscascade.orgwrote: I'm working on a script that needs to be able to crosswalk at least a couple hundred XML files regularly, some of which are quite large. I've thought of a number of ways to go about this, but I wanted to bounce this off the list since I'm sure people here deal with this problem all the time. My goal is to make something that's easy to read/maintain without pegging the CPU and consuming too much memory. The performance and load I'm seeing from running the files through LibXML and SimpleXML on the large files is completely unacceptable. SAX is not out of the question, but I'm trying to avoid it if possible to keep the code more compact and easier to read. I'm tempted to streamedit out all line breaks since they occur in unpredictable places and put new ones at the end of each record into a temp file. Then I can read the temp file one line at a time and process using SimpleXML. That way, there's no need to load giant files into memory, create huge arrays, etc and the code would be easy enough for a 6th grader to follow. My proposed method doesn't sound very efficient to me, but it should consume predictable resources which don't increase with file size. How do you guys deal with large XML files? Thanks, kyle rantWhy the heck does the XML spec require a root element, particularly since large files usually consist of a large number of records/documents? This makes it absolutely impossible to process a file of any size without resorting to SAX or string parsing -- which takes away many of the advantages you'd normally have with an XML structure. /rant -- -- Kyle Banerjee Digital Services Program Manager Orbis Cascade Alliance baner...@uoregon.edubaner...@orbiscascade.org / 503.999.9787
Re: [CODE4LIB] Best way to process large XML files
I would really consider SAX. In MarcEdit, I had originally utilized an XSLT process for handling MARCXML translations (using both SAXON and MSXML parsers) -- but as you noticed -- there ends up being an upper limit to what you can process. The break point for me was when working with some researchers experimenting with data from the HathiTrust and they had a 32 GB XML file of MARCXML that needed to be processed. Using the DOM model, the process was untenable. Re-working the code so that it was SAX based -- required building, to some degree, the same type of templating to react to specific elements and nested elements -- but shifted processing time so that it took ~8 minutes to translate those 32 GBs of MARCXML data into MARC (and allowed me to include code that handled some common issues related to field length, etc. at the point of translation). Not knowing what your XML files look like, my guess is that if you do it right, you can template your SAX code in such a way that the actual processing code is smaller and much more efficient than anything you could create using a DOM method. --tr -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Kyle Banerjee Sent: Friday, June 08, 2012 11:36 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] Best way to process large XML files I'm working on a script that needs to be able to crosswalk at least a couple hundred XML files regularly, some of which are quite large. I've thought of a number of ways to go about this, but I wanted to bounce this off the list since I'm sure people here deal with this problem all the time. My goal is to make something that's easy to read/maintain without pegging the CPU and consuming too much memory. The performance and load I'm seeing from running the files through LibXML and SimpleXML on the large files is completely unacceptable. SAX is not out of the question, but I'm trying to avoid it if possible to keep the code more compact and easier to read. I'm tempted to streamedit out all line breaks since they occur in unpredictable places and put new ones at the end of each record into a temp file. Then I can read the temp file one line at a time and process using SimpleXML. That way, there's no need to load giant files into memory, create huge arrays, etc and the code would be easy enough for a 6th grader to follow. My proposed method doesn't sound very efficient to me, but it should consume predictable resources which don't increase with file size. How do you guys deal with large XML files? Thanks, kyle rantWhy the heck does the XML spec require a root element, particularly since large files usually consist of a large number of records/documents? This makes it absolutely impossible to process a file of any size without resorting to SAX or string parsing -- which takes away many of the advantages you'd normally have with an XML structure. /rant -- -- Kyle Banerjee Digital Services Program Manager Orbis Cascade Alliance baner...@uoregon.edubaner...@orbiscascade.org / 503.999.9787
Re: [CODE4LIB] Best way to process large XML files
I create 50GB files of marcxml all the time. We do NOT put a wrapper element around them, but do put a line feed at the end of each record. Then a trivial line reading loop in java/perl/whatever can read those records individually and process them appropriately. That turns out the be the right way to do things in Hadoop too. Ralph -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Kyle Banerjee Sent: Friday, June 08, 2012 2:36 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Best way to process large XML files I'm working on a script that needs to be able to crosswalk at least a couple hundred XML files regularly, some of which are quite large. I've thought of a number of ways to go about this, but I wanted to bounce this off the list since I'm sure people here deal with this problem all the time. My goal is to make something that's easy to read/maintain without pegging the CPU and consuming too much memory. The performance and load I'm seeing from running the files through LibXML and SimpleXML on the large files is completely unacceptable. SAX is not out of the question, but I'm trying to avoid it if possible to keep the code more compact and easier to read. I'm tempted to streamedit out all line breaks since they occur in unpredictable places and put new ones at the end of each record into a temp file. Then I can read the temp file one line at a time and process using SimpleXML. That way, there's no need to load giant files into memory, create huge arrays, etc and the code would be easy enough for a 6th grader to follow. My proposed method doesn't sound very efficient to me, but it should consume predictable resources which don't increase with file size. How do you guys deal with large XML files? Thanks, kyle rantWhy the heck does the XML spec require a root element, particularly since large files usually consist of a large number of records/documents? This makes it absolutely impossible to process a file of any size without resorting to SAX or string parsing -- which takes away many of the advantages you'd normally have with an XML structure. /rant -- -- Kyle Banerjee Digital Services Program Manager Orbis Cascade Alliance baner...@uoregon.edubaner...@orbiscascade.org / 503.999.9787
Re: [CODE4LIB] Best way to process large XML files
This is something I've dealt with. And for a variety of reasons, we went with the streaming parser. I'm not sure about the quality of your data, but we have to be prepared for seriously messed up data. There was no way I was going to develop a process that would try to load a 15 million record file, and the whole process could fail at record 14 million due to a syntax or encoding error. Nope, nope, nope. So, not only do we use a streaming parser, it's a two stage streaming parser. First, we have a stage that finds record boundaries and creates a well formed version of it. Then the parser for the actual record is called to extract the data for crosswalking. /dev -- Devon Smith Consulting Software Engineer OCLC Research http://www.oclc.org/research/people/smith.htm On Fri, Jun 8, 2012 at 2:36 PM, Kyle Banerjee baner...@orbiscascade.org wrote: I'm working on a script that needs to be able to crosswalk at least a couple hundred XML files regularly, some of which are quite large. I've thought of a number of ways to go about this, but I wanted to bounce this off the list since I'm sure people here deal with this problem all the time. My goal is to make something that's easy to read/maintain without pegging the CPU and consuming too much memory. The performance and load I'm seeing from running the files through LibXML and SimpleXML on the large files is completely unacceptable. SAX is not out of the question, but I'm trying to avoid it if possible to keep the code more compact and easier to read. I'm tempted to streamedit out all line breaks since they occur in unpredictable places and put new ones at the end of each record into a temp file. Then I can read the temp file one line at a time and process using SimpleXML. That way, there's no need to load giant files into memory, create huge arrays, etc and the code would be easy enough for a 6th grader to follow. My proposed method doesn't sound very efficient to me, but it should consume predictable resources which don't increase with file size. How do you guys deal with large XML files? Thanks, kyle rantWhy the heck does the XML spec require a root element, particularly since large files usually consist of a large number of records/documents? This makes it absolutely impossible to process a file of any size without resorting to SAX or string parsing -- which takes away many of the advantages you'd normally have with an XML structure. /rant -- -- Kyle Banerjee Digital Services Program Manager Orbis Cascade Alliance baner...@uoregon.edubaner...@orbiscascade.org / 503.999.9787 -- Sent from my GMail account.
Re: [CODE4LIB] Best way to process large XML files
On Jun 8, 2012, at 2:36 PM, Kyle Banerjee wrote: I'm working on a script that needs to be able to crosswalk at least a couple hundred XML files regularly, some of which are quite large. [trimmed] How do you guys deal with large XML files? Thanks, um ... I return ASCII tab-delim records, because IDL's XML processing routines have some massive issue with garbage collection if you walk down the DOM tree. However, no one in their right mind uses IDL for XML, as it's basically Fortran w/ multi-dimensional arrays. ... Everyone else is going to tell you to use SAX, and they're probably right, but as you sound to be as reluctant as I am on using SAX, another alternative may be Perl's XML::Twig: http://search.cpan.org/perldoc?XML::Twig -Joe
Re: [CODE4LIB] Best way to process large XML files
It is also worth noting that you can usually do SAX-style parsing in most XML parsing libraries that are normally associated with DOM style parsing and conveniences like XPath selectors. For example, Nokogiri does SAX and it is *very* fast: http://nokogiri.org/Nokogiri/XML/SAX/Document.html As a related question, when folks do SAX-style parsing and need to select highly conditional and deeply nested elements (think getting MODS title data only when a parent element's attribute matches a condition and it is all nested in a big METS wrapper), how are you keeping track of those nesting and conditional rules? I have relied on using a few booleans that get set and unset to track state, but it often feels sloppy. -steve On Fri, Jun 8, 2012 at 2:41 PM, Ethan Gruber ewg4x...@gmail.com wrote: but I have gotten noticeably better performance from Saxon/XSLT2 than PHP with DOMDocument or SimpleXML or nokogiri and hpricot in Ruby.
Re: [CODE4LIB] Best way to process large XML files
One way to get the best of both worlds (scalability of a streaming parser, but convenience of DOM) is to use DOM4J's ElementHandler interface[1]. You parse the XML file using a SAXReader, and register a class to handle callbacks, based on an XPath expression. I used this approach to break up giant MARCXML files with hundreds of thousands of records. Though this approach does require the XML to be well-formed. I had some problems with that, and wound up pre-processing the MARCXML to strip out illegal characters so they wouldn't cause parsing errors. 1. http://dom4j.sourceforge.net/dom4j-1.6.1/apidocs/org/dom4j/ElementHandler.html -Esme -- Esme Cowles escow...@ucsd.edu The wages of sin is death but so is the salary of virtue, and at least the evil get to go home early on Fridays. -- Terry Pratchett, Witches Abroad On 06/8/2012, at 2:36 PM, Kyle Banerjee wrote: I'm working on a script that needs to be able to crosswalk at least a couple hundred XML files regularly, some of which are quite large. I've thought of a number of ways to go about this, but I wanted to bounce this off the list since I'm sure people here deal with this problem all the time. My goal is to make something that's easy to read/maintain without pegging the CPU and consuming too much memory. The performance and load I'm seeing from running the files through LibXML and SimpleXML on the large files is completely unacceptable. SAX is not out of the question, but I'm trying to avoid it if possible to keep the code more compact and easier to read. I'm tempted to streamedit out all line breaks since they occur in unpredictable places and put new ones at the end of each record into a temp file. Then I can read the temp file one line at a time and process using SimpleXML. That way, there's no need to load giant files into memory, create huge arrays, etc and the code would be easy enough for a 6th grader to follow. My proposed method doesn't sound very efficient to me, but it should consume predictable resources which don't increase with file size. How do you guys deal with large XML files? Thanks, kyle rantWhy the heck does the XML spec require a root element, particularly since large files usually consist of a large number of records/documents? This makes it absolutely impossible to process a file of any size without resorting to SAX or string parsing -- which takes away many of the advantages you'd normally have with an XML structure. /rant -- -- Kyle Banerjee Digital Services Program Manager Orbis Cascade Alliance baner...@uoregon.edubaner...@orbiscascade.org / 503.999.9787
Re: [CODE4LIB] Best way to process large XML files
If you're not adverse to Java, the XOM XML library has a nice NodeFactory class that you can override and control the processing of the XML document. For instance, it will let you parse a very large XML document like root rec/rec rec/rec ... /root only keeping a rec at a time in memory. You control the node building process so can throw away the one's you're done with. It's friendlier than SAX and what I use for processing very large documents. Cf. http://www.xom.nu/apidocs/nu/xom/NodeFactory.html Kevin On Fri, Jun 8, 2012 at 2:36 PM, Kyle Banerjee baner...@orbiscascade.org wrote: I'm working on a script that needs to be able to crosswalk at least a couple hundred XML files regularly, some of which are quite large. I've thought of a number of ways to go about this, but I wanted to bounce this off the list since I'm sure people here deal with this problem all the time. My goal is to make something that's easy to read/maintain without pegging the CPU and consuming too much memory. The performance and load I'm seeing from running the files through LibXML and SimpleXML on the large files is completely unacceptable. SAX is not out of the question, but I'm trying to avoid it if possible to keep the code more compact and easier to read. I'm tempted to streamedit out all line breaks since they occur in unpredictable places and put new ones at the end of each record into a temp file. Then I can read the temp file one line at a time and process using SimpleXML. That way, there's no need to load giant files into memory, create huge arrays, etc and the code would be easy enough for a 6th grader to follow. My proposed method doesn't sound very efficient to me, but it should consume predictable resources which don't increase with file size. How do you guys deal with large XML files? Thanks, kyle rantWhy the heck does the XML spec require a root element, particularly since large files usually consist of a large number of records/documents? This makes it absolutely impossible to process a file of any size without resorting to SAX or string parsing -- which takes away many of the advantages you'd normally have with an XML structure. /rant -- -- Kyle Banerjee Digital Services Program Manager Orbis Cascade Alliance baner...@uoregon.edubaner...@orbiscascade.org / 503.999.9787
Re: [CODE4LIB] Best way to process large XML files
Since you mentioned SimpleXML, Kyle, I assume you're using PHP? If so, you might look at XMLReader [1], which is a pull parser, and should give you better performance on large files than SimpleXML . It is still based on libxml, though, so if that is still not fast enough for you, you can toss out my suggestion. :-) --Dave [1] http://php.net/manual/en/book.xmlreader.php - David Walker Interim Director, Systemwide Digital Library Services California State University 562-355-4845 -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Kyle Banerjee Sent: Friday, June 08, 2012 11:36 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] Best way to process large XML files I'm working on a script that needs to be able to crosswalk at least a couple hundred XML files regularly, some of which are quite large. I've thought of a number of ways to go about this, but I wanted to bounce this off the list since I'm sure people here deal with this problem all the time. My goal is to make something that's easy to read/maintain without pegging the CPU and consuming too much memory. The performance and load I'm seeing from running the files through LibXML and SimpleXML on the large files is completely unacceptable. SAX is not out of the question, but I'm trying to avoid it if possible to keep the code more compact and easier to read. I'm tempted to streamedit out all line breaks since they occur in unpredictable places and put new ones at the end of each record into a temp file. Then I can read the temp file one line at a time and process using SimpleXML. That way, there's no need to load giant files into memory, create huge arrays, etc and the code would be easy enough for a 6th grader to follow. My proposed method doesn't sound very efficient to me, but it should consume predictable resources which don't increase with file size. How do you guys deal with large XML files? Thanks, kyle rantWhy the heck does the XML spec require a root element, particularly since large files usually consist of a large number of records/documents? This makes it absolutely impossible to process a file of any size without resorting to SAX or string parsing -- which takes away many of the advantages you'd normally have with an XML structure. /rant -- -- Kyle Banerjee Digital Services Program Manager Orbis Cascade Alliance baner...@uoregon.edubaner...@orbiscascade.org / 503.999.9787
Re: [CODE4LIB] Best way to process large XML files
Since you mentioned SimpleXML, Kyle, I assume you're using PHP? Actually I'm using perl. For reasons not related to XML parsing, it is the preferred (but not mandatory) language. Based on a few tests and manual inspection, it looks like the ticket for me is going have a two stage process where the first stage converts the file to valid XML and the second cuts through it with SAX. Originally, I was trying to avoid SAX, but the process has been prettier than expected so far. The XML has not been prettier than expected -- it contains a number of issues including outright invalid XML, invalid characters, and hand coded HTML within some elements (i.e. string data not encoded as such). Gotta love library data. But screwed up stuff is employment security. If things actually worked, I'd be redundant... kyle
Re: [CODE4LIB] Best way to process large XML files
*sigh* -- I kinda wish this whole discussion got captured in http://libraries.stackexchange.com/ ... Peter On Jun 8, 2012, at 2:36 PM, Kyle Banerjee wrote: I'm working on a script that needs to be able to crosswalk at least a couple hundred XML files regularly, some of which are quite large. I've thought of a number of ways to go about this, but I wanted to bounce this off the list since I'm sure people here deal with this problem all the time. My goal is to make something that's easy to read/maintain without pegging the CPU and consuming too much memory. The performance and load I'm seeing from running the files through LibXML and SimpleXML on the large files is completely unacceptable. SAX is not out of the question, but I'm trying to avoid it if possible to keep the code more compact and easier to read. I'm tempted to streamedit out all line breaks since they occur in unpredictable places and put new ones at the end of each record into a temp file. Then I can read the temp file one line at a time and process using SimpleXML. That way, there's no need to load giant files into memory, create huge arrays, etc and the code would be easy enough for a 6th grader to follow. My proposed method doesn't sound very efficient to me, but it should consume predictable resources which don't increase with file size. How do you guys deal with large XML files? Thanks, kyle rantWhy the heck does the XML spec require a root element, particularly since large files usually consist of a large number of records/documents? This makes it absolutely impossible to process a file of any size without resorting to SAX or string parsing -- which takes away many of the advantages you'd normally have with an XML structure. /rant -- Peter Murray Assistant Director, Technology Services Development LYRASIS peter.mur...@lyrasis.org +1 678-235-2955 1438 West Peachtree Street NW Suite 200 Atlanta, GA 30309 Toll Free: 800.999.8558 Fax: 404.892.7879 www.lyrasis.org LYRASIS: Great Libraries. Strong Communities. Innovative Answers.