Innovation in Libraries 2012
A Free Post Conference event after LITA Forum
Invitation and Call for Proposals
Do you love exploring new ideas? Always secretly wished you knew more
about how to create an app? Wonder what the next wave of library
innovation might be?
If you answered yes, then
Hello Silicon Sorcerers,
I was just wondering if there have been any efforts from Code4Lib into
MediaWiki development? I know that there have been some Wikipedia
templates and bots designed to interface with library services. Yet what
about cold hard MediaWiki extensions? Has there been any
I'm working on a script that needs to be able to crosswalk at least a
couple hundred XML files regularly, some of which are quite large.
I've thought of a number of ways to go about this, but I wanted to bounce
this off the list since I'm sure people here deal with this problem all the
time. My
Saxon is really, really efficient with large files. I don't really have
any benchmarks stats available, but I have gotten noticeably better
performance from Saxon/XSLT2 than PHP with DOMDocument or SimpleXML or
nokogiri and hpricot in Ruby.
Ethan
On Fri, Jun 8, 2012 at 2:36 PM, Kyle Banerjee
I would really consider SAX. In MarcEdit, I had originally utilized an XSLT
process for handling MARCXML translations (using both SAXON and MSXML parsers)
-- but as you noticed -- there ends up being an upper limit to what you can
process. The break point for me was when working with some
apologies for the cross-posting ***
*
The IG meeting also will be looking for an IG chair volunteer and discuss
any programs the IG members are interested in organizing/offering at the
next ALA Annual Conference. More details :
http://connect.ala.org/node/176080 .
**
*LITA Mobile Computing
I create 50GB files of marcxml all the time. We do NOT put a wrapper
element around them, but do put a line feed at the end of each record.
Then a trivial line reading loop in java/perl/whatever can read those
records individually and process them appropriately.
That turns out the be the right
This is something I've dealt with. And for a variety of reasons, we
went with the streaming parser. I'm not sure about the quality of your
data, but we have to be prepared for seriously messed up data. There
was no way I was going to develop a process that would try to load a
15 million record
On Jun 8, 2012, at 2:36 PM, Kyle Banerjee wrote:
I'm working on a script that needs to be able to crosswalk at least a
couple hundred XML files regularly, some of which are quite large.
[trimmed]
How do you guys deal with large XML files? Thanks,
um ... I return ASCII tab-delim records,
It is also worth noting that you can usually do SAX-style parsing in
most XML parsing libraries that are normally associated with DOM style
parsing and conveniences like XPath selectors. For example, Nokogiri
does SAX and it is *very* fast:
http://nokogiri.org/Nokogiri/XML/SAX/Document.html
As a
One way to get the best of both worlds (scalability of a streaming parser, but
convenience of DOM) is to use DOM4J's ElementHandler interface[1]. You parse
the XML file using a SAXReader, and register a class to handle callbacks, based
on an XPath expression. I used this approach to break up
If you're not adverse to Java, the XOM XML library has a nice
NodeFactory class that you can override and control the processing of
the XML document. For instance, it will let you parse a very large
XML document like
root
rec/rec
rec/rec
...
/root
only keeping a rec at a time in memory.
Since you mentioned SimpleXML, Kyle, I assume you're using PHP?
If so, you might look at XMLReader [1], which is a pull parser, and should give
you better performance on large files than SimpleXML .
It is still based on libxml, though, so if that is still not fast enough for
you, you can
The Digital Services Librarian provides expertise in creating and managing
library digital collections, such as digital special collections, electronic
theses and dissertations, and other born-digital or retrospectively digitized
materials. This position assumes primary responsibility for the
Since you mentioned SimpleXML, Kyle, I assume you're using PHP?
Actually I'm using perl. For reasons not related to XML parsing, it is the
preferred (but not mandatory) language.
Based on a few tests and manual inspection, it looks like the ticket for me
is going have a two stage process
*sigh* -- I kinda wish this whole discussion got captured in
http://libraries.stackexchange.com/ ...
Peter
On Jun 8, 2012, at 2:36 PM, Kyle Banerjee wrote:
I'm working on a script that needs to be able to crosswalk at least a
couple hundred XML files regularly, some of which are quite large.
One tangent that I know about is the Memento work:
https://www.mediawiki.org/wiki/Extension:Memento
Peter
On Jun 8, 2012, at 2:18 PM, Klein,Max wrote:
Hello Silicon Sorcerers,
I was just wondering if there have been any efforts from Code4Lib into
MediaWiki development? I know that
17 matches
Mail list logo