[CODE4LIB] XML split and transform in Java

Tod Olson Sun, 08 Sep 2013 09:30:32 -0700

code4lib,

I'm looking for some advice on splitting and transforming XML data using Java. 
The context is writing a mixin for SolrMARC to enhance our bib data, bringing 
in table of contents and summary data. The data is in XML, isomorphic to 
MARCXML. I need to split it up, transform it, and store it for use at import 
time. I expect the input XML to be up to a few GB, so slurping the whole thing 
into a DOM seems questionable. I've done one implementation for a split-only 
version of the problem, but the transform requirement is causing me to 
re-think.


And maybe someone out there has already done this exact thing.

I think the basic approach is to read a record from start tag to end tag, and 
create a reader/stream/whatever to hand exactly that record to the transform 
API. Lots of options for this: SAX, StAX events, or what have you. Any thoughts 
of what seems the most straightforward for this split-and-transform scenario 
would be welcome.

On a related note, any thoughts on your favorite light-weight key/value pair 
persistent storage for Java would be welcome. I expect the data to be a little 
large for a serialized HashMap.

Best,

-Tod


Tod Olson <[email protected]>
Systems Librarian     
University of Chicago Library

[CODE4LIB] XML split and transform in Java

Reply via email to