Hello, I've got big XML files that I'd like to process with Hadoop. Each file is basically just a collection of records, and I'd like to apply a map to each record. It's something like <file> <record> <foo>bar</foo> <foo2>bar2</foo2> </record> <record> </record> .... </file>
Is hadoop streaming what I want? For my simple case, it seems like it would work. It doesn't really bother me if my map task is an external program or a java class that implements Mapper (though I imagine implementing Mapper is faster than firing up the JVM or perl binary for each record.) Is there another RecordReader that reads XML records on given boundries? I think the ideal API for me is to be able to specify an XPath expression to define the records I want to read, and then I guess in my map class have an XML parser to process each <record></record> fragment. Thanks! -Erik
