Hello,

I've got big XML files that I'd like to process with Hadoop. Each
file is basically just a collection of records, and I'd like to apply
a map to each record. It's something like
<file>
  <record>
   <foo>bar</foo>
   <foo2>bar2</foo2>
  </record>
  <record>
  </record>
  ....
</file>



Is hadoop streaming what I want? For my simple case, it seems like it
would work. It doesn't really bother me if my map task is an external
program or a java class that implements Mapper (though I imagine 
implementing Mapper is faster than firing up the JVM or perl binary for
each record.)

Is there another RecordReader that reads XML records on given boundries?
I think the ideal API for me is to be able to specify an XPath expression
to define the records I want to read, and then I guess in my map class
have an XML parser to process each <record></record> fragment. 

Thanks!

-Erik


Reply via email to