Alan, On Mon, Dec 10, 2007 at 01:12:28AM -0800, Alan Ho wrote: >I've written a xml input splitter based on a Stax parser. Its much better than >StreamXMLRecordReader >
We'd definitely like to see something like this in Hadoop, do you mind contributing it? Details: http://wiki.apache.org/lucene-hadoop/HowToContribute thanks, Arun >----- Original Message ---- >From: Peter Thygesen <[EMAIL PROTECTED]> >To: [email protected] >Sent: Monday, November 26, 2007 8:49:52 AM >Subject: MapReduce Job on XML input > >I would like to run some mapReduce jobs on some xml files I got (aprox. >100000 compressed files). >The XML files are not that big about 1 Mb compressed, each containing >about 1000 records. > >Do I have to write my own InputSplitter? Should I use >MultiFileInputFormat or StreamInputFormat? Can I use the >StreamXmlRecordReader, and how? By sub-classing some input class? > >The tutorials and examples I've read are all very straight forward >reading simple text files, but I miss a more complex example, > especially >one that reads xml files ;) > >thx. >Peter > > > > > > > > Looking for the perfect gift? Give the gift of Flickr! > >http://www.flickr.com/gift/ >
