We tried using the hadoop streaming xml format a while ago and it didn't quite go as expected. I don't remember why, but, it gave some weird results- missing some records off, getting to 98% complete and then stopping etc.
The Mahout project also has an XmlInputFormat [1] that we ended up using. I also posted something on my blog about it all [2], and a little about my understanding (so far) of input formats and record readers etc. Hope that helps, Paul 1. http://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java 2. http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html On 13 Jul 2010, at 12:26, Shuja Rehman wrote: > Hi Khaled, > XML files can be processed using hadoop streaming. check out the following > link. > > http://hadoop.apache.org/common/docs/r0.15.2/streaming.html#How+do+I+parse+XML+documents+using+streaming%3F > > Regards > Shuja > > On Tue, Jul 13, 2010 at 2:24 PM, edward choi <[email protected]> wrote: > >> Khaled, >> >> Hadoop mapreduce innately takes in file line by line. >> XML files are not comprised of single lines. >> So you will have to pack a single xml document into a single line. >> Or you can make your own input format, which you need to refer to a guide >> book. >> >> 2010/7/13 Khaled BEN BAHRI <[email protected]> >> >>> Hello to all >>> >>> I'm novice in working with mapreduce and i'm developping a mapreduce >>> function that take xml documents as inputs. >>> >>> How can i make input files and precise it to the map function >>> >>> Thanks for help >>> >>> Best regards >>> Khaled >>> >>> >> > > > > -- > Regards > Shuja-ur-Rehman Baig > _________________________________ > MS CS - School of Science and Engineering > Lahore University of Management Sciences (LUMS) > Sector U, DHA, Lahore, 54792, Pakistan > Cell: +92 3214207445
