We tried using the hadoop streaming xml format a while ago and it didn't quite 
go as expected. I don't remember why, but, it gave some weird results- missing 
some records off, getting to 98% complete and then stopping etc.

The Mahout project also has an XmlInputFormat [1] that we ended up using. I 
also posted something on my blog about it all [2], and a little about my 
understanding (so far) of input formats and record readers etc.

Hope that helps,
Paul

1. 
http://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java
2. http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html

On 13 Jul 2010, at 12:26, Shuja Rehman wrote:

> Hi Khaled,
> XML files can be processed using hadoop streaming. check out the following
> link.
> 
> http://hadoop.apache.org/common/docs/r0.15.2/streaming.html#How+do+I+parse+XML+documents+using+streaming%3F
> 
> Regards
> Shuja
> 
> On Tue, Jul 13, 2010 at 2:24 PM, edward choi <[email protected]> wrote:
> 
>> Khaled,
>> 
>> Hadoop mapreduce innately takes in file line by line.
>> XML files are not comprised of single lines.
>> So you will have to pack a single xml document into a single line.
>> Or you can make your own input format, which you need to refer to a guide
>> book.
>> 
>> 2010/7/13 Khaled BEN BAHRI <[email protected]>
>> 
>>> Hello to all
>>> 
>>> I'm novice in working with mapreduce and i'm developping a mapreduce
>>> function that take xml documents as inputs.
>>> 
>>> How can i make input files and precise it to the map function
>>> 
>>> Thanks for help
>>> 
>>> Best regards
>>> Khaled
>>> 
>>> 
>> 
> 
> 
> 
> -- 
> Regards
> Shuja-ur-Rehman Baig
> _________________________________
> MS CS - School of Science and Engineering
> Lahore University of Management Sciences (LUMS)
> Sector U, DHA, Lahore, 54792, Pakistan
> Cell: +92 3214207445

Reply via email to