mapreduce doesn't know anything about your application logic. as long as you
can split the big xml into a lot of small xml files, then hadoop could help
you.

1. split this big xml file into 10000 small xml files, for example.
2. each small xml file could be one pair in sequence file.
3. then use mapreduce to read the sequence file and parse them, for example
you have 10 map & reduce tasks.
4. finally you have 10 output files, which contain the format you want.

On 9/24/06, howard chen <[EMAIL PROTECTED]> wrote:

Hello,

I have a larger XML file, over 10GB, has a simple format like

<book>
     <title></title>
     <author></author>
     ...
</book>

I used to parse the XML and convert into another format, i.e. CSV.
Currently, the parsing only performed on a single server and speed is
slow (a few hours)

Is hadoop is a good solution for spliting the XML files and spread the
XML parsing on serveral clusters?

Thanks for any comment.

Reply via email to