mapreduce doesn't know anything about your application logic. as long as you can split the big xml into a lot of small xml files, then hadoop could help you.
1. split this big xml file into 10000 small xml files, for example. 2. each small xml file could be one pair in sequence file. 3. then use mapreduce to read the sequence file and parse them, for example you have 10 map & reduce tasks. 4. finally you have 10 output files, which contain the format you want. On 9/24/06, howard chen <[EMAIL PROTECTED]> wrote:
Hello, I have a larger XML file, over 10GB, has a simple format like <book> <title></title> <author></author> ... </book> I used to parse the XML and convert into another format, i.e. CSV. Currently, the parsing only performed on a single server and speed is slow (a few hours) Is hadoop is a good solution for spliting the XML files and spread the XML parsing on serveral clusters? Thanks for any comment.