Hi Mark, A future release of Hadoop will have a MultipleInputs class, akin to MultipleOutputs. This would allow you to have a different inputformat, mapper depending on the path you are getting the split from. It uses special Delegating[mapper/input] classes to resolve this. I understand backporting this is more or less out of question, but the ideas there might provide pointers to help you solve your current problem. Just a thought :)
Amogh On 11/3/09 8:44 PM, "Mark Vigeant" <[email protected]> wrote: Hey Vipul No I haven't concatenated my files yet, and I was just thinking over how to approach the issue of multiple input paths. I actually did what Amandeep hinted at which was we wrote our own XMLInputFormat and XMLRecordReader. When configuring the job in my driver I set job.setInputFormatClass(XMLFileInputFormat.class) and what it does is send chunks of XML to the mapper as opposed to lines of text or whole files. So I specified the Line Delimiter in the XMLRecordReader (ie <startTag>) and everything in between the tags <startTag> and </startTag> are sent to the mapper. Inside the map function is where to parse the data and write it to the table. What I have to do now is just figure out how to set the Line Delimiter to be something common in both XML files I'm reading. Currently I have 2 mapper classes and thus 2 submitted jobs which is really inefficient and time consuming. Make sense at all? Sorry if it doesn't, feel free to ask more questions Mark -----Original Message----- From: Vipul Sharma [mailto:[email protected]] Sent: Monday, November 02, 2009 7:48 PM To: [email protected] Subject: RE: Multiple Input Paths Mark, were you able to concatenate both the xml files together. What did you do to keep the resulting xml well forned? Regards, Vipul Sharma, Cell: 281-217-0761
