Hi Mark,
A future release of Hadoop will have a MultipleInputs class, akin to 
MultipleOutputs. This would allow you to have a different inputformat, mapper 
depending on the path you are getting the split from. It uses special 
Delegating[mapper/input] classes to resolve this. I understand backporting this 
is more or less out of question, but the ideas there might provide pointers to 
help you solve your current problem.
Just a thought :)

Amogh


On 11/3/09 8:44 PM, "Mark Vigeant" <[email protected]> wrote:

Hey Vipul

No I haven't concatenated my files yet, and I was just thinking over how to 
approach the issue of multiple input paths.

I actually did what Amandeep hinted at which was we wrote our own 
XMLInputFormat and XMLRecordReader. When configuring the job in my driver I set 
job.setInputFormatClass(XMLFileInputFormat.class) and what it does is send 
chunks of XML to the mapper as opposed to lines of text or whole files. So I 
specified the Line Delimiter in the XMLRecordReader (ie <startTag>) and 
everything in between the tags <startTag> and </startTag> are sent to the 
mapper. Inside the map function is where to parse the data and write it to the 
table.

What I have to do now is just figure out how to set the Line Delimiter to be 
something common in both XML files I'm reading. Currently I have 2 mapper 
classes and thus 2 submitted jobs which is really inefficient and time 
consuming.

Make sense at all? Sorry if it doesn't, feel free to ask more questions

Mark

-----Original Message-----
From: Vipul Sharma [mailto:[email protected]]
Sent: Monday, November 02, 2009 7:48 PM
To: [email protected]
Subject: RE: Multiple Input Paths

Mark,

were you able to concatenate both the xml files together. What did you do to
keep the resulting xml well forned?

Regards,
Vipul Sharma,
Cell: 281-217-0761

Reply via email to