You can use the data_join lib in contrib to do your job.

Runping
 


> -----Original Message-----
> From: Alexandre Rochette [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, June 20, 2007 5:44 PM
> To: [email protected]
> Subject: 'Combining' input files for maps
> 
> Hello Hadoop users,
> 
> I've been scratching my head over this one and wondered if anybody had
> ever encountered something similar :
> 
> I have 2 output files from a MapReduce job.
> 
> Now I want to use these files to use as an input for a second MapReduce
> job but not by randomly taking lines from each one. I want to combine
> each item in the first file with every item in the second.
> 
> For instance, if I have file 1 with :
> a
> b
> c
> and file 2 with:
> 1
> 2
> 3
> I want the input of my new MapReduce job to be a1, a2, a3, b1 ... c3.
> 
> To do it, I first thought about accumulating all the data in a single
> reducer and outputing it correctly in the close() method, but that
> requires too much memory as I have keep every item in my working set to
> be able to iterate over them on close().
> 
> So right now I'm stuck at sequentially going through the files by
> reading them with the filesystem api and constructing a new file that
> combines both. I do that in the my driver's main() before running my
> second MapReduce job.
> 
> Is there anything already available that would let me do what I want in
> a distributed fashion? Like maybe a way to generate InputSlices by
> reading from two files at once ?
> 
> Thank you,
> alex.r.


Reply via email to