'Combining' input files for maps

Alexandre Rochette Wed, 20 Jun 2007 17:43:57 -0700

Hello Hadoop users,

I've been scratching my head over this one and wondered if anybody hadever encountered something similar :


I have 2 output files from a MapReduce job.

Now I want to use these files to use as an input for a second MapReducejob but not by randomly taking lines from each one. I want to combineeach item in the first file with every item in the second.


For instance, if I have file 1 with :
a
b
c
and file 2 with:
1
2
3
I want the input of my new MapReduce job to be a1, a2, a3, b1 ... c3.

To do it, I first thought about accumulating all the data in a singlereducer and outputing it correctly in the close() method, but thatrequires too much memory as I have keep every item in my working set tobe able to iterate over them on close().

So right now I'm stuck at sequentially going through the files byreading them with the filesystem api and constructing a new file thatcombines both. I do that in the my driver's main() before running mysecond MapReduce job.

Is there anything already available that would let me do what I want ina distributed fashion? Like maybe a way to generate InputSlices byreading from two files at once ?


Thank you,
alex.r.

'Combining' input files for maps

Reply via email to