Hello Hadoop users,

I've been scratching my head over this one and wondered if anybody had ever encountered something similar :

I have 2 output files from a MapReduce job.

Now I want to use these files to use as an input for a second MapReduce job but not by randomly taking lines from each one. I want to combine each item in the first file with every item in the second.

For instance, if I have file 1 with :
a
b
c
and file 2 with:
1
2
3
I want the input of my new MapReduce job to be a1, a2, a3, b1 ... c3.

To do it, I first thought about accumulating all the data in a single reducer and outputing it correctly in the close() method, but that requires too much memory as I have keep every item in my working set to be able to iterate over them on close().

So right now I'm stuck at sequentially going through the files by reading them with the filesystem api and constructing a new file that combines both. I do that in the my driver's main() before running my second MapReduce job.

Is there anything already available that would let me do what I want in a distributed fashion? Like maybe a way to generate InputSlices by reading from two files at once ?

Thank you,
alex.r.

Reply via email to