Hello Hadoop users,
I've been scratching my head over this one and wondered if anybody had
ever encountered something similar :
I have 2 output files from a MapReduce job.
Now I want to use these files to use as an input for a second MapReduce
job but not by randomly taking lines from each one. I want to combine
each item in the first file with every item in the second.
For instance, if I have file 1 with :
a
b
c
and file 2 with:
1
2
3
I want the input of my new MapReduce job to be a1, a2, a3, b1 ... c3.
To do it, I first thought about accumulating all the data in a single
reducer and outputing it correctly in the close() method, but that
requires too much memory as I have keep every item in my working set to
be able to iterate over them on close().
So right now I'm stuck at sequentially going through the files by
reading them with the filesystem api and constructing a new file that
combines both. I do that in the my driver's main() before running my
second MapReduce job.
Is there anything already available that would let me do what I want in
a distributed fashion? Like maybe a way to generate InputSlices by
reading from two files at once ?
Thank you,
alex.r.