You can use the data_join lib in contrib to do your job. Runping
> -----Original Message----- > From: Alexandre Rochette [mailto:[EMAIL PROTECTED] > Sent: Wednesday, June 20, 2007 5:44 PM > To: [email protected] > Subject: 'Combining' input files for maps > > Hello Hadoop users, > > I've been scratching my head over this one and wondered if anybody had > ever encountered something similar : > > I have 2 output files from a MapReduce job. > > Now I want to use these files to use as an input for a second MapReduce > job but not by randomly taking lines from each one. I want to combine > each item in the first file with every item in the second. > > For instance, if I have file 1 with : > a > b > c > and file 2 with: > 1 > 2 > 3 > I want the input of my new MapReduce job to be a1, a2, a3, b1 ... c3. > > To do it, I first thought about accumulating all the data in a single > reducer and outputing it correctly in the close() method, but that > requires too much memory as I have keep every item in my working set to > be able to iterate over them on close(). > > So right now I'm stuck at sequentially going through the files by > reading them with the filesystem api and constructing a new file that > combines both. I do that in the my driver's main() before running my > second MapReduce job. > > Is there anything already available that would let me do what I want in > a distributed fashion? Like maybe a way to generate InputSlices by > reading from two files at once ? > > Thank you, > alex.r.
