Yes, I did look at CompositeInputFormat. That is why I remarked that I suppose that I should be looking under org.apache.hadoop.mapreduce.* and sent the earlier question about why CompositeInputFormat is not under org.apache.hadoop.mapreduce.* in Hadoop 1.0.0. But I have gotten no answers yet.
And yes, I very much want to do the joins as 'early' as possible (i.e., in the mapper not the reducer); I do not want to waste work sending copies of the large constant dataset around if I do not have to. BTW, my outline below was written too hastily; I see no obvious way to get the large constant dataset and the previous iteration's output placed consistently, which is going to be needed for top performance. Thanks, Mike From: Bejoy Ks <bejoy.had...@gmail.com> To: mapreduce-user@hadoop.apache.org Date: 01/15/2012 07:49 AM Subject: Re: What is the right way to do map-side joins in Hadoop 1.0? Hi Mark Have a look at CompositeInputFormat. I guess it is what you are looking for to achieve map side joins. If you are fine with a Reduce side join go in with MultipleInputFormat. I have tried the same sort of joins using MultipleInputFormat and have scribbled something on the same. Check out if it'd be useful for you. (A very crude implementation :), you may have better ways ) http://kickstarthadoop.blogspot.com/2011/09/joins-with-plain-map-reduce.html Hope it helps!... Regards Bejoy.K.S On Sun, Jan 15, 2012 at 4:34 PM, Mike Spreitzer <mspre...@us.ibm.com> wrote: BTW, each key appears exactly once in the large constant dataset, and exactly once in each MR job's output. I am thinking the right approach is to consistently partition the job output and the large constant dataset, with the number of partitions being the number of reduce tasks; each part goes into its own file. Make an InputFormat whose number of splits equals the number of reduce tasks. Reading a split will consist of reading a corresponding pair of files, stepping through each. Seems like something that should already be provided by something in org.apache.hadoop.mapreduce.*. Thanks, Mike