What is the right way to do map-side joins in Hadoop 1.0?

Mike Spreitzer Sun, 15 Jan 2012 00:21:38 -0800

I have a problem that needs to be solved by an iteration of MapReduce 
jobs, and in each iteration I need to start by doing an equijoin between a 
large constant dataset and the output of the previous iteration; the 
remainder of my map function works on a joined tuple in a way whose 
details are not important here.  The reduce output I am happy to describe 
as (key,value) pairs, and the large constant dataset can well be described 
that way too; in this case the join condition is equality of keys.  What 
is the best way to get those equijoins done in the maps?


I presume I should be looking for a solution using 
org.apache.hadoop.mapreduce.* rather than org.apache.hadoop.mapred.*

I do not want to cache the entirety of the large constant dataset in 
memory during the setup method of my Mapper --- that would require way too 
much memory.  I do not even want to make a copy of the entirety of the 
large constant dataset in the local filesystem of each node in my cluster. 
 What I want is to have the large constant dataset partitioned among my 
nodes.  It is OK (even preferable) if there is a bit of replication. Thus, 
storing it in HDFS --- as one file or as a collection of files --- would 
be fine.

Thanks,
Mike

What is the right way to do map-side joins in Hadoop 1.0?

Reply via email to