I have a problem that needs to be solved by an iteration of MapReduce jobs, and in each iteration I need to start by doing an equijoin between a large constant dataset and the output of the previous iteration; the remainder of my map function works on a joined tuple in a way whose details are not important here. The reduce output I am happy to describe as (key,value) pairs, and the large constant dataset can well be described that way too; in this case the join condition is equality of keys. What is the best way to get those equijoins done in the maps?
I presume I should be looking for a solution using org.apache.hadoop.mapreduce.* rather than org.apache.hadoop.mapred.* I do not want to cache the entirety of the large constant dataset in memory during the setup method of my Mapper --- that would require way too much memory. I do not even want to make a copy of the entirety of the large constant dataset in the local filesystem of each node in my cluster. What I want is to have the large constant dataset partitioned among my nodes. It is OK (even preferable) if there is a bit of replication. Thus, storing it in HDFS --- as one file or as a collection of files --- would be fine. Thanks, Mike