It doesn't require the separation of data at the file level, it only requires each record carry enough distinction about which source it is from.
You don't need "separate mappers", it is just the same mapper with logic that distinguish the record type. For reducing memory footprint, the data source with less record of the same key should arrive at the reducer first. What you describe is a special case (1-to-M), then the data source with non-dup key (which means the primary key) should arrive first. At the reducer, it just need to store in memory the first set of data records (which is smaller in size) and stream out the joined output. Compared to a non-optimized version, it reduce memory footprint from O(M + N) to O(min(M, N)) where M and N is the number of dup keys records for each data sources. In your special 1-to-M case, it reduce O(N) to O(1). This technique will work even with one reducer. Of course, you should always challenging your design why there is only one reducer. Reducer-side join is the most straightforward one and require less organization of your data. But it is not the most efficient ones. There are other joining techniques that is more efficient. (map-side-partition-join, map-side-partition-merge-join, semi-join, memcache-join ...etc) I wrote a blog on Map/Reduce algorithms that has include various joining techniques here at ... http://horicky.blogspot.com/2010/08/designing-algorithmis-for-map-reduce.html Rgds, Ricky -----Original Message----- From: Matthew John [mailto:[email protected]] Sent: Monday, October 18, 2010 1:17 AM To: [email protected] Subject: Reduce side join Hi all, I am working on a join operation using Hadoop. I came across Reduce-side join in Hadoop The Definitive Guide. As far as I understand , this technique is all about : 1) Read the two inputs using separate mappers and tag the two inputs using different values such that in the Sort Shuffle phase the primary key Record (with only one instance of a Record with the key) comes before the records with the same foreign key. 2) In the Reduce phase , read the required portion of the 1st record to a variable and keep on appending it to the rest of the records to follow . My doubt is : Is it fine if I have more than 1 set of input records (primary record followed by the foreign records) in the same reduce phase. For example, will this technique work if I have just one reducer running. Regards, Matthew John
