It doesn't require the separation of data at the file level, it only requires 
each record carry enough distinction about which source it is from.

You don't need "separate mappers", it is just the same mapper with logic that 
distinguish the record type.

For reducing memory footprint, the data source with less record of the same key 
should arrive at the reducer first.  What you describe is a special case 
(1-to-M), then the data source with non-dup key (which means the primary key) 
should arrive first.

At the reducer, it just need to store in memory the first set of data records 
(which is smaller in size) and stream out the joined output.

Compared to a non-optimized version, it reduce memory footprint from O(M + N) 
to 
O(min(M, N)) where M and N is the number of dup keys records for each data 
sources.  In your special 1-to-M case, it reduce O(N) to O(1).

This technique will work even with one reducer.  Of course, you should always 
challenging your design why there is only one reducer.

Reducer-side join is the most straightforward one and require less organization 
of your data.  But it is not the most efficient ones.  There are other joining 
techniques that is more efficient.  (map-side-partition-join, 
map-side-partition-merge-join, semi-join, memcache-join ...etc)

I wrote a blog on Map/Reduce algorithms that has include various joining 
techniques here at ...
http://horicky.blogspot.com/2010/08/designing-algorithmis-for-map-reduce.html

Rgds,
Ricky


-----Original Message-----
From: Matthew John [mailto:[email protected]] 
Sent: Monday, October 18, 2010 1:17 AM
To: [email protected]
Subject: Reduce side join
 
Hi all,
 
   I am working on a join operation using Hadoop. I came across Reduce-side
join in Hadoop The Definitive Guide. As far as I understand , this technique
is all about :
 
1) Read the two inputs using separate mappers  and tag the two inputs using
different values such that in the Sort Shuffle phase the primary key Record
(with only one instance of a Record with the key) comes before the records
with the same foreign key.
 
2) In the Reduce phase , read the required portion of the 1st record to a
variable and keep on appending it to the rest of the records to follow .
 
My doubt is :
Is it fine if I have more than 1 set of input records (primary record
followed by the foreign records) in the same reduce phase.
For example, will this technique work if I have just one reducer running.
 
Regards,
 
Matthew John



      

Reply via email to