not sure what you want.

If you want to do the join in reduce side, MapReduce framework enable this by 
grouping all the matching tuples together. Why bother to build hash table to 
buffer the entire partition in memory? This probably brings you a out-of-memory 
error. The default reduce join should be your choice in this case. 

-Gang



----- 原始邮件 ----
发件人: abc xyz <[email protected]>
收件人: [email protected]
发送日期: 2010/7/3 (周六) 2:10:14 上午
主   题: Hashing two relations

Hey Folks,

I have to mess around with hashing. I want to take two input sources, partition 
them using hash function, then make the in-memory hash table for each partition 
of one sources, and compare the hash of each record of the same partition of 
the 
other table against it for joining these two. 


I know that map-side join does this (on pre-partitioned data), but I want to do 
it on reduce side. Using job-chaining, I can output (hash(key), value) by two 
map tasks on the two input files, but when it comes to the reduce stage, i have 
to take the same partition from both the hash tables. I am not sure how can I 
accomplish this. Any guidance in this regards would be appreciated.

Thanks



Reply via email to