Re: Hashing two relations

Gang Luo Mon, 05 Jul 2010 06:55:23 -0700

Actually I mean hash join when I said default reduce join. The hash partitioner 
will shuffle records from mappers to reducers. Each reducer receive a hash 
partition which involves many keys. the reduce method will process one key (and 
all the associative records) at one time. What you can do is build hash table 
on one table in the reduce method and probe from the other side. But this 
doesn't make much sense since the result is a cross product.


-Gang




----- 原始邮件 ----
发件人： abc xyz <[email protected]>
收件人： [email protected]
发送日期： 2010/7/5 (周一) 3:55:22 上午
主   题： Re: Hashing two relations

The default reduce join is the sort-merge join. I want to have a hash-join on 
reduce-side for some experimenting. I want to get a partition from each 
hash-table and build an in-memory hash table for one and probing the partition 
from other table against it (like grace-join algorithm). Any suggestions 
would be highly appreciated.



________________________________
From: Gang Luo <[email protected]>
To: [email protected]
Sent: Sun, July 4, 2010 10:04:03 PM
Subject: Re: Hashing two relations

not sure what you want.

If you want to do the join in reduce side, MapReduce framework enable this by 
grouping all the matching tuples together. Why bother to build hash table to 
buffer the entire partition in memory? This probably brings you a out-of-memory 
error. The default reduce join should be your choice in this case. 


-Gang



----- 原始邮件 ----
发件人： abc xyz <[email protected]>
收件人： [email protected]
发送日期： 2010/7/3 (周六) 2:10:14 上午
主  题： Hashing two relations

Hey Folks,

I have to mess around with hashing. I want to take two input sources, partition 
them using hash function, then make the in-memory hash table for each partition 
of one sources, and compare the hash of each record of the same partition of 
the 

other table against it for joining these two. 


I know that map-side join does this (on pre-partitioned data), but I want to do 
it on reduce side. Using job-chaining, I can output (hash(key), value) by two 
map tasks on the two input files, but when it comes to the reduce stage, i have 
to take the same partition from both the hash tables. I am not sure how can I 
accomplish this. Any guidance in this regards would be appreciated.

Thanks

Re: Hashing two relations

Reply via email to