Actually I mean hash join when I said default reduce join. The hash partitioner will shuffle records from mappers to reducers. Each reducer receive a hash partition which involves many keys. the reduce method will process one key (and all the associative records) at one time. What you can do is build hash table on one table in the reduce method and probe from the other side. But this doesn't make much sense since the result is a cross product.
-Gang ----- 原始邮件 ---- 发件人: abc xyz <[email protected]> 收件人: [email protected] 发送日期: 2010/7/5 (周一) 3:55:22 上午 主 题: Re: Hashing two relations The default reduce join is the sort-merge join. I want to have a hash-join on reduce-side for some experimenting. I want to get a partition from each hash-table and build an in-memory hash table for one and probing the partition from other table against it (like grace-join algorithm). Any suggestions would be highly appreciated. ________________________________ From: Gang Luo <[email protected]> To: [email protected] Sent: Sun, July 4, 2010 10:04:03 PM Subject: Re: Hashing two relations not sure what you want. If you want to do the join in reduce side, MapReduce framework enable this by grouping all the matching tuples together. Why bother to build hash table to buffer the entire partition in memory? This probably brings you a out-of-memory error. The default reduce join should be your choice in this case. -Gang ----- 原始邮件 ---- 发件人: abc xyz <[email protected]> 收件人: [email protected] 发送日期: 2010/7/3 (周六) 2:10:14 上午 主 题: Hashing two relations Hey Folks, I have to mess around with hashing. I want to take two input sources, partition them using hash function, then make the in-memory hash table for each partition of one sources, and compare the hash of each record of the same partition of the other table against it for joining these two. I know that map-side join does this (on pre-partitioned data), but I want to do it on reduce side. Using job-chaining, I can output (hash(key), value) by two map tasks on the two input files, but when it comes to the reduce stage, i have to take the same partition from both the hash tables. I am not sure how can I accomplish this. Any guidance in this regards would be appreciated. Thanks
