Hey Folks,
I have to mess around with hashing. I want to take two input sources, partition
them using hash function, then make the in-memory hash table for each partition
of one sources, and compare the hash of each record of the same partition of
the
other table against it for joining these two.
I know that map-side join does this (on pre-partitioned data), but I want to do
it on reduce side. Using job-chaining, I can output (hash(key), value) by two
map tasks on the two input files, but when it comes to the reduce stage, i have
to take the same partition from both the hash tables. I am not sure how can I
accomplish this. Any guidance in this regards would be appreciated.
Thanks