one-to-many Map Side Join without reducer

Shi Yu Thu, 09 Jun 2011 14:10:08 -0700

Hi,

I have two datasets: dataset 1 has the format:


MasterKey1    SubKey1    SubKey2    SubKey3
MasterKey2    Subkey4     Subkey5     Subkey6
....


dataset 2 has the format:

SubKey1    Value1
SubKey2    Value2
...

I want to have one-to-many join based on the SubKey, and the final goalis to have an output like:


MasterKey1    Value1    Value2    Value3
MasterKey2    Value4    Value5    Value6
...

After studying and experimenting some example code, I understand that itis doable if I transform the first data set as


SubKey1    MasterKey1
SubKey2    MasterKey1
SubKey3    MasterKey1
SubKey4    MasterKey2
SubKey5    MasterKey2
SubKey6    MasterKey2

then using the inner join with the dataset 2 on SubKey. Then I probablyneed a reducer to perform secondary sort on MasterKey to get the result.However, the bottleneck is still on the reducer if each MasterKey haslots of SubKey.My question is, suppose that dataset2 contains all the Subkeys and neversplit, is it possible to join the key of dataset 2 with multiple valuesof dataset 1 at the Mapper Side? Any hint is highly appreciated.

Shi

one-to-many Map Side Join without reducer

Reply via email to