Hi,

I have two datasets: dataset 1 has the format:

MasterKey1    SubKey1    SubKey2    SubKey3
MasterKey2    Subkey4     Subkey5     Subkey6
....


dataset 2 has the format:

SubKey1    Value1
SubKey2    Value2
...

I want to have one-to-many join based on the SubKey, and the final goal is to have an output like:

MasterKey1    Value1    Value2    Value3
MasterKey2    Value4    Value5    Value6
...


After studying and experimenting some example code, I understand that it is doable if I transform the first data set as

SubKey1    MasterKey1
SubKey2    MasterKey1
SubKey3    MasterKey1
SubKey4    MasterKey2
SubKey5    MasterKey2
SubKey6    MasterKey2

then using the inner join with the dataset 2 on SubKey. Then I probably need a reducer to perform secondary sort on MasterKey to get the result. However, the bottleneck is still on the reducer if each MasterKey has lots of SubKey. My question is, suppose that dataset2 contains all the Subkeys and never split, is it possible to join the key of dataset 2 with multiple values of dataset 1 at the Mapper Side? Any hint is highly appreciated.

Shi


Reply via email to