Hi,
I have two datasets: dataset 1 has the format:
MasterKey1 SubKey1 SubKey2 SubKey3
MasterKey2 Subkey4 Subkey5 Subkey6
....
dataset 2 has the format:
SubKey1 Value1
SubKey2 Value2
...
I want to have one-to-many join based on the SubKey, and the final goal
is to have an output like:
MasterKey1 Value1 Value2 Value3
MasterKey2 Value4 Value5 Value6
...
After studying and experimenting some example code, I understand that it
is doable if I transform the first data set as
SubKey1 MasterKey1
SubKey2 MasterKey1
SubKey3 MasterKey1
SubKey4 MasterKey2
SubKey5 MasterKey2
SubKey6 MasterKey2
then using the inner join with the dataset 2 on SubKey. Then I probably
need a reducer to perform secondary sort on MasterKey to get the result.
However, the bottleneck is still on the reducer if each MasterKey has
lots of SubKey.
My question is, suppose that dataset2 contains all the Subkeys and never
split, is it possible to join the key of dataset 2 with multiple values
of dataset 1 at the Mapper Side? Any hint is highly appreciated.
Shi