I think HIVE is best suited for ur use case where it gives you the sql based interface to the hadoop to make these type of things.
On Fri, Jun 10, 2011 at 2:39 AM, Shi Yu <[email protected]> wrote: > Hi, > > I have two datasets: dataset 1 has the format: > > MasterKey1 SubKey1 SubKey2 SubKey3 > MasterKey2 Subkey4 Subkey5 Subkey6 > .... > > > dataset 2 has the format: > > SubKey1 Value1 > SubKey2 Value2 > ... > > I want to have one-to-many join based on the SubKey, and the final goal is > to have an output like: > > MasterKey1 Value1 Value2 Value3 > MasterKey2 Value4 Value5 Value6 > ... > > > After studying and experimenting some example code, I understand that it is > doable if I transform the first data set as > > SubKey1 MasterKey1 > SubKey2 MasterKey1 > SubKey3 MasterKey1 > SubKey4 MasterKey2 > SubKey5 MasterKey2 > SubKey6 MasterKey2 > > then using the inner join with the dataset 2 on SubKey. Then I probably > need a reducer to perform secondary sort on MasterKey to get the result. > However, the bottleneck is still on the reducer if each MasterKey has lots > of SubKey. > My question is, suppose that dataset2 contains all the Subkeys and never > split, is it possible to join the key of dataset 2 with multiple values of > dataset 1 at the Mapper Side? Any hint is highly appreciated. > > Shi > > >
