Re: one-to-many Map Side Join without reducer

madhu phatak Tue, 21 Jun 2011 04:05:38 -0700

I think HIVE is best suited for ur use case where it gives you the sql based
interface to the hadoop to make these type of things.


On Fri, Jun 10, 2011 at 2:39 AM, Shi Yu <[email protected]> wrote:

> Hi,
>
> I have two datasets: dataset 1 has the format:
>
> MasterKey1    SubKey1    SubKey2    SubKey3
> MasterKey2    Subkey4     Subkey5     Subkey6
> ....
>
>
> dataset 2 has the format:
>
> SubKey1    Value1
> SubKey2    Value2
> ...
>
> I want to have one-to-many join based on the SubKey, and the final goal is
> to have an output like:
>
> MasterKey1    Value1    Value2    Value3
> MasterKey2    Value4    Value5    Value6
> ...
>
>
> After studying and experimenting some example code, I understand that it is
> doable if I transform the first data set as
>
> SubKey1    MasterKey1
> SubKey2    MasterKey1
> SubKey3    MasterKey1
> SubKey4    MasterKey2
> SubKey5    MasterKey2
> SubKey6    MasterKey2
>
> then using the inner join with the dataset 2 on SubKey. Then I probably
> need a reducer to perform secondary sort on MasterKey to get the result.
> However, the bottleneck is still on the reducer if each MasterKey has lots
> of SubKey.
> My question is, suppose that dataset2 contains all the Subkeys and never
> split, is it possible to join the key of dataset 2 with multiple values of
> dataset 1 at the Mapper Side? Any hint is highly appreciated.
>
> Shi
>
>
>

Re: one-to-many Map Side Join without reducer

Reply via email to