Re: Behavior of JOIN

hc busy Thu, 10 Jun 2010 11:10:12 -0700

Isn't that kind of annoying? Since JOIN in sql implicitly is an inner join.
Would have been great if


C = JOIN A by id, B b id;

is alias for
C1 = COGROUP A by id, B by id;
C2 = filter C1 by IsEmpty(A) OR IsEmpty(B);
C = foreach C2 generate FLATTEN(A), FLATTEN(B);


On Tue, Jun 8, 2010 at 12:03 PM, Alan Gates <[email protected]> wrote:

> Historically
>
> C = JOIN A by a, B by a
>
> was defined in Pig Latin as shorthand for:
>
> C1 = COGROUP A by a, B by a;
> C = FOREACH C1 GENERATE flatten(A), flatten(B)
>
> which produces the doubling of keys.
>
> Also, given that Pig Latin does not require that key names be the same (as
> USING or NATURAL do in SQL) there would be issues if it did not have both
> keys in the output.  (For the same reason ON in SQL duplicates the keys in
> the results.)
>
> Alan.
>
>
> On Jun 8, 2010, at 4:45 AM, Alexander SchÃ¤tzle wrote:
>
>  Hi all,
>>
>> the JOIN operator of Pig produces duplicate columns in its output.
>> Let's say the statement is like this:
>>
>> C = JOIN A BY (var1, var2), B BY (var1, var2);
>>
>> Then C contains var1 and var2 two times (one for each input relation), of
>> course with the same content.
>> This is somehow not what a user "usually" expects from a Join.
>> Why does Pig produce such redundant entries?
>> If you want to get rid of these entries you have to do a FOREACH for
>> projection.
>> Otherwise you shuffle unnecessary data through MR-phases.
>> In my opinion this is somehow really unnecessary.
>> I just wonder why Pig produces theo output of a Join the way it does?
>>
>> Cheers,
>> Alex
>>
>>
>>
>

Re: Behavior of JOIN

Reply via email to