Re: Behavior of JOIN

Alan Gates Tue, 08 Jun 2010 12:04:48 -0700

Historically

C = JOIN A by a, B by a


was defined in Pig Latin as shorthand for:

C1 = COGROUP A by a, B by a;
C = FOREACH C1 GENERATE flatten(A), flatten(B)

which produces the doubling of keys.

Also, given that Pig Latin does not require that key names be the same(as USING or NATURAL do in SQL) there would be issues if it did nothave both keys in the output. (For the same reason ON in SQLduplicates the keys in the results.)


Alan.

On Jun 8, 2010, at 4:45 AM, Alexander SchÃ¤tzle wrote:

Hi all,

the JOIN operator of Pig produces duplicate columns in its output.
Let's say the statement is like this:

C = JOIN A BY (var1, var2), B BY (var1, var2);

Then C contains var1 and var2 two times (one for each inputrelation), of course with the same content.

This is somehow not what a user "usually" expects from a Join.
Why does Pig produce such redundant entries?

If you want to get rid of these entries you have to do a FOREACH forprojection.

Otherwise you shuffle unnecessary data through MR-phases.
In my opinion this is somehow really unnecessary.
I just wonder why Pig produces theo output of a Join the way it does?

Cheers,
Alex

Re: Behavior of JOIN

Reply via email to