Hi all,

the JOIN operator of Pig produces duplicate columns in its output.
Let's say the statement is like this:

C = JOIN A BY (var1, var2), B BY (var1, var2);

Then C contains var1 and var2 two times (one for each input relation), of 
course with the same content.
This is somehow not what a user "usually" expects from a Join.
Why does Pig produce such redundant entries?
If you want to get rid of these entries you have to do a FOREACH for projection.
Otherwise you shuffle unnecessary data through MR-phases.
In my opinion this is somehow really unnecessary.
I just wonder why Pig produces theo output of a Join the way it does?

Cheers,
Alex


Reply via email to