Hi all, the JOIN operator of Pig produces duplicate columns in its output. Let's say the statement is like this:
C = JOIN A BY (var1, var2), B BY (var1, var2); Then C contains var1 and var2 two times (one for each input relation), of course with the same content. This is somehow not what a user "usually" expects from a Join. Why does Pig produce such redundant entries? If you want to get rid of these entries you have to do a FOREACH for projection. Otherwise you shuffle unnecessary data through MR-phases. In my opinion this is somehow really unnecessary. I just wonder why Pig produces theo output of a Join the way it does? Cheers, Alex
