Curious to know the answer too.
To add more to this duplicate columns, after the join when I do the FOREACH
for projection it errors out if the join condition fields have the same
name, pig doesn't know which field to pick.
Eg. C = JOIN A BY (var1), B BY (var1);
D = FOREACH C GENERATE var1, var2, var3;
You get the below error;
2010-06-08 11:19:49,396 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1025: Found more than one match: A::var1, B::var1
The work around for this would be;
C = JOIN A BY (var1), B BY (var4);
D = FOREACH C GENERATE var1, var2, var3;
And it works fine.
It just doesn't seem the efficient way.
On 6/8/10 4:45 AM, "Alexander SchÀtzle" <[email protected]>
wrote:
> Hi all,
>
> the JOIN operator of Pig produces duplicate columns in its output.
> Let's say the statement is like this:
>
> C = JOIN A BY (var1, var2), B BY (var1, var2);
>
> Then C contains var1 and var2 two times (one for each input relation), of
> course with the same content.
> This is somehow not what a user "usually" expects from a Join.
> Why does Pig produce such redundant entries?
> If you want to get rid of these entries you have to do a FOREACH for
> projection.
> Otherwise you shuffle unnecessary data through MR-phases.
> In my opinion this is somehow really unnecessary.
> I just wonder why Pig produces theo output of a Join the way it does?
>
> Cheers,
> Alex
>
>