Re: Behavior of JOIN

Syed Wasti Tue, 08 Jun 2010 11:34:50 -0700

Curious to know the answer too.
To add more to this duplicate columns, after the join when I do the FOREACH
for projection it errors out if the join condition fields have the same
name, pig doesn't know which field to pick.


Eg.  C = JOIN A BY (var1), B BY (var1);
     D = FOREACH C GENERATE var1, var2, var3;
You get the below error;
2010-06-08 11:19:49,396 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1025: Found more than one match: A::var1, B::var1

The work around for this would be;
C = JOIN A BY (var1), B BY (var4);
D = FOREACH C GENERATE var1, var2, var3;
And it works fine.
It just doesn't seem the efficient way.

On 6/8/10 4:45 AM, "Alexander SchÃ€tzle" <[email protected]>
wrote:

> Hi all,
> 
> the JOIN operator of Pig produces duplicate columns in its output.
> Let's say the statement is like this:
> 
> C = JOIN A BY (var1, var2), B BY (var1, var2);
> 
> Then C contains var1 and var2 two times (one for each input relation), of
> course with the same content.
> This is somehow not what a user "usually" expects from a Join.
> Why does Pig produce such redundant entries?
> If you want to get rid of these entries you have to do a FOREACH for
> projection.
> Otherwise you shuffle unnecessary data through MR-phases.
> In my opinion this is somehow really unnecessary.
> I just wonder why Pig produces theo output of a Join the way it does?
> 
> Cheers,
> Alex
> 
>

Re: Behavior of JOIN

Reply via email to