That's already what happens, because flattening a bag that is empty
results in 0 rows, regardless of how many rows came out of the other
bag.
Alan.
On Jun 10, 2010, at 11:09 AM, hc busy wrote:
Isn't that kind of annoying? Since JOIN in sql implicitly is an
inner join.
Would have been great if
C = JOIN A by id, B b id;
is alias for
C1 = COGROUP A by id, B by id;
C2 = filter C1 by IsEmpty(A) OR IsEmpty(B);
C = foreach C2 generate FLATTEN(A), FLATTEN(B);
On Tue, Jun 8, 2010 at 12:03 PM, Alan Gates <[email protected]>
wrote:
Historically
C = JOIN A by a, B by a
was defined in Pig Latin as shorthand for:
C1 = COGROUP A by a, B by a;
C = FOREACH C1 GENERATE flatten(A), flatten(B)
which produces the doubling of keys.
Also, given that Pig Latin does not require that key names be the
same (as
USING or NATURAL do in SQL) there would be issues if it did not
have both
keys in the output. (For the same reason ON in SQL duplicates the
keys in
the results.)
Alan.
On Jun 8, 2010, at 4:45 AM, Alexander Schätzle wrote:
Hi all,
the JOIN operator of Pig produces duplicate columns in its output.
Let's say the statement is like this:
C = JOIN A BY (var1, var2), B BY (var1, var2);
Then C contains var1 and var2 two times (one for each input
relation), of
course with the same content.
This is somehow not what a user "usually" expects from a Join.
Why does Pig produce such redundant entries?
If you want to get rid of these entries you have to do a FOREACH for
projection.
Otherwise you shuffle unnecessary data through MR-phases.
In my opinion this is somehow really unnecessary.
I just wonder why Pig produces theo output of a Join the way it
does?
Cheers,
Alex