Isn't that kind of annoying? Since JOIN in sql implicitly is an inner join. Would have been great if
C = JOIN A by id, B b id; is alias for C1 = COGROUP A by id, B by id; C2 = filter C1 by IsEmpty(A) OR IsEmpty(B); C = foreach C2 generate FLATTEN(A), FLATTEN(B); On Tue, Jun 8, 2010 at 12:03 PM, Alan Gates <[email protected]> wrote: > Historically > > C = JOIN A by a, B by a > > was defined in Pig Latin as shorthand for: > > C1 = COGROUP A by a, B by a; > C = FOREACH C1 GENERATE flatten(A), flatten(B) > > which produces the doubling of keys. > > Also, given that Pig Latin does not require that key names be the same (as > USING or NATURAL do in SQL) there would be issues if it did not have both > keys in the output. (For the same reason ON in SQL duplicates the keys in > the results.) > > Alan. > > > On Jun 8, 2010, at 4:45 AM, Alexander Schätzle wrote: > > Hi all, >> >> the JOIN operator of Pig produces duplicate columns in its output. >> Let's say the statement is like this: >> >> C = JOIN A BY (var1, var2), B BY (var1, var2); >> >> Then C contains var1 and var2 two times (one for each input relation), of >> course with the same content. >> This is somehow not what a user "usually" expects from a Join. >> Why does Pig produce such redundant entries? >> If you want to get rid of these entries you have to do a FOREACH for >> projection. >> Otherwise you shuffle unnecessary data through MR-phases. >> In my opinion this is somehow really unnecessary. >> I just wonder why Pig produces theo output of a Join the way it does? >> >> Cheers, >> Alex >> >> >> >
