Hi,
Is there any operator or UDF in Pig similar to the IN operator of SQL?
Specifically, given a large bag A and a very small single-column bag B, I want
to select tuples in A with a field a1 that has its value in B.
My current method of doing it using a JOIN (below) seems very expensive.
grunt> A = LOAD '/tmp/a.txt' USING PigStorage(',') AS
(a1:chararray,a2:chararray);
grunt> B = LOAD '/tmp/b.txt' USING PigStorage(',') AS (b1:chararray);
grunt> C = JOIN A by a2, B by b1;
It'll be very useful if such an operator is available for use in FILTER and
SPLIT as well.
For example, if I need to substitute '0' when a2 is NOT IN B::b1, currently,
there's no easy way, I guess.
Thanks,
Sundar (a Pig n00b)
"That language is an instrument of human reason, and not merely a medium for
the expression of thought, is a truth generally admitted."
- George Boole, quoted in Iverson's Turing Award Lecture