If data represented by relation B can fit in memory than you can simply use a 
"replicated" join which is inexpensive and is a map-side join.

 C = JOIN A by a2, B by b1 USING "replicated";

-...@nkur


On 5/31/10 3:32 PM, "BalaSundaraRaman" <[email protected]> wrote:

Hi,

Is there any operator or UDF in Pig similar to the IN operator of SQL?
Specifically, given a large bag A and a very small single-column bag B, I want 
to select tuples in A with a field a1 that has its value in B.
My current method of doing it using a JOIN (below) seems very expensive.
grunt> A = LOAD '/tmp/a.txt' USING PigStorage(',') AS 
(a1:chararray,a2:chararray);
grunt> B = LOAD '/tmp/b.txt' USING PigStorage(',') AS (b1:chararray);
grunt> C = JOIN A by a2, B by b1;

It'll be very useful if such an operator is available for use in FILTER and 
SPLIT as well.
For example, if I need to substitute '0' when a2 is NOT IN B::b1, currently, 
there's no easy way, I guess.


Thanks,
Sundar (a Pig n00b)

"That language is an instrument of human reason, and not merely a medium for 
the expression of thought, is a truth generally admitted."
- George Boole, quoted in Iverson's Turing Award Lecture


Reply via email to