If data represented by relation B can fit in memory than you can simply use a "replicated" join which is inexpensive and is a map-side join.
C = JOIN A by a2, B by b1 USING "replicated"; -...@nkur On 5/31/10 3:32 PM, "BalaSundaraRaman" <[email protected]> wrote: Hi, Is there any operator or UDF in Pig similar to the IN operator of SQL? Specifically, given a large bag A and a very small single-column bag B, I want to select tuples in A with a field a1 that has its value in B. My current method of doing it using a JOIN (below) seems very expensive. grunt> A = LOAD '/tmp/a.txt' USING PigStorage(',') AS (a1:chararray,a2:chararray); grunt> B = LOAD '/tmp/b.txt' USING PigStorage(',') AS (b1:chararray); grunt> C = JOIN A by a2, B by b1; It'll be very useful if such an operator is available for use in FILTER and SPLIT as well. For example, if I need to substitute '0' when a2 is NOT IN B::b1, currently, there's no easy way, I guess. Thanks, Sundar (a Pig n00b) "That language is an instrument of human reason, and not merely a medium for the expression of thought, is a truth generally admitted." - George Boole, quoted in Iverson's Turing Award Lecture
