Thanks Ankur. But, in my actual case, it's a COGROUP and not
a join.
"replicated" can't be used with COGROUP, no?
Any work
around?
- Sundar
"That language is an
instrument of human reason, and not merely a medium for the
expression of
thought, is a truth generally admitted."
- George Boole, quoted in
Iverson's Turing Award Lecture
----- Original
Message ----
From: Ankur C. Goel <
ymailto="mailto:[email protected]"
href="mailto:[email protected]">[email protected]>
To:
"
href="mailto:[email protected]">pig-
[email protected]" <
ymailto="mailto:[email protected]"
href="mailto:[email protected]">[email protected]>
Sent: Tue, June 1, 2010 12:39:56 PM
Subject: Re: Pig facility
analogous to SQL's IN?
If data represented by relation
B can fit in memory than you can simply use a
"replicated" join
which is inexpensive and is a map-side join.
C =
JOIN
A by a2, B by b1 USING "replicated";
-...@nkur
On 5/31/10 3:32
PM,
"BalaSundaraRaman" <
href="mailto:
ymailto="mailto:[email protected]"
href="mailto:[email protected]">[email protected]">
ymailto="mailto:[email protected]"
href="mailto:[email protected]">[email protected]>
wrote:
Hi,
Is there any operator or UDF in Pig
similar to the IN
operator of SQL?
Specifically, given a
large bag A and a very small
single-column bag B, I want to select
tuples in A with a field a1 that has its
value in B.
My
current method of doing it using a JOIN (below) seems very
expensive.
grunt> A = LOAD '/tmp/a.txt' USING PigStorage(',')
AS
(a1:chararray,a2:chararray);
grunt> B = LOAD
'/tmp/b.txt' USING
PigStorage(',') AS (b1:chararray);
grunt> C = JOIN A by a2, B by
b1;
It'll be very
useful if such an operator is available for use in
FILTER and SPLIT
as well.
For example, if I need to substitute '0' when a2 is
NOT IN B::b1, currently, there's no easy way, I
guess.
Thanks,
Sundar (a Pig n00b)
"That
language is an
instrument of human reason, and not merely a medium
for the expression of
thought, is a truth generally
admitted."
- George Boole, quoted in Iverson's
Turing Award
Lecture