The semantic of join is that all records from input 1 with a key value k will be joined with all records from input 2 with that same key value. With one large input and one small input, this can be accomplished by loading the small input into memory on every mapper regardless of how the large input is split into maps by Map Reduce. That is, for all the keys with value k in the large input, some may be assigned to map 1 and some to map 2, and join will still work.

The semantic of cogroup is that at the end of the cogroup statement all keys from both inputs will be collected together into bags (one for each input). The only way to do this is in the map is to guarantee that all keys with value k are in the same map. That means that the InputFormat used to split the data across maps must be aware of the values of the keys and produce splits accordingly. Zebra is the only storage format I'm aware of that can do this.

All this said it would obviously be nice if Pig could analyze the script and figure out whether the user truly needs this stronger semantic of cogroup or whether he is just using cogroup as a join, and where possible rewrite it. But Pig's optimizer isn't there yet.

Alan.


On Jun 1, 2010, at 11:13 PM, BalaSundaraRaman wrote:

Thanks Alan. I'm definitely interested in knowing why it won't work in cogroup the same way.

Will try to implement the IN UDF, though, I've only written simple eval udf's only so far.

- Sundar

"That language is an instrument of human reason, and not merely a medium for the expression of thought, is a truth generally admitted."
- George Boole, quoted in Iverson's Turing Award Lecture



----- Original Message ----
From: Alan Gates <[email protected]>
To: [email protected]
Sent: Tue, June 1, 2010 11:02:31 PM
Subject: Re: Pig facility analogous to SQL's IN?

In general mapside cogroups are not possible unless the underlying storage mechanism can guarantee that all instances of a the key you are cogrouping on
are in a single map instance.  At this point only Zebra can guarantee
that. If you're interested I can give more details on why join works and
cogroup doesn't.

You can do IN for filter without needing a full mapside
cogroup. You could implement this via a UDF that loads the small bag into
a hash table and probes the table for each record it is
passed.

Alan.

On Jun 1, 2010, at 12:45 AM, BalaSundaraRaman
wrote:

Thanks Ankur. But, in my actual case, it's a COGROUP and not
a join.
"replicated" can't be used with COGROUP, no?
Any work
around?

- Sundar

"That language is an
instrument of human reason, and not merely a medium for the expression of
thought, is a truth generally admitted."
- George Boole, quoted in
Iverson's Turing Award Lecture



----- Original
Message ----
From: Ankur C. Goel <
ymailto="mailto:[email protected]";
href="mailto:[email protected]";>[email protected]>
To:
"
href="mailto:[email protected]";>pig- [email protected]" <
ymailto="mailto:[email protected]";
href="mailto:[email protected]";>[email protected]>

Sent: Tue, June 1, 2010 12:39:56 PM
Subject: Re: Pig facility
analogous to SQL's IN?

If data represented by relation
B can fit in memory than you can simply use a
"replicated" join
which is inexpensive and is a map-side join.

C =
JOIN
A by a2, B by b1 USING "replicated";


-...@nkur


On 5/31/10 3:32
PM,
"BalaSundaraRaman" <
href="mailto:
ymailto="mailto:[email protected]";
href="mailto:[email protected]";>[email protected]">
ymailto="mailto:[email protected]";
href="mailto:[email protected]";>[email protected]>

wrote:

Hi,

Is there any operator or UDF in Pig
similar to the IN
operator of SQL?
Specifically, given a
large bag A and a very small
single-column bag B, I want to select
tuples in A with a field a1 that has its
value in B.
My
current method of doing it using a JOIN (below) seems very

expensive.
grunt> A = LOAD '/tmp/a.txt' USING PigStorage(',')
AS
(a1:chararray,a2:chararray);
grunt> B = LOAD
'/tmp/b.txt' USING
PigStorage(',') AS (b1:chararray);

grunt> C = JOIN A by a2, B by
b1;

It'll be very
useful if such an operator is available for use in
FILTER and SPLIT
as well.
For example, if I need to substitute '0' when a2 is

NOT IN B::b1, currently, there's no easy way, I
guess.



Thanks,
Sundar (a Pig n00b)

"That
language is an
instrument of human reason, and not merely a medium
for the expression of
thought, is a truth generally
admitted."
- George Boole, quoted in Iverson's
Turing Award
Lecture

Reply via email to