Re: Pig facility analogous to SQL's IN?

Alan Gates Wed, 02 Jun 2010 09:06:28 -0700

The semantic of join is that all records from input 1 with a key valuek will be joined with all records from input 2 with that same keyvalue. With one large input and one small input, this can beaccomplished by loading the small input into memory on every mapperregardless of how the large input is split into maps by Map Reduce.That is, for all the keys with value k in the large input, some may beassigned to map 1 and some to map 2, and join will still work.

The semantic of cogroup is that at the end of the cogroup statementall keys from both inputs will be collected together into bags (onefor each input). The only way to do this is in the map is toguarantee that all keys with value k are in the same map. That meansthat the InputFormat used to split the data across maps must be awareof the values of the keys and produce splits accordingly. Zebra isthe only storage format I'm aware of that can do this.

All this said it would obviously be nice if Pig could analyze thescript and figure out whether the user truly needs this strongersemantic of cogroup or whether he is just using cogroup as a join, andwhere possible rewrite it. But Pig's optimizer isn't there yet.


Alan.


On Jun 1, 2010, at 11:13 PM, BalaSundaraRaman wrote:

Thanks Alan. I'm definitely interested in knowing why it won't workin cogroup the same way.

Will try to implement the IN UDF, though, I've only written simpleeval udf's only so far.


- Sundar

"That language is an instrument of human reason, and not merely amedium for the expression of thought, is a truth generally admitted."

- George Boole, quoted in Iverson's Turing Award Lecture



----- Original Message ----

From: Alan Gates <[email protected]>
To: [email protected]
Sent: Tue, June 1, 2010 11:02:31 PM
Subject: Re: Pig facility analogous to SQL's IN?
In general mapside cogroups are not possible unless the underlyingstoragemechanism can guarantee that all instances of a the key you arecogrouping on
are in a single map instance.  At this point only Zebra can guarantee
that. If you're interested I can give more details on why joinworks and
cogroup doesn't.


You can do IN for filter without needing a full mapside

cogroup. You could implement this via a UDF that loads the smallbag into
a hash table and probes the table for each record it is
passed.


Alan.

On Jun 1, 2010, at 12:45 AM, BalaSundaraRaman

wrote:

Thanks Ankur. But, in my actual case, it's a COGROUP and not
a join.
"replicated" can't be used with COGROUP, no?
Any work
around?

- Sundar

"That language is an

instrument of human reason, and not merely a medium for theexpression of

thought, is a truth generally admitted."
- George Boole, quoted in
Iverson's Turing Award Lecture



----- Original
Message ----

From: Ankur C. Goel <

ymailto="mailto:[email protected]";
href="mailto:[email protected]";>[email protected]>

To:

href="mailto:[email protected]";>pig-[email protected]" <

ymailto="mailto:[email protected]";
href="mailto:[email protected]";>[email protected]>

Sent: Tue, June 1, 2010 12:39:56 PM

Subject: Re: Pig facility

analogous to SQL's IN?


If data represented by relation

B can fit in memory than you can simply use a

"replicated" join

which is inexpensive and is a map-side join.

C =
JOIN

A by a2, B by b1 USING "replicated";



-...@nkur


On 5/31/10 3:32

PM,

"BalaSundaraRaman" <

href="mailto:

ymailto="mailto:[email protected]";
href="mailto:[email protected]";>[email protected]">
ymailto="mailto:[email protected]";
href="mailto:[email protected]";>[email protected]>

wrote:

Hi,

Is there any operator or UDF in Pig
similar to the IN

operator of SQL?

Specifically, given a
large bag A and a very small

single-column bag B, I want to select

tuples in A with a field a1 that has its

value in B.

My
current method of doing it using a JOIN (below) seems very

expensive.
grunt> A = LOAD '/tmp/a.txt' USING PigStorage(',')
AS

(a1:chararray,a2:chararray);

grunt> B = LOAD
'/tmp/b.txt' USING

PigStorage(',') AS (b1:chararray);


grunt> C = JOIN A by a2, B by

b1;


It'll be very
useful if such an operator is available for use in

FILTER and SPLIT

as well.
For example, if I need to substitute '0' when a2 is

NOT IN B::b1, currently, there's no easy way, I

guess.




Thanks,
Sundar (a Pig n00b)

"That
language is an

instrument of human reason, and not merely a medium

for the expression of

thought, is a truth generally

admitted."
- George Boole, quoted in Iverson's

Turing Award

Lecture

Re: Pig facility analogous to SQL's IN?

Reply via email to