Thanks for the explanation, Alan. Got it.

- Sundar

 "That language is an instrument of human reason, and not merely a medium for 
the expression of thought, is a truth generally admitted."
- George Boole, quoted in Iverson's Turing Award Lecture



----- Original Message ----
> From: Alan Gates <[email protected]>
> To: [email protected]
> Sent: Wed, June 2, 2010 9:35:24 PM
> Subject: Re: Pig facility analogous to SQL's IN?
> 
> The semantic of join is that all records from input 1 with a key value k will 
> be 
> joined with all records from input 2 with that same key value.  With one 
> large input and one small input, this can be accomplished by loading the 
> small 
> input into memory on every mapper regardless of how the large input is split 
> into maps by Map Reduce.  That is, for all the keys with value k in the 
> large input, some may be assigned to map 1 and some to map 2, and join will 
> still work.

The semantic of cogroup is that at the end of the cogroup 
> statement all keys from both inputs will be collected together into bags (one 
> for each input).  The only way to do this is in the map is to guarantee 
> that all keys with value k are in the same map.  That means that the 
> InputFormat used to split the data across maps must be aware of the values of 
> the keys and produce splits accordingly.  Zebra is the only storage format 
> I'm aware of that can do this.

All this said it would obviously be nice 
> if Pig could analyze the script and figure out whether the user truly needs 
> this 
> stronger semantic of cogroup or whether he is just using cogroup as a join, 
> and 
> where possible rewrite it.  But Pig's optimizer isn't there 
> yet.

Alan.


On Jun 1, 2010, at 11:13 PM, BalaSundaraRaman 
> wrote:

> Thanks Alan. I'm definitely interested in knowing why it 
> won't work in cogroup the same way.
> 
> Will try to implement the 
> IN UDF, though, I've only written simple eval udf's only so far.
> 
> 
> - Sundar
> 
> "That language is an instrument of human 
> reason, and not merely a medium for the expression of thought, is a truth 
> generally admitted."
> - George Boole, quoted in Iverson's Turing Award 
> Lecture
> 
> 
> 
> ----- Original Message 
> ----
>> From: Alan Gates <
> href="mailto:[email protected]";>[email protected]>
>> To: 
> ymailto="mailto:[email protected]"; 
> href="mailto:[email protected]";>[email protected]
>> 
> Sent: Tue, June 1, 2010 11:02:31 PM
>> Subject: Re: Pig facility 
> analogous to SQL's IN?
>> 
>> In general mapside cogroups are 
> not possible unless the underlying storage
>> mechanism can guarantee 
> that all instances of a the key you are cogrouping on
>> are in a 
> single map instance.  At this point only Zebra can guarantee
>> 
> that.  If you're interested I can give more details on why join works 
> and
>> cogroup doesn't.
> 
> You can do IN for filter 
> without needing a full mapside
>> cogroup.  You could implement 
> this via a UDF that loads the small bag into
>> a hash table and probes 
> the table for each record it is
>> passed.
> 
> 
> Alan.
> 
> On Jun 1, 2010, at 12:45 AM, BalaSundaraRaman
>> 
> wrote:
> 
>> Thanks Ankur. But, in my actual case, it's a COGROUP 
> and not
>> a join.
>> "replicated" can't be used with COGROUP, 
> no?
>> Any work
>> around?
>> 
>> - 
> Sundar
>> 
>> "That language is an
>> instrument of 
> human reason, and not merely a medium for the expression of
>> thought, 
> is a truth generally admitted."
>> - George Boole, quoted 
> in
>> Iverson's Turing Award Lecture
>> 
>> 
> 
>> 
>> ----- Original
>> Message 
> ----
>>> From: Ankur C. Goel <
>> ymailto="mailto:
> ymailto="mailto:[email protected]"; 
> href="mailto:[email protected]";>[email protected]"
>> 
> href="mailto:
> href="mailto:[email protected]";>[email protected]">
> ymailto="mailto:[email protected]"; 
> href="mailto:[email protected]";>[email protected]>
>>> 
> To:
>> "
>> href="mailto:
> ymailto="mailto:[email protected]"; 
> href="mailto:[email protected]";>[email protected]">
> ymailto="mailto:[email protected]"; 
> href="mailto:[email protected]";>[email protected]" 
> <
>> ymailto="mailto:
> href="mailto:[email protected]";>[email protected]"
>> 
> href="mailto:
> href="mailto:[email protected]";>[email protected]">
> ymailto="mailto:[email protected]"; 
> href="mailto:[email protected]";>[email protected]>
>>> 
> 
>> Sent: Tue, June 1, 2010 12:39:56 PM
>>> Subject: Re: 
> Pig facility
>> analogous to SQL's IN?
>>> 
>>> 
> If data represented by relation
>> B can fit in memory than you can 
> simply use a
>>> "replicated" join
>> which is inexpensive 
> and is a map-side join.
>> 
>> C =
>> 
> JOIN
>>> A by a2, B by b1 USING "replicated";
>> 
> 
>> 
>> -...@nkur
>> 
>> 
>> On 
> 5/31/10 3:32
>>> PM,
>> "BalaSundaraRaman" 
> <
>>> href="mailto:
>> ymailto="mailto:
> ymailto="mailto:[email protected]"; 
> href="mailto:[email protected]";>[email protected]"
>> 
> href="mailto:
> href="mailto:[email protected]";>[email protected]">
> ymailto="mailto:[email protected]"; 
> href="mailto:[email protected]";>[email protected]">
>> 
> ymailto="mailto:
> href="mailto:[email protected]";>[email protected]"
>> 
> href="mailto:
> href="mailto:[email protected]";>[email protected]">
> ymailto="mailto:[email protected]"; 
> href="mailto:[email protected]";>[email protected]>
>>> 
> 
>> wrote:
>> 
>> Hi,
>> 
>> Is 
> there any operator or UDF in Pig
>> similar to the IN
>>> 
> operator of SQL?
>> Specifically, given a
>> large bag A and a 
> very small
>>> single-column bag B, I want to select
>> 
> tuples in A with a field a1 that has its
>>> value in B.
>> 
> My
>> current method of doing it using a JOIN (below) seems 
> very
>>> 
>> expensive.
>> grunt> A = LOAD 
> '/tmp/a.txt' USING PigStorage(',')
>> AS
>>> 
> (a1:chararray,a2:chararray);
>> grunt> B = LOAD
>> 
> '/tmp/b.txt' USING
>>> PigStorage(',') AS 
> (b1:chararray);
>> 
>> grunt> C = JOIN A by a2, B 
> by
>>> b1;
>> 
>> It'll be very
>> useful 
> if such an operator is available for use in
>>> FILTER and 
> SPLIT
>> as well.
>> For example, if I need to substitute '0' 
> when a2 is
>>> 
>> NOT IN B::b1, currently, there's no easy 
> way, I
>>> guess.
>> 
>> 
>> 
>> 
> Thanks,
>> Sundar (a Pig n00b)
>> 
>> 
> "That
>> language is an
>>> instrument of human reason, and 
> not merely a medium
>> for the expression of
>>> thought, 
> is a truth generally
>> admitted."
>> - George Boole, quoted 
> in Iverson's
>>> Turing Award
>> Lecture

Reply via email to