Thanks for the explanation, Alan. Got it. - Sundar
"That language is an instrument of human reason, and not merely a medium for the expression of thought, is a truth generally admitted." - George Boole, quoted in Iverson's Turing Award Lecture ----- Original Message ---- > From: Alan Gates <[email protected]> > To: [email protected] > Sent: Wed, June 2, 2010 9:35:24 PM > Subject: Re: Pig facility analogous to SQL's IN? > > The semantic of join is that all records from input 1 with a key value k will > be > joined with all records from input 2 with that same key value. With one > large input and one small input, this can be accomplished by loading the > small > input into memory on every mapper regardless of how the large input is split > into maps by Map Reduce. That is, for all the keys with value k in the > large input, some may be assigned to map 1 and some to map 2, and join will > still work. The semantic of cogroup is that at the end of the cogroup > statement all keys from both inputs will be collected together into bags (one > for each input). The only way to do this is in the map is to guarantee > that all keys with value k are in the same map. That means that the > InputFormat used to split the data across maps must be aware of the values of > the keys and produce splits accordingly. Zebra is the only storage format > I'm aware of that can do this. All this said it would obviously be nice > if Pig could analyze the script and figure out whether the user truly needs > this > stronger semantic of cogroup or whether he is just using cogroup as a join, > and > where possible rewrite it. But Pig's optimizer isn't there > yet. Alan. On Jun 1, 2010, at 11:13 PM, BalaSundaraRaman > wrote: > Thanks Alan. I'm definitely interested in knowing why it > won't work in cogroup the same way. > > Will try to implement the > IN UDF, though, I've only written simple eval udf's only so far. > > > - Sundar > > "That language is an instrument of human > reason, and not merely a medium for the expression of thought, is a truth > generally admitted." > - George Boole, quoted in Iverson's Turing Award > Lecture > > > > ----- Original Message > ---- >> From: Alan Gates < > href="mailto:[email protected]">[email protected]> >> To: > ymailto="mailto:[email protected]" > href="mailto:[email protected]">[email protected] >> > Sent: Tue, June 1, 2010 11:02:31 PM >> Subject: Re: Pig facility > analogous to SQL's IN? >> >> In general mapside cogroups are > not possible unless the underlying storage >> mechanism can guarantee > that all instances of a the key you are cogrouping on >> are in a > single map instance. At this point only Zebra can guarantee >> > that. If you're interested I can give more details on why join works > and >> cogroup doesn't. > > You can do IN for filter > without needing a full mapside >> cogroup. You could implement > this via a UDF that loads the small bag into >> a hash table and probes > the table for each record it is >> passed. > > > Alan. > > On Jun 1, 2010, at 12:45 AM, BalaSundaraRaman >> > wrote: > >> Thanks Ankur. But, in my actual case, it's a COGROUP > and not >> a join. >> "replicated" can't be used with COGROUP, > no? >> Any work >> around? >> >> - > Sundar >> >> "That language is an >> instrument of > human reason, and not merely a medium for the expression of >> thought, > is a truth generally admitted." >> - George Boole, quoted > in >> Iverson's Turing Award Lecture >> >> > >> >> ----- Original >> Message > ---- >>> From: Ankur C. Goel < >> ymailto="mailto: > ymailto="mailto:[email protected]" > href="mailto:[email protected]">[email protected]" >> > href="mailto: > href="mailto:[email protected]">[email protected]"> > ymailto="mailto:[email protected]" > href="mailto:[email protected]">[email protected]> >>> > To: >> " >> href="mailto: > ymailto="mailto:[email protected]" > href="mailto:[email protected]">[email protected]"> > ymailto="mailto:[email protected]" > href="mailto:[email protected]">[email protected]" > < >> ymailto="mailto: > href="mailto:[email protected]">[email protected]" >> > href="mailto: > href="mailto:[email protected]">[email protected]"> > ymailto="mailto:[email protected]" > href="mailto:[email protected]">[email protected]> >>> > >> Sent: Tue, June 1, 2010 12:39:56 PM >>> Subject: Re: > Pig facility >> analogous to SQL's IN? >>> >>> > If data represented by relation >> B can fit in memory than you can > simply use a >>> "replicated" join >> which is inexpensive > and is a map-side join. >> >> C = >> > JOIN >>> A by a2, B by b1 USING "replicated"; >> > >> >> -...@nkur >> >> >> On > 5/31/10 3:32 >>> PM, >> "BalaSundaraRaman" > < >>> href="mailto: >> ymailto="mailto: > ymailto="mailto:[email protected]" > href="mailto:[email protected]">[email protected]" >> > href="mailto: > href="mailto:[email protected]">[email protected]"> > ymailto="mailto:[email protected]" > href="mailto:[email protected]">[email protected]"> >> > ymailto="mailto: > href="mailto:[email protected]">[email protected]" >> > href="mailto: > href="mailto:[email protected]">[email protected]"> > ymailto="mailto:[email protected]" > href="mailto:[email protected]">[email protected]> >>> > >> wrote: >> >> Hi, >> >> Is > there any operator or UDF in Pig >> similar to the IN >>> > operator of SQL? >> Specifically, given a >> large bag A and a > very small >>> single-column bag B, I want to select >> > tuples in A with a field a1 that has its >>> value in B. >> > My >> current method of doing it using a JOIN (below) seems > very >>> >> expensive. >> grunt> A = LOAD > '/tmp/a.txt' USING PigStorage(',') >> AS >>> > (a1:chararray,a2:chararray); >> grunt> B = LOAD >> > '/tmp/b.txt' USING >>> PigStorage(',') AS > (b1:chararray); >> >> grunt> C = JOIN A by a2, B > by >>> b1; >> >> It'll be very >> useful > if such an operator is available for use in >>> FILTER and > SPLIT >> as well. >> For example, if I need to substitute '0' > when a2 is >>> >> NOT IN B::b1, currently, there's no easy > way, I >>> guess. >> >> >> >> > Thanks, >> Sundar (a Pig n00b) >> >> > "That >> language is an >>> instrument of human reason, and > not merely a medium >> for the expression of >>> thought, > is a truth generally >> admitted." >> - George Boole, quoted > in Iverson's >>> Turing Award >> Lecture
