oh and of course you can go crazy with this: fract = limit (foreach (load 'tmp/numbers' as (letter:chararray, x:int, y:int)) generate letter) 1;
On Thu, Jul 15, 2010 at 3:56 PM, Dmitriy Ryaboy <[email protected]> wrote: > Um. > > grunt> nums = limit (load 'tmp/numbers' as (letter:chararray, x:int, > y:int)) 1; > grunt> dump nums > (a,1,2) > > grunt> nums = load 'tmp/numbers' as (letter:chararray, x:int, > y:int); > grunt> fract = limit (foreach nums generate letter) > 1; > grunt> dump fract > (a) > > Note that you can do the same for a number of operators, including, most > handily, foreach: > > foo = foreach (group data by id) generate group as id, COUNT(data) as > num_rows; > > > On Thu, Jul 15, 2010 at 3:39 PM, hc busy <[email protected]> wrote: > >> But, to be clear, PigLatin is easy to read tho, so far, even with a 2k >> line >> script... >> >> On Thu, Jul 15, 2010 at 3:33 PM, hc busy <[email protected]> wrote: >> >> > LIMIT is an extra line to type. But I guess if we're using pig, we don't >> > really care for elegance and concision huh? >> > >> > >> > On Wed, Jul 14, 2010 at 12:25 PM, Dmitriy Ryaboy <[email protected] >> >wrote: >> > >> >> hc, two things about that approach : >> >> >> >> 1) if you use the accumulator interface, the bag won't be materialized >> >> 2) am I missing something? Why can't you just use LIMIT 1? >> >> >> >> -D >> >> >> >> On Wed, Jul 14, 2010 at 10:39 AM, hc busy <[email protected]> wrote: >> >> >> >> > Write a UDF called >> >> > >> >> > takeOne() >> >> > >> >> > that takes the first thing from the bag and returns it. The only >> problem >> >> > that I'm having is that this UDF cannot signal to pig that it is >> done. >> >> So >> >> > that whole bag is always created in it's entirety. >> >> > >> >> > >> >> > Btw, this UDF will be able to accomplish the same task (picking out >> one >> >> > item >> >> > out fo a bag) >> >> > >> >> > https://issues.apache.org/jira/browse/PIG-1386 >> >> > >> >> > because MaxTupleByNthField extends the original MaxTupleBy1stField by >> >> > allowing you to specify any column in the tuple as the comparison >> key. >> >> And >> >> > because it handles typing correctly, your schema will be as you >> expect >> >> > automatically. >> >> > >> >> > sessions = GROUP sessions BY sid; >> >> > sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE >> >> > FLATTEN(first);}; >> >> > sessions = FOREACH sessions GENERATE sid, .. and all the fields I >> have >> >> in >> >> > the session table... >> >> > >> >> > >> >> > is replaced with >> >> > >> >> > session = GROUP session by sid; >> >> > session = FOREACH session generate MaxTupleByNthField(session); >> >> > >> >> > that's it. it'll have the right schema, all columns from before, but >> >> choses >> >> > one of the data points. >> >> > >> >> > >> >> > >> >> > >> >> > On Wed, Jul 14, 2010 at 9:39 AM, Scott Carey < >> [email protected] >> >> > >wrote: >> >> > >> >> > > I run into this situation all the time. You have to do a foreach >> ... >> >> > > generate projection at the end to rename everything. >> >> > > >> >> > > The way aliases work in pig, you quite often have to do 'renaming >> >> only' >> >> > > projections if you don't want to make other bits of code later >> change: >> >> > > After the group and limit: >> >> > > >> >> > > sessions = FOREACH sessions GENERATE field1 as field1, field2 as >> >> field2, >> >> > > field3 ad field3 . . . >> >> > > >> >> > > That will get rid of the :: prefixes and make the alias shareable >> with >> >> > > later pig code and not dependent on what you do in the group to >> filter >> >> > data. >> >> > > >> >> > > >> >> > > On Jul 13, 2010, at 1:48 AM, Vincent Barat wrote: >> >> > > >> >> > > > Actually you are right: the schema is the same, nevertheless, >> the >> >> > > > "naming" of the various columns in the schema is modified, and >> thus >> >> > > > my subsequent operations fail: >> >> > > > >> >> > > > original schema: >> >> > > > sessions: {sid: chararray,infoid: chararray,imei: >> chararray,start: >> >> > long} >> >> > > > >> >> > > > modified schema: >> >> > > > sessions: {first::sid: chararray,first::infoid: >> >> > > > chararray,first::imei: chararray,first::start: long} >> >> > > > >> >> > > > Do you know a workaround ? >> >> > > > >> >> > > > Le 13/07/10 10:13, Mridul Muralidharan a écrit : >> >> > > >> >> >> > > >> The flatten will return the same schema as before (in 'first') : >> >> > > >> so unless you are modifying the fields or the order in which >> they >> >> > > >> are generated (which I dont think you are in view of your >> comment >> >> > > >> that it should work with and without this), you can simply go >> with >> >> : >> >> > > >> >> >> > > >> -- Or whatever works for you. >> >> > > >> %define PARALLELISM '10' >> >> > > >> >> >> > > >> sessions = DISTINCT sessions PARALLEL $PARALLELISM; >> >> > > >> >> >> > > >> OR >> >> > > >> >> >> > > >> sessions = GROUP sessions BY sid PARALLEL $PARALLELISM; >> >> > > >> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE >> >> > > >> FLATTEN(first);}; >> >> > > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > > >> The schema at the end would be exactly same as start of the code >> >> > > >> snippet for 'sessions'. >> >> > > >> >> >> > > >> >> >> > > >> Regards, >> >> > > >> Mridul >> >> > > >> >> >> > > >> >> >> > > >> >> >> > > >> On Tuesday 13 July 2010 01:01 PM, Vincent Barat wrote: >> >> > > >>> >> >> > > >>> >> >> > > >>> Le 12/07/10 16:56, Mridul Muralidharan a écrit : >> >> > > >>>> >> >> > > >>>> I am not sure what you mean here exactly. >> >> > > >>>> Will a sid row have multiple (different) values for the other >> >> > > >>>> fields ? >> >> > > >>> Yes. >> >> > > >>>> >> >> > > >>>> But if you want to pick any one row for a given sid, then I >> think >> >> > > >>>> what you have below might be good enough (you can omit the >> last >> >> > > >>>> line though). >> >> > > >>> OK. Thanks. The last line is used to retrieve the exact same >> data >> >> > > >>> structure and naming as the original table. This way, I can >> >> > > >>> optionally perform this treatment without modifying my code. If >> >> you >> >> > > >>> know a better way... >> >> > > >>> >> >> > > >>> Cheers, >> >> > > >>> >> >> > > >>>> >> >> > > >>>> Regards, >> >> > > >>>> Mridul >> >> > > >>>> >> >> > > >>>> >> >> > > >>>> >> >> > > >>>> On Monday 12 July 2010 06:53 PM, Vincent Barat wrote: >> >> > > >>>>> Hello everybody, >> >> > > >>>>> >> >> > > >>>>> I have a simple table containing sessions. Each sessions has >> an >> >> > > >>>>> unique key (the sid, which is actually a uuid). >> >> > > >>>>> But a session can be present several times in my input table. >> >> > > >>>>> >> >> > > >>>>> I want to ensure that I only have 1 record for each sid >> (because >> >> I >> >> > > >>>>> perform subsequent JOIN based on this sid). >> >> > > >>>>> >> >> > > >>>>> Currently I use the following script, but I wonder if there >> is >> >> > > >>>>> something more efficient: >> >> > > >>>>> >> >> > > >>>>> sessions = GROUP sessions BY sid; >> >> > > >>>>> sessions = FOREACH sessions { first = LIMIT sessions 1; >> GENERATE >> >> > > >>>>> FLATTEN(first);}; >> >> > > >>>>> sessions = FOREACH sessions GENERATE sid, .. and all the >> fields >> >> I >> >> > > >>>>> have in the session table... >> >> > > >>>>> >> >> > > >>>>> Do you see any optimization I can do, especially on the >> FLATTEN >> >> / >> >> > > >>>>> GENERATE part ? >> >> > > >>>>> >> >> > > >>>>> Thank you very much for your help. >> >> > > >>>> >> >> > > >>>> >> >> > > >> >> >> > > >> >> >> > > >> >> > > >> >> > >> >> >> > >> > >> > >
