hc, two things about that approach : 1) if you use the accumulator interface, the bag won't be materialized 2) am I missing something? Why can't you just use LIMIT 1?
-D On Wed, Jul 14, 2010 at 10:39 AM, hc busy <[email protected]> wrote: > Write a UDF called > > takeOne() > > that takes the first thing from the bag and returns it. The only problem > that I'm having is that this UDF cannot signal to pig that it is done. So > that whole bag is always created in it's entirety. > > > Btw, this UDF will be able to accomplish the same task (picking out one > item > out fo a bag) > > https://issues.apache.org/jira/browse/PIG-1386 > > because MaxTupleByNthField extends the original MaxTupleBy1stField by > allowing you to specify any column in the tuple as the comparison key. And > because it handles typing correctly, your schema will be as you expect > automatically. > > sessions = GROUP sessions BY sid; > sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE > FLATTEN(first);}; > sessions = FOREACH sessions GENERATE sid, .. and all the fields I have in > the session table... > > > is replaced with > > session = GROUP session by sid; > session = FOREACH session generate MaxTupleByNthField(session); > > that's it. it'll have the right schema, all columns from before, but choses > one of the data points. > > > > > On Wed, Jul 14, 2010 at 9:39 AM, Scott Carey <[email protected] > >wrote: > > > I run into this situation all the time. You have to do a foreach ... > > generate projection at the end to rename everything. > > > > The way aliases work in pig, you quite often have to do 'renaming only' > > projections if you don't want to make other bits of code later change: > > After the group and limit: > > > > sessions = FOREACH sessions GENERATE field1 as field1, field2 as field2, > > field3 ad field3 . . . > > > > That will get rid of the :: prefixes and make the alias shareable with > > later pig code and not dependent on what you do in the group to filter > data. > > > > > > On Jul 13, 2010, at 1:48 AM, Vincent Barat wrote: > > > > > Actually you are right: the schema is the same, nevertheless, the > > > "naming" of the various columns in the schema is modified, and thus > > > my subsequent operations fail: > > > > > > original schema: > > > sessions: {sid: chararray,infoid: chararray,imei: chararray,start: > long} > > > > > > modified schema: > > > sessions: {first::sid: chararray,first::infoid: > > > chararray,first::imei: chararray,first::start: long} > > > > > > Do you know a workaround ? > > > > > > Le 13/07/10 10:13, Mridul Muralidharan a écrit : > > >> > > >> The flatten will return the same schema as before (in 'first') : > > >> so unless you are modifying the fields or the order in which they > > >> are generated (which I dont think you are in view of your comment > > >> that it should work with and without this), you can simply go with : > > >> > > >> -- Or whatever works for you. > > >> %define PARALLELISM '10' > > >> > > >> sessions = DISTINCT sessions PARALLEL $PARALLELISM; > > >> > > >> OR > > >> > > >> sessions = GROUP sessions BY sid PARALLEL $PARALLELISM; > > >> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE > > >> FLATTEN(first);}; > > >> > > >> > > >> > > >> > > >> The schema at the end would be exactly same as start of the code > > >> snippet for 'sessions'. > > >> > > >> > > >> Regards, > > >> Mridul > > >> > > >> > > >> > > >> On Tuesday 13 July 2010 01:01 PM, Vincent Barat wrote: > > >>> > > >>> > > >>> Le 12/07/10 16:56, Mridul Muralidharan a écrit : > > >>>> > > >>>> I am not sure what you mean here exactly. > > >>>> Will a sid row have multiple (different) values for the other > > >>>> fields ? > > >>> Yes. > > >>>> > > >>>> But if you want to pick any one row for a given sid, then I think > > >>>> what you have below might be good enough (you can omit the last > > >>>> line though). > > >>> OK. Thanks. The last line is used to retrieve the exact same data > > >>> structure and naming as the original table. This way, I can > > >>> optionally perform this treatment without modifying my code. If you > > >>> know a better way... > > >>> > > >>> Cheers, > > >>> > > >>>> > > >>>> Regards, > > >>>> Mridul > > >>>> > > >>>> > > >>>> > > >>>> On Monday 12 July 2010 06:53 PM, Vincent Barat wrote: > > >>>>> Hello everybody, > > >>>>> > > >>>>> I have a simple table containing sessions. Each sessions has an > > >>>>> unique key (the sid, which is actually a uuid). > > >>>>> But a session can be present several times in my input table. > > >>>>> > > >>>>> I want to ensure that I only have 1 record for each sid (because I > > >>>>> perform subsequent JOIN based on this sid). > > >>>>> > > >>>>> Currently I use the following script, but I wonder if there is > > >>>>> something more efficient: > > >>>>> > > >>>>> sessions = GROUP sessions BY sid; > > >>>>> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE > > >>>>> FLATTEN(first);}; > > >>>>> sessions = FOREACH sessions GENERATE sid, .. and all the fields I > > >>>>> have in the session table... > > >>>>> > > >>>>> Do you see any optimization I can do, especially on the FLATTEN / > > >>>>> GENERATE part ? > > >>>>> > > >>>>> Thank you very much for your help. > > >>>> > > >>>> > > >> > > >> > > > > >
