Hmmm.... I didn't know you can go crazy like that. I take back that I said Pig is not concise and inelegant.
Oh, there isn't a way to extend the language. I mean, unless while I wasn't looking those additional "#define" syntaxes has been implemented already. And I still think recursive functions are a must to have. On Thu, Jul 15, 2010 at 8:32 PM, Mridul Muralidharan <[email protected]>wrote: > > It is more about maintaining yet another udf which duplicates functionality > which is done by the base language ... > So tradeoff is between using a language construct (which might be optimized > internally) versus writing extension code. > > Mridul > > > On Friday 16 July 2010 04:03 AM, hc busy wrote: > >> LIMIT is an extra line to type. But I guess if we're using pig, we don't >> really care for elegance and concision huh? >> >> >> On Wed, Jul 14, 2010 at 12:25 PM, Dmitriy Ryaboy<[email protected]> >> wrote: >> >> hc, two things about that approach : >>> >>> 1) if you use the accumulator interface, the bag won't be materialized >>> 2) am I missing something? Why can't you just use LIMIT 1? >>> >>> -D >>> >>> On Wed, Jul 14, 2010 at 10:39 AM, hc busy<[email protected]> wrote: >>> >>> Write a UDF called >>>> >>>> takeOne() >>>> >>>> that takes the first thing from the bag and returns it. The only problem >>>> that I'm having is that this UDF cannot signal to pig that it is done. >>>> So >>>> that whole bag is always created in it's entirety. >>>> >>>> >>>> Btw, this UDF will be able to accomplish the same task (picking out one >>>> item >>>> out fo a bag) >>>> >>>> https://issues.apache.org/jira/browse/PIG-1386 >>>> >>>> because MaxTupleByNthField extends the original MaxTupleBy1stField by >>>> allowing you to specify any column in the tuple as the comparison key. >>>> >>> And >>> >>>> because it handles typing correctly, your schema will be as you expect >>>> automatically. >>>> >>>> sessions = GROUP sessions BY sid; >>>> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE >>>> FLATTEN(first);}; >>>> sessions = FOREACH sessions GENERATE sid, .. and all the fields I have >>>> in >>>> the session table... >>>> >>>> >>>> is replaced with >>>> >>>> session = GROUP session by sid; >>>> session = FOREACH session generate MaxTupleByNthField(session); >>>> >>>> that's it. it'll have the right schema, all columns from before, but >>>> >>> choses >>> >>>> one of the data points. >>>> >>>> >>>> >>>> >>>> On Wed, Jul 14, 2010 at 9:39 AM, Scott Carey<[email protected] >>>> >>>>> wrote: >>>>> >>>> >>>> I run into this situation all the time. You have to do a foreach ... >>>>> generate projection at the end to rename everything. >>>>> >>>>> The way aliases work in pig, you quite often have to do 'renaming only' >>>>> projections if you don't want to make other bits of code later change: >>>>> After the group and limit: >>>>> >>>>> sessions = FOREACH sessions GENERATE field1 as field1, field2 as >>>>> >>>> field2, >>> >>>> field3 ad field3 . . . >>>>> >>>>> That will get rid of the :: prefixes and make the alias shareable with >>>>> later pig code and not dependent on what you do in the group to filter >>>>> >>>> data. >>>> >>>>> >>>>> >>>>> On Jul 13, 2010, at 1:48 AM, Vincent Barat wrote: >>>>> >>>>> Actually you are right: the schema is the same, nevertheless, the >>>>>> "naming" of the various columns in the schema is modified, and thus >>>>>> my subsequent operations fail: >>>>>> >>>>>> original schema: >>>>>> sessions: {sid: chararray,infoid: chararray,imei: chararray,start: >>>>>> >>>>> long} >>>> >>>>> >>>>>> modified schema: >>>>>> sessions: {first::sid: chararray,first::infoid: >>>>>> chararray,first::imei: chararray,first::start: long} >>>>>> >>>>>> Do you know a workaround ? >>>>>> >>>>>> Le 13/07/10 10:13, Mridul Muralidharan a écrit : >>>>>> >>>>>>> >>>>>>> The flatten will return the same schema as before (in 'first') : >>>>>>> so unless you are modifying the fields or the order in which they >>>>>>> are generated (which I dont think you are in view of your comment >>>>>>> that it should work with and without this), you can simply go with : >>>>>>> >>>>>>> -- Or whatever works for you. >>>>>>> %define PARALLELISM '10' >>>>>>> >>>>>>> sessions = DISTINCT sessions PARALLEL $PARALLELISM; >>>>>>> >>>>>>> OR >>>>>>> >>>>>>> sessions = GROUP sessions BY sid PARALLEL $PARALLELISM; >>>>>>> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE >>>>>>> FLATTEN(first);}; >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> The schema at the end would be exactly same as start of the code >>>>>>> snippet for 'sessions'. >>>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> Mridul >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tuesday 13 July 2010 01:01 PM, Vincent Barat wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Le 12/07/10 16:56, Mridul Muralidharan a écrit : >>>>>>>> >>>>>>>>> >>>>>>>>> I am not sure what you mean here exactly. >>>>>>>>> Will a sid row have multiple (different) values for the other >>>>>>>>> fields ? >>>>>>>>> >>>>>>>> Yes. >>>>>>>> >>>>>>>>> >>>>>>>>> But if you want to pick any one row for a given sid, then I think >>>>>>>>> what you have below might be good enough (you can omit the last >>>>>>>>> line though). >>>>>>>>> >>>>>>>> OK. Thanks. The last line is used to retrieve the exact same data >>>>>>>> structure and naming as the original table. This way, I can >>>>>>>> optionally perform this treatment without modifying my code. If you >>>>>>>> know a better way... >>>>>>>> >>>>>>>> Cheers, >>>>>>>> >>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Mridul >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Monday 12 July 2010 06:53 PM, Vincent Barat wrote: >>>>>>>>> >>>>>>>>>> Hello everybody, >>>>>>>>>> >>>>>>>>>> I have a simple table containing sessions. Each sessions has an >>>>>>>>>> unique key (the sid, which is actually a uuid). >>>>>>>>>> But a session can be present several times in my input table. >>>>>>>>>> >>>>>>>>>> I want to ensure that I only have 1 record for each sid (because >>>>>>>>>> >>>>>>>>> I >>> >>>> perform subsequent JOIN based on this sid). >>>>>>>>>> >>>>>>>>>> Currently I use the following script, but I wonder if there is >>>>>>>>>> something more efficient: >>>>>>>>>> >>>>>>>>>> sessions = GROUP sessions BY sid; >>>>>>>>>> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE >>>>>>>>>> FLATTEN(first);}; >>>>>>>>>> sessions = FOREACH sessions GENERATE sid, .. and all the fields I >>>>>>>>>> have in the session table... >>>>>>>>>> >>>>>>>>>> Do you see any optimization I can do, especially on the FLATTEN / >>>>>>>>>> GENERATE part ? >>>>>>>>>> >>>>>>>>>> Thank you very much for your help. >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> >>>> >>> >
