Re: Any better way to ensure unicity ?

hc busy Wed, 14 Jul 2010 10:48:20 -0700

Write a UDF called

takeOne()


that takes the first thing from the bag and returns it. The only problem
that I'm having is that this UDF cannot signal to pig that it is done. So
that whole bag is always created in it's entirety.


Btw, this UDF will be able to accomplish the same task (picking out one item
out fo a bag)

https://issues.apache.org/jira/browse/PIG-1386

because MaxTupleByNthField extends the original MaxTupleBy1stField by
allowing you to specify any column in the tuple as the comparison key. And
because it handles typing correctly, your schema will be as you expect
automatically.

sessions = GROUP sessions BY sid;
sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
FLATTEN(first);};
sessions = FOREACH sessions GENERATE sid, .. and all the fields I have in
the session table...


is replaced with

session = GROUP session by sid;
session = FOREACH session generate MaxTupleByNthField(session);

that's it. it'll have the right schema, all columns from before, but choses
one of the data points.




On Wed, Jul 14, 2010 at 9:39 AM, Scott Carey <[email protected]>wrote:

> I run into this situation all the time.  You have to do a foreach ...
> generate projection at the end to rename everything.
>
> The way aliases work in pig, you quite often have to do 'renaming only'
> projections if you don't want to make other bits of code later change:
> After the group and limit:
>
> sessions = FOREACH sessions GENERATE field1 as field1, field2 as field2,
> field3 ad field3 . . .
>
> That will get rid of the :: prefixes and make the alias shareable with
> later pig code and not dependent on what you do in the group to filter data.
>
>
> On Jul 13, 2010, at 1:48 AM, Vincent Barat wrote:
>
> >  Actually you are right: the schema is the same, nevertheless, the
> > "naming" of the various columns in the schema is modified, and thus
> > my subsequent operations fail:
> >
> > original schema:
> > sessions: {sid: chararray,infoid: chararray,imei: chararray,start: long}
> >
> > modified schema:
> > sessions: {first::sid: chararray,first::infoid:
> > chararray,first::imei: chararray,first::start: long}
> >
> > Do you know a workaround ?
> >
> > Le 13/07/10 10:13, Mridul Muralidharan a écrit :
> >>
> >> The flatten will return the same schema as before (in 'first') :
> >> so unless you are modifying the fields or the order in which they
> >> are generated (which I dont think you are in view of your comment
> >> that it should work with and without this), you can simply go with :
> >>
> >> -- Or whatever works for you.
> >> %define PARALLELISM        '10'
> >>
> >> sessions = DISTINCT sessions PARALLEL $PARALLELISM;
> >>
> >> OR
> >>
> >> sessions = GROUP sessions BY sid  PARALLEL $PARALLELISM;
> >> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
> >> FLATTEN(first);};
> >>
> >>
> >>
> >>
> >> The schema at the end would be exactly same as start of the code
> >> snippet for 'sessions'.
> >>
> >>
> >> Regards,
> >> Mridul
> >>
> >>
> >>
> >> On Tuesday 13 July 2010 01:01 PM, Vincent Barat wrote:
> >>>
> >>>
> >>> Le 12/07/10 16:56, Mridul Muralidharan a écrit :
> >>>>
> >>>> I am not sure what you mean here exactly.
> >>>> Will a sid row have multiple (different) values for the other
> >>>> fields ?
> >>> Yes.
> >>>>
> >>>> But if you want to pick any one row for a given sid, then I think
> >>>> what you have below might be good enough (you can omit the last
> >>>> line though).
> >>> OK. Thanks. The last line is used to retrieve the exact same data
> >>> structure and naming as the original table. This way, I can
> >>> optionally perform this treatment without modifying my code. If you
> >>> know a better way...
> >>>
> >>> Cheers,
> >>>
> >>>>
> >>>> Regards,
> >>>> Mridul
> >>>>
> >>>>
> >>>>
> >>>> On Monday 12 July 2010 06:53 PM, Vincent Barat wrote:
> >>>>>    Hello everybody,
> >>>>>
> >>>>> I have a simple table containing sessions. Each sessions has an
> >>>>> unique key (the sid, which is actually a uuid).
> >>>>> But a session can be present several times in my input table.
> >>>>>
> >>>>> I want to ensure that I only have 1 record for each sid (because I
> >>>>> perform subsequent JOIN based on this sid).
> >>>>>
> >>>>> Currently I use the following script, but I wonder if there is
> >>>>> something more efficient:
> >>>>>
> >>>>> sessions = GROUP sessions BY sid;
> >>>>> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
> >>>>> FLATTEN(first);};
> >>>>> sessions = FOREACH sessions GENERATE sid, .. and all the fields I
> >>>>> have in the session table...
> >>>>>
> >>>>> Do you see any optimization I can do, especially on the FLATTEN /
> >>>>> GENERATE part ?
> >>>>>
> >>>>> Thank you very much for your help.
> >>>>
> >>>>
> >>
> >>
>
>

Re: Any better way to ensure unicity ?

Reply via email to