Re: Any better way to ensure unicity ?

Dmitriy Ryaboy Wed, 14 Jul 2010 12:26:01 -0700

hc, two things about that approach :

1) if you use the accumulator interface, the bag won't be materialized
2) am I missing something? Why can't you just use LIMIT 1?


-D

On Wed, Jul 14, 2010 at 10:39 AM, hc busy <[email protected]> wrote:

> Write a UDF called
>
> takeOne()
>
> that takes the first thing from the bag and returns it. The only problem
> that I'm having is that this UDF cannot signal to pig that it is done. So
> that whole bag is always created in it's entirety.
>
>
> Btw, this UDF will be able to accomplish the same task (picking out one
> item
> out fo a bag)
>
> https://issues.apache.org/jira/browse/PIG-1386
>
> because MaxTupleByNthField extends the original MaxTupleBy1stField by
> allowing you to specify any column in the tuple as the comparison key. And
> because it handles typing correctly, your schema will be as you expect
> automatically.
>
> sessions = GROUP sessions BY sid;
> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
> FLATTEN(first);};
> sessions = FOREACH sessions GENERATE sid, .. and all the fields I have in
> the session table...
>
>
> is replaced with
>
> session = GROUP session by sid;
> session = FOREACH session generate MaxTupleByNthField(session);
>
> that's it. it'll have the right schema, all columns from before, but choses
> one of the data points.
>
>
>
>
> On Wed, Jul 14, 2010 at 9:39 AM, Scott Carey <[email protected]
> >wrote:
>
> > I run into this situation all the time.  You have to do a foreach ...
> > generate projection at the end to rename everything.
> >
> > The way aliases work in pig, you quite often have to do 'renaming only'
> > projections if you don't want to make other bits of code later change:
> > After the group and limit:
> >
> > sessions = FOREACH sessions GENERATE field1 as field1, field2 as field2,
> > field3 ad field3 . . .
> >
> > That will get rid of the :: prefixes and make the alias shareable with
> > later pig code and not dependent on what you do in the group to filter
> data.
> >
> >
> > On Jul 13, 2010, at 1:48 AM, Vincent Barat wrote:
> >
> > >  Actually you are right: the schema is the same, nevertheless, the
> > > "naming" of the various columns in the schema is modified, and thus
> > > my subsequent operations fail:
> > >
> > > original schema:
> > > sessions: {sid: chararray,infoid: chararray,imei: chararray,start:
> long}
> > >
> > > modified schema:
> > > sessions: {first::sid: chararray,first::infoid:
> > > chararray,first::imei: chararray,first::start: long}
> > >
> > > Do you know a workaround ?
> > >
> > > Le 13/07/10 10:13, Mridul Muralidharan a écrit :
> > >>
> > >> The flatten will return the same schema as before (in 'first') :
> > >> so unless you are modifying the fields or the order in which they
> > >> are generated (which I dont think you are in view of your comment
> > >> that it should work with and without this), you can simply go with :
> > >>
> > >> -- Or whatever works for you.
> > >> %define PARALLELISM        '10'
> > >>
> > >> sessions = DISTINCT sessions PARALLEL $PARALLELISM;
> > >>
> > >> OR
> > >>
> > >> sessions = GROUP sessions BY sid  PARALLEL $PARALLELISM;
> > >> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
> > >> FLATTEN(first);};
> > >>
> > >>
> > >>
> > >>
> > >> The schema at the end would be exactly same as start of the code
> > >> snippet for 'sessions'.
> > >>
> > >>
> > >> Regards,
> > >> Mridul
> > >>
> > >>
> > >>
> > >> On Tuesday 13 July 2010 01:01 PM, Vincent Barat wrote:
> > >>>
> > >>>
> > >>> Le 12/07/10 16:56, Mridul Muralidharan a écrit :
> > >>>>
> > >>>> I am not sure what you mean here exactly.
> > >>>> Will a sid row have multiple (different) values for the other
> > >>>> fields ?
> > >>> Yes.
> > >>>>
> > >>>> But if you want to pick any one row for a given sid, then I think
> > >>>> what you have below might be good enough (you can omit the last
> > >>>> line though).
> > >>> OK. Thanks. The last line is used to retrieve the exact same data
> > >>> structure and naming as the original table. This way, I can
> > >>> optionally perform this treatment without modifying my code. If you
> > >>> know a better way...
> > >>>
> > >>> Cheers,
> > >>>
> > >>>>
> > >>>> Regards,
> > >>>> Mridul
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Monday 12 July 2010 06:53 PM, Vincent Barat wrote:
> > >>>>>    Hello everybody,
> > >>>>>
> > >>>>> I have a simple table containing sessions. Each sessions has an
> > >>>>> unique key (the sid, which is actually a uuid).
> > >>>>> But a session can be present several times in my input table.
> > >>>>>
> > >>>>> I want to ensure that I only have 1 record for each sid (because I
> > >>>>> perform subsequent JOIN based on this sid).
> > >>>>>
> > >>>>> Currently I use the following script, but I wonder if there is
> > >>>>> something more efficient:
> > >>>>>
> > >>>>> sessions = GROUP sessions BY sid;
> > >>>>> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
> > >>>>> FLATTEN(first);};
> > >>>>> sessions = FOREACH sessions GENERATE sid, .. and all the fields I
> > >>>>> have in the session table...
> > >>>>>
> > >>>>> Do you see any optimization I can do, especially on the FLATTEN /
> > >>>>> GENERATE part ?
> > >>>>>
> > >>>>> Thank you very much for your help.
> > >>>>
> > >>>>
> > >>
> > >>
> >
> >
>

Re: Any better way to ensure unicity ?

Reply via email to