Re: Any better way to ensure unicity ?

Dmitriy Ryaboy Thu, 15 Jul 2010 16:12:24 -0700

oh and of course you can go crazy with this:

fract = limit (foreach (load 'tmp/numbers' as (letter:chararray, x:int,
y:int)) generate letter) 1;



On Thu, Jul 15, 2010 at 3:56 PM, Dmitriy Ryaboy <[email protected]> wrote:

> Um.
>
> grunt> nums = limit (load 'tmp/numbers' as (letter:chararray, x:int,
> y:int)) 1;
> grunt> dump nums
> (a,1,2)
>
> grunt> nums = load 'tmp/numbers' as (letter:chararray, x:int,
> y:int);
> grunt> fract = limit (foreach nums generate letter)
> 1;
> grunt> dump fract
> (a)
>
> Note that you can do the same for a number of operators, including, most
> handily, foreach:
>
> foo = foreach (group data by id) generate group as id, COUNT(data) as
> num_rows;
>
>
> On Thu, Jul 15, 2010 at 3:39 PM, hc busy <[email protected]> wrote:
>
>> But, to be clear, PigLatin is easy to read tho, so far, even with a 2k
>> line
>> script...
>>
>> On Thu, Jul 15, 2010 at 3:33 PM, hc busy <[email protected]> wrote:
>>
>> > LIMIT is an extra line to type. But I guess if we're using pig, we don't
>> > really care for elegance and concision huh?
>> >
>> >
>> > On Wed, Jul 14, 2010 at 12:25 PM, Dmitriy Ryaboy <[email protected]
>> >wrote:
>> >
>> >> hc, two things about that approach :
>> >>
>> >> 1) if you use the accumulator interface, the bag won't be materialized
>> >> 2) am I missing something? Why can't you just use LIMIT 1?
>> >>
>> >> -D
>> >>
>> >> On Wed, Jul 14, 2010 at 10:39 AM, hc busy <[email protected]> wrote:
>> >>
>> >> > Write a UDF called
>> >> >
>> >> > takeOne()
>> >> >
>> >> > that takes the first thing from the bag and returns it. The only
>> problem
>> >> > that I'm having is that this UDF cannot signal to pig that it is
>> done.
>> >> So
>> >> > that whole bag is always created in it's entirety.
>> >> >
>> >> >
>> >> > Btw, this UDF will be able to accomplish the same task (picking out
>> one
>> >> > item
>> >> > out fo a bag)
>> >> >
>> >> > https://issues.apache.org/jira/browse/PIG-1386
>> >> >
>> >> > because MaxTupleByNthField extends the original MaxTupleBy1stField by
>> >> > allowing you to specify any column in the tuple as the comparison
>> key.
>> >> And
>> >> > because it handles typing correctly, your schema will be as you
>> expect
>> >> > automatically.
>> >> >
>> >> > sessions = GROUP sessions BY sid;
>> >> > sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
>> >> > FLATTEN(first);};
>> >> > sessions = FOREACH sessions GENERATE sid, .. and all the fields I
>> have
>> >> in
>> >> > the session table...
>> >> >
>> >> >
>> >> > is replaced with
>> >> >
>> >> > session = GROUP session by sid;
>> >> > session = FOREACH session generate MaxTupleByNthField(session);
>> >> >
>> >> > that's it. it'll have the right schema, all columns from before, but
>> >> choses
>> >> > one of the data points.
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Wed, Jul 14, 2010 at 9:39 AM, Scott Carey <
>> [email protected]
>> >> > >wrote:
>> >> >
>> >> > > I run into this situation all the time.  You have to do a foreach
>> ...
>> >> > > generate projection at the end to rename everything.
>> >> > >
>> >> > > The way aliases work in pig, you quite often have to do 'renaming
>> >> only'
>> >> > > projections if you don't want to make other bits of code later
>> change:
>> >> > > After the group and limit:
>> >> > >
>> >> > > sessions = FOREACH sessions GENERATE field1 as field1, field2 as
>> >> field2,
>> >> > > field3 ad field3 . . .
>> >> > >
>> >> > > That will get rid of the :: prefixes and make the alias shareable
>> with
>> >> > > later pig code and not dependent on what you do in the group to
>> filter
>> >> > data.
>> >> > >
>> >> > >
>> >> > > On Jul 13, 2010, at 1:48 AM, Vincent Barat wrote:
>> >> > >
>> >> > > >  Actually you are right: the schema is the same, nevertheless,
>> the
>> >> > > > "naming" of the various columns in the schema is modified, and
>> thus
>> >> > > > my subsequent operations fail:
>> >> > > >
>> >> > > > original schema:
>> >> > > > sessions: {sid: chararray,infoid: chararray,imei:
>> chararray,start:
>> >> > long}
>> >> > > >
>> >> > > > modified schema:
>> >> > > > sessions: {first::sid: chararray,first::infoid:
>> >> > > > chararray,first::imei: chararray,first::start: long}
>> >> > > >
>> >> > > > Do you know a workaround ?
>> >> > > >
>> >> > > > Le 13/07/10 10:13, Mridul Muralidharan a écrit :
>> >> > > >>
>> >> > > >> The flatten will return the same schema as before (in 'first') :
>> >> > > >> so unless you are modifying the fields or the order in which
>> they
>> >> > > >> are generated (which I dont think you are in view of your
>> comment
>> >> > > >> that it should work with and without this), you can simply go
>> with
>> >> :
>> >> > > >>
>> >> > > >> -- Or whatever works for you.
>> >> > > >> %define PARALLELISM        '10'
>> >> > > >>
>> >> > > >> sessions = DISTINCT sessions PARALLEL $PARALLELISM;
>> >> > > >>
>> >> > > >> OR
>> >> > > >>
>> >> > > >> sessions = GROUP sessions BY sid  PARALLEL $PARALLELISM;
>> >> > > >> sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
>> >> > > >> FLATTEN(first);};
>> >> > > >>
>> >> > > >>
>> >> > > >>
>> >> > > >>
>> >> > > >> The schema at the end would be exactly same as start of the code
>> >> > > >> snippet for 'sessions'.
>> >> > > >>
>> >> > > >>
>> >> > > >> Regards,
>> >> > > >> Mridul
>> >> > > >>
>> >> > > >>
>> >> > > >>
>> >> > > >> On Tuesday 13 July 2010 01:01 PM, Vincent Barat wrote:
>> >> > > >>>
>> >> > > >>>
>> >> > > >>> Le 12/07/10 16:56, Mridul Muralidharan a écrit :
>> >> > > >>>>
>> >> > > >>>> I am not sure what you mean here exactly.
>> >> > > >>>> Will a sid row have multiple (different) values for the other
>> >> > > >>>> fields ?
>> >> > > >>> Yes.
>> >> > > >>>>
>> >> > > >>>> But if you want to pick any one row for a given sid, then I
>> think
>> >> > > >>>> what you have below might be good enough (you can omit the
>> last
>> >> > > >>>> line though).
>> >> > > >>> OK. Thanks. The last line is used to retrieve the exact same
>> data
>> >> > > >>> structure and naming as the original table. This way, I can
>> >> > > >>> optionally perform this treatment without modifying my code. If
>> >> you
>> >> > > >>> know a better way...
>> >> > > >>>
>> >> > > >>> Cheers,
>> >> > > >>>
>> >> > > >>>>
>> >> > > >>>> Regards,
>> >> > > >>>> Mridul
>> >> > > >>>>
>> >> > > >>>>
>> >> > > >>>>
>> >> > > >>>> On Monday 12 July 2010 06:53 PM, Vincent Barat wrote:
>> >> > > >>>>>    Hello everybody,
>> >> > > >>>>>
>> >> > > >>>>> I have a simple table containing sessions. Each sessions has
>> an
>> >> > > >>>>> unique key (the sid, which is actually a uuid).
>> >> > > >>>>> But a session can be present several times in my input table.
>> >> > > >>>>>
>> >> > > >>>>> I want to ensure that I only have 1 record for each sid
>> (because
>> >> I
>> >> > > >>>>> perform subsequent JOIN based on this sid).
>> >> > > >>>>>
>> >> > > >>>>> Currently I use the following script, but I wonder if there
>> is
>> >> > > >>>>> something more efficient:
>> >> > > >>>>>
>> >> > > >>>>> sessions = GROUP sessions BY sid;
>> >> > > >>>>> sessions = FOREACH sessions { first = LIMIT sessions 1;
>> GENERATE
>> >> > > >>>>> FLATTEN(first);};
>> >> > > >>>>> sessions = FOREACH sessions GENERATE sid, .. and all the
>> fields
>> >> I
>> >> > > >>>>> have in the session table...
>> >> > > >>>>>
>> >> > > >>>>> Do you see any optimization I can do, especially on the
>> FLATTEN
>> >> /
>> >> > > >>>>> GENERATE part ?
>> >> > > >>>>>
>> >> > > >>>>> Thank you very much for your help.
>> >> > > >>>>
>> >> > > >>>>
>> >> > > >>
>> >> > > >>
>> >> > >
>> >> > >
>> >> >
>> >>
>> >
>> >
>>
>
>

Re: Any better way to ensure unicity ?

Reply via email to