Re: Any better way to ensure unicity ?

Mridul Muralidharan Tue, 13 Jul 2010 17:31:41 -0700


Then project it out and then do distinct ?


like
sessions = FOREACH sessions required_fields;
sessions = DISTINCT sessions PARALLEL $PARALLELISM;


assuming you dont need timestamp ofcourse.
If you do, then the group route might be only option ...

Regards,
Mridul

On Tuesday 13 July 2010 05:57 PM, Vincent Barat wrote:

   Yes. I would have used DISTINCT too, but I cannot, since some of
the other fields can be different (the timestamp actually).

Thanks for your help.

Le 13/07/10 11:06, Mridul Muralidharan a écrit :


I am not sure why the prefix 'first' is coming in ... someone from
pig team can comment better.
Though personally, I would use distinct over
group/foreach/limit/flatten combination.



Regards,
Mridul


On Tuesday 13 July 2010 02:18 PM, Vincent Barat wrote:

    Actually you are right: the schema is the same, nevertheless, the
"naming" of the various columns in the schema is modified, and thus
my subsequent operations fail:

original schema:
sessions: {sid: chararray,infoid: chararray,imei:
chararray,start: long}

modified schema:
sessions: {first::sid: chararray,first::infoid:
chararray,first::imei: chararray,first::start: long}

Do you know a workaround ?

Le 13/07/10 10:13, Mridul Muralidharan a écrit :


The flatten will return the same schema as before (in 'first') :
so unless you are modifying the fields or the order in which they
are generated (which I dont think you are in view of your comment
that it should work with and without this), you can simply go
with :

-- Or whatever works for you.
%define PARALLELISM        '10'

sessions = DISTINCT sessions PARALLEL $PARALLELISM;

OR

sessions = GROUP sessions BY sid  PARALLEL $PARALLELISM;
sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
FLATTEN(first);};




The schema at the end would be exactly same as start of the code
snippet for 'sessions'.


Regards,
Mridul



On Tuesday 13 July 2010 01:01 PM, Vincent Barat wrote:



Le 12/07/10 16:56, Mridul Muralidharan a écrit :


I am not sure what you mean here exactly.
Will a sid row have multiple (different) values for the other
fields ?

Yes.


But if you want to pick any one row for a given sid, then I think
what you have below might be good enough (you can omit the last
line though).

OK. Thanks. The last line is used to retrieve the exact same data
structure and naming as the original table. This way, I can
optionally perform this treatment without modifying my code. If
you
know a better way...

Cheers,


Regards,
Mridul



On Monday 12 July 2010 06:53 PM, Vincent Barat wrote:

      Hello everybody,

I have a simple table containing sessions. Each sessions has an
unique key (the sid, which is actually a uuid).
But a session can be present several times in my input table.

I want to ensure that I only have 1 record for each sid
(because I
perform subsequent JOIN based on this sid).

Currently I use the following script, but I wonder if there is
something more efficient:

sessions = GROUP sessions BY sid;
sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
FLATTEN(first);};
sessions = FOREACH sessions GENERATE sid, .. and all the
fields I
have in the session table...

Do you see any optimization I can do, especially on the
FLATTEN /
GENERATE part ?

Thank you very much for your help.

Re: Any better way to ensure unicity ?

Reply via email to