Re: Any better way to ensure unicity ?

Mridul Muralidharan Mon, 12 Jul 2010 07:57:25 -0700


I am not sure what you mean here exactly.
Will a sid row have multiple (different) values for the other fields ?

If not, that is, you can simply have duplicates for rows : you can useDISTINCT to achieve what you require :


sessions = DISTINCT sessions PARALLEL $PARALLELISM;

But if you want to pick any one row for a given sid, then I think whatyou have below might be good enough (you can omit the last line though).



Regards,
Mridul



On Monday 12 July 2010 06:53 PM, Vincent Barat wrote:

   Hello everybody,

I have a simple table containing sessions. Each sessions has an
unique key (the sid, which is actually a uuid).
But a session can be present several times in my input table.

I want to ensure that I only have 1 record for each sid (because I
perform subsequent JOIN based on this sid).

Currently I use the following script, but I wonder if there is
something more efficient:

sessions = GROUP sessions BY sid;
sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
FLATTEN(first);};
sessions = FOREACH sessions GENERATE sid, .. and all the fields I
have in the session table...

Do you see any optimization I can do, especially on the FLATTEN /
GENERATE part ?

Thank you very much for your help.

Re: Any better way to ensure unicity ?

Reply via email to