I am not sure what you mean here exactly.
Will a sid row have multiple (different) values for the other fields ?


If not, that is, you can simply have duplicates for rows : you can use DISTINCT to achieve what you require :

sessions = DISTINCT sessions PARALLEL $PARALLELISM;



But if you want to pick any one row for a given sid, then I think what you have below might be good enough (you can omit the last line though).


Regards,
Mridul



On Monday 12 July 2010 06:53 PM, Vincent Barat wrote:
   Hello everybody,

I have a simple table containing sessions. Each sessions has an
unique key (the sid, which is actually a uuid).
But a session can be present several times in my input table.

I want to ensure that I only have 1 record for each sid (because I
perform subsequent JOIN based on this sid).

Currently I use the following script, but I wonder if there is
something more efficient:

sessions = GROUP sessions BY sid;
sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
FLATTEN(first);};
sessions = FOREACH sessions GENERATE sid, .. and all the fields I
have in the session table...

Do you see any optimization I can do, especially on the FLATTEN /
GENERATE part ?

Thank you very much for your help.

Reply via email to