I am not sure what you mean here exactly.
Will a sid row have multiple (different) values for the other fields ?
If not, that is, you can simply have duplicates for rows : you can use
DISTINCT to achieve what you require :
sessions = DISTINCT sessions PARALLEL $PARALLELISM;
But if you want to pick any one row for a given sid, then I think what
you have below might be good enough (you can omit the last line though).
Regards,
Mridul
On Monday 12 July 2010 06:53 PM, Vincent Barat wrote:
Hello everybody,
I have a simple table containing sessions. Each sessions has an
unique key (the sid, which is actually a uuid).
But a session can be present several times in my input table.
I want to ensure that I only have 1 record for each sid (because I
perform subsequent JOIN based on this sid).
Currently I use the following script, but I wonder if there is
something more efficient:
sessions = GROUP sessions BY sid;
sessions = FOREACH sessions { first = LIMIT sessions 1; GENERATE
FLATTEN(first);};
sessions = FOREACH sessions GENERATE sid, .. and all the fields I
have in the session table...
Do you see any optimization I can do, especially on the FLATTEN /
GENERATE part ?
Thank you very much for your help.