Hi all,
   I met a problem that “group operator has different results in different 
engines like "spark" and 
"mapreduce"(PIG-4282<https://issues.apache.org/jira/browse/PIG-4282>).

groupdistinct.pig
A = load 'input1.txt' as (age:int,gpa:int);
B = group A by age;
C = foreach B {
 D = A.gpa;
 E = distinct D;
generate group, MIN(E);
};
dump C;
input1.txt is:
10 89
20 78
10 68
10 89
20 92
the mapreduce output is:
(10,68),(20,78)
the spark output is
(20,78),(10,68)
These two results are different, because the sequence of field ‘group’ is not 
same.

Is there any way to guarantee the sequence of “group” field as the input when 
using “group” operator in pig?


Best regards
Zhang,Liyun

Reply via email to