liyunzhang_intel created PIG-4345:
-------------------------------------
Summary: e2e test "RubyUDFs_13" fails because of the different
result of "group a all" in different engines like "spark", "mapreduce"
Key: PIG-4345
URL: https://issues.apache.org/jira/browse/PIG-4345
Project: Pig
Issue Type: Bug
Reporter: liyunzhang_intel
Assignee: liyunzhang_intel
RubyUDFs e2e scrip is on the line 3818 of nightly.conf :
{code}
'num' => 13,
'java_params' => ['-Dpig.accumulative.batchsize=5'],
'pig' => q\
register ':SCRIPTHOMEPATH:/ruby/morerubyudfs.rb' using jruby as myfuncs;
a = load ':INPATH:/singlefile/studenttab10k' using PigStorage() as (name,
age:int, gpa:double);
b = foreach (group a all) generate FLATTEN(myfuncs.AppendIndex(a));
store b into ':OUTPATH:';\,
'verify_pig_script' => q\
register :FUNCPATH:/testudf.jar;
a = load ':INPATH:/singlefile/studenttab10k' using PigStorage() as (name,
age:int, gpa:double);
b = foreach (group a all) generate
FLATTEN(org.apache.pig.test.udf.evalfunc.AppendIndex(a));
store b into ':OUTPATH:';\,
},
]
},
{code}
RubyUDFs_13.pig tests ruby udf "AppendIndex" in "morerubyudfs.rb". The output
is compared with verified script which use java udf
"org.apache.pig.test.udf.evalfunc.AppendIndex". The output of "RubyUDFs_13.pig"
is like following:
If test file “studemttab10k” is
tom thompson 42 0.53
nick johnson 34 0.47
priscilla falkner 55 1.16
the result in spark engine will be:
tom thompson 42 0.53 1
nick johnson 34 0.47 2
priscilla falkner 55 1.16 3
the result in mapreduce engine which verified script uses will be
priscilla falkner 55 1.16 1
nick johnson 34 0.47 2
tom thompson 42 0.53 3
The difference between the result in spark and mapreduce engine cause
RubyUDFs_13 e2e test failure .
The root cause of the difference is because “group a all” has different result
in different engines.
In Spark engine, “group a all” :
all { (tom thompson 42 0.53),( nick johnson 34 0.47),(
priscilla falkner 55 1.16)}
In mapreduce engine , “group a all”:
all {( priscilla falkner 55 1.16), ( nick johnson 34
0.47),(tom thompson 42 0.53)}
If the test script is modified like following, RubyUDF_13 e2e test passes.
{code}
{
'num' => 13,
'java_params' => ['-Dpig.accumulative.batchsize=5'],
'pig' => q\
register ':SCRIPTHOMEPATH:/ruby/morerubyudfs.rb' using jruby as myfuncs;
a = load ':INPATH:/singlefile/studenttab10k' using PigStorage() as (name,
age:int, gpa:double);
a1 = filter a by name == 'nick johnson';
a2 = filter a1 by age == 34;
b = foreach (group a2 all) generate FLATTEN(myfuncs.AppendIndex(a2));
store b into ':OUTPATH:';\,
'verify_pig_script' => q\
register :FUNCPATH:/testudf.jar;
a = load ':INPATH:/singlefile/studenttab10k' using PigStorage() as (name,
age:int, gpa:double);
a1 = filter a by name == 'nick johnson';
a2 = filter a1 by age == 34;
b = foreach (group a2 all) generate
FLATTEN(org.apache.pig.test.udf.evalfunc.AppendIndex(a2));
store b into ':OUTPATH:';\,
},
]
},
{code}
using modified test script, the result in spark and mapreduce engine will be:
nick johnson 34 0.47 1
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)