[
https://issues.apache.org/jira/browse/PIG-4265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199664#comment-14199664
]
liyunzhang_intel commented on PIG-4265:
---------------------------------------
Thanks [~xuefuz]'s comment.
I agreed that 15.9999999999998 and 16.0000000001 are equal if we compare them
in double precision terms.
The problem is on the casting:
when $2=15.9999999999998
(double)((int)$2*100)/100 = (double)((int)15.9999999999998*100)/100 =
(double)(15*100)/100 = 1500.0/100=15.0
when $2=16.0000000001
(double)((int)$2*100)/100 = (double)((int)16.0000000001*100)/100 =
(double)(16*100)/100 = 1600.0/100 = 16.0
so if casting method changes like following, i think double precision problem
can be avoid:
(double)(ROUND($2*100))/100
when $2 = 15.9999999999998
(double)(ROUND($2*100))/100 = (double)(ROUND(1599.99999999998))/100 =
(double)(1600)/100 = 16.0
when $2 = 16.0000000001
(double)(ROUND($2*100))/100 = (double)(ROUND(1600.00000001))/100 =
(double)(1600)/100 = 16.0
The RubyUDFs_10.pig is generated by test/e2e/pig/tests/nightly.conf Line
3776~3793
{code}
3776 {
3777 # test accumulator functions
3778 'num' => 10,
3779 'java_params' => ['-Dpig.accumulative.batchsize=5'],
3780 'pig' => q\
3781 register ':SCRIPTHOMEPATH:/ruby/scriptingudfs.rb' using jruby as myfuncs;
3782 a = load ':INPATH:/singlefile/studenttab10k' using PigStorage() as (name,
age:int, gpa:double);
3783 b = group a by name;
3784 c = foreach b generate group, myfuncs.Sum(a.age), myfuncs.Sum(a.gpa);
3785 d = foreach c generate $0, $1, (double)((int)$2*100)/100;
3786 store d into ':OUTPATH:';\,
3787 'verify_pig_script' => q\
3788 a = load ':INPATH:/singlefile/studenttab10k' using PigStorage() as (name,
age:int, gpa:double);
3789 b = group a by name;
3790 c = foreach b generate group, SUM(a.age), SUM(a.gpa);
3791 d = foreach c generate $0, $1, (double)((int)$2*100)/100 ;
3792 store d into ':OUTPATH:';\,
3793 },
{code}
Now RubyUDFs_10.pig of e2e tests in mapreduce mode successes while in spark
mode fails.
I find that if we change from "d = foreach c generate $0, $1,
(double)((int)$2*100)/100"
to "d = foreach c generate $0, $1, (double)(ROUND($2*100))/100",problem can be
avoided(patch available), Can you help review?
> SUM functions returns different value in spark and mapreduce engine
> -------------------------------------------------------------------
>
> Key: PIG-4265
> URL: https://issues.apache.org/jira/browse/PIG-4265
> Project: Pig
> Issue Type: Bug
> Reporter: liyunzhang_intel
> Assignee: liyunzhang_intel
> Attachments: PIG-4265.patch
>
>
> $PIG_HOME/bin/pig -x local RubyUDFs_10.pig
> #RubyUDFs_10.pig
> a = load 'studenttab10k' using PigStorage() as (name, age:int, gpa:double);
> b = group a by name;
> c = foreach b generate group, SUM(a.age), SUM(a.gpa);
> d = foreach c generate $0, $1, (double)((int)$2*100)/100;
> store d into 'local.output/RubyUDFs_10_benchmark.out';
> the result in RubyUDFs_10.out/part
> #grep "david s" RubyUDFs_10.out/part-r-00000
> david steinbeck 266 15.0
> #grep "david s" studenttab10k
> david steinbeck 21 2.44
> david steinbeck 33 1.17
> david steinbeck 42 1.94
> david steinbeck 42 1.35
> david steinbeck 31 2.77
> david steinbeck 40 2.42
> david steinbeck 57 3.91
> when runing Ruby_UDFs.pig in spark, the sum(a.gpa) is 16.0 and
> (double)((int)$2*100)/100 will be "david steinbeck 266 16.0".
> when running Ruby_UDFs.pig in mapreduce mode, the sum(a.gpa) is
> 15.999999999999998 and (double)((int)$2*100)/100 will be "david steinbeck
> 266 15.0".
> I don't know why the same code by different execution engines(spark and
> mapreduce) on the same os returns different results.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)