[jira] [Commented] (PIG-4265) SUM functions returns different value in spark and mapreduce engine

liyunzhang_intel (JIRA) Wed, 05 Nov 2014 18:35:41 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-4265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199664#comment-14199664
 ]


liyunzhang_intel commented on PIG-4265:
---------------------------------------

Thanks [~xuefuz]'s comment.
I agreed that 15.9999999999998 and 16.0000000001 are equal if we compare them 
in double precision terms.
The problem is on the casting:
when $2=15.9999999999998
(double)((int)$2*100)/100 = (double)((int)15.9999999999998*100)/100 = 
(double)(15*100)/100 = 1500.0/100=15.0

when $2=16.0000000001
(double)((int)$2*100)/100 = (double)((int)16.0000000001*100)/100 = 
(double)(16*100)/100 = 1600.0/100 = 16.0

so if casting method changes like following, i think double precision problem 
can be avoid:
(double)(ROUND($2*100))/100 
when $2 = 15.9999999999998
(double)(ROUND($2*100))/100 = (double)(ROUND(1599.99999999998))/100 = 
(double)(1600)/100 = 16.0
when $2 = 16.0000000001
(double)(ROUND($2*100))/100 = (double)(ROUND(1600.00000001))/100 = 
(double)(1600)/100 = 16.0

The RubyUDFs_10.pig is generated by test/e2e/pig/tests/nightly.conf  Line 
3776~3793
{code}
3776                   {
3777                     # test accumulator functions
3778                     'num' => 10,
3779                     'java_params' => ['-Dpig.accumulative.batchsize=5'],
3780                     'pig' => q\
3781 register ':SCRIPTHOMEPATH:/ruby/scriptingudfs.rb' using jruby as myfuncs;
3782 a = load ':INPATH:/singlefile/studenttab10k' using PigStorage() as (name, 
age:int, gpa:double);
3783 b = group a by name;
3784 c = foreach b generate group, myfuncs.Sum(a.age), myfuncs.Sum(a.gpa);
3785 d = foreach c generate $0, $1, (double)((int)$2*100)/100;
3786 store d into ':OUTPATH:';\,
3787                                     'verify_pig_script' => q\
3788 a = load ':INPATH:/singlefile/studenttab10k' using PigStorage() as (name, 
age:int, gpa:double);
3789 b = group a by name;
3790 c = foreach b generate group, SUM(a.age), SUM(a.gpa);
3791 d = foreach c generate $0, $1,  (double)((int)$2*100)/100 ;
3792 store d into ':OUTPATH:';\,
3793                     },
{code}

Now RubyUDFs_10.pig of e2e tests in  mapreduce mode successes while  in spark 
mode fails. 
I find that if we change from "d = foreach c generate $0, $1,  
(double)((int)$2*100)/100"
to "d = foreach c generate $0, $1, (double)(ROUND($2*100))/100",problem can be 
avoided(patch available), Can you help review?

> SUM functions returns different value in spark and mapreduce engine
> -------------------------------------------------------------------
>
>                 Key: PIG-4265
>                 URL: https://issues.apache.org/jira/browse/PIG-4265
>             Project: Pig
>          Issue Type: Bug
>            Reporter: liyunzhang_intel
>            Assignee: liyunzhang_intel
>         Attachments: PIG-4265.patch
>
>
> $PIG_HOME/bin/pig -x local RubyUDFs_10.pig
> #RubyUDFs_10.pig
> a = load 'studenttab10k' using PigStorage() as (name, age:int, gpa:double);
> b = group a by name;
> c = foreach b generate group, SUM(a.age), SUM(a.gpa);
> d = foreach c generate $0, $1, (double)((int)$2*100)/100;
> store d into 'local.output/RubyUDFs_10_benchmark.out';
> the result in RubyUDFs_10.out/part
> #grep "david s" RubyUDFs_10.out/part-r-00000 
> david steinbeck       266     15.0
> #grep "david s" studenttab10k
> david steinbeck       21      2.44
> david steinbeck       33      1.17
> david steinbeck       42      1.94
> david steinbeck       42      1.35
> david steinbeck       31      2.77
> david steinbeck       40      2.42
> david steinbeck       57      3.91
> when runing Ruby_UDFs.pig in spark, the sum(a.gpa) is 16.0 and 
> (double)((int)$2*100)/100 will be "david steinbeck     266     16.0".
> when running Ruby_UDFs.pig in mapreduce mode, the sum(a.gpa) is 
> 15.999999999999998 and (double)((int)$2*100)/100 will be "david steinbeck     
> 266     15.0".
> I don't know why the same code by different execution engines(spark and 
> mapreduce) on the same os returns different results. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4265) SUM functions returns different value in spark and mapreduce engine

Reply via email to