[jira] [Updated] (SPARK-19102) Accuracy error of spark SQL results

XiaodongCui (JIRA) Fri, 06 Jan 2017 01:41:39 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-19102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


XiaodongCui updated SPARK-19102:
--------------------------------
    Description: 
the problem is cube6's  second column named sumprice is 10000 times bigger than 
the cube5's  second column named sumprice,but  they should be equal .the bug is 
only reappear  in the format like sum(a * b),count (distinct  c)

code:
================================================================================
        DataFrame 
df1=sqlContext.read().parquet("hdfs://cdh01:8020/sandboxdata_A/test/a");
                df1.registerTempTable("hd_salesflat");
                DataFrame cube5 = sqlContext.sql("SELECT areacode1, 
SUM(quantity*unitprice) AS sumprice FROM hd_salesflat GROUP BY areacode1");
                DataFrame cube6 = sqlContext.sql("SELECT areacode1, 
SUM(quantity*unitprice) AS sumprice, COUNT(DISTINCT transno)  FROM hd_salesflat 
GROUP BY areacode1");
                cube5.show(50);
                cube6.show(50);
================================================================================
my  data：
transno | quantity | unitprice | areacode1
76317828|  1.0000  |  25.0000  |  HDCN

data schema：
 |-- areacode1: string (nullable = true)
 |-- quantity: decimal(20,4) (nullable = true)
 |-- unitprice: decimal(20,4) (nullable = true)
 |-- transno: string (nullable = true)

  was:
the problem is cube6's  second column named sumprice is 10000 times bigger than 
the cube5's  second column named sumprice,but  they should be equal .the bug is 
only reappear  in the format like sum(a * b),count (distinct  c)

        DataFrame 
df1=sqlContext.read().parquet("hdfs://cdh01:8020/sandboxdata_A/test/a");
                df1.registerTempTable("hd_salesflat");
                DataFrame cube5 = sqlContext.sql("SELECT areacode1, 
SUM(quantity*unitprice) AS sumprice FROM hd_salesflat GROUP BY areacode1");
                DataFrame cube6 = sqlContext.sql("SELECT areacode1, 
SUM(quantity*unitprice) AS sumprice, COUNT(DISTINCT transno)  FROM hd_salesflat 
GROUP BY areacode1");
                cube5.show(50);
                cube6.show(50);

my  data：
transno | quantity | unitprice | areacode1
76317828|  1.0000  |  25.0000  |  HDCN

data schema：
 |-- areacode1: string (nullable = true)
 |-- quantity: decimal(20,4) (nullable = true)
 |-- unitprice: decimal(20,4) (nullable = true)
 |-- transno: string (nullable = true)


> Accuracy error of spark SQL results
> -----------------------------------
>
>                 Key: SPARK-19102
>                 URL: https://issues.apache.org/jira/browse/SPARK-19102
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 1.6.0, 1.6.1
>         Environment: Spark 1.6.0, Hadoop 2.6.0,JDK 1.8,CentOS6.6
>            Reporter: XiaodongCui
>         Attachments: a.zip
>
>
> the problem is cube6's  second column named sumprice is 10000 times bigger 
> than the cube5's  second column named sumprice,but  they should be equal .the 
> bug is only reappear  in the format like sum(a * b),count (distinct  c)
> code:
> ================================================================================
>       DataFrame 
> df1=sqlContext.read().parquet("hdfs://cdh01:8020/sandboxdata_A/test/a");
>               df1.registerTempTable("hd_salesflat");
>               DataFrame cube5 = sqlContext.sql("SELECT areacode1, 
> SUM(quantity*unitprice) AS sumprice FROM hd_salesflat GROUP BY areacode1");
>               DataFrame cube6 = sqlContext.sql("SELECT areacode1, 
> SUM(quantity*unitprice) AS sumprice, COUNT(DISTINCT transno)  FROM 
> hd_salesflat GROUP BY areacode1");
>               cube5.show(50);
>               cube6.show(50);
> ================================================================================
> my  data：
> transno | quantity | unitprice | areacode1
> 76317828|  1.0000  |  25.0000  |  HDCN
> data schema：
>  |-- areacode1: string (nullable = true)
>  |-- quantity: decimal(20,4) (nullable = true)
>  |-- unitprice: decimal(20,4) (nullable = true)
>  |-- transno: string (nullable = true)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19102) Accuracy error of spark SQL results

Reply via email to