[ 
https://issues.apache.org/jira/browse/SPARK-4217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197386#comment-14197386
 ] 

shengli commented on SPARK-4217:
--------------------------------

I also test the script both on pure-hive and spark-sql. 
Versions: Hive-0.11.0  Spark-1.1.0
I also got the different result between them:
In hive:
```
MapReduce Jobs Launched: 
Job 0: Map: 2  Reduce: 1   Cumulative CPU: 14.1 sec   HDFS Read: 12903826 HDFS 
Write: 8286554 SUCCESS
Job 1: Map: 2  Reduce: 1   Cumulative CPU: 14.72 sec   HDFS Read: 8463043 HDFS 
Write: 278 SUCCESS
Job 2: Map: 1  Reduce: 1   Cumulative CPU: 2.71 sec   HDFS Read: 642 HDFS 
Write: 92 SUCCESS
Total MapReduce CPU Time Spent: 31 seconds 530 msec
OK
2004    1403018
2005    5557850
2006    7203061
2007    11300432
2008    12109328
2009    5365447
2010    188944
Time taken: 108.757 seconds, Fetched: 7 row(s)
```

In spark sql:
```
spark-sql> select c.theyear, sum(b.amount)
         > from tblstock a
         > join tblStockDetail b on a.ordernumber = b.ordernumber
         > join tbldate c on a.dateid = c.dateid
         > group by c.theyear;
2010    210924
2004    3265696
2005    13247234
2006    13670416
2007    16711974
2008    14670698
2009    6322137
```
It seems it's a bug when multiple joins happen in spark sql (HiveContext) query.

> Result of SparkSQL is incorrect after a table join and group by operation
> -------------------------------------------------------------------------
>
>                 Key: SPARK-4217
>                 URL: https://issues.apache.org/jira/browse/SPARK-4217
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.1.0
>         Environment: Hadoop 2.2.0
> Spark1.1
>            Reporter: peter.zhang
>            Priority: Critical
>         Attachments: TestScript.sql, saledata.zip
>
>
> I runed a test using same SQL script in SparkSQL, Shark and Hive 
> environment(Pure hive application rather than Spark HiveContext) as below
> ---------------------------------------------------------------
> select c.theyear, sum(b.amount)
> from tblstock a
> join tblStockDetail b on a.ordernumber = b.ordernumber
> join tbldate c on a.dateid = c.dateid
> group by c.theyear;
> result of hive/shark:
> theyear       _c1
> 2004  1403018
> 2005  5557850
> 2006  7203061
> 2007  11300432
> 2008  12109328
> 2009  5365447
> 2010  188944
> result of SparkSQL:
> 2010  210924
> 2004  3265696
> 2005  13247234
> 2006  13670416
> 2007  16711974
> 2008  14670698
> 2009  6322137



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to