[ https://issues.apache.org/jira/browse/SPARK-11275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiao Li updated SPARK-11275: ---------------------------- Summary: [SQL] Incorrect results when using rollup/cube (was: [SQL] Regression in rollup/cube ) > [SQL] Incorrect results when using rollup/cube > ----------------------------------------------- > > Key: SPARK-11275 > URL: https://issues.apache.org/jira/browse/SPARK-11275 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.3.0, 1.4.0, 1.5.1 > Reporter: Xiao Li > > Spark SQL is unable to generate a correct result when the following query > using rollup. > "select a, b, sum(a + b) as sumAB, GROUPING__ID from mytable group by a, > b with rollup" > Spark SQL generates a wrong result: > [2,4,6,3] > [2,null,null,1] > [1,null,null,1] > [null,null,null,0] > [1,2,3,3] > The table mytable is super simple, containing two rows and two columns: > testData = Seq((1, 2), (2, 4)).toDF("a", "b") > After turning off codegen, the query plan is like > == Parsed Logical Plan == > 'Rollup ['a,'b], > [unresolvedalias('a),unresolvedalias('b),unresolvedalias('sum(('a + 'b)) AS > sumAB#20),unresolvedalias('GROUPING__ID)] > 'UnresolvedRelation `mytable`, None > == Analyzed Logical Plan == > a: int, b: int, sumAB: bigint, GROUPING__ID: int > Aggregate [a#2,b#3,grouping__id#23], [a#2,b#3,sum(cast((a#2 + b#3) as > bigint)) AS sumAB#20L,GROUPING__ID#23] > Expand [0,1,3], [a#2,b#3], grouping__id#23 > Subquery mytable > Project [_1#0 AS a#2,_2#1 AS b#3] > LocalRelation [_1#0,_2#1], [[1,2],[2,4]] > == Optimized Logical Plan == > Aggregate [a#2,b#3,grouping__id#23], [a#2,b#3,sum(cast((a#2 + b#3) as > bigint)) AS sumAB#20L,GROUPING__ID#23] > Expand [0,1,3], [a#2,b#3], grouping__id#23 > LocalRelation [a#2,b#3], [[1,2],[2,4]] > == Physical Plan == > Aggregate false, [a#2,b#3,grouping__id#23], [a#2,b#3,sum(PartialSum#24L) AS > sumAB#20L,grouping__id#23] > Exchange hashpartitioning(a#2,b#3,grouping__id#23,5) > Aggregate true, [a#2,b#3,grouping__id#23], > [a#2,b#3,grouping__id#23,sum(cast((a#2 + b#3) as bigint)) AS PartialSum#24L] > Expand [List(null, null, 0),List(a#2, null, 1),List(a#2, b#3, 3)], > [a#2,b#3,grouping__id#23] > LocalTableScan [a#2,b#3], [[1,2],[2,4]] > Below are my observations: > 1. Generation of GROUP__ID looks OK. > 2. The problem still exists no matter whether turning on/off CODEGEN > 3. Rollup still works in a simple query when group-by columns have only one > column. For example, "select b, sum(a), GROUPING__ID from mytable group by b > with rollup" > 4. The buckets in "HiveDataFrameAnalytcisSuite" are misleading. > Unfortunately, they hide the bugs. Although the buckets passed, they just > compare the results of SQL and Dataframe. This way is unable to capture the > regression when both return the same wrong results. > 5. The same problem also exists in cube. I have not started the investigation > in cube, but I believe the root causes should be the same. > 6. It looks like all the logical plans are correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org