Yin Huai created SPARK-10169:
--------------------------------
Summary: Evaluating AggregateFunction1 (old code path) may return
wrong answers when grouping expressions are used as arguments of aggregate
functions
Key: SPARK-10169
URL: https://issues.apache.org/jira/browse/SPARK-10169
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 1.4.1, 1.3.1, 1.2.2, 1.1.1
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Critical
Before Spark 1.5, if an aggregate function use an grouping expression as input
argument, the result of the query can be wrong. The reason is we are using
transformUp when we do aggregate results rewriting (see
https://github.com/apache/spark/blob/branch-1.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala#L154).
To reproduce the problem, you can use
{code}
import org.apache.spark.sql.functions._
sc.parallelize((1 to 1000), 50).map(i =>
Tuple1(i)).toDF("i").registerTempTable("t")
sqlContext.sql("""
select i % 10, sum(if(i % 10 = 5, 1, 0)), count(i)
from t
where i % 10 = 5
group by i % 10""").explain()
== Physical Plan ==
Aggregate false, [PartialGroup#234], [PartialGroup#234 AS
_c0#225,SUM(CAST(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf((PartialGroup#234
= 5),1,0), LongType)) AS _c1#226L,Coalesce(SUM(PartialCount#233L),0) AS
_c2#227L]
Exchange (HashPartitioning [PartialGroup#234], 200)
Aggregate true, [(i#191 % 10)], [(i#191 % 10) AS
PartialGroup#234,SUM(CAST(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf(((i#191
% 10) = 5),1,0), LongType)) AS PartialSum#232L,COUNT(1) AS PartialCount#233L]
Project [_1#190 AS i#191]
Filter ((_1#190 % 10) = 5)
PhysicalRDD [_1#190], MapPartitionsRDD[93] at mapPartitions at
ExistingRDD.scala:37
sqlContext.sql("""
select i % 10, sum(if(i % 10 = 5, 1, 0)), count(i)
from t
where i % 10 = 5
group by i % 10""").show
_c0 _c1 _c2
5 50 100
{code}
In Spark 1.5, new aggregation code path does not have the problem. The old code
path is fixed by
https://github.com/apache/spark/commit/dd9ae7945ab65d353ed2b113e0c1a00a0533ffd6.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]