[
https://issues.apache.org/jira/browse/SPARK-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Cheng Lian updated SPARK-1915:
------------------------------
Description:
Average values are difference between the calculation is done partially or not
partially.
Because {{AverageFunction}} (in not-partially calculation) counts even if the
evaluated value is null.
To reproduce this bug, run the following in {{sbt/sbt hive/console}}:
{code}
scala> sql("SELECT AVG(key) FROM src1").collect().foreach(println)
...
== Query Plan ==
Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) /
CAST(SUM(PartialCount#649), DoubleType)) AS c0#644]
Exchange SinglePartition
Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS
PartialSum#648]
HiveTableScan [key#646], (MetastoreRelation default, src1, None), None),
which is now runnable
14/05/28 07:04:33 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from
Stage 8 (SchemaRDD[45] at RDD at SchemaRDD.scala:98
== Query Plan ==
Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) /
CAST(SUM(PartialCount#649), DoubleType)) AS c0#644]
Exchange SinglePartition
Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS
PartialSum#648]
HiveTableScan [key#646], (MetastoreRelation default, src1, None), None)
...
[237.06666666666666]
scala> sql("SELECT AVG(key), COUNT(DISTINCT key) FROM
src1").collect().foreach(println)
...
== Query Plan ==
Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS c1#669]
Exchange SinglePartition
HiveTableScan [key#672], (MetastoreRelation default, src1, None), None),
which is now runnable
14/05/28 07:21:31 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from
Stage 12 (SchemaRDD[67] at RDD at SchemaRDD.scala:98
== Query Plan ==
Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS c1#669]
Exchange SinglePartition
HiveTableScan [key#672], (MetastoreRelation default, src1, None), None)
...
[142.24,15]
{code}
In the first query, {{AVG}} is broke into partial aggregation, and gives the
right answer (null values ignored). In the second query, since {{COUNT(DISTINCT
key)}} can't be turned into partial aggregation, {{AVG}} isn't either, and the
bug is triggered.
was:
Average values are difference between the calculation is done partially or not
partially.
Because {{AverageFunction}} (in not-partially calculation) counts even if the
evaluated value is null.
> AverageFunction should not count if the evaluated value is null.
> ----------------------------------------------------------------
>
> Key: SPARK-1915
> URL: https://issues.apache.org/jira/browse/SPARK-1915
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Reporter: Takuya Ueshin
> Assignee: Takuya Ueshin
> Fix For: 1.1.0, 1.0.1
>
>
> Average values are difference between the calculation is done partially or
> not partially.
> Because {{AverageFunction}} (in not-partially calculation) counts even if the
> evaluated value is null.
> To reproduce this bug, run the following in {{sbt/sbt hive/console}}:
> {code}
> scala> sql("SELECT AVG(key) FROM src1").collect().foreach(println)
> ...
> == Query Plan ==
> Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) /
> CAST(SUM(PartialCount#649), DoubleType)) AS c0#644]
> Exchange SinglePartition
> Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS
> PartialSum#648]
> HiveTableScan [key#646], (MetastoreRelation default, src1, None), None),
> which is now runnable
> 14/05/28 07:04:33 INFO scheduler.DAGScheduler: Submitting 1 missing tasks
> from Stage 8 (SchemaRDD[45] at RDD at SchemaRDD.scala:98
> == Query Plan ==
> Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) /
> CAST(SUM(PartialCount#649), DoubleType)) AS c0#644]
> Exchange SinglePartition
> Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS
> PartialSum#648]
> HiveTableScan [key#646], (MetastoreRelation default, src1, None), None)
> ...
> [237.06666666666666]
> scala> sql("SELECT AVG(key), COUNT(DISTINCT key) FROM
> src1").collect().foreach(println)
> ...
> == Query Plan ==
> Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS
> c1#669]
> Exchange SinglePartition
> HiveTableScan [key#672], (MetastoreRelation default, src1, None), None),
> which is now runnable
> 14/05/28 07:21:31 INFO scheduler.DAGScheduler: Submitting 1 missing tasks
> from Stage 12 (SchemaRDD[67] at RDD at SchemaRDD.scala:98
> == Query Plan ==
> Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS
> c1#669]
> Exchange SinglePartition
> HiveTableScan [key#672], (MetastoreRelation default, src1, None), None)
> ...
> [142.24,15]
> {code}
> In the first query, {{AVG}} is broke into partial aggregation, and gives the
> right answer (null values ignored). In the second query, since
> {{COUNT(DISTINCT key)}} can't be turned into partial aggregation, {{AVG}}
> isn't either, and the bug is triggered.
--
This message was sent by Atlassian JIRA
(v6.2#6252)