[ 
https://issues.apache.org/jira/browse/SPARK-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-1915:
------------------------------

    Description: 
Average values are difference between the calculation is done partially or not 
partially.

Because {{AverageFunction}} (in not-partially calculation) counts even if the 
evaluated value is null.

To reproduce this bug, run the following in {{sbt/sbt hive/console}}:

{code}
scala> sql("SELECT AVG(key) FROM src1").collect().foreach(println)
...
== Query Plan ==
Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / 
CAST(SUM(PartialCount#649), DoubleType)) AS c0#644]
 Exchange SinglePartition
  Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS 
PartialSum#648]
   HiveTableScan [key#646], (MetastoreRelation default, src1, None), None), 
which is now runnable
14/05/28 07:04:33 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from 
Stage 8 (SchemaRDD[45] at RDD at SchemaRDD.scala:98
== Query Plan ==
Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / 
CAST(SUM(PartialCount#649), DoubleType)) AS c0#644]
 Exchange SinglePartition
  Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS 
PartialSum#648]
   HiveTableScan [key#646], (MetastoreRelation default, src1, None), None)
...
[237.06666666666666]

scala> sql("SELECT AVG(key), COUNT(DISTINCT key) FROM 
src1").collect().foreach(println)
...
== Query Plan ==
Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS c1#669]
 Exchange SinglePartition
  HiveTableScan [key#672], (MetastoreRelation default, src1, None), None), 
which is now runnable
14/05/28 07:21:31 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from 
Stage 12 (SchemaRDD[67] at RDD at SchemaRDD.scala:98
== Query Plan ==
Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS c1#669]
 Exchange SinglePartition
  HiveTableScan [key#672], (MetastoreRelation default, src1, None), None)
...
[142.24,15]
{code}

In the first query, {{AVG}} is broke into partial aggregation, and gives the 
right answer (null values ignored). In the second query, since {{COUNT(DISTINCT 
key)}} can't be turned into partial aggregation, {{AVG}} isn't either, and the 
bug is triggered.

  was:
Average values are difference between the calculation is done partially or not 
partially.

Because {{AverageFunction}} (in not-partially calculation) counts even if the 
evaluated value is null.


> AverageFunction should not count if the evaluated value is null.
> ----------------------------------------------------------------
>
>                 Key: SPARK-1915
>                 URL: https://issues.apache.org/jira/browse/SPARK-1915
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Takuya Ueshin
>            Assignee: Takuya Ueshin
>             Fix For: 1.1.0, 1.0.1
>
>
> Average values are difference between the calculation is done partially or 
> not partially.
> Because {{AverageFunction}} (in not-partially calculation) counts even if the 
> evaluated value is null.
> To reproduce this bug, run the following in {{sbt/sbt hive/console}}:
> {code}
> scala> sql("SELECT AVG(key) FROM src1").collect().foreach(println)
> ...
> == Query Plan ==
> Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / 
> CAST(SUM(PartialCount#649), DoubleType)) AS c0#644]
>  Exchange SinglePartition
>   Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS 
> PartialSum#648]
>    HiveTableScan [key#646], (MetastoreRelation default, src1, None), None), 
> which is now runnable
> 14/05/28 07:04:33 INFO scheduler.DAGScheduler: Submitting 1 missing tasks 
> from Stage 8 (SchemaRDD[45] at RDD at SchemaRDD.scala:98
> == Query Plan ==
> Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / 
> CAST(SUM(PartialCount#649), DoubleType)) AS c0#644]
>  Exchange SinglePartition
>   Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS 
> PartialSum#648]
>    HiveTableScan [key#646], (MetastoreRelation default, src1, None), None)
> ...
> [237.06666666666666]
> scala> sql("SELECT AVG(key), COUNT(DISTINCT key) FROM 
> src1").collect().foreach(println)
> ...
> == Query Plan ==
> Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS 
> c1#669]
>  Exchange SinglePartition
>   HiveTableScan [key#672], (MetastoreRelation default, src1, None), None), 
> which is now runnable
> 14/05/28 07:21:31 INFO scheduler.DAGScheduler: Submitting 1 missing tasks 
> from Stage 12 (SchemaRDD[67] at RDD at SchemaRDD.scala:98
> == Query Plan ==
> Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS 
> c1#669]
>  Exchange SinglePartition
>   HiveTableScan [key#672], (MetastoreRelation default, src1, None), None)
> ...
> [142.24,15]
> {code}
> In the first query, {{AVG}} is broke into partial aggregation, and gives the 
> right answer (null values ignored). In the second query, since 
> {{COUNT(DISTINCT key)}} can't be turned into partial aggregation, {{AVG}} 
> isn't either, and the bug is triggered.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to