[GitHub] spark pull request #19488: [SPARK-22266][SQL] The same aggregate function wa...

maryannxue Fri, 13 Oct 2017 14:03:13 -0700

Github user maryannxue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19488#discussion_r144656360
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
 ---
    @@ -205,14 +205,17 @@ object PhysicalAggregation {
         case logical.Aggregate(groupingExpressions, resultExpressions, child) 
=>
           // A single aggregate expression might appear multiple times in 
resultExpressions.
           // In order to avoid evaluating an individual aggregate function 
multiple times, we'll
    -      // build a set of the distinct aggregate expressions and build a 
function which can
    +      // build a map of the distinct aggregate expressions and build a 
function which can
           // be used to re-write expressions so that they reference the single 
copy of the
    -      // aggregate function which actually gets computed.
    -      val aggregateExpressions = resultExpressions.flatMap { expr =>
    +      // aggregate function which actually gets computed. Note that 
aggregate expressions
    +      // should always be deterministic, so we can use its canonicalized 
expression as its
    --- End diff --
    
    So we are talking about two types of "non-deterministic" here:
    1. Across-query non-deterministic but in-query deterministic, which means 
the same expression can produce different results over the same input between 
different runs, but should always give the same result within the same run. 
sum/avg on floating point numbers could be an example. Shall we make sure that 
"select sum(f) - sum(f) from t" always return 0? and similarly for "first()" 
maybe, should "select first_value(c) = first_value(c) over ..." always return 
true?
    It is important to define the behavior first, which will lead to opposite 
approaches on how to handle the "deterministic" field here.
    2. Across-query and in-query non-deterministic, which I don't think is 
allowed anyway.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19488: [SPARK-22266][SQL] The same aggregate function wa...

Reply via email to