[GitHub] spark pull request #19301: [SPARK-22084][SQL] Fix performance regression in ...

cloud-fan Thu, 21 Sep 2017 22:50:27 -0700

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19301#discussion_r140416279
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala
 ---
    @@ -72,11 +74,19 @@ object AggregateExpression {
           aggregateFunction: AggregateFunction,
           mode: AggregateMode,
           isDistinct: Boolean): AggregateExpression = {
    +    val state = if (aggregateFunction.resolved) {
    +      Seq(aggregateFunction.toString, aggregateFunction.dataType,
    +        aggregateFunction.nullable, mode, isDistinct)
    +    } else {
    +      Seq(aggregateFunction.toString, mode, isDistinct)
    +    }
    +    val hashCode = state.map(Objects.hashCode).foldLeft(0)((a, b) => 31 * 
a + b)
    +
         AggregateExpression(
           aggregateFunction,
           mode,
           isDistinct,
    -      NamedExpression.newExprId)
    +      ExprId(hashCode))
    --- End diff --
    
    I don't think this is the right fix. Semantically the `b0` and `b1` in 
`SELECT SUM(b) AS b0, SUM(b) AS b1 ` are different aggregate functions, so they 
should have different `resultId`.
    
    It's kind of an optimization in aggregate planner, we should detect these 
semantically different but duplicated aggregate functions and only plan one 
aggrega function.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19301: [SPARK-22084][SQL] Fix performance regression in ...

Reply via email to