[GitHub] spark pull request #19301: [SPARK-22084][SQL] Fix performance regression in ...

stanzhai Sun, 24 Sep 2017 23:53:13 -0700

Github user stanzhai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19301#discussion_r140699522
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala
 ---
    @@ -72,11 +74,19 @@ object AggregateExpression {
           aggregateFunction: AggregateFunction,
           mode: AggregateMode,
           isDistinct: Boolean): AggregateExpression = {
    +    val state = if (aggregateFunction.resolved) {
    +      Seq(aggregateFunction.toString, aggregateFunction.dataType,
    +        aggregateFunction.nullable, mode, isDistinct)
    +    } else {
    +      Seq(aggregateFunction.toString, mode, isDistinct)
    +    }
    +    val hashCode = state.map(Objects.hashCode).foldLeft(0)((a, b) => 31 * 
a + b)
    +
         AggregateExpression(
           aggregateFunction,
           mode,
           isDistinct,
    -      NamedExpression.newExprId)
    +      ExprId(hashCode))
    --- End diff --
    
    I've tried to optimize in aggregate planner 
(https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala#L211).
    
    ```scala
          // A single aggregate expression might appear multiple times in 
resultExpressions.
          // In order to avoid evaluating an individual aggregate function 
multiple times, we'll
          // build a set of the distinct aggregate expressions and build a 
function which can
          // be used to re-write expressions so that they reference the single 
copy of the
          // aggregate function which actually gets computed.
          val aggregateExpressions = resultExpressions.flatMap { expr =>
            expr.collect {
              case agg: AggregateExpression =>
                val aggregateFunction = agg.aggregateFunction
                val state = if (aggregateFunction.resolved) {
                  Seq(aggregateFunction.toString, aggregateFunction.dataType,
                    aggregateFunction.nullable, agg.mode, agg.isDistinct)
                } else {
                  Seq(aggregateFunction.toString, agg.mode, agg.isDistinct)
                }
                val hashCode = state.map(Objects.hashCode).foldLeft(0)((a, b) 
=> 31 * a + b)
                (hashCode, agg)
            }
          }.groupBy(_._1).map { case (_, values) =>
            values.head._2
          }.toSeq
    ```
    
    But it's difficult to distinguish between different typed aggregators 
without expr id. Current solution can work well for all of aggregate functions.
    
    I'm not familiar with typed aggregators, any suggestions will be 
appreciated.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19301: [SPARK-22084][SQL] Fix performance regression in ...

Reply via email to