[
https://issues.apache.org/jira/browse/SPARK-22266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203105#comment-16203105
]
Maryann Xue commented on SPARK-22266:
-------------------------------------
Thank you for the comment, [~maropu]! I think it is a problem of common
subexpression elimination, but feel like this kind of CSE had better be
performed high level rather than in code-gen. And PhysicalAggregation is
designed to pull out the aggregate functions from result expressions so that
later on at code-gen (and non-code-gen as well) stage HashAggregate can assume
that all aggregate functions are purely aggregation and no other expressions.
So if CSE in code-gen were to handle this, we would end up having extra
non-aggregate expressions in "aggregate expression" list.
Not sure if we should have a logical/physical optimization dedicated to CSE,
but I think it would be nice. I applied a simple straightforward fix in the PR.
Could you please review? Thank you in advance!
> The same aggregate function was evaluated multiple times
> --------------------------------------------------------
>
> Key: SPARK-22266
> URL: https://issues.apache.org/jira/browse/SPARK-22266
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 2.2.0
> Reporter: Maryann Xue
> Priority: Minor
>
> We should avoid the same aggregate function being evaluated more than once,
> and this is what has been stated in the code comment below
> (patterns.scala:206). However things didn't work as expected.
> {code}
> // A single aggregate expression might appear multiple times in
> resultExpressions.
> // In order to avoid evaluating an individual aggregate function
> multiple times, we'll
> // build a set of the distinct aggregate expressions and build a
> function which can
> // be used to re-write expressions so that they reference the single
> copy of the
> // aggregate function which actually gets computed.
> {code}
> For example, the physical plan of
> {code}
> SELECT a, max(b+1), max(b+1) + 1 FROM testData2 GROUP BY a
> {code}
> was
> {code}
> HashAggregate(keys=[a#23], functions=[max((b#24 + 1)), max((b#24 + 1))],
> output=[a#23, max((b + 1))#223, (max((b + 1)) + 1)#224])
> +- HashAggregate(keys=[a#23], functions=[partial_max((b#24 + 1)),
> partial_max((b#24 + 1))], output=[a#23, max#231, max#232])
> +- SerializeFromObject [assertnotnull(input[0,
> org.apache.spark.sql.test.SQLTestData$TestData2, true]).a AS a#23,
> assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2,
> true]).b AS b#24]
> +- Scan ExternalRDDScan[obj#22]
> {code}
> , where in each HashAggregate there were two identical aggregate functions
> "max(b#24 + 1)".
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]