[
https://issues.apache.org/jira/browse/SPARK-25914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiao Li reassigned SPARK-25914:
-------------------------------
Assignee: (was: Dilip Biswal)
> Separate projection from grouping and aggregate in logical Aggregate
> --------------------------------------------------------------------
>
> Key: SPARK-25914
> URL: https://issues.apache.org/jira/browse/SPARK-25914
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.4.0
> Reporter: Maryann Xue
> Priority: Major
>
> Currently the Spark SQL logical Aggregate has two expression fields:
> {{groupingExpressions}} and {{aggregateExpressions}}, in which
> {{aggregateExpressions}} is actually the result expressions, or in other
> words, the project list in the SELECT clause.
>
> This would cause an exception while processing the following query:
> {code:java}
> SELECT concat('x', concat(a, 's'))
> FROM testData2
> GROUP BY concat(a, 's'){code}
> After optimization, the query becomes:
> {code:java}
> SELECT concat('x', a, 's')
> FROM testData2
> GROUP BY concat(a, 's'){code}
> The optimization rule {{CombineConcats}} optimizes the expressions by
> flattening "concat" and causes the query to fail since the expression
> {{concat('x', a, 's')}} in the SELECT clause is neither referencing a
> grouping expression nor a aggregate expression.
>
> The problem is that we try to mix two operations in one operator, and worse,
> in one field: the group-and-aggregate operation and the project operation.
> There are two ways to solve this problem:
> 1. Break the two operations into two logical operators, which means a
> group-by query can usually be mapped into a Project-over-Aggregate pattern.
> 2. Break the two operations into multiple fields in the Aggregate operator,
> the same way we do for physical aggregate classes (e.g.,
> {{HashAggregateExec}}, or {{SortAggregateExec}}). Thus,
> {{groupingExpressions}} would still be the expressions from the GROUP BY
> clause (as before), but {{aggregateExpressions}} would contain aggregate
> functions only, and {{resultExpressions}} would be the project list in the
> SELECT clause holding references to either {{groupingExpressions}} or
> {{aggregateExpressions}}.
>
> I would say option 1 is even clearer, but it would be more likely to break
> the pattern matching in existing optimization rules and thus require more
> changes in the compiler. So we'd probably wanna go with option 2. That said,
> I suggest we achieve this goal through two iterative steps:
>
> Phase 1: Keep the current fields of logical Aggregate as
> {{groupingExpressions}} and {{aggregateExpressions}}, but change the
> semantics of {{aggregateExpressions}} by replacing the grouping expressions
> with corresponding references to expressions in {{groupingExpressions}}. The
> aggregate expressions in {{aggregateExpressions}} will remain the same.
>
> Phase 2: Add {{resultExpressions}} for the project list, and keep only
> aggregate expressions in {{aggregateExpressions}}.
>
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]