mikhailnik-db commented on code in PR #54297:
URL: https://github.com/apache/spark/pull/54297#discussion_r2803857272
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/AggregateExpressionResolver.scala:
##########
@@ -117,7 +118,16 @@ class AggregateExpressionResolver(
aggregateExpression match {
case agg @ AggregateExpression(listAgg: ListAgg, _, _, _, _)
if agg.isDistinct && listAgg.needSaveOrderValue =>
- throwFunctionAndOrderExpressionMismatchError(listAgg)
+ // Allow when the mismatch is only because child was cast
+ val mismatchDueToCast = listAgg.orderExpressions.size == 1 &&
+ (listAgg.child match {
+ case Cast(castChild, _, _, _) =>
+ listAgg.orderExpressions.head.child.semanticEquals(castChild)
+ case _ => false
+ })
+ if (!mismatchDueToCast) {
+ throwFunctionAndOrderExpressionMismatchError(listAgg)
+ }
Review Comment:
For context: the purpose of this check is to prevent problems with a
distinct framework. Simply speaking, all aggregation functions with `distinct`
arguments are rewritten by moving those arguments to `GROUP BY`.
You can imagine it as
```
SELECT agg(distinct col)
FROM table
~
SELECT agg(col)
FROM table
GROUP BY col
```
listagg with distinct treats the argument **and** the order expression as
keys.
```
SELECT listagg(distinct col) WITHIN GROUP (ORDER BY col')
FROM table
~
SELECT listagg(col) WITHIN GROUP (ORDER BY col')
FROM table
GROUP BY col, col'
```
Before this change, there was a simple invariant: if `col` semantically
equals `col'` then `GROUP BY col, col'` is equivalent to `GROUP BY col`, which
is the expected behavior for a user.
Now we want to relax this check, assuming that for the column of any type
`GROUP BY CAST(col AS STRING), col` ~ `GROUP BY CAST(col AS STRING)` ~ `GROUP
BY col` (and the same with `CAST(col AS BINARY)`). I do not have any
counterexamples. It seems a reasonable assumption, but it should be
double-checked.
@helioshe4, I'm afraid the only way to prove correctness is to go through
all existing types and check the logic of the cast to string and binary. My
three main concerns:
1. There could be some normalisation or absence of it when needed, e.g.,
floating‑point numbers usually have 2 encodings for zero: `0` and `-0`. They
are equal, but will we normalize them when casting to string or binary?
2. loss of precision or some other information. e.g. when converting
timestamps or floating‑point numbers
3. Collations. They were created to control the equality relation of
strings. Can we change them by casting?
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##########
@@ -450,8 +450,17 @@ trait CheckAnalysis extends LookupCatalog with
QueryErrorsBase with PlanToString
case agg @ AggregateExpression(listAgg: ListAgg, _, _, _, _)
if agg.isDistinct && listAgg.needSaveOrderValue =>
- throw
QueryCompilationErrors.functionAndOrderExpressionMismatchError(
- listAgg.prettyName, listAgg.child, listAgg.orderExpressions)
+ // Allow when the mismatch is only because child was cast
+ val mismatchDueToCast = listAgg.orderExpressions.size == 1 &&
+ (listAgg.child match {
+ case Cast(castChild, _, _, _) =>
+
listAgg.orderExpressions.head.child.semanticEquals(castChild)
+ case _ => false
+ })
+ if (!mismatchDueToCast) {
+ throw
QueryCompilationErrors.functionAndOrderExpressionMismatchError(
+ listAgg.prettyName, listAgg.child, listAgg.orderExpressions)
+ }
Review Comment:
nit: now there is more logic, so it's worth abstracting into `ListAgg`'s
method
##########
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala:
##########
@@ -687,6 +687,110 @@ class DataFrameAggregateSuite extends QueryTest
)
}
+ test("SPARK-55501: listagg with distinct and within group order by") {
Review Comment:
Tests that are not testing something dataframe-specific should be moved into
the golden files. Then we validate not only the result but also the logical
plan. See `SQLQueryTestSuite` for documentation on how to run and `listagg.sql`
and `listagg-collations.sql` for existing tests
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]