mikhailnik-db commented on code in PR #54297:
URL: https://github.com/apache/spark/pull/54297#discussion_r2803857272
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/AggregateExpressionResolver.scala:
##########
@@ -117,7 +118,16 @@ class AggregateExpressionResolver(
aggregateExpression match {
case agg @ AggregateExpression(listAgg: ListAgg, _, _, _, _)
if agg.isDistinct && listAgg.needSaveOrderValue =>
- throwFunctionAndOrderExpressionMismatchError(listAgg)
+ // Allow when the mismatch is only because child was cast
+ val mismatchDueToCast = listAgg.orderExpressions.size == 1 &&
+ (listAgg.child match {
+ case Cast(castChild, _, _, _) =>
+ listAgg.orderExpressions.head.child.semanticEquals(castChild)
+ case _ => false
+ })
+ if (!mismatchDueToCast) {
+ throwFunctionAndOrderExpressionMismatchError(listAgg)
+ }
Review Comment:
For context: the purpose of this check is to prevent problems with a
distinct framework. Simply speaking, all aggregation functions with `distinct`
arguments are rewritten by moving those arguments to `GROUP BY`.
You can imagine it as
```
SELECT agg(distinct col)
FROM table
~
SELECT agg(col)
FROM (
SELECT col FROM table
GROUP BY col
)
```
listagg with distinct treats the argument **and** the order expression as
keys.
```
SELECT listagg(distinct col) WITHIN GROUP (ORDER BY col')
FROM table
~
SELECT listagg(col) WITHIN GROUP (ORDER BY col')
FROM (SELECT col, col'
FROM table
GROUP BY col, col'
)
```
Before this change, there was a simple invariant: if `col` semantically
equals `col'` then `GROUP BY col, col'` is equivalent to `GROUP BY col`, which
is the expected behavior for a user.
Now we want to relax this check, assuming that for the column of any type
`GROUP BY CAST(col AS STRING), col` ~ `GROUP BY CAST(col AS STRING)` ~ `GROUP
BY col` (and the same with `CAST(col AS BINARY)`). I do not have any
counterexamples. It seems a reasonable assumption, but it should be
double-checked.
@helioshe4, I'm afraid the only way to prove correctness is to go through
all existing types and check the logic of the cast to string and binary. My
three main concerns:
1. There could be some normalisation or absence of it when needed, e.g.,
floating‑point numbers usually have 2 encodings for zero: `0` and `-0`. They
are equal, but will we normalize them when casting to string or binary?
2. loss of precision or some other information. e.g. when converting
timestamps or floating‑point numbers
3. Collations. They were created to control the equality relation of
strings. Can we change them by casting?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]