helioshe4 commented on code in PR #54297:
URL: https://github.com/apache/spark/pull/54297#discussion_r2810253006
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/AggregateExpressionResolver.scala:
##########
@@ -117,7 +118,16 @@ class AggregateExpressionResolver(
aggregateExpression match {
case agg @ AggregateExpression(listAgg: ListAgg, _, _, _, _)
if agg.isDistinct && listAgg.needSaveOrderValue =>
- throwFunctionAndOrderExpressionMismatchError(listAgg)
+ // Allow when the mismatch is only because child was cast
+ val mismatchDueToCast = listAgg.orderExpressions.size == 1 &&
+ (listAgg.child match {
+ case Cast(castChild, _, _, _) =>
+ listAgg.orderExpressions.head.child.semanticEquals(castChild)
+ case _ => false
+ })
+ if (!mismatchDueToCast) {
+ throwFunctionAndOrderExpressionMismatchError(listAgg)
+ }
Review Comment:
@mikhailnik-db thanks for the detailed explanation!
To address your concerns
1. you're right about the normalization of fp numbers.
The implicit cast to STRING doesn't preserve GROUP BY equality for
float/double types because ListAgg.child is **not** normalized before casting
to string (leading to "0.0" and "-0.0"), while GROUP BY keys are normalized.
The root cause is that the implicit cast is applied **before** DISTINCT
deduplication rather than after, so DISTINCT operates on string values (where
"-0.0" != "0.0") instead of double values (where -0.0 = 0.0). I feel like a
cleaner solution would be to normalize `col` before casting to string (to keep
operations between the order expression col and child col consistent), but this
may cause some other side effects or go against user expectations.
For now, I've added a whitelist of types where the cast preserves equality
semantics, and a specific error message for unsafe types (float, double)
explaining why they're rejected.
2. no loss in precision
For DecimalType, `toPlainString` is used (preserves scale/precision), and
`toString()` is used for other numeric types. The Date types and Interval types
all used precise conversion with no loss.
3. yes, explicit casting with collation could be an issue if child col's
collation isn't the same as order by col's collation
implicit casting doesn't change collation, but we block explicit casting
with collation. I'm taking a conservative approach where we can only
explicitly cast FROM StringType with UTF8_binary and cast TO StringType with
UTF8_binary.
I've updated my PR comment to explain the logic/safety of casting.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]