Re: [PR] [SPARK-55501][SQL] Fix listagg distinct + within group order by bug [spark]

via GitHub Sun, 15 Feb 2026 16:53:20 -0800


helioshe4 commented on code in PR #54297:
URL: https://github.com/apache/spark/pull/54297#discussion_r2810253006



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/AggregateExpressionResolver.scala:
##########
@@ -117,7 +118,16 @@ class AggregateExpressionResolver(
     aggregateExpression match {
       case agg @ AggregateExpression(listAgg: ListAgg, _, _, _, _)
           if agg.isDistinct && listAgg.needSaveOrderValue =>
-        throwFunctionAndOrderExpressionMismatchError(listAgg)
+            // Allow when the mismatch is only because child was cast
+            val mismatchDueToCast = listAgg.orderExpressions.size == 1 &&
+              (listAgg.child match {
+                case Cast(castChild, _, _, _) =>
+                  listAgg.orderExpressions.head.child.semanticEquals(castChild)
+                case _ => false
+              })
+            if (!mismatchDueToCast) {
+              throwFunctionAndOrderExpressionMismatchError(listAgg)
+            }

Review Comment:
   @mikhailnik-db thanks for the detailed explanation!
   
   To address your concerns
   
   1. you're right about the normalization of fp numbers.
   
   The implicit cast to STRING doesn't preserve GROUP BY equality for 
float/double types because ListAgg.child is **not** normalized before casting 
to string (leading to "0.0" and "-0.0"), while GROUP BY keys are normalized.
   
   The root cause is that the implicit cast is applied **before** DISTINCT 
deduplication rather than after, so DISTINCT operates on string values (where 
"-0.0" != "0.0") instead of double values (where -0.0 = 0.0).  I feel like a 
cleaner solution would be to normalize `col` before casting to string (to keep 
operations between the order expression col and child col consistent), but this 
may cause some other side effects or go against user expectations.
   
   For now, I've added a whitelist of types where the cast preserves equality 
semantics, and a specific error message for unsafe types (float, double) 
explaining why they're rejected.
   
   
   2. no loss in precision
   
   For DecimalType, `toPlainString` is used (preserves scale/precision), and 
`toString()` is used for other numeric types. The Date types and Interval types 
all used precise conversion with no loss.
   
   
   3. yes, explicit casting with collation could be an issue if child col's 
collation isn't the same as order by col's collation
   
   implicit casting doesn't change collation, but we block explicit casting 
with collation.  I'm taking a conservative approach where we can only 
explicitly cast FROM StringType with UTF8_binary and cast TO StringType with 
UTF8_binary.
   
   I've updated my PR comment to explain the logic/safety of casting.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-55501][SQL] Fix listagg distinct + within group order by bug [spark]

Reply via email to