Re: [PR] [SPARK-55501][SQL] Fix listagg distinct + within group order by bug [spark]

via GitHub Fri, 13 Feb 2026 04:18:32 -0800


mikhailnik-db commented on code in PR #54297:
URL: https://github.com/apache/spark/pull/54297#discussion_r2803857272



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/resolver/AggregateExpressionResolver.scala:
##########
@@ -117,7 +118,16 @@ class AggregateExpressionResolver(
     aggregateExpression match {
       case agg @ AggregateExpression(listAgg: ListAgg, _, _, _, _)
           if agg.isDistinct && listAgg.needSaveOrderValue =>
-        throwFunctionAndOrderExpressionMismatchError(listAgg)
+            // Allow when the mismatch is only because child was cast
+            val mismatchDueToCast = listAgg.orderExpressions.size == 1 &&
+              (listAgg.child match {
+                case Cast(castChild, _, _, _) =>
+                  listAgg.orderExpressions.head.child.semanticEquals(castChild)
+                case _ => false
+              })
+            if (!mismatchDueToCast) {
+              throwFunctionAndOrderExpressionMismatchError(listAgg)
+            }

Review Comment:
   For context: the purpose of this check is to prevent problems with a 
distinct framework. Simply speaking, all aggregation functions with `distinct` 
arguments are rewritten by moving those arguments to `GROUP BY`.
   You can imagine it as 
   ```
   SELECT agg(distinct col)
   FROM table
   ~
   SELECT agg(col)
   FROM table
   GROUP BY col
   ```
   
   listagg with distinct treats the argument **and** the order expression as 
keys. 
   ```
   SELECT listagg(distinct col) WITHIN GROUP (ORDER BY col')
   FROM table
   ~
   SELECT listagg(col) WITHIN GROUP (ORDER BY col')
   FROM table
   GROUP BY col, col'
   ```
   
   Before this change, there was a simple invariant: if `col` semantically 
equals `col'` then `GROUP BY col, col'` is equivalent to `GROUP BY col`, which 
is the expected behavior for a user.
   
   Now we want to relax this check, assuming that for the column of any type 
`GROUP BY CAST(col AS STRING), col` ~ `GROUP BY CAST(col AS STRING)` ~ `GROUP 
BY col` (and the same with `CAST(col AS BINARY)`). I do not have any 
counterexamples. It seems a reasonable assumption, but it should be 
double-checked.
   
   @helioshe4, I'm afraid the only way to prove correctness is to go through 
all existing types and check the logic of the cast to string and binary. My 
three main concerns:
   1. There could be some normalisation or absence of it when needed, e.g., 
floating‑point numbers usually have 2 encodings for zero: `0` and `-0`. They 
are equal, but will we normalize them when casting to string or binary?
   2. loss of precision or some other information. e.g. when converting 
timestamps or floating‑point numbers
   3. Collations. They were created to control the equality relation of 
strings. Can we change them by casting?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##########
@@ -450,8 +450,17 @@ trait CheckAnalysis extends LookupCatalog with 
QueryErrorsBase with PlanToString
 
           case agg @ AggregateExpression(listAgg: ListAgg, _, _, _, _)
             if agg.isDistinct && listAgg.needSaveOrderValue =>
-            throw 
QueryCompilationErrors.functionAndOrderExpressionMismatchError(
-              listAgg.prettyName, listAgg.child, listAgg.orderExpressions)
+              // Allow when the mismatch is only because child was cast
+              val mismatchDueToCast = listAgg.orderExpressions.size == 1 &&
+                (listAgg.child match {
+                  case Cast(castChild, _, _, _) =>
+                    
listAgg.orderExpressions.head.child.semanticEquals(castChild)
+                  case _ => false
+                })
+              if (!mismatchDueToCast) {
+                throw 
QueryCompilationErrors.functionAndOrderExpressionMismatchError(
+                  listAgg.prettyName, listAgg.child, listAgg.orderExpressions)
+              }

Review Comment:
   nit: now there is more logic, so it's worth abstracting into `ListAgg`'s 
method



##########
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala:
##########
@@ -687,6 +687,110 @@ class DataFrameAggregateSuite extends QueryTest
     )
   }
 
+  test("SPARK-55501: listagg with distinct and within group order by") {

Review Comment:
   Tests that are not testing something dataframe-specific should be moved into 
the golden files. Then we validate not only the result but also the logical 
plan. See `SQLQueryTestSuite` for documentation on how to run and `listagg.sql` 
and `listagg-collations.sql` for existing tests



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-55501][SQL] Fix listagg distinct + within group order by bug [spark]

Reply via email to