Re: [PR] chore: Add more tests with `GROUP BY` to test spark `collect_set` [datafusion]

via GitHub Thu, 16 Apr 2026 03:13:38 -0700


martin-g commented on code in PR #21659:
URL: https://github.com/apache/datafusion/pull/21659#discussion_r3092419877



##########
datafusion/sqllogictest/test_files/spark/aggregate/collect.slt:
##########
@@ -91,3 +91,34 @@ ORDER BY g;
 ----
 1 [10]
 2 [20]
+
+# collect_set with GROUP BY: group where all values are NULL returns empty list
+query I?
+SELECT g, array_sort(collect_set(a))
+FROM (VALUES (1, 10), (1, 20), (2, NULL), (2, NULL)) AS t(g, a)
+GROUP BY g
+ORDER BY g;
+----
+1 [10, 20]
+2 []
+
+# collect_set with GROUP BY: string values with duplicates
+query T?
+SELECT g, array_sort(collect_set(v))
+FROM (VALUES ('a', 'x'), ('a', 'y'), ('a', 'x'), ('b', 'z'), ('b', 'z')) AS 
t(g, v)
+GROUP BY g
+ORDER BY g;
+----
+a [x, y]
+b [z]
+
+# collect_set with GROUP BY: multiple groups with mixed NULLs and duplicates
+query I?
+SELECT g, array_sort(collect_set(a))
+FROM (VALUES (1, 5), (1, 5), (1, NULL), (2, 10), (2, 20), (2, 10), (3, NULL), 
(3, NULL)) AS t(g, a)

Review Comment:
   ```suggestion
   FROM (VALUES (1::INT, 5::INT), (1::INT, 5::INT), (1::INT, NULL::INT), 
(2::INT, 10::INT), (2::INT, 20::INT), (2::INT, 10::INT), (3::INT, NULL::INT), 
(3::INT, NULL::INT)) AS t(g, a)
   ```



##########
datafusion/sqllogictest/test_files/spark/aggregate/collect.slt:
##########
@@ -91,3 +91,34 @@ ORDER BY g;
 ----
 1 [10]
 2 [20]
+
+# collect_set with GROUP BY: group where all values are NULL returns empty list
+query I?
+SELECT g, array_sort(collect_set(a))
+FROM (VALUES (1, 10), (1, 20), (2, NULL), (2, NULL)) AS t(g, a)

Review Comment:
   
https://github.com/comphead/arrow-datafusion/blob/76e09a58bcc7a78f92ba0a881520d3f5723df94e/datafusion/sqllogictest/test_files/spark/README.md
 says:
   ```
   Add explicit casts to input values to ensure the correct data type is used 
(e.g., 0::INT)
   Explicit casting is necessary because DataFusion and Spark do not infer data 
types in the same way
   ```
   
   ```suggestion
   FROM (VALUES (1::INT, 10::INT), (1::INT, 20::INT), (2::INT, NULL::INT), 
(2::INT, NULL::INT)) AS t(g, a)
   ```



##########
datafusion/sqllogictest/test_files/spark/aggregate/collect.slt:
##########
@@ -91,3 +91,34 @@ ORDER BY g;
 ----
 1 [10]
 2 [20]
+
+# collect_set with GROUP BY: group where all values are NULL returns empty list
+query I?
+SELECT g, array_sort(collect_set(a))
+FROM (VALUES (1, 10), (1, 20), (2, NULL), (2, NULL)) AS t(g, a)
+GROUP BY g
+ORDER BY g;
+----
+1 [10, 20]
+2 []
+
+# collect_set with GROUP BY: string values with duplicates
+query T?
+SELECT g, array_sort(collect_set(v))
+FROM (VALUES ('a', 'x'), ('a', 'y'), ('a', 'x'), ('b', 'z'), ('b', 'z')) AS 
t(g, v)

Review Comment:
   ```suggestion
   FROM (VALUES ('a'::TEXT, 'x'::TEXT), ('a'::TEXT, 'y'::TEXT), ('a'::TEXT, 
'x'::TEXT), ('b'::TEXT, 'z'::TEXT), ('b'::TEXT, 'z'::TEXT)) AS t(g, v)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] chore: Add more tests with `GROUP BY` to test spark `collect_set` [datafusion]

Reply via email to