This is an automated email from the ASF dual-hosted git repository.
github-bot pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion.git
The following commit(s) were added to refs/heads/main by this push:
new 9626776b4e fix: null in array_agg with DISTINCT and IGNORE (#19736)
9626776b4e is described below
commit 9626776b4e93619fe3d64d9d63ce1c17962e7b67
Author: David López <[email protected]>
AuthorDate: Wed Jan 21 04:40:18 2026 +0100
fix: null in array_agg with DISTINCT and IGNORE (#19736)
## Which issue does this PR close?
- Closes https://github.com/apache/datafusion/issues/19735.
## Rationale for this change
The `SingleDistinctToGroupBy` optimizer rewrites aggregate functions
with `DISTINCT` into a `GROUP BY` operation for better performance.
However, during this rewrite, it was discarding important aggregate
function parameters: `null_treatment`, `filter`, and `order_by`.
This caused queries like `ARRAY_AGG(DISTINCT x IGNORE NULLS)` to include
NULL values in the result because the `IGNORE NULLS` clause (stored as
null_treatment) was being lost during optimization.
## What changes are included in this PR?
Preserve aggregate parameters in optimizer: Modified
`SingleDistinctToGroupBy` to extract and preserve `null_treatment`,
`filter`, and `order_by` from the original aggregate function when
creating the rewritten version.
Add regression test: Added SQL logic test to verify that
`ARRAY_AGG(DISTINCT x IGNORE NULLS)` correctly filters out NULL values.
Files changed:
**datafusion/optimizer/src/single_distinct_to_groupby.rs**: Extract and
pass through filter, order_by, and null_treatment parameters
**datafusion/sqllogictest/test_files/aggregate.slt**: Add test case for
ARRAY_AGG(DISTINCT ... IGNORE NULLS)
## Are these changes tested?
Yes:
New SQL logic test in aggregate.slt verifies the fix works end-to-end
Existing optimizer tests continue to pass (19 tests in
`single_distinct_to_groupby`)
Existing aggregate tests continue to pass (20 tests in `array_agg`)
## Are there any user-facing changes?
Bug fix - Users can now correctly use `IGNORE NULLS` (and `FILTER` /
`ORDER BY`) with `DISTINCT` aggregates:
Before (broken):
```sql
SELECT ARRAY_AGG(DISTINCT x IGNORE NULLS)
FROM (VALUES (1), (2), (NULL), (2), (1)) AS t(x);
-- Result: [2, NULL, 1] ❌ NULL incorrectly included
```
After (fixed):
```sql
SELECT ARRAY_AGG(DISTINCT x IGNORE NULLS)
FROM (VALUES (1), (2), (NULL), (2), (1)) AS t(x);
-- Result: [1, 2] ✅ NULLs correctly filtered
```
---
.../optimizer/src/single_distinct_to_groupby.rs | 18 +++++---
datafusion/sqllogictest/test_files/aggregate.slt | 53 ++++++++++++++++++++++
2 files changed, 64 insertions(+), 7 deletions(-)
diff --git a/datafusion/optimizer/src/single_distinct_to_groupby.rs
b/datafusion/optimizer/src/single_distinct_to_groupby.rs
index 05edd230da..00c8fab228 100644
--- a/datafusion/optimizer/src/single_distinct_to_groupby.rs
+++ b/datafusion/optimizer/src/single_distinct_to_groupby.rs
@@ -184,7 +184,11 @@ impl OptimizerRule for SingleDistinctToGroupBy {
func,
params:
AggregateFunctionParams {
- mut args, distinct, ..
+ mut args,
+ distinct,
+ filter,
+ order_by,
+ null_treatment,
},
}) => {
if distinct {
@@ -204,9 +208,9 @@ impl OptimizerRule for SingleDistinctToGroupBy {
func,
vec![col(SINGLE_DISTINCT_ALIAS)],
false, // intentional to remove distinct
here
- None,
- vec![],
- None,
+ filter,
+ order_by,
+ null_treatment,
)))
// if the aggregate function is not distinct,
we need to rewrite it like two phase aggregation
} else {
@@ -217,9 +221,9 @@ impl OptimizerRule for SingleDistinctToGroupBy {
Arc::clone(&func),
args,
false,
- None,
- vec![],
- None,
+ filter,
+ order_by,
+ null_treatment,
))
.alias(&alias_str),
);
diff --git a/datafusion/sqllogictest/test_files/aggregate.slt
b/datafusion/sqllogictest/test_files/aggregate.slt
index be24178637..e911a16be7 100644
--- a/datafusion/sqllogictest/test_files/aggregate.slt
+++ b/datafusion/sqllogictest/test_files/aggregate.slt
@@ -379,6 +379,59 @@ select array_sort(c1), array_sort(c2) from (
statement ok
drop table array_agg_distinct_list_table;
+# Test array_agg with DISTINCT and IGNORE NULLS (regression test for issue
#19735)
+query ?
+SELECT array_sort(ARRAY_AGG(DISTINCT x IGNORE NULLS)) as result
+FROM (VALUES (1), (2), (NULL), (2), (NULL), (1)) AS t(x);
+----
+[1, 2]
+
+# Test that non-DISTINCT aggregates also preserve IGNORE NULLS when mixed with
DISTINCT
+# This tests the two-phase aggregation rewrite in SingleDistinctToGroupBy
+query I?
+SELECT
+ COUNT(DISTINCT x) as distinct_count,
+ array_sort(ARRAY_AGG(y IGNORE NULLS)) as y_agg
+FROM (VALUES
+ (1, 10),
+ (1, 20),
+ (2, 30),
+ (3, NULL),
+ (3, 40),
+ (NULL, 50)
+) AS t(x, y)
+----
+3 [10, 20, 30, 40, 50]
+
+# Test that FILTER clause is preserved in two-phase aggregation rewrite
+query II
+SELECT
+ COUNT(DISTINCT x) as distinct_count,
+ SUM(y) FILTER (WHERE y > 15) as filtered_sum
+FROM (VALUES
+ (1, 10),
+ (1, 20),
+ (2, 5),
+ (2, 30),
+ (3, 25)
+) AS t(x, y)
+----
+3 75
+
+# Test that ORDER BY is preserved in two-phase aggregation rewrite
+query I?
+SELECT
+ COUNT(DISTINCT x) as distinct_count,
+ ARRAY_AGG(y ORDER BY y DESC) as ordered_agg
+FROM (VALUES
+ (1, 10),
+ (1, 30),
+ (2, 20),
+ (2, 40)
+) AS t(x, y)
+----
+2 [40, 30, 20, 10]
+
statement error This feature is not implemented: Calling array_agg: LIMIT not
supported in function arguments: 1
SELECT array_agg(c13 LIMIT 1) FROM aggregate_test_100
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]