This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion.git


The following commit(s) were added to refs/heads/main by this push:
     new 9626776b4e fix: null in array_agg with DISTINCT and IGNORE (#19736)
9626776b4e is described below

commit 9626776b4e93619fe3d64d9d63ce1c17962e7b67
Author: David López <[email protected]>
AuthorDate: Wed Jan 21 04:40:18 2026 +0100

    fix: null in array_agg with DISTINCT and IGNORE (#19736)
    
    ## Which issue does this PR close?
    
    - Closes https://github.com/apache/datafusion/issues/19735.
    
    ## Rationale for this change
    
    The `SingleDistinctToGroupBy` optimizer rewrites aggregate functions
    with `DISTINCT` into a `GROUP BY` operation for better performance.
    However, during this rewrite, it was discarding important aggregate
    function parameters: `null_treatment`, `filter`, and `order_by`.
    
    This caused queries like `ARRAY_AGG(DISTINCT x IGNORE NULLS)` to include
    NULL values in the result because the `IGNORE NULLS` clause (stored as
    null_treatment) was being lost during optimization.
    
    ## What changes are included in this PR?
    
    Preserve aggregate parameters in optimizer: Modified
    `SingleDistinctToGroupBy` to extract and preserve `null_treatment`,
    `filter`, and `order_by` from the original aggregate function when
    creating the rewritten version.
    
    Add regression test: Added SQL logic test to verify that
    `ARRAY_AGG(DISTINCT x IGNORE NULLS)` correctly filters out NULL values.
    
    Files changed:
    
    **datafusion/optimizer/src/single_distinct_to_groupby.rs**: Extract and
    pass through filter, order_by, and null_treatment parameters
    **datafusion/sqllogictest/test_files/aggregate.slt**: Add test case for
    ARRAY_AGG(DISTINCT ... IGNORE NULLS)
    
    ## Are these changes tested?
    
    Yes:
    New SQL logic test in aggregate.slt verifies the fix works end-to-end
    Existing optimizer tests continue to pass (19 tests in
    `single_distinct_to_groupby`)
    Existing aggregate tests continue to pass (20 tests in `array_agg`)
    
    ## Are there any user-facing changes?
    
    Bug fix - Users can now correctly use `IGNORE NULLS` (and `FILTER` /
    `ORDER BY`) with `DISTINCT` aggregates:
    
    Before (broken):
    
    ```sql
    SELECT ARRAY_AGG(DISTINCT x IGNORE NULLS)
    FROM (VALUES (1), (2), (NULL), (2), (1)) AS t(x);
    -- Result: [2, NULL, 1]  ❌ NULL incorrectly included
    ```
    
    After (fixed):
    ```sql
    SELECT ARRAY_AGG(DISTINCT x IGNORE NULLS)
    FROM (VALUES (1), (2), (NULL), (2), (1)) AS t(x);
    -- Result: [1, 2]  ✅ NULLs correctly filtered
    ```
---
 .../optimizer/src/single_distinct_to_groupby.rs    | 18 +++++---
 datafusion/sqllogictest/test_files/aggregate.slt   | 53 ++++++++++++++++++++++
 2 files changed, 64 insertions(+), 7 deletions(-)

diff --git a/datafusion/optimizer/src/single_distinct_to_groupby.rs 
b/datafusion/optimizer/src/single_distinct_to_groupby.rs
index 05edd230da..00c8fab228 100644
--- a/datafusion/optimizer/src/single_distinct_to_groupby.rs
+++ b/datafusion/optimizer/src/single_distinct_to_groupby.rs
@@ -184,7 +184,11 @@ impl OptimizerRule for SingleDistinctToGroupBy {
                             func,
                             params:
                                 AggregateFunctionParams {
-                                    mut args, distinct, ..
+                                    mut args,
+                                    distinct,
+                                    filter,
+                                    order_by,
+                                    null_treatment,
                                 },
                         }) => {
                             if distinct {
@@ -204,9 +208,9 @@ impl OptimizerRule for SingleDistinctToGroupBy {
                                     func,
                                     vec![col(SINGLE_DISTINCT_ALIAS)],
                                     false, // intentional to remove distinct 
here
-                                    None,
-                                    vec![],
-                                    None,
+                                    filter,
+                                    order_by,
+                                    null_treatment,
                                 )))
                                 // if the aggregate function is not distinct, 
we need to rewrite it like two phase aggregation
                             } else {
@@ -217,9 +221,9 @@ impl OptimizerRule for SingleDistinctToGroupBy {
                                         Arc::clone(&func),
                                         args,
                                         false,
-                                        None,
-                                        vec![],
-                                        None,
+                                        filter,
+                                        order_by,
+                                        null_treatment,
                                     ))
                                     .alias(&alias_str),
                                 );
diff --git a/datafusion/sqllogictest/test_files/aggregate.slt 
b/datafusion/sqllogictest/test_files/aggregate.slt
index be24178637..e911a16be7 100644
--- a/datafusion/sqllogictest/test_files/aggregate.slt
+++ b/datafusion/sqllogictest/test_files/aggregate.slt
@@ -379,6 +379,59 @@ select array_sort(c1), array_sort(c2) from (
 statement ok
 drop table array_agg_distinct_list_table;
 
+# Test array_agg with DISTINCT and IGNORE NULLS (regression test for issue 
#19735)
+query ?
+SELECT array_sort(ARRAY_AGG(DISTINCT x IGNORE NULLS)) as result
+FROM (VALUES (1), (2), (NULL), (2), (NULL), (1)) AS t(x);
+----
+[1, 2]
+
+# Test that non-DISTINCT aggregates also preserve IGNORE NULLS when mixed with 
DISTINCT
+# This tests the two-phase aggregation rewrite in SingleDistinctToGroupBy
+query I?
+SELECT
+  COUNT(DISTINCT x) as distinct_count,
+  array_sort(ARRAY_AGG(y IGNORE NULLS)) as y_agg
+FROM (VALUES
+  (1, 10),
+  (1, 20),
+  (2, 30),
+  (3, NULL),
+  (3, 40),
+  (NULL, 50)
+) AS t(x, y)
+----
+3 [10, 20, 30, 40, 50]
+
+# Test that FILTER clause is preserved in two-phase aggregation rewrite
+query II
+SELECT
+  COUNT(DISTINCT x) as distinct_count,
+  SUM(y) FILTER (WHERE y > 15) as filtered_sum
+FROM (VALUES
+  (1, 10),
+  (1, 20),
+  (2, 5),
+  (2, 30),
+  (3, 25)
+) AS t(x, y)
+----
+3 75
+
+# Test that ORDER BY is preserved in two-phase aggregation rewrite
+query I?
+SELECT
+  COUNT(DISTINCT x) as distinct_count,
+  ARRAY_AGG(y ORDER BY y DESC) as ordered_agg
+FROM (VALUES
+  (1, 10),
+  (1, 30),
+  (2, 20),
+  (2, 40)
+) AS t(x, y)
+----
+2 [40, 30, 20, 10]
+
 statement error This feature is not implemented: Calling array_agg: LIMIT not 
supported in function arguments: 1
 SELECT array_agg(c13 LIMIT 1) FROM aggregate_test_100
 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to