(datafusion) branch main updated: Fix/support duplicate column names #6543 (#21126)

github-bot Sun, 29 Mar 2026 12:52:08 -0700

This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion.git



The following commit(s) were added to refs/heads/main by this push:
     new aadae6bda3 Fix/support duplicate column names #6543 (#21126)
aadae6bda3 is described below

commit aadae6bda34f8b5bb33ef7d70b44404e5730e2ea
Author: Rafael Herrero <[email protected]>
AuthorDate: Sun Mar 29 21:51:27 2026 +0200

    Fix/support duplicate column names #6543 (#21126)
    
    ## Which issue does this PR close?
    
    - Closes #6543.
    
    ## Rationale for this change
    
    We're building a SQL engine on top of DataFusion and hit this while
    running TPC-DS benchmarks — Q39 fails during planning with:
    
    ```
    Projections require unique expression names but the expression
    "CAST(inv1.cov AS Decimal128(30, 10))" at position 4 and "inv1.cov"
    at position 10 have the same name.
    ```
    
    The underlying issue is that `CAST` is transparent to `schema_name()`,
    so both expressions resolve to `inv1.cov`. But this also affects simpler
    cases like `SELECT 1, 1` or `SELECT x, x FROM t` — all of which
    PostgreSQL, Trino, and SQLite handle without errors.
    
    Looking at the issue discussion, @alamb suggested adding auto-aliases in
    the SQL planner:
    
    > "I think that is actually a pretty neat idea -- specifically add the
    aliases in the SQL planner. I would be happy to review such a PR"
    
    That's what this PR does.
    
    ### TPC-DS Q39 reproduction
    
    The query joins two CTEs that both produce columns named `cov`, `mean`,
    etc. When the planner applies implicit casts during type coercion, the
    cast-wrapped and original expressions end up with the same schema name:
    
    ```sql
    WITH inv1 AS (
        SELECT w_warehouse_name, w_warehouse_sk, i_item_sk, d_moy,
               stdev, mean, (CASE mean WHEN 0 THEN NULL ELSE stdev/mean END) AS 
cov
        FROM (SELECT w_warehouse_name, w_warehouse_sk, i_item_sk, d_moy,
                     stddev_samp(inv_quantity_on_hand) AS stdev,
                     avg(inv_quantity_on_hand) AS mean
              FROM inventory, item, warehouse, date_dim
              WHERE inv_item_sk = i_item_sk AND inv_warehouse_sk = 
w_warehouse_sk
                AND inv_date_sk = d_date_sk AND d_year = 2001
              GROUP BY w_warehouse_name, w_warehouse_sk, i_item_sk, d_moy) foo
        WHERE CASE mean WHEN 0 THEN 0 ELSE stdev/mean END > 1
    ),
    inv2 AS (
        SELECT w_warehouse_name, w_warehouse_sk, i_item_sk, d_moy,
               stdev, mean, (CASE mean WHEN 0 THEN NULL ELSE stdev/mean END) AS 
cov
        FROM (SELECT w_warehouse_name, w_warehouse_sk, i_item_sk, d_moy,
                     stddev_samp(inv_quantity_on_hand) AS stdev,
                     avg(inv_quantity_on_hand) AS mean
              FROM inventory, item, warehouse, date_dim
              WHERE inv_item_sk = i_item_sk AND inv_warehouse_sk = 
w_warehouse_sk
                AND inv_date_sk = d_date_sk AND d_year = 2001 AND d_moy = 2
              GROUP BY w_warehouse_name, w_warehouse_sk, i_item_sk, d_moy) foo
        WHERE CASE mean WHEN 0 THEN 0 ELSE stdev/mean END > 1
    )
    SELECT inv1.w_warehouse_sk, inv1.i_item_sk, inv1.d_moy, inv1.mean, inv1.cov,
           inv2.w_warehouse_sk, inv2.i_item_sk, inv2.d_moy, inv2.mean, inv2.cov
    FROM inv1 JOIN inv2
      ON inv1.i_item_sk = inv2.i_item_sk AND inv1.w_warehouse_sk = 
inv2.w_warehouse_sk
    ORDER BY inv1.w_warehouse_sk, inv1.i_item_sk, inv1.d_moy, inv1.mean, 
inv1.cov,
             inv2.d_moy, inv2.mean, inv2.cov;
    ```
    
    ## What changes are included in this PR?
    
    A dedup pass in `SqlToRel` that runs right after
    `prepare_select_exprs()` and before `self.project()`. It detects
    duplicate `schema_name()` values and wraps the second (and subsequent)
    occurrences in an `Alias` with a `:{N}` suffix:
    
    ```sql
    SELECT x AS c1, y AS c1 FROM t;
    -- produces columns: c1, c1:1
    ```
    
    The actual code is small (~45 lines of logic across 2 files):
    
    - `datafusion/sql/src/utils.rs` — new `deduplicate_select_expr_names()`
    function
    - `datafusion/sql/src/select.rs` — one call site between
    `prepare_select_exprs()` and `self.project()`
    
    I intentionally kept this scoped to the SQL planner only:
    
    - `validate_unique_names("Projections")` in `builder.rs` is untouched,
    so the Rust API (`LogicalPlanBuilder::project`) still rejects duplicates
    - No changes to the optimizer, physical planner, or DFSchema
    - `validate_unique_names("Windows")` is unchanged
    
    **Known limitation:** `SELECT *, x FROM t` still errors when `x`
    overlaps with `*`, because wildcard expansion happens after this dedup
    pass (inside `project_with_validation`). Happy to address that in a
    follow-up if desired.
    
    ## Are these changes tested?
    
    New sqllogictest file (`duplicate_column_alias.slt`) with 13 test cases
    covering:
    
    - Basic duplicate aliases, literals, and same-column-twice
    - Subquery with duplicate names
    - ORDER BY resolving to first occurrence
    - CTE join (TPC-DS Q39 pattern)
    - Three-way duplicates
    - CAST producing same schema_name as original column
    - GROUP BY and aggregates with duplicates
    - ORDER BY positional reference to the renamed column
    - `iszero(0.0), iszero(-0.0)` (reported in the issue by @jatin510)
    - UNION with duplicate column names
    - Wildcard limitation documented as explicit `query error` test
    
    Updated existing tests in `sql_integration.rs` (5 tests),
    `aggregate.slt`, and `unnest.slt` that previously asserted the
    "Projections require unique" error.
    
    ## Are there any user-facing changes?
    
    Yes, this is a behavior change:
    
    - SQL queries with duplicate expression names now succeed instead of
    erroring
    - Duplicate columns get a `:{N}` suffix in the output (e.g., `cov`,
    `cov:1`)
    - First occurrence keeps its original name, so ORDER BY / HAVING
    references still work
    - The programmatic Rust API is unchanged
---
 datafusion/sql/src/select.rs                       |   9 +-
 datafusion/sql/src/utils.rs                        |  33 ++++
 datafusion/sql/tests/sql_integration.rs            |  54 ++++---
 datafusion/sqllogictest/test_files/aggregate.slt   |  19 +--
 .../test_files/duplicate_column_alias.slt          | 175 +++++++++++++++++++++
 datafusion/sqllogictest/test_files/unnest.slt      |  10 +-
 6 files changed, 265 insertions(+), 35 deletions(-)

diff --git a/datafusion/sql/src/select.rs b/datafusion/sql/src/select.rs
index 6aa7ff599d..52aabff925 100644
--- a/datafusion/sql/src/select.rs
+++ b/datafusion/sql/src/select.rs
@@ -23,8 +23,9 @@ use crate::planner::{ContextProvider, PlannerContext, 
SqlToRel};
 use crate::query::to_order_by_exprs_with_select;
 use crate::utils::{
     CheckColumnsMustReferenceAggregatePurpose, CheckColumnsSatisfyExprsPurpose,
-    check_columns_satisfy_exprs, extract_aliases, rebase_expr, 
resolve_aliases_to_exprs,
-    resolve_columns, resolve_positions_to_exprs, 
rewrite_recursive_unnests_bottom_up,
+    check_columns_satisfy_exprs, deduplicate_select_expr_names, 
extract_aliases,
+    rebase_expr, resolve_aliases_to_exprs, resolve_columns, 
resolve_positions_to_exprs,
+    rewrite_recursive_unnests_bottom_up,
 };
 
 use datafusion_common::error::DataFusionErrorBuilder;
@@ -109,6 +110,10 @@ impl<S: ContextProvider> SqlToRel<'_, S> {
             planner_context,
         )?;
 
+        // Auto-suffix duplicate expression names (e.g. cov, cov → cov, cov:1)
+        // before projection so that the unique-name constraint is satisfied.
+        let select_exprs = deduplicate_select_expr_names(select_exprs);
+
         // Having and group by clause may reference aliases defined in select 
projection
         let projected_plan = self.project(base_plan.clone(), select_exprs)?;
         let select_exprs = projected_plan.expressions();
diff --git a/datafusion/sql/src/utils.rs b/datafusion/sql/src/utils.rs
index 1a76dd69f4..d31bfc65ae 100644
--- a/datafusion/sql/src/utils.rs
+++ b/datafusion/sql/src/utils.rs
@@ -33,6 +33,7 @@ use datafusion_expr::builder::get_struct_unnested_columns;
 use datafusion_expr::expr::{
     Alias, GroupingSet, Unnest, WindowFunction, WindowFunctionParams,
 };
+use datafusion_expr::select_expr::SelectExpr;
 use datafusion_expr::utils::{expr_as_column_expr, find_column_exprs};
 use datafusion_expr::{
     ColumnUnnestList, Expr, ExprSchemable, LogicalPlan, col, expr_vec_fmt,
@@ -633,6 +634,38 @@ fn push_projection_dedupl(projection: &mut Vec<Expr>, 
expr: Expr) {
         projection.push(expr);
     }
 }
+
+/// Auto-suffix duplicate SELECT expression names with `:{count}`.
+///
+/// The first occurrence keeps its original name so that ORDER BY / HAVING
+/// references resolve correctly. Wildcards are left untouched because they
+/// are expanded later in `project_with_validation`.
+///
+/// Duplicates are detected by the schema name of each expression, which
+/// identifies logically identical expressions before column normalization.
+pub(crate) fn deduplicate_select_expr_names(exprs: Vec<SelectExpr>) -> 
Vec<SelectExpr> {
+    let mut seen: HashMap<String, usize> = HashMap::new();
+    exprs
+        .into_iter()
+        .map(|select_expr| match select_expr {
+            SelectExpr::Expression(expr) => {
+                let name = expr.schema_name().to_string();
+                let count = seen.entry(name.clone()).or_insert(0);
+                let result = if *count > 0 {
+                    let (_qualifier, field_name) = expr.qualified_name();
+                    
SelectExpr::Expression(expr.alias(format!("{field_name}:{count}")))
+                } else {
+                    SelectExpr::Expression(expr)
+                };
+                *count += 1;
+                result
+            }
+            // Leave wildcards alone — they are expanded later
+            other => other,
+        })
+        .collect()
+}
+
 /// The context is we want to rewrite unnest() into 
InnerProjection->Unnest->OuterProjection
 /// Given an expression which contains unnest expr as one of its children,
 /// Try transform depends on unnest type
diff --git a/datafusion/sql/tests/sql_integration.rs 
b/datafusion/sql/tests/sql_integration.rs
index 950a79ddb0..2581a0028a 100644
--- a/datafusion/sql/tests/sql_integration.rs
+++ b/datafusion/sql/tests/sql_integration.rs
@@ -789,11 +789,13 @@ fn select_column_does_not_exist() {
 #[test]
 fn select_repeated_column() {
     let sql = "SELECT age, age FROM person";
-    let err = logical_plan(sql).expect_err("query should have failed");
-
+    let plan = logical_plan(sql).unwrap();
     assert_snapshot!(
-        err.strip_backtrace(),
-        @r#"Error during planning: Projections require unique expression names 
but the expression "person.age" at position 0 and "person.age" at position 1 
have the same name. Consider aliasing ("AS") one of them."#
+        plan,
+        @r"
+    Projection: person.age, person.age AS age:1
+      TableScan: person
+    "
     );
 }
 
@@ -1531,11 +1533,14 @@ fn select_simple_aggregate_column_does_not_exist() {
 #[test]
 fn select_simple_aggregate_repeated_aggregate() {
     let sql = "SELECT MIN(age), MIN(age) FROM person";
-    let err = logical_plan(sql).expect_err("query should have failed");
-
+    let plan = logical_plan(sql).unwrap();
     assert_snapshot!(
-        err.strip_backtrace(),
-        @r#"Error during planning: Projections require unique expression names 
but the expression "min(person.age)" at position 0 and "min(person.age)" at 
position 1 have the same name. Consider aliasing ("AS") one of them."#
+        plan,
+        @r"
+    Projection: min(person.age), min(person.age) AS min(person.age):1
+      Aggregate: groupBy=[[]], aggr=[[min(person.age)]]
+        TableScan: person
+    "
     );
 }
 
@@ -1584,11 +1589,14 @@ fn select_from_typed_string_values() {
 #[test]
 fn select_simple_aggregate_repeated_aggregate_with_repeated_aliases() {
     let sql = "SELECT MIN(age) AS a, MIN(age) AS a FROM person";
-    let err = logical_plan(sql).expect_err("query should have failed");
-
+    let plan = logical_plan(sql).unwrap();
     assert_snapshot!(
-        err.strip_backtrace(),
-        @r#"Error during planning: Projections require unique expression names 
but the expression "min(person.age) AS a" at position 0 and "min(person.age) AS 
a" at position 1 have the same name. Consider aliasing ("AS") one of them."#
+        plan,
+        @r"
+    Projection: min(person.age) AS a, min(person.age) AS a AS a:1
+      Aggregate: groupBy=[[]], aggr=[[min(person.age)]]
+        TableScan: person
+    "
     );
 }
 
@@ -1625,11 +1633,14 @@ fn select_simple_aggregate_with_groupby_with_aliases() {
 #[test]
 fn select_simple_aggregate_with_groupby_with_aliases_repeated() {
     let sql = "SELECT state AS a, MIN(age) AS a FROM person GROUP BY state";
-    let err = logical_plan(sql).expect_err("query should have failed");
-
+    let plan = logical_plan(sql).unwrap();
     assert_snapshot!(
-        err.strip_backtrace(),
-        @r#"Error during planning: Projections require unique expression names 
but the expression "person.state AS a" at position 0 and "min(person.age) AS a" 
at position 1 have the same name. Consider aliasing ("AS") one of them."#
+        plan,
+        @r"
+    Projection: person.state AS a, min(person.age) AS a AS a:1
+      Aggregate: groupBy=[[person.state]], aggr=[[min(person.age)]]
+        TableScan: person
+    "
     );
 }
 
@@ -1750,11 +1761,14 @@ fn select_simple_aggregate_with_groupby_can_use_alias() 
{
 #[test]
 fn select_simple_aggregate_with_groupby_aggregate_repeated() {
     let sql = "SELECT state, MIN(age), MIN(age) FROM person GROUP BY state";
-    let err = logical_plan(sql).expect_err("query should have failed");
-
+    let plan = logical_plan(sql).unwrap();
     assert_snapshot!(
-        err.strip_backtrace(),
-        @r#"Error during planning: Projections require unique expression names 
but the expression "min(person.age)" at position 1 and "min(person.age)" at 
position 2 have the same name. Consider aliasing ("AS") one of them."#
+        plan,
+        @r"
+    Projection: person.state, min(person.age), min(person.age) AS 
min(person.age):1
+      Aggregate: groupBy=[[person.state]], aggr=[[min(person.age)]]
+        TableScan: person
+    "
     );
 }
 
diff --git a/datafusion/sqllogictest/test_files/aggregate.slt 
b/datafusion/sqllogictest/test_files/aggregate.slt
index e42ebd4ce7..cfca0de74b 100644
--- a/datafusion/sqllogictest/test_files/aggregate.slt
+++ b/datafusion/sqllogictest/test_files/aggregate.slt
@@ -7941,21 +7941,16 @@ select count(), count() * count() from t;
 ----
 2 4
 
-# DataFusion error: Error during planning: Projections require unique 
expression names but the expression "count\(Int64\(1\)\)" at position 0 and 
"count\(Int64\(1\)\)" at position 1 have the same name\. Consider aliasing 
\("AS"\) one of them\.
-query error
+# Duplicate aggregate expressions are now auto-suffixed
+query II
 select count(1), count(1) from t;
+----
+2 2
 
-# DataFusion error: Error during planning: Projections require unique 
expression names but the expression "count\(Int64\(1\)\)" at position 0 and 
"count\(Int64\(1\)\)" at position 1 have the same name\. Consider aliasing 
\("AS"\) one of them\.
-query error
-explain select count(1), count(1) from t;
-
-# DataFusion error: Error during planning: Projections require unique 
expression names but the expression "count\(Int64\(1\) AS \)" at position 0 and 
"count\(Int64\(1\) AS \)" at position 1 have the same name\. Consider aliasing 
\("AS"\) one of them\.
-query error
+query II
 select count(), count() from t;
-
-# DataFusion error: Error during planning: Projections require unique 
expression names but the expression "count\(Int64\(1\) AS \)" at position 0 and 
"count\(Int64\(1\) AS \)" at position 1 have the same name\. Consider aliasing 
\("AS"\) one of them\.
-query error
-explain select count(), count() from t;
+----
+2 2
 
 query II
 select count(1), count(2) from t;
diff --git a/datafusion/sqllogictest/test_files/duplicate_column_alias.slt 
b/datafusion/sqllogictest/test_files/duplicate_column_alias.slt
new file mode 100644
index 0000000000..e7e2d14976
--- /dev/null
+++ b/datafusion/sqllogictest/test_files/duplicate_column_alias.slt
@@ -0,0 +1,175 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# Tests for duplicate column names/aliases in projections.
+# DataFusion auto-suffixes duplicates with :{count} (e.g. cov, cov:1).
+
+# Setup
+statement ok
+CREATE TABLE t(x INT, y INT) AS VALUES (1, 2), (3, 4);
+
+#
+# Basic duplicate alias
+#
+query II
+SELECT x AS c1, y AS c1 FROM t;
+----
+1 2
+3 4
+
+#
+# Duplicate literal expressions
+#
+query II
+SELECT 1, 1;
+----
+1 1
+
+#
+# Same column selected twice
+#
+query II
+SELECT x, x FROM t;
+----
+1 1
+3 3
+
+#
+# Subquery with duplicate column names
+#
+query II
+SELECT * FROM (SELECT x AS c1, y AS c1 FROM t);
+----
+1 2
+3 4
+
+#
+# ORDER BY referencing a duplicated alias resolves to first occurrence
+#
+query II
+SELECT x AS c1, y AS c1 FROM t ORDER BY c1;
+----
+1 2
+3 4
+
+#
+# CTE join producing duplicate column names (TPC-DS Q39 pattern)
+#
+statement ok
+CREATE TABLE inv(warehouse_sk INT, item_sk INT, moy INT, cov DOUBLE) AS VALUES
+  (1, 10, 1, 1.5),
+  (1, 10, 2, 2.0),
+  (2, 20, 1, 0.8),
+  (2, 20, 2, 1.2);
+
+query IIIRIIIR
+WITH inv1 AS (
+    SELECT warehouse_sk, item_sk, moy, cov FROM inv WHERE moy = 1
+),
+inv2 AS (
+    SELECT warehouse_sk, item_sk, moy, cov FROM inv WHERE moy = 2
+)
+SELECT inv1.warehouse_sk, inv1.item_sk, inv1.moy, inv1.cov,
+       inv2.warehouse_sk, inv2.item_sk, inv2.moy, inv2.cov
+FROM inv1 JOIN inv2
+  ON inv1.item_sk = inv2.item_sk AND inv1.warehouse_sk = inv2.warehouse_sk
+ORDER BY inv1.warehouse_sk, inv1.item_sk;
+----
+1 10 1 1.5 1 10 2 2
+2 20 1 0.8 2 20 2 1.2
+
+#
+# Three-way duplicate
+#
+query III
+SELECT x AS a, y AS a, x + y AS a FROM t;
+----
+1 2 3
+3 4 7
+
+#
+# CAST produces same schema_name as original column (TPC-DS Q39 pattern).
+# CAST is transparent to schema_name, so CAST(x AS DOUBLE) and x
+# both have schema_name "x" — this must be deduped.
+#
+query RI
+SELECT CAST(x AS DOUBLE), x FROM t;
+----
+1 1
+3 3
+
+#
+# GROUP BY with duplicate expressions in SELECT
+#
+query II
+SELECT x, x FROM t GROUP BY x;
+----
+1 1
+3 3
+
+#
+# Aggregate with GROUP BY producing duplicate column names
+#
+query III
+SELECT x, SUM(y) AS total, SUM(y) AS total FROM t GROUP BY x ORDER BY x;
+----
+1 2 2
+3 4 4
+
+#
+# ORDER BY referencing the second (renamed) column by position
+#
+query II
+SELECT y AS c1, x AS c1 FROM t ORDER BY 2;
+----
+2 1
+4 3
+
+#
+# Function calls that produce the same schema_name after argument
+# normalization (reported in issue #6543 for iszero).
+#
+query BB
+SELECT iszero(0.0), iszero(-0.0);
+----
+true true
+
+#
+# Duplicate expressions inside a UNION subquery
+#
+query II
+SELECT * FROM (SELECT x AS a, y AS a FROM t UNION ALL SELECT y AS a, x AS a 
FROM t) ORDER BY 1, 2;
+----
+1 2
+2 1
+3 4
+4 3
+
+#
+# Known limitation: wildcard expansion happens after dedup, so
+# SELECT *, col FROM t still errors when col overlaps with *.
+# This will be addressed in a follow-up PR.
+#
+query error DataFusion error: Error during planning: Projections require 
unique expression names but the expression "t\.x" at position 0 and "t\.x" at 
position 2 have the same name\. Consider aliasing \("AS"\) one of them\.
+SELECT *, x FROM t;
+
+# Cleanup
+statement ok
+DROP TABLE t;
+
+statement ok
+DROP TABLE inv;
diff --git a/datafusion/sqllogictest/test_files/unnest.slt 
b/datafusion/sqllogictest/test_files/unnest.slt
index 8cfc01380d..4b3f3e4f38 100644
--- a/datafusion/sqllogictest/test_files/unnest.slt
+++ b/datafusion/sqllogictest/test_files/unnest.slt
@@ -547,8 +547,16 @@ select unnest(column1) from (select * from 
(values([1,2,3]), ([4,5,6])) limit 1
 5
 6
 
-query error DataFusion error: Error during planning: Projections require 
unique expression names but the expression "UNNEST\(unnest_table.column1\)" at 
position 0 and "UNNEST\(unnest_table.column1\)" at position 1 have the same 
name. Consider aliasing \("AS"\) one of them.
+query II
 select unnest(column1), unnest(column1) from unnest_table;
+----
+1 1
+2 2
+3 3
+4 4
+5 5
+6 6
+12 12
 
 query II
 select unnest(column1), unnest(column1) u1 from unnest_table;


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(datafusion) branch main updated: Fix/support duplicate column names #6543 (#21126)

Reply via email to