kosiew opened a new pull request, #18814:
URL: https://github.com/apache/datafusion/pull/18814

   
   ## Which issue does this PR close?
   
   * Closes #16295.
   
   ## Rationale for this change
   
   Self-referential INTERSECT and EXCEPT queries (where both sides originate 
from the same table) failed during Substrait round‑trip consumption with the 
error:
   
   > "Schema contains duplicate qualified field name"
   
   This happened because the join-based implementation of set operations 
attempted to merge two identical schemas without requalification, resulting in 
duplicate or ambiguous field names. By ensuring both sides are requalified when 
needed, DataFusion can correctly construct valid logical plans for these 
operations.
   
   ### Before
   ```
   ❯ cargo test --test sqllogictests -- --substrait-round-trip 
intersection.slt:33
       Finished `test` profile [unoptimized + debuginfo] target(s) in 0.24s
        Running bin/sqllogictests.rs 
(target/debug/deps/sqllogictests-917e139464eeea33)
   Completed 1 test files in 0 seconds                                          
    External error: 1 errors in file 
/Users/kosiew/GitHub/datafusion/datafusion/sqllogictest/test_files/intersection.slt
   
   1. query failed: DataFusion error: Schema error: Schema contains duplicate 
qualified field name alltypes_plain.int_col
   ...
   ```
   
   ### After
   ```
   ❯ cargo test --test sqllogictests -- --substrait-round-trip 
intersection.slt:33
       Finished `test` profile [unoptimized + debuginfo] target(s) in 0.64s
        Running bin/sqllogictests.rs 
(target/debug/deps/sqllogictests-917e139464eeea33)
   Completed 1 test files in 0 seconds
   ```
   
   ## What changes are included in this PR?
   
   * Added a requalification step (`requalify_sides_if_needed`) inside 
`intersect_or_except` to avoid duplicate or ambiguous field names.
   * Improved conflict detection logic in `requalify_sides_if_needed` to handle:
   
     1. Duplicate qualified fields
     2. Duplicate unqualified fields
     3. Ambiguous references (qualified vs. unqualified collisions)
   * Updated optimizer tests to reflect correct aliasing (`left`, `right`).
   * Added new Substrait round‑trip tests for:
   
     * INTERSECT and EXCEPT (both DISTINCT and ALL variants)
     * Self-referential queries that previously failed
   * Minor formatting and consistency improvements in Substrait consumer code.
   
   ## Are these changes tested?
   
   Yes. The PR includes comprehensive tests that:
   
   * Reproduce the original failure modes.
   * Validate that requalification produces stable and correct logical plans.
   * Confirm correct behavior across INTERSECT, EXCEPT, ALL, and DISTINCT cases.
   
   ## Are there any user-facing changes?
   
   No user-facing behavior changes.
   This is a correctness improvement ensuring that valid SQL queries—previously 
failing only in Substrait round‑trip mode—now work without error.
   
   ## LLM-generated code disclosure
   
   This PR includes LLM-generated code and comments. All LLM-generated content 
has been manually reviewed and validated.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to