kosiew opened a new pull request, #18814:
URL: https://github.com/apache/datafusion/pull/18814
## Which issue does this PR close?
* Closes #16295.
## Rationale for this change
Self-referential INTERSECT and EXCEPT queries (where both sides originate
from the same table) failed during Substrait round‑trip consumption with the
error:
> "Schema contains duplicate qualified field name"
This happened because the join-based implementation of set operations
attempted to merge two identical schemas without requalification, resulting in
duplicate or ambiguous field names. By ensuring both sides are requalified when
needed, DataFusion can correctly construct valid logical plans for these
operations.
### Before
```
❯ cargo test --test sqllogictests -- --substrait-round-trip
intersection.slt:33
Finished `test` profile [unoptimized + debuginfo] target(s) in 0.24s
Running bin/sqllogictests.rs
(target/debug/deps/sqllogictests-917e139464eeea33)
Completed 1 test files in 0 seconds
External error: 1 errors in file
/Users/kosiew/GitHub/datafusion/datafusion/sqllogictest/test_files/intersection.slt
1. query failed: DataFusion error: Schema error: Schema contains duplicate
qualified field name alltypes_plain.int_col
...
```
### After
```
❯ cargo test --test sqllogictests -- --substrait-round-trip
intersection.slt:33
Finished `test` profile [unoptimized + debuginfo] target(s) in 0.64s
Running bin/sqllogictests.rs
(target/debug/deps/sqllogictests-917e139464eeea33)
Completed 1 test files in 0 seconds
```
## What changes are included in this PR?
* Added a requalification step (`requalify_sides_if_needed`) inside
`intersect_or_except` to avoid duplicate or ambiguous field names.
* Improved conflict detection logic in `requalify_sides_if_needed` to handle:
1. Duplicate qualified fields
2. Duplicate unqualified fields
3. Ambiguous references (qualified vs. unqualified collisions)
* Updated optimizer tests to reflect correct aliasing (`left`, `right`).
* Added new Substrait round‑trip tests for:
* INTERSECT and EXCEPT (both DISTINCT and ALL variants)
* Self-referential queries that previously failed
* Minor formatting and consistency improvements in Substrait consumer code.
## Are these changes tested?
Yes. The PR includes comprehensive tests that:
* Reproduce the original failure modes.
* Validate that requalification produces stable and correct logical plans.
* Confirm correct behavior across INTERSECT, EXCEPT, ALL, and DISTINCT cases.
## Are there any user-facing changes?
No user-facing behavior changes.
This is a correctness improvement ensuring that valid SQL queries—previously
failing only in Substrait round‑trip mode—now work without error.
## LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content
has been manually reviewed and validated.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]