kosiew opened a new issue, #21616:
URL: https://github.com/apache/datafusion/issues/21616
## Summary
DataFusion's SQL planner currently validates duplicate relation
names/aliases in three separate, inconsistent places:
| Location | Path covered | Diagnostic quality |
|---|---|---|
| `datafusion/sql/src/cte.rs` – `plan_with_clause` | Duplicate CTE names in
a `WITH` block | Good – span-aware `Diagnostic` with "previously defined here"
note |
| `datafusion/sql/src/select.rs` – `plan_from_tables` (multi-entry branch
only) | Duplicate aliases in comma-separated `FROM` lists (`FROM a, b, a`) |
Good – span-aware `Diagnostic` with "first defined here" note |
| `datafusion/common/src/dfschema.rs` – `DFSchema::check_names` /
`DFSchema::join` | All other duplicates (explicit `JOIN`, sub-selects, etc.) |
Poor – generic `SchemaError::DuplicateQualifiedField` /
`DuplicateUnqualifiedField` with no source span |
There is no single, authoritative place where the planner records "which
relation identity has been bound, at which source span". This causes three
concrete problems described below.
## Problem 1 – Explicit `JOIN` duplicate aliases are not caught by the new
diagnostic
`plan_from_tables` only applies the new check when `from.len() > 1` (the
comma-join path). A single `TableWithJoins` goes straight to
`plan_table_with_joins` without any alias-deduplication step:
```rust
// datafusion/sql/src/select.rs lines 691-694
match from.len() {
0 => Ok(LogicalPlanBuilder::empty(true).build()?),
1 => {
let input = from.remove(0);
self.plan_table_with_joins(input, planner_context) // ← no
duplicate check
}
_ => { /* new duplicate check lives here only */ }
}
```
`plan_table_with_joins` in `datafusion/sql/src/relation/join.rs` calls
`create_relation` for each join operand and then `parse_relation_join`, neither
of which validates alias uniqueness. This means:
```sql
-- Explicit JOIN – skips the new check entirely
SELECT * FROM (SELECT 1 AS a) AS t JOIN (SELECT 2 AS b) AS t ON true
-- When column names overlap, falls through to the generic DFSchema error
-- When column names do not overlap, may plan successfully with two `t`
aliases
```
Both outcomes differ from the comma-join behaviour, which now returns a
proper positioned diagnostic.
## Problem 2 – Unaliased fully-qualified table names collapse to the wrong
key
The `extract_table_name` closure uses only the **last** identifier of a
multi-part table name:
```rust
// datafusion/sql/src/select.rs lines 714-722
TableFactor::Table { name, alias: None, .. } => {
let span = Span::try_from_sqlparser_span(t.relation.span());
let table_name = name
.0
.iter()
.filter_map(|p| p.as_ident())
.map(|id| self.ident_normalizer.normalize(id.clone()))
.next_back()?; // ← only the final component
Some((table_name, span))
}
```
So `catalog1.public.orders` and `catalog2.public.orders` both produce the
key `"orders"` and are rejected as duplicates, even though DataFusion's scan
qualifier tracks the full `TableReference` (`catalog1.public.orders` vs
`catalog2.public.orders`). Conversely, the same physical table accessed via two
different schema paths could be double-scanned without triggering the guard.
The correct key for an unaliased table is
`object_name_to_table_reference(name.clone())?.to_string()`, which produces
`"catalog1.public.orders"` vs `"catalog2.public.orders"`.
## Problem 3 – DFSchema duplicate errors carry no source location
When a duplicate alias escapes planner-level checks (e.g. via an explicit
JOIN), the error propagates from `DFSchema::check_names` / `DFSchema::join` in
`datafusion/common/src/dfschema.rs` (lines 240–246). These paths emit
`SchemaError::DuplicateQualifiedField { qualifier, name }` and
`SchemaError::DuplicateUnqualifiedField { name }` with no attached `Diagnostic`
and no source span. The user sees a schema-layer error message with no
indication of which SQL token caused the conflict.
## Proposed solution
Introduce a **planner-level relation binding table** (`RelationScope` or
similar) that records normalized relation identity, source alias/name, and the
associated `Span` for every relation introduced into scope. This table would be
populated and consulted during relation and join planning (not only in
`plan_from_tables`), making duplicate detection consistent across all paths.
High-level sketch:
```rust
/// Maintained inside PlannerContext during planning of a single FROM clause.
struct RelationScope {
/// Maps normalized relation key -> (display name, source span)
bindings: HashMap<String, (String, Option<Span>)>,
}
impl RelationScope {
/// Register a relation; returns an error with Diagnostic if the key is
already bound.
fn bind(
&mut self,
key: String,
display: String,
span: Option<Span>,
) -> Result<()> { ... }
}
```
`RelationScope::bind` would be called from:
- `plan_from_tables` (current comma-join path), replacing the local
`alias_spans` map
- `plan_table_with_joins`, after resolving each join operand's alias
- Potentially `plan_with_clause` (CTE names), to have one unified binding
registry
Files that would need to change:
| File | Change |
|---|---|
| `datafusion/sql/src/select.rs` | Replace local `alias_spans` HashMap with
`RelationScope`; add binding call for the single-item FROM case |
| `datafusion/sql/src/relation/join.rs` | Call `RelationScope::bind` for
each join operand alias in `plan_table_with_joins` |
| `datafusion/sql/src/planner.rs` or a new `relation/scope.rs` | Define
`RelationScope` / `RelationBinding` |
| `datafusion/sql/src/cte.rs` | Optionally migrate CTE duplicate check to
use the same registry |
## Acceptance criteria
- [ ] `SELECT * FROM t JOIN t ON true` returns a span-aware "duplicate table
alias" diagnostic (blocked by the missing JOIN path)
- [ ] `SELECT * FROM t JOIN t ON t.x = t.x` (non-overlapping-column case) is
also rejected with a clear diagnostic
- [ ] `SELECT * FROM catalog1.public.orders, catalog2.public.orders`
succeeds (different full identities) instead of producing a false-positive
duplicate error
- [ ] `SELECT * FROM public.orders, public.orders` is still rejected (same
full identity)
- [ ] Regression tests added under
`datafusion/sql/tests/cases/diagnostic.rs` and/or
`datafusion/sqllogictest/test_files/` covering explicit JOIN duplicates,
cross-catalog same-table-name, and unaliased same-table-name
## Affected files (for reference)
- [datafusion/sql/src/select.rs](datafusion/sql/src/select.rs) –
`plan_from_tables`, `extract_table_name` closure
- [datafusion/sql/src/relation/join.rs](datafusion/sql/src/relation/join.rs)
– `plan_table_with_joins`, `parse_relation_join`
- [datafusion/sql/src/cte.rs](datafusion/sql/src/cte.rs) –
`plan_with_clause` (reference implementation of span-aware duplicate check)
- [datafusion/common/src/dfschema.rs](datafusion/common/src/dfschema.rs) –
`DFSchema::check_names`, `DFSchema::join` (downstream fallback path)
- [datafusion/common/src/error.rs](datafusion/common/src/error.rs) –
`SchemaError::DuplicateQualifiedField`, `SchemaError::DuplicateUnqualifiedField`
- [datafusion/sql/src/planner.rs](datafusion/sql/src/planner.rs) –
`object_name_to_table_reference` (canonical normalization helper)
This is a follow-up refactor surfaced during review of PR #20720
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]