songkant-aws commented on issue #23007:
URL: https://github.com/apache/datafusion/issues/23007#issuecomment-4739160847
## Additional repro: equivalent SQL that executes fine in DataFusion but
fails after a Substrait round-trip
Reproduced on **`main` @ `0fc55d067` (2026-06-18)** as well as `54.0.0` —
identical error.
This is the SQL form of the streamstats plan (the original report used
`LogicalPlanBuilder`). It's a single hand-written query against the existing
`data` table:
```sql
SELECT a, b, avg1, avg2 FROM (
SELECT a, b, avg1,
row_number() OVER () AS seq2,
avg(b) OVER (PARTITION BY a ROWS BETWEEN UNBOUNDED PRECEDING AND
CURRENT ROW) AS avg2
FROM (
SELECT a, b, avg1 FROM (
SELECT a, b,
row_number() OVER () AS seq1,
avg(b) OVER (PARTITION BY a ROWS BETWEEN UNBOUNDED
PRECEDING AND CURRENT ROW) AS avg1
FROM data
) t1 ORDER BY seq1
) t2
) t3 ORDER BY seq2
```
### The strange part
This SQL is **valid** and **executes successfully** in DataFusion with no
Substrait involved:
```rust
// PASSES: planning + optimization + execution all succeed
let plan = ctx.sql(SQL).await?.into_optimized_plan()?;
let _ = ctx.sql(SQL).await?.collect().await?; // 1 batch, ok
// FAILS: the exact same SQL, once round-tripped through Substrait
roundtrip(SQL).await?;
// Error: SchemaError(DuplicateUnqualifiedField {
// name: "avg(data.b) PARTITION BY [data.a] ROWS BETWEEN UNBOUNDED
PRECEDING AND CURRENT ROW" }, Some(""))
```
So a legal, executable plan becomes unconsumable purely because it went
through Substrait.
### Why this is counterintuitive — and what actually collides
At first glance you'd expect the duplicate to involve the two `row_number()`
columns. It does **not**. Here is the DataFusion-optimized plan (no Substrait),
which is what gets serialized:
```
Projection: t3.a, t3.b, t3.avg1, t3.avg2
Sort: t3.seq2 ASC NULLS LAST
Projection: t3.a, t3.b, t3.avg1, t3.avg2, t3.seq2
SubqueryAlias: t3
Projection: t2.a, t2.b, t2.avg1,
row_number() ROWS BETWEEN UNBOUNDED PRECEDING AND
UNBOUNDED FOLLOWING AS seq2,
avg(t2.b) PARTITION BY [t2.a] ROWS BETWEEN UNBOUNDED
PRECEDING AND CURRENT ROW AS avg2
WindowAggr: windowExpr=[[row_number() ROWS BETWEEN UNBOUNDED
PRECEDING AND UNBOUNDED FOLLOWING]]
WindowAggr: windowExpr=[[avg(t2.b) PARTITION BY [t2.a] ROWS
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW]]
SubqueryAlias: t2
SubqueryAlias: t1
Projection: data.a, data.b,
avg(data.b) PARTITION BY [data.a] ROWS BETWEEN
UNBOUNDED PRECEDING AND CURRENT ROW AS avg1
WindowAggr: windowExpr=[[avg(data.b) PARTITION BY
[data.a] ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW]]
TableScan: data projection=[a, b]
```
Two observations from this plan:
1. **The inner `row_number()` (`seq1`) is gone.** The optimizer pruned the
entire inner `seq1` WindowAggr — it was only referenced by an inner `ORDER BY`
that doesn't affect the final result, and the outer queries never select it. So
`row_number` is not what collides. (`seq2` survives only because it's used by
the top-level `ORDER BY`.)
2. **The collision is between the two `avg(...)` columns**, which carry
*different aliases* `avg1` and `avg2`:
- inner WindowAggr → `AS avg1`, and `avg1` is then carried up through `t1
→ t2 → t3` into the final SELECT;
- middle WindowAggr → `AS avg2`.
In the DataFusion plan they are unambiguous because of those distinct
aliases — that's why it executes fine. But their **internal schema names are
byte-for-byte identical**: both are `avg(data.b) PARTITION BY [data.a] ROWS
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW`. They are distinguished *only* by
the `AS avg1` / `AS avg2` aliases.
Substrait intermediate rels are positional (`output_mapping` by index); the
`avg1`/`avg2` aliases are **not** preserved in the IR. On the consumer side,
when `from_project_rel` rebuilds the second window via `window_plan` →
`Window::try_new`, the inherited `avg1` column (already in the input schema)
and the freshly-built `avg2` both fall back to that same default schema name,
and `Window::try_new`'s `DFSchema::new_with_metadata` → `check_names()` raises
`DuplicateUnqualifiedField`.
So the precise "strange point": **the two columns are never duplicates
semantically** — they have different aliases and different origins, and the
column that the outer query selects is `avg1`, not the pruned `row_number`. The
duplicate is *manufactured* by Substrait's positional representation dropping
the aliases that disambiguated them. The existing `NameTracker` (added in
#15211) doesn't catch it because it never seeds itself with the input schema's
field names before `window_plan`, so an inherited window column and a new
identically-named one are not deduplicated against each other.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]