songkant-aws commented on issue #23007:
URL: https://github.com/apache/datafusion/issues/23007#issuecomment-4739160847

   ## Additional repro: equivalent SQL that executes fine in DataFusion but 
fails after a Substrait round-trip
   
   Reproduced on **`main` @ `0fc55d067` (2026-06-18)** as well as `54.0.0` — 
identical error.
   
   This is the SQL form of the streamstats plan (the original report used 
`LogicalPlanBuilder`). It's a single hand-written query against the existing 
`data` table:
   
   ```sql
   SELECT a, b, avg1, avg2 FROM (
       SELECT a, b, avg1,
              row_number() OVER () AS seq2,
              avg(b) OVER (PARTITION BY a ROWS BETWEEN UNBOUNDED PRECEDING AND 
CURRENT ROW) AS avg2
       FROM (
           SELECT a, b, avg1 FROM (
               SELECT a, b,
                      row_number() OVER () AS seq1,
                      avg(b) OVER (PARTITION BY a ROWS BETWEEN UNBOUNDED 
PRECEDING AND CURRENT ROW) AS avg1
               FROM data
           ) t1 ORDER BY seq1
       ) t2
   ) t3 ORDER BY seq2
   ```
   
   ### The strange part
   
   This SQL is **valid** and **executes successfully** in DataFusion with no 
Substrait involved:
   
   ```rust
   // PASSES: planning + optimization + execution all succeed
   let plan = ctx.sql(SQL).await?.into_optimized_plan()?;
   let _ = ctx.sql(SQL).await?.collect().await?;   // 1 batch, ok
   
   // FAILS: the exact same SQL, once round-tripped through Substrait
   roundtrip(SQL).await?;
   // Error: SchemaError(DuplicateUnqualifiedField {
   //   name: "avg(data.b) PARTITION BY [data.a] ROWS BETWEEN UNBOUNDED 
PRECEDING AND CURRENT ROW" }, Some(""))
   ```
   
   So a legal, executable plan becomes unconsumable purely because it went 
through Substrait.
   
   ### Why this is counterintuitive — and what actually collides
   
   At first glance you'd expect the duplicate to involve the two `row_number()` 
columns. It does **not**. Here is the DataFusion-optimized plan (no Substrait), 
which is what gets serialized:
   
   ```
   Projection: t3.a, t3.b, t3.avg1, t3.avg2
     Sort: t3.seq2 ASC NULLS LAST
       Projection: t3.a, t3.b, t3.avg1, t3.avg2, t3.seq2
         SubqueryAlias: t3
           Projection: t2.a, t2.b, t2.avg1,
                       row_number() ROWS BETWEEN UNBOUNDED PRECEDING AND 
UNBOUNDED FOLLOWING AS seq2,
                       avg(t2.b) PARTITION BY [t2.a] ROWS BETWEEN UNBOUNDED 
PRECEDING AND CURRENT ROW AS avg2
             WindowAggr: windowExpr=[[row_number() ROWS BETWEEN UNBOUNDED 
PRECEDING AND UNBOUNDED FOLLOWING]]
               WindowAggr: windowExpr=[[avg(t2.b) PARTITION BY [t2.a] ROWS 
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW]]
                 SubqueryAlias: t2
                   SubqueryAlias: t1
                     Projection: data.a, data.b,
                                 avg(data.b) PARTITION BY [data.a] ROWS BETWEEN 
UNBOUNDED PRECEDING AND CURRENT ROW AS avg1
                       WindowAggr: windowExpr=[[avg(data.b) PARTITION BY 
[data.a] ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW]]
                         TableScan: data projection=[a, b]
   ```
   
   Two observations from this plan:
   
   1. **The inner `row_number()` (`seq1`) is gone.** The optimizer pruned the 
entire inner `seq1` WindowAggr — it was only referenced by an inner `ORDER BY` 
that doesn't affect the final result, and the outer queries never select it. So 
`row_number` is not what collides. (`seq2` survives only because it's used by 
the top-level `ORDER BY`.)
   
   2. **The collision is between the two `avg(...)` columns**, which carry 
*different aliases* `avg1` and `avg2`:
      - inner WindowAggr → `AS avg1`, and `avg1` is then carried up through `t1 
→ t2 → t3` into the final SELECT;
      - middle WindowAggr → `AS avg2`.
   
      In the DataFusion plan they are unambiguous because of those distinct 
aliases — that's why it executes fine. But their **internal schema names are 
byte-for-byte identical**: both are `avg(data.b) PARTITION BY [data.a] ROWS 
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW`. They are distinguished *only* by 
the `AS avg1` / `AS avg2` aliases.
   
   Substrait intermediate rels are positional (`output_mapping` by index); the 
`avg1`/`avg2` aliases are **not** preserved in the IR. On the consumer side, 
when `from_project_rel` rebuilds the second window via `window_plan` → 
`Window::try_new`, the inherited `avg1` column (already in the input schema) 
and the freshly-built `avg2` both fall back to that same default schema name, 
and `Window::try_new`'s `DFSchema::new_with_metadata` → `check_names()` raises 
`DuplicateUnqualifiedField`.
   
   So the precise "strange point": **the two columns are never duplicates 
semantically** — they have different aliases and different origins, and the 
column that the outer query selects is `avg1`, not the pruned `row_number`. The 
duplicate is *manufactured* by Substrait's positional representation dropping 
the aliases that disambiguated them. The existing `NameTracker` (added in 
#15211) doesn't catch it because it never seeds itself with the input schema's 
field names before `window_plan`, so an inherited window column and a new 
identically-named one are not deduplicated against each other.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to