adriangb opened a new issue, #22895:
URL: https://github.com/apache/datafusion/issues/22895

   ### Describe the bug
   
   When a join key contains a `MoveTowardsLeafNodes` expression (e.g. 
`get_field`), `ExtractLeafExpressions` rewrites the key into a 
`__datafusion_extracted_N` column, and a subsequent `optimize_projections` pass 
fails with `Schema error: No field named <column>` for an unrelated column of 
the same table. Reproduces on current main (tested at `883c38ee`) with default 
settings.
   
   ### To Reproduce
   
   ```sql
   CREATE TABLE rt AS SELECT * FROM (VALUES
     (named_struct('uid','u1','t','t1'), TIMESTAMP '2026-06-08T10:00:00', 'a'),
     (named_struct('uid','u2','t','t2'), TIMESTAMP '2026-06-08T11:00:00', 'b')) 
v(attributes, start_timestamp, span_name);
   
   SELECT r.start_timestamp
   FROM rt r JOIN (SELECT attributes['uid'] AS uid FROM rt) f ON f.uid = 
r.attributes['uid']
   WHERE r.attributes['t'] IN (SELECT attributes['t'] FROM rt);
   ```
   
   ```
   Optimizer rule 'optimize_projections' failed
   caused by
   Schema error: No field named r.start_timestamp.
   ```
   
   A larger variant of the same query (extra columns, mark-join from the `IN` 
subquery) shows what the schema has degraded to:
   
   ```
   Schema error: No field named r.start_timestamp.
   Valid fields are r.__datafusion_extracted_3, r.attributes, f.uid, 
f.first_time, __correlated_sq_1.mark.
   ```
   
   The subquery is not required — a self-join on `get_field` keys plus any 
`get_field` in `WHERE` also fails:
   
   ```sql
   CREATE TABLE rt2 AS SELECT * FROM (VALUES
     (named_struct('msg','user auth failed','sid','a'), 1, 'svc1'),
     (named_struct('msg','login token','sid','b'), 2, 'svc2')) v(attributes, 
id, name);
   
   SELECT a.id, b.name
   FROM rt2 a JOIN rt2 b ON a.attributes['sid'] = b.attributes['sid']
   WHERE a.attributes['msg'] LIKE '%auth%';
   ```
   
   ```
   Optimizer rule 'optimize_projections' failed
   caused by
   Schema error: No field named b.name.
   Valid fields are a.id, a.__datafusion_extracted_1, 
b.__datafusion_extracted_2, b.id.
   ```
   
   Either ingredient alone (join key extraction without the `WHERE`, or the 
`WHERE` with a plain join key) works fine.
   
   Workaround (confirmed for all variants):
   
   ```sql
   SET datafusion.optimizer.enable_leaf_expression_pushdown = false;
   ```
   
   ### Expected behavior
   
   The queries should run; with leaf expression pushdown disabled they return 
the expected rows.
   
   ### Additional context
   
   Plan progression from `EXPLAIN VERBOSE` for the self-join variant: 
`extract_leaf_expressions` rewrites the join to `Inner Join: 
__datafusion_extracted_1 = __datafusion_extracted_2` with per-input extraction 
projections and a positional recovery projection above the join 
(`build_recovery_projection`) that restores `a.attributes, a.id, a.name, 
b.attributes, b.id, b.name`. After `push_down_leaf_projections` merges the 
extraction projections downward, `optimize_projections` prunes the join inputs 
and drops columns that the recovery projection still references. The surviving 
field set looks index-shifted rather than name-driven — note `b.id` survives 
while `b.name` (the column actually referenced above) vanishes — which suggests 
the pruning mis-accounts for the extra extraction columns when computing 
required indices for the join inputs.
   
   We also see a related error family in production from the same machinery, 
e.g. `Optimizer rule 'push_down_leaf_projections' failed ... Schema contains 
qualified field name m.__datafusion_extracted_11 and unqualified field name 
__datafusion_extracted_11 which would be ambiguous`, suggesting qualifier 
handling of the generated aliases is fragile in multi-input plans.
   
   Related fixed issues in this area, for context — this appears to be a 
distinct remaining bug: #20432 (alias collision with user-provided 
`__datafusion_extracted` names), #22615 / #22620 (`PushDownLeafProjections` vs 
`Unnest`).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to