yashrb24 opened a new pull request, #21247:
URL: https://github.com/apache/datafusion/pull/21247

   ## Which issue does this PR close?
   
   - Closes #21246
   
   ## Rationale for this change
   
   `ProjectionExec::gather_filters_for_pushdown` silently rewrites filter 
predicates to the wrong source column when the output schema contains duplicate 
column names — a structure that arises above joins where both sides share a 
column name. Two functions use name-only schema lookups (`column_with_name` and 
`index_of`) that always return the first match, which is incorrect when 
duplicate names exist:
   
   1. `collect_reverse_alias` — HashMap key collision causes the second 
duplicate to overwrite the first.
   2. `FilterRemapper::try_remap` — `index_of` silently rewrites column indices 
from non-first duplicates to position 0.
   
   This code path is not exercised through normal SQL because the logical 
optimizer's `PushDownFilter` resolves qualified column references and pushes 
filters below projections before the physical plan is created. However, it 
affects any direct construction of physical plans (custom planners, external 
systems, the DataFrame API with manual projections).
   
   ## What changes are included in this PR?
   
   1. **`collect_reverse_alias`**: Use `enumerate()` index instead of 
`column_with_name()`. Projection expressions are positionally aligned with the 
output schema, so `idx` is the correct output column index.
   
   2. **`gather_filters_for_pushdown`**: Replace `FilterRemapper::try_remap` 
(which uses `index_of`) with direct validation against the alias map's exact 
`(name, index)` keys. The `PhysicalColumnRewriter` already does an exact-key 
lookup, so `try_remap` was both redundant and wrong for this case.
   
   ## Are these changes tested?
   
   Yes. A regression test is added that constructs the exact physical plan 
structure triggering the bug (FilterExec → ProjectionExec with duplicate column 
names → HashJoinExec), runs the FilterPushdown optimizer, and verifies the 
optimized plan returns correct results (3 rows instead of the previous 0).
   
   ## Are there any user-facing changes?
   
   No API changes. Fixes incorrect query results for physical plans with 
duplicate column names in projections.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to