Dandandan opened a new pull request, #21333:
URL: https://github.com/apache/datafusion/pull/21333

   ## Which issue does this PR close?
   
   N/A - optimization improvement discovered during TPC-H plan analysis.
   
   ## Rationale for this change
   
   When `try_embed_projection` embeds a projection into a `HashJoinExec` (or 
other operators supporting embedded projections), `collect_column_indices` 
previously collected column indices into a `HashSet` and sorted them, losing 
the original column ordering from the projection. This meant that when a 
projection simply reordered columns, the embedded projection would output 
columns in sorted index order instead of the desired order, requiring a 
residual `ProjectionExec` just to reorder them.
   
   For example, in TPC-H Q2, two `ProjectionExec` nodes existed purely to 
reorder columns after joins — adding unnecessary operators to the plan.
   
   ## What changes are included in this PR?
   
   Changed `collect_column_indices` in `projection.rs` to preserve insertion 
order of column indices:
   - For simple `Column` expressions: the index is taken directly in projection 
order
   - For complex expressions referencing multiple columns: indices are sorted 
for determinism (since `collect_columns` returns a `HashSet`)
   - A `HashSet` is still used for deduplication, but only for the `.insert()` 
membership check — not for iteration
   
   ## How are these changes tested?
   
   - All existing sqllogictests pass (419 files) including TPC-H
   - Updated test expectations to reflect eliminated `ProjectionExec` nodes
   - Net reduction of 65 lines across test files (fewer plan operators)
   
   ## Are these changes safe?
   
   Yes. The embedded projection indices don't need to be sorted — 
`with_projection`, `ProjectionMapping::from_indices`, and `project_schema` all 
accept indices in any order. The only behavioral change is that the embedded 
projection now outputs columns in the order the parent projection requested, 
allowing `is_projection_removable` to detect the residual as an identity and 
remove it.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to