SubhamSinghal opened a new pull request, #21735:
URL: https://github.com/apache/datafusion/pull/21735

   ## Which issue does this PR close?
   
     ## Rationale for this change
   In `CollectLeft` hash joins, `concat_batches` copies **all** columns from 
the build side into a single `RecordBatch`, even when only a subset is needed 
for the join output, filter evaluation, and key computation. For wide tables 
(20+ columns), this wastes significant memory and CPU. Savings = (total_columns 
- needed_columns) / total_columns                                               
  
      For ex:                                                                   
                             
     For a 20-column table needing 3 columns: skips 85% of the copy             
                                
     For a 10-column table needing 8 columns: skips 20% of the copy             
                                
     For a 5-column table needing 5 columns:  skips 0% (short-circuit)  
    This PR projects build-side batches to only the needed columns before 
`concat_batches`, reducing both peak memory and copy time.
   
     ## What changes are included in this PR?
   
     **`datafusion/physical-plan/src/joins/hash_join/exec.rs`:**
     - `compute_build_side_projection()` — determines which build-side columns 
are actually needed (union of output columns, filter columns, and join key 
expression columns)
     - `remap_column_indices()` — translates original column indices to 
projected positions
     - `evaluate_and_concat_per_batch()` — evaluates join key expressions 
per-batch before projection, then concatenates result arrays (only used when 
projection is active)
     - Modified `collect_left_input()` and `try_create_array_map()` to project 
batches before `concat_batches` when a column subset suffices
     - Added `build_column_remap` field to `JoinLeftData` to carry the remap 
table downstream
   
     **`datafusion/physical-plan/src/joins/hash_join/stream.rs`:**
     - In `collect_build_side()`, remaps `column_indices` and filter 
`column_indices` when build-side projection is active
   
     ## Are these changes tested?
   
     Yes — covered by existing tests:
     - 791 hash join unit tests pass
     - 415 symmetric hash join tests pass (not affected, but verified)
     - 14 join-related SLT files pass
     - TPC-H SF1 benchmark: all 22 queries produce correct results with no 
regressions (narrow TPC-H tables hit the short-circuit path)
   
     No new tests added since this is an internal optimization that doesn't 
change observable behavior. The existing test suite covers all join types, 
partition modes, filter combinations, empty build sides, and outer join 
unmatched-row handling.
   
     ## Are there any user-facing changes?
   
     No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to