SubhamSinghal opened a new pull request, #21735:
URL: https://github.com/apache/datafusion/pull/21735
## Which issue does this PR close?
## Rationale for this change
In `CollectLeft` hash joins, `concat_batches` copies **all** columns from
the build side into a single `RecordBatch`, even when only a subset is needed
for the join output, filter evaluation, and key computation. For wide tables
(20+ columns), this wastes significant memory and CPU. Savings = (total_columns
- needed_columns) / total_columns
For ex:
For a 20-column table needing 3 columns: skips 85% of the copy
For a 10-column table needing 8 columns: skips 20% of the copy
For a 5-column table needing 5 columns: skips 0% (short-circuit)
This PR projects build-side batches to only the needed columns before
`concat_batches`, reducing both peak memory and copy time.
## What changes are included in this PR?
**`datafusion/physical-plan/src/joins/hash_join/exec.rs`:**
- `compute_build_side_projection()` — determines which build-side columns
are actually needed (union of output columns, filter columns, and join key
expression columns)
- `remap_column_indices()` — translates original column indices to
projected positions
- `evaluate_and_concat_per_batch()` — evaluates join key expressions
per-batch before projection, then concatenates result arrays (only used when
projection is active)
- Modified `collect_left_input()` and `try_create_array_map()` to project
batches before `concat_batches` when a column subset suffices
- Added `build_column_remap` field to `JoinLeftData` to carry the remap
table downstream
**`datafusion/physical-plan/src/joins/hash_join/stream.rs`:**
- In `collect_build_side()`, remaps `column_indices` and filter
`column_indices` when build-side projection is active
## Are these changes tested?
Yes — covered by existing tests:
- 791 hash join unit tests pass
- 415 symmetric hash join tests pass (not affected, but verified)
- 14 join-related SLT files pass
- TPC-H SF1 benchmark: all 22 queries produce correct results with no
regressions (narrow TPC-H tables hit the short-circuit path)
No new tests added since this is an internal optimization that doesn't
change observable behavior. The existing test suite covers all join types,
partition modes, filter combinations, empty build sides, and outer join
unmatched-row handling.
## Are there any user-facing changes?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]