kosiew opened a new pull request, #21068:
URL: https://github.com/apache/datafusion/pull/21068
## Which issue does this PR close?
* Closes #20492.
## Rationale for this change
`HashJoinExec` currently continues polling and consuming the probe side even
after the build side has completed with zero rows.
For join types whose output is guaranteed to be empty when the build side is
empty, this work is unnecessary. In practice, it can trigger large avoidable
scans and extra compute despite producing no output. This is especially costly
for cases such as INNER, LEFT, LEFT SEMI, LEFT ANTI, LEFT MARK, and RIGHT SEMI
joins.
This change makes the stream state machine aware of that condition so
execution can terminate as soon as the build side is known to be empty and no
probe rows are needed to determine the final result.
The change also preserves the existing behavior for join types that still
require probe-side rows even when the build side is empty, such as RIGHT, FULL,
RIGHT ANTI, and RIGHT MARK joins.
## What changes are included in this PR?
This PR introduces a small refactor and an execution-path optimization for
hash join empty-build handling:
* adds `empty_build_side_produces_empty_result(join_type)` in
`joins/utils.rs` as a shared source of truth for join types that can
short-circuit when the build side is empty
* updates `HashJoinStream` to use a new `next_state_after_build_ready(...)`
helper so that, once build-side collection and any required coordination
finish, the stream can transition directly to `Completed` instead of always
entering `FetchProbeBatch`
* applies the same shared helper inside `build_batch_empty_build_side(...)`
to keep output semantics aligned with the stream short-circuit logic
* factors dynamic-filter join setup in tests into a reusable helper,
reducing duplicated test code
* adds regression tests covering:
* join types that should not consume the probe side when the build side is
empty
* join types that must still consume the probe side because probe rows are
needed for correct output
* dynamic-filter completion behavior, including the path that waits for
partition-bounds reporting before deciding whether probe polling is necessary
## Are these changes tested?
Yes.
This PR adds targeted async tests covering both the optimized and
non-optimized cases:
* `join_does_not_consume_probe_when_empty_build_fixes_output`
* `join_still_consumes_probe_when_empty_build_needs_probe_rows`
* `test_hash_join_skips_probe_on_empty_build_after_partition_bounds_report`
The tests also verify that errors from the probe side are not observed when
the join can correctly short-circuit, and that they are still surfaced for join
types that must continue consuming probe input.
Existing dynamic-filter tests were also cleaned up to use a shared helper
while preserving the completion checks.
## Are there any user-facing changes?
Yes, in execution behavior and performance.
For affected hash join types, queries can now stop earlier when the build
side is empty, avoiding unnecessary probe-side scans and reducing wasted I/O
and compute. There are no intended API changes.
## LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content
has been manually reviewed and tested.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]