korowa opened a new pull request, #8020:
URL: https://github.com/apache/arrow-datafusion/pull/8020
## Which issue does this PR close?
<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases. You can
link an issue to this PR using the GitHub syntax. For example `Closes #123`
indicates that this PR will close issue #123.
-->
Closes #7848.
## Rationale for this change
<!--
Why are you proposing this change? If this is already explained clearly in
the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand your
changes and offer better suggestions for fixes.
-->
Ability to produce intermediate output, without accumulating "inner" batch
until complete probe-side batch is joined, will help to avoid unpredictable
memory consumption by `HashJoinStream` (e.g in case of implicit cross-joins)
and to process such queries.
## What changes are included in this PR?
<!--
There is no need to duplicate the description in the issue here but it is
sometimes worth providing a summary of the individual changes in this PR.
-->
- `HashJoinStream` now manages current probe batch, and fetches next batch
from probe-side input only if required (if previous one has been processed),
and last pair of matches indices to be able to produce join result for probe
batch in multiple iterations
- `build_equal_condition_join_indices` is able to skip previously joined
pairs and stops matching indices in case potential (before applying join
filters) output batch size reaches `configuration.batch_size`, and returns
intermediate result along with last pair of matched indices
- `adjust_indices_by_join_type` and subsequently called by it functions now
accept `Range` argument in order to perform partial adjustment.
- (as for now ) behaviour of `SymmetricHashJoin` remains untouched -- it
still waits for the whole probe_batch being processed to produce the result.
## Are these changes tested?
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code
If tests are not included in your PR, please explain why (for example, are
they covered by existing tests)?
-->
## Are there any user-facing changes?
<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->
`HashJoinExec` default output now maintains only probe-side ordering
(instead of preserving sort order for both build and probe sides) -- complies
with `calculate_join_output_ordering` behaviour, but still is a change.
<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]