UBarney commented on code in PR #16443:
URL: https://github.com/apache/datafusion/pull/16443#discussion_r2173634499
##########
datafusion/physical-plan/src/joins/utils.rs:
##########
@@ -843,24 +844,56 @@ pub(crate) fn apply_join_filter_to_indices(
probe_indices: UInt32Array,
filter: &JoinFilter,
build_side: JoinSide,
+ max_intermediate_size: Option<usize>,
) -> Result<(UInt64Array, UInt32Array)> {
if build_indices.is_empty() && probe_indices.is_empty() {
return Ok((build_indices, probe_indices));
};
- let intermediate_batch = build_batch_from_indices(
- filter.schema(),
- build_input_buffer,
- probe_batch,
- &build_indices,
- &probe_indices,
- filter.column_indices(),
- build_side,
- )?;
- let filter_result = filter
- .expression()
- .evaluate(&intermediate_batch)?
- .into_array(intermediate_batch.num_rows())?;
+ let filter_result = if let Some(max_size) = max_intermediate_size {
Review Comment:
>Can we enforce it before filtering, while calculating build/probe_indices
args for this function (in NestedLoopJoinExec::build_join_indices)?
I'll do it in next pr.
https://github.com/apache/datafusion/issues/16364#issuecomment-2975520489
> Why batch_size enforcement should take place during filtering
1. Although the "Process the Cartesian Product Incrementally" step is
designed to limit the input size for `apply_join_filter_to_indices`, the size
of a single batch can still be very large (up to `left_table.now_rows() * N`).
When the left table itself is large, this can lead to the creation of a large
`record_batch`.
2. Benchmarks indicate that executing joins is faster with this enforcement
in place.
https://github.com/apache/datafusion/pull/16443#issuecomment-2993893069
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]