alamb commented on code in PR #18161:
URL: https://github.com/apache/datafusion/pull/18161#discussion_r2445113736
##########
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##########
@@ -1661,11 +1665,30 @@ fn build_row_join_batch(
// Broadcast the single build-side row to match the filtered
// probe-side batch length
let original_left_array =
build_side_batch.column(column_index.index);
- let scalar_value = ScalarValue::try_from_array(
- original_left_array.as_ref(),
- build_side_index,
- )?;
- scalar_value.to_array_of_size(filtered_probe_batch.num_rows())?
+ // Avoid using `ScalarValue::to_array_of_size()` for
`List(Utf8View)` to avoid
+ // deep copies for buffers inside `Utf8View` array. See below for
details.
+ // https://github.com/apache/datafusion/issues/18159
+ //
+ // In other cases, `to_array_of_size()` is faster.
+ match original_left_array.data_type() {
+ DataType::List(field) | DataType::LargeList(field)
+ if field.data_type() == &DataType::Utf8View =>
+ {
+ let indices_iter = std::iter::repeat_n(
+ build_side_index as u64,
+ filtered_probe_batch.num_rows(),
+ );
+ let indices_array =
UInt64Array::from_iter_values(indices_iter);
+ take(original_left_array.as_ref(), &indices_array, None)?
+ }
Review Comment:
I think this approach could be ported into `ScalarValue::to_array_of_size`
itself rather than special cased here -- which would improve performance in
potentially other places
https://github.com/apache/datafusion/blob/556eb9b1efa7d57b95538e2178475b2a216fa497/datafusion/common/src/scalar/mod.rs#L3238-L3245
That being said, I think this is a nice point fix that we can safely
backport to the datafusion 50 branch, so I think we should merge this PR /
backport it and I will file a follow on PR to further improve the code
##########
datafusion/physical-plan/src/joins/nested_loop_join.rs:
##########
@@ -1661,11 +1665,30 @@ fn build_row_join_batch(
// Broadcast the single build-side row to match the filtered
// probe-side batch length
let original_left_array =
build_side_batch.column(column_index.index);
- let scalar_value = ScalarValue::try_from_array(
- original_left_array.as_ref(),
- build_side_index,
- )?;
- scalar_value.to_array_of_size(filtered_probe_batch.num_rows())?
+ // Avoid using `ScalarValue::to_array_of_size()` for
`List(Utf8View)` to avoid
+ // deep copies for buffers inside `Utf8View` array. See below for
details.
+ // https://github.com/apache/datafusion/issues/18159
Review Comment:
The root cause is tracked in
- https://github.com/apache/arrow-rs/issues/6408
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]