[
https://issues.apache.org/jira/browse/ARROW-11030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256945#comment-17256945
]
Daniël Heres commented on ARROW-11030:
--------------------------------------
I think it would be nice to further experiment with concatenating into a single
batch for the join.
Thoughts:
* copying is very fast per element (compared to hashing, random access
indexing, etc) so it seems a small penalty to pay upfront
* indexing can use arrows take, which is relatively fast as it can avoid the
overhead of the extra indirection, the less efficient MutableDataArray
structure and the current remaining exponential behavior. take also has the
potential for further optimization I think (e.g. SIMD), I think that is
hard/impossible to do for something like MutableArrayData.
* Makes code for left / right side more similar
* Allows other simplifications / speed ups / avoiding extra copies utilizing
the simplified structure and arrows kernels more.
I added a PR for extending take to support u64 here
https://github.com/apache/arrow/pull/9057
> [Rust] [DataFusion] HashJoinExec slow with many batches
> -------------------------------------------------------
>
> Key: ARROW-11030
> URL: https://issues.apache.org/jira/browse/ARROW-11030
> Project: Apache Arrow
> Issue Type: Bug
> Components: Rust - DataFusion
> Reporter: Andy Grove
> Priority: Major
> Fix For: 3.0.0
>
>
> Performance of joins slows down dramatically with smaller batches.
> The issue is related to slow performance of MutableDataArray::new() when
> passed a high number of batches. This happens when passing in all of the
> batches from the build side of the join and this happens once per build-side
> join key for each probe-side batch.
> It seems to get exponentially slower as the number of arrays increases even
> though the number of rows is the same.
> I modified hash_join.rs to have this debug code:
> {code:java}
> let start = Instant::now();
> let row_count: usize = arrays.iter().map(|arr| arr.len()).sum();
> let num_arrays = arrays.len();
> let mut mutable = MutableArrayData::new(arrays, true, capacity);
> if num_arrays > 0 {
> debug!("MutableArrayData::new() with {} arrays containing {} rows took {}
> ms", num_arrays, row_count, start.elapsed().as_millis());
> } {code}
> Batch size 131072:
> {code:java}
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms
> {code}
> Batch size 16384:
> {code:java}
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 19 ms
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 16 ms
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 17 ms
> {code}
> Batch size 4096:
> {code:java}
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 88 ms
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 89 ms
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 88 ms
> {code}
>
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)