alamb commented on code in PR #8020:
URL: https://github.com/apache/arrow-datafusion/pull/8020#discussion_r1453363639
##########
datafusion/physical-plan/src/joins/hash_join.rs:
##########
@@ -1039,76 +1058,32 @@ pub fn build_equal_condition_join_indices<T:
JoinHashMapType>(
.into_array(build_input_buffer.num_rows())
})
.collect::<Result<Vec<_>>>()?;
- hashes_buffer.clear();
- hashes_buffer.resize(probe_batch.num_rows(), 0);
- let hash_values = create_hashes(&keys_values, random_state,
hashes_buffer)?;
- // In case build-side input has not been inverted while JoinHashMap
creation, the chained list algorithm
- // will return build indices for each probe row in a reverse order as such:
- // Build Indices: [5, 4, 3]
- // Probe Indices: [1, 1, 1]
- //
- // This affects the output sequence. Hypothetically, it's possible to
preserve the lexicographic order on the build side.
- // Let's consider probe rows [0,1] as an example:
- //
- // When the probe iteration sequence is reversed, the following pairings
can be derived:
- //
- // For probe row 1:
- // (5, 1)
- // (4, 1)
- // (3, 1)
- //
- // For probe row 0:
- // (5, 0)
- // (4, 0)
- // (3, 0)
- //
- // After reversing both sets of indices, we obtain reversed indices:
- //
- // (3,0)
- // (4,0)
- // (5,0)
- // (3,1)
- // (4,1)
- // (5,1)
- //
- // With this approach, the lexicographic order on both the probe side and
the build side is preserved.
- let (mut probe_indices, mut build_indices) = if fifo_hashmap {
- build_hashmap.get_matched_indices(hash_values.iter().enumerate(),
deleted_offset)
- } else {
- let (mut matched_probe, mut matched_build) = build_hashmap
- .get_matched_indices(hash_values.iter().enumerate().rev(),
deleted_offset);
-
- matched_probe.as_slice_mut().reverse();
- matched_build.as_slice_mut().reverse();
+ let mut hashes_buffer = vec![0; probe_batch.num_rows()];
Review Comment:
I spent some time profiling and it looks like the additional time is spent
in `get_matched_indices_with_limit_offset` (likely unsurprising). I wonder if
there are some obvious wins there (maybe use / reuse Vecs rather than buffer
builder to avoid so many reallocations ?)
<details><summary>Methodology</summary>
<p>
Comparing be361fdd8079a2f44da70f6af6e9d8eb3f7d0020 (which is the merge-base)
to b631c1e
Methodology -- build like this
```
git checkout `git merge-base HEAD apache/main` #
be361fdd8079a2f44da70f6af6e9d8eb3f7d0020
cargo build --profile release-nonlto --bin dfbench
cp target/release-nonlto/dfbench ~/Downloads/dfbench-be361fd
```
```
gh pr checkout https://github.com/apache/arrow-datafusion/pull/8020
cargo build --profile release-nonlto --bin dfbench
cp target/release-nonlto/dfbench ~/Downloads/dfbench-hash_join_batch_size
```
# link the data
```shell
~/Downloads$ ln -s /Users/andrewlamb/Software/arrow-datafusion/benchmarks
```
Run query 18:
```shell
./dfbench-hash_join_batch_size tpch --iterations 5 --path
/Users/andrewlamb/Software/arrow-datafusion/benchmarks/data/tpch_sf1 -m
--format parquet --query 18
```
</p>
</details>
## This PR
<img width="1726" alt="Screenshot 2024-01-16 at 7 21 58 AM"
src="https://github.com/apache/arrow-datafusion/assets/490673/ee1eafe6-742e-4cbf-8854-a00b4a7a58a1">
## `main`
<img width="1727" alt="Screenshot 2024-01-16 at 7 25 16 AM"
src="https://github.com/apache/arrow-datafusion/assets/490673/121bff76-c0af-4a2d-a7d3-dbf2b9e2efb8">
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]