teh-cmc commented on PR #6300:
URL: https://github.com/apache/arrow-rs/pull/6300#issuecomment-2323002688
> I am concerned that this change will decrease performance by forcing an
extra copy in all situations, though I may not understand the implications
I wouldn't expect this to have a noticable impact on performance for two
reasons:
1. `MutableBuffer` already reallocates memory and copies data around as it
grows in any case (that's the root of this issue to begin with, after all).
2. I don't think the extra copy is a guarantee -- see my answer to the next
question.
> Does this force a copy in the case where the capacity is larger that the
actual need?
As far as I understand by looking at the code, at the end of the day all of
this machinery bottoms out in `realloc()`.
Whether this actually reallocates memory and copies data around should
therefore be unspecified: it is left as an implementation detail of whichever
`GlobalAllocator` is currently in use.
I'd expect any decent allocator to do whatever is most efficient depending
on the exact situation at hand.
> Have you considered making the initial capacity calculations more accurate
(I am not sure this is possible) so the delta between capacity and actual is
lower
That doesn't seem possible given the nature of `MutableBuffer` and how it is
used across the different layers of the codebase.
> This PR currently seems to have some CI failures
I've updated the tests to reflect the new expected capacity values, so they
should now pass.
Beware that I am changing these values with absolutely zero knowledge of the
context in which these tests were written though: it would be fair to say that
I have effectively no idea what I'm doing.
---
That's it for the theory... now for the practice.
I have found two relevant benchmark suites that relate to this change:
* `--bench concatenate_kernel`
* `--bench mutable_array`
### `--bench concatenate_kernel`
Performance looks roughly on pair with `master` for this benchmark suite. In
fact it even looks like a slight overall improvement, for some reason.
```
taskset -c 7 cargo bench -p arrow --bench concatenate_kernel
--features="test_utils"
concat i32 1024 time: [453.90 ns 454.31 ns 454.75 ns]
change: [-3.7808% -3.5907% -3.4059%] (p = 0.00 <
0.05)
Performance has improved.
concat i32 nulls 1024 time: [652.64 ns 652.88 ns 653.18 ns]
change: [+0.0226% +0.2489% +0.4726%] (p = 0.03 <
0.05)
Change within noise threshold.
concat 1024 arrays i32 4
time: [109.22 µs 109.38 µs 109.57 µs]
change: [-7.2900% -6.8379% -6.4195%] (p = 0.00 <
0.05)
Performance has improved.
concat str 1024 time: [9.3471 µs 9.3609 µs 9.3764 µs]
change: [-0.1653% +0.4566% +1.0820%] (p = 0.16 >
0.05)
No change in performance detected.
concat str nulls 1024 time: [5.7129 µs 5.7328 µs 5.7480 µs]
change: [-4.0256% -3.3280% -2.6276%] (p = 0.00 <
0.05)
Performance has improved.
concat str_dict 1024 time: [2.4110 µs 2.4143 µs 2.4177 µs]
change: [+2.5143% +2.6708% +2.8211%] (p = 0.00 <
0.05)
Performance has regressed.
concat str_dict_sparse 1024
time: [4.7248 µs 4.7301 µs 4.7371 µs]
change: [+1.7168% +1.8403% +1.9509%] (p = 0.00 <
0.05)
Performance has regressed.
concat str nulls 1024 #2
time: [5.7316 µs 5.7364 µs 5.7415 µs]
change: [-3.8916% -3.0466% -2.2002%] (p = 0.00 <
0.05)
Performance has improved.
concat fixed size lists time: [136.77 µs 136.97 µs 137.18 µs]
change: [-0.5076% -0.1102% +0.2847%] (p = 0.63 >
0.05)
No change in performance detected.
```
### `--bench mutable_array`
This one is a different beast -- it is _much_ slower on this branch:
```
mutable str 1024 time: [31.213 ms 31.302 ms 31.420 ms]
change: [+73.378% +74.185% +74.973%] (p = 0.00 <
0.05)
Performance has regressed.
mutable str nulls 1024 time: [11.210 ms 11.273 ms 11.376 ms]
change: [-13.933% -13.406% -12.652%] (p = 0.00 <
0.05)
Performance has improved.
```
which means my first assumption above was wrong: you can certainly feel the
extra copy if the right conditions are met, despite `MutableBuffer` doing a
bunch of its own.
That's with the default system allocator though, which is notoriously not a
good idea... Running these benchmarks with `mimalloc` tells a very different
story:
```diff
diff --git a/arrow/benches/mutable_array.rs b/arrow/benches/mutable_array.rs
index b04e5cd84..d9b388e68 100644
--- a/arrow/benches/mutable_array.rs
+++ b/arrow/benches/mutable_array.rs
@@ -15,6 +15,11 @@
// specific language governing permissions and limitations
// under the License.
+use mimalloc::MiMalloc;
+
+#[global_allocator]
+static GLOBAL: MiMalloc = MiMalloc;
+
#[macro_use]
extern crate criterion;
use criterion::Criterion;
```
Results:
```
mutable str 1024 time: [11.755 ms 11.790 ms 11.825 ms]
change: [+0.1332% +0.5256% +0.9202%] (p = 0.01 <
0.05)
Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
mutable str nulls 1024 time: [5.8176 ms 5.8273 ms 5.8385 ms]
change: [-1.0864% -0.7649% -0.4585%] (p = 0.00 <
0.05)
Change within noise threshold.
```
which means my second assumption was correct: it is left to the allocator
implementation to do the right thing.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]