alamb commented on PR #7650:
URL: https://github.com/apache/arrow-rs/pull/7650#issuecomment-2984471796
Ok, I figured out what is going on and why the `mixed_utf8view` test is
slowing down. The issue is that the new utf8view code is triggering garbage
collection (string copying) when the old one did not. I put some `println!` and
on main it shows
```
ideal_buffer_size: 370022, actual_buffer_size: 614032
```
This is right under the cutoff load factor (0.5) that would force a a copy
of the strings into new buffers
However, on this branch, because the GC happens *after* the input is sliced
the overall load factor is smaller which triggers the GC in some cases
```
ideal_buffer_size: 246034, actual_buffer_size: 614032
ideal_buffer_size: 123988, actual_buffer_size: 614032
ideal_buffer_size: 155553, actual_buffer_size: 614032
Need GC
```
If I hard code the gc heuristic to be different
```diff
index 0be8702c1b..5e4695dd7e 100644
--- a/arrow-select/src/coalesce/byte_view.rs
+++ b/arrow-select/src/coalesce/byte_view.rs
@@ -290,7 +290,7 @@ impl<B: ByteViewType> InProgressArray for
InProgressByteViewArray<B> {
// Copying the strings into a buffer can be time-consuming so
// only do it if the array is sparse
- if actual_buffer_size > (ideal_buffer_size * 2) {
+ if actual_buffer_size > (ideal_buffer_size * 100) {
self.append_views_and_copy_strings(s.views(),
ideal_buffer_size, buffers);
} else {
self.append_views_and_update_buffer_index(s.views(), buffers);
```
The performance for this benchmark is the same as on main
I am thinking about how best to fix this
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]