cetra3 opened a new pull request, #9393:
URL: https://github.com/apache/arrow-rs/pull/9393

   # Which issue does this PR close?
   
   None at the moment
   
   # Rationale for this change
   
   There are a number of places within the code that go from `MutableBuffer` to 
`Vec` and then to `Buffer`.  This causes extra allocations and is ripe for a 
performance refactor.
   
   
   # What changes are included in this PR?
   
   This PR adjusts MutableBuffer to be closer to `Vec` in representation, and 
also adjusts some of the functions/kernels to use them instead.
   
   1. **MutableBuffer**: Replace `layout: Layout` (16 bytes) with `capacity: 
usize` + `align: usize` (16 bytes), keeping struct at 32 bytes. `reserve()` now 
compares against a cached `capacity` field instead of calling `Layout::size()`. 
Layout is reconstructed via `from_size_align_unchecked` only on cold 
alloc/dealloc/realloc paths.
   
   2. **BufferBuilder**: Cache `buffer.capacity()` in a local before extend 
loops. Replace `std::mem::size_of::<T>()` with `const` locals. Add 
`#[inline(always)]` to `append` and `advance`.
   
   3. **Kernel Vec-to-Buffer replacements**: In `sort.rs`, `concat.rs`, and 
`interleave.rs`, replace `Vec<T>` + `Buffer::from(vec)` with direct 
`MutableBuffer`/`BufferBuilder` usage, eliminating memcpy on conversion.
   
   # Are these changes tested?
   
   - `cargo test -p arrow-buffer` -- 244 tests passed
   - `cargo test -p arrow-select -p arrow-ord -p arrow-array` -- all tests 
passed
   - `cargo +nightly miri test -p arrow-buffer -- buffer::mutable` -- 25 tests 
+ 13 doc-tests passed, no UB detected
   - Struct size: `MutableBuffer` is 32 bytes (data:8 + len:8 + capacity:8 + 
align:8), identical to original
   
   # Are there any user-facing changes?
   
   Nope
   
   # Benchmark changes
   
   This was compared against `main`, after applying 
https://github.com/apache/arrow-rs/pull/9392
   
   ## Summary
   
   | Category | Count |
   |---|---|
   | Improved (>1% faster) | **117** |
   | Within noise | **34** |
   | Regressed (>1% slower) | **18** |
   
   ## Top Improvements
   
   | Benchmark | Change | Time |
   |---|---|---|
   | take stringview 1024 | **-28.9%** | 346 ns |
   | take fsb(12) 1024 | **-28.8%** | 1.66 µs |
   | take fsb(12) nulls 1024 | **-23.5%** | 2.16 µs |
   | sort primitive run 2^12 | **-22.1%** | 2.60 µs |
   | take primitive run 1024/512 | **-16.8%** | 7.21 µs |
   | take stringview nulls 1024 | **-16.4%** | 825 ns |
   | sort string[100] indices 2^12 | **-13.9%** | 30.9 µs |
   | interleave dict(20) 1024 | **-13.1%** | 880 ns |
   | concat str nulls 1024 | **-12.8%** | 1.56 µs |
   | interleave list\<i64\> 400 | **-12.4%** | 4.41 µs |
   | interleave dict(20) nulls 1024 | **-12.2%** | 921 ns |
   | sort string[100] indices+values 2^12 | **-12.0%** | 30.5 µs |
   | take str nulls values 1024 | **-11.6%** | 2.37 µs |
   | interleave list\<i64\> nulls 400 | **-11.4%** | 11.0 µs |
   | interleave list\<i64\> 4000 | **-11.0%** | 11.0 µs |
   | sort string[10] indices 2^12 | **-10.7%** | 30.8 µs |
   | interleave dict(20) 4000 | **-10.6%** | 1.10 µs |
   | sort string[10] to indices 2^12 | **-10.1%** | 32.6 µs |
   | sort string[10] indices+values 2^12 | **-10.4%** | 16.6 µs |
   
   ## Results by Category
   
   ### sort_kernel
   
   | Benchmark | Change | Time |
   |---|---|---|
   | sort primitive run 2^12 | **-22.1%** | 2.60 µs |
   | sort string[100] indices 2^12 | **-13.9%** | 30.9 µs |
   | sort string[100] indices+values 2^12 | **-12.0%** | 30.5 µs |
   | sort string[10] indices 2^12 | **-10.7%** | 30.8 µs |
   | sort string[10] to indices 2^12 | **-10.1%** | 32.6 µs |
   | sort string[10] indices+values 2^12 | **-10.4%** | 16.6 µs |
   | sort string[100] to indices+values 2^12 | **-9.4%** | 16.7 µs |
   | sort string[100] to indices 2^12 | **-9.1%** | 16.6 µs |
   | sort string[10] to indices+values 2^12 | **-8.3%** | 17.4 µs |
   | sort string[10] to values 2^12 | **-7.7%** | 16.3 µs |
   | sort string[100] to values 2^12 | **-7.5%** | 16.2 µs |
   | lexsort (f64, f64) nulls 2^12 | **-4.5%** | 19.3 µs |
   | lexsort (f64, f32) 2^12 | **-3.6%** | 43.7 µs |
   | rank f32 dict 2^12 | **-3.0%** | 54.6 µs |
   | lexsort (f32, f32) 2^12 | **-2.8%** | 100 µs |
   | sort f32 2^12 | **-1.2%** | 25.0 µs |
   | sort i32 2^10 | **-2.0%** | 3.52 µs |
   | sort i32 2^12 | **-1.4%** | 16.7 µs |
   | sort f32 nulls 2^12 | **-1.6%** | 12.6 µs |
   | lexsort (f32, f32) 2^10 | **-2.0%** | 21.4 µs |
   | sort primitive run 2^12 (with values) | **-1.8%** | 3.77 µs |
   | rank f32 2^12 | **-1.2%** | 29.9 µs |
   | rank f32 nulls 2^12 | **-1.4%** | 15.5 µs |
   
   Sort-with-indices benchmarks (Vec->Buffer replacement) show -9% to -14%. 
Primitive run -22%. Sort-with-values (still using Vec internally) show +2-4% 
regression, expected since those paths were not changed.
   
   ### interleave_kernels
   
   | Benchmark | Change | Time |
   |---|---|---|
   | interleave dict(20) 1024 | **-13.1%** | 880 ns |
   | interleave dict(20) nulls 1024 | **-12.2%** | 921 ns |
   | interleave list\<i64\> 400 | **-12.4%** | 4.41 µs |
   | interleave list\<i64\> nulls 400 | **-11.4%** | 11.0 µs |
   | interleave list\<i64\> 4000 | **-11.0%** | 11.0 µs |
   | interleave list\<i64\> nulls 4000 | **-11.6%** | 11.1 µs |
   | interleave dict(20) 4000 | **-10.6%** | 1.10 µs |
   | interleave dict(20) nulls 4000 | **-10.6%** | 1.01 µs |
   | interleave i32 4000 | **-8.3%** | 18.7 µs |
   | interleave i32 nulls 4000 | **-8.1%** | 18.6 µs |
   | interleave i32 nulls 400 | **-7.4%** | 1.46 µs |
   | interleave i32 400 | **-7.9%** | 1.64 µs |
   | interleave str_dict 4000 | **-5.9%** | 2.27 µs |
   | interleave str 400 | **-3.5%** | 671 ns |
   | interleave str 4000 | **-5.2%** | 1.57 µs |
   | interleave str nulls 4000 | **-1.8%** | 1.58 µs |
   | interleave str_dict 400 | **-3.2%** | 325 ns |
   | interleave str_dict nulls 400 | **-1.2%** | 327 ns |
   | interleave str_dict nulls 4000 | **-2.4%** | 3.39 µs |
   | interleave str_dict 4000 (nulls) | **-1.8%** | 3.43 µs |
   | interleave bool 400 | **-1.7%** | 119 ns |
   | interleave bool 4000 | **-1.6%** | 289 ns |
   
   ### take_kernels
   
   | Benchmark | Change | Time |
   |---|---|---|
   | take stringview 1024 | **-28.9%** | 346 ns |
   | take fsb(12) 1024 | **-28.8%** | 1.66 µs |
   | take fsb(12) nulls 1024 | **-23.5%** | 2.16 µs |
   | take primitive run 1024/512 | **-16.8%** | 7.21 µs |
   | take stringview nulls 1024 | **-16.4%** | 825 ns |
   | take str nulls values 1024 | **-11.6%** | 2.37 µs |
   | take bool nulls 512 | **-9.2%** | 701 ns |
   | take i32 nulls 1024 | **-8.8%** | 703 ns |
   | take stringview 512 | **-8.1%** | 221 ns |
   | take i32 nulls 512 | **-7.7%** | 323 ns |
   | take bool nulls 1024 | **-5.1%** | 379 ns |
   | take str 1024 | **-4.5%** | 4.77 µs |
   | take str nulls 1024 | **-4.7%** | 2.83 µs |
   | take stringview nulls values 1024 | **-4.4%** | 851 ns |
   | take str 512 | **-2.4%** | 2.23 µs |
   | take str nulls values 512 | **-3.3%** | 3.20 µs |
   | take bool 1024 | **-1.2%** | 506 ns |
   | take bool 512 | **-1.0%** | 279 ns |
   
   ### buffer_create
   
   | Benchmark | Change | Time |
   |---|---|---|
   | from_slice | **-6.5%** | 184 µs |
   | from_iter (u32) | **-5.0%** | 21.5 ms |
   | mutable prepared | **-3.5%** | 136 µs |
   | mutable | **-2.5%** | 199 µs |
   | mutable extend | **-2.3%** | 371 µs |
   | from_iter (u8) | **-1.9%** | 1.87 ms |
   | buffer_create (overhead) | **-1.6%** | 819 µs |
   | Buffer::from_iter bool | **-1.6%** | 1.95 ms |
   | from_slice prepared | +1.8% | 182 µs |
   
   ### concatenate_kernel
   
   | Benchmark | Change | Time |
   |---|---|---|
   | concat str nulls 1024 | **-12.8%** | 1.56 µs |
   | concat i32 8192/100 arrays | **-7.4%** | 45.3 µs |
   | concat str 8192/100 arrays | **-4.9%** | 34.8 µs |
   | concat str nulls 8192/100 arrays | **-4.9%** | 34.8 µs |
   | concat boolean 1024 | **-4.5%** | 110 ns |
   | concat boolean nulls 1024 | **-4.1%** | 158 ns |
   | concat i32 1024 | **-3.4%** | 150 ns |
   | concat fixed size lists | **-2.9%** | 99.7 µs |
   | concat str 8192/10 arrays | **-2.8%** | 15.0 µs |
   | concat str_dict nulls 1024 | **-5.1%** | 3.32 µs |
   | concat str_dict 1024 | +2.5% | 1.12 µs |
   | concat str 1024 | +12.3% | 3.92 µs |
   | concat i32 nulls 1024 | +23.4% | 247 ns |
   
   ### builder
   
   | Benchmark | Change | Time |
   |---|---|---|
   | bench_decimal64_builder | **-5.0%** | 29.8 µs |
   | bench_primitive | **-4.0%** | 1.25 ms |
   | bench_decimal128_builder | **-2.6%** | 59.5 µs |
   | bench_bool | **-1.8%** | 133 µs |
   | bench_decimal32_builder | +2.0% | 31.6 µs |
   | bench_primitive_nulls | +14.6% | 721 µs |
   
   ## Notable Regressions
   
   | Benchmark | Change | Time | Analysis |
   |---|---|---|---|
   | concat i32 nulls 1024 | +23.4% | 247 ns | Very fast benchmark (~247ns), 
absolute delta is ~47ns. May be measurement noise or code layout sensitivity. |
   | bench_primitive_nulls | +14.6% | 721 µs | Builder benchmark with null 
handling. Needs investigation. |
   | concat str 1024 | +12.3% | 3.92 µs | Small concat. Large concat (8192/100) 
improved -5%. |
   | take i32 512 | +6.3% | 182 ns | Small take. The 1024 variant is neutral 
(+1.3%). |
   | sort i32 (with values) | +2-4% | ~24 µs | Expected: these paths still use 
Vec internally, only sort-to-indices was changed. |
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to