Weijun-H commented on PR #8716:
URL: https://github.com/apache/arrow-rs/pull/8716#issuecomment-3455136174
After several rounds of optimization, the current version delivers
significant improvements over the previous one.
- Type-specialized dispatch:
`compute_run_boundaries` now routes each physical layout (boolean, primitive
scalars, binary/string, etc.) to a dedicated helper, allowing most arrays to
bypass the slow, generic `ArrayData` comparison path.
- Chunked primitive scanning:
The no-null primitive path uses scan_run_end, which compares 16 bytes at a
time via u128 loads. When a chunk differs, it falls back to scalar
iteration—reducing branches and bounds checks in the hot loop.
- Targeted use of unsafe for performance:
Tight loops leverage `get_unchecked`, `from_raw_parts`, and `read_unaligned`
to eliminate redundant bounds and alignment checks. Each unsafe block includes
detailed safety comments describing the invariants upheld.
- RunBoundaryAccumulator:
A lightweight helper that pre-allocates capacity using a `len / 64 + 2`
heuristic and expands as needed. All run-detection routines share this
consistent and efficient allocation strategy.
- Integrated null handling:
Boolean, primitive, and binary paths now detect value and validity
transitions in a single scan, avoiding temporary bitmap construction for null
detection.
- Generic fallback:
Less common types still rely on `ArrayData` equality but reuse the shared
accumulator to produce consistent run and value outputs—without special-casing
memory management.
```
cast string single run to ree<int32>
time: [23.143 µs 23.180 µs 23.224 µs]
change: [−8.5926% −6.6138% −5.2622%] (p = 0.00 <
0.05)
Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
1 (1.00%) low mild
3 (3.00%) high mild
9 (9.00%) high severe
cast runs of 10 string to ree<int32>
time: [4.4857 µs 4.4924 µs 4.4999 µs]
change: [−35.582% −32.807% −30.598%] (p = 0.00 <
0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
3 (3.00%) high mild
3 (3.00%) high severe
cast runs of 1000 int32s to ree<int32>
time: [1.9651 µs 1.9923 µs 2.0449 µs]
change: [−35.958% −34.582% −33.095%] (p = 0.00 <
0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
2 (2.00%) high mild
3 (3.00%) high severe
cast no runs of int32s to ree<int32>
time: [27.745 µs 28.013 µs 28.291 µs]
change: [−27.957% −27.305% −26.645%] (p = 0.00 <
0.05)
Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
14 (14.00%) high mild
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]