Re: [PR] Bulk-fill definition levels for majority-null leaf columns [arrow-rs]

via GitHub Wed, 13 May 2026 04:33:07 -0700


RyanJamesStewart commented on PR #9967:
URL: https://github.com/apache/arrow-rs/pull/9967#issuecomment-4440494341


   Thanks — applied both. Switched the fast path to `nulls.valid_indices()` to 
drop the unsafe, and fixed the reserve to `valid_in_range`.
   
   Also caught a regression the benchmark bot surfaced that I had missed in my 
own measurement — I hadn't benchmarked the list paths. `list_primitive` and 
`list_primitive_sparse_99pct_null` were ~6–12% slower because the per-range 
`count_set_bits_offset` and the under-allocated `reserve(len - valid_in_range)` 
were both being paid on every `write_leaf` call from `write_list` → 
`write_non_null_slice`, where call counts are high (~10K) and per-call ranges 
are tiny (~5 elements avg). The bulk-fill payoff doesn't apply at that range 
size.
   
   Added a length gate on entering the new path: `len >= 64 && 
nulls.null_count() * 2 >= nulls.len()`. The `null_count()` check uses the 
cached field (O(1)) so there's no per-range popcount when the global density is 
low. I swept `T = {0, 16, 32, 64, 128, 256}` on 
`list_primitive_sparse_99pct_null` to justify the choice:
   
   |  T  | list_primitive_sparse_99pct_null |
   |-----|---------------------------------:|
   |  0  | +7.8%  (reproduces the bot's original measurement) |
   | 16  | +2.8%  |
   | 32  | +1.7%  |
   | 64  | +1.7%  ← chosen |
   | 128 | +2.4%  |
   | 256 | +2.7%  |
   
   Breakeven for the list-sparse case is between T=0 and T=32. The +1.7% floor 
at T≥32 is the structural cost of evaluating the gate across ~10K calls, not 
the fast-path execution; reducing it further would require hoisting the 
decision into `write_list`. T=64 matches T=32 on every shape with 12x margin 
over the avg list length of ~5 and keeps the wins intact: −1.5% on `primitive`, 
**−35.1% on `primitive_sparse_99pct_null`**, **−66.4% on `primitive_all_null`** 
vs main on Ryzen 9 9950X.
   
   Re-triggering the bench.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Bulk-fill definition levels for majority-null leaf columns [arrow-rs]

Reply via email to