adriangb opened a new pull request, #10068: URL: https://github.com/apache/arrow-rs/pull/10068
# Which issue does this PR close? Follow-up to benchmark noise observed on the criterion bench bot (e.g. on #9972), where `string/parquet_2` reported a ~1.75x "regression" that was not reproducible and not present in instruction counts. # Rationale for this change The `arrow_writer` benchmarks build a fresh `ArrowWriter` every criterion iteration, so the writer's internal encode buffers are allocated and freed on each iteration. With a page-decaying allocator (glibc default, jemalloc default), those buffers are served from fresh, un-faulted pages whenever earlier benchmarks in the same process have churned the heap — so each iteration pays a per-page **minor page fault** on every byte written. That fault tax roughly doubles the measured time for the byte-array writers and makes the result **depend on benchmark order**. On the same hardware as the bench bot (Neoverse-V2), the *same* `main` binary produces: | `string/parquet_2` | time | |---|---| | run in isolation | ~106 ms | | run after the `primitive` group | ~187 ms | This is the source of the spurious bench-bot deltas: a `main`-vs-`main` control run (identical code on both sides) reproduced an **18%** difference on `string/parquet_2`, and a larger draw produced the original ~1.75x. The work done is identical (instruction count differs by ~0.25% for the change that triggered the investigation) — only the page-fault state differs. Diagnosis details: the slow basin shows ~5M minor faults vs ~763K in the fast basin; forcing every buffer onto fresh pages (`MALLOC_MMAP_THRESHOLD_` low) pins it slow, and disabling page decay pins it fast. # What changes are included in this PR? Use jemalloc as the `arrow_writer` bench's global allocator with page decay disabled (`dirty_decay_ms:-1,muzzy_decay_ms:-1`), so freed pages stay mapped and are reused warm instead of being returned to the OS. This removes the per-iteration fault tax and collapses the order-dependent bimodality: | `string/parquet_2` | isolated | after `primitive` | after `string` group | |---|---|---|---| | before (system alloc) | 106 ms | 187 ms | 106 ms | | after (this PR) | ~106 ms | ~107 ms | ~106 ms | Notes on robustness (this came up in review): - The decay policy is **pinned by the benchmark**, not left to an allocator default — via a compiled-in `malloc_conf` symbol — so it does not silently change if the allocator updates its defaults. - jemalloc only reads the *unprefixed* `malloc_conf` symbol when built with `unprefixed_malloc_on_supported_platforms`; without it the symbol is silently ignored. To make that failure mode loud, `assert_page_decay_disabled()` reads `opt.dirty_decay_ms` / `opt.muzzy_decay_ms` at startup (via `tikv-jemalloc-ctl`) and panics if the policy is not actually `-1`, with a hint. This was verified to fire when the feature is removed. Scope: the allocator only affects the `arrow_writer` benchmark binary; no library code changes. # Are there any user-facing changes? No. Benchmark-only change (dev-dependencies + the `arrow_writer` bench). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
