zeroshade commented on PR #864:
URL: https://github.com/apache/arrow-go/pull/864#issuecomment-4848598340

   Thanks for the extra benchmark @Patzifist. Before we iterate on the 
implementation, I want to make sure we're measuring the right thing — I don't 
think these numbers show what they appear to.
   
   **What the benchmark is actually measuring**
   
   Both rows ran at `b.N = 1` (the `1` immediately after the benchmark name). 
The entire 65 × 100 = 6,500-call concurrent stress loop runs *inside a single 
benchmark iteration*, so the testing framework divides the totals by `N = 1`. 
That means `B/op` and `allocs/op` are not "per `GetColumnBloomFilter` call" — 
they're the totals for one 6,500-call batch.
   
   I confirmed this with the `ReproduceProductionLeak` benchmark in the PR by 
varying `-benchtime`:
   
   ```
   -benchtime=1x    1    631 MB/op    209135 allocs/op
   -benchtime=3x    3    318 MB/op    208677 allocs/op
   ```
   
   `allocs/op` stays at ~209k regardless of `N`, which confirms the "op" is one 
whole batch. The real per-filter cost is ~209,000 / 6,500 ≈ **~32 allocs per 
actual filter read**, not 209k.
   
   Two related points:
   - `B/op` is derived from `TotalAlloc`, which is *cumulative* allocation 
(churn), not *retained* memory. It only ever increases, even as the GC frees 
everything, so by itself it can't demonstrate a leak or an OOM.
   - Taking the two rows at face value, the patched path is **~24% slower** 
(1.461s vs 1.177s) for **~10% fewer allocations** (209k vs 233k) — the opposite 
of the trade-off we'd want.
   
   **What I do agree is real**
   
   There is something worth fixing, and it's what we converged on earlier in 
the thread: today the bitset buffer is only returned to the pool via 
`runtime.AddCleanup`, which is GC-timing dependent. Under a burst those returns 
lag and cumulative churn climbs — I measure ~3.1 GB of churn on the 
cleanup-only path vs ~75 MB when the same buffer is returned to the pool 
promptly. The pool pollution point (small headers and MB-sized bitsets sharing 
one pool) is also legitimate. So "return the buffer to the per-reader pool more 
promptly" is a good goal.
   
   **What would help: a proper reproducer**
   
   To justify a change of this size, could you put together a benchmark that 
isolates the claim? Specifically:
   
   1. **One op = one `GetColumnBloomFilter` call** — use `b.RunParallel` with 
`b.ResetTimer()` / `b.ReportAllocs()` so `B/op` and `allocs/op` are genuinely 
per-call. The scenario B/C harness I posted above is a good starting template.
   2. **Use a non-zero `BloomFilterOffset`** so the read path actually runs — 
assert `Size()` returns the expected byte count. The original repro used 
`offset = 0`, which returns `nil, nil` before the pool, header, or bitset are 
ever touched.
   3. **Keep allocations that aren't under test out of the measured loop** — in 
the original harness the `make([]byte, 2 MiB)` per iteration was itself ~13 GB 
of the churn (6,500 × 2 MiB), unrelated to the pool.
   4. **If the claim is OOM / unbounded growth, measure *retained* memory** — 
`HeapInuse`/`HeapAlloc` sampled over time (or RSS), not `TotalAlloc`. A genuine 
leak shows up as live memory that doesn't drop after a `runtime.GC()`.
   5. **Compare against current `main` on equal footing** — same allocator, and 
the same buffer-return discipline on both sides (either both rely on cleanup, 
or both return promptly).
   
   If you can share a repro that shows per-call allocations and/or 
retained-memory growth on current `main`, that gives us a concrete target to 
optimize against and to verify any fix. Happy to review the harness once it's 
posted.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to