zeroshade commented on PR #864: URL: https://github.com/apache/arrow-go/pull/864#issuecomment-4848598340
Thanks for the extra benchmark @Patzifist. Before we iterate on the implementation, I want to make sure we're measuring the right thing — I don't think these numbers show what they appear to. **What the benchmark is actually measuring** Both rows ran at `b.N = 1` (the `1` immediately after the benchmark name). The entire 65 × 100 = 6,500-call concurrent stress loop runs *inside a single benchmark iteration*, so the testing framework divides the totals by `N = 1`. That means `B/op` and `allocs/op` are not "per `GetColumnBloomFilter` call" — they're the totals for one 6,500-call batch. I confirmed this with the `ReproduceProductionLeak` benchmark in the PR by varying `-benchtime`: ``` -benchtime=1x 1 631 MB/op 209135 allocs/op -benchtime=3x 3 318 MB/op 208677 allocs/op ``` `allocs/op` stays at ~209k regardless of `N`, which confirms the "op" is one whole batch. The real per-filter cost is ~209,000 / 6,500 ≈ **~32 allocs per actual filter read**, not 209k. Two related points: - `B/op` is derived from `TotalAlloc`, which is *cumulative* allocation (churn), not *retained* memory. It only ever increases, even as the GC frees everything, so by itself it can't demonstrate a leak or an OOM. - Taking the two rows at face value, the patched path is **~24% slower** (1.461s vs 1.177s) for **~10% fewer allocations** (209k vs 233k) — the opposite of the trade-off we'd want. **What I do agree is real** There is something worth fixing, and it's what we converged on earlier in the thread: today the bitset buffer is only returned to the pool via `runtime.AddCleanup`, which is GC-timing dependent. Under a burst those returns lag and cumulative churn climbs — I measure ~3.1 GB of churn on the cleanup-only path vs ~75 MB when the same buffer is returned to the pool promptly. The pool pollution point (small headers and MB-sized bitsets sharing one pool) is also legitimate. So "return the buffer to the per-reader pool more promptly" is a good goal. **What would help: a proper reproducer** To justify a change of this size, could you put together a benchmark that isolates the claim? Specifically: 1. **One op = one `GetColumnBloomFilter` call** — use `b.RunParallel` with `b.ResetTimer()` / `b.ReportAllocs()` so `B/op` and `allocs/op` are genuinely per-call. The scenario B/C harness I posted above is a good starting template. 2. **Use a non-zero `BloomFilterOffset`** so the read path actually runs — assert `Size()` returns the expected byte count. The original repro used `offset = 0`, which returns `nil, nil` before the pool, header, or bitset are ever touched. 3. **Keep allocations that aren't under test out of the measured loop** — in the original harness the `make([]byte, 2 MiB)` per iteration was itself ~13 GB of the churn (6,500 × 2 MiB), unrelated to the pool. 4. **If the claim is OOM / unbounded growth, measure *retained* memory** — `HeapInuse`/`HeapAlloc` sampled over time (or RSS), not `TotalAlloc`. A genuine leak shows up as live memory that doesn't drop after a `runtime.GC()`. 5. **Compare against current `main` on equal footing** — same allocator, and the same buffer-return discipline on both sides (either both rely on cleanup, or both return promptly). If you can share a repro that shows per-call allocations and/or retained-memory growth on current `main`, that gives us a concrete target to optimize against and to verify any fix. Happy to review the harness once it's posted. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
