yashmayya opened a new pull request, #18905: URL: https://github.com/apache/pinot/pull/18905
### Problem #15335 introduced `PooledByteBufAllocatorWithLimits` to reduce the number of direct arenas in the Netty pooled allocators used by the broker <-> server query transport. However, `ServerChannels.ServerChannel`'s constructor calls `PooledByteBufAllocatorWithLimits.getBufferAllocatorWithLimits(...)`, which creates a brand-new `PooledByteBufAllocator` on every call — and one `ServerChannel` exists per `ServerRoutingInstance` (per server, per table type). Prior to #15335, all server channels shared the single `PooledByteBufAllocator.DEFAULT`. A broker connected to N servers therefore creates N independent pooled allocators, each with up to `min(2 * cores, maxDirectMemory / chunkSize / 5)` direct arenas: - **Direct memory retention is amplified up to N times.** Netty pooled arenas retain chunk memory after the buffers allocated from them are released, and free space in one allocator's pool can never serve another allocator's allocations. Large scatter-gather responses spike several channels' pools, each pool's retained chunk memory ratchets up independently, and the broker's total pinned direct memory grows toward N times the intended per-process footprint. We have seen production brokers repeatedly hit `io.netty.util.internal.OutOfDirectMemoryError` at the `-XX:MaxDirectMemorySize` cap this way: long after the original response spike, even small request writes (`LengthFieldPrepender.encode` -> `PoolArena.newChunk`) fail intermittently because most of the direct memory budget is pinned across dozens of mostly-idle per-channel pools, and only a broker restart (which throws away the allocators) recovers. Related recovery-path fixes: #17684, #17861. - **The `NETTY_POOLED_*` broker gauges are wrong.** Every new `ServerChannel` re-registers `BrokerGauge.NETTY_POOLED_USED_DIRECT_MEMORY` (and the other pooled-allocator gauges) against its own allocator's metric, overwriting the previous registration, so the gauge only ever reports the last-created allocator's usage and hides the true total. This made the incidents above much harder to diagnose. The server side has a milder variant of the same problem: each `QueryServer` (plaintext and TLS listeners) creates its own limited allocator in `start()`. ### Fix - Add `PooledByteBufAllocatorWithLimits.getSharedBufferAllocatorWithLimits()`, which lazily creates a single process-wide arena-limited allocator; the per-call factory is now private so the shared instance is the only way to obtain one. - `ServerChannels` obtains it once in its constructor, uses it for every `ServerChannel` bootstrap, and registers the `NETTY_POOLED_*` gauges once against that shared allocator's metric (the gauges now also appear at broker startup instead of on the first query). - `QueryServer` uses the same shared allocator, so servers running plaintext + TLS listeners (and combined-role JVMs like quickstarts) share one pool, restoring the pre-#15335 sharing behavior while keeping the reduced arena count — which now applies once per process instead of once per connection. - Since the allocator is now created once and lives for the process lifetime, the computed arena-count default is floored at 1 so that a depleted direct-memory snapshot at creation time cannot permanently disable pooling (an explicit `io.netty.allocator.numDirectArenas=0` is still honored), and the chosen sizing is logged at creation for diagnosability. This only affects the unshaded Netty transports; the gRPC transports use shaded Netty classes with their own per-component (not per-connection) allocators and are unaffected. ### Behavior notes for operators - `BrokerGauge.NETTY_POOLED_*` values will step up after this change: they previously reported a single (the most recently created) per-channel allocator and now report the whole shared pool. Alerts calibrated against the old values should be re-baselined — the new values are the accurate ones. - In combined-role JVMs (e.g. quickstarts), `BrokerGauge.NETTY_POOLED_*` and `ServerGauge.NETTY_POOLED_*` now report the same underlying allocator, so summing them across roles double-counts. ### Testing - New `ServerChannelsTest.testChannelsShareBufferAllocator` asserts that all server channels — including ones created by different `ServerChannels` instances (e.g. the TLS one) — are bootstrapped with the same allocator instance, and that it is the shared allocator. - `QueryServerTest` now asserts both the server socket channel and the accepted child channels (which allocate the request/response buffers) are configured with the shared allocator. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
