yashmayya opened a new pull request, #18905:
URL: https://github.com/apache/pinot/pull/18905

   ### Problem
   
   #15335 introduced `PooledByteBufAllocatorWithLimits` to reduce the number of 
direct arenas in the Netty pooled allocators used by the broker <-> server 
query transport. However, `ServerChannels.ServerChannel`'s constructor calls 
`PooledByteBufAllocatorWithLimits.getBufferAllocatorWithLimits(...)`, which 
creates a brand-new `PooledByteBufAllocator` on every call — and one 
`ServerChannel` exists per `ServerRoutingInstance` (per server, per table 
type). Prior to #15335, all server channels shared the single 
`PooledByteBufAllocator.DEFAULT`.
   
   A broker connected to N servers therefore creates N independent pooled 
allocators, each with up to `min(2 * cores, maxDirectMemory / chunkSize / 5)` 
direct arenas:
   
   - **Direct memory retention is amplified up to N times.** Netty pooled 
arenas retain chunk memory after the buffers allocated from them are released, 
and free space in one allocator's pool can never serve another allocator's 
allocations. Large scatter-gather responses spike several channels' pools, each 
pool's retained chunk memory ratchets up independently, and the broker's total 
pinned direct memory grows toward N times the intended per-process footprint. 
We have seen production brokers repeatedly hit 
`io.netty.util.internal.OutOfDirectMemoryError` at the 
`-XX:MaxDirectMemorySize` cap this way: long after the original response spike, 
even small request writes (`LengthFieldPrepender.encode` -> 
`PoolArena.newChunk`) fail intermittently because most of the direct memory 
budget is pinned across dozens of mostly-idle per-channel pools, and only a 
broker restart (which throws away the allocators) recovers. Related 
recovery-path fixes: #17684, #17861.
   - **The `NETTY_POOLED_*` broker gauges are wrong.** Every new 
`ServerChannel` re-registers `BrokerGauge.NETTY_POOLED_USED_DIRECT_MEMORY` (and 
the other pooled-allocator gauges) against its own allocator's metric, 
overwriting the previous registration, so the gauge only ever reports the 
last-created allocator's usage and hides the true total. This made the 
incidents above much harder to diagnose.
   
   The server side has a milder variant of the same problem: each `QueryServer` 
(plaintext and TLS listeners) creates its own limited allocator in `start()`.
   
   ### Fix
   
   - Add 
`PooledByteBufAllocatorWithLimits.getSharedBufferAllocatorWithLimits()`, which 
lazily creates a single process-wide arena-limited allocator; the per-call 
factory is now private so the shared instance is the only way to obtain one.
   - `ServerChannels` obtains it once in its constructor, uses it for every 
`ServerChannel` bootstrap, and registers the `NETTY_POOLED_*` gauges once 
against that shared allocator's metric (the gauges now also appear at broker 
startup instead of on the first query).
   - `QueryServer` uses the same shared allocator, so servers running plaintext 
+ TLS listeners (and combined-role JVMs like quickstarts) share one pool, 
restoring the pre-#15335 sharing behavior while keeping the reduced arena count 
— which now applies once per process instead of once per connection.
   - Since the allocator is now created once and lives for the process 
lifetime, the computed arena-count default is floored at 1 so that a depleted 
direct-memory snapshot at creation time cannot permanently disable pooling (an 
explicit `io.netty.allocator.numDirectArenas=0` is still honored), and the 
chosen sizing is logged at creation for diagnosability.
   
   This only affects the unshaded Netty transports; the gRPC transports use 
shaded Netty classes with their own per-component (not per-connection) 
allocators and are unaffected.
   
   ### Behavior notes for operators
   
   - `BrokerGauge.NETTY_POOLED_*` values will step up after this change: they 
previously reported a single (the most recently created) per-channel allocator 
and now report the whole shared pool. Alerts calibrated against the old values 
should be re-baselined — the new values are the accurate ones.
   - In combined-role JVMs (e.g. quickstarts), `BrokerGauge.NETTY_POOLED_*` and 
`ServerGauge.NETTY_POOLED_*` now report the same underlying allocator, so 
summing them across roles double-counts.
   
   ### Testing
   
   - New `ServerChannelsTest.testChannelsShareBufferAllocator` asserts that all 
server channels — including ones created by different `ServerChannels` 
instances (e.g. the TLS one) — are bootstrapped with the same allocator 
instance, and that it is the shared allocator.
   - `QueryServerTest` now asserts both the server socket channel and the 
accepted child channels (which allocate the request/response buffers) are 
configured with the shared allocator.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to