[ 
https://issues.apache.org/jira/browse/RATIS-2529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Doroszlai reassigned RATIS-2529:
---------------------------------------

    Assignee: Yongzao Dan

> gRPC worker threads permanently inflate to availableProcessors×2 after 
> follower restart catch-up
> ------------------------------------------------------------------------------------------------
>
>                 Key: RATIS-2529
>                 URL: https://issues.apache.org/jira/browse/RATIS-2529
>             Project: Ratis
>          Issue Type: Improvement
>          Components: gRPC
>    Affects Versions: 3.2.2
>            Reporter: Yongzao Dan
>            Assignee: Yongzao Dan
>            Priority: Critical
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> When a Raft follower node restarts and rejoins the cluster, the catch-up 
> phase creates a burst of concurrent gRPC streams (log replication, snapshot
> transfer, voting, etc.). This burst activates nearly all Netty EventLoop 
> threads in the EventLoopGroup (default size: availableProcessors * 2). After
> catch-up completes, these threads never exit — they remain in 
> EpollEventLoop.run() doing empty epollWait indefinitely, even though only 2–3 
> of them carry any real I/O.
>  
> In contrast, a follower that has been running since initial cluster startup 
> only has 2–6 EventLoop threads active, because connections were established 
> gradually over time and only activated a few EventLoops.
> *Environment*
>  - Ratis version: 3.2.2
>  - Cluster: 5 Raft nodes, 18-core machines (so availableProcessors * 2 = 36)
>  - Transport: gRPC (Netty-based, using EpollEventLoopGroup)
> *Steps to Reproduce*
> 1. Start a 5-node Raft cluster. Wait for it to stabilize.
> 2. Observe grpc-default-worker-ELG-* threads on each follower via jstack — 
> typically 2–6 threads.
> 3. Kill 2 follower nodes. Wait a few minutes.
> 4. Restart them. They will rejoin the cluster and go through the catch-up 
> phase.
> 5. After catch-up is done and the cluster is stable again, observe 
> grpc-default-worker-ELG-* threads on the restarted followers.
> *Observed Behavior*
> Thread counts from a production cluster (collected via jstack):
>  
> ||Node||Role||ELG Threads||Actively Used||
> |node-3|Follower (never restarted)|6|2|
> |node-4|Follower (never restarted)|6|2|
> |node-0|Follower (restarted)|36|2|
> |node-1|Follower (restarted)|35|2|
> |node-2|Leader|36|majority|
>  
> Key evidence — thread creation timestamps (from jstack elapsed times on 
> node-0):
> grpc-default-worker-ELG-3-1 elapsed=7192.55s
> grpc-default-worker-ELG-3-2 elapsed=7192.55s
> ...
> grpc-default-worker-ELG-3-36 elapsed=7189.04s ← all 36 created within 3.5 
> seconds at startup
> CPU usage confirms most threads are idle (over 7192 seconds of life):
> ELG-3-1 cpu=1487.26ms ← actually processing I/O
> ELG-3-2 cpu=1012.09ms ← actually processing I/O
> ELG-3-3 cpu=0.18ms ← idle (epollWait only)
> ELG-3-4 cpu=0.18ms
> ...
> ELG-3-36 cpu=35.02ms
> Normal followers grew their threads gradually over their entire lifetime (2 → 
> 4 → 6), corresponding to individual peer connection events.
> *Expected Behavior*
> After catch-up completes, a restarted follower should have a similar number 
> of active ELG threads as a follower that was never restarted (~2–6), rather 
> than permanently retaining the full availableProcessors * 2 threads.
> *Root Cause Analysis*
> The issue stems from how Netty EventLoopGroup works combined with Ratis's 
> gRPC usage:
> 1. EventLoopGroup is created with a fixed capacity of availableProcessors * 2 
> EventLoop objects, but threads are lazily started — an EventLoop's thread 
> only begins running when a channel is first registered to it.
> 2. During normal operation, new connections trickle in one at a time, and 
> gRPC's round-robin registration only touches a few EventLoops. Most 
> EventLoops never get a channel registered, so their threads are never started.
> 3. During restart catch-up, the leader sends a burst of log entries / 
> snapshot chunks via GrpcLogAppender, and the restarted node simultaneously 
> opens client connections to multiple peers. This burst distributes channel 
> registrations across all
> EventLoops, starting all 36 threads at once.
> 4. EventLoop threads never exit. Once started, a Netty EventLoop runs an 
> infinite loop (EpollEventLoop.run()), and only terminates when 
> EventLoopGroup.shutdownGracefully() is called. Ratis does not rebuild or 
> resize the EventLoopGroup after catch-up.
> *Suggestion*
> Expose a configuration key in GrpcConfigKeys to allow users to control the 
> EventLoopGroup thread count, rather than relying on the Netty default 
> (availableProcessors * 2). For example:
> {code:java}
> // In GrpcConfigKeys.Server
> String WORKER_EVENT_LOOP_THREADS_KEY = PREFIX + ".worker.event-loop.threads";
> int WORKER_EVENT_LOOP_THREADS_DEFAULT = 0; // 0 = use Netty default{code}
> Then in GrpcServicesImpl.Builder.newNettyServerBuilder():
>  
> {code:java}
> NettyServerBuilder nettyServerBuilder = NettyServerBuilder.forAddress(address)
>     .withChildOption(ChannelOption.SO_REUSEADDR, true)
>     .maxInboundMessageSize(messageSizeMax.getSizeInt())
>     .flowControlWindow(flowControlWindow.getSizeInt());
> int workerThreads = GrpcConfigKeys.Server.workerEventLoopThreads(properties);
> if (workerThreads > 0) {
>     nettyServerBuilder.workerEventLoopGroup(new 
> EpollEventLoopGroup(workerThreads));
> }
> {code}
>  
> Similarly, the client-side NettyChannelBuilder in 
> GrpcServerProtocolClient.buildChannel() and 
> GrpcClientProtocolClient.buildChannel() could accept a configurable 
> EventLoopGroup.
> This would allow downstream projects (e.g., Apache IoTDB) to cap the thread 
> count at a reasonable value (e.g., 4–8) for follower nodes, avoiding the 
> permanent thread inflation after restart.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to