Yongzao Dan created RATIS-2529:
----------------------------------
Summary: gRPC worker threads permanently inflate to
availableProcessors×2 after follower restart catch-up
Key: RATIS-2529
URL: https://issues.apache.org/jira/browse/RATIS-2529
Project: Ratis
Issue Type: Improvement
Components: gRPC
Affects Versions: 3.2.2
Reporter: Yongzao Dan
When a Raft follower node restarts and rejoins the cluster, the catch-up phase
creates a burst of concurrent gRPC streams (log replication, snapshot transfer,
voting, etc.). This burst activates
nearly all Netty EventLoop threads in the EventLoopGroup (default size:
availableProcessors * 2). After catch-up completes, these threads never exit —
they remain in EpollEventLoop.run() doing empty
epollWait indefinitely, even though only 2–3 of them carry any real I/O.
In contrast, a follower that has been running since initial cluster startup
only has 2–6 EventLoop threads active, because connections were established
gradually over time and only activated a few
EventLoops.
Environment
- Ratis version: 3.2.2
- Cluster: 5 Raft nodes, 18-core machines (so availableProcessors * 2 = 36)
- Transport: gRPC (Netty-based, using EpollEventLoopGroup)
Steps to Reproduce
1. Start a 5-node Raft cluster. Wait for it to stabilize.
2. Observe grpc-default-worker-ELG-* threads on each follower via jstack —
typically 2–6 threads.
3. Kill 2 follower nodes. Wait a few minutes.
4. Restart them. They will rejoin the cluster and go through the catch-up
phase.
5. After catch-up is done and the cluster is stable again, observe
grpc-default-worker-ELG-* threads on the restarted followers.
Observed Behavior
Thread counts from a production cluster (collected via jstack):
┌────────┬────────────────────────────┬─────────────┬───────────────┐
│ Node │ Role │ ELG Threads │ Actively Used │
├────────┼────────────────────────────┼─────────────┼───────────────┤
│ node-3 │ Follower (never restarted) │ 6 │ 2 │
├────────┼────────────────────────────┼─────────────┼───────────────┤
│ node-4 │ Follower (never restarted) │ 6 │ 2 │
├────────┼────────────────────────────┼─────────────┼───────────────┤
│ node-0 │ Follower (restarted) │ 36 │ 2 │
├────────┼────────────────────────────┼─────────────┼───────────────┤
│ node-1 │ Follower (restarted) │ 35 │ 2 │
├────────┼────────────────────────────┼─────────────┼───────────────┤
│ node-2 │ Leader │ 36 │ majority │
└────────┴────────────────────────────┴─────────────┴───────────────┘
Key evidence — thread creation timestamps (from jstack elapsed times on
node-0):
grpc-default-worker-ELG-3-1 elapsed=7192.55s
grpc-default-worker-ELG-3-2 elapsed=7192.55s
...
grpc-default-worker-ELG-3-36 elapsed=7189.04s ← all 36 created within 3.5
seconds at startup
CPU usage confirms most threads are idle (over 7192 seconds of life):
ELG-3-1 cpu=1487.26ms ← actually processing I/O
ELG-3-2 cpu=1012.09ms ← actually processing I/O
ELG-3-3 cpu=0.18ms ← idle (epollWait only)
ELG-3-4 cpu=0.18ms
...
ELG-3-36 cpu=35.02ms
├────────┼────────────────────────────┼─────────────┼───────────────┤
│ node-2 │ Leader │ 36 │ majority │
└────────┴────────────────────────────┴─────────────┴───────────────┘
Key evidence — thread creation timestamps (from jstack elapsed times on
node-0):
grpc-default-worker-ELG-3-1 elapsed=7192.55s
grpc-default-worker-ELG-3-2 elapsed=7192.55s
...
grpc-default-worker-ELG-3-36 elapsed=7189.04s ← all 36 created within 3.5
seconds at startup
CPU usage confirms most threads are idle (over 7192 seconds of life):
ELG-3-1 cpu=1487.26ms ← actually processing I/O
ELG-3-2 cpu=1012.09ms ← actually processing I/O
ELG-3-3 cpu=0.18ms ← idle (epollWait only)
ELG-3-4 cpu=0.18ms
...
ELG-3-36 cpu=35.02ms
Normal followers grew their threads gradually over their entire lifetime (2 →
4 → 6), corresponding to individual peer connection events.
Expected Behavior
After catch-up completes, a restarted follower should have a similar number
of active ELG threads as a follower that was never restarted (~2–6), rather
than permanently retaining the full availableProcessors * 2 threads.
Root Cause Analysis
The issue stems from how Netty EventLoopGroup works combined with Ratis's
gRPC usage:
1. EventLoopGroup is created with a fixed capacity of availableProcessors * 2
EventLoop objects, but threads are lazily started — an EventLoop's thread only
begins running when a channel is first registered to it.
2. During normal operation, new connections trickle in one at a time, and
gRPC's round-robin registration only touches a few EventLoops. Most EventLoops
never get a channel registered, so their threads are never started.
3. During restart catch-up, the leader sends a burst of log entries /
snapshot chunks via GrpcLogAppender, and the restarted node simultaneously
opens client connections to multiple peers. This burst distributes channel
registrations across all
EventLoops, starting all 36 threads at once.
4. EventLoop threads never exit. Once started, a Netty EventLoop runs an
infinite loop (EpollEventLoop.run()), and only terminates when
EventLoopGroup.shutdownGracefully() is called. Ratis does not rebuild or resize
the EventLoopGroup after catch-up.
Suggestion
Expose a configuration key in GrpcConfigKeys to allow users to control the
EventLoopGroup thread count, rather than relying on the Netty default
(availableProcessors * 2). For example:
// In GrpcConfigKeys.Server
String WORKER_EVENT_LOOP_THREADS_KEY = PREFIX + ".worker.event-loop.threads";
int WORKER_EVENT_LOOP_THREADS_DEFAULT = 0; // 0 = use Netty default
Then in GrpcServicesImpl.Builder.newNettyServerBuilder():
NettyServerBuilder nettyServerBuilder = NettyServerBuilder.forAddress(address)
.withChildOption(ChannelOption.SO_REUSEADDR, true)
.maxInboundMessageSize(messageSizeMax.getSizeInt())
.flowControlWindow(flowControlWindow.getSizeInt());
int workerThreads = GrpcConfigKeys.Server.workerEventLoopThreads(properties);
if (workerThreads > 0) {
nettyServerBuilder.workerEventLoopGroup(new
EpollEventLoopGroup(workerThreads));
}
Similarly, the client-side NettyChannelBuilder in
GrpcServerProtocolClient.buildChannel() and
GrpcClientProtocolClient.buildChannel() could accept a configurable
EventLoopGroup.
This would allow downstream projects (e.g., Apache IoTDB) to cap the thread
count at a reasonable value (e.g., 4–8) for follower nodes, avoiding the
permanent thread inflation after restart.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)