Yongzao Dan created RATIS-2529:
----------------------------------

             Summary: gRPC worker threads permanently inflate to 
availableProcessors×2 after follower restart catch-up
                 Key: RATIS-2529
                 URL: https://issues.apache.org/jira/browse/RATIS-2529
             Project: Ratis
          Issue Type: Improvement
          Components: gRPC
    Affects Versions: 3.2.2
            Reporter: Yongzao Dan


When a Raft follower node restarts and rejoins the cluster, the catch-up phase 
creates a burst of concurrent gRPC streams (log replication, snapshot transfer, 
voting, etc.). This burst activates
  nearly all Netty EventLoop threads in the EventLoopGroup (default size: 
availableProcessors * 2). After catch-up completes, these threads never exit — 
they remain in EpollEventLoop.run() doing empty
  epollWait indefinitely, even though only 2–3 of them carry any real I/O.

  In contrast, a follower that has been running since initial cluster startup 
only has 2–6 EventLoop threads active, because connections were established 
gradually over time and only activated a few
  EventLoops.

  Environment

  - Ratis version: 3.2.2
  - Cluster: 5 Raft nodes, 18-core machines (so availableProcessors * 2 = 36)
  - Transport: gRPC (Netty-based, using EpollEventLoopGroup)

  Steps to Reproduce

  1. Start a 5-node Raft cluster. Wait for it to stabilize.
  2. Observe grpc-default-worker-ELG-* threads on each follower via jstack — 
typically 2–6 threads.
  3. Kill 2 follower nodes. Wait a few minutes.
  4. Restart them. They will rejoin the cluster and go through the catch-up 
phase.
  5. After catch-up is done and the cluster is stable again, observe 
grpc-default-worker-ELG-* threads on the restarted followers.

  Observed Behavior

  Thread counts from a production cluster (collected via jstack):

  ┌────────┬────────────────────────────┬─────────────┬───────────────┐
  │  Node  │            Role            │ ELG Threads │ Actively Used │
  ├────────┼────────────────────────────┼─────────────┼───────────────┤
  │ node-3 │ Follower (never restarted) │ 6           │ 2             │
  ├────────┼────────────────────────────┼─────────────┼───────────────┤
  │ node-4 │ Follower (never restarted) │ 6           │ 2             │
  ├────────┼────────────────────────────┼─────────────┼───────────────┤
  │ node-0 │ Follower (restarted)       │ 36          │ 2             │
  ├────────┼────────────────────────────┼─────────────┼───────────────┤
  │ node-1 │ Follower (restarted)       │ 35          │ 2             │
  ├────────┼────────────────────────────┼─────────────┼───────────────┤
  │ node-2 │ Leader                     │ 36          │ majority      │
  └────────┴────────────────────────────┴─────────────┴───────────────┘

  Key evidence — thread creation timestamps (from jstack elapsed times on 
node-0):

  grpc-default-worker-ELG-3-1   elapsed=7192.55s
  grpc-default-worker-ELG-3-2   elapsed=7192.55s
  ...
  grpc-default-worker-ELG-3-36  elapsed=7189.04s   ← all 36 created within 3.5 
seconds at startup

  CPU usage confirms most threads are idle (over 7192 seconds of life):

  ELG-3-1   cpu=1487.26ms   ← actually processing I/O
  ELG-3-2   cpu=1012.09ms   ← actually processing I/O
  ELG-3-3   cpu=0.18ms      ← idle (epollWait only)
  ELG-3-4   cpu=0.18ms
  ...
  ELG-3-36  cpu=35.02ms

  ├────────┼────────────────────────────┼─────────────┼───────────────┤
  │ node-2 │ Leader                     │ 36          │ majority      │
  └────────┴────────────────────────────┴─────────────┴───────────────┘

  Key evidence — thread creation timestamps (from jstack elapsed times on 
node-0):

  grpc-default-worker-ELG-3-1   elapsed=7192.55s
  grpc-default-worker-ELG-3-2   elapsed=7192.55s
  ...
  grpc-default-worker-ELG-3-36  elapsed=7189.04s   ← all 36 created within 3.5 
seconds at startup

  CPU usage confirms most threads are idle (over 7192 seconds of life):

  ELG-3-1   cpu=1487.26ms   ← actually processing I/O
  ELG-3-2   cpu=1012.09ms   ← actually processing I/O
  ELG-3-3   cpu=0.18ms      ← idle (epollWait only)
  ELG-3-4   cpu=0.18ms
  ...
  ELG-3-36  cpu=35.02ms

  Normal followers grew their threads gradually over their entire lifetime (2 → 
4 → 6), corresponding to individual peer connection events.

  Expected Behavior

  After catch-up completes, a restarted follower should have a similar number 
of active ELG threads as a follower that was never restarted (~2–6), rather 
than permanently retaining the full availableProcessors * 2 threads.

  Root Cause Analysis

  The issue stems from how Netty EventLoopGroup works combined with Ratis's 
gRPC usage:

  1. EventLoopGroup is created with a fixed capacity of availableProcessors * 2 
EventLoop objects, but threads are lazily started — an EventLoop's thread only 
begins running when a channel is first registered to it.
  2. During normal operation, new connections trickle in one at a time, and 
gRPC's round-robin registration only touches a few EventLoops. Most EventLoops 
never get a channel registered, so their threads are never started.
  3. During restart catch-up, the leader sends a burst of log entries / 
snapshot chunks via GrpcLogAppender, and the restarted node simultaneously 
opens client connections to multiple peers. This burst distributes channel 
registrations across all
  EventLoops, starting all 36 threads at once.
  4. EventLoop threads never exit. Once started, a Netty EventLoop runs an 
infinite loop (EpollEventLoop.run()), and only terminates when 
EventLoopGroup.shutdownGracefully() is called. Ratis does not rebuild or resize 
the EventLoopGroup after catch-up.

  Suggestion

  Expose a configuration key in GrpcConfigKeys to allow users to control the 
EventLoopGroup thread count, rather than relying on the Netty default 
(availableProcessors * 2). For example:

  // In GrpcConfigKeys.Server
  String WORKER_EVENT_LOOP_THREADS_KEY = PREFIX + ".worker.event-loop.threads";
  int WORKER_EVENT_LOOP_THREADS_DEFAULT = 0; // 0 = use Netty default

  Then in GrpcServicesImpl.Builder.newNettyServerBuilder():

  NettyServerBuilder nettyServerBuilder = NettyServerBuilder.forAddress(address)
      .withChildOption(ChannelOption.SO_REUSEADDR, true)
      .maxInboundMessageSize(messageSizeMax.getSizeInt())
      .flowControlWindow(flowControlWindow.getSizeInt());

  int workerThreads = GrpcConfigKeys.Server.workerEventLoopThreads(properties);
  if (workerThreads > 0) {
      nettyServerBuilder.workerEventLoopGroup(new 
EpollEventLoopGroup(workerThreads));
  }

  Similarly, the client-side NettyChannelBuilder in 
GrpcServerProtocolClient.buildChannel() and 
GrpcClientProtocolClient.buildChannel() could accept a configurable 
EventLoopGroup.

  This would allow downstream projects (e.g., Apache IoTDB) to cap the thread 
count at a reasonable value (e.g., 4–8) for follower nodes, avoiding the 
permanent thread inflation after restart.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to