Hi, We have new discoveries on https://issues.apache.org/jira/browse/RATIS-1674 <https://issues.apache.org/jira/browse/RATIS-1674>. We observe a lot of inconsistent AppendEntries and finally OutOfDirectMemory error on leader. Previous discussions please refer to https://lists.apache.org/[email protected]:2022-7:DirectOOM <https://lists.apache.org/[email protected]:2022-7:DirectOOM> and https://lists.apache.org/[email protected]:2022-8:DirectOOM <https://lists.apache.org/[email protected]:2022-8:DirectOOM>.
This time, After analyzing logs and code carefully, we highly suspect that the problem roots in gRPC Log Appender's AppendEntries sending and NextIndex updating mechanism. When a leader-switch happens in a cluster containing a slow follower, previous leader’s pending AppendEntries queued in slow follower will cause it reply a wrong NextIndex to the new leader, which starts an Inconsistent AE storm and finally lead to OOM. A detailed description is provided in https://issues.apache.org/jira/browse/RATIS-1674 <https://issues.apache.org/jira/browse/RATIS-1674>. Please help me to confirm this problem. Thanks in advance! Regards, William
