Hi William, Thanks a lot for the follow up. Will check the detailed description in RATIS-1674.
Tsz-Wo On Thu, Sep 22, 2022 at 10:51 PM William Song <[email protected]> wrote: > Hi, > > We have new discoveries on > https://issues.apache.org/jira/browse/RATIS-1674 < > https://issues.apache.org/jira/browse/RATIS-1674>. We observe a lot of > inconsistent AppendEntries and finally OutOfDirectMemory error on leader. > Previous discussions please refer to > https://lists.apache.org/[email protected]:2022-7:DirectOOM < > https://lists.apache.org/[email protected]:2022-7:DirectOOM> and > https://lists.apache.org/[email protected]:2022-8:DirectOOM < > https://lists.apache.org/[email protected]:2022-8:DirectOOM>. > > This time, After analyzing logs and code carefully, we highly suspect that > the problem roots in gRPC Log Appender's AppendEntries sending and > NextIndex updating mechanism. > > When a leader-switch happens in a cluster containing a slow follower, > previous leader’s pending AppendEntries queued in slow follower will cause > it reply a wrong NextIndex to the new leader, which starts an Inconsistent > AE storm and finally lead to OOM. > > A detailed description is provided in > https://issues.apache.org/jira/browse/RATIS-1674 < > https://issues.apache.org/jira/browse/RATIS-1674>. Please help me to > confirm this problem. Thanks in advance! > > Regards, > William
