Hi William,

Thanks a lot for the follow up.  Will check the detailed description
in RATIS-1674.

Tsz-Wo

On Thu, Sep 22, 2022 at 10:51 PM William Song <[email protected]> wrote:

> Hi,
>
> We have new discoveries on
> https://issues.apache.org/jira/browse/RATIS-1674 <
> https://issues.apache.org/jira/browse/RATIS-1674>. We observe a lot of
> inconsistent AppendEntries and finally OutOfDirectMemory error on leader.
> Previous discussions please refer to
> https://lists.apache.org/[email protected]:2022-7:DirectOOM <
> https://lists.apache.org/[email protected]:2022-7:DirectOOM> and
> https://lists.apache.org/[email protected]:2022-8:DirectOOM <
> https://lists.apache.org/[email protected]:2022-8:DirectOOM>.
>
> This time, After analyzing logs and code carefully, we highly suspect that
> the problem roots in gRPC Log Appender's AppendEntries sending and
> NextIndex updating mechanism.
>
> When a leader-switch happens in a cluster containing a slow follower,
> previous leader’s pending AppendEntries queued in slow follower will cause
> it reply a wrong NextIndex to the new leader, which starts an Inconsistent
> AE storm and finally lead to OOM.
>
> A detailed description is provided in
> https://issues.apache.org/jira/browse/RATIS-1674 <
> https://issues.apache.org/jira/browse/RATIS-1674>. Please help me to
> confirm this problem. Thanks in advance!
>
> Regards,
> William

Reply via email to