Hi, 

We have new discoveries on https://issues.apache.org/jira/browse/RATIS-1674 
<https://issues.apache.org/jira/browse/RATIS-1674>. We observe a lot of  
inconsistent AppendEntries and finally OutOfDirectMemory error on leader. 
Previous discussions please refer to 
https://lists.apache.org/[email protected]:2022-7:DirectOOM 
<https://lists.apache.org/[email protected]:2022-7:DirectOOM> and 
https://lists.apache.org/[email protected]:2022-8:DirectOOM 
<https://lists.apache.org/[email protected]:2022-8:DirectOOM>.

This time, After analyzing logs and code carefully, we highly suspect that the 
problem roots in gRPC Log Appender's AppendEntries sending and NextIndex 
updating mechanism. 

When a leader-switch happens in a cluster containing a slow follower, previous 
leader’s pending AppendEntries queued in slow follower will cause it reply a 
wrong NextIndex to the new leader, which starts an Inconsistent AE storm and 
finally lead to OOM.

A detailed description is provided in 
https://issues.apache.org/jira/browse/RATIS-1674 
<https://issues.apache.org/jira/browse/RATIS-1674>. Please help me to confirm 
this problem. Thanks in advance!

Regards,
William

Reply via email to