[ 
https://issues.apache.org/jira/browse/RATIS-2498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18071329#comment-18071329
 ] 

Ivan Andika commented on RATIS-2498:
------------------------------------

One possible case: a pre-existing race between leader step-down and the 
client's sliding window when analyzing testAppendEntriesTimeout.

  Here's the timeline:
  1. s2 becomes leader (term 1). The test blocks WRITE_STATE_MACHINE_DATA on 
both followers (s0 and s1)
     — line 436 filter keeps all non-leaders whose ID != leader.
  2. Client sends a dummy Watch (seq=1) — sendDummyRequest is enabled by 
default (line 123 of
     OrderedAsync). This goes to s0, gets NotLeaderException, is retried to s2 
(the leader), and
     completes there. s2's sliding window processes seq=1. s0's sliding window 
never sees seq=1 (it
     rejected it before reaching the sliding window).
  3. Client sends the write "abc" (seq=2) to s2. But followers can't respond to 
AppendEntries because
     writeStateMachineData is blocked, which prevents them from acknowledging 
the log entry.
  4. s2 loses leadership after 544ms (LOST_MAJORITY_HEARTBEATS). Election 
timeout is 300ms, and both
     followers are blocked. s2 can't re-elect for 10s 
(lostMajorityHeartbeatsRecently).
  5. At +5s, test unblocks `WRITE_STATE_MACHINE_DATA`. At this moment there's 
no leader, so the unblock
     filter actually works correctly (both s0 and s1 are unblocked).
  6. s0 becomes leader (term 2). The "abc" entry from term 1 gets committed. 
The cluster is healthy (all
      at commit index 3).
  7. Client retries seq=2 to s0 (the new leader). But s0's server-side sliding 
window has never
     processed seq=1 for this client (the dummy Watch was only completed on 
s2). The sliding window
     queues seq=2 and waits for seq=1, which never arrives. The request hangs 
until the Netty RPC
     timeout (3s), retries, and loops forever.

  In short: The dummy Watch (seq=1) was processed on the old leader (s2), but 
the new leader (s0) never
  saw it. The server-side sliding window on s0 blocks seq=2 waiting for seq=1 
that will never come.
  This is a pre-existing race condition — whenever the leader steps down during 
this test (which is likely
   given both followers are blocked and the election timeout is only 300ms), 
the client gets stuck in this
   loop.

> Fix flaky TestRaftAsyncWithNetty
> --------------------------------
>
>                 Key: RATIS-2498
>                 URL: https://issues.apache.org/jira/browse/RATIS-2498
>             Project: Ratis
>          Issue Type: Bug
>            Reporter: Ivan Andika
>            Priority: Major
>
> TestRaftAsyncWithNetty is recently flaky. The flaky tests include multiple 
> tests under the RaftAsyncWithNetty.
>  
> [https://github.com/apache/ratis/actions/runs/23957268626/job/69878546984]
> [https://github.com/apache/ratis/actions/runs/23956477384/job/69875932130]
> [https://github.com/apache/ratis/actions/runs/23739689985/job/69153331550]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to