Hi Maros, Thanks for your reply and following is my understanding.
The reason is the controller.fetch.timeout.Ms is here <https://github.com/apache/kafka/blob/2c9180e8d181798bf55e37801cb16d79c36f9ff4/raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java#L553> (1) , and leader will set the beginQuorumRequestTimeout here <https://github.com/apache/kafka/blob/6b187e9ff9374711a10452cc3aaa903837d937ba/raft/src/main/java/org/apache/kafka/raft/LeaderState.java#L156> (2) (2) comes from (1) and the Raft.fetch.timeout.Ms is 500Ms so we need to set controller.fetch.timeout.Ms at least 1000. Another point is, we need to give controller.fetch.timeout.Ms enough time to round trip even network latency equals raft max fetch request time. Improve your example and assume network latency always 100ms T=0ms: Follower sends fetch request and starts 500ms *fetchTimer* T=500ms: Leader does not receive fetch request (RAFT_MAX_FETCH_WAIT_MS e.g. 500) do two things: 1) Leader sneds BeginQuorum request to the follower 2) The follower sends fetch request to leader and start *fetchTimer* T=600ms: four cases 1) the follower receives BeginQuorum and response it (100 network latency) -> leader still works 2) the follower does not receive beginQuorum request -> leader MAYBE not work 1) the leader receives fetch and response it (100 network latency) -> follower still works 2) the leader does not receive fetch Request -> follower MAYBE not work T=800ms: Follower's fetch timer is expired. then the follower transitions to the *Prospective state*, and it triggers election... (wasted resources even the leader is working fine). >From the above flow, we give the leader and follower two changes to check each state so the algorithm works (one is beginQuorum, another is fetch but their start point be different). Best, TaiJu Wu Maroš Orsák <[email protected]> 於 2025年11月12日 週三 下午10:54寫道: > Hi, > > So basically, one of the examples of how we could violate such an invariant > is that when, f.e., `controller.quorum.fetch.timeout.ms = 800ms`: > > T=0ms: Follower sends fetch request and starts 800ms *fetchTimer* > T=100ms: Leader receives request (100ms network latency) > ... > T=600ms: Leader waits 500ms for new data (MAX_FETCH_WAIT_MS) > T=600ms: Leader sends response > T=700ms: Response arrives at follower (100ms network latency) > ... > T=800ms: Follower *fetchTimer *expires. > > then the follower transitions to the *Prospective state*, and it triggers > election... (wasted resources even leader is working fine). > > Is my understanding correct? If so, maybe adding such an example would be > great to add to KIP? > > Maros. >
