cmccabe commented on PR #16900:
URL: https://github.com/apache/kafka/pull/16900#issuecomment-2299868510

   > I don't want raft to forget or expose forgotten state. At one point raft 
knew that the leader was Y at epoch X. Later for the same epoch raft would 
report that the leader is unknown for epoch X. I don't want for raft to expose 
that it forgot that the leader was Y at epoch X. You can thinking of this as 
lost of data. Raft needs to stay at epoch X because it is resigning. The 
expectation is that some other voter will increase the epoch to X + 1 and try 
to win the election.
   
   If you read the JavaDoc of `RaftClient.Listener.handleLeaderChange` it 
states that:
   
   > the implementation of method should expect this method will be called at 
most twice for each epoch. Once if the epoch changed but the leader is not 
known and once when the leader is known for the current epoch.
   
   To me this implies that the best thing to do would be to immediately fire 
off a callback with epoch = X + 1, leader = empty to let everyone know that we 
lost leadership. Then, later on once we figure out who the leader is, fire off 
another callback with epoch = X + 1, leader = new-leader.
   
   Now, the JavaDoc also states that "if this node is the leader, then the 
notification of leadership will be delayed" (why?)
   
   From a correctness point of view, we can certainly wait before delivering 
this callback, but I think it's suboptimal. You are keeping everyone outside 
the raft layer "in the dark" while you wait for the new leader to settle, which 
will result in them doing wasted work. For example, if you're doing ZK 
migration, the controller may keep trying to send out UpdateLeaderAndIsrRequest 
even after the rest of the cluster has moved on. The controller epoch should 
protect us, but... it feels suboptimal.
   
   Anyway, let's get this bugfix in and discuss more next week


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to