jsancio commented on PR #13765: URL: https://github.com/apache/kafka/pull/13765#issuecomment-1572275711
Thanks for your feedback @divijvaidya @dajac. I am replying to both your comments in you message. > With this change, we are changing the semantics of what a leadership epoch means. Prior to this change, leadership epoch is a version number representing membership of an ISR. As soon as membership changes, this version changes. @divijvaidya, The old code was increasing the leader epoch when with the ISR shrinks but not when the ISR expands. My understanding that we were doing this because the old replica manager used leader epoch bump to invalidate old fetchers. During shutdown the fetchers needed to be invalidated to avoid having them rejoin the ISR. With KIP-841, this is no longer necessary as we can reject brokers that are shutting down from joining the ISR and modifying the HWM. Part of the code for doing this already exists, what we missing and what part of this PR fixes, is considering this state when advancing the HWM. The partition leader should not include shutting down replicas when determining the HWM. > After this change, the definition has changed to - leadership epoch is a version number that represents member of an ISR "in some cases". As you can see, the new definition has added ifs and buts to the simple definition above. Hence, I am not in favour of changing this. @divijvaidya, For correctness, the main requirement is that the leader epoch is increase whenever the leader changes. This is needed for log truncation and reconciliation. For log consistency, log truncation and reconciliation assumes that the tuples (offset, epoch) are unique per topic partition and that if the tuple (offset, epoch) match in two replicas then their log up to that offset also match. In my opinion, for correctness Kafka doesn't require that the leader epoch is increased when the ISR changes. > As you can see there, we only reset the followers' states when the leader epoch is bumped. I suppose that this is why you stumbled upon this issue with having shutting down replicas holding back advancing the HWM. The issue is that the shutting down replica's state is not reset so it remains caught-up for replica.lag.time.max.ms. I think that we need to update Partition.makeLeader to always update the followers' states. Obviously, we also need your changes to not consider fenced and shutting down replicas in the HWM computation. @dajac, Yes, I thought about this when I was implemented this feature. I decided against it because the follower (shutting down replica) is technically "caught up" to the leader we simply don't want the leader to wait for the replica when computing the HWM since we know it will soon be shutting down its fetchers. > I wonder if we really need this if we change Partition.makeLeader as explained. It seems to me that the change in Partition.makeLeader and Partition.maybeIncrementLeaderHW together should be backward compatible. What do you think? We need the MV check in the controller even with your suggestion. The question is "When is it beneficial for the controller to not increase the leader epoch when a replica is removed from the ISR because of shutdown?" This is only the case when the controller knows that the brokers have the replica manager fixes in this PR. That is guarantee to be the case if the MV is greater than the MV introduced in this PR. If the brokers don't contain the fixes in this PR and the controller doesn't bump the leader epoch, PRODUCE requests will timeout because the HWM increase will be delayed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org