jsancio commented on PR #13765:
URL: https://github.com/apache/kafka/pull/13765#issuecomment-1572275711

   Thanks for your feedback @divijvaidya @dajac. I am replying to both your 
comments in you message.
   > With this change, we are changing the semantics of what a leadership epoch 
means. Prior to this change, leadership epoch is a version number representing 
membership of an ISR. As soon as membership changes, this version changes.
   
   @divijvaidya, The old code was increasing the leader epoch when with the ISR 
shrinks but not when the ISR expands. My understanding that we were doing this 
because the old replica manager used leader epoch bump to invalidate old 
fetchers. During shutdown the fetchers needed to be invalidated to avoid having 
them rejoin the ISR. With KIP-841, this is no longer necessary as we can reject 
brokers that are shutting down from joining the ISR and modifying the HWM.
   
   Part of the code for doing this already exists, what we missing and what 
part of this PR fixes, is considering this state when advancing the HWM. The 
partition leader should not include shutting down replicas when determining the 
HWM.
   
   > After this change, the definition has changed to - leadership epoch is a 
version number that represents member of an ISR "in some cases". As you can 
see, the new definition has added ifs and buts to the simple definition above. 
Hence, I am not in favour of changing this.
   
   @divijvaidya, For correctness, the main requirement is that the leader epoch 
is increase whenever the leader changes. This is needed for log truncation and 
reconciliation. For log consistency, log truncation and reconciliation assumes 
that the tuples (offset, epoch) are unique per topic partition and that if the 
tuple (offset, epoch) match in two replicas then their log up to that offset 
also match. In my opinion, for correctness Kafka doesn't require that the 
leader epoch is increased when the ISR changes.
   
   > As you can see there, we only reset the followers' states when the leader 
epoch is bumped. I suppose that this is why you stumbled upon this issue with 
having shutting down replicas holding back advancing the HWM. The issue is that 
the shutting down replica's state is not reset so it remains caught-up for 
replica.lag.time.max.ms. I think that we need to update Partition.makeLeader to 
always update the followers' states. Obviously, we also need your changes to 
not consider fenced and shutting down replicas in the HWM computation.
   
   @dajac, Yes, I thought about this when I was implemented this feature. I 
decided against it because the follower (shutting down replica) is technically 
"caught up" to the leader we simply don't want the leader to wait for the 
replica when computing the HWM since we know it will soon be shutting down its 
fetchers.
   
   > I wonder if we really need this if we change Partition.makeLeader as 
explained. It seems to me that the change in Partition.makeLeader and 
Partition.maybeIncrementLeaderHW together should be backward compatible. What 
do you think?
   
   We need the MV check in the controller even with your suggestion. The 
question is "When is it beneficial for the controller to not increase the 
leader epoch when a replica is removed from the ISR because of shutdown?" This 
is only the case when the controller knows that the brokers have the replica 
manager fixes in this PR. That is guarantee to be the case if the MV is greater 
than the MV introduced in this PR.
   
   If the brokers don't contain the fixes in this PR and the controller doesn't 
bump the leader epoch, PRODUCE requests will timeout because the HWM increase 
will be delayed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to