jsancio commented on code in PR #19454: URL: https://github.com/apache/kafka/pull/19454#discussion_r2044695337
########## metadata/src/main/java/org/apache/kafka/controller/ClusterControlManager.java: ########## @@ -309,8 +309,10 @@ public void activate() { long nowNs = time.nanoseconds(); for (BrokerRegistration registration : brokerRegistrations.values()) { heartbeatManager.register(registration.id(), registration.fenced()); - heartbeatManager.tracker().updateContactTime( - new BrokerIdAndEpoch(registration.id(), registration.epoch()), nowNs); + if (!registration.fenced()) { + heartbeatManager.tracker().updateContactTime( + new BrokerIdAndEpoch(registration.id(), registration.epoch()), nowNs); Review Comment: Interesting. This is not a new issue but this means that a cluster with frequent controller failovers, more often than the heartbeat timeout, will be unable to fence brokers that have not sent a heartbeat. Have you considered persisting some of the session state? What is the scalability impact of persisting session state? For example, my understanding is that ZooKeeper persists session state. That is how they implement ephemeral nodes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org