[ https://issues.apache.org/jira/browse/KAFKA-13191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441106#comment-17441106 ]
Edoardo Comar commented on KAFKA-13191: --------------------------------------- [~acldstkusr] can you take a look at https://issues.apache.org/jira/browse/KAFKA-13407 ? It might be symptoms of the same underlying issue. Would you be able to try reproduce this issue with the fix we propose in https://issues.apache.org/jira/browse/KAFKA-13407 ? > Kafka 2.8 - simultaneous restarts of Kafka and zookeeper result in broken > cluster > --------------------------------------------------------------------------------- > > Key: KAFKA-13191 > URL: https://issues.apache.org/jira/browse/KAFKA-13191 > Project: Kafka > Issue Type: Bug > Components: protocol > Affects Versions: 2.8.0, 3.0.0 > Reporter: CS User > Priority: Major > > We're using confluent platform 6.2, running in a Kubernetes environment. The > cluster has been running for a couple of years with zero issues, starting > from version 1.1, 2.5 and now 2.8. > We've very recently upgraded to kafka 2.8 from kafka 2.5. > Since upgrading, we have seen issues when kafka and zookeeper pods restart > concurrently. > We can replicate the issue by restarting either the zookeeper statefulset > first or the kafka statefulset first, either way appears to result with the > same failure scenario. > We've attempted to mitigate by preventing the kafka pods from stopping if any > zookeeper pods are being restarted, or a rolling restart of the zookeeper > cluster is underway. > We've also added a check to stop the kafka pods from starting until all > zookeeper pods are ready, however under the following scenario we still see > the issue: > In a 3 node kafka cluster with 5 zookeeper servers > # kafka-2 starts to terminate - all zookeeper pods are running, so it > proceeds > # zookeeper-4 terminates > # kafka-2 starts-up, and waits until the zookeeper rollout completes > # kafka-2 eventually fully starts, kafka comes up and we see the errors > below on other pods in the cluster. > Without mitigation and in the above scenario we see errors on pods kafka-0 > (repeatedly spamming the log) : > {noformat} > [2021-08-11 11:45:57,625] WARN Broker had a stale broker epoch > (670014914375), retrying. (kafka.server.DefaultAlterIsrManager){noformat} > Kafka-1 seems ok > When kafka-2 starts, it has this log entry with regards to its own broker > epoch: > {noformat} > [2021-08-11 11:44:48,116] INFO Registered broker 2 at path /brokers/ids/2 > with addresses: > INTERNAL://kafka-2.kafka.svc.cluster.local:9092,INTERNAL_SECURE://kafka-2.kafka.svc.cluster.local:9094, > czxid (broker epoch): 674309865493 (kafka.zk.KafkaZkClient) {noformat} > This is despite kafka-2 appearing to start fine, this is what you see in > kafka-2's logs, nothing else seems to be added to the log, it just seems to > hang here: > {noformat} > [2021-08-11 11:44:48,911] INFO [SocketServer listenerType=ZK_BROKER, > nodeId=2] Started socket server acceptors and processors > (kafka.network.SocketServer) > [2021-08-11 11:44:48,913] INFO Kafka version: 6.2.0-ccs > (org.apache.kafka.common.utils.AppInfoParser) > [2021-08-11 11:44:48,913] INFO Kafka commitId: 1a5755cf9401c84f > (org.apache.kafka.common.utils.AppInfoParser) > [2021-08-11 11:44:48,913] INFO Kafka startTimeMs: 1628682288911 > (org.apache.kafka.common.utils.AppInfoParser) > [2021-08-11 11:44:48,914] INFO [KafkaServer id=2] started > (kafka.server.KafkaServer) {noformat} > This never appears to recover. > If you then restart kafka-2, you'll see these errors: > {noformat} > org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication > factor: 3 larger than available brokers: 0. {noformat} > This seems to completely break the cluster, partitions do not failover as > expected. > > Checking zookeeper, and getting the values of the brokers look fine > {noformat} > get /brokers/ids/0{noformat} > etc, all looks fine there, each broker is present. > > This error message appears to have been added to kafka in the last 11 months > {noformat} > Broker had a stale broker epoch {noformat} > Via this PR: > [https://github.com/apache/kafka/pull/9100] > I see also this comment around the leader getting stuck: > [https://github.com/apache/kafka/pull/9100/files#r494480847] > > Recovery is possible by continuing to restart the remaining brokers in the > cluster. Once all have been restarted, everything looks fine. > Has anyone else come across this? It seems very simple to replicate in our > environment, simply start a simultaneous rolling restart of both kafka and > zookeeper. > I appreciate that Zookeeper and Kafka would not normally be restarted > concurrently in this way. However there are going to be scenarios where this > can happen, such as if we had simultaneous Kubernetes node failures, > resulting in the loss of both a zookeeper and a kafka pod at the same time. > This could result in the issue above. > This is not something that we have seen previously with versions 1.1 or 2.5. > Just to be clear, rolling restarting only kafka or zookeeper is absolutely > fine. > After some additional testing, it appears this can be recreated simply by > restarting a broker pod and then restarting the zookeeper leader as the > broker is shutting down. > -- This message was sent by Atlassian Jira (v8.20.1#820001)