[
https://issues.apache.org/jira/browse/KAFKA-13191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17519434#comment-17519434
]
Fabio Hecht commented on KAFKA-13191:
-------------------------------------
According to my tests with Confluent 7.1.0 / Kafka 3.1.0, the issue has been
fixed.
> Kafka 2.8 - simultaneous restarts of Kafka and zookeeper result in broken
> cluster
> ---------------------------------------------------------------------------------
>
> Key: KAFKA-13191
> URL: https://issues.apache.org/jira/browse/KAFKA-13191
> Project: Kafka
> Issue Type: Bug
> Components: protocol
> Affects Versions: 2.8.0, 3.0.0
> Reporter: CS User
> Priority: Major
>
> We're using confluent platform 6.2, running in a Kubernetes environment. The
> cluster has been running for a couple of years with zero issues, starting
> from version 1.1, 2.5 and now 2.8.
> We've very recently upgraded to kafka 2.8 from kafka 2.5.
> Since upgrading, we have seen issues when kafka and zookeeper pods restart
> concurrently.
> We can replicate the issue by restarting either the zookeeper statefulset
> first or the kafka statefulset first, either way appears to result with the
> same failure scenario.
> We've attempted to mitigate by preventing the kafka pods from stopping if any
> zookeeper pods are being restarted, or a rolling restart of the zookeeper
> cluster is underway.
> We've also added a check to stop the kafka pods from starting until all
> zookeeper pods are ready, however under the following scenario we still see
> the issue:
> In a 3 node kafka cluster with 5 zookeeper servers
> # kafka-2 starts to terminate - all zookeeper pods are running, so it
> proceeds
> # zookeeper-4 terminates
> # kafka-2 starts-up, and waits until the zookeeper rollout completes
> # kafka-2 eventually fully starts, kafka comes up and we see the errors
> below on other pods in the cluster.
> Without mitigation and in the above scenario we see errors on pods kafka-0
> (repeatedly spamming the log) :
> {noformat}
> [2021-08-11 11:45:57,625] WARN Broker had a stale broker epoch
> (670014914375), retrying. (kafka.server.DefaultAlterIsrManager){noformat}
> Kafka-1 seems ok
> When kafka-2 starts, it has this log entry with regards to its own broker
> epoch:
> {noformat}
> [2021-08-11 11:44:48,116] INFO Registered broker 2 at path /brokers/ids/2
> with addresses:
> INTERNAL://kafka-2.kafka.svc.cluster.local:9092,INTERNAL_SECURE://kafka-2.kafka.svc.cluster.local:9094,
> czxid (broker epoch): 674309865493 (kafka.zk.KafkaZkClient) {noformat}
> This is despite kafka-2 appearing to start fine, this is what you see in
> kafka-2's logs, nothing else seems to be added to the log, it just seems to
> hang here:
> {noformat}
> [2021-08-11 11:44:48,911] INFO [SocketServer listenerType=ZK_BROKER,
> nodeId=2] Started socket server acceptors and processors
> (kafka.network.SocketServer)
> [2021-08-11 11:44:48,913] INFO Kafka version: 6.2.0-ccs
> (org.apache.kafka.common.utils.AppInfoParser)
> [2021-08-11 11:44:48,913] INFO Kafka commitId: 1a5755cf9401c84f
> (org.apache.kafka.common.utils.AppInfoParser)
> [2021-08-11 11:44:48,913] INFO Kafka startTimeMs: 1628682288911
> (org.apache.kafka.common.utils.AppInfoParser)
> [2021-08-11 11:44:48,914] INFO [KafkaServer id=2] started
> (kafka.server.KafkaServer) {noformat}
> This never appears to recover.
> If you then restart kafka-2, you'll see these errors:
> {noformat}
> org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication
> factor: 3 larger than available brokers: 0. {noformat}
> This seems to completely break the cluster, partitions do not failover as
> expected.
>
> Checking zookeeper, and getting the values of the brokers look fine
> {noformat}
> get /brokers/ids/0{noformat}
> etc, all looks fine there, each broker is present.
>
> This error message appears to have been added to kafka in the last 11 months
> {noformat}
> Broker had a stale broker epoch {noformat}
> Via this PR:
> [https://github.com/apache/kafka/pull/9100]
> I see also this comment around the leader getting stuck:
> [https://github.com/apache/kafka/pull/9100/files#r494480847]
>
> Recovery is possible by continuing to restart the remaining brokers in the
> cluster. Once all have been restarted, everything looks fine.
> Has anyone else come across this? It seems very simple to replicate in our
> environment, simply start a simultaneous rolling restart of both kafka and
> zookeeper.
> I appreciate that Zookeeper and Kafka would not normally be restarted
> concurrently in this way. However there are going to be scenarios where this
> can happen, such as if we had simultaneous Kubernetes node failures,
> resulting in the loss of both a zookeeper and a kafka pod at the same time.
> This could result in the issue above.
> This is not something that we have seen previously with versions 1.1 or 2.5.
> Just to be clear, rolling restarting only kafka or zookeeper is absolutely
> fine.
> After some additional testing, it appears this can be recreated simply by
> restarting a broker pod and then restarting the zookeeper leader as the
> broker is shutting down.
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)