[ 
https://issues.apache.org/jira/browse/KAFKA-13191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441106#comment-17441106
 ] 

Edoardo Comar commented on KAFKA-13191:
---------------------------------------

[~acldstkusr] can you take a look at 
https://issues.apache.org/jira/browse/KAFKA-13407 ?
It might be symptoms of the same underlying issue.

Would you be able to try reproduce this issue with the fix we propose in 
https://issues.apache.org/jira/browse/KAFKA-13407 ?

> Kafka 2.8 - simultaneous restarts of Kafka and zookeeper result in broken 
> cluster
> ---------------------------------------------------------------------------------
>
>                 Key: KAFKA-13191
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13191
>             Project: Kafka
>          Issue Type: Bug
>          Components: protocol
>    Affects Versions: 2.8.0, 3.0.0
>            Reporter: CS User
>            Priority: Major
>
> We're using confluent platform 6.2, running in a Kubernetes environment. The 
> cluster has been running for a couple of years with zero issues, starting 
> from version 1.1, 2.5 and now 2.8. 
> We've very recently upgraded to kafka 2.8 from kafka 2.5. 
> Since upgrading, we have seen issues when kafka and zookeeper pods restart 
> concurrently. 
> We can replicate the issue by restarting either the zookeeper statefulset 
> first or the kafka statefulset first, either way appears to result with the 
> same failure scenario. 
> We've attempted to mitigate by preventing the kafka pods from stopping if any 
> zookeeper pods are being restarted, or a rolling restart of the zookeeper 
> cluster is underway. 
> We've also added a check to stop the kafka pods from starting until all 
> zookeeper pods are ready, however under the following scenario we still see 
> the issue:
> In a 3 node kafka cluster with 5 zookeeper servers
>  # kafka-2 starts to terminate - all zookeeper pods are running, so it 
> proceeds
>  # zookeeper-4 terminates
>  # kafka-2 starts-up, and waits until the zookeeper rollout completes
>  # kafka-2 eventually fully starts, kafka comes up and we see the errors 
> below on other pods in the cluster. 
> Without mitigation and in the above scenario we see errors on pods kafka-0 
> (repeatedly spamming the log) :
> {noformat}
> [2021-08-11 11:45:57,625] WARN Broker had a stale broker epoch 
> (670014914375), retrying. (kafka.server.DefaultAlterIsrManager){noformat}
> Kafka-1 seems ok
> When kafka-2 starts, it has this log entry with regards to its own broker 
> epoch:
> {noformat}
> [2021-08-11 11:44:48,116] INFO Registered broker 2 at path /brokers/ids/2 
> with addresses: 
> INTERNAL://kafka-2.kafka.svc.cluster.local:9092,INTERNAL_SECURE://kafka-2.kafka.svc.cluster.local:9094,
>  czxid (broker epoch): 674309865493 (kafka.zk.KafkaZkClient) {noformat}
> This is despite kafka-2 appearing to start fine, this is what you see in 
> kafka-2's logs, nothing else seems to be added to the log, it just seems to 
> hang here:
> {noformat}
> [2021-08-11 11:44:48,911] INFO [SocketServer listenerType=ZK_BROKER, 
> nodeId=2] Started socket server acceptors and processors 
> (kafka.network.SocketServer)
> [2021-08-11 11:44:48,913] INFO Kafka version: 6.2.0-ccs 
> (org.apache.kafka.common.utils.AppInfoParser)
> [2021-08-11 11:44:48,913] INFO Kafka commitId: 1a5755cf9401c84f 
> (org.apache.kafka.common.utils.AppInfoParser)
> [2021-08-11 11:44:48,913] INFO Kafka startTimeMs: 1628682288911 
> (org.apache.kafka.common.utils.AppInfoParser)
> [2021-08-11 11:44:48,914] INFO [KafkaServer id=2] started 
> (kafka.server.KafkaServer) {noformat}
> This never appears to recover. 
> If you then restart kafka-2, you'll see these errors:
> {noformat}
> org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication 
> factor: 3 larger than available brokers: 0. {noformat}
> This seems to completely break the cluster, partitions do not failover as 
> expected. 
>  
> Checking zookeeper, and getting the values of the brokers look fine 
> {noformat}
>  get /brokers/ids/0{noformat}
> etc, all looks fine there, each broker is present. 
>  
> This error message appears to have been added to kafka in the last 11 months 
> {noformat}
> Broker had a stale broker epoch {noformat}
> Via this PR:
> [https://github.com/apache/kafka/pull/9100]
> I see also this comment around the leader getting stuck:
> [https://github.com/apache/kafka/pull/9100/files#r494480847]
>  
> Recovery is possible by continuing to restart the remaining brokers in the 
> cluster. Once all have been restarted, everything looks fine.
> Has anyone else come across this? It seems very simple to replicate in our 
> environment, simply start a simultaneous rolling restart of both kafka and 
> zookeeper. 
> I appreciate that Zookeeper and Kafka would not normally be restarted 
> concurrently in this way. However there are going to be scenarios where this 
> can happen, such as if we had simultaneous Kubernetes node failures, 
> resulting in the loss of both a zookeeper and a kafka pod at the same time. 
> This could result in the issue above. 
> This is not something that we have seen previously with versions 1.1 or 2.5. 
> Just to be clear, rolling restarting only kafka or zookeeper is absolutely 
> fine. 
> After some additional testing, it appears this can be recreated simply by 
> restarting a broker pod and then restarting the zookeeper leader as the 
> broker is shutting down. 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to