[jira] [Created] (KAFKA-13191) Kafka 2.8 - simultaneous restarts of Kafka and zookeeper result in broken cluster

2021-08-11 Thread CS User (Jira)
CS User created KAFKA-13191:
---

 Summary: Kafka 2.8 - simultaneous restarts of Kafka and zookeeper 
result in broken cluster
 Key: KAFKA-13191
 URL: https://issues.apache.org/jira/browse/KAFKA-13191
 Project: Kafka
  Issue Type: Bug
  Components: protocol
Affects Versions: 2.8.0
Reporter: CS User


We're using confluent platform 6.2, running in a Kubernetes environment. The 
cluster has been running for a couple of years with zero issues, starting from 
version 1.1, 2.5 and now 2.8. 

We've very recently upgraded to kafka 2.8 from kafka 2.5. 

Since upgraded, we have seen issues when kafka and zookeeper pods restart 
concurrently. 

We can replicate the issue by restarting either the zookeeper statefulset first 
or the kafka statefulset first, either way appears to result with the same 
failure scenario. 

We've attempted to mitigate by preventing the kafka pods from stopping if any 
zookeeper pods are being restarted, or a rolling restart of the zookeeper 
cluster is underway. 

We've also added a check to stop the kafka pods from starting until all 
zookeeper pods are ready, however under the following scenario we still see the 
issue:

In a 3 node kafka cluster with 5 zookeeper servers
 # kafka-2 starts to terminate - all zookeeper pods are running, so it proceeds
 # zookeeper-4 terminates
 # kafka-2 starts-up, and waits until the zookeeper rollout completes
 # kafka-2 eventually fully starts, kafka comes up and we see the errors below 
on other pods in the cluster. 

Without mitigation and in the above scenario we see errors on pods 
kafka-0/kafka-1:
{noformat}
Broker had a stale broker epoch (635655171205), retrying. 
(kafka.server.DefaultAlterIsrManager) {noformat}
This never appears to recover. 

If you then restart kafka-2, you'll see these errors:
{noformat}
org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication 
factor: 3 larger than available brokers: 0. {noformat}
This seems to completely break the cluster, partitions do not failover as 
expected. 

 

Checking zookeeper, and getting the values of the brokers look fine 
{noformat}
 get /brokers/ids/0{noformat}
etc, all looks fine there, each broker is present. 

 

This error message appears to have been added to kafka in the last 11 months 
{noformat}
Broker had a stale broker epoch {noformat}
Via this PR:

[https://github.com/apache/kafka/pull/9100]

I see also this comment around the leader getting stuck:

https://github.com/apache/kafka/pull/9100/files#r494480847

 

Recovery is possible by continuing to restart the remaining brokers in the 
cluster. Once all have been restarted, everything looks fine.

Has anyone else come across this? It seems very simple to replicate in our 
environment, simply start a simultaneous rolling restart of both kafka and 
zookeeper. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Kafka broker crash - broker id then changed

2016-05-26 Thread cs user
Hi All,

We are running Kafka version 0.9.0.1, at the time the brokers crashed
yesterday we were running in a 2 mode cluster. This has now been increased
to 3.

We are not specifying a broker id and relying on kafka generating one.

After the brokers crashed (at exactly the same time) we left kafka stopped
for a while. After kafka was started back up, the broker id's on both
servers were incremented, they were 1001/1002 and they flipped to
1003/1004. This seemed to cause some problems as partitions were assigned
to broker id's which it believed had disappeared and so were not
recoverable.

We noticed that the broker id's are actually stored in:

/tmp/kafka-logs/meta.properties

So we set these back to what they were and restarted. Is there a reason why
these would change?

Below are the error logs from each server:

Server 1

[2016-05-25 09:05:52,827] INFO [ReplicaFetcherManager on broker 1002]
Removed fetcher for partitions [Topic1Heartbeat,1]
(kafka.server.ReplicaFetcherManager)
[2016-05-25 09:05:52,831] INFO Completed load of log Topic1Heartbeat-1 with
log end offset 0 (kafka.log.Log)
[2016-05-25 09:05:52,831] INFO Created log for partition
[Topic1Heartbeat,1] in /tmp/kafka-logs with properties {compression.type ->
producer, file.delete.delay.ms -> 6, max.message.bytes -> 112,
min.insync.replicas -> 1, segment.
jitter.ms -> 0, preallocate -> false, min.cleanable.dirty.ratio -> 0.5,
index.interval.bytes -> 4096, unclean.leader.election.enable -> true,
retention.bytes -> -1, delete.retention.ms -> 8640, cleanup.policy ->
delete, flush.ms -> 9
223372036854775807, segment.ms -> 60480, segment.bytes -> 1073741824,
retention.ms -> 60480, segment.index.bytes -> 10485760, flush.messages
-> 9223372036854775807}. (kafka.log.LogManager)
[2016-05-25 09:05:52,831] INFO Partition [Topic1Heartbeat,1] on broker
1002: No checkpointed highwatermark is found for partition
[Topic1Heartbeat,1] (kafka.cluster.Partition)
[2016-05-25 09:14:12,189] INFO [GroupCoordinator 1002]: Preparing to
restabilize group Topic1 with old generation 0
(kafka.coordinator.GroupCoordinator)
[2016-05-25 09:14:12,190] INFO [GroupCoordinator 1002]: Stabilized group
Topic1 generation 1 (kafka.coordinator.GroupCoordinator)
[2016-05-25 09:14:12,195] INFO [GroupCoordinator 1002]: Assignment received
from leader for group Topic1 for generation 1
(kafka.coordinator.GroupCoordinator)
[2016-05-25 09:14:12,749] FATAL [Replica Manager on Broker 1002]: Halting
due to unrecoverable I/O error while handling produce request:
 (kafka.server.ReplicaManager)
kafka.common.KafkaStorageException: I/O exception in append to log
'__consumer_offsets-0'
at kafka.log.Log.append(Log.scala:318)
at kafka.cluster.Partition$$anonfun$9.apply(Partition.scala:442)
at kafka.cluster.Partition$$anonfun$9.apply(Partition.scala:428)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262)
at kafka.utils.CoreUtils$.inReadLock(CoreUtils.scala:268)
at
kafka.cluster.Partition.appendMessagesToLeader(Partition.scala:428)
at
kafka.server.ReplicaManager$$anonfun$appendToLocalLog$2.apply(ReplicaManager.scala:401)
at
kafka.server.ReplicaManager$$anonfun$appendToLocalLog$2.apply(ReplicaManager.scala:386)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.immutable.Map$Map1.foreach(Map.scala:116)
at
scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at
kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:386)
at
kafka.server.ReplicaManager.appendMessages(ReplicaManager.scala:322)
at
kafka.coordinator.GroupMetadataManager.store(GroupMetadataManager.scala:228)
at
kafka.coordinator.GroupCoordinator$$anonfun$handleCommitOffsets$9.apply(GroupCoordinator.scala:429)
at
kafka.coordinator.GroupCoordinator$$anonfun$handleCommitOffsets$9.apply(GroupCoordinator.scala:429)
at scala.Option.foreach(Option.scala:257)
at
kafka.coordinator.GroupCoordinator.handleCommitOffsets(GroupCoordinator.scala:429)
at
kafka.server.KafkaApis.handleOffsetCommitRequest(KafkaApis.scala:280)
at kafka.server.KafkaApis.handle(KafkaApis.scala:76)
at
kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException:
/tmp/kafka-logs/__consumer_offsets-0/.index (No such
file or directory)
at java.io.RandomAccessFile.open0(Native Method)
at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
at java.io.RandomAccessFile.(RandomAccessFile.java:243)
at
kafka.log.OffsetIndex$$anonfun$resize$1.apply(OffsetIndex.scala:277)
at