Swathi Mocharla created KAFKA-14449:
---------------------------------------
Summary: Brokers not re-joining the ISR list and stuck at started
until all the brokers restart
Key: KAFKA-14449
URL: https://issues.apache.org/jira/browse/KAFKA-14449
Project: Kafka
Issue Type: Bug
Components: core
Affects Versions: 3.2.0
Reporter: Swathi Mocharla
hi,
We are upgrading a 3 broker cluster (1001,1002,1003) from 3.1.0 to 3.2.0.
During upgrade, it is noticed that when 1003 is restarted, it doesn't join back
the ISR list and the broker is stuck. Same is the case with 1002.
Only when 1001 is restrarted, 1003,1002 re-join the ISR list and start
replicating data.
{code:java}
{"type":"log", "host":"kf-pl47-me8-2", "level":"INFO",
"neid":"kafka-b352b4f8cf4447e9a73d9e7ef3ec746c", "system":"kafka",
"time":"2022-12-06T10:07:30.386", "timezone":"UTC", "log":{"message":"main -
kafka.server.KafkaServer - [KafkaServer id=1003] started"}}
{"type":"log", "host":"kf-pl47-me8-2", "level":"INFO",
"neid":"kafka-b352b4f8cf4447e9a73d9e7ef3ec746c", "system":"kafka",
"time":"2022-12-06T10:07:30.442", "timezone":"UTC",
"log":{"message":"data-plane-kafka-request-handler-1 - state.change.logger -
[Broker id=1003] Add 397 partitions and deleted 0 partitions from metadata
cache in response to UpdateMetadata request sent by controller 1002 epoch 18
with correlation id 0"}}
{"type":"log", "host":"kf-pl47-me8-2", "level":"INFO",
"neid":"kafka-b352b4f8cf4447e9a73d9e7ef3ec746c", "system":"kafka",
"time":"2022-12-06T10:07:30.448", "timezone":"UTC",
"log":{"message":"BrokerToControllerChannelManager broker=1003 name=alterIsr -
kafka.server.BrokerToControllerRequestThread -
[BrokerToControllerChannelManager broker=1003 name=alterIsr]: Recorded new
controller, from now on will use broker
kf-pl47-me8-1.kf-pl47-me8-headless.nc0968-admin-ns.svc.cluster.local:9092 (id:
1002 rack: null)"}}
{"type":"log", "host":"kf-pl47-me8-2", "level":"ERROR",
"neid":"kafka-b352b4f8cf4447e9a73d9e7ef3ec746c", "system":"kafka",
"time":"2022-12-06T10:07:30.451", "timezone":"UTC",
"log":{"message":"data-plane-kafka-network-thread-1003-ListenerName(PLAINTEXT)-PLAINTEXT-1
- kafka.network.Processor - Closing socket for
192.168.216.11:9092-192.168.199.100:53778-0 because of error"}}
org.apache.kafka.common.errors.InvalidRequestException: Error getting request
for apiKey: LEADER_AND_ISR, apiVersion: 6, connectionId:
192.168.216.11:9092-192.168.199.100:53778-0, listenerName:
ListenerName(PLAINTEXT), principal: User:ANONYMOUS
org.apache.kafka.common.errors.InvalidRequestException: Error getting request
for apiKey: LEADER_AND_ISR, apiVersion: 6, connectionId:
192.168.216.11:9092-192.168.235.153:46282-461, listenerName:
ListenerName(PLAINTEXT), principal: User:ANONYMOUS
Caused by: org.apache.kafka.common.errors.UnsupportedVersionException: Can't
read version 6 of LeaderAndIsrTopicState
{"type":"log", "host":"kf-pl47-me8-2", "level":"INFO",
"neid":"kafka-b352b4f8cf4447e9a73d9e7ef3ec746c", "system":"kafka",
"time":"2022-12-06T10:12:50.916", "timezone":"UTC",
"log":{"message":"controller-event-thread - kafka.controller.KafkaController -
[Controller id=1003] 1003 successfully elected as the controller. Epoch
incremented to 20 and epoch zk version is now 20"}}
{"type":"log", "host":"kf-pl47-me8-2", "level":"INFO",
"neid":"kafka-b352b4f8cf4447e9a73d9e7ef3ec746c", "system":"kafka",
"time":"2022-12-06T10:12:50.917", "timezone":"UTC",
"log":{"message":"controller-event-thread - kafka.controller.KafkaController -
[Controller id=1003] Registering handlers"}}
{code}
This possibly was introduced by KAFKA-13587.
In the below snapshot during the upgrade, at 16:05:15 UTC 2022, 1001 was
restarting and both 1002 and 1003 were already up and running (after the
upgrade from 3.1.0 to 3.2.0), but did not manage to re-join the ISRs.
{code:java}
Wed Dec 7 16:05:15 UTC 2022
Topic: test TopicId: L6Yj_Nf9RrirNhFQzvXODw PartitionCount: 2
ReplicationFactor: 3 Configs:
compression.type=producer,min.insync.replicas=1,cleanup.policy=delete,flush.ms=1000,segment.bytes=100000000,flush.messages=10000,max.message.bytes=1000012,index.interval.bytes=4096,unclean.leader.election.enable=false,retention.bytes=1000000000,segment.index.bytes=10485760
Topic: test Partition: 0 Leader: none Replicas:
1002,1003,1001 Isr: 1001
Topic: test Partition: 1 Leader: none Replicas:
1001,1002,1003 Isr: 1001
Wed Dec 7 16:05:33 UTC 2022
Topic: test TopicId: L6Yj_Nf9RrirNhFQzvXODw PartitionCount: 2
ReplicationFactor: 3 Configs:
compression.type=producer,min.insync.replicas=1,cleanup.policy=delete,flush.ms=1000,segment.bytes=100000000,flush.messages=10000,max.message.bytes=1000012,index.interval.bytes=4096,unclean.leader.election.enable=false,retention.bytes=1000000000,segment.index.bytes=10485760
Topic: test Partition: 0 Leader: 1001 Replicas:
1002,1003,1001 Isr: 1001,1002,1003
Topic: test Partition: 1 Leader: 1001 Replicas:
1001,1002,1003 Isr: 1001,1002,1003{code}
Is there anything the user needs to do explicitly to work around this issue?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)