[ 
https://issues.apache.org/jira/browse/KAFKA-16247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Chen resolved KAFKA-16247.
-------------------------------
    Resolution: Fixed

Fixed in 3.7.0 RC4

> replica keep out-of-sync after migrating broker to KRaft
> --------------------------------------------------------
>
>                 Key: KAFKA-16247
>                 URL: https://issues.apache.org/jira/browse/KAFKA-16247
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 3.7.0
>            Reporter: Luke Chen
>            Priority: Major
>         Attachments: KAFKA-16247.zip
>
>
> We are deploying 3 controllers and 3 brokers, and following the steps in 
> [doc|https://kafka.apache.org/documentation/#kraft_zk_migration]. When we're 
> moving from "Enabling the migration on the brokers" state to "Migrating 
> brokers to KRaft" state, the first rolled broker becomes out-of-sync and 
> never become in-sync. 
> From the log, we can see some "reject alterPartition" errors, but it just 
> happen 2 times. Theoretically, the leader should add the follower  into ISR 
> as long as the follower is fetching since we don't have client writing data. 
> But can't figure out why it didn't fetch. 
> Logs: https://gist.github.com/showuon/64c4dcecb238a317bdbdec8db17fd494
> ===
> update Feb. 14
> After further investigating the logs, I think the reason why the replica is 
> not added into ISR is because the alterPartition request got non-retriable 
> error from controller:
> {code:java}
> Failed to alter partition to PendingExpandIsr(newInSyncReplicaId=0, 
> sentLeaderAndIsr=LeaderAndIsr(leader=1, leaderEpoch=4, 
> isrWithBrokerEpoch=List(BrokerState(brokerId=1, brokerEpoch=-1), 
> BrokerState(brokerId=2, brokerEpoch=-1), BrokerState(brokerId=0, 
> brokerEpoch=-1)), leaderRecoveryState=RECOVERED, partitionEpoch=7), 
> leaderRecoveryState=RECOVERED, 
> lastCommittedState=CommittedPartitionState(isr=Set(1, 2), 
> leaderRecoveryState=RECOVERED)) because the partition epoch is invalid. 
> Partition state may be out of sync, awaiting new the latest metadata. 
> (kafka.cluster.Partition) 
> [zk-broker-1-to-controller-alter-partition-channel-manager]
> {code}
> Since it's a non-retriable error, we'll keep the state as pending, and 
> waiting for later leaderAndISR update as described 
> [here|https://github.com/apache/kafka/blob/d24abe0edebad37e554adea47408c3063037f744/core/src/main/scala/kafka/cluster/Partition.scala#L1876C1-L1876C41].
> Log analysis: https://gist.github.com/showuon/5514cbb995fc2ae6acd5858f69c137bb
> So the question becomes:
> 1. Why does the controller increase the partition epoch?
> 2. When the leader receives the leaderAndISR request from the controller, it 
> ignored the request because the leader epoch is identical, even though the 
> partition epoch is updated. Is the behavior expected? Will it impact the 
> alterPartition request later?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to