[ https://issues.apache.org/jira/browse/KAFKA-18911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Luke Chen resolved KAFKA-18911. ------------------------------- Resolution: Invalid > alterPartition gets stuck when getting out-of-date errors > --------------------------------------------------------- > > Key: KAFKA-18911 > URL: https://issues.apache.org/jira/browse/KAFKA-18911 > Project: Kafka > Issue Type: Bug > Affects Versions: 3.9.0 > Reporter: Luke Chen > Assignee: Luke Chen > Priority: Major > > When the leader node sends the AlterPartition request to the controller, the > controller will do [some > validation|https://github.com/apache/kafka/blob/898dcd11ad260e9b3cfefc5291c40e68009acb7d/metadata/src/main/java/org/apache/kafka/controller/ReplicationControlManager.java#L1231] > before processing it. And in the leader node side, when receiving the > errors, we'll decide if it should be retried or not > [here|https://github.com/apache/kafka/blob/898dcd11ad260e9b3cfefc5291c40e68009acb7d/core/src/main/scala/kafka/cluster/Partition.scala#L1868]. > However, in some non-retry cases, we directly return false without changing > the state: > > {code:java} > case Errors.UNKNOWN_TOPIC_OR_PARTITION => > info(s"Failed to alter partition to $proposedIsrState since the controller > doesn't know about " + > "this topic or partition. Partition state may be out of sync, awaiting > new the latest metadata.") > false > case Errors.UNKNOWN_TOPIC_ID => > info(s"Failed to alter partition to $proposedIsrState since the controller > doesn't know about " + > "this topic. Partition state may be out of sync, awaiting new the latest > metadata.") > false > case Errors.FENCED_LEADER_EPOCH => > info(s"Failed to alter partition to $proposedIsrState since the leader > epoch is old. " + > "Partition state may be out of sync, awaiting new the latest metadata.") > false > case Errors.INVALID_UPDATE_VERSION => > info(s"Failed to alter partition to $proposedIsrState because the partition > epoch is invalid. " + > "Partition state may be out of sync, awaiting new the latest metadata.") > false > case Errors.INVALID_REQUEST => > info(s"Failed to alter partition to $proposedIsrState because the request > is invalid. " + > "Partition state may be out of sync, awaiting new the latest metadata.") > false > case Errors.NEW_LEADER_ELECTED => > // The operation completed successfully but this replica got removed from > the replica set by the controller > // while completing a ongoing reassignment. This replica is no longer the > leader but it does not know it > // yet. It should remain in the current pending state until the metadata > overrides it. > // This is only raised in KRaft mode. > info(s"The alter partition request successfully updated the partition state > to $proposedIsrState but " + > "this replica got removed from the replica set while completing a > reassignment. " + > "Waiting on new metadata to clean up this replica.") > false{code} > As we said in the log, "Partition state may be out of sync, awaiting new the > latest metadata". But without updating the partition state means it will > stays at `PendingExpandIsr` or `PendingShrinkIsr` state, which keeps the > `isInflight` to true. Under this state, the partition state will never be > updated anymore. > > The impact of this issue is that the ISR state will be in stale(wrong) state > until leadership change. -- This message was sent by Atlassian Jira (v8.20.10#820010)