[ 
https://issues.apache.org/jira/browse/KAFKA-18911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Chen resolved KAFKA-18911.
-------------------------------
    Resolution: Invalid

> alterPartition gets stuck when getting out-of-date errors
> ---------------------------------------------------------
>
>                 Key: KAFKA-18911
>                 URL: https://issues.apache.org/jira/browse/KAFKA-18911
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 3.9.0
>            Reporter: Luke Chen
>            Assignee: Luke Chen
>            Priority: Major
>
> When the leader node sends the AlterPartition request to the controller, the 
> controller will do [some 
> validation|https://github.com/apache/kafka/blob/898dcd11ad260e9b3cfefc5291c40e68009acb7d/metadata/src/main/java/org/apache/kafka/controller/ReplicationControlManager.java#L1231]
>  before processing it. And in the leader node side, when receiving the 
> errors, we'll decide if it should be retried or not 
> [here|https://github.com/apache/kafka/blob/898dcd11ad260e9b3cfefc5291c40e68009acb7d/core/src/main/scala/kafka/cluster/Partition.scala#L1868].
>  However, in some non-retry cases, we directly return false without changing 
> the state:
>  
> {code:java}
> case Errors.UNKNOWN_TOPIC_OR_PARTITION =>
>   info(s"Failed to alter partition to $proposedIsrState since the controller 
> doesn't know about " +
>     "this topic or partition. Partition state may be out of sync, awaiting 
> new the latest metadata.")
>   false
> case Errors.UNKNOWN_TOPIC_ID =>
>   info(s"Failed to alter partition to $proposedIsrState since the controller 
> doesn't know about " +
>     "this topic. Partition state may be out of sync, awaiting new the latest 
> metadata.")
>   false
> case Errors.FENCED_LEADER_EPOCH =>
>   info(s"Failed to alter partition to $proposedIsrState since the leader 
> epoch is old. " +
>     "Partition state may be out of sync, awaiting new the latest metadata.")
>   false
> case Errors.INVALID_UPDATE_VERSION =>
>   info(s"Failed to alter partition to $proposedIsrState because the partition 
> epoch is invalid. " +
>     "Partition state may be out of sync, awaiting new the latest metadata.")
>   false
> case Errors.INVALID_REQUEST =>
>   info(s"Failed to alter partition to $proposedIsrState because the request 
> is invalid. " +
>     "Partition state may be out of sync, awaiting new the latest metadata.")
>   false
> case Errors.NEW_LEADER_ELECTED =>
>   // The operation completed successfully but this replica got removed from 
> the replica set by the controller
>   // while completing a ongoing reassignment. This replica is no longer the 
> leader but it does not know it
>   // yet. It should remain in the current pending state until the metadata 
> overrides it.
>   // This is only raised in KRaft mode.
>   info(s"The alter partition request successfully updated the partition state 
> to $proposedIsrState but " +
>     "this replica got removed from the replica set while completing a 
> reassignment. " +
>     "Waiting on new metadata to clean up this replica.")
>   false{code}
> As we said in the log, "Partition state may be out of sync, awaiting new the 
> latest metadata". But without updating the partition state means it will 
> stays at `PendingExpandIsr` or `PendingShrinkIsr` state, which keeps the 
> `isInflight` to true. Under this state, the partition state will never be 
> updated anymore.
>  
> The impact of this issue is that the ISR state will be in stale(wrong) state 
> until leadership change.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to