[ 
https://issues.apache.org/jira/browse/KAFKA-18911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17931923#comment-17931923
 ] 

Luke Chen commented on KAFKA-18911:
-----------------------------------

Had another look, it looks like we're OK to stay in this state until we see new 
metadata from LeaderAndIsr (or an update to the KRaft metadata log). Closing it 
now.

> alterPartition gets stuck when getting out-of-date errors
> ---------------------------------------------------------
>
>                 Key: KAFKA-18911
>                 URL: https://issues.apache.org/jira/browse/KAFKA-18911
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 3.9.0
>            Reporter: Luke Chen
>            Assignee: Luke Chen
>            Priority: Major
>
> When the leader node sends the AlterPartition request to the controller, the 
> controller will do [some 
> validation|https://github.com/apache/kafka/blob/898dcd11ad260e9b3cfefc5291c40e68009acb7d/metadata/src/main/java/org/apache/kafka/controller/ReplicationControlManager.java#L1231]
>  before processing it. And in the leader node side, when receiving the 
> errors, we'll decide if it should be retried or not 
> [here|https://github.com/apache/kafka/blob/898dcd11ad260e9b3cfefc5291c40e68009acb7d/core/src/main/scala/kafka/cluster/Partition.scala#L1868].
>  However, in some non-retry cases, we directly return false without changing 
> the state:
>  
> {code:java}
> case Errors.UNKNOWN_TOPIC_OR_PARTITION =>
>   info(s"Failed to alter partition to $proposedIsrState since the controller 
> doesn't know about " +
>     "this topic or partition. Partition state may be out of sync, awaiting 
> new the latest metadata.")
>   false
> case Errors.UNKNOWN_TOPIC_ID =>
>   info(s"Failed to alter partition to $proposedIsrState since the controller 
> doesn't know about " +
>     "this topic. Partition state may be out of sync, awaiting new the latest 
> metadata.")
>   false
> case Errors.FENCED_LEADER_EPOCH =>
>   info(s"Failed to alter partition to $proposedIsrState since the leader 
> epoch is old. " +
>     "Partition state may be out of sync, awaiting new the latest metadata.")
>   false
> case Errors.INVALID_UPDATE_VERSION =>
>   info(s"Failed to alter partition to $proposedIsrState because the partition 
> epoch is invalid. " +
>     "Partition state may be out of sync, awaiting new the latest metadata.")
>   false
> case Errors.INVALID_REQUEST =>
>   info(s"Failed to alter partition to $proposedIsrState because the request 
> is invalid. " +
>     "Partition state may be out of sync, awaiting new the latest metadata.")
>   false
> case Errors.NEW_LEADER_ELECTED =>
>   // The operation completed successfully but this replica got removed from 
> the replica set by the controller
>   // while completing a ongoing reassignment. This replica is no longer the 
> leader but it does not know it
>   // yet. It should remain in the current pending state until the metadata 
> overrides it.
>   // This is only raised in KRaft mode.
>   info(s"The alter partition request successfully updated the partition state 
> to $proposedIsrState but " +
>     "this replica got removed from the replica set while completing a 
> reassignment. " +
>     "Waiting on new metadata to clean up this replica.")
>   false{code}
> As we said in the log, "Partition state may be out of sync, awaiting new the 
> latest metadata". But without updating the partition state means it will 
> stays at `PendingExpandIsr` or `PendingShrinkIsr` state, which keeps the 
> `isInflight` to true. Under this state, the partition state will never be 
> updated anymore.
>  
> The impact of this issue is that the ISR state will be in stale(wrong) state 
> until leadership change.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to