[ 
https://issues.apache.org/jira/browse/KAFKA-14824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hudeqi updated KAFKA-14824:
---------------------------
    Reviewer: Chia-Ping Tsai

> ReplicaAlterLogDirsThread may cause serious disk usage in case of unknown 
> exception
> -----------------------------------------------------------------------------------
>
>                 Key: KAFKA-14824
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14824
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 3.3.2
>            Reporter: hudeqi
>            Priority: Blocker
>
> For ReplicaAlterLogDirsThread, if the partition is marked as failed due to an 
> unknown exception and the partition fetch is suspended, the paused cleanup 
> logic of the partition needs to be canceled, otherwise it will lead to 
> serious unexpected disk usage growth.
>  
> For example, in the actual production environment (the Kafka version used is 
> 2.5.1), there is such a case: perform log dir balance on this partition 
> leader broker. After started fetching when the future log is successfully 
> created, then reset and truncate to the leader's log start offset for the 
> first time due to out of range. At the same time, because the partition 
> leader is processing the leaderAndIsrRequest, the leader epoch is updated, so 
> the ReplicaAlterLogDirsThread appears FENCED_LEADER_EPOCH, and the 
> 'partitionStates' of the partition are cleaned up. At the same time, the 
> logic of add ReplicaAlterLogDirsThread for the partition is executing in the 
> thread that is processing leaderAndIsrRequest. In here, the offset set by 
> InitialFetchState is the hw of the leader. When ReplicaAlterLogDirsThread 
> performs the logic of processFetchRequest, it will throw 
> "java.lang.IllegalStateException : Offset mismatch for the future replica 
> anti_fraud.data_collector.anticrawler_live-54: fetched offset = 4979659327, 
> log end offset = 4918576434.", leading to such a result: 
> ReplicaAlterLogDirsThread no longer fetch the partition, due to the previous 
> paused cleanup logic of the partition, the disk usage of the corresponding 
> broker increases infinitely, causing serious problems.
>  
> But I found that trunk fixed this bug in KAFKA-9087, which may cause 
> ReplicaAlterLogDirsThread to appear “Offset mismatch error" causing to stop 
> fetch. But I don't know if there will be some other unknown exceptions, and 
> at the same time, due to the current logic, it will bring the same disk 
> cleanup failure problem?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to