hudeqi created KAFKA-14824:
------------------------------

             Summary: ReplicaAlterLogDirsThread may cause serious disk usage in 
case of unknown exception
                 Key: KAFKA-14824
                 URL: https://issues.apache.org/jira/browse/KAFKA-14824
             Project: Kafka
          Issue Type: Bug
          Components: core
    Affects Versions: 3.3.2
            Reporter: hudeqi


For ReplicaAlterLogDirsThread, if the partition is marked as failed due to an 
unknown exception and the partition fetch is suspended, the paused cleanup 
logic of the partition needs to be canceled, otherwise it will lead to serious 
unexpected disk usage growth.

 

For example, in the actual production environment (the Kafka version used is 
2.5.1), there is such a case: perform log dir balance on this partition leader 
broker. After started fetching when the future log is successfully created, 
then reset and truncate to the leader's log start offset for the first time due 
to out of range. At the same time, because the partition leader is processing 
the leaderAndIsrRequest, the leader epoch is updated, so the 
ReplicaAlterLogDirsThread appears FENCED_LEADER_EPOCH, and the 
'partitionStates' of the partition are cleaned up. At the same time, the logic 
of add ReplicaAlterLogDirsThread for the partition is executing in the thread 
that is processing leaderAndIsrRequest. In here, the offset set by 
InitialFetchState is the hw of the leader. When ReplicaAlterLogDirsThread 
performs the logic of processFetchRequest, it will throw 
"java.lang.IllegalStateException : Offset mismatch for the future replica 
anti_fraud.data_collector.anticrawler_live-54: fetched offset = 4979659327, log 
end offset = 4918576434.", leading to such a result: ReplicaAlterLogDirsThread 
no longer fetch the partition, due to the previous paused cleanup logic of the 
partition, the disk usage of the corresponding broker increases infinitely, 
causing serious problems.

 

But I found that trunk fixed this bug in KAFKA-9087, which may cause 
ReplicaAlterLogDirsThread to appear “Offset mismatch error" causing to stop 
fetch. But I don't know if there will be some other unknown exceptions, and at 
the same time, due to the current logic, it will bring the same disk cleanup 
failure problem?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to