[jira] [Resolved] (KAFKA-19646) CLONE - Broker Startup: Handle Checkpoint Creation Failure via logDirFailureChannel

Haozhong Ma (Jira) Tue, 26 Aug 2025 01:55:06 -0700


     [ 
https://issues.apache.org/jira/browse/KAFKA-19646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Haozhong Ma resolved KAFKA-19646.
---------------------------------
    Resolution: Not A Problem

> CLONE - Broker Startup: Handle Checkpoint Creation Failure via 
> logDirFailureChannel
> -----------------------------------------------------------------------------------
>
>                 Key: KAFKA-19646
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19646
>             Project: Kafka
>          Issue Type: Improvement
>          Components: core
>            Reporter: Haozhong Ma
>            Assignee: Haozhong Ma
>            Priority: Major
>
> In our production environment, we encountered a scenario where a broker 
> failed to start due to checkpoint creation failure on a single disk (caused 
> by disk corruption or filesystem errors). According to Kafka's design, such 
> disk-level failures should be isolated via {{{}logDirFailureChannel{}}}, 
> allowing other healthy disks to continue serving traffic. However, upon 
> reviewing the {{CheckpointFileWithFailureHandler}} implementation, we 
> observed that while methods like {{{}write{}}}, {{{}read{}}}, and 
> {{writeIfDirExists}} handle {{IOException}} by routing the affected 
> {{log.dir}} to {{{}logDirFailureChannel{}}}, the checkpoint initialization 
> process lacks this fault-tolerant behavior. Should checkpoint creation adopt 
> the same failure-handling logic? If this is not an intentional design, I will 
> submit a PR to fix this issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (KAFKA-19646) CLONE - Broker Startup: Handle Checkpoint Creation Failure via logDirFailureChannel

Reply via email to