Haozhong Ma created KAFKA-19548:
-----------------------------------

             Summary: Broker Startup: Handle Checkpoint Creation Failure via 
logDirFailureChannel
                 Key: KAFKA-19548
                 URL: https://issues.apache.org/jira/browse/KAFKA-19548
             Project: Kafka
          Issue Type: Improvement
          Components: core
            Reporter: Haozhong Ma
            Assignee: Haozhong Ma


In our production environment, we encountered a scenario where a broker failed 
to start due to checkpoint creation failure on a single disk (caused by disk 
corruption or filesystem errors). According to Kafka's design, such disk-level 
failures should be isolated via {{{}logDirFailureChannel{}}}, allowing other 
healthy disks to continue serving traffic. However, upon reviewing the 
{{CheckpointFileWithFailureHandler}} implementation, we observed that while 
methods like {{{}write{}}}, {{{}read{}}}, and {{writeIfDirExists}} handle 
{{IOException}} by routing the affected {{log.dir}} to 
{{{}logDirFailureChannel{}}}, the checkpoint initialization process lacks this 
fault-tolerant behavior. Is this an oversight? Should checkpoint creation adopt 
the same failure-handling logic?

!image-2025-07-25-15-07-18-919.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to