Haozhong Ma created KAFKA-19548: ----------------------------------- Summary: Broker Startup: Handle Checkpoint Creation Failure via logDirFailureChannel Key: KAFKA-19548 URL: https://issues.apache.org/jira/browse/KAFKA-19548 Project: Kafka Issue Type: Improvement Components: core Reporter: Haozhong Ma Assignee: Haozhong Ma
In our production environment, we encountered a scenario where a broker failed to start due to checkpoint creation failure on a single disk (caused by disk corruption or filesystem errors). According to Kafka's design, such disk-level failures should be isolated via {{{}logDirFailureChannel{}}}, allowing other healthy disks to continue serving traffic. However, upon reviewing the {{CheckpointFileWithFailureHandler}} implementation, we observed that while methods like {{{}write{}}}, {{{}read{}}}, and {{writeIfDirExists}} handle {{IOException}} by routing the affected {{log.dir}} to {{{}logDirFailureChannel{}}}, the checkpoint initialization process lacks this fault-tolerant behavior. Is this an oversight? Should checkpoint creation adopt the same failure-handling logic? !image-2025-07-25-15-07-18-919.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)