Haozhong Ma created KAFKA-19646: ----------------------------------- Summary: CLONE - Broker Startup: Handle Checkpoint Creation Failure via logDirFailureChannel Key: KAFKA-19646 URL: https://issues.apache.org/jira/browse/KAFKA-19646 Project: Kafka Issue Type: Improvement Components: core Reporter: Haozhong Ma Assignee: Haozhong Ma
In our production environment, we encountered a scenario where a broker failed to start due to checkpoint creation failure on a single disk (caused by disk corruption or filesystem errors). According to Kafka's design, such disk-level failures should be isolated via {{{}logDirFailureChannel{}}}, allowing other healthy disks to continue serving traffic. However, upon reviewing the {{CheckpointFileWithFailureHandler}} implementation, we observed that while methods like {{{}write{}}}, {{{}read{}}}, and {{writeIfDirExists}} handle {{IOException}} by routing the affected {{log.dir}} to {{{}logDirFailureChannel{}}}, the checkpoint initialization process lacks this fault-tolerant behavior. Should checkpoint creation adopt the same failure-handling logic? If this is not an intentional design, I will submit a PR to fix this issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)