[ https://issues.apache.org/jira/browse/KAFKA-19646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Haozhong Ma resolved KAFKA-19646. --------------------------------- Resolution: Not A Problem > CLONE - Broker Startup: Handle Checkpoint Creation Failure via > logDirFailureChannel > ----------------------------------------------------------------------------------- > > Key: KAFKA-19646 > URL: https://issues.apache.org/jira/browse/KAFKA-19646 > Project: Kafka > Issue Type: Improvement > Components: core > Reporter: Haozhong Ma > Assignee: Haozhong Ma > Priority: Major > > In our production environment, we encountered a scenario where a broker > failed to start due to checkpoint creation failure on a single disk (caused > by disk corruption or filesystem errors). According to Kafka's design, such > disk-level failures should be isolated via {{{}logDirFailureChannel{}}}, > allowing other healthy disks to continue serving traffic. However, upon > reviewing the {{CheckpointFileWithFailureHandler}} implementation, we > observed that while methods like {{{}write{}}}, {{{}read{}}}, and > {{writeIfDirExists}} handle {{IOException}} by routing the affected > {{log.dir}} to {{{}logDirFailureChannel{}}}, the checkpoint initialization > process lacks this fault-tolerant behavior. Should checkpoint creation adopt > the same failure-handling logic? If this is not an intentional design, I will > submit a PR to fix this issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)