Divij Vaidya created KAFKA-15391: ------------------------------------ Summary: Delete topic may lead to directory offline Key: KAFKA-15391 URL: https://issues.apache.org/jira/browse/KAFKA-15391 Project: Kafka Issue Type: Bug Components: core Reporter: Divij Vaidya Fix For: 3.6.0
This is an edge case where the entire log directory is marked offline when we delete a topic. This symptoms of this scenario is characterised by the following logs: {noformat} [2023-08-14 09:22:12,600] ERROR Uncaught exception in scheduled task 'flush-log' (org.apache.kafka.server.util.KafkaScheduler:152) org.apache.kafka.common.errors.KafkaStorageException: Error while flushing log for test-0 in dir /tmp/kafka-15093588566723278510 with offset 221 (exclusive) and recovery point 221 Caused by: java.nio.file.NoSuchFileException: /tmp/kafka-15093588566723278510/test-0{noformat} The above log is followed by logs such as: {noformat} [2023-08-14 09:22:12,601] ERROR Uncaught exception in scheduled task 'flush-log' (org.apache.kafka.server.util.KafkaScheduler:152)org.apache.kafka.common.errors.KafkaStorageException: The log dir /tmp/kafka-15093588566723278510 is already offline due to a previous IO exception.{noformat} The below sequence of events demonstrate the scenario where this bug manifests 1. On the broker, partition lock is acquired and UnifiedLog.roll() is called which schedules an async call for flushUptoOffsetExclusive(). The roll may be called due to segment rotation time or size. 2. Admin client calls deleteTopic 3. On the broker, LogManager.asyncDelete() is called which will call UnifiedLog.renameDir() 4. The directory for the partition is successfully renamed with a "delete" suffix. 5. The async task scheduled in step 1 (flushUptoOffsetExclusive) starts executing. It tries to call localLog.flush() without acquiring a partition lock. 6. LocalLog calls Utils.flushDir() which fails with an IOException. 7. On IOException, log directory is added to logDirFailureChannel 8. Any new interaction with this logDir fails and a log line is printed such as "The log dir $logDir is already offline due to a previous IO exception" -- This message was sent by Atlassian Jira (v8.20.10#820010)