ccding opened a new pull request #11351: URL: https://github.com/apache/kafka/pull/11351
We have seen an exception caused by shutting down the scheduler before shutting down LogManager. When LogManager was closing partitions one by one, the scheduler called to delete old segments due to retention. However, the old segments could have been closed by the LogManager, which caused an exception and subsequently marked logdir as offline. As a result, the broker didn't flush the remaining partitions and didn't write the clean shutdown marker. Ultimately the broker took hours to recover the log during restart. This PR essentially reverts https://github.com/apache/kafka/pull/10538 I believe the exception https://github.com/apache/kafka/pull/10538 saw is at https://github.com/apache/kafka/blob/5a6f19b2a1ff72c52ad627230ffdf464456104ee/core/src/main/scala/kafka/log/LocalLog.scala#L895-L903 which called the scheduler and crashed the compaction thread. The effect of this exception has been mitigated by https://github.com/apache/kafka/pull/10763 cc @rondagostino @ijuma @cmccabe @junrao @dhruvilshah3 as authors/reviewers of the PRs mentioned above to make sure this change look okay. ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org