[ https://issues.apache.org/jira/browse/KAFKA-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16286884#comment-16286884 ]
Guozhang Wang commented on KAFKA-2319: -------------------------------------- I think this issue still exists, but probably in a slight different manner. Here is what I observed while shutting down a server: {code} [2017-12-11 16:37:32,620] ERROR [kafka-log-cleaner-thread-0]: Error due to (kafka.log.LogCleaner$CleanerThread:73) java.lang.IllegalStateException: Kafka scheduler is not running. at kafka.utils.KafkaScheduler.ensureRunning(KafkaScheduler.scala:132) at kafka.utils.KafkaScheduler.schedule(KafkaScheduler.scala:106) at kafka.log.Log.kafka$log$Log$$asyncDeleteSegment(Log.scala:1637) at kafka.log.Log$$anonfun$replaceSegments$1.apply(Log.scala:1682) at kafka.log.Log$$anonfun$replaceSegments$1.apply(Log.scala:1677) at scala.collection.immutable.List.foreach(List.scala:392) at kafka.log.Log.replaceSegments(Log.scala:1677) at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:489) at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.scala:400) at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.scala:399) at scala.collection.immutable.List.foreach(List.scala:392) at kafka.log.Cleaner.doClean(LogCleaner.scala:399) at kafka.log.Cleaner.clean(LogCleaner.scala:376) at kafka.log.LogCleaner$CleanerThread.cleanOrSleep(LogCleaner.scala:267) at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:247) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:68) {code} The root cause is that we call {{Log#asyncDeleteSegment}} to asynchronously delete the segment, AND we shutdown scheduler first then cleaner thread. Hence the following race condition can happen (very likely): 1. Kafka server shutdown, {{KafkaScheduler#shutdown}} called first. 2. {{cachedExecutor.shutdown()}} called, which executed all the already-scheduled tasks. 3. {{KafkaScheduler#shutdown}} completed, {{executor}} is set to {{null}}. 4. The {{LogCleaner.doWork()}} is called then before it is shutting down, and then calling {{scheduler.schedule("delete-file", deleteSeg _, delay = config.fileDeleteDelayMs)}}. BOOM. An easy fix would be re-ordering the shutdown sequence, to make sure callers of the scheduler like the cleaner thread is shutdown first. > After controlled shutdown: IllegalStateException: Kafka scheduler has not > been started > -------------------------------------------------------------------------------------- > > Key: KAFKA-2319 > URL: https://issues.apache.org/jira/browse/KAFKA-2319 > Project: Kafka > Issue Type: Bug > Affects Versions: 1.0.1 > Reporter: Jason Rosenberg > > Running 0.8.2.1, just saw this today at the end of a controlled shutdown. It > doesn't happen every time, but I've seen it several times: > {code} > 2015-07-07 18:54:28,424 INFO [Thread-4] server.KafkaServer - [Kafka Server > 99], Controlled shutdown succeeded > 2015-07-07 18:54:28,425 INFO [Thread-4] network.SocketServer - [Socket > Server on Broker 99], Shutting down > 2015-07-07 18:54:28,435 INFO [Thread-4] network.SocketServer - [Socket > Server on Broker 99], Shutdown completed > 2015-07-07 18:54:28,435 INFO [Thread-4] server.KafkaRequestHandlerPool - > [Kafka Request Handler on Broker 99], shutting down > 2015-07-07 18:54:28,444 INFO [Thread-4] server.KafkaRequestHandlerPool - > [Kafka Request Handler on Broker 99], shut down completely > 2015-07-07 18:54:28,649 INFO [Thread-4] server.ReplicaManager - [Replica > Manager on Broker 99]: Shut down > 2015-07-07 18:54:28,649 INFO [Thread-4] server.ReplicaFetcherManager - > [ReplicaFetcherManager on broker 99] shutting down > 2015-07-07 18:54:28,650 INFO [Thread-4] server.ReplicaFetcherThread - > [ReplicaFetcherThread-0-95], Shutting down > 2015-07-07 18:54:28,750 INFO [Thread-4] server.ReplicaFetcherThread - > [ReplicaFetcherThread-0-95], Shutdown completed > 2015-07-07 18:54:28,750 INFO [ReplicaFetcherThread-0-95] > server.ReplicaFetcherThread - [ReplicaFetcherThread-0-95], Stopped > 2015-07-07 18:54:28,750 INFO [Thread-4] server.ReplicaFetcherThread - > [ReplicaFetcherThread-0-98], Shutting down > 2015-07-07 18:54:28,791 INFO [Thread-4] server.ReplicaFetcherThread - > [ReplicaFetcherThread-0-98], Shutdown completed > 2015-07-07 18:54:28,791 INFO [ReplicaFetcherThread-0-98] > server.ReplicaFetcherThread - [ReplicaFetcherThread-0-98], Stopped > 2015-07-07 18:54:28,791 INFO [Thread-4] server.ReplicaFetcherManager - > [ReplicaFetcherManager on broker 99] shutdown completed > 2015-07-07 18:54:28,819 INFO [Thread-4] server.ReplicaManager - [Replica > Manager on Broker 99]: Shut down completely > 2015-07-07 18:54:28,826 INFO [Thread-4] log.LogManager - Shutting down. > 2015-07-07 18:54:30,459 INFO [Thread-4] log.LogManager - Shutdown complete. > 2015-07-07 18:54:30,463 WARN [Thread-4] utils.Utils$ - Kafka scheduler has > not been started > java.lang.IllegalStateException: Kafka scheduler has not been started > at kafka.utils.KafkaScheduler.ensureStarted(KafkaScheduler.scala:114) > at kafka.utils.KafkaScheduler.shutdown(KafkaScheduler.scala:86) > at > kafka.controller.KafkaController.onControllerResignation(KafkaController.scala:350) > at > kafka.controller.KafkaController.shutdown(KafkaController.scala:664) > at > kafka.server.KafkaServer$$anonfun$shutdown$8.apply$mcV$sp(KafkaServer.scala:285) > at kafka.utils.Utils$.swallow(Utils.scala:172) > at kafka.utils.Logging$class.swallowWarn(Logging.scala:92) > at kafka.utils.Utils$.swallowWarn(Utils.scala:45) > at kafka.utils.Logging$class.swallow(Logging.scala:94) > at kafka.utils.Utils$.swallow(Utils.scala:45) > at kafka.server.KafkaServer.shutdown(KafkaServer.scala:285) > ... > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)