Hi Spark devs, Occasionally when hitting Ctrl-C in the scala spark shell on 0.9.0 one of my workers goes dead in the spark master UI. I'm using the standalone cluster and didn't ever see this while using 0.8.0 so I think it may be a regression.
When I prod on the hung CoarseGrainedExecutorBackend JVM with jstack and jmap -heap, it doesn't respond unless I add the -F force flag. The heap isn't full, but there are some interesting bits in the jstack. Poking around a little, I think there may be some kind of deadlock in the shutdown hooks. Below are the threads I think are most interesting: Thread 14308: (state = BLOCKED) - java.lang.Shutdown.exit(int) @bci=96, line=212 (Interpreted frame) - java.lang.Runtime.exit(int) @bci=14, line=109 (Interpreted frame) - java.lang.System.exit(int) @bci=4, line=962 (Interpreted frame) - org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(java.lang.Object, scala.Function1) @bci=352, line=81 (Interpreted frame) - akka.actor.ActorCell.receiveMessage(java.lang.Object) @bci=25, line=498 (Interpreted frame) - akka.actor.ActorCell.invoke(akka.dispatch.Envelope) @bci=39, line=456 (Interpreted frame) - akka.dispatch.Mailbox.processMailbox(int, long) @bci=24, line=237 (Interpreted frame) - akka.dispatch.Mailbox.run() @bci=20, line=219 (Interpreted frame) - akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec() @bci=4, line=386 (Interpreted frame) - scala.concurrent.forkjoin.ForkJoinTask.doExec() @bci=10, line=260 (Compiled frame) - scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(scala.concurrent.forkjoin.ForkJoinTask) @bci=10, line=1339 (Compiled frame) - scala.concurrent.forkjoin.ForkJoinPool.runWorker(scala.concurrent.forkjoin.ForkJoinPool$WorkQueue) @bci=11, line=1979 (Compiled frame) - scala.concurrent.forkjoin.ForkJoinWorkerThread.run() @bci=14, line=107 (Interpreted frame) Thread 3865: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Interpreted frame) - java.lang.Thread.join(long) @bci=38, line=1280 (Interpreted frame) - java.lang.Thread.join() @bci=2, line=1354 (Interpreted frame) - java.lang.ApplicationShutdownHooks.runHooks() @bci=87, line=106 (Interpreted frame) - java.lang.ApplicationShutdownHooks$1.run() @bci=0, line=46 (Interpreted frame) - java.lang.Shutdown.runHooks() @bci=39, line=123 (Interpreted frame) - java.lang.Shutdown.sequence() @bci=26, line=167 (Interpreted frame) - java.lang.Shutdown.exit(int) @bci=96, line=212 (Interpreted frame) - java.lang.Terminator$1.handle(sun.misc.Signal) @bci=8, line=52 (Interpreted frame) - sun.misc.Signal$1.run() @bci=8, line=212 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=744 (Interpreted frame) Thread 3987: (state = BLOCKED) - java.io.UnixFileSystem.list(java.io.File) @bci=0 (Interpreted frame) - java.io.File.list() @bci=29, line=1116 (Interpreted frame) - java.io.File.listFiles() @bci=1, line=1201 (Compiled frame) - org.apache.spark.util.Utils$.listFilesSafely(java.io.File) @bci=1, line=466 (Interpreted frame) - org.apache.spark.util.Utils$.deleteRecursively(java.io.File) @bci=9, line=478 (Compiled frame) - org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(java.io.File) @bci=4, line=479 (Compiled frame) - org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(java.lang.Object) @bci=5, line=478 (Compiled frame) - scala.collection.IndexedSeqOptimized$class.foreach(scala.collection.IndexedSeqOptimized, scala.Function1) @bci=22, line=33 (Compiled frame) - scala.collection.mutable.WrappedArray.foreach(scala.Function1) @bci=2, line=34 (Compiled frame) - org.apache.spark.util.Utils$.deleteRecursively(java.io.File) @bci=19, line=478 (Interpreted frame) - org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$2.apply(java.io.File) @bci=14, line=141 (Interpreted frame) - org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$2.apply(java.lang.Object) @bci=5, line=139 (Interpreted frame) - scala.collection.IndexedSeqOptimized$class.foreach(scala.collection.IndexedSeqOptimized, scala.Function1) @bci=22, line=33 (Compiled frame) - scala.collection.mutable.ArrayOps$ofRef.foreach(scala.Function1) @bci=2, line=108 (Interpreted frame) - org.apache.spark.storage.DiskBlockManager$$anon$1.run() @bci=39, line=139 (Interpreted frame) I think what happened here is that thread 14308 received the akka "shutdown" message and called System.exit(). This started thread 3865, which is the JVM shutting itself down. Part of that process is running the shutdown hooks, so it started thread 3987. That thread is the shutdown hook from addShutdownHook() in DiskBlockManager.scala, which looks like this: private def addShutdownHook() { localDirs.foreach(localDir => Utils.registerShutdownDeleteDir(localDir)) Runtime.getRuntime.addShutdownHook(new Thread("delete Spark local dirs") { override def run() { logDebug("Shutdown hook called") localDirs.foreach { localDir => try { if (!Utils.hasRootAsShutdownDeleteDir(localDir)) Utils.deleteRecursively(localDir) } catch { case t: Throwable => logError("Exception while deleting local spark dir: " + localDir, t) } } if (shuffleSender != null) { shuffleSender.stop() } } }) } It goes through and deletes the directories recursively. I was thinking there might be some issues with concurrently-running shutdown hooks deleting things out from underneath each other (shutdown hook javadocs say they're all started in parallel if multiple hooks are added) causing the File.list() in that last thread to take quite some time. While I was looking through the stacktrace the JVM finally exited (after 15-20min at least) so I won't be able to debug more until this bug strikes again. Any ideas on what might be going on here? Thanks! Andrew