[jira] [Updated] (SPARK-1769) Executor loss can cause race condition in Pool

Aaron Davidson (JIRA) Tue, 13 May 2014 10:57:14 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Aaron Davidson updated SPARK-1769:
----------------------------------

    Assignee: Andrew Or  (was: Aaron Davidson)

> Executor loss can cause race condition in Pool
> ----------------------------------------------
>
>                 Key: SPARK-1769
>                 URL: https://issues.apache.org/jira/browse/SPARK-1769
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.0.0
>            Reporter: Aaron Davidson
>            Assignee: Andrew Or
>
> Loss of executors (in this case due to OOMs) exposes a race condition in 
> Pool.scala, evident from this stack trace:
> {code}
> 14/05/08 22:41:48 ERROR OneForOneStrategy:
> java.lang.NullPointerException
>         at 
> org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
>         at 
> org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
>         at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>         at org.apache.spark.scheduler.Pool.executorLost(Pool.scala:87)
>         at 
> org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
>         at 
> org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
>         at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>         at org.apache.spark.scheduler.Pool.executorLost(Pool.scala:87)
>         at 
> org.apache.spark.scheduler.TaskSchedulerImpl.removeExecutor(TaskSchedulerImpl.scala:412)
>         at 
> org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:385)
>         at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.removeExecutor(CoarseGrainedSchedulerBackend.scala:160)
>         at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1$$anonfun$applyOrElse$5.apply(CoarseGrainedSchedulerBackend.scala:123)
>         at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1$$anonfun$applyOrElse$5.apply(CoarseGrainedSchedulerBackend.scala:123)
>         at scala.Option.foreach(Option.scala:236)
>         at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:123)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>         at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>         at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>         at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}
> Note that the line of code that throws this exception is here:
> {code}
> schedulableQueue.foreach(_.executorLost(executorId, host))
> {code}
> By the stack trace, it's not schedulableQueue that is null, but an element 
> therein. As far as I could tell, we never add a null element to this queue. 
> Rather, I could see that removeSchedulable() and executorLost() were called 
> at about the same time (via log messages), and suspect that since this 
> ArrayBuffer is in no way synchronized, that we iterate through the list while 
> it's in an incomplete state.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1769) Executor loss can cause race condition in Pool

Reply via email to