[
https://issues.apache.org/jira/browse/SPARK-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen updated SPARK-4609:
-----------------------------
Target Version/s: (was: 1.3.0)
> Job can not finish if there is one bad slave in clusters
> --------------------------------------------------------
>
> Key: SPARK-4609
> URL: https://issues.apache.org/jira/browse/SPARK-4609
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Reporter: Davies Liu
>
> If there is one bad machine in the cluster, the executor will keep die (such
> as out of space in the disk), some task may be scheduled to this machines
> multiple times, then the job will failed after several failures of one task.
> {code}
> 14/11/26 00:34:57 INFO TaskSetManager: Starting task 39.0 in stage 3.0 (TID
> 1255, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes)
> 14/11/26 00:34:57 WARN TaskSetManager: Lost task 39.0 in stage 3.0 (TID 1255,
> spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 60
> lost)
> 14/11/26 00:35:02 INFO TaskSetManager: Starting task 39.1 in stage 3.0 (TID
> 1256, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes)
> 14/11/26 00:35:03 WARN TaskSetManager: Lost task 39.1 in stage 3.0 (TID 1256,
> spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 61
> lost)
> 14/11/26 00:35:08 INFO TaskSetManager: Starting task 39.2 in stage 3.0 (TID
> 1257, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes)
> 14/11/26 00:35:08 WARN TaskSetManager: Lost task 39.2 in stage 3.0 (TID 1257,
> spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 62
> lost)
> 14/11/26 00:35:13 INFO TaskSetManager: Starting task 39.3 in stage 3.0 (TID
> 1258, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes)
> 14/11/26 00:35:14 WARN TaskSetManager: Lost task 39.3 in stage 3.0 (TID 1258,
> spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 63
> lost)
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 39 in
> stage 3.0 failed 4 times, most recent failure: Lost task 39.3 in stage 3.0
> (TID 1258, spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure
> (executor 63 lost)
> Driver stacktrace:
> at
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1207)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1196)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1195)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1195)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
> at scala.Option.foreach(Option.scala:236)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1413)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1368)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
> at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}
> The task should not be scheduled to a machines for more than one times. Also,
> if one machine failed with executor lost, it should be put in black list for
> some time, then try again.
> cc [~kayousterhout] [~matei]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]