[
https://issues.apache.org/jira/browse/SPARK-604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved SPARK-604.
-----------------------------
Resolution: Cannot Reproduce
Stale at this point, without similar findings recently.
> reconnect if mesos slaves dies
> ------------------------------
>
> Key: SPARK-604
> URL: https://issues.apache.org/jira/browse/SPARK-604
> Project: Spark
> Issue Type: Bug
> Components: Mesos
>
> when running on mesos, if a slave goes down, spark doesn't try to reassign
> the work to another machine. Even if the slave comes back up, the job is
> doomed.
> Currently when this happens, we just see this in the driver logs:
> 12/11/01 16:48:56 INFO mesos.MesosSchedulerBackend: Mesos slave lost:
> 201210312057-1560611338-5050-24091-52
> Exception in thread "Thread-346" java.util.NoSuchElementException: key not
> found: value: "201210312057-1560611338-5050-24091-52"
> at scala.collection.MapLike$class.default(MapLike.scala:224)
> at scala.collection.mutable.HashMap.default(HashMap.scala:43)
> at scala.collection.MapLike$class.apply(MapLike.scala:135)
> at scala.collection.mutable.HashMap.apply(HashMap.scala:43)
> at
> spark.scheduler.cluster.ClusterScheduler.slaveLost(ClusterScheduler.scala:255)
> at
> spark.scheduler.mesos.MesosSchedulerBackend.slaveLost(MesosSchedulerBackend.scala:275)
> 12/11/01 16:48:56 INFO mesos.MesosSchedulerBackend: driver.run() returned
> with code DRIVER_ABORTED
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]