[
https://issues.apache.org/jira/browse/SPARK-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved SPARK-1848.
------------------------------
Resolution: Cannot Reproduce
I think this is at least stale at this point.
> Executors are mysteriously dying when using Spark on Mesos
> ----------------------------------------------------------
>
> Key: SPARK-1848
> URL: https://issues.apache.org/jira/browse/SPARK-1848
> Project: Spark
> Issue Type: Bug
> Components: Mesos, Spark Core
> Affects Versions: 1.0.0
> Environment: Linux 3.8.0-35-generic #50~precise1-Ubuntu SMP Wed Dec 4
> 17:25:51 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
> java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
> Mesos 0.18.0
> Spark Master
> Reporter: Bouke van der Bijl
>
> Here's a logfile: https://gist.github.com/bouk/b4647e7ba62eb169a40a
> We have 47 machines running Mesos that we're trying to run Spark jobs on, but
> they fail at some point because tasks have to get rescheduled too often,
> which is caused by Spark killing the tasks because of executors dying. When I
> look at the stderr or stdout of the Mesos slaves, there seem to be no
> indication of an error happening and sometimes I can see a "14/05/15 17:38:54
> INFO DAGScheduler: Ignoring possibly bogus ShuffleMapTask completion from
> <id>" which would indicate that the executor just keeps going and hasn't
> actually died. If I add a Thread.dumpStack() at the location where the job is
> killed, this is the trace it returns:
> at java.lang.Thread.dumpStack(Thread.java:1364)
> at
> org.apache.spark.scheduler.TaskSetManager.handleFailedTask(TaskSetManager.scala:588)
> at
> org.apache.spark.scheduler.TaskSetManager$$anonfun$executorLost$9.apply(TaskSetManager.scala:665)
> at
> org.apache.spark.scheduler.TaskSetManager$$anonfun$executorLost$9.apply(TaskSetManager.scala:664)
> at
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
> at
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
> at
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> at
> org.apache.spark.scheduler.TaskSetManager.executorLost(TaskSetManager.scala:664)
> at
> org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
> at
> org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at org.apache.spark.scheduler.Pool.executorLost(Pool.scala:87)
> at
> org.apache.spark.scheduler.TaskSchedulerImpl.removeExecutor(TaskSchedulerImpl.scala:412)
> at
> org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:271)
> at
> org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:266)
> at
> org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.statusUpdate(MesosSchedulerBackend.scala:287)
> What could cause this? Is this a set up problem with our cluster or a bug in
> spark?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]