Bouke van der Bijl created SPARK-1848:
-----------------------------------------
Summary: Executors are mysteriously dying when using Spark on Mesos
Key: SPARK-1848
URL: https://issues.apache.org/jira/browse/SPARK-1848
Project: Spark
Issue Type: Bug
Components: Mesos, Spark Core
Affects Versions: 1.0.0
Environment: Linux 3.8.0-35-generic #50~precise1-Ubuntu SMP Wed Dec 4
17:25:51 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
java version "1.7.0_51"
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
Mesos 0.18.0
Spark Master
Reporter: Bouke van der Bijl
Here's a logfile: https://gist.github.com/bouk/b4647e7ba62eb169a40a
We have 47 machines running Mesos that we're trying to run Spark jobs on, but
they fail at some point because tasks have to get rescheduled too often, which
is caused by Spark killing the tasks because of executors dying. When I look at
the stderr or stdout of the Mesos slaves, there seem to be no indication of an
error happening and sometimes I can see a "14/05/15 17:38:54 INFO DAGScheduler:
Ignoring possibly bogus ShuffleMapTask completion from <id>" which would
indicate that the executor just keeps going and hasn't actually died. If I add
a Thread.dumpStack() at the location where the job is killed, this is the trace
it returns:
at java.lang.Thread.dumpStack(Thread.java:1364)
at
org.apache.spark.scheduler.TaskSetManager.handleFailedTask(TaskSetManager.scala:588)
at
org.apache.spark.scheduler.TaskSetManager$$anonfun$executorLost$9.apply(TaskSetManager.scala:665)
at
org.apache.spark.scheduler.TaskSetManager$$anonfun$executorLost$9.apply(TaskSetManager.scala:664)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at
org.apache.spark.scheduler.TaskSetManager.executorLost(TaskSetManager.scala:664)
at
org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
at
org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.Pool.executorLost(Pool.scala:87)
at
org.apache.spark.scheduler.TaskSchedulerImpl.removeExecutor(TaskSchedulerImpl.scala:412)
at
org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:271)
at
org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:266)
at
org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.statusUpdate(MesosSchedulerBackend.scala:287)
What could cause this? Is this a set up problem with our cluster or a bug in
spark?
--
This message was sent by Atlassian JIRA
(v6.2#6252)