[jira] [Resolved] (SPARK-1848) Executors are mysteriously dying when using Spark on Mesos

Sean Owen (JIRA) Fri, 15 May 2015 06:46:49 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sean Owen resolved SPARK-1848.
------------------------------
    Resolution: Cannot Reproduce

I think this is at least stale at this point.

> Executors are mysteriously dying when using Spark on Mesos
> ----------------------------------------------------------
>
>                 Key: SPARK-1848
>                 URL: https://issues.apache.org/jira/browse/SPARK-1848
>             Project: Spark
>          Issue Type: Bug
>          Components: Mesos, Spark Core
>    Affects Versions: 1.0.0
>         Environment: Linux 3.8.0-35-generic #50~precise1-Ubuntu SMP Wed Dec 4 
> 17:25:51 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
> java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
> Mesos 0.18.0
> Spark Master
>            Reporter: Bouke van der Bijl
>
> Here's a logfile: https://gist.github.com/bouk/b4647e7ba62eb169a40a
> We have 47 machines running Mesos that we're trying to run Spark jobs on, but 
> they fail at some point because tasks have to get rescheduled too often, 
> which is caused by Spark killing the tasks because of executors dying. When I 
> look at the stderr or stdout of the Mesos slaves, there seem to be no 
> indication of an error happening and sometimes I can see a "14/05/15 17:38:54 
> INFO DAGScheduler: Ignoring possibly bogus ShuffleMapTask completion from 
> <id>" which would indicate that the executor just keeps going and hasn't 
> actually died. If I add a Thread.dumpStack() at the location where the job is 
> killed, this is the trace it returns: 
>         at java.lang.Thread.dumpStack(Thread.java:1364)
>         at 
> org.apache.spark.scheduler.TaskSetManager.handleFailedTask(TaskSetManager.scala:588)
>         at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$executorLost$9.apply(TaskSetManager.scala:665)
>         at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$executorLost$9.apply(TaskSetManager.scala:664)
>         at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>         at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>         at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>         at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
>         at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>         at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
>         at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>         at 
> org.apache.spark.scheduler.TaskSetManager.executorLost(TaskSetManager.scala:664)
>         at 
> org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
>         at 
> org.apache.spark.scheduler.Pool$$anonfun$executorLost$1.apply(Pool.scala:87)
>         at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>         at org.apache.spark.scheduler.Pool.executorLost(Pool.scala:87)
>         at 
> org.apache.spark.scheduler.TaskSchedulerImpl.removeExecutor(TaskSchedulerImpl.scala:412)
>         at 
> org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:271)
>         at 
> org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:266)
>         at 
> org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.statusUpdate(MesosSchedulerBackend.scala:287)
> What could cause this? Is this a set up problem with our cluster or a bug in 
> spark?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SPARK-1848) Executors are mysteriously dying when using Spark on Mesos

Reply via email to