Hi Martin, Tim suggested that you pastebin the mesos logs -- can you share those for the list?
Cheers, Andrew On Thu, May 15, 2014 at 5:02 PM, Martin Weindel <martin.wein...@gmail.com>wrote: > Andrew, > > thanks for your response. When using the coarse mode, the jobs run fine. > > My problem is the fine-grained mode. Here the parallel jobs nearly always > end in a dead lock. It seems to have something to do with resource > allocation, as Mesos shows neither used nor idle CPU resources in this > state. I do not understand what this means. > Any ideas how to analysis this problem are welcome. > > Martin > > Am 13.05.2014 08:48, schrieb Andrew Ash: > > Are you setting a core limit with spark.cores.max? If you don't, in > coarse mode each Spark job uses all available cores on Mesos and doesn't > let them go until the job is terminated. At which point the other job can > access the cores. > > https://spark.apache.org/docs/latest/running-on-mesos.html -- "Mesos Run > Modes" section > > The quick fix should be to set spark.cores.max to half of your cluster's > cores to support running two jobs concurrently. Alternatively, switching > to fine-grained mode would help here too at the expense of higher latency > on startup. > > > > On Mon, May 12, 2014 at 12:37 PM, Martin Weindel <martin.wein...@gmail.com > > wrote: > >> I'm using a current Spark 1.0.0-SNAPSHOT for Hadoop 2.2.0 on Mesos >> 0.17.0. >> >> If I run a single Spark Job, the job runs fine on Mesos. Running multiple >> Spark Jobs also works, if I'm using the coarse-grained mode >> ("spark.mesos.coarse" = true). >> >> But if I run two Spark Jobs in parallel using the fine-grained mode, the >> jobs seem to block each other after a few seconds. >> And the Mesos UI reports no idle but also no used CPUs in this state. >> >> As soon as I kill one job, the other continues normally. See below for >> some log output. >> Looks to me as if something strange happens with the CPU resources. >> >> Can anybody give me a hint about the cause? The jobs read some HDFS >> files, but have no other communication to external processes. >> Or any other suggestions how to analyze this problem? >> >> Thanks, >> >> Martin >> >> ----- >> Here is the relevant log output of the driver of job1: >> >> INFO 17:53:09,247 Missing parents for Stage 2: List() >> INFO 17:53:09,250 Submitting Stage 2 (MapPartitionsRDD[9] at >> mapPartitions at HighTemperatureSpansPerLogfile.java:92), which is now >> runnable >> INFO 17:53:09,269 Submitting 1 missing tasks from Stage 2 >> (MapPartitionsRDD[9] at mapPartitions at >> HighTemperatureSpansPerLogfile.java:92) >> INFO 17:53:09,269 Adding task set 2.0 with 1 tasks >> >> ................................................................................ >> >> *** at this point the job was killed *** >> >> >> log output of driver of job2: >> INFO 17:53:04,874 Missing parents for Stage 6: List() >> INFO 17:53:04,875 Submitting Stage 6 (MappedRDD[23] at values at >> ComputeLogFileTimespan.java:71), which is now runnable >> INFO 17:53:04,881 Submitting 1 missing tasks from Stage 6 (MappedRDD[23] >> at values at ComputeLogFileTimespan.java:71) >> INFO 17:53:04,882 Adding task set 6.0 with 1 tasks >> >> ................................................................................ >> >> *** at this point the job 1 was killed *** >> INFO 18:01:39,307 Starting task 6.0:0 as TID 7 on executor >> 20140501-141732-308511242-5050-2657-1:myclusternode (PROCESS_LOCAL) >> INFO 18:01:39,307 Serialized task 6.0:0 as 3052 bytes in 0 ms >> INFO 18:01:39,328 Asked to send map output locations for shuffle 2 to >> spark@ >> <sp...@ustst018-cep-node1.usu.usu.grp:40542>myclusternode:40542<sp...@ustst018-cep-node1.usu.usu.grp:40542> >> >> INFO 18:01:39,328 Size of output statuses for shuffle 2 is 178 bytes >> > > >