Hi,

I'm having trouble running spark on mesos in fine-grained mode. I'm running
spark 1.0.0 and mesos 0.18.0. The tasks are failing randomly, which most of
the time, but not always, cause the job to fail. The same code is running
fine in coarse-grained mode. I see the following exceptions in the logs of
the spark driver:

W0617 10:57:36.774382  8735 sched.cpp:901] Attempting to launch task 21
with an unknown offer 20140416-011500-1369465866-5050-26096-52332715
W0617 10:57:36.774433  8735 sched.cpp:901] Attempting to launch task 22
with an unknown offer 20140416-011500-1369465866-5050-26096-52332715
14/06/17 10:57:36 INFO TaskSetManager: Re-queueing tasks for
201311011608-1369465866-5050-9189-46 from TaskSet 0.0
14/06/17 10:57:36 WARN TaskSetManager: Lost TID 22 (task 0.0:2)
14/06/17 10:57:36 WARN TaskSetManager: Lost TID 19 (task 0.0:0)
14/06/17 10:57:36 WARN TaskSetManager: Lost TID 21 (task 0.0:1)
14/06/17 10:57:36 INFO DAGScheduler: Executor lost:
201311011608-1369465866-5050-9189-46 (epoch 0)
14/06/17 10:57:36 INFO BlockManagerMasterActor: Trying to remove executor
201311011608-1369465866-5050-9189-46 from BlockManagerMaster.
14/06/17 10:57:36 INFO BlockManagerMaster: Removed
201311011608-1369465866-5050-9189-46 successfully in removeExecutor
14/06/17 10:57:36 DEBUG MapOutputTrackerMaster: Increasing epoch to 1
14/06/17 10:57:36 INFO DAGScheduler: Host added was in lost list earlier:
ca1-dcc1-0065.lab.mtl

I don't see any exceptions in the spark executor logs. The only error
message I found in mesos itself is warnings in the mesos master:

W0617 10:57:36.816748 26100 master.cpp:1615] Failed to validate task 21 :
Task 21 attempted to use cpus(*):1 combined with already used cpus(*):1;
mem(*):2048 is greater than offered mem(*):3216; disk(*):98304;
ports(*):[11900-11919, 1192
1-11995, 11997-11999]; cpus(*):1
W0617 10:57:36.819807 26100 master.cpp:1615] Failed to validate task 22 :
Task 22 attempted to use cpus(*):1 combined with already used cpus(*):1;
mem(*):2048 is greater than offered mem(*):3216; disk(*):98304;
ports(*):[11900-11919, 1192
1-11995, 11997-11999]; cpus(*):1
W0617 10:57:36.932287 26102 master.cpp:1615] Failed to validate task 28 :
Task 28 attempted to use cpus(*):1 combined with already used cpus(*):1;
mem(*):2048 is greater than offered cpus(*):1; mem(*):3216; disk(*):98304;
ports(*):[11900-
11960, 11962-11978, 11980-11999]
W0617 11:05:52.783133 26098 master.cpp:2106] Ignoring unknown exited
executor 201311011608-1369465866-5050-9189-46 on slave
201311011608-1369465866-5050-9189-46 (ca1-dcc1-0065.lab.mtl)
W0617 11:05:52.787739 26103 master.cpp:2106] Ignoring unknown exited
executor 201311011608-1369465866-5050-9189-34 on slave
201311011608-1369465866-5050-9189-34 (ca1-dcc1-0053.lab.mtl)
W0617 11:05:52.790292 26102 master.cpp:2106] Ignoring unknown exited
executor 201311011608-1369465866-5050-9189-59 on slave
201311011608-1369465866-5050-9189-59 (ca1-dcc1-0079.lab.mtl)
W0617 11:05:52.800649 26099 master.cpp:2106] Ignoring unknown exited
executor 201311011608-1369465866-5050-9189-18 on slave
201311011608-1369465866-5050-9189-18 (ca1-dcc1-0027.lab.mtl)
... (more of those "Ignoring unknown exited executor")


I analyzed the difference in between the execution of the same job in
coarse-grained mode and fine-grained mode, and I noticed that in the
fine-grained mode the tasks get executed on executors different than the
ones reported in spark, as if spark and mesos get out of sync as to which
executor is responsible for which task. See the following:


Coarse-grained mode:

SparkMesosTask IndexTask IDExecutorStatusTask ID (UI)Task NameTask ID (logs)
ExecutorState0066SUCCESS4"Task 4"066RUNNING1159SUCCESS0"Task 0"159RUNNING22
54SUCCESS10"Task 10"254RUNNING33128SUCCESS6"Task 6"3128RUNNING...

Fine-grained mode:

SparkMesosTask IndexTask IDExecutorTask ID (UI)Task NameTask ID (logs)
ExecutorState023108SUCCESS23"task 0.0:0"2327FINISHED01965FAILED19"task
0.0:0"1986FINISHED12165FAILEDMesos executor was never
created12492SUCCESS24"task
0.0:1"24129FINISHED22265FAILEDMesos executor was never created225100SUCCESS
25"task 0.0:2"2584FINISHED32680SUCCESS26"task 0.0:3"26124FINISHED42765FAILED
27"task 0.0:4"27108FINISHED42992SUCCESS29"task
0.0:4"2965FINISHED52865FAILEDMesos
executor was never created53077SUCCESS30"task
0.0:5"3062FINISHED6053SUCCESS0"task
0.0:6"041FINISHED7177SUCCESS1"task 0.0:7"1114FINISHED...


Is it normal that the executor reported in spark and mesos to be different
when running in fine-grained mode?

Please note that in this particular example the job actually succeeded, but
most of the time it's failing after 4 failed attempts of a given task. This
job never fails in coarse-grained mode. Every job is working in
coarse-grained mode and failing the same way in fine-grained mode.

Does anybody have an idea what the problem could be?

Thanks,

- Sebastien

Reply via email to