Hi, I'm having trouble running spark on mesos in fine-grained mode. I'm running spark 1.0.0 and mesos 0.18.0. The tasks are failing randomly, which most of the time, but not always, cause the job to fail. The same code is running fine in coarse-grained mode. I see the following exceptions in the logs of the spark driver:
W0617 10:57:36.774382 8735 sched.cpp:901] Attempting to launch task 21 with an unknown offer 20140416-011500-1369465866-5050-26096-52332715 W0617 10:57:36.774433 8735 sched.cpp:901] Attempting to launch task 22 with an unknown offer 20140416-011500-1369465866-5050-26096-52332715 14/06/17 10:57:36 INFO TaskSetManager: Re-queueing tasks for 201311011608-1369465866-5050-9189-46 from TaskSet 0.0 14/06/17 10:57:36 WARN TaskSetManager: Lost TID 22 (task 0.0:2) 14/06/17 10:57:36 WARN TaskSetManager: Lost TID 19 (task 0.0:0) 14/06/17 10:57:36 WARN TaskSetManager: Lost TID 21 (task 0.0:1) 14/06/17 10:57:36 INFO DAGScheduler: Executor lost: 201311011608-1369465866-5050-9189-46 (epoch 0) 14/06/17 10:57:36 INFO BlockManagerMasterActor: Trying to remove executor 201311011608-1369465866-5050-9189-46 from BlockManagerMaster. 14/06/17 10:57:36 INFO BlockManagerMaster: Removed 201311011608-1369465866-5050-9189-46 successfully in removeExecutor 14/06/17 10:57:36 DEBUG MapOutputTrackerMaster: Increasing epoch to 1 14/06/17 10:57:36 INFO DAGScheduler: Host added was in lost list earlier: ca1-dcc1-0065.lab.mtl I don't see any exceptions in the spark executor logs. The only error message I found in mesos itself is warnings in the mesos master: W0617 10:57:36.816748 26100 master.cpp:1615] Failed to validate task 21 : Task 21 attempted to use cpus(*):1 combined with already used cpus(*):1; mem(*):2048 is greater than offered mem(*):3216; disk(*):98304; ports(*):[11900-11919, 1192 1-11995, 11997-11999]; cpus(*):1 W0617 10:57:36.819807 26100 master.cpp:1615] Failed to validate task 22 : Task 22 attempted to use cpus(*):1 combined with already used cpus(*):1; mem(*):2048 is greater than offered mem(*):3216; disk(*):98304; ports(*):[11900-11919, 1192 1-11995, 11997-11999]; cpus(*):1 W0617 10:57:36.932287 26102 master.cpp:1615] Failed to validate task 28 : Task 28 attempted to use cpus(*):1 combined with already used cpus(*):1; mem(*):2048 is greater than offered cpus(*):1; mem(*):3216; disk(*):98304; ports(*):[11900- 11960, 11962-11978, 11980-11999] W0617 11:05:52.783133 26098 master.cpp:2106] Ignoring unknown exited executor 201311011608-1369465866-5050-9189-46 on slave 201311011608-1369465866-5050-9189-46 (ca1-dcc1-0065.lab.mtl) W0617 11:05:52.787739 26103 master.cpp:2106] Ignoring unknown exited executor 201311011608-1369465866-5050-9189-34 on slave 201311011608-1369465866-5050-9189-34 (ca1-dcc1-0053.lab.mtl) W0617 11:05:52.790292 26102 master.cpp:2106] Ignoring unknown exited executor 201311011608-1369465866-5050-9189-59 on slave 201311011608-1369465866-5050-9189-59 (ca1-dcc1-0079.lab.mtl) W0617 11:05:52.800649 26099 master.cpp:2106] Ignoring unknown exited executor 201311011608-1369465866-5050-9189-18 on slave 201311011608-1369465866-5050-9189-18 (ca1-dcc1-0027.lab.mtl) ... (more of those "Ignoring unknown exited executor") I analyzed the difference in between the execution of the same job in coarse-grained mode and fine-grained mode, and I noticed that in the fine-grained mode the tasks get executed on executors different than the ones reported in spark, as if spark and mesos get out of sync as to which executor is responsible for which task. See the following: Coarse-grained mode: SparkMesosTask IndexTask IDExecutorStatusTask ID (UI)Task NameTask ID (logs) ExecutorState0066SUCCESS4"Task 4"066RUNNING1159SUCCESS0"Task 0"159RUNNING22 54SUCCESS10"Task 10"254RUNNING33128SUCCESS6"Task 6"3128RUNNING... Fine-grained mode: SparkMesosTask IndexTask IDExecutorTask ID (UI)Task NameTask ID (logs) ExecutorState023108SUCCESS23"task 0.0:0"2327FINISHED01965FAILED19"task 0.0:0"1986FINISHED12165FAILEDMesos executor was never created12492SUCCESS24"task 0.0:1"24129FINISHED22265FAILEDMesos executor was never created225100SUCCESS 25"task 0.0:2"2584FINISHED32680SUCCESS26"task 0.0:3"26124FINISHED42765FAILED 27"task 0.0:4"27108FINISHED42992SUCCESS29"task 0.0:4"2965FINISHED52865FAILEDMesos executor was never created53077SUCCESS30"task 0.0:5"3062FINISHED6053SUCCESS0"task 0.0:6"041FINISHED7177SUCCESS1"task 0.0:7"1114FINISHED... Is it normal that the executor reported in spark and mesos to be different when running in fine-grained mode? Please note that in this particular example the job actually succeeded, but most of the time it's failing after 4 failed attempts of a given task. This job never fails in coarse-grained mode. Every job is working in coarse-grained mode and failing the same way in fine-grained mode. Does anybody have an idea what the problem could be? Thanks, - Sebastien