Hi all:
Does anyone run mesos-0.12.0 successfully on a real cluster?
In order to run Hadoop JobTracker on our real Mesos cluster (3 nodes:
192.168.1.1: master, 192.168.1.2: slave 1, 192.168.1.3: slave2), I've uploaded
the Mesos executor bundle hadoop.tar.gz to hdfs://master:54310/, according to
the code in Mesos_Scheduler.cpp.
However, I encountered task lost continuously due to process exit. On analyzing
the code in process_based_isolation_module.cpp as well as the logs, I think the
executor process was forked successfully, and exited soon, without the
expected"Got registration for executor 'executorId' of framework" log generated
by slave.cpp.
In addition, I'm wondering how the ExecutorLauncher::fetchExecutors() in
launcher.cpp was called and executed, so that the resource from HDFS was
grabbed according to the Mesos config file above.
================= mapred-site.xml ===================
<property>
<name>mapred.mesos.executor</name>
# <value>hdfs://hdfs.name.node:port/hadoop.zip</value>
<value>hdfs://master:54310/hadoop.tar.gz</value>
</property>
==================== Mesos_Scheduler.java ======================
459 if (master.equals("local")) {
try {
commandInfo = CommandInfo.newBuilder()
.setEnvironment(envBuilder)
.setValue(new File("bin/mesos-executor").getCanonicalPath())
.build();
} catch (IOException e) {
LOG.fatal("Failed to find Mesos executor ", e);
System.exit(1);
}
} else {
String uri = conf.get("mapred.mesos.executor");
commandInfo = CommandInfo.newBuilder()
.setEnvironment(envBuilder)
.setValue("cd hadoop && ./bin/mesos-executor")
.addUris(CommandInfo.URI.newBuilder().setValue(uri)).build();
}
===================== slave2 log
=======================================================
I0703 19:44:58.547559 4648 slave.cpp:169] Slave started on 1)@192.168.1.3:50842
I0703 19:44:58.547617 4648 slave.cpp:170] Slave resources: cpus=24; mem=63362;
ports=[31000-32000]; disk=882530
I0703 19:44:58.550875 4648 slave.cpp:378] New master detected at
[email protected]:5050
I0703 19:44:58.552201 4632 slave.cpp:361] Successfully attached file
'/home/mesos/log/mesos-slave.INFO'
I0703 19:44:58.553238 4649 slave.cpp:398] Registered with master; given slave
ID 201307031945-16885952-5050-13298-1
I0703 19:45:58.550350 4629 slave.cpp:1233] Current disk usage 0.36%. Max
allowed age: 6.97days
I0703 19:46:58.552299 4650 slave.cpp:1233] Current disk usage 0.36%. Max
allowed age: 6.97days
I0703 19:47:58.553762 4638 slave.cpp:1233] Current disk usage 0.36%. Max
allowed age: 6.97days
I0703 19:48:16.771438 4647 slave.cpp:475] Got assigned task Task_Tracker_0 for
framework 201307031945-16885952-5050-13298-0000
I0703 19:48:16.774071 4647 paths.hpp:235] Created executor directory
'/tmp/mesos/slaves/201307031945-16885952-5050-13298-1/frameworks/201307031945-16885952-5050-13298-0000/executors/executor_Task_Tracker_0/runs/11a79052-7eb7-4530-8315-153fef22cee7'
I0703 19:48:16.774608 4647 process_based_isolation_module.cpp:108] Launching
executor_Task_Tracker_0 (cd hadoop && ./bin/mesos-executor) in
/tmp/mesos/slaves/201307031945-16885952-5050-13298-1/frameworks/201307031945-16885952-5050-13298-0000/executors/executor_Task_Tracker_0/runs/11a79052-7eb7-4530-8315-153fef22cee7
with resources cpus=1; mem=1280' for framework
201307031945-16885952-5050-13298-0000
I0703 19:48:16.775753 4635 slave.cpp:361] Successfully attached file
'/tmp/mesos/slaves/201307031945-16885952-5050-13298-1/frameworks/201307031945-16885952-5050-13298-0000/executors/executor_Task_Tracker_0/runs/11a79052-7eb7-4530-8315-153fef22cee7'
I0703 19:48:16.776356 4647 process_based_isolation_module.cpp:153] Forked
executor at 5262
I0703 19:48:18.789098 4634 process_based_isolation_module.cpp:344] Telling
slave of lost executor executor_Task_Tracker_0 of framework
201307031945-16885952-5050-13298-0000
I0703 19:48:18.789445 4650 slave.cpp:1053] Executor 'executor_Task_Tracker_0'
of framework 201307031945-16885952-5050-13298-0000 has terminated with signal
Aborted
I0703 19:48:18.792886 4650 slave.cpp:830] Status update: task Task_Tracker_0
of framework 201307031945-16885952-5050-13298-0000 is now in state TASK_LOST
I0703 19:48:18.793547 4646 gc.cpp:97] Scheduling
/tmp/mesos/slaves/201307031945-16885952-5050-13298-1/frameworks/201307031945-16885952-5050-13298-0000/executors/executor_Task_Tracker_0/runs/11a79052-7eb7-4530-8315-153fef22cee7
for removal
I0703 19:48:18.797713 4636 slave.cpp:727] Got acknowledgement of status update
for task Task_Tracker_0 of framework 201307031945-16885952-5050-13298-0000
I0703 19:48:18.789301 4634 process_utils.hpp:64] Stopping … 5262
============== process_based_isolation_module.cpp ==========================
153 LOG(INFO) << "Forked executor at " << pid;
// Tell the slave this executor has started.
159 dispatch(slave, &Slave::executorStarted, frameworkId, executorId, pid);
338 void ProcessBasedIsolationModule::processExited(pid_t pid, int status)
344 LOG(INFO) << "Telling slave of lost executor " << executorId
<< " of framework " << frameworkId;
347 dispatch(slave,
&Slave::executorTerminated,
frameworkId,
executorId,
status,
false,
"Executor exited");
====================== master log =====================================
I0703 19:48:22.194702 13320 master.hpp:285] Adding task with resources
cpus=0.8; mem=2048; disk=4096; ports=[31000-31000, 32000-32000] on slave
201307031945-16885952-5050-13298-1
I0703 19:48:22.194816 13320 master.cpp:1586] Launching task Task_Tracker_0 of
framework 201307031945-16885952-5050-13298-0000 with resources cpus=0.8;
mem=2048; disk=4096; ports=[31000-31000, 32000-32000] on slave
201307031945-16885952-5050-13298-1 (slave2)
I0703 19:48:22.195260 13320 master.hpp:314] Removing offer with resources
cpus=24; mem=63362; ports=[31000-32000]; disk=882530 on slave
201307031945-16885952-5050-13298-1
I0703 19:48:22.195427 13320 master.cpp:1469] Processing reply for offer
201307031945-16885952-5050-13298-5 on slave 201307031945-16885952-5050-13298-0
(slave1) for framework 201307031945-16885952-5050-13298-0000
I0703 19:48:22.195549 13314 hierarchical_allocator_process.hpp:495] Framework
201307031945-16885952-5050-13298-0000 filtered slave
201307031945-16885952-5050-13298-1 for 5.00secs
I0703 19:48:22.195596 13320 master.hpp:314] Removing offer with resources
cpus=24; mem=63362; ports=[31000-32000]; disk=882382 on slave
201307031945-16885952-5050-13298-0
I0703 19:48:22.195886 13314 hierarchical_allocator_process.hpp:495] Framework
201307031945-16885952-5050-13298-0000 filtered slave
201307031945-16885952-5050-13298-0 for 5.00secs
I0703 19:48:23.180142 13302 hierarchical_allocator_process.hpp:597] Performed
allocation for 2 slaves in 102.71us
I0703 19:48:24.181670 13317 hierarchical_allocator_process.hpp:597] Performed
allocation for 2 slaves in 127.42us
I0703 19:48:24.218245 13321 master.cpp:1022] Status update from
slave(1)@192.168.1.3:50842: task Task_Tracker_0 of framework
201307031945-16885952-5050-13298-0000 is now in state TASK_LOST
Thanks a lot for your help!
Kai