Hi Kai, Are there any logs in the executor work directory? e.g., /tmp/mesos/slaves/ 201307031945-16885952-5050-13298-1/frameworks/201307031945-16885952-5050- 13298-0000/executors/executor_Task_Tracker_0/runs/11a79052- 7eb7-4530-8315-153fef22cee7.
Also are you capturing stderr of the slave? On Wed, Jul 3, 2013 at 8:58 AM, Kai Wang <[email protected]> wrote: > Hi all: > > Does anyone run mesos-0.12.0 successfully on a real cluster? > > In order to run Hadoop JobTracker on our real Mesos cluster (3 nodes: > 192.168.1.1: master, 192.168.1.2: slave 1, 192.168.1.3: slave2), I've > uploaded the Mesos executor bundle hadoop.tar.gz to hdfs://master:54310/, > according to the code in Mesos_Scheduler.cpp. > > However, I encountered task lost continuously due to process exit. On > analyzing the code in process_based_isolation_module.cpp as well as the > logs, I think the executor process was forked successfully, and exited > soon, without the expected"Got registration for executor 'executorId' of > framework" log generated by slave.cpp. > In addition, I'm wondering how the ExecutorLauncher::fetchExecutors() in > launcher.cpp was called and executed, so that the resource from HDFS was > grabbed according to the Mesos config file above. > > ================= mapred-site.xml =================== > <property> > <name>mapred.mesos.executor</name> > # <value>hdfs://hdfs.name.node:port/hadoop.zip</value> > <value>hdfs://master:54310/hadoop.tar.gz</value> > </property> > > ==================== Mesos_Scheduler.java ====================== > 459 if (master.equals("local")) { > try { > commandInfo = CommandInfo.newBuilder() > .setEnvironment(envBuilder) > .setValue(new File("bin/mesos-executor").getCanonicalPath()) > .build(); > } catch (IOException e) { > LOG.fatal("Failed to find Mesos executor ", e); > System.exit(1); > } > } else { > String uri = conf.get("mapred.mesos.executor"); > commandInfo = CommandInfo.newBuilder() > .setEnvironment(envBuilder) > .setValue("cd hadoop && ./bin/mesos-executor") > .addUris(CommandInfo.URI.newBuilder().setValue(uri)).build(); > } > > ===================== slave2 log > ======================================================= > I0703 19:44:58.547559 4648 slave.cpp:169] Slave started on 1)@ > 192.168.1.3:50842 > I0703 19:44:58.547617 4648 slave.cpp:170] Slave resources: cpus=24; > mem=63362; ports=[31000-32000]; disk=882530 > I0703 19:44:58.550875 4648 slave.cpp:378] New master detected at > [email protected]:5050 > I0703 19:44:58.552201 4632 slave.cpp:361] Successfully attached file > '/home/mesos/log/mesos-slave.INFO' > I0703 19:44:58.553238 4649 slave.cpp:398] Registered with master; given > slave ID 201307031945-16885952-5050-13298-1 > I0703 19:45:58.550350 4629 slave.cpp:1233] Current disk usage 0.36%. Max > allowed age: 6.97days > I0703 19:46:58.552299 4650 slave.cpp:1233] Current disk usage 0.36%. Max > allowed age: 6.97days > I0703 19:47:58.553762 4638 slave.cpp:1233] Current disk usage 0.36%. Max > allowed age: 6.97days > I0703 19:48:16.771438 4647 slave.cpp:475] Got assigned task > Task_Tracker_0 for framework 201307031945-16885952-5050-13298-0000 > I0703 19:48:16.774071 4647 paths.hpp:235] Created executor directory > '/tmp/mesos/slaves/201307031945-16885952-5050-13298-1/frameworks/201307031945-16885952-5050-13298-0000/executors/executor_Task_Tracker_0/runs/11a79052-7eb7-4530-8315-153fef22cee7' > I0703 19:48:16.774608 4647 process_based_isolation_module.cpp:108] > Launching executor_Task_Tracker_0 (cd hadoop && ./bin/mesos-executor) in > /tmp/mesos/slaves/201307031945-16885952-5050-13298-1/frameworks/201307031945-16885952-5050-13298-0000/executors/executor_Task_Tracker_0/runs/11a79052-7eb7-4530-8315-153fef22cee7 > with resources cpus=1; mem=1280' for framework > 201307031945-16885952-5050-13298-0000 > I0703 19:48:16.775753 4635 slave.cpp:361] Successfully attached file > '/tmp/mesos/slaves/201307031945-16885952-5050-13298-1/frameworks/201307031945-16885952-5050-13298-0000/executors/executor_Task_Tracker_0/runs/11a79052-7eb7-4530-8315-153fef22cee7' > I0703 19:48:16.776356 4647 process_based_isolation_module.cpp:153] Forked > executor at 5262 > I0703 19:48:18.789098 4634 process_based_isolation_module.cpp:344] > Telling slave of lost executor executor_Task_Tracker_0 of framework > 201307031945-16885952-5050-13298-0000 > I0703 19:48:18.789445 4650 slave.cpp:1053] Executor > 'executor_Task_Tracker_0' of framework > 201307031945-16885952-5050-13298-0000 has terminated with signal Aborted > I0703 19:48:18.792886 4650 slave.cpp:830] Status update: task > Task_Tracker_0 of framework 201307031945-16885952-5050-13298-0000 is now in > state TASK_LOST > I0703 19:48:18.793547 4646 gc.cpp:97] Scheduling > /tmp/mesos/slaves/201307031945-16885952-5050-13298-1/frameworks/201307031945-16885952-5050-13298-0000/executors/executor_Task_Tracker_0/runs/11a79052-7eb7-4530-8315-153fef22cee7 > for removal > I0703 19:48:18.797713 4636 slave.cpp:727] Got acknowledgement of status > update for task Task_Tracker_0 of framework > 201307031945-16885952-5050-13298-0000 > I0703 19:48:18.789301 4634 process_utils.hpp:64] Stopping … 5262 > > ============== process_based_isolation_module.cpp > ========================== > 153 LOG(INFO) << "Forked executor at " << pid; > // Tell the slave this executor has started. > 159 dispatch(slave, &Slave::executorStarted, frameworkId, executorId, > pid); > > > 338 void ProcessBasedIsolationModule::processExited(pid_t pid, int status) > 344 LOG(INFO) << "Telling slave of lost executor " << executorId > << " of framework " << frameworkId; > > 347 dispatch(slave, > &Slave::executorTerminated, > frameworkId, > executorId, > status, > false, > "Executor exited"); > > ====================== master log ===================================== > I0703 19:48:22.194702 13320 master.hpp:285] Adding task with resources > cpus=0.8; mem=2048; disk=4096; ports=[31000-31000, 32000-32000] on slave > 201307031945-16885952-5050-13298-1 > I0703 19:48:22.194816 13320 master.cpp:1586] Launching task Task_Tracker_0 > of framework 201307031945-16885952-5050-13298-0000 with resources cpus=0.8; > mem=2048; disk=4096; ports=[31000-31000, 32000-32000] on slave > 201307031945-16885952-5050-13298-1 (slave2) > I0703 19:48:22.195260 13320 master.hpp:314] Removing offer with resources > cpus=24; mem=63362; ports=[31000-32000]; disk=882530 on slave > 201307031945-16885952-5050-13298-1 > I0703 19:48:22.195427 13320 master.cpp:1469] Processing reply for offer > 201307031945-16885952-5050-13298-5 on slave > 201307031945-16885952-5050-13298-0 (slave1) for framework > 201307031945-16885952-5050-13298-0000 > I0703 19:48:22.195549 13314 hierarchical_allocator_process.hpp:495] > Framework 201307031945-16885952-5050-13298-0000 filtered slave > 201307031945-16885952-5050-13298-1 for 5.00secs > I0703 19:48:22.195596 13320 master.hpp:314] Removing offer with resources > cpus=24; mem=63362; ports=[31000-32000]; disk=882382 on slave > 201307031945-16885952-5050-13298-0 > I0703 19:48:22.195886 13314 hierarchical_allocator_process.hpp:495] > Framework 201307031945-16885952-5050-13298-0000 filtered slave > 201307031945-16885952-5050-13298-0 for 5.00secs > I0703 19:48:23.180142 13302 hierarchical_allocator_process.hpp:597] > Performed allocation for 2 slaves in 102.71us > I0703 19:48:24.181670 13317 hierarchical_allocator_process.hpp:597] > Performed allocation for 2 slaves in 127.42us > I0703 19:48:24.218245 13321 master.cpp:1022] Status update from slave(1)@ > 192.168.1.3:50842: task Task_Tracker_0 of framework > 201307031945-16885952-5050-13298-0000 is now in state TASK_LOST > > Thanks a lot for your help! > Kai > >
