Hi all:

Does anyone run mesos-0.12.0 successfully on a real cluster?

In order to run Hadoop JobTracker on our real Mesos cluster (3 nodes: 
192.168.1.1: master, 192.168.1.2: slave 1, 192.168.1.3: slave2), I've uploaded 
the Mesos executor bundle hadoop.tar.gz to hdfs://master:54310/, according to 
the code in Mesos_Scheduler.cpp.

However, I encountered task lost continuously due to process exit. On analyzing 
the code in process_based_isolation_module.cpp as well as the logs, I think the 
executor process was forked successfully, and exited soon, without the 
expected"Got registration for executor 'executorId' of framework" log generated 
by slave.cpp.
In addition, I'm wondering how the ExecutorLauncher::fetchExecutors() in 
launcher.cpp was called and executed, so that the resource from HDFS was 
grabbed according to the Mesos config file above.
 
================= mapred-site.xml ===================
<property>
    <name>mapred.mesos.executor</name>
#    <value>hdfs://hdfs.name.node:port/hadoop.zip</value>
     <value>hdfs://master:54310/hadoop.tar.gz</value>
 </property>

==================== Mesos_Scheduler.java ======================
459      if (master.equals("local")) {
        try {
          commandInfo = CommandInfo.newBuilder()
              .setEnvironment(envBuilder)
              .setValue(new File("bin/mesos-executor").getCanonicalPath())
              .build();
        } catch (IOException e) {
          LOG.fatal("Failed to find Mesos executor ", e);
          System.exit(1);
        }
      } else {
        String uri = conf.get("mapred.mesos.executor");
        commandInfo = CommandInfo.newBuilder()
            .setEnvironment(envBuilder)
            .setValue("cd hadoop && ./bin/mesos-executor")
            .addUris(CommandInfo.URI.newBuilder().setValue(uri)).build();
      }

===================== slave2 log 
=======================================================
I0703 19:44:58.547559  4648 slave.cpp:169] Slave started on 1)@192.168.1.3:50842
I0703 19:44:58.547617  4648 slave.cpp:170] Slave resources: cpus=24; mem=63362; 
ports=[31000-32000]; disk=882530
I0703 19:44:58.550875  4648 slave.cpp:378] New master detected at 
[email protected]:5050
I0703 19:44:58.552201  4632 slave.cpp:361] Successfully attached file 
'/home/mesos/log/mesos-slave.INFO'
I0703 19:44:58.553238  4649 slave.cpp:398] Registered with master; given slave 
ID 201307031945-16885952-5050-13298-1
I0703 19:45:58.550350  4629 slave.cpp:1233] Current disk usage 0.36%. Max 
allowed age: 6.97days
I0703 19:46:58.552299  4650 slave.cpp:1233] Current disk usage 0.36%. Max 
allowed age: 6.97days
I0703 19:47:58.553762  4638 slave.cpp:1233] Current disk usage 0.36%. Max 
allowed age: 6.97days
I0703 19:48:16.771438  4647 slave.cpp:475] Got assigned task Task_Tracker_0 for 
framework 201307031945-16885952-5050-13298-0000
I0703 19:48:16.774071  4647 paths.hpp:235] Created executor directory 
'/tmp/mesos/slaves/201307031945-16885952-5050-13298-1/frameworks/201307031945-16885952-5050-13298-0000/executors/executor_Task_Tracker_0/runs/11a79052-7eb7-4530-8315-153fef22cee7'
I0703 19:48:16.774608  4647 process_based_isolation_module.cpp:108] Launching 
executor_Task_Tracker_0 (cd hadoop && ./bin/mesos-executor) in 
/tmp/mesos/slaves/201307031945-16885952-5050-13298-1/frameworks/201307031945-16885952-5050-13298-0000/executors/executor_Task_Tracker_0/runs/11a79052-7eb7-4530-8315-153fef22cee7
 with resources cpus=1; mem=1280' for framework 
201307031945-16885952-5050-13298-0000
I0703 19:48:16.775753  4635 slave.cpp:361] Successfully attached file 
'/tmp/mesos/slaves/201307031945-16885952-5050-13298-1/frameworks/201307031945-16885952-5050-13298-0000/executors/executor_Task_Tracker_0/runs/11a79052-7eb7-4530-8315-153fef22cee7'
I0703 19:48:16.776356  4647 process_based_isolation_module.cpp:153] Forked 
executor at 5262
I0703 19:48:18.789098  4634 process_based_isolation_module.cpp:344] Telling 
slave of lost executor executor_Task_Tracker_0 of framework 
201307031945-16885952-5050-13298-0000
I0703 19:48:18.789445  4650 slave.cpp:1053] Executor 'executor_Task_Tracker_0' 
of framework 201307031945-16885952-5050-13298-0000 has terminated with signal 
Aborted
I0703 19:48:18.792886  4650 slave.cpp:830] Status update: task Task_Tracker_0 
of framework 201307031945-16885952-5050-13298-0000 is now in state TASK_LOST
I0703 19:48:18.793547  4646 gc.cpp:97] Scheduling 
/tmp/mesos/slaves/201307031945-16885952-5050-13298-1/frameworks/201307031945-16885952-5050-13298-0000/executors/executor_Task_Tracker_0/runs/11a79052-7eb7-4530-8315-153fef22cee7
 for removal
I0703 19:48:18.797713  4636 slave.cpp:727] Got acknowledgement of status update 
for task Task_Tracker_0 of framework 201307031945-16885952-5050-13298-0000
I0703 19:48:18.789301  4634 process_utils.hpp:64] Stopping … 5262

============== process_based_isolation_module.cpp ==========================
153   LOG(INFO) << "Forked executor at " << pid;
    // Tell the slave this executor has started.
159    dispatch(slave, &Slave::executorStarted, frameworkId, executorId, pid);


338 void ProcessBasedIsolationModule::processExited(pid_t pid, int status)
344        LOG(INFO) << "Telling slave of lost executor " << executorId
                  << " of framework " << frameworkId;

347     dispatch(slave,
                 &Slave::executorTerminated,
                 frameworkId,
                 executorId,
                 status,
                 false,
                 "Executor exited");
                                                            
====================== master log =====================================
I0703 19:48:22.194702 13320 master.hpp:285] Adding task with resources 
cpus=0.8; mem=2048; disk=4096; ports=[31000-31000, 32000-32000] on slave 
201307031945-16885952-5050-13298-1
I0703 19:48:22.194816 13320 master.cpp:1586] Launching task Task_Tracker_0 of 
framework 201307031945-16885952-5050-13298-0000 with resources cpus=0.8; 
mem=2048; disk=4096; ports=[31000-31000, 32000-32000] on slave 
201307031945-16885952-5050-13298-1 (slave2)
I0703 19:48:22.195260 13320 master.hpp:314] Removing offer with resources 
cpus=24; mem=63362; ports=[31000-32000]; disk=882530 on slave 
201307031945-16885952-5050-13298-1
I0703 19:48:22.195427 13320 master.cpp:1469] Processing reply for offer 
201307031945-16885952-5050-13298-5 on slave 201307031945-16885952-5050-13298-0 
(slave1) for framework 201307031945-16885952-5050-13298-0000
I0703 19:48:22.195549 13314 hierarchical_allocator_process.hpp:495] Framework 
201307031945-16885952-5050-13298-0000 filtered slave 
201307031945-16885952-5050-13298-1 for 5.00secs
I0703 19:48:22.195596 13320 master.hpp:314] Removing offer with resources 
cpus=24; mem=63362; ports=[31000-32000]; disk=882382 on slave 
201307031945-16885952-5050-13298-0
I0703 19:48:22.195886 13314 hierarchical_allocator_process.hpp:495] Framework 
201307031945-16885952-5050-13298-0000 filtered slave 
201307031945-16885952-5050-13298-0 for 5.00secs
I0703 19:48:23.180142 13302 hierarchical_allocator_process.hpp:597] Performed 
allocation for 2 slaves in 102.71us
I0703 19:48:24.181670 13317 hierarchical_allocator_process.hpp:597] Performed 
allocation for 2 slaves in 127.42us
I0703 19:48:24.218245 13321 master.cpp:1022] Status update from 
slave(1)@192.168.1.3:50842: task Task_Tracker_0 of framework 
201307031945-16885952-5050-13298-0000 is now in state TASK_LOST

Thanks a lot for your help!
Kai

Reply via email to