[
https://issues.apache.org/jira/browse/MESOS-215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinod Kone resolved MESOS-215.
------------------------------
Resolution: Fixed
> In slave, a framework won't be shutdown if no executor in it.
> -------------------------------------------------------------
>
> Key: MESOS-215
> URL: https://issues.apache.org/jira/browse/MESOS-215
> Project: Mesos
> Issue Type: Bug
> Components: slave
> Affects Versions: 0.9.0
> Environment: All platforms.
> Reporter: Jie Yu
> Assignee: Vinod Kone
> Priority: Minor
>
> In slave, a framework won't be shutdown if no executor in it. In some cases,
> this could cause the slave keep resending status updates to master if the
> user scheduler terminate before the corresponding status update
> acknowledgement is sent.
> void Slave::shutdownFramework(const FrameworkID& frameworkId)
> {
> LOG(INFO) << "Asked to shut down framework " << frameworkId;
> Framework* framework = getFramework(frameworkId);
> if (framework != NULL) {
> LOG(INFO) << "Shutting down framework " << framework->id;
> // Shut down all executors of this framework.
> foreachvalue (Executor* executor, framework->executors) {
> shutdownExecutor(framework, executor);
> }
> }
> }
> If no executor in the framework (e.g. killed due to unexpected process exit),
> shutdownExecutor will be executed. As a result, the framework will not be
> removed from the slave. If in some case, the slave does not receive an
> acknowledgment for a status update (e.g. the user scheduler terminate before
> it is sent), the slave will keep resending status update message to master.
> Here is the output from my test:
> ======= start of master =======
> jyu@jyu-vm-ubuntu:~/workspace/mesos/build$ sudo src/mesos-master --port=5432
> I0621 17:18:07.984211 31857 logging.cpp:86] Logging to STDERR
> I0621 17:18:07.990334 31857 main.cpp:104] Build: 2012-05-31 09:05:54 by jyu
> I0621 17:18:07.990653 31857 main.cpp:105] Starting Mesos master
> I0621 17:18:07.991225 31872 master.cpp:262] Master started on 127.0.1.1:5432
> I0621 17:18:07.991291 31872 master.cpp:277] Master ID:
> 201206211718-16842879-5432-31857
> I0621 17:18:07.993168 31872 master.cpp:493] Elected as master!
> I0621 17:18:08.011967 31874 webui_utils.cpp:49] Loading webui script at
> '/home/jyu/workspace/mesos/install/share/mesos/webui/master/webui.py'
> Bottle server starting up (using WSGIRefServer())...
> Listening on http://0.0.0.0:8080/
> Use Ctrl-C to quit.
> I0621 17:18:09.480581 31871 master.cpp:858] Attempting to register slave on
> jyu-vm-ubuntu at slave(1)@127.0.1.1:46234
> I0621 17:18:09.480648 31871 master.cpp:1075] Master now considering a slave
> at jyu-vm-ubuntu:46234 as active
> I0621 17:18:09.480692 31871 master.cpp:1611] Adding slave
> 201206211718-16842879-5432-31857-0 at jyu-vm-ubuntu with cpus=1; mem=96
> I0621 17:18:09.481227 31871 simple_allocator.cpp:69] Added slave
> 201206211718-16842879-5432-31857-0 with cpus=1; mem=96
> I0621 17:18:10.850127 31871 master.cpp:536] Registering framework
> 201206211718-16842879-5432-31857-0000 at scheduler(1)@127.0.1.1:39471
> I0621 17:18:10.850338 31871 simple_allocator.cpp:46] Added framework
> 201206211718-16842879-5432-31857-0000
> I0621 17:18:10.850414 31871 master.cpp:1166] Sending 1 offers to framework
> 201206211718-16842879-5432-31857-0000
> I0621 17:18:10.851227 31871 master.cpp:704] Received reply for offer
> 201206211718-16842879-5432-31857-0
> I0621 17:18:10.851323 31871 master.cpp:1473] Launching task 1 with resources
> mem=32 on slave 201206211718-16842879-5432-31857-0 (jyu-vm-ubuntu)
> I0621 17:18:34.898843 31871 master.cpp:956] Status update from
> slave(1)@127.0.1.1:46234: task 1 of framework
> 201206211718-16842879-5432-31857-0000 is now in state TASK_LOST
> I0621 17:18:34.899086 31871 master.cpp:1055] Executor default of framework
> 201206211718-16842879-5432-31857-0000 on slave
> 201206211718-16842879-5432-31857-0 (jyu-vm-ubuntu) exited with status 0
> I0621 17:18:34.902322 31871 master.cpp:435] Framework
> 201206211718-16842879-5432-31857-0000 disconnected
> I0621 17:18:34.902359 31871 master.cpp:444] Giving framework
> 201206211718-16842879-5432-31857-0000 0 seconds to failover
> I0621 17:18:34.902570 31871 master.cpp:1125] Framework failover timeout,
> removing framework 201206211718-16842879-5432-31857-0000
> I0621 17:18:34.902668 31871 simple_allocator.cpp:59] Removed framework
> 201206211718-16842879-5432-31857-0000
> I0621 17:18:44.899116 31871 master.cpp:956] Status update from
> slave(1)@127.0.1.1:46234: task 1 of framework
> 201206211718-16842879-5432-31857-0000 is now in state TASK_LOST
> W0621 17:18:44.899209 31871 master.cpp:994] Status update from
> slave(1)@127.0.1.1:46234 (jyu-vm-ubuntu): error, couldn't lookup framework
> 201206211718-16842879-5432-31857-0000
> I0621 17:18:54.901684 31871 master.cpp:956] Status update from
> slave(1)@127.0.1.1:46234: task 1 of framework
> 201206211718-16842879-5432-31857-0000 is now in state TASK_LOST
> W0621 17:18:54.901762 31871 master.cpp:994] Status update from
> slave(1)@127.0.1.1:46234 (jyu-vm-ubuntu): error, couldn't lookup framework
> 201206211718-16842879-5432-31857-0000
> I0621 17:19:04.904207 31871 master.cpp:956] Status update from
> slave(1)@127.0.1.1:46234: task 1 of framework
> 201206211718-16842879-5432-31857-0000 is now in state TASK_LOST
> W0621 17:19:04.904311 31871 master.cpp:994] Status update from
> slave(1)@127.0.1.1:46234 (jyu-vm-ubuntu): error, couldn't lookup framework
> 201206211718-16842879-5432-31857-0000
> I0621 17:19:14.908376 31871 master.cpp:956] Status update from
> slave(1)@127.0.1.1:46234: task 1 of framework
> 201206211718-16842879-5432-31857-0000 is now in state TASK_LOST
> W0621 17:19:14.908475 31871 master.cpp:994] Status update from
> slave(1)@127.0.1.1:46234 (jyu-vm-ubuntu): error, couldn't lookup framework
> 201206211718-16842879-5432-31857-0000
> I0621 17:19:24.910850 31871 master.cpp:956] Status update from
> slave(1)@127.0.1.1:46234: task 1 of framework
> 201206211718-16842879-5432-31857-0000 is now in state TASK_LOST
> W0621 17:19:24.910948 31871 master.cpp:994] Status update from
> slave(1)@127.0.1.1:46234 (jyu-vm-ubuntu): error, couldn't lookup framework
> 201206211718-16842879-5432-31857-0000
> I0621 17:19:34.914938 31871 master.cpp:956] Status update from
> slave(1)@127.0.1.1:46234: task 1 of framework
> 201206211718-16842879-5432-31857-0000 is now in state TASK_LOST
> W0621 17:19:34.915150 31871 master.cpp:994] Status update from
> slave(1)@127.0.1.1:46234 (jyu-vm-ubuntu): error, couldn't lookup framework
> 201206211718-16842879-5432-31857-0000
> I0621 17:19:44.917757 31871 master.cpp:956] Status update from
> slave(1)@127.0.1.1:46234: task 1 of framework
> 201206211718-16842879-5432-31857-0000 is now in state TASK_LOST
> W0621 17:19:44.917917 31871 master.cpp:994] Status update from
> slave(1)@127.0.1.1:46234 (jyu-vm-ubuntu): error, couldn't lookup framework
> 201206211718-16842879-5432-31857-0000
> I0621 17:19:54.921114 31871 master.cpp:956] Status update from
> slave(1)@127.0.1.1:46234: task 1 of framework
> 201206211718-16842879-5432-31857-0000 is now in state TASK_LOST
> W0621 17:19:54.921288 31871 master.cpp:994] Status update from
> slave(1)@127.0.1.1:46234 (jyu-vm-ubuntu): error, couldn't lookup framework
> 201206211718-16842879-5432-31857-0000
> ======= end of master =======
> ======= start of slave =======
> jyu@jyu-vm-ubuntu:~/workspace/mesos/build$ sudo src/mesos-slave
> --master=localhost:5432 --resources="cpus:1;mem:96" --isolation=cgroups
> I0621 17:18:09.466815 31877 logging.cpp:86] Logging to STDERR
> I0621 17:18:09.473896 31877 main.cpp:111] Creating "cgroups" isolation module
> I0621 17:18:09.474149 31877 main.cpp:119] Build: 2012-05-31 09:05:54 by jyu
> I0621 17:18:09.474203 31877 main.cpp:120] Starting Mesos slave
> I0621 17:18:09.476152 31877 slave.cpp:209] Slave started on 1)@127.0.1.1:46234
> I0621 17:18:09.476459 31877 slave.cpp:210] Slave resources: cpus=1; mem=96
> I0621 17:18:09.477195 31877 slave.cpp:376] New master detected at
> [email protected]:5432
> I0621 17:18:09.481650 31891 slave.cpp:396] Registered with master; given
> slave ID 201206211718-16842879-5432-31857-0
> I0621 17:18:09.496419 31894 webui_utils.cpp:49] Loading webui script at
> '/home/jyu/workspace/mesos/install/share/mesos/webui/slave/webui.py'
> Bottle server starting up (using WSGIRefServer())...
> Listening on http://0.0.0.0:8081/
> Use Ctrl-C to quit.
> I0621 17:18:10.851681 31891 slave.cpp:457] Got assigned task 1 for framework
> 201206211718-16842879-5432-31857-0000
> I0621 17:18:10.851780 31891 slave.cpp:1559] Generating a unique work
> directory for executor 'default' of framework
> 201206211718-16842879-5432-31857-0000
> I0621 17:18:10.852123 31891 slave.cpp:522] Using
> '/tmp/mesos/slaves/201206211718-16842879-5432-31857-0/frameworks/201206211718-16842879-5432-31857-0000/executors/default/runs/0'
> as work directory for executor 'default' of framework
> 201206211718-16842879-5432-31857-0000
> I0621 17:18:10.852707 31891 cgroups_isolation_module.cpp:149] Launching
> default (/home/jyu/workspace/mesos/build/src/.libs/balloon-executor) in
> /tmp/mesos/slaves/201206211718-16842879-5432-31857-0/frameworks/201206211718-16842879-5432-31857-0000/executors/default/runs/0
> with resources mem=64' for framework 201206211718-16842879-5432-31857-0000
> I0621 17:18:10.853230 31891 cgroups_isolation_module.cpp:323] Changing cgroup
> controls in
> /cgroups/mesos_cgroup_executor_default_framework_201206211718-16842879-5432-31857-0000
> to mem=64
> I0621 17:18:10.853401 31891 cgroups_isolation_module.cpp:339] Write
> cpu.shares = 10
> I0621 17:18:10.853543 31891 cgroups_isolation_module.cpp:353] Write
> memory.limit_in_bytes = 67108864
> I0621 17:18:10.853701 31891 cgroups_isolation_module.cpp:371] Start listen on
> OOM events
> I0621 17:18:10.854008 31891 cgroups_isolation_module.cpp:187] Forked executor
> at = 31913
> I0621 17:18:10.897469 31891 slave.cpp:789] Got registration for executor
> 'default' of framework 201206211718-16842879-5432-31857-0000
> I0621 17:18:10.897694 31892 cgroups_isolation_module.cpp:323] Changing cgroup
> controls in
> /cgroups/mesos_cgroup_executor_default_framework_201206211718-16842879-5432-31857-0000
> to mem=96
> I0621 17:18:10.897886 31891 slave.cpp:847] Flushing queued tasks for
> framework 201206211718-16842879-5432-31857-0000
> I0621 17:18:10.898143 31892 cgroups_isolation_module.cpp:339] Write
> cpu.shares = 10
> I0621 17:18:10.898398 31892 cgroups_isolation_module.cpp:353] Write
> memory.limit_in_bytes = 100663296
> I0621 17:18:34.892823 31892 cgroups_isolation_module.cpp:389] OOM notifier is
> triggered
> I0621 17:18:34.892909 31892 cgroups_isolation_module.cpp:434] OOM detected in
> executor default of framework 201206211718-16842879-5432-31857-0000
> I0621 17:18:34.892930 31892 cgroups_isolation_module.cpp:229] Killing
> executor default for framework 201206211718-16842879-5432-31857-0000
> I0621 17:18:34.894033 31892 slave.cpp:1383] Executor 'default' of framework
> 201206211718-16842879-5432-31857-0000 has exited with status 0
> I0621 17:18:34.894765 31892 slave.cpp:989] Status update: task 1 of framework
> 201206211718-16842879-5432-31857-0000 is now in state TASK_LOST
> I0621 17:18:34.895120 31892 slave.cpp:1507] Scheduling executor directory
> /tmp/mesos/slaves/201206211718-16842879-5432-31857-0/frameworks/201206211718-16842879-5432-31857-0000/executors/default/runs/0
> for deletion
> I0621 17:18:34.902997 31891 slave.cpp:625] Asked to shut down framework
> 201206211718-16842879-5432-31857-0000
> I0621 17:18:34.903031 31891 slave.cpp:629] Shutting down framework
> 201206211718-16842879-5432-31857-0000
> I0621 17:18:44.897243 31891 slave.cpp:1083] Resending status update for task
> 1 of framework 201206211718-16842879-5432-31857-0000
> I0621 17:18:54.900529 31891 slave.cpp:1083] Resending status update for task
> 1 of framework 201206211718-16842879-5432-31857-0000
> I0621 17:19:04.903036 31892 slave.cpp:1083] Resending status update for task
> 1 of framework 201206211718-16842879-5432-31857-0000
> I0621 17:19:14.905769 31892 slave.cpp:1083] Resending status update for task
> 1 of framework 201206211718-16842879-5432-31857-0000
> I0621 17:19:24.909895 31891 slave.cpp:1083] Resending status update for task
> 1 of framework 201206211718-16842879-5432-31857-0000
> I0621 17:19:34.912976 31891 slave.cpp:1083] Resending status update for task
> 1 of framework 201206211718-16842879-5432-31857-0000
> I0621 17:19:44.916512 31891 slave.cpp:1083] Resending status update for task
> 1 of framework 201206211718-16842879-5432-31857-0000
> I0621 17:19:54.920135 31891 slave.cpp:1083] Resending status update for task
> 1 of framework 201206211718-16842879-5432-31857-0000
> I0621 17:20:04.922044 31892 slave.cpp:1083] Resending status update for task
> 1 of framework 201206211718-16842879-5432-31857-0000
> ====== end of slave =======
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira