[ 
https://issues.apache.org/jira/browse/MESOS-215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone resolved MESOS-215.
------------------------------

    Resolution: Fixed
    
> In slave, a framework won't be shutdown if no executor in it.
> -------------------------------------------------------------
>
>                 Key: MESOS-215
>                 URL: https://issues.apache.org/jira/browse/MESOS-215
>             Project: Mesos
>          Issue Type: Bug
>          Components: slave
>    Affects Versions: 0.9.0
>         Environment: All platforms.
>            Reporter: Jie Yu
>            Assignee: Vinod Kone
>            Priority: Minor
>
> In slave, a framework won't be shutdown if no executor in it. In some cases, 
> this could cause the slave keep resending status updates to master if the 
> user scheduler terminate before the corresponding status update 
> acknowledgement is sent.
> void Slave::shutdownFramework(const FrameworkID& frameworkId)
> {
>   LOG(INFO) << "Asked to shut down framework " << frameworkId;
>   Framework* framework = getFramework(frameworkId);
>   if (framework != NULL) {
>     LOG(INFO) << "Shutting down framework " << framework->id;
>     // Shut down all executors of this framework.
>     foreachvalue (Executor* executor, framework->executors) {
>       shutdownExecutor(framework, executor);
>     }    
>   }
> }
> If no executor in the framework (e.g. killed due to unexpected process exit), 
> shutdownExecutor will be executed. As a result, the framework will not be 
> removed from the slave. If in some case, the slave does not receive an 
> acknowledgment for a status update (e.g. the user scheduler terminate before 
> it is sent), the slave will keep resending status update message to master.
> Here is the output from my test:
> ======= start of master =======
> jyu@jyu-vm-ubuntu:~/workspace/mesos/build$ sudo src/mesos-master --port=5432
> I0621 17:18:07.984211 31857 logging.cpp:86] Logging to STDERR
> I0621 17:18:07.990334 31857 main.cpp:104] Build: 2012-05-31 09:05:54 by jyu
> I0621 17:18:07.990653 31857 main.cpp:105] Starting Mesos master
> I0621 17:18:07.991225 31872 master.cpp:262] Master started on 127.0.1.1:5432
> I0621 17:18:07.991291 31872 master.cpp:277] Master ID: 
> 201206211718-16842879-5432-31857
> I0621 17:18:07.993168 31872 master.cpp:493] Elected as master!
> I0621 17:18:08.011967 31874 webui_utils.cpp:49] Loading webui script at 
> '/home/jyu/workspace/mesos/install/share/mesos/webui/master/webui.py'
> Bottle server starting up (using WSGIRefServer())...
> Listening on http://0.0.0.0:8080/
> Use Ctrl-C to quit.
> I0621 17:18:09.480581 31871 master.cpp:858] Attempting to register slave on 
> jyu-vm-ubuntu at slave(1)@127.0.1.1:46234
> I0621 17:18:09.480648 31871 master.cpp:1075] Master now considering a slave 
> at jyu-vm-ubuntu:46234 as active
> I0621 17:18:09.480692 31871 master.cpp:1611] Adding slave 
> 201206211718-16842879-5432-31857-0 at jyu-vm-ubuntu with cpus=1; mem=96
> I0621 17:18:09.481227 31871 simple_allocator.cpp:69] Added slave 
> 201206211718-16842879-5432-31857-0 with cpus=1; mem=96
> I0621 17:18:10.850127 31871 master.cpp:536] Registering framework 
> 201206211718-16842879-5432-31857-0000 at scheduler(1)@127.0.1.1:39471
> I0621 17:18:10.850338 31871 simple_allocator.cpp:46] Added framework 
> 201206211718-16842879-5432-31857-0000
> I0621 17:18:10.850414 31871 master.cpp:1166] Sending 1 offers to framework 
> 201206211718-16842879-5432-31857-0000
> I0621 17:18:10.851227 31871 master.cpp:704] Received reply for offer 
> 201206211718-16842879-5432-31857-0
> I0621 17:18:10.851323 31871 master.cpp:1473] Launching task 1 with resources 
> mem=32 on slave 201206211718-16842879-5432-31857-0 (jyu-vm-ubuntu)
> I0621 17:18:34.898843 31871 master.cpp:956] Status update from 
> slave(1)@127.0.1.1:46234: task 1 of framework 
> 201206211718-16842879-5432-31857-0000 is now in state TASK_LOST
> I0621 17:18:34.899086 31871 master.cpp:1055] Executor default of framework 
> 201206211718-16842879-5432-31857-0000 on slave 
> 201206211718-16842879-5432-31857-0 (jyu-vm-ubuntu) exited with status 0
> I0621 17:18:34.902322 31871 master.cpp:435] Framework 
> 201206211718-16842879-5432-31857-0000 disconnected
> I0621 17:18:34.902359 31871 master.cpp:444] Giving framework 
> 201206211718-16842879-5432-31857-0000 0 seconds to failover
> I0621 17:18:34.902570 31871 master.cpp:1125] Framework failover timeout, 
> removing framework 201206211718-16842879-5432-31857-0000
> I0621 17:18:34.902668 31871 simple_allocator.cpp:59] Removed framework 
> 201206211718-16842879-5432-31857-0000
> I0621 17:18:44.899116 31871 master.cpp:956] Status update from 
> slave(1)@127.0.1.1:46234: task 1 of framework 
> 201206211718-16842879-5432-31857-0000 is now in state TASK_LOST
> W0621 17:18:44.899209 31871 master.cpp:994] Status update from 
> slave(1)@127.0.1.1:46234 (jyu-vm-ubuntu): error, couldn't lookup framework 
> 201206211718-16842879-5432-31857-0000
> I0621 17:18:54.901684 31871 master.cpp:956] Status update from 
> slave(1)@127.0.1.1:46234: task 1 of framework 
> 201206211718-16842879-5432-31857-0000 is now in state TASK_LOST
> W0621 17:18:54.901762 31871 master.cpp:994] Status update from 
> slave(1)@127.0.1.1:46234 (jyu-vm-ubuntu): error, couldn't lookup framework 
> 201206211718-16842879-5432-31857-0000
> I0621 17:19:04.904207 31871 master.cpp:956] Status update from 
> slave(1)@127.0.1.1:46234: task 1 of framework 
> 201206211718-16842879-5432-31857-0000 is now in state TASK_LOST
> W0621 17:19:04.904311 31871 master.cpp:994] Status update from 
> slave(1)@127.0.1.1:46234 (jyu-vm-ubuntu): error, couldn't lookup framework 
> 201206211718-16842879-5432-31857-0000
> I0621 17:19:14.908376 31871 master.cpp:956] Status update from 
> slave(1)@127.0.1.1:46234: task 1 of framework 
> 201206211718-16842879-5432-31857-0000 is now in state TASK_LOST
> W0621 17:19:14.908475 31871 master.cpp:994] Status update from 
> slave(1)@127.0.1.1:46234 (jyu-vm-ubuntu): error, couldn't lookup framework 
> 201206211718-16842879-5432-31857-0000
> I0621 17:19:24.910850 31871 master.cpp:956] Status update from 
> slave(1)@127.0.1.1:46234: task 1 of framework 
> 201206211718-16842879-5432-31857-0000 is now in state TASK_LOST
> W0621 17:19:24.910948 31871 master.cpp:994] Status update from 
> slave(1)@127.0.1.1:46234 (jyu-vm-ubuntu): error, couldn't lookup framework 
> 201206211718-16842879-5432-31857-0000
> I0621 17:19:34.914938 31871 master.cpp:956] Status update from 
> slave(1)@127.0.1.1:46234: task 1 of framework 
> 201206211718-16842879-5432-31857-0000 is now in state TASK_LOST
> W0621 17:19:34.915150 31871 master.cpp:994] Status update from 
> slave(1)@127.0.1.1:46234 (jyu-vm-ubuntu): error, couldn't lookup framework 
> 201206211718-16842879-5432-31857-0000
> I0621 17:19:44.917757 31871 master.cpp:956] Status update from 
> slave(1)@127.0.1.1:46234: task 1 of framework 
> 201206211718-16842879-5432-31857-0000 is now in state TASK_LOST
> W0621 17:19:44.917917 31871 master.cpp:994] Status update from 
> slave(1)@127.0.1.1:46234 (jyu-vm-ubuntu): error, couldn't lookup framework 
> 201206211718-16842879-5432-31857-0000
> I0621 17:19:54.921114 31871 master.cpp:956] Status update from 
> slave(1)@127.0.1.1:46234: task 1 of framework 
> 201206211718-16842879-5432-31857-0000 is now in state TASK_LOST
> W0621 17:19:54.921288 31871 master.cpp:994] Status update from 
> slave(1)@127.0.1.1:46234 (jyu-vm-ubuntu): error, couldn't lookup framework 
> 201206211718-16842879-5432-31857-0000
> ======= end of master =======
> ======= start of slave =======
> jyu@jyu-vm-ubuntu:~/workspace/mesos/build$ sudo src/mesos-slave 
> --master=localhost:5432 --resources="cpus:1;mem:96" --isolation=cgroups
> I0621 17:18:09.466815 31877 logging.cpp:86] Logging to STDERR
> I0621 17:18:09.473896 31877 main.cpp:111] Creating "cgroups" isolation module
> I0621 17:18:09.474149 31877 main.cpp:119] Build: 2012-05-31 09:05:54 by jyu
> I0621 17:18:09.474203 31877 main.cpp:120] Starting Mesos slave
> I0621 17:18:09.476152 31877 slave.cpp:209] Slave started on 1)@127.0.1.1:46234
> I0621 17:18:09.476459 31877 slave.cpp:210] Slave resources: cpus=1; mem=96
> I0621 17:18:09.477195 31877 slave.cpp:376] New master detected at 
> [email protected]:5432
> I0621 17:18:09.481650 31891 slave.cpp:396] Registered with master; given 
> slave ID 201206211718-16842879-5432-31857-0
> I0621 17:18:09.496419 31894 webui_utils.cpp:49] Loading webui script at 
> '/home/jyu/workspace/mesos/install/share/mesos/webui/slave/webui.py'
> Bottle server starting up (using WSGIRefServer())...
> Listening on http://0.0.0.0:8081/
> Use Ctrl-C to quit.
> I0621 17:18:10.851681 31891 slave.cpp:457] Got assigned task 1 for framework 
> 201206211718-16842879-5432-31857-0000
> I0621 17:18:10.851780 31891 slave.cpp:1559] Generating a unique work 
> directory for executor 'default' of framework 
> 201206211718-16842879-5432-31857-0000
> I0621 17:18:10.852123 31891 slave.cpp:522] Using 
> '/tmp/mesos/slaves/201206211718-16842879-5432-31857-0/frameworks/201206211718-16842879-5432-31857-0000/executors/default/runs/0'
>  as work directory for executor 'default' of framework 
> 201206211718-16842879-5432-31857-0000
> I0621 17:18:10.852707 31891 cgroups_isolation_module.cpp:149] Launching 
> default (/home/jyu/workspace/mesos/build/src/.libs/balloon-executor) in 
> /tmp/mesos/slaves/201206211718-16842879-5432-31857-0/frameworks/201206211718-16842879-5432-31857-0000/executors/default/runs/0
>  with resources mem=64' for framework 201206211718-16842879-5432-31857-0000
> I0621 17:18:10.853230 31891 cgroups_isolation_module.cpp:323] Changing cgroup 
> controls in 
> /cgroups/mesos_cgroup_executor_default_framework_201206211718-16842879-5432-31857-0000
>  to mem=64
> I0621 17:18:10.853401 31891 cgroups_isolation_module.cpp:339] Write 
> cpu.shares = 10
> I0621 17:18:10.853543 31891 cgroups_isolation_module.cpp:353] Write 
> memory.limit_in_bytes = 67108864
> I0621 17:18:10.853701 31891 cgroups_isolation_module.cpp:371] Start listen on 
> OOM events
> I0621 17:18:10.854008 31891 cgroups_isolation_module.cpp:187] Forked executor 
> at = 31913
> I0621 17:18:10.897469 31891 slave.cpp:789] Got registration for executor 
> 'default' of framework 201206211718-16842879-5432-31857-0000
> I0621 17:18:10.897694 31892 cgroups_isolation_module.cpp:323] Changing cgroup 
> controls in 
> /cgroups/mesos_cgroup_executor_default_framework_201206211718-16842879-5432-31857-0000
>  to mem=96
> I0621 17:18:10.897886 31891 slave.cpp:847] Flushing queued tasks for 
> framework 201206211718-16842879-5432-31857-0000
> I0621 17:18:10.898143 31892 cgroups_isolation_module.cpp:339] Write 
> cpu.shares = 10
> I0621 17:18:10.898398 31892 cgroups_isolation_module.cpp:353] Write 
> memory.limit_in_bytes = 100663296
> I0621 17:18:34.892823 31892 cgroups_isolation_module.cpp:389] OOM notifier is 
> triggered
> I0621 17:18:34.892909 31892 cgroups_isolation_module.cpp:434] OOM detected in 
> executor default of framework 201206211718-16842879-5432-31857-0000
> I0621 17:18:34.892930 31892 cgroups_isolation_module.cpp:229] Killing 
> executor default for framework 201206211718-16842879-5432-31857-0000
> I0621 17:18:34.894033 31892 slave.cpp:1383] Executor 'default' of framework 
> 201206211718-16842879-5432-31857-0000 has exited with status 0
> I0621 17:18:34.894765 31892 slave.cpp:989] Status update: task 1 of framework 
> 201206211718-16842879-5432-31857-0000 is now in state TASK_LOST
> I0621 17:18:34.895120 31892 slave.cpp:1507] Scheduling executor directory 
> /tmp/mesos/slaves/201206211718-16842879-5432-31857-0/frameworks/201206211718-16842879-5432-31857-0000/executors/default/runs/0
>  for deletion
> I0621 17:18:34.902997 31891 slave.cpp:625] Asked to shut down framework 
> 201206211718-16842879-5432-31857-0000
> I0621 17:18:34.903031 31891 slave.cpp:629] Shutting down framework 
> 201206211718-16842879-5432-31857-0000
> I0621 17:18:44.897243 31891 slave.cpp:1083] Resending status update for task 
> 1 of framework 201206211718-16842879-5432-31857-0000
> I0621 17:18:54.900529 31891 slave.cpp:1083] Resending status update for task 
> 1 of framework 201206211718-16842879-5432-31857-0000
> I0621 17:19:04.903036 31892 slave.cpp:1083] Resending status update for task 
> 1 of framework 201206211718-16842879-5432-31857-0000
> I0621 17:19:14.905769 31892 slave.cpp:1083] Resending status update for task 
> 1 of framework 201206211718-16842879-5432-31857-0000
> I0621 17:19:24.909895 31891 slave.cpp:1083] Resending status update for task 
> 1 of framework 201206211718-16842879-5432-31857-0000
> I0621 17:19:34.912976 31891 slave.cpp:1083] Resending status update for task 
> 1 of framework 201206211718-16842879-5432-31857-0000
> I0621 17:19:44.916512 31891 slave.cpp:1083] Resending status update for task 
> 1 of framework 201206211718-16842879-5432-31857-0000
> I0621 17:19:54.920135 31891 slave.cpp:1083] Resending status update for task 
> 1 of framework 201206211718-16842879-5432-31857-0000
> I0621 17:20:04.922044 31892 slave.cpp:1083] Resending status update for task 
> 1 of framework 201206211718-16842879-5432-31857-0000
> ====== end of slave =======

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to