----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/7887/#review13203 -----------------------------------------------------------
Ship it! src/slave/cgroups_isolation_module.hpp <https://reviews.apache.org/r/7887/#comment28405> It's a bit odd to have: killed // whether killExecutor() called destroyed // whether destroyed by module Maybe rename to something more indicative? bool killAttempted; // Have we tried to kill it via killExecutor()? src/slave/cgroups_isolation_module.cpp <https://reviews.apache.org/r/7887/#comment28407> This message comes out a bit rough in the log: I1106 01:53:54.852854 61941 cgroups_isolation_module.cpp:689] MEMORY LIMIT: 100663296 bytes MEMORY USAGE: 100663296 bytes MEMORY STATISTICS: cache 245760 rss 100417536 mapped_file 24576 pgpgin 7320 pgpgout 6250 inactive_anon 0 active_anon 1826816 inactive_file 192512 active_file 53248 unevictable 98590720 hierarchical_memory_limit 100663296 total_cache 245760 total_rss 100417536 total_mapped_file 24576 total_pgpgin 7320 total_pgpgout 6250 total_inactive_anon 0 total_active_anon 1826816 total_inactive_file 192512 total_active_file 53248 total_unevictable 98590720 vs having the oom + data in 1 log message + indentation I1106 01:53:54.852854 61941 cgroups_isolation_module.cpp:689] OOM detected for executor default of framework 201211060153-2081170186-5432-61885-0000 with tag bf7fc2e7-a9c4-4240-8300-18acb99490dc MEMORY LIMIT: 100663296 bytes MEMORY USAGE: 100663296 bytes MEMORY STATISTICS: cache 245760 rss 100417536 mapped_file 24576 pgpgin 7320 pgpgout 6250 inactive_anon 0 active_anon 1826816 inactive_file 192512 active_file 53248 unevictable 98590720 hierarchical_memory_limit 100663296 total_cache 245760 total_rss 100417536 total_mapped_file 24576 total_pgpgin 7320 total_pgpgout 6250 total_inactive_anon 0 total_active_anon 1826816 total_inactive_file 192512 total_active_file 53248 total_unevictable 98590720 Also, for the reason, can you prepend the fact that an OOM happened? like: I1106 01:54:00.542150 61984 sched.cpp:326] Status update: task 1 of framework 201211060153-2081170186-5432-61885-0000 is now in state TASK_FAILED Task in state TASK_FAILED Reason: OOM Detected // <-- Here MEMORY LIMIT: 100663296 bytes MEMORY USAGE: 100663296 bytes MEMORY STATISTICS: src/slave/slave.cpp <https://reviews.apache.org/r/7887/#comment28406> Just curious, why the check for command executor? More specifically, why is a terminated non-destroyed command executor failed instead of lost? - Ben Mahler On Nov. 6, 2012, 8:33 p.m., Vinod Kone wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/7887/ > ----------------------------------------------------------- > > (Updated Nov. 6, 2012, 8:33 p.m.) > > > Review request for mesos, Benjamin Hindman and Ben Mahler. > > > Description > ------- > > See summary > > > Diffs > ----- > > src/common/protobuf_utils.hpp 77b300d7c1a02a836100d3365e205889c48ae99a > src/examples/balloon_framework.cpp e9b60de0c7d3a96381aff37340e0f5ac499850dd > src/slave/cgroups_isolation_module.hpp > dd4703a1ca584d2347efac95bcdfae9a84544d4a > src/slave/cgroups_isolation_module.cpp > 3d10ee568b8f194543707374f34f21bd3a927958 > src/slave/lxc_isolation_module.cpp 36d86e08f7b511371a9a2053ddf43477063a79f1 > src/slave/process_based_isolation_module.cpp > b0b6a81c93acc68d1f4acbdda5ab2f9f96b5fb5a > src/slave/slave.hpp be0d7cc239e51636bb07e12c3046e0751a958787 > src/slave/slave.cpp 2bd2dbce538a6108dd9fe607829cfbdab33e0778 > src/tests/fault_tolerance_tests.cpp > a01d1aef012b636f2ced64d4d2ffabfb6ce42644 > src/tests/gc_tests.cpp b61b2de621e227f327ce546b62f8dfc528f3894e > src/tests/master_tests.cpp d9cd09c5650234351f570f0a035f4b61cd2d00f5 > > Diff: https://reviews.apache.org/r/7887/diff/ > > > Testing > ------- > > make check (CentOs) > > [vinod@smfd-aki-27-sr1:~/mesos/build] $ sudo GLOG_v=1 ./bin/mesos-tests.sh > --gtest_filter="*CgroupsIsolationTest*" --verbose > ... > ... > I1106 01:53:54.852120 61941 cgroups_isolation_module.cpp:617] OOM notifier is > triggered for executor default of framework > 201211060153-2081170186-5432-61885-0000 with tag > bf7fc2e7-a9c4-4240-8300-18acb99490dc > I1106 01:53:54.852165 61941 cgroups_isolation_module.cpp:662] OOM detected > for executor default of framework 201211060153-2081170186-5432-61885-0000 > with tag bf7fc2e7-a9c4-4240-8300-18acb99490dc > I1106 01:53:54.852854 61941 cgroups_isolation_module.cpp:689] MEMORY LIMIT: > 100663296 bytes > MEMORY USAGE: 100663296 bytes > MEMORY STATISTICS: > cache 245760 > rss 100417536 > mapped_file 24576 > pgpgin 7320 > pgpgout 6250 > inactive_anon 0 > active_anon 1826816 > inactive_file 192512 > active_file 53248 > unevictable 98590720 > hierarchical_memory_limit 100663296 > total_cache 245760 > total_rss 100417536 > total_mapped_file 24576 > total_pgpgin 7320 > total_pgpgout 6250 > total_inactive_anon 0 > total_active_anon 1826816 > total_inactive_file 192512 > total_active_file 53248 > total_unevictable 98590720 > I1106 01:53:54.852898 61941 cgroups_isolation_module.cpp:408] Killing > executor default of framework 201211060153-2081170186-5432-61885-0000 > I1106 01:53:54.855185 61937 cgroups.cpp:1116] Attempting to freeze cgroup > 'mesos/framework_201211060153-2081170186-5432-61885-0000_executor_default_tag_bf7fc2e7-a9c4-4240-8300-18acb99490dc' > I1106 01:53:55.536480 61907 hierarchical_allocator_process.hpp:608] No > resources available to allocate! > I1106 01:53:55.536576 61907 hierarchical_allocator_process.hpp:543] Performed > allocation for 1 slaves in 130.08us > I1106 01:53:56.537866 61903 hierarchical_allocator_process.hpp:608] No > resources available to allocate! > I1106 01:53:56.537951 61903 hierarchical_allocator_process.hpp:543] Performed > allocation for 1 slaves in 103.18us > I1106 01:53:57.538408 61912 hierarchical_allocator_process.hpp:608] No > resources available to allocate! > I1106 01:53:57.538483 61912 hierarchical_allocator_process.hpp:543] Performed > allocation for 1 slaves in 93.44us > I1106 01:53:58.539499 61908 hierarchical_allocator_process.hpp:608] No > resources available to allocate! > I1106 01:53:58.539593 61908 hierarchical_allocator_process.hpp:543] Performed > allocation for 1 slaves in 113.75us > W1106 01:53:59.532685 61903 master.cpp:79] No whitelist given. Advertising > offers for all slaves > I1106 01:53:59.540832 61912 hierarchical_allocator_process.hpp:608] No > resources available to allocate! > I1106 01:53:59.540907 61912 hierarchical_allocator_process.hpp:543] Performed > allocation for 1 slaves in 91.56us > W1106 01:54:00.020642 61941 cgroups.cpp:1201] Unable to freeze cgroup > 'mesos/framework_201211060153-2081170186-5432-61885-0000_executor_default_tag_bf7fc2e7-a9c4-4240-8300-18acb99490dc' > within 51 attempts > I1106 01:54:00.022102 61937 cgroups.cpp:1131] Attempting to thaw cgroup > 'mesos/framework_201211060153-2081170186-5432-61885-0000_executor_default_tag_bf7fc2e7-a9c4-4240-8300-18acb99490dc' > I1106 01:54:00.022274 61937 cgroups.cpp:1237] Successfully thawed cgroup > 'mesos/framework_201211060153-2081170186-5432-61885-0000_executor_default_tag_bf7fc2e7-a9c4-4240-8300-18acb99490dc' > I1106 01:54:00.030532 61948 process.cpp:872] Socket closed while receiving > I1106 01:54:00.129642 61936 cgroups_isolation_module.cpp:705] Successfully > destroyed the cgroup > mesos/framework_201211060153-2081170186-5432-61885-0000_executor_default_tag_bf7fc2e7-a9c4-4240-8300-18acb99490dc > I1106 01:54:00.539801 61944 cgroups_isolation_module.cpp:468] Telling slave > of terminated executor default of framework > 201211060153-2081170186-5432-61885-0000 > I1106 01:54:00.539939 61934 slave.cpp:1008] Executor 'default' of framework > 201211060153-2081170186-5432-61885-0000 has terminated with signal Killed > I1106 01:54:00.541018 61934 slave.cpp:833] Status update: task 1 of framework > 201211060153-2081170186-5432-61885-0000 is now in state TASK_FAILED > I1106 01:54:00.541290 61944 cgroups_isolation_module.cpp:441] Asked to update > resources for an unknown/terminated executor > I1106 01:54:00.541384 61904 hierarchical_allocator_process.hpp:608] No > resources available to allocate! > I1106 01:54:00.541460 61904 hierarchical_allocator_process.hpp:543] Performed > allocation for 1 slaves in 87.63us > I1106 01:54:00.541471 61936 gc.cpp:97] Scheduling > /tmp/mesos/slaves/201211060153-2081170186-5432-61885-0/frameworks/201211060153-2081170186-5432-61885-0000/executors/default/runs/c842b51d-d962-4b20-a80a-bfe484f6dc95 > for removal > I1106 01:54:00.541610 61907 master.cpp:1024] Status update from > slave(1)@10.35.12.124:36146: task 1 of framework > 201211060153-2081170186-5432-61885-0000 is now in state TASK_FAILED > I1106 01:54:00.541759 61907 master.hpp:288] Removing task with resources > mem=32 on slave 201211060153-2081170186-5432-61885-0 > I1106 01:54:00.541872 61907 master.cpp:1125] Executor default of framework > 201211060153-2081170186-5432-61885-0000 on slave > 201211060153-2081170186-5432-61885-0 (smfd-aki-27-sr1.devel.twitter.com) > exited with status 9 > I1106 01:54:00.541872 61912 hierarchical_allocator_process.hpp:491] Recovered > mem=32 on slave 201211060153-2081170186-5432-61885-0 from framework > 201211060153-2081170186-5432-61885-0000 > I1106 01:54:00.541967 61912 hierarchical_allocator_process.hpp:491] Recovered > mem=64 on slave 201211060153-2081170186-5432-61885-0 from framework > 201211060153-2081170186-5432-61885-0000 > I1106 01:54:00.542150 61984 sched.cpp:326] Status update: task 1 of framework > 201211060153-2081170186-5432-61885-0000 is now in state TASK_FAILED > Task in state TASK_FAILED > Reason: MEMORY LIMIT: 100663296 bytes > MEMORY USAGE: 100663296 bytes > MEMORY STATISTICS: > cache 245760 > rss 100417536 > mapped_file 24576 > pgpgin 7320 > pgpgout 6250 > inactive_anon 0 > active_anon 1826816 > inactive_file 192512 > active_file 53248 > unevictable 98590720 > hierarchical_memory_limit 100663296 > total_cache 245760 > total_rss 100417536 > total_mapped_file 24576 > total_pgpgin 7320 > total_pgpgout 6250 > total_inactive_anon 0 > total_active_anon 1826816 > total_inactive_file 192512 > total_active_file 53248 > total_unevictable 98590720 > > > Thanks, > > Vinod Kone > >
