----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/10951/#review20194 -----------------------------------------------------------
src/master/master.cpp <https://reviews.apache.org/r/10951/#comment41437> thank you. src/master/master.cpp <https://reviews.apache.org/r/10951/#comment41442> Wow. This is really a bug. Thanks for catching this! I think a better way to do this, is to change the foreach loop (#1776) to loop through the slave's tasks instead of framework's tasks (which can be huge!). Inside the for loop we can check if the task belongs to the removing framework or not. Makes sense? Also, we always use braces around if/for statements. - Vinod Kone On May 6, 2013, 4:59 p.m., Brenden Matthews wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/10951/ > ----------------------------------------------------------- > > (Updated May 6, 2013, 4:59 p.m.) > > > Review request for mesos. > > > Description > ------- > > From d5576303ecaaf3c02eba082c8d5b6cf483e36dae Mon Sep 17 00:00:00 2001 > From: Brenden Matthews <[email protected]> > Date: Mon, 6 May 2013 09:54:03 -0700 > Subject: [PATCH] Terminate correct tasks when a slave disconnects. > > Previously, when a slave disconnected all tasks for that framework would > be removed and it would result in a bad state for a given framework. In > the case of Hadoop, it would result in a bunch of zombie tasks running > on the slaves which never terminate. > --- > src/master/master.cpp | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > > Below is a sample of what the Mesos master log looks like: > > > I0506 03:01:21.188874 2639 master.cpp:445] Slave > 201305040040-3141079306-5050-1068-21(i-ced4aba2) disconnected > I0506 03:01:21.189184 2639 master.cpp:464] Removing non-checkpointing > framework 201305040040-4196536586-5050-1124-0000 from disconn > ected slave 201305040040-3141079306-5050-1068-21(i-ced4aba2) > I0506 03:01:21.190471 2639 master.hpp:295] Removing task Task_Tracker_46 > with resources cpus=9; mem=18432; disk=73728; ports=[31000-31000, > 32000-32000] on slave 201305040040-4196536586-5050-1124-3 > I0506 03:01:21.190891 2632 hierarchical_allocator_process.hpp:544] Recovered > cpus=9; mem=18432; disk=73728; ports=[31000-31000, 32000-32000] (total > allocatable: cpus=15; mem=19180.2; ports=[31000-32000]; disk=763224) on slave > 201305040040-4196536586-5050-1124-3 from framework > 201305040040-4196536586-5050-1124-0000 > I0506 03:01:21.191614 2639 master.hpp:295] Removing task Task_Tracker_154 > with resources cpus=9; mem=18432; disk=73728; ports=[31000-31000, > 32000-32000] on slave 201305040040-3141079306-5050-1068-38 > I0506 03:01:21.192049 2634 hierarchical_allocator_process.hpp:544] Recovered > cpus=9; mem=18432; disk=73728; ports=[31000-31000, 32000-32000] (total > allocatable: cpus=15; mem=19180.2; ports=[31000-32000]; disk=761189) on slave > 201305040040-3141079306-5050-1068-38 from framework > 201305040040-4196536586-5050-1124-0000 > I0506 03:01:21.192828 2639 master.hpp:295] Removing task Task_Tracker_195 > with resources cpus=6.5; mem=13312; disk=53248; ports=[31999-31999, > 31001-31001] on slave 201305040040-3141079306-5050-1068-85 > I0506 03:01:21.193270 2640 hierarchical_allocator_process.hpp:544] Recovered > cpus=6.5; mem=13312; disk=53248; ports=[31999-31999, 31001-31001] (total > allocatable: cpus=10; mem=13408.8; ports=[31001-31999]; disk=596893) on slave > 201305040040-3141079306-5050-1068-85 from framework > 201305040040-4196536586-5050-1124-0000 > I0506 03:01:21.194039 2639 master.hpp:295] Removing task Task_Tracker_182 > with resources cpus=9; mem=18432; disk=73728; ports=[31000-31000, > 32000-32000] on slave 201305040040-3141079306-5050-1068-45 > I0506 03:01:21.194425 2638 hierarchical_allocator_process.hpp:544] Recovered > cpus=9; mem=18432; disk=73728; ports=[31000-31000, 32000-32000] (total > allocatable: cpus=15; mem=19180.2; ports=[31000-32000]; disk=760196) on slave > 201305040040-3141079306-5050-1068-45 from framework > 201305040040-4196536586-5050-1124-0000 > I0506 03:01:21.195190 2639 master.hpp:295] Removing task Task_Tracker_58 > with resources cpus=9; mem=18432; disk=73728; ports=[31000-31000, > 32000-32000] on slave 201305040040-3141079306-5050-1068-76 > I0506 03:01:21.195636 2636 hierarchical_allocator_process.hpp:544] Recovered > cpus=9; mem=18432; disk=73728; ports=[31000-31000, 32000-32000] (total > allocatable: cpus=15; mem=19180.2; ports=[31000-32000]; disk=761175) on slave > 201305040040-3141079306-5050-1068-76 from framework > 201305040040-4196536586-5050-1124-0000 > I0506 03:01:21.196455 2639 master.hpp:295] Removing task Task_Tracker_160 > with resources cpus=20; mem=40960; disk=163840; ports=[31000-31000, > 32000-32000] on slave 201305040040-3141079306-5050-1068-85 > I0506 03:01:21.196883 2631 hierarchical_allocator_process.hpp:544] Recovered > cpus=20; mem=40960; disk=163840; ports=[31000-31000, 32000-32000] (total > allocatable: cpus=30; mem=54368.8; ports=[31000-32000]; disk=760733) on slave > 201305040040-3141079306-5050-1068-85 from framework > 201305040040-4196536586-5050-1124-0000 > I0506 03:01:21.197710 2639 master.hpp:295] Removing task Task_Tracker_96 > with resources cpus=3.5; mem=7168; disk=28672; ports=[31000-31000, > 32000-32000] on slave 201305040040-3141079306-5050-1068-80 > <...log continues...> > > > Diffs > ----- > > src/master/master.cpp 3207157 > > Diff: https://reviews.apache.org/r/10951/diff/ > > > Testing > ------- > > Used in production at airbnb. > > > Thanks, > > Brenden Matthews > >
