> On May 6, 2013, 5:20 p.m., Vinod Kone wrote:
> > src/master/master.cpp, lines 1784-1786
> > <https://reviews.apache.org/r/10951/diff/1/?file=288131#file288131line1784>
> >
> >     Wow. This is really a bug. Thanks for catching this!
> >     
> >     I think a better way to do this, is to change the foreach loop (#1776) 
> > to loop through the slave's tasks instead of framework's tasks (which can 
> > be huge!). Inside the for loop we can check if the task belongs to the 
> > removing framework or not. Makes sense?
> >     
> >     Also, we always use braces around if/for statements.
> >

The other ones are blockers too.  This one is actually less of a blocker than 
some of the others, since the map reduce jobs will still finish.  The one where 
mesos kills task trackers before jobs finish is a bigger problem (fixed with 
https://reviews.apache.org/r/10920/)


- Brenden


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10951/#review20194
-----------------------------------------------------------


On May 6, 2013, 6:04 p.m., Brenden Matthews wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10951/
> -----------------------------------------------------------
> 
> (Updated May 6, 2013, 6:04 p.m.)
> 
> 
> Review request for mesos.
> 
> 
> Description
> -------
> 
> From d01482457f02acc1e19195995db7a14dfc2a89b9 Mon Sep 17 00:00:00 2001
> From: Brenden Matthews <[email protected]>
> Date: Mon, 6 May 2013 09:54:03 -0700
> Subject: [PATCH] Terminate correct tasks when a slave disconnects.
> 
> Previously, when a slave disconnected all tasks for that framework would
> be removed and it would result in a bad state for a given framework.  In
> the case of Hadoop, it would result in a bunch of zombie tasks running
> on the slaves which never terminate.
> 
> Added some `operator !=' type utilities.
> ---
>  src/common/type_utils.hpp |   66 
> +++++++++++++++++++++++++++++++++++++++++++++
>  src/master/master.cpp     |    8 ++++--
>  2 files changed, 72 insertions(+), 2 deletions(-)
> 
> 
> Below is a sample of what the Mesos master log looks like:
> 
> 
> I0506 03:01:21.188874  2639 master.cpp:445] Slave 
> 201305040040-3141079306-5050-1068-21(i-ced4aba2) disconnected
> I0506 03:01:21.189184  2639 master.cpp:464] Removing non-checkpointing 
> framework 201305040040-4196536586-5050-1124-0000 from disconn
> ected slave 201305040040-3141079306-5050-1068-21(i-ced4aba2)
> I0506 03:01:21.190471  2639 master.hpp:295] Removing task Task_Tracker_46 
> with resources cpus=9; mem=18432; disk=73728; ports=[31000-31000, 
> 32000-32000] on slave 201305040040-4196536586-5050-1124-3
> I0506 03:01:21.190891  2632 hierarchical_allocator_process.hpp:544] Recovered 
> cpus=9; mem=18432; disk=73728; ports=[31000-31000, 32000-32000] (total 
> allocatable: cpus=15; mem=19180.2; ports=[31000-32000]; disk=763224) on slave 
> 201305040040-4196536586-5050-1124-3 from framework 
> 201305040040-4196536586-5050-1124-0000
> I0506 03:01:21.191614  2639 master.hpp:295] Removing task Task_Tracker_154 
> with resources cpus=9; mem=18432; disk=73728; ports=[31000-31000, 
> 32000-32000] on slave 201305040040-3141079306-5050-1068-38
> I0506 03:01:21.192049  2634 hierarchical_allocator_process.hpp:544] Recovered 
> cpus=9; mem=18432; disk=73728; ports=[31000-31000, 32000-32000] (total 
> allocatable: cpus=15; mem=19180.2; ports=[31000-32000]; disk=761189) on slave 
> 201305040040-3141079306-5050-1068-38 from framework 
> 201305040040-4196536586-5050-1124-0000
> I0506 03:01:21.192828  2639 master.hpp:295] Removing task Task_Tracker_195 
> with resources cpus=6.5; mem=13312; disk=53248; ports=[31999-31999, 
> 31001-31001] on slave 201305040040-3141079306-5050-1068-85
> I0506 03:01:21.193270  2640 hierarchical_allocator_process.hpp:544] Recovered 
> cpus=6.5; mem=13312; disk=53248; ports=[31999-31999, 31001-31001] (total 
> allocatable: cpus=10; mem=13408.8; ports=[31001-31999]; disk=596893) on slave 
> 201305040040-3141079306-5050-1068-85 from framework 
> 201305040040-4196536586-5050-1124-0000
> I0506 03:01:21.194039  2639 master.hpp:295] Removing task Task_Tracker_182 
> with resources cpus=9; mem=18432; disk=73728; ports=[31000-31000, 
> 32000-32000] on slave 201305040040-3141079306-5050-1068-45
> I0506 03:01:21.194425  2638 hierarchical_allocator_process.hpp:544] Recovered 
> cpus=9; mem=18432; disk=73728; ports=[31000-31000, 32000-32000] (total 
> allocatable: cpus=15; mem=19180.2; ports=[31000-32000]; disk=760196) on slave 
> 201305040040-3141079306-5050-1068-45 from framework 
> 201305040040-4196536586-5050-1124-0000
> I0506 03:01:21.195190  2639 master.hpp:295] Removing task Task_Tracker_58 
> with resources cpus=9; mem=18432; disk=73728; ports=[31000-31000, 
> 32000-32000] on slave 201305040040-3141079306-5050-1068-76
> I0506 03:01:21.195636  2636 hierarchical_allocator_process.hpp:544] Recovered 
> cpus=9; mem=18432; disk=73728; ports=[31000-31000, 32000-32000] (total 
> allocatable: cpus=15; mem=19180.2; ports=[31000-32000]; disk=761175) on slave 
> 201305040040-3141079306-5050-1068-76 from framework 
> 201305040040-4196536586-5050-1124-0000
> I0506 03:01:21.196455  2639 master.hpp:295] Removing task Task_Tracker_160 
> with resources cpus=20; mem=40960; disk=163840; ports=[31000-31000, 
> 32000-32000] on slave 201305040040-3141079306-5050-1068-85
> I0506 03:01:21.196883  2631 hierarchical_allocator_process.hpp:544] Recovered 
> cpus=20; mem=40960; disk=163840; ports=[31000-31000, 32000-32000] (total 
> allocatable: cpus=30; mem=54368.8; ports=[31000-32000]; disk=760733) on slave 
> 201305040040-3141079306-5050-1068-85 from framework 
> 201305040040-4196536586-5050-1124-0000
> I0506 03:01:21.197710  2639 master.hpp:295] Removing task Task_Tracker_96 
> with resources cpus=3.5; mem=7168; disk=28672; ports=[31000-31000, 
> 32000-32000] on slave 201305040040-3141079306-5050-1068-80
> <...log continues...>
> 
> 
> Diffs
> -----
> 
>   src/common/type_utils.hpp 377b65f 
>   src/master/master.cpp 3207157 
> 
> Diff: https://reviews.apache.org/r/10951/diff/
> 
> 
> Testing
> -------
> 
> Used in production at airbnb.
> 
> 
> Thanks,
> 
> Brenden Matthews
> 
>

Reply via email to