----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/50705/#review146438 -----------------------------------------------------------
src/master/master.hpp (line 679) <https://reviews.apache.org/r/50705/#comment212873> s/SlaveInfo &/SlaveInfo&/ src/master/master.hpp (line 1698) <https://reviews.apache.org/r/50705/#comment212902> Should this still be called `removed` ? src/master/master.cpp (line 1735) <https://reviews.apache.org/r/50705/#comment212879> s/removal of agent/transition of agent to UNREACHABLE/ src/master/master.cpp (line 1746) <https://reviews.apache.org/r/50705/#comment212878> s/removing it from the registrar/marking it unreachable in the registrar/ src/master/master.cpp (line 1753) <https://reviews.apache.org/r/50705/#comment212892> do you want to remove the code related to "flags.registry_strict == true" (in a separate review)? src/master/master.cpp (line 1769) <https://reviews.apache.org/r/50705/#comment212885> s/remove agent/mark the agent UNREACHABLE/ src/master/master.cpp (line 4815) <https://reviews.apache.org/r/50705/#comment212894> new line. src/master/master.cpp (lines 4882 - 4884) <https://reviews.apache.org/r/50705/#comment212898> this can be done once outside the foreach loop? src/master/master.cpp (lines 4894 - 4896) <https://reviews.apache.org/r/50705/#comment212899> If the slave was removed, `slave->tasks` should be empty? src/master/master.cpp (line 5330) <https://reviews.apache.org/r/50705/#comment212900> "unknown agent << slaveId << unreachable" src/master/master.cpp (line 5347) <https://reviews.apache.org/r/50705/#comment212901> s/UNREACHABLE/TASK_UNREACHABLE/ src/master/master.cpp (lines 5412 - 5413) <https://reviews.apache.org/r/50705/#comment212903> so you put a slave in `removed` here and `markingunreachable` !? previously the slave was only in one of the lists (`registered`, `removed` etc), but not anymore? src/master/master.cpp (line 5428) <https://reviews.apache.org/r/50705/#comment212905> finally you call them "admitted" :) src/master/master.cpp (line 5430) <https://reviews.apache.org/r/50705/#comment212904> TASK_UNREACHABLE. src/master/master.cpp (line 5444) <https://reviews.apache.org/r/50705/#comment212908> s/SlaveInfo &/SlaveInfo& / src/master/master.cpp (line 5446) <https://reviews.apache.org/r/50705/#comment212909> s/unreachableCause/message/ src/master/master.cpp (line 5447) <https://reviews.apache.org/r/50705/#comment212907> s/registrarResult/future/ src/master/master.cpp (line 5460) <https://reviews.apache.org/r/50705/#comment212906> new line. src/master/master.cpp (lines 5467 - 5468) <https://reviews.apache.org/r/50705/#comment212910> Is this for backwards compatibility? If so, can you add a NOTE. Will there be new metrics for unreachability? src/master/master.cpp (line 5470) <https://reviews.apache.org/r/50705/#comment212911> TASK_UNREACHABLE src/tests/master_tests.cpp (line 1764) <https://reviews.apache.org/r/50705/#comment212912> it is not removed from the registry, but marked as unreachable right? src/tests/partition_tests.cpp (line 169) <https://reviews.apache.org/r/50705/#comment212915> This test is huge. Can you split PARTITION_AWARE and non-PARTITION_AWARE into separate tests? src/tests/partition_tests.cpp (line 480) <https://reviews.apache.org/r/50705/#comment212916> what's the guarantee that the slave will not try to re-register again? src/tests/partition_tests.cpp (lines 512 - 513) <https://reviews.apache.org/r/50705/#comment212917> no need for this? src/tests/partition_tests.cpp (lines 520 - 522) <https://reviews.apache.org/r/50705/#comment212919> what does ping timeouts have to do with slave re-registering? src/tests/partition_tests.cpp (line 525) <https://reviews.apache.org/r/50705/#comment212918> no need for this? - Vinod Kone On Aug. 13, 2016, 11:56 p.m., Neil Conway wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/50705/ > ----------------------------------------------------------- > > (Updated Aug. 13, 2016, 11:56 p.m.) > > > Review request for mesos and Vinod Kone. > > > Bugs: MESOS-4049 > https://issues.apache.org/jira/browse/MESOS-4049 > > > Repository: mesos > > > Description > ------- > > The previous behavior was to shutdown partitioned agents that attempt to > reregister---unless the master has failed over, in which case the > reregistration is allowed (when running in "non-strict" mode). > > The new behavior is always to allow partitioned agents to reregister. > This is part of a longer-term project to allow frameworks to define > their own policies for handling tasks running on partitioned agents. > > In particular, if a framework has the PARTITION_AWARE capability, any > tasks running on the partitioned agent will continue to run after > reregistration. If the framework is not PARTITION_AWARE, any tasks that > were running on such an agent will be killed after the agent reregisters > (unless the master has failed over). This is for backward compatibility > with the previous ("non-strict") behavior. Note that regardless of the > PARTITION_AWARE capability, the agent will not be shutdown, which is a > change from the previous Mesos behavior. > > This commit also changes the master so that if an agent is removed and > then the master receives a message from that agent, the master will no > longer attempt to shutdown the agent. This is consistent with the goal > of getting the master out of the business of shutting down agents that > we suspect are unhealthy. Such an agent will eventually realize it is > not registered with the master (e.g., because it won't receive any pings > from the master), which will cause it to reregister. > > > Diffs > ----- > > src/master/master.hpp 6decff6f4b9c3434de030fd5c06df4c683a7abad > src/master/master.cpp 0bd1a3490a86fede86a3f5f62ce4745b65aae258 > src/tests/master_tests.cpp 398164d09b8ef14f916122ed8780023c4a3cd0f6 > src/tests/partition_tests.cpp 0a72b345538ca3b9510fccf38ceb68ac71c2b473 > > Diff: https://reviews.apache.org/r/50705/diff/ > > > Testing > ------- > > make check > > > Thanks, > > Neil Conway > >
