Hi folks, There are two "safety limits" in place that control the master's agent removal behavior:
(1) "--agent_removal_rate_limit" controls the rate at which agents can be removed from the cluster when they fail health checks. (2) "--recovery_agent_removal_limit" controls the fraction of agents in the cluster that can be removed if they fail to reregister within "--agent_reregister_timeout" after a master failover. If this fraction is exceeded, the master aborts without removing any agents. If the fraction is *not* exceeded, any agents that have not reregistered will be removed at a rate controlled by "--agent_removal_rate_limit", if one is specified. In the PARTITION_AWARE world [1], what kind of limits are appropriate? To begin with, let's assume that all frameworks have opted in to the PARTITION_AWARE capability. (1) I'd argue that we no longer want this rate limit in the master: it should be up to frameworks to decide how to deal with tasks running on unreachable agents. If a given framework wants to use a rate-limit in their logic for handling unreachable tasks, that is up to them. If we applied a rate-limit "upstream" of the frameworks, we are restricting their ability to define their own partition-handling policies. This also means applying the same rate-limit to all tasks and all frameworks, which is undesirable. (2) This seems less clear to me, but I think you can also make a case for removing this safety limit as well: in the PARTITION_AWARE world, removing an agent from the cluster just means that frameworks will be notified that the master can't communicate with the agent. If the master fails over and, say, 60% of the agents in the cluster are not reachable after the "--agent_reregister_timeout" expires, you could argue that the master should just propagate that information to frameworks. Typically you'd want an operator to take action in this situation, but operator involvement can/should be triggered via orthogonal means (e.g., monitoring the # of removed agents). I'm curious to hear what people think about this behavior. In any case, for Mesos < 2, we can't assume that all frameworks will be PARTITION_AWARE. When an agent is removed and then reregisters, non-PARTITION_AWARE tasks running on the agent will be shutdown (but the agent itself will continue running). In principle, it would be nice to rate-limit the rate at which *tasks* are killed: this would mean the rate-limit would be ignored by PARTITION_AWARE frameworks, while still having an effect for old-style frameworks. However, this seems fairly complex. I'm inclined to say that for Mesos 1.1, we should document that PARTITION_AWARE frameworks probably don't want "--agent_removal_rate_limit" to be configured; we can then consider removing "--agent_removal_rate_limit" (and maybe "--recovery_agent_removal_limit") for Mesos 2.0. Comments welcome. Neil [1] https://docs.google.com/document/d/1AYoF5HZPRdQN2TsRpPOliGC6oHen6aHVc0FBOo30rLQ/edit?usp=sharing