It seems to me that these particular flags are not applicable for PARTITION_AWARE frameworks, since there is no removal occurring. For old frameworks, they still act as if removal is occurring and so these flags provide the backwards compatibility by rate limiting the shutting down of non-PARTITION_AWARE framework's tasks.
For safety limits, the equivalent in the PARTITION_AWARE world seems to be: should we limit the rate at which we notify frameworks that agents are unreachable? Should we provide other safety mechanisms? If we expect schedulers that opt-in to have their own rate limiting / SLA-aware handling when reacting to agents becoming unreachable, then it seems ok to omit any rate limiting of the notifications. If we want to support schedulers that react poorly, we can add per-framework rate limits for unreachable notifications. Operators could turn these on to deal with specific frameworks that react poorly. Last thing that comes to mind is that the health checking parameters still play a role here (e.g. 15 second ping timeout / 5 retries). If one were to set these aggressively low, it seems possible for us to trigger false notifications (e.g. we thought this was unreachable because it didn't answer in 10ms, but it is actually because we have a 100ms application-level delay). Since these are pretty conservative by default, it seems unlikely that this will be a problem. In situations where the agent is considered unreachable, we won't offer resources, correct? This would mitigate situations where we think all agents are unreachable due to network or application-level congestion on the master (since the scheduler can't re-schedule anything). Operators can increase the health checking parameters if they find that they are running into issues in production and need a safety hatch. On Wed, Jul 27, 2016 at 6:29 AM, Neil Conway <neil.con...@gmail.com> wrote: > Hi folks, > > There are two "safety limits" in place that control the master's agent > removal behavior: > > (1) "--agent_removal_rate_limit" controls the rate at which agents can > be removed from the cluster when they fail health checks. > > (2) "--recovery_agent_removal_limit" controls the fraction of agents > in the cluster that can be removed if they fail to reregister within > "--agent_reregister_timeout" after a master failover. If this fraction > is exceeded, the master aborts without removing any agents. If the > fraction is *not* exceeded, any agents that have not reregistered will > be removed at a rate controlled by "--agent_removal_rate_limit", if > one is specified. > > In the PARTITION_AWARE world [1], what kind of limits are appropriate? > To begin with, let's assume that all frameworks have opted in to the > PARTITION_AWARE capability. > > (1) I'd argue that we no longer want this rate limit in the master: it > should be up to frameworks to decide how to deal with tasks running on > unreachable agents. If a given framework wants to use a rate-limit in > their logic for handling unreachable tasks, that is up to them. If we > applied a rate-limit "upstream" of the frameworks, we are restricting > their ability to define their own partition-handling policies. This > also means applying the same rate-limit to all tasks and all > frameworks, which is undesirable. > > (2) This seems less clear to me, but I think you can also make a case > for removing this safety limit as well: in the PARTITION_AWARE world, > removing an agent from the cluster just means that frameworks will be > notified that the master can't communicate with the agent. If the > master fails over and, say, 60% of the agents in the cluster are not > reachable after the "--agent_reregister_timeout" expires, you could > argue that the master should just propagate that information to > frameworks. Typically you'd want an operator to take action in this > situation, but operator involvement can/should be triggered via > orthogonal means (e.g., monitoring the # of removed agents). > > I'm curious to hear what people think about this behavior. > > In any case, for Mesos < 2, we can't assume that all frameworks will > be PARTITION_AWARE. When an agent is removed and then reregisters, > non-PARTITION_AWARE tasks running on the agent will be shutdown (but > the agent itself will continue running). In principle, it would be > nice to rate-limit the rate at which *tasks* are killed: this would > mean the rate-limit would be ignored by PARTITION_AWARE frameworks, > while still having an effect for old-style frameworks. However, this > seems fairly complex. > > I'm inclined to say that for Mesos 1.1, we should document that > PARTITION_AWARE frameworks probably don't want > "--agent_removal_rate_limit" to be configured; we can then consider > removing "--agent_removal_rate_limit" (and maybe > "--recovery_agent_removal_limit") for Mesos 2.0. > > Comments welcome. > > Neil > > [1] > https://docs.google.com/document/d/1AYoF5HZPRdQN2TsRpPOliGC6oHen6aHVc0FBOo30rLQ/edit?usp=sharing >