Hi folks,

There are two "safety limits" in place that control the master's agent
removal behavior:

(1) "--agent_removal_rate_limit" controls the rate at which agents can
be removed from the cluster when they fail health checks.

(2) "--recovery_agent_removal_limit" controls the fraction of agents
in the cluster that can be removed if they fail to reregister within
"--agent_reregister_timeout" after a master failover. If this fraction
is exceeded, the master aborts without removing any agents. If the
fraction is *not* exceeded, any agents that have not reregistered will
be removed at a rate controlled by "--agent_removal_rate_limit", if
one is specified.

In the PARTITION_AWARE world [1], what kind of limits are appropriate?
To begin with, let's assume that all frameworks have opted in to the
PARTITION_AWARE capability.

(1) I'd argue that we no longer want this rate limit in the master: it
should be up to frameworks to decide how to deal with tasks running on
unreachable agents. If a given framework wants to use a rate-limit in
their logic for handling unreachable tasks, that is up to them. If we
applied a rate-limit "upstream" of the frameworks, we are restricting
their ability to define their own partition-handling policies. This
also means applying the same rate-limit to all tasks and all
frameworks, which is undesirable.

(2) This seems less clear to me, but I think you can also make a case
for removing this safety limit as well: in the PARTITION_AWARE world,
removing an agent from the cluster just means that frameworks will be
notified that the master can't communicate with the agent. If the
master fails over and, say, 60% of the agents in the cluster are not
reachable after the "--agent_reregister_timeout" expires, you could
argue that the master should just propagate that information to
frameworks. Typically you'd want an operator to take action in this
situation, but operator involvement can/should be triggered via
orthogonal means (e.g., monitoring the # of removed agents).

I'm curious to hear what people think about this behavior.

In any case, for Mesos < 2, we can't assume that all frameworks will
be PARTITION_AWARE. When an agent is removed and then reregisters,
non-PARTITION_AWARE tasks running on the agent will be shutdown (but
the agent itself will continue running). In principle, it would be
nice to rate-limit the rate at which *tasks* are killed: this would
mean the rate-limit would be ignored by PARTITION_AWARE frameworks,
while still having an effect for old-style frameworks. However, this
seems fairly complex.

I'm inclined to say that for Mesos 1.1, we should document that
PARTITION_AWARE frameworks probably don't want
"--agent_removal_rate_limit" to be configured; we can then consider
removing "--agent_removal_rate_limit" (and maybe
"--recovery_agent_removal_limit") for Mesos 2.0.

Comments welcome.

Neil

[1] 
https://docs.google.com/document/d/1AYoF5HZPRdQN2TsRpPOliGC6oHen6aHVc0FBOo30rLQ/edit?usp=sharing

Reply via email to