[
https://issues.apache.org/jira/browse/MESOS-8987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16522623#comment-16522623
]
Greg Mann commented on MESOS-8987:
----------------------------------
I agree with [~gkleiman], since it's quite possible that authorization could
return a {{false}} result due to some operator or tooling error, it makes sense
to me that we would not take the drastic step of killing tasks on an agent in
that case.
One metric that an operator could use to alert that an agent is not
authenticating successfully is the number of registered agents; we could also
consider adding metrics for agent/scheduler authentication failures to expose
this and similar scenarios.
> Master asks agent to shutdown upon auth errors
> ----------------------------------------------
>
> Key: MESOS-8987
> URL: https://issues.apache.org/jira/browse/MESOS-8987
> Project: Mesos
> Issue Type: Bug
> Components: master, security
> Affects Versions: 1.4.1, 1.5.1, 1.6.0, 1.7.0
> Reporter: Gastón Kleiman
> Assignee: Gastón Kleiman
> Priority: Blocker
> Labels: mesosphere
>
> The Mesos master sends a {{ShutdownMessage}} to an agent if there is an
> [authentication|https://github.com/apache/mesos/blob/d733b1031350e03bce443aa287044eb4eee1053a/src/master/master.cpp#L6532-L6543]
> or an
> [authorization|https://github.com/apache/mesos/blob/d733b1031350e03bce443aa287044eb4eee1053a/src/master/master.cpp#L6622-L6633]
> error during agent registration.
>
> Upon receipt of this message, the agent kills alls its tasks and commits
> suicide. This means that transient auth errors can lead to whole agents being
> killed along with it's tasks.
> I think the master should stop sending the {{ShutdownMessage}}s in these
> cases, or at least let the agent retry the registration a few times before
> asking it to shutdown.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)