[ 
https://issues.apache.org/jira/browse/MESOS-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15796150#comment-15796150
 ] 

Neil Conway commented on MESOS-6286:
------------------------------------

This problem has also been observed when the agent is stuck in recovery for an 
extended period of time, e.g., because a container has gotten into some weird 
state.

The simplest fix here might be to simply change the agent to not respond to 
{{PingSlaveMessage}} is the agent is not in the {{RUNNING}} state. With that 
change, an agent that is stuck in recovery indefinitely will eventually fail 
health checks; the framework will then receive {{TASK_LOST}} / 
{{TASK_UNREACHABLE}} status updates for any tasks on the agent, and can decide 
if/when to relaunch that work elsewhere. If the agent later finishes recovery, 
it will be allowed to re-register -- as normal, non-partition-aware tasks on 
the agent will be terminated and partition-aware tasks will be allowed to keep 
running. Any failed containers will be reported as terminal to the framework.

There should probably also be a mechanism to detect situations in which the 
agent fails to startup for an extended period, so that the operator can 
investigate the state of the agent. But that seems orthogonal to this issue.

> Master does not remove an agent if it is responsive but not registered
> ----------------------------------------------------------------------
>
>                 Key: MESOS-6286
>                 URL: https://issues.apache.org/jira/browse/MESOS-6286
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Joseph Wu
>            Assignee: Neil Conway
>              Labels: mesosphere
>
> As part of MESOS-6285, we observed an agent stuck in the recovery phase.  The 
> agent would do the following in a loop:
> # Systemd starts the agent.
> # The agent detects the master, but does not connect yet.  The agent needs to 
> recover first.
> # The agent responds to {{PingSlaveMessage}} from the master, but it is 
> stalled in recovery.
> # The agent is OOM-killed by the kernel before recovery finishes.  Repeat 
> (1-4).
> The consequences of this:
> * Frameworks will never get a TASK_LOST or terminal status update for tasks 
> on this agent.
> * Executors on the agent can connect to the agent, but will not be able to 
> register.
> We should consider adding some timeout/intervention in the master for 
> responsive, but non-recoverable agents.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to