[ 
https://issues.apache.org/jira/browse/MESOS-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-6286:
-------------------------------
    Description: 
As part of MESOS-6285, we observed an agent stuck in the recovery phase.  The 
agent would do the following in a loop:
# Systemd starts the agent.
# The agent detects the master, but does not connect yet.  The agent needs to 
recover first.
# The agent responds to {{PingSlaveMessage}} from the master, but it is stalled 
in recovery.
# The agent is OOM-killed by the kernel before recovery finishes.  Repeat (1-4).

The consequences of this:
* Frameworks will never get a TASK_LOST or terminal status update for tasks on 
this agent.
* Executors on the agent can connect to the agent, but will not be able to 
register.

We should consider adding some timeout/intervention in the master for 
responsive, but non-recoverable agents.

  was:
As part of MESOS-6285, we observed an agent stuck in the recovery phase.  The 
agent would do the following in a loop:
1) Systemd starts the agent.
2) The agent detects the master, but does not connect yet.  The agent needs to 
recover first.
3) The agent is responsive to {{PingSlaveMessage}}s from the master.  But is 
stalled in recovery.
4) The agent is OOM-killed by the kernel before recovery finishes.  Repeat 
(1-4).

The consequences of this:
* Frameworks will never get a TASK_LOST or terminal status update for tasks on 
this agent.
* Executors on the agent can connect to the agent, but will not be able to 
register.

We should consider adding some timeout/intervention in the master for 
responsive, but non-recoverable agents.


> Master does not remove an agent if it is responsive but not registered
> ----------------------------------------------------------------------
>
>                 Key: MESOS-6286
>                 URL: https://issues.apache.org/jira/browse/MESOS-6286
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Joseph Wu
>            Assignee: Neil Conway
>              Labels: mesosphere
>
> As part of MESOS-6285, we observed an agent stuck in the recovery phase.  The 
> agent would do the following in a loop:
> # Systemd starts the agent.
> # The agent detects the master, but does not connect yet.  The agent needs to 
> recover first.
> # The agent responds to {{PingSlaveMessage}} from the master, but it is 
> stalled in recovery.
> # The agent is OOM-killed by the kernel before recovery finishes.  Repeat 
> (1-4).
> The consequences of this:
> * Frameworks will never get a TASK_LOST or terminal status update for tasks 
> on this agent.
> * Executors on the agent can connect to the agent, but will not be able to 
> register.
> We should consider adding some timeout/intervention in the master for 
> responsive, but non-recoverable agents.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to