[ 
https://issues.apache.org/jira/browse/MESOS-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-6286:
-------------------------------------
    Sprint: Mesosphere Sprint 48, Mesosphere Sprint 49  (was: Mesosphere Sprint 
48)

> Master does not remove an agent if it is responsive but not registered
> ----------------------------------------------------------------------
>
>                 Key: MESOS-6286
>                 URL: https://issues.apache.org/jira/browse/MESOS-6286
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Joseph Wu
>            Assignee: Neil Conway
>            Priority: Blocker
>              Labels: mesosphere
>
> As part of MESOS-6285, we observed an agent stuck in the recovery phase.  The 
> agent would do the following in a loop:
> # Systemd starts the agent.
> # The agent detects the master, but does not connect yet.  The agent needs to 
> recover first.
> # The agent responds to {{PingSlaveMessage}} from the master, but it is 
> stalled in recovery.
> # The agent is OOM-killed by the kernel before recovery finishes.  Repeat 
> (1-4).
> The consequences of this:
> * Frameworks will never get a TASK_LOST or terminal status update for tasks 
> on this agent.
> * Executors on the agent can connect to the agent, but will not be able to 
> register.
> We should consider adding some timeout/intervention in the master for 
> responsive, but non-recoverable agents.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to