[ 
https://issues.apache.org/jira/browse/MESOS-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-6286:
-------------------------------
            Shepherd: Vinod Kone
              Sprint: Mesosphere Sprint 48
        Story Points: 5
    Target Version/s: 1.2.0
            Priority: Blocker  (was: Major)

> Master does not remove an agent if it is responsive but not registered
> ----------------------------------------------------------------------
>
>                 Key: MESOS-6286
>                 URL: https://issues.apache.org/jira/browse/MESOS-6286
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Joseph Wu
>            Assignee: Neil Conway
>            Priority: Blocker
>              Labels: mesosphere
>
> As part of MESOS-6285, we observed an agent stuck in the recovery phase.  The 
> agent would do the following in a loop:
> # Systemd starts the agent.
> # The agent detects the master, but does not connect yet.  The agent needs to 
> recover first.
> # The agent responds to {{PingSlaveMessage}} from the master, but it is 
> stalled in recovery.
> # The agent is OOM-killed by the kernel before recovery finishes.  Repeat 
> (1-4).
> The consequences of this:
> * Frameworks will never get a TASK_LOST or terminal status update for tasks 
> on this agent.
> * Executors on the agent can connect to the agent, but will not be able to 
> register.
> We should consider adding some timeout/intervention in the master for 
> responsive, but non-recoverable agents.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to