[
https://issues.apache.org/jira/browse/MESOS-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Artem Harutyunyan updated MESOS-6286:
-------------------------------------
Sprint: Mesosphere Sprint 48, Mesosphere Sprint 49 (was: Mesosphere Sprint
48)
> Master does not remove an agent if it is responsive but not registered
> ----------------------------------------------------------------------
>
> Key: MESOS-6286
> URL: https://issues.apache.org/jira/browse/MESOS-6286
> Project: Mesos
> Issue Type: Bug
> Reporter: Joseph Wu
> Assignee: Neil Conway
> Priority: Blocker
> Labels: mesosphere
>
> As part of MESOS-6285, we observed an agent stuck in the recovery phase. The
> agent would do the following in a loop:
> # Systemd starts the agent.
> # The agent detects the master, but does not connect yet. The agent needs to
> recover first.
> # The agent responds to {{PingSlaveMessage}} from the master, but it is
> stalled in recovery.
> # The agent is OOM-killed by the kernel before recovery finishes. Repeat
> (1-4).
> The consequences of this:
> * Frameworks will never get a TASK_LOST or terminal status update for tasks
> on this agent.
> * Executors on the agent can connect to the agent, but will not be able to
> register.
> We should consider adding some timeout/intervention in the master for
> responsive, but non-recoverable agents.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)