[
https://issues.apache.org/jira/browse/MESOS-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15812656#comment-15812656
]
Neil Conway commented on MESOS-6286:
------------------------------------
https://reviews.apache.org/r/55354/
> Master does not remove an agent if it is responsive but not registered
> ----------------------------------------------------------------------
>
> Key: MESOS-6286
> URL: https://issues.apache.org/jira/browse/MESOS-6286
> Project: Mesos
> Issue Type: Bug
> Reporter: Joseph Wu
> Assignee: Neil Conway
> Priority: Blocker
> Labels: mesosphere
>
> As part of MESOS-6285, we observed an agent stuck in the recovery phase. The
> agent would do the following in a loop:
> # Systemd starts the agent.
> # The agent detects the master, but does not connect yet. The agent needs to
> recover first.
> # The agent responds to {{PingSlaveMessage}} from the master, but it is
> stalled in recovery.
> # The agent is OOM-killed by the kernel before recovery finishes. Repeat
> (1-4).
> The consequences of this:
> * Frameworks will never get a TASK_LOST or terminal status update for tasks
> on this agent.
> * Executors on the agent can connect to the agent, but will not be able to
> register.
> We should consider adding some timeout/intervention in the master for
> responsive, but non-recoverable agents.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)