[ https://issues.apache.org/jira/browse/MESOS-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neil Conway updated MESOS-6286: ------------------------------- Description: As part of MESOS-6285, we observed an agent stuck in the recovery phase. The agent would do the following in a loop: # Systemd starts the agent. # The agent detects the master, but does not connect yet. The agent needs to recover first. # The agent responds to {{PingSlaveMessage}} from the master, but it is stalled in recovery. # The agent is OOM-killed by the kernel before recovery finishes. Repeat (1-4). The consequences of this: * Frameworks will never get a TASK_LOST or terminal status update for tasks on this agent. * Executors on the agent can connect to the agent, but will not be able to register. We should consider adding some timeout/intervention in the master for responsive, but non-recoverable agents. was: As part of MESOS-6285, we observed an agent stuck in the recovery phase. The agent would do the following in a loop: 1) Systemd starts the agent. 2) The agent detects the master, but does not connect yet. The agent needs to recover first. 3) The agent is responsive to {{PingSlaveMessage}}s from the master. But is stalled in recovery. 4) The agent is OOM-killed by the kernel before recovery finishes. Repeat (1-4). The consequences of this: * Frameworks will never get a TASK_LOST or terminal status update for tasks on this agent. * Executors on the agent can connect to the agent, but will not be able to register. We should consider adding some timeout/intervention in the master for responsive, but non-recoverable agents. > Master does not remove an agent if it is responsive but not registered > ---------------------------------------------------------------------- > > Key: MESOS-6286 > URL: https://issues.apache.org/jira/browse/MESOS-6286 > Project: Mesos > Issue Type: Bug > Reporter: Joseph Wu > Assignee: Neil Conway > Labels: mesosphere > > As part of MESOS-6285, we observed an agent stuck in the recovery phase. The > agent would do the following in a loop: > # Systemd starts the agent. > # The agent detects the master, but does not connect yet. The agent needs to > recover first. > # The agent responds to {{PingSlaveMessage}} from the master, but it is > stalled in recovery. > # The agent is OOM-killed by the kernel before recovery finishes. Repeat > (1-4). > The consequences of this: > * Frameworks will never get a TASK_LOST or terminal status update for tasks > on this agent. > * Executors on the agent can connect to the agent, but will not be able to > register. > We should consider adding some timeout/intervention in the master for > responsive, but non-recoverable agents. -- This message was sent by Atlassian JIRA (v6.3.4#6332)