[ 
https://issues.apache.org/jira/browse/MESOS-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15796343#comment-15796343
 ] 

Neil Conway commented on MESOS-6286:
------------------------------------

Talking with [~vinodkone], the scheme described above is simple, but has a 
drawback: if recovering the agent state takes longer than the master's health 
check timeout (75 seconds by default), the agent will be marked LOST. Extended 
agent recovery times can be more common than you'd think because this code path 
hasn't been optimized: e.g., the agent re-reads the metadata for all executors 
(including completed ones), which might be in the 1000s.

An alternative would be to reuse the {{agent_reregister_timeout}} flag. If the 
master notices that the master -> agent socket has broken, it will start a 
timer with duration {{agent_reregister_timeout}}; if the timer expires before 
the agent has re-registered, the master marks the agent lost/unreachable. That 
would give the agent 10 minutes (by default) to finish recovery before being 
marked unreachable.

> Master does not remove an agent if it is responsive but not registered
> ----------------------------------------------------------------------
>
>                 Key: MESOS-6286
>                 URL: https://issues.apache.org/jira/browse/MESOS-6286
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Joseph Wu
>            Assignee: Neil Conway
>            Priority: Blocker
>              Labels: mesosphere
>
> As part of MESOS-6285, we observed an agent stuck in the recovery phase.  The 
> agent would do the following in a loop:
> # Systemd starts the agent.
> # The agent detects the master, but does not connect yet.  The agent needs to 
> recover first.
> # The agent responds to {{PingSlaveMessage}} from the master, but it is 
> stalled in recovery.
> # The agent is OOM-killed by the kernel before recovery finishes.  Repeat 
> (1-4).
> The consequences of this:
> * Frameworks will never get a TASK_LOST or terminal status update for tasks 
> on this agent.
> * Executors on the agent can connect to the agent, but will not be able to 
> register.
> We should consider adding some timeout/intervention in the master for 
> responsive, but non-recoverable agents.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to