[ 
https://issues.apache.org/jira/browse/MESOS-8111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250166#comment-16250166
 ] 

Vinod Kone commented on MESOS-8111:
-----------------------------------

What framework are you using? I'm assuming marathon because you are using DC/OS.
 
There is a default rate limit of 1 in 20 min in DC/OS for the master to mark a 
disconnected agent as unreachable. If you have a more than one agent disconnect 
/ scaled down at the same time, it would take quite a bit for master to 
recognize that.

Also, can you share the master, scheduler and agent logs for around the 
specific task and during disconnection? That would help us diagnose this better.


> Mesos sees task as running, but cannot kill it because the agent is offline
> ---------------------------------------------------------------------------
>
>                 Key: MESOS-8111
>                 URL: https://issues.apache.org/jira/browse/MESOS-8111
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.2.3
>         Environment: DC/OS 1.9.4
>            Reporter: Cosmin Lehene
>
> After scaling down a cluster, the master is reporting a task as running 
> although the slave has been long gone.
> At the same time it reports it can't kill it because the agent is offline
> {noformat}
> I1018 16:55:22.000000  6976 master.cpp:4913] Processing KILL call for task 
> 'spark.7b59a77b-b353-11e7-addd-b29ecbf071e1' of framework 
> 4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at 
> [email protected]:15101
> W1018 16:55:22.000000  6976 master.cpp:5000] Cannot kill task 
> spark.7b59a77b-b353-11e7-addd-b29ecbf071e1 of framework 
> 4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at 
> [email protected]:15101 because the 
> agent 4d2a982a-0e62-4471-88e8-8df9cc0ae437-S129 at slave(1)@10.0.0.81:5051 
> (10.0.0.81) is disconnected. Kill will be retried if the agent re-registers
> {noformat}
> Clearly, if the agent is offline the task is also not running. Also not sure 
> waiting indefinitely for an agent to recover is a good strategy.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to