[
https://issues.apache.org/jira/browse/MESOS-8111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250193#comment-16250193
]
Cosmin Lehene commented on MESOS-8111:
--------------------------------------
[~vinodkone] Yes, Marathon.
I could capture the logs next time (the cluster is long gone).
I think this was happening after scaling down from 100 nodes to 5 or 10.
I'm trying to understand what prompted the default. Is it to avoid churn when
having the master in a network partition?
Perhaps we should adjust the rate limit for these use-cases.
> Mesos sees task as running, but cannot kill it because the agent is offline
> ---------------------------------------------------------------------------
>
> Key: MESOS-8111
> URL: https://issues.apache.org/jira/browse/MESOS-8111
> Project: Mesos
> Issue Type: Bug
> Components: master
> Affects Versions: 1.2.3
> Environment: DC/OS 1.9.4
> Reporter: Cosmin Lehene
> Assignee: Vinod Kone
>
> After scaling down a cluster, the master is reporting a task as running
> although the slave has been long gone.
> At the same time it reports it can't kill it because the agent is offline
> {noformat}
> I1018 16:55:22.000000 6976 master.cpp:4913] Processing KILL call for task
> 'spark.7b59a77b-b353-11e7-addd-b29ecbf071e1' of framework
> 4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at
> [email protected]:15101
> W1018 16:55:22.000000 6976 master.cpp:5000] Cannot kill task
> spark.7b59a77b-b353-11e7-addd-b29ecbf071e1 of framework
> 4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at
> [email protected]:15101 because the
> agent 4d2a982a-0e62-4471-88e8-8df9cc0ae437-S129 at slave(1)@10.0.0.81:5051
> (10.0.0.81) is disconnected. Kill will be retried if the agent re-registers
> {noformat}
> Clearly, if the agent is offline the task is also not running. Also not sure
> waiting indefinitely for an agent to recover is a good strategy.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)