[
https://issues.apache.org/jira/browse/MESOS-5482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16157937#comment-16157937
]
Chun-Hung Hsiao edited comment on MESOS-5482 at 9/8/17 1:01 AM:
----------------------------------------------------------------
My superficial guess is that someone (Marathon?) asked to shutdown framework
{{f853458f-b07b-4b79-8192-24953f474369-0000}}, but the framework itself didn't
know that and asked to launch task
{{metrics_statsd.70dff634-7dce-11e7-bea2-0242f4eb80ac}} (possibly due to some
race condition?). From the agent's perspective, the framework was already
shutdown, so the task never existed; but from the framework's perspective, it
thought the task was launched so kept asking to kill it but got no response.
[~gengmao] Can you attach the complete master/agent logs, and the sandbox log
of Marathon related to launching/shutting
{{f853458f-b07b-4b79-8192-24953f474369-0000}} if possible, so we can
confirm/reject the above guess?
was (Author: chhsia0):
My superficial guess is that someone (Marathon?) asked to shutdown framework
{{f853458f-b07b-4b79-8192-24953f474369-0000}}, but the framework itself didn't
know that and asked to launch task
{{metrics_statsd.70dff634-7dce-11e7-bea2-0242f4eb80ac}} (possibly due to some
race condition?). From the agent's perspective, the framework was already
shutdown, so the task never existed; but from the framework's perspective, it
thought the task was launched so kept asking to kill it but got no response.
[~gengmao] Can you attach the complete master/agent logs, and the sandbox log
of Marathon related to launching/shutting
{{f853458f-b07b-4b79-8192-24953f474369-0000}} if possible?
> mesos/marathon task stuck in staging after slave reboot
> -------------------------------------------------------
>
> Key: MESOS-5482
> URL: https://issues.apache.org/jira/browse/MESOS-5482
> Project: Mesos
> Issue Type: Bug
> Reporter: lutful karim
> Labels: tech-debt
> Attachments: marathon-mesos-masters_after-reboot.log,
> mesos-masters_mesos.log, mesos_slaves_after_reboot.log,
> tasks_running_before_rebooot.marathon
>
>
> The main idea of mesos/marathon is to sleep well, but after node reboot mesos
> task gets stuck in staging for about 4 hours.
> To reproduce the issue:
> - setup a mesos cluster in HA mode with systemd enabled mesos-master and
> mesos-slave service.
> - run docker registry (https://hub.docker.com/_/registry/ ) with mesos
> constraint (hostname:LIKE:mesos-slave-1) in one node. Reboot the node and
> notice that task getting stuck in staging.
> Possible workaround: service mesos-slave restart fixes the issue.
> OS: centos 7.2
> mesos version: 0.28.1
> marathon: 1.1.1
> zookeeper: 3.4.8
> docker: 1.9.1 dockerAPIversion: 1.21
> error message:
> May 30 08:38:24 euca-10-254-237-140 mesos-slave[832]: W0530 08:38:24.120013
> 909 slave.cpp:2018] Ignoring kill task
> docker-registry.066fb448-2628-11e6-bedd-d00d0ef81dc3 because the executor
> 'docker-registry.066fb448-2628-11e6-bedd-d00d0ef81dc3' of framework
> 8517fcb7-f2d0-47ad-ae02-837570bef929-0000 is terminating/terminated
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)