[
https://issues.apache.org/jira/browse/MESOS-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15422638#comment-15422638
]
Rob Johnson commented on MESOS-4084:
------------------------------------
We've been hit by this bug in the last couple of days: specifically, Chronos
was receiving a status update about a task owned by Marathon (note that the
task had now finished). To resolve the situtation, we had to clear the Mesos
agent's work directory and restart it.
Here are some of the logs picked out of the agent log and stderr of the sandbox:
Timeline:
{noformat}
2016-08-14 20:58:20,820 the application dies
{noformat}
The agent notices it's death
{noformat}
mesos-slave.hostname.invalid-user.log.INFO.20160621-111458.29832:I0814
20:58:23.803419 29833 slave.cpp:3012] Handling status update TASK_FAILED (UUID:
7efdda87-2b3b-4ae8-92b7-3cb16f5fe0e4) for task
myservice.git034740b1.configc9d21b00.7e4463a5-629c-11e6-a691-764fd3cce37c of
framework f52e09ed-169c-42a2-9627-60815851d9d2-0000 from
executor(1)@1.2.3.4:32870
...
mesos-slave.hostname.invalid-user.log.INFO.20160621-111458.29832:I0814
20:58:24.812420 29858 slave.cpp:3871] Executor
'myservice.git034740b1.configc9d21b00.7e4463a5-629c-11e6-a691-764fd3cce37c' of
framework f52e09ed-169c-42a2-9627-60815851d9d2-0000 exited with status 0
mesos-slave.hostname.invalid-user.log.INFO.20160621-111458.29832:I0814
20:58:24.813383 29840 status_update_manager.cpp:320] Received status update
TASK_FAILED (UUID: 7efdda87-2b3b-4ae8-92b7-3cb16f5fe0e4) for task
myservice.git034740b1.configc9d21b00.7e4463a5-629c-11e6-a691-764fd3cce37c of
framework f52e09ed-169c-42a2-9627-60815851d9d2-0000
mesos-slave.hostname.invalid-user.log.INFO.20160621-111458.29832:I0814
20:58:24.813478 29840 status_update_manager.cpp:824] Checkpointing UPDATE for
status update TASK_FAILED (UUID: 7efdda87-2b3b-4ae8-92b7-3cb16f5fe0e4) for task
myservice.git034740b1.configc9d21b00.7e4463a5-629c-11e6-a691-764fd3cce37c of
framework f52e09ed-169c-42a2-9627-60815851d9d2-0000
mesos-slave.hostname.invalid-user.log.INFO.20160621-111458.29832:I0814
20:58:24.841326 29849 slave.cpp:3410] Forwarding the update TASK_FAILED (UUID:
7efdda87-2b3b-4ae8-92b7-3cb16f5fe0e4) for task
myservice.git034740b1.configc9d21b00.7e4463a5-629c-11e6-a691-764fd3cce37c of
framework f52e09ed-169c-42a2-9627-60815851d9d2-0000 to [email protected]:5050
mesos-slave.hostname.invalid-user.log.INFO.20160621-111458.29832:I0814
20:58:24.841840 29849 slave.cpp:3320] Sending acknowledgement for status update
TASK_FAILED (UUID: 7efdda87-2b3b-4ae8-92b7-3cb16f5fe0e4) for task
myservice.git034740b1.configc9d21b00.7e4463a5-629c-11e6-a691-764fd3cce37c of
framework f52e09ed-169c-42a2-9627-60815851d9d2-0000 to executor(1)@1.2.3.4:32870
mesos-slave.hostname.invalid-user.log.INFO.20160621-111458.29832:I0814
20:58:30.629101 29864 status_update_manager.cpp:392] Received status update
acknowledgement (UUID: 7efdda87-2b3b-4ae8-92b7-3cb16f5fe0e4) for task
myservice.git034740b1.configc9d21b00.7e4463a5-629c-11e6-a691-764fd3cce37c of
framework f52e09ed-169c-42a2-9627-60815851d9d2-0000
mesos-slave.hostname.invalid-user.log.INFO.20160621-111458.29832:I0814
20:58:30.629396 29864 status_update_manager.cpp:824] Checkpointing ACK for
status update TASK_FAILED (UUID: 7efdda87-2b3b-4ae8-92b7-3cb16f5fe0e4) for task
myservice.git034740b1.configc9d21b00.7e4463a5-629c-11e6-a691-764fd3cce37c of
framework f52e09ed-169c-42a2-9627-60815851d9d2-0000
mesos-slave.hostname.invalid-user.log.INFO.20160621-111458.29832:I0814
20:58:30.651621 29838 slave.cpp:3975] Cleaning up executor
'myservice.git034740b1.configc9d21b00.7e4463a5-629c-11e6-a691-764fd3cce37c' of
framework f52e09ed-169c-42a2-9627-60815851d9d2-0000 at executor(1)@1.2.3.4:32870
{noformat}
However, in the mean time, the container is still within it's grace period, and
so the slave continues to health check. But, the Docker container has been
killed
{noformat}
I0814 20:58:29.862861 30341 main.cpp:98] Ignoring failure as health check still
in grace period
Error response from daemon: Container
mesos-a1951f98-b39c-482f-8afc-6dcb3cba7a24-S33.5ad311ef-a023-44fa-b261-10ea736b24c1
is not running
I0814 20:58:39.930883 30328 main.cpp:98] Ignoring failure as health check still
in grace period
Error response from daemon: Container
mesos-a1951f98-b39c-482f-8afc-6dcb3cba7a24-S33.5ad311ef-a023-44fa-b261-10ea736b24c1
is not running
I0814 20:58:49.996784 30332 main.cpp:98] Ignoring failure as health check still
in grace period
Error response from daemon: Container
mesos-a1951f98-b39c-482f-8afc-6dcb3cba7a24-S33.5ad311ef-a023-44fa-b261-10ea736b24c1
is not running
I0814 20:59:00.061403 30346 main.cpp:98] Ignoring failure as health check still
in grace period
Error response from daemon: Container
mesos-a1951f98-b39c-482f-8afc-6dcb3cba7a24-S33.5ad311ef-a023-44fa-b261-10ea736b24c1
is not running
I0814 20:59:10.126847 30325 main.cpp:98] Ignoring failure as health check still
in grace period
{noformat}
>From there, there are a number of entries in the stderr of the sandbox that
>look like this
{noformat}
E0814 20:59:20.191401 30349 process.cpp:1958] Failed to shutdown socket with fd
11: Transport endpoint is not connected
Error response from daemon: Container
mesos-a1951f98-b39c-482f-8afc-6dcb3cba7a24-S33.5ad311ef-a023-44fa-b261-10ea736b24c1
is not running
E0814 20:59:30.353951 30349 process.cpp:1958] Failed to shutdown socket with fd
11: Transport endpoint is not connected
Error response from daemon: Container
mesos-a1951f98-b39c-482f-8afc-6dcb3cba7a24-S33.5ad311ef-a023-44fa-b261-10ea736b24c1
is not running
E0814 20:59:40.413733 30349 process.cpp:1958] Failed to shutdown socket with fd
11: Transport endpoint is not connected
Error response from daemon: Container
mesos-a1951f98-b39c-482f-8afc-6dcb3cba7a24-S33.5ad311ef-a023-44fa-b261-10ea736b24c1
is not running
{noformat}
This continues for 5 minutes (this is the end of the stderr logs)
{noformat}
E0814 21:04:13.749197 30349 process.cpp:1958] Failed to shutdown socket with fd
11: Transport endpoint is not connected
W0814 21:04:14.749591 30309 main.cpp:387] Health check failed Health command
check exited with status 1
{noformat}
However, there are a number of status updates mentioning that the task is still
running
{noformat}
mesos-slave.hostname.invalid-user.log.INFO.20160621-111458.29832:I0814
21:00:20.850172 29860 slave.cpp:3012] Handling status update TASK_RUNNING
(UUID: 6c0e3a29-6323-4de1-8f82-7472bd390252) for task
myservice.git034740b1.configc9d21b00.7e4463a5-629c-11e6-a691-764fd3cce37c in
health state unhealthy of framework 20151218-172703-1745234954-5050-16740-0000
from executor(1)@1.2.3.4:32870
mesos-slave.hostname.invalid-user.log.INFO.20160621-111458.29832:W0814
21:00:20.850363 29860 slave.cpp:3092] Could not find the executor for status
update TASK_RUNNING (UUID: 6c0e3a29-6323-4de1-8f82-7472bd390252) for task
myservice.git034740b1.configc9d21b00.7e4463a5-629c-11e6-a691-764fd3cce37c in
health state unhealthy of framework 20151218-172703-1745234954-5050-16740-0000
mesos-slave.hostname.invalid-user.log.INFO.20160621-111458.29832:I0814
21:00:20.850623 29852 status_update_manager.cpp:320] Received status update
TASK_RUNNING (UUID: 6c0e3a29-6323-4de1-8f82-7472bd390252) for task
myservice.git034740b1.configc9d21b00.7e4463a5-629c-11e6-a691-764fd3cce37c in
health state unhealthy of framework 20151218-172703-1745234954-5050-16740-0000
mesos-slave.hostname.invalid-user.log.INFO.20160621-111458.29832:I0814
21:00:20.851411 29849 slave.cpp:3410] Forwarding the update TASK_RUNNING (UUID:
6c0e3a29-6323-4de1-8f82-7472bd390252) for task
myservice.git034740b1.configc9d21b00.7e4463a5-629c-11e6-a691-764fd3cce37c in
health state unhealthy of framework 20151218-172703-1745234954-5050-16740-0000
to [email protected]:5050
mesos-slave.hostname.invalid-user.log.INFO.20160621-111458.29832:I0814
21:00:20.851662 29849 slave.cpp:3320] Sending acknowledgement for status update
TASK_RUNNING (UUID: 6c0e3a29-6323-4de1-8f82-7472bd390252) for task
myservice.git034740b1.configc9d21b00.7e4463a5-629c-11e6-a691-764fd3cce37c in
health state unhealthy of framework 20151218-172703-1745234954-5050-16740-0000
to executor(1)@1.2.3.4:32870
{noformat}
>From here, the agent continues to handle TASK_RUNNING status updates until we
>move the state and restart
{noformat}
mesos-slave.hostname.invalid-user.log.INFO.20160621-111458.29832:W0814
21:00:50.853273 29860 status_update_manager.cpp:475] Resending status update
TASK_RUNNING (UUID: 6c0e3a29-6323-4de1-8f82-7472bd390252) for task
myservice.git034740b1.configc9d21b00.7e4463a5-629c-11e6-a691-764fd3cce37c in
health state unhealthy of framework 20151218-172703-1745234954-5050-16740-0000
mesos-slave.hostname.invalid-user.log.INFO.20160621-111458.29832:I0814
21:00:50.853715 29838 slave.cpp:3410] Forwarding the update TASK_RUNNING (UUID:
6c0e3a29-6323-4de1-8f82-7472bd390252) for task
myservice.git034740b1.configc9d21b00.7e4463a5-629c-11e6-a691-764fd3cce37c in
health state unhealthy of framework 20151218-172703-1745234954-5050-16740-0000
to [email protected]:5050
{noformat}
> mesos-slave assigned marathon task wrongly to chronos framework after task
> failure
> ----------------------------------------------------------------------------------
>
> Key: MESOS-4084
> URL: https://issues.apache.org/jira/browse/MESOS-4084
> Project: Mesos
> Issue Type: Bug
> Components: slave
> Affects Versions: 0.22.2
> Environment: Ubuntu 14.04.2 LTS
> Mesos 0.22.2
> Marathon 0.11.0
> Chronos 2.4.0
> Reporter: Erhan Kesken
> Priority: Minor
>
> I don't know how to reproduce problem, only thing I can do, is sharing my
> logs:
> https://gist.github.com/ekesken/f2edfd65cca8638b0136
> These are highlights from my logs:
> mesos-slave logs:
> {noformat}
> Dec 7 08:58:27 mesos-slave-node-012 mesos-slave[56099]: I1207
> 08:58:27.089156 56130 slave.cpp:2531] Handling status update TASK_FAILED
> (UUID: 5b335fab-1722-4270-83a6-b4ec843be47f) for task
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
> of framework 20151113-112010-100670892-5050-7957-0001 from
> executor(1)@172.29.1.12:1651
> 08:58:27 mesos-slave-node-012 mesos-slave[56099]: E1207 08:58:27.089874 56128
> slave.cpp:2662] Failed to update resources for container
> ed5f4f67-464d-4786-9628-cd48732de6b7 of executor
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
> running task
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
> on status update for terminal task, destroying container: Failed to
> determine cgroup for the 'cpu' subsystem: Failed to read /proc/34074/cgroup:
> Failed to open file '/proc/34074/cgroup': No such file or directory
> {noformat}
> notice the framework id above, then 5m later we got following logs:
> {noformat}
> Dec 7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: I1207
> 09:03:27.653187 56130 slave.cpp:2531] Handling status update TASK_RUNNING
> (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
> in health state unhealthy of framework
> 20150624-210230-117448108-5050-3678-0001 from executor(1)@172.29.1.12:1651
> Dec 7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: W1207
> 09:03:27.653282 56130 slave.cpp:2568] Could not find the executor for status
> update TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
> in health state unhealthy of framework
> 20150624-210230-117448108-5050-3678-0001
> Dec 7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: I1207
> 09:03:27.653390 56130 status_update_manager.cpp:317] Received status update
> TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
> in health state unhealthy of framework
> 20150624-210230-117448108-5050-3678-0001
> Dec 7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: I1207
> 09:03:27.653543 56130 slave.cpp:2776] Forwarding the update TASK_RUNNING
> (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
> in health state unhealthy of framework
> 20150624-210230-117448108-5050-3678-0001 to [email protected]:5050
> Dec 7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: I1207
> 09:03:27.653688 56130 slave.cpp:2709] Sending acknowledgement for status
> update TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
> in health state unhealthy of framework
> 20150624-210230-117448108-5050-3678-0001 to executor(1)@172.29.1.12:1651
> Dec 7 09:03:37 mesos-slave-node-012 mesos-slave[56099]: W1207
> 09:03:37.654337 56134 status_update_manager.cpp:472] Resending status update
> TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
> in health state unhealthy of framework
> 20150624-210230-117448108-5050-3678-0001
> {noformat}
> this caused deactivation of chronos immediately as seen on mesos-master log:
> {noformat}
> Dec 7 09:03:27 mesos-master-node-001 mesos-master[40898]: I1207
> 09:03:27.654770 40948 master.cpp:1964] Deactivating framework
> 20150624-210230-117448108-5050-3678-0001 (chronos-2.4.0) at
> [email protected]:11893
> Dec 7 09:03:27 mesos-master-node-001 mesos-master[40898]: I1207
> 09:03:27.654989 40948 hierarchical.hpp:400] Deactivated framework
> 20150624-210230-117448108-5050-3678-0001
> {noformat}
> no clue at chronos log at that time, but when we restart chronos master, we
> got these logs:
> {noformat}
> Dec 7 12:35:25 mesos-master-node-003 chronos[15003]: Exception in thread
> "Thread-62550" scala.MatchError:
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
> (of class java.lang.String)
> Dec 7 12:35:25 mesos-master-node-003 chronos[15003]: #011at
> org.apache.mesos.chronos.scheduler.jobs.TaskUtils$.parseTaskId(TaskUtils.scala:132)
> Dec 7 12:35:25 mesos-master-node-003 chronos[15003]: #011at
> org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework.statusUpdate(MesosJobFramework.scala:215)
> Dec 7 12:35:25 mesos-master-node-003 chronos[15003]: #011at
> sun.reflect.GeneratedMethodAccessor93.invoke(Unknown Source)
> Dec 7 12:35:25 mesos-master-node-003 chronos[15003]: #011at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> Dec 7 12:35:25 mesos-master-node-003 chronos[15003]: #011at
> java.lang.reflect.Method.invoke(Method.java:497)
> Dec 7 12:35:25 mesos-master-node-003 chronos[15003]: #011at
> com.google.inject.internal.DelegatingInvocationHandler.invoke(DelegatingInvocationHandler.java:37)
> Dec 7 12:35:25 mesos-master-node-003 chronos[15003]: #011at
> com.sun.proxy.$Proxy31.statusUpdate(Unknown Source)
> Dec 7 12:35:25 mesos-master-node-003 chronos[15003]: I1207 12:35:25.238782
> 15035 sched.cpp:1623] Asked to abort the driver
> Dec 7 12:35:25 mesos-master-node-003 chronos[15003]: I1207 12:35:25.238836
> 15035 sched.cpp:856] Aborting framework
> '20150624-210230-117448108-5050-3678-0001'
> {noformat}
> Problem is solved by following commands in problematic slave:
> {noformat}
> service mesos-slave stop
> rm -fr /tmp/mesos
> service mesos-slave start
> {noformat}
> I needed to restart chronos master to reactivate framework, I don't know if
> there is a way to reactivate framework manually rather than restarting master
> node.
> I guess this issue https://github.com/mesos/chronos/issues/569 opened in
> chronos project is related with this bug too.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)