[
https://issues.apache.org/jira/browse/MESOS-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046390#comment-15046390
]
Erhan Kesken commented on MESOS-4084:
-------------------------------------
I don't know if it's related, but half hour before this problem there had been
a connection loss to zookeeper for 26s, and mesos-master had changed.
> mesos-slave assigned marathon task wrongly to chronos framework after task
> failure
> ----------------------------------------------------------------------------------
>
> Key: MESOS-4084
> URL: https://issues.apache.org/jira/browse/MESOS-4084
> Project: Mesos
> Issue Type: Bug
> Components: slave
> Affects Versions: 0.22.2
> Environment: Ubuntu 14.04.2 LTS
> Mesos 0.22.2
> Marathon 0.11.0
> Chronos 2.4.0
> Reporter: Erhan Kesken
> Priority: Minor
>
> I don't know how to reproduce problem, only thing I can do, is sharing my
> logs:
> https://gist.github.com/ekesken/f2edfd65cca8638b0136
> These are highlights from my logs:
> mesos-slave logs:
> {noformat}
> Dec 7 08:58:27 mesos-slave-node-012 mesos-slave[56099]: I1207
> 08:58:27.089156 56130 slave.cpp:2531] Handling status update TASK_FAILED
> (UUID: 5b335fab-1722-4270-83a6-b4ec843be47f) for task
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
> of framework 20151113-112010-100670892-5050-7957-0001 from
> executor(1)@172.29.1.12:1651
> 08:58:27 mesos-slave-node-012 mesos-slave[56099]: E1207 08:58:27.089874 56128
> slave.cpp:2662] Failed to update resources for container
> ed5f4f67-464d-4786-9628-cd48732de6b7 of executor
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
> running task
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
> on status update for terminal task, destroying container: Failed to
> determine cgroup for the 'cpu' subsystem: Failed to read /proc/34074/cgroup:
> Failed to open file '/proc/34074/cgroup': No such file or directory
> {noformat}
> notice the framework id above, then 5m later we got following logs:
> {noformat}
> Dec 7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: I1207
> 09:03:27.653187 56130 slave.cpp:2531] Handling status update TASK_RUNNING
> (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
> in health state unhealthy of framework
> 20150624-210230-117448108-5050-3678-0001 from executor(1)@172.29.1.12:1651
> Dec 7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: W1207
> 09:03:27.653282 56130 slave.cpp:2568] Could not find the executor for status
> update TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
> in health state unhealthy of framework
> 20150624-210230-117448108-5050-3678-0001
> Dec 7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: I1207
> 09:03:27.653390 56130 status_update_manager.cpp:317] Received status update
> TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
> in health state unhealthy of framework
> 20150624-210230-117448108-5050-3678-0001
> Dec 7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: I1207
> 09:03:27.653543 56130 slave.cpp:2776] Forwarding the update TASK_RUNNING
> (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
> in health state unhealthy of framework
> 20150624-210230-117448108-5050-3678-0001 to [email protected]:5050
> Dec 7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: I1207
> 09:03:27.653688 56130 slave.cpp:2709] Sending acknowledgement for status
> update TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
> in health state unhealthy of framework
> 20150624-210230-117448108-5050-3678-0001 to executor(1)@172.29.1.12:1651
> Dec 7 09:03:37 mesos-slave-node-012 mesos-slave[56099]: W1207
> 09:03:37.654337 56134 status_update_manager.cpp:472] Resending status update
> TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
> in health state unhealthy of framework
> 20150624-210230-117448108-5050-3678-0001
> {noformat}
> this caused deactivation of chronos immediately as seen on mesos-master log:
> {noformat}
> Dec 7 09:03:27 mesos-master-node-001 mesos-master[40898]: I1207
> 09:03:27.654770 40948 master.cpp:1964] Deactivating framework
> 20150624-210230-117448108-5050-3678-0001 (chronos-2.4.0) at
> [email protected]:11893
> Dec 7 09:03:27 mesos-master-node-001 mesos-master[40898]: I1207
> 09:03:27.654989 40948 hierarchical.hpp:400] Deactivated framework
> 20150624-210230-117448108-5050-3678-0001
> {noformat}
> no clue at chronos log at that time, but when we restart chronos master, we
> got these logs:
> {noformat}
> Dec 7 12:35:25 mesos-master-node-003 chronos[15003]: Exception in thread
> "Thread-62550" scala.MatchError:
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
> (of class java.lang.String)
> Dec 7 12:35:25 mesos-master-node-003 chronos[15003]: #011at
> org.apache.mesos.chronos.scheduler.jobs.TaskUtils$.parseTaskId(TaskUtils.scala:132)
> Dec 7 12:35:25 mesos-master-node-003 chronos[15003]: #011at
> org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework.statusUpdate(MesosJobFramework.scala:215)
> Dec 7 12:35:25 mesos-master-node-003 chronos[15003]: #011at
> sun.reflect.GeneratedMethodAccessor93.invoke(Unknown Source)
> Dec 7 12:35:25 mesos-master-node-003 chronos[15003]: #011at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> Dec 7 12:35:25 mesos-master-node-003 chronos[15003]: #011at
> java.lang.reflect.Method.invoke(Method.java:497)
> Dec 7 12:35:25 mesos-master-node-003 chronos[15003]: #011at
> com.google.inject.internal.DelegatingInvocationHandler.invoke(DelegatingInvocationHandler.java:37)
> Dec 7 12:35:25 mesos-master-node-003 chronos[15003]: #011at
> com.sun.proxy.$Proxy31.statusUpdate(Unknown Source)
> Dec 7 12:35:25 mesos-master-node-003 chronos[15003]: I1207 12:35:25.238782
> 15035 sched.cpp:1623] Asked to abort the driver
> Dec 7 12:35:25 mesos-master-node-003 chronos[15003]: I1207 12:35:25.238836
> 15035 sched.cpp:856] Aborting framework
> '20150624-210230-117448108-5050-3678-0001'
> {noformat}
> Problem is solved by following commands in problematic slave:
> {noformat}
> service mesos-slave stop
> rm -fr /tmp/mesos
> service mesos-slave start
> {noformat}
> I needed to restart chronos master to reactivate framework, I don't know if
> there is a way to reactivate framework manually rather than restarting master
> node.
> I guess this issue https://github.com/mesos/chronos/issues/569 opened in
> chronos project is related with this bug too.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)