[jira] [Commented] (MESOS-4084) mesos-slave assigned marathon task wrongly to chronos framework after task failure

Huadong Liu (JIRA) Thu, 29 Sep 2016 15:59:18 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534342#comment-15534342
 ]


Huadong Liu commented on MESOS-4084:
------------------------------------

While the the marathon task executor hang around, a chronos job was launched on 
the SAME port 40844. Before the chronos job executor terminated, a status 
update for the marathon task was sent at 13:11:18, in which the chronos 
framework ID was used instead of the correct marathon framework ID.

{noformat}
I0927 13:11:15.117260 28954 slave.cpp:2828] Got registration for executor 
'my_chronos_service' of framework 20151218-172703-1745234954-5050-16740-0000 
from executor(1)@10.46.1.35:40844
I0927 13:11:15.120493 28966 slave.cpp:2005] Sending queued task 
'my_chronos_service' to executor 'my_chronos_service' of framework 
20151218-172703-1745234954-5050-16740-0000 at executor(1)@10.46.1.35:40844
I0927 13:11:16.948460 28968 slave.cpp:3211] Handling status update TASK_RUNNING 
(UUID: e994358d-8e77-4d5e-8c1a-a64de13b60bc) for task my_chronos_service of 
framework 20151218-172703-1745234954-5050-16740-0000 from 
executor(1)@10.46.1.35:40844
I0927 13:11:17.211659 28966 slave.cpp:3514] Sending acknowledgement for status 
update TASK_RUNNING (UUID: e994358d-8e77-4d5e-8c1a-a64de13b60bc) for task 
my_chronos_service of framework 20151218-172703-1745234954-5050-16740-0000 to 
executor(1)@10.46.1.35:40844

I0927 13:11:18.588488 28953 slave.cpp:3211] Handling status update TASK_RUNNING 
(UUID: e85041ac-e8bb-4832-991e-e9b23cf0c547) for task 
myservice.gitaefa9d79.config5e0a0677.e9416045-84ed-11e6-8d4d-02423c827edc in 
health state unhealthy of framework 20151218-172703-1745234954-5050-16740-0000 
from executor(1)@10.46.1.35:40844
I0927 13:11:18.590958 28960 slave.cpp:3514] Sending acknowledgement for status 
update TASK_RUNNING (UUID: e85041ac-e8bb-4832-991e-e9b23cf0c547) for task 
myservice.gitaefa9d79.config5e0a0677.e9416045-84ed-11e6-8d4d-02423c827edc in 
health state unhealthy of framework 20151218-172703-1745234954-5050-16740-0000 
to executor(1)@10.46.1.35:40844

I0927 13:11:21.788275 28956 slave.cpp:3211] Handling status update 
TASK_FINISHED (UUID: c38d2220-7f20-4674-bd06-e59185e4bccc) for task 
my_chronos_service of framework 20151218-172703-1745234954-5050-16740-0000 from 
executor(1)@10.46.1.35:40844
I0927 13:11:21.961676 28968 slave.cpp:3514] Sending acknowledgement for status 
update TASK_FINISHED (UUID: c38d2220-7f20-4674-bd06-e59185e4bccc) for task 
my_chronos_service of framework 20151218-172703-1745234954-5050-16740-0000 to 
executor(1)@10.46.1.35:40844
I0927 13:11:24.793031 28958 slave.cpp:3732] executor(1)@10.46.1.35:40844 exited
{noformat}

Questions need to be answered:
1. Why did the marathon job fail? There was no clue found in logs.
2. Why did the agent said the executor exited while the executor log disagreed? 
This will probably explain why the same port number was reused.


> mesos-slave assigned marathon task wrongly to chronos framework after task 
> failure
> ----------------------------------------------------------------------------------
>
>                 Key: MESOS-4084
>                 URL: https://issues.apache.org/jira/browse/MESOS-4084
>             Project: Mesos
>          Issue Type: Bug
>          Components: slave
>    Affects Versions: 0.22.2
>         Environment: Ubuntu 14.04.2 LTS
> Mesos 0.22.2
> Marathon 0.11.0
> Chronos 2.4.0
>            Reporter: Erhan Kesken
>            Priority: Minor
>
> I don't know how to reproduce problem, only thing I can do, is sharing my 
> logs:
> https://gist.github.com/ekesken/f2edfd65cca8638b0136
> These are highlights from my logs:
> mesos-slave logs:
> {noformat}
> Dec  7 08:58:27 mesos-slave-node-012 mesos-slave[56099]: I1207 
> 08:58:27.089156 56130 slave.cpp:2531] Handling status update TASK_FAILED 
> (UUID: 5b335fab-1722-4270-83a6-b4ec843be47f) for task 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  of framework 20151113-112010-100670892-5050-7957-0001 from 
> executor(1)@172.29.1.12:1651
> 08:58:27 mesos-slave-node-012 mesos-slave[56099]: E1207 08:58:27.089874 56128 
> slave.cpp:2662] Failed to update resources for container 
> ed5f4f67-464d-4786-9628-cd48732de6b7 of executor 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  running task 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  on status update for terminal task, destroying container: Failed to 
> determine cgroup for the 'cpu' subsystem: Failed to read /proc/34074/cgroup: 
> Failed to open file '/proc/34074/cgroup': No such file or directory
> {noformat}
> notice the framework id above, then 5m later we got following logs:
> {noformat}
> Dec  7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: I1207 
> 09:03:27.653187 56130 slave.cpp:2531] Handling status update TASK_RUNNING 
> (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  in health state unhealthy of framework 
> 20150624-210230-117448108-5050-3678-0001 from executor(1)@172.29.1.12:1651
> Dec  7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: W1207 
> 09:03:27.653282 56130 slave.cpp:2568] Could not find the executor for status 
> update TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  in health state unhealthy of framework 
> 20150624-210230-117448108-5050-3678-0001
> Dec  7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: I1207 
> 09:03:27.653390 56130 status_update_manager.cpp:317] Received status update 
> TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  in health state unhealthy of framework 
> 20150624-210230-117448108-5050-3678-0001
> Dec  7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: I1207 
> 09:03:27.653543 56130 slave.cpp:2776] Forwarding the update TASK_RUNNING 
> (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  in health state unhealthy of framework 
> 20150624-210230-117448108-5050-3678-0001 to [email protected]:5050
> Dec  7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: I1207 
> 09:03:27.653688 56130 slave.cpp:2709] Sending acknowledgement for status 
> update TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  in health state unhealthy of framework 
> 20150624-210230-117448108-5050-3678-0001 to executor(1)@172.29.1.12:1651
> Dec  7 09:03:37 mesos-slave-node-012 mesos-slave[56099]: W1207 
> 09:03:37.654337 56134 status_update_manager.cpp:472] Resending status update 
> TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  in health state unhealthy of framework 
> 20150624-210230-117448108-5050-3678-0001
> {noformat} 
> this caused deactivation of chronos immediately as seen on mesos-master log:
> {noformat}
> Dec  7 09:03:27 mesos-master-node-001 mesos-master[40898]: I1207 
> 09:03:27.654770 40948 master.cpp:1964] Deactivating framework 
> 20150624-210230-117448108-5050-3678-0001 (chronos-2.4.0) at 
> [email protected]:11893
> Dec  7 09:03:27 mesos-master-node-001 mesos-master[40898]: I1207 
> 09:03:27.654989 40948 hierarchical.hpp:400] Deactivated framework 
> 20150624-210230-117448108-5050-3678-0001
> {noformat}
> no clue at chronos log at that time, but when we restart chronos master, we 
> got these logs:
> {noformat}
> Dec  7 12:35:25 mesos-master-node-003 chronos[15003]: Exception in thread 
> "Thread-62550" scala.MatchError: 
> collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128
>  (of class java.lang.String)
> Dec  7 12:35:25 mesos-master-node-003 chronos[15003]: #011at 
> org.apache.mesos.chronos.scheduler.jobs.TaskUtils$.parseTaskId(TaskUtils.scala:132)
> Dec  7 12:35:25 mesos-master-node-003 chronos[15003]: #011at 
> org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework.statusUpdate(MesosJobFramework.scala:215)
> Dec  7 12:35:25 mesos-master-node-003 chronos[15003]: #011at 
> sun.reflect.GeneratedMethodAccessor93.invoke(Unknown Source)
> Dec  7 12:35:25 mesos-master-node-003 chronos[15003]: #011at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> Dec  7 12:35:25 mesos-master-node-003 chronos[15003]: #011at 
> java.lang.reflect.Method.invoke(Method.java:497)
> Dec  7 12:35:25 mesos-master-node-003 chronos[15003]: #011at 
> com.google.inject.internal.DelegatingInvocationHandler.invoke(DelegatingInvocationHandler.java:37)
> Dec  7 12:35:25 mesos-master-node-003 chronos[15003]: #011at 
> com.sun.proxy.$Proxy31.statusUpdate(Unknown Source)
> Dec  7 12:35:25 mesos-master-node-003 chronos[15003]: I1207 12:35:25.238782 
> 15035 sched.cpp:1623] Asked to abort the driver
> Dec  7 12:35:25 mesos-master-node-003 chronos[15003]: I1207 12:35:25.238836 
> 15035 sched.cpp:856] Aborting framework 
> '20150624-210230-117448108-5050-3678-0001'
> {noformat}
> Problem is solved by following commands in problematic slave:
> {noformat}
> service mesos-slave stop
> rm -fr /tmp/mesos
> service mesos-slave start
> {noformat}
> I needed to restart chronos master to reactivate framework, I don't know if 
> there is a way to reactivate framework manually rather than restarting master 
> node.
> I guess this issue https://github.com/mesos/chronos/issues/569 opened in 
> chronos project is related with this bug too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4084) mesos-slave assigned marathon task wrongly to chronos framework after task failure

Reply via email to