[ 
https://issues.apache.org/jira/browse/MESOS-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-5349:
----------------------------------
    Attachment: staging_tasks.png

> A large number of tasks stuck in Staging state.
> -----------------------------------------------
>
>                 Key: MESOS-5349
>                 URL: https://issues.apache.org/jira/browse/MESOS-5349
>             Project: Mesos
>          Issue Type: Bug
>          Components: slave
>    Affects Versions: 0.29.0
>            Reporter: Anand Mazumdar
>              Labels: mesosphere
>         Attachments: agent-state.log, master-state.log, mesos-master.WARNING, 
> mesos-slave.WARNING, staging_tasks.png
>
>
> We saw a weird issue happening on one of our test clusters over the weekend. 
> A large number of tasks from the example {{long running framework}} were 
> stuck in staging. The executor was duly sending status updates for all the 
> tasks and the slave successfully received the status update as seen from the 
> logs but for some reason never got to checkpointing them.
> From the agent logs, it seems that it kept on retrying some backlogged status 
> updates starting with the 4xxx/6xxx range while the present tasks were 
> launched in the 8xxx range. (task ID)
> The issue resolved itself after a few hours upon the agent (re-)registering 
> with the master upon loosing its ZK session. 
> Let's take a timeline of a particular task 8142.
> Agent logs before restart
> {code}
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.204941  2820 
> slave.cpp:1522] Got assigned task 8142 for framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205142  2820 
> slave.cpp:1641] Launching task 8142 for framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205656  2820 
> slave.cpp:1880] Queuing task '8142' for executor 'default' of framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP)
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.206092  2818 
> disk.cpp:169] Updating the disk resources for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.101; mem(*):33
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207093  2816 
> mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 33MB for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207293  2821 
> cpushare.cpp:389] Updated 'cpu.shares' to 103 (cpus 0.101) for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208742  2821 
> cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' 
> to 10100us (cpus 0.101) for container f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208902  2818 
> slave.cpp:2032] Sending queued task '8142' to executor 'default' of framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP)
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210290  2821 
> http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210357  2821 
> slave.cpp:3221] Handling status update TASK_RUNNING (UUID: 
> 85323c4f-e523-495e-9b49-39b0a7792303) for task 8142 of framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213770  2817 
> http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213882  2817 
> slave.cpp:3221] Handling status update TASK_FINISHED (UUID: 
> 285b73e1-7f5a-43e5-8385-7b76e0fbdad4) for task 8142 of framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.214787  2821 
> disk.cpp:169] Updating the disk resources for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.1; mem(*):32
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215365  2823 
> cpushare.cpp:389] Updated 'cpu.shares' to 102 (cpus 0.1) for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215641  2820 
> mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 32MB for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.216878  2823 
> cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' 
> to 10ms (cpus 0.1) for container f68f137c-b101-4f9f-8de4-f50eae27e969
> {code}
> Agent logs for this task upon restart:
> {code}
> May 09 15:22:14 ip-10-10-0-6 mesos-slave[14314]: W0509 15:22:14.083993 14318 
> state.cpp:606] Failed to find status updates file 
> '/var/lib/mesos/slave/meta/slaves/ad2ee74e-24f1-4381-be9a-1af70ba1ced0-S1/frameworks/ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003/executors/default/runs/f68f137c-b101-4f9f-8de4-f50eae27e969/tasks/8142/task.updates'
> {code}
> Things that need to be investigated:
> - Why couldn't the agent get around to handling the status updates from the 
> executor i.e. even checkpointing them?
> - What made the agent get _so_ backlogged on the status updates i.e. why it 
> kept resending the old status updates for the 4xxxx/6xxxx tasks without 
> getting around to the newer tasks.
> PFA the agent/master logs. This is running against Mesos HEAD -> 
> 557cab591f35a6c3d2248d7af7f06cdf99726e92



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to