[
https://issues.apache.org/jira/browse/MESOS-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Anand Mazumdar updated MESOS-5349:
----------------------------------
Attachment: log_agent_before_zk_disconnect.gz
> A large number of tasks stuck in Staging state.
> -----------------------------------------------
>
> Key: MESOS-5349
> URL: https://issues.apache.org/jira/browse/MESOS-5349
> Project: Mesos
> Issue Type: Bug
> Components: slave
> Affects Versions: 0.29.0
> Reporter: Anand Mazumdar
> Labels: mesosphere
> Attachments: agent-state.log, log_agent_after_zk_disconnect.gz,
> log_agent_before_zk_disconnect.gz, master-state.log, mesos-master.WARNING,
> mesos-slave.WARNING, staging_tasks.png
>
>
> We saw a weird issue happening on one of our test clusters over the weekend.
> A large number of tasks from the example {{long running framework}} were
> stuck in staging. The executor was duly sending status updates for all the
> tasks and the slave successfully received the status update as seen from the
> logs but for some reason never got to checkpointing them.
> From the agent logs, it seems that it kept on retrying some backlogged status
> updates starting with the 4xxx/6xxx range while the present tasks were
> launched in the 8xxx range. (task ID)
> The issue resolved itself after a few hours upon the agent (re-)registering
> with the master upon loosing its ZK session.
> Let's take a timeline of a particular task 8142.
> Agent logs before restart
> {code}
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.204941 2820
> slave.cpp:1522] Got assigned task 8142 for framework
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205142 2820
> slave.cpp:1641] Launching task 8142 for framework
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205656 2820
> slave.cpp:1880] Queuing task '8142' for executor 'default' of framework
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP)
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.206092 2818
> disk.cpp:169] Updating the disk resources for container
> f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.101; mem(*):33
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207093 2816
> mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 33MB for container
> f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207293 2821
> cpushare.cpp:389] Updated 'cpu.shares' to 103 (cpus 0.101) for container
> f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208742 2821
> cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us'
> to 10100us (cpus 0.101) for container f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208902 2818
> slave.cpp:2032] Sending queued task '8142' to executor 'default' of framework
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP)
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210290 2821
> http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210357 2821
> slave.cpp:3221] Handling status update TASK_RUNNING (UUID:
> 85323c4f-e523-495e-9b49-39b0a7792303) for task 8142 of framework
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213770 2817
> http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213882 2817
> slave.cpp:3221] Handling status update TASK_FINISHED (UUID:
> 285b73e1-7f5a-43e5-8385-7b76e0fbdad4) for task 8142 of framework
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.214787 2821
> disk.cpp:169] Updating the disk resources for container
> f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.1; mem(*):32
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215365 2823
> cpushare.cpp:389] Updated 'cpu.shares' to 102 (cpus 0.1) for container
> f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215641 2820
> mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 32MB for container
> f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.216878 2823
> cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us'
> to 10ms (cpus 0.1) for container f68f137c-b101-4f9f-8de4-f50eae27e969
> {code}
> Agent logs for this task upon restart:
> {code}
> May 09 15:22:14 ip-10-10-0-6 mesos-slave[14314]: W0509 15:22:14.083993 14318
> state.cpp:606] Failed to find status updates file
> '/var/lib/mesos/slave/meta/slaves/ad2ee74e-24f1-4381-be9a-1af70ba1ced0-S1/frameworks/ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003/executors/default/runs/f68f137c-b101-4f9f-8de4-f50eae27e969/tasks/8142/task.updates'
> {code}
> Things that need to be investigated:
> - Why couldn't the agent get around to handling the status updates from the
> executor i.e. even checkpointing them?
> - What made the agent get _so_ backlogged on the status updates i.e. why it
> kept resending the old status updates for the 4xxxx/6xxxx tasks without
> getting around to the newer tasks.
> PFA the agent/master logs. This is running against Mesos HEAD ->
> 557cab591f35a6c3d2248d7af7f06cdf99726e92
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)