[
https://issues.apache.org/jira/browse/MESOS-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Anand Mazumdar updated MESOS-5349:
----------------------------------
Description:
We saw a weird issue happening on one of our test clusters over the weekend. A
large number of tasks from the example {{long running framework}} were stuck in
staging. The executor was duly sending status updates for all the tasks and the
slave successfully received the status update as seen from the logs but for
some reason never got to checkpointing them.
>From the agent logs, it seems that it kept on retrying some backlogged status
>updates starting with the 4xxx/6xxx range while the present tasks were
>launched in the 8xxx range. (task ID)
The issue resolved itself after a few hours upon the agent (re-)registering
with the master upon loosing its ZK session.
Let's take a timeline of a particular task 8142.
Agent logs before restart
{code}
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.204941 2820
slave.cpp:1522] Got assigned task 8142 for framework
ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205142 2820
slave.cpp:1641] Launching task 8142 for framework
ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205656 2820
slave.cpp:1880] Queuing task '8142' for executor 'default' of framework
ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP)
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.206092 2818
disk.cpp:169] Updating the disk resources for container
f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.101; mem(*):33
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207093 2816
mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 33MB for container
f68f137c-b101-4f9f-8de4-f50eae27e969
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207293 2821
cpushare.cpp:389] Updated 'cpu.shares' to 103 (cpus 0.101) for container
f68f137c-b101-4f9f-8de4-f50eae27e969
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208742 2821
cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us'
to 10100us (cpus 0.101) for container f68f137c-b101-4f9f-8de4-f50eae27e969
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208902 2818
slave.cpp:2032] Sending queued task '8142' to executor 'default' of framework
ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP)
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210290 2821
http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210357 2821
slave.cpp:3221] Handling status update TASK_RUNNING (UUID:
85323c4f-e523-495e-9b49-39b0a7792303) for task 8142 of framework
ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213770 2817
http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921
May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213882 2817
slave.cpp:3221] Handling status update TASK_FINISHED (UUID:
285b73e1-7f5a-43e5-8385-7b76e0fbdad4) for task 8142 of framework
ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.214787 2821
disk.cpp:169] Updating the disk resources for container
f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.1; mem(*):32
May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215365 2823
cpushare.cpp:389] Updated 'cpu.shares' to 102 (cpus 0.1) for container
f68f137c-b101-4f9f-8de4-f50eae27e969
May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215641 2820
mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 32MB for container
f68f137c-b101-4f9f-8de4-f50eae27e969
May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.216878 2823
cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us'
to 10ms (cpus 0.1) for container f68f137c-b101-4f9f-8de4-f50eae27e969
{code}
Agent logs for this task upon restart:
{code}
May 09 15:22:14 ip-10-10-0-6 mesos-slave[14314]: W0509 15:22:14.083993 14318
state.cpp:606] Failed to find status updates file
'/var/lib/mesos/slave/meta/slaves/ad2ee74e-24f1-4381-be9a-1af70ba1ced0-S1/frameworks/ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003/executors/default/runs/f68f137c-b101-4f9f-8de4-f50eae27e969/tasks/8142/task.updates'
{code}
Things that need to be investigated:
- Why couldn't the agent get around to handling the status updates from the
executor i.e. even checkpointing them?
- What made the agent get _so_ backlogged on the status updates i.e. why it
kept resending the old status updates for the 4xxxx/6xxxx tasks without getting
around to the newer tasks.
PFA the agent/master logs. This is running against Mesos {{HEAD ->
557cab591f35a6c3d2248d7af7f06cdf99726e92}}
was:
We saw a weird issue happening on one of our test clusters over the weekend. A
large number of tasks from the example {{long running framework}} were stuck in
staging. The executor was duly sending status updates for all the tasks and the
slave successfully received the status update as seen from the logs but for
some reason never got to checkpointing them.
>From the agent logs, it seems that it kept on retrying some backlogged status
>updates starting with the 4xxx/6xxx range while the present tasks were
>launched in the 8xxx range. (task ID)
The issue resolved itself after a few hours upon the agent (re-)registering
with the master upon loosing its ZK session.
Let's take a timeline of a particular task 8142.
Agent logs before restart
{code}
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.204941 2820
slave.cpp:1522] Got assigned task 8142 for framework
ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205142 2820
slave.cpp:1641] Launching task 8142 for framework
ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205656 2820
slave.cpp:1880] Queuing task '8142' for executor 'default' of framework
ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP)
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.206092 2818
disk.cpp:169] Updating the disk resources for container
f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.101; mem(*):33
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207093 2816
mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 33MB for container
f68f137c-b101-4f9f-8de4-f50eae27e969
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207293 2821
cpushare.cpp:389] Updated 'cpu.shares' to 103 (cpus 0.101) for container
f68f137c-b101-4f9f-8de4-f50eae27e969
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208742 2821
cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us'
to 10100us (cpus 0.101) for container f68f137c-b101-4f9f-8de4-f50eae27e969
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208902 2818
slave.cpp:2032] Sending queued task '8142' to executor 'default' of framework
ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP)
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210290 2821
http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210357 2821
slave.cpp:3221] Handling status update TASK_RUNNING (UUID:
85323c4f-e523-495e-9b49-39b0a7792303) for task 8142 of framework
ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213770 2817
http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921
May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213882 2817
slave.cpp:3221] Handling status update TASK_FINISHED (UUID:
285b73e1-7f5a-43e5-8385-7b76e0fbdad4) for task 8142 of framework
ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.214787 2821
disk.cpp:169] Updating the disk resources for container
f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.1; mem(*):32
May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215365 2823
cpushare.cpp:389] Updated 'cpu.shares' to 102 (cpus 0.1) for container
f68f137c-b101-4f9f-8de4-f50eae27e969
May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215641 2820
mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 32MB for container
f68f137c-b101-4f9f-8de4-f50eae27e969
May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.216878 2823
cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us'
to 10ms (cpus 0.1) for container f68f137c-b101-4f9f-8de4-f50eae27e969
{code}
Agent logs for this task upon restart:
{code}
May 09 15:22:14 ip-10-10-0-6 mesos-slave[14314]: W0509 15:22:14.083993 14318
state.cpp:606] Failed to find status updates file
'/var/lib/mesos/slave/meta/slaves/ad2ee74e-24f1-4381-be9a-1af70ba1ced0-S1/frameworks/ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003/executors/default/runs/f68f137c-b101-4f9f-8de4-f50eae27e969/tasks/8142/task.updates'
{code}
Things that need to be investigated:
- Why couldn't the agent get around to handling the status updates from the
executor i.e. even checkpointing them?
- What made the agent get _so_ backlogged on the status updates i.e. why it
kept resending the old status updates for the 4xxxx/6xxxx tasks without getting
around to the newer tasks.
PFA the agent/master logs. This is running against Mesos HEAD ->
557cab591f35a6c3d2248d7af7f06cdf99726e92
> A large number of tasks stuck in Staging state.
> -----------------------------------------------
>
> Key: MESOS-5349
> URL: https://issues.apache.org/jira/browse/MESOS-5349
> Project: Mesos
> Issue Type: Bug
> Components: slave
> Affects Versions: 0.29.0
> Reporter: Anand Mazumdar
> Labels: mesosphere
> Attachments: agent-state.log, log_agent_after_zk_disconnect.gz,
> log_agent_before_zk_disconnect.gz, master-state.log, mesos-master.WARNING,
> mesos-slave.WARNING, staging_tasks.png
>
>
> We saw a weird issue happening on one of our test clusters over the weekend.
> A large number of tasks from the example {{long running framework}} were
> stuck in staging. The executor was duly sending status updates for all the
> tasks and the slave successfully received the status update as seen from the
> logs but for some reason never got to checkpointing them.
> From the agent logs, it seems that it kept on retrying some backlogged status
> updates starting with the 4xxx/6xxx range while the present tasks were
> launched in the 8xxx range. (task ID)
> The issue resolved itself after a few hours upon the agent (re-)registering
> with the master upon loosing its ZK session.
> Let's take a timeline of a particular task 8142.
> Agent logs before restart
> {code}
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.204941 2820
> slave.cpp:1522] Got assigned task 8142 for framework
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205142 2820
> slave.cpp:1641] Launching task 8142 for framework
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205656 2820
> slave.cpp:1880] Queuing task '8142' for executor 'default' of framework
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP)
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.206092 2818
> disk.cpp:169] Updating the disk resources for container
> f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.101; mem(*):33
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207093 2816
> mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 33MB for container
> f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207293 2821
> cpushare.cpp:389] Updated 'cpu.shares' to 103 (cpus 0.101) for container
> f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208742 2821
> cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us'
> to 10100us (cpus 0.101) for container f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208902 2818
> slave.cpp:2032] Sending queued task '8142' to executor 'default' of framework
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP)
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210290 2821
> http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210357 2821
> slave.cpp:3221] Handling status update TASK_RUNNING (UUID:
> 85323c4f-e523-495e-9b49-39b0a7792303) for task 8142 of framework
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213770 2817
> http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213882 2817
> slave.cpp:3221] Handling status update TASK_FINISHED (UUID:
> 285b73e1-7f5a-43e5-8385-7b76e0fbdad4) for task 8142 of framework
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.214787 2821
> disk.cpp:169] Updating the disk resources for container
> f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.1; mem(*):32
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215365 2823
> cpushare.cpp:389] Updated 'cpu.shares' to 102 (cpus 0.1) for container
> f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215641 2820
> mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 32MB for container
> f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.216878 2823
> cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us'
> to 10ms (cpus 0.1) for container f68f137c-b101-4f9f-8de4-f50eae27e969
> {code}
> Agent logs for this task upon restart:
> {code}
> May 09 15:22:14 ip-10-10-0-6 mesos-slave[14314]: W0509 15:22:14.083993 14318
> state.cpp:606] Failed to find status updates file
> '/var/lib/mesos/slave/meta/slaves/ad2ee74e-24f1-4381-be9a-1af70ba1ced0-S1/frameworks/ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003/executors/default/runs/f68f137c-b101-4f9f-8de4-f50eae27e969/tasks/8142/task.updates'
> {code}
> Things that need to be investigated:
> - Why couldn't the agent get around to handling the status updates from the
> executor i.e. even checkpointing them?
> - What made the agent get _so_ backlogged on the status updates i.e. why it
> kept resending the old status updates for the 4xxxx/6xxxx tasks without
> getting around to the newer tasks.
> PFA the agent/master logs. This is running against Mesos {{HEAD ->
> 557cab591f35a6c3d2248d7af7f06cdf99726e92}}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)