Kevin Sweeney created AURORA-788:
------------------------------------
Summary: gc executor updates ignored by slave on loaded machines
Key: AURORA-788
URL: https://issues.apache.org/jira/browse/AURORA-788
Project: Aurora
Issue Type: Bug
Reporter: Kevin Sweeney
It looks like the root cause is that the GC executor doesn't sleep long enough
for the slave to reliably receive its final status update
>From production logs at Twitter:
{noformat}
mesos-slave.log:
I1002 23:40:17.372160 10628 slave.cpp:1025] Got assigned task
system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d for framework
201205082337-0000000003-0000
I1002 23:40:17.372648 10628 slave.cpp:1135] Launching task
system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d for framework
201205082337-0000000003-0000
I1002 23:40:17.373164 10628 slave.cpp:4002] Checkpointing TaskInfo to
'/var/lib/mesos/meta/slaves/20140916-223909-2215521290-5050-52027-3819/frameworks/201205082337-0000000003-0000/executors/aurora.gc/runs/f97928c4-4214-439c-8d9f-c8493070b1db/tasks/system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d/task.info'
I1002 23:40:17.373951 10628 slave.cpp:1270] Sending task
'system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d' to executor 'aurora.gc' of
framework 201205082337-0000000003-0000
I1002 23:40:17.374651 10623 cpushare.cpp:338] Updated 'cpu.shares' to 204 (cpus
0.2) for container f97928c4-4214-439c-8d9f-c8493070b1db
I1002 23:40:17.374681 10630 mem.cpp:293] Updated 'memory.soft_limit_in_bytes'
to 128MB for container f97928c4-4214-439c-8d9f-c8493070b1db
I1002 23:40:17.375869 10623 cpushare.cpp:359] Updated 'cpu.cfs_period_us' to
100ms and 'cpu.cfs_quota_us' to 20ms (cpus 0.2) for container
f97928c4-4214-439c-8d9f-c8493070b1db
I1002 23:40:17.614163 10629 containerizer.cpp:997] Executor for container
'f97928c4-4214-439c-8d9f-c8493070b1db' has exited
I1002 23:40:17.614673 10629 containerizer.cpp:882] Destroying container
'f97928c4-4214-439c-8d9f-c8493070b1db'
I1002 23:40:17.616425 10627 cgroups.cpp:2208] Freezing cgroup
/sys/fs/cgroup/freezer/mesos/f97928c4-4214-439c-8d9f-c8493070b1db
I1002 23:40:17.618352 10629 cgroups.cpp:1375] Successfully froze cgroup
/sys/fs/cgroup/freezer/mesos/f97928c4-4214-439c-8d9f-c8493070b1db after
1.140736ms
I1002 23:40:17.619982 10629 cgroups.cpp:2225] Thawing cgroup
/sys/fs/cgroup/freezer/mesos/f97928c4-4214-439c-8d9f-c8493070b1db
I1002 23:40:17.621541 10629 cgroups.cpp:1404] Successfullly thawed cgroup
/sys/fs/cgroup/freezer/mesos/f97928c4-4214-439c-8d9f-c8493070b1db after 1.0752ms
I1002 23:40:17.623524 10613 port_mapping.cpp:1949] Removing IP packet filters
with ports [33792,34815] for container with pid 19217
I1002 23:40:17.626447 10613 port_mapping.cpp:1968] Freed ephemeral ports
[33792,34816) for container with pid 19217
I1002 23:40:17.646090 10613 port_mapping.cpp:2096] Successfully performed
cleanup for pid 19217
I1002 23:40:17.646457 10612 slave.cpp:2618] Executor 'aurora.gc' of framework
201205082337-0000000003-0000 exited with status 0
I1002 23:40:17.648576 10612 slave.cpp:2111] Handling status update TASK_LOST
(UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task
system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework
201205082337-0000000003-0000 from @0.0.0.0:0
W1002 23:40:17.648859 10626 containerizer.cpp:788] Ignoring update for unknown
container: f97928c4-4214-439c-8d9f-c8493070b1db
I1002 23:40:17.649251 10607 status_update_manager.cpp:320] Received status
update TASK_LOST (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task
system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework
201205082337-0000000003-0000
I1002 23:40:17.649842 10607 status_update_manager.hpp:342] Checkpointing UPDATE
for status update TASK_LOST (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for
task system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework
201205082337-0000000003-0000
I1002 23:40:17.656316 10607 status_update_manager.cpp:373] Forwarding status
update TASK_LOST (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task
system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework
201205082337-0000000003-0000 to [email protected]:5050
I1002 23:40:17.658787 10626 status_update_manager.cpp:398] Received status
update acknowledgement (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task
system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework
201205082337-0000000003-0000
I1002 23:40:17.658941 10626 status_update_manager.hpp:342] Checkpointing ACK
for status update TASK_LOST (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for
task system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework
201205082337-0000000003-0000
{noformat}
Note the race - the slave notices the dead GC executor process 300µs before it
sees that status update from the executor.
{noformat}
I1002 23:40:17.646457 10612 slave.cpp:2618] Executor 'aurora.gc' of framework
201205082337-0000000003-0000 exited with status 0
I1002 23:40:17.648576 10612 slave.cpp:2111] Handling status update TASK_LOST
(UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task
system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework
201205082337-0000000003-0000 from @0.0.0.0:0
W1002 23:40:17.648859 10626 containerizer.cpp:788] Ignoring update for unknown
container: f97928c4-4214-439c-8d9f-c8493070b1db
{noformat}
{noformat}
gc_executor_main.log:
I1002 23:35:11.663753 19217 executor_base.py:46] Executor
[20140916-223909-2215521290-5050-52027-3819]: - Runner is dead but under LOST
threshold.
I1002 23:36:18.483422 19217 executor_base.py:46] Executor
[20140916-223909-2215521290-5050-52027-3819]: launchTask() got task_id:
system-gc-04810a63-e8b1-42a4-9b08-03b099517856
I1002 23:36:18.483834 19217 executor_base.py:46] Executor
[20140916-223909-2215521290-5050-52027-3819]: => Adding
system-gc-04810a63-e8b1-42a4-9b08-03b099517856 to GC queue
I1002 23:40:11.664382 19217 executor_base.py:46] Executor
[20140916-223909-2215521290-5050-52027-3819]: Updating
system-gc-04810a63-e8b1-42a4-9b08-03b099517856 => FINISHED
I1002 23:40:11.664607 19217 executor_base.py:46] Executor
[20140916-223909-2215521290-5050-52027-3819]: Reason: Garbage collection
skipped - GC executor shutting down
I1002 23:40:11.664880 19217 executor_base.py:46] Executor
[20140916-223909-2215521290-5050-52027-3819]: Sleeping briefly to mitigate
https://issues.apache.org/jira/browse/MESOS-243
I1002 23:40:16.670403 19217 gc_executor_main.py:41] MesosExecutorDriver.run()
has finished.
{noformat}
Given that the GC executor is already in maintenance mode and we already have a
sleep in place to mitigate https://issues.apache.org/jira/browse/MESOS-243 I
think we can just raise the sleep timeout here.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)