Kevin Sweeney created AURORA-788:
------------------------------------

             Summary: gc executor updates ignored by slave on loaded machines
                 Key: AURORA-788
                 URL: https://issues.apache.org/jira/browse/AURORA-788
             Project: Aurora
          Issue Type: Bug
            Reporter: Kevin Sweeney


It looks like the root cause is that the GC executor doesn't sleep long enough 
for the slave to reliably receive its final status update

>From production logs at Twitter:
{noformat}
mesos-slave.log:
I1002 23:40:17.372160 10628 slave.cpp:1025] Got assigned task 
system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d for framework 
201205082337-0000000003-0000
I1002 23:40:17.372648 10628 slave.cpp:1135] Launching task 
system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d for framework 
201205082337-0000000003-0000
I1002 23:40:17.373164 10628 slave.cpp:4002] Checkpointing TaskInfo to 
'/var/lib/mesos/meta/slaves/20140916-223909-2215521290-5050-52027-3819/frameworks/201205082337-0000000003-0000/executors/aurora.gc/runs/f97928c4-4214-439c-8d9f-c8493070b1db/tasks/system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d/task.info'
I1002 23:40:17.373951 10628 slave.cpp:1270] Sending task 
'system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d' to executor 'aurora.gc' of 
framework 201205082337-0000000003-0000
I1002 23:40:17.374651 10623 cpushare.cpp:338] Updated 'cpu.shares' to 204 (cpus 
0.2) for container f97928c4-4214-439c-8d9f-c8493070b1db
I1002 23:40:17.374681 10630 mem.cpp:293] Updated 'memory.soft_limit_in_bytes' 
to 128MB for container f97928c4-4214-439c-8d9f-c8493070b1db
I1002 23:40:17.375869 10623 cpushare.cpp:359] Updated 'cpu.cfs_period_us' to 
100ms and 'cpu.cfs_quota_us' to 20ms (cpus 0.2) for container 
f97928c4-4214-439c-8d9f-c8493070b1db
I1002 23:40:17.614163 10629 containerizer.cpp:997] Executor for container 
'f97928c4-4214-439c-8d9f-c8493070b1db' has exited
I1002 23:40:17.614673 10629 containerizer.cpp:882] Destroying container 
'f97928c4-4214-439c-8d9f-c8493070b1db'
I1002 23:40:17.616425 10627 cgroups.cpp:2208] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/f97928c4-4214-439c-8d9f-c8493070b1db
I1002 23:40:17.618352 10629 cgroups.cpp:1375] Successfully froze cgroup 
/sys/fs/cgroup/freezer/mesos/f97928c4-4214-439c-8d9f-c8493070b1db after 
1.140736ms
I1002 23:40:17.619982 10629 cgroups.cpp:2225] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/f97928c4-4214-439c-8d9f-c8493070b1db
I1002 23:40:17.621541 10629 cgroups.cpp:1404] Successfullly thawed cgroup 
/sys/fs/cgroup/freezer/mesos/f97928c4-4214-439c-8d9f-c8493070b1db after 1.0752ms
I1002 23:40:17.623524 10613 port_mapping.cpp:1949] Removing IP packet filters 
with ports [33792,34815] for container with pid 19217
I1002 23:40:17.626447 10613 port_mapping.cpp:1968] Freed ephemeral ports 
[33792,34816) for container with pid 19217
I1002 23:40:17.646090 10613 port_mapping.cpp:2096] Successfully performed 
cleanup for pid 19217
I1002 23:40:17.646457 10612 slave.cpp:2618] Executor 'aurora.gc' of framework 
201205082337-0000000003-0000 exited with status 0
I1002 23:40:17.648576 10612 slave.cpp:2111] Handling status update TASK_LOST 
(UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task 
system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework 
201205082337-0000000003-0000 from @0.0.0.0:0
W1002 23:40:17.648859 10626 containerizer.cpp:788] Ignoring update for unknown 
container: f97928c4-4214-439c-8d9f-c8493070b1db
I1002 23:40:17.649251 10607 status_update_manager.cpp:320] Received status 
update TASK_LOST (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task 
system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework 
201205082337-0000000003-0000
I1002 23:40:17.649842 10607 status_update_manager.hpp:342] Checkpointing UPDATE 
for status update TASK_LOST (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for 
task system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework 
201205082337-0000000003-0000
I1002 23:40:17.656316 10607 status_update_manager.cpp:373] Forwarding status 
update TASK_LOST (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task 
system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework 
201205082337-0000000003-0000 to [email protected]:5050
I1002 23:40:17.658787 10626 status_update_manager.cpp:398] Received status 
update acknowledgement (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task 
system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework 
201205082337-0000000003-0000
I1002 23:40:17.658941 10626 status_update_manager.hpp:342] Checkpointing ACK 
for status update TASK_LOST (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for 
task system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework 
201205082337-0000000003-0000
{noformat}

Note the race - the slave notices the dead GC executor process 300µs before it 
sees that status update from the executor.
{noformat}
I1002 23:40:17.646457 10612 slave.cpp:2618] Executor 'aurora.gc' of framework 
201205082337-0000000003-0000 exited with status 0
I1002 23:40:17.648576 10612 slave.cpp:2111] Handling status update TASK_LOST 
(UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task 
system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework 
201205082337-0000000003-0000 from @0.0.0.0:0
W1002 23:40:17.648859 10626 containerizer.cpp:788] Ignoring update for unknown 
container: f97928c4-4214-439c-8d9f-c8493070b1db
{noformat}

{noformat}
gc_executor_main.log:
I1002 23:35:11.663753 19217 executor_base.py:46] Executor 
[20140916-223909-2215521290-5050-52027-3819]:   - Runner is dead but under LOST 
threshold.
I1002 23:36:18.483422 19217 executor_base.py:46] Executor 
[20140916-223909-2215521290-5050-52027-3819]: launchTask() got task_id: 
system-gc-04810a63-e8b1-42a4-9b08-03b099517856
I1002 23:36:18.483834 19217 executor_base.py:46] Executor 
[20140916-223909-2215521290-5050-52027-3819]: => Adding 
system-gc-04810a63-e8b1-42a4-9b08-03b099517856 to GC queue
I1002 23:40:11.664382 19217 executor_base.py:46] Executor 
[20140916-223909-2215521290-5050-52027-3819]: Updating 
system-gc-04810a63-e8b1-42a4-9b08-03b099517856 => FINISHED
I1002 23:40:11.664607 19217 executor_base.py:46] Executor 
[20140916-223909-2215521290-5050-52027-3819]:    Reason: Garbage collection 
skipped - GC executor shutting down
I1002 23:40:11.664880 19217 executor_base.py:46] Executor 
[20140916-223909-2215521290-5050-52027-3819]: Sleeping briefly to mitigate 
https://issues.apache.org/jira/browse/MESOS-243
I1002 23:40:16.670403 19217 gc_executor_main.py:41] MesosExecutorDriver.run() 
has finished.
{noformat}

Given that the GC executor is already in maintenance mode and we already have a 
sleep in place to mitigate https://issues.apache.org/jira/browse/MESOS-243 I 
think we can just raise the sleep timeout here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to