[ 
https://issues.apache.org/jira/browse/AURORA-788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Sweeney updated AURORA-788:
---------------------------------
    Description: The GC Executor is currently configured to exit after a 
hardcoded limit of activity since it processed a launchTask request. This 
causes needless task churn in the cluster, especially given that the 
scheduler's default launch interval will cause the executor to need to be 
relaunched on average every 15 minutes.  (was: It looks like the root cause is 
that the GC executor doesn't sleep long enough for the slave to reliably 
receive its final status update

>From production logs at Twitter:
{noformat}
mesos-slave.log:
I1002 23:40:17.372160 10628 slave.cpp:1025] Got assigned task 
system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d for framework 
201205082337-0000000003-0000
I1002 23:40:17.372648 10628 slave.cpp:1135] Launching task 
system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d for framework 
201205082337-0000000003-0000
I1002 23:40:17.373164 10628 slave.cpp:4002] Checkpointing TaskInfo to 
'/var/lib/mesos/meta/slaves/20140916-223909-2215521290-5050-52027-3819/frameworks/201205082337-0000000003-0000/executors/aurora.gc/runs/f97928c4-4214-439c-8d9f-c8493070b1db/tasks/system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d/task.info'
I1002 23:40:17.373951 10628 slave.cpp:1270] Sending task 
'system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d' to executor 'aurora.gc' of 
framework 201205082337-0000000003-0000
I1002 23:40:17.374651 10623 cpushare.cpp:338] Updated 'cpu.shares' to 204 (cpus 
0.2) for container f97928c4-4214-439c-8d9f-c8493070b1db
I1002 23:40:17.374681 10630 mem.cpp:293] Updated 'memory.soft_limit_in_bytes' 
to 128MB for container f97928c4-4214-439c-8d9f-c8493070b1db
I1002 23:40:17.375869 10623 cpushare.cpp:359] Updated 'cpu.cfs_period_us' to 
100ms and 'cpu.cfs_quota_us' to 20ms (cpus 0.2) for container 
f97928c4-4214-439c-8d9f-c8493070b1db
I1002 23:40:17.614163 10629 containerizer.cpp:997] Executor for container 
'f97928c4-4214-439c-8d9f-c8493070b1db' has exited
I1002 23:40:17.614673 10629 containerizer.cpp:882] Destroying container 
'f97928c4-4214-439c-8d9f-c8493070b1db'
I1002 23:40:17.616425 10627 cgroups.cpp:2208] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/f97928c4-4214-439c-8d9f-c8493070b1db
I1002 23:40:17.618352 10629 cgroups.cpp:1375] Successfully froze cgroup 
/sys/fs/cgroup/freezer/mesos/f97928c4-4214-439c-8d9f-c8493070b1db after 
1.140736ms
I1002 23:40:17.619982 10629 cgroups.cpp:2225] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/f97928c4-4214-439c-8d9f-c8493070b1db
I1002 23:40:17.621541 10629 cgroups.cpp:1404] Successfullly thawed cgroup 
/sys/fs/cgroup/freezer/mesos/f97928c4-4214-439c-8d9f-c8493070b1db after 1.0752ms
I1002 23:40:17.623524 10613 port_mapping.cpp:1949] Removing IP packet filters 
with ports [33792,34815] for container with pid 19217
I1002 23:40:17.626447 10613 port_mapping.cpp:1968] Freed ephemeral ports 
[33792,34816) for container with pid 19217
I1002 23:40:17.646090 10613 port_mapping.cpp:2096] Successfully performed 
cleanup for pid 19217
I1002 23:40:17.646457 10612 slave.cpp:2618] Executor 'aurora.gc' of framework 
201205082337-0000000003-0000 exited with status 0
I1002 23:40:17.648576 10612 slave.cpp:2111] Handling status update TASK_LOST 
(UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task 
system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework 
201205082337-0000000003-0000 from @0.0.0.0:0
W1002 23:40:17.648859 10626 containerizer.cpp:788] Ignoring update for unknown 
container: f97928c4-4214-439c-8d9f-c8493070b1db
I1002 23:40:17.649251 10607 status_update_manager.cpp:320] Received status 
update TASK_LOST (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task 
system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework 
201205082337-0000000003-0000
I1002 23:40:17.649842 10607 status_update_manager.hpp:342] Checkpointing UPDATE 
for status update TASK_LOST (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for 
task system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework 
201205082337-0000000003-0000
I1002 23:40:17.656316 10607 status_update_manager.cpp:373] Forwarding status 
update TASK_LOST (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task 
system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework 
201205082337-0000000003-0000 to [email protected]:5050
I1002 23:40:17.658787 10626 status_update_manager.cpp:398] Received status 
update acknowledgement (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task 
system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework 
201205082337-0000000003-0000
I1002 23:40:17.658941 10626 status_update_manager.hpp:342] Checkpointing ACK 
for status update TASK_LOST (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for 
task system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework 
201205082337-0000000003-0000
{noformat}

Note the race - the slave notices the dead GC executor process 300µs before it 
sees that status update from the executor.
{noformat}
I1002 23:40:17.646457 10612 slave.cpp:2618] Executor 'aurora.gc' of framework 
201205082337-0000000003-0000 exited with status 0
I1002 23:40:17.648576 10612 slave.cpp:2111] Handling status update TASK_LOST 
(UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task 
system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework 
201205082337-0000000003-0000 from @0.0.0.0:0
W1002 23:40:17.648859 10626 containerizer.cpp:788] Ignoring update for unknown 
container: f97928c4-4214-439c-8d9f-c8493070b1db
{noformat}

{noformat}
gc_executor_main.log:
I1002 23:35:11.663753 19217 executor_base.py:46] Executor 
[20140916-223909-2215521290-5050-52027-3819]:   - Runner is dead but under LOST 
threshold.
I1002 23:36:18.483422 19217 executor_base.py:46] Executor 
[20140916-223909-2215521290-5050-52027-3819]: launchTask() got task_id: 
system-gc-04810a63-e8b1-42a4-9b08-03b099517856
I1002 23:36:18.483834 19217 executor_base.py:46] Executor 
[20140916-223909-2215521290-5050-52027-3819]: => Adding 
system-gc-04810a63-e8b1-42a4-9b08-03b099517856 to GC queue
I1002 23:40:11.664382 19217 executor_base.py:46] Executor 
[20140916-223909-2215521290-5050-52027-3819]: Updating 
system-gc-04810a63-e8b1-42a4-9b08-03b099517856 => FINISHED
I1002 23:40:11.664607 19217 executor_base.py:46] Executor 
[20140916-223909-2215521290-5050-52027-3819]:    Reason: Garbage collection 
skipped - GC executor shutting down
I1002 23:40:11.664880 19217 executor_base.py:46] Executor 
[20140916-223909-2215521290-5050-52027-3819]: Sleeping briefly to mitigate 
https://issues.apache.org/jira/browse/MESOS-243
I1002 23:40:16.670403 19217 gc_executor_main.py:41] MesosExecutorDriver.run() 
has finished.
{noformat}

Given that the GC executor is already in maintenance mode and we already have a 
sleep in place to mitigate https://issues.apache.org/jira/browse/MESOS-243 I 
think we can just raise the sleep timeout here.)

> GC executor doesn't need to exit after period of inactivity
> -----------------------------------------------------------
>
>                 Key: AURORA-788
>                 URL: https://issues.apache.org/jira/browse/AURORA-788
>             Project: Aurora
>          Issue Type: Bug
>            Reporter: Kevin Sweeney
>            Assignee: Kevin Sweeney
>             Fix For: 0.6.0
>
>
> The GC Executor is currently configured to exit after a hardcoded limit of 
> activity since it processed a launchTask request. This causes needless task 
> churn in the cluster, especially given that the scheduler's default launch 
> interval will cause the executor to need to be relaunched on average every 15 
> minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to