[
https://issues.apache.org/jira/browse/AURORA-788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kevin Sweeney updated AURORA-788:
---------------------------------
Description: The GC Executor is currently configured to exit after a
hardcoded limit of activity since it processed a launchTask request. This
causes needless task churn in the cluster, especially given that the
scheduler's default launch interval will cause the executor to need to be
relaunched on average every 15 minutes. (was: It looks like the root cause is
that the GC executor doesn't sleep long enough for the slave to reliably
receive its final status update
>From production logs at Twitter:
{noformat}
mesos-slave.log:
I1002 23:40:17.372160 10628 slave.cpp:1025] Got assigned task
system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d for framework
201205082337-0000000003-0000
I1002 23:40:17.372648 10628 slave.cpp:1135] Launching task
system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d for framework
201205082337-0000000003-0000
I1002 23:40:17.373164 10628 slave.cpp:4002] Checkpointing TaskInfo to
'/var/lib/mesos/meta/slaves/20140916-223909-2215521290-5050-52027-3819/frameworks/201205082337-0000000003-0000/executors/aurora.gc/runs/f97928c4-4214-439c-8d9f-c8493070b1db/tasks/system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d/task.info'
I1002 23:40:17.373951 10628 slave.cpp:1270] Sending task
'system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d' to executor 'aurora.gc' of
framework 201205082337-0000000003-0000
I1002 23:40:17.374651 10623 cpushare.cpp:338] Updated 'cpu.shares' to 204 (cpus
0.2) for container f97928c4-4214-439c-8d9f-c8493070b1db
I1002 23:40:17.374681 10630 mem.cpp:293] Updated 'memory.soft_limit_in_bytes'
to 128MB for container f97928c4-4214-439c-8d9f-c8493070b1db
I1002 23:40:17.375869 10623 cpushare.cpp:359] Updated 'cpu.cfs_period_us' to
100ms and 'cpu.cfs_quota_us' to 20ms (cpus 0.2) for container
f97928c4-4214-439c-8d9f-c8493070b1db
I1002 23:40:17.614163 10629 containerizer.cpp:997] Executor for container
'f97928c4-4214-439c-8d9f-c8493070b1db' has exited
I1002 23:40:17.614673 10629 containerizer.cpp:882] Destroying container
'f97928c4-4214-439c-8d9f-c8493070b1db'
I1002 23:40:17.616425 10627 cgroups.cpp:2208] Freezing cgroup
/sys/fs/cgroup/freezer/mesos/f97928c4-4214-439c-8d9f-c8493070b1db
I1002 23:40:17.618352 10629 cgroups.cpp:1375] Successfully froze cgroup
/sys/fs/cgroup/freezer/mesos/f97928c4-4214-439c-8d9f-c8493070b1db after
1.140736ms
I1002 23:40:17.619982 10629 cgroups.cpp:2225] Thawing cgroup
/sys/fs/cgroup/freezer/mesos/f97928c4-4214-439c-8d9f-c8493070b1db
I1002 23:40:17.621541 10629 cgroups.cpp:1404] Successfullly thawed cgroup
/sys/fs/cgroup/freezer/mesos/f97928c4-4214-439c-8d9f-c8493070b1db after 1.0752ms
I1002 23:40:17.623524 10613 port_mapping.cpp:1949] Removing IP packet filters
with ports [33792,34815] for container with pid 19217
I1002 23:40:17.626447 10613 port_mapping.cpp:1968] Freed ephemeral ports
[33792,34816) for container with pid 19217
I1002 23:40:17.646090 10613 port_mapping.cpp:2096] Successfully performed
cleanup for pid 19217
I1002 23:40:17.646457 10612 slave.cpp:2618] Executor 'aurora.gc' of framework
201205082337-0000000003-0000 exited with status 0
I1002 23:40:17.648576 10612 slave.cpp:2111] Handling status update TASK_LOST
(UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task
system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework
201205082337-0000000003-0000 from @0.0.0.0:0
W1002 23:40:17.648859 10626 containerizer.cpp:788] Ignoring update for unknown
container: f97928c4-4214-439c-8d9f-c8493070b1db
I1002 23:40:17.649251 10607 status_update_manager.cpp:320] Received status
update TASK_LOST (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task
system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework
201205082337-0000000003-0000
I1002 23:40:17.649842 10607 status_update_manager.hpp:342] Checkpointing UPDATE
for status update TASK_LOST (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for
task system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework
201205082337-0000000003-0000
I1002 23:40:17.656316 10607 status_update_manager.cpp:373] Forwarding status
update TASK_LOST (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task
system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework
201205082337-0000000003-0000 to [email protected]:5050
I1002 23:40:17.658787 10626 status_update_manager.cpp:398] Received status
update acknowledgement (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task
system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework
201205082337-0000000003-0000
I1002 23:40:17.658941 10626 status_update_manager.hpp:342] Checkpointing ACK
for status update TASK_LOST (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for
task system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework
201205082337-0000000003-0000
{noformat}
Note the race - the slave notices the dead GC executor process 300µs before it
sees that status update from the executor.
{noformat}
I1002 23:40:17.646457 10612 slave.cpp:2618] Executor 'aurora.gc' of framework
201205082337-0000000003-0000 exited with status 0
I1002 23:40:17.648576 10612 slave.cpp:2111] Handling status update TASK_LOST
(UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task
system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework
201205082337-0000000003-0000 from @0.0.0.0:0
W1002 23:40:17.648859 10626 containerizer.cpp:788] Ignoring update for unknown
container: f97928c4-4214-439c-8d9f-c8493070b1db
{noformat}
{noformat}
gc_executor_main.log:
I1002 23:35:11.663753 19217 executor_base.py:46] Executor
[20140916-223909-2215521290-5050-52027-3819]: - Runner is dead but under LOST
threshold.
I1002 23:36:18.483422 19217 executor_base.py:46] Executor
[20140916-223909-2215521290-5050-52027-3819]: launchTask() got task_id:
system-gc-04810a63-e8b1-42a4-9b08-03b099517856
I1002 23:36:18.483834 19217 executor_base.py:46] Executor
[20140916-223909-2215521290-5050-52027-3819]: => Adding
system-gc-04810a63-e8b1-42a4-9b08-03b099517856 to GC queue
I1002 23:40:11.664382 19217 executor_base.py:46] Executor
[20140916-223909-2215521290-5050-52027-3819]: Updating
system-gc-04810a63-e8b1-42a4-9b08-03b099517856 => FINISHED
I1002 23:40:11.664607 19217 executor_base.py:46] Executor
[20140916-223909-2215521290-5050-52027-3819]: Reason: Garbage collection
skipped - GC executor shutting down
I1002 23:40:11.664880 19217 executor_base.py:46] Executor
[20140916-223909-2215521290-5050-52027-3819]: Sleeping briefly to mitigate
https://issues.apache.org/jira/browse/MESOS-243
I1002 23:40:16.670403 19217 gc_executor_main.py:41] MesosExecutorDriver.run()
has finished.
{noformat}
Given that the GC executor is already in maintenance mode and we already have a
sleep in place to mitigate https://issues.apache.org/jira/browse/MESOS-243 I
think we can just raise the sleep timeout here.)
> GC executor doesn't need to exit after period of inactivity
> -----------------------------------------------------------
>
> Key: AURORA-788
> URL: https://issues.apache.org/jira/browse/AURORA-788
> Project: Aurora
> Issue Type: Bug
> Reporter: Kevin Sweeney
> Assignee: Kevin Sweeney
> Fix For: 0.6.0
>
>
> The GC Executor is currently configured to exit after a hardcoded limit of
> activity since it processed a launchTask request. This causes needless task
> churn in the cluster, especially given that the scheduler's default launch
> interval will cause the executor to need to be relaunched on average every 15
> minutes.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)