[
https://issues.apache.org/jira/browse/AURORA-788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157479#comment-14157479
]
Kevin Sweeney commented on AURORA-788:
--------------------------------------
https://reviews.apache.org/r/26300/
> gc executor updates ignored by slave on loaded machines
> -------------------------------------------------------
>
> Key: AURORA-788
> URL: https://issues.apache.org/jira/browse/AURORA-788
> Project: Aurora
> Issue Type: Bug
> Reporter: Kevin Sweeney
> Assignee: Kevin Sweeney
> Fix For: 0.6.0
>
>
> It looks like the root cause is that the GC executor doesn't sleep long
> enough for the slave to reliably receive its final status update
> From production logs at Twitter:
> {noformat}
> mesos-slave.log:
> I1002 23:40:17.372160 10628 slave.cpp:1025] Got assigned task
> system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d for framework
> 201205082337-0000000003-0000
> I1002 23:40:17.372648 10628 slave.cpp:1135] Launching task
> system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d for framework
> 201205082337-0000000003-0000
> I1002 23:40:17.373164 10628 slave.cpp:4002] Checkpointing TaskInfo to
> '/var/lib/mesos/meta/slaves/20140916-223909-2215521290-5050-52027-3819/frameworks/201205082337-0000000003-0000/executors/aurora.gc/runs/f97928c4-4214-439c-8d9f-c8493070b1db/tasks/system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d/task.info'
> I1002 23:40:17.373951 10628 slave.cpp:1270] Sending task
> 'system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d' to executor 'aurora.gc' of
> framework 201205082337-0000000003-0000
> I1002 23:40:17.374651 10623 cpushare.cpp:338] Updated 'cpu.shares' to 204
> (cpus 0.2) for container f97928c4-4214-439c-8d9f-c8493070b1db
> I1002 23:40:17.374681 10630 mem.cpp:293] Updated 'memory.soft_limit_in_bytes'
> to 128MB for container f97928c4-4214-439c-8d9f-c8493070b1db
> I1002 23:40:17.375869 10623 cpushare.cpp:359] Updated 'cpu.cfs_period_us' to
> 100ms and 'cpu.cfs_quota_us' to 20ms (cpus 0.2) for container
> f97928c4-4214-439c-8d9f-c8493070b1db
> I1002 23:40:17.614163 10629 containerizer.cpp:997] Executor for container
> 'f97928c4-4214-439c-8d9f-c8493070b1db' has exited
> I1002 23:40:17.614673 10629 containerizer.cpp:882] Destroying container
> 'f97928c4-4214-439c-8d9f-c8493070b1db'
> I1002 23:40:17.616425 10627 cgroups.cpp:2208] Freezing cgroup
> /sys/fs/cgroup/freezer/mesos/f97928c4-4214-439c-8d9f-c8493070b1db
> I1002 23:40:17.618352 10629 cgroups.cpp:1375] Successfully froze cgroup
> /sys/fs/cgroup/freezer/mesos/f97928c4-4214-439c-8d9f-c8493070b1db after
> 1.140736ms
> I1002 23:40:17.619982 10629 cgroups.cpp:2225] Thawing cgroup
> /sys/fs/cgroup/freezer/mesos/f97928c4-4214-439c-8d9f-c8493070b1db
> I1002 23:40:17.621541 10629 cgroups.cpp:1404] Successfullly thawed cgroup
> /sys/fs/cgroup/freezer/mesos/f97928c4-4214-439c-8d9f-c8493070b1db after
> 1.0752ms
> I1002 23:40:17.623524 10613 port_mapping.cpp:1949] Removing IP packet filters
> with ports [33792,34815] for container with pid 19217
> I1002 23:40:17.626447 10613 port_mapping.cpp:1968] Freed ephemeral ports
> [33792,34816) for container with pid 19217
> I1002 23:40:17.646090 10613 port_mapping.cpp:2096] Successfully performed
> cleanup for pid 19217
> I1002 23:40:17.646457 10612 slave.cpp:2618] Executor 'aurora.gc' of framework
> 201205082337-0000000003-0000 exited with status 0
> I1002 23:40:17.648576 10612 slave.cpp:2111] Handling status update TASK_LOST
> (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task
> system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework
> 201205082337-0000000003-0000 from @0.0.0.0:0
> W1002 23:40:17.648859 10626 containerizer.cpp:788] Ignoring update for
> unknown container: f97928c4-4214-439c-8d9f-c8493070b1db
> I1002 23:40:17.649251 10607 status_update_manager.cpp:320] Received status
> update TASK_LOST (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task
> system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework
> 201205082337-0000000003-0000
> I1002 23:40:17.649842 10607 status_update_manager.hpp:342] Checkpointing
> UPDATE for status update TASK_LOST (UUID:
> 125cd0b5-0660-44b2-ac96-047787be717f) for task
> system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework
> 201205082337-0000000003-0000
> I1002 23:40:17.656316 10607 status_update_manager.cpp:373] Forwarding status
> update TASK_LOST (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task
> system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework
> 201205082337-0000000003-0000 to [email protected]:5050
> I1002 23:40:17.658787 10626 status_update_manager.cpp:398] Received status
> update acknowledgement (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task
> system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework
> 201205082337-0000000003-0000
> I1002 23:40:17.658941 10626 status_update_manager.hpp:342] Checkpointing ACK
> for status update TASK_LOST (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for
> task system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework
> 201205082337-0000000003-0000
> {noformat}
> Note the race - the slave notices the dead GC executor process 300µs before
> it sees that status update from the executor.
> {noformat}
> I1002 23:40:17.646457 10612 slave.cpp:2618] Executor 'aurora.gc' of framework
> 201205082337-0000000003-0000 exited with status 0
> I1002 23:40:17.648576 10612 slave.cpp:2111] Handling status update TASK_LOST
> (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task
> system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework
> 201205082337-0000000003-0000 from @0.0.0.0:0
> W1002 23:40:17.648859 10626 containerizer.cpp:788] Ignoring update for
> unknown container: f97928c4-4214-439c-8d9f-c8493070b1db
> {noformat}
> {noformat}
> gc_executor_main.log:
> I1002 23:35:11.663753 19217 executor_base.py:46] Executor
> [20140916-223909-2215521290-5050-52027-3819]: - Runner is dead but under
> LOST threshold.
> I1002 23:36:18.483422 19217 executor_base.py:46] Executor
> [20140916-223909-2215521290-5050-52027-3819]: launchTask() got task_id:
> system-gc-04810a63-e8b1-42a4-9b08-03b099517856
> I1002 23:36:18.483834 19217 executor_base.py:46] Executor
> [20140916-223909-2215521290-5050-52027-3819]: => Adding
> system-gc-04810a63-e8b1-42a4-9b08-03b099517856 to GC queue
> I1002 23:40:11.664382 19217 executor_base.py:46] Executor
> [20140916-223909-2215521290-5050-52027-3819]: Updating
> system-gc-04810a63-e8b1-42a4-9b08-03b099517856 => FINISHED
> I1002 23:40:11.664607 19217 executor_base.py:46] Executor
> [20140916-223909-2215521290-5050-52027-3819]: Reason: Garbage collection
> skipped - GC executor shutting down
> I1002 23:40:11.664880 19217 executor_base.py:46] Executor
> [20140916-223909-2215521290-5050-52027-3819]: Sleeping briefly to mitigate
> https://issues.apache.org/jira/browse/MESOS-243
> I1002 23:40:16.670403 19217 gc_executor_main.py:41] MesosExecutorDriver.run()
> has finished.
> {noformat}
> Given that the GC executor is already in maintenance mode and we already have
> a sleep in place to mitigate https://issues.apache.org/jira/browse/MESOS-243
> I think we can just raise the sleep timeout here.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)