[ 
https://issues.apache.org/jira/browse/AURORA-788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157479#comment-14157479
 ] 

Kevin Sweeney commented on AURORA-788:
--------------------------------------

https://reviews.apache.org/r/26300/

> gc executor updates ignored by slave on loaded machines
> -------------------------------------------------------
>
>                 Key: AURORA-788
>                 URL: https://issues.apache.org/jira/browse/AURORA-788
>             Project: Aurora
>          Issue Type: Bug
>            Reporter: Kevin Sweeney
>            Assignee: Kevin Sweeney
>             Fix For: 0.6.0
>
>
> It looks like the root cause is that the GC executor doesn't sleep long 
> enough for the slave to reliably receive its final status update
> From production logs at Twitter:
> {noformat}
> mesos-slave.log:
> I1002 23:40:17.372160 10628 slave.cpp:1025] Got assigned task 
> system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d for framework 
> 201205082337-0000000003-0000
> I1002 23:40:17.372648 10628 slave.cpp:1135] Launching task 
> system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d for framework 
> 201205082337-0000000003-0000
> I1002 23:40:17.373164 10628 slave.cpp:4002] Checkpointing TaskInfo to 
> '/var/lib/mesos/meta/slaves/20140916-223909-2215521290-5050-52027-3819/frameworks/201205082337-0000000003-0000/executors/aurora.gc/runs/f97928c4-4214-439c-8d9f-c8493070b1db/tasks/system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d/task.info'
> I1002 23:40:17.373951 10628 slave.cpp:1270] Sending task 
> 'system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d' to executor 'aurora.gc' of 
> framework 201205082337-0000000003-0000
> I1002 23:40:17.374651 10623 cpushare.cpp:338] Updated 'cpu.shares' to 204 
> (cpus 0.2) for container f97928c4-4214-439c-8d9f-c8493070b1db
> I1002 23:40:17.374681 10630 mem.cpp:293] Updated 'memory.soft_limit_in_bytes' 
> to 128MB for container f97928c4-4214-439c-8d9f-c8493070b1db
> I1002 23:40:17.375869 10623 cpushare.cpp:359] Updated 'cpu.cfs_period_us' to 
> 100ms and 'cpu.cfs_quota_us' to 20ms (cpus 0.2) for container 
> f97928c4-4214-439c-8d9f-c8493070b1db
> I1002 23:40:17.614163 10629 containerizer.cpp:997] Executor for container 
> 'f97928c4-4214-439c-8d9f-c8493070b1db' has exited
> I1002 23:40:17.614673 10629 containerizer.cpp:882] Destroying container 
> 'f97928c4-4214-439c-8d9f-c8493070b1db'
> I1002 23:40:17.616425 10627 cgroups.cpp:2208] Freezing cgroup 
> /sys/fs/cgroup/freezer/mesos/f97928c4-4214-439c-8d9f-c8493070b1db
> I1002 23:40:17.618352 10629 cgroups.cpp:1375] Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/f97928c4-4214-439c-8d9f-c8493070b1db after 
> 1.140736ms
> I1002 23:40:17.619982 10629 cgroups.cpp:2225] Thawing cgroup 
> /sys/fs/cgroup/freezer/mesos/f97928c4-4214-439c-8d9f-c8493070b1db
> I1002 23:40:17.621541 10629 cgroups.cpp:1404] Successfullly thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/f97928c4-4214-439c-8d9f-c8493070b1db after 
> 1.0752ms
> I1002 23:40:17.623524 10613 port_mapping.cpp:1949] Removing IP packet filters 
> with ports [33792,34815] for container with pid 19217
> I1002 23:40:17.626447 10613 port_mapping.cpp:1968] Freed ephemeral ports 
> [33792,34816) for container with pid 19217
> I1002 23:40:17.646090 10613 port_mapping.cpp:2096] Successfully performed 
> cleanup for pid 19217
> I1002 23:40:17.646457 10612 slave.cpp:2618] Executor 'aurora.gc' of framework 
> 201205082337-0000000003-0000 exited with status 0
> I1002 23:40:17.648576 10612 slave.cpp:2111] Handling status update TASK_LOST 
> (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task 
> system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework 
> 201205082337-0000000003-0000 from @0.0.0.0:0
> W1002 23:40:17.648859 10626 containerizer.cpp:788] Ignoring update for 
> unknown container: f97928c4-4214-439c-8d9f-c8493070b1db
> I1002 23:40:17.649251 10607 status_update_manager.cpp:320] Received status 
> update TASK_LOST (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task 
> system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework 
> 201205082337-0000000003-0000
> I1002 23:40:17.649842 10607 status_update_manager.hpp:342] Checkpointing 
> UPDATE for status update TASK_LOST (UUID: 
> 125cd0b5-0660-44b2-ac96-047787be717f) for task 
> system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework 
> 201205082337-0000000003-0000
> I1002 23:40:17.656316 10607 status_update_manager.cpp:373] Forwarding status 
> update TASK_LOST (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task 
> system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework 
> 201205082337-0000000003-0000 to [email protected]:5050
> I1002 23:40:17.658787 10626 status_update_manager.cpp:398] Received status 
> update acknowledgement (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task 
> system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework 
> 201205082337-0000000003-0000
> I1002 23:40:17.658941 10626 status_update_manager.hpp:342] Checkpointing ACK 
> for status update TASK_LOST (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for 
> task system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework 
> 201205082337-0000000003-0000
> {noformat}
> Note the race - the slave notices the dead GC executor process 300µs before 
> it sees that status update from the executor.
> {noformat}
> I1002 23:40:17.646457 10612 slave.cpp:2618] Executor 'aurora.gc' of framework 
> 201205082337-0000000003-0000 exited with status 0
> I1002 23:40:17.648576 10612 slave.cpp:2111] Handling status update TASK_LOST 
> (UUID: 125cd0b5-0660-44b2-ac96-047787be717f) for task 
> system-gc-9c9febe5-6002-4163-8999-d1ede3004c2d of framework 
> 201205082337-0000000003-0000 from @0.0.0.0:0
> W1002 23:40:17.648859 10626 containerizer.cpp:788] Ignoring update for 
> unknown container: f97928c4-4214-439c-8d9f-c8493070b1db
> {noformat}
> {noformat}
> gc_executor_main.log:
> I1002 23:35:11.663753 19217 executor_base.py:46] Executor 
> [20140916-223909-2215521290-5050-52027-3819]:   - Runner is dead but under 
> LOST threshold.
> I1002 23:36:18.483422 19217 executor_base.py:46] Executor 
> [20140916-223909-2215521290-5050-52027-3819]: launchTask() got task_id: 
> system-gc-04810a63-e8b1-42a4-9b08-03b099517856
> I1002 23:36:18.483834 19217 executor_base.py:46] Executor 
> [20140916-223909-2215521290-5050-52027-3819]: => Adding 
> system-gc-04810a63-e8b1-42a4-9b08-03b099517856 to GC queue
> I1002 23:40:11.664382 19217 executor_base.py:46] Executor 
> [20140916-223909-2215521290-5050-52027-3819]: Updating 
> system-gc-04810a63-e8b1-42a4-9b08-03b099517856 => FINISHED
> I1002 23:40:11.664607 19217 executor_base.py:46] Executor 
> [20140916-223909-2215521290-5050-52027-3819]:    Reason: Garbage collection 
> skipped - GC executor shutting down
> I1002 23:40:11.664880 19217 executor_base.py:46] Executor 
> [20140916-223909-2215521290-5050-52027-3819]: Sleeping briefly to mitigate 
> https://issues.apache.org/jira/browse/MESOS-243
> I1002 23:40:16.670403 19217 gc_executor_main.py:41] MesosExecutorDriver.run() 
> has finished.
> {noformat}
> Given that the GC executor is already in maintenance mode and we already have 
> a sleep in place to mitigate https://issues.apache.org/jira/browse/MESOS-243 
> I think we can just raise the sleep timeout here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to