cf-natali opened a new pull request #362:
URL: https://github.com/apache/mesos/pull/362


   This is a follow-up to MESOS-10107, which introduced retries when
   calling rmdir on a seemingly empty cgroup fails with EBUSY because of
   various kernel bugs.
   At the time, the fix introduced a bounded number of retries, using an
   exponential backoff summing up to slightly over 1s. This was done
   because it was similar to what Docker does, and worked during testing.
   However, after 1 month without seeing this error in our cluster at work,
   we finally experienced one case where the 1s timeout wasn't enough.
   It could be because the machine was busy at the time, or some other
   rnadom factor.
   So instead of only trying for 1s, I think it might make sense to just
   keep retrying, until the top-level container destruction timeout - set
   at 1 minute - kicks in.
   This actually makes more sense, and avoids having a magical timeout in
   the cgroup code.
   We just need to ensure that when the destroyer is finalised, it discards
   the future in charge of doing the periodic remove.
   
   
   Here are the logs of the problem we've seen in our cluster, for reference - 
you can see that the cgroup destruction fails even after 1s:
   ```
   May  1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.523308 4942 
containerizer.cpp:3179] Container 34edf43b-fc1f-4eb4-b70b-de7c3c2a3244 has 
reached its limit for resource 
[{"name":"mem","scalar":{"value":16320.0},"type":"SCALAR"}] and will be 
terminated
   May  1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.523384 4942 
containerizer.cpp:2623] Destroying container 
34edf43b-fc1f-4eb4-b70b-de7c3c2a3244 in RUNNING state
   May  1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.523396 4942 
containerizer.cpp:3321] Transitioning the state of container 
34edf43b-fc1f-4eb4-b70b-de7c3c2a3244 from RUNNING to DESTROYING after 
56.8612528682667mins
   May  1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.523541 4942 
linux_launcher.cpp:564] Asked to destroy container 
34edf43b-fc1f-4eb4-b70b-de7c3c2a3244
   May  1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.523587 4942 
linux_launcher.cpp:606] Destroying cgroup 
'/sys/fs/cgroup/freezer/mesos/34edf43b-fc1f-4eb4-b70b-de7c3c2a3244'
   May  1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.523954 4942 
cgroups.cpp:2887] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/34edf43b-fc1f-4eb4-b70b-de7c3c2a3244
   May  1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.524153 4954 
cgroups.cpp:1275] Successfully froze cgroup 
/sys/fs/cgroup/freezer/mesos/34edf43b-fc1f-4eb4-b70b-de7c3c2a3244 after 189184ns
   May  1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.524430 4954 
cgroups.cpp:2905] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/34edf43b-fc1f-4eb4-b70b-de7c3c2a3244
   May  1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.524525 4954 
cgroups.cpp:1304] Successfully thawed cgroup 
/sys/fs/cgroup/freezer/mesos/34edf43b-fc1f-4eb4-b70b-de7c3c2a3244 after 87808ns
   May  1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.533007 4928 
slave.cpp:6616] Got exited event for executor(1)@172.16.20.99:36313
   May  1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.557977 4950 
containerizer.cpp:3159] Container 34edf43b-fc1f-4eb4-b70b-de7c3c2a3244 has 
exited
   May  1 11:52:02 host033 mesos-slave[2256]: E0501 11:52:02.583150 4956 
slave.cpp:6994] Termination of executor 
'secure_executor:13954144-cbcd-6bd4-8e37-af2301ec510d' of framework 
c0c4ce82-5cff-4116-aacb-c3fd6a93d61b-0000 failed: Failed to kill all processes 
in the container: Failed to remove cgroup 
'mesos/34edf43b-fc1f-4eb4-b70b-de7c3c2a3244': Failed to remove cgroup 
'/sys/fs/cgroup/freezer/mesos/34edf43b-fc1f-4eb4-b70b-de7c3c2a3244': Device or 
resource busy
   May  1 11:52:02 host033 mesos-slave[2256]: I0501 11:52:02.583232 4956 
slave.cpp:5890] Handling status update TASK_FAILED (Status UUID: 
cc2899ab-c534-4eeb-a1a4-f28102fc3ca4) for task 
13954144-cbcd-6bd4-8e37-af2301ec510d of framework 
c0c4ce82-5cff-4116-aacb-c3fd6a93d61b-0000 from @0.0.0.0:0
   ```
   
   @abudnik @qianzhangxa 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to