This is an automated email from the ASF dual-hosted git repository.

abudnik pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/mesos.git


The following commit(s) were added to refs/heads/master by this push:
     new 0cb1591  Keep retrying to remove cgroup on EBUSY.
0cb1591 is described below

commit 0cb1591b709e3c9f32093d943b8e2ddcdcf7999f
Author: Charles-Francois Natali <cf.nat...@gmail.com>
AuthorDate: Sat May 2 01:41:09 2020 +0100

    Keep retrying to remove cgroup on EBUSY.
    
    This is a follow-up to MESOS-10107, which introduced retries when
    calling `rmdir` on a seemingly empty cgroup fails with `EBUSY`
    because of various kernel bugs.
    At the time, the fix introduced a bounded number of retries, using an
    exponential backoff summing up to slightly over 1s. This was done
    because it was similar to what Docker does, and worked during testing.
    However, after 1 month without seeing this error in our cluster at work,
    we finally experienced one case where the 1s timeout wasn't enough.
    It could be because the machine was busy at the time, or some other
    random factor.
    So instead of only trying for 1s, I think it might make sense to just
    keep retrying, until the top-level container destruction timeout - set
    at 1 minute - kicks in.
    This actually makes more sense, and avoids having a magical timeout in
    the cgroup code.
    We just need to ensure that when the destroyer is finalized, it discards
    the future in charge of doing the periodic remove.
    
    This closes #362
---
 src/linux/cgroups.cpp | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/src/linux/cgroups.cpp b/src/linux/cgroups.cpp
index 48be19e..43d8ac0 100644
--- a/src/linux/cgroups.cpp
+++ b/src/linux/cgroups.cpp
@@ -280,7 +280,6 @@ Future<Nothing> remove(const string& hierarchy, const 
string& cgroup)
   // with EBUSY even though the cgroup appears empty.
 
   Duration delay = Duration::zero();
-  int retry = 10;
 
   return loop(
       [=]() mutable {
@@ -291,8 +290,10 @@ Future<Nothing> remove(const string& hierarchy, const 
string& cgroup)
       [=](const Nothing&) mutable -> Future<ControlFlow<Nothing>> {
         if (::rmdir(path.c_str()) == 0) {
           return process::Break();
-        } else if ((errno == EBUSY) && (retry > 0)) {
-          --retry;
+        } else if (errno == EBUSY) {
+          LOG(WARNING) << "Removal of cgroup " << path
+                       << " failed with EBUSY, will try again";
+
           return process::Continue();
         } else {
           // If the `cgroup` still exists in the hierarchy, treat this as
@@ -1572,6 +1573,7 @@ protected:
 
   void finalize() override
   {
+    remover.discard();
     discard(killers);
     promise.discard();
   }
@@ -1580,8 +1582,8 @@ private:
   void killed(const Future<vector<Nothing>>& kill)
   {
     if (kill.isReady()) {
-      internal::remove(hierarchy, cgroups)
-        .onAny(defer(self(), &Destroyer::removed, lambda::_1));
+      remover = internal::remove(hierarchy, cgroups);
+      remover.onAny(defer(self(), &Destroyer::removed, lambda::_1));
     } else if (kill.isDiscarded()) {
       promise.discard();
       terminate(self());
@@ -1611,6 +1613,9 @@ private:
 
   // The killer processes used to atomically kill tasks in each cgroup.
   vector<Future<Nothing>> killers;
+
+  // Future used to destroy the cgroups once the tasks have been killed.
+  Future<Nothing> remover;
 };
 
 } // namespace internal {

Reply via email to