cf-natali opened a new pull request #362:
URL: https://github.com/apache/mesos/pull/362
This is a follow-up to MESOS-10107, which introduced retries when
calling rmdir on a seemingly empty cgroup fails with EBUSY because of
various kernel bugs.
At the time, the fix introduced a bounded number of retries, using an
exponential backoff summing up to slightly over 1s. This was done
because it was similar to what Docker does, and worked during testing.
However, after 1 month without seeing this error in our cluster at work,
we finally experienced one case where the 1s timeout wasn't enough.
It could be because the machine was busy at the time, or some other
rnadom factor.
So instead of only trying for 1s, I think it might make sense to just
keep retrying, until the top-level container destruction timeout - set
at 1 minute - kicks in.
This actually makes more sense, and avoids having a magical timeout in
the cgroup code.
We just need to ensure that when the destroyer is finalised, it discards
the future in charge of doing the periodic remove.
Here are the logs of the problem we've seen in our cluster, for reference -
you can see that the cgroup destruction fails even after 1s:
```
May 1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.523308 4942
containerizer.cpp:3179] Container 34edf43b-fc1f-4eb4-b70b-de7c3c2a3244 has
reached its limit for resource
[{"name":"mem","scalar":{"value":16320.0},"type":"SCALAR"}] and will be
terminated
May 1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.523384 4942
containerizer.cpp:2623] Destroying container
34edf43b-fc1f-4eb4-b70b-de7c3c2a3244 in RUNNING state
May 1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.523396 4942
containerizer.cpp:3321] Transitioning the state of container
34edf43b-fc1f-4eb4-b70b-de7c3c2a3244 from RUNNING to DESTROYING after
56.8612528682667mins
May 1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.523541 4942
linux_launcher.cpp:564] Asked to destroy container
34edf43b-fc1f-4eb4-b70b-de7c3c2a3244
May 1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.523587 4942
linux_launcher.cpp:606] Destroying cgroup
'/sys/fs/cgroup/freezer/mesos/34edf43b-fc1f-4eb4-b70b-de7c3c2a3244'
May 1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.523954 4942
cgroups.cpp:2887] Freezing cgroup
/sys/fs/cgroup/freezer/mesos/34edf43b-fc1f-4eb4-b70b-de7c3c2a3244
May 1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.524153 4954
cgroups.cpp:1275] Successfully froze cgroup
/sys/fs/cgroup/freezer/mesos/34edf43b-fc1f-4eb4-b70b-de7c3c2a3244 after 189184ns
May 1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.524430 4954
cgroups.cpp:2905] Thawing cgroup
/sys/fs/cgroup/freezer/mesos/34edf43b-fc1f-4eb4-b70b-de7c3c2a3244
May 1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.524525 4954
cgroups.cpp:1304] Successfully thawed cgroup
/sys/fs/cgroup/freezer/mesos/34edf43b-fc1f-4eb4-b70b-de7c3c2a3244 after 87808ns
May 1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.533007 4928
slave.cpp:6616] Got exited event for executor(1)@172.16.20.99:36313
May 1 11:52:01 host033 mesos-slave[2256]: I0501 11:52:01.557977 4950
containerizer.cpp:3159] Container 34edf43b-fc1f-4eb4-b70b-de7c3c2a3244 has
exited
May 1 11:52:02 host033 mesos-slave[2256]: E0501 11:52:02.583150 4956
slave.cpp:6994] Termination of executor
'secure_executor:13954144-cbcd-6bd4-8e37-af2301ec510d' of framework
c0c4ce82-5cff-4116-aacb-c3fd6a93d61b-0000 failed: Failed to kill all processes
in the container: Failed to remove cgroup
'mesos/34edf43b-fc1f-4eb4-b70b-de7c3c2a3244': Failed to remove cgroup
'/sys/fs/cgroup/freezer/mesos/34edf43b-fc1f-4eb4-b70b-de7c3c2a3244': Device or
resource busy
May 1 11:52:02 host033 mesos-slave[2256]: I0501 11:52:02.583232 4956
slave.cpp:5890] Handling status update TASK_FAILED (Status UUID:
cc2899ab-c534-4eeb-a1a4-f28102fc3ca4) for task
13954144-cbcd-6bd4-8e37-af2301ec510d of framework
c0c4ce82-5cff-4116-aacb-c3fd6a93d61b-0000 from @0.0.0.0:0
```
@abudnik @qianzhangxa
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]