> Are /foo/bar cgroups hierarchical such that /foo missing would prevent > /foo/bar tasks from being scheduled? i.e., might that be the root cause of > why the kernel is ignoring these tasks?
Was curious why you said the above. CPU scheduling shares are a function of their parent’s CPU bandwidth. -Jojy > On Dec 30, 2015, at 6:55 PM, Erik Weathers <[email protected]> > wrote: > > I'm trying to figure out a situation where we see tasks in a mesos > container no longer being scheduled by the Linux kernel. None of the tasks > in the container are zombies, nor are they stuck in "Disk sleep" state. > They are all in Running state. But if I try to strace the processes the > strace cmd just hangs. I've also noticed that none of the RIPs (64-bit > instruction pointers) are changing at all in these tasks, and they're not > accumulating any cputime. So the kernel is just not scheduling them. > > Despite the behavior described above, these non-running tasks *are* listed > in the run queues of /proc/sched_debug. Notably, I have observed that on > hosts without this problem that there exist "cfs_rq[N]:/mesos" run queues, > but on the hosts that have the broken scheduling, these run queues don't > exist, though we still have "cfs_rq[N]:/mesos/<cgroup-UUID>" in > /proc/sched_debug. That is mighty suspicious to me. > > I'm curious about: > > - Has anyone seen similar behavior? > - Are /foo/bar cgroups hierarchical such that /foo missing would prevent > /foo/bar tasks from being scheduled? i.e., might that be the root cause of > why the kernel is ignoring these tasks? > - What creates the /mesos cfs run queue, and why would that cease to > exist without the subordinate cgroups being cleaned up? > - I'm assuming the creation of the "cpu" cgroup with the path > "/mesos" done by mesos-slave creates this run queue. > - But I'm not sure how/why it would be removed, since I still see a > mesos cgroup in my cgroupfs cpu path (i.e., /cgroup/cpu/mesos exists). > > I'm assuming that this is a kernel bug, and I'm hopeful RedHat has patched > fixes into newer kernel versions that we are running on other hosts (e.g., > 2.6.32-573.7.1.el6). > > Setup info: > > Kernel version: 2.6.32-431.el6.x86_64 > Mesos version: 0.22.1 > Containerizer: Mesos > Isolators: Have seen this behavior with both of these configs: > cgroups/cpu,cgroups/mem > cgroups/cpu,cgroups/mem,namespaces/pid > > Thanks for any insight or help! > > - Erik
