Hi Erik Happy to work on this with you. Thanks for the details. As you might know, in cfs_rq:/<name> (from /proc/sched_debug), <name> is the CPU cgroup hierarchy name. I am curious about the contents and cgroups hierarchy when this happens. Could you send the “mesos” hierarchy (directory tree) and contents of files like ‘tasks’,’cpu.cfs_quota_us’,’cpu.cfs_period_us' ‘cpu.shares’, ‘cpu.stat’.
It does look strange that the parent cgroup is missing when child is present. Also, wondering if you are able to see same issue with latest Mesos and/or kernel? -Jojy > On Jan 2, 2016, at 9:43 PM, Erik Weathers <[email protected]> > wrote: > > hey Jojy, Thanks for your reply. Response inline. > > On Thu, Dec 31, 2015 at 11:31 AM, Jojy Varghese <[email protected]> wrote: > >>> Are /foo/bar cgroups hierarchical such that /foo missing would prevent >>> /foo/bar tasks from being scheduled? i.e., might that be the root >> cause of >>> why the kernel is ignoring these tasks? >> >> Was curious why you said the above. CPU scheduling shares are a function >> of their parent’s CPU bandwidth. >> > > This question arose from an earlier observation in my initial email: > > In my initial email I pointed out that the contents of /proc/sched_debug > list all of the CFS run queues, but it seems like some of those run queues > are missing on the affected hosts. i.e., usually they look like this (only > including output for the 1st CPU's CFS run queues): > > % grep 'cfs_rq\[0\]' /proc/sched_debug > cfs_rq[0]:/mesos/e8aa3b46-8004-466a-9a5e-249d6d19993f > cfs_rq[0]:/mesos > cfs_rq[0]:/ > > But on the problematic hosts, they look like this: > > % grep 'cfs_rq\[0\]' /proc/sched_debug > cfs_rq[0]:/mesos/5cf9a444-e612-4d5b-b8bb-7ee93e44b352 > cfs_rq[0]:/ > > Notably, "cfs_rq[0]:/mesos" is missing on the problematic hosts. > > I'm not sure how that is possible, given my understanding that these > cfs_rq's are created from the special cgroups filesystem having directories > added to it, and since the /cgroup/cpu/mesos dir exists (as well as > /cgroup/cpu/mesos/5cf9a444-e612-4d5b-b8bb-7ee93e44b352/), I don't see how > the CFS run queues for "/mesos" could have been deleted. I've been trying > to read the kernel cgroup CFS scheduling code, but it's tough for a newb. > > Notably, the cgroup settings that I see in /cgroup/cpu/mesos and > /cgroup/cpu/mesos/5cf9a444-e612-4d5b-b8bb-7ee93e44b352 are not suspicious. > i.e., it's not that the cgroup settings of the "parent" /mesos cgroup are > preventing the tasks from being scheduled. It seems to be that the cgroup > settings of the parent are simply gone from the kernel. Poof. > > At this point I'm assuming that the above observation is indeed the root > cause of the problem, and I'm simply hoping that whatever logic deleted the > "/mesos" run queue is fixed in either a newer kernel or newer mesos version. > > Thanks! > > - Erik > > > >> >> -Jojy >> >> >>> On Dec 30, 2015, at 6:55 PM, Erik Weathers <[email protected]> >> wrote: >>> >>> I'm trying to figure out a situation where we see tasks in a mesos >>> container no longer being scheduled by the Linux kernel. None of the >> tasks >>> in the container are zombies, nor are they stuck in "Disk sleep" state. >>> They are all in Running state. But if I try to strace the processes the >>> strace cmd just hangs. I've also noticed that none of the RIPs (64-bit >>> instruction pointers) are changing at all in these tasks, and they're not >>> accumulating any cputime. So the kernel is just not scheduling them. >>> >>> Despite the behavior described above, these non-running tasks *are* >> listed >>> in the run queues of /proc/sched_debug. Notably, I have observed that on >>> hosts without this problem that there exist "cfs_rq[N]:/mesos" run >> queues, >>> but on the hosts that have the broken scheduling, these run queues don't >>> exist, though we still have "cfs_rq[N]:/mesos/<cgroup-UUID>" in >>> /proc/sched_debug. That is mighty suspicious to me. >>> >>> I'm curious about: >>> >>> - Has anyone seen similar behavior? >>> - Are /foo/bar cgroups hierarchical such that /foo missing would >> prevent >>> /foo/bar tasks from being scheduled? i.e., might that be the root >> cause of >>> why the kernel is ignoring these tasks? >>> - What creates the /mesos cfs run queue, and why would that cease to >>> exist without the subordinate cgroups being cleaned up? >>> - I'm assuming the creation of the "cpu" cgroup with the path >>> "/mesos" done by mesos-slave creates this run queue. >>> - But I'm not sure how/why it would be removed, since I still see a >>> mesos cgroup in my cgroupfs cpu path (i.e., /cgroup/cpu/mesos >> exists). >>> >>> I'm assuming that this is a kernel bug, and I'm hopeful RedHat has >> patched >>> fixes into newer kernel versions that we are running on other hosts >> (e.g., >>> 2.6.32-573.7.1.el6). >>> >>> Setup info: >>> >>> Kernel version: 2.6.32-431.el6.x86_64 >>> Mesos version: 0.22.1 >>> Containerizer: Mesos >>> Isolators: Have seen this behavior with both of these configs: >>> cgroups/cpu,cgroups/mem >>> cgroups/cpu,cgroups/mem,namespaces/pid >>> >>> Thanks for any insight or help! >>> >>> - Erik >> >>
