hi Jojy, Unfortunately, I haven't been able to reproduce this issue on demand, it has just happened spontaneously a few times. So I cannot say for sure if it would happen on a newer mesos/kernel version. I'm thinking of trying to force reproduction by creating and destroying a ton of cgroups, since the issue does *seem* to possibly correlate with some badly behaved storm topologies that are constantly crashing and causing the cgroups to be created and destroyed often.
I have a couple test hosts that are in this bad state right now, so I'm trying to get as much info out of them as I can. I'm thinking of trying SystemTap to introspect the kernel's run queue state and see what is happening. Here is the info you requested: */cgroup/cpu files:* % for f in cpu.cfs_quota_us cpu.cfs_period_us cpu.shares cpu.stat tasks; do echo ----$f:---- ; cat /cgroup/cpu/$f; done | head -n20 ----cpu.cfs_quota_us:---- 0 ----cpu.cfs_period_us:---- 0 ----cpu.shares:---- 1024 ----cpu.stat:---- nr_periods 0 nr_throttled 0 throttled_time 0 ----tasks:---- 1 ... */cgroup/cpu/mesos files:* % for f in cpu.cfs_quota_us cpu.cfs_period_us cpu.shares cpu.stat tasks; do echo ----$f:---- ; cat /cgroup/cpu/mesos/$f; done ----cpu.cfs_quota_us:---- -1 ----cpu.cfs_period_us:---- 100000 ----cpu.shares:---- 1024 ----cpu.stat:---- nr_periods 0 nr_throttled 0 throttled_time 0 ----tasks:---- NOTE: no tasks, and the cpu.cfs_quota_us being -1. But those are both consistent with other hosts that aren't exhibiting this problem. */cgroup/cpu/mesos/08610169-76d5-4fd2-86bc-d3ef4d163e3e files:* % for f in cpu.cfs_quota_us cpu.cfs_period_us cpu.shares cpu.stat tasks; do echo ----$f:---- ; cat /cgroup/cpu/mesos/08610169-76d5-4fd2-86bc-d3ef4d163e3e/$f; done ----cpu.cfs_quota_us:---- 1800000 ----cpu.cfs_period_us:---- 100000 ----cpu.shares:---- 18432 ----cpu.stat:---- nr_periods 680868 nr_throttled 254025 throttled_time 55400010353125 ----tasks:---- 6473 ... - Erik On Sun, Jan 3, 2016 at 10:38 AM, Jojy Varghese <[email protected]> wrote: > Hi Erik > Happy to work on this with you. Thanks for the details. > > As you might know, in cfs_rq:/<name> (from /proc/sched_debug), <name> is > the CPU cgroup hierarchy name. I am curious about the contents and cgroups > hierarchy when this happens. Could you send the “mesos” hierarchy > (directory tree) and contents of files like > ‘tasks’,’cpu.cfs_quota_us’,’cpu.cfs_period_us' ‘cpu.shares’, ‘cpu.stat’. > > It does look strange that the parent cgroup is missing when child is > present. > > Also, wondering if you are able to see same issue with latest Mesos and/or > kernel? > > -Jojy > > > > On Jan 2, 2016, at 9:43 PM, Erik Weathers <[email protected]> > wrote: > > > > hey Jojy, Thanks for your reply. Response inline. > > > > On Thu, Dec 31, 2015 at 11:31 AM, Jojy Varghese <[email protected]> > wrote: > > > >>> Are /foo/bar cgroups hierarchical such that /foo missing would prevent > >>> /foo/bar tasks from being scheduled? i.e., might that be the root > >> cause of > >>> why the kernel is ignoring these tasks? > >> > >> Was curious why you said the above. CPU scheduling shares are a function > >> of their parent’s CPU bandwidth. > >> > > > > This question arose from an earlier observation in my initial email: > > > > In my initial email I pointed out that the contents of /proc/sched_debug > > list all of the CFS run queues, but it seems like some of those run > queues > > are missing on the affected hosts. i.e., usually they look like this > (only > > including output for the 1st CPU's CFS run queues): > > > > % grep 'cfs_rq\[0\]' /proc/sched_debug > > cfs_rq[0]:/mesos/e8aa3b46-8004-466a-9a5e-249d6d19993f > > cfs_rq[0]:/mesos > > cfs_rq[0]:/ > > > > But on the problematic hosts, they look like this: > > > > % grep 'cfs_rq\[0\]' /proc/sched_debug > > cfs_rq[0]:/mesos/5cf9a444-e612-4d5b-b8bb-7ee93e44b352 > > cfs_rq[0]:/ > > > > Notably, "cfs_rq[0]:/mesos" is missing on the problematic hosts. > > > > I'm not sure how that is possible, given my understanding that these > > cfs_rq's are created from the special cgroups filesystem having > directories > > added to it, and since the /cgroup/cpu/mesos dir exists (as well as > > /cgroup/cpu/mesos/5cf9a444-e612-4d5b-b8bb-7ee93e44b352/), I don't see how > > the CFS run queues for "/mesos" could have been deleted. I've been > trying > > to read the kernel cgroup CFS scheduling code, but it's tough for a newb. > > > > Notably, the cgroup settings that I see in /cgroup/cpu/mesos and > > /cgroup/cpu/mesos/5cf9a444-e612-4d5b-b8bb-7ee93e44b352 are not > suspicious. > > i.e., it's not that the cgroup settings of the "parent" /mesos cgroup are > > preventing the tasks from being scheduled. It seems to be that the > cgroup > > settings of the parent are simply gone from the kernel. Poof. > > > > At this point I'm assuming that the above observation is indeed the root > > cause of the problem, and I'm simply hoping that whatever logic deleted > the > > "/mesos" run queue is fixed in either a newer kernel or newer mesos > version. > > > > Thanks! > > > > - Erik > > > > > > > >> > >> -Jojy > >> > >> > >>> On Dec 30, 2015, at 6:55 PM, Erik Weathers > <[email protected]> > >> wrote: > >>> > >>> I'm trying to figure out a situation where we see tasks in a mesos > >>> container no longer being scheduled by the Linux kernel. None of the > >> tasks > >>> in the container are zombies, nor are they stuck in "Disk sleep" state. > >>> They are all in Running state. But if I try to strace the processes > the > >>> strace cmd just hangs. I've also noticed that none of the RIPs (64-bit > >>> instruction pointers) are changing at all in these tasks, and they're > not > >>> accumulating any cputime. So the kernel is just not scheduling them. > >>> > >>> Despite the behavior described above, these non-running tasks *are* > >> listed > >>> in the run queues of /proc/sched_debug. Notably, I have observed that > on > >>> hosts without this problem that there exist "cfs_rq[N]:/mesos" run > >> queues, > >>> but on the hosts that have the broken scheduling, these run queues > don't > >>> exist, though we still have "cfs_rq[N]:/mesos/<cgroup-UUID>" in > >>> /proc/sched_debug. That is mighty suspicious to me. > >>> > >>> I'm curious about: > >>> > >>> - Has anyone seen similar behavior? > >>> - Are /foo/bar cgroups hierarchical such that /foo missing would > >> prevent > >>> /foo/bar tasks from being scheduled? i.e., might that be the root > >> cause of > >>> why the kernel is ignoring these tasks? > >>> - What creates the /mesos cfs run queue, and why would that cease to > >>> exist without the subordinate cgroups being cleaned up? > >>> - I'm assuming the creation of the "cpu" cgroup with the path > >>> "/mesos" done by mesos-slave creates this run queue. > >>> - But I'm not sure how/why it would be removed, since I still see a > >>> mesos cgroup in my cgroupfs cpu path (i.e., /cgroup/cpu/mesos > >> exists). > >>> > >>> I'm assuming that this is a kernel bug, and I'm hopeful RedHat has > >> patched > >>> fixes into newer kernel versions that we are running on other hosts > >> (e.g., > >>> 2.6.32-573.7.1.el6). > >>> > >>> Setup info: > >>> > >>> Kernel version: 2.6.32-431.el6.x86_64 > >>> Mesos version: 0.22.1 > >>> Containerizer: Mesos > >>> Isolators: Have seen this behavior with both of these configs: > >>> cgroups/cpu,cgroups/mem > >>> cgroups/cpu,cgroups/mem,namespaces/pid > >>> > >>> Thanks for any insight or help! > >>> > >>> - Erik > >> > >> > >
