Hi Erik
  Happy to work on this with you. Thanks for the details. 

As you might know, in cfs_rq:/<name> (from /proc/sched_debug), <name> is the 
CPU cgroup hierarchy name. I am curious about the contents and cgroups 
hierarchy when this happens. Could you send the “mesos” hierarchy (directory 
tree) and contents of files like ‘tasks’,’cpu.cfs_quota_us’,’cpu.cfs_period_us' 
‘cpu.shares’,  ‘cpu.stat’.

It does look strange that the parent cgroup is missing when child is present. 

Also, wondering if you are able to see same issue with latest Mesos and/or 
kernel?

-Jojy
  

> On Jan 2, 2016, at 9:43 PM, Erik Weathers <[email protected]> 
> wrote:
> 
> hey Jojy,  Thanks for your reply.  Response inline.
> 
> On Thu, Dec 31, 2015 at 11:31 AM, Jojy Varghese <[email protected]> wrote:
> 
>>> Are /foo/bar cgroups hierarchical such that /foo missing would prevent
>>>  /foo/bar tasks from being scheduled?  i.e., might that be the root
>> cause of
>>>  why the kernel is ignoring these tasks?
>> 
>> Was curious why you said the above. CPU scheduling shares are a function
>> of their parent’s CPU bandwidth.
>> 
> 
> This question arose from an earlier observation in my initial email:
> 
> In my initial email I pointed out that the contents of /proc/sched_debug
> list all of the CFS run queues, but it seems like some of those run queues
> are missing on the affected hosts.  i.e., usually they look like this (only
> including output for the 1st CPU's CFS run queues):
> 
> % grep 'cfs_rq\[0\]' /proc/sched_debug
> cfs_rq[0]:/mesos/e8aa3b46-8004-466a-9a5e-249d6d19993f
> cfs_rq[0]:/mesos
> cfs_rq[0]:/
> 
> But on the problematic hosts, they look like this:
> 
> % grep 'cfs_rq\[0\]' /proc/sched_debug
> cfs_rq[0]:/mesos/5cf9a444-e612-4d5b-b8bb-7ee93e44b352
> cfs_rq[0]:/
> 
> Notably, "cfs_rq[0]:/mesos" is missing on the problematic hosts.
> 
> I'm not sure how that is possible, given my understanding that these
> cfs_rq's are created from the special cgroups filesystem having directories
> added to it, and since the /cgroup/cpu/mesos dir exists (as well as
> /cgroup/cpu/mesos/5cf9a444-e612-4d5b-b8bb-7ee93e44b352/), I don't see how
> the CFS run queues for "/mesos" could have been deleted.   I've been trying
> to read the kernel cgroup CFS scheduling code, but it's tough for a newb.
> 
> Notably, the cgroup settings that I see in /cgroup/cpu/mesos and
> /cgroup/cpu/mesos/5cf9a444-e612-4d5b-b8bb-7ee93e44b352 are not suspicious.
> i.e., it's not that the cgroup settings of the "parent" /mesos cgroup are
> preventing the tasks from being scheduled.  It seems to be that the cgroup
> settings of the parent are simply gone from the kernel.  Poof.
> 
> At this point I'm assuming that the above observation is indeed the root
> cause of the problem, and I'm simply hoping that whatever logic deleted the
> "/mesos" run queue is fixed in either a newer kernel or newer mesos version.
> 
> Thanks!
> 
> - Erik
> 
> 
> 
>> 
>> -Jojy
>> 
>> 
>>> On Dec 30, 2015, at 6:55 PM, Erik Weathers <[email protected]>
>> wrote:
>>> 
>>> I'm trying to figure out a situation where we see tasks in a mesos
>>> container no longer being scheduled by the Linux kernel.  None of the
>> tasks
>>> in the container are zombies, nor are they stuck in "Disk sleep" state.
>>> They are all in Running state.  But if I try to strace the processes the
>>> strace cmd just hangs.  I've also noticed that none of the RIPs (64-bit
>>> instruction pointers) are changing at all in these tasks, and they're not
>>> accumulating any cputime.   So the kernel is just not scheduling them.
>>> 
>>> Despite the behavior described above, these non-running tasks *are*
>> listed
>>> in the run queues of /proc/sched_debug.  Notably, I have observed that on
>>> hosts without this problem that there exist "cfs_rq[N]:/mesos" run
>> queues,
>>> but on the hosts that have the broken scheduling, these run queues don't
>>> exist, though we still have "cfs_rq[N]:/mesos/<cgroup-UUID>" in
>>> /proc/sched_debug.  That is mighty suspicious to me.
>>> 
>>> I'm curious about:
>>> 
>>>  - Has anyone seen similar behavior?
>>>  - Are /foo/bar cgroups hierarchical such that /foo missing would
>> prevent
>>>  /foo/bar tasks from being scheduled?  i.e., might that be the root
>> cause of
>>>  why the kernel is ignoring these tasks?
>>>  - What creates the /mesos cfs run queue, and why would that cease to
>>>  exist without the subordinate cgroups being cleaned up?
>>>     - I'm assuming the creation of the "cpu" cgroup with the path
>>>     "/mesos" done by mesos-slave creates this run queue.
>>>     - But I'm not sure how/why it would be removed, since I still see a
>>>     mesos cgroup in my cgroupfs cpu path (i.e., /cgroup/cpu/mesos
>> exists).
>>> 
>>> I'm assuming that this is a kernel bug, and I'm hopeful RedHat has
>> patched
>>> fixes into newer kernel versions that we are running on other hosts
>> (e.g.,
>>> 2.6.32-573.7.1.el6).
>>> 
>>> Setup info:
>>> 
>>> Kernel version:  2.6.32-431.el6.x86_64
>>> Mesos version:  0.22.1
>>> Containerizer: Mesos
>>> Isolators: Have seen this behavior with both of these configs:
>>>  cgroups/cpu,cgroups/mem
>>>  cgroups/cpu,cgroups/mem,namespaces/pid
>>> 
>>> Thanks for any insight or help!
>>> 
>>> - Erik
>> 
>> 

Reply via email to