Hi Eric,
Thanks again for the information. The value of 0 for
*/cgroup/cpu/cpu.cfs_* files looks suspicious.
Since you are motivated to look into kernel, here is what I think is
interesting.
- The shed_debug print
(http://lxr.free-electrons.com/source/kernel/sched_debug.c?v=2.6.34#L163
<http://lxr.free-electrons.com/source/kernel/sched_debug.c?v=2.6.34#L163>)
finds cgroup_subsys_state
<http://lxr.free-electrons.com/ident?v=2.6.34;i=cgroup_subsys_state>
(http://lxr.free-electrons.com/source/include/linux/cgroup.h?v=2.6.34#L60
<http://lxr.free-electrons.com/source/include/linux/cgroup.h?v=2.6.34#L60>) as
NULL for the task group.
- So what is it that makes css NULL for a task_group? Most likely it was
because css_put
(http://lxr.free-electrons.com/source/kernel/cgroup.c?v=2.6.34#L4312
<http://lxr.free-electrons.com/source/kernel/cgroup.c?v=2.6.34#L4312>) was
called. It would be interesting to use kernel trace/debug/systemtap to see if
this happens. My bet is that “rmdir” was called on the cgroup that caused this.
-Jojy
> On Jan 4, 2016, at 7:54 PM, Erik Weathers <[email protected]>
> wrote:
>
> hi Jojy,
>
> Unfortunately, I haven't been able to reproduce this issue on demand, it
> has just happened spontaneously a few times. So I cannot say for sure if
> it would happen on a newer mesos/kernel version. I'm thinking of trying to
> force reproduction by creating and destroying a ton of cgroups, since the
> issue does *seem* to possibly correlate with some badly behaved storm
> topologies that are constantly crashing and causing the cgroups to be
> created and destroyed often.
>
> I have a couple test hosts that are in this bad state right now, so I'm
> trying to get as much info out of them as I can. I'm thinking of trying
> SystemTap to introspect the kernel's run queue state and see what is
> happening.
>
> Here is the info you requested:
>
> */cgroup/cpu files:*
>
> % for f in cpu.cfs_quota_us cpu.cfs_period_us cpu.shares cpu.stat tasks; do
> echo ----$f:---- ; cat /cgroup/cpu/$f; done | head -n20
> ----cpu.cfs_quota_us:----
> 0
> ----cpu.cfs_period_us:----
> 0
> ----cpu.shares:----
> 1024
> ----cpu.stat:----
> nr_periods 0
> nr_throttled 0
> throttled_time 0
> ----tasks:----
> 1
> ...
>
> */cgroup/cpu/mesos files:*
>
> % for f in cpu.cfs_quota_us cpu.cfs_period_us cpu.shares cpu.stat tasks; do
> echo ----$f:---- ; cat /cgroup/cpu/mesos/$f; done
> ----cpu.cfs_quota_us:----
> -1
> ----cpu.cfs_period_us:----
> 100000
> ----cpu.shares:----
> 1024
> ----cpu.stat:----
> nr_periods 0
> nr_throttled 0
> throttled_time 0
> ----tasks:----
>
> NOTE: no tasks, and the cpu.cfs_quota_us being -1. But those are both
> consistent with other hosts that aren't exhibiting this problem.
>
> */cgroup/cpu/mesos/08610169-76d5-4fd2-86bc-d3ef4d163e3e files:*
>
> % for f in cpu.cfs_quota_us cpu.cfs_period_us cpu.shares cpu.stat tasks; do
> echo ----$f:---- ; cat
> /cgroup/cpu/mesos/08610169-76d5-4fd2-86bc-d3ef4d163e3e/$f; done
> ----cpu.cfs_quota_us:----
> 1800000
> ----cpu.cfs_period_us:----
> 100000
> ----cpu.shares:----
> 18432
> ----cpu.stat:----
> nr_periods 680868
> nr_throttled 254025
> throttled_time 55400010353125
> ----tasks:----
> 6473
> ...
>
> - Erik
>
> On Sun, Jan 3, 2016 at 10:38 AM, Jojy Varghese <[email protected]> wrote:
>
>> Hi Erik
>> Happy to work on this with you. Thanks for the details.
>>
>> As you might know, in cfs_rq:/<name> (from /proc/sched_debug), <name> is
>> the CPU cgroup hierarchy name. I am curious about the contents and cgroups
>> hierarchy when this happens. Could you send the “mesos” hierarchy
>> (directory tree) and contents of files like
>> ‘tasks’,’cpu.cfs_quota_us’,’cpu.cfs_period_us' ‘cpu.shares’, ‘cpu.stat’.
>>
>> It does look strange that the parent cgroup is missing when child is
>> present.
>>
>> Also, wondering if you are able to see same issue with latest Mesos and/or
>> kernel?
>>
>> -Jojy
>>
>>
>>> On Jan 2, 2016, at 9:43 PM, Erik Weathers <[email protected]>
>> wrote:
>>>
>>> hey Jojy, Thanks for your reply. Response inline.
>>>
>>> On Thu, Dec 31, 2015 at 11:31 AM, Jojy Varghese <[email protected]>
>> wrote:
>>>
>>>>> Are /foo/bar cgroups hierarchical such that /foo missing would prevent
>>>>> /foo/bar tasks from being scheduled? i.e., might that be the root
>>>> cause of
>>>>> why the kernel is ignoring these tasks?
>>>>
>>>> Was curious why you said the above. CPU scheduling shares are a function
>>>> of their parent’s CPU bandwidth.
>>>>
>>>
>>> This question arose from an earlier observation in my initial email:
>>>
>>> In my initial email I pointed out that the contents of /proc/sched_debug
>>> list all of the CFS run queues, but it seems like some of those run
>> queues
>>> are missing on the affected hosts. i.e., usually they look like this
>> (only
>>> including output for the 1st CPU's CFS run queues):
>>>
>>> % grep 'cfs_rq\[0\]' /proc/sched_debug
>>> cfs_rq[0]:/mesos/e8aa3b46-8004-466a-9a5e-249d6d19993f
>>> cfs_rq[0]:/mesos
>>> cfs_rq[0]:/
>>>
>>> But on the problematic hosts, they look like this:
>>>
>>> % grep 'cfs_rq\[0\]' /proc/sched_debug
>>> cfs_rq[0]:/mesos/5cf9a444-e612-4d5b-b8bb-7ee93e44b352
>>> cfs_rq[0]:/
>>>
>>> Notably, "cfs_rq[0]:/mesos" is missing on the problematic hosts.
>>>
>>> I'm not sure how that is possible, given my understanding that these
>>> cfs_rq's are created from the special cgroups filesystem having
>> directories
>>> added to it, and since the /cgroup/cpu/mesos dir exists (as well as
>>> /cgroup/cpu/mesos/5cf9a444-e612-4d5b-b8bb-7ee93e44b352/), I don't see how
>>> the CFS run queues for "/mesos" could have been deleted. I've been
>> trying
>>> to read the kernel cgroup CFS scheduling code, but it's tough for a newb.
>>>
>>> Notably, the cgroup settings that I see in /cgroup/cpu/mesos and
>>> /cgroup/cpu/mesos/5cf9a444-e612-4d5b-b8bb-7ee93e44b352 are not
>> suspicious.
>>> i.e., it's not that the cgroup settings of the "parent" /mesos cgroup are
>>> preventing the tasks from being scheduled. It seems to be that the
>> cgroup
>>> settings of the parent are simply gone from the kernel. Poof.
>>>
>>> At this point I'm assuming that the above observation is indeed the root
>>> cause of the problem, and I'm simply hoping that whatever logic deleted
>> the
>>> "/mesos" run queue is fixed in either a newer kernel or newer mesos
>> version.
>>>
>>> Thanks!
>>>
>>> - Erik
>>>
>>>
>>>
>>>>
>>>> -Jojy
>>>>
>>>>
>>>>> On Dec 30, 2015, at 6:55 PM, Erik Weathers
>> <[email protected]>
>>>> wrote:
>>>>>
>>>>> I'm trying to figure out a situation where we see tasks in a mesos
>>>>> container no longer being scheduled by the Linux kernel. None of the
>>>> tasks
>>>>> in the container are zombies, nor are they stuck in "Disk sleep" state.
>>>>> They are all in Running state. But if I try to strace the processes
>> the
>>>>> strace cmd just hangs. I've also noticed that none of the RIPs (64-bit
>>>>> instruction pointers) are changing at all in these tasks, and they're
>> not
>>>>> accumulating any cputime. So the kernel is just not scheduling them.
>>>>>
>>>>> Despite the behavior described above, these non-running tasks *are*
>>>> listed
>>>>> in the run queues of /proc/sched_debug. Notably, I have observed that
>> on
>>>>> hosts without this problem that there exist "cfs_rq[N]:/mesos" run
>>>> queues,
>>>>> but on the hosts that have the broken scheduling, these run queues
>> don't
>>>>> exist, though we still have "cfs_rq[N]:/mesos/<cgroup-UUID>" in
>>>>> /proc/sched_debug. That is mighty suspicious to me.
>>>>>
>>>>> I'm curious about:
>>>>>
>>>>> - Has anyone seen similar behavior?
>>>>> - Are /foo/bar cgroups hierarchical such that /foo missing would
>>>> prevent
>>>>> /foo/bar tasks from being scheduled? i.e., might that be the root
>>>> cause of
>>>>> why the kernel is ignoring these tasks?
>>>>> - What creates the /mesos cfs run queue, and why would that cease to
>>>>> exist without the subordinate cgroups being cleaned up?
>>>>> - I'm assuming the creation of the "cpu" cgroup with the path
>>>>> "/mesos" done by mesos-slave creates this run queue.
>>>>> - But I'm not sure how/why it would be removed, since I still see a
>>>>> mesos cgroup in my cgroupfs cpu path (i.e., /cgroup/cpu/mesos
>>>> exists).
>>>>>
>>>>> I'm assuming that this is a kernel bug, and I'm hopeful RedHat has
>>>> patched
>>>>> fixes into newer kernel versions that we are running on other hosts
>>>> (e.g.,
>>>>> 2.6.32-573.7.1.el6).
>>>>>
>>>>> Setup info:
>>>>>
>>>>> Kernel version: 2.6.32-431.el6.x86_64
>>>>> Mesos version: 0.22.1
>>>>> Containerizer: Mesos
>>>>> Isolators: Have seen this behavior with both of these configs:
>>>>> cgroups/cpu,cgroups/mem
>>>>> cgroups/cpu,cgroups/mem,namespaces/pid
>>>>>
>>>>> Thanks for any insight or help!
>>>>>
>>>>> - Erik
>>>>
>>>>
>>
>>