Re: tasks not being scheduled; cfs_rq for /mesos is missing

Jojy Varghese Mon, 04 Jan 2016 22:00:07 -0800

Hi Eric,
        Thanks again for the information. The value of 0 for 
*/cgroup/cpu/cpu.cfs_* files looks suspicious.


Since you are motivated to look into kernel, here is what I think is 
interesting. 

- The shed_debug print 
(http://lxr.free-electrons.com/source/kernel/sched_debug.c?v=2.6.34#L163 
<http://lxr.free-electrons.com/source/kernel/sched_debug.c?v=2.6.34#L163>) 
finds  cgroup_subsys_state 
<http://lxr.free-electrons.com/ident?v=2.6.34;i=cgroup_subsys_state> 
(http://lxr.free-electrons.com/source/include/linux/cgroup.h?v=2.6.34#L60 
<http://lxr.free-electrons.com/source/include/linux/cgroup.h?v=2.6.34#L60>) as 
NULL for the task group. 
- So what is it that makes css NULL for a task_group? Most likely it was 
because css_put 
(http://lxr.free-electrons.com/source/kernel/cgroup.c?v=2.6.34#L4312 
<http://lxr.free-electrons.com/source/kernel/cgroup.c?v=2.6.34#L4312>) was 
called. It would be interesting to use kernel trace/debug/systemtap to see if 
this happens. My bet is that “rmdir” was called on the cgroup that caused this.


-Jojy



> On Jan 4, 2016, at 7:54 PM, Erik Weathers <[email protected]> 
> wrote:
> 
> hi Jojy,
> 
> Unfortunately, I haven't been able to reproduce this issue on demand, it
> has just happened spontaneously a few times.   So I cannot say for sure if
> it would happen on a newer mesos/kernel version.  I'm thinking of trying to
> force reproduction by creating and destroying a ton of cgroups, since the
> issue does *seem* to possibly correlate with some badly behaved storm
> topologies that are constantly crashing and causing the cgroups to be
> created and destroyed often.
> 
> I have a couple test hosts that are in this bad state right now, so I'm
> trying to get as much info out of them as I can.  I'm thinking of trying
> SystemTap to introspect the kernel's run queue state and see what is
> happening.
> 
> Here is the info you requested:
> 
> */cgroup/cpu files:*
> 
> % for f in cpu.cfs_quota_us cpu.cfs_period_us cpu.shares cpu.stat tasks; do
> echo ----$f:---- ; cat /cgroup/cpu/$f; done | head -n20
> ----cpu.cfs_quota_us:----
> 0
> ----cpu.cfs_period_us:----
> 0
> ----cpu.shares:----
> 1024
> ----cpu.stat:----
> nr_periods 0
> nr_throttled 0
> throttled_time 0
> ----tasks:----
> 1
> ...
> 
> */cgroup/cpu/mesos files:*
> 
> % for f in cpu.cfs_quota_us cpu.cfs_period_us cpu.shares cpu.stat tasks; do
> echo ----$f:---- ; cat /cgroup/cpu/mesos/$f; done
> ----cpu.cfs_quota_us:----
> -1
> ----cpu.cfs_period_us:----
> 100000
> ----cpu.shares:----
> 1024
> ----cpu.stat:----
> nr_periods 0
> nr_throttled 0
> throttled_time 0
> ----tasks:----
> 
> NOTE: no tasks, and the cpu.cfs_quota_us being -1.  But those are both
> consistent with other hosts that aren't exhibiting this problem.
> 
> */cgroup/cpu/mesos/08610169-76d5-4fd2-86bc-d3ef4d163e3e files:*
> 
> % for f in cpu.cfs_quota_us cpu.cfs_period_us cpu.shares cpu.stat tasks; do
> echo ----$f:---- ; cat
> /cgroup/cpu/mesos/08610169-76d5-4fd2-86bc-d3ef4d163e3e/$f; done
> ----cpu.cfs_quota_us:----
> 1800000
> ----cpu.cfs_period_us:----
> 100000
> ----cpu.shares:----
> 18432
> ----cpu.stat:----
> nr_periods 680868
> nr_throttled 254025
> throttled_time 55400010353125
> ----tasks:----
> 6473
> ...
> 
> - Erik
> 
> On Sun, Jan 3, 2016 at 10:38 AM, Jojy Varghese <[email protected]> wrote:
> 
>> Hi Erik
>>  Happy to work on this with you. Thanks for the details.
>> 
>> As you might know, in cfs_rq:/<name> (from /proc/sched_debug), <name> is
>> the CPU cgroup hierarchy name. I am curious about the contents and cgroups
>> hierarchy when this happens. Could you send the “mesos” hierarchy
>> (directory tree) and contents of files like
>> ‘tasks’,’cpu.cfs_quota_us’,’cpu.cfs_period_us' ‘cpu.shares’,  ‘cpu.stat’.
>> 
>> It does look strange that the parent cgroup is missing when child is
>> present.
>> 
>> Also, wondering if you are able to see same issue with latest Mesos and/or
>> kernel?
>> 
>> -Jojy
>> 
>> 
>>> On Jan 2, 2016, at 9:43 PM, Erik Weathers <[email protected]>
>> wrote:
>>> 
>>> hey Jojy,  Thanks for your reply.  Response inline.
>>> 
>>> On Thu, Dec 31, 2015 at 11:31 AM, Jojy Varghese <[email protected]>
>> wrote:
>>> 
>>>>> Are /foo/bar cgroups hierarchical such that /foo missing would prevent
>>>>> /foo/bar tasks from being scheduled?  i.e., might that be the root
>>>> cause of
>>>>> why the kernel is ignoring these tasks?
>>>> 
>>>> Was curious why you said the above. CPU scheduling shares are a function
>>>> of their parent’s CPU bandwidth.
>>>> 
>>> 
>>> This question arose from an earlier observation in my initial email:
>>> 
>>> In my initial email I pointed out that the contents of /proc/sched_debug
>>> list all of the CFS run queues, but it seems like some of those run
>> queues
>>> are missing on the affected hosts.  i.e., usually they look like this
>> (only
>>> including output for the 1st CPU's CFS run queues):
>>> 
>>> % grep 'cfs_rq\[0\]' /proc/sched_debug
>>> cfs_rq[0]:/mesos/e8aa3b46-8004-466a-9a5e-249d6d19993f
>>> cfs_rq[0]:/mesos
>>> cfs_rq[0]:/
>>> 
>>> But on the problematic hosts, they look like this:
>>> 
>>> % grep 'cfs_rq\[0\]' /proc/sched_debug
>>> cfs_rq[0]:/mesos/5cf9a444-e612-4d5b-b8bb-7ee93e44b352
>>> cfs_rq[0]:/
>>> 
>>> Notably, "cfs_rq[0]:/mesos" is missing on the problematic hosts.
>>> 
>>> I'm not sure how that is possible, given my understanding that these
>>> cfs_rq's are created from the special cgroups filesystem having
>> directories
>>> added to it, and since the /cgroup/cpu/mesos dir exists (as well as
>>> /cgroup/cpu/mesos/5cf9a444-e612-4d5b-b8bb-7ee93e44b352/), I don't see how
>>> the CFS run queues for "/mesos" could have been deleted.   I've been
>> trying
>>> to read the kernel cgroup CFS scheduling code, but it's tough for a newb.
>>> 
>>> Notably, the cgroup settings that I see in /cgroup/cpu/mesos and
>>> /cgroup/cpu/mesos/5cf9a444-e612-4d5b-b8bb-7ee93e44b352 are not
>> suspicious.
>>> i.e., it's not that the cgroup settings of the "parent" /mesos cgroup are
>>> preventing the tasks from being scheduled.  It seems to be that the
>> cgroup
>>> settings of the parent are simply gone from the kernel.  Poof.
>>> 
>>> At this point I'm assuming that the above observation is indeed the root
>>> cause of the problem, and I'm simply hoping that whatever logic deleted
>> the
>>> "/mesos" run queue is fixed in either a newer kernel or newer mesos
>> version.
>>> 
>>> Thanks!
>>> 
>>> - Erik
>>> 
>>> 
>>> 
>>>> 
>>>> -Jojy
>>>> 
>>>> 
>>>>> On Dec 30, 2015, at 6:55 PM, Erik Weathers
>> <[email protected]>
>>>> wrote:
>>>>> 
>>>>> I'm trying to figure out a situation where we see tasks in a mesos
>>>>> container no longer being scheduled by the Linux kernel.  None of the
>>>> tasks
>>>>> in the container are zombies, nor are they stuck in "Disk sleep" state.
>>>>> They are all in Running state.  But if I try to strace the processes
>> the
>>>>> strace cmd just hangs.  I've also noticed that none of the RIPs (64-bit
>>>>> instruction pointers) are changing at all in these tasks, and they're
>> not
>>>>> accumulating any cputime.   So the kernel is just not scheduling them.
>>>>> 
>>>>> Despite the behavior described above, these non-running tasks *are*
>>>> listed
>>>>> in the run queues of /proc/sched_debug.  Notably, I have observed that
>> on
>>>>> hosts without this problem that there exist "cfs_rq[N]:/mesos" run
>>>> queues,
>>>>> but on the hosts that have the broken scheduling, these run queues
>> don't
>>>>> exist, though we still have "cfs_rq[N]:/mesos/<cgroup-UUID>" in
>>>>> /proc/sched_debug.  That is mighty suspicious to me.
>>>>> 
>>>>> I'm curious about:
>>>>> 
>>>>> - Has anyone seen similar behavior?
>>>>> - Are /foo/bar cgroups hierarchical such that /foo missing would
>>>> prevent
>>>>> /foo/bar tasks from being scheduled?  i.e., might that be the root
>>>> cause of
>>>>> why the kernel is ignoring these tasks?
>>>>> - What creates the /mesos cfs run queue, and why would that cease to
>>>>> exist without the subordinate cgroups being cleaned up?
>>>>>    - I'm assuming the creation of the "cpu" cgroup with the path
>>>>>    "/mesos" done by mesos-slave creates this run queue.
>>>>>    - But I'm not sure how/why it would be removed, since I still see a
>>>>>    mesos cgroup in my cgroupfs cpu path (i.e., /cgroup/cpu/mesos
>>>> exists).
>>>>> 
>>>>> I'm assuming that this is a kernel bug, and I'm hopeful RedHat has
>>>> patched
>>>>> fixes into newer kernel versions that we are running on other hosts
>>>> (e.g.,
>>>>> 2.6.32-573.7.1.el6).
>>>>> 
>>>>> Setup info:
>>>>> 
>>>>> Kernel version:  2.6.32-431.el6.x86_64
>>>>> Mesos version:  0.22.1
>>>>> Containerizer: Mesos
>>>>> Isolators: Have seen this behavior with both of these configs:
>>>>> cgroups/cpu,cgroups/mem
>>>>> cgroups/cpu,cgroups/mem,namespaces/pid
>>>>> 
>>>>> Thanks for any insight or help!
>>>>> 
>>>>> - Erik
>>>> 
>>>> 
>> 
>>

Re: tasks not being scheduled; cfs_rq for /mesos is missing

Reply via email to