Re: tasks not being scheduled; cfs_rq for /mesos is missing

Erik Weathers Mon, 04 Jan 2016 19:55:47 -0800

hi Jojy,

Unfortunately, I haven't been able to reproduce this issue on demand, it
has just happened spontaneously a few times.   So I cannot say for sure if
it would happen on a newer mesos/kernel version.  I'm thinking of trying to
force reproduction by creating and destroying a ton of cgroups, since the
issue does *seem* to possibly correlate with some badly behaved storm
topologies that are constantly crashing and causing the cgroups to be
created and destroyed often.


I have a couple test hosts that are in this bad state right now, so I'm
trying to get as much info out of them as I can.  I'm thinking of trying
SystemTap to introspect the kernel's run queue state and see what is
happening.

Here is the info you requested:

*/cgroup/cpu files:*

% for f in cpu.cfs_quota_us cpu.cfs_period_us cpu.shares cpu.stat tasks; do
echo ----$f:---- ; cat /cgroup/cpu/$f; done | head -n20
----cpu.cfs_quota_us:----
0
----cpu.cfs_period_us:----
0
----cpu.shares:----
1024
----cpu.stat:----
nr_periods 0
nr_throttled 0
throttled_time 0
----tasks:----
1
...

*/cgroup/cpu/mesos files:*

% for f in cpu.cfs_quota_us cpu.cfs_period_us cpu.shares cpu.stat tasks; do
echo ----$f:---- ; cat /cgroup/cpu/mesos/$f; done
----cpu.cfs_quota_us:----
-1
----cpu.cfs_period_us:----
100000
----cpu.shares:----
1024
----cpu.stat:----
nr_periods 0
nr_throttled 0
throttled_time 0
----tasks:----

NOTE: no tasks, and the cpu.cfs_quota_us being -1.  But those are both
consistent with other hosts that aren't exhibiting this problem.

*/cgroup/cpu/mesos/08610169-76d5-4fd2-86bc-d3ef4d163e3e files:*

% for f in cpu.cfs_quota_us cpu.cfs_period_us cpu.shares cpu.stat tasks; do
echo ----$f:---- ; cat
/cgroup/cpu/mesos/08610169-76d5-4fd2-86bc-d3ef4d163e3e/$f; done
----cpu.cfs_quota_us:----
1800000
----cpu.cfs_period_us:----
100000
----cpu.shares:----
18432
----cpu.stat:----
nr_periods 680868
nr_throttled 254025
throttled_time 55400010353125
----tasks:----
6473
...

- Erik

On Sun, Jan 3, 2016 at 10:38 AM, Jojy Varghese <[email protected]> wrote:

> Hi Erik
>   Happy to work on this with you. Thanks for the details.
>
> As you might know, in cfs_rq:/<name> (from /proc/sched_debug), <name> is
> the CPU cgroup hierarchy name. I am curious about the contents and cgroups
> hierarchy when this happens. Could you send the “mesos” hierarchy
> (directory tree) and contents of files like
> ‘tasks’,’cpu.cfs_quota_us’,’cpu.cfs_period_us' ‘cpu.shares’,  ‘cpu.stat’.
>
> It does look strange that the parent cgroup is missing when child is
> present.
>
> Also, wondering if you are able to see same issue with latest Mesos and/or
> kernel?
>
> -Jojy
>
>
> > On Jan 2, 2016, at 9:43 PM, Erik Weathers <[email protected]>
> wrote:
> >
> > hey Jojy,  Thanks for your reply.  Response inline.
> >
> > On Thu, Dec 31, 2015 at 11:31 AM, Jojy Varghese <[email protected]>
> wrote:
> >
> >>> Are /foo/bar cgroups hierarchical such that /foo missing would prevent
> >>>  /foo/bar tasks from being scheduled?  i.e., might that be the root
> >> cause of
> >>>  why the kernel is ignoring these tasks?
> >>
> >> Was curious why you said the above. CPU scheduling shares are a function
> >> of their parent’s CPU bandwidth.
> >>
> >
> > This question arose from an earlier observation in my initial email:
> >
> > In my initial email I pointed out that the contents of /proc/sched_debug
> > list all of the CFS run queues, but it seems like some of those run
> queues
> > are missing on the affected hosts.  i.e., usually they look like this
> (only
> > including output for the 1st CPU's CFS run queues):
> >
> > % grep 'cfs_rq\[0\]' /proc/sched_debug
> > cfs_rq[0]:/mesos/e8aa3b46-8004-466a-9a5e-249d6d19993f
> > cfs_rq[0]:/mesos
> > cfs_rq[0]:/
> >
> > But on the problematic hosts, they look like this:
> >
> > % grep 'cfs_rq\[0\]' /proc/sched_debug
> > cfs_rq[0]:/mesos/5cf9a444-e612-4d5b-b8bb-7ee93e44b352
> > cfs_rq[0]:/
> >
> > Notably, "cfs_rq[0]:/mesos" is missing on the problematic hosts.
> >
> > I'm not sure how that is possible, given my understanding that these
> > cfs_rq's are created from the special cgroups filesystem having
> directories
> > added to it, and since the /cgroup/cpu/mesos dir exists (as well as
> > /cgroup/cpu/mesos/5cf9a444-e612-4d5b-b8bb-7ee93e44b352/), I don't see how
> > the CFS run queues for "/mesos" could have been deleted.   I've been
> trying
> > to read the kernel cgroup CFS scheduling code, but it's tough for a newb.
> >
> > Notably, the cgroup settings that I see in /cgroup/cpu/mesos and
> > /cgroup/cpu/mesos/5cf9a444-e612-4d5b-b8bb-7ee93e44b352 are not
> suspicious.
> > i.e., it's not that the cgroup settings of the "parent" /mesos cgroup are
> > preventing the tasks from being scheduled.  It seems to be that the
> cgroup
> > settings of the parent are simply gone from the kernel.  Poof.
> >
> > At this point I'm assuming that the above observation is indeed the root
> > cause of the problem, and I'm simply hoping that whatever logic deleted
> the
> > "/mesos" run queue is fixed in either a newer kernel or newer mesos
> version.
> >
> > Thanks!
> >
> > - Erik
> >
> >
> >
> >>
> >> -Jojy
> >>
> >>
> >>> On Dec 30, 2015, at 6:55 PM, Erik Weathers
> <[email protected]>
> >> wrote:
> >>>
> >>> I'm trying to figure out a situation where we see tasks in a mesos
> >>> container no longer being scheduled by the Linux kernel.  None of the
> >> tasks
> >>> in the container are zombies, nor are they stuck in "Disk sleep" state.
> >>> They are all in Running state.  But if I try to strace the processes
> the
> >>> strace cmd just hangs.  I've also noticed that none of the RIPs (64-bit
> >>> instruction pointers) are changing at all in these tasks, and they're
> not
> >>> accumulating any cputime.   So the kernel is just not scheduling them.
> >>>
> >>> Despite the behavior described above, these non-running tasks *are*
> >> listed
> >>> in the run queues of /proc/sched_debug.  Notably, I have observed that
> on
> >>> hosts without this problem that there exist "cfs_rq[N]:/mesos" run
> >> queues,
> >>> but on the hosts that have the broken scheduling, these run queues
> don't
> >>> exist, though we still have "cfs_rq[N]:/mesos/<cgroup-UUID>" in
> >>> /proc/sched_debug.  That is mighty suspicious to me.
> >>>
> >>> I'm curious about:
> >>>
> >>>  - Has anyone seen similar behavior?
> >>>  - Are /foo/bar cgroups hierarchical such that /foo missing would
> >> prevent
> >>>  /foo/bar tasks from being scheduled?  i.e., might that be the root
> >> cause of
> >>>  why the kernel is ignoring these tasks?
> >>>  - What creates the /mesos cfs run queue, and why would that cease to
> >>>  exist without the subordinate cgroups being cleaned up?
> >>>     - I'm assuming the creation of the "cpu" cgroup with the path
> >>>     "/mesos" done by mesos-slave creates this run queue.
> >>>     - But I'm not sure how/why it would be removed, since I still see a
> >>>     mesos cgroup in my cgroupfs cpu path (i.e., /cgroup/cpu/mesos
> >> exists).
> >>>
> >>> I'm assuming that this is a kernel bug, and I'm hopeful RedHat has
> >> patched
> >>> fixes into newer kernel versions that we are running on other hosts
> >> (e.g.,
> >>> 2.6.32-573.7.1.el6).
> >>>
> >>> Setup info:
> >>>
> >>> Kernel version:  2.6.32-431.el6.x86_64
> >>> Mesos version:  0.22.1
> >>> Containerizer: Mesos
> >>> Isolators: Have seen this behavior with both of these configs:
> >>>  cgroups/cpu,cgroups/mem
> >>>  cgroups/cpu,cgroups/mem,namespaces/pid
> >>>
> >>> Thanks for any insight or help!
> >>>
> >>> - Erik
> >>
> >>
>
>

Re: tasks not being scheduled; cfs_rq for /mesos is missing

Reply via email to