> Are /foo/bar cgroups hierarchical such that /foo missing would prevent
>   /foo/bar tasks from being scheduled?  i.e., might that be the root cause of
>   why the kernel is ignoring these tasks?

Was curious why you said the above. CPU scheduling shares are a function of 
their parent’s CPU bandwidth.

-Jojy


> On Dec 30, 2015, at 6:55 PM, Erik Weathers <[email protected]> 
> wrote:
> 
> I'm trying to figure out a situation where we see tasks in a mesos
> container no longer being scheduled by the Linux kernel.  None of the tasks
> in the container are zombies, nor are they stuck in "Disk sleep" state.
> They are all in Running state.  But if I try to strace the processes the
> strace cmd just hangs.  I've also noticed that none of the RIPs (64-bit
> instruction pointers) are changing at all in these tasks, and they're not
> accumulating any cputime.   So the kernel is just not scheduling them.
> 
> Despite the behavior described above, these non-running tasks *are* listed
> in the run queues of /proc/sched_debug.  Notably, I have observed that on
> hosts without this problem that there exist "cfs_rq[N]:/mesos" run queues,
> but on the hosts that have the broken scheduling, these run queues don't
> exist, though we still have "cfs_rq[N]:/mesos/<cgroup-UUID>" in
> /proc/sched_debug.  That is mighty suspicious to me.
> 
> I'm curious about:
> 
>   - Has anyone seen similar behavior?
>   - Are /foo/bar cgroups hierarchical such that /foo missing would prevent
>   /foo/bar tasks from being scheduled?  i.e., might that be the root cause of
>   why the kernel is ignoring these tasks?
>   - What creates the /mesos cfs run queue, and why would that cease to
>   exist without the subordinate cgroups being cleaned up?
>      - I'm assuming the creation of the "cpu" cgroup with the path
>      "/mesos" done by mesos-slave creates this run queue.
>      - But I'm not sure how/why it would be removed, since I still see a
>      mesos cgroup in my cgroupfs cpu path (i.e., /cgroup/cpu/mesos exists).
> 
> I'm assuming that this is a kernel bug, and I'm hopeful RedHat has patched
> fixes into newer kernel versions that we are running on other hosts (e.g.,
> 2.6.32-573.7.1.el6).
> 
> Setup info:
> 
> Kernel version:  2.6.32-431.el6.x86_64
> Mesos version:  0.22.1
> Containerizer: Mesos
> Isolators: Have seen this behavior with both of these configs:
>   cgroups/cpu,cgroups/mem
>   cgroups/cpu,cgroups/mem,namespaces/pid
> 
> Thanks for any insight or help!
> 
> - Erik

Reply via email to