Re: [lxc-devel] cgroup V2 and LXC

Christian Brauner Tue, 23 Feb 2016 02:24:07 -0800

On Mon, Feb 15, 2016 at 07:48:05PM +0000, Serge Hallyn wrote:
> Quoting Christian Brauner (christian.brau...@mailbox.org):
> > On Wed, Feb 10, 2016 at 05:45:48PM +0000, Serge Hallyn wrote:
> > > Quoting Christian Brauner (christian.brau...@mailbox.org):
> > > > On Mon, Feb 01, 2016 at 04:56:08AM +0000, Serge Hallyn wrote:
> > > > > Quoting Kevin Wilson (wkev...@gmail.com):
> > > > > > Hi, LXC developers,
> > > > > > 
> > > > > > The latest kernel release (4.4) includes initial support to cgroup 
> > > > > > v2
> > > > > > with 2 controllers (memory and io). Also it seems that the PIDs
> > > > > > controller works in cgroup v2, but I do not know if it is officially
> > > > > > supported in v2.
> > > > > > 
> > > > > > Is there any intention to replace the existing cgroup v1 usage in 
> > > > > > LXC
> > > > > > by cgroup v2 ? or at least to enable working with both of them ?
> > > > > > 
> > > > > > Regards,
> > > > > > Kevin
> > > > > 
> > > > > Replace, no, support, yes.  I've added support for it to cgmanager, 
> > > > > and have
> > > > > used lxc with the unified hierarchy through cgmanager.  Without 
> > > > > cgmanager
> > > > > it will currently definately not work.  It's worth discussing how we 
> > > > > should
> > > > > handle it - and how init wants us to handle it.   With cgmanager I 
> > > > > actually
> > > > > built in the support so that you could treat it as a legacy 
> > > > > hierarchy, and
> > > > > upstart was happy with that since it used cgmanager.  Systemd will 
> > > > > not be
> > > > > happy with that, and it will be a problem.  The only exception to the 
> > > > > "no
> > > > > tasks in a non-leaf node" rule is for the / cgroup.  So lxc would 
> > > > > need to
> > > > > place init in say /lxc/c1/.leaf, and systemd would have to accept that
> > > > > /lxc/c1 is the container's cgroup.  A few possibilities:
> > > > > 
> > > > > 1. maybe if we place systemd in /lxc/c1/init.scope it will be happy
> > > > Well, here is how I thought it could go (sticking to systemd specifics 
> > > > here):
> > > >         - create a slice for all lxc "lxc.slice" (similar to 
> > > > "machine.slice" of
> > > >           systemd-nspawn backed containers)
> > > >         - "lxc.slice" contains a scope for each container (e.g. 
> > > > "c1.scope"
> > > >         - "c1.scope" contains an "init.scope"
> > > >         - "init.scope" only contains the PID of "/sbin/init" as seen 
> > > > from the
> > > >           host (obviously)
> > > 
> > > So if we are creating container c1, are you talking about
> > > 
> > > /lxc/c1/lxc.slice/c1.scope/init.scope
> > > 
> > > or are you talking about a host-global
> > > 
> > > /lxc.slice
> > Yes, you have lxc.slice then you have all your machines under this. This is 
> > what
> > systemd-nspawn does if I'm not mistaken.
> > > with container-specific
> > > 
> > > /lxc.slice/c1.scope
> > > 
> > > per container?
> > > 
> > > ?
> > Yes.
> 
> This doesn't seem to address the problem.  Where we put these on the host 
> doesn't
> matter.  The question is, we create container c1, in which cgroup do we put 
> the
> init process?
> 
> Assume we create /lxc/c1 on the host as we do now.  This becomes / in the 
> container's
> cgroup namespace.  Where do we put init?  If we put it into (namespaced) /, 
> then
> systemd will not be able to create any cgroups.  So we should probably put it 
> into
> /init.scope.  This is fine with cgroup namespaces since it can see it is in 
> '/init.scope'
> (or '/' if an unprivileged container couldn't create a cgroup for some 
> controllers).
> But if we do not have cgroup namespaces, systemd sees it is running in perhaps
> /user.slice/user-1000.slice/session-c6.scope/lxc/lxdvm1/lxc/c1/init.scope.  
> In that
> case we want systemd to recognize init.scope and create services under
> /user.slice/user-1000.slice/session-c6.scope/lxc/lxdvm1/lxc/c1.
> 
> > > >         - All other processes are put in another slice 
> > > > "c1-something.slice"
> > > 
> > > Which other processes?
> > Well, all processes, systemd starts are either put in system.slice or
> > user.slice. All other things we start in the container (let it be e.g. vim) 
> > is
> > put in a session.slice (e.g. session-0.slice, session-1000.slice).
> 
> wc -l /sys/fs/cgroup/memory/tasks
> 548
This is output from a legacy cgroup. (The tasks file is removed in cgroup
unified hierarchy, no?) I was talking about unified cgroups.


A typical layout for a container BB running a unified cgroup system inside on a
host running a unified cgroup system with systemd-nspawn:

/sys/fs/cgroup/machine.slice/:
        - non-leaf node --> cgroup.procs empty

/sys/fs/cgroup/machine.slice/machine-BB\x2dtree.scope/:
        - non-leaf node --> cgroup.procs empty

The following are on the same level: 
(/sys/fs/cgroup/machine.slice/machine-BB\x2dtree.scope/)

- /sys/fs/cgroup/machine.slice/machine-BB\x2dtree.scope/init.scope/:
        - leaf node --> cgroup.procs contains PID of init

- /sys/fs/cgroup/machine.slice/machine-BB\x2dtree.scope/system.slice/:
        - non-leaf node --> cgroup.procs empty
        - contains leaf nodes for system setup stuff (journald, logind etc.)

- 
/sys/fs/cgroup/machine.slice/machine-BB\x2dtree.scope/user.slice/user-0.slice/session-c1.scope
and
- 
/sys/fs/cgroup/machine.slice/machine-BB\x2dtree.scope/user.slice/user-0.slice/user@0.service:
        - filled with leaf-nodes for e.g. processes started by the user

> 
> > > AFAIK all other processes will be created by systemd.  The q is what will 
> > > it
> > > do.  If we put systemd in /lxc.slice/c1.scope/init.scope, will it take 
> > > that
> > > as its cgroup root and try to create and move itself into
> > > /lxc.slice/c1.scope/init.scope ?  If so it will fail since it cannot 
> > > create a
> > > cgroup while it is in it.
> > I don't think so but I need to test that again. Time to boot unified.
> > 
> > > 
> > > So I think I've convinced myself that we need to collaborate with systemd
> > > on this.  Perhaps we can agree with it on a default cgroup in which it 
> > > should
> > > be started to tell it "this is the leaf cgroup for your init".  So if it 
> > > sees
> > > it is in /a/b/c/.cg_leaf, then it will know that /a/b/c is its root.
> > I thought the same that's why I started to read some of the code.
> > fwiw, systemd-nspawn already works with the unified cgroup hierarchy and I 
> > think
> > nesting works as well. But I'm not completely sure how nspawn handles 
> > nesting.
> 
> Looks like it puts systemd into '/supervisor' and the container into 
> '/payload'?
> (nspawn-cgroup.c)
I don't think so. This seems to be a special case when systemd-nspawn is run
from a service unit. Otherwise the layout seems to be as I sketched above.
_______________________________________________
lxc-devel mailing list
lxc-devel@lists.linuxcontainers.org
http://lists.linuxcontainers.org/listinfo/lxc-devel

Re: [lxc-devel] cgroup V2 and LXC

Reply via email to