Re: [PATCH v4 0/2] cgroup: allow management of subtrees by new cgroup namespaces

2016-05-20 Thread Aditya Kali
On Fri, May 20, 2016 at 9:25 AM, James Bottomley
 wrote:
>
> On Fri, 2016-05-20 at 09:17 -0700, Tejun Heo wrote:
> > Hello, James.
> >
> > On Fri, May 20, 2016 at 12:09:10PM -0400, James Bottomley wrote:
> > > I think it's just different definitions.  If you take on our
> > > definition of being able to set up a container without any admin
> > > intervention, do you see our problem: we can't get the initial
> > > delegation of the hierarchy.
> >
> > Yeah, I can see the difference but we can't solve that by special
> > casing NS case.
>
> Great, we agree on the problem definition ... as I said, I'm not saying
> this patch is the solution, but it gives us a starting point for
> exploring whether there is a solution.
>
> >   This is stemming from the fact that an unpriv application can't
> > create its sub-cgroups without explicit delegation from the root and
> > that has always been an explicit design choice.
> > It's tied to who's responsible for cleanup afterwards and what
> > happens when the process gets migrated to a different cgroup.  The
> > latter is an important issue on v1 hierarchies because migrating
> > tasks sometimes is used as a way to control resource distribution.
>
> OK, so is the only problem cleanup?  If so, what if I proposed that a
> cgroup directory could only be created by the owner of the userns
> (which would be any old unprivileged user) iff they create a cgroup ns
> and the cgroup ns would be responsible for removing it again, so the
> cgroup subdirectory would be tied to the cgroup namespace as its holder
> and we'd use release of the cgroup to remove all the directories?
>

cgroup namspace doesn't own the resources in the cgroupns-root, and so
I am not sure how it will be able to do the cleanup either. I.e, even
if all the processes in the cgroup ns die, it doesn't mean that the
cgroupns-root they belonged to is available for cleanup. For this
reason, one of the implicit design choice in cgroupns was that the
cgroup-ns root should already exist and the target process should
already be moved to it (presumably by some admin process) before
creating the cgroupns.

Moreover, the subsystem controllers (cpu, memory, etc.) are oblivious
to cgroup namespaces. So, for example, creating new cgroup namespace
doesn't affect the reclaim behavior. But, allowing
creation/modification of sub-cgroups affects it. So I think allowing
any unprivileged process to do that cannot be considered safe for now.
Explicit approval from some admin process will still be needed (which
can be given by chmod/chown today).


>
> James
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Thanks,

-- 
Aditya


Re: [PATCH v4 0/2] cgroup: allow management of subtrees by new cgroup namespaces

2016-05-20 Thread Aditya Kali
On Fri, May 20, 2016 at 9:25 AM, James Bottomley
 wrote:
>
> On Fri, 2016-05-20 at 09:17 -0700, Tejun Heo wrote:
> > Hello, James.
> >
> > On Fri, May 20, 2016 at 12:09:10PM -0400, James Bottomley wrote:
> > > I think it's just different definitions.  If you take on our
> > > definition of being able to set up a container without any admin
> > > intervention, do you see our problem: we can't get the initial
> > > delegation of the hierarchy.
> >
> > Yeah, I can see the difference but we can't solve that by special
> > casing NS case.
>
> Great, we agree on the problem definition ... as I said, I'm not saying
> this patch is the solution, but it gives us a starting point for
> exploring whether there is a solution.
>
> >   This is stemming from the fact that an unpriv application can't
> > create its sub-cgroups without explicit delegation from the root and
> > that has always been an explicit design choice.
> > It's tied to who's responsible for cleanup afterwards and what
> > happens when the process gets migrated to a different cgroup.  The
> > latter is an important issue on v1 hierarchies because migrating
> > tasks sometimes is used as a way to control resource distribution.
>
> OK, so is the only problem cleanup?  If so, what if I proposed that a
> cgroup directory could only be created by the owner of the userns
> (which would be any old unprivileged user) iff they create a cgroup ns
> and the cgroup ns would be responsible for removing it again, so the
> cgroup subdirectory would be tied to the cgroup namespace as its holder
> and we'd use release of the cgroup to remove all the directories?
>

cgroup namspace doesn't own the resources in the cgroupns-root, and so
I am not sure how it will be able to do the cleanup either. I.e, even
if all the processes in the cgroup ns die, it doesn't mean that the
cgroupns-root they belonged to is available for cleanup. For this
reason, one of the implicit design choice in cgroupns was that the
cgroup-ns root should already exist and the target process should
already be moved to it (presumably by some admin process) before
creating the cgroupns.

Moreover, the subsystem controllers (cpu, memory, etc.) are oblivious
to cgroup namespaces. So, for example, creating new cgroup namespace
doesn't affect the reclaim behavior. But, allowing
creation/modification of sub-cgroups affects it. So I think allowing
any unprivileged process to do that cannot be considered safe for now.
Explicit approval from some admin process will still be needed (which
can be given by chmod/chown today).


>
> James
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Thanks,

-- 
Aditya


Re: [PATCH] cgroup namespaces: add a 'nsroot=' mountinfo field

2016-04-15 Thread Aditya Kali
On Thu, Apr 14, 2016 at 8:27 AM, Serge E. Hallyn  wrote:
> Quoting Eric W. Biederman (ebied...@xmission.com):
>> "Serge E. Hallyn"  writes:
>>
>> > This is so that userspace can distinguish a mount made in a cgroup
>> > namespace from a bind mount from a cgroup subdirectory.
>>
>> To do that do you need to print the path, or is an extra option that
>> reveals nothing except that it was a cgroup mount sufficient?
>>
>> Is there any practical difference between a mount in a namespace and a
>> bind mount?
>>
>> Given the way the conversation has been going I think it would be good
>> to see the answers to these questions.  Perhaps I missed it but I
>> haven't seen the answers to those questions.
>
> Yup, I tried to answer those in my last email, let me try again.
>
> Let's say I start a container using cgroup namespaces, /lxc/x1.  It mounts
> freezer at /sys/fs/cgroup so it has field three of mountinfo as /lxc/x1,
> and /sys/fs/cgroup/ is the path to the container's cgroup (/lxc/x1).  In
> that container, I start another container x1, not using cgroup namespaces.
> It also wants a cgroup mount, and a common way to handle that (to prevent
> container rewriting its limits) is to mount a tmpfs at /sys/fs/cgroup,
> create /sysfs/cgroup/lxc/x1, and bind mount /sys/fs/cgroup/lxc/x1 from
> the parent container onto /sys/fs/cgroup/lxc/x1 in the child container.
> Now for that bind mount, the mountinfo field 3 will show /lxc/x1/lxc/x1,
> with mount target /sys/fs/cgroup/lxc/x1, while /proc/self/cgroup for a task
> in that container will show '/lxc/x1'.  Unless it has been moved into
> /lxc/x1/lxc/x1 in the container (/lxc/x1/lxc/x1/lxc/x1 on the host)...
> Every time I've thought "maybe we can just..." I've found a case where it
> wouldn't work.
>
> At first in lxc we simply said if /proc/self/ns/cgroup exists assume that
> the cgroupfs mounts are not bind mounts.  However, old userspace (and
> container drivers) on new kernels is certainly possible, especially an
> older distro in a container on a newer distro on the host.  That completely
> breaks with this approach.
>

My main concern regarding making this a new kernel API is that its too
generic and exposes information about all system cgroups to every
process on the system, not just the container or the process inside it
that needs it. Not all containers need this information and not all
processes running inside the container needs this. I haven't spent too
much thought into it, but it seems you will still need to update the
container userspace to read this extra mount option. So seems like a
simpler approach where the host "cgroup manager" provides this
information to specific container cgroup manager via other user-space
channels (a config file, command-line args, environment vars, proper
container mounts, etc.) may also work, right?

> I also personally think there *is* value in letting a task know its
> place on the system, so hiding the full cgroup path is imo not only not
> a valid goal, it's counter-productive.  Part of making for better
> virtualization is to give userspace all the info it needs about its
> current limits.  Consider that with the unified hierarchy, you cannot
> have tasks in a cgroup that also has child cgroups - except for the
> root.  Cgroup namespaces do not make an exception for this, so knowing
> that you are not in the absolute cgroup root actually can prevent you
> from trying something that cannot work.  Or, I suppose, at least
> understanding why you're unable to do what you're trying to do (namely
> your container manager messed up).  I point this out because finding
> a way to only show the namespaced root in field 3 of mountinfo would
> fix the base problem, but at the cost of hiding useful information
> from a container.
>
>> Eric
>>
>>
>> >
>> > Signed-off-by: Serge Hallyn 
>> > ---
>> > Changelog: 2016-04-13: pass kernfs_node rather than dentry to show_options
>> > ---
>> >  fs/kernfs/mount.c  |  2 +-
>> >  include/linux/kernfs.h |  3 ++-
>> >  kernel/cgroup.c| 28 +++-
>> >  3 files changed, 30 insertions(+), 3 deletions(-)
>> >
>> > diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
>> > index f73541f..58e8a86 100644
>> > --- a/fs/kernfs/mount.c
>> > +++ b/fs/kernfs/mount.c
>> > @@ -36,7 +36,7 @@ static int kernfs_sop_show_options(struct seq_file *sf, 
>> > struct dentry *dentry)
>> > struct kernfs_syscall_ops *scops = root->syscall_ops;
>> >
>> > if (scops && scops->show_options)
>> > -   return scops->show_options(sf, root);
>> > +   return scops->show_options(sf, dentry->d_fsdata, root);
>> > return 0;
>> >  }
>> >
>> > diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
>> > index c06c442..72b4081 100644
>> > --- a/include/linux/kernfs.h
>> > +++ b/include/linux/kernfs.h
>> > @@ -145,7 +145,8 @@ struct kernfs_node {
>> >   */
>> >  struct kernfs_syscall_ops {
>> > int 

Re: [PATCH] cgroup namespaces: add a 'nsroot=' mountinfo field

2016-04-15 Thread Aditya Kali
On Thu, Apr 14, 2016 at 8:27 AM, Serge E. Hallyn  wrote:
> Quoting Eric W. Biederman (ebied...@xmission.com):
>> "Serge E. Hallyn"  writes:
>>
>> > This is so that userspace can distinguish a mount made in a cgroup
>> > namespace from a bind mount from a cgroup subdirectory.
>>
>> To do that do you need to print the path, or is an extra option that
>> reveals nothing except that it was a cgroup mount sufficient?
>>
>> Is there any practical difference between a mount in a namespace and a
>> bind mount?
>>
>> Given the way the conversation has been going I think it would be good
>> to see the answers to these questions.  Perhaps I missed it but I
>> haven't seen the answers to those questions.
>
> Yup, I tried to answer those in my last email, let me try again.
>
> Let's say I start a container using cgroup namespaces, /lxc/x1.  It mounts
> freezer at /sys/fs/cgroup so it has field three of mountinfo as /lxc/x1,
> and /sys/fs/cgroup/ is the path to the container's cgroup (/lxc/x1).  In
> that container, I start another container x1, not using cgroup namespaces.
> It also wants a cgroup mount, and a common way to handle that (to prevent
> container rewriting its limits) is to mount a tmpfs at /sys/fs/cgroup,
> create /sysfs/cgroup/lxc/x1, and bind mount /sys/fs/cgroup/lxc/x1 from
> the parent container onto /sys/fs/cgroup/lxc/x1 in the child container.
> Now for that bind mount, the mountinfo field 3 will show /lxc/x1/lxc/x1,
> with mount target /sys/fs/cgroup/lxc/x1, while /proc/self/cgroup for a task
> in that container will show '/lxc/x1'.  Unless it has been moved into
> /lxc/x1/lxc/x1 in the container (/lxc/x1/lxc/x1/lxc/x1 on the host)...
> Every time I've thought "maybe we can just..." I've found a case where it
> wouldn't work.
>
> At first in lxc we simply said if /proc/self/ns/cgroup exists assume that
> the cgroupfs mounts are not bind mounts.  However, old userspace (and
> container drivers) on new kernels is certainly possible, especially an
> older distro in a container on a newer distro on the host.  That completely
> breaks with this approach.
>

My main concern regarding making this a new kernel API is that its too
generic and exposes information about all system cgroups to every
process on the system, not just the container or the process inside it
that needs it. Not all containers need this information and not all
processes running inside the container needs this. I haven't spent too
much thought into it, but it seems you will still need to update the
container userspace to read this extra mount option. So seems like a
simpler approach where the host "cgroup manager" provides this
information to specific container cgroup manager via other user-space
channels (a config file, command-line args, environment vars, proper
container mounts, etc.) may also work, right?

> I also personally think there *is* value in letting a task know its
> place on the system, so hiding the full cgroup path is imo not only not
> a valid goal, it's counter-productive.  Part of making for better
> virtualization is to give userspace all the info it needs about its
> current limits.  Consider that with the unified hierarchy, you cannot
> have tasks in a cgroup that also has child cgroups - except for the
> root.  Cgroup namespaces do not make an exception for this, so knowing
> that you are not in the absolute cgroup root actually can prevent you
> from trying something that cannot work.  Or, I suppose, at least
> understanding why you're unable to do what you're trying to do (namely
> your container manager messed up).  I point this out because finding
> a way to only show the namespaced root in field 3 of mountinfo would
> fix the base problem, but at the cost of hiding useful information
> from a container.
>
>> Eric
>>
>>
>> >
>> > Signed-off-by: Serge Hallyn 
>> > ---
>> > Changelog: 2016-04-13: pass kernfs_node rather than dentry to show_options
>> > ---
>> >  fs/kernfs/mount.c  |  2 +-
>> >  include/linux/kernfs.h |  3 ++-
>> >  kernel/cgroup.c| 28 +++-
>> >  3 files changed, 30 insertions(+), 3 deletions(-)
>> >
>> > diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
>> > index f73541f..58e8a86 100644
>> > --- a/fs/kernfs/mount.c
>> > +++ b/fs/kernfs/mount.c
>> > @@ -36,7 +36,7 @@ static int kernfs_sop_show_options(struct seq_file *sf, 
>> > struct dentry *dentry)
>> > struct kernfs_syscall_ops *scops = root->syscall_ops;
>> >
>> > if (scops && scops->show_options)
>> > -   return scops->show_options(sf, root);
>> > +   return scops->show_options(sf, dentry->d_fsdata, root);
>> > return 0;
>> >  }
>> >
>> > diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
>> > index c06c442..72b4081 100644
>> > --- a/include/linux/kernfs.h
>> > +++ b/include/linux/kernfs.h
>> > @@ -145,7 +145,8 @@ struct kernfs_node {
>> >   */
>> >  struct kernfs_syscall_ops {
>> > int (*remount_fs)(struct kernfs_root *root, int *flags, char *data);
>> > -   

Re: [RFC PATCH] cgroup namespaces: add a 'nsroot=' mountinfo field

2016-04-13 Thread Aditya Kali
On Wed, Apr 13, 2016 at 12:01 PM, Serge E. Hallyn  wrote:
> Quoting Tejun Heo (t...@kernel.org):
>> Hello, Serge.
>>
>> On Wed, Apr 13, 2016 at 01:46:39PM -0500, Serge E. Hallyn wrote:
>> > It's not a leak of any information we're trying to hide.  I realize
>> > something like 8 years have passed, but I still basically go by the
>> > ksummit guidance that containers are ok but the kernel's first priority
>> > is to facilitate containers but not trick containers into thinking
>> > they're not containerized.  So long as the container is properly set
>> > up, I don't think there's anything the workload could do with the
>> > nsroot= info other than *know* that it is in a ns cgroup.
>> >
>> > If we did change that guidance, there's a slew of proc info that we
>> > could better virtualize :)
>>
>> I see.  I'm just wondering because the information here seems a bit
>> gratuituous.  Isn't the only thing necessary telling whether the root
>> is bind mounted or namescoped?  Wouldn't simple "nsroot" work for that
>> purpose?
>
> I don't think so - we could be in a cgroup namespace but still have
> access only to bind-mounted cgroups.  So we need to compare the
> superblock dentry root field to the nsroot= value.

Umm, I don't think this is such a good idea. The main purpose of
cgroup namespace was to prevent this exposure of system cgroup
hierarchy that used to happen because of /proc/self/cgroup. Wouldn't
showing that information in /proc/self/mountinfo defeat the purpose?

> One practical problem I've found with cgroup namespaces is that there
> is no way to disambiguate between a cgroupfs mount which was done in
> a cgroup namespace, and a bind mount of a cgroupfs directory.

Thats actually by design, no? Namespaced apps should not know/care if
they are running inside namespace. If they can find it out today, its
just because of certain side-effects. I fear adding explicit "nsroot"
or something in /proc/self/mountinfo now becomes an API making it hard
to virtualize user-apps again.

-- 
Aditya


Re: [RFC PATCH] cgroup namespaces: add a 'nsroot=' mountinfo field

2016-04-13 Thread Aditya Kali
On Wed, Apr 13, 2016 at 12:01 PM, Serge E. Hallyn  wrote:
> Quoting Tejun Heo (t...@kernel.org):
>> Hello, Serge.
>>
>> On Wed, Apr 13, 2016 at 01:46:39PM -0500, Serge E. Hallyn wrote:
>> > It's not a leak of any information we're trying to hide.  I realize
>> > something like 8 years have passed, but I still basically go by the
>> > ksummit guidance that containers are ok but the kernel's first priority
>> > is to facilitate containers but not trick containers into thinking
>> > they're not containerized.  So long as the container is properly set
>> > up, I don't think there's anything the workload could do with the
>> > nsroot= info other than *know* that it is in a ns cgroup.
>> >
>> > If we did change that guidance, there's a slew of proc info that we
>> > could better virtualize :)
>>
>> I see.  I'm just wondering because the information here seems a bit
>> gratuituous.  Isn't the only thing necessary telling whether the root
>> is bind mounted or namescoped?  Wouldn't simple "nsroot" work for that
>> purpose?
>
> I don't think so - we could be in a cgroup namespace but still have
> access only to bind-mounted cgroups.  So we need to compare the
> superblock dentry root field to the nsroot= value.

Umm, I don't think this is such a good idea. The main purpose of
cgroup namespace was to prevent this exposure of system cgroup
hierarchy that used to happen because of /proc/self/cgroup. Wouldn't
showing that information in /proc/self/mountinfo defeat the purpose?

> One practical problem I've found with cgroup namespaces is that there
> is no way to disambiguate between a cgroupfs mount which was done in
> a cgroup namespace, and a bind mount of a cgroupfs directory.

Thats actually by design, no? Namespaced apps should not know/care if
they are running inside namespace. If they can find it out today, its
just because of certain side-effects. I fear adding explicit "nsroot"
or something in /proc/self/mountinfo now becomes an API making it hard
to virtualize user-apps again.

-- 
Aditya


Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces

2015-01-07 Thread Aditya Kali
On Wed, Jan 7, 2015 at 1:28 AM, Richard Weinberger  wrote:
> Am 07.01.2015 um 00:20 schrieb Aditya Kali:
>> I understand your point. But it will add some complexity to the code.
>>
>> Before trying to make it work for non-unified hierarchy cases, I would
>> like to get a clearer idea.
>> What do you expect to be mounted when you run:
>>   container:/ # mount -t cgroup none /sys/fs/cgroup/
>> from inside the container?
>>
>> Note that cgroup-namespace wont be able to change the way cgroups are
>> mounted .. i.e., if say cpu and cpuacct subsystems are mounted
>> together at a single mount-point, then we cannot mount them any other
>> way (inside a container or outside). This restriction exists today and
>> cgroup-namespaces won't change that.
>
> I wondered why cgroup namespaces won't change that and looked at your patches
> in more detail.
> What you propose as cgroup namespace is much more a cgroup chroot() than
> a namespace.
> As you pass relative paths into the namespace you depend on the mount 
> structure
> of the host side.
> Hence, the abstraction between namespaces happens on the mount paths of the 
> initial
> cgroupfs. But we really want a new cgroupfs instance within a container and 
> not just
> a cut out of the initial cgroupfs mount.
>

What you describe will be useful at Google too, just that I found it
difficult/infeasible to include it in the scope of cgroup namespaces.
The scope of cgroup namespace was deliberately limited to virtualize
/proc//cgroup file. That too in a way that doesn't need major
changes to cgroup code itself. (It was also limited to unified
hierarchy to keep things simple, but that can be changed).

Many of the cgroup subsystems (memory, cpu, etc) rely on the fact that
they can see entire cgroup view. For example, in a memcg-OOM scenario,
the memory controller would need to look at all sub-cgroups inside the
OOMing cgroup. A per namespace cgroupfs instance (if I understand
correctly) would mean that sub-cgroups created inside the namespace
won't be visible outside. I expect this will break the functionality
of the subsystem.

Illustration: memcg A is under OOM; [B] and [C] are cgroup namespace
roots with possibly namespace-private sub-cgroups.
  -- [B]
A |
  -- [C]

Cgroups are heavily used inside the kernel for various purposes which
need any namespace-agnostic view. Inherent limitation of running
containers running on a machine is that they share the same kernel.
Perhaps what you need is something like kexec to be supported inside a
container.

> I fear you approach is over simplified and won't work for all cases. It may 
> work
> for your specific use case at Google but we really want something generic.
> Eric, what do you think?
>
> Thanks,
> //richard


-- 
Aditya
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces

2015-01-07 Thread Aditya Kali
On Wed, Jan 7, 2015 at 1:28 AM, Richard Weinberger rich...@nod.at wrote:
 Am 07.01.2015 um 00:20 schrieb Aditya Kali:
 I understand your point. But it will add some complexity to the code.

 Before trying to make it work for non-unified hierarchy cases, I would
 like to get a clearer idea.
 What do you expect to be mounted when you run:
   container:/ # mount -t cgroup none /sys/fs/cgroup/
 from inside the container?

 Note that cgroup-namespace wont be able to change the way cgroups are
 mounted .. i.e., if say cpu and cpuacct subsystems are mounted
 together at a single mount-point, then we cannot mount them any other
 way (inside a container or outside). This restriction exists today and
 cgroup-namespaces won't change that.

 I wondered why cgroup namespaces won't change that and looked at your patches
 in more detail.
 What you propose as cgroup namespace is much more a cgroup chroot() than
 a namespace.
 As you pass relative paths into the namespace you depend on the mount 
 structure
 of the host side.
 Hence, the abstraction between namespaces happens on the mount paths of the 
 initial
 cgroupfs. But we really want a new cgroupfs instance within a container and 
 not just
 a cut out of the initial cgroupfs mount.


What you describe will be useful at Google too, just that I found it
difficult/infeasible to include it in the scope of cgroup namespaces.
The scope of cgroup namespace was deliberately limited to virtualize
/proc/pid/cgroup file. That too in a way that doesn't need major
changes to cgroup code itself. (It was also limited to unified
hierarchy to keep things simple, but that can be changed).

Many of the cgroup subsystems (memory, cpu, etc) rely on the fact that
they can see entire cgroup view. For example, in a memcg-OOM scenario,
the memory controller would need to look at all sub-cgroups inside the
OOMing cgroup. A per namespace cgroupfs instance (if I understand
correctly) would mean that sub-cgroups created inside the namespace
won't be visible outside. I expect this will break the functionality
of the subsystem.

Illustration: memcg A is under OOM; [B] and [C] are cgroup namespace
roots with possibly namespace-private sub-cgroups.
  -- [B]
A |
  -- [C]

Cgroups are heavily used inside the kernel for various purposes which
need any namespace-agnostic view. Inherent limitation of running
containers running on a machine is that they share the same kernel.
Perhaps what you need is something like kexec to be supported inside a
container.

 I fear you approach is over simplified and won't work for all cases. It may 
 work
 for your specific use case at Google but we really want something generic.
 Eric, what do you think?

 Thanks,
 //richard


-- 
Aditya
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces

2015-01-06 Thread Aditya Kali
I understand your point. But it will add some complexity to the code.

Before trying to make it work for non-unified hierarchy cases, I would
like to get a clearer idea.
What do you expect to be mounted when you run:
  container:/ # mount -t cgroup none /sys/fs/cgroup/
from inside the container?

Note that cgroup-namespace wont be able to change the way cgroups are
mounted .. i.e., if say cpu and cpuacct subsystems are mounted
together at a single mount-point, then we cannot mount them any other
way (inside a container or outside). This restriction exists today and
cgroup-namespaces won't change that.

So, If on the host we have:
root@adityakali-vm2:/sys/fs/cgroup# cat /proc/mounts | grep cgroup
tmpfs /sys/fs/cgroup tmpfs rw,relatime 0 0
cgroup /sys/fs/cgroup/cpu cgroup rw,relatime,cpuset,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/mem cgroup rw,relatime,memory,hugetlb 0 0
cgroup /sys/fs/cgroup/rest cgroup
rw,relatime,devices,freezer,net_cls,blkio,perf_event,net_prio 0 0

And inside the container we want each subsystem to be on its own
mount-point, then it will fail. Do you think even then its useful to
support virtualizing paths for non-unified hierarchies?

Thanks,


On Mon, Jan 5, 2015 at 4:17 PM, Richard Weinberger  wrote:
> Am 06.01.2015 um 01:10 schrieb Aditya Kali:
>> Since the old/default behavior is on its way out, I didn't invest time
>> in fixing that. Also, some of the properties that make
>> cgroup-namespace simpler are only provided by unified hierarchy (for
>> example: a single root-cgroup per container).
>
> Does the new sane cgroupfs behavior even have a single real world user?
> I always thought it isn't stable yet.
>
> Linux distros currently use systemd v210. They don't dare to use a newer one.
> Even *if* systemd would support the sane sane cgroupfs behavior in the most 
> recent
> version it will take 1-2 years until it would hit a recent distro.
>
> So please support also the old and nasty behavior such that one day we can 
> run current
> systemd distros in Linux containers.
>
> Thanks,
> //richard



-- 
Aditya
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces

2015-01-06 Thread Aditya Kali
I understand your point. But it will add some complexity to the code.

Before trying to make it work for non-unified hierarchy cases, I would
like to get a clearer idea.
What do you expect to be mounted when you run:
  container:/ # mount -t cgroup none /sys/fs/cgroup/
from inside the container?

Note that cgroup-namespace wont be able to change the way cgroups are
mounted .. i.e., if say cpu and cpuacct subsystems are mounted
together at a single mount-point, then we cannot mount them any other
way (inside a container or outside). This restriction exists today and
cgroup-namespaces won't change that.

So, If on the host we have:
root@adityakali-vm2:/sys/fs/cgroup# cat /proc/mounts | grep cgroup
tmpfs /sys/fs/cgroup tmpfs rw,relatime 0 0
cgroup /sys/fs/cgroup/cpu cgroup rw,relatime,cpuset,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/mem cgroup rw,relatime,memory,hugetlb 0 0
cgroup /sys/fs/cgroup/rest cgroup
rw,relatime,devices,freezer,net_cls,blkio,perf_event,net_prio 0 0

And inside the container we want each subsystem to be on its own
mount-point, then it will fail. Do you think even then its useful to
support virtualizing paths for non-unified hierarchies?

Thanks,


On Mon, Jan 5, 2015 at 4:17 PM, Richard Weinberger rich...@nod.at wrote:
 Am 06.01.2015 um 01:10 schrieb Aditya Kali:
 Since the old/default behavior is on its way out, I didn't invest time
 in fixing that. Also, some of the properties that make
 cgroup-namespace simpler are only provided by unified hierarchy (for
 example: a single root-cgroup per container).

 Does the new sane cgroupfs behavior even have a single real world user?
 I always thought it isn't stable yet.

 Linux distros currently use systemd v210. They don't dare to use a newer one.
 Even *if* systemd would support the sane sane cgroupfs behavior in the most 
 recent
 version it will take 1-2 years until it would hit a recent distro.

 So please support also the old and nasty behavior such that one day we can 
 run current
 systemd distros in Linux containers.

 Thanks,
 //richard



-- 
Aditya
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces

2015-01-05 Thread Aditya Kali
On Mon, Jan 5, 2015 at 3:53 PM, Eric W. Biederman  wrote:
> Richard Weinberger  writes:
>
>> Am 05.01.2015 um 23:48 schrieb Aditya Kali:
>>> On Sun, Dec 14, 2014 at 3:05 PM, Richard Weinberger  wrote:
>>>> Aditya,
>>>>
>>>> I gave your patch set a try but it does not work for me.
>>>> Maybe you can bring some light into the issues I'm facing.
>>>> Sadly I still had no time to dig into your code.
>>>>
>>>> Am 05.12.2014 um 02:55 schrieb Aditya Kali:
>>>>> Signed-off-by: Aditya Kali 
>>>>> ---
>>>>>  Documentation/cgroups/namespace.txt | 147 
>>>>> 
>>>>>  1 file changed, 147 insertions(+)
>>>>>  create mode 100644 Documentation/cgroups/namespace.txt
>>>>>
>>>>> diff --git a/Documentation/cgroups/namespace.txt 
>>>>> b/Documentation/cgroups/namespace.txt
>>>>> new file mode 100644
>>>>> index 000..6480379
>>>>> --- /dev/null
>>>>> +++ b/Documentation/cgroups/namespace.txt
>>>>> @@ -0,0 +1,147 @@
>>>>> + CGroup Namespaces
>>>>> +
>>>>> +CGroup Namespace provides a mechanism to virtualize the view of the
>>>>> +/proc//cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
>>>>> +clone() and unshare() syscalls to create a new cgroup namespace.
>>>>> +The process running inside the cgroup namespace will have its 
>>>>> /proc//cgroup
>>>>> +output restricted to cgroupns-root. cgroupns-root is the cgroup of the 
>>>>> process
>>>>> +at the time of creation of the cgroup namespace.
>>>>> +
>>>>> +Prior to CGroup Namespace, the /proc//cgroup file used to show 
>>>>> complete
>>>>> +path of the cgroup of a process. In a container setup (where a set of 
>>>>> cgroups
>>>>> +and namespaces are intended to isolate processes), the 
>>>>> /proc//cgroup file
>>>>> +may leak potential system level information to the isolated processes.
>>>>> +
>>>>> +For Example:
>>>>> +  $ cat /proc/self/cgroup
>>>>> +  
>>>>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>>> +
>>>>> +The path '/batchjobs/container_id1' can generally be considered as 
>>>>> system-data
>>>>> +and its desirable to not expose it to the isolated process.
>>>>> +
>>>>> +CGroup Namespaces can be used to restrict visibility of this path.
>>>>> +For Example:
>>>>> +  # Before creating cgroup namespace
>>>>> +  $ ls -l /proc/self/ns/cgroup
>>>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> 
>>>>> cgroup:[4026531835]
>>>>> +  $ cat /proc/self/cgroup
>>>>> +  
>>>>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>>> +
>>>>> +  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
>>>>> +  $ ~/unshare -c
>>>>> +  [ns]$ ls -l /proc/self/ns/cgroup
>>>>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> 
>>>>> cgroup:[4026532183]
>>>>> +  # From within new cgroupns, process sees that its in the root cgroup
>>>>> +  [ns]$ cat /proc/self/cgroup
>>>>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>>>> +
>>>>> +  # From global cgroupns:
>>>>> +  $ cat /proc//cgroup
>>>>> +  
>>>>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>>>>> +
>>>>> +  # Unshare cgroupns along with userns and mountns
>>>>> +  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), 
>>>>> then
>>>>> +  # sets up uid/gid map and execs /bin/bash
>>>>> +  $ ~/unshare -c -u -m
>>>>
>>>> This command does not issue CLONE_NEWUSER, -U does.
>>>>
>>> I was using a custom unshare binary. But I will update the command
>>> line to be similar to the one in util-linux.
>>>
>>>>> +  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our 
>>>>> own cgroup
>>>>> +  # hierarchy.
>>>>> +  [ns]$ mount -t cg

Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces

2015-01-05 Thread Aditya Kali
Thanks for the review. I have made the suggested fixes. Regarding
relative path, please see inline.

On Fri, Dec 12, 2014 at 12:54 AM, Zefan Li  wrote:
>> +In its current form, the cgroup namespaces patcheset provides following
>> +behavior:
>> +
>> +(1) The 'cgroupns-root' for a cgroup namespace is the cgroup in which
>> +the process calling unshare is running.
>> +For ex. if a process in /batchjobs/container_id1 cgroup calls unshare,
>> +cgroup /batchjobs/container_id1 becomes the cgroupns-root.
>> +For the init_cgroup_ns, this is the real root ('/') cgroup
>> +(identified in code as cgrp_dfl_root.cgrp).
>> +
>> +(2) The cgroupns-root cgroup does not change even if the namespace
>> +creator process later moves to a different cgroup.
>> +$ ~/unshare -c # unshare cgroupns in some cgroup
>> +[ns]$ cat /proc/self/cgroup
>> +0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>> +[ns]$ mkdir sub_cgrp_1
>> +[ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>> +[ns]$ cat /proc/self/cgroup
>> +0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>> +
>> +(3) Each process gets its CGROUPNS specific view of /proc//cgroup
>> +(a) Processes running inside the cgroup namespace will be able to see
>> +cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>> +[ns]$ sleep 10 &  # From within unshared cgroupns
>> +[1] 7353
>> +[ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>> +[ns]$ cat /proc/7353/cgroup
>> +0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>> +
>> +(b) From global cgroupns, the real cgroup path will be visible:
>> +$ cat /proc/7353/cgroup
>> +
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1/sub_cgrp_1
>> +
>> +(c) From a sibling cgroupns (cgroupns root-ed at a different cgroup), cgroup
>> +path relative to its own cgroupns-root will be shown:
>> +# ns2's cgroupns-root is at '/batchjobs/container_id2'
>> +[ns2]$ cat /proc/7353/cgroup
>> +
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2/sub_cgrp_1
>
> Should be ../container_id1/sub_cgrp_1 ?
>

Starting with '/' was deliberate.

>> +
>> +Note that the relative path always starts with '/' to indicate that its
>> +relative to the cgroupns-root of the caller.
>
> If a path doesn't start with '/', then it's a relative path, so why make it 
> start with '/'?
>

This is so as not to surprise the apps parsing /proc//cgroup
files and using the path in it as absolute path. The existing paths
there always start with '/' right now. Retaining the '/' means path
generated by userspace continuous to work. Does this makes sense?

>> +
>> +(4) Processes inside a cgroupns can move in-and-out of the cgroupns-root
>> +(if they have proper access to external cgroups).
>> +# From inside cgroupns (with cgroupns-root at 
>> /batchjobs/container_id1), and
>> +# assuming that the global hierarchy is still accessible inside 
>> cgroupns:
>> +$ cat /proc/7353/cgroup
>> +0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>> +$ echo 7353 > batchjobs/container_id2/cgroup.procs
>> +$ cat /proc/7353/cgroup
>> +0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2
>> +
>> +Note that this kind of setup is not encouraged. A task inside cgroupns
>> +should only be exposed to its own cgroupns hierarchy. Otherwise it makes
>> +the virtualization of /proc//cgroup less useful.
>> +
>> +(5) Setns to another cgroup namespace is allowed when:
>> +(a) the process has CAP_SYS_ADMIN in its current userns
>> +(b) the process has CAP_SYS_ADMIN in the target cgroupns' userns
>> +No implicit cgroup changes happen with attaching to another cgroupns. It
>> +is expected that the somone moves the attaching process under the target
>> +cgroupns-root.
>> +
>
> s/the somone/someone
>
fixed.

>> +(6) When some thread from a multi-threaded process unshares its
>> +cgroup-namespace, the new cgroupns gets applied to the entire
>> +process (all the threads). This should be OK since
>> +unified-hierarchy only allows process-level containerization. So
>> +all the threads in the process will have the same cgroup.
>> +
>> +(7) The cgroup namespace is alive as long as there is atleast 1
>
> s/atelast/at least
>
fixed.

>> +process inside it. When the last process exits, the cgroup
>> +namespace is destroyed. The cgroupns-root and the actual cgroups
>> +remain though.
>> +
>> +(8) Namespace specific cgroup hierarchy can be mounted by a process running
>> +inside cgroupns:
>> +$ mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT
>> +
>> +This will mount the unified cgroup hierarchy with cgroupns-root as the
>> +filesystem root. The process needs CAP_SYS_ADMIN in its userns and 
>> mntns.
>> +
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a 

Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces

2015-01-05 Thread Aditya Kali
On Sun, Dec 14, 2014 at 3:05 PM, Richard Weinberger  wrote:
> Aditya,
>
> I gave your patch set a try but it does not work for me.
> Maybe you can bring some light into the issues I'm facing.
> Sadly I still had no time to dig into your code.
>
> Am 05.12.2014 um 02:55 schrieb Aditya Kali:
>> Signed-off-by: Aditya Kali 
>> ---
>>  Documentation/cgroups/namespace.txt | 147 
>> 
>>  1 file changed, 147 insertions(+)
>>  create mode 100644 Documentation/cgroups/namespace.txt
>>
>> diff --git a/Documentation/cgroups/namespace.txt 
>> b/Documentation/cgroups/namespace.txt
>> new file mode 100644
>> index 000..6480379
>> --- /dev/null
>> +++ b/Documentation/cgroups/namespace.txt
>> @@ -0,0 +1,147 @@
>> + CGroup Namespaces
>> +
>> +CGroup Namespace provides a mechanism to virtualize the view of the
>> +/proc//cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
>> +clone() and unshare() syscalls to create a new cgroup namespace.
>> +The process running inside the cgroup namespace will have its 
>> /proc//cgroup
>> +output restricted to cgroupns-root. cgroupns-root is the cgroup of the 
>> process
>> +at the time of creation of the cgroup namespace.
>> +
>> +Prior to CGroup Namespace, the /proc//cgroup file used to show complete
>> +path of the cgroup of a process. In a container setup (where a set of 
>> cgroups
>> +and namespaces are intended to isolate processes), the /proc//cgroup 
>> file
>> +may leak potential system level information to the isolated processes.
>> +
>> +For Example:
>> +  $ cat /proc/self/cgroup
>> +  
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>> +
>> +The path '/batchjobs/container_id1' can generally be considered as 
>> system-data
>> +and its desirable to not expose it to the isolated process.
>> +
>> +CGroup Namespaces can be used to restrict visibility of this path.
>> +For Example:
>> +  # Before creating cgroup namespace
>> +  $ ls -l /proc/self/ns/cgroup
>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> 
>> cgroup:[4026531835]
>> +  $ cat /proc/self/cgroup
>> +  
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>> +
>> +  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
>> +  $ ~/unshare -c
>> +  [ns]$ ls -l /proc/self/ns/cgroup
>> +  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> 
>> cgroup:[4026532183]
>> +  # From within new cgroupns, process sees that its in the root cgroup
>> +  [ns]$ cat /proc/self/cgroup
>> +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>> +
>> +  # From global cgroupns:
>> +  $ cat /proc//cgroup
>> +  
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
>> +
>> +  # Unshare cgroupns along with userns and mountns
>> +  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>> +  # sets up uid/gid map and execs /bin/bash
>> +  $ ~/unshare -c -u -m
>
> This command does not issue CLONE_NEWUSER, -U does.
>
I was using a custom unshare binary. But I will update the command
line to be similar to the one in util-linux.

>> +  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own 
>> cgroup
>> +  # hierarchy.
>> +  [ns]$ mount -t cgroup cgroup /tmp/cgroup
>> +  [ns]$ ls -l /tmp/cgroup
>> +  total 0
>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>> +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>> +  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>
> I've patched libvirt-lxc to issue CLONE_NEWCGROUP and not bind mount cgroupfs 
> into a container.
> But I'm unable to mount cgroupfs within the container, mount(2) is failing 
> with EINVAL.
> And /proc/self/cgroup still shows the cgroup from outside.
>
> ---cut---
> container:/ # ls /sys/fs/cgroup/
> container:/ # mount -t cgroup none /sys/fs/cgroup/

You need to provide "-o __DEVEL_sane_behavior" flag. Inside the
container, only unified hierarchy can be mounted. So, for now, that
flag is needed. I will fix the documentation too.

> mount: wrong fs type, bad option, bad superblock on none,
>missing codepage or helper program, or other error
>
>In some cases useful info is found in syslog - try
>dmesg | tail or so.
> container:/ # cat /proc/self/cgroup
> 8:memory:/machine/te

Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces

2015-01-05 Thread Aditya Kali
On Sun, Dec 14, 2014 at 3:05 PM, Richard Weinberger rich...@nod.at wrote:
 Aditya,

 I gave your patch set a try but it does not work for me.
 Maybe you can bring some light into the issues I'm facing.
 Sadly I still had no time to dig into your code.

 Am 05.12.2014 um 02:55 schrieb Aditya Kali:
 Signed-off-by: Aditya Kali adityak...@google.com
 ---
  Documentation/cgroups/namespace.txt | 147 
 
  1 file changed, 147 insertions(+)
  create mode 100644 Documentation/cgroups/namespace.txt

 diff --git a/Documentation/cgroups/namespace.txt 
 b/Documentation/cgroups/namespace.txt
 new file mode 100644
 index 000..6480379
 --- /dev/null
 +++ b/Documentation/cgroups/namespace.txt
 @@ -0,0 +1,147 @@
 + CGroup Namespaces
 +
 +CGroup Namespace provides a mechanism to virtualize the view of the
 +/proc/pid/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
 +clone() and unshare() syscalls to create a new cgroup namespace.
 +The process running inside the cgroup namespace will have its 
 /proc/pid/cgroup
 +output restricted to cgroupns-root. cgroupns-root is the cgroup of the 
 process
 +at the time of creation of the cgroup namespace.
 +
 +Prior to CGroup Namespace, the /proc/pid/cgroup file used to show complete
 +path of the cgroup of a process. In a container setup (where a set of 
 cgroups
 +and namespaces are intended to isolate processes), the /proc/pid/cgroup 
 file
 +may leak potential system level information to the isolated processes.
 +
 +For Example:
 +  $ cat /proc/self/cgroup
 +  
 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
 +
 +The path '/batchjobs/container_id1' can generally be considered as 
 system-data
 +and its desirable to not expose it to the isolated process.
 +
 +CGroup Namespaces can be used to restrict visibility of this path.
 +For Example:
 +  # Before creating cgroup namespace
 +  $ ls -l /proc/self/ns/cgroup
 +  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup - 
 cgroup:[4026531835]
 +  $ cat /proc/self/cgroup
 +  
 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
 +
 +  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
 +  $ ~/unshare -c
 +  [ns]$ ls -l /proc/self/ns/cgroup
 +  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup - 
 cgroup:[4026532183]
 +  # From within new cgroupns, process sees that its in the root cgroup
 +  [ns]$ cat /proc/self/cgroup
 +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
 +
 +  # From global cgroupns:
 +  $ cat /proc/pid/cgroup
 +  
 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
 +
 +  # Unshare cgroupns along with userns and mountns
 +  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
 +  # sets up uid/gid map and execs /bin/bash
 +  $ ~/unshare -c -u -m

 This command does not issue CLONE_NEWUSER, -U does.

I was using a custom unshare binary. But I will update the command
line to be similar to the one in util-linux.

 +  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own 
 cgroup
 +  # hierarchy.
 +  [ns]$ mount -t cgroup cgroup /tmp/cgroup
 +  [ns]$ ls -l /tmp/cgroup
 +  total 0
 +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
 +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
 +  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
 +  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control

 I've patched libvirt-lxc to issue CLONE_NEWCGROUP and not bind mount cgroupfs 
 into a container.
 But I'm unable to mount cgroupfs within the container, mount(2) is failing 
 with EINVAL.
 And /proc/self/cgroup still shows the cgroup from outside.

 ---cut---
 container:/ # ls /sys/fs/cgroup/
 container:/ # mount -t cgroup none /sys/fs/cgroup/

You need to provide -o __DEVEL_sane_behavior flag. Inside the
container, only unified hierarchy can be mounted. So, for now, that
flag is needed. I will fix the documentation too.

 mount: wrong fs type, bad option, bad superblock on none,
missing codepage or helper program, or other error

In some cases useful info is found in syslog - try
dmesg | tail or so.
 container:/ # cat /proc/self/cgroup
 8:memory:/machine/test00.libvirt-lxc
 7:devices:/machine/test00.libvirt-lxc
 6:hugetlb:/
 5:cpuset:/machine/test00.libvirt-lxc
 4:blkio:/machine/test00.libvirt-lxc
 3:cpu,cpuacct:/machine/test00.libvirt-lxc
 2:freezer:/machine/test00.libvirt-lxc
 1:name=systemd:/user.slice/user-0.slice/session-c2.scope
 container:/ # ls -la /proc/self/ns
 total 0
 dr-x--x--x 2 root root 0 Dec 14 23:02 .
 dr-xr-xr-x 8 root root 0 Dec 14 23:02 ..
 lrwxrwxrwx 1 root root 0 Dec 14 23:02 cgroup - cgroup:[4026532240]
 lrwxrwxrwx 1 root root 0 Dec 14 23:02 ipc - ipc:[4026532238]
 lrwxrwxrwx 1 root root 0 Dec 14 23:02 mnt - mnt:[4026532235]
 lrwxrwxrwx 1 root root 0 Dec 14 23:02 net - net:[4026532242]
 lrwxrwxrwx 1 root root 0 Dec 14 23:02 pid - pid

Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces

2015-01-05 Thread Aditya Kali
Thanks for the review. I have made the suggested fixes. Regarding
relative path, please see inline.

On Fri, Dec 12, 2014 at 12:54 AM, Zefan Li lize...@huawei.com wrote:
 +In its current form, the cgroup namespaces patcheset provides following
 +behavior:
 +
 +(1) The 'cgroupns-root' for a cgroup namespace is the cgroup in which
 +the process calling unshare is running.
 +For ex. if a process in /batchjobs/container_id1 cgroup calls unshare,
 +cgroup /batchjobs/container_id1 becomes the cgroupns-root.
 +For the init_cgroup_ns, this is the real root ('/') cgroup
 +(identified in code as cgrp_dfl_root.cgrp).
 +
 +(2) The cgroupns-root cgroup does not change even if the namespace
 +creator process later moves to a different cgroup.
 +$ ~/unshare -c # unshare cgroupns in some cgroup
 +[ns]$ cat /proc/self/cgroup
 +0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
 +[ns]$ mkdir sub_cgrp_1
 +[ns]$ echo 0  sub_cgrp_1/cgroup.procs
 +[ns]$ cat /proc/self/cgroup
 +0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
 +
 +(3) Each process gets its CGROUPNS specific view of /proc/pid/cgroup
 +(a) Processes running inside the cgroup namespace will be able to see
 +cgroup paths (in /proc/self/cgroup) only inside their root cgroup
 +[ns]$ sleep 10   # From within unshared cgroupns
 +[1] 7353
 +[ns]$ echo 7353  sub_cgrp_1/cgroup.procs
 +[ns]$ cat /proc/7353/cgroup
 +0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
 +
 +(b) From global cgroupns, the real cgroup path will be visible:
 +$ cat /proc/7353/cgroup
 +
 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1/sub_cgrp_1
 +
 +(c) From a sibling cgroupns (cgroupns root-ed at a different cgroup), cgroup
 +path relative to its own cgroupns-root will be shown:
 +# ns2's cgroupns-root is at '/batchjobs/container_id2'
 +[ns2]$ cat /proc/7353/cgroup
 +
 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2/sub_cgrp_1

 Should be ../container_id1/sub_cgrp_1 ?


Starting with '/' was deliberate.

 +
 +Note that the relative path always starts with '/' to indicate that its
 +relative to the cgroupns-root of the caller.

 If a path doesn't start with '/', then it's a relative path, so why make it 
 start with '/'?


This is so as not to surprise the apps parsing /proc/pid/cgroup
files and using the path in it as absolute path. The existing paths
there always start with '/' right now. Retaining the '/' means path
generated by userspace continuous to work. Does this makes sense?

 +
 +(4) Processes inside a cgroupns can move in-and-out of the cgroupns-root
 +(if they have proper access to external cgroups).
 +# From inside cgroupns (with cgroupns-root at 
 /batchjobs/container_id1), and
 +# assuming that the global hierarchy is still accessible inside 
 cgroupns:
 +$ cat /proc/7353/cgroup
 +0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
 +$ echo 7353  batchjobs/container_id2/cgroup.procs
 +$ cat /proc/7353/cgroup
 +0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2
 +
 +Note that this kind of setup is not encouraged. A task inside cgroupns
 +should only be exposed to its own cgroupns hierarchy. Otherwise it makes
 +the virtualization of /proc/pid/cgroup less useful.
 +
 +(5) Setns to another cgroup namespace is allowed when:
 +(a) the process has CAP_SYS_ADMIN in its current userns
 +(b) the process has CAP_SYS_ADMIN in the target cgroupns' userns
 +No implicit cgroup changes happen with attaching to another cgroupns. It
 +is expected that the somone moves the attaching process under the target
 +cgroupns-root.
 +

 s/the somone/someone

fixed.

 +(6) When some thread from a multi-threaded process unshares its
 +cgroup-namespace, the new cgroupns gets applied to the entire
 +process (all the threads). This should be OK since
 +unified-hierarchy only allows process-level containerization. So
 +all the threads in the process will have the same cgroup.
 +
 +(7) The cgroup namespace is alive as long as there is atleast 1

 s/atelast/at least

fixed.

 +process inside it. When the last process exits, the cgroup
 +namespace is destroyed. The cgroupns-root and the actual cgroups
 +remain though.
 +
 +(8) Namespace specific cgroup hierarchy can be mounted by a process running
 +inside cgroupns:
 +$ mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT
 +
 +This will mount the unified cgroup hierarchy with cgroupns-root as the
 +filesystem root. The process needs CAP_SYS_ADMIN in its userns and 
 mntns.
 +


 --
 To unsubscribe from this list: send the line unsubscribe cgroups in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


Thanks!
-- 
Aditya
--
To unsubscribe from this list: send the 

Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces

2015-01-05 Thread Aditya Kali
On Mon, Jan 5, 2015 at 3:53 PM, Eric W. Biederman ebied...@xmission.com wrote:
 Richard Weinberger rich...@nod.at writes:

 Am 05.01.2015 um 23:48 schrieb Aditya Kali:
 On Sun, Dec 14, 2014 at 3:05 PM, Richard Weinberger rich...@nod.at wrote:
 Aditya,

 I gave your patch set a try but it does not work for me.
 Maybe you can bring some light into the issues I'm facing.
 Sadly I still had no time to dig into your code.

 Am 05.12.2014 um 02:55 schrieb Aditya Kali:
 Signed-off-by: Aditya Kali adityak...@google.com
 ---
  Documentation/cgroups/namespace.txt | 147 
 
  1 file changed, 147 insertions(+)
  create mode 100644 Documentation/cgroups/namespace.txt

 diff --git a/Documentation/cgroups/namespace.txt 
 b/Documentation/cgroups/namespace.txt
 new file mode 100644
 index 000..6480379
 --- /dev/null
 +++ b/Documentation/cgroups/namespace.txt
 @@ -0,0 +1,147 @@
 + CGroup Namespaces
 +
 +CGroup Namespace provides a mechanism to virtualize the view of the
 +/proc/pid/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
 +clone() and unshare() syscalls to create a new cgroup namespace.
 +The process running inside the cgroup namespace will have its 
 /proc/pid/cgroup
 +output restricted to cgroupns-root. cgroupns-root is the cgroup of the 
 process
 +at the time of creation of the cgroup namespace.
 +
 +Prior to CGroup Namespace, the /proc/pid/cgroup file used to show 
 complete
 +path of the cgroup of a process. In a container setup (where a set of 
 cgroups
 +and namespaces are intended to isolate processes), the 
 /proc/pid/cgroup file
 +may leak potential system level information to the isolated processes.
 +
 +For Example:
 +  $ cat /proc/self/cgroup
 +  
 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
 +
 +The path '/batchjobs/container_id1' can generally be considered as 
 system-data
 +and its desirable to not expose it to the isolated process.
 +
 +CGroup Namespaces can be used to restrict visibility of this path.
 +For Example:
 +  # Before creating cgroup namespace
 +  $ ls -l /proc/self/ns/cgroup
 +  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup - 
 cgroup:[4026531835]
 +  $ cat /proc/self/cgroup
 +  
 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
 +
 +  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
 +  $ ~/unshare -c
 +  [ns]$ ls -l /proc/self/ns/cgroup
 +  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup - 
 cgroup:[4026532183]
 +  # From within new cgroupns, process sees that its in the root cgroup
 +  [ns]$ cat /proc/self/cgroup
 +  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
 +
 +  # From global cgroupns:
 +  $ cat /proc/pid/cgroup
 +  
 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
 +
 +  # Unshare cgroupns along with userns and mountns
 +  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), 
 then
 +  # sets up uid/gid map and execs /bin/bash
 +  $ ~/unshare -c -u -m

 This command does not issue CLONE_NEWUSER, -U does.

 I was using a custom unshare binary. But I will update the command
 line to be similar to the one in util-linux.

 +  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our 
 own cgroup
 +  # hierarchy.
 +  [ns]$ mount -t cgroup cgroup /tmp/cgroup
 +  [ns]$ ls -l /tmp/cgroup
 +  total 0
 +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
 +  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
 +  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
 +  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control

 I've patched libvirt-lxc to issue CLONE_NEWCGROUP and not bind mount 
 cgroupfs into a container.
 But I'm unable to mount cgroupfs within the container, mount(2) is failing 
 with EINVAL.
 And /proc/self/cgroup still shows the cgroup from outside.

 ---cut---
 container:/ # ls /sys/fs/cgroup/
 container:/ # mount -t cgroup none /sys/fs/cgroup/

 You need to provide -o __DEVEL_sane_behavior flag. Inside the
 container, only unified hierarchy can be mounted. So, for now, that
 flag is needed. I will fix the documentation too.

 mount: wrong fs type, bad option, bad superblock on none,
missing codepage or helper program, or other error

In some cases useful info is found in syslog - try
dmesg | tail or so.
 container:/ # cat /proc/self/cgroup
 8:memory:/machine/test00.libvirt-lxc
 7:devices:/machine/test00.libvirt-lxc
 6:hugetlb:/
 5:cpuset:/machine/test00.libvirt-lxc
 4:blkio:/machine/test00.libvirt-lxc
 3:cpu,cpuacct:/machine/test00.libvirt-lxc
 2:freezer:/machine/test00.libvirt-lxc
 1:name=systemd:/user.slice/user-0.slice/session-c2.scope
 container:/ # ls -la /proc/self/ns
 total 0
 dr-x--x--x 2 root root 0 Dec 14 23:02 .
 dr-xr-xr-x 8 root root 0 Dec 14 23:02 ..
 lrwxrwxrwx 1 root root 0 Dec 14 23:02 cgroup - cgroup:[4026532240]
 lrwxrwxrwx 1 root root 0 Dec 14 23:02 ipc - ipc

Re: [PATCHv3 0/8] CGroup Namespaces

2014-12-04 Thread Aditya Kali
These patches are now also hosted on github at
https://github.com/adityakali/linux/tree/cgroupns_v3.

Thanks,

On Thu, Dec 4, 2014 at 5:55 PM, Aditya Kali  wrote:
> Another spin for CGroup Namespaces feature.
>
> Changes from V2:
> 1. Added documentation in Documentation/cgroups/namespace.txt
> 2. Fixed a bug that caused crash
> 3. Incorporated some other suggestions from last patchset:
>- removed use of threadgroup_lock() while creating new cgroupns
>- use task_lock() instead of rcu_read_lock() while accessing
>  task->nsproxy
>- optimized setns() to own cgroupns
>- simplified code around sane-behavior mount option parsing
> 4. Restored ACKs from Serge Hallyn from v1 on few patches that have
>not changed since then.
>
> Changes from V1:
> 1. No pinning of processes within cgroupns. Tasks can be freely moved
>across cgroups even outside of their cgroupns-root. Usual DAC/MAC policies
>apply as before.
> 2. Path in /proc//cgroup is now always shown and is relative to
>cgroupns-root. So path can contain '/..' strings depending on cgroupns-root
>of the reader and cgroup of .
> 3. setns() does not require the process to first move under target
>cgroupns-root.
>
> Changes form RFC (V0):
> 1. setns support for cgroupns
> 2. 'mount -t cgroup cgroup ' from inside a cgroupns now
>mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
> 3. writes to cgroup files outside of cgroupns-root are not allowed
> 4. visibility of /proc//cgroup is further restricted by not showing
>anything if the  is in a sibling cgroupns and its cgroup falls outside
>your cgroupns-root.
>
> ---
>  Documentation/cgroups/namespace.txt | 147 +++
>  fs/kernfs/dir.c | 195 
> 
>  fs/kernfs/mount.c   |  48 +
>  fs/proc/namespaces.c|   1 +
>  include/linux/cgroup.h  |  52 +-
>  include/linux/cgroup_namespace.h|  36 +++
>  include/linux/kernfs.h  |   5 +
>  include/linux/nsproxy.h |   2 +
>  include/linux/proc_ns.h |   4 +
>  include/uapi/linux/sched.h  |   3 +-
>  kernel/Makefile |   2 +-
>  kernel/cgroup.c | 106 +++-
>  kernel/cgroup_namespace.c   | 140 ++
>  kernel/fork.c   |   2 +-
>  kernel/nsproxy.c|  19 +++-
>  15 files changed, 711 insertions(+), 51 deletions(-)
>  create mode 100644 Documentation/cgroups/namespace.txt
>  create mode 100644 include/linux/cgroup_namespace.h
>  create mode 100644 kernel/cgroup_namespace.c
>
> [PATCHv3 1/8] kernfs: Add API to generate relative kernfs path
> [PATCHv3 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup
> [PATCHv3 3/8] cgroup: add function to get task's cgroup on default
> [PATCHv3 4/8] cgroup: export cgroup_get() and cgroup_put()
> [PATCHv3 5/8] cgroup: introduce cgroup namespaces
> [PATCHv3 6/8] cgroup: cgroup namespace setns support
> [PATCHv3 7/8] cgroup: mount cgroupns-root when inside non-init cgroupns
> [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces



-- 
Aditya
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv3 3/8] cgroup: add function to get task's cgroup on default hierarchy

2014-12-04 Thread Aditya Kali
get_task_cgroup() returns the (reference counted) cgroup of the
given task on the default hierarchy.

Acked-by: Serge Hallyn 
Signed-off-by: Aditya Kali 
---
 include/linux/cgroup.h |  1 +
 kernel/cgroup.c| 25 +
 2 files changed, 26 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 9fd99f5..d6930de 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -579,6 +579,7 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
 }
 
 char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
+struct cgroup *get_task_cgroup(struct task_struct *task);
 
 int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
 int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index bb263d0..5d8fc84 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1966,6 +1966,31 @@ char *task_cgroup_path(struct task_struct *task, char 
*buf, size_t buflen)
 }
 EXPORT_SYMBOL_GPL(task_cgroup_path);
 
+/*
+ * get_task_cgroup - returns the cgroup of the task in the default cgroup
+ * hierarchy.
+ *
+ * @task: target task
+ * This function returns the @task's cgroup on the default cgroup hierarchy. 
The
+ * returned cgroup has its reference incremented (by calling cgroup_get()). So
+ * the caller must cgroup_put() the obtained reference once it is done with it.
+ */
+struct cgroup *get_task_cgroup(struct task_struct *task)
+{
+   struct cgroup *cgrp;
+
+   mutex_lock(_mutex);
+   down_read(_set_rwsem);
+
+   cgrp = task_cgroup_from_root(task, _dfl_root);
+   cgroup_get(cgrp);
+
+   up_read(_set_rwsem);
+   mutex_unlock(_mutex);
+   return cgrp;
+}
+EXPORT_SYMBOL_GPL(get_task_cgroup);
+
 /* used to track tasks and other necessary states during migration */
 struct cgroup_taskset {
/* the src and dst cset list running through cset->mg_node */
-- 
2.2.0.rc0.207.ga3a616c

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv3 6/8] cgroup: cgroup namespace setns support

2014-12-04 Thread Aditya Kali
setns on a cgroup namespace is allowed only if
task has CAP_SYS_ADMIN in its current user-namespace and
over the user-namespace associated with target cgroupns.
No implicit cgroup changes happen with attaching to another
cgroupns. It is expected that the somone moves the attaching
process under the target cgroupns-root.

Signed-off-by: Aditya Kali 
---
 kernel/cgroup_namespace.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
index 0e0ef3a..ee0cc51 100644
--- a/kernel/cgroup_namespace.c
+++ b/kernel/cgroup_namespace.c
@@ -79,8 +79,21 @@ err_out:
 
 static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
 {
-   pr_info("setns not supported for cgroup namespace");
-   return -EINVAL;
+   struct cgroup_namespace *cgroup_ns = ns;
+
+   if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
+   !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
+   return -EPERM;
+
+   /* Don't need to do anything if we are attaching to our own cgroupns. */
+   if (cgroup_ns == nsproxy->cgroup_ns)
+   return 0;
+
+   get_cgroup_ns(cgroup_ns);
+   put_cgroup_ns(nsproxy->cgroup_ns);
+   nsproxy->cgroup_ns = cgroup_ns;
+
+   return 0;
 }
 
 static void *cgroupns_get(struct task_struct *task)
-- 
2.2.0.rc0.207.ga3a616c

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces

2014-12-04 Thread Aditya Kali
Signed-off-by: Aditya Kali 
---
 Documentation/cgroups/namespace.txt | 147 
 1 file changed, 147 insertions(+)
 create mode 100644 Documentation/cgroups/namespace.txt

diff --git a/Documentation/cgroups/namespace.txt 
b/Documentation/cgroups/namespace.txt
new file mode 100644
index 000..6480379
--- /dev/null
+++ b/Documentation/cgroups/namespace.txt
@@ -0,0 +1,147 @@
+   CGroup Namespaces
+
+CGroup Namespace provides a mechanism to virtualize the view of the
+/proc//cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
+clone() and unshare() syscalls to create a new cgroup namespace.
+The process running inside the cgroup namespace will have its 
/proc//cgroup
+output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
+at the time of creation of the cgroup namespace.
+
+Prior to CGroup Namespace, the /proc//cgroup file used to show complete
+path of the cgroup of a process. In a container setup (where a set of cgroups
+and namespaces are intended to isolate processes), the /proc//cgroup file
+may leak potential system level information to the isolated processes.
+
+For Example:
+  $ cat /proc/self/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+The path '/batchjobs/container_id1' can generally be considered as system-data
+and its desirable to not expose it to the isolated process.
+
+CGroup Namespaces can be used to restrict visibility of this path.
+For Example:
+  # Before creating cgroup namespace
+  $ ls -l /proc/self/ns/cgroup
+  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> 
cgroup:[4026531835]
+  $ cat /proc/self/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
+  $ ~/unshare -c
+  [ns]$ ls -l /proc/self/ns/cgroup
+  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> 
cgroup:[4026532183]
+  # From within new cgroupns, process sees that its in the root cgroup
+  [ns]$ cat /proc/self/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
+
+  # From global cgroupns:
+  $ cat /proc//cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+  # Unshare cgroupns along with userns and mountns
+  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
+  # sets up uid/gid map and execs /bin/bash
+  $ ~/unshare -c -u -m
+  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own 
cgroup
+  # hierarchy.
+  [ns]$ mount -t cgroup cgroup /tmp/cgroup
+  [ns]$ ls -l /tmp/cgroup
+  total 0
+  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
+  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
+  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
+  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
+
+The cgroupns-root (/batchjobs/container_id1 in above example) becomes the
+filesystem root for the namespace specific cgroupfs mount.
+
+The virtualization of /proc/self/cgroup file combined with restricting
+the view of cgroup hierarchy by namespace-private cgroupfs mount
+should provide a completely isolated cgroup view inside the container.
+
+Note that CGroup Namespaces virtualizes the path on unified hierarchy only. If
+other hierarchies are mounted, /proc//cgroup will continue to show the 
full
+cgroup path for those.
+
+In its current form, the cgroup namespaces patcheset provides following
+behavior:
+
+(1) The 'cgroupns-root' for a cgroup namespace is the cgroup in which
+the process calling unshare is running.
+For ex. if a process in /batchjobs/container_id1 cgroup calls unshare,
+cgroup /batchjobs/container_id1 becomes the cgroupns-root.
+For the init_cgroup_ns, this is the real root ('/') cgroup
+(identified in code as cgrp_dfl_root.cgrp).
+
+(2) The cgroupns-root cgroup does not change even if the namespace
+creator process later moves to a different cgroup.
+$ ~/unshare -c # unshare cgroupns in some cgroup
+[ns]$ cat /proc/self/cgroup
+0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
+[ns]$ mkdir sub_cgrp_1
+[ns]$ echo 0 > sub_cgrp_1/cgroup.procs
+[ns]$ cat /proc/self/cgroup
+0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+
+(3) Each process gets its CGROUPNS specific view of /proc//cgroup
+(a) Processes running inside the cgroup namespace will be able to see
+cgroup paths (in /proc/self/cgroup) only inside their root cgroup
+[ns]$ sleep 10 &  # From within unshared cgroupns
+[1] 7353
+[ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
+[ns]$ cat /proc/7353/cgroup
+0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+
+(b) From global cgroupns, the real cgroup path will be visible:
+$ cat /proc/7353/cgroup
+
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1/sub_cgrp_1
+
+(c) From a 

[PATCHv3 5/8] cgroup: introduce cgroup namespaces

2014-12-04 Thread Aditya Kali
Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the cgroup of the process at the point
of creation of the cgroup namespace (referred as cgroupns-root).
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root
(unless they are moved outside of their cgroupns-root, at which point
 they will see a relative path from their cgroupns-root).
For a correctly setup container this enables container-tools
(like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
containers without leaking system level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.

Signed-off-by: Aditya Kali 
---
 fs/proc/namespaces.c |   1 +
 include/linux/cgroup.h   |  29 -
 include/linux/cgroup_namespace.h |  36 +++
 include/linux/nsproxy.h  |   2 +
 include/linux/proc_ns.h  |   4 ++
 kernel/Makefile  |   2 +-
 kernel/cgroup.c  |  13 
 kernel/cgroup_namespace.c| 127 +++
 kernel/fork.c|   2 +-
 kernel/nsproxy.c |  19 +-
 10 files changed, 230 insertions(+), 5 deletions(-)

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 8902609..55bc5da 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -32,6 +32,7 @@ static const struct proc_ns_operations *ns_entries[] = {
_operations,
 #endif
_operations,
+   _operations,
 };
 
 static const struct file_operations ns_file_operations = {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 6e7533b..94a5a0c 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -22,6 +22,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #ifdef CONFIG_CGROUPS
 
@@ -460,6 +462,13 @@ struct cftype {
 #endif
 };
 
+struct cgroup_namespace {
+   atomic_tcount;
+   unsigned intproc_inum;
+   struct user_namespace   *user_ns;
+   struct cgroup   *root_cgrp;
+};
+
 extern struct cgroup_root cgrp_dfl_root;
 extern struct css_set init_css_set;
 
@@ -584,10 +593,28 @@ static inline int cgroup_name(struct cgroup *cgrp, char 
*buf, size_t buflen)
return kernfs_name(cgrp->kn, buf, buflen);
 }
 
+static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
+struct cgroup *cgrp, char *buf,
+size_t buflen)
+{
+   if (ns) {
+   BUG_ON(!cgroup_on_dfl(cgrp));
+   return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf,
+buflen);
+   } else {
+   return kernfs_path(cgrp->kn, buf, buflen);
+   }
+}
+
 static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
  size_t buflen)
 {
-   return kernfs_path(cgrp->kn, buf, buflen);
+   if (cgroup_on_dfl(cgrp)) {
+   return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf,
+ buflen);
+   } else {
+   return cgroup_path_ns(NULL, cgrp, buf, buflen);
+   }
 }
 
 static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
new file mode 100644
index 000..0b97b8d
--- /dev/null
+++ b/include/linux/cgroup_namespace.h
@@ -0,0 +1,36 @@
+#ifndef _LINUX_CGROUP_NAMESPACE_H
+#define _LINUX_CGROUP_NAMESPACE_H
+
+#include 
+#include 
+#include 
+#include 
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+static inline struct cgroup *current_cgroupns_root(void)
+{
+   return current->nsproxy->cgroup_ns->root_cgrp;
+}
+
+extern void free_cgroup_ns(struct cgroup_namespace *ns);
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+   struct cgroup_namespace *ns)
+{
+   if (ns)
+   atomic_inc(>count);
+   return ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+   if (ns && atomic_dec_and_test(>count))
+   free_cgroup_ns(ns);
+}
+
+extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+  struct user_namespace *user_ns,
+  struct cgroup_namespace *old_ns);
+
+#endif  /* _LINUX_CGROUP_NAMESPACE_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 35fa08f..ac0d65b 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -8,6 +8,7 @@ struct mnt_namespace;
 struct uts_namespace;
 struct ipc_namespace;
 struct pid_namespace;
+struct cgroup_namespace;
 struct fs_struct;
 
 /*
@@

[PATCHv3 4/8] cgroup: export cgroup_get() and cgroup_put()

2014-12-04 Thread Aditya Kali
move cgroup_get() and cgroup_put() into cgroup.h so that
they can be called from other places.

Acked-by: Serge Hallyn 
Signed-off-by: Aditya Kali 
---
 include/linux/cgroup.h | 22 ++
 kernel/cgroup.c| 22 --
 2 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index d6930de..6e7533b 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -521,6 +521,28 @@ static inline bool cgroup_on_dfl(const struct cgroup *cgrp)
return cgrp->root == _dfl_root;
 }
 
+/* convenient tests for these bits */
+static inline bool cgroup_is_dead(const struct cgroup *cgrp)
+{
+   return !(cgrp->self.flags & CSS_ONLINE);
+}
+
+static inline void cgroup_get(struct cgroup *cgrp)
+{
+   WARN_ON_ONCE(cgroup_is_dead(cgrp));
+   css_get(>self);
+}
+
+static inline bool cgroup_tryget(struct cgroup *cgrp)
+{
+   return css_tryget(>self);
+}
+
+static inline void cgroup_put(struct cgroup *cgrp)
+{
+   css_put(>self);
+}
+
 /* no synchronization, the result can only be used as a hint */
 static inline bool cgroup_has_tasks(struct cgroup *cgrp)
 {
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 5d8fc84..e12d36e 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -321,12 +321,6 @@ out_unlock:
return css;
 }
 
-/* convenient tests for these bits */
-static inline bool cgroup_is_dead(const struct cgroup *cgrp)
-{
-   return !(cgrp->self.flags & CSS_ONLINE);
-}
-
 struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
 {
struct cgroup *cgrp = of->kn->parent->priv;
@@ -1039,22 +1033,6 @@ static umode_t cgroup_file_mode(const struct cftype *cft)
return mode;
 }
 
-static void cgroup_get(struct cgroup *cgrp)
-{
-   WARN_ON_ONCE(cgroup_is_dead(cgrp));
-   css_get(>self);
-}
-
-static bool cgroup_tryget(struct cgroup *cgrp)
-{
-   return css_tryget(>self);
-}
-
-static void cgroup_put(struct cgroup *cgrp)
-{
-   css_put(>self);
-}
-
 /**
  * cgroup_calc_child_subsys_mask - calculate child_subsys_mask
  * @cgrp: the target cgroup
-- 
2.2.0.rc0.207.ga3a616c

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv3 7/8] cgroup: mount cgroupns-root when inside non-init cgroupns

2014-12-04 Thread Aditya Kali
This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.
In order to support this, a new kernfs api is added to lookup the
dentry for the cgroupns-root.

Signed-off-by: Aditya Kali 
---
 fs/kernfs/mount.c  | 48 
 include/linux/kernfs.h |  2 ++
 kernel/cgroup.c| 46 +-
 3 files changed, 95 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f973ae9..efe5e15 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block 
*sb)
return NULL;
 }
 
+/**
+ * kernfs_obtain_root - get a dentry for the given kernfs_node
+ * @sb: the kernfs super_block
+ * @kn: kernfs_node for which a dentry is needed
+ *
+ * This can used used by callers which want to mount only a part of the kernfs
+ * as root of the filesystem.
+ */
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+ struct kernfs_node *kn)
+{
+   struct dentry *dentry;
+   struct inode *inode;
+
+   BUG_ON(sb->s_op != _sops);
+
+   /* inode for the given kernfs_node should already exist. */
+   inode = ilookup(sb, kn->ino);
+   if (!inode) {
+   pr_debug("kernfs: could not get inode for '");
+   pr_cont_kernfs_path(kn);
+   pr_cont("'.\n");
+   return ERR_PTR(-EINVAL);
+   }
+
+   /* instantiate and link root dentry */
+   dentry = d_obtain_root(inode);
+   if (!dentry) {
+   pr_debug("kernfs: could not get dentry for '");
+   pr_cont_kernfs_path(kn);
+   pr_cont("'.\n");
+   return ERR_PTR(-ENOMEM);
+   }
+
+   /* If this is a new dentry, set it up. We need kernfs_mutex because this
+* may be called by callers other than kernfs_fill_super. */
+   mutex_lock(_mutex);
+   if (!dentry->d_fsdata) {
+   kernfs_get(kn);
+   dentry->d_fsdata = kn;
+   } else {
+   WARN_ON(dentry->d_fsdata != kn);
+   }
+   mutex_unlock(_mutex);
+
+   return dentry;
+}
+
 static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
 {
struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3c2be75..b9538e0 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
 struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
 struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
 
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+ struct kernfs_node *kn);
 struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
   unsigned int flags, void *priv);
 void kernfs_destroy_root(struct kernfs_root *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index b1ae6d9..e779890 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1438,6 +1438,14 @@ static int parse_cgroupfs_options(char *data, struct 
cgroup_sb_opts *opts)
return -ENOENT;
}
 
+   /* If inside a non-init cgroup namespace, only allow default hierarchy
+* to be mounted.
+*/
+   if ((current->nsproxy->cgroup_ns != _cgroup_ns) &&
+   !(opts->flags & CGRP_ROOT_SANE_BEHAVIOR)) {
+   return -EINVAL;
+   }
+
if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
pr_warn("sane_behavior: this is still under development and its 
behaviors will change, proceed at your own risk\n");
if (nr_opts != 1) {
@@ -1630,6 +1638,15 @@ static void init_cgroup_root(struct cgroup_root *root,
set_bit(CGRP_CPUSET_CLONE_CHILDREN, >cgrp.flags);
 }
 
+struct dentry *cgroupns_get_root(struct super_block *sb,
+struct cgroup_namespace *ns)
+{
+   struct dentry *nsdentry;
+
+   nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
+   return nsdentry;
+}
+
 static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
 {
LIST_HEAD(tmp_links);
@@ -1734,6 +1751,14 @@ static struct dentry *cgroup_mount(struct 
file_system_type *fs_type,
int ret;
int i;
bool new_sb;
+   struct cgroup_namespace *ns =
+   get_cgroup_ns(current->nsproxy->cgroup_ns);
+
+   /* Check if the caller has permission to mount. */
+   

[PATCHv3 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace

2014-12-04 Thread Aditya Kali
CLONE_NEWCGROUP will be used to create new cgroup namespace.

Acked-by: Serge Hallyn 
Signed-off-by: Aditya Kali 
---
 include/uapi/linux/sched.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 34f9d73..2f90d00 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -21,8 +21,7 @@
 #define CLONE_DETACHED 0x0040  /* Unused, ignored */
 #define CLONE_UNTRACED 0x0080  /* set if the tracing process 
can't force CLONE_PTRACE on this clone */
 #define CLONE_CHILD_SETTID 0x0100  /* set the TID in the child */
-/* 0x0200 was previously the unused CLONE_STOPPED (Start in stopped state)
-   and is now available for re-use. */
+#define CLONE_NEWCGROUP0x0200  /* New cgroup namespace 
*/
 #define CLONE_NEWUTS   0x0400  /* New utsname group? */
 #define CLONE_NEWIPC   0x0800  /* New ipcs */
 #define CLONE_NEWUSER  0x1000  /* New user namespace */
-- 
2.2.0.rc0.207.ga3a616c

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv3 1/8] kernfs: Add API to generate relative kernfs path

2014-12-04 Thread Aditya Kali
The new function kernfs_path_from_node() generates and returns
kernfs path of a given kernfs_node relative to a given parent
kernfs_node.

Signed-off-by: Aditya Kali 
---
 fs/kernfs/dir.c| 195 +++--
 include/linux/kernfs.h |   3 +
 2 files changed, 177 insertions(+), 21 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 1c77193..cb225a7 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -44,28 +44,159 @@ static int kernfs_name_locked(struct kernfs_node *kn, char 
*buf, size_t buflen)
return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
 }
 
-static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char 
*buf,
- size_t buflen)
+/**
+ * kernfs_node_depth - compute depth of the kernfs node from root.
+ * The root node itself is considered to be at depth 0.
+ */
+static size_t kernfs_node_depth(struct kernfs_node *kn)
 {
-   char *p = buf + buflen;
+   size_t depth = 0;
+
+   BUG_ON(!kn);
+   while (kn->parent) {
+   depth++;
+   kn = kn->parent;
+   }
+   return depth;
+}
+
+/**
+ * kernfs_path_from_node_locked - find a relative path from @kn_from to @kn_to
+ * @kn_from: reference node of the path
+ * @kn_to: kernfs node to which path is needed
+ * @buf: buffer to copy the path into
+ * @buflen: size of @buf
+ *
+ * We need to handle couple of scenarios here:
+ * [1] when @kn_from is an ancestor of @kn_to at some level
+ * kn_from: /n1/n2/n3
+ * kn_to:   /n1/n2/n3/n4/n5
+ * result:  /n4/n5
+ *
+ * [2] when @kn_from is on a different hierarchy and we need to find common
+ * ancestor between @kn_from and @kn_to.
+ * kn_from: /n1/n2/n3/n4
+ * kn_to:   /n1/n2/n5
+ * result:  /../../n5
+ * OR
+ * kn_from: /n1/n2/n3/n4/n5   [depth=5]
+ * kn_to:   /n1/n2/n3 [depth=3]
+ * result:  /../..
+ */
+static char * __must_check kernfs_path_from_node_locked(
+   struct kernfs_node *kn_from,
+   struct kernfs_node *kn_to,
+   char *buf,
+   size_t buflen)
+{
+   char *p = buf;
+   struct kernfs_node *kn;
+   size_t depth_from = 0, depth_to, d;
int len;
 
-   *--p = '\0';
+   /* We atleast need 2 bytes to write "/\0". */
+   BUG_ON(buflen < 2);
 
-   do {
-   len = strlen(kn->name);
-   if (p - buf < len + 1) {
-   buf[0] = '\0';
-   p = NULL;
-   break;
+   /* Short-circuit the easy case - kn_to is the root node. */
+   if ((kn_from == kn_to) || (!kn_from && !kn_to->parent)) {
+   *p = '/';
+   *(p + 1) = '\0';
+   return p;
+   }
+
+   /* We can find the relative path only if both the nodes belong to the
+* same kernfs root.
+*/
+   if (kn_from) {
+   BUG_ON(kernfs_root(kn_from) != kernfs_root(kn_to));
+   depth_from = kernfs_node_depth(kn_from);
+   }
+
+   depth_to = kernfs_node_depth(kn_to);
+
+   /* We compose path from left to right. So first write out all possible
+* "/.." strings needed to reach from 'kn_from' to the common ancestor.
+*/
+   if (kn_from) {
+   while (depth_from > depth_to) {
+   len = strlen("/..");
+   if ((buflen - (p - buf)) < len + 1) {
+   /* buffer not big enough. */
+   buf[0] = '\0';
+   return NULL;
+   }
+   memcpy(p, "/..", len);
+   p += len;
+   *p = '\0';
+   --depth_from;
+   kn_from = kn_from->parent;
}
+
+   d = depth_to;
+   kn = kn_to;
+   while (depth_from < d) {
+   kn = kn->parent;
+   d--;
+   }
+
+   /* Now we have 'depth_from == depth_to' at this point. Add more
+* "/.."s until we reach common ancestor. In the worst case,
+* root node will be the common ancestor.
+*/
+   while (depth_from > 0) {
+   /* If we reached common ancestor, stop. */
+   if (kn_from == kn)
+   break;
+   len = strlen("/..");
+   if ((buflen - (p - buf)) < len + 1) {
+   /* buffer not big enough. */
+   buf[0] = '\0';
+   return NULL;
+   }
+   memcpy(p, "/..", len);
+   p += len;
+   *p = '

[PATCHv3 0/8] CGroup Namespaces

2014-12-04 Thread Aditya Kali
Another spin for CGroup Namespaces feature.

Changes from V2:
1. Added documentation in Documentation/cgroups/namespace.txt
2. Fixed a bug that caused crash
3. Incorporated some other suggestions from last patchset:
   - removed use of threadgroup_lock() while creating new cgroupns
   - use task_lock() instead of rcu_read_lock() while accessing
 task->nsproxy
   - optimized setns() to own cgroupns
   - simplified code around sane-behavior mount option parsing
4. Restored ACKs from Serge Hallyn from v1 on few patches that have
   not changed since then.

Changes from V1:
1. No pinning of processes within cgroupns. Tasks can be freely moved
   across cgroups even outside of their cgroupns-root. Usual DAC/MAC policies
   apply as before.
2. Path in /proc//cgroup is now always shown and is relative to
   cgroupns-root. So path can contain '/..' strings depending on cgroupns-root
   of the reader and cgroup of .
3. setns() does not require the process to first move under target
   cgroupns-root.

Changes form RFC (V0):
1. setns support for cgroupns
2. 'mount -t cgroup cgroup ' from inside a cgroupns now
   mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
3. writes to cgroup files outside of cgroupns-root are not allowed
4. visibility of /proc//cgroup is further restricted by not showing
   anything if the  is in a sibling cgroupns and its cgroup falls outside
   your cgroupns-root.

---
 Documentation/cgroups/namespace.txt | 147 +++
 fs/kernfs/dir.c | 195 
 fs/kernfs/mount.c   |  48 +
 fs/proc/namespaces.c|   1 +
 include/linux/cgroup.h  |  52 +-
 include/linux/cgroup_namespace.h|  36 +++
 include/linux/kernfs.h  |   5 +
 include/linux/nsproxy.h |   2 +
 include/linux/proc_ns.h |   4 +
 include/uapi/linux/sched.h  |   3 +-
 kernel/Makefile |   2 +-
 kernel/cgroup.c | 106 +++-
 kernel/cgroup_namespace.c   | 140 ++
 kernel/fork.c   |   2 +-
 kernel/nsproxy.c|  19 +++-
 15 files changed, 711 insertions(+), 51 deletions(-)
 create mode 100644 Documentation/cgroups/namespace.txt
 create mode 100644 include/linux/cgroup_namespace.h
 create mode 100644 kernel/cgroup_namespace.c

[PATCHv3 1/8] kernfs: Add API to generate relative kernfs path
[PATCHv3 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup
[PATCHv3 3/8] cgroup: add function to get task's cgroup on default
[PATCHv3 4/8] cgroup: export cgroup_get() and cgroup_put()
[PATCHv3 5/8] cgroup: introduce cgroup namespaces
[PATCHv3 6/8] cgroup: cgroup namespace setns support
[PATCHv3 7/8] cgroup: mount cgroupns-root when inside non-init cgroupns
[PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv3 0/8] CGroup Namespaces

2014-12-04 Thread Aditya Kali
Another spin for CGroup Namespaces feature.

Changes from V2:
1. Added documentation in Documentation/cgroups/namespace.txt
2. Fixed a bug that caused crash
3. Incorporated some other suggestions from last patchset:
   - removed use of threadgroup_lock() while creating new cgroupns
   - use task_lock() instead of rcu_read_lock() while accessing
 task-nsproxy
   - optimized setns() to own cgroupns
   - simplified code around sane-behavior mount option parsing
4. Restored ACKs from Serge Hallyn from v1 on few patches that have
   not changed since then.

Changes from V1:
1. No pinning of processes within cgroupns. Tasks can be freely moved
   across cgroups even outside of their cgroupns-root. Usual DAC/MAC policies
   apply as before.
2. Path in /proc/pid/cgroup is now always shown and is relative to
   cgroupns-root. So path can contain '/..' strings depending on cgroupns-root
   of the reader and cgroup of pid.
3. setns() does not require the process to first move under target
   cgroupns-root.

Changes form RFC (V0):
1. setns support for cgroupns
2. 'mount -t cgroup cgroup mntpt' from inside a cgroupns now
   mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
3. writes to cgroup files outside of cgroupns-root are not allowed
4. visibility of /proc/pid/cgroup is further restricted by not showing
   anything if the pid is in a sibling cgroupns and its cgroup falls outside
   your cgroupns-root.

---
 Documentation/cgroups/namespace.txt | 147 +++
 fs/kernfs/dir.c | 195 
 fs/kernfs/mount.c   |  48 +
 fs/proc/namespaces.c|   1 +
 include/linux/cgroup.h  |  52 +-
 include/linux/cgroup_namespace.h|  36 +++
 include/linux/kernfs.h  |   5 +
 include/linux/nsproxy.h |   2 +
 include/linux/proc_ns.h |   4 +
 include/uapi/linux/sched.h  |   3 +-
 kernel/Makefile |   2 +-
 kernel/cgroup.c | 106 +++-
 kernel/cgroup_namespace.c   | 140 ++
 kernel/fork.c   |   2 +-
 kernel/nsproxy.c|  19 +++-
 15 files changed, 711 insertions(+), 51 deletions(-)
 create mode 100644 Documentation/cgroups/namespace.txt
 create mode 100644 include/linux/cgroup_namespace.h
 create mode 100644 kernel/cgroup_namespace.c

[PATCHv3 1/8] kernfs: Add API to generate relative kernfs path
[PATCHv3 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup
[PATCHv3 3/8] cgroup: add function to get task's cgroup on default
[PATCHv3 4/8] cgroup: export cgroup_get() and cgroup_put()
[PATCHv3 5/8] cgroup: introduce cgroup namespaces
[PATCHv3 6/8] cgroup: cgroup namespace setns support
[PATCHv3 7/8] cgroup: mount cgroupns-root when inside non-init cgroupns
[PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv3 1/8] kernfs: Add API to generate relative kernfs path

2014-12-04 Thread Aditya Kali
The new function kernfs_path_from_node() generates and returns
kernfs path of a given kernfs_node relative to a given parent
kernfs_node.

Signed-off-by: Aditya Kali adityak...@google.com
---
 fs/kernfs/dir.c| 195 +++--
 include/linux/kernfs.h |   3 +
 2 files changed, 177 insertions(+), 21 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 1c77193..cb225a7 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -44,28 +44,159 @@ static int kernfs_name_locked(struct kernfs_node *kn, char 
*buf, size_t buflen)
return strlcpy(buf, kn-parent ? kn-name : /, buflen);
 }
 
-static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char 
*buf,
- size_t buflen)
+/**
+ * kernfs_node_depth - compute depth of the kernfs node from root.
+ * The root node itself is considered to be at depth 0.
+ */
+static size_t kernfs_node_depth(struct kernfs_node *kn)
 {
-   char *p = buf + buflen;
+   size_t depth = 0;
+
+   BUG_ON(!kn);
+   while (kn-parent) {
+   depth++;
+   kn = kn-parent;
+   }
+   return depth;
+}
+
+/**
+ * kernfs_path_from_node_locked - find a relative path from @kn_from to @kn_to
+ * @kn_from: reference node of the path
+ * @kn_to: kernfs node to which path is needed
+ * @buf: buffer to copy the path into
+ * @buflen: size of @buf
+ *
+ * We need to handle couple of scenarios here:
+ * [1] when @kn_from is an ancestor of @kn_to at some level
+ * kn_from: /n1/n2/n3
+ * kn_to:   /n1/n2/n3/n4/n5
+ * result:  /n4/n5
+ *
+ * [2] when @kn_from is on a different hierarchy and we need to find common
+ * ancestor between @kn_from and @kn_to.
+ * kn_from: /n1/n2/n3/n4
+ * kn_to:   /n1/n2/n5
+ * result:  /../../n5
+ * OR
+ * kn_from: /n1/n2/n3/n4/n5   [depth=5]
+ * kn_to:   /n1/n2/n3 [depth=3]
+ * result:  /../..
+ */
+static char * __must_check kernfs_path_from_node_locked(
+   struct kernfs_node *kn_from,
+   struct kernfs_node *kn_to,
+   char *buf,
+   size_t buflen)
+{
+   char *p = buf;
+   struct kernfs_node *kn;
+   size_t depth_from = 0, depth_to, d;
int len;
 
-   *--p = '\0';
+   /* We atleast need 2 bytes to write /\0. */
+   BUG_ON(buflen  2);
 
-   do {
-   len = strlen(kn-name);
-   if (p - buf  len + 1) {
-   buf[0] = '\0';
-   p = NULL;
-   break;
+   /* Short-circuit the easy case - kn_to is the root node. */
+   if ((kn_from == kn_to) || (!kn_from  !kn_to-parent)) {
+   *p = '/';
+   *(p + 1) = '\0';
+   return p;
+   }
+
+   /* We can find the relative path only if both the nodes belong to the
+* same kernfs root.
+*/
+   if (kn_from) {
+   BUG_ON(kernfs_root(kn_from) != kernfs_root(kn_to));
+   depth_from = kernfs_node_depth(kn_from);
+   }
+
+   depth_to = kernfs_node_depth(kn_to);
+
+   /* We compose path from left to right. So first write out all possible
+* /.. strings needed to reach from 'kn_from' to the common ancestor.
+*/
+   if (kn_from) {
+   while (depth_from  depth_to) {
+   len = strlen(/..);
+   if ((buflen - (p - buf))  len + 1) {
+   /* buffer not big enough. */
+   buf[0] = '\0';
+   return NULL;
+   }
+   memcpy(p, /.., len);
+   p += len;
+   *p = '\0';
+   --depth_from;
+   kn_from = kn_from-parent;
}
+
+   d = depth_to;
+   kn = kn_to;
+   while (depth_from  d) {
+   kn = kn-parent;
+   d--;
+   }
+
+   /* Now we have 'depth_from == depth_to' at this point. Add more
+* /..s until we reach common ancestor. In the worst case,
+* root node will be the common ancestor.
+*/
+   while (depth_from  0) {
+   /* If we reached common ancestor, stop. */
+   if (kn_from == kn)
+   break;
+   len = strlen(/..);
+   if ((buflen - (p - buf))  len + 1) {
+   /* buffer not big enough. */
+   buf[0] = '\0';
+   return NULL;
+   }
+   memcpy(p, /.., len);
+   p += len;
+   *p = '\0';
+   --depth_from;
+   kn_from = kn_from-parent;
+   kn = kn-parent;
+   }
+   }
+
+   /* Figure out

[PATCHv3 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace

2014-12-04 Thread Aditya Kali
CLONE_NEWCGROUP will be used to create new cgroup namespace.

Acked-by: Serge Hallyn serge.hal...@canonical.com
Signed-off-by: Aditya Kali adityak...@google.com
---
 include/uapi/linux/sched.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 34f9d73..2f90d00 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -21,8 +21,7 @@
 #define CLONE_DETACHED 0x0040  /* Unused, ignored */
 #define CLONE_UNTRACED 0x0080  /* set if the tracing process 
can't force CLONE_PTRACE on this clone */
 #define CLONE_CHILD_SETTID 0x0100  /* set the TID in the child */
-/* 0x0200 was previously the unused CLONE_STOPPED (Start in stopped state)
-   and is now available for re-use. */
+#define CLONE_NEWCGROUP0x0200  /* New cgroup namespace 
*/
 #define CLONE_NEWUTS   0x0400  /* New utsname group? */
 #define CLONE_NEWIPC   0x0800  /* New ipcs */
 #define CLONE_NEWUSER  0x1000  /* New user namespace */
-- 
2.2.0.rc0.207.ga3a616c

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv3 7/8] cgroup: mount cgroupns-root when inside non-init cgroupns

2014-12-04 Thread Aditya Kali
This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.
In order to support this, a new kernfs api is added to lookup the
dentry for the cgroupns-root.

Signed-off-by: Aditya Kali adityak...@google.com
---
 fs/kernfs/mount.c  | 48 
 include/linux/kernfs.h |  2 ++
 kernel/cgroup.c| 46 +-
 3 files changed, 95 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f973ae9..efe5e15 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block 
*sb)
return NULL;
 }
 
+/**
+ * kernfs_obtain_root - get a dentry for the given kernfs_node
+ * @sb: the kernfs super_block
+ * @kn: kernfs_node for which a dentry is needed
+ *
+ * This can used used by callers which want to mount only a part of the kernfs
+ * as root of the filesystem.
+ */
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+ struct kernfs_node *kn)
+{
+   struct dentry *dentry;
+   struct inode *inode;
+
+   BUG_ON(sb-s_op != kernfs_sops);
+
+   /* inode for the given kernfs_node should already exist. */
+   inode = ilookup(sb, kn-ino);
+   if (!inode) {
+   pr_debug(kernfs: could not get inode for ');
+   pr_cont_kernfs_path(kn);
+   pr_cont('.\n);
+   return ERR_PTR(-EINVAL);
+   }
+
+   /* instantiate and link root dentry */
+   dentry = d_obtain_root(inode);
+   if (!dentry) {
+   pr_debug(kernfs: could not get dentry for ');
+   pr_cont_kernfs_path(kn);
+   pr_cont('.\n);
+   return ERR_PTR(-ENOMEM);
+   }
+
+   /* If this is a new dentry, set it up. We need kernfs_mutex because this
+* may be called by callers other than kernfs_fill_super. */
+   mutex_lock(kernfs_mutex);
+   if (!dentry-d_fsdata) {
+   kernfs_get(kn);
+   dentry-d_fsdata = kn;
+   } else {
+   WARN_ON(dentry-d_fsdata != kn);
+   }
+   mutex_unlock(kernfs_mutex);
+
+   return dentry;
+}
+
 static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
 {
struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3c2be75..b9538e0 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
 struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
 struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
 
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+ struct kernfs_node *kn);
 struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
   unsigned int flags, void *priv);
 void kernfs_destroy_root(struct kernfs_root *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index b1ae6d9..e779890 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1438,6 +1438,14 @@ static int parse_cgroupfs_options(char *data, struct 
cgroup_sb_opts *opts)
return -ENOENT;
}
 
+   /* If inside a non-init cgroup namespace, only allow default hierarchy
+* to be mounted.
+*/
+   if ((current-nsproxy-cgroup_ns != init_cgroup_ns) 
+   !(opts-flags  CGRP_ROOT_SANE_BEHAVIOR)) {
+   return -EINVAL;
+   }
+
if (opts-flags  CGRP_ROOT_SANE_BEHAVIOR) {
pr_warn(sane_behavior: this is still under development and its 
behaviors will change, proceed at your own risk\n);
if (nr_opts != 1) {
@@ -1630,6 +1638,15 @@ static void init_cgroup_root(struct cgroup_root *root,
set_bit(CGRP_CPUSET_CLONE_CHILDREN, root-cgrp.flags);
 }
 
+struct dentry *cgroupns_get_root(struct super_block *sb,
+struct cgroup_namespace *ns)
+{
+   struct dentry *nsdentry;
+
+   nsdentry = kernfs_obtain_root(sb, ns-root_cgrp-kn);
+   return nsdentry;
+}
+
 static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
 {
LIST_HEAD(tmp_links);
@@ -1734,6 +1751,14 @@ static struct dentry *cgroup_mount(struct 
file_system_type *fs_type,
int ret;
int i;
bool new_sb;
+   struct cgroup_namespace *ns =
+   get_cgroup_ns(current-nsproxy-cgroup_ns);
+
+   /* Check if the caller has permission to mount. */
+   if (!ns_capable(ns-user_ns, CAP_SYS_ADMIN)) {
+   put_cgroup_ns(ns

[PATCHv3 4/8] cgroup: export cgroup_get() and cgroup_put()

2014-12-04 Thread Aditya Kali
move cgroup_get() and cgroup_put() into cgroup.h so that
they can be called from other places.

Acked-by: Serge Hallyn serge.hal...@canonical.com
Signed-off-by: Aditya Kali adityak...@google.com
---
 include/linux/cgroup.h | 22 ++
 kernel/cgroup.c| 22 --
 2 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index d6930de..6e7533b 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -521,6 +521,28 @@ static inline bool cgroup_on_dfl(const struct cgroup *cgrp)
return cgrp-root == cgrp_dfl_root;
 }
 
+/* convenient tests for these bits */
+static inline bool cgroup_is_dead(const struct cgroup *cgrp)
+{
+   return !(cgrp-self.flags  CSS_ONLINE);
+}
+
+static inline void cgroup_get(struct cgroup *cgrp)
+{
+   WARN_ON_ONCE(cgroup_is_dead(cgrp));
+   css_get(cgrp-self);
+}
+
+static inline bool cgroup_tryget(struct cgroup *cgrp)
+{
+   return css_tryget(cgrp-self);
+}
+
+static inline void cgroup_put(struct cgroup *cgrp)
+{
+   css_put(cgrp-self);
+}
+
 /* no synchronization, the result can only be used as a hint */
 static inline bool cgroup_has_tasks(struct cgroup *cgrp)
 {
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 5d8fc84..e12d36e 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -321,12 +321,6 @@ out_unlock:
return css;
 }
 
-/* convenient tests for these bits */
-static inline bool cgroup_is_dead(const struct cgroup *cgrp)
-{
-   return !(cgrp-self.flags  CSS_ONLINE);
-}
-
 struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
 {
struct cgroup *cgrp = of-kn-parent-priv;
@@ -1039,22 +1033,6 @@ static umode_t cgroup_file_mode(const struct cftype *cft)
return mode;
 }
 
-static void cgroup_get(struct cgroup *cgrp)
-{
-   WARN_ON_ONCE(cgroup_is_dead(cgrp));
-   css_get(cgrp-self);
-}
-
-static bool cgroup_tryget(struct cgroup *cgrp)
-{
-   return css_tryget(cgrp-self);
-}
-
-static void cgroup_put(struct cgroup *cgrp)
-{
-   css_put(cgrp-self);
-}
-
 /**
  * cgroup_calc_child_subsys_mask - calculate child_subsys_mask
  * @cgrp: the target cgroup
-- 
2.2.0.rc0.207.ga3a616c

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces

2014-12-04 Thread Aditya Kali
Signed-off-by: Aditya Kali adityak...@google.com
---
 Documentation/cgroups/namespace.txt | 147 
 1 file changed, 147 insertions(+)
 create mode 100644 Documentation/cgroups/namespace.txt

diff --git a/Documentation/cgroups/namespace.txt 
b/Documentation/cgroups/namespace.txt
new file mode 100644
index 000..6480379
--- /dev/null
+++ b/Documentation/cgroups/namespace.txt
@@ -0,0 +1,147 @@
+   CGroup Namespaces
+
+CGroup Namespace provides a mechanism to virtualize the view of the
+/proc/pid/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
+clone() and unshare() syscalls to create a new cgroup namespace.
+The process running inside the cgroup namespace will have its 
/proc/pid/cgroup
+output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
+at the time of creation of the cgroup namespace.
+
+Prior to CGroup Namespace, the /proc/pid/cgroup file used to show complete
+path of the cgroup of a process. In a container setup (where a set of cgroups
+and namespaces are intended to isolate processes), the /proc/pid/cgroup file
+may leak potential system level information to the isolated processes.
+
+For Example:
+  $ cat /proc/self/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+The path '/batchjobs/container_id1' can generally be considered as system-data
+and its desirable to not expose it to the isolated process.
+
+CGroup Namespaces can be used to restrict visibility of this path.
+For Example:
+  # Before creating cgroup namespace
+  $ ls -l /proc/self/ns/cgroup
+  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup - 
cgroup:[4026531835]
+  $ cat /proc/self/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
+  $ ~/unshare -c
+  [ns]$ ls -l /proc/self/ns/cgroup
+  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup - 
cgroup:[4026532183]
+  # From within new cgroupns, process sees that its in the root cgroup
+  [ns]$ cat /proc/self/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
+
+  # From global cgroupns:
+  $ cat /proc/pid/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+  # Unshare cgroupns along with userns and mountns
+  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
+  # sets up uid/gid map and execs /bin/bash
+  $ ~/unshare -c -u -m
+  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own 
cgroup
+  # hierarchy.
+  [ns]$ mount -t cgroup cgroup /tmp/cgroup
+  [ns]$ ls -l /tmp/cgroup
+  total 0
+  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
+  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
+  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
+  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
+
+The cgroupns-root (/batchjobs/container_id1 in above example) becomes the
+filesystem root for the namespace specific cgroupfs mount.
+
+The virtualization of /proc/self/cgroup file combined with restricting
+the view of cgroup hierarchy by namespace-private cgroupfs mount
+should provide a completely isolated cgroup view inside the container.
+
+Note that CGroup Namespaces virtualizes the path on unified hierarchy only. If
+other hierarchies are mounted, /proc/pid/cgroup will continue to show the 
full
+cgroup path for those.
+
+In its current form, the cgroup namespaces patcheset provides following
+behavior:
+
+(1) The 'cgroupns-root' for a cgroup namespace is the cgroup in which
+the process calling unshare is running.
+For ex. if a process in /batchjobs/container_id1 cgroup calls unshare,
+cgroup /batchjobs/container_id1 becomes the cgroupns-root.
+For the init_cgroup_ns, this is the real root ('/') cgroup
+(identified in code as cgrp_dfl_root.cgrp).
+
+(2) The cgroupns-root cgroup does not change even if the namespace
+creator process later moves to a different cgroup.
+$ ~/unshare -c # unshare cgroupns in some cgroup
+[ns]$ cat /proc/self/cgroup
+0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
+[ns]$ mkdir sub_cgrp_1
+[ns]$ echo 0  sub_cgrp_1/cgroup.procs
+[ns]$ cat /proc/self/cgroup
+0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+
+(3) Each process gets its CGROUPNS specific view of /proc/pid/cgroup
+(a) Processes running inside the cgroup namespace will be able to see
+cgroup paths (in /proc/self/cgroup) only inside their root cgroup
+[ns]$ sleep 10   # From within unshared cgroupns
+[1] 7353
+[ns]$ echo 7353  sub_cgrp_1/cgroup.procs
+[ns]$ cat /proc/7353/cgroup
+0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+
+(b) From global cgroupns, the real cgroup path will be visible:
+$ cat /proc/7353/cgroup
+
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1

[PATCHv3 5/8] cgroup: introduce cgroup namespaces

2014-12-04 Thread Aditya Kali
Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the cgroup of the process at the point
of creation of the cgroup namespace (referred as cgroupns-root).
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root
(unless they are moved outside of their cgroupns-root, at which point
 they will see a relative path from their cgroupns-root).
For a correctly setup container this enables container-tools
(like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
containers without leaking system level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.

Signed-off-by: Aditya Kali adityak...@google.com
---
 fs/proc/namespaces.c |   1 +
 include/linux/cgroup.h   |  29 -
 include/linux/cgroup_namespace.h |  36 +++
 include/linux/nsproxy.h  |   2 +
 include/linux/proc_ns.h  |   4 ++
 kernel/Makefile  |   2 +-
 kernel/cgroup.c  |  13 
 kernel/cgroup_namespace.c| 127 +++
 kernel/fork.c|   2 +-
 kernel/nsproxy.c |  19 +-
 10 files changed, 230 insertions(+), 5 deletions(-)

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 8902609..55bc5da 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -32,6 +32,7 @@ static const struct proc_ns_operations *ns_entries[] = {
userns_operations,
 #endif
mntns_operations,
+   cgroupns_operations,
 };
 
 static const struct file_operations ns_file_operations = {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 6e7533b..94a5a0c 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -22,6 +22,8 @@
 #include linux/seq_file.h
 #include linux/kernfs.h
 #include linux/wait.h
+#include linux/nsproxy.h
+#include linux/types.h
 
 #ifdef CONFIG_CGROUPS
 
@@ -460,6 +462,13 @@ struct cftype {
 #endif
 };
 
+struct cgroup_namespace {
+   atomic_tcount;
+   unsigned intproc_inum;
+   struct user_namespace   *user_ns;
+   struct cgroup   *root_cgrp;
+};
+
 extern struct cgroup_root cgrp_dfl_root;
 extern struct css_set init_css_set;
 
@@ -584,10 +593,28 @@ static inline int cgroup_name(struct cgroup *cgrp, char 
*buf, size_t buflen)
return kernfs_name(cgrp-kn, buf, buflen);
 }
 
+static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
+struct cgroup *cgrp, char *buf,
+size_t buflen)
+{
+   if (ns) {
+   BUG_ON(!cgroup_on_dfl(cgrp));
+   return kernfs_path_from_node(ns-root_cgrp-kn, cgrp-kn, buf,
+buflen);
+   } else {
+   return kernfs_path(cgrp-kn, buf, buflen);
+   }
+}
+
 static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
  size_t buflen)
 {
-   return kernfs_path(cgrp-kn, buf, buflen);
+   if (cgroup_on_dfl(cgrp)) {
+   return cgroup_path_ns(current-nsproxy-cgroup_ns, cgrp, buf,
+ buflen);
+   } else {
+   return cgroup_path_ns(NULL, cgrp, buf, buflen);
+   }
 }
 
 static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
new file mode 100644
index 000..0b97b8d
--- /dev/null
+++ b/include/linux/cgroup_namespace.h
@@ -0,0 +1,36 @@
+#ifndef _LINUX_CGROUP_NAMESPACE_H
+#define _LINUX_CGROUP_NAMESPACE_H
+
+#include linux/nsproxy.h
+#include linux/cgroup.h
+#include linux/types.h
+#include linux/user_namespace.h
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+static inline struct cgroup *current_cgroupns_root(void)
+{
+   return current-nsproxy-cgroup_ns-root_cgrp;
+}
+
+extern void free_cgroup_ns(struct cgroup_namespace *ns);
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+   struct cgroup_namespace *ns)
+{
+   if (ns)
+   atomic_inc(ns-count);
+   return ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+   if (ns  atomic_dec_and_test(ns-count))
+   free_cgroup_ns(ns);
+}
+
+extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+  struct user_namespace *user_ns,
+  struct cgroup_namespace *old_ns);
+
+#endif  /* _LINUX_CGROUP_NAMESPACE_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 35fa08f..ac0d65b 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -8,6 +8,7 @@ struct mnt_namespace;
 struct

[PATCHv3 3/8] cgroup: add function to get task's cgroup on default hierarchy

2014-12-04 Thread Aditya Kali
get_task_cgroup() returns the (reference counted) cgroup of the
given task on the default hierarchy.

Acked-by: Serge Hallyn serge.hal...@canonical.com
Signed-off-by: Aditya Kali adityak...@google.com
---
 include/linux/cgroup.h |  1 +
 kernel/cgroup.c| 25 +
 2 files changed, 26 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 9fd99f5..d6930de 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -579,6 +579,7 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
 }
 
 char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
+struct cgroup *get_task_cgroup(struct task_struct *task);
 
 int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
 int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index bb263d0..5d8fc84 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1966,6 +1966,31 @@ char *task_cgroup_path(struct task_struct *task, char 
*buf, size_t buflen)
 }
 EXPORT_SYMBOL_GPL(task_cgroup_path);
 
+/*
+ * get_task_cgroup - returns the cgroup of the task in the default cgroup
+ * hierarchy.
+ *
+ * @task: target task
+ * This function returns the @task's cgroup on the default cgroup hierarchy. 
The
+ * returned cgroup has its reference incremented (by calling cgroup_get()). So
+ * the caller must cgroup_put() the obtained reference once it is done with it.
+ */
+struct cgroup *get_task_cgroup(struct task_struct *task)
+{
+   struct cgroup *cgrp;
+
+   mutex_lock(cgroup_mutex);
+   down_read(css_set_rwsem);
+
+   cgrp = task_cgroup_from_root(task, cgrp_dfl_root);
+   cgroup_get(cgrp);
+
+   up_read(css_set_rwsem);
+   mutex_unlock(cgroup_mutex);
+   return cgrp;
+}
+EXPORT_SYMBOL_GPL(get_task_cgroup);
+
 /* used to track tasks and other necessary states during migration */
 struct cgroup_taskset {
/* the src and dst cset list running through cset-mg_node */
-- 
2.2.0.rc0.207.ga3a616c

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv3 6/8] cgroup: cgroup namespace setns support

2014-12-04 Thread Aditya Kali
setns on a cgroup namespace is allowed only if
task has CAP_SYS_ADMIN in its current user-namespace and
over the user-namespace associated with target cgroupns.
No implicit cgroup changes happen with attaching to another
cgroupns. It is expected that the somone moves the attaching
process under the target cgroupns-root.

Signed-off-by: Aditya Kali adityak...@google.com
---
 kernel/cgroup_namespace.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
index 0e0ef3a..ee0cc51 100644
--- a/kernel/cgroup_namespace.c
+++ b/kernel/cgroup_namespace.c
@@ -79,8 +79,21 @@ err_out:
 
 static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
 {
-   pr_info(setns not supported for cgroup namespace);
-   return -EINVAL;
+   struct cgroup_namespace *cgroup_ns = ns;
+
+   if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
+   !ns_capable(cgroup_ns-user_ns, CAP_SYS_ADMIN))
+   return -EPERM;
+
+   /* Don't need to do anything if we are attaching to our own cgroupns. */
+   if (cgroup_ns == nsproxy-cgroup_ns)
+   return 0;
+
+   get_cgroup_ns(cgroup_ns);
+   put_cgroup_ns(nsproxy-cgroup_ns);
+   nsproxy-cgroup_ns = cgroup_ns;
+
+   return 0;
 }
 
 static void *cgroupns_get(struct task_struct *task)
-- 
2.2.0.rc0.207.ga3a616c

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv3 0/8] CGroup Namespaces

2014-12-04 Thread Aditya Kali
These patches are now also hosted on github at
https://github.com/adityakali/linux/tree/cgroupns_v3.

Thanks,

On Thu, Dec 4, 2014 at 5:55 PM, Aditya Kali adityak...@google.com wrote:
 Another spin for CGroup Namespaces feature.

 Changes from V2:
 1. Added documentation in Documentation/cgroups/namespace.txt
 2. Fixed a bug that caused crash
 3. Incorporated some other suggestions from last patchset:
- removed use of threadgroup_lock() while creating new cgroupns
- use task_lock() instead of rcu_read_lock() while accessing
  task-nsproxy
- optimized setns() to own cgroupns
- simplified code around sane-behavior mount option parsing
 4. Restored ACKs from Serge Hallyn from v1 on few patches that have
not changed since then.

 Changes from V1:
 1. No pinning of processes within cgroupns. Tasks can be freely moved
across cgroups even outside of their cgroupns-root. Usual DAC/MAC policies
apply as before.
 2. Path in /proc/pid/cgroup is now always shown and is relative to
cgroupns-root. So path can contain '/..' strings depending on cgroupns-root
of the reader and cgroup of pid.
 3. setns() does not require the process to first move under target
cgroupns-root.

 Changes form RFC (V0):
 1. setns support for cgroupns
 2. 'mount -t cgroup cgroup mntpt' from inside a cgroupns now
mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
 3. writes to cgroup files outside of cgroupns-root are not allowed
 4. visibility of /proc/pid/cgroup is further restricted by not showing
anything if the pid is in a sibling cgroupns and its cgroup falls outside
your cgroupns-root.

 ---
  Documentation/cgroups/namespace.txt | 147 +++
  fs/kernfs/dir.c | 195 
 
  fs/kernfs/mount.c   |  48 +
  fs/proc/namespaces.c|   1 +
  include/linux/cgroup.h  |  52 +-
  include/linux/cgroup_namespace.h|  36 +++
  include/linux/kernfs.h  |   5 +
  include/linux/nsproxy.h |   2 +
  include/linux/proc_ns.h |   4 +
  include/uapi/linux/sched.h  |   3 +-
  kernel/Makefile |   2 +-
  kernel/cgroup.c | 106 +++-
  kernel/cgroup_namespace.c   | 140 ++
  kernel/fork.c   |   2 +-
  kernel/nsproxy.c|  19 +++-
  15 files changed, 711 insertions(+), 51 deletions(-)
  create mode 100644 Documentation/cgroups/namespace.txt
  create mode 100644 include/linux/cgroup_namespace.h
  create mode 100644 kernel/cgroup_namespace.c

 [PATCHv3 1/8] kernfs: Add API to generate relative kernfs path
 [PATCHv3 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup
 [PATCHv3 3/8] cgroup: add function to get task's cgroup on default
 [PATCHv3 4/8] cgroup: export cgroup_get() and cgroup_put()
 [PATCHv3 5/8] cgroup: introduce cgroup namespaces
 [PATCHv3 6/8] cgroup: cgroup namespace setns support
 [PATCHv3 7/8] cgroup: mount cgroupns-root when inside non-init cgroupns
 [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces



-- 
Aditya
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 0/7] CGroup Namespaces

2014-12-02 Thread Aditya Kali
On Wed, Nov 26, 2014 at 2:58 PM, Richard Weinberger
 wrote:
>
> On Thu, Nov 6, 2014 at 6:33 PM, Aditya Kali  wrote:
> > On Tue, Nov 4, 2014 at 5:10 AM, Vivek Goyal  wrote:
> >> On Fri, Oct 31, 2014 at 12:18:54PM -0700, Aditya Kali wrote:
> >> [..]
> >>>  fs/kernfs/dir.c  | 194 
> >>> ++-
> >>>  fs/kernfs/mount.c|  48 ++
> >>>  fs/proc/namespaces.c |   1 +
> >>>  include/linux/cgroup.h   |  41 -
> >>>  include/linux/cgroup_namespace.h |  36 
> >>>  include/linux/kernfs.h   |   5 +
> >>>  include/linux/nsproxy.h  |   2 +
> >>>  include/linux/proc_ns.h  |   4 +
> >>>  include/uapi/linux/sched.h   |   3 +-
> >>>  kernel/Makefile  |   2 +-
> >>>  kernel/cgroup.c  | 108 +-
> >>>  kernel/cgroup_namespace.c| 148 +
> >>>  kernel/fork.c|   2 +-
> >>>  kernel/nsproxy.c |  19 +++-
> >>
> >> Hi Aditya,
> >>
> >> Can we provide a documentation file for cgroup namespace behavior. Say,
> >> Documentation/namespaces/cgroup-namespace.txt.
> >>
> > Yes, definitely. I will add it as soon as we have a consensus on the
> > overall series.
>
> Do you have a public git repository which contains your patches?
>

Hi, Sorry for late reply. I don't have these in a public git repo yet.
But I will try to post it on github or somewhere.
Also, I found a bug in this patchset that crashes the kernel in some
cases (when both unified and split hierarchies are mounted). I have a
fix and will send out the patches (with documentation) soon.

>
> --
> Thanks,
> //richard

Thanks,
-- 
Aditya
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 0/7] CGroup Namespaces

2014-12-02 Thread Aditya Kali
On Wed, Nov 26, 2014 at 2:58 PM, Richard Weinberger
richard.weinber...@gmail.com wrote:

 On Thu, Nov 6, 2014 at 6:33 PM, Aditya Kali adityak...@google.com wrote:
  On Tue, Nov 4, 2014 at 5:10 AM, Vivek Goyal vgo...@redhat.com wrote:
  On Fri, Oct 31, 2014 at 12:18:54PM -0700, Aditya Kali wrote:
  [..]
   fs/kernfs/dir.c  | 194 
  ++-
   fs/kernfs/mount.c|  48 ++
   fs/proc/namespaces.c |   1 +
   include/linux/cgroup.h   |  41 -
   include/linux/cgroup_namespace.h |  36 
   include/linux/kernfs.h   |   5 +
   include/linux/nsproxy.h  |   2 +
   include/linux/proc_ns.h  |   4 +
   include/uapi/linux/sched.h   |   3 +-
   kernel/Makefile  |   2 +-
   kernel/cgroup.c  | 108 +-
   kernel/cgroup_namespace.c| 148 +
   kernel/fork.c|   2 +-
   kernel/nsproxy.c |  19 +++-
 
  Hi Aditya,
 
  Can we provide a documentation file for cgroup namespace behavior. Say,
  Documentation/namespaces/cgroup-namespace.txt.
 
  Yes, definitely. I will add it as soon as we have a consensus on the
  overall series.

 Do you have a public git repository which contains your patches?


Hi, Sorry for late reply. I don't have these in a public git repo yet.
But I will try to post it on github or somewhere.
Also, I found a bug in this patchset that crashes the kernel in some
cases (when both unified and split hierarchies are mounted). I have a
fix and will send out the patches (with documentation) soon.


 --
 Thanks,
 //richard

Thanks,
-- 
Aditya
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

2014-11-12 Thread Aditya Kali
I agree with what Andy and Serge has to say. The ability to mount
cgroupfs inside userns also seems consistent with other kernel
interfaces like sysfs, procfs, etc.

Though it would be great if we can atleast merge the rest of the
patches first while we address the mounting part.

Thanks for your feedback.

On Tue, Nov 4, 2014 at 7:50 AM, Serge E. Hallyn  wrote:
>
> Quoting Andy Lutomirski (l...@amacapital.net):
> > On Tue, Nov 4, 2014 at 5:46 AM, Tejun Heo  wrote:
> > > Hello, Aditya.
> > >
> > > On Mon, Nov 03, 2014 at 02:43:47PM -0800, Aditya Kali wrote:
> > >> I agree that this is effectively bind-mounting, but doing this in kernel
> > >> makes it really convenient for the userspace. The process that sets up 
> > >> the
> > >> container doesn't need to care whether it should bind-mount cgroupfs 
> > >> inside
> > >> the container or not. The tasks inside the container can mount cgroupfs 
> > >> on
> > >> as-needed basis. The root container manager can simply unshare cgroupns 
> > >> and
> > >> forget about the internal setup. I think this is useful just for the 
> > >> reason
> > >> that it makes life much simpler for userspace.
> > >
> > > If it's okay to require userland to just do bind mounting, I'd be far
> > > happier with that.  cgroup mount code is already overcomplicated
> > > because of the dynamic matching of supers to mounts when it could just
> > > have told userland to use bind mounting.  Doesn't the host side have
> > > to set up some of the filesystem layouts anyway?  Does it really
> > > matter that we require the host to set up cgroup hierarchy too?
> > >
> >
> > Sort of, but only sort of.
> >
> > You can create a container by unsharing namespaces, mounting
> > everything, and then calling pivot_root.  But this is unpleasant
> > because of the strange way that pid namespaces work -- you generally
> > have to fork first, so this gets tedious.  And it doesn't integrate
> > well with things like fstab or other container-side configuration
> > mechanisms.
> >
> > It's nicer if you can unshare namespaces, mount the bare minimum,
> > pivot_root, and let the contained software do as much setup as
> > possible.
>
> Also, the bind-mount requires the container manager to know where
> the guest distro will want the cgroups mounted.
>
> -serge
> ___
> Containers mailing list
> contain...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers




-- 
Aditya
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

2014-11-12 Thread Aditya Kali
I agree with what Andy and Serge has to say. The ability to mount
cgroupfs inside userns also seems consistent with other kernel
interfaces like sysfs, procfs, etc.

Though it would be great if we can atleast merge the rest of the
patches first while we address the mounting part.

Thanks for your feedback.

On Tue, Nov 4, 2014 at 7:50 AM, Serge E. Hallyn se...@hallyn.com wrote:

 Quoting Andy Lutomirski (l...@amacapital.net):
  On Tue, Nov 4, 2014 at 5:46 AM, Tejun Heo t...@kernel.org wrote:
   Hello, Aditya.
  
   On Mon, Nov 03, 2014 at 02:43:47PM -0800, Aditya Kali wrote:
   I agree that this is effectively bind-mounting, but doing this in kernel
   makes it really convenient for the userspace. The process that sets up 
   the
   container doesn't need to care whether it should bind-mount cgroupfs 
   inside
   the container or not. The tasks inside the container can mount cgroupfs 
   on
   as-needed basis. The root container manager can simply unshare cgroupns 
   and
   forget about the internal setup. I think this is useful just for the 
   reason
   that it makes life much simpler for userspace.
  
   If it's okay to require userland to just do bind mounting, I'd be far
   happier with that.  cgroup mount code is already overcomplicated
   because of the dynamic matching of supers to mounts when it could just
   have told userland to use bind mounting.  Doesn't the host side have
   to set up some of the filesystem layouts anyway?  Does it really
   matter that we require the host to set up cgroup hierarchy too?
  
 
  Sort of, but only sort of.
 
  You can create a container by unsharing namespaces, mounting
  everything, and then calling pivot_root.  But this is unpleasant
  because of the strange way that pid namespaces work -- you generally
  have to fork first, so this gets tedious.  And it doesn't integrate
  well with things like fstab or other container-side configuration
  mechanisms.
 
  It's nicer if you can unshare namespaces, mount the bare minimum,
  pivot_root, and let the contained software do as much setup as
  possible.

 Also, the bind-mount requires the container manager to know where
 the guest distro will want the cgroups mounted.

 -serge
 ___
 Containers mailing list
 contain...@lists.linux-foundation.org
 https://lists.linuxfoundation.org/mailman/listinfo/containers




-- 
Aditya
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 0/7] CGroup Namespaces

2014-11-06 Thread Aditya Kali
On Tue, Nov 4, 2014 at 5:10 AM, Vivek Goyal  wrote:
> On Fri, Oct 31, 2014 at 12:18:54PM -0700, Aditya Kali wrote:
> [..]
>>  fs/kernfs/dir.c  | 194 
>> ++-
>>  fs/kernfs/mount.c|  48 ++
>>  fs/proc/namespaces.c |   1 +
>>  include/linux/cgroup.h   |  41 -
>>  include/linux/cgroup_namespace.h |  36 
>>  include/linux/kernfs.h   |   5 +
>>  include/linux/nsproxy.h  |   2 +
>>  include/linux/proc_ns.h  |   4 +
>>  include/uapi/linux/sched.h   |   3 +-
>>  kernel/Makefile  |   2 +-
>>  kernel/cgroup.c  | 108 +-
>>  kernel/cgroup_namespace.c| 148 +
>>  kernel/fork.c|   2 +-
>>  kernel/nsproxy.c |  19 +++-
>
> Hi Aditya,
>
> Can we provide a documentation file for cgroup namespace behavior. Say,
> Documentation/namespaces/cgroup-namespace.txt.
>
Yes, definitely. I will add it as soon as we have a consensus on the
overall series.

> Namespaces are complicated and it might be a good idea to keep one .txt
> file for each namespace.
>
> Thanks
> Vivek


Thanks,
-- 
Aditya
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

2014-11-06 Thread Aditya Kali
On Tue, Nov 4, 2014 at 5:57 AM, Tejun Heo  wrote:
> Hello, Aditya.
>
> On Mon, Nov 03, 2014 at 03:12:28PM -0800, Aditya Kali wrote:
>> I think the sane-behavior flag is only temporary and will be removed
>> anyways, right? So I didn't bother asking user to supply it. But I can
>> make the change as you suggested. We just have to make sure that tasks
>> inside cgroupns cannot mount non-default hierarchies as it would be a
>> regression.
>
> I'm not sure whether supporting mounting from inside a ns is even
> necessary but, if it is, can't you just test against cgrp_dfl_root?
> There's no reason to do anything differnetly for ns mounting.
>

I am not sure I fully understand what you mean. But we don't have a
way to test against cgrp_dfl_root while parsing mount-options. They
only way we know that user is trying to mount a default hierarchy is
via the sane_behavior flag. So I need to test against this flag it if
we want to restrict processes inside cgroupns to mounting the default
hierarchy only.
Or are you suggesting that its OK for nsown_capable(CAP_SYS_ADMIN)
processes to mount any cgroup hierarchy (irrespective of their
cgroupns)? I assumed that this will be a undesirable.

> Thanks.
>
> --
> tejun


Thanks,
-- 
Aditya
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

2014-11-06 Thread Aditya Kali
On Tue, Nov 4, 2014 at 5:57 AM, Tejun Heo t...@kernel.org wrote:
 Hello, Aditya.

 On Mon, Nov 03, 2014 at 03:12:28PM -0800, Aditya Kali wrote:
 I think the sane-behavior flag is only temporary and will be removed
 anyways, right? So I didn't bother asking user to supply it. But I can
 make the change as you suggested. We just have to make sure that tasks
 inside cgroupns cannot mount non-default hierarchies as it would be a
 regression.

 I'm not sure whether supporting mounting from inside a ns is even
 necessary but, if it is, can't you just test against cgrp_dfl_root?
 There's no reason to do anything differnetly for ns mounting.


I am not sure I fully understand what you mean. But we don't have a
way to test against cgrp_dfl_root while parsing mount-options. They
only way we know that user is trying to mount a default hierarchy is
via the sane_behavior flag. So I need to test against this flag it if
we want to restrict processes inside cgroupns to mounting the default
hierarchy only.
Or are you suggesting that its OK for nsown_capable(CAP_SYS_ADMIN)
processes to mount any cgroup hierarchy (irrespective of their
cgroupns)? I assumed that this will be a undesirable.

 Thanks.

 --
 tejun


Thanks,
-- 
Aditya
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 0/7] CGroup Namespaces

2014-11-06 Thread Aditya Kali
On Tue, Nov 4, 2014 at 5:10 AM, Vivek Goyal vgo...@redhat.com wrote:
 On Fri, Oct 31, 2014 at 12:18:54PM -0700, Aditya Kali wrote:
 [..]
  fs/kernfs/dir.c  | 194 
 ++-
  fs/kernfs/mount.c|  48 ++
  fs/proc/namespaces.c |   1 +
  include/linux/cgroup.h   |  41 -
  include/linux/cgroup_namespace.h |  36 
  include/linux/kernfs.h   |   5 +
  include/linux/nsproxy.h  |   2 +
  include/linux/proc_ns.h  |   4 +
  include/uapi/linux/sched.h   |   3 +-
  kernel/Makefile  |   2 +-
  kernel/cgroup.c  | 108 +-
  kernel/cgroup_namespace.c| 148 +
  kernel/fork.c|   2 +-
  kernel/nsproxy.c |  19 +++-

 Hi Aditya,

 Can we provide a documentation file for cgroup namespace behavior. Say,
 Documentation/namespaces/cgroup-namespace.txt.

Yes, definitely. I will add it as soon as we have a consensus on the
overall series.

 Namespaces are complicated and it might be a good idea to keep one .txt
 file for each namespace.

 Thanks
 Vivek


Thanks,
-- 
Aditya
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

2014-11-03 Thread Aditya Kali

This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.
In order to support this, a new kernfs api is added to lookup the
dentry for the cgroupns-root.

Signed-off-by: Aditya Kali 
---
 fs/kernfs/mount.c  | 48 


 include/linux/kernfs.h |  2 ++
 kernel/cgroup.c| 46 +-
 3 files changed, 95 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f973ae9..efe5e15 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct 
super_block *sb)

return NULL;
 }

+/**
+ * kernfs_obtain_root - get a dentry for the given kernfs_node
+ * @sb: the kernfs super_block
+ * @kn: kernfs_node for which a dentry is needed
+ *
+ * This can used used by callers which want to mount only a part of the 
kernfs

+ * as root of the filesystem.
+ */
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+ struct kernfs_node *kn)
+{
+   struct dentry *dentry;
+   struct inode *inode;
+
+   BUG_ON(sb->s_op != _sops);
+
+   /* inode for the given kernfs_node should already exist. */
+   inode = ilookup(sb, kn->ino);
+   if (!inode) {
+   pr_debug("kernfs: could not get inode for '");
+   pr_cont_kernfs_path(kn);
+   pr_cont("'.\n");
+   return ERR_PTR(-EINVAL);
+   }
+
+   /* instantiate and link root dentry */
+   dentry = d_obtain_root(inode);
+   if (!dentry) {
+   pr_debug("kernfs: could not get dentry for '");
+   pr_cont_kernfs_path(kn);
+   pr_cont("'.\n");
+   return ERR_PTR(-ENOMEM);
+   }
+
+   /* If this is a new dentry, set it up. We need kernfs_mutex because this
+* may be called by callers other than kernfs_fill_super. */
+   mutex_lock(_mutex);
+   if (!dentry->d_fsdata) {
+   kernfs_get(kn);
+   dentry->d_fsdata = kn;
+   } else {
+   WARN_ON(dentry->d_fsdata != kn);
+   }
+   mutex_unlock(_mutex);
+
+   return dentry;
+}
+
 static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
 {
struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3c2be75..b9538e0 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
 struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
 struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);

+struct dentry *kernfs_obtain_root(struct super_block *sb,
+ struct kernfs_node *kn);
 struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
   unsigned int flags, void *priv);
 void kernfs_destroy_root(struct kernfs_root *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7e5d597..8008c4c 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1389,6 +1389,14 @@ static int parse_cgroupfs_options(char *data, 
struct cgroup_sb_opts *opts)

return -ENOENT;
}

+   /* If inside a non-init cgroup namespace, only allow default hierarchy
+* to be mounted.
+*/
+   if ((current->nsproxy->cgroup_ns != _cgroup_ns) &&
+   !(opts->flags & CGRP_ROOT_SANE_BEHAVIOR)) {
+   return -EINVAL;
+   }
+
if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
 		pr_warn("sane_behavior: this is still under development and its 
behaviors will change, proceed at your own risk\n");

if (nr_opts != 1) {
@@ -1581,6 +1589,15 @@ static void init_cgroup_root(struct cgroup_root 
*root,

set_bit(CGRP_CPUSET_CLONE_CHILDREN, >cgrp.flags);
 }

+struct dentry *cgroupns_get_root(struct super_block *sb,
+struct cgroup_namespace *ns)
+{
+   struct dentry *nsdentry;
+
+   nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
+   return nsdentry;
+}
+
 static int cgroup_setup_root(struct cgroup_root *root, unsigned int 
ss_mask)

 {
LIST_HEAD(tmp_links);
@@ -1685,6 +1702,14 @@ static struct dentry *cgroup_mount(struct 
file_system_type *fs_type,

int ret;
int i;
bool new_sb;
+   struct cgroup_namespace *ns =
+   get_cgroup_ns(current->nsproxy->cgroup_ns);
+
+   /* Check if the caller has permission to mount. */
+   if (

Re: [PATCHv2 5/7] cgroup: introduce cgroup namespaces

2014-11-03 Thread Aditya Kali


Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the cgroup of the process at the point
of creation of the cgroup namespace (referred as cgroupns-root).
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root
(unless they are moved outside of their cgroupns-root, at which point
 they will see a relative path from their cgroupns-root).
For a correctly setup container this enables container-tools
(like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
containers without leaking system level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.

Signed-off-by: Aditya Kali 
---
 fs/proc/namespaces.c |   1 +
 include/linux/cgroup.h   |  18 +-
 include/linux/cgroup_namespace.h |  36 +++
 include/linux/nsproxy.h  |   2 +
 include/linux/proc_ns.h  |   4 ++
 kernel/Makefile  |   2 +-
 kernel/cgroup.c  |  14 +
 kernel/cgroup_namespace.c| 127 
+++

 kernel/fork.c|   2 +-
 kernel/nsproxy.c |  19 +-
 10 files changed, 220 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/cgroup_namespace.h
 create mode 100644 kernel/cgroup_namespace.c

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 8902609..55bc5da 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -32,6 +32,7 @@ static const struct proc_ns_operations *ns_entries[] = {
_operations,
 #endif
_operations,
+   _operations,
 };

 static const struct file_operations ns_file_operations = {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4a0eb2d..aa86495 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -22,6 +22,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 

 #ifdef CONFIG_CGROUPS

@@ -460,6 +462,13 @@ struct cftype {
 #endif
 };

+struct cgroup_namespace {
+   atomic_tcount;
+   unsigned intproc_inum;
+   struct user_namespace   *user_ns;
+   struct cgroup   *root_cgrp;
+};
+
 extern struct cgroup_root cgrp_dfl_root;
 extern struct css_set init_css_set;

@@ -584,10 +593,17 @@ static inline int cgroup_name(struct cgroup *cgrp, 
char *buf, size_t buflen)

return kernfs_name(cgrp->kn, buf, buflen);
 }

+static inline char * __must_check cgroup_path_ns(struct 
cgroup_namespace *ns,

+struct cgroup *cgrp, char *buf,
+size_t buflen)
+{
+   return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf, buflen);
+}
+
 static inline char * __must_check cgroup_path(struct cgroup *cgrp, 
char *buf,

  size_t buflen)
 {
-   return kernfs_path(cgrp->kn, buf, buflen);
+   return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf, buflen);
 }

 static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
diff --git a/include/linux/cgroup_namespace.h 
b/include/linux/cgroup_namespace.h

new file mode 100644
index 000..0b97b8d
--- /dev/null
+++ b/include/linux/cgroup_namespace.h
@@ -0,0 +1,36 @@
+#ifndef _LINUX_CGROUP_NAMESPACE_H
+#define _LINUX_CGROUP_NAMESPACE_H
+
+#include 
+#include 
+#include 
+#include 
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+static inline struct cgroup *current_cgroupns_root(void)
+{
+   return current->nsproxy->cgroup_ns->root_cgrp;
+}
+
+extern void free_cgroup_ns(struct cgroup_namespace *ns);
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+   struct cgroup_namespace *ns)
+{
+   if (ns)
+   atomic_inc(>count);
+   return ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+   if (ns && atomic_dec_and_test(>count))
+   free_cgroup_ns(ns);
+}
+
+extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+  struct user_namespace *user_ns,
+  struct cgroup_namespace *old_ns);
+
+#endif  /* _LINUX_CGROUP_NAMESPACE_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 35fa08f..ac0d65b 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -8,6 +8,7 @@ struct mnt_namespace;
 struct uts_namespace;
 struct ipc_namespace;
 struct pid_namespace;
+struct cgroup_namespace;
 struct fs_struct;

 /*
@@ -33,6 +34,7 @@ struct nsproxy {
struct mnt_namespace *mnt_ns;
struct pid_namespace *pid_ns_for_children;
struct net   *net_ns;
+   struct cgroup_namespace *cgroup_ns;
 };
 extern struct nsproxy init_nsproxy;

diff --git a/include/linux/proc_ns.h b/in

Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

2014-11-03 Thread Aditya Kali
On Mon, Nov 3, 2014 at 4:17 PM, Andy Lutomirski  wrote:
> On Mon, Nov 3, 2014 at 4:12 PM, Aditya Kali  wrote:
>> On Mon, Nov 3, 2014 at 3:48 PM, Andy Lutomirski  wrote:
>>> On Mon, Nov 3, 2014 at 3:23 PM, Aditya Kali  wrote:
>>>> On Mon, Nov 3, 2014 at 3:15 PM, Andy Lutomirski  
>>>> wrote:
>>>>> On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali  wrote:
>>>>>> On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski  
>>>>>> wrote:
>>>>>>> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali  
>>>>>>> wrote:
>>>>>>>> if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>>>>>>> pr_warn("sane_behavior: this is still under 
>>>>>>>> development and its behaviors will change, proceed at your own 
>>>>>>>> risk\n");
>>>>>>>> -   if (nr_opts != 1) {
>>>>>>>> +   if (nr_opts > 1) {
>>>>>>>> pr_err("sane_behavior: no other mount options 
>>>>>>>> allowed\n");
>>>>>>>> return -EINVAL;
>>>>>>>
>>>>>>> This looks wrong.  But, if you make the change above, then it'll be 
>>>>>>> right.
>>>>>>>
>>>>>>
>>>>>> It would have been nice if simple 'mount -t cgroup cgroup ' from
>>>>>> cgroupns does the right thing automatically.
>>>>>>
>>>>>
>>>>> This is a debatable point, but it's not what I meant.  Won't your code
>>>>> let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?
>>>>>
>>>>
>>>> I don't think so. This check "if (nr_opts > 1)" is nested under "if
>>>> (opts->flags & CGRP_ROOT_SANE_BEHAVIOR)". So we know that there is
>>>> atleast 1 option ('__DEVEL__sane_behavior') present (implicit or not).
>>>> Addition of 'one_evil_flag' will make nr_opts = 2 and result in EINVAL
>>>> here.
>>>
>>> But the implicit __DEVEL__sane_behavior doesn't increment nr_opts, right?
>>>
>>
>> Yes. Hence this change makes sure that we don't return EINVAL when
>> nr_opts == 0 or nr_opts == 1 :)
>> That way, both of the following are equivalent when inside non-init cgroupns:
>>
>> (1) $ mount -t cgroup -o __DEVEL__sane_behavior cgroup mountpoint
>> (2) $ mount -t cgroup cgroup mountpoint
>>
>> Any other mount option will trigger the error here.
>
> I still don't get it.  Can you walk me through why mount -o
> some_other_option -t cgroup cgroup mountpoint causes -EINVAL?
>

Argh! You are right. I was totally convinced that this works. But it
clearly doesn't if you specify 1 legit mount option. I wanted to make
it work for both cases (1) and (2) above. But then this check will
have to be changed :(
Sorry about the back and forth. I am just going to make it return
EINVAL if __DEVEL_sane_behavior is not specified as suggested in the
beginning.

> --Andy

-- 
Aditya
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

2014-11-03 Thread Aditya Kali
On Mon, Nov 3, 2014 at 3:48 PM, Andy Lutomirski  wrote:
> On Mon, Nov 3, 2014 at 3:23 PM, Aditya Kali  wrote:
>> On Mon, Nov 3, 2014 at 3:15 PM, Andy Lutomirski  wrote:
>>> On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali  wrote:
>>>> On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski  
>>>> wrote:
>>>>> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali  
>>>>> wrote:
>>>>>> if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>>>>> pr_warn("sane_behavior: this is still under development 
>>>>>> and its behaviors will change, proceed at your own risk\n");
>>>>>> -   if (nr_opts != 1) {
>>>>>> +   if (nr_opts > 1) {
>>>>>> pr_err("sane_behavior: no other mount options 
>>>>>> allowed\n");
>>>>>> return -EINVAL;
>>>>>
>>>>> This looks wrong.  But, if you make the change above, then it'll be right.
>>>>>
>>>>
>>>> It would have been nice if simple 'mount -t cgroup cgroup ' from
>>>> cgroupns does the right thing automatically.
>>>>
>>>
>>> This is a debatable point, but it's not what I meant.  Won't your code
>>> let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?
>>>
>>
>> I don't think so. This check "if (nr_opts > 1)" is nested under "if
>> (opts->flags & CGRP_ROOT_SANE_BEHAVIOR)". So we know that there is
>> atleast 1 option ('__DEVEL__sane_behavior') present (implicit or not).
>> Addition of 'one_evil_flag' will make nr_opts = 2 and result in EINVAL
>> here.
>
> But the implicit __DEVEL__sane_behavior doesn't increment nr_opts, right?
>

Yes. Hence this change makes sure that we don't return EINVAL when
nr_opts == 0 or nr_opts == 1 :)
That way, both of the following are equivalent when inside non-init cgroupns:

(1) $ mount -t cgroup -o __DEVEL__sane_behavior cgroup mountpoint
(2) $ mount -t cgroup cgroup mountpoint

Any other mount option will trigger the error here.


> --Andy

-- 
Aditya
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 5/7] cgroup: introduce cgroup namespaces

2014-11-03 Thread Aditya Kali
On Fri, Oct 31, 2014 at 5:58 PM, Eric W. Biederman
 wrote:
> Andy Lutomirski  writes:
>
>> On Fri, Oct 31, 2014 at 12:18 PM, Aditya Kali  wrote:
>
> 
>
>>> +static void *cgroupns_get(struct task_struct *task)
>>> +{
>>> +   struct cgroup_namespace *ns = NULL;
>>> +   struct nsproxy *nsproxy;
>>> +
>>> +   rcu_read_lock();
>>> +   nsproxy = task->nsproxy;
>>> +   if (nsproxy) {
>>> +   ns = nsproxy->cgroup_ns;
>>> +   get_cgroup_ns(ns);
>>> +   }
>>> +   rcu_read_unlock();
>>
>> How is this correct?  Other namespaces do it too, so it Must Be
>> Correct (tm), but I don't understand.  What is RCU protecting?
>
> The code is not correct.  The code needs to use task_lock.
>
> RCU used to protect nsproxy, and now task_lock protects nsproxy.
> For the reasons of of all of this I refer you to the commit
> that changed this, and the comment in nsproxy.h
>

My bad. This should be under task_lock. I will fix it.

> commit 728dba3a39c66b3d8ac889ddbe38b5b1c264aec3
> Author: Eric W. Biederman 
> Date:   Mon Feb 3 19:13:49 2014 -0800
>
> namespaces: Use task_lock and not rcu to protect nsproxy
>
> The synchronous syncrhonize_rcu in switch_task_namespaces makes setns
> a sufficiently expensive system call that people have complained.
>
> Upon inspect nsproxy no longer needs rcu protection for remote reads.
> remote reads are rare.  So optimize for same process reads and write
> by switching using rask_lock instead.
>
> This yields a simpler to understand lock, and a faster setns system call.
>
> In particular this fixes a performance regression observed
> by Rafael David Tinoco .
>
> This is effectively a revert of Pavel Emelyanov's commit
> cf7b708c8d1d7a27736771bcf4c457b332b0f818 Make access to task's nsproxy 
> lighter
> from 2007.  The race this originialy fixed no longer exists as
> do_notify_parent uses task_active_pid_ns(parent) instead of
> parent->nsproxy.
>
> Signed-off-by: "Eric W. Biederman" 
>
> Eric



-- 
Aditya
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 5/7] cgroup: introduce cgroup namespaces

2014-11-03 Thread Aditya Kali
On Fri, Oct 31, 2014 at 5:02 PM, Andy Lutomirski  wrote:
> On Fri, Oct 31, 2014 at 12:18 PM, Aditya Kali  wrote:
>> Introduce the ability to create new cgroup namespace. The newly created
>> cgroup namespace remembers the cgroup of the process at the point
>> of creation of the cgroup namespace (referred as cgroupns-root).
>> The main purpose of cgroup namespace is to virtualize the contents
>> of /proc/self/cgroup file. Processes inside a cgroup namespace
>> are only able to see paths relative to their namespace root
>> (unless they are moved outside of their cgroupns-root, at which point
>>  they will see a relative path from their cgroupns-root).
>> For a correctly setup container this enables container-tools
>> (like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
>> containers without leaking system level cgroup hierarchy to the task.
>> This patch only implements the 'unshare' part of the cgroupns.
>>
>
>> +   /* Prevent cgroup changes for this task. */
>> +   threadgroup_lock(current);
>
> This could just be me being dense, but what is the lock for?
>

threadgroup_lock() is there to prevent the task from changing cgroups
while we are unsharing cgroupns.
But it seems that this might be unnecessary now because we have
removed the pinning restriction. Without pinning, we don't care if the
task cgroup changes underneath us. I will remove it from here as well
as from cgroupns_install().

>> +
>> +   /* CGROUPNS only virtualizes the cgroup path on the unified 
>> hierarchy.
>> +*/
>> +   cgrp = get_task_cgroup(current);
>> +
>> +   err = -ENOMEM;
>> +   new_ns = alloc_cgroup_ns();
>> +   if (!new_ns)
>> +   goto err_out_unlock;
>> +
>> +   err = proc_alloc_inum(_ns->proc_inum);
>> +   if (err)
>> +   goto err_out_unlock;
>> +
>> +   new_ns->user_ns = get_user_ns(user_ns);
>> +   new_ns->root_cgrp = cgrp;
>> +
>> +   threadgroup_unlock(current);
>> +
>> +   return new_ns;
>> +
>> +err_out_unlock:
>> +   threadgroup_unlock(current);
>> +err_out:
>> +   if (cgrp)
>> +   cgroup_put(cgrp);
>> +   kfree(new_ns);
>> +   return ERR_PTR(err);
>> +}
>> +
>> +static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
>> +{
>> +   pr_info("setns not supported for cgroup namespace");
>> +   return -EINVAL;
>> +}
>> +
>> +static void *cgroupns_get(struct task_struct *task)
>> +{
>> +   struct cgroup_namespace *ns = NULL;
>> +   struct nsproxy *nsproxy;
>> +
>> +   rcu_read_lock();
>> +   nsproxy = task->nsproxy;
>> +   if (nsproxy) {
>> +   ns = nsproxy->cgroup_ns;
>> +   get_cgroup_ns(ns);
>> +   }
>> +   rcu_read_unlock();
>
> How is this correct?  Other namespaces do it too, so it Must Be
> Correct (tm), but I don't understand.  What is RCU protecting?
>
> --Andy



-- 
Aditya
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

2014-11-03 Thread Aditya Kali
On Mon, Nov 3, 2014 at 3:15 PM, Andy Lutomirski  wrote:
> On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali  wrote:
>> On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski  wrote:
>>> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali  wrote:
>>>> This patch enables cgroup mounting inside userns when a process
>>>> as appropriate privileges. The cgroup filesystem mounted is
>>>> rooted at the cgroupns-root. Thus, in a container-setup, only
>>>> the hierarchy under the cgroupns-root is exposed inside the container.
>>>> This allows container management tools to run inside the containers
>>>> without depending on any global state.
>>>> In order to support this, a new kernfs api is added to lookup the
>>>> dentry for the cgroupns-root.
>>>>
>>>> Signed-off-by: Aditya Kali 
>>>> ---
>>>>  fs/kernfs/mount.c  | 48 
>>>> 
>>>>  include/linux/kernfs.h |  2 ++
>>>>  kernel/cgroup.c| 47 
>>>> +--
>>>>  3 files changed, 95 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
>>>> index f973ae9..e334f45 100644
>>>> --- a/fs/kernfs/mount.c
>>>> +++ b/fs/kernfs/mount.c
>>>> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct 
>>>> super_block *sb)
>>>> return NULL;
>>>>  }
>>>>
>>>> +/**
>>>> + * kernfs_make_root - create new root dentry for the given kernfs_node.
>>>> + * @sb: the kernfs super_block
>>>> + * @kn: kernfs_node for which a dentry is needed
>>>> + *
>>>> + * This can used used by callers which want to mount only a part of the 
>>>> kernfs
>>>> + * as root of the filesystem.
>>>> + */
>>>> +struct dentry *kernfs_obtain_root(struct super_block *sb,
>>>> + struct kernfs_node *kn)
>>>> +{
>>>
>>> I can't usefully review this, but kernfs_make_root and
>>> kernfs_obtain_root aren't the same string...
>>>
>>>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>>>> index 7e5d597..250aaec 100644
>>>> --- a/kernel/cgroup.c
>>>> +++ b/kernel/cgroup.c
>>>> @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, 
>>>> struct cgroup_sb_opts *opts)
>>>>
>>>> memset(opts, 0, sizeof(*opts));
>>>>
>>>> +   /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init 
>>>> cgroup
>>>> +* namespace.
>>>> +*/
>>>> +   if (current->nsproxy->cgroup_ns != _cgroup_ns) {
>>>> +   opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
>>>> +   }
>>>> +
>>>
>>> I don't like this implicit stuff.  Can you just return -EINVAL if sane
>>> behavior isn't requested?
>>>
>>
>> I think the sane-behavior flag is only temporary and will be removed
>> anyways, right? So I didn't bother asking user to supply it. But I can
>> make the change as you suggested. We just have to make sure that tasks
>> inside cgroupns cannot mount non-default hierarchies as it would be a
>> regression.
>>
>>>> while ((token = strsep(, ",")) != NULL) {
>>>> nr_opts++;
>>>>
>>>> @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct 
>>>> cgroup_sb_opts *opts)
>>>>
>>>> if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>>> pr_warn("sane_behavior: this is still under development 
>>>> and its behaviors will change, proceed at your own risk\n");
>>>> -   if (nr_opts != 1) {
>>>> +   if (nr_opts > 1) {
>>>> pr_err("sane_behavior: no other mount options 
>>>> allowed\n");
>>>> return -EINVAL;
>>>
>>> This looks wrong.  But, if you make the change above, then it'll be right.
>>>
>>
>> It would have been nice if simple 'mount -t cgroup cgroup ' from
>> cgroupns does the right thing automatically.
>>
>
> This is a debatable point, but it's not what I meant.  Won't your code
> let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?
>

I don't t

Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

2014-11-03 Thread Aditya Kali
On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski  wrote:
> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali  wrote:
>> This patch enables cgroup mounting inside userns when a process
>> as appropriate privileges. The cgroup filesystem mounted is
>> rooted at the cgroupns-root. Thus, in a container-setup, only
>> the hierarchy under the cgroupns-root is exposed inside the container.
>> This allows container management tools to run inside the containers
>> without depending on any global state.
>> In order to support this, a new kernfs api is added to lookup the
>> dentry for the cgroupns-root.
>>
>> Signed-off-by: Aditya Kali 
>> ---
>>  fs/kernfs/mount.c  | 48 
>>  include/linux/kernfs.h |  2 ++
>>  kernel/cgroup.c| 47 +--
>>  3 files changed, 95 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
>> index f973ae9..e334f45 100644
>> --- a/fs/kernfs/mount.c
>> +++ b/fs/kernfs/mount.c
>> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct 
>> super_block *sb)
>> return NULL;
>>  }
>>
>> +/**
>> + * kernfs_make_root - create new root dentry for the given kernfs_node.
>> + * @sb: the kernfs super_block
>> + * @kn: kernfs_node for which a dentry is needed
>> + *
>> + * This can used used by callers which want to mount only a part of the 
>> kernfs
>> + * as root of the filesystem.
>> + */
>> +struct dentry *kernfs_obtain_root(struct super_block *sb,
>> + struct kernfs_node *kn)
>> +{
>
> I can't usefully review this, but kernfs_make_root and
> kernfs_obtain_root aren't the same string...
>
>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>> index 7e5d597..250aaec 100644
>> --- a/kernel/cgroup.c
>> +++ b/kernel/cgroup.c
>> @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct 
>> cgroup_sb_opts *opts)
>>
>> memset(opts, 0, sizeof(*opts));
>>
>> +   /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
>> +* namespace.
>> +*/
>> +   if (current->nsproxy->cgroup_ns != _cgroup_ns) {
>> +   opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
>> +   }
>> +
>
> I don't like this implicit stuff.  Can you just return -EINVAL if sane
> behavior isn't requested?
>

I think the sane-behavior flag is only temporary and will be removed
anyways, right? So I didn't bother asking user to supply it. But I can
make the change as you suggested. We just have to make sure that tasks
inside cgroupns cannot mount non-default hierarchies as it would be a
regression.

>> while ((token = strsep(, ",")) != NULL) {
>> nr_opts++;
>>
>> @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct 
>> cgroup_sb_opts *opts)
>>
>> if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>> pr_warn("sane_behavior: this is still under development and 
>> its behaviors will change, proceed at your own risk\n");
>> -   if (nr_opts != 1) {
>> +   if (nr_opts > 1) {
>> pr_err("sane_behavior: no other mount options 
>> allowed\n");
>> return -EINVAL;
>
> This looks wrong.  But, if you make the change above, then it'll be right.
>

It would have been nice if simple 'mount -t cgroup cgroup ' from
cgroupns does the right thing automatically.


>> @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct 
>> file_system_type *fs_type,
>> int ret;
>> int i;
>> bool new_sb;
>> +   struct cgroup_namespace *ns =
>> +   get_cgroup_ns(current->nsproxy->cgroup_ns);
>> +
>> +   /* Check if the caller has permission to mount. */
>> +   if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
>> +   put_cgroup_ns(ns);
>> +   return ERR_PTR(-EPERM);
>> +   }
>
> Why is this necessary?
>

Without this, if I unshare userns and mntns (but no cgroupns), I will
be able to mount my parent's cgroupfs hierarchy. This is deviation
from whats allowed today (i.e., today I can't mount cgroupfs even
after unsharing userns & mntns). This check is there to prevent the
unintended effect of cgroupns feature.

>> @@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
>> .name = "cgroup",
>> .mount = cgroup_mount,
>> .kill_sb = cgroup_kill_sb,
>> +   .fs_flags = FS_USERNS_MOUNT,
>
> Aargh, another one!  Eric, can you either ack or nack my patch?
> Because if my patch goes in, then this line may need to change.  Or
> not, but if a stable release with cgroupfs and without my patch
> happens, then we'll have an ABI break.
>
> --Andy



-- 
Aditya
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

2014-11-03 Thread Aditya Kali
(sorry for accidental non-plain-text response earlier).

On Fri, Oct 31, 2014 at 6:09 PM, Eric W. Biederman
 wrote:
> Aditya Kali  writes:
>
>> This patch enables cgroup mounting inside userns when a process
>> as appropriate privileges. The cgroup filesystem mounted is
>> rooted at the cgroupns-root. Thus, in a container-setup, only
>> the hierarchy under the cgroupns-root is exposed inside the container.
>> This allows container management tools to run inside the containers
>> without depending on any global state.
>> In order to support this, a new kernfs api is added to lookup the
>> dentry for the cgroupns-root.
>
> There is a misdesign in this.  Because files already exist we need the
> protections that are present in proc and sysfs that only allow you to
> mount the filesystem if it is already mounted.  Otherwise you can wind
> up mounting this cgroupfs in a chroot jail when the global root would
> not like you to see it.  cgroupfs isn't as bad as proc and sys but there
> is at the very least an information leak in mounting it.
>

I think simply mounting the cgroupfs doesn't give you any more
information than what you don't already know about the system ;
specially if the visibility is restricted within the process's
cgroupns-root. The cgroups still wont be writable by the user, so I
think it should be fine to allow mounting?

> Given that we are effectively performing a bind mount in this patch, and
> that we need to require cgroupfs be mounted anyway (to be safe).
>
> I don't see the point of this change.
>
> If we could change the set of cgroups or visible in cgroupfs I could
> probably see the point.  But as it is this change seems to be pointless.
>

I agree that this is effectively bind-mounting, but doing this in
kernel makes it really convenient for the userspace. The process that
sets up the container doesn't need to care whether it should
bind-mount cgroupfs inside the container or not. The tasks inside the
container can mount cgroupfs on as-needed basis. The root container
manager can simply unshare cgroupns and forget about the internal
setup. I think this is useful just for the reason that it makes life
much simpler for userspace.

> Eric
>
>
>> Signed-off-by: Aditya Kali 
>> ---
>>  fs/kernfs/mount.c  | 48 
>>  include/linux/kernfs.h |  2 ++
>>  kernel/cgroup.c| 47 +--
>>  3 files changed, 95 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
>> index f973ae9..e334f45 100644
>> --- a/fs/kernfs/mount.c
>> +++ b/fs/kernfs/mount.c
>> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct 
>> super_block *sb)
>>   return NULL;
>>  }
>>
>> +/**
>> + * kernfs_make_root - create new root dentry for the given kernfs_node.
>> + * @sb: the kernfs super_block
>> + * @kn: kernfs_node for which a dentry is needed
>> + *
>> + * This can used used by callers which want to mount only a part of the 
>> kernfs
>> + * as root of the filesystem.
>> + */
>> +struct dentry *kernfs_obtain_root(struct super_block *sb,
>> +   struct kernfs_node *kn)
>> +{
>> + struct dentry *dentry;
>> + struct inode *inode;
>> +
>> + BUG_ON(sb->s_op != _sops);
>> +
>> + /* inode for the given kernfs_node should already exist. */
>> + inode = ilookup(sb, kn->ino);
>> + if (!inode) {
>> + pr_debug("kernfs: could not get inode for '");
>> + pr_cont_kernfs_path(kn);
>> + pr_cont("'.\n");
>> + return ERR_PTR(-EINVAL);
>> + }
>> +
>> + /* instantiate and link root dentry */
>> + dentry = d_obtain_root(inode);
>> + if (!dentry) {
>> + pr_debug("kernfs: could not get dentry for '");
>> + pr_cont_kernfs_path(kn);
>> + pr_cont("'.\n");
>> + return ERR_PTR(-ENOMEM);
>> + }
>> +
>> + /* If this is a new dentry, set it up. We need kernfs_mutex because 
>> this
>> +  * may be called by callers other than kernfs_fill_super. */
>> + mutex_lock(_mutex);
>> + if (!dentry->d_fsdata) {
>> + kernfs_get(kn);
>> + dentry->d_fsdata = kn;
>> + } else {
>> + WARN_ON(dentry->d_fsdata != kn);
>> + }
>> + mutex_unlock(_mutex);
>> +
>> + return dentry;
>> +}
>> +
>>  static int kernfs_fill_super(struct super_b

Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

2014-11-03 Thread Aditya Kali
(sorry for accidental non-plain-text response earlier).

On Fri, Oct 31, 2014 at 6:09 PM, Eric W. Biederman
ebied...@xmission.com wrote:
 Aditya Kali adityak...@google.com writes:

 This patch enables cgroup mounting inside userns when a process
 as appropriate privileges. The cgroup filesystem mounted is
 rooted at the cgroupns-root. Thus, in a container-setup, only
 the hierarchy under the cgroupns-root is exposed inside the container.
 This allows container management tools to run inside the containers
 without depending on any global state.
 In order to support this, a new kernfs api is added to lookup the
 dentry for the cgroupns-root.

 There is a misdesign in this.  Because files already exist we need the
 protections that are present in proc and sysfs that only allow you to
 mount the filesystem if it is already mounted.  Otherwise you can wind
 up mounting this cgroupfs in a chroot jail when the global root would
 not like you to see it.  cgroupfs isn't as bad as proc and sys but there
 is at the very least an information leak in mounting it.


I think simply mounting the cgroupfs doesn't give you any more
information than what you don't already know about the system ;
specially if the visibility is restricted within the process's
cgroupns-root. The cgroups still wont be writable by the user, so I
think it should be fine to allow mounting?

 Given that we are effectively performing a bind mount in this patch, and
 that we need to require cgroupfs be mounted anyway (to be safe).

 I don't see the point of this change.

 If we could change the set of cgroups or visible in cgroupfs I could
 probably see the point.  But as it is this change seems to be pointless.


I agree that this is effectively bind-mounting, but doing this in
kernel makes it really convenient for the userspace. The process that
sets up the container doesn't need to care whether it should
bind-mount cgroupfs inside the container or not. The tasks inside the
container can mount cgroupfs on as-needed basis. The root container
manager can simply unshare cgroupns and forget about the internal
setup. I think this is useful just for the reason that it makes life
much simpler for userspace.

 Eric


 Signed-off-by: Aditya Kali adityak...@google.com
 ---
  fs/kernfs/mount.c  | 48 
  include/linux/kernfs.h |  2 ++
  kernel/cgroup.c| 47 +--
  3 files changed, 95 insertions(+), 2 deletions(-)

 diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
 index f973ae9..e334f45 100644
 --- a/fs/kernfs/mount.c
 +++ b/fs/kernfs/mount.c
 @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct 
 super_block *sb)
   return NULL;
  }

 +/**
 + * kernfs_make_root - create new root dentry for the given kernfs_node.
 + * @sb: the kernfs super_block
 + * @kn: kernfs_node for which a dentry is needed
 + *
 + * This can used used by callers which want to mount only a part of the 
 kernfs
 + * as root of the filesystem.
 + */
 +struct dentry *kernfs_obtain_root(struct super_block *sb,
 +   struct kernfs_node *kn)
 +{
 + struct dentry *dentry;
 + struct inode *inode;
 +
 + BUG_ON(sb-s_op != kernfs_sops);
 +
 + /* inode for the given kernfs_node should already exist. */
 + inode = ilookup(sb, kn-ino);
 + if (!inode) {
 + pr_debug(kernfs: could not get inode for ');
 + pr_cont_kernfs_path(kn);
 + pr_cont('.\n);
 + return ERR_PTR(-EINVAL);
 + }
 +
 + /* instantiate and link root dentry */
 + dentry = d_obtain_root(inode);
 + if (!dentry) {
 + pr_debug(kernfs: could not get dentry for ');
 + pr_cont_kernfs_path(kn);
 + pr_cont('.\n);
 + return ERR_PTR(-ENOMEM);
 + }
 +
 + /* If this is a new dentry, set it up. We need kernfs_mutex because 
 this
 +  * may be called by callers other than kernfs_fill_super. */
 + mutex_lock(kernfs_mutex);
 + if (!dentry-d_fsdata) {
 + kernfs_get(kn);
 + dentry-d_fsdata = kn;
 + } else {
 + WARN_ON(dentry-d_fsdata != kn);
 + }
 + mutex_unlock(kernfs_mutex);
 +
 + return dentry;
 +}
 +
  static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
  {
   struct kernfs_super_info *info = kernfs_info(sb);
 diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
 index 3c2be75..b9538e0 100644
 --- a/include/linux/kernfs.h
 +++ b/include/linux/kernfs.h
 @@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
  struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
  struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);

 +struct dentry *kernfs_obtain_root(struct super_block *sb,
 +   struct kernfs_node *kn);
  struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops

Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

2014-11-03 Thread Aditya Kali
On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali adityak...@google.com wrote:
 This patch enables cgroup mounting inside userns when a process
 as appropriate privileges. The cgroup filesystem mounted is
 rooted at the cgroupns-root. Thus, in a container-setup, only
 the hierarchy under the cgroupns-root is exposed inside the container.
 This allows container management tools to run inside the containers
 without depending on any global state.
 In order to support this, a new kernfs api is added to lookup the
 dentry for the cgroupns-root.

 Signed-off-by: Aditya Kali adityak...@google.com
 ---
  fs/kernfs/mount.c  | 48 
  include/linux/kernfs.h |  2 ++
  kernel/cgroup.c| 47 +--
  3 files changed, 95 insertions(+), 2 deletions(-)

 diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
 index f973ae9..e334f45 100644
 --- a/fs/kernfs/mount.c
 +++ b/fs/kernfs/mount.c
 @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct 
 super_block *sb)
 return NULL;
  }

 +/**
 + * kernfs_make_root - create new root dentry for the given kernfs_node.
 + * @sb: the kernfs super_block
 + * @kn: kernfs_node for which a dentry is needed
 + *
 + * This can used used by callers which want to mount only a part of the 
 kernfs
 + * as root of the filesystem.
 + */
 +struct dentry *kernfs_obtain_root(struct super_block *sb,
 + struct kernfs_node *kn)
 +{

 I can't usefully review this, but kernfs_make_root and
 kernfs_obtain_root aren't the same string...

 diff --git a/kernel/cgroup.c b/kernel/cgroup.c
 index 7e5d597..250aaec 100644
 --- a/kernel/cgroup.c
 +++ b/kernel/cgroup.c
 @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct 
 cgroup_sb_opts *opts)

 memset(opts, 0, sizeof(*opts));

 +   /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
 +* namespace.
 +*/
 +   if (current-nsproxy-cgroup_ns != init_cgroup_ns) {
 +   opts-flags |= CGRP_ROOT_SANE_BEHAVIOR;
 +   }
 +

 I don't like this implicit stuff.  Can you just return -EINVAL if sane
 behavior isn't requested?


I think the sane-behavior flag is only temporary and will be removed
anyways, right? So I didn't bother asking user to supply it. But I can
make the change as you suggested. We just have to make sure that tasks
inside cgroupns cannot mount non-default hierarchies as it would be a
regression.

 while ((token = strsep(o, ,)) != NULL) {
 nr_opts++;

 @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct 
 cgroup_sb_opts *opts)

 if (opts-flags  CGRP_ROOT_SANE_BEHAVIOR) {
 pr_warn(sane_behavior: this is still under development and 
 its behaviors will change, proceed at your own risk\n);
 -   if (nr_opts != 1) {
 +   if (nr_opts  1) {
 pr_err(sane_behavior: no other mount options 
 allowed\n);
 return -EINVAL;

 This looks wrong.  But, if you make the change above, then it'll be right.


It would have been nice if simple 'mount -t cgroup cgroup mnt' from
cgroupns does the right thing automatically.


 @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct 
 file_system_type *fs_type,
 int ret;
 int i;
 bool new_sb;
 +   struct cgroup_namespace *ns =
 +   get_cgroup_ns(current-nsproxy-cgroup_ns);
 +
 +   /* Check if the caller has permission to mount. */
 +   if (!ns_capable(ns-user_ns, CAP_SYS_ADMIN)) {
 +   put_cgroup_ns(ns);
 +   return ERR_PTR(-EPERM);
 +   }

 Why is this necessary?


Without this, if I unshare userns and mntns (but no cgroupns), I will
be able to mount my parent's cgroupfs hierarchy. This is deviation
from whats allowed today (i.e., today I can't mount cgroupfs even
after unsharing userns  mntns). This check is there to prevent the
unintended effect of cgroupns feature.

 @@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
 .name = cgroup,
 .mount = cgroup_mount,
 .kill_sb = cgroup_kill_sb,
 +   .fs_flags = FS_USERNS_MOUNT,

 Aargh, another one!  Eric, can you either ack or nack my patch?
 Because if my patch goes in, then this line may need to change.  Or
 not, but if a stable release with cgroupfs and without my patch
 happens, then we'll have an ABI break.

 --Andy



-- 
Aditya
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

2014-11-03 Thread Aditya Kali
On Mon, Nov 3, 2014 at 3:15 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali adityak...@google.com wrote:
 On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali adityak...@google.com wrote:
 This patch enables cgroup mounting inside userns when a process
 as appropriate privileges. The cgroup filesystem mounted is
 rooted at the cgroupns-root. Thus, in a container-setup, only
 the hierarchy under the cgroupns-root is exposed inside the container.
 This allows container management tools to run inside the containers
 without depending on any global state.
 In order to support this, a new kernfs api is added to lookup the
 dentry for the cgroupns-root.

 Signed-off-by: Aditya Kali adityak...@google.com
 ---
  fs/kernfs/mount.c  | 48 
 
  include/linux/kernfs.h |  2 ++
  kernel/cgroup.c| 47 
 +--
  3 files changed, 95 insertions(+), 2 deletions(-)

 diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
 index f973ae9..e334f45 100644
 --- a/fs/kernfs/mount.c
 +++ b/fs/kernfs/mount.c
 @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct 
 super_block *sb)
 return NULL;
  }

 +/**
 + * kernfs_make_root - create new root dentry for the given kernfs_node.
 + * @sb: the kernfs super_block
 + * @kn: kernfs_node for which a dentry is needed
 + *
 + * This can used used by callers which want to mount only a part of the 
 kernfs
 + * as root of the filesystem.
 + */
 +struct dentry *kernfs_obtain_root(struct super_block *sb,
 + struct kernfs_node *kn)
 +{

 I can't usefully review this, but kernfs_make_root and
 kernfs_obtain_root aren't the same string...

 diff --git a/kernel/cgroup.c b/kernel/cgroup.c
 index 7e5d597..250aaec 100644
 --- a/kernel/cgroup.c
 +++ b/kernel/cgroup.c
 @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, 
 struct cgroup_sb_opts *opts)

 memset(opts, 0, sizeof(*opts));

 +   /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init 
 cgroup
 +* namespace.
 +*/
 +   if (current-nsproxy-cgroup_ns != init_cgroup_ns) {
 +   opts-flags |= CGRP_ROOT_SANE_BEHAVIOR;
 +   }
 +

 I don't like this implicit stuff.  Can you just return -EINVAL if sane
 behavior isn't requested?


 I think the sane-behavior flag is only temporary and will be removed
 anyways, right? So I didn't bother asking user to supply it. But I can
 make the change as you suggested. We just have to make sure that tasks
 inside cgroupns cannot mount non-default hierarchies as it would be a
 regression.

 while ((token = strsep(o, ,)) != NULL) {
 nr_opts++;

 @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct 
 cgroup_sb_opts *opts)

 if (opts-flags  CGRP_ROOT_SANE_BEHAVIOR) {
 pr_warn(sane_behavior: this is still under development 
 and its behaviors will change, proceed at your own risk\n);
 -   if (nr_opts != 1) {
 +   if (nr_opts  1) {
 pr_err(sane_behavior: no other mount options 
 allowed\n);
 return -EINVAL;

 This looks wrong.  But, if you make the change above, then it'll be right.


 It would have been nice if simple 'mount -t cgroup cgroup mnt' from
 cgroupns does the right thing automatically.


 This is a debatable point, but it's not what I meant.  Won't your code
 let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?


I don't think so. This check if (nr_opts  1) is nested under if
(opts-flags  CGRP_ROOT_SANE_BEHAVIOR). So we know that there is
atleast 1 option ('__DEVEL__sane_behavior') present (implicit or not).
Addition of 'one_evil_flag' will make nr_opts = 2 and result in EINVAL
here.


 @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct 
 file_system_type *fs_type,
 int ret;
 int i;
 bool new_sb;
 +   struct cgroup_namespace *ns =
 +   get_cgroup_ns(current-nsproxy-cgroup_ns);
 +
 +   /* Check if the caller has permission to mount. */
 +   if (!ns_capable(ns-user_ns, CAP_SYS_ADMIN)) {
 +   put_cgroup_ns(ns);
 +   return ERR_PTR(-EPERM);
 +   }

 Why is this necessary?


 Without this, if I unshare userns and mntns (but no cgroupns), I will
 be able to mount my parent's cgroupfs hierarchy. This is deviation
 from whats allowed today (i.e., today I can't mount cgroupfs even
 after unsharing userns  mntns). This check is there to prevent the
 unintended effect of cgroupns feature.

 Oh, I get it.  I misunderstood the code.

 I guess this is reasonable.  If it annoys anyone, it can be reverted
 or weakened.

 --Andy



-- 
Aditya
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord

Re: [PATCHv2 5/7] cgroup: introduce cgroup namespaces

2014-11-03 Thread Aditya Kali
On Fri, Oct 31, 2014 at 5:02 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Fri, Oct 31, 2014 at 12:18 PM, Aditya Kali adityak...@google.com wrote:
 Introduce the ability to create new cgroup namespace. The newly created
 cgroup namespace remembers the cgroup of the process at the point
 of creation of the cgroup namespace (referred as cgroupns-root).
 The main purpose of cgroup namespace is to virtualize the contents
 of /proc/self/cgroup file. Processes inside a cgroup namespace
 are only able to see paths relative to their namespace root
 (unless they are moved outside of their cgroupns-root, at which point
  they will see a relative path from their cgroupns-root).
 For a correctly setup container this enables container-tools
 (like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
 containers without leaking system level cgroup hierarchy to the task.
 This patch only implements the 'unshare' part of the cgroupns.


 +   /* Prevent cgroup changes for this task. */
 +   threadgroup_lock(current);

 This could just be me being dense, but what is the lock for?


threadgroup_lock() is there to prevent the task from changing cgroups
while we are unsharing cgroupns.
But it seems that this might be unnecessary now because we have
removed the pinning restriction. Without pinning, we don't care if the
task cgroup changes underneath us. I will remove it from here as well
as from cgroupns_install().

 +
 +   /* CGROUPNS only virtualizes the cgroup path on the unified 
 hierarchy.
 +*/
 +   cgrp = get_task_cgroup(current);
 +
 +   err = -ENOMEM;
 +   new_ns = alloc_cgroup_ns();
 +   if (!new_ns)
 +   goto err_out_unlock;
 +
 +   err = proc_alloc_inum(new_ns-proc_inum);
 +   if (err)
 +   goto err_out_unlock;
 +
 +   new_ns-user_ns = get_user_ns(user_ns);
 +   new_ns-root_cgrp = cgrp;
 +
 +   threadgroup_unlock(current);
 +
 +   return new_ns;
 +
 +err_out_unlock:
 +   threadgroup_unlock(current);
 +err_out:
 +   if (cgrp)
 +   cgroup_put(cgrp);
 +   kfree(new_ns);
 +   return ERR_PTR(err);
 +}
 +
 +static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
 +{
 +   pr_info(setns not supported for cgroup namespace);
 +   return -EINVAL;
 +}
 +
 +static void *cgroupns_get(struct task_struct *task)
 +{
 +   struct cgroup_namespace *ns = NULL;
 +   struct nsproxy *nsproxy;
 +
 +   rcu_read_lock();
 +   nsproxy = task-nsproxy;
 +   if (nsproxy) {
 +   ns = nsproxy-cgroup_ns;
 +   get_cgroup_ns(ns);
 +   }
 +   rcu_read_unlock();

 How is this correct?  Other namespaces do it too, so it Must Be
 Correct (tm), but I don't understand.  What is RCU protecting?

 --Andy



-- 
Aditya
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 5/7] cgroup: introduce cgroup namespaces

2014-11-03 Thread Aditya Kali
On Fri, Oct 31, 2014 at 5:58 PM, Eric W. Biederman
ebied...@xmission.com wrote:
 Andy Lutomirski l...@amacapital.net writes:

 On Fri, Oct 31, 2014 at 12:18 PM, Aditya Kali adityak...@google.com wrote:

 snip

 +static void *cgroupns_get(struct task_struct *task)
 +{
 +   struct cgroup_namespace *ns = NULL;
 +   struct nsproxy *nsproxy;
 +
 +   rcu_read_lock();
 +   nsproxy = task-nsproxy;
 +   if (nsproxy) {
 +   ns = nsproxy-cgroup_ns;
 +   get_cgroup_ns(ns);
 +   }
 +   rcu_read_unlock();

 How is this correct?  Other namespaces do it too, so it Must Be
 Correct (tm), but I don't understand.  What is RCU protecting?

 The code is not correct.  The code needs to use task_lock.

 RCU used to protect nsproxy, and now task_lock protects nsproxy.
 For the reasons of of all of this I refer you to the commit
 that changed this, and the comment in nsproxy.h


My bad. This should be under task_lock. I will fix it.

 commit 728dba3a39c66b3d8ac889ddbe38b5b1c264aec3
 Author: Eric W. Biederman ebied...@xmission.com
 Date:   Mon Feb 3 19:13:49 2014 -0800

 namespaces: Use task_lock and not rcu to protect nsproxy

 The synchronous syncrhonize_rcu in switch_task_namespaces makes setns
 a sufficiently expensive system call that people have complained.

 Upon inspect nsproxy no longer needs rcu protection for remote reads.
 remote reads are rare.  So optimize for same process reads and write
 by switching using rask_lock instead.

 This yields a simpler to understand lock, and a faster setns system call.

 In particular this fixes a performance regression observed
 by Rafael David Tinoco rafael.tin...@canonical.com.

 This is effectively a revert of Pavel Emelyanov's commit
 cf7b708c8d1d7a27736771bcf4c457b332b0f818 Make access to task's nsproxy 
 lighter
 from 2007.  The race this originialy fixed no longer exists as
 do_notify_parent uses task_active_pid_ns(parent) instead of
 parent-nsproxy.

 Signed-off-by: Eric W. Biederman ebied...@xmission.com

 Eric



-- 
Aditya
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

2014-11-03 Thread Aditya Kali
On Mon, Nov 3, 2014 at 3:48 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Mon, Nov 3, 2014 at 3:23 PM, Aditya Kali adityak...@google.com wrote:
 On Mon, Nov 3, 2014 at 3:15 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali adityak...@google.com wrote:
 On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski l...@amacapital.net 
 wrote:
 On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali adityak...@google.com 
 wrote:
 if (opts-flags  CGRP_ROOT_SANE_BEHAVIOR) {
 pr_warn(sane_behavior: this is still under development 
 and its behaviors will change, proceed at your own risk\n);
 -   if (nr_opts != 1) {
 +   if (nr_opts  1) {
 pr_err(sane_behavior: no other mount options 
 allowed\n);
 return -EINVAL;

 This looks wrong.  But, if you make the change above, then it'll be right.


 It would have been nice if simple 'mount -t cgroup cgroup mnt' from
 cgroupns does the right thing automatically.


 This is a debatable point, but it's not what I meant.  Won't your code
 let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?


 I don't think so. This check if (nr_opts  1) is nested under if
 (opts-flags  CGRP_ROOT_SANE_BEHAVIOR). So we know that there is
 atleast 1 option ('__DEVEL__sane_behavior') present (implicit or not).
 Addition of 'one_evil_flag' will make nr_opts = 2 and result in EINVAL
 here.

 But the implicit __DEVEL__sane_behavior doesn't increment nr_opts, right?


Yes. Hence this change makes sure that we don't return EINVAL when
nr_opts == 0 or nr_opts == 1 :)
That way, both of the following are equivalent when inside non-init cgroupns:

(1) $ mount -t cgroup -o __DEVEL__sane_behavior cgroup mountpoint
(2) $ mount -t cgroup cgroup mountpoint

Any other mount option will trigger the error here.


 --Andy

-- 
Aditya
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

2014-11-03 Thread Aditya Kali
On Mon, Nov 3, 2014 at 4:17 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Mon, Nov 3, 2014 at 4:12 PM, Aditya Kali adityak...@google.com wrote:
 On Mon, Nov 3, 2014 at 3:48 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Mon, Nov 3, 2014 at 3:23 PM, Aditya Kali adityak...@google.com wrote:
 On Mon, Nov 3, 2014 at 3:15 PM, Andy Lutomirski l...@amacapital.net 
 wrote:
 On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali adityak...@google.com wrote:
 On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski l...@amacapital.net 
 wrote:
 On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali adityak...@google.com 
 wrote:
 if (opts-flags  CGRP_ROOT_SANE_BEHAVIOR) {
 pr_warn(sane_behavior: this is still under 
 development and its behaviors will change, proceed at your own 
 risk\n);
 -   if (nr_opts != 1) {
 +   if (nr_opts  1) {
 pr_err(sane_behavior: no other mount options 
 allowed\n);
 return -EINVAL;

 This looks wrong.  But, if you make the change above, then it'll be 
 right.


 It would have been nice if simple 'mount -t cgroup cgroup mnt' from
 cgroupns does the right thing automatically.


 This is a debatable point, but it's not what I meant.  Won't your code
 let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?


 I don't think so. This check if (nr_opts  1) is nested under if
 (opts-flags  CGRP_ROOT_SANE_BEHAVIOR). So we know that there is
 atleast 1 option ('__DEVEL__sane_behavior') present (implicit or not).
 Addition of 'one_evil_flag' will make nr_opts = 2 and result in EINVAL
 here.

 But the implicit __DEVEL__sane_behavior doesn't increment nr_opts, right?


 Yes. Hence this change makes sure that we don't return EINVAL when
 nr_opts == 0 or nr_opts == 1 :)
 That way, both of the following are equivalent when inside non-init cgroupns:

 (1) $ mount -t cgroup -o __DEVEL__sane_behavior cgroup mountpoint
 (2) $ mount -t cgroup cgroup mountpoint

 Any other mount option will trigger the error here.

 I still don't get it.  Can you walk me through why mount -o
 some_other_option -t cgroup cgroup mountpoint causes -EINVAL?


Argh! You are right. I was totally convinced that this works. But it
clearly doesn't if you specify 1 legit mount option. I wanted to make
it work for both cases (1) and (2) above. But then this check will
have to be changed :(
Sorry about the back and forth. I am just going to make it return
EINVAL if __DEVEL_sane_behavior is not specified as suggested in the
beginning.

 --Andy

-- 
Aditya
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 5/7] cgroup: introduce cgroup namespaces

2014-11-03 Thread Aditya Kali


Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the cgroup of the process at the point
of creation of the cgroup namespace (referred as cgroupns-root).
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root
(unless they are moved outside of their cgroupns-root, at which point
 they will see a relative path from their cgroupns-root).
For a correctly setup container this enables container-tools
(like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
containers without leaking system level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.

Signed-off-by: Aditya Kali adityak...@google.com
---
 fs/proc/namespaces.c |   1 +
 include/linux/cgroup.h   |  18 +-
 include/linux/cgroup_namespace.h |  36 +++
 include/linux/nsproxy.h  |   2 +
 include/linux/proc_ns.h  |   4 ++
 kernel/Makefile  |   2 +-
 kernel/cgroup.c  |  14 +
 kernel/cgroup_namespace.c| 127 
+++

 kernel/fork.c|   2 +-
 kernel/nsproxy.c |  19 +-
 10 files changed, 220 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/cgroup_namespace.h
 create mode 100644 kernel/cgroup_namespace.c

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 8902609..55bc5da 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -32,6 +32,7 @@ static const struct proc_ns_operations *ns_entries[] = {
userns_operations,
 #endif
mntns_operations,
+   cgroupns_operations,
 };

 static const struct file_operations ns_file_operations = {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4a0eb2d..aa86495 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -22,6 +22,8 @@
 #include linux/seq_file.h
 #include linux/kernfs.h
 #include linux/wait.h
+#include linux/nsproxy.h
+#include linux/types.h

 #ifdef CONFIG_CGROUPS

@@ -460,6 +462,13 @@ struct cftype {
 #endif
 };

+struct cgroup_namespace {
+   atomic_tcount;
+   unsigned intproc_inum;
+   struct user_namespace   *user_ns;
+   struct cgroup   *root_cgrp;
+};
+
 extern struct cgroup_root cgrp_dfl_root;
 extern struct css_set init_css_set;

@@ -584,10 +593,17 @@ static inline int cgroup_name(struct cgroup *cgrp, 
char *buf, size_t buflen)

return kernfs_name(cgrp-kn, buf, buflen);
 }

+static inline char * __must_check cgroup_path_ns(struct 
cgroup_namespace *ns,

+struct cgroup *cgrp, char *buf,
+size_t buflen)
+{
+   return kernfs_path_from_node(ns-root_cgrp-kn, cgrp-kn, buf, buflen);
+}
+
 static inline char * __must_check cgroup_path(struct cgroup *cgrp, 
char *buf,

  size_t buflen)
 {
-   return kernfs_path(cgrp-kn, buf, buflen);
+   return cgroup_path_ns(current-nsproxy-cgroup_ns, cgrp, buf, buflen);
 }

 static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
diff --git a/include/linux/cgroup_namespace.h 
b/include/linux/cgroup_namespace.h

new file mode 100644
index 000..0b97b8d
--- /dev/null
+++ b/include/linux/cgroup_namespace.h
@@ -0,0 +1,36 @@
+#ifndef _LINUX_CGROUP_NAMESPACE_H
+#define _LINUX_CGROUP_NAMESPACE_H
+
+#include linux/nsproxy.h
+#include linux/cgroup.h
+#include linux/types.h
+#include linux/user_namespace.h
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+static inline struct cgroup *current_cgroupns_root(void)
+{
+   return current-nsproxy-cgroup_ns-root_cgrp;
+}
+
+extern void free_cgroup_ns(struct cgroup_namespace *ns);
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+   struct cgroup_namespace *ns)
+{
+   if (ns)
+   atomic_inc(ns-count);
+   return ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+   if (ns  atomic_dec_and_test(ns-count))
+   free_cgroup_ns(ns);
+}
+
+extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+  struct user_namespace *user_ns,
+  struct cgroup_namespace *old_ns);
+
+#endif  /* _LINUX_CGROUP_NAMESPACE_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 35fa08f..ac0d65b 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -8,6 +8,7 @@ struct mnt_namespace;
 struct uts_namespace;
 struct ipc_namespace;
 struct pid_namespace;
+struct cgroup_namespace;
 struct fs_struct;

 /*
@@ -33,6 +34,7 @@ struct nsproxy {
struct mnt_namespace *mnt_ns;
struct pid_namespace *pid_ns_for_children;
struct net   *net_ns

Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

2014-11-03 Thread Aditya Kali

This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.
In order to support this, a new kernfs api is added to lookup the
dentry for the cgroupns-root.

Signed-off-by: Aditya Kali adityak...@google.com
---
 fs/kernfs/mount.c  | 48 


 include/linux/kernfs.h |  2 ++
 kernel/cgroup.c| 46 +-
 3 files changed, 95 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f973ae9..efe5e15 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct 
super_block *sb)

return NULL;
 }

+/**
+ * kernfs_obtain_root - get a dentry for the given kernfs_node
+ * @sb: the kernfs super_block
+ * @kn: kernfs_node for which a dentry is needed
+ *
+ * This can used used by callers which want to mount only a part of the 
kernfs

+ * as root of the filesystem.
+ */
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+ struct kernfs_node *kn)
+{
+   struct dentry *dentry;
+   struct inode *inode;
+
+   BUG_ON(sb-s_op != kernfs_sops);
+
+   /* inode for the given kernfs_node should already exist. */
+   inode = ilookup(sb, kn-ino);
+   if (!inode) {
+   pr_debug(kernfs: could not get inode for ');
+   pr_cont_kernfs_path(kn);
+   pr_cont('.\n);
+   return ERR_PTR(-EINVAL);
+   }
+
+   /* instantiate and link root dentry */
+   dentry = d_obtain_root(inode);
+   if (!dentry) {
+   pr_debug(kernfs: could not get dentry for ');
+   pr_cont_kernfs_path(kn);
+   pr_cont('.\n);
+   return ERR_PTR(-ENOMEM);
+   }
+
+   /* If this is a new dentry, set it up. We need kernfs_mutex because this
+* may be called by callers other than kernfs_fill_super. */
+   mutex_lock(kernfs_mutex);
+   if (!dentry-d_fsdata) {
+   kernfs_get(kn);
+   dentry-d_fsdata = kn;
+   } else {
+   WARN_ON(dentry-d_fsdata != kn);
+   }
+   mutex_unlock(kernfs_mutex);
+
+   return dentry;
+}
+
 static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
 {
struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3c2be75..b9538e0 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
 struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
 struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);

+struct dentry *kernfs_obtain_root(struct super_block *sb,
+ struct kernfs_node *kn);
 struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
   unsigned int flags, void *priv);
 void kernfs_destroy_root(struct kernfs_root *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7e5d597..8008c4c 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1389,6 +1389,14 @@ static int parse_cgroupfs_options(char *data, 
struct cgroup_sb_opts *opts)

return -ENOENT;
}

+   /* If inside a non-init cgroup namespace, only allow default hierarchy
+* to be mounted.
+*/
+   if ((current-nsproxy-cgroup_ns != init_cgroup_ns) 
+   !(opts-flags  CGRP_ROOT_SANE_BEHAVIOR)) {
+   return -EINVAL;
+   }
+
if (opts-flags  CGRP_ROOT_SANE_BEHAVIOR) {
 		pr_warn(sane_behavior: this is still under development and its 
behaviors will change, proceed at your own risk\n);

if (nr_opts != 1) {
@@ -1581,6 +1589,15 @@ static void init_cgroup_root(struct cgroup_root 
*root,

set_bit(CGRP_CPUSET_CLONE_CHILDREN, root-cgrp.flags);
 }

+struct dentry *cgroupns_get_root(struct super_block *sb,
+struct cgroup_namespace *ns)
+{
+   struct dentry *nsdentry;
+
+   nsdentry = kernfs_obtain_root(sb, ns-root_cgrp-kn);
+   return nsdentry;
+}
+
 static int cgroup_setup_root(struct cgroup_root *root, unsigned int 
ss_mask)

 {
LIST_HEAD(tmp_links);
@@ -1685,6 +1702,14 @@ static struct dentry *cgroup_mount(struct 
file_system_type *fs_type,

int ret;
int i;
bool new_sb;
+   struct cgroup_namespace *ns =
+   get_cgroup_ns(current-nsproxy-cgroup_ns);
+
+   /* Check if the caller has permission to mount. */
+   if (!ns_capable(ns-user_ns, CAP_SYS_ADMIN)) {
+   put_cgroup_ns(ns

[PATCHv2 0/7] CGroup Namespaces

2014-10-31 Thread Aditya Kali
Another attempt at Cgroup Namespace patch-set. This incorporates
suggestions on previous patch-set.

Changes from V1:
1. No pinning of processes within cgroupns. Tasks can be freely moved
   across cgroups even outside of their cgroupns-root. Usual DAC/MAC policies
   apply as before.
2. Path in /proc//cgroup is now always shown and is relative to
   cgroupns-root. So path can contain '/..' strings depending on cgroupns-root
   of the reader and cgroup of .
3. setns() does not require the process to first move under target
   cgroupns-root.

Changes form RFC (V0):
1. setns support for cgroupns
2. 'mount -t cgroup cgroup ' from inside a cgroupns now
   mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
3. writes to cgroup files outside of cgroupns-root are not allowed
4. visibility of /proc//cgroup is further restricted by not showing
   anything if the  is in a sibling cgroupns and its cgroup falls outside
   your cgroupns-root.

More details in the writeup below.

Background
  Cgroups and Namespaces are used together to create “virtual”
  containers that isolates the host environment from the processes
  running in container. But since cgroups themselves are not
  “virtualized”, the task is always able to see global cgroups view
  through cgroupfs mount and via /proc/self/cgroup file.

  $ cat /proc/self/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1

  This exposure of cgroup names to the processes running inside a
  container results in some problems:
  (1) The container names are typically host-container-management-agent
  (systemd, docker/libcontainer, etc.) data and leaking its name (or
  leaking the hierarchy) reveals too much information about the host
  system.
  (2) It makes the container migration across machines (CRIU) more
  difficult as the container names need to be unique across the
  machines in the migration domain.
  (3) It makes it difficult to run container management tools (like
  docker/libcontainer, lmctfy, etc.) within virtual containers
  without adding dependency on some state/agent present outside the
  container.

  Note that the feature proposed here is completely different than the
  “ns cgroup” feature which existed in the linux kernel until recently.
  The ns cgroup also attempted to connect cgroups and namespaces by
  creating a new cgroup every time a new namespace was created. It did
  not solve any of the above mentioned problems and was later dropped
  from the kernel. Incidentally though, it used the same config option
  name CONFIG_CGROUP_NS as used in my prototype!

Introducing CGroup Namespaces
  With unified cgroup hierarchy
  (Documentation/cgroups/unified-hierarchy.txt), the containers can now
  have a much more coherent cgroup view and its easy to associate a
  container with a single cgroup. This also allows us to virtualize the
  cgroup view for tasks inside the container.

  The new CGroup Namespace allows a process to “unshare” its cgroup
  hierarchy starting from the cgroup its currently in.
  For Ex:
  $ cat /proc/self/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
  $ ls -l /proc/self/ns/cgroup
  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> 
cgroup:[4026531835]
  $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
  [ns]$ ls -l /proc/self/ns/cgroup
  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
  cgroup:[4026532183]
  # From within new cgroupns, process sees that its in the root cgroup
  [ns]$ cat /proc/self/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/

  # From global cgroupns:
  $ cat /proc//cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1

  # Unshare cgroupns along with userns and mountns
  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
  # sets up uid/gid map and exec’s /bin/bash
  $ ~/unshare -c -u -m

  # Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
  # hierarchy.
  [ns]$ mount -t cgroup cgroup /tmp/cgroup
  [ns]$ ls -l /tmp/cgroup
  total 0
  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control

  The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
  filesystem root for the namespace specific cgroupfs mount.

  The virtualization of /proc/self/cgroup file combined with restricting
  the view of cgroup hierarchy by namespace-private cgroupfs mount
  should provide a completely isolated cgroup view inside the container.

  In its current form, the cgroup namespaces patcheset provides following
  behavior:

  (1) The “root” cgroup for a cgroup namespace is the cgroup in which
  the process calling unshare is running.
  For ex. if a process in 

[PATCHv2 4/7] cgroup: export cgroup_get() and cgroup_put()

2014-10-31 Thread Aditya Kali
move cgroup_get() and cgroup_put() into cgroup.h so that
they can be called from other places.

Signed-off-by: Aditya Kali 
---
 include/linux/cgroup.h | 22 ++
 kernel/cgroup.c| 22 --
 2 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 80ed6e0..4a0eb2d 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -521,6 +521,28 @@ static inline bool cgroup_on_dfl(const struct cgroup *cgrp)
return cgrp->root == _dfl_root;
 }
 
+/* convenient tests for these bits */
+static inline bool cgroup_is_dead(const struct cgroup *cgrp)
+{
+   return !(cgrp->self.flags & CSS_ONLINE);
+}
+
+static inline void cgroup_get(struct cgroup *cgrp)
+{
+   WARN_ON_ONCE(cgroup_is_dead(cgrp));
+   css_get(>self);
+}
+
+static inline bool cgroup_tryget(struct cgroup *cgrp)
+{
+   return css_tryget(>self);
+}
+
+static inline void cgroup_put(struct cgroup *cgrp)
+{
+   css_put(>self);
+}
+
 /* no synchronization, the result can only be used as a hint */
 static inline bool cgroup_has_tasks(struct cgroup *cgrp)
 {
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 50fa8e3..9c622b9 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -284,12 +284,6 @@ static struct cgroup_subsys_state *cgroup_e_css(struct 
cgroup *cgrp,
return cgroup_css(cgrp, ss);
 }
 
-/* convenient tests for these bits */
-static inline bool cgroup_is_dead(const struct cgroup *cgrp)
-{
-   return !(cgrp->self.flags & CSS_ONLINE);
-}
-
 struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
 {
struct cgroup *cgrp = of->kn->parent->priv;
@@ -1002,22 +996,6 @@ static umode_t cgroup_file_mode(const struct cftype *cft)
return mode;
 }
 
-static void cgroup_get(struct cgroup *cgrp)
-{
-   WARN_ON_ONCE(cgroup_is_dead(cgrp));
-   css_get(>self);
-}
-
-static bool cgroup_tryget(struct cgroup *cgrp)
-{
-   return css_tryget(>self);
-}
-
-static void cgroup_put(struct cgroup *cgrp)
-{
-   css_put(>self);
-}
-
 /**
  * cgroup_refresh_child_subsys_mask - update child_subsys_mask
  * @cgrp: the target cgroup
-- 
2.1.0.rc2.206.gedb03e5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv2 3/7] cgroup: add function to get task's cgroup on default hierarchy

2014-10-31 Thread Aditya Kali
get_task_cgroup() returns the (reference counted) cgroup of the
given task on the default hierarchy.

Signed-off-by: Aditya Kali 
---
 include/linux/cgroup.h |  1 +
 kernel/cgroup.c| 25 +
 2 files changed, 26 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 1d51968..80ed6e0 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -579,6 +579,7 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
 }
 
 char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
+struct cgroup *get_task_cgroup(struct task_struct *task);
 
 int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
 int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 136ecea..50fa8e3 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1917,6 +1917,31 @@ char *task_cgroup_path(struct task_struct *task, char 
*buf, size_t buflen)
 }
 EXPORT_SYMBOL_GPL(task_cgroup_path);
 
+/*
+ * get_task_cgroup - returns the cgroup of the task in the default cgroup
+ * hierarchy.
+ *
+ * @task: target task
+ * This function returns the @task's cgroup on the default cgroup hierarchy. 
The
+ * returned cgroup has its reference incremented (by calling cgroup_get()). So
+ * the caller must cgroup_put() the obtained reference once it is done with it.
+ */
+struct cgroup *get_task_cgroup(struct task_struct *task)
+{
+   struct cgroup *cgrp;
+
+   mutex_lock(_mutex);
+   down_read(_set_rwsem);
+
+   cgrp = task_cgroup_from_root(task, _dfl_root);
+   cgroup_get(cgrp);
+
+   up_read(_set_rwsem);
+   mutex_unlock(_mutex);
+   return cgrp;
+}
+EXPORT_SYMBOL_GPL(get_task_cgroup);
+
 /* used to track tasks and other necessary states during migration */
 struct cgroup_taskset {
/* the src and dst cset list running through cset->mg_node */
-- 
2.1.0.rc2.206.gedb03e5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv2 1/7] kernfs: Add API to generate relative kernfs path

2014-10-31 Thread Aditya Kali
The new function kernfs_path_from_node() generates and returns
kernfs path of a given kernfs_node relative to a given parent
kernfs_node.

Signed-off-by: Aditya Kali 
---
 fs/kernfs/dir.c| 194 +++--
 include/linux/kernfs.h |   3 +
 2 files changed, 176 insertions(+), 21 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 1c77193..e49c365 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -44,28 +44,158 @@ static int kernfs_name_locked(struct kernfs_node *kn, char 
*buf, size_t buflen)
return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
 }
 
-static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char 
*buf,
- size_t buflen)
+/**
+ * kernfs_node_depth - compute depth of the kernfs node from root.
+ * The root node itself is considered to be at depth 0.
+ */
+static size_t kernfs_node_depth(struct kernfs_node *kn)
 {
-   char *p = buf + buflen;
+   size_t depth = 0;
+
+   BUG_ON(!kn);
+   while (kn->parent) {
+   depth++;
+   kn = kn->parent;
+   }
+   return depth;
+}
+
+/**
+ * kernfs_path_from_node_locked - find a relative path from @kn_from to @kn_to
+ * @kn_from: reference node of the path
+ * @kn_to: kernfs node to which path is needed
+ * @buf: buffer to copy the path into
+ * @buflen: size of @buf
+ *
+ * We need to handle couple of scenarios here:
+ * [1] when @kn_from is an ancestor of @kn_to at some level
+ * kn_from: /n1/n2/n3
+ * kn_to:   /n1/n2/n3/n4/n5
+ * result:  /n4/n5
+ *
+ * [2] when @kn_from is on a different hierarchy and we need to find common
+ * ancestor between @kn_from and @kn_to.
+ * kn_from: /n1/n2/n3/n4
+ * kn_to:   /n1/n2/n5
+ * result:  /../../n5
+ * OR
+ * kn_from: /n1/n2/n3/n4/n5   [depth=5]
+ * kn_to:   /n1/n2/n3 [depth=3]
+ * result:  /../..
+ */
+static char * __must_check kernfs_path_from_node_locked(
+   struct kernfs_node *kn_from,
+   struct kernfs_node *kn_to,
+   char *buf,
+   size_t buflen)
+{
+   char *p = buf;
+   struct kernfs_node *kn;
+   size_t depth_from = 0, depth_to, d;
int len;
 
-   *--p = '\0';
+   /* We atleast need 2 bytes to write "/\0". */
+   BUG_ON(buflen < 2);
 
-   do {
-   len = strlen(kn->name);
-   if (p - buf < len + 1) {
-   buf[0] = '\0';
-   p = NULL;
-   break;
+   if (kn_from == kn_to) {
+   *p = '/';
+   *(p + 1) = '\0';
+   return p;
+   }
+
+   /* We can find the relative path only if both the nodes belong to the
+* same kernfs root.
+*/
+   if (kn_from) {
+   BUG_ON(kernfs_root(kn_from) != kernfs_root(kn_to));
+   depth_from = kernfs_node_depth(kn_from);
+   }
+
+   depth_to = kernfs_node_depth(kn_to);
+
+   /* We compose path from left to right. So first write out all possible
+* "/.." strings needed to reach from 'kn_from' to the common ancestor.
+*/
+   if (kn_from) {
+   while (depth_from > depth_to) {
+   len = strlen("/..");
+   if ((buflen - (p - buf)) < len + 1) {
+   /* buffer not big enough. */
+   buf[0] = '\0';
+   return NULL;
+   }
+   memcpy(p, "/..", len);
+   p += len;
+   *p = '\0';
+   --depth_from;
+   kn_from = kn_from->parent;
}
+
+   d = depth_to;
+   kn = kn_to;
+   while (depth_from < d) {
+   kn = kn->parent;
+   d--;
+   }
+
+   /* Now we have 'depth_from == depth_to' at this point. Add more
+* "/.."s until we reach common ancestor. In the worst case,
+* root node will be the common ancestor.
+*/
+   while (depth_from > 0) {
+   /* If we reached common ancestor, stop. */
+   if (kn_from == kn)
+   break;
+   len = strlen("/..");
+   if ((buflen - (p - buf)) < len + 1) {
+   /* buffer not big enough. */
+   buf[0] = '\0';
+   return NULL;
+   }
+   memcpy(p, "/..", len);
+   p += len;
+   *p = '\0';
+   --depth_from;
+   kn_from = kn_from->parent;
+   kn = kn->parent;
+ 

[PATCHv2 5/7] cgroup: introduce cgroup namespaces

2014-10-31 Thread Aditya Kali
Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the cgroup of the process at the point
of creation of the cgroup namespace (referred as cgroupns-root).
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root
(unless they are moved outside of their cgroupns-root, at which point
 they will see a relative path from their cgroupns-root).
For a correctly setup container this enables container-tools
(like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
containers without leaking system level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.

Signed-off-by: Aditya Kali 
---
 fs/proc/namespaces.c |   1 +
 include/linux/cgroup.h   |  18 +-
 include/linux/cgroup_namespace.h |  36 +++
 include/linux/nsproxy.h  |   2 +
 include/linux/proc_ns.h  |   4 ++
 kernel/Makefile  |   2 +-
 kernel/cgroup.c  |  14 
 kernel/cgroup_namespace.c| 134 +++
 kernel/fork.c|   2 +-
 kernel/nsproxy.c |  19 +-
 10 files changed, 227 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/cgroup_namespace.h
 create mode 100644 kernel/cgroup_namespace.c

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 8902609..55bc5da 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -32,6 +32,7 @@ static const struct proc_ns_operations *ns_entries[] = {
_operations,
 #endif
_operations,
+   _operations,
 };
 
 static const struct file_operations ns_file_operations = {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4a0eb2d..aa86495 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -22,6 +22,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #ifdef CONFIG_CGROUPS
 
@@ -460,6 +462,13 @@ struct cftype {
 #endif
 };
 
+struct cgroup_namespace {
+   atomic_tcount;
+   unsigned intproc_inum;
+   struct user_namespace   *user_ns;
+   struct cgroup   *root_cgrp;
+};
+
 extern struct cgroup_root cgrp_dfl_root;
 extern struct css_set init_css_set;
 
@@ -584,10 +593,17 @@ static inline int cgroup_name(struct cgroup *cgrp, char 
*buf, size_t buflen)
return kernfs_name(cgrp->kn, buf, buflen);
 }
 
+static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
+struct cgroup *cgrp, char *buf,
+size_t buflen)
+{
+   return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf, buflen);
+}
+
 static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
  size_t buflen)
 {
-   return kernfs_path(cgrp->kn, buf, buflen);
+   return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf, buflen);
 }
 
 static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
new file mode 100644
index 000..0b97b8d
--- /dev/null
+++ b/include/linux/cgroup_namespace.h
@@ -0,0 +1,36 @@
+#ifndef _LINUX_CGROUP_NAMESPACE_H
+#define _LINUX_CGROUP_NAMESPACE_H
+
+#include 
+#include 
+#include 
+#include 
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+static inline struct cgroup *current_cgroupns_root(void)
+{
+   return current->nsproxy->cgroup_ns->root_cgrp;
+}
+
+extern void free_cgroup_ns(struct cgroup_namespace *ns);
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+   struct cgroup_namespace *ns)
+{
+   if (ns)
+   atomic_inc(>count);
+   return ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+   if (ns && atomic_dec_and_test(>count))
+   free_cgroup_ns(ns);
+}
+
+extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+  struct user_namespace *user_ns,
+  struct cgroup_namespace *old_ns);
+
+#endif  /* _LINUX_CGROUP_NAMESPACE_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 35fa08f..ac0d65b 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -8,6 +8,7 @@ struct mnt_namespace;
 struct uts_namespace;
 struct ipc_namespace;
 struct pid_namespace;
+struct cgroup_namespace;
 struct fs_struct;
 
 /*
@@ -33,6 +34,7 @@ struct nsproxy {
struct mnt_namespace *mnt_ns;
struct pid_namespace *pid_ns_for_children;
struct net   *net_ns;
+   struct cgroup_namespace *cgroup_ns;
 };
 extern struct nsproxy init_nsproxy;
 
diff --git a/include/linux/proc_ns.h b/in

[PATCHv2 6/7] cgroup: cgroup namespace setns support

2014-10-31 Thread Aditya Kali
setns on a cgroup namespace is allowed only if
task has CAP_SYS_ADMIN in its current user-namespace and
over the user-namespace associated with target cgroupns.
No implicit cgroup changes happen with attaching to another
cgroupns. It is expected that the somone moves the attaching
process under the target cgroupns-root.

Signed-off-by: Aditya Kali 
---
 kernel/cgroup_namespace.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
index 7e9bda0..0803575 100644
--- a/kernel/cgroup_namespace.c
+++ b/kernel/cgroup_namespace.c
@@ -86,8 +86,22 @@ err_out:
 
 static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
 {
-   pr_info("setns not supported for cgroup namespace");
-   return -EINVAL;
+   struct cgroup_namespace *cgroup_ns = ns;
+
+   if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
+   !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
+   return -EPERM;
+
+   /* Prevent cgroup changes for this task. */
+   threadgroup_lock(current);
+
+   get_cgroup_ns(cgroup_ns);
+   put_cgroup_ns(nsproxy->cgroup_ns);
+   nsproxy->cgroup_ns = cgroup_ns;
+
+   threadgroup_unlock(current);
+
+   return 0;
 }
 
 static void *cgroupns_get(struct task_struct *task)
-- 
2.1.0.rc2.206.gedb03e5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

2014-10-31 Thread Aditya Kali
This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.
In order to support this, a new kernfs api is added to lookup the
dentry for the cgroupns-root.

Signed-off-by: Aditya Kali 
---
 fs/kernfs/mount.c  | 48 
 include/linux/kernfs.h |  2 ++
 kernel/cgroup.c| 47 +--
 3 files changed, 95 insertions(+), 2 deletions(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f973ae9..e334f45 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block 
*sb)
return NULL;
 }
 
+/**
+ * kernfs_make_root - create new root dentry for the given kernfs_node.
+ * @sb: the kernfs super_block
+ * @kn: kernfs_node for which a dentry is needed
+ *
+ * This can used used by callers which want to mount only a part of the kernfs
+ * as root of the filesystem.
+ */
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+ struct kernfs_node *kn)
+{
+   struct dentry *dentry;
+   struct inode *inode;
+
+   BUG_ON(sb->s_op != _sops);
+
+   /* inode for the given kernfs_node should already exist. */
+   inode = ilookup(sb, kn->ino);
+   if (!inode) {
+   pr_debug("kernfs: could not get inode for '");
+   pr_cont_kernfs_path(kn);
+   pr_cont("'.\n");
+   return ERR_PTR(-EINVAL);
+   }
+
+   /* instantiate and link root dentry */
+   dentry = d_obtain_root(inode);
+   if (!dentry) {
+   pr_debug("kernfs: could not get dentry for '");
+   pr_cont_kernfs_path(kn);
+   pr_cont("'.\n");
+   return ERR_PTR(-ENOMEM);
+   }
+
+   /* If this is a new dentry, set it up. We need kernfs_mutex because this
+* may be called by callers other than kernfs_fill_super. */
+   mutex_lock(_mutex);
+   if (!dentry->d_fsdata) {
+   kernfs_get(kn);
+   dentry->d_fsdata = kn;
+   } else {
+   WARN_ON(dentry->d_fsdata != kn);
+   }
+   mutex_unlock(_mutex);
+
+   return dentry;
+}
+
 static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
 {
struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3c2be75..b9538e0 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
 struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
 struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
 
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+ struct kernfs_node *kn);
 struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
   unsigned int flags, void *priv);
 void kernfs_destroy_root(struct kernfs_root *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7e5d597..250aaec 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct 
cgroup_sb_opts *opts)
 
memset(opts, 0, sizeof(*opts));
 
+   /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
+* namespace.
+*/
+   if (current->nsproxy->cgroup_ns != _cgroup_ns) {
+   opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
+   }
+
while ((token = strsep(, ",")) != NULL) {
nr_opts++;
 
@@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct 
cgroup_sb_opts *opts)
 
if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
pr_warn("sane_behavior: this is still under development and its 
behaviors will change, proceed at your own risk\n");
-   if (nr_opts != 1) {
+   if (nr_opts > 1) {
pr_err("sane_behavior: no other mount options 
allowed\n");
return -EINVAL;
}
@@ -1581,6 +1588,15 @@ static void init_cgroup_root(struct cgroup_root *root,
set_bit(CGRP_CPUSET_CLONE_CHILDREN, >cgrp.flags);
 }
 
+struct dentry *cgroupns_get_root(struct super_block *sb,
+struct cgroup_namespace *ns)
+{
+   struct dentry *nsdentry;
+
+   nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
+   return nsdentry;
+}
+
 static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
 {

[PATCHv2 2/7] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace

2014-10-31 Thread Aditya Kali
CLONE_NEWCGROUP will be used to create new cgroup namespace.

Signed-off-by: Aditya Kali 
---
 include/uapi/linux/sched.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 34f9d73..2f90d00 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -21,8 +21,7 @@
 #define CLONE_DETACHED 0x0040  /* Unused, ignored */
 #define CLONE_UNTRACED 0x0080  /* set if the tracing process 
can't force CLONE_PTRACE on this clone */
 #define CLONE_CHILD_SETTID 0x0100  /* set the TID in the child */
-/* 0x0200 was previously the unused CLONE_STOPPED (Start in stopped state)
-   and is now available for re-use. */
+#define CLONE_NEWCGROUP0x0200  /* New cgroup namespace 
*/
 #define CLONE_NEWUTS   0x0400  /* New utsname group? */
 #define CLONE_NEWIPC   0x0800  /* New ipcs */
 #define CLONE_NEWUSER  0x1000  /* New user namespace */
-- 
2.1.0.rc2.206.gedb03e5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv2 2/7] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace

2014-10-31 Thread Aditya Kali
CLONE_NEWCGROUP will be used to create new cgroup namespace.

Signed-off-by: Aditya Kali adityak...@google.com
---
 include/uapi/linux/sched.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 34f9d73..2f90d00 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -21,8 +21,7 @@
 #define CLONE_DETACHED 0x0040  /* Unused, ignored */
 #define CLONE_UNTRACED 0x0080  /* set if the tracing process 
can't force CLONE_PTRACE on this clone */
 #define CLONE_CHILD_SETTID 0x0100  /* set the TID in the child */
-/* 0x0200 was previously the unused CLONE_STOPPED (Start in stopped state)
-   and is now available for re-use. */
+#define CLONE_NEWCGROUP0x0200  /* New cgroup namespace 
*/
 #define CLONE_NEWUTS   0x0400  /* New utsname group? */
 #define CLONE_NEWIPC   0x0800  /* New ipcs */
 #define CLONE_NEWUSER  0x1000  /* New user namespace */
-- 
2.1.0.rc2.206.gedb03e5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv2 6/7] cgroup: cgroup namespace setns support

2014-10-31 Thread Aditya Kali
setns on a cgroup namespace is allowed only if
task has CAP_SYS_ADMIN in its current user-namespace and
over the user-namespace associated with target cgroupns.
No implicit cgroup changes happen with attaching to another
cgroupns. It is expected that the somone moves the attaching
process under the target cgroupns-root.

Signed-off-by: Aditya Kali adityak...@google.com
---
 kernel/cgroup_namespace.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
index 7e9bda0..0803575 100644
--- a/kernel/cgroup_namespace.c
+++ b/kernel/cgroup_namespace.c
@@ -86,8 +86,22 @@ err_out:
 
 static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
 {
-   pr_info(setns not supported for cgroup namespace);
-   return -EINVAL;
+   struct cgroup_namespace *cgroup_ns = ns;
+
+   if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
+   !ns_capable(cgroup_ns-user_ns, CAP_SYS_ADMIN))
+   return -EPERM;
+
+   /* Prevent cgroup changes for this task. */
+   threadgroup_lock(current);
+
+   get_cgroup_ns(cgroup_ns);
+   put_cgroup_ns(nsproxy-cgroup_ns);
+   nsproxy-cgroup_ns = cgroup_ns;
+
+   threadgroup_unlock(current);
+
+   return 0;
 }
 
 static void *cgroupns_get(struct task_struct *task)
-- 
2.1.0.rc2.206.gedb03e5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

2014-10-31 Thread Aditya Kali
This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.
In order to support this, a new kernfs api is added to lookup the
dentry for the cgroupns-root.

Signed-off-by: Aditya Kali adityak...@google.com
---
 fs/kernfs/mount.c  | 48 
 include/linux/kernfs.h |  2 ++
 kernel/cgroup.c| 47 +--
 3 files changed, 95 insertions(+), 2 deletions(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f973ae9..e334f45 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block 
*sb)
return NULL;
 }
 
+/**
+ * kernfs_make_root - create new root dentry for the given kernfs_node.
+ * @sb: the kernfs super_block
+ * @kn: kernfs_node for which a dentry is needed
+ *
+ * This can used used by callers which want to mount only a part of the kernfs
+ * as root of the filesystem.
+ */
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+ struct kernfs_node *kn)
+{
+   struct dentry *dentry;
+   struct inode *inode;
+
+   BUG_ON(sb-s_op != kernfs_sops);
+
+   /* inode for the given kernfs_node should already exist. */
+   inode = ilookup(sb, kn-ino);
+   if (!inode) {
+   pr_debug(kernfs: could not get inode for ');
+   pr_cont_kernfs_path(kn);
+   pr_cont('.\n);
+   return ERR_PTR(-EINVAL);
+   }
+
+   /* instantiate and link root dentry */
+   dentry = d_obtain_root(inode);
+   if (!dentry) {
+   pr_debug(kernfs: could not get dentry for ');
+   pr_cont_kernfs_path(kn);
+   pr_cont('.\n);
+   return ERR_PTR(-ENOMEM);
+   }
+
+   /* If this is a new dentry, set it up. We need kernfs_mutex because this
+* may be called by callers other than kernfs_fill_super. */
+   mutex_lock(kernfs_mutex);
+   if (!dentry-d_fsdata) {
+   kernfs_get(kn);
+   dentry-d_fsdata = kn;
+   } else {
+   WARN_ON(dentry-d_fsdata != kn);
+   }
+   mutex_unlock(kernfs_mutex);
+
+   return dentry;
+}
+
 static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
 {
struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3c2be75..b9538e0 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
 struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
 struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
 
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+ struct kernfs_node *kn);
 struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
   unsigned int flags, void *priv);
 void kernfs_destroy_root(struct kernfs_root *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7e5d597..250aaec 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct 
cgroup_sb_opts *opts)
 
memset(opts, 0, sizeof(*opts));
 
+   /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
+* namespace.
+*/
+   if (current-nsproxy-cgroup_ns != init_cgroup_ns) {
+   opts-flags |= CGRP_ROOT_SANE_BEHAVIOR;
+   }
+
while ((token = strsep(o, ,)) != NULL) {
nr_opts++;
 
@@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct 
cgroup_sb_opts *opts)
 
if (opts-flags  CGRP_ROOT_SANE_BEHAVIOR) {
pr_warn(sane_behavior: this is still under development and its 
behaviors will change, proceed at your own risk\n);
-   if (nr_opts != 1) {
+   if (nr_opts  1) {
pr_err(sane_behavior: no other mount options 
allowed\n);
return -EINVAL;
}
@@ -1581,6 +1588,15 @@ static void init_cgroup_root(struct cgroup_root *root,
set_bit(CGRP_CPUSET_CLONE_CHILDREN, root-cgrp.flags);
 }
 
+struct dentry *cgroupns_get_root(struct super_block *sb,
+struct cgroup_namespace *ns)
+{
+   struct dentry *nsdentry;
+
+   nsdentry = kernfs_obtain_root(sb, ns-root_cgrp-kn);
+   return nsdentry;
+}
+
 static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
 {
LIST_HEAD(tmp_links);
@@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct

[PATCHv2 1/7] kernfs: Add API to generate relative kernfs path

2014-10-31 Thread Aditya Kali
The new function kernfs_path_from_node() generates and returns
kernfs path of a given kernfs_node relative to a given parent
kernfs_node.

Signed-off-by: Aditya Kali adityak...@google.com
---
 fs/kernfs/dir.c| 194 +++--
 include/linux/kernfs.h |   3 +
 2 files changed, 176 insertions(+), 21 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 1c77193..e49c365 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -44,28 +44,158 @@ static int kernfs_name_locked(struct kernfs_node *kn, char 
*buf, size_t buflen)
return strlcpy(buf, kn-parent ? kn-name : /, buflen);
 }
 
-static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char 
*buf,
- size_t buflen)
+/**
+ * kernfs_node_depth - compute depth of the kernfs node from root.
+ * The root node itself is considered to be at depth 0.
+ */
+static size_t kernfs_node_depth(struct kernfs_node *kn)
 {
-   char *p = buf + buflen;
+   size_t depth = 0;
+
+   BUG_ON(!kn);
+   while (kn-parent) {
+   depth++;
+   kn = kn-parent;
+   }
+   return depth;
+}
+
+/**
+ * kernfs_path_from_node_locked - find a relative path from @kn_from to @kn_to
+ * @kn_from: reference node of the path
+ * @kn_to: kernfs node to which path is needed
+ * @buf: buffer to copy the path into
+ * @buflen: size of @buf
+ *
+ * We need to handle couple of scenarios here:
+ * [1] when @kn_from is an ancestor of @kn_to at some level
+ * kn_from: /n1/n2/n3
+ * kn_to:   /n1/n2/n3/n4/n5
+ * result:  /n4/n5
+ *
+ * [2] when @kn_from is on a different hierarchy and we need to find common
+ * ancestor between @kn_from and @kn_to.
+ * kn_from: /n1/n2/n3/n4
+ * kn_to:   /n1/n2/n5
+ * result:  /../../n5
+ * OR
+ * kn_from: /n1/n2/n3/n4/n5   [depth=5]
+ * kn_to:   /n1/n2/n3 [depth=3]
+ * result:  /../..
+ */
+static char * __must_check kernfs_path_from_node_locked(
+   struct kernfs_node *kn_from,
+   struct kernfs_node *kn_to,
+   char *buf,
+   size_t buflen)
+{
+   char *p = buf;
+   struct kernfs_node *kn;
+   size_t depth_from = 0, depth_to, d;
int len;
 
-   *--p = '\0';
+   /* We atleast need 2 bytes to write /\0. */
+   BUG_ON(buflen  2);
 
-   do {
-   len = strlen(kn-name);
-   if (p - buf  len + 1) {
-   buf[0] = '\0';
-   p = NULL;
-   break;
+   if (kn_from == kn_to) {
+   *p = '/';
+   *(p + 1) = '\0';
+   return p;
+   }
+
+   /* We can find the relative path only if both the nodes belong to the
+* same kernfs root.
+*/
+   if (kn_from) {
+   BUG_ON(kernfs_root(kn_from) != kernfs_root(kn_to));
+   depth_from = kernfs_node_depth(kn_from);
+   }
+
+   depth_to = kernfs_node_depth(kn_to);
+
+   /* We compose path from left to right. So first write out all possible
+* /.. strings needed to reach from 'kn_from' to the common ancestor.
+*/
+   if (kn_from) {
+   while (depth_from  depth_to) {
+   len = strlen(/..);
+   if ((buflen - (p - buf))  len + 1) {
+   /* buffer not big enough. */
+   buf[0] = '\0';
+   return NULL;
+   }
+   memcpy(p, /.., len);
+   p += len;
+   *p = '\0';
+   --depth_from;
+   kn_from = kn_from-parent;
}
+
+   d = depth_to;
+   kn = kn_to;
+   while (depth_from  d) {
+   kn = kn-parent;
+   d--;
+   }
+
+   /* Now we have 'depth_from == depth_to' at this point. Add more
+* /..s until we reach common ancestor. In the worst case,
+* root node will be the common ancestor.
+*/
+   while (depth_from  0) {
+   /* If we reached common ancestor, stop. */
+   if (kn_from == kn)
+   break;
+   len = strlen(/..);
+   if ((buflen - (p - buf))  len + 1) {
+   /* buffer not big enough. */
+   buf[0] = '\0';
+   return NULL;
+   }
+   memcpy(p, /.., len);
+   p += len;
+   *p = '\0';
+   --depth_from;
+   kn_from = kn_from-parent;
+   kn = kn-parent;
+   }
+   }
+
+   /* Figure out how many bytes we need to write the path.
+*/
+   d = depth_to;
+   kn = kn_to

[PATCHv2 5/7] cgroup: introduce cgroup namespaces

2014-10-31 Thread Aditya Kali
Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the cgroup of the process at the point
of creation of the cgroup namespace (referred as cgroupns-root).
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root
(unless they are moved outside of their cgroupns-root, at which point
 they will see a relative path from their cgroupns-root).
For a correctly setup container this enables container-tools
(like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
containers without leaking system level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.

Signed-off-by: Aditya Kali adityak...@google.com
---
 fs/proc/namespaces.c |   1 +
 include/linux/cgroup.h   |  18 +-
 include/linux/cgroup_namespace.h |  36 +++
 include/linux/nsproxy.h  |   2 +
 include/linux/proc_ns.h  |   4 ++
 kernel/Makefile  |   2 +-
 kernel/cgroup.c  |  14 
 kernel/cgroup_namespace.c| 134 +++
 kernel/fork.c|   2 +-
 kernel/nsproxy.c |  19 +-
 10 files changed, 227 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/cgroup_namespace.h
 create mode 100644 kernel/cgroup_namespace.c

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 8902609..55bc5da 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -32,6 +32,7 @@ static const struct proc_ns_operations *ns_entries[] = {
userns_operations,
 #endif
mntns_operations,
+   cgroupns_operations,
 };
 
 static const struct file_operations ns_file_operations = {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4a0eb2d..aa86495 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -22,6 +22,8 @@
 #include linux/seq_file.h
 #include linux/kernfs.h
 #include linux/wait.h
+#include linux/nsproxy.h
+#include linux/types.h
 
 #ifdef CONFIG_CGROUPS
 
@@ -460,6 +462,13 @@ struct cftype {
 #endif
 };
 
+struct cgroup_namespace {
+   atomic_tcount;
+   unsigned intproc_inum;
+   struct user_namespace   *user_ns;
+   struct cgroup   *root_cgrp;
+};
+
 extern struct cgroup_root cgrp_dfl_root;
 extern struct css_set init_css_set;
 
@@ -584,10 +593,17 @@ static inline int cgroup_name(struct cgroup *cgrp, char 
*buf, size_t buflen)
return kernfs_name(cgrp-kn, buf, buflen);
 }
 
+static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
+struct cgroup *cgrp, char *buf,
+size_t buflen)
+{
+   return kernfs_path_from_node(ns-root_cgrp-kn, cgrp-kn, buf, buflen);
+}
+
 static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
  size_t buflen)
 {
-   return kernfs_path(cgrp-kn, buf, buflen);
+   return cgroup_path_ns(current-nsproxy-cgroup_ns, cgrp, buf, buflen);
 }
 
 static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
new file mode 100644
index 000..0b97b8d
--- /dev/null
+++ b/include/linux/cgroup_namespace.h
@@ -0,0 +1,36 @@
+#ifndef _LINUX_CGROUP_NAMESPACE_H
+#define _LINUX_CGROUP_NAMESPACE_H
+
+#include linux/nsproxy.h
+#include linux/cgroup.h
+#include linux/types.h
+#include linux/user_namespace.h
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+static inline struct cgroup *current_cgroupns_root(void)
+{
+   return current-nsproxy-cgroup_ns-root_cgrp;
+}
+
+extern void free_cgroup_ns(struct cgroup_namespace *ns);
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+   struct cgroup_namespace *ns)
+{
+   if (ns)
+   atomic_inc(ns-count);
+   return ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+   if (ns  atomic_dec_and_test(ns-count))
+   free_cgroup_ns(ns);
+}
+
+extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+  struct user_namespace *user_ns,
+  struct cgroup_namespace *old_ns);
+
+#endif  /* _LINUX_CGROUP_NAMESPACE_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 35fa08f..ac0d65b 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -8,6 +8,7 @@ struct mnt_namespace;
 struct uts_namespace;
 struct ipc_namespace;
 struct pid_namespace;
+struct cgroup_namespace;
 struct fs_struct;
 
 /*
@@ -33,6 +34,7 @@ struct nsproxy {
struct mnt_namespace *mnt_ns;
struct pid_namespace *pid_ns_for_children;
struct net   *net_ns

[PATCHv2 3/7] cgroup: add function to get task's cgroup on default hierarchy

2014-10-31 Thread Aditya Kali
get_task_cgroup() returns the (reference counted) cgroup of the
given task on the default hierarchy.

Signed-off-by: Aditya Kali adityak...@google.com
---
 include/linux/cgroup.h |  1 +
 kernel/cgroup.c| 25 +
 2 files changed, 26 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 1d51968..80ed6e0 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -579,6 +579,7 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
 }
 
 char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
+struct cgroup *get_task_cgroup(struct task_struct *task);
 
 int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
 int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 136ecea..50fa8e3 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1917,6 +1917,31 @@ char *task_cgroup_path(struct task_struct *task, char 
*buf, size_t buflen)
 }
 EXPORT_SYMBOL_GPL(task_cgroup_path);
 
+/*
+ * get_task_cgroup - returns the cgroup of the task in the default cgroup
+ * hierarchy.
+ *
+ * @task: target task
+ * This function returns the @task's cgroup on the default cgroup hierarchy. 
The
+ * returned cgroup has its reference incremented (by calling cgroup_get()). So
+ * the caller must cgroup_put() the obtained reference once it is done with it.
+ */
+struct cgroup *get_task_cgroup(struct task_struct *task)
+{
+   struct cgroup *cgrp;
+
+   mutex_lock(cgroup_mutex);
+   down_read(css_set_rwsem);
+
+   cgrp = task_cgroup_from_root(task, cgrp_dfl_root);
+   cgroup_get(cgrp);
+
+   up_read(css_set_rwsem);
+   mutex_unlock(cgroup_mutex);
+   return cgrp;
+}
+EXPORT_SYMBOL_GPL(get_task_cgroup);
+
 /* used to track tasks and other necessary states during migration */
 struct cgroup_taskset {
/* the src and dst cset list running through cset-mg_node */
-- 
2.1.0.rc2.206.gedb03e5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv2 4/7] cgroup: export cgroup_get() and cgroup_put()

2014-10-31 Thread Aditya Kali
move cgroup_get() and cgroup_put() into cgroup.h so that
they can be called from other places.

Signed-off-by: Aditya Kali adityak...@google.com
---
 include/linux/cgroup.h | 22 ++
 kernel/cgroup.c| 22 --
 2 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 80ed6e0..4a0eb2d 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -521,6 +521,28 @@ static inline bool cgroup_on_dfl(const struct cgroup *cgrp)
return cgrp-root == cgrp_dfl_root;
 }
 
+/* convenient tests for these bits */
+static inline bool cgroup_is_dead(const struct cgroup *cgrp)
+{
+   return !(cgrp-self.flags  CSS_ONLINE);
+}
+
+static inline void cgroup_get(struct cgroup *cgrp)
+{
+   WARN_ON_ONCE(cgroup_is_dead(cgrp));
+   css_get(cgrp-self);
+}
+
+static inline bool cgroup_tryget(struct cgroup *cgrp)
+{
+   return css_tryget(cgrp-self);
+}
+
+static inline void cgroup_put(struct cgroup *cgrp)
+{
+   css_put(cgrp-self);
+}
+
 /* no synchronization, the result can only be used as a hint */
 static inline bool cgroup_has_tasks(struct cgroup *cgrp)
 {
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 50fa8e3..9c622b9 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -284,12 +284,6 @@ static struct cgroup_subsys_state *cgroup_e_css(struct 
cgroup *cgrp,
return cgroup_css(cgrp, ss);
 }
 
-/* convenient tests for these bits */
-static inline bool cgroup_is_dead(const struct cgroup *cgrp)
-{
-   return !(cgrp-self.flags  CSS_ONLINE);
-}
-
 struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
 {
struct cgroup *cgrp = of-kn-parent-priv;
@@ -1002,22 +996,6 @@ static umode_t cgroup_file_mode(const struct cftype *cft)
return mode;
 }
 
-static void cgroup_get(struct cgroup *cgrp)
-{
-   WARN_ON_ONCE(cgroup_is_dead(cgrp));
-   css_get(cgrp-self);
-}
-
-static bool cgroup_tryget(struct cgroup *cgrp)
-{
-   return css_tryget(cgrp-self);
-}
-
-static void cgroup_put(struct cgroup *cgrp)
-{
-   css_put(cgrp-self);
-}
-
 /**
  * cgroup_refresh_child_subsys_mask - update child_subsys_mask
  * @cgrp: the target cgroup
-- 
2.1.0.rc2.206.gedb03e5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv2 0/7] CGroup Namespaces

2014-10-31 Thread Aditya Kali
Another attempt at Cgroup Namespace patch-set. This incorporates
suggestions on previous patch-set.

Changes from V1:
1. No pinning of processes within cgroupns. Tasks can be freely moved
   across cgroups even outside of their cgroupns-root. Usual DAC/MAC policies
   apply as before.
2. Path in /proc/pid/cgroup is now always shown and is relative to
   cgroupns-root. So path can contain '/..' strings depending on cgroupns-root
   of the reader and cgroup of pid.
3. setns() does not require the process to first move under target
   cgroupns-root.

Changes form RFC (V0):
1. setns support for cgroupns
2. 'mount -t cgroup cgroup mntpt' from inside a cgroupns now
   mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
3. writes to cgroup files outside of cgroupns-root are not allowed
4. visibility of /proc/pid/cgroup is further restricted by not showing
   anything if the pid is in a sibling cgroupns and its cgroup falls outside
   your cgroupns-root.

More details in the writeup below.

Background
  Cgroups and Namespaces are used together to create “virtual”
  containers that isolates the host environment from the processes
  running in container. But since cgroups themselves are not
  “virtualized”, the task is always able to see global cgroups view
  through cgroupfs mount and via /proc/self/cgroup file.

  $ cat /proc/self/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1

  This exposure of cgroup names to the processes running inside a
  container results in some problems:
  (1) The container names are typically host-container-management-agent
  (systemd, docker/libcontainer, etc.) data and leaking its name (or
  leaking the hierarchy) reveals too much information about the host
  system.
  (2) It makes the container migration across machines (CRIU) more
  difficult as the container names need to be unique across the
  machines in the migration domain.
  (3) It makes it difficult to run container management tools (like
  docker/libcontainer, lmctfy, etc.) within virtual containers
  without adding dependency on some state/agent present outside the
  container.

  Note that the feature proposed here is completely different than the
  “ns cgroup” feature which existed in the linux kernel until recently.
  The ns cgroup also attempted to connect cgroups and namespaces by
  creating a new cgroup every time a new namespace was created. It did
  not solve any of the above mentioned problems and was later dropped
  from the kernel. Incidentally though, it used the same config option
  name CONFIG_CGROUP_NS as used in my prototype!

Introducing CGroup Namespaces
  With unified cgroup hierarchy
  (Documentation/cgroups/unified-hierarchy.txt), the containers can now
  have a much more coherent cgroup view and its easy to associate a
  container with a single cgroup. This also allows us to virtualize the
  cgroup view for tasks inside the container.

  The new CGroup Namespace allows a process to “unshare” its cgroup
  hierarchy starting from the cgroup its currently in.
  For Ex:
  $ cat /proc/self/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
  $ ls -l /proc/self/ns/cgroup
  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup - 
cgroup:[4026531835]
  $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
  [ns]$ ls -l /proc/self/ns/cgroup
  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -
  cgroup:[4026532183]
  # From within new cgroupns, process sees that its in the root cgroup
  [ns]$ cat /proc/self/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/

  # From global cgroupns:
  $ cat /proc/pid/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1

  # Unshare cgroupns along with userns and mountns
  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
  # sets up uid/gid map and exec’s /bin/bash
  $ ~/unshare -c -u -m

  # Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
  # hierarchy.
  [ns]$ mount -t cgroup cgroup /tmp/cgroup
  [ns]$ ls -l /tmp/cgroup
  total 0
  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control

  The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
  filesystem root for the namespace specific cgroupfs mount.

  The virtualization of /proc/self/cgroup file combined with restricting
  the view of cgroup hierarchy by namespace-private cgroupfs mount
  should provide a completely isolated cgroup view inside the container.

  In its current form, the cgroup namespaces patcheset provides following
  behavior:

  (1) The “root” cgroup for a cgroup namespace is the cgroup in which
  the process calling unshare is running.
  For ex. if a 

Re: [PATCHv1 5/8] cgroup: introduce cgroup namespaces

2014-10-23 Thread Aditya Kali
I will include the suggested changes in the new patchset. Some comments inline.

On Thu, Oct 16, 2014 at 9:37 AM, Serge E. Hallyn  wrote:
> Quoting Aditya Kali (adityak...@google.com):
>> Introduce the ability to create new cgroup namespace. The newly created
>> cgroup namespace remembers the 'struct cgroup *root_cgrp' at the point
>> of creation of the cgroup namespace. The task that creates the new
>> cgroup namespace and all its future children will now be restricted only
>> to the cgroup hierarchy under this root_cgrp.
>> The main purpose of cgroup namespace is to virtualize the contents
>> of /proc/self/cgroup file. Processes inside a cgroup namespace
>> are only able to see paths relative to their namespace root.
>> This allows container-tools (like libcontainer, lxc, lmctfy, etc.)
>> to create completely virtualized containers without leaking system
>> level cgroup hierarchy to the task.
>> This patch only implements the 'unshare' part of the cgroupns.
>>
>> Signed-off-by: Aditya Kali 
>
> I'm not sure that the CONFIG_CGROUP_NS is worthwhile.  If you already
> have cgroups in the kernel this won't add much in the way of memory
> usage, right?  And I think the 'experimental' argument has long since
> been squashed.  So I'd argue for simplifying this patch by removing
> CONFIG_CGROUP_NS.
>

With no pinning involved, I think its safe to enable the feature
without needing a config option. Removed it from next version. This
feature is now implicitly available with CONFIG_CGROUPS.

> (more below)
>
>> ---
>>  fs/proc/namespaces.c |   3 +
>>  include/linux/cgroup.h   |  18 +-
>>  include/linux/cgroup_namespace.h |  62 +++
>>  include/linux/nsproxy.h  |   2 +
>>  include/linux/proc_ns.h  |   4 ++
>>  init/Kconfig |   9 +++
>>  kernel/Makefile  |   1 +
>>  kernel/cgroup.c  |  11 
>>  kernel/cgroup_namespace.c| 128 
>> +++
>>  kernel/fork.c|   2 +-
>>  kernel/nsproxy.c |  19 +-
>>  11 files changed, 255 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
>> index 8902609..e04ed4b 100644
>> --- a/fs/proc/namespaces.c
>> +++ b/fs/proc/namespaces.c
>> @@ -32,6 +32,9 @@ static const struct proc_ns_operations *ns_entries[] = {
>>   _operations,
>>  #endif
>>   _operations,
>> +#ifdef CONFIG_CGROUP_NS
>> + _operations,
>> +#endif
>>  };
>>
>>  static const struct file_operations ns_file_operations = {
>> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
>> index 4a0eb2d..aa86495 100644
>> --- a/include/linux/cgroup.h
>> +++ b/include/linux/cgroup.h
>> @@ -22,6 +22,8 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>> +#include 
>>
>>  #ifdef CONFIG_CGROUPS
>>
>> @@ -460,6 +462,13 @@ struct cftype {
>>  #endif
>>  };
>>
>> +struct cgroup_namespace {
>> + atomic_tcount;
>> + unsigned intproc_inum;
>> + struct user_namespace   *user_ns;
>> + struct cgroup   *root_cgrp;
>> +};
>> +
>>  extern struct cgroup_root cgrp_dfl_root;
>>  extern struct css_set init_css_set;
>>
>> @@ -584,10 +593,17 @@ static inline int cgroup_name(struct cgroup *cgrp, 
>> char *buf, size_t buflen)
>>   return kernfs_name(cgrp->kn, buf, buflen);
>>  }
>>
>> +static inline char * __must_check cgroup_path_ns(struct cgroup_namespace 
>> *ns,
>> +  struct cgroup *cgrp, char 
>> *buf,
>> +  size_t buflen)
>> +{
>> + return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf, buflen);
>> +}
>> +
>>  static inline char * __must_check cgroup_path(struct cgroup *cgrp, char 
>> *buf,
>> size_t buflen)
>>  {
>> - return kernfs_path(cgrp->kn, buf, buflen);
>> + return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf, buflen);
>>  }
>>
>>  static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
>> diff --git a/include/linux/cgroup_namespace.h 
>> b/include/linux/cgroup_namespace.h
>> new file mode 100644
>> index 000..9f637fe
>> --- /dev/null
>> +++ b/include/linux/cgroup_namespace.h
>> @@ -0,0 +1,62 @@
>> +#ifndef _LINUX_CGROUP_

Re: [PATCHv1 5/8] cgroup: introduce cgroup namespaces

2014-10-23 Thread Aditya Kali
I will include the suggested changes in the new patchset. Some comments inline.

On Thu, Oct 16, 2014 at 9:37 AM, Serge E. Hallyn se...@hallyn.com wrote:
 Quoting Aditya Kali (adityak...@google.com):
 Introduce the ability to create new cgroup namespace. The newly created
 cgroup namespace remembers the 'struct cgroup *root_cgrp' at the point
 of creation of the cgroup namespace. The task that creates the new
 cgroup namespace and all its future children will now be restricted only
 to the cgroup hierarchy under this root_cgrp.
 The main purpose of cgroup namespace is to virtualize the contents
 of /proc/self/cgroup file. Processes inside a cgroup namespace
 are only able to see paths relative to their namespace root.
 This allows container-tools (like libcontainer, lxc, lmctfy, etc.)
 to create completely virtualized containers without leaking system
 level cgroup hierarchy to the task.
 This patch only implements the 'unshare' part of the cgroupns.

 Signed-off-by: Aditya Kali adityak...@google.com

 I'm not sure that the CONFIG_CGROUP_NS is worthwhile.  If you already
 have cgroups in the kernel this won't add much in the way of memory
 usage, right?  And I think the 'experimental' argument has long since
 been squashed.  So I'd argue for simplifying this patch by removing
 CONFIG_CGROUP_NS.


With no pinning involved, I think its safe to enable the feature
without needing a config option. Removed it from next version. This
feature is now implicitly available with CONFIG_CGROUPS.

 (more below)

 ---
  fs/proc/namespaces.c |   3 +
  include/linux/cgroup.h   |  18 +-
  include/linux/cgroup_namespace.h |  62 +++
  include/linux/nsproxy.h  |   2 +
  include/linux/proc_ns.h  |   4 ++
  init/Kconfig |   9 +++
  kernel/Makefile  |   1 +
  kernel/cgroup.c  |  11 
  kernel/cgroup_namespace.c| 128 
 +++
  kernel/fork.c|   2 +-
  kernel/nsproxy.c |  19 +-
  11 files changed, 255 insertions(+), 4 deletions(-)

 diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
 index 8902609..e04ed4b 100644
 --- a/fs/proc/namespaces.c
 +++ b/fs/proc/namespaces.c
 @@ -32,6 +32,9 @@ static const struct proc_ns_operations *ns_entries[] = {
   userns_operations,
  #endif
   mntns_operations,
 +#ifdef CONFIG_CGROUP_NS
 + cgroupns_operations,
 +#endif
  };

  static const struct file_operations ns_file_operations = {
 diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
 index 4a0eb2d..aa86495 100644
 --- a/include/linux/cgroup.h
 +++ b/include/linux/cgroup.h
 @@ -22,6 +22,8 @@
  #include linux/seq_file.h
  #include linux/kernfs.h
  #include linux/wait.h
 +#include linux/nsproxy.h
 +#include linux/types.h

  #ifdef CONFIG_CGROUPS

 @@ -460,6 +462,13 @@ struct cftype {
  #endif
  };

 +struct cgroup_namespace {
 + atomic_tcount;
 + unsigned intproc_inum;
 + struct user_namespace   *user_ns;
 + struct cgroup   *root_cgrp;
 +};
 +
  extern struct cgroup_root cgrp_dfl_root;
  extern struct css_set init_css_set;

 @@ -584,10 +593,17 @@ static inline int cgroup_name(struct cgroup *cgrp, 
 char *buf, size_t buflen)
   return kernfs_name(cgrp-kn, buf, buflen);
  }

 +static inline char * __must_check cgroup_path_ns(struct cgroup_namespace 
 *ns,
 +  struct cgroup *cgrp, char 
 *buf,
 +  size_t buflen)
 +{
 + return kernfs_path_from_node(ns-root_cgrp-kn, cgrp-kn, buf, buflen);
 +}
 +
  static inline char * __must_check cgroup_path(struct cgroup *cgrp, char 
 *buf,
 size_t buflen)
  {
 - return kernfs_path(cgrp-kn, buf, buflen);
 + return cgroup_path_ns(current-nsproxy-cgroup_ns, cgrp, buf, buflen);
  }

  static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
 diff --git a/include/linux/cgroup_namespace.h 
 b/include/linux/cgroup_namespace.h
 new file mode 100644
 index 000..9f637fe
 --- /dev/null
 +++ b/include/linux/cgroup_namespace.h
 @@ -0,0 +1,62 @@
 +#ifndef _LINUX_CGROUP_NAMESPACE_H
 +#define _LINUX_CGROUP_NAMESPACE_H
 +
 +#include linux/nsproxy.h
 +#include linux/cgroup.h
 +#include linux/types.h
 +#include linux/user_namespace.h
 +
 +extern struct cgroup_namespace init_cgroup_ns;
 +
 +static inline struct cgroup *task_cgroupns_root(struct task_struct *tsk)
 +{
 + return tsk-nsproxy-cgroup_ns-root_cgrp;

 Per the rules in nsproxy.h, you should be taking the task_lock here.

 (If you are making assumptions about tsk then you need to state them
 here - I only looked quickly enough that you pass in 'leader')


In the new version of the patch, we call this function only for the
'current' task. As per nsproxy.h, no special precautions needed when
reading current task's nsproxy. So I just remodeled this function

Re: [PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns

2014-10-22 Thread Aditya Kali
On Fri, Oct 17, 2014 at 2:28 AM, Serge E. Hallyn  wrote:
> Quoting Aditya Kali (adityak...@google.com):
>> Restrict following operations within the calling tasks:
>> * cgroup_mkdir & cgroup_rmdir
>> * cgroup_attach_task
>> * writes to cgroup files outside of task's cgroupns-root
>>
>> Also, read of /proc//cgroup file is now restricted only
>> to tasks under same cgroupns-root. If a task tries to look
>> at cgroup of another task outside of its cgroupns-root, then
>> it won't be able to see anything for the default hierarchy.
>> This is same as if the cgroups are not mounted.
>>
>> Signed-off-by: Aditya Kali 
>
> So this is a bit different from some other namespaces - if I
> have an open fd to a file, then setns into a mntns where that
> file is not addressable, I can still use the file.
>
> I guess not allowing attach to a cgroup outside our ns is a
> good failsafe as we'll otherwise risk falling off a cliff in
> some code, but I'm not sure the cgroup_file_write/mkdir/rmdir
> restrictions are needed.  (And really I can fchdir to a
> directory not in my ns, so the cgroup-attach restriction is
> any more justified).
>

As discussed on another thread, most of the restrictions in this patch
are undesirable and will be removed in the next version. Even the
restriction in cgroup_attach_task() will change to something like:

- if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(leader)))
+ if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(current)))
return -EPERM;

i.e., we don't care the cgroup of the process being moved. We only
check if the writer has access to the dst_cgrp.

So I will just drop this patch in the next version and merge the
cgroup_attach_task() change in another patch.

> Still I'm not strictly opposed ot this, so
>
> Acked-by: Serge Hallyn 
>
> just wanted to point this out.
>
>> ---
>>  kernel/cgroup.c | 34 +-
>>  1 file changed, 33 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>> index f8099b4..2fc0dfa 100644
>> --- a/kernel/cgroup.c
>> +++ b/kernel/cgroup.c
>> @@ -2318,6 +2318,12 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
>>   struct task_struct *task;
>>   int ret;
>>
>> + /* Only allow changing cgroups accessible within task's cgroup
>> +  * namespace. i.e. 'dst_cgrp' should be a descendant of task's
>> +  * cgroupns->root_cgrp. */
>> + if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(leader)))
>> + return -EPERM;
>> +
>>   /* look up all src csets */
>>   down_read(_set_rwsem);
>>   rcu_read_lock();
>> @@ -2882,6 +2888,10 @@ static ssize_t cgroup_file_write(struct 
>> kernfs_open_file *of, char *buf,
>>   struct cgroup_subsys_state *css;
>>   int ret;
>>
>> + /* Reject writes to cgroup files outside of task's cgroupns-root. */
>> + if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
>> + return -EINVAL;
>> +
>>   if (cft->write)
>>   return cft->write(of, buf, nbytes, off);
>>
>> @@ -4560,6 +4570,13 @@ static int cgroup_mkdir(struct kernfs_node 
>> *parent_kn, const char *name,
>>   parent = cgroup_kn_lock_live(parent_kn);
>>   if (!parent)
>>   return -ENODEV;
>> +
>> + /* Allow mkdir only within process's cgroup namespace root. */
>> + if (!cgroup_is_descendant(parent, task_cgroupns_root(current))) {
>> + ret = -EPERM;
>> + goto out_unlock;
>> + }
>> +
>>   root = parent->root;
>>
>>   /* allocate the cgroup and its ID, 0 is reserved for the root */
>> @@ -4822,6 +4839,13 @@ static int cgroup_rmdir(struct kernfs_node *kn)
>>   if (!cgrp)
>>   return 0;
>>
>> + /* Allow rmdir only within process's cgroup namespace root.
>> +  * The process can't delete its own root anyways. */
>> + if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current))) {
>> + cgroup_kn_unlock(kn);
>> + return -EPERM;
>> + }
>> +
>>   ret = cgroup_destroy_locked(cgrp);
>>
>>   cgroup_kn_unlock(kn);
>> @@ -5051,6 +5075,15 @@ int proc_cgroup_show(struct seq_file *m, struct 
>> pid_namespace *ns,
>>   if (root == _dfl_root && !cgrp_dfl_root_visible)
>>   continue;
>>
>> + cgrp = task_cgroup_from_root(tsk, root);
>> +
>> + /* The cgroup path on default hierarc

Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

2014-10-22 Thread Aditya Kali
On Tue, Oct 21, 2014 at 5:58 PM, Andy Lutomirski  wrote:
> On Tue, Oct 21, 2014 at 5:46 PM, Aditya Kali  wrote:
>> On Tue, Oct 21, 2014 at 3:42 PM, Andy Lutomirski  wrote:
>>> On Tue, Oct 21, 2014 at 3:33 PM, Aditya Kali  wrote:
>>>> On Tue, Oct 21, 2014 at 12:02 PM, Andy Lutomirski  
>>>> wrote:
>>>>> On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali  
>>>>> wrote:
>>>>>> On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski  
>>>>>> wrote:
>>>>>>> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
>>>>>>>  wrote:
>>>>>>>>
>>>>>>>> I do wonder if we think of this as chcgrouproot if there is a simpler
>>>>>>>> implementation.
>>>>>>>
>>>>>>> Could be.  I'll defer to Aditya for that one.
>>>>>>>
>>>>>>
>>>>>> More than chcgrouproot, its probably closer to pivot_cgroup_root. In
>>>>>> addition to restricting the process to a cgroup-root, new processes
>>>>>> entering the container should also be implicitly contained within the
>>>>>> cgroup-root of that container.
>>>>>
>>>>> Why?  Concretely, why should this be in the kernel namespace code
>>>>> instead of in userspace?
>>>>>
>>>>
>>>> Userspace can do it too. Though then there will be possibility of
>>>> having processes in the same mount namespace with different
>>>> cgroup-roots. Deriving contents of /proc//cgroup becomes even
>>>> more complex. Thats another reason why it might not be good idea to
>>>> tie cgroups with mount namespace.
>>>>
>>>>>> Implementing pivot_cgroup_root would
>>>>>> probably involve overloading mount-namespace to now understand cgroup
>>>>>> filesystem too. I did attempt combining cgroupns-root with mntns
>>>>>> earlier (not via a new syscall though), but came to the conclusion
>>>>>> that its just simpler to have a separate cgroup namespace and get
>>>>>> clear semantics. One of the issues was that implicitly changing cgroup
>>>>>> on setns to mntns seemed like a huge undesirable side-effect.
>>>>>>
>>>>>> About pinning: I really feel that it should be OK to pin processes
>>>>>> within cgroupns-root. I think thats one of the most important feature
>>>>>> of cgroup-namespace since its most common usecase is to containerize
>>>>>> un-trusted processes - processes that, for their entire lifetime, need
>>>>>> to remain inside their container.
>>>>>
>>>>> So don't let them out.  None of the other namespaces have this kind of
>>>>> constraint:
>>>>>
>>>>>  - If you're in a mntns, you can still use fds from outside.
>>>>>  - If you're in a netns, you can still use sockets from outside the 
>>>>> namespace.
>>>>>  - If you're in an ipcns, you can still use ipc handles from outside.
>>>>
>>>> But none of the namespaces allow you to allocate new fds/sockets/ipc
>>>> handles in the outside namespace. I think moving a process outside of
>>>> cgroupns-root is like allocating a resource outside of your namespace.
>>>
>>> In a pidns, you can see outside tasks if you have an outside procfs
>>> mounted, but, if you don't, then you can't.  Wouldn't cgroupns be just
>>> like that?  You wouldn't be able to escape your cgroup as long as you
>>> don't have an inappropriate cgroupfs mounted.
>>>
>>
>> I am not if we should only depend on restricted visibility for this
>> though. More details below.
>>
>>>
>>>>>
>>>>>> And with explicit permission from
>>>>>> cgroup subsystem (something like cgroup.may_unshare as you had
>>>>>> suggested previously), we can make sure that unprivileged processes
>>>>>> cannot pin themselves. Also, maintaining this invariant (your current
>>>>>> cgroup is always under your cgroupns-root) keeps the code and the
>>>>>> semantics simple.
>>>>>
>>>>> I actually think it makes the semantics more complex.  The less policy
>>>>> you stick in the kernel, the easier it is to understand the impact of
>>>>> that policy.
>>>>>
>>>>
>>>

Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

2014-10-22 Thread Aditya Kali
On Tue, Oct 21, 2014 at 5:58 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Tue, Oct 21, 2014 at 5:46 PM, Aditya Kali adityak...@google.com wrote:
 On Tue, Oct 21, 2014 at 3:42 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Tue, Oct 21, 2014 at 3:33 PM, Aditya Kali adityak...@google.com wrote:
 On Tue, Oct 21, 2014 at 12:02 PM, Andy Lutomirski l...@amacapital.net 
 wrote:
 On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali adityak...@google.com 
 wrote:
 On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski l...@amacapital.net 
 wrote:
 On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
 ebied...@xmission.com wrote:

 I do wonder if we think of this as chcgrouproot if there is a simpler
 implementation.

 Could be.  I'll defer to Aditya for that one.


 More than chcgrouproot, its probably closer to pivot_cgroup_root. In
 addition to restricting the process to a cgroup-root, new processes
 entering the container should also be implicitly contained within the
 cgroup-root of that container.

 Why?  Concretely, why should this be in the kernel namespace code
 instead of in userspace?


 Userspace can do it too. Though then there will be possibility of
 having processes in the same mount namespace with different
 cgroup-roots. Deriving contents of /proc/pid/cgroup becomes even
 more complex. Thats another reason why it might not be good idea to
 tie cgroups with mount namespace.

 Implementing pivot_cgroup_root would
 probably involve overloading mount-namespace to now understand cgroup
 filesystem too. I did attempt combining cgroupns-root with mntns
 earlier (not via a new syscall though), but came to the conclusion
 that its just simpler to have a separate cgroup namespace and get
 clear semantics. One of the issues was that implicitly changing cgroup
 on setns to mntns seemed like a huge undesirable side-effect.

 About pinning: I really feel that it should be OK to pin processes
 within cgroupns-root. I think thats one of the most important feature
 of cgroup-namespace since its most common usecase is to containerize
 un-trusted processes - processes that, for their entire lifetime, need
 to remain inside their container.

 So don't let them out.  None of the other namespaces have this kind of
 constraint:

  - If you're in a mntns, you can still use fds from outside.
  - If you're in a netns, you can still use sockets from outside the 
 namespace.
  - If you're in an ipcns, you can still use ipc handles from outside.

 But none of the namespaces allow you to allocate new fds/sockets/ipc
 handles in the outside namespace. I think moving a process outside of
 cgroupns-root is like allocating a resource outside of your namespace.

 In a pidns, you can see outside tasks if you have an outside procfs
 mounted, but, if you don't, then you can't.  Wouldn't cgroupns be just
 like that?  You wouldn't be able to escape your cgroup as long as you
 don't have an inappropriate cgroupfs mounted.


 I am not if we should only depend on restricted visibility for this
 though. More details below.



 And with explicit permission from
 cgroup subsystem (something like cgroup.may_unshare as you had
 suggested previously), we can make sure that unprivileged processes
 cannot pin themselves. Also, maintaining this invariant (your current
 cgroup is always under your cgroupns-root) keeps the code and the
 semantics simple.

 I actually think it makes the semantics more complex.  The less policy
 you stick in the kernel, the easier it is to understand the impact of
 that policy.


 My inclination is towards keeping things simpler - both in code as
 well as in configuration. I agree that cgroupns might seem
 less-flexible, but in its current form, it encourages consistent
 container configuration. If you have a process that needs to move
 around between cgroups belonging to different containers, then that
 process should probably not be inside any container's cgroup
 namespace. Allowing that will just make the cgroup namespace
 pretty-much meaningless.

 The problem with pinning is that preventing it causes problems
 (specifically, either something potentially complex and incompatible
 needs to be added or unprivileged processes will be able to pin
 themselves).

 Unless I'm missing something, a normal cgroupns user doesn't actually
 need kernel pinning support to effectively constrain its members'
 cgroups.


 So there are 2 scenarios to consider:

 We have 2 containers with cgroups: /container1 and /container2
 Assume process P is running under cgroupns-root '/container1'

 (1) process P wants to 'write' to cgroup.procs outside its
 cgroupns-root (say to /container2/cgroup.procs)

 This, at least, doesn't have the problem with unprivileged processes
 pinning themselves.

 (2) An admin process running in init_cgroup_ns (or any parent cgroupns
 with cgroupns-root above /container1) wants to write pid of process P
 to /container2/cgroup.procs (which lies outside of P's cgroupns-root)

 For (1), I think its ok to reject

Re: [PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns

2014-10-22 Thread Aditya Kali
On Fri, Oct 17, 2014 at 2:28 AM, Serge E. Hallyn se...@hallyn.com wrote:
 Quoting Aditya Kali (adityak...@google.com):
 Restrict following operations within the calling tasks:
 * cgroup_mkdir  cgroup_rmdir
 * cgroup_attach_task
 * writes to cgroup files outside of task's cgroupns-root

 Also, read of /proc/pid/cgroup file is now restricted only
 to tasks under same cgroupns-root. If a task tries to look
 at cgroup of another task outside of its cgroupns-root, then
 it won't be able to see anything for the default hierarchy.
 This is same as if the cgroups are not mounted.

 Signed-off-by: Aditya Kali adityak...@google.com

 So this is a bit different from some other namespaces - if I
 have an open fd to a file, then setns into a mntns where that
 file is not addressable, I can still use the file.

 I guess not allowing attach to a cgroup outside our ns is a
 good failsafe as we'll otherwise risk falling off a cliff in
 some code, but I'm not sure the cgroup_file_write/mkdir/rmdir
 restrictions are needed.  (And really I can fchdir to a
 directory not in my ns, so the cgroup-attach restriction is
 any more justified).


As discussed on another thread, most of the restrictions in this patch
are undesirable and will be removed in the next version. Even the
restriction in cgroup_attach_task() will change to something like:

- if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(leader)))
+ if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(current)))
return -EPERM;

i.e., we don't care the cgroup of the process being moved. We only
check if the writer has access to the dst_cgrp.

So I will just drop this patch in the next version and merge the
cgroup_attach_task() change in another patch.

 Still I'm not strictly opposed ot this, so

 Acked-by: Serge Hallyn serge.hal...@canonical.com

 just wanted to point this out.

 ---
  kernel/cgroup.c | 34 +-
  1 file changed, 33 insertions(+), 1 deletion(-)

 diff --git a/kernel/cgroup.c b/kernel/cgroup.c
 index f8099b4..2fc0dfa 100644
 --- a/kernel/cgroup.c
 +++ b/kernel/cgroup.c
 @@ -2318,6 +2318,12 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
   struct task_struct *task;
   int ret;

 + /* Only allow changing cgroups accessible within task's cgroup
 +  * namespace. i.e. 'dst_cgrp' should be a descendant of task's
 +  * cgroupns-root_cgrp. */
 + if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(leader)))
 + return -EPERM;
 +
   /* look up all src csets */
   down_read(css_set_rwsem);
   rcu_read_lock();
 @@ -2882,6 +2888,10 @@ static ssize_t cgroup_file_write(struct 
 kernfs_open_file *of, char *buf,
   struct cgroup_subsys_state *css;
   int ret;

 + /* Reject writes to cgroup files outside of task's cgroupns-root. */
 + if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
 + return -EINVAL;
 +
   if (cft-write)
   return cft-write(of, buf, nbytes, off);

 @@ -4560,6 +4570,13 @@ static int cgroup_mkdir(struct kernfs_node 
 *parent_kn, const char *name,
   parent = cgroup_kn_lock_live(parent_kn);
   if (!parent)
   return -ENODEV;
 +
 + /* Allow mkdir only within process's cgroup namespace root. */
 + if (!cgroup_is_descendant(parent, task_cgroupns_root(current))) {
 + ret = -EPERM;
 + goto out_unlock;
 + }
 +
   root = parent-root;

   /* allocate the cgroup and its ID, 0 is reserved for the root */
 @@ -4822,6 +4839,13 @@ static int cgroup_rmdir(struct kernfs_node *kn)
   if (!cgrp)
   return 0;

 + /* Allow rmdir only within process's cgroup namespace root.
 +  * The process can't delete its own root anyways. */
 + if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current))) {
 + cgroup_kn_unlock(kn);
 + return -EPERM;
 + }
 +
   ret = cgroup_destroy_locked(cgrp);

   cgroup_kn_unlock(kn);
 @@ -5051,6 +5075,15 @@ int proc_cgroup_show(struct seq_file *m, struct 
 pid_namespace *ns,
   if (root == cgrp_dfl_root  !cgrp_dfl_root_visible)
   continue;

 + cgrp = task_cgroup_from_root(tsk, root);
 +
 + /* The cgroup path on default hierarchy is shown only if it
 +  * falls under current task's cgroupns-root.
 +  */
 + if (root == cgrp_dfl_root 
 + !cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
 + continue;
 +
   seq_printf(m, %d:, root-hierarchy_id);
   for_each_subsys(ss, ssid)
   if (root-subsys_mask  (1  ssid))
 @@ -5059,7 +5092,6 @@ int proc_cgroup_show(struct seq_file *m, struct 
 pid_namespace *ns,
   seq_printf(m, %sname=%s, count ? , : ,
  root-name);
   seq_putc(m, ':');
 - cgrp = task_cgroup_from_root(tsk

Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

2014-10-21 Thread Aditya Kali
On Tue, Oct 21, 2014 at 3:42 PM, Andy Lutomirski  wrote:
> On Tue, Oct 21, 2014 at 3:33 PM, Aditya Kali  wrote:
>> On Tue, Oct 21, 2014 at 12:02 PM, Andy Lutomirski  
>> wrote:
>>> On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali  wrote:
>>>> On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski  
>>>> wrote:
>>>>> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
>>>>>  wrote:
>>>>>>
>>>>>> I do wonder if we think of this as chcgrouproot if there is a simpler
>>>>>> implementation.
>>>>>
>>>>> Could be.  I'll defer to Aditya for that one.
>>>>>
>>>>
>>>> More than chcgrouproot, its probably closer to pivot_cgroup_root. In
>>>> addition to restricting the process to a cgroup-root, new processes
>>>> entering the container should also be implicitly contained within the
>>>> cgroup-root of that container.
>>>
>>> Why?  Concretely, why should this be in the kernel namespace code
>>> instead of in userspace?
>>>
>>
>> Userspace can do it too. Though then there will be possibility of
>> having processes in the same mount namespace with different
>> cgroup-roots. Deriving contents of /proc//cgroup becomes even
>> more complex. Thats another reason why it might not be good idea to
>> tie cgroups with mount namespace.
>>
>>>> Implementing pivot_cgroup_root would
>>>> probably involve overloading mount-namespace to now understand cgroup
>>>> filesystem too. I did attempt combining cgroupns-root with mntns
>>>> earlier (not via a new syscall though), but came to the conclusion
>>>> that its just simpler to have a separate cgroup namespace and get
>>>> clear semantics. One of the issues was that implicitly changing cgroup
>>>> on setns to mntns seemed like a huge undesirable side-effect.
>>>>
>>>> About pinning: I really feel that it should be OK to pin processes
>>>> within cgroupns-root. I think thats one of the most important feature
>>>> of cgroup-namespace since its most common usecase is to containerize
>>>> un-trusted processes - processes that, for their entire lifetime, need
>>>> to remain inside their container.
>>>
>>> So don't let them out.  None of the other namespaces have this kind of
>>> constraint:
>>>
>>>  - If you're in a mntns, you can still use fds from outside.
>>>  - If you're in a netns, you can still use sockets from outside the 
>>> namespace.
>>>  - If you're in an ipcns, you can still use ipc handles from outside.
>>
>> But none of the namespaces allow you to allocate new fds/sockets/ipc
>> handles in the outside namespace. I think moving a process outside of
>> cgroupns-root is like allocating a resource outside of your namespace.
>
> In a pidns, you can see outside tasks if you have an outside procfs
> mounted, but, if you don't, then you can't.  Wouldn't cgroupns be just
> like that?  You wouldn't be able to escape your cgroup as long as you
> don't have an inappropriate cgroupfs mounted.
>

I am not if we should only depend on restricted visibility for this
though. More details below.

>
>>>
>>>> And with explicit permission from
>>>> cgroup subsystem (something like cgroup.may_unshare as you had
>>>> suggested previously), we can make sure that unprivileged processes
>>>> cannot pin themselves. Also, maintaining this invariant (your current
>>>> cgroup is always under your cgroupns-root) keeps the code and the
>>>> semantics simple.
>>>
>>> I actually think it makes the semantics more complex.  The less policy
>>> you stick in the kernel, the easier it is to understand the impact of
>>> that policy.
>>>
>>
>> My inclination is towards keeping things simpler - both in code as
>> well as in configuration. I agree that cgroupns might seem
>> "less-flexible", but in its current form, it encourages consistent
>> container configuration. If you have a process that needs to move
>> around between cgroups belonging to different containers, then that
>> process should probably not be inside any container's cgroup
>> namespace. Allowing that will just make the cgroup namespace
>> pretty-much meaningless.
>
> The problem with pinning is that preventing it causes problems
> (specifically, either something potentially complex and incompatible
> needs to be added or unprivileged processes will be able to pin
> themselves

Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

2014-10-21 Thread Aditya Kali
On Tue, Oct 21, 2014 at 12:02 PM, Andy Lutomirski  wrote:
> On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali  wrote:
>> On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski  
>> wrote:
>>> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
>>>  wrote:
>>>>
>>>> I do wonder if we think of this as chcgrouproot if there is a simpler
>>>> implementation.
>>>
>>> Could be.  I'll defer to Aditya for that one.
>>>
>>
>> More than chcgrouproot, its probably closer to pivot_cgroup_root. In
>> addition to restricting the process to a cgroup-root, new processes
>> entering the container should also be implicitly contained within the
>> cgroup-root of that container.
>
> Why?  Concretely, why should this be in the kernel namespace code
> instead of in userspace?
>

Userspace can do it too. Though then there will be possibility of
having processes in the same mount namespace with different
cgroup-roots. Deriving contents of /proc//cgroup becomes even
more complex. Thats another reason why it might not be good idea to
tie cgroups with mount namespace.

>> Implementing pivot_cgroup_root would
>> probably involve overloading mount-namespace to now understand cgroup
>> filesystem too. I did attempt combining cgroupns-root with mntns
>> earlier (not via a new syscall though), but came to the conclusion
>> that its just simpler to have a separate cgroup namespace and get
>> clear semantics. One of the issues was that implicitly changing cgroup
>> on setns to mntns seemed like a huge undesirable side-effect.
>>
>> About pinning: I really feel that it should be OK to pin processes
>> within cgroupns-root. I think thats one of the most important feature
>> of cgroup-namespace since its most common usecase is to containerize
>> un-trusted processes - processes that, for their entire lifetime, need
>> to remain inside their container.
>
> So don't let them out.  None of the other namespaces have this kind of
> constraint:
>
>  - If you're in a mntns, you can still use fds from outside.
>  - If you're in a netns, you can still use sockets from outside the namespace.
>  - If you're in an ipcns, you can still use ipc handles from outside.

But none of the namespaces allow you to allocate new fds/sockets/ipc
handles in the outside namespace. I think moving a process outside of
cgroupns-root is like allocating a resource outside of your namespace.

>
> etc.

>
>> And with explicit permission from
>> cgroup subsystem (something like cgroup.may_unshare as you had
>> suggested previously), we can make sure that unprivileged processes
>> cannot pin themselves. Also, maintaining this invariant (your current
>> cgroup is always under your cgroupns-root) keeps the code and the
>> semantics simple.
>
> I actually think it makes the semantics more complex.  The less policy
> you stick in the kernel, the easier it is to understand the impact of
> that policy.
>

My inclination is towards keeping things simpler - both in code as
well as in configuration. I agree that cgroupns might seem
"less-flexible", but in its current form, it encourages consistent
container configuration. If you have a process that needs to move
around between cgroups belonging to different containers, then that
process should probably not be inside any container's cgroup
namespace. Allowing that will just make the cgroup namespace
pretty-much meaningless.

>>
>> If we ditch the pinning requirement and allow the containarized
>> process to move outside of its cgroupns-root, we will have to address
>> atleast the following:
>> * what does its /proc/self/cgroup  (and /proc//cgroup in general)
>> look like? We might need to just not show anything in
>> /proc//cgroup in such case (for default hierarchy).
>
> The process should see the cgroup path relative to its cgroup ns.
> Whether this requires a new /proc mount or happens automatically is an
> open question.  (I *hate* procfs for reasons like this.)
>
>> * how should future setns() and unshare() by such process behave?
>
> Open question.
>
>> * 'mount -t cgroup cgroup ' by such a process will yield unexpected 
>> result
>
> You could disallow that and instead require 'mount -t cgroup -o
> cgrouproot=. cgroup mnt' where '.' will be resolved at mount time
> relative to the caller's cgroupns.
>
>> * container will not remain migratable
>
> Why not?
>

Well, the processes running outside of cgroupns root will be exposed
to information outside of the container (i.e., its /proc/self/cgroup
will show paths involving other containers and potentially system
level information). So unless you even restore them, it wil

Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

2014-10-21 Thread Aditya Kali
On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski  wrote:
> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
>  wrote:
>> Andy Lutomirski  writes:
>>
>>> On Mon, Oct 20, 2014 at 9:49 PM, Eric W. Biederman
>>>  wrote:
 Andy Lutomirski  writes:
> Possible solution:
>
> Ditch the pinning.  That is, if you're outside a cgroupns (or you have
> a non-ns-confined cgroupfs mounted), then you can move a task in a
> cgroupns outside of its root cgroup.  If you do this, then the task
> thinks its cgroup is something like "../foo" or "../../foo".

 Of the possible solutions that seems attractive to me, simply because
 we sometimes want to allow clever things to occur.

 Does anyone know of a reason (beyond pretty printing) why we need
 cgroupns to restrict the subset of cgroups processes can be in?

 I would expect permissions on the cgroup directories themselves, and
 limited visiblilty would be (in general) to achieve the desired
 visiblity.
>>>
>>> This makes the security impact of cgroupns very easy to understand,
>>> right?  Because there really won't be any -- cgroupns only affects
>>> reads from /proc and what cgroupfs shows, but it doesn't change any
>>> actual cgroups, nor does it affect any cgroup *changes*.
>>
>> It seems like what we have described is chcgrouproot aka chroot for
>> cgroups.  At which point I think there are potentially similar security
>> issues as for chroot.  Can we confuse a setuid root process if we make
>> it's cgroup names look different.
>>
>> Of course the confusing root concern is handled by the usual namespace
>> security checks that are already present.
>
> I think that the chroot issues are mostly in two categories: setuid
> confusion (not an issue here as you described) and chroot escapes.
> cgroupns escapes aren't a big deal, I think -- admins should deny the
> confined task the right to write to cgroupfs outside its hierarchy, by
> setting cgroupfs permissions appropriately and/or avoiding mounting
> cgroupfs outside the hierarchy.
>
>>
>> I do wonder if we think of this as chcgrouproot if there is a simpler
>> implementation.
>
> Could be.  I'll defer to Aditya for that one.
>

More than chcgrouproot, its probably closer to pivot_cgroup_root. In
addition to restricting the process to a cgroup-root, new processes
entering the container should also be implicitly contained within the
cgroup-root of that container. Implementing pivot_cgroup_root would
probably involve overloading mount-namespace to now understand cgroup
filesystem too. I did attempt combining cgroupns-root with mntns
earlier (not via a new syscall though), but came to the conclusion
that its just simpler to have a separate cgroup namespace and get
clear semantics. One of the issues was that implicitly changing cgroup
on setns to mntns seemed like a huge undesirable side-effect.

About pinning: I really feel that it should be OK to pin processes
within cgroupns-root. I think thats one of the most important feature
of cgroup-namespace since its most common usecase is to containerize
un-trusted processes - processes that, for their entire lifetime, need
to remain inside their container. And with explicit permission from
cgroup subsystem (something like cgroup.may_unshare as you had
suggested previously), we can make sure that unprivileged processes
cannot pin themselves. Also, maintaining this invariant (your current
cgroup is always under your cgroupns-root) keeps the code and the
semantics simple.

If we ditch the pinning requirement and allow the containarized
process to move outside of its cgroupns-root, we will have to address
atleast the following:
* what does its /proc/self/cgroup  (and /proc//cgroup in general)
look like? We might need to just not show anything in
/proc//cgroup in such case (for default hierarchy).
* how should future setns() and unshare() by such process behave?
* 'mount -t cgroup cgroup ' by such a process will yield unexpected result
* container will not remain migratable
* added code complexity to handle above scenarios

I understand that having process pinned to a cgroup hierarchy might
seem inconvenient. But even today (without cgroup namespaces), moving
a task from one cgroup to another can fail for reasons outside of
control of the task attempting the move (even if its privileged). So
the userspace should already handle this scenario. I feel its not
worth to add complexity in the kernel for this.

>>
> While we're at it, consider making setns for a cgroupns *not* change
> the caller's cgroup.  Is there any reason it really needs to?

 setns doesn't but nsenter is going to need to change the cgroup
 if the pinning requirement is kept.  nsenenter is going to want to
 change the cgroup if the pinning requirement is dropped.

>>>
>>> It seems easy enough for nsenter to change the cgroup all by itself.
>>
>> Again.  I don't think anyone has suggested or implemented anything
>> different.

Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

2014-10-21 Thread Aditya Kali
On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
 ebied...@xmission.com wrote:
 Andy Lutomirski l...@amacapital.net writes:

 On Mon, Oct 20, 2014 at 9:49 PM, Eric W. Biederman
 ebied...@xmission.com wrote:
 Andy Lutomirski l...@amacapital.net writes:
 Possible solution:

 Ditch the pinning.  That is, if you're outside a cgroupns (or you have
 a non-ns-confined cgroupfs mounted), then you can move a task in a
 cgroupns outside of its root cgroup.  If you do this, then the task
 thinks its cgroup is something like ../foo or ../../foo.

 Of the possible solutions that seems attractive to me, simply because
 we sometimes want to allow clever things to occur.

 Does anyone know of a reason (beyond pretty printing) why we need
 cgroupns to restrict the subset of cgroups processes can be in?

 I would expect permissions on the cgroup directories themselves, and
 limited visiblilty would be (in general) to achieve the desired
 visiblity.

 This makes the security impact of cgroupns very easy to understand,
 right?  Because there really won't be any -- cgroupns only affects
 reads from /proc and what cgroupfs shows, but it doesn't change any
 actual cgroups, nor does it affect any cgroup *changes*.

 It seems like what we have described is chcgrouproot aka chroot for
 cgroups.  At which point I think there are potentially similar security
 issues as for chroot.  Can we confuse a setuid root process if we make
 it's cgroup names look different.

 Of course the confusing root concern is handled by the usual namespace
 security checks that are already present.

 I think that the chroot issues are mostly in two categories: setuid
 confusion (not an issue here as you described) and chroot escapes.
 cgroupns escapes aren't a big deal, I think -- admins should deny the
 confined task the right to write to cgroupfs outside its hierarchy, by
 setting cgroupfs permissions appropriately and/or avoiding mounting
 cgroupfs outside the hierarchy.


 I do wonder if we think of this as chcgrouproot if there is a simpler
 implementation.

 Could be.  I'll defer to Aditya for that one.


More than chcgrouproot, its probably closer to pivot_cgroup_root. In
addition to restricting the process to a cgroup-root, new processes
entering the container should also be implicitly contained within the
cgroup-root of that container. Implementing pivot_cgroup_root would
probably involve overloading mount-namespace to now understand cgroup
filesystem too. I did attempt combining cgroupns-root with mntns
earlier (not via a new syscall though), but came to the conclusion
that its just simpler to have a separate cgroup namespace and get
clear semantics. One of the issues was that implicitly changing cgroup
on setns to mntns seemed like a huge undesirable side-effect.

About pinning: I really feel that it should be OK to pin processes
within cgroupns-root. I think thats one of the most important feature
of cgroup-namespace since its most common usecase is to containerize
un-trusted processes - processes that, for their entire lifetime, need
to remain inside their container. And with explicit permission from
cgroup subsystem (something like cgroup.may_unshare as you had
suggested previously), we can make sure that unprivileged processes
cannot pin themselves. Also, maintaining this invariant (your current
cgroup is always under your cgroupns-root) keeps the code and the
semantics simple.

If we ditch the pinning requirement and allow the containarized
process to move outside of its cgroupns-root, we will have to address
atleast the following:
* what does its /proc/self/cgroup  (and /proc/pid/cgroup in general)
look like? We might need to just not show anything in
/proc/pid/cgroup in such case (for default hierarchy).
* how should future setns() and unshare() by such process behave?
* 'mount -t cgroup cgroup mnt' by such a process will yield unexpected result
* container will not remain migratable
* added code complexity to handle above scenarios

I understand that having process pinned to a cgroup hierarchy might
seem inconvenient. But even today (without cgroup namespaces), moving
a task from one cgroup to another can fail for reasons outside of
control of the task attempting the move (even if its privileged). So
the userspace should already handle this scenario. I feel its not
worth to add complexity in the kernel for this.


 While we're at it, consider making setns for a cgroupns *not* change
 the caller's cgroup.  Is there any reason it really needs to?

 setns doesn't but nsenter is going to need to change the cgroup
 if the pinning requirement is kept.  nsenenter is going to want to
 change the cgroup if the pinning requirement is dropped.


 It seems easy enough for nsenter to change the cgroup all by itself.

 Again.  I don't think anyone has suggested or implemented anything
 different.

 The current patchset seems to punt on this decision by just failing
 the 

Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

2014-10-21 Thread Aditya Kali
On Tue, Oct 21, 2014 at 12:02 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali adityak...@google.com wrote:
 On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski l...@amacapital.net 
 wrote:
 On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
 ebied...@xmission.com wrote:

 I do wonder if we think of this as chcgrouproot if there is a simpler
 implementation.

 Could be.  I'll defer to Aditya for that one.


 More than chcgrouproot, its probably closer to pivot_cgroup_root. In
 addition to restricting the process to a cgroup-root, new processes
 entering the container should also be implicitly contained within the
 cgroup-root of that container.

 Why?  Concretely, why should this be in the kernel namespace code
 instead of in userspace?


Userspace can do it too. Though then there will be possibility of
having processes in the same mount namespace with different
cgroup-roots. Deriving contents of /proc/pid/cgroup becomes even
more complex. Thats another reason why it might not be good idea to
tie cgroups with mount namespace.

 Implementing pivot_cgroup_root would
 probably involve overloading mount-namespace to now understand cgroup
 filesystem too. I did attempt combining cgroupns-root with mntns
 earlier (not via a new syscall though), but came to the conclusion
 that its just simpler to have a separate cgroup namespace and get
 clear semantics. One of the issues was that implicitly changing cgroup
 on setns to mntns seemed like a huge undesirable side-effect.

 About pinning: I really feel that it should be OK to pin processes
 within cgroupns-root. I think thats one of the most important feature
 of cgroup-namespace since its most common usecase is to containerize
 un-trusted processes - processes that, for their entire lifetime, need
 to remain inside their container.

 So don't let them out.  None of the other namespaces have this kind of
 constraint:

  - If you're in a mntns, you can still use fds from outside.
  - If you're in a netns, you can still use sockets from outside the namespace.
  - If you're in an ipcns, you can still use ipc handles from outside.

But none of the namespaces allow you to allocate new fds/sockets/ipc
handles in the outside namespace. I think moving a process outside of
cgroupns-root is like allocating a resource outside of your namespace.


 etc.


 And with explicit permission from
 cgroup subsystem (something like cgroup.may_unshare as you had
 suggested previously), we can make sure that unprivileged processes
 cannot pin themselves. Also, maintaining this invariant (your current
 cgroup is always under your cgroupns-root) keeps the code and the
 semantics simple.

 I actually think it makes the semantics more complex.  The less policy
 you stick in the kernel, the easier it is to understand the impact of
 that policy.


My inclination is towards keeping things simpler - both in code as
well as in configuration. I agree that cgroupns might seem
less-flexible, but in its current form, it encourages consistent
container configuration. If you have a process that needs to move
around between cgroups belonging to different containers, then that
process should probably not be inside any container's cgroup
namespace. Allowing that will just make the cgroup namespace
pretty-much meaningless.


 If we ditch the pinning requirement and allow the containarized
 process to move outside of its cgroupns-root, we will have to address
 atleast the following:
 * what does its /proc/self/cgroup  (and /proc/pid/cgroup in general)
 look like? We might need to just not show anything in
 /proc/pid/cgroup in such case (for default hierarchy).

 The process should see the cgroup path relative to its cgroup ns.
 Whether this requires a new /proc mount or happens automatically is an
 open question.  (I *hate* procfs for reasons like this.)

 * how should future setns() and unshare() by such process behave?

 Open question.

 * 'mount -t cgroup cgroup mnt' by such a process will yield unexpected 
 result

 You could disallow that and instead require 'mount -t cgroup -o
 cgrouproot=. cgroup mnt' where '.' will be resolved at mount time
 relative to the caller's cgroupns.

 * container will not remain migratable

 Why not?


Well, the processes running outside of cgroupns root will be exposed
to information outside of the container (i.e., its /proc/self/cgroup
will show paths involving other containers and potentially system
level information). So unless you even restore them, it will be
difficult to restore these processes. The whole point of virtualizing
the /proc/self/cgroup view was so that the processes don't see outside
cgroups.

 * added code complexity to handle above scenarios

 I understand that having process pinned to a cgroup hierarchy might
 seem inconvenient. But even today (without cgroup namespaces), moving
 a task from one cgroup to another can fail for reasons outside of
 control of the task attempting the move (even if its

Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

2014-10-21 Thread Aditya Kali
On Tue, Oct 21, 2014 at 3:42 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Tue, Oct 21, 2014 at 3:33 PM, Aditya Kali adityak...@google.com wrote:
 On Tue, Oct 21, 2014 at 12:02 PM, Andy Lutomirski l...@amacapital.net 
 wrote:
 On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali adityak...@google.com wrote:
 On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski l...@amacapital.net 
 wrote:
 On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
 ebied...@xmission.com wrote:

 I do wonder if we think of this as chcgrouproot if there is a simpler
 implementation.

 Could be.  I'll defer to Aditya for that one.


 More than chcgrouproot, its probably closer to pivot_cgroup_root. In
 addition to restricting the process to a cgroup-root, new processes
 entering the container should also be implicitly contained within the
 cgroup-root of that container.

 Why?  Concretely, why should this be in the kernel namespace code
 instead of in userspace?


 Userspace can do it too. Though then there will be possibility of
 having processes in the same mount namespace with different
 cgroup-roots. Deriving contents of /proc/pid/cgroup becomes even
 more complex. Thats another reason why it might not be good idea to
 tie cgroups with mount namespace.

 Implementing pivot_cgroup_root would
 probably involve overloading mount-namespace to now understand cgroup
 filesystem too. I did attempt combining cgroupns-root with mntns
 earlier (not via a new syscall though), but came to the conclusion
 that its just simpler to have a separate cgroup namespace and get
 clear semantics. One of the issues was that implicitly changing cgroup
 on setns to mntns seemed like a huge undesirable side-effect.

 About pinning: I really feel that it should be OK to pin processes
 within cgroupns-root. I think thats one of the most important feature
 of cgroup-namespace since its most common usecase is to containerize
 un-trusted processes - processes that, for their entire lifetime, need
 to remain inside their container.

 So don't let them out.  None of the other namespaces have this kind of
 constraint:

  - If you're in a mntns, you can still use fds from outside.
  - If you're in a netns, you can still use sockets from outside the 
 namespace.
  - If you're in an ipcns, you can still use ipc handles from outside.

 But none of the namespaces allow you to allocate new fds/sockets/ipc
 handles in the outside namespace. I think moving a process outside of
 cgroupns-root is like allocating a resource outside of your namespace.

 In a pidns, you can see outside tasks if you have an outside procfs
 mounted, but, if you don't, then you can't.  Wouldn't cgroupns be just
 like that?  You wouldn't be able to escape your cgroup as long as you
 don't have an inappropriate cgroupfs mounted.


I am not if we should only depend on restricted visibility for this
though. More details below.



 And with explicit permission from
 cgroup subsystem (something like cgroup.may_unshare as you had
 suggested previously), we can make sure that unprivileged processes
 cannot pin themselves. Also, maintaining this invariant (your current
 cgroup is always under your cgroupns-root) keeps the code and the
 semantics simple.

 I actually think it makes the semantics more complex.  The less policy
 you stick in the kernel, the easier it is to understand the impact of
 that policy.


 My inclination is towards keeping things simpler - both in code as
 well as in configuration. I agree that cgroupns might seem
 less-flexible, but in its current form, it encourages consistent
 container configuration. If you have a process that needs to move
 around between cgroups belonging to different containers, then that
 process should probably not be inside any container's cgroup
 namespace. Allowing that will just make the cgroup namespace
 pretty-much meaningless.

 The problem with pinning is that preventing it causes problems
 (specifically, either something potentially complex and incompatible
 needs to be added or unprivileged processes will be able to pin
 themselves).

 Unless I'm missing something, a normal cgroupns user doesn't actually
 need kernel pinning support to effectively constrain its members'
 cgroups.


So there are 2 scenarios to consider:

We have 2 containers with cgroups: /container1 and /container2
Assume process P is running under cgroupns-root '/container1'

(1) process P wants to 'write' to cgroup.procs outside its
cgroupns-root (say to /container2/cgroup.procs)
(2) An admin process running in init_cgroup_ns (or any parent cgroupns
with cgroupns-root above /container1) wants to write pid of process P
to /container2/cgroup.procs (which lies outside of P's cgroupns-root)

For (1), I think its ok to reject such a write. This is consistent
with the restriction in cgroup_file_write added in 'Patch 6' of this
set. I believe this should be independent of visibility of the cgroup
hierarchy for P.

For (2), we may allow the write to succeed if we make sure that the
process

Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

2014-10-16 Thread Aditya Kali
On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn  wrote:
> Quoting Aditya Kali (adityak...@google.com):
>> setns on a cgroup namespace is allowed only if
>> * task has CAP_SYS_ADMIN in its current user-namespace and
>>   over the user-namespace associated with target cgroupns.
>> * task's current cgroup is descendent of the target cgroupns-root
>>   cgroup.
>
> What is the point of this?
>
> If I'm a user logged into
> /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
> a container which is in
> /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
> then I will want to be able to enter the container's cgroup.
> The container's cgroup root is under my own (satisfying the
> below condition0 but my cgroup is not a descendent of the
> container's cgroup.
>
This condition is there because we don't want to do implicit cgroup
changes when a process attaches to another cgroupns. cgroupns tries to
preserve the invariant that at any point, your current cgroup is
always under the cgroupns-root of your cgroup namespace. But in your
example, if we allow a process in "session-c12.scope" container to
attach to cgroupns root'ed at "session-c12.scope/x1" container
(without implicitly moving its cgroup), then this invariant won't
hold.

>
>> * target cgroupns-root is same as or deeper than task's current
>>   cgroupns-root. This is so that the task cannot escape out of its
>>   cgroupns-root. This also ensures that setns() only makes the task
>>   get restricted to a deeper cgroup hierarchy.
>>
>> Signed-off-by: Aditya Kali 
>> ---
>>  kernel/cgroup_namespace.c | 44 ++--
>>  1 file changed, 42 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
>> index c16604f..c612946 100644
>> --- a/kernel/cgroup_namespace.c
>> +++ b/kernel/cgroup_namespace.c
>> @@ -80,8 +80,48 @@ err_out:
>>
>>  static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
>>  {
>> - pr_info("setns not supported for cgroup namespace");
>> - return -EINVAL;
>> + struct cgroup_namespace *cgroup_ns = ns;
>> + struct task_struct *task = current;
>> + struct cgroup *cgrp = NULL;
>> + int err = 0;
>> +
>> + if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
>> + !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
>> + return -EPERM;
>> +
>> + /* Prevent cgroup changes for this task. */
>> + threadgroup_lock(task);
>> +
>> + cgrp = get_task_cgroup(task);
>> +
>> + err = -EINVAL;
>> + if (!cgroup_on_dfl(cgrp))
>> + goto out_unlock;
>> +
>> + /* Allow switch only if the task's current cgroup is descendant of the
>> +  * target cgroup_ns->root_cgrp.
>> +  */
>> + if (!cgroup_is_descendant(cgrp, cgroup_ns->root_cgrp))
>> + goto out_unlock;
>> +
>> + /* Only allow setns to a cgroupns root-ed deeper than task's current
>> +  * cgroupns-root. This will make sure that tasks cannot escape their
>> +  * cgroupns by attaching to parent cgroupns.
>> +  */
>> + if (!cgroup_is_descendant(cgroup_ns->root_cgrp,
>> +   task_cgroupns_root(task)))
>> + goto out_unlock;
>> +
>> + err = 0;
>> + get_cgroup_ns(cgroup_ns);
>> + put_cgroup_ns(nsproxy->cgroup_ns);
>> + nsproxy->cgroup_ns = cgroup_ns;
>> +
>> +out_unlock:
>> + threadgroup_unlock(current);
>> + if (cgrp)
>> + cgroup_put(cgrp);
>> + return err;
>>  }
>>
>>  static void *cgroupns_get(struct task_struct *task)
>> --
>> 2.1.0.rc2.206.gedb03e5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/



-- 
Aditya
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

2014-10-16 Thread Aditya Kali
On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn se...@hallyn.com wrote:
 Quoting Aditya Kali (adityak...@google.com):
 setns on a cgroup namespace is allowed only if
 * task has CAP_SYS_ADMIN in its current user-namespace and
   over the user-namespace associated with target cgroupns.
 * task's current cgroup is descendent of the target cgroupns-root
   cgroup.

 What is the point of this?

 If I'm a user logged into
 /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
 a container which is in
 /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
 then I will want to be able to enter the container's cgroup.
 The container's cgroup root is under my own (satisfying the
 below condition0 but my cgroup is not a descendent of the
 container's cgroup.

This condition is there because we don't want to do implicit cgroup
changes when a process attaches to another cgroupns. cgroupns tries to
preserve the invariant that at any point, your current cgroup is
always under the cgroupns-root of your cgroup namespace. But in your
example, if we allow a process in session-c12.scope container to
attach to cgroupns root'ed at session-c12.scope/x1 container
(without implicitly moving its cgroup), then this invariant won't
hold.


 * target cgroupns-root is same as or deeper than task's current
   cgroupns-root. This is so that the task cannot escape out of its
   cgroupns-root. This also ensures that setns() only makes the task
   get restricted to a deeper cgroup hierarchy.

 Signed-off-by: Aditya Kali adityak...@google.com
 ---
  kernel/cgroup_namespace.c | 44 ++--
  1 file changed, 42 insertions(+), 2 deletions(-)

 diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
 index c16604f..c612946 100644
 --- a/kernel/cgroup_namespace.c
 +++ b/kernel/cgroup_namespace.c
 @@ -80,8 +80,48 @@ err_out:

  static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
  {
 - pr_info(setns not supported for cgroup namespace);
 - return -EINVAL;
 + struct cgroup_namespace *cgroup_ns = ns;
 + struct task_struct *task = current;
 + struct cgroup *cgrp = NULL;
 + int err = 0;
 +
 + if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
 + !ns_capable(cgroup_ns-user_ns, CAP_SYS_ADMIN))
 + return -EPERM;
 +
 + /* Prevent cgroup changes for this task. */
 + threadgroup_lock(task);
 +
 + cgrp = get_task_cgroup(task);
 +
 + err = -EINVAL;
 + if (!cgroup_on_dfl(cgrp))
 + goto out_unlock;
 +
 + /* Allow switch only if the task's current cgroup is descendant of the
 +  * target cgroup_ns-root_cgrp.
 +  */
 + if (!cgroup_is_descendant(cgrp, cgroup_ns-root_cgrp))
 + goto out_unlock;
 +
 + /* Only allow setns to a cgroupns root-ed deeper than task's current
 +  * cgroupns-root. This will make sure that tasks cannot escape their
 +  * cgroupns by attaching to parent cgroupns.
 +  */
 + if (!cgroup_is_descendant(cgroup_ns-root_cgrp,
 +   task_cgroupns_root(task)))
 + goto out_unlock;
 +
 + err = 0;
 + get_cgroup_ns(cgroup_ns);
 + put_cgroup_ns(nsproxy-cgroup_ns);
 + nsproxy-cgroup_ns = cgroup_ns;
 +
 +out_unlock:
 + threadgroup_unlock(current);
 + if (cgrp)
 + cgroup_put(cgrp);
 + return err;
  }

  static void *cgroupns_get(struct task_struct *task)
 --
 2.1.0.rc2.206.gedb03e5

 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/



-- 
Aditya
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv1 0/8] CGroup Namespaces

2014-10-14 Thread Aditya Kali
On Tue, Oct 14, 2014 at 3:42 PM, Andy Lutomirski  wrote:
> On Mon, Oct 13, 2014 at 2:23 PM, Aditya Kali  wrote:
>> Second take at the Cgroup Namespace patch-set.
>>
>> Major changes form RFC (V0):
>> 1. setns support for cgroupns
>> 2. 'mount -t cgroup cgroup ' from inside a cgroupns now
>>mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
>> 3. writes to cgroup files outside of cgroupns-root are not allowed
>> 4. visibility of /proc//cgroup is further restricted by not showing
>>anything if the  is in a sibling cgroupns and its cgroup falls 
>> outside
>>your cgroupns-root.
>>
>> More details in the writeup below.
>>
>> Background
>>   Cgroups and Namespaces are used together to create “virtual”
>>   containers that isolates the host environment from the processes
>>   running in container. But since cgroups themselves are not
>>   “virtualized”, the task is always able to see global cgroups view
>>   through cgroupfs mount and via /proc/self/cgroup file.
>>
>>   $ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>>   This exposure of cgroup names to the processes running inside a
>>   container results in some problems:
>>   (1) The container names are typically host-container-management-agent
>>   (systemd, docker/libcontainer, etc.) data and leaking its name (or
>>   leaking the hierarchy) reveals too much information about the host
>>   system.
>>   (2) It makes the container migration across machines (CRIU) more
>>   difficult as the container names need to be unique across the
>>   machines in the migration domain.
>>   (3) It makes it difficult to run container management tools (like
>>   docker/libcontainer, lmctfy, etc.) within virtual containers
>>   without adding dependency on some state/agent present outside the
>>   container.
>>
>>   Note that the feature proposed here is completely different than the
>>   “ns cgroup” feature which existed in the linux kernel until recently.
>>   The ns cgroup also attempted to connect cgroups and namespaces by
>>   creating a new cgroup every time a new namespace was created. It did
>>   not solve any of the above mentioned problems and was later dropped
>>   from the kernel. Incidentally though, it used the same config option
>>   name CONFIG_CGROUP_NS as used in my prototype!
>>
>> Introducing CGroup Namespaces
>>   With unified cgroup hierarchy
>>   (Documentation/cgroups/unified-hierarchy.txt), the containers can now
>>   have a much more coherent cgroup view and its easy to associate a
>>   container with a single cgroup. This also allows us to virtualize the
>>   cgroup view for tasks inside the container.
>>
>>   The new CGroup Namespace allows a process to “unshare” its cgroup
>>   hierarchy starting from the cgroup its currently in.
>>   For Ex:
>>   $ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>   $ ls -l /proc/self/ns/cgroup
>>   lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> 
>> cgroup:[4026531835]
>>   $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
>>   [ns]$ ls -l /proc/self/ns/cgroup
>>   lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
>>   cgroup:[4026532183]
>>   # From within new cgroupns, process sees that its in the root cgroup
>>   [ns]$ cat /proc/self/cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>
>>   # From global cgroupns:
>>   $ cat /proc//cgroup
>>   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>>   # Unshare cgroupns along with userns and mountns
>>   # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>>   # sets up uid/gid map and exec’s /bin/bash
>>   $ ~/unshare -c -u -m
>>
>>   # Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
>>   # hierarchy.
>>   [ns]$ mount -t cgroup cgroup /tmp/cgroup
>>   [ns]$ ls -l /tmp/cgroup
>>   total 0
>>   -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>>   -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>>   -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>>   -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>>
>>   The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
>>   filesystem root for the namespace specific cgroupfs mount.
>>
>>   The virtu

Re: [PATCHv1 0/8] CGroup Namespaces

2014-10-14 Thread Aditya Kali
On Tue, Oct 14, 2014 at 3:42 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Mon, Oct 13, 2014 at 2:23 PM, Aditya Kali adityak...@google.com wrote:
 Second take at the Cgroup Namespace patch-set.

 Major changes form RFC (V0):
 1. setns support for cgroupns
 2. 'mount -t cgroup cgroup mntpt' from inside a cgroupns now
mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
 3. writes to cgroup files outside of cgroupns-root are not allowed
 4. visibility of /proc/pid/cgroup is further restricted by not showing
anything if the pid is in a sibling cgroupns and its cgroup falls 
 outside
your cgroupns-root.

 More details in the writeup below.

 Background
   Cgroups and Namespaces are used together to create “virtual”
   containers that isolates the host environment from the processes
   running in container. But since cgroups themselves are not
   “virtualized”, the task is always able to see global cgroups view
   through cgroupfs mount and via /proc/self/cgroup file.

   $ cat /proc/self/cgroup
   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1

   This exposure of cgroup names to the processes running inside a
   container results in some problems:
   (1) The container names are typically host-container-management-agent
   (systemd, docker/libcontainer, etc.) data and leaking its name (or
   leaking the hierarchy) reveals too much information about the host
   system.
   (2) It makes the container migration across machines (CRIU) more
   difficult as the container names need to be unique across the
   machines in the migration domain.
   (3) It makes it difficult to run container management tools (like
   docker/libcontainer, lmctfy, etc.) within virtual containers
   without adding dependency on some state/agent present outside the
   container.

   Note that the feature proposed here is completely different than the
   “ns cgroup” feature which existed in the linux kernel until recently.
   The ns cgroup also attempted to connect cgroups and namespaces by
   creating a new cgroup every time a new namespace was created. It did
   not solve any of the above mentioned problems and was later dropped
   from the kernel. Incidentally though, it used the same config option
   name CONFIG_CGROUP_NS as used in my prototype!

 Introducing CGroup Namespaces
   With unified cgroup hierarchy
   (Documentation/cgroups/unified-hierarchy.txt), the containers can now
   have a much more coherent cgroup view and its easy to associate a
   container with a single cgroup. This also allows us to virtualize the
   cgroup view for tasks inside the container.

   The new CGroup Namespace allows a process to “unshare” its cgroup
   hierarchy starting from the cgroup its currently in.
   For Ex:
   $ cat /proc/self/cgroup
   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
   $ ls -l /proc/self/ns/cgroup
   lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup - 
 cgroup:[4026531835]
   $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
   [ns]$ ls -l /proc/self/ns/cgroup
   lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -
   cgroup:[4026532183]
   # From within new cgroupns, process sees that its in the root cgroup
   [ns]$ cat /proc/self/cgroup
   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/

   # From global cgroupns:
   $ cat /proc/pid/cgroup
   0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1

   # Unshare cgroupns along with userns and mountns
   # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
   # sets up uid/gid map and exec’s /bin/bash
   $ ~/unshare -c -u -m

   # Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
   # hierarchy.
   [ns]$ mount -t cgroup cgroup /tmp/cgroup
   [ns]$ ls -l /tmp/cgroup
   total 0
   -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
   -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
   -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
   -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control

   The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
   filesystem root for the namespace specific cgroupfs mount.

   The virtualization of /proc/self/cgroup file combined with restricting
   the view of cgroup hierarchy by namespace-private cgroupfs mount
   should provide a completely isolated cgroup view inside the container.

   In its current form, the cgroup namespaces patcheset provides following
   behavior:

   (1) The “root” cgroup for a cgroup namespace is the cgroup in which
   the process calling unshare is running.
   For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
   cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
   For the init_cgroup_ns, this is the real root (“/”) cgroup
   (identified in code as cgrp_dfl_root.cgrp).

   (2) The cgroupns-root

[PATCHv1 0/8] CGroup Namespaces

2014-10-13 Thread Aditya Kali
Second take at the Cgroup Namespace patch-set.

Major changes form RFC (V0):
1. setns support for cgroupns
2. 'mount -t cgroup cgroup ' from inside a cgroupns now
   mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
3. writes to cgroup files outside of cgroupns-root are not allowed
4. visibility of /proc//cgroup is further restricted by not showing
   anything if the  is in a sibling cgroupns and its cgroup falls outside
   your cgroupns-root.

More details in the writeup below.

Background
  Cgroups and Namespaces are used together to create “virtual”
  containers that isolates the host environment from the processes
  running in container. But since cgroups themselves are not
  “virtualized”, the task is always able to see global cgroups view
  through cgroupfs mount and via /proc/self/cgroup file.

  $ cat /proc/self/cgroup 
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1

  This exposure of cgroup names to the processes running inside a
  container results in some problems:
  (1) The container names are typically host-container-management-agent
  (systemd, docker/libcontainer, etc.) data and leaking its name (or
  leaking the hierarchy) reveals too much information about the host
  system.
  (2) It makes the container migration across machines (CRIU) more
  difficult as the container names need to be unique across the
  machines in the migration domain.
  (3) It makes it difficult to run container management tools (like
  docker/libcontainer, lmctfy, etc.) within virtual containers
  without adding dependency on some state/agent present outside the
  container.

  Note that the feature proposed here is completely different than the
  “ns cgroup” feature which existed in the linux kernel until recently.
  The ns cgroup also attempted to connect cgroups and namespaces by
  creating a new cgroup every time a new namespace was created. It did
  not solve any of the above mentioned problems and was later dropped
  from the kernel. Incidentally though, it used the same config option
  name CONFIG_CGROUP_NS as used in my prototype!

Introducing CGroup Namespaces
  With unified cgroup hierarchy
  (Documentation/cgroups/unified-hierarchy.txt), the containers can now
  have a much more coherent cgroup view and its easy to associate a
  container with a single cgroup. This also allows us to virtualize the
  cgroup view for tasks inside the container.

  The new CGroup Namespace allows a process to “unshare” its cgroup
  hierarchy starting from the cgroup its currently in.
  For Ex:
  $ cat /proc/self/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
  $ ls -l /proc/self/ns/cgroup
  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> 
cgroup:[4026531835]
  $ ~/unshare -c  # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
  [ns]$ ls -l /proc/self/ns/cgroup
  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
  cgroup:[4026532183]
  # From within new cgroupns, process sees that its in the root cgroup
  [ns]$ cat /proc/self/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/

  # From global cgroupns:
  $ cat /proc//cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1

  # Unshare cgroupns along with userns and mountns
  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
  # sets up uid/gid map and exec’s /bin/bash
  $ ~/unshare -c -u -m

  # Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
  # hierarchy.
  [ns]$ mount -t cgroup cgroup /tmp/cgroup
  [ns]$ ls -l /tmp/cgroup
  total 0
  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control

  The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
  filesystem root for the namespace specific cgroupfs mount.

  The virtualization of /proc/self/cgroup file combined with restricting
  the view of cgroup hierarchy by namespace-private cgroupfs mount
  should provide a completely isolated cgroup view inside the container.

  In its current form, the cgroup namespaces patcheset provides following
  behavior:

  (1) The “root” cgroup for a cgroup namespace is the cgroup in which
  the process calling unshare is running.
  For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
  cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
  For the init_cgroup_ns, this is the real root (“/”) cgroup
  (identified in code as cgrp_dfl_root.cgrp).

  (2) The cgroupns-root cgroup does not change even if the namespace
  creator process later moves to a different cgroup.
  $ ~/unshare -c # unshare cgroupns in some cgroup
  [ns]$ cat /proc/self/cgroup 
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/ 
 

[PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns

2014-10-13 Thread Aditya Kali
Restrict following operations within the calling tasks:
* cgroup_mkdir & cgroup_rmdir
* cgroup_attach_task
* writes to cgroup files outside of task's cgroupns-root

Also, read of /proc//cgroup file is now restricted only
to tasks under same cgroupns-root. If a task tries to look
at cgroup of another task outside of its cgroupns-root, then
it won't be able to see anything for the default hierarchy.
This is same as if the cgroups are not mounted.

Signed-off-by: Aditya Kali 
---
 kernel/cgroup.c | 34 +-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f8099b4..2fc0dfa 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2318,6 +2318,12 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
struct task_struct *task;
int ret;
 
+   /* Only allow changing cgroups accessible within task's cgroup
+* namespace. i.e. 'dst_cgrp' should be a descendant of task's
+* cgroupns->root_cgrp. */
+   if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(leader)))
+   return -EPERM;
+
/* look up all src csets */
down_read(_set_rwsem);
rcu_read_lock();
@@ -2882,6 +2888,10 @@ static ssize_t cgroup_file_write(struct kernfs_open_file 
*of, char *buf,
struct cgroup_subsys_state *css;
int ret;
 
+   /* Reject writes to cgroup files outside of task's cgroupns-root. */
+   if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
+   return -EINVAL;
+
if (cft->write)
return cft->write(of, buf, nbytes, off);
 
@@ -4560,6 +4570,13 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, 
const char *name,
parent = cgroup_kn_lock_live(parent_kn);
if (!parent)
return -ENODEV;
+
+   /* Allow mkdir only within process's cgroup namespace root. */
+   if (!cgroup_is_descendant(parent, task_cgroupns_root(current))) {
+   ret = -EPERM;
+   goto out_unlock;
+   }
+
root = parent->root;
 
/* allocate the cgroup and its ID, 0 is reserved for the root */
@@ -4822,6 +4839,13 @@ static int cgroup_rmdir(struct kernfs_node *kn)
if (!cgrp)
return 0;
 
+   /* Allow rmdir only within process's cgroup namespace root.
+* The process can't delete its own root anyways. */
+   if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current))) {
+   cgroup_kn_unlock(kn);
+   return -EPERM;
+   }
+
ret = cgroup_destroy_locked(cgrp);
 
cgroup_kn_unlock(kn);
@@ -5051,6 +5075,15 @@ int proc_cgroup_show(struct seq_file *m, struct 
pid_namespace *ns,
if (root == _dfl_root && !cgrp_dfl_root_visible)
continue;
 
+   cgrp = task_cgroup_from_root(tsk, root);
+
+   /* The cgroup path on default hierarchy is shown only if it
+* falls under current task's cgroupns-root.
+*/
+   if (root == _dfl_root &&
+   !cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
+   continue;
+
seq_printf(m, "%d:", root->hierarchy_id);
for_each_subsys(ss, ssid)
if (root->subsys_mask & (1 << ssid))
@@ -5059,7 +5092,6 @@ int proc_cgroup_show(struct seq_file *m, struct 
pid_namespace *ns,
seq_printf(m, "%sname=%s", count ? "," : "",
   root->name);
seq_putc(m, ':');
-   cgrp = task_cgroup_from_root(tsk, root);
path = cgroup_path(cgrp, buf, PATH_MAX);
if (!path) {
retval = -ENAMETOOLONG;
-- 
2.1.0.rc2.206.gedb03e5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv1 4/8] cgroup: export cgroup_get() and cgroup_put()

2014-10-13 Thread Aditya Kali
move cgroup_get() and cgroup_put() into cgroup.h so that
they can be called from other places.

Signed-off-by: Aditya Kali 
---
 include/linux/cgroup.h | 22 ++
 kernel/cgroup.c| 22 --
 2 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 80ed6e0..4a0eb2d 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -521,6 +521,28 @@ static inline bool cgroup_on_dfl(const struct cgroup *cgrp)
return cgrp->root == _dfl_root;
 }
 
+/* convenient tests for these bits */
+static inline bool cgroup_is_dead(const struct cgroup *cgrp)
+{
+   return !(cgrp->self.flags & CSS_ONLINE);
+}
+
+static inline void cgroup_get(struct cgroup *cgrp)
+{
+   WARN_ON_ONCE(cgroup_is_dead(cgrp));
+   css_get(>self);
+}
+
+static inline bool cgroup_tryget(struct cgroup *cgrp)
+{
+   return css_tryget(>self);
+}
+
+static inline void cgroup_put(struct cgroup *cgrp)
+{
+   css_put(>self);
+}
+
 /* no synchronization, the result can only be used as a hint */
 static inline bool cgroup_has_tasks(struct cgroup *cgrp)
 {
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 56d507b..2b3e9f9 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -284,12 +284,6 @@ static struct cgroup_subsys_state *cgroup_e_css(struct 
cgroup *cgrp,
return cgroup_css(cgrp, ss);
 }
 
-/* convenient tests for these bits */
-static inline bool cgroup_is_dead(const struct cgroup *cgrp)
-{
-   return !(cgrp->self.flags & CSS_ONLINE);
-}
-
 struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
 {
struct cgroup *cgrp = of->kn->parent->priv;
@@ -1002,22 +996,6 @@ static umode_t cgroup_file_mode(const struct cftype *cft)
return mode;
 }
 
-static void cgroup_get(struct cgroup *cgrp)
-{
-   WARN_ON_ONCE(cgroup_is_dead(cgrp));
-   css_get(>self);
-}
-
-static bool cgroup_tryget(struct cgroup *cgrp)
-{
-   return css_tryget(>self);
-}
-
-static void cgroup_put(struct cgroup *cgrp)
-{
-   css_put(>self);
-}
-
 /**
  * cgroup_refresh_child_subsys_mask - update child_subsys_mask
  * @cgrp: the target cgroup
-- 
2.1.0.rc2.206.gedb03e5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv1 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace

2014-10-13 Thread Aditya Kali
CLONE_NEWCGROUP will be used to create new cgroup namespace.

Signed-off-by: Aditya Kali 
---
 include/uapi/linux/sched.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 34f9d73..2f90d00 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -21,8 +21,7 @@
 #define CLONE_DETACHED 0x0040  /* Unused, ignored */
 #define CLONE_UNTRACED 0x0080  /* set if the tracing process 
can't force CLONE_PTRACE on this clone */
 #define CLONE_CHILD_SETTID 0x0100  /* set the TID in the child */
-/* 0x0200 was previously the unused CLONE_STOPPED (Start in stopped state)
-   and is now available for re-use. */
+#define CLONE_NEWCGROUP0x0200  /* New cgroup namespace 
*/
 #define CLONE_NEWUTS   0x0400  /* New utsname group? */
 #define CLONE_NEWIPC   0x0800  /* New ipcs */
 #define CLONE_NEWUSER  0x1000  /* New user namespace */
-- 
2.1.0.rc2.206.gedb03e5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv1 5/8] cgroup: introduce cgroup namespaces

2014-10-13 Thread Aditya Kali
Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the 'struct cgroup *root_cgrp' at the point
of creation of the cgroup namespace. The task that creates the new
cgroup namespace and all its future children will now be restricted only
to the cgroup hierarchy under this root_cgrp.
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root.
This allows container-tools (like libcontainer, lxc, lmctfy, etc.)
to create completely virtualized containers without leaking system
level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.

Signed-off-by: Aditya Kali 
---
 fs/proc/namespaces.c |   3 +
 include/linux/cgroup.h   |  18 +-
 include/linux/cgroup_namespace.h |  62 +++
 include/linux/nsproxy.h  |   2 +
 include/linux/proc_ns.h  |   4 ++
 init/Kconfig |   9 +++
 kernel/Makefile  |   1 +
 kernel/cgroup.c  |  11 
 kernel/cgroup_namespace.c| 128 +++
 kernel/fork.c|   2 +-
 kernel/nsproxy.c |  19 +-
 11 files changed, 255 insertions(+), 4 deletions(-)

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 8902609..e04ed4b 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -32,6 +32,9 @@ static const struct proc_ns_operations *ns_entries[] = {
_operations,
 #endif
_operations,
+#ifdef CONFIG_CGROUP_NS
+   _operations,
+#endif
 };
 
 static const struct file_operations ns_file_operations = {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4a0eb2d..aa86495 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -22,6 +22,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #ifdef CONFIG_CGROUPS
 
@@ -460,6 +462,13 @@ struct cftype {
 #endif
 };
 
+struct cgroup_namespace {
+   atomic_tcount;
+   unsigned intproc_inum;
+   struct user_namespace   *user_ns;
+   struct cgroup   *root_cgrp;
+};
+
 extern struct cgroup_root cgrp_dfl_root;
 extern struct css_set init_css_set;
 
@@ -584,10 +593,17 @@ static inline int cgroup_name(struct cgroup *cgrp, char 
*buf, size_t buflen)
return kernfs_name(cgrp->kn, buf, buflen);
 }
 
+static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
+struct cgroup *cgrp, char *buf,
+size_t buflen)
+{
+   return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf, buflen);
+}
+
 static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
  size_t buflen)
 {
-   return kernfs_path(cgrp->kn, buf, buflen);
+   return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf, buflen);
 }
 
 static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
new file mode 100644
index 000..9f637fe
--- /dev/null
+++ b/include/linux/cgroup_namespace.h
@@ -0,0 +1,62 @@
+#ifndef _LINUX_CGROUP_NAMESPACE_H
+#define _LINUX_CGROUP_NAMESPACE_H
+
+#include 
+#include 
+#include 
+#include 
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+static inline struct cgroup *task_cgroupns_root(struct task_struct *tsk)
+{
+   return tsk->nsproxy->cgroup_ns->root_cgrp;
+}
+
+#ifdef CONFIG_CGROUP_NS
+
+extern void free_cgroup_ns(struct cgroup_namespace *ns);
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+   struct cgroup_namespace *ns)
+{
+   if (ns)
+   atomic_inc(>count);
+   return ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+   if (ns && atomic_dec_and_test(>count))
+   free_cgroup_ns(ns);
+}
+
+extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+  struct user_namespace *user_ns,
+  struct cgroup_namespace *old_ns);
+
+#else  /* CONFIG_CGROUP_NS */
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+   struct cgroup_namespace *ns)
+{
+   return _cgroup_ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+}
+
+static inline struct cgroup_namespace *copy_cgroup_ns(
+   unsigned long flags,
+   struct user_namespace *user_ns,
+   struct cgroup_namespace *old_ns) {
+   if (flags & CLONE_NEWCGROUP)
+   return ERR_PTR(-EINVAL);
+
+   return old_ns;
+}
+
+#endif  /* CONFIG_CGROUP_NS */
+
+#endif  /* _LINUX_CGROUP_NAMESPACE_H */
diff --git a/include/linux/nsproxy.h 

[PATCHv1 8/8] cgroup: mount cgroupns-root when inside non-init cgroupns

2014-10-13 Thread Aditya Kali
This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.
In order to support this, a new kernfs api is added to lookup the
dentry for the cgroupns-root.

Signed-off-by: Aditya Kali 
---
 fs/kernfs/mount.c  | 48 
 include/linux/kernfs.h |  2 ++
 kernel/cgroup.c| 47 +--
 3 files changed, 95 insertions(+), 2 deletions(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f973ae9..e334f45 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block 
*sb)
return NULL;
 }
 
+/**
+ * kernfs_make_root - create new root dentry for the given kernfs_node.
+ * @sb: the kernfs super_block
+ * @kn: kernfs_node for which a dentry is needed
+ *
+ * This can used used by callers which want to mount only a part of the kernfs
+ * as root of the filesystem.
+ */
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+ struct kernfs_node *kn)
+{
+   struct dentry *dentry;
+   struct inode *inode;
+
+   BUG_ON(sb->s_op != _sops);
+
+   /* inode for the given kernfs_node should already exist. */
+   inode = ilookup(sb, kn->ino);
+   if (!inode) {
+   pr_debug("kernfs: could not get inode for '");
+   pr_cont_kernfs_path(kn);
+   pr_cont("'.\n");
+   return ERR_PTR(-EINVAL);
+   }
+
+   /* instantiate and link root dentry */
+   dentry = d_obtain_root(inode);
+   if (!dentry) {
+   pr_debug("kernfs: could not get dentry for '");
+   pr_cont_kernfs_path(kn);
+   pr_cont("'.\n");
+   return ERR_PTR(-ENOMEM);
+   }
+
+   /* If this is a new dentry, set it up. We need kernfs_mutex because this
+* may be called by callers other than kernfs_fill_super. */
+   mutex_lock(_mutex);
+   if (!dentry->d_fsdata) {
+   kernfs_get(kn);
+   dentry->d_fsdata = kn;
+   } else {
+   WARN_ON(dentry->d_fsdata != kn);
+   }
+   mutex_unlock(_mutex);
+
+   return dentry;
+}
+
 static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
 {
struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3c2be75..b9538e0 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
 struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
 struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
 
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+ struct kernfs_node *kn);
 struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
   unsigned int flags, void *priv);
 void kernfs_destroy_root(struct kernfs_root *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2fc0dfa..ef27dc4 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct 
cgroup_sb_opts *opts)
 
memset(opts, 0, sizeof(*opts));
 
+   /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
+* namespace.
+*/
+   if (current->nsproxy->cgroup_ns != _cgroup_ns) {
+   opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
+   }
+
while ((token = strsep(, ",")) != NULL) {
nr_opts++;
 
@@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct 
cgroup_sb_opts *opts)
 
if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
pr_warn("sane_behavior: this is still under development and its 
behaviors will change, proceed at your own risk\n");
-   if (nr_opts != 1) {
+   if (nr_opts > 1) {
pr_err("sane_behavior: no other mount options 
allowed\n");
return -EINVAL;
}
@@ -1581,6 +1588,15 @@ static void init_cgroup_root(struct cgroup_root *root,
set_bit(CGRP_CPUSET_CLONE_CHILDREN, >cgrp.flags);
 }
 
+struct dentry *cgroupns_get_root(struct super_block *sb,
+struct cgroup_namespace *ns)
+{
+   struct dentry *nsdentry;
+
+   nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
+   return nsdentry;
+}
+
 static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
 {

  1   2   >