[Devel] Re: [PATCH 20/38] netns ct: NOTRACK in netns

2008-08-22 Thread adobriyan
On Thu, Aug 21, 2008 at 07:06:37PM -0400, Jan Engelhardt wrote:
 On Thursday 2008-08-21 18:04, [EMAIL PROTECTED] wrote:
 
 Make untracked conntrack per-netns.
 
 Why? It does not store any useful information per se, it is
 merely used to add a third type of ct, iow:
 
 (a) ct==NULL
 (b) ct!=NULL
 (c) ct==untracked
 
 mmap(2)'s return value for example has something similar:
 
 (a) mmap(...)==NULL
 (b) mmap(...)==MMAP_FAILED
 (c) otherwise
 
 The untracked ct is a singleton, and should stay one, unless
 there are further reasons not to do so.

We wait for untracked ct refcount to drop to 1 back:

/* wait until all references to nf_conntrack_untracked are dropped */
while (atomic_read(nf_conntrack_untracked.ct_general.use)  1)
schedule();

Consequently it should be one per netns, otherwise netns A can prevent
netns B from stopping.

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: SNMP statistics in namespaces

2008-08-22 Thread Dave Hansen
On Fri, 2008-08-22 at 13:57 +0200, Eelco Chaudron wrote:
 I was looking at the 2.6.26 nw namespace code, and noticed that the
 SNMP counters are still system wide. Looking at the status page there
 is not separate item on SNMP. Was this overlooked, skipped on purpose,
 or is it in the pipeline?
 
 Hope some one can answer this...

I'm an idiot.  I know that SNMP has something to do with network
statistics, and I'm too lazy to google it.

Could you give us a quick idea why SNMP is important here?  You're
welcome to add it to the list, but it's unlikely to get implemented
until someone thinks it is important.  That said, I'd be happy to help
you put together some patches to implement it.

-- Dave

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: SNMP statistics in namespaces

2008-08-22 Thread Balbir Singh
Dave Hansen wrote:
 On Fri, 2008-08-22 at 13:57 +0200, Eelco Chaudron wrote:
 I was looking at the 2.6.26 nw namespace code, and noticed that the
 SNMP counters are still system wide. Looking at the status page there
 is not separate item on SNMP. Was this overlooked, skipped on purpose,
 or is it in the pipeline?

 Hope some one can answer this...
 
 I'm an idiot.  I know that SNMP has something to do with network
 statistics, and I'm too lazy to google it.
 
 Could you give us a quick idea why SNMP is important here?  You're
 welcome to add it to the list, but it's unlikely to get implemented
 until someone thinks it is important.  That said, I'd be happy to help
 you put together some patches to implement it.

SNMP is typically the way that most network administration (statistics and
control) is done. I have not looked at the network namespace implementation, but
I am quite confident that it would be a useful thing to do, if not done already.

-- 
Balbir
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: RFC: Attaching threads to cgroups is OK?

2008-08-22 Thread Vivek Goyal
On Thu, Aug 21, 2008 at 02:25:06PM +0900, Fernando Luis Vázquez Cao wrote:
 Hi Balbir,
 
 On Thu, 2008-08-21 at 09:02 +0530, Balbir Singh wrote:
  Fernando Luis Vázquez Cao wrote:
   On Wed, 2008-08-20 at 20:48 +0900, Hirokazu Takahashi wrote:
   Hi,
  
   Tsuruta-san, how about your bio-cgroup's tracking concerning this?
   If we want to use your tracking functions for each threads seperately, 
   there seems to be a problem.
   ===cf. mm_get_bio_cgroup()===
  owner
   mm_struct  task_struct  bio_cgroup
   =
   In my understanding, the mm_struct of a thread is same as its parent's.
   So, even if we attach the TIDs of some threads to different cgroups 
   the 
   tracking always returns the same bio_cgroup -- its parent's group.
   Do you have some policy about in which case we can use your tracking?
  
   It's will be resitriction when io-controller reuse information of the 
   owner
   of memory. But if it's very clear who issues I/O (by tracking read/write
   syscall), we may have chance to record the issuer of I/O to page_cgroup
   struct. 
   This might be slightly different topic though,
   I've been thinking where we should add hooks to track I/O reqeust.
   I think the following set of hooks is enough whether we are going to
   support thread based cgroup or not.
  
 Hook-1: called when allocating a page, where the memory controller
  already have a hoook.
 Hook-2: called when making a page in page-cache dirty.
  
   For anonymous pages, Hook-1 is enough to track any type of I/O request.
   For pages in page-cache, Hook-1 is also enough for read I/O because
   the I/O is issued just once right after allocting the page.
   For write I/O requests to pages in page-cache, Hook-1 will be okay
   in most cases but sometimes process in another cgroup may write
   the pages. In this case, Hook-2 is needed to keep accurate to track
   I/O requests.
   
   This relative simplicity is what prompted me to say that we probably
   should try to disentangle the io tracking functionality from the memory
   controller a bit more (of course we still should reuse as much as we can
   from it). The rationale for this is that the existing I/O scheduler
   would benefit from proper io tracking capabilities too, so it'd be nice
   if we could have them even in non-cgroup-capable kernels.
   
  
  Hook 2 referred to in the mail above exist today in the form of task IO 
  accounting.
 Yup.
 
   As an aside, when the IO context of a certain IO operation is known
   (synchronous IO comes to mind) I think it should be cashed in the
   resulting bio so that we can do without the expensive accesses to
   bio_cgroup once it enters the block layer.
  
  Will this give you everything you need for accounting and control (from the
  block layer?)
 
 Well, it depends on what you are trying to achieve.
 
 Current IO schedulers such as CFQ only care about the io_context when
 scheduling requests. When a new request comes in CFQ assumes that it was
 originated in the context of the current task, which obviously does not
 hold true for buffered IO and aio. This problem could be solved by using
 bio-cgroup for IO tracking, but accessing the io context information is
 somewhat expensive: 
 
 page-page_cgroup-bio_cgroup-io_context.
 
 If at the time of building a bio we know its io context (i.e. the
 context of the task or cgroup that generated that bio) I think we should
 store it in the bio itself, too. With this scheme, whenever the kernel
 needs to know the io_context of a particular block IO operation the
 kernel would first try to retrieve its io_context directly from the bio,
 and, if not available there, would resort to the slow path (accessing it
 through bio_cgroup). My gut feeling is that elevator-based IO resource
 controllers would benefit from such an approach, too.
 

Hi Fernando,

Had a question.

IIUC, at the time of submtting the bio, io_context will be known only for 
synchronous request. For asynchronous request it will not be known
(ex. writing the dirty pages back to disk) and one shall have to take
the longer path (bio-cgroup thing) to ascertain the io_context associated
with a request.

If that's the case, than it looks like we shall have to always traverse the
longer path in case of asynchronous IO. By putting the io_context pointer
in bio, we will just shift the time of pointer traversal. (From CFQ to higher
layers).

So probably it is not worth while to put io_context pointer in bio? Am I
missing something?

Thanks
Vivek
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [0/10] User namespaces: introduction

2008-08-22 Thread Serge E. Hallyn
Hi Eric,

so here is a start to a userns patchset trying to follow your ideas
about how to have user namespaces and filesystems interact.  Ignore
the bookkeeping crap or you'll pull your hair out.  Lots of stuff
remains unimplemented - i.e. chown (setattr) and proper handling of
capabilities.  But you can do some fun things with this patchset.
I.e.

(log in as root)
setcap cap_sys_admin=ep ns_exec
setcap cap_sys_admin=ep usernsmount
ns_exec -U /bin/sh
ls /root (fails)
ls / (succeeds)
(log in as hallyn)
ns_exec -U /bin/sh
id
(uid=0, gid=0)
ls (fails, can't descend /home/hallyn)
usernsmount / nsid=4
ls (succeeds)
touch ab
ls -l ab
(ab is owned by root)
exit
(we're logged in as hallyn in the init_user_ns again)
ls -l ab
(ab is owned by hallyn)

The only supported fs is ext3.  Only a few operations are supported.
So if, above, when we are hallyn in the init_user_ns but root in
the child user ns,
when we create a file, it is properly handled, so
inode-i_uid=500, but an xattr (nsid=4,uid=0) is added
when we chown the file to root, it is not properly handled,
so inode-i_uid = 0
it's just a matter of hooking all the places at this point.

Capabilities remain a problem.  Right now I think capabilities will
need to be split up into system-wide caps, and container-safe caps.
So CAP_NET_ADMIN, CAP_NET_RAW, CAP_DAC_OVERRIDE, those are container-safe.
CAP_REBOOT may become container-safe one day, but for now is very
much system-wide.

So if I'm uid 500 on the host and create a user namespace where I'm
uid=0, I should be able to acquire container-safe caps (perhaps
contingent on whether I unshared all other namespaces), but not
system-wide ones.  Or, whether I can acquire them would depend
on whether the suid bit was set in a user_ns or not.  sigh.

thanks,
-serge
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 03/10] user namespaces: reset task's credentials on CLONE_NEWUSER

2008-08-22 Thread Serge E. Hallyn
Currently, creating a new user namespace does not reset
the task's uid or gid.  Since generally that is done as
root because it requires CAP_SYS_ADMIN, and since the
first uid in the new namespace is 0, one usually doesn't
notice.  However, if one does

capset cap_sys_admin=ep ns_exec
su - hallyn
  ns_exec -U /bin/sh
  id

then one will see hallyn's userid, and all preexisting
groups.

With this patch, cloning a new user namespace will set
the task's uid and gid to 0, and reset the group_info to
the empty set assigned to init.

Signed-off-by: Serge E. Hallyn [EMAIL PROTECTED]
---
 kernel/user_namespace.c |   14 ++
 1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 0045dd0..39aea7b 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -11,6 +11,9 @@
 #include linux/slab.h
 #include linux/user_namespace.h
 
+/* defined in kernel/sys.c */
+extern struct group_info init_groups;
+
 /*
  * Clone a new ns copying an original user ns, setting refcount to 1
  * @old_ns: namespace to clone
@@ -48,6 +51,17 @@ int create_new_userns(int flags, struct task_struct *tsk)
put_user_ns(ns);
 
task_switch_uid(tsk, ns-root_user);
+   tsk-uid = tsk-euid = tsk-suid = tsk-fsuid = 0;
+   tsk-gid = tsk-egid = tsk-sgid = tsk-fsgid = 0;
+
+   /* this can't be safe for unshare, can it?  it's safe
+* for fork, though.  I'm tempted to limit clone_newuser to
+* fork only */
+   task_lock(tsk);
+   put_group_info(tsk-group_info);
+   tsk-group_info = init_groups;
+   get_group_info(tsk-group_info);
+   task_unlock(tsk);
 
return 0;
 }
-- 
1.5.4.3

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 01/10] user namespaces: introduce user_struct-user_namespace relationship

2008-08-22 Thread Serge E. Hallyn
When a task does clone(CLONE_NEWNS), the task's user is the 'creator' of the
new user_namespace, and the user_namespace is tacked onto a list of those
created by this user.

Changelog:
Aug 1: renamed user-user_namespace to user_ns, as the next
patch did anyway.
Aug 1: move put_user_ns call in one free_user() definition
to move it outside the lock in free_user.  put_user_ns
calls free_user on the user_ns-creator, which in
turn would grab the lock again.

Signed-off-by: Serge Hallyn [EMAIL PROTECTED]
---
 include/linux/sched.h  |1 +
 include/linux/user_namespace.h |1 +
 kernel/user.c  |7 +++
 kernel/user_namespace.c|   20 +++-
 4 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5850bfb..b0fe15a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -600,6 +600,7 @@ struct user_struct {
/* Hash table maintenance information */
struct hlist_node uidhash_node;
uid_t uid;
+   struct user_namespace *user_ns;
 
 #ifdef CONFIG_USER_SCHED
struct task_group *tg;
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index b5f41d4..f9477c3 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -13,6 +13,7 @@ struct user_namespace {
struct kref kref;
struct hlist_head   uidhash_table[UIDHASH_SZ];
struct user_struct  *root_user;
+   struct user_struct  *creator;
 };
 
 extern struct user_namespace init_user_ns;
diff --git a/kernel/user.c b/kernel/user.c
index 865ecf5..aedb3a1 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -22,6 +22,7 @@ struct user_namespace init_user_ns = {
.refcount   = ATOMIC_INIT(2),
},
.root_user = root_user,
+   .creator = root_user,
 };
 EXPORT_SYMBOL_GPL(init_user_ns);
 
@@ -53,6 +54,7 @@ struct user_struct root_user = {
.files  = ATOMIC_INIT(0),
.sigpending = ATOMIC_INIT(0),
.locked_shm = 0,
+   .user_ns= init_user_ns,
 #ifdef CONFIG_USER_SCHED
.tg = init_task_group,
 #endif
@@ -325,6 +327,7 @@ static inline void free_user(struct user_struct *up, 
unsigned long flags)
atomic_inc(up-__count);
spin_unlock_irqrestore(uidhash_lock, flags);
 
+   put_user_ns(up-user_ns);
INIT_WORK(up-work, remove_user_sysfs_dir);
schedule_work(up-work);
 }
@@ -347,6 +350,7 @@ static inline void free_user(struct user_struct *up, 
unsigned long flags)
sched_destroy_user(up);
key_put(up-uid_keyring);
key_put(up-session_keyring);
+   put_user_ns(up-user_ns);
kmem_cache_free(uid_cachep, up);
 }
 
@@ -409,6 +413,8 @@ struct user_struct *alloc_uid(struct user_namespace *ns, 
uid_t uid)
if (sched_create_user(new)  0)
goto out_free_user;
 
+   new-user_ns = get_user_ns(ns);
+
if (uids_user_create(new))
goto out_destoy_sched;
 
@@ -441,6 +447,7 @@ struct user_struct *alloc_uid(struct user_namespace *ns, 
uid_t uid)
 
 out_destoy_sched:
sched_destroy_user(new);
+   put_user_ns(new-user_ns);
 out_free_user:
kmem_cache_free(uid_cachep, new);
 out_unlock:
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index a9ab059..e8db443 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -19,7 +19,6 @@
 static struct user_namespace *clone_user_ns(struct user_namespace *old_ns)
 {
struct user_namespace *ns;
-   struct user_struct *new_user;
int n;
 
ns = kmalloc(sizeof(struct user_namespace), GFP_KERNEL);
@@ -38,15 +37,17 @@ static struct user_namespace *clone_user_ns(struct 
user_namespace *old_ns)
return ERR_PTR(-ENOMEM);
}
 
-   /* Reset current-user with a new one */
-   new_user = alloc_uid(ns, current-uid);
-   if (!new_user) {
-   free_uid(ns-root_user);
-   kfree(ns);
-   return ERR_PTR(-ENOMEM);
-   }
+   /* pin the creating user */
+   ns-creator = current-user;
+   atomic_inc(ns-creator-__count);
+
+   /*
+* The alloc_uid() incremented the userns refcount,
+* so drop it again 
+*/
+   put_user_ns(ns);
 
-   switch_uid(new_user);
+   switch_uid(ns-root_user);
return ns;
 }
 
@@ -72,6 +73,7 @@ void free_user_ns(struct kref *kref)
 
ns = container_of(kref, struct user_namespace, kref);
release_uids(ns);
+   free_uid(ns-creator);
kfree(ns);
 }
 EXPORT_SYMBOL(free_user_ns);
-- 
1.5.4.3

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers


[Devel] [PATCH 05/10] user namespaces: Allow registering new usernamespaces using mount

2008-08-22 Thread Serge E. Hallyn
Allow registering new user namespaces using mount(MS_ADD_USERNS).
Define lib/fsuserns.c which will contain functions which filesystems
can hook into to support user namespaces.

Since fsuserns.c currently supports neither reading policy nor
storing userns info using xattrs, the support is really bogus
for now.

The following program shows how to use the mount syscall to
register a new user namespace with a filesystem:

===
 #include stdio.h
 #include sys/mount.h
 #include errno.h

 #define MS_SET_USERNS   (125) /* Add current's user_ns as valid for sb */

/*
 * path is a path to be remounted
 * userid is a string 'user=X' where X is an integer
 */

int main(int argc, char *argv[])
{
char *path, *userid;
int ret;

path = argv[1];
userid = argv[2];

ret = mount(path, path, NULL, MS_SET_USERNS, userid);
if (ret)
perror(mount);
return ret;
}
===

Signed-off-by: Serge Hallyn [EMAIL PROTECTED]
---
 fs/ext3/super.c|   14 +++
 fs/namespace.c |   11 +++
 include/linux/fs.h |3 +
 lib/Makefile   |2 +
 lib/fsuserns.c |  226 
 5 files changed, 256 insertions(+), 0 deletions(-)
 create mode 100644 lib/fsuserns.c

diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index f38a5af..3458d25 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -719,6 +719,15 @@ static struct quotactl_ops ext3_qctl_operations = {
 };
 #endif
 
+#ifdef CONFIG_USER_NS
+extern int fsuserns_add_userns(struct super_block *sb,
+struct user *user, void *data);
+extern int fsuserns_convert_uid_gid(struct user_namespace *ns,
+   struct inode *inode, uid_t *retuid, gid_t *retgid);
+extern int fsuserns_is_capable(struct user_namespace *ns,
+   struct inode *inode, int cap);
+#endif
+
 static const struct super_operations ext3_sops = {
.alloc_inode= ext3_alloc_inode,
.destroy_inode  = ext3_destroy_inode,
@@ -738,6 +747,11 @@ static const struct super_operations ext3_sops = {
.quota_read = ext3_quota_read,
.quota_write= ext3_quota_write,
 #endif
+#ifdef CONFIG_USER_NS
+   .add_userns = fsuserns_add_userns,
+   .is_capable = fsuserns_is_capable,
+   .convert_uid_gid = fsuserns_convert_uid_gid,
+#endif
 };
 
 static const struct export_operations ext3_export_ops = {
diff --git a/fs/namespace.c b/fs/namespace.c
index 6e283c9..4bb8c61 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1885,6 +1885,15 @@ int copy_mount_options(const void __user * data, 
unsigned long *where)
return 0;
 }
 
+int do_add_userns(struct nameidata *nd, struct user_struct *user,
+   void *data_page)
+{
+   struct super_block *sb = nd-path.mnt-mnt_sb;
+   if (sb-s_op-add_userns)
+   return sb-s_op-add_userns(sb, user, data_page);
+   return -EINVAL;
+}
+
 /*
  * Flags is a 32-bit value that allows up to 31 non-fs dependent flags to
  * be given to the mount() call (ie: read-only, no-dev, no-suid etc).
@@ -1958,6 +1967,8 @@ long do_mount(char *dev_name, char *dir_name, char 
*type_page,
retval = do_change_type(nd, flags);
else if (flags  MS_MOVE)
retval = do_move_mount(nd, dev_name);
+   else if (flags  MS_SET_USERNS)
+   retval = do_add_userns(nd, current-user, data_page);
else
retval = do_new_mount(nd, type_page, flags, mnt_flags,
  dev_name, data_page);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bb58c2e..492abef 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -128,6 +128,7 @@ extern int dir_notify_enable;
 #define MS_RELATIME(121) /* Update atime relative to mtime/ctime. */
 #define MS_KERNMOUNT   (122) /* this is a kern_mount call */
 #define MS_I_VERSION   (123) /* Update inode I_version field */
+#define MS_SET_USERNS  (125) /* Add current's user_ns as valid for sb */
 #define MS_ACTIVE  (130)
 #define MS_NOUSER  (131)
 
@@ -1308,6 +1309,7 @@ extern ssize_t vfs_readv(struct file *, const struct 
iovec __user *,
 extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
unsigned long, loff_t *);
 
+struct user_struct;
 struct super_operations {
struct inode *(*alloc_inode)(struct super_block *sb);
void (*destroy_inode)(struct inode *);
@@ -1325,6 +1327,7 @@ struct super_operations {
int (*remount_fs) (struct super_block *, int *, char *);
void (*clear_inode) (struct inode *);
void (*umount_begin) (struct super_block *);
+   int (*add_userns) (struct super_block *, struct user_struct *, void *);
int (*is_capable) (struct user_namespace *, struct inode *, int);
uid_t (*convert_uid_gid)(struct user_namespace *, struct inode *,
   

[Devel] [PATCH 04/10] user namespaces: enforce user namespaces for file permission

2008-08-22 Thread Serge E. Hallyn
Add a user_ns to the sb.  It is always set to the user_ns of the task which
mounted the sb.

Define 3 new super_operations.  convert_uid() and convert_gid() take a uid
or gid from an inode on the sb's fs, and attempt to convert them into ids
meaningful in the user namespace passed in, which presumably is current's.
is_capable() checks whether current has the requested capability with respect
to the sb's fs, which is dependent upon the relationship between current's
user_ns and those which the sb knows about.

If the sb isn't user ns aware - which none are right now - then the new
super_operations won't be defined.  If in that case current and sb have
different user namespaces, then the userids can't be compared.

If current's and sb's userids can't be compared, then current will get
'user other' (we usually say 'nobody') permissions to the inode.

Changelog:
Aug 5: send the whole inode to the super_operations.

Signed-off-by: Serge Hallyn [EMAIL PROTECTED]
---
 fs/namei.c |   29 +++---
 fs/super.c |3 ++
 include/linux/fs.h |   55 
 3 files changed, 83 insertions(+), 4 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 4ea63ed..6bf5446 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -183,10 +183,27 @@ int generic_permission(struct inode *inode, int mask,
int (*check_acl)(struct inode *inode, int mask))
 {
umode_t mode = inode-i_mode;
+   uid_t   iuid;
+   gid_t   igid;
+   int ret;
+   int good_userns = 1;
+
+   ret = s_convert_uid_gid(inode, current-user-user_ns,
+   iuid, igid);
+   if (!ret)
+   good_userns = 0;
 
mask = MAY_READ | MAY_WRITE | MAY_EXEC;
 
-   if (current-fsuid == inode-i_uid)
+   /*
+* If we're not in the inode's user namespace, we get
+* user nobody permissions, and we ignore acls
+* (bc serge doesn't know how to handle acls in this case)
+*/
+   if (!good_userns)
+   goto check;
+
+   if (current-fsuid == iuid)
mode = 6;
else {
if (IS_POSIXACL(inode)  (mode  S_IRWXG)  check_acl) {
@@ -197,15 +214,18 @@ int generic_permission(struct inode *inode, int mask,
return error;
}
 
-   if (in_group_p(inode-i_gid))
+   if (in_group_p(igid))
mode = 3;
}
 
+check:
/*
 * If the DACs are ok we don't need any capability check.
 */
if ((mask  ~mode) == 0)
return 0;
+   if (!good_userns)
+   return -EACCES;
 
  check_capabilities:
/*
@@ -214,14 +234,15 @@ int generic_permission(struct inode *inode, int mask,
 */
if (!(mask  MAY_EXEC) ||
(inode-i_mode  S_IXUGO) || S_ISDIR(inode-i_mode))
-   if (capable(CAP_DAC_OVERRIDE))
+   if (s_is_capable(inode, current-user-user_ns,
+   CAP_DAC_OVERRIDE))
return 0;
 
/*
 * Searching includes executable on directories, else just read.
 */
if (mask == MAY_READ || (S_ISDIR(inode-i_mode)  !(mask  MAY_WRITE)))
-   if (capable(CAP_DAC_READ_SEARCH))
+   if (s_is_capable(inode, current-user-user_ns, 
CAP_DAC_READ_SEARCH))
return 0;
 
return -EACCES;
diff --git a/fs/super.c b/fs/super.c
index e931ae9..3559637 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -38,6 +38,7 @@
 #include linux/kobject.h
 #include linux/mutex.h
 #include linux/file.h
+#include linux/user_namespace.h
 #include asm/uaccess.h
 #include internal.h
 
@@ -93,6 +94,7 @@ static struct super_block *alloc_super(struct 
file_system_type *type)
s-s_qcop = sb_quotactl_ops;
s-s_op = default_op;
s-s_time_gran = 10;
+   s-user_ns = get_user_ns(current-user-user_ns);
}
 out:
return s;
@@ -109,6 +111,7 @@ static inline void destroy_super(struct super_block *s)
security_sb_free(s);
kfree(s-s_subtype);
kfree(s-s_options);
+   put_user_ns(s-user_ns);
kfree(s);
 }
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 580b513..bb58c2e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1128,6 +1128,11 @@ struct super_block {
 * generic_show_options()
 */
char *s_options;
+
+   /*
+* namespace of the user which mounted the sb
+*/
+   struct user_namespace *user_ns;
 };
 
 extern struct timespec current_fs_time(struct super_block *sb);
@@ -1320,6 +1325,9 @@ struct super_operations {
int (*remount_fs) (struct super_block *, int *, char *);
void (*clear_inode) 

[Devel] [PATCH 02/10] user namespaces: move user_ns from nsproxy into user struct

2008-08-22 Thread Serge E. Hallyn
When we get the sysfs support needed to support fair user scheduling
along with user namespaces, then we will need to be able to get the
user namespace from the user struct.

So we need the user_ns to be a part of struct user.  Once we can
access it from tsk-user, we no longer have a use for
tsk-nsproxy-user_ns.

When a user_namespace is created, the user which created it is
marked as its 'creator'.  The user_namespace pins the creator.
Each userid in a user_ns pins the user_ns.  This keeps refcounting
nice and simple.

At the same time, this patch simplifies the refcounting.  The current
user and userns locking works as follows:

The task pins the user struct.

The task pins the nsproxy, the nsproxy pins the user_ns.

When a new user_ns is created, it creates a root user for
it, and pins it.  When the nsproxy releases the user_ns,
the userns tries to release all its user structs.

So you see that the refcounting works for now, but only
because the nsproxy (and therefore usr_ns) and user structs
will be freed at the same time - when the last task using
them is released.

Now we need to put the user_ns in the struct user.  You can
see that will mess up the refcounting.

Fortunately, once the user_ns is available from tsk-user,
we don't need it in nsproxy.

So here is how the refcounting *should* be done:

The task pins the user struct.

The user struct pins its user namespace.

The user namespace pins the struct user which created it.

A user namespace now doesn't need to release its userids,
because it is only released when its last user disappears.

This patch makes those changes.

Signed-off-by: Serge Hallyn [EMAIL PROTECTED]
---
 include/linux/init_task.h  |1 -
 include/linux/key.h|3 ++
 include/linux/nsproxy.h|1 -
 include/linux/sched.h  |1 +
 include/linux/user_namespace.h |   10 +++-
 kernel/fork.c  |3 +-
 kernel/nsproxy.c   |   10 +---
 kernel/sys.c   |4 +-
 kernel/user.c  |   42 +--
 kernel/user_namespace.c|   36 +
 security/keys/process_keys.c   |7 +-
 11 files changed, 39 insertions(+), 79 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 021d8e7..550058b 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -57,7 +57,6 @@ extern struct nsproxy init_nsproxy;
.mnt_ns = NULL, \
INIT_NET_NS(net_ns) \
INIT_IPC_NS(ipc_ns) \
-   .user_ns= init_user_ns,\
 }
 
 #define INIT_SIGHAND(sighand) {
\
diff --git a/include/linux/key.h b/include/linux/key.h
index c45c962..ba53aef 100644
--- a/include/linux/key.h
+++ b/include/linux/key.h
@@ -277,6 +277,8 @@ extern ctl_table key_sysctls[];
  * the userspace interface
  */
 extern void switch_uid_keyring(struct user_struct *new_user);
+extern void switch_uid_keyring_task(struct task_struct *tsk,
+   struct user_struct *new_user);
 extern int copy_keys(unsigned long clone_flags, struct task_struct *tsk);
 extern int copy_thread_group_keys(struct task_struct *tsk);
 extern void exit_keys(struct task_struct *tsk);
@@ -305,6 +307,7 @@ extern void key_init(void);
 #define key_ref_to_ptr(k)  ({ NULL; })
 #define is_key_possessed(k)0
 #define switch_uid_keyring(u)  do { } while(0)
+#define switch_uid_keyring_task(t,u)   do { } while(0)
 #define __install_session_keyring(t, k)({ NULL; })
 #define copy_keys(f,t) 0
 #define copy_thread_group_keys(t)  0
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index c8a768e..afad7de 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -27,7 +27,6 @@ struct nsproxy {
struct ipc_namespace *ipc_ns;
struct mnt_namespace *mnt_ns;
struct pid_namespace *pid_ns;
-   struct user_namespace *user_ns;
struct net   *net_ns;
 };
 extern struct nsproxy init_nsproxy;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b0fe15a..a2f1356 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1715,6 +1715,7 @@ static inline struct user_struct *get_uid(struct 
user_struct *u)
 }
 extern void free_uid(struct user_struct *);
 extern void switch_uid(struct user_struct *);
+extern void task_switch_uid(struct task_struct *tsk, struct user_struct *);
 extern void release_uids(struct user_namespace *ns);
 
 #include asm/current.h
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 

[Devel] [PATCH 10/10] userns: add support for readdir

2008-08-22 Thread Serge E. Hallyn
Now ls works correctly inside a userns!

(but don't go doing some sort of setattr like 'chown'  :)

Signed-off-by: Serge Hallyn [EMAIL PROTECTED]
---
 fs/ext3/file.c  |4 
 fs/ext3/inode.c |   22 ++
 fs/ext3/namei.c |3 +++
 fs/ext3/xattr.c |6 ++
 lib/fsuserns.c  |   42 +-
 5 files changed, 64 insertions(+), 13 deletions(-)

diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index acc4913..b259061 100644
--- a/fs/ext3/file.c
+++ b/fs/ext3/file.c
@@ -106,6 +106,9 @@ force_commit:
return ret;
 }
 
+extern int ext3_getattr(struct vfsmount *mnt, struct dentry *dentry,
+   struct kstat *stat);
+
 const struct file_operations ext3_file_operations = {
.llseek = generic_file_llseek,
.read   = do_sync_read,
@@ -134,5 +137,6 @@ const struct inode_operations ext3_file_inode_operations = {
.removexattr= generic_removexattr,
 #endif
.permission = ext3_permission,
+   .getattr= ext3_getattr,
 };
 
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 507d868..b252490 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -36,6 +36,7 @@
 #include linux/mpage.h
 #include linux/uio.h
 #include linux/bio.h
+#include linux/security.h
 #include xattr.h
 #include acl.h
 
@@ -3088,6 +3089,27 @@ err_out:
return error;
 }
 
+int ext3_getattr(struct vfsmount *mnt, struct dentry *dentry,
+   struct kstat *stat)
+{
+   struct inode *inode = dentry-d_inode;
+   int retval;
+   uid_t uid;
+   gid_t gid;
+
+   retval = security_inode_getattr(mnt, dentry);
+   if (retval)
+   return retval;
+
+   generic_fillattr(inode, stat);
+
+   retval = s_convert_uid_gid(inode, current-user-user_ns, uid, gid);
+   if (retval == 1) {
+   stat-uid = uid;
+   stat-gid = gid;
+   }
+   return 0;
+}
 
 /*
  * How many blocks doth make a writepage()?
diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
index e5be4bc..fe7350b 100644
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -2410,6 +2410,8 @@ end_rename:
return retval;
 }
 
+extern int ext3_getattr(struct vfsmount *mnt, struct dentry *dentry,
+   struct kstat *stat);
 /*
  * directories can handle most operations...
  */
@@ -2431,6 +2433,7 @@ const struct inode_operations ext3_dir_inode_operations = 
{
.removexattr= generic_removexattr,
 #endif
.permission = ext3_permission,
+   .getattr= ext3_getattr,
 };
 
 const struct inode_operations ext3_special_inode_operations = {
diff --git a/fs/ext3/xattr.c b/fs/ext3/xattr.c
index 500fec7..cf7dc63 100644
--- a/fs/ext3/xattr.c
+++ b/fs/ext3/xattr.c
@@ -345,6 +345,7 @@ ext3_xattr_get_userns(struct inode *inode, void *value, 
size_t value_size)
error = ext3_xattr_block_get(inode, name_index, name, value,
 value_size);
up_read(EXT3_I(inode)-xattr_sem);
+   printk(KERN_NOTICE %s: returning %d for %lu\n, __func__, error, 
inode-i_ino);
return error;
 }
 
@@ -1102,7 +1103,12 @@ retry:
error = PTR_ERR(handle);
} else {
int error2;
+   char *buf;
+   int i;
 
+   printk(KERN_NOTICE %s: writing %d bytes:\n, __func__, 
value_len);
+   for (i=0, buf = (char *)value; ivalue_len; i++,buf++)
+   printk(KERN_NOTICE %s: %d %x\n, __func__, i, 
(int)*buf);
error = ext3_xattr_set_handle(handle, inode, name_index, name,
  value, value_len, 0);
error2 = ext3_journal_stop(handle);
diff --git a/lib/fsuserns.c b/lib/fsuserns.c
index ac0ca99..f0be780 100644
--- a/lib/fsuserns.c
+++ b/lib/fsuserns.c
@@ -179,7 +179,7 @@ int fsuserns_convert_uid_gid(struct user_namespace *ns, 
struct inode *inode,
struct fsuserns_conversion_table *t;
struct fsuserns_table_entries *ep;
size_t valuelen;
-   struct unsstore *unsstore;
+   struct unsstore *unsstore = NULL;
int i, ret;
 
t = find_table(sb);
@@ -196,23 +196,33 @@ int fsuserns_convert_uid_gid(struct user_namespace *ns, 
struct inode *inode,
 convert:
mutex_unlock(fsuserns_table_mutex);
 
-   /* look for an xattr */
-   /* yes, 3 needs to be made adjustable */
-   valuelen = 3;
-   unsstore = kzalloc(3*sizeof(struct unsstore), GFP_KERNEL);
+   ret = xattrget(inode, NULL, 0);
+   if (ret = 0)
+   goto notfound;
+   valuelen = ret;
+   unsstore = kzalloc(ret, GFP_KERNEL);
ret = xattrget(inode, unsstore, valuelen);
-   if (ret  0)
-   return ret;
-   for (i=0; i3; i++)
+   if (ret = 0)
+   goto notfound;
+   for (i=0; i(ret/sizeof(*unsstore)); i++) {
+   

[Devel] [PATCH 06/10] user namespaces: hook fs/attr.c

2008-08-22 Thread Serge E. Hallyn
Hook fs/attr.c so things like chown are properly handled.  Note this is only
for permission checks.  We'll need to hook ext3_setattr to get the right
uids updated.

Signed-off-by: Serge Hallyn [EMAIL PROTECTED]
---
 fs/attr.c |   28 +++-
 include/linux/sched.h |1 +
 2 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index 26c71ba..072d367 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -22,6 +22,11 @@ int inode_change_ok(struct inode *inode, struct iattr *attr)
 {
int retval = -EPERM;
unsigned int ia_valid = attr-ia_valid;
+   uid_t iuid;
+   gid_t igid;
+
+   if (!s_convert_uid_gid(inode, current_userns(), iuid, igid))
+   return -EPERM;
 
/* If force is set do it anyway. */
if (ia_valid  ATTR_FORCE)
@@ -29,30 +34,30 @@ int inode_change_ok(struct inode *inode, struct iattr *attr)
 
/* Make sure a caller can chown. */
if ((ia_valid  ATTR_UID) 
-   (current-fsuid != inode-i_uid ||
-attr-ia_uid != inode-i_uid)  !capable(CAP_CHOWN))
+   (current-fsuid != iuid ||
+attr-ia_uid != iuid)  !s_is_capable(inode, current_userns(), 
CAP_CHOWN))
goto error;
 
/* Make sure caller can chgrp. */
if ((ia_valid  ATTR_GID) 
-   (current-fsuid != inode-i_uid ||
-   (!in_group_p(attr-ia_gid)  attr-ia_gid != inode-i_gid)) 
-   !capable(CAP_CHOWN))
+   (current-fsuid != iuid ||
+   (!in_group_p(attr-ia_gid)  attr-ia_gid != igid)) 
+   !s_is_capable(inode, current_userns(), CAP_CHOWN))
goto error;
 
/* Make sure a caller can chmod. */
if (ia_valid  ATTR_MODE) {
-   if (!is_owner_or_cap(inode))
+   if (current-fsuid != iuid || !s_is_capable(inode, 
current_userns(), CAP_FOWNER))
goto error;
/* Also check the setgid bit! */
if (!in_group_p((ia_valid  ATTR_GID) ? attr-ia_gid :
-   inode-i_gid)  !capable(CAP_FSETID))
+   igid)  !s_is_capable(inode, current_userns(), 
CAP_FSETID))
attr-ia_mode = ~S_ISGID;
}
 
/* Check for setting the inode time. */
if (ia_valid  (ATTR_MTIME_SET | ATTR_ATIME_SET | ATTR_TIMES_SET)) {
-   if (!is_owner_or_cap(inode))
+   if (current-fsuid != iuid || !s_is_capable(inode, 
current_userns(), CAP_FOWNER))
goto error;
}
 fine:
@@ -66,6 +71,11 @@ EXPORT_SYMBOL(inode_change_ok);
 int inode_setattr(struct inode * inode, struct iattr * attr)
 {
unsigned int ia_valid = attr-ia_valid;
+   uid_t iuid;
+   gid_t igid;
+
+   if (!s_convert_uid_gid(inode, current_userns(), iuid, igid))
+   return -EPERM;
 
if (ia_valid  ATTR_SIZE 
attr-ia_size != i_size_read(inode)) {
@@ -90,7 +100,7 @@ int inode_setattr(struct inode * inode, struct iattr * attr)
if (ia_valid  ATTR_MODE) {
umode_t mode = attr-ia_mode;
 
-   if (!in_group_p(inode-i_gid)  !capable(CAP_FSETID))
+   if (!in_group_p(igid)  !s_is_capable(inode, current_userns(), 
CAP_FSETID))
mode = ~S_ISGID;
inode-i_mode = mode;
}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a2f1356..f557535 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -611,6 +611,7 @@ struct user_struct {
 #endif
 };
 
+#define current_userns() (current-user-user_ns)
 extern int uids_sysfs_init(void);
 
 extern struct user_struct *find_user(uid_t);
-- 
1.5.4.3

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 09/10] userns: have ext3 use fsuserns to read userns xattrs, and add groups to userns

2008-08-22 Thread Serge E. Hallyn
userns: have ext3 use fsuserns to read userns xattrs, and add groups to userns

Signed-off-by: Serge Hallyn [EMAIL PROTECTED]
---
 fs/ext3/super.c|   11 +--
 fs/ext3/xattr.c|   19 ++-
 fs/ext3/xattr.h|3 ++-
 include/linux/fs.h |2 +-
 lib/fsuserns.c |   48 +---
 5 files changed, 71 insertions(+), 12 deletions(-)

diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 3458d25..37c8404 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -723,11 +723,18 @@ static struct quotactl_ops ext3_qctl_operations = {
 extern int fsuserns_add_userns(struct super_block *sb,
 struct user *user, void *data);
 extern int fsuserns_convert_uid_gid(struct user_namespace *ns,
-   struct inode *inode, uid_t *retuid, gid_t *retgid);
+   struct inode *inode, uid_t *retuid, gid_t *retgid,
+   int (*xattrget)(struct inode *, const void *, size_t));
 extern int fsuserns_is_capable(struct user_namespace *ns,
struct inode *inode, int cap);
 #endif
 
+int ext3_convert_uid_gid(struct user_namespace *ns, struct inode *inode,
+uid_t *retuid, gid_t *retgid)
+{
+   return fsuserns_convert_uid_gid(ns, inode, retuid, retgid, 
ext3_xattr_get_userns);
+}
+
 static const struct super_operations ext3_sops = {
.alloc_inode= ext3_alloc_inode,
.destroy_inode  = ext3_destroy_inode,
@@ -750,7 +757,7 @@ static const struct super_operations ext3_sops = {
 #ifdef CONFIG_USER_NS
.add_userns = fsuserns_add_userns,
.is_capable = fsuserns_is_capable,
-   .convert_uid_gid = fsuserns_convert_uid_gid,
+   .convert_uid_gid = ext3_convert_uid_gid,
 #endif
 };
 
diff --git a/fs/ext3/xattr.c b/fs/ext3/xattr.c
index da47c35..500fec7 100644
--- a/fs/ext3/xattr.c
+++ b/fs/ext3/xattr.c
@@ -331,6 +331,23 @@ ext3_xattr_get(struct inode *inode, int name_index, const 
char *name,
return error;
 }
 
+int
+ext3_xattr_get_userns(struct inode *inode, void *value, size_t value_size)
+{
+   int error;
+   int name_index = EXT3_XATTR_INDEX_SECURITY;
+   char *name = userns;
+
+   down_read(EXT3_I(inode)-xattr_sem);
+   error = ext3_xattr_ibody_get(inode, name_index, name, value,
+value_size);
+   if (error == -ENODATA)
+   error = ext3_xattr_block_get(inode, name_index, name, value,
+value_size);
+   up_read(EXT3_I(inode)-xattr_sem);
+   return error;
+}
+
 static int
 ext3_xattr_list_entries(struct inode *inode, struct ext3_xattr_entry *entry,
char *buffer, size_t buffer_size)
@@ -1087,7 +1104,7 @@ retry:
int error2;
 
error = ext3_xattr_set_handle(handle, inode, name_index, name,
- value, value_len, flags);
+ value, value_len, 0);
error2 = ext3_journal_stop(handle);
if (error == -ENOSPC 
ext3_should_retry_alloc(inode-i_sb, retries))
diff --git a/fs/ext3/xattr.h b/fs/ext3/xattr.h
index 8a523de..8c5b982 100644
--- a/fs/ext3/xattr.h
+++ b/fs/ext3/xattr.h
@@ -70,7 +70,8 @@ extern int ext3_xattr_get(struct inode *, int, const char *, 
void *, size_t);
 extern int ext3_xattr_set(struct inode *, int, const char *, const void *, 
size_t, int);
 extern int ext3_xattr_set_handle(handle_t *, struct inode *, int, const char 
*, const void *, size_t, int);
 
-extern int ext3_xattr_set_userns(struct inode *inode, const void *value, 
size_t value_len, int flags);
+extern int ext3_xattr_get_userns(struct inode *inode, void *value, size_t 
value_len);
+extern int ext3_xattr_set_userns(struct inode *inode, const void *value, 
size_t value_len);
 
 extern void ext3_xattr_delete_inode(handle_t *, struct inode *);
 extern void ext3_xattr_put_super(struct super_block *);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 492abef..9ec6dac 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1329,7 +1329,7 @@ struct super_operations {
void (*umount_begin) (struct super_block *);
int (*add_userns) (struct super_block *, struct user_struct *, void *);
int (*is_capable) (struct user_namespace *, struct inode *, int);
-   uid_t (*convert_uid_gid)(struct user_namespace *, struct inode *,
+   int (*convert_uid_gid)(struct user_namespace *, struct inode *,
uid_t *, gid_t *);
 
int (*show_options)(struct seq_file *, struct vfsmount *);
diff --git a/lib/fsuserns.c b/lib/fsuserns.c
index db70970..ac0ca99 100644
--- a/lib/fsuserns.c
+++ b/lib/fsuserns.c
@@ -59,6 +59,12 @@ struct fsuserns_conversion_table {
 
 LIST_HEAD(fsuserns_tables);
 
+struct unsstore {
+   int ns;
+   uid_t uid;
+   gid_t gid;
+};
+
 struct fsuserns_conversion_table 

[Devel] [PATCH 07/10] user namespaces: bad bad bad but test code

2008-08-22 Thread Serge E. Hallyn
Let uid 0 in a child namespace whose creator owns a file,
access that file.

This of course means that user hallyn (if he is allowed to
remount / for his userns, i.e. through
capset cap_sys_admin=ep usernsremount
can create files owned by root.

So this is only so we can play.  This code will be removed
in favor of code doing the right thing using extended
attributes.  Then, when the above user creates a file,
the inode-iuid will be set to 500 (hallyn), and an
xattr named fs.userns=(nsid,0) will store the fact that
in the given nsid (might be 1 for instance) uid 0 owns
the file.

Signed-off-by: Serge E. Hallyn [EMAIL PROTECTED]
---
 lib/fsuserns.c |9 +
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/lib/fsuserns.c b/lib/fsuserns.c
index 0a9f52d..c237d1d 100644
--- a/lib/fsuserns.c
+++ b/lib/fsuserns.c
@@ -185,6 +185,15 @@ int fsuserns_convert_uid_gid(struct user_namespace *ns, 
struct inode *inode,
 convert:
mutex_unlock(fsuserns_table_mutex);
 
+   /* The following is BAD CODE.  IT's for testing only */
+   if (current-uid == 0) {
+   if (inode-i_uid == ns-creator-uid) {
+   *retuid = 0;
+   *retgid = 0;
+   return 1;
+   }
+   }
+
/*
 * ok now we would look through the xattrs for the
 * inode to find a stored uid in this namespace.
-- 
1.5.4.3

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 08/10] userns: store child userns uids as xattrs in ext3 using lib/fsuserns

2008-08-22 Thread Serge E. Hallyn
userns: store child userns uids as xattrs in ext3 using lib/fsuserns

Signed-off-by: Serge Hallyn [EMAIL PROTECTED]
---
 fs/ext3/namei.c|7 +++-
 fs/ext3/xattr.c|   29 +++
 fs/ext3/xattr.h|2 +
 include/linux/user_namespace.h |1 +
 kernel/user_namespace.c|1 +
 lib/fsuserns.c |   74 
 6 files changed, 113 insertions(+), 1 deletions(-)

diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
index de13e91..e5be4bc 100644
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -1676,6 +1676,9 @@ static int ext3_add_nondir(handle_t *handle,
return err;
 }
 
+int fsuserns_store_creds(struct inode *inode, struct user_struct *user,
+   int (*)(struct inode *inode, const void *value, size_t value_len));
+
 /*
  * By the time this is called, we already have created
  * the directory cache entry for the new file, but it
@@ -1707,7 +1710,9 @@ retry:
inode-i_op = ext3_file_inode_operations;
inode-i_fop = ext3_file_operations;
ext3_set_aops(inode);
-   err = ext3_add_nondir(handle, dentry, inode);
+   err = fsuserns_store_creds(inode, current-user, 
ext3_xattr_set_userns);
+   if (!err)
+   err = ext3_add_nondir(handle, dentry, inode);
}
ext3_journal_stop(handle);
if (err == -ENOSPC  ext3_should_retry_alloc(dir-i_sb, retries))
diff --git a/fs/ext3/xattr.c b/fs/ext3/xattr.c
index 175414a..da47c35 100644
--- a/fs/ext3/xattr.c
+++ b/fs/ext3/xattr.c
@@ -1070,6 +1070,35 @@ retry:
return error;
 }
 
+int
+ext3_xattr_set_userns(struct inode *inode,
+  const void *value, size_t value_len)
+{
+   int name_index = EXT3_XATTR_INDEX_SECURITY;
+   handle_t *handle;
+   int error, retries = 0;
+   char *name = userns;
+
+retry:
+   handle = ext3_journal_start(inode, EXT3_DATA_TRANS_BLOCKS(inode-i_sb));
+   if (IS_ERR(handle)) {
+   error = PTR_ERR(handle);
+   } else {
+   int error2;
+
+   error = ext3_xattr_set_handle(handle, inode, name_index, name,
+ value, value_len, flags);
+   error2 = ext3_journal_stop(handle);
+   if (error == -ENOSPC 
+   ext3_should_retry_alloc(inode-i_sb, retries))
+   goto retry;
+   if (error == 0)
+   error = error2;
+   }
+
+   return error;
+}
+
 /*
  * ext3_xattr_delete_inode()
  *
diff --git a/fs/ext3/xattr.h b/fs/ext3/xattr.h
index 148a4df..8a523de 100644
--- a/fs/ext3/xattr.h
+++ b/fs/ext3/xattr.h
@@ -70,6 +70,8 @@ extern int ext3_xattr_get(struct inode *, int, const char *, 
void *, size_t);
 extern int ext3_xattr_set(struct inode *, int, const char *, const void *, 
size_t, int);
 extern int ext3_xattr_set_handle(handle_t *, struct inode *, int, const char 
*, const void *, size_t, int);
 
+extern int ext3_xattr_set_userns(struct inode *inode, const void *value, 
size_t value_len, int flags);
+
 extern void ext3_xattr_delete_inode(handle_t *, struct inode *);
 extern void ext3_xattr_put_super(struct super_block *);
 
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 1b4959d..a793263 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -14,6 +14,7 @@ struct user_namespace {
struct hlist_head   uidhash_table[UIDHASH_SZ];
struct user_struct  *root_user;
struct user_struct  *creator;
+   gid_t   creator_grp;
 };
 
 extern struct user_namespace init_user_ns;
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 39aea7b..879693a 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -51,6 +51,7 @@ int create_new_userns(int flags, struct task_struct *tsk)
put_user_ns(ns);
 
task_switch_uid(tsk, ns-root_user);
+   ns-creator_grp = tsk-gid;
tsk-uid = tsk-euid = tsk-suid = tsk-fsuid = 0;
tsk-gid = tsk-egid = tsk-sgid = tsk-fsgid = 0;
 
diff --git a/lib/fsuserns.c b/lib/fsuserns.c
index c237d1d..db70970 100644
--- a/lib/fsuserns.c
+++ b/lib/fsuserns.c
@@ -5,6 +5,7 @@
 #include linux/fs.h
 #include linux/user.h
 #include linux/user_namespace.h
+#include linux/xattr.h
 
 /*
  * Ok, eventually I'll want some policy loaded which looks as follows:
@@ -233,3 +234,76 @@ convert:
printk(KERN_NOTICE %s: oh, but I wasn't capable(%d)\n, __func__, cap);
return 0;
 }
+
+/*
+ * when a user_namespace is registered with an fs, we store the
+ * nsid.  This next fn will need to retreive an nsid for a
+ * given fs (inode-i_sb, that is) and user_namespace
+ *
+ * I don't want to do that bookkeeping yet, so i just return 1 :)
+ */
+int find_ns_id(struct inode *inode, struct user_namespace *ns)
+{
+   struct fsuserns_conversion_table *t;
+  

[Devel] Re: [RFC v2][PATCH 1/9] Create trivial sys_checkpoint/sys_restart syscalls

2008-08-22 Thread Dave Hansen
On Fri, 2008-08-22 at 12:32 -0700, Dave Hansen wrote:
 On Wed, 2008-08-20 at 23:03 -0400, Oren Laadan wrote:
  6/unistd_32.h
  index d739467..88bdec4 100644
  --- a/include/asm-x86/unistd_32.h
  +++ b/include/asm-x86/unistd_32.h
  @@ -338,6 +338,8 @@
#define __NR_dup3 330
#define __NR_pipe2331
#define __NR_inotify_init1332
  +#define __NR_checkpoint333
  +#define __NR_restart   334
  
#ifdef __KERNEL__
 
 Uh oh.  These got whitespace mangled somehow.  I'll look through them,
 but probably can't produce patches on top of them for now.

This patch also has to go *AFTER* you actually define the syscall
functions.  Otherwise, this won't be build-bisectable when it gets
committed.

-- Dave

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 04/10] user namespaces: enforce user namespaces for file permission

2008-08-22 Thread Eric W. Biederman
Serge E. Hallyn [EMAIL PROTECTED] writes:

 Add a user_ns to the sb.  It is always set to the user_ns of the task which
 mounted the sb.

 Define 3 new super_operations.  convert_uid() and convert_gid() take a uid
 or gid from an inode on the sb's fs, and attempt to convert them into ids
 meaningful in the user namespace passed in, which presumably is current's.
 is_capable() checks whether current has the requested capability with respect
 to the sb's fs, which is dependent upon the relationship between current's
 user_ns and those which the sb knows about.

 If the sb isn't user ns aware - which none are right now - then the new
 super_operations won't be defined.  If in that case current and sb have
 different user namespaces, then the userids can't be compared.

 If current's and sb's userids can't be compared, then current will get
 'user other' (we usually say 'nobody') permissions to the inode.

 Changelog:
   Aug 5: send the whole inode to the super_operations.

 Signed-off-by: Serge Hallyn [EMAIL PROTECTED]
 ---
  fs/namei.c |   29 +++---
  fs/super.c |3 ++
  include/linux/fs.h |   55 
 
  3 files changed, 83 insertions(+), 4 deletions(-)

 diff --git a/fs/namei.c b/fs/namei.c
 index 4ea63ed..6bf5446 100644
 --- a/fs/namei.c
 +++ b/fs/namei.c
 @@ -183,10 +183,27 @@ int generic_permission(struct inode *inode, int mask,
   int (*check_acl)(struct inode *inode, int mask))
  {
   umode_t mode = inode-i_mode;
 + uid_t   iuid;
 + gid_t   igid;
 + int ret;
 + int good_userns = 1;
 +
 + ret = s_convert_uid_gid(inode, current-user-user_ns,
 + iuid, igid);
 + if (!ret)
 + good_userns = 0;
  
   mask = MAY_READ | MAY_WRITE | MAY_EXEC;
  
 - if (current-fsuid == inode-i_uid)
 + /*
 +  * If we're not in the inode's user namespace, we get
 +  * user nobody permissions, and we ignore acls
 +  * (bc serge doesn't know how to handle acls in this case)
 +  */
 + if (!good_userns)
 + goto check;
 +
 + if (current-fsuid == iuid)
   mode = 6;
   else {
   if (IS_POSIXACL(inode)  (mode  S_IRWXG)  check_acl) {
 @@ -197,15 +214,18 @@ int generic_permission(struct inode *inode, int mask,
   return error;
   }
  
 - if (in_group_p(inode-i_gid))
 + if (in_group_p(igid))
   mode = 3;
   }
  
 +check:
   /*
* If the DACs are ok we don't need any capability check.
*/
   if ((mask  ~mode) == 0)
   return 0;
 + if (!good_userns)
 + return -EACCES;
  
   check_capabilities:
   /*
 @@ -214,14 +234,15 @@ int generic_permission(struct inode *inode, int mask,
*/
   if (!(mask  MAY_EXEC) ||
   (inode-i_mode  S_IXUGO) || S_ISDIR(inode-i_mode))
 - if (capable(CAP_DAC_OVERRIDE))
 + if (s_is_capable(inode, current-user-user_ns,
 + CAP_DAC_OVERRIDE))
   return 0;
  
   /*
* Searching includes executable on directories, else just read.
*/
   if (mask == MAY_READ || (S_ISDIR(inode-i_mode)  !(mask  MAY_WRITE)))
 - if (capable(CAP_DAC_READ_SEARCH))
 + if (s_is_capable(inode, current-user-user_ns, CAP_DAC_READ_SEARCH))
   return 0;
  
   return -EACCES;
 diff --git a/fs/super.c b/fs/super.c
 index e931ae9..3559637 100644
 --- a/fs/super.c
 +++ b/fs/super.c
 @@ -38,6 +38,7 @@
  #include linux/kobject.h
  #include linux/mutex.h
  #include linux/file.h
 +#include linux/user_namespace.h
  #include asm/uaccess.h
  #include internal.h
  
 @@ -93,6 +94,7 @@ static struct super_block *alloc_super(struct 
 file_system_type
 *type)
   s-s_qcop = sb_quotactl_ops;
   s-s_op = default_op;
   s-s_time_gran = 10;
 + s-user_ns = get_user_ns(current-user-user_ns);
   }
  out:
   return s;
 @@ -109,6 +111,7 @@ static inline void destroy_super(struct super_block *s)
   security_sb_free(s);
   kfree(s-s_subtype);
   kfree(s-s_options);
 + put_user_ns(s-user_ns);
   kfree(s);
  }
  
 diff --git a/include/linux/fs.h b/include/linux/fs.h
 index 580b513..bb58c2e 100644
 --- a/include/linux/fs.h
 +++ b/include/linux/fs.h
 @@ -1128,6 +1128,11 @@ struct super_block {
* generic_show_options()
*/
   char *s_options;
 +
 + /*
 +  * namespace of the user which mounted the sb
 +  */
 + struct user_namespace *user_ns;

We should make it clear that this is a default, and not something
a filesystem must use, just something a filesystem may use.
  };
  
  extern struct timespec current_fs_time(struct 

[Devel] Re: [0/10] User namespaces: introduction

2008-08-22 Thread Eric W. Biederman
Serge E. Hallyn [EMAIL PROTECTED] writes:

 Hi Eric,

 so here is a start to a userns patchset trying to follow your ideas
 about how to have user namespaces and filesystems interact.  Ignore
 the bookkeeping crap or you'll pull your hair out.  Lots of stuff
 remains unimplemented - i.e. chown (setattr) and proper handling of
 capabilities.  But you can do some fun things with this patchset.
 I.e.

   (log in as root)
   setcap cap_sys_admin=ep ns_exec
   setcap cap_sys_admin=ep usernsmount
   ns_exec -U /bin/sh
   ls /root (fails)
   ls / (succeeds)
   (log in as hallyn)
   ns_exec -U /bin/sh
   id
   (uid=0, gid=0)
   ls (fails, can't descend /home/hallyn)
   usernsmount / nsid=4
   ls (succeeds)
   touch ab
   ls -l ab
   (ab is owned by root)
   exit
   (we're logged in as hallyn in the init_user_ns again)
   ls -l ab
   (ab is owned by hallyn)

 The only supported fs is ext3.  Only a few operations are supported.
 So if, above, when we are hallyn in the init_user_ns but root in
 the child user ns,
   when we create a file, it is properly handled, so
   inode-i_uid=500, but an xattr (nsid=4,uid=0) is added
   when we chown the file to root, it is not properly handled,
   so inode-i_uid = 0
 it's just a matter of hooking all the places at this point.

 Capabilities remain a problem.  Right now I think capabilities will
 need to be split up into system-wide caps, and container-safe caps.
 So CAP_NET_ADMIN, CAP_NET_RAW, CAP_DAC_OVERRIDE, those are container-safe.
 CAP_REBOOT may become container-safe one day, but for now is very
 much system-wide.

 So if I'm uid 500 on the host and create a user namespace where I'm
 uid=0, I should be able to acquire container-safe caps (perhaps
 contingent on whether I unshared all other namespaces), but not
 system-wide ones.  Or, whether I can acquire them would depend
 on whether the suid bit was set in a user_ns or not.  sigh.

Serge at first glance this looks like a good start, especially for thinking
through how things will work.

It has just occurred to me that from a dependency point of view it
makes an enormous amount of sense to sort out capable with
respect to namespaces before we get to the filesystems.

There is no one else working in the area of capabilities so there won't
be conflicts, and we need a firm understanding of how capabilities are
going to work with respect to namespaces before we start embedding
the logic in filesystems.

With respect to your separation of capabilities in namespaces I don't think
you have quite grasped the simple idea that is sitting in my head and makes
all of this clear.  Let me see if I can explain it better.

A fully qualified capability name would be of the form:
userns:capability_name

For each operation we will check for one specific capability.
For the network namespace in particular we will check for:
userns_of_network_namespace_creator:CAP_NET_ADMIN

The check for a capability will succeed if:
- We have the exact fully qualified capability. 
- We are outside the user namespace but are the owner of
  the user namespace.
- We are outside the user namespace but have the appropriate
  capability over the owner of the user namespace  CAP_PTRACE?

  This last test would recurses.

I'm less certain than I like about which permissions we allow someone outside
of a container to posses and still control the container.

This has two very useful implications.
- We can have all capabilities in a new user namespace and be completely
  impotent.
- Allowing the capabilities of a user namespace to do something useful
  can come gradually.

Which means we need to extend the classic capable check to become.
capable(userns, capability).  Or possibly we extend the capability
parameter to be a structure that can hold both userns and the capability,
whichever turns out to be more maintainable.

Once we have done that we can allow something to be under the power
of creator_user_ns:capability instead of init_user_ns:capability.

So the CAP_SYS_REBOOT test will be init_user_ns:capability for the 
foreseeable future.  While the CAP_NET_ADMIN test will shortly
become creator_of_netns:CAP_NET_ADMIN.

Of course none of that will happen until we relax the test to create a
new namespace from init_user_ns:CAP_SYS_ADMIN to
current_user_ns:CAP_SYS_ADMIN.

Eric
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC v2][PATCH 1/9] Create trivial sys_checkpoint/sys_restart syscalls

2008-08-22 Thread Oren Laadan


Dave Hansen wrote:
 On Wed, 2008-08-20 at 23:03 -0400, Oren Laadan wrote:
 6/unistd_32.h
 index d739467..88bdec4 100644
 --- a/include/asm-x86/unistd_32.h
 +++ b/include/asm-x86/unistd_32.h
 @@ -338,6 +338,8 @@
   #define __NR_dup3 330
   #define __NR_pipe2331
   #define __NR_inotify_init1332
 +#define __NR_checkpoint333
 +#define __NR_restart   334

   #ifdef __KERNEL__
 
 Uh oh.  These got whitespace mangled somehow.  I'll look through them,
 but probably can't produce patches on top of them for now.
 

Oops .. not sure what happened.
I'll re-send after addressing the recent comments.

Oren.

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 04/10] user namespaces: enforce user namespaces for file permission

2008-08-22 Thread Eric W. Biederman
Serge E. Hallyn [EMAIL PROTECTED] writes:

 Add a user_ns to the sb.  It is always set to the user_ns of the task which
 mounted the sb.

 Define 3 new super_operations.  convert_uid() and convert_gid() take a uid
 or gid from an inode on the sb's fs, and attempt to convert them into ids
 meaningful in the user namespace passed in, which presumably is current's.
 is_capable() checks whether current has the requested capability with respect
 to the sb's fs, which is dependent upon the relationship between current's
 user_ns and those which the sb knows about.

 If the sb isn't user ns aware - which none are right now - then the new
 super_operations won't be defined.  If in that case current and sb have
 different user namespaces, then the userids can't be compared.

 If current's and sb's userids can't be compared, then current will get
 'user other' (we usually say 'nobody') permissions to the inode.

 Changelog:
   Aug 5: send the whole inode to the super_operations.

Serge while I am dubious about adding the methods you have added to
super_block_operations there is one very important question you are
using them for that we do need to ask.

Does this filesystem map to user namespace X.

Can we separate that concept out as it's own individual function and initially
have a noop implementation that returns false for everything except for
the initial network namespace.

By doing just that we should be able to capture and update all of the tests 
without
the rest of the hassle.  Possibly this is just your s_convert_uid_gid function
without calling the superblock option.

I would really like to get to the point where we can create a user namespace
without privilege and have it be safe (if impotent).  Once we do that we can
extend out what the root of a user namespace is allowed to do, much as
we have done with the network namespace. 

Eric
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC v2][PATCH 4/9] Memory management - dump state

2008-08-22 Thread Oren Laadan

Thanks Louis for all the comments. Will fix in v3.

Oren.

Louis Rilling wrote:
 On Wed, Aug 20, 2008 at 11:05:15PM -0400, Oren Laadan wrote:
 For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
 it will be followed by the file name.  The cr_vma-npages will tell
 how many pages were dumped for this VMA.  Then it will be followed
 by the actual data: first a dump of the addresses of all dumped
 pages (npages entries) followed by a dump of the contents of all
 dumped pages (npages pages). Then will come the next VMA and so on.
 
 [...]
 
 diff --git a/checkpoint/ckpt_mem.c b/checkpoint/ckpt_mem.c
 new file mode 100644
 index 000..a23aa29
 --- /dev/null
 +++ b/checkpoint/ckpt_mem.c
 
 [...]
 
 +/**
 + * cr_vma_fill_pgarr - fill a page-array with addr/page tuples for a vma
 + * @ctx - checkpoint context
 + * @pgarr - page-array to fill
 + * @vma - vma to scan
 + * @start - start address (updated)
 + */
 +static int cr_vma_fill_pgarr(struct cr_ctx *ctx, struct cr_pgarr *pgarr,
 + struct vm_area_struct *vma, unsigned long *start)
 +{
 +unsigned long end = vma-vm_end;
 +unsigned long addr = *start;
 +struct page **pagep;
 +unsigned long *addrp;
 +int cow, nr, ret = 0;
 +
 +nr = pgarr-nleft;
 +pagep = pgarr-pages[pgarr-nused];
 +addrp = pgarr-addrs[pgarr-nused];
 +cow = !!vma-vm_file;
 +
 +while (addr  end) {
 +struct page *page;
 +
 +/* simplified version of get_user_pages(): already have vma,
 +* only need FOLL_TOUCH, and (for now) ignore fault stats */
 +
 +cond_resched();
 +while (!(page = follow_page(vma, addr, FOLL_TOUCH))) {
 +ret = handle_mm_fault(vma-vm_mm, vma, addr, 0);
 +if (ret  VM_FAULT_ERROR) {
 +if (ret  VM_FAULT_OOM)
 +ret = -ENOMEM;
 +else if (ret  VM_FAULT_SIGBUS)
 +ret = -EFAULT;
 +else
 +BUG();
 +break;
 +}
 
 + ret = 0;
 
 +cond_resched();
 +}
 +
 +if (IS_ERR(page)) {
 +ret = PTR_ERR(page);
 +break;
 +}
 
 Need to check ret here:
 
 + if (ret)
   break;
 
 +
 +if (page == ZERO_PAGE(0))
 +page = NULL;/* zero page: ignore */
 +else if (cow  page_mapping(page) != NULL)
 +page = NULL;/* clean cow: ignore */
 +else {
 +get_page(page);
 +*(addrp++) = addr;
 +*(pagep++) = page;
 +if (--nr == 0) {
 +addr += PAGE_SIZE;
 +break;
 +}
 +}
 +
 +addr += PAGE_SIZE;
 +}
 +
 +if (unlikely(ret  0)) {
 +nr = pgarr-nleft - nr;
 +while (nr--)
 +page_cache_release(*(--pagep));
 +return ret;
 +}
 +
 +*start = addr;
 +return (pgarr-nleft - nr);
 +}
 
 [...]
 
 Thanks,
 
 Louis
 
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 04/10] user namespaces: enforce usernamespaces for file permission

2008-08-22 Thread Serge E. Hallyn
Quoting Eric W. Biederman ([EMAIL PROTECTED]):
 Serge E. Hallyn [EMAIL PROTECTED] writes:
 
  Add a user_ns to the sb.  It is always set to the user_ns of the task which
  mounted the sb.
 
  Define 3 new super_operations.  convert_uid() and convert_gid() take a uid
  or gid from an inode on the sb's fs, and attempt to convert them into ids
  meaningful in the user namespace passed in, which presumably is current's.
  is_capable() checks whether current has the requested capability with 
  respect
  to the sb's fs, which is dependent upon the relationship between current's
  user_ns and those which the sb knows about.
 
  If the sb isn't user ns aware - which none are right now - then the new
  super_operations won't be defined.  If in that case current and sb have
  different user namespaces, then the userids can't be compared.
 
  If current's and sb's userids can't be compared, then current will get
  'user other' (we usually say 'nobody') permissions to the inode.
 
  Changelog:
  Aug 5: send the whole inode to the super_operations.
 
 Serge while I am dubious about adding the methods you have added to
 super_block_operations there is one very important question you are
 using them for that we do need to ask.
 
 Does this filesystem map to user namespace X.

By itself that is not sufficient.  We need to support two inodes on the
same fs where both have i_uid=500 on the host fs, while in user
namespace X one is owned by uid 0, and another by uid 1000.

So we need to be able to pass the filesystem an inode and a user
namespace, and ask for the owning uid and gids.

Or am I (I likely am) misunderstanding?

-serge
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 04/10] user namespaces: enforce user namespaces for file permission

2008-08-22 Thread Serge E. Hallyn
Quoting Eric W. Biederman ([EMAIL PROTECTED]):
 Serge E. Hallyn [EMAIL PROTECTED] writes:
 
  Add a user_ns to the sb.  It is always set to the user_ns of the task which
  mounted the sb.
 
  Define 3 new super_operations.  convert_uid() and convert_gid() take a uid
  or gid from an inode on the sb's fs, and attempt to convert them into ids
  meaningful in the user namespace passed in, which presumably is current's.
  is_capable() checks whether current has the requested capability with 
  respect
  to the sb's fs, which is dependent upon the relationship between current's
  user_ns and those which the sb knows about.
 
  If the sb isn't user ns aware - which none are right now - then the new
  super_operations won't be defined.  If in that case current and sb have
  different user namespaces, then the userids can't be compared.
 
  If current's and sb's userids can't be compared, then current will get
  'user other' (we usually say 'nobody') permissions to the inode.
 
  Changelog:
  Aug 5: send the whole inode to the super_operations.
 
  Signed-off-by: Serge Hallyn [EMAIL PROTECTED]
  ---
   fs/namei.c |   29 +++---
   fs/super.c |3 ++
   include/linux/fs.h |   55 
  
   3 files changed, 83 insertions(+), 4 deletions(-)
 
  diff --git a/fs/namei.c b/fs/namei.c
  index 4ea63ed..6bf5446 100644
  --- a/fs/namei.c
  +++ b/fs/namei.c
  @@ -183,10 +183,27 @@ int generic_permission(struct inode *inode, int mask,
  int (*check_acl)(struct inode *inode, int mask))
   {
  umode_t mode = inode-i_mode;
  +   uid_t   iuid;
  +   gid_t   igid;
  +   int ret;
  +   int good_userns = 1;
  +
  +   ret = s_convert_uid_gid(inode, current-user-user_ns,
  +   iuid, igid);
  +   if (!ret)
  +   good_userns = 0;
   
  mask = MAY_READ | MAY_WRITE | MAY_EXEC;
   
  -   if (current-fsuid == inode-i_uid)
  +   /*
  +* If we're not in the inode's user namespace, we get
  +* user nobody permissions, and we ignore acls
  +* (bc serge doesn't know how to handle acls in this case)
  +*/
  +   if (!good_userns)
  +   goto check;
  +
  +   if (current-fsuid == iuid)
  mode = 6;
  else {
  if (IS_POSIXACL(inode)  (mode  S_IRWXG)  check_acl) {
  @@ -197,15 +214,18 @@ int generic_permission(struct inode *inode, int mask,
  return error;
  }
   
  -   if (in_group_p(inode-i_gid))
  +   if (in_group_p(igid))
  mode = 3;
  }
   
  +check:
  /*
   * If the DACs are ok we don't need any capability check.
   */
  if ((mask  ~mode) == 0)
  return 0;
  +   if (!good_userns)
  +   return -EACCES;
   
check_capabilities:
  /*
  @@ -214,14 +234,15 @@ int generic_permission(struct inode *inode, int mask,
   */
  if (!(mask  MAY_EXEC) ||
  (inode-i_mode  S_IXUGO) || S_ISDIR(inode-i_mode))
  -   if (capable(CAP_DAC_OVERRIDE))
  +   if (s_is_capable(inode, current-user-user_ns,
  +   CAP_DAC_OVERRIDE))
  return 0;
   
  /*
   * Searching includes executable on directories, else just read.
   */
  if (mask == MAY_READ || (S_ISDIR(inode-i_mode)  !(mask  MAY_WRITE)))
  -   if (capable(CAP_DAC_READ_SEARCH))
  + if (s_is_capable(inode, current-user-user_ns, CAP_DAC_READ_SEARCH))
  return 0;
   
  return -EACCES;
  diff --git a/fs/super.c b/fs/super.c
  index e931ae9..3559637 100644
  --- a/fs/super.c
  +++ b/fs/super.c
  @@ -38,6 +38,7 @@
   #include linux/kobject.h
   #include linux/mutex.h
   #include linux/file.h
  +#include linux/user_namespace.h
   #include asm/uaccess.h
   #include internal.h
   
  @@ -93,6 +94,7 @@ static struct super_block *alloc_super(struct 
  file_system_type
  *type)
  s-s_qcop = sb_quotactl_ops;
  s-s_op = default_op;
  s-s_time_gran = 10;
  +   s-user_ns = get_user_ns(current-user-user_ns);
  }
   out:
  return s;
  @@ -109,6 +111,7 @@ static inline void destroy_super(struct super_block *s)
  security_sb_free(s);
  kfree(s-s_subtype);
  kfree(s-s_options);
  +   put_user_ns(s-user_ns);
  kfree(s);
   }
   
  diff --git a/include/linux/fs.h b/include/linux/fs.h
  index 580b513..bb58c2e 100644
  --- a/include/linux/fs.h
  +++ b/include/linux/fs.h
  @@ -1128,6 +1128,11 @@ struct super_block {
   * generic_show_options()
   */
  char *s_options;
  +
  +   /*
  +* namespace of the user which mounted the sb
  +*/
  +   struct user_namespace *user_ns;
 
 We should make it clear that this is a default, and not something
 a filesystem must use, just something a filesystem may use.
   

[Devel] Re: [0/10] User namespaces: introduction

2008-08-22 Thread Serge E. Hallyn
Quoting Eric W. Biederman ([EMAIL PROTECTED]):
 Serge E. Hallyn [EMAIL PROTECTED] writes:
 
  Hi Eric,
 
  so here is a start to a userns patchset trying to follow your ideas
  about how to have user namespaces and filesystems interact.  Ignore
  the bookkeeping crap or you'll pull your hair out.  Lots of stuff
  remains unimplemented - i.e. chown (setattr) and proper handling of
  capabilities.  But you can do some fun things with this patchset.
  I.e.
 
  (log in as root)
  setcap cap_sys_admin=ep ns_exec
  setcap cap_sys_admin=ep usernsmount
  ns_exec -U /bin/sh
  ls /root (fails)
  ls / (succeeds)
  (log in as hallyn)
  ns_exec -U /bin/sh
  id
  (uid=0, gid=0)
  ls (fails, can't descend /home/hallyn)
  usernsmount / nsid=4
  ls (succeeds)
  touch ab
  ls -l ab
  (ab is owned by root)
  exit
  (we're logged in as hallyn in the init_user_ns again)
  ls -l ab
  (ab is owned by hallyn)
 
  The only supported fs is ext3.  Only a few operations are supported.
  So if, above, when we are hallyn in the init_user_ns but root in
  the child user ns,
  when we create a file, it is properly handled, so
  inode-i_uid=500, but an xattr (nsid=4,uid=0) is added
  when we chown the file to root, it is not properly handled,
  so inode-i_uid = 0
  it's just a matter of hooking all the places at this point.
 
  Capabilities remain a problem.  Right now I think capabilities will
  need to be split up into system-wide caps, and container-safe caps.
  So CAP_NET_ADMIN, CAP_NET_RAW, CAP_DAC_OVERRIDE, those are container-safe.
  CAP_REBOOT may become container-safe one day, but for now is very
  much system-wide.
 
  So if I'm uid 500 on the host and create a user namespace where I'm
  uid=0, I should be able to acquire container-safe caps (perhaps
  contingent on whether I unshared all other namespaces), but not
  system-wide ones.  Or, whether I can acquire them would depend
  on whether the suid bit was set in a user_ns or not.  sigh.
 
 Serge at first glance this looks like a good start, especially for thinking
 through how things will work.
 
 It has just occurred to me that from a dependency point of view it
 makes an enormous amount of sense to sort out capable with
 respect to namespaces before we get to the filesystems.
 
 There is no one else working in the area of capabilities so there won't
 be conflicts, and we need a firm understanding of how capabilities are
 going to work with respect to namespaces before we start embedding
 the logic in filesystems.
 
 With respect to your separation of capabilities in namespaces I don't think
 you have quite grasped the simple idea that is sitting in my head and makes
 all of this clear.  Let me see if I can explain it better.
 
 A fully qualified capability name would be of the form:
 userns:capability_name
 
 For each operation we will check for one specific capability.
 For the network namespace in particular we will check for:
 userns_of_network_namespace_creator:CAP_NET_ADMIN
 
 The check for a capability will succeed if:
 - We have the exact fully qualified capability. 
 - We are outside the user namespace but are the owner of
   the user namespace.
 - We are outside the user namespace but have the appropriate
   capability over the owner of the user namespace  CAP_PTRACE?
 
   This last test would recurses.
 
 I'm less certain than I like about which permissions we allow someone outside
 of a container to posses and still control the container.
 
 This has two very useful implications.
 - We can have all capabilities in a new user namespace and be completely
   impotent.
 - Allowing the capabilities of a user namespace to do something useful
   can come gradually.
 
 Which means we need to extend the classic capable check to become.
 capable(userns, capability).  Or possibly we extend the capability
 parameter to be a structure that can hold both userns and the capability,
 whichever turns out to be more maintainable.
 
 Once we have done that we can allow something to be under the power
 of creator_user_ns:capability instead of init_user_ns:capability.
 
 So the CAP_SYS_REBOOT test will be init_user_ns:capability for the 
 foreseeable future.  While the CAP_NET_ADMIN test will shortly
 become creator_of_netns:CAP_NET_ADMIN.
 
 Of course none of that will happen until we relax the test to create a
 new namespace from init_user_ns:CAP_SYS_ADMIN to
 current_user_ns:CAP_SYS_ADMIN.
 
 Eric

It definately seems to make sense in terms of the security
implications.  And solving this before the filesystem handlers seems
to make sense too.  Although I would like to get the first 3 patches upstream
pretty soon, as I believe they are proper fixes.

But wrt userns:capability, the problem that brings to mind is that of
referring to the userns.  Do we use the userspace-exported id, or do we
use the actual in-kernel user_ns?  If we use the in-kernel 

[Devel] Re: [PATCH 04/10] user namespaces: enforce usernamespaces for file permission

2008-08-22 Thread Eric W. Biederman
Serge E. Hallyn [EMAIL PROTECTED] writes:

 By itself that is not sufficient.  We need to support two inodes on the
 same fs where both have i_uid=500 on the host fs, while in user
 namespace X one is owned by uid 0, and another by uid 1000.

 So we need to be able to pass the filesystem an inode and a user
 namespace, and ask for the owning uid and gids.

 Or am I (I likely am) misunderstanding?

There are two questions.
Does this filesystem provide mappings to user namespace X?
What is the mapping from this filesystem to user namespace X?

I think we may be able to separate those two questions.

The important idea is that we don't need to implement filesystem changes
in the first pass.  Just have the permission check fail unconditionally
if we are not in the init_user_ns.

Eric
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 04/10] user namespaces: enforce usernamespaces for file permission

2008-08-22 Thread Serge E. Hallyn
Quoting Eric W. Biederman ([EMAIL PROTECTED]):
 Serge E. Hallyn [EMAIL PROTECTED] writes:
 
  By itself that is not sufficient.  We need to support two inodes on the
  same fs where both have i_uid=500 on the host fs, while in user
  namespace X one is owned by uid 0, and another by uid 1000.
 
  So we need to be able to pass the filesystem an inode and a user
  namespace, and ask for the owning uid and gids.
 
  Or am I (I likely am) misunderstanding?
 
 There are two questions.
 Does this filesystem provide mappings to user namespace X?
 What is the mapping from this filesystem to user namespace X?

That is where you and I still disagree:  I don't believe that a mapping
as such makes sense.

A mapping implies uid 500 becomes uid 0, uid 501 becomes uid 1000,
etc.  That simply is not sufficient.  If I am going to be able to create
a new userns without privilege, then all files belonging to all uids in
the child userns must be owned by uid 500 in my parent uidns.

So I do think we will have to have something more of the format what is
the owning uid and gid of this inode, in the context of user namespace X.
Which was the s_convert_uid_gid() in my helpers.

If we follow your userns:capability idea and then we probably can get
rid of the s_is_capable you hated so much :)

But,

 I think we may be able to separate those two questions.
 
 The important idea is that we don't need to implement filesystem changes
 in the first pass.  Just have the permission check fail unconditionally
 if we are not in the init_user_ns.
 
 Eric

There I definately agree.

However, given this general direction, do you feel that marking an sb
with a userns the way I'm doing in patch 4/10 makes sense, using it
only to facilitate a basic is this non-userns-aware fs in the same
userns as the task attempting to access it check?  If not, what
alternatives do we have, given that marking vfsmounts as we've
discussed before won't be feasible?

Of course by the time we get capabilities straightened out I
suppose tagging vfsmounts might in fact be feasible  :-)

Thanks, Eric.

-serge
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 04/10] user namespaces: enforce user namespaces for file permission

2008-08-22 Thread Eric W. Biederman
Serge E. Hallyn [EMAIL PROTECTED] writes:

 That might make sense.  I guess the problem is I started trying to
 handle generic permission.  But I don't need to...  the fs can
 provide its own permission, else we do the simple userns check.

Sounds right.

We really need that simple starting place where we can turn it all
on and you aren't allowed to do anything, then we can enable one
feature at a time from there.

Plus I can think of some uses for a user who isn't allowed to do
anything on a machine. ;)

 The user ns comes from the current task.  The fs must decide whether
 current-user-uid in current-user-user_ns has capabilities over
 the target file.  Only the fs can erify that the user is really
 uid=0 or has CAP_DAC_OVERRIDE, for instance, with respect to the file.

In the general case only the filesystem can make the call about what
powers we have over a specific file and the filesystem will make that
call when we pass it our kerberos ticket to the filesystem server, and
ask the filesystem to perform a specific operation.

In the common case the filesystem is just a reflection of our in memory
set of namespaces.  Getting the capability rules straight are going
to be a little interesting.  Essentially however for filesystems
I believe the rule is if I have CAP_DAC_OVERRIDE in the user namespace
of a user.  And that user creates a namespace.  Then I should have
CAP_DAC_OVERRIDE capability over that namespace as well.  Allowing
me to back up files in the unprivileged users namespace.

The common case is the only case that needs to be implemented
in generic kernel code.

Eric
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [0/10] User namespaces: introduction

2008-08-22 Thread Eric W. Biederman
Serge E. Hallyn [EMAIL PROTECTED] writes:

 It definately seems to make sense in terms of the security
 implications.  And solving this before the filesystem handlers seems
 to make sense too.  Although I would like to get the first 3 patches upstream
 pretty soon, as I believe they are proper fixes.

Reasonable.  I'm not certain about free_user continuing to be an inline
function as it seems a bit non-trivial, but otherwise that sounds correct.

 But wrt userns:capability, the problem that brings to mind is that of
 referring to the userns.  Do we use the userspace-exported id, or do we
 use the actual in-kernel user_ns?  If we use the in-kernel user_ns,
 then we'd have to take a ref for each cap, yuck.  But you had wanted to
 use 'mount' to only have filesystems associate userspace ids with the
 in-kernel struct user_ns, so that complicates the idea of having
 capabilities refer to those.

I don't think so.  In the standard security model there are only 2
intersections between the filesystem and the capabilities.

- CAP_DAC_OVERRIDE.
- The capabilities xattr on a filesystem.

With a filesystem in exactly one user namespace at a time this
is straight forward.  With a filesystem in user namespaces at
a time this is slightly more interesting.

I believe the authentication algorithm becomes:
Map the credentials on the filesystem inode into (fs_user_ns, fs_uid, fs_gid, 
fs_mode)

Then to see if we have power over the file we test:
capable(fs_user_ns, CAP_DAC_OVERRIDE).

Then if current-user-user_ns != fs_user_ns we can do something like:
uid = 0, gid = 0 mode clear except for the other bits.  We want either
0 or another uid we have reserved for the purpose.

I don't see why the mapping rules should not be universal so we can probably
do all of the mapping foreign uid's and gid's in generic code and just
place a unser_ns pointer into struct inode. 

Which makes things very close to how they are now and it means we can
do the lookup of the user_ns when we cache the struct inode.

 Anyway I like the overall approach, and will think a bit about
 any other actual implementation issues.

Thanks.

It adds more complications then I like not having a view of the filesystem
with a single user_namespace.  However that appears to be necessary to
deal with Al's inode_permission changes, and it seems to be where we
are ultimately heading so it seems more honest.  So I guess I have
to bite the bullet and accept it. ;)

For the case of a shared /usr just having other permission access
should work fairly well.  I just looked on my ubuntu system and I
found only 36 suid executables and only one executable (fusermount)
that was not world executable.  And a shared usr is the only reasonable
case I could think of where I would want a file to at least appear
to have multiple owners.

Eric
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 04/10] user namespaces: enforce usernamespaces for file permission

2008-08-22 Thread Eric W. Biederman
Serge E. Hallyn [EMAIL PROTECTED] writes:

 There are two questions.
 Does this filesystem provide mappings to user namespace X?
 What is the mapping from this filesystem to user namespace X?

 That is where you and I still disagree:  I don't believe that a mapping
 as such makes sense.

 A mapping implies uid 500 becomes uid 0, uid 501 becomes uid 1000,
 etc.  That simply is not sufficient.  If I am going to be able to create
 a new userns without privilege, then all files belonging to all uids in
 the child userns must be owned by uid 500 in my parent uidns.

 So I do think we will have to have something more of the format what is
 the owning uid and gid of this inode, in the context of user namespace X.
 Which was the s_convert_uid_gid() in my helpers.

That is what I meant by mapping.  ;)

Given my thought in my last message that we should cache the user_ns
in the inode along with the rest of the credentials.  Making all of
the child files appear and act as if they are owned by uid 500 in the
parent namespace should be straight forward.

I think we may want a 1-1 mapping onto the filesystem.  However I
don't care very much how we store uids on the filesystem as long
as we do something that works.

 If we follow your userns:capability idea and then we probably can get
 rid of the s_is_capable you hated so much :)

yep.

 But,

 I think we may be able to separate those two questions.
 
 The important idea is that we don't need to implement filesystem changes
 in the first pass.  Just have the permission check fail unconditionally
 if we are not in the init_user_ns.
 
 Eric

 There I definately agree.

 However, given this general direction, do you feel that marking an sb
 with a userns the way I'm doing in patch 4/10 makes sense, using it
 only to facilitate a basic is this non-userns-aware fs in the same
 userns as the task attempting to access it check?  If not, what
 alternatives do we have, given that marking vfsmounts as we've
 discussed before won't be feasible?

 Of course by the time we get capabilities straightened out I
 suppose tagging vfsmounts might in fact be feasible  :-)

LOL

Well I tell you what.  Unless we can think of a reason for a
file to live in more than one leaf user namespace.  We should
tag the inode with the leaf user namespace.

We could tag the superblock with a default for the inode user
namespace but that is less interesting.

Eric

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel