[Devel] Re: [PATCH 20/38] netns ct: NOTRACK in netns
On Thu, Aug 21, 2008 at 07:06:37PM -0400, Jan Engelhardt wrote: On Thursday 2008-08-21 18:04, [EMAIL PROTECTED] wrote: Make untracked conntrack per-netns. Why? It does not store any useful information per se, it is merely used to add a third type of ct, iow: (a) ct==NULL (b) ct!=NULL (c) ct==untracked mmap(2)'s return value for example has something similar: (a) mmap(...)==NULL (b) mmap(...)==MMAP_FAILED (c) otherwise The untracked ct is a singleton, and should stay one, unless there are further reasons not to do so. We wait for untracked ct refcount to drop to 1 back: /* wait until all references to nf_conntrack_untracked are dropped */ while (atomic_read(nf_conntrack_untracked.ct_general.use) 1) schedule(); Consequently it should be one per netns, otherwise netns A can prevent netns B from stopping. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: SNMP statistics in namespaces
On Fri, 2008-08-22 at 13:57 +0200, Eelco Chaudron wrote: I was looking at the 2.6.26 nw namespace code, and noticed that the SNMP counters are still system wide. Looking at the status page there is not separate item on SNMP. Was this overlooked, skipped on purpose, or is it in the pipeline? Hope some one can answer this... I'm an idiot. I know that SNMP has something to do with network statistics, and I'm too lazy to google it. Could you give us a quick idea why SNMP is important here? You're welcome to add it to the list, but it's unlikely to get implemented until someone thinks it is important. That said, I'd be happy to help you put together some patches to implement it. -- Dave ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: SNMP statistics in namespaces
Dave Hansen wrote: On Fri, 2008-08-22 at 13:57 +0200, Eelco Chaudron wrote: I was looking at the 2.6.26 nw namespace code, and noticed that the SNMP counters are still system wide. Looking at the status page there is not separate item on SNMP. Was this overlooked, skipped on purpose, or is it in the pipeline? Hope some one can answer this... I'm an idiot. I know that SNMP has something to do with network statistics, and I'm too lazy to google it. Could you give us a quick idea why SNMP is important here? You're welcome to add it to the list, but it's unlikely to get implemented until someone thinks it is important. That said, I'd be happy to help you put together some patches to implement it. SNMP is typically the way that most network administration (statistics and control) is done. I have not looked at the network namespace implementation, but I am quite confident that it would be a useful thing to do, if not done already. -- Balbir ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: RFC: Attaching threads to cgroups is OK?
On Thu, Aug 21, 2008 at 02:25:06PM +0900, Fernando Luis Vázquez Cao wrote: Hi Balbir, On Thu, 2008-08-21 at 09:02 +0530, Balbir Singh wrote: Fernando Luis Vázquez Cao wrote: On Wed, 2008-08-20 at 20:48 +0900, Hirokazu Takahashi wrote: Hi, Tsuruta-san, how about your bio-cgroup's tracking concerning this? If we want to use your tracking functions for each threads seperately, there seems to be a problem. ===cf. mm_get_bio_cgroup()=== owner mm_struct task_struct bio_cgroup = In my understanding, the mm_struct of a thread is same as its parent's. So, even if we attach the TIDs of some threads to different cgroups the tracking always returns the same bio_cgroup -- its parent's group. Do you have some policy about in which case we can use your tracking? It's will be resitriction when io-controller reuse information of the owner of memory. But if it's very clear who issues I/O (by tracking read/write syscall), we may have chance to record the issuer of I/O to page_cgroup struct. This might be slightly different topic though, I've been thinking where we should add hooks to track I/O reqeust. I think the following set of hooks is enough whether we are going to support thread based cgroup or not. Hook-1: called when allocating a page, where the memory controller already have a hoook. Hook-2: called when making a page in page-cache dirty. For anonymous pages, Hook-1 is enough to track any type of I/O request. For pages in page-cache, Hook-1 is also enough for read I/O because the I/O is issued just once right after allocting the page. For write I/O requests to pages in page-cache, Hook-1 will be okay in most cases but sometimes process in another cgroup may write the pages. In this case, Hook-2 is needed to keep accurate to track I/O requests. This relative simplicity is what prompted me to say that we probably should try to disentangle the io tracking functionality from the memory controller a bit more (of course we still should reuse as much as we can from it). The rationale for this is that the existing I/O scheduler would benefit from proper io tracking capabilities too, so it'd be nice if we could have them even in non-cgroup-capable kernels. Hook 2 referred to in the mail above exist today in the form of task IO accounting. Yup. As an aside, when the IO context of a certain IO operation is known (synchronous IO comes to mind) I think it should be cashed in the resulting bio so that we can do without the expensive accesses to bio_cgroup once it enters the block layer. Will this give you everything you need for accounting and control (from the block layer?) Well, it depends on what you are trying to achieve. Current IO schedulers such as CFQ only care about the io_context when scheduling requests. When a new request comes in CFQ assumes that it was originated in the context of the current task, which obviously does not hold true for buffered IO and aio. This problem could be solved by using bio-cgroup for IO tracking, but accessing the io context information is somewhat expensive: page-page_cgroup-bio_cgroup-io_context. If at the time of building a bio we know its io context (i.e. the context of the task or cgroup that generated that bio) I think we should store it in the bio itself, too. With this scheme, whenever the kernel needs to know the io_context of a particular block IO operation the kernel would first try to retrieve its io_context directly from the bio, and, if not available there, would resort to the slow path (accessing it through bio_cgroup). My gut feeling is that elevator-based IO resource controllers would benefit from such an approach, too. Hi Fernando, Had a question. IIUC, at the time of submtting the bio, io_context will be known only for synchronous request. For asynchronous request it will not be known (ex. writing the dirty pages back to disk) and one shall have to take the longer path (bio-cgroup thing) to ascertain the io_context associated with a request. If that's the case, than it looks like we shall have to always traverse the longer path in case of asynchronous IO. By putting the io_context pointer in bio, we will just shift the time of pointer traversal. (From CFQ to higher layers). So probably it is not worth while to put io_context pointer in bio? Am I missing something? Thanks Vivek ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [0/10] User namespaces: introduction
Hi Eric, so here is a start to a userns patchset trying to follow your ideas about how to have user namespaces and filesystems interact. Ignore the bookkeeping crap or you'll pull your hair out. Lots of stuff remains unimplemented - i.e. chown (setattr) and proper handling of capabilities. But you can do some fun things with this patchset. I.e. (log in as root) setcap cap_sys_admin=ep ns_exec setcap cap_sys_admin=ep usernsmount ns_exec -U /bin/sh ls /root (fails) ls / (succeeds) (log in as hallyn) ns_exec -U /bin/sh id (uid=0, gid=0) ls (fails, can't descend /home/hallyn) usernsmount / nsid=4 ls (succeeds) touch ab ls -l ab (ab is owned by root) exit (we're logged in as hallyn in the init_user_ns again) ls -l ab (ab is owned by hallyn) The only supported fs is ext3. Only a few operations are supported. So if, above, when we are hallyn in the init_user_ns but root in the child user ns, when we create a file, it is properly handled, so inode-i_uid=500, but an xattr (nsid=4,uid=0) is added when we chown the file to root, it is not properly handled, so inode-i_uid = 0 it's just a matter of hooking all the places at this point. Capabilities remain a problem. Right now I think capabilities will need to be split up into system-wide caps, and container-safe caps. So CAP_NET_ADMIN, CAP_NET_RAW, CAP_DAC_OVERRIDE, those are container-safe. CAP_REBOOT may become container-safe one day, but for now is very much system-wide. So if I'm uid 500 on the host and create a user namespace where I'm uid=0, I should be able to acquire container-safe caps (perhaps contingent on whether I unshared all other namespaces), but not system-wide ones. Or, whether I can acquire them would depend on whether the suid bit was set in a user_ns or not. sigh. thanks, -serge ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 03/10] user namespaces: reset task's credentials on CLONE_NEWUSER
Currently, creating a new user namespace does not reset the task's uid or gid. Since generally that is done as root because it requires CAP_SYS_ADMIN, and since the first uid in the new namespace is 0, one usually doesn't notice. However, if one does capset cap_sys_admin=ep ns_exec su - hallyn ns_exec -U /bin/sh id then one will see hallyn's userid, and all preexisting groups. With this patch, cloning a new user namespace will set the task's uid and gid to 0, and reset the group_info to the empty set assigned to init. Signed-off-by: Serge E. Hallyn [EMAIL PROTECTED] --- kernel/user_namespace.c | 14 ++ 1 files changed, 14 insertions(+), 0 deletions(-) diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 0045dd0..39aea7b 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -11,6 +11,9 @@ #include linux/slab.h #include linux/user_namespace.h +/* defined in kernel/sys.c */ +extern struct group_info init_groups; + /* * Clone a new ns copying an original user ns, setting refcount to 1 * @old_ns: namespace to clone @@ -48,6 +51,17 @@ int create_new_userns(int flags, struct task_struct *tsk) put_user_ns(ns); task_switch_uid(tsk, ns-root_user); + tsk-uid = tsk-euid = tsk-suid = tsk-fsuid = 0; + tsk-gid = tsk-egid = tsk-sgid = tsk-fsgid = 0; + + /* this can't be safe for unshare, can it? it's safe +* for fork, though. I'm tempted to limit clone_newuser to +* fork only */ + task_lock(tsk); + put_group_info(tsk-group_info); + tsk-group_info = init_groups; + get_group_info(tsk-group_info); + task_unlock(tsk); return 0; } -- 1.5.4.3 ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 01/10] user namespaces: introduce user_struct-user_namespace relationship
When a task does clone(CLONE_NEWNS), the task's user is the 'creator' of the new user_namespace, and the user_namespace is tacked onto a list of those created by this user. Changelog: Aug 1: renamed user-user_namespace to user_ns, as the next patch did anyway. Aug 1: move put_user_ns call in one free_user() definition to move it outside the lock in free_user. put_user_ns calls free_user on the user_ns-creator, which in turn would grab the lock again. Signed-off-by: Serge Hallyn [EMAIL PROTECTED] --- include/linux/sched.h |1 + include/linux/user_namespace.h |1 + kernel/user.c |7 +++ kernel/user_namespace.c| 20 +++- 4 files changed, 20 insertions(+), 9 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 5850bfb..b0fe15a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -600,6 +600,7 @@ struct user_struct { /* Hash table maintenance information */ struct hlist_node uidhash_node; uid_t uid; + struct user_namespace *user_ns; #ifdef CONFIG_USER_SCHED struct task_group *tg; diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index b5f41d4..f9477c3 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -13,6 +13,7 @@ struct user_namespace { struct kref kref; struct hlist_head uidhash_table[UIDHASH_SZ]; struct user_struct *root_user; + struct user_struct *creator; }; extern struct user_namespace init_user_ns; diff --git a/kernel/user.c b/kernel/user.c index 865ecf5..aedb3a1 100644 --- a/kernel/user.c +++ b/kernel/user.c @@ -22,6 +22,7 @@ struct user_namespace init_user_ns = { .refcount = ATOMIC_INIT(2), }, .root_user = root_user, + .creator = root_user, }; EXPORT_SYMBOL_GPL(init_user_ns); @@ -53,6 +54,7 @@ struct user_struct root_user = { .files = ATOMIC_INIT(0), .sigpending = ATOMIC_INIT(0), .locked_shm = 0, + .user_ns= init_user_ns, #ifdef CONFIG_USER_SCHED .tg = init_task_group, #endif @@ -325,6 +327,7 @@ static inline void free_user(struct user_struct *up, unsigned long flags) atomic_inc(up-__count); spin_unlock_irqrestore(uidhash_lock, flags); + put_user_ns(up-user_ns); INIT_WORK(up-work, remove_user_sysfs_dir); schedule_work(up-work); } @@ -347,6 +350,7 @@ static inline void free_user(struct user_struct *up, unsigned long flags) sched_destroy_user(up); key_put(up-uid_keyring); key_put(up-session_keyring); + put_user_ns(up-user_ns); kmem_cache_free(uid_cachep, up); } @@ -409,6 +413,8 @@ struct user_struct *alloc_uid(struct user_namespace *ns, uid_t uid) if (sched_create_user(new) 0) goto out_free_user; + new-user_ns = get_user_ns(ns); + if (uids_user_create(new)) goto out_destoy_sched; @@ -441,6 +447,7 @@ struct user_struct *alloc_uid(struct user_namespace *ns, uid_t uid) out_destoy_sched: sched_destroy_user(new); + put_user_ns(new-user_ns); out_free_user: kmem_cache_free(uid_cachep, new); out_unlock: diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index a9ab059..e8db443 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -19,7 +19,6 @@ static struct user_namespace *clone_user_ns(struct user_namespace *old_ns) { struct user_namespace *ns; - struct user_struct *new_user; int n; ns = kmalloc(sizeof(struct user_namespace), GFP_KERNEL); @@ -38,15 +37,17 @@ static struct user_namespace *clone_user_ns(struct user_namespace *old_ns) return ERR_PTR(-ENOMEM); } - /* Reset current-user with a new one */ - new_user = alloc_uid(ns, current-uid); - if (!new_user) { - free_uid(ns-root_user); - kfree(ns); - return ERR_PTR(-ENOMEM); - } + /* pin the creating user */ + ns-creator = current-user; + atomic_inc(ns-creator-__count); + + /* +* The alloc_uid() incremented the userns refcount, +* so drop it again +*/ + put_user_ns(ns); - switch_uid(new_user); + switch_uid(ns-root_user); return ns; } @@ -72,6 +73,7 @@ void free_user_ns(struct kref *kref) ns = container_of(kref, struct user_namespace, kref); release_uids(ns); + free_uid(ns-creator); kfree(ns); } EXPORT_SYMBOL(free_user_ns); -- 1.5.4.3 ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers
[Devel] [PATCH 05/10] user namespaces: Allow registering new usernamespaces using mount
Allow registering new user namespaces using mount(MS_ADD_USERNS). Define lib/fsuserns.c which will contain functions which filesystems can hook into to support user namespaces. Since fsuserns.c currently supports neither reading policy nor storing userns info using xattrs, the support is really bogus for now. The following program shows how to use the mount syscall to register a new user namespace with a filesystem: === #include stdio.h #include sys/mount.h #include errno.h #define MS_SET_USERNS (125) /* Add current's user_ns as valid for sb */ /* * path is a path to be remounted * userid is a string 'user=X' where X is an integer */ int main(int argc, char *argv[]) { char *path, *userid; int ret; path = argv[1]; userid = argv[2]; ret = mount(path, path, NULL, MS_SET_USERNS, userid); if (ret) perror(mount); return ret; } === Signed-off-by: Serge Hallyn [EMAIL PROTECTED] --- fs/ext3/super.c| 14 +++ fs/namespace.c | 11 +++ include/linux/fs.h |3 + lib/Makefile |2 + lib/fsuserns.c | 226 5 files changed, 256 insertions(+), 0 deletions(-) create mode 100644 lib/fsuserns.c diff --git a/fs/ext3/super.c b/fs/ext3/super.c index f38a5af..3458d25 100644 --- a/fs/ext3/super.c +++ b/fs/ext3/super.c @@ -719,6 +719,15 @@ static struct quotactl_ops ext3_qctl_operations = { }; #endif +#ifdef CONFIG_USER_NS +extern int fsuserns_add_userns(struct super_block *sb, +struct user *user, void *data); +extern int fsuserns_convert_uid_gid(struct user_namespace *ns, + struct inode *inode, uid_t *retuid, gid_t *retgid); +extern int fsuserns_is_capable(struct user_namespace *ns, + struct inode *inode, int cap); +#endif + static const struct super_operations ext3_sops = { .alloc_inode= ext3_alloc_inode, .destroy_inode = ext3_destroy_inode, @@ -738,6 +747,11 @@ static const struct super_operations ext3_sops = { .quota_read = ext3_quota_read, .quota_write= ext3_quota_write, #endif +#ifdef CONFIG_USER_NS + .add_userns = fsuserns_add_userns, + .is_capable = fsuserns_is_capable, + .convert_uid_gid = fsuserns_convert_uid_gid, +#endif }; static const struct export_operations ext3_export_ops = { diff --git a/fs/namespace.c b/fs/namespace.c index 6e283c9..4bb8c61 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1885,6 +1885,15 @@ int copy_mount_options(const void __user * data, unsigned long *where) return 0; } +int do_add_userns(struct nameidata *nd, struct user_struct *user, + void *data_page) +{ + struct super_block *sb = nd-path.mnt-mnt_sb; + if (sb-s_op-add_userns) + return sb-s_op-add_userns(sb, user, data_page); + return -EINVAL; +} + /* * Flags is a 32-bit value that allows up to 31 non-fs dependent flags to * be given to the mount() call (ie: read-only, no-dev, no-suid etc). @@ -1958,6 +1967,8 @@ long do_mount(char *dev_name, char *dir_name, char *type_page, retval = do_change_type(nd, flags); else if (flags MS_MOVE) retval = do_move_mount(nd, dev_name); + else if (flags MS_SET_USERNS) + retval = do_add_userns(nd, current-user, data_page); else retval = do_new_mount(nd, type_page, flags, mnt_flags, dev_name, data_page); diff --git a/include/linux/fs.h b/include/linux/fs.h index bb58c2e..492abef 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -128,6 +128,7 @@ extern int dir_notify_enable; #define MS_RELATIME(121) /* Update atime relative to mtime/ctime. */ #define MS_KERNMOUNT (122) /* this is a kern_mount call */ #define MS_I_VERSION (123) /* Update inode I_version field */ +#define MS_SET_USERNS (125) /* Add current's user_ns as valid for sb */ #define MS_ACTIVE (130) #define MS_NOUSER (131) @@ -1308,6 +1309,7 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *, extern ssize_t vfs_writev(struct file *, const struct iovec __user *, unsigned long, loff_t *); +struct user_struct; struct super_operations { struct inode *(*alloc_inode)(struct super_block *sb); void (*destroy_inode)(struct inode *); @@ -1325,6 +1327,7 @@ struct super_operations { int (*remount_fs) (struct super_block *, int *, char *); void (*clear_inode) (struct inode *); void (*umount_begin) (struct super_block *); + int (*add_userns) (struct super_block *, struct user_struct *, void *); int (*is_capable) (struct user_namespace *, struct inode *, int); uid_t (*convert_uid_gid)(struct user_namespace *, struct inode *,
[Devel] [PATCH 04/10] user namespaces: enforce user namespaces for file permission
Add a user_ns to the sb. It is always set to the user_ns of the task which mounted the sb. Define 3 new super_operations. convert_uid() and convert_gid() take a uid or gid from an inode on the sb's fs, and attempt to convert them into ids meaningful in the user namespace passed in, which presumably is current's. is_capable() checks whether current has the requested capability with respect to the sb's fs, which is dependent upon the relationship between current's user_ns and those which the sb knows about. If the sb isn't user ns aware - which none are right now - then the new super_operations won't be defined. If in that case current and sb have different user namespaces, then the userids can't be compared. If current's and sb's userids can't be compared, then current will get 'user other' (we usually say 'nobody') permissions to the inode. Changelog: Aug 5: send the whole inode to the super_operations. Signed-off-by: Serge Hallyn [EMAIL PROTECTED] --- fs/namei.c | 29 +++--- fs/super.c |3 ++ include/linux/fs.h | 55 3 files changed, 83 insertions(+), 4 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 4ea63ed..6bf5446 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -183,10 +183,27 @@ int generic_permission(struct inode *inode, int mask, int (*check_acl)(struct inode *inode, int mask)) { umode_t mode = inode-i_mode; + uid_t iuid; + gid_t igid; + int ret; + int good_userns = 1; + + ret = s_convert_uid_gid(inode, current-user-user_ns, + iuid, igid); + if (!ret) + good_userns = 0; mask = MAY_READ | MAY_WRITE | MAY_EXEC; - if (current-fsuid == inode-i_uid) + /* +* If we're not in the inode's user namespace, we get +* user nobody permissions, and we ignore acls +* (bc serge doesn't know how to handle acls in this case) +*/ + if (!good_userns) + goto check; + + if (current-fsuid == iuid) mode = 6; else { if (IS_POSIXACL(inode) (mode S_IRWXG) check_acl) { @@ -197,15 +214,18 @@ int generic_permission(struct inode *inode, int mask, return error; } - if (in_group_p(inode-i_gid)) + if (in_group_p(igid)) mode = 3; } +check: /* * If the DACs are ok we don't need any capability check. */ if ((mask ~mode) == 0) return 0; + if (!good_userns) + return -EACCES; check_capabilities: /* @@ -214,14 +234,15 @@ int generic_permission(struct inode *inode, int mask, */ if (!(mask MAY_EXEC) || (inode-i_mode S_IXUGO) || S_ISDIR(inode-i_mode)) - if (capable(CAP_DAC_OVERRIDE)) + if (s_is_capable(inode, current-user-user_ns, + CAP_DAC_OVERRIDE)) return 0; /* * Searching includes executable on directories, else just read. */ if (mask == MAY_READ || (S_ISDIR(inode-i_mode) !(mask MAY_WRITE))) - if (capable(CAP_DAC_READ_SEARCH)) + if (s_is_capable(inode, current-user-user_ns, CAP_DAC_READ_SEARCH)) return 0; return -EACCES; diff --git a/fs/super.c b/fs/super.c index e931ae9..3559637 100644 --- a/fs/super.c +++ b/fs/super.c @@ -38,6 +38,7 @@ #include linux/kobject.h #include linux/mutex.h #include linux/file.h +#include linux/user_namespace.h #include asm/uaccess.h #include internal.h @@ -93,6 +94,7 @@ static struct super_block *alloc_super(struct file_system_type *type) s-s_qcop = sb_quotactl_ops; s-s_op = default_op; s-s_time_gran = 10; + s-user_ns = get_user_ns(current-user-user_ns); } out: return s; @@ -109,6 +111,7 @@ static inline void destroy_super(struct super_block *s) security_sb_free(s); kfree(s-s_subtype); kfree(s-s_options); + put_user_ns(s-user_ns); kfree(s); } diff --git a/include/linux/fs.h b/include/linux/fs.h index 580b513..bb58c2e 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1128,6 +1128,11 @@ struct super_block { * generic_show_options() */ char *s_options; + + /* +* namespace of the user which mounted the sb +*/ + struct user_namespace *user_ns; }; extern struct timespec current_fs_time(struct super_block *sb); @@ -1320,6 +1325,9 @@ struct super_operations { int (*remount_fs) (struct super_block *, int *, char *); void (*clear_inode)
[Devel] [PATCH 02/10] user namespaces: move user_ns from nsproxy into user struct
When we get the sysfs support needed to support fair user scheduling along with user namespaces, then we will need to be able to get the user namespace from the user struct. So we need the user_ns to be a part of struct user. Once we can access it from tsk-user, we no longer have a use for tsk-nsproxy-user_ns. When a user_namespace is created, the user which created it is marked as its 'creator'. The user_namespace pins the creator. Each userid in a user_ns pins the user_ns. This keeps refcounting nice and simple. At the same time, this patch simplifies the refcounting. The current user and userns locking works as follows: The task pins the user struct. The task pins the nsproxy, the nsproxy pins the user_ns. When a new user_ns is created, it creates a root user for it, and pins it. When the nsproxy releases the user_ns, the userns tries to release all its user structs. So you see that the refcounting works for now, but only because the nsproxy (and therefore usr_ns) and user structs will be freed at the same time - when the last task using them is released. Now we need to put the user_ns in the struct user. You can see that will mess up the refcounting. Fortunately, once the user_ns is available from tsk-user, we don't need it in nsproxy. So here is how the refcounting *should* be done: The task pins the user struct. The user struct pins its user namespace. The user namespace pins the struct user which created it. A user namespace now doesn't need to release its userids, because it is only released when its last user disappears. This patch makes those changes. Signed-off-by: Serge Hallyn [EMAIL PROTECTED] --- include/linux/init_task.h |1 - include/linux/key.h|3 ++ include/linux/nsproxy.h|1 - include/linux/sched.h |1 + include/linux/user_namespace.h | 10 +++- kernel/fork.c |3 +- kernel/nsproxy.c | 10 +--- kernel/sys.c |4 +- kernel/user.c | 42 +-- kernel/user_namespace.c| 36 + security/keys/process_keys.c |7 +- 11 files changed, 39 insertions(+), 79 deletions(-) diff --git a/include/linux/init_task.h b/include/linux/init_task.h index 021d8e7..550058b 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -57,7 +57,6 @@ extern struct nsproxy init_nsproxy; .mnt_ns = NULL, \ INIT_NET_NS(net_ns) \ INIT_IPC_NS(ipc_ns) \ - .user_ns= init_user_ns,\ } #define INIT_SIGHAND(sighand) { \ diff --git a/include/linux/key.h b/include/linux/key.h index c45c962..ba53aef 100644 --- a/include/linux/key.h +++ b/include/linux/key.h @@ -277,6 +277,8 @@ extern ctl_table key_sysctls[]; * the userspace interface */ extern void switch_uid_keyring(struct user_struct *new_user); +extern void switch_uid_keyring_task(struct task_struct *tsk, + struct user_struct *new_user); extern int copy_keys(unsigned long clone_flags, struct task_struct *tsk); extern int copy_thread_group_keys(struct task_struct *tsk); extern void exit_keys(struct task_struct *tsk); @@ -305,6 +307,7 @@ extern void key_init(void); #define key_ref_to_ptr(k) ({ NULL; }) #define is_key_possessed(k)0 #define switch_uid_keyring(u) do { } while(0) +#define switch_uid_keyring_task(t,u) do { } while(0) #define __install_session_keyring(t, k)({ NULL; }) #define copy_keys(f,t) 0 #define copy_thread_group_keys(t) 0 diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h index c8a768e..afad7de 100644 --- a/include/linux/nsproxy.h +++ b/include/linux/nsproxy.h @@ -27,7 +27,6 @@ struct nsproxy { struct ipc_namespace *ipc_ns; struct mnt_namespace *mnt_ns; struct pid_namespace *pid_ns; - struct user_namespace *user_ns; struct net *net_ns; }; extern struct nsproxy init_nsproxy; diff --git a/include/linux/sched.h b/include/linux/sched.h index b0fe15a..a2f1356 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1715,6 +1715,7 @@ static inline struct user_struct *get_uid(struct user_struct *u) } extern void free_uid(struct user_struct *); extern void switch_uid(struct user_struct *); +extern void task_switch_uid(struct task_struct *tsk, struct user_struct *); extern void release_uids(struct user_namespace *ns); #include asm/current.h diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index
[Devel] [PATCH 10/10] userns: add support for readdir
Now ls works correctly inside a userns! (but don't go doing some sort of setattr like 'chown' :) Signed-off-by: Serge Hallyn [EMAIL PROTECTED] --- fs/ext3/file.c |4 fs/ext3/inode.c | 22 ++ fs/ext3/namei.c |3 +++ fs/ext3/xattr.c |6 ++ lib/fsuserns.c | 42 +- 5 files changed, 64 insertions(+), 13 deletions(-) diff --git a/fs/ext3/file.c b/fs/ext3/file.c index acc4913..b259061 100644 --- a/fs/ext3/file.c +++ b/fs/ext3/file.c @@ -106,6 +106,9 @@ force_commit: return ret; } +extern int ext3_getattr(struct vfsmount *mnt, struct dentry *dentry, + struct kstat *stat); + const struct file_operations ext3_file_operations = { .llseek = generic_file_llseek, .read = do_sync_read, @@ -134,5 +137,6 @@ const struct inode_operations ext3_file_inode_operations = { .removexattr= generic_removexattr, #endif .permission = ext3_permission, + .getattr= ext3_getattr, }; diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c index 507d868..b252490 100644 --- a/fs/ext3/inode.c +++ b/fs/ext3/inode.c @@ -36,6 +36,7 @@ #include linux/mpage.h #include linux/uio.h #include linux/bio.h +#include linux/security.h #include xattr.h #include acl.h @@ -3088,6 +3089,27 @@ err_out: return error; } +int ext3_getattr(struct vfsmount *mnt, struct dentry *dentry, + struct kstat *stat) +{ + struct inode *inode = dentry-d_inode; + int retval; + uid_t uid; + gid_t gid; + + retval = security_inode_getattr(mnt, dentry); + if (retval) + return retval; + + generic_fillattr(inode, stat); + + retval = s_convert_uid_gid(inode, current-user-user_ns, uid, gid); + if (retval == 1) { + stat-uid = uid; + stat-gid = gid; + } + return 0; +} /* * How many blocks doth make a writepage()? diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c index e5be4bc..fe7350b 100644 --- a/fs/ext3/namei.c +++ b/fs/ext3/namei.c @@ -2410,6 +2410,8 @@ end_rename: return retval; } +extern int ext3_getattr(struct vfsmount *mnt, struct dentry *dentry, + struct kstat *stat); /* * directories can handle most operations... */ @@ -2431,6 +2433,7 @@ const struct inode_operations ext3_dir_inode_operations = { .removexattr= generic_removexattr, #endif .permission = ext3_permission, + .getattr= ext3_getattr, }; const struct inode_operations ext3_special_inode_operations = { diff --git a/fs/ext3/xattr.c b/fs/ext3/xattr.c index 500fec7..cf7dc63 100644 --- a/fs/ext3/xattr.c +++ b/fs/ext3/xattr.c @@ -345,6 +345,7 @@ ext3_xattr_get_userns(struct inode *inode, void *value, size_t value_size) error = ext3_xattr_block_get(inode, name_index, name, value, value_size); up_read(EXT3_I(inode)-xattr_sem); + printk(KERN_NOTICE %s: returning %d for %lu\n, __func__, error, inode-i_ino); return error; } @@ -1102,7 +1103,12 @@ retry: error = PTR_ERR(handle); } else { int error2; + char *buf; + int i; + printk(KERN_NOTICE %s: writing %d bytes:\n, __func__, value_len); + for (i=0, buf = (char *)value; ivalue_len; i++,buf++) + printk(KERN_NOTICE %s: %d %x\n, __func__, i, (int)*buf); error = ext3_xattr_set_handle(handle, inode, name_index, name, value, value_len, 0); error2 = ext3_journal_stop(handle); diff --git a/lib/fsuserns.c b/lib/fsuserns.c index ac0ca99..f0be780 100644 --- a/lib/fsuserns.c +++ b/lib/fsuserns.c @@ -179,7 +179,7 @@ int fsuserns_convert_uid_gid(struct user_namespace *ns, struct inode *inode, struct fsuserns_conversion_table *t; struct fsuserns_table_entries *ep; size_t valuelen; - struct unsstore *unsstore; + struct unsstore *unsstore = NULL; int i, ret; t = find_table(sb); @@ -196,23 +196,33 @@ int fsuserns_convert_uid_gid(struct user_namespace *ns, struct inode *inode, convert: mutex_unlock(fsuserns_table_mutex); - /* look for an xattr */ - /* yes, 3 needs to be made adjustable */ - valuelen = 3; - unsstore = kzalloc(3*sizeof(struct unsstore), GFP_KERNEL); + ret = xattrget(inode, NULL, 0); + if (ret = 0) + goto notfound; + valuelen = ret; + unsstore = kzalloc(ret, GFP_KERNEL); ret = xattrget(inode, unsstore, valuelen); - if (ret 0) - return ret; - for (i=0; i3; i++) + if (ret = 0) + goto notfound; + for (i=0; i(ret/sizeof(*unsstore)); i++) { +
[Devel] [PATCH 06/10] user namespaces: hook fs/attr.c
Hook fs/attr.c so things like chown are properly handled. Note this is only for permission checks. We'll need to hook ext3_setattr to get the right uids updated. Signed-off-by: Serge Hallyn [EMAIL PROTECTED] --- fs/attr.c | 28 +++- include/linux/sched.h |1 + 2 files changed, 20 insertions(+), 9 deletions(-) diff --git a/fs/attr.c b/fs/attr.c index 26c71ba..072d367 100644 --- a/fs/attr.c +++ b/fs/attr.c @@ -22,6 +22,11 @@ int inode_change_ok(struct inode *inode, struct iattr *attr) { int retval = -EPERM; unsigned int ia_valid = attr-ia_valid; + uid_t iuid; + gid_t igid; + + if (!s_convert_uid_gid(inode, current_userns(), iuid, igid)) + return -EPERM; /* If force is set do it anyway. */ if (ia_valid ATTR_FORCE) @@ -29,30 +34,30 @@ int inode_change_ok(struct inode *inode, struct iattr *attr) /* Make sure a caller can chown. */ if ((ia_valid ATTR_UID) - (current-fsuid != inode-i_uid || -attr-ia_uid != inode-i_uid) !capable(CAP_CHOWN)) + (current-fsuid != iuid || +attr-ia_uid != iuid) !s_is_capable(inode, current_userns(), CAP_CHOWN)) goto error; /* Make sure caller can chgrp. */ if ((ia_valid ATTR_GID) - (current-fsuid != inode-i_uid || - (!in_group_p(attr-ia_gid) attr-ia_gid != inode-i_gid)) - !capable(CAP_CHOWN)) + (current-fsuid != iuid || + (!in_group_p(attr-ia_gid) attr-ia_gid != igid)) + !s_is_capable(inode, current_userns(), CAP_CHOWN)) goto error; /* Make sure a caller can chmod. */ if (ia_valid ATTR_MODE) { - if (!is_owner_or_cap(inode)) + if (current-fsuid != iuid || !s_is_capable(inode, current_userns(), CAP_FOWNER)) goto error; /* Also check the setgid bit! */ if (!in_group_p((ia_valid ATTR_GID) ? attr-ia_gid : - inode-i_gid) !capable(CAP_FSETID)) + igid) !s_is_capable(inode, current_userns(), CAP_FSETID)) attr-ia_mode = ~S_ISGID; } /* Check for setting the inode time. */ if (ia_valid (ATTR_MTIME_SET | ATTR_ATIME_SET | ATTR_TIMES_SET)) { - if (!is_owner_or_cap(inode)) + if (current-fsuid != iuid || !s_is_capable(inode, current_userns(), CAP_FOWNER)) goto error; } fine: @@ -66,6 +71,11 @@ EXPORT_SYMBOL(inode_change_ok); int inode_setattr(struct inode * inode, struct iattr * attr) { unsigned int ia_valid = attr-ia_valid; + uid_t iuid; + gid_t igid; + + if (!s_convert_uid_gid(inode, current_userns(), iuid, igid)) + return -EPERM; if (ia_valid ATTR_SIZE attr-ia_size != i_size_read(inode)) { @@ -90,7 +100,7 @@ int inode_setattr(struct inode * inode, struct iattr * attr) if (ia_valid ATTR_MODE) { umode_t mode = attr-ia_mode; - if (!in_group_p(inode-i_gid) !capable(CAP_FSETID)) + if (!in_group_p(igid) !s_is_capable(inode, current_userns(), CAP_FSETID)) mode = ~S_ISGID; inode-i_mode = mode; } diff --git a/include/linux/sched.h b/include/linux/sched.h index a2f1356..f557535 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -611,6 +611,7 @@ struct user_struct { #endif }; +#define current_userns() (current-user-user_ns) extern int uids_sysfs_init(void); extern struct user_struct *find_user(uid_t); -- 1.5.4.3 ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 09/10] userns: have ext3 use fsuserns to read userns xattrs, and add groups to userns
userns: have ext3 use fsuserns to read userns xattrs, and add groups to userns Signed-off-by: Serge Hallyn [EMAIL PROTECTED] --- fs/ext3/super.c| 11 +-- fs/ext3/xattr.c| 19 ++- fs/ext3/xattr.h|3 ++- include/linux/fs.h |2 +- lib/fsuserns.c | 48 +--- 5 files changed, 71 insertions(+), 12 deletions(-) diff --git a/fs/ext3/super.c b/fs/ext3/super.c index 3458d25..37c8404 100644 --- a/fs/ext3/super.c +++ b/fs/ext3/super.c @@ -723,11 +723,18 @@ static struct quotactl_ops ext3_qctl_operations = { extern int fsuserns_add_userns(struct super_block *sb, struct user *user, void *data); extern int fsuserns_convert_uid_gid(struct user_namespace *ns, - struct inode *inode, uid_t *retuid, gid_t *retgid); + struct inode *inode, uid_t *retuid, gid_t *retgid, + int (*xattrget)(struct inode *, const void *, size_t)); extern int fsuserns_is_capable(struct user_namespace *ns, struct inode *inode, int cap); #endif +int ext3_convert_uid_gid(struct user_namespace *ns, struct inode *inode, +uid_t *retuid, gid_t *retgid) +{ + return fsuserns_convert_uid_gid(ns, inode, retuid, retgid, ext3_xattr_get_userns); +} + static const struct super_operations ext3_sops = { .alloc_inode= ext3_alloc_inode, .destroy_inode = ext3_destroy_inode, @@ -750,7 +757,7 @@ static const struct super_operations ext3_sops = { #ifdef CONFIG_USER_NS .add_userns = fsuserns_add_userns, .is_capable = fsuserns_is_capable, - .convert_uid_gid = fsuserns_convert_uid_gid, + .convert_uid_gid = ext3_convert_uid_gid, #endif }; diff --git a/fs/ext3/xattr.c b/fs/ext3/xattr.c index da47c35..500fec7 100644 --- a/fs/ext3/xattr.c +++ b/fs/ext3/xattr.c @@ -331,6 +331,23 @@ ext3_xattr_get(struct inode *inode, int name_index, const char *name, return error; } +int +ext3_xattr_get_userns(struct inode *inode, void *value, size_t value_size) +{ + int error; + int name_index = EXT3_XATTR_INDEX_SECURITY; + char *name = userns; + + down_read(EXT3_I(inode)-xattr_sem); + error = ext3_xattr_ibody_get(inode, name_index, name, value, +value_size); + if (error == -ENODATA) + error = ext3_xattr_block_get(inode, name_index, name, value, +value_size); + up_read(EXT3_I(inode)-xattr_sem); + return error; +} + static int ext3_xattr_list_entries(struct inode *inode, struct ext3_xattr_entry *entry, char *buffer, size_t buffer_size) @@ -1087,7 +1104,7 @@ retry: int error2; error = ext3_xattr_set_handle(handle, inode, name_index, name, - value, value_len, flags); + value, value_len, 0); error2 = ext3_journal_stop(handle); if (error == -ENOSPC ext3_should_retry_alloc(inode-i_sb, retries)) diff --git a/fs/ext3/xattr.h b/fs/ext3/xattr.h index 8a523de..8c5b982 100644 --- a/fs/ext3/xattr.h +++ b/fs/ext3/xattr.h @@ -70,7 +70,8 @@ extern int ext3_xattr_get(struct inode *, int, const char *, void *, size_t); extern int ext3_xattr_set(struct inode *, int, const char *, const void *, size_t, int); extern int ext3_xattr_set_handle(handle_t *, struct inode *, int, const char *, const void *, size_t, int); -extern int ext3_xattr_set_userns(struct inode *inode, const void *value, size_t value_len, int flags); +extern int ext3_xattr_get_userns(struct inode *inode, void *value, size_t value_len); +extern int ext3_xattr_set_userns(struct inode *inode, const void *value, size_t value_len); extern void ext3_xattr_delete_inode(handle_t *, struct inode *); extern void ext3_xattr_put_super(struct super_block *); diff --git a/include/linux/fs.h b/include/linux/fs.h index 492abef..9ec6dac 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1329,7 +1329,7 @@ struct super_operations { void (*umount_begin) (struct super_block *); int (*add_userns) (struct super_block *, struct user_struct *, void *); int (*is_capable) (struct user_namespace *, struct inode *, int); - uid_t (*convert_uid_gid)(struct user_namespace *, struct inode *, + int (*convert_uid_gid)(struct user_namespace *, struct inode *, uid_t *, gid_t *); int (*show_options)(struct seq_file *, struct vfsmount *); diff --git a/lib/fsuserns.c b/lib/fsuserns.c index db70970..ac0ca99 100644 --- a/lib/fsuserns.c +++ b/lib/fsuserns.c @@ -59,6 +59,12 @@ struct fsuserns_conversion_table { LIST_HEAD(fsuserns_tables); +struct unsstore { + int ns; + uid_t uid; + gid_t gid; +}; + struct fsuserns_conversion_table
[Devel] [PATCH 07/10] user namespaces: bad bad bad but test code
Let uid 0 in a child namespace whose creator owns a file, access that file. This of course means that user hallyn (if he is allowed to remount / for his userns, i.e. through capset cap_sys_admin=ep usernsremount can create files owned by root. So this is only so we can play. This code will be removed in favor of code doing the right thing using extended attributes. Then, when the above user creates a file, the inode-iuid will be set to 500 (hallyn), and an xattr named fs.userns=(nsid,0) will store the fact that in the given nsid (might be 1 for instance) uid 0 owns the file. Signed-off-by: Serge E. Hallyn [EMAIL PROTECTED] --- lib/fsuserns.c |9 + 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/lib/fsuserns.c b/lib/fsuserns.c index 0a9f52d..c237d1d 100644 --- a/lib/fsuserns.c +++ b/lib/fsuserns.c @@ -185,6 +185,15 @@ int fsuserns_convert_uid_gid(struct user_namespace *ns, struct inode *inode, convert: mutex_unlock(fsuserns_table_mutex); + /* The following is BAD CODE. IT's for testing only */ + if (current-uid == 0) { + if (inode-i_uid == ns-creator-uid) { + *retuid = 0; + *retgid = 0; + return 1; + } + } + /* * ok now we would look through the xattrs for the * inode to find a stored uid in this namespace. -- 1.5.4.3 ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 08/10] userns: store child userns uids as xattrs in ext3 using lib/fsuserns
userns: store child userns uids as xattrs in ext3 using lib/fsuserns Signed-off-by: Serge Hallyn [EMAIL PROTECTED] --- fs/ext3/namei.c|7 +++- fs/ext3/xattr.c| 29 +++ fs/ext3/xattr.h|2 + include/linux/user_namespace.h |1 + kernel/user_namespace.c|1 + lib/fsuserns.c | 74 6 files changed, 113 insertions(+), 1 deletions(-) diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c index de13e91..e5be4bc 100644 --- a/fs/ext3/namei.c +++ b/fs/ext3/namei.c @@ -1676,6 +1676,9 @@ static int ext3_add_nondir(handle_t *handle, return err; } +int fsuserns_store_creds(struct inode *inode, struct user_struct *user, + int (*)(struct inode *inode, const void *value, size_t value_len)); + /* * By the time this is called, we already have created * the directory cache entry for the new file, but it @@ -1707,7 +1710,9 @@ retry: inode-i_op = ext3_file_inode_operations; inode-i_fop = ext3_file_operations; ext3_set_aops(inode); - err = ext3_add_nondir(handle, dentry, inode); + err = fsuserns_store_creds(inode, current-user, ext3_xattr_set_userns); + if (!err) + err = ext3_add_nondir(handle, dentry, inode); } ext3_journal_stop(handle); if (err == -ENOSPC ext3_should_retry_alloc(dir-i_sb, retries)) diff --git a/fs/ext3/xattr.c b/fs/ext3/xattr.c index 175414a..da47c35 100644 --- a/fs/ext3/xattr.c +++ b/fs/ext3/xattr.c @@ -1070,6 +1070,35 @@ retry: return error; } +int +ext3_xattr_set_userns(struct inode *inode, + const void *value, size_t value_len) +{ + int name_index = EXT3_XATTR_INDEX_SECURITY; + handle_t *handle; + int error, retries = 0; + char *name = userns; + +retry: + handle = ext3_journal_start(inode, EXT3_DATA_TRANS_BLOCKS(inode-i_sb)); + if (IS_ERR(handle)) { + error = PTR_ERR(handle); + } else { + int error2; + + error = ext3_xattr_set_handle(handle, inode, name_index, name, + value, value_len, flags); + error2 = ext3_journal_stop(handle); + if (error == -ENOSPC + ext3_should_retry_alloc(inode-i_sb, retries)) + goto retry; + if (error == 0) + error = error2; + } + + return error; +} + /* * ext3_xattr_delete_inode() * diff --git a/fs/ext3/xattr.h b/fs/ext3/xattr.h index 148a4df..8a523de 100644 --- a/fs/ext3/xattr.h +++ b/fs/ext3/xattr.h @@ -70,6 +70,8 @@ extern int ext3_xattr_get(struct inode *, int, const char *, void *, size_t); extern int ext3_xattr_set(struct inode *, int, const char *, const void *, size_t, int); extern int ext3_xattr_set_handle(handle_t *, struct inode *, int, const char *, const void *, size_t, int); +extern int ext3_xattr_set_userns(struct inode *inode, const void *value, size_t value_len, int flags); + extern void ext3_xattr_delete_inode(handle_t *, struct inode *); extern void ext3_xattr_put_super(struct super_block *); diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index 1b4959d..a793263 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -14,6 +14,7 @@ struct user_namespace { struct hlist_head uidhash_table[UIDHASH_SZ]; struct user_struct *root_user; struct user_struct *creator; + gid_t creator_grp; }; extern struct user_namespace init_user_ns; diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 39aea7b..879693a 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -51,6 +51,7 @@ int create_new_userns(int flags, struct task_struct *tsk) put_user_ns(ns); task_switch_uid(tsk, ns-root_user); + ns-creator_grp = tsk-gid; tsk-uid = tsk-euid = tsk-suid = tsk-fsuid = 0; tsk-gid = tsk-egid = tsk-sgid = tsk-fsgid = 0; diff --git a/lib/fsuserns.c b/lib/fsuserns.c index c237d1d..db70970 100644 --- a/lib/fsuserns.c +++ b/lib/fsuserns.c @@ -5,6 +5,7 @@ #include linux/fs.h #include linux/user.h #include linux/user_namespace.h +#include linux/xattr.h /* * Ok, eventually I'll want some policy loaded which looks as follows: @@ -233,3 +234,76 @@ convert: printk(KERN_NOTICE %s: oh, but I wasn't capable(%d)\n, __func__, cap); return 0; } + +/* + * when a user_namespace is registered with an fs, we store the + * nsid. This next fn will need to retreive an nsid for a + * given fs (inode-i_sb, that is) and user_namespace + * + * I don't want to do that bookkeeping yet, so i just return 1 :) + */ +int find_ns_id(struct inode *inode, struct user_namespace *ns) +{ + struct fsuserns_conversion_table *t; +
[Devel] Re: [RFC v2][PATCH 1/9] Create trivial sys_checkpoint/sys_restart syscalls
On Fri, 2008-08-22 at 12:32 -0700, Dave Hansen wrote: On Wed, 2008-08-20 at 23:03 -0400, Oren Laadan wrote: 6/unistd_32.h index d739467..88bdec4 100644 --- a/include/asm-x86/unistd_32.h +++ b/include/asm-x86/unistd_32.h @@ -338,6 +338,8 @@ #define __NR_dup3 330 #define __NR_pipe2331 #define __NR_inotify_init1332 +#define __NR_checkpoint333 +#define __NR_restart 334 #ifdef __KERNEL__ Uh oh. These got whitespace mangled somehow. I'll look through them, but probably can't produce patches on top of them for now. This patch also has to go *AFTER* you actually define the syscall functions. Otherwise, this won't be build-bisectable when it gets committed. -- Dave ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 04/10] user namespaces: enforce user namespaces for file permission
Serge E. Hallyn [EMAIL PROTECTED] writes: Add a user_ns to the sb. It is always set to the user_ns of the task which mounted the sb. Define 3 new super_operations. convert_uid() and convert_gid() take a uid or gid from an inode on the sb's fs, and attempt to convert them into ids meaningful in the user namespace passed in, which presumably is current's. is_capable() checks whether current has the requested capability with respect to the sb's fs, which is dependent upon the relationship between current's user_ns and those which the sb knows about. If the sb isn't user ns aware - which none are right now - then the new super_operations won't be defined. If in that case current and sb have different user namespaces, then the userids can't be compared. If current's and sb's userids can't be compared, then current will get 'user other' (we usually say 'nobody') permissions to the inode. Changelog: Aug 5: send the whole inode to the super_operations. Signed-off-by: Serge Hallyn [EMAIL PROTECTED] --- fs/namei.c | 29 +++--- fs/super.c |3 ++ include/linux/fs.h | 55 3 files changed, 83 insertions(+), 4 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 4ea63ed..6bf5446 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -183,10 +183,27 @@ int generic_permission(struct inode *inode, int mask, int (*check_acl)(struct inode *inode, int mask)) { umode_t mode = inode-i_mode; + uid_t iuid; + gid_t igid; + int ret; + int good_userns = 1; + + ret = s_convert_uid_gid(inode, current-user-user_ns, + iuid, igid); + if (!ret) + good_userns = 0; mask = MAY_READ | MAY_WRITE | MAY_EXEC; - if (current-fsuid == inode-i_uid) + /* + * If we're not in the inode's user namespace, we get + * user nobody permissions, and we ignore acls + * (bc serge doesn't know how to handle acls in this case) + */ + if (!good_userns) + goto check; + + if (current-fsuid == iuid) mode = 6; else { if (IS_POSIXACL(inode) (mode S_IRWXG) check_acl) { @@ -197,15 +214,18 @@ int generic_permission(struct inode *inode, int mask, return error; } - if (in_group_p(inode-i_gid)) + if (in_group_p(igid)) mode = 3; } +check: /* * If the DACs are ok we don't need any capability check. */ if ((mask ~mode) == 0) return 0; + if (!good_userns) + return -EACCES; check_capabilities: /* @@ -214,14 +234,15 @@ int generic_permission(struct inode *inode, int mask, */ if (!(mask MAY_EXEC) || (inode-i_mode S_IXUGO) || S_ISDIR(inode-i_mode)) - if (capable(CAP_DAC_OVERRIDE)) + if (s_is_capable(inode, current-user-user_ns, + CAP_DAC_OVERRIDE)) return 0; /* * Searching includes executable on directories, else just read. */ if (mask == MAY_READ || (S_ISDIR(inode-i_mode) !(mask MAY_WRITE))) - if (capable(CAP_DAC_READ_SEARCH)) + if (s_is_capable(inode, current-user-user_ns, CAP_DAC_READ_SEARCH)) return 0; return -EACCES; diff --git a/fs/super.c b/fs/super.c index e931ae9..3559637 100644 --- a/fs/super.c +++ b/fs/super.c @@ -38,6 +38,7 @@ #include linux/kobject.h #include linux/mutex.h #include linux/file.h +#include linux/user_namespace.h #include asm/uaccess.h #include internal.h @@ -93,6 +94,7 @@ static struct super_block *alloc_super(struct file_system_type *type) s-s_qcop = sb_quotactl_ops; s-s_op = default_op; s-s_time_gran = 10; + s-user_ns = get_user_ns(current-user-user_ns); } out: return s; @@ -109,6 +111,7 @@ static inline void destroy_super(struct super_block *s) security_sb_free(s); kfree(s-s_subtype); kfree(s-s_options); + put_user_ns(s-user_ns); kfree(s); } diff --git a/include/linux/fs.h b/include/linux/fs.h index 580b513..bb58c2e 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1128,6 +1128,11 @@ struct super_block { * generic_show_options() */ char *s_options; + + /* + * namespace of the user which mounted the sb + */ + struct user_namespace *user_ns; We should make it clear that this is a default, and not something a filesystem must use, just something a filesystem may use. }; extern struct timespec current_fs_time(struct
[Devel] Re: [0/10] User namespaces: introduction
Serge E. Hallyn [EMAIL PROTECTED] writes: Hi Eric, so here is a start to a userns patchset trying to follow your ideas about how to have user namespaces and filesystems interact. Ignore the bookkeeping crap or you'll pull your hair out. Lots of stuff remains unimplemented - i.e. chown (setattr) and proper handling of capabilities. But you can do some fun things with this patchset. I.e. (log in as root) setcap cap_sys_admin=ep ns_exec setcap cap_sys_admin=ep usernsmount ns_exec -U /bin/sh ls /root (fails) ls / (succeeds) (log in as hallyn) ns_exec -U /bin/sh id (uid=0, gid=0) ls (fails, can't descend /home/hallyn) usernsmount / nsid=4 ls (succeeds) touch ab ls -l ab (ab is owned by root) exit (we're logged in as hallyn in the init_user_ns again) ls -l ab (ab is owned by hallyn) The only supported fs is ext3. Only a few operations are supported. So if, above, when we are hallyn in the init_user_ns but root in the child user ns, when we create a file, it is properly handled, so inode-i_uid=500, but an xattr (nsid=4,uid=0) is added when we chown the file to root, it is not properly handled, so inode-i_uid = 0 it's just a matter of hooking all the places at this point. Capabilities remain a problem. Right now I think capabilities will need to be split up into system-wide caps, and container-safe caps. So CAP_NET_ADMIN, CAP_NET_RAW, CAP_DAC_OVERRIDE, those are container-safe. CAP_REBOOT may become container-safe one day, but for now is very much system-wide. So if I'm uid 500 on the host and create a user namespace where I'm uid=0, I should be able to acquire container-safe caps (perhaps contingent on whether I unshared all other namespaces), but not system-wide ones. Or, whether I can acquire them would depend on whether the suid bit was set in a user_ns or not. sigh. Serge at first glance this looks like a good start, especially for thinking through how things will work. It has just occurred to me that from a dependency point of view it makes an enormous amount of sense to sort out capable with respect to namespaces before we get to the filesystems. There is no one else working in the area of capabilities so there won't be conflicts, and we need a firm understanding of how capabilities are going to work with respect to namespaces before we start embedding the logic in filesystems. With respect to your separation of capabilities in namespaces I don't think you have quite grasped the simple idea that is sitting in my head and makes all of this clear. Let me see if I can explain it better. A fully qualified capability name would be of the form: userns:capability_name For each operation we will check for one specific capability. For the network namespace in particular we will check for: userns_of_network_namespace_creator:CAP_NET_ADMIN The check for a capability will succeed if: - We have the exact fully qualified capability. - We are outside the user namespace but are the owner of the user namespace. - We are outside the user namespace but have the appropriate capability over the owner of the user namespace CAP_PTRACE? This last test would recurses. I'm less certain than I like about which permissions we allow someone outside of a container to posses and still control the container. This has two very useful implications. - We can have all capabilities in a new user namespace and be completely impotent. - Allowing the capabilities of a user namespace to do something useful can come gradually. Which means we need to extend the classic capable check to become. capable(userns, capability). Or possibly we extend the capability parameter to be a structure that can hold both userns and the capability, whichever turns out to be more maintainable. Once we have done that we can allow something to be under the power of creator_user_ns:capability instead of init_user_ns:capability. So the CAP_SYS_REBOOT test will be init_user_ns:capability for the foreseeable future. While the CAP_NET_ADMIN test will shortly become creator_of_netns:CAP_NET_ADMIN. Of course none of that will happen until we relax the test to create a new namespace from init_user_ns:CAP_SYS_ADMIN to current_user_ns:CAP_SYS_ADMIN. Eric ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC v2][PATCH 1/9] Create trivial sys_checkpoint/sys_restart syscalls
Dave Hansen wrote: On Wed, 2008-08-20 at 23:03 -0400, Oren Laadan wrote: 6/unistd_32.h index d739467..88bdec4 100644 --- a/include/asm-x86/unistd_32.h +++ b/include/asm-x86/unistd_32.h @@ -338,6 +338,8 @@ #define __NR_dup3 330 #define __NR_pipe2331 #define __NR_inotify_init1332 +#define __NR_checkpoint333 +#define __NR_restart 334 #ifdef __KERNEL__ Uh oh. These got whitespace mangled somehow. I'll look through them, but probably can't produce patches on top of them for now. Oops .. not sure what happened. I'll re-send after addressing the recent comments. Oren. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 04/10] user namespaces: enforce user namespaces for file permission
Serge E. Hallyn [EMAIL PROTECTED] writes: Add a user_ns to the sb. It is always set to the user_ns of the task which mounted the sb. Define 3 new super_operations. convert_uid() and convert_gid() take a uid or gid from an inode on the sb's fs, and attempt to convert them into ids meaningful in the user namespace passed in, which presumably is current's. is_capable() checks whether current has the requested capability with respect to the sb's fs, which is dependent upon the relationship between current's user_ns and those which the sb knows about. If the sb isn't user ns aware - which none are right now - then the new super_operations won't be defined. If in that case current and sb have different user namespaces, then the userids can't be compared. If current's and sb's userids can't be compared, then current will get 'user other' (we usually say 'nobody') permissions to the inode. Changelog: Aug 5: send the whole inode to the super_operations. Serge while I am dubious about adding the methods you have added to super_block_operations there is one very important question you are using them for that we do need to ask. Does this filesystem map to user namespace X. Can we separate that concept out as it's own individual function and initially have a noop implementation that returns false for everything except for the initial network namespace. By doing just that we should be able to capture and update all of the tests without the rest of the hassle. Possibly this is just your s_convert_uid_gid function without calling the superblock option. I would really like to get to the point where we can create a user namespace without privilege and have it be safe (if impotent). Once we do that we can extend out what the root of a user namespace is allowed to do, much as we have done with the network namespace. Eric ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC v2][PATCH 4/9] Memory management - dump state
Thanks Louis for all the comments. Will fix in v3. Oren. Louis Rilling wrote: On Wed, Aug 20, 2008 at 11:05:15PM -0400, Oren Laadan wrote: For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped, it will be followed by the file name. The cr_vma-npages will tell how many pages were dumped for this VMA. Then it will be followed by the actual data: first a dump of the addresses of all dumped pages (npages entries) followed by a dump of the contents of all dumped pages (npages pages). Then will come the next VMA and so on. [...] diff --git a/checkpoint/ckpt_mem.c b/checkpoint/ckpt_mem.c new file mode 100644 index 000..a23aa29 --- /dev/null +++ b/checkpoint/ckpt_mem.c [...] +/** + * cr_vma_fill_pgarr - fill a page-array with addr/page tuples for a vma + * @ctx - checkpoint context + * @pgarr - page-array to fill + * @vma - vma to scan + * @start - start address (updated) + */ +static int cr_vma_fill_pgarr(struct cr_ctx *ctx, struct cr_pgarr *pgarr, + struct vm_area_struct *vma, unsigned long *start) +{ +unsigned long end = vma-vm_end; +unsigned long addr = *start; +struct page **pagep; +unsigned long *addrp; +int cow, nr, ret = 0; + +nr = pgarr-nleft; +pagep = pgarr-pages[pgarr-nused]; +addrp = pgarr-addrs[pgarr-nused]; +cow = !!vma-vm_file; + +while (addr end) { +struct page *page; + +/* simplified version of get_user_pages(): already have vma, +* only need FOLL_TOUCH, and (for now) ignore fault stats */ + +cond_resched(); +while (!(page = follow_page(vma, addr, FOLL_TOUCH))) { +ret = handle_mm_fault(vma-vm_mm, vma, addr, 0); +if (ret VM_FAULT_ERROR) { +if (ret VM_FAULT_OOM) +ret = -ENOMEM; +else if (ret VM_FAULT_SIGBUS) +ret = -EFAULT; +else +BUG(); +break; +} + ret = 0; +cond_resched(); +} + +if (IS_ERR(page)) { +ret = PTR_ERR(page); +break; +} Need to check ret here: + if (ret) break; + +if (page == ZERO_PAGE(0)) +page = NULL;/* zero page: ignore */ +else if (cow page_mapping(page) != NULL) +page = NULL;/* clean cow: ignore */ +else { +get_page(page); +*(addrp++) = addr; +*(pagep++) = page; +if (--nr == 0) { +addr += PAGE_SIZE; +break; +} +} + +addr += PAGE_SIZE; +} + +if (unlikely(ret 0)) { +nr = pgarr-nleft - nr; +while (nr--) +page_cache_release(*(--pagep)); +return ret; +} + +*start = addr; +return (pgarr-nleft - nr); +} [...] Thanks, Louis ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 04/10] user namespaces: enforce usernamespaces for file permission
Quoting Eric W. Biederman ([EMAIL PROTECTED]): Serge E. Hallyn [EMAIL PROTECTED] writes: Add a user_ns to the sb. It is always set to the user_ns of the task which mounted the sb. Define 3 new super_operations. convert_uid() and convert_gid() take a uid or gid from an inode on the sb's fs, and attempt to convert them into ids meaningful in the user namespace passed in, which presumably is current's. is_capable() checks whether current has the requested capability with respect to the sb's fs, which is dependent upon the relationship between current's user_ns and those which the sb knows about. If the sb isn't user ns aware - which none are right now - then the new super_operations won't be defined. If in that case current and sb have different user namespaces, then the userids can't be compared. If current's and sb's userids can't be compared, then current will get 'user other' (we usually say 'nobody') permissions to the inode. Changelog: Aug 5: send the whole inode to the super_operations. Serge while I am dubious about adding the methods you have added to super_block_operations there is one very important question you are using them for that we do need to ask. Does this filesystem map to user namespace X. By itself that is not sufficient. We need to support two inodes on the same fs where both have i_uid=500 on the host fs, while in user namespace X one is owned by uid 0, and another by uid 1000. So we need to be able to pass the filesystem an inode and a user namespace, and ask for the owning uid and gids. Or am I (I likely am) misunderstanding? -serge ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 04/10] user namespaces: enforce user namespaces for file permission
Quoting Eric W. Biederman ([EMAIL PROTECTED]): Serge E. Hallyn [EMAIL PROTECTED] writes: Add a user_ns to the sb. It is always set to the user_ns of the task which mounted the sb. Define 3 new super_operations. convert_uid() and convert_gid() take a uid or gid from an inode on the sb's fs, and attempt to convert them into ids meaningful in the user namespace passed in, which presumably is current's. is_capable() checks whether current has the requested capability with respect to the sb's fs, which is dependent upon the relationship between current's user_ns and those which the sb knows about. If the sb isn't user ns aware - which none are right now - then the new super_operations won't be defined. If in that case current and sb have different user namespaces, then the userids can't be compared. If current's and sb's userids can't be compared, then current will get 'user other' (we usually say 'nobody') permissions to the inode. Changelog: Aug 5: send the whole inode to the super_operations. Signed-off-by: Serge Hallyn [EMAIL PROTECTED] --- fs/namei.c | 29 +++--- fs/super.c |3 ++ include/linux/fs.h | 55 3 files changed, 83 insertions(+), 4 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 4ea63ed..6bf5446 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -183,10 +183,27 @@ int generic_permission(struct inode *inode, int mask, int (*check_acl)(struct inode *inode, int mask)) { umode_t mode = inode-i_mode; + uid_t iuid; + gid_t igid; + int ret; + int good_userns = 1; + + ret = s_convert_uid_gid(inode, current-user-user_ns, + iuid, igid); + if (!ret) + good_userns = 0; mask = MAY_READ | MAY_WRITE | MAY_EXEC; - if (current-fsuid == inode-i_uid) + /* +* If we're not in the inode's user namespace, we get +* user nobody permissions, and we ignore acls +* (bc serge doesn't know how to handle acls in this case) +*/ + if (!good_userns) + goto check; + + if (current-fsuid == iuid) mode = 6; else { if (IS_POSIXACL(inode) (mode S_IRWXG) check_acl) { @@ -197,15 +214,18 @@ int generic_permission(struct inode *inode, int mask, return error; } - if (in_group_p(inode-i_gid)) + if (in_group_p(igid)) mode = 3; } +check: /* * If the DACs are ok we don't need any capability check. */ if ((mask ~mode) == 0) return 0; + if (!good_userns) + return -EACCES; check_capabilities: /* @@ -214,14 +234,15 @@ int generic_permission(struct inode *inode, int mask, */ if (!(mask MAY_EXEC) || (inode-i_mode S_IXUGO) || S_ISDIR(inode-i_mode)) - if (capable(CAP_DAC_OVERRIDE)) + if (s_is_capable(inode, current-user-user_ns, + CAP_DAC_OVERRIDE)) return 0; /* * Searching includes executable on directories, else just read. */ if (mask == MAY_READ || (S_ISDIR(inode-i_mode) !(mask MAY_WRITE))) - if (capable(CAP_DAC_READ_SEARCH)) + if (s_is_capable(inode, current-user-user_ns, CAP_DAC_READ_SEARCH)) return 0; return -EACCES; diff --git a/fs/super.c b/fs/super.c index e931ae9..3559637 100644 --- a/fs/super.c +++ b/fs/super.c @@ -38,6 +38,7 @@ #include linux/kobject.h #include linux/mutex.h #include linux/file.h +#include linux/user_namespace.h #include asm/uaccess.h #include internal.h @@ -93,6 +94,7 @@ static struct super_block *alloc_super(struct file_system_type *type) s-s_qcop = sb_quotactl_ops; s-s_op = default_op; s-s_time_gran = 10; + s-user_ns = get_user_ns(current-user-user_ns); } out: return s; @@ -109,6 +111,7 @@ static inline void destroy_super(struct super_block *s) security_sb_free(s); kfree(s-s_subtype); kfree(s-s_options); + put_user_ns(s-user_ns); kfree(s); } diff --git a/include/linux/fs.h b/include/linux/fs.h index 580b513..bb58c2e 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1128,6 +1128,11 @@ struct super_block { * generic_show_options() */ char *s_options; + + /* +* namespace of the user which mounted the sb +*/ + struct user_namespace *user_ns; We should make it clear that this is a default, and not something a filesystem must use, just something a filesystem may use.
[Devel] Re: [0/10] User namespaces: introduction
Quoting Eric W. Biederman ([EMAIL PROTECTED]): Serge E. Hallyn [EMAIL PROTECTED] writes: Hi Eric, so here is a start to a userns patchset trying to follow your ideas about how to have user namespaces and filesystems interact. Ignore the bookkeeping crap or you'll pull your hair out. Lots of stuff remains unimplemented - i.e. chown (setattr) and proper handling of capabilities. But you can do some fun things with this patchset. I.e. (log in as root) setcap cap_sys_admin=ep ns_exec setcap cap_sys_admin=ep usernsmount ns_exec -U /bin/sh ls /root (fails) ls / (succeeds) (log in as hallyn) ns_exec -U /bin/sh id (uid=0, gid=0) ls (fails, can't descend /home/hallyn) usernsmount / nsid=4 ls (succeeds) touch ab ls -l ab (ab is owned by root) exit (we're logged in as hallyn in the init_user_ns again) ls -l ab (ab is owned by hallyn) The only supported fs is ext3. Only a few operations are supported. So if, above, when we are hallyn in the init_user_ns but root in the child user ns, when we create a file, it is properly handled, so inode-i_uid=500, but an xattr (nsid=4,uid=0) is added when we chown the file to root, it is not properly handled, so inode-i_uid = 0 it's just a matter of hooking all the places at this point. Capabilities remain a problem. Right now I think capabilities will need to be split up into system-wide caps, and container-safe caps. So CAP_NET_ADMIN, CAP_NET_RAW, CAP_DAC_OVERRIDE, those are container-safe. CAP_REBOOT may become container-safe one day, but for now is very much system-wide. So if I'm uid 500 on the host and create a user namespace where I'm uid=0, I should be able to acquire container-safe caps (perhaps contingent on whether I unshared all other namespaces), but not system-wide ones. Or, whether I can acquire them would depend on whether the suid bit was set in a user_ns or not. sigh. Serge at first glance this looks like a good start, especially for thinking through how things will work. It has just occurred to me that from a dependency point of view it makes an enormous amount of sense to sort out capable with respect to namespaces before we get to the filesystems. There is no one else working in the area of capabilities so there won't be conflicts, and we need a firm understanding of how capabilities are going to work with respect to namespaces before we start embedding the logic in filesystems. With respect to your separation of capabilities in namespaces I don't think you have quite grasped the simple idea that is sitting in my head and makes all of this clear. Let me see if I can explain it better. A fully qualified capability name would be of the form: userns:capability_name For each operation we will check for one specific capability. For the network namespace in particular we will check for: userns_of_network_namespace_creator:CAP_NET_ADMIN The check for a capability will succeed if: - We have the exact fully qualified capability. - We are outside the user namespace but are the owner of the user namespace. - We are outside the user namespace but have the appropriate capability over the owner of the user namespace CAP_PTRACE? This last test would recurses. I'm less certain than I like about which permissions we allow someone outside of a container to posses and still control the container. This has two very useful implications. - We can have all capabilities in a new user namespace and be completely impotent. - Allowing the capabilities of a user namespace to do something useful can come gradually. Which means we need to extend the classic capable check to become. capable(userns, capability). Or possibly we extend the capability parameter to be a structure that can hold both userns and the capability, whichever turns out to be more maintainable. Once we have done that we can allow something to be under the power of creator_user_ns:capability instead of init_user_ns:capability. So the CAP_SYS_REBOOT test will be init_user_ns:capability for the foreseeable future. While the CAP_NET_ADMIN test will shortly become creator_of_netns:CAP_NET_ADMIN. Of course none of that will happen until we relax the test to create a new namespace from init_user_ns:CAP_SYS_ADMIN to current_user_ns:CAP_SYS_ADMIN. Eric It definately seems to make sense in terms of the security implications. And solving this before the filesystem handlers seems to make sense too. Although I would like to get the first 3 patches upstream pretty soon, as I believe they are proper fixes. But wrt userns:capability, the problem that brings to mind is that of referring to the userns. Do we use the userspace-exported id, or do we use the actual in-kernel user_ns? If we use the in-kernel
[Devel] Re: [PATCH 04/10] user namespaces: enforce usernamespaces for file permission
Serge E. Hallyn [EMAIL PROTECTED] writes: By itself that is not sufficient. We need to support two inodes on the same fs where both have i_uid=500 on the host fs, while in user namespace X one is owned by uid 0, and another by uid 1000. So we need to be able to pass the filesystem an inode and a user namespace, and ask for the owning uid and gids. Or am I (I likely am) misunderstanding? There are two questions. Does this filesystem provide mappings to user namespace X? What is the mapping from this filesystem to user namespace X? I think we may be able to separate those two questions. The important idea is that we don't need to implement filesystem changes in the first pass. Just have the permission check fail unconditionally if we are not in the init_user_ns. Eric ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 04/10] user namespaces: enforce usernamespaces for file permission
Quoting Eric W. Biederman ([EMAIL PROTECTED]): Serge E. Hallyn [EMAIL PROTECTED] writes: By itself that is not sufficient. We need to support two inodes on the same fs where both have i_uid=500 on the host fs, while in user namespace X one is owned by uid 0, and another by uid 1000. So we need to be able to pass the filesystem an inode and a user namespace, and ask for the owning uid and gids. Or am I (I likely am) misunderstanding? There are two questions. Does this filesystem provide mappings to user namespace X? What is the mapping from this filesystem to user namespace X? That is where you and I still disagree: I don't believe that a mapping as such makes sense. A mapping implies uid 500 becomes uid 0, uid 501 becomes uid 1000, etc. That simply is not sufficient. If I am going to be able to create a new userns without privilege, then all files belonging to all uids in the child userns must be owned by uid 500 in my parent uidns. So I do think we will have to have something more of the format what is the owning uid and gid of this inode, in the context of user namespace X. Which was the s_convert_uid_gid() in my helpers. If we follow your userns:capability idea and then we probably can get rid of the s_is_capable you hated so much :) But, I think we may be able to separate those two questions. The important idea is that we don't need to implement filesystem changes in the first pass. Just have the permission check fail unconditionally if we are not in the init_user_ns. Eric There I definately agree. However, given this general direction, do you feel that marking an sb with a userns the way I'm doing in patch 4/10 makes sense, using it only to facilitate a basic is this non-userns-aware fs in the same userns as the task attempting to access it check? If not, what alternatives do we have, given that marking vfsmounts as we've discussed before won't be feasible? Of course by the time we get capabilities straightened out I suppose tagging vfsmounts might in fact be feasible :-) Thanks, Eric. -serge ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 04/10] user namespaces: enforce user namespaces for file permission
Serge E. Hallyn [EMAIL PROTECTED] writes: That might make sense. I guess the problem is I started trying to handle generic permission. But I don't need to... the fs can provide its own permission, else we do the simple userns check. Sounds right. We really need that simple starting place where we can turn it all on and you aren't allowed to do anything, then we can enable one feature at a time from there. Plus I can think of some uses for a user who isn't allowed to do anything on a machine. ;) The user ns comes from the current task. The fs must decide whether current-user-uid in current-user-user_ns has capabilities over the target file. Only the fs can erify that the user is really uid=0 or has CAP_DAC_OVERRIDE, for instance, with respect to the file. In the general case only the filesystem can make the call about what powers we have over a specific file and the filesystem will make that call when we pass it our kerberos ticket to the filesystem server, and ask the filesystem to perform a specific operation. In the common case the filesystem is just a reflection of our in memory set of namespaces. Getting the capability rules straight are going to be a little interesting. Essentially however for filesystems I believe the rule is if I have CAP_DAC_OVERRIDE in the user namespace of a user. And that user creates a namespace. Then I should have CAP_DAC_OVERRIDE capability over that namespace as well. Allowing me to back up files in the unprivileged users namespace. The common case is the only case that needs to be implemented in generic kernel code. Eric ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [0/10] User namespaces: introduction
Serge E. Hallyn [EMAIL PROTECTED] writes: It definately seems to make sense in terms of the security implications. And solving this before the filesystem handlers seems to make sense too. Although I would like to get the first 3 patches upstream pretty soon, as I believe they are proper fixes. Reasonable. I'm not certain about free_user continuing to be an inline function as it seems a bit non-trivial, but otherwise that sounds correct. But wrt userns:capability, the problem that brings to mind is that of referring to the userns. Do we use the userspace-exported id, or do we use the actual in-kernel user_ns? If we use the in-kernel user_ns, then we'd have to take a ref for each cap, yuck. But you had wanted to use 'mount' to only have filesystems associate userspace ids with the in-kernel struct user_ns, so that complicates the idea of having capabilities refer to those. I don't think so. In the standard security model there are only 2 intersections between the filesystem and the capabilities. - CAP_DAC_OVERRIDE. - The capabilities xattr on a filesystem. With a filesystem in exactly one user namespace at a time this is straight forward. With a filesystem in user namespaces at a time this is slightly more interesting. I believe the authentication algorithm becomes: Map the credentials on the filesystem inode into (fs_user_ns, fs_uid, fs_gid, fs_mode) Then to see if we have power over the file we test: capable(fs_user_ns, CAP_DAC_OVERRIDE). Then if current-user-user_ns != fs_user_ns we can do something like: uid = 0, gid = 0 mode clear except for the other bits. We want either 0 or another uid we have reserved for the purpose. I don't see why the mapping rules should not be universal so we can probably do all of the mapping foreign uid's and gid's in generic code and just place a unser_ns pointer into struct inode. Which makes things very close to how they are now and it means we can do the lookup of the user_ns when we cache the struct inode. Anyway I like the overall approach, and will think a bit about any other actual implementation issues. Thanks. It adds more complications then I like not having a view of the filesystem with a single user_namespace. However that appears to be necessary to deal with Al's inode_permission changes, and it seems to be where we are ultimately heading so it seems more honest. So I guess I have to bite the bullet and accept it. ;) For the case of a shared /usr just having other permission access should work fairly well. I just looked on my ubuntu system and I found only 36 suid executables and only one executable (fusermount) that was not world executable. And a shared usr is the only reasonable case I could think of where I would want a file to at least appear to have multiple owners. Eric ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 04/10] user namespaces: enforce usernamespaces for file permission
Serge E. Hallyn [EMAIL PROTECTED] writes: There are two questions. Does this filesystem provide mappings to user namespace X? What is the mapping from this filesystem to user namespace X? That is where you and I still disagree: I don't believe that a mapping as such makes sense. A mapping implies uid 500 becomes uid 0, uid 501 becomes uid 1000, etc. That simply is not sufficient. If I am going to be able to create a new userns without privilege, then all files belonging to all uids in the child userns must be owned by uid 500 in my parent uidns. So I do think we will have to have something more of the format what is the owning uid and gid of this inode, in the context of user namespace X. Which was the s_convert_uid_gid() in my helpers. That is what I meant by mapping. ;) Given my thought in my last message that we should cache the user_ns in the inode along with the rest of the credentials. Making all of the child files appear and act as if they are owned by uid 500 in the parent namespace should be straight forward. I think we may want a 1-1 mapping onto the filesystem. However I don't care very much how we store uids on the filesystem as long as we do something that works. If we follow your userns:capability idea and then we probably can get rid of the s_is_capable you hated so much :) yep. But, I think we may be able to separate those two questions. The important idea is that we don't need to implement filesystem changes in the first pass. Just have the permission check fail unconditionally if we are not in the init_user_ns. Eric There I definately agree. However, given this general direction, do you feel that marking an sb with a userns the way I'm doing in patch 4/10 makes sense, using it only to facilitate a basic is this non-userns-aware fs in the same userns as the task attempting to access it check? If not, what alternatives do we have, given that marking vfsmounts as we've discussed before won't be feasible? Of course by the time we get capabilities straightened out I suppose tagging vfsmounts might in fact be feasible :-) LOL Well I tell you what. Unless we can think of a reason for a file to live in more than one leaf user namespace. We should tag the inode with the leaf user namespace. We could tag the superblock with a default for the inode user namespace but that is less interesting. Eric ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel