Re: [patch 07/10] unprivileged mounts: add sysctl tunable for safe property
Maybe sysctls just need to check capabilities, instead of uids. I think that would make a lot of sense anyway. Would it be as simple as tagging the inodes with capability sets? One set for writing, or one each for reading and writing? Yes, or something even simpler, like mapping the owner permission bits to CAP_SYS_ADMIN. There seem to be very few different permissions under /proc/sys: --w--- -r--r--r-- -rw--- -rw-r--r-- As long as the group and other bits are always the same, and we accept that the owner bits really mean CAP_SYS_ADMIN and not something else, But I would assume some things under /proc/sys/net/ipv4 or /proc/sys/net/ath0 require CAP_NET_ADMIN rather than CAP_SYS_ADMIN? I guess so. I'm not very familiar with the different capabilities :) How about this patch then: a hybrid solution between just relying on permission bits, and specifying separate capability sets for read and write in addition to the permission bits. Untested, the 'cap' field obviously still needs to be filled in where appropriate. Miklos Index: linux/include/linux/sysctl.h === --- linux.orig/include/linux/sysctl.h 2008-02-04 12:29:01.0 +0100 +++ linux/include/linux/sysctl.h2008-02-07 15:19:06.0 +0100 @@ -1041,6 +1041,7 @@ struct ctl_table void *data; int maxlen; mode_t mode; + int cap;/* Capability needed to read/write */ struct ctl_table *child; struct ctl_table *parent; /* Automatically set */ proc_handler *proc_handler; /* Callback for text formatting */ Index: linux/kernel/sysctl.c === --- linux.orig/kernel/sysctl.c 2008-02-05 22:17:05.0 +0100 +++ linux/kernel/sysctl.c 2008-02-07 15:30:45.0 +0100 @@ -1527,14 +1527,26 @@ out: * some sysctl variables are readonly even to root. */ -static int test_perm(int mode, int op) +static int test_perm(struct ctl_table *table, int op) { - if (!current-euid) - mode = 6; - else if (in_egroup_p(0)) - mode = 3; + int cap = table-cap; + mode_t mode = table-mode; + + if (!cap) + cap = CAP_SYS_ADMIN; + + if ((op MAY_READ) !(mode S_IRUGO)) + return -EACCES; + + if ((op MAY_WRITE) !(mode S_IWUGO)) + return -EACCES; + + if (capable(cap)) + return 0; + if ((mode op 0007) == op) return 0; + return -EACCES; } @@ -1544,7 +1556,7 @@ int sysctl_perm(struct ctl_table *table, error = security_sysctl(table, op); if (error) return error; - return test_perm(table-mode, op); + return test_perm(table, op); } #ifdef CONFIG_SYSCTL_SYSCALL - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 07/10] unprivileged mounts: add sysctl tunable for safe property
Quoting Miklos Szeredi ([EMAIL PROTECTED]): Maybe sysctls just need to check capabilities, instead of uids. I think that would make a lot of sense anyway. Would it be as simple as tagging the inodes with capability sets? One set for writing, or one each for reading and writing? Yes, or something even simpler, like mapping the owner permission bits to CAP_SYS_ADMIN. There seem to be very few different permissions under /proc/sys: --w--- -r--r--r-- -rw--- -rw-r--r-- As long as the group and other bits are always the same, and we accept that the owner bits really mean CAP_SYS_ADMIN and not something else, But I would assume some things under /proc/sys/net/ipv4 or /proc/sys/net/ath0 require CAP_NET_ADMIN rather than CAP_SYS_ADMIN? then the permission check would not need to look at uids or gids at all. Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 07/10] unprivileged mounts: add sysctl tunable for safe property
Quoting Miklos Szeredi ([EMAIL PROTECTED]): Maybe sysctls just need to check capabilities, instead of uids. I think that would make a lot of sense anyway. Would it be as simple as tagging the inodes with capability sets? One set for writing, or one each for reading and writing? Yes, or something even simpler, like mapping the owner permission bits to CAP_SYS_ADMIN. There seem to be very few different permissions under /proc/sys: --w--- -r--r--r-- -rw--- -rw-r--r-- As long as the group and other bits are always the same, and we accept that the owner bits really mean CAP_SYS_ADMIN and not something else, But I would assume some things under /proc/sys/net/ipv4 or /proc/sys/net/ath0 require CAP_NET_ADMIN rather than CAP_SYS_ADMIN? I guess so. I'm not very familiar with the different capabilities :) How about this patch then: a hybrid solution between just relying on permission bits, and specifying separate capability sets for read and write in addition to the permission bits. Untested, the 'cap' field obviously still needs to be filled in where appropriate. Miklos Index: linux/include/linux/sysctl.h === --- linux.orig/include/linux/sysctl.h 2008-02-04 12:29:01.0 +0100 +++ linux/include/linux/sysctl.h 2008-02-07 15:19:06.0 +0100 @@ -1041,6 +1041,7 @@ struct ctl_table void *data; int maxlen; mode_t mode; + int cap;/* Capability needed to read/write */ struct ctl_table *child; struct ctl_table *parent; /* Automatically set */ proc_handler *proc_handler; /* Callback for text formatting */ Index: linux/kernel/sysctl.c === --- linux.orig/kernel/sysctl.c2008-02-05 22:17:05.0 +0100 +++ linux/kernel/sysctl.c 2008-02-07 15:30:45.0 +0100 @@ -1527,14 +1527,26 @@ out: * some sysctl variables are readonly even to root. */ -static int test_perm(int mode, int op) +static int test_perm(struct ctl_table *table, int op) { - if (!current-euid) - mode = 6; - else if (in_egroup_p(0)) - mode = 3; + int cap = table-cap; + mode_t mode = table-mode; + + if (!cap) + cap = CAP_SYS_ADMIN; + + if ((op MAY_READ) !(mode S_IRUGO)) + return -EACCES; + + if ((op MAY_WRITE) !(mode S_IWUGO)) + return -EACCES; + + if (capable(cap)) + return 0; + if ((mode op 0007) == op) return 0; + return -EACCES; I like how simple it appears to be :) At first I missed the fact that owning uid is always 0 so I thought the uid processing wasn't quite enough. But since it's always 0, the only question is whether there are any /proc/sys files whose users currently depend on being setgid 0 and setgid non-0 with no capabilities. On my laptop, 'find /proc/sys -type f -perm -020' gives me no results, so that is promising. So this certainly seems like a good first step. In fact, combined with /proc/sys/ being partially remounted per container like /proc/sys/net is doing, we may not even need to do anything with CAP_NS_OVERRIDE. thanks, -serge } @@ -1544,7 +1556,7 @@ int sysctl_perm(struct ctl_table *table, error = security_sysctl(table, op); if (error) return error; - return test_perm(table-mode, op); + return test_perm(table, op); } #ifdef CONFIG_SYSCTL_SYSCALL - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 07/10] unprivileged mounts: add sysctl tunable for safe property
Quoting Miklos Szeredi ([EMAIL PROTECTED]): From: Miklos Szeredi [EMAIL PROTECTED] Add the following: /proc/sys/fs/types/${FS_TYPE}/usermount_safe Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] Thanks, Miklos, good explanations in the docs. Acked-by: Serge Hallyn [EMAIL PROTECTED] One comment inline, but not imo your problem :) --- Index: linux/fs/filesystems.c === --- linux.orig/fs/filesystems.c 2008-02-04 23:47:46.0 +0100 +++ linux/fs/filesystems.c2008-02-04 23:48:04.0 +0100 @@ -12,6 +12,7 @@ #include linux/kmod.h #include linux/init.h #include linux/module.h +#include linux/sysctl.h #include asm/uaccess.h /* @@ -51,6 +52,57 @@ static struct file_system_type **find_fi return p; } +#define MAX_FILESYSTEM_VARS 1 + +struct filesystem_sysctl_table { + struct ctl_table_header *header; + struct ctl_table table[MAX_FILESYSTEM_VARS + 1]; +}; + +/* + * Create /sys/fs/types/${FSNAME} directory with per fs-type tunables. + */ +static int filesystem_sysctl_register(struct file_system_type *fs) +{ + struct filesystem_sysctl_table *t; + struct ctl_path path[] = { + { .procname = fs, .ctl_name = CTL_FS }, + { .procname = types, .ctl_name = CTL_UNNUMBERED }, + { .procname = fs-name, .ctl_name = CTL_UNNUMBERED }, + { } + }; + + t = kzalloc(sizeof(*t), GFP_KERNEL); + if (!t) + return -ENOMEM; + + + t-table[0].ctl_name = CTL_UNNUMBERED; + t-table[0].procname = usermount_safe; + t-table[0].maxlen = sizeof(int); + t-table[0].data = fs-fs_safe; + t-table[0].mode = 0644; Yikes, this could be a problem for containers, as it's simply tied to uid 0, whereas tying it to a capability would let us solve it with capability bounds. This might mean more urgency to get user namespaces working at least with sysfs, else this is a quick way around having CAP_SYS_ADMIN taken out of a container's capability bounding set. + t-table[0].proc_handler = proc_dointvec; + + t-header = register_sysctl_paths(path, t-table); + if (!t-header) { + kfree(t); + return -ENOMEM; + } + + fs-sysctl_table = t; + + return 0; +} + +static void filesystem_sysctl_unregister(struct file_system_type *fs) +{ + struct filesystem_sysctl_table *t = fs-sysctl_table; + + unregister_sysctl_table(t-header); + kfree(t); +} + /** * register_filesystem - register a new filesystem * @fs: the file system structure @@ -80,6 +132,13 @@ int register_filesystem(struct file_syst else *p = fs; write_unlock(file_systems_lock); + + if (res == 0) { + res = filesystem_sysctl_register(fs); + if (res != 0) + unregister_filesystem(fs); + } + return res; } @@ -108,6 +167,7 @@ int unregister_filesystem(struct file_sy *tmp = fs-next; fs-next = NULL; write_unlock(file_systems_lock); + filesystem_sysctl_unregister(fs); return 0; } tmp = (*tmp)-next; Index: linux/include/linux/fs.h === --- linux.orig/include/linux/fs.h 2008-02-04 23:48:02.0 +0100 +++ linux/include/linux/fs.h 2008-02-04 23:48:04.0 +0100 @@ -1444,6 +1444,7 @@ struct file_system_type { struct module *owner; struct file_system_type * next; struct list_head fs_supers; + struct filesystem_sysctl_table *sysctl_table; struct lock_class_key s_lock_key; struct lock_class_key s_umount_key; Index: linux/Documentation/filesystems/proc.txt === --- linux.orig/Documentation/filesystems/proc.txt 2008-02-04 23:47:58.0 +0100 +++ linux/Documentation/filesystems/proc.txt 2008-02-04 23:48:04.0 +0100 @@ -44,6 +44,7 @@ Table of Contents 2.14 /proc/pid/io - Display the IO accounting fields 2.15 /proc/pid/coredump_filter - Core dump filtering settings 2.16 /proc/pid/mountinfo - Information about mounts + 2.17 /proc/sys/fs/types - File system type specific parameters -- Preface @@ -2392,4 +2393,34 @@ For more information see: Documentation/filesystems/sharedsubtree.txt +2.17 /proc/sys/fs/types/ - File system type specific parameters + + +There's a separate directory /proc/sys/fs/types/type/ for each +filesystem type, containing the following files: + +usermount_safe +-- + +Setting this to non-zero will allow
Re: [patch 07/10] unprivileged mounts: add sysctl tunable for safe property
+ t-table[0].mode = 0644; Yikes, this could be a problem for containers, as it's simply tied to uid 0, whereas tying it to a capability would let us solve it with capability bounds. This might mean more urgency to get user namespaces working at least with sysfs, else this is a quick way around having CAP_SYS_ADMIN taken out of a container's capability bounding set. I think I understand the problem, but not the solution. How do user namespaces going to help? Maybe sysctls just need to check capabilities, instead of uids. I think that would make a lot of sense anyway. Thanks, Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 07/10] unprivileged mounts: add sysctl tunable for safe property
Quoting Miklos Szeredi ([EMAIL PROTECTED]): + t-table[0].mode = 0644; Yikes, this could be a problem for containers, as it's simply tied to uid 0, whereas tying it to a capability would let us solve it with capability bounds. This might mean more urgency to get user namespaces working at least with sysfs, else this is a quick way around having CAP_SYS_ADMIN taken out of a container's capability bounding set. I think I understand the problem, but not the solution. How do user namespaces going to help? Well it somewhat depends on how we implement userns for filesystems in the first place, and whether we end up splitting sysfs into sub-filesystems as I think Eric Biederman has been advocating. My thoughts had been running along the lines of just tagging vfsmounts with userns of the mounting process. A task from outside the mounting process' namespace would get user other permissions whether or not its uid was the owning uid or uid 0 (unless the task had CAP_NS_OVERRIDE). But really it gets more complicated for sysfs than something like ext2 since we really want to be able to filter files and directories for different namespaces... Handling sysfs user namespaces before we sort out the rest of the sysfs stuff (being hashed out with network namespaces) seems like jumping the gun a bit. Maybe sysctls just need to check capabilities, instead of uids. I think that would make a lot of sense anyway. Would it be as simple as tagging the inodes with capability sets? One set for writing, or one each for reading and writing? thanks, -serge - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 07/10] unprivileged mounts: add sysctl tunable for safe property
Quoting Miklos Szeredi ([EMAIL PROTECTED]): What do you think about doing this only if FS_SAFE is also set, so for instance at first only FUSE would allow itself to be made user-mountable? A safe thing to do, or overly intrusive? It goes somewhat against the no policy in kernel policy ;). I think the warning in the documentation should be enough to make sysadmins think twice before doing anything foolish: Warning in which documentation? A sysadmin considering setting fs_safe for ext2 or xfs isn't going to be looking at fuse docs, which I think is what you're talking about. Are you going to add a file under Documentation/filesystems? Yes, I meant documentation of the new sysctl tunable in Documentation/filesystems/proc.txt: Argh, sorry. Index: linux/Documentation/filesystems/proc.txt === --- linux.orig/Documentation/filesystems/proc.txt 2008-01-16 13:25:07.0 +0100 +++ linux/Documentation/filesystems/proc.txt2008-01-16 13:25:09.0 +0100 @@ -43,6 +43,7 @@ Table of Contents 2.13 /proc/pid/oom_score - Display current oom-killer score 2.14 /proc/pid/io - Display the IO accounting fields 2.15 /proc/pid/coredump_filter - Core dump filtering settings + 2.16 /proc/sys/fs/types - File system type specific parameters -- Preface @@ -2283,4 +2284,21 @@ For example: $ echo 0x7 /proc/self/coredump_filter $ ./some_program +2.16 /proc/sys/fs/types/ - File system type specific parameters + + +There's a separate directory /proc/sys/fs/types/type/ for each +filesystem type, containing the following files: + +usermount_safe +-- + +Setting this to non-zero will allow filesystems of this type to be +mounted by unprivileged users (note, that there are other +prerequisites as well). + +Care should be taken when enabling this, since most +filesystems haven't been designed with unprivileged mounting +in mind. + -- Do you think this is enough? Or do we need something more, to prevent sysadmin inadvertently setting this for an unsafe filesystem? I would think something more would be good. First explaining that fuse should be safe modulo warnings in the fuse documentation, procfs and sysfs may be safe, while other filesystems are not known safe at all. Then explaining the dangers with not-known-safe filesystems and what is needed to make them safe. Clearly making sure input validation is properly done so for instance getsb() doesn't turn into a buffer overflow, etc. Such a checklist also would be useful for holding a meaningful discussion about the other filesystems and maybe turning some people loose on an audit of other filesystems. thanks, -serge - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 07/10] unprivileged mounts: add sysctl tunable for safe property
What do you think about doing this only if FS_SAFE is also set, so for instance at first only FUSE would allow itself to be made user-mountable? A safe thing to do, or overly intrusive? It goes somewhat against the no policy in kernel policy ;). I think the warning in the documentation should be enough to make sysadmins think twice before doing anything foolish: +Care should be taken when enabling this, since most +filesystems haven't been designed with unprivileged mounting +in mind. + BTW, filesystems like 'proc' and 'sysfs' should also be safe, although the only use for them being marked safe is if the users are allowed to umount them from their private namespace (otherwise a 'mount --bind' has the same effect as a new mount). Thanks, Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html