[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control
On Tue, Feb 22, 2011 at 07:03:58PM -0500, Vivek Goyal wrote: I think we should accept to have an inode granularity. We could redesign the writeback code to work per-cgroup / per-page, etc. but that would add a huge overhead. The limit of inode granularity could be an acceptable tradeoff, cgroups are supposed to work to different files usually, well.. except when databases come into play (ouch!). Agreed. Granularity of per inode level might be accetable in many cases. Again, I am worried faster group getting stuck behind slower group. I am wondering if we are trying to solve the problem of ASYNC write throttling at wrong layer. Should ASYNC IO be throttled before we allow task to write to page cache. The way we throttle the process based on dirty ratio, can we just check for throttle limits also there or something like that.(I think that's what you had done in your initial throttling controller implementation?) Right. This is exactly the same approach I've used in my old throttling controller: throttle sync READs and WRITEs at the block layer and async WRITEs when the task is dirtying memory pages. This is probably the simplest way to resolve the problem of faster group getting blocked by slower group, but the controller will be a little bit more leaky, because the writeback IO will be never throttled and we'll see some limited IO spikes during the writeback. However, this is always a better solution IMHO respect to the current implementation that is affected by that kind of priority inversion problem. I can try to add this logic to the current blk-throttle controller if you think it is worth to test it. -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 4/5] blk-throttle: track buffered and anonymous pages
On Tue, Feb 22, 2011 at 07:07:19PM -0500, Vivek Goyal wrote: On Wed, Feb 23, 2011 at 12:05:34AM +0100, Andrea Righi wrote: On Tue, Feb 22, 2011 at 04:00:30PM -0500, Vivek Goyal wrote: On Tue, Feb 22, 2011 at 06:12:55PM +0100, Andrea Righi wrote: Add the tracking of buffered (writeback) and anonymous pages. Dirty pages in the page cache can be processed asynchronously by the per-bdi flusher kernel threads or by any other thread in the system, according to the writeback policy. For this reason the real writes to the underlying block devices may occur in a different IO context respect to the task that originally generated the dirty pages involved in the IO operation. This makes the tracking and throttling of writeback IO more complicate respect to the synchronous IO from the blkio controller's point of view. The idea is to save the cgroup owner of each anonymous page and dirty page in page cache. A page is associated to a cgroup the first time it is dirtied in memory (for file cache pages) or when it is set as swap-backed (for anonymous pages). This information is stored using the page_cgroup functionality. Then, at the block layer, it is possible to retrieve the throttle group looking at the bio_page(bio). If the page was not explicitly associated to any cgroup the IO operation is charged to the current task/cgroup, as it was done by the previous implementation. Signed-off-by: Andrea Righi ari...@develer.com --- block/blk-throttle.c | 87 +++- include/linux/blkdev.h | 26 ++- 2 files changed, 111 insertions(+), 2 deletions(-) diff --git a/block/blk-throttle.c b/block/blk-throttle.c index 9ad3d1e..a50ee04 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -8,6 +8,10 @@ #include linux/slab.h #include linux/blkdev.h #include linux/bio.h +#include linux/memcontrol.h +#include linux/mm_inline.h +#include linux/pagemap.h +#include linux/page_cgroup.h #include linux/blktrace_api.h #include linux/blk-cgroup.h @@ -221,6 +225,85 @@ done: return tg; } +static inline bool is_kernel_io(void) +{ + return !!(current-flags (PF_KTHREAD | PF_KSWAPD | PF_MEMALLOC)); +} + +static int throtl_set_page_owner(struct page *page, struct mm_struct *mm) +{ + struct blkio_cgroup *blkcg; + unsigned short id = 0; + + if (blkio_cgroup_disabled()) + return 0; + if (!mm) + goto out; + rcu_read_lock(); + blkcg = task_to_blkio_cgroup(rcu_dereference(mm-owner)); + if (likely(blkcg)) + id = css_id(blkcg-css); + rcu_read_unlock(); +out: + return page_cgroup_set_owner(page, id); +} + +int blk_throtl_set_anonpage_owner(struct page *page, struct mm_struct *mm) +{ + return throtl_set_page_owner(page, mm); +} +EXPORT_SYMBOL(blk_throtl_set_anonpage_owner); + +int blk_throtl_set_filepage_owner(struct page *page, struct mm_struct *mm) +{ + if (is_kernel_io() || !page_is_file_cache(page)) + return 0; + return throtl_set_page_owner(page, mm); +} +EXPORT_SYMBOL(blk_throtl_set_filepage_owner); Why are we exporting all these symbols? Right. Probably a single one is enough: int blk_throtl_set_page_owner(struct page *page, struct mm_struct *mm, bool anon); Who is going to use this single export? Which module? I was actually thinking at some filesystem modules, but I was wrong, because at the moment no one needs the export. I'll remove it in the next version of the patch. Thanks, -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 3/5] page_cgroup: make page tracking available for blkio
On Wed, Feb 23, 2011 at 01:49:10PM +0900, KAMEZAWA Hiroyuki wrote: On Wed, 23 Feb 2011 00:37:18 +0100 Andrea Righi ari...@develer.com wrote: On Tue, Feb 22, 2011 at 06:06:30PM -0500, Vivek Goyal wrote: On Wed, Feb 23, 2011 at 12:01:47AM +0100, Andrea Righi wrote: On Tue, Feb 22, 2011 at 01:01:45PM -0700, Jonathan Corbet wrote: On Tue, 22 Feb 2011 18:12:54 +0100 Andrea Righi ari...@develer.com wrote: The page_cgroup infrastructure, currently available only for the memory cgroup controller, can be used to store the owner of each page and opportunely track the writeback IO. This information is encoded in the upper 16-bits of the page_cgroup-flags. A owner can be identified using a generic ID number and the following interfaces are provided to store a retrieve this information: unsigned long page_cgroup_get_owner(struct page *page); int page_cgroup_set_owner(struct page *page, unsigned long id); int page_cgroup_copy_owner(struct page *npage, struct page *opage); My immediate observation is that you're not really tracking the owner here - you're tracking an opaque 16-bit token known only to the block controller in a field which - if changed by anybody other than the block controller - will lead to mayhem in the block controller. I think it might be clearer - and safer - to say blkcg or some such instead of owner here. Basically the idea here was to be as generic as possible and make this feature potentially available also to other subsystems, so that cgroup subsystems may represent whatever they want with the 16-bit token. However, no more than a single subsystem may be able to use this feature at the same time. I'm tempted to say it might be better to just add a pointer to your throtl_grp structure into struct page_cgroup. Or maybe replace the mem_cgroup pointer with a single pointer to struct css_set. Both of those ideas, though, probably just add unwanted extra overhead now to gain generality which may or may not be wanted in the future. The pointer to css_set sounds good, but it would add additional space to the page_cgroup struct. Now, page_cgroup is 40 bytes (in 64-bit arch) and all of them are allocated at boot time. Using unused bits in page_cgroup-flags is a choice with no overhead from this point of view. I think John suggested replacing mem_cgroup pointer with css_set so that size of the strcuture does not increase but it leads extra level of indirection. OK, got it sorry. So, IIUC we save css_set pointer and get a struct cgroup as following: struct cgroup *cgrp = css_set-subsys[subsys_id]-cgroup; Then, for example to get the mem_cgroup reference: struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); It seems a lot of indirections, but I may have done something wrong or there could be a simpler way to do it. Then, page_cgroup should have reference count on css_set and make tons of atomic ops. BTW, bits of pc-flags are used for storing sectionID or nodeID. Please clarify your 16bit never breaks that information. And please keep more 4-5 flags for dirty_ratio support of memcg. OK, I didn't see the recent work about section and node id encoded in the pc-flags, thanks. So, it'd be probably better to rebase the patch to the latest mmotm to check all this stuff. I wonder I can make pc-mem_cgroup to be pc-memid(16bit), then, == static inline struct mem_cgroup *get_memcg_from_pc(struct page_cgroup *pc) { struct cgroup_subsys_state *css = css_lookup(mem_cgroup_subsys, pc-memid); return container_of(css, struct mem_cgroup, css); } == Overhead will be seen at updating file statistics and LRU management. But, hmm, can't you do that tracking without page_cgroup ? Because the number of dirty/writeback pages are far smaller than total pages, chasing I/O with dynamic structure is not very bad.. prepareing [pfn - blkio] record table and move that information to struct bio in dynamic way is very difficult ? This would be ok for dirty pages, but consider that we're also tracking anonymous pages. So, if we want to control the swap IO we actually need to save this information for a lot of pages and at the end I think we'll basically duplicate the page_cgroup code. Thanks, -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 2/9] security: Make capabilities relative to the user namespace.
David Howells dhowe...@redhat.com wrote: int (*capable) (struct task_struct *tsk, const struct cred *cred, - int cap, int audit); + struct user_namespace *ns, int cap, int audit); Hmmm... A chunk of the contents of the cred struct are user-namespaced. Could you add the user_namespace pointer to the cred struct and thus avoid passing it as an argument to other things. Ah, no... Ignore that, I think I see that you do need it. +int cap_capable(struct task_struct *tsk, const struct cred *cred, + struct user_namespace *targ_ns, int cap, int audit) { - return cap_raised(cred-cap_effective, cap) ? 0 : -EPERM; + for (;;) { + /* The creator of the user namespace has all caps. */ + if (targ_ns != init_user_ns targ_ns-creator == cred-user) + return 0; Why is that last comment so? Why should the creating namespace sport all possible capabilities? Do you have to have all capabilities available to you to be permitted create a new user namespace? Also, would it be worth having a separate cap_ns_capable()? Wouldn't most calls to cap_capable() only be checking the caps granted in the current user namespace? David ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] User namespaces and keys
I guess we need to look at how to mix keys and namespaces again. Possibly the trickiest problem with keys is how to upcall key construction to /sbin/request-key when the keys may be of a different user namespace. David ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 2/9] security: Make capabilities relative to the user namespace.
Quoting David Howells (dhowe...@redhat.com): David Howells dhowe...@redhat.com wrote: int (*capable) (struct task_struct *tsk, const struct cred *cred, - int cap, int audit); + struct user_namespace *ns, int cap, int audit); Hmmm... A chunk of the contents of the cred struct are user-namespaced. Could you add the user_namespace pointer to the cred struct and thus avoid passing it as an argument to other things. Ah, no... Ignore that, I think I see that you do need it. Thanks for reviewing, David. +int cap_capable(struct task_struct *tsk, const struct cred *cred, + struct user_namespace *targ_ns, int cap, int audit) { - return cap_raised(cred-cap_effective, cap) ? 0 : -EPERM; + for (;;) { + /* The creator of the user namespace has all caps. */ + if (targ_ns != init_user_ns targ_ns-creator == cred-user) + return 0; Why is that last comment so? Why should the creating namespace sport all possible capabilities? Do you have to have all capabilities available to you to be permitted create a new user namespace? It's not the creating namespace, but the creating user, which has all caps. So for instance, if uid 500 in init_user_ns creates a namespace, then: a. uid 500 in init_user_ns has all caps to the child user_ns, so it can kill the tasks in the userns, clean up, etc. b. uid X in any other child user_ns has no caps to the child user_ns. c. root in init_user_ns has whatever capabilities are in his pE to the child user_ns. Again, this is so that the admin in any user_ns can clean up any messes made by users in his user_ns. One of the goals of the user namespaces it to make it safe to allow unprivileged users to create child user namespaces in which they have targeted privilege. Anything which happens in that child user namespace should be: a. constrained to resources which the user can control anyway b. able to be cleaned up by the user c. (especially) able to be cleaned up by the privileged user in the parent user_ns. Also, would it be worth having a separate cap_ns_capable()? Wouldn't most calls to cap_capable() only be checking the caps granted in the current user namespace? Hm. There is nsown_capable() which is targeted to current_userns(), but that still needs to enable the caps for the privileged ancestors as described above. thanks, -serge ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: User namespaces and keys
Quoting David Howells (dhowe...@redhat.com): I guess we need to look at how to mix keys and namespaces again. From strictly kernel pov, at the moment, keys are strictly usable only by the user in your own user namespace. We may want to look at this again, but for now I think that would be a safe enough default. Later, we'll probably want the user creating a child_user_ns to allow his keys to be inherited by the child user_ns. Though, as I type that, it seems to me that that'll just become a maintenance pain, and it's just plain safer to have the user re-enter his keys, sharing them over a file if needed. I'm going to not consider the TPM at the moment :) Possibly the trickiest problem with keys is how to upcall key construction to /sbin/request-key when the keys may be of a different user namespace. Hm, jinkeys, yes. -serge ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: User namespaces and keys
Serge E. Hallyn se...@hallyn.com writes: Quoting David Howells (dhowe...@redhat.com): I guess we need to look at how to mix keys and namespaces again. From strictly kernel pov, at the moment, keys are strictly usable only by the user in your own user namespace. We may want to look at this again, but for now I think that would be a safe enough default. Later, we'll probably want the user creating a child_user_ns to allow his keys to be inherited by the child user_ns. Though, as I type that, it seems to me that that'll just become a maintenance pain, and it's just plain safer to have the user re-enter his keys, sharing them over a file if needed. I'm going to not consider the TPM at the moment :) Possibly the trickiest problem with keys is how to upcall key construction to /sbin/request-key when the keys may be of a different user namespace. Hm, jinkeys, yes. Serge short term this is where I think we need to add a check or two so that keys only work in the init_user_ns. When the rest of the details are sorted out we can open up the use of keys some more. As the first stage of converting the network stack we added patches that turned off everything in non init namespaces. Those patches were trivial and easy to review, and it made the conversion process a lot easier. I suspect for keys and possibly security modules and anything else that does is related we want to turn off by default. Then once the core of user namespaces is safe to unshare without privilege we can come back and get the more difficult bits. Eric ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: User namespaces and keys
Serge E. Hallyn se...@hallyn.com wrote: I guess we need to look at how to mix keys and namespaces again. From strictly kernel pov, at the moment, keys are strictly usable only by the user in your own user namespace. I'm not sure that's currently completely true. Key quota maintenance is namespaced, and the key's owner UID/GID belong to that namespace, so that's okay, but: (*) key_task_permission() does not distinguish UIDs and GIDs from different namespaces. (*) A key can be referred to by its serial number, no matter whose namespace it is in, and will yield up its given UID/GID, even if these aren't actually meaningful in your namespace. This means request_key() can successfully upcall at the moment. I wonder if I should make the following changes: (1) If the key and the accessor are in different user namespaces, then skip the UID and GID comparisons in key_task_permission(). That means that to be able to access the key you'd have to possess the key and the key would have to grant you Possessor access, or the key would have to grant you Other access. (2) If the key and someone viewing the key description are in different namespaces, then indicate that the UID and the GID are -1, irrespective of the actual values. (3) When an upcall is attempting to instantiate a key, it is allowed to access the keys of requestor using the requestor's credentials (UID, GID, groups, security label). Ensure that this will be done in the requestor's user namespace. Nothing should need to be done here, since search_process_keyrings() switches to the requestor's creds. Oh, and are security labels user-namespaced? We may want to look at this again, but for now I think that would be a safe enough default. Later, we'll probably want the user creating a child_user_ns to allow his keys to be inherited by the child user_ns. That depends what you mean by 'allow his keys to be inherited'. Do you mean copying all the creator's keys en mass? You may find all you need to do is to provide the new intended user with a new session keyring with a link back to the creator's session keyring. Though, as I type that, it seems to me that that'll just become a maintenance pain, and it's just plain safer to have the user re-enter his keys, That would certainly be simpler. sharing them over a file if needed. I'm not sure what you mean by that. Do you mean some sort of cred passing similar to that that can be done over AF_UNIX sockets with fds? I'm going to not consider the TPM at the moment :) I'm not sure the TPM is that much of a problem, assuming you're just referring to its keystore capability... David ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control
Agreed. Granularity of per inode level might be accetable in many cases. Again, I am worried faster group getting stuck behind slower group. I am wondering if we are trying to solve the problem of ASYNC write throttling at wrong layer. Should ASYNC IO be throttled before we allow task to write to page cache. The way we throttle the process based on dirty ratio, can we just check for throttle limits also there or something like that.(I think that's what you had done in your initial throttling controller implementation?) Right. This is exactly the same approach I've used in my old throttling controller: throttle sync READs and WRITEs at the block layer and async WRITEs when the task is dirtying memory pages. This is probably the simplest way to resolve the problem of faster group getting blocked by slower group, but the controller will be a little bit more leaky, because the writeback IO will be never throttled and we'll see some limited IO spikes during the writeback. Yes writeback will not be throttled. Not sure how big a problem that is. - We have controlled the input rate. So that should help a bit. - May be one can put some high limit on root cgroup to in blkio throttle controller to limit overall WRITE rate of the system. - For SATA disks, try to use CFQ which can try to minimize the impact of WRITE. It will atleast provide consistent bandwindth experience to application. However, this is always a better solution IMHO respect to the current implementation that is affected by that kind of priority inversion problem. I can try to add this logic to the current blk-throttle controller if you think it is worth to test it. At this point of time I have few concerns with this approach. - Configuration issues. Asking user to plan for SYNC ans ASYNC IO separately is inconvenient. One has to know the nature of workload. - Most likely we will come up with global limits (atleast to begin with), and not per device limit. That can lead to contention on one single lock and scalability issues on big systems. Having said that, this approach should reduce the kernel complexity a lot. So if we can do some intelligent locking to limit the overhead then it will boil down to reduced complexity in kernel vs ease of use to user. I guess at this point of time I am inclined towards keeping it simple in kernel. Couple of people have asked me that we have backup jobs running at night and we want to reduce the IO bandwidth of these jobs to limit the impact on latency of other jobs, I guess this approach will definitely solve that issue. IMHO, it might be worth trying this approach and see how well does it work. It might not solve all the problems but can be helpful in many situations. I feel that for proportional bandwidth division, implementing ASYNC control at CFQ will make sense because even if things get serialized in higher layers, consequences are not very bad as it is work conserving algorithm. But for throttling serialization will lead to bad consequences. May be one can think of new files in blkio controller to limit async IO per group during page dirty time. blkio.throttle.async.write_bps_limit blkio.throttle.async.write_iops_limit Thanks Vivek ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: User namespaces and keys
David Howells dhowe...@redhat.com writes: Serge E. Hallyn se...@hallyn.com wrote: I guess we need to look at how to mix keys and namespaces again. From strictly kernel pov, at the moment, keys are strictly usable only by the user in your own user namespace. I'm not sure that's currently completely true. Key quota maintenance is namespaced, and the key's owner UID/GID belong to that namespace, so that's okay, but: (*) key_task_permission() does not distinguish UIDs and GIDs from different namespaces. (*) A key can be referred to by its serial number, no matter whose namespace it is in, and will yield up its given UID/GID, even if these aren't actually meaningful in your namespace. This means request_key() can successfully upcall at the moment. I wonder if I should make the following changes: (1) If the key and the accessor are in different user namespaces, then skip the UID and GID comparisons in key_task_permission(). That means that to be able to access the key you'd have to possess the key and the key would have to grant you Possessor access, or the key would have to grant you Other access. (2) If the key and someone viewing the key description are in different namespaces, then indicate that the UID and the GID are -1, irrespective of the actual values. (3) When an upcall is attempting to instantiate a key, it is allowed to access the keys of requestor using the requestor's credentials (UID, GID, groups, security label). Ensure that this will be done in the requestor's user namespace. Nothing should need to be done here, since search_process_keyrings() switches to the requestor's creds. Oh, and are security labels user-namespaced? Not at this time. The user namespace as currently merged is little more than a place holder for a proper implementation. Serge is busily fleshing out that proper implementation. Until we reach the point where all checks that have historically been if (uid1 == uid2) become if ((uidns1 == uidns2) (uid1 == uid2)) there will be problems. The security labels and probably lsm's in general need to be per user namespace but we simply have not gotten that far. For the short term I will be happy when we get a minimally usable user namespace. Eric ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: User namespaces and keys
Quoting Eric W. Biederman (ebied...@xmission.com): David Howells dhowe...@redhat.com writes: Serge E. Hallyn se...@hallyn.com wrote: I guess we need to look at how to mix keys and namespaces again. From strictly kernel pov, at the moment, keys are strictly usable only by the user in your own user namespace. I'm not sure that's currently completely true. Key quota maintenance is namespaced, and the key's owner UID/GID belong to that namespace, so that's okay, but: (*) key_task_permission() does not distinguish UIDs and GIDs from different namespaces. (*) A key can be referred to by its serial number, no matter whose namespace it is in, and will yield up its given UID/GID, even if these aren't actually meaningful in your namespace. This means request_key() can successfully upcall at the moment. I wonder if I should make the following changes: (1) If the key and the accessor are in different user namespaces, then skip the UID and GID comparisons in key_task_permission(). That means that to be able to access the key you'd have to possess the key and the key would have to grant you Possessor access, or the key would have to grant you Other access. (2) If the key and someone viewing the key description are in different namespaces, then indicate that the UID and the GID are -1, irrespective of the actual values. (3) When an upcall is attempting to instantiate a key, it is allowed to access the keys of requestor using the requestor's credentials (UID, GID, groups, security label). Ensure that this will be done in the requestor's user namespace. Nothing should need to be done here, since search_process_keyrings() switches to the requestor's creds. Oh, and are security labels user-namespaced? Not at this time. The user namespace as currently merged is little more than a place holder for a proper implementation. Serge is busily fleshing out that proper implementation. Until we reach the point where all checks that have historically been if (uid1 == uid2) become if ((uidns1 == uidns2) (uid1 == uid2)) there will be problems. The security labels and probably lsm's in general need to be per user namespace but we simply have not gotten that far. For the short term I will be happy when we get a minimally usable user namespace. Note also that when Eric brought this up at the LSM mini-conf two or three years ago, there was pretty general, strong objection to the idea. Like Eric says, I think that'll have to wait. In the meantime, isolating user namespace sandboxes (or containers) using simple LSM configurations is a very good idea. -serge ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 2/9] security: Make capabilities relative to the user namespace.
Serge E. Hallyn se...@hallyn.com wrote: + /* If you have the capability in a parent user ns you have it + * in the over all children user namespaces as well, so see + * if this process has the capability in the parent user + * namespace. + */ ... in the over all children user namespaces ... I think need a couple of words dropping. David ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 5/9] Allow ptrace from non-init user namespaces
Serge E. Hallyn se...@hallyn.com wrote: ptrace is allowed to tasks in the same user namespace according to There's a verb missing after 'allowed to' - presumably 'trace'. David ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 5/9] Allow ptrace from non-init user namespaces
Serge E. Hallyn se...@hallyn.com wrote: +int same_or_ancestor_user_ns(struct task_struct *task, + struct task_struct *victim) +{ + struct user_namespace *u1 = task_cred_xxx(task, user)-user_ns; + struct user_namespace *u2 = task_cred_xxx(victim, user)-user_ns; Hmmm. task_cred_xxx() uses task-real_cred, which is correct for victim (the object), but normally you'd use task-cred for task (the subject). However, in this case, I think it's probably okay. David ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 1/9] Add a user_namespace as creator/owner of uts_namespace
Serge E. Hallyn se...@hallyn.com wrote: struct uts_namespace { struct kref kref; struct new_utsname name; + struct user_namespace *user_ns; }; If a uts_namespace belongs to a user_namespace, should CLONE_NEWUSER then imply CLONE_NEWUTS? Or is uts_namespace::user_ns more an implication that the set of users in user_namespace are the only ones authorised to alter the uts data. I presume that the uts_namespace of a process must be owned by one of the user_namespaces in the alternating inheritance chain of namespaces and their creators leading from current_user_ns() to init_user_ns. With that in mind, looking at patch 3: - if (!capable(CAP_SYS_ADMIN)) + if (!ns_capable(current-nsproxy-uts_ns-user_ns, CAP_SYS_ADMIN)) what is it you're actually asking? I presume it's 'does this user have CAP_SYS_ADMIN capability over objects belonging to the uts_namespace's user_namespace?' So, to look at the important bit of patch 2: -int cap_capable(struct task_struct *tsk, const struct cred *cred, int cap, - int audit) +int cap_capable(struct task_struct *tsk, const struct cred *cred, + struct user_namespace *targ_ns, int cap, int audit) { - return cap_raised(cred-cap_effective, cap) ? 0 : -EPERM; + for (;;) { + /* The creator of the user namespace has all caps. */ + if (targ_ns != init_user_ns targ_ns-creator == cred-user) + return 0; + + /* Do we have the necessary capabilities? */ + if (targ_ns == cred-user-user_ns) + return cap_raised(cred-cap_effective, cap) ? 0 : -EPERM; + + /* Have we tried all of the parent namespaces? */ + if (targ_ns == init_user_ns) + return -EPERM; + + /* If you have the capability in a parent user ns you have it +* in the over all children user namespaces as well, so see +* if this process has the capability in the parent user +* namespace. +*/ + targ_ns = targ_ns-creator-user_ns; + } + + /* We never get here */ + return -EPERM; } On entry, as we're called from ns_capable(), cred-user is the user that the current process is running as, and, as such, may be in a separate namespace from uts_namespace - which may itself be in a separate namespace from init_user_ns. So, assume for the sake of argument that there are three user_namespaces along the chain from the calling process to the root, and that the uts_namespace belongs to the middle one. if (targ_ns != init_user_ns targ_ns-creator == cred-user) return 0; Can never match because targ_ns-creator cannot be cred-user; even if the uts_namespace belongs to our namespace, given that the creator lies outside our namespace. if (targ_ns == cred-user-user_ns) return cap_raised(cred-cap_effective, cap) ? 0 : -EPERM; Can only match if we are in the target user_namespace (ie. the one to which uts_namespace belongs), whether or not we have CAP_SYS_ADMIN. Which means that unless the uts_namespace belongs to our user_namespace, we cannot change it. Is that correct? So ns_capable() restricts you to only doing interesting things to objects that belong to a user_namespace if they are in your own user_namespace. Is that correct? If that is so, is the loop required for ns_capable()? Looking further at patch 2: #define nsown_capable(cap) (ns_capable(current_user_ns(), (cap))) Given what I've said above, I presume the loop isn't necessary here either. I think you're using ns_capable() in two different ways: (1) You're using it to see if a process has power over its descendents in a user_namespace that can be traced back to a clone() that it did with CLONE_NEWUSER. For example, automatically granting a process permission to kill descendents in a namespace it created. (2) You're using it to see if a process can access objects that might be outside its own user_namespace. For example, setting the hostname. Is it worth giving two different interfaces to make this clearer (even if they actually do the same thing)? Sorry if this seems rambly, but I'm trying to get my head round your code. David ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: User namespaces and keys
Casey Schaufler ca...@schaufler-ca.com writes: I confess that I remain less well educated on namespaces than I probably should be, but with what I do know it seems that the relationships between user namespaces and LSMs are bound to be strained from the beginning. Some LSMs (SELinux and Smack) are providing similar sandbox capabilities to what you get from user namespaces, but from different directions and with different use cases. Casey I won't argue about the possibility of things being strained, but I think if we focus on the semantics and not on the end goal of exactly how the pieces are to be used there can be some reasonable dialog. From 10,000 feet an user namespace represents one administrative domain. uids, gids and other external tokens are all controlled by a single group with a single security policy. In that single administrative domain things like nfs are expected to work without translating uids and gids. Before the implementation of user namespaces a single kernel could not even express the ability of dealing with multiple user namespaces simultaneously (nfs has usually punted and said despite being multiple machines you still have to be in the same administrative domain so the same user identifiers can be used throughout). From that principle we have the concept that use assigned/visible identifiers uid, gid, capabilities, security keys?, security labels? logically belong to the user namespace. Which means in implementing there are two pieces of work in implementing the user namespace. - Find all of the interesting comparisons and change them from if (id == id) to if (userns == userns) (id == id). - Potentially define and handle what happens when you mix user namespaces. I think for the first round of this we simply want to define lsms and the user namespace as not mixing or the lsm interfaces only being visible if you are in the initial user namespace. Thus reserving the definition of what happens when you have lsms and multiple user namespaces as something we can define later. I expect the proper definition is something that would allow nested lsm policy. Regardless. The namespaces are all about making the identifiers that processes deal with local to the namespace, while the underlying object manipulations should not care. The big advantage of the user namespace is that it is the only way I can see to get us out of the bind we are in with suid root executables, where we can not enable new features to untrusted applications that can change the environment for a suid root executable, because it might confuse those suid root executables to misusing their power. Once inside a user namespace nothing has dangerous powers because even a root owned process with a full set of capabilities is less powerful than a guest user outside of the namespace. No process having dangerous powers allows us to enable unsharing of other namespaces and potentially other things that today are restricted with capabilities only because they can be used to confuse a privileged executable to do something malicious. Eric ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 1/9] Add a user_namespace as creator/owner of uts_namespace
David Howells dhowe...@redhat.com writes: Serge E. Hallyn se...@hallyn.com wrote: struct uts_namespace { struct kref kref; struct new_utsname name; +struct user_namespace *user_ns; }; If a uts_namespace belongs to a user_namespace, should CLONE_NEWUSER then imply CLONE_NEWUTS? Or is uts_namespace::user_ns more an implication that the set of users in user_namespace are the only ones authorised to alter the uts data. The later. I presume that the uts_namespace of a process must be owned by one of the user_namespaces in the alternating inheritance chain of namespaces and their creators leading from current_user_ns() to init_user_ns. With that in mind, looking at patch 3: - if (!capable(CAP_SYS_ADMIN)) + if (!ns_capable(current-nsproxy-uts_ns-user_ns, CAP_SYS_ADMIN)) what is it you're actually asking? I presume it's 'does this user have CAP_SYS_ADMIN capability over objects belonging to the uts_namespace's user_namespace?' Yes. So, to look at the important bit of patch 2: -int cap_capable(struct task_struct *tsk, const struct cred *cred, int cap, - int audit) +int cap_capable(struct task_struct *tsk, const struct cred *cred, + struct user_namespace *targ_ns, int cap, int audit) { - return cap_raised(cred-cap_effective, cap) ? 0 : -EPERM; + for (;;) { + /* The creator of the user namespace has all caps. */ + if (targ_ns != init_user_ns targ_ns-creator == cred-user) + return 0; + + /* Do we have the necessary capabilities? */ + if (targ_ns == cred-user-user_ns) + return cap_raised(cred-cap_effective, cap) ? 0 : -EPERM; + + /* Have we tried all of the parent namespaces? */ + if (targ_ns == init_user_ns) + return -EPERM; + + /* If you have the capability in a parent user ns you have it + * in the over all children user namespaces as well, so see + * if this process has the capability in the parent user + * namespace. + */ + targ_ns = targ_ns-creator-user_ns; + } + + /* We never get here */ + return -EPERM; } On entry, as we're called from ns_capable(), cred-user is the user that the current process is running as, and, as such, may be in a separate namespace from uts_namespace - which may itself be in a separate namespace from init_user_ns. So, assume for the sake of argument that there are three user_namespaces along the chain from the calling process to the root, and that the uts_namespace belongs to the middle one. So we have the nested stack of: user_ns3-creator-user_ns == user_ns2 user_ns2-creator-user_ns == init_user_ns uts_ns2-user_ns == user_ns2 if (targ_ns != init_user_ns targ_ns-creator == cred-user) return 0; Can never match because targ_ns-creator cannot be cred-user; even if the uts_namespace belongs to our namespace, given that the creator lies outside our namespace. Initially we come in with targ_ns == user_ns2 and cred-user-user_ns in one of (user_ns3, user_ns2, or init_user_ns). targ_ns takes on values user_ns2 and init_user_ns. So when targ_ns becomes init_user_ns. If the user in question is uts_ns2-user_ns-creator. This check will indeed match. if (targ_ns == cred-user-user_ns) return cap_raised(cred-cap_effective, cap) ? 0 : -EPERM; Can only match if we are in the target user_namespace (ie. the one to which uts_namespace belongs), whether or not we have CAP_SYS_ADMIN. As before targ_ns takes on values of user_ns2 and init_user_ns. Which means this check will match if we have CAP_SYS_ADMIN in init_user_ns or in user_ns2. Which means that unless the uts_namespace belongs to our user_namespace, we cannot change it. Is that correct? No. If you are root in a parent namespace you can also change it. So ns_capable() restricts you to only doing interesting things to objects that belong to a user_namespace if they are in your own user_namespace. Is that correct? No. Root outside your user namespace is also allowed to do interesting things. If that is so, is the loop required for ns_capable()? Yes. Looking further at patch 2: #define nsown_capable(cap) (ns_capable(current_user_ns(), (cap))) Given what I've said above, I presume the loop isn't necessary here either. I think you're using ns_capable() in two different ways: (1) You're using it to see if a process has power over its descendents in a user_namespace that can be traced back to a clone() that it did with CLONE_NEWUSER. For example, automatically granting a process permission to kill descendents in a namespace it created. (2) You're using it to see if a process can access objects that might be outside its own user_namespace. For
[Devel] Re: User namespaces and keys
On 2/23/2011 12:55 PM, Eric W. Biederman wrote: Casey Schaufler ca...@schaufler-ca.com writes: I confess that I remain less well educated on namespaces than I probably should be, but with what I do know it seems that the relationships between user namespaces and LSMs are bound to be strained from the beginning. Some LSMs (SELinux and Smack) are providing similar sandbox capabilities to what you get from user namespaces, but from different directions and with different use cases. Casey I won't argue about the possibility of things being strained, but I think if we focus on the semantics and not on the end goal of exactly how the pieces are to be used there can be some reasonable dialog. I'm sure that there will be cases where they will work together like horses in a troika. Making sensible semantics for the interactions is key, and it is entirely possible that in some cases a comparison of semantics and behaviors will lead an end user to chose either an LSM or namespaces over the combination. Just like I expect that even when we allow multiple LSMs the SELinux and Smack combination will be rare among the sane. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control
On Wed, Feb 23, 2011 at 10:23:54AM -0500, Vivek Goyal wrote: Agreed. Granularity of per inode level might be accetable in many cases. Again, I am worried faster group getting stuck behind slower group. I am wondering if we are trying to solve the problem of ASYNC write throttling at wrong layer. Should ASYNC IO be throttled before we allow task to write to page cache. The way we throttle the process based on dirty ratio, can we just check for throttle limits also there or something like that.(I think that's what you had done in your initial throttling controller implementation?) Right. This is exactly the same approach I've used in my old throttling controller: throttle sync READs and WRITEs at the block layer and async WRITEs when the task is dirtying memory pages. This is probably the simplest way to resolve the problem of faster group getting blocked by slower group, but the controller will be a little bit more leaky, because the writeback IO will be never throttled and we'll see some limited IO spikes during the writeback. Yes writeback will not be throttled. Not sure how big a problem that is. - We have controlled the input rate. So that should help a bit. - May be one can put some high limit on root cgroup to in blkio throttle controller to limit overall WRITE rate of the system. - For SATA disks, try to use CFQ which can try to minimize the impact of WRITE. It will atleast provide consistent bandwindth experience to application. Right. However, this is always a better solution IMHO respect to the current implementation that is affected by that kind of priority inversion problem. I can try to add this logic to the current blk-throttle controller if you think it is worth to test it. At this point of time I have few concerns with this approach. - Configuration issues. Asking user to plan for SYNC ans ASYNC IO separately is inconvenient. One has to know the nature of workload. - Most likely we will come up with global limits (atleast to begin with), and not per device limit. That can lead to contention on one single lock and scalability issues on big systems. Having said that, this approach should reduce the kernel complexity a lot. So if we can do some intelligent locking to limit the overhead then it will boil down to reduced complexity in kernel vs ease of use to user. I guess at this point of time I am inclined towards keeping it simple in kernel. BTW, with this approach probably we can even get rid of the page tracking stuff for now. If we don't consider the swap IO, any other IO operation from our point of view will happen directly from process context (writes in memory + sync reads from the block device). However, I'm sure we'll need the page tracking also for the blkio controller soon or later. This is an important information and also the proportional bandwidth controller can take advantage of it. Couple of people have asked me that we have backup jobs running at night and we want to reduce the IO bandwidth of these jobs to limit the impact on latency of other jobs, I guess this approach will definitely solve that issue. IMHO, it might be worth trying this approach and see how well does it work. It might not solve all the problems but can be helpful in many situations. Agreed. This could be a good tradeoff for a lot of common cases. I feel that for proportional bandwidth division, implementing ASYNC control at CFQ will make sense because even if things get serialized in higher layers, consequences are not very bad as it is work conserving algorithm. But for throttling serialization will lead to bad consequences. Agreed. May be one can think of new files in blkio controller to limit async IO per group during page dirty time. blkio.throttle.async.write_bps_limit blkio.throttle.async.write_iops_limit OK, I'll try to add the async throttling logic and use this interface. -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 1/9] Add a user_namespace as creator/owner of uts_namespace
Eric W. Biederman ebied...@xmission.com wrote: Which means that unless the uts_namespace belongs to our user_namespace, we cannot change it. Is that correct? No. If you are root in a parent namespace you can also change it. But surely, by definition, if you're a user in this namespace, you can't also be root in a parent namespace... For the case I worked through current_user() is a member of current_user_ns() and can't also be a member of its parent, grandparent, etc. - or can it? David ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 3/5] page_cgroup: make page tracking available for blkio
On Wed, 23 Feb 2011 09:59:11 +0100 Andrea Righi ari...@develer.com wrote: I wonder I can make pc-mem_cgroup to be pc-memid(16bit), then, == static inline struct mem_cgroup *get_memcg_from_pc(struct page_cgroup *pc) { struct cgroup_subsys_state *css = css_lookup(mem_cgroup_subsys, pc-memid); return container_of(css, struct mem_cgroup, css); } == Overhead will be seen at updating file statistics and LRU management. But, hmm, can't you do that tracking without page_cgroup ? Because the number of dirty/writeback pages are far smaller than total pages, chasing I/O with dynamic structure is not very bad.. prepareing [pfn - blkio] record table and move that information to struct bio in dynamic way is very difficult ? This would be ok for dirty pages, but consider that we're also tracking anonymous pages. So, if we want to control the swap IO we actually need to save this information for a lot of pages and at the end I think we'll basically duplicate the page_cgroup code. swap io is always started with bio and the task/mm_struct. So, if we can record information in bio, no page tracking is required. You can record information to bio just by reading mm-owner. Thanks, -Kame ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control
On Thu, Feb 24, 2011 at 12:14:11AM +0100, Andrea Righi wrote: On Wed, Feb 23, 2011 at 10:23:54AM -0500, Vivek Goyal wrote: Agreed. Granularity of per inode level might be accetable in many cases. Again, I am worried faster group getting stuck behind slower group. I am wondering if we are trying to solve the problem of ASYNC write throttling at wrong layer. Should ASYNC IO be throttled before we allow task to write to page cache. The way we throttle the process based on dirty ratio, can we just check for throttle limits also there or something like that.(I think that's what you had done in your initial throttling controller implementation?) Right. This is exactly the same approach I've used in my old throttling controller: throttle sync READs and WRITEs at the block layer and async WRITEs when the task is dirtying memory pages. This is probably the simplest way to resolve the problem of faster group getting blocked by slower group, but the controller will be a little bit more leaky, because the writeback IO will be never throttled and we'll see some limited IO spikes during the writeback. Yes writeback will not be throttled. Not sure how big a problem that is. - We have controlled the input rate. So that should help a bit. - May be one can put some high limit on root cgroup to in blkio throttle controller to limit overall WRITE rate of the system. - For SATA disks, try to use CFQ which can try to minimize the impact of WRITE. It will atleast provide consistent bandwindth experience to application. Right. However, this is always a better solution IMHO respect to the current implementation that is affected by that kind of priority inversion problem. I can try to add this logic to the current blk-throttle controller if you think it is worth to test it. At this point of time I have few concerns with this approach. - Configuration issues. Asking user to plan for SYNC ans ASYNC IO separately is inconvenient. One has to know the nature of workload. - Most likely we will come up with global limits (atleast to begin with), and not per device limit. That can lead to contention on one single lock and scalability issues on big systems. Having said that, this approach should reduce the kernel complexity a lot. So if we can do some intelligent locking to limit the overhead then it will boil down to reduced complexity in kernel vs ease of use to user. I guess at this point of time I am inclined towards keeping it simple in kernel. BTW, with this approach probably we can even get rid of the page tracking stuff for now. Agreed. If we don't consider the swap IO, any other IO operation from our point of view will happen directly from process context (writes in memory + sync reads from the block device). Why do we need to account for swap IO? Application never asked for swap IO. It is kernel's decision to move soem pages to swap to free up some memory. What's the point in charging those pages to application group and throttle accordingly? However, I'm sure we'll need the page tracking also for the blkio controller soon or later. This is an important information and also the proportional bandwidth controller can take advantage of it. Yes page tracking will be needed for CFQ proportional bandwidth ASYNC write support. But until and unless we implement memory cgroup dirty ratio and figure a way out to make writeback logic cgroup aware, till then I think page tracking stuff is not really useful. Couple of people have asked me that we have backup jobs running at night and we want to reduce the IO bandwidth of these jobs to limit the impact on latency of other jobs, I guess this approach will definitely solve that issue. IMHO, it might be worth trying this approach and see how well does it work. It might not solve all the problems but can be helpful in many situations. Agreed. This could be a good tradeoff for a lot of common cases. I feel that for proportional bandwidth division, implementing ASYNC control at CFQ will make sense because even if things get serialized in higher layers, consequences are not very bad as it is work conserving algorithm. But for throttling serialization will lead to bad consequences. Agreed. May be one can think of new files in blkio controller to limit async IO per group during page dirty time. blkio.throttle.async.write_bps_limit blkio.throttle.async.write_iops_limit OK, I'll try to add the async throttling logic and use this interface. Cool, I would like to play with it a bit once patches are ready. Thanks Vivek ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel
[Devel] [PATCH 1/4] userns: let clone_uts_ns() handle setting uts-user_ns
To do so we need to pass in the task_struct who'll get the utsname, so we can get its user_ns. Changelog: Feb 23: As per Oleg's coment, just pass in tsk. Signed-off-by: Serge E. Hallyn serge.hal...@canonical.com --- include/linux/utsname.h |6 +++--- kernel/nsproxy.c|7 +-- kernel/utsname.c| 12 +++- 3 files changed, 11 insertions(+), 14 deletions(-) diff --git a/include/linux/utsname.h b/include/linux/utsname.h index 85171be..21b4566 100644 --- a/include/linux/utsname.h +++ b/include/linux/utsname.h @@ -53,7 +53,7 @@ static inline void get_uts_ns(struct uts_namespace *ns) } extern struct uts_namespace *copy_utsname(unsigned long flags, - struct uts_namespace *ns); + struct task_struct *tsk); extern void free_uts_ns(struct kref *kref); static inline void put_uts_ns(struct uts_namespace *ns) @@ -70,12 +70,12 @@ static inline void put_uts_ns(struct uts_namespace *ns) } static inline struct uts_namespace *copy_utsname(unsigned long flags, - struct uts_namespace *ns) +struct task_struct *tsk) { if (flags CLONE_NEWUTS) return ERR_PTR(-EINVAL); - return ns; + return tsk-nsproxy-uts_ns; } #endif diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c index b6dbff2..ac8a56e 100644 --- a/kernel/nsproxy.c +++ b/kernel/nsproxy.c @@ -69,16 +69,11 @@ static struct nsproxy *create_new_namespaces(unsigned long flags, goto out_ns; } - new_nsp-uts_ns = copy_utsname(flags, tsk-nsproxy-uts_ns); + new_nsp-uts_ns = copy_utsname(flags, tsk); if (IS_ERR(new_nsp-uts_ns)) { err = PTR_ERR(new_nsp-uts_ns); goto out_uts; } - if (new_nsp-uts_ns != tsk-nsproxy-uts_ns) { - put_user_ns(new_nsp-uts_ns-user_ns); - new_nsp-uts_ns-user_ns = task_cred_xxx(tsk, user)-user_ns; - get_user_ns(new_nsp-uts_ns-user_ns); - } new_nsp-ipc_ns = copy_ipcs(flags, tsk-nsproxy-ipc_ns); if (IS_ERR(new_nsp-ipc_ns)) { diff --git a/kernel/utsname.c b/kernel/utsname.c index a7b3a8d..4464617 100644 --- a/kernel/utsname.c +++ b/kernel/utsname.c @@ -31,7 +31,8 @@ static struct uts_namespace *create_uts_ns(void) * @old_ns: namespace to clone * Return NULL on error (failure to kmalloc), new ns otherwise */ -static struct uts_namespace *clone_uts_ns(struct uts_namespace *old_ns) +static struct uts_namespace *clone_uts_ns(struct task_struct *tsk, + struct uts_namespace *old_ns) { struct uts_namespace *ns; @@ -41,8 +42,7 @@ static struct uts_namespace *clone_uts_ns(struct uts_namespace *old_ns) down_read(uts_sem); memcpy(ns-name, old_ns-name, sizeof(ns-name)); - ns-user_ns = old_ns-user_ns; - get_user_ns(ns-user_ns); + ns-user_ns = get_user_ns(task_cred_xxx(tsk, user)-user_ns); up_read(uts_sem); return ns; } @@ -53,8 +53,10 @@ static struct uts_namespace *clone_uts_ns(struct uts_namespace *old_ns) * utsname of this process won't be seen by parent, and vice * versa. */ -struct uts_namespace *copy_utsname(unsigned long flags, struct uts_namespace *old_ns) +struct uts_namespace *copy_utsname(unsigned long flags, + struct task_struct *tsk) { + struct uts_namespace *old_ns = tsk-nsproxy-uts_ns; struct uts_namespace *new_ns; BUG_ON(!old_ns); @@ -63,7 +65,7 @@ struct uts_namespace *copy_utsname(unsigned long flags, struct uts_namespace *ol if (!(flags CLONE_NEWUTS)) return old_ns; - new_ns = clone_uts_ns(old_ns); + new_ns = clone_uts_ns(tsk, old_ns); put_uts_ns(old_ns); return new_ns; -- 1.7.0.4 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 2/4] userns: let copy_ipcs handle setting ipc_ns-user_ns
To do that, we have to pass in the task_struct of the task which will own the ipc_ns, so we can assign its user_ns. Changelog: Feb 23: As per Oleg comment, just pass in tsk. To get the ipc_ns from the nsproxy we need to include nsproxy.h Signed-off-by: Serge E. Hallyn serge.hal...@canonical.com --- include/linux/ipc_namespace.h |7 --- ipc/namespace.c | 13 - kernel/nsproxy.c |7 +-- 3 files changed, 13 insertions(+), 14 deletions(-) diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h index 46d2eb4..c079d09 100644 --- a/include/linux/ipc_namespace.h +++ b/include/linux/ipc_namespace.h @@ -5,6 +5,7 @@ #include linux/idr.h #include linux/rwsem.h #include linux/notifier.h +#include linux/nsproxy.h /* * ipc namespace events @@ -93,7 +94,7 @@ static inline int mq_init_ns(struct ipc_namespace *ns) { return 0; } #if defined(CONFIG_IPC_NS) extern struct ipc_namespace *copy_ipcs(unsigned long flags, - struct ipc_namespace *ns); + struct task_struct *tsk); static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns) { if (ns) @@ -104,12 +105,12 @@ static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns) extern void put_ipc_ns(struct ipc_namespace *ns); #else static inline struct ipc_namespace *copy_ipcs(unsigned long flags, - struct ipc_namespace *ns) + struct task_struct *tsk) { if (flags CLONE_NEWIPC) return ERR_PTR(-EINVAL); - return ns; + return tsk-nsproxy-ipc_ns; } static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns) diff --git a/ipc/namespace.c b/ipc/namespace.c index aa18899..3c3e522 100644 --- a/ipc/namespace.c +++ b/ipc/namespace.c @@ -15,7 +15,8 @@ #include util.h -static struct ipc_namespace *create_ipc_ns(struct ipc_namespace *old_ns) +static struct ipc_namespace *create_ipc_ns(struct task_struct *tsk, + struct ipc_namespace *old_ns) { struct ipc_namespace *ns; int err; @@ -44,17 +45,19 @@ static struct ipc_namespace *create_ipc_ns(struct ipc_namespace *old_ns) ipcns_notify(IPCNS_CREATED); register_ipcns_notifier(ns); - ns-user_ns = old_ns-user_ns; - get_user_ns(ns-user_ns); + ns-user_ns = get_user_ns(task_cred_xxx(tsk, user)-user_ns); return ns; } -struct ipc_namespace *copy_ipcs(unsigned long flags, struct ipc_namespace *ns) +struct ipc_namespace *copy_ipcs(unsigned long flags, + struct task_struct *tsk) { + struct ipc_namespace *ns = tsk-nsproxy-ipc_ns; + if (!(flags CLONE_NEWIPC)) return get_ipc_ns(ns); - return create_ipc_ns(ns); + return create_ipc_ns(tsk, ns); } /* diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c index ac8a56e..a05d191 100644 --- a/kernel/nsproxy.c +++ b/kernel/nsproxy.c @@ -75,16 +75,11 @@ static struct nsproxy *create_new_namespaces(unsigned long flags, goto out_uts; } - new_nsp-ipc_ns = copy_ipcs(flags, tsk-nsproxy-ipc_ns); + new_nsp-ipc_ns = copy_ipcs(flags, tsk); if (IS_ERR(new_nsp-ipc_ns)) { err = PTR_ERR(new_nsp-ipc_ns); goto out_ipc; } - if (new_nsp-ipc_ns != tsk-nsproxy-ipc_ns) { - put_user_ns(new_nsp-ipc_ns-user_ns); - new_nsp-ipc_ns-user_ns = task_cred_xxx(tsk, user)-user_ns; - get_user_ns(new_nsp-ipc_ns-user_ns); - } new_nsp-pid_ns = copy_pid_ns(flags, task_active_pid_ns(tsk)); if (IS_ERR(new_nsp-pid_ns)) { -- 1.7.0.4 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 5/4] Clean up capability.h and capability.c
Convert macros to functions to let type safety do its thing. Switch some functions from ints to more appropriate bool. Move all forward declarations together to top of the #ifdef __KERNEL__ section. Use kernel-doc format for comments. Some macros couldn't be converted because they use functions from security.h which sometimes are extern and sometimes static inline, and we don't want to #include security.h in capability.h. Also add a real current_user_ns function (and convert the existing macro to _current_user_ns() so we can use it in capability.h without #including cred.h. Signed-off-by: Serge E. Hallyn serge.hal...@canonical.com --- include/linux/capability.h | 38 ++ include/linux/cred.h |4 +++- kernel/capability.c| 20 kernel/cred.c |5 + 4 files changed, 46 insertions(+), 21 deletions(-) diff --git a/include/linux/capability.h b/include/linux/capability.h index bc0f262..688462f 100644 --- a/include/linux/capability.h +++ b/include/linux/capability.h @@ -368,6 +368,17 @@ struct cpu_vfs_cap_data { #ifdef __KERNEL__ +struct dentry; +struct user_namespace; + +extern struct user_namespace init_user_ns; + +struct user_namespace *current_user_ns(void); + +extern const kernel_cap_t __cap_empty_set; +extern const kernel_cap_t __cap_full_set; +extern const kernel_cap_t __cap_init_eff_set; + /* * Internal kernel functions only */ @@ -530,10 +541,6 @@ static inline kernel_cap_t cap_raise_nfsd_set(const kernel_cap_t a, cap_intersect(permitted, __cap_nfsd_set)); } -extern const kernel_cap_t __cap_empty_set; -extern const kernel_cap_t __cap_full_set; -extern const kernel_cap_t __cap_init_eff_set; - /** * has_capability - Determine if a task has a superior capability available * @t: The task in question @@ -560,18 +567,25 @@ extern const kernel_cap_t __cap_init_eff_set; * Note that this does not set PF_SUPERPRIV on the task. */ #define has_capability_noaudit(t, cap) \ - (security_real_capable_noaudit((t), init_user_ns, (cap)) == 0) + (security_real_capable_noaudit((t), init_user_ns, (cap)) == 0) -struct user_namespace; -extern struct user_namespace init_user_ns; -extern int capable(int cap); -extern int ns_capable(struct user_namespace *ns, int cap); -extern int task_ns_capable(struct task_struct *t, int cap); +extern bool capable(int cap); +extern bool ns_capable(struct user_namespace *ns, int cap); +extern bool task_ns_capable(struct task_struct *t, int cap); -#define nsown_capable(cap) (ns_capable(current_user_ns(), (cap))) +/** + * nsown_capable - Check superior capability to one's own user_ns + * @cap: The capability in question + * + * Return true if the current task has the given superior capability + * targeted at its own user namespace. + */ +static inline bool nsown_capable(int cap) +{ + return ns_capable(current_user_ns(), cap); +} /* audit system wants to get cap info from files as well */ -struct dentry; extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps); #endif /* __KERNEL__ */ diff --git a/include/linux/cred.h b/include/linux/cred.h index 4aaeab3..9aeeb0b 100644 --- a/include/linux/cred.h +++ b/include/linux/cred.h @@ -354,9 +354,11 @@ static inline void put_cred(const struct cred *_cred) #define current_fsgid()(current_cred_xxx(fsgid)) #define current_cap() (current_cred_xxx(cap_effective)) #define current_user() (current_cred_xxx(user)) -#define current_user_ns() (current_cred_xxx(user)-user_ns) +#define _current_user_ns() (current_cred_xxx(user)-user_ns) #define current_security() (current_cred_xxx(security)) +extern struct user_namespace *current_user_ns(void); + #define current_uid_gid(_uid, _gid)\ do { \ const struct cred *__cred; \ diff --git a/kernel/capability.c b/kernel/capability.c index 916658c..0a3d2c8 100644 --- a/kernel/capability.c +++ b/kernel/capability.c @@ -300,7 +300,7 @@ error: * This sets PF_SUPERPRIV on the task if the capability is available on the * assumption that it's about to be used. */ -int capable(int cap) +bool capable(int cap) { return ns_capable(init_user_ns, cap); } @@ -317,7 +317,7 @@ EXPORT_SYMBOL(capable); * This sets PF_SUPERPRIV on the task if the capability is available on the * assumption that it's about to be used. */ -int ns_capable(struct user_namespace *ns, int cap) +bool ns_capable(struct user_namespace *ns, int cap) { if (unlikely(!cap_valid(cap))) { printk(KERN_CRIT capable() called with invalid cap=%u\n, cap); @@ -326,17 +326,21 @@ int ns_capable(struct user_namespace *ns, int cap) if (security_capable(ns, current_cred(), cap) == 0) { current-flags |= PF_SUPERPRIV; - return 1; + return true; }
[Devel] Re: [PATCH 5/9] Allow ptrace from non-init user namespaces
Quoting Andrew Morton (a...@linux-foundation.org): On Thu, 17 Feb 2011 15:03:33 + Serge E. Hallyn se...@hallyn.com wrote: ptrace is allowed to tasks in the same user namespace according to the usual rules (i.e. the same rules as for two tasks in the init user namespace). ptrace is also allowed to a user namespace to which the current task the has CAP_SYS_PTRACE capability. ... --- a/include/linux/capability.h +++ b/include/linux/capability.h @@ -546,6 +546,8 @@ extern const kernel_cap_t __cap_init_eff_set; */ #define has_capability(t, cap) (security_real_capable((t), init_user_ns, (cap)) == 0) +#define has_ns_capability(t, ns, cap) (security_real_capable((t), (ns), (cap)) == 0) macroitis. Thanks for the review, Andrew. Unfortunately this one is hard to turn into a function beecause it uses security_real_capable(), which is sometimes defined in security/security.c as a real function, and other times as a static inline in include/linux/security.h. So I'd have to #include security.h in capability.h, but security.h already #includes capability.h. All the other comments affect same_or_ancestor_user_ns(), which following Eric's feedback is going away. /** * has_capability_noaudit - Determine if a task has a superior capability available (unaudited) * @t: The task in question diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index faf4679..862fc59 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -39,6 +39,9 @@ static inline void put_user_ns(struct user_namespace *ns) uid_t user_ns_map_uid(struct user_namespace *to, const struct cred *cred, uid_t uid); gid_t user_ns_map_gid(struct user_namespace *to, const struct cred *cred, gid_t gid); +int same_or_ancestor_user_ns(struct task_struct *task, + struct task_struct *victim); bool. #else static inline struct user_namespace *get_user_ns(struct user_namespace *ns) ... --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -129,6 +129,22 @@ gid_t user_ns_map_gid(struct user_namespace *to, const struct cred *cred, gid_t return overflowgid; } +int same_or_ancestor_user_ns(struct task_struct *task, + struct task_struct *victim) +{ + struct user_namespace *u1 = task_cred_xxx(task, user)-user_ns; + struct user_namespace *u2 = task_cred_xxx(victim, user)-user_ns; + for (;;) { + if (u1 == u2) + return 1; + if (u1 == init_user_ns) + return 0; + u1 = u1-creator-user_ns; + } + /* We never get here */ + return 0; Remove? +} + static __init int user_namespaces_init(void) { user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC); ... int cap_ptrace_access_check(struct task_struct *child, unsigned int mode) { int ret = 0; + const struct cred *cred, *tcred; rcu_read_lock(); - if (!cap_issubset(__task_cred(child)-cap_permitted, - current_cred()-cap_permitted) - !capable(CAP_SYS_PTRACE)) - ret = -EPERM; + cred = current_cred(); + tcred = __task_cred(child); + /* +* The ancestor user_ns check may be gratuitous, as I think +* we've already guaranteed that in kernel/ptrace.c. +*/ ? + if (same_or_ancestor_user_ns(current, child) + cap_issubset(tcred-cap_permitted, cred-cap_permitted)) + goto out; + if (ns_capable(tcred-user-user_ns, CAP_SYS_PTRACE)) + goto out; + ret = -EPERM; +out: rcu_read_unlock(); return ret; } ... ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control
On Wed, 23 Feb 2011 19:10:33 -0500 Vivek Goyal vgo...@redhat.com wrote: On Thu, Feb 24, 2011 at 12:14:11AM +0100, Andrea Righi wrote: On Wed, Feb 23, 2011 at 10:23:54AM -0500, Vivek Goyal wrote: Agreed. Granularity of per inode level might be accetable in many cases. Again, I am worried faster group getting stuck behind slower group. I am wondering if we are trying to solve the problem of ASYNC write throttling at wrong layer. Should ASYNC IO be throttled before we allow task to write to page cache. The way we throttle the process based on dirty ratio, can we just check for throttle limits also there or something like that.(I think that's what you had done in your initial throttling controller implementation?) Right. This is exactly the same approach I've used in my old throttling controller: throttle sync READs and WRITEs at the block layer and async WRITEs when the task is dirtying memory pages. This is probably the simplest way to resolve the problem of faster group getting blocked by slower group, but the controller will be a little bit more leaky, because the writeback IO will be never throttled and we'll see some limited IO spikes during the writeback. Yes writeback will not be throttled. Not sure how big a problem that is. - We have controlled the input rate. So that should help a bit. - May be one can put some high limit on root cgroup to in blkio throttle controller to limit overall WRITE rate of the system. - For SATA disks, try to use CFQ which can try to minimize the impact of WRITE. It will atleast provide consistent bandwindth experience to application. Right. However, this is always a better solution IMHO respect to the current implementation that is affected by that kind of priority inversion problem. I can try to add this logic to the current blk-throttle controller if you think it is worth to test it. At this point of time I have few concerns with this approach. - Configuration issues. Asking user to plan for SYNC ans ASYNC IO separately is inconvenient. One has to know the nature of workload. - Most likely we will come up with global limits (atleast to begin with), and not per device limit. That can lead to contention on one single lock and scalability issues on big systems. Having said that, this approach should reduce the kernel complexity a lot. So if we can do some intelligent locking to limit the overhead then it will boil down to reduced complexity in kernel vs ease of use to user. I guess at this point of time I am inclined towards keeping it simple in kernel. BTW, with this approach probably we can even get rid of the page tracking stuff for now. Agreed. If we don't consider the swap IO, any other IO operation from our point of view will happen directly from process context (writes in memory + sync reads from the block device). Why do we need to account for swap IO? Application never asked for swap IO. It is kernel's decision to move soem pages to swap to free up some memory. What's the point in charging those pages to application group and throttle accordingly? I think swap I/O should be controlled by memcg's dirty_ratio. But, IIRC, NEC guy had a requirement for this... I think some enterprise cusotmer may want to throttle the whole speed of swapout I/O (not swapin)...so, they may be glad if they can limit throttle the I/O against a disk partition or all I/O tagged as 'swapio' rather than some cgroup name. But I'm afraid slow swapout may consume much dirty_ratio and make things worse ;) However, I'm sure we'll need the page tracking also for the blkio controller soon or later. This is an important information and also the proportional bandwidth controller can take advantage of it. Yes page tracking will be needed for CFQ proportional bandwidth ASYNC write support. But until and unless we implement memory cgroup dirty ratio and figure a way out to make writeback logic cgroup aware, till then I think page tracking stuff is not really useful. I think Greg Thelen is now preparing patches for dirty_ratio. Thanks, -Kame ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 4/9] allow killing tasks in your own or child userns
Quoting Andrew Morton (a...@linux-foundation.org): On Thu, 17 Feb 2011 15:03:25 + Serge E. Hallyn se...@hallyn.com wrote: /* + * called with RCU read lock from check_kill_permission() + */ +static inline int kill_ok_by_cred(struct task_struct *t) +{ + const struct cred *cred = current_cred(); + const struct cred *tcred = __task_cred(t); + + if (cred-user-user_ns == tcred-user-user_ns + (cred-euid == tcred-suid || +cred-euid == tcred-uid || +cred-uid == tcred-suid || +cred-uid == tcred-uid)) + return 1; + + if (ns_capable(tcred-user-user_ns, CAP_KILL)) + return 1; + + return 0; +} The compiler will inline this for us. Is that simply true with everything (worth inlining) nowadays, or is there a particular implicit hint to the compiler that'll make that happen? Not that I guess it's even particularly important in this case. From: Serge E. Hallyn serge.hal...@canonical.com Date: Thu, 24 Feb 2011 00:26:02 + Subject: [PATCH 1/2] userns: let compiler inline kill_ok_by_cred (per akpm) Signed-off-by: Serge E. Hallyn serge.hal...@canonical.com --- kernel/signal.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kernel/signal.c b/kernel/signal.c index ffe4bdf..12702b4 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -638,7 +638,7 @@ static inline bool si_fromuser(const struct siginfo *info) /* * called with RCU read lock from check_kill_permission() */ -static inline int kill_ok_by_cred(struct task_struct *t) +static int kill_ok_by_cred(struct task_struct *t) { const struct cred *cred = current_cred(); const struct cred *tcred = __task_cred(t); -- 1.7.0.4 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH] userns: ptrace: incorporate feedback from Eric
same_or_ancestore_user_ns() was not an appropriate check to constrain cap_issubset. Rather, cap_issubset() only is meaningful when both capsets are in the same user_ns. Signed-off-by: Serge E. Hallyn serge.hal...@canonical.com --- include/linux/user_namespace.h |9 - kernel/user_namespace.c| 16 security/commoncap.c | 28 ++-- 3 files changed, 10 insertions(+), 43 deletions(-) diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index 862fc59..faf4679 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -39,9 +39,6 @@ static inline void put_user_ns(struct user_namespace *ns) uid_t user_ns_map_uid(struct user_namespace *to, const struct cred *cred, uid_t uid); gid_t user_ns_map_gid(struct user_namespace *to, const struct cred *cred, gid_t gid); -int same_or_ancestor_user_ns(struct task_struct *task, - struct task_struct *victim); - #else static inline struct user_namespace *get_user_ns(struct user_namespace *ns) @@ -69,12 +66,6 @@ static inline gid_t user_ns_map_gid(struct user_namespace *to, return gid; } -static inline int same_or_ancestor_user_ns(struct task_struct *task, - struct task_struct *victim) -{ - return 1; -} - #endif #endif /* _LINUX_USER_H */ diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 0ef2258..9da289c 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -129,22 +129,6 @@ gid_t user_ns_map_gid(struct user_namespace *to, const struct cred *cred, gid_t return overflowgid; } -int same_or_ancestor_user_ns(struct task_struct *task, - struct task_struct *victim) -{ - struct user_namespace *u1 = task_cred_xxx(task, user)-user_ns; - struct user_namespace *u2 = task_cred_xxx(victim, user)-user_ns; - for (;;) { - if (u1 == u2) - return 1; - if (u1 == init_user_ns) - return 0; - u1 = u1-creator-user_ns; - } - /* We never get here */ - return 0; -} - static __init int user_namespaces_init(void) { user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC); diff --git a/security/commoncap.c b/security/commoncap.c index 12ff65c..526106f 100644 --- a/security/commoncap.c +++ b/security/commoncap.c @@ -142,19 +142,15 @@ int cap_settime(struct timespec *ts, struct timezone *tz) int cap_ptrace_access_check(struct task_struct *child, unsigned int mode) { int ret = 0; - const struct cred *cred, *tcred; + const struct cred *cred, *child_cred; rcu_read_lock(); cred = current_cred(); - tcred = __task_cred(child); - /* -* The ancestor user_ns check may be gratuitous, as I think -* we've already guaranteed that in kernel/ptrace.c. -*/ - if (same_or_ancestor_user_ns(current, child) - cap_issubset(tcred-cap_permitted, cred-cap_permitted)) + child_cred = __task_cred(child); + if (cred-user-user_ns == child_cred-user-user_ns + cap_issubset(child_cred-cap_permitted, cred-cap_permitted)) goto out; - if (ns_capable(tcred-user-user_ns, CAP_SYS_PTRACE)) + if (ns_capable(child_cred-user-user_ns, CAP_SYS_PTRACE)) goto out; ret = -EPERM; out: @@ -178,19 +174,15 @@ out: int cap_ptrace_traceme(struct task_struct *parent) { int ret = 0; - const struct cred *cred, *tcred; + const struct cred *cred, *child_cred; rcu_read_lock(); cred = __task_cred(parent); - tcred = current_cred(); - /* -* The ancestor user_ns check may be gratuitous, as I think -* we've already guaranteed that in kernel/ptrace.c. -*/ - if (same_or_ancestor_user_ns(parent, current) - cap_issubset(tcred-cap_permitted, cred-cap_permitted)) + child_cred = current_cred(); + if (cred-user-user_ns == child_cred-user-user_ns + cap_issubset(child_cred-cap_permitted, cred-cap_permitted)) goto out; - if (has_ns_capability(parent, tcred-user-user_ns, CAP_SYS_PTRACE)) + if (has_ns_capability(parent, child_cred-user-user_ns, CAP_SYS_PTRACE)) goto out; ret = -EPERM; out: -- 1.7.0.4 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 4/9] allow killing tasks in your own or child userns
On Thu, 24 Feb 2011 00:48:18 + Serge E. Hallyn se...@hallyn.com wrote: Quoting Andrew Morton (a...@linux-foundation.org): On Thu, 17 Feb 2011 15:03:25 + Serge E. Hallyn se...@hallyn.com wrote: /* + * called with RCU read lock from check_kill_permission() + */ +static inline int kill_ok_by_cred(struct task_struct *t) +{ + const struct cred *cred = current_cred(); + const struct cred *tcred = __task_cred(t); + + if (cred-user-user_ns == tcred-user-user_ns + (cred-euid == tcred-suid || + cred-euid == tcred-uid || + cred-uid == tcred-suid || + cred-uid == tcred-uid)) + return 1; + + if (ns_capable(tcred-user-user_ns, CAP_KILL)) + return 1; + + return 0; +} The compiler will inline this for us. Is that simply true with everything (worth inlining) nowadays, or is there a particular implicit hint to the compiler that'll make that happen? We've basically stopped inlining things nowadays. gcc inlines aggressively and sometimes we have to use noinline to stop it. Also, modern gcc's like to ignore the inline directive anwyay, so we have to resort to __always_inline when we disagree. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control
On Wed, Feb 23, 2011 at 4:40 PM, KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Wed, 23 Feb 2011 19:10:33 -0500 Vivek Goyal vgo...@redhat.com wrote: On Thu, Feb 24, 2011 at 12:14:11AM +0100, Andrea Righi wrote: On Wed, Feb 23, 2011 at 10:23:54AM -0500, Vivek Goyal wrote: Agreed. Granularity of per inode level might be accetable in many cases. Again, I am worried faster group getting stuck behind slower group. I am wondering if we are trying to solve the problem of ASYNC write throttling at wrong layer. Should ASYNC IO be throttled before we allow task to write to page cache. The way we throttle the process based on dirty ratio, can we just check for throttle limits also there or something like that.(I think that's what you had done in your initial throttling controller implementation?) Right. This is exactly the same approach I've used in my old throttling controller: throttle sync READs and WRITEs at the block layer and async WRITEs when the task is dirtying memory pages. This is probably the simplest way to resolve the problem of faster group getting blocked by slower group, but the controller will be a little bit more leaky, because the writeback IO will be never throttled and we'll see some limited IO spikes during the writeback. Yes writeback will not be throttled. Not sure how big a problem that is. - We have controlled the input rate. So that should help a bit. - May be one can put some high limit on root cgroup to in blkio throttle controller to limit overall WRITE rate of the system. - For SATA disks, try to use CFQ which can try to minimize the impact of WRITE. It will atleast provide consistent bandwindth experience to application. Right. However, this is always a better solution IMHO respect to the current implementation that is affected by that kind of priority inversion problem. I can try to add this logic to the current blk-throttle controller if you think it is worth to test it. At this point of time I have few concerns with this approach. - Configuration issues. Asking user to plan for SYNC ans ASYNC IO separately is inconvenient. One has to know the nature of workload. - Most likely we will come up with global limits (atleast to begin with), and not per device limit. That can lead to contention on one single lock and scalability issues on big systems. Having said that, this approach should reduce the kernel complexity a lot. So if we can do some intelligent locking to limit the overhead then it will boil down to reduced complexity in kernel vs ease of use to user. I guess at this point of time I am inclined towards keeping it simple in kernel. BTW, with this approach probably we can even get rid of the page tracking stuff for now. Agreed. If we don't consider the swap IO, any other IO operation from our point of view will happen directly from process context (writes in memory + sync reads from the block device). Why do we need to account for swap IO? Application never asked for swap IO. It is kernel's decision to move soem pages to swap to free up some memory. What's the point in charging those pages to application group and throttle accordingly? I think swap I/O should be controlled by memcg's dirty_ratio. But, IIRC, NEC guy had a requirement for this... I think some enterprise cusotmer may want to throttle the whole speed of swapout I/O (not swapin)...so, they may be glad if they can limit throttle the I/O against a disk partition or all I/O tagged as 'swapio' rather than some cgroup name. But I'm afraid slow swapout may consume much dirty_ratio and make things worse ;) However, I'm sure we'll need the page tracking also for the blkio controller soon or later. This is an important information and also the proportional bandwidth controller can take advantage of it. Yes page tracking will be needed for CFQ proportional bandwidth ASYNC write support. But until and unless we implement memory cgroup dirty ratio and figure a way out to make writeback logic cgroup aware, till then I think page tracking stuff is not really useful. I think Greg Thelen is now preparing patches for dirty_ratio. Thanks, -Kame Correct. I am working on the memcg dirty_ratio patches with latest mmotm memcg. I am running some test cases which should be complete tomorrow. Once testing is complete, I will sent the patches for review. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH] userns: ptrace: incorporate feedback from Eric
Quoting Andrew Morton (a...@linux-foundation.org): On Thu, 24 Feb 2011 00:49:01 + Serge E. Hallyn se...@hallyn.com wrote: same_or_ancestore_user_ns() was not an appropriate check to constrain cap_issubset. Rather, cap_issubset() only is meaningful when both capsets are in the same user_ns. I queued this as a fix against userns-allow-ptrace-from-non-init-user-namespaces.patch, but I get the feeling that it would be better to just drop everything and start again? Sure, I'll rebase and resend. I wonder if I should trim the Cc list for the next round. thanks, -serge ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 9/9] userns: check user namespace for task-file uid equivalence checks
On Thu, 24 Feb 2011 03:24:16 + Serge E. Hallyn se...@hallyn.com wrote: Quoting Andrew Morton (a...@linux-foundation.org): On Thu, 17 Feb 2011 15:04:07 + Serge E. Hallyn se...@hallyn.com wrote: There's a fairly well adhered to convention that global symbols (and often static symbols) have a prefix which identifies the subsystem to which they belong. This patchset rather scorns that convention. Most of these identifiers are pretty obviously from the capability subsystem, but still... Would 'inode_owner_or_capable' be better and and make sense? I suppose so. We've totally screwed that pooch in the VFS (grep EXPORT fs/inode.c). But it's never to late to start. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control
* Andrea Righi ari...@develer.com [2011-02-22 18:12:51]: Currently the blkio.throttle controller only support synchronous IO requests. This means that we always look at the current task to identify the owner of each IO request. However dirty pages in the page cache can be wrote to disk asynchronously by the per-bdi flusher kernel threads or by any other thread in the system, according to the writeback policy. For this reason the real writes to the underlying block devices may occur in a different IO context respect to the task that originally generated the dirty pages involved in the IO operation. This makes the tracking and throttling of writeback IO more complicate respect to the synchronous IO from the blkio controller's perspective. The same concept is also valid for anonymous pages involed in IO operations (swap). This patch allow to track the cgroup that originally dirtied each page in page cache and each anonymous page and pass these informations to the blk-throttle controller. These informations can be used to provide a better service level differentiation of buffered writes swap IO between different cgroups. Testcase - create a cgroup with 1MiB/s write limit: # mount -t cgroup -o blkio none /mnt/cgroup # mkdir /mnt/cgroup/foo # echo 8:0 $((1024 * 1024)) /mnt/cgroup/foo/blkio.throttle.write_bps_device - move a task into the cgroup and run a dd to generate some writeback IO Results: - 2.6.38-rc6 vanilla: $ cat /proc/$$/cgroup 1:blkio:/foo $ dd if=/dev/zero of=zero bs=1M count=1024 $ dstat -df --dsk/sda-- read writ 019M 019M 0 0 0 0 019M ... - 2.6.38-rc6 + blk-throttle writeback IO control: $ cat /proc/$$/cgroup 1:blkio:/foo $ dd if=/dev/zero of=zero bs=1M count=1024 $ dstat -df --dsk/sda-- read writ 0 1024 0 1024 0 1024 0 1024 0 1024 ... Thanks for looking into this, further review follows. -- Three Cheers, Balbir ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: User namespaces and keys
Casey Schaufler ca...@schaufler-ca.com writes: On 2/23/2011 12:55 PM, Eric W. Biederman wrote: Casey Schaufler ca...@schaufler-ca.com writes: I confess that I remain less well educated on namespaces than I probably should be, but with what I do know it seems that the relationships between user namespaces and LSMs are bound to be strained from the beginning. Some LSMs (SELinux and Smack) are providing similar sandbox capabilities to what you get from user namespaces, but from different directions and with different use cases. Casey I won't argue about the possibility of things being strained, but I think if we focus on the semantics and not on the end goal of exactly how the pieces are to be used there can be some reasonable dialog. I'm sure that there will be cases where they will work together like horses in a troika. Making sensible semantics for the interactions is key, and it is entirely possible that in some cases a comparison of semantics and behaviors will lead an end user to chose either an LSM or namespaces over the combination. Just like I expect that even when we allow multiple LSMs the SELinux and Smack combination will be rare among the sane. That sounds about right. Eric ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel