[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control

2011-02-23 Thread Andrea Righi
On Tue, Feb 22, 2011 at 07:03:58PM -0500, Vivek Goyal wrote:
  I think we should accept to have an inode granularity. We could redesign
  the writeback code to work per-cgroup / per-page, etc. but that would
  add a huge overhead. The limit of inode granularity could be an
  acceptable tradeoff, cgroups are supposed to work to different files
  usually, well.. except when databases come into play (ouch!).
 
 Agreed. Granularity of per inode level might be accetable in many 
 cases. Again, I am worried faster group getting stuck behind slower
 group.
 
 I am wondering if we are trying to solve the problem of ASYNC write throttling
 at wrong layer. Should ASYNC IO be throttled before we allow task to write to
 page cache. The way we throttle the process based on dirty ratio, can we
 just check for throttle limits also there or something like that.(I think
 that's what you had done in your initial throttling controller 
 implementation?)

Right. This is exactly the same approach I've used in my old throttling
controller: throttle sync READs and WRITEs at the block layer and async
WRITEs when the task is dirtying memory pages.

This is probably the simplest way to resolve the problem of faster group
getting blocked by slower group, but the controller will be a little bit
more leaky, because the writeback IO will be never throttled and we'll
see some limited IO spikes during the writeback. However, this is always
a better solution IMHO respect to the current implementation that is
affected by that kind of priority inversion problem.

I can try to add this logic to the current blk-throttle controller if
you think it is worth to test it.

-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 4/5] blk-throttle: track buffered and anonymous pages

2011-02-23 Thread Andrea Righi
On Tue, Feb 22, 2011 at 07:07:19PM -0500, Vivek Goyal wrote:
 On Wed, Feb 23, 2011 at 12:05:34AM +0100, Andrea Righi wrote:
  On Tue, Feb 22, 2011 at 04:00:30PM -0500, Vivek Goyal wrote:
   On Tue, Feb 22, 2011 at 06:12:55PM +0100, Andrea Righi wrote:
Add the tracking of buffered (writeback) and anonymous pages.

Dirty pages in the page cache can be processed asynchronously by the
per-bdi flusher kernel threads or by any other thread in the system,
according to the writeback policy.

For this reason the real writes to the underlying block devices may
occur in a different IO context respect to the task that originally
generated the dirty pages involved in the IO operation. This makes
the tracking and throttling of writeback IO more complicate respect to
the synchronous IO from the blkio controller's point of view.

The idea is to save the cgroup owner of each anonymous page and dirty
page in page cache. A page is associated to a cgroup the first time it
is dirtied in memory (for file cache pages) or when it is set as
swap-backed (for anonymous pages). This information is stored using the
page_cgroup functionality.

Then, at the block layer, it is possible to retrieve the throttle group
looking at the bio_page(bio). If the page was not explicitly associated
to any cgroup the IO operation is charged to the current task/cgroup, as
it was done by the previous implementation.

Signed-off-by: Andrea Righi ari...@develer.com
---
 block/blk-throttle.c   |   87 
+++-
 include/linux/blkdev.h |   26 ++-
 2 files changed, 111 insertions(+), 2 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 9ad3d1e..a50ee04 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -8,6 +8,10 @@
 #include linux/slab.h
 #include linux/blkdev.h
 #include linux/bio.h
+#include linux/memcontrol.h
+#include linux/mm_inline.h
+#include linux/pagemap.h
+#include linux/page_cgroup.h
 #include linux/blktrace_api.h
 #include linux/blk-cgroup.h
 
@@ -221,6 +225,85 @@ done:
return tg;
 }
 
+static inline bool is_kernel_io(void)
+{
+   return !!(current-flags  (PF_KTHREAD | PF_KSWAPD | 
PF_MEMALLOC));
+}
+
+static int throtl_set_page_owner(struct page *page, struct mm_struct 
*mm)
+{
+   struct blkio_cgroup *blkcg;
+   unsigned short id = 0;
+
+   if (blkio_cgroup_disabled())
+   return 0;
+   if (!mm)
+   goto out;
+   rcu_read_lock();
+   blkcg = task_to_blkio_cgroup(rcu_dereference(mm-owner));
+   if (likely(blkcg))
+   id = css_id(blkcg-css);
+   rcu_read_unlock();
+out:
+   return page_cgroup_set_owner(page, id);
+}
+
+int blk_throtl_set_anonpage_owner(struct page *page, struct mm_struct 
*mm)
+{
+   return throtl_set_page_owner(page, mm);
+}
+EXPORT_SYMBOL(blk_throtl_set_anonpage_owner);
+
+int blk_throtl_set_filepage_owner(struct page *page, struct mm_struct 
*mm)
+{
+   if (is_kernel_io() || !page_is_file_cache(page))
+   return 0;
+   return throtl_set_page_owner(page, mm);
+}
+EXPORT_SYMBOL(blk_throtl_set_filepage_owner);
   
   Why are we exporting all these symbols?
  
  Right. Probably a single one is enough:
  
   int blk_throtl_set_page_owner(struct page *page,
  struct mm_struct *mm, bool anon);
 
 Who is going to use this single export? Which module?
 

I was actually thinking at some filesystem modules, but I was wrong,
because at the moment no one needs the export. I'll remove it in the
next version of the patch.

Thanks,
-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 3/5] page_cgroup: make page tracking available for blkio

2011-02-23 Thread Andrea Righi
On Wed, Feb 23, 2011 at 01:49:10PM +0900, KAMEZAWA Hiroyuki wrote:
 On Wed, 23 Feb 2011 00:37:18 +0100
 Andrea Righi ari...@develer.com wrote:
 
  On Tue, Feb 22, 2011 at 06:06:30PM -0500, Vivek Goyal wrote:
   On Wed, Feb 23, 2011 at 12:01:47AM +0100, Andrea Righi wrote:
On Tue, Feb 22, 2011 at 01:01:45PM -0700, Jonathan Corbet wrote:
 On Tue, 22 Feb 2011 18:12:54 +0100
 Andrea Righi ari...@develer.com wrote:
 
  The page_cgroup infrastructure, currently available only for the 
  memory
  cgroup controller, can be used to store the owner of each page and
  opportunely track the writeback IO. This information is encoded in
  the upper 16-bits of the page_cgroup-flags.
  
  A owner can be identified using a generic ID number and the 
  following
  interfaces are provided to store a retrieve this information:
  
unsigned long page_cgroup_get_owner(struct page *page);
int page_cgroup_set_owner(struct page *page, unsigned long id);
int page_cgroup_copy_owner(struct page *npage, struct page 
  *opage);
 
 My immediate observation is that you're not really tracking the 
 owner
 here - you're tracking an opaque 16-bit token known only to the block
 controller in a field which - if changed by anybody other than the 
 block
 controller - will lead to mayhem in the block controller.  I think it
 might be clearer - and safer - to say blkcg or some such instead of
 owner here.
 

Basically the idea here was to be as generic as possible and make this
feature potentially available also to other subsystems, so that cgroup
subsystems may represent whatever they want with the 16-bit token.
However, no more than a single subsystem may be able to use this feature
at the same time.

 I'm tempted to say it might be better to just add a pointer to your
 throtl_grp structure into struct page_cgroup.  Or maybe replace the
 mem_cgroup pointer with a single pointer to struct css_set.  Both of
 those ideas, though, probably just add unwanted extra overhead now to 
 gain
 generality which may or may not be wanted in the future.

The pointer to css_set sounds good, but it would add additional space to
the page_cgroup struct. Now, page_cgroup is 40 bytes (in 64-bit arch)
and all of them are allocated at boot time. Using unused bits in
page_cgroup-flags is a choice with no overhead from this point of view.
   
   I think John suggested replacing mem_cgroup pointer with css_set so that
   size of the strcuture does not increase but it leads extra level of 
   indirection.
  
  OK, got it sorry.
  
  So, IIUC we save css_set pointer and get a struct cgroup as following:
  
struct cgroup *cgrp = css_set-subsys[subsys_id]-cgroup;
  
  Then, for example to get the mem_cgroup reference:
  
struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
  
  It seems a lot of indirections, but I may have done something wrong or
  there could be a simpler way to do it.
  
 
 
 Then, page_cgroup should have reference count on css_set and make tons of
 atomic ops.
 
 BTW, bits of pc-flags are used for storing sectionID or nodeID.
 Please clarify your 16bit never breaks that information. And please keep
 more 4-5 flags for dirty_ratio support of memcg.

OK, I didn't see the recent work about section and node id encoded in
the pc-flags, thanks. So, it'd be probably better to rebase the patch to
the latest mmotm to check all this stuff.

 
 I wonder I can make pc-mem_cgroup to be pc-memid(16bit), then, 
 ==
 static inline struct mem_cgroup *get_memcg_from_pc(struct page_cgroup *pc)
 {
 struct cgroup_subsys_state *css = css_lookup(mem_cgroup_subsys, 
 pc-memid);
 return container_of(css, struct mem_cgroup, css);
 }
 ==
 Overhead will be seen at updating file statistics and LRU management.
 
 But, hmm, can't you do that tracking without page_cgroup ?
 Because the number of dirty/writeback pages are far smaller than total pages,
 chasing I/O with dynamic structure is not very bad..
 
 prepareing [pfn - blkio] record table and move that information to struct bio
 in dynamic way is very difficult ?

This would be ok for dirty pages, but consider that we're also tracking
anonymous pages. So, if we want to control the swap IO we actually need
to save this information for a lot of pages and at the end I think we'll
basically duplicate the page_cgroup code.

Thanks,
-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 2/9] security: Make capabilities relative to the user namespace.

2011-02-23 Thread David Howells
David Howells dhowe...@redhat.com wrote:

  int (*capable) (struct task_struct *tsk, const struct cred *cred,
  -   int cap, int audit);
  +   struct user_namespace *ns, int cap, int audit);
 
 Hmmm...  A chunk of the contents of the cred struct are user-namespaced.
 Could you add the user_namespace pointer to the cred struct and thus avoid
 passing it as an argument to other things.

Ah, no...  Ignore that, I think I see that you do need it.

 +int cap_capable(struct task_struct *tsk, const struct cred *cred,
 + struct user_namespace *targ_ns, int cap, int audit)
  {
 - return cap_raised(cred-cap_effective, cap) ? 0 : -EPERM;
 + for (;;) {
 + /* The creator of the user namespace has all caps. */
 + if (targ_ns != init_user_ns  targ_ns-creator == cred-user)
 + return 0;

Why is that last comment so?  Why should the creating namespace sport all
possible capabilities?  Do you have to have all capabilities available to you
to be permitted create a new user namespace?

Also, would it be worth having a separate cap_ns_capable()?  Wouldn't most
calls to cap_capable() only be checking the caps granted in the current user
namespace?

David
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] User namespaces and keys

2011-02-23 Thread David Howells

I guess we need to look at how to mix keys and namespaces again.

Possibly the trickiest problem with keys is how to upcall key construction to
/sbin/request-key when the keys may be of a different user namespace.

David
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 2/9] security: Make capabilities relative to the user namespace.

2011-02-23 Thread Serge E. Hallyn
Quoting David Howells (dhowe...@redhat.com):
 David Howells dhowe...@redhat.com wrote:
 
 int (*capable) (struct task_struct *tsk, const struct cred *cred,
   - int cap, int audit);
   + struct user_namespace *ns, int cap, int audit);
  
  Hmmm...  A chunk of the contents of the cred struct are user-namespaced.
  Could you add the user_namespace pointer to the cred struct and thus avoid
  passing it as an argument to other things.
 
 Ah, no...  Ignore that, I think I see that you do need it.

Thanks for reviewing, David.

  +int cap_capable(struct task_struct *tsk, const struct cred *cred,
  +   struct user_namespace *targ_ns, int cap, int audit)
   {
  -   return cap_raised(cred-cap_effective, cap) ? 0 : -EPERM;
  +   for (;;) {
  +   /* The creator of the user namespace has all caps. */
  +   if (targ_ns != init_user_ns  targ_ns-creator == cred-user)
  +   return 0;
 
 Why is that last comment so?  Why should the creating namespace sport all
 possible capabilities?  Do you have to have all capabilities available to you
 to be permitted create a new user namespace?

It's not the creating namespace, but the creating user, which has all caps.

So for instance, if uid 500 in init_user_ns creates a namespace, then:
  a. uid 500 in init_user_ns has all caps to the child user_ns, so it can
 kill the tasks in the userns, clean up, etc.
  b. uid X in any other child user_ns has no caps to the child user_ns.
  c. root in init_user_ns has whatever capabilities are in his pE to the
 child user_ns.  Again, this is so that the admin in any user_ns can
 clean up any messes made by users in his user_ns.

One of the goals of the user namespaces it to make it safe to allow
unprivileged users to create child user namespaces in which they have
targeted privilege.  Anything which happens in that child user namespace
should be:

  a. constrained to resources which the user can control anyway
  b. able to be cleaned up by the user
  c. (especially) able to be cleaned up by the privileged user in the
 parent user_ns.

 Also, would it be worth having a separate cap_ns_capable()?  Wouldn't most
 calls to cap_capable() only be checking the caps granted in the current user
 namespace?

Hm.  There is nsown_capable() which is targeted to current_userns(), but
that still needs to enable the caps for the privileged ancestors as
described above.

thanks,
-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: User namespaces and keys

2011-02-23 Thread Serge E. Hallyn
Quoting David Howells (dhowe...@redhat.com):
 
 I guess we need to look at how to mix keys and namespaces again.

From strictly kernel pov, at the moment, keys are strictly usable only
by the user in your own user namespace.

We may want to look at this again, but for now I think that would be a
safe enough default.  Later, we'll probably want the user creating a
child_user_ns to allow his keys to be inherited by the child user_ns.
Though, as I type that, it seems to me that that'll just become a
maintenance pain, and it's just plain safer to have the user re-enter
his keys, sharing them over a file if needed.

I'm going to not consider the TPM at the moment :)

 Possibly the trickiest problem with keys is how to upcall key construction to
 /sbin/request-key when the keys may be of a different user namespace.

Hm, jinkeys, yes.

-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: User namespaces and keys

2011-02-23 Thread Eric W. Biederman
Serge E. Hallyn se...@hallyn.com writes:

 Quoting David Howells (dhowe...@redhat.com):
 
 I guess we need to look at how to mix keys and namespaces again.

From strictly kernel pov, at the moment, keys are strictly usable only
 by the user in your own user namespace.

 We may want to look at this again, but for now I think that would be a
 safe enough default.  Later, we'll probably want the user creating a
 child_user_ns to allow his keys to be inherited by the child user_ns.
 Though, as I type that, it seems to me that that'll just become a
 maintenance pain, and it's just plain safer to have the user re-enter
 his keys, sharing them over a file if needed.

 I'm going to not consider the TPM at the moment :)

 Possibly the trickiest problem with keys is how to upcall key construction to
 /sbin/request-key when the keys may be of a different user namespace.

 Hm, jinkeys, yes.

Serge short term this is where I think we need to add a check or two so
that keys only work in the init_user_ns.  When the rest of the details
are sorted out we can open up the use of keys some more.

As the first stage of converting the network stack we added patches
that turned off everything in non init namespaces.   Those patches
were trivial and easy to review, and it made the conversion process
a lot easier.  I suspect for keys and possibly security modules and
anything else that does is related we want to turn off by default.

Then once the core of user namespaces is safe to unshare without
privilege we can come back and get the more difficult bits.

Eric
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: User namespaces and keys

2011-02-23 Thread David Howells
Serge E. Hallyn se...@hallyn.com wrote:

  I guess we need to look at how to mix keys and namespaces again.
 
 From strictly kernel pov, at the moment, keys are strictly usable only
 by the user in your own user namespace.

I'm not sure that's currently completely true.  Key quota maintenance is
namespaced, and the key's owner UID/GID belong to that namespace, so that's
okay, but:

 (*) key_task_permission() does not distinguish UIDs and GIDs from different
 namespaces.

 (*) A key can be referred to by its serial number, no matter whose namespace
 it is in, and will yield up its given UID/GID, even if these aren't
 actually meaningful in your namespace.

 This means request_key() can successfully upcall at the moment.

I wonder if I should make the following changes:

 (1) If the key and the accessor are in different user namespaces, then skip
 the UID and GID comparisons in key_task_permission().  That means that to
 be able to access the key you'd have to possess the key and the key would
 have to grant you Possessor access, or the key would have to grant you
 Other access.

 (2) If the key and someone viewing the key description are in different
 namespaces, then indicate that the UID and the GID are -1, irrespective of
 the actual values.

 (3) When an upcall is attempting to instantiate a key, it is allowed to access
 the keys of requestor using the requestor's credentials (UID, GID, groups,
 security label).  Ensure that this will be done in the requestor's user
 namespace.

 Nothing should need to be done here, since search_process_keyrings()
 switches to the requestor's creds.

Oh, and are security labels user-namespaced?

 We may want to look at this again, but for now I think that would be a
 safe enough default.  Later, we'll probably want the user creating a
 child_user_ns to allow his keys to be inherited by the child user_ns.

That depends what you mean by 'allow his keys to be inherited'.  Do you mean
copying all the creator's keys en mass?  You may find all you need to do is to
provide the new intended user with a new session keyring with a link back to
the creator's session keyring.

 Though, as I type that, it seems to me that that'll just become a
 maintenance pain, and it's just plain safer to have the user re-enter
 his keys,

That would certainly be simpler.

 sharing them over a file if needed.

I'm not sure what you mean by that.  Do you mean some sort of cred passing
similar to that that can be done over AF_UNIX sockets with fds?

 I'm going to not consider the TPM at the moment :)

I'm not sure the TPM is that much of a problem, assuming you're just referring
to its keystore capability...

David
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control

2011-02-23 Thread Vivek Goyal
  Agreed. Granularity of per inode level might be accetable in many 
  cases. Again, I am worried faster group getting stuck behind slower
  group.
  
  I am wondering if we are trying to solve the problem of ASYNC write 
  throttling
  at wrong layer. Should ASYNC IO be throttled before we allow task to write 
  to
  page cache. The way we throttle the process based on dirty ratio, can we
  just check for throttle limits also there or something like that.(I think
  that's what you had done in your initial throttling controller 
  implementation?)
 
 Right. This is exactly the same approach I've used in my old throttling
 controller: throttle sync READs and WRITEs at the block layer and async
 WRITEs when the task is dirtying memory pages.
 
 This is probably the simplest way to resolve the problem of faster group
 getting blocked by slower group, but the controller will be a little bit
 more leaky, because the writeback IO will be never throttled and we'll
 see some limited IO spikes during the writeback.

Yes writeback will not be throttled. Not sure how big a problem that is.

- We have controlled the input rate. So that should help a bit.
- May be one can put some high limit on root cgroup to in blkio throttle
  controller to limit overall WRITE rate of the system.
- For SATA disks, try to use CFQ which can try to minimize the impact of
  WRITE.

It will atleast provide consistent bandwindth experience to application.

However, this is always
 a better solution IMHO respect to the current implementation that is
 affected by that kind of priority inversion problem.
 
 I can try to add this logic to the current blk-throttle controller if
 you think it is worth to test it.

At this point of time I have few concerns with this approach.

- Configuration issues. Asking user to plan for SYNC ans ASYNC IO
  separately is inconvenient. One has to know the nature of workload.

- Most likely we will come up with global limits (atleast to begin with),
  and not per device limit. That can lead to contention on one single
  lock and scalability issues on big systems.

Having said that, this approach should reduce the kernel complexity a lot.
So if we can do some intelligent locking to limit the overhead then it
will boil down to reduced complexity in kernel vs ease of use to user. I 
guess at this point of time I am inclined towards keeping it simple in
kernel.

Couple of people have asked me that we have backup jobs running at night
and we want to reduce the IO bandwidth of these jobs to limit the impact
on latency of other jobs, I guess this approach will definitely solve
that issue.

IMHO, it might be worth trying this approach and see how well does it work. It
might not solve all the problems but can be helpful in many situations.

I feel that for proportional bandwidth division, implementing ASYNC
control at CFQ will make sense because even if things get serialized in
higher layers, consequences are not very bad as it is work conserving
algorithm. But for throttling serialization will lead to bad consequences.

May be one can think of new files in blkio controller to limit async IO
per group during page dirty time.

blkio.throttle.async.write_bps_limit
blkio.throttle.async.write_iops_limit

Thanks
Vivek
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: User namespaces and keys

2011-02-23 Thread Eric W. Biederman
David Howells dhowe...@redhat.com writes:

 Serge E. Hallyn se...@hallyn.com wrote:

  I guess we need to look at how to mix keys and namespaces again.
 
 From strictly kernel pov, at the moment, keys are strictly usable only
 by the user in your own user namespace.

 I'm not sure that's currently completely true.  Key quota maintenance is
 namespaced, and the key's owner UID/GID belong to that namespace, so that's
 okay, but:

  (*) key_task_permission() does not distinguish UIDs and GIDs from different
  namespaces.

  (*) A key can be referred to by its serial number, no matter whose namespace
  it is in, and will yield up its given UID/GID, even if these aren't
  actually meaningful in your namespace.

  This means request_key() can successfully upcall at the moment.

 I wonder if I should make the following changes:

  (1) If the key and the accessor are in different user namespaces, then skip
  the UID and GID comparisons in key_task_permission().  That means that to
  be able to access the key you'd have to possess the key and the key would
  have to grant you Possessor access, or the key would have to grant you
  Other access.

  (2) If the key and someone viewing the key description are in different
  namespaces, then indicate that the UID and the GID are -1, irrespective 
 of
  the actual values.

  (3) When an upcall is attempting to instantiate a key, it is allowed to 
 access
  the keys of requestor using the requestor's credentials (UID, GID, 
 groups,
  security label).  Ensure that this will be done in the requestor's user
  namespace.

  Nothing should need to be done here, since search_process_keyrings()
  switches to the requestor's creds.

 Oh, and are security labels user-namespaced?

Not at this time.  The user namespace as currently merged is little more
than a place holder for a proper implementation.  Serge is busily
fleshing out that proper implementation.

Until we reach the point where all checks that have historically been
if (uid1 == uid2) become if ((uidns1 == uidns2)  (uid1 == uid2))
there will be problems.

The security labels and probably lsm's in general need to be per user
namespace but we simply have not gotten that far.  For the short term I
will be happy when we get a minimally usable user namespace.

Eric
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: User namespaces and keys

2011-02-23 Thread Serge E. Hallyn
Quoting Eric W. Biederman (ebied...@xmission.com):
 David Howells dhowe...@redhat.com writes:
 
  Serge E. Hallyn se...@hallyn.com wrote:
 
   I guess we need to look at how to mix keys and namespaces again.
  
  From strictly kernel pov, at the moment, keys are strictly usable only
  by the user in your own user namespace.
 
  I'm not sure that's currently completely true.  Key quota maintenance is
  namespaced, and the key's owner UID/GID belong to that namespace, so that's
  okay, but:
 
   (*) key_task_permission() does not distinguish UIDs and GIDs from different
   namespaces.
 
   (*) A key can be referred to by its serial number, no matter whose 
  namespace
   it is in, and will yield up its given UID/GID, even if these aren't
   actually meaningful in your namespace.
 
   This means request_key() can successfully upcall at the moment.
 
  I wonder if I should make the following changes:
 
   (1) If the key and the accessor are in different user namespaces, then skip
   the UID and GID comparisons in key_task_permission().  That means that 
  to
   be able to access the key you'd have to possess the key and the key 
  would
   have to grant you Possessor access, or the key would have to grant you
   Other access.
 
   (2) If the key and someone viewing the key description are in different
   namespaces, then indicate that the UID and the GID are -1, 
  irrespective of
   the actual values.
 
   (3) When an upcall is attempting to instantiate a key, it is allowed to 
  access
   the keys of requestor using the requestor's credentials (UID, GID, 
  groups,
   security label).  Ensure that this will be done in the requestor's user
   namespace.
 
   Nothing should need to be done here, since search_process_keyrings()
   switches to the requestor's creds.
 
  Oh, and are security labels user-namespaced?
 
 Not at this time.  The user namespace as currently merged is little more
 than a place holder for a proper implementation.  Serge is busily
 fleshing out that proper implementation.
 
 Until we reach the point where all checks that have historically been
 if (uid1 == uid2) become if ((uidns1 == uidns2)  (uid1 == uid2))
 there will be problems.
 
 The security labels and probably lsm's in general need to be per user
 namespace but we simply have not gotten that far.  For the short term I
 will be happy when we get a minimally usable user namespace.

Note also that when Eric brought this up at the LSM mini-conf two or three
years ago, there was pretty general, strong objection to the idea.

Like Eric says, I think that'll have to wait.  In the meantime, isolating
user namespace sandboxes (or containers) using simple LSM configurations
is a very good idea.

-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 2/9] security: Make capabilities relative to the user namespace.

2011-02-23 Thread David Howells
Serge E. Hallyn se...@hallyn.com wrote:

 + /* If you have the capability in a parent user ns you have it
 +  * in the over all children user namespaces as well, so see
 +  * if this process has the capability in the parent user
 +  * namespace.
 +  */

... in the over all children user namespaces ... I think need a couple of
words dropping.

David
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 5/9] Allow ptrace from non-init user namespaces

2011-02-23 Thread David Howells
Serge E. Hallyn se...@hallyn.com wrote:

 ptrace is allowed to tasks in the same user namespace according to

There's a verb missing after 'allowed to' - presumably 'trace'.

David
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 5/9] Allow ptrace from non-init user namespaces

2011-02-23 Thread David Howells
Serge E. Hallyn se...@hallyn.com wrote:

 +int same_or_ancestor_user_ns(struct task_struct *task,
 + struct task_struct *victim)
 +{
 + struct user_namespace *u1 = task_cred_xxx(task, user)-user_ns;
 + struct user_namespace *u2 = task_cred_xxx(victim, user)-user_ns;

Hmmm.  task_cred_xxx() uses task-real_cred, which is correct for victim (the
object), but normally you'd use task-cred for task (the subject).  However,
in this case, I think it's probably okay.

David
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 1/9] Add a user_namespace as creator/owner of uts_namespace

2011-02-23 Thread David Howells
Serge E. Hallyn se...@hallyn.com wrote:

  struct uts_namespace {
   struct kref kref;
   struct new_utsname name;
 + struct user_namespace *user_ns;
  };

If a uts_namespace belongs to a user_namespace, should CLONE_NEWUSER then
imply CLONE_NEWUTS?

Or is uts_namespace::user_ns more an implication that the set of users in
user_namespace are the only ones authorised to alter the uts data.

I presume that the uts_namespace of a process must be owned by one of the
user_namespaces in the alternating inheritance chain of namespaces and their
creators leading from current_user_ns() to init_user_ns.

With that in mind, looking at patch 3:

-   if (!capable(CAP_SYS_ADMIN))
+   if (!ns_capable(current-nsproxy-uts_ns-user_ns, CAP_SYS_ADMIN))

what is it you're actually asking?  I presume it's 'does this user have
CAP_SYS_ADMIN capability over objects belonging to the uts_namespace's
user_namespace?'

So, to look at the important bit of patch 2:

-int cap_capable(struct task_struct *tsk, const struct cred *cred, int cap,
-   int audit)
+int cap_capable(struct task_struct *tsk, const struct cred *cred,
+   struct user_namespace *targ_ns, int cap, int audit)
 {
-   return cap_raised(cred-cap_effective, cap) ? 0 : -EPERM;
+   for (;;) {
+   /* The creator of the user namespace has all caps. */
+   if (targ_ns != init_user_ns  targ_ns-creator == cred-user)
+   return 0;
+
+   /* Do we have the necessary capabilities? */
+   if (targ_ns == cred-user-user_ns)
+   return cap_raised(cred-cap_effective, cap) ? 0 : 
-EPERM;
+
+   /* Have we tried all of the parent namespaces? */
+   if (targ_ns == init_user_ns)
+   return -EPERM;
+
+   /* If you have the capability in a parent user ns you have it
+* in the over all children user namespaces as well, so see
+* if this process has the capability in the parent user
+* namespace.
+*/
+   targ_ns = targ_ns-creator-user_ns;
+   }
+
+   /* We never get here */
+   return -EPERM;
 }

On entry, as we're called from ns_capable(), cred-user is the user that the
current process is running as, and, as such, may be in a separate namespace
from uts_namespace - which may itself be in a separate namespace from
init_user_ns.

So, assume for the sake of argument that there are three user_namespaces along
the chain from the calling process to the root, and that the uts_namespace
belongs to the middle one.

if (targ_ns != init_user_ns  targ_ns-creator == cred-user)
return 0;

Can never match because targ_ns-creator cannot be cred-user; even if the
uts_namespace belongs to our namespace, given that the creator lies outside
our namespace.

if (targ_ns == cred-user-user_ns)
return cap_raised(cred-cap_effective, cap) ? 0 : 
-EPERM;

Can only match if we are in the target user_namespace (ie. the one to which
uts_namespace belongs), whether or not we have CAP_SYS_ADMIN.

Which means that unless the uts_namespace belongs to our user_namespace, we
cannot change it.  Is that correct?

So ns_capable() restricts you to only doing interesting things to objects that
belong to a user_namespace if they are in your own user_namespace.  Is that
correct?

If that is so, is the loop required for ns_capable()?


Looking further at patch 2:

#define nsown_capable(cap) (ns_capable(current_user_ns(), (cap)))

Given what I've said above, I presume the loop isn't necessary here either.


I think you're using ns_capable() in two different ways:

 (1) You're using it to see if a process has power over its descendents in a
 user_namespace that can be traced back to a clone() that it did with
 CLONE_NEWUSER.

 For example, automatically granting a process permission to kill
 descendents in a namespace it created.

 (2) You're using it to see if a process can access objects that might be
 outside its own user_namespace.

 For example, setting the hostname.

Is it worth giving two different interfaces to make this clearer (even if they
actually do the same thing)?


Sorry if this seems rambly, but I'm trying to get my head round your code.

David
 
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: User namespaces and keys

2011-02-23 Thread Eric W. Biederman
Casey Schaufler ca...@schaufler-ca.com writes:

 I confess that I remain less well educated on namespaces than
 I probably should be, but with what I do know it seems that the
 relationships between user namespaces and LSMs are bound to be
 strained from the beginning. Some LSMs (SELinux and Smack) are
 providing similar sandbox capabilities to what you get from user
 namespaces, but from different directions and with different
 use cases.

Casey I won't argue about the possibility of things being strained, but
I think if we focus on the semantics and not on the end goal of exactly
how the pieces are to be used there can be some reasonable dialog.

From 10,000 feet an user namespace represents one administrative domain.
uids, gids and other external tokens are all controlled by a single
group with a single security policy.  In that single administrative
domain things like nfs are expected to work without translating uids and
gids.  Before the implementation of user namespaces a single kernel
could not even express the ability of dealing with multiple user
namespaces simultaneously (nfs has usually punted and said despite being
multiple machines you still have to be in the same administrative domain
so the same user identifiers can be used throughout).

From that principle we have the concept that use assigned/visible
identifiers uid, gid, capabilities, security keys?, security labels?
logically belong to the user namespace.

Which means in implementing there are two pieces of work in implementing
the user namespace.

- Find all of the interesting comparisons and change them from
  if (id == id) to if (userns == userns)  (id == id).

- Potentially define and handle what happens when you mix user
  namespaces. 

I think for the first round of this we simply want to define lsms
and the user namespace as not mixing or the lsm interfaces only being
visible if you are in the initial user namespace.  Thus reserving the
definition of what happens when you have lsms and multiple user
namespaces as something we can define later.  I expect the proper
definition is something that would allow nested lsm policy.

Regardless.  The namespaces are all about making the identifiers that
processes deal with local to the namespace, while the underlying object
manipulations should not care.

The big advantage of the user namespace is that it is the only way I can
see to get us out of the bind we are in with suid root executables,
where we can not enable new features to untrusted applications that can
change the environment for a suid root executable, because it might
confuse those suid root executables to misusing their power.  Once
inside a user namespace nothing has dangerous powers because even a
root owned process with a full set of capabilities is less powerful than
a guest user outside of the namespace.   No process having dangerous
powers allows us to enable unsharing of other namespaces and potentially
other things that today are restricted with capabilities only because
they can be used to confuse a privileged executable to do something
malicious.

Eric

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 1/9] Add a user_namespace as creator/owner of uts_namespace

2011-02-23 Thread Eric W. Biederman
David Howells dhowe...@redhat.com writes:

 Serge E. Hallyn se...@hallyn.com wrote:

  struct uts_namespace {
  struct kref kref;
  struct new_utsname name;
 +struct user_namespace *user_ns;
  };

 If a uts_namespace belongs to a user_namespace, should CLONE_NEWUSER then
 imply CLONE_NEWUTS?

 Or is uts_namespace::user_ns more an implication that the set of users in
 user_namespace are the only ones authorised to alter the uts data.

The later.

 I presume that the uts_namespace of a process must be owned by one of the
 user_namespaces in the alternating inheritance chain of namespaces and their
 creators leading from current_user_ns() to init_user_ns.

 With that in mind, looking at patch 3:

 - if (!capable(CAP_SYS_ADMIN))
 + if (!ns_capable(current-nsproxy-uts_ns-user_ns, CAP_SYS_ADMIN))

 what is it you're actually asking?  I presume it's 'does this user have
 CAP_SYS_ADMIN capability over objects belonging to the uts_namespace's
 user_namespace?'

Yes.

 So, to look at the important bit of patch 2:

 -int cap_capable(struct task_struct *tsk, const struct cred *cred, int cap,
 - int audit)
 +int cap_capable(struct task_struct *tsk, const struct cred *cred,
 + struct user_namespace *targ_ns, int cap, int audit)
  {
 - return cap_raised(cred-cap_effective, cap) ? 0 : -EPERM;
 + for (;;) {
 + /* The creator of the user namespace has all caps. */
 + if (targ_ns != init_user_ns  targ_ns-creator == cred-user)
 + return 0;
 +
 + /* Do we have the necessary capabilities? */
 + if (targ_ns == cred-user-user_ns)
 + return cap_raised(cred-cap_effective, cap) ? 0 : 
 -EPERM;
 +
 + /* Have we tried all of the parent namespaces? */
 + if (targ_ns == init_user_ns)
 + return -EPERM;
 +
 + /* If you have the capability in a parent user ns you have it
 +  * in the over all children user namespaces as well, so see
 +  * if this process has the capability in the parent user
 +  * namespace.
 +  */
 + targ_ns = targ_ns-creator-user_ns;
 + }
 +
 + /* We never get here */
 + return -EPERM;
  }

 On entry, as we're called from ns_capable(), cred-user is the user that the
 current process is running as, and, as such, may be in a separate namespace
 from uts_namespace - which may itself be in a separate namespace from
 init_user_ns.

 So, assume for the sake of argument that there are three user_namespaces along
 the chain from the calling process to the root, and that the uts_namespace
 belongs to the middle one.

So we have the nested stack of:
user_ns3-creator-user_ns == user_ns2
user_ns2-creator-user_ns == init_user_ns
uts_ns2-user_ns == user_ns2


   if (targ_ns != init_user_ns  targ_ns-creator == cred-user)
   return 0;

 Can never match because targ_ns-creator cannot be cred-user; even if the
 uts_namespace belongs to our namespace, given that the creator lies outside
 our namespace.

Initially we come in with targ_ns == user_ns2 and cred-user-user_ns in
one of (user_ns3, user_ns2, or init_user_ns).

targ_ns takes on values user_ns2 and init_user_ns.

So when targ_ns becomes init_user_ns.  If the user in question is
uts_ns2-user_ns-creator.  This check will indeed match.

   if (targ_ns == cred-user-user_ns)
   return cap_raised(cred-cap_effective, cap) ? 0 : 
 -EPERM;

 Can only match if we are in the target user_namespace (ie. the one to which
 uts_namespace belongs), whether or not we have CAP_SYS_ADMIN.

As before targ_ns takes on values of user_ns2 and init_user_ns.

Which means this check will match if we have CAP_SYS_ADMIN in
init_user_ns or in user_ns2.

 Which means that unless the uts_namespace belongs to our user_namespace, we
 cannot change it.  Is that correct?

No.  If you are root in a parent namespace you can also change it.

 So ns_capable() restricts you to only doing interesting things to objects that
 belong to a user_namespace if they are in your own user_namespace.  Is that
 correct?

No.  Root outside your user namespace is also allowed to do interesting
things.

 If that is so, is the loop required for ns_capable()?

Yes.


 Looking further at patch 2:

   #define nsown_capable(cap) (ns_capable(current_user_ns(), (cap)))

 Given what I've said above, I presume the loop isn't necessary here either.


 I think you're using ns_capable() in two different ways:

  (1) You're using it to see if a process has power over its descendents in a
  user_namespace that can be traced back to a clone() that it did with
  CLONE_NEWUSER.

  For example, automatically granting a process permission to kill
  descendents in a namespace it created.

  (2) You're using it to see if a process can access objects that might be
  outside its own user_namespace.

  For 

[Devel] Re: User namespaces and keys

2011-02-23 Thread Casey Schaufler
On 2/23/2011 12:55 PM, Eric W. Biederman wrote:
 Casey Schaufler ca...@schaufler-ca.com writes:

 I confess that I remain less well educated on namespaces than
 I probably should be, but with what I do know it seems that the
 relationships between user namespaces and LSMs are bound to be
 strained from the beginning. Some LSMs (SELinux and Smack) are
 providing similar sandbox capabilities to what you get from user
 namespaces, but from different directions and with different
 use cases.
 Casey I won't argue about the possibility of things being strained, but
 I think if we focus on the semantics and not on the end goal of exactly
 how the pieces are to be used there can be some reasonable dialog.

I'm sure that there will be cases where they will work together
like horses in a troika. Making sensible semantics for the interactions
is key, and it is entirely possible that in some cases a comparison
of semantics and behaviors will lead an end user to chose either an
LSM or namespaces over the combination. Just like I expect that even
when we allow multiple LSMs the SELinux and Smack combination will be
rare among the sane.

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control

2011-02-23 Thread Andrea Righi
On Wed, Feb 23, 2011 at 10:23:54AM -0500, Vivek Goyal wrote:
   Agreed. Granularity of per inode level might be accetable in many 
   cases. Again, I am worried faster group getting stuck behind slower
   group.
   
   I am wondering if we are trying to solve the problem of ASYNC write 
   throttling
   at wrong layer. Should ASYNC IO be throttled before we allow task to 
   write to
   page cache. The way we throttle the process based on dirty ratio, can we
   just check for throttle limits also there or something like that.(I think
   that's what you had done in your initial throttling controller 
   implementation?)
  
  Right. This is exactly the same approach I've used in my old throttling
  controller: throttle sync READs and WRITEs at the block layer and async
  WRITEs when the task is dirtying memory pages.
  
  This is probably the simplest way to resolve the problem of faster group
  getting blocked by slower group, but the controller will be a little bit
  more leaky, because the writeback IO will be never throttled and we'll
  see some limited IO spikes during the writeback.
 
 Yes writeback will not be throttled. Not sure how big a problem that is.
 
 - We have controlled the input rate. So that should help a bit.
 - May be one can put some high limit on root cgroup to in blkio throttle
   controller to limit overall WRITE rate of the system.
 - For SATA disks, try to use CFQ which can try to minimize the impact of
   WRITE.
 
 It will atleast provide consistent bandwindth experience to application.

Right.

 
 However, this is always
  a better solution IMHO respect to the current implementation that is
  affected by that kind of priority inversion problem.
  
  I can try to add this logic to the current blk-throttle controller if
  you think it is worth to test it.
 
 At this point of time I have few concerns with this approach.
 
 - Configuration issues. Asking user to plan for SYNC ans ASYNC IO
   separately is inconvenient. One has to know the nature of workload.
 
 - Most likely we will come up with global limits (atleast to begin with),
   and not per device limit. That can lead to contention on one single
   lock and scalability issues on big systems.
 
 Having said that, this approach should reduce the kernel complexity a lot.
 So if we can do some intelligent locking to limit the overhead then it
 will boil down to reduced complexity in kernel vs ease of use to user. I 
 guess at this point of time I am inclined towards keeping it simple in
 kernel.
 

BTW, with this approach probably we can even get rid of the page
tracking stuff for now. If we don't consider the swap IO, any other IO
operation from our point of view will happen directly from process
context (writes in memory + sync reads from the block device).

However, I'm sure we'll need the page tracking also for the blkio
controller soon or later. This is an important information and also the
proportional bandwidth controller can take advantage of it.

 
 Couple of people have asked me that we have backup jobs running at night
 and we want to reduce the IO bandwidth of these jobs to limit the impact
 on latency of other jobs, I guess this approach will definitely solve
 that issue.
 
 IMHO, it might be worth trying this approach and see how well does it work. It
 might not solve all the problems but can be helpful in many situations.

Agreed. This could be a good tradeoff for a lot of common cases.

 
 I feel that for proportional bandwidth division, implementing ASYNC
 control at CFQ will make sense because even if things get serialized in
 higher layers, consequences are not very bad as it is work conserving
 algorithm. But for throttling serialization will lead to bad consequences.

Agreed.

 
 May be one can think of new files in blkio controller to limit async IO
 per group during page dirty time.
 
 blkio.throttle.async.write_bps_limit
 blkio.throttle.async.write_iops_limit

OK, I'll try to add the async throttling logic and use this interface.

-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 1/9] Add a user_namespace as creator/owner of uts_namespace

2011-02-23 Thread David Howells
Eric W. Biederman ebied...@xmission.com wrote:

  Which means that unless the uts_namespace belongs to our user_namespace, we
  cannot change it.  Is that correct?
 
 No.  If you are root in a parent namespace you can also change it.

But surely, by definition, if you're a user in this namespace, you can't also
be root in a parent namespace...

For the case I worked through current_user() is a member of current_user_ns()
and can't also be a member of its parent, grandparent, etc. - or can it?

David
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 3/5] page_cgroup: make page tracking available for blkio

2011-02-23 Thread KAMEZAWA Hiroyuki
On Wed, 23 Feb 2011 09:59:11 +0100
Andrea Righi ari...@develer.com wrote:

  
  I wonder I can make pc-mem_cgroup to be pc-memid(16bit), then, 
  ==
  static inline struct mem_cgroup *get_memcg_from_pc(struct page_cgroup *pc)
  {
  struct cgroup_subsys_state *css = css_lookup(mem_cgroup_subsys, 
  pc-memid);
  return container_of(css, struct mem_cgroup, css);
  }
  ==
  Overhead will be seen at updating file statistics and LRU management.
  
  But, hmm, can't you do that tracking without page_cgroup ?
  Because the number of dirty/writeback pages are far smaller than total 
  pages,
  chasing I/O with dynamic structure is not very bad..
  
  prepareing [pfn - blkio] record table and move that information to struct 
  bio
  in dynamic way is very difficult ?
 
 This would be ok for dirty pages, but consider that we're also tracking
 anonymous pages. So, if we want to control the swap IO we actually need
 to save this information for a lot of pages and at the end I think we'll
 basically duplicate the page_cgroup code.
 

swap io is always started with bio and the task/mm_struct.
So, if we can record information in bio, no page tracking is required.
You can record information to bio just by reading mm-owner.

Thanks,
-Kame

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control

2011-02-23 Thread Vivek Goyal
On Thu, Feb 24, 2011 at 12:14:11AM +0100, Andrea Righi wrote:
 On Wed, Feb 23, 2011 at 10:23:54AM -0500, Vivek Goyal wrote:
Agreed. Granularity of per inode level might be accetable in many 
cases. Again, I am worried faster group getting stuck behind slower
group.

I am wondering if we are trying to solve the problem of ASYNC write 
throttling
at wrong layer. Should ASYNC IO be throttled before we allow task to 
write to
page cache. The way we throttle the process based on dirty ratio, can we
just check for throttle limits also there or something like that.(I 
think
that's what you had done in your initial throttling controller 
implementation?)
   
   Right. This is exactly the same approach I've used in my old throttling
   controller: throttle sync READs and WRITEs at the block layer and async
   WRITEs when the task is dirtying memory pages.
   
   This is probably the simplest way to resolve the problem of faster group
   getting blocked by slower group, but the controller will be a little bit
   more leaky, because the writeback IO will be never throttled and we'll
   see some limited IO spikes during the writeback.
  
  Yes writeback will not be throttled. Not sure how big a problem that is.
  
  - We have controlled the input rate. So that should help a bit.
  - May be one can put some high limit on root cgroup to in blkio throttle
controller to limit overall WRITE rate of the system.
  - For SATA disks, try to use CFQ which can try to minimize the impact of
WRITE.
  
  It will atleast provide consistent bandwindth experience to application.
 
 Right.
 
  
  However, this is always
   a better solution IMHO respect to the current implementation that is
   affected by that kind of priority inversion problem.
   
   I can try to add this logic to the current blk-throttle controller if
   you think it is worth to test it.
  
  At this point of time I have few concerns with this approach.
  
  - Configuration issues. Asking user to plan for SYNC ans ASYNC IO
separately is inconvenient. One has to know the nature of workload.
  
  - Most likely we will come up with global limits (atleast to begin with),
and not per device limit. That can lead to contention on one single
lock and scalability issues on big systems.
  
  Having said that, this approach should reduce the kernel complexity a lot.
  So if we can do some intelligent locking to limit the overhead then it
  will boil down to reduced complexity in kernel vs ease of use to user. I 
  guess at this point of time I am inclined towards keeping it simple in
  kernel.
  
 
 BTW, with this approach probably we can even get rid of the page
 tracking stuff for now.

Agreed.

 If we don't consider the swap IO, any other IO
 operation from our point of view will happen directly from process
 context (writes in memory + sync reads from the block device).

Why do we need to account for swap IO? Application never asked for swap
IO. It is kernel's decision to move soem pages to swap to free up some
memory. What's the point in charging those pages to application group
and throttle accordingly?

 
 However, I'm sure we'll need the page tracking also for the blkio
 controller soon or later. This is an important information and also the
 proportional bandwidth controller can take advantage of it.

Yes page tracking will be needed for CFQ proportional bandwidth ASYNC
write support. But until and unless we implement memory cgroup dirty
ratio and figure a way out to make writeback logic cgroup aware, till
then I think page tracking stuff is not really useful.

  
  Couple of people have asked me that we have backup jobs running at night
  and we want to reduce the IO bandwidth of these jobs to limit the impact
  on latency of other jobs, I guess this approach will definitely solve
  that issue.
  
  IMHO, it might be worth trying this approach and see how well does it work. 
  It
  might not solve all the problems but can be helpful in many situations.
 
 Agreed. This could be a good tradeoff for a lot of common cases.
 
  
  I feel that for proportional bandwidth division, implementing ASYNC
  control at CFQ will make sense because even if things get serialized in
  higher layers, consequences are not very bad as it is work conserving
  algorithm. But for throttling serialization will lead to bad consequences.
 
 Agreed.
 
  
  May be one can think of new files in blkio controller to limit async IO
  per group during page dirty time.
  
  blkio.throttle.async.write_bps_limit
  blkio.throttle.async.write_iops_limit
 
 OK, I'll try to add the async throttling logic and use this interface.

Cool, I would like to play with it a bit once patches are ready.

Thanks
Vivek
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel 

[Devel] [PATCH 1/4] userns: let clone_uts_ns() handle setting uts-user_ns

2011-02-23 Thread Serge E. Hallyn
To do so we need to pass in the task_struct who'll get the utsname,
so we can get its user_ns.

Changelog:
Feb 23: As per Oleg's coment, just pass in tsk.

Signed-off-by: Serge E. Hallyn serge.hal...@canonical.com
---
 include/linux/utsname.h |6 +++---
 kernel/nsproxy.c|7 +--
 kernel/utsname.c|   12 +++-
 3 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/include/linux/utsname.h b/include/linux/utsname.h
index 85171be..21b4566 100644
--- a/include/linux/utsname.h
+++ b/include/linux/utsname.h
@@ -53,7 +53,7 @@ static inline void get_uts_ns(struct uts_namespace *ns)
 }
 
 extern struct uts_namespace *copy_utsname(unsigned long flags,
-   struct uts_namespace *ns);
+ struct task_struct *tsk);
 extern void free_uts_ns(struct kref *kref);
 
 static inline void put_uts_ns(struct uts_namespace *ns)
@@ -70,12 +70,12 @@ static inline void put_uts_ns(struct uts_namespace *ns)
 }
 
 static inline struct uts_namespace *copy_utsname(unsigned long flags,
-   struct uts_namespace *ns)
+struct task_struct *tsk)
 {
if (flags  CLONE_NEWUTS)
return ERR_PTR(-EINVAL);
 
-   return ns;
+   return tsk-nsproxy-uts_ns;
 }
 #endif
 
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index b6dbff2..ac8a56e 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -69,16 +69,11 @@ static struct nsproxy *create_new_namespaces(unsigned long 
flags,
goto out_ns;
}
 
-   new_nsp-uts_ns = copy_utsname(flags, tsk-nsproxy-uts_ns);
+   new_nsp-uts_ns = copy_utsname(flags, tsk);
if (IS_ERR(new_nsp-uts_ns)) {
err = PTR_ERR(new_nsp-uts_ns);
goto out_uts;
}
-   if (new_nsp-uts_ns != tsk-nsproxy-uts_ns) {
-   put_user_ns(new_nsp-uts_ns-user_ns);
-   new_nsp-uts_ns-user_ns = task_cred_xxx(tsk, user)-user_ns;
-   get_user_ns(new_nsp-uts_ns-user_ns);
-   }
 
new_nsp-ipc_ns = copy_ipcs(flags, tsk-nsproxy-ipc_ns);
if (IS_ERR(new_nsp-ipc_ns)) {
diff --git a/kernel/utsname.c b/kernel/utsname.c
index a7b3a8d..4464617 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -31,7 +31,8 @@ static struct uts_namespace *create_uts_ns(void)
  * @old_ns: namespace to clone
  * Return NULL on error (failure to kmalloc), new ns otherwise
  */
-static struct uts_namespace *clone_uts_ns(struct uts_namespace *old_ns)
+static struct uts_namespace *clone_uts_ns(struct task_struct *tsk,
+ struct uts_namespace *old_ns)
 {
struct uts_namespace *ns;
 
@@ -41,8 +42,7 @@ static struct uts_namespace *clone_uts_ns(struct 
uts_namespace *old_ns)
 
down_read(uts_sem);
memcpy(ns-name, old_ns-name, sizeof(ns-name));
-   ns-user_ns = old_ns-user_ns;
-   get_user_ns(ns-user_ns);
+   ns-user_ns = get_user_ns(task_cred_xxx(tsk, user)-user_ns);
up_read(uts_sem);
return ns;
 }
@@ -53,8 +53,10 @@ static struct uts_namespace *clone_uts_ns(struct 
uts_namespace *old_ns)
  * utsname of this process won't be seen by parent, and vice
  * versa.
  */
-struct uts_namespace *copy_utsname(unsigned long flags, struct uts_namespace 
*old_ns)
+struct uts_namespace *copy_utsname(unsigned long flags,
+  struct task_struct *tsk)
 {
+   struct uts_namespace *old_ns = tsk-nsproxy-uts_ns;
struct uts_namespace *new_ns;
 
BUG_ON(!old_ns);
@@ -63,7 +65,7 @@ struct uts_namespace *copy_utsname(unsigned long flags, 
struct uts_namespace *ol
if (!(flags  CLONE_NEWUTS))
return old_ns;
 
-   new_ns = clone_uts_ns(old_ns);
+   new_ns = clone_uts_ns(tsk, old_ns);
 
put_uts_ns(old_ns);
return new_ns;
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 2/4] userns: let copy_ipcs handle setting ipc_ns-user_ns

2011-02-23 Thread Serge E. Hallyn
To do that, we have to pass in the task_struct of the task which
will own the ipc_ns, so we can assign its user_ns.

Changelog:
Feb 23: As per Oleg comment, just pass in tsk.  To get the
ipc_ns from the nsproxy we need to include nsproxy.h

Signed-off-by: Serge E. Hallyn serge.hal...@canonical.com
---
 include/linux/ipc_namespace.h |7 ---
 ipc/namespace.c   |   13 -
 kernel/nsproxy.c  |7 +--
 3 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index 46d2eb4..c079d09 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -5,6 +5,7 @@
 #include linux/idr.h
 #include linux/rwsem.h
 #include linux/notifier.h
+#include linux/nsproxy.h
 
 /*
  * ipc namespace events
@@ -93,7 +94,7 @@ static inline int mq_init_ns(struct ipc_namespace *ns) { 
return 0; }
 
 #if defined(CONFIG_IPC_NS)
 extern struct ipc_namespace *copy_ipcs(unsigned long flags,
-  struct ipc_namespace *ns);
+  struct task_struct *tsk);
 static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
 {
if (ns)
@@ -104,12 +105,12 @@ static inline struct ipc_namespace *get_ipc_ns(struct 
ipc_namespace *ns)
 extern void put_ipc_ns(struct ipc_namespace *ns);
 #else
 static inline struct ipc_namespace *copy_ipcs(unsigned long flags,
-   struct ipc_namespace *ns)
+ struct task_struct *tsk)
 {
if (flags  CLONE_NEWIPC)
return ERR_PTR(-EINVAL);
 
-   return ns;
+   return tsk-nsproxy-ipc_ns;
 }
 
 static inline struct ipc_namespace *get_ipc_ns(struct ipc_namespace *ns)
diff --git a/ipc/namespace.c b/ipc/namespace.c
index aa18899..3c3e522 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -15,7 +15,8 @@
 
 #include util.h
 
-static struct ipc_namespace *create_ipc_ns(struct ipc_namespace *old_ns)
+static struct ipc_namespace *create_ipc_ns(struct task_struct *tsk,
+  struct ipc_namespace *old_ns)
 {
struct ipc_namespace *ns;
int err;
@@ -44,17 +45,19 @@ static struct ipc_namespace *create_ipc_ns(struct 
ipc_namespace *old_ns)
ipcns_notify(IPCNS_CREATED);
register_ipcns_notifier(ns);
 
-   ns-user_ns = old_ns-user_ns;
-   get_user_ns(ns-user_ns);
+   ns-user_ns = get_user_ns(task_cred_xxx(tsk, user)-user_ns);
 
return ns;
 }
 
-struct ipc_namespace *copy_ipcs(unsigned long flags, struct ipc_namespace *ns)
+struct ipc_namespace *copy_ipcs(unsigned long flags,
+   struct task_struct *tsk)
 {
+   struct ipc_namespace *ns = tsk-nsproxy-ipc_ns;
+
if (!(flags  CLONE_NEWIPC))
return get_ipc_ns(ns);
-   return create_ipc_ns(ns);
+   return create_ipc_ns(tsk, ns);
 }
 
 /*
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index ac8a56e..a05d191 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -75,16 +75,11 @@ static struct nsproxy *create_new_namespaces(unsigned long 
flags,
goto out_uts;
}
 
-   new_nsp-ipc_ns = copy_ipcs(flags, tsk-nsproxy-ipc_ns);
+   new_nsp-ipc_ns = copy_ipcs(flags, tsk);
if (IS_ERR(new_nsp-ipc_ns)) {
err = PTR_ERR(new_nsp-ipc_ns);
goto out_ipc;
}
-   if (new_nsp-ipc_ns != tsk-nsproxy-ipc_ns) {
-   put_user_ns(new_nsp-ipc_ns-user_ns);
-   new_nsp-ipc_ns-user_ns = task_cred_xxx(tsk, user)-user_ns;
-   get_user_ns(new_nsp-ipc_ns-user_ns);
-   }
 
new_nsp-pid_ns = copy_pid_ns(flags, task_active_pid_ns(tsk));
if (IS_ERR(new_nsp-pid_ns)) {
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 5/4] Clean up capability.h and capability.c

2011-02-23 Thread Serge E. Hallyn
Convert macros to functions to let type safety do its thing.  Switch
some functions from ints to more appropriate bool.  Move all forward
declarations together to top of the #ifdef __KERNEL__ section.  Use
kernel-doc format for comments.

Some macros couldn't be converted because they use functions from
security.h which sometimes are extern and sometimes static inline,
and we don't want to #include security.h in capability.h.

Also add a real current_user_ns function (and convert the existing
macro to _current_user_ns() so we can use it in capability.h
without #including cred.h.

Signed-off-by: Serge E. Hallyn serge.hal...@canonical.com
---
 include/linux/capability.h |   38 ++
 include/linux/cred.h   |4 +++-
 kernel/capability.c|   20 
 kernel/cred.c  |5 +
 4 files changed, 46 insertions(+), 21 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index bc0f262..688462f 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -368,6 +368,17 @@ struct cpu_vfs_cap_data {
 
 #ifdef __KERNEL__
 
+struct dentry;
+struct user_namespace;
+
+extern struct user_namespace init_user_ns;
+
+struct user_namespace *current_user_ns(void);
+
+extern const kernel_cap_t __cap_empty_set;
+extern const kernel_cap_t __cap_full_set;
+extern const kernel_cap_t __cap_init_eff_set;
+
 /*
  * Internal kernel functions only
  */
@@ -530,10 +541,6 @@ static inline kernel_cap_t cap_raise_nfsd_set(const 
kernel_cap_t a,
   cap_intersect(permitted, __cap_nfsd_set));
 }
 
-extern const kernel_cap_t __cap_empty_set;
-extern const kernel_cap_t __cap_full_set;
-extern const kernel_cap_t __cap_init_eff_set;
-
 /**
  * has_capability - Determine if a task has a superior capability available
  * @t: The task in question
@@ -560,18 +567,25 @@ extern const kernel_cap_t __cap_init_eff_set;
  * Note that this does not set PF_SUPERPRIV on the task.
  */
 #define has_capability_noaudit(t, cap) \
-   (security_real_capable_noaudit((t), init_user_ns, (cap)) == 0)
+   (security_real_capable_noaudit((t), init_user_ns, (cap)) == 0)
 
-struct user_namespace;
-extern struct user_namespace init_user_ns;
-extern int capable(int cap);
-extern int ns_capable(struct user_namespace *ns, int cap);
-extern int task_ns_capable(struct task_struct *t, int cap);
+extern bool capable(int cap);
+extern bool ns_capable(struct user_namespace *ns, int cap);
+extern bool task_ns_capable(struct task_struct *t, int cap);
 
-#define nsown_capable(cap) (ns_capable(current_user_ns(), (cap)))
+/**
+ * nsown_capable - Check superior capability to one's own user_ns
+ * @cap: The capability in question
+ *
+ * Return true if the current task has the given superior capability
+ * targeted at its own user namespace.
+ */
+static inline bool nsown_capable(int cap)
+{
+   return ns_capable(current_user_ns(), cap);
+}
 
 /* audit system wants to get cap info from files as well */
-struct dentry;
 extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct 
cpu_vfs_cap_data *cpu_caps);
 
 #endif /* __KERNEL__ */
diff --git a/include/linux/cred.h b/include/linux/cred.h
index 4aaeab3..9aeeb0b 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -354,9 +354,11 @@ static inline void put_cred(const struct cred *_cred)
 #define current_fsgid()(current_cred_xxx(fsgid))
 #define current_cap()  (current_cred_xxx(cap_effective))
 #define current_user() (current_cred_xxx(user))
-#define current_user_ns()  (current_cred_xxx(user)-user_ns)
+#define _current_user_ns() (current_cred_xxx(user)-user_ns)
 #define current_security() (current_cred_xxx(security))
 
+extern struct user_namespace *current_user_ns(void);
+
 #define current_uid_gid(_uid, _gid)\
 do {   \
const struct cred *__cred;  \
diff --git a/kernel/capability.c b/kernel/capability.c
index 916658c..0a3d2c8 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -300,7 +300,7 @@ error:
  * This sets PF_SUPERPRIV on the task if the capability is available on the
  * assumption that it's about to be used.
  */
-int capable(int cap)
+bool capable(int cap)
 {
return ns_capable(init_user_ns, cap);
 }
@@ -317,7 +317,7 @@ EXPORT_SYMBOL(capable);
  * This sets PF_SUPERPRIV on the task if the capability is available on the
  * assumption that it's about to be used.
  */
-int ns_capable(struct user_namespace *ns, int cap)
+bool ns_capable(struct user_namespace *ns, int cap)
 {
if (unlikely(!cap_valid(cap))) {
printk(KERN_CRIT capable() called with invalid cap=%u\n, cap);
@@ -326,17 +326,21 @@ int ns_capable(struct user_namespace *ns, int cap)
 
if (security_capable(ns, current_cred(), cap) == 0) {
current-flags |= PF_SUPERPRIV;
-   return 1;
+   return true;
}

[Devel] Re: [PATCH 5/9] Allow ptrace from non-init user namespaces

2011-02-23 Thread Serge E. Hallyn
Quoting Andrew Morton (a...@linux-foundation.org):
 On Thu, 17 Feb 2011 15:03:33 +
 Serge E. Hallyn se...@hallyn.com wrote:
 
  ptrace is allowed to tasks in the same user namespace according to
  the usual rules (i.e. the same rules as for two tasks in the init
  user namespace).  ptrace is also allowed to a user namespace to
  which the current task the has CAP_SYS_PTRACE capability.
  
 
  ...
 
  --- a/include/linux/capability.h
  +++ b/include/linux/capability.h
  @@ -546,6 +546,8 @@ extern const kernel_cap_t __cap_init_eff_set;
*/
   #define has_capability(t, cap) (security_real_capable((t), init_user_ns, 
  (cap)) == 0)
   
  +#define has_ns_capability(t, ns, cap) (security_real_capable((t), (ns), 
  (cap)) == 0)
 
 macroitis.

Thanks for the review, Andrew.  Unfortunately this one is hard to turn
into a function beecause it uses security_real_capable(), which is
sometimes defined in security/security.c as a real function, and
other times as a static inline in include/linux/security.h.  So
I'd have to #include security.h in capability.h, but security.h
already #includes capability.h.

All the other comments affect same_or_ancestor_user_ns(), which
following Eric's feedback is going away.

   /**
* has_capability_noaudit - Determine if a task has a superior capability 
  available (unaudited)
* @t: The task in question
  diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
  index faf4679..862fc59 100644
  --- a/include/linux/user_namespace.h
  +++ b/include/linux/user_namespace.h
  @@ -39,6 +39,9 @@ static inline void put_user_ns(struct user_namespace *ns)
   uid_t user_ns_map_uid(struct user_namespace *to, const struct cred *cred, 
  uid_t uid);
   gid_t user_ns_map_gid(struct user_namespace *to, const struct cred *cred, 
  gid_t gid);
   
  +int same_or_ancestor_user_ns(struct task_struct *task,
  +   struct task_struct *victim);
 
 bool.
 
   #else
   
   static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
 
  ...
 
  --- a/kernel/user_namespace.c
  +++ b/kernel/user_namespace.c
  @@ -129,6 +129,22 @@ gid_t user_ns_map_gid(struct user_namespace *to, const 
  struct cred *cred, gid_t
  return overflowgid;
   }
   
  +int same_or_ancestor_user_ns(struct task_struct *task,
  +   struct task_struct *victim)
  +{
  +   struct user_namespace *u1 = task_cred_xxx(task, user)-user_ns;
  +   struct user_namespace *u2 = task_cred_xxx(victim, user)-user_ns;
  +   for (;;) {
  +   if (u1 == u2)
  +   return 1;
  +   if (u1 == init_user_ns)
  +   return 0;
  +   u1 = u1-creator-user_ns;
  +   }
  +   /* We never get here */
  +   return 0;
 
 Remove?
 
  +}
  +
   static __init int user_namespaces_init(void)
   {
  user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC);
 
  ...
 
   int cap_ptrace_access_check(struct task_struct *child, unsigned int mode)
   {
  int ret = 0;
  +   const struct cred *cred, *tcred;
   
  rcu_read_lock();
  -   if (!cap_issubset(__task_cred(child)-cap_permitted,
  - current_cred()-cap_permitted) 
  -   !capable(CAP_SYS_PTRACE))
  -   ret = -EPERM;
  +   cred = current_cred();
  +   tcred = __task_cred(child);
  +   /*
  +* The ancestor user_ns check may be gratuitous, as I think
  +* we've already guaranteed that in kernel/ptrace.c.
  +*/
 
 ?
 
  +   if (same_or_ancestor_user_ns(current, child) 
  +   cap_issubset(tcred-cap_permitted, cred-cap_permitted))
  +   goto out;
  +   if (ns_capable(tcred-user-user_ns, CAP_SYS_PTRACE))
  +   goto out;
  +   ret = -EPERM;
  +out:
  rcu_read_unlock();
  return ret;
   }
 
  ...
 
 
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel



[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control

2011-02-23 Thread KAMEZAWA Hiroyuki
On Wed, 23 Feb 2011 19:10:33 -0500
Vivek Goyal vgo...@redhat.com wrote:

 On Thu, Feb 24, 2011 at 12:14:11AM +0100, Andrea Righi wrote:
  On Wed, Feb 23, 2011 at 10:23:54AM -0500, Vivek Goyal wrote:
 Agreed. Granularity of per inode level might be accetable in many 
 cases. Again, I am worried faster group getting stuck behind slower
 group.
 
 I am wondering if we are trying to solve the problem of ASYNC write 
 throttling
 at wrong layer. Should ASYNC IO be throttled before we allow task to 
 write to
 page cache. The way we throttle the process based on dirty ratio, can 
 we
 just check for throttle limits also there or something like that.(I 
 think
 that's what you had done in your initial throttling controller 
 implementation?)

Right. This is exactly the same approach I've used in my old throttling
controller: throttle sync READs and WRITEs at the block layer and async
WRITEs when the task is dirtying memory pages.

This is probably the simplest way to resolve the problem of faster group
getting blocked by slower group, but the controller will be a little bit
more leaky, because the writeback IO will be never throttled and we'll
see some limited IO spikes during the writeback.
   
   Yes writeback will not be throttled. Not sure how big a problem that is.
   
   - We have controlled the input rate. So that should help a bit.
   - May be one can put some high limit on root cgroup to in blkio throttle
 controller to limit overall WRITE rate of the system.
   - For SATA disks, try to use CFQ which can try to minimize the impact of
 WRITE.
   
   It will atleast provide consistent bandwindth experience to application.
  
  Right.
  
   
   However, this is always
a better solution IMHO respect to the current implementation that is
affected by that kind of priority inversion problem.

I can try to add this logic to the current blk-throttle controller if
you think it is worth to test it.
   
   At this point of time I have few concerns with this approach.
   
   - Configuration issues. Asking user to plan for SYNC ans ASYNC IO
 separately is inconvenient. One has to know the nature of workload.
   
   - Most likely we will come up with global limits (atleast to begin with),
 and not per device limit. That can lead to contention on one single
 lock and scalability issues on big systems.
   
   Having said that, this approach should reduce the kernel complexity a lot.
   So if we can do some intelligent locking to limit the overhead then it
   will boil down to reduced complexity in kernel vs ease of use to user. I 
   guess at this point of time I am inclined towards keeping it simple in
   kernel.
   
  
  BTW, with this approach probably we can even get rid of the page
  tracking stuff for now.
 
 Agreed.
 
  If we don't consider the swap IO, any other IO
  operation from our point of view will happen directly from process
  context (writes in memory + sync reads from the block device).
 
 Why do we need to account for swap IO? Application never asked for swap
 IO. It is kernel's decision to move soem pages to swap to free up some
 memory. What's the point in charging those pages to application group
 and throttle accordingly?
 

I think swap I/O should be controlled by memcg's dirty_ratio.
But, IIRC, NEC guy had a requirement for this...

I think some enterprise cusotmer may want to throttle the whole speed of
swapout I/O (not swapin)...so, they may be glad if they can limit throttle
the I/O against a disk partition or all I/O tagged as 'swapio' rather than
some cgroup name.

But I'm afraid slow swapout may consume much dirty_ratio and make things
worse ;)



  
  However, I'm sure we'll need the page tracking also for the blkio
  controller soon or later. This is an important information and also the
  proportional bandwidth controller can take advantage of it.
 
 Yes page tracking will be needed for CFQ proportional bandwidth ASYNC
 write support. But until and unless we implement memory cgroup dirty
 ratio and figure a way out to make writeback logic cgroup aware, till
 then I think page tracking stuff is not really useful.
 

I think Greg Thelen is now preparing patches for dirty_ratio.

Thanks,
-Kame

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 4/9] allow killing tasks in your own or child userns

2011-02-23 Thread Serge E. Hallyn
Quoting Andrew Morton (a...@linux-foundation.org):
 On Thu, 17 Feb 2011 15:03:25 +
 Serge E. Hallyn se...@hallyn.com wrote:
 
   /*
  + * called with RCU read lock from check_kill_permission()
  + */
  +static inline int kill_ok_by_cred(struct task_struct *t)
  +{
  +   const struct cred *cred = current_cred();
  +   const struct cred *tcred = __task_cred(t);
  +
  +   if (cred-user-user_ns == tcred-user-user_ns 
  +   (cred-euid == tcred-suid ||
  +cred-euid == tcred-uid ||
  +cred-uid  == tcred-suid ||
  +cred-uid  == tcred-uid))
  +   return 1;
  +
  +   if (ns_capable(tcred-user-user_ns, CAP_KILL))
  +   return 1;
  +
  +   return 0;
  +}
 
 The compiler will inline this for us.

Is that simply true with everything (worth inlining) nowadays, or is
there a particular implicit hint to the compiler that'll make that
happen?

Not that I guess it's even particularly important in this case.

From: Serge E. Hallyn serge.hal...@canonical.com
Date: Thu, 24 Feb 2011 00:26:02 +
Subject: [PATCH 1/2] userns: let compiler inline kill_ok_by_cred (per akpm)

Signed-off-by: Serge E. Hallyn serge.hal...@canonical.com
---
 kernel/signal.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index ffe4bdf..12702b4 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -638,7 +638,7 @@ static inline bool si_fromuser(const struct siginfo *info)
 /*
  * called with RCU read lock from check_kill_permission()
  */
-static inline int kill_ok_by_cred(struct task_struct *t)
+static int kill_ok_by_cred(struct task_struct *t)
 {
const struct cred *cred = current_cred();
const struct cred *tcred = __task_cred(t);
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH] userns: ptrace: incorporate feedback from Eric

2011-02-23 Thread Serge E. Hallyn
same_or_ancestore_user_ns() was not an appropriate check to
constrain cap_issubset.  Rather, cap_issubset() only is
meaningful when both capsets are in the same user_ns.

Signed-off-by: Serge E. Hallyn serge.hal...@canonical.com
---
 include/linux/user_namespace.h |9 -
 kernel/user_namespace.c|   16 
 security/commoncap.c   |   28 ++--
 3 files changed, 10 insertions(+), 43 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 862fc59..faf4679 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -39,9 +39,6 @@ static inline void put_user_ns(struct user_namespace *ns)
 uid_t user_ns_map_uid(struct user_namespace *to, const struct cred *cred, 
uid_t uid);
 gid_t user_ns_map_gid(struct user_namespace *to, const struct cred *cred, 
gid_t gid);
 
-int same_or_ancestor_user_ns(struct task_struct *task,
-   struct task_struct *victim);
-
 #else
 
 static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
@@ -69,12 +66,6 @@ static inline gid_t user_ns_map_gid(struct user_namespace 
*to,
return gid;
 }
 
-static inline int same_or_ancestor_user_ns(struct task_struct *task,
-   struct task_struct *victim)
-{
-   return 1;
-}
-
 #endif
 
 #endif /* _LINUX_USER_H */
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 0ef2258..9da289c 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -129,22 +129,6 @@ gid_t user_ns_map_gid(struct user_namespace *to, const 
struct cred *cred, gid_t
return overflowgid;
 }
 
-int same_or_ancestor_user_ns(struct task_struct *task,
-   struct task_struct *victim)
-{
-   struct user_namespace *u1 = task_cred_xxx(task, user)-user_ns;
-   struct user_namespace *u2 = task_cred_xxx(victim, user)-user_ns;
-   for (;;) {
-   if (u1 == u2)
-   return 1;
-   if (u1 == init_user_ns)
-   return 0;
-   u1 = u1-creator-user_ns;
-   }
-   /* We never get here */
-   return 0;
-}
-
 static __init int user_namespaces_init(void)
 {
user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC);
diff --git a/security/commoncap.c b/security/commoncap.c
index 12ff65c..526106f 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -142,19 +142,15 @@ int cap_settime(struct timespec *ts, struct timezone *tz)
 int cap_ptrace_access_check(struct task_struct *child, unsigned int mode)
 {
int ret = 0;
-   const struct cred *cred, *tcred;
+   const struct cred *cred, *child_cred;
 
rcu_read_lock();
cred = current_cred();
-   tcred = __task_cred(child);
-   /*
-* The ancestor user_ns check may be gratuitous, as I think
-* we've already guaranteed that in kernel/ptrace.c.
-*/
-   if (same_or_ancestor_user_ns(current, child) 
-   cap_issubset(tcred-cap_permitted, cred-cap_permitted))
+   child_cred = __task_cred(child);
+   if (cred-user-user_ns == child_cred-user-user_ns 
+   cap_issubset(child_cred-cap_permitted, cred-cap_permitted))
goto out;
-   if (ns_capable(tcred-user-user_ns, CAP_SYS_PTRACE))
+   if (ns_capable(child_cred-user-user_ns, CAP_SYS_PTRACE))
goto out;
ret = -EPERM;
 out:
@@ -178,19 +174,15 @@ out:
 int cap_ptrace_traceme(struct task_struct *parent)
 {
int ret = 0;
-   const struct cred *cred, *tcred;
+   const struct cred *cred, *child_cred;
 
rcu_read_lock();
cred = __task_cred(parent);
-   tcred = current_cred();
-   /*
-* The ancestor user_ns check may be gratuitous, as I think
-* we've already guaranteed that in kernel/ptrace.c.
-*/
-   if (same_or_ancestor_user_ns(parent, current) 
-   cap_issubset(tcred-cap_permitted, cred-cap_permitted))
+   child_cred = current_cred();
+   if (cred-user-user_ns == child_cred-user-user_ns 
+   cap_issubset(child_cred-cap_permitted, cred-cap_permitted))
goto out;
-   if (has_ns_capability(parent, tcred-user-user_ns, CAP_SYS_PTRACE))
+   if (has_ns_capability(parent, child_cred-user-user_ns, 
CAP_SYS_PTRACE))
goto out;
ret = -EPERM;
 out:
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 4/9] allow killing tasks in your own or child userns

2011-02-23 Thread Andrew Morton
On Thu, 24 Feb 2011 00:48:18 +
Serge E. Hallyn se...@hallyn.com wrote:

 Quoting Andrew Morton (a...@linux-foundation.org):
  On Thu, 17 Feb 2011 15:03:25 +
  Serge E. Hallyn se...@hallyn.com wrote:
  
/*
   + * called with RCU read lock from check_kill_permission()
   + */
   +static inline int kill_ok_by_cred(struct task_struct *t)
   +{
   + const struct cred *cred = current_cred();
   + const struct cred *tcred = __task_cred(t);
   +
   + if (cred-user-user_ns == tcred-user-user_ns 
   + (cred-euid == tcred-suid ||
   +  cred-euid == tcred-uid ||
   +  cred-uid  == tcred-suid ||
   +  cred-uid  == tcred-uid))
   + return 1;
   +
   + if (ns_capable(tcred-user-user_ns, CAP_KILL))
   + return 1;
   +
   + return 0;
   +}
  
  The compiler will inline this for us.
 
 Is that simply true with everything (worth inlining) nowadays, or is
 there a particular implicit hint to the compiler that'll make that
 happen?

We've basically stopped inlining things nowadays.  gcc inlines
aggressively and sometimes we have to use noinline to stop it.  Also,
modern gcc's like to ignore the inline directive anwyay, so we have to
resort to __always_inline when we disagree.


___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control

2011-02-23 Thread Greg Thelen
On Wed, Feb 23, 2011 at 4:40 PM, KAMEZAWA Hiroyuki
kamezawa.hir...@jp.fujitsu.com wrote:
 On Wed, 23 Feb 2011 19:10:33 -0500
 Vivek Goyal vgo...@redhat.com wrote:

 On Thu, Feb 24, 2011 at 12:14:11AM +0100, Andrea Righi wrote:
  On Wed, Feb 23, 2011 at 10:23:54AM -0500, Vivek Goyal wrote:
 Agreed. Granularity of per inode level might be accetable in many
 cases. Again, I am worried faster group getting stuck behind slower
 group.

 I am wondering if we are trying to solve the problem of ASYNC write 
 throttling
 at wrong layer. Should ASYNC IO be throttled before we allow task to 
 write to
 page cache. The way we throttle the process based on dirty ratio, 
 can we
 just check for throttle limits also there or something like that.(I 
 think
 that's what you had done in your initial throttling controller 
 implementation?)
   
Right. This is exactly the same approach I've used in my old throttling
controller: throttle sync READs and WRITEs at the block layer and async
WRITEs when the task is dirtying memory pages.
   
This is probably the simplest way to resolve the problem of faster 
group
getting blocked by slower group, but the controller will be a little 
bit
more leaky, because the writeback IO will be never throttled and we'll
see some limited IO spikes during the writeback.
  
   Yes writeback will not be throttled. Not sure how big a problem that is.
  
   - We have controlled the input rate. So that should help a bit.
   - May be one can put some high limit on root cgroup to in blkio throttle
     controller to limit overall WRITE rate of the system.
   - For SATA disks, try to use CFQ which can try to minimize the impact of
     WRITE.
  
   It will atleast provide consistent bandwindth experience to application.
 
  Right.
 
  
   However, this is always
a better solution IMHO respect to the current implementation that is
affected by that kind of priority inversion problem.
   
I can try to add this logic to the current blk-throttle controller if
you think it is worth to test it.
  
   At this point of time I have few concerns with this approach.
  
   - Configuration issues. Asking user to plan for SYNC ans ASYNC IO
     separately is inconvenient. One has to know the nature of workload.
  
   - Most likely we will come up with global limits (atleast to begin with),
     and not per device limit. That can lead to contention on one single
     lock and scalability issues on big systems.
  
   Having said that, this approach should reduce the kernel complexity a 
   lot.
   So if we can do some intelligent locking to limit the overhead then it
   will boil down to reduced complexity in kernel vs ease of use to user. I
   guess at this point of time I am inclined towards keeping it simple in
   kernel.
  
 
  BTW, with this approach probably we can even get rid of the page
  tracking stuff for now.

 Agreed.

  If we don't consider the swap IO, any other IO
  operation from our point of view will happen directly from process
  context (writes in memory + sync reads from the block device).

 Why do we need to account for swap IO? Application never asked for swap
 IO. It is kernel's decision to move soem pages to swap to free up some
 memory. What's the point in charging those pages to application group
 and throttle accordingly?


 I think swap I/O should be controlled by memcg's dirty_ratio.
 But, IIRC, NEC guy had a requirement for this...

 I think some enterprise cusotmer may want to throttle the whole speed of
 swapout I/O (not swapin)...so, they may be glad if they can limit throttle
 the I/O against a disk partition or all I/O tagged as 'swapio' rather than
 some cgroup name.

 But I'm afraid slow swapout may consume much dirty_ratio and make things
 worse ;)



 
  However, I'm sure we'll need the page tracking also for the blkio
  controller soon or later. This is an important information and also the
  proportional bandwidth controller can take advantage of it.

 Yes page tracking will be needed for CFQ proportional bandwidth ASYNC
 write support. But until and unless we implement memory cgroup dirty
 ratio and figure a way out to make writeback logic cgroup aware, till
 then I think page tracking stuff is not really useful.


 I think Greg Thelen is now preparing patches for dirty_ratio.

 Thanks,
 -Kame



Correct.  I am working on the memcg dirty_ratio patches with latest
mmotm memcg.  I am running some test cases which should be complete
tomorrow.  Once testing is complete, I will sent  the patches for
review.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] userns: ptrace: incorporate feedback from Eric

2011-02-23 Thread Serge E. Hallyn
Quoting Andrew Morton (a...@linux-foundation.org):
 On Thu, 24 Feb 2011 00:49:01 +
 Serge E. Hallyn se...@hallyn.com wrote:
 
  same_or_ancestore_user_ns() was not an appropriate check to
  constrain cap_issubset.  Rather, cap_issubset() only is
  meaningful when both capsets are in the same user_ns.
 
 I queued this as a fix against
 userns-allow-ptrace-from-non-init-user-namespaces.patch, but I get the
 feeling that it would be better to just drop everything and start
 again?

Sure, I'll rebase and resend.  I wonder if I should trim the Cc list
for the next round.

thanks,
-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 9/9] userns: check user namespace for task-file uid equivalence checks

2011-02-23 Thread Andrew Morton
On Thu, 24 Feb 2011 03:24:16 + Serge E. Hallyn se...@hallyn.com wrote:

 Quoting Andrew Morton (a...@linux-foundation.org):
  On Thu, 17 Feb 2011 15:04:07 +
  Serge E. Hallyn se...@hallyn.com wrote:
  
  There's a fairly well adhered to convention that global symbols (and
  often static symbols) have a prefix which identifies the subsystem to
  which they belong.  This patchset rather scorns that convention.
  
  Most of these identifiers are pretty obviously from the capability
  subsystem, but still...
 
 Would 'inode_owner_or_capable' be better and and make sense?
 

I suppose so.  We've totally screwed that pooch in the VFS (grep EXPORT
fs/inode.c).  But it's never to late to start.

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control

2011-02-23 Thread Balbir Singh
* Andrea Righi ari...@develer.com [2011-02-22 18:12:51]:

 Currently the blkio.throttle controller only support synchronous IO requests.
 This means that we always look at the current task to identify the owner of
 each IO request.
 
 However dirty pages in the page cache can be wrote to disk asynchronously by
 the per-bdi flusher kernel threads or by any other thread in the system,
 according to the writeback policy.
 
 For this reason the real writes to the underlying block devices may
 occur in a different IO context respect to the task that originally
 generated the dirty pages involved in the IO operation. This makes the
 tracking and throttling of writeback IO more complicate respect to the
 synchronous IO from the blkio controller's perspective.
 
 The same concept is also valid for anonymous pages involed in IO operations
 (swap).
 
 This patch allow to track the cgroup that originally dirtied each page in page
 cache and each anonymous page and pass these informations to the blk-throttle
 controller. These informations can be used to provide a better service level
 differentiation of buffered writes swap IO between different cgroups.
 
 Testcase
 
 - create a cgroup with 1MiB/s write limit:
   # mount -t cgroup -o blkio none /mnt/cgroup
   # mkdir /mnt/cgroup/foo
   # echo 8:0 $((1024 * 1024))  
 /mnt/cgroup/foo/blkio.throttle.write_bps_device
 
 - move a task into the cgroup and run a dd to generate some writeback IO
 
 Results:
   - 2.6.38-rc6 vanilla:
   $ cat /proc/$$/cgroup
   1:blkio:/foo
   $ dd if=/dev/zero of=zero bs=1M count=1024 
   $ dstat -df
   --dsk/sda--
read  writ
  019M
  019M
  0 0
  0 0
  019M
   ...
 
   - 2.6.38-rc6 + blk-throttle writeback IO control:
   $ cat /proc/$$/cgroup
   1:blkio:/foo
   $ dd if=/dev/zero of=zero bs=1M count=1024 
   $ dstat -df
   --dsk/sda--
read  writ
  0  1024
  0  1024
  0  1024
  0  1024
  0  1024
   ...
 

Thanks for looking into this, further review follows.

-- 
Three Cheers,
Balbir
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: User namespaces and keys

2011-02-23 Thread Eric W. Biederman
Casey Schaufler ca...@schaufler-ca.com writes:

 On 2/23/2011 12:55 PM, Eric W. Biederman wrote:
 Casey Schaufler ca...@schaufler-ca.com writes:

 I confess that I remain less well educated on namespaces than
 I probably should be, but with what I do know it seems that the
 relationships between user namespaces and LSMs are bound to be
 strained from the beginning. Some LSMs (SELinux and Smack) are
 providing similar sandbox capabilities to what you get from user
 namespaces, but from different directions and with different
 use cases.
 Casey I won't argue about the possibility of things being strained, but
 I think if we focus on the semantics and not on the end goal of exactly
 how the pieces are to be used there can be some reasonable dialog.

 I'm sure that there will be cases where they will work together
 like horses in a troika. Making sensible semantics for the interactions
 is key, and it is entirely possible that in some cases a comparison
 of semantics and behaviors will lead an end user to chose either an
 LSM or namespaces over the combination. Just like I expect that even
 when we allow multiple LSMs the SELinux and Smack combination will be
 rare among the sane.

That sounds about right.

Eric

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel