from:"Andrew Morton"

Re: [Devel] [PATCH -mm v2 1/2] mem-hotplug: implement get/put_online_mems

2014-04-18 Thread Andrew Morton

On Mon, 7 Apr 2014 13:45:34 +0400 Vladimir Davydov vdavy...@parallels.com 
wrote:

 {un}lock_memory_hotplug, which is used to synchronize against memory
 hotplug, is currently backed by a mutex, which makes it a bit of a
 hammer - threads that only want to get a stable value of online nodes
 mask won't be able to proceed concurrently. Also, it imposes some strong
 locking ordering rules on it, which narrows down the set of its usage
 scenarios.
 
 This patch introduces get/put_online_mems, which are the same as
 get/put_online_cpus, but for memory hotplug, i.e. executing a code
 inside a get/put_online_mems section will guarantee a stable value of
 online nodes, present pages, etc.

Well that seems a nice change.  I added the important paragraph

lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.

I'll Cc a large number of people who have recently worked on the memory
hotplug code.  Hopefully some of them will have time to review and test
these patches, thanks.


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH -mm v2.2] mm: get rid of __GFP_KMEMCG

2014-04-10 Thread Andrew Morton

On Thu, 3 Apr 2014 19:05:59 +0400 Vladimir Davydov vdavy...@parallels.com 
wrote:

 Currently to allocate a page that should be charged to kmemcg (e.g.
 threadinfo), we pass __GFP_KMEMCG flag to the page allocator. The page
 allocated is then to be freed by free_memcg_kmem_pages. Apart from
 looking asymmetrical, this also requires intrusion to the general
 allocation path. So let's introduce separate functions that will
 alloc/free pages charged to kmemcg.
 
 The new functions are called alloc_kmem_pages and free_kmem_pages. They
 should be used when the caller actually would like to use kmalloc, but
 has to fall back to the page allocator for the allocation is large. They
 only differ from alloc_pages and free_pages in that besides allocating
 or freeing pages they also charge them to the kmem resource counter of
 the current memory cgroup.
 
 ...

 +void *kmalloc_order(size_t size, gfp_t flags, unsigned int order)
 +{
 + void *ret;
 + struct page *page;
 +
 + flags |= __GFP_COMP;
 + page = alloc_kmem_pages(flags, order);
 + ret = page ? page_address(page) : NULL;
 + kmemleak_alloc(ret, size, 1, flags);
 + return ret;
 +}

While we're in there it wouldn't hurt to document this: why it exists,
what it does, etc.  And why it sets __GFP_COMP.

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH 1/2] kobject: don't block for each kobject_uevent

2014-02-13 Thread Andrew Morton

On Sun, 9 Feb 2014 14:56:15 +0400 Vladimir Davydov vdavy...@parallels.com 
wrote:

 Currently kobject_uevent has somewhat unpredictable semantics. The point
 is, since it may call a usermode helper and wait for it to execute
 (UMH_WAIT_EXEC), it is impossible to say for sure what lock dependencies
 it will introduce for the caller - strictly speaking it depends on what
 fs the binary is located on and the set of locks fork may take. There
 are quite a few kobject_uevent's users that do not take this into
 account and call it with various mutexes taken, e.g. rtnl_mutex,
 net_mutex, which might potentially lead to a deadlock.
 
 Since there is actually no reason to wait for the usermode helper to
 execute there, let's make kobject_uevent start the helper asynchronously
 with the aid of the UMH_NO_WAIT flag.
 
 Personally, I'm interested in this, because I really want kobject_uevent
 to be called under the slab_mutex in the slub implementation as it used
 to be some time ago, because it greatly simplifies synchronization and
 automatically fixes a kmemcg-related race. However, there was a deadlock
 detected on an attempt to call kobject_uevent under the slab_mutex (see
 https://lkml.org/lkml/2012/1/14/45), which was reported to be fixed by
 releasing the slab_mutex for kobject_uevent. Unfortunately, there was no
 information about who exactly blocked on the slab_mutex causing the
 usermode helper to stall, neither have I managed to find this out or
 reproduce the issue.
 
 BTW, this is not the first attempt to make kobject_uevent use
 UMH_NO_WAIT. Previous one was made by commit f520360d93c, but it was
 wrong (it passed arguments allocated on stack to async thread) so it was
 reverted (commit 05f54c13cd0c). It targeted on speeding up the boot
 process though.

Am not a huge fan of this patch.  My test box gets an early oops in

initcalls
-...
  -register_pernet_operations
-rtnetlink_net_init
  -__netlink_kernel_create
-sock_create_lite
  -sock_alloc
-new_inode_pseudo
  -alloc_inode+0xe

I expect that sock_mnt-mnt_sb is null.  Or perhaps sb-s_op.  Perhaps
sockfs hasn't mounted yet.

The oops doesn't happen on mainline - it only happens on linux-next. 
So there may be some interaction there, but it may only be timing
related.

config: http://ozlabs.org/~akpm/stuff/config-akpm2
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH 1/2] kobject: don't block for each kobject_uevent

2014-02-11 Thread Andrew Morton

On Sun, 9 Feb 2014 14:56:15 +0400 Vladimir Davydov vdavy...@parallels.com 
wrote:

 Currently kobject_uevent has somewhat unpredictable semantics. The point
 is, since it may call a usermode helper and wait for it to execute
 (UMH_WAIT_EXEC), it is impossible to say for sure what lock dependencies
 it will introduce for the caller - strictly speaking it depends on what
 fs the binary is located on and the set of locks fork may take. There
 are quite a few kobject_uevent's users that do not take this into
 account and call it with various mutexes taken, e.g. rtnl_mutex,
 net_mutex, which might potentially lead to a deadlock.
 
 Since there is actually no reason to wait for the usermode helper to
 execute there, let's make kobject_uevent start the helper asynchronously
 with the aid of the UMH_NO_WAIT flag.
 
 Personally, I'm interested in this, because I really want kobject_uevent
 to be called under the slab_mutex in the slub implementation as it used
 to be some time ago, because it greatly simplifies synchronization and
 automatically fixes a kmemcg-related race. However, there was a deadlock
 detected on an attempt to call kobject_uevent under the slab_mutex (see
 https://lkml.org/lkml/2012/1/14/45), which was reported to be fixed by
 releasing the slab_mutex for kobject_uevent. Unfortunately, there was no
 information about who exactly blocked on the slab_mutex causing the
 usermode helper to stall, neither have I managed to find this out or
 reproduce the issue.
 
 BTW, this is not the first attempt to make kobject_uevent use
 UMH_NO_WAIT. Previous one was made by commit f520360d93c, but it was
 wrong (it passed arguments allocated on stack to async thread) so it was
 reverted (commit 05f54c13cd0c). It targeted on speeding up the boot
 process though.

The patches look good to me.  One is kobject (Greg) and the other is
slub (Pekka), so I grabbed them ;) Reviews-and-acks, please?



btw, when referring to commits, please use the form

f520360d93c (kobject: don't block for each kobject_uevent)

because the same commit can have different hashes in different trees.

(Although I suspect the amount of convenience this provides others
doesn't match the amount of time I spend fixing changelogs!)

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH 2/3] mm: vmscan: get rid of DEFAULT_SEEKS and document shrink_slab logic

2014-02-05 Thread Andrew Morton

On Wed, 5 Feb 2014 11:16:49 +0400 Vladimir Davydov vdavy...@parallels.com 
wrote:

  So why did I originally make DEFAULT_SEEKS=2?  Because I figured that to
  recreate (say) an inode would require a seek to the inode data then a
  seek back.  Is it legitimate to include the
  seek-back-to-what-you-were-doing-before seek in the cost of an inode
  reclaim?  I guess so...
 
 Hmm, that explains this 2. Since we typically don't need to seek back
 when recreating a cache page, as they are usually read in bunches by
 readahead, the number of seeks to bring back a user page is 1, while the
 number of seeks to recreate an average inode is 2, right?

Sounds right to me.

 Then to scan inodes and user pages so that they would generate
 approximately the same number of seeks, we should calculate the number
 of objects to scan as follows:
 
 nr_objects_to_scan = nr_pages_scanned / lru_pages *
 nr_freeable_objects /
 shrinker-seeks
 
 where shrinker-seeks = DEFAULT_SEEKS = 2 for inodes.

hm, I wonder if we should take the size of the object into account. 
Should we be maximizing (memory-reclaimed / seeks-to-reestablish-it).

 But currently we
 have four times that. I can explain why we should multiply this by 2 -
 we do not count pages moving from active to inactive lrus in
 nr_pages_scanned, and 2*nr_pages_scanned can be a good approximation for
 that - but I have no idea why we multiply it by 4...

I don't understand this code at all:

total_scan = nr;
delta = (4 * nr_pages_scanned) / shrinker-seeks;
delta *= freeable;
do_div(delta, lru_pages + 1);
total_scan += delta;

If it actually makes any sense, it sorely sorely needs documentation.

David, you touched it last.  Any hints?
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH 2/3] mm: vmscan: get rid of DEFAULT_SEEKS and document shrink_slab logic

2014-02-04 Thread Andrew Morton

On Fri, 17 Jan 2014 23:25:30 +0400 Vladimir Davydov vdavy...@parallels.com 
wrote:

 Each shrinker must define the number of seeks it takes to recreate a
 shrinkable cache object. It is used to balance slab reclaim vs page
 reclaim: assuming it costs one seek to replace an LRU page, we age equal
 percentages of the LRU and ageable caches. So far, everything sounds
 clear, but the code implementing this behavior is rather confusing.
 
 First, there is the DEFAULT_SEEKS constant, which equals 2 for some
 reason:
 
   #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
 
 Most shrinkers define `seeks' to be equal to DEFAULT_SEEKS, some use
 DEFAULT_SEEKS*N, and there are a few that totally ignore it. What is
 peculiar, dcache and icache shrinkers have seeks=DEFAULT_SEEKS although
 recreating an inode typically requires one seek. Does this mean that we
 scan twice more inodes than we should?
 
 Actually, no. The point is that vmscan handles DEFAULT_SEEKS as if it
 were 1 (`delta' is the number of objects we are going to scan):
 
   shrink_slab_node():
 delta = (4 * nr_pages_scanned) / shrinker-seeks;
 delta *= freeable;
 do_div(delta, lru_pages + 1);
 
 i.e.
 
 2 * nr_pages_scannedDEFAULT_SEEKS
 delta =  * --- * freeable;
  lru_pages shrinker-seeks
 
 Here we double the number of pages scanned in order to take into account
 moves of on-LRU pages from the inactive list to the active list, which
 we do not count in nr_pages_scanned.
 
 That said, shrinker-seeks=DEFAULT_SEEKS*N is equivalent to N seeks, so
 why on the hell do we need it?
 
 IMO, the existence of the DEFAULT_SEEKS constant only causes confusion
 for both users of the shrinker interface and those trying to understand
 how slab shrinking works. The meaning of the `seeks' is perfectly
 explained by the comment to it and there is no need in any obscure
 constants for using it.
 
 That's why I'm sending this patch which completely removes DEFAULT_SEEKS
 and makes all shrinkers use N instead of N*DEFAULT_SEEKS, documenting
 the idea lying behind shrink_slab() in the meanwhile.
 
 Unfortunately, there are a few shrinkers that define seeks=1, which is
 impossible to transfer to the new interface intact, namely:
 
   nfsd_reply_cache_shrinker
   ttm_pool_manager::mm_shrink
   ttm_pool_manager::mm_shrink
   dm_bufio_client::shrinker
 
 It seems to me their authors were simply deceived by this mysterious
 DEFAULT_SEEKS constant, because I've found no documentation why these
 particular caches should be scanned more aggressively than the page and
 other slab caches. For them, this patch leaves seeks=1. Thus, it DOES
 introduce a functional change: the shrinkers enumerated above will be
 scanned twice less intensively than they are now. I do not think that
 this will cause any problems though.
 

um, yes.  DEFAULT_SEEKS is supposed to be the number of seeks if you
don't know any better.  Using DEFAULT_SEEKS*n is just daft.

So why did I originally make DEFAULT_SEEKS=2?  Because I figured that to
recreate (say) an inode would require a seek to the inode data then a
seek back.  Is it legitimate to include the
seek-back-to-what-you-were-doing-before seek in the cost of an inode
reclaim?  I guess so...

If a filesystem were to require a seek to the superblock for every
inode read (ok, bad example) then the cost of reestablishing that inode
would be 3.

All that being said, why did you go through and halve everything?  The
cost of reestablishing an ext2 inode should be 2 seeks, but the patch
makes it 1.

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH v2 2/7] memcg, slab: cleanup memcg cache name creation

2014-02-03 Thread Andrew Morton

On Mon, 3 Feb 2014 19:54:37 +0400 Vladimir Davydov vdavy...@parallels.com 
wrote:

 The way memcg_create_kmem_cache() creates the name for a memcg cache
 looks rather strange: it first formats the name in the static buffer
 tmp_name protected by a mutex, then passes the pointer to the buffer to
 kmem_cache_create_memcg(), which finally duplicates it to the cache
 name.
 
 Let's clean this up by moving memcg cache name creation to a separate
 function to be called by kmem_cache_create_memcg(), and estimating the
 length of the name string before copying anything to it so that we won't
 need a temporary buffer.
 
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -3193,6 +3193,37 @@ int memcg_update_cache_size(struct kmem_cache *s, int 
 num_groups)
   return 0;
  }
  
 +static int memcg_print_cache_name(char *buf, size_t size,
 + struct mem_cgroup *memcg, struct kmem_cache *root_cache)
 +{
 + int ret;
 +
 + rcu_read_lock();
 + ret = snprintf(buf, size, %s(%d:%s), root_cache-name,
 +memcg_cache_id(memcg), cgroup_name(memcg-css.cgroup));
 + rcu_read_unlock();
 + return ret;
 +}
 +
 +char *memcg_create_cache_name(struct mem_cgroup *memcg,
 +   struct kmem_cache *root_cache)
 +{
 + int len;
 + char *name;
 +
 + /*
 +  * We cannot use kasprintf() here, because cgroup_name() must be called
 +  * under RCU protection.
 +  */
 + len = memcg_print_cache_name(NULL, 0, memcg, root_cache);
 +
 + name = kmalloc(len + 1, GFP_KERNEL);
 + if (name)
 + memcg_print_cache_name(name, len + 1, memcg, root_cache);

but but but this assumes that cgroup_name(memcg-css.cgroup) did not
change between the two calls to memcg_print_cache_name().  If that is
the case then the locking was unneeded anyway.

 + return name;
 +}
 +
  int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s,
struct kmem_cache *root_cache)
  {
 @@ -3397,44 +3428,6 @@ void mem_cgroup_destroy_cache(struct kmem_cache 
 *cachep)
   schedule_work(cachep-memcg_params-destroy);
  }
  

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH 1/5] mm: vmscan: shrink all slab objects if tight on memory

2014-01-15 Thread Andrew Morton

On Wed, 15 Jan 2014 12:47:35 +0400 Vladimir Davydov vdavy...@parallels.com 
wrote:

 On 01/15/2014 02:14 AM, Andrew Morton wrote:
  On Tue, 14 Jan 2014 11:23:30 +0400 Vladimir Davydov 
  vdavy...@parallels.com wrote:
 
  On 01/14/2014 03:05 AM, Andrew Morton wrote:
  That being said, I think I'll schedule this patch as-is for 3.14.  Can
  you please take a look at implementing the simpler approach, send me
  something for 3.15-rc1?
  IMHO the simpler approach (Glauber's patch) is not suitable as is,
  because it, in fact, neglects the notion of batch_size when doing low
  prio scans, because it calls -scan() for  batch_size objects even if
  the slab has = batch_size objects while AFAIU it should accumulate a
  sufficient number of objects to scan in nr_deferred instead.
  Well.  If you mean that when nr-objects=large and batch_size=32 and
  total_scan=33, the patched code will scan 32 objects and then 1 object
  then yes, that should be fixed.
 
 I mean if nr_objects=large and batch_size=32 and shrink_slab() is called
 8 times with total_scan=4, we can either call -scan() 8 times with
 nr_to_scan=4 (Glauber's patch) or call it only once with nr_to_scan=32
 (that's how it works now). Frankly, after a bit of thinking I am
 starting to doubt that this can affect performance at all provided the
 shrinker is implemented in a sane way, because as you've mentioned
 shrink_slab() is already a slow path. It seems I misunderstood the
 purpose of batch_size initially: I though we need it to limit the number
 of calls to -scan(), but now I guess the only purpose of it is limiting
 the number of objects scanned in one pass to avoid latency issues.

Actually, the intent of batching is to limit the number of calls to
-scan().  At least, that was the intent when I wrote it!  This is a
good principle and we should keep doing it.  If we're going to send the
CPU away to tread on a pile of cold cachelines, we should make sure
that it does a good amount of work while it's there.

 But
 then another question arises - why do you think the behavior you
 described above (scanning 32 and then 1 object if total_scan=33,
 batch_size=32) is bad?

Yes, it's a bit inefficient but it won't be too bad.  What would be bad
would be to scan a very small number of objects and then to advance to
the next shrinker.

 In other words why can't we make the scan loop
 look like this:
 
 while (total_scan  0) {
 unsigned long ret;
 unsigned long nr_to_scan = min(total_scan, batch_size);
 
 shrinkctl-nr_to_scan = nr_to_scan;
 ret = shrinker-scan_objects(shrinker, shrinkctl);
 if (ret == SHRINK_STOP)
 break;
 freed += ret;
 
 count_vm_events(SLABS_SCANNED, nr_to_scan);
 total_scan -= nr_to_scan;
 
 cond_resched();
 }


Well, if we come in here with total_scan=1 then we defeat the original
intent of the batching, don't we?  We end up doing a lot of work just
to scan one object.  So perhaps add something like

if (total_scan  batch_size  max_pass  batch_size)
skip the while loop

If we do this, total_scan will be accumulated into nr_deferred, up to
the point where the threshold is exceeded, yes?

All the arithmetic in there hurts my brain and I don't know what values
total_scan typically ends up with.

btw. all that trickery with delta and lru_pages desperately needs
documenting.  What the heck is it intended to do??



We could avoid the scan 32 then scan just 1 issue with something like

if (total_scan  batch_size)
total_scan %= batch_size;

before the loop.  But I expect the effects of that will be unmeasurable
- on average the number of objects which are scanned in the final pass
of the loop will be batch_size/2, yes?  That's still a decent amount.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH 1/5] mm: vmscan: shrink all slab objects if tight on memory

2014-01-15 Thread Andrew Morton

On Wed, 15 Jan 2014 19:55:11 +0400 Vladimir Davydov vdavy...@parallels.com 
wrote:

 
  We could avoid the scan 32 then scan just 1 issue with something like
 
  if (total_scan  batch_size)
  total_scan %= batch_size;
 
  before the loop.  But I expect the effects of that will be unmeasurable
  - on average the number of objects which are scanned in the final pass
  of the loop will be batch_size/2, yes?  That's still a decent amount.
 
 Let me try to summarize. We want to scan batch_size objects in one pass,
 not more (to keep latency low) and not less (to avoid cpu cache
 pollution due to too frequent calls); if the calculated value of
 nr_to_scan is less than the batch_size we should accumulate it in
 nr_deferred instead of calling -scan() and add nr_deferred to
 nr_to_scan on the next pass, i.e. in pseudo-code:
 
 /* calculate current nr_to_scan */
 max_pass = shrinker-count();
 delta = max_pass * nr_user_pages_scanned / nr_user_pages;
 
 /* add nr_deferred */
 total_scan = delta + nr_deferred;
 
 while (total_scan = batch_size) {
 shrinker-scan(batch_size);
 total_scan -= batch_size;
 }
 
 /* save the remainder to nr_deferred  */
 nr_deferred = total_scan;
 
 That would work, but if max_pass is  batch_size, it would not scan the
 objects immediately even if prio is high (we want to scan all objects).

Yes, that's a problem.

 For example, dropping caches would not work on the first attempt - the
 user would have to call it batch_size / max_pass times.

And we do want drop_caches to work immediately.

 This could be
 fixed by making the code proceed to -scan() not only if total_scan is
 = batch_size, but also if max_pass is  batch_size and total_scan is =
 max_pass, i.e.
 
 while (total_scan = batch_size ||
 (max_pass  batch_size  total_scan = max_pass)) ...
 
 which is equivalent to
 
 while (total_scan = batch_size ||
 total_scan = max_pass) ...
 
 The latter is the loop condition from the current patch, i.e. this patch
 would make the trick if shrink_slab() followed the pseudo-code above. In
 real life, it does not actually - we have to bias total_scan before the
 while loop in order to avoid dropping fs meta caches on light memory
 pressure due to a large number being built in nr_deferred:
 
 if (delta  max_pass / 4)
 total_scan = min(total_scan, max_pass / 2);

Oh, is that what's it's for.  Where did you discover this gem?

 while (total_scan = batch_size) ...
 
 With this biasing, it is impossible to achieve the ideal behavior I've
 described above, because we will never accumulate max_pass objects in
 nr_deferred if memory pressure is low. So, if applied to the real code,
 this patch takes on a slightly different sense, which I tried to reflect
 in the comment to the code: it will call -scan() with nr_to_scan 
 batch_size only if:
 
 1) max_pass  batch_size  total_scan = max_pass
 
 and
 
 2) we're tight on memory, i.e. the current delta is high (otherwise
 total_scan will be biased as max_pass / 2 and condition 1 won't be
 satisfied).

(is max_pass misnamed?)

 From our discussion it seems condition 2 is not necessary at all, but it
 follows directly from the biasing rule. So I propose to tweak the
 biasing a bit so that total_scan won't be lowered  batch_size:
 
 diff --git a/mm/vmscan.c b/mm/vmscan.c
 index eea668d..78ddd5e 100644
 --- a/mm/vmscan.c
 +++ b/mm/vmscan.c
 @@ -267,7 +267,7 @@ shrink_slab_node(struct shrink_control *shrinkctl,
 struct shrinker *shrinker,
   * a large delta change is calculated directly.
   */
  if (delta  max_pass / 4)
 -total_scan = min(total_scan, max_pass / 2);
 +total_scan = min(total_scan, max(max_pass / 2, batch_size));
  
  /*
   * Avoid risking looping forever due to too large nr value:
 @@ -281,7 +281,7 @@ shrink_slab_node(struct shrink_control *shrinkctl,
 struct shrinker *shrinker,
  nr_pages_scanned, lru_pages,
  max_pass, delta, total_scan);
  
 -while (total_scan = batch_size) {
 +while (total_scan = batch_size || total_scan = max_pass) {
  unsigned long ret;
  
  shrinkctl-nr_to_scan = batch_size;
 
 The first hunk guarantees that total_scan will always accumulate at
 least batch_size objects no matter how small max_pass is. That means
 that when max_pass is  batch_size we will eventually get = max_pass
 objects to scan and shrink the slab to 0 as we need. What do you think
 about that?

I'm a bit lost :(
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH 1/5] mm: vmscan: shrink all slab objects if tight on memory

2014-01-14 Thread Andrew Morton

On Tue, 14 Jan 2014 11:23:30 +0400 Vladimir Davydov vdavy...@parallels.com 
wrote:

 On 01/14/2014 03:05 AM, Andrew Morton wrote:
  On Sat, 11 Jan 2014 16:36:31 +0400 Vladimir Davydov 
  vdavy...@parallels.com wrote:
 
  When reclaiming kmem, we currently don't scan slabs that have less than
  batch_size objects (see shrink_slab_node()):
 
  while (total_scan = batch_size) {
  shrinkctl-nr_to_scan = batch_size;
  shrinker-scan_objects(shrinker, shrinkctl);
  total_scan -= batch_size;
  }
 
  If there are only a few shrinkers available, such a behavior won't cause
  any problems, because the batch_size is usually small, but if we have a
  lot of slab shrinkers, which is perfectly possible since FS shrinkers
  are now per-superblock, we can end up with hundreds of megabytes of
  practically unreclaimable kmem objects. For instance, mounting a
  thousand of ext2 FS images with a hundred of files in each and iterating
  over all the files using du(1) will result in about 200 Mb of FS caches
  that cannot be dropped even with the aid of the vm.drop_caches sysctl!
  True.  I suspect this was an accidental consequence of the chosen
  implementation.  As you mentioned, I was thinking that the caches would
  all be large, and the remaining 1 ..  SHRINK_BATCH-1 objects just
  didn't matter.
 
  This problem was initially pointed out by Glauber Costa [*]. Glauber
  proposed to fix it by making the shrink_slab() always take at least one
  pass, to put it simply, turning the scan loop above to a do{}while()
  loop. However, this proposal was rejected, because it could result in
  more aggressive and frequent slab shrinking even under low memory
  pressure when total_scan is naturally very small.
  Well, it wasn't rejected - Mel pointed out that Glauber's change
  could potentially trigger problems which already exist in shrinkers.
 
  The potential issues seem pretty unlikely to me, and they're things we
  can fix up if they eventuate.
 
 When preparing this patch, I considered not the problems that
 potentially exist in some shrinkers, but the issues that unconditional
 scan of  batch_size objects might trigger for any shrinker:
 
 1) We would call shrinkers more frequently,

hm, why?

 which could possibly
 increase contention on shrinker-internal locks. The point is that under
 very light memory pressure when we can fulfill the allocation request
 after a few low-prio scans, we would not call slab shrinkers at all,
 instead we would only add the delta to nr_deferred in order to keep
 slab-vs-pagecache reclaim balanced. Original Glauber's patch changes
 this behavior - it makes shrink_slab() always call the shrinker at least
 once, even if the current delta is negligible. I'm afraid, this might
 affect performance. Note, this is irrespective of how much objects the
 shrinker has to reclaim ( or  batch_size).

I doubt if it affects performance much at all - memory reclaim in
general is a slow path.

 2) As Mel Gorman pointed out
 (http://thread.gmane.org/gmane.linux.kernel.mm/99059):
 
  It's possible for caches to shrink to zero where before the last
  SHRINK_SLAB objects would often be protected for any slab. If this is
  an inode or dentry cache and there are very few objects then it's
  possible that objects will be reclaimed before they can be used by the
  process allocating them.

I don't understand that one.  It appears to assume that vfs code will
allocate and initialise a dentry or inode, will put it in cache and
release all references to it and then will look it up again and start
using it.  vfs doesn't work like that - it would be crazy to do so.  It
will instead hold references to those objects while using them.  Mainly
to protect them from reclaim.

And if there were any problems of this nature, they would already be
demonstrable with a sufficiently large number of threads/cpus.

  So I'm thinking we should at least try Glauber's approach - it's a bit
  weird that we should treat the final 0 ..  batch_size-1 objects in a
  different manner from all the others.
 
 It's not exactly that we treat the final 0 .. batch_size-1 objects
 differently from others. We rather try to accumulate at least batch_size
 objects before calling -scan().

And if there are  batch_size objects, they never get scanned.  That's
different treatment.

  That being said, I think I'll schedule this patch as-is for 3.14.  Can
  you please take a look at implementing the simpler approach, send me
  something for 3.15-rc1?
 
 IMHO the simpler approach (Glauber's patch) is not suitable as is,
 because it, in fact, neglects the notion of batch_size when doing low
 prio scans, because it calls -scan() for  batch_size objects even if
 the slab has = batch_size objects while AFAIU it should accumulate a
 sufficient number of objects to scan in nr_deferred instead.

Well.  If you mean that when nr-objects=large and batch_size=32 and
total_scan=33, the patched code will scan 32

Re: [Devel] [PATCH 1/5] mm: vmscan: shrink all slab objects if tight on memory

2014-01-13 Thread Andrew Morton

On Sat, 11 Jan 2014 16:36:31 +0400 Vladimir Davydov vdavy...@parallels.com 
wrote:

 When reclaiming kmem, we currently don't scan slabs that have less than
 batch_size objects (see shrink_slab_node()):
 
 while (total_scan = batch_size) {
 shrinkctl-nr_to_scan = batch_size;
 shrinker-scan_objects(shrinker, shrinkctl);
 total_scan -= batch_size;
 }
 
 If there are only a few shrinkers available, such a behavior won't cause
 any problems, because the batch_size is usually small, but if we have a
 lot of slab shrinkers, which is perfectly possible since FS shrinkers
 are now per-superblock, we can end up with hundreds of megabytes of
 practically unreclaimable kmem objects. For instance, mounting a
 thousand of ext2 FS images with a hundred of files in each and iterating
 over all the files using du(1) will result in about 200 Mb of FS caches
 that cannot be dropped even with the aid of the vm.drop_caches sysctl!

True.  I suspect this was an accidental consequence of the chosen
implementation.  As you mentioned, I was thinking that the caches would
all be large, and the remaining 1 ..  SHRINK_BATCH-1 objects just
didn't matter.

 This problem was initially pointed out by Glauber Costa [*]. Glauber
 proposed to fix it by making the shrink_slab() always take at least one
 pass, to put it simply, turning the scan loop above to a do{}while()
 loop. However, this proposal was rejected, because it could result in
 more aggressive and frequent slab shrinking even under low memory
 pressure when total_scan is naturally very small.

Well, it wasn't rejected - Mel pointed out that Glauber's change
could potentially trigger problems which already exist in shrinkers.

The potential issues seem pretty unlikely to me, and they're things we
can fix up if they eventuate.

So I'm thinking we should at least try Glauber's approach - it's a bit
weird that we should treat the final 0 ..  batch_size-1 objects in a
different manner from all the others.


That being said, I think I'll schedule this patch as-is for 3.14.  Can
you please take a look at implementing the simpler approach, send me
something for 3.15-rc1?

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH 3/5] mm: vmscan: respect NUMA policy mask when shrinking slab on direct reclaim

2014-01-13 Thread Andrew Morton

On Sat, 11 Jan 2014 16:36:33 +0400 Vladimir Davydov vdavy...@parallels.com 
wrote:

 When direct reclaim is executed by a process bound to a set of NUMA
 nodes, we should scan only those nodes when possible, but currently we
 will scan kmem from all online nodes even if the kmem shrinker is NUMA
 aware. That said, binding a process to a particular NUMA node won't
 prevent it from shrinking inode/dentry caches from other nodes, which is
 not good. Fix this.

Seems right.  I worry that reducing the amount of shrinking which
node-bound processes perform might affect workloads in unexpected ways.

I think I'll save this one for 3.15-rc1, OK?
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH 4/5] mm: vmscan: move call to shrink_slab() to shrink_zones()

2014-01-13 Thread Andrew Morton

On Sat, 11 Jan 2014 16:36:34 +0400 Vladimir Davydov vdavy...@parallels.com 
wrote:

 This reduces the indentation level of do_try_to_free_pages() and removes
 extra loop over all eligible zones counting the number of on-LRU pages.

So this should cause no functional change, yes?
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH v11 00/15] kmemcg shrinkers

2013-11-26 Thread Andrew Morton

On Tue, 26 Nov 2013 16:55:43 +0400 Vladimir Davydov vdavy...@parallels.com 
wrote:

 What do you think about splitting this set into two main series as follows:
 
 1) Prepare vmscan to kmemcg-aware shrinkers; would include patches 1-7 
 of this set.
 2) Make fs shrinkers memcg-aware; would include patches 9-11 of this set

Please just resend everything.

 and leave other patches, which are rather for optimization/extending 
 functionality, for future?

It will be helpful to describe such splitting opportunities in the
[0/n] changelog.

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH] mm: strictlimit feature -v4

2013-08-21 Thread Andrew Morton

On Wed, 21 Aug 2013 17:56:32 +0400 Maxim Patlasov mpatla...@parallels.com 
wrote:

 The feature prevents mistrusted filesystems to grow a large number of dirty
 pages before throttling. For such filesystems balance_dirty_pages always
 check bdi counters against bdi limits. I.e. even if global nr_dirty is under
 freerun, it's not allowed to skip bdi checks. The only use case for now is
 fuse: it sets bdi max_ratio to 1% by default and system administrators are
 supposed to expect that this limit won't be exceeded.
 
 The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag.
 A filesystem may set the flag when it initializes its BDI.

Now I think about it, I don't really understand the need for this
feature.  Can you please go into some detail about the problematic
scenarios and why they need fixing?  Including an expanded descritopn
of the term mistrusted filesystem?

Is this some theoretical happens-in-the-lab thing, or are real world
users actually hurting due to the lack of this feature?

I think I'll apply it to -mm for now to get a bit of testing, but would
very much like it if Fengguang could find time to review the
implementation, please.

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH] mqueue: sys_mq_open: do not call mnt_drop_write() if read-only

2013-03-19 Thread Andrew Morton

On Tue, 19 Mar 2013 13:31:18 +0400 Vladimir Davydov vdavy...@parallels.com 
wrote:

 mnt_drop_write() must be called only if mnt_want_write() succeeded,
 otherwise the mnt_writers counter will diverge.
 
 ...

 --- a/ipc/mqueue.c
 +++ b/ipc/mqueue.c
 @@ -840,7 +840,8 @@ out_putfd:
   fd = error;
   }
   mutex_unlock(root-d_inode-i_mutex);
 - mnt_drop_write(mnt);
 + if (!ro)
 + mnt_drop_write(mnt);
  out_putname:
   putname(name);
   return fd;

huh, that's been there for a while.  What were the runtime-visible
effects of the bug?
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [RFC PATCH v8 0/5] IPC: checkpoint/restore in userspace enhancements

2012-12-20 Thread Andrew Morton

On Thu, 20 Dec 2012 08:06:32 +0400
Stanislav Kinsbursky skinsbur...@parallels.com wrote:

 19.12.2012 00:36, Andrew Morton __:
  On Wed, 24 Oct 2012 19:34:51 +0400
  Stanislav Kinsbursky skinsbur...@parallels.com wrote:
 
  This respin of the patch set was significantly reworked. Most part of new 
  API
  was replaced by sysctls (by one per messages, semaphores and shared 
  memory),
  allowing to preset desired id for next new IPC object.
 
  This patch set is aimed to provide additional functionality for all IPC
  objects, which is required for migration of these objects by user-space
  checkpoint/restore utils (CRIU).
 
  The main problem here was impossibility to set up object id. This patch set
  solves the problem by adding new sysctls for preset of desired id for new 
  IPC
  object.
 
  Another problem was to peek messages from queues without deleting them.
  This was achived by introducing of new MSG_COPY flag for sys_msgrcv(). If
  MSG_COPY flag is set, then msgtyp is interpreted as message number.
  According to my extensive records, Sasha hit a bug in
  ipc-message-queue-copy-feature-introduced.patch and Fengguang found a
  bug in
  ipc-message-queue-copy-feature-introduced-cleanup-do_msgrcv-aroung-msg_copy-feature.patch
 
  It's not obvious (to me) that these things have been identified and
  fixed.  What's the status, please?
 
 Hello, Andrew.
 Fengguang's issue was solved by ipc: simplify message copying I sent you.
 But I can't find Sasha's issue. As I remember, there was some problem in 
 early
 version of the patch set. But I believe its fixed now.

http://lkml.indiana.edu/hypermail/linux/kernel/1210.3/01710.html

Subject: ipc, msgqueue: NULL ptr deref in msgrcv
___
Devel mailing list
Devel@openvz.org
http://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [RFC PATCH v8 0/5] IPC: checkpoint/restore in userspace enhancements

2012-12-18 Thread Andrew Morton

On Wed, 24 Oct 2012 19:34:51 +0400
Stanislav Kinsbursky skinsbur...@parallels.com wrote:

 This respin of the patch set was significantly reworked. Most part of new API
 was replaced by sysctls (by one per messages, semaphores and shared memory),
 allowing to preset desired id for next new IPC object.
 
 This patch set is aimed to provide additional functionality for all IPC
 objects, which is required for migration of these objects by user-space
 checkpoint/restore utils (CRIU).
 
 The main problem here was impossibility to set up object id. This patch set
 solves the problem by adding new sysctls for preset of desired id for new IPC
 object.
 
 Another problem was to peek messages from queues without deleting them.
 This was achived by introducing of new MSG_COPY flag for sys_msgrcv(). If
 MSG_COPY flag is set, then msgtyp is interpreted as message number.

According to my extensive records, Sasha hit a bug in
ipc-message-queue-copy-feature-introduced.patch and Fengguang found a
bug in
ipc-message-queue-copy-feature-introduced-cleanup-do_msgrcv-aroung-msg_copy-feature.patch

It's not obvious (to me) that these things have been identified and
fixed.  What's the status, please?

___
Devel mailing list
Devel@openvz.org
http://lists.openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH v8 2/5] ipc: add sysctl to specify desired next object id

2012-10-24 Thread Andrew Morton

On Wed, 24 Oct 2012 19:35:09 +0400
Stanislav Kinsbursky skinsbur...@parallels.com wrote:

 This patch adds 3 new variables and sysctls to tune them (by one next_id
 variable for messages, semaphores and shared memory respectively).
 This variable can be used to set desired id for next allocated IPC object.
 By default it's equal to -1 and old behaviour is preserved.
 If this variable is non-negative, then desired idr will be extracted from it
 and used as a start value to search for free IDR slot.
 
 Notes:
 1) this patch doesn't garantee, that new object will have desired id. So it's
 up to user space how to handle new object with wrong id.
 2) After sucessfull id allocation attempt, next_id will be set back to -1
 (if it was non-negative).
 
 --- a/ipc/ipc_sysctl.c
 +++ b/ipc/ipc_sysctl.c
 @@ -158,6 +158,7 @@ static int proc_ipcauto_dointvec_minmax(ctl_table *table, 
 int write,
  
  static int zero;
  static int one = 1;
 +static int int_max = INT_MAX;
  
  static struct ctl_table ipc_kern_table[] = {
   {
 @@ -227,6 +228,33 @@ static struct ctl_table ipc_kern_table[] = {
   .extra1 = zero,
   .extra2 = one,
   },
 + {
 + .procname   = sem_next_id,
 + .data   = init_ipc_ns.ids[IPC_SEM_IDS].next_id,
 + .maxlen = sizeof(init_ipc_ns.ids[IPC_SEM_IDS].next_id),
 + .mode   = 0644,
 + .proc_handler   = proc_ipc_dointvec_minmax,
 + .extra1 = zero,
 + .extra2 = int_max,
 + },
 + {
 + .procname   = msg_next_id,
 + .data   = init_ipc_ns.ids[IPC_MSG_IDS].next_id,
 + .maxlen = sizeof(init_ipc_ns.ids[IPC_MSG_IDS].next_id),
 + .mode   = 0644,
 + .proc_handler   = proc_ipc_dointvec_minmax,
 + .extra1 = zero,
 + .extra2 = int_max,
 + },
 + {
 + .procname   = shm_next_id,
 + .data   = init_ipc_ns.ids[IPC_SHM_IDS].next_id,
 + .maxlen = sizeof(init_ipc_ns.ids[IPC_SHM_IDS].next_id),
 + .mode   = 0644,
 + .proc_handler   = proc_ipc_dointvec_minmax,
 + .extra1 = zero,
 + .extra2 = int_max,
 + },
   {}
  };

ipc_kern_table[] is (badly) documented in
Documentation/sysctl/kernel.txt.  Can we at least mention these
controls in there?  Better, create a new way of properly documenting
each control and document these three in that manner?  Better still,
document all the other ones as well ;)

The patch adds these controls to CONFIG_CHECKPOINT_RESTORE=n kernels. 
Why is this?

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH v8 4/5] ipc: message queue copy feature introduced

2012-10-24 Thread Andrew Morton

On Wed, 24 Oct 2012 19:35:20 +0400
Stanislav Kinsbursky skinsbur...@parallels.com wrote:

 This patch is required for checkpoint/restore in userspace.
 IOW, c/r requires some way to get all pending IPC messages without deleting
 them from the queue (checkpoint can fail and in this case tasks will be 
 resumed,
 so queue have to be valid).
 To achive this, new operation flag MSG_COPY for sys_msgrcv() system call was
 introduced. If this flag was specified, then mtype is interpreted as number of
 the message to copy.
 If MSG_COPY is set, then kernel will allocate dummy message with passed size,
 and then use new copy_msg() helper function to copy desired message (instead 
 of
 unlinking it from the queue).
 
 ...

 @@ -777,19 +777,48 @@ long do_msgrcv(int msqid, void __user *buf, size_t 
 bufsz, long msgtyp,
   struct msg_msg *msg;
   int mode;
   struct ipc_namespace *ns;
 +#ifdef CONFIG_CHECKPOINT_RESTORE
 + struct msg_msg *copy = NULL;
 + unsigned long copy_number = 0;
 +#endif
  
   if (msqid  0 || (long) bufsz  0)
   return -EINVAL;
 + if (msgflg  MSG_COPY) {
 +#ifdef CONFIG_CHECKPOINT_RESTORE
 +
 + if (msgflg  MSG_COPY) {

This test is't needed.

 + copy_number = msgtyp;
 + msgtyp = 0;
 + }
 +
 + /*
 +  * Create dummy message to copy real message to.
 +  */
 + copy = load_msg(buf, bufsz);
 + if (IS_ERR(copy))
 + return PTR_ERR(copy);
 + copy-m_ts = bufsz;
 +#else
 + return -ENOSYS;
 +#endif
 + }
   mode = convert_mode(msgtyp, msgflg);
   ns = current-nsproxy-ipc_ns;
  
   msq = msg_lock_check(ns, msqid);
 - if (IS_ERR(msq))
 + if (IS_ERR(msq)) {
 +#ifdef CONFIG_CHECKPOINT_RESTORE
 + if (msgflg  MSG_COPY)
 + free_msg(copy);
 +#endif
   return PTR_ERR(msq);
 + }
  
   for (;;) {
   struct msg_receiver msr_d;
   struct list_head *tmp;
 + long msg_counter = 0;
  
   msg = ERR_PTR(-EACCES);
   if (ipcperms(ns, msq-q_perm, S_IRUGO))
 @@ -809,8 +838,16 @@ long do_msgrcv(int msqid, void __user *buf, size_t 
 bufsz, long msgtyp,
   if (mode == SEARCH_LESSEQUAL 
   walk_msg-m_type != 1) {
   msgtyp = walk_msg-m_type - 1;
 +#ifdef CONFIG_CHECKPOINT_RESTORE
 + } else if (msgflg  MSG_COPY) {
 + if (copy_number == msg_counter) {
 + msg = copy_msg(walk_msg, copy);
 + break;
 + }
 +#endif
   } else
   break;
 + msg_counter++;
   }
   tmp = tmp-next;
   }
 @@ -823,6 +860,10 @@ long do_msgrcv(int msqid, void __user *buf, size_t 
 bufsz, long msgtyp,
   msg = ERR_PTR(-E2BIG);
   goto out_unlock;
   }
 +#ifdef CONFIG_CHECKPOINT_RESTORE
 + if (msgflg  MSG_COPY)
 + goto out_unlock;
 +#endif
   list_del(msg-m_list);
   msq-q_qnum--;
   msq-q_rtime = get_seconds();
 @@ -906,8 +947,13 @@ out_unlock:
   break;
   }
   }
 - if (IS_ERR(msg))
 + if (IS_ERR(msg)) {
 +#ifdef CONFIG_CHECKPOINT_RESTORE
 + if (msgflg  MSG_COPY)
 + free_msg(copy);
 +#endif
   return PTR_ERR(msg);
 + }
  
   bufsz = msg_handler(buf, msg, bufsz);
   free_msg(msg);

It's all a bit ugly, but I don't really see much we can practically do
about that.

You could add something like

#ifdef CONFIG_CHECKPOINT_RESTORE
static inline void free_copy(void *p, int msgflg, struct msg_msg *copy)
{
if (IS_ERR(p)  (msgflg  MSG_COPY))
free_msg(copy);
}
#else
/* As a macro because `copy' will be undefined */
#define free_copy(p, msgflg, copy) do {} while (0)
#endif

and use that in a couple of places.  But that won't help much.


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH v5 00/14] kmem controller for memcg.

2012-10-18 Thread Andrew Morton

On Thu, 18 Oct 2012 20:51:05 +0400
Glauber Costa glom...@parallels.com wrote:

 On 10/18/2012 02:11 AM, Andrew Morton wrote:
  On Tue, 16 Oct 2012 14:16:37 +0400
  Glauber Costa glom...@parallels.com wrote:
  
  ...
 
  A general explanation of what this is all about follows:
 
  The kernel memory limitation mechanism for memcg concerns itself with
  disallowing potentially non-reclaimable allocations to happen in exaggerate
  quantities by a particular set of processes (cgroup). Those allocations 
  could
  create pressure that affects the behavior of a different and unrelated set 
  of
  processes.
 
  Its basic working mechanism is to annotate some allocations with the
  _GFP_KMEMCG flag. When this flag is set, the current process allocating 
  will
  have its memcg identified and charged against. When reaching a specific 
  limit,
  further allocations will be denied.
  
  The need to set _GFP_KMEMCG is rather unpleasing, and makes one wonder
  why didn't it just track all allocations.
  
 This was raised as well by Peter Zijlstra during the memcg summit.

Firstly: please treat any question from a reviewer as an indication
that information was missing from the changelog or from code comments. 
Ideally all such queries are addressed in later version of the patch
and changelog.

 The
 answer I gave to him still stands: There is a cost associated with it.
 We believe it comes down to a trade off situation. How much tracking a
 particular kind of allocation help vs how much does it cost.
 
 The free path is specially more expensive, since it will always incur in
 a page_cgroup lookup.

OK.  But that is a quantitative argument, without any quantities!  Do
we have even an estimate of what this cost will be?  Perhaps it's the
case that, if well implemented, that cost will be acceptable.  How do
we tell?

  Does this mean that over time we can expect more sites to get the
  _GFP_KMEMCG tagging?  
 
 We have being doing kernel memory limitation for OpenVZ for a lot of
 times, using a quite different mechanism. What we do in this work (with
 slab included), allows us to achieve feature parity with that. It means
 it is good enough for production environments.

That's really good info.
 
 Whether or not more people will want other allocations to be tracked, I
 can't predict. What I do can say is that stack + slab is a very
 significant part of the memory one potentially cares about, and if
 anyone else ever have the need for more, it will come down to a
 trade-off calculation.

OK.
 
  If so, are there any special implications, or do
  we just go in, do the one-line patch and expect everything to work? 
 
 With the infrastructure in place, it shouldn't be hard. But it's not
 necessarily a one-liner either. It depends on what are the pratical
 considerations for having that specific kind of allocation tied to a
 memcg. The slab, for instance, that follows this series, is far away
 from a one-liner: it is in fact, a 19-patch patch series.
 
 
 
  
  And how *accurate* is the proposed code?  What percentage of kernel
  memory allocations are unaccounted, typical case and worst case?
 
 With both patchsets applied, all memory used for the stack and most of
 the memory used for slab objects allocated in userspace process contexts
 are accounted.
 
 I honestly don't know which percentage of the total kernel memory this
 represents.

It sounds like the coverage will be good.  What's left over?  Random
get_free_pages() calls and interrupt-time slab allocations?

I suppose that there are situations in which network rx could consume
significant amounts of unaccounted memory?

 The accuracy for stack pages is very high: In this series, we don't move
 stack pages around when moving a task to other cgroups (for stack, it
 could be done), but other than that, all processes that pops up in a
 cgroup and stay there will have its memory accurately accounted.
 
 The slab is more complicated, and depends on the workload. It will be
 more accurate in workloads in which the level of object-sharing among
 cgroups is low. A container, for instance, is the perfect example of
 where this happen.
 
  
  All sorts of questions come to mind over this decision, but it was
  unexplained.  It should be, please.  A lot!
  
 
  ...
 
  Limits lower than
  the user limit effectively means there is a separate kernel memory limit 
  that
  may be reached independently than the user limit. Values equal or greater 
  than
  the user limit implies only that kernel memory is tracked. This provides a
  unified vision of maximum memory, be it kernel or user memory.
 
  
  I'm struggling to understand that text much at all.  Reading the
  Documentation/cgroups/memory.txt patch helped.
  
 
 Great. If you have any specific suggestions I can change that. Maybe I
 should just paste the documentation bit in here...

That's not a bad idea.

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo

[Devel] Re: [PATCH v5 07/14] mm: Allocate kernel pages to the right memcg

2012-10-18 Thread Andrew Morton

On Thu, 18 Oct 2012 13:24:47 +0400
Glauber Costa glom...@parallels.com wrote:

 On 10/18/2012 02:12 AM, Andrew Morton wrote:
  On Tue, 16 Oct 2012 14:16:44 +0400
  Glauber Costa glom...@parallels.com wrote:
  
  When a process tries to allocate a page with the __GFP_KMEMCG flag, the
  page allocator will call the corresponding memcg functions to validate
  the allocation. Tasks in the root memcg can always proceed.
 
  To avoid adding markers to the page - and a kmem flag that would
  necessarily follow, as much as doing page_cgroup lookups for no reason,
  whoever is marking its allocations with __GFP_KMEMCG flag is responsible
  for telling the page allocator that this is such an allocation at
  free_pages() time.
  
  Well, why?  Was that the correct decision?
  
 
 I don't fully understand your question. Is this the same question you
 posed in patch 0, about marking some versus marking all? If so, I
 believe I should have answered it there.

Yes, it's the same question.  The one which has not yet been fully answered ;)

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH v5] slab: Ignore internal flags in cache creation

2012-10-18 Thread Andrew Morton

On Wed, 17 Oct 2012 15:36:51 +0400
Glauber Costa glom...@parallels.com wrote:

 Some flags are used internally by the allocators for management
 purposes. One example of that is the CFLGS_OFF_SLAB flag that slab uses
 to mark that the metadata for that cache is stored outside of the slab.
 
 No cache should ever pass those as a creation flags. We can just ignore
 this bit if it happens to be passed (such as when duplicating a cache in
 the kmem memcg patches).

I may be minunderstanding this, but...

If some caller to kmem_cache_create() is passing in bogus flags then
that's a bug, and it is undesirable to hide such a bug in this fashion?

 Because such flags can vary from allocator to allocator, we allow them
 to make their own decisions on that, defining SLAB_AVAILABLE_FLAGS with
 all flags that are valid at creation time.  Allocators that doesn't have
 any specific flag requirement should define that to mean all flags.
 
 Common code will mask out all flags not belonging to that set.

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH v5 00/14] kmem controller for memcg.

2012-10-17 Thread Andrew Morton

On Tue, 16 Oct 2012 14:16:37 +0400
Glauber Costa glom...@parallels.com wrote:

 ...

 A general explanation of what this is all about follows:
 
 The kernel memory limitation mechanism for memcg concerns itself with
 disallowing potentially non-reclaimable allocations to happen in exaggerate
 quantities by a particular set of processes (cgroup). Those allocations could
 create pressure that affects the behavior of a different and unrelated set of
 processes.
 
 Its basic working mechanism is to annotate some allocations with the
 _GFP_KMEMCG flag. When this flag is set, the current process allocating will
 have its memcg identified and charged against. When reaching a specific limit,
 further allocations will be denied.

The need to set _GFP_KMEMCG is rather unpleasing, and makes one wonder
why didn't it just track all allocations.

Does this mean that over time we can expect more sites to get the
_GFP_KMEMCG tagging?  If so, are there any special implications, or do
we just go in, do the one-line patch and expect everything to work?  If
so, why don't we go in and do that tagging right now?

And how *accurate* is the proposed code?  What percentage of kernel
memory allocations are unaccounted, typical case and worst case?

All sorts of questions come to mind over this decision, but it was
unexplained.  It should be, please.  A lot!


 ...
 
 Limits lower than
 the user limit effectively means there is a separate kernel memory limit that
 may be reached independently than the user limit. Values equal or greater than
 the user limit implies only that kernel memory is tracked. This provides a
 unified vision of maximum memory, be it kernel or user memory.


I'm struggling to understand that text much at all.  Reading the
Documentation/cgroups/memory.txt patch helped.

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH v5 01/14] memcg: Make it possible to use the stock for more than one page.

2012-10-17 Thread Andrew Morton

On Tue, 16 Oct 2012 14:16:38 +0400
Glauber Costa glom...@parallels.com wrote:

 From: Suleiman Souhlal ssouh...@freebsd.org
 
 We currently have a percpu stock cache scheme that charges one page at a
 time from memcg-res, the user counter. When the kernel memory
 controller comes into play, we'll need to charge more than that.
 
 This is because kernel memory allocations will also draw from the user
 counter, and can be bigger than a single page, as it is the case with
 the stack (usually 2 pages) or some higher order slabs.
 
 ...

 -/*
 - * Try to consume stocked charge on this cpu. If success, one page is 
 consumed
 - * from local stock and true is returned. If the stock is 0 or charges from a
 - * cgroup which is not current target, returns false. This stock will be
 - * refilled.
 +/**
 + * consume_stock: Try to consume stocked charge on this cpu.
 + * @memcg: memcg to consume from.
 + * @nr_pages: how many pages to charge.
 + *
 + * The charges will only happen if @memcg matches the current cpu's memcg
 + * stock, and at least @nr_pages are available in that stock.  Failure to
 + * service an allocation will refill the stock.
 + *
 + * returns true if succesfull, false otherwise.

spello.

   */
 -static bool consume_stock(struct mem_cgroup *memcg)
 +static bool consume_stock(struct mem_cgroup *memcg, int nr_pages)

I don't believe there is a case for nr_pages  0 here?  If not then I
suggest that it would be clearer to use an unsigned type, like
memcg_stock_pcp.stock.


 ...


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH v5 04/14] kmem accounting basic infrastructure

2012-10-17 Thread Andrew Morton

On Tue, 16 Oct 2012 14:16:41 +0400
Glauber Costa glom...@parallels.com wrote:

 This patch adds the basic infrastructure for the accounting of kernel
 memory. To control that, the following files are created:
 
  * memory.kmem.usage_in_bytes
  * memory.kmem.limit_in_bytes
  * memory.kmem.failcnt

gargh.  failcnt is not a word.  Who was it who first thought that
omitting voewls from words improves anything?

Sigh.  That pooch is already screwed and there's nothing we can do
about it now.

  * memory.kmem.max_usage_in_bytes
 
 They have the same meaning of their user memory counterparts. They
 reflect the state of the kmem res_counter.
 
 Per cgroup kmem memory accounting is not enabled until a limit is set
 for the group. Once the limit is set the accounting cannot be disabled
 for that group.  This means that after the patch is applied, no
 behavioral changes exists for whoever is still using memcg to control
 their memory usage, until memory.kmem.limit_in_bytes is set for the
 first time.
 
 We always account to both user and kernel resource_counters. This
 effectively means that an independent kernel limit is in place when the
 limit is set to a lower value than the user memory. A equal or higher
 value means that the user limit will always hit first, meaning that kmem
 is effectively unlimited.
 
 People who want to track kernel memory but not limit it, can set this
 limit to a very high number (like RESOURCE_MAX - 1page - that no one
 will ever hit, or equal to the user memory)
 

 ...

 +/* internal only representation about the status of kmem accounting. */
 +enum {
 + KMEM_ACCOUNTED_ACTIVE = 0, /* accounted by this cgroup itself */
 +};
 +
 +#define KMEM_ACCOUNTED_MASK (1  KMEM_ACCOUNTED_ACTIVE)
 +
 +#ifdef CONFIG_MEMCG_KMEM
 +static void memcg_kmem_set_active(struct mem_cgroup *memcg)
 +{
 + set_bit(KMEM_ACCOUNTED_ACTIVE, memcg-kmem_accounted);
 +}
 +#endif

I don't think memcg_kmem_set_active() really needs to exist.  It has a
single caller and is unlikely to get any additional callers, so just
open-code it there?



___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH v5 06/14] memcg: kmem controller infrastructure

2012-10-17 Thread Andrew Morton

On Tue, 16 Oct 2012 14:16:43 +0400
Glauber Costa glom...@parallels.com wrote:

 This patch introduces infrastructure for tracking kernel memory pages to
 a given memcg. This will happen whenever the caller includes the flag
 __GFP_KMEMCG flag, and the task belong to a memcg other than the root.
 
 In memcontrol.h those functions are wrapped in inline acessors.  The
 idea is to later on, patch those with static branches, so we don't incur
 any overhead when no mem cgroups with limited kmem are being used.
 
 Users of this functionality shall interact with the memcg core code
 through the following functions:
 
 memcg_kmem_newpage_charge: will return true if the group can handle the
allocation. At this point, struct page is not
yet allocated.
 
 memcg_kmem_commit_charge: will either revert the charge, if struct page
   allocation failed, or embed memcg information
   into page_cgroup.
 
 memcg_kmem_uncharge_page: called at free time, will revert the charge.
 
 ...

 +static __always_inline bool
 +memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
 +{
 + if (!memcg_kmem_enabled())
 + return true;
 +
 + /*
 +  * __GFP_NOFAIL allocations will move on even if charging is not
 +  * possible. Therefore we don't even try, and have this allocation
 +  * unaccounted. We could in theory charge it with
 +  * res_counter_charge_nofail, but we hope those allocations are rare,
 +  * and won't be worth the trouble.
 +  */
 + if (!(gfp  __GFP_KMEMCG) || (gfp  __GFP_NOFAIL))
 + return true;
 + if (in_interrupt() || (!current-mm) || (current-flags  PF_KTHREAD))
 + return true;
 +
 + /* If the test is dying, just let it go. */
 +if (unlikely(test_thread_flag(TIF_MEMDIE)
 + || fatal_signal_pending(current)))
 + return true;
 +
 + return __memcg_kmem_newpage_charge(gfp, memcg, order);
 +}

That's a big function!  Why was it __always_inline?  I'd have thought
it would be better to move the code after memcg_kmem_enabled() out of
line.

Do we actually need to test PF_KTHREAD when current-mm == NULL? 
Perhaps because of aio threads whcih temporarily adopt a userspace mm?

 +/**
 + * memcg_kmem_uncharge_page: uncharge pages from memcg
 + * @page: pointer to struct page being freed
 + * @order: allocation order.
 + *
 + * there is no need to specify memcg here, since it is embedded in 
 page_cgroup
 + */
 +static __always_inline void
 +memcg_kmem_uncharge_page(struct page *page, int order)
 +{
 + if (memcg_kmem_enabled())
 + __memcg_kmem_uncharge_page(page, order);
 +}
 +
 +/**
 + * memcg_kmem_commit_charge: embeds correct memcg in a page
 + * @page: pointer to struct page recently allocated
 + * @memcg: the memcg structure we charged against
 + * @order: allocation order.
 + *
 + * Needs to be called after memcg_kmem_newpage_charge, regardless of success 
 or
 + * failure of the allocation. if @page is NULL, this function will revert the
 + * charges. Otherwise, it will commit the memcg given by @memcg to the
 + * corresponding page_cgroup.
 + */
 +static __always_inline void
 +memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int 
 order)
 +{
 + if (memcg_kmem_enabled()  memcg)
 + __memcg_kmem_commit_charge(page, memcg, order);
 +}

I suspect the __always_inline's here are to do with static branch
trickery.  A code comment is warranted if so?


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH v5 07/14] mm: Allocate kernel pages to the right memcg

2012-10-17 Thread Andrew Morton

On Tue, 16 Oct 2012 14:16:44 +0400
Glauber Costa glom...@parallels.com wrote:

 When a process tries to allocate a page with the __GFP_KMEMCG flag, the
 page allocator will call the corresponding memcg functions to validate
 the allocation. Tasks in the root memcg can always proceed.
 
 To avoid adding markers to the page - and a kmem flag that would
 necessarily follow, as much as doing page_cgroup lookups for no reason,
 whoever is marking its allocations with __GFP_KMEMCG flag is responsible
 for telling the page allocator that this is such an allocation at
 free_pages() time.

Well, why?  Was that the correct decision?

 This is done by the invocation of
 __free_accounted_pages() and free_accounted_pages().

These are very general-sounding names.  I'd expect the identifiers to
contain memcg and/or kmem, to identify what's going on.

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH v5 11/14] memcg: allow a memcg with kmem charges to be destructed.

2012-10-17 Thread Andrew Morton

On Tue, 16 Oct 2012 14:16:48 +0400
Glauber Costa glom...@parallels.com wrote:

 Because the ultimate goal of the kmem tracking in memcg is to track slab
 pages as well,

It is?  For a major patchset such as this, it's pretty important to
discuss such long-term plans in the top-level discussion.  Covering
things such as expected complexity, expected performance hit, how these
plans affected the current implementation, etc.

The main reason for this is that if the future plans appear to be of
doubtful feasibility and the current implementation isn't sufficiently
useful without the future stuff, we shouldn't merge the current
implementation.  It's a big issue!

 we can't guarantee that we'll always be able to point a
 page to a particular process, and migrate the charges along with it -
 since in the common case, a page will contain data belonging to multiple
 processes.
 
 Because of that, when we destroy a memcg, we only make sure the
 destruction will succeed by discounting the kmem charges from the user
 charges when we try to empty the cgroup.
 
 ...


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH v5 13/14] protect architectures where THREAD_SIZE = PAGE_SIZE against fork bombs

2012-10-17 Thread Andrew Morton

On Tue, 16 Oct 2012 14:16:50 +0400
Glauber Costa glom...@parallels.com wrote:

 @@ -146,7 +146,7 @@ void __weak arch_release_thread_info(struct thread_info 
 *ti)
  static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 int node)
  {
 - struct page *page = alloc_pages_node(node, THREADINFO_GFP,
 + struct page *page = alloc_pages_node(node, THREADINFO_GFP_ACCOUNTED,
THREAD_SIZE_ORDER);

yay, we actually used all this code for something ;)

I don't think we really saw a comprehensive list of what else the kmem
controller will be used for, but I believe that all other envisaged
applications will require slab accounting, yes?


So it appears that all we have at present is a
yet-another-fork-bomb-preventer, but one which requires that the
culprit be in a container?  That's reasonable, given your
hosted-environment scenario.  It's unclear (to me) that we should merge
all this code for only this feature.  Again, it would be good to have a
clear listing of and plan for other applications of this code.

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH v5 14/14] Add documentation about the kmem controller

2012-10-17 Thread Andrew Morton

On Tue, 16 Oct 2012 14:16:51 +0400
Glauber Costa glom...@parallels.com wrote:

 +Kernel memory won't be accounted at all until limit on a group is set. This
 +allows for existing setups to continue working without disruption.  The limit
 +cannot be set if the cgroup have children, or if there are already tasks in 
 the
 +cgroup.

What behaviour will usersapce see if The limit cannot be set? 
write() returns -EINVAL, something like that?

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH] proc: check vma-vm_file before dereferencing

2012-10-15 Thread Andrew Morton

On Mon, 15 Oct 2012 19:30:03 +0400
Stanislav Kinsbursky skinsbur...@parallels.com wrote:

 It can be equal to NULL.
 

Please write better changelogs, so people do not have to ask questions
such as:

- Under what conditions does this bug trigger?

- In which kernel version(s)?

- Is it a post-3.6 regression?

Thanks.

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH] proc: check vma-vm_file before dereferencing

2012-10-15 Thread Andrew Morton

On Tue, 16 Oct 2012 01:52:30 +0400
Cyrill Gorcunov gorcu...@openvz.org wrote:

 On Mon, Oct 15, 2012 at 02:40:48PM -0700, Andrew Morton wrote:
  On Mon, 15 Oct 2012 19:30:03 +0400
  Stanislav Kinsbursky skinsbur...@parallels.com wrote:
  
   It can be equal to NULL.
   
  
  Please write better changelogs, so people do not have to ask questions
  such as:
  
  - Under what conditions does this bug trigger?
  
  - In which kernel version(s)?
  
  - Is it a post-3.6 regression?
 
 Andrew, would the following changelog be enough?
 
 The commit 7b540d0646ce122f0ba4520412be91e530719742 switched
 proc_map_files_readdir to use @f_mode directly instead of grabbing
 @file reference, but same time the test for @vm_file presence was
 lost leading to nil dereference. The patch brings the test back.
 
 The all proc_map_files feature is CONFIG_CHECKPOINT_RESTORE wrapped
 (which is set to 'n' by default) so the bug doesn't affect regular
 kernels.
 
 The regression is 3.7-rc1 only as far as I can tell.

Ah, I see, great, thanks.

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 06/11] memcg: kmem controller infrastructure

2012-06-26 Thread Andrew Morton

On Tue, 26 Jun 2012 19:01:15 +0400 Glauber Costa glom...@parallels.com wrote:

 On 06/26/2012 03:17 AM, Andrew Morton wrote:
  +  memcg_uncharge_kmem(memcg, size);
  + mem_cgroup_put(memcg);
  +}
  +EXPORT_SYMBOL(__mem_cgroup_free_kmem_page);
#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
  
#if defined(CONFIG_INET)  defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
  @@ -5645,3 +5751,69 @@ static int __init enable_swap_account(char *s)
__setup(swapaccount=, enable_swap_account);
  
#endif
  +
  +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
  gargh.  CONFIG_MEMCG_KMEM, please!
 
 
 Here too. I like it as much as you do.
 
 But that is consistent with the rest of the file, and I'd rather have
 it this way.

There's not much point in being consistent with something which is so
unpleasant.  I'm on a little campaign to rename
CONFIG_CGROUP_MEM_RES_CTLR to CONFIG_MEMCG, only nobody has taken my
bait yet.  Be first!

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 06/11] memcg: kmem controller infrastructure

2012-06-26 Thread Andrew Morton

On Tue, 26 Jun 2012 22:14:51 +0400
Glauber Costa glom...@parallels.com wrote:

 On 06/26/2012 10:01 PM, Andrew Morton wrote:
  On Tue, 26 Jun 2012 19:01:15 +0400 Glauber Costa glom...@parallels.com 
  wrote:
 
  On 06/26/2012 03:17 AM, Andrew Morton wrote:
  +memcg_uncharge_kmem(memcg, size);
  +   mem_cgroup_put(memcg);
  +}
  +EXPORT_SYMBOL(__mem_cgroup_free_kmem_page);
#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
 
#if defined(CONFIG_INET)  defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
  @@ -5645,3 +5751,69 @@ static int __init enable_swap_account(char *s)
__setup(swapaccount=, enable_swap_account);
 
#endif
  +
  +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
  gargh.  CONFIG_MEMCG_KMEM, please!
 
 
  Here too. I like it as much as you do.
 
  But that is consistent with the rest of the file, and I'd rather have
  it this way.
 
  There's not much point in being consistent with something which is so
  unpleasant.  I'm on a little campaign to rename
  CONFIG_CGROUP_MEM_RES_CTLR to CONFIG_MEMCG, only nobody has taken my
  bait yet.  Be first!
 
 
 If you are okay with a preparation mechanical patch to convert the whole 
 file, I can change mine too.
 
 But you'll be responsible for arguing with whoever stepping up opposing 
 this =p
 

From: Andrew Morton a...@linux-foundation.org
Subject: memcg: rename config variables

Sanity:

CONFIG_CGROUP_MEM_RES_CTLR - CONFIG_MEMCG
CONFIG_CGROUP_MEM_RES_CTLR_SWAP - CONFIG_MEMCG_SWAP
CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED - CONFIG_MEMCG_SWAP_ENABLED
CONFIG_CGROUP_MEM_RES_CTLR_KMEM - CONFIG_MEMCG_KMEM

Cc: Glauber Costa glom...@parallels.com
Cc: Michal Hocko mho...@suse.cz
Cc: Johannes Weiner han...@cmpxchg.org
Cc: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
Cc: Hugh Dickins hu...@google.com
Cc: Tejun Heo t...@kernel.org
Cc: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Cc: David Rientjes rient...@google.com
Cc: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com
Signed-off-by: Andrew Morton a...@linux-foundation.org
---

 include/linux/cgroup_subsys.h |2 +-
 include/linux/memcontrol.h|   14 +++---
 include/linux/mmzone.h|8 
 include/linux/page_cgroup.h   |   10 +-
 include/linux/sched.h |2 +-
 init/Kconfig  |   14 +++---
 kernel/fork.c |2 +-
 mm/hwpoison-inject.c  |2 +-
 mm/memcontrol.c   |   20 ++--
 mm/memory-failure.c   |2 +-
 mm/mmzone.c   |2 +-
 mm/oom_kill.c |2 +-
 mm/page_cgroup.c  |2 +-
 mm/vmscan.c   |4 ++--
 14 files changed, 43 insertions(+), 43 deletions(-)

diff -puN kernel/fork.c~a kernel/fork.c
--- a/kernel/fork.c~a
+++ a/kernel/fork.c
@@ -1302,7 +1302,7 @@ static struct task_struct *copy_process(
 #ifdef CONFIG_DEBUG_MUTEXES
p-blocked_on = NULL; /* not blocked yet */
 #endif
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_MEMCG
p-memcg_batch.do_batch = 0;
p-memcg_batch.memcg = NULL;
 #endif
diff -puN mm/hwpoison-inject.c~a mm/hwpoison-inject.c
--- a/mm/hwpoison-inject.c~a
+++ a/mm/hwpoison-inject.c
@@ -123,7 +123,7 @@ static int pfn_inject_init(void)
if (!dentry)
goto fail;
 
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+#ifdef CONFIG_MEMCG_SWAP
dentry = debugfs_create_u64(corrupt-filter-memcg, 0600,
hwpoison_dir, hwpoison_filter_memcg);
if (!dentry)
diff -puN mm/memcontrol.c~a mm/memcontrol.c
--- a/mm/memcontrol.c~a
+++ a/mm/memcontrol.c
@@ -61,12 +61,12 @@ struct cgroup_subsys mem_cgroup_subsys _
 #define MEM_CGROUP_RECLAIM_RETRIES 5
 static struct mem_cgroup *root_mem_cgroup __read_mostly;
 
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+#ifdef CONFIG_MEMCG_SWAP
 /* Turned on only when memory cgroup is enabled  really_do_swap_account = 1 
*/
 int do_swap_account __read_mostly;
 
 /* for remember boot option*/
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED
+#ifdef CONFIG_MEMCG_SWAP_ENABLED
 static int really_do_swap_account __initdata = 1;
 #else
 static int really_do_swap_account __initdata = 0;
@@ -407,7 +407,7 @@ static void mem_cgroup_get(struct mem_cg
 static void mem_cgroup_put(struct mem_cgroup *memcg);
 
 /* Writing them here to avoid exposing memcg's inner layout */
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+#ifdef CONFIG_MEMCG_KMEM
 #include net/sock.h
 #include net/ip.h
 
@@ -466,9 +466,9 @@ struct cg_proto *tcp_proto_cgroup(struct
 }
 EXPORT_SYMBOL(tcp_proto_cgroup);
 #endif /* CONFIG_INET */
-#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
+#endif /* CONFIG_MEMCG_KMEM */
 
-#if defined(CONFIG_INET)  defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
+#if defined(CONFIG_INET)  defined(CONFIG_MEMCG_KMEM)
 static void disarm_sock_keys(struct mem_cgroup *memcg)
 {
if (!memcg_proto_activated(memcg-tcp_mem.cg_proto))
@@ -3085,7 +3085,7 @@ mem_cgroup_uncharge_swapcache(struct pag
 }
 #endif
 
-#ifdef

[Devel] Re: [PATCH 00/11] kmem controller for memcg: stripped down version

2012-06-26 Thread Andrew Morton

On Tue, 26 Jun 2012 11:17:49 +0400
Glauber Costa glom...@parallels.com wrote:

 On 06/26/2012 03:27 AM, Andrew Morton wrote:
  On Mon, 25 Jun 2012 18:15:17 +0400
  Glauber Costa glom...@parallels.com wrote:
 
  What I am proposing with this series is a stripped down version of the
  kmem controller for memcg that would allow us to merge significant parts
  of the infrastructure, while leaving out, for now, the polemic bits about
  the slab while it is being reworked by Cristoph.
 
  Me reasoning for that is that after the last change to introduce a gfp
  flag to mark kernel allocations, it became clear to me that tracking other
  resources like the stack would then follow extremely naturaly. I figured
  that at some point we'd have to solve the issue pointed by David, and avoid
  testing the Slab flag in the page allocator, since it would soon be made
  more generic. I do that by having the callers to explicit mark it.
 
  So to demonstrate how it would work, I am introducing a stack tracker here,
  that is already a functionality per-se: it successfully stops fork bombs to
  happen. (Sorry for doing all your work, Frederic =p ). Note that after all
  memcg infrastructure is deployed, it becomes very easy to track anything.
  The last patch of this series is extremely simple.
 
  The infrastructure is exactly the same we had in memcg, but stripped down
  of the slab parts. And because what we have after those patches is a 
  feature
  per-se, I think it could be considered for merging.
 
  hm.  None of this new code makes the kernel smaller, faster, easier to
  understand or more fun to read!
 Not sure if this is a general comment - in case I agree - or if targeted 
 to my statement that this is stripped down. If so, it is of course 
 smaller relative to my previous slab accounting patches.

It's a general comment.  The patch adds overhead: runtime costs and
maintenance costs.  Do its benefits justify that cost?

 The infrastructure is largely common, but I realized that a future user,
 tracking the stack, would be a lot simpler and could be done first.
 
  Presumably we're getting some benefit for all the downside.  When the
  time is appropriate, please do put some time into explaining that
  benefit, so that others can agree that it is a worthwhile tradeoff.
 
 
 Well, for one thing, we stop fork bombs for processes inside cgroups.

inside cgroups is a significant limitation!  Is this capability
important enough to justify adding the new code?  That's unobvious
(to me).

Are there any other user-facing things which we can do with this
feature?  Present, or planned?

 I can't speak for everybody here, but AFAIK, tracking the stack through
 the memory it used, therefore using my proposed kmem controller, was an
 idea that good quite a bit of traction with the memcg/memory people. 
 So here you have something that people already asked a lot for, in a
 shape and interface that seem to be acceptable.

mm, maybe.  Kernel developers tend to look at code from the point of
view does it work as designed, is it clean, is it efficient, do
I understand it, etc.  We often forget to step back and really
consider whether or not it should be merged at all.

I mean, unless the code is an explicit simplification, we should have
a very strong bias towards don't merge.

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 06/11] memcg: kmem controller infrastructure

2012-06-25 Thread Andrew Morton

On Mon, 25 Jun 2012 18:15:23 +0400
Glauber Costa glom...@parallels.com wrote:

 This patch introduces infrastructure for tracking kernel memory pages
 to a given memcg. This will happen whenever the caller includes the
 flag __GFP_KMEMCG flag, and the task belong to a memcg other than
 the root.
 
 In memcontrol.h those functions are wrapped in inline acessors.
 The idea is to later on, patch those with jump labels, so we don't
 incur any overhead when no mem cgroups are being used.
 

 ...

 @@ -416,6 +423,43 @@ static inline void sock_update_memcg(struct sock *sk)
  static inline void sock_release_memcg(struct sock *sk)
  {
  }
 +
 +#define mem_cgroup_kmem_on 0
 +#define __mem_cgroup_new_kmem_page(a, b, c) false
 +#define __mem_cgroup_free_kmem_page(a,b )
 +#define __mem_cgroup_commit_kmem_page(a, b, c)

I suggest that the naming consistently follow the model
mem_cgroup_kmem_foo.  So mem_cgroup_kmem_ becomes the well-known
identifier for this subsystem.

Then, s/mem_cgroup/memcg/g/ - show us some mercy here!

 +#define is_kmem_tracked_alloc (false)

memcg_kmem_tracked_alloc, perhaps.  But what does this actually do?

looks

eww, ick.

 +#define is_kmem_tracked_alloc (gfp  __GFP_KMEMCG)

What Tejun said.  This:

/*
 * Nice comment goes here
 */
static inline bool memcg_kmem_tracked_alloc(gfp_t gfp)
{
return gfp  __GFP_KMEMCG;
}

  #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
 +
 +static __always_inline
 +bool mem_cgroup_new_kmem_page(gfp_t gfp, void *handle, int order)

memcg_kmem_new_page().

 +{
 + if (!mem_cgroup_kmem_on)
 + return true;
 + if (!is_kmem_tracked_alloc)
 + return true;
 + if (!current-mm)
 + return true;
 + if (in_interrupt())
 + return true;
 + if (gfp  __GFP_NOFAIL)
 + return true;
 + return __mem_cgroup_new_kmem_page(gfp, handle, order);
 +}

Add documentation, please.  The semantics of the return value are
inscrutable.

 +static __always_inline
 +void mem_cgroup_free_kmem_page(struct page *page, int order)
 +{
 + if (mem_cgroup_kmem_on)
 + __mem_cgroup_free_kmem_page(page, order);
 +}

memcg_kmem_free_page().

 +static __always_inline
 +void mem_cgroup_commit_kmem_page(struct page *page, struct mem_cgroup 
 *handle,
 +  int order)
 +{
 + if (mem_cgroup_kmem_on)
 + __mem_cgroup_commit_kmem_page(page, handle, order);
 +}

memcg_kmem_commit_page().

  #endif /* _LINUX_MEMCONTROL_H */
  

 ...

 +static inline bool mem_cgroup_kmem_enabled(struct mem_cgroup *memcg)
 +{
 + return !mem_cgroup_disabled()  memcg 
 +!mem_cgroup_is_root(memcg)  memcg-kmem_accounted;
 +}

Does this really need to handle a memcg==NULL?

 +bool __mem_cgroup_new_kmem_page(gfp_t gfp, void *_handle, int order)
 +{
 + struct mem_cgroup *memcg;
 + struct mem_cgroup **handle = (struct mem_cgroup **)_handle;
 + bool ret = true;
 + size_t size;
 + struct task_struct *p;
 +
 + *handle = NULL;
 + rcu_read_lock();
 + p = rcu_dereference(current-mm-owner);
 + memcg = mem_cgroup_from_task(p);
 + if (!mem_cgroup_kmem_enabled(memcg))
 + goto out;
 +
 + mem_cgroup_get(memcg);
 +
 + size = (1  order)  PAGE_SHIFT;

size = PAGE_SIZE  order;

is simpler and more typical.

 + ret = memcg_charge_kmem(memcg, gfp, size) == 0;

Odd.  memcg_charge_kmem() returns a nice errno, buit this conversion
just drops that information on the floor.  If the
mem_cgroup_new_kmem_page() return value had been documented, I might
have been able to understand the thinking here.  But it wasn't, so I
couldn't.

 + if (!ret) {
 + mem_cgroup_put(memcg);
 + goto out;
 + }
 +
 + *handle = memcg;
 +out:
 + rcu_read_unlock();
 + return ret;
 +}
 +EXPORT_SYMBOL(__mem_cgroup_new_kmem_page);
 +
 +void __mem_cgroup_commit_kmem_page(struct page *page, void *handle, int 
 order)
 +{
 + struct page_cgroup *pc;
 + struct mem_cgroup *memcg = handle;
 + size_t size;
 +
 + if (!memcg)
 + return;
 +
 + WARN_ON(mem_cgroup_is_root(memcg));
 + /* The page allocation must have failed. Revert */
 + if (!page) {
 + size = (1  order)  PAGE_SHIFT;

PAGE_SIZE  order

 + memcg_uncharge_kmem(memcg, size);
 + mem_cgroup_put(memcg);
 + return;
 + }

This all looks a bit odd.  After a failure you run a commit which
undoes the speculative charging.  I guess it makes sense.  It's the
sort of thing which can be expounded upon in the documentation whcih
isn't there, sigh.

 + pc = lookup_page_cgroup(page);
 + lock_page_cgroup(pc);
 + pc-mem_cgroup = memcg;
 + SetPageCgroupUsed(pc);
 + unlock_page_cgroup(pc);
 +}

missed a newline there.

 +void __mem_cgroup_free_kmem_page(struct page *page, int order)
 +{
 + struct mem_cgroup *memcg;
 + size_t size;
 + struct page_cgroup *pc;
 +
 +

[Devel] Re: [PATCH 09/11] memcg: propagate kmem limiting information to children

2012-06-25 Thread Andrew Morton

On Tue, 26 Jun 2012 02:36:27 +0400
Glauber Costa glom...@parallels.com wrote:

 On 06/25/2012 10:29 PM, Tejun Heo wrote:
  Feeling like a nit pervert but..
 
  On Mon, Jun 25, 2012 at 06:15:26PM +0400, Glauber Costa wrote:
  @@ -287,7 +287,11 @@ struct mem_cgroup {
  * Should the accounting and control be hierarchical, per subtree?
  */
 bool use_hierarchy;
  -  bool kmem_accounted;
  +  /*
  +   * bit0: accounted by this cgroup
  +   * bit1: accounted by a parent.
  +   */
  +  volatile unsigned long kmem_accounted;
 
  Is the volatile declaration really necessary?  Why is it necessary?
  Why no comment explaining it?
 
 Seems to be required by set_bit and friends. gcc will complain if it is 
 not volatile (take a look at the bit function headers)

That would be a broken gcc.  We run test_bit()/set_bit() and friends
against plain old `unsigned long' in thousands of places.  There's
nothing special about this one!


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 09/11] memcg: propagate kmem limiting information to children

2012-06-25 Thread Andrew Morton

On Mon, 25 Jun 2012 18:15:26 +0400
Glauber Costa glom...@parallels.com wrote:

 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -287,7 +287,11 @@ struct mem_cgroup {
* Should the accounting and control be hierarchical, per subtree?
*/
   bool use_hierarchy;
 - bool kmem_accounted;
 + /*
 +  * bit0: accounted by this cgroup
 +  * bit1: accounted by a parent.
 +  */
 + volatile unsigned long kmem_accounted;

I suggest

unsigned long kmem_accounted;   /* See KMEM_ACCOUNTED_*, below */

   booloom_lock;
   atomic_tunder_oom;
 @@ -340,6 +344,9 @@ struct mem_cgroup {
  #endif
  };
  
 +#define KMEM_ACCOUNTED_THIS  0
 +#define KMEM_ACCOUNTED_PARENT1

And then document the fields here.

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 00/11] kmem controller for memcg: stripped down version

2012-06-25 Thread Andrew Morton

On Mon, 25 Jun 2012 18:15:17 +0400
Glauber Costa glom...@parallels.com wrote:

 What I am proposing with this series is a stripped down version of the
 kmem controller for memcg that would allow us to merge significant parts
 of the infrastructure, while leaving out, for now, the polemic bits about
 the slab while it is being reworked by Cristoph.
 
 Me reasoning for that is that after the last change to introduce a gfp
 flag to mark kernel allocations, it became clear to me that tracking other
 resources like the stack would then follow extremely naturaly. I figured
 that at some point we'd have to solve the issue pointed by David, and avoid
 testing the Slab flag in the page allocator, since it would soon be made
 more generic. I do that by having the callers to explicit mark it.
 
 So to demonstrate how it would work, I am introducing a stack tracker here,
 that is already a functionality per-se: it successfully stops fork bombs to
 happen. (Sorry for doing all your work, Frederic =p ). Note that after all
 memcg infrastructure is deployed, it becomes very easy to track anything.
 The last patch of this series is extremely simple.
 
 The infrastructure is exactly the same we had in memcg, but stripped down
 of the slab parts. And because what we have after those patches is a feature
 per-se, I think it could be considered for merging.

hm.  None of this new code makes the kernel smaller, faster, easier to
understand or more fun to read!

Presumably we're getting some benefit for all the downside.  When the
time is appropriate, please do put some time into explaining that
benefit, so that others can agree that it is a worthwhile tradeoff.

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 09/11] memcg: propagate kmem limiting information to children

2012-06-25 Thread Andrew Morton

On Mon, 25 Jun 2012 22:24:44 -0700 (PDT) David Rientjes rient...@google.com 
wrote:

   +#define KMEM_ACCOUNTED_THIS  0
   +#define KMEM_ACCOUNTED_PARENT1
  
  And then document the fields here.
  
 
 In hex, please?

Well, they're bit numbers, not masks.  Decimal 0-31 is OK, or an enum.

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH v6 2/2] decrement static keys on real destroy time

2012-05-23 Thread Andrew Morton

On Wed, 23 May 2012 13:16:36 +0400
Glauber Costa glom...@parallels.com wrote:

 On 05/23/2012 02:46 AM, Andrew Morton wrote:
  Here, we're open-coding kinda-test_bit().  Why do that?  These flags are
  modified with set_bit() and friends, so we should read them with the
  matching test_bit()?
 
 My reasoning was to be as cheap as possible, as you noted yourself two
 paragraphs below.

These aren't on any fast path, are they?

Plus: you failed in that objective!  The C compiler's internal
scalar-bool conversion makes these functions no more efficient than
test_bit().

  So here are suggested changes from*some*  of the above discussion.
  Please consider, incorporate, retest and send us a v7?
 
 How do you want me to do it? Should I add your patch ontop of mine,
 and then another one that tweaks whatever else is left, or should I just
 merge those changes into the patches I have?

A brand new patch, I guess.  I can sort out the what-did-he-change view
at this end.

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH v6 2/2] decrement static keys on real destroy time

2012-05-22 Thread Andrew Morton

(MEMCG_SOCK_ACTIVATED, cg_proto-flags))
 + static_key_slow_inc(memcg_socket_limit_enabled);
 + set_bit(MEMCG_SOCK_ACTIVE, cg_proto-flags);
 + }

So here are suggested changes from *some* of the above discussion. 
Please consider, incorporate, retest and send us a v7?



From: Andrew Morton a...@linux-foundation.org
Subject: memcg-decrement-static-keys-at-real-destroy-time-v6-fix

- move enum sock_flag_bits into sock.h

- document enum sock_flag_bits

- convert memcg_proto_active() and memcg_proto_activated() to test_bit()

- redo tcp_update_limit() comment to 80 cols

Cc: Glauber Costa glom...@parallels.com
Cc: Johannes Weiner han...@cmpxchg.org
Cc: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
Cc: Li Zefan lize...@huawei.com
Cc: Michal Hocko mho...@suse.cz
Cc: Tejun Heo t...@kernel.org
Signed-off-by: Andrew Morton a...@linux-foundation.org
---

 include/linux/memcontrol.h |5 -
 include/net/sock.h |   15 +--
 net/ipv4/tcp_memcontrol.c  |   30 +++---
 3 files changed, 28 insertions(+), 22 deletions(-)

diff -puN 
include/linux/memcontrol.h~memcg-decrement-static-keys-at-real-destroy-time-v6-fix
 include/linux/memcontrol.h
--- 
a/include/linux/memcontrol.h~memcg-decrement-static-keys-at-real-destroy-time-v6-fix
+++ a/include/linux/memcontrol.h
@@ -405,11 +405,6 @@ enum {
OVER_LIMIT,
 };
 
-enum sock_flag_bits {
-   MEMCG_SOCK_ACTIVE,
-   MEMCG_SOCK_ACTIVATED,
-};
-
 struct sock;
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 void sock_update_memcg(struct sock *sk);
diff -puN 
include/net/sock.h~memcg-decrement-static-keys-at-real-destroy-time-v6-fix 
include/net/sock.h
--- a/include/net/sock.h~memcg-decrement-static-keys-at-real-destroy-time-v6-fix
+++ a/include/net/sock.h
@@ -46,6 +46,7 @@
 #include linux/list_nulls.h
 #include linux/timer.h
 #include linux/cache.h
+#include linux/bitops.h
 #include linux/lockdep.h
 #include linux/netdevice.h
 #include linux/skbuff.h  /* struct sk_buff */
@@ -921,6 +922,16 @@ struct proto {
 #endif
 };
 
+/*
+ * Bits in struct cg_proto.flags
+ */
+enum sock_flag_bits {
+   /* Currently active and new sockets should be assigned to cgroups */
+   MEMCG_SOCK_ACTIVE,
+   /* It was ever activated; we must disarm static keys on destruction */
+   MEMCG_SOCK_ACTIVATED,
+};
+
 struct cg_proto {
void(*enter_memory_pressure)(struct sock *sk);
struct res_counter  *memory_allocated;  /* Current allocated 
memory. */
@@ -945,12 +956,12 @@ extern void proto_unregister(struct prot
 
 static inline bool memcg_proto_active(struct cg_proto *cg_proto)
 {
-   return cg_proto-flags  (1  MEMCG_SOCK_ACTIVE);
+   return test_bit(MEMCG_SOCK_ACTIVE, cg_proto-flags);
 }
 
 static inline bool memcg_proto_activated(struct cg_proto *cg_proto)
 {
-   return cg_proto-flags  (1  MEMCG_SOCK_ACTIVATED);
+   return test_bit(MEMCG_SOCK_ACTIVATED, cg_proto-flags);
 }
 
 #ifdef SOCK_REFCNT_DEBUG
diff -puN 
net/ipv4/tcp_memcontrol.c~memcg-decrement-static-keys-at-real-destroy-time-v6-fix
 net/ipv4/tcp_memcontrol.c
--- 
a/net/ipv4/tcp_memcontrol.c~memcg-decrement-static-keys-at-real-destroy-time-v6-fix
+++ a/net/ipv4/tcp_memcontrol.c
@@ -108,24 +108,24 @@ static int tcp_update_limit(struct mem_c
clear_bit(MEMCG_SOCK_ACTIVE, cg_proto-flags);
else if (val != RESOURCE_MAX) {
/*
-*  The active bit needs to be written after the static_key 
update.
-*  This is what guarantees that the socket activation function
-*  is the last one to run. See sock_update_memcg() for details,
-*  and note that we don't mark any socket as belonging to this
-*  memcg until that flag is up.
+* The active bit needs to be written after the static_key
+* update.  This is what guarantees that the socket activation
+* function is the last one to run.  See sock_update_memcg() for
+* details, and note that we don't mark any socket as belonging
+* to this memcg until that flag is up.
 *
-*  We need to do this, because static_keys will span multiple
-*  sites, but we can't control their order. If we mark a socket
-*  as accounted, but the accounting functions are not patched 
in
-*  yet, we'll lose accounting.
+* We need to do this, because static_keys will span multiple
+* sites, but we can't control their order.  If we mark a socket
+* as accounted, but the accounting functions are not patched in
+* yet, we'll lose accounting.
 *
-*  We never race with the readers in sock_update_memcg(), 
because
-*  when this value change, the code to process it is not 
patched in
-*  yet

[Devel] Re: [PATCH v6 2/2] decrement static keys on real destroy time

2012-05-22 Thread Andrew Morton

On Tue, 22 May 2012 15:46:10 -0700
Andrew Morton a...@linux-foundation.org wrote:

  +static inline bool memcg_proto_active(struct cg_proto *cg_proto)
  +{
  +   return cg_proto-flags  (1  MEMCG_SOCK_ACTIVE);
  +}
  +
  +static inline bool memcg_proto_activated(struct cg_proto *cg_proto)
  +{
  +   return cg_proto-flags  (1  MEMCG_SOCK_ACTIVATED);
  +}
 
 Here, we're open-coding kinda-test_bit().  Why do that?  These flags are
 modified with set_bit() and friends, so we should read them with the
 matching test_bit()?
 
 Also, these bool-returning functions will return values other than 0
 and 1.  That probably works OK and I don't know what the C standards
 and implementations do about this.  But it seems unclean and slightly
 risky to have a bool value of 32!  Converting these functions to use
 test_bit() fixes this - test_bit() returns only 0 or 1.
 
 test_bit() is slightly more expensive than the above.  If this is
 considered to be an issue then I guess we could continue to use this
 approach.  But I do think a code comment is needed, explaining and
 justifying the unusual decision to bypass the bitops API.  Also these
 functions should tell the truth and return an int type.

Joe corrected (and informed) me:

: 6.3.1.2p1:
: 
: When any scalar value is converted to _Bool, the result is 0
: if the value compares equal to 0; otherwise, the result is 1.

So the above functions will be given compiler-generated scalar-to-boolean
conversion.

test_bit() already does internal scalar-to-boolean conversion.  The
compiler doesn't know that, so if we convert the above functions to use
test_bit(), we'll end up performing scalar-to-boolean-to-boolean
conversion, which is dumb.

I assume that a way of fixing this is to change test_bit() to return
bool type.  That's a bit scary.

A less scary way would be to add a new

bool test_bit_bool(int nr, const unsigned long *addr);

which internally calls test_bit() but somehow avoids the
compiler-generated conversion of the test_bit() return value into a
bool.  I haven't actually thought of a way of doing this ;)

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH v5 2/2] decrement static keys on real destroy time

2012-05-17 Thread Andrew Morton

On Thu, 17 May 2012 13:52:13 +0400 Glauber Costa glom...@parallels.com wrote:

 Andrew is right. It seems we will need that mutex after all. Just this 
 is not a race, and neither something that should belong in the 
 static_branch interface.

Well, a mutex is one way.  Or you could do something like

if (!test_and_set_bit(CGPROTO_ACTIVATED, cg_proto-flags))
static_key_slow_inc(memcg_socket_limit_enabled);

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH v5 2/2] decrement static keys on real destroy time

2012-05-16 Thread Andrew Morton

On Fri, 11 May 2012 17:11:17 -0300
Glauber Costa glom...@parallels.com wrote:

 We call the destroy function when a cgroup starts to be removed,
 such as by a rmdir event.
 
 However, because of our reference counters, some objects are still
 inflight. Right now, we are decrementing the static_keys at destroy()
 time, meaning that if we get rid of the last static_key reference,
 some objects will still have charges, but the code to properly
 uncharge them won't be run.
 
 This becomes a problem specially if it is ever enabled again, because
 now new charges will be added to the staled charges making keeping
 it pretty much impossible.
 
 We just need to be careful with the static branch activation:
 since there is no particular preferred order of their activation,
 we need to make sure that we only start using it after all
 call sites are active. This is achieved by having a per-memcg
 flag that is only updated after static_key_slow_inc() returns.
 At this time, we are sure all sites are active.
 
 This is made per-memcg, not global, for a reason:
 it also has the effect of making socket accounting more
 consistent. The first memcg to be limited will trigger static_key()
 activation, therefore, accounting. But all the others will then be
 accounted no matter what. After this patch, only limited memcgs
 will have its sockets accounted.
 
 ...

 @@ -107,10 +104,31 @@ static int tcp_update_limit(struct mem_cgroup *memcg, 
 u64 val)
   tcp-tcp_prot_mem[i] = min_t(long, val  PAGE_SHIFT,
net-ipv4.sysctl_tcp_mem[i]);
  
 - if (val == RESOURCE_MAX  old_lim != RESOURCE_MAX)
 - static_key_slow_dec(memcg_socket_limit_enabled);
 - else if (old_lim == RESOURCE_MAX  val != RESOURCE_MAX)
 - static_key_slow_inc(memcg_socket_limit_enabled);
 + if (val == RESOURCE_MAX)
 + cg_proto-active = false;
 + else if (val != RESOURCE_MAX) {
 + /*
 +  * -activated needs to be written after the static_key update.
 +  *  This is what guarantees that the socket activation function
 +  *  is the last one to run. See sock_update_memcg() for details,
 +  *  and note that we don't mark any socket as belonging to this
 +  *  memcg until that flag is up.
 +  *
 +  *  We need to do this, because static_keys will span multiple
 +  *  sites, but we can't control their order. If we mark a socket
 +  *  as accounted, but the accounting functions are not patched 
 in
 +  *  yet, we'll lose accounting.
 +  *
 +  *  We never race with the readers in sock_update_memcg(), 
 because
 +  *  when this value change, the code to process it is not 
 patched in
 +  *  yet.
 +  */
 + if (!cg_proto-activated) {
 + static_key_slow_inc(memcg_socket_limit_enabled);
 + cg_proto-activated = true;
 + }

If two threads run this code concurrently, they can both see
cg_proto-activated==false and they will both run
static_key_slow_inc().

Hopefully there's some locking somewhere which prevents this, but it is
unobvious.  We should comment this, probably at the cg_proto.activated
definition site.  Or we should fix the bug ;)


 + cg_proto-active = true;
 + }
  
   return 0;
  }

 ...


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH v5 2/2] decrement static keys on real destroy time

2012-05-16 Thread Andrew Morton

On Fri, 11 May 2012 17:11:17 -0300
Glauber Costa glom...@parallels.com wrote:

 We call the destroy function when a cgroup starts to be removed,
 such as by a rmdir event.
 
 However, because of our reference counters, some objects are still
 inflight. Right now, we are decrementing the static_keys at destroy()
 time, meaning that if we get rid of the last static_key reference,
 some objects will still have charges, but the code to properly
 uncharge them won't be run.
 
 This becomes a problem specially if it is ever enabled again, because
 now new charges will be added to the staled charges making keeping
 it pretty much impossible.
 
 We just need to be careful with the static branch activation:
 since there is no particular preferred order of their activation,
 we need to make sure that we only start using it after all
 call sites are active. This is achieved by having a per-memcg
 flag that is only updated after static_key_slow_inc() returns.
 At this time, we are sure all sites are active.
 
 This is made per-memcg, not global, for a reason:
 it also has the effect of making socket accounting more
 consistent. The first memcg to be limited will trigger static_key()
 activation, therefore, accounting. But all the others will then be
 accounted no matter what. After this patch, only limited memcgs
 will have its sockets accounted.

So I'm scratching my head over what the actual bug is, and how
important it is.  AFAICT it will cause charging stats to exhibit some
inaccuracy when memcg's are being torn down?

I don't know how serious this in in the real world and so can't decide
which kernel version(s) we should fix.

When fixing bugs, please always fully describe the bug's end-user
impact, so that I and others can make these sorts of decisions.

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH v5 2/2] decrement static keys on real destroy time

2012-05-16 Thread Andrew Morton

On Thu, 17 May 2012 07:06:52 +0400 Glauber Costa glom...@parallels.com wrote:

 ...
  +  else if (val != RESOURCE_MAX) {
  +  /*
  +   * -activated needs to be written after the static_key update.
  +   *  This is what guarantees that the socket activation function
  +   *  is the last one to run. See sock_update_memcg() for details,
  +   *  and note that we don't mark any socket as belonging to this
  +   *  memcg until that flag is up.
  +   *
  +   *  We need to do this, because static_keys will span multiple
  +   *  sites, but we can't control their order. If we mark a socket
  +   *  as accounted, but the accounting functions are not patched 
  in
  +   *  yet, we'll lose accounting.
  +   *
  +   *  We never race with the readers in sock_update_memcg(), 
  because
  +   *  when this value change, the code to process it is not 
  patched in
  +   *  yet.
  +   */
  +  if (!cg_proto-activated) {
  +  static_key_slow_inc(memcg_socket_limit_enabled);
  +  cg_proto-activated = true;
  +  }
 
  If two threads run this code concurrently, they can both see
  cg_proto-activated==false and they will both run
  static_key_slow_inc().
 
  Hopefully there's some locking somewhere which prevents this, but it is
  unobvious.  We should comment this, probably at the cg_proto.activated
  definition site.  Or we should fix the bug ;)
 
 If that happens, locking in static_key_slow_inc will prevent any damage.
 My previous version had explicit code to prevent that, but we were 
 pointed out that this is already part of the static_key expectations, so 
 that was dropped.

This makes no sense.  If two threads run that code concurrently,
key-enabled gets incremented twice.  Nobody anywhere has a record that
this happened so it cannot be undone.  key-enabled is now in an
unknown state.


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH] remove BUG() in possible but rare condition

2012-04-11 Thread Andrew Morton

On Wed, 11 Apr 2012 15:10:24 -0300
Glauber Costa glom...@parallels.com wrote:

 While stressing the kernel with with failing allocations today,
 I hit the following chain of events:
 
 alloc_page_buffers():
 
   bh = alloc_buffer_head(GFP_NOFS);
   if (!bh)
   goto no_grow; = path taken
 
 grow_dev_page():
 bh = alloc_page_buffers(page, size, 0);
 if (!bh)
 goto failed;  = taken, consequence of the above
 
 and then the failed path BUG()s the kernel.
 
 The failure is inserted a litte bit artificially, but even then,
 I see no reason why it should be deemed impossible in a real box.
 
 Even though this is not a condition that we expect to see
 around every time, failed allocations are expected to be handled,
 and BUG() sounds just too much. As a matter of fact, grow_dev_page()
 can return NULL just fine in other circumstances, so I propose we just
 remove it, then.
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Linus Torvalds torva...@linux-foundation.org
 CC: Andrew Morton a...@linux-foundation.org
 ---
  fs/buffer.c |1 -
  1 files changed, 0 insertions(+), 1 deletions(-)
 
 diff --git a/fs/buffer.c b/fs/buffer.c
 index 36d6665..351e18e 100644
 --- a/fs/buffer.c
 +++ b/fs/buffer.c
 @@ -985,7 +985,6 @@ grow_dev_page(struct block_device *bdev, sector_t block,
   return page;
  
  failed:
 - BUG();
   unlock_page(page);
   page_cache_release(page);
   return NULL;

Cute.

AFAICT what happened was that in my April 2002 rewrite of this code I
put a non-fatal buffer_error() warning in that case to tell us that
something bad happened.

Years later we removed the temporary buffer_error() and mistakenly
replaced that warning with a BUG().  Only it *can* happen.

We can remove the BUG() and fix up callers, or we can pass retry=1 into
alloc_page_buffers(), so grow_dev_page() cannot fail.  Immortal
functions are a silly fiction, so we should remove the BUG() and fix up
callers.

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH] remove BUG() in possible but rare condition

2012-04-11 Thread Andrew Morton

On Wed, 11 Apr 2012 17:51:57 -0300
Glauber Costa glom...@parallels.com wrote:

 On 04/11/2012 05:26 PM, Andrew Morton wrote:
 
  failed:
-   BUG();
unlock_page(page);
page_cache_release(page);
return NULL;
  Cute.
 
  AFAICT what happened was that in my April 2002 rewrite of this code I
  put a non-fatal buffer_error() warning in that case to tell us that
  something bad happened.
 
  Years later we removed the temporary buffer_error() and mistakenly
  replaced that warning with a BUG().  Only it*can*  happen.
 
  We can remove the BUG() and fix up callers, or we can pass retry=1 into
  alloc_page_buffers(), so grow_dev_page() cannot fail.  Immortal
  functions are a silly fiction, so we should remove the BUG() and fix up
  callers.
 
 Any particular caller you are concerned with ?

Didn't someone see a buggy caller in btrfs?

I'm thinking that we should retain some sort of assertion (a WARN_ON)
if the try_to_free_buffers() failed.  This is a weird case which I
assume handles the situation where a blockdev's blocksize has changed. 
The code tries to throw away the old wrongly-sized buffer_heads and to
then add new correctly-sized ones.  If that discarding of buffers
fails then the kernel is in rather a mess.

It's quite possible that this code is never executed - we _should_ have
invalidated all the pagecache for that device when changing blocksize. 
Or maybe it *is* executed, I dunno.  It's one of those things which has
hung around for decades as code in other places has vastly changed.

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH] memcg: Do not open code accesses to res_counter members

2012-04-05 Thread Andrew Morton

On Tue, 20 Mar 2012 20:53:44 +0400
Glauber Costa glom...@parallels.com wrote:

 We should use the acessor res_counter_read_u64 for that.
 Although a purely cosmetic change is sometimes better of delayed,
 to avoid conflicting with other people's work, we are starting to
 have people touching this code as well, and reproducing the open
 code behavior because that's the standard =)
 
 ...

 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -3708,7 +3708,7 @@ move_account:
   goto try_to_free;
   cond_resched();
   /* ret should also be checked to ensure all lists are empty. */
 - } while (memcg-res.usage  0 || ret);
 + } while (res_counter_read_u64(memcg-res, RES_USAGE)  0 || ret);
  out:
   css_put(memcg-css);
   return ret;
 @@ -3723,7 +3723,7 @@ try_to_free:
   lru_add_drain_all();
   /* try to free all pages in this cgroup */
   shrink = 1;
 - while (nr_retries  memcg-res.usage  0) {
 + while (nr_retries  res_counter_read_u64(memcg-res, RES_USAGE)  0) {
   int progress;
  
   if (signal_pending(current)) {

Actually this fixes bugs on 32-bit machines.  Good luck trying to
demonstrate them at runtime though ;)

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 1/2] SYSCTL: root unregister routine introduced

2011-12-12 Thread Andrew Morton

On Mon, 12 Dec 2011 21:50:00 +0300
Stanislav Kinsbursky skinsbur...@parallels.com wrote:

 This routine is required for SUNRPC sysctl's, which are going to be allocated,
 processed and destroyed per network namespace context.
 IOW, new sysctl root will be registered on network namespace creation and
 thus have to unregistered before network namespace destruction.
 

It's a bit suspicious that such a mature subsystem as sysctl newly
needs its internals exported like this.  Either a) the net namespaces
work is doing something which hasn't been done before or b) it is doing
something wrong.

So, please explain further so we can confirm that it is a) and not b).

 --- a/kernel/sysctl.c
 +++ b/kernel/sysctl.c
 @@ -1701,6 +1701,13 @@ void register_sysctl_root(struct ctl_table_root *root)
   spin_unlock(sysctl_lock);
  }
  
 +void unregister_sysctl_root(struct ctl_table_root *root)
 +{
 + spin_lock(sysctl_lock);
 + list_del(root-root_list);
 + spin_unlock(sysctl_lock);
 +}
 +

This requires the addition of a declaration to include/linux/sysctl.h.

Once that is done and review is complete, I'd suggest that these two
patches be joined into a single patch, and that patch become part of
whatever patch series it is which needs them.

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 4/9] allow killing tasks in your own or child userns

2011-02-23 Thread Andrew Morton

On Thu, 24 Feb 2011 00:48:18 +
Serge E. Hallyn se...@hallyn.com wrote:

 Quoting Andrew Morton (a...@linux-foundation.org):
  On Thu, 17 Feb 2011 15:03:25 +
  Serge E. Hallyn se...@hallyn.com wrote:
  
/*
   + * called with RCU read lock from check_kill_permission()
   + */
   +static inline int kill_ok_by_cred(struct task_struct *t)
   +{
   + const struct cred *cred = current_cred();
   + const struct cred *tcred = __task_cred(t);
   +
   + if (cred-user-user_ns == tcred-user-user_ns 
   + (cred-euid == tcred-suid ||
   +  cred-euid == tcred-uid ||
   +  cred-uid  == tcred-suid ||
   +  cred-uid  == tcred-uid))
   + return 1;
   +
   + if (ns_capable(tcred-user-user_ns, CAP_KILL))
   + return 1;
   +
   + return 0;
   +}
  
  The compiler will inline this for us.
 
 Is that simply true with everything (worth inlining) nowadays, or is
 there a particular implicit hint to the compiler that'll make that
 happen?

We've basically stopped inlining things nowadays.  gcc inlines
aggressively and sometimes we have to use noinline to stop it.  Also,
modern gcc's like to ignore the inline directive anwyay, so we have to
resort to __always_inline when we disagree.


___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 9/9] userns: check user namespace for task-file uid equivalence checks

2011-02-23 Thread Andrew Morton

On Thu, 24 Feb 2011 03:24:16 + Serge E. Hallyn se...@hallyn.com wrote:

 Quoting Andrew Morton (a...@linux-foundation.org):
  On Thu, 17 Feb 2011 15:04:07 +
  Serge E. Hallyn se...@hallyn.com wrote:
  
  There's a fairly well adhered to convention that global symbols (and
  often static symbols) have a prefix which identifies the subsystem to
  which they belong.  This patchset rather scorns that convention.
  
  Most of these identifiers are pretty obviously from the capability
  subsystem, but still...
 
 Would 'inode_owner_or_capable' be better and and make sense?
 

I suppose so.  We've totally screwed that pooch in the VFS (grep EXPORT
fs/inode.c).  But it's never to late to start.

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 1/9] Add a user_namespace as creator/owner of uts_namespace

2011-02-18 Thread Andrew Morton

On Thu, 17 Feb 2011 15:02:57 +
Serge E. Hallyn se...@hallyn.com wrote:

 +/*
 + * userns count is 1 for root user, 1 for init_uts_ns,
 + * and 1 for... ?
 + */

?
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 2/9] security: Make capabilities relative to the user namespace.

2011-02-18 Thread Andrew Morton

On Thu, 17 Feb 2011 15:03:06 +
Serge E. Hallyn se...@hallyn.com wrote:

 - Introduce ns_capable to test for a capability in a non-default
   user namespace.
 - Teach cap_capable to handle capabilities in a non-default
   user namespace.
 
 The motivation is to get to the unprivileged creation of new
 namespaces.  It looks like this gets us 90% of the way there, with
 only potential uid confusion issues left.
 
 I still need to handle getting all caps after creation but otherwise I
 think I have a good starter patch that achieves all of your goals.
 

 ...

 --- a/include/linux/capability.h
 +++ b/include/linux/capability.h
 @@ -544,7 +544,7 @@ extern const kernel_cap_t __cap_init_eff_set;
   *
   * Note that this does not set PF_SUPERPRIV on the task.
   */
 -#define has_capability(t, cap) (security_real_capable((t), (cap)) == 0)
 +#define has_capability(t, cap) (security_real_capable((t), init_user_ns, 
 (cap)) == 0)
  
  /**
   * has_capability_noaudit - Determine if a task has a superior capability 
 available (unaudited)
 @@ -558,9 +558,15 @@ extern const kernel_cap_t __cap_init_eff_set;
   * Note that this does not set PF_SUPERPRIV on the task.
   */
  #define has_capability_noaudit(t, cap) \
 - (security_real_capable_noaudit((t), (cap)) == 0)
 + (security_real_capable_noaudit((t), init_user_ns, (cap)) == 0)
  
 +struct user_namespace;
 +extern struct user_namespace init_user_ns;

Two icky-should-be-written-in-C macros which reference init_user_ns,
followed by the declaration of init_user_ns and its type.  Declarations
which duplicate those in other header files.  It's ripe for some
upcleaning, methinks?

Also, please ensure that the forward struct declarations are all at
top-of-file (as in include/linux/security.h).  Otherwise we can end up
accumulating multiple forward declarations of the same thing in the one
file.

  extern int capable(int cap);
 +extern int ns_capable(struct user_namespace *ns, int cap);
 +extern int task_ns_capable(struct task_struct *t, int cap);
 +
 +#define nsown_capable(cap) (ns_capable(current_user_ns(), (cap)))

macroitis!

 @@ -301,15 +302,42 @@ error:
   */
  int capable(int cap)
  {
 + return ns_capable(init_user_ns, cap);
 +}
 +EXPORT_SYMBOL(capable);
 +
 +/**
 + * ns_capable - Determine if the current task has a superior capability in 
 effect
 + * @ns:  The usernamespace we want the capability in
 + * @cap: The capability to be tested for
 + *
 + * Return true if the current task has the given superior capability 
 currently
 + * available for use, false if not.

Actually it doesn't return true or false - it returns 1 or 0.  Using a
`bool' return type would fix the comment :)

 + * This sets PF_SUPERPRIV on the task if the capability is available on the
 + * assumption that it's about to be used.
 + */
 +int ns_capable(struct user_namespace *ns, int cap)
 +{
   if (unlikely(!cap_valid(cap))) {
   printk(KERN_CRIT capable() called with invalid cap=%u\n, cap);
   BUG();
   }
  
 - if (security_capable(current_cred(), cap) == 0) {
 + if (security_capable(ns, current_cred(), cap) == 0) {
   current-flags |= PF_SUPERPRIV;
   return 1;
   }
   return 0;
  }
 -EXPORT_SYMBOL(capable);
 +EXPORT_SYMBOL(ns_capable);
 +
 +/*
 + * does current have capability 'cap' to the user namespace of task
 + * 't'.  Return true if it does, false otherwise.
 + */

Other comments were kerneldocified.

 +int task_ns_capable(struct task_struct *t, int cap)
 +{
 + return ns_capable(task_cred_xxx(t, user)-user_ns, cap);
 +}
 +EXPORT_SYMBOL(task_ns_capable);

Could return bool.


 ...

 +int cap_capable(struct task_struct *tsk, const struct cred *cred,
 + struct user_namespace *targ_ns, int cap, int audit)
  {
 - return cap_raised(cred-cap_effective, cap) ? 0 : -EPERM;
 + for (;;) {
 + /* The creator of the user namespace has all caps. */
 + if (targ_ns != init_user_ns  targ_ns-creator == cred-user)
 + return 0;
 +
 + /* Do we have the necessary capabilities? */
 + if (targ_ns == cred-user-user_ns)
 + return cap_raised(cred-cap_effective, cap) ? 0 : 
 -EPERM;
 +
 + /* Have we tried all of the parent namespaces? */
 + if (targ_ns == init_user_ns)
 + return -EPERM;
 +
 + /* If you have the capability in a parent user ns you have it
 +  * in the over all children user namespaces as well, so see
 +  * if this process has the capability in the parent user
 +  * namespace.
 +  */
 + targ_ns = targ_ns-creator-user_ns;
 + }
 +
 + /* We never get here */
 + return -EPERM;

So delete the code?  Or does the compiler warn?  If so, it's pretty busted.

  }
  
  /**

 ...


___
Containers mailing list
contain...@lists.linux-foundation.org

[Devel] Re: [PATCH 4/9] allow killing tasks in your own or child userns

2011-02-18 Thread Andrew Morton

On Thu, 17 Feb 2011 15:03:25 +
Serge E. Hallyn se...@hallyn.com wrote:

  /*
 + * called with RCU read lock from check_kill_permission()
 + */
 +static inline int kill_ok_by_cred(struct task_struct *t)
 +{
 + const struct cred *cred = current_cred();
 + const struct cred *tcred = __task_cred(t);
 +
 + if (cred-user-user_ns == tcred-user-user_ns 
 + (cred-euid == tcred-suid ||
 +  cred-euid == tcred-uid ||
 +  cred-uid  == tcred-suid ||
 +  cred-uid  == tcred-uid))
 + return 1;
 +
 + if (ns_capable(tcred-user-user_ns, CAP_KILL))
 + return 1;
 +
 + return 0;
 +}

The compiler will inline this for us.

 +/*
   * Bad permissions for sending the signal
   * - the caller must hold the RCU read lock
   */
  static int check_kill_permission(int sig, struct siginfo *info,
struct task_struct *t)
  {
 - const struct cred *cred, *tcred;
   struct pid *sid;
   int error;
  
 @@ -656,14 +676,8 @@ static int check_kill_permission(int sig, struct siginfo 
 *info,
   if (error)
   return error;
  
 - cred = current_cred();
 - tcred = __task_cred(t);
   if (!same_thread_group(current, t) 
 - (cred-euid ^ tcred-suid) 
 - (cred-euid ^ tcred-uid) 
 - (cred-uid  ^ tcred-suid) 
 - (cred-uid  ^ tcred-uid) 
 - !capable(CAP_KILL)) {
 + !kill_ok_by_cred(t)) {
   switch (sig) {
   case SIGCONT:
   sid = task_session(t);

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 5/9] Allow ptrace from non-init user namespaces

2011-02-18 Thread Andrew Morton

On Thu, 17 Feb 2011 15:03:33 +
Serge E. Hallyn se...@hallyn.com wrote:

 ptrace is allowed to tasks in the same user namespace according to
 the usual rules (i.e. the same rules as for two tasks in the init
 user namespace).  ptrace is also allowed to a user namespace to
 which the current task the has CAP_SYS_PTRACE capability.
 

 ...

 --- a/include/linux/capability.h
 +++ b/include/linux/capability.h
 @@ -546,6 +546,8 @@ extern const kernel_cap_t __cap_init_eff_set;
   */
  #define has_capability(t, cap) (security_real_capable((t), init_user_ns, 
 (cap)) == 0)
  
 +#define has_ns_capability(t, ns, cap) (security_real_capable((t), (ns), 
 (cap)) == 0)

macroitis.

  /**
   * has_capability_noaudit - Determine if a task has a superior capability 
 available (unaudited)
   * @t: The task in question
 diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
 index faf4679..862fc59 100644
 --- a/include/linux/user_namespace.h
 +++ b/include/linux/user_namespace.h
 @@ -39,6 +39,9 @@ static inline void put_user_ns(struct user_namespace *ns)
  uid_t user_ns_map_uid(struct user_namespace *to, const struct cred *cred, 
 uid_t uid);
  gid_t user_ns_map_gid(struct user_namespace *to, const struct cred *cred, 
 gid_t gid);
  
 +int same_or_ancestor_user_ns(struct task_struct *task,
 + struct task_struct *victim);

bool.

  #else
  
  static inline struct user_namespace *get_user_ns(struct user_namespace *ns)

 ...

 --- a/kernel/user_namespace.c
 +++ b/kernel/user_namespace.c
 @@ -129,6 +129,22 @@ gid_t user_ns_map_gid(struct user_namespace *to, const 
 struct cred *cred, gid_t
   return overflowgid;
  }
  
 +int same_or_ancestor_user_ns(struct task_struct *task,
 + struct task_struct *victim)
 +{
 + struct user_namespace *u1 = task_cred_xxx(task, user)-user_ns;
 + struct user_namespace *u2 = task_cred_xxx(victim, user)-user_ns;
 + for (;;) {
 + if (u1 == u2)
 + return 1;
 + if (u1 == init_user_ns)
 + return 0;
 + u1 = u1-creator-user_ns;
 + }
 + /* We never get here */
 + return 0;

Remove?

 +}
 +
  static __init int user_namespaces_init(void)
  {
   user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC);

 ...

  int cap_ptrace_access_check(struct task_struct *child, unsigned int mode)
  {
   int ret = 0;
 + const struct cred *cred, *tcred;
  
   rcu_read_lock();
 - if (!cap_issubset(__task_cred(child)-cap_permitted,
 -   current_cred()-cap_permitted) 
 - !capable(CAP_SYS_PTRACE))
 - ret = -EPERM;
 + cred = current_cred();
 + tcred = __task_cred(child);
 + /*
 +  * The ancestor user_ns check may be gratuitous, as I think
 +  * we've already guaranteed that in kernel/ptrace.c.
 +  */

?

 + if (same_or_ancestor_user_ns(current, child) 
 + cap_issubset(tcred-cap_permitted, cred-cap_permitted))
 + goto out;
 + if (ns_capable(tcred-user-user_ns, CAP_SYS_PTRACE))
 + goto out;
 + ret = -EPERM;
 +out:
   rcu_read_unlock();
   return ret;
  }

 ...


___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 7/9] add a user namespace owner of ipc ns

2011-02-18 Thread Andrew Morton

On Thu, 17 Feb 2011 15:03:49 +
Serge E. Hallyn se...@hallyn.com wrote:


 ...

 --- a/include/linux/ipc_namespace.h
 +++ b/include/linux/ipc_namespace.h
 @@ -24,6 +24,7 @@ struct ipc_ids {
   struct idr ipcs_idr;
  };
  
 +struct user_namespace;

Move to top of file.

  struct ipc_namespace {
   atomic_tcount;
   struct ipc_ids  ids[3];
 @@ -56,6 +57,8 @@ struct ipc_namespace {
   unsigned intmq_msg_max;  /* initialized to DFLT_MSGMAX */
   unsigned intmq_msgsize_max;  /* initialized to DFLT_MSGSIZEMAX */
  
 + /* user_ns which owns the ipc ns */
 + struct user_namespace *user_ns;
  };
  
  extern struct ipc_namespace init_ipc_ns;
 diff --git a/ipc/msgutil.c b/ipc/msgutil.c
 index f095ee2..d91ff4b 100644
 --- a/ipc/msgutil.c
 +++ b/ipc/msgutil.c
 @@ -20,6 +20,8 @@
  
  DEFINE_SPINLOCK(mq_lock);
  
 +extern struct user_namespace init_user_ns;

Should be declared in .h, not in .c.

  /*
   * The next 2 defines are here bc this is the only file
   * compiled when either CONFIG_SYSVIPC and CONFIG_POSIX_MQUEUE

 ...


___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 6/9] user namespaces: convert all capable checks in kernel/sys.c

2011-02-18 Thread Andrew Morton

On Thu, 17 Feb 2011 15:03:42 +
Serge E. Hallyn se...@hallyn.com wrote:

 This allows setuid/setgid in containers.  It also fixes some
 corner cases where kernel logic foregoes capability checks when
 uids are equivalent.  The latter will need to be done throughout
 the whole kernel.
 

 ...

 --- a/kernel/sys.c
 +++ b/kernel/sys.c
 @@ -118,17 +118,29 @@ EXPORT_SYMBOL(cad_pid);
  
  void (*pm_power_off_prepare)(void);
  
 +/* called with rcu_read_lock, creds are safe */
 +static inline int set_one_prio_perm(struct task_struct *p)
 +{
 + const struct cred *cred = current_cred(), *pcred = __task_cred(p);
 +
 + if (pcred-user-user_ns == cred-user-user_ns 
 + (pcred-uid  == cred-euid ||
 +  pcred-euid == cred-euid))
 + return 1;
 + if (ns_capable(pcred-user-user_ns, CAP_SYS_NICE))
 + return 1;
 + return 0;
 +}

uninline.  Document return value?


 ...


___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 9/9] userns: check user namespace for task-file uid equivalence checks

2011-02-18 Thread Andrew Morton

On Thu, 17 Feb 2011 15:04:07 +
Serge E. Hallyn se...@hallyn.com wrote:

 Cheat for now and say all files belong to init_user_ns.  Next
 step will be to let superblocks belong to a user_ns, and derive
 inode_userns(inode) from inode-i_sb-s_user_ns.  Finally we'll
 introduce more flexible arrangements.
 

 ...

 +
 +/*
 + * return 1 if current either has CAP_FOWNER to the
 + * file, or owns the file.
 + */
 +int is_owner_or_cap(const struct inode *inode)
 +{
 + struct user_namespace *ns = inode_userns(inode);
 +
 + if (current_user_ns() == ns  current_fsuid() == inode-i_uid)
 + return 1;
 + if (ns_capable(ns, CAP_FOWNER))
 + return 1;
 + return 0;
 +}

bool?

 +EXPORT_SYMBOL(is_owner_or_cap);

There's a fairly well adhered to convention that global symbols (and
often static symbols) have a prefix which identifies the subsystem to
which they belong.  This patchset rather scorns that convention.

Most of these identifiers are pretty obviously from the capability
subsystem, but still...


 ...


___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 6/9] user namespaces: convert all capable checks in kernel/sys.c

2011-02-18 Thread Andrew Morton

On Thu, 17 Feb 2011 15:03:42 +
Serge E. Hallyn se...@hallyn.com wrote:

 @@ -1177,8 +1189,11 @@ SYSCALL_DEFINE2(sethostname, char __user *, name, int, 
 len)
   int errno;
   char tmp[__NEW_UTS_LEN];
  
 - if (!ns_capable(current-nsproxy-uts_ns-user_ns, CAP_SYS_ADMIN))
 + if (!ns_capable(current-nsproxy-uts_ns-user_ns, CAP_SYS_ADMIN)) {
 + printk(KERN_NOTICE %s: did not have CAP_SYS_ADMIN\n, 
 __func__);
   return -EPERM;
 + }
 + printk(KERN_NOTICE %s: did have CAP_SYS_ADMIN\n, __func__);
   if (len  0 || len  __NEW_UTS_LEN)
   return -EINVAL;
   down_write(uts_sem);

Left over debugging printks?
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: userns: targeted capabilities v5

2011-02-17 Thread Andrew Morton

On Thu, 17 Feb 2011 15:02:24 +
Serge E. Hallyn se...@hallyn.com wrote:

 Here is a repost of my previous user namespace patch, ported onto
 last night's git head.
 
 It fixes several things I was doing wrong in the last (v4)
 posting, in particular:
 
   1. don't set uts_ns-user_ns to current's when !CLONE_NEWUTS
   2. add a ipc_ns-user_ns which owns ipc_ns, and use that to
  decide CAP_IPC_OWNER
   3. fix logic flaw caused by bad parantheses
   4. allow do_prlimit to current
   5. don't always give root full privs to init_user_ns
 
 The expected course of development for user namespaces is laid out
 at https://wiki.ubuntu.com/UserNamespace.

Seems like a nice feature to be developing.

I worry about the maturity of it all at this stage.  How far along is
it *really*?

Is anyone else working with you on developing and reviewing this work?

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 1/1, v7] cgroup/freezer: add per freezer duty ratio control

2011-02-14 Thread Andrew Morton

On Sun, 13 Feb 2011 19:23:10 -0800
Arjan van de Ven ar...@linux.intel.com wrote:

 On 2/13/2011 4:44 PM, KAMEZAWA Hiroyuki wrote:
  On Sat, 12 Feb 2011 15:29:07 -0800
  Matt Helsleymatth...@us.ibm.com  wrote:
 
  On Fri, Feb 11, 2011 at 11:10:44AM -0800, jacob.jun@linux.intel.com 
  wrote:
  From: Jacob Panjacob.jun@linux.intel.com
 
  Freezer subsystem is used to manage batch jobs which can start
  stop at the same time. However, sometime it is desirable to let
  the kernel manage the freezer state automatically with a given
  duty ratio.
  For example, if we want to reduce the time that backgroup apps
  are allowed to run we can put them into a freezer subsystem and
  set the kernel to turn them THAWED/FROZEN at given duty ratio.
 
  This patch introduces two file nodes under cgroup
  freezer.duty_ratio_pct and freezer.period_sec
  Again: I don't think this is the right approach in the long term.
  It would be better not to add this interface and instead enable the
  cpu cgroup subsystem for non-rt tasks using a similar duty ratio
  concept..
 
  Nevertheless, I've added some feedback on the code for you here :).
 
  AFAIK, there was a work for bandwidth control in CFS.
 
  http://linux.derkeiler.com/Mailing-Lists/Kernel/2010-10/msg04335.html
 
  I tested this and worked fine. This schduler approach seems better for
  my purpose to limit bandwidth of apprications rather than freezer.
 
 for our purpose, it's not about bandwidth.
 it's about making sure the class of apps don't run for a long period 
 (30-second range) of time.
 

The discussion about this patchset seems to have been upside-down: lots
of talk about a particular implementation, with people walking back
from the implemetnation trying to work out what the requirements were,
then seeing if other implementations might suit those requirements. 
Whatever they were.

I think it would be helpful to start again, ignoring (for now) any
implementation.


What are the requirements here, guys?  What effects are we actually
trying to achieve?  Once that is understood and agreed to, we can
think about implementations.


And maybe you people _are_ clear about the requirements.  But I'm not and
I'm sure many others aren't too.  A clear statement of them would help
things along and would doubtless lead to better code.  This is pretty
basic stuff!

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH v8 0/3] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs

2011-02-09 Thread Andrew Morton

On Mon, 7 Feb 2011 20:35:42 -0500
Ben Blum bb...@andrew.cmu.edu wrote:

 On Sun, Dec 26, 2010 at 07:09:19AM -0500, Ben Blum wrote:
  On Fri, Dec 24, 2010 at 03:22:26AM -0500, Ben Blum wrote:
   On Wed, Aug 11, 2010 at 01:46:04AM -0400, Ben Blum wrote:
On Fri, Jul 30, 2010 at 07:56:49PM -0400, Ben Blum wrote:
 This patch series is a revision of http://lkml.org/lkml/2010/6/25/11 .
 
 This patch series implements a write function for the 'cgroup.procs'
 per-cgroup file, which enables atomic movement of multithreaded
 applications between cgroups. Writing the thread-ID of any thread in a
 threadgroup to a cgroup's procs file causes all threads in the group 
 to
 be moved to that cgroup safely with respect to threads 
 forking/exiting.
 (Possible usage scenario: If running a multithreaded build system that
 sucks up system resources, this lets you restrict it all at once into 
 a
 new cgroup to keep it under control.)
 
 Example: Suppose pid 31337 clones new threads 31338 and 31339.
 
 # cat /dev/cgroup/tasks
 ...
 31337
 31338
 31339
 # mkdir /dev/cgroup/foo
 # echo 31337  /dev/cgroup/foo/cgroup.procs
 # cat /dev/cgroup/foo/tasks
 31337
 31338
 31339
 
 A new lock, called threadgroup_fork_lock and living in signal_struct, 
 is
 introduced to ensure atomicity when moving threads between cgroups. 
 It's
 taken for writing during the operation, and taking for reading in 
 fork()
 around the calls to cgroup_fork() and cgroup_post_fork().

The above six month old text is the best (and almost the only)
explanation of the rationale for the entire patch series.  Is
it still correct and complete?


Assuming yes, then...  how do we determine whether the feature is
sufficiently useful to justify merging and maintaining it?  Will people
actually use it?

Was there some particular operational situation which led you to think
that the kernel should have this capability?  If so, please help us out here
and lavishly describe it.


___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH, v4 2/2] cgroups: introduce timer slack subsystem

2011-02-08 Thread Andrew Morton

On Thu,  3 Feb 2011 16:34:14 +0200
Kirill A. Shutsemov kir...@shutemov.name wrote:

 From: Kirill A. Shutemov kir...@shutemov.name
 
 Provides a way of tasks grouping by timer slack value. Introduces per
 cgroup max and min timer slack value. When a task attaches to a cgroup,
 its timer slack value adjusts (if needed) to fit min-max range.
 
 It also provides a way to set timer slack value for all tasks in the
 cgroup at once.
 
 This functionality is useful in mobile devices where certain background
 apps are attached to a cgroup and minimum wakeups are desired.

This description is quite scanty.  It doesn't explain what the benefit
is, when one might want to use it, how to use it, etc.  A nice
description under Documentation/cgroups/ would be appropriate.

I read this and come away with no sense of how useful or desirable this
code is.


 ...

 --- a/include/linux/init_task.h
 +++ b/include/linux/init_task.h
 @@ -124,6 +124,8 @@ extern struct cred init_cred;
  # define INIT_PERF_EVENTS(tsk)
  #endif
  
 +#define TIMER_SLACK_NS_DEFAULT 5

Seems to be an unrelated cleanup.  And given that this #define is only
used in a single place, its cleanliness is questionable.

  /*
   *  INIT_TASK is used to set up the first task table, touch at
   * your own risk!. Base=0, limit=0x1f (=2MB)
 @@ -177,7 +179,7 @@ extern struct cred init_cred;
   .cpu_timers = INIT_CPU_TIMERS(tsk.cpu_timers),  \
   .fs_excl= ATOMIC_INIT(0),   \
   .pi_lock= __RAW_SPIN_LOCK_UNLOCKED(tsk.pi_lock),\
 - .timer_slack_ns = 5, /* 50 usec default slack */\
 + .timer_slack_ns = TIMER_SLACK_NS_DEFAULT,   \
   .pids = {   \
   [PIDTYPE_PID]  = INIT_PID_LINK(PIDTYPE_PID),\
   [PIDTYPE_PGID] = INIT_PID_LINK(PIDTYPE_PGID),   \

 ...

 +extern int (*timer_slack_check)(struct task_struct *task,
 + unsigned long slack_ns);

Nope.  Please put this in a header file so we can be sure that callers
see the same prototype as does the definition.


 ...

 +/*
 + * Adjust -timer_slack_ns and -default_max_slack_ns of the task to fit
 + * limits of the cgroup.
 + */
 +static void tslack_adjust_task(struct timer_slack_cgroup *tslack_cgroup,
 + struct task_struct *tsk)
 +{
 + if (tslack_cgroup-min_slack_ns  tsk-timer_slack_ns)
 + tsk-timer_slack_ns = tslack_cgroup-min_slack_ns;
 + else if (tslack_cgroup-max_slack_ns  tsk-timer_slack_ns)
 + tsk-timer_slack_ns = tslack_cgroup-max_slack_ns;
 +
 + if (tslack_cgroup-min_slack_ns  tsk-default_timer_slack_ns)
 + tsk-default_timer_slack_ns = tslack_cgroup-min_slack_ns;
 + else if (tslack_cgroup-max_slack_ns  tsk-default_timer_slack_ns)
 + tsk-default_timer_slack_ns = tslack_cgroup-max_slack_ns;
 +}
 +
 +static void tslack_cgroup_attach(struct cgroup_subsys *subsys,
 + struct cgroup *cgroup, struct cgroup *prev,
 + struct task_struct *tsk, bool threadgroup)
 +{
 + tslack_adjust_task(cgroup_to_tslack_cgroup(cgroup), tsk);
 +}
 +
 +static int tslack_write_set_slack_ns(struct cgroup *cgroup, struct cftype 
 *cft,
 + u64 val)
 +{
 + struct timer_slack_cgroup *tslack_cgroup;
 + struct cgroup_iter it;
 + struct task_struct *task;
 +
 + tslack_cgroup = cgroup_to_tslack_cgroup(cgroup);
 + if (!is_timer_slack_allowed(cgroup_to_tslack_cgroup(cgroup), val))
 + return -EPERM;
 +
 + /* Change timer slack value for all tasks in the cgroup */
 + cgroup_iter_start(cgroup, it);
 + while ((task = cgroup_iter_next(cgroup, it)))
 + task-timer_slack_ns = val;
 + cgroup_iter_end(cgroup, it);
 +
 + return 0;
 +}
 +
 +static u64 tslack_read_range(struct cgroup *cgroup, struct cftype *cft)
 +{
 + struct timer_slack_cgroup *tslack_cgroup;
 +
 + tslack_cgroup = cgroup_to_tslack_cgroup(cgroup);
 + switch (cft-private) {
 + case TIMER_SLACK_MIN:
 + return tslack_cgroup-min_slack_ns;
 + case TIMER_SLACK_MAX:
 + return tslack_cgroup-max_slack_ns;
 + default:
 + BUG();
 + };

Extraneous ;.

 +}

These proposed new userspace interfaces should be documented somewhere,
please.  They are the most important part of the patch.


 ...

 +struct cgroup_subsys timer_slack_subsys = {
 + .name = timer_slack,
 + .module = THIS_MODULE,
 +#ifdef CONFIG_CGROUP_TIMER_SLACK

Confused.  This file won't even be compiled if
CONFIG_CGROUP_TIMER_SLACK=n?

 + .subsys_id = timer_slack_subsys_id,
 +#endif
 + .create = tslack_cgroup_create,
 + .destroy = tslack_cgroup_destroy,
 + .attach = tslack_cgroup_attach,
 + .populate = tslack_cgroup_populate,
 +};
 +

 ...

 @@ -1694,12 +1698,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, 
 arg2, unsigned long, arg3,

[Devel] Re: [PATCH v7 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup

2011-02-04 Thread Andrew Morton

On Fri, 4 Feb 2011 16:25:15 -0500
Ben Blum bb...@andrew.cmu.edu wrote:

 On Mon, Jan 24, 2011 at 01:05:29PM -0800, Andrew Morton wrote:
  On Sun, 26 Dec 2010 07:09:51 -0500
  Ben Blum bb...@andrew.cmu.edu wrote:
  
   Adds functionality to read/write lock CLONE_THREAD fork()ing 
   per-threadgroup
   
   From: Ben Blum bb...@andrew.cmu.edu
   
   This patch adds an rwsem that lives in a threadgroup's signal_struct 
   that's
   taken for reading in the fork path, under CONFIG_CGROUPS. If another part 
   of
   the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
   ifdefs should be changed to a higher-up flag that CGROUPS and the other 
   system
   would both depend on.
   
   This is a pre-patch for cgroup-procs-write.patch.
   
   ...
  
   +/* See the declaration of threadgroup_fork_lock in signal_struct. */
   +#ifdef CONFIG_CGROUPS
   +static inline void threadgroup_fork_read_lock(struct task_struct *tsk)
   +{
   + down_read(tsk-signal-threadgroup_fork_lock);
   +}
   +static inline void threadgroup_fork_read_unlock(struct task_struct *tsk)
   +{
   + up_read(tsk-signal-threadgroup_fork_lock);
   +}
   +static inline void threadgroup_fork_write_lock(struct task_struct *tsk)
   +{
   + down_write(tsk-signal-threadgroup_fork_lock);
   +}
   +static inline void threadgroup_fork_write_unlock(struct task_struct *tsk)
   +{
   + up_write(tsk-signal-threadgroup_fork_lock);
   +}
   +#else
  
  Risky. sched.h doesn't include rwsem.h.
  
  We could make it do so, but almost every compilation unit in the kernel
  includes sched.h.  It would be nicer to make the kernel build
  finer-grained, rather than blunter-grained.  Don't be afraid to add new
  header files if that is one way of doing this!
 
 Hmm, good point. But there's also:
 
 +#ifdef CONFIG_CGROUPS
 +   struct rw_semaphore threadgroup_fork_lock;
 +#endif
 
 in the signal_struct, also in sched.h, which needs to be there. Or I
 could change it to a struct pointer with a forward incomplete
 declaration above, and kmalloc/kfree it? I don't like adding more
 alloc/free calls but don't know if it's more or less important than
 header granularity.

What about adding a new header file which includes rwsem.h and sched.h
and then defines the new interfaces?
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH v2] cgroup/freezer: add per freezer duty ratio control

2011-02-04 Thread Andrew Morton

On Wed,  2 Feb 2011 16:42:20 -0800
jacob.jun@linux.intel.com wrote:

 Freezer subsystem is used to manage batch jobs which can start
 stop at the same time. However, sometime it is desirable to let
 the kernel manage the freezer state automatically with a given
 duty ratio.
 For example, if we want to reduce the time that backgroup apps
 are allowed to run we can put them into a freezer subsystem and
 set the kernel to turn them THAWED/FROZEN at given duty ratio.
 
 This patch introduces two file nodes under cgroup
 freezer.duty_ratio_pct and freezer.period_sec
 
 Usage example: set period to be 5 seconds and frozen duty ratio 90%
 [root@localhost aoa]# echo 90  freezer.duty_ratio_pct
 [root@localhost aoa]# echo 5  freezer.period_sec
 
 Signed-off-by: Jacob Pan jacob.jun@linux.intel.com
 ---
  kernel/cgroup_freezer.c |  109 
 ++-
  1 files changed, 107 insertions(+), 2 deletions(-)

Update the documentation please.  Documentation/cgroups/
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH] cgroup : remove the ns_cgroup

2011-01-26 Thread Andrew Morton

On Tue, 25 Jan 2011 10:39:48 +0100
Daniel Lezcano daniel.lezc...@free.fr wrote:

 The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier
 and leads to some problems:
 
 * cgroup creation is out-of-control
 * cgroup name can conflict when pids are looping
 * it is not possible to have a single process handling
 a lot of namespaces without falling in a exponential creation time
 * we may want to create a namespace without creating a cgroup
 
 The ns_cgroup was replaced by a compatibility flag 'clone_children',
 where a newly created cgroup will copy the parent cgroup values.
 The userspace has to manually create a cgroup and add a task to
 the 'tasks' file.
 
 This patch removes the ns_cgroup as suggested in the following thread:
 
 https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html
 
 The 'cgroup_clone' function is removed because it is no longer used.
 
 Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr
 Signed-off-by: Serge E. Hallyn serge.hal...@canonical.com
 Cc: Eric W. Biederman ebied...@xmission.com
 Cc: Jamal Hadi Salim h...@cyberus.ca
 Reviewed-by: Li Zefan l...@cn.fujitsu.com
 Acked-by: Paul Menage men...@google.com
 Acked-by: Matt Helsley matth...@us.ibm.com

 ...

  22 files changed, 4 insertions(+), 287 deletions(-)

I didn't see that one coming.

This change is userspace-visible, is it not?  What are the implications
of this?  There's some discussion in that nearly-two-year-old thread
regarding making provision for back-compatibility but I'm not seeing
such things in this patch?


___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH] cgroup : remove the ns_cgroup

2011-01-26 Thread Andrew Morton

On Tue, 25 Jan 2011 10:39:48 +0100
Daniel Lezcano daniel.lezc...@free.fr wrote:

 This patch removes the ns_cgroup as suggested in the following thread:

I had this patch queued up in September last year, but dropped it.  Why
did I do that?
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH] cgroup : remove the ns_cgroup

2011-01-26 Thread Andrew Morton

On Thu, 27 Jan 2011 09:08:51 +0800 Li Zefan l...@cn.fujitsu.com wrote:

 Andrew Morton wrote:
  On Tue, 25 Jan 2011 10:39:48 +0100
  Daniel Lezcano daniel.lezc...@free.fr wrote:
  
  This patch removes the ns_cgroup as suggested in the following thread:
  
  I had this patch queued up in September last year, but dropped it.  Why
  did I do that?
 
 Because you wanted to wait for some time for users (if any) to notice this
 coming change.
 
 Author: Daniel Lezcano daniel.lezc...@free.fr
 Date:   Wed Oct 27 15:33:38 2010 -0700
 
 cgroup: notify ns_cgroup deprecated
 
 The ns_cgroup will be removed very soon.  Let's warn, for this version,
 ns_cgroup is deprecated.
 
 Make ns_cgroup and clone_children exclusive.  If the clone_children is set
 and the ns_cgroup is mounted, let's fail with EINVAL when the ns_cgroup
 subsys is created (a printk will help the user to understand why the
 creation fails).
 
 Update the feature remove schedule file with the deprecated ns_cgroup.
 
 Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr
 Acked-by: Paul Menage men...@google.com
 Signed-off-by: Andrew Morton a...@linux-foundation.org
 Signed-off-by: Linus Torvalds torva...@linux-foundation.org

ooh, that was clever of me.

Here is the text which was missing from the changelog:

  This is a userspace-visible change.  Commit 45531757b45c (cgroup:
  notify ns_cgroup deprecated) (merged into 2.6.27) caused the kernel
  to emit a printk warning users that the feature is planned for
  removal.  Since that time we have heard from XXX users who were
  affected by this.

Please provide XXX.

How do we know that 2.6.37-2.6.38 is long enough?  Will any major
distros be released containing this warning in that timeframe?  I doubt
it.

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH v7 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup

2011-01-24 Thread Andrew Morton

On Sun, 26 Dec 2010 07:09:51 -0500
Ben Blum bb...@andrew.cmu.edu wrote:

 Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup
 
 From: Ben Blum bb...@andrew.cmu.edu
 
 This patch adds an rwsem that lives in a threadgroup's signal_struct that's
 taken for reading in the fork path, under CONFIG_CGROUPS. If another part of
 the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
 ifdefs should be changed to a higher-up flag that CGROUPS and the other system
 would both depend on.
 
 This is a pre-patch for cgroup-procs-write.patch.
 
 ...

 +/* See the declaration of threadgroup_fork_lock in signal_struct. */
 +#ifdef CONFIG_CGROUPS
 +static inline void threadgroup_fork_read_lock(struct task_struct *tsk)
 +{
 + down_read(tsk-signal-threadgroup_fork_lock);
 +}
 +static inline void threadgroup_fork_read_unlock(struct task_struct *tsk)
 +{
 + up_read(tsk-signal-threadgroup_fork_lock);
 +}
 +static inline void threadgroup_fork_write_lock(struct task_struct *tsk)
 +{
 + down_write(tsk-signal-threadgroup_fork_lock);
 +}
 +static inline void threadgroup_fork_write_unlock(struct task_struct *tsk)
 +{
 + up_write(tsk-signal-threadgroup_fork_lock);
 +}
 +#else

Risky. sched.h doesn't include rwsem.h.

We could make it do so, but almost every compilation unit in the kernel
includes sched.h.  It would be nicer to make the kernel build
finer-grained, rather than blunter-grained.  Don't be afraid to add new
header files if that is one way of doing this!


___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH v5 3/3] cgroups: make procs file writable

2010-12-24 Thread Andrew Morton

On Fri, 24 Dec 2010 06:45:00 -0500 Ben Blum bb...@andrew.cmu.edu wrote:

  kmalloc() is allowed while holding a spinlock and NODEMASK_ALLOC() takes a 
  gfp_flags argument for that reason.
 
 Ah, it's only with GFP_KERNEL and friends. So changing the uses in
 cpuset_can_attach to GFP_ATOMIC would solve this concern, then?

It would introduce an additional concern.  GFP_ATOMIC is bad, for a
number of reasons.  The main one of which is that it is vastly less
reliable than GFP_KERNEL.  And making the cpuset code less reliable
is a regression, no?

Please try to find a better solution.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH v5 3/3] cgroups: make procs file writable

2010-12-16 Thread Andrew Morton

On Wed, 15 Dec 2010 22:34:39 -0800 Paul Menage men...@google.com wrote:

 Ping akpm?

Patches have gone a bit stale, sorry.  Refactoring in
kernel/cgroup_freezer.c necessitates a refresh and retest please.


 On Fri, Oct 8, 2010 at 2:57 PM, Paul Menage men...@google.com wrote:
  Hi Andrew,
 
  Do you see any road-blockers for including this patch set in -mm ?
 
  Thanks,
  Paul
 
  On Tue, Aug 24, 2010 at 11:08 AM, Paul Menage men...@google.com wrote:
  On Tue, Aug 10, 2010 at 10:48 PM, Ben Blum bb...@andrew.cmu.edu wrote:
 
 
  Makes procs file writable to move all threads by tgid at once
 
  From: Ben Blum bb...@andrew.cmu.edu
 
  This patch adds functionality that enables users to move all threads in a
  threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
  file. This current implementation makes use of a per-threadgroup rwsem 
  that's
  taken for reading in the fork() path to prevent newly forking threads 
  within
  the threadgroup from escaping while the move is in progress.
 
  Signed-off-by: Ben Blum bb...@andrew.cmu.edu
 
  Reviewed-by: Paul Menage men...@google.com
 
 
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH] user_ns: Improve the user_ns on-the-slab packaging

2010-12-07 Thread Andrew Morton

On Tue, 07 Dec 2010 17:12:33 +0300
Pavel Emelyanov xe...@parallels.com wrote:

 @@ -126,3 +128,11 @@ gid_t user_ns_map_gid(struct user_namespace *to, const 
 struct cred *cred, gid_t
   /* No useful relationship so no mapping */
   return overflowgid;
  }
 +
 +static __init int user_namespaces_init(void)
 +{
 + user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC);
 + return 0;
 +}
 +
 +__initcall(user_namespaces_init);

checkpatch (which you apparently didn't use) says

WARNING: please use device_initcall() instead of __initcall()
#81: FILE: kernel/user_namespace.c:138:
+__initcall(user_namespaces_init);

which was a somewhat random recommendation.  I think I'll switch it to
plain old module_init().

Presumably user-namespaces don't get used prior to initcalls being run.

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH v4 05/11] writeback: create dirty_info structure

2010-11-17 Thread Andrew Morton

On Fri, 29 Oct 2010 00:09:08 -0700
Greg Thelen gthe...@google.com wrote:

 Bundle dirty limits and dirty memory usage metrics into a dirty_info
 structure to simplify interfaces of routines that need all.

Problems...

These patches interact pretty badly with Fengguang's IO-less dirty
throttling v2 patches.  I fixed up
writeback-create-dirty_info-structure.patch pretty mechanically but
when it got to memcg-check-memcg-dirty-limits-in-page-writeback.patch
things got sticky and I gave up.

As your stuff was merged first, I'd normally send the bad news to
Fengguang, but the memcg code is logically built upon the core
writeback code so I do think these patches should be staged after the
changes to core writeback.

Also, while I was there it seemed that the chosen members of the
dirty_info structure were a bit random.  Perhaps we should be putting
nr_dirty in there as well, perhaps other things.  Please have a think
about that.

Also, in ratelimit_pages() we call global_dirty_info() to return four
items, but that caller only actually uses two of them.  Wasted effort?


So I'm afraid I'm going to have to request that you redo and retest
these patches:

writeback-create-dirty_info-structure.patch
memcg-add-dirty-page-accounting-infrastructure.patch
memcg-add-kernel-calls-for-memcg-dirty-page-stats.patch
memcg-add-dirty-limits-to-mem_cgroup.patch
memcg-add-dirty-limits-to-mem_cgroup-use-native-word-to-represent-dirtyable-pages.patch
memcg-add-dirty-limits-to-mem_cgroup-catch-negative-per-cpu-sums-in-dirty-info.patch
memcg-add-dirty-limits-to-mem_cgroup-avoid-overflow-in-memcg_hierarchical_free_pages.patch
memcg-add-dirty-limits-to-mem_cgroup-correct-memcg_hierarchical_free_pages-return-type.patch
memcg-add-dirty-limits-to-mem_cgroup-avoid-free-overflow-in-memcg_hierarchical_free_pages.patch
memcg-cpu-hotplug-lockdep-warning-fix.patch
memcg-add-cgroupfs-interface-to-memcg-dirty-limits.patch
memcg-break-out-event-counters-from-other-stats.patch
memcg-check-memcg-dirty-limits-in-page-writeback.patch
memcg-use-native-word-page-statistics-counters.patch
memcg-use-native-word-page-statistics-counters-fix.patch
#
memcg-add-mem_cgroup-parameter-to-mem_cgroup_page_stat.patch
memcg-pass-mem_cgroup-to-mem_cgroup_dirty_info.patch
#memcg-make-throttle_vm_writeout-memcg-aware.patch: troublesome: Kamezawa
memcg-make-throttle_vm_writeout-memcg-aware.patch
memcg-make-throttle_vm_writeout-memcg-aware-fix.patch
memcg-simplify-mem_cgroup_page_stat.patch
memcg-simplify-mem_cgroup_dirty_info.patch
memcg-make-mem_cgroup_page_stat-return-value-unsigned.patch

against the http://userweb.kernel.org/~akpm/mmotm/ which I just
uploaded, sorry.  I've uploaded my copy of all the above to
http://userweb.kernel.org/~akpm/stuff/gthelen.tar.gz.  I think only the
two patches need fixing and retesting.

Also, while wrangling the above patches, I stumbled across rejects such
as:


***
*** 99,106 
   state:%8lx\n,
   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
-  K(bdi_thresh), K(dirty_thresh),
-  K(background_thresh), nr_dirty, nr_io, nr_more_io,
   !list_empty(bdi-bdi_list), bdi-state);
  #undef K
  
--- 98,106 
   state:%8lx\n,
   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
+  K(bdi_thresh), K(dirty_info.dirty_thresh),
+  K(dirty_info.background_thresh),
+  nr_dirty, nr_io, nr_more_io,
   !list_empty(bdi-bdi_list), bdi-state);

Please, if you discover crud like this, just fix it up.  One item per
line:

   state:%8lx\n,
   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
   K(bdi_thresh),
   K(dirty_info.dirty_thresh),
   K(dirty_info.background_thresh),
   nr_dirty,
   nr_io,
   nr_more_io,
   !list_empty(bdi-bdi_list), bdi-state);

all very simple.  And while you're there, fix up the
tab-tab-space-space-space indenting - just use tabs.


The other area where code maintenance is harder than it needs to be is
in definitions of locals:

long nr_reclaimable;
long nr_dirty, bdi_dirty;  /* = file_dirty + writeback + unstable_nfs */
long bdi_prev_dirty = 0;

again, that's just dopey.  Change it to

long nr_reclaimable;
long nr_dirty;
long bdi_dirty; /* = file_dirty + writeback + unstable_nfs */
long bdi_prev_dirty = 0;

All very simple.

Thanks.
___
Containers mailing list
contain...@lists.linux-foundation.org

[Devel] Re: [PATCH v4 05/11] writeback: create dirty_info structure

2010-11-17 Thread Andrew Morton

On Wed, 17 Nov 2010 16:49:24 -0800
Andrew Morton a...@linux-foundation.org wrote:

 against the http://userweb.kernel.org/~akpm/mmotm/ which I just
 uploaded

err, will upload Real Soon Now.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH v4 00/11] memcg: per cgroup dirty page accounting

2010-10-29 Thread Andrew Morton

On Fri, 29 Oct 2010 00:09:03 -0700
Greg Thelen gthe...@google.com wrote:

This is cool stuff - it's been a long haul.  One day we'll be
nearly-finished and someone will write a book telling people how to use
it all and lots of people will go holy crap.  I hope.

 Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
 page cache used by a cgroup.  So, in case of multiple cgroup writers, they 
 will
 not be able to consume more than their designated share of dirty pages and 
 will
 be forced to perform write-out if they cross that limit.
 
 The patches are based on a series proposed by Andrea Righi in Mar 2010.
 
 Overview:
 - Add page_cgroup flags to record when pages are dirty, in writeback, or nfs
   unstable.
 
 - Extend mem_cgroup to record the total number of pages in each of the 
   interesting dirty states (dirty, writeback, unstable_nfs).  
 
 - Add dirty parameters similar to the system-wide  /proc/sys/vm/dirty_*
   limits to mem_cgroup.  The mem_cgroup dirty parameters are accessible
   via cgroupfs control files.

Curious minds will want to know what the default values are set to and
how they were determined.

 - Consider both system and per-memcg dirty limits in page writeback when
   deciding to queue background writeback or block for foreground writeback.
 
 Known shortcomings:
 - When a cgroup dirty limit is exceeded, then bdi writeback is employed to
   writeback dirty inodes.  Bdi writeback considers inodes from any cgroup, not
   just inodes contributing dirty pages to the cgroup exceeding its limit.  

yup.  Some broader discussion of the implications of this shortcoming
is needed.  I'm not sure where it would be placed, though. 
Documentation/ for now, until you write that book.

 - When memory.use_hierarchy is set, then dirty limits are disabled.  This is a
   implementation detail.

So this is unintentional, and forced upon us my the present implementation?

  An enhanced implementation is needed to check the
   chain of parents to ensure that no dirty limit is exceeded.

How important is it that this be fixed?

And how feasible would that fix be?  A linear walk up the hierarchy
list?  More than that?

 Performance data:
 - A page fault microbenchmark workload was used to measure performance, which
   can be called in read or write mode:
 f = open(foo. $cpu)
 truncate(f, 4096)
 alarm(60)
 while (1) {
 p = mmap(f, 4096)
 if (write)
   *p = 1
   else
   x = *p
 munmap(p)
 }
 
 - The workload was called for several points in the patch series in different
   modes:
   - s_read is a single threaded reader
   - s_write is a single threaded writer
   - p_read is a 16 thread reader, each operating on a different file
   - p_write is a 16 thread writer, each operating on a different file
 
 - Measurements were collected on a 16 core non-numa system using perf stat
   --repeat 3.  The -a option was used for parallel (p_*) runs.
 
 - All numbers are page fault rate (M/sec).  Higher is better.
 
 - To compare the performance of a kernel without non-memcg compare the first 
 and
   last rows, neither has memcg configured.  The first row does not include any
   of these memcg patches.
 
 - To compare the performance of using memcg dirty limits, compare the baseline
   (2nd row titled w/ memcg) with the the code and memcg enabled (2nd to last
   row titled all patches).
 
root_cgroupchild_cgroup
  s_read s_write p_read p_write   s_read s_write p_read p_write
 mmotm w/o memcg   0.428  0.390   0.429  0.388
 mmotm w/ memcg0.411  0.378   0.391  0.362 0.412  0.377   0.385  0.363
 all patches   0.384  0.360   0.370  0.348 0.381  0.363   0.368  0.347
 all patches   0.431  0.402   0.427  0.395
   w/o memcg

afaict this benchmark has demonstrated that the changes do not cause an
appreciable performance regression in terms of CPU loading, yes?

Can we come up with any tests which demonstrate the _benefits_ of the
feature?

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH v4 09/11] memcg: CPU hotplug lockdep warning fix

2010-10-29 Thread Andrew Morton

On Fri, 29 Oct 2010 00:09:12 -0700
Greg Thelen gthe...@google.com wrote:

 From: Balbir Singh bal...@linux.vnet.ibm.com
 
 memcg has lockdep warnings (sleep inside rcu lock)
 
 ...

 Acked-by: Greg Thelen gthe...@google.com

You were on the patch delivery path, so this should be Signed-off-by:. 
I made that change to my copy.

 Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com


___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH v4 02/11] memcg: document cgroup dirty memory interfaces

2010-10-29 Thread Andrew Morton

On Fri, 29 Oct 2010 00:09:05 -0700
Greg Thelen gthe...@google.com wrote:

 Document cgroup dirty memory interfaces and statistics.
 

 ...

 +When use_hierarchy=0, each cgroup has dirty memory usage and limits.
 +System-wide dirty limits are also consulted.  Dirty memory consumption is
 +checked against both system-wide and per-cgroup dirty limits.
 +
 +The current implementation does enforce per-cgroup dirty limits when

does not, I trust.

 +use_hierarchy=1.  System-wide dirty limits are used for processes in such
 +cgroups.  Attempts to read memory.dirty_* files return the system-wide 
 values.
 +Writes to the memory.dirty_* files return error.  An enhanced implementation 
 is
 +needed to check the chain of parents to ensure that no dirty limit is 
 exceeded.
 +
  6. Hierarchy support
  
  The memory controller supports a deep hierarchy and hierarchical accounting.
 -- 
 1.7.3.1
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 2/2] Kconfig : default all the namespaces to 'yes'

2010-10-14 Thread Andrew Morton

On Wed, 13 Oct 2010 09:44:30 -0500
Serge E. Hallyn se...@hallyn.com wrote:

 Quoting Daniel Lezcano (daniel.lezc...@free.fr):
  On 10/12/2010 07:16 PM, Serge E. Hallyn wrote:
  Quoting Matt Helsley (matth...@us.ibm.com):
  On Thu, Oct 07, 2010 at 03:15:33PM +0200, Daniel Lezcano wrote:
  As the different namespaces depend on 'CONFIG_NAMESPACES', it is
  logical to enable all the namespaces when we enable NAMESPACES.
  
  Signed-off-by: Daniel Lezcanodaniel.lezc...@free.fr
  Subject of the patch email is a little confusing as it's not
  quite what happens. I'm mostly OK with it but I'm not sure we
  should enable user-ns by default just yet.
  
  Acked-By: Matt Helsleymatth...@us.ibm.com
  In fact, perhaps we should keep the experimental tag on user namespaces.
  
  The experimental tag is kept on the user namespace. This one is
  defaulting to yes when the namespaces and experimental are selected.
 
 Oh, sounds good
 

My attention flagged.  Can we please confirm that the current patch is
still good?


From: Daniel Lezcano daniel.lezc...@free.fr

As the different namespaces depend on 'CONFIG_NAMESPACES', it is logical
to enable all the namespaces when we enable NAMESPACES.

Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr
Cc: Eric W. Biederman ebied...@xmission.com
Cc: David Miller da...@davemloft.net
Acked-By: Matt Helsley matth...@us.ibm.com
Signed-off-by: Andrew Morton a...@linux-foundation.org
---

 init/Kconfig |7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff -puN 
init/Kconfig~namespaces-default-all-the-namespaces-to-yes-when-config_namespaces-is-selected
 init/Kconfig
--- 
a/init/Kconfig~namespaces-default-all-the-namespaces-to-yes-when-config_namespaces-is-selected
+++ a/init/Kconfig
@@ -739,6 +739,7 @@ config NAMESPACES
 config UTS_NS
bool UTS namespace
depends on NAMESPACES
+   default y
help
  In this namespace tasks see different info provided with the
  uname() system call
@@ -746,6 +747,7 @@ config UTS_NS
 config IPC_NS
bool IPC namespace
depends on NAMESPACES  (SYSVIPC || POSIX_MQUEUE)
+   default y
help
  In this namespace tasks work with IPC ids which correspond to
  different IPC objects in different namespaces.
@@ -753,6 +755,7 @@ config IPC_NS
 config USER_NS
bool User namespace (EXPERIMENTAL)
depends on NAMESPACES  EXPERIMENTAL
+   default y
help
  This allows containers, i.e. vservers, to use user namespaces
  to provide different user info for different servers.
@@ -760,8 +763,8 @@ config USER_NS
 
 config PID_NS
bool PID Namespaces
-   default n
depends on NAMESPACES
+   default y
help
  Support process id namespaces.  This allows having multiple
  processes with the same pid as long as they are in different
@@ -769,8 +772,8 @@ config PID_NS
 
 config NET_NS
bool Network namespace
-   default n
depends on NAMESPACES  NET
+   default y
help
  Allow user space to create what appear to be multiple instances
  of the network stack.
_

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH v2] memcg: reduce lock time at move charge (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()

2010-10-07 Thread Andrew Morton

On Fri, 8 Oct 2010 13:37:12 +0900 KAMEZAWA Hiroyuki 
kamezawa.hir...@jp.fujitsu.com wrote:

 On Thu, 7 Oct 2010 16:14:54 -0700
 Andrew Morton a...@linux-foundation.org wrote:
 
  On Thu, 7 Oct 2010 17:04:05 +0900
  KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote:
  
   Now, at task migration among cgroup, memory cgroup scans page table and 
   moving
   account if flags are properly set.
   
   The core code, mem_cgroup_move_charge_pte_range() does
   
 pte_offset_map_lock();
 for all ptes in a page table:
 1. look into page table, find_and_get a page
 2. remove it from LRU.
 3. move charge.
 4. putback to LRU. put_page()
 pte_offset_map_unlock();
   
   for pte entries on a 3rd level? page table.
   
   This pte_offset_map_lock seems a bit long. This patch modifies a rountine 
   as
   
 for 32 pages: pte_offset_map_lock()
   find_and_get a page
   record it
   pte_offset_map_unlock()
 for all recorded pages
   isolate it from LRU.
   move charge
   putback to LRU
 for all recorded pages
   put_page()
  
  The patch makes the code larger, more complex and slower!
  
 
 Slower ?

Sure.  It walks the same data three times, potentially causing
thrashing in the L1 cache.  It takes and releases locks at a higher
frequency.  It increases the text size.

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 1/1] cgroups: strcpy destination string overflow

2010-10-05 Thread Andrew Morton

On Tue,  5 Oct 2010 12:38:05 +0400
Evgeny Kuznetsov ext-eugeny.kuznet...@nokia.com wrote:

 From: Evgeny Kuznetsov ext-eugeny.kuznet...@nokia.com
 
 Function strcpy is used without check for maximum allowed source
 string length and could cause destination string overflow.
 Check for string length is added before using strcpy.
 Function now is return error if source string length is more than
 a maximum.
 
 Signed-off-by: Evgeny Kuznetsov ext-eugeny.kuznet...@nokia.com
 ---
  kernel/cgroup.c |2 ++
  1 files changed, 2 insertions(+), 0 deletions(-)
 
 diff --git a/kernel/cgroup.c b/kernel/cgroup.c
 index c9483d8..82bbede 100644
 --- a/kernel/cgroup.c
 +++ b/kernel/cgroup.c
 @@ -1883,6 +1883,8 @@ static int cgroup_release_agent_write(struct cgroup 
 *cgrp, struct cftype *cft,
 const char *buffer)
  {
   BUILD_BUG_ON(sizeof(cgrp-root-release_agent_path)  PATH_MAX);
 + if (strlen(buffer) = PATH_MAX)
 + return -EINVAL;
   if (!cgroup_lock_live_group(cgrp))
   return -ENODEV;
   strcpy(cgrp-root-release_agent_path, buffer);

I don't think this can happen, because cftype.max_write_len is
PATH_MAX.

But it's pretty unobvious if this is actually true, and the code is
fragile against future changes.

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 0/3][V2] remove the ns_cgroup

2010-09-28 Thread Andrew Morton

On Tue, 28 Sep 2010 15:50:17 +0200
Daniel Lezcano daniel.lezc...@free.fr wrote:

 On 09/27/2010 10:46 PM, Andrew Morton wrote:
  On Mon, 27 Sep 2010 15:36:58 -0500
  Serge E. Hallynserge.hal...@canonical.com  wrote:
 
 
  This patchset removes the ns_cgroup by adding a new flag to the cgroup
  and the cgroupfs mount option. It enables the copy of the parent cgroup
  when a child cgroup is created. We can then safely remove the ns_cgroup 
  as
  this flag brings a compatibility. We have now to manually create and add 
  the
  task to a cgroup, which is consistent with the cgroup framework.
   
  So this is a non-backward-compatible userspace-visible change?
 
  Yes, it is.
 
  Patch 1 is needed to let lxc and libvirt both control containers with
  same cgroup setup.  Patch 3 however isn't *necessary* for that.  Daniel,
  what do you think about holding off on patch 3?
   
  One way of handling this would be to merge patches 12 which add the
  new interface and also arrange for usage of the old interface(s) to
  emit a printk, telling people that they're using a feature which is
  scheduled for removal.
 
 
 Right, that makes sense.
 
 Do you will take the patches #1 and #2, drop the patch #3, and I send a 
 new patch with the printk warning ?
 Or shall I resend all ?

I dropped #3.  Please send the printk-warning patch.  I'd suggest a
printk_once(), nice and verbose.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 0/3][V2] remove the ns_cgroup

2010-09-27 Thread Andrew Morton

On Mon, 27 Sep 2010 12:14:10 +0200
Daniel Lezcano daniel.lezc...@free.fr wrote:

 The ns_cgroup is a control group interacting with the namespaces.
 When a new namespace is created, a corresponding cgroup is 
 automatically created too. The cgroup name is the pid of the process
 who did 'unshare' or the child of 'clone'.
 
 This cgroup is tied with the namespace because it prevents a
 process to escape the control group and use the post_clone callback,
 so the child cgroup inherits the values of the parent cgroup.
 
 Unfortunately, the more we use this cgroup and the more we are facing
 problems with it:
 
  (1) when a process unshares, the cgroup name may conflict with a previous
  cgroup with the same pid, so unshare or clone return -EEXIST
 
  (2) the cgroup creation is out of control because there may have an
  application creating several namespaces where the system will automatically
  create several cgroups in his back and let them on the cgroupfs (eg. a vrf
  based on the network namespace).
 
  (3) the mix of (1) and (2) force an administrator to regularly check and
  clean these cgroups.
 
 This patchset removes the ns_cgroup by adding a new flag to the cgroup
 and the cgroupfs mount option. It enables the copy of the parent cgroup
 when a child cgroup is created. We can then safely remove the ns_cgroup as
 this flag brings a compatibility. We have now to manually create and add the
 task to a cgroup, which is consistent with the cgroup framework.

So this is a non-backward-compatible userspace-visible change?

What are the implications of this?
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting

2010-09-27 Thread Andrew Morton

On Mon, 27 Sep 2010 11:18:47 +0200
Michael Holzheu holz...@linux.vnet.ibm.com wrote:

 Hello Andrew,
 
 On Fri, 2010-09-24 at 11:50 -0700, Andrew Morton wrote:
This is a big change!  If this is done right then we're heading in the
direction of deprecating the longstanding way in which userspace
observes the state of Linux processes and we're recommending that the
whole world migrate to taskstats.  I think?
   
   Or it can be used as alternative. Since procfs has its drawbacks (e.g.
   performance) an alternative could be helpful. 
  
  And it can be harmful.  More kernel code to maintain and test, more
  userspace code to develop, maintain, etc.  Less user testing than if
  there was a single interface.
 
 Sure, the value has to be big enough to justify the effort.
 
 But as I said, with taskstats and procfs we already have two interfaces
 for getting task information.

That doesn't mean it was the right thing to do!  For the reasons I
outline above, it can be the wrong thing to do and strengthening one of
the alternatives worsens the problem.

 Currently in procfs there is information
 than you can't find in taskstats. But also the other way round in the
 taskstats structure there is very useful information that you can't get
 under proc. E.g. the task delay times, IO accounting, etc.

Sounds like a big screwup ;)

Look at it this way: if you were going to sit down and start to design
a new operating system from scratch, would you design the task status
reporting system as it currently stands in Linux?  Don't think so!

 So currently
 tools have to use both interfaces to get all information, which is not
 optimal.
 
   
I worry that there's a dependency on CONFIG_NET?  If so then that's a
big problem because in N years time, 99% of the world will be using
taskstats, but a few embedded losers will be stuck using (and having to
support) the old tools.
   
   Sure, but if we could add the /proc/taskstats approach, this dependency
   would not be there.
  
  So why do we need to present the same info over netlink?
 
 Good point. It is not really necessary. I started development using the
 netlink code. Therefore I first added the new command in the netlink
 code. I also thought, it would be a good idea to provide all netlink
 commands over the procfs interface to be consistent.

Maybe we should have delivered taskstats over procfs from day one.

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 0/3][V2] remove the ns_cgroup

2010-09-27 Thread Andrew Morton

On Mon, 27 Sep 2010 15:36:58 -0500
Serge E. Hallyn serge.hal...@canonical.com wrote:

   This patchset removes the ns_cgroup by adding a new flag to the cgroup
   and the cgroupfs mount option. It enables the copy of the parent cgroup
   when a child cgroup is created. We can then safely remove the ns_cgroup as
   this flag brings a compatibility. We have now to manually create and add 
   the
   task to a cgroup, which is consistent with the cgroup framework.
  
  So this is a non-backward-compatible userspace-visible change?
 
 Yes, it is.
 
 Patch 1 is needed to let lxc and libvirt both control containers with
 same cgroup setup.  Patch 3 however isn't *necessary* for that.  Daniel,
 what do you think about holding off on patch 3?

One way of handling this would be to merge patches 12 which add the
new interface and also arrange for usage of the old interface(s) to
emit a printk, telling people that they're using a feature which is
scheduled for removal.

Later, we remove the old interface.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 0/3][V2] remove the ns_cgroup

2010-09-27 Thread Andrew Morton

On Mon, 27 Sep 2010 13:45:26 -0700
ebied...@xmission.com (Eric W. Biederman) wrote:

 Serge E. Hallyn serge.hal...@canonical.com writes:
 
  Quoting Andrew Morton (a...@linux-foundation.org):
  On Mon, 27 Sep 2010 12:14:10 +0200
  Daniel Lezcano daniel.lezc...@free.fr wrote:
  
   The ns_cgroup is a control group interacting with the namespaces.
   When a new namespace is created, a corresponding cgroup is 
   automatically created too. The cgroup name is the pid of the process
   who did 'unshare' or the child of 'clone'.
   
   This cgroup is tied with the namespace because it prevents a
   process to escape the control group and use the post_clone callback,
   so the child cgroup inherits the values of the parent cgroup.
   
   Unfortunately, the more we use this cgroup and the more we are facing
   problems with it:
   
(1) when a process unshares, the cgroup name may conflict with a 
   previous
cgroup with the same pid, so unshare or clone return -EEXIST
   
(2) the cgroup creation is out of control because there may have an
application creating several namespaces where the system will 
   automatically
create several cgroups in his back and let them on the cgroupfs (eg. a 
   vrf
based on the network namespace).
   
(3) the mix of (1) and (2) force an administrator to regularly check and
clean these cgroups.
   
   This patchset removes the ns_cgroup by adding a new flag to the cgroup
   and the cgroupfs mount option. It enables the copy of the parent cgroup
   when a child cgroup is created. We can then safely remove the ns_cgroup 
   as
   this flag brings a compatibility. We have now to manually create and add 
   the
   task to a cgroup, which is consistent with the cgroup framework.
  
  So this is a non-backward-compatible userspace-visible change?
 
  Yes, it is.
 
 
 We have always been able to compile out the ns cgroup right?
 
 In which case this is not a backwards incompatible change so much
 as the permanent removal of a borked kernel feature.
 

That's just spin ;) People whose code is dependent on the old feature
get screwed over if we remove it.

And sure, it's unlikely that many people at all are using this feature.
But it's amazing what goes on out there and if the cost to us of
providing people with a migration period isn't too high then why not do
it?

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting

2010-09-24 Thread Andrew Morton

On Fri, 24 Sep 2010 11:10:15 +0200
Michael Holzheu holz...@linux.vnet.ibm.com wrote:

 Hello Andrew,
 
 On Thu, 2010-09-23 at 13:11 -0700, Andrew Morton wrote:
   GOALS OF THIS PATCH SET
   ---
   The intention of this patch set is to provide better support for tools 
   like
   top. The goal is to:
   
   * provide a task snapshot mechanism where we can get a consistent view of
 all running tasks.
   * provide a transport mechanism that does not require a lot of system 
   calls
 and that allows implementing low CPU overhead task monitoring.
   * provide microsecond CPU time granularity.
  
  This is a big change!  If this is done right then we're heading in the
  direction of deprecating the longstanding way in which userspace
  observes the state of Linux processes and we're recommending that the
  whole world migrate to taskstats.  I think?
 
 Or it can be used as alternative. Since procfs has its drawbacks (e.g.
 performance) an alternative could be helpful. 

And it can be harmful.  More kernel code to maintain and test, more
userspace code to develop, maintain, etc.  Less user testing than if
there was a single interface.

 
  I worry that there's a dependency on CONFIG_NET?  If so then that's a
  big problem because in N years time, 99% of the world will be using
  taskstats, but a few embedded losers will be stuck using (and having to
  support) the old tools.
 
 Sure, but if we could add the /proc/taskstats approach, this dependency
 would not be there.

So why do we need to present the same info over netlink?

If the info is available via procfs then userspace code should use that
and not netlink, because that userspace code would also be applicable
to CONFIG_NET=n systems.

 
  Does this have the potential to save us from the CONFIG_NET=n problem?
 
 Yes

Let's say that when it's all tested ;)

 Are PIDs over all namespaces unique?

Nope.  The same pid can be present in different namespaces at the same
time.

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting

2010-09-23 Thread Andrew Morton

On Thu, 23 Sep 2010 15:48:01 +0200
Michael Holzheu holz...@linux.vnet.ibm.com wrote:

 Currently tools like top gather the task information by reading procfs
 files. This has several disadvantages:
 
 * It is very CPU intensive, because a lot of system calls (readdir, open,
   read, close) are necessary.
 * No real task snapshot can be provided, because while the procfs files are
   read the system continues running.
 * The procfs times granularity is restricted to jiffies.
 
 In parallel to procfs there exists the taskstats binary interface that uses
 netlink sockets as transport mechanism to deliver task information to
 user space. There exists a taskstats command TASKSTATS_CMD_ATTR_PID
 to get task information for a given PID. This command can already be used for
 tools like top, but has also several disadvantages:
 
 * You first have to find out which PIDs are available in the system. Currently
   we have to use procfs again to do this.
 * For each task two system calls have to be issued (First send the command and
   then receive the reply).
 * No snapshot mechanism is available.
 
 GOALS OF THIS PATCH SET
 ---
 The intention of this patch set is to provide better support for tools like
 top. The goal is to:
 
 * provide a task snapshot mechanism where we can get a consistent view of
   all running tasks.
 * provide a transport mechanism that does not require a lot of system calls
   and that allows implementing low CPU overhead task monitoring.
 * provide microsecond CPU time granularity.

This is a big change!  If this is done right then we're heading in the
direction of deprecating the longstanding way in which userspace
observes the state of Linux processes and we're recommending that the
whole world migrate to taskstats.  I think?

If so, much chin-scratching will be needed, coordination with
util-linux people, etc.

We'd need to think about the implications of taskstats versioning.  It
_is_ a versioned interface, so people can't just go and toss random new
stuff in there at will - it's not like adding a new procfs file, or
adding a new line to an existing one.  I don't know if that's likely to
be a significant problem.

I worry that there's a dependency on CONFIG_NET?  If so then that's a
big problem because in N years time, 99% of the world will be using
taskstats, but a few embedded losers will be stuck using (and having to
support) the old tools.


 FIRST RESULTS
 -
 Together with this kernel patch set also user space code for a new top
 utility (ptop) is provided that exploits the new kernel infrastructure. See
 patch 10 for more details.
 
 TEST1: System with many sleeping tasks
 
   for ((i=0; i  1000; i++))
   do
  sleep 100 
   done
 
   # ptop_new_proc
 
  
   pid   user  sys  ste  total  Name
   (#)(%)  (%)  (%)(%)  (str)
   541   0.37 2.39 0.10   2.87  top
   3743  0.03 0.05 0.00   0.07  ptop_new_proc
  
 
 Compared to the old top command that has to scan more than 1000 proc
 directories the new ptop consumes much less CPU time (0.05% system time
 on my s390 system).

How many CPUs does that system have?

What's the `top' update period?  One second?

So we're saying that a `top -d 1' consumes 2.4% of this
mystery-number-of-CPUs machine?  That's quite a lot.

 PATCHSET OVERVIEW
 -
 The code is not final and still has a few TODOs. But it is good enough for a
 first round of review. The following kernel patches are provided:
 
 [01] Prepare-0: Use real microsecond granularity for taskstats CPU times.
 [02] Prepare-1: Restructure taskstats.c in order to be able to add new 
 commands
  more easily.
 [03] Prepare-2: Separate the finding of a task_struct by PID or TGID from
  filling the taskstats.
 [04] Add new command TASKSTATS_CMD_ATTR_PIDS to get a snapshot of multiple
  tasks.
 [05] Add procfs interface for taskstats commands. This allows to get a 
 complete
  and consistent snapshot with all tasks using two system calls (ioctl and
  read). Transferring a snapshot of all running tasks is not possible using
  the existing netlink interface, because there we have the socket buffer
  size as restricting factor.

So this is a binary interface which uses an ioctl.  People don't like
ioctls.  Could we have triggered it with a write() instead?

Does this have the potential to save us from the CONFIG_NET=n problem?

 [06] Add TGID to taskstats.
 [07] Add steal time per task accounting.
 [08] Add cumulative CPU time (user, system and steal) to taskstats.

These didn't update the taskstats version number.  Should they have?

 [09] Fix exit CPU time accounting.
 
 [10] Besides of the kernel patches also user space code is provided that
  exploits the new kernel infrastructure. The user space code provides the
  following:
  1. A proposal for a taskstats user space library:
 1.1 Based on netlink (requires libnl-devel-1.1-5)
 2.1 Based on the

[Devel] Re: [PATCH] cgroups: fix API thinko

2010-08-25 Thread Andrew Morton

On Fri, 06 Aug 2010 10:38:24 -0600
Alex Williamson alex.william...@redhat.com wrote:

 On Fri, 2010-08-06 at 09:34 -0700, Sridhar Samudrala wrote:
  On 8/5/2010 3:59 PM, Michael S. Tsirkin wrote:
   cgroup_attach_task_current_cg API that have upstream is backwards: we
   really need an API to attach to the cgroups from another process A to
   the current one.
  
   In our case (vhost), a priveledged user wants to attach it's task to 
   cgroups
   from a less priveledged one, the API makes us run it in the other
   task's context, and this fails.
  
   So let's make the API generic and just pass in 'from' and 'to' tasks.
   Add an inline wrapper for cgroup_attach_task_current_cg to avoid
   breaking bisect.
  
   Signed-off-by: Michael S. Tsirkinm...@redhat.com
   ---
  
   Paul, Li, Sridhar, could you please review the following
   patch?
  
   I only compile-tested it due to travel, but looks
   straight-forward to me.
   Alex Williamson volunteered to test and report the results.
   Sending out now for review as I might be offline for a bit.
   Will only try to merge when done, obviously.
  
   If OK, I would like to merge this through -net tree,
   together with the patch fixing vhost-net.
   Let me know if that sounds ok.
  
   Thanks!
  
   This patch is on top of net-next, it is needed for fix
   vhost-net regression in net-next, where a non-priveledged
   process can't enable the device anymore:
  
   when qemu uses vhost, inside the ioctl call it
   creates a thread, and tries to add
   this thread to the groups of current, and it fails.
   But we control the thread, so to solve the problem,
   we really should tell it 'connect to out cgroups'.

So am I correct to assume that this change is now needed in 2.6.36, and
unneeded in 2.6.35?

Can it affect the userspace-kernel API in amy manner?  If so, it
should be backported into earlier kernels to reduce the number of
incompatible kernels out there.

Paul, did you have any comments?

I didn't see any update in response to the minor review comments, so...


 include/linux/cgroup.h |1 +
 kernel/cgroup.c|6 +++---
 2 files changed, 4 insertions(+), 3 deletions(-)

diff -puN include/linux/cgroup.h~cgroups-fix-api-thinko-fix 
include/linux/cgroup.h
--- a/include/linux/cgroup.h~cgroups-fix-api-thinko-fix
+++ a/include/linux/cgroup.h
@@ -579,6 +579,7 @@ void cgroup_iter_end(struct cgroup *cgrp
 int cgroup_scan_tasks(struct cgroup_scanner *scan);
 int cgroup_attach_task(struct cgroup *, struct task_struct *);
 int cgroup_attach_task_all(struct task_struct *from, struct task_struct *);
+
 static inline int cgroup_attach_task_current_cg(struct task_struct *tsk)
 {
return cgroup_attach_task_all(current, tsk);
diff -puN kernel/cgroup.c~cgroups-fix-api-thinko-fix kernel/cgroup.c
--- a/kernel/cgroup.c~cgroups-fix-api-thinko-fix
+++ a/kernel/cgroup.c
@@ -1798,13 +1798,13 @@ out:
 int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 {
struct cgroupfs_root *root;
-   struct cgroup *cur_cg;
int retval = 0;
 
cgroup_lock();
for_each_active_root(root) {
-   cur_cg = task_cgroup_from_root(from, root);
-   retval = cgroup_attach_task(cur_cg, tsk);
+   struct cgroup *from_cg = task_cgroup_from_root(from, root);
+
+   retval = cgroup_attach_task(from_cg, tsk);
if (retval)
break;
}
_

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH v4 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs

2010-08-03 Thread Andrew Morton

On Fri, 30 Jul 2010 19:56:49 -0400
Ben Blum bb...@andrew.cmu.edu wrote:

 This patch series implements a write function for the 'cgroup.procs'
 per-cgroup file, which enables atomic movement of multithreaded
 applications between cgroups. Writing the thread-ID of any thread in a
 threadgroup to a cgroup's procs file causes all threads in the group to
 be moved to that cgroup safely with respect to threads forking/exiting.
 (Possible usage scenario: If running a multithreaded build system that
 sucks up system resources, this lets you restrict it all at once into a
 new cgroup to keep it under control.)

I can see how that would be useful.  No comments from anyone else?

patch 1/2 makes me cry with all those ifdefs.  Maybe helper functions
would help, but not a lot.

patch 2/2 looks very complicated.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [Bugme-new] [Bug 16417] New: Slow context switches with SMP and CONFIG_FAIR_GROUP_SCHED

2010-07-22 Thread Andrew Morton


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

sched suckage!  Do we have a linear search in there?


On Mon, 19 Jul 2010 14:38:09 GMT
bugzilla-dae...@bugzilla.kernel.org wrote:

 https://bugzilla.kernel.org/show_bug.cgi?id=16417
 
Summary: Slow context switches with SMP and
 CONFIG_FAIR_GROUP_SCHED
Product: Process Management
Version: 2.5
 Kernel Version: 2.6.34.1
   Platform: All
 OS/Version: Linux
   Tree: Mainline
 Status: NEW
   Severity: normal
   Priority: P1
  Component: Scheduler
 AssignedTo: mi...@elte.hu
 ReportedBy: pbour...@excellency.fr
 Regression: No
 
 
 Hello,
 
 We have been experiencing slow context switches using a large number of 
 cgroups
 (around 600 groups) and CONFIG_FAIR_GROUP_SCHED. This causes a system time
 usage increase on context switching heavy processes (measured with pidstat -w)
 and a drop in timer interrupts handling.
 
 This problem only appears on SMP : when booting with nosmp, the issue does not
 appear. From maxprocs=2 to maxprocs=8 we were able to reproduce it accurately.
 
 Steps to reproduce :
 - mount the cgroup filesystem in /dev/cgroup
 - cd /dev/cgroup  for i in $(seq 1 5000); do mkdir test_group_$i; done
 - launch lat_ctx from lmbench, for instance ./lat_ctx -N 200 100
 
 The results from lat_ctx were the following :
 - SMP enabled, no cgroups : 2.65
 - SMP enabled, 1000 cgroups : 3.40
 - SMP enabled, 6000 cgroups : 3957.36
 - SMP disabled, 6000 cgroups : 1.58
 
 We can see that from a certain amount of cgroups, the context switching starts
 taking a lot of time. Another way to reproduce this problem :
 - launch cat /dev/zero | pv -L 1G  /dev/null
 - look at the CPU usage (about 40% here)
 - cd /dev/cgroup  for i in $(seq 1 5000); do mkdir test_group_$i; done
 - look at the CPU usage (about 80% here)
 
 Also note that when a lot of cgroups are present, the system is spending a lot
 of time in softirqs, and there are less timer interrupts handled than normally
 (according to our graphs).
 


___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH v2] wait: use wrapper functions

2010-05-07 Thread Andrew Morton

On Fri,  7 May 2010 14:33:26 +0800
Changli Gao xiao...@gmail.com wrote:

 epoll should not touch flags in wait_queue_t. This patch introduces a new
 function __add_wait_queue_excl(), for the users, who use wait queue as a
 LIFO queue.

__add_wait_queue_exclusive(), please.  Let's be consistent.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: Linux Checkpoint-Restart - v19

2010-03-01 Thread Andrew Morton

On Mon, 22 Feb 2010 18:17:00 -0500
Oren Laadan or...@cs.columbia.edu wrote:

 Hi Andrew,
 
 We've put a stake in the ground for our next set of checkpoint/restart
 patches, v19. It has some great new stuff, and we put extra effort to
 address your concerns. We would like to have the code included in -mm
 for wider feedback and testing.
 
 This one is able to checkpoint/restart screen and vnc sessions, and
 live-migrate network servers between hosts. It also adds support for
 x86-64 (in addition to x86-32, s390x and powerpc). It is rebased to
 kernel 2.6.33-rc8.
 
 Since one of your main concerns was about what is not yet implemented
 and how complicated or ugly it will be to support that, we've put up
 a wiki page to address that. In it there is a simple table that lists
 what is not implemented and the anticipated solution impact, and for
 some entries a link to more details.
 
 The page is here:   http://ckpt.wiki.kernel.org/index.php/Checklist

Does Refuses to Checkpoint mean that an attempt to checkpoint will
fail, return the failure to userspace and the system continues as
before?

 We want to stress that the patchset is already very useful as-is. We
 will keep working to implement more features cleanly. Some features we
 are working on include network namespaces and device configurations,
 mounts and mounts namespaces, and file locks. Should a complicated
 feature prove hard to implement, users have alternatives systems like
 kvm, until we manage to come up with a clean solution.
 
 We believe that maintenance is best addressed through testing. We now
 have a comprehensive test-suite to automatically find regressions.
 In addition, we ran LTP and the results are the same with CHECKPOINT=n
 and =y.
 
 If desired we'll send the whole patchset to lkml, but the git trees
 can be seen at:
 
kernel:   http://www.linux-cr.org/git/?p=linux-cr.git;a=summary
user tools:   http://www.linux-cr.org/git/?p=user-cr.git;a=summary
tests suite:  http://www.linux-cr.org/git/?p=tests-cr.git;a=summary
 

I'd suggest waiting until very shortly after 2.6.34-rc1 then please
send all the patches onto the list and let's get to work.

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH v4 0/4] cgroups: support for module-loadable subsystems

2010-01-06 Thread Andrew Morton

On Thu, 31 Dec 2009 00:10:50 -0500
Ben Blum bb...@andrew.cmu.edu wrote:

 This patch series implements support for building, loading, and
 unloading subsystems as modules, both within and outside the kernel
 source tree. It provides an interface cgroup_load_subsys() and
 cgroup_unload_subsys() which modular subsystems can use to register and
 depart during runtime. The net_cls classifier subsystem serves as the
 example for a subsystem which can be converted into a module using these
 changes.

What is the value in this?  What are the usage scenarios?  Why does the
benefit of this change exceed the cost/risk/etc of merging it?
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [BUG] cpu controller can't provide fair CPU time for each group

2009-11-09 Thread Andrew Morton

(cc contain...@lists.linux-foundation.org)

On Thu, 05 Nov 2009 11:56:00 +0900
Miao Xie mi...@cn.fujitsu.com wrote:

 Hi, Ingo
 
 Could you see the following problems?
 
 Regards
 Miao
 
 on 2009-11-3 11:26, Miao Xie wrote:
  Hi, Peter.
  
  I found two problems about cpu controller:
  1) cpu controller didn't provide fair CPU time to groups when the tasks
 attached into those groups were bound to the same logic CPU.
  2) cpu controller didn't provide fair CPU time to groups when shares of
 each group = 2 * nr_cpus.
  
  The detail is following:
  1) The first one is that cpu controller didn't provide fair CPU time to
 groups when the tasks attached into those groups were bound to the
 same logic CPU.
  
 The reason is that there is something with the computing of the per
 cpu shares.
  
 on my test box with 16 logic CPU, I did the following manipulation:
 a. create 2 cpu controller groups.
 b. attach a task into one group and 2 tasks into the other.
 c. bind three tasks to the same logic cpu.
  ++ ++
  | group1 | | group2 |
  ++ ++
  |  |
 CPU0  Task A  Task B  Task C
  
 The following is the reproduce steps:
 # mkdir /dev/cpuctl
 # mount -t cgroup -o cpu,noprefix cpuctl /dev/cpuctl
 # mkdir /dev/cpuctl/1
 # mkdir /dev/cpuctl/2
 # cat /dev/zero  /dev/null 
 # pid1=$!
 # echo $pid1  /dev/cpuctl/1/tasks
 # taskset -p -c 0 $pid1
 # cat /dev/zero  /dev/null 
 # pid2=$!
 # echo $pid2  /dev/cpuctl/2/tasks
 # taskset -p -c 0 $pid2
 # cat /dev/zero  /dev/null 
 # pid3=$!
 # echo $pid3  /dev/cpuctl/2/tasks
 # taskset -p -c 0 $pid3
  
 some time later, I found the the task in the group1 got the 35% CPU 
  time not
 50% CPU time. It was very strange that this result against the expected.
  
 this problem was caused by the wrong computing of the per cpu shares.
 According to the design of the cpu controller, the shares of each cpu
 controller group will be divided for every CPU by the workload of each
 logic CPU.
cpu[i] shares = group shares * CPU[i] workload / sum(CPU workload)
  
 But if the CPU has no task, cpu controller will pretend there is one of
 average load, usually this average load is 1024, the load of the task 
  whose
 nice is zero. So in the test, the shares of group1 on CPU0 is:
1024 * (1 * 1024) / ((1 * 1024 + 15 * 1024)) = 64
 and the shares of group2 on CPU0 is:
1024 * (2 * 1024) / ((2 * 1024 + 15 * 1024)) = 120
 The scheduler of the CPU0 provided CPU time to each group by the shares
 above. The bug occured.
  
  2) The second problem is that cpu controller didn't provide fair CPU 
  time to
 groups when shares of each group = 2 * nr_cpus
  
 The reason is that per cpu shares was set to MIN_SHARES(=2) if shares of
 each group = 2 * nr_cpus.
  
 on the test box with 16 logic CPU, we do the following test:
 a. create two cpu controller groups
 b. attach 32 tasks into each group
 c. set shares of the first group to 16, the other to 32
  ++ ++
  | group1 | | group2 |
  ++ ++
  |shares=16 |shares=32
  |  |
   16 Tasks   32 Tasks
  
 some time later, the first group got 50% CPU time, not 33%. It also 
  was very
 strange that this result against the expected.
  
 It is because the shares of cpuctl group was small, and there is many 
  logic
 CPU. So per cpu shares that was computed was less than MIN_SHARES, 
  and then
 was set to MIN_SHARES.
  
 Maybe 16 and 32 is not used usually. We can set a usual number(such 
  as 1024)
 to avoid this problem on my box. But the number of CPU on a machine will
 become more and more in the future. If the number of CPU is greater 
  than 512,
 this bug will occur even we set shares of group to 1024. This is a usual
 number. At this rate, the usual user will feel strange.
  
  
  
  -- 
  To unsubscribe from this list: send the line unsubscribe linux-kernel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  Please read the FAQ at  http://www.tux.org/lkml/
  
  
  
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org

[Devel] Re: pidns memory leak

2009-11-02 Thread Andrew Morton

On Tue, 13 Oct 2009 23:15:33 -0700
Sukadev Bhattiprolu suka...@linux.vnet.ibm.com wrote:

 Daniel Lezcano [dlezc...@fr.ibm.com] wrote:
  Sukadev Bhattiprolu wrote:
  Ccing  Andrea's new email id:
 
  Daniel Lezcano [dlezc...@fr.ibm.com] wrote:

  Following your explanation I was able to reproduce a simple program   
  added in attachment. But there is something I do not understand is 
  why  the leak does not appear if I do the 'lstat' (cf. test program) 
  in the  pid 2 context.
  
 
  Hmm, are you sure there is no leak with this test program ? If I put back
  the commit (7766755a2f249e7), I do see a leak in all three data structures
  (pid_2, proc_inode, pid_namespace).

 
  Let me clarify :)
 
  The program leaks with the commit 7766755a2f249e7 and does not leak  
  without this commit.
  This is the expected behaviour and this simple program spots the problem.
 
  I tried to modify the program and I moved the lstat to the process 2 in  
  the child namespace. Conforming your analysis, I was expecting to see a  
  leak too, but this one didn't occur. I was wondering why, maybe there is  
  something I didn't understood in the analysis.
 
 Hmm, There are two separate dentries associated with the processes.
 One in each mount of /proc. The proc dentries in the child container
 are freed when the child container unmounts its /proc so you don't see
 the leak when the lstat() is inside the container.
 
 When the lstat() is in the root container, it is accessing proc-dentries
 from the _root container_ - They are supposed to  be flushed when the task
 exits (but the above commit prevents that flush). They should be freed
 when the /proc in root container is unmounted - and leak until then ?
 

This bug hasn't been fixed yet, has it?
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: pidns memory leak

2009-11-02 Thread Andrew Morton

On Mon, 2 Nov 2009 16:38:18 -0600
Serge E. Hallyn se...@us.ibm.com wrote:

  This bug hasn't been fixed yet, has it?
 
 Well Suka did trace the bug to commit 7766755a2f249e7, and posted a patch
 to revert that, acked by Eric on Oct 20.  Suka, were you going to repost
 that patch?

Ah.  OK.  Thanks.  Found it in the backlog pile.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: IO scheduler based IO controller V10

2009-09-24 Thread Andrew Morton

On Thu, 24 Sep 2009 15:25:04 -0400
Vivek Goyal vgo...@redhat.com wrote:

 
 Hi All,
 
 Here is the V10 of the IO controller patches generated on top of 2.6.31.
 

Thanks for the writeup.  It really helps and is most worthwhile for a
project of this importance, size and complexity.


  
 What problem are we trying to solve
 ===
 Provide group IO scheduling feature in Linux along the lines of other resource
 controllers like cpu.
 
 IOW, provide facility so that a user can group applications using cgroups and
 control the amount of disk time/bandwidth received by a group based on its
 weight. 
 
 How to solve the problem
 =
 
 Different people have solved the issue differetnly. So far looks it looks
 like we seem to have following two core requirements when it comes to
 fairness at group level.
 
 - Control bandwidth seen by groups.
 - Control on latencies when a request gets backlogged in group.
 
 At least there are now three patchsets available (including this one).
 
 IO throttling
 -
 This is a bandwidth controller which keeps track of IO rate of a group and
 throttles the process in the group if it exceeds the user specified limit.
 
 dm-ioband
 -
 This is a proportional bandwidth controller implemented as device mapper
 driver and provides fair access in terms of amount of IO done (not in terms
 of disk time as CFQ does).
 
 So one will setup one or more dm-ioband devices on top of physical/logical
 block device, configure the ioband device and pass information like grouping
 etc. Now this device will keep track of bios flowing through it and control
 the flow of bios based on group policies.
 
 IO scheduler based IO controller
 
 Here we have viewed the problem of IO contoller as hierarchical group
 scheduling (along the lines of CFS group scheduling) issue. Currently one can
 view linux IO schedulers as flat where there is one root group and all the IO
 belongs to that group.
 
 This patchset basically modifies IO schedulers to also support hierarchical
 group scheduling. CFQ already provides fairness among different processes. I 
 have extended it support group IO schduling. Also took some of the code out
 of CFQ and put in a common layer so that same group scheduling code can be
 used by noop, deadline and AS to support group scheduling. 
 
 Pros/Cons
 =
 There are pros and cons to each of the approach. Following are some of the
 thoughts.
 
 Max bandwidth vs proportional bandwidth
 ---
 IO throttling is a max bandwidth controller and not a proportional one.
 Additionaly it provides fairness in terms of amount of IO done (and not in
 terms of disk time as CFQ does).
 
 Personally, I think that proportional weight controller is useful to more
 people than just max bandwidth controller. In addition, IO scheduler based
 controller can also be enhanced to do max bandwidth control. So it can 
 satisfy wider set of requirements.
 
 Fairness in terms of disk time vs size of IO
 -
 An higher level controller will most likely be limited to providing fairness
 in terms of size/number of IO done and will find it hard to provide fairness
 in terms of disk time used (as CFQ provides between various prio levels). This
 is because only IO scheduler knows how much disk time a queue has used and
 information about queues and disk time used is not exported to higher
 layers.
 
 So a seeky application will still run away with lot of disk time and bring
 down the overall throughput of the the disk.

But that's only true if the thing is poorly implemented.

A high-level controller will need some view of the busyness of the
underlying device(s).  That could be proportion of idle time, or
average length of queue or average request latency or some mix of
these or something else altogether.

But these things are simple to calculate, and are simple to feed back
to the higher-level controller and probably don't require any changes
to to IO scheduler at all, which is a great advantage.


And I must say that high-level throttling based upon feedback from
lower layers seems like a much better model to me than hacking away in
the IO scheduler layer.  Both from an implementation point of view and
from a we can get it to work on things other than block devices point
of view.

 Currently dm-ioband provides fairness in terms of number/size of IO.
 
 Latencies and isolation between groups
 --
 An higher level controller is generally implementing a bandwidth throttling
 solution where if a group exceeds either the max bandwidth or the proportional
 share then throttle that group.
 
 This kind of approach will probably not help in controlling latencies as it
 will depend on underlying IO scheduler. Consider following scenario. 
 
 Assume there are two groups. One group is running multiple sequential

1 2 3 4 >

1 - 100 of 353 matches

Mail list logo