Re: [Devel] [PATCH -mm v2 1/2] mem-hotplug: implement get/put_online_mems
On Mon, 7 Apr 2014 13:45:34 +0400 Vladimir Davydov vdavy...@parallels.com wrote: {un}lock_memory_hotplug, which is used to synchronize against memory hotplug, is currently backed by a mutex, which makes it a bit of a hammer - threads that only want to get a stable value of online nodes mask won't be able to proceed concurrently. Also, it imposes some strong locking ordering rules on it, which narrows down the set of its usage scenarios. This patch introduces get/put_online_mems, which are the same as get/put_online_cpus, but for memory hotplug, i.e. executing a code inside a get/put_online_mems section will guarantee a stable value of online nodes, present pages, etc. Well that seems a nice change. I added the important paragraph lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether. I'll Cc a large number of people who have recently worked on the memory hotplug code. Hopefully some of them will have time to review and test these patches, thanks. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH -mm v2.2] mm: get rid of __GFP_KMEMCG
On Thu, 3 Apr 2014 19:05:59 +0400 Vladimir Davydov vdavy...@parallels.com wrote: Currently to allocate a page that should be charged to kmemcg (e.g. threadinfo), we pass __GFP_KMEMCG flag to the page allocator. The page allocated is then to be freed by free_memcg_kmem_pages. Apart from looking asymmetrical, this also requires intrusion to the general allocation path. So let's introduce separate functions that will alloc/free pages charged to kmemcg. The new functions are called alloc_kmem_pages and free_kmem_pages. They should be used when the caller actually would like to use kmalloc, but has to fall back to the page allocator for the allocation is large. They only differ from alloc_pages and free_pages in that besides allocating or freeing pages they also charge them to the kmem resource counter of the current memory cgroup. ... +void *kmalloc_order(size_t size, gfp_t flags, unsigned int order) +{ + void *ret; + struct page *page; + + flags |= __GFP_COMP; + page = alloc_kmem_pages(flags, order); + ret = page ? page_address(page) : NULL; + kmemleak_alloc(ret, size, 1, flags); + return ret; +} While we're in there it wouldn't hurt to document this: why it exists, what it does, etc. And why it sets __GFP_COMP. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH 1/2] kobject: don't block for each kobject_uevent
On Sun, 9 Feb 2014 14:56:15 +0400 Vladimir Davydov vdavy...@parallels.com wrote: Currently kobject_uevent has somewhat unpredictable semantics. The point is, since it may call a usermode helper and wait for it to execute (UMH_WAIT_EXEC), it is impossible to say for sure what lock dependencies it will introduce for the caller - strictly speaking it depends on what fs the binary is located on and the set of locks fork may take. There are quite a few kobject_uevent's users that do not take this into account and call it with various mutexes taken, e.g. rtnl_mutex, net_mutex, which might potentially lead to a deadlock. Since there is actually no reason to wait for the usermode helper to execute there, let's make kobject_uevent start the helper asynchronously with the aid of the UMH_NO_WAIT flag. Personally, I'm interested in this, because I really want kobject_uevent to be called under the slab_mutex in the slub implementation as it used to be some time ago, because it greatly simplifies synchronization and automatically fixes a kmemcg-related race. However, there was a deadlock detected on an attempt to call kobject_uevent under the slab_mutex (see https://lkml.org/lkml/2012/1/14/45), which was reported to be fixed by releasing the slab_mutex for kobject_uevent. Unfortunately, there was no information about who exactly blocked on the slab_mutex causing the usermode helper to stall, neither have I managed to find this out or reproduce the issue. BTW, this is not the first attempt to make kobject_uevent use UMH_NO_WAIT. Previous one was made by commit f520360d93c, but it was wrong (it passed arguments allocated on stack to async thread) so it was reverted (commit 05f54c13cd0c). It targeted on speeding up the boot process though. Am not a huge fan of this patch. My test box gets an early oops in initcalls -... -register_pernet_operations -rtnetlink_net_init -__netlink_kernel_create -sock_create_lite -sock_alloc -new_inode_pseudo -alloc_inode+0xe I expect that sock_mnt-mnt_sb is null. Or perhaps sb-s_op. Perhaps sockfs hasn't mounted yet. The oops doesn't happen on mainline - it only happens on linux-next. So there may be some interaction there, but it may only be timing related. config: http://ozlabs.org/~akpm/stuff/config-akpm2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH 1/2] kobject: don't block for each kobject_uevent
On Sun, 9 Feb 2014 14:56:15 +0400 Vladimir Davydov vdavy...@parallels.com wrote: Currently kobject_uevent has somewhat unpredictable semantics. The point is, since it may call a usermode helper and wait for it to execute (UMH_WAIT_EXEC), it is impossible to say for sure what lock dependencies it will introduce for the caller - strictly speaking it depends on what fs the binary is located on and the set of locks fork may take. There are quite a few kobject_uevent's users that do not take this into account and call it with various mutexes taken, e.g. rtnl_mutex, net_mutex, which might potentially lead to a deadlock. Since there is actually no reason to wait for the usermode helper to execute there, let's make kobject_uevent start the helper asynchronously with the aid of the UMH_NO_WAIT flag. Personally, I'm interested in this, because I really want kobject_uevent to be called under the slab_mutex in the slub implementation as it used to be some time ago, because it greatly simplifies synchronization and automatically fixes a kmemcg-related race. However, there was a deadlock detected on an attempt to call kobject_uevent under the slab_mutex (see https://lkml.org/lkml/2012/1/14/45), which was reported to be fixed by releasing the slab_mutex for kobject_uevent. Unfortunately, there was no information about who exactly blocked on the slab_mutex causing the usermode helper to stall, neither have I managed to find this out or reproduce the issue. BTW, this is not the first attempt to make kobject_uevent use UMH_NO_WAIT. Previous one was made by commit f520360d93c, but it was wrong (it passed arguments allocated on stack to async thread) so it was reverted (commit 05f54c13cd0c). It targeted on speeding up the boot process though. The patches look good to me. One is kobject (Greg) and the other is slub (Pekka), so I grabbed them ;) Reviews-and-acks, please? btw, when referring to commits, please use the form f520360d93c (kobject: don't block for each kobject_uevent) because the same commit can have different hashes in different trees. (Although I suspect the amount of convenience this provides others doesn't match the amount of time I spend fixing changelogs!) ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH 2/3] mm: vmscan: get rid of DEFAULT_SEEKS and document shrink_slab logic
On Wed, 5 Feb 2014 11:16:49 +0400 Vladimir Davydov vdavy...@parallels.com wrote: So why did I originally make DEFAULT_SEEKS=2? Because I figured that to recreate (say) an inode would require a seek to the inode data then a seek back. Is it legitimate to include the seek-back-to-what-you-were-doing-before seek in the cost of an inode reclaim? I guess so... Hmm, that explains this 2. Since we typically don't need to seek back when recreating a cache page, as they are usually read in bunches by readahead, the number of seeks to bring back a user page is 1, while the number of seeks to recreate an average inode is 2, right? Sounds right to me. Then to scan inodes and user pages so that they would generate approximately the same number of seeks, we should calculate the number of objects to scan as follows: nr_objects_to_scan = nr_pages_scanned / lru_pages * nr_freeable_objects / shrinker-seeks where shrinker-seeks = DEFAULT_SEEKS = 2 for inodes. hm, I wonder if we should take the size of the object into account. Should we be maximizing (memory-reclaimed / seeks-to-reestablish-it). But currently we have four times that. I can explain why we should multiply this by 2 - we do not count pages moving from active to inactive lrus in nr_pages_scanned, and 2*nr_pages_scanned can be a good approximation for that - but I have no idea why we multiply it by 4... I don't understand this code at all: total_scan = nr; delta = (4 * nr_pages_scanned) / shrinker-seeks; delta *= freeable; do_div(delta, lru_pages + 1); total_scan += delta; If it actually makes any sense, it sorely sorely needs documentation. David, you touched it last. Any hints? ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH 2/3] mm: vmscan: get rid of DEFAULT_SEEKS and document shrink_slab logic
On Fri, 17 Jan 2014 23:25:30 +0400 Vladimir Davydov vdavy...@parallels.com wrote: Each shrinker must define the number of seeks it takes to recreate a shrinkable cache object. It is used to balance slab reclaim vs page reclaim: assuming it costs one seek to replace an LRU page, we age equal percentages of the LRU and ageable caches. So far, everything sounds clear, but the code implementing this behavior is rather confusing. First, there is the DEFAULT_SEEKS constant, which equals 2 for some reason: #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */ Most shrinkers define `seeks' to be equal to DEFAULT_SEEKS, some use DEFAULT_SEEKS*N, and there are a few that totally ignore it. What is peculiar, dcache and icache shrinkers have seeks=DEFAULT_SEEKS although recreating an inode typically requires one seek. Does this mean that we scan twice more inodes than we should? Actually, no. The point is that vmscan handles DEFAULT_SEEKS as if it were 1 (`delta' is the number of objects we are going to scan): shrink_slab_node(): delta = (4 * nr_pages_scanned) / shrinker-seeks; delta *= freeable; do_div(delta, lru_pages + 1); i.e. 2 * nr_pages_scannedDEFAULT_SEEKS delta = * --- * freeable; lru_pages shrinker-seeks Here we double the number of pages scanned in order to take into account moves of on-LRU pages from the inactive list to the active list, which we do not count in nr_pages_scanned. That said, shrinker-seeks=DEFAULT_SEEKS*N is equivalent to N seeks, so why on the hell do we need it? IMO, the existence of the DEFAULT_SEEKS constant only causes confusion for both users of the shrinker interface and those trying to understand how slab shrinking works. The meaning of the `seeks' is perfectly explained by the comment to it and there is no need in any obscure constants for using it. That's why I'm sending this patch which completely removes DEFAULT_SEEKS and makes all shrinkers use N instead of N*DEFAULT_SEEKS, documenting the idea lying behind shrink_slab() in the meanwhile. Unfortunately, there are a few shrinkers that define seeks=1, which is impossible to transfer to the new interface intact, namely: nfsd_reply_cache_shrinker ttm_pool_manager::mm_shrink ttm_pool_manager::mm_shrink dm_bufio_client::shrinker It seems to me their authors were simply deceived by this mysterious DEFAULT_SEEKS constant, because I've found no documentation why these particular caches should be scanned more aggressively than the page and other slab caches. For them, this patch leaves seeks=1. Thus, it DOES introduce a functional change: the shrinkers enumerated above will be scanned twice less intensively than they are now. I do not think that this will cause any problems though. um, yes. DEFAULT_SEEKS is supposed to be the number of seeks if you don't know any better. Using DEFAULT_SEEKS*n is just daft. So why did I originally make DEFAULT_SEEKS=2? Because I figured that to recreate (say) an inode would require a seek to the inode data then a seek back. Is it legitimate to include the seek-back-to-what-you-were-doing-before seek in the cost of an inode reclaim? I guess so... If a filesystem were to require a seek to the superblock for every inode read (ok, bad example) then the cost of reestablishing that inode would be 3. All that being said, why did you go through and halve everything? The cost of reestablishing an ext2 inode should be 2 seeks, but the patch makes it 1. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH v2 2/7] memcg, slab: cleanup memcg cache name creation
On Mon, 3 Feb 2014 19:54:37 +0400 Vladimir Davydov vdavy...@parallels.com wrote: The way memcg_create_kmem_cache() creates the name for a memcg cache looks rather strange: it first formats the name in the static buffer tmp_name protected by a mutex, then passes the pointer to the buffer to kmem_cache_create_memcg(), which finally duplicates it to the cache name. Let's clean this up by moving memcg cache name creation to a separate function to be called by kmem_cache_create_memcg(), and estimating the length of the name string before copying anything to it so that we won't need a temporary buffer. --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3193,6 +3193,37 @@ int memcg_update_cache_size(struct kmem_cache *s, int num_groups) return 0; } +static int memcg_print_cache_name(char *buf, size_t size, + struct mem_cgroup *memcg, struct kmem_cache *root_cache) +{ + int ret; + + rcu_read_lock(); + ret = snprintf(buf, size, %s(%d:%s), root_cache-name, +memcg_cache_id(memcg), cgroup_name(memcg-css.cgroup)); + rcu_read_unlock(); + return ret; +} + +char *memcg_create_cache_name(struct mem_cgroup *memcg, + struct kmem_cache *root_cache) +{ + int len; + char *name; + + /* + * We cannot use kasprintf() here, because cgroup_name() must be called + * under RCU protection. + */ + len = memcg_print_cache_name(NULL, 0, memcg, root_cache); + + name = kmalloc(len + 1, GFP_KERNEL); + if (name) + memcg_print_cache_name(name, len + 1, memcg, root_cache); but but but this assumes that cgroup_name(memcg-css.cgroup) did not change between the two calls to memcg_print_cache_name(). If that is the case then the locking was unneeded anyway. + return name; +} + int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s, struct kmem_cache *root_cache) { @@ -3397,44 +3428,6 @@ void mem_cgroup_destroy_cache(struct kmem_cache *cachep) schedule_work(cachep-memcg_params-destroy); } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH 1/5] mm: vmscan: shrink all slab objects if tight on memory
On Wed, 15 Jan 2014 12:47:35 +0400 Vladimir Davydov vdavy...@parallels.com wrote: On 01/15/2014 02:14 AM, Andrew Morton wrote: On Tue, 14 Jan 2014 11:23:30 +0400 Vladimir Davydov vdavy...@parallels.com wrote: On 01/14/2014 03:05 AM, Andrew Morton wrote: That being said, I think I'll schedule this patch as-is for 3.14. Can you please take a look at implementing the simpler approach, send me something for 3.15-rc1? IMHO the simpler approach (Glauber's patch) is not suitable as is, because it, in fact, neglects the notion of batch_size when doing low prio scans, because it calls -scan() for batch_size objects even if the slab has = batch_size objects while AFAIU it should accumulate a sufficient number of objects to scan in nr_deferred instead. Well. If you mean that when nr-objects=large and batch_size=32 and total_scan=33, the patched code will scan 32 objects and then 1 object then yes, that should be fixed. I mean if nr_objects=large and batch_size=32 and shrink_slab() is called 8 times with total_scan=4, we can either call -scan() 8 times with nr_to_scan=4 (Glauber's patch) or call it only once with nr_to_scan=32 (that's how it works now). Frankly, after a bit of thinking I am starting to doubt that this can affect performance at all provided the shrinker is implemented in a sane way, because as you've mentioned shrink_slab() is already a slow path. It seems I misunderstood the purpose of batch_size initially: I though we need it to limit the number of calls to -scan(), but now I guess the only purpose of it is limiting the number of objects scanned in one pass to avoid latency issues. Actually, the intent of batching is to limit the number of calls to -scan(). At least, that was the intent when I wrote it! This is a good principle and we should keep doing it. If we're going to send the CPU away to tread on a pile of cold cachelines, we should make sure that it does a good amount of work while it's there. But then another question arises - why do you think the behavior you described above (scanning 32 and then 1 object if total_scan=33, batch_size=32) is bad? Yes, it's a bit inefficient but it won't be too bad. What would be bad would be to scan a very small number of objects and then to advance to the next shrinker. In other words why can't we make the scan loop look like this: while (total_scan 0) { unsigned long ret; unsigned long nr_to_scan = min(total_scan, batch_size); shrinkctl-nr_to_scan = nr_to_scan; ret = shrinker-scan_objects(shrinker, shrinkctl); if (ret == SHRINK_STOP) break; freed += ret; count_vm_events(SLABS_SCANNED, nr_to_scan); total_scan -= nr_to_scan; cond_resched(); } Well, if we come in here with total_scan=1 then we defeat the original intent of the batching, don't we? We end up doing a lot of work just to scan one object. So perhaps add something like if (total_scan batch_size max_pass batch_size) skip the while loop If we do this, total_scan will be accumulated into nr_deferred, up to the point where the threshold is exceeded, yes? All the arithmetic in there hurts my brain and I don't know what values total_scan typically ends up with. btw. all that trickery with delta and lru_pages desperately needs documenting. What the heck is it intended to do?? We could avoid the scan 32 then scan just 1 issue with something like if (total_scan batch_size) total_scan %= batch_size; before the loop. But I expect the effects of that will be unmeasurable - on average the number of objects which are scanned in the final pass of the loop will be batch_size/2, yes? That's still a decent amount. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH 1/5] mm: vmscan: shrink all slab objects if tight on memory
On Wed, 15 Jan 2014 19:55:11 +0400 Vladimir Davydov vdavy...@parallels.com wrote: We could avoid the scan 32 then scan just 1 issue with something like if (total_scan batch_size) total_scan %= batch_size; before the loop. But I expect the effects of that will be unmeasurable - on average the number of objects which are scanned in the final pass of the loop will be batch_size/2, yes? That's still a decent amount. Let me try to summarize. We want to scan batch_size objects in one pass, not more (to keep latency low) and not less (to avoid cpu cache pollution due to too frequent calls); if the calculated value of nr_to_scan is less than the batch_size we should accumulate it in nr_deferred instead of calling -scan() and add nr_deferred to nr_to_scan on the next pass, i.e. in pseudo-code: /* calculate current nr_to_scan */ max_pass = shrinker-count(); delta = max_pass * nr_user_pages_scanned / nr_user_pages; /* add nr_deferred */ total_scan = delta + nr_deferred; while (total_scan = batch_size) { shrinker-scan(batch_size); total_scan -= batch_size; } /* save the remainder to nr_deferred */ nr_deferred = total_scan; That would work, but if max_pass is batch_size, it would not scan the objects immediately even if prio is high (we want to scan all objects). Yes, that's a problem. For example, dropping caches would not work on the first attempt - the user would have to call it batch_size / max_pass times. And we do want drop_caches to work immediately. This could be fixed by making the code proceed to -scan() not only if total_scan is = batch_size, but also if max_pass is batch_size and total_scan is = max_pass, i.e. while (total_scan = batch_size || (max_pass batch_size total_scan = max_pass)) ... which is equivalent to while (total_scan = batch_size || total_scan = max_pass) ... The latter is the loop condition from the current patch, i.e. this patch would make the trick if shrink_slab() followed the pseudo-code above. In real life, it does not actually - we have to bias total_scan before the while loop in order to avoid dropping fs meta caches on light memory pressure due to a large number being built in nr_deferred: if (delta max_pass / 4) total_scan = min(total_scan, max_pass / 2); Oh, is that what's it's for. Where did you discover this gem? while (total_scan = batch_size) ... With this biasing, it is impossible to achieve the ideal behavior I've described above, because we will never accumulate max_pass objects in nr_deferred if memory pressure is low. So, if applied to the real code, this patch takes on a slightly different sense, which I tried to reflect in the comment to the code: it will call -scan() with nr_to_scan batch_size only if: 1) max_pass batch_size total_scan = max_pass and 2) we're tight on memory, i.e. the current delta is high (otherwise total_scan will be biased as max_pass / 2 and condition 1 won't be satisfied). (is max_pass misnamed?) From our discussion it seems condition 2 is not necessary at all, but it follows directly from the biasing rule. So I propose to tweak the biasing a bit so that total_scan won't be lowered batch_size: diff --git a/mm/vmscan.c b/mm/vmscan.c index eea668d..78ddd5e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -267,7 +267,7 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker, * a large delta change is calculated directly. */ if (delta max_pass / 4) -total_scan = min(total_scan, max_pass / 2); +total_scan = min(total_scan, max(max_pass / 2, batch_size)); /* * Avoid risking looping forever due to too large nr value: @@ -281,7 +281,7 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker, nr_pages_scanned, lru_pages, max_pass, delta, total_scan); -while (total_scan = batch_size) { +while (total_scan = batch_size || total_scan = max_pass) { unsigned long ret; shrinkctl-nr_to_scan = batch_size; The first hunk guarantees that total_scan will always accumulate at least batch_size objects no matter how small max_pass is. That means that when max_pass is batch_size we will eventually get = max_pass objects to scan and shrink the slab to 0 as we need. What do you think about that? I'm a bit lost :( ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH 1/5] mm: vmscan: shrink all slab objects if tight on memory
On Tue, 14 Jan 2014 11:23:30 +0400 Vladimir Davydov vdavy...@parallels.com wrote: On 01/14/2014 03:05 AM, Andrew Morton wrote: On Sat, 11 Jan 2014 16:36:31 +0400 Vladimir Davydov vdavy...@parallels.com wrote: When reclaiming kmem, we currently don't scan slabs that have less than batch_size objects (see shrink_slab_node()): while (total_scan = batch_size) { shrinkctl-nr_to_scan = batch_size; shrinker-scan_objects(shrinker, shrinkctl); total_scan -= batch_size; } If there are only a few shrinkers available, such a behavior won't cause any problems, because the batch_size is usually small, but if we have a lot of slab shrinkers, which is perfectly possible since FS shrinkers are now per-superblock, we can end up with hundreds of megabytes of practically unreclaimable kmem objects. For instance, mounting a thousand of ext2 FS images with a hundred of files in each and iterating over all the files using du(1) will result in about 200 Mb of FS caches that cannot be dropped even with the aid of the vm.drop_caches sysctl! True. I suspect this was an accidental consequence of the chosen implementation. As you mentioned, I was thinking that the caches would all be large, and the remaining 1 .. SHRINK_BATCH-1 objects just didn't matter. This problem was initially pointed out by Glauber Costa [*]. Glauber proposed to fix it by making the shrink_slab() always take at least one pass, to put it simply, turning the scan loop above to a do{}while() loop. However, this proposal was rejected, because it could result in more aggressive and frequent slab shrinking even under low memory pressure when total_scan is naturally very small. Well, it wasn't rejected - Mel pointed out that Glauber's change could potentially trigger problems which already exist in shrinkers. The potential issues seem pretty unlikely to me, and they're things we can fix up if they eventuate. When preparing this patch, I considered not the problems that potentially exist in some shrinkers, but the issues that unconditional scan of batch_size objects might trigger for any shrinker: 1) We would call shrinkers more frequently, hm, why? which could possibly increase contention on shrinker-internal locks. The point is that under very light memory pressure when we can fulfill the allocation request after a few low-prio scans, we would not call slab shrinkers at all, instead we would only add the delta to nr_deferred in order to keep slab-vs-pagecache reclaim balanced. Original Glauber's patch changes this behavior - it makes shrink_slab() always call the shrinker at least once, even if the current delta is negligible. I'm afraid, this might affect performance. Note, this is irrespective of how much objects the shrinker has to reclaim ( or batch_size). I doubt if it affects performance much at all - memory reclaim in general is a slow path. 2) As Mel Gorman pointed out (http://thread.gmane.org/gmane.linux.kernel.mm/99059): It's possible for caches to shrink to zero where before the last SHRINK_SLAB objects would often be protected for any slab. If this is an inode or dentry cache and there are very few objects then it's possible that objects will be reclaimed before they can be used by the process allocating them. I don't understand that one. It appears to assume that vfs code will allocate and initialise a dentry or inode, will put it in cache and release all references to it and then will look it up again and start using it. vfs doesn't work like that - it would be crazy to do so. It will instead hold references to those objects while using them. Mainly to protect them from reclaim. And if there were any problems of this nature, they would already be demonstrable with a sufficiently large number of threads/cpus. So I'm thinking we should at least try Glauber's approach - it's a bit weird that we should treat the final 0 .. batch_size-1 objects in a different manner from all the others. It's not exactly that we treat the final 0 .. batch_size-1 objects differently from others. We rather try to accumulate at least batch_size objects before calling -scan(). And if there are batch_size objects, they never get scanned. That's different treatment. That being said, I think I'll schedule this patch as-is for 3.14. Can you please take a look at implementing the simpler approach, send me something for 3.15-rc1? IMHO the simpler approach (Glauber's patch) is not suitable as is, because it, in fact, neglects the notion of batch_size when doing low prio scans, because it calls -scan() for batch_size objects even if the slab has = batch_size objects while AFAIU it should accumulate a sufficient number of objects to scan in nr_deferred instead. Well. If you mean that when nr-objects=large and batch_size=32 and total_scan=33, the patched code will scan 32
Re: [Devel] [PATCH 1/5] mm: vmscan: shrink all slab objects if tight on memory
On Sat, 11 Jan 2014 16:36:31 +0400 Vladimir Davydov vdavy...@parallels.com wrote: When reclaiming kmem, we currently don't scan slabs that have less than batch_size objects (see shrink_slab_node()): while (total_scan = batch_size) { shrinkctl-nr_to_scan = batch_size; shrinker-scan_objects(shrinker, shrinkctl); total_scan -= batch_size; } If there are only a few shrinkers available, such a behavior won't cause any problems, because the batch_size is usually small, but if we have a lot of slab shrinkers, which is perfectly possible since FS shrinkers are now per-superblock, we can end up with hundreds of megabytes of practically unreclaimable kmem objects. For instance, mounting a thousand of ext2 FS images with a hundred of files in each and iterating over all the files using du(1) will result in about 200 Mb of FS caches that cannot be dropped even with the aid of the vm.drop_caches sysctl! True. I suspect this was an accidental consequence of the chosen implementation. As you mentioned, I was thinking that the caches would all be large, and the remaining 1 .. SHRINK_BATCH-1 objects just didn't matter. This problem was initially pointed out by Glauber Costa [*]. Glauber proposed to fix it by making the shrink_slab() always take at least one pass, to put it simply, turning the scan loop above to a do{}while() loop. However, this proposal was rejected, because it could result in more aggressive and frequent slab shrinking even under low memory pressure when total_scan is naturally very small. Well, it wasn't rejected - Mel pointed out that Glauber's change could potentially trigger problems which already exist in shrinkers. The potential issues seem pretty unlikely to me, and they're things we can fix up if they eventuate. So I'm thinking we should at least try Glauber's approach - it's a bit weird that we should treat the final 0 .. batch_size-1 objects in a different manner from all the others. That being said, I think I'll schedule this patch as-is for 3.14. Can you please take a look at implementing the simpler approach, send me something for 3.15-rc1? ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH 3/5] mm: vmscan: respect NUMA policy mask when shrinking slab on direct reclaim
On Sat, 11 Jan 2014 16:36:33 +0400 Vladimir Davydov vdavy...@parallels.com wrote: When direct reclaim is executed by a process bound to a set of NUMA nodes, we should scan only those nodes when possible, but currently we will scan kmem from all online nodes even if the kmem shrinker is NUMA aware. That said, binding a process to a particular NUMA node won't prevent it from shrinking inode/dentry caches from other nodes, which is not good. Fix this. Seems right. I worry that reducing the amount of shrinking which node-bound processes perform might affect workloads in unexpected ways. I think I'll save this one for 3.15-rc1, OK? ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH 4/5] mm: vmscan: move call to shrink_slab() to shrink_zones()
On Sat, 11 Jan 2014 16:36:34 +0400 Vladimir Davydov vdavy...@parallels.com wrote: This reduces the indentation level of do_try_to_free_pages() and removes extra loop over all eligible zones counting the number of on-LRU pages. So this should cause no functional change, yes? ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH v11 00/15] kmemcg shrinkers
On Tue, 26 Nov 2013 16:55:43 +0400 Vladimir Davydov vdavy...@parallels.com wrote: What do you think about splitting this set into two main series as follows: 1) Prepare vmscan to kmemcg-aware shrinkers; would include patches 1-7 of this set. 2) Make fs shrinkers memcg-aware; would include patches 9-11 of this set Please just resend everything. and leave other patches, which are rather for optimization/extending functionality, for future? It will be helpful to describe such splitting opportunities in the [0/n] changelog. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH] mm: strictlimit feature -v4
On Wed, 21 Aug 2013 17:56:32 +0400 Maxim Patlasov mpatla...@parallels.com wrote: The feature prevents mistrusted filesystems to grow a large number of dirty pages before throttling. For such filesystems balance_dirty_pages always check bdi counters against bdi limits. I.e. even if global nr_dirty is under freerun, it's not allowed to skip bdi checks. The only use case for now is fuse: it sets bdi max_ratio to 1% by default and system administrators are supposed to expect that this limit won't be exceeded. The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag. A filesystem may set the flag when it initializes its BDI. Now I think about it, I don't really understand the need for this feature. Can you please go into some detail about the problematic scenarios and why they need fixing? Including an expanded descritopn of the term mistrusted filesystem? Is this some theoretical happens-in-the-lab thing, or are real world users actually hurting due to the lack of this feature? I think I'll apply it to -mm for now to get a bit of testing, but would very much like it if Fengguang could find time to review the implementation, please. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH] mqueue: sys_mq_open: do not call mnt_drop_write() if read-only
On Tue, 19 Mar 2013 13:31:18 +0400 Vladimir Davydov vdavy...@parallels.com wrote: mnt_drop_write() must be called only if mnt_want_write() succeeded, otherwise the mnt_writers counter will diverge. ... --- a/ipc/mqueue.c +++ b/ipc/mqueue.c @@ -840,7 +840,8 @@ out_putfd: fd = error; } mutex_unlock(root-d_inode-i_mutex); - mnt_drop_write(mnt); + if (!ro) + mnt_drop_write(mnt); out_putname: putname(name); return fd; huh, that's been there for a while. What were the runtime-visible effects of the bug? ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [RFC PATCH v8 0/5] IPC: checkpoint/restore in userspace enhancements
On Thu, 20 Dec 2012 08:06:32 +0400 Stanislav Kinsbursky skinsbur...@parallels.com wrote: 19.12.2012 00:36, Andrew Morton __: On Wed, 24 Oct 2012 19:34:51 +0400 Stanislav Kinsbursky skinsbur...@parallels.com wrote: This respin of the patch set was significantly reworked. Most part of new API was replaced by sysctls (by one per messages, semaphores and shared memory), allowing to preset desired id for next new IPC object. This patch set is aimed to provide additional functionality for all IPC objects, which is required for migration of these objects by user-space checkpoint/restore utils (CRIU). The main problem here was impossibility to set up object id. This patch set solves the problem by adding new sysctls for preset of desired id for new IPC object. Another problem was to peek messages from queues without deleting them. This was achived by introducing of new MSG_COPY flag for sys_msgrcv(). If MSG_COPY flag is set, then msgtyp is interpreted as message number. According to my extensive records, Sasha hit a bug in ipc-message-queue-copy-feature-introduced.patch and Fengguang found a bug in ipc-message-queue-copy-feature-introduced-cleanup-do_msgrcv-aroung-msg_copy-feature.patch It's not obvious (to me) that these things have been identified and fixed. What's the status, please? Hello, Andrew. Fengguang's issue was solved by ipc: simplify message copying I sent you. But I can't find Sasha's issue. As I remember, there was some problem in early version of the patch set. But I believe its fixed now. http://lkml.indiana.edu/hypermail/linux/kernel/1210.3/01710.html Subject: ipc, msgqueue: NULL ptr deref in msgrcv ___ Devel mailing list Devel@openvz.org http://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [RFC PATCH v8 0/5] IPC: checkpoint/restore in userspace enhancements
On Wed, 24 Oct 2012 19:34:51 +0400 Stanislav Kinsbursky skinsbur...@parallels.com wrote: This respin of the patch set was significantly reworked. Most part of new API was replaced by sysctls (by one per messages, semaphores and shared memory), allowing to preset desired id for next new IPC object. This patch set is aimed to provide additional functionality for all IPC objects, which is required for migration of these objects by user-space checkpoint/restore utils (CRIU). The main problem here was impossibility to set up object id. This patch set solves the problem by adding new sysctls for preset of desired id for new IPC object. Another problem was to peek messages from queues without deleting them. This was achived by introducing of new MSG_COPY flag for sys_msgrcv(). If MSG_COPY flag is set, then msgtyp is interpreted as message number. According to my extensive records, Sasha hit a bug in ipc-message-queue-copy-feature-introduced.patch and Fengguang found a bug in ipc-message-queue-copy-feature-introduced-cleanup-do_msgrcv-aroung-msg_copy-feature.patch It's not obvious (to me) that these things have been identified and fixed. What's the status, please? ___ Devel mailing list Devel@openvz.org http://lists.openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v8 2/5] ipc: add sysctl to specify desired next object id
On Wed, 24 Oct 2012 19:35:09 +0400 Stanislav Kinsbursky skinsbur...@parallels.com wrote: This patch adds 3 new variables and sysctls to tune them (by one next_id variable for messages, semaphores and shared memory respectively). This variable can be used to set desired id for next allocated IPC object. By default it's equal to -1 and old behaviour is preserved. If this variable is non-negative, then desired idr will be extracted from it and used as a start value to search for free IDR slot. Notes: 1) this patch doesn't garantee, that new object will have desired id. So it's up to user space how to handle new object with wrong id. 2) After sucessfull id allocation attempt, next_id will be set back to -1 (if it was non-negative). --- a/ipc/ipc_sysctl.c +++ b/ipc/ipc_sysctl.c @@ -158,6 +158,7 @@ static int proc_ipcauto_dointvec_minmax(ctl_table *table, int write, static int zero; static int one = 1; +static int int_max = INT_MAX; static struct ctl_table ipc_kern_table[] = { { @@ -227,6 +228,33 @@ static struct ctl_table ipc_kern_table[] = { .extra1 = zero, .extra2 = one, }, + { + .procname = sem_next_id, + .data = init_ipc_ns.ids[IPC_SEM_IDS].next_id, + .maxlen = sizeof(init_ipc_ns.ids[IPC_SEM_IDS].next_id), + .mode = 0644, + .proc_handler = proc_ipc_dointvec_minmax, + .extra1 = zero, + .extra2 = int_max, + }, + { + .procname = msg_next_id, + .data = init_ipc_ns.ids[IPC_MSG_IDS].next_id, + .maxlen = sizeof(init_ipc_ns.ids[IPC_MSG_IDS].next_id), + .mode = 0644, + .proc_handler = proc_ipc_dointvec_minmax, + .extra1 = zero, + .extra2 = int_max, + }, + { + .procname = shm_next_id, + .data = init_ipc_ns.ids[IPC_SHM_IDS].next_id, + .maxlen = sizeof(init_ipc_ns.ids[IPC_SHM_IDS].next_id), + .mode = 0644, + .proc_handler = proc_ipc_dointvec_minmax, + .extra1 = zero, + .extra2 = int_max, + }, {} }; ipc_kern_table[] is (badly) documented in Documentation/sysctl/kernel.txt. Can we at least mention these controls in there? Better, create a new way of properly documenting each control and document these three in that manner? Better still, document all the other ones as well ;) The patch adds these controls to CONFIG_CHECKPOINT_RESTORE=n kernels. Why is this? ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v8 4/5] ipc: message queue copy feature introduced
On Wed, 24 Oct 2012 19:35:20 +0400 Stanislav Kinsbursky skinsbur...@parallels.com wrote: This patch is required for checkpoint/restore in userspace. IOW, c/r requires some way to get all pending IPC messages without deleting them from the queue (checkpoint can fail and in this case tasks will be resumed, so queue have to be valid). To achive this, new operation flag MSG_COPY for sys_msgrcv() system call was introduced. If this flag was specified, then mtype is interpreted as number of the message to copy. If MSG_COPY is set, then kernel will allocate dummy message with passed size, and then use new copy_msg() helper function to copy desired message (instead of unlinking it from the queue). ... @@ -777,19 +777,48 @@ long do_msgrcv(int msqid, void __user *buf, size_t bufsz, long msgtyp, struct msg_msg *msg; int mode; struct ipc_namespace *ns; +#ifdef CONFIG_CHECKPOINT_RESTORE + struct msg_msg *copy = NULL; + unsigned long copy_number = 0; +#endif if (msqid 0 || (long) bufsz 0) return -EINVAL; + if (msgflg MSG_COPY) { +#ifdef CONFIG_CHECKPOINT_RESTORE + + if (msgflg MSG_COPY) { This test is't needed. + copy_number = msgtyp; + msgtyp = 0; + } + + /* + * Create dummy message to copy real message to. + */ + copy = load_msg(buf, bufsz); + if (IS_ERR(copy)) + return PTR_ERR(copy); + copy-m_ts = bufsz; +#else + return -ENOSYS; +#endif + } mode = convert_mode(msgtyp, msgflg); ns = current-nsproxy-ipc_ns; msq = msg_lock_check(ns, msqid); - if (IS_ERR(msq)) + if (IS_ERR(msq)) { +#ifdef CONFIG_CHECKPOINT_RESTORE + if (msgflg MSG_COPY) + free_msg(copy); +#endif return PTR_ERR(msq); + } for (;;) { struct msg_receiver msr_d; struct list_head *tmp; + long msg_counter = 0; msg = ERR_PTR(-EACCES); if (ipcperms(ns, msq-q_perm, S_IRUGO)) @@ -809,8 +838,16 @@ long do_msgrcv(int msqid, void __user *buf, size_t bufsz, long msgtyp, if (mode == SEARCH_LESSEQUAL walk_msg-m_type != 1) { msgtyp = walk_msg-m_type - 1; +#ifdef CONFIG_CHECKPOINT_RESTORE + } else if (msgflg MSG_COPY) { + if (copy_number == msg_counter) { + msg = copy_msg(walk_msg, copy); + break; + } +#endif } else break; + msg_counter++; } tmp = tmp-next; } @@ -823,6 +860,10 @@ long do_msgrcv(int msqid, void __user *buf, size_t bufsz, long msgtyp, msg = ERR_PTR(-E2BIG); goto out_unlock; } +#ifdef CONFIG_CHECKPOINT_RESTORE + if (msgflg MSG_COPY) + goto out_unlock; +#endif list_del(msg-m_list); msq-q_qnum--; msq-q_rtime = get_seconds(); @@ -906,8 +947,13 @@ out_unlock: break; } } - if (IS_ERR(msg)) + if (IS_ERR(msg)) { +#ifdef CONFIG_CHECKPOINT_RESTORE + if (msgflg MSG_COPY) + free_msg(copy); +#endif return PTR_ERR(msg); + } bufsz = msg_handler(buf, msg, bufsz); free_msg(msg); It's all a bit ugly, but I don't really see much we can practically do about that. You could add something like #ifdef CONFIG_CHECKPOINT_RESTORE static inline void free_copy(void *p, int msgflg, struct msg_msg *copy) { if (IS_ERR(p) (msgflg MSG_COPY)) free_msg(copy); } #else /* As a macro because `copy' will be undefined */ #define free_copy(p, msgflg, copy) do {} while (0) #endif and use that in a couple of places. But that won't help much. ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v5 00/14] kmem controller for memcg.
On Thu, 18 Oct 2012 20:51:05 +0400 Glauber Costa glom...@parallels.com wrote: On 10/18/2012 02:11 AM, Andrew Morton wrote: On Tue, 16 Oct 2012 14:16:37 +0400 Glauber Costa glom...@parallels.com wrote: ... A general explanation of what this is all about follows: The kernel memory limitation mechanism for memcg concerns itself with disallowing potentially non-reclaimable allocations to happen in exaggerate quantities by a particular set of processes (cgroup). Those allocations could create pressure that affects the behavior of a different and unrelated set of processes. Its basic working mechanism is to annotate some allocations with the _GFP_KMEMCG flag. When this flag is set, the current process allocating will have its memcg identified and charged against. When reaching a specific limit, further allocations will be denied. The need to set _GFP_KMEMCG is rather unpleasing, and makes one wonder why didn't it just track all allocations. This was raised as well by Peter Zijlstra during the memcg summit. Firstly: please treat any question from a reviewer as an indication that information was missing from the changelog or from code comments. Ideally all such queries are addressed in later version of the patch and changelog. The answer I gave to him still stands: There is a cost associated with it. We believe it comes down to a trade off situation. How much tracking a particular kind of allocation help vs how much does it cost. The free path is specially more expensive, since it will always incur in a page_cgroup lookup. OK. But that is a quantitative argument, without any quantities! Do we have even an estimate of what this cost will be? Perhaps it's the case that, if well implemented, that cost will be acceptable. How do we tell? Does this mean that over time we can expect more sites to get the _GFP_KMEMCG tagging? We have being doing kernel memory limitation for OpenVZ for a lot of times, using a quite different mechanism. What we do in this work (with slab included), allows us to achieve feature parity with that. It means it is good enough for production environments. That's really good info. Whether or not more people will want other allocations to be tracked, I can't predict. What I do can say is that stack + slab is a very significant part of the memory one potentially cares about, and if anyone else ever have the need for more, it will come down to a trade-off calculation. OK. If so, are there any special implications, or do we just go in, do the one-line patch and expect everything to work? With the infrastructure in place, it shouldn't be hard. But it's not necessarily a one-liner either. It depends on what are the pratical considerations for having that specific kind of allocation tied to a memcg. The slab, for instance, that follows this series, is far away from a one-liner: it is in fact, a 19-patch patch series. And how *accurate* is the proposed code? What percentage of kernel memory allocations are unaccounted, typical case and worst case? With both patchsets applied, all memory used for the stack and most of the memory used for slab objects allocated in userspace process contexts are accounted. I honestly don't know which percentage of the total kernel memory this represents. It sounds like the coverage will be good. What's left over? Random get_free_pages() calls and interrupt-time slab allocations? I suppose that there are situations in which network rx could consume significant amounts of unaccounted memory? The accuracy for stack pages is very high: In this series, we don't move stack pages around when moving a task to other cgroups (for stack, it could be done), but other than that, all processes that pops up in a cgroup and stay there will have its memory accurately accounted. The slab is more complicated, and depends on the workload. It will be more accurate in workloads in which the level of object-sharing among cgroups is low. A container, for instance, is the perfect example of where this happen. All sorts of questions come to mind over this decision, but it was unexplained. It should be, please. A lot! ... Limits lower than the user limit effectively means there is a separate kernel memory limit that may be reached independently than the user limit. Values equal or greater than the user limit implies only that kernel memory is tracked. This provides a unified vision of maximum memory, be it kernel or user memory. I'm struggling to understand that text much at all. Reading the Documentation/cgroups/memory.txt patch helped. Great. If you have any specific suggestions I can change that. Maybe I should just paste the documentation bit in here... That's not a bad idea. ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo
[Devel] Re: [PATCH v5 07/14] mm: Allocate kernel pages to the right memcg
On Thu, 18 Oct 2012 13:24:47 +0400 Glauber Costa glom...@parallels.com wrote: On 10/18/2012 02:12 AM, Andrew Morton wrote: On Tue, 16 Oct 2012 14:16:44 +0400 Glauber Costa glom...@parallels.com wrote: When a process tries to allocate a page with the __GFP_KMEMCG flag, the page allocator will call the corresponding memcg functions to validate the allocation. Tasks in the root memcg can always proceed. To avoid adding markers to the page - and a kmem flag that would necessarily follow, as much as doing page_cgroup lookups for no reason, whoever is marking its allocations with __GFP_KMEMCG flag is responsible for telling the page allocator that this is such an allocation at free_pages() time. Well, why? Was that the correct decision? I don't fully understand your question. Is this the same question you posed in patch 0, about marking some versus marking all? If so, I believe I should have answered it there. Yes, it's the same question. The one which has not yet been fully answered ;) ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v5] slab: Ignore internal flags in cache creation
On Wed, 17 Oct 2012 15:36:51 +0400 Glauber Costa glom...@parallels.com wrote: Some flags are used internally by the allocators for management purposes. One example of that is the CFLGS_OFF_SLAB flag that slab uses to mark that the metadata for that cache is stored outside of the slab. No cache should ever pass those as a creation flags. We can just ignore this bit if it happens to be passed (such as when duplicating a cache in the kmem memcg patches). I may be minunderstanding this, but... If some caller to kmem_cache_create() is passing in bogus flags then that's a bug, and it is undesirable to hide such a bug in this fashion? Because such flags can vary from allocator to allocator, we allow them to make their own decisions on that, defining SLAB_AVAILABLE_FLAGS with all flags that are valid at creation time. Allocators that doesn't have any specific flag requirement should define that to mean all flags. Common code will mask out all flags not belonging to that set. ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v5 00/14] kmem controller for memcg.
On Tue, 16 Oct 2012 14:16:37 +0400 Glauber Costa glom...@parallels.com wrote: ... A general explanation of what this is all about follows: The kernel memory limitation mechanism for memcg concerns itself with disallowing potentially non-reclaimable allocations to happen in exaggerate quantities by a particular set of processes (cgroup). Those allocations could create pressure that affects the behavior of a different and unrelated set of processes. Its basic working mechanism is to annotate some allocations with the _GFP_KMEMCG flag. When this flag is set, the current process allocating will have its memcg identified and charged against. When reaching a specific limit, further allocations will be denied. The need to set _GFP_KMEMCG is rather unpleasing, and makes one wonder why didn't it just track all allocations. Does this mean that over time we can expect more sites to get the _GFP_KMEMCG tagging? If so, are there any special implications, or do we just go in, do the one-line patch and expect everything to work? If so, why don't we go in and do that tagging right now? And how *accurate* is the proposed code? What percentage of kernel memory allocations are unaccounted, typical case and worst case? All sorts of questions come to mind over this decision, but it was unexplained. It should be, please. A lot! ... Limits lower than the user limit effectively means there is a separate kernel memory limit that may be reached independently than the user limit. Values equal or greater than the user limit implies only that kernel memory is tracked. This provides a unified vision of maximum memory, be it kernel or user memory. I'm struggling to understand that text much at all. Reading the Documentation/cgroups/memory.txt patch helped. ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v5 01/14] memcg: Make it possible to use the stock for more than one page.
On Tue, 16 Oct 2012 14:16:38 +0400 Glauber Costa glom...@parallels.com wrote: From: Suleiman Souhlal ssouh...@freebsd.org We currently have a percpu stock cache scheme that charges one page at a time from memcg-res, the user counter. When the kernel memory controller comes into play, we'll need to charge more than that. This is because kernel memory allocations will also draw from the user counter, and can be bigger than a single page, as it is the case with the stack (usually 2 pages) or some higher order slabs. ... -/* - * Try to consume stocked charge on this cpu. If success, one page is consumed - * from local stock and true is returned. If the stock is 0 or charges from a - * cgroup which is not current target, returns false. This stock will be - * refilled. +/** + * consume_stock: Try to consume stocked charge on this cpu. + * @memcg: memcg to consume from. + * @nr_pages: how many pages to charge. + * + * The charges will only happen if @memcg matches the current cpu's memcg + * stock, and at least @nr_pages are available in that stock. Failure to + * service an allocation will refill the stock. + * + * returns true if succesfull, false otherwise. spello. */ -static bool consume_stock(struct mem_cgroup *memcg) +static bool consume_stock(struct mem_cgroup *memcg, int nr_pages) I don't believe there is a case for nr_pages 0 here? If not then I suggest that it would be clearer to use an unsigned type, like memcg_stock_pcp.stock. ... ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v5 04/14] kmem accounting basic infrastructure
On Tue, 16 Oct 2012 14:16:41 +0400 Glauber Costa glom...@parallels.com wrote: This patch adds the basic infrastructure for the accounting of kernel memory. To control that, the following files are created: * memory.kmem.usage_in_bytes * memory.kmem.limit_in_bytes * memory.kmem.failcnt gargh. failcnt is not a word. Who was it who first thought that omitting voewls from words improves anything? Sigh. That pooch is already screwed and there's nothing we can do about it now. * memory.kmem.max_usage_in_bytes They have the same meaning of their user memory counterparts. They reflect the state of the kmem res_counter. Per cgroup kmem memory accounting is not enabled until a limit is set for the group. Once the limit is set the accounting cannot be disabled for that group. This means that after the patch is applied, no behavioral changes exists for whoever is still using memcg to control their memory usage, until memory.kmem.limit_in_bytes is set for the first time. We always account to both user and kernel resource_counters. This effectively means that an independent kernel limit is in place when the limit is set to a lower value than the user memory. A equal or higher value means that the user limit will always hit first, meaning that kmem is effectively unlimited. People who want to track kernel memory but not limit it, can set this limit to a very high number (like RESOURCE_MAX - 1page - that no one will ever hit, or equal to the user memory) ... +/* internal only representation about the status of kmem accounting. */ +enum { + KMEM_ACCOUNTED_ACTIVE = 0, /* accounted by this cgroup itself */ +}; + +#define KMEM_ACCOUNTED_MASK (1 KMEM_ACCOUNTED_ACTIVE) + +#ifdef CONFIG_MEMCG_KMEM +static void memcg_kmem_set_active(struct mem_cgroup *memcg) +{ + set_bit(KMEM_ACCOUNTED_ACTIVE, memcg-kmem_accounted); +} +#endif I don't think memcg_kmem_set_active() really needs to exist. It has a single caller and is unlikely to get any additional callers, so just open-code it there? ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v5 06/14] memcg: kmem controller infrastructure
On Tue, 16 Oct 2012 14:16:43 +0400 Glauber Costa glom...@parallels.com wrote: This patch introduces infrastructure for tracking kernel memory pages to a given memcg. This will happen whenever the caller includes the flag __GFP_KMEMCG flag, and the task belong to a memcg other than the root. In memcontrol.h those functions are wrapped in inline acessors. The idea is to later on, patch those with static branches, so we don't incur any overhead when no mem cgroups with limited kmem are being used. Users of this functionality shall interact with the memcg core code through the following functions: memcg_kmem_newpage_charge: will return true if the group can handle the allocation. At this point, struct page is not yet allocated. memcg_kmem_commit_charge: will either revert the charge, if struct page allocation failed, or embed memcg information into page_cgroup. memcg_kmem_uncharge_page: called at free time, will revert the charge. ... +static __always_inline bool +memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order) +{ + if (!memcg_kmem_enabled()) + return true; + + /* + * __GFP_NOFAIL allocations will move on even if charging is not + * possible. Therefore we don't even try, and have this allocation + * unaccounted. We could in theory charge it with + * res_counter_charge_nofail, but we hope those allocations are rare, + * and won't be worth the trouble. + */ + if (!(gfp __GFP_KMEMCG) || (gfp __GFP_NOFAIL)) + return true; + if (in_interrupt() || (!current-mm) || (current-flags PF_KTHREAD)) + return true; + + /* If the test is dying, just let it go. */ +if (unlikely(test_thread_flag(TIF_MEMDIE) + || fatal_signal_pending(current))) + return true; + + return __memcg_kmem_newpage_charge(gfp, memcg, order); +} That's a big function! Why was it __always_inline? I'd have thought it would be better to move the code after memcg_kmem_enabled() out of line. Do we actually need to test PF_KTHREAD when current-mm == NULL? Perhaps because of aio threads whcih temporarily adopt a userspace mm? +/** + * memcg_kmem_uncharge_page: uncharge pages from memcg + * @page: pointer to struct page being freed + * @order: allocation order. + * + * there is no need to specify memcg here, since it is embedded in page_cgroup + */ +static __always_inline void +memcg_kmem_uncharge_page(struct page *page, int order) +{ + if (memcg_kmem_enabled()) + __memcg_kmem_uncharge_page(page, order); +} + +/** + * memcg_kmem_commit_charge: embeds correct memcg in a page + * @page: pointer to struct page recently allocated + * @memcg: the memcg structure we charged against + * @order: allocation order. + * + * Needs to be called after memcg_kmem_newpage_charge, regardless of success or + * failure of the allocation. if @page is NULL, this function will revert the + * charges. Otherwise, it will commit the memcg given by @memcg to the + * corresponding page_cgroup. + */ +static __always_inline void +memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order) +{ + if (memcg_kmem_enabled() memcg) + __memcg_kmem_commit_charge(page, memcg, order); +} I suspect the __always_inline's here are to do with static branch trickery. A code comment is warranted if so? ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v5 07/14] mm: Allocate kernel pages to the right memcg
On Tue, 16 Oct 2012 14:16:44 +0400 Glauber Costa glom...@parallels.com wrote: When a process tries to allocate a page with the __GFP_KMEMCG flag, the page allocator will call the corresponding memcg functions to validate the allocation. Tasks in the root memcg can always proceed. To avoid adding markers to the page - and a kmem flag that would necessarily follow, as much as doing page_cgroup lookups for no reason, whoever is marking its allocations with __GFP_KMEMCG flag is responsible for telling the page allocator that this is such an allocation at free_pages() time. Well, why? Was that the correct decision? This is done by the invocation of __free_accounted_pages() and free_accounted_pages(). These are very general-sounding names. I'd expect the identifiers to contain memcg and/or kmem, to identify what's going on. ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v5 11/14] memcg: allow a memcg with kmem charges to be destructed.
On Tue, 16 Oct 2012 14:16:48 +0400 Glauber Costa glom...@parallels.com wrote: Because the ultimate goal of the kmem tracking in memcg is to track slab pages as well, It is? For a major patchset such as this, it's pretty important to discuss such long-term plans in the top-level discussion. Covering things such as expected complexity, expected performance hit, how these plans affected the current implementation, etc. The main reason for this is that if the future plans appear to be of doubtful feasibility and the current implementation isn't sufficiently useful without the future stuff, we shouldn't merge the current implementation. It's a big issue! we can't guarantee that we'll always be able to point a page to a particular process, and migrate the charges along with it - since in the common case, a page will contain data belonging to multiple processes. Because of that, when we destroy a memcg, we only make sure the destruction will succeed by discounting the kmem charges from the user charges when we try to empty the cgroup. ... ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v5 13/14] protect architectures where THREAD_SIZE = PAGE_SIZE against fork bombs
On Tue, 16 Oct 2012 14:16:50 +0400 Glauber Costa glom...@parallels.com wrote: @@ -146,7 +146,7 @@ void __weak arch_release_thread_info(struct thread_info *ti) static struct thread_info *alloc_thread_info_node(struct task_struct *tsk, int node) { - struct page *page = alloc_pages_node(node, THREADINFO_GFP, + struct page *page = alloc_pages_node(node, THREADINFO_GFP_ACCOUNTED, THREAD_SIZE_ORDER); yay, we actually used all this code for something ;) I don't think we really saw a comprehensive list of what else the kmem controller will be used for, but I believe that all other envisaged applications will require slab accounting, yes? So it appears that all we have at present is a yet-another-fork-bomb-preventer, but one which requires that the culprit be in a container? That's reasonable, given your hosted-environment scenario. It's unclear (to me) that we should merge all this code for only this feature. Again, it would be good to have a clear listing of and plan for other applications of this code. ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v5 14/14] Add documentation about the kmem controller
On Tue, 16 Oct 2012 14:16:51 +0400 Glauber Costa glom...@parallels.com wrote: +Kernel memory won't be accounted at all until limit on a group is set. This +allows for existing setups to continue working without disruption. The limit +cannot be set if the cgroup have children, or if there are already tasks in the +cgroup. What behaviour will usersapce see if The limit cannot be set? write() returns -EINVAL, something like that? ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH] proc: check vma-vm_file before dereferencing
On Mon, 15 Oct 2012 19:30:03 +0400 Stanislav Kinsbursky skinsbur...@parallels.com wrote: It can be equal to NULL. Please write better changelogs, so people do not have to ask questions such as: - Under what conditions does this bug trigger? - In which kernel version(s)? - Is it a post-3.6 regression? Thanks. ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH] proc: check vma-vm_file before dereferencing
On Tue, 16 Oct 2012 01:52:30 +0400 Cyrill Gorcunov gorcu...@openvz.org wrote: On Mon, Oct 15, 2012 at 02:40:48PM -0700, Andrew Morton wrote: On Mon, 15 Oct 2012 19:30:03 +0400 Stanislav Kinsbursky skinsbur...@parallels.com wrote: It can be equal to NULL. Please write better changelogs, so people do not have to ask questions such as: - Under what conditions does this bug trigger? - In which kernel version(s)? - Is it a post-3.6 regression? Andrew, would the following changelog be enough? The commit 7b540d0646ce122f0ba4520412be91e530719742 switched proc_map_files_readdir to use @f_mode directly instead of grabbing @file reference, but same time the test for @vm_file presence was lost leading to nil dereference. The patch brings the test back. The all proc_map_files feature is CONFIG_CHECKPOINT_RESTORE wrapped (which is set to 'n' by default) so the bug doesn't affect regular kernels. The regression is 3.7-rc1 only as far as I can tell. Ah, I see, great, thanks. ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 06/11] memcg: kmem controller infrastructure
On Tue, 26 Jun 2012 19:01:15 +0400 Glauber Costa glom...@parallels.com wrote: On 06/26/2012 03:17 AM, Andrew Morton wrote: + memcg_uncharge_kmem(memcg, size); + mem_cgroup_put(memcg); +} +EXPORT_SYMBOL(__mem_cgroup_free_kmem_page); #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */ #if defined(CONFIG_INET) defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM) @@ -5645,3 +5751,69 @@ static int __init enable_swap_account(char *s) __setup(swapaccount=, enable_swap_account); #endif + +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM gargh. CONFIG_MEMCG_KMEM, please! Here too. I like it as much as you do. But that is consistent with the rest of the file, and I'd rather have it this way. There's not much point in being consistent with something which is so unpleasant. I'm on a little campaign to rename CONFIG_CGROUP_MEM_RES_CTLR to CONFIG_MEMCG, only nobody has taken my bait yet. Be first! ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 06/11] memcg: kmem controller infrastructure
On Tue, 26 Jun 2012 22:14:51 +0400 Glauber Costa glom...@parallels.com wrote: On 06/26/2012 10:01 PM, Andrew Morton wrote: On Tue, 26 Jun 2012 19:01:15 +0400 Glauber Costa glom...@parallels.com wrote: On 06/26/2012 03:17 AM, Andrew Morton wrote: +memcg_uncharge_kmem(memcg, size); + mem_cgroup_put(memcg); +} +EXPORT_SYMBOL(__mem_cgroup_free_kmem_page); #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */ #if defined(CONFIG_INET) defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM) @@ -5645,3 +5751,69 @@ static int __init enable_swap_account(char *s) __setup(swapaccount=, enable_swap_account); #endif + +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM gargh. CONFIG_MEMCG_KMEM, please! Here too. I like it as much as you do. But that is consistent with the rest of the file, and I'd rather have it this way. There's not much point in being consistent with something which is so unpleasant. I'm on a little campaign to rename CONFIG_CGROUP_MEM_RES_CTLR to CONFIG_MEMCG, only nobody has taken my bait yet. Be first! If you are okay with a preparation mechanical patch to convert the whole file, I can change mine too. But you'll be responsible for arguing with whoever stepping up opposing this =p From: Andrew Morton a...@linux-foundation.org Subject: memcg: rename config variables Sanity: CONFIG_CGROUP_MEM_RES_CTLR - CONFIG_MEMCG CONFIG_CGROUP_MEM_RES_CTLR_SWAP - CONFIG_MEMCG_SWAP CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED - CONFIG_MEMCG_SWAP_ENABLED CONFIG_CGROUP_MEM_RES_CTLR_KMEM - CONFIG_MEMCG_KMEM Cc: Glauber Costa glom...@parallels.com Cc: Michal Hocko mho...@suse.cz Cc: Johannes Weiner han...@cmpxchg.org Cc: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com Cc: Hugh Dickins hu...@google.com Cc: Tejun Heo t...@kernel.org Cc: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com Cc: David Rientjes rient...@google.com Cc: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com Signed-off-by: Andrew Morton a...@linux-foundation.org --- include/linux/cgroup_subsys.h |2 +- include/linux/memcontrol.h| 14 +++--- include/linux/mmzone.h|8 include/linux/page_cgroup.h | 10 +- include/linux/sched.h |2 +- init/Kconfig | 14 +++--- kernel/fork.c |2 +- mm/hwpoison-inject.c |2 +- mm/memcontrol.c | 20 ++-- mm/memory-failure.c |2 +- mm/mmzone.c |2 +- mm/oom_kill.c |2 +- mm/page_cgroup.c |2 +- mm/vmscan.c |4 ++-- 14 files changed, 43 insertions(+), 43 deletions(-) diff -puN kernel/fork.c~a kernel/fork.c --- a/kernel/fork.c~a +++ a/kernel/fork.c @@ -1302,7 +1302,7 @@ static struct task_struct *copy_process( #ifdef CONFIG_DEBUG_MUTEXES p-blocked_on = NULL; /* not blocked yet */ #endif -#ifdef CONFIG_CGROUP_MEM_RES_CTLR +#ifdef CONFIG_MEMCG p-memcg_batch.do_batch = 0; p-memcg_batch.memcg = NULL; #endif diff -puN mm/hwpoison-inject.c~a mm/hwpoison-inject.c --- a/mm/hwpoison-inject.c~a +++ a/mm/hwpoison-inject.c @@ -123,7 +123,7 @@ static int pfn_inject_init(void) if (!dentry) goto fail; -#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP +#ifdef CONFIG_MEMCG_SWAP dentry = debugfs_create_u64(corrupt-filter-memcg, 0600, hwpoison_dir, hwpoison_filter_memcg); if (!dentry) diff -puN mm/memcontrol.c~a mm/memcontrol.c --- a/mm/memcontrol.c~a +++ a/mm/memcontrol.c @@ -61,12 +61,12 @@ struct cgroup_subsys mem_cgroup_subsys _ #define MEM_CGROUP_RECLAIM_RETRIES 5 static struct mem_cgroup *root_mem_cgroup __read_mostly; -#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP +#ifdef CONFIG_MEMCG_SWAP /* Turned on only when memory cgroup is enabled really_do_swap_account = 1 */ int do_swap_account __read_mostly; /* for remember boot option*/ -#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED +#ifdef CONFIG_MEMCG_SWAP_ENABLED static int really_do_swap_account __initdata = 1; #else static int really_do_swap_account __initdata = 0; @@ -407,7 +407,7 @@ static void mem_cgroup_get(struct mem_cg static void mem_cgroup_put(struct mem_cgroup *memcg); /* Writing them here to avoid exposing memcg's inner layout */ -#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM +#ifdef CONFIG_MEMCG_KMEM #include net/sock.h #include net/ip.h @@ -466,9 +466,9 @@ struct cg_proto *tcp_proto_cgroup(struct } EXPORT_SYMBOL(tcp_proto_cgroup); #endif /* CONFIG_INET */ -#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */ +#endif /* CONFIG_MEMCG_KMEM */ -#if defined(CONFIG_INET) defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM) +#if defined(CONFIG_INET) defined(CONFIG_MEMCG_KMEM) static void disarm_sock_keys(struct mem_cgroup *memcg) { if (!memcg_proto_activated(memcg-tcp_mem.cg_proto)) @@ -3085,7 +3085,7 @@ mem_cgroup_uncharge_swapcache(struct pag } #endif -#ifdef
[Devel] Re: [PATCH 00/11] kmem controller for memcg: stripped down version
On Tue, 26 Jun 2012 11:17:49 +0400 Glauber Costa glom...@parallels.com wrote: On 06/26/2012 03:27 AM, Andrew Morton wrote: On Mon, 25 Jun 2012 18:15:17 +0400 Glauber Costa glom...@parallels.com wrote: What I am proposing with this series is a stripped down version of the kmem controller for memcg that would allow us to merge significant parts of the infrastructure, while leaving out, for now, the polemic bits about the slab while it is being reworked by Cristoph. Me reasoning for that is that after the last change to introduce a gfp flag to mark kernel allocations, it became clear to me that tracking other resources like the stack would then follow extremely naturaly. I figured that at some point we'd have to solve the issue pointed by David, and avoid testing the Slab flag in the page allocator, since it would soon be made more generic. I do that by having the callers to explicit mark it. So to demonstrate how it would work, I am introducing a stack tracker here, that is already a functionality per-se: it successfully stops fork bombs to happen. (Sorry for doing all your work, Frederic =p ). Note that after all memcg infrastructure is deployed, it becomes very easy to track anything. The last patch of this series is extremely simple. The infrastructure is exactly the same we had in memcg, but stripped down of the slab parts. And because what we have after those patches is a feature per-se, I think it could be considered for merging. hm. None of this new code makes the kernel smaller, faster, easier to understand or more fun to read! Not sure if this is a general comment - in case I agree - or if targeted to my statement that this is stripped down. If so, it is of course smaller relative to my previous slab accounting patches. It's a general comment. The patch adds overhead: runtime costs and maintenance costs. Do its benefits justify that cost? The infrastructure is largely common, but I realized that a future user, tracking the stack, would be a lot simpler and could be done first. Presumably we're getting some benefit for all the downside. When the time is appropriate, please do put some time into explaining that benefit, so that others can agree that it is a worthwhile tradeoff. Well, for one thing, we stop fork bombs for processes inside cgroups. inside cgroups is a significant limitation! Is this capability important enough to justify adding the new code? That's unobvious (to me). Are there any other user-facing things which we can do with this feature? Present, or planned? I can't speak for everybody here, but AFAIK, tracking the stack through the memory it used, therefore using my proposed kmem controller, was an idea that good quite a bit of traction with the memcg/memory people. So here you have something that people already asked a lot for, in a shape and interface that seem to be acceptable. mm, maybe. Kernel developers tend to look at code from the point of view does it work as designed, is it clean, is it efficient, do I understand it, etc. We often forget to step back and really consider whether or not it should be merged at all. I mean, unless the code is an explicit simplification, we should have a very strong bias towards don't merge. ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 06/11] memcg: kmem controller infrastructure
On Mon, 25 Jun 2012 18:15:23 +0400 Glauber Costa glom...@parallels.com wrote: This patch introduces infrastructure for tracking kernel memory pages to a given memcg. This will happen whenever the caller includes the flag __GFP_KMEMCG flag, and the task belong to a memcg other than the root. In memcontrol.h those functions are wrapped in inline acessors. The idea is to later on, patch those with jump labels, so we don't incur any overhead when no mem cgroups are being used. ... @@ -416,6 +423,43 @@ static inline void sock_update_memcg(struct sock *sk) static inline void sock_release_memcg(struct sock *sk) { } + +#define mem_cgroup_kmem_on 0 +#define __mem_cgroup_new_kmem_page(a, b, c) false +#define __mem_cgroup_free_kmem_page(a,b ) +#define __mem_cgroup_commit_kmem_page(a, b, c) I suggest that the naming consistently follow the model mem_cgroup_kmem_foo. So mem_cgroup_kmem_ becomes the well-known identifier for this subsystem. Then, s/mem_cgroup/memcg/g/ - show us some mercy here! +#define is_kmem_tracked_alloc (false) memcg_kmem_tracked_alloc, perhaps. But what does this actually do? looks eww, ick. +#define is_kmem_tracked_alloc (gfp __GFP_KMEMCG) What Tejun said. This: /* * Nice comment goes here */ static inline bool memcg_kmem_tracked_alloc(gfp_t gfp) { return gfp __GFP_KMEMCG; } #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */ + +static __always_inline +bool mem_cgroup_new_kmem_page(gfp_t gfp, void *handle, int order) memcg_kmem_new_page(). +{ + if (!mem_cgroup_kmem_on) + return true; + if (!is_kmem_tracked_alloc) + return true; + if (!current-mm) + return true; + if (in_interrupt()) + return true; + if (gfp __GFP_NOFAIL) + return true; + return __mem_cgroup_new_kmem_page(gfp, handle, order); +} Add documentation, please. The semantics of the return value are inscrutable. +static __always_inline +void mem_cgroup_free_kmem_page(struct page *page, int order) +{ + if (mem_cgroup_kmem_on) + __mem_cgroup_free_kmem_page(page, order); +} memcg_kmem_free_page(). +static __always_inline +void mem_cgroup_commit_kmem_page(struct page *page, struct mem_cgroup *handle, + int order) +{ + if (mem_cgroup_kmem_on) + __mem_cgroup_commit_kmem_page(page, handle, order); +} memcg_kmem_commit_page(). #endif /* _LINUX_MEMCONTROL_H */ ... +static inline bool mem_cgroup_kmem_enabled(struct mem_cgroup *memcg) +{ + return !mem_cgroup_disabled() memcg +!mem_cgroup_is_root(memcg) memcg-kmem_accounted; +} Does this really need to handle a memcg==NULL? +bool __mem_cgroup_new_kmem_page(gfp_t gfp, void *_handle, int order) +{ + struct mem_cgroup *memcg; + struct mem_cgroup **handle = (struct mem_cgroup **)_handle; + bool ret = true; + size_t size; + struct task_struct *p; + + *handle = NULL; + rcu_read_lock(); + p = rcu_dereference(current-mm-owner); + memcg = mem_cgroup_from_task(p); + if (!mem_cgroup_kmem_enabled(memcg)) + goto out; + + mem_cgroup_get(memcg); + + size = (1 order) PAGE_SHIFT; size = PAGE_SIZE order; is simpler and more typical. + ret = memcg_charge_kmem(memcg, gfp, size) == 0; Odd. memcg_charge_kmem() returns a nice errno, buit this conversion just drops that information on the floor. If the mem_cgroup_new_kmem_page() return value had been documented, I might have been able to understand the thinking here. But it wasn't, so I couldn't. + if (!ret) { + mem_cgroup_put(memcg); + goto out; + } + + *handle = memcg; +out: + rcu_read_unlock(); + return ret; +} +EXPORT_SYMBOL(__mem_cgroup_new_kmem_page); + +void __mem_cgroup_commit_kmem_page(struct page *page, void *handle, int order) +{ + struct page_cgroup *pc; + struct mem_cgroup *memcg = handle; + size_t size; + + if (!memcg) + return; + + WARN_ON(mem_cgroup_is_root(memcg)); + /* The page allocation must have failed. Revert */ + if (!page) { + size = (1 order) PAGE_SHIFT; PAGE_SIZE order + memcg_uncharge_kmem(memcg, size); + mem_cgroup_put(memcg); + return; + } This all looks a bit odd. After a failure you run a commit which undoes the speculative charging. I guess it makes sense. It's the sort of thing which can be expounded upon in the documentation whcih isn't there, sigh. + pc = lookup_page_cgroup(page); + lock_page_cgroup(pc); + pc-mem_cgroup = memcg; + SetPageCgroupUsed(pc); + unlock_page_cgroup(pc); +} missed a newline there. +void __mem_cgroup_free_kmem_page(struct page *page, int order) +{ + struct mem_cgroup *memcg; + size_t size; + struct page_cgroup *pc; + +
[Devel] Re: [PATCH 09/11] memcg: propagate kmem limiting information to children
On Tue, 26 Jun 2012 02:36:27 +0400 Glauber Costa glom...@parallels.com wrote: On 06/25/2012 10:29 PM, Tejun Heo wrote: Feeling like a nit pervert but.. On Mon, Jun 25, 2012 at 06:15:26PM +0400, Glauber Costa wrote: @@ -287,7 +287,11 @@ struct mem_cgroup { * Should the accounting and control be hierarchical, per subtree? */ bool use_hierarchy; - bool kmem_accounted; + /* + * bit0: accounted by this cgroup + * bit1: accounted by a parent. + */ + volatile unsigned long kmem_accounted; Is the volatile declaration really necessary? Why is it necessary? Why no comment explaining it? Seems to be required by set_bit and friends. gcc will complain if it is not volatile (take a look at the bit function headers) That would be a broken gcc. We run test_bit()/set_bit() and friends against plain old `unsigned long' in thousands of places. There's nothing special about this one! ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 09/11] memcg: propagate kmem limiting information to children
On Mon, 25 Jun 2012 18:15:26 +0400 Glauber Costa glom...@parallels.com wrote: --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -287,7 +287,11 @@ struct mem_cgroup { * Should the accounting and control be hierarchical, per subtree? */ bool use_hierarchy; - bool kmem_accounted; + /* + * bit0: accounted by this cgroup + * bit1: accounted by a parent. + */ + volatile unsigned long kmem_accounted; I suggest unsigned long kmem_accounted; /* See KMEM_ACCOUNTED_*, below */ booloom_lock; atomic_tunder_oom; @@ -340,6 +344,9 @@ struct mem_cgroup { #endif }; +#define KMEM_ACCOUNTED_THIS 0 +#define KMEM_ACCOUNTED_PARENT1 And then document the fields here. ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 00/11] kmem controller for memcg: stripped down version
On Mon, 25 Jun 2012 18:15:17 +0400 Glauber Costa glom...@parallels.com wrote: What I am proposing with this series is a stripped down version of the kmem controller for memcg that would allow us to merge significant parts of the infrastructure, while leaving out, for now, the polemic bits about the slab while it is being reworked by Cristoph. Me reasoning for that is that after the last change to introduce a gfp flag to mark kernel allocations, it became clear to me that tracking other resources like the stack would then follow extremely naturaly. I figured that at some point we'd have to solve the issue pointed by David, and avoid testing the Slab flag in the page allocator, since it would soon be made more generic. I do that by having the callers to explicit mark it. So to demonstrate how it would work, I am introducing a stack tracker here, that is already a functionality per-se: it successfully stops fork bombs to happen. (Sorry for doing all your work, Frederic =p ). Note that after all memcg infrastructure is deployed, it becomes very easy to track anything. The last patch of this series is extremely simple. The infrastructure is exactly the same we had in memcg, but stripped down of the slab parts. And because what we have after those patches is a feature per-se, I think it could be considered for merging. hm. None of this new code makes the kernel smaller, faster, easier to understand or more fun to read! Presumably we're getting some benefit for all the downside. When the time is appropriate, please do put some time into explaining that benefit, so that others can agree that it is a worthwhile tradeoff. ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 09/11] memcg: propagate kmem limiting information to children
On Mon, 25 Jun 2012 22:24:44 -0700 (PDT) David Rientjes rient...@google.com wrote: +#define KMEM_ACCOUNTED_THIS 0 +#define KMEM_ACCOUNTED_PARENT1 And then document the fields here. In hex, please? Well, they're bit numbers, not masks. Decimal 0-31 is OK, or an enum. ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v6 2/2] decrement static keys on real destroy time
On Wed, 23 May 2012 13:16:36 +0400 Glauber Costa glom...@parallels.com wrote: On 05/23/2012 02:46 AM, Andrew Morton wrote: Here, we're open-coding kinda-test_bit(). Why do that? These flags are modified with set_bit() and friends, so we should read them with the matching test_bit()? My reasoning was to be as cheap as possible, as you noted yourself two paragraphs below. These aren't on any fast path, are they? Plus: you failed in that objective! The C compiler's internal scalar-bool conversion makes these functions no more efficient than test_bit(). So here are suggested changes from*some* of the above discussion. Please consider, incorporate, retest and send us a v7? How do you want me to do it? Should I add your patch ontop of mine, and then another one that tweaks whatever else is left, or should I just merge those changes into the patches I have? A brand new patch, I guess. I can sort out the what-did-he-change view at this end. ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v6 2/2] decrement static keys on real destroy time
(MEMCG_SOCK_ACTIVATED, cg_proto-flags)) + static_key_slow_inc(memcg_socket_limit_enabled); + set_bit(MEMCG_SOCK_ACTIVE, cg_proto-flags); + } So here are suggested changes from *some* of the above discussion. Please consider, incorporate, retest and send us a v7? From: Andrew Morton a...@linux-foundation.org Subject: memcg-decrement-static-keys-at-real-destroy-time-v6-fix - move enum sock_flag_bits into sock.h - document enum sock_flag_bits - convert memcg_proto_active() and memcg_proto_activated() to test_bit() - redo tcp_update_limit() comment to 80 cols Cc: Glauber Costa glom...@parallels.com Cc: Johannes Weiner han...@cmpxchg.org Cc: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com Cc: Li Zefan lize...@huawei.com Cc: Michal Hocko mho...@suse.cz Cc: Tejun Heo t...@kernel.org Signed-off-by: Andrew Morton a...@linux-foundation.org --- include/linux/memcontrol.h |5 - include/net/sock.h | 15 +-- net/ipv4/tcp_memcontrol.c | 30 +++--- 3 files changed, 28 insertions(+), 22 deletions(-) diff -puN include/linux/memcontrol.h~memcg-decrement-static-keys-at-real-destroy-time-v6-fix include/linux/memcontrol.h --- a/include/linux/memcontrol.h~memcg-decrement-static-keys-at-real-destroy-time-v6-fix +++ a/include/linux/memcontrol.h @@ -405,11 +405,6 @@ enum { OVER_LIMIT, }; -enum sock_flag_bits { - MEMCG_SOCK_ACTIVE, - MEMCG_SOCK_ACTIVATED, -}; - struct sock; #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM void sock_update_memcg(struct sock *sk); diff -puN include/net/sock.h~memcg-decrement-static-keys-at-real-destroy-time-v6-fix include/net/sock.h --- a/include/net/sock.h~memcg-decrement-static-keys-at-real-destroy-time-v6-fix +++ a/include/net/sock.h @@ -46,6 +46,7 @@ #include linux/list_nulls.h #include linux/timer.h #include linux/cache.h +#include linux/bitops.h #include linux/lockdep.h #include linux/netdevice.h #include linux/skbuff.h /* struct sk_buff */ @@ -921,6 +922,16 @@ struct proto { #endif }; +/* + * Bits in struct cg_proto.flags + */ +enum sock_flag_bits { + /* Currently active and new sockets should be assigned to cgroups */ + MEMCG_SOCK_ACTIVE, + /* It was ever activated; we must disarm static keys on destruction */ + MEMCG_SOCK_ACTIVATED, +}; + struct cg_proto { void(*enter_memory_pressure)(struct sock *sk); struct res_counter *memory_allocated; /* Current allocated memory. */ @@ -945,12 +956,12 @@ extern void proto_unregister(struct prot static inline bool memcg_proto_active(struct cg_proto *cg_proto) { - return cg_proto-flags (1 MEMCG_SOCK_ACTIVE); + return test_bit(MEMCG_SOCK_ACTIVE, cg_proto-flags); } static inline bool memcg_proto_activated(struct cg_proto *cg_proto) { - return cg_proto-flags (1 MEMCG_SOCK_ACTIVATED); + return test_bit(MEMCG_SOCK_ACTIVATED, cg_proto-flags); } #ifdef SOCK_REFCNT_DEBUG diff -puN net/ipv4/tcp_memcontrol.c~memcg-decrement-static-keys-at-real-destroy-time-v6-fix net/ipv4/tcp_memcontrol.c --- a/net/ipv4/tcp_memcontrol.c~memcg-decrement-static-keys-at-real-destroy-time-v6-fix +++ a/net/ipv4/tcp_memcontrol.c @@ -108,24 +108,24 @@ static int tcp_update_limit(struct mem_c clear_bit(MEMCG_SOCK_ACTIVE, cg_proto-flags); else if (val != RESOURCE_MAX) { /* -* The active bit needs to be written after the static_key update. -* This is what guarantees that the socket activation function -* is the last one to run. See sock_update_memcg() for details, -* and note that we don't mark any socket as belonging to this -* memcg until that flag is up. +* The active bit needs to be written after the static_key +* update. This is what guarantees that the socket activation +* function is the last one to run. See sock_update_memcg() for +* details, and note that we don't mark any socket as belonging +* to this memcg until that flag is up. * -* We need to do this, because static_keys will span multiple -* sites, but we can't control their order. If we mark a socket -* as accounted, but the accounting functions are not patched in -* yet, we'll lose accounting. +* We need to do this, because static_keys will span multiple +* sites, but we can't control their order. If we mark a socket +* as accounted, but the accounting functions are not patched in +* yet, we'll lose accounting. * -* We never race with the readers in sock_update_memcg(), because -* when this value change, the code to process it is not patched in -* yet
[Devel] Re: [PATCH v6 2/2] decrement static keys on real destroy time
On Tue, 22 May 2012 15:46:10 -0700 Andrew Morton a...@linux-foundation.org wrote: +static inline bool memcg_proto_active(struct cg_proto *cg_proto) +{ + return cg_proto-flags (1 MEMCG_SOCK_ACTIVE); +} + +static inline bool memcg_proto_activated(struct cg_proto *cg_proto) +{ + return cg_proto-flags (1 MEMCG_SOCK_ACTIVATED); +} Here, we're open-coding kinda-test_bit(). Why do that? These flags are modified with set_bit() and friends, so we should read them with the matching test_bit()? Also, these bool-returning functions will return values other than 0 and 1. That probably works OK and I don't know what the C standards and implementations do about this. But it seems unclean and slightly risky to have a bool value of 32! Converting these functions to use test_bit() fixes this - test_bit() returns only 0 or 1. test_bit() is slightly more expensive than the above. If this is considered to be an issue then I guess we could continue to use this approach. But I do think a code comment is needed, explaining and justifying the unusual decision to bypass the bitops API. Also these functions should tell the truth and return an int type. Joe corrected (and informed) me: : 6.3.1.2p1: : : When any scalar value is converted to _Bool, the result is 0 : if the value compares equal to 0; otherwise, the result is 1. So the above functions will be given compiler-generated scalar-to-boolean conversion. test_bit() already does internal scalar-to-boolean conversion. The compiler doesn't know that, so if we convert the above functions to use test_bit(), we'll end up performing scalar-to-boolean-to-boolean conversion, which is dumb. I assume that a way of fixing this is to change test_bit() to return bool type. That's a bit scary. A less scary way would be to add a new bool test_bit_bool(int nr, const unsigned long *addr); which internally calls test_bit() but somehow avoids the compiler-generated conversion of the test_bit() return value into a bool. I haven't actually thought of a way of doing this ;) ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v5 2/2] decrement static keys on real destroy time
On Thu, 17 May 2012 13:52:13 +0400 Glauber Costa glom...@parallels.com wrote: Andrew is right. It seems we will need that mutex after all. Just this is not a race, and neither something that should belong in the static_branch interface. Well, a mutex is one way. Or you could do something like if (!test_and_set_bit(CGPROTO_ACTIVATED, cg_proto-flags)) static_key_slow_inc(memcg_socket_limit_enabled); ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v5 2/2] decrement static keys on real destroy time
On Fri, 11 May 2012 17:11:17 -0300 Glauber Costa glom...@parallels.com wrote: We call the destroy function when a cgroup starts to be removed, such as by a rmdir event. However, because of our reference counters, some objects are still inflight. Right now, we are decrementing the static_keys at destroy() time, meaning that if we get rid of the last static_key reference, some objects will still have charges, but the code to properly uncharge them won't be run. This becomes a problem specially if it is ever enabled again, because now new charges will be added to the staled charges making keeping it pretty much impossible. We just need to be careful with the static branch activation: since there is no particular preferred order of their activation, we need to make sure that we only start using it after all call sites are active. This is achieved by having a per-memcg flag that is only updated after static_key_slow_inc() returns. At this time, we are sure all sites are active. This is made per-memcg, not global, for a reason: it also has the effect of making socket accounting more consistent. The first memcg to be limited will trigger static_key() activation, therefore, accounting. But all the others will then be accounted no matter what. After this patch, only limited memcgs will have its sockets accounted. ... @@ -107,10 +104,31 @@ static int tcp_update_limit(struct mem_cgroup *memcg, u64 val) tcp-tcp_prot_mem[i] = min_t(long, val PAGE_SHIFT, net-ipv4.sysctl_tcp_mem[i]); - if (val == RESOURCE_MAX old_lim != RESOURCE_MAX) - static_key_slow_dec(memcg_socket_limit_enabled); - else if (old_lim == RESOURCE_MAX val != RESOURCE_MAX) - static_key_slow_inc(memcg_socket_limit_enabled); + if (val == RESOURCE_MAX) + cg_proto-active = false; + else if (val != RESOURCE_MAX) { + /* + * -activated needs to be written after the static_key update. + * This is what guarantees that the socket activation function + * is the last one to run. See sock_update_memcg() for details, + * and note that we don't mark any socket as belonging to this + * memcg until that flag is up. + * + * We need to do this, because static_keys will span multiple + * sites, but we can't control their order. If we mark a socket + * as accounted, but the accounting functions are not patched in + * yet, we'll lose accounting. + * + * We never race with the readers in sock_update_memcg(), because + * when this value change, the code to process it is not patched in + * yet. + */ + if (!cg_proto-activated) { + static_key_slow_inc(memcg_socket_limit_enabled); + cg_proto-activated = true; + } If two threads run this code concurrently, they can both see cg_proto-activated==false and they will both run static_key_slow_inc(). Hopefully there's some locking somewhere which prevents this, but it is unobvious. We should comment this, probably at the cg_proto.activated definition site. Or we should fix the bug ;) + cg_proto-active = true; + } return 0; } ... ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v5 2/2] decrement static keys on real destroy time
On Fri, 11 May 2012 17:11:17 -0300 Glauber Costa glom...@parallels.com wrote: We call the destroy function when a cgroup starts to be removed, such as by a rmdir event. However, because of our reference counters, some objects are still inflight. Right now, we are decrementing the static_keys at destroy() time, meaning that if we get rid of the last static_key reference, some objects will still have charges, but the code to properly uncharge them won't be run. This becomes a problem specially if it is ever enabled again, because now new charges will be added to the staled charges making keeping it pretty much impossible. We just need to be careful with the static branch activation: since there is no particular preferred order of their activation, we need to make sure that we only start using it after all call sites are active. This is achieved by having a per-memcg flag that is only updated after static_key_slow_inc() returns. At this time, we are sure all sites are active. This is made per-memcg, not global, for a reason: it also has the effect of making socket accounting more consistent. The first memcg to be limited will trigger static_key() activation, therefore, accounting. But all the others will then be accounted no matter what. After this patch, only limited memcgs will have its sockets accounted. So I'm scratching my head over what the actual bug is, and how important it is. AFAICT it will cause charging stats to exhibit some inaccuracy when memcg's are being torn down? I don't know how serious this in in the real world and so can't decide which kernel version(s) we should fix. When fixing bugs, please always fully describe the bug's end-user impact, so that I and others can make these sorts of decisions. ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v5 2/2] decrement static keys on real destroy time
On Thu, 17 May 2012 07:06:52 +0400 Glauber Costa glom...@parallels.com wrote: ... + else if (val != RESOURCE_MAX) { + /* + * -activated needs to be written after the static_key update. + * This is what guarantees that the socket activation function + * is the last one to run. See sock_update_memcg() for details, + * and note that we don't mark any socket as belonging to this + * memcg until that flag is up. + * + * We need to do this, because static_keys will span multiple + * sites, but we can't control their order. If we mark a socket + * as accounted, but the accounting functions are not patched in + * yet, we'll lose accounting. + * + * We never race with the readers in sock_update_memcg(), because + * when this value change, the code to process it is not patched in + * yet. + */ + if (!cg_proto-activated) { + static_key_slow_inc(memcg_socket_limit_enabled); + cg_proto-activated = true; + } If two threads run this code concurrently, they can both see cg_proto-activated==false and they will both run static_key_slow_inc(). Hopefully there's some locking somewhere which prevents this, but it is unobvious. We should comment this, probably at the cg_proto.activated definition site. Or we should fix the bug ;) If that happens, locking in static_key_slow_inc will prevent any damage. My previous version had explicit code to prevent that, but we were pointed out that this is already part of the static_key expectations, so that was dropped. This makes no sense. If two threads run that code concurrently, key-enabled gets incremented twice. Nobody anywhere has a record that this happened so it cannot be undone. key-enabled is now in an unknown state. ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH] remove BUG() in possible but rare condition
On Wed, 11 Apr 2012 15:10:24 -0300 Glauber Costa glom...@parallels.com wrote: While stressing the kernel with with failing allocations today, I hit the following chain of events: alloc_page_buffers(): bh = alloc_buffer_head(GFP_NOFS); if (!bh) goto no_grow; = path taken grow_dev_page(): bh = alloc_page_buffers(page, size, 0); if (!bh) goto failed; = taken, consequence of the above and then the failed path BUG()s the kernel. The failure is inserted a litte bit artificially, but even then, I see no reason why it should be deemed impossible in a real box. Even though this is not a condition that we expect to see around every time, failed allocations are expected to be handled, and BUG() sounds just too much. As a matter of fact, grow_dev_page() can return NULL just fine in other circumstances, so I propose we just remove it, then. Signed-off-by: Glauber Costa glom...@parallels.com CC: Linus Torvalds torva...@linux-foundation.org CC: Andrew Morton a...@linux-foundation.org --- fs/buffer.c |1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 36d6665..351e18e 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -985,7 +985,6 @@ grow_dev_page(struct block_device *bdev, sector_t block, return page; failed: - BUG(); unlock_page(page); page_cache_release(page); return NULL; Cute. AFAICT what happened was that in my April 2002 rewrite of this code I put a non-fatal buffer_error() warning in that case to tell us that something bad happened. Years later we removed the temporary buffer_error() and mistakenly replaced that warning with a BUG(). Only it *can* happen. We can remove the BUG() and fix up callers, or we can pass retry=1 into alloc_page_buffers(), so grow_dev_page() cannot fail. Immortal functions are a silly fiction, so we should remove the BUG() and fix up callers. ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH] remove BUG() in possible but rare condition
On Wed, 11 Apr 2012 17:51:57 -0300 Glauber Costa glom...@parallels.com wrote: On 04/11/2012 05:26 PM, Andrew Morton wrote: failed: - BUG(); unlock_page(page); page_cache_release(page); return NULL; Cute. AFAICT what happened was that in my April 2002 rewrite of this code I put a non-fatal buffer_error() warning in that case to tell us that something bad happened. Years later we removed the temporary buffer_error() and mistakenly replaced that warning with a BUG(). Only it*can* happen. We can remove the BUG() and fix up callers, or we can pass retry=1 into alloc_page_buffers(), so grow_dev_page() cannot fail. Immortal functions are a silly fiction, so we should remove the BUG() and fix up callers. Any particular caller you are concerned with ? Didn't someone see a buggy caller in btrfs? I'm thinking that we should retain some sort of assertion (a WARN_ON) if the try_to_free_buffers() failed. This is a weird case which I assume handles the situation where a blockdev's blocksize has changed. The code tries to throw away the old wrongly-sized buffer_heads and to then add new correctly-sized ones. If that discarding of buffers fails then the kernel is in rather a mess. It's quite possible that this code is never executed - we _should_ have invalidated all the pagecache for that device when changing blocksize. Or maybe it *is* executed, I dunno. It's one of those things which has hung around for decades as code in other places has vastly changed. ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH] memcg: Do not open code accesses to res_counter members
On Tue, 20 Mar 2012 20:53:44 +0400 Glauber Costa glom...@parallels.com wrote: We should use the acessor res_counter_read_u64 for that. Although a purely cosmetic change is sometimes better of delayed, to avoid conflicting with other people's work, we are starting to have people touching this code as well, and reproducing the open code behavior because that's the standard =) ... --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3708,7 +3708,7 @@ move_account: goto try_to_free; cond_resched(); /* ret should also be checked to ensure all lists are empty. */ - } while (memcg-res.usage 0 || ret); + } while (res_counter_read_u64(memcg-res, RES_USAGE) 0 || ret); out: css_put(memcg-css); return ret; @@ -3723,7 +3723,7 @@ try_to_free: lru_add_drain_all(); /* try to free all pages in this cgroup */ shrink = 1; - while (nr_retries memcg-res.usage 0) { + while (nr_retries res_counter_read_u64(memcg-res, RES_USAGE) 0) { int progress; if (signal_pending(current)) { Actually this fixes bugs on 32-bit machines. Good luck trying to demonstrate them at runtime though ;) ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 1/2] SYSCTL: root unregister routine introduced
On Mon, 12 Dec 2011 21:50:00 +0300 Stanislav Kinsbursky skinsbur...@parallels.com wrote: This routine is required for SUNRPC sysctl's, which are going to be allocated, processed and destroyed per network namespace context. IOW, new sysctl root will be registered on network namespace creation and thus have to unregistered before network namespace destruction. It's a bit suspicious that such a mature subsystem as sysctl newly needs its internals exported like this. Either a) the net namespaces work is doing something which hasn't been done before or b) it is doing something wrong. So, please explain further so we can confirm that it is a) and not b). --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1701,6 +1701,13 @@ void register_sysctl_root(struct ctl_table_root *root) spin_unlock(sysctl_lock); } +void unregister_sysctl_root(struct ctl_table_root *root) +{ + spin_lock(sysctl_lock); + list_del(root-root_list); + spin_unlock(sysctl_lock); +} + This requires the addition of a declaration to include/linux/sysctl.h. Once that is done and review is complete, I'd suggest that these two patches be joined into a single patch, and that patch become part of whatever patch series it is which needs them. ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 4/9] allow killing tasks in your own or child userns
On Thu, 24 Feb 2011 00:48:18 + Serge E. Hallyn se...@hallyn.com wrote: Quoting Andrew Morton (a...@linux-foundation.org): On Thu, 17 Feb 2011 15:03:25 + Serge E. Hallyn se...@hallyn.com wrote: /* + * called with RCU read lock from check_kill_permission() + */ +static inline int kill_ok_by_cred(struct task_struct *t) +{ + const struct cred *cred = current_cred(); + const struct cred *tcred = __task_cred(t); + + if (cred-user-user_ns == tcred-user-user_ns + (cred-euid == tcred-suid || + cred-euid == tcred-uid || + cred-uid == tcred-suid || + cred-uid == tcred-uid)) + return 1; + + if (ns_capable(tcred-user-user_ns, CAP_KILL)) + return 1; + + return 0; +} The compiler will inline this for us. Is that simply true with everything (worth inlining) nowadays, or is there a particular implicit hint to the compiler that'll make that happen? We've basically stopped inlining things nowadays. gcc inlines aggressively and sometimes we have to use noinline to stop it. Also, modern gcc's like to ignore the inline directive anwyay, so we have to resort to __always_inline when we disagree. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 9/9] userns: check user namespace for task-file uid equivalence checks
On Thu, 24 Feb 2011 03:24:16 + Serge E. Hallyn se...@hallyn.com wrote: Quoting Andrew Morton (a...@linux-foundation.org): On Thu, 17 Feb 2011 15:04:07 + Serge E. Hallyn se...@hallyn.com wrote: There's a fairly well adhered to convention that global symbols (and often static symbols) have a prefix which identifies the subsystem to which they belong. This patchset rather scorns that convention. Most of these identifiers are pretty obviously from the capability subsystem, but still... Would 'inode_owner_or_capable' be better and and make sense? I suppose so. We've totally screwed that pooch in the VFS (grep EXPORT fs/inode.c). But it's never to late to start. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 1/9] Add a user_namespace as creator/owner of uts_namespace
On Thu, 17 Feb 2011 15:02:57 + Serge E. Hallyn se...@hallyn.com wrote: +/* + * userns count is 1 for root user, 1 for init_uts_ns, + * and 1 for... ? + */ ? ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 2/9] security: Make capabilities relative to the user namespace.
On Thu, 17 Feb 2011 15:03:06 + Serge E. Hallyn se...@hallyn.com wrote: - Introduce ns_capable to test for a capability in a non-default user namespace. - Teach cap_capable to handle capabilities in a non-default user namespace. The motivation is to get to the unprivileged creation of new namespaces. It looks like this gets us 90% of the way there, with only potential uid confusion issues left. I still need to handle getting all caps after creation but otherwise I think I have a good starter patch that achieves all of your goals. ... --- a/include/linux/capability.h +++ b/include/linux/capability.h @@ -544,7 +544,7 @@ extern const kernel_cap_t __cap_init_eff_set; * * Note that this does not set PF_SUPERPRIV on the task. */ -#define has_capability(t, cap) (security_real_capable((t), (cap)) == 0) +#define has_capability(t, cap) (security_real_capable((t), init_user_ns, (cap)) == 0) /** * has_capability_noaudit - Determine if a task has a superior capability available (unaudited) @@ -558,9 +558,15 @@ extern const kernel_cap_t __cap_init_eff_set; * Note that this does not set PF_SUPERPRIV on the task. */ #define has_capability_noaudit(t, cap) \ - (security_real_capable_noaudit((t), (cap)) == 0) + (security_real_capable_noaudit((t), init_user_ns, (cap)) == 0) +struct user_namespace; +extern struct user_namespace init_user_ns; Two icky-should-be-written-in-C macros which reference init_user_ns, followed by the declaration of init_user_ns and its type. Declarations which duplicate those in other header files. It's ripe for some upcleaning, methinks? Also, please ensure that the forward struct declarations are all at top-of-file (as in include/linux/security.h). Otherwise we can end up accumulating multiple forward declarations of the same thing in the one file. extern int capable(int cap); +extern int ns_capable(struct user_namespace *ns, int cap); +extern int task_ns_capable(struct task_struct *t, int cap); + +#define nsown_capable(cap) (ns_capable(current_user_ns(), (cap))) macroitis! @@ -301,15 +302,42 @@ error: */ int capable(int cap) { + return ns_capable(init_user_ns, cap); +} +EXPORT_SYMBOL(capable); + +/** + * ns_capable - Determine if the current task has a superior capability in effect + * @ns: The usernamespace we want the capability in + * @cap: The capability to be tested for + * + * Return true if the current task has the given superior capability currently + * available for use, false if not. Actually it doesn't return true or false - it returns 1 or 0. Using a `bool' return type would fix the comment :) + * This sets PF_SUPERPRIV on the task if the capability is available on the + * assumption that it's about to be used. + */ +int ns_capable(struct user_namespace *ns, int cap) +{ if (unlikely(!cap_valid(cap))) { printk(KERN_CRIT capable() called with invalid cap=%u\n, cap); BUG(); } - if (security_capable(current_cred(), cap) == 0) { + if (security_capable(ns, current_cred(), cap) == 0) { current-flags |= PF_SUPERPRIV; return 1; } return 0; } -EXPORT_SYMBOL(capable); +EXPORT_SYMBOL(ns_capable); + +/* + * does current have capability 'cap' to the user namespace of task + * 't'. Return true if it does, false otherwise. + */ Other comments were kerneldocified. +int task_ns_capable(struct task_struct *t, int cap) +{ + return ns_capable(task_cred_xxx(t, user)-user_ns, cap); +} +EXPORT_SYMBOL(task_ns_capable); Could return bool. ... +int cap_capable(struct task_struct *tsk, const struct cred *cred, + struct user_namespace *targ_ns, int cap, int audit) { - return cap_raised(cred-cap_effective, cap) ? 0 : -EPERM; + for (;;) { + /* The creator of the user namespace has all caps. */ + if (targ_ns != init_user_ns targ_ns-creator == cred-user) + return 0; + + /* Do we have the necessary capabilities? */ + if (targ_ns == cred-user-user_ns) + return cap_raised(cred-cap_effective, cap) ? 0 : -EPERM; + + /* Have we tried all of the parent namespaces? */ + if (targ_ns == init_user_ns) + return -EPERM; + + /* If you have the capability in a parent user ns you have it + * in the over all children user namespaces as well, so see + * if this process has the capability in the parent user + * namespace. + */ + targ_ns = targ_ns-creator-user_ns; + } + + /* We never get here */ + return -EPERM; So delete the code? Or does the compiler warn? If so, it's pretty busted. } /** ... ___ Containers mailing list contain...@lists.linux-foundation.org
[Devel] Re: [PATCH 4/9] allow killing tasks in your own or child userns
On Thu, 17 Feb 2011 15:03:25 + Serge E. Hallyn se...@hallyn.com wrote: /* + * called with RCU read lock from check_kill_permission() + */ +static inline int kill_ok_by_cred(struct task_struct *t) +{ + const struct cred *cred = current_cred(); + const struct cred *tcred = __task_cred(t); + + if (cred-user-user_ns == tcred-user-user_ns + (cred-euid == tcred-suid || + cred-euid == tcred-uid || + cred-uid == tcred-suid || + cred-uid == tcred-uid)) + return 1; + + if (ns_capable(tcred-user-user_ns, CAP_KILL)) + return 1; + + return 0; +} The compiler will inline this for us. +/* * Bad permissions for sending the signal * - the caller must hold the RCU read lock */ static int check_kill_permission(int sig, struct siginfo *info, struct task_struct *t) { - const struct cred *cred, *tcred; struct pid *sid; int error; @@ -656,14 +676,8 @@ static int check_kill_permission(int sig, struct siginfo *info, if (error) return error; - cred = current_cred(); - tcred = __task_cred(t); if (!same_thread_group(current, t) - (cred-euid ^ tcred-suid) - (cred-euid ^ tcred-uid) - (cred-uid ^ tcred-suid) - (cred-uid ^ tcred-uid) - !capable(CAP_KILL)) { + !kill_ok_by_cred(t)) { switch (sig) { case SIGCONT: sid = task_session(t); ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 5/9] Allow ptrace from non-init user namespaces
On Thu, 17 Feb 2011 15:03:33 + Serge E. Hallyn se...@hallyn.com wrote: ptrace is allowed to tasks in the same user namespace according to the usual rules (i.e. the same rules as for two tasks in the init user namespace). ptrace is also allowed to a user namespace to which the current task the has CAP_SYS_PTRACE capability. ... --- a/include/linux/capability.h +++ b/include/linux/capability.h @@ -546,6 +546,8 @@ extern const kernel_cap_t __cap_init_eff_set; */ #define has_capability(t, cap) (security_real_capable((t), init_user_ns, (cap)) == 0) +#define has_ns_capability(t, ns, cap) (security_real_capable((t), (ns), (cap)) == 0) macroitis. /** * has_capability_noaudit - Determine if a task has a superior capability available (unaudited) * @t: The task in question diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index faf4679..862fc59 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -39,6 +39,9 @@ static inline void put_user_ns(struct user_namespace *ns) uid_t user_ns_map_uid(struct user_namespace *to, const struct cred *cred, uid_t uid); gid_t user_ns_map_gid(struct user_namespace *to, const struct cred *cred, gid_t gid); +int same_or_ancestor_user_ns(struct task_struct *task, + struct task_struct *victim); bool. #else static inline struct user_namespace *get_user_ns(struct user_namespace *ns) ... --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -129,6 +129,22 @@ gid_t user_ns_map_gid(struct user_namespace *to, const struct cred *cred, gid_t return overflowgid; } +int same_or_ancestor_user_ns(struct task_struct *task, + struct task_struct *victim) +{ + struct user_namespace *u1 = task_cred_xxx(task, user)-user_ns; + struct user_namespace *u2 = task_cred_xxx(victim, user)-user_ns; + for (;;) { + if (u1 == u2) + return 1; + if (u1 == init_user_ns) + return 0; + u1 = u1-creator-user_ns; + } + /* We never get here */ + return 0; Remove? +} + static __init int user_namespaces_init(void) { user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC); ... int cap_ptrace_access_check(struct task_struct *child, unsigned int mode) { int ret = 0; + const struct cred *cred, *tcred; rcu_read_lock(); - if (!cap_issubset(__task_cred(child)-cap_permitted, - current_cred()-cap_permitted) - !capable(CAP_SYS_PTRACE)) - ret = -EPERM; + cred = current_cred(); + tcred = __task_cred(child); + /* + * The ancestor user_ns check may be gratuitous, as I think + * we've already guaranteed that in kernel/ptrace.c. + */ ? + if (same_or_ancestor_user_ns(current, child) + cap_issubset(tcred-cap_permitted, cred-cap_permitted)) + goto out; + if (ns_capable(tcred-user-user_ns, CAP_SYS_PTRACE)) + goto out; + ret = -EPERM; +out: rcu_read_unlock(); return ret; } ... ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 7/9] add a user namespace owner of ipc ns
On Thu, 17 Feb 2011 15:03:49 + Serge E. Hallyn se...@hallyn.com wrote: ... --- a/include/linux/ipc_namespace.h +++ b/include/linux/ipc_namespace.h @@ -24,6 +24,7 @@ struct ipc_ids { struct idr ipcs_idr; }; +struct user_namespace; Move to top of file. struct ipc_namespace { atomic_tcount; struct ipc_ids ids[3]; @@ -56,6 +57,8 @@ struct ipc_namespace { unsigned intmq_msg_max; /* initialized to DFLT_MSGMAX */ unsigned intmq_msgsize_max; /* initialized to DFLT_MSGSIZEMAX */ + /* user_ns which owns the ipc ns */ + struct user_namespace *user_ns; }; extern struct ipc_namespace init_ipc_ns; diff --git a/ipc/msgutil.c b/ipc/msgutil.c index f095ee2..d91ff4b 100644 --- a/ipc/msgutil.c +++ b/ipc/msgutil.c @@ -20,6 +20,8 @@ DEFINE_SPINLOCK(mq_lock); +extern struct user_namespace init_user_ns; Should be declared in .h, not in .c. /* * The next 2 defines are here bc this is the only file * compiled when either CONFIG_SYSVIPC and CONFIG_POSIX_MQUEUE ... ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 6/9] user namespaces: convert all capable checks in kernel/sys.c
On Thu, 17 Feb 2011 15:03:42 + Serge E. Hallyn se...@hallyn.com wrote: This allows setuid/setgid in containers. It also fixes some corner cases where kernel logic foregoes capability checks when uids are equivalent. The latter will need to be done throughout the whole kernel. ... --- a/kernel/sys.c +++ b/kernel/sys.c @@ -118,17 +118,29 @@ EXPORT_SYMBOL(cad_pid); void (*pm_power_off_prepare)(void); +/* called with rcu_read_lock, creds are safe */ +static inline int set_one_prio_perm(struct task_struct *p) +{ + const struct cred *cred = current_cred(), *pcred = __task_cred(p); + + if (pcred-user-user_ns == cred-user-user_ns + (pcred-uid == cred-euid || + pcred-euid == cred-euid)) + return 1; + if (ns_capable(pcred-user-user_ns, CAP_SYS_NICE)) + return 1; + return 0; +} uninline. Document return value? ... ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 9/9] userns: check user namespace for task-file uid equivalence checks
On Thu, 17 Feb 2011 15:04:07 + Serge E. Hallyn se...@hallyn.com wrote: Cheat for now and say all files belong to init_user_ns. Next step will be to let superblocks belong to a user_ns, and derive inode_userns(inode) from inode-i_sb-s_user_ns. Finally we'll introduce more flexible arrangements. ... + +/* + * return 1 if current either has CAP_FOWNER to the + * file, or owns the file. + */ +int is_owner_or_cap(const struct inode *inode) +{ + struct user_namespace *ns = inode_userns(inode); + + if (current_user_ns() == ns current_fsuid() == inode-i_uid) + return 1; + if (ns_capable(ns, CAP_FOWNER)) + return 1; + return 0; +} bool? +EXPORT_SYMBOL(is_owner_or_cap); There's a fairly well adhered to convention that global symbols (and often static symbols) have a prefix which identifies the subsystem to which they belong. This patchset rather scorns that convention. Most of these identifiers are pretty obviously from the capability subsystem, but still... ... ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 6/9] user namespaces: convert all capable checks in kernel/sys.c
On Thu, 17 Feb 2011 15:03:42 + Serge E. Hallyn se...@hallyn.com wrote: @@ -1177,8 +1189,11 @@ SYSCALL_DEFINE2(sethostname, char __user *, name, int, len) int errno; char tmp[__NEW_UTS_LEN]; - if (!ns_capable(current-nsproxy-uts_ns-user_ns, CAP_SYS_ADMIN)) + if (!ns_capable(current-nsproxy-uts_ns-user_ns, CAP_SYS_ADMIN)) { + printk(KERN_NOTICE %s: did not have CAP_SYS_ADMIN\n, __func__); return -EPERM; + } + printk(KERN_NOTICE %s: did have CAP_SYS_ADMIN\n, __func__); if (len 0 || len __NEW_UTS_LEN) return -EINVAL; down_write(uts_sem); Left over debugging printks? ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: userns: targeted capabilities v5
On Thu, 17 Feb 2011 15:02:24 + Serge E. Hallyn se...@hallyn.com wrote: Here is a repost of my previous user namespace patch, ported onto last night's git head. It fixes several things I was doing wrong in the last (v4) posting, in particular: 1. don't set uts_ns-user_ns to current's when !CLONE_NEWUTS 2. add a ipc_ns-user_ns which owns ipc_ns, and use that to decide CAP_IPC_OWNER 3. fix logic flaw caused by bad parantheses 4. allow do_prlimit to current 5. don't always give root full privs to init_user_ns The expected course of development for user namespaces is laid out at https://wiki.ubuntu.com/UserNamespace. Seems like a nice feature to be developing. I worry about the maturity of it all at this stage. How far along is it *really*? Is anyone else working with you on developing and reviewing this work? ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 1/1, v7] cgroup/freezer: add per freezer duty ratio control
On Sun, 13 Feb 2011 19:23:10 -0800 Arjan van de Ven ar...@linux.intel.com wrote: On 2/13/2011 4:44 PM, KAMEZAWA Hiroyuki wrote: On Sat, 12 Feb 2011 15:29:07 -0800 Matt Helsleymatth...@us.ibm.com wrote: On Fri, Feb 11, 2011 at 11:10:44AM -0800, jacob.jun@linux.intel.com wrote: From: Jacob Panjacob.jun@linux.intel.com Freezer subsystem is used to manage batch jobs which can start stop at the same time. However, sometime it is desirable to let the kernel manage the freezer state automatically with a given duty ratio. For example, if we want to reduce the time that backgroup apps are allowed to run we can put them into a freezer subsystem and set the kernel to turn them THAWED/FROZEN at given duty ratio. This patch introduces two file nodes under cgroup freezer.duty_ratio_pct and freezer.period_sec Again: I don't think this is the right approach in the long term. It would be better not to add this interface and instead enable the cpu cgroup subsystem for non-rt tasks using a similar duty ratio concept.. Nevertheless, I've added some feedback on the code for you here :). AFAIK, there was a work for bandwidth control in CFS. http://linux.derkeiler.com/Mailing-Lists/Kernel/2010-10/msg04335.html I tested this and worked fine. This schduler approach seems better for my purpose to limit bandwidth of apprications rather than freezer. for our purpose, it's not about bandwidth. it's about making sure the class of apps don't run for a long period (30-second range) of time. The discussion about this patchset seems to have been upside-down: lots of talk about a particular implementation, with people walking back from the implemetnation trying to work out what the requirements were, then seeing if other implementations might suit those requirements. Whatever they were. I think it would be helpful to start again, ignoring (for now) any implementation. What are the requirements here, guys? What effects are we actually trying to achieve? Once that is understood and agreed to, we can think about implementations. And maybe you people _are_ clear about the requirements. But I'm not and I'm sure many others aren't too. A clear statement of them would help things along and would doubtless lead to better code. This is pretty basic stuff! ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v8 0/3] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
On Mon, 7 Feb 2011 20:35:42 -0500 Ben Blum bb...@andrew.cmu.edu wrote: On Sun, Dec 26, 2010 at 07:09:19AM -0500, Ben Blum wrote: On Fri, Dec 24, 2010 at 03:22:26AM -0500, Ben Blum wrote: On Wed, Aug 11, 2010 at 01:46:04AM -0400, Ben Blum wrote: On Fri, Jul 30, 2010 at 07:56:49PM -0400, Ben Blum wrote: This patch series is a revision of http://lkml.org/lkml/2010/6/25/11 . This patch series implements a write function for the 'cgroup.procs' per-cgroup file, which enables atomic movement of multithreaded applications between cgroups. Writing the thread-ID of any thread in a threadgroup to a cgroup's procs file causes all threads in the group to be moved to that cgroup safely with respect to threads forking/exiting. (Possible usage scenario: If running a multithreaded build system that sucks up system resources, this lets you restrict it all at once into a new cgroup to keep it under control.) Example: Suppose pid 31337 clones new threads 31338 and 31339. # cat /dev/cgroup/tasks ... 31337 31338 31339 # mkdir /dev/cgroup/foo # echo 31337 /dev/cgroup/foo/cgroup.procs # cat /dev/cgroup/foo/tasks 31337 31338 31339 A new lock, called threadgroup_fork_lock and living in signal_struct, is introduced to ensure atomicity when moving threads between cgroups. It's taken for writing during the operation, and taking for reading in fork() around the calls to cgroup_fork() and cgroup_post_fork(). The above six month old text is the best (and almost the only) explanation of the rationale for the entire patch series. Is it still correct and complete? Assuming yes, then... how do we determine whether the feature is sufficiently useful to justify merging and maintaining it? Will people actually use it? Was there some particular operational situation which led you to think that the kernel should have this capability? If so, please help us out here and lavishly describe it. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH, v4 2/2] cgroups: introduce timer slack subsystem
On Thu, 3 Feb 2011 16:34:14 +0200 Kirill A. Shutsemov kir...@shutemov.name wrote: From: Kirill A. Shutemov kir...@shutemov.name Provides a way of tasks grouping by timer slack value. Introduces per cgroup max and min timer slack value. When a task attaches to a cgroup, its timer slack value adjusts (if needed) to fit min-max range. It also provides a way to set timer slack value for all tasks in the cgroup at once. This functionality is useful in mobile devices where certain background apps are attached to a cgroup and minimum wakeups are desired. This description is quite scanty. It doesn't explain what the benefit is, when one might want to use it, how to use it, etc. A nice description under Documentation/cgroups/ would be appropriate. I read this and come away with no sense of how useful or desirable this code is. ... --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -124,6 +124,8 @@ extern struct cred init_cred; # define INIT_PERF_EVENTS(tsk) #endif +#define TIMER_SLACK_NS_DEFAULT 5 Seems to be an unrelated cleanup. And given that this #define is only used in a single place, its cleanliness is questionable. /* * INIT_TASK is used to set up the first task table, touch at * your own risk!. Base=0, limit=0x1f (=2MB) @@ -177,7 +179,7 @@ extern struct cred init_cred; .cpu_timers = INIT_CPU_TIMERS(tsk.cpu_timers), \ .fs_excl= ATOMIC_INIT(0), \ .pi_lock= __RAW_SPIN_LOCK_UNLOCKED(tsk.pi_lock),\ - .timer_slack_ns = 5, /* 50 usec default slack */\ + .timer_slack_ns = TIMER_SLACK_NS_DEFAULT, \ .pids = { \ [PIDTYPE_PID] = INIT_PID_LINK(PIDTYPE_PID),\ [PIDTYPE_PGID] = INIT_PID_LINK(PIDTYPE_PGID), \ ... +extern int (*timer_slack_check)(struct task_struct *task, + unsigned long slack_ns); Nope. Please put this in a header file so we can be sure that callers see the same prototype as does the definition. ... +/* + * Adjust -timer_slack_ns and -default_max_slack_ns of the task to fit + * limits of the cgroup. + */ +static void tslack_adjust_task(struct timer_slack_cgroup *tslack_cgroup, + struct task_struct *tsk) +{ + if (tslack_cgroup-min_slack_ns tsk-timer_slack_ns) + tsk-timer_slack_ns = tslack_cgroup-min_slack_ns; + else if (tslack_cgroup-max_slack_ns tsk-timer_slack_ns) + tsk-timer_slack_ns = tslack_cgroup-max_slack_ns; + + if (tslack_cgroup-min_slack_ns tsk-default_timer_slack_ns) + tsk-default_timer_slack_ns = tslack_cgroup-min_slack_ns; + else if (tslack_cgroup-max_slack_ns tsk-default_timer_slack_ns) + tsk-default_timer_slack_ns = tslack_cgroup-max_slack_ns; +} + +static void tslack_cgroup_attach(struct cgroup_subsys *subsys, + struct cgroup *cgroup, struct cgroup *prev, + struct task_struct *tsk, bool threadgroup) +{ + tslack_adjust_task(cgroup_to_tslack_cgroup(cgroup), tsk); +} + +static int tslack_write_set_slack_ns(struct cgroup *cgroup, struct cftype *cft, + u64 val) +{ + struct timer_slack_cgroup *tslack_cgroup; + struct cgroup_iter it; + struct task_struct *task; + + tslack_cgroup = cgroup_to_tslack_cgroup(cgroup); + if (!is_timer_slack_allowed(cgroup_to_tslack_cgroup(cgroup), val)) + return -EPERM; + + /* Change timer slack value for all tasks in the cgroup */ + cgroup_iter_start(cgroup, it); + while ((task = cgroup_iter_next(cgroup, it))) + task-timer_slack_ns = val; + cgroup_iter_end(cgroup, it); + + return 0; +} + +static u64 tslack_read_range(struct cgroup *cgroup, struct cftype *cft) +{ + struct timer_slack_cgroup *tslack_cgroup; + + tslack_cgroup = cgroup_to_tslack_cgroup(cgroup); + switch (cft-private) { + case TIMER_SLACK_MIN: + return tslack_cgroup-min_slack_ns; + case TIMER_SLACK_MAX: + return tslack_cgroup-max_slack_ns; + default: + BUG(); + }; Extraneous ;. +} These proposed new userspace interfaces should be documented somewhere, please. They are the most important part of the patch. ... +struct cgroup_subsys timer_slack_subsys = { + .name = timer_slack, + .module = THIS_MODULE, +#ifdef CONFIG_CGROUP_TIMER_SLACK Confused. This file won't even be compiled if CONFIG_CGROUP_TIMER_SLACK=n? + .subsys_id = timer_slack_subsys_id, +#endif + .create = tslack_cgroup_create, + .destroy = tslack_cgroup_destroy, + .attach = tslack_cgroup_attach, + .populate = tslack_cgroup_populate, +}; + ... @@ -1694,12 +1698,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
[Devel] Re: [PATCH v7 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup
On Fri, 4 Feb 2011 16:25:15 -0500 Ben Blum bb...@andrew.cmu.edu wrote: On Mon, Jan 24, 2011 at 01:05:29PM -0800, Andrew Morton wrote: On Sun, 26 Dec 2010 07:09:51 -0500 Ben Blum bb...@andrew.cmu.edu wrote: Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup From: Ben Blum bb...@andrew.cmu.edu This patch adds an rwsem that lives in a threadgroup's signal_struct that's taken for reading in the fork path, under CONFIG_CGROUPS. If another part of the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS ifdefs should be changed to a higher-up flag that CGROUPS and the other system would both depend on. This is a pre-patch for cgroup-procs-write.patch. ... +/* See the declaration of threadgroup_fork_lock in signal_struct. */ +#ifdef CONFIG_CGROUPS +static inline void threadgroup_fork_read_lock(struct task_struct *tsk) +{ + down_read(tsk-signal-threadgroup_fork_lock); +} +static inline void threadgroup_fork_read_unlock(struct task_struct *tsk) +{ + up_read(tsk-signal-threadgroup_fork_lock); +} +static inline void threadgroup_fork_write_lock(struct task_struct *tsk) +{ + down_write(tsk-signal-threadgroup_fork_lock); +} +static inline void threadgroup_fork_write_unlock(struct task_struct *tsk) +{ + up_write(tsk-signal-threadgroup_fork_lock); +} +#else Risky. sched.h doesn't include rwsem.h. We could make it do so, but almost every compilation unit in the kernel includes sched.h. It would be nicer to make the kernel build finer-grained, rather than blunter-grained. Don't be afraid to add new header files if that is one way of doing this! Hmm, good point. But there's also: +#ifdef CONFIG_CGROUPS + struct rw_semaphore threadgroup_fork_lock; +#endif in the signal_struct, also in sched.h, which needs to be there. Or I could change it to a struct pointer with a forward incomplete declaration above, and kmalloc/kfree it? I don't like adding more alloc/free calls but don't know if it's more or less important than header granularity. What about adding a new header file which includes rwsem.h and sched.h and then defines the new interfaces? ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v2] cgroup/freezer: add per freezer duty ratio control
On Wed, 2 Feb 2011 16:42:20 -0800 jacob.jun@linux.intel.com wrote: Freezer subsystem is used to manage batch jobs which can start stop at the same time. However, sometime it is desirable to let the kernel manage the freezer state automatically with a given duty ratio. For example, if we want to reduce the time that backgroup apps are allowed to run we can put them into a freezer subsystem and set the kernel to turn them THAWED/FROZEN at given duty ratio. This patch introduces two file nodes under cgroup freezer.duty_ratio_pct and freezer.period_sec Usage example: set period to be 5 seconds and frozen duty ratio 90% [root@localhost aoa]# echo 90 freezer.duty_ratio_pct [root@localhost aoa]# echo 5 freezer.period_sec Signed-off-by: Jacob Pan jacob.jun@linux.intel.com --- kernel/cgroup_freezer.c | 109 ++- 1 files changed, 107 insertions(+), 2 deletions(-) Update the documentation please. Documentation/cgroups/ ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH] cgroup : remove the ns_cgroup
On Tue, 25 Jan 2011 10:39:48 +0100 Daniel Lezcano daniel.lezc...@free.fr wrote: The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier and leads to some problems: * cgroup creation is out-of-control * cgroup name can conflict when pids are looping * it is not possible to have a single process handling a lot of namespaces without falling in a exponential creation time * we may want to create a namespace without creating a cgroup The ns_cgroup was replaced by a compatibility flag 'clone_children', where a newly created cgroup will copy the parent cgroup values. The userspace has to manually create a cgroup and add a task to the 'tasks' file. This patch removes the ns_cgroup as suggested in the following thread: https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html The 'cgroup_clone' function is removed because it is no longer used. Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr Signed-off-by: Serge E. Hallyn serge.hal...@canonical.com Cc: Eric W. Biederman ebied...@xmission.com Cc: Jamal Hadi Salim h...@cyberus.ca Reviewed-by: Li Zefan l...@cn.fujitsu.com Acked-by: Paul Menage men...@google.com Acked-by: Matt Helsley matth...@us.ibm.com ... 22 files changed, 4 insertions(+), 287 deletions(-) I didn't see that one coming. This change is userspace-visible, is it not? What are the implications of this? There's some discussion in that nearly-two-year-old thread regarding making provision for back-compatibility but I'm not seeing such things in this patch? ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH] cgroup : remove the ns_cgroup
On Tue, 25 Jan 2011 10:39:48 +0100 Daniel Lezcano daniel.lezc...@free.fr wrote: This patch removes the ns_cgroup as suggested in the following thread: I had this patch queued up in September last year, but dropped it. Why did I do that? ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH] cgroup : remove the ns_cgroup
On Thu, 27 Jan 2011 09:08:51 +0800 Li Zefan l...@cn.fujitsu.com wrote: Andrew Morton wrote: On Tue, 25 Jan 2011 10:39:48 +0100 Daniel Lezcano daniel.lezc...@free.fr wrote: This patch removes the ns_cgroup as suggested in the following thread: I had this patch queued up in September last year, but dropped it. Why did I do that? Because you wanted to wait for some time for users (if any) to notice this coming change. Author: Daniel Lezcano daniel.lezc...@free.fr Date: Wed Oct 27 15:33:38 2010 -0700 cgroup: notify ns_cgroup deprecated The ns_cgroup will be removed very soon. Let's warn, for this version, ns_cgroup is deprecated. Make ns_cgroup and clone_children exclusive. If the clone_children is set and the ns_cgroup is mounted, let's fail with EINVAL when the ns_cgroup subsys is created (a printk will help the user to understand why the creation fails). Update the feature remove schedule file with the deprecated ns_cgroup. Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr Acked-by: Paul Menage men...@google.com Signed-off-by: Andrew Morton a...@linux-foundation.org Signed-off-by: Linus Torvalds torva...@linux-foundation.org ooh, that was clever of me. Here is the text which was missing from the changelog: This is a userspace-visible change. Commit 45531757b45c (cgroup: notify ns_cgroup deprecated) (merged into 2.6.27) caused the kernel to emit a printk warning users that the feature is planned for removal. Since that time we have heard from XXX users who were affected by this. Please provide XXX. How do we know that 2.6.37-2.6.38 is long enough? Will any major distros be released containing this warning in that timeframe? I doubt it. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v7 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup
On Sun, 26 Dec 2010 07:09:51 -0500 Ben Blum bb...@andrew.cmu.edu wrote: Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup From: Ben Blum bb...@andrew.cmu.edu This patch adds an rwsem that lives in a threadgroup's signal_struct that's taken for reading in the fork path, under CONFIG_CGROUPS. If another part of the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS ifdefs should be changed to a higher-up flag that CGROUPS and the other system would both depend on. This is a pre-patch for cgroup-procs-write.patch. ... +/* See the declaration of threadgroup_fork_lock in signal_struct. */ +#ifdef CONFIG_CGROUPS +static inline void threadgroup_fork_read_lock(struct task_struct *tsk) +{ + down_read(tsk-signal-threadgroup_fork_lock); +} +static inline void threadgroup_fork_read_unlock(struct task_struct *tsk) +{ + up_read(tsk-signal-threadgroup_fork_lock); +} +static inline void threadgroup_fork_write_lock(struct task_struct *tsk) +{ + down_write(tsk-signal-threadgroup_fork_lock); +} +static inline void threadgroup_fork_write_unlock(struct task_struct *tsk) +{ + up_write(tsk-signal-threadgroup_fork_lock); +} +#else Risky. sched.h doesn't include rwsem.h. We could make it do so, but almost every compilation unit in the kernel includes sched.h. It would be nicer to make the kernel build finer-grained, rather than blunter-grained. Don't be afraid to add new header files if that is one way of doing this! ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v5 3/3] cgroups: make procs file writable
On Fri, 24 Dec 2010 06:45:00 -0500 Ben Blum bb...@andrew.cmu.edu wrote: kmalloc() is allowed while holding a spinlock and NODEMASK_ALLOC() takes a gfp_flags argument for that reason. Ah, it's only with GFP_KERNEL and friends. So changing the uses in cpuset_can_attach to GFP_ATOMIC would solve this concern, then? It would introduce an additional concern. GFP_ATOMIC is bad, for a number of reasons. The main one of which is that it is vastly less reliable than GFP_KERNEL. And making the cpuset code less reliable is a regression, no? Please try to find a better solution. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v5 3/3] cgroups: make procs file writable
On Wed, 15 Dec 2010 22:34:39 -0800 Paul Menage men...@google.com wrote: Ping akpm? Patches have gone a bit stale, sorry. Refactoring in kernel/cgroup_freezer.c necessitates a refresh and retest please. On Fri, Oct 8, 2010 at 2:57 PM, Paul Menage men...@google.com wrote: Hi Andrew, Do you see any road-blockers for including this patch set in -mm ? Thanks, Paul On Tue, Aug 24, 2010 at 11:08 AM, Paul Menage men...@google.com wrote: On Tue, Aug 10, 2010 at 10:48 PM, Ben Blum bb...@andrew.cmu.edu wrote: Makes procs file writable to move all threads by tgid at once From: Ben Blum bb...@andrew.cmu.edu This patch adds functionality that enables users to move all threads in a threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs' file. This current implementation makes use of a per-threadgroup rwsem that's taken for reading in the fork() path to prevent newly forking threads within the threadgroup from escaping while the move is in progress. Signed-off-by: Ben Blum bb...@andrew.cmu.edu Reviewed-by: Paul Menage men...@google.com ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH] user_ns: Improve the user_ns on-the-slab packaging
On Tue, 07 Dec 2010 17:12:33 +0300 Pavel Emelyanov xe...@parallels.com wrote: @@ -126,3 +128,11 @@ gid_t user_ns_map_gid(struct user_namespace *to, const struct cred *cred, gid_t /* No useful relationship so no mapping */ return overflowgid; } + +static __init int user_namespaces_init(void) +{ + user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC); + return 0; +} + +__initcall(user_namespaces_init); checkpatch (which you apparently didn't use) says WARNING: please use device_initcall() instead of __initcall() #81: FILE: kernel/user_namespace.c:138: +__initcall(user_namespaces_init); which was a somewhat random recommendation. I think I'll switch it to plain old module_init(). Presumably user-namespaces don't get used prior to initcalls being run. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v4 05/11] writeback: create dirty_info structure
On Fri, 29 Oct 2010 00:09:08 -0700 Greg Thelen gthe...@google.com wrote: Bundle dirty limits and dirty memory usage metrics into a dirty_info structure to simplify interfaces of routines that need all. Problems... These patches interact pretty badly with Fengguang's IO-less dirty throttling v2 patches. I fixed up writeback-create-dirty_info-structure.patch pretty mechanically but when it got to memcg-check-memcg-dirty-limits-in-page-writeback.patch things got sticky and I gave up. As your stuff was merged first, I'd normally send the bad news to Fengguang, but the memcg code is logically built upon the core writeback code so I do think these patches should be staged after the changes to core writeback. Also, while I was there it seemed that the chosen members of the dirty_info structure were a bit random. Perhaps we should be putting nr_dirty in there as well, perhaps other things. Please have a think about that. Also, in ratelimit_pages() we call global_dirty_info() to return four items, but that caller only actually uses two of them. Wasted effort? So I'm afraid I'm going to have to request that you redo and retest these patches: writeback-create-dirty_info-structure.patch memcg-add-dirty-page-accounting-infrastructure.patch memcg-add-kernel-calls-for-memcg-dirty-page-stats.patch memcg-add-dirty-limits-to-mem_cgroup.patch memcg-add-dirty-limits-to-mem_cgroup-use-native-word-to-represent-dirtyable-pages.patch memcg-add-dirty-limits-to-mem_cgroup-catch-negative-per-cpu-sums-in-dirty-info.patch memcg-add-dirty-limits-to-mem_cgroup-avoid-overflow-in-memcg_hierarchical_free_pages.patch memcg-add-dirty-limits-to-mem_cgroup-correct-memcg_hierarchical_free_pages-return-type.patch memcg-add-dirty-limits-to-mem_cgroup-avoid-free-overflow-in-memcg_hierarchical_free_pages.patch memcg-cpu-hotplug-lockdep-warning-fix.patch memcg-add-cgroupfs-interface-to-memcg-dirty-limits.patch memcg-break-out-event-counters-from-other-stats.patch memcg-check-memcg-dirty-limits-in-page-writeback.patch memcg-use-native-word-page-statistics-counters.patch memcg-use-native-word-page-statistics-counters-fix.patch # memcg-add-mem_cgroup-parameter-to-mem_cgroup_page_stat.patch memcg-pass-mem_cgroup-to-mem_cgroup_dirty_info.patch #memcg-make-throttle_vm_writeout-memcg-aware.patch: troublesome: Kamezawa memcg-make-throttle_vm_writeout-memcg-aware.patch memcg-make-throttle_vm_writeout-memcg-aware-fix.patch memcg-simplify-mem_cgroup_page_stat.patch memcg-simplify-mem_cgroup_dirty_info.patch memcg-make-mem_cgroup_page_stat-return-value-unsigned.patch against the http://userweb.kernel.org/~akpm/mmotm/ which I just uploaded, sorry. I've uploaded my copy of all the above to http://userweb.kernel.org/~akpm/stuff/gthelen.tar.gz. I think only the two patches need fixing and retesting. Also, while wrangling the above patches, I stumbled across rejects such as: *** *** 99,106 state:%8lx\n, (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)), (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)), - K(bdi_thresh), K(dirty_thresh), - K(background_thresh), nr_dirty, nr_io, nr_more_io, !list_empty(bdi-bdi_list), bdi-state); #undef K --- 98,106 state:%8lx\n, (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)), (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)), + K(bdi_thresh), K(dirty_info.dirty_thresh), + K(dirty_info.background_thresh), + nr_dirty, nr_io, nr_more_io, !list_empty(bdi-bdi_list), bdi-state); Please, if you discover crud like this, just fix it up. One item per line: state:%8lx\n, (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)), (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)), K(bdi_thresh), K(dirty_info.dirty_thresh), K(dirty_info.background_thresh), nr_dirty, nr_io, nr_more_io, !list_empty(bdi-bdi_list), bdi-state); all very simple. And while you're there, fix up the tab-tab-space-space-space indenting - just use tabs. The other area where code maintenance is harder than it needs to be is in definitions of locals: long nr_reclaimable; long nr_dirty, bdi_dirty; /* = file_dirty + writeback + unstable_nfs */ long bdi_prev_dirty = 0; again, that's just dopey. Change it to long nr_reclaimable; long nr_dirty; long bdi_dirty; /* = file_dirty + writeback + unstable_nfs */ long bdi_prev_dirty = 0; All very simple. Thanks. ___ Containers mailing list contain...@lists.linux-foundation.org
[Devel] Re: [PATCH v4 05/11] writeback: create dirty_info structure
On Wed, 17 Nov 2010 16:49:24 -0800 Andrew Morton a...@linux-foundation.org wrote: against the http://userweb.kernel.org/~akpm/mmotm/ which I just uploaded err, will upload Real Soon Now. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v4 00/11] memcg: per cgroup dirty page accounting
On Fri, 29 Oct 2010 00:09:03 -0700 Greg Thelen gthe...@google.com wrote: This is cool stuff - it's been a long haul. One day we'll be nearly-finished and someone will write a book telling people how to use it all and lots of people will go holy crap. I hope. Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim) page cache used by a cgroup. So, in case of multiple cgroup writers, they will not be able to consume more than their designated share of dirty pages and will be forced to perform write-out if they cross that limit. The patches are based on a series proposed by Andrea Righi in Mar 2010. Overview: - Add page_cgroup flags to record when pages are dirty, in writeback, or nfs unstable. - Extend mem_cgroup to record the total number of pages in each of the interesting dirty states (dirty, writeback, unstable_nfs). - Add dirty parameters similar to the system-wide /proc/sys/vm/dirty_* limits to mem_cgroup. The mem_cgroup dirty parameters are accessible via cgroupfs control files. Curious minds will want to know what the default values are set to and how they were determined. - Consider both system and per-memcg dirty limits in page writeback when deciding to queue background writeback or block for foreground writeback. Known shortcomings: - When a cgroup dirty limit is exceeded, then bdi writeback is employed to writeback dirty inodes. Bdi writeback considers inodes from any cgroup, not just inodes contributing dirty pages to the cgroup exceeding its limit. yup. Some broader discussion of the implications of this shortcoming is needed. I'm not sure where it would be placed, though. Documentation/ for now, until you write that book. - When memory.use_hierarchy is set, then dirty limits are disabled. This is a implementation detail. So this is unintentional, and forced upon us my the present implementation? An enhanced implementation is needed to check the chain of parents to ensure that no dirty limit is exceeded. How important is it that this be fixed? And how feasible would that fix be? A linear walk up the hierarchy list? More than that? Performance data: - A page fault microbenchmark workload was used to measure performance, which can be called in read or write mode: f = open(foo. $cpu) truncate(f, 4096) alarm(60) while (1) { p = mmap(f, 4096) if (write) *p = 1 else x = *p munmap(p) } - The workload was called for several points in the patch series in different modes: - s_read is a single threaded reader - s_write is a single threaded writer - p_read is a 16 thread reader, each operating on a different file - p_write is a 16 thread writer, each operating on a different file - Measurements were collected on a 16 core non-numa system using perf stat --repeat 3. The -a option was used for parallel (p_*) runs. - All numbers are page fault rate (M/sec). Higher is better. - To compare the performance of a kernel without non-memcg compare the first and last rows, neither has memcg configured. The first row does not include any of these memcg patches. - To compare the performance of using memcg dirty limits, compare the baseline (2nd row titled w/ memcg) with the the code and memcg enabled (2nd to last row titled all patches). root_cgroupchild_cgroup s_read s_write p_read p_write s_read s_write p_read p_write mmotm w/o memcg 0.428 0.390 0.429 0.388 mmotm w/ memcg0.411 0.378 0.391 0.362 0.412 0.377 0.385 0.363 all patches 0.384 0.360 0.370 0.348 0.381 0.363 0.368 0.347 all patches 0.431 0.402 0.427 0.395 w/o memcg afaict this benchmark has demonstrated that the changes do not cause an appreciable performance regression in terms of CPU loading, yes? Can we come up with any tests which demonstrate the _benefits_ of the feature? ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v4 09/11] memcg: CPU hotplug lockdep warning fix
On Fri, 29 Oct 2010 00:09:12 -0700 Greg Thelen gthe...@google.com wrote: From: Balbir Singh bal...@linux.vnet.ibm.com memcg has lockdep warnings (sleep inside rcu lock) ... Acked-by: Greg Thelen gthe...@google.com You were on the patch delivery path, so this should be Signed-off-by:. I made that change to my copy. Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v4 02/11] memcg: document cgroup dirty memory interfaces
On Fri, 29 Oct 2010 00:09:05 -0700 Greg Thelen gthe...@google.com wrote: Document cgroup dirty memory interfaces and statistics. ... +When use_hierarchy=0, each cgroup has dirty memory usage and limits. +System-wide dirty limits are also consulted. Dirty memory consumption is +checked against both system-wide and per-cgroup dirty limits. + +The current implementation does enforce per-cgroup dirty limits when does not, I trust. +use_hierarchy=1. System-wide dirty limits are used for processes in such +cgroups. Attempts to read memory.dirty_* files return the system-wide values. +Writes to the memory.dirty_* files return error. An enhanced implementation is +needed to check the chain of parents to ensure that no dirty limit is exceeded. + 6. Hierarchy support The memory controller supports a deep hierarchy and hierarchical accounting. -- 1.7.3.1 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 2/2] Kconfig : default all the namespaces to 'yes'
On Wed, 13 Oct 2010 09:44:30 -0500 Serge E. Hallyn se...@hallyn.com wrote: Quoting Daniel Lezcano (daniel.lezc...@free.fr): On 10/12/2010 07:16 PM, Serge E. Hallyn wrote: Quoting Matt Helsley (matth...@us.ibm.com): On Thu, Oct 07, 2010 at 03:15:33PM +0200, Daniel Lezcano wrote: As the different namespaces depend on 'CONFIG_NAMESPACES', it is logical to enable all the namespaces when we enable NAMESPACES. Signed-off-by: Daniel Lezcanodaniel.lezc...@free.fr Subject of the patch email is a little confusing as it's not quite what happens. I'm mostly OK with it but I'm not sure we should enable user-ns by default just yet. Acked-By: Matt Helsleymatth...@us.ibm.com In fact, perhaps we should keep the experimental tag on user namespaces. The experimental tag is kept on the user namespace. This one is defaulting to yes when the namespaces and experimental are selected. Oh, sounds good My attention flagged. Can we please confirm that the current patch is still good? From: Daniel Lezcano daniel.lezc...@free.fr As the different namespaces depend on 'CONFIG_NAMESPACES', it is logical to enable all the namespaces when we enable NAMESPACES. Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr Cc: Eric W. Biederman ebied...@xmission.com Cc: David Miller da...@davemloft.net Acked-By: Matt Helsley matth...@us.ibm.com Signed-off-by: Andrew Morton a...@linux-foundation.org --- init/Kconfig |7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff -puN init/Kconfig~namespaces-default-all-the-namespaces-to-yes-when-config_namespaces-is-selected init/Kconfig --- a/init/Kconfig~namespaces-default-all-the-namespaces-to-yes-when-config_namespaces-is-selected +++ a/init/Kconfig @@ -739,6 +739,7 @@ config NAMESPACES config UTS_NS bool UTS namespace depends on NAMESPACES + default y help In this namespace tasks see different info provided with the uname() system call @@ -746,6 +747,7 @@ config UTS_NS config IPC_NS bool IPC namespace depends on NAMESPACES (SYSVIPC || POSIX_MQUEUE) + default y help In this namespace tasks work with IPC ids which correspond to different IPC objects in different namespaces. @@ -753,6 +755,7 @@ config IPC_NS config USER_NS bool User namespace (EXPERIMENTAL) depends on NAMESPACES EXPERIMENTAL + default y help This allows containers, i.e. vservers, to use user namespaces to provide different user info for different servers. @@ -760,8 +763,8 @@ config USER_NS config PID_NS bool PID Namespaces - default n depends on NAMESPACES + default y help Support process id namespaces. This allows having multiple processes with the same pid as long as they are in different @@ -769,8 +772,8 @@ config PID_NS config NET_NS bool Network namespace - default n depends on NAMESPACES NET + default y help Allow user space to create what appear to be multiple instances of the network stack. _ ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v2] memcg: reduce lock time at move charge (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
On Fri, 8 Oct 2010 13:37:12 +0900 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Thu, 7 Oct 2010 16:14:54 -0700 Andrew Morton a...@linux-foundation.org wrote: On Thu, 7 Oct 2010 17:04:05 +0900 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: Now, at task migration among cgroup, memory cgroup scans page table and moving account if flags are properly set. The core code, mem_cgroup_move_charge_pte_range() does pte_offset_map_lock(); for all ptes in a page table: 1. look into page table, find_and_get a page 2. remove it from LRU. 3. move charge. 4. putback to LRU. put_page() pte_offset_map_unlock(); for pte entries on a 3rd level? page table. This pte_offset_map_lock seems a bit long. This patch modifies a rountine as for 32 pages: pte_offset_map_lock() find_and_get a page record it pte_offset_map_unlock() for all recorded pages isolate it from LRU. move charge putback to LRU for all recorded pages put_page() The patch makes the code larger, more complex and slower! Slower ? Sure. It walks the same data three times, potentially causing thrashing in the L1 cache. It takes and releases locks at a higher frequency. It increases the text size. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 1/1] cgroups: strcpy destination string overflow
On Tue, 5 Oct 2010 12:38:05 +0400 Evgeny Kuznetsov ext-eugeny.kuznet...@nokia.com wrote: From: Evgeny Kuznetsov ext-eugeny.kuznet...@nokia.com Function strcpy is used without check for maximum allowed source string length and could cause destination string overflow. Check for string length is added before using strcpy. Function now is return error if source string length is more than a maximum. Signed-off-by: Evgeny Kuznetsov ext-eugeny.kuznet...@nokia.com --- kernel/cgroup.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index c9483d8..82bbede 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -1883,6 +1883,8 @@ static int cgroup_release_agent_write(struct cgroup *cgrp, struct cftype *cft, const char *buffer) { BUILD_BUG_ON(sizeof(cgrp-root-release_agent_path) PATH_MAX); + if (strlen(buffer) = PATH_MAX) + return -EINVAL; if (!cgroup_lock_live_group(cgrp)) return -ENODEV; strcpy(cgrp-root-release_agent_path, buffer); I don't think this can happen, because cftype.max_write_len is PATH_MAX. But it's pretty unobvious if this is actually true, and the code is fragile against future changes. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/3][V2] remove the ns_cgroup
On Tue, 28 Sep 2010 15:50:17 +0200 Daniel Lezcano daniel.lezc...@free.fr wrote: On 09/27/2010 10:46 PM, Andrew Morton wrote: On Mon, 27 Sep 2010 15:36:58 -0500 Serge E. Hallynserge.hal...@canonical.com wrote: This patchset removes the ns_cgroup by adding a new flag to the cgroup and the cgroupfs mount option. It enables the copy of the parent cgroup when a child cgroup is created. We can then safely remove the ns_cgroup as this flag brings a compatibility. We have now to manually create and add the task to a cgroup, which is consistent with the cgroup framework. So this is a non-backward-compatible userspace-visible change? Yes, it is. Patch 1 is needed to let lxc and libvirt both control containers with same cgroup setup. Patch 3 however isn't *necessary* for that. Daniel, what do you think about holding off on patch 3? One way of handling this would be to merge patches 12 which add the new interface and also arrange for usage of the old interface(s) to emit a printk, telling people that they're using a feature which is scheduled for removal. Right, that makes sense. Do you will take the patches #1 and #2, drop the patch #3, and I send a new patch with the printk warning ? Or shall I resend all ? I dropped #3. Please send the printk-warning patch. I'd suggest a printk_once(), nice and verbose. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/3][V2] remove the ns_cgroup
On Mon, 27 Sep 2010 12:14:10 +0200 Daniel Lezcano daniel.lezc...@free.fr wrote: The ns_cgroup is a control group interacting with the namespaces. When a new namespace is created, a corresponding cgroup is automatically created too. The cgroup name is the pid of the process who did 'unshare' or the child of 'clone'. This cgroup is tied with the namespace because it prevents a process to escape the control group and use the post_clone callback, so the child cgroup inherits the values of the parent cgroup. Unfortunately, the more we use this cgroup and the more we are facing problems with it: (1) when a process unshares, the cgroup name may conflict with a previous cgroup with the same pid, so unshare or clone return -EEXIST (2) the cgroup creation is out of control because there may have an application creating several namespaces where the system will automatically create several cgroups in his back and let them on the cgroupfs (eg. a vrf based on the network namespace). (3) the mix of (1) and (2) force an administrator to regularly check and clean these cgroups. This patchset removes the ns_cgroup by adding a new flag to the cgroup and the cgroupfs mount option. It enables the copy of the parent cgroup when a child cgroup is created. We can then safely remove the ns_cgroup as this flag brings a compatibility. We have now to manually create and add the task to a cgroup, which is consistent with the cgroup framework. So this is a non-backward-compatible userspace-visible change? What are the implications of this? ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting
On Mon, 27 Sep 2010 11:18:47 +0200 Michael Holzheu holz...@linux.vnet.ibm.com wrote: Hello Andrew, On Fri, 2010-09-24 at 11:50 -0700, Andrew Morton wrote: This is a big change! If this is done right then we're heading in the direction of deprecating the longstanding way in which userspace observes the state of Linux processes and we're recommending that the whole world migrate to taskstats. I think? Or it can be used as alternative. Since procfs has its drawbacks (e.g. performance) an alternative could be helpful. And it can be harmful. More kernel code to maintain and test, more userspace code to develop, maintain, etc. Less user testing than if there was a single interface. Sure, the value has to be big enough to justify the effort. But as I said, with taskstats and procfs we already have two interfaces for getting task information. That doesn't mean it was the right thing to do! For the reasons I outline above, it can be the wrong thing to do and strengthening one of the alternatives worsens the problem. Currently in procfs there is information than you can't find in taskstats. But also the other way round in the taskstats structure there is very useful information that you can't get under proc. E.g. the task delay times, IO accounting, etc. Sounds like a big screwup ;) Look at it this way: if you were going to sit down and start to design a new operating system from scratch, would you design the task status reporting system as it currently stands in Linux? Don't think so! So currently tools have to use both interfaces to get all information, which is not optimal. I worry that there's a dependency on CONFIG_NET? If so then that's a big problem because in N years time, 99% of the world will be using taskstats, but a few embedded losers will be stuck using (and having to support) the old tools. Sure, but if we could add the /proc/taskstats approach, this dependency would not be there. So why do we need to present the same info over netlink? Good point. It is not really necessary. I started development using the netlink code. Therefore I first added the new command in the netlink code. I also thought, it would be a good idea to provide all netlink commands over the procfs interface to be consistent. Maybe we should have delivered taskstats over procfs from day one. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/3][V2] remove the ns_cgroup
On Mon, 27 Sep 2010 15:36:58 -0500 Serge E. Hallyn serge.hal...@canonical.com wrote: This patchset removes the ns_cgroup by adding a new flag to the cgroup and the cgroupfs mount option. It enables the copy of the parent cgroup when a child cgroup is created. We can then safely remove the ns_cgroup as this flag brings a compatibility. We have now to manually create and add the task to a cgroup, which is consistent with the cgroup framework. So this is a non-backward-compatible userspace-visible change? Yes, it is. Patch 1 is needed to let lxc and libvirt both control containers with same cgroup setup. Patch 3 however isn't *necessary* for that. Daniel, what do you think about holding off on patch 3? One way of handling this would be to merge patches 12 which add the new interface and also arrange for usage of the old interface(s) to emit a printk, telling people that they're using a feature which is scheduled for removal. Later, we remove the old interface. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/3][V2] remove the ns_cgroup
On Mon, 27 Sep 2010 13:45:26 -0700 ebied...@xmission.com (Eric W. Biederman) wrote: Serge E. Hallyn serge.hal...@canonical.com writes: Quoting Andrew Morton (a...@linux-foundation.org): On Mon, 27 Sep 2010 12:14:10 +0200 Daniel Lezcano daniel.lezc...@free.fr wrote: The ns_cgroup is a control group interacting with the namespaces. When a new namespace is created, a corresponding cgroup is automatically created too. The cgroup name is the pid of the process who did 'unshare' or the child of 'clone'. This cgroup is tied with the namespace because it prevents a process to escape the control group and use the post_clone callback, so the child cgroup inherits the values of the parent cgroup. Unfortunately, the more we use this cgroup and the more we are facing problems with it: (1) when a process unshares, the cgroup name may conflict with a previous cgroup with the same pid, so unshare or clone return -EEXIST (2) the cgroup creation is out of control because there may have an application creating several namespaces where the system will automatically create several cgroups in his back and let them on the cgroupfs (eg. a vrf based on the network namespace). (3) the mix of (1) and (2) force an administrator to regularly check and clean these cgroups. This patchset removes the ns_cgroup by adding a new flag to the cgroup and the cgroupfs mount option. It enables the copy of the parent cgroup when a child cgroup is created. We can then safely remove the ns_cgroup as this flag brings a compatibility. We have now to manually create and add the task to a cgroup, which is consistent with the cgroup framework. So this is a non-backward-compatible userspace-visible change? Yes, it is. We have always been able to compile out the ns cgroup right? In which case this is not a backwards incompatible change so much as the permanent removal of a borked kernel feature. That's just spin ;) People whose code is dependent on the old feature get screwed over if we remove it. And sure, it's unlikely that many people at all are using this feature. But it's amazing what goes on out there and if the cost to us of providing people with a migration period isn't too high then why not do it? ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting
On Fri, 24 Sep 2010 11:10:15 +0200 Michael Holzheu holz...@linux.vnet.ibm.com wrote: Hello Andrew, On Thu, 2010-09-23 at 13:11 -0700, Andrew Morton wrote: GOALS OF THIS PATCH SET --- The intention of this patch set is to provide better support for tools like top. The goal is to: * provide a task snapshot mechanism where we can get a consistent view of all running tasks. * provide a transport mechanism that does not require a lot of system calls and that allows implementing low CPU overhead task monitoring. * provide microsecond CPU time granularity. This is a big change! If this is done right then we're heading in the direction of deprecating the longstanding way in which userspace observes the state of Linux processes and we're recommending that the whole world migrate to taskstats. I think? Or it can be used as alternative. Since procfs has its drawbacks (e.g. performance) an alternative could be helpful. And it can be harmful. More kernel code to maintain and test, more userspace code to develop, maintain, etc. Less user testing than if there was a single interface. I worry that there's a dependency on CONFIG_NET? If so then that's a big problem because in N years time, 99% of the world will be using taskstats, but a few embedded losers will be stuck using (and having to support) the old tools. Sure, but if we could add the /proc/taskstats approach, this dependency would not be there. So why do we need to present the same info over netlink? If the info is available via procfs then userspace code should use that and not netlink, because that userspace code would also be applicable to CONFIG_NET=n systems. Does this have the potential to save us from the CONFIG_NET=n problem? Yes Let's say that when it's all tested ;) Are PIDs over all namespaces unique? Nope. The same pid can be present in different namespaces at the same time. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting
On Thu, 23 Sep 2010 15:48:01 +0200 Michael Holzheu holz...@linux.vnet.ibm.com wrote: Currently tools like top gather the task information by reading procfs files. This has several disadvantages: * It is very CPU intensive, because a lot of system calls (readdir, open, read, close) are necessary. * No real task snapshot can be provided, because while the procfs files are read the system continues running. * The procfs times granularity is restricted to jiffies. In parallel to procfs there exists the taskstats binary interface that uses netlink sockets as transport mechanism to deliver task information to user space. There exists a taskstats command TASKSTATS_CMD_ATTR_PID to get task information for a given PID. This command can already be used for tools like top, but has also several disadvantages: * You first have to find out which PIDs are available in the system. Currently we have to use procfs again to do this. * For each task two system calls have to be issued (First send the command and then receive the reply). * No snapshot mechanism is available. GOALS OF THIS PATCH SET --- The intention of this patch set is to provide better support for tools like top. The goal is to: * provide a task snapshot mechanism where we can get a consistent view of all running tasks. * provide a transport mechanism that does not require a lot of system calls and that allows implementing low CPU overhead task monitoring. * provide microsecond CPU time granularity. This is a big change! If this is done right then we're heading in the direction of deprecating the longstanding way in which userspace observes the state of Linux processes and we're recommending that the whole world migrate to taskstats. I think? If so, much chin-scratching will be needed, coordination with util-linux people, etc. We'd need to think about the implications of taskstats versioning. It _is_ a versioned interface, so people can't just go and toss random new stuff in there at will - it's not like adding a new procfs file, or adding a new line to an existing one. I don't know if that's likely to be a significant problem. I worry that there's a dependency on CONFIG_NET? If so then that's a big problem because in N years time, 99% of the world will be using taskstats, but a few embedded losers will be stuck using (and having to support) the old tools. FIRST RESULTS - Together with this kernel patch set also user space code for a new top utility (ptop) is provided that exploits the new kernel infrastructure. See patch 10 for more details. TEST1: System with many sleeping tasks for ((i=0; i 1000; i++)) do sleep 100 done # ptop_new_proc pid user sys ste total Name (#)(%) (%) (%)(%) (str) 541 0.37 2.39 0.10 2.87 top 3743 0.03 0.05 0.00 0.07 ptop_new_proc Compared to the old top command that has to scan more than 1000 proc directories the new ptop consumes much less CPU time (0.05% system time on my s390 system). How many CPUs does that system have? What's the `top' update period? One second? So we're saying that a `top -d 1' consumes 2.4% of this mystery-number-of-CPUs machine? That's quite a lot. PATCHSET OVERVIEW - The code is not final and still has a few TODOs. But it is good enough for a first round of review. The following kernel patches are provided: [01] Prepare-0: Use real microsecond granularity for taskstats CPU times. [02] Prepare-1: Restructure taskstats.c in order to be able to add new commands more easily. [03] Prepare-2: Separate the finding of a task_struct by PID or TGID from filling the taskstats. [04] Add new command TASKSTATS_CMD_ATTR_PIDS to get a snapshot of multiple tasks. [05] Add procfs interface for taskstats commands. This allows to get a complete and consistent snapshot with all tasks using two system calls (ioctl and read). Transferring a snapshot of all running tasks is not possible using the existing netlink interface, because there we have the socket buffer size as restricting factor. So this is a binary interface which uses an ioctl. People don't like ioctls. Could we have triggered it with a write() instead? Does this have the potential to save us from the CONFIG_NET=n problem? [06] Add TGID to taskstats. [07] Add steal time per task accounting. [08] Add cumulative CPU time (user, system and steal) to taskstats. These didn't update the taskstats version number. Should they have? [09] Fix exit CPU time accounting. [10] Besides of the kernel patches also user space code is provided that exploits the new kernel infrastructure. The user space code provides the following: 1. A proposal for a taskstats user space library: 1.1 Based on netlink (requires libnl-devel-1.1-5) 2.1 Based on the
[Devel] Re: [PATCH] cgroups: fix API thinko
On Fri, 06 Aug 2010 10:38:24 -0600 Alex Williamson alex.william...@redhat.com wrote: On Fri, 2010-08-06 at 09:34 -0700, Sridhar Samudrala wrote: On 8/5/2010 3:59 PM, Michael S. Tsirkin wrote: cgroup_attach_task_current_cg API that have upstream is backwards: we really need an API to attach to the cgroups from another process A to the current one. In our case (vhost), a priveledged user wants to attach it's task to cgroups from a less priveledged one, the API makes us run it in the other task's context, and this fails. So let's make the API generic and just pass in 'from' and 'to' tasks. Add an inline wrapper for cgroup_attach_task_current_cg to avoid breaking bisect. Signed-off-by: Michael S. Tsirkinm...@redhat.com --- Paul, Li, Sridhar, could you please review the following patch? I only compile-tested it due to travel, but looks straight-forward to me. Alex Williamson volunteered to test and report the results. Sending out now for review as I might be offline for a bit. Will only try to merge when done, obviously. If OK, I would like to merge this through -net tree, together with the patch fixing vhost-net. Let me know if that sounds ok. Thanks! This patch is on top of net-next, it is needed for fix vhost-net regression in net-next, where a non-priveledged process can't enable the device anymore: when qemu uses vhost, inside the ioctl call it creates a thread, and tries to add this thread to the groups of current, and it fails. But we control the thread, so to solve the problem, we really should tell it 'connect to out cgroups'. So am I correct to assume that this change is now needed in 2.6.36, and unneeded in 2.6.35? Can it affect the userspace-kernel API in amy manner? If so, it should be backported into earlier kernels to reduce the number of incompatible kernels out there. Paul, did you have any comments? I didn't see any update in response to the minor review comments, so... include/linux/cgroup.h |1 + kernel/cgroup.c|6 +++--- 2 files changed, 4 insertions(+), 3 deletions(-) diff -puN include/linux/cgroup.h~cgroups-fix-api-thinko-fix include/linux/cgroup.h --- a/include/linux/cgroup.h~cgroups-fix-api-thinko-fix +++ a/include/linux/cgroup.h @@ -579,6 +579,7 @@ void cgroup_iter_end(struct cgroup *cgrp int cgroup_scan_tasks(struct cgroup_scanner *scan); int cgroup_attach_task(struct cgroup *, struct task_struct *); int cgroup_attach_task_all(struct task_struct *from, struct task_struct *); + static inline int cgroup_attach_task_current_cg(struct task_struct *tsk) { return cgroup_attach_task_all(current, tsk); diff -puN kernel/cgroup.c~cgroups-fix-api-thinko-fix kernel/cgroup.c --- a/kernel/cgroup.c~cgroups-fix-api-thinko-fix +++ a/kernel/cgroup.c @@ -1798,13 +1798,13 @@ out: int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) { struct cgroupfs_root *root; - struct cgroup *cur_cg; int retval = 0; cgroup_lock(); for_each_active_root(root) { - cur_cg = task_cgroup_from_root(from, root); - retval = cgroup_attach_task(cur_cg, tsk); + struct cgroup *from_cg = task_cgroup_from_root(from, root); + + retval = cgroup_attach_task(from_cg, tsk); if (retval) break; } _ ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v4 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
On Fri, 30 Jul 2010 19:56:49 -0400 Ben Blum bb...@andrew.cmu.edu wrote: This patch series implements a write function for the 'cgroup.procs' per-cgroup file, which enables atomic movement of multithreaded applications between cgroups. Writing the thread-ID of any thread in a threadgroup to a cgroup's procs file causes all threads in the group to be moved to that cgroup safely with respect to threads forking/exiting. (Possible usage scenario: If running a multithreaded build system that sucks up system resources, this lets you restrict it all at once into a new cgroup to keep it under control.) I can see how that would be useful. No comments from anyone else? patch 1/2 makes me cry with all those ifdefs. Maybe helper functions would help, but not a lot. patch 2/2 looks very complicated. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [Bugme-new] [Bug 16417] New: Slow context switches with SMP and CONFIG_FAIR_GROUP_SCHED
(switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). sched suckage! Do we have a linear search in there? On Mon, 19 Jul 2010 14:38:09 GMT bugzilla-dae...@bugzilla.kernel.org wrote: https://bugzilla.kernel.org/show_bug.cgi?id=16417 Summary: Slow context switches with SMP and CONFIG_FAIR_GROUP_SCHED Product: Process Management Version: 2.5 Kernel Version: 2.6.34.1 Platform: All OS/Version: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: Scheduler AssignedTo: mi...@elte.hu ReportedBy: pbour...@excellency.fr Regression: No Hello, We have been experiencing slow context switches using a large number of cgroups (around 600 groups) and CONFIG_FAIR_GROUP_SCHED. This causes a system time usage increase on context switching heavy processes (measured with pidstat -w) and a drop in timer interrupts handling. This problem only appears on SMP : when booting with nosmp, the issue does not appear. From maxprocs=2 to maxprocs=8 we were able to reproduce it accurately. Steps to reproduce : - mount the cgroup filesystem in /dev/cgroup - cd /dev/cgroup for i in $(seq 1 5000); do mkdir test_group_$i; done - launch lat_ctx from lmbench, for instance ./lat_ctx -N 200 100 The results from lat_ctx were the following : - SMP enabled, no cgroups : 2.65 - SMP enabled, 1000 cgroups : 3.40 - SMP enabled, 6000 cgroups : 3957.36 - SMP disabled, 6000 cgroups : 1.58 We can see that from a certain amount of cgroups, the context switching starts taking a lot of time. Another way to reproduce this problem : - launch cat /dev/zero | pv -L 1G /dev/null - look at the CPU usage (about 40% here) - cd /dev/cgroup for i in $(seq 1 5000); do mkdir test_group_$i; done - look at the CPU usage (about 80% here) Also note that when a lot of cgroups are present, the system is spending a lot of time in softirqs, and there are less timer interrupts handled than normally (according to our graphs). ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v2] wait: use wrapper functions
On Fri, 7 May 2010 14:33:26 +0800 Changli Gao xiao...@gmail.com wrote: epoll should not touch flags in wait_queue_t. This patch introduces a new function __add_wait_queue_excl(), for the users, who use wait queue as a LIFO queue. __add_wait_queue_exclusive(), please. Let's be consistent. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: Linux Checkpoint-Restart - v19
On Mon, 22 Feb 2010 18:17:00 -0500 Oren Laadan or...@cs.columbia.edu wrote: Hi Andrew, We've put a stake in the ground for our next set of checkpoint/restart patches, v19. It has some great new stuff, and we put extra effort to address your concerns. We would like to have the code included in -mm for wider feedback and testing. This one is able to checkpoint/restart screen and vnc sessions, and live-migrate network servers between hosts. It also adds support for x86-64 (in addition to x86-32, s390x and powerpc). It is rebased to kernel 2.6.33-rc8. Since one of your main concerns was about what is not yet implemented and how complicated or ugly it will be to support that, we've put up a wiki page to address that. In it there is a simple table that lists what is not implemented and the anticipated solution impact, and for some entries a link to more details. The page is here: http://ckpt.wiki.kernel.org/index.php/Checklist Does Refuses to Checkpoint mean that an attempt to checkpoint will fail, return the failure to userspace and the system continues as before? We want to stress that the patchset is already very useful as-is. We will keep working to implement more features cleanly. Some features we are working on include network namespaces and device configurations, mounts and mounts namespaces, and file locks. Should a complicated feature prove hard to implement, users have alternatives systems like kvm, until we manage to come up with a clean solution. We believe that maintenance is best addressed through testing. We now have a comprehensive test-suite to automatically find regressions. In addition, we ran LTP and the results are the same with CHECKPOINT=n and =y. If desired we'll send the whole patchset to lkml, but the git trees can be seen at: kernel: http://www.linux-cr.org/git/?p=linux-cr.git;a=summary user tools: http://www.linux-cr.org/git/?p=user-cr.git;a=summary tests suite: http://www.linux-cr.org/git/?p=tests-cr.git;a=summary I'd suggest waiting until very shortly after 2.6.34-rc1 then please send all the patches onto the list and let's get to work. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v4 0/4] cgroups: support for module-loadable subsystems
On Thu, 31 Dec 2009 00:10:50 -0500 Ben Blum bb...@andrew.cmu.edu wrote: This patch series implements support for building, loading, and unloading subsystems as modules, both within and outside the kernel source tree. It provides an interface cgroup_load_subsys() and cgroup_unload_subsys() which modular subsystems can use to register and depart during runtime. The net_cls classifier subsystem serves as the example for a subsystem which can be converted into a module using these changes. What is the value in this? What are the usage scenarios? Why does the benefit of this change exceed the cost/risk/etc of merging it? ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [BUG] cpu controller can't provide fair CPU time for each group
(cc contain...@lists.linux-foundation.org) On Thu, 05 Nov 2009 11:56:00 +0900 Miao Xie mi...@cn.fujitsu.com wrote: Hi, Ingo Could you see the following problems? Regards Miao on 2009-11-3 11:26, Miao Xie wrote: Hi, Peter. I found two problems about cpu controller: 1) cpu controller didn't provide fair CPU time to groups when the tasks attached into those groups were bound to the same logic CPU. 2) cpu controller didn't provide fair CPU time to groups when shares of each group = 2 * nr_cpus. The detail is following: 1) The first one is that cpu controller didn't provide fair CPU time to groups when the tasks attached into those groups were bound to the same logic CPU. The reason is that there is something with the computing of the per cpu shares. on my test box with 16 logic CPU, I did the following manipulation: a. create 2 cpu controller groups. b. attach a task into one group and 2 tasks into the other. c. bind three tasks to the same logic cpu. ++ ++ | group1 | | group2 | ++ ++ | | CPU0 Task A Task B Task C The following is the reproduce steps: # mkdir /dev/cpuctl # mount -t cgroup -o cpu,noprefix cpuctl /dev/cpuctl # mkdir /dev/cpuctl/1 # mkdir /dev/cpuctl/2 # cat /dev/zero /dev/null # pid1=$! # echo $pid1 /dev/cpuctl/1/tasks # taskset -p -c 0 $pid1 # cat /dev/zero /dev/null # pid2=$! # echo $pid2 /dev/cpuctl/2/tasks # taskset -p -c 0 $pid2 # cat /dev/zero /dev/null # pid3=$! # echo $pid3 /dev/cpuctl/2/tasks # taskset -p -c 0 $pid3 some time later, I found the the task in the group1 got the 35% CPU time not 50% CPU time. It was very strange that this result against the expected. this problem was caused by the wrong computing of the per cpu shares. According to the design of the cpu controller, the shares of each cpu controller group will be divided for every CPU by the workload of each logic CPU. cpu[i] shares = group shares * CPU[i] workload / sum(CPU workload) But if the CPU has no task, cpu controller will pretend there is one of average load, usually this average load is 1024, the load of the task whose nice is zero. So in the test, the shares of group1 on CPU0 is: 1024 * (1 * 1024) / ((1 * 1024 + 15 * 1024)) = 64 and the shares of group2 on CPU0 is: 1024 * (2 * 1024) / ((2 * 1024 + 15 * 1024)) = 120 The scheduler of the CPU0 provided CPU time to each group by the shares above. The bug occured. 2) The second problem is that cpu controller didn't provide fair CPU time to groups when shares of each group = 2 * nr_cpus The reason is that per cpu shares was set to MIN_SHARES(=2) if shares of each group = 2 * nr_cpus. on the test box with 16 logic CPU, we do the following test: a. create two cpu controller groups b. attach 32 tasks into each group c. set shares of the first group to 16, the other to 32 ++ ++ | group1 | | group2 | ++ ++ |shares=16 |shares=32 | | 16 Tasks 32 Tasks some time later, the first group got 50% CPU time, not 33%. It also was very strange that this result against the expected. It is because the shares of cpuctl group was small, and there is many logic CPU. So per cpu shares that was computed was less than MIN_SHARES, and then was set to MIN_SHARES. Maybe 16 and 32 is not used usually. We can set a usual number(such as 1024) to avoid this problem on my box. But the number of CPU on a machine will become more and more in the future. If the number of CPU is greater than 512, this bug will occur even we set shares of group to 1024. This is a usual number. At this rate, the usual user will feel strange. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org
[Devel] Re: pidns memory leak
On Tue, 13 Oct 2009 23:15:33 -0700 Sukadev Bhattiprolu suka...@linux.vnet.ibm.com wrote: Daniel Lezcano [dlezc...@fr.ibm.com] wrote: Sukadev Bhattiprolu wrote: Ccing Andrea's new email id: Daniel Lezcano [dlezc...@fr.ibm.com] wrote: Following your explanation I was able to reproduce a simple program added in attachment. But there is something I do not understand is why the leak does not appear if I do the 'lstat' (cf. test program) in the pid 2 context. Hmm, are you sure there is no leak with this test program ? If I put back the commit (7766755a2f249e7), I do see a leak in all three data structures (pid_2, proc_inode, pid_namespace). Let me clarify :) The program leaks with the commit 7766755a2f249e7 and does not leak without this commit. This is the expected behaviour and this simple program spots the problem. I tried to modify the program and I moved the lstat to the process 2 in the child namespace. Conforming your analysis, I was expecting to see a leak too, but this one didn't occur. I was wondering why, maybe there is something I didn't understood in the analysis. Hmm, There are two separate dentries associated with the processes. One in each mount of /proc. The proc dentries in the child container are freed when the child container unmounts its /proc so you don't see the leak when the lstat() is inside the container. When the lstat() is in the root container, it is accessing proc-dentries from the _root container_ - They are supposed to be flushed when the task exits (but the above commit prevents that flush). They should be freed when the /proc in root container is unmounted - and leak until then ? This bug hasn't been fixed yet, has it? ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: pidns memory leak
On Mon, 2 Nov 2009 16:38:18 -0600 Serge E. Hallyn se...@us.ibm.com wrote: This bug hasn't been fixed yet, has it? Well Suka did trace the bug to commit 7766755a2f249e7, and posted a patch to revert that, acked by Eric on Oct 20. Suka, were you going to repost that patch? Ah. OK. Thanks. Found it in the backlog pile. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: IO scheduler based IO controller V10
On Thu, 24 Sep 2009 15:25:04 -0400 Vivek Goyal vgo...@redhat.com wrote: Hi All, Here is the V10 of the IO controller patches generated on top of 2.6.31. Thanks for the writeup. It really helps and is most worthwhile for a project of this importance, size and complexity. What problem are we trying to solve === Provide group IO scheduling feature in Linux along the lines of other resource controllers like cpu. IOW, provide facility so that a user can group applications using cgroups and control the amount of disk time/bandwidth received by a group based on its weight. How to solve the problem = Different people have solved the issue differetnly. So far looks it looks like we seem to have following two core requirements when it comes to fairness at group level. - Control bandwidth seen by groups. - Control on latencies when a request gets backlogged in group. At least there are now three patchsets available (including this one). IO throttling - This is a bandwidth controller which keeps track of IO rate of a group and throttles the process in the group if it exceeds the user specified limit. dm-ioband - This is a proportional bandwidth controller implemented as device mapper driver and provides fair access in terms of amount of IO done (not in terms of disk time as CFQ does). So one will setup one or more dm-ioband devices on top of physical/logical block device, configure the ioband device and pass information like grouping etc. Now this device will keep track of bios flowing through it and control the flow of bios based on group policies. IO scheduler based IO controller Here we have viewed the problem of IO contoller as hierarchical group scheduling (along the lines of CFS group scheduling) issue. Currently one can view linux IO schedulers as flat where there is one root group and all the IO belongs to that group. This patchset basically modifies IO schedulers to also support hierarchical group scheduling. CFQ already provides fairness among different processes. I have extended it support group IO schduling. Also took some of the code out of CFQ and put in a common layer so that same group scheduling code can be used by noop, deadline and AS to support group scheduling. Pros/Cons = There are pros and cons to each of the approach. Following are some of the thoughts. Max bandwidth vs proportional bandwidth --- IO throttling is a max bandwidth controller and not a proportional one. Additionaly it provides fairness in terms of amount of IO done (and not in terms of disk time as CFQ does). Personally, I think that proportional weight controller is useful to more people than just max bandwidth controller. In addition, IO scheduler based controller can also be enhanced to do max bandwidth control. So it can satisfy wider set of requirements. Fairness in terms of disk time vs size of IO - An higher level controller will most likely be limited to providing fairness in terms of size/number of IO done and will find it hard to provide fairness in terms of disk time used (as CFQ provides between various prio levels). This is because only IO scheduler knows how much disk time a queue has used and information about queues and disk time used is not exported to higher layers. So a seeky application will still run away with lot of disk time and bring down the overall throughput of the the disk. But that's only true if the thing is poorly implemented. A high-level controller will need some view of the busyness of the underlying device(s). That could be proportion of idle time, or average length of queue or average request latency or some mix of these or something else altogether. But these things are simple to calculate, and are simple to feed back to the higher-level controller and probably don't require any changes to to IO scheduler at all, which is a great advantage. And I must say that high-level throttling based upon feedback from lower layers seems like a much better model to me than hacking away in the IO scheduler layer. Both from an implementation point of view and from a we can get it to work on things other than block devices point of view. Currently dm-ioband provides fairness in terms of number/size of IO. Latencies and isolation between groups -- An higher level controller is generally implementing a bandwidth throttling solution where if a group exceeds either the max bandwidth or the proportional share then throttle that group. This kind of approach will probably not help in controlling latencies as it will depend on underlying IO scheduler. Consider following scenario. Assume there are two groups. One group is running multiple sequential