[Devel] Re: [PATCH v2 00/15] Lockd: grace period containerization
28.07.2012 01:54, J. Bruce Fields пишет: On Wed, Jul 25, 2012 at 04:55:45PM +0400, Stanislav Kinsbursky wrote: Bruce, I feel this patch set is ready for inclusion. v2: 1) Rebase on Bruce's for-3.6 branch. This patch set makes grace period and hosts reclaiming network namespace aware. On a quick skim--yes, that looks reasonable to me. It doesn't help with active/active cluster exports, because in that case we need some additional coordination between nfsd's. But it looks good enough to handle the case where each filesystem is exported from at most one server at a time, which is more than we currently handle. It's a little late for 3.6. Also I get the impression Al Viro has some lockd rework in progress, which we may want to wait for. So I'll likely look again into queueing this up for 3.7 once 3.6-rc1 is out. Ok. Will Al Viro's lockd rework be a part of 3.6 kernel? --b. Main ideas: 1) moving of unsigned long next_gc; unsigned long nrhosts; struct delayed_work grace_period_end; struct lock_manager lockd_manager; struct list_head grace_list; to per-net Lockd data. 2) moving of struct lock_manager nfsd4_manager; to per-net NFSd data. 3) shutdown + gc of NLM hosts done now network namespace aware. 4) restart_grace() now works only for init_net. The following series implements... --- Stanislav Kinsbursky (15): LockD: mark host per network namespace on garbage collect LockD: make garbage collector network namespace aware. LockD: manage garbage collection timeout per networks namespace LockD: manage used host count per networks namespace Lockd: host complaining function introduced Lockd: add more debug to host shutdown functions LockD: manage grace period per network namespace LockD: make lockd manager allocated per network namespace NFSd: make nfsd4_manager allocated per network namespace context. SUNRPC: service request network namespace helper introduced LockD: manage grace list per network namespace LockD: pass actual network namespace to grace period management functions Lockd: move grace period management from lockd() to per-net functions NFSd: make grace end flag per network namespace NFSd: make boot_time variable per network namespace fs/lockd/grace.c| 16 +-- fs/lockd/host.c | 92 ++ fs/lockd/netns.h|7 +++ fs/lockd/svc.c | 43 ++ fs/lockd/svc4proc.c | 13 +++-- fs/lockd/svclock.c | 16 +++ fs/lockd/svcproc.c | 15 -- fs/lockd/svcsubs.c | 19 +--- fs/nfs/callback_xdr.c |4 +- fs/nfsd/export.c|4 +- fs/nfsd/netns.h |4 ++ fs/nfsd/nfs4idmap.c |4 +- fs/nfsd/nfs4proc.c | 18 --- fs/nfsd/nfs4state.c | 104 --- fs/nfsd/state.h |4 +- include/linux/fs.h |5 +- include/linux/lockd/lockd.h |6 +- include/linux/sunrpc/svc.h |2 + 18 files changed, 231 insertions(+), 145 deletions(-) -- Best regards, Stanislav Kinsbursky ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: containers and cgroups mini-summit @ Linux Plumbers
On 07/11/2012 11:41 PM, Kir Kolyshkin wrote: Gentlemen, We are organizing containers mini-summit during next Linux Plumbers (San Diego, August 29-31). The idea is to gather and discuss everything relevant to namespaces, cgroups, resource management, checkpoint-restore and so on. We are trying to come up with a list of topics to discuss, so please reply with topic suggestions, and let me know if you are going to come. I probably forgot a few more people (such as, I am not sure who else from Google is working on cgroups stuff), so fill free to forward this to anyone you believe should go, or just let me know whom I missed. Hi Kir, I have a presentation for LPC and I am awaiting the approval for the funding. If it is accepted I will be there. One point to address could be the time virtualization. Thanks -- Daniel -- http://www.linaro.org/ Linaro.org │ Open source software for ARM SoCs Follow Linaro: http://www.facebook.com/pages/Linaro Facebook | http://twitter.com/#!/linaroorg Twitter | http://www.linaro.org/linaro-blog/ Blog ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: containers and cgroups mini-summit @ Linux Plumbers
On Wed, Jul 25, 2012 at 02:00:41PM +0400, Glauber Costa wrote: On 07/25/2012 02:00 PM, Eric W. Biederman wrote: Glauber Costa glom...@parallels.com writes: On 07/12/2012 01:41 AM, Kir Kolyshkin wrote: Gentlemen, We are organizing containers mini-summit during next Linux Plumbers (San Diego, August 29-31). The idea is to gather and discuss everything relevant to namespaces, cgroups, resource management, checkpoint-restore and so on. We are trying to come up with a list of topics to discuss, so please reply with topic suggestions, and let me know if you are going to come. I probably forgot a few more people (such as, I am not sure who else from Google is working on cgroups stuff), so fill free to forward this to anyone you believe should go, or just let me know whom I missed. Regards, Kir. BTW, sorry for not replying before (vacations + post-vacations laziness) I would be interested in adding /proc virtualization to the discussion. By now it seems userspace would be the best place for that to happen, in a fuse overlay. I know Daniel has an initial implementation of that, and it would be good to have it as library that both OpenVZ and LXC (and whoever else wants) can use. Shouldn't take much time... What would you need proc virtualization for? proc provides a lot of information that userspace tools rely upon. For instance, when running top, you can draw per-process figures from what we have now, but you can't make sense of percentages without aggregating container-wide information. When you read /proc/cpuinfo, as well, you would expect to see something that matches your container configuration. free is another example. The list go on. Another interesting feature IMHO would be the per-cgroup loadavg. A typical use case could be a monitoring system that wants to know which containers are more overloaded than others, instead of using a single system-wide measure in /proc/loadavg. -Andrea ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 04/10] memcg: skip memcg kmem allocations in specified code regions
On Wed, Jul 25, 2012 at 06:38:15PM +0400, Glauber Costa wrote: This patch creates a mechanism that skip memcg allocations during certain pieces of our core code. It basically works in the same way as preempt_disable()/preempt_enable(): By marking a region under which all allocations will be accounted to the root memcg. We need this to prevent races in early cache creation, when we allocate data using caches that are not necessarily created already. Why not a GFP_* flag? -- Kirill A. Shutemov ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 06/10] sl[au]b: Allocate objects from memcg cache
On Wed, Jul 25, 2012 at 06:38:17PM +0400, Glauber Costa wrote: We are able to match a cache allocation to a particular memcg. If the task doesn't change groups during the allocation itself - a rare event, this will give us a good picture about who is the first group to touch a cache page. This patch uses the now available infrastructure by calling memcg_kmem_get_cache() before all the cache allocations. Signed-off-by: Glauber Costa glom...@parallels.com CC: Christoph Lameter c...@linux.com CC: Pekka Enberg penb...@cs.helsinki.fi CC: Michal Hocko mho...@suse.cz CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com CC: Johannes Weiner han...@cmpxchg.org CC: Suleiman Souhlal sulei...@google.com --- include/linux/slub_def.h | 18 +- mm/memcontrol.c |2 ++ mm/slab.c|4 mm/slub.c|1 + 4 files changed, 20 insertions(+), 5 deletions(-) diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index 8bb8ad2..148000a 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -13,6 +13,8 @@ #include linux/kobject.h #include linux/kmemleak.h +#include linux/memcontrol.h +#include linux/mm.h enum stat_item { ALLOC_FASTPATH, /* Allocation from cpu slab */ @@ -209,14 +211,14 @@ static __always_inline int kmalloc_index(size_t size) * This ought to end up with a global pointer to the right cache * in kmalloc_caches. */ -static __always_inline struct kmem_cache *kmalloc_slab(size_t size) +static __always_inline struct kmem_cache *kmalloc_slab(gfp_t flags, size_t size) { int index = kmalloc_index(size); if (index == 0) return NULL; - return kmalloc_caches[index]; + return memcg_kmem_get_cache(kmalloc_caches[index], flags); } void *kmem_cache_alloc(struct kmem_cache *, gfp_t); @@ -225,7 +227,13 @@ void *__kmalloc(size_t size, gfp_t flags); static __always_inline void * kmalloc_order(size_t size, gfp_t flags, unsigned int order) { - void *ret = (void *) __get_free_pages(flags | __GFP_COMP, order); + void *ret; + + flags = __GFP_COMP; +#ifdef CONFIG_MEMCG_KMEM + flags |= __GFP_KMEMCG; +#endif Em.. I don't see where __GFP_KMEMCG is defined. It should be 0 for !CONFIG_MEMCG_KMEM. -- Kirill A. Shutemov ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 06/10] sl[au]b: Allocate objects from memcg cache
On 07/30/2012 04:58 PM, Kirill A. Shutemov wrote: On Wed, Jul 25, 2012 at 06:38:17PM +0400, Glauber Costa wrote: We are able to match a cache allocation to a particular memcg. If the task doesn't change groups during the allocation itself - a rare event, this will give us a good picture about who is the first group to touch a cache page. This patch uses the now available infrastructure by calling memcg_kmem_get_cache() before all the cache allocations. Signed-off-by: Glauber Costa glom...@parallels.com CC: Christoph Lameter c...@linux.com CC: Pekka Enberg penb...@cs.helsinki.fi CC: Michal Hocko mho...@suse.cz CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com CC: Johannes Weiner han...@cmpxchg.org CC: Suleiman Souhlal sulei...@google.com --- include/linux/slub_def.h | 18 +- mm/memcontrol.c |2 ++ mm/slab.c|4 mm/slub.c|1 + 4 files changed, 20 insertions(+), 5 deletions(-) diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index 8bb8ad2..148000a 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -13,6 +13,8 @@ #include linux/kobject.h #include linux/kmemleak.h +#include linux/memcontrol.h +#include linux/mm.h enum stat_item { ALLOC_FASTPATH, /* Allocation from cpu slab */ @@ -209,14 +211,14 @@ static __always_inline int kmalloc_index(size_t size) * This ought to end up with a global pointer to the right cache * in kmalloc_caches. */ -static __always_inline struct kmem_cache *kmalloc_slab(size_t size) +static __always_inline struct kmem_cache *kmalloc_slab(gfp_t flags, size_t size) { int index = kmalloc_index(size); if (index == 0) return NULL; -return kmalloc_caches[index]; +return memcg_kmem_get_cache(kmalloc_caches[index], flags); } void *kmem_cache_alloc(struct kmem_cache *, gfp_t); @@ -225,7 +227,13 @@ void *__kmalloc(size_t size, gfp_t flags); static __always_inline void * kmalloc_order(size_t size, gfp_t flags, unsigned int order) { -void *ret = (void *) __get_free_pages(flags | __GFP_COMP, order); +void *ret; + +flags = __GFP_COMP; +#ifdef CONFIG_MEMCG_KMEM +flags |= __GFP_KMEMCG; +#endif Em.. I don't see where __GFP_KMEMCG is defined. It should be 0 for !CONFIG_MEMCG_KMEM. It is not, sorry. As I said, this is dependent on another patch series. My main goal while sending this was to get the slab part - that will eventually come ontop of that - discussed. Because they are both quite complex, I believe they benefit from being discussed separately. You can find the latest version of that here: https://lkml.org/lkml/2012/6/25/251 ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 04/10] memcg: skip memcg kmem allocations in specified code regions
On 07/30/2012 04:50 PM, Kirill A. Shutemov wrote: On Wed, Jul 25, 2012 at 06:38:15PM +0400, Glauber Costa wrote: This patch creates a mechanism that skip memcg allocations during certain pieces of our core code. It basically works in the same way as preempt_disable()/preempt_enable(): By marking a region under which all allocations will be accounted to the root memcg. We need this to prevent races in early cache creation, when we allocate data using caches that are not necessarily created already. Why not a GFP_* flag? The main reason for this is to prevent nested calls of kmem_cache_create(), since they could create (and in my tests, do create) funny circular dependencies with each other. So the cache creation itself would proceed without involving memcg. At first, it is a bit weird to have cache creation itself depending on a allocation flag test. ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH] SUNRPC: return negative value in case rpcbind client creation error
On Mon, Jul 30, 2012 at 11:12:05PM +, Myklebust, Trond wrote: On Fri, 2012-07-20 at 15:57 +0400, Stanislav Kinsbursky wrote: Without this patch kernel will panic on LockD start, because lockd_up() checks lockd_up_net() result for negative value. From my pow it's better to return negative value from rpcbind routines instead of replacing all such checks like in lockd_up(). Signed-off-by: Stanislav Kinsbursky skinsbur...@parallels.com --- net/sunrpc/rpcb_clnt.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/net/sunrpc/rpcb_clnt.c b/net/sunrpc/rpcb_clnt.c index 92509ff..a70acae 100644 --- a/net/sunrpc/rpcb_clnt.c +++ b/net/sunrpc/rpcb_clnt.c @@ -251,7 +251,7 @@ static int rpcb_create_local_unix(struct net *net) if (IS_ERR(clnt)) { dprintk(RPC: failed to create AF_LOCAL rpcbind client (errno %ld).\n, PTR_ERR(clnt)); - result = -PTR_ERR(clnt); + result = PTR_ERR(clnt); goto out; } @@ -298,7 +298,7 @@ static int rpcb_create_local_net(struct net *net) if (IS_ERR(clnt)) { dprintk(RPC: failed to create local rpcbind client (errno %ld).\n, PTR_ERR(clnt)); - result = -PTR_ERR(clnt); + result = PTR_ERR(clnt); goto out; } Who is supposed to carry this patch? Is it Bruce or is it me? Works either way. Either way--it looks like the bug was introduced with c526611dd631b2802b6b0221ffb306c5fa25c86c SUNRPC: Use a cached RPC client and transport for rpcbind upcalls and 7402ab19cdd5943c7dd4f3399afe3abda8077ef5 SUNRPC: Use AF_LOCAL for rpcbind upcalls and should go to stable as well. (Looks like I said that before but accidentally dropped everyone off the cc.) --b. ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v3] SUNRPC: protect service sockets lists during per-net shutdown
On Tue, 24 Jul 2012 15:40:37 -0400 J. Bruce Fields bfie...@fieldses.org wrote: On Tue, Jul 03, 2012 at 04:58:57PM +0400, Stanislav Kinsbursky wrote: v3: 1) rebased on 3.5-rc3 kernel. v2: destruction of currently processing transport added: 1) Added marking of currently processing transports with XPT_CLOSE on per-net shutdown. These transports will be destroyed in svc_xprt_enqueue() (instead of enqueueing). That worries me: - Why did we originally defer close until svc_recv? I don't think there was any obscure reason - it was just the natural place do to it. In svc_recv we are absolutely sure that the socket is idle. There are a number of things we might want to do, so we find the highest-priority one and do it. state machine pattern? - Are we sure there's no risk to performing it immediately in svc_enqueue? Is it safe to call from the socket callbacks and wherever else we call svc_enqueue? The latter point is the one I'd want to see verified. If svc_xprt_enqueue gets called in 'bh' content, and calls svc_delete_xprt which then calls svc_deferred_dequeue and that takes -xpt_lock - does that mean that all lock/unlock of -xpt_lock needs to be changed to use the _bh variants? NeilBrown And in the past I haven't been good at testing for problems here--instead they tend to show up when a use somewhere tries shutting down a server that's under load. I'll look more closely. Meanwhile you could split out that change as a separate patch and convince me why it's right --b. 2) newly created temporary transport in svc_recv() will be destroyed, if it's parent was marked with XPT_CLOSE. 3) spin_lock(serv-sv_lock) was replaced by spin_lock_bh() in svc_close_net(serv-sv_lock). Service sv_tempsocks and sv_permsocks lists are accessible by tasks with different network namespaces, and thus per-net service destruction must be protected. These lists are protected by service sv_lock. So lets wrap list munipulations with this lock and move tranports destruction outside wrapped area to prevent deadlocks. Signed-off-by: Stanislav Kinsbursky skinsbur...@parallels.com --- net/sunrpc/svc_xprt.c | 56 ++--- 1 files changed, 52 insertions(+), 4 deletions(-) diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c index 88f2bf6..4af2114 100644 --- a/net/sunrpc/svc_xprt.c +++ b/net/sunrpc/svc_xprt.c @@ -320,6 +320,7 @@ void svc_xprt_enqueue(struct svc_xprt *xprt) struct svc_pool *pool; struct svc_rqst *rqstp; int cpu; + int destroy = 0; if (!svc_xprt_has_something_to_do(xprt)) return; @@ -338,6 +339,17 @@ void svc_xprt_enqueue(struct svc_xprt *xprt) pool-sp_stats.packets++; + /* +* Check transport close flag. It could be marked as closed on per-net +* service shutdown. +*/ + if (test_bit(XPT_CLOSE, xprt-xpt_flags)) { + /* Don't enqueue transport if it has to be destroyed. */ + dprintk(svc: transport %p have to be closed\n, xprt); + destroy++; + goto out_unlock; + } + /* Mark transport as busy. It will remain in this state until * the provider calls svc_xprt_received. We update XPT_BUSY * atomically because it also guards against trying to enqueue @@ -374,6 +386,8 @@ void svc_xprt_enqueue(struct svc_xprt *xprt) out_unlock: spin_unlock_bh(pool-sp_lock); + if (destroy) + svc_delete_xprt(xprt); } EXPORT_SYMBOL_GPL(svc_xprt_enqueue); @@ -714,6 +728,13 @@ int svc_recv(struct svc_rqst *rqstp, long timeout) __module_get(newxpt-xpt_class-xcl_owner); svc_check_conn_limits(xprt-xpt_server); spin_lock_bh(serv-sv_lock); + if (test_bit(XPT_CLOSE, xprt-xpt_flags)) { + dprintk(svc_recv: found XPT_CLOSE on listener\n); + set_bit(XPT_DETACHED, newxpt-xpt_flags); + spin_unlock_bh(pool-sp_lock); + svc_delete_xprt(newxpt); + goto out_closed; + } set_bit(XPT_TEMP, newxpt-xpt_flags); list_add(newxpt-xpt_list, serv-sv_tempsocks); serv-sv_tmpcnt++; @@ -739,6 +760,7 @@ int svc_recv(struct svc_rqst *rqstp, long timeout) len = xprt-xpt_ops-xpo_recvfrom(rqstp); dprintk(svc: got len=%d\n, len); } +out_closed: svc_xprt_received(xprt); /* No data, incomplete (TCP) read, or accept() */ @@ -936,6 +958,7 @@ static void svc_clear_pools(struct svc_serv *serv, struct net *net) struct svc_pool *pool; struct svc_xprt *xprt; struct svc_xprt *tmp; + struct svc_rqst *rqstp; int i;