[Devel] Re: [PATCH v2 00/15] Lockd: grace period containerization

2012-07-30 Thread Stanislav Kinsbursky

28.07.2012 01:54, J. Bruce Fields пишет:

On Wed, Jul 25, 2012 at 04:55:45PM +0400, Stanislav Kinsbursky wrote:

Bruce, I feel this patch set is ready for inclusion.

v2:
1) Rebase on Bruce's for-3.6 branch.

This patch set makes grace period and hosts reclaiming network namespace
aware.


On a quick skim--yes, that looks reasonable to me.

It doesn't help with active/active cluster exports, because in that case
we need some additional coordination between nfsd's.

But it looks good enough to handle the case where each filesystem is
exported from at most one server at a time, which is more than we
currently handle.

It's a little late for 3.6.  Also I get the impression Al Viro has some
lockd rework in progress, which we may want to wait for.

So I'll likely look again into queueing this up for 3.7 once 3.6-rc1 is
out.



Ok.
Will Al Viro's lockd rework be a part of 3.6 kernel?




--b.



Main ideas:
1)  moving of

unsigned long next_gc;
unsigned long nrhosts;

struct delayed_work grace_period_end;
struct lock_manager lockd_manager;
struct list_head grace_list;

to per-net Lockd data.

2) moving of

struct lock_manager nfsd4_manager;

to per-net NFSd data.

3) shutdown + gc of NLM hosts done now network namespace aware.

4) restart_grace() now works only for init_net.

The following series implements...

---

Stanislav Kinsbursky (15):
   LockD: mark host per network namespace on garbage collect
   LockD: make garbage collector network namespace aware.
   LockD: manage garbage collection timeout per networks namespace
   LockD: manage used host count per networks namespace
   Lockd: host complaining function introduced
   Lockd: add more debug to host shutdown functions
   LockD: manage grace period per network namespace
   LockD: make lockd manager allocated per network namespace
   NFSd: make nfsd4_manager allocated per network namespace context.
   SUNRPC: service request network namespace helper introduced
   LockD: manage grace list per network namespace
   LockD: pass actual network namespace to grace period management functions
   Lockd: move grace period management from lockd() to per-net functions
   NFSd: make grace end flag per network namespace
   NFSd: make boot_time variable per network namespace


  fs/lockd/grace.c|   16 +--
  fs/lockd/host.c |   92 ++
  fs/lockd/netns.h|7 +++
  fs/lockd/svc.c  |   43 ++
  fs/lockd/svc4proc.c |   13 +++--
  fs/lockd/svclock.c  |   16 +++
  fs/lockd/svcproc.c  |   15 --
  fs/lockd/svcsubs.c  |   19 +---
  fs/nfs/callback_xdr.c   |4 +-
  fs/nfsd/export.c|4 +-
  fs/nfsd/netns.h |4 ++
  fs/nfsd/nfs4idmap.c |4 +-
  fs/nfsd/nfs4proc.c  |   18 ---
  fs/nfsd/nfs4state.c |  104 ---
  fs/nfsd/state.h |4 +-
  include/linux/fs.h  |5 +-
  include/linux/lockd/lockd.h |6 +-
  include/linux/sunrpc/svc.h  |2 +
  18 files changed, 231 insertions(+), 145 deletions(-)




--
Best regards,
Stanislav Kinsbursky

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: containers and cgroups mini-summit @ Linux Plumbers

2012-07-30 Thread Daniel Lezcano
On 07/11/2012 11:41 PM, Kir Kolyshkin wrote:
 Gentlemen,
 
 We are organizing containers mini-summit during next Linux Plumbers (San
 Diego, August 29-31).
 The idea is to gather and discuss everything relevant to namespaces,
 cgroups, resource management,
 checkpoint-restore and so on.
 
 We are trying to come up with a list of topics to discuss, so please
 reply with topic suggestions, and
 let me know if you are going to come.
 
 I probably forgot a few more people (such as, I am not sure who else
 from Google is working
 on cgroups stuff), so fill free to forward this to anyone you believe
 should go,
 or just let me know whom I missed.

Hi Kir,

I have a presentation for LPC and I am awaiting the approval for the
funding. If it is accepted I will be there.

One point to address could be the time virtualization.

Thanks
  -- Daniel

-- 
 http://www.linaro.org/ Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  http://www.facebook.com/pages/Linaro Facebook |
http://twitter.com/#!/linaroorg Twitter |
http://www.linaro.org/linaro-blog/ Blog

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: containers and cgroups mini-summit @ Linux Plumbers

2012-07-30 Thread Andrea Righi
On Wed, Jul 25, 2012 at 02:00:41PM +0400, Glauber Costa wrote:
 On 07/25/2012 02:00 PM, Eric W. Biederman wrote:
  Glauber Costa glom...@parallels.com writes:
  
  On 07/12/2012 01:41 AM, Kir Kolyshkin wrote:
  Gentlemen,
 
  We are organizing containers mini-summit during next Linux Plumbers (San
  Diego, August 29-31).
  The idea is to gather and discuss everything relevant to namespaces,
  cgroups, resource management,
  checkpoint-restore and so on.
 
  We are trying to come up with a list of topics to discuss, so please
  reply with topic suggestions, and
  let me know if you are going to come.
 
  I probably forgot a few more people (such as, I am not sure who else
  from Google is working
  on cgroups stuff), so fill free to forward this to anyone you believe
  should go,
  or just let me know whom I missed.
 
  Regards,
Kir.
 
  BTW, sorry for not replying before (vacations + post-vacations laziness)
 
  I would be interested in adding /proc virtualization to the discussion.
  By now it seems userspace would be the best place for that to happen, in
  a fuse overlay. I know Daniel has an initial implementation of that, and
  it would be good to have it as library that both OpenVZ and LXC (and
  whoever else wants) can use.
 
  Shouldn't take much time...
  
  What would you need proc virtualization for?
  
 
 proc provides a lot of information that userspace tools rely upon.
 For instance, when running top, you can draw per-process figures from
 what we have now, but you can't make sense of percentages without
 aggregating container-wide information.
 
 When you read /proc/cpuinfo, as well, you would expect to see something
 that matches your container configuration.
 
 free is another example. The list go on.

Another interesting feature IMHO would be the per-cgroup loadavg. A
typical use case could be a monitoring system that wants to know which
containers are more overloaded than others, instead of using a single
system-wide measure in /proc/loadavg.

-Andrea

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 04/10] memcg: skip memcg kmem allocations in specified code regions

2012-07-30 Thread Kirill A. Shutemov
On Wed, Jul 25, 2012 at 06:38:15PM +0400, Glauber Costa wrote:
 This patch creates a mechanism that skip memcg allocations during
 certain pieces of our core code. It basically works in the same way
 as preempt_disable()/preempt_enable(): By marking a region under
 which all allocations will be accounted to the root memcg.
 
 We need this to prevent races in early cache creation, when we
 allocate data using caches that are not necessarily created already.

Why not a GFP_* flag?

-- 
 Kirill A. Shutemov

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 06/10] sl[au]b: Allocate objects from memcg cache

2012-07-30 Thread Kirill A. Shutemov
On Wed, Jul 25, 2012 at 06:38:17PM +0400, Glauber Costa wrote:
 We are able to match a cache allocation to a particular memcg.  If the
 task doesn't change groups during the allocation itself - a rare event,
 this will give us a good picture about who is the first group to touch a
 cache page.
 
 This patch uses the now available infrastructure by calling
 memcg_kmem_get_cache() before all the cache allocations.
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 CC: Suleiman Souhlal sulei...@google.com
 ---
  include/linux/slub_def.h |   18 +-
  mm/memcontrol.c  |2 ++
  mm/slab.c|4 
  mm/slub.c|1 +
  4 files changed, 20 insertions(+), 5 deletions(-)
 
 diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
 index 8bb8ad2..148000a 100644
 --- a/include/linux/slub_def.h
 +++ b/include/linux/slub_def.h
 @@ -13,6 +13,8 @@
  #include linux/kobject.h
  
  #include linux/kmemleak.h
 +#include linux/memcontrol.h
 +#include linux/mm.h
  
  enum stat_item {
   ALLOC_FASTPATH, /* Allocation from cpu slab */
 @@ -209,14 +211,14 @@ static __always_inline int kmalloc_index(size_t size)
   * This ought to end up with a global pointer to the right cache
   * in kmalloc_caches.
   */
 -static __always_inline struct kmem_cache *kmalloc_slab(size_t size)
 +static __always_inline struct kmem_cache *kmalloc_slab(gfp_t flags, size_t 
 size)
  {
   int index = kmalloc_index(size);
  
   if (index == 0)
   return NULL;
  
 - return kmalloc_caches[index];
 + return memcg_kmem_get_cache(kmalloc_caches[index], flags);
  }
  
  void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
 @@ -225,7 +227,13 @@ void *__kmalloc(size_t size, gfp_t flags);
  static __always_inline void *
  kmalloc_order(size_t size, gfp_t flags, unsigned int order)
  {
 - void *ret = (void *) __get_free_pages(flags | __GFP_COMP, order);
 + void *ret;
 +
 + flags = __GFP_COMP;
 +#ifdef CONFIG_MEMCG_KMEM
 + flags |= __GFP_KMEMCG;
 +#endif

Em.. I don't see where __GFP_KMEMCG is defined.
It should be 0 for !CONFIG_MEMCG_KMEM.

-- 
 Kirill A. Shutemov

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 06/10] sl[au]b: Allocate objects from memcg cache

2012-07-30 Thread Glauber Costa
On 07/30/2012 04:58 PM, Kirill A. Shutemov wrote:
 On Wed, Jul 25, 2012 at 06:38:17PM +0400, Glauber Costa wrote:
 We are able to match a cache allocation to a particular memcg.  If the
 task doesn't change groups during the allocation itself - a rare event,
 this will give us a good picture about who is the first group to touch a
 cache page.

 This patch uses the now available infrastructure by calling
 memcg_kmem_get_cache() before all the cache allocations.

 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 CC: Suleiman Souhlal sulei...@google.com
 ---
  include/linux/slub_def.h |   18 +-
  mm/memcontrol.c  |2 ++
  mm/slab.c|4 
  mm/slub.c|1 +
  4 files changed, 20 insertions(+), 5 deletions(-)

 diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
 index 8bb8ad2..148000a 100644
 --- a/include/linux/slub_def.h
 +++ b/include/linux/slub_def.h
 @@ -13,6 +13,8 @@
  #include linux/kobject.h
  
  #include linux/kmemleak.h
 +#include linux/memcontrol.h
 +#include linux/mm.h
  
  enum stat_item {
  ALLOC_FASTPATH, /* Allocation from cpu slab */
 @@ -209,14 +211,14 @@ static __always_inline int kmalloc_index(size_t size)
   * This ought to end up with a global pointer to the right cache
   * in kmalloc_caches.
   */
 -static __always_inline struct kmem_cache *kmalloc_slab(size_t size)
 +static __always_inline struct kmem_cache *kmalloc_slab(gfp_t flags, size_t 
 size)
  {
  int index = kmalloc_index(size);
  
  if (index == 0)
  return NULL;
  
 -return kmalloc_caches[index];
 +return memcg_kmem_get_cache(kmalloc_caches[index], flags);
  }
  
  void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
 @@ -225,7 +227,13 @@ void *__kmalloc(size_t size, gfp_t flags);
  static __always_inline void *
  kmalloc_order(size_t size, gfp_t flags, unsigned int order)
  {
 -void *ret = (void *) __get_free_pages(flags | __GFP_COMP, order);
 +void *ret;
 +
 +flags = __GFP_COMP;
 +#ifdef CONFIG_MEMCG_KMEM
 +flags |= __GFP_KMEMCG;
 +#endif
 
 Em.. I don't see where __GFP_KMEMCG is defined.
 It should be 0 for !CONFIG_MEMCG_KMEM.
 
It is not, sorry.

As I said, this is dependent on another patch series.
My main goal while sending this was to get the slab part - that will
eventually come ontop of that - discussed. Because they are both quite
complex, I believe they benefit from being discussed separately.

You can find the latest version of that here:

https://lkml.org/lkml/2012/6/25/251

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 04/10] memcg: skip memcg kmem allocations in specified code regions

2012-07-30 Thread Glauber Costa
On 07/30/2012 04:50 PM, Kirill A. Shutemov wrote:
 On Wed, Jul 25, 2012 at 06:38:15PM +0400, Glauber Costa wrote:
 This patch creates a mechanism that skip memcg allocations during
 certain pieces of our core code. It basically works in the same way
 as preempt_disable()/preempt_enable(): By marking a region under
 which all allocations will be accounted to the root memcg.

 We need this to prevent races in early cache creation, when we
 allocate data using caches that are not necessarily created already.
 
 Why not a GFP_* flag?
 

The main reason for this is to prevent nested calls of
kmem_cache_create(), since they could create (and in my tests, do
create) funny circular dependencies with each other. So the cache
creation itself would proceed without involving memcg.

At first, it is a bit weird to have cache creation itself depending on a
allocation flag test.

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] SUNRPC: return negative value in case rpcbind client creation error

2012-07-30 Thread bfie...@fieldses.org
On Mon, Jul 30, 2012 at 11:12:05PM +, Myklebust, Trond wrote:
 On Fri, 2012-07-20 at 15:57 +0400, Stanislav Kinsbursky wrote:
  Without this patch kernel will panic on LockD start, because lockd_up() 
  checks
  lockd_up_net() result for negative value.
  From my pow it's better to return negative value from rpcbind routines 
  instead
  of replacing all such checks like in lockd_up().
  
  Signed-off-by: Stanislav Kinsbursky skinsbur...@parallels.com
  ---
   net/sunrpc/rpcb_clnt.c |4 ++--
   1 files changed, 2 insertions(+), 2 deletions(-)
  
  diff --git a/net/sunrpc/rpcb_clnt.c b/net/sunrpc/rpcb_clnt.c
  index 92509ff..a70acae 100644
  --- a/net/sunrpc/rpcb_clnt.c
  +++ b/net/sunrpc/rpcb_clnt.c
  @@ -251,7 +251,7 @@ static int rpcb_create_local_unix(struct net *net)
  if (IS_ERR(clnt)) {
  dprintk(RPC:   failed to create AF_LOCAL rpcbind 
  client (errno %ld).\n, PTR_ERR(clnt));
  -   result = -PTR_ERR(clnt);
  +   result = PTR_ERR(clnt);
  goto out;
  }
   
  @@ -298,7 +298,7 @@ static int rpcb_create_local_net(struct net *net)
  if (IS_ERR(clnt)) {
  dprintk(RPC:   failed to create local rpcbind 
  client (errno %ld).\n, PTR_ERR(clnt));
  -   result = -PTR_ERR(clnt);
  +   result = PTR_ERR(clnt);
  goto out;
  }
 
 Who is supposed to carry this patch? Is it Bruce or is it me?

Works either way.  Either way--it looks like the bug was introduced with

c526611dd631b2802b6b0221ffb306c5fa25c86c SUNRPC: Use a cached RPC
client and transport for rpcbind upcalls and
7402ab19cdd5943c7dd4f3399afe3abda8077ef5 SUNRPC: Use AF_LOCAL for
rpcbind upcalls

and should go to stable as well.

(Looks like I said that before but accidentally dropped everyone off the
cc.)

--b.

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v3] SUNRPC: protect service sockets lists during per-net shutdown

2012-07-30 Thread NeilBrown
On Tue, 24 Jul 2012 15:40:37 -0400 J. Bruce Fields bfie...@fieldses.org
wrote:

 On Tue, Jul 03, 2012 at 04:58:57PM +0400, Stanislav Kinsbursky wrote:
  v3:
  1) rebased on 3.5-rc3 kernel.
  
  v2: destruction of currently processing transport added:
  1) Added marking of currently processing transports with XPT_CLOSE on 
  per-net
  shutdown. These transports will be destroyed in svc_xprt_enqueue() (instead 
  of
  enqueueing).
 
 That worries me:
 
   - Why did we originally defer close until svc_recv?

I don't think there was any obscure reason - it was just the natural place do
to it.  In svc_recv we are absolutely sure that the socket is idle.  There
are a number of things we might want to do, so we find the highest-priority
one and do it.  state machine pattern?


   - Are we sure there's no risk to performing it immediately in
 svc_enqueue?  Is it safe to call from the socket callbacks and
 wherever else we call svc_enqueue?

The latter point is the one I'd want to see verified.  If svc_xprt_enqueue
gets called in 'bh' content, and calls svc_delete_xprt which then calls
svc_deferred_dequeue and that takes -xpt_lock - does that mean that all
lock/unlock of -xpt_lock needs to be changed to use the _bh variants?

NeilBrown


 
 And in the past I haven't been good at testing for problems
 here--instead they tend to show up when a use somewhere tries shutting
 down a server that's under load.
 
 I'll look more closely.  Meanwhile you could split out that change as a
 separate patch and convince me why it's right
 
 --b.
 
  2) newly created temporary transport in svc_recv() will be destroyed, if 
  it's
  parent was marked with XPT_CLOSE.
  3) spin_lock(serv-sv_lock) was replaced by spin_lock_bh() in
  svc_close_net(serv-sv_lock).
  
  Service sv_tempsocks and sv_permsocks lists are accessible by tasks with
  different network namespaces, and thus per-net service destruction must be
  protected.
  These lists are protected by service sv_lock. So lets wrap list 
  munipulations
  with this lock and move tranports destruction outside wrapped area to 
  prevent
  deadlocks.
  
  Signed-off-by: Stanislav Kinsbursky skinsbur...@parallels.com
  ---
   net/sunrpc/svc_xprt.c |   56 
  ++---
   1 files changed, 52 insertions(+), 4 deletions(-)
  
  diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
  index 88f2bf6..4af2114 100644
  --- a/net/sunrpc/svc_xprt.c
  +++ b/net/sunrpc/svc_xprt.c
  @@ -320,6 +320,7 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
  struct svc_pool *pool;
  struct svc_rqst *rqstp;
  int cpu;
  +   int destroy = 0;
   
  if (!svc_xprt_has_something_to_do(xprt))
  return;
  @@ -338,6 +339,17 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
   
  pool-sp_stats.packets++;
   
  +   /*
  +* Check transport close flag. It could be marked as closed on per-net
  +* service shutdown.
  +*/
  +   if (test_bit(XPT_CLOSE, xprt-xpt_flags)) {
  +   /* Don't enqueue transport if it has to be destroyed. */
  +   dprintk(svc: transport %p have to be closed\n, xprt);
  +   destroy++;
  +   goto out_unlock;
  +   }
  +
  /* Mark transport as busy. It will remain in this state until
   * the provider calls svc_xprt_received. We update XPT_BUSY
   * atomically because it also guards against trying to enqueue
  @@ -374,6 +386,8 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
   
   out_unlock:
  spin_unlock_bh(pool-sp_lock);
  +   if (destroy)
  +   svc_delete_xprt(xprt);
   }
   EXPORT_SYMBOL_GPL(svc_xprt_enqueue);
   
  @@ -714,6 +728,13 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
  __module_get(newxpt-xpt_class-xcl_owner);
  svc_check_conn_limits(xprt-xpt_server);
  spin_lock_bh(serv-sv_lock);
  +   if (test_bit(XPT_CLOSE, xprt-xpt_flags)) {
  +   dprintk(svc_recv: found XPT_CLOSE on 
  listener\n);
  +   set_bit(XPT_DETACHED, newxpt-xpt_flags);
  +   spin_unlock_bh(pool-sp_lock);
  +   svc_delete_xprt(newxpt);
  +   goto out_closed;
  +   }
  set_bit(XPT_TEMP, newxpt-xpt_flags);
  list_add(newxpt-xpt_list, serv-sv_tempsocks);
  serv-sv_tmpcnt++;
  @@ -739,6 +760,7 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
  len = xprt-xpt_ops-xpo_recvfrom(rqstp);
  dprintk(svc: got len=%d\n, len);
  }
  +out_closed:
  svc_xprt_received(xprt);
   
  /* No data, incomplete (TCP) read, or accept() */
  @@ -936,6 +958,7 @@ static void svc_clear_pools(struct svc_serv *serv, 
  struct net *net)
  struct svc_pool *pool;
  struct svc_xprt *xprt;
  struct svc_xprt *tmp;
  +   struct svc_rqst *rqstp;
  int i;