Re: [Devel] call_usermodehelper in containers

2016-02-19 Thread Kamezawa Hiroyuki

On 2016/02/19 14:37, Ian Kent wrote:

On Fri, 2016-02-19 at 12:08 +0900, Kamezawa Hiroyuki wrote:

On 2016/02/19 5:45, Eric W. Biederman wrote:

Personally I am a fan of the don't be clever and capture a kernel
thread
approach as it is very easy to see you what if any exploitation
opportunities there are.  The justifications for something more
clever
is trickier.  Of course we do something that from this perspective
would
be considered ``clever'' today with kthreadd and user mode helpers.



I read old discussionlet me allow clarification  to create a
helper kernel thread
to run usermodehelper with using kthreadd.

0) define a trigger to create an independent usermodehelper
environment for a container.
Option A) at creating some namespace (pid, uid, etc...)
Option B) at creating a new nsproxy
Option C).at a new systemcall is called or some sysctl,
make_private_usermode_helper() or some,

   It's expected this should be triggered by init process of a
container with some capability.
   And scope of the effect should be defined. pid namespace ? nsporxy ?
or new namespace ?

1) create a helper thread.
task = kthread_create(kthread_work_fn, ?, ?, "usermodehelper")
switch task's nsproxy to current.(swtich_task_namespaces())
switch task's cgroups to current (cgroup_attach_task_all())
switch task's cred to current.
copy task's capability from current
(and any other ?)
wake_up_process()

And create a link between kthread_wq and container.


Not sure I quite understand this but I thought the difficulty with this
approach previously (even though the approach was very much incomplete)
was knowing that all the "moving parts" would not allow vulnerabilities.


Ok, that was discussed.


And it looks like this would require a kernel thread for each instance.
So for a thousand containers that each mount an NFS mount that means, at
least, 1000 additional kernel threads. Might be able to sell that, if we
were lucky, but from an system administration POV it's horrible.


I agree.


There's also the question of existence (aka. lifetime) to deal with
since the thread above needs to be created at a time other than the
usermode helper callback.

What happens for SIGKILL on a container?


It depends on how the helper kthread is tied to a container related object.
If kthread is linked with some namespace, we can kill it when a namespace
goes away.

So, with your opinion,
 - a helper thread should be spawned on demand
 - the lifetime of it should be clear. It will be good to have as same life 
time as the container.

I wonder there is no solution for "moving part" problem other than calling
do_fork() or copy_process() with container's init process context if we do all 
in the kernel.
Is that possible ?

Thanks,
-Kame


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] call_usermodehelper in containers

2016-02-18 Thread Kamezawa Hiroyuki
On 2016/02/19 5:45, Eric W. Biederman wrote: 
> Personally I am a fan of the don't be clever and capture a kernel thread
> approach as it is very easy to see you what if any exploitation
> opportunities there are.  The justifications for something more clever
> is trickier.  Of course we do something that from this perspective would
> be considered ``clever'' today with kthreadd and user mode helpers.
> 

I read old discussionlet me allow clarification  to create a helper kernel 
thread 
to run usermodehelper with using kthreadd.

0) define a trigger to create an independent usermodehelper environment for a 
container.
   Option A) at creating some namespace (pid, uid, etc...)
   Option B) at creating a new nsproxy
   Option C).at a new systemcall is called or some sysctl, 
make_private_usermode_helper() or some,
  
  It's expected this should be triggered by init process of a container with 
some capability.
  And scope of the effect should be defined. pid namespace ? nsporxy ? or new 
namespace ?

1) create a helper thread.
   task = kthread_create(kthread_work_fn, ?, ?, "usermodehelper")
   switch task's nsproxy to current.(swtich_task_namespaces())
   switch task's cgroups to current (cgroup_attach_task_all())
   switch task's cred to current.
   copy task's capability from current
   (and any other ?)
   wake_up_process()
   
   And create a link between kthread_wq and container.

2) modify call_usermodehelper() to use kthread_worker


It seems the problem is which object container private user mode helper should 
be tied to.

Regards,
-Kame
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] call_usermodehelper in containers

2016-02-17 Thread Kamezawa Hiroyuki
On 2016/02/18 11:57, Eric W. Biederman wrote:
> 
> Ccing The containers list because a related discussion is happening there
> and somehow this thread has never made it there.
> 
> Ian Kent  writes:
> 
>> On Mon, 2013-11-18 at 18:28 +0100, Oleg Nesterov wrote:
>>> On 11/15, Eric W. Biederman wrote:

 I don't understand that one.  Having a preforked thread with the
 proper
 environment that can act like kthreadd in terms of spawning user
 mode
 helpers works and is simple.
>>
>> Forgive me replying to such an old thread but ...
>>
>> After realizing workqueues can't be used to pre-create threads to run
>> usermode helpers I've returned to look at this.
> 
> If someone can wind up with a good implementation I will be happy.
> 
>>> Can't we ask ->child_reaper to create the non-daemonized kernel thread
>>> with the "right" ->nsproxy, ->fs, etc?
>>
>> Eric, do you think this approach would be sufficient too?
>>
>> Probably wouldn't be quite right for user namespaces but should provide
>> what's needed for other cases?
>>
>> It certainly has the advantage of not having to maintain a plague of
>> processes waiting around to execute helpers.
> 
> That certainly sounds attractive.  Especially for the case of everyone
> who wants to set a core pattern in a container.
> 
> I am fuzzy on all of the details right now, but what I do remember is
> that in the kernel the user mode helper concepts when they attempted to
> scrub a processes environment were quite error prone until we managed to
> get kthreadd(pid 2) on the scene which always had a clean environment.
> 
> If we are going to tie this kind of thing to the pid namespace I
> recommend simplying denying it if you are in a user namespace without
> an approrpriate pid namespace.  AKA simply not allowing thigns to be setup
> if current->pid_ns->user_ns != current->user_ns.
> 
Can't be handled by simple capability like CAP_SYS_USERMODEHELPER ?

User_ns check seems not to allow core-dump-cather in host will not work if 
user_ns is used.

Thanks,
-Kame

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v5 06/14] memcg: kmem controller infrastructure

2012-10-17 Thread Kamezawa Hiroyuki
(2012/10/16 19:16), Glauber Costa wrote:
 This patch introduces infrastructure for tracking kernel memory pages to
 a given memcg. This will happen whenever the caller includes the flag
 __GFP_KMEMCG flag, and the task belong to a memcg other than the root.
 
 In memcontrol.h those functions are wrapped in inline acessors.  The
 idea is to later on, patch those with static branches, so we don't incur
 any overhead when no mem cgroups with limited kmem are being used.
 
 Users of this functionality shall interact with the memcg core code
 through the following functions:
 
 memcg_kmem_newpage_charge: will return true if the group can handle the
 allocation. At this point, struct page is not
 yet allocated.
 
 memcg_kmem_commit_charge: will either revert the charge, if struct page
allocation failed, or embed memcg information
into page_cgroup.
 
 memcg_kmem_uncharge_page: called at free time, will revert the charge.
 
 [ v2: improved comments and standardized function names ]
 [ v3: handle no longer opaque, functions not exported,
even more comments ]
 [ v4: reworked Used bit handling and surroundings for more clarity ]
 [ v5: simplified code for kmemcg compiled out and core functions in
memcontrol.c, moved kmem code to the middle to avoid forward decls ]
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 Acked-by: Michal Hocko mho...@suse.cz
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 CC: Tejun Heo t...@kernel.org

Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com



___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v5 12/14] execute the whole memcg freeing in free_worker

2012-10-17 Thread Kamezawa Hiroyuki
(2012/10/16 19:16), Glauber Costa wrote:
 A lot of the initialization we do in mem_cgroup_create() is done with
 softirqs enabled. This include grabbing a css id, which holds
 ss-id_lock-rlock, and the per-zone trees, which holds
 rtpz-lock-rlock. All of those signal to the lockdep mechanism that
 those locks can be used in SOFTIRQ-ON-W context. This means that the
 freeing of memcg structure must happen in a compatible context,
 otherwise we'll get a deadlock, like the one bellow, caught by lockdep:
 
[81103095] free_accounted_pages+0x47/0x4c
[81047f90] free_task+0x31/0x5c
[8104807d] __put_task_struct+0xc2/0xdb
[8104dfc7] put_task_struct+0x1e/0x22
[8104e144] delayed_put_task_struct+0x7a/0x98
[810cf0e5] __rcu_process_callbacks+0x269/0x3df
[810cf28c] rcu_process_callbacks+0x31/0x5b
[8105266d] __do_softirq+0x122/0x277
 
 This usage pattern could not be triggered before kmem came into play.
 With the introduction of kmem stack handling, it is possible that we
 call the last mem_cgroup_put() from the task destructor, which is run in
 an rcu callback. Such callbacks are run with softirqs disabled, leading
 to the offensive usage pattern.
 
 In general, we have little, if any, means to guarantee in which context
 the last memcg_put will happen. The best we can do is test it and try to
 make sure no invalid context releases are happening. But as we add more
 code to memcg, the possible interactions grow in number and expose more
 ways to get context conflicts. One thing to keep in mind, is that part
 of the freeing process is already deferred to a worker, such as vfree(),
 that can only be called from process context.
 
 For the moment, the only two functions we really need moved away are:
 
* free_css_id(), and
* mem_cgroup_remove_from_trees().
 
 But because the later accesses per-zone info,
 free_mem_cgroup_per_zone_info() needs to be moved as well. With that, we
 are left with the per_cpu stats only. Better move it all.
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 Tested-by: Greg Thelen gthe...@google.com
 Acked-by: Michal Hocko mho...@suse.cz
 CC: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 CC: Tejun Heo t...@kernel.org

Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v4 14/14] Add documentation about the kmem controller

2012-10-17 Thread Kamezawa Hiroyuki
(2012/10/08 19:06), Glauber Costa wrote:
 Signed-off-by: Glauber Costa glom...@parallels.com
 ---
   Documentation/cgroups/memory.txt | 55 
 +++-
   1 file changed, 54 insertions(+), 1 deletion(-)
 

Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v4 06/14] memcg: kmem controller infrastructure

2012-10-16 Thread Kamezawa Hiroyuki

(2012/10/12 18:13), Glauber Costa wrote:

On 10/12/2012 12:57 PM, Michal Hocko wrote:

On Fri 12-10-12 12:44:57, Glauber Costa wrote:

On 10/12/2012 12:39 PM, Michal Hocko wrote:

On Fri 12-10-12 11:45:46, Glauber Costa wrote:

On 10/11/2012 04:42 PM, Michal Hocko wrote:

On Mon 08-10-12 14:06:12, Glauber Costa wrote:

[...]

+   /*
+* Conditions under which we can wait for the oom_killer.
+* __GFP_NORETRY should be masked by __mem_cgroup_try_charge,
+* but there is no harm in being explicit here
+*/
+   may_oom = (gfp  __GFP_WAIT)  !(gfp  __GFP_NORETRY);


Well we _have to_ check __GFP_NORETRY here because if we don't then we
can end up in OOM. mem_cgroup_do_charge returns CHARGE_NOMEM for
__GFP_NORETRY (without doing any reclaim) and of oom==true we decrement
oom retries counter and eventually hit OOM killer. So the comment is
misleading.


I will update. What i understood from your last message is that we don't
really need to, because try_charge will do it.


IIRC I just said it couldn't happen before because migration doesn't go
through charge and thp disable oom by default.



I had it changed to:

 /*
  * Conditions under which we can wait for the oom_killer.
  * We have to be able to wait, but also, if we can't retry,
  * we obviously shouldn't go mess with oom.
  */
 may_oom = (gfp  __GFP_WAIT)  !(gfp  __GFP_NORETRY);


OK




+
+   _memcg = memcg;
+   ret = __mem_cgroup_try_charge(NULL, gfp, size  PAGE_SHIFT,
+ _memcg, may_oom);
+
+   if (!ret) {
+   ret = res_counter_charge(memcg-kmem, size, fail_res);


Now that I'm thinking about the charging ordering we should charge the
kmem first because we would like to hit kmem limit before we hit u+k
limit, don't we.
Say that you have kmem limit 10M and the total limit 50M. Current `u'
would be 40M and this charge would cause kmem to hit the `k' limit. I
think we should fail to charge kmem before we go to u+k and potentially
reclaim/oom.
Or has this been alredy discussed and I just do not remember?


This has never been discussed as far as I remember. We charged u first
since day0, and you are so far the first one to raise it...

One of the things in favor of charging 'u' first is that
mem_cgroup_try_charge is already equipped to make a lot of decisions,
like when to allow reclaim, when to bypass charges, and it would be good
if we can reuse all that.


Hmm, I think that we should prevent from those decisions if kmem charge
would fail anyway (especially now when we do not have targeted slab
reclaim).



Let's revisit this discussion when we do have targeted reclaim. For now,
I'll agree that charging kmem first would be acceptable.

This will only make a difference when K  U anyway.


Yes and it should work as advertised (aka hit the k limit first).


Just so we don't ping-pong in another submission:

I changed memcontrol.h's memcg_kmem_newpage_charge to include:

 /* If the test is dying, just let it go. */
 if (unlikely(test_thread_flag(TIF_MEMDIE)
  || fatal_signal_pending(current)))
 return true;


I'm also attaching the proposed code in memcontrol.c

+static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
+{
+   struct res_counter *fail_res;
+   struct mem_cgroup *_memcg;
+   int ret = 0;
+   bool may_oom;
+
+   ret = res_counter_charge(memcg-kmem, size, fail_res);
+   if (ret)
+   return ret;
+
+   /*
+* Conditions under which we can wait for the oom_killer.
+* We have to be able to wait, but also, if we can't retry,
+* we obviously shouldn't go mess with oom.
+*/
+   may_oom = (gfp  __GFP_WAIT)  !(gfp  __GFP_NORETRY);
+
+   _memcg = memcg;
+   ret = __mem_cgroup_try_charge(NULL, gfp, size  PAGE_SHIFT,
+ _memcg, may_oom);
+
+   if (ret == -EINTR)  {
+   /*
+* __mem_cgroup_try_charge() chosed to bypass to root due to
+* OOM kill or fatal signal.  Since our only options are to
+* either fail the allocation or charge it to this cgroup, do
+* it as a temporary condition. But we can't fail. From a
+* kmem/slab perspective, the cache has already been selected,
+* by mem_cgroup_get_kmem_cache(), so it is too late to change
+* our minds. This condition will only trigger if the task
+* entered memcg_charge_kmem in a sane state, but was
+* OOM-killed.  during __mem_cgroup_try_charge. Tasks that are
+* already dying when the allocation triggers should have been
+* already directed to the root cgroup.
+*/
+   res_counter_charge_nofail(memcg-res, size, fail_res);
+   if (do_swap_account)
+   

[Devel] Re: [PATCH v4 08/14] res_counter: return amount of charges after res_counter_uncharge

2012-10-16 Thread Kamezawa Hiroyuki
(2012/10/08 19:06), Glauber Costa wrote:
 It is useful to know how many charges are still left after a call to
 res_counter_uncharge. While it is possible to issue a res_counter_read
 after uncharge, this can be racy.
 
 If we need, for instance, to take some action when the counters drop
 down to 0, only one of the callers should see it. This is the same
 semantics as the atomic variables in the kernel.
 
 Since the current return value is void, we don't need to worry about
 anything breaking due to this change: nobody relied on that, and only
 users appearing from now on will be checking this value.
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Michal Hocko mho...@suse.cz
 CC: Johannes Weiner han...@cmpxchg.org
 CC: Suleiman Souhlal sulei...@google.com
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 ---
   Documentation/cgroups/resource_counter.txt |  7 ---
   include/linux/res_counter.h| 12 +++-
   kernel/res_counter.c   | 20 +---
   3 files changed, 24 insertions(+), 15 deletions(-)

Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com



___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v4 09/14] memcg: kmem accounting lifecycle management

2012-10-16 Thread Kamezawa Hiroyuki

(2012/10/12 17:41), Michal Hocko wrote:

On Fri 12-10-12 11:47:17, Glauber Costa wrote:

On 10/11/2012 05:11 PM, Michal Hocko wrote:

On Mon 08-10-12 14:06:15, Glauber Costa wrote:

Because kmem charges can outlive the cgroup, we need to make sure that
we won't free the memcg structure while charges are still in flight.
For reviewing simplicity, the charge functions will issue
mem_cgroup_get() at every charge, and mem_cgroup_put() at every
uncharge.

This can get expensive, however, and we can do better. mem_cgroup_get()
only really needs to be issued once: when the first limit is set. In the
same spirit, we only need to issue mem_cgroup_put() when the last charge
is gone.

We'll need an extra bit in kmem_accounted for that: KMEM_ACCOUNTED_DEAD.
it will be set when the cgroup dies, if there are charges in the group.
If there aren't, we can proceed right away.

Our uncharge function will have to test that bit every time the charges
drop to 0. Because that is not the likely output of
res_counter_uncharge, this should not impose a big hit on us: it is
certainly much better than a reference count decrease at every
operation.

[ v3: merged all lifecycle related patches in one ]

Signed-off-by: Glauber Costa glom...@parallels.com
CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
CC: Christoph Lameter c...@linux.com
CC: Pekka Enberg penb...@cs.helsinki.fi
CC: Michal Hocko mho...@suse.cz
CC: Johannes Weiner han...@cmpxchg.org
CC: Suleiman Souhlal sulei...@google.com


OK, I like the optimization. I have just one comment to the
memcg_kmem_dead naming but other than that

Acked-by: Michal Hocko mho...@suse.cz

[...]

+static bool memcg_kmem_dead(struct mem_cgroup *memcg)


The name is tricky because it doesn't tell you that it clears the flag
which made me scratch my head when reading comment in kmem_cgroup_destroy


memcg_kmem_finally_kill_that_bastard() ?


memcg_kmem_test_and_clear_dead? I know long but at least clear that the
flag is cleared. Or just open code it.



I agree. Ack by me with that naming.

Thanks,
-Kame



___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v4 10/14] memcg: use static branches when code not in use

2012-10-16 Thread Kamezawa Hiroyuki

(2012/10/12 16:47), Glauber Costa wrote:

On 10/11/2012 05:40 PM, Michal Hocko wrote:

On Mon 08-10-12 14:06:16, Glauber Costa wrote:

We can use static branches to patch the code in or out when not used.

Because the _ACTIVE bit on kmem_accounted is only set after the
increment is done, we guarantee that the root memcg will always be
selected for kmem charges until all call sites are patched (see
memcg_kmem_enabled).  This guarantees that no mischarges are applied.

static branch decrement happens when the last reference count from the
kmem accounting in memcg dies. This will only happen when the charges
drop down to 0.

When that happen, we need to disable the static branch only on those
memcgs that enabled it. To achieve this, we would be forced to
complicate the code by keeping track of which memcgs were the ones
that actually enabled limits, and which ones got it from its parents.

It is a lot simpler just to do static_key_slow_inc() on every child
that is accounted.

[ v4: adapted this patch to the changes in kmem_accounted ]

Signed-off-by: Glauber Costa glom...@parallels.com
CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
CC: Christoph Lameter c...@linux.com
CC: Pekka Enberg penb...@cs.helsinki.fi
CC: Michal Hocko mho...@suse.cz
CC: Johannes Weiner han...@cmpxchg.org
CC: Suleiman Souhlal sulei...@google.com


Looks reasonable to me
Acked-by: Michal Hocko mho...@suse.cz

Just a little nit.

[...]


diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 634c7b5..724a08b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -344,11 +344,15 @@ struct mem_cgroup {
  /* internal only representation about the status of kmem accounting. */
  enum {
KMEM_ACCOUNTED_ACTIVE = 0, /* accounted by this cgroup itself */
+   KMEM_ACCOUNTED_ACTIVATED, /* static key enabled. */
KMEM_ACCOUNTED_DEAD, /* dead memcg, pending kmem charges */
  };

-/* first bit */
-#define KMEM_ACCOUNTED_MASK 0x1
+/*
+ * first two bits. We account when limit is on, but only after
+ * call sites are patched
+ */
+#define KMEM_ACCOUNTED_MASK 0x3


The names are long but why not use KMEM_ACCOUNTED_ACTIVE*
#define KMEM_ACCOUNTED_MASK 1KMEM_ACCOUNTED_ACTIVE | 
1KMEM_ACCOUNTED_ACTIVATED


Because the names are long! =)



please use long macros ;) it's not bad.

Anyway,

Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com







___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v4 02/14] memcg: Reclaim when more than one page needed.

2012-10-15 Thread Kamezawa Hiroyuki
(2012/10/08 19:06), Glauber Costa wrote:
 From: Suleiman Souhlal ssouh...@freebsd.org
 
 mem_cgroup_do_charge() was written before kmem accounting, and expects
 three cases: being called for 1 page, being called for a stock of 32
 pages, or being called for a hugepage.  If we call for 2 or 3 pages (and
 both the stack and several slabs used in process creation are such, at
 least with the debug options I had), it assumed it's being called for
 stock and just retried without reclaiming.
 
 Fix that by passing down a minsize argument in addition to the csize.
 
 And what to do about that (csize == PAGE_SIZE  ret) retry?  If it's
 needed at all (and presumably is since it's there, perhaps to handle
 races), then it should be extended to more than PAGE_SIZE, yet how far?
 And should there be a retry count limit, of what?  For now retry up to
 COSTLY_ORDER (as page_alloc.c does) and make sure not to do it if
 __GFP_NORETRY.
 
 [v4: fixed nr pages calculation pointed out by Christoph Lameter ]
 
 Signed-off-by: Suleiman Souhlal sulei...@google.com
 Signed-off-by: Glauber Costa glom...@parallels.com
 Reviewed-by: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 Acked-by: Michal Hocko mho...@suse.cz
 Acked-by: Johannes Weiner han...@cmpxchg.org

Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-16 Thread Kamezawa Hiroyuki
(2012/08/13 17:28), Glauber Costa wrote:
 + * Needs to be called after memcg_kmem_new_page, regardless of success or
 + * failure of the allocation. if @page is NULL, this function will revert 
 the
 + * charges. Otherwise, it will commit the memcg given by @handle to the
 + * corresponding page_cgroup.
 + */
 +static __always_inline void
 +memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
 order)
 +{
 +  if (memcg_kmem_on)
 +  __memcg_kmem_commit_page(page, handle, order);
 +}
 Doesn't this 2 functions has no short-cuts ?
 
 Sorry kame, what exactly do you mean?
 
I meant avoinding function call. But please ignore, I missed following patches.


 if (memcg_kmem_on  handle) ?
 I guess this can be done to avoid a function call.
 
 Maybe free() needs to access page_cgroup...

 Can you also be a bit more specific here?
 

Please ignore, I misunderstood the usage of free_accounted_pages().

 +bool __memcg_kmem_new_page(gfp_t gfp, void *_handle, int order)
 +{
 +  struct mem_cgroup *memcg;
 +  struct mem_cgroup **handle = (struct mem_cgroup **)_handle;
 +  bool ret = true;
 +  size_t size;
 +  struct task_struct *p;
 +
 +  *handle = NULL;
 +  rcu_read_lock();
 +  p = rcu_dereference(current-mm-owner);
 +  memcg = mem_cgroup_from_task(p);
 +  if (!memcg_kmem_enabled(memcg))
 +  goto out;
 +
 +  mem_cgroup_get(memcg);
 +
 This mem_cgroup_get() will be a potentioal performance problem.
 Don't you have good idea to avoid accessing atomic counter here ?
 I think some kind of percpu counter or a feature to disable move task
 will be a help.
 
 
 
 
 +  pc = lookup_page_cgroup(page);
 +  lock_page_cgroup(pc);
 +  pc-mem_cgroup = memcg;
 +  SetPageCgroupUsed(pc);
 +  unlock_page_cgroup(pc);
 +}
 +
 +void __memcg_kmem_free_page(struct page *page, int order)
 +{
 +  struct mem_cgroup *memcg;
 +  size_t size;
 +  struct page_cgroup *pc;
 +
 +  if (mem_cgroup_disabled())
 +  return;
 +
 +  pc = lookup_page_cgroup(page);
 +  lock_page_cgroup(pc);
 +  memcg = pc-mem_cgroup;
 +  pc-mem_cgroup = NULL;
 
 shouldn't this happen after checking Used bit ?
 Ah, BTW, why do you need to clear pc-memcg ?
 
 As for clearing pc-memcg, I think I'm just being overzealous. I can't
 foresee any problems due to removing it.
 
 As for the Used bit, what difference does it make when we clear it?

I just want to see the same logic used in mem_cgroup_uncharge_common().
Hmm, at setting pc-mem_cgroup, the things happens in
   set pc-mem_cgroup
   set Used bit
order. If you clear pc-mem_cgroup
   unset Used bit
   clear pc-mem_cgroup
seems reasonable.


 
 +  if (!PageCgroupUsed(pc)) {
 +  unlock_page_cgroup(pc);
 +  return;
 +  }
 +  ClearPageCgroupUsed(pc);
 +  unlock_page_cgroup(pc);
 +
 +  /*
 +   * Checking if kmem accounted is enabled won't work for uncharge, since
 +   * it is possible that the user enabled kmem tracking, allocated, and
 +   * then disabled it again.
 +   *
 +   * We trust if there is a memcg associated with the page, it is a valid
 +   * allocation
 +   */
 +  if (!memcg)
 +  return;
 +
 +  WARN_ON(mem_cgroup_is_root(memcg));
 +  size = (1  order)  PAGE_SHIFT;
 +  memcg_uncharge_kmem(memcg, size);
 +  mem_cgroup_put(memcg);
 Why do we need ref-counting here ? kmem res_counter cannot work as
 reference ?
 This is of course the pair of the mem_cgroup_get() you commented on
 earlier. If we need one, we need the other. If we don't need one, we
 don't need the other =)
 
 The guarantee we're trying to give here is that the memcg structure will
 stay around while there are dangling charges to kmem, that we decided
 not to move (remember: moving it for the stack is simple, for the slab
 is very complicated and ill-defined, and I believe it is better to treat
 all kmem equally here)
 
 So maybe we can be clever here, and avoid reference counting at all
 times. We call mem_cgroup_get() when the first charge occurs, and then
 go for mem_cgroup_put() when our count reaches 0.
 
 What do you think about that?
 

I think that should work. I don't want to add not-optimized atomic counter ops
in this very hot path.


 
 +#ifdef CONFIG_MEMCG_KMEM
 +int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta)
 +{
 What does 'delta' means ?

 I can change it to something like nr_bytes, more informative.
 
 +  struct res_counter *fail_res;
 +  struct mem_cgroup *_memcg;
 +  int ret;
 +  bool may_oom;
 +  bool nofail = false;
 +
 +  may_oom = (gfp  __GFP_WAIT)  (gfp  __GFP_FS) 
 +  !(gfp  __GFP_NORETRY);
 +
 +  ret = 0;
 +
 +  if (!memcg)
 +  return ret;
 +
 +  _memcg = memcg;
 +  ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
 +  _memcg, may_oom);
 +
 +  if (ret == -EINTR)  {
 +  nofail = true;
 +  /*
 +   * __mem_cgroup_try_charge() chosed to bypass to root due to
 +   * OOM kill or fatal signal.  Since our only options are to
 +   * either fail the allocation or charge it to this cgroup, do
 + 

[Devel] Re: [PATCH v2 04/11] kmem accounting basic infrastructure

2012-08-16 Thread Kamezawa Hiroyuki
(2012/08/13 17:36), Glauber Costa wrote:
 On 08/10/2012 09:02 PM, Kamezawa Hiroyuki wrote:
 (2012/08/09 22:01), Glauber Costa wrote:
 This patch adds the basic infrastructure for the accounting of the slab
 caches. To control that, the following files are created:

* memory.kmem.usage_in_bytes
* memory.kmem.limit_in_bytes
* memory.kmem.failcnt
* memory.kmem.max_usage_in_bytes

 They have the same meaning of their user memory counterparts. They
 reflect the state of the kmem res_counter.

 The code is not enabled until a limit is set. This can be tested by the
 flag kmem_accounted. This means that after the patch is applied, no
 behavioral changes exists for whoever is still using memcg to control
 their memory usage.

 We always account to both user and kernel resource_counters. This
 effectively means that an independent kernel limit is in place when the
 limit is set to a lower value than the user memory. A equal or higher
 value means that the user limit will always hit first, meaning that kmem
 is effectively unlimited.

 People who want to track kernel memory but not limit it, can set this
 limit to a very high number (like RESOURCE_MAX - 1page - that no one
 will ever hit, or equal to the user memory)

 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Michal Hocko mho...@suse.cz
 CC: Johannes Weiner han...@cmpxchg.org
 Reviewed-by: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com

 Could you add  a patch for documentation of this new interface and a text
 explaining the behavior of kmem_accounting ?

 Hm, my concern is the difference of behavior between user page accounting and
 kmem accounting...but this is how tcp-accounting is working.

 Once you add Documentation, it's okay to add my Ack.

 I plan to add documentation in a separate patch. Due to that, can I add
 your ack to this patch here?
 
 Also, I find that the description text in patch0 grew to be quite
 informative and complete. I plan to add that to the documentation
 if that is ok with you
 

Ack to this patch.

-Kame


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v2 02/11] memcg: Reclaim when more than one page needed.

2012-08-10 Thread Kamezawa Hiroyuki

(2012/08/11 0:42), Michal Hocko wrote:

On Thu 09-08-12 17:01:10, Glauber Costa wrote:
[...]

@@ -2317,18 +2318,18 @@ static int mem_cgroup_do_charge(struct mem_cgroup 
*memcg, gfp_t gfp_mask,
} else
mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
/*
-* nr_pages can be either a huge page (HPAGE_PMD_NR), a batch
-* of regular pages (CHARGE_BATCH), or a single regular page (1).
-*
 * Never reclaim on behalf of optional batching, retry with a
 * single page instead.
 */
-   if (nr_pages == CHARGE_BATCH)
+   if (nr_pages  min_pages)
return CHARGE_RETRY;


This is dangerous because THP charges will be retried now while they
previously failed with CHARGE_NOMEM which means that we will keep
attempting potentially endlessly.


with THP, I thought nr_pages == min_pages, and no retry.



Why cannot we simply do if (nr_pages  CHARGE_BATCH) and get rid of the
min_pages altogether?


Hm, I think a slab can be larger than CHARGE_BATCH.


Also the comment doesn't seem to be valid anymore.


I agree it's not clean. Because our assumption on nr_pages are changed,
I think this behavior should not depend on nr_pages value..
Shouldn't we have a flag to indicate trial-for-batched charge ?


Thanks,
-Kame




___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v2 04/11] kmem accounting basic infrastructure

2012-08-10 Thread Kamezawa Hiroyuki
(2012/08/09 22:01), Glauber Costa wrote:
 This patch adds the basic infrastructure for the accounting of the slab
 caches. To control that, the following files are created:
 
   * memory.kmem.usage_in_bytes
   * memory.kmem.limit_in_bytes
   * memory.kmem.failcnt
   * memory.kmem.max_usage_in_bytes
 
 They have the same meaning of their user memory counterparts. They
 reflect the state of the kmem res_counter.
 
 The code is not enabled until a limit is set. This can be tested by the
 flag kmem_accounted. This means that after the patch is applied, no
 behavioral changes exists for whoever is still using memcg to control
 their memory usage.
 
 We always account to both user and kernel resource_counters. This
 effectively means that an independent kernel limit is in place when the
 limit is set to a lower value than the user memory. A equal or higher
 value means that the user limit will always hit first, meaning that kmem
 is effectively unlimited.
 
 People who want to track kernel memory but not limit it, can set this
 limit to a very high number (like RESOURCE_MAX - 1page - that no one
 will ever hit, or equal to the user memory)
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Michal Hocko mho...@suse.cz
 CC: Johannes Weiner han...@cmpxchg.org
 Reviewed-by: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com

Could you add  a patch for documentation of this new interface and a text
explaining the behavior of kmem_accounting ?

Hm, my concern is the difference of behavior between user page accounting and
kmem accounting...but this is how tcp-accounting is working.

Once you add Documentation, it's okay to add my Ack.

Thanks,
-Kame


 ---
   mm/memcontrol.c | 69 
 -
   1 file changed, 68 insertions(+), 1 deletion(-)
 
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index b0e29f4..54e93de 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -273,6 +273,10 @@ struct mem_cgroup {
   };
   
   /*
 +  * the counter to account for kernel memory usage.
 +  */
 + struct res_counter kmem;
 + /*
* Per cgroup active and inactive list, similar to the
* per zone LRU lists.
*/
 @@ -287,6 +291,7 @@ struct mem_cgroup {
* Should the accounting and control be hierarchical, per subtree?
*/
   bool use_hierarchy;
 + bool kmem_accounted;
   
   booloom_lock;
   atomic_tunder_oom;
 @@ -397,6 +402,7 @@ enum res_type {
   _MEM,
   _MEMSWAP,
   _OOM_TYPE,
 + _KMEM,
   };
   
   #define MEMFILE_PRIVATE(x, val) ((x)  16 | (val))
 @@ -1499,6 +1505,10 @@ done:
   res_counter_read_u64(memcg-memsw, RES_USAGE)  10,
   res_counter_read_u64(memcg-memsw, RES_LIMIT)  10,
   res_counter_read_u64(memcg-memsw, RES_FAILCNT));
 + printk(KERN_INFO kmem: usage %llukB, limit %llukB, failcnt %llu\n,
 + res_counter_read_u64(memcg-kmem, RES_USAGE)  10,
 + res_counter_read_u64(memcg-kmem, RES_LIMIT)  10,
 + res_counter_read_u64(memcg-kmem, RES_FAILCNT));
   
   mem_cgroup_print_oom_stat(memcg);
   }
 @@ -4008,6 +4018,9 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, 
 struct cftype *cft,
   else
   val = res_counter_read_u64(memcg-memsw, name);
   break;
 + case _KMEM:
 + val = res_counter_read_u64(memcg-kmem, name);
 + break;
   default:
   BUG();
   }
 @@ -4046,8 +4059,23 @@ static int mem_cgroup_write(struct cgroup *cont, 
 struct cftype *cft,
   break;
   if (type == _MEM)
   ret = mem_cgroup_resize_limit(memcg, val);
 - else
 + else if (type == _MEMSWAP)
   ret = mem_cgroup_resize_memsw_limit(memcg, val);
 + else if (type == _KMEM) {
 + ret = res_counter_set_limit(memcg-kmem, val);
 + if (ret)
 + break;
 + /*
 +  * Once enabled, can't be disabled. We could in theory
 +  * disable it if we haven't yet created any caches, or
 +  * if we can shrink them all to death.
 +  *
 +  * But it is not worth the trouble
 +  */
 + if (!memcg-kmem_accounted  val != RESOURCE_MAX)
 + memcg-kmem_accounted = true;
 + } else
 + return -EINVAL;
   break;
   case RES_SOFT_LIMIT:
   ret = res_counter_memparse_write_strategy(buffer, val);
 @@ -4113,12 +4141,16 @@ static int mem_cgroup_reset(struct cgroup *cont, 
 unsigned int event)
   case RES_MAX_USAGE:
   if (type == _MEM)
   res_counter_reset_max(memcg-res);
 + else if (type == _KMEM

[Devel] Re: [PATCH v2 05/11] Add a __GFP_KMEMCG flag

2012-08-10 Thread Kamezawa Hiroyuki
(2012/08/09 22:01), Glauber Costa wrote:
 This flag is used to indicate to the callees that this allocation is a
 kernel allocation in process context, and should be accounted to
 current's memcg. It takes numerical place of the of the recently removed
 __GFP_NO_KSWAPD.
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 CC: Suleiman Souhlal sulei...@google.com
 CC: Rik van Riel r...@redhat.com
 CC: Mel Gorman m...@csn.ul.ie

Okay, so, only memcg-aware allocations are accounted.
It seems a safe way to go.

Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-10 Thread Kamezawa Hiroyuki
(2012/08/09 22:01), Glauber Costa wrote:
 This patch introduces infrastructure for tracking kernel memory pages to
 a given memcg. This will happen whenever the caller includes the flag
 __GFP_KMEMCG flag, and the task belong to a memcg other than the root.
 
 In memcontrol.h those functions are wrapped in inline accessors.  The
 idea is to later on, patch those with static branches, so we don't incur
 any overhead when no mem cgroups with limited kmem are being used.
 
 [ v2: improved comments and standardized function names ]
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 ---
   include/linux/memcontrol.h |  79 +++
   mm/memcontrol.c| 185 
 +
   2 files changed, 264 insertions(+)
 
 diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
 index 8d9489f..75b247e 100644
 --- a/include/linux/memcontrol.h
 +++ b/include/linux/memcontrol.h
 @@ -21,6 +21,7 @@
   #define _LINUX_MEMCONTROL_H
   #include linux/cgroup.h
   #include linux/vm_event_item.h
 +#include linux/hardirq.h
   
   struct mem_cgroup;
   struct page_cgroup;
 @@ -399,6 +400,11 @@ struct sock;
   #ifdef CONFIG_MEMCG_KMEM
   void sock_update_memcg(struct sock *sk);
   void sock_release_memcg(struct sock *sk);
 +
 +#define memcg_kmem_on 1
 +bool __memcg_kmem_new_page(gfp_t gfp, void *handle, int order);
 +void __memcg_kmem_commit_page(struct page *page, void *handle, int order);
 +void __memcg_kmem_free_page(struct page *page, int order);
   #else
   static inline void sock_update_memcg(struct sock *sk)
   {
 @@ -406,6 +412,79 @@ static inline void sock_update_memcg(struct sock *sk)
   static inline void sock_release_memcg(struct sock *sk)
   {
   }
 +
 +#define memcg_kmem_on 0
 +static inline bool
 +__memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
 +{
 + return false;
 +}
 +
 +static inline void  __memcg_kmem_free_page(struct page *page, int order)
 +{
 +}
 +
 +static inline void
 +__memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
 order)
 +{
 +}
   #endif /* CONFIG_MEMCG_KMEM */
 +
 +/**
 + * memcg_kmem_new_page: verify if a new kmem allocation is allowed.
 + * @gfp: the gfp allocation flags.
 + * @handle: a pointer to the memcg this was charged against.
 + * @order: allocation order.
 + *
 + * returns true if the memcg where the current task belongs can hold this
 + * allocation.
 + *
 + * We return true automatically if this allocation is not to be accounted to
 + * any memcg.
 + */
 +static __always_inline bool
 +memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
 +{
 + if (!memcg_kmem_on)
 + return true;
 + if (!(gfp  __GFP_KMEMCG) || (gfp  __GFP_NOFAIL))
 + return true;
 + if (in_interrupt() || (!current-mm) || (current-flags  PF_KTHREAD))
 + return true;
 + return __memcg_kmem_new_page(gfp, handle, order);
 +}
 +
 +/**
 + * memcg_kmem_free_page: uncharge pages from memcg
 + * @page: pointer to struct page being freed
 + * @order: allocation order.
 + *
 + * there is no need to specify memcg here, since it is embedded in 
 page_cgroup
 + */
 +static __always_inline void
 +memcg_kmem_free_page(struct page *page, int order)
 +{
 + if (memcg_kmem_on)
 + __memcg_kmem_free_page(page, order);
 +}
 +
 +/**
 + * memcg_kmem_commit_page: embeds correct memcg in a page
 + * @handle: a pointer to the memcg this was charged against.
 + * @page: pointer to struct page recently allocated
 + * @handle: the memcg structure we charged against
 + * @order: allocation order.
 + *
 + * Needs to be called after memcg_kmem_new_page, regardless of success or
 + * failure of the allocation. if @page is NULL, this function will revert the
 + * charges. Otherwise, it will commit the memcg given by @handle to the
 + * corresponding page_cgroup.
 + */
 +static __always_inline void
 +memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
 order)
 +{
 + if (memcg_kmem_on)
 + __memcg_kmem_commit_page(page, handle, order);
 +}

Doesn't this 2 functions has no short-cuts ?

if (memcg_kmem_on  handle) ?

Maybe free() needs to access page_cgroup...



   #endif /* _LINUX_MEMCONTROL_H */
   
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index 54e93de..e9824c1 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -10,6 +10,10 @@
* Copyright (C) 2009 Nokia Corporation
* Author: Kirill A. Shutemov
*
 + * Kernel Memory Controller
 + * Copyright (C) 2012 Parallels Inc. and Google Inc.
 + * Authors: Glauber Costa and Suleiman Souhlal
 + *
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation

[Devel] Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg

2012-08-10 Thread Kamezawa Hiroyuki
(2012/08/09 22:01), Glauber Costa wrote:
 When a process tries to allocate a page with the __GFP_KMEMCG flag, the
 page allocator will call the corresponding memcg functions to validate
 the allocation. Tasks in the root memcg can always proceed.
 
 To avoid adding markers to the page - and a kmem flag that would
 necessarily follow, as much as doing page_cgroup lookups for no reason,
 whoever is marking its allocations with __GFP_KMEMCG flag is responsible
 for telling the page allocator that this is such an allocation at
 free_pages() time. This is done by the invocation of
 __free_accounted_pages() and free_accounted_pages().
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 CC: Suleiman Souhlal sulei...@google.com

Ah, ok. free_accounted_page() seems good.

Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com

I myself is okay with this. But...

Because you add a new hook to alloc_pages(), please get Ack from Mel
before requesting merge.

Thanks,
-Kame




 ---
   include/linux/gfp.h |  3 +++
   mm/page_alloc.c | 38 ++
   2 files changed, 41 insertions(+)
 
 diff --git a/include/linux/gfp.h b/include/linux/gfp.h
 index d8eae4d..029570f 100644
 --- a/include/linux/gfp.h
 +++ b/include/linux/gfp.h
 @@ -370,6 +370,9 @@ extern void free_pages(unsigned long addr, unsigned int 
 order);
   extern void free_hot_cold_page(struct page *page, int cold);
   extern void free_hot_cold_page_list(struct list_head *list, int cold);
   
 +extern void __free_accounted_pages(struct page *page, unsigned int order);
 +extern void free_accounted_pages(unsigned long addr, unsigned int order);
 +
   #define __free_page(page) __free_pages((page), 0)
   #define free_page(addr) free_pages((addr), 0)
   
 diff --git a/mm/page_alloc.c b/mm/page_alloc.c
 index b956cec..da341dc 100644
 --- a/mm/page_alloc.c
 +++ b/mm/page_alloc.c
 @@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
 order,
   struct page *page = NULL;
   int migratetype = allocflags_to_migratetype(gfp_mask);
   unsigned int cpuset_mems_cookie;
 + void *handle = NULL;
   
   gfp_mask = gfp_allowed_mask;
   
 @@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
 order,
   return NULL;
   
   /*
 +  * Will only have any effect when __GFP_KMEMCG is set.
 +  * This is verified in the (always inline) callee
 +  */
 + if (!memcg_kmem_new_page(gfp_mask, handle, order))
 + return NULL;
 +
 + /*
* Check the zones suitable for the gfp_mask contain at least one
* valid zone. It's possible to have an empty zonelist as a result
* of GFP_THISNODE and a memoryless node
 @@ -2583,6 +2591,8 @@ out:
   if (unlikely(!put_mems_allowed(cpuset_mems_cookie)  !page))
   goto retry_cpuset;
   
 + memcg_kmem_commit_page(page, handle, order);
 +
   return page;
   }
   EXPORT_SYMBOL(__alloc_pages_nodemask);
 @@ -2635,6 +2645,34 @@ void free_pages(unsigned long addr, unsigned int order)
   
   EXPORT_SYMBOL(free_pages);
   
 +/*
 + * __free_accounted_pages and free_accounted_pages will free pages allocated
 + * with __GFP_KMEMCG.
 + *
 + * Those pages are accounted to a particular memcg, embedded in the
 + * corresponding page_cgroup. To avoid adding a hit in the allocator to 
 search
 + * for that information only to find out that it is NULL for users who have 
 no
 + * interest in that whatsoever, we provide these functions.
 + *
 + * The caller knows better which flags it relies on.
 + */
 +void __free_accounted_pages(struct page *page, unsigned int order)
 +{
 + memcg_kmem_free_page(page, order);
 + __free_pages(page, order);
 +}
 +EXPORT_SYMBOL(__free_accounted_pages);
 +
 +void free_accounted_pages(unsigned long addr, unsigned int order)
 +{
 + if (addr != 0) {
 + VM_BUG_ON(!virt_addr_valid((void *)addr));
 + memcg_kmem_free_page(virt_to_page((void *)addr), order);
 + __free_pages(virt_to_page((void *)addr), order);
 + }
 +}
 +EXPORT_SYMBOL(free_accounted_pages);
 +
   static void *make_alloc_exact(unsigned long addr, unsigned order, size_t 
 size)
   {
   if (addr) {
 


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v2 09/11] memcg: propagate kmem limiting information to children

2012-08-10 Thread Kamezawa Hiroyuki
(2012/08/09 22:01), Glauber Costa wrote:
 The current memcg slab cache management fails to present satisfatory
 hierarchical behavior in the following scenario:
 
 - /cgroups/memory/A/B/C
 
 * kmem limit set at A,
 * A and B have no tasks,
 * span a new task in in C.
 
 Because kmem_accounted is a boolean that was not set for C, no
 accounting would be done. This is, however, not what we expect.
 
 The basic idea, is that when a cgroup is limited, we walk the tree
 upwards (something Kame and I already thought about doing for other
 purposes), and make sure that we store the information about the parent
 being limited in kmem_accounted (that is turned into a bitmap: two
 booleans would not be space efficient). The code for that is taken from
 sched/core.c. My reasons for not putting it into a common place is to
 dodge the type issues that would arise from a common implementation
 between memcg and the scheduler - but I think that it should ultimately
 happen, so if you want me to do it now, let me know.
 
 We do the reverse operation when a formerly limited cgroup becomes
 unlimited.
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 CC: Suleiman Souhlal sulei...@google.com



 ---
   mm/memcontrol.c | 88 
 +++--
   1 file changed, 79 insertions(+), 9 deletions(-)
 
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index 3216292..3d30b79 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -295,7 +295,8 @@ struct mem_cgroup {
* Should the accounting and control be hierarchical, per subtree?
*/
   bool use_hierarchy;
 - bool kmem_accounted;
 +
 + unsigned long kmem_accounted; /* See KMEM_ACCOUNTED_*, below */
   
   booloom_lock;
   atomic_tunder_oom;
 @@ -348,6 +349,38 @@ struct mem_cgroup {
   #endif
   };
   
 +enum {
 + KMEM_ACCOUNTED_THIS, /* accounted by this cgroup itself */
 + KMEM_ACCOUNTED_PARENT, /* accounted by any of its parents. */
 +};
 +
 +#ifdef CONFIG_MEMCG_KMEM
 +static bool memcg_kmem_account(struct mem_cgroup *memcg)
 +{
 + return !test_and_set_bit(KMEM_ACCOUNTED_THIS, memcg-kmem_accounted);
 +}
 +
 +static bool memcg_kmem_clear_account(struct mem_cgroup *memcg)
 +{
 + return test_and_clear_bit(KMEM_ACCOUNTED_THIS, memcg-kmem_accounted);
 +}
 +
 +static bool memcg_kmem_is_accounted(struct mem_cgroup *memcg)
 +{
 + return test_bit(KMEM_ACCOUNTED_THIS, memcg-kmem_accounted);
 +}
 +
 +static void memcg_kmem_account_parent(struct mem_cgroup *memcg)
 +{
 + set_bit(KMEM_ACCOUNTED_PARENT, memcg-kmem_accounted);
 +}
 +
 +static void memcg_kmem_clear_account_parent(struct mem_cgroup *memcg)
 +{
 + clear_bit(KMEM_ACCOUNTED_PARENT, memcg-kmem_accounted);
 +}
 +#endif /* CONFIG_MEMCG_KMEM */
 +
   /* Stuffs for move charges at task migration. */
   /*
* Types of charges to be moved. move_charge_at_immitgrate is treated as a
 @@ -614,7 +647,7 @@ EXPORT_SYMBOL(__memcg_kmem_free_page);
   
   static void disarm_kmem_keys(struct mem_cgroup *memcg)
   {
 - if (memcg-kmem_accounted)
 + if (test_bit(KMEM_ACCOUNTED_THIS, memcg-kmem_accounted))
   static_key_slow_dec(memcg_kmem_enabled_key);
   }
   #else
 @@ -4171,17 +4204,54 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, 
 struct cftype *cft,
   static void memcg_update_kmem_limit(struct mem_cgroup *memcg, u64 val)
   {
   #ifdef CONFIG_MEMCG_KMEM
 - /*
 -  * Once enabled, can't be disabled. We could in theory disable it if we
 -  * haven't yet created any caches, or if we can shrink them all to
 -  * death. But it is not worth the trouble.
 -  */
 + struct mem_cgroup *iter;
 +
   mutex_lock(set_limit_mutex);
 - if (!memcg-kmem_accounted  val != RESOURCE_MAX) {
 + if ((val != RESOURCE_MAX)  memcg_kmem_account(memcg)) {
 +
 + /*
 +  * Once enabled, can't be disabled. We could in theory disable
 +  * it if we haven't yet created any caches, or if we can shrink
 +  * them all to death. But it is not worth the trouble
 +  */
   static_key_slow_inc(memcg_kmem_enabled_key);
 - memcg-kmem_accounted = true;
 +
 + if (!memcg-use_hierarchy)
 + goto out;
 +
 + for_each_mem_cgroup_tree(iter, memcg) {
 + if (iter == memcg)
 + continue;
 + memcg_kmem_account_parent(iter);
 + }

Could you add an explanation comment ?


 + } else if ((val == RESOURCE_MAX)  memcg_kmem_clear_account(memcg)) {
 +
 + if (!memcg-use_hierarchy)
 + goto out;
 +
ditto.

 + for_each_mem_cgroup_tree(iter, memcg

[Devel] Re: [PATCH v2 11/11] protect architectures where THREAD_SIZE = PAGE_SIZE against fork bombs

2012-08-10 Thread Kamezawa Hiroyuki
(2012/08/09 22:01), Glauber Costa wrote:
 Because those architectures will draw their stacks directly from the
 page allocator, rather than the slab cache, we can directly pass
 __GFP_KMEMCG flag, and issue the corresponding free_pages.
 
 This code path is taken when the architecture doesn't define
 CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
 THREAD_SIZE = PAGE_SIZE. Luckily, most - if not all - of the remaining
 architectures fall in this category.
 
 This will guarantee that every stack page is accounted to the memcg the
 process currently lives on, and will have the allocations to fail if
 they go over limit.
 
 For the time being, I am defining a new variant of THREADINFO_GFP, not
 to mess with the other path. Once the slab is also tracked by memcg, we
 can get rid of that flag.
 
 Tested to successfully protect against :(){ :|: };:
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 Acked-by: Frederic Weisbecker fweis...@redhat.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 CC: Suleiman Souhlal sulei...@google.com

Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com


 ---
   include/linux/thread_info.h | 2 ++
   kernel/fork.c   | 4 ++--
   2 files changed, 4 insertions(+), 2 deletions(-)
 
 diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
 index ccc1899..e7e0473 100644
 --- a/include/linux/thread_info.h
 +++ b/include/linux/thread_info.h
 @@ -61,6 +61,8 @@ extern long do_no_restart_syscall(struct restart_block 
 *parm);
   # define THREADINFO_GFP (GFP_KERNEL | __GFP_NOTRACK)
   #endif
   
 +#define THREADINFO_GFP_ACCOUNTED (THREADINFO_GFP | __GFP_KMEMCG)
 +
   /*
* flag set/clear/test wrappers
* - pass TIF_ constants to these functions
 diff --git a/kernel/fork.c b/kernel/fork.c
 index dc3ff16..b0b90c3 100644
 --- a/kernel/fork.c
 +++ b/kernel/fork.c
 @@ -142,7 +142,7 @@ void __weak arch_release_thread_info(struct thread_info 
 *ti) { }
   static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 int node)
   {
 - struct page *page = alloc_pages_node(node, THREADINFO_GFP,
 + struct page *page = alloc_pages_node(node, THREADINFO_GFP_ACCOUNTED,
THREAD_SIZE_ORDER);
   
   return page ? page_address(page) : NULL;
 @@ -151,7 +151,7 @@ static struct thread_info *alloc_thread_info_node(struct 
 task_struct *tsk,
   static inline void free_thread_info(struct thread_info *ti)
   {
   arch_release_thread_info(ti);
 - free_pages((unsigned long)ti, THREAD_SIZE_ORDER);
 + free_accounted_pages((unsigned long)ti, THREAD_SIZE_ORDER);
   }
   # else
   static struct kmem_cache *thread_info_cache;
 


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v2 02/11] memcg: Reclaim when more than one page needed.

2012-08-10 Thread Kamezawa Hiroyuki

(2012/08/11 2:28), Michal Hocko wrote:

On Sat 11-08-12 01:49:25, KAMEZAWA Hiroyuki wrote:

(2012/08/11 0:42), Michal Hocko wrote:

On Thu 09-08-12 17:01:10, Glauber Costa wrote:
[...]

@@ -2317,18 +2318,18 @@ static int mem_cgroup_do_charge(struct mem_cgroup 
*memcg, gfp_t gfp_mask,
} else
mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
/*
-* nr_pages can be either a huge page (HPAGE_PMD_NR), a batch
-* of regular pages (CHARGE_BATCH), or a single regular page (1).
-*
 * Never reclaim on behalf of optional batching, retry with a
 * single page instead.
 */
-   if (nr_pages == CHARGE_BATCH)
+   if (nr_pages  min_pages)
return CHARGE_RETRY;


This is dangerous because THP charges will be retried now while they
previously failed with CHARGE_NOMEM which means that we will keep
attempting potentially endlessly.


with THP, I thought nr_pages == min_pages, and no retry.


right you are.


Why cannot we simply do if (nr_pages  CHARGE_BATCH) and get rid of the
min_pages altogether?


Hm, I think a slab can be larger than CHARGE_BATCH.


Also the comment doesn't seem to be valid anymore.


I agree it's not clean. Because our assumption on nr_pages are changed,
I think this behavior should not depend on nr_pages value..
Shouldn't we have a flag to indicate trial-for-batched charge ?


dunno, it would require a new parameter anyway (because abusing gfp
doesn't seem great idea).


ok, agreed.

-Kame


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: containers and cgroups mini-summit @ Linux Plumbers

2012-07-20 Thread Kamezawa Hiroyuki

(2012/07/12 6:41), Kir Kolyshkin wrote:

Gentlemen,

We are organizing containers mini-summit during next Linux Plumbers (San Diego, 
August 29-31).
The idea is to gather and discuss everything relevant to namespaces, cgroups, 
resource management,
checkpoint-restore and so on.

We are trying to come up with a list of topics to discuss, so please reply with 
topic suggestions, and
let me know if you are going to come.



I think I'll be there.

-Kame


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 10/11] memcg: allow a memcg with kmem charges to be destructed.

2012-06-26 Thread Kamezawa Hiroyuki
(2012/06/25 23:15), Glauber Costa wrote:
 Because the ultimate goal of the kmem tracking in memcg is to
 track slab pages as well, we can't guarantee that we'll always
 be able to point a page to a particular process, and migrate
 the charges along with it - since in the common case, a page
 will contain data belonging to multiple processes.
 
 Because of that, when we destroy a memcg, we only make sure
 the destruction will succeed by discounting the kmem charges
 from the user charges when we try to empty the cgroup.
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 CC: Suleiman Souhlal sulei...@google.com
 ---
   mm/memcontrol.c |   10 +-
   1 file changed, 9 insertions(+), 1 deletion(-)
 
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index a6a440b..bb9b6fe 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -598,6 +598,11 @@ static void disarm_kmem_keys(struct mem_cgroup *memcg)
   {
   if (test_bit(KMEM_ACCOUNTED_THIS, memcg-kmem_accounted))
   static_key_slow_dec(mem_cgroup_kmem_enabled_key);
 + /*
 +  * This check can't live in kmem destruction function,
 +  * since the charges will outlive the cgroup
 +  */
 + BUG_ON(res_counter_read_u64(memcg-kmem, RES_USAGE) != 0);
   }
   #else
   static void disarm_kmem_keys(struct mem_cgroup *memcg)
 @@ -3838,6 +3843,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup 
 *memcg, bool free_all)
   int node, zid, shrink;
   int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
   struct cgroup *cgrp = memcg-css.cgroup;
 + u64 usage;
   
   css_get(memcg-css);
   
 @@ -3877,8 +3883,10 @@ move_account:
   if (ret == -ENOMEM)
   goto try_to_free;
   cond_resched();
 + usage = res_counter_read_u64(memcg-res, RES_USAGE) -
 + res_counter_read_u64(memcg-kmem, RES_USAGE);
   /* ret should also be checked to ensure all lists are empty. */
 - } while (res_counter_read_u64(memcg-res, RES_USAGE)  0 || ret);
 + } while (usage  0 || ret);
   out:
   css_put(memcg-css);
   return ret;
 
Hmmaybe work enough. Could you add more comments on the code ?

Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com



___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] fix bad behavior in use_hierarchy file

2012-06-25 Thread Kamezawa Hiroyuki
(2012/06/25 18:21), Glauber Costa wrote:
 I have an application that does the following:
 
 * copy the state of all controllers attached to a hierarchy
 * replicate it as a child of the current level.
 
 I would expect writes to the files to mostly succeed, since they
 are inheriting sane values from parents.
 
 But that is not the case for use_hierarchy. If it is set to 0, we
 succeed ok. If we're set to 1, the value of the file is automatically
 set to 1 in the children, but if userspace tries to write the
 very same 1, it will fail. That same situation happens if we
 set use_hierarchy, create a child, and then try to write 1 again.
 
 Now, there is no reason whatsoever for failing to write a value
 that is already there. It doesn't even match the comments, that
 states:
 
   /* If parent's use_hierarchy is set, we can't make any modifications
* in the child subtrees...
 
 since we are not changing anything.
 
 The following patch tests the new value against the one we're storing,
 and automatically return 0 if we're not proposing a change.
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Dhaval Giani dhaval.gi...@gmail.com
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org

Hm. 
Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 11/11] protect architectures where THREAD_SIZE = PAGE_SIZE against fork bombs

2012-06-25 Thread Kamezawa Hiroyuki

(2012/06/26 13:57), David Rientjes wrote:

On Mon, 25 Jun 2012, Glauber Costa wrote:


diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index ccc1899..914ec07 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -61,6 +61,12 @@ extern long do_no_restart_syscall(struct restart_block 
*parm);
  # define THREADINFO_GFP   (GFP_KERNEL | __GFP_NOTRACK)
  #endif

+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+# define THREADINFO_GFP_ACCOUNTED (THREADINFO_GFP | __GFP_KMEMCG)
+#else
+# define THREADINFO_GFP_ACCOUNTED (THREADINFO_GFP)
+#endif
+


This type of requirement is going to become nasty very quickly if nobody
can use __GFP_KMEMCG without testing for CONFIG_CGROUP_MEM_RES_CTLR_KMEM.
Perhaps define __GFP_KMEMCG to be 0x0 if it's not enabled, similar to how
kmemcheck does?


I agree.

Thanks,
-Kame


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 08/11] memcg: disable kmem code when not in use.

2012-06-25 Thread Kamezawa Hiroyuki
(2012/06/25 23:15), Glauber Costa wrote:
 We can use jump labels to patch the code in or out
 when not used.
 
 Because the assignment: memcg-kmem_accounted = true
 is done after the jump labels increment, we guarantee
 that the root memcg will always be selected until
 all call sites are patched (see mem_cgroup_kmem_enabled).
 This guarantees that no mischarges are applied.
 
 Jump label decrement happens when the last reference
 count from the memcg dies. This will only happen when
 the caches are all dead.
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 CC: Suleiman Souhlal sulei...@google.com

Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v4 23/25] memcg: propagate kmem limiting information to children

2012-06-22 Thread Kamezawa Hiroyuki
(2012/06/20 17:59), Glauber Costa wrote:
 On 06/19/2012 12:54 PM, Glauber Costa wrote:
 On 06/19/2012 12:35 PM, Glauber Costa wrote:
 On 06/19/2012 04:16 AM, Kamezawa Hiroyuki wrote:
 (2012/06/18 21:43), Glauber Costa wrote:
 On 06/18/2012 04:37 PM, Kamezawa Hiroyuki wrote:
 (2012/06/18 19:28), Glauber Costa wrote:
 The current memcg slab cache management fails to present satisfatory 
 hierarchical
 behavior in the following scenario:

 -   /cgroups/memory/A/B/C

 * kmem limit set at A
 * A and B empty taskwise
 * bash in C does find /

 Because kmem_accounted is a boolean that was not set for C, no 
 accounting
 would be done. This is, however, not what we expect.


 Hmmdo we need this new routines even while we have mem_cgroup_iter() 
 ?

 Doesn't this work ?

  struct mem_cgroup {
  .
  bool kmem_accounted_this;
  atomic_t kmem_accounted;
  
  }

 at set limit

  set_limit(memcg) {

  if (newly accounted) {
  mem_cgroup_iter() {
  atomic_inc(iter-kmem_accounted)
  }
  } else {
  mem_cgroup_iter() {
  atomic_dec(iter-kmem_accounted);
  }
  }


 hm ? Then, you can see kmem is accounted or not by 
 atomic_read(memcg-kmem_accounted);


 Accounted by itself / parent is still useful, and I see no reason to use
 an atomic + bool if we can use a pair of bits.

 As for the routine, I guess mem_cgroup_iter will work... It does a lot
 more than I need, but for the sake of using what's already in there, I
 can switch to it with no problems.


 Hmm. please start from reusing existing routines.
 If it's not enough, some enhancement for generic cgroup  will be welcomed
 rather than completely new one only for memcg.


 And now that I am trying to adapt the code to the new function, I
 remember clearly why I done this way. Sorry for my failed memory.

 That has to do with the order of the walk. I need to enforce hierarchy,
 which means whenever a cgroup has !use_hierarchy, I need to cut out that
 branch, but continue scanning the tree for other branches.

 That is a lot easier to do with depth-search tree walks like the one
 proposed in this patch. for_each_mem_cgroup() seems to walk the tree in
 css-creation order. Which means we need to keep track of parents that
 has hierarchy disabled at all times ( can be many ), and always test for
 ancestorship - which is expensive, but I don't particularly care.

 But I'll give another shot with this one.


 Humm, silly me. I was believing the hierarchical settings to be more
 flexible than they really are.

 I thought that it could be possible for a children of a parent with
 use_hierarchy = 1 to have use_hierarchy = 0.

 It seems not to be the case. This makes my life a lot easier.

 
 How about the following patch?
 
 It is still expensive in the clear_bit case, because I can't just walk
 the whole tree flipping the bit down: I need to stop whenever I see a
 branch whose root is itself accounted - and the ordering of iter forces
 me to always check the tree up (So we got O(n*h) h being height instead
 of O(n)).
 
 for flipping the bit up, it is easy enough.
 
 
Yes. It seems much nicer.

Thanks,
-Kame

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v4 05/25] memcg: Always free struct memcg through schedule_work()

2012-06-21 Thread Kamezawa Hiroyuki

(2012/06/20 17:40), Glauber Costa wrote:

On 06/20/2012 11:32 AM, Pekka Enberg wrote:

Maybe Pekka can merge the current -mm with his tree?

I first want to have a stable base from Christoph's common slab series
before I am comfortable with going forward with the memcg parts.

Feel free to push forward any preparational patches to the slab
allocators, though.

Pekka


Kame and others:

If you are already comfortable with the general shape of the series, it would 
do me good to do the same with the memcg preparation patches, so we have less 
code to review and merge in the next window.

They are:

memcg: Make it possible to use the stock for more than one page.
memcg: Reclaim when more than one page needed.
memcg: change defines to an enum

Do you see any value in merging them now ?



I'll be okay with the 3 patches for memcg.

Thanks,
-Kame

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v4 05/25] memcg: Always free struct memcg through schedule_work()

2012-06-18 Thread Kamezawa Hiroyuki
(2012/06/18 19:27), Glauber Costa wrote:
 Right now we free struct memcg with kfree right after a
 rcu grace period, but defer it if we need to use vfree() to get
 rid of that memory area. We do that by need, because we need vfree
 to be called in a process context.
 
 This patch unifies this behavior, by ensuring that even kfree will
 happen in a separate thread. The goal is to have a stable place to
 call the upcoming jump label destruction function outside the realm
 of the complicated and quite far-reaching cgroup lock (that can't be
 held when calling neither the cpu_hotplug.lock nor the jump_label_mutex)
 
 Signed-off-by: Glauber Costaglom...@parallels.com
 CC: Tejun Heot...@kernel.org
 CC: Li Zefanlize...@huawei.com
 CC: Kamezawa Hiroyukikamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weinerhan...@cmpxchg.org
 CC: Michal Hockomho...@suse.cz

How about cut out this patch and merge first as simple cleanu up and
to reduce patch stack on your side ?

Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com

 ---
   mm/memcontrol.c |   24 +---
   1 file changed, 13 insertions(+), 11 deletions(-)
 
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index e3b528e..ce15be4 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -245,8 +245,8 @@ struct mem_cgroup {
*/
   struct rcu_head rcu_freeing;
   /*
 -  * But when using vfree(), that cannot be done at
 -  * interrupt time, so we must then queue the work.
 +  * We also need some space for a worker in deferred freeing.
 +  * By the time we call it, rcu_freeing is not longer in use.
*/
   struct work_struct work_freeing;
   };
 @@ -4826,23 +4826,28 @@ out_free:
   }
 
   /*
 - * Helpers for freeing a vzalloc()ed mem_cgroup by RCU,
 + * Helpers for freeing a kmalloc()ed/vzalloc()ed mem_cgroup by RCU,
* but in process context.  The work_freeing structure is overlaid
* on the rcu_freeing structure, which itself is overlaid on memsw.
*/
 -static void vfree_work(struct work_struct *work)
 +static void free_work(struct work_struct *work)
   {
   struct mem_cgroup *memcg;
 + int size = sizeof(struct mem_cgroup);
 
   memcg = container_of(work, struct mem_cgroup, work_freeing);
 - vfree(memcg);
 + if (size  PAGE_SIZE)
 + kfree(memcg);
 + else
 + vfree(memcg);
   }
 -static void vfree_rcu(struct rcu_head *rcu_head)
 +
 +static void free_rcu(struct rcu_head *rcu_head)
   {
   struct mem_cgroup *memcg;
 
   memcg = container_of(rcu_head, struct mem_cgroup, rcu_freeing);
 - INIT_WORK(memcg-work_freeing, vfree_work);
 + INIT_WORK(memcg-work_freeing, free_work);
   schedule_work(memcg-work_freeing);
   }
 
 @@ -4868,10 +4873,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
   free_mem_cgroup_per_zone_info(memcg, node);
 
   free_percpu(memcg-stat);
 - if (sizeof(struct mem_cgroup)  PAGE_SIZE)
 - kfree_rcu(memcg, rcu_freeing);
 - else
 - call_rcu(memcg-rcu_freeing, vfree_rcu);
 + call_rcu(memcg-rcu_freeing, free_rcu);
   }
 
   static void mem_cgroup_get(struct mem_cgroup *memcg)


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v4 00/25] kmem limitation for memcg

2012-06-18 Thread Kamezawa Hiroyuki
(2012/06/18 19:27), Glauber Costa wrote:
 Hello All,
 
 This is my new take for the memcg kmem accounting. This should merge
 all of the previous comments from you guys, specially concerning the big churn
 inside the allocators themselves.
 
 My focus in this new round was to keep the changes in the cache internals to
 a minimum. To do that, I relied upon two main pillars:
 
   * Cristoph's unification series, that allowed me to put must of the changes
 in a common file. Even then, the changes are not too many, since the 
 overal
 level of invasiveness was decreased.
   * Accounting is done directly from the page allocator. This means some pages
 can fail to be accounted, but that can only happen when the task calling
 kmem_cache_alloc or kmalloc is not the same task allocating a new page.
 This never happens in steady state operation if the tasks are kept in the
 same memcg. Naturally, if the page ends up being accounted to a memcg that
 is not limited (such as root memcg), that particular page will simply not
 be accounted.
 
 The dispatcher code stays (mem_cgroup_get_kmem_cache), being the mechanism who
 guarantees that, during steady state operation, all objects allocated in a 
 page
 will belong to the same memcg. I consider this a good compromise point between
 strict and loose accounting here.
 

2 questions.

  - Do you have performance numbers ?

  - Do you think user-memory memcg should be switched to page-allocator level 
accounting ?
(it will require some study for modifying current bached-freeing and 
per-cpu-stock
 logics...)

Thanks,
-Kame

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v4 17/25] skip memcg kmem allocations in specified code regions

2012-06-18 Thread Kamezawa Hiroyuki
(2012/06/18 19:28), Glauber Costa wrote:
 This patch creates a mechanism that skip memcg allocations during
 certain pieces of our core code. It basically works in the same way
 as preempt_disable()/preempt_enable(): By marking a region under
 which all allocations will be accounted to the root memcg.
 
 We need this to prevent races in early cache creation, when we
 allocate data using caches that are not necessarily created already.
 
 Signed-off-by: Glauber Costaglom...@parallels.com
 CC: Christoph Lameterc...@linux.com
 CC: Pekka Enbergpenb...@cs.helsinki.fi
 CC: Michal Hockomho...@suse.cz
 CC: Kamezawa Hiroyukikamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weinerhan...@cmpxchg.org
 CC: Suleiman Souhlalsulei...@google.com

I'm ok with this approach.

Reviewed-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v4 19/25] memcg: disable kmem code when not in use.

2012-06-18 Thread Kamezawa Hiroyuki
(2012/06/18 19:28), Glauber Costa wrote:
 We can use jump labels to patch the code in or out
 when not used.
 
 Because the assignment: memcg-kmem_accounted = true
 is done after the jump labels increment, we guarantee
 that the root memcg will always be selected until
 all call sites are patched (see mem_cgroup_kmem_enabled).
 This guarantees that no mischarges are applied.
 
 Jump label decrement happens when the last reference
 count from the memcg dies. This will only happen when
 the caches are all dead.
 
 Signed-off-by: Glauber Costaglom...@parallels.com
 CC: Christoph Lameterc...@linux.com
 CC: Pekka Enbergpenb...@cs.helsinki.fi
 CC: Michal Hockomho...@suse.cz
 CC: Kamezawa Hiroyukikamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weinerhan...@cmpxchg.org
 CC: Suleiman Souhlalsulei...@google.com
 ---
   include/linux/memcontrol.h |5 -
   mm/memcontrol.c|   22 +-
   2 files changed, 25 insertions(+), 2 deletions(-)
 
 diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
 index 27a3f16..47ccd80 100644
 --- a/include/linux/memcontrol.h
 +++ b/include/linux/memcontrol.h
 @@ -22,6 +22,7 @@
   #includelinux/cgroup.h
   #includelinux/vm_event_item.h
   #includelinux/hardirq.h
 +#includelinux/jump_label.h
 
   struct mem_cgroup;
   struct page_cgroup;
 @@ -451,7 +452,6 @@ bool __mem_cgroup_new_kmem_page(gfp_t gfp, void *handle, 
 int order);
   void __mem_cgroup_commit_kmem_page(struct page *page, void *handle, int 
 order);
   void __mem_cgroup_free_kmem_page(struct page *page, int order);
 
 -#define mem_cgroup_kmem_on 1
   struct kmem_cache *
   __mem_cgroup_get_kmem_cache(struct kmem_cache *cachep, gfp_t gfp);
 
 @@ -459,6 +459,9 @@ static inline bool has_memcg_flag(gfp_t gfp)
   {
   return gfp  __GFP_SLABMEMCG;
   }
 +
 +extern struct static_key mem_cgroup_kmem_enabled_key;
 +#define mem_cgroup_kmem_on static_key_false(mem_cgroup_kmem_enabled_key)
   #else
   static inline void mem_cgroup_register_cache(struct mem_cgroup *memcg,
struct kmem_cache *s)
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index b47ab87..5295ab6 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -422,6 +422,10 @@ static void mem_cgroup_put(struct mem_cgroup *memcg);
   #includenet/sock.h
   #includenet/ip.h
 
 +struct static_key mem_cgroup_kmem_enabled_key;
 +/* so modules can inline the checks */
 +EXPORT_SYMBOL(mem_cgroup_kmem_enabled_key);
 +
   static bool mem_cgroup_is_root(struct mem_cgroup *memcg);
   static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 
 delta);
   static void memcg_uncharge_kmem(struct mem_cgroup *memcg, s64 delta);
 @@ -468,6 +472,12 @@ void sock_release_memcg(struct sock *sk)
   }
   }
 
 +static void disarm_static_keys(struct mem_cgroup *memcg)
 +{
 + if (memcg-kmem_accounted)
 + static_key_slow_dec(mem_cgroup_kmem_enabled_key);
 +}
 +
   #ifdef CONFIG_INET
   struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg)
   {
 @@ -831,6 +841,10 @@ static void memcg_slab_init(struct mem_cgroup *memcg)
   for (i = 0; i  MAX_KMEM_CACHE_TYPES; i++)
   memcg-slabs[i] = NULL;
   }
 +#else
 +static inline void disarm_static_keys(struct mem_cgroup *memcg)
 +{
 +}
   #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
 
   static void drain_all_stock_async(struct mem_cgroup *memcg);
 @@ -4344,8 +4358,13 @@ static int mem_cgroup_write(struct cgroup *cont, 
 struct cftype *cft,
*
* But it is not worth the trouble
*/
 - if (!memcg-kmem_accounted  val != RESOURCE_MAX)
 + mutex_lock(set_limit_mutex);
 + if (!memcg-kmem_accounted  val != RESOURCE_MAX
 +   !memcg-kmem_accounted) {

I'm sorry why you check the value twice ?

Thanks,
-Kame

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v4 23/25] memcg: propagate kmem limiting information to children

2012-06-18 Thread Kamezawa Hiroyuki
(2012/06/18 19:28), Glauber Costa wrote:
 The current memcg slab cache management fails to present satisfatory 
 hierarchical
 behavior in the following scenario:
 
 -  /cgroups/memory/A/B/C
 
 * kmem limit set at A
 * A and B empty taskwise
 * bash in C does find /
 
 Because kmem_accounted is a boolean that was not set for C, no accounting
 would be done. This is, however, not what we expect.
 

Hmmdo we need this new routines even while we have mem_cgroup_iter() ?

Doesn't this work ?

struct mem_cgroup {
.
bool kmem_accounted_this;
atomic_t kmem_accounted;

}

at set limit

set_limit(memcg) {

if (newly accounted) {
mem_cgroup_iter() {
atomic_inc(iter-kmem_accounted)
}
} else {
mem_cgroup_iter() {
atomic_dec(iter-kmem_accounted);
}
}


hm ? Then, you can see kmem is accounted or not by 
atomic_read(memcg-kmem_accounted);

Thanks,
-Kame


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v4 05/25] memcg: Always free struct memcg through schedule_work()

2012-06-18 Thread Kamezawa Hiroyuki
(2012/06/18 21:10), Glauber Costa wrote:
 On 06/18/2012 04:07 PM, Kamezawa Hiroyuki wrote:
 (2012/06/18 19:27), Glauber Costa wrote:
 Right now we free struct memcg with kfree right after a
 rcu grace period, but defer it if we need to use vfree() to get
 rid of that memory area. We do that by need, because we need vfree
 to be called in a process context.

 This patch unifies this behavior, by ensuring that even kfree will
 happen in a separate thread. The goal is to have a stable place to
 call the upcoming jump label destruction function outside the realm
 of the complicated and quite far-reaching cgroup lock (that can't be
 held when calling neither the cpu_hotplug.lock nor the jump_label_mutex)

 Signed-off-by: Glauber Costaglom...@parallels.com
 CC: Tejun Heot...@kernel.org
 CC: Li Zefanlize...@huawei.com
 CC: Kamezawa Hiroyukikamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weinerhan...@cmpxchg.org
 CC: Michal Hockomho...@suse.cz

 How about cut out this patch and merge first as simple cleanu up and
 to reduce patch stack on your side ?

 Acked-by: KAMEZAWA Hiroyukikamezawa.hir...@jp.fujitsu.com
 
 I believe this is already in the -mm tree (from the sock memcg fixes)
 
 But actually, my main trouble with this series here, is that I am basing
 it on Pekka's tree, while some of the fixes are in -mm already.
 If I'd base it on -mm I would lose some of the stuff as well.
 
 Maybe Pekka can merge the current -mm with his tree?
 
 So far I am happy with getting comments from people about the code, so I
 did not get overly concerned about that.
 

Sure. thank you.
-Kame


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v4 23/25] memcg: propagate kmem limiting information to children

2012-06-18 Thread Kamezawa Hiroyuki
(2012/06/18 21:43), Glauber Costa wrote:
 On 06/18/2012 04:37 PM, Kamezawa Hiroyuki wrote:
 (2012/06/18 19:28), Glauber Costa wrote:
 The current memcg slab cache management fails to present satisfatory 
 hierarchical
 behavior in the following scenario:

 -   /cgroups/memory/A/B/C

 * kmem limit set at A
 * A and B empty taskwise
 * bash in C does find /

 Because kmem_accounted is a boolean that was not set for C, no accounting
 would be done. This is, however, not what we expect.


 Hmmdo we need this new routines even while we have mem_cgroup_iter() ?

 Doesn't this work ?

  struct mem_cgroup {
  .
  bool kmem_accounted_this;
  atomic_t kmem_accounted;
  
  }

 at set limit

  set_limit(memcg) {

  if (newly accounted) {
  mem_cgroup_iter() {
  atomic_inc(iter-kmem_accounted)
  }
  } else {
  mem_cgroup_iter() {
  atomic_dec(iter-kmem_accounted);
  }
  }


 hm ? Then, you can see kmem is accounted or not by 
 atomic_read(memcg-kmem_accounted);

 
 Accounted by itself / parent is still useful, and I see no reason to use
 an atomic + bool if we can use a pair of bits.
 
 As for the routine, I guess mem_cgroup_iter will work... It does a lot
 more than I need, but for the sake of using what's already in there, I
 can switch to it with no problems.
 

Hmm. please start from reusing existing routines.
If it's not enough, some enhancement for generic cgroup  will be welcomed
rather than completely new one only for memcg.

Thanks,
-Kame



___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v3 00/28] kmem limitation for memcg

2012-06-13 Thread Kamezawa Hiroyuki

(2012/06/07 23:00), Frederic Weisbecker wrote:

On Thu, Jun 07, 2012 at 02:53:07PM +0400, Glauber Costa wrote:

On 06/07/2012 02:26 PM, Frederic Weisbecker wrote:

On Fri, May 25, 2012 at 05:03:20PM +0400, Glauber Costa wrote:

Hello All,

This is my new take for the memcg kmem accounting. This should merge
all of the previous comments from you, plus fix a bunch of bugs.

At this point, I consider the series pretty mature. Since last submission
2 weeks ago, I focused on broadening the testing coverage. Some bugs were
fixed, but that of course doesn't mean no bugs exist.

I believe some of the early patches here are already in some trees around.
I don't know who should pick this, so if everyone agrees with what's in here,
please just ack them and tell me which tree I should aim for (-mm? Hocko's?)
and I'll rebase it.

I should point out again that most, if not all, of the code in the caches
are wrapped in static_key areas, meaning they will be completely patched out
until the first limit is set. Enabling and disabling of static_keys incorporate
the last fixes for sock memcg, and should be pretty robust.

I also put a lot of effort, as you will all see, in the proper separation
of the patches, so the review process is made as easy as the complexity of
the work allows to.


So I believe that if I want to implement a per kernel stack 
accounting/limitation,
I need to work on top of your patchset.

What do you think about having some sub kmem accounting based on the caches?
For example there could be a specific accounting per kmem cache.

Like if we use a specific kmem cache to allocate the kernel stack
(as is done by some archs but I can generalize that for those who want
kernel stack accounting), allocations are accounted globally in the memcg as
done in your patchset but also on a seperate counter only for this kmem cache
on the memcg, resulting in a kmem.stack.usage somewhere.

The concept of per kmem cache accounting can be expanded more for any
kind of finegrained kmem accounting.

Thoughts?


I believe a general separation is too much, and will lead to knob
explosion. So I don't think it is a good idea.


Right. This could be an option in kmem_cache_create() or something.



Now, for the stack itself, it can be justified. The question that
remains to be answered is:

Why do you need to set the stack value separately? Isn't accounting
the stack value, and limiting against the global kmem limit enough?


Well, I may want to let my container have a full access to some kmem
resources (net, file, etc...) but defend against fork bombs or NR_PROC
rlimit exhaustion of other containers.

So I need to be able to set my limit precisely on kstack.


You explained that the limitation is necessary for fork-bomb, and the bad
point of fork-bomb is that it can cause OOM. So, the problem is OOM not 
fork-bomb.

If the problem is OOM, IIUC, generic kernel memory limiting will work better 
than
kernel stack limiting.

Thanks,
-Kame


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v5 2/2] decrement static keys on real destroy time

2012-05-17 Thread KAMEZAWA Hiroyuki
(2012/05/17 18:52), Glauber Costa wrote:

 On 05/17/2012 09:37 AM, Andrew Morton wrote:
  If that happens, locking in static_key_slow_inc will prevent any damage.
  My previous version had explicit code to prevent that, but we were
  pointed out that this is already part of the static_key expectations, so
  that was dropped.
 This makes no sense.  If two threads run that code concurrently,
 key-enabled gets incremented twice.  Nobody anywhere has a record that
 this happened so it cannot be undone.  key-enabled is now in an
 unknown state.
 
 Kame, Tejun,
 
 Andrew is right. It seems we will need that mutex after all. Just this 
 is not a race, and neither something that should belong in the 
 static_branch interface.
 


Hmmhow about having

res_counter_xchg_limit(res, old_limit, new_limit);

if (!cg_proto-updated  old_limit == RESOURCE_MAX)
update labels...

Then, no mutex overhead maybe and activated will be updated only once.
Ah, but please fix in a way you like. Above is an example.

Thanks,
-Kame
(*) I'm sorry I won't be able to read e-mails, tomorrow.



___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v5 2/2] decrement static keys on real destroy time

2012-05-17 Thread KAMEZAWA Hiroyuki
(2012/05/17 19:22), Glauber Costa wrote:

 On 05/17/2012 02:18 PM, KAMEZAWA Hiroyuki wrote:
 (2012/05/17 18:52), Glauber Costa wrote:

 On 05/17/2012 09:37 AM, Andrew Morton wrote:
   If that happens, locking in static_key_slow_inc will prevent any 
 damage.
   My previous version had explicit code to prevent that, but we were
   pointed out that this is already part of the static_key expectations, 
 so
   that was dropped.
 This makes no sense.  If two threads run that code concurrently,
 key-enabled gets incremented twice.  Nobody anywhere has a record that
 this happened so it cannot be undone.  key-enabled is now in an
 unknown state.

 Kame, Tejun,

 Andrew is right. It seems we will need that mutex after all. Just this
 is not a race, and neither something that should belong in the
 static_branch interface.



 Hmmhow about having

 res_counter_xchg_limit(res,old_limit, new_limit);

 if (!cg_proto-updated  old_limit == RESOURCE_MAX)
  update labels...

 Then, no mutex overhead maybe and activated will be updated only once.
 Ah, but please fix in a way you like. Above is an example.
 
 I think a mutex is a lot cleaner than adding a new function to the 
 res_counter interface.
 
 We could do a counter, and then later decrement the key until the 
 counter reaches zero, but between those two, I still think a mutex here 
 is preferable.
 
 Only that, instead of coming up with a mutex of ours, we could export 
 and reuse set_limit_mutex from memcontrol.c
 


ok, please.

thx,
-Kame

 
 Thanks,
 -Kame
 (*) I'm sorry I won't be able to read e-mails, tomorrow.

 Ok Kame. I am not in a terrible hurry to fix this, it doesn't seem to be 
 hurting any real workload.
 
 



___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v2 19/29] skip memcg kmem allocations in specified code regions

2012-05-16 Thread KAMEZAWA Hiroyuki
(2012/05/16 15:19), Glauber Costa wrote:

 On 05/15/2012 06:46 AM, KAMEZAWA Hiroyuki wrote:
 (2012/05/12 2:44), Glauber Costa wrote:

 This patch creates a mechanism that skip memcg allocations during
 certain pieces of our core code. It basically works in the same way
 as preempt_disable()/preempt_enable(): By marking a region under
 which all allocations will be accounted to the root memcg.

 We need this to prevent races in early cache creation, when we
 allocate data using caches that are not necessarily created already.

 Signed-off-by: Glauber Costaglom...@parallels.com
 CC: Christoph Lameterc...@linux.com
 CC: Pekka Enbergpenb...@cs.helsinki.fi
 CC: Michal Hockomho...@suse.cz
 CC: Kamezawa Hiroyukikamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weinerhan...@cmpxchg.org
 CC: Suleiman Souhlalsulei...@google.com


 The concept seems okay to me but...

 ---
   include/linux/sched.h |1 +
   mm/memcontrol.c   |   25 +
   2 files changed, 26 insertions(+), 0 deletions(-)

 diff --git a/include/linux/sched.h b/include/linux/sched.h
 index 81a173c..0501114 100644
 --- a/include/linux/sched.h
 +++ b/include/linux/sched.h
 @@ -1613,6 +1613,7 @@ struct task_struct {
 unsigned long nr_pages; /* uncharged usage */
 unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
 } memcg_batch;
 +   atomic_t memcg_kmem_skip_account;


 If only 'current' thread touch this, you don't need to make this atomic 
 counter.
 you can use 'long'.

 You're absolutely right, Kame, thanks.
 I first used atomic_t because I had it tested against current-mm-owner.
 
 Do you, btw, agree to use current instead of owner here?
 You can find the rationale in earlier mails between me and Suleiman.

I agree to use current. This information depends on the context of callers.

Thanks,
-Kame


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v2 18/29] memcg: kmem controller charge/uncharge infrastructure

2012-05-16 Thread KAMEZAWA Hiroyuki
(2012/05/16 15:42), Glauber Costa wrote:

 On 05/15/2012 06:57 AM, KAMEZAWA Hiroyuki wrote:
 +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
  +int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta)
  +{
  + struct res_counter *fail_res;
  + struct mem_cgroup *_memcg;
  + int may_oom, ret;
  + bool nofail = false;
  +
  + may_oom = (gfp  __GFP_WAIT)  (gfp  __GFP_FS)
  + !(gfp  __GFP_NORETRY);
  +
  + ret = 0;
  +
  + if (!memcg)
  + return ret;
  +
  + _memcg = memcg;
  + ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
  + _memcg, may_oom);
  +
  + if ((ret == -EINTR) || (ret  (gfp  __GFP_NOFAIL)))  {
  + nofail = true;
  + /*
  +  * __mem_cgroup_try_charge() chose to bypass to root due
  +  * to OOM kill or fatal signal.
  +  * Since our only options are to either fail the
  +  * allocation or charge it to this cgroup, force the
  +  * change, going above the limit if needed.
  +  */
  + res_counter_charge_nofail(memcg-res, delta,fail_res);
  + if (do_swap_account)
  + res_counter_charge_nofail(memcg-memsw, delta,
  + fail_res);
  + } else if (ret == -ENOMEM)
  + return ret;
  +
  + if (nofail)
  + res_counter_charge_nofail(memcg-kmem, delta,fail_res);
  + else
  + ret = res_counter_charge(memcg-kmem, delta,fail_res);

 Ouch, you allow usage  limit ? It's BUG.

 IMHO, if GFP_NOFAIL, memcg accounting should be skipped. Please

 if (gfp_mask  __GFP_NOFAIL)
  return 0;

 Or avoid calling memcg_charge_kmem() you can do that as you do in patch 
 19/29,
 I guess you can use a trick like

 == in 19/29
 +if (!current-mm || atomic_read(current-memcg_kmem_skip_account))
 +return cachep;
 +
   gfp |=  cachep-allocflags;
 ==

 == change like this
   gfp |= cachep-allocflags;

   if (!current-mm || current-memcg_kmem_skip_account || gfp  
 __GFP_NOFAIL))
 ==

 Is this difficult ?

 Thanks,
 -Kame
 
 Well, we disagree with that.
 I actually voiced this earlier to Suleiman in the thread, but it is good
 that you brought this up again - this is quite important.
 
 I will repeat my rationale here, and if you still are not convinced,
 tell me and I will be happy to switch over.
 
 I believe that the whole reasoning behind this, is to have allocations
 failing if we go over limit. If the allocation won't fail anyway, it
 doesn't really matter who we charge this to.
 
 However, if the allocation still came from a particular memcg, those
 nofail allocation may prevent it to use more memory when a normal
 allocation takes place.
 
 Consider this:
 
 limit = 4M
 usage = 4M - 4k
 
 If at this point the memcg hits a NOFAIL allocation worth 2 pages, by
 the method I am using, the memcg will be at 4M + 4k after the
 allocation. Charging it to the root memcg will leave it at 4M - 4k.
 
 This means that to be able to allocate a page again, you need to free
 two other pages, be it the 2 pages used by the GFP allocation or any
 other. In other words: the memcg that originated the charge is held
 accountable for it. If he says it can't fail for whatever reason, fine,
 we respect that,  but we punish it later for other allocations.
 

I personally think 'we punish it later' is bad thing at resource accounting.
We have 'hard limit'. It's not soft limit.

 Without that GFP_NOFAIL becomes just a nice way for people to bypass
 those controls altogether, since after a ton of GFP_NOFAIL allocations,
 normal allocations will still succeed.
 

Allowing people to bypass is not bad because they're kernel.

But, IIUC, from gfp.h
==
 * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
 * cannot handle allocation failures.  This modifier is deprecated and no new
 * users should be added.
==

GFP_NOFAIL will go away and no new user is recommended.

So, please skip GFP_NOFAIL accounting and avoid to write
usage may go over limit if you're unfortune, sorry into memcg documentation.


 The change you propose is totally doable. I just don't believe it should
 be done.
 
 But let me know where you stand.
 

My stand point is keeping usage = limit is the spec. and
important in enterprise system. So, please avoid usage  limit.

Thanks,
-Kame









___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v5 2/2] decrement static keys on real destroy time

2012-05-16 Thread KAMEZAWA Hiroyuki
(2012/05/16 16:04), Glauber Costa wrote:

 On 05/16/2012 10:03 AM, Glauber Costa wrote:
 BTW, what is the relationship between 1/2 and 2/2  ?
 Can't do jump label patching inside an interrupt handler. They need to
 happen when we free the structure, and I was about to add a worker
 myself when I found out we already have one: just we don't always use it.

 Before we merge it, let me just make sure the issue with config Li
 pointed out don't exist. I did test it, but since I've reposted this
 many times with multiple tiny changes - the type that will usually get
 us killed, I'd be more comfortable with an extra round of testing if
 someone spotted a possibility.

 Who is merging this fix, btw ?
 I find it to be entirely memcg related, even though it touches a file in
 net (but a file with only memcg code in it)

 
 For the record, I compiled test it many times, and the problem that Li
 wondered about seems not to exist.
 

Ah...Hmm.I guess dependency problem will be found in -mm if any rather than
netdev...

David, can this bug-fix patch goes via -mm tree ? Or will you pick up ?

CC'ed David Miller and Andrew Morton.

Thanks,
-Kame


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v2 18/29] memcg: kmem controller charge/uncharge infrastructure

2012-05-16 Thread KAMEZAWA Hiroyuki
(2012/05/16 17:25), Glauber Costa wrote:

 On 05/16/2012 12:18 PM, KAMEZAWA Hiroyuki wrote:
 If at this point the memcg hits a NOFAIL allocation worth 2 pages, by
  the method I am using, the memcg will be at 4M + 4k after the
  allocation. Charging it to the root memcg will leave it at 4M - 4k.
  
  This means that to be able to allocate a page again, you need to free
  two other pages, be it the 2 pages used by the GFP allocation or any
  other. In other words: the memcg that originated the charge is held
  accountable for it. If he says it can't fail for whatever reason, fine,
  we respect that,  but we punish it later for other allocations.
  
 I personally think 'we punish it later' is bad thing at resource accounting.
 We have 'hard limit'. It's not soft limit.
 
 That only makes sense if you will fail the allocation. If you won't, you
 are over your hard limit anyway. You are just masquerading that.
 


'showing usage  limit to user' and 'avoid accounting'
is totally different user experience.



  Without that GFP_NOFAIL becomes just a nice way for people to bypass
  those controls altogether, since after a ton of GFP_NOFAIL allocations,
  normal allocations will still succeed.
  
 Allowing people to bypass is not bad because they're kernel.
 
 No, they are not. They are in process context, on behalf of a process
 that belongs to a valid memcg. If they happen to be a kernel thread,
 !current-mm test will send the allocation to the root memcg already.
 


Yes, but it's kernel code. There will be some special reason to use 
__GFP_NOFAIL.


 But, IIUC, from gfp.h
 ==
   * __GFP_NOFAIL: The VM implementation_must_  retry infinitely: the caller
   * cannot handle allocation failures.  This modifier is deprecated and no 
 new
   * users should be added.
 ==

 GFP_NOFAIL will go away and no new user is recommended.

 Yes, I am aware of that. That's actually why I don't plan to insist on
 this too much - although your e-mail didn't really convince me.
 
 It should not matter in practice.
 
 So, please skip GFP_NOFAIL accounting and avoid to write
 usage may go over limit if you're unfortune, sorry into memcg 
 documentation.
 
 I won't write that, because that's not true. Is more like: Allocations
 that can fail will fail if you go over limit.
 


  The change you propose is totally doable. I just don't believe it should
  be done.
  
  But let me know where you stand.
  
 My stand point is keeping usage= limit is the spec. and
 important in enterprise system. So, please avoid usage  limit.

 As I said, I won't make a case here because those allocations shouldn't
 matter in real life anyway. I can change it.
 

My standing point is that 'usage  limit' is bug. So please avoid it if
__GFP_NOFAIL allocation is not very important.


Thanks,
-Kame


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v5 2/2] decrement static keys on real destroy time

2012-05-16 Thread KAMEZAWA Hiroyuki
(2012/05/17 6:13), Andrew Morton wrote:

 On Fri, 11 May 2012 17:11:17 -0300
 Glauber Costa glom...@parallels.com wrote:
 
 We call the destroy function when a cgroup starts to be removed,
 such as by a rmdir event.

 However, because of our reference counters, some objects are still
 inflight. Right now, we are decrementing the static_keys at destroy()
 time, meaning that if we get rid of the last static_key reference,
 some objects will still have charges, but the code to properly
 uncharge them won't be run.

 This becomes a problem specially if it is ever enabled again, because
 now new charges will be added to the staled charges making keeping
 it pretty much impossible.

 We just need to be careful with the static branch activation:
 since there is no particular preferred order of their activation,
 we need to make sure that we only start using it after all
 call sites are active. This is achieved by having a per-memcg
 flag that is only updated after static_key_slow_inc() returns.
 At this time, we are sure all sites are active.

 This is made per-memcg, not global, for a reason:
 it also has the effect of making socket accounting more
 consistent. The first memcg to be limited will trigger static_key()
 activation, therefore, accounting. But all the others will then be
 accounted no matter what. After this patch, only limited memcgs
 will have its sockets accounted.
 
 So I'm scratching my head over what the actual bug is, and how
 important it is.  AFAICT it will cause charging stats to exhibit some
 inaccuracy when memcg's are being torn down?
 
 I don't know how serious this in in the real world and so can't decide
 which kernel version(s) we should fix.
 
 When fixing bugs, please always fully describe the bug's end-user
 impact, so that I and others can make these sorts of decisions.
 


Ah, this was a bug report from me. tcp accounting can be easily broken.
Costa, could you include this ?
==

tcp memcontrol uses static_branch to optimize limit=RESOURCE_MAX case.
If all cgroup's limit=RESOUCE_MAX, resource usage is not accounted.
But it's buggy now.

For example, do following
# while sleep 1;do
   echo 9223372036854775807  /cgroup/memory/A/memory.kmem.tcp.limit_in_bytes;
   echo 300M  /cgroup/memory/A/memory.kmem.tcp.limit_in_bytes;
   done
and run network application under A. tcp's usage is sometimes accounted
and sometimes not accounted because of frequent changes of static_branch.
Then,  you can see broken tcp.usage_in_bytes.
WARN_ON() is printed because res_counter-usage goes below 0.
==
kernel: [ cut here ]--
kernel: WARNING: at kernel/res_counter.c:96 
res_counter_uncharge_locked+0x37/0x40()
 snip
kernel: Pid: 17753, comm: bash Tainted: G  W3.3.0+ #99
kernel: Call Trace:
kernel: IRQ  [8104cc9f] warn_slowpath_common+0x7f/0xc0
kernel: [810d7e88] ? rb_reserve__next_event+0x68/0x470
kernel: [8104ccfa] warn_slowpath_null+0x1a/0x20
kernel: [810b4e37] res_counter_uncharge_locked+0x37/0x40
 ...
==







___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v2 19/29] skip memcg kmem allocations in specified code regions

2012-05-14 Thread KAMEZAWA Hiroyuki
(2012/05/12 2:44), Glauber Costa wrote:

 This patch creates a mechanism that skip memcg allocations during
 certain pieces of our core code. It basically works in the same way
 as preempt_disable()/preempt_enable(): By marking a region under
 which all allocations will be accounted to the root memcg.
 
 We need this to prevent races in early cache creation, when we
 allocate data using caches that are not necessarily created already.
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 CC: Suleiman Souhlal sulei...@google.com


The concept seems okay to me but...

 ---
  include/linux/sched.h |1 +
  mm/memcontrol.c   |   25 +
  2 files changed, 26 insertions(+), 0 deletions(-)
 
 diff --git a/include/linux/sched.h b/include/linux/sched.h
 index 81a173c..0501114 100644
 --- a/include/linux/sched.h
 +++ b/include/linux/sched.h
 @@ -1613,6 +1613,7 @@ struct task_struct {
   unsigned long nr_pages; /* uncharged usage */
   unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
   } memcg_batch;
 + atomic_t memcg_kmem_skip_account;


If only 'current' thread touch this, you don't need to make this atomic counter.
you can use 'long'.

Thanks,
-Kame

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v2 18/29] memcg: kmem controller charge/uncharge infrastructure

2012-05-14 Thread KAMEZAWA Hiroyuki
(2012/05/12 2:44), Glauber Costa wrote:

 With all the dependencies already in place, this patch introduces
 the charge/uncharge functions for the slab cache accounting in memcg.
 
 Before we can charge a cache, we need to select the right cache.
 This is done by using the function __mem_cgroup_get_kmem_cache().
 
 If we should use the root kmem cache, this function tries to detect
 that and return as early as possible.
 
 The charge and uncharge functions comes in two flavours:
  * __mem_cgroup_(un)charge_slab(), that assumes the allocation is
a slab page, and
  * __mem_cgroup_(un)charge_kmem(), that does not. This later exists
because the slub allocator draws the larger kmalloc allocations
from the page allocator.
 
 In memcontrol.h those functions are wrapped in inline acessors.
 The idea is to later on, patch those with jump labels, so we don't
 incur any overhead when no mem cgroups are being used.
 
 Because the slub allocator tends to inline the allocations whenever
 it can, those functions need to be exported so modules can make use
 of it properly.
 
 I apologize in advance to the reviewers. This patch is quite big, but
 I was not able to split it any further due to all the dependencies
 between the code.
 
 This code is inspired by the code written by Suleiman Souhlal,
 but heavily changed.
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 CC: Suleiman Souhlal sulei...@google.com
 ---
  include/linux/memcontrol.h |   67 
  init/Kconfig   |2 +-
  mm/memcontrol.c|  379 
 +++-
  3 files changed, 446 insertions(+), 2 deletions(-)
 
 diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
 index f93021a..c555799 100644
 --- a/include/linux/memcontrol.h
 +++ b/include/linux/memcontrol.h
 @@ -21,6 +21,7 @@
  #define _LINUX_MEMCONTROL_H
  #include linux/cgroup.h
  #include linux/vm_event_item.h
 +#include linux/hardirq.h
  
  struct mem_cgroup;
  struct page_cgroup;
 @@ -447,6 +448,19 @@ void mem_cgroup_register_cache(struct mem_cgroup *memcg,
  void mem_cgroup_release_cache(struct kmem_cache *cachep);
  extern char *mem_cgroup_cache_name(struct mem_cgroup *memcg,
  struct kmem_cache *cachep);
 +
 +void mem_cgroup_flush_cache_create_queue(void);
 +bool __mem_cgroup_charge_slab(struct kmem_cache *cachep, gfp_t gfp,
 +   size_t size);
 +void __mem_cgroup_uncharge_slab(struct kmem_cache *cachep, size_t size);
 +
 +bool __mem_cgroup_new_kmem_page(struct page *page, gfp_t gfp);
 +void __mem_cgroup_free_kmem_page(struct page *page);
 +
 +struct kmem_cache *
 +__mem_cgroup_get_kmem_cache(struct kmem_cache *cachep, gfp_t gfp);
 +
 +#define mem_cgroup_kmem_on 1
  #else
  static inline void mem_cgroup_register_cache(struct mem_cgroup *memcg,
struct kmem_cache *s)
 @@ -463,6 +477,59 @@ static inline void sock_update_memcg(struct sock *sk)
  static inline void sock_release_memcg(struct sock *sk)
  {
  }
 +
 +static inline void
 +mem_cgroup_flush_cache_create_queue(void)
 +{
 +}
 +
 +static inline void mem_cgroup_destroy_cache(struct kmem_cache *cachep)
 +{
 +}
 +
 +#define mem_cgroup_kmem_on 0
 +#define __mem_cgroup_get_kmem_cache(a, b) a
 +#define __mem_cgroup_charge_slab(a, b, c) false
 +#define __mem_cgroup_new_kmem_page(a, gfp) false
 +#define __mem_cgroup_uncharge_slab(a, b)
 +#define __mem_cgroup_free_kmem_page(b)
  #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
 +static __always_inline struct kmem_cache *
 +mem_cgroup_get_kmem_cache(struct kmem_cache *cachep, gfp_t gfp)
 +{
 + if (mem_cgroup_kmem_on  current-mm  !in_interrupt())
 + return __mem_cgroup_get_kmem_cache(cachep, gfp);
 + return cachep;
 +}
 +
 +static __always_inline bool
 +mem_cgroup_charge_slab(struct kmem_cache *cachep, gfp_t gfp, size_t size)
 +{
 + if (mem_cgroup_kmem_on)
 + return __mem_cgroup_charge_slab(cachep, gfp, size);
 + return true;
 +}
 +
 +static __always_inline void
 +mem_cgroup_uncharge_slab(struct kmem_cache *cachep, size_t size)
 +{
 + if (mem_cgroup_kmem_on)
 + __mem_cgroup_uncharge_slab(cachep, size);
 +}
 +
 +static __always_inline
 +bool mem_cgroup_new_kmem_page(struct page *page, gfp_t gfp)
 +{
 + if (mem_cgroup_kmem_on  current-mm  !in_interrupt())
 + return __mem_cgroup_new_kmem_page(page, gfp);
 + return true;
 +}
 +
 +static __always_inline
 +void mem_cgroup_free_kmem_page(struct page *page)
 +{
 + if (mem_cgroup_kmem_on)
 + __mem_cgroup_free_kmem_page(page);
 +}
  #endif /* _LINUX_MEMCONTROL_H */
  
 diff --git a/init/Kconfig b/init/Kconfig
 index 72f33fa..071b7e3 100644
 --- a/init/Kconfig
 +++ b/init/Kconfig
 @@ -696,7

[Devel] Re: [PATCH v2 11/29] cgroups: ability to stop res charge propagation on bounded ancestor

2012-05-14 Thread KAMEZAWA Hiroyuki
(2012/05/12 2:44), Glauber Costa wrote:

 From: Frederic Weisbecker fweis...@gmail.com
 
 Moving a task from a cgroup to another may require to substract its
 resource charge from the old cgroup and add it to the new one.
 
 For this to happen, the uncharge/charge propagation can just stop when we
 reach the common ancestor for the two cgroups.  Further the performance
 reasons, we also want to avoid to temporarily overload the common
 ancestors with a non-accurate resource counter usage if we charge first
 the new cgroup and uncharge the old one thereafter.  This is going to be a
 requirement for the coming max number of task subsystem.
 
 To solve this, provide a pair of new API that can charge/uncharge a
 resource counter until we reach a given ancestor.
 
 Signed-off-by: Frederic Weisbecker fweis...@gmail.com
 Acked-by: Paul Menage p...@paulmenage.org
 Acked-by: Glauber Costa glom...@parallels.com
 Cc: Li Zefan l...@cn.fujitsu.com
 Cc: Johannes Weiner han...@cmpxchg.org
 Cc: Aditya Kali adityak...@google.com
 Cc: Oleg Nesterov o...@redhat.com
 Cc: Kay Sievers kay.siev...@vrfy.org
 Cc: Tim Hockin thoc...@hockin.org
 Cc: Tejun Heo hte...@gmail.com
 Acked-by: Kirill A. Shutemov kir...@shutemov.name
 Signed-off-by: Andrew Morton a...@linux-foundation.org


Where is this function called in this series ?

Thanks,
-Kame

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v5 1/2] Always free struct memcg through schedule_work()

2012-05-13 Thread KAMEZAWA Hiroyuki
(2012/05/12 5:11), Glauber Costa wrote:

 Right now we free struct memcg with kfree right after a
 rcu grace period, but defer it if we need to use vfree() to get
 rid of that memory area. We do that by need, because we need vfree
 to be called in a process context.
 
 This patch unifies this behavior, by ensuring that even kfree will
 happen in a separate thread. The goal is to have a stable place to
 call the upcoming jump label destruction function outside the realm
 of the complicated and quite far-reaching cgroup lock (that can't be
 held when calling neither the cpu_hotplug.lock nor the jump_label_mutex)
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Tejun Heo t...@kernel.org
 CC: Li Zefan lize...@huawei.com
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 CC: Michal Hocko mho...@suse.cz


I think we'll need to revisit this, again.
for now,
Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v5 2/2] decrement static keys on real destroy time

2012-05-13 Thread KAMEZAWA Hiroyuki
(2012/05/12 5:11), Glauber Costa wrote:

 We call the destroy function when a cgroup starts to be removed,
 such as by a rmdir event.
 
 However, because of our reference counters, some objects are still
 inflight. Right now, we are decrementing the static_keys at destroy()
 time, meaning that if we get rid of the last static_key reference,
 some objects will still have charges, but the code to properly
 uncharge them won't be run.
 
 This becomes a problem specially if it is ever enabled again, because
 now new charges will be added to the staled charges making keeping
 it pretty much impossible.
 
 We just need to be careful with the static branch activation:
 since there is no particular preferred order of their activation,
 we need to make sure that we only start using it after all
 call sites are active. This is achieved by having a per-memcg
 flag that is only updated after static_key_slow_inc() returns.
 At this time, we are sure all sites are active.
 
 This is made per-memcg, not global, for a reason:
 it also has the effect of making socket accounting more
 consistent. The first memcg to be limited will trigger static_key()
 activation, therefore, accounting. But all the others will then be
 accounted no matter what. After this patch, only limited memcgs
 will have its sockets accounted.
 
 [v2: changed a tcp limited flag for a generic proto limited flag ]
 [v3: update the current active flag only after the static_key update ]
 [v4: disarm_static_keys() inside free_work ]
 [v5: got rid of tcp_limit_mutex, now in the static_key interface ]
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Tejun Heo t...@kernel.org
 CC: Li Zefan lize...@huawei.com
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 CC: Michal Hocko mho...@suse.cz


Thank you for your patient works.

Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com

BTW, what is the relationship between 1/2 and 2/2  ?

Thanks,
-Kame




___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v4 1/3] make jump_labels wait while updates are in place

2012-04-26 Thread KAMEZAWA Hiroyuki
(2012/04/27 9:43), Steven Rostedt wrote:

 On Thu, Apr 26, 2012 at 07:51:05PM -0300, Glauber Costa wrote:
 In mem cgroup, we need to guarantee that two concurrent updates
 of the jump_label interface wait for each other. IOW, we can't have
 other updates returning while the first one is still patching the
 kernel around, otherwise we'll race.
 
 But it shouldn't. The code as is should prevent that.
 

 I believe this is something that can fit well in the static branch
 API, without noticeable disadvantages:

 * in the common case, it will be a quite simple lock/unlock operation
 * Every context that calls static_branch_slow* already expects to be
   in sleeping context because it will mutex_lock the unlikely case.
 * static_key_slow_inc is not expected to be called in any fast path,
   otherwise it would be expected to have quite a different name. Therefore
   the mutex + atomic combination instead of just an atomic should not kill
   us.

 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Tejun Heo t...@kernel.org
 CC: Li Zefan lize...@huawei.com
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 CC: Michal Hocko mho...@suse.cz
 CC: Ingo Molnar mi...@elte.hu
 CC: Jason Baron jba...@redhat.com
 ---
  kernel/jump_label.c |   21 +++--
  1 files changed, 11 insertions(+), 10 deletions(-)

 diff --git a/kernel/jump_label.c b/kernel/jump_label.c
 index 4304919..5d09cb4 100644
 --- a/kernel/jump_label.c
 +++ b/kernel/jump_label.c
 @@ -57,17 +57,16 @@ static void jump_label_update(struct static_key *key, 
 int enable);
  
  void static_key_slow_inc(struct static_key *key)
  {
 +jump_label_lock();
  if (atomic_inc_not_zero(key-enabled))
 -return;
 
 If key-enabled is not zero, there's nothing to be done. As the jump
 label has already been enabled. Note, the key-enabled doesn't get set
 until after the jump label is updated. Thus, if two tasks were to come
 in, they both would be locked on the jump_label_lock().
 

Ah, sorry, I misunderstood somthing. I'm sorry, Glauber.

-Kame

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 16/23] slab: provide kmalloc_no_account

2012-04-25 Thread KAMEZAWA Hiroyuki
(2012/04/25 23:29), Glauber Costa wrote:

 On 04/24/2012 10:44 PM, KAMEZAWA Hiroyuki wrote:
 (2012/04/23 8:53), Glauber Costa wrote:

 Some allocations need to be accounted to the root memcg regardless
 of their context. One trivial example, is the allocations we do
 during the memcg slab cache creation themselves. Strictly speaking,
 they could go to the parent, but it is way easier to bill them to
 the root cgroup.

 Only generic kmalloc allocations are allowed to be bypassed.

 The function is not exported, because drivers code should always
 be accounted.

 This code is mosly written by Suleiman Souhlal.

 Signed-off-by: Glauber Costaglom...@parallels.com
 CC: Christoph Lameterc...@linux.com
 CC: Pekka Enbergpenb...@cs.helsinki.fi
 CC: Michal Hockomho...@suse.cz
 CC: Kamezawa Hiroyukikamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weinerhan...@cmpxchg.org
 CC: Suleiman Souhlalsulei...@google.com


 Seems reasonable.
 Reviewed-by: KAMEZAWA Hiroyukikamezawa.hir...@jp.fujitsu.com

 Hmm...but can't we find the 'context' in automatic way ?

 
 Not that I can think of. Well, actually, not without adding some tests
 to the allocation path I'd rather not (like testing for the return
 address and then doing a table lookup, etc)
 
 An option would be to store it in the task_struct. So we would allocate
 as following:
 
 memcg_skip_account_start(p);
 do_a_bunch_of_allocations();
 memcg_skip_account_stop(p);
 
 The problem with that, is that it is quite easy to abuse.
 but if we don't export that to modules, it would be acceptable.
 
 Question is, given the fact that the number of kmalloc_no_account() is
 expected to be really small, is it worth it?
 

ok, but There was an idea __GFP_NOACCOUNT, which is better ?
Are you afraid that__GFP_NOACCOUNT can be spread too much rather than 
kmalloc_no_account() ?


Thanks,
-Kame


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v2 5/5] decrement static keys on real destroy time

2012-04-24 Thread KAMEZAWA Hiroyuki
(2012/04/24 20:41), Glauber Costa wrote:

 On 04/23/2012 11:40 PM, KAMEZAWA Hiroyuki wrote:
 (2012/04/24 4:37), Glauber Costa wrote:

 We call the destroy function when a cgroup starts to be removed,
 such as by a rmdir event.

 However, because of our reference counters, some objects are still
 inflight. Right now, we are decrementing the static_keys at destroy()
 time, meaning that if we get rid of the last static_key reference,
 some objects will still have charges, but the code to properly
 uncharge them won't be run.

 This becomes a problem specially if it is ever enabled again, because
 now new charges will be added to the staled charges making keeping
 it pretty much impossible.

 We just need to be careful with the static branch activation:
 since there is no particular preferred order of their activation,
 we need to make sure that we only start using it after all
 call sites are active. This is achieved by having a per-memcg
 flag that is only updated after static_key_slow_inc() returns.
 At this time, we are sure all sites are active.

 This is made per-memcg, not global, for a reason:
 it also has the effect of making socket accounting more
 consistent. The first memcg to be limited will trigger static_key()
 activation, therefore, accounting. But all the others will then be
 accounted no matter what. After this patch, only limited memcgs
 will have its sockets accounted.

 [v2: changed a tcp limited flag for a generic proto limited flag ]
 [v3: update the current active flag only after the static_key update ]

 Signed-off-by: Glauber Costaglom...@parallels.com


 Acked-by: KAMEZAWA Hiroyukikamezawa.hir...@jp.fujitsu.com

 A small request below.

 snip


 +* -activated needs to be written after the static_key update.
 +*  This is what guarantees that the socket activation function
 +*  is the last one to run. See sock_update_memcg() for details,
 +*  and note that we don't mark any socket as belonging to this
 +*  memcg until that flag is up.
 +*
 +*  We need to do this, because static_keys will span multiple
 +*  sites, but we can't control their order. If we mark a socket
 +*  as accounted, but the accounting functions are not patched 
 in
 +*  yet, we'll lose accounting.
 +*
 +*  We never race with the readers in sock_update_memcg(), 
 because
 +*  when this value change, the code to process it is not 
 patched in
 +*  yet.
 +*/
 +   mutex_lock(tcp_set_limit_mutex);


 Could you explain for what this mutex is in above comment ?

 This is explained at the site where the mutex is defined.
 If you still want me to mention it here, or maybe expand the explanation
 there, I surely can.
 

Ah, I think it's better to mention one more complicated race.
Let me explain. 

Assume we don't have tcp_set_limit_mutex. And jump_label is not activated yet
i.e. memcg_socket_limit_enabled-count == 0.

When a user updates limit of 2 cgroups at once, following happens.

CPU A   CPU B

if (cg_proto-activated)   if (cg-proto_activated)
static_key_inc()static_key_inc()
= set counter 0-1 = set counter 1-2, return immediately.
   = hold mutex= cg_proto-activated = true.  
  = overwrite jmps.

Then, without mutex, activated/active may be set 'true' before the end
of jump_label modification.

Thanks,
-Kame




___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 04/23] memcg: Make it possible to use the stock for more than one page.

2012-04-24 Thread KAMEZAWA Hiroyuki
(2012/04/21 6:57), Glauber Costa wrote:

 From: Suleiman Souhlal ssouh...@freebsd.org
 
 Signed-off-by: Suleiman Souhlal sulei...@google.com


ok, should work enough.

Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 05/23] memcg: Reclaim when more than one page needed.

2012-04-24 Thread KAMEZAWA Hiroyuki
(2012/04/21 6:57), Glauber Costa wrote:

 From: Suleiman Souhlal ssouh...@freebsd.org
 
 mem_cgroup_do_charge() was written before slab accounting, and expects
 three cases: being called for 1 page, being called for a stock of 32 pages,
 or being called for a hugepage.  If we call for 2 pages (and several slabs
 used in process creation are such, at least with the debug options I had),
 it assumed it's being called for stock and just retried without reclaiming.
 
 Fix that by passing down a minsize argument in addition to the csize.
 
 And what to do about that (csize == PAGE_SIZE  ret) retry?  If it's
 needed at all (and presumably is since it's there, perhaps to handle
 races), then it should be extended to more than PAGE_SIZE, yet how far?


IIRC, it was for preventing rapid OOM kill and reducing latency.

 And should there be a retry count limit, of what?  For now retry up to
 COSTLY_ORDER (as page_alloc.c does), stay safe with a cond_resched(),
 and make sure not to do it if __GFP_NORETRY.
 
 Signed-off-by: Suleiman Souhlal sulei...@google.com


Hmm, maybe ok.

Reviewed-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com


 ---
  mm/memcontrol.c |   18 +++---
  1 files changed, 11 insertions(+), 7 deletions(-)
 
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index 4b94b2d..cbffc4c 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -2187,7 +2187,8 @@ enum {
  };
  
  static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 - unsigned int nr_pages, bool oom_check)
 + unsigned int nr_pages, unsigned int min_pages,
 + bool oom_check)
  {
   unsigned long csize = nr_pages * PAGE_SIZE;
   struct mem_cgroup *mem_over_limit;
 @@ -2210,18 +2211,18 @@ static int mem_cgroup_do_charge(struct mem_cgroup 
 *memcg, gfp_t gfp_mask,
   } else
   mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
   /*
 -  * nr_pages can be either a huge page (HPAGE_PMD_NR), a batch
 -  * of regular pages (CHARGE_BATCH), or a single regular page (1).
 -  *
* Never reclaim on behalf of optional batching, retry with a
* single page instead.
*/
 - if (nr_pages == CHARGE_BATCH)
 + if (nr_pages  min_pages)
   return CHARGE_RETRY;
  
   if (!(gfp_mask  __GFP_WAIT))
   return CHARGE_WOULDBLOCK;
  
 + if (gfp_mask  __GFP_NORETRY)
 + return CHARGE_NOMEM;
 +
   ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
   if (mem_cgroup_margin(mem_over_limit) = nr_pages)
   return CHARGE_RETRY;
 @@ -2234,8 +2235,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup 
 *memcg, gfp_t gfp_mask,
* unlikely to succeed so close to the limit, and we fall back
* to regular pages anyway in case of failure.
*/
 - if (nr_pages == 1  ret)
 + if (nr_pages = (PAGE_SIZE  PAGE_ALLOC_COSTLY_ORDER)  ret) {
 + cond_resched();
   return CHARGE_RETRY;
 + }
  
   /*
* At task move, charge accounts can be doubly counted. So, it's
 @@ -2369,7 +2372,8 @@ again:
   nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
   }
  
 - ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check);
 + ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, nr_pages,
 + oom_check);
   switch (ret) {
   case CHARGE_OK:
   break;



___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 07/23] change defines to an enum

2012-04-24 Thread KAMEZAWA Hiroyuki
(2012/04/21 6:57), Glauber Costa wrote:

 This is just a cleanup patch for clarity of expression.
 In earlier submissions, people asked it to be in a separate
 patch, so here it is.
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org


Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 08/23] don't force return value checking in res_counter_charge_nofail

2012-04-24 Thread KAMEZAWA Hiroyuki
(2012/04/21 6:57), Glauber Costa wrote:

 Since we will succeed with the allocation no matter what, there
 isn't the need to use __must_check with it. It can very well
 be optional.
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 CC: Michal Hocko mho...@suse.cz


Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 09/23] kmem slab accounting basic infrastructure

2012-04-24 Thread KAMEZAWA Hiroyuki
(2012/04/21 6:57), Glauber Costa wrote:

 This patch adds the basic infrastructure for the accounting of the slab
 caches. To control that, the following files are created:
 
  * memory.kmem.usage_in_bytes
  * memory.kmem.limit_in_bytes
  * memory.kmem.failcnt
  * memory.kmem.max_usage_in_bytes
 
 They have the same meaning of their user memory counterparts. They reflect
 the state of the kmem res_counter.
 
 The code is not enabled until a limit is set. This can be tested by the flag
 kmem_accounted. This means that after the patch is applied, no behavioral
 changes exists for whoever is still using memcg to control their memory usage.
 

Hmm, res_counter never goes naeative ?

 We always account to both user and kernel resource_counters. This effectively
 means that an independent kernel limit is in place when the limit is set
 to a lower value than the user memory. A equal or higher value means that the
 user limit will always hit first, meaning that kmem is effectively unlimited.
 
 People who want to track kernel memory but not limit it, can set this limit
 to a very high number (like RESOURCE_MAX - 1page - that no one will ever hit,
 or equal to the user memory)
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org


The code itself seems fine.

Reviewed-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 11/23] slub: consider a memcg parameter in kmem_create_cache

2012-04-24 Thread KAMEZAWA Hiroyuki
(2012/04/21 6:57), Glauber Costa wrote:

 Allow a memcg parameter to be passed during cache creation.
 The slub allocator will only merge caches that belong to
 the same memcg.
 
 Default function is created as a wrapper, passing NULL
 to the memcg version. We only merge caches that belong
 to the same memcg.
 
From the memcontrol.c side, 3 helper functions are created:
 
  1) memcg_css_id: because slub needs a unique cache name
 for sysfs. Since this is visible, but not the canonical
 location for slab data, the cache name is not used, the
 css_id should suffice.
 
  2) mem_cgroup_register_cache: is responsible for assigning
 a unique index to each cache, and other general purpose
 setup. The index is only assigned for the root caches. All
 others are assigned index == -1.
 
  3) mem_cgroup_release_cache: can be called from the root cache
 destruction, and will release the index for other caches.
 
 This index mechanism was developed by Suleiman Souhlal.
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 CC: Suleiman Souhlal sulei...@google.com
 ---
  include/linux/memcontrol.h |   14 ++
  include/linux/slab.h   |6 ++
  mm/memcontrol.c|   29 +
  mm/slub.c  |   31 +++
  4 files changed, 76 insertions(+), 4 deletions(-)
 
 diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
 index f94efd2..99e14b9 100644
 --- a/include/linux/memcontrol.h
 +++ b/include/linux/memcontrol.h
 @@ -26,6 +26,7 @@ struct mem_cgroup;
  struct page_cgroup;
  struct page;
  struct mm_struct;
 +struct kmem_cache;
  
  /* Stats that can be updated by kernel. */
  enum mem_cgroup_page_stat_item {
 @@ -440,7 +441,20 @@ struct sock;
  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
  void sock_update_memcg(struct sock *sk);
  void sock_release_memcg(struct sock *sk);
 +int memcg_css_id(struct mem_cgroup *memcg);
 +void mem_cgroup_register_cache(struct mem_cgroup *memcg,
 +   struct kmem_cache *s);
 +void mem_cgroup_release_cache(struct kmem_cache *cachep);
  #else
 +static inline void mem_cgroup_register_cache(struct mem_cgroup *memcg,
 +  struct kmem_cache *s)
 +{
 +}
 +
 +static inline void mem_cgroup_release_cache(struct kmem_cache *cachep)
 +{
 +}
 +
  static inline void sock_update_memcg(struct sock *sk)
  {
  }
 diff --git a/include/linux/slab.h b/include/linux/slab.h
 index a5127e1..c7a7e05 100644
 --- a/include/linux/slab.h
 +++ b/include/linux/slab.h
 @@ -321,6 +321,12 @@ extern void *__kmalloc_track_caller(size_t, gfp_t, 
 unsigned long);
   __kmalloc(size, flags)
  #endif /* DEBUG_SLAB */
  
 +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 +#define MAX_KMEM_CACHE_TYPES 400
 +#else
 +#define MAX_KMEM_CACHE_TYPES 0
 +#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
 +


why 400 ?


  #ifdef CONFIG_NUMA
  /*
   * kmalloc_node_track_caller is a special version of kmalloc_node that
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index 36f1e6b..0015ed0 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -323,6 +323,11 @@ struct mem_cgroup {
  #endif
  };
  
 +int memcg_css_id(struct mem_cgroup *memcg)
 +{
 + return css_id(memcg-css);
 +}
 +
  /* Stuffs for move charges at task migration. */
  /*
   * Types of charges to be moved. move_charge_at_immitgrate is treated as a
 @@ -461,6 +466,30 @@ struct cg_proto *tcp_proto_cgroup(struct mem_cgroup 
 *memcg)
  }
  EXPORT_SYMBOL(tcp_proto_cgroup);
  #endif /* CONFIG_INET */
 +
 +/* Bitmap used for allocating the cache id numbers. */
 +static DECLARE_BITMAP(cache_types, MAX_KMEM_CACHE_TYPES);
 +
 +void mem_cgroup_register_cache(struct mem_cgroup *memcg,
 +struct kmem_cache *cachep)
 +{
 + int id = -1;
 +
 + cachep-memcg_params.memcg = memcg;
 +
 + if (!memcg) {
 + id = find_first_zero_bit(cache_types, MAX_KMEM_CACHE_TYPES);
 + BUG_ON(id  0 || id = MAX_KMEM_CACHE_TYPES);
 + __set_bit(id, cache_types);


No lock here ? you need find_first_zero_bit_and_set_atomic() or some.
Rather than that, I think you can use lib/idr.c::ida_simple_get().

 + } else
 + INIT_LIST_HEAD(cachep-memcg_params.destroyed_list);
 + cachep-memcg_params.id = id;
 +}
 +
 +void mem_cgroup_release_cache(struct kmem_cache *cachep)
 +{
 + __clear_bit(cachep-memcg_params.id, cache_types);
 +}
  #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
  
  static void drain_all_stock_async(struct mem_cgroup *memcg);
 diff --git a/mm/slub.c b/mm/slub.c
 index 2652e7c..86e40cc 100644
 --- a/mm/slub.c
 +++ b/mm/slub.c
 @@ -32,6 +32,7 @@
  #include linux/prefetch.h
  
  #include trace/events/kmem.h
 +#include linux

[Devel] Re: [PATCH 16/23] slab: provide kmalloc_no_account

2012-04-24 Thread KAMEZAWA Hiroyuki
(2012/04/23 8:53), Glauber Costa wrote:

 Some allocations need to be accounted to the root memcg regardless
 of their context. One trivial example, is the allocations we do
 during the memcg slab cache creation themselves. Strictly speaking,
 they could go to the parent, but it is way easier to bill them to
 the root cgroup.
 
 Only generic kmalloc allocations are allowed to be bypassed.
 
 The function is not exported, because drivers code should always
 be accounted.
 
 This code is mosly written by Suleiman Souhlal.
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 CC: Suleiman Souhlal sulei...@google.com


Seems reasonable.
Reviewed-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com

Hmm...but can't we find the 'context' in automatic way ?

-Kame

 ---
  include/linux/slab_def.h |1 +
  mm/slab.c|   23 +++
  2 files changed, 24 insertions(+), 0 deletions(-)
 
 diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
 index 06e4a3e..54d25d7 100644
 --- a/include/linux/slab_def.h
 +++ b/include/linux/slab_def.h
 @@ -114,6 +114,7 @@ extern struct cache_sizes malloc_sizes[];
  
  void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
  void *__kmalloc(size_t size, gfp_t flags);
 +void *kmalloc_no_account(size_t size, gfp_t flags);
  
  #ifdef CONFIG_TRACING
  extern void *kmem_cache_alloc_trace(size_t size,
 diff --git a/mm/slab.c b/mm/slab.c
 index c4ef684..13948c3 100644
 --- a/mm/slab.c
 +++ b/mm/slab.c
 @@ -3960,6 +3960,29 @@ void *__kmalloc(size_t size, gfp_t flags)
  }
  EXPORT_SYMBOL(__kmalloc);
  
 +static __always_inline void *__do_kmalloc_no_account(size_t size, gfp_t 
 flags,
 +  void *caller)
 +{
 + struct kmem_cache *cachep;
 + void *ret;
 +
 + cachep = __find_general_cachep(size, flags);
 + if (unlikely(ZERO_OR_NULL_PTR(cachep)))
 + return cachep;
 +
 + ret = __cache_alloc(cachep, flags, caller);
 + trace_kmalloc((unsigned long)caller, ret, size,
 +   cachep-buffer_size, flags);
 +
 + return ret;
 +}
 +
 +void *kmalloc_no_account(size_t size, gfp_t flags)
 +{
 + return __do_kmalloc_no_account(size, flags,
 +__builtin_return_address(0));
 +}
 +
  void *__kmalloc_track_caller(size_t size, gfp_t flags, unsigned long caller)
  {
   return __do_kmalloc(size, flags, (void *)caller);



___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 17/23] kmem controller charge/uncharge infrastructure

2012-04-24 Thread KAMEZAWA Hiroyuki
(2012/04/24 23:22), Frederic Weisbecker wrote:

 On Mon, Apr 23, 2012 at 03:25:59PM -0700, David Rientjes wrote:
 On Sun, 22 Apr 2012, Glauber Costa wrote:

 +/*
 + * Return the kmem_cache we're supposed to use for a slab allocation.
 + * If we are in interrupt context or otherwise have an allocation that
 + * can't fail, we return the original cache.
 + * Otherwise, we will try to use the current memcg's version of the cache.
 + *
 + * If the cache does not exist yet, if we are the first user of it,
 + * we either create it immediately, if possible, or create it 
 asynchronously
 + * in a workqueue.
 + * In the latter case, we will let the current allocation go through with
 + * the original cache.
 + *
 + * This function returns with rcu_read_lock() held.
 + */
 +struct kmem_cache *__mem_cgroup_get_kmem_cache(struct kmem_cache *cachep,
 +gfp_t gfp)
 +{
 +   struct mem_cgroup *memcg;
 +   int idx;
 +
 +   gfp |=  cachep-allocflags;
 +
 +   if ((current-mm == NULL))
 +   return cachep;
 +
 +   if (cachep-memcg_params.memcg)
 +   return cachep;
 +
 +   idx = cachep-memcg_params.id;
 +   VM_BUG_ON(idx == -1);
 +
 +   memcg = mem_cgroup_from_task(current);
 +   if (!mem_cgroup_kmem_enabled(memcg))
 +   return cachep;
 +
 +   if (rcu_access_pointer(memcg-slabs[idx]) == NULL) {
 +   memcg_create_cache_enqueue(memcg, cachep);
 +   return cachep;
 +   }
 +
 +   return rcu_dereference(memcg-slabs[idx]);
 +}
 +EXPORT_SYMBOL(__mem_cgroup_get_kmem_cache);
 +
 +void mem_cgroup_remove_child_kmem_cache(struct kmem_cache *cachep, int id)
 +{
 +   rcu_assign_pointer(cachep-memcg_params.memcg-slabs[id], NULL);
 +}
 +
 +bool __mem_cgroup_charge_kmem(gfp_t gfp, size_t size)
 +{
 +   struct mem_cgroup *memcg;
 +   bool ret = true;
 +
 +   rcu_read_lock();
 +   memcg = mem_cgroup_from_task(current);

 This seems horribly inconsistent with memcg charging of user memory since 
 it charges to p-mm-owner and you're charging to p.  So a thread attached 
 to a memcg can charge user memory to one memcg while charging slab to 
 another memcg?
 
 Charging to the thread rather than the process seem to me the right behaviour:
 you can have two threads of a same process attached to different cgroups.
 
 Perhaps it is the user memory memcg that needs to be fixed?
 

There is a problem of OOM-Kill.
To free memory by killing process, 'mm' should be released by kill.
So, oom-killer just finds a leader of process.

Assume A process X consists of thread A, B and A is thread-group-leader.

Put thread A into cgroup/Gold
thread B into cgroup/Silver.

If we do accounting based on threads, we can't do anything at OOM in 
cgroup/Silver.
An idea 'Killing thread-A to kill thread-B'. breaks isolation.
 
As far as resources used by process, I think accounting should be done per 
process.
It's not tied to thread.

About kmem, if we count task_struct, page tables, etc...which can be freed by
OOM-Killer i.e. it's allocated for 'process', should be aware of OOM problem.
Using mm-owner makes sense to me until someone finds a great idea to handle
OOM situation rather than task killing.

Thanks,
-Kame

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v2 1/5] don't attach a task to a dead cgroup

2012-04-23 Thread KAMEZAWA Hiroyuki
(2012/04/24 4:37), Glauber Costa wrote:

 Not all external callers of cgroup_attach_task() test to
 see if the cgroup is still live - the internal callers at
 cgroup.c does.
 
 With this test in cgroup_attach_task, we can assure that
 no tasks are ever moved to a cgroup that is past its
 destruction point and was already marked as dead.
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Tejun Heo t...@kernel.org
 CC: Li Zefan lize...@huawei.com
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com


Reviewed-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com


 ---
  kernel/cgroup.c |3 +++
  1 files changed, 3 insertions(+), 0 deletions(-)
 
 diff --git a/kernel/cgroup.c b/kernel/cgroup.c
 index b61b938..932c318 100644
 --- a/kernel/cgroup.c
 +++ b/kernel/cgroup.c
 @@ -1927,6 +1927,9 @@ int cgroup_attach_task(struct cgroup *cgrp, struct 
 task_struct *tsk)
   struct cgroup_taskset tset = { };
   struct css_set *newcg;
  
 + if (cgroup_is_removed(cgrp))
 + return -ENODEV;
 +
   /* @tsk either already exited or can't exit until the end */
   if (tsk-flags  PF_EXITING)
   return -ESRCH;



___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v2 3/5] change number_of_cpusets to an atomic

2012-04-23 Thread KAMEZAWA Hiroyuki
(2012/04/24 4:37), Glauber Costa wrote:

 This will allow us to call destroy() without holding the
 cgroup_mutex(). Other important updates inside update_flags()
 are protected by the callback_mutex.
 
 We could protect this variable with the callback_mutex as well,
 as suggested by Li Zefan, but we need to make sure we are protected
 by that mutex at all times, and some of its updates happen inside the
 cgroup_mutex - which means we would deadlock.
 
 An atomic variable is not expensive, since it is seldom updated,
 and protect us well.
 
 Signed-off-by: Glauber Costa glom...@parallels.com


Reviewed-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v2 4/5] don't take cgroup_mutex in destroy()

2012-04-23 Thread KAMEZAWA Hiroyuki
(2012/04/24 4:37), Glauber Costa wrote:

 Most of the destroy functions are only doing very simple things
 like freeing memory.
 
 The ones who goes through lists and such, already use its own
 locking for those.
 
 * The cgroup itself won't go away until we free it, (after destroy)
 * The parent won't go away because we hold a reference count
 * There are no more tasks in the cgroup, and the cgroup is declared
   dead (cgroup_is_removed() == true)
 
 [v2: don't cgroup_lock the freezer and blkcg ]
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Tejun Heo t...@kernel.org
 CC: Li Zefan lize...@huawei.com
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Vivek Goyal vgo...@redhat.com
 ---
  kernel/cgroup.c |9 -
  1 files changed, 4 insertions(+), 5 deletions(-)
 
 diff --git a/kernel/cgroup.c b/kernel/cgroup.c
 index 932c318..976d332 100644
 --- a/kernel/cgroup.c
 +++ b/kernel/cgroup.c
 @@ -869,13 +869,13 @@ static void cgroup_diput(struct dentry *dentry, struct 
 inode *inode)
* agent */
   synchronize_rcu();
  
 - mutex_lock(cgroup_mutex);
   /*
* Release the subsystem state objects.
*/
   for_each_subsys(cgrp-root, ss)
   ss-destroy(cgrp);
  
 + mutex_lock(cgroup_mutex);
   cgrp-root-number_of_cgroups--;
   mutex_unlock(cgroup_mutex);
  
 @@ -3994,13 +3994,12 @@ static long cgroup_create(struct cgroup *parent, 
 struct dentry *dentry,
  
   err_destroy:
  
 + mutex_unlock(cgroup_mutex);
   for_each_subsys(root, ss) {
   if (cgrp-subsys[ss-subsys_id])
   ss-destroy(cgrp);
   }
  
 - mutex_unlock(cgroup_mutex);
 -
   /* Release the reference count that we took on the superblock */
   deactivate_super(sb);
  
 @@ -4349,9 +4348,9 @@ int __init_or_module cgroup_load_subsys(struct 
 cgroup_subsys *ss)
   int ret = cgroup_init_idr(ss, css);
   if (ret) {
   dummytop-subsys[ss-subsys_id] = NULL;
 + mutex_unlock(cgroup_mutex);
   ss-destroy(dummytop);
   subsys[i] = NULL;
 - mutex_unlock(cgroup_mutex);
   return ret;
   }
   }
 @@ -4447,10 +4446,10 @@ void cgroup_unload_subsys(struct cgroup_subsys *ss)
* pointer to find their state. note that this also takes care of
* freeing the css_id.
*/
 + mutex_unlock(cgroup_mutex);
   ss-destroy(dummytop);
   dummytop-subsys[ss-subsys_id] = NULL;
  

I'm not fully sure but...dummytop-subsys[] update can be done without locking ?

Thanks,
-Kame


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v2 5/5] decrement static keys on real destroy time

2012-04-23 Thread KAMEZAWA Hiroyuki
(2012/04/24 4:37), Glauber Costa wrote:

 We call the destroy function when a cgroup starts to be removed,
 such as by a rmdir event.
 
 However, because of our reference counters, some objects are still
 inflight. Right now, we are decrementing the static_keys at destroy()
 time, meaning that if we get rid of the last static_key reference,
 some objects will still have charges, but the code to properly
 uncharge them won't be run.
 
 This becomes a problem specially if it is ever enabled again, because
 now new charges will be added to the staled charges making keeping
 it pretty much impossible.
 
 We just need to be careful with the static branch activation:
 since there is no particular preferred order of their activation,
 we need to make sure that we only start using it after all
 call sites are active. This is achieved by having a per-memcg
 flag that is only updated after static_key_slow_inc() returns.
 At this time, we are sure all sites are active.
 
 This is made per-memcg, not global, for a reason:
 it also has the effect of making socket accounting more
 consistent. The first memcg to be limited will trigger static_key()
 activation, therefore, accounting. But all the others will then be
 accounted no matter what. After this patch, only limited memcgs
 will have its sockets accounted.
 
 [v2: changed a tcp limited flag for a generic proto limited flag ]
 [v3: update the current active flag only after the static_key update ]
 
 Signed-off-by: Glauber Costa glom...@parallels.com


Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com

A small request below.

snip


 +  * -activated needs to be written after the static_key update.
 +  *  This is what guarantees that the socket activation function
 +  *  is the last one to run. See sock_update_memcg() for details,
 +  *  and note that we don't mark any socket as belonging to this
 +  *  memcg until that flag is up.
 +  *
 +  *  We need to do this, because static_keys will span multiple
 +  *  sites, but we can't control their order. If we mark a socket
 +  *  as accounted, but the accounting functions are not patched 
 in
 +  *  yet, we'll lose accounting.
 +  *
 +  *  We never race with the readers in sock_update_memcg(), 
 because
 +  *  when this value change, the code to process it is not 
 patched in
 +  *  yet.
 +  */
 + mutex_lock(tcp_set_limit_mutex);


Could you explain for what this mutex is in above comment ?

Thanks,
-Kame

 + if (!cg_proto-activated) {
 + static_key_slow_inc(memcg_socket_limit_enabled);
 + cg_proto-activated = true;
 + }
 + mutex_unlock(tcp_set_limit_mutex);
 + cg_proto-active = true;
 + }
  
   return 0;
  }



___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 3/3] decrement static keys on real destroy time

2012-04-20 Thread KAMEZAWA Hiroyuki
(2012/04/20 7:49), Glauber Costa wrote:

 We call the destroy function when a cgroup starts to be removed,
 such as by a rmdir event.
 
 However, because of our reference counters, some objects are still
 inflight. Right now, we are decrementing the static_keys at destroy()
 time, meaning that if we get rid of the last static_key reference,
 some objects will still have charges, but the code to properly
 uncharge them won't be run.
 
 This becomes a problem specially if it is ever enabled again, because
 now new charges will be added to the staled charges making keeping
 it pretty much impossible.
 
 We just need to be careful with the static branch activation:
 since there is no particular preferred order of their activation,
 we need to make sure that we only start using it after all
 call sites are active. This is achieved by having a per-memcg
 flag that is only updated after static_key_slow_inc() returns.
 At this time, we are sure all sites are active.
 
 This is made per-memcg, not global, for a reason:
 it also has the effect of making socket accounting more
 consistent. The first memcg to be limited will trigger static_key()
 activation, therefore, accounting. But all the others will then be
 accounted no matter what. After this patch, only limited memcgs
 will have its sockets accounted.
 
 [v2: changed a tcp limited flag for a generic proto limited flag ]
 
 Signed-off-by: Glauber Costa glom...@parallels.com

 ---
  include/net/sock.h|9 +++
  mm/memcontrol.c   |   20 +++-
  net/ipv4/tcp_memcontrol.c |   52 ++--
  3 files changed, 72 insertions(+), 9 deletions(-)
 
 diff --git a/include/net/sock.h b/include/net/sock.h
 index b3ebe6b..c5a2010 100644
 --- a/include/net/sock.h
 +++ b/include/net/sock.h
 @@ -914,6 +914,15 @@ struct cg_proto {
   int *memory_pressure;
   long*sysctl_mem;
   /*
 +  * active means it is currently active, and new sockets should
 +  * be assigned to cgroups.
 +  *
 +  * activated means it was ever activated, and we need to
 +  * disarm the static keys on destruction
 +  */
 + boolactivated;
 + boolactive; 
 + /*
* memcg field is used to find which memcg we belong directly
* Each memcg struct can hold more than one cg_proto, so container_of
* won't really cut.
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index 7832b4d..01d25a0 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -404,6 +404,7 @@ void sock_update_memcg(struct sock *sk)
  {
   if (mem_cgroup_sockets_enabled) {
   struct mem_cgroup *memcg;
 + struct cg_proto *cg_proto;
  
   BUG_ON(!sk-sk_prot-proto_cgroup);
  
 @@ -423,9 +424,10 @@ void sock_update_memcg(struct sock *sk)
  
   rcu_read_lock();
   memcg = mem_cgroup_from_task(current);
 - if (!mem_cgroup_is_root(memcg)) {
 + cg_proto = sk-sk_prot-proto_cgroup(memcg);
 + if (!mem_cgroup_is_root(memcg)  cg_proto-active) {



   mem_cgroup_get(memcg);
 - sk-sk_cgrp = sk-sk_prot-proto_cgroup(memcg);
 + sk-sk_cgrp = cg_proto;
   }



Is this correct ? cg_proto-active can be true before all jump_labels are
patched, then we can loose accounting. That will cause underflow of
res_countner.

cg_proto-active should be set after jump_label modification.
Then, things will work, I guess.

Thanks,
-Kame




   rcu_read_unlock();
   }
 @@ -442,6 +444,14 @@ void sock_release_memcg(struct sock *sk)
   }
  }
  
 +static void disarm_static_keys(struct mem_cgroup *memcg)
 +{
 +#ifdef CONFIG_INET
 + if (memcg-tcp_mem.cg_proto.activated)
 + static_key_slow_dec(memcg_socket_limit_enabled);
 +#endif
 +}
 +
  #ifdef CONFIG_INET
  struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg)
  {
 @@ -452,6 +462,11 @@ struct cg_proto *tcp_proto_cgroup(struct mem_cgroup 
 *memcg)
  }
  EXPORT_SYMBOL(tcp_proto_cgroup);
  #endif /* CONFIG_INET */
 +#else
 +static inline void disarm_static_keys(struct mem_cgroup *memcg)
 +{
 +}
 +
  #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
  
  static void drain_all_stock_async(struct mem_cgroup *memcg);
 @@ -4883,6 +4898,7 @@ static void __mem_cgroup_put(struct mem_cgroup *memcg, 
 int count)
  {
   if (atomic_sub_and_test(count, memcg-refcnt)) {
   struct mem_cgroup *parent = parent_mem_cgroup(memcg);
 + disarm_static_keys(memcg);
   __mem_cgroup_free(memcg);
   if (parent)
   mem_cgroup_put(parent);
 diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
 index 1517037..d02573a 100644
 --- a/net/ipv4/tcp_memcontrol.c
 +++ b/net/ipv4/tcp_memcontrol.c
 @@ -54,6 +54,8 @@ int tcp_init_cgroup(struct mem_cgroup *memcg, struct 
 

[Devel] Re: [RFC 5/7] use percpu_counters for res_counter usage

2012-04-08 Thread KAMEZAWA Hiroyuki
(2012/03/30 22:53), Glauber Costa wrote:

 On 03/30/2012 11:58 AM, KAMEZAWA Hiroyuki wrote:
 ==

 Now, we do consume 'reserved' usage, we can avoid css_get(), an heavy atomic
 ops. You may need to move this code as

  rcu_read_lock()
  
  res_counter_charge()
  if (failure) {
  css_tryget()
  rcu_read_unlock()
  } else {
  rcu_read_unlock()
  return success;
  }

 to compare performance. This css_get() affects performance very very much.
 
 thanks for the tip.
 
 But one thing:
 
 To be sure: it effectively mean that we are drawing from a dead memcg
 (because we pre-allocated, right?

Cached stock is consumed by the current task. It blocks removal of memcg.
It's not dead.

Thanks,
-Kame

 



___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC 0/7] Initial proposal for faster res_counter updates

2012-03-30 Thread KAMEZAWA Hiroyuki
(2012/03/30 17:04), Glauber Costa wrote:

 Hi,
 
 Here is my take about how we can make res_counter updates faster.
 Keep in mind this is a bit of a hack intended as a proof of concept.
 
 The pros I see with this:
 
 * free updates in non-constrained paths. non-constrained paths includes
   unlimited scenarios, but also ones in which we are far from the limit.
 
 * No need to have a special cache mechanism in memcg. The problem with
   the caching is my opinion, is that we will forward-account pages, meaning
   that we'll consider accounted pages we never used. I am not sure
   anyone actually ran into this, but in theory, this can fire events
   much earlier than it should.
 


Note: Assume a big system which has many cpus, and user wants to devide
the system into containers. Current memcg's percpu caching is done
only when a task in memcg is on the cpu, running. So, it's not so dangerous
as it looks.

But yes, if we can drop memcg's code, it's good. Then, we can remove some
amount of codes.

 But the cons:
 
 * percpu counters have signed quantities, so this would limit us 4G.
   We can add a shift and then count pages instead of bytes, but we
   are still in the 16T area here. Maybe we really need more than that.
 


struct percpu_counter {
raw_spinlock_t lock;
s64 count;

s64 limtes us 4G ?


 * some of the additions here may slow down the percpu_counters for
   users that don't care about our usage. Things about min/max tracking
   enter in this category.
 


I think it's not very good to increase size of percpu counter. It's already
very big...Hm. How about

struct percpu_counter_lazy {
struct percpu_counter pcp;
extra information
s64 margin;
}
?

 * growth of the percpu memory.



This may be a concern.

I'll look into patches.

Thanks,
-Kame

 
 It is still not clear for me if we should use percpu_counters as this
 patch implies, or if we should just replicate its functionality.
 
 I need to go through at least one more full round of auditing before
 making sure the locking is safe, specially my use of synchronize_rcu().
 
 As for measurements, the cache we have in memcg kind of distort things.
 I need to either disable it, or find the cases in which it is likely
 to lose and benchmark them, such as deep hierarchy concurrent updates
 with common parents.
 
 I also included a possible optimization that can be done when we
 are close to the limit to avoid the initial tests altogether, but
 it needs to be extended to avoid scanning the percpu areas as well.
 
 In summary, if this is to be carried forward, it definitely needs
 some love. It should be, however, more than enough to make the
 proposal clear.
 
 Comments are appreciated.
 
 Glauber Costa (7):
   split percpu_counter_sum
   consolidate all res_counter manipulation
   bundle a percpu counter into res_counters and use its lock
   move res_counter_set limit to res_counter.c
   use percpu_counters for res_counter usage
   Add min and max statistics to percpu_counter
   Global optimization
 
  include/linux/percpu_counter.h |3 +
  include/linux/res_counter.h|   63 ++---
  kernel/res_counter.c   |  151 
 +---
  lib/percpu_counter.c   |   16 -
  4 files changed, 151 insertions(+), 82 deletions(-)
 



___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC 5/7] use percpu_counters for res_counter usage

2012-03-30 Thread KAMEZAWA Hiroyuki
(2012/03/30 17:04), Glauber Costa wrote:

 This is the bulk of the proposal.
 Updates to the res_counter are done to the percpu area, if we are
 inside what we can call the safe zone.
 
 The safe zone is whenever we are far enough from the limit to be
 sure this update won't touch it. It is bigger the bigger the system
 is, since it grows with the number of cpus.
 
 However, for unlimited scenarios, this will always be the case.
 In those situations we are sure to never be close to the limit simply
 because the limit is high enough.
 
 Small consumers will also be safe. This includes workloads that
 pin and unpin memory often, but never grow the total size of memory
 by too much.
 
 The memory reported (reads of RES_USAGE) in this way is actually
 more precise than we currently have (Actually would be, if we
 would disable the memcg caches): I am using percpu_counter_sum(),
 meaning the cpu areas will be scanned and accumulated.
 
 percpu_counter_read() can also be used for reading RES_USAGE.
 We could then be off by a factor of batch_size * #cpus. I consider
 this to be not worse than the current situation with the memcg caches.
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 ---
  include/linux/res_counter.h |   15 ++
  kernel/res_counter.c|   61 
 ---
  2 files changed, 60 insertions(+), 16 deletions(-)
 
 diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
 index 53b271c..8c1c20e 100644
 --- a/include/linux/res_counter.h
 +++ b/include/linux/res_counter.h
 @@ -25,7 +25,6 @@ struct res_counter {
   /*
* the current resource consumption level
*/
 - unsigned long long usage;
   struct percpu_counter usage_pcp;
   /*
* the maximal value of the usage from the counter creation
 @@ -138,10 +137,12 @@ static inline unsigned long long 
 res_counter_margin(struct res_counter *cnt)
  {
   unsigned long long margin;
   unsigned long flags;
 + u64 usage;
  
   raw_spin_lock_irqsave(cnt-usage_pcp.lock, flags);
 - if (cnt-limit  cnt-usage)
 - margin = cnt-limit - cnt-usage;
 + usage = __percpu_counter_sum_locked(cnt-usage_pcp);
 + if (cnt-limit  usage)
 + margin = cnt-limit - usage;
   else
   margin = 0;
   raw_spin_unlock_irqrestore(cnt-usage_pcp.lock, flags);
 @@ -160,12 +161,14 @@ res_counter_soft_limit_excess(struct res_counter *cnt)
  {
   unsigned long long excess;
   unsigned long flags;
 + u64 usage;
  
   raw_spin_lock_irqsave(cnt-usage_pcp.lock, flags);
 - if (cnt-usage = cnt-soft_limit)
 + usage = __percpu_counter_sum_locked(cnt-usage_pcp);
 + if (usage = cnt-soft_limit)
   excess = 0;
   else
 - excess = cnt-usage - cnt-soft_limit;
 + excess = usage - cnt-soft_limit;
   raw_spin_unlock_irqrestore(cnt-usage_pcp.lock, flags);
   return excess;
  }
 @@ -175,7 +178,7 @@ static inline void res_counter_reset_max(struct 
 res_counter *cnt)
   unsigned long flags;
  
   raw_spin_lock_irqsave(cnt-usage_pcp.lock, flags);
 - cnt-max_usage = cnt-usage;
 + cnt-max_usage = __percpu_counter_sum_locked(cnt-usage_pcp);
   raw_spin_unlock_irqrestore(cnt-usage_pcp.lock, flags);
  }
  
 diff --git a/kernel/res_counter.c b/kernel/res_counter.c
 index 052efaf..8a99943 100644
 --- a/kernel/res_counter.c
 +++ b/kernel/res_counter.c
 @@ -28,9 +28,28 @@ int __res_counter_add(struct res_counter *c, long val, 
 bool fail)
   int ret = 0;
   u64 usage;
  
 + rcu_read_lock();
 +


Hmm... isn't it better to synchronize percpu usage to the main counter
by smp_call_function() or some at set limit ? after set 'global' mode ?


set global mode
smp_call_function(drain all pcp counters to main counter)
set limit.
unset global mode

 + if (val  0) {
 + percpu_counter_add(c-usage_pcp, val);
 + rcu_read_unlock();
 + return 0;
 + }


Memo:
memcg's uncharge path is batched so..it will be bigger than 
percpu_counter_batch() in most of cases. (And lock conflict is enough low.)


 +
 + usage = percpu_counter_read(c-usage_pcp);
 +
 + if (percpu_counter_read(c-usage_pcp) + val 
 + (c-limit + num_online_cpus() * percpu_counter_batch)) {


c-limit - num_online_cpus() * percpu_counter_batch ?

Anyway, you can pre-calculate this value at cpu hotplug event..


 + percpu_counter_add(c-usage_pcp, val);
 + rcu_read_unlock();
 + return 0;
 + }
 +
 + rcu_read_unlock();
 +
   raw_spin_lock(c-usage_pcp.lock);
  
 - usage = c-usage;
 + usage = __percpu_counter_sum_locked(c-usage_pcp);


Hmm this part doesn't seem very good.
I don't think for_each_online_cpu() here will not be a way to the final win.
Under multiple hierarchy, you may need to call for_each_online_cpu() in each 
level.

Can't you update 

[Devel] Re: [RFC 5/7] use percpu_counters for res_counter usage

2012-03-30 Thread KAMEZAWA Hiroyuki
(2012/03/30 18:33), KAMEZAWA Hiroyuki wrote:

 (2012/03/30 17:04), Glauber Costa wrote:

 
 Hmm this part doesn't seem very good.
 I don't think for_each_online_cpu() here will not be a way to the final win.
 Under multiple hierarchy, you may need to call for_each_online_cpu() in each 
 level.
 
 Can't you update percpu counter's core logic to avoid using 
 for_each_online_cpu() ?
 For example, if you know what cpus have caches, you can use that cpu mask...
 
 Memo:
 Current implementation of memcg's percpu counting is reserving usage before 
 its real use.
 In usual, the kernel don't have to scan percpu caches and just drain caches 
 from cpus
 reserving usages if we need to cancel reserved usages. (And it's 
 automatically canceled
 when cpu's memcg changes.)
 
 And 'reserving' avoids caching in multi-level counters,it updates 
 multiple counters
 in batch and memcg core don't need to walk res_counter ancestors in fast path.
 
 Considering res_counter's characteristics
  - it has _hard_ limit
  - it can be tree and usages are propagated to ancestors
  - all ancestors has hard limit.
 
 Isn't it better to generalize 'reserving resource' model ?
 You can provide 'precise usage' to the user by some logic.
 

Ahone more point. please see this memcg's code.
==
if (nr_pages == 1  consume_stock(memcg)) {
/*
 * It seems dagerous to access memcg without css_get().
 * But considering how consume_stok works, it's not
 * necessary. If consume_stock success, some charges
 * from this memcg are cached on this cpu. So, we
 * don't need to call css_get()/css_tryget() before
 * calling consume_stock().
 */
rcu_read_unlock();
goto done;
}
/* after here, we may be blocked. we need to get refcnt */
if (!css_tryget(memcg-css)) {
rcu_read_unlock();
goto again;
}
==

Now, we do consume 'reserved' usage, we can avoid css_get(), an heavy atomic
ops. You may need to move this code as

rcu_read_lock()

res_counter_charge()
if (failure) {
css_tryget()
rcu_read_unlock()
} else {
rcu_read_unlock()
return success;
}

to compare performance. This css_get() affects performance very very much.

Thanks,
-Kame









___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] memcg: Do not open code accesses to res_counter members

2012-03-20 Thread KAMEZAWA Hiroyuki
(2012/03/21 1:53), Glauber Costa wrote:

 We should use the acessor res_counter_read_u64 for that.
 Although a purely cosmetic change is sometimes better of delayed,
 to avoid conflicting with other people's work, we are starting to
 have people touching this code as well, and reproducing the open
 code behavior because that's the standard =)
 
 Time to fix it, then.
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 Cc: Johannes Weiner han...@cmpxchg.org
 Cc: Michal Hocko mho...@suse.cz
 Cc: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com


Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v2 02/13] memcg: Kernel memory accounting infrastructure.

2012-03-14 Thread KAMEZAWA Hiroyuki
(2012/03/14 21:29), Glauber Costa wrote:


   - What happens when a new cgroup created ?
 
 mem_cgroup_create() is called =)
 Heh, jokes apart, I don't really follow here. What exactly do you mean? 
 There shouldn't be anything extremely out of the ordinary.
 


Sorry, too short words.

Assume a cgroup with
cgroup.memory.limit_in_bytes=1G
cgroup.memory.kmem.limit_in_bytes=400M

When a child cgroup is created, what should be the default values.
'unlimited' as current implementation ?
Hmm..maybe yes.

Thanks,
-Kame

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v2 02/13] memcg: Kernel memory accounting infrastructure.

2012-03-13 Thread KAMEZAWA Hiroyuki
On Sun, 11 Mar 2012 12:12:04 +0400
Glauber Costa glom...@parallels.com wrote:

 On 03/10/2012 12:39 AM, Suleiman Souhlal wrote:
  Enabled with CONFIG_CGROUP_MEM_RES_CTLR_KMEM.
 
  Adds the following files:
   - memory.kmem.independent_kmem_limit
   - memory.kmem.usage_in_bytes
   - memory.kmem.limit_in_bytes
 
  Signed-off-by: Suleiman Souhlalsulei...@google.com
  ---
mm/memcontrol.c |  136 
  ++-
1 files changed, 135 insertions(+), 1 deletions(-)
 
  diff --git a/mm/memcontrol.c b/mm/memcontrol.c
  index 37ad2cb..e6fd558 100644
  --- a/mm/memcontrol.c
  +++ b/mm/memcontrol.c
  @@ -220,6 +220,10 @@ enum memcg_flags {
   */
  MEMCG_MEMSW_IS_MINIMUM, /* Set when res.limit == memsw.limit */
  MEMCG_OOM_KILL_DISABLE, /* OOM-Killer disable */
  +   MEMCG_INDEPENDENT_KMEM_LIMIT,   /*
  +* kernel memory is not counted in
  +* memory.usage_in_bytes
  +*/
};


After looking codes, I think we need to think
whether independent_kmem_limit is good or not

How about adding MEMCG_KMEM_ACCOUNT flag instead of this and use only
memcg-res/memcg-memsw rather than adding a new counter, memcg-kmem ?

if MEMCG_KMEM_ACCOUNT is set - slab is accoutned to mem-res/memsw.
if MEMCG_KMEM_ACCOUNT is not set - slab is never accounted.

(I think On/Off switch is required..)

Thanks,
-Kame

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v2 03/13] memcg: Uncharge all kmem when deleting a cgroup.

2012-03-13 Thread KAMEZAWA Hiroyuki
On Fri,  9 Mar 2012 12:39:06 -0800
Suleiman Souhlal ssouh...@freebsd.org wrote:

 Signed-off-by: Suleiman Souhlal sulei...@google.com
 ---
  mm/memcontrol.c |   31 ++-
  1 files changed, 30 insertions(+), 1 deletions(-)
 
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index e6fd558..6fbb438 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -382,6 +382,7 @@ static void mem_cgroup_get(struct mem_cgroup *memcg);
  static void mem_cgroup_put(struct mem_cgroup *memcg);
  static void memcg_kmem_init(struct mem_cgroup *memcg,
  struct mem_cgroup *parent);
 +static void memcg_kmem_move(struct mem_cgroup *memcg);
  
  static inline bool
  mem_cgroup_test_flag(const struct mem_cgroup *memcg, enum memcg_flags flag)
 @@ -3700,6 +3701,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup 
 *memcg, bool free_all)
   int ret;
   int node, zid, shrink;
   int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 + unsigned long usage;
   struct cgroup *cgrp = memcg-css.cgroup;
  
   css_get(memcg-css);
 @@ -3719,6 +3721,8 @@ move_account:
   /* This is for making all *used* pages to be on LRU. */
   lru_add_drain_all();
   drain_all_stock_sync(memcg);
 + if (!free_all)
 + memcg_kmem_move(memcg);
   ret = 0;
   mem_cgroup_start_move(memcg);
   for_each_node_state(node, N_HIGH_MEMORY) {
 @@ -3740,8 +3744,14 @@ move_account:
   if (ret == -ENOMEM)
   goto try_to_free;
   cond_resched();
 + usage = memcg-res.usage;
 +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 + if (free_all  !mem_cgroup_test_flag(memcg,
 + MEMCG_INDEPENDENT_KMEM_LIMIT))
 + usage -= memcg-kmem.usage;
 +#endif
   /* ret should also be checked to ensure all lists are empty. */
 - } while (memcg-res.usage  0 || ret);
 + } while (usage  0 || ret);
  out:
   css_put(memcg-css);
   return ret;
 @@ -5689,9 +5699,28 @@ memcg_kmem_init(struct mem_cgroup *memcg, struct 
 mem_cgroup *parent)
   parent_res = parent-kmem;
   res_counter_init(memcg-kmem, parent_res);
  }
 +
 +static void
 +memcg_kmem_move(struct mem_cgroup *memcg)

the function name says 'move' but the code seems just do 'forget'
or 'leak'...


 +{
 + unsigned long flags;
 + long kmem;
 +
 + spin_lock_irqsave(memcg-kmem.lock, flags);
 + kmem = memcg-kmem.usage;
 + res_counter_uncharge_locked(memcg-kmem, kmem);
 + spin_unlock_irqrestore(memcg-kmem.lock, flags);
 + if (!mem_cgroup_test_flag(memcg, MEMCG_INDEPENDENT_KMEM_LIMIT))
 + res_counter_uncharge(memcg-res, kmem);
 +}

please update memcg-memsw, too.

Thanks,
-Kame



  #else /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
  static void
  memcg_kmem_init(struct mem_cgroup *memcg, struct mem_cgroup *parent)
  {
  }
 +
 +static void
 +memcg_kmem_move(struct mem_cgroup *memcg)
 +{
 +}
  #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
 -- 
 1.7.7.3
 

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v2 02/13] memcg: Kernel memory accounting infrastructure.

2012-03-13 Thread KAMEZAWA Hiroyuki
On Tue, 13 Mar 2012 14:37:30 +0400
Glauber Costa glom...@parallels.com wrote:

  After looking codes, I think we need to think
  whether independent_kmem_limit is good or not
 
  How about adding MEMCG_KMEM_ACCOUNT flag instead of this and use only
  memcg-res/memcg-memsw rather than adding a new counter, memcg-kmem ?
 
  if MEMCG_KMEM_ACCOUNT is set -  slab is accoutned to mem-res/memsw.
  if MEMCG_KMEM_ACCOUNT is not set -  slab is never accounted.
 
  (I think On/Off switch is required..)
 
  Thanks,
  -Kame
 
 
 This has been discussed before, I can probably find it in the archives 
 if you want to go back and see it.
 

Yes. IIUC, we agreed to have independet kmem limit. I just want to think it
again because there are too many proposals and it seems I'm in confusion.

As far as I see, there are ongoing works as
 - kmem limit by 2 guys.
 - hugetlb limit
 - per lru locking (by 2 guys)
 - page cgroup diet (by me, but stops now.)
 - drity-ratio and writeback 
 - Tejun's proposal to remove pre_destroy()
 - moving shared resource

I'm thinking what is a simple plan and implementation. 
Most of series consists of 10+ patches...

Thank you for your help of clarification.



 But in a nutshell:
 
 1) Supposing independent knob disappear (I will explain in item 2 why I 
 don't want it to), I don't thing a flag makes sense either. *If* we are 
 planning to enable/disable this, it might make more sense to put some 
 work on it, and allow particular slabs to be enabled/disabled by writing 
 to memory.kmem.slabinfo (-* would disable all, +* enable all, +kmalloc* 
 enable all kmalloc, etc).
 
seems interesting.

 Alternatively, what we could do instead, is something similar to what 
 ended up being done for tcp, by request of the network people: if you 
 never touch the limit file, don't bother with it at all, and simply does 
 not account. With Suleiman's lazy allocation infrastructure, that should 
 actually be trivial. And then again, a flag is not necessary, because 
 writing to the limit file does the job, and also convey the meaning well 
 enough.
 

Hm.

 2) For the kernel itself, we are mostly concerned that a malicious 
 container may pin into memory big amounts of kernel memory which is, 
 ultimately, unreclaimable. 

Yes. This is a big problem both to memcg and the whole system.

In my experience, 2000 process shares a 10GB shared memory and eats up
big memory ;(



 In particular, with overcommit allowed 
 scenarios, you can fill the whole physical memory (or at least a 
 significant part) with those objects, well beyond your softlimit 
 allowance, making the creation of further containers impossible.
 With user memory, you can reclaim the cgroup back to its place. With 
 kernel memory, you can't.
 
Agreed.

 In the particular example of 32-bit boxes, you can easily fill up a 
 large part of the available 1gb kernel memory with pinned memory and 
 render the whole system unresponsive.
 
 Never allowing the kernel memory to go beyond the soft limit was one of 
 the proposed alternatives. However, it may force you to establish a soft
 limit where one was not previously needed. Or, establish a low soft 
 limit when you really need a bigger one.
 
 All that said, while reading your message, thinking a bit, the following 
 crossed my mind:
 
 - We can account the slabs to memcg-res normally, and just store the
information that this is kernel memory into a percpu counter, as
I proposed recently.

Ok, then user can see the amount of kernel memory.


 - The knob goes away, and becomes implicit: if you ever write anything
to memory.kmem.limit_in_bytes, we transfer that memory to a separate
kmem res_counter, and proceed from there. We can keep accounting to
memcg-res anyway, just that kernel memory will now have a separate
limit.

Okay, then,

kmem_limit  memory.limit  memsw.limit

...seems reasonable to me.
This means, user can specify 'ratio' of kmem in memory.limit.

More consideration will be interesting.

 - We can show the amount of reclaimable kmem by some means ?
 - What happens when a new cgroup created ?
 - Should we have 'ratio' interface in kernel level ?
 - What happens at task moving ?
 - Should we allow per-slab accounting knob in /sys/kernel/slab/xxx ?
   or somewhere ?
 - Should we show per-memcg usage in /sys/kernel/slab/xxx ?
 - Should we have force_empty for kmem (as last resort) ?

With any implementation, my concern is
 - overhead/performance.
 - unreclaimable kmem
 - shared kmem between cgroups.


 - With this scheme, it may not be necessary to ever have a file
memory.kmem.soft_limit_in_bytes. Reclaim is always part of the normal
memcg reclaim.
 
Good.

 The outlined above would work for us, and make the whole scheme simpler, 
 I believe.
 
 What do you think ?

It sounds interesting to me.

Thanks,
-Kame











___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 04/10] memcg: Introduce __GFP_NOACCOUNT.

2012-02-29 Thread KAMEZAWA Hiroyuki
On Wed, 29 Feb 2012 11:09:50 -0800
Suleiman Souhlal sulei...@google.com wrote:

 On Tue, Feb 28, 2012 at 10:00 PM, KAMEZAWA Hiroyuki
 kamezawa.hir...@jp.fujitsu.com wrote:
  On Mon, 27 Feb 2012 14:58:47 -0800
  Suleiman Souhlal ssouh...@freebsd.org wrote:
 
  This is used to indicate that we don't want an allocation to be accounted
  to the current cgroup.
 
  Signed-off-by: Suleiman Souhlal sulei...@google.com
 
  I don't like this.
 
  Please add
 
  ___GFP_ACCOUNT  account this allocation to memcg
 
  Or make this as slab's flag if this work is for slab allocation.
 
 We would like to account for all the slab allocations that happen in
 process context.
 
 Manually marking every single allocation or kmem_cache with a GFP flag
 really doesn't seem like the right thing to do..
 
 Can you explain why you don't like this flag?
 

For example, tcp buffer limiting has another logic for buffer size controling.
_AND_, most of kernel pages are not reclaimable at all.
I think you should start from reclaimable caches as dcache, icache etc.

If you want to use this wider, you can discuss

+ #define GFP_KERNEL(.| ___GFP_ACCOUNT)

in future. I'd like to see small start because memory allocation failure
is always terrible and make the system unstable. Even if you notify
Ah, kernel memory allocation failed because of memory.limit? and
 many unreclaimable memory usage. Please tweak the limitation or kill tasks!!

The user can't do anything because he can't create any new task because of OOM.

The system will be being unstable until an admin, who is not under any limit,
tweaks something or reboot the system.

Please do small start until you provide Eco-System to avoid a case that
the admin cannot login and what he can do was only reboot.

Thanks,
-Kame

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 1/7] small cleanup for memcontrol.c

2012-02-29 Thread KAMEZAWA Hiroyuki
On Wed, 29 Feb 2012 14:30:35 -0300
Glauber Costa glom...@parallels.com wrote:

 On 02/22/2012 12:01 PM, Glauber Costa wrote:
  On 02/22/2012 04:46 AM, KAMEZAWA Hiroyuki wrote:
  On Tue, 21 Feb 2012 15:34:33 +0400
  Glauber Costaglom...@parallels.com wrote:
 
  Move some hardcoded definitions to an enum type.
 
  Signed-off-by: Glauber Costaglom...@parallels.com
  CC: Kirill A. Shutemovkir...@shutemov.name
  CC: Greg Thelengthe...@google.com
  CC: Johannes Weinerjwei...@redhat.com
  CC: Michal Hockomho...@suse.cz
  CC: Hiroyouki Kamezawakamezawa.hir...@jp.fujitsu.com
  CC: Paul Turnerp...@google.com
  CC: Frederic Weisbeckerfweis...@gmail.com
 
  seems ok to me.
 
  Acked-by: KAMEZAWA Hiroyukikamezawa.hir...@jp.fujitsu.com
 
  BTW, this series is likely to go through many rounds of discussion.
  This patch can be probably picked separately, if you want to.
 
  a nitpick..
 
  ---
  mm/memcontrol.c | 10 +++---
  1 files changed, 7 insertions(+), 3 deletions(-)
 
  diff --git a/mm/memcontrol.c b/mm/memcontrol.c
  index 6728a7a..b15a693 100644
  --- a/mm/memcontrol.c
  +++ b/mm/memcontrol.c
  @@ -351,9 +351,13 @@ enum charge_type {
  };
 
  /* for encoding cft-private value on file */
  -#define _MEM (0)
  -#define _MEMSWAP (1)
  -#define _OOM_TYPE (2)
  +
  +enum mem_type {
  + _MEM = 0,
 
  =0 is required ?
  I believe not, but I always liked to use it to be 100 % explicit.
  Personal taste... Can change it, if this is a big deal.
 
 Kame, would you like me to send this cleanup without the = 0 ?
 

Not necessary.  Sorry for delayed response.
Lots of memcg patches are flying..

Thanks,
-Kame

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 04/10] memcg: Introduce __GFP_NOACCOUNT.

2012-02-29 Thread KAMEZAWA Hiroyuki
On Wed, 29 Feb 2012 21:24:11 -0300
Glauber Costa glom...@parallels.com wrote:

 On 02/29/2012 09:10 PM, KAMEZAWA Hiroyuki wrote:
  On Wed, 29 Feb 2012 11:09:50 -0800
  Suleiman Souhlalsulei...@google.com  wrote:
 
  On Tue, Feb 28, 2012 at 10:00 PM, KAMEZAWA Hiroyuki
  kamezawa.hir...@jp.fujitsu.com  wrote:
  On Mon, 27 Feb 2012 14:58:47 -0800
  Suleiman Souhlalssouh...@freebsd.org  wrote:
 
  This is used to indicate that we don't want an allocation to be accounted
  to the current cgroup.
 
  Signed-off-by: Suleiman Souhlalsulei...@google.com
 
  I don't like this.
 
  Please add
 
  ___GFP_ACCOUNT  account this allocation to memcg
 
  Or make this as slab's flag if this work is for slab allocation.
 
  We would like to account for all the slab allocations that happen in
  process context.
 
  Manually marking every single allocation or kmem_cache with a GFP flag
  really doesn't seem like the right thing to do..
 
  Can you explain why you don't like this flag?
 
 
  For example, tcp buffer limiting has another logic for buffer size 
  controling.
  _AND_, most of kernel pages are not reclaimable at all.
  I think you should start from reclaimable caches as dcache, icache etc.
 
  If you want to use this wider, you can discuss
 
  + #define GFP_KERNEL(.| ___GFP_ACCOUNT)
 
  in future. I'd like to see small start because memory allocation failure
  is always terrible and make the system unstable. Even if you notify
  Ah, kernel memory allocation failed because of memory.limit? and
many unreclaimable memory usage. Please tweak the limitation or kill 
  tasks!!
 
  The user can't do anything because he can't create any new task because of 
  OOM.
 
  The system will be being unstable until an admin, who is not under any 
  limit,
  tweaks something or reboot the system.
 
  Please do small start until you provide Eco-System to avoid a case that
  the admin cannot login and what he can do was only reboot.
 
 Having the root cgroup to be always unlimited should already take care 
 of the most extreme cases, right?
 
If an admin can login into root cgroup ;)
Anyway, if someone have a container under cgroup via hosting service,
he can do noting if oom killer cannot recover his container. It can be
caused by kernel memory limit. And I'm not sure he can do shutdown because
he can't login.

Thanks,
-Kame


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 08/10] memcg: Add CONFIG_CGROUP_MEM_RES_CTLR_KMEM_ACCT_ROOT.

2012-02-28 Thread KAMEZAWA Hiroyuki
On Tue, 28 Feb 2012 15:36:27 -0800
Suleiman Souhlal sulei...@google.com wrote:

 On Tue, Feb 28, 2012 at 5:34 AM, Glauber Costa glom...@parallels.com wrote:
  On 02/27/2012 07:58 PM, Suleiman Souhlal wrote:
 
  This config option dictates whether or not kernel memory in the
  root cgroup should be accounted.
 
  This may be useful in an environment where everything is supposed to be
  in a cgroup and accounted for. Large amounts of kernel memory in the
  root cgroup would indicate problems with memory isolation or accounting.
 
 
  I don't like accounting this stuff to the root memory cgroup. This causes
  overhead for everybody, including people who couldn't care less about memcg.
 
  If it were up to me, we would simply not account it, and end of story.
 
  However, if this is terribly important for you, I think you need to at
  least make it possible to enable it at runtime, and default it to disabled.
 
 Yes, that is why I made it a config option. If the config option is
 disabled, that memory does not get accounted at all.
 
 Making it configurable at runtime is not ideal, because we would
 prefer slab memory that was allocated before cgroups are created to
 still be counted toward root.
 

I never like to do accounting in root cgroup.
If you want to do this, add config option to account _all_ memory in the root.
Accounting only slab seems very dirty hack.

Thanks,
-Kame


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 04/10] memcg: Introduce __GFP_NOACCOUNT.

2012-02-28 Thread KAMEZAWA Hiroyuki
On Mon, 27 Feb 2012 14:58:47 -0800
Suleiman Souhlal ssouh...@freebsd.org wrote:

 This is used to indicate that we don't want an allocation to be accounted
 to the current cgroup.
 
 Signed-off-by: Suleiman Souhlal sulei...@google.com

I don't like this.

Please add 

___GFP_ACCOUNT  account this allocation to memcg

Or make this as slab's flag if this work is for slab allocation.

Thanks,
-Kame



 ---
  include/linux/gfp.h |2 ++
  1 files changed, 2 insertions(+), 0 deletions(-)
 
 diff --git a/include/linux/gfp.h b/include/linux/gfp.h
 index 581e74b..765c20f 100644
 --- a/include/linux/gfp.h
 +++ b/include/linux/gfp.h
 @@ -23,6 +23,7 @@ struct vm_area_struct;
  #define ___GFP_REPEAT0x400u
  #define ___GFP_NOFAIL0x800u
  #define ___GFP_NORETRY   0x1000u
 +#define ___GFP_NOACCOUNT 0x2000u
  #define ___GFP_COMP  0x4000u
  #define ___GFP_ZERO  0x8000u
  #define ___GFP_NOMEMALLOC0x1u
 @@ -76,6 +77,7 @@ struct vm_area_struct;
  #define __GFP_REPEAT ((__force gfp_t)___GFP_REPEAT)  /* See above */
  #define __GFP_NOFAIL ((__force gfp_t)___GFP_NOFAIL)  /* See above */
  #define __GFP_NORETRY((__force gfp_t)___GFP_NORETRY) /* See above */
 +#define __GFP_NOACCOUNT  ((__force gfp_t)___GFP_NOACCOUNT) /* Don't 
 account to the current cgroup */
  #define __GFP_COMP   ((__force gfp_t)___GFP_COMP)/* Add compound page 
 metadata */
  #define __GFP_ZERO   ((__force gfp_t)___GFP_ZERO)/* Return zeroed page 
 on success */
  #define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC) /* Don't use 
 emergency reserves */
 -- 
 1.7.7.3
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a
 

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 03/10] memcg: Reclaim when more than one page needed.

2012-02-28 Thread KAMEZAWA Hiroyuki
On Mon, 27 Feb 2012 14:58:46 -0800
Suleiman Souhlal ssouh...@freebsd.org wrote:

 From: Hugh Dickins hu...@google.com
 
 mem_cgroup_do_charge() was written before slab accounting, and expects
 three cases: being called for 1 page, being called for a stock of 32 pages,
 or being called for a hugepage.  If we call for 2 pages (and several slabs
 used in process creation are such, at least with the debug options I had),
 it assumed it's being called for stock and just retried without reclaiming.
 
 Fix that by passing down a minsize argument in addition to the csize;
 and pass minsize to consume_stock() also, so that it can draw on stock
 for higher order slabs, instead of accumulating an increasing surplus
 of stock, as its nr_pages == 1 tests previously caused.
 
 And what to do about that (csize == PAGE_SIZE  ret) retry?  If it's
 needed at all (and presumably is since it's there, perhaps to handle
 races), then it should be extended to more than PAGE_SIZE, yet how far?

IIRC, this hack was added to avoid very-easy-oom-kill in a small memcg.
If we can support dirty_ratio? and stop page reclaim to wait until
writeback end, we can remove this, I think.

 And should there be a retry count limit, of what?  For now retry up to
 COSTLY_ORDER (as page_alloc.c does), stay safe with a cond_resched(),
 and make sure not to do it if __GFP_NORETRY.
 

Could you divide the changes to consume_stock() and mem_cgroup_do_charge() ?

In these days, I personally don't like magical retry count..
Let's see what happens if we can wait I/O for memcg. We may not have to
have any retry and be able to clean up this loop.

Thanks,
-Kame


 Signed-off-by: Hugh Dickins hu...@google.com
 Signed-off-by: Suleiman Souhlal sulei...@google.com
 ---
  mm/memcontrol.c |   35 +++
  1 files changed, 19 insertions(+), 16 deletions(-)
 
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index 6f44fcb..c82ca1c 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -1928,19 +1928,19 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, 
 memcg_stock);
  static DEFINE_MUTEX(percpu_charge_mutex);
  
  /*
 - * Try to consume stocked charge on this cpu. If success, one page is 
 consumed
 - * from local stock and true is returned. If the stock is 0 or charges from a
 - * cgroup which is not current target, returns false. This stock will be
 - * refilled.
 + * Try to consume stocked charge on this cpu. If success, nr_pages pages are
 + * consumed from local stock and true is returned. If the stock is 0 or
 + * charges from a cgroup which is not current target, returns false.
 + * This stock will be refilled.
   */
 -static bool consume_stock(struct mem_cgroup *memcg)
 +static bool consume_stock(struct mem_cgroup *memcg, int nr_pages)
  {
   struct memcg_stock_pcp *stock;
   bool ret = true;
  
   stock = get_cpu_var(memcg_stock);
 - if (memcg == stock-cached  stock-nr_pages)
 - stock-nr_pages--;
 + if (memcg == stock-cached  stock-nr_pages = nr_pages)
 + stock-nr_pages -= nr_pages;
   else /* need to call res_counter_charge */
   ret = false;
   put_cpu_var(memcg_stock);
 @@ -2131,7 +2131,7 @@ enum {
  };
  
  static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 - unsigned int nr_pages, bool oom_check)
 +unsigned int nr_pages, unsigned int min_pages, bool oom_check)
  {
   unsigned long csize = nr_pages * PAGE_SIZE;
   struct mem_cgroup *mem_over_limit;
 @@ -2154,18 +2154,18 @@ static int mem_cgroup_do_charge(struct mem_cgroup 
 *memcg, gfp_t gfp_mask,
   } else
   mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
   /*
 -  * nr_pages can be either a huge page (HPAGE_PMD_NR), a batch
 -  * of regular pages (CHARGE_BATCH), or a single regular page (1).
 -  *
* Never reclaim on behalf of optional batching, retry with a
* single page instead.
*/
 - if (nr_pages == CHARGE_BATCH)
 + if (nr_pages  min_pages)
   return CHARGE_RETRY;
  
   if (!(gfp_mask  __GFP_WAIT))
   return CHARGE_WOULDBLOCK;
  
 + if (gfp_mask  __GFP_NORETRY)
 + return CHARGE_NOMEM;
 +
   ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
   if (mem_cgroup_margin(mem_over_limit) = nr_pages)
   return CHARGE_RETRY;
 @@ -2178,8 +2178,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup 
 *memcg, gfp_t gfp_mask,
* unlikely to succeed so close to the limit, and we fall back
* to regular pages anyway in case of failure.
*/
 - if (nr_pages == 1  ret)
 + if (nr_pages = (PAGE_SIZE  PAGE_ALLOC_COSTLY_ORDER)  ret) {
 + cond_resched();
   return CHARGE_RETRY;
 + }
  
   /*
* At task move, charge accounts can be doubly counted. So, it's
 @@ -2253,7 +2255,7 @@ again:
   VM_BUG_ON(css_is_removed(memcg-css));
   

[Devel] Re: [PATCH 02/10] memcg: Uncharge all kmem when deleting a cgroup.

2012-02-28 Thread KAMEZAWA Hiroyuki
On Mon, 27 Feb 2012 14:58:45 -0800
Suleiman Souhlal ssouh...@freebsd.org wrote:

 A later patch will also use this to move the accounting to the root
 cgroup.
 
 Signed-off-by: Suleiman Souhlal sulei...@google.com
 ---
  mm/memcontrol.c |   30 +-
  1 files changed, 29 insertions(+), 1 deletions(-)
 
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index 11e31d6..6f44fcb 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -378,6 +378,7 @@ static void mem_cgroup_get(struct mem_cgroup *memcg);
  static void mem_cgroup_put(struct mem_cgroup *memcg);
  static void memcg_kmem_init(struct mem_cgroup *memcg,
  struct mem_cgroup *parent);
 +static void memcg_kmem_move(struct mem_cgroup *memcg);
  
  /* Writing them here to avoid exposing memcg's inner layout */
  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 @@ -3674,6 +3675,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup 
 *memcg, bool free_all)
   int ret;
   int node, zid, shrink;
   int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 + unsigned long usage;
   struct cgroup *cgrp = memcg-css.cgroup;
  
   css_get(memcg-css);
 @@ -3693,6 +3695,8 @@ move_account:
   /* This is for making all *used* pages to be on LRU. */
   lru_add_drain_all();
   drain_all_stock_sync(memcg);
 + if (!free_all)
 + memcg_kmem_move(memcg);
   ret = 0;
   mem_cgroup_start_move(memcg);
   for_each_node_state(node, N_HIGH_MEMORY) {
 @@ -3714,8 +3718,13 @@ move_account:
   if (ret == -ENOMEM)
   goto try_to_free;
   cond_resched();
 + usage = memcg-res.usage;
 +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 + if (free_all  !memcg-independent_kmem_limit)
 + usage -= memcg-kmem_bytes.usage;
 +#endif

Why we need this even if memcg_kmem_move() does uncharge ?

Thanks,
-Kame

   /* ret should also be checked to ensure all lists are empty. */
 - } while (memcg-res.usage  0 || ret);
 + } while (usage  0 || ret);
  out:
   css_put(memcg-css);
   return ret;
 @@ -5632,9 +5641,28 @@ memcg_kmem_init(struct mem_cgroup *memcg, struct 
 mem_cgroup *parent)
   res_counter_init(memcg-kmem_bytes, parent_res);
   memcg-independent_kmem_limit = 0;
  }
 +
 +static void
 +memcg_kmem_move(struct mem_cgroup *memcg)
 +{
 + unsigned long flags;
 + long kmem_bytes;
 +
 + spin_lock_irqsave(memcg-kmem_bytes.lock, flags);
 + kmem_bytes = memcg-kmem_bytes.usage;
 + res_counter_uncharge_locked(memcg-kmem_bytes, kmem_bytes);
 + spin_unlock_irqrestore(memcg-kmem_bytes.lock, flags);
 + if (!memcg-independent_kmem_limit)
 + res_counter_uncharge(memcg-res, kmem_bytes);
 +}
  #else /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
  static void
  memcg_kmem_init(struct mem_cgroup *memcg, struct mem_cgroup *parent)
  {
  }
 +
 +static void
 +memcg_kmem_move(struct mem_cgroup *memcg)
 +{
 +}
  #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
 -- 
 1.7.7.3
 
 

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 1/7] small cleanup for memcontrol.c

2012-02-21 Thread KAMEZAWA Hiroyuki
On Tue, 21 Feb 2012 15:34:33 +0400
Glauber Costa glom...@parallels.com wrote:

 Move some hardcoded definitions to an enum type.
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Kirill A. Shutemov kir...@shutemov.name
 CC: Greg Thelen gthe...@google.com
 CC: Johannes Weiner jwei...@redhat.com
 CC: Michal Hocko mho...@suse.cz
 CC: Hiroyouki Kamezawa kamezawa.hir...@jp.fujitsu.com
 CC: Paul Turner p...@google.com
 CC: Frederic Weisbecker fweis...@gmail.com

seems ok to me.

Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com

a nitpick..

 ---
  mm/memcontrol.c |   10 +++---
  1 files changed, 7 insertions(+), 3 deletions(-)
 
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index 6728a7a..b15a693 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -351,9 +351,13 @@ enum charge_type {
  };
  
  /* for encoding cft-private value on file */
 -#define _MEM (0)
 -#define _MEMSWAP (1)
 -#define _OOM_TYPE(2)
 +
 +enum mem_type {
 + _MEM = 0,

=0 is required ?

 + _MEMSWAP,
 + _OOM_TYPE,
 +};
 +
  #define MEMFILE_PRIVATE(x, val)  (((x)  16) | (val))
  #define MEMFILE_TYPE(val)(((val)  16)  0x)
  #define MEMFILE_ATTR(val)((val)  0x)
 -- 
 1.7.7.6
 
 

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 3/7] per-cgroup slab caches

2012-02-21 Thread KAMEZAWA Hiroyuki
On Tue, 21 Feb 2012 15:34:35 +0400
Glauber Costa glom...@parallels.com wrote:

 This patch creates the infrastructure to allow us to register
 per-memcg slab caches. As an example implementation, I am tracking
 the dentry cache, but others will follow.
 
 I am using an opt-in istead of opt-out system here: this means the
 cache needs to explicitly register itself to be tracked by memcg.
 I prefer this approach since:
 
 1) not all caches are big enough to be relevant,
 2) most of the relevant ones will involve shrinking in some way,
and it would be better be sure they are shrinker-aware
 3) some objects, like network sockets, have their very own idea
of memory control, that goes beyond the allocation itself.
 
 Once registered, allocations made on behalf of a task living
 on a cgroup will be billed to it. It is a first-touch mechanism,
 but if follows what we have today, and the cgroup infrastructure
 itself. No overhead is expected in object allocation: only when
 slab pages are allocated and freed, any form of billing occurs.
 
 The allocation stays billed to a cgroup until it is destroyed.
 It is kept if a task leaves the cgroup, since it is close to
 impossible to efficiently map an object to a task - and the tracking
 is done by pages, which contain multiple objects.
 

Hmmcan't we do this by

 kmem_cache = kmem_cache_create(, , SLAB_XXX_XXX | SLAB_MEMCG_AWARE)

 kmem_cache_alloc(kmem_cache, flags)
 = find a memcg_kmem_cache for the thread. if it doesn't exist, create it.

BTW, comparing ANON/FILE caches, we'll have many kinds of gfp_t flags.

Do we handle it correctly ?


Maybe I don't fully understand your implemenatation...but try to comment.


 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Kirill A. Shutemov kir...@shutemov.name
 CC: Greg Thelen gthe...@google.com
 CC: Johannes Weiner jwei...@redhat.com
 CC: Michal Hocko mho...@suse.cz
 CC: Hiroyouki Kamezawa kamezawa.hir...@jp.fujitsu.com
 CC: Paul Turner p...@google.com
 CC: Frederic Weisbecker fweis...@gmail.com
 CC: Pekka Enberg penb...@kernel.org
 CC: Christoph Lameter c...@linux.com
 ---
  include/linux/memcontrol.h |   29 ++
  include/linux/slab.h   |8 
  include/linux/slub_def.h   |2 +
  mm/memcontrol.c|   93 
 
  mm/slub.c  |   68 +--
  5 files changed, 195 insertions(+), 5 deletions(-)
 
 diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
 index 4d34356..95e7e19 100644
 --- a/include/linux/memcontrol.h
 +++ b/include/linux/memcontrol.h
 @@ -21,12 +21,41 @@
  #define _LINUX_MEMCONTROL_H
  #include linux/cgroup.h
  #include linux/vm_event_item.h
 +#include linux/slab.h
 +#include linux/workqueue.h
 +#include linux/shrinker.h
  
  struct mem_cgroup;
  struct page_cgroup;
  struct page;
  struct mm_struct;
  
 +struct memcg_kmem_cache {
 + struct kmem_cache *cache;
 + struct mem_cgroup *memcg; /* Should be able to do without this */
 +};
 +
 +struct memcg_cache_struct {
 + int index;
 + struct kmem_cache *cache;
 +};
 +
 +enum memcg_cache_indexes {
 + CACHE_DENTRY,
 + NR_CACHES,
 +};
 +
 +int memcg_kmem_newpage(struct mem_cgroup *memcg, struct page *page, unsigned 
 long pages);
 +void memcg_kmem_freepage(struct mem_cgroup *memcg, struct page *page, 
 unsigned long pages);
 +struct mem_cgroup *memcg_from_shrinker(struct shrinker *s);
 +
 +struct memcg_kmem_cache *memcg_cache_get(struct mem_cgroup *memcg, int 
 index);
 +void register_memcg_cache(struct memcg_cache_struct *cache);
 +void memcg_slab_destroy(struct kmem_cache *cache, struct mem_cgroup *memcg);
 +
 +struct kmem_cache *
 +kmem_cache_dup(struct mem_cgroup *memcg, struct kmem_cache *base);
 +
  /* Stats that can be updated by kernel. */
  enum mem_cgroup_page_stat_item {
   MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
 diff --git a/include/linux/slab.h b/include/linux/slab.h
 index 573c809..8a372cd 100644
 --- a/include/linux/slab.h
 +++ b/include/linux/slab.h
 @@ -98,6 +98,14 @@
  void __init kmem_cache_init(void);
  int slab_is_available(void);
  
 +struct mem_cgroup;
 +
 +unsigned long slab_nr_pages(struct kmem_cache *s);
 +
 +struct kmem_cache *kmem_cache_create_cg(struct mem_cgroup *memcg,
 + const char *, size_t, size_t,
 + unsigned long,
 + void (*)(void *));
  struct kmem_cache *kmem_cache_create(const char *, size_t, size_t,
   unsigned long,
   void (*)(void *));
 diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
 index a32bcfd..4f39fff 100644
 --- a/include/linux/slub_def.h
 +++ b/include/linux/slub_def.h
 @@ -100,6 +100,8 @@ struct kmem_cache {
   struct kobject kobj;/* For sysfs */
  #endif
  
 + struct mem_cgroup *memcg;
 + struct kmem_cache *parent_slab; /* can be moved out of here as well */
  #ifdef CONFIG_NUMA
   /*
 

[Devel] Re: [PATCH 4/7] chained slab caches: move pages to a different cache when a cache is destroyed.

2012-02-21 Thread KAMEZAWA Hiroyuki
On Tue, 21 Feb 2012 15:34:36 +0400
Glauber Costa glom...@parallels.com wrote:

 In the context of tracking kernel memory objects to a cgroup, the
 following problem appears: we may need to destroy a cgroup, but
 this does not guarantee that all objects inside the cache are dead.
 This can't be guaranteed even if we shrink the cache beforehand.
 
 The simple option is to simply leave the cache around. However,
 intensive workloads may have generated a lot of objects and thus
 the dead cache will live in memory for a long while.
 
 Scanning the list of objects in the dead cache takes time, and
 would probably require us to lock the free path of every objects
 to make sure we're not racing against the update.
 
 I decided to give a try to a different idea then - but I'd be
 happy to pursue something else if you believe it would be better.
 
 Upon memcg destruction, all the pages on the partial list
 are moved to the new slab (usually the parent memcg, or root memcg)
 When an object is freed, there are high stakes that no list locks
 are needed - so this case poses no overhead. If list manipulation
 is indeed needed, we can detect this case, and perform it
 in the right slab.
 
 If all pages were residing in the partial list, we can free
 the cache right away. Otherwise, we do it when the last cache
 leaves the full list.
 

How about starting from 'don't handle slabs on dead memcg' 
if shrink_slab() can find them

This move complicates all implementation, I think...

Thanks,
-Kame

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 5/7] shrink support for memcg kmem controller

2012-02-21 Thread KAMEZAWA Hiroyuki
On Tue, 21 Feb 2012 15:34:37 +0400
Glauber Costa glom...@parallels.com wrote:

 This patch adds the shrinker interface to memcg proposed kmem
 controller. With this, softlimits starts being meaningful. I didn't
 played to much with softlimits itself, since it is a bit in progress
 for the general case as well. But this patch at least makes vmscan.c
 no longer skip shrink_slab for the memcg case.
 
 It also allows us to set the hard limit to a lower value than
 current usage, as it is possible for the current memcg: a reclaim
 is carried on, and if we succeed in freeing enough of kernel memory,
 we can lower the limit.
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Kirill A. Shutemov kir...@shutemov.name
 CC: Greg Thelen gthe...@google.com
 CC: Johannes Weiner jwei...@redhat.com
 CC: Michal Hocko mho...@suse.cz
 CC: Hiroyouki Kamezawa kamezawa.hir...@jp.fujitsu.com
 CC: Paul Turner p...@google.com
 CC: Frederic Weisbecker fweis...@gmail.com
 CC: Pekka Enberg penb...@kernel.org
 CC: Christoph Lameter c...@linux.com
 ---
  include/linux/memcontrol.h |5 +++
  include/linux/shrinker.h   |4 ++
  mm/memcontrol.c|   87 
 ++--
  mm/vmscan.c|   60 +--
  4 files changed, 150 insertions(+), 6 deletions(-)
 
 diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
 index 6138d10..246b2d4 100644
 --- a/include/linux/memcontrol.h
 +++ b/include/linux/memcontrol.h
 @@ -33,12 +33,16 @@ struct mm_struct;
  struct memcg_kmem_cache {
   struct kmem_cache *cache;
   struct work_struct destroy;
 + struct list_head lru;
 + u32 nr_objects;
   struct mem_cgroup *memcg; /* Should be able to do without this */
  };
  
  struct memcg_cache_struct {
   int index;
   struct kmem_cache *cache;
 + int (*shrink_fn)(struct shrinker *shrink, struct shrink_control *sc);
 + struct shrinker shrink;
  };
  
  enum memcg_cache_indexes {
 @@ -53,6 +57,7 @@ struct mem_cgroup *memcg_from_shrinker(struct shrinker *s);
  struct memcg_kmem_cache *memcg_cache_get(struct mem_cgroup *memcg, int 
 index);
  void register_memcg_cache(struct memcg_cache_struct *cache);
  void memcg_slab_destroy(struct kmem_cache *cache, struct mem_cgroup *memcg);
 +bool memcg_slab_reclaim(struct mem_cgroup *memcg);
  
  struct kmem_cache *
  kmem_cache_dup(struct mem_cgroup *memcg, struct kmem_cache *base);
 diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
 index 07ceb97..11efdba 100644
 --- a/include/linux/shrinker.h
 +++ b/include/linux/shrinker.h
 @@ -1,6 +1,7 @@
  #ifndef _LINUX_SHRINKER_H
  #define _LINUX_SHRINKER_H
  
 +struct mem_cgroup;
  /*
   * This struct is used to pass information from page reclaim to the 
 shrinkers.
   * We consolidate the values for easier extention later.
 @@ -10,6 +11,7 @@ struct shrink_control {
  
   /* How many slab objects shrinker() should scan and try to reclaim */
   unsigned long nr_to_scan;
 + struct mem_cgroup *memcg;
  };
  
  /*
 @@ -40,4 +42,6 @@ struct shrinker {
  #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
  extern void register_shrinker(struct shrinker *);
  extern void unregister_shrinker(struct shrinker *);
 +
 +extern void register_shrinker_memcg(struct shrinker *);
  #endif
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index 1b1db88..9c89a3c 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -3460,6 +3460,54 @@ static int mem_cgroup_resize_limit(struct mem_cgroup 
 *memcg,
   return ret;
  }
  
 +static int mem_cgroup_resize_kmem_limit(struct mem_cgroup *memcg,
 + unsigned long long val)
 +{
 +
 + int retry_count;
 + int ret = 0;
 + int children = mem_cgroup_count_children(memcg);
 + u64 curusage, oldusage;
 +
 + struct shrink_control shrink = {
 + .gfp_mask = GFP_KERNEL,
 + .memcg = memcg,
 + };
 +
 + /*
 +  * For keeping hierarchical_reclaim simple, how long we should retry
 +  * is depends on callers. We set our retry-count to be function
 +  * of # of children which we should visit in this loop.
 +  */
 + retry_count = MEM_CGROUP_RECLAIM_RETRIES * children;
 +
 + oldusage = res_counter_read_u64(memcg-kmem, RES_USAGE);
 +
 + while (retry_count) {
 + if (signal_pending(current)) {
 + ret = -EINTR;
 + break;
 + }
 + mutex_lock(set_limit_mutex);
 + ret = res_counter_set_limit(memcg-kmem, val);
 + mutex_unlock(set_limit_mutex);
 + if (!ret)
 + break;
 +
 + shrink_slab(shrink, 0, 0);
 +
 + curusage = res_counter_read_u64(memcg-kmem, RES_USAGE);
 +
 + /* Usage is reduced ? */
 + if (curusage = oldusage)
 + retry_count--;
 + else
 + oldusage = curusage;

[Devel] Re: [PATCH v9 0/9] Request for inclusion: per-cgroup tcp memory pressure controls

2011-12-14 Thread KAMEZAWA Hiroyuki
On Mon, 12 Dec 2011 11:47:00 +0400
Glauber Costa glom...@parallels.com wrote:

 Hi,
 
 This series fixes all the few comments raised in the last round,
 and seem to have acquired consensus from the memcg side.
 
 Dave, do you think it is acceptable now from the networking PoV?
 In case positive, would you prefer merging this trough your tree,
 or acking this so a cgroup maintainer can do it?
 

I met this bug at _1st_ run. Please enable _all_ debug options!.

From this log, your mem_cgroup_sockets_init() is totally buggy because
tcp init routine will call percpu_alloc() under read_lock.

please fix. I'll turn off the config for today's linux-next.

Regards,
-Kame


==
Dec 15 14:47:15 bluextal kernel: [  228.386054] BUG: sleeping function called 
from invalid context at k
ernel/mutex.c:271
Dec 15 14:47:15 bluextal kernel: [  228.386286] in_atomic(): 1, 
irqs_disabled(): 0, pid: 2692, name: cg
create
Dec 15 14:47:15 bluextal kernel: [  228.386438] 4 locks held by cgcreate/2692:
Dec 15 14:47:15 bluextal kernel: [  228.386580]  #0:  
(sb-s_type-i_mutex_key#15/1){+.+.+.}, at: [ff
ff8118717e] kern_path_create+0x8e/0x150
Dec 15 14:47:15 bluextal kernel: [  228.387238]  #1:  (cgroup_mutex){+.+.+.}, 
at: [810c2907]
cgroup_mkdir+0x77/0x3f0
Dec 15 14:47:15 bluextal kernel: [  228.387755]  #2:  
(sb-s_type-i_mutex_key#15/2){+.+.+.}, at: [ff
ff810be168] cgroup_create_file+0xc8/0xf0
Dec 15 14:47:15 bluextal kernel: [  228.388401]  #3:  
(proto_list_lock){.+}, at: [814e7fe4
] mem_cgroup_sockets_init+0x24/0xf0
Dec 15 14:47:15 bluextal kernel: [  228.388922] Pid: 2692, comm: cgcreate Not 
tainted 3.2.0-rc5-next-20
111214+ #34
Dec 15 14:47:15 bluextal kernel: [  228.389154] Call Trace:
Dec 15 14:47:15 bluextal kernel: [  228.389296]  [81080f07] 
__might_sleep+0x107/0x140
Dec 15 14:47:15 bluextal kernel: [  228.389446]  [815ce15f] 
mutex_lock_nested+0x2f/0x50
Dec 15 14:47:15 bluextal kernel: [  228.389597]  [811354fd] 
pcpu_alloc+0x4d/0xa20
Dec 15 14:47:15 bluextal kernel: [  228.389745]  [810a5759] ? 
debug_check_no_locks_freed+0xa9/0x180
Dec 15 14:47:15 bluextal kernel: [  228.389897]  [810a5615] ? 
trace_hardirqs_on_caller+0x105/0x190
Dec 15 14:47:15 bluextal kernel: [  228.390057]  [810a56ad] ? 
trace_hardirqs_on+0xd/0x10
Dec 15 14:47:15 bluextal kernel: [  228.390206]  [810a3600] ? 
lockdep_init_map+0x50/0x140
Dec 15 14:47:15 bluextal kernel: [  228.390356]  [81135f00] 
__alloc_percpu+0x10/0x20
Dec 15 14:47:15 bluextal kernel: [  228.390506]  [812fda18] 
__percpu_counter_init+0x58/0xc0
Dec 15 14:47:15 bluextal kernel: [  228.390658]  [8158fb26] 
tcp_init_cgroup+0xd6/0x140
Dec 15 14:47:15 bluextal kernel: [  228.390808]  [814e8014] 
mem_cgroup_sockets_init+0x54/0xf0
Dec 15 14:47:15 bluextal kernel: [  228.390961]  [8116b47b] 
mem_cgroup_populate+0xab/0xc0
Dec 15 14:47:15 bluextal kernel: [  228.391120]  [810c053a] 
cgroup_populate_dir+0x7a/0x110
Dec 15 14:47:15 bluextal kernel: [  228.391271]  [810c2b16] 
cgroup_mkdir+0x286/0x3f0
Dec 15 14:47:15 bluextal kernel: [  228.391420]  [811836b4] 
vfs_mkdir+0xa4/0xe0
Dec 15 14:47:15 bluextal kernel: [  228.391566]  [8118747d] 
sys_mkdirat+0xcd/0xe0
Dec 15 14:47:15 bluextal kernel: [  228.391714]  [811874a8] 
sys_mkdir+0x18/0x20
Dec 15 14:47:15 bluextal kernel: [  228.391862]  [815d86cf] 
tracesys+0xdd/0xe2
==

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: How to draw values for /proc/stat

2011-12-11 Thread KAMEZAWA Hiroyuki
On Sun, 11 Dec 2011 15:50:56 +0100
Glauber Costa glom...@parallels.com wrote:

 On 12/09/2011 03:55 PM, Glauber Costa wrote:
  On 12/09/2011 12:03 PM, Peter Zijlstra wrote:
  On Mon, 2011-12-05 at 07:32 -0200, Glauber Costa wrote:
  Hi,
 
  Specially Peter and Paul, but all the others:
 
  As you can see in https://lkml.org/lkml/2011/12/4/178, and in my answer
  to that, there is a question - one I've asked before but without that
  much of an audience - of whether /proc files read from process living on
  cgroups should display global or per-cgroup resources.
 
  In the past, I was arguing for a knob to control that, but I recently
  started to believe that a knob here will only overcomplicate matters:
  if you live in a cgroup, you should display only the resources you can
  possibly use. Global is for whoever is in the main cgroup.
 
  Now, it comes two questions:
  1) Do you agree with that, for files like /proc/stat ? I think the most
  important part is to be consistent inside the system, regardless of what
  is done
 
  Personally I don't give a rats arse about (/proc vs) cgroups :-)
  Currently /proc is unaffected by whatever cgroup you happen to be in and
  that seems to make some sort of sense.
 
  Namespaces seem to be about limiting visibility, cgroups about
  controlling resources.
 
  The two things are hopelessly disjoint atm, but I believe someone was
  looking at this mess.
 
  I did take a look at this (if anyone else was, I'd like to know so we
  can share some ideas), but I am not convinced we should do anything to
  join them anymore. We virtualization people are to the best of my
  knowledge the only ones doing namespaces. Cgroups, OTOH, got a lot bigger.
 
  What I am mostly concerned about now, is how consistent they will be.
  /proc always being always global indeed does make sense, but my question
  still stands: if you live in a resource-controlled world, why should you
  even see resources you will never own ?
 
 
  IOW a /proc namespace coupled to cgroup scope would do what you want.
  Now my head hurts..
 
  Mine too. The idea is good, but too broad. Boils down to: How do you
  couple them? And none of the methods I thought about seemed to make any
  sense.
 
  If we really want to have the values in /proc being opted-in, I think
  Kamezawa's idea of a mount option is the winner so far.
 
 
 Ok:
 
 How about the following patch to achieve this ?

Hmm, What I thought was mount option for procfs. Containers will mount its own
/proc file systems. Do you have any pros. / cons. ?
IIUC, cgroup can be mounted per subsystems. Then, options can be passed per
subsystems. It's a mess but we don't need to bring this to procfs.

How about

  # mount -t procfs proc /container_root/proc -o cgroup_aware

to show cgroup aware procfs ? I think this will be easy to be used with
namespace/chroot, etc.

Thanks,
-Kame


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v8 3/9] socket: initial cgroup code.

2011-12-11 Thread KAMEZAWA Hiroyuki
On Fri, 9 Dec 2011 10:43:00 -0200
Glauber Costa glom...@parallels.com wrote:

 On 12/09/2011 12:05 AM, KAMEZAWA Hiroyuki wrote:
  On Mon,  5 Dec 2011 19:34:57 -0200
  Glauber Costaglom...@parallels.com  wrote:
 
  The goal of this work is to move the memory pressure tcp
  controls to a cgroup, instead of just relying on global
  conditions.
 
  To avoid excessive overhead in the network fast paths,
  the code that accounts allocated memory to a cgroup is
  hidden inside a static_branch(). This branch is patched out
  until the first non-root cgroup is created. So when nobody
  is using cgroups, even if it is mounted, no significant performance
  penalty should be seen.
 
  This patch handles the generic part of the code, and has nothing
  tcp-specific.
 
  Signed-off-by: Glauber Costaglom...@parallels.com
  CC: Kirill A. Shutemovkir...@shutemov.name
  CC: KAMEZAWA Hiroyukikamezawa.hir...@jp.fujtsu.com
  CC: David S. Millerda...@davemloft.net
  CC: Eric W. Biedermanebied...@xmission.com
  CC: Eric Dumazeteric.duma...@gmail.com
 
  I already replied Reviewed-by: but...
 Feel free. Reviews, the more, the merrier.
 
 
 
  +/* Writing them here to avoid exposing memcg's inner layout */
  +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
  +#ifdef CONFIG_INET
  +#includenet/sock.h
  +
  +static bool mem_cgroup_is_root(struct mem_cgroup *memcg);
  +void sock_update_memcg(struct sock *sk)
  +{
  +  /* A socket spends its whole life in the same cgroup */
  +  if (sk-sk_cgrp) {
  +  WARN_ON(1);
  +  return;
  +  }
  +  if (static_branch(memcg_socket_limit_enabled)) {
  +  struct mem_cgroup *memcg;
  +
  +  BUG_ON(!sk-sk_prot-proto_cgroup);
  +
  +  rcu_read_lock();
  +  memcg = mem_cgroup_from_task(current);
  +  if (!mem_cgroup_is_root(memcg)) {
  +  mem_cgroup_get(memcg);
  +  sk-sk_cgrp = sk-sk_prot-proto_cgroup(memcg);
  +  }
  +  rcu_read_unlock();
  +  }
  +}
 
  Here, you do mem_cgroup_get() if !mem_cgroup_is_root().
 
 
  +EXPORT_SYMBOL(sock_update_memcg);
  +
  +void sock_release_memcg(struct sock *sk)
  +{
  +  if (static_branch(memcg_socket_limit_enabled)  sk-sk_cgrp) {
  +  struct mem_cgroup *memcg;
  +  WARN_ON(!sk-sk_cgrp-memcg);
  +  memcg = sk-sk_cgrp-memcg;
  +  mem_cgroup_put(memcg);
  +  }
  +}
 
 
  You don't check !mem_cgroup_is_root(). Hm, root memcg will not be freed
  by this ?
 
 No, I don't. But I check if sk-sk_cgrp is filled. So it is implied, 
 because we only fill in this value if !mem_cgroup_is_root().

Ah, ok. thank you.
-Kame


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v8 1/9] Basic kernel memory functionality for the Memory Controller

2011-12-11 Thread KAMEZAWA Hiroyuki
On Fri, 9 Dec 2011 12:37:23 -0200
Glauber Costa glom...@parallels.com wrote:

 On 12/08/2011 11:21 PM, KAMEZAWA Hiroyuki wrote:
  Hm, why you check val != parent-kmem_independent_accounting ?
 
  if (parent  parent-use_hierarchy)
  return -EINVAL;
  ?
 
  BTW, you didn't check this cgroup has children or not.
  I think
 
  if (this_cgroup-use_hierarchy
!list_empty(this_cgroup-childlen))
  return -EINVAL;
 
 How about this?
 
  val = !!val;
 
  /*
   * This follows the same hierarchy restrictions than
   * mem_cgroup_hierarchy_write()
   */
  if (!parent || !parent-use_hierarchy) {
  if (list_empty(cgroup-children))
  memcg-kmem_independent_accounting = val;
  else
  return -EBUSY;
  }
  else
  return -EINVAL;
 
  return 0;
 
seems good to me.

Thanks,
-Kame

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v8 2/9] foundations of per-cgroup memory pressure controlling.

2011-12-08 Thread KAMEZAWA Hiroyuki
On Mon,  5 Dec 2011 19:34:56 -0200
Glauber Costa glom...@parallels.com wrote:

 This patch replaces all uses of struct sock fields' memory_pressure,
 memory_allocated, sockets_allocated, and sysctl_mem to acessor
 macros. Those macros can either receive a socket argument, or a mem_cgroup
 argument, depending on the context they live in.
 
 Since we're only doing a macro wrapping here, no performance impact at all is
 expected in the case where we don't have cgroups disabled.
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: David S. Miller da...@davemloft.net
 CC: Hiroyouki Kamezawa kamezawa.hir...@jp.fujitsu.com
 CC: Eric W. Biederman ebied...@xmission.com
 CC: Eric Dumazet eric.duma...@gmail.com

please get ack from network guys. 
from me.
Reviewed-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v8 3/9] socket: initial cgroup code.

2011-12-08 Thread KAMEZAWA Hiroyuki
On Mon,  5 Dec 2011 19:34:57 -0200
Glauber Costa glom...@parallels.com wrote:

 The goal of this work is to move the memory pressure tcp
 controls to a cgroup, instead of just relying on global
 conditions.
 
 To avoid excessive overhead in the network fast paths,
 the code that accounts allocated memory to a cgroup is
 hidden inside a static_branch(). This branch is patched out
 until the first non-root cgroup is created. So when nobody
 is using cgroups, even if it is mounted, no significant performance
 penalty should be seen.
 
 This patch handles the generic part of the code, and has nothing
 tcp-specific.
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Kirill A. Shutemov kir...@shutemov.name
 CC: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujtsu.com
 CC: David S. Miller da...@davemloft.net
 CC: Eric W. Biederman ebied...@xmission.com
 CC: Eric Dumazet eric.duma...@gmail.com

Reviewed-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v8 4/9] tcp memory pressure controls

2011-12-08 Thread KAMEZAWA Hiroyuki
On Mon,  5 Dec 2011 19:34:58 -0200
Glauber Costa glom...@parallels.com wrote:

 This patch introduces memory pressure controls for the tcp
 protocol. It uses the generic socket memory pressure code
 introduced in earlier patches, and fills in the
 necessary data in cg_proto struct.
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujtisu.com
 CC: Eric W. Biederman ebied...@xmission.com

Reviewed-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v8 5/9] per-netns ipv4 sysctl_tcp_mem

2011-12-08 Thread KAMEZAWA Hiroyuki
On Mon,  5 Dec 2011 19:34:59 -0200
Glauber Costa glom...@parallels.com wrote:

 This patch allows each namespace to independently set up
 its levels for tcp memory pressure thresholds. This patch
 alone does not buy much: we need to make this values
 per group of process somehow. This is achieved in the
 patches that follows in this patchset.
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: David S. Miller da...@davemloft.net
 CC: Eric W. Biederman ebied...@xmission.com

Reviewed-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v8 6/9] tcp buffer limitation: per-cgroup limit

2011-12-08 Thread KAMEZAWA Hiroyuki
On Mon,  5 Dec 2011 19:35:00 -0200
Glauber Costa glom...@parallels.com wrote:

 This patch uses the tcp.limit_in_bytes field of the kmem_cgroup to
 effectively control the amount of kernel memory pinned by a cgroup.
 
 This value is ignored in the root cgroup, and in all others,
 caps the value specified by the admin in the net namespaces'
 view of tcp_sysctl_mem.
 
 If namespaces are being used, the admin is allowed to set a
 value bigger than cgroup's maximum, the same way it is allowed
 to set pretty much unlimited values in a real box.
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: David S. Miller da...@davemloft.net
 CC: Hiroyouki Kamezawa kamezawa.hir...@jp.fujitsu.com
 CC: Eric W. Biederman ebied...@xmission.com

Reviewed-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v8 7/9] Display current tcp memory allocation in kmem cgroup

2011-12-08 Thread KAMEZAWA Hiroyuki
On Mon,  5 Dec 2011 19:35:01 -0200
Glauber Costa glom...@parallels.com wrote:

 This patch introduces kmem.tcp.usage_in_bytes file, living in the
 kmem_cgroup filesystem. It is a simple read-only file that displays the
 amount of kernel memory currently consumed by the cgroup.
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 Reviewed-by: Hiroyouki Kamezawa kamezawa.hir...@jp.fujitsu.com
 CC: David S. Miller da...@davemloft.net
 CC: Eric W. Biederman ebied...@xmission.com

Reviewed-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v8 9/9] Display maximum tcp memory allocation in kmem cgroup

2011-12-08 Thread KAMEZAWA Hiroyuki
On Mon,  5 Dec 2011 19:35:03 -0200
Glauber Costa glom...@parallels.com wrote:

 This patch introduces kmem.tcp.max_usage_in_bytes file, living in the
 kmem_cgroup filesystem. The root cgroup will display a value equal
 to RESOURCE_MAX. This is to avoid introducing any locking schemes in
 the network paths when cgroups are not being actively used.
 
 All others, will see the maximum memory ever used by this cgroup.
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 Reviewed-by: Hiroyouki Kamezawa kamezawa.hir...@jp.fujitsu.com
 CC: David S. Miller da...@davemloft.net
 CC: Eric W. Biederman ebied...@xmission.com

Reviewed-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v8 3/9] socket: initial cgroup code.

2011-12-08 Thread KAMEZAWA Hiroyuki
On Mon,  5 Dec 2011 19:34:57 -0200
Glauber Costa glom...@parallels.com wrote:

 The goal of this work is to move the memory pressure tcp
 controls to a cgroup, instead of just relying on global
 conditions.
 
 To avoid excessive overhead in the network fast paths,
 the code that accounts allocated memory to a cgroup is
 hidden inside a static_branch(). This branch is patched out
 until the first non-root cgroup is created. So when nobody
 is using cgroups, even if it is mounted, no significant performance
 penalty should be seen.
 
 This patch handles the generic part of the code, and has nothing
 tcp-specific.
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Kirill A. Shutemov kir...@shutemov.name
 CC: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujtsu.com
 CC: David S. Miller da...@davemloft.net
 CC: Eric W. Biederman ebied...@xmission.com
 CC: Eric Dumazet eric.duma...@gmail.com

I already replied Reviewed-by: but...


 +/* Writing them here to avoid exposing memcg's inner layout */
 +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 +#ifdef CONFIG_INET
 +#include net/sock.h
 +
 +static bool mem_cgroup_is_root(struct mem_cgroup *memcg);
 +void sock_update_memcg(struct sock *sk)
 +{
 + /* A socket spends its whole life in the same cgroup */
 + if (sk-sk_cgrp) {
 + WARN_ON(1);
 + return;
 + }
 + if (static_branch(memcg_socket_limit_enabled)) {
 + struct mem_cgroup *memcg;
 +
 + BUG_ON(!sk-sk_prot-proto_cgroup);
 +
 + rcu_read_lock();
 + memcg = mem_cgroup_from_task(current);
 + if (!mem_cgroup_is_root(memcg)) {
 + mem_cgroup_get(memcg);
 + sk-sk_cgrp = sk-sk_prot-proto_cgroup(memcg);
 + }
 + rcu_read_unlock();
 + }
 +}

Here, you do mem_cgroup_get() if !mem_cgroup_is_root().


 +EXPORT_SYMBOL(sock_update_memcg);
 +
 +void sock_release_memcg(struct sock *sk)
 +{
 + if (static_branch(memcg_socket_limit_enabled)  sk-sk_cgrp) {
 + struct mem_cgroup *memcg;
 + WARN_ON(!sk-sk_cgrp-memcg);
 + memcg = sk-sk_cgrp-memcg;
 + mem_cgroup_put(memcg);
 + }
 +}


You don't check !mem_cgroup_is_root(). Hm, root memcg will not be freed
by this ?

Thanks,
-Kame


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v7 00/10] Request for Inclusion: per-cgroup tcp memory pressure

2011-12-05 Thread KAMEZAWA Hiroyuki
On Mon, 5 Dec 2011 07:09:51 -0200
Glauber Costa glom...@parallels.com wrote:

 On 12/05/2011 12:06 AM, KAMEZAWA Hiroyuki wrote:
  On Fri, 2 Dec 2011 16:04:08 -0200
  Glauber Costaglom...@parallels.com  wrote:
 
  On 11/30/2011 12:11 AM, KAMEZAWA Hiroyuki wrote:
  On Tue, 29 Nov 2011 21:56:51 -0200
  Glauber Costaglom...@parallels.com   wrote:
 
  Hi,
 
  This patchset implements per-cgroup tcp memory pressure controls. It did 
  not change
  significantly since last submission: rather, it just merges the comments 
  Kame had.
  Most of them are style-related and/or Documentation, but there are two 
  real bugs he
  managed to spot (thanks)
 
  Please let me know if there is anything else I should address.
 
 
  After reading all codes again, I feel some strange. Could you clarify ?
 
  Here.
  ==
  +void sock_update_memcg(struct sock *sk)
  +{
  + /* right now a socket spends its whole life in the same cgroup */
  + if (sk-sk_cgrp) {
  + WARN_ON(1);
  + return;
  + }
  + if (static_branch(memcg_socket_limit_enabled)) {
  + struct mem_cgroup *memcg;
  +
  + BUG_ON(!sk-sk_prot-proto_cgroup);
  +
  + rcu_read_lock();
  + memcg = mem_cgroup_from_task(current);
  + if (!mem_cgroup_is_root(memcg))
  + sk-sk_cgrp = sk-sk_prot-proto_cgroup(memcg);
  + rcu_read_unlock();
  ==
 
  sk-sk_cgrp is set to a memcg without any reference count.
 
  Then, no check for preventing rmdir() and freeing memcgroup.
 
  Is there some css_get() or mem_cgroup_get() somewhere ?
 
 
  There were a css_get in the first version of this patchset. It was
  removed, however, because it was deemed anti-intuitive to prevent rmdir,
  since we can't know which sockets are blocking it, or do anything about
  it. Or did I misunderstand something ?
 
 
  Maybe I misuderstood. Thank you. Ok, there is no css_get/put and
  rmdir() is allowed. But, hmmwhat's guarding threads from stale
  pointer access ?
 
  Does a memory cgroup which is pointed by sk-sk_cgrp always exist ?
 
 If I am not mistaken, yes, it will. (Ok, right now it won't)
 
 Reason is a cgroup can't be removed if it is empty.
 To make it empty, you need to move the tasks away.
 
 So the sockets will be moved away as well when you do it. So right now 
 they are not, so it would then probably be better to increase a 
 reference count with a comment saying that it is temporary.
 

I'm sorry if I misunderstand.

At task exit, __fput() will be called against file descriptors, yes.
__fput() calles f_op-release() = inet_release() = tcp_close().

But TCP socket may be alive after task exit until it gets down to 
protocol close. For example, until the all message in send buffer
is acked, socket and tcp connection will not be disappear.

In short, socket's lifetime is different from it's task's. 
So, there may be sockets which are not belongs to any task.



Thanks,
-Kame













___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: How to draw values for /proc/stat

2011-12-05 Thread KAMEZAWA Hiroyuki
On Mon, 5 Dec 2011 07:32:33 -0200
Glauber Costa glom...@parallels.com wrote:

 Hi,
 
 Specially Peter and Paul, but all the others:
 
 As you can see in https://lkml.org/lkml/2011/12/4/178, and in my answer 
 to that, there is a question - one I've asked before but without that 
 much of an audience - of whether /proc files read from process living on 
 cgroups should display global or per-cgroup resources.
 
 In the past, I was arguing for a knob to control that, but I recently 
 started to believe that a knob here will only overcomplicate matters:
 if you live in a cgroup, you should display only the resources you can 
 possibly use. Global is for whoever is in the main cgroup.
 

Hm. I have a suggestion and a concern.

(A suggestion)
   How about having a mount option for procfs ?
   For example,
mount -t proc  -o cgroup_virtualized
   Then, /proc/stat etc shows per-cgroup information.

(A concern)
   /proc/stat will be a mixture of virtualized values and not-virtualized 
values.
   1. Don't users need to know whether each value is virtualized or not ?
   2. Can we have a way to show this value is virtualized! annotation ?


 Now, it comes two questions:
 1) Do you agree with that, for files like /proc/stat ? I think the most 
 important part is to be consistent inside the system, regardless of what 
 is done
 
I think some kind of care for users are required as I wrote above.


 2) Will cpuacct stay? I think if it does, that becomes almost mandatory 
 (at least the bind mount idea is pretty much over here), because drawing 
 value for /proc/stat becomes quite complex.
 The cpuacct cgroup can provide user, sys, etc values. But we also have:
 

If virtualized /proc/stat works, I don't think 'account only' cgroup is
necessary. It can be obsolete.

Thanks,
-Kame

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v7 02/10] foundations of per-cgroup memory pressure controlling.

2011-12-04 Thread KAMEZAWA Hiroyuki
On Fri, 2 Dec 2011 15:46:46 -0200
Glauber Costa glom...@parallels.com wrote:

 
static void proto_seq_printf(struct seq_file *seq, struct proto *proto)
{
  +  struct mem_cgroup *memcg = mem_cgroup_from_task(current);
  +
 seq_printf(seq, %-9s %4u %6d  %6ld   %-3s %6u   %-3s  %-10s 
 %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c 
  %2c %2c %2c %2c %2c %2c\n,
proto-name,
proto-obj_size,
sock_prot_inuse_get(seq_file_net(seq), proto),
  - proto-memory_allocated != NULL ? 
  atomic_long_read(proto-memory_allocated) : -1L,
  - proto-memory_pressure != NULL ? *proto-memory_pressure ? 
  yes : no : NI,
  + sock_prot_memory_allocated(proto, memcg),
  + sock_prot_memory_pressure(proto, memcg),
 
  I wonder I should say NO, here. (Networking guys are ok ??)
 
  IIUC, this means there is no way to see aggregated sockstat of all system.
  And the result depends on the cgroup which the caller is under control.
 
  I think you should show aggregated sockstat(global + per-memcg) here and
  show per-memcg ones via /cgroup interface or add private_sockstat to show
  per cgroup summary.
 
 
 Hi Kame,
 
 Yes, the statistics displayed depends on which cgroup you live.
 Also, note that the parent cgroup here is always updated (even when 
 use_hierarchy is set to 0). So it is always possible to grab global 
 statistics, by being in the root cgroup.
 
 For the others, I believe it to be a question of naturalization. Any 
 tool that is fetching these values is likely interested in the amount of 
 resources available/used. When you are on a cgroup, the amount of 
 resources available/used changes, so that's what you should see.
 
 Also brings the point of resource isolation: if you shouldn't interfere 
 with other set of process' resources, there is no reason for you to see 
 them in the first place.
 
 So given all that, I believe that whenever we talk about resources in a 
 cgroup, we should talk about cgroup-local ones.

But you changes /proc/ information without any arguments with other guys.
If you go this way, you should move this patch as independent add-on patch
and discuss what this should be. For example, /proc/meminfo doesn't reflect
memcg's information (for now). And scheduler statiscits in /proc/stat doesn't
reflect cgroup's information.

So, please discuss the problem in open way. This issue is not only related to
this patch but also to other cgroups. Sneaking this kind of _big_ change in
a middle of complicated patch series isn't good.

In short, could you divide this patch into a independent patch and discuss
again ? If we agree the general diection should go this way, other guys will
post patches for cpu, memory, blkio, etc.


Thanks,
-Kame

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v7 04/10] tcp memory pressure controls

2011-12-04 Thread KAMEZAWA Hiroyuki
On Fri, 2 Dec 2011 15:57:28 -0200
Glauber Costa glom...@parallels.com wrote:

 On 11/29/2011 11:49 PM, KAMEZAWA Hiroyuki wrote:
 
  -static struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont)
  +struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont)
{
 return container_of(cgroup_subsys_state(cont,
 mem_cgroup_subsys_id), struct mem_cgroup,
  @@ -4717,14 +4732,27 @@ static int register_kmem_files(struct cgroup 
  *cont, struct cgroup_subsys *ss)
 
 ret = cgroup_add_files(cont, ss, kmem_cgroup_files,
ARRAY_SIZE(kmem_cgroup_files));
  +
  +  if (!ret)
  +  ret = mem_cgroup_sockets_init(cont, ss);
 return ret;
};
 
  You does initizalication here. The reason what I think is
  1. 'proto_list' is not available at createion of root cgroup and
   you need to delay set up until mounting.
 
  If so, please add comment or find another way.
  This seems not very clean to me.
 
 Yes, we do can run into some ordering issues. A part of the 
 initialization can be done earlier. But I preferred to move it all later
 instead of creating two functions for it. But I can change that if you 
 want, no big deal.
 

Hmm. please add comments about the 'issue'. It will help readers.


  +  tcp-tcp_prot_mem[0] = sysctl_tcp_mem[0];
  +  tcp-tcp_prot_mem[1] = sysctl_tcp_mem[1];
  +  tcp-tcp_prot_mem[2] = sysctl_tcp_mem[2];
  +  tcp-tcp_memory_pressure = 0;
 
  Question:
 
  Is this value will be updated when an admin chages sysctl ?
 
 yes.
 
  I guess, this value is set at system init script or some which may
  happen later than mounting cgroup.
  I don't like to write a guideline 'please set sysctl val before
  mounting cgroup'
 
 Agreed.
 
 This code is in patch 6 (together with the limiting):
 
 +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 +   rcu_read_lock();
 +   memcg = mem_cgroup_from_task(current);
 +
 +   tcp_prot_mem(memcg, vec[0], 0);
 +   tcp_prot_mem(memcg, vec[1], 1);
 +   tcp_prot_mem(memcg, vec[2], 2);
 +   rcu_read_unlock();
 +#endif
 
 tcp_prot_mem is just a wrapper around the assignment so we can access 
 memcg's inner fields.
 


Ok. sysctl and cgroup are updated at the same time.
thank you.
-Kame

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


  1   2   3   4   5   6   7   8   >