[Devel] Re: + memcg-fix-deadlock-between-cpuset-and-memcg.patch added to -mm tree

2011-01-05 Thread Daisuke Nishimura
(resend with adding Cc: containers and linux-mm)

Hi, Andrew.

Thank you for picking up this patch.

But this version has a small race problem. This is a fix-up patch.

===
From: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp

We must clear mc.moving_task before waking up all waiters at the end of
task migration.

Otherwise, there can be a small race like:

mem_cgroup_clear_mc()|mem_cgroup_wait_acct_move()
-+---
  __mem_cgroup_clear_mc()|
wake_up_all()|
 |prepare_to_wait()
 |if (mc.moving_task) - true
 |  schedule()
 |  - noone wakes it up.
  mc.moving_task = NULL  |

Signed-off-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp
---
 mm/memcontrol.c |6 +-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b108b30..61678be 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4714,12 +4714,16 @@ static void mem_cgroup_clear_mc(void)
 {
struct mem_cgroup *from = mc.from;
 
+   /*
+* we must clear moving_task before waking up waiters at the end of
+* task migration.
+*/
+   mc.moving_task = NULL;
__mem_cgroup_clear_mc();
spin_lock(mc.lock);
mc.from = NULL;
mc.to = NULL;
spin_unlock(mc.lock);
-   mc.moving_task = NULL;
mem_cgroup_end_move(from);
 }
 
-- 
1.7.1

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC][BUGFIX] memcg: fix dead lock between cpuset and memcg (Re: [PATCH v5 3/3] cgroups: make procs file writable)

2010-12-27 Thread Daisuke Nishimura
 It looks to me like when memcg holds the mmap_sem the whole time, it's
 just to avoid the deadlock, not that there's there some need for the
 stuff under mmap_sem not to change between can_attach and attach. But if
 there is such a need, then the write-side in mpol_rebind_mm may conflict
 even with my proposed solution.
 
 Regardless, the best way would be to avoid holding the mmap_sem across
 the whole window, possibly by solving the move_charge deadlock some
 other internal way, if at all possible?
 
I made a patch to fix these probrems(deadlock between cpuset and memcg which
commit b1dd693e introduces, and deadlock which the commit fixed).
I'll test and resend this after new year holidays in Japan.

===
From: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp

The commit b1dd693e(memcg: avoid deadlock between move charge and try_charge())
can cause another deadlock about mmap_sem on task migration if cpuset and memcg
are mounted onto the same mount point.

After the commit, cgroup_attach_task() has sequence like:

cgroup_attach_task()
  ss-can_attach()
cpuset_can_attach()
mem_cgroup_can_attach()
  down_read(mmap_sem)(1)
  ss-attach()
cpuset_attach()
  mpol_rebind_mm()
down_write(mmap_sem) (2)
up_write(mmap_sem)
  cpuset_migrate_mm()
do_migrate_pages()
  down_read(mmap_sem)
  up_read(mmap_sem)
mem_cgroup_move_task()
  mem_cgroup_clear_mc()
up_read(mmap_sem)

We can cause deadlock at (2) because we've already aquire the mmap_sem at (1).

But the commit itself is necessary to fix deadlocks which have existed before
the commit like:

Ex.1)
move charge |try charge
  --+--
mem_cgroup_can_attach() |  down_write(mmap_sem)
  mc.moving_task = current  |..
  mem_cgroup_precharge_mc() |  __mem_cgroup_try_charge()
mem_cgroup_count_precharge()|prepare_to_wait()
  down_read(mmap_sem)  |if (mc.moving_task)
  - cannot aquire the lock |- true
|  schedule()
|  - move charge should wake it up

Ex.2)
move charge |try charge
  --+--
mem_cgroup_can_attach() |
  mc.moving_task = current  |
  mem_cgroup_precharge_mc() |
mem_cgroup_count_precharge()|
  down_read(mmap_sem)  |
  ..|
  up_read(mmap_sem)|
|  down_write(mmap_sem)
mem_cgroup_move_task()  |..
  mem_cgroup_move_charge()  |  __mem_cgroup_try_charge()
down_read(mmap_sem)|prepare_to_wait()
- cannot aquire the lock   |if (mc.moving_task)
|- true
|  schedule()
|  - move charge should wake it up

This patch fixes all of these problems by:
1. revert the commit.
2. To fix the Ex.1, we set mc.moving_task after mem_cgroup_count_precharge()
   has released the mmap_sem.
3. To fix the Ex.2, we use down_read_trylock() instead of down_read() in
   mem_cgroup_move_charge() and, if it has failed to aquire the lock, cancel
   all extra charges, wake up all waiters, and retry trylock.

Reported-by: Ben Blum bb...@andrew.cmu.edu
Signed-off-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp
---
 mm/memcontrol.c |   78 +++
 1 files changed, 44 insertions(+), 34 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7a22b41..b108b30 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -292,7 +292,6 @@ static struct move_charge_struct {
unsigned long moved_charge;
unsigned long moved_swap;
struct task_struct *moving_task;/* a task moving charges */
-   struct mm_struct *mm;
wait_queue_head_t waitq;/* a waitq for other context */
 } mc = {
.lock = __SPIN_LOCK_UNLOCKED(mc.lock),
@@ -4639,7 +4638,7 @@ static unsigned long mem_cgroup_count_precharge(struct 
mm_struct *mm)
unsigned long precharge;
struct vm_area_struct *vma;
 
-   /* We've already held the mmap_sem */
+   down_read(mm-mmap_sem);
for (vma = mm-mmap; vma; vma = vma-vm_next) {
struct mm_walk mem_cgroup_count_precharge_walk = {
.pmd_entry = mem_cgroup_count_precharge_pte_range,
@@ -4651,6 +4650,7 @@ static unsigned long mem_cgroup_count_precharge(struct 
mm_struct *mm)
walk_page_range(vma-vm_start, vma-vm_end,
mem_cgroup_count_precharge_walk

[Devel] Re: [PATCH v5 3/3] cgroups: make procs file writable

2010-12-26 Thread Daisuke Nishimura
On Fri, 24 Dec 2010 21:55:08 -0500
Ben Blum bb...@andrew.cmu.edu wrote:

 On Thu, Dec 23, 2010 at 10:33:52PM -0500, Ben Blum wrote:
  On Thu, Dec 16, 2010 at 12:26:03AM -0800, Andrew Morton wrote:
   Patches have gone a bit stale, sorry.  Refactoring in
   kernel/cgroup_freezer.c necessitates a refresh and retest please.
  
  commit 53feb29767c29c877f9d47dcfe14211b5b0f7ebd changed a bunch of stuff
  in kernel/cpuset.c to allocate nodemasks with NODEMASK_ALLOC (which
  wraps kmalloc) instead of on the stack.
  
  1. All these functions have 'void' return values, indicating that
 calling them them must not fail. Sure there are bailout cases, but no
 semblance of cross-function error propagation. Most importantly,
 cpuset_attach is a subsystem callback, which MUST not fail given the
 way it's used in cgroups, so relying on kmalloc is not safe.
  
  2. I'm working on a patch series which needs to hold tasklist_lock
 across ss-attach callbacks (in cpuset_attach's if (threadgroup)
 case, this is how safe traversal of tsk-thread_group will be
 ensured), and kmalloc isn't allowed while holding a spin-lock. 
  
  Why do we need heap-allocation here at all? In each case their scope is
  exactly the function's scope, and neither the commit nor the surrounding
  patch series give any explanation. I'd like to revert the patch if
  possible.
  
  cc'ing Miao Xie (author) and David Rientjes (acker).
  
  -- Ben
 
 Well even with the proposed solution to this there is still another
 problem that I see - that of mmap_sem. cpuset_attach() calls into
 mpol_rebind_mm() and do_migrate_pages(), which take mm-mmap_sem for
 writing and reading respectively. This is going to conflict with
 tasklist_lock... but moreover, the memcontrol subsys also touches the
 task's mm-mmap_sem, holding onto it between mem_cgroup_can_attach() and
 mem_cgroup_move_task() - as of b1dd693e5b9348bd68a80e679e03cf9c0973b01b.
 
 So we have (currently, even without my patches):
 
 cgroup_attach_task
 (1) cpuset_can_attach
 (2) mem_cgroup_can_attach
  - down_read(mm-mmap_sem);
 (3) cpuset_attach
  - mpol_rebind_mm
 - down_write(mm-mmap_sem);
 - up_write(mm-mmap_sem);
  - cpuset_migrate_mm
 - do_migrate_pages
- down_read(mm-mmap_sem);
- up_read(mm-mmap_sem);
 (4) mem_cgroup_move_task
  - mem_cgroup_clear_mc
 - up_read(...);
 
hmm, nice catch.

 Is there some interdependency I'm missing here that guarantees recursive
 locking/deadlock will be avoided? It all looks like typical-case code.
 
Unfortunately, not.
I couldn't hit this because I mount all subsystems onto different
mount points in my environment.

 I think we should move taking the mmap_sem all the way up into
 cgroup_attach_task and cgroup_attach_proc; it will be held for writing
 the whole time. I don't quite understand the mempolicy stuff but maybe
 there can be ways to use mpol_rebind_mm and do_migrate_pages when the
 lock is already held.
 
I agree.
Another solution(just an idea): avoid enabling both move_charge feature of 
memcg
and memory_migrate of cpuset at the same time iff they are mounted
under the same mount point. But, hmm... it's not a good idea to make a subsystem
take account of another subsystem, IMHO.


Thanks,
Daisuke Nishimura.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v2][memcg+dirtylimit] Fix overwriting global vm dirty limit setting by memcg (Re: [PATCH v3 00/11] memcg: per cgroup dirty page accounting

2010-10-24 Thread Daisuke Nishimura
 limit, and
 b) system usage is at system limit
 Then the above patch will not trigger writeback.  Example with two
 memcg:
  sys
 B   C
   
   limit  usage
   sys  10 10
B7  6
C5  4
 
 If B wants to grow, the system will exceed system limit of 10 and should
 be throttled.  However, the smaller limit (7) will be selected and
 applied to memcg usage (6), which indicates no need to throttle, so the
 system could get as bad as:
 
   limit  usage
   sys  10 12
B7  7
C5  5
 
 In this case the system usage exceeds the system limit because each
 the per-memcg checks see no per-memcg problems.
 
nice catch !

 To solve this I propose we create a new structure to aggregate both
 dirty limit and usage data:
   struct dirty_limits {
  unsigned long dirty_thresh;
  unsigned long background_thresh;
  unsigned long nr_reclaimable;
  unsigned long nr_writeback;
   };
 
 global_dirty_limits() would then query both the global and memcg limits
 and dirty usage of one that is closest to its limit.  This change makes
 global_dirty_limits() look like:
 
 void global_dirty_limits(struct dirty_limits *limits)
 {
   unsigned long background;
   unsigned long dirty;
   unsigned long nr_reclaimable;
   unsigned long nr_writeback;
   unsigned long available_memory = determine_dirtyable_memory();
   struct task_struct *tsk;
 
   if (vm_dirty_bytes)
   dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
   else
   dirty = (vm_dirty_ratio * available_memory) / 100;
 
   if (dirty_background_bytes)
   background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
   else
   background = (dirty_background_ratio * available_memory) / 100;
 
   nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
   global_page_state(NR_UNSTABLE_NFS);
   nr_writeback = global_page_state(NR_WRITEBACK);
 
   if (mem_cgroup_dirty_limits(available_memory, limits) 
   dirty_available(limits-dirty_thresh, limits-nr_reclaimable,
   limits-nr_writeback) 
   dirty_available(dirty, nr_reclaimable, nr_writeback)) {
   dirty = min(dirty, limits-dirty_thresh);
   background = min(background, limits-background_thresh);
   } else {
   limits-nr_reclaimable = nr_reclaimable;
   limits-nr_writeback = nr_writeback;
   }
 
   if (background = dirty)
   background = dirty / 2;
   tsk = current;
   if (tsk-flags  PF_LESS_THROTTLE || rt_task(tsk)) {
   background += background / 4;
   dirty += dirty / 4;
   }
   limits-background_thresh = background;
   limits-dirty_thresh = dirty;
 }
 
 Because this approach considered both memcg and system limits, the
 problem described above is avoided.
 
 I have this change integrated into the memcg dirty limit series (-v3 was
 the last post; v4 is almost ready with this change).  I will post -v4
 with this approach is there is no strong objection.
 
I have one concern now.

I've noticed that global_dirty_limits() is called bdi_debug_stats_show(),
so the output of cat /sys/kernel/debug/bdi/.../stats changes depending
on the memcg where the command is executed.

Can you take it into account too ?

Thanks,
Daisuke Nishimura.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v2][memcg+dirtylimit] Fix overwriting global vm dirty limit setting by memcg (Re: [PATCH v3 00/11] memcg: per cgroup dirty page accounting

2010-10-20 Thread Daisuke Nishimura
On Wed, 20 Oct 2010 14:02:55 +0900
KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote:

 Fixed one here.
 ==
 From: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
 
 Now, at calculating dirty limit, vm_dirty_param() is called.
 This function returns dirty-limit related parameters considering
 memory cgroup settings.
 
 Now, assume that vm_dirty_bytes=100M (global dirty limit) and
 memory cgroup has 1G of pages and 40 dirty_ratio, dirtyable memory is
 500MB.
 
 In this case, global_dirty_limits will consider dirty_limt as
 500 *0.4 = 200MB. This is bad...memory cgroup is not back door.
 
 This patch limits the return value of vm_dirty_param() considring
 global settings.
 
 Changelog:
  - fixed an argument mem int to u64
  - fixed to use global available memory to cap memcg's value.
 
 Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com

Acked-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v3 06/11] memcg: add kernel calls for memcg dirty page stats

2010-10-19 Thread Daisuke Nishimura
On Mon, 18 Oct 2010 17:39:39 -0700
Greg Thelen gthe...@google.com wrote:

 Add calls into memcg dirty page accounting.  Notify memcg when pages
 transition between clean, file dirty, writeback, and unstable nfs.
 This allows the memory controller to maintain an accurate view of
 the amount of its memory that is dirty.
 
 Signed-off-by: Greg Thelen gthe...@google.com
 Signed-off-by: Andrea Righi ari...@develer.com

Reviewed-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v3 02/11] memcg: document cgroup dirty memory interfaces

2010-10-19 Thread Daisuke Nishimura
.
 +
  6. Hierarchy support
  
  The memory controller supports a deep hierarchy and hierarchical accounting.
 -- 
 1.7.1
 
Can you clarify whether we can limit the total dirty pages under hierarchy
in use_hierarchy==1 case ?
If we can, I think it would be better to note it in this documentation.


Thanks,
Daisuke Nishimura. 
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v3 02/11] memcg: document cgroup dirty memory interfaces

2010-10-19 Thread Daisuke Nishimura
On Wed, 20 Oct 2010 09:11:09 +0900
KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote:

 On Tue, 19 Oct 2010 14:00:58 -0700
 Greg Thelen gthe...@google.com wrote:
 
(snip)
  +When use_hierarchy=0, each cgroup has independent dirty memory usage and 
  limits.
  +
  +When use_hierarchy=1, a parent cgroup increasing its dirty memory usage 
  will
  +compare its total_dirty memory (which includes sum of all child cgroup 
  dirty
  +memory) to its dirty limits.  This keeps a parent from explicitly 
  exceeding its
  +dirty limits.  However, a child cgroup can increase its dirty usage without
  +considering the parent's dirty limits.  Thus the parent's total_dirty can 
  exceed
  +the parent's dirty limits as a child dirties pages.
 
 Hmm. in short, dirty_ratio in use_hierarchy=1 doesn't work as an user expects.
 Is this a spec. or a current implementation ?
 
 I think as following.
  - add a limitation as At setting chidlren's dirty_ratio, it must be below 
 parent's.
If it exceeds parent's dirty_ratio, EINVAL is returned.
 
 Could you modify setting memory.dirty_ratio code ?
 Then, parent's dirty_ratio will never exceeds its own. (If I understand 
 correctly.)
 
 memory.dirty_limit_in_bytes will be a bit more complecated, but I think you 
 can.
 
I agree.

At the first impression, this limitation seems a bit overkill for me, because
we allow memory.limit_in_bytes of a child bigger than that of parent now.
But considering more, the situation is different, because usage_in_bytes never
exceeds limit_in_bytes.


Thanks,
Daisuke Nishimura.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v3 07/11] memcg: add dirty limits to mem_cgroup

2010-10-19 Thread Daisuke Nishimura
 +static unsigned long long
 +memcg_hierarchical_free_pages(struct mem_cgroup *mem)
 +{
 + struct cgroup *cgroup;
 + unsigned long long min_free, free;
 +
 + min_free = res_counter_read_u64(mem-res, RES_LIMIT) -
 + res_counter_read_u64(mem-res, RES_USAGE);
 + cgroup = mem-css.cgroup;
 + if (!mem-use_hierarchy)
 + goto out;
 +
 + while (cgroup-parent) {
 + cgroup = cgroup-parent;
 + mem = mem_cgroup_from_cont(cgroup);
 + if (!mem-use_hierarchy)
 + break;
 + free = res_counter_read_u64(mem-res, RES_LIMIT) -
 + res_counter_read_u64(mem-res, RES_USAGE);
 + min_free = min(min_free, free);
 + }
 +out:
 + /* Translate free memory in pages */
 + return min_free  PAGE_SHIFT;
 +}
 +
I think you can simplify this function using parent_mem_cgroup().

unsigned long free, min_free = ULLONG_MAX;

while (mem) {
free = res_counter_read_u64(mem-res, RES_LIMIT) -
res_counter_read_u64(mem-res, RES_USAGE);
min_free = min(min_free, free);
mem = parent_mem_cgroup();
}

/* Translate free memory in pages */
return min_free  PAGE_SHIFT;

And, IMHO, we should return min(global_page_state(NR_FREE_PAGES), min_free  
PAGE_SHIFT).
Because we are allowed to set no-limit(or a very big limit) in memcg,
so min_free can be very big if we don't set a limit against all the memcg's in 
hierarchy.


Thanks,
Dasiuke Nishimura.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v3 05/11] memcg: add dirty page accounting infrastructure

2010-10-19 Thread Daisuke Nishimura
On Mon, 18 Oct 2010 17:39:38 -0700
Greg Thelen gthe...@google.com wrote:

 Add memcg routines to track dirty, writeback, and unstable_NFS pages.
 These routines are not yet used by the kernel to count such pages.
 A later change adds kernel calls to these new routines.
 
 Signed-off-by: Greg Thelen gthe...@google.com
 Signed-off-by: Andrea Righi ari...@develer.com

Acked-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v3 09/11] memcg: add cgroupfs interface to memcg dirty limits

2010-10-19 Thread Daisuke Nishimura
On Mon, 18 Oct 2010 17:39:42 -0700
Greg Thelen gthe...@google.com wrote:

 Add cgroupfs interface to memcg dirty page limits:
   Direct write-out is controlled with:
   - memory.dirty_ratio
   - memory.dirty_limit_in_bytes
 
   Background write-out is controlled with:
   - memory.dirty_background_ratio
   - memory.dirty_background_limit_bytes
 
 Other memcg cgroupfs files support 'M', 'm', 'k', 'K', 'g'
 and 'G' suffixes for byte counts.  This patch provides the
 same functionality for memory.dirty_limit_in_bytes and
 memory.dirty_background_limit_bytes.
 
 Signed-off-by: Andrea Righi ari...@develer.com
 Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
 Signed-off-by: Greg Thelen gthe...@google.com

Acked-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp

One question: shouldn't we return -EINVAL when writing to 
dirty(_background)_limit_bytes
a bigger value than that of global one(if any) ? Or do you intentionally
set the input value without comparing it with the global value ?
But, hmm..., IMHO we should check it in __mem_cgroup_dirty_param() or something
not to allow dirty pages more than global limit.


Thanks,
Daisuke Nishimura.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v3 09/11] memcg: add cgroupfs interface to memcg dirty limits

2010-10-19 Thread Daisuke Nishimura
On Wed, 20 Oct 2010 12:31:10 +0900
Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote:

 On Mon, 18 Oct 2010 17:39:42 -0700
 Greg Thelen gthe...@google.com wrote:
 
  Add cgroupfs interface to memcg dirty page limits:
Direct write-out is controlled with:
- memory.dirty_ratio
- memory.dirty_limit_in_bytes
  
Background write-out is controlled with:
- memory.dirty_background_ratio
- memory.dirty_background_limit_bytes
  
  Other memcg cgroupfs files support 'M', 'm', 'k', 'K', 'g'
  and 'G' suffixes for byte counts.  This patch provides the
  same functionality for memory.dirty_limit_in_bytes and
  memory.dirty_background_limit_bytes.
  
  Signed-off-by: Andrea Righi ari...@develer.com
  Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
  Signed-off-by: Greg Thelen gthe...@google.com
 
 Acked-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp
 
 One question: shouldn't we return -EINVAL when writing to 
 dirty(_background)_limit_bytes
 a bigger value than that of global one(if any) ? Or do you intentionally
 set the input value without comparing it with the global value ?
 But, hmm..., IMHO we should check it in __mem_cgroup_dirty_param() or 
 something
 not to allow dirty pages more than global limit.
 
Oh, Kamazawa-san has just send a fix for this problem :)
Please ignore this comment.

Thanks,
Daisuke Nishimura.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v3 02/11] memcg: document cgroup dirty memory interfaces

2010-10-19 Thread Daisuke Nishimura
On Wed, 20 Oct 2010 11:24:31 +0900
KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote:

 On Wed, 20 Oct 2010 10:14:21 +0900
 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote:
 
  On Wed, 20 Oct 2010 09:48:21 +0900
  Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote:
  
   On Wed, 20 Oct 2010 09:11:09 +0900
   KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote:
   
On Tue, 19 Oct 2010 14:00:58 -0700
Greg Thelen gthe...@google.com wrote:

   (snip)
 +When use_hierarchy=0, each cgroup has independent dirty memory usage 
 and limits.
 +
 +When use_hierarchy=1, a parent cgroup increasing its dirty memory 
 usage will
 +compare its total_dirty memory (which includes sum of all child 
 cgroup dirty
 +memory) to its dirty limits.  This keeps a parent from explicitly 
 exceeding its
 +dirty limits.  However, a child cgroup can increase its dirty usage 
 without
 +considering the parent's dirty limits.  Thus the parent's 
 total_dirty can exceed
 +the parent's dirty limits as a child dirties pages.

Hmm. in short, dirty_ratio in use_hierarchy=1 doesn't work as an user 
expects.
Is this a spec. or a current implementation ?

I think as following.
 - add a limitation as At setting chidlren's dirty_ratio, it must be 
below parent's.
   If it exceeds parent's dirty_ratio, EINVAL is returned.

Could you modify setting memory.dirty_ratio code ?
Then, parent's dirty_ratio will never exceeds its own. (If I understand 
correctly.)

memory.dirty_limit_in_bytes will be a bit more complecated, but I 
think you can.

   I agree.
   
   At the first impression, this limitation seems a bit overkill for me, 
   because
   we allow memory.limit_in_bytes of a child bigger than that of parent now.
   But considering more, the situation is different, because usage_in_bytes 
   never
   exceeds limit_in_bytes.
   
  
  I'd like to consider a patch.
  Please mention that use_hierarchy=1 case depends on implemenation. for 
  now.
  
 
 BTW, how about supporing dirty_limit_in_bytes when use_hierarchy=0 or leave 
 it as
 broken when use_hierarchy=1 ?
 It seems we can only support dirty_ratio when hierarchy is used.
 
It's all right for me.
This feature would be useful even w/o hierarchy support.

Thanks,
Daisuke Nishimura.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v3 10/11] writeback: make determine_dirtyable_memory() static.

2010-10-19 Thread Daisuke Nishimura
On Mon, 18 Oct 2010 17:39:43 -0700
Greg Thelen gthe...@google.com wrote:

 The determine_dirtyable_memory() function is not used outside of
 page writeback.  Make the routine static.  No functional change.
 Just a cleanup in preparation for a change that adds memcg dirty
 limits consideration into global_dirty_limits().
 
 Signed-off-by: Andrea Righi ari...@develer.com
 Signed-off-by: Greg Thelen gthe...@google.com

Reviewed-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v3 08/11] memcg: CPU hotplug lockdep warning fix

2010-10-19 Thread Daisuke Nishimura
On Mon, 18 Oct 2010 17:39:41 -0700
Greg Thelen gthe...@google.com wrote:

 From: Balbir Singh bal...@linux.vnet.ibm.com
 
 memcg has lockdep warnings (sleep inside rcu lock)
 
 From: Balbir Singh bal...@linux.vnet.ibm.com
 
 Recent move to get_online_cpus() ends up calling get_online_cpus() from
 mem_cgroup_read_stat(). However mem_cgroup_read_stat() is called under rcu
 lock. get_online_cpus() can sleep. The dirty limit patches expose
 this BUG more readily due to their usage of mem_cgroup_page_stat()
 
 This patch address this issue as identified by lockdep and moves the
 hotplug protection to a higher layer. This might increase the time
 required to hotplug, but not by much.
 
 Warning messages
 
 BUG: sleeping function called from invalid context at kernel/cpu.c:62
 in_atomic(): 0, irqs_disabled(): 0, pid: 6325, name: pagetest
 2 locks held by pagetest/6325:
 do_page_fault+0x27d/0x4a0
 mem_cgroup_page_stat+0x0/0x23f
 Pid: 6325, comm: pagetest Not tainted 2.6.36-rc5-mm1+ #201
 Call Trace:
 [81041224] __might_sleep+0x12d/0x131
 [8104f4af] get_online_cpus+0x1c/0x51
 [8110eedb] mem_cgroup_read_stat+0x27/0xa3
 [811125d2] mem_cgroup_page_stat+0x131/0x23f
 [811124a1] ? mem_cgroup_page_stat+0x0/0x23f
 [810d57c3] global_dirty_limits+0x42/0xf8
 [810d58b3] throttle_vm_writeout+0x3a/0xb4
 [810dc2f8] shrink_zone+0x3e6/0x3f8
 [81074a35] ? ktime_get_ts+0xb2/0xbf
 [810dd1aa] do_try_to_free_pages+0x106/0x478
 [810dd601] try_to_free_mem_cgroup_pages+0xe5/0x14c
 [8110f947] mem_cgroup_hierarchical_reclaim+0x314/0x3a2
 [8b31] __mem_cgroup_try_charge+0x29b/0x593
 [894a] ? __mem_cgroup_try_charge+0xb4/0x593
 [81071258] ? local_clock+0x40/0x59
 [81009015] ? sched_clock+0x9/0xd
 [810710d5] ? sched_clock_local+0x1c/0x82
 [8111398a] mem_cgroup_charge_common+0x4b/0x76
 [81141469] ? bio_add_page+0x36/0x38
 [81113ba9] mem_cgroup_cache_charge+0x1f4/0x214
 [810cd195] add_to_page_cache_locked+0x4a/0x148
 
 
 Acked-by: Greg Thelen gthe...@google.com
 Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com

Acked-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v3 11/11] memcg: check memcg dirty limits in page writeback

2010-10-19 Thread Daisuke Nishimura
On Wed, 20 Oct 2010 13:18:57 +0900
KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote:

 On Tue, 19 Oct 2010 10:00:15 +0900
 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote:
 
  On Mon, 18 Oct 2010 17:39:44 -0700
  Greg Thelen gthe...@google.com wrote:
  
   If the current process is in a non-root memcg, then
   global_dirty_limits() will consider the memcg dirty limit.
   This allows different cgroups to have distinct dirty limits
   which trigger direct and background writeback at different
   levels.
   
   Signed-off-by: Andrea Righi ari...@develer.com
   Signed-off-by: Greg Thelen gthe...@google.com
  
  Reviewed-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
  
 Why FREEPAGES in memcg is not counted as dirtyable ?
 
It's counted as memcg_hierarchical_free_pages() in mem_cgroup_page_stat().
As for FREEPAGES, we should count ancestors, not children.

mem_cgroup_page_stat():
   1311 if (mem  !mem_cgroup_is_root(mem)) {
   1312 /*
   1313  * If we're looking for dirtyable pages we need to 
evaluate
   1314  * free pages depending on the limit and usage of the 
parents
   1315  * first of all.
   1316  */
   1317 if (item == MEMCG_NR_DIRTYABLE_PAGES)
   1318 value = memcg_hierarchical_free_pages(mem);
   1319 else
   1320 value = 0;
   1321 /*
   1322  * Recursively evaluate page statistics against all 
cgroup
   1323  * under hierarchy tree
   1324  */
   1325 for_each_mem_cgroup_tree(iter, mem)
   1326 value += mem_cgroup_local_page_stat(iter, item);

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v3 11/11] memcg: check memcg dirty limits in page writeback

2010-10-19 Thread Daisuke Nishimura
On Mon, 18 Oct 2010 17:39:44 -0700
Greg Thelen gthe...@google.com wrote:

 If the current process is in a non-root memcg, then
 global_dirty_limits() will consider the memcg dirty limit.
 This allows different cgroups to have distinct dirty limits
 which trigger direct and background writeback at different
 levels.
 
 Signed-off-by: Andrea Righi ari...@develer.com
 Signed-off-by: Greg Thelen gthe...@google.com

Reviewed-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v3 01/11] memcg: add page_cgroup flags for dirty page tracking

2010-10-18 Thread Daisuke Nishimura
On Mon, 18 Oct 2010 17:39:34 -0700
Greg Thelen gthe...@google.com wrote:

 Add additional flags to page_cgroup to track dirty pages
 within a mem_cgroup.
 
 Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
 Signed-off-by: Andrea Righi ari...@develer.com
 Signed-off-by: Greg Thelen gthe...@google.com

Acked-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v3 03/11] memcg: create extensible page stat update routines

2010-10-18 Thread Daisuke Nishimura
On Mon, 18 Oct 2010 17:39:36 -0700
Greg Thelen gthe...@google.com wrote:

 Replace usage of the mem_cgroup_update_file_mapped() memcg
 statistic update routine with two new routines:
 * mem_cgroup_inc_page_stat()
 * mem_cgroup_dec_page_stat()
 
 As before, only the file_mapped statistic is managed.  However,
 these more general interfaces allow for new statistics to be
 more easily added.  New statistics are added with memcg dirty
 page accounting.
 
 Signed-off-by: Greg Thelen gthe...@google.com
 Signed-off-by: Andrea Righi ari...@develer.com

Acked-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v3 04/11] memcg: add lock to synchronize page accounting and migration

2010-10-18 Thread Daisuke Nishimura
On Mon, 18 Oct 2010 17:39:37 -0700
Greg Thelen gthe...@google.com wrote:

 From: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
 
 From: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
 
 Introduce a new bit spin lock, PCG_MOVE_LOCK, to synchronize
 the page accounting and migration code.  This reworks the
 locking scheme of _update_stat() and _move_account() by
 adding new lock bit PCG_MOVE_LOCK, which is always taken
 under IRQ disable.
 
 1. If pages are being migrated from a memcg, then updates to
that memcg page statistics are protected by grabbing
PCG_MOVE_LOCK using move_lock_page_cgroup().  In an
upcoming commit, memcg dirty page accounting will be
updating memcg page accounting (specifically: num
writeback pages) from IRQ context (softirq).  Avoid a
deadlocking nested spin lock attempt by disabling irq on
the local processor when grabbing the PCG_MOVE_LOCK.
 
 2. lock for update_page_stat is used only for avoiding race
with move_account().  So, IRQ awareness of
lock_page_cgroup() itself is not a problem.  The problem
is between mem_cgroup_update_page_stat() and
mem_cgroup_move_account_page().
 
 Trade-off:
   * Changing lock_page_cgroup() to always disable IRQ (or
 local_bh) has some impacts on performance and I think
 it's bad to disable IRQ when it's not necessary.
   * adding a new lock makes move_account() slower.  Score is
 here.
 
 Performance Impact: moving a 8G anon process.
 
 Before:
   real0m0.792s
   user0m0.000s
   sys 0m0.780s
 
 After:
   real0m0.854s
   user0m0.000s
   sys 0m0.842s
 
 This score is bad but planned patches for optimization can reduce
 this impact.
 
 Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
 Signed-off-by: Greg Thelen gthe...@google.com

This approach is more straightforward and easy to understand.

Acked-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v4] memcg: reduce lock time at move charge

2010-10-12 Thread Daisuke Nishimura
On Tue, 12 Oct 2010 14:48:01 +0900
KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote:

 On Tue, 12 Oct 2010 12:56:13 +0900
 Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote:
 
   +err_out:
   + for (; mt  info + num; mt++)
   + if (mt-type == MC_TARGET_PAGE) {
   + putback_lru_page(mt-val.page);
  Is this putback_lru_page() necessary ?
  is_target_pte_for_mc() doesn't isolate the page.
  
 Ok, v4 here. tested failure path and success path.
 
Looks good to me.

Acked-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp

Thanks,
Daisuke Nishimura.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v2] memcg: reduce lock time at move charge (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()

2010-10-11 Thread Daisuke Nishimura
 +err_out:
 + for (; mt  info + num; mt++)
 + if (mt-type == MC_TARGET_PAGE) {
 + putback_lru_page(mt-val.page);
Is this putback_lru_page() necessary ?
is_target_pte_for_mc() doesn't isolate the page.

Thanks,
Daisuke Nishimura.


 + put_page(mt-val.page);
 + }
 + goto out;
  }
  
  static void mem_cgroup_move_charge(struct mm_struct *mm)
 
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v2] memcg: reduce lock time at move charge (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()

2010-10-07 Thread Daisuke Nishimura
On Thu, 7 Oct 2010 16:14:54 -0700
Andrew Morton a...@linux-foundation.org wrote:

 On Thu, 7 Oct 2010 17:04:05 +0900
 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote:
 
  Now, at task migration among cgroup, memory cgroup scans page table and 
  moving
  account if flags are properly set.
  
  The core code, mem_cgroup_move_charge_pte_range() does
  
  pte_offset_map_lock();
  for all ptes in a page table:
  1. look into page table, find_and_get a page
  2. remove it from LRU.
  3. move charge.
  4. putback to LRU. put_page()
  pte_offset_map_unlock();
  
  for pte entries on a 3rd level? page table.
  
  This pte_offset_map_lock seems a bit long. This patch modifies a rountine as
  
  for 32 pages: pte_offset_map_lock()
find_and_get a page
record it
pte_offset_map_unlock()
  for all recorded pages
isolate it from LRU.
move charge
putback to LRU
  for all recorded pages
put_page()
 
 The patch makes the code larger, more complex and slower!
 
Before this patch:
   textdata bss dec hex filename
  27163   117824100   43045a825 mm/memcontrol.o

After this patch:
   textdata bss dec hex filename
  27307   122944100   43701aab5 mm/memcontrol.o

hmm, allocating mc.target[] statically might be bad, but I'm now wondering
whether I could allocate mc itself dynamically(I'll try).

 I do think we're owed a more complete description of its benefits than
 seems a bit long.  Have problems been observed?  Any measurements
 taken?
 
IIUC, this patch is necessary for [PATCH] memcg: lock-free clear page 
writeback
later, but I agree we should describe it.

Thanks,
Daisuke Nishimura.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC] Restrict size of page_cgroup-flags

2010-10-06 Thread Daisuke Nishimura
On Wed, 6 Oct 2010 19:53:14 +0530
Balbir Singh bal...@linux.vnet.ibm.com wrote:

 I propose restricting page_cgroup.flags to 16 bits. The patch for the
 same is below. Comments?
 
 
 Restrict the bits usage in page_cgroup.flags
 
 From: Balbir Singh bal...@linux.vnet.ibm.com
 
 Restricting the flags helps control growth of the flags unbound.
 Restriciting it to 16 bits gives us the possibility of merging
 cgroup id with flags (atomicity permitting) and saving a whole
 long word in page_cgroup
 
I agree that reducing the size of page_cgroup would be good and important.
But, wouldn't it be better to remove -page, if possible ?

Thanks,
Daisuke Nishimura.

 Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
 ---
 
  include/linux/page_cgroup.h |3 +++
  mm/page_cgroup.c|1 +
  2 files changed, 4 insertions(+), 0 deletions(-)
 
 
 diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
 index 872f6b1..10c37b4 100644
 --- a/include/linux/page_cgroup.h
 +++ b/include/linux/page_cgroup.h
 @@ -44,8 +44,11 @@ enum {
   PCG_FILE_WRITEBACK, /* page is under writeback */
   PCG_FILE_UNSTABLE_NFS, /* page is NFS unstable */
   PCG_MIGRATION, /* under page migration */
 + PCG_MAX_NR,
  };
  
 +#define PCG_MAX_BIT_SIZE 16
 +
  #define TESTPCGFLAG(uname, lname)\
  static inline int PageCgroup##uname(struct page_cgroup *pc)  \
   { return test_bit(PCG_##lname, pc-flags); }
 diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
 index 5bffada..e16ad2e 100644
 --- a/mm/page_cgroup.c
 +++ b/mm/page_cgroup.c
 @@ -258,6 +258,7 @@ void __init page_cgroup_init(void)
   unsigned long pfn;
   int fail = 0;
  
 + BUILD_BUG_ON(PCG_MAX_NR = PCG_MAX_BIT_SIZE);
   if (mem_cgroup_disabled())
   return;
  
 
 -- 
   Three Cheers,
   Balbir
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a
 
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()

2010-10-06 Thread Daisuke Nishimura
On Thu, 7 Oct 2010 09:35:45 +0900
KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote:

 On Wed, 6 Oct 2010 09:15:34 +0900
 Minchan Kim minchan@gmail.com wrote:
 
  First of all, we could add your patch as it is and I don't expect any
  regression report about interrupt latency.
  That's because many embedded guys doesn't use mmotm and have a
  tendency to not report regression of VM.
  Even they don't use memcg. Hmm...
  
  I pass the decision to MAINTAINER Kame and Balbir.
  Thanks for the detail explanation.
  
 
 Hmm. IRQ delay is a concern. So, my option is this. How do you think ?
 
 1. remove local_irq_save()/restore() in lock/unlock_page_cgroup().
yes, I don't like it.
 
 2. At moving charge, do this:
   a) lock_page()/ or trylock_page()
   b) wait_on_page_writeback()
   c) do move_account under lock_page_cgroup().
   c) unlock_page()
 
 
 Then, Writeback updates will never come from IRQ context while
 lock/unlock_page_cgroup() is held by move_account(). There will be no race.
 
hmm, if we'll do that, I think we need to do that under pte_lock in
mem_cgroup_move_charge_pte_range(). But, we can't do wait_on_page_writeback()
under pte_lock, right? Or, we need re-organize current move-charge 
implementation.

Thanks,
Daisuke Nishimura.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC] Restrict size of page_cgroup-flags

2010-10-06 Thread Daisuke Nishimura
On Thu, 7 Oct 2010 08:44:59 +0530
Balbir Singh bal...@linux.vnet.ibm.com wrote:

 * nishim...@mxp.nes.nec.co.jp nishim...@mxp.nes.nec.co.jp [2010-10-07 
 09:54:58]:
 
  On Wed, 6 Oct 2010 19:53:14 +0530
  Balbir Singh bal...@linux.vnet.ibm.com wrote:
  
   I propose restricting page_cgroup.flags to 16 bits. The patch for the
   same is below. Comments?
   
   
   Restrict the bits usage in page_cgroup.flags
   
   From: Balbir Singh bal...@linux.vnet.ibm.com
   
   Restricting the flags helps control growth of the flags unbound.
   Restriciting it to 16 bits gives us the possibility of merging
   cgroup id with flags (atomicity permitting) and saving a whole
   long word in page_cgroup
   
  I agree that reducing the size of page_cgroup would be good and important.
  But, wouldn't it be better to remove -page, if possible ?
 
 
 Without the page pointer, how do we go from pc to page for reclaim? 
 
We store page_cgroups in arrays now, so I suppose we can implement pc_to_pfn()
using the similar calculation as page_to_pfn() does.
IIRC, KAMEZAWA-san talked about it in another thread.

Thanks,
Daisuke Nishimura.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC] Restrict size of page_cgroup-flags

2010-10-06 Thread Daisuke Nishimura
On Thu, 7 Oct 2010 13:22:33 +0900
KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote:

 On Thu, 7 Oct 2010 09:26:08 +0530
 Balbir Singh bal...@linux.vnet.ibm.com wrote:
 
  * KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-10-07 12:18:16]:
  
   On Thu, 7 Oct 2010 08:42:04 +0530
   Balbir Singh bal...@linux.vnet.ibm.com wrote:
   
* KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-10-07 
08:58:58]:

 On Wed, 6 Oct 2010 19:53:14 +0530
 Balbir Singh bal...@linux.vnet.ibm.com wrote:
 
  I propose restricting page_cgroup.flags to 16 bits. The patch for 
  the
  same is below. Comments?
  
  
  Restrict the bits usage in page_cgroup.flags
  
  From: Balbir Singh bal...@linux.vnet.ibm.com
  
  Restricting the flags helps control growth of the flags unbound.
  Restriciting it to 16 bits gives us the possibility of merging
  cgroup id with flags (atomicity permitting) and saving a whole
  long word in page_cgroup
  
  Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
 
 Doesn't make sense until you show the usage of existing bits.

??

   Limiting something for NOT EXISTING PATCH doesn't make sense, in general.
   
   
 And I guess 16bit may be too large on 32bit systems.

too large on 32 bit systems? My intention is to keep the flags to 16
bits and then use cgroup id for the rest and see if we can remove
mem_cgroup pointer

   
   You can't use flags field to store mem_cgroup_id while we use lock bit on 
   it.
   We have to store something more stable...as pfn or node-id or zone-id.
   
   It's very racy. 
  
  
  Yes, correct it is racy, there is no easy way from what I know we can write
  the upper 16 bits of the flag without affecting the lower 16 bits, if
  the 16 bits are changing. One of the techniques could be to have lock
  for the unsigned long word itself - but I don't know what performance
  overhead that would add. Having said that I would like to explore
  techniques that allow me to merge the two.
  
 
 to store pfn, we need to limit under 12bit.
I'm sorry if I miss something, but is it valid in PAE case too ?
I think it would be better to store section id(or node id) rather than pfn.

 I'll schedule my patch if dirty_ratio one goes.
 
Agreed. I think we should do dirty_ratio patches first.


Thanks,
Daisuke Nishimura.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 01/10] memcg: add page_cgroup flags for dirty page tracking

2010-10-05 Thread Daisuke Nishimura
On Sun,  3 Oct 2010 23:57:56 -0700
Greg Thelen gthe...@google.com wrote:

 Add additional flags to page_cgroup to track dirty pages
 within a mem_cgroup.
 
 Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
 Signed-off-by: Andrea Righi ari...@develer.com
 Signed-off-by: Greg Thelen gthe...@google.com

Acked-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp

Thanks,
Daisuke Nishimura.

 ---
  include/linux/page_cgroup.h |   23 +++
  1 files changed, 23 insertions(+), 0 deletions(-)
 
 diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
 index 5bb13b3..b59c298 100644
 --- a/include/linux/page_cgroup.h
 +++ b/include/linux/page_cgroup.h
 @@ -40,6 +40,9 @@ enum {
   PCG_USED, /* this object is in use. */
   PCG_ACCT_LRU, /* page has been accounted for */
   PCG_FILE_MAPPED, /* page is accounted as mapped */
 + PCG_FILE_DIRTY, /* page is dirty */
 + PCG_FILE_WRITEBACK, /* page is under writeback */
 + PCG_FILE_UNSTABLE_NFS, /* page is NFS unstable */
   PCG_MIGRATION, /* under page migration */
  };
  
 @@ -59,6 +62,10 @@ static inline void ClearPageCgroup##uname(struct 
 page_cgroup *pc)  \
  static inline int TestClearPageCgroup##uname(struct page_cgroup *pc) \
   { return test_and_clear_bit(PCG_##lname, pc-flags);  }
  
 +#define TESTSETPCGFLAG(uname, lname) \
 +static inline int TestSetPageCgroup##uname(struct page_cgroup *pc)   \
 + { return test_and_set_bit(PCG_##lname, pc-flags);  }
 +
  TESTPCGFLAG(Locked, LOCK)
  
  /* Cache flag is set only once (at allocation) */
 @@ -80,6 +87,22 @@ SETPCGFLAG(FileMapped, FILE_MAPPED)
  CLEARPCGFLAG(FileMapped, FILE_MAPPED)
  TESTPCGFLAG(FileMapped, FILE_MAPPED)
  
 +SETPCGFLAG(FileDirty, FILE_DIRTY)
 +CLEARPCGFLAG(FileDirty, FILE_DIRTY)
 +TESTPCGFLAG(FileDirty, FILE_DIRTY)
 +TESTCLEARPCGFLAG(FileDirty, FILE_DIRTY)
 +TESTSETPCGFLAG(FileDirty, FILE_DIRTY)
 +
 +SETPCGFLAG(FileWriteback, FILE_WRITEBACK)
 +CLEARPCGFLAG(FileWriteback, FILE_WRITEBACK)
 +TESTPCGFLAG(FileWriteback, FILE_WRITEBACK)
 +
 +SETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
 +CLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
 +TESTPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
 +TESTCLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
 +TESTSETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
 +
  SETPCGFLAG(Migration, MIGRATION)
  CLEARPCGFLAG(Migration, MIGRATION)
  TESTPCGFLAG(Migration, MIGRATION)
 -- 
 1.7.1
 
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 02/10] memcg: document cgroup dirty memory interfaces

2010-10-05 Thread Daisuke Nishimura
On Sun,  3 Oct 2010 23:57:57 -0700
Greg Thelen gthe...@google.com wrote:

 Document cgroup dirty memory interfaces and statistics.
 
 Signed-off-by: Andrea Righi ari...@develer.com
 Signed-off-by: Greg Thelen gthe...@google.com

I think you will change nfs to nfs_unstable, but anyway,

Acked-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp

Thanks
Daisuke Nishimura.

 ---
  Documentation/cgroups/memory.txt |   37 +
  1 files changed, 37 insertions(+), 0 deletions(-)
 
 diff --git a/Documentation/cgroups/memory.txt 
 b/Documentation/cgroups/memory.txt
 index 7781857..eab65e2 100644
 --- a/Documentation/cgroups/memory.txt
 +++ b/Documentation/cgroups/memory.txt
 @@ -385,6 +385,10 @@ mapped_file  - # of bytes of mapped file (includes 
 tmpfs/shmem)
  pgpgin   - # of pages paged in (equivalent to # of charging 
 events).
  pgpgout  - # of pages paged out (equivalent to # of uncharging 
 events).
  swap - # of bytes of swap usage
 +dirty- # of bytes that are waiting to get written back to 
 the disk.
 +writeback- # of bytes that are actively being written back to the disk.
 +nfs  - # of bytes sent to the NFS server, but not yet committed to
 + the actual storage.
  inactive_anon- # of bytes of anonymous memory and swap cache memory 
 on
   LRU list.
  active_anon  - # of bytes of anonymous and swap cache memory on active
 @@ -453,6 +457,39 @@ memory under it will be reclaimed.
  You can reset failcnt by writing 0 to failcnt file.
  # echo 0  .../memory.failcnt
  
 +5.5 dirty memory
 +
 +Control the maximum amount of dirty pages a cgroup can have at any given 
 time.
 +
 +Limiting dirty memory is like fixing the max amount of dirty (hard to 
 reclaim)
 +page cache used by a cgroup.  So, in case of multiple cgroup writers, they 
 will
 +not be able to consume more than their designated share of dirty pages and 
 will
 +be forced to perform write-out if they cross that limit.
 +
 +The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*.  
 It
 +is possible to configure a limit to trigger both a direct writeback or a
 +background writeback performed by per-bdi flusher threads.  The root cgroup
 +memory.dirty_* control files are read-only and match the contents of
 +the /proc/sys/vm/dirty_* files.
 +
 +Per-cgroup dirty limits can be set using the following files in the cgroupfs:
 +
 +- memory.dirty_ratio: the amount of dirty memory (expressed as a percentage 
 of
 +  cgroup memory) at which a process generating dirty pages will itself start
 +  writing out dirty data.
 +
 +- memory.dirty_bytes: the amount of dirty memory (expressed in bytes) in the
 +  cgroup at which a process generating dirty pages will start itself writing 
 out
 +  dirty data.
 +
 +- memory.dirty_background_ratio: the amount of dirty memory of the cgroup
 +  (expressed as a percentage of cgroup memory) at which background writeback
 +  kernel threads will start writing out dirty data.
 +
 +- memory.dirty_background_bytes: the amount of dirty memory (expressed in 
 bytes)
 +  in the cgroup at which background writeback kernel threads will start 
 writing
 +  out dirty data.
 +
  6. Hierarchy support
  
  The memory controller supports a deep hierarchy and hierarchical accounting.
 -- 
 1.7.1
 
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: Cgroup OOM bug

2010-05-09 Thread Daisuke Nishimura
On Mon, 10 May 2010 09:01:31 +0900, KAMEZAWA Hiroyuki 
kamezawa.hir...@jp.fujitsu.com wrote:
 On Fri, 7 May 2010 14:30:21 -0700
 Jeremy Hanmer jer...@hq.newdream.net wrote:
 
  I've found a reproducible case where you can create an infinite loop inside 
  the oom killer if you create a cgroup with a single process that has 
  oom_adj=-17 and trigger an OOM situation inside that cgroup.
  
  It seems that oom_adj should probably be ignored in the case where it makes 
  the OOM situation unsolvable.  Here's an excerpt from the console output 
  when the bug is triggered:
  
  Memory cgroup out of memory: kill process 25306 (memeater) score 0 or a 
  child
  Memory cgroup out of memory: kill process 25306 (memeater) score 0 or a 
  child
  Memory cgroup out of memory: kill process 25306 (memeater) score 0 or a 
  child
  Memory cgroup out of memory: kill process 25306 (memeater) score 0 or a 
  child
  Memory cgroup out of memory: kill process 25306 (memeater) score 0 or a 
  child
  Memory cgroup out of memory: kill process 25306 (memeater) score 0 or a 
  child
  Memory cgroup out of memory: kill process 25306 (memeater) score 0 or a 
  child
  Memory cgroup out of memory: kill process 25306 (memeater) score 0 or a 
  child
  Memory cgroup out of memory: kill process 25306 (memeater) score 0 or a 
  child
  Memory cgroup out of memory: kill process 25306 (memeater) score 0 or a 
  child
  Memory cgroup out of memory: kill process 25306 (memeater) score 0 or a 
  child
  Memory cgroup out of memory: kill process 25306 (memeater) score 0 or a 
  child
  
 
 Now, In -mm tree, 2 options are supproted and it's under test.
 
   1. oom_notifier ... an event notifier which notifies OOM.
 
   2. oom_disable  ... disable OOM-Killer. All threads will wait for the end of
  OOM. No cpu consumption.
 
 By 1+2, an user(user-land deamon) can implement its own OOM handler.
 Please wait it.
 
And, you also need 
memcg-make-oom-killer-a-no-op-when-no-killable-task-can-be-found.patch.
IIUC, we will be stuck at the infinite loop inside mem_cgroup_out_of_memory() 
otherwise.

Thanks,
Daisuke Nishimura.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 1/5] memcg: disable irq at page cgroup lock

2010-04-15 Thread Daisuke Nishimura
On Wed, 14 Apr 2010 21:48:25 -0700, Greg Thelen gthe...@google.com wrote:
 On Wed, Apr 14, 2010 at 7:40 PM, Daisuke Nishimura
 nishim...@mxp.nes.nec.co.jp wrote:
  On Wed, 14 Apr 2010 13:14:07 -0700, Greg Thelen gthe...@google.com wrote:
  Vivek Goyal vgo...@redhat.com writes:
 
   On Tue, Apr 13, 2010 at 11:55:12PM -0700, Greg Thelen wrote:
   On Thu, Mar 18, 2010 at 8:00 PM, KAMEZAWA Hiroyuki 
   kamezawa.hir...@jp.fujitsu.com wrote:
On Fri, 19 Mar 2010 08:10:39 +0530
Balbir Singh bal...@linux.vnet.ibm.com wrote:
   
* KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-03-19 
10:23:32]:
   
 On Thu, 18 Mar 2010 21:58:55 +0530
 Balbir Singh bal...@linux.vnet.ibm.com wrote:

  * KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-03-18 
  13:35:27]:

   Then, no probelm. It's ok to add mem_cgroup_udpate_stat() 
   indpendent from
   mem_cgroup_update_file_mapped(). The look may be messy but 
   it's not your
   fault. But please write why add new function to patch 
   description.
  
   I'm sorry for wasting your time.
 
  Do we need to go down this route? We could check the stat and do 
  the
  correct thing. In case of FILE_MAPPED, always grab 
  page_cgroup_lock
  and for others potentially look at trylock. It is OK for 
  different
  stats to be protected via different locks.
 

 I _don't_ want to see a mixture of spinlock and trylock in a 
 function.

   
A well documented well written function can help. The other thing is 
to
of-course solve this correctly by introducing different locking 
around
the statistics. Are you suggesting the later?
   
   
No. As I wrote.
       - don't modify codes around FILE_MAPPED in this series.
       - add a new functions for new statistics
Then,
       - think about clean up later, after we confirm all things work 
as expected.
  
   I have ported Andrea Righi's memcg dirty page accounting patches to 
   latest
   mmtom-2010-04-05-16-09.  In doing so I have to address this locking 
   issue.  Does
   the following look good?  I will (of course) submit the entire patch 
   for review,
   but I wanted make sure I was aiming in the right direction.
  
   void mem_cgroup_update_page_stat(struct page *page,
                      enum mem_cgroup_write_page_stat_item idx, bool 
   charge)
   {
      static int seq;
      struct page_cgroup *pc;
  
      if (mem_cgroup_disabled())
              return;
      pc = lookup_page_cgroup(page);
      if (!pc || mem_cgroup_is_root(pc-mem_cgroup))
              return;
  
      /*
       * This routine does not disable irq when updating stats.  So it is
       * possible that a stat update from within interrupt routine, could
       * deadlock.  Use trylock_page_cgroup() to avoid such deadlock.  This
       * makes the memcg counters fuzzy.  More complicated, or lower
       * performing locking solutions avoid this fuzziness, but are not
       * currently needed.
       */
      if (irqs_disabled()) {
               ^
   Or may be in_interrupt()?
 
  Good catch.  I will replace irqs_disabled() with in_interrupt().
 
  I think you should check both. __remove_from_page_cache(), which will update
  DIRTY, is called with irq disabled(iow, under mapping-tree_lock) but not in
  interrupt context.
 
 The only reason to use trylock in this case is to prevent deadlock
 when running in a context that may have preempted or interrupted a
 routine that already holds the bit locked.  In the
 __remove_from_page_cache() irqs are disabled, but that does not imply
 that a routine holding the spinlock has been preempted.  When the bit
 is locked, preemption is disabled.  The only way to interrupt a holder
 of the bit for an interrupt to occur (I/O, timer, etc).  So I think
 that in_interrupt() is sufficient.  Am I missing something?
 
IIUC, it's would be enough to prevent deadlock where one CPU tries to acquire
the same page cgroup lock. But there is still some possibility where 2 CPUs
can cause dead lock each other(please see the commit e767e056).
IOW, my point is don't call lock_page_cgroup() under mapping-tree_lock. 

Thanks,
Daisuke Nishimura.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 1/5] memcg: disable irq at page cgroup lock

2010-04-14 Thread Daisuke Nishimura
On Wed, 14 Apr 2010 13:14:07 -0700, Greg Thelen gthe...@google.com wrote:
 Vivek Goyal vgo...@redhat.com writes:
 
  On Tue, Apr 13, 2010 at 11:55:12PM -0700, Greg Thelen wrote:
  On Thu, Mar 18, 2010 at 8:00 PM, KAMEZAWA Hiroyuki 
  kamezawa.hir...@jp.fujitsu.com wrote:
   On Fri, 19 Mar 2010 08:10:39 +0530
   Balbir Singh bal...@linux.vnet.ibm.com wrote:
  
   * KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-03-19 
   10:23:32]:
  
On Thu, 18 Mar 2010 21:58:55 +0530
Balbir Singh bal...@linux.vnet.ibm.com wrote:
   
 * KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-03-18 
 13:35:27]:
   
  Then, no probelm. It's ok to add mem_cgroup_udpate_stat() 
  indpendent from
  mem_cgroup_update_file_mapped(). The look may be messy but it's 
  not your
  fault. But please write why add new function to patch 
  description.
 
  I'm sorry for wasting your time.

 Do we need to go down this route? We could check the stat and do the
 correct thing. In case of FILE_MAPPED, always grab page_cgroup_lock
 and for others potentially look at trylock. It is OK for different
 stats to be protected via different locks.

   
I _don't_ want to see a mixture of spinlock and trylock in a function.
   
  
   A well documented well written function can help. The other thing is to
   of-course solve this correctly by introducing different locking around
   the statistics. Are you suggesting the later?
  
  
   No. As I wrote.
          - don't modify codes around FILE_MAPPED in this series.
          - add a new functions for new statistics
   Then,
          - think about clean up later, after we confirm all things work as 
   expected.
  
  I have ported Andrea Righi's memcg dirty page accounting patches to latest
  mmtom-2010-04-05-16-09.  In doing so I have to address this locking issue. 
   Does
  the following look good?  I will (of course) submit the entire patch for 
  review,
  but I wanted make sure I was aiming in the right direction.
  
  void mem_cgroup_update_page_stat(struct page *page,
 enum mem_cgroup_write_page_stat_item idx, bool charge)
  {
 static int seq;
 struct page_cgroup *pc;
  
 if (mem_cgroup_disabled())
 return;
 pc = lookup_page_cgroup(page);
 if (!pc || mem_cgroup_is_root(pc-mem_cgroup))
 return;
  
 /*
  * This routine does not disable irq when updating stats.  So it is
  * possible that a stat update from within interrupt routine, could
  * deadlock.  Use trylock_page_cgroup() to avoid such deadlock.  This
  * makes the memcg counters fuzzy.  More complicated, or lower
  * performing locking solutions avoid this fuzziness, but are not
  * currently needed.
  */
 if (irqs_disabled()) {
  ^
  Or may be in_interrupt()?
 
 Good catch.  I will replace irqs_disabled() with in_interrupt().
 
I think you should check both. __remove_from_page_cache(), which will update
DIRTY, is called with irq disabled(iow, under mapping-tree_lock) but not in
interrupt context.

Anyway, I tend to agree with KAMEZAWA-san: use trylock always(except for 
FILE_MAPPED),
or add some new interfaces(e.g. mem_cgroup_update_stat_locked/safe...).

Thanks,
Daisuke Nishimura.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 1/5] memcg: disable irq at page cgroup lock

2010-03-17 Thread Daisuke Nishimura
On Thu, 18 Mar 2010 09:45:19 +0900, KAMEZAWA Hiroyuki 
kamezawa.hir...@jp.fujitsu.com wrote:
 On Thu, 18 Mar 2010 08:54:11 +0900
 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote:
 
  On Wed, 17 Mar 2010 17:28:55 +0530
  Balbir Singh bal...@linux.vnet.ibm.com wrote:
  
   * Andrea Righi ari...@develer.com [2010-03-15 00:26:38]:
   
From: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com

Now, file-mapped is maintaiend. But more generic update function
will be needed for dirty page accounting.

For accountig page status, we have to guarantee lock_page_cgroup()
will be never called under tree_lock held.
To guarantee that, we use trylock at updating status.
By this, we do fuzzy accounting, but in almost all case, it's correct.
   
   
   I don't like this at all, but in almost all cases is not acceptable
   for statistics, since decisions will be made on them and having them
   incorrect is really bad. Could we do a form of deferred statistics and
   fix this.
   
  
  plz show your implementation which has no performance regresssion.
  For me, I don't neee file_mapped accounting, at all. If we can remove that,
  we can add simple migration lock.
  file_mapped is a feattue you added. please improve it.
  
 
 BTW, I should explain how acculate this accounting is in this patch itself.
 
 Now, lock_page_cgroup/unlock_page_cgroup happens when
   - charge/uncharge/migrate/move accounting
 
 Then, the lock contention (trylock failure) seems to occur in conflict
 with
   - charge, uncharge, migarate. move accounting
 
 About dirty accounting, charge/uncharge/migarate are operation in synchronous
 manner with radix-tree (holding treelock etc). Then no account leak.
 move accounting is only source for inacculacy...but I don't think this 
 move-task
 is ciritialmoreover, we don't move any file pages at task-move, now.
 (But Nishimura-san has a plan to do so.)
 So, contention will happen only at confliction with force_empty.
 
 About FILE_MAPPED accounting, it's not synchronous with radix-tree operaton.
 Then, accounting-miss seems to happen when charge/uncharge/migrate/account 
 move.
 But
   charge  we don't account a page as FILE_MAPPED before it's charged.
   uncharge .. usual operation in turncation is 
 unmap-remove-from-radix-tree.
   Then, it's sequential in almost all case. The race exists 
 when...
   Assume there are 2 threads A and B. A truncate a file, B 
 map/unmap that.
   This is very unusal confliction.
   migrate we do try_to_unmap before migrating pages. Then, FILE_MAPPED
   is properly handled.
   move account  we don't have move-account-mapped-file, yet.
 
FILE_MAPPED is updated under pte lock. OTOH, move account is also done under
pte lock. page cgroup lock is held under pte lock in both cases, so move account
is not so problem as for FILE_MAPPED.


Thanks,
Daisuke Nishimura.

 Then, this trylock contention happens at contention with force_empty and 
 truncate.
 
 
 Then, main issue for contention is force_empty. But it's called for removing 
 memcg,
 accounting for such memcg is not important.
 Then, I say this accounting is Okay.
 
 To do more accurate, we may need another migration lock. But to get better
 performance for root cgroup, we have to call mem_cgroup_is_root() before
 taking lock and there will be another complicated race.
 
 Bye,
 -Kame
 
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 2/5] memcg: dirty memory documentation

2010-03-16 Thread Daisuke Nishimura
On Mon, 15 Mar 2010 00:26:39 +0100, Andrea Righi ari...@develer.com wrote:
 Document cgroup dirty memory interfaces and statistics.
 
 Signed-off-by: Andrea Righi ari...@develer.com
 ---
  Documentation/cgroups/memory.txt |   36 
  1 files changed, 36 insertions(+), 0 deletions(-)
 
 diff --git a/Documentation/cgroups/memory.txt 
 b/Documentation/cgroups/memory.txt
 index 49f86f3..38ca499 100644
 --- a/Documentation/cgroups/memory.txt
 +++ b/Documentation/cgroups/memory.txt
 @@ -310,6 +310,11 @@ cache- # of bytes of page cache memory.
  rss  - # of bytes of anonymous and swap cache memory.
  pgpgin   - # of pages paged in (equivalent to # of charging 
 events).
  pgpgout  - # of pages paged out (equivalent to # of uncharging 
 events).
 +filedirty- # of pages that are waiting to get written back to the disk.
 +writeback- # of pages that are actively being written back to the disk.
 +writeback_tmp- # of pages used by FUSE for temporary writeback 
 buffers.
 +nfs  - # of NFS pages sent to the server, but not yet committed to
 +   the actual storage.
  active_anon  - # of bytes of anonymous and  swap cache memory on active
 lru list.
  inactive_anon- # of bytes of anonymous memory and swap cache memory 
 on
 @@ -345,6 +350,37 @@ Note:
- a cgroup which uses hierarchy and it has child cgroup.
- a cgroup which uses hierarchy and not the root of hierarchy.
  
 +5.4 dirty memory
 +
 +  Control the maximum amount of dirty pages a cgroup can have at any given 
 time.
 +
 +  Limiting dirty memory is like fixing the max amount of dirty (hard to
 +  reclaim) page cache used by any cgroup. So, in case of multiple cgroup 
 writers,
 +  they will not be able to consume more than their designated share of dirty
 +  pages and will be forced to perform write-out if they cross that limit.
 +
 +  The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*.
 +  It is possible to configure a limit to trigger both a direct writeback or a
 +  background writeback performed by per-bdi flusher threads.
 +
 +  Per-cgroup dirty limits can be set using the following files in the 
 cgroupfs:
 +
 +  - memory.dirty_ratio: contains, as a percentage of cgroup memory, the
 +amount of dirty memory at which a process which is generating disk writes
 +inside the cgroup will start itself writing out dirty data.
 +
 +  - memory.dirty_bytes: the amount of dirty memory of the cgroup (expressed 
 in
 +bytes) at which a process generating disk writes will start itself 
 writing
 +out dirty data.
 +
 +  - memory.dirty_background_ratio: contains, as a percentage of the cgroup
 +memory, the amount of dirty memory at which background writeback kernel
 +threads will start writing out dirty data.
 +
 +  - memory.dirty_background_bytes: the amount of dirty memory of the cgroup 
 (in
 +bytes) at which background writeback kernel threads will start writing 
 out
 +dirty data.
 +
  
It would be better to note that what those files of root cgroup mean.
We cannot write any value to them, IOW, we cannot control dirty limit about 
root cgroup.
And they show the same value as the global one(strictly speaking, it's not true
because global values can change. We need a hook in mem_cgroup_dirty_read()?).

Thanks,
Daisuke Nishimura.

  6. Hierarchy support
  
 -- 
 1.6.3.3
 
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 4/5] memcg: dirty pages accounting and limiting infrastructure

2010-03-16 Thread Daisuke Nishimura
On Tue, 16 Mar 2010 10:11:50 -0400
Vivek Goyal vgo...@redhat.com wrote:

 On Tue, Mar 16, 2010 at 11:32:38AM +0900, Daisuke Nishimura wrote:
 
 [..]
   + * mem_cgroup_page_stat() - get memory cgroup file cache statistics
   + * @item:memory statistic item exported to the kernel
   + *
   + * Return the accounted statistic value, or a negative value in case of 
   error.
   + */
   +s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item)
   +{
   + struct mem_cgroup_page_stat stat = {};
   + struct mem_cgroup *mem;
   +
   + rcu_read_lock();
   + mem = mem_cgroup_from_task(current);
   + if (mem  !mem_cgroup_is_root(mem)) {
   + /*
   +  * If we're looking for dirtyable pages we need to evaluate
   +  * free pages depending on the limit and usage of the parents
   +  * first of all.
   +  */
   + if (item == MEMCG_NR_DIRTYABLE_PAGES)
   + stat.value = memcg_get_hierarchical_free_pages(mem);
   + /*
   +  * Recursively evaluate page statistics against all cgroup
   +  * under hierarchy tree
   +  */
   + stat.item = item;
   + mem_cgroup_walk_tree(mem, stat, mem_cgroup_page_stat_cb);
   + } else
   + stat.value = -EINVAL;
   + rcu_read_unlock();
   +
   + return stat.value;
   +}
   +
  hmm, mem_cgroup_page_stat() can return negative value, but you place 
  BUG_ON()
  in [5/5] to check it returns negative value. What happens if the current is 
  moved
  to root between mem_cgroup_has_dirty_limit() and mem_cgroup_page_stat() ?
  How about making mem_cgroup_has_dirty_limit() return the target mem_cgroup, 
  and
  passing the mem_cgroup to mem_cgroup_page_stat() ?
  
 
 Hmm, if mem_cgroup_has_dirty_limit() retrun pointer to memcg, then one
 shall have to use rcu_read_lock() and that will look ugly.
 
agreed.

 Why don't we simply look at the return value and if it is negative, we
 fall back to using global stats and get rid of BUG_ON()?
 
 Or, modify mem_cgroup_page_stat() to return global stats if it can't
 determine per cgroup stat for some reason. (mem=NULL or root cgroup etc).
 
I don't have any objection as long as we don't hit BUG_ON.


Thanks,
Daisuke Nishimura.

 Vivek
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a
 



___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 4/5] memcg: dirty pages accounting and limiting infrastructure

2010-03-15 Thread Daisuke Nishimura
  !do_swap_account)
continue;
+   if (i = MCS_FILE_STAT_STAR  i = MCS_FILE_STAT_END 
+  mem_cgroup_is_root(mem_cont))
+   continue;
cb-fill(cb, memcg_stat_strings[i].local_name, mystat.stat[i]);
}

not to show file stat in root cgroup ? It's meaningless value anyway.
Of course, you'd better mention it in [2/5] too.


Thanks,
Daisuke Nishimura.

   /* per zone stat */
   val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
 @@ -3583,6 +3917,63 @@ unlock:
   return ret;
  }
  
 +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
 +{
 + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
 +
 + switch (cft-private) {
 + case MEM_CGROUP_DIRTY_RATIO:
 + return memcg-dirty_param.dirty_ratio;
 + case MEM_CGROUP_DIRTY_BYTES:
 + return memcg-dirty_param.dirty_bytes;
 + case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
 + return memcg-dirty_param.dirty_background_ratio;
 + case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
 + return memcg-dirty_param.dirty_background_bytes;
 + default:
 + BUG();
 + }
 +}
 +
 +static int
 +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
 +{
 + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
 + int type = cft-private;
 +
 + if (cgrp-parent == NULL)
 + return -EINVAL;
 + if ((type == MEM_CGROUP_DIRTY_RATIO ||
 + type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO)  val  100)
 + return -EINVAL;
 + /*
 +  * TODO: provide a validation check routine. And retry if validation
 +  * fails.
 +  */
 + switch (type) {
 + case MEM_CGROUP_DIRTY_RATIO:
 + memcg-dirty_param.dirty_ratio = val;
 + memcg-dirty_param.dirty_bytes = 0;
 + break;
 + case MEM_CGROUP_DIRTY_BYTES:
 + memcg-dirty_param.dirty_ratio  = 0;
 + memcg-dirty_param.dirty_bytes = val;
 + break;
 + case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
 + memcg-dirty_param.dirty_background_ratio = val;
 + memcg-dirty_param.dirty_background_bytes = 0;
 + break;
 + case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
 + memcg-dirty_param.dirty_background_ratio = 0;
 + memcg-dirty_param.dirty_background_bytes = val;
 + break;
 + default:
 + BUG();
 + break;
 + }
 + return 0;
 +}
 +
  static struct cftype mem_cgroup_files[] = {
   {
   .name = usage_in_bytes,
 @@ -3634,6 +4025,30 @@ static struct cftype mem_cgroup_files[] = {
   .write_u64 = mem_cgroup_swappiness_write,
   },
   {
 + .name = dirty_ratio,
 + .read_u64 = mem_cgroup_dirty_read,
 + .write_u64 = mem_cgroup_dirty_write,
 + .private = MEM_CGROUP_DIRTY_RATIO,
 + },
 + {
 + .name = dirty_bytes,
 + .read_u64 = mem_cgroup_dirty_read,
 + .write_u64 = mem_cgroup_dirty_write,
 + .private = MEM_CGROUP_DIRTY_BYTES,
 + },
 + {
 + .name = dirty_background_ratio,
 + .read_u64 = mem_cgroup_dirty_read,
 + .write_u64 = mem_cgroup_dirty_write,
 + .private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
 + },
 + {
 + .name = dirty_background_bytes,
 + .read_u64 = mem_cgroup_dirty_read,
 + .write_u64 = mem_cgroup_dirty_write,
 + .private = MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
 + },
 + {
   .name = move_charge_at_immigrate,
   .read_u64 = mem_cgroup_move_charge_read,
   .write_u64 = mem_cgroup_move_charge_write,
 @@ -3892,8 +4307,21 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct 
 cgroup *cont)
   mem-last_scanned_child = 0;
   spin_lock_init(mem-reclaim_param_lock);
  
 - if (parent)
 + if (parent) {
   mem-swappiness = get_swappiness(parent);
 + mem-dirty_param = parent-dirty_param;
 + } else {
 + while (1) {
 + get_global_vm_dirty_param(mem-dirty_param);
 + /*
 +  * Since global dirty parameters are not protected we
 +  * try to speculatively read them and retry if we get
 +  * inconsistent values.
 +  */
 + if (likely(dirty_param_is_valid(mem-dirty_param)))
 + break;
 + }
 + }
   atomic_set(mem-refcnt, 1);
   mem-move_charge_at_immigrate = 0;
   mutex_init(mem-thresholds_lock);
 -- 
 1.6.3.3
 
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

[Devel] Re: [PATCH mmotm 2.5/4] memcg: disable irq at page cgroup lock (Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure)

2010-03-11 Thread Daisuke Nishimura
On Thu, 11 Mar 2010 15:15:11 +0900, KAMEZAWA Hiroyuki 
kamezawa.hir...@jp.fujitsu.com wrote:
 On Thu, 11 Mar 2010 14:13:00 +0900
 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote:
 
  On Thu, 11 Mar 2010 13:58:47 +0900
  Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote:
I'll consider yet another fix for race in account migration if I can.

   me too.
   
  
  How about this ? Assume that the race is very rare.
  
  1. use trylock when updating statistics.
 If trylock fails, don't account it.
  
  2. add PCG_FLAG for all status as
  
  +   PCG_ACCT_FILE_MAPPED, /* page is accounted as file rss*/
  +   PCG_ACCT_DIRTY, /* page is dirty */
  +   PCG_ACCT_WRITEBACK, /* page is being written back to disk */
  +   PCG_ACCT_WRITEBACK_TEMP, /* page is used as temporary buffer for FUSE */
  +   PCG_ACCT_UNSTABLE_NFS, /* NFS page not yet committed to the server */
  
  3. At reducing counter, check PCG_xxx flags by
  TESTCLEARPCGFLAG()
  
  This is similar to an _used_ method of LRU accounting. And We can think this
  method's error-range never go too bad number. 
  
I agree with you. I've been thinking whether we can remove page cgroup lock
in update_stat as we do in lru handling codes.

  I think this kind of fuzzy accounting is enough for writeback status.
  Does anyone need strict accounting ?
  
 
IMHO, we don't need strict accounting.

 How this looks ?
I agree to this direction. One concern is we re-introduce trylock again..

Some comments are inlined.

 ==
 From: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
 
 Now, file-mapped is maintaiend. But more generic update function
 will be needed for dirty page accounting.
 
 For accountig page status, we have to guarantee lock_page_cgroup()
 will be never called under tree_lock held.
 To guarantee that, we use trylock at updating status.
 By this, we do fuzyy accounting, but in almost all case, it's correct.
 
 Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
 ---
  include/linux/memcontrol.h  |7 +++
  include/linux/page_cgroup.h |   15 +++
  mm/memcontrol.c |   88 
 +---
  mm/rmap.c   |4 +-
  4 files changed, 90 insertions(+), 24 deletions(-)
 
 Index: mmotm-2.6.34-Mar9/mm/memcontrol.c
 ===
 --- mmotm-2.6.34-Mar9.orig/mm/memcontrol.c
 +++ mmotm-2.6.34-Mar9/mm/memcontrol.c
 @@ -1348,30 +1348,79 @@ bool mem_cgroup_handle_oom(struct mem_cg
   * Currently used to update mapped file statistics, but the routine can be
   * generalized to update other statistics as well.
   */
 -void mem_cgroup_update_file_mapped(struct page *page, int val)
 +void __mem_cgroup_update_stat(struct page_cgroup *pc, int idx, bool charge)
  {
   struct mem_cgroup *mem;
 - struct page_cgroup *pc;
 -
 - pc = lookup_page_cgroup(page);
 - if (unlikely(!pc))
 - return;
 + int val;
  
 - lock_page_cgroup(pc);
   mem = pc-mem_cgroup;
 - if (!mem)
 - goto done;
 + if (!mem || !PageCgroupUsed(pc))
 + return;
  
 - if (!PageCgroupUsed(pc))
 - goto done;
 + if (charge)
 + val = 1;
 + else
 + val = -1;
  
 + switch (idx) {
 + case MEMCG_NR_FILE_MAPPED:
 + if (charge) {
 + if (!PageCgroupFileMapped(pc))
 + SetPageCgroupFileMapped(pc);
 + else
 + val = 0;
 + } else {
 + if (PageCgroupFileMapped(pc))
 + ClearPageCgroupFileMapped(pc);
 + else
 + val = 0;
 + }
Using !TestSetPageCgroupFileMapped(pc) or TestClearPageCgroupFileMapped(pc) is 
better ?

 + idx = MEM_CGROUP_STAT_FILE_MAPPED;
 + break;
 + default:
 + BUG();
 + break;
 + }
   /*
* Preemption is already disabled. We can use __this_cpu_xxx
*/
 - __this_cpu_add(mem-stat-count[MEM_CGROUP_STAT_FILE_MAPPED], val);
 + __this_cpu_add(mem-stat-count[idx], val);
 +}
  
 -done:
 - unlock_page_cgroup(pc);
 +void mem_cgroup_update_stat(struct page *page, int idx, bool charge)
 +{
 + struct page_cgroup *pc;
 +
 + pc = lookup_page_cgroup(page);
 + if (unlikely(!pc))
 + return;
 +
 + if (trylock_page_cgroup(pc)) {
 + __mem_cgroup_update_stat(pc, idx, charge);
 + unlock_page_cgroup(pc);
 + }
 + return;
 +}
 +
 +static void mem_cgroup_migrate_stat(struct page_cgroup *pc,
 + struct mem_cgroup *from, struct mem_cgroup *to)
 +{
 + preempt_disable();
 + if (PageCgroupFileMapped(pc)) {
 + __this_cpu_dec(from-stat-count[MEM_CGROUP_STAT_FILE_MAPPED]);
 + __this_cpu_inc(to-stat-count[MEM_CGROUP_STAT_FILE_MAPPED]);
 + }
 + preempt_enable

[Devel] Re: [PATCH -mmotm 0/5] memcg: per cgroup dirty limit (v6)

2010-03-11 Thread Daisuke Nishimura
On Thu, 11 Mar 2010 18:42:44 +0900, KAMEZAWA Hiroyuki 
kamezawa.hir...@jp.fujitsu.com wrote:
 On Thu, 11 Mar 2010 18:25:00 +0900
 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote:
  Then, it's not problem that check pc-mem_cgroup is root cgroup or not
  without spinlock.
  ==
  void mem_cgroup_update_stat(struct page *page, int idx, bool charge)
  {
  pc = lookup_page_cgroup(page);
  if (unlikely(!pc) || mem_cgroup_is_root(pc-mem_cgroup))
  return; 
  ...
  }
  ==
  This can be handle in the same logic of lock failure path.
  And we just do ignore accounting.
  
  There are will be no spinlocksto do more than this,
  I think we have to use struct page rather than struct page_cgroup.
  
 Hmm..like this ? The bad point of this patch is that this will corrupt 
 FILE_MAPPED
 status in root cgroup. This kind of change is not very good.
 So, one way is to use this kind of function only for new parameters. Hmm.
IMHO, if we disable accounting file stats in root cgroup, it would be better
not to show them in memory.stat to avoid confusing users.
But, hmm, I think accounting them in root cgroup isn't so meaningless.
Isn't making mem_cgroup_has_dirty_limit() return false in case of root cgroup 
enough?

 ==
 
 From: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
 
 Now, file-mapped is maintaiend. But more generic update function
 will be needed for dirty page accounting.
 
 For accountig page status, we have to guarantee lock_page_cgroup()
 will be never called under tree_lock held.
 To guarantee that, we use trylock at updating status.
 By this, we do fuzyy accounting, but in almost all case, it's correct.
 
 Changelog:
  - removed unnecessary preempt_disable()
  - added root cgroup check. By this, we do no lock/account in root cgroup.
 
 Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
Looks good overall.


Thanks,
Daisuke Nishimura.

 ---
  include/linux/memcontrol.h  |7 ++-
  include/linux/page_cgroup.h |   15 +++
  mm/memcontrol.c |   92 
 +---
  mm/rmap.c   |4 -
  4 files changed, 94 insertions(+), 24 deletions(-)
 
 Index: mmotm-2.6.34-Mar9/mm/memcontrol.c
 ===
 --- mmotm-2.6.34-Mar9.orig/mm/memcontrol.c
 +++ mmotm-2.6.34-Mar9/mm/memcontrol.c
 @@ -1348,30 +1348,83 @@ bool mem_cgroup_handle_oom(struct mem_cg
   * Currently used to update mapped file statistics, but the routine can be
   * generalized to update other statistics as well.
   */
 -void mem_cgroup_update_file_mapped(struct page *page, int val)
 +void __mem_cgroup_update_stat(struct page_cgroup *pc, int idx, bool charge)
  {
   struct mem_cgroup *mem;
 - struct page_cgroup *pc;
 -
 - pc = lookup_page_cgroup(page);
 - if (unlikely(!pc))
 - return;
 + int val;
  
 - lock_page_cgroup(pc);
   mem = pc-mem_cgroup;
 - if (!mem)
 - goto done;
 + if (!mem || !PageCgroupUsed(pc))
 + return;
  
 - if (!PageCgroupUsed(pc))
 - goto done;
 + if (charge)
 + val = 1;
 + else
 + val = -1;
  
 + switch (idx) {
 + case MEMCG_NR_FILE_MAPPED:
 + if (charge) {
 + if (!PageCgroupFileMapped(pc))
 + SetPageCgroupFileMapped(pc);
 + else
 + val = 0;
 + } else {
 + if (PageCgroupFileMapped(pc))
 + ClearPageCgroupFileMapped(pc);
 + else
 + val = 0;
 + }
 + idx = MEM_CGROUP_STAT_FILE_MAPPED;
 + break;
 + default:
 + BUG();
 + break;
 + }
   /*
* Preemption is already disabled. We can use __this_cpu_xxx
*/
 - __this_cpu_add(mem-stat-count[MEM_CGROUP_STAT_FILE_MAPPED], val);
 + __this_cpu_add(mem-stat-count[idx], val);
 +}
  
 -done:
 - unlock_page_cgroup(pc);
 +void mem_cgroup_update_stat(struct page *page, int idx, bool charge)
 +{
 + struct page_cgroup *pc;
 +
 + pc = lookup_page_cgroup(page);
 + if (!pc || mem_cgroup_is_root(pc-mem_cgroup))
 + return;
 +
 + if (trylock_page_cgroup(pc)) {
 + __mem_cgroup_update_stat(pc, idx, charge);
 + unlock_page_cgroup(pc);
 + }
 + return;
 +}
 +
 +static void mem_cgroup_migrate_stat(struct page_cgroup *pc,
 + struct mem_cgroup *from, struct mem_cgroup *to)
 +{
 + if (PageCgroupFileMapped(pc)) {
 + __this_cpu_dec(from-stat-count[MEM_CGROUP_STAT_FILE_MAPPED]);
 + if (!mem_cgroup_is_root(to)) {
 + 
 __this_cpu_inc(to-stat-count[MEM_CGROUP_STAT_FILE_MAPPED]);
 + } else {
 + ClearPageCgroupFileMapped(pc);
 + }
 + }
 +}
 +
 +static void
 +__mem_cgroup_stat_fixup

[Devel] Re: [PATCH mmotm 2.5/4] memcg: disable irq at page cgroup lock (Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure)

2010-03-10 Thread Daisuke Nishimura
On Wed, 10 Mar 2010 09:26:24 +0530, Balbir Singh bal...@linux.vnet.ibm.com 
wrote:
 * nishim...@mxp.nes.nec.co.jp nishim...@mxp.nes.nec.co.jp [2010-03-10 
 10:43:09]:
 
   Please please measure the performance overhead of this change.
   
  
  here.
  
I made a patch below and measured the time(average of 10 times) 
of kernel build
on tmpfs(make -j8 on 8 CPU machine with 2.6.33 defconfig).

before
- root cgroup: 190.47 sec
- child cgroup: 192.81 sec

after
- root cgroup: 191.06 sec
- child cgroup: 193.06 sec

  
  after2(local_irq_save/restore)
  - root cgroup: 191.42 sec
  - child cgroup: 193.55 sec
  
  hmm, I think it's in error range, but I can see a tendency by testing 
  several times
  that it's getting slower as I add additional codes. Using 
  local_irq_disable()/enable()
  except in mem_cgroup_update_file_mapped(it can be the only candidate to be 
  called
  with irq disabled in future) might be the choice.
 
 
 Error range would depend on things like standard deviation and
 repetition. It might be good to keep update_file_mapped and see the
 impact. My concern is with large systems, the difference might be
 larger.
  
 -- 
   Three Cheers,
   Balbir
I made a patch(attached) using both local_irq_disable/enable and 
local_irq_save/restore.
local_irq_save/restore is used only in mem_cgroup_update_file_mapped.

And I attached a histogram graph of 30 times kernel build in root cgroup for 
each.

  before_root: no irq operation(original)
  after_root: local_irq_disable/enable for all
  after2_root: local_irq_save/restore for all
  after3_root: mixed version(attached)

hmm, there seems to be a tendency that before  after  after3  after2 ?
Should I replace save/restore version to mixed version ?


Thanks,
Daisuke Nishimura.
===
 include/linux/page_cgroup.h |   28 ++--
 mm/memcontrol.c |   36 ++--
 2 files changed, 44 insertions(+), 20 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 30b0813..c0aca62 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -83,16 +83,40 @@ static inline enum zone_type page_cgroup_zid(struct 
page_cgroup *pc)
return page_zonenum(pc-page);
 }
 
-static inline void lock_page_cgroup(struct page_cgroup *pc)
+static inline void __lock_page_cgroup(struct page_cgroup *pc)
 {
bit_spin_lock(PCG_LOCK, pc-flags);
 }
 
-static inline void unlock_page_cgroup(struct page_cgroup *pc)
+static inline void __unlock_page_cgroup(struct page_cgroup *pc)
 {
bit_spin_unlock(PCG_LOCK, pc-flags);
 }
 
+#define lock_page_cgroup_irq(pc)   \
+   do {\
+   local_irq_disable();\
+   __lock_page_cgroup(pc); \
+   } while (0)
+
+#define unlock_page_cgroup_irq(pc) \
+   do {\
+   __unlock_page_cgroup(pc);   \
+   local_irq_enable(); \
+   } while (0)
+
+#define lock_page_cgroup_irqsave(pc, flags)\
+   do {\
+   local_irq_save(flags);  \
+   __lock_page_cgroup(pc); \
+   } while (0)
+
+#define unlock_page_cgroup_irqrestore(pc, flags)   \
+   do {\
+   __unlock_page_cgroup(pc);   \
+   local_irq_restore(flags);   \
+   } while (0)
+
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct page_cgroup;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 02ea959..11d483e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1354,12 +1354,13 @@ void mem_cgroup_update_file_mapped(struct page *page, 
int val)
 {
struct mem_cgroup *mem;
struct page_cgroup *pc;
+   unsigned long flags;
 
pc = lookup_page_cgroup(page);
if (unlikely(!pc))
return;
 
-   lock_page_cgroup(pc);
+   lock_page_cgroup_irqsave(pc, flags);
mem = pc-mem_cgroup;
if (!mem)
goto done;
@@ -1373,7 +1374,7 @@ void mem_cgroup_update_file_mapped(struct page *page, int 
val)
__this_cpu_add(mem-stat-count[MEM_CGROUP_STAT_FILE_MAPPED], val);
 
 done:
-   unlock_page_cgroup(pc);
+   unlock_page_cgroup_irqrestore(pc, flags);
 }
 
 /*
@@ -1711,7 +1712,7 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct 
page *page)
VM_BUG_ON(!PageLocked(page));
 
pc = lookup_page_cgroup(page);
-   lock_page_cgroup(pc);
+   lock_page_cgroup_irq(pc);
if (PageCgroupUsed(pc)) {
mem = pc-mem_cgroup;
if (mem  !css_tryget(mem-css))
@@ -1725,7 +1726,7 @@ struct mem_cgroup

[Devel] Re: [PATCH mmotm 2.5/4] memcg: disable irq at page cgroup lock (Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure)

2010-03-10 Thread Daisuke Nishimura
On Thu, 11 Mar 2010 13:49:08 +0900, KAMEZAWA Hiroyuki 
kamezawa.hir...@jp.fujitsu.com wrote:
 On Thu, 11 Mar 2010 13:31:23 +0900
 Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote:
 
  On Wed, 10 Mar 2010 09:26:24 +0530, Balbir Singh 
  bal...@linux.vnet.ibm.com wrote:
   * nishim...@mxp.nes.nec.co.jp nishim...@mxp.nes.nec.co.jp [2010-03-10 
   10:43:09]:
 
  I made a patch(attached) using both local_irq_disable/enable and 
  local_irq_save/restore.
  local_irq_save/restore is used only in mem_cgroup_update_file_mapped.
  
  And I attached a histogram graph of 30 times kernel build in root cgroup 
  for each.
  
before_root: no irq operation(original)
after_root: local_irq_disable/enable for all
after2_root: local_irq_save/restore for all
after3_root: mixed version(attached)
  
  hmm, there seems to be a tendency that before  after  after3  after2 ?
  Should I replace save/restore version to mixed version ?
  
 
 IMHO, starting from after2_root version is the easist.
 If there is a chance to call lock/unlock page_cgroup can be called in
 interrupt context, we _have to_ disable IRQ, anyway.
 And if we have to do this, I prefer migration_lock rather than this mixture.
 
I see.

 BTW, how big your system is ? Balbir-san's concern is for bigger machines.
 But I'm not sure this change is affecte by the size of machines.
 I'm sorry I have no big machine, now.
 
My test machine have 8CPUs, and I run all the test with make -j8.
Sorry, I don't have easy access to huge machine either.

 I'll consider yet another fix for race in account migration if I can.
 
me too.


Thanks,
Daisuke Nishimura.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH mmotm 2.5/4] memcg: disable irq at page cgroup lock (Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure)

2010-03-09 Thread Daisuke Nishimura
 Please please measure the performance overhead of this change.
 

here.

  I made a patch below and measured the time(average of 10 times) of 
  kernel build
  on tmpfs(make -j8 on 8 CPU machine with 2.6.33 defconfig).
  
  before
  - root cgroup: 190.47 sec
  - child cgroup: 192.81 sec
  
  after
  - root cgroup: 191.06 sec
  - child cgroup: 193.06 sec
  

after2(local_irq_save/restore)
- root cgroup: 191.42 sec
- child cgroup: 193.55 sec

hmm, I think it's in error range, but I can see a tendency by testing several 
times
that it's getting slower as I add additional codes. Using 
local_irq_disable()/enable()
except in mem_cgroup_update_file_mapped(it can be the only candidate to be 
called
with irq disabled in future) might be the choice.


Thanks,
Daisuke Nishimura.

On Tue, 9 Mar 2010 10:20:58 +0530, Balbir Singh bal...@linux.vnet.ibm.com 
wrote:
 * nishim...@mxp.nes.nec.co.jp nishim...@mxp.nes.nec.co.jp [2010-03-09 
 10:29:28]:
 
  On Tue, 9 Mar 2010 09:19:14 +0900, KAMEZAWA Hiroyuki 
  kamezawa.hir...@jp.fujitsu.com wrote:
   On Tue, 9 Mar 2010 01:12:52 +0100
   Andrea Righi ari...@develer.com wrote:
   
On Mon, Mar 08, 2010 at 05:31:00PM +0900, KAMEZAWA Hiroyuki wrote:
 On Mon, 8 Mar 2010 17:07:11 +0900
 Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote:
 
  On Mon, 8 Mar 2010 11:37:11 +0900, KAMEZAWA Hiroyuki 
  kamezawa.hir...@jp.fujitsu.com wrote:
   On Mon, 8 Mar 2010 11:17:24 +0900
   Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote:
   
 But IIRC, clear_writeback is done under treelock No ?
 
The place where NR_WRITEBACK is updated is out of tree_lock.

   1311 int test_clear_page_writeback(struct page *page)
   1312 {
   1313 struct address_space *mapping = 
page_mapping(page);
   1314 int ret;
   1315
   1316 if (mapping) {
   1317 struct backing_dev_info *bdi = 
mapping-backing_dev_info;
   1318 unsigned long flags;
   1319
   1320 spin_lock_irqsave(mapping-tree_lock, 
flags);
   1321 ret = TestClearPageWriteback(page);
   1322 if (ret) {
   1323 
radix_tree_tag_clear(mapping-page_tree,
   1324 
page_index(page),
   1325 
PAGECACHE_TAG_WRITEBACK);
   1326 if 
(bdi_cap_account_writeback(bdi)) {
   1327 __dec_bdi_stat(bdi, 
BDI_WRITEBACK);
   1328 __bdi_writeout_inc(bdi);
   1329 }
   1330 }
   1331 
spin_unlock_irqrestore(mapping-tree_lock, flags);
   1332 } else {
   1333 ret = TestClearPageWriteback(page);
   1334 }
   1335 if (ret)
   1336 dec_zone_page_state(page, NR_WRITEBACK);
   1337 return ret;
   1338 }
   
   We can move this up to under tree_lock. Considering memcg, all 
   our target has mapping.
   
   If we newly account bounce-buffers (for NILFS, FUSE, etc..), 
   which has no -mapping,
   we need much more complex new charge/uncharge theory.
   
   But yes, adding new lock scheme seems complicated. (Sorry Andrea.)
   My concerns is performance. We may need somehing new 
   re-implementation of
   locks/migrate/charge/uncharge.
   
  I agree. Performance is my concern too.
  
  I made a patch below and measured the time(average of 10 times) of 
  kernel build
  on tmpfs(make -j8 on 8 CPU machine with 2.6.33 defconfig).
  
  before
  - root cgroup: 190.47 sec
  - child cgroup: 192.81 sec
  
  after
  - root cgroup: 191.06 sec
  - child cgroup: 193.06 sec
  
  Hmm... about 0.3% slower for root, 0.1% slower for child.
  
 
 Hmm...accepatable ? (sounds it's in error-range)
 
 BTW, why local_irq_disable() ? 
 local_irq_save()/restore() isn't better ?

Probably there's not the overhead of saving flags? 
   maybe.
   
Anyway, it would make the code much more readable...

   ok.
   
   please go ahead in this direction. Nishimura-san, would you post an
   independent patch ? If no, Andrea-san, please.
   
  This is the updated version.
  
  Andrea-san, can you merge this into your patch set ?
  
 
 Please please measure the performance overhead of this change.
 
 -- 
   Three Cheers,
   Balbir
___
Containers mailing list
contain...@lists.linux

[Devel] Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure

2010-03-08 Thread Daisuke Nishimura
On Mon, 8 Mar 2010 11:37:11 +0900, KAMEZAWA Hiroyuki 
kamezawa.hir...@jp.fujitsu.com wrote:
 On Mon, 8 Mar 2010 11:17:24 +0900
 Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote:
 
   But IIRC, clear_writeback is done under treelock No ?
   
  The place where NR_WRITEBACK is updated is out of tree_lock.
  
 1311 int test_clear_page_writeback(struct page *page)
 1312 {
 1313 struct address_space *mapping = page_mapping(page);
 1314 int ret;
 1315
 1316 if (mapping) {
 1317 struct backing_dev_info *bdi = 
  mapping-backing_dev_info;
 1318 unsigned long flags;
 1319
 1320 spin_lock_irqsave(mapping-tree_lock, flags);
 1321 ret = TestClearPageWriteback(page);
 1322 if (ret) {
 1323 radix_tree_tag_clear(mapping-page_tree,
 1324 page_index(page),
 1325 
  PAGECACHE_TAG_WRITEBACK);
 1326 if (bdi_cap_account_writeback(bdi)) {
 1327 __dec_bdi_stat(bdi, BDI_WRITEBACK);
 1328 __bdi_writeout_inc(bdi);
 1329 }
 1330 }
 1331 spin_unlock_irqrestore(mapping-tree_lock, flags);
 1332 } else {
 1333 ret = TestClearPageWriteback(page);
 1334 }
 1335 if (ret)
 1336 dec_zone_page_state(page, NR_WRITEBACK);
 1337 return ret;
 1338 }
 
 We can move this up to under tree_lock. Considering memcg, all our target has 
 mapping.
 
 If we newly account bounce-buffers (for NILFS, FUSE, etc..), which has no 
 -mapping,
 we need much more complex new charge/uncharge theory.
 
 But yes, adding new lock scheme seems complicated. (Sorry Andrea.)
 My concerns is performance. We may need somehing new re-implementation of
 locks/migrate/charge/uncharge.
 
I agree. Performance is my concern too.

I made a patch below and measured the time(average of 10 times) of kernel build
on tmpfs(make -j8 on 8 CPU machine with 2.6.33 defconfig).

before
- root cgroup: 190.47 sec
- child cgroup: 192.81 sec

after
- root cgroup: 191.06 sec
- child cgroup: 193.06 sec

Hmm... about 0.3% slower for root, 0.1% slower for child.

===
From: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp

In current implementation, we don't have to disable irq at lock_page_cgroup()
because the lock is never acquired in interrupt context.
But we are going to do it in later patch, so this patch encloses all of
lock_page_cgroup()/unlock_page_cgroup() with irq_disabled()/irq_enabled().

Signed-off-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp
---
 mm/memcontrol.c |   17 +
 1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 02ea959..e5ae1a1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1359,6 +1359,7 @@ void mem_cgroup_update_file_mapped(struct page *page, int 
val)
if (unlikely(!pc))
return;
 
+   local_irq_disable();
lock_page_cgroup(pc);
mem = pc-mem_cgroup;
if (!mem)
@@ -1374,6 +1375,7 @@ void mem_cgroup_update_file_mapped(struct page *page, int 
val)
 
 done:
unlock_page_cgroup(pc);
+   local_irq_enable();
 }
 
 /*
@@ -1711,6 +1713,7 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct 
page *page)
VM_BUG_ON(!PageLocked(page));
 
pc = lookup_page_cgroup(page);
+   local_irq_disable();
lock_page_cgroup(pc);
if (PageCgroupUsed(pc)) {
mem = pc-mem_cgroup;
@@ -1726,6 +1729,7 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct 
page *page)
rcu_read_unlock();
}
unlock_page_cgroup(pc);
+   local_irq_enable();
return mem;
 }
 
@@ -1742,9 +1746,11 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup 
*mem,
if (!mem)
return;
 
+   local_irq_disable();
lock_page_cgroup(pc);
if (unlikely(PageCgroupUsed(pc))) {
unlock_page_cgroup(pc);
+   local_irq_enable();
mem_cgroup_cancel_charge(mem);
return;
}
@@ -1775,6 +1781,7 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup 
*mem,
mem_cgroup_charge_statistics(mem, pc, true);
 
unlock_page_cgroup(pc);
+   local_irq_enable();
/*
 * charge_statistics updated event counter. Then, check it.
 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
@@ -1844,12 +1851,14 @@ static int mem_cgroup_move_account(struct page_cgroup 
*pc,
struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
 {
int ret = -EINVAL;
+   local_irq_disable();
lock_page_cgroup(pc

[Devel] Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure

2010-03-08 Thread Daisuke Nishimura
On Mon, 8 Mar 2010 17:31:00 +0900, KAMEZAWA Hiroyuki 
kamezawa.hir...@jp.fujitsu.com wrote:
 On Mon, 8 Mar 2010 17:07:11 +0900
 Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote:
 
  On Mon, 8 Mar 2010 11:37:11 +0900, KAMEZAWA Hiroyuki 
  kamezawa.hir...@jp.fujitsu.com wrote:
   On Mon, 8 Mar 2010 11:17:24 +0900
   Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote:
   
 But IIRC, clear_writeback is done under treelock No ?
 
The place where NR_WRITEBACK is updated is out of tree_lock.

   1311 int test_clear_page_writeback(struct page *page)
   1312 {
   1313 struct address_space *mapping = page_mapping(page);
   1314 int ret;
   1315
   1316 if (mapping) {
   1317 struct backing_dev_info *bdi = 
mapping-backing_dev_info;
   1318 unsigned long flags;
   1319
   1320 spin_lock_irqsave(mapping-tree_lock, flags);
   1321 ret = TestClearPageWriteback(page);
   1322 if (ret) {
   1323 
radix_tree_tag_clear(mapping-page_tree,
   1324 
page_index(page),
   1325 
PAGECACHE_TAG_WRITEBACK);
   1326 if (bdi_cap_account_writeback(bdi)) {
   1327 __dec_bdi_stat(bdi, 
BDI_WRITEBACK);
   1328 __bdi_writeout_inc(bdi);
   1329 }
   1330 }
   1331 spin_unlock_irqrestore(mapping-tree_lock, 
flags);
   1332 } else {
   1333 ret = TestClearPageWriteback(page);
   1334 }
   1335 if (ret)
   1336 dec_zone_page_state(page, NR_WRITEBACK);
   1337 return ret;
   1338 }
   
   We can move this up to under tree_lock. Considering memcg, all our target 
   has mapping.
   
   If we newly account bounce-buffers (for NILFS, FUSE, etc..), which has no 
   -mapping,
   we need much more complex new charge/uncharge theory.
   
   But yes, adding new lock scheme seems complicated. (Sorry Andrea.)
   My concerns is performance. We may need somehing new re-implementation of
   locks/migrate/charge/uncharge.
   
  I agree. Performance is my concern too.
  
  I made a patch below and measured the time(average of 10 times) of kernel 
  build
  on tmpfs(make -j8 on 8 CPU machine with 2.6.33 defconfig).
  
  before
  - root cgroup: 190.47 sec
  - child cgroup: 192.81 sec
  
  after
  - root cgroup: 191.06 sec
  - child cgroup: 193.06 sec
  
  Hmm... about 0.3% slower for root, 0.1% slower for child.
  
 
 Hmm...accepatable ? (sounds it's in error-range)
 
 BTW, why local_irq_disable() ? 
 local_irq_save()/restore() isn't better ?
 
I don't have any strong reason. All of lock_page_cgroup() is *now* called w/o 
irq disabled,
so I used just disable()/enable() instead of save()/restore().
I think disable()/enable() is better in those cases because we need not to 
save/restore
eflags register by pushf/popf, but, I don't have any numbers though, there 
wouldn't be a big
difference in performance.


Thanks,
Daisuke Nishimura.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure

2010-03-08 Thread Daisuke Nishimura
On Tue, 9 Mar 2010 09:20:54 +0900, KAMEZAWA Hiroyuki 
kamezawa.hir...@jp.fujitsu.com wrote:
 On Tue, 9 Mar 2010 09:18:45 +0900
 Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote:
 
  On Mon, 8 Mar 2010 17:31:00 +0900, KAMEZAWA Hiroyuki 
  kamezawa.hir...@jp.fujitsu.com wrote:
   On Mon, 8 Mar 2010 17:07:11 +0900
   Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote:
 
   Hmm...accepatable ? (sounds it's in error-range)
   
   BTW, why local_irq_disable() ? 
   local_irq_save()/restore() isn't better ?
   
  I don't have any strong reason. All of lock_page_cgroup() is *now* called 
  w/o irq disabled,
  so I used just disable()/enable() instead of save()/restore().
 
 My point is, this will be used under treelock soon.
 
I agree.

I'll update the patch using save()/restore(), and repost later.


Thanks,
Daisuke Nishimura.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH mmotm 2.5/4] memcg: disable irq at page cgroup lock (Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure)

2010-03-08 Thread Daisuke Nishimura
On Tue, 9 Mar 2010 09:19:14 +0900, KAMEZAWA Hiroyuki 
kamezawa.hir...@jp.fujitsu.com wrote:
 On Tue, 9 Mar 2010 01:12:52 +0100
 Andrea Righi ari...@develer.com wrote:
 
  On Mon, Mar 08, 2010 at 05:31:00PM +0900, KAMEZAWA Hiroyuki wrote:
   On Mon, 8 Mar 2010 17:07:11 +0900
   Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote:
   
On Mon, 8 Mar 2010 11:37:11 +0900, KAMEZAWA Hiroyuki 
kamezawa.hir...@jp.fujitsu.com wrote:
 On Mon, 8 Mar 2010 11:17:24 +0900
 Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote:
 
   But IIRC, clear_writeback is done under treelock No ?
   
  The place where NR_WRITEBACK is updated is out of tree_lock.
  
 1311 int test_clear_page_writeback(struct page *page)
 1312 {
 1313 struct address_space *mapping = page_mapping(page);
 1314 int ret;
 1315
 1316 if (mapping) {
 1317 struct backing_dev_info *bdi = 
  mapping-backing_dev_info;
 1318 unsigned long flags;
 1319
 1320 spin_lock_irqsave(mapping-tree_lock, 
  flags);
 1321 ret = TestClearPageWriteback(page);
 1322 if (ret) {
 1323 
  radix_tree_tag_clear(mapping-page_tree,
 1324 
  page_index(page),
 1325 
  PAGECACHE_TAG_WRITEBACK);
 1326 if (bdi_cap_account_writeback(bdi)) 
  {
 1327 __dec_bdi_stat(bdi, 
  BDI_WRITEBACK);
 1328 __bdi_writeout_inc(bdi);
 1329 }
 1330 }
 1331 spin_unlock_irqrestore(mapping-tree_lock, 
  flags);
 1332 } else {
 1333 ret = TestClearPageWriteback(page);
 1334 }
 1335 if (ret)
 1336 dec_zone_page_state(page, NR_WRITEBACK);
 1337 return ret;
 1338 }
 
 We can move this up to under tree_lock. Considering memcg, all our 
 target has mapping.
 
 If we newly account bounce-buffers (for NILFS, FUSE, etc..), which 
 has no -mapping,
 we need much more complex new charge/uncharge theory.
 
 But yes, adding new lock scheme seems complicated. (Sorry Andrea.)
 My concerns is performance. We may need somehing new 
 re-implementation of
 locks/migrate/charge/uncharge.
 
I agree. Performance is my concern too.

I made a patch below and measured the time(average of 10 times) of 
kernel build
on tmpfs(make -j8 on 8 CPU machine with 2.6.33 defconfig).

before
- root cgroup: 190.47 sec
- child cgroup: 192.81 sec

after
- root cgroup: 191.06 sec
- child cgroup: 193.06 sec

Hmm... about 0.3% slower for root, 0.1% slower for child.

   
   Hmm...accepatable ? (sounds it's in error-range)
   
   BTW, why local_irq_disable() ? 
   local_irq_save()/restore() isn't better ?
  
  Probably there's not the overhead of saving flags? 
 maybe.
 
  Anyway, it would make the code much more readable...
  
 ok.
 
 please go ahead in this direction. Nishimura-san, would you post an
 independent patch ? If no, Andrea-san, please.
 
This is the updated version.

Andrea-san, can you merge this into your patch set ?

===
From: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp

In current implementation, we don't have to disable irq at lock_page_cgroup()
because the lock is never acquired in interrupt context.
But we are going to call it in later patch in an interrupt context or with
irq disabled, so this patch disables irq at lock_page_cgroup() and enables it
at unlock_page_cgroup().

Signed-off-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp
---
 include/linux/page_cgroup.h |   16 ++--
 mm/memcontrol.c |   43 +--
 2 files changed, 39 insertions(+), 20 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 30b0813..0d2f92c 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -83,16 +83,28 @@ static inline enum zone_type page_cgroup_zid(struct 
page_cgroup *pc)
return page_zonenum(pc-page);
 }
 
-static inline void lock_page_cgroup(struct page_cgroup *pc)
+static inline void __lock_page_cgroup(struct page_cgroup *pc)
 {
bit_spin_lock(PCG_LOCK, pc-flags);
 }
 
-static inline void unlock_page_cgroup(struct page_cgroup *pc)
+static inline void __unlock_page_cgroup(struct page_cgroup *pc)
 {
bit_spin_unlock(PCG_LOCK, pc-flags);
 }
 
+#define lock_page_cgroup(pc, flags)\
+   do {\
+   local_irq_save(flags

[Devel] Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure

2010-03-07 Thread Daisuke Nishimura
 +/*
 + * mem_cgroup_update_page_stat_locked() - update memcg file cache's 
 accounting
 + * @page:the page involved in a file cache operation.
 + * @idx: the particular file cache statistic.
 + * @charge:  true to increment, false to decrement the statistic specified
 + *   by @idx.
 + *
 + * Update memory cgroup file cache's accounting from a locked context.
 + *
 + * NOTE: must be called with mapping-tree_lock held.
 + */
 +void mem_cgroup_update_page_stat_locked(struct page *page,
 + enum mem_cgroup_write_page_stat_item idx, bool charge)
 +{
 + struct address_space *mapping = page_mapping(page);
 + struct page_cgroup *pc;
 +
 + if (mem_cgroup_disabled())
 + return;
 + WARN_ON_ONCE(!irqs_disabled());
 + WARN_ON_ONCE(mapping  !spin_is_locked(mapping-tree_lock));
 +
I think this is a wrong place to insert assertion.
The problem about page cgroup lock is that it can be interrupted in current 
implementation.
So,

a) it must not be aquired under another lock which can be aquired in interrupt 
context,
   such as mapping-tree_lock, to avoid:

context1context2
lock_page_cgroup(pcA)
spin_lock_irq(tree_lock)
lock_page_cgroup(pcA)   interrupted
=fail  spin_lock_irqsave(tree_lock)
=fail

b) it must not be aquired in interrupt context to avoid:

lock_page_cgroup(pcA)
interrupted
lock_page_cgroup(pcA)
=fail

I think something like this would be better:

@@ -83,8 +83,14 @@ static inline enum zone_type page_cgroup_zid(struct 
page_cgroup *pc)
return page_zonenum(pc-page);
 }

+#include linux/irqflags.h
+#include linux/hardirq.h
 static inline void lock_page_cgroup(struct page_cgroup *pc)
 {
+#ifdef CONFIG_DEBUG_VM
+   WARN_ON_ONCE(irqs_disabled());
+   WARN_ON_ONCE(in_interrupt());
+#endif
bit_spin_lock(PCG_LOCK, pc-flags);
 }

 + pc = lookup_page_cgroup(page);
 + if (unlikely(!pc) || !PageCgroupUsed(pc))
 + return;
 + mem_cgroup_update_page_stat(pc, idx, charge);
 +}
 +EXPORT_SYMBOL_GPL(mem_cgroup_update_page_stat_locked);
 +
 +/*
 + * mem_cgroup_update_page_stat_unlocked() - update memcg file cache's 
 accounting
 + * @page:the page involved in a file cache operation.
 + * @idx: the particular file cache statistic.
 + * @charge:  true to increment, false to decrement the statistic specified
 + *   by @idx.
 + *
 + * Update memory cgroup file cache's accounting from an unlocked context.
 + */
 +void mem_cgroup_update_page_stat_unlocked(struct page *page,
 + enum mem_cgroup_write_page_stat_item idx, bool charge)
 +{
 + struct page_cgroup *pc;
 +
 + if (mem_cgroup_disabled())
 + return;
 + pc = lookup_page_cgroup(page);
 + if (unlikely(!pc) || !PageCgroupUsed(pc))
 + return;
 + lock_page_cgroup(pc);
 + mem_cgroup_update_page_stat(pc, idx, charge);
   unlock_page_cgroup(pc);
  }
 +EXPORT_SYMBOL_GPL(mem_cgroup_update_page_stat_unlocked);
  
IIUC, test_clear_page_writeback(at least) can be called under interrupt context.
This means lock_page_cgroup() is called under interrupt context, that is,
the case b) above can happen.
hmm... I don't have any good idea for now except disabling irq around page 
cgroup lock
to avoid all of these mess things.


Thanks,
Daisuke Nishimura.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure

2010-03-07 Thread Daisuke Nishimura
On Mon, 8 Mar 2010 10:56:41 +0900, KAMEZAWA Hiroyuki 
kamezawa.hir...@jp.fujitsu.com wrote:
 On Mon, 8 Mar 2010 10:44:47 +0900
 Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote:
 
   +/*
   + * mem_cgroup_update_page_stat_locked() - update memcg file cache's 
   accounting
   + * @page:the page involved in a file cache operation.
   + * @idx: the particular file cache statistic.
   + * @charge:  true to increment, false to decrement the statistic 
   specified
   + *   by @idx.
   + *
   + * Update memory cgroup file cache's accounting from a locked context.
   + *
   + * NOTE: must be called with mapping-tree_lock held.
   + */
   +void mem_cgroup_update_page_stat_locked(struct page *page,
   + enum mem_cgroup_write_page_stat_item idx, bool charge)
   +{
   + struct address_space *mapping = page_mapping(page);
   + struct page_cgroup *pc;
   +
   + if (mem_cgroup_disabled())
   + return;
   + WARN_ON_ONCE(!irqs_disabled());
   + WARN_ON_ONCE(mapping  !spin_is_locked(mapping-tree_lock));
   +
  I think this is a wrong place to insert assertion.
  The problem about page cgroup lock is that it can be interrupted in current 
  implementation.
  So,
  
  a) it must not be aquired under another lock which can be aquired in 
  interrupt context,
 such as mapping-tree_lock, to avoid:
  
  context1context2
  lock_page_cgroup(pcA)
  spin_lock_irq(tree_lock)
  lock_page_cgroup(pcA)   interrupted
  =fail  spin_lock_irqsave(tree_lock)
  =fail
  
  b) it must not be aquired in interrupt context to avoid:
  
  lock_page_cgroup(pcA)
  interrupted
  lock_page_cgroup(pcA)
  =fail
  
  I think something like this would be better:
  
  @@ -83,8 +83,14 @@ static inline enum zone_type page_cgroup_zid(struct 
  page_cgroup *pc)
  return page_zonenum(pc-page);
   }
  
  +#include linux/irqflags.h
  +#include linux/hardirq.h
   static inline void lock_page_cgroup(struct page_cgroup *pc)
   {
  +#ifdef CONFIG_DEBUG_VM
  +   WARN_ON_ONCE(irqs_disabled());
  +   WARN_ON_ONCE(in_interrupt());
  +#endif
  bit_spin_lock(PCG_LOCK, pc-flags);
   }
  
   + pc = lookup_page_cgroup(page);
   + if (unlikely(!pc) || !PageCgroupUsed(pc))
   + return;
   + mem_cgroup_update_page_stat(pc, idx, charge);
   +}
   +EXPORT_SYMBOL_GPL(mem_cgroup_update_page_stat_locked);
   +
   +/*
   + * mem_cgroup_update_page_stat_unlocked() - update memcg file cache's 
   accounting
   + * @page:the page involved in a file cache operation.
   + * @idx: the particular file cache statistic.
   + * @charge:  true to increment, false to decrement the statistic 
   specified
   + *   by @idx.
   + *
   + * Update memory cgroup file cache's accounting from an unlocked context.
   + */
   +void mem_cgroup_update_page_stat_unlocked(struct page *page,
   + enum mem_cgroup_write_page_stat_item idx, bool charge)
   +{
   + struct page_cgroup *pc;
   +
   + if (mem_cgroup_disabled())
   + return;
   + pc = lookup_page_cgroup(page);
   + if (unlikely(!pc) || !PageCgroupUsed(pc))
   + return;
   + lock_page_cgroup(pc);
   + mem_cgroup_update_page_stat(pc, idx, charge);
 unlock_page_cgroup(pc);
}
   +EXPORT_SYMBOL_GPL(mem_cgroup_update_page_stat_unlocked);

  IIUC, test_clear_page_writeback(at least) can be called under interrupt 
  context.
  This means lock_page_cgroup() is called under interrupt context, that is,
  the case b) above can happen.
  hmm... I don't have any good idea for now except disabling irq around page 
  cgroup lock
  to avoid all of these mess things.
  
 
 Hmm...simply IRQ-off for all updates ?
I think so in current code.
But after these changes, we must use local_irq_save()/restore()
instead of local_irq_disable()/enable() in mem_cgroup_update_page_stat().

 But IIRC, clear_writeback is done under treelock No ?
 
The place where NR_WRITEBACK is updated is out of tree_lock.

   1311 int test_clear_page_writeback(struct page *page)
   1312 {
   1313 struct address_space *mapping = page_mapping(page);
   1314 int ret;
   1315
   1316 if (mapping) {
   1317 struct backing_dev_info *bdi = 
mapping-backing_dev_info;
   1318 unsigned long flags;
   1319
   1320 spin_lock_irqsave(mapping-tree_lock, flags);
   1321 ret = TestClearPageWriteback(page);
   1322 if (ret) {
   1323 radix_tree_tag_clear(mapping-page_tree,
   1324 page_index(page),
   1325 
PAGECACHE_TAG_WRITEBACK);
   1326 if (bdi_cap_account_writeback(bdi)) {
   1327

[Devel] Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure

2010-03-04 Thread Daisuke Nishimura
 + ClearPageCgroupDirty(pc);
 + break;
 + case MEM_CGROUP_STAT_WRITEBACK:
 + if (val  0)
 + SetPageCgroupWriteback(pc);
 + else
 + ClearPageCgroupWriteback(pc);
 + break;
 + case MEM_CGROUP_STAT_WRITEBACK_TEMP:
 + if (val  0)
 + SetPageCgroupWritebackTemp(pc);
 + else
 + ClearPageCgroupWritebackTemp(pc);
 + break;
 + case MEM_CGROUP_STAT_UNSTABLE_NFS:
 + if (val  0)
 + SetPageCgroupUnstableNFS(pc);
 + else
 + ClearPageCgroupUnstableNFS(pc);
 + break;
 + default:
 + BUG();
 + break;
 + }
 + mem = pc-mem_cgroup;
 + if (likely(mem))
 + __this_cpu_add(mem-stat-count[idx], val);
 + unlock_page_cgroup_migrate(pc);
  }
 +EXPORT_SYMBOL_GPL(mem_cgroup_update_stat);
  
  /*
   * size of first charge trial. 32 comes from vmscan.c's magic value.
 @@ -1701,6 +1885,45 @@ static void __mem_cgroup_commit_charge(struct 
 mem_cgroup *mem,
   memcg_check_events(mem, pc-page);
  }
  
 +/*
 + * Update file cache accounted statistics on task migration.
 + *
 + * TODO: We don't move charges of file (including shmem/tmpfs) pages for now.
 + * So, at the moment this function simply returns without updating accounted
 + * statistics, because we deal only with anonymous pages here.
 + */
This function is not unique to task migration. It's called from rmdir() too.
So this comment isn't needed.

 +static void __mem_cgroup_update_file_stat(struct page_cgroup *pc,
 + struct mem_cgroup *from, struct mem_cgroup *to)
 +{
 + struct page *page = pc-page;
 +
 + if (!page_mapped(page) || PageAnon(page))
 + return;
 +
 + if (PageCgroupFileMapped(pc)) {
 + __this_cpu_dec(from-stat-count[MEM_CGROUP_STAT_FILE_MAPPED]);
 + __this_cpu_inc(to-stat-count[MEM_CGROUP_STAT_FILE_MAPPED]);
 + }
 + if (PageCgroupDirty(pc)) {
 + __this_cpu_dec(from-stat-count[MEM_CGROUP_STAT_FILE_DIRTY]);
 + __this_cpu_inc(to-stat-count[MEM_CGROUP_STAT_FILE_DIRTY]);
 + }
 + if (PageCgroupWriteback(pc)) {
 + __this_cpu_dec(from-stat-count[MEM_CGROUP_STAT_WRITEBACK]);
 + __this_cpu_inc(to-stat-count[MEM_CGROUP_STAT_WRITEBACK]);
 + }
 + if (PageCgroupWritebackTemp(pc)) {
 + __this_cpu_dec(
 + from-stat-count[MEM_CGROUP_STAT_WRITEBACK_TEMP]);
 + __this_cpu_inc(to-stat-count[MEM_CGROUP_STAT_WRITEBACK_TEMP]);
 + }
 + if (PageCgroupUnstableNFS(pc)) {
 + __this_cpu_dec(
 + from-stat-count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
 + __this_cpu_inc(to-stat-count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
 + }
 +}
 +
  /**
   * __mem_cgroup_move_account - move account of the page
   * @pc:  page_cgroup of the page.
 @@ -1721,22 +1944,16 @@ static void __mem_cgroup_commit_charge(struct 
 mem_cgroup *mem,
  static void __mem_cgroup_move_account(struct page_cgroup *pc,
   struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
  {
 - struct page *page;
 -
   VM_BUG_ON(from == to);
   VM_BUG_ON(PageLRU(pc-page));
   VM_BUG_ON(!PageCgroupLocked(pc));
   VM_BUG_ON(!PageCgroupUsed(pc));
   VM_BUG_ON(pc-mem_cgroup != from);
  
 - page = pc-page;
 - if (page_mapped(page)  !PageAnon(page)) {
 - /* Update mapped_file data for mem_cgroup */
 - preempt_disable();
 - __this_cpu_dec(from-stat-count[MEM_CGROUP_STAT_FILE_MAPPED]);
 - __this_cpu_inc(to-stat-count[MEM_CGROUP_STAT_FILE_MAPPED]);
 - preempt_enable();
 - }
 + preempt_disable();
 + lock_page_cgroup_migrate(pc);
 + __mem_cgroup_update_file_stat(pc, from, to);
 +
   mem_cgroup_charge_statistics(from, pc, false);
   if (uncharge)
   /* This is not cancel, but cancel_charge does all we need. */
 @@ -1745,6 +1962,8 @@ static void __mem_cgroup_move_account(struct 
 page_cgroup *pc,
   /* caller should have done css_get */
   pc-mem_cgroup = to;
   mem_cgroup_charge_statistics(to, pc, true);
 + unlock_page_cgroup_migrate(pc);
 + preempt_enable();
Glad to see this cleanup :)
But, hmm, I don't think preempt_disable/enable() is enough(and 
bit_spin_lock/unlock()
does it anyway). lock/unlock_page_cgroup_migrate() can be called under irq 
context
(e.g. end_page_writeback()), so I think we must local_irq_disable()/enable() 
here.


Thanks,
Daisuke Nishimura.

   /*
* We charges against to which may not have any tasks. Then, to
* can be under rmdir(). But in current implementation, caller of
 @@ -3042,6 +3261,10 @@ enum {
   MCS_PGPGIN,
   MCS_PGPGOUT,
   MCS_SWAP,
 + MCS_FILE_DIRTY,
 + MCS_WRITEBACK,
 + MCS_WRITEBACK_TEMP

[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation

2010-03-03 Thread Daisuke Nishimura
  +   ClearPageCgroupAcctDirty(pc);
  +   break;
  +   case MEM_CGROUP_STAT_WBACK:
  +   if (set)
  +   SetPageCgroupAcctWB(pc);
  +   else
  +   ClearPageCgroupAcctWB(pc);
  +   break;
  +   case MEM_CGROUP_STAT_WBACK_TEMP:
  +   if (set)
  +   SetPageCgroupAcctWBTemp(pc);
  +   else
  +   ClearPageCgroupAcctWBTemp(pc);
  +   break;
  +   case MEM_CGROUP_STAT_UNSTABLE_NFS:
  +   if (set)
  +   SetPageCgroupAcctUnstableNFS(pc);
  +   else
  +   ClearPageCgroupAcctUnstableNFS(pc);
  +   break;
  +   default:
  +   BUG();
  +   break;
  +   }
  +   mem = pc-mem_cgroup;
  +   if (set)
  +   __this_cpu_inc(mem-stat-count[idx]);
  +   else
  +   __this_cpu_dec(mem-stat-count[idx]);
  +   unlock_page_cgroup_migrate(pc);
  +}
  +
  +static void move_acct_information(struct mem_cgroup *from,
  +   struct mem_cgroup *to,
  +   struct page_cgroup *pc)
  +{
  +   /* preemption is disabled, migration_lock is held. */
  +   if (PageCgroupAcctDirty(pc)) {
  +   __this_cpu_dec(from-stat-count[MEM_CGROUP_STAT_DIRTY]);
  +   __this_cpu_inc(to-stat-count[MEM_CGROUP_STAT_DIRTY]);
  +   }
  +   if (PageCgroupAcctWB(pc)) {
  +   __this_cpu_dec(from-stat-count[MEM_CGROUP_STAT_WBACK]);
  +   __this_cpu_inc(to-stat-count[MEM_CGROUP_STAT_WBACK]);
  +   }
  +   if (PageCgroupAcctWBTemp(pc)) {
  +   __this_cpu_dec(from-stat-count[MEM_CGROUP_STAT_WBACK_TEMP]);
  +   __this_cpu_inc(to-stat-count[MEM_CGROUP_STAT_WBACK_TEMP]);
  +   }
  +   if (PageCgroupAcctUnstableNFS(pc)) {
  +   __this_cpu_dec(from-stat-count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
  +   __this_cpu_inc(to-stat-count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
  +   }
  +}
  +
  +/*
* size of first charge trial. 32 comes from vmscan.c's magic value.
* TODO: maybe necessary to use big numbers in big irons.
*/
  @@ -1794,15 +1878,16 @@ static void __mem_cgroup_move_account(st
  VM_BUG_ON(!PageCgroupUsed(pc));
  VM_BUG_ON(pc-mem_cgroup != from);
   
  +   preempt_disable();
  +   lock_page_cgroup_migrate(pc);
  page = pc-page;
  if (page_mapped(page)  !PageAnon(page)) {
  /* Update mapped_file data for mem_cgroup */
  -   preempt_disable();
  __this_cpu_dec(from-stat-count[MEM_CGROUP_STAT_FILE_MAPPED]);
  __this_cpu_inc(to-stat-count[MEM_CGROUP_STAT_FILE_MAPPED]);
  -   preempt_enable();
  }
  mem_cgroup_charge_statistics(from, pc, false);
  +   move_acct_information(from, to, pc);
 
 Kame-san, a question. According to is_target_pte_for_mc() it seems we
 don't move file pages across cgroups for now. If !PageAnon(page) we just
 return 0 and the page won't be selected for migration in
 mem_cgroup_move_charge_pte_range().
 
You're right. It's my TODO to move file pages at task migration.

 So, if I've understood well the code is correct in perspective, but
 right now it's unnecessary. File pages are not moved on task migration
 across cgroups and, at the moment, there's no way for file page
 accounted statistics to go negative.
 
 Or am I missing something?
 
__mem_cgroup_move_account() will be called not only at task migration
but also at rmdir, so I think it would be better to handle file pages anyway.


Thanks,
Daisuke Nishimura.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation

2010-03-02 Thread Daisuke Nishimura
On Tue, 2 Mar 2010 09:01:58 +0100, Andrea Righi ari...@develer.com wrote:
 On Tue, Mar 02, 2010 at 09:23:09AM +0900, KAMEZAWA Hiroyuki wrote:
  On Mon,  1 Mar 2010 22:23:40 +0100
  Andrea Righi ari...@develer.com wrote:
  
   Apply the cgroup dirty pages accounting and limiting infrastructure to
   the opportune kernel functions.
   
   Signed-off-by: Andrea Righi ari...@develer.com
  
  Seems nice.
  
  Hmm. the last problem is moving account between memcg.
  
  Right ?
 
 Correct. This was actually the last item of the TODO list. Anyway, I'm
 still considering if it's correct to move dirty pages when a task is
 migrated from a cgroup to another. Currently, dirty pages just remain in
 the original cgroup and are flushed depending on the original cgroup
 settings. That is not totally wrong... at least moving the dirty pages
 between memcgs should be optional (move_charge_at_immigrate?).
 
FYI, I'm planning to add file-cache and shmem/tmpfs support for move_charge 
feature
for 2.6.35.
But, hmm, it would be complicated if we try to move dirty account too.


Thanks,
Daisuke Nishimura.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation

2010-03-02 Thread Daisuke Nishimura
On Tue, 2 Mar 2010 23:18:23 +0100, Andrea Righi ari...@develer.com wrote:
 On Tue, Mar 02, 2010 at 07:20:26PM +0530, Balbir Singh wrote:
  * KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-03-02 17:23:16]:
  
   On Tue, 2 Mar 2010 09:01:58 +0100
   Andrea Righi ari...@develer.com wrote:
   
On Tue, Mar 02, 2010 at 09:23:09AM +0900, KAMEZAWA Hiroyuki wrote:
 On Mon,  1 Mar 2010 22:23:40 +0100
 Andrea Righi ari...@develer.com wrote:
 
  Apply the cgroup dirty pages accounting and limiting infrastructure 
  to
  the opportune kernel functions.
  
  Signed-off-by: Andrea Righi ari...@develer.com
 
 Seems nice.
 
 Hmm. the last problem is moving account between memcg.
 
 Right ?

Correct. This was actually the last item of the TODO list. Anyway, I'm
still considering if it's correct to move dirty pages when a task is
migrated from a cgroup to another. Currently, dirty pages just remain in
the original cgroup and are flushed depending on the original cgroup
settings. That is not totally wrong... at least moving the dirty pages
between memcgs should be optional (move_charge_at_immigrate?).

   
   My concern is 
- migration between memcg is already suppoted
   - at task move
   - at rmdir
   
   Then, if you leave DIRTY_PAGE accounting to original cgroup,
   the new cgroup (migration target)'s Dirty page accounting may
   goes to be negative, or incorrect value. Please check FILE_MAPPED
   implementation in __mem_cgroup_move_account()
   
   As
  if (page_mapped(page)  !PageAnon(page)) {
   /* Update mapped_file data for mem_cgroup */
   preempt_disable();
   
   __this_cpu_dec(from-stat-count[MEM_CGROUP_STAT_FILE_MAPPED]);
   
   __this_cpu_inc(to-stat-count[MEM_CGROUP_STAT_FILE_MAPPED]);
   preempt_enable();
   }
   then, FILE_MAPPED never goes negative.
  
  
  Absolutely! I am not sure how complex dirty memory migration will be,
  but one way of working around it would be to disable migration of
  charges when the feature is enabled (dirty* is set in the memory
  cgroup). We might need additional logic to allow that to happen. 
 
 I've started to look at dirty memory migration. First attempt is to add
 DIRTY, WRITEBACK, etc. to page_cgroup flags and handle them in
 __mem_cgroup_move_account(). Probably I'll have something ready for the
 next version of the patch. I still need to figure if this can work as
 expected...
 
I agree it's a right direction(in fact, I have been planning to post a patch
in that direction), so I leave it to you.
Can you add PCG_FILE_MAPPED flag too ? I think this flag can be handled in the
same way as other flags you're trying to add, and we can change
if (page_mapped(page)  !PageAnon(page)) to if (PageCgroupFileMapped(pc)
in __mem_cgroup_move_account(). It would be cleaner than current code, IMHO.


Thanks,
Daisuke Nishimura.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation

2010-03-02 Thread Daisuke Nishimura
 diff --git a/mm/filemap.c b/mm/filemap.c
 index fe09e51..f85acae 100644
 --- a/mm/filemap.c
 +++ b/mm/filemap.c
 @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
* having removed the page entirely.
*/
   if (PageDirty(page)  mapping_cap_account_dirty(mapping)) {
 + mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
   dec_zone_page_state(page, NR_FILE_DIRTY);
   dec_bdi_stat(mapping-backing_dev_info, BDI_DIRTY);
   }
(snip)
 @@ -1096,6 +1113,7 @@ int __set_page_dirty_no_writeback(struct page *page)
  void account_page_dirtied(struct page *page, struct address_space *mapping)
  {
   if (mapping_cap_account_dirty(mapping)) {
 + mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1);
   __inc_zone_page_state(page, NR_FILE_DIRTY);
   __inc_bdi_stat(mapping-backing_dev_info, BDI_DIRTY);
   task_dirty_inc(current);
As long as I can see, those two functions(at least) calls 
mem_cgroup_update_state(),
which acquires page cgroup lock, under mapping-tree_lock.
But as I fixed before in commit e767e056, page cgroup lock must not acquired 
under
mapping-tree_lock.
hmm, we should call those mem_cgroup_update_state() outside mapping-tree_lock,
or add local_irq_save/restore() around lock/unlock_page_cgroup() to avoid 
dead-lock.


Thanks,
Daisuke Nishimura.

On Mon,  1 Mar 2010 22:23:40 +0100, Andrea Righi ari...@develer.com wrote:
 Apply the cgroup dirty pages accounting and limiting infrastructure to
 the opportune kernel functions.
 
 Signed-off-by: Andrea Righi ari...@develer.com
 ---
  fs/fuse/file.c  |5 +++
  fs/nfs/write.c  |4 ++
  fs/nilfs2/segment.c |   10 +-
  mm/filemap.c|1 +
  mm/page-writeback.c |   84 --
  mm/rmap.c   |4 +-
  mm/truncate.c   |2 +
  7 files changed, 76 insertions(+), 34 deletions(-)
 
 diff --git a/fs/fuse/file.c b/fs/fuse/file.c
 index a9f5e13..dbbdd53 100644
 --- a/fs/fuse/file.c
 +++ b/fs/fuse/file.c
 @@ -11,6 +11,7 @@
  #include linux/pagemap.h
  #include linux/slab.h
  #include linux/kernel.h
 +#include linux/memcontrol.h
  #include linux/sched.h
  #include linux/module.h
  
 @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, 
 struct fuse_req *req)
  
   list_del(req-writepages_entry);
   dec_bdi_stat(bdi, BDI_WRITEBACK);
 + mem_cgroup_update_stat(req-pages[0],
 + MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
   dec_zone_page_state(req-pages[0], NR_WRITEBACK_TEMP);
   bdi_writeout_inc(bdi);
   wake_up(fi-page_waitq);
 @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
   req-inode = inode;
  
   inc_bdi_stat(mapping-backing_dev_info, BDI_WRITEBACK);
 + mem_cgroup_update_stat(tmp_page,
 + MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
   inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
   end_page_writeback(page);
  
 diff --git a/fs/nfs/write.c b/fs/nfs/write.c
 index b753242..7316f7a 100644
 --- a/fs/nfs/write.c
 +++ b/fs/nfs/write.c
 @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
   req-wb_index,
   NFS_PAGE_TAG_COMMIT);
   spin_unlock(inode-i_lock);
 + mem_cgroup_update_stat(req-wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
   inc_zone_page_state(req-wb_page, NR_UNSTABLE_NFS);
   inc_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_UNSTABLE);
   __mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
   struct page *page = req-wb_page;
  
   if (test_and_clear_bit(PG_CLEAN, (req)-wb_flags)) {
 + mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
   dec_zone_page_state(page, NR_UNSTABLE_NFS);
   dec_bdi_stat(page-mapping-backing_dev_info, BDI_UNSTABLE);
   return 1;
 @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head 
 *head, int how)
   req = nfs_list_entry(head-next);
   nfs_list_remove_request(req);
   nfs_mark_request_commit(req);
 + mem_cgroup_update_stat(req-wb_page,
 + MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
   dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS);
   dec_bdi_stat(req-wb_page-mapping-backing_dev_info,
   BDI_UNSTABLE);
 diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
 index ada2f1b..aef6d13 100644
 --- a/fs/nilfs2/segment.c
 +++ b/fs/nilfs2/segment.c
 @@ -1660,8 +1660,11 @@ nilfs_copy_replace_page_buffers(struct page *page, 
 struct list_head *out)
   } while (bh = bh-b_this_page, bh2 = bh2-b_this_page, bh != head);
   kunmap_atomic(kaddr, KM_USER0);
  
 - if (!TestSetPageWriteback(clone_page))
 + if (!TestSetPageWriteback(clone_page

[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation

2010-03-02 Thread Daisuke Nishimura
On Wed, 3 Mar 2010 12:29:06 +0900, KAMEZAWA Hiroyuki 
kamezawa.hir...@jp.fujitsu.com wrote:
 On Wed, 3 Mar 2010 11:12:38 +0900
 Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote:
 
   diff --git a/mm/filemap.c b/mm/filemap.c
   index fe09e51..f85acae 100644
   --- a/mm/filemap.c
   +++ b/mm/filemap.c
   @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
  * having removed the page entirely.
  */
 if (PageDirty(page)  mapping_cap_account_dirty(mapping)) {
   + mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
 dec_zone_page_state(page, NR_FILE_DIRTY);
 dec_bdi_stat(mapping-backing_dev_info, BDI_DIRTY);
 }
  (snip)
   @@ -1096,6 +1113,7 @@ int __set_page_dirty_no_writeback(struct page *page)
void account_page_dirtied(struct page *page, struct address_space 
   *mapping)
{
 if (mapping_cap_account_dirty(mapping)) {
   + mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1);
 __inc_zone_page_state(page, NR_FILE_DIRTY);
 __inc_bdi_stat(mapping-backing_dev_info, BDI_DIRTY);
 task_dirty_inc(current);
  As long as I can see, those two functions(at least) calls 
  mem_cgroup_update_state(),
  which acquires page cgroup lock, under mapping-tree_lock.
  But as I fixed before in commit e767e056, page cgroup lock must not 
  acquired under
  mapping-tree_lock.
  hmm, we should call those mem_cgroup_update_state() outside 
  mapping-tree_lock,
  or add local_irq_save/restore() around lock/unlock_page_cgroup() to avoid 
  dead-lock.
  
 Ah, good catch! But hmm...
 This account_page_dirtted() seems to be called under IRQ-disabled.
 About  __remove_from_page_cache(), I think page_cgroup should have its own 
 DIRTY flag,
 then, mem_cgroup_uncharge_page() can handle it automatically.
 
 But. there are no guarantee that following never happens. 
   lock_page_cgroup()
   === interrupt.
   - mapping-tree_lock()
 Even if mapping-tree_lock is held with IRQ-disabled.
 Then, if we add local_irq_save(), we have to add it to all lock_page_cgroup().
 
 Then, hm...some kind of new trick ? as..
 (Follwoing patch is not tested!!)
 
If we can verify that all callers of mem_cgroup_update_stat() have always 
either aquired
or not aquired tree_lock, this direction will work fine.
But if we can't, we have to add local_irq_save() to lock_page_cgroup() like 
below.

===
 include/linux/page_cgroup.h |8 ++--
 mm/memcontrol.c |   43 +--
 2 files changed, 31 insertions(+), 20 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 30b0813..51da916 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -83,15 +83,19 @@ static inline enum zone_type page_cgroup_zid(struct 
page_cgroup *pc)
return page_zonenum(pc-page);
 }
 
-static inline void lock_page_cgroup(struct page_cgroup *pc)
+static inline void __lock_page_cgroup(struct page_cgroup *pc)
 {
bit_spin_lock(PCG_LOCK, pc-flags);
 }
+#define lock_page_cgroup(pc, flags) \
+  do { local_irq_save(flags); __lock_page_cgroup(pc); } while (0)
 
-static inline void unlock_page_cgroup(struct page_cgroup *pc)
+static inline void __unlock_page_cgroup(struct page_cgroup *pc)
 {
bit_spin_unlock(PCG_LOCK, pc-flags);
 }
+#define unlock_page_cgroup(pc, flags) \
+  do { __unlock_page_cgroup(pc); local_irq_restore(flags); } while (0)
 
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct page_cgroup;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 00ed4b1..40b9be4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1327,12 +1327,13 @@ void mem_cgroup_update_file_mapped(struct page *page, 
int val)
 {
struct mem_cgroup *mem;
struct page_cgroup *pc;
+   unsigned long flags;
 
pc = lookup_page_cgroup(page);
if (unlikely(!pc))
return;
 
-   lock_page_cgroup(pc);
+   lock_page_cgroup(pc, flags);
mem = pc-mem_cgroup;
if (!mem)
goto done;
@@ -1346,7 +1347,7 @@ void mem_cgroup_update_file_mapped(struct page *page, int 
val)
__this_cpu_add(mem-stat-count[MEM_CGROUP_STAT_FILE_MAPPED], val);
 
 done:
-   unlock_page_cgroup(pc);
+   unlock_page_cgroup(pc, flags);
 }
 
 /*
@@ -1680,11 +1681,12 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct 
page *page)
struct page_cgroup *pc;
unsigned short id;
swp_entry_t ent;
+   unsigned long flags;
 
VM_BUG_ON(!PageLocked(page));
 
pc = lookup_page_cgroup(page);
-   lock_page_cgroup(pc);
+   lock_page_cgroup(pc, flags);
if (PageCgroupUsed(pc)) {
mem = pc-mem_cgroup;
if (mem  !css_tryget(mem-css))
@@ -1698,7 +1700,7 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct 
page *page)
mem = NULL;
rcu_read_unlock

[Devel] Re: [PATCH -mmotm 1/2] memcg: dirty pages accounting and limiting infrastructure

2010-03-01 Thread Daisuke Nishimura
On Fri, 26 Feb 2010 23:52:30 +0100, Andrea Righi ari...@develer.com wrote:
 Infrastructure to account dirty pages per cgroup and add dirty limit
 interfaces in the cgroupfs:
 
  - Active write-out: memory.dirty_ratio, memory.dirty_bytes
  - Background write-out: memory.dirty_background_ratio, 
 memory.dirty_background_bytes
 
It looks good for me in general.

 Signed-off-by: Andrea Righi ari...@develer.com
 ---
  include/linux/memcontrol.h |   74 +-
  mm/memcontrol.c|  354 
 
  2 files changed, 399 insertions(+), 29 deletions(-)
 

(snip)

 +s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
 +{
 + struct mem_cgroup_page_stat stat = {};
 + struct mem_cgroup *memcg;
 +
I think it would be better to add if (mem_cgroup_disabled()).

 + rcu_read_lock();
 + memcg = mem_cgroup_from_task(current);
 + if (memcg) {
 + /*
 +  * Recursively evaulate page statistics against all cgroup
 +  * under hierarchy tree
 +  */
 + stat.item = item;
 + mem_cgroup_walk_tree(memcg, stat, mem_cgroup_page_stat_cb);
 + } else
 + stat.value = -ENOMEM;
 + rcu_read_unlock();
 +
 + return stat.value;
 +}
 +
  static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
  {
   int *val = data;
 @@ -1263,10 +1419,10 @@ static void record_last_oom(struct mem_cgroup *mem)
  }
  
  /*
 - * Currently used to update mapped file statistics, but the routine can be
 - * generalized to update other statistics as well.
 + * Generalized routine to update memory cgroup statistics.
   */
 -void mem_cgroup_update_file_mapped(struct page *page, int val)
 +void mem_cgroup_update_stat(struct page *page,
 + enum mem_cgroup_stat_index idx, int val)
  {
   struct mem_cgroup *mem;
   struct page_cgroup *pc;
ditto.

 @@ -1286,7 +1442,8 @@ void mem_cgroup_update_file_mapped(struct page *page, 
 int val)
   /*
* Preemption is already disabled. We can use __this_cpu_xxx
*/
 - __this_cpu_add(mem-stat-count[MEM_CGROUP_STAT_FILE_MAPPED], val);
 + VM_BUG_ON(idx = MEM_CGROUP_STAT_NSTATS);
 + __this_cpu_add(mem-stat-count[idx], val);
  
  done:
   unlock_page_cgroup(pc);


Thanks,
Daisuke Nishimura.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v4 3/4] memcg: rework usage of stats by soft limit

2009-12-27 Thread Daisuke Nishimura
On Sun, 27 Dec 2009 04:09:01 +0200, Kirill A. Shutemov kir...@shutemov.name 
wrote:
 Instead of incrementing counter on each page in/out and comparing it
 with constant, we set counter to constant, decrement counter on each
 page in/out and compare it with zero. We want to make comparing as fast
 as possible. On many RISC systems (probably not only RISC) comparing
 with zero is more effective than comparing with a constant, since not
 every constant can be immediate operand for compare instruction.
 
 Also, I've renamed MEM_CGROUP_STAT_EVENTS to MEM_CGROUP_STAT_SOFTLIMIT,
 since really it's not a generic counter.
 
 Signed-off-by: Kirill A. Shutemov kir...@shutemov.name
 ---
  mm/memcontrol.c |   19 ++-
  1 files changed, 14 insertions(+), 5 deletions(-)
 
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index 1d71cb4..36eb7af 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -69,8 +69,9 @@ enum mem_cgroup_stat_index {
   MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
   MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
   MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
 - MEM_CGROUP_STAT_EVENTS, /* sum of pagein + pageout for internal use */
   MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
 + MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
 + used by soft limit implementation */
  
   MEM_CGROUP_STAT_NSTATS,
  };
 @@ -90,6 +91,13 @@ __mem_cgroup_stat_reset_safe(struct mem_cgroup_stat_cpu 
 *stat,
   stat-count[idx] = 0;
  }
  
 +static inline void
 +__mem_cgroup_stat_set(struct mem_cgroup_stat_cpu *stat,
 + enum mem_cgroup_stat_index idx, s64 val)
 +{
 + stat-count[idx] = val;
 +}
 +
I think it would be better to name it __mem_cgroup_stat_set_safe.
And could you remove the definition of __mem_cgroup_stat_reset ?


Thanks,
Daisuke Nishimura.

  static inline s64
  __mem_cgroup_stat_read_local(struct mem_cgroup_stat_cpu *stat,
   enum mem_cgroup_stat_index idx)
 @@ -380,9 +388,10 @@ static bool mem_cgroup_soft_limit_check(struct 
 mem_cgroup *mem)
  
   cpu = get_cpu();
   cpustat = mem-stat.cpustat[cpu];
 - val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_EVENTS);
 - if (unlikely(val  SOFTLIMIT_EVENTS_THRESH)) {
 - __mem_cgroup_stat_reset_safe(cpustat, MEM_CGROUP_STAT_EVENTS);
 + val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_SOFTLIMIT);
 + if (unlikely(val  0)) {
 + __mem_cgroup_stat_set(cpustat, MEM_CGROUP_STAT_SOFTLIMIT,
 + SOFTLIMIT_EVENTS_THRESH);
   ret = true;
   }
   put_cpu();
 @@ -515,7 +524,7 @@ static void mem_cgroup_charge_statistics(struct 
 mem_cgroup *mem,
   else
   __mem_cgroup_stat_add_safe(cpustat,
   MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
 - __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_EVENTS, 1);
 + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_SOFTLIMIT, -1);
   put_cpu();
  }
  
 -- 
 1.6.5.7
 
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v4 4/4] memcg: implement memory thresholds

2009-12-27 Thread Daisuke Nishimura
On Sun, 27 Dec 2009 04:09:02 +0200, Kirill A. Shutemov kir...@shutemov.name 
wrote:
 It allows to register multiple memory and memsw thresholds and gets
 notifications when it crosses.
 
 To register a threshold application need:
 - create an eventfd;
 - open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
 - write string like event_fd memory.usage_in_bytes threshold to
   cgroup.event_control.
 
 Application will be notified through eventfd when memory usage crosses
 threshold in any direction.
 
 It's applicable for root and non-root cgroup.
 
 It uses stats to track memory usage, simmilar to soft limits. It checks
 if we need to send event to userspace on every 100 page in/out. I guess
 it's good compromise between performance and accuracy of thresholds.
 
 Signed-off-by: Kirill A. Shutemov kir...@shutemov.name
 ---
  Documentation/cgroups/memory.txt |   19 +++-
  mm/memcontrol.c  |  275 
 ++
  2 files changed, 293 insertions(+), 1 deletions(-)
 
 diff --git a/Documentation/cgroups/memory.txt 
 b/Documentation/cgroups/memory.txt
 index b871f25..195af07 100644
 --- a/Documentation/cgroups/memory.txt
 +++ b/Documentation/cgroups/memory.txt
 @@ -414,7 +414,24 @@ NOTE1: Soft limits take effect over a long period of 
 time, since they involve
  NOTE2: It is recommended to set the soft limit always below the hard limit,
 otherwise the hard limit will take precedence.
  
 -8. TODO
 +8. Memory thresholds
 +
 +Memory controler implements memory thresholds using cgroups notification
 +API (see cgroups.txt). It allows to register multiple memory and memsw
 +thresholds and gets notifications when it crosses.
 +
 +To register a threshold application need:
 + - create an eventfd using eventfd(2);
 + - open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
 + - write string like event_fd memory.usage_in_bytes threshold to
 +   cgroup.event_control.
 +
 +Application will be notified through eventfd when memory usage crosses
 +threshold in any direction.
 +
 +It's applicable for root and non-root cgroup.
 +
 +9. TODO
  
  1. Add support for accounting huge pages (as a separate controller)
  2. Make per-cgroup scanner reclaim not-shared pages first
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index 36eb7af..3a0a6a1 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
It would be a nitpick, but my 
patch(http://marc.info/?l=linux-mm-commitsm=126152804420992w=2)
has already modified here.

I think it might be better for you to apply my patches by hand or wait for next 
mmotm
to be released to avoid bothering Andrew.
(There is enough time left till the next merge window :))

(snip)

 +static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
 +{
 + struct mem_cgroup_threshold_ary *thresholds;
 + u64 usage = mem_cgroup_usage(memcg, swap);
 + int i, cur;
 +
I think calling mem_cgroup_usage() after checking if(!thresholds)
decreases the overhead a little when we don't set any thresholds.
I've confirmed that the change makes the assembler output different.

 + rcu_read_lock();
 + if (!swap) {
 + thresholds = rcu_dereference(memcg-thresholds);
 + } else {
 + thresholds = rcu_dereference(memcg-memsw_thresholds);
 + }
 +
 + if (!thresholds)
 + goto unlock;
 +
 + cur = atomic_read(thresholds-cur);
 +
 + /* Check if a threshold crossed in any direction */
 +
 + for(i = cur; i = 0 
 + unlikely(thresholds-entries[i].threshold  usage); i--) {
 + atomic_dec(thresholds-cur);
 + eventfd_signal(thresholds-entries[i].eventfd, 1);
 + }
 +
 + for(i = cur + 1; i  thresholds-size 
 + unlikely(thresholds-entries[i].threshold = usage); i++) {
 + atomic_inc(thresholds-cur);
 + eventfd_signal(thresholds-entries[i].eventfd, 1);
 + }
 +unlock:
 + rcu_read_unlock();
 +}
 +


Thanks,
Daisuke Nishimura.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH RFC v2 3/4] memcg: rework usage of stats by soft limit

2009-12-12 Thread Daisuke Nishimura
On Sat, 12 Dec 2009 15:06:52 +0200
Kirill A. Shutemov kir...@shutemov.name wrote:

 On Sat, Dec 12, 2009 at 5:50 AM, Daisuke Nishimura
 d-nishim...@mtf.biglobe.ne.jp wrote:
  Sorry, I disagree this change.
 
  mem_cgroup_soft_limit_check() is used for checking how much current usage 
  exceeds
  the soft_limit_in_bytes and updating softlimit tree asynchronously, instead 
  of
  checking every charge/uncharge. What if you change the soft_limit_in_bytes,
  but the number of charges and uncharges are very balanced afterwards ?
  The softlimit tree will not be updated for a long time.
 
 I don't see how my patch affects the logic you've described.
 Statistics updates and
 checks in the same place. It just uses decrement instead of increment.
 
Ah... my bad. Ignore this comment, please.
I misunderstood this patch.

 
  And IIUC, it's the same for your threshold feature, right ?
  I think it would be better:
 
  - discard this change.
  - in 4/4, rename mem_cgroup_soft_limit_check to mem_cgroup_event_check,
   and instead of adding a new STAT counter, do like:
 
         if (mem_cgroup_event_check(mem)) {
                 mem_cgroup_update_tree(mem, page);
                 mem_cgroup_threshold(mem);
         }
 
 I think that mem_cgroup_update_tree() and mem_cgroup_threshold() should be
 run with different frequency. How to share MEM_CGROUP_STAT_EVENTS
 between soft limits and thresholds in this case?
 
hmm, both softlimit and your threshold count events at the same place(charge 
and uncharge).
So, I think those events can be shared.
Is there any reason they should run in different frequency ?


Thanks,
Daisuke Nishimura.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH RFC v2 3/4] memcg: rework usage of stats by soft limit

2009-12-12 Thread Daisuke Nishimura
On Sat, 12 Dec 2009 21:46:08 +0200
Kirill A. Shutemov kir...@shutemov.name wrote:

 On Sat, Dec 12, 2009 at 4:34 PM, Daisuke Nishimura
 d-nishim...@mtf.biglobe.ne.jp wrote:
  On Sat, 12 Dec 2009 15:06:52 +0200
  Kirill A. Shutemov kir...@shutemov.name wrote:
 
  On Sat, Dec 12, 2009 at 5:50 AM, Daisuke Nishimura
  d-nishim...@mtf.biglobe.ne.jp wrote:
   And IIUC, it's the same for your threshold feature, right ?
   I think it would be better:
  
   - discard this change.
   - in 4/4, rename mem_cgroup_soft_limit_check to mem_cgroup_event_check,
    and instead of adding a new STAT counter, do like:
  
          if (mem_cgroup_event_check(mem)) {
                  mem_cgroup_update_tree(mem, page);
                  mem_cgroup_threshold(mem);
          }
 
  I think that mem_cgroup_update_tree() and mem_cgroup_threshold() should be
  run with different frequency. How to share MEM_CGROUP_STAT_EVENTS
  between soft limits and thresholds in this case?
 
  hmm, both softlimit and your threshold count events at the same 
  place(charge and uncharge).
  So, I think those events can be shared.
  Is there any reason they should run in different frequency ?
 
 SOFTLIMIT_EVENTS_THRESH is 1000. If use the same value for thresholds,
 a threshold can
 be exceed on 1000*nr_cpu_id pages. It's too many. I think, that 100 is
 a reasonable value.
 
O.K. I see.

 mem_cgroup_soft_limit_check() resets MEM_CGROUP_STAT_EVENTS when it reaches
 SOFTLIMIT_EVENTS_THRESH. If I will do the same thing for
 THRESHOLDS_EVENTS_THRESH
 (which is 100) , mem_cgroup_event_check() will never be 'true'. Any
 idea how to share
 MEM_CGROUP_STAT_EVENTS in this case?
 
It's impossible if they have different frequency as you say.

Thank you for your clarification.


Daisuke Nishimura.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH RFC v2 4/4] memcg: implement memory thresholds

2009-12-11 Thread Daisuke Nishimura
On Sat, 12 Dec 2009 00:59:19 +0200
Kirill A. Shutemov kir...@shutemov.name wrote:

 It allows to register multiple memory and memsw thresholds and gets
 notifications when it crosses.
 
 To register a threshold application need:
 - create an eventfd;
 - open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
 - write string like event_fd memory.usage_in_bytes threshold to
   cgroup.event_control.
 
 Application will be notified through eventfd when memory usage crosses
 threshold in any direction.
 
 It's applicable for root and non-root cgroup.
 
 It uses stats to track memory usage, simmilar to soft limits. It checks
 if we need to send event to userspace on every 100 page in/out. I guess
 it's good compromise between performance and accuracy of thresholds.
 
 Signed-off-by: Kirill A. Shutemov kir...@shutemov.name
 ---
  mm/memcontrol.c |  263 
 +++
  1 files changed, 263 insertions(+), 0 deletions(-)
 
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index c6081cc..5ba2140 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -6,6 +6,10 @@
   * Copyright 2007 OpenVZ SWsoft Inc
   * Author: Pavel Emelianov xe...@openvz.org
   *
 + * Memory thresholds
 + * Copyright (C) 2009 Nokia Corporation
 + * Author: Kirill A. Shutemov
 + *
   * This program is free software; you can redistribute it and/or modify
   * it under the terms of the GNU General Public License as published by
   * the Free Software Foundation; either version 2 of the License, or
 @@ -38,6 +42,7 @@
  #include linux/vmalloc.h
  #include linux/mm_inline.h
  #include linux/page_cgroup.h
 +#include linux/eventfd.h
  #include internal.h
  
  #include asm/uaccess.h
 @@ -56,6 +61,7 @@ static int really_do_swap_account __initdata = 1; /* for 
 remember boot option*/
  
  static DEFINE_MUTEX(memcg_tasklist); /* can be hold under cgroup_mutex */
This mutex has already removed in current mmotm.
Please write a patch for memcg based on mmot.

  #define SOFTLIMIT_EVENTS_THRESH (1000)
 +#define THRESHOLDS_EVENTS_THRESH (100)
  
  /*
   * Statistics for memory cgroup.

(snip)

 @@ -1363,6 +1395,11 @@ static int __mem_cgroup_try_charge(struct mm_struct 
 *mm,
   if (mem_cgroup_soft_limit_check(mem))
   mem_cgroup_update_tree(mem, page);
  done:
 + if (mem_cgroup_threshold_check(mem)) {
 + mem_cgroup_threshold(mem, false);
 + if (do_swap_account)
 + mem_cgroup_threshold(mem, true);
 + }
   return 0;
  nomem:
   css_put(mem-css);
 @@ -1906,6 +1943,11 @@ __mem_cgroup_uncharge_common(struct page *page, enum 
 charge_type ctype)
  
   if (mem_cgroup_soft_limit_check(mem))
   mem_cgroup_update_tree(mem, page);
 + if (mem_cgroup_threshold_check(mem)) {
 + mem_cgroup_threshold(mem, false);
 + if (do_swap_account)
 + mem_cgroup_threshold(mem, true);
 + }
   /* at swapout, this memcg will be accessed to record to swap */
   if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
   css_put(mem-css);
Can if (do_swap_account) check be moved into mem_cgroup_threshold ?


Thanks,
Daisuke Nishimura.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH RFC v2 3/4] memcg: rework usage of stats by soft limit

2009-12-11 Thread Daisuke Nishimura
Sorry, I disagree this change.

mem_cgroup_soft_limit_check() is used for checking how much current usage 
exceeds
the soft_limit_in_bytes and updating softlimit tree asynchronously, instead of
checking every charge/uncharge. What if you change the soft_limit_in_bytes,
but the number of charges and uncharges are very balanced afterwards ?
The softlimit tree will not be updated for a long time.

And IIUC, it's the same for your threshold feature, right ?
I think it would be better:

- discard this change.
- in 4/4, rename mem_cgroup_soft_limit_check to mem_cgroup_event_check,
  and instead of adding a new STAT counter, do like:

if (mem_cgroup_event_check(mem)) {
mem_cgroup_update_tree(mem, page);
mem_cgroup_threshold(mem);
}

Ah, yes. Current code doesn't call mem_cgroup_soft_limit_check() for root cgroup
in charge path as you said in 
http://marc.info/?l=linux-mmm=126021128400687w=2.
I think you can change there as you want, I can change my patch
(http://marc.info/?l=linux-mmm=126023467303178w=2, it has not yet sent to
Andrew anyway) to check mem_cgroup_is_root() in mem_cgroup_update_tree().

Thanks,
Daisuke Nishimura.

On Sat, 12 Dec 2009 00:59:18 +0200
Kirill A. Shutemov kir...@shutemov.name wrote:

 Instead of incrementing counter on each page in/out and comparing it
 with constant, we set counter to constant, decrement counter on each
 page in/out and compare it with zero. We want to make comparing as fast
 as possible. On many RISC systems (probably not only RISC) comparing
 with zero is more effective than comparing with a constant, since not
 every constant can be immediate operand for compare instruction.
 
 Also, I've renamed MEM_CGROUP_STAT_EVENTS to MEM_CGROUP_STAT_SOFTLIMIT,
 since really it's not a generic counter.
 
 Signed-off-by: Kirill A. Shutemov kir...@shutemov.name
 ---
  mm/memcontrol.c |   19 ++-
  1 files changed, 14 insertions(+), 5 deletions(-)
 
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index 0ff65ed..c6081cc 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -69,8 +69,9 @@ enum mem_cgroup_stat_index {
   MEM_CGROUP_STAT_MAPPED_FILE,  /* # of pages charged as file rss */
   MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
   MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
 - MEM_CGROUP_STAT_EVENTS, /* sum of pagein + pageout for internal use */
   MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
 + MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
 + used by soft limit implementation */
  
   MEM_CGROUP_STAT_NSTATS,
  };
 @@ -90,6 +91,13 @@ __mem_cgroup_stat_reset_safe(struct mem_cgroup_stat_cpu 
 *stat,
   stat-count[idx] = 0;
  }
  
 +static inline void
 +__mem_cgroup_stat_set(struct mem_cgroup_stat_cpu *stat,
 + enum mem_cgroup_stat_index idx, s64 val)
 +{
 + stat-count[idx] = val;
 +}
 +
  static inline s64
  __mem_cgroup_stat_read_local(struct mem_cgroup_stat_cpu *stat,
   enum mem_cgroup_stat_index idx)
 @@ -374,9 +382,10 @@ static bool mem_cgroup_soft_limit_check(struct 
 mem_cgroup *mem)
  
   cpu = get_cpu();
   cpustat = mem-stat.cpustat[cpu];
 - val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_EVENTS);
 - if (unlikely(val  SOFTLIMIT_EVENTS_THRESH)) {
 - __mem_cgroup_stat_reset_safe(cpustat, MEM_CGROUP_STAT_EVENTS);
 + val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_SOFTLIMIT);
 + if (unlikely(val  0)) {
 + __mem_cgroup_stat_set(cpustat, MEM_CGROUP_STAT_SOFTLIMIT,
 + SOFTLIMIT_EVENTS_THRESH);
   ret = true;
   }
   put_cpu();
 @@ -509,7 +518,7 @@ static void mem_cgroup_charge_statistics(struct 
 mem_cgroup *mem,
   else
   __mem_cgroup_stat_add_safe(cpustat,
   MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
 - __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_EVENTS, 1);
 + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_SOFTLIMIT, -1);
   put_cpu();
  }
  
 -- 
 1.6.5.3
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH RFC v0 2/3] res_counter: implement thresholds

2009-11-26 Thread Daisuke Nishimura
Hi.

On Thu, 26 Nov 2009 19:11:16 +0200, Kirill A. Shutemov kir...@shutemov.name 
wrote:
 It allows to setup two thresholds: one above current usage and one
 below. Callback threshold_notifier() will be called if a threshold is
 crossed.
 
 Signed-off-by: Kirill A. Shutemov kir...@shutemov.name
 ---
  include/linux/res_counter.h |   44 
 +++
  kernel/res_counter.c|4 +++
  2 files changed, 48 insertions(+), 0 deletions(-)
 
 diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
 index fcb9884..bca99a5 100644
 --- a/include/linux/res_counter.h
 +++ b/include/linux/res_counter.h
 @@ -9,6 +9,10 @@
   *
   * Author: Pavel Emelianov xe...@openvz.org
   *
 + * Thresholds support
 + * Copyright (C) 2009 Nokia Corporation
 + * Author: Kirill A. Shutemov
 + *
   * See Documentation/cgroups/resource_counter.txt for more
   * info about what this counter is.
   */
 @@ -42,6 +46,13 @@ struct res_counter {
* the number of unsuccessful attempts to consume the resource
*/
   unsigned long long failcnt;
 +
 + unsigned long long threshold_above;
 + unsigned long long threshold_below;
 + void (*threshold_notifier)(struct res_counter *counter,
 + unsigned long long usage,
 + unsigned long long threshold);
 +
   /*
* the lock to protect all of the above.
* the routines below consider this to be IRQ-safe
 @@ -145,6 +156,20 @@ static inline bool 
 res_counter_soft_limit_check_locked(struct res_counter *cnt)
   return false;
  }
  
 +static inline void res_counter_threshold_notify_locked(struct res_counter 
 *cnt)
 +{
 + if (cnt-usage = cnt-threshold_above) {
 + cnt-threshold_notifier(cnt, cnt-usage, cnt-threshold_above);
 + return;
 + }
 +
 + if (cnt-usage  cnt-threshold_below) {
 + cnt-threshold_notifier(cnt, cnt-usage, cnt-threshold_below);
 + return;
 + }
 +}
 +
 +
  /**
   * Get the difference between the usage and the soft limit
   * @cnt: The counter
 @@ -238,4 +263,23 @@ res_counter_set_soft_limit(struct res_counter *cnt,
   return 0;
  }
  
 +static inline int
 +res_counter_set_thresholds(struct res_counter *cnt,
 + unsigned long long threshold_above,
 + unsigned long long threshold_below)
 +{
 + unsigned long flags;
 + int ret = -EINVAL;
 +
 + spin_lock_irqsave(cnt-lock, flags);
 + if ((cnt-usage  threshold_above) 
 + (cnt-usage = threshold_below)) {
 + cnt-threshold_above = threshold_above;
 + cnt-threshold_below = threshold_below;
 + ret = 0;
 + }
 + spin_unlock_irqrestore(cnt-lock, flags);
 + return ret;
 +}
 +
  #endif
 diff --git a/kernel/res_counter.c b/kernel/res_counter.c
 index bcdabf3..646c29c 100644
 --- a/kernel/res_counter.c
 +++ b/kernel/res_counter.c
 @@ -20,6 +20,8 @@ void res_counter_init(struct res_counter *counter, struct 
 res_counter *parent)
   spin_lock_init(counter-lock);
   counter-limit = RESOURCE_MAX;
   counter-soft_limit = RESOURCE_MAX;
 + counter-threshold_above = RESOURCE_MAX;
 + counter-threshold_below = 0ULL;
   counter-parent = parent;
  }
  
 @@ -33,6 +35,7 @@ int res_counter_charge_locked(struct res_counter *counter, 
 unsigned long val)
   counter-usage += val;
   if (counter-usage  counter-max_usage)
   counter-max_usage = counter-usage;
 + res_counter_threshold_notify_locked(counter);
   return 0;
  }
  
 @@ -73,6 +76,7 @@ void res_counter_uncharge_locked(struct res_counter 
 *counter, unsigned long val)
   val = counter-usage;
  
   counter-usage -= val;
 + res_counter_threshold_notify_locked(counter);
  }
  
hmm.. this adds new checks to hot-path of process life cycle.

Do you have any number on performance impact of these patches(w/o setting any 
threshold)?
IMHO, it might be small enough to be ignored because KAMEZAWA-san's coalesce 
charge/uncharge
patches have decreased charge/uncharge for res_counter itself, but I want to 
know just to make sure.


Regards,
Daisuke Nishimura.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: No limitation in physical memory in cgroups

2009-05-20 Thread Daisuke Nishimura
Hi.

Thank you for using memory cgroup :)

On Wed, 20 May 2009 13:52:29 +0800, anqin anqin@gmail.com wrote:
 Hi all,
 
 I would like to bind given physical memory to specific task group,
 however it does not take effect in my experiments. Are there
 something wrong in my experiments.
 
 My experiment is done under the kernel 2.6.29.3 and I constructed
 my task group by following commands:
 
 a) In shell #1, prepare a bash :
 
 # bash
 # echo $$
 2253
 
 b) In shell #2, prepare the memory control via cgroupfs:
 
 # mount -t cgroup cgroup /mnt/mycgrp
 # cd /mn/mycgrp
 # mkdir mycontainer
 # echo 0  mycontainer/cpuset.mems
 # echo 0-1  mycontainer/cpuset.cpus
 # echo 2252  mycontainer/tasks
 
 # cat mycontainer/memory.usage_in_bytes
 2875392
 # echo 300  mycontainer/memory.max_usage_in_bytes
 # cat mycontainer/memory.max_usage_in_bytes
 3002368
 
You need to write the limit to memory.limit_in_bytes.
max_usage_in_bytes means the max of the usage so far(you can reset it
by writing to it).

See Documentation/cgroups/memory.txt(it might be a bit old).


Thanks,
Daisuke Nishimura.

 
 c) In Shell #1, run a memory consumer (in which, malloc() is called
 to allocate memory and not free until program is existed) to allocate
 500M memory:
 
 # /tmp/memoy_consumer_program  500
 
 In Shell #2, the used memory ascends from start point 2875392 when
 program begins (from number presented in memory.usage_in_bytes),
 but it return to start point when it touches the maximal boundary.
 
 On the other hand, I also run the top to watch the memory hold by
 memoy_consumer_program. In top, the memory (both virtual and rss
 memory ) is always growing without any limitation.
 
 Is this phenomenon the correct behaviors of memory cgroups?
 
 
 Best Regards,
 
 Anqin
 ___
 Containers mailing list
 contain...@lists.linux-foundation.org
 https://lists.linux-foundation.org/mailman/listinfo/containers
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-29 Thread Daisuke Nishimura
On Wed, 30 Jul 2008 13:14:07 +0900, KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:
 On Wed, 30 Jul 2008 12:11:15 +0900
 KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:
 
  On Wed, 30 Jul 2008 11:52:26 +0900
  KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:
   mem+swap controller means a shrink to memory resource controller 
   (try_to_free_mem_cgroup_pages()) should drop only file caches.
   (Because kick-out-to-swap will never changes the usage.)
   
   right ? only global-lru can make a swap.
   maybe I can add optimization to do this. Hmm. I should see how OOM works
   under some situation.
   
I'm thinking mem+swap controller in a different way: an add-on to
mem controller, just as current swap controller.
I mean adding memory.(mem+swap)_limit.

  (I'm sorry that I'm not a good writer of e-mail.)
  
  A brief summary about changes to mem controller.
  
   - mem+swap controller which limits the # sum of pages and swap_entries.
   - mem+swap controller just drops file caches when it reaches limit.
   - under mem+swap controller, recaliming Anon pages make no sense.
 Then,
- LRU for Anon is not necessary.
- LRU for tmpfs/shmem is not necessary.
just showing account is better.
   - we should see try_to_free_mem_cgroup() again to avoid too much OOM.
 Maybe Retries=5 is too small because we never do swap under us.
 a problem like struck-into-ext3-journal can easily make file-cache 
  reclaim
 difficult.
   - need some changes to documentation.
   - Should we have on/off switch of taking swap into account ?
 or should we implement mem+swap contoller in different name than
 memory controller ?
 If swap is not accounted, we need to do swap-out in memory reclaiming 
  path,
 again.
 
 Then, mem+swap controller finally means
  - under mem+swap controller, program works with no swap. Only global LRU
may make pages swapped-out.
  - If swap-accounting-mode is off, swap can be used unlimitedly.
 
 Hmm, sounds a bit differenct from what I want. How about others ?
 

Thanks,
Daisuke Nishimura.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC] Transactional CGroup task attachment

2008-07-14 Thread Daisuke Nishimura
On Fri, 11 Jul 2008 09:20:58 +0900, KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:
 Thank you for your effort.
 
 On Wed, 9 Jul 2008 23:46:33 -0700
 Paul Menage [EMAIL PROTECTED] wrote:
  3) memory
  
  Curently the memory cgroup only uses the mm-owner's cgroup at charge
  time, and keeps a reference to the cgroup on the page. However,
  patches have been proposed that would move all non-shared (page count
  == 1) pages to the destination cgroup when the mm-owner moves to a
  new cgroup. Since it's not possible to prevent page count changes
  without locking all mms on the system, even this transaction approach
  can't really give guarantees. However, something like the following
  would probably be suitable. It's very similar to the memrlimit
  approach, except for the fact that we have to handle the fact that the
  number of pages we finally move might not be exactly the same as the
  number of pages we thought we'd be moving.
  
  prepare_attach_sleep() {
down_read(mm-mmap_sem);
if (mm-owner != state-task) return 0;
count = count_unshared_pages(mm);
// save the count charged to the new cgroup
state-subsys[memcgroup_subsys_id] = (void *)count;
if ((ret = res_counter_charge(state-dest, count)) {
  up_read(mm-mmap_sem);
}
return ret;
  }
  
  commit_attach() {
if (mm-owner == state-task) {
  final_count = move_unshared_pages(mm, state-dest);
  res_counter_uncharge(state-src, final_count);
  count = state-subsys[memcgroup_subsys_id];
  res_counter_force_charge(state-dest, final_count - count);
}
up_read(mm-mmap_sem);
  }
  
  abort_attach_sleep() {
if (mm-owner == state-task) {
  count = state-subsys[memcgroup_subsys_id];
  res_counter_uncharge(state-dest, count);
}
up_read(mm-mmap_sem);
  }
  
 
 At frist look, maybe works well. we need some special codes (to move resource)
 but that's all.
 
 My small concern is a state change between prepare_attach_sleep() -
 commit_attach(). Hmm...but as you say, we cannot do down_write(mmap_sem).
 Maybe inserting some check codes to mem_cgroup_charge() to stop charge while
 move is the last thing we can do.
 
I have two comments.

- I think page reclaiming code decreases the memory charge
  without holding mmap_sem(e.g. try_to_unmap(), __remove_mapping()).
  Shouldn't we handle these cases?
- When swap controller is merged, I should implement
  prepare_attach_nosleep() which holds swap_lock.

 Anyway, if unwinding is supported officially, I think we can find a way
 to go.
 
I think so too.


Thanks,
Daisuke Nishimura.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC] Transactional CGroup task attachment

2008-07-14 Thread Daisuke Nishimura
On Mon, 14 Jul 2008 16:54:44 +0900, KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:
 On Mon, 14 Jul 2008 15:28:22 +0900
 Daisuke Nishimura [EMAIL PROTECTED] wrote:
 
  On Fri, 11 Jul 2008 09:20:58 +0900, KAMEZAWA Hiroyuki [EMAIL PROTECTED] 
  wrote:
   Thank you for your effort.
   
   On Wed, 9 Jul 2008 23:46:33 -0700
   Paul Menage [EMAIL PROTECTED] wrote:
3) memory

Curently the memory cgroup only uses the mm-owner's cgroup at charge
time, and keeps a reference to the cgroup on the page. However,
patches have been proposed that would move all non-shared (page count
== 1) pages to the destination cgroup when the mm-owner moves to a
new cgroup. Since it's not possible to prevent page count changes
without locking all mms on the system, even this transaction approach
can't really give guarantees. However, something like the following
would probably be suitable. It's very similar to the memrlimit
approach, except for the fact that we have to handle the fact that the
number of pages we finally move might not be exactly the same as the
number of pages we thought we'd be moving.

prepare_attach_sleep() {
  down_read(mm-mmap_sem);
  if (mm-owner != state-task) return 0;
  count = count_unshared_pages(mm);
  // save the count charged to the new cgroup
  state-subsys[memcgroup_subsys_id] = (void *)count;
  if ((ret = res_counter_charge(state-dest, count)) {
up_read(mm-mmap_sem);
  }
  return ret;
}

commit_attach() {
  if (mm-owner == state-task) {
final_count = move_unshared_pages(mm, state-dest);
res_counter_uncharge(state-src, final_count);
count = state-subsys[memcgroup_subsys_id];
res_counter_force_charge(state-dest, final_count - count);
  }
  up_read(mm-mmap_sem);
}

abort_attach_sleep() {
  if (mm-owner == state-task) {
count = state-subsys[memcgroup_subsys_id];
res_counter_uncharge(state-dest, count);
  }
  up_read(mm-mmap_sem);
}

   
   At frist look, maybe works well. we need some special codes (to move 
   resource)
   but that's all.
   
   My small concern is a state change between prepare_attach_sleep() -
   commit_attach(). Hmm...but as you say, we cannot do down_write(mmap_sem).
   Maybe inserting some check codes to mem_cgroup_charge() to stop charge 
   while
   move is the last thing we can do.
   
  I have two comments.
  
  - I think page reclaiming code decreases the memory charge
without holding mmap_sem(e.g. try_to_unmap(), __remove_mapping()).
Shouldn't we handle these cases?
 
 I think decreasing is not problem, here.
 I don't like handle mmap-sem by some unclear way. I'd like to add some flag 
 to
 mm_struct or page_struct to stop(skip/avoid) charge/uncharge while task move.
 
It would be a good idea.

  - When swap controller is merged, I should implement
prepare_attach_nosleep() which holds swap_lock.
  
 just making add_to_swap() fail during move is not enough ?
 
This can only avoid increasing, I think.

I thought it would be better to avoid decreasing too, 
just because some special handling on uncharged usage
would be needed in rollback or commit.

Anyway, I think it depends on how to implement move and rollback,
and I will consider more.
Thank you for your suggestion.


Thanks,
Daisuke Nishimura.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mm 1/5] swapcgroup (v3): add cgroup files

2008-07-11 Thread Daisuke Nishimura
Hi, Dave-san.

On Thu, 10 Jul 2008 13:35:36 -0700, Dave Hansen [EMAIL PROTECTED] wrote:
 On Fri, 2008-07-04 at 15:17 +0900, Daisuke Nishimura wrote:
  +config CGROUP_SWAP_RES_CTLR
  +   bool Swap Resource Controller for Control Groups
  +   depends on CGROUP_MEM_RES_CTLR  SWAP
  +   help
  + Provides a swap resource controller that manages and limits swap 
  usage.
  + Implemented as a add-on to Memory Resource Controller.
 
 Could you make this just plain depend on 'CGROUP_MEM_RES_CTLR  SWAP'
 and not make it configurable?  I don't think the resource usage really
 justifies yet another .config knob to tune and break. :)
 

I don't stick to using kernel config option.

As I said in my ToDo, I'm going to implement another method
(boot option or something) to disable(or enable?) this feature,
so I can make this config not configurable after it.


Thanks,
Daisuke Nishimura.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mm 5/5] swapcgroup (v3): implement force_empty

2008-07-07 Thread Daisuke Nishimura
On Sat, 5 Jul 2008 13:29:44 +0900, KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:
 On Fri, 4 Jul 2008 21:33:01 +0900
 Daisuke Nishimura [EMAIL PROTECTED] wrote:
 
  On Fri, 4 Jul 2008 19:16:38 +0900, KAMEZAWA Hiroyuki [EMAIL PROTECTED] 
  wrote:
   On Fri, 4 Jul 2008 15:24:23 +0900
   Daisuke Nishimura [EMAIL PROTECTED] wrote:
   
This patch implements force_empty of swapcgroup.

Currently, it simply uncharges all the charges from the group.

I think there can be other implementations.

What I thought are:
- move all the charges to its parent.
- unuse(swap in) all the swap charged to the group.

   3. move all swap back to memory (see swapoff.)
   
   
  Do you mean swapping in all the swap including used by
  other groups?
 
 swapping in all swap used by the group (not by all group)
 
O.K. I intended to say the same thing in 2.
I'll try it and I think some part of this implementation can be
used by shrinking support too.

(snip)

   Hmm...but handling limit_change (at least, returns -EBUSY) will be 
   necessary.
  I think so too.
  But I'm not sure now it's good or bad to support shrinking at limit_change
  about swap.
  Shrinking swap means increasing the memory usage and that may cause
  another swapout.
 yes. but who reduce the limit ? it's the admin or users.
 
 At leaset, returning -EBUSY is necesary. You can use 
 res_counter: check limit change patch which I posted yesterday.
 
I saw your patch, and I agree that returning -EBUSY is the first step.

   Do you consider a some magical way to move pages in swap back to memory ?
   
  In this patch, I modified the find_next_to_unuse() to find
  the entry charged to a specific group.
  It might be possible to modify try_to_unuse()(or define another function
  based on try_to_unuse()) to reduce swap usage of a specified group
  down to some threashold.
  But, I think, one problem here is from which device the swaps
  should be back to memory, or usage balance between swap devices.
  
 Ah, that's maybe difficult one.
 As memcg has its own LRU, add MRU to swap ...is not a choice ;(
 
Swap devices are used in order of their priority,
so storing per device usage might be usefull for this porpose...
Anyway, I should consider more.

   In general, I like this set but we can't change the limit on demand. 
   (maybe)
   (just putting it to TO-DO-List is okay to me.)
   
  I'm sorry but what do you mean by change the limit on demand?
  Could you explain more?
  
 In short, the administrator have to write the perfect plan to set
 each group's swap limit beforehand because we cannot decrease used swap.
 
 1st problem is that the user cannot reduce the usage of swap by hand.
 (He can reduce by killing process or deleting shmem.)
 Once the usage of swap of a group grows, other groups can't use much.
 
 2nd problem is there is no entity who controls the total amount of swap.
 The user/admin have to check the amount of free swap space by himself at 
 planning
 each group's swap limit more carefully than memcg.
 
 So, I think rich-control of hierarchy will be of no use ;)
 All things should be planned before the system starts.
 
 In memcg, the amount of free memory is maintained by global LRU. It does much 
 jobs
 for us. But free swap space isn't. It's just used on demand.
 
 If we can't decrease usage of swap by a group by hand, the problem which this
 swap-space-controller want to fix will not be fixed at pleasant level.
 
Thank you for your explanation.
I see your point and agree that the shrinking support is desireble.
I'll add it to my ToDo.

 Anyway, please return -EBUSY at setting limit  usage, at first :)
 That's enough for me, now.
 
Yes, again.



Thanks,
Daisuke Nishimura.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mm 0/5] swapcgroup (v3)

2008-07-07 Thread Daisuke Nishimura
Hi, Balbir-san.

On Sat, 05 Jul 2008 12:22:25 +0530, Balbir Singh [EMAIL PROTECTED] wrote:
 Daisuke Nishimura wrote:
  Hi.
  
  This is new version of swapcgroup.
  
  Major changes from previous version
  - Rebased on 2.6.26-rc5-mm3.
The new -mm has been released, but these patches
can be applied on 2.6.26-rc8-mm1 too with only some offset warnings.
I tested these patches on 2.6.26-rc5-mm3 with some fixes about memory,
and it seems to work fine.
  - (NEW) Implemented force_empty.
Currently, it simply uncharges all the charges from the group.
  
  Patches
  - [1/5] add cgroup files
  - [2/5] add a member to swap_info_struct
  - [3/5] implement charge and uncharge
  - [4/5] modify vm_swap_full() 
  - [5/5] implement force_empty
  
  ToDo(in my thought. Feel free to add some others here.)
  - need some documentation
Add to memory.txt? or create a new documentation file?
  
 
 I think memory.txt is good. But then, we'll need to add a Table of Contents to
 it, so that swap controller documentation can be located easily.
 
I think memory.txt is a self-closed documentation,
so I don't want to change it, honestlly.

I'll write a documentation for swap as a new file first for review.

  - add option to disable only this feature
I'm wondering if this option is needed.
memcg has already the boot option to disable it.
Is there any case where memory should be accounted but swap should not?
  
 
 That depends on what use case you are trying to provide. Let's say I needed
 backward compatibility with 2.6.25, then I would account for memory and leave
 out swap (even though we have swap controller).
 
O.K. I'll add option.


Thanks,
Daisuke Nishimura.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH -mm 0/5] swapcgroup (v3)

2008-07-04 Thread Daisuke Nishimura
Hi.

This is new version of swapcgroup.

Major changes from previous version
- Rebased on 2.6.26-rc5-mm3.
  The new -mm has been released, but these patches
  can be applied on 2.6.26-rc8-mm1 too with only some offset warnings.
  I tested these patches on 2.6.26-rc5-mm3 with some fixes about memory,
  and it seems to work fine.
- (NEW) Implemented force_empty.
  Currently, it simply uncharges all the charges from the group.

Patches
- [1/5] add cgroup files
- [2/5] add a member to swap_info_struct
- [3/5] implement charge and uncharge
- [4/5] modify vm_swap_full() 
- [5/5] implement force_empty

ToDo(in my thought. Feel free to add some others here.)
- need some documentation
  Add to memory.txt? or create a new documentation file?

- add option to disable only this feature
  I'm wondering if this option is needed.
  memcg has already the boot option to disable it.
  Is there any case where memory should be accounted but swap should not?

- hierarchy support
- move charges along with task
  Both of them need more discussion.


Thanks,
Daisuke Nishimura.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH -mm 1/5] swapcgroup (v3): add cgroup files

2008-07-04 Thread Daisuke Nishimura
Even if limiting memory usage by memory cgroup,
swap space is shared, so resource isolation is not enough.
If one group uses up most of the swap space, it can affect
other groups anyway.

The purpose of swapcgroup is limiting swap usage per group
as memory cgroup limits the RSS memory usage.
It's now implemented as a add-on to memory cgroup.


This patch adds cgroup files(and a member to struct mem_cgroup)
for swapcgroup.

Files to be added by this patch are:

- memory.swap_usage_in_bytes
- memory.swap_max_usage_in_bytes
- memory.swap_limit_in_bytes
- memory.swap_failcnt

The meaning of those files are the same as memory cgroup.


Change log
v2-v3
- Rebased on 2.6.26-rc5-mm3.
v1-v2
- Rebased on 2.6.26-rc2-mm1.
- Implemented as a add-on to memory cgroup.


Signed-off-by: Daisuke Nishimura [EMAIL PROTECTED]

---
 init/Kconfig|7 +
 mm/memcontrol.c |   67 +++
 2 files changed, 74 insertions(+), 0 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 847931a..c604b6d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -418,6 +418,13 @@ config CGROUP_MEMRLIMIT_CTLR
  memory RSS and Page Cache control. Virtual address space control
  is provided by this controller.
 
+config CGROUP_SWAP_RES_CTLR
+   bool Swap Resource Controller for Control Groups
+   depends on CGROUP_MEM_RES_CTLR  SWAP
+   help
+ Provides a swap resource controller that manages and limits swap 
usage.
+ Implemented as a add-on to Memory Resource Controller.
+
 config SYSFS_DEPRECATED
bool
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b8fe33c..ddc842b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -133,6 +133,12 @@ struct mem_cgroup {
 * statistics.
 */
struct mem_cgroup_stat stat;
+#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
+   /*
+* the counter to account for swap usage
+*/
+   struct res_counter swap_res;
+#endif
 };
 static struct mem_cgroup init_mem_cgroup;
 
@@ -953,6 +959,39 @@ static int mem_control_stat_show(struct cgroup *cont, 
struct cftype *cft,
return 0;
 }
 
+#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
+static u64 swap_cgroup_read(struct cgroup *cont, struct cftype *cft)
+{
+   return res_counter_read_u64(mem_cgroup_from_cont(cont)-swap_res,
+   cft-private);
+}
+
+static ssize_t swap_cgroup_write(struct cgroup *cont, struct cftype *cft,
+   struct file *file, const char __user *userbuf,
+   size_t nbytes, loff_t *ppos)
+{
+   return res_counter_write(mem_cgroup_from_cont(cont)-swap_res,
+   cft-private, userbuf, nbytes, ppos,
+   mem_cgroup_write_strategy);
+}
+
+static int swap_cgroup_reset(struct cgroup *cont, unsigned int event)
+{
+   struct mem_cgroup *mem;
+
+   mem = mem_cgroup_from_cont(cont);
+   switch (event) {
+   case RES_MAX_USAGE:
+   res_counter_reset_max(mem-swap_res);
+   break;
+   case RES_FAILCNT:
+   res_counter_reset_failcnt(mem-swap_res);
+   break;
+   }
+   return 0;
+}
+#endif
+
 static struct cftype mem_cgroup_files[] = {
{
.name = usage_in_bytes,
@@ -985,6 +1024,31 @@ static struct cftype mem_cgroup_files[] = {
.name = stat,
.read_map = mem_control_stat_show,
},
+#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
+   {
+   .name = swap_usage_in_bytes,
+   .private = RES_USAGE,
+   .read_u64 = swap_cgroup_read,
+   },
+   {
+   .name = swap_max_usage_in_bytes,
+   .private = RES_MAX_USAGE,
+   .trigger = swap_cgroup_reset,
+   .read_u64 = swap_cgroup_read,
+   },
+   {
+   .name = swap_limit_in_bytes,
+   .private = RES_LIMIT,
+   .write = swap_cgroup_write,
+   .read_u64 = swap_cgroup_read,
+   },
+   {
+   .name = swap_failcnt,
+   .private = RES_FAILCNT,
+   .trigger = swap_cgroup_reset,
+   .read_u64 = swap_cgroup_read,
+   },
+#endif
 };
 
 static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
@@ -1063,6 +1127,9 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup 
*cont)
}
 
res_counter_init(mem-res);
+#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
+   res_counter_init(mem-swap_res);
+#endif
 
for_each_node_state(node, N_POSSIBLE)
if (alloc_mem_cgroup_per_zone_info(mem, node))
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH -mm 2/5] swapcgroup (v3): add a member to swap_info_struct

2008-07-04 Thread Daisuke Nishimura
This patch add a member to swap_info_struct for cgroup.

This member, array of pointers to mem_cgroup, is used to
remember to which cgroup each swap entries are charged.

The memory for this array of pointers is allocated on swapon,
and freed on swapoff.


Change log
v2-v3
- Rebased on 2.6.26-rc5-mm3
- add helper functions and removed #ifdef from sys_swapon()/sys_swapoff().
- add check on mem_cgroup_subsys.disabled
v1-v2
- Rebased on 2.6.26-rc2-mm1
- Implemented as a add-on to memory cgroup.


Signed-off-by: Daisuke Nishimura [EMAIL PROTECTED]

---
 include/linux/memcontrol.h |   20 +++-
 include/linux/swap.h   |3 +++
 mm/memcontrol.c|   36 
 mm/swapfile.c  |   11 +++
 4 files changed, 69 insertions(+), 1 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ee1b2fc..b6ff509 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -24,6 +24,7 @@ struct mem_cgroup;
 struct page_cgroup;
 struct page;
 struct mm_struct;
+struct swap_info_struct;
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 
@@ -165,5 +166,22 @@ static inline long mem_cgroup_calc_reclaim(struct 
mem_cgroup *mem,
 }
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
-#endif /* _LINUX_MEMCONTROL_H */
+#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
+extern struct mem_cgroup **swap_info_clear_memcg(struct swap_info_struct *p);
+extern int swap_info_alloc_memcg(struct swap_info_struct *p,
+   unsigned long maxpages);
+#else
+static inline
+struct mem_cgroup **swap_info_clear_memcg(struct swap_info_struct *p)
+{
+   return NULL;
+}
 
+static inline
+int swap_info_alloc_memcg(struct swap_info_struct *p, unsigned long maxpages)
+{
+   return 0;
+}
+#endif
+
+#endif /* _LINUX_MEMCONTROL_H */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index a3af95b..6e1b03d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -142,6 +142,9 @@ struct swap_info_struct {
struct swap_extent *curr_swap_extent;
unsigned old_block_size;
unsigned short * swap_map;
+#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
+   struct mem_cgroup **memcg;
+#endif
unsigned int lowest_bit;
unsigned int highest_bit;
unsigned int cluster_next;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ddc842b..81bb7fa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1209,3 +1209,39 @@ struct cgroup_subsys mem_cgroup_subsys = {
.attach = mem_cgroup_move_task,
.early_init = 0,
 };
+
+#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
+/* called with swap_lock held */
+struct mem_cgroup **swap_info_clear_memcg(struct swap_info_struct *p)
+{
+   struct mem_cgroup **mem;
+
+   /* just clear p-memcg, without checking mem_cgroup_subsys.disabled */
+   mem = p-memcg;
+   p-memcg = NULL;
+
+   return mem;
+}
+
+/* called without swap_lock held */
+int swap_info_alloc_memcg(struct swap_info_struct *p, unsigned long maxpages)
+{
+   int ret = 0;
+
+   if (mem_cgroup_subsys.disabled)
+   goto out;
+
+   p-memcg = vmalloc(maxpages * sizeof(struct mem_cgroup *));
+   if (!p-memcg) {
+   /* make swapon fail */
+   printk(KERN_ERR Unable to allocate memory for memcg\n);
+   ret = -ENOMEM;
+   goto out;
+   }
+   memset(p-memcg, 0, maxpages * sizeof(struct mem_cgroup *));
+
+out:
+   return ret;
+}
+#endif
+
diff --git a/mm/swapfile.c b/mm/swapfile.c
index bf7d13d..312c573 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1228,6 +1228,7 @@ asmlinkage long sys_swapoff(const char __user * 
specialfile)
unsigned short *swap_map;
struct file *swap_file, *victim;
struct address_space *mapping;
+   struct mem_cgroup **memcg = NULL;
struct inode *inode;
char * pathname;
int i, type, prev;
@@ -1328,10 +1329,12 @@ asmlinkage long sys_swapoff(const char __user * 
specialfile)
p-max = 0;
swap_map = p-swap_map;
p-swap_map = NULL;
+   memcg = swap_info_clear_memcg(p);
p-flags = 0;
spin_unlock(swap_lock);
mutex_unlock(swapon_mutex);
vfree(swap_map);
+   vfree(memcg);
inode = mapping-host;
if (S_ISBLK(inode-i_mode)) {
struct block_device *bdev = I_BDEV(inode);
@@ -1475,6 +1478,7 @@ asmlinkage long sys_swapon(const char __user * 
specialfile, int swap_flags)
unsigned long maxpages = 1;
int swapfilesize;
unsigned short *swap_map;
+   struct mem_cgroup **memcg = NULL;
struct page *page = NULL;
struct inode *inode = NULL;
int did_down = 0;
@@ -1498,6 +1502,7 @@ asmlinkage long sys_swapon(const char __user * 
specialfile, int swap_flags)
p-swap_file = NULL;
p-old_block_size = 0;
p-swap_map = NULL;
+   swap_info_clear_memcg(p);
p-lowest_bit = 0;
p-highest_bit = 0;
p

[Devel] [PATCH -mm 3/5] swapcgroup (v3): implement charge and uncharge

2008-07-04 Thread Daisuke Nishimura
This patch implements charge and uncharge of swapcgroup.

- what will be charged ?
  charge the number of swap entries in bytes.

- when to charge/uncharge ?
  charge at get_swap_entry(), and uncharge at swap_entry_free().

- to what group charge the swap entry ?
  To determine to what mem_cgroup the swap entry should be charged,
  I changed the argument of get_swap_entry() from (void) to
  (struct page *). As a result, get_swap_entry() can determine
  to what mem_cgroup it should charge the swap entry
  by referring to page-page_cgroup-mem_cgroup.

- from what group uncharge the swap entry ?
  I added in previous patch to swap_info_struct a member
  'struct swap_cgroup **', array of pointer to which swap_cgroup
  the swap entry is charged.


Change log
v2-v3
- Rebased on 2.6.26-rc5-mm3
- make swap_cgroup_charge() fail when !pc.
- add check on mem_cgroup_subsys.disabled
v1-v2
- Rebased on 2.6.26-rc2-mm1
- Implemented as a add-on to memory cgroup.


Signed-off-by: Daisuke Nishimura [EMAIL PROTECTED]

---
 include/linux/memcontrol.h |   17 ++
 include/linux/swap.h   |4 +-
 mm/memcontrol.c|   53 
 mm/shmem.c |2 +-
 mm/swap_state.c|2 +-
 mm/swapfile.c  |   13 ++-
 6 files changed, 86 insertions(+), 5 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b6ff509..f3c84f7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -170,6 +170,11 @@ static inline long mem_cgroup_calc_reclaim(struct 
mem_cgroup *mem,
 extern struct mem_cgroup **swap_info_clear_memcg(struct swap_info_struct *p);
 extern int swap_info_alloc_memcg(struct swap_info_struct *p,
unsigned long maxpages);
+extern int swap_cgroup_charge(struct page *page,
+   struct swap_info_struct *si,
+   unsigned long offset);
+extern void swap_cgroup_uncharge(struct swap_info_struct *si,
+   unsigned long offset);
 #else
 static inline
 struct mem_cgroup **swap_info_clear_memcg(struct swap_info_struct *p)
@@ -182,6 +187,18 @@ int swap_info_alloc_memcg(struct swap_info_struct *p, 
unsigned long maxpages)
 {
return 0;
 }
+
+static inline int swap_cgroup_charge(struct page *page,
+   struct swap_info_struct *si,
+   unsigned long offset)
+{
+   return 0;
+}
+
+static inline void swap_cgroup_uncharge(struct swap_info_struct *si,
+   unsigned long offset)
+{
+}
 #endif
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 6e1b03d..f80255b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -298,7 +298,7 @@ extern struct page *swapin_readahead(swp_entry_t, gfp_t,
 /* linux/mm/swapfile.c */
 extern long total_swap_pages;
 extern void si_swapinfo(struct sysinfo *);
-extern swp_entry_t get_swap_page(void);
+extern swp_entry_t get_swap_page(struct page *);
 extern swp_entry_t get_swap_page_of_type(int);
 extern int swap_duplicate(swp_entry_t);
 extern int valid_swaphandles(swp_entry_t, unsigned long *);
@@ -405,7 +405,7 @@ static inline int remove_exclusive_swap_page_ref(struct 
page *page)
return 0;
 }
 
-static inline swp_entry_t get_swap_page(void)
+static inline swp_entry_t get_swap_page(struct page *page)
 {
swp_entry_t entry;
entry.val = 0;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 81bb7fa..d16d0a5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1243,5 +1243,58 @@ int swap_info_alloc_memcg(struct swap_info_struct *p, 
unsigned long maxpages)
 out:
return ret;
 }
+
+int swap_cgroup_charge(struct page *page,
+   struct swap_info_struct *si,
+   unsigned long offset)
+{
+   int ret;
+   struct page_cgroup *pc;
+   struct mem_cgroup *mem;
+
+   if (mem_cgroup_subsys.disabled)
+   return 0;
+
+   lock_page_cgroup(page);
+   pc = page_get_page_cgroup(page);
+   if (unlikely(!pc)) {
+   /* make get_swap_page fail */
+   unlock_page_cgroup(page);
+   return -ENOMEM;
+   } else {
+   mem = pc-mem_cgroup;
+   css_get(mem-css);
+   }
+   unlock_page_cgroup(page);
+
+   ret = res_counter_charge(mem-swap_res, PAGE_SIZE);
+   if (!ret)
+   si-memcg[offset] = mem;
+   else
+   css_put(mem-css);
+
+   return ret;
+}
+
+void swap_cgroup_uncharge(struct swap_info_struct *si,
+   unsigned long offset)
+{
+   struct mem_cgroup *mem = si-memcg[offset];
+
+   if (mem_cgroup_subsys.disabled)
+   return;
+
+   /* mem would be NULL:
+* 1. when get_swap_page() failed at charging swap_cgroup,
+*and called swap_entry_free().
+* 2. when this swap entry had

[Devel] [PATCH -mm 4/5] swapcgroup (v3): modify vm_swap_full()

2008-07-04 Thread Daisuke Nishimura
This patch modifies vm_swap_full() to calculate
the rate of swap usage per cgroup.

The purpose of this change is to free freeable swap caches
(that is, swap entries) per cgroup, so that swap_cgroup_charge()
fails less frequently.

I tested whether this patch can reduce the swap usage or not,
by running qsbench (8 x 40M) 10 times in mem:128M/swap:256M group
and mem:256M/swap:128M group.
And I confirmed that this patch can keep swap usage to
(half of the limit  usage  the limit),
whereas without this patch, swap usage tends to keep near to the limit.


Change log
v2-v3
- Rebased on 2.6.26-rc5-mm3
- take into account global swap usage.
- change arg of vm_swap_full from page to memcg.
- add check on mem_cgroup_subsys.disabled
v1-v2
- new patch


Signed-off-by: Daisuke Nishimura [EMAIL PROTECTED]

---
 include/linux/memcontrol.h |   12 
 include/linux/swap.h   |4 +++-
 mm/memcontrol.c|   18 ++
 mm/memory.c|4 +++-
 mm/swapfile.c  |   38 +-
 mm/vmscan.c|6 +++---
 6 files changed, 76 insertions(+), 6 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f3c84f7..1e6eb0c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -175,6 +175,8 @@ extern int swap_cgroup_charge(struct page *page,
unsigned long offset);
 extern void swap_cgroup_uncharge(struct swap_info_struct *si,
unsigned long offset);
+extern struct mem_cgroup *page_to_memcg(struct page *page);
+extern int swap_cgroup_vm_swap_full(struct mem_cgroup *memcg);
 #else
 static inline
 struct mem_cgroup **swap_info_clear_memcg(struct swap_info_struct *p)
@@ -199,6 +201,16 @@ static inline void swap_cgroup_uncharge(struct 
swap_info_struct *si,
unsigned long offset)
 {
 }
+
+static inline struct mem_cgroup *page_to_memcg(struct page *page)
+{
+   return NULL;
+}
+
+static inline int swap_cgroup_vm_swap_full(struct mem_cgroup *memcg)
+{
+   return 0;
+}
 #endif
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index f80255b..2b95df9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -161,7 +161,9 @@ struct swap_list_t {
 };
 
 /* Swap 50% full? Release swapcache more aggressively.. */
-#define vm_swap_full() (nr_swap_pages*2  total_swap_pages)
+#define vm_swap_full(memcg) ((nr_swap_pages*2  total_swap_pages) \
+   || swap_cgroup_vm_swap_full(memcg))
+
 
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d16d0a5..160fca1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1296,5 +1296,23 @@ void swap_cgroup_uncharge(struct swap_info_struct *si,
css_put(mem-css);
}
 }
+
+int swap_cgroup_vm_swap_full(struct mem_cgroup *memcg)
+{
+   u64 usage;
+   u64 limit;
+
+   if (memcg)
+   css_get(memcg-css);
+   else
+   return 0;
+
+   usage = res_counter_read_u64(memcg-swap_res, RES_USAGE);
+   limit = res_counter_read_u64(memcg-swap_res, RES_LIMIT);
+
+   css_put(memcg-css);
+
+   return usage * 2  limit;
+}
 #endif
 
diff --git a/mm/memory.c b/mm/memory.c
index 536d748..afcf737 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2230,7 +2230,9 @@ static int do_swap_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
page_add_anon_rmap(page, vma, address);
 
swap_free(entry);
-   if (vm_swap_full() || (vma-vm_flags  VM_LOCKED) || PageMlocked(page))
+   if (vm_swap_full(page_to_memcg(page))
+   || (vma-vm_flags  VM_LOCKED)
+   || PageMlocked(page))
remove_exclusive_swap_page(page);
unlock_page(page);
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 57798c5..6a863bc 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -28,6 +28,7 @@
 #include linux/capability.h
 #include linux/syscalls.h
 #include linux/memcontrol.h
+#include linux/cgroup.h
 
 #include asm/pgtable.h
 #include asm/tlbflush.h
@@ -447,7 +448,8 @@ void free_swap_and_cache(swp_entry_t entry)
/* Only cache user (+us), or swap space full? Free it! */
/* Also recheck PageSwapCache after page is locked (above) */
if (PageSwapCache(page)  !PageWriteback(page) 
-   (one_user || vm_swap_full())) {
+   (one_user
+   || vm_swap_full(page_to_memcg(page {
delete_from_swap_cache(page);
SetPageDirty(page);
}
@@ -1889,3 +1891,37 @@ int valid_swaphandles(swp_entry_t entry, unsigned long 
*offset)
*offset = ++toff;
return nr_pages? ++nr_pages: 0;
 }
+
+#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
+/*
+ * returns

[Devel] [PATCH -mm 5/5] swapcgroup (v3): implement force_empty

2008-07-04 Thread Daisuke Nishimura
This patch implements force_empty of swapcgroup.

Currently, it simply uncharges all the charges from the group.

I think there can be other implementations.

What I thought are:
- move all the charges to its parent.
- unuse(swap in) all the swap charged to the group.

But in any case, I think before implementing this way,
hierarchy and move charges support are needed.

So I think this is enough for the first step.


Change log
v2-v3
- new patch


Signed-off-by: Daisuke Nishimura [EMAIL PROTECTED]

---
 include/linux/memcontrol.h |5 
 mm/memcontrol.c|   47 +++
 mm/swapfile.c  |   52 +--
 3 files changed, 91 insertions(+), 13 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1e6eb0c..8c4bad0 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -177,6 +177,7 @@ extern void swap_cgroup_uncharge(struct swap_info_struct 
*si,
unsigned long offset);
 extern struct mem_cgroup *page_to_memcg(struct page *page);
 extern int swap_cgroup_vm_swap_full(struct mem_cgroup *memcg);
+extern void __swap_cgroup_force_empty(struct mem_cgroup *mem);
 #else
 static inline
 struct mem_cgroup **swap_info_clear_memcg(struct swap_info_struct *p)
@@ -211,6 +212,10 @@ static inline int swap_cgroup_vm_swap_full(struct 
mem_cgroup *memcg)
 {
return 0;
 }
+
+static inline void __swap_cgroup_force_empty(struct mem_cgroup *mem)
+{
+}
 #endif
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 160fca1..f93907c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -826,6 +826,34 @@ static void mem_cgroup_force_empty_list(struct mem_cgroup 
*mem,
spin_unlock_irqrestore(mz-lru_lock, flags);
 }
 
+static inline int mem_cgroup_in_use(struct mem_cgroup *mem)
+{
+   return mem-res.usage  0;
+}
+
+static inline int swap_cgroup_in_use(struct mem_cgroup *mem)
+{
+#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
+   return mem-swap_res.usage  0;
+#else
+   return 0;
+#endif
+}
+
+static void __mem_cgroup_force_empty(struct mem_cgroup *mem)
+{
+   int node, zid;
+
+   for_each_node_state(node, N_POSSIBLE)
+   for (zid = 0; zid  MAX_NR_ZONES; zid++) {
+   struct mem_cgroup_per_zone *mz;
+   enum lru_list l;
+   mz = mem_cgroup_zoneinfo(mem, node, zid);
+   for_each_lru(l)
+   mem_cgroup_force_empty_list(mem, mz, l);
+   }
+}
+
 /*
  * make mem_cgroup's charge to be 0 if there is no task.
  * This enables deleting this mem_cgroup.
@@ -833,7 +861,6 @@ static void mem_cgroup_force_empty_list(struct mem_cgroup 
*mem,
 static int mem_cgroup_force_empty(struct mem_cgroup *mem)
 {
int ret = -EBUSY;
-   int node, zid;
 
css_get(mem-css);
/*
@@ -841,18 +868,17 @@ static int mem_cgroup_force_empty(struct mem_cgroup *mem)
 * active_list - inactive_list while we don't take a lock.
 * So, we have to do loop here until all lists are empty.
 */
-   while (mem-res.usage  0) {
+   while (mem_cgroup_in_use(mem) || swap_cgroup_in_use(mem)) {
if (atomic_read(mem-css.cgroup-count)  0)
goto out;
-   for_each_node_state(node, N_POSSIBLE)
-   for (zid = 0; zid  MAX_NR_ZONES; zid++) {
-   struct mem_cgroup_per_zone *mz;
-   enum lru_list l;
-   mz = mem_cgroup_zoneinfo(mem, node, zid);
-   for_each_lru(l)
-   mem_cgroup_force_empty_list(mem, mz, l);
-   }
+
+   if (mem_cgroup_in_use(mem))
+   __mem_cgroup_force_empty(mem);
+
+   if (swap_cgroup_in_use(mem))
+   __swap_cgroup_force_empty(mem);
}
+
ret = 0;
 out:
css_put(mem-css);
@@ -1289,6 +1315,7 @@ void swap_cgroup_uncharge(struct swap_info_struct *si,
 *and called swap_entry_free().
 * 2. when this swap entry had been assigned by
 *get_swap_page_of_type() (via SWSUSP?).
+* 3. force empty can make such entries.
 */
if (mem) {
res_counter_uncharge(mem-swap_res, PAGE_SIZE);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6a863bc..2263ae8 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -704,15 +704,30 @@ static int unuse_mm(struct mm_struct *mm,
 }
 
 /*
+ * return the mem_cgroup to which a entry is charged
+ */
+static inline struct mem_cgroup *entry_to_memcg(swp_entry_t entry)
+{
+#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
+   return swap_info[swp_type(entry)].memcg[swp_offset(entry)];
+#else
+   return NULL;
+#endif
+}
+
+/*
  * Scan swap_map from current position to next entry still

[Devel] Re: [PATCH -mm 5/5] swapcgroup (v3): implement force_empty

2008-07-04 Thread Daisuke Nishimura
Hi, Yamamoto-san.

Thank you for your comment.

On Fri,  4 Jul 2008 15:54:31 +0900 (JST), [EMAIL PROTECTED] (YAMAMOTO Takashi) 
wrote:
 hi,
 
  +/*
  + * uncharge all the entries that are charged to the group.
  + */
  +void __swap_cgroup_force_empty(struct mem_cgroup *mem)
  +{
  +   struct swap_info_struct *p;
  +   int type;
  +
  +   spin_lock(swap_lock);
  +   for (type = swap_list.head; type = 0; type = swap_info[type].next) {
  +   p = swap_info + type;
  +
  +   if ((p-flags  SWP_ACTIVE) == SWP_ACTIVE) {
  +   unsigned int i = 0;
  +
  +   spin_unlock(swap_lock);
 
 what prevents the device from being swapoff'ed while you drop swap_lock?
 
Nothing.

After searching the entry to be uncharged(find_next_to_unuse below),
I recheck under swap_lock whether the entry is charged to the group.
Even if the device is swapoff'ed, swap_off must have uncharged the entry,
so I don't think it's needed anyway.

 YAMAMOTO Takashi
 
  +   while ((i = find_next_to_unuse(p, i, mem)) != 0) {
  +   spin_lock(swap_lock);
  +   if (p-swap_map[i]  p-memcg[i] == mem)
Ah, I think it should be added !p-swap_map to check the device has not
been swapoff'ed.


Thanks,
Daisuke Nishimura.

  +   swap_cgroup_uncharge(p, i);
  +   spin_unlock(swap_lock);
  +   }
  +   spin_lock(swap_lock);
  +   }
  +   }
  +   spin_unlock(swap_lock);
  +
  +   return;
  +}
   #endif
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mm 5/5] swapcgroup (v3): implement force_empty

2008-07-04 Thread Daisuke Nishimura
On Fri,  4 Jul 2008 16:48:28 +0900 (JST), [EMAIL PROTECTED] (YAMAMOTO Takashi) 
wrote:
  Hi, Yamamoto-san.
  
  Thank you for your comment.
  
  On Fri,  4 Jul 2008 15:54:31 +0900 (JST), [EMAIL PROTECTED] (YAMAMOTO 
  Takashi) wrote:
   hi,
   
+/*
+ * uncharge all the entries that are charged to the group.
+ */
+void __swap_cgroup_force_empty(struct mem_cgroup *mem)
+{
+   struct swap_info_struct *p;
+   int type;
+
+   spin_lock(swap_lock);
+   for (type = swap_list.head; type = 0; type = 
swap_info[type].next) {
+   p = swap_info + type;
+
+   if ((p-flags  SWP_ACTIVE) == SWP_ACTIVE) {
+   unsigned int i = 0;
+
+   spin_unlock(swap_lock);
   
   what prevents the device from being swapoff'ed while you drop swap_lock?
   
  Nothing.
  
  After searching the entry to be uncharged(find_next_to_unuse below),
  I recheck under swap_lock whether the entry is charged to the group.
  Even if the device is swapoff'ed, swap_off must have uncharged the entry,
  so I don't think it's needed anyway.
  
   YAMAMOTO Takashi
   
+   while ((i = find_next_to_unuse(p, i, mem)) != 
0) {
+   spin_lock(swap_lock);
+   if (p-swap_map[i]  p-memcg[i] == 
mem)
  Ah, I think it should be added !p-swap_map to check the device has not
  been swapoff'ed.
 
 find_next_to_unuse seems to have fragile assumptions and
 can dereference p-swap_map as well.
 
You're right.
Thank you for pointing it out!

I'll consider more.


Thanks,
Daisuke Nishimura.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mm 4/5] swapcgroup (v3): modify vm_swap_full()

2008-07-04 Thread Daisuke Nishimura
Hi, Kamezawa-san.

On Fri, 4 Jul 2008 18:58:45 +0900, KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:
 On Fri, 4 Jul 2008 15:22:44 +0900
 Daisuke Nishimura [EMAIL PROTECTED] wrote:
 
   /* Swap 50% full? Release swapcache more aggressively.. */
  -#define vm_swap_full() (nr_swap_pages*2  total_swap_pages)
  +#define vm_swap_full(memcg) ((nr_swap_pages*2  total_swap_pages) \
  +   || swap_cgroup_vm_swap_full(memcg))
  +
   
 Maybe nitpick but I like 
 ==
   vm_swap_full(page)  ((nr_swap_pages *2  total_swap_pages)
   || swap_cgroup_vm_swap_full_page(page))
 ==
 rather than vm_swap_full(memcg)
 
Well, I used page in v2, but Kosaki-san said vm_swap_full()
is not page-granularity operation so it should be changed.
 
And more,

  @@ -1317,7 +1317,7 @@ static void shrink_active_list(unsigned long 
  nr_pages, struct zone *zone,
  __mod_zone_page_state(zone, NR_LRU_BASE + lru, pgmoved);
  pgmoved = 0;
  spin_unlock_irq(zone-lru_lock);
  -   if (vm_swap_full())
  +   if (vm_swap_full(sc-mem_cgroup))
  pagevec_swap_free(pvec);
  __pagevec_release(pvec);
  spin_lock_irq(zone-lru_lock);
  @@ -1328,7 +1328,7 @@ static void shrink_active_list(unsigned long 
  nr_pages, struct zone *zone,
  __count_zone_vm_events(PGREFILL, zone, pgscanned);
  __count_vm_events(PGDEACTIVATE, pgdeactivate);
  spin_unlock_irq(zone-lru_lock);
  -   if (vm_swap_full())
  +   if (vm_swap_full(sc-mem_cgroup))
  pagevec_swap_free(pvec);
   
  pagevec_release(pvec);

page cannot be determined in those places.
I don't want to change pagevec_swap_free(), so I changed
the argument of vm_swap_full().

 And could you change this to inline funcition ?
 
Of course.
I think it would be better.


Thanks,
Daisuke Nishimura.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mm 0/5] swapcgroup (v3)

2008-07-04 Thread Daisuke Nishimura
On Fri, 4 Jul 2008 18:40:33 +0900, KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:
 On Fri, 4 Jul 2008 15:15:36 +0900
 Daisuke Nishimura [EMAIL PROTECTED] wrote:
 
  Hi.
  
  This is new version of swapcgroup.
  
  Major changes from previous version
  - Rebased on 2.6.26-rc5-mm3.
The new -mm has been released, but these patches
can be applied on 2.6.26-rc8-mm1 too with only some offset warnings.
I tested these patches on 2.6.26-rc5-mm3 with some fixes about memory,
and it seems to work fine.
  - (NEW) Implemented force_empty.
Currently, it simply uncharges all the charges from the group.
  
  Patches
  - [1/5] add cgroup files
  - [2/5] add a member to swap_info_struct
  - [3/5] implement charge and uncharge
  - [4/5] modify vm_swap_full() 
  - [5/5] implement force_empty
  
  ToDo(in my thought. Feel free to add some others here.)
  - need some documentation
Add to memory.txt? or create a new documentation file?
  
 Maybe new documentation file is better. 
 
O.K.

  - add option to disable only this feature
I'm wondering if this option is needed.
memcg has already the boot option to disable it.
Is there any case where memory should be accounted but swap should not?
 
 On x86-32, area for vmalloc() is very small and array of swap_info_struct[]
 will use much amount of it. If vmalloc() area is too small, the kernel cannot
 load modules.
 
It would be critical.
I'll add an option.


Thanks,
Daisuke Nishimura.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mm 5/5] swapcgroup (v3): implement force_empty

2008-07-04 Thread Daisuke Nishimura
On Fri, 4 Jul 2008 19:16:38 +0900, KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:
 On Fri, 4 Jul 2008 15:24:23 +0900
 Daisuke Nishimura [EMAIL PROTECTED] wrote:
 
  This patch implements force_empty of swapcgroup.
  
  Currently, it simply uncharges all the charges from the group.
  
  I think there can be other implementations.
  
  What I thought are:
  - move all the charges to its parent.
  - unuse(swap in) all the swap charged to the group.
  
 3. move all swap back to memory (see swapoff.)
 
 
Do you mean swapping in all the swap including used by
other groups?
It would be one choice anyway.

  But in any case, I think before implementing this way,
  hierarchy and move charges support are needed.
  
  So I think this is enough for the first step.
  
 
 I don't think hierarchy/automatic-load-balancer for swap cg is necessary.
It's the problem of how the hierarchy would be, I think.
I'm saying hierarchy here just to mean some kind of feature
where a parent includes their children.
I think hierarchy is needed if we implement the choice 1 above,
and I personally think it would be the best choice.

 Hmm...but handling limit_change (at least, returns -EBUSY) will be necessary.
I think so too.
But I'm not sure now it's good or bad to support shrinking at limit_change
about swap.
Shrinking swap means increasing the memory usage and that may cause
another swapout.

 Do you consider a some magical way to move pages in swap back to memory ?
 
In this patch, I modified the find_next_to_unuse() to find
the entry charged to a specific group.
It might be possible to modify try_to_unuse()(or define another function
based on try_to_unuse()) to reduce swap usage of a specified group
down to some threashold.
But, I think, one problem here is from which device the swaps
should be back to memory, or usage balance between swap devices.

 In general, I like this set but we can't change the limit on demand. (maybe)
 (just putting it to TO-DO-List is okay to me.)
 
I'm sorry but what do you mean by change the limit on demand?
Could you explain more?


Thanks,
Daisuke Nishimura.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFD][PATCH] memcg: Move Usage at Task Move

2008-06-20 Thread Daisuke Nishimura
On Wed, 11 Jun 2008 13:57:34 +0530
Balbir Singh [EMAIL PROTECTED] wrote:

(snip)

   2. Don't move any usage at task move. (current implementation.)
 Pros.
   - no complication in the code.
 Cons.
   - A task's usage is chareged to wrong cgroup.
   - Not sure, but I believe the users don't want this.
  
  I'd say stick with this unless there a strong arguments in favour of
  changing, based on concrete needs.
  
  One reasone is that I think a typical usage of memory controller is
  fork()-move-exec(). (by libcg ?) and exec() will flush the all usage.
  
  Exactly - this is a good reason *not* to implement move - because then
  you drag all the usage of the middleware daemon into the new cgroup.
  
 
 Yes. The other thing is that charges will eventually fade away. Please see the
 cgroup implementation of page_referenced() and mark_page_accessed(). The
 original group on memory pressure will drop pages that were left behind by a
 task that migrates. The new group will pick it up if referenced.
 
Hum..
So, it seems that some kind of Lazy Mode(#3 of Kamezawa-san's)
has been implemented already.

But, one of the reason that I think usage should be moved
is to make the usage as accurate as possible, that is
the size of memory used by processes in the group at the moment.

I agree that statistics is not the purpose of memcg(and swap),
but, IMHO, it's useful feature of memcg.
Administrators can know how busy or idle each groups are by it.


Thanks,
Daisuke Nishimura.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] cgroup: support checking of subsystem dependencies

2008-06-18 Thread Daisuke Nishimura
On Thu, 19 Jun 2008 08:40:24 +0800, Li Zefan [EMAIL PROTECTED] wrote:
 KAMEZAWA Hiroyuki wrote:
  On Wed, 18 Jun 2008 16:01:49 +0800
  Li Zefan [EMAIL PROTECTED] wrote:
  
  This allows one subsystem to require that it only be mounted when some
  other subsystems are also present in the proposed hierarchy.
 
  For example if subsystem foo depends on bar, the following will fail:
   # mount -t cgroup -ofoo xxx /dev/cgroup
 
  You should mount with both subsystems:
   # mount -t cgroup -ofoo,bar xxx /dev/cgroup
 
  I'm just curious. May I ask Is there such cgroup subsystem now ?
  
 
 Currently no, but people who are developing or wants to develop new subsystems
 may find this useful, for example Daisuke Nishimura-san's swap controller.
 
My current version is implemented as addon to memory controller,
so it can be used by just mounting memory controller :)

But anyway, this feature might be used in future by some subsystems.


Thanks,
Daisuke Nishimura.

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFD][PATCH] memcg: Move Usage at Task Move

2008-06-10 Thread Daisuke Nishimura
Hi, Kamezawa-san.

Sorry for late reply.

On Fri, 6 Jun 2008 10:52:35 +0900, KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:
 Move Usage at Task Move (just an experimantal for discussion)
 I tested this but don't think bug-free.
 
 In current memcg, when task moves to a new cg, the usage remains in the old 
 cg.
 This is considered to be not good.
 
I agree.

 This is a trial to move usage from old cg to new cg at task move.
 Finally, you'll see the problems we have to handle are failure and rollback.
 
 This one's Basic algorithm is
 
  0. can_attach() is called.
  1. count movable pages by scanning page table. isolate all pages from 
 LRU.
  2. try to create enough room in new memory cgroup
  3. start moving page accouing
  4. putback pages to LRU.
  5. can_attach() for other cgroups are called.
 
You isolate pages and move charges of them by can_attach(),
but it means that pages that are allocated between page isolation
and moving tsk-cgroups remains charged to old group, right?

I think it would be better if possible to move charges by attach()
as cpuset migrates pages by cpuset_attach().
But one of the problem of it is that attch() does not return
any value, so there is no way to notify failure...

 A case study.
 
   group_A - limit=1G, task_X's usage= 800M.
   group_B - limit=1G, usage=500M.
 
 For moving task_X from group_A to group_B.
   - group_B  should be reclaimed or have enough room.
 
 While moving task_X from group_A to group_B.
   - group_B's memory usage can be changed
   - group_A's memory usage can be changed
 
   We accounts the resouce based on pages. Then, we can't move all resource
   usage at once.
 
   If group_B has no more room when we've moved 700M of task_X to group_B,
   we have to move 700M of task_X back to group_A. So I implemented roll-back.
   But other process may use up group_A's available resource at that point.
   
   For avoiding that, preserve 800M in group_B before moving task_X means that
   task_X can occupy 1600M of resource at moving. (So I don't do in this 
 patch.)
 
   This patch uses Best-Effort rollback. Failure in rollback is ignored and
   the usage is just leaked.
 
If implement rollback in kernel, I think it must not fail to prevent
leak of usage.
How about using charge_force for rollbak?

Or, instead of implementing rollback in kernel,
how about making user(or middle ware?) re-echo pid to rollbak
on failure?

 Roll-back can happen when
 (a) in phase 3. cannot move a page to new cgroup because of limit.
 (b) in phase 5. other cgourp subsys returns error in can_attach().
 
Isn't rollbak needed on failure between can_attach and attach(e.g. failure
on find_css_set, ...)?

 +int mem_cgroup_recharge_task(struct mem_cgroup *newcg,
 + struct task_struct *task)
 +{
(snip)
 + /* create enough room before move */
 + necessary = info.count * PAGE_SIZE;
 +
 + do {
 + spin_lock(newcg-res.lock);
 + if (newcg-res.limit  necessary)
 + rc = -ENOMEM;
I think it should be (newcg-res.limit  necessary).


Thanks,
Daisuke Nishimura.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFD][PATCH] memcg: Move Usage at Task Move

2008-06-10 Thread Daisuke Nishimura
On Tue, 10 Jun 2008 17:26:37 +0900, KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:
   This is a trial to move usage from old cg to new cg at task move.
   Finally, you'll see the problems we have to handle are failure and 
   rollback.
   
   This one's Basic algorithm is
   
0. can_attach() is called.
1. count movable pages by scanning page table. isolate all pages 
   from LRU.
2. try to create enough room in new memory cgroup
3. start moving page accouing
4. putback pages to LRU.
5. can_attach() for other cgroups are called.
   
  You isolate pages and move charges of them by can_attach(),
  but it means that pages that are allocated between page isolation
  and moving tsk-cgroups remains charged to old group, right?
 yes.
 
  
  I think it would be better if possible to move charges by attach()
  as cpuset migrates pages by cpuset_attach().
  But one of the problem of it is that attch() does not return
  any value, so there is no way to notify failure...
  
 yes, here again. it makes roll-back more difficult.
 
I think so too. That's why I said one of the problem.

   A case study.
   
 group_A - limit=1G, task_X's usage= 800M.
 group_B - limit=1G, usage=500M.
   
   For moving task_X from group_A to group_B.
 - group_B  should be reclaimed or have enough room.
   
   While moving task_X from group_A to group_B.
 - group_B's memory usage can be changed
 - group_A's memory usage can be changed
   
 We accounts the resouce based on pages. Then, we can't move all resource
 usage at once.
   
 If group_B has no more room when we've moved 700M of task_X to group_B,
 we have to move 700M of task_X back to group_A. So I implemented 
   roll-back.
 But other process may use up group_A's available resource at that point.
 
 For avoiding that, preserve 800M in group_B before moving task_X means 
   that
 task_X can occupy 1600M of resource at moving. (So I don't do in this 
   patch.)
   
 This patch uses Best-Effort rollback. Failure in rollback is ignored and
 the usage is just leaked.
   
  If implement rollback in kernel, I think it must not fail to prevent
  leak of usage.
  How about using charge_force for rollbak?
  
 means allowing to exceed limit ?
 
Yes.
I agree that exceeding limit is not good, but I
just feel that it's better than leaking usage.
Of cource, I think usage should be decreased later
by some methods.

  Or, instead of implementing rollback in kernel,
  how about making user(or middle ware?) re-echo pid to rollbak
  on failure?
  
 
 If the users does well, the system works in better way is O.K.
 If the users doesn't well, the system works in broken way is very bad.
 
Hum...

I think users must know what they are doing.

They must know that moving a process to another group
that doesn't have enough room for it may fail with half state,
if it is the behavior of kernel.
And they should handle the error by themselves, IMHO.


Thanks,
Daisuke Nishimura.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFD][PATCH] memcg: Move Usage at Task Move

2008-06-10 Thread Daisuke Nishimura
On Wed, 11 Jun 2008 13:14:37 +0900, KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:
 On Wed, 11 Jun 2008 12:44:46 +0900 (JST)
 [EMAIL PROTECTED] (YAMAMOTO Takashi) wrote:
 
   I'm now considering following logic. How do you think ?
   
   Assume: move TASK from group:CURR to group:DEST.
   
   == move_task(TASK, CURR, DEST)
   
   if (DEST's limit is unlimited)
 moving TASK
 return success.
   
   usage = check_usage_of_task(TASK).
   
   /* try to reserve enough room in destionation */
   if (try_to_reserve_enough_room(DEST, usage)) {
 move TASK to DEST and move pages AMAP.
 /* usage_of_task(TASK) can be changed while we do this.
Then, we move AMAP. */
 return success;
   }
   return failure.
   ==
  
  AMAP means that you might leave some random charges in CURR?
  
 yes. but we can reduce bad case by some way 
  - reserve more than necessary.
  or
  - read_lock mm-sem while move.
 
I preffer the latter.
Though it's expencive, I think moving a task would not happen
so offen.


Thanks,
Daisuke Nishimura.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 0/4] swapcgroup(v2)

2008-05-27 Thread Daisuke Nishimura
On 2008/05/27 16:42 +0900, Balbir Singh wrote:
 YAMAMOTO Takashi wrote:
 hi,

 Thanks for looking into this. Yamamoto-San is also looking into a swap
 controller. Is there a consensus on the approach?

 Not yet, but I think we should have some consensus each other
 before going further.


 Thanks,
 Daisuke Nishimura.
 while nishimura-san's one still seems to have a lot of todo,
 it seems good enough as a start point to me.
 so i'd like to withdraw mine.

 nishimura-san, is it ok for you?

Of cource.
I'll work hard to make it better.

 
 I would suggest that me merge the good parts from both into the swap 
 controller.
 Having said that I'll let the two of you decide on what the good aspects of 
 both
 are. I cannot see any immediate overlap, but there might be some w.r.t.
 infrastructure used.
 
Well, you mean you'll make another patch based on yamamoto-san's
and mine?

Basically, I think it's difficult to merge
because we charge different objects.


Thanks,
Daisuke Nishimura.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 0/4] swapcgroup(v2)

2008-05-27 Thread Daisuke Nishimura
On 2008/05/27 22:18 +0900, Balbir Singh wrote:
 Daisuke Nishimura wrote:
 On 2008/05/27 16:42 +0900, Balbir Singh wrote:
 YAMAMOTO Takashi wrote:
 hi,

 Thanks for looking into this. Yamamoto-San is also looking into a swap
 controller. Is there a consensus on the approach?

 Not yet, but I think we should have some consensus each other
 before going further.


 Thanks,
 Daisuke Nishimura.
 while nishimura-san's one still seems to have a lot of todo,
 it seems good enough as a start point to me.
 so i'd like to withdraw mine.

 nishimura-san, is it ok for you?

 Of cource.
 I'll work hard to make it better.

 I would suggest that me merge the good parts from both into the swap 
 controller.
 Having said that I'll let the two of you decide on what the good aspects of 
 both
 are. I cannot see any immediate overlap, but there might be some w.r.t.
 infrastructure used.

 Well, you mean you'll make another patch based on yamamoto-san's
 and mine?

 Basically, I think it's difficult to merge
 because we charge different objects.

 
 Yes, I know - that's why I said infrastructure, to see the grouping and common
 data structure aspects. I'll try and review the code.
 
OK.

I'll wait for your patch.


Thanks,
Daisuke Nishimura.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 0/4] swapcgroup(v2)

2008-05-27 Thread Daisuke Nishimura
On 2008/05/27 22:46 +0900, Balbir Singh wrote:
 Daisuke Nishimura wrote:
 On 2008/05/27 22:18 +0900, Balbir Singh wrote:
 Daisuke Nishimura wrote:
 On 2008/05/27 16:42 +0900, Balbir Singh wrote:
 YAMAMOTO Takashi wrote:
 hi,

 Thanks for looking into this. Yamamoto-San is also looking into a swap
 controller. Is there a consensus on the approach?

 Not yet, but I think we should have some consensus each other
 before going further.


 Thanks,
 Daisuke Nishimura.
 while nishimura-san's one still seems to have a lot of todo,
 it seems good enough as a start point to me.
 so i'd like to withdraw mine.

 nishimura-san, is it ok for you?

 Of cource.
 I'll work hard to make it better.

 I would suggest that me merge the good parts from both into the swap 
 controller.
 Having said that I'll let the two of you decide on what the good aspects 
 of both
 are. I cannot see any immediate overlap, but there might be some w.r.t.
 infrastructure used.

 Well, you mean you'll make another patch based on yamamoto-san's
 and mine?

 Basically, I think it's difficult to merge
 because we charge different objects.

 Yes, I know - that's why I said infrastructure, to see the grouping and 
 common
 data structure aspects. I'll try and review the code.

 OK.

 I'll wait for your patch.
 
 Hi, Daisuke-San
 
 I am not sending out any patch (sorry for the confusion, if I caused it). I am
 going to review the swapcgroup patchset.
 

No problem :-)

I would very appreciate it if you review swap cgroup patches.

I'll update my patches based on your and others' comments.


Thanks,
Daisuke Nishimura.

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 3/4] swapcgroup: implement charge/uncharge

2008-05-23 Thread Daisuke Nishimura
On 2008/05/22 16:37 +0900, KAMEZAWA Hiroyuki wrote:
 On Thu, 22 May 2008 15:20:05 +0900
 Daisuke Nishimura [EMAIL PROTECTED] wrote:
 
 +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
 +int swap_cgroup_charge(struct page *page,
 +struct swap_info_struct *si,
 +unsigned long offset)
 +{
 +int ret;
 +struct page_cgroup *pc;
 +struct mem_cgroup *mem;
 +
 +lock_page_cgroup(page);
 +pc = page_get_page_cgroup(page);
 +if (unlikely(!pc))
 +mem = init_mem_cgroup;
 +else
 +mem = pc-mem_cgroup;
 +unlock_page_cgroup(page);
 
 If !pc, the page is used before memory controller is available. But is it
 good to be charged to init_mem_cgroup() ?
I'm sorry, but I can't understand this situation.
memory controller is initialized at kernel initialization,
so aren't processes created after it is initialized?

 How about returning 'failure' in this case ? I think returning 'failure' here
 is not so bad.
 
 
Which of below do you mean by 'failure'?

A. make it fail to get swap entry, so the page cannot be swapped out.
B. don't charge this swap entry to any cgroup, but the page
   would be swapped out.

I don't want to do B, because I don't want to make such
not-charged-to-anywhere entries.
And I've seen several times this condition(!pc) becomes true,
so I charged to init_mem_cgroup.


BTW, I noticed that almost all of functions I added by this patch set
should check mem_cgroup_subsys.disabled first because it depend on
memory cgroup.

 +
 +css_get(mem-css);
 
 move this css_get() before unlock_page_cgroup() is safer.
 
OK, thanks.


Daisuke Nishimura.

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 4/4] swapcgroup: modify vm_swap_full for cgroup

2008-05-23 Thread Daisuke Nishimura
On 2008/05/22 21:32 +0900, KOSAKI Motohiro wrote:
 perhaps, I don't understand your intention exactly.
 Why can't you make wrapper function?
 
 e.g.
 vm_swap_full(page_to_memcg(page))
 
OK.
I'll try it.

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 0/4] swapcgroup(v2)

2008-05-22 Thread Daisuke Nishimura
Hi.

I updated my swapcgroup patch.

Major changes from previous version(*1):
- Rebased on 2.6.26-rc2-mm1 + KAMEZAWA-san's performance
  improvement patchset v4.
- Implemented as a add-on to memory cgroup.
  So, there is no need to add a new member to page_cgroup now.
- (NEW)Modified vm_swap_full() to calculate the rate of
  swap usage per cgroup.

Patchs:
- [1/4] add cgroup files
- [2/4] add member to swap_info_struct for cgroup
- [3/4] implement charge/uncharge
- [4/4] modify vm_swap_full for cgroup

ToDo:
- handle force_empty.
- make it possible for users to select if they use
  this feature or not, and avoid overhead for users
  not using this feature.
- move charges along with task move between cgroups.

*1
https://lists.linux-foundation.org/pipermail/containers/2008-March/010216.html


Thanks,
Daisuke Nishimura.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 1/4] swapcgroup: add cgroup files

2008-05-22 Thread Daisuke Nishimura
This patch add cgroup files(and a member to struct mem_cgroup)
for swapcgroup.

The files to be added are:

- memory.swap_usage_in_bytes
- memory.swap_max_usage_in_bytes
- memory.swap_limit_in_bytes
- memory.swap_failcnt

The meaning of those files are the same as memory cgroup.


Signed-off-by: Daisuke Nishimura [EMAIL PROTECTED]

---
 init/Kconfig|7 +
 mm/memcontrol.c |   67 +++
 2 files changed, 74 insertions(+), 0 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 6135d07..aabbb83 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -407,6 +407,13 @@ config CGROUP_MEM_RES_CTLR
  This config option also selects MM_OWNER config option, which
  could in turn add some fork/exit overhead.
 
+config CGROUP_SWAP_RES_CTLR
+   bool Swap Resource Controller for Control Groups
+   depends on CGROUP_MEM_RES_CTLR  SWAP
+   help
+ Provides a swap resource controller that manages and limits swap 
usage.
+ Implemented as a add-on to Memory Resource Controller.
+
 config SYSFS_DEPRECATED
bool
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a96577f..a837215 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -141,6 +141,12 @@ struct mem_cgroup {
 * statistics.
 */
struct mem_cgroup_stat stat;
+#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
+   /*
+* the counter to account for swap usage
+*/
+   struct res_counter swap_res;
+#endif
 };
 static struct mem_cgroup init_mem_cgroup;
 
@@ -960,6 +966,39 @@ static int mem_control_stat_show(struct cgroup *cont, 
struct cftype *cft,
return 0;
 }
 
+#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
+static u64 swap_cgroup_read(struct cgroup *cont, struct cftype *cft)
+{
+   return res_counter_read_u64(mem_cgroup_from_cont(cont)-swap_res,
+   cft-private);
+}
+
+static ssize_t swap_cgroup_write(struct cgroup *cont, struct cftype *cft,
+   struct file *file, const char __user *userbuf,
+   size_t nbytes, loff_t *ppos)
+{
+   return res_counter_write(mem_cgroup_from_cont(cont)-swap_res,
+   cft-private, userbuf, nbytes, ppos,
+   mem_cgroup_write_strategy);
+}
+
+static int swap_cgroup_reset(struct cgroup *cont, unsigned int event)
+{
+   struct mem_cgroup *mem;
+
+   mem = mem_cgroup_from_cont(cont);
+   switch (event) {
+   case RES_MAX_USAGE:
+   res_counter_reset_max(mem-swap_res);
+   break;
+   case RES_FAILCNT:
+   res_counter_reset_failcnt(mem-swap_res);
+   break;
+   }
+   return 0;
+}
+#endif
+
 static struct cftype mem_cgroup_files[] = {
{
.name = usage_in_bytes,
@@ -992,6 +1031,31 @@ static struct cftype mem_cgroup_files[] = {
.name = stat,
.read_map = mem_control_stat_show,
},
+#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
+   {
+   .name = swap_usage_in_bytes,
+   .private = RES_USAGE,
+   .read_u64 = swap_cgroup_read,
+   },
+   {
+   .name = swap_max_usage_in_bytes,
+   .private = RES_MAX_USAGE,
+   .trigger = swap_cgroup_reset,
+   .read_u64 = swap_cgroup_read,
+   },
+   {
+   .name = swap_limit_in_bytes,
+   .private = RES_LIMIT,
+   .write = swap_cgroup_write,
+   .read_u64 = swap_cgroup_read,
+   },
+   {
+   .name = swap_failcnt,
+   .private = RES_FAILCNT,
+   .trigger = swap_cgroup_reset,
+   .read_u64 = swap_cgroup_read,
+   },
+#endif
 };
 
 static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
@@ -1069,6 +1133,9 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup 
*cont)
}
 
res_counter_init(mem-res);
+#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
+   res_counter_init(mem-swap_res);
+#endif
 
for_each_node_state(node, N_POSSIBLE)
if (alloc_mem_cgroup_per_zone_info(mem, node))


___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 2/4] swapcgroup: add member to swap_info_struct for cgroup

2008-05-22 Thread Daisuke Nishimura
This patch add a member to swap_info_struct for cgroup.

This member, array of pointers to mem_cgroup, is used to
remember to which cgroup each swap entries are charged.

The memory for this array of pointers is allocated on swapon,
and freed on swapoff.


Signed-off-by: Daisuke Nishimura [EMAIL PROTECTED]

---
 include/linux/swap.h |3 +++
 mm/swapfile.c|   32 
 2 files changed, 35 insertions(+), 0 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index de40f16..67de27b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -141,6 +141,9 @@ struct swap_info_struct {
struct swap_extent *curr_swap_extent;
unsigned old_block_size;
unsigned short * swap_map;
+#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
+   struct mem_cgroup **memcg;
+#endif
unsigned int lowest_bit;
unsigned int highest_bit;
unsigned int cluster_next;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index d3caf3a..232bf20 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1207,6 +1207,9 @@ asmlinkage long sys_swapoff(const char __user * 
specialfile)
 {
struct swap_info_struct * p = NULL;
unsigned short *swap_map;
+#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
+   struct mem_cgroup **memcg;
+#endif
struct file *swap_file, *victim;
struct address_space *mapping;
struct inode *inode;
@@ -1309,10 +1312,17 @@ asmlinkage long sys_swapoff(const char __user * 
specialfile)
p-max = 0;
swap_map = p-swap_map;
p-swap_map = NULL;
+#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
+   memcg = p-memcg;
+   p-memcg = NULL;
+#endif
p-flags = 0;
spin_unlock(swap_lock);
mutex_unlock(swapon_mutex);
vfree(swap_map);
+#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
+   vfree(memcg);
+#endif
inode = mapping-host;
if (S_ISBLK(inode-i_mode)) {
struct block_device *bdev = I_BDEV(inode);
@@ -1456,6 +1466,9 @@ asmlinkage long sys_swapon(const char __user * 
specialfile, int swap_flags)
unsigned long maxpages = 1;
int swapfilesize;
unsigned short *swap_map;
+#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
+   struct mem_cgroup **memcg;
+#endif
struct page *page = NULL;
struct inode *inode = NULL;
int did_down = 0;
@@ -1479,6 +1492,9 @@ asmlinkage long sys_swapon(const char __user * 
specialfile, int swap_flags)
p-swap_file = NULL;
p-old_block_size = 0;
p-swap_map = NULL;
+#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
+   p-memcg = NULL;
+#endif
p-lowest_bit = 0;
p-highest_bit = 0;
p-cluster_nr = 0;
@@ -1651,6 +1667,15 @@ asmlinkage long sys_swapon(const char __user * 
specialfile, int swap_flags)
1 /* header page */;
if (error)
goto bad_swap;
+
+#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
+   p-memcg = vmalloc(maxpages * sizeof(struct mem_cgroup *));
+   if (!p-memcg) {
+   error = -ENOMEM;
+   goto bad_swap;
+   }
+   memset(p-memcg, 0, maxpages * sizeof(struct mem_cgroup *));
+#endif
}
 
if (nr_good_pages) {
@@ -1710,11 +1735,18 @@ bad_swap_2:
swap_map = p-swap_map;
p-swap_file = NULL;
p-swap_map = NULL;
+#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
+   memcg = p-memcg;
+   p-memcg = NULL;
+#endif
p-flags = 0;
if (!(swap_flags  SWAP_FLAG_PREFER))
++least_priority;
spin_unlock(swap_lock);
vfree(swap_map);
+#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
+   vfree(memcg);
+#endif
if (swap_file)
filp_close(swap_file, NULL);
 out:


___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 3/4] swapcgroup: implement charge/uncharge

2008-05-22 Thread Daisuke Nishimura
This patch implements charge and uncharge of swapcgroup.

- what will be charged ?
  charge the number of swap entries in bytes.

- when to charge/uncharge ?
  charge at get_swap_entry(), and uncharge at swap_entry_free().

- to what group charge the swap entry ?
  To determine to what mem_cgroup the swap entry should be charged,
  I changed the argument of get_swap_entry() from (void) to
  (struct page *). As a result, get_swap_entry() can determine
  to what mem_cgroup it should charge the swap entry
  by referring to page-page_cgroup-mem_cgroup.

- from what group uncharge the swap entry ?
  I added to swap_info_struct a member 'struct swap_cgroup **',
  array of pointer to which swap_cgroup the swap entry is
  charged.


Signed-off-by: Daisuke Nishimura [EMAIL PROTECTED]

---
 include/linux/memcontrol.h |   21 +++
 include/linux/swap.h   |4 +-
 mm/memcontrol.c|   47 
 mm/shmem.c |2 +-
 mm/swap_state.c|2 +-
 mm/swapfile.c  |   14 -
 6 files changed, 85 insertions(+), 5 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index fdf3967..a7e6621 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -24,6 +24,7 @@ struct mem_cgroup;
 struct page_cgroup;
 struct page;
 struct mm_struct;
+struct swap_info_struct;
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 
@@ -172,5 +173,25 @@ static inline long mem_cgroup_calc_reclaim_inactive(struct 
mem_cgroup *mem,
 }
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
+#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
+extern int swap_cgroup_charge(struct page *page,
+   struct swap_info_struct *si,
+   unsigned long offset);
+extern void swap_cgroup_uncharge(struct swap_info_struct *si,
+   unsigned long offset);
+#else /* CONFIG_CGROUP_SWAP_RES_CTLR */
+static inline int swap_cgroup_charge(struct page *page,
+   struct swap_info_struct *si,
+   unsigned long offset)
+{
+   return 0;
+}
+
+static inline void swap_cgroup_uncharge(struct swap_info_struct *si,
+   unsigned long offset)
+{
+}
+#endif /* CONFIG_CGROUP_SWAP_RES_CTLR */
+
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 67de27b..18887f0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -241,7 +241,7 @@ extern struct page *swapin_readahead(swp_entry_t, gfp_t,
 /* linux/mm/swapfile.c */
 extern long total_swap_pages;
 extern void si_swapinfo(struct sysinfo *);
-extern swp_entry_t get_swap_page(void);
+extern swp_entry_t get_swap_page(struct page *);
 extern swp_entry_t get_swap_page_of_type(int);
 extern int swap_duplicate(swp_entry_t);
 extern int valid_swaphandles(swp_entry_t, unsigned long *);
@@ -342,7 +342,7 @@ static inline int remove_exclusive_swap_page(struct page *p)
return 0;
 }
 
-static inline swp_entry_t get_swap_page(void)
+static inline swp_entry_t get_swap_page(struct page *page)
 {
swp_entry_t entry;
entry.val = 0;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a837215..84e803d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1220,3 +1220,50 @@ struct cgroup_subsys mem_cgroup_subsys = {
.attach = mem_cgroup_move_task,
.early_init = 0,
 };
+
+#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
+int swap_cgroup_charge(struct page *page,
+   struct swap_info_struct *si,
+   unsigned long offset)
+{
+   int ret;
+   struct page_cgroup *pc;
+   struct mem_cgroup *mem;
+
+   lock_page_cgroup(page);
+   pc = page_get_page_cgroup(page);
+   if (unlikely(!pc))
+   mem = init_mem_cgroup;
+   else
+   mem = pc-mem_cgroup;
+   unlock_page_cgroup(page);
+
+   css_get(mem-css);
+   ret = res_counter_charge(mem-swap_res, PAGE_SIZE);
+   if (!ret)
+   si-memcg[offset] = mem;
+   else
+   css_put(mem-css);
+
+   return ret;
+}
+
+void swap_cgroup_uncharge(struct swap_info_struct *si,
+   unsigned long offset)
+{
+   struct mem_cgroup *mem = si-memcg[offset];
+
+   /* mem would be NULL:
+* 1. when get_swap_page() failed at charging swap_cgroup,
+*and called swap_entry_free().
+* 2. when this swap entry had been assigned by
+*get_swap_page_of_type() (via SWSUSP?).
+*/
+   if (mem) {
+   res_counter_uncharge(mem-swap_res, PAGE_SIZE);
+   si-memcg[offset] = NULL;
+   css_put(mem-css);
+   }
+}
+#endif
+
diff --git a/mm/shmem.c b/mm/shmem.c
index 95b056d..69f8909 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1029,7 +1029,7 @@ static int shmem_writepage(struct page *page, struct 
writeback_control *wbc)
 * want to check if there's a redundant swappage

[Devel] [PATCH 4/4] swapcgroup: modify vm_swap_full for cgroup

2008-05-22 Thread Daisuke Nishimura
This patch modifies vm_swap_full() to calculate
the rate of swap usage per cgroup.

The purpose of this change is to free freeable swap caches
(that is, swap entries) per cgroup, so that swap_cgroup_charge()
fails less frequently.

I think I can free freeable swap caches more agressively
with one of Rik's splitlru patchset:

  [patch 04/14] free swap space on swap-in/activation

but, I should verify whether this change to vm_swap_full()
is valid.


Signed-off-by: Daisuke Nishimura [EMAIL PROTECTED]

---
 include/linux/memcontrol.h |3 +++
 include/linux/swap.h   |6 +-
 mm/memcontrol.c|   10 ++
 mm/memory.c|2 +-
 mm/swapfile.c  |   35 ++-
 5 files changed, 53 insertions(+), 3 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a7e6621..256b298 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -179,6 +179,9 @@ extern int swap_cgroup_charge(struct page *page,
unsigned long offset);
 extern void swap_cgroup_uncharge(struct swap_info_struct *si,
unsigned long offset);
+extern int swap_cgroup_vm_swap_full(struct page *page);
+extern u64 swap_cgroup_read_usage(struct mem_cgroup *mem);
+extern u64 swap_cgroup_read_limit(struct mem_cgroup *mem);
 #else /* CONFIG_CGROUP_SWAP_RES_CTLR */
 static inline int swap_cgroup_charge(struct page *page,
struct swap_info_struct *si,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 18887f0..ef156c9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -159,8 +159,12 @@ struct swap_list_t {
int next;   /* swapfile to be used next */
 };
 
+#ifndef CONFIG_CGROUP_SWAP_RES_CTLR
 /* Swap 50% full? Release swapcache more aggressively.. */
-#define vm_swap_full() (nr_swap_pages*2  total_swap_pages)
+#define vm_swap_full(page) (nr_swap_pages*2  total_swap_pages)
+#else
+#define vm_swap_full(page) swap_cgroup_vm_swap_full(page)
+#endif
 
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 84e803d..58d72ca 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1265,5 +1265,15 @@ void swap_cgroup_uncharge(struct swap_info_struct *si,
css_put(mem-css);
}
 }
+
+u64 swap_cgroup_read_usage(struct mem_cgroup *mem)
+{
+   return res_counter_read_u64(mem-swap_res, RES_USAGE);
+}
+
+u64 swap_cgroup_read_limit(struct mem_cgroup *mem)
+{
+   return res_counter_read_u64(mem-swap_res, RES_LIMIT);
+}
 #endif
 
diff --git a/mm/memory.c b/mm/memory.c
index df8f0e9..be2ff96 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2175,7 +2175,7 @@ static int do_swap_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
page_add_anon_rmap(page, vma, address);
 
swap_free(entry);
-   if (vm_swap_full())
+   if (vm_swap_full(page))
remove_exclusive_swap_page(page);
unlock_page(page);
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 682b71e..9256c2d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -429,7 +429,7 @@ void free_swap_and_cache(swp_entry_t entry)
/* Only cache user (+us), or swap space full? Free it! */
/* Also recheck PageSwapCache after page is locked (above) */
if (PageSwapCache(page)  !PageWriteback(page) 
-   (one_user || vm_swap_full())) {
+   (one_user || vm_swap_full(page))) {
delete_from_swap_cache(page);
SetPageDirty(page);
}
@@ -1892,3 +1892,36 @@ int valid_swaphandles(swp_entry_t entry, unsigned long 
*offset)
*offset = ++toff;
return nr_pages? ++nr_pages: 0;
 }
+
+#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
+int swap_cgroup_vm_swap_full(struct page *page)
+{
+   int ret;
+   struct swap_info_struct *p;
+   struct mem_cgroup *mem;
+   u64 usage;
+   u64 limit;
+   swp_entry_t entry;
+
+   VM_BUG_ON(!PageLocked(page));
+   VM_BUG_ON(!PageSwapCache(page));
+
+   ret = 0;
+   entry.val = page_private(page);
+   p = swap_info_get(entry);
+   if (!p)
+   goto out;
+
+   mem = p-memcg[swp_offset(entry)];
+   usage = swap_cgroup_read_usage(mem) / PAGE_SIZE;
+   limit = swap_cgroup_read_limit(mem) / PAGE_SIZE;
+   limit = (limit  total_swap_pages) ? limit : total_swap_pages;
+
+   ret = usage * 2  limit;
+
+   spin_unlock(swap_lock);
+
+out:
+   return ret;
+}
+#endif


___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 2/4] swapcgroup: add member to swap_info_struct for cgroup

2008-05-22 Thread Daisuke Nishimura
Hi.

On 2008/05/22 16:23 +0900, KAMEZAWA Hiroyuki wrote:
 On Thu, 22 May 2008 15:18:51 +0900
 Daisuke Nishimura [EMAIL PROTECTED] wrote:
 
 This patch add a member to swap_info_struct for cgroup.

 This member, array of pointers to mem_cgroup, is used to
 remember to which cgroup each swap entries are charged.

 The memory for this array of pointers is allocated on swapon,
 and freed on swapoff.

 Hi, in general, #ifdefs in the middle of functions are not good style.
 I'd like to comment some hints.
 
I completely agree that it's not good style.

 Signed-off-by: Daisuke Nishimura [EMAIL PROTECTED]

 ---
  include/linux/swap.h |3 +++
  mm/swapfile.c|   32 
  2 files changed, 35 insertions(+), 0 deletions(-)

 diff --git a/include/linux/swap.h b/include/linux/swap.h
 index de40f16..67de27b 100644
 --- a/include/linux/swap.h
 +++ b/include/linux/swap.h
 @@ -141,6 +141,9 @@ struct swap_info_struct {
  struct swap_extent *curr_swap_extent;
  unsigned old_block_size;
  unsigned short * swap_map;
 +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
 +struct mem_cgroup **memcg;
 +#endif
  unsigned int lowest_bit;
  unsigned int highest_bit;
  unsigned int cluster_next;
 diff --git a/mm/swapfile.c b/mm/swapfile.c
 index d3caf3a..232bf20 100644
 --- a/mm/swapfile.c
 +++ b/mm/swapfile.c
 @@ -1207,6 +1207,9 @@ asmlinkage long sys_swapoff(const char __user * 
 specialfile)
  {
  struct swap_info_struct * p = NULL;
  unsigned short *swap_map;
 +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
 +struct mem_cgroup **memcg;
 +#endif
 Remove #ifdef.
 
  struct mem_cgroup **memcg = NULL;
 
good idea.
I'll do it.

  struct file *swap_file, *victim;
  struct address_space *mapping;
  struct inode *inode;
 @@ -1309,10 +1312,17 @@ asmlinkage long sys_swapoff(const char __user * 
 specialfile)
  p-max = 0;
  swap_map = p-swap_map;
  p-swap_map = NULL;
 +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
 +memcg = p-memcg;
 +p-memcg = NULL;
 +#endif
 
 
 ==
 #ifdef CONFIG_CGROUP_SWAP_RES_CTR
   void  swap_cgroup_init_memcg(p, memcg)
   {
 do something.
   }
 #else
void  swap_cgroup_init_memcg(p, memcg)
   {
   }
 #endif
 ==
 
I think swap_cgroup_init_memcg should return old value
of p-memcg, and I would like to name it swap_cgroup_clear_memcg,
because it is called by sys_swapoff, so clear rather than init
would be better.

How about something like this?

struct mem_cgroup **swap_cgroup_clear_memcg(p, memcg)
{
struct mem_cgroup **mem;

mem = p-memcg;
p-memcg = NULL;

return mem;
}

and at sys_swapoff():

struct mem_cgroup **memcg;
 :
memcg = swap_cgroup_clear_memcg(p, memcg);
 :
if (memcg)
vfree(memcg);

  p-flags = 0;
  spin_unlock(swap_lock);
  mutex_unlock(swapon_mutex);
  vfree(swap_map);
 +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
 +vfree(memcg);
 +#endif
  if (memcg)
   vfree(memcg);
 
 
will do.

  inode = mapping-host;
  if (S_ISBLK(inode-i_mode)) {
  struct block_device *bdev = I_BDEV(inode);
 @@ -1456,6 +1466,9 @@ asmlinkage long sys_swapon(const char __user * 
 specialfile, int swap_flags)
  unsigned long maxpages = 1;
  int swapfilesize;
  unsigned short *swap_map;
 +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
 +struct mem_cgroup **memcg;
 +#endif
 Remove #ifdefs
 
will do.

  struct page *page = NULL;
  struct inode *inode = NULL;
  int did_down = 0;
 @@ -1479,6 +1492,9 @@ asmlinkage long sys_swapon(const char __user * 
 specialfile, int swap_flags)
  p-swap_file = NULL;
  p-old_block_size = 0;
  p-swap_map = NULL;
 +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
 +p-memcg = NULL;
 +#endif
 
 void init_swap_ctlr_memcg(p);
 
I would like to call this one swap_cgroup_init_memcg.

  p-lowest_bit = 0;
  p-highest_bit = 0;
  p-cluster_nr = 0;
 @@ -1651,6 +1667,15 @@ asmlinkage long sys_swapon(const char __user * 
 specialfile, int swap_flags)
  1 /* header page */;
  if (error)
  goto bad_swap;
 +
 +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
 +p-memcg = vmalloc(maxpages * sizeof(struct mem_cgroup *));
 +if (!p-memcg) {
 +error = -ENOMEM;
 +goto bad_swap;
 +}
 +memset(p-memcg, 0, maxpages * sizeof(struct mem_cgroup *));
 +#endif
 void alloc_swap_ctlr_memcg(p)
 
OK.
I'll implement swap_cgroup_alloc_memcg.

 But this implies swapon will fail at memory shortage. Is it good ?
 
Hum.
Would it be better to just disabling this feature?

  }
  
  if (nr_good_pages) {
 @@ -1710,11 +1735,18 @@ bad_swap_2:
  swap_map = p-swap_map;
  p-swap_file = NULL;
  p-swap_map = NULL;
 +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
 +memcg = p-memcg;
 +p-memcg = NULL;
 +#endif
  p-flags = 0;
  if (!(swap_flags  SWAP_FLAG_PREFER))
  ++least_priority;
  spin_unlock(swap_lock

[Devel] Re: [PATCH 4/4] swapcgroup: modify vm_swap_full for cgroup

2008-05-22 Thread Daisuke Nishimura
Hi,

On 2008/05/22 17:00 +0900, KOSAKI Motohiro wrote:
 Hi,
 
 +#ifndef CONFIG_CGROUP_SWAP_RES_CTLR
  /* Swap 50% full? Release swapcache more aggressively.. */
 -#define vm_swap_full() (nr_swap_pages*2  total_swap_pages)
 +#define vm_swap_full(page) (nr_swap_pages*2  total_swap_pages)
 +#else
 +#define vm_swap_full(page) swap_cgroup_vm_swap_full(page)
 +#endif
 
 I'd prefer #ifdef rather than #ifndef.
 
 so...
 
 #ifdef CONFIG_CGROUP_SWAP_RES_CTLR
   your definition
 #else
   original definition
 #endif
 
OK.
I'll change it.

 and vm_swap_full() isn't page granularity operation.
 this is memory(or swap) cgroup operation.
 
 this argument is slightly odd.
 
But what callers of vm_swap_full() know is page,
not mem_cgroup.
I don't want to add to callers something like:

  pc = get_page_cgroup(page);
  mem = pc-mem_cgroup;
  vm_swap_full(mem);


Thanks,
Daisuke Nishimura.

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 4/4] swapcgroup: modify vm_swap_full for cgroup

2008-05-22 Thread Daisuke Nishimura
Hi,

On 2008/05/22 15:45 +0900, YAMAMOTO Takashi wrote:
 @@ -1892,3 +1892,36 @@ int valid_swaphandles(swp_entry_t entry, unsigned 
 long *offset)
  *offset = ++toff;
  return nr_pages? ++nr_pages: 0;
  }
 +
 +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
 +int swap_cgroup_vm_swap_full(struct page *page)
 +{
 +int ret;
 +struct swap_info_struct *p;
 +struct mem_cgroup *mem;
 +u64 usage;
 +u64 limit;
 +swp_entry_t entry;
 +
 +VM_BUG_ON(!PageLocked(page));
 +VM_BUG_ON(!PageSwapCache(page));
 +
 +ret = 0;
 +entry.val = page_private(page);
 +p = swap_info_get(entry);
 +if (!p)
 +goto out;
 +
 +mem = p-memcg[swp_offset(entry)];
 +usage = swap_cgroup_read_usage(mem) / PAGE_SIZE;
 +limit = swap_cgroup_read_limit(mem) / PAGE_SIZE;
 +limit = (limit  total_swap_pages) ? limit : total_swap_pages;
 +
 +ret = usage * 2  limit;
 +
 +spin_unlock(swap_lock);
 +
 +out:
 +return ret;
 +}
 +#endif
 
 shouldn't it check the global usage (nr_swap_pages) as well?
 
 YAMAMOTO Takashi
 

I didn't check global usage because I didn't want 
some group to be influenced by other groups.

But in above code, there would be some cases that
vm_swap_full() returns false even when more than
half of swap is used in global.

Thanks you for pointing it out.

How about something like this?

  :
usage = swap_cgroup_read_usage(mem);//no need to align to number of page
limit = swap_cgroup_read_limit(mem);//no need to align to number of page
ret = (usage * 2  limit) || (nr_swap_pages * 2  total_swap_pages)
  :


Thanks,
Daisuke Nishimura.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 0/4] swapcgroup(v2)

2008-05-22 Thread Daisuke Nishimura
On 2008/05/22 16:44 +0900, KAMEZAWA Hiroyuki wrote:
 On Thu, 22 May 2008 15:13:41 +0900
 Daisuke Nishimura [EMAIL PROTECTED] wrote:
 
 Hi.

 I updated my swapcgroup patch.

 seems good in general.
 

Thanks :-)
 
 ToDo:
 - handle force_empty.
 
 Without this, we can do rmdir() against cgroup with swap. right ?
 
You are right.

There are some cases that cgroup dir cannot be removed
because there remains some swap usage
even when no tasks remain in the dir.
In such cases, the only way to remove the dir is currently
to do swapoff.

So, I think this is the most important todo.

 - make it possible for users to select if they use
   this feature or not, and avoid overhead for users
   not using this feature.
 - move charges along with task move between cgroups.

 I think memory-controller's anon pages should also do this


 But how do you think about shared entries ?
 
Yes.
This is a big problem. I don't have any practical idea yet,
but at least I think it should be avoided for some shared
entry to be charged to different groups.


Thanks,
Daisuke Nishimura.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 0/4] swapcgroup(v2)

2008-05-22 Thread Daisuke Nishimura
(sorry, I sent the previous mail before completing it)

On 2008/05/23 11:10 +0900, Daisuke Nishimura wrote:
 On 2008/05/22 16:44 +0900, KAMEZAWA Hiroyuki wrote:
(snip)
 - make it possible for users to select if they use
   this feature or not, and avoid overhead for users
   not using this feature.
 - move charges along with task move between cgroups.

 I think memory-controller's anon pages should also do this
 
I want it too.

Not only it's usefull for users IMHO,
but alsoI need it to charge a swap to the group which the task
belongs to at the point of swapout.


Thanks,
Daisuke Nishimura.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 0/4] swapcgroup(v2)

2008-05-22 Thread Daisuke Nishimura
Hi.

On 2008/05/23 6:27 +0900, Balbir Singh wrote:
 Daisuke Nishimura wrote:
 Hi.

 I updated my swapcgroup patch.

 Major changes from previous version(*1):
 - Rebased on 2.6.26-rc2-mm1 + KAMEZAWA-san's performance
   improvement patchset v4.
 - Implemented as a add-on to memory cgroup.
   So, there is no need to add a new member to page_cgroup now.
 - (NEW)Modified vm_swap_full() to calculate the rate of
   swap usage per cgroup.

 Patchs:
 - [1/4] add cgroup files
 - [2/4] add member to swap_info_struct for cgroup
 - [3/4] implement charge/uncharge
 - [4/4] modify vm_swap_full for cgroup

 ToDo:
 - handle force_empty.
 - make it possible for users to select if they use
   this feature or not, and avoid overhead for users
   not using this feature.
 - move charges along with task move between cgroups.

 
 Thanks for looking into this. Yamamoto-San is also looking into a swap
 controller. Is there a consensus on the approach?
 
Not yet, but I think we should have some consensus each other
before going further.


Thanks,
Daisuke Nishimura.

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


  1   2   >