[Devel] Re: [PATCH][RFC] memory.min_usage again

2008-09-28 Thread YAMAMOTO Takashi
hi,

 On Wed, 10 Sep 2008 08:32:15 -0700
 Balbir Singh [EMAIL PROTECTED] wrote:
 
  YAMAMOTO Takashi wrote:
   hi,
   
   hi,
  
   here's a patch to implement memory.min_usage,
   which controls the minimum memory usage for a cgroup.
  
   it works similarly to mlock;
   global memory reclamation doesn't reclaim memory from
   cgroups whose memory usage is below the value.
   setting it too high is a dangerous operation.
  
  
  Looking through the code I am a little worried, what if every cgroup is 
  below
  minimum value and the system is under memory pressure, do we OOM, while we 
  could
  have easily reclaimed?

i'm not sure what you are worring about.  can you explain a little more?
under the configuration, OOM is an expected behaviour.

  
  I would prefer to see some heuristics around such a feature, mostly around 
  the
  priority that do_try_to_free_pages() to determine how desperate we are for
  reclaiming memory.
  
 Taking priority of memory reclaim path into account is good.
 
 ==
 static unsigned long shrink_inactive_list(unsigned long max_scan,
 struct zone *zone, struct scan_control *sc,
 int priority, int file)
 ==
 How about ignore min_usage if priority  DEF_PRIORITY - 2 ?

are you suggesting ignoring mlock etc as well in that case?

YAMAMOTO Takashi

 
 
 Thanks,
 -Kame
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to [EMAIL PROTECTED]  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:[EMAIL PROTECTED] [EMAIL PROTECTED] /a
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH][RFC] memory.min_usage again

2008-09-10 Thread YAMAMOTO Takashi
hi,

 hi,
 
 here's a patch to implement memory.min_usage,
 which controls the minimum memory usage for a cgroup.
 
 it works similarly to mlock;
 global memory reclamation doesn't reclaim memory from
 cgroups whose memory usage is below the value.
 setting it too high is a dangerous operation.
 
 it's against 2.6.24-rc3-mm2 + memory.swappiness patch i posted here yesterday.
 but it's logically independent from the swappiness patch.
 
 todo:
 - restrict non-root user's operation ragardless of owner of cgroupfs files?
 - make oom killer aware of this?
 
 YAMAMOTO Takashi

here's a new version adapted to 2.6.27-rc5-mm1.

YAMAMOTO Takashi


Signed-off-by: YAMAMOTO Takashi [EMAIL PROTECTED]
---

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ee1b2fc..fdf35bf 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -72,6 +72,8 @@ extern void mem_cgroup_record_reclaim_priority(struct 
mem_cgroup *mem,
 extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
int priority, enum lru_list lru);
 
+extern int mem_cgroup_canreclaim(struct page *page, struct mem_cgroup *mem);
+
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 static inline void page_reset_bad_cgroup(struct page *page)
 {
@@ -163,6 +165,13 @@ static inline long mem_cgroup_calc_reclaim(struct 
mem_cgroup *mem,
 {
return 0;
 }
+
+static inline int mem_cgroup_canreclaim(struct page *page,
+   struct mem_cgroup *mem)
+{
+   return 1;
+}
+
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2979d22..a567bdb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -129,6 +129,7 @@ struct mem_cgroup {
struct mem_cgroup_lru_info info;
 
int prev_priority;  /* for recording reclaim priority */
+   unsigned long long min_usage; /* XXX should be a part of res_counter? */
/*
 * statistics.
 */
@@ -1004,6 +1005,28 @@ static int mem_control_stat_show(struct cgroup *cont, 
struct cftype *cft,
return 0;
 }
 
+static int mem_cgroup_min_usage_write(struct cgroup *cg, struct cftype *cft,
+   const char *buffer)
+{
+   struct mem_cgroup *mem = mem_cgroup_from_cont(cg);
+   unsigned long long val;
+   int error;
+
+   error = res_counter_memparse_write_strategy(buffer, val);
+   if (error)
+   return error;
+
+   mem-min_usage = val;
+   return 0;
+}
+
+static u64 mem_cgroup_min_usage_read(struct cgroup *cg, struct cftype *cft)
+{
+   struct mem_cgroup *mem = mem_cgroup_from_cont(cg);
+
+   return mem-min_usage;
+}
+
 static struct cftype mem_cgroup_files[] = {
{
.name = usage_in_bytes,
@@ -1036,8 +1059,43 @@ static struct cftype mem_cgroup_files[] = {
.name = stat,
.read_map = mem_control_stat_show,
},
+   {
+   .name = min_usage_in_bytes,
+   .write_string = mem_cgroup_min_usage_write,
+   .read_u64 = mem_cgroup_min_usage_read,
+   },
 };
 
+int mem_cgroup_canreclaim(struct page *page, struct mem_cgroup *mem1)
+{
+   struct page_cgroup *pc;
+   int result = 1;
+
+   if (mem1 != NULL)
+   return 1;
+
+   lock_page_cgroup(page);
+   pc = page_get_page_cgroup(page);
+   if (pc) {
+   struct mem_cgroup *mem2 = pc-mem_cgroup;
+   unsigned long long min_usage;
+
+   BUG_ON(mem2 == NULL);
+   min_usage = mem2-min_usage;
+   if (min_usage != 0) {
+   unsigned long flags;
+
+   spin_lock_irqsave(mem2-res.lock, flags);
+   if (mem2-res.usage = min_usage)
+   result = 0;
+   spin_unlock_irqrestore(mem2-res.lock, flags);
+   }
+   }
+   unlock_page_cgroup(page);
+
+   return result;
+}
+
 static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
 {
struct mem_cgroup_per_node *pn;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 33e4319..ef37968 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -673,6 +673,9 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
referenced  page_mapping_inuse(page))
goto activate_locked;
 
+   if (!mem_cgroup_canreclaim(page, sc-mem_cgroup))
+   goto activate_locked;
+
 #ifdef CONFIG_SWAP
/*
 * Anonymous process memory has backing store?
@@ -1294,7 +1297,9 @@ static void shrink_active_list(unsigned long nr_pages, 
struct zone *zone,
continue;
}
 
-   if (page_referenced(page, 0, sc-mem_cgroup)) {
+   if (!mem_cgroup_canreclaim(page, sc-mem_cgroup

[Devel] Re: [PATCH][RFC] dirty balancing for cgroups

2008-08-13 Thread YAMAMOTO Takashi
hi,

  @@ -485,7 +502,10 @@ unsigned long mem_cgroup_isolate_pages(unsigned long 
  nr_to_scan,
  if (PageUnevictable(page) ||
  (PageActive(page)  !active) ||
  (!PageActive(page)  active)) {
  -   __mem_cgroup_move_lists(pc, page_lru(page));
  +   if (try_lock_page_cgroup(page)) {
  +   __mem_cgroup_move_lists(pc, page_lru(page));
  +   unlock_page_cgroup(page);
  +   }
  continue;
  }
 
 This chunk seems unrelated and lost

it's necessary to protect from mem_cgroup_{set,clear}_dirty
which modify pc-flags without holding mz-lru_lock.

 I presonally dislike the != 0, == 0 comparisons for bitmask operations,
 they seem to make it harder to read somewhow. I prefer to write !(flags
  mask) and (flags  mask), instead.
 
 I guess taste differs,...

yes, it seems different. :)

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH][RFC] dirty balancing for cgroups

2008-08-06 Thread YAMAMOTO Takashi
hi,

 On Fri, 11 Jul 2008 17:34:46 +0900 (JST)
 [EMAIL PROTECTED] (YAMAMOTO Takashi) wrote:
 
  hi,
  
my patch penalizes heavy-writer cgroups as task_dirty_limit does
for heavy-writer tasks.  i don't think that it's necessary to be
tied to the memory subsystem because i merely want to group writers.

   Hmm, maybe what I need is different from this ;)
   Does not seem to be a help for memory reclaim under memcg.
  
  to implement what you need, i think that we need to keep track of
  the numbers of dirty-pages in each memory cgroups as a first step.
  do you agree?
  
 yes, I think so, now.
 
 may be not difficult but will add extra overhead ;( Sigh..

the following is a patch to add the overhead. :)
any comments?

YAMAMOTO Takashi


Signed-off-by: YAMAMOTO Takashi [EMAIL PROTECTED]
---

diff --git a/fs/buffer.c b/fs/buffer.c
index 9749a90..a2dc642 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -41,6 +41,7 @@
 #include linux/bitops.h
 #include linux/mpage.h
 #include linux/bit_spinlock.h
+#include linux/memcontrol.h
 
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
 
@@ -708,12 +709,19 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
 static int __set_page_dirty(struct page *page,
struct address_space *mapping, int warn)
 {
-   if (unlikely(!mapping))
-   return !TestSetPageDirty(page);
+   if (unlikely(!mapping)) {
+   if (TestSetPageDirty(page)) {
+   return 0;
+   } else {
+   mem_cgroup_set_page_dirty(page);
+   return 1;
+   }
+   }
 
if (TestSetPageDirty(page))
return 0;
 
+   mem_cgroup_set_page_dirty(page);
spin_lock_irq(mapping-tree_lock);
if (page-mapping) {/* Race with truncate? */
WARN_ON_ONCE(warn  !PageUptodate(page));
@@ -762,8 +770,14 @@ int __set_page_dirty_buffers(struct page *page)
 {
struct address_space *mapping = page_mapping(page);
 
-   if (unlikely(!mapping))
-   return !TestSetPageDirty(page);
+   if (unlikely(!mapping)) {
+   if (TestSetPageDirty(page)) {
+   return 0;
+   } else {
+   mem_cgroup_set_page_dirty(page);
+   return 1;
+   }
+   }
 
spin_lock(mapping-private_lock);
if (page_has_buffers(page)) {
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ee1b2fc..f04441f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -57,6 +57,9 @@ extern int
 mem_cgroup_prepare_migration(struct page *page, struct page *newpage);
 extern void mem_cgroup_end_migration(struct page *page);
 
+extern void mem_cgroup_set_page_dirty(struct page *page);
+extern void mem_cgroup_clear_page_dirty(struct page *page);
+
 /*
  * For memory reclaim.
  */
@@ -132,6 +135,14 @@ static inline void mem_cgroup_end_migration(struct page 
*page)
 {
 }
 
+static inline void mem_cgroup_set_page_dirty(struct page *page)
+{
+}
+
+static inline void mem_cgroup_clear_page_dirty(struct page *page)
+{
+}
+
 static inline int mem_cgroup_calc_mapped_ratio(struct mem_cgroup *mem)
 {
return 0;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 344a477..33d14b7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -49,6 +49,7 @@ enum mem_cgroup_stat_index {
 */
MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
MEM_CGROUP_STAT_RSS,   /* # of pages charged as rss */
+   MEM_CGROUP_STAT_DIRTY, /* # of dirty pages */
MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
 
@@ -73,6 +74,14 @@ static void __mem_cgroup_stat_add_safe(struct 
mem_cgroup_stat *stat,
stat-cpustat[cpu].count[idx] += val;
 }
 
+static void __mem_cgroup_stat_add(struct mem_cgroup_stat *stat,
+   enum mem_cgroup_stat_index idx, int val)
+{
+   int cpu = get_cpu();
+   stat-cpustat[cpu].count[idx] += val;
+   put_cpu();
+}
+
 static s64 mem_cgroup_read_stat(struct mem_cgroup_stat *stat,
enum mem_cgroup_stat_index idx)
 {
@@ -164,6 +173,7 @@ struct page_cgroup {
 #define PAGE_CGROUP_FLAG_ACTIVE(0x2)   /* page is active in this 
cgroup */
 #define PAGE_CGROUP_FLAG_FILE (0x4)/* page is file system backed */
 #define PAGE_CGROUP_FLAG_UNEVICTABLE (0x8) /* page is unevictableable */
+#define PAGE_CGROUP_FLAG_DIRTY (0x10)  /* accounted as dirty */
 
 static int page_cgroup_nid(struct page_cgroup *pc)
 {
@@ -196,6 +206,9 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup 
*mem, int flags,
else
__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_RSS, val);
 
+   if (flags  PAGE_CGROUP_FLAG_DIRTY)
+   __mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_DIRTY, val);
+
if (charge

[Devel] Re: [PATCH][RFC] dirty balancing for cgroups

2008-08-06 Thread YAMAMOTO Takashi
hi,

 On Wed,  6 Aug 2008 17:20:46 +0900 (JST)
 [EMAIL PROTECTED] (YAMAMOTO Takashi) wrote:
 
  hi,
  
   On Fri, 11 Jul 2008 17:34:46 +0900 (JST)
   [EMAIL PROTECTED] (YAMAMOTO Takashi) wrote:
   
hi,

  my patch penalizes heavy-writer cgroups as task_dirty_limit does
  for heavy-writer tasks.  i don't think that it's necessary to be
  tied to the memory subsystem because i merely want to group writers.
  
 Hmm, maybe what I need is different from this ;)
 Does not seem to be a help for memory reclaim under memcg.

to implement what you need, i think that we need to keep track of
the numbers of dirty-pages in each memory cgroups as a first step.
do you agree?

   yes, I think so, now.
   
   may be not difficult but will add extra overhead ;( Sigh..
  
  the following is a patch to add the overhead. :)
  any comments?
  
 Do you have some numbers ? ;) 

not yet.

 I like this because this seems very straightforward. thank you.

good to hear.

 How about changing these to be
 
 ==
 void mem_cgroup_test_set_page_dirty()
 {
   if (try_lock_page_cgroup(pg)) {
   pc = page_get_page_cgroup(pg);
   if (pc ..) {
   }
   unlock_page_cgroup(pg)
   }
 }
 ==

i'm not sure how many opportunities to update statistics
we would lose for the trylock failure.
although the statistics don't need to be too precise,
its error should have a reasonable upper-limit to be useful.

 Off-topic: I wonder we can delete this lock in future.
 
 Because page-page_cgroup is
  1. attached at first use.(Obiously no race with set_dirty)
  2. deleted at removal. (force_empty is problematic here..)

i hope it's possible. :)

YAMAMOTO Takashi

 
 But, now, we need this lock.
 
 Thanks,
 -Kame
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to [EMAIL PROTECTED]  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:[EMAIL PROTECTED] [EMAIL PROTECTED] /a
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH][RFC] dirty balancing for cgroups

2008-07-11 Thread YAMAMOTO Takashi
- This looks simple but, could you merge this into memory resource 
   controller ?
  
  why?
  
 3 points.
  1. Is this useful if used alone ?

it can be.  why not?

  2. memcg requires this kind of feature, basically.
 
  3. I wonder I need more work to make this work well under memcg.

i'm not sure if i understand these points.  can you explain a bit?

my patch penalizes heavy-writer cgroups as task_dirty_limit does
for heavy-writer tasks.  i don't think that it's necessary to be
tied to the memory subsystem because i merely want to group writers.

otoh, if you want to limit the number (or percentage or whatever) of
dirty pages in a memory cgroup, it can't be done independently from
the memory subsystem, of course.  it's another story, tho.

YAMAMOTO Takashi

 
 If chasing page-cgroup and memcg make this patch much more complex,
 I think this style of implimentation is a choice.
 
  About 3. 
 Does this works well if I changes get_dirty_limit()'s
 determine_dirtyable_memory() calculation under memcg ?
 But to do this seems not valid if dirty_ratio cgroup and memcg cgroup 
 containes different set of tasks.
  
 Thanks,
 -Kame
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH][RFC] dirty balancing for cgroups

2008-07-11 Thread YAMAMOTO Takashi
hi,

  my patch penalizes heavy-writer cgroups as task_dirty_limit does
  for heavy-writer tasks.  i don't think that it's necessary to be
  tied to the memory subsystem because i merely want to group writers.
  
 Hmm, maybe what I need is different from this ;)
 Does not seem to be a help for memory reclaim under memcg.

to implement what you need, i think that we need to keep track of
the numbers of dirty-pages in each memory cgroups as a first step.
do you agree?

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH][RFC] dirty balancing for cgroups

2008-07-10 Thread YAMAMOTO Takashi
hi,

 On Wed,  9 Jul 2008 15:00:34 +0900 (JST)
 [EMAIL PROTECTED] (YAMAMOTO Takashi) wrote:
 
  hi,
  
  the following patch is a simple implementation of
  dirty balancing for cgroups.  any comments?
  
  it depends on the following fix:
  http://lkml.org/lkml/2008/7/8/428
  
 
 A few comments  ;)

thanks for comments.

  - This looks simple but, could you merge this into memory resource 
 controller ?

why?

(if conflict, I'll queue on my stack.)
  - Do you have some number ? or How we can test this works well ?

i have no numbers yet.
it should work as well as the one for tasks.  (ie. i don't know. :)

  - please CC to linux-mm.

ok, i will.

YAMAMOTO Takashi

 
 Thanks,
 -Kame
 
  YAMAMOTO Takashi
  
  
  Signed-off-by: YAMAMOTO Takashi [EMAIL PROTECTED]
  ---
  
  diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
  index 23c02e2..f5453cc 100644
  --- a/include/linux/cgroup_subsys.h
  +++ b/include/linux/cgroup_subsys.h
  @@ -52,3 +52,9 @@ SUBSYS(memrlimit_cgroup)
   #endif
   
   /* */
  +
  +#ifdef CONFIG_CGROUP_MEMDIRTYLIMIT_CTLR
  +SUBSYS(memdirtylimit_cgroup)
  +#endif
  +
  +/* */
  diff --git a/include/linux/memdirtylimitcgroup.h 
  b/include/linux/memdirtylimitcgroup.h
  new file mode 100644
  index 000..667d312
  --- /dev/null
  +++ b/include/linux/memdirtylimitcgroup.h
  @@ -0,0 +1,47 @@
  +
  +/*
  + * memdirtylimitcgroup.h COPYRIGHT FUJITSU LIMITED 2008
  + *
  + * Author: [EMAIL PROTECTED]
  + */
  +
  +struct task_struct;
  +
  +#if defined(CONFIG_CGROUP_MEMDIRTYLIMIT_CTLR)
  +
  +void memdirtylimitcgroup_dirty_inc(struct task_struct *);
  +void memdirtylimitcgroup_dirty_limit(struct task_struct *, long *);
  +void memdirtylimitcgroup_change_shift(int);
  +void memdirtylimitcgroup_init(int);
  +
  +#else /* defined(CONFIG_CGROUP_MEMDIRTYLIMIT_CTLR) */
  +
  +static inline void
  +memdirtylimitcgroup_dirty_inc(struct task_struct *t)
  +{
  +
  +   /* nothing */
  +}
  +
  +static inline void
  +memdirtylimitcgroup_dirty_limit(struct task_struct *t, long *dirtyp)
  +{
  +
  +   /* nothing */
  +}
  +
  +static inline void
  +memdirtylimitcgroup_change_shift(int shift)
  +{
  +
  +   /* nothing */
  +}
  +
  +static inline void
  +memdirtylimitcgroup_init(int shift)
  +{
  +
  +   /* nothing */
  +}
  +
  +#endif /* defined(CONFIG_CGROUP_MEMDIRTYLIMIT_CTLR) */
  diff --git a/init/Kconfig b/init/Kconfig
  index 162d462..985bac8 100644
  --- a/init/Kconfig
  +++ b/init/Kconfig
  @@ -418,6 +418,12 @@ config CGROUP_MEMRLIMIT_CTLR
memory RSS and Page Cache control. Virtual address space control
is provided by this controller.
   
  +config CGROUP_MEMDIRTYLIMIT_CTLR
  +   bool Memory Dirty Limit Controller for Control Groups
  +   depends on CGROUPS  RESOURCE_COUNTERS
  +   help
  + XXX TBD
  +
   config SYSFS_DEPRECATED
  bool
   
  diff --git a/mm/Makefile b/mm/Makefile
  index f54232d..8603d19 100644
  --- a/mm/Makefile
  +++ b/mm/Makefile
  @@ -35,4 +35,5 @@ obj-$(CONFIG_SMP) += allocpercpu.o
   obj-$(CONFIG_QUICKLIST) += quicklist.o
   obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
   obj-$(CONFIG_CGROUP_MEMRLIMIT_CTLR) += memrlimitcgroup.o
  +obj-$(CONFIG_CGROUP_MEMDIRTYLIMIT_CTLR) += memdirtylimitcgroup.o
   obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
  diff --git a/mm/memdirtylimitcgroup.c b/mm/memdirtylimitcgroup.c
  new file mode 100644
  index 000..b70b33d
  --- /dev/null
  +++ b/mm/memdirtylimitcgroup.c
  @@ -0,0 +1,179 @@
  +
  +/*
  + * memdirtylimitcgroup.c COPYRIGHT FUJITSU LIMITED 2008
  + *
  + * Author: [EMAIL PROTECTED]
  + */
  +
  +#include linux/err.h
  +#include linux/cgroup.h
  +#include linux/rcupdate.h
  +#include linux/slab.h
  +#include linux/memdirtylimitcgroup.h
  +
  +#include asm/div64.h
  +
  +static struct prop_descriptor vm_cgroup_dirties;
  +
  +struct memdirtylimit_cgroup {
  +   struct cgroup_subsys_state dlcg_css;
  +   spinlock_t dlcg_lock;
  +   struct prop_local_single dlcg_dirties;
  +};
  +
  +static struct cgroup_subsys_state *
  +task_to_css(struct task_struct *task)
  +{
  +
  +   return task_subsys_state(task, memdirtylimit_cgroup_subsys_id);
  +}
  +
  +static struct memdirtylimit_cgroup *
  +css_to_dlcg(struct cgroup_subsys_state *css)
  +{
  +
  +   return container_of(css, struct memdirtylimit_cgroup, dlcg_css);
  +}
  +
  +static struct cgroup_subsys_state *
  +cg_to_css(struct cgroup *cg)
  +{
  +
  +   return cgroup_subsys_state(cg, memdirtylimit_cgroup_subsys_id);
  +}
  +
  +static struct memdirtylimit_cgroup *
  +cg_to_dlcg(struct cgroup *cg)
  +{
  +
  +   return css_to_dlcg(cg_to_css(cg));
  +}
  +
  +/*  */
  +
  +static void
  +getfraction(struct memdirtylimit_cgroup *dlcg, long *numeratorp,
  +long *denominatorp)
  +{
  +
  +   spin_lock(dlcg-dlcg_lock);
  +   prop_fraction_single(vm_cgroup_dirties, dlcg-dlcg_dirties,
  +  numeratorp, denominatorp);
  +   spin_unlock(dlcg-dlcg_lock

[Devel] [PATCH][RFC] dirty balancing for cgroups

2008-07-09 Thread YAMAMOTO Takashi
hi,

the following patch is a simple implementation of
dirty balancing for cgroups.  any comments?

it depends on the following fix:
http://lkml.org/lkml/2008/7/8/428

YAMAMOTO Takashi


Signed-off-by: YAMAMOTO Takashi [EMAIL PROTECTED]
---

diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 23c02e2..f5453cc 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -52,3 +52,9 @@ SUBSYS(memrlimit_cgroup)
 #endif
 
 /* */
+
+#ifdef CONFIG_CGROUP_MEMDIRTYLIMIT_CTLR
+SUBSYS(memdirtylimit_cgroup)
+#endif
+
+/* */
diff --git a/include/linux/memdirtylimitcgroup.h 
b/include/linux/memdirtylimitcgroup.h
new file mode 100644
index 000..667d312
--- /dev/null
+++ b/include/linux/memdirtylimitcgroup.h
@@ -0,0 +1,47 @@
+
+/*
+ * memdirtylimitcgroup.h COPYRIGHT FUJITSU LIMITED 2008
+ *
+ * Author: [EMAIL PROTECTED]
+ */
+
+struct task_struct;
+
+#if defined(CONFIG_CGROUP_MEMDIRTYLIMIT_CTLR)
+
+void memdirtylimitcgroup_dirty_inc(struct task_struct *);
+void memdirtylimitcgroup_dirty_limit(struct task_struct *, long *);
+void memdirtylimitcgroup_change_shift(int);
+void memdirtylimitcgroup_init(int);
+
+#else /* defined(CONFIG_CGROUP_MEMDIRTYLIMIT_CTLR) */
+
+static inline void
+memdirtylimitcgroup_dirty_inc(struct task_struct *t)
+{
+
+   /* nothing */
+}
+
+static inline void
+memdirtylimitcgroup_dirty_limit(struct task_struct *t, long *dirtyp)
+{
+
+   /* nothing */
+}
+
+static inline void
+memdirtylimitcgroup_change_shift(int shift)
+{
+
+   /* nothing */
+}
+
+static inline void
+memdirtylimitcgroup_init(int shift)
+{
+
+   /* nothing */
+}
+
+#endif /* defined(CONFIG_CGROUP_MEMDIRTYLIMIT_CTLR) */
diff --git a/init/Kconfig b/init/Kconfig
index 162d462..985bac8 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -418,6 +418,12 @@ config CGROUP_MEMRLIMIT_CTLR
  memory RSS and Page Cache control. Virtual address space control
  is provided by this controller.
 
+config CGROUP_MEMDIRTYLIMIT_CTLR
+   bool Memory Dirty Limit Controller for Control Groups
+   depends on CGROUPS  RESOURCE_COUNTERS
+   help
+ XXX TBD
+
 config SYSFS_DEPRECATED
bool
 
diff --git a/mm/Makefile b/mm/Makefile
index f54232d..8603d19 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -35,4 +35,5 @@ obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
 obj-$(CONFIG_CGROUP_MEMRLIMIT_CTLR) += memrlimitcgroup.o
+obj-$(CONFIG_CGROUP_MEMDIRTYLIMIT_CTLR) += memdirtylimitcgroup.o
 obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
diff --git a/mm/memdirtylimitcgroup.c b/mm/memdirtylimitcgroup.c
new file mode 100644
index 000..b70b33d
--- /dev/null
+++ b/mm/memdirtylimitcgroup.c
@@ -0,0 +1,179 @@
+
+/*
+ * memdirtylimitcgroup.c COPYRIGHT FUJITSU LIMITED 2008
+ *
+ * Author: [EMAIL PROTECTED]
+ */
+
+#include linux/err.h
+#include linux/cgroup.h
+#include linux/rcupdate.h
+#include linux/slab.h
+#include linux/memdirtylimitcgroup.h
+
+#include asm/div64.h
+
+static struct prop_descriptor vm_cgroup_dirties;
+
+struct memdirtylimit_cgroup {
+   struct cgroup_subsys_state dlcg_css;
+   spinlock_t dlcg_lock;
+   struct prop_local_single dlcg_dirties;
+};
+
+static struct cgroup_subsys_state *
+task_to_css(struct task_struct *task)
+{
+
+   return task_subsys_state(task, memdirtylimit_cgroup_subsys_id);
+}
+
+static struct memdirtylimit_cgroup *
+css_to_dlcg(struct cgroup_subsys_state *css)
+{
+
+   return container_of(css, struct memdirtylimit_cgroup, dlcg_css);
+}
+
+static struct cgroup_subsys_state *
+cg_to_css(struct cgroup *cg)
+{
+
+   return cgroup_subsys_state(cg, memdirtylimit_cgroup_subsys_id);
+}
+
+static struct memdirtylimit_cgroup *
+cg_to_dlcg(struct cgroup *cg)
+{
+
+   return css_to_dlcg(cg_to_css(cg));
+}
+
+/*  */
+
+static void
+getfraction(struct memdirtylimit_cgroup *dlcg, long *numeratorp,
+long *denominatorp)
+{
+
+   spin_lock(dlcg-dlcg_lock);
+   prop_fraction_single(vm_cgroup_dirties, dlcg-dlcg_dirties,
+  numeratorp, denominatorp);
+   spin_unlock(dlcg-dlcg_lock);
+}
+
+/*  */
+
+void
+memdirtylimitcgroup_dirty_inc(struct task_struct *t)
+{
+   struct memdirtylimit_cgroup *dlcg;
+
+   rcu_read_lock();
+   dlcg = css_to_dlcg(task_to_css(t));
+   spin_lock(dlcg-dlcg_lock);
+   prop_inc_single(vm_cgroup_dirties, dlcg-dlcg_dirties);
+   spin_unlock(dlcg-dlcg_lock);
+   rcu_read_unlock();
+}
+
+void
+memdirtylimitcgroup_dirty_limit(struct task_struct *t, long *dirtyp)
+{
+   struct memdirtylimit_cgroup *dlcg;
+   unsigned long dirty = *dirtyp;
+   uint64_t tmp;
+   long numerator;
+   long denominator;
+
+   BUG_ON(*dirtyp  0);
+
+   rcu_read_lock();
+   dlcg = css_to_dlcg(task_to_css(t));
+   getfraction(dlcg, numerator, denominator);
+   rcu_read_unlock

[Devel] Re: [PATCH -mm 5/5] swapcgroup (v3): implement force_empty

2008-07-04 Thread YAMAMOTO Takashi
hi,

 +/*
 + * uncharge all the entries that are charged to the group.
 + */
 +void __swap_cgroup_force_empty(struct mem_cgroup *mem)
 +{
 + struct swap_info_struct *p;
 + int type;
 +
 + spin_lock(swap_lock);
 + for (type = swap_list.head; type = 0; type = swap_info[type].next) {
 + p = swap_info + type;
 +
 + if ((p-flags  SWP_ACTIVE) == SWP_ACTIVE) {
 + unsigned int i = 0;
 +
 + spin_unlock(swap_lock);

what prevents the device from being swapoff'ed while you drop swap_lock?

YAMAMOTO Takashi

 + while ((i = find_next_to_unuse(p, i, mem)) != 0) {
 + spin_lock(swap_lock);
 + if (p-swap_map[i]  p-memcg[i] == mem)
 + swap_cgroup_uncharge(p, i);
 + spin_unlock(swap_lock);
 + }
 + spin_lock(swap_lock);
 + }
 + }
 + spin_unlock(swap_lock);
 +
 + return;
 +}
  #endif
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mm 5/5] swapcgroup (v3): implement force_empty

2008-07-04 Thread YAMAMOTO Takashi
 Hi, Yamamoto-san.
 
 Thank you for your comment.
 
 On Fri,  4 Jul 2008 15:54:31 +0900 (JST), [EMAIL PROTECTED] (YAMAMOTO 
 Takashi) wrote:
  hi,
  
   +/*
   + * uncharge all the entries that are charged to the group.
   + */
   +void __swap_cgroup_force_empty(struct mem_cgroup *mem)
   +{
   + struct swap_info_struct *p;
   + int type;
   +
   + spin_lock(swap_lock);
   + for (type = swap_list.head; type = 0; type = swap_info[type].next) {
   + p = swap_info + type;
   +
   + if ((p-flags  SWP_ACTIVE) == SWP_ACTIVE) {
   + unsigned int i = 0;
   +
   + spin_unlock(swap_lock);
  
  what prevents the device from being swapoff'ed while you drop swap_lock?
  
 Nothing.
 
 After searching the entry to be uncharged(find_next_to_unuse below),
 I recheck under swap_lock whether the entry is charged to the group.
 Even if the device is swapoff'ed, swap_off must have uncharged the entry,
 so I don't think it's needed anyway.
 
  YAMAMOTO Takashi
  
   + while ((i = find_next_to_unuse(p, i, mem)) != 0) {
   + spin_lock(swap_lock);
   + if (p-swap_map[i]  p-memcg[i] == mem)
 Ah, I think it should be added !p-swap_map to check the device has not
 been swapoff'ed.

find_next_to_unuse seems to have fragile assumptions and
can dereference p-swap_map as well.

YAMAMOTO Takashi

 
 
 Thanks,
 Daisuke Nishimura.
 
   + swap_cgroup_uncharge(p, i);
   + spin_unlock(swap_lock);
   + }
   + spin_lock(swap_lock);
   + }
   + }
   + spin_unlock(swap_lock);
   +
   + return;
   +}
#endif
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFD][PATCH] memcg: Move Usage at Task Move

2008-06-11 Thread YAMAMOTO Takashi
  i think that you can redirect new charges in TASK to DEST
  so that usage_of_task(TASK) will not grow.
  
 
 Hmm, to do that, we have to handle complicated cgroup's attach ops.
 
 at this moving, memcg is pointed by
  - TASK-cgroup-memcg(CURR)
 after move
  - TASK-another_cgroup-memcg(DEST)
 
 This move happens before cgroup is replaced by another_cgroup.

currently cgroup_attach_task calls -attach callbacks after
assigning tsk-cgroups.  are you talking about something else?

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFD][PATCH] memcg: Move Usage at Task Move

2008-06-10 Thread YAMAMOTO Takashi
 On Tue, 10 Jun 2008 14:50:32 +0900 (JST)
 [EMAIL PROTECTED] (YAMAMOTO Takashi) wrote:
 
3. Use Lazy Manner
 When the task moves, we can mark the pages used by it as
 Wrong Charge, Should be dropped, and add them some penalty in the 
   LRU.
   Pros.
 - no complicated ones.
 - the pages will be gradually moved at memory pressure.
   Cons.
 - A task's usage can exceed the limit for a while.
 - can't handle mlocked() memory in proper way.
   
4. Allow Half-moved state and abandon rollback.
   Pros.
 - no complicated ones in the code.
   Cons.
 - the users will be in chaos.
  
  how about:
  
  5. try to move charges as your patch does.
 if the target cgroup's usage is going to exceed the limit,
 try to shrink it.  if it failed, just leave it exceeded.
 (ie. no rollback)
 for the memory subsystem, which can use its OOM killer,
 the failure should be rare.
  
 
 Hmm, allowing exceed and cause OOM kill ?
 
 One difficult point is that the users cannot know they can move task
 without any risk. How to handle the risk can be a point. 
 I don't like that approarch in general because I don't like exceed
 status. But implementation will be easy.

regardless of how to handle task moves,
it's important to provide information to help users
to avoid unreasonable cgroup/task placement.
otherwise, they will be surprised by OOM-killer etc anyway.

having said that, if you decide to put too large tasks into
a cgroup with too small limit, i don't think that there are
many choices besides OOM-kill and allowing exceed.

actually, i think that #3 and #5 are somewhat similar.
a big difference is that, while #5 shrinks the cgroup immediately,
#3 does it later.  in case we need to do OOM-kill, i prefer to do it
sooner than later.

   After writing this patch, for me, 3 is attractive. now.
   (or using Lazy manner and allow moving of usage instead of freeing it.)
   
   One reasone is that I think a typical usage of memory controller is
   fork()-move-exec(). (by libcg ?) and exec() will flush the all usage.
  
  i guess that moving long-running applications can be desirable
  esp. for not so well-designed systems.
  
 
 hmm, for not so well-designed systemstrue.
 But 5 has the same kind of risks for not so well-desgined systems ;)

i don't claim that #5 is a perfect solution for everyone. :)

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFD][PATCH] memcg: Move Usage at Task Move

2008-06-10 Thread YAMAMOTO Takashi
 I'm now considering following logic. How do you think ?
 
 Assume: move TASK from group:CURR to group:DEST.
 
 == move_task(TASK, CURR, DEST)
 
 if (DEST's limit is unlimited)
   moving TASK
   return success.
 
 usage = check_usage_of_task(TASK).
 
 /* try to reserve enough room in destionation */
 if (try_to_reserve_enough_room(DEST, usage)) {
   move TASK to DEST and move pages AMAP.
   /* usage_of_task(TASK) can be changed while we do this.
  Then, we move AMAP. */
   return success;
 }
 return failure.
 ==

AMAP means that you might leave some random charges in CURR?

i think that you can redirect new charges in TASK to DEST
so that usage_of_task(TASK) will not grow.

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFD][PATCH] memcg: Move Usage at Task Move

2008-06-10 Thread YAMAMOTO Takashi
  having said that, if you decide to put too large tasks into
  a cgroup with too small limit, i don't think that there are
  many choices besides OOM-kill and allowing exceed.
  
 IMHO, allowing exceed is harmfull without changing the definition of limit.
 limit is hard-limit, now, not soft-limit. Changing the defintion just for
 this is not acceptable for me. 

even with the current code, the exceed condition can be created
by simply lowering the limit.
(well, i know that some of your patches floating around change it.)

 Maybe move under limit itself is crazy opsHmm...
 
 Should we allow task move when the destination cgroup is unlimited ?
 Isn't it useful ?

i think it makes some sense.

  actually, i think that #3 and #5 are somewhat similar.
  a big difference is that, while #5 shrinks the cgroup immediately,
  #3 does it later.  in case we need to do OOM-kill, i prefer to do it
  sooner than later.
  
 #3 will not cause OOM-killer, I hope...A user can notice memory shortage.

we are talking about the case where a cgroup's working set is getting
hopelessly larger than its limit.  i don't see why #3 will not
cause OOM-kill.  can you explain?

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFD][PATCH] memcg: Move Usage at Task Move

2008-06-09 Thread YAMAMOTO Takashi
 For avoiding complicated rollbacks,
 I think of following ways of policy for task moving (you can add here.)
 
  1. Before moving usage, reserve usage in the new cgroup and old cgroup.
 Pros.
  - rollback will be very easy.
 Cons.
  - A task will use twice of its own usage virtaually for a while.
  - some amount of cpu time will be necessary to move _Big_ apps.
  - It's difficut to move _Big_ apps to small memcg.
  - we have to add special case handling.
 
  2. Don't move any usage at task move. (current implementation.)
 Pros.
   - no complication in the code.
 Cons.
   - A task's usage is chareged to wrong cgroup.
   - Not sure, but I believe the users don't want this.
 
  3. Use Lazy Manner
   When the task moves, we can mark the pages used by it as
   Wrong Charge, Should be dropped, and add them some penalty in the LRU.
 Pros.
   - no complicated ones.
   - the pages will be gradually moved at memory pressure.
 Cons.
   - A task's usage can exceed the limit for a while.
   - can't handle mlocked() memory in proper way.
 
  4. Allow Half-moved state and abandon rollback.
 Pros.
   - no complicated ones in the code.
 Cons.
   - the users will be in chaos.

how about:

5. try to move charges as your patch does.
   if the target cgroup's usage is going to exceed the limit,
   try to shrink it.  if it failed, just leave it exceeded.
   (ie. no rollback)
   for the memory subsystem, which can use its OOM killer,
   the failure should be rare.

 After writing this patch, for me, 3 is attractive. now.
 (or using Lazy manner and allow moving of usage instead of freeing it.)
 
 One reasone is that I think a typical usage of memory controller is
 fork()-move-exec(). (by libcg ?) and exec() will flush the all usage.

i guess that moving long-running applications can be desirable
esp. for not so well-designed systems.

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC 1/2] memcg: hierarchy support core (yet another one)

2008-05-28 Thread YAMAMOTO Takashi
 @@ -39,6 +39,18 @@ struct res_counter {
*/
   unsigned long long failcnt;
   /*
 +  * the amount of resource comes from parenet cgroup. Should be
 +  * returned to the parent at destroying/resizing this res_counter.
 +  */
 + unsigned long long borrow;

why do you need this in addition to the limit?
ie. aren't their values always equal except the root cgroup?

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 0/4] swapcgroup(v2)

2008-05-27 Thread YAMAMOTO Takashi
hi,

  Thanks for looking into this. Yamamoto-San is also looking into a swap
  controller. Is there a consensus on the approach?
  
 Not yet, but I think we should have some consensus each other
 before going further.
 
 
 Thanks,
 Daisuke Nishimura.

while nishimura-san's one still seems to have a lot of todo,
it seems good enough as a start point to me.
so i'd like to withdraw mine.

nishimura-san, is it ok for you?

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC 2/4] memcg: high-low watermark

2008-05-26 Thread YAMAMOTO Takashi
 +enum res_state res_counter_state(struct res_counter *counter)
 +{
 + unsigned long flags;
 + enum res_state ret = RES_BELOW_LIMIT;
 +
 + spin_lock_irqsave(counter-lock, flags);
 + if (counter-use_watermark) {
 + if (counter-usage = counter-lwmark)
 + ret = RES_BELOW_LOW;
 + else if (counter-usage = counter-hwmark)
 + ret = RES_BELOW_HIGH;
 + }
 + spin_unlock_irqrestore(counter-lock, flags);
 + return ret;
 +}

can't it be RES_OVER_LIMIT?
eg. when you lower the limit.

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 4/4] swapcgroup: modify vm_swap_full for cgroup

2008-05-25 Thread YAMAMOTO Takashi
 How about something like this?
 
   :
 usage = swap_cgroup_read_usage(mem);  //no need to align to number of page
 limit = swap_cgroup_read_limit(mem);  //no need to align to number of page
 ret = (usage * 2  limit) || (nr_swap_pages * 2  total_swap_pages)
   :

it seems reasonable to me.

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 4/4] swapcgroup: modify vm_swap_full for cgroup

2008-05-22 Thread YAMAMOTO Takashi
 @@ -1892,3 +1892,36 @@ int valid_swaphandles(swp_entry_t entry, unsigned long 
 *offset)
   *offset = ++toff;
   return nr_pages? ++nr_pages: 0;
  }
 +
 +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR
 +int swap_cgroup_vm_swap_full(struct page *page)
 +{
 + int ret;
 + struct swap_info_struct *p;
 + struct mem_cgroup *mem;
 + u64 usage;
 + u64 limit;
 + swp_entry_t entry;
 +
 + VM_BUG_ON(!PageLocked(page));
 + VM_BUG_ON(!PageSwapCache(page));
 +
 + ret = 0;
 + entry.val = page_private(page);
 + p = swap_info_get(entry);
 + if (!p)
 + goto out;
 +
 + mem = p-memcg[swp_offset(entry)];
 + usage = swap_cgroup_read_usage(mem) / PAGE_SIZE;
 + limit = swap_cgroup_read_limit(mem) / PAGE_SIZE;
 + limit = (limit  total_swap_pages) ? limit : total_swap_pages;
 +
 + ret = usage * 2  limit;
 +
 + spin_unlock(swap_lock);
 +
 +out:
 + return ret;
 +}
 +#endif

shouldn't it check the global usage (nr_swap_pages) as well?

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH] another swap controller for cgroup

2008-05-21 Thread YAMAMOTO Takashi
 BTW, I'm just trying to make my swapcontroller patch
 that is rebased on recent kernel and implemented
 as part of memory controller.
 I'm going to submit it by the middle of May.

what's the status of this?

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH] another swap controller for cgroup

2008-05-18 Thread YAMAMOTO Takashi
  deterministic in the sense that, even when two or more processes
  from different cgroups are sharing a page, both of them, rather than
  only unlucky one, are always charged.
  
 
 I'm not sure whether this behavior itself is good or bad,
 but I think it's not good idea to make memory controller,
 which charges only one process for a shared page,
 and swap controller behave differently.
 I think it will be confusing for users. At least,
 I would feel it strange.

i agree that yours can be better integrated with the memory controller. 

unlike yours, mine was designed to be independent from
the memory controller as far as possible.
(i don't want to complicate the memory controller.)

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH] another swap controller for cgroup

2008-05-15 Thread YAMAMOTO Takashi
 On Tue, May 13, 2008 at 8:21 PM, YAMAMOTO Takashi
 [EMAIL PROTECTED] wrote:
   
Could you please mention what the limitations are? We could get those 
  fixed or
take another serious look at the mm-owner patches.
 
   for example, its callback can't sleep.
 
 
 You need to be able to sleep in order to take mmap_sem, right?

yes.
besides that, i prefer not to hold a spinlock when traversing PTEs
as it can take somewhat long.

 Of course, having lots of datapath operations also take cgroup_mutex
 would be really painful, so it's not practical to use for things that
 can become non-attachable due to a process consuming some resources.
 This is part of the reason that I started working on the lock-mode
 patches that I sent out yesterday, in order to make finer-grained
 locking simpler. I'm going to rework those to make the locking more
 explicit, and I'll bear this use case in mind while I'm doing it.

thanks.

 A few comments on the patch:
 
 - you're not really limiting swap usage, you're limiting swapped-out
 address space. So it looks as though if a process has swapped out most
 of its address space, and forks a child, the total swap charge for
 the cgroup will double. Is that correct?

yes.

 If so, why is this better
 than charging for actual swap usage?

its behaviour is more determinstic and it uses less memory.
(than nishimura-san's one, which charges for actual swap usage.)

 - what will happen if someone creates non-NPTL threads, which share an
 mm but not a thread group (so each of them is a thread group leader)?

a thread which is most recently assigned to a cgroup will win.

 - if you were to store a pointer in the page rather than the

a pointer?  a pointer to what?

 swap_cgroup pointer, then (in combination with mm-owner) you wouldn't
 need to do the rebinding to the new swap_cgroup when a process moves
 to a different cgroup - you could instead keep a swapped pte count
 in the mm, and just charge that to the new cgroup and uncharge it from
 the old cgroup. You also wouldn't need to keep ref counts on the
 swap_cgroup.

PTE walking in my patch is for locking, not for rebinding.
ie. to deal with concurrent swap activities.
the fact that each page table pages have their own locks (pte_lockptr)
complicated it.

 - ideally this wouldn't actually start charging until it was bound on
 to a cgroups hierarchy, although I guess that the performance of this
 is less important than something like the virtual address space
 controller, since once we start swapping we can expect performance to
 be bad anyway.

i agree.

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH] another swap controller for cgroup

2008-05-15 Thread YAMAMOTO Takashi
If so, why is this better
than charging for actual swap usage?
 
   its behaviour is more determinstic and it uses less memory.
   (than nishimura-san's one, which charges for actual swap usage.)
 
 
 Using less memory is good, but maybe not worth it if the result isn't so 
 useful.
 
 I'd say that it's less deterministic than nishimura-san's controller -
 with his you just need to know how much swap is in use (which you can
 tell by observing the app on a real system) but with yours you also
 have to know whether there are any processes sharing anon pages (but
 not mms).

deterministic in the sense that, even when two or more processes
from different cgroups are sharing a page, both of them, rather than
only unlucky one, are always charged.

another related advantage is that it's possible to move charges
quite precisely when moving a task among cgroups.

 Now it's true that if all the apps you need to run do an execve()
 after forking, then the number of swap ptes really does track the
 amount of swap space in use pretty accurately, since there's not going
 to be any sharing of anon memory between mms. And it might be that
 people decide that the reduced memory overhead justifies this
 limitation. But I think it should be made explicit in the patch
 description and documentation that this controller achieves its
 reduced overhead at the cost of giving (IMO) bogus results on a rather
 ancient but still perfectly legitimate class of Unix application. (The
 apache httpd server used to work this way, for instance. It may still
 but I've not looked at it in a while).

fair enough.

- what will happen if someone creates non-NPTL threads, which share an
mm but not a thread group (so each of them is a thread group leader)?
 
   a thread which is most recently assigned to a cgroup will win.
 
 
 Doesn't that risk triggering the BUG_ON(mm-swap_cgroup != oldscg) in
 swap_cgroup_attach() ?

which version of the patch you are looking at?
the following is the latest copy.

YAMAMOTO Takashi


Signed-off-by: YAMAMOTO Takashi [EMAIL PROTECTED]
---

--- linux-2.6.25-rc8-mm2/mm/swapcontrol.c.BACKUP2008-04-21 
12:06:44.0 +0900
+++ linux-2.6.25-rc8-mm2/mm/swapcontrol.c   2008-05-15 13:48:00.0 
+0900
@@ -0,0 +1,570 @@
+
+/*
+ * swapcontrol.c COPYRIGHT FUJITSU LIMITED 2008
+ *
+ * Author: [EMAIL PROTECTED]
+ */
+
+/*
+ * there are two types of swap users.
+ * this controller handles both of them.
+ *
+ * - anonymous pages
+ * eg. user stacks, MAP_PRIVATE pages
+ *
+ * we track the number of PTEs with swap entries.
+ * it's precise wrt moving tasks between cgroups.
+ *
+ * - anonymous objects (aka shmem)
+ * eg. tmpfs, sysvshm, MAP_SHARED anonymous mapping
+ *
+ * anonymous objects are associated to a cgroup when they are created
+ * and the number of on-disk swap slots used by them are counted
+ * for the cgroup.  the association is persistent except cgroup removal,
+ * in which case, its associated objects are moved to init_mm's cgroup.
+ */
+
+#include linux/err.h
+#include linux/cgroup.h
+#include linux/res_counter.h
+#include linux/pagemap.h
+#include linux/slab.h
+#include linux/swap.h
+#include linux/swapcontrol.h
+#include linux/swapops.h
+#include linux/shmem_fs.h
+#include linux/hugetlb.h
+
+struct swap_cgroup {
+   struct cgroup_subsys_state scg_css;
+   struct res_counter scg_counter;
+   struct list_head scg_shmem_list;
+};
+
+static struct cgroup_subsys_state *
+task_to_css(struct task_struct *task)
+{
+
+   return task_subsys_state(task, swap_cgroup_subsys_id);
+}
+
+static struct swap_cgroup *
+css_to_scg(struct cgroup_subsys_state *css)
+{
+
+   return container_of(css, struct swap_cgroup, scg_css);
+}
+
+static struct cgroup_subsys_state *
+cg_to_css(struct cgroup *cg)
+{
+
+   return cgroup_subsys_state(cg, swap_cgroup_subsys_id);
+}
+
+static struct swap_cgroup *
+cg_to_scg(struct cgroup *cg)
+{
+
+   return css_to_scg(cg_to_css(cg));
+}
+
+/*  */
+/*
+ * anonymous pages
+ */
+
+/*
+ * lazily initialize ptp-ptp_swap_cgroup.
+ *
+ * called with page table locked.
+ */
+static struct swap_cgroup *
+swap_cgroup_prepare_ptp(struct page *ptp, struct mm_struct *mm)
+{
+   struct swap_cgroup *scg = ptp-ptp_swap_cgroup;
+
+   BUG_ON(mm == NULL);
+   BUG_ON(mm-swap_cgroup == NULL);
+
+   /*
+* scg == NULL here means that it's the first time we access
+* this PTP.  in that case, initialize ptp-ptp_swap_cgroup
+* from mm-swap_cgroup.
+*/
+   if (scg == NULL) {
+   /*
+* see swap_cgroup_attach.
+*/
+   smp_rmb();
+   scg = mm-swap_cgroup;
+   BUG_ON(scg == NULL);
+   ptp-ptp_swap_cgroup = scg;
+   }
+
+   return scg;
+}
+
+/*
+ * called with page table locked.
+ */
+int
+swap_cgroup_charge(struct page *ptp

[Devel] Re: [RFC][PATCH] another swap controller for cgroup

2008-05-13 Thread YAMAMOTO Takashi
  + BUG_ON(mm == NULL);
  + BUG_ON(mm-swap_cgroup == NULL);
  + if (scg == NULL) {
  + /*
  +  * see swap_cgroup_attach.
  +  */
  + smp_rmb();
  + scg = mm-swap_cgroup;
  With the mm-owner patches, we need not maintain a separate
  mm-swap_cgroup.
  
  i don't think the mm-owner patch, at least with the current form,
  can replace it.
  
 
 Could you please mention what the limitations are? We could get those fixed or
 take another serious look at the mm-owner patches.

for example, its callback can't sleep.

  + /*
  +  * swap_cgroup_attach is in progress.
  +  */
  +
  + res_counter_charge_force(newscg-scg_counter, PAGE_CACHE_SIZE);
  So, we force the counter to go over limit?
  
  yes.
  
  as newscg != oldscg here means the task is being moved between cgroups,
  this instance of res_counter_charge_force should not matter much.
  
 
 Isn't it bad to force a group to go over it's limit due to migration?

we don't have many choices as far as -attach can't fail.
although we can have racy checks in -can_attach, i'm not happy with it.

  We need to write actual numbers here? Can't we keep the write
  interface consistent with the memory controller?
  
  i'm not sure what you mean here.  can you explain a bit more?
  do you mean K, M, etc?
  
 
 Yes, I mean the same format that memparse() uses.

i'll take a look.

  Is moving to init_mm (root
  cgroup) a good idea? Ideally with support for hierarchies, if we ever
  do move things, it should be to the parent cgroup.
  
  i chose init_mm because there seemed to be no consensus about
  cgroup hierarchy semantics.
  
 
 I would suggest that we fail deletion of a group for now. I have a set of
 patches for the cgroup hierarchy semantics. I think the parent is the best 
 place
 to move it.

ok.

  + info-swap_cgroup = newscg;
  + res_counter_uncharge(oldscg-scg_counter, bytes);
  + res_counter_charge_force(newscg-scg_counter, bytes);
  I don't like the excessive use of res_counter_charge_force(), it seems
  like we might end up bypassing the controller all together. I would
  rather fail the destroy operation if the charge fails.
  
  + bytes = swslots * PAGE_CACHE_SIZE;
  + res_counter_uncharge(oldscg-scg_counter, bytes);
  + /*
  +  * XXX ignore newscg's limit because cgroup -attach method can't fail.
  +  */
  + res_counter_charge_force(newscg-scg_counter, bytes);
  But in the future, we could plan on making attach fail (I have a
  requirement for it). Again, I don't like the _force operation
  
  allowing these operations fail implies to have code to back out
  partial operations.  i'm afraid that it will be too complex.
  
 
 OK, we need to find out a way to fix that then.

note that a failure can affect other subsystems which belong to
the same hierarchy as well, and, even worse, a back-out attempt can also fail.
i'm afraid that we need to play some kind of transaction-commit game,
which can make subsystems too complex to implement properly.

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH] another swap controller for cgroup

2008-05-06 Thread YAMAMOTO Takashi
hi,

 Hi, Thanks for the patches and your patience. I've just applied your
 patches on top of 2.6.25-mm1 (it had a minor reject, that I've fixed).
 I am building and testing the patches along with KAMEZAWA-San's low
 overhead patches.

thanks.

  +#include linux/err.h
  +#include linux/cgroup.h
  +#include linux/hugetlb.h
 
 My powerpc build fails, we need to move hugetlb.h down to the bottom

what's the error message?

  +struct swap_cgroup {
  +   struct cgroup_subsys_state scg_css;
 
 Can't we call this just css. Since the structure is swap_cgroup it
 already has the required namespace required to distinguish it from
 other css's. Please see page 4 of The practice of programming, be
 consistent. The same comment applies to all members below.

i don't have the book.
i like this kind of prefixes as it's grep-friendly.

  +#definetask_to_css(task) task_subsys_state((task), 
  swap_cgroup_subsys_id)
  +#definecss_to_scg(css) container_of((css), struct swap_cgroup, scg_css)
  +#definecg_to_css(cg)   cgroup_subsys_state((cg), swap_cgroup_subsys_id)
  +#definecg_to_scg(cg)   css_to_scg(cg_to_css(cg))
 
 Aren't static inline better than macros? I would suggest moving to
 them.

sounds like a matter of preference.
either ok for me.

  +static struct swap_cgroup *
  +swap_cgroup_prepare_ptp(struct page *ptp, struct mm_struct *mm)
  +{
  +   struct swap_cgroup *scg = ptp-ptp_swap_cgroup;
  +
 
 Is this routine safe w.r.t. concurrent operations, modifications to
 ptp_swap_cgroup?

it's always accessed with the page table locked.

  +   BUG_ON(mm == NULL);
  +   BUG_ON(mm-swap_cgroup == NULL);
  +   if (scg == NULL) {
  +   /*
  +* see swap_cgroup_attach.
  +*/
  +   smp_rmb();
  +   scg = mm-swap_cgroup;
 
 With the mm-owner patches, we need not maintain a separate
 mm-swap_cgroup.

i don't think the mm-owner patch, at least with the current form,
can replace it.

  +   /*
  +* swap_cgroup_attach is in progress.
  +*/
  +
  +   res_counter_charge_force(newscg-scg_counter, PAGE_CACHE_SIZE);
 
 So, we force the counter to go over limit?

yes.

as newscg != oldscg here means the task is being moved between cgroups,
this instance of res_counter_charge_force should not matter much.

  +static int
  +swap_cgroup_write_u64(struct cgroup *cg, struct cftype *cft, u64 val)
  +{
  +   struct res_counter *counter = cg_to_scg(cg)-scg_counter;
  +   unsigned long flags;
  +
  +   /* XXX res_counter_write_u64 */
  +   BUG_ON(cft-private != RES_LIMIT);
  +   spin_lock_irqsave(counter-lock, flags);
  +   counter-limit = val;
  +   spin_unlock_irqrestore(counter-lock, flags);
  +   return 0;
  +}
  +
 
 We need to write actual numbers here? Can't we keep the write
 interface consistent with the memory controller?

i'm not sure what you mean here.  can you explain a bit more?
do you mean K, M, etc?

  +static void
  +swap_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cg)
  +{
  +   struct swap_cgroup *oldscg = cg_to_scg(cg);
  +   struct swap_cgroup *newscg;
  +   struct list_head *pos;
  +   struct list_head *next;
  +
  +   /*
  +* move our anonymous objects to init_mm's group.
  +*/
 
 Is this good design, should be allow a swap_cgroup to be destroyed,
 even though pages are allocated to it?

first of all, i'm not quite happy with this design. :)
having said that, what else can we do?
i tend to think that trying to swap-in these pages is too much effort
for little benefit.

 Is moving to init_mm (root
 cgroup) a good idea? Ideally with support for hierarchies, if we ever
 do move things, it should be to the parent cgroup.

i chose init_mm because there seemed to be no consensus about
cgroup hierarchy semantics.

  +   info-swap_cgroup = newscg;
  +   res_counter_uncharge(oldscg-scg_counter, bytes);
  +   res_counter_charge_force(newscg-scg_counter, bytes);
 
 I don't like the excessive use of res_counter_charge_force(), it seems
 like we might end up bypassing the controller all together. I would
 rather fail the destroy operation if the charge fails.

  +   bytes = swslots * PAGE_CACHE_SIZE;
  +   res_counter_uncharge(oldscg-scg_counter, bytes);
  +   /*
  +* XXX ignore newscg's limit because cgroup -attach method can't fail.
  +*/
  +   res_counter_charge_force(newscg-scg_counter, bytes);
 
 But in the future, we could plan on making attach fail (I have a
 requirement for it). Again, I don't like the _force operation

allowing these operations fail implies to have code to back out
partial operations.  i'm afraid that it will be too complex.

  +static void
  +swap_cgroup_attach_mm(struct mm_struct *mm, struct swap_cgroup *oldscg,
  +struct swap_cgroup *newscg)
 
 We need comments about the function, why do we attach an mm?

instead of a task, you mean?
because we count the number of ptes which points to swap
and ptes belong to an mm, not a task.

YAMAMOTO Takashi

[Devel] Re: [RFC][PATCH] another swap controller for cgroup

2008-04-29 Thread YAMAMOTO Takashi
   - anonymous objects (shmem) are not accounted.
  IMHO, shmem should be accounted.
  I agree it's difficult in your implementation,
  but are you going to support it?
 
 it should be trivial to track how much swap an anonymous object is using.
 i'm not sure how it should be associated with cgroups, tho.
 
 YAMAMOTO Takashi

i implemented shmem swap accounting.  see below.

YAMAMOTO Takashi


the following is another swap controller, which was designed and
implemented independently from nishimura-san's one.

some random differences from nishimura-san's one:
- counts and limits the number of ptes with swap entries instead of
  on-disk swap slots.  (except swap slots used by anonymous objects,
  which are counted differently.  see below.)
- no swapon-time memory allocation.
- precise wrt moving tasks between cgroups.
- [NEW] anonymous objects (shmem) are associated to a cgroup when they are
  created and the number of on-disk swap slots used by them are counted for
  the cgroup.  the association is persistent except cgroup removal,
  in which case, its associated objects are moved to init_mm's cgroup.


Signed-off-by: YAMAMOTO Takashi [EMAIL PROTECTED]
---

--- linux-2.6.25-rc8-mm2/mm/swapcontrol.c.BACKUP2008-04-21 
12:06:44.0 +0900
+++ linux-2.6.25-rc8-mm2/mm/swapcontrol.c   2008-04-28 14:16:03.0 
+0900
@@ -0,0 +1,447 @@
+
+/*
+ * swapcontrol.c COPYRIGHT FUJITSU LIMITED 2008
+ *
+ * Author: [EMAIL PROTECTED]
+ */
+
+#include linux/err.h
+#include linux/cgroup.h
+#include linux/hugetlb.h
+#include linux/res_counter.h
+#include linux/pagemap.h
+#include linux/slab.h
+#include linux/swap.h
+#include linux/swapcontrol.h
+#include linux/swapops.h
+#include linux/shmem_fs.h
+
+struct swap_cgroup {
+   struct cgroup_subsys_state scg_css;
+   struct res_counter scg_counter;
+   struct list_head scg_shmem_list;
+};
+
+#definetask_to_css(task) task_subsys_state((task), 
swap_cgroup_subsys_id)
+#definecss_to_scg(css) container_of((css), struct swap_cgroup, scg_css)
+#definecg_to_css(cg)   cgroup_subsys_state((cg), swap_cgroup_subsys_id)
+#definecg_to_scg(cg)   css_to_scg(cg_to_css(cg))
+
+static DEFINE_MUTEX(swap_cgroup_shmem_lock);
+
+static struct swap_cgroup *
+swap_cgroup_prepare_ptp(struct page *ptp, struct mm_struct *mm)
+{
+   struct swap_cgroup *scg = ptp-ptp_swap_cgroup;
+
+   BUG_ON(mm == NULL);
+   BUG_ON(mm-swap_cgroup == NULL);
+   if (scg == NULL) {
+   /*
+* see swap_cgroup_attach.
+*/
+   smp_rmb();
+   scg = mm-swap_cgroup;
+   BUG_ON(scg == NULL);
+   ptp-ptp_swap_cgroup = scg;
+   }
+
+   return scg;
+}
+
+/*
+ * called with page table locked.
+ */
+int
+swap_cgroup_charge(struct page *ptp, struct mm_struct *mm)
+{
+   struct swap_cgroup * const scg = swap_cgroup_prepare_ptp(ptp, mm);
+
+   return res_counter_charge(scg-scg_counter, PAGE_CACHE_SIZE);
+}
+
+/*
+ * called with page table locked.
+ */
+void
+swap_cgroup_uncharge(struct page *ptp)
+{
+   struct swap_cgroup * const scg = ptp-ptp_swap_cgroup;
+
+   BUG_ON(scg == NULL);
+   res_counter_uncharge(scg-scg_counter, PAGE_CACHE_SIZE);
+}
+
+/*
+ * called with both page tables locked.
+ */
+void
+swap_cgroup_remap_charge(struct page *oldptp, struct page *newptp,
+struct mm_struct *mm)
+{
+   struct swap_cgroup * const oldscg = oldptp-ptp_swap_cgroup;
+   struct swap_cgroup * const newscg = swap_cgroup_prepare_ptp(newptp, mm);
+
+   BUG_ON(oldscg == NULL);
+   BUG_ON(newscg == NULL);
+   if (oldscg == newscg) {
+   return;
+   }
+
+   /*
+* swap_cgroup_attach is in progress.
+*/
+
+   res_counter_charge_force(newscg-scg_counter, PAGE_CACHE_SIZE);
+   res_counter_uncharge(oldscg-scg_counter, PAGE_CACHE_SIZE);
+}
+
+void
+swap_cgroup_init_mm(struct mm_struct *mm, struct task_struct *t)
+{
+   struct swap_cgroup *scg;
+   struct cgroup *cg;
+
+   /* mm-swap_cgroup is not NULL in the case of dup_mm */
+   cg = task_cgroup(t, swap_cgroup_subsys_id);
+   BUG_ON(cg == NULL);
+   scg = cg_to_scg(cg);
+   BUG_ON(scg == NULL);
+   css_get(scg-scg_css);
+   mm-swap_cgroup = scg;
+}
+
+void
+swap_cgroup_exit_mm(struct mm_struct *mm)
+{
+   struct swap_cgroup * const scg = mm-swap_cgroup;
+
+   /*
+* note that swap_cgroup_attach does get_task_mm which
+* prevents us from being called.  we can assume mm-swap_cgroup
+* is stable.
+*/
+
+   BUG_ON(scg == NULL);
+   mm-swap_cgroup = NULL; /* it isn't necessary.  just being tidy. */
+   css_put(scg-scg_css);
+}
+
+static u64
+swap_cgroup_read_u64(struct cgroup *cg, struct cftype *cft)
+{
+
+   return res_counter_read_u64(cg_to_scg(cg)-scg_counter, cft-private);
+}
+
+static int
+swap_cgroup_write_u64(struct cgroup *cg, struct cftype *cft, u64 val

[Devel] Re: [RFC][-mm] Memory controller hierarchy support (v1)

2008-04-19 Thread YAMAMOTO Takashi
 -int res_counter_charge(struct res_counter *counter, unsigned long val)
 +int res_counter_charge(struct res_counter *counter, unsigned long val,
 + struct res_counter **limit_exceeded_at)
  {
   int ret;
   unsigned long flags;
 + struct res_counter *c, *unroll_c;
  
 - spin_lock_irqsave(counter-lock, flags);
 - ret = res_counter_charge_locked(counter, val);
 - spin_unlock_irqrestore(counter-lock, flags);
 + *limit_exceeded_at = NULL;
 + local_irq_save(flags);
 + for (c = counter; c != NULL; c = c-parent) {
 + spin_lock(c-lock);
 + ret = res_counter_charge_locked(c, val);
 + spin_unlock(c-lock);
 + if (ret  0) {
 + *limit_exceeded_at = c;
 + goto unroll;
 + }
 + }
 + local_irq_restore(flags);
 + return 0;
 +
 +unroll:
 + for (unroll_c = counter; unroll_c != c; unroll_c = unroll_c-parent) {
 + spin_lock(unroll_c-lock);
 + res_counter_uncharge_locked(unroll_c, val);
 + spin_unlock(unroll_c-lock);
 + }
 + local_irq_restore(flags);
   return ret;
  }

i wonder how much performance impacts this involves.

it increases the number of atomic ops per charge/uncharge and
makes the common case (success) of every charge/uncharge in a system
touch a global (ie. root cgroup's) cachelines.

 + /*
 +  * Ideally we need to hold cgroup_mutex here
 +  */
 + list_for_each_entry_safe_from(cgroup, cgrp,
 + curr_cgroup-children, sibling) {
 + struct mem_cgroup *mem_child;
 +
 + mem_child = mem_cgroup_from_cont(cgroup);
 + ret = try_to_free_mem_cgroup_pages(mem_child,
 + gfp_mask);
 + mem-last_scanned_child = mem_child;
 + if (ret == 0)
 + break;
 + }

if i read it correctly, it makes us hit the last child again and again.

i think you want to reclaim from all cgroups under the curr_cgroup
including eg. children's children.

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH] another swap controller for cgroup

2008-04-10 Thread YAMAMOTO Takashi
the following is a new version of the patch.
changes from the previous:
- fix a BUG_ON in swap_cgroup_attach and add a comment about it.

YAMAMOTO Takashi

Signed-off-by: YAMAMOTO Takashi [EMAIL PROTECTED]
---

--- linux-2.6.25-rc3-mm1/init/Kconfig.BACKUP2008-03-05 15:45:50.0 
+0900
+++ linux-2.6.25-rc3-mm1/init/Kconfig   2008-03-12 11:52:48.0 +0900
@@ -379,6 +379,12 @@ config CGROUP_MEM_RES_CTLR
  Only enable when you're ok with these trade offs and really
  sure you need the memory resource controller.
 
+config CGROUP_SWAP_RES_CTLR
+   bool Swap Resource Controller for Control Groups
+   depends on CGROUPS  RESOURCE_COUNTERS
+   help
+ XXX TBD
+
 config SYSFS_DEPRECATED
bool Create deprecated sysfs files
depends on SYSFS
--- linux-2.6.25-rc3-mm1/mm/swapfile.c.BACKUP   2008-03-05 15:45:52.0 
+0900
+++ linux-2.6.25-rc3-mm1/mm/swapfile.c  2008-03-14 17:25:40.0 +0900
@@ -28,6 +28,7 @@
 #include linux/capability.h
 #include linux/syscalls.h
 #include linux/memcontrol.h
+#include linux/swapcontrol.h
 
 #include asm/pgtable.h
 #include asm/tlbflush.h
@@ -526,6 +527,7 @@ static int unuse_pte(struct vm_area_stru
}
 
inc_mm_counter(vma-vm_mm, anon_rss);
+   swap_cgroup_uncharge(pmd_page(*pmd));
get_page(page);
set_pte_at(vma-vm_mm, addr, pte,
   pte_mkold(mk_pte(page, vma-vm_page_prot)));
--- linux-2.6.25-rc3-mm1/mm/Makefile.BACKUP 2008-03-05 15:45:51.0 
+0900
+++ linux-2.6.25-rc3-mm1/mm/Makefile2008-03-12 11:53:31.0 +0900
@@ -33,4 +33,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_SWAP_RES_CTLR) += swapcontrol.o
 
--- linux-2.6.25-rc3-mm1/mm/rmap.c.BACKUP   2008-03-05 15:45:52.0 
+0900
+++ linux-2.6.25-rc3-mm1/mm/rmap.c  2008-03-17 07:45:16.0 +0900
@@ -49,6 +49,7 @@
 #include linux/module.h
 #include linux/kallsyms.h
 #include linux/memcontrol.h
+#include linux/swapcontrol.h
 
 #include asm/tlbflush.h
 
@@ -237,8 +238,9 @@ unsigned long page_address_in_vma(struct
  *
  * On success returns with pte mapped and locked.
  */
-pte_t *page_check_address(struct page *page, struct mm_struct *mm,
- unsigned long address, spinlock_t **ptlp)
+pte_t *page_check_address1(struct page *page, struct mm_struct *mm,
+ unsigned long address, spinlock_t **ptlp,
+ struct page **ptpp)
 {
pgd_t *pgd;
pud_t *pud;
@@ -269,12 +271,21 @@ pte_t *page_check_address(struct page *p
spin_lock(ptl);
if (pte_present(*pte)  page_to_pfn(page) == pte_pfn(*pte)) {
*ptlp = ptl;
+   if (ptpp != NULL) {
+   *ptpp = pmd_page(*(pmd));
+   }
return pte;
}
pte_unmap_unlock(pte, ptl);
return NULL;
 }
 
+pte_t *page_check_address(struct page *page, struct mm_struct *mm,
+ unsigned long address, spinlock_t **ptlp)
+{
+   return page_check_address1(page, mm, address, ptlp, NULL);
+}
+
 /*
  * Subfunctions of page_referenced: page_referenced_one called
  * repeatedly from either page_referenced_anon or page_referenced_file.
@@ -710,13 +721,14 @@ static int try_to_unmap_one(struct page 
pte_t *pte;
pte_t pteval;
spinlock_t *ptl;
+   struct page *ptp;
int ret = SWAP_AGAIN;
 
address = vma_address(page, vma);
if (address == -EFAULT)
goto out;
 
-   pte = page_check_address(page, mm, address, ptl);
+   pte = page_check_address1(page, mm, address, ptl, ptp);
if (!pte)
goto out;
 
@@ -731,6 +743,12 @@ static int try_to_unmap_one(struct page 
goto out_unmap;
}
 
+   if (!migration  PageSwapCache(page)  swap_cgroup_charge(ptp, mm)) {
+   /* XXX should make the caller free the swap slot? */
+   ret = SWAP_FAIL;
+   goto out_unmap;
+   }
+
/* Nuke the page table entry. */
flush_cache_page(vma, address, page_to_pfn(page));
pteval = ptep_clear_flush(vma, address, pte);
--- linux-2.6.25-rc3-mm1/mm/memory.c.BACKUP 2008-03-05 15:45:52.0 
+0900
+++ linux-2.6.25-rc3-mm1/mm/memory.c2008-03-14 18:54:21.0 +0900
@@ -51,6 +51,7 @@
 #include linux/init.h
 #include linux/writeback.h
 #include linux/memcontrol.h
+#include linux/swapcontrol.h
 
 #include asm/pgalloc.h
 #include asm/uaccess.h
@@ -431,10 +432,10 @@ struct page *vm_normal_page(struct vm_ar
  * covered by this vma.
  */
 
-static inline void
+static inline int
 copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
-   unsigned long addr, int *rss

[Devel] Re: [RFC][PATCH] another swap controller for cgroup

2008-04-07 Thread YAMAMOTO Takashi
 YAMAMOTO Takashi wrote:
  hi,
  
  i tried to reproduce the large swap cache issue, but no luck.
  can you provide a little more detailed instruction?
  
 This issue also happens on generic 2.6.25-rc3-mm1
 (with limitting only memory), so I think this issue is not
 related to your patch.
 I'm investigating this issue too.
 
 Below is my environment and how to reproduce.
 
 - System
   full virtualized xen guest based on RHEL5.1(x86_64).
 CPU: 2
 memory: 2GB
 swap: 1GB
   A config of the running kernel(2.6.25-rc3-mm1 with your patch)
   is attached.
 
 - how to reproduce
   - change swappines to 100
 
 echo 100 /proc/sys/vm/swappiness
 
   - mount cgroup fs
 
 # mount -t cgroup -o memory,swap none /cgroup
 
   - make cgroup for test
 
 # mkdir /cgroup/02
 # echo -n 64M /cgroup/02/memory.limit_in_bytes
 # echo -n `expr 128 \* 1024 \* 1024` /cgroup/02/swap.limit_in_bytes
 
   - run test
 
 # echo $$ /cgropu/02/tasks
 # while true; do make clean; make -j2; done
 
   In other terminals, I run some monitoring processes, top,
   tail -f /var/log/messages, and displaying *.usage_in_bytes
   every seconds.
 
 
 Thanks,
 Daisuke Nishimura.

what i tried was essentially same.
for me, once vm_swap_full() got true, swap cache stopped growing as expected.

http://people.valinux.co.jp/~yamamoto/swap.png

it was taken by running
while :;do swapon -s|tail -1;sleep 1;done  foo
in an unlimited cgroup, and then plotted by gnuplot.
plot foo u 4

as my system has 1GB swap configured, the vm_swap_full() threshold is
around 500MB.

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH] another swap controller for cgroup

2008-03-27 Thread YAMAMOTO Takashi
hi,

i tried to reproduce the large swap cache issue, but no luck.
can you provide a little more detailed instruction?

the following is a new version of the patch.
changes from the previous:
- fix a bug; update accounting for mremap properly.
- some comments and assertions.

YAMAMOTO Takashi


Signed-off-by: YAMAMOTO Takashi [EMAIL PROTECTED]
---

--- linux-2.6.25-rc3-mm1/init/Kconfig.BACKUP2008-03-05 15:45:50.0 
+0900
+++ linux-2.6.25-rc3-mm1/init/Kconfig   2008-03-12 11:52:48.0 +0900
@@ -379,6 +379,12 @@ config CGROUP_MEM_RES_CTLR
  Only enable when you're ok with these trade offs and really
  sure you need the memory resource controller.
 
+config CGROUP_SWAP_RES_CTLR
+   bool Swap Resource Controller for Control Groups
+   depends on CGROUPS  RESOURCE_COUNTERS
+   help
+ XXX TBD
+
 config SYSFS_DEPRECATED
bool Create deprecated sysfs files
depends on SYSFS
--- linux-2.6.25-rc3-mm1/mm/swapfile.c.BACKUP   2008-03-05 15:45:52.0 
+0900
+++ linux-2.6.25-rc3-mm1/mm/swapfile.c  2008-03-14 17:25:40.0 +0900
@@ -28,6 +28,7 @@
 #include linux/capability.h
 #include linux/syscalls.h
 #include linux/memcontrol.h
+#include linux/swapcontrol.h
 
 #include asm/pgtable.h
 #include asm/tlbflush.h
@@ -526,6 +527,7 @@ static int unuse_pte(struct vm_area_stru
}
 
inc_mm_counter(vma-vm_mm, anon_rss);
+   swap_cgroup_uncharge(pmd_page(*pmd));
get_page(page);
set_pte_at(vma-vm_mm, addr, pte,
   pte_mkold(mk_pte(page, vma-vm_page_prot)));
--- linux-2.6.25-rc3-mm1/mm/Makefile.BACKUP 2008-03-05 15:45:51.0 
+0900
+++ linux-2.6.25-rc3-mm1/mm/Makefile2008-03-12 11:53:31.0 +0900
@@ -33,4 +33,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_SWAP_RES_CTLR) += swapcontrol.o
 
--- linux-2.6.25-rc3-mm1/mm/rmap.c.BACKUP   2008-03-05 15:45:52.0 
+0900
+++ linux-2.6.25-rc3-mm1/mm/rmap.c  2008-03-17 07:45:16.0 +0900
@@ -49,6 +49,7 @@
 #include linux/module.h
 #include linux/kallsyms.h
 #include linux/memcontrol.h
+#include linux/swapcontrol.h
 
 #include asm/tlbflush.h
 
@@ -237,8 +238,9 @@ unsigned long page_address_in_vma(struct
  *
  * On success returns with pte mapped and locked.
  */
-pte_t *page_check_address(struct page *page, struct mm_struct *mm,
- unsigned long address, spinlock_t **ptlp)
+pte_t *page_check_address1(struct page *page, struct mm_struct *mm,
+ unsigned long address, spinlock_t **ptlp,
+ struct page **ptpp)
 {
pgd_t *pgd;
pud_t *pud;
@@ -269,12 +271,21 @@ pte_t *page_check_address(struct page *p
spin_lock(ptl);
if (pte_present(*pte)  page_to_pfn(page) == pte_pfn(*pte)) {
*ptlp = ptl;
+   if (ptpp != NULL) {
+   *ptpp = pmd_page(*(pmd));
+   }
return pte;
}
pte_unmap_unlock(pte, ptl);
return NULL;
 }
 
+pte_t *page_check_address(struct page *page, struct mm_struct *mm,
+ unsigned long address, spinlock_t **ptlp)
+{
+   return page_check_address1(page, mm, address, ptlp, NULL);
+}
+
 /*
  * Subfunctions of page_referenced: page_referenced_one called
  * repeatedly from either page_referenced_anon or page_referenced_file.
@@ -710,13 +721,14 @@ static int try_to_unmap_one(struct page 
pte_t *pte;
pte_t pteval;
spinlock_t *ptl;
+   struct page *ptp;
int ret = SWAP_AGAIN;
 
address = vma_address(page, vma);
if (address == -EFAULT)
goto out;
 
-   pte = page_check_address(page, mm, address, ptl);
+   pte = page_check_address1(page, mm, address, ptl, ptp);
if (!pte)
goto out;
 
@@ -731,6 +743,12 @@ static int try_to_unmap_one(struct page 
goto out_unmap;
}
 
+   if (!migration  PageSwapCache(page)  swap_cgroup_charge(ptp, mm)) {
+   /* XXX should make the caller free the swap slot? */
+   ret = SWAP_FAIL;
+   goto out_unmap;
+   }
+
/* Nuke the page table entry. */
flush_cache_page(vma, address, page_to_pfn(page));
pteval = ptep_clear_flush(vma, address, pte);
--- linux-2.6.25-rc3-mm1/mm/memory.c.BACKUP 2008-03-05 15:45:52.0 
+0900
+++ linux-2.6.25-rc3-mm1/mm/memory.c2008-03-14 18:54:21.0 +0900
@@ -51,6 +51,7 @@
 #include linux/init.h
 #include linux/writeback.h
 #include linux/memcontrol.h
+#include linux/swapcontrol.h
 
 #include asm/pgalloc.h
 #include asm/uaccess.h
@@ -431,10 +432,10 @@ struct page *vm_normal_page(struct vm_ar
  * covered by this vma.
  */
 
-static inline void
+static inline int
 copy_one_pte(struct mm_struct *dst_mm

[Devel] [PATCH] fix spurious EBUSY on memory cgroup removal

2008-03-24 Thread YAMAMOTO Takashi
[ resending with To: akpm.  Andrew, can you include this in -mm tree? ]

hi,

the following patch is to fix spurious EBUSY on cgroup removal.

YAMAMOTO Takashi


call mm_free_cgroup earlier.
otherwise a reference due to lazy mm switching can prevent cgroup removal.

Signed-off-by: YAMAMOTO Takashi [EMAIL PROTECTED]
Acked-by: Balbir Singh [EMAIL PROTECTED]
---

--- linux-2.6.24-rc8-mm1/kernel/fork.c.BACKUP   2008-01-23 14:43:29.0 
+0900
+++ linux-2.6.24-rc8-mm1/kernel/fork.c  2008-01-31 17:26:31.0 +0900
@@ -393,7 +393,6 @@ void __mmdrop(struct mm_struct *mm)
 {
BUG_ON(mm == init_mm);
mm_free_pgd(mm);
-   mm_free_cgroup(mm);
destroy_context(mm);
free_mm(mm);
 }
@@ -415,6 +414,7 @@ void mmput(struct mm_struct *mm)
spin_unlock(mmlist_lock);
}
put_swap_token(mm);
+   mm_free_cgroup(mm);
mmdrop(mm);
}
 }
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH] another swap controller for cgroup

2008-03-17 Thread YAMAMOTO Takashi
  - anonymous objects (shmem) are not accounted.
 IMHO, shmem should be accounted.
 I agree it's difficult in your implementation,
 but are you going to support it?

it should be trivial to track how much swap an anonymous object is using.
i'm not sure how it should be associated with cgroups, tho.

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC][PATCH] another swap controller for cgroup

2008-03-16 Thread YAMAMOTO Takashi
hi,

the following is another swap controller, which was designed and
implemented independently from nishimura-san's one.

some random differences from nishimura-san's one:
- counts and limits the number of ptes with swap entries instead of
  on-disk swap slots.
- no swapon-time memory allocation.
- anonymous objects (shmem) are not accounted.
- precise wrt moving tasks between cgroups. 

this patch contains some unrelated small fixes which i've posted separately:
- exe_file fput botch fix
- cgroup_rmdir EBUSY fix

any comments?

YAMAMOTO Takashi


--- linux-2.6.25-rc3-mm1/init/Kconfig.BACKUP2008-03-05 15:45:50.0 
+0900
+++ linux-2.6.25-rc3-mm1/init/Kconfig   2008-03-12 11:52:48.0 +0900
@@ -379,6 +379,12 @@ config CGROUP_MEM_RES_CTLR
  Only enable when you're ok with these trade offs and really
  sure you need the memory resource controller.
 
+config CGROUP_SWAP_RES_CTLR
+   bool Swap Resource Controller for Control Groups
+   depends on CGROUPS  RESOURCE_COUNTERS
+   help
+ XXX TBD
+
 config SYSFS_DEPRECATED
bool Create deprecated sysfs files
depends on SYSFS
--- linux-2.6.25-rc3-mm1/mm/swapfile.c.BACKUP   2008-03-05 15:45:52.0 
+0900
+++ linux-2.6.25-rc3-mm1/mm/swapfile.c  2008-03-14 17:25:40.0 +0900
@@ -28,6 +28,7 @@
 #include linux/capability.h
 #include linux/syscalls.h
 #include linux/memcontrol.h
+#include linux/swapcontrol.h
 
 #include asm/pgtable.h
 #include asm/tlbflush.h
@@ -526,6 +527,7 @@ static int unuse_pte(struct vm_area_stru
}
 
inc_mm_counter(vma-vm_mm, anon_rss);
+   swap_cgroup_uncharge(pmd_page(*pmd));
get_page(page);
set_pte_at(vma-vm_mm, addr, pte,
   pte_mkold(mk_pte(page, vma-vm_page_prot)));
--- linux-2.6.25-rc3-mm1/mm/Makefile.BACKUP 2008-03-05 15:45:51.0 
+0900
+++ linux-2.6.25-rc3-mm1/mm/Makefile2008-03-12 11:53:31.0 +0900
@@ -33,4 +33,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_SWAP_RES_CTLR) += swapcontrol.o
 
--- linux-2.6.25-rc3-mm1/mm/rmap.c.BACKUP   2008-03-05 15:45:52.0 
+0900
+++ linux-2.6.25-rc3-mm1/mm/rmap.c  2008-03-17 07:45:16.0 +0900
@@ -49,6 +49,7 @@
 #include linux/module.h
 #include linux/kallsyms.h
 #include linux/memcontrol.h
+#include linux/swapcontrol.h
 
 #include asm/tlbflush.h
 
@@ -237,8 +238,9 @@ unsigned long page_address_in_vma(struct
  *
  * On success returns with pte mapped and locked.
  */
-pte_t *page_check_address(struct page *page, struct mm_struct *mm,
- unsigned long address, spinlock_t **ptlp)
+pte_t *page_check_address1(struct page *page, struct mm_struct *mm,
+ unsigned long address, spinlock_t **ptlp,
+ struct page **ptpp)
 {
pgd_t *pgd;
pud_t *pud;
@@ -269,12 +271,21 @@ pte_t *page_check_address(struct page *p
spin_lock(ptl);
if (pte_present(*pte)  page_to_pfn(page) == pte_pfn(*pte)) {
*ptlp = ptl;
+   if (ptpp != NULL) {
+   *ptpp = pmd_page(*(pmd));
+   }
return pte;
}
pte_unmap_unlock(pte, ptl);
return NULL;
 }
 
+pte_t *page_check_address(struct page *page, struct mm_struct *mm,
+ unsigned long address, spinlock_t **ptlp)
+{
+   return page_check_address1(page, mm, address, ptlp, NULL);
+}
+
 /*
  * Subfunctions of page_referenced: page_referenced_one called
  * repeatedly from either page_referenced_anon or page_referenced_file.
@@ -710,13 +721,14 @@ static int try_to_unmap_one(struct page 
pte_t *pte;
pte_t pteval;
spinlock_t *ptl;
+   struct page *ptp;
int ret = SWAP_AGAIN;
 
address = vma_address(page, vma);
if (address == -EFAULT)
goto out;
 
-   pte = page_check_address(page, mm, address, ptl);
+   pte = page_check_address1(page, mm, address, ptl, ptp);
if (!pte)
goto out;
 
@@ -731,6 +743,12 @@ static int try_to_unmap_one(struct page 
goto out_unmap;
}
 
+   if (!migration  PageSwapCache(page)  swap_cgroup_charge(ptp, mm)) {
+   /* XXX should make the caller free the swap slot? */
+   ret = SWAP_FAIL;
+   goto out_unmap;
+   }
+
/* Nuke the page table entry. */
flush_cache_page(vma, address, page_to_pfn(page));
pteval = ptep_clear_flush(vma, address, pte);
--- linux-2.6.25-rc3-mm1/mm/memory.c.BACKUP 2008-03-05 15:45:52.0 
+0900
+++ linux-2.6.25-rc3-mm1/mm/memory.c2008-03-14 18:54:21.0 +0900
@@ -51,6 +51,7 @@
 #include linux/init.h
 #include linux/writeback.h
 #include linux/memcontrol.h
+#include linux/swapcontrol.h
 
 #include asm/pgalloc.h
 #include asm

[Devel] Re: [PATCH 2/2] Make res_counter hierarchical

2008-03-12 Thread YAMAMOTO Takashi
 @@ -36,10 +37,26 @@ int res_counter_charge(struct res_counter *counter, 
 unsigned long val)
  {
   int ret;
   unsigned long flags;
 + struct res_counter *c, *unroll_c;
 +
 + local_irq_save(flags);
 + for (c = counter; c != NULL; c = c-parent) {
 + spin_lock(c-lock);
 + ret = res_counter_charge_locked(c, val);
 + spin_unlock(c-lock);
 + if (ret  0)
 + goto unroll;
 + }
 + local_irq_restore(flags);
 + return 0;
  
 - spin_lock_irqsave(counter-lock, flags);
 - ret = res_counter_charge_locked(counter, val);
 - spin_unlock_irqrestore(counter-lock, flags);
 +unroll:
 + for (unroll_c = counter; unroll_c != c; unroll_c = unroll_c-parent) {
 + spin_lock(unroll_c-lock);
 + res_counter_uncharge_locked(unroll_c, val);
 + spin_unlock(unroll_c-lock);
 + }
 + local_irq_restore(flags);
   return ret;
  }

what prevents the topology (in particular, -parent pointers) from
changing behind us?

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 2/2] Make res_counter hierarchical

2008-03-12 Thread YAMAMOTO Takashi
  @@ -36,10 +37,26 @@ int res_counter_charge(struct res_counter *counter, 
  unsigned long val)
   {
  int ret;
  unsigned long flags;
  +   struct res_counter *c, *unroll_c;
  +
  +   local_irq_save(flags);
  +   for (c = counter; c != NULL; c = c-parent) {
  +   spin_lock(c-lock);
  +   ret = res_counter_charge_locked(c, val);
  +   spin_unlock(c-lock);
  +   if (ret  0)
  +   goto unroll;
  +   }
  +   local_irq_restore(flags);
  +   return 0;
   
  -   spin_lock_irqsave(counter-lock, flags);
  -   ret = res_counter_charge_locked(counter, val);
  -   spin_unlock_irqrestore(counter-lock, flags);
  +unroll:
  +   for (unroll_c = counter; unroll_c != c; unroll_c = unroll_c-parent) {
  +   spin_lock(unroll_c-lock);
  +   res_counter_uncharge_locked(unroll_c, val);
  +   spin_unlock(unroll_c-lock);
  +   }
  +   local_irq_restore(flags);
  return ret;
   }
 
 what prevents the topology (in particular, -parent pointers) from
 changing behind us?
 
 YAMAMOTO Takashi

to answer myself: cgroupfs rename doesn't allow topological changes
in the first place.

btw, i think you need to do the same for res_counter_limit_check_locked
as well.  i'm skeptical about doing these complicated stuffs in kernel,
esp. in this potentially performance critical code.

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC] memory controller : backgorund reclaim and avoid excessive locking [4/5] borrow resource

2008-02-17 Thread YAMAMOTO Takashi
 + /* try to charge */
 + ret = res_counter_charge(mem-res, mem-borrow_unit);
 + if (!ret) { /* success */
 + *bwp += (mem-borrow_unit - size);
 + goto out;
 + }
 + }
 + spin_lock(mem-res.lock);
 + ret = res_counter_charge_locked(mem-res, size);
 + spin_unlock(mem-res.lock);

although i don't know if it matters, this retrying of charge affects failcnt.

 +static void mem_cgroup_return_and_uncharge(struct mem_cgroup *mem, int size)
 +{
 + unsigned long flags;
 + int uncharge_size = 0;
 +
 + local_irq_save(flags);
 + if (mem-borrow_unit) {
 + int limit = mem-borrow_unit * 2;
 + int cpu;
 + s64 *bwp;
 + cpu = smp_processor_id();
 + bwp = mem-stat.cpustat[cpu].count[MEM_CGROUP_STAT_BORROW];
 + *bwp += size;
 + if (*bwp  limit) {
 + uncharge_size = *bwp - mem-borrow_unit;
 + *bwp = mem-borrow_unit;
 + }
 + } else
 + uncharge_size = size;
 +
 + if (uncharge_size) {
 + spin_lock(mem-res.lock);
 + res_counter_uncharge_locked(mem-res, size);

s/size/uncharge_size/

 @@ -1109,12 +1202,29 @@ static u64 mem_throttle_read(struct cgro
   return (u64)mem-throttle.limit;
  }
  
 +static int mem_bulkratio_write(struct cgroup *cont, struct cftype *cft, u64 
 val)
 +{
 + struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
 + int unit = val * PAGE_SIZE;
 + if (unit  (PAGE_SIZE  (MAX_ORDER/2)))
 + return -EINVAL;
 + mem-borrow_unit = unit;
 + return 0;
 +}

it seems unsafe with concurrent mem_cgroup_borrow_and_charge or
mem_cgroup_return_and_uncharge.

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC] memory controller : backgorund reclaim and avoid excessive locking [5/5] lazy page_cgroup freeing

2008-02-17 Thread YAMAMOTO Takashi
 + /*
 +  * For lazy freeing (not GC)
 +  */
 + struct {
 + struct mem_cgroup_per_zone *mz;
 + int num;
 +#define GARBAGE_MAXSIZE  (16)
 + struct page_cgroup  *vec[GARBAGE_MAXSIZE];
 + } garbage[NR_CPUS];
  };

i think you want to dedicate cache lines.

 @@ -176,12 +185,14 @@ struct page_cgroup {
   struct list_head lru;   /* per cgroup LRU list */
   struct page *page;
   struct mem_cgroup *mem_cgroup;
 + struct mem_cgroup_per_zone *mz; /* now belongs to...*/

is this for performance?

 @@ -408,10 +427,12 @@ static void __mem_cgroup_move_lists(stru
   if (active) {
   MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_ACTIVE) += 1;
   pc-flags |= PAGE_CGROUP_FLAG_ACTIVE;
 + pc-mz = mz;
   list_move(pc-lru, mz-active_list);
   } else {
   MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE) += 1;
   pc-flags = ~PAGE_CGROUP_FLAG_ACTIVE;
 + pc-mz = mz;
   list_move(pc-lru, mz-inactive_list);
   }
  }

isn't pc-mz already assigned by __mem_cgroup_add_list?

 @@ -1050,11 +1114,15 @@ mem_cgroup_force_empty_list(struct mem_c
   if (list_empty(list))
   return;
  retry:
 + all_free_garbages(mem);
   count = FORCE_UNCHARGE_BATCH;
   spin_lock_irqsave(mz-lru_lock, flags);
  
   while (--count  !list_empty(list)) {
   pc = list_entry(list-prev, struct page_cgroup, lru);
 + /* If there are still garbage, exit and retry */
 + if (pc-flags  PAGE_CGROUP_FLAG_GARBAGE)
 + break;

i think mem_cgroup_isolate_pages needs a similar check.

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 2/3] memcgroup: fix typo in VM_BUG_ON()

2008-02-17 Thread YAMAMOTO Takashi
 No need for VM_BUG_ON(pc), since 'pc' is the list entry. This should
 be VM_BUG_ON(page).
 
 Signed-off-by: Li Zefan [EMAIL PROTECTED]
 Acked-by: KAMEZAWA Hiroyuki [EMAIL PROTECTED]
 ---
  mm/memcontrol.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)
 
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index 6bded84..c2959ee 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -534,7 +534,7 @@ unsigned long mem_cgroup_isolate_pages(unsigned long 
 nr_to_scan,
   if (scan = nr_to_scan)
   break;
   page = pc-page;
 - VM_BUG_ON(!pc);
 + VM_BUG_ON(!page);

can't page be NULL here if mem_cgroup_uncharge clears pc-page behind us?
ie. bug.

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH] fix spurious EBUSY on memory cgroup removal

2008-01-31 Thread YAMAMOTO Takashi
hi,

the following patch is to fix spurious EBUSY on cgroup removal.

YAMAMOTO Takashi


call mm_free_cgroup earlier.
otherwise a reference due to lazy mm switching can prevent cgroup removal.

Signed-off-by: YAMAMOTO Takashi [EMAIL PROTECTED]
---

--- linux-2.6.24-rc8-mm1/kernel/fork.c.BACKUP   2008-01-23 14:43:29.0 
+0900
+++ linux-2.6.24-rc8-mm1/kernel/fork.c  2008-01-31 17:26:31.0 +0900
@@ -393,7 +393,6 @@ void __mmdrop(struct mm_struct *mm)
 {
BUG_ON(mm == init_mm);
mm_free_pgd(mm);
-   mm_free_cgroup(mm);
destroy_context(mm);
free_mm(mm);
 }
@@ -415,6 +414,7 @@ void mmput(struct mm_struct *mm)
spin_unlock(mmlist_lock);
}
put_swap_token(mm);
+   mm_free_cgroup(mm);
mmdrop(mm);
}
 }
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [dm-devel] [PATCH 0/2] dm-band: The I/O bandwidth controller: Overview

2008-01-24 Thread YAMAMOTO Takashi
 Hi,
 
On Wed, Jan 23, 2008 at 09:53:50PM +0900, Ryo Tsuruta wrote:
 Dm-band gives bandwidth to each job according to its weight, 
 which each job can set its own value to.
 At this time, a job is a group of processes with the same pid or pgrp 
 or uid.

It seems to rely on 'current' to classify bios and doesn't do it until 
the map
function is called, possibly in a different process context, so it won't
always identify the original source of the I/O correctly:
   
   Yes, this should be mentioned in the document with the current 
   implementation
   as you pointed out.
   
   By the way, I think once a memory controller of cgroup is introduced, it 
   will
   help to track down which cgroup is the original source.
  
  do you mean to make this a part of the memory subsystem?
 
 I just think if the memory subsystem is in front of us, we don't need to
 reinvent the wheel.
 
 But I don't have a concrete image how the interface between dm-band and
 the memory subsystem should be designed yet. I'd be appreciate if some of
 the cgroup developers give some ideas about it.

the current implementation of memory subsystem associates pages to
cgroups directly, rather than via tasks.  so it isn't straightforward to
use the information for other classification mechanisms like yours which
might not share the view of hierarchy with the memory subsystem.

YAMAMOTO Takashi

 
 Thanks,
 Hirokazu Takahashi.
 
 
  YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [dm-devel] [PATCH 0/2] dm-band: The I/O bandwidth controller: Overview

2008-01-23 Thread YAMAMOTO Takashi
 Hi,
 
  On Wed, Jan 23, 2008 at 09:53:50PM +0900, Ryo Tsuruta wrote:
   Dm-band gives bandwidth to each job according to its weight, 
   which each job can set its own value to.
   At this time, a job is a group of processes with the same pid or pgrp or 
   uid.
  
  It seems to rely on 'current' to classify bios and doesn't do it until the 
  map
  function is called, possibly in a different process context, so it won't
  always identify the original source of the I/O correctly:
 
 Yes, this should be mentioned in the document with the current implementation
 as you pointed out.
 
 By the way, I think once a memory controller of cgroup is introduced, it will
 help to track down which cgroup is the original source.

do you mean to make this a part of the memory subsystem?

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH] reject '\n' in a cgroup name

2008-01-23 Thread YAMAMOTO Takashi
hi,

the following patch rejects '\n' in a cgroup name.
otherwise /proc/$$/cgroup is not parsable.

example:
imawoto% cat /proc/$$/cgroup
memory:/
imawoto% mkdir -p 
memory:/foo
imawoto% echo $$ | 
memory:/foo/tasks
imawoto% cat /proc/$$/cgroup
memory:/
memory:/foo
imawoto% 

YAMAMOTO Takashi


Signed-off-by: YAMAMOTO Takashi [EMAIL PROTECTED]
---

--- linux-2.6.24-rc8-mm1/kernel/cgroup.c.BACKUP 2008-01-23 14:43:29.0 
+0900
+++ linux-2.6.24-rc8-mm1/kernel/cgroup.c2008-01-24 13:56:28.0 
+0900
@@ -2216,6 +2216,10 @@ static long cgroup_create(struct cgroup 
struct cgroup_subsys *ss;
struct super_block *sb = root-sb;
 
+   /* reject a newline.  otherwise /proc/$$/cgroup is not parsable. */
+   if (strchr(dentry-d_name.name, '\n'))
+   return -EINVAL;
+
cgrp = kzalloc(sizeof(*cgrp), GFP_KERNEL);
if (!cgrp)
return -ENOMEM;
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] memory.min_usage

2007-12-04 Thread YAMAMOTO Takashi
 But even with seqlock, we'll have to disable irq.

for writers, sure.
readers don't need to disable irq.

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][for -mm] memory controller enhancements for reclaiming take2 [7/8] bacground reclaim for memory controller

2007-12-03 Thread YAMAMOTO Takashi
 @@ -1186,6 +1251,16 @@ static void free_mem_cgroup_per_zone_inf
  
  static struct mem_cgroup init_mem_cgroup;
  
 +static int __init mem_cgroup_reclaim_init(void)
 +{
 + init_mem_cgroup.daemon.thread = kthread_run(mem_cgroup_reclaim_daemon,
 + init_mem_cgroup, memcontd);
 + if (IS_ERR(init_mem_cgroup.daemon.thread))
 + BUG();
 + return 0;
 +}
 +late_initcall(mem_cgroup_reclaim_init);
 +
  static struct cgroup_subsys_state *
  mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
  {
 @@ -1213,6 +1288,17 @@ mem_cgroup_create(struct cgroup_subsys *
   if (alloc_mem_cgroup_per_zone_info(mem, node))
   goto free_out;
  
 + /* Memory Reclaim Daemon per cgroup */
 + init_waitqueue_head(mem-daemon.waitq);
 + if (mem != init_mem_cgroup) {
 + /* Complicated...but we cannot call kthread create here..*/
 + /* init call will later assign kthread */
 + mem-daemon.thread = kthread_run(mem_cgroup_reclaim_daemon,
 + mem, memcontd);
 + if (IS_ERR(mem-daemon.thread))
 + goto free_out;
 + }
 +
   return mem-css;
  free_out:
   for_each_node_state(node, N_POSSIBLE)

you don't need the kthread as far as RES_HWMARK is infinite.
given the current default value of RES_HWMARK, you can simplify
initialization by deferring the kthread creation to mem_cgroup_write.

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH] memory.min_usage

2007-12-03 Thread YAMAMOTO Takashi
hi,

here's a patch to implement memory.min_usage,
which controls the minimum memory usage for a cgroup.

it works similarly to mlock;
global memory reclamation doesn't reclaim memory from
cgroups whose memory usage is below the value.
setting it too high is a dangerous operation.

it's against 2.6.24-rc3-mm2 + memory.swappiness patch i posted here yesterday.
but it's logically independent from the swappiness patch.

todo:
- restrict non-root user's operation ragardless of owner of cgroupfs files?
- make oom killer aware of this?

YAMAMOTO Takashi


Signed-off-by: YAMAMOTO Takashi [EMAIL PROTECTED]
---

--- linux-2.6.24-rc3-mm2-swappiness/include/linux/memcontrol.h.BACKUP2  
2007-12-03 13:52:37.0 +0900
+++ linux-2.6.24-rc3-mm2-swappiness/include/linux/memcontrol.h  2007-12-03 
14:07:45.0 +0900
@@ -47,6 +47,7 @@ extern int mem_cgroup_cache_charge(struc
gfp_t gfp_mask);
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
 extern int mem_cgroup_swappiness(struct mem_cgroup *mem);
+extern int mem_cgroup_canreclaim(struct page *page, struct mem_cgroup *mem);
 
 static inline struct mem_cgroup *mm_cgroup(const struct mm_struct *mm)
 {
--- linux-2.6.24-rc3-mm2-swappiness/mm/memcontrol.c.BACKUP2 2007-12-03 
13:52:13.0 +0900
+++ linux-2.6.24-rc3-mm2-swappiness/mm/memcontrol.c 2007-12-04 
11:42:07.0 +0900
@@ -134,6 +134,7 @@ struct mem_cgroup {
unsigned long control_type; /* control RSS or RSS+Pagecache */
int prev_priority;  /* for recording reclaim priority */
unsigned int swappiness;/* swappiness */
+   unsigned long long min_usage; /* XXX should be a part of res_counter? */
/*
 * statistics.
 */
@@ -1096,6 +1097,22 @@ static u64 mem_cgroup_swappiness_read(st
return mem-swappiness;
 }
 
+static int mem_cgroup_min_usage_write(struct cgroup *cont, struct cftype *cft,
+  u64 val)
+{
+   struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+
+   mem-min_usage = val;
+   return 0;
+}
+
+static u64 mem_cgroup_min_usage_read(struct cgroup *cont, struct cftype *cft)
+{
+   struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+
+   return mem-min_usage;
+}
+
 static struct cftype mem_cgroup_files[] = {
{
.name = usage_in_bytes,
@@ -1132,6 +1149,11 @@ static struct cftype mem_cgroup_files[] 
.write_uint = mem_cgroup_swappiness_write,
.read_uint = mem_cgroup_swappiness_read,
},
+   {
+   .name = min_usage_in_bytes,
+   .write_uint = mem_cgroup_min_usage_write,
+   .read_uint = mem_cgroup_min_usage_read,
+   },
 };
 
 /* XXX probably it's better to move try_to_free_mem_cgroup_pages to
@@ -1142,6 +1164,36 @@ int mem_cgroup_swappiness(struct mem_cgr
return mem-swappiness;
 }
 
+int mem_cgroup_canreclaim(struct page *page, struct mem_cgroup *mem1)
+{
+   struct page_cgroup *pc;
+   int result = 1;
+
+   if (mem1 != NULL)
+   return 1;
+
+   lock_page_cgroup(page);
+   pc = page_get_page_cgroup(page);
+   if (pc) {
+   struct mem_cgroup *mem2 = pc-mem_cgroup;
+   unsigned long long min_usage;
+
+   BUG_ON(mem2 == NULL);
+   min_usage = mem2-min_usage;
+   if (min_usage != 0) {
+   unsigned long flags;
+
+   spin_lock_irqsave(mem2-res.lock, flags);
+   if (mem2-res.usage = min_usage)
+   result = 0;
+   spin_unlock_irqrestore(mem2-res.lock, flags);
+   }
+   }
+   unlock_page_cgroup(page);
+
+   return result;
+}
+
 static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
 {
struct mem_cgroup_per_node *pn;
--- linux-2.6.24-rc3-mm2-swappiness/mm/vmscan.c.BACKUP2 2007-12-03 
13:52:22.0 +0900
+++ linux-2.6.24-rc3-mm2-swappiness/mm/vmscan.c 2007-12-03 14:11:42.0 
+0900
@@ -531,6 +531,11 @@ static unsigned long shrink_page_list(st
referenced  page_mapping_inuse(page))
goto activate_locked;
 
+#ifdef CONFIG_CGROUP_MEM_CONT
+   if (!mem_cgroup_canreclaim(page, sc-mem_cgroup))
+   goto activate_locked;
+#endif /* CONFIG_CGROUP_MEM_CONT */
+
 #ifdef CONFIG_SWAP
/*
 * Anonymous process memory has backing store?
@@ -1122,6 +1127,12 @@ static void shrink_active_list(unsigned 
list_add(page-lru, l_active);
continue;
}
+#ifdef CONFIG_CGROUP_MEM_CONT
+   if (!mem_cgroup_canreclaim(page, sc-mem_cgroup)) {
+   list_add(page-lru, l_active);
+   continue;
+   }
+#endif

[Devel] Re: [PATCH] memory.min_usage

2007-12-03 Thread YAMAMOTO Takashi
  +int mem_cgroup_canreclaim(struct page *page, struct mem_cgroup *mem1)
  +{
  +   struct page_cgroup *pc;
  +   int result = 1;
  +
  +   if (mem1 != NULL)
  +   return 1;
  +
  +   lock_page_cgroup(page);
  +   pc = page_get_page_cgroup(page);
  +   if (pc) {
  +   struct mem_cgroup *mem2 = pc-mem_cgroup;
  +   unsigned long long min_usage;
  +
  +   BUG_ON(mem2 == NULL);
  +   min_usage = mem2-min_usage;
  +   if (min_usage != 0) {
  +   unsigned long flags;
  +
  +   spin_lock_irqsave(mem2-res.lock, flags);
  +   if (mem2-res.usage = min_usage)
  +   result = 0;
  +   spin_unlock_irqrestore(mem2-res.lock, flags);
  +   }
  +   }
  +   unlock_page_cgroup(page);
  +
  +   return result;
  +}
  +
 
 How about adding lock_and_get_page_cgroup(page)/put_page_cgroup() ?
 ==
 struct page_cgroup *pc lock_and_get_page_cgroup(page)
 {
   struct page_cgroup *pc;
 
   lock_page_cgroup(page);
   pc = page_get_page_cgroup(page);
   if (atomic_inc_not_zero(pc-ref_cnt)) 
   pc = NULL;
   unlock_page_cgroup(page);
   return pc;
 }
 
 struct page_cgroup *pc put_page_cgroup(struct page_cgroup *pc)
 {
   mem_cgroup_uncharge(pc);
 }
 ==

i'm not sure if it's a good idea given that reference counting (atomic_foo)
has its own costs.

 BTW, how about add a status to res_counter ?
 My (based on your) current patch uses watermark_state.
 
 Maybe we can change it to be resource_state and show following status.
 
 RES_STATE_HIT_LIMIT,
 RES_STATE_ABOVE_HWATER,
 RES_STATE_ABOVE_LWATER,
 RES_STATE_STABLE, ?
 RES_STATE_BELOW_MIN,
 
 Useful ?

how about making res_counter use seq_lock?

   static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
   {
  struct mem_cgroup_per_node *pn;
  --- linux-2.6.24-rc3-mm2-swappiness/mm/vmscan.c.BACKUP2 2007-12-03 
  13:52:22.0 +0900
  +++ linux-2.6.24-rc3-mm2-swappiness/mm/vmscan.c 2007-12-03 
  14:11:42.0 +0900
  @@ -531,6 +531,11 @@ static unsigned long shrink_page_list(st
  referenced  page_mapping_inuse(page))
  goto activate_locked;
   
  +#ifdef CONFIG_CGROUP_MEM_CONT
  +   if (!mem_cgroup_canreclaim(page, sc-mem_cgroup))
  +   goto activate_locked;
  +#endif /* CONFIG_CGROUP_MEM_CONT */
  +
 
 Maybe 
 ==
 if (scan_global_lru(sc)  !
 mem_cgroup_canreclaim(page, sc-mem-cgroup))
 goto activate_locked:
 ==

i don't think the decision belongs to callers.
(at least it wasn't my intention.)

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH] memory.swappiness

2007-12-02 Thread YAMAMOTO Takashi
here's a trivial patch to implement memory.swappiness,
which controls swappiness for cgroup memory reclamation.

it's against 2.6.24-rc3-mm2.

YAMAMOTO Takashi


Signed-off-by: YAMAMOTO Takashi [EMAIL PROTECTED]
---

--- linux-2.6.24-rc3-mm2-swappiness/include/linux/memcontrol.h.BACKUP   
2007-12-03 11:49:27.176669111 +0900
+++ linux-2.6.24-rc3-mm2-swappiness/include/linux/memcontrol.h  2007-12-03 
10:00:29.049448425 +0900
@@ -46,6 +46,7 @@ extern void mem_cgroup_out_of_memory(str
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
+extern int mem_cgroup_swappiness(struct mem_cgroup *mem);
 
 static inline struct mem_cgroup *mm_cgroup(const struct mm_struct *mm)
 {
--- linux-2.6.24-rc3-mm2-swappiness/mm/vmscan.c.BACKUP  2007-12-03 
07:49:00.0 +0900
+++ linux-2.6.24-rc3-mm2-swappiness/mm/vmscan.c 2007-12-03 10:01:57.559803379 
+0900
@@ -1030,7 +1030,7 @@ static int calc_reclaim_mapped(struct sc
 *
 * Max temporary value is vm_total_pages*100.
 */
-   imbalance *= (vm_swappiness + 1);
+   imbalance *= (sc-swappiness + 1);
imbalance /= 100;
 
/*
@@ -1445,7 +1445,7 @@ unsigned long try_to_free_mem_cgroup_pag
.may_writepage = !laptop_mode,
.may_swap = 1,
.swap_cluster_max = SWAP_CLUSTER_MAX,
-   .swappiness = vm_swappiness,
+   .swappiness = mem_cgroup_swappiness(mem_cont),
.order = 0,
.mem_cgroup = mem_cont,
.isolate_pages = mem_cgroup_isolate_pages,
--- linux-2.6.24-rc3-mm2-swappiness/mm/memcontrol.c.BACKUP  2007-12-03 
07:49:00.0 +0900
+++ linux-2.6.24-rc3-mm2-swappiness/mm/memcontrol.c 2007-12-03 
11:22:40.157163781 +0900
@@ -133,6 +133,7 @@ struct mem_cgroup {
 
unsigned long control_type; /* control RSS or RSS+Pagecache */
int prev_priority;  /* for recording reclaim priority */
+   unsigned int swappiness;/* swappiness */
/*
 * statistics.
 */
@@ -1077,7 +1078,23 @@ static int mem_control_stat_open(struct 
return single_open(file, mem_control_stat_show, cont);
 }
 
+static int mem_cgroup_swappiness_write(struct cgroup *cont, struct cftype *cft,
+  u64 val)
+{
+   struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+
+   if (val  100)
+   return -EINVAL;
+   mem-swappiness = val;
+   return 0;
+}
+
+static u64 mem_cgroup_swappiness_read(struct cgroup *cont, struct cftype *cft)
+{
+   struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
 
+   return mem-swappiness;
+}
 
 static struct cftype mem_cgroup_files[] = {
{
@@ -1110,8 +1127,21 @@ static struct cftype mem_cgroup_files[] 
.name = stat,
.open = mem_control_stat_open,
},
+   {
+   .name = swappiness,
+   .write_uint = mem_cgroup_swappiness_write,
+   .read_uint = mem_cgroup_swappiness_read,
+   },
 };
 
+/* XXX probably it's better to move try_to_free_mem_cgroup_pages to
+  memcontrol.c and kill this */
+int mem_cgroup_swappiness(struct mem_cgroup *mem)
+{
+
+   return mem-swappiness;
+}
+
 static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
 {
struct mem_cgroup_per_node *pn;
@@ -1155,6 +1185,8 @@ mem_cgroup_create(struct cgroup_subsys *
res_counter_init(mem-res);
 
mem-control_type = MEM_CGROUP_TYPE_ALL;
+   mem-swappiness = 60; /* XXX probably should inherit a value from
+   either parent cgroup or global vm_swappiness */
memset(mem-info, 0, sizeof(mem-info));
 
for_each_node_state(node, N_POSSIBLE)
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][ only for review ] memory controller bacground reclaim [3/5] high/low watermark support in res_counter

2007-11-28 Thread YAMAMOTO Takashi
 This patch adds high/low watermark parameter to res_counter.
 splitted out from YAMAMOTO's background page reclaim for memory cgroup set.

thanks.

 +  * Watermarks
 +  * Should be changed automatically when the limit is changed and
 +  * keep low  high  limit.
 +  */
 + unsigned long long high_watermark;
 + unsigned long long low_watermark;
 + /*

do you have any specific reason not to allow low == high == limit?
if you want to ensure low  high  limit, it's better to make the
default values below meet the condition as well.
to me, it seems weird to prevent users from making these values back to
the default.

YAMAMOTO Takashi

 @@ -17,6 +17,9 @@
  {
   spin_lock_init(counter-lock);
   counter-limit = (unsigned long long)LLONG_MAX;
 + counter-low_watermark = (unsigned long long)LLONG_MAX;
 + counter-high_watermark = (unsigned long long)LLONG_MAX;
 + counter-watermark_state = RES_WATERMARK_BELOW_LOW;
  }
  
  int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH][for -mm] per-zone and reclaim enhancements for memory controller take 3 [3/10] per-zone active inactive counter

2007-11-28 Thread YAMAMOTO Takashi
 @@ -651,10 +758,11 @@
   /* Avoid race with charge */
   atomic_set(pc-ref_cnt, 0);
   if (clear_page_cgroup(page, pc) == pc) {
 + int active;
   css_put(mem-css);
 + active = pc-flags  PAGE_CGROUP_FLAG_ACTIVE;
   res_counter_uncharge(mem-res, PAGE_SIZE);
 - list_del_init(pc-lru);
 - mem_cgroup_charge_statistics(mem, pc-flags, false);
 + __mem_cgroup_remove_list(pc);
   kfree(pc);
   } else  /* being uncharged ? ...do relax */
   break;

'active' seems unused.

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH][for -mm] per-zone and reclaim enhancements for memory controller take 3 [3/10] per-zone active inactive counter

2007-11-28 Thread YAMAMOTO Takashi
 +static inline struct mem_cgroup_per_zone *
 +mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
 +{
 + if (!mem-info.nodeinfo[nid])

can this be true?

YAMAMOTO Takashi

 + return NULL;
 + return mem-info.nodeinfo[nid]-zoneinfo[zid];
 +}
 +
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][ only for review ] memory controller bacground reclaim [3/5] high/low watermark support in res_counter

2007-11-28 Thread YAMAMOTO Takashi
  to me, it seems weird to prevent users from making these values back to
  the default.
  
 will fix.
 LLONG_MAX-1 for high
 LLONG_MAX-2 for low ...?

imo it's better to simply allow low == high == limit.

 BTW, it should be low + PAGE_SIZE  high + PAGE_SIZE  limit ...?

it shouldn't, unless you are going to throw away the res_counter abstraction.

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC] [PATCH] memory controller background reclamation

2007-11-26 Thread YAMAMOTO Takashi
 Balbir Singh wrote:
  YAMAMOTO Takashi wrote:
  +int batch_count = 128; /* XXX arbitrary */
  Could we define and use something like MEM_CGROUP_BATCH_COUNT for now?
  Later we could consider and see if it needs to be tunable. numbers are
  hard to read in code.
  although i don't think it makes sense, i can do so if you prefer.
 
  
  Using numbers like 128 make the code unreadable. I prefer something
  like MEM_CGROUP_BATCH_COUNT since its more readable than 128. If we ever
  propagate batch_count to other dependent functions, I'd much rather do
  it with a well defined name.
  
 
 I just checked we already have FORCE_UNCHARGE_BATCH, we could just
 rename and re-use it.

i don't think it's a good idea to use a single constant for
completely different things.

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC] [PATCH] memory controller background reclamation

2007-11-26 Thread YAMAMOTO Takashi
 Using numbers like 128 make the code unreadable. I prefer something
 like MEM_CGROUP_BATCH_COUNT since its more readable than 128. If we ever
 propagate batch_count to other dependent functions, I'd much rather do
 it with a well defined name.

here's a new version.

changes from the previous:
- define a constant.  (MEM_CGROUP_BG_RECLAIM_BATCH_COUNT)
  no functional changes.
- don't increment ALLOCSTALL vm event for mem cgroup because
  it isn't appropriate esp. for background reclamation.
  introduce MEM_CGROUP_STAT_ALLOCSTALL instead.
  (its value is same as memory.failcnt unless GFP_ATOMIC.)

YAMAMOTO Takashi


Signed-off-by: YAMAMOTO Takashi [EMAIL PROTECTED]
---

--- linux-2.6.24-rc2-mm1-kame-pd/include/linux/res_counter.h.BACKUP 
2007-11-14 16:05:48.0 +0900
+++ linux-2.6.24-rc2-mm1-kame-pd/include/linux/res_counter.h2007-11-22 
15:14:32.0 +0900
@@ -32,6 +32,13 @@ struct res_counter {
 * the number of unsuccessful attempts to consume the resource
 */
unsigned long long failcnt;
+
+   /*
+* watermarks
+*/
+   unsigned long long high_watermark;
+   unsigned long long low_watermark;
+
/*
 * the lock to protect all of the above.
 * the routines below consider this to be IRQ-safe
@@ -66,6 +73,8 @@ enum {
RES_USAGE,
RES_LIMIT,
RES_FAILCNT,
+   RES_HIGH_WATERMARK,
+   RES_LOW_WATERMARK,
 };
 
 /*
@@ -124,4 +133,26 @@ static inline bool res_counter_check_und
return ret;
 }
 
+static inline bool res_counter_below_low_watermark(struct res_counter *cnt)
+{
+   bool ret;
+   unsigned long flags;
+
+   spin_lock_irqsave(cnt-lock, flags);
+   ret = cnt-usage  cnt-low_watermark;
+   spin_unlock_irqrestore(cnt-lock, flags);
+   return ret;
+}
+
+static inline bool res_counter_above_high_watermark(struct res_counter *cnt)
+{
+   bool ret;
+   unsigned long flags;
+
+   spin_lock_irqsave(cnt-lock, flags);
+   ret = cnt-usage  cnt-high_watermark;
+   spin_unlock_irqrestore(cnt-lock, flags);
+   return ret;
+}
+
 #endif
--- linux-2.6.24-rc2-mm1-kame-pd/kernel/res_counter.c.BACKUP2007-11-14 
16:05:52.0 +0900
+++ linux-2.6.24-rc2-mm1-kame-pd/kernel/res_counter.c   2007-11-22 
15:14:32.0 +0900
@@ -17,6 +17,8 @@ void res_counter_init(struct res_counter
 {
spin_lock_init(counter-lock);
counter-limit = (unsigned long long)LLONG_MAX;
+   counter-high_watermark = (unsigned long long)LLONG_MAX;
+   counter-low_watermark = (unsigned long long)LLONG_MAX;
 }
 
 int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
@@ -69,6 +71,10 @@ res_counter_member(struct res_counter *c
return counter-limit;
case RES_FAILCNT:
return counter-failcnt;
+   case RES_HIGH_WATERMARK:
+   return counter-high_watermark;
+   case RES_LOW_WATERMARK:
+   return counter-low_watermark;
};
 
BUG();
@@ -99,6 +105,7 @@ ssize_t res_counter_write(struct res_cou
int ret;
char *buf, *end;
unsigned long long tmp, *val;
+   unsigned long flags;
 
buf = kmalloc(nbytes + 1, GFP_KERNEL);
ret = -ENOMEM;
@@ -122,9 +129,29 @@ ssize_t res_counter_write(struct res_cou
goto out_free;
}
 
+   spin_lock_irqsave(counter-lock, flags);
val = res_counter_member(counter, member);
+   /* ensure low_watermark = high_watermark = limit */
+   switch (member) {
+   case RES_LIMIT:
+   if (tmp  counter-high_watermark)
+   goto out_locked;
+   break;
+   case RES_HIGH_WATERMARK:
+   if (tmp  counter-limit || tmp  counter-low_watermark)
+   goto out_locked;
+   break;
+   case RES_LOW_WATERMARK:
+   if (tmp  counter-high_watermark)
+   goto out_locked;
+   break;
+   }
*val = tmp;
+   BUG_ON(counter-high_watermark  counter-limit);
+   BUG_ON(counter-low_watermark  counter-high_watermark);
ret = nbytes;
+out_locked:
+   spin_unlock_irqrestore(counter-lock, flags);
 out_free:
kfree(buf);
 out:
--- linux-2.6.24-rc2-mm1-kame-pd/mm/vmscan.c.BACKUP 2007-11-20 
13:11:09.0 +0900
+++ linux-2.6.24-rc2-mm1-kame-pd/mm/vmscan.c2007-11-27 08:09:51.0 
+0900
@@ -1333,7 +1333,6 @@ static unsigned long do_try_to_free_page
unsigned long lru_pages = 0;
int i;
 
-   count_vm_event(ALLOCSTALL);
/*
 * mem_cgroup will not do shrink_slab.
 */
@@ -1432,6 +1431,7 @@ unsigned long try_to_free_pages(struct z
.isolate_pages = isolate_pages_global,
};
 
+   count_vm_event(ALLOCSTALL);
return do_try_to_free_pages(zones, gfp_mask, sc);
 }
 
--- linux-2.6.24-rc2-mm1-kame-pd/mm/memcontrol.c.BACKUP 2007-11-20

[Devel] Re: [RFC] [PATCH] memory controller background reclamation

2007-11-25 Thread YAMAMOTO Takashi
hi,

  --- linux-2.6.24-rc2-mm1-kame-pd/kernel/res_counter.c.BACKUP
  2007-11-14 16:05:52.0 +0900
  +++ linux-2.6.24-rc2-mm1-kame-pd/kernel/res_counter.c   2007-11-22 
  15:14:32.0 +0900
  @@ -17,6 +17,8 @@ void res_counter_init(struct res_counter
   {
  spin_lock_init(counter-lock);
  counter-limit = (unsigned long long)LLONG_MAX;
  +   counter-high_watermark = (unsigned long long)LLONG_MAX;
  +   counter-low_watermark = (unsigned long long)LLONG_MAX;
 
 Should low watermark also be LLONG_MAX?

what else do you suggest?  0?
currently it doesn't matter much because low_watermark is not used at all
as far as high_watermark is LLONG_MAX.

  +static void
  +mem_cgroup_reclaim(struct work_struct *work)
  +{
  +   struct mem_cgroup * const mem =
  +   container_of(work, struct mem_cgroup, reclaim_work);
  +   int batch_count = 128; /* XXX arbitrary */
 
 Could we define and use something like MEM_CGROUP_BATCH_COUNT for now?
 Later we could consider and see if it needs to be tunable. numbers are
 hard to read in code.

although i don't think it makes sense, i can do so if you prefer.

  +
  +   for (; batch_count  0; batch_count--) {
  +   if (res_counter_below_low_watermark(mem-res))
  +   break;
 
 Shouldn't we also check to see that we start reclaim in background only
 when we are above the high watermark?

i don't understand what you mean.  can you explain?
highwatermark is checked by mem_cgroup_charge_common before waking
these threads.

 I'll start some tests on these patches.

thanks.

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC] [PATCH] memory controller background reclamation

2007-11-25 Thread YAMAMOTO Takashi
  currently it doesn't matter much because low_watermark is not used at all
  as far as high_watermark is LLONG_MAX.
  
 
 Don't we use by checking res_counter_below_low_watermark()?

yes, but only when we get above highwatermark.

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH] memory controller per zone patches take 2 [4/10] calculate mapped ratio for memory cgroup

2007-11-22 Thread YAMAMOTO Takashi
 On Thu, 22 Nov 2007 17:34:20 +0900 (JST)
 [EMAIL PROTECTED] (YAMAMOTO Takashi) wrote:
 
 + /* usage is recorded in bytes */
 + total = mem-res.usage  PAGE_SHIFT;
 + rss = mem_cgroup_read_stat(mem-stat, MEM_CGROUP_STAT_RSS);
 + return (rss * 100) / total;

Never tried 64 bit division on a 32 bit system. I hope we don't
have to resort to do_div() sort of functionality.

   Hmm, maybe it's better to make these numebrs be just long.
   I'll try to change per-cpu-counter implementation.
   
   Thanks,
   -Kame
  
  besides that, i think 'total' can be zero here.
  
 Ah, This is what I do now.
 ==
 +/*
 + * Calculate mapped_ratio under memory controller. This will be used in
 + * vmscan.c for deteremining we have to reclaim mapped pages.
 + */
 +int mem_cgroup_calc_mapped_ratio(struct mem_cgroup *mem)
 +{
 +   long total, rss;
 +
 +   /*
 +* usage is recorded in bytes. But, here, we assume the number of
 +* physical pages can be represented by long on any arch.
 +*/
 +   total = (long) (mem-res.usage  PAGE_SHIFT);
 +   rss = (long)mem_cgroup_read_stat(mem-stat, MEM_CGROUP_STAT_RSS);
 +   return (int)((rss * 100L) / total);
 +}
 ==
 
 maybe works well.
 
 -Kame

i meant that / total can cause a division-by-zero exception.

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC] [PATCH] memory controller background reclamation

2007-11-22 Thread YAMAMOTO Takashi
   We could add the following API to resource counters
   
   res_counter_set_low_watermark
   res_counter_set_high_watermark
   res_counter_below_low_watermark
   res_counter_above_high_watermark
   
   and add
   
   low_watermark
   high_watermark
   
   members to the resource group. We could push out data
   upto the low watermark from the cgroup.
 
 i implemented something like that.  (and rebased to 2.6.24-rc2-mm1.)
 
 what's the best way to expose watermarks to userland is an open question.
 i took the simplest way for now.  do you have any suggestions?
 
 YAMAMOTO Takashi

here's a new version.

changes from the previous:
- rebase to linux-2.6.24-rc2-mm1 + kamezawa patchset.
- create workqueue when setting high watermark, rather than cgroup creation
  which is earlier on boot.

YAMAMOTO Takashi


Signed-off-by: YAMAMOTO Takashi [EMAIL PROTECTED]
---

--- linux-2.6.24-rc2-mm1-kame-pd/include/linux/res_counter.h.BACKUP 
2007-11-14 16:05:48.0 +0900
+++ linux-2.6.24-rc2-mm1-kame-pd/include/linux/res_counter.h2007-11-22 
15:14:32.0 +0900
@@ -32,6 +32,13 @@ struct res_counter {
 * the number of unsuccessful attempts to consume the resource
 */
unsigned long long failcnt;
+
+   /*
+* watermarks
+*/
+   unsigned long long high_watermark;
+   unsigned long long low_watermark;
+
/*
 * the lock to protect all of the above.
 * the routines below consider this to be IRQ-safe
@@ -66,6 +73,8 @@ enum {
RES_USAGE,
RES_LIMIT,
RES_FAILCNT,
+   RES_HIGH_WATERMARK,
+   RES_LOW_WATERMARK,
 };
 
 /*
@@ -124,4 +133,26 @@ static inline bool res_counter_check_und
return ret;
 }
 
+static inline bool res_counter_below_low_watermark(struct res_counter *cnt)
+{
+   bool ret;
+   unsigned long flags;
+
+   spin_lock_irqsave(cnt-lock, flags);
+   ret = cnt-usage  cnt-low_watermark;
+   spin_unlock_irqrestore(cnt-lock, flags);
+   return ret;
+}
+
+static inline bool res_counter_above_high_watermark(struct res_counter *cnt)
+{
+   bool ret;
+   unsigned long flags;
+
+   spin_lock_irqsave(cnt-lock, flags);
+   ret = cnt-usage  cnt-high_watermark;
+   spin_unlock_irqrestore(cnt-lock, flags);
+   return ret;
+}
+
 #endif
--- linux-2.6.24-rc2-mm1-kame-pd/kernel/res_counter.c.BACKUP2007-11-14 
16:05:52.0 +0900
+++ linux-2.6.24-rc2-mm1-kame-pd/kernel/res_counter.c   2007-11-22 
15:14:32.0 +0900
@@ -17,6 +17,8 @@ void res_counter_init(struct res_counter
 {
spin_lock_init(counter-lock);
counter-limit = (unsigned long long)LLONG_MAX;
+   counter-high_watermark = (unsigned long long)LLONG_MAX;
+   counter-low_watermark = (unsigned long long)LLONG_MAX;
 }
 
 int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
@@ -69,6 +71,10 @@ res_counter_member(struct res_counter *c
return counter-limit;
case RES_FAILCNT:
return counter-failcnt;
+   case RES_HIGH_WATERMARK:
+   return counter-high_watermark;
+   case RES_LOW_WATERMARK:
+   return counter-low_watermark;
};
 
BUG();
@@ -99,6 +105,7 @@ ssize_t res_counter_write(struct res_cou
int ret;
char *buf, *end;
unsigned long long tmp, *val;
+   unsigned long flags;
 
buf = kmalloc(nbytes + 1, GFP_KERNEL);
ret = -ENOMEM;
@@ -122,9 +129,29 @@ ssize_t res_counter_write(struct res_cou
goto out_free;
}
 
+   spin_lock_irqsave(counter-lock, flags);
val = res_counter_member(counter, member);
+   /* ensure low_watermark = high_watermark = limit */
+   switch (member) {
+   case RES_LIMIT:
+   if (tmp  counter-high_watermark)
+   goto out_locked;
+   break;
+   case RES_HIGH_WATERMARK:
+   if (tmp  counter-limit || tmp  counter-low_watermark)
+   goto out_locked;
+   break;
+   case RES_LOW_WATERMARK:
+   if (tmp  counter-high_watermark)
+   goto out_locked;
+   break;
+   }
*val = tmp;
+   BUG_ON(counter-high_watermark  counter-limit);
+   BUG_ON(counter-low_watermark  counter-high_watermark);
ret = nbytes;
+out_locked:
+   spin_unlock_irqrestore(counter-lock, flags);
 out_free:
kfree(buf);
 out:
--- linux-2.6.24-rc2-mm1-kame-pd/mm/memcontrol.c.BACKUP 2007-11-20 
13:11:09.0 +0900
+++ linux-2.6.24-rc2-mm1-kame-pd/mm/memcontrol.c2007-11-22 
16:29:26.0 +0900
@@ -28,6 +28,7 @@
 #include linux/rcupdate.h
 #include linux/swap.h
 #include linux/spinlock.h
+#include linux/workqueue.h
 #include linux/fs.h
 #include linux/seq_file.h
 
@@ -138,6 +139,10 @@ struct mem_cgroup {
 * statistics.
 */
struct mem_cgroup_stat stat;
+   /*
+* background

[Devel] Re: [RFC][ for -mm] memory controller enhancements for NUMA [1/10] record nid/zid on page_cgroup

2007-11-14 Thread YAMAMOTO Takashi
 This patch adds nid/zoneid value to page cgroup.
 This helps per-zone accounting for memory cgroup and reclaim routine.
 
 Signed-off-by: KAMEZAWA Hiroyuki [EMAIL PROTECTED]
 
  mm/memcontrol.c |6 ++
  1 file changed, 6 insertions(+)
 
 Index: linux-2.6.24-rc2-mm1/mm/memcontrol.c
 ===
 --- linux-2.6.24-rc2-mm1.orig/mm/memcontrol.c
 +++ linux-2.6.24-rc2-mm1/mm/memcontrol.c
 @@ -131,6 +131,8 @@ struct page_cgroup {
   atomic_t ref_cnt;   /* Helpful when pages move b/w  */
   /* mapped and cached states */
   int  flags;
 + short   nid;
 + short   zid;
  };
  #define PAGE_CGROUP_FLAG_CACHE   (0x1)   /* charged as cache */
  #define PAGE_CGROUP_FLAG_ACTIVE (0x2)/* page is active in this 
 cgroup */
 @@ -216,6 +218,10 @@ void page_assign_page_cgroup(struct page
   VM_BUG_ON(!page_cgroup_locked(page));
   locked = (page-page_cgroup  PAGE_CGROUP_LOCK);
   page-page_cgroup = ((unsigned long)pc | locked);
 + if (pc) {
 + pc-nid = page_to_nid(page);
 + pc-zid = page_zonenum(page);
 + }
  }
  
  struct page_cgroup *page_get_page_cgroup(struct page *page)

are they worth to be cached?
can't you use page_zonenum(pc-page)?

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][ for -mm] memory controller enhancements for NUMA [10/10] per-zone-lru

2007-11-14 Thread YAMAMOTO Takashi
 +struct mc_lru_head {
 + struct list_head active_list[MAX_NR_ZONES];
 + struct list_head inactive_list[MAX_NR_ZONES];
 +};
 +

i guess
struct foo {
struct list_head active_list;
struct list_head inactive_list;
} lists[MAX_NR_ZONES];
is better.

 @@ -139,8 +144,20 @@ struct mem_cgroup {
* Per zone statistics (used for memory reclaim)
*/
   struct mem_cgroup_zonestat zstat;
 +#ifndef CONFIG_NUMA
 + struct lru_head local_head;
 +#endif

struct mc_lru_head local_lru;

 +static int mem_cgroup_init_lru(struct mem_cgroup *mem)
 +{
 + int zone;
 + mem-lrus[0] = mem-local_lru;

'zone' seems unused.

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC] [PATCH] memory controller background reclamation

2007-11-14 Thread YAMAMOTO Takashi
  We could add the following API to resource counters
  
  res_counter_set_low_watermark
  res_counter_set_high_watermark
  res_counter_below_low_watermark
  res_counter_above_high_watermark
  
  and add
  
  low_watermark
  high_watermark
  
  members to the resource group. We could push out data
  upto the low watermark from the cgroup.

i implemented something like that.  (and rebased to 2.6.24-rc2-mm1.)

what's the best way to expose watermarks to userland is an open question.
i took the simplest way for now.  do you have any suggestions?

YAMAMOTO Takashi


--- ./include/linux/res_counter.h.orig  2007-11-14 15:57:31.0 +0900
+++ ./include/linux/res_counter.h   2007-11-14 16:03:24.0 +0900
@@ -32,6 +32,13 @@ struct res_counter {
 * the number of unsuccessful attempts to consume the resource
 */
unsigned long long failcnt;
+
+   /*
+* watermarks
+*/
+   unsigned long long high_watermark;
+   unsigned long long low_watermark;
+
/*
 * the lock to protect all of the above.
 * the routines below consider this to be IRQ-safe
@@ -66,6 +73,8 @@ enum {
RES_USAGE,
RES_LIMIT,
RES_FAILCNT,
+   RES_HIGH_WATERMARK,
+   RES_LOW_WATERMARK,
 };
 
 /*
@@ -124,4 +133,26 @@ static inline bool res_counter_check_und
return ret;
 }
 
+static inline bool res_counter_below_low_watermark(struct res_counter *cnt)
+{
+   bool ret;
+   unsigned long flags;
+
+   spin_lock_irqsave(cnt-lock, flags);
+   ret = cnt-usage  cnt-low_watermark;
+   spin_unlock_irqrestore(cnt-lock, flags);
+   return ret;
+}
+
+static inline bool res_counter_above_high_watermark(struct res_counter *cnt)
+{
+   bool ret;
+   unsigned long flags;
+
+   spin_lock_irqsave(cnt-lock, flags);
+   ret = cnt-usage  cnt-high_watermark;
+   spin_unlock_irqrestore(cnt-lock, flags);
+   return ret;
+}
+
 #endif
--- ./kernel/res_counter.c.orig 2007-11-14 15:57:31.0 +0900
+++ ./kernel/res_counter.c  2007-11-14 16:03:24.0 +0900
@@ -17,6 +17,8 @@ void res_counter_init(struct res_counter
 {
spin_lock_init(counter-lock);
counter-limit = (unsigned long long)LLONG_MAX;
+   counter-high_watermark = (unsigned long long)LLONG_MAX;
+   counter-low_watermark = (unsigned long long)LLONG_MAX;
 }
 
 int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
@@ -69,6 +71,10 @@ res_counter_member(struct res_counter *c
return counter-limit;
case RES_FAILCNT:
return counter-failcnt;
+   case RES_HIGH_WATERMARK:
+   return counter-high_watermark;
+   case RES_LOW_WATERMARK:
+   return counter-low_watermark;
};
 
BUG();
@@ -99,6 +105,7 @@ ssize_t res_counter_write(struct res_cou
int ret;
char *buf, *end;
unsigned long long tmp, *val;
+   unsigned long flags;
 
buf = kmalloc(nbytes + 1, GFP_KERNEL);
ret = -ENOMEM;
@@ -122,9 +129,29 @@ ssize_t res_counter_write(struct res_cou
goto out_free;
}
 
+   spin_lock_irqsave(counter-lock, flags);
val = res_counter_member(counter, member);
+   /* ensure low_watermark = high_watermark = limit */
+   switch (member) {
+   case RES_LIMIT:
+   if (tmp  counter-high_watermark)
+   goto out_locked;
+   break;
+   case RES_HIGH_WATERMARK:
+   if (tmp  counter-limit || tmp  counter-low_watermark)
+   goto out_locked;
+   break;
+   case RES_LOW_WATERMARK:
+   if (tmp  counter-high_watermark)
+   goto out_locked;
+   break;
+   }
*val = tmp;
+   BUG_ON(counter-high_watermark  counter-limit);
+   BUG_ON(counter-low_watermark  counter-high_watermark);
ret = nbytes;
+out_locked:
+   spin_unlock_irqrestore(counter-lock, flags);
 out_free:
kfree(buf);
 out:
--- ./mm/memcontrol.c.orig  2007-11-14 15:57:46.0 +0900
+++ ./mm/memcontrol.c   2007-11-14 16:03:53.0 +0900
@@ -28,6 +28,7 @@
 #include linux/rcupdate.h
 #include linux/swap.h
 #include linux/spinlock.h
+#include linux/workqueue.h
 #include linux/fs.h
 #include linux/seq_file.h
 
@@ -110,6 +111,10 @@ struct mem_cgroup {
 * statistics.
 */
struct mem_cgroup_stat stat;
+   /*
+* background reclamation.
+*/
+   struct work_struct reclaim_work;
 };
 
 /*
@@ -168,6 +173,9 @@ static void mem_cgroup_charge_statistics
 
 static struct mem_cgroup init_mem_cgroup;
 
+static DEFINE_MUTEX(mem_cgroup_workqueue_init_lock);
+static struct workqueue_struct *mem_cgroup_workqueue;
+
 static inline
 struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont)
 {
@@ -375,6 +383,50 @@ unsigned long mem_cgroup_isolate_pages(u
return nr_taken;
 }
 
+static

[Devel] Re: [RFC][ for -mm] memory cgroup enhancements take3 [3/9] remember page cache

2007-10-30 Thread YAMAMOTO Takashi
 @@ -93,6 +95,11 @@ enum {
   MEM_CGROUP_TYPE_MAX,
  };
  
 +enum charge_type {
 + MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
 + MEM_CGROUP_CHARGE_TYPE_MAPPED = 0,
 +};
 +

should be different values. :-)

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][ for -mm] memory cgroup enhancements take3 [9/9] per zone stat

2007-10-30 Thread YAMAMOTO Takashi
 +
 +/*
 + * Per-zone statistics.
 + * Please be carefull. The array can be very big on envrionments whic has
 + * very big MAX_NUMNODES . Adding new stat member to this will eat much 
 memory.
 + * Only Active/Inactive may be sutiable.

s/whic/h/
s/sutiable/suitable/

 +static inline void __mem_cgroup_zonesta_dec(struct mem_cgroup_zonestat 
 *zstat,

s/zonesta/t/

 @@ -293,6 +369,23 @@ clear_page_cgroup(struct page *page, str
  
  static void __mem_cgroup_move_lists(struct page_cgroup *pc, bool active)
  {
 + int direction = 0;
 +
 + if (active  !(pc-flags  PAGE_CGROUP_FLAG_ACTIVE))
 + direction = 1; /*from inactive to acvive */
 + if (!active  (pc-flags  PAGE_CGROUP_FLAG_ACTIVE))
 + direction = -1;
 +
 + if (direction) {
 + struct mem_cgroup_zonestat *zstat = pc-mem_cgroup-zonestat;
 + int index = page_cgroup_to_zonestat_index(pc);
 + preempt_disable();
 + __mem_cgroup_zonestat_add(zstat, MEM_CGROUP_ZONESTAT_ACTIVE,
 + direction, index);
 + __mem_cgroup_zonestat_add(zstat, MEM_CGROUP_ZONESTAT_INACTIVE,
 + direction, index);

dec?

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][ for -mm] memory cgroup enhancements take3 [9/9] per zone stat

2007-10-30 Thread YAMAMOTO Takashi
   + if (active  !(pc-flags  PAGE_CGROUP_FLAG_ACTIVE))
   + direction = 1; /*from inactive to acvive */
   + if (!active  (pc-flags  PAGE_CGROUP_FLAG_ACTIVE))
   + direction = -1;
   +
   + if (direction) {
   + struct mem_cgroup_zonestat *zstat = pc-mem_cgroup-zonestat;
   + int index = page_cgroup_to_zonestat_index(pc);
   + preempt_disable();
   + __mem_cgroup_zonestat_add(zstat, MEM_CGROUP_ZONESTAT_ACTIVE,
   + direction, index);
   + __mem_cgroup_zonestat_add(zstat, MEM_CGROUP_ZONESTAT_INACTIVE,
   + direction, index);
  
  dec?
  
 direction(add value) is 1 or -1 here. Hmm, this is maybe confusing.
 ok, I'll clean up this.

adding the same value to both of active and inactive seems wrong.
i think you want to subtract 'direction' from inactive here.

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC] [PATCH] memory controller background reclamation

2007-10-22 Thread YAMAMOTO Takashi
hi,

i implemented background reclamation for your memory controller and
did a few benchmarks with and without it.  any comments?

YAMAMOTO Takashi


-
time make -j4 bzImage in a cgroup with 64MB limit:

without patch:
real22m22.389s
user13m35.163s
sys 6m12.283s

memory.failcnt after a run: 106647

with patch:
real19m25.860s
user14m10.549s
sys 6m13.831s

memory.failcnt after a run: 35366

for x in 1 2 3;do time dd if=/dev/zero bs=1M count=300|tail;done
in a cgroup with 16MB limit:

without patch:
300+0 records in
300+0 records out
314572800 bytes (315 MB) copied, 19.0058 seconds, 16.6 MB/s
u 0.00s s 0.67s e 19.01s maj 90 min 2021 in 0 out 0
u 0.48s s 7.22s e 93.13s maj 20475 min 210903 in 0 out 0
300+0 records in
300+0 records out
314572800 bytes (315 MB) copied, 14.8394 seconds, 21.2 MB/s
u 0.00s s 0.63s e 14.84s maj 53 min 1392 in 0 out 0
u 0.48s s 7.00s e 94.39s maj 20125 min 211145 in 0 out 0
300+0 records in
300+0 records out
314572800 bytes (315 MB) copied, 14.0635 seconds, 22.4 MB/s
u 0.01s s 0.59s e 14.07s maj 50 min 1385 in 0 out 0
u 0.57s s 6.93s e 91.54s maj 20359 min 210911 in 0 out 0

memory.failcnt after a run: 8804

with patch:
300+0 records in
300+0 records out
314572800 bytes (315 MB) copied, 5.94875 seconds, 52.9 MB/s
u 0.00s s 0.67s e 5.95s maj 206 min 5393 in 0 out 0
u 0.45s s 6.63s e 46.07s maj 21350 min 210252 in 0 out 0
300+0 records in
300+0 records out
314572800 bytes (315 MB) copied, 8.16441 seconds, 38.5 MB/s
u 0.00s s 0.60s e 8.17s maj 240 min 4614 in 0 out 0
u 0.45s s 6.56s e 54.56s maj 22298 min 209255 in 0 out 0
300+0 records in
300+0 records out
314572800 bytes (315 MB) copied, 4.60967 seconds, 68.2 MB/s
u 0.00s s 0.60s e 4.64s maj 196 min 4733 in 0 out 0
u 0.46s s 6.53s e 49.28s maj 3 min 209384 in 0 out 0

memory.failcnt after a run: 1589
-


--- linux-2.6.23-rc8-mm2-cgkswapd/mm/memcontrol.c.BACKUP2007-10-01 
17:19:57.0 +0900
+++ linux-2.6.23-rc8-mm2-cgkswapd/mm/memcontrol.c   2007-10-22 
13:46:01.0 +0900
@@ -27,6 +27,7 @@
 #include linux/rcupdate.h
 #include linux/swap.h
 #include linux/spinlock.h
+#include linux/workqueue.h
 #include linux/fs.h
 
 #include asm/uaccess.h
@@ -63,6 +64,8 @@ struct mem_cgroup {
 */
spinlock_t lru_lock;
unsigned long control_type; /* control RSS or RSS+Pagecache */
+
+   struct work_struct reclaim_work;
 };
 
 /*
@@ -95,6 +98,9 @@ enum {
 
 static struct mem_cgroup init_mem_cgroup;
 
+static DEFINE_MUTEX(mem_cgroup_workqueue_init_lock);
+static struct workqueue_struct *mem_cgroup_workqueue;
+
 static inline
 struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont)
 {
@@ -250,6 +256,69 @@ unsigned long mem_cgroup_isolate_pages(u
return nr_taken;
 }
 
+static int
+mem_cgroup_need_reclaim(struct mem_cgroup *mem)
+{
+   struct res_counter * const cnt = mem-res;
+   int doreclaim;
+   unsigned long flags;
+
+   /* XXX should be in res_counter */
+   /* XXX should not hardcode a watermark */
+   spin_lock_irqsave(cnt-lock, flags);
+   doreclaim = cnt-usage  cnt-limit / 4 * 3;
+   spin_unlock_irqrestore(cnt-lock, flags);
+
+   return doreclaim;
+}
+
+static void
+mem_cgroup_schedule_reclaim_if_necessary(struct mem_cgroup *mem)
+{
+
+   if (mem_cgroup_workqueue == NULL) {
+   BUG_ON(mem-css.cgroup-parent != NULL);
+   return;
+   }
+
+   if (work_pending(mem-reclaim_work))
+   return;
+
+   if (!mem_cgroup_need_reclaim(mem))
+   return;
+
+   css_get(mem-css); /* XXX need some thoughts wrt cgroup removal. */
+   /*
+* XXX workqueue is not an ideal mechanism for our purpose.
+* revisit later.
+*/
+   if (!queue_work(mem_cgroup_workqueue, mem-reclaim_work))
+   css_put(mem-css);
+}
+
+static void
+mem_cgroup_reclaim(struct work_struct *work)
+{
+   struct mem_cgroup * const mem =
+   container_of(work, struct mem_cgroup, reclaim_work);
+   int batch_count = 128; /* XXX arbitrary */
+
+   for (; batch_count  0; batch_count--) {
+   if (!mem_cgroup_need_reclaim(mem))
+   break;
+   /*
+* XXX try_to_free_foo is not a correct mechanism to
+* use here

[Devel] Re: [PATCH][ just for review] memory controller enhancements [4/5] statistics for memory cgroup

2007-10-15 Thread YAMAMOTO Takashi
  - changed from u64 to s64

why?

 +/*
 + * For batchingmem_cgroup_charge_statistics()(see below).
 + */
 +static inline void mem_cgroup_stat_add(struct mem_cgroup_stat *stat,
 +enum mem_cgroup_stat_index idx, int val)
 +{
 + int cpu = smp_processor_id();
 + stat-cpustat[cpu].count[idx] += val;
 +}

i think the function name should be something which implies batching.

 @@ -207,8 +299,23 @@ clear_page_cgroup(struct page *page, str
  }
  
  
 -static void __mem_cgroup_move_lists(struct page_cgroup *pc, bool active)
 +static void __mem_cgroup_move_lists(struct page_cgroup *pc, bool active,
 + struct mem_cgroup *mem)
  {

can mem be different from pc-mem_cgroup here?

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH][ just for review] memory controller enhancements [4/5] statistics for memory cgroup

2007-10-15 Thread YAMAMOTO Takashi
   +/*
   + * For batchingmem_cgroup_charge_statistics()(see below).
   + */
   +static inline void mem_cgroup_stat_add(struct mem_cgroup_stat *stat,
   +enum mem_cgroup_stat_index idx, int val)
   +{
   + int cpu = smp_processor_id();
   + stat-cpustat[cpu].count[idx] += val;
   +}
  
  i think the function name should be something which implies batching.
  
 Hm, How about this ?
 ==
 mem_cgroup_stat_add_atomic() 
 ==
 and add this 
 ==
 VM_BUG_ON(preempt_count() == 0)
 ==

_atomic sounds like a different thing to me.  _nonpreemptible?

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH][ just for review] memory controller enhancements [1/5] force empty

2007-10-15 Thread YAMAMOTO Takashi
 Index: devel-2.6.23-rc8-mm2/mm/memcontrol.c
 ===
 --- devel-2.6.23-rc8-mm2.orig/mm/memcontrol.c
 +++ devel-2.6.23-rc8-mm2/mm/memcontrol.c
 @@ -469,6 +469,7 @@ void mem_cgroup_uncharge(struct page_cgr
   page = pc-page;
   /*
* get page-cgroup and clear it under lock.
 +  * force-empty can drop page-cgroup without checking refcnt.

force_empty

 + char buf[2] = 0;

it should be static const unless you want a runtime assignment.

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC] [PATCH] memory controller statistics

2007-10-10 Thread YAMAMOTO Takashi
 On Wed, 10 Oct 2007 10:01:17 +0900 (JST)
 [EMAIL PROTECTED] (YAMAMOTO Takashi) wrote:
 
  hi,
  
i implemented some statistics for your memory controller.
  
  here's a new version.
  
  changes from the previous:
  - make counters per-cpu.
  - value *= PAGE_SIZE
  
 
 YAMAMOTO-san, I like this work and the patch seems good.
 But will HUNK with my work to some extent ;)
 
 So, I'd like to merge this patch on my patch set (against -mm) if you don't 
 mind.
 If it's ok, please give your Signed-off-by line. Then, I'll merge this to 
 mine.
 
 Thanks,
 -Kame

sure.  here's the latest version.

changes from the previous:
- fix a race in uncharge.  (check PCGF_ACTIVE with mem-lru_lock held)

YAMAMOTO Takashi


Signed-off-by: YAMAMOTO Takashi [EMAIL PROTECTED]
---

--- linux-2.6.23-rc8-mm2-stat/mm/memcontrol.c.BACKUP2007-10-01 
17:19:57.0 +0900
+++ linux-2.6.23-rc8-mm2-stat/mm/memcontrol.c   2007-10-10 12:40:48.0 
+0900
@@ -25,6 +25,7 @@
 #include linux/backing-dev.h
 #include linux/bit_spinlock.h
 #include linux/rcupdate.h
+#include linux/seq_file.h
 #include linux/swap.h
 #include linux/spinlock.h
 #include linux/fs.h
@@ -34,6 +35,63 @@
 struct cgroup_subsys mem_cgroup_subsys;
 static const int MEM_CGROUP_RECLAIM_RETRIES = 5;
 
+enum mem_cgroup_stat_index {
+   /*
+* for MEM_CONTAINER_TYPE_ALL, usage == pagecache + rss
+*/
+   MEM_CGROUP_STAT_PAGECACHE,  /* # of pages charged as cache */
+   MEM_CGROUP_STAT_RSS,/* # of pages charged as rss */
+
+   /*
+* redundant; usage == charge - uncharge
+*/
+   MEM_CGROUP_STAT_CHARGE, /* # of pages charged */
+   MEM_CGROUP_STAT_UNCHARGE,   /* # of pages uncharged */
+
+   MEM_CGROUP_STAT_ACTIVE, /* # of pages on active_list */
+   MEM_CGROUP_STAT_INACTIVE,   /* # of pages on inactive_list */
+
+   MEM_CGROUP_STAT_NSTATS,
+};
+
+static const struct mem_cgroup_stat_desc {
+   const char *msg;
+   u64 unit;
+} mem_cgroup_stat_desc[] = {
+   [MEM_CGROUP_STAT_PAGECACHE] = { page_cache, PAGE_SIZE, },
+   [MEM_CGROUP_STAT_RSS] = { rss, PAGE_SIZE, },
+   [MEM_CGROUP_STAT_CHARGE] = { charge, PAGE_SIZE },
+   [MEM_CGROUP_STAT_UNCHARGE] = { uncharge, PAGE_SIZE, },
+   [MEM_CGROUP_STAT_ACTIVE] = { active, PAGE_SIZE, },
+   [MEM_CGROUP_STAT_INACTIVE] = { inactive, PAGE_SIZE, },
+};
+
+struct mem_cgroup_stat_cpu {
+   u64 count[MEM_CGROUP_STAT_NSTATS];
+} cacheline_aligned_in_smp;
+
+struct mem_cgroup_stat {
+   struct mem_cgroup_stat_cpu cpustat[NR_CPUS];
+};
+
+static inline void mem_cgroup_stat_inc(struct mem_cgroup_stat * stat,
+   enum mem_cgroup_stat_index idx)
+{
+   unsigned int cpu = get_cpu();
+
+   stat-cpustat[cpu].count[idx]++;
+   put_cpu();
+}
+
+static inline void mem_cgroup_stat_dec(struct mem_cgroup_stat * stat,
+   enum mem_cgroup_stat_index idx)
+{
+   unsigned int cpu = get_cpu();
+
+   stat-cpustat[cpu].count[idx]--;
+   put_cpu();
+}
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
@@ -63,6 +121,11 @@ struct mem_cgroup {
 */
spinlock_t lru_lock;
unsigned long control_type; /* control RSS or RSS+Pagecache */
+
+   /*
+* statistics
+*/
+   struct mem_cgroup_stat stat;
 };
 
 /*
@@ -83,6 +146,9 @@ struct page_cgroup {
struct mem_cgroup *mem_cgroup;
atomic_t ref_cnt;   /* Helpful when pages move b/w  */
/* mapped and cached states */
+   int flags;
+#definePCGF_PAGECACHE  1   /* charged as page cache */
+#definePCGF_ACTIVE 2   /* on active_list */
 };
 
 enum {
@@ -164,10 +230,24 @@ static void __always_inline unlock_page_
 
 static void __mem_cgroup_move_lists(struct page_cgroup *pc, bool active)
 {
-   if (active)
-   list_move(pc-lru, pc-mem_cgroup-active_list);
-   else
-   list_move(pc-lru, pc-mem_cgroup-inactive_list);
+   struct mem_cgroup *mem = pc-mem_cgroup;
+   struct mem_cgroup_stat *stat = mem-stat;
+
+   if (active) {
+   list_move(pc-lru, mem-active_list);
+   if ((pc-flags  PCGF_ACTIVE) == 0) {
+   mem_cgroup_stat_dec(stat, MEM_CGROUP_STAT_INACTIVE);
+   mem_cgroup_stat_inc(stat, MEM_CGROUP_STAT_ACTIVE);
+   pc-flags |= PCGF_ACTIVE;
+   }
+   } else {
+   list_move(pc-lru, mem-inactive_list);
+   if ((pc-flags  PCGF_ACTIVE) != 0) {
+   mem_cgroup_stat_dec(stat, MEM_CGROUP_STAT_ACTIVE);
+   mem_cgroup_stat_inc(stat, MEM_CGROUP_STAT_INACTIVE);
+   pc-flags = ~PCGF_ACTIVE;
+   }
+   }
 }
 
 /*
@@ -256,10 +336,11 @@ unsigned long

[Devel] Re: [PATCH][for -mm] Fix and Enhancements for memory cgroup [1/6] fix refcnt race in charge/uncharge

2007-10-09 Thread YAMAMOTO Takashi
 The logic of uncharging is 
  - decrement refcnt - lock page cgroup - remove page cgroup.
 But the logic of charging is
  - lock page cgroup - increment refcnt - return.
 
 Then, one charge will be added to a page_cgroup under being removed.
 This makes no big trouble (like panic) but one charge is lost.
 
 This patch add a test at charging to verify page_cgroup's refcnt is
 greater than 0. If not, unlock and retry.
 
 Signed-off-by: KAMEZAWA Hiroyuki [EMAIL PROTECTED]
 
 
  mm/memcontrol.c |9 +++--
  1 file changed, 7 insertions(+), 2 deletions(-)
 
 Index: linux-2.6.23-rc8-mm2/mm/memcontrol.c
 ===
 --- linux-2.6.23-rc8-mm2.orig/mm/memcontrol.c
 +++ linux-2.6.23-rc8-mm2/mm/memcontrol.c
 @@ -271,14 +271,19 @@ int mem_cgroup_charge(struct page *page,
* to see if the cgroup page already has a page_cgroup associated
* with it
*/
 +retry:
   lock_page_cgroup(page);
   pc = page_get_page_cgroup(page);
   /*
* The page_cgroup exists and the page has already been accounted
*/
   if (pc) {
 - atomic_inc(pc-ref_cnt);
 - goto done;
 + if (unlikely(!atomic_inc_not_zero(pc-ref_cnt))) {
 + /* this page is under being uncharge ? */
 + unlock_page_cgroup(page);

cpu_relax() here?

YAMAMOTO Takashi

 + goto retry;
 + } else
 + goto done;
   }
  
   unlock_page_cgroup(page);
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC] [PATCH] memory controller statistics

2007-10-09 Thread YAMAMOTO Takashi
hi,

  i implemented some statistics for your memory controller.

here's a new version.

changes from the previous:
- make counters per-cpu.
- value *= PAGE_SIZE

YAMAMOTO Takashi


--- linux-2.6.23-rc8-mm2-stat/mm/memcontrol.c.BACKUP2007-10-01 
17:19:57.0 +0900
+++ linux-2.6.23-rc8-mm2-stat/mm/memcontrol.c   2007-10-09 13:41:43.0 
+0900
@@ -25,6 +25,7 @@
 #include linux/backing-dev.h
 #include linux/bit_spinlock.h
 #include linux/rcupdate.h
+#include linux/seq_file.h
 #include linux/swap.h
 #include linux/spinlock.h
 #include linux/fs.h
@@ -34,6 +35,63 @@
 struct cgroup_subsys mem_cgroup_subsys;
 static const int MEM_CGROUP_RECLAIM_RETRIES = 5;
 
+enum mem_cgroup_stat_index {
+   /*
+* for MEM_CONTAINER_TYPE_ALL, usage == pagecache + rss
+*/
+   MEM_CGROUP_STAT_PAGECACHE,  /* # of pages charged as cache */
+   MEM_CGROUP_STAT_RSS,/* # of pages charged as rss */
+
+   /*
+* redundant; usage == charge - uncharge
+*/
+   MEM_CGROUP_STAT_CHARGE, /* # of pages charged */
+   MEM_CGROUP_STAT_UNCHARGE,   /* # of pages uncharged */
+
+   MEM_CGROUP_STAT_ACTIVE, /* # of pages on active_list */
+   MEM_CGROUP_STAT_INACTIVE,   /* # of pages on inactive_list */
+
+   MEM_CGROUP_STAT_NSTATS,
+};
+
+static const struct mem_cgroup_stat_desc {
+   const char *msg;
+   u64 unit;
+} mem_cgroup_stat_desc[] = {
+   [MEM_CGROUP_STAT_PAGECACHE] = { page_cache, PAGE_SIZE, },
+   [MEM_CGROUP_STAT_RSS] = { rss, PAGE_SIZE, },
+   [MEM_CGROUP_STAT_CHARGE] = { charge, PAGE_SIZE },
+   [MEM_CGROUP_STAT_UNCHARGE] = { uncharge, PAGE_SIZE, },
+   [MEM_CGROUP_STAT_ACTIVE] = { active, PAGE_SIZE, },
+   [MEM_CGROUP_STAT_INACTIVE] = { inactive, PAGE_SIZE, },
+};
+
+struct mem_cgroup_stat_cpu {
+   u64 count[MEM_CGROUP_STAT_NSTATS];
+} cacheline_aligned_in_smp;
+
+struct mem_cgroup_stat {
+   struct mem_cgroup_stat_cpu cpustat[NR_CPUS];
+};
+
+static inline void mem_cgroup_stat_inc(struct mem_cgroup_stat * stat,
+   enum mem_cgroup_stat_index idx)
+{
+   unsigned int cpu = get_cpu();
+
+   stat-cpustat[cpu].count[idx]++;
+   put_cpu();
+}
+
+static inline void mem_cgroup_stat_dec(struct mem_cgroup_stat * stat,
+   enum mem_cgroup_stat_index idx)
+{
+   unsigned int cpu = get_cpu();
+
+   stat-cpustat[cpu].count[idx]--;
+   put_cpu();
+}
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
@@ -63,6 +121,11 @@ struct mem_cgroup {
 */
spinlock_t lru_lock;
unsigned long control_type; /* control RSS or RSS+Pagecache */
+
+   /*
+* statistics
+*/
+   struct mem_cgroup_stat stat;
 };
 
 /*
@@ -83,6 +146,9 @@ struct page_cgroup {
struct mem_cgroup *mem_cgroup;
atomic_t ref_cnt;   /* Helpful when pages move b/w  */
/* mapped and cached states */
+   int flags;
+#definePCGF_PAGECACHE  1   /* charged as page cache */
+#definePCGF_ACTIVE 2   /* on active_list */
 };
 
 enum {
@@ -164,10 +230,24 @@ static void __always_inline unlock_page_
 
 static void __mem_cgroup_move_lists(struct page_cgroup *pc, bool active)
 {
-   if (active)
-   list_move(pc-lru, pc-mem_cgroup-active_list);
-   else
-   list_move(pc-lru, pc-mem_cgroup-inactive_list);
+   struct mem_cgroup *mem = pc-mem_cgroup;
+   struct mem_cgroup_stat *stat = mem-stat;
+
+   if (active) {
+   list_move(pc-lru, mem-active_list);
+   if ((pc-flags  PCGF_ACTIVE) == 0) {
+   mem_cgroup_stat_dec(stat, MEM_CGROUP_STAT_INACTIVE);
+   mem_cgroup_stat_inc(stat, MEM_CGROUP_STAT_ACTIVE);
+   pc-flags |= PCGF_ACTIVE;
+   }
+   } else {
+   list_move(pc-lru, mem-inactive_list);
+   if ((pc-flags  PCGF_ACTIVE) != 0) {
+   mem_cgroup_stat_dec(stat, MEM_CGROUP_STAT_ACTIVE);
+   mem_cgroup_stat_inc(stat, MEM_CGROUP_STAT_INACTIVE);
+   pc-flags = ~PCGF_ACTIVE;
+   }
+   }
 }
 
 /*
@@ -256,10 +336,11 @@ unsigned long mem_cgroup_isolate_pages(u
  * 0 if the charge was successful
  *  0 if the cgroup is over its limit
  */
-int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
-   gfp_t gfp_mask)
+int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
+   gfp_t gfp_mask, int is_cache)
 {
struct mem_cgroup *mem;
+   struct mem_cgroup_stat *stat;
struct page_cgroup *pc, *race_pc;
unsigned long flags;
unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
@@ -365,8 +446,17 @@ noreclaim

[Devel] Re: [RFC] [PATCH] memory controller statistics

2007-10-03 Thread YAMAMOTO Takashi
 hi,
 
 i implemented some statistics for your memory controller.
 
 it's tested with 2.6.23-rc2-mm2 + memory controller v7.
 i think it can be applied to 2.6.23-rc4-mm1 as well.
 
 YAMOMOTO Takshi
 
 todo: something like nr_active/inactive in /proc/vmstat.

here's the version i'm working on currently.  any comments?

changes from the previous version:

- adapt to 2.6.23-rc8-mm2, container - cgroup rename.
- reflect some of comments on this list.
- rename some macros as suggested by balbir
- sprinkle some inlines as suggested by balbir.
- remove isolate statistics
- remove PAGE_CONTAINER_CACHE hack and
  add flags member to page_container instead.
- make counters atomic_long_t.
- add some comments.
- coding style.
- implement nr_active/nr_inactive.  they show numbers of pages on
  per-cgroup lru lists.

todo:
- consider to make counters per-cpu.
- more statistics.

YAMAMOTO Takashi


--- linux-2.6.23-rc8-mm2-stat/mm/memcontrol.c.BACKUP2007-10-01 
17:19:57.0 +0900
+++ linux-2.6.23-rc8-mm2-stat/mm/memcontrol.c   2007-10-04 12:42:05.0 
+0900
@@ -25,6 +25,7 @@
 #include linux/backing-dev.h
 #include linux/bit_spinlock.h
 #include linux/rcupdate.h
+#include linux/seq_file.h
 #include linux/swap.h
 #include linux/spinlock.h
 #include linux/fs.h
@@ -34,6 +35,52 @@
 struct cgroup_subsys mem_cgroup_subsys;
 static const int MEM_CGROUP_RECLAIM_RETRIES = 5;
 
+enum mem_cgroup_stat_index {
+   /*
+* for MEM_CONTAINER_TYPE_ALL, usage == pagecache + rss
+*/
+   MEM_CGROUP_STAT_PAGECACHE,  /* # of pages charged as cache */
+   MEM_CGROUP_STAT_RSS,/* # of pages charged as rss */
+
+   /*
+* redundant; usage == charge - uncharge
+*/
+   MEM_CGROUP_STAT_CHARGE, /* # of pages charged */
+   MEM_CGROUP_STAT_UNCHARGE,   /* # of pages uncharged */
+
+   MEM_CGROUP_STAT_ACTIVE, /* # of pages on active_list */
+   MEM_CGROUP_STAT_INACTIVE,   /* # of pages on inactive_list */
+
+   MEM_CGROUP_STAT_NSTATS,
+};
+
+static const char * const mem_cgroup_stat_desc[] = {
+   [MEM_CGROUP_STAT_PAGECACHE] = page_cache,
+   [MEM_CGROUP_STAT_RSS] = rss,
+   [MEM_CGROUP_STAT_CHARGE] = charge,
+   [MEM_CGROUP_STAT_UNCHARGE] = uncharge,
+   [MEM_CGROUP_STAT_ACTIVE] = nr_active,
+   [MEM_CGROUP_STAT_INACTIVE] = nr_inactive,
+};
+
+struct mem_cgroup_stat {
+   atomic_long_t count[MEM_CGROUP_STAT_NSTATS];
+};
+
+static inline void mem_cgroup_stat_inc(struct mem_cgroup_stat * stat,
+   enum mem_cgroup_stat_index idx)
+{
+
+   atomic_long_inc(stat-count[idx]);
+}
+
+static inline void mem_cgroup_stat_dec(struct mem_cgroup_stat * stat,
+   enum mem_cgroup_stat_index idx)
+{
+
+   atomic_long_dec(stat-count[idx]);
+}
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
@@ -63,6 +110,11 @@ struct mem_cgroup {
 */
spinlock_t lru_lock;
unsigned long control_type; /* control RSS or RSS+Pagecache */
+
+   /*
+* statistics
+*/
+   struct mem_cgroup_stat stat;
 };
 
 /*
@@ -83,6 +135,9 @@ struct page_cgroup {
struct mem_cgroup *mem_cgroup;
atomic_t ref_cnt;   /* Helpful when pages move b/w  */
/* mapped and cached states */
+   int flags;
+#definePCGF_PAGECACHE  1   /* charged as page cache */
+#definePCGF_ACTIVE 2   /* on active_list */
 };
 
 enum {
@@ -164,10 +219,24 @@ static void __always_inline unlock_page_
 
 static void __mem_cgroup_move_lists(struct page_cgroup *pc, bool active)
 {
-   if (active)
-   list_move(pc-lru, pc-mem_cgroup-active_list);
-   else
-   list_move(pc-lru, pc-mem_cgroup-inactive_list);
+   struct mem_cgroup *mem = pc-mem_cgroup;
+   struct mem_cgroup_stat *stat = mem-stat;
+
+   if (active) {
+   list_move(pc-lru, mem-active_list);
+   if ((pc-flags  PCGF_ACTIVE) == 0) {
+   mem_cgroup_stat_dec(stat, MEM_CGROUP_STAT_INACTIVE);
+   mem_cgroup_stat_inc(stat, MEM_CGROUP_STAT_ACTIVE);
+   pc-flags |= PCGF_ACTIVE;
+   }
+   } else {
+   list_move(pc-lru, mem-inactive_list);
+   if ((pc-flags  PCGF_ACTIVE) != 0) {
+   mem_cgroup_stat_dec(stat, MEM_CGROUP_STAT_ACTIVE);
+   mem_cgroup_stat_inc(stat, MEM_CGROUP_STAT_INACTIVE);
+   pc-flags = ~PCGF_ACTIVE;
+   }
+   }
 }
 
 /*
@@ -256,10 +325,11 @@ unsigned long mem_cgroup_isolate_pages(u
  * 0 if the charge was successful
  *  0 if the cgroup is over its limit
  */
-int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
-   gfp_t

[Devel] [RFC] [PATCH] memory controller statistics

2007-09-06 Thread YAMAMOTO Takashi
hi,

i implemented some statistics for your memory controller.

it's tested with 2.6.23-rc2-mm2 + memory controller v7.
i think it can be applied to 2.6.23-rc4-mm1 as well.

YAMOMOTO Takshi

todo: something like nr_active/inactive in /proc/vmstat.

--- ./mm/memcontrol.c.BACKUP2007-08-29 17:13:09.0 +0900
+++ ./mm/memcontrol.c   2007-09-06 16:26:13.0 +0900
@@ -24,6 +24,7 @@
 #include linux/page-flags.h
 #include linux/bit_spinlock.h
 #include linux/rcupdate.h
+#include linux/seq_file.h
 #include linux/swap.h
 #include linux/spinlock.h
 #include linux/fs.h
@@ -33,6 +34,54 @@
 struct container_subsys mem_container_subsys;
 static const int MEM_CONTAINER_RECLAIM_RETRIES = 5;
 
+enum mem_container_stat_index {
+   /*
+* for MEM_CONTAINER_TYPE_ALL, usage == pagecache + rss
+*/
+   MEMCONT_STAT_PAGECACHE,
+   MEMCONT_STAT_RSS,
+
+   /*
+* redundant; usage == charge - uncharge
+*/
+   MEMCONT_STAT_CHARGE,
+   MEMCONT_STAT_UNCHARGE,
+
+   /*
+* mostly for debug
+*/
+   MEMCONT_STAT_ISOLATE,
+   MEMCONT_STAT_ISOLATE_FAIL,
+   MEMCONT_STAT_NSTATS,
+};
+
+static const char * const mem_container_stat_desc[] = {
+   [MEMCONT_STAT_PAGECACHE] = page_cache,
+   [MEMCONT_STAT_RSS] = rss,
+   [MEMCONT_STAT_CHARGE] = charge,
+   [MEMCONT_STAT_UNCHARGE] = uncharge,
+   [MEMCONT_STAT_ISOLATE] = isolate,
+   [MEMCONT_STAT_ISOLATE_FAIL] = isolate_fail,
+};
+
+struct mem_container_stat {
+   atomic_t count[MEMCONT_STAT_NSTATS];
+};
+
+static void mem_container_stat_inc(struct mem_container_stat * stat,
+   enum mem_container_stat_index idx)
+{
+
+   atomic_inc(stat-count[idx]);
+}
+
+static void mem_container_stat_dec(struct mem_container_stat * stat,
+   enum mem_container_stat_index idx)
+{
+
+   atomic_dec(stat-count[idx]);
+}
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per container. We would eventually like to provide
@@ -62,6 +111,7 @@ struct mem_container {
 */
spinlock_t lru_lock;
unsigned long control_type; /* control RSS or RSS+Pagecache */
+   struct mem_container_stat stat;
 };
 
 /*
@@ -72,6 +122,12 @@ struct mem_container {
 #define PAGE_CONTAINER_LOCK_BIT0x0
 #define PAGE_CONTAINER_LOCK(1  PAGE_CONTAINER_LOCK_BIT)
 
+/* XXX hack; shouldn't be here.  it really belongs to struct page_container. */
+#definePAGE_CONTAINER_CACHE_BIT0x1
+#definePAGE_CONTAINER_CACHE(1  PAGE_CONTAINER_CACHE_BIT)
+
+#definePAGE_CONTAINER_FLAGS(PAGE_CONTAINER_LOCK | 
PAGE_CONTAINER_CACHE)
+
 /*
  * A page_container page is associated with every page descriptor. The
  * page_container helps us identify information about the container
@@ -134,9 +190,9 @@ static inline int page_container_locked(
page-page_container);
 }
 
-void page_assign_page_container(struct page *page, struct page_container *pc)
+static void page_assign_page_container_flags(struct page *page, int flags,
+struct page_container *pc)
 {
-   int locked;
 
/*
 * While resetting the page_container we might not hold the
@@ -145,14 +201,20 @@ void page_assign_page_container(struct p
 */
if (pc)
VM_BUG_ON(!page_container_locked(page));
-   locked = (page-page_container  PAGE_CONTAINER_LOCK);
-   page-page_container = ((unsigned long)pc | locked);
+   flags |= (page-page_container  PAGE_CONTAINER_LOCK);
+   page-page_container = ((unsigned long)pc | flags);
+}
+
+void page_assign_page_container(struct page *page, struct page_container *pc)
+{
+
+   page_assign_page_container_flags(page, 0, pc);
 }
 
 struct page_container *page_get_page_container(struct page *page)
 {
return (struct page_container *)
-   (page-page_container  ~PAGE_CONTAINER_LOCK);
+   (page-page_container  ~PAGE_CONTAINER_FLAGS);
 }
 
 void __always_inline lock_page_container(struct page *page)
@@ -203,6 +265,7 @@ unsigned long mem_container_isolate_page
LIST_HEAD(pc_list);
struct list_head *src;
struct page_container *pc;
+   struct mem_container_stat *stat = mem_cont-stat;
 
if (active)
src = mem_cont-active_list;
@@ -244,6 +307,9 @@ unsigned long mem_container_isolate_page
if (__isolate_lru_page(page, mode) == 0) {
list_move(page-lru, dst);
nr_taken++;
+   mem_container_stat_inc(stat, MEMCONT_STAT_ISOLATE);
+   } else {
+   mem_container_stat_inc(stat, MEMCONT_STAT_ISOLATE_FAIL);
}
}
 
@@ -260,9 +326,11 @@ unsigned long mem_container_isolate_page
  * 0 if the charge was successful
  *  0 if the container is over its limit
  */
-int mem_container_charge(struct page *page, struct 

[Devel] Re: [RFC] [PATCH] memory controller statistics

2007-09-06 Thread YAMAMOTO Takashi
hi,

thanks for comments.

 Hi,
 
 On Fri,  7 Sep 2007 12:39:42 +0900 (JST)
 [EMAIL PROTECTED] (YAMAMOTO Takashi) wrote:
 
  +enum mem_container_stat_index {
  +   /*
  +* for MEM_CONTAINER_TYPE_ALL, usage == pagecache + rss
  +*/
  +   MEMCONT_STAT_PAGECACHE,
  +   MEMCONT_STAT_RSS,
  +
  +   /*
  +* redundant; usage == charge - uncharge
  +*/
  +   MEMCONT_STAT_CHARGE,
  +   MEMCONT_STAT_UNCHARGE,
  +
  +   /*
  +* mostly for debug
  +*/
  +   MEMCONT_STAT_ISOLATE,
  +   MEMCONT_STAT_ISOLATE_FAIL,
  +   MEMCONT_STAT_NSTATS,
  +};
  +
 please add comments on each statistics name.

sure.

 It's uneasy to catch the meaning of 
 ISOLATE/ISOLATE_FAIL without comments.

they aren't useful for users who don't read the relevant code.
probably they should be just removed.

  +static const char * const mem_container_stat_desc[] = {
  +   [MEMCONT_STAT_PAGECACHE] = page_cache,
  +   [MEMCONT_STAT_RSS] = rss,
  +   [MEMCONT_STAT_CHARGE] = charge,
  +   [MEMCONT_STAT_UNCHARGE] = uncharge,
  +   [MEMCONT_STAT_ISOLATE] = isolate,
  +   [MEMCONT_STAT_ISOLATE_FAIL] = isolate_fail,
  +};
  +
  +struct mem_container_stat {
  +   atomic_t count[MEMCONT_STAT_NSTATS];
  +};
  +
  +static void mem_container_stat_inc(struct mem_container_stat * stat,
  +   enum mem_container_stat_index idx)
  +{
  +
  +   atomic_inc(stat-count[idx]);
  +}
  +
  +static void mem_container_stat_dec(struct mem_container_stat * stat,
  +   enum mem_container_stat_index idx)
  +{
  +
  +   atomic_dec(stat-count[idx]);
  +}
  +
 
 Can we do this accounting as mod_zone_page_state()(in mm/vmstat.c) ?
 (use per-cpu data for accounting.) 

we can do so later.

  +/* XXX hack; shouldn't be here.  it really belongs to struct 
  page_container. */
  +#definePAGE_CONTAINER_CACHE_BIT0x1
  +#definePAGE_CONTAINER_CACHE(1  PAGE_CONTAINER_CACHE_BIT)
  +
 
 Is this used for remebering whether a page is charged as page-cache or not ?

yes.

  +   page_assign_page_container_flags(page,
  +   is_cache ? PAGE_CONTAINER_CACHE : 0, pc);
  +
  +   stat = mem-stat;
  +   if (is_cache) {
  +   mem_container_stat_inc(stat, MEMCONT_STAT_PAGECACHE);
  +   } else {
  +   mem_container_stat_inc(stat, MEMCONT_STAT_RSS);
  +   }
 
 nitpick,in linux style, one-sentence block shouldn't have braces {}.
 
 ==
 if (is_cache) 
   mem_cont...
 else
   mem_cont...
 ==

sure.

  +   mem_container_stat_inc(stat, MEMCONT_STAT_CHARGE);
   
  spin_lock_irqsave(mem-lru_lock, flags);
  list_add(pc-lru, mem-active_list);
  @@ -377,6 +454,12 @@ err:
  return -ENOMEM;
   }
   
  +int mem_container_charge(struct page *page, struct mm_struct *mm)
  +{
  +
  +   return mem_container_charge_common(page, mm, 0);
  +}
  +
   /*
* See if the cached pages should be charged at all?
*/
  @@ -388,7 +471,7 @@ int mem_container_cache_charge(struct pa
   
  mem = rcu_dereference(mm-mem_container);
  if (mem-control_type == MEM_CONTAINER_TYPE_ALL)
  -   return mem_container_charge(page, mm);
  +   return mem_container_charge_common(page, mm, 1);
  else
  return 0;
   }
  @@ -411,15 +494,29 @@ void mem_container_uncharge(struct page_
  return;
   
  if (atomic_dec_and_test(pc-ref_cnt)) {
  +   struct mem_container_stat *stat;
  +   int is_cache;
  +
  page = pc-page;
  lock_page_container(page);
  mem = pc-mem_container;
  css_put(mem-css);
  +   /* XXX */
 This kind of comment is bad.

sure.

  +   is_cache = (page-page_container  PAGE_CONTAINER_CACHE) != 0;
  page_assign_page_container(page, NULL);
  unlock_page_container(page);
  res_counter_uncharge(mem-res, 1);
   
  +   stat = mem-stat;
  +   if (is_cache) {
  +   mem_container_stat_dec(stat, MEMCONT_STAT_PAGECACHE);
  +   } else {
  +   mem_container_stat_dec(stat, MEMCONT_STAT_RSS);
  +   }
  +   mem_container_stat_inc(stat, MEMCONT_STAT_UNCHARGE);
  +
  spin_lock_irqsave(mem-lru_lock, flags);
  +   BUG_ON(list_empty(pc-lru));
 
 Why this BUG_ON() is added ?
 
 Thanks
 -Kame

to ensure that my understanding is correct.

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [-mm PATCH 5/10] Memory controller task migration (v7)

2007-08-28 Thread YAMAMOTO Takashi
 YAMAMOTO Takashi wrote:
  Allow tasks to migrate from one container to the other. We migrate
  mm_struct's mem_container only when the thread group id migrates.
  
  +  /*
  +   * Only thread group leaders are allowed to migrate, the mm_struct is
  +   * in effect owned by the leader
  +   */
  +  if (p-tgid != p-pid)
  +  goto out;
  
  does it mean that you can't move a process between containers
  once its thread group leader exited?
  
  YAMAMOTO Takashi
 
 
 Hi,
 
 Good catch! Currently, we treat the mm as owned by the thread group leader.
 But this policy can be easily adapted to any other desired policy.
 Would you like to see it change to something else?
 
 -- 
   Warm Regards,
   Balbir Singh
   Linux Technology Center
   IBM, ISTL

although i have no good idea right now, something which allows
to move a process with its thread group leader dead would be better.

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [-mm PATCH 5/10] Memory controller task migration (v7)

2007-08-27 Thread YAMAMOTO Takashi
 Allow tasks to migrate from one container to the other. We migrate
 mm_struct's mem_container only when the thread group id migrates.

 + /*
 +  * Only thread group leaders are allowed to migrate, the mm_struct is
 +  * in effect owned by the leader
 +  */
 + if (p-tgid != p-pid)
 + goto out;

does it mean that you can't move a process between containers
once its thread group leader exited?

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] Memory controller Add Documentation

2007-08-24 Thread YAMAMOTO Takashi
 +echo 1  /proc/sys/vm/drop_pages will help get rid of some of the pages
 +cached in the container (page cache pages).

drop_caches

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [-mm PATCH 4/9] Memory controller memory accounting (v4)

2007-08-15 Thread YAMAMOTO Takashi
 YAMAMOTO Takashi wrote:
  +  lock_meta_page(page);
  +  /*
  +   * Check if somebody else beat us to allocating the meta_page
  +   */
  +  race_mp = page_get_meta_page(page);
  +  if (race_mp) {
  +  kfree(mp);
  +  mp = race_mp;
  +  atomic_inc(mp-ref_cnt);
  +  res_counter_uncharge(mem-res, 1);
  +  goto done;
  +  }
  
  i think you need css_put here.
 
 Thats correct. We do need css_put in this path.
 
 Thanks,
 Vaidy

v5 still seems to have the problem.

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [-mm PATCH 8/9] Memory controller add switch to control what type of pages to limit (v4)

2007-08-13 Thread YAMAMOTO Takashi
 YAMAMOTO Takashi wrote:
  Choose if we want cached pages to be accounted or not. By default both
  are accounted for. A new set of tunables are added.
 
  echo -n 1  mem_control_type
 
  switches the accounting to account for only mapped pages
 
  echo -n 2  mem_control_type
 
  switches the behaviour back
  
  MEM_CONTAINER_TYPE_ALL is 3, not 2.
  
 
 Thanks, I'll fix the comment on the top.
 
  YAMAMOTO Takashi
  
  +enum {
  +  MEM_CONTAINER_TYPE_UNSPEC = 0,
  +  MEM_CONTAINER_TYPE_MAPPED,
  +  MEM_CONTAINER_TYPE_CACHED,

what's MEM_CONTAINER_TYPE_CACHED, btw?
it seems that nothing distinguishes it from MEM_CONTAINER_TYPE_MAPPED.

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [-mm PATCH 8/9] Memory controller add switch to control what type of pages to limit (v4)

2007-08-12 Thread YAMAMOTO Takashi
 Choose if we want cached pages to be accounted or not. By default both
 are accounted for. A new set of tunables are added.
 
 echo -n 1  mem_control_type
 
 switches the accounting to account for only mapped pages
 
 echo -n 2  mem_control_type
 
 switches the behaviour back

MEM_CONTAINER_TYPE_ALL is 3, not 2.

YAMAMOTO Takashi

 +enum {
 + MEM_CONTAINER_TYPE_UNSPEC = 0,
 + MEM_CONTAINER_TYPE_MAPPED,
 + MEM_CONTAINER_TYPE_CACHED,
 + MEM_CONTAINER_TYPE_ALL,
 + MEM_CONTAINER_TYPE_MAX,
 +} mem_control_type;
 +
 +static struct mem_container init_mem_container;

 + mem = rcu_dereference(mm-mem_container);
 + if (mem-control_type == MEM_CONTAINER_TYPE_ALL)
 + return mem_container_charge(page, mm);
 + else
 + return 0;
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [-mm PATCH 4/9] Memory controller memory accounting (v4)

2007-07-30 Thread YAMAMOTO Takashi
 + lock_meta_page(page);
 + /*
 +  * Check if somebody else beat us to allocating the meta_page
 +  */
 + race_mp = page_get_meta_page(page);
 + if (race_mp) {
 + kfree(mp);
 + mp = race_mp;
 + atomic_inc(mp-ref_cnt);
 + res_counter_uncharge(mem-res, 1);
 + goto done;
 + }

i think you need css_put here.

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][-mm PATCH 6/8] Memory controller add per container LRU and reclaim (v3)

2007-07-24 Thread YAMAMOTO Takashi
hi,

 +unsigned long mem_container_isolate_pages(unsigned long nr_to_scan,
 + struct list_head *dst,
 + unsigned long *scanned, int order,
 + int mode, struct zone *z,
 + struct mem_container *mem_cont,
 + int active)
 +{
 + unsigned long nr_taken = 0;
 + struct page *page;
 + unsigned long scan;
 + LIST_HEAD(mp_list);
 + struct list_head *src;
 + struct meta_page *mp;
 +
 + if (active)
 + src = mem_cont-active_list;
 + else
 + src = mem_cont-inactive_list;
 +
 + for (scan = 0; scan  nr_to_scan  !list_empty(src); scan++) {
 + mp = list_entry(src-prev, struct meta_page, lru);
 + page = mp-page;
 +

- is it safe to pick the lists without mem_cont-lru_lock held?

- what prevents mem_container_uncharge from freeing this meta_page
 behind us?

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [-mm PATCH 4/8] Memory controller memory accounting (v2)

2007-07-13 Thread YAMAMOTO Takashi
 On 7/10/07, YAMAMOTO Takashi [EMAIL PROTECTED] wrote:
  hi,
 
   diff -puN mm/memory.c~mem-control-accounting mm/memory.c
   --- linux-2.6.22-rc6/mm/memory.c~mem-control-accounting   2007-07-05 
   13:45:18.0 -0700
   +++ linux-2.6.22-rc6-balbir/mm/memory.c   2007-07-05 
   13:45:18.0 -0700
 
   @@ -1731,6 +1736,9 @@ gotten:
 cow_user_page(new_page, old_page, address, vma);
 }
  
   + if (mem_container_charge(new_page, mm))
   + goto oom;
   +
 /*
  * Re-check the pte - we dropped the lock
  */
 
  it seems that the page will be leaked on error.
 
 You mean meta_page right?

no.  i meant 'new_page'.

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [-mm PATCH 6/8] Memory controller add per container LRU and reclaim (v2)

2007-07-13 Thread YAMAMOTO Takashi
 Add the meta_page to the per container LRU. The reclaim algorithm has been
 modified to make the isolate_lru_pages() as a pluggable component. The
 scan_control data structure now accepts the container on behalf of which
 reclaims are carried out. try_to_free_pages() has been extended to become
 container aware.
 
 Signed-off-by: Balbir Singh [EMAIL PROTECTED]

it seems that the number of pages to scan (nr_active/nr_inactive
in shrink_zone) is calculated from NR_ACTIVE and NR_INACTIVE of the zone,
even in the case of per-container reclaim.  is it intended?

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [-mm PATCH 4/8] Memory controller memory accounting (v2)

2007-07-13 Thread YAMAMOTO Takashi
hi,

 diff -puN mm/memory.c~mem-control-accounting mm/memory.c
 --- linux-2.6.22-rc6/mm/memory.c~mem-control-accounting   2007-07-05 
 13:45:18.0 -0700
 +++ linux-2.6.22-rc6-balbir/mm/memory.c   2007-07-05 13:45:18.0 
 -0700

 @@ -1731,6 +1736,9 @@ gotten:
   cow_user_page(new_page, old_page, address, vma);
   }
  
 + if (mem_container_charge(new_page, mm))
 + goto oom;
 +
   /*
* Re-check the pte - we dropped the lock
*/

it seems that the page will be leaked on error.

 @@ -2188,6 +2196,11 @@ static int do_swap_page(struct mm_struct
   }
  
   delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
 + if (mem_container_charge(page, mm)) {
 + ret = VM_FAULT_OOM;
 + goto out;
 + }
 +
   mark_page_accessed(page);
   lock_page(page);
  

ditto.

 @@ -2264,6 +2278,9 @@ static int do_anonymous_page(struct mm_s
   if (!page)
   goto oom;
  
 + if (mem_container_charge(page, mm))
 + goto oom;
 +
   entry = mk_pte(page, vma-vm_page_prot);
   entry = maybe_mkwrite(pte_mkdirty(entry), vma);
  

ditto.

can you check the rest of the patch by yourself?  thanks.

YAMAMOTO Takashi
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel