[Devel] Re: [PATCH] Wake up mandatory locks waiter on chmod (v2)

2007-09-20 Thread Hugh Dickins
On Tue, 18 Sep 2007, J. Bruce Fields wrote:
 On Tue, Sep 18, 2007 at 12:54:56PM -0400, Trond Myklebust wrote:
  
  It gets even better when you throw mmap() into the mix :-)
 
 Hm.  Documentation/mandatory.txt claims that it mandatory locks and
 mmap() with MAP_SHARED exclude each other, but I can't see where that's
 enfoced.  That file doesn't make any mention of the above race.

I believe the locks_verify_locked() call from mm/mmap.c prevents
mmap'ing shared-write a file with mandatory locks in force;
and the mapping_writably_mapped() calls from fs/locks.c prevent
mandatory locking on a file while it's mmap'ed shared-write.

Though I think there's no lock to prevent those checks racing,
so it's not quite watertight.

Hugh

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC] [-mm PATCH] Memory controller fix swap charging context in unuse_pte()

2007-10-07 Thread Hugh Dickins
On Fri, 5 Oct 2007, Balbir Singh wrote:
 
 Found-by: Hugh Dickins [EMAIL PROTECTED]
 
 mem_cgroup_charge() in unuse_pte() is called under a lock, the pte_lock. 
 That's
 clearly incorrect, since we pass GFP_KERNEL to mem_cgroup_charge() for
 allocation of page_cgroup.
 
 This patch release the lock and reacquires the lock after the call to
 mem_cgroup_charge().
 
 Tested on a powerpc box by calling swapoff in the middle of a cgroup
 running a workload that pushes pages to swap.

Hard to test it adequately at present, while that call
to mem_cgroup_charge is never allocating anything new.

Sorry, it's a bit ugly (the intertwining of unuse_pte and its caller),
it's got a bug, and fixing that bug makes it uglier.

The bug is that you ignore the pte ptr returned by pte_offset_map_lock:
we could be preempted on to a different cpu just there, so a different
cpu's kmap_atomic area used, with a different pte pointer; which would
need to be passed back to the caller for when it unmaps.

I much prefer my patch appended further down: considering how it's safe
for you to drop the ptl there because of holding page lock, pushed me
into seeing that we can actually do our scanning without ptl, which in
many configurations has the advantage of staying preemptible (though
preemptible swapoff is not terribly high on anyone's ticklist ;).

But you may well prefer that we split it into two: with me taking
responsibility and blame for the preliminary patch which relaxes
the locking, and you then adding the mem_cgroup_charge (and the
exceptional mem_cgroup_uncharge_page) on top of that.

Hugh

 
 Signed-off-by: Balbir Singh [EMAIL PROTECTED]
 ---
 
  mm/swapfile.c |   16 
  1 file changed, 12 insertions(+), 4 deletions(-)
 
 diff -puN mm/swapfile.c~memory-controller-fix-unuse-pte-charging mm/swapfile.c
 --- linux-2.6.23-rc8/mm/swapfile.c~memory-controller-fix-unuse-pte-charging   
 2007-10-03 13:45:56.0 +0530
 +++ linux-2.6.23-rc8-balbir/mm/swapfile.c 2007-10-05 08:49:54.0 
 +0530
 @@ -507,11 +507,18 @@ unsigned int count_swap_pages(int type, 
   * just let do_wp_page work it out if a write is requested later - to
   * force COW, vm_page_prot omits write permission from any private vma.
   */
 -static int unuse_pte(struct vm_area_struct *vma, pte_t *pte,
 - unsigned long addr, swp_entry_t entry, struct page *page)
 +static int unuse_pte(struct vm_area_struct *vma, pte_t *pte, pmd_t *pmd,
 + unsigned long addr, swp_entry_t entry, struct page *page,
 + spinlock_t **ptl)
  {
 - if (mem_cgroup_charge(page, vma-vm_mm, GFP_KERNEL))
 + pte_unmap_unlock(pte - 1, *ptl);
 +
 + if (mem_cgroup_charge(page, vma-vm_mm, GFP_KERNEL)) {
 + pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl);
   return -ENOMEM;
 + }
 +
 + pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl);
  
   inc_mm_counter(vma-vm_mm, anon_rss);
   get_page(page);
 @@ -543,7 +550,8 @@ static int unuse_pte_range(struct vm_are
* Test inline before going to call unuse_pte.
*/
   if (unlikely(pte_same(*pte, swp_pte))) {
 - ret = unuse_pte(vma, pte++, addr, entry, page);
 + ret = unuse_pte(vma, pte++, pmd, addr, entry, page,
 + ptl);
   break;
   }
   } while (pte++, addr += PAGE_SIZE, addr != end);

--- 2.6.23-rc8-mm2/mm/swapfile.c2007-09-27 12:03:36.0 +0100
+++ linux/mm/swapfile.c 2007-10-07 14:33:05.0 +0100
@@ -507,11 +507,23 @@ unsigned int count_swap_pages(int type, 
  * just let do_wp_page work it out if a write is requested later - to
  * force COW, vm_page_prot omits write permission from any private vma.
  */
-static int unuse_pte(struct vm_area_struct *vma, pte_t *pte,
+static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, swp_entry_t entry, struct page *page)
 {
+   spinlock_t *ptl;
+   pte_t *pte;
+   int ret = 1;
+
if (mem_cgroup_charge(page, vma-vm_mm, GFP_KERNEL))
-   return -ENOMEM;
+   ret = -ENOMEM;
+
+   pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl);
+   if (unlikely(!pte_same(*pte, swp_entry_to_pte(entry {
+   if (ret  0)
+   mem_cgroup_uncharge_page(page);
+   ret = 0;
+   goto out;
+   }
 
inc_mm_counter(vma-vm_mm, anon_rss);
get_page(page);
@@ -524,7 +536,9 @@ static int unuse_pte(struct vm_area_stru
 * immediately swapped out again after swapon.
 */
activate_page(page);
-   return 1;
+out:
+   pte_unmap_unlock(pte, ptl);
+   return ret;
 }
 
 static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
@@ -533,21 +547,33 @@ static int unuse_pte_range(struct vm_are
 {
pte_t swp_pte = swp_entry_to_pte(entry);
pte_t *pte

[Devel] Re: [RFC] [-mm PATCH] Memory controller fix swap charging context in unuse_pte()

2007-10-22 Thread Hugh Dickins
On Mon, 15 Oct 2007, Balbir Singh wrote:
 Hugh Dickins wrote:
  
  --- 2.6.23-rc8-mm2/mm/swapfile.c2007-09-27 12:03:36.0 +0100
  +++ linux/mm/swapfile.c 2007-10-07 14:33:05.0 +0100
  @@ -507,11 +507,23 @@ unsigned int count_swap_pages(int type, 
* just let do_wp_page work it out if a write is requested later - to
* force COW, vm_page_prot omits write permission from any private vma.
*/
  -static int unuse_pte(struct vm_area_struct *vma, pte_t *pte,
  +static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
  unsigned long addr, swp_entry_t entry, struct page *page)
...
 
 I tested this patch and it seems to be working fine. I tried swapoff -a
 in the middle of tests consuming swap. Not 100% rigorous, but a good
 test nevertheless.
 
 Tested-by: Balbir Singh [EMAIL PROTECTED]

Thanks, Balbir.  Sorry for the delay.  I've not forgotten our
agreement that I should be splitting it into before-and-after
mem cgroup patches.  But it's low priority for me until we're
genuinely assigning to a cgroup there.  Hope to get back to
looking into that tomorrow, but no promises.

I think you still see no problem, where I claim that simply
omitting the mem charge mods from mm/swap_state.c leads to OOMs?
Maybe our difference is because my memhog in the cgroup is using
more memory than RAM, not just more memory than allowed to the
cgroup.  I suspect that arrives at a state (when the swapcache
pages are not charged) where it cannot locate the pages it needs
to reclaim to stay within its limit.

Hugh
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC] [-mm PATCH] Memory controller fix swap charging context in unuse_pte()

2007-10-25 Thread Hugh Dickins
On Wed, 24 Oct 2007, Balbir Singh wrote:
 Hugh Dickins wrote:
  
  Thanks, Balbir.  Sorry for the delay.  I've not forgotten our
  agreement that I should be splitting it into before-and-after
  mem cgroup patches.  But it's low priority for me until we're
  genuinely assigning to a cgroup there.  Hope to get back to
  looking into that tomorrow, but no promises.
 
 No Problem. We have some time with this one.

Phew - I still haven't got there.

  I think you still see no problem, where I claim that simply
  omitting the mem charge mods from mm/swap_state.c leads to OOMs?
  Maybe our difference is because my memhog in the cgroup is using
  more memory than RAM, not just more memory than allowed to the
  cgroup.  I suspect that arrives at a state (when the swapcache
  pages are not charged) where it cannot locate the pages it needs
  to reclaim to stay within its limit.
 
 Yes, in my case there I use memory less than RAM and more than that
 is allowed by the cgroup. It's quite possible that in your case the
 swapcache has grown significantly without any limit/control on it.
 The memhog program is using memory at a rate much higher than the
 rate of reclaim. Could you share your memhog program, please?

Gosh, it's nothing special.  Appended below, but please don't shame
me by taking it too seriously.  Defaults to working on a 600M mmap
because I'm in the habit of booting mem=512M.  You probably have
something better yourself that you'd rather use.

 In the use case you've mentioned/tested, having these mods to
 control swapcache is actually useful, right?

No idea what you mean by these mods to control swapcache?

With your mem_cgroup mods in mm/swap_state.c, swapoff assigns
the pages read in from swap to whoever's running swapoff and your
unuse_pte mem_cgroup_charge never does anything useful: swap pages
should get assigned to the appropriate cgroups at that point.

Without your mem_cgroup mods in mm/swap_state.c, unuse_pte makes
the right assignments (I believe).  But I find that swapout (using
600M in a 512M machine) from a 200M cgroup quickly OOMs, whereas
it behaves correctly with your mm/swap_state.c.

Thought little yet about what happens to shmem swapped pages,
and swap readahead pages; but still suspect that they and the
above issue will need a limbo cgroup, for pages which are
expected to belong to a not-yet-identified mem cgroup.

 
 Could you share your major objections at this point with the memory
 controller at this point. I hope to be able to look into/resolve them
 as my first priority in my list of items to work on.

The things I've noticed so far, as mentioned before and above.

But it does worry me that I only came here through finding swapoff
broken by that unuse_mm return value, and then found one issue
after another.  It feels like the mem cgroup people haven't really
thought through or tested swap at all, and that if I looked further
I'd uncover more.

That's simply FUD, and I apologize if I'm being unfair: but that
is how it feels, and I expect we all know that phase in a project
when solving one problem uncovers three - suggests it's not ready.

Hugh

/* swapout.c */
#include unistd.h
#include stdlib.h
#include stdio.h
#include errno.h
#include sys/mman.h

int main(int argc, char *argv[])
{
unsigned long *base = (unsigned long *)0x0840;
unsigned long size;
unsigned long limit;
unsigned long i;
char *ptr = NULL;

size = argv[1]? strtoul(argv[1], ptr, 0): 600;
if (size = 3*1024)
size = 0;
size *= 1024*1024;
limit = size / sizeof(unsigned long);
if (size == 0 || base + limit + 1024  size) {
errno = EINVAL;
perror(swapout);
exit(1);
}
base = mmap(base, size, PROT_READ|PROT_WRITE,
MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
if (base == (unsigned long *)(-1)) {
perror(mmap);
exit(1);
}
for (i = 0; i  limit; i++)
base[i] = i;
if (ptr  *ptr == '.') {
printf(Type Return to continue );
fflush(stdout);
getchar();
}
for (i = 0; i  limit; i++)
base[i] = limit - i;
return 0;
}
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC] [-mm PATCH] Memory controller fix swap charging context in unuse_pte()

2007-10-29 Thread Hugh Dickins
On Mon, 29 Oct 2007, Balbir Singh wrote:
 On Mon, Oct 29, 2007 at 01:57:40AM +0530, Balbir Singh wrote:
 Hugh Dickins wrote:
 
 [snip]
  
  Without your mem_cgroup mods in mm/swap_state.c, unuse_pte makes
  the right assignments (I believe).  But I find that swapout (using
  600M in a 512M machine) from a 200M cgroup quickly OOMs, whereas
  it behaves correctly with your mm/swap_state.c.
  
 
 On my UML setup, I booted the UML instance with 512M of memory and
 used the swapout program that you shared. I tried two things
 
 
 1. Ran swapout without any changes. The program ran well without
any OOM condition occuring, lot of reclaim occured.
 2. Ran swapout with the changes to mm/swap_state.c removed (diff below)
and I still did not see any OOM. The reclaim count was much lesser
since swap cache did not get accounted back to the cgroup from
which pages were being evicted.
 
 I am not sure why I don't see the OOM that you see, still trying. May be
 I missing something obvious at this late hour in the night :-)

I reconfirm that I do see those OOMs.  I'll have to try harder to
analyze how they come about: I sure don't expect you to debug a
problem you cannot reproduce.  But what happens if you try it
native rather than using UML?

Hugh
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 1/6 mm] swapoff: scan ptes preemptibly

2007-11-08 Thread Hugh Dickins
Provided that CONFIG_HIGHPTE is not set, unuse_pte_range can reduce latency
in swapoff by scanning the page table preemptibly: so long as unuse_pte is
careful to recheck that entry under pte lock.

(To tell the truth, this patch was not inspired by any cries for lower
latency here: rather, this restructuring permits a future memory controller
patch to allocate with GFP_KERNEL in unuse_pte, where before it could not.
But it would be wrong to tuck this change away inside a memcgroup patch.)

Signed-off-by: Hugh Dickins [EMAIL PROTECTED]
---
This patch could go anywhere in the mm series before the memory-controller
patches: I suggest just after swapin-fix-valid-swaphandles-defect.patch
Subsequent patches N/6 go in different places amongst the memory-controller
patches: please see accompanying suggestions.

 mm/swapfile.c |   38 +++---
 1 file changed, 31 insertions(+), 7 deletions(-)

--- patch0/mm/swapfile.c2007-11-07 19:41:45.0 +
+++ patch1/mm/swapfile.c2007-11-08 12:34:12.0 +
@@ -506,9 +506,19 @@ unsigned int count_swap_pages(int type, 
  * just let do_wp_page work it out if a write is requested later - to
  * force COW, vm_page_prot omits write permission from any private vma.
  */
-static void unuse_pte(struct vm_area_struct *vma, pte_t *pte,
+static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, swp_entry_t entry, struct page *page)
 {
+   spinlock_t *ptl;
+   pte_t *pte;
+   int found = 1;
+
+   pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl);
+   if (unlikely(!pte_same(*pte, swp_entry_to_pte(entry {
+   found = 0;
+   goto out;
+   }
+
inc_mm_counter(vma-vm_mm, anon_rss);
get_page(page);
set_pte_at(vma-vm_mm, addr, pte,
@@ -520,6 +530,9 @@ static void unuse_pte(struct vm_area_str
 * immediately swapped out again after swapon.
 */
activate_page(page);
+out:
+   pte_unmap_unlock(pte, ptl);
+   return found;
 }
 
 static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
@@ -528,22 +541,33 @@ static int unuse_pte_range(struct vm_are
 {
pte_t swp_pte = swp_entry_to_pte(entry);
pte_t *pte;
-   spinlock_t *ptl;
int found = 0;
 
-   pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl);
+   /*
+* We don't actually need pte lock while scanning for swp_pte: since
+* we hold page lock and mmap_sem, swp_pte cannot be inserted into the
+* page table while we're scanning; though it could get zapped, and on
+* some architectures (e.g. x86_32 with PAE) we might catch a glimpse
+* of unmatched parts which look like swp_pte, so unuse_pte must
+* recheck under pte lock.  Scanning without pte lock lets it be
+* preemptible whenever CONFIG_PREEMPT but not CONFIG_HIGHPTE.
+*/
+   pte = pte_offset_map(pmd, addr);
do {
/*
 * swapoff spends a _lot_ of time in this loop!
 * Test inline before going to call unuse_pte.
 */
if (unlikely(pte_same(*pte, swp_pte))) {
-   unuse_pte(vma, pte++, addr, entry, page);
-   found = 1;
-   break;
+   pte_unmap(pte);
+   found = unuse_pte(vma, pmd, addr, entry, page);
+   if (found)
+   goto out;
+   pte = pte_offset_map(pmd, addr);
}
} while (pte++, addr += PAGE_SIZE, addr != end);
-   pte_unmap_unlock(pte - 1, ptl);
+   pte_unmap(pte - 1);
+out:
return found;
 }
 
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 2/6 mm] memcgroup: temporarily revert swapoff mod

2007-11-08 Thread Hugh Dickins
Whaaa?  This patch precisely reverts the swapoff: scan ptes preemptibly
patch just presented.  It's a temporary measure to allow existing memory
controller patches to apply without rejects: in due course they should be
rendered down into one sensible patch, and this reversion disappear.

Signed-off-by: Hugh Dickins [EMAIL PROTECTED]
---
This patch should go immediately before the memory-controller patches,
or immediately before memory-controller-memory-accounting-v7.patch

 mm/swapfile.c |   38 +++---
 1 file changed, 7 insertions(+), 31 deletions(-)

--- patch1/mm/swapfile.c2007-11-08 12:34:12.0 +
+++ patch2/mm/swapfile.c2007-11-08 12:34:12.0 +
@@ -506,19 +506,9 @@ unsigned int count_swap_pages(int type, 
  * just let do_wp_page work it out if a write is requested later - to
  * force COW, vm_page_prot omits write permission from any private vma.
  */
-static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
+static void unuse_pte(struct vm_area_struct *vma, pte_t *pte,
unsigned long addr, swp_entry_t entry, struct page *page)
 {
-   spinlock_t *ptl;
-   pte_t *pte;
-   int found = 1;
-
-   pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl);
-   if (unlikely(!pte_same(*pte, swp_entry_to_pte(entry {
-   found = 0;
-   goto out;
-   }
-
inc_mm_counter(vma-vm_mm, anon_rss);
get_page(page);
set_pte_at(vma-vm_mm, addr, pte,
@@ -530,9 +520,6 @@ static int unuse_pte(struct vm_area_stru
 * immediately swapped out again after swapon.
 */
activate_page(page);
-out:
-   pte_unmap_unlock(pte, ptl);
-   return found;
 }
 
 static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
@@ -541,33 +528,22 @@ static int unuse_pte_range(struct vm_are
 {
pte_t swp_pte = swp_entry_to_pte(entry);
pte_t *pte;
+   spinlock_t *ptl;
int found = 0;
 
-   /*
-* We don't actually need pte lock while scanning for swp_pte: since
-* we hold page lock and mmap_sem, swp_pte cannot be inserted into the
-* page table while we're scanning; though it could get zapped, and on
-* some architectures (e.g. x86_32 with PAE) we might catch a glimpse
-* of unmatched parts which look like swp_pte, so unuse_pte must
-* recheck under pte lock.  Scanning without pte lock lets it be
-* preemptible whenever CONFIG_PREEMPT but not CONFIG_HIGHPTE.
-*/
-   pte = pte_offset_map(pmd, addr);
+   pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl);
do {
/*
 * swapoff spends a _lot_ of time in this loop!
 * Test inline before going to call unuse_pte.
 */
if (unlikely(pte_same(*pte, swp_pte))) {
-   pte_unmap(pte);
-   found = unuse_pte(vma, pmd, addr, entry, page);
-   if (found)
-   goto out;
-   pte = pte_offset_map(pmd, addr);
+   unuse_pte(vma, pte++, addr, entry, page);
+   found = 1;
+   break;
}
} while (pte++, addr += PAGE_SIZE, addr != end);
-   pte_unmap(pte - 1);
-out:
+   pte_unmap_unlock(pte - 1, ptl);
return found;
 }
 
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 3/6 mm] memcgroup: fix try_to_free order

2007-11-08 Thread Hugh Dickins
Why does try_to_free_mem_cgroup_pages try for order 1 pages?  It's called
when mem_cgroup_charge_common would go over the limit, and that's adding
an order 0 page.  I see no reason: it has to be a typo: fix it.

Signed-off-by: Hugh Dickins [EMAIL PROTECTED]
---
Insert just after memory-controller-add-per-container-lru-and-reclaim-v7.patch

 mm/vmscan.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- patch2/mm/vmscan.c  2007-11-08 15:46:21.0 +
+++ patch3/mm/vmscan.c  2007-11-08 15:48:08.0 +
@@ -1354,7 +1354,7 @@ unsigned long try_to_free_mem_cgroup_pag
.may_swap = 1,
.swap_cluster_max = SWAP_CLUSTER_MAX,
.swappiness = vm_swappiness,
-   .order = 1,
+   .order = 0,
.mem_cgroup = mem_cont,
.isolate_pages = mem_cgroup_isolate_pages,
};
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 5/6 mm] memcgroup: fix zone isolation OOM

2007-11-08 Thread Hugh Dickins
mem_cgroup_charge_common shows a tendency to OOM without good reason,
when a memhog goes well beyond its rss limit but with plenty of swap
available.  Seen on x86 but not on PowerPC; seen when the next patch
omits swapcache from memcgroup, but we presume it can happen without.

mem_cgroup_isolate_pages is not quite satisfying reclaim's criteria
for OOM avoidance.  Already it has to scan beyond the nr_to_scan limit
when it finds a !LRU page or an active page when handling inactive or
an inactive page when handling active.  It needs to do exactly the same
when it finds a page from the wrong zone (the x86 tests had two zones,
the PowerPC tests had only one).

Don't increment scan and then decrement it in these cases, just move
the incrementation down.  Fix recent off-by-one when checking against
nr_to_scan.  Cut out Check if the meta page went away from under us,
presumably left over from early debugging: no amount of such checks
could save us if this list really were being updated without locking.

This change does make the unlimited scan while holding two spinlocks
even worse - bad for latency and bad for containment; but that's a
separate issue which is better left to be fixed a little later.

Signed-off-by: Hugh Dickins [EMAIL PROTECTED]
---
Insert just after 
bugfix-for-memory-cgroup-controller-avoid-pagelru-page-in-mem_cgroup_isolate_pages-fix.patch
or just before memory-cgroup-enhancements

 mm/memcontrol.c |   17 -
 1 file changed, 4 insertions(+), 13 deletions(-)

--- patch4/mm/memcontrol.c  2007-11-08 16:03:33.0 +
+++ patch5/mm/memcontrol.c  2007-11-08 16:51:39.0 +
@@ -260,24 +260,20 @@ unsigned long mem_cgroup_isolate_pages(u
spin_lock(mem_cont-lru_lock);
scan = 0;
list_for_each_entry_safe_reverse(pc, tmp, src, lru) {
-   if (scan++  nr_to_scan)
+   if (scan = nr_to_scan)
break;
page = pc-page;
VM_BUG_ON(!pc);
 
-   if (unlikely(!PageLRU(page))) {
-   scan--;
+   if (unlikely(!PageLRU(page)))
continue;
-   }
 
if (PageActive(page)  !active) {
__mem_cgroup_move_lists(pc, true);
-   scan--;
continue;
}
if (!PageActive(page)  active) {
__mem_cgroup_move_lists(pc, false);
-   scan--;
continue;
}
 
@@ -288,13 +284,8 @@ unsigned long mem_cgroup_isolate_pages(u
if (page_zone(page) != z)
continue;
 
-   /*
-* Check if the meta page went away from under us
-*/
-   if (!list_empty(pc-lru))
-   list_move(pc-lru, pc_list);
-   else
-   continue;
+   scan++;
+   list_move(pc-lru, pc_list);
 
if (__isolate_lru_page(page, mode) == 0) {
list_move(page-lru, dst);
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 6/6 mm] memcgroup: revert swap_state mods

2007-11-08 Thread Hugh Dickins
If we're charging rss and we're charging cache, it seems obvious that
we should be charging swapcache - as has been done.  But in practice
that doesn't work out so well: both swapin readahead and swapoff leave
the majority of pages charged to the wrong cgroup (the cgroup that
happened to read them in, rather than the cgroup to which they belong).

(Which is why unuse_pte's GFP_KERNEL while holding pte lock never
showed up as a problem: no allocation was ever done there, every page
read being already charged to the cgroup which initiated the swapoff.)

It all works rather better if we leave the charging to do_swap_page and
unuse_pte, and do nothing for swapcache itself: revert mm/swap_state.c
to what it was before the memory-controller patches.  This also speeds
up significantly a contained process working at its limit: because it
no longer needs to keep waiting for swap writeback to complete.

Is it unfair that swap pages become uncharged once they're unmapped,
even though they're still clearly private to particular cgroups?  For
a short while, yes; but PageReclaim arranges for those pages to go to
the end of the inactive list and be reclaimed soon if necessary.

shmem/tmpfs pages are a distinct case: their charging also benefits
from this change, but their second life on the lists as swapcache
pages may prove more unfair - that I need to check next.

Signed-off-by: Hugh Dickins [EMAIL PROTECTED]
---
Insert just after 5/6: the tree builds okay if it goes earlier, just after
memory-controller-bug_on.patch, but 5/6 fixes OOM made more likely by 6/6.
Alternatively, hand edit all of the mm/swap_state.c mods out of all of the
memory-controller patches which modify it.

 mm/swap_state.c |   15 ++-
 1 file changed, 2 insertions(+), 13 deletions(-)

--- patch5/mm/swap_state.c  2007-11-08 15:58:50.0 +
+++ patch6/mm/swap_state.c  2007-11-08 16:01:11.0 +
@@ -17,7 +17,6 @@
 #include linux/backing-dev.h
 #include linux/pagevec.h
 #include linux/migrate.h
-#include linux/memcontrol.h
 
 #include asm/pgtable.h
 
@@ -79,11 +78,6 @@ static int __add_to_swap_cache(struct pa
BUG_ON(!PageLocked(page));
BUG_ON(PageSwapCache(page));
BUG_ON(PagePrivate(page));
-
-   error = mem_cgroup_cache_charge(page, current-mm, gfp_mask);
-   if (error)
-   goto out;
-
error = radix_tree_preload(gfp_mask);
if (!error) {
write_lock_irq(swapper_space.tree_lock);
@@ -95,14 +89,10 @@ static int __add_to_swap_cache(struct pa
set_page_private(page, entry.val);
total_swapcache_pages++;
__inc_zone_page_state(page, NR_FILE_PAGES);
-   } else
-   mem_cgroup_uncharge_page(page);
-
+   }
write_unlock_irq(swapper_space.tree_lock);
radix_tree_preload_end();
-   } else
-   mem_cgroup_uncharge_page(page);
-out:
+   }
return error;
 }
 
@@ -143,7 +133,6 @@ void __delete_from_swap_cache(struct pag
BUG_ON(PageWriteback(page));
BUG_ON(PagePrivate(page));
 
-   mem_cgroup_uncharge_page(page);
radix_tree_delete(swapper_space.page_tree, page_private(page));
set_page_private(page, 0);
ClearPageSwapCache(page);
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 4/6 mm] memcgroup: reinstate swapoff mod

2007-11-08 Thread Hugh Dickins
This patch reinstates the swapoff: scan ptes preemptibly mod we started
with: in due course it should be rendered down into the earlier patches,
leaving us with a more straightforward mem_cgroup_charge mod to unuse_pte,
allocating with GFP_KERNEL while holding no spinlock and no atomic kmap.

Signed-off-by: Hugh Dickins [EMAIL PROTECTED]
---
Insert just after memory-controller-make-charging-gfp-mask-aware.patch or
you may prefer to insert 4-6 all together before memory-cgroup-enhancements

 mm/swapfile.c |   42 ++
 1 file changed, 34 insertions(+), 8 deletions(-)

--- patch3/mm/swapfile.c2007-11-08 15:48:08.0 +
+++ patch4/mm/swapfile.c2007-11-08 15:55:12.0 +
@@ -507,11 +507,23 @@ unsigned int count_swap_pages(int type, 
  * just let do_wp_page work it out if a write is requested later - to
  * force COW, vm_page_prot omits write permission from any private vma.
  */
-static int unuse_pte(struct vm_area_struct *vma, pte_t *pte,
+static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, swp_entry_t entry, struct page *page)
 {
+   spinlock_t *ptl;
+   pte_t *pte;
+   int ret = 1;
+
if (mem_cgroup_charge(page, vma-vm_mm, GFP_KERNEL))
-   return -ENOMEM;
+   ret = -ENOMEM;
+
+   pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl);
+   if (unlikely(!pte_same(*pte, swp_entry_to_pte(entry {
+   if (ret  0)
+   mem_cgroup_uncharge_page(page);
+   ret = 0;
+   goto out;
+   }
 
inc_mm_counter(vma-vm_mm, anon_rss);
get_page(page);
@@ -524,7 +536,9 @@ static int unuse_pte(struct vm_area_stru
 * immediately swapped out again after swapon.
 */
activate_page(page);
-   return 1;
+out:
+   pte_unmap_unlock(pte, ptl);
+   return ret;
 }
 
 static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
@@ -533,21 +547,33 @@ static int unuse_pte_range(struct vm_are
 {
pte_t swp_pte = swp_entry_to_pte(entry);
pte_t *pte;
-   spinlock_t *ptl;
int ret = 0;
 
-   pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl);
+   /*
+* We don't actually need pte lock while scanning for swp_pte: since
+* we hold page lock and mmap_sem, swp_pte cannot be inserted into the
+* page table while we're scanning; though it could get zapped, and on
+* some architectures (e.g. x86_32 with PAE) we might catch a glimpse
+* of unmatched parts which look like swp_pte, so unuse_pte must
+* recheck under pte lock.  Scanning without pte lock lets it be
+* preemptible whenever CONFIG_PREEMPT but not CONFIG_HIGHPTE.
+*/
+   pte = pte_offset_map(pmd, addr);
do {
/*
 * swapoff spends a _lot_ of time in this loop!
 * Test inline before going to call unuse_pte.
 */
if (unlikely(pte_same(*pte, swp_pte))) {
-   ret = unuse_pte(vma, pte++, addr, entry, page);
-   break;
+   pte_unmap(pte);
+   ret = unuse_pte(vma, pmd, addr, entry, page);
+   if (ret)
+   goto out;
+   pte = pte_offset_map(pmd, addr);
}
} while (pte++, addr += PAGE_SIZE, addr != end);
-   pte_unmap_unlock(pte - 1, ptl);
+   pte_unmap(pte - 1);
+out:
return ret;
 }
 
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 6/6 mm] memcgroup: revert swap_state mods

2007-11-11 Thread Hugh Dickins
On Fri, 9 Nov 2007, KAMEZAWA Hiroyuki wrote:
 On Fri, 9 Nov 2007 07:14:22 + (GMT)
 Hugh Dickins [EMAIL PROTECTED] wrote:
 
  If we're charging rss and we're charging cache, it seems obvious that
  we should be charging swapcache - as has been done.  But in practice
  that doesn't work out so well: both swapin readahead and swapoff leave
  the majority of pages charged to the wrong cgroup (the cgroup that
  happened to read them in, rather than the cgroup to which they belong).
 
 Thank you. I welcome this patch :)

Thank you!  But perhaps less welcome if I don't confirm...

 Could I confirm a change in the logic  ?
 
  * Before this patch, wrong swapcache charge is added to one who
called try_to_free_page().

try_to_free_pages?  No, I don't think any wrong charge was made
there.  It was when reading in swap pages.  The usual way is by
swapin_readahead, which reads in a cluster of swap pages, which
are quite likely to belong to different memcgroups, but were all
charged to the one which is doing the fault on its target page.
Another way is in swapoff, where they all got charged to whoever
was doing the swapoff (and the charging in unuse_pte was a no-op).

  * After this patch, anonymous page's charge will drop to 0 when
page_remove_rmap() is called.

Yes, when its final (usually its only) page_remove_rmap is called.

Hugh
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH][SHMEM] Factor out sbi-free_inodes manipulations

2007-11-23 Thread Hugh Dickins
Looks good, but we can save slightly more there (depending on config),
and I found your inc/dec names a little confusing, since the count is
going the other way: how do you feel about this version?  (I'd like it
better if those helpers could take a struct inode *, but they cannot.)
Hugh


From: Pavel Emelyanov [EMAIL PROTECTED]

The shmem_sb_info structure has a number of free_inodes. This
value is altered in appropriate places under spinlock and with
the sbi-max_inodes != 0 check.

Consolidate these manipulations into two helpers.

This is minus 42 bytes of shmem.o and minus 4 :) lines of code.

Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]
Signed-off-by: Hugh Dickins [EMAIL PROTECTED]
---

 mm/shmem.c |   72 ---
 1 file changed, 34 insertions(+), 38 deletions(-)

--- 2.6.24-rc3/mm/shmem.c   2007-11-07 04:21:45.0 +
+++ linux/mm/shmem.c2007-11-23 12:43:28.0 +
@@ -207,6 +207,31 @@ static void shmem_free_blocks(struct ino
}
 }
 
+static int shmem_reserve_inode(struct super_block *sb)
+{
+   struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
+   if (sbinfo-max_inodes) {
+   spin_lock(sbinfo-stat_lock);
+   if (!sbinfo-free_inodes) {
+   spin_unlock(sbinfo-stat_lock);
+   return -ENOMEM;
+   }
+   sbinfo-free_inodes--;
+   spin_unlock(sbinfo-stat_lock);
+   }
+   return 0;
+}
+
+static void shmem_free_inode(struct super_block *sb)
+{
+   struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
+   if (sbinfo-max_inodes) {
+   spin_lock(sbinfo-stat_lock);
+   sbinfo-free_inodes++;
+   spin_unlock(sbinfo-stat_lock);
+   }
+}
+
 /*
  * shmem_recalc_inode - recalculate the size of an inode
  *
@@ -762,7 +787,6 @@ static int shmem_notify_change(struct de
 
 static void shmem_delete_inode(struct inode *inode)
 {
-   struct shmem_sb_info *sbinfo = SHMEM_SB(inode-i_sb);
struct shmem_inode_info *info = SHMEM_I(inode);
 
if (inode-i_op-truncate == shmem_truncate) {
@@ -777,11 +801,7 @@ static void shmem_delete_inode(struct in
}
}
BUG_ON(inode-i_blocks);
-   if (sbinfo-max_inodes) {
-   spin_lock(sbinfo-stat_lock);
-   sbinfo-free_inodes++;
-   spin_unlock(sbinfo-stat_lock);
-   }
+   shmem_free_inode(inode-i_sb);
clear_inode(inode);
 }
 
@@ -1398,15 +1418,8 @@ shmem_get_inode(struct super_block *sb, 
struct shmem_inode_info *info;
struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
 
-   if (sbinfo-max_inodes) {
-   spin_lock(sbinfo-stat_lock);
-   if (!sbinfo-free_inodes) {
-   spin_unlock(sbinfo-stat_lock);
-   return NULL;
-   }
-   sbinfo-free_inodes--;
-   spin_unlock(sbinfo-stat_lock);
-   }
+   if (shmem_reserve_inode(sb))
+   return NULL;
 
inode = new_inode(sb);
if (inode) {
@@ -1450,11 +1463,8 @@ shmem_get_inode(struct super_block *sb, 
NULL);
break;
}
-   } else if (sbinfo-max_inodes) {
-   spin_lock(sbinfo-stat_lock);
-   sbinfo-free_inodes++;
-   spin_unlock(sbinfo-stat_lock);
-   }
+   } else
+   shmem_free_inode(sb);
return inode;
 }
 
@@ -1797,22 +1807,14 @@ static int shmem_create(struct inode *di
 static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct 
dentry *dentry)
 {
struct inode *inode = old_dentry-d_inode;
-   struct shmem_sb_info *sbinfo = SHMEM_SB(inode-i_sb);
 
/*
 * No ordinary (disk based) filesystem counts links as inodes;
 * but each new link needs a new dentry, pinning lowmem, and
 * tmpfs dentries cannot be pruned until they are unlinked.
 */
-   if (sbinfo-max_inodes) {
-   spin_lock(sbinfo-stat_lock);
-   if (!sbinfo-free_inodes) {
-   spin_unlock(sbinfo-stat_lock);
-   return -ENOSPC;
-   }
-   sbinfo-free_inodes--;
-   spin_unlock(sbinfo-stat_lock);
-   }
+   if (shmem_reserve_inode(inode-i_sb))
+   return -ENOSPC;
 
dir-i_size += BOGO_DIRENT_SIZE;
inode-i_ctime = dir-i_ctime = dir-i_mtime = CURRENT_TIME;
@@ -1827,14 +1829,8 @@ static int shmem_unlink(struct inode *di
 {
struct inode *inode = dentry-d_inode;
 
-   if (inode-i_nlink  1  !S_ISDIR(inode-i_mode)) {
-   struct shmem_sb_info *sbinfo = SHMEM_SB(inode-i_sb);
-   if (sbinfo-max_inodes) {
-   spin_lock(sbinfo-stat_lock);
-   sbinfo-free_inodes++;
-   spin_unlock(sbinfo-stat_lock

[Devel] Re: [PATCH][SHMEM] Factor out sbi-free_inodes manipulations

2007-11-27 Thread Hugh Dickins
On Mon, 26 Nov 2007, Andrew Morton wrote:
 
 It is peculair to (wrongly) return -ENOMEM
 
  +   if (shmem_reserve_inode(inode-i_sb))
  +   return -ENOSPC;
 
 and to then correct it in the caller..

Oops, I missed that completely.

 Something boringly conventional such as the below, perhaps?

Much better, thanks.

Hugh

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC/PATCH] cgroup swap subsystem

2008-03-05 Thread Hugh Dickins
On Wed, 5 Mar 2008, Pavel Emelyanov wrote:
 Daisuke Nishimura wrote:
  
  Todo:
- rebase new kernel, and split into some patches.
- Merge with memory subsystem (if it would be better), or
  remove dependency on CONFIG_CGROUP_MEM_CONT if possible
  (needs to make page_cgroup more generic one).
 
 Merge is a must IMHO. I can hardly imagine a situation in which
 someone would need these two separately.

Strongly agree.  Nobody's interested in swap as such: it's just
secondary memory, where RAM is primary memory.  People want to
control memory as the sum of the two; and I expect they may also
want to control primary memory (all that the current memcg does)
within that.  I wonder if such nesting of limits fits easily
into cgroups or will be problematic.

Hugh
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-25 Thread Hugh Dickins
On Fri, 25 Jul 2008, Paul Menage wrote:
 
 So I think we'd be complicating some of the vm paths in mainline with
 a feature that isn't likely to get a lot of real use.
 
 What do you (and others on the containers list) think? Should we ask
 Andrew/Linus to hold off on this for now? My preference would be to do
 that until we have someone who can stand up with a concrete scenario
 where they want to use this in a real environment.

I see Andrew has already acted, so it's now moot.  But I'd like to
say that I do agree with you and the conclusion to hold off for now.

I was a bit alarmed earlier to see those patches sailing on through;
but realized that I'd done very little to substantiate my hatred of
the whole thing, and decided that I didn't feel strongly enough to
stand in the way now.  But I am glad you've stepped in, thank you.

(Different topic, but one day I ought to get around to saying again
how absurd I think a swap controller; whereas a mem+swap controller
makes plenty of sense.  I think Rik and others said the same.)

By the way, here's a BUG I got from CONFIG_CGROUP_MEMRLIMIT_CTLR=y
but no use of it, when doing swapoff a week ago.  Not investigated
at all, I'm afraid, but at a guess it might come from memrlimit work
placing too much faith in the mm_users count - swapoff is only one
of several places which have to inc/dec mm_users for some reason.

BUG: unable to handle kernel paging request at 6b6b6b8b
IP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29
*pde =  
Oops:  [#1] PREEMPT SMP 
last sysfs file: /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map
Modules linked in: acpi_cpufreq snd_pcm_oss snd_mixer_oss snd_seq 
snd_seq_device thermal ac battery button

Pid: 22500, comm: swapoff Not tainted (2.6.26-rc8-mm1 #7)
EIP: 0060:[7817078f] EFLAGS: 00010206 CPU: 0
EIP is at memrlimit_cgroup_uncharge_as+0x18/0x29
EAX: 6b6b6b6b EBX: 7963215c ECX: 7c032000 EDX: 0025e000
ESI: 96902518 EDI: 9fbb1aa0 EBP: 7c033e9c ESP: 7c033e9c
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process swapoff (pid: 22500, ti=7c032000 task=907e2b70 task.ti=7c032000)
Stack: 7c033edc 78161323 9fbb1aa0 025e ff77 7c033ecc 96902518  
    7c033ec8  0089 7963215c 9fbb1aa0 9fbb1b28 a272f040 
   7c033ef4 781226b1 9fbb1aa0 9fbb1aa0 790fa884 a272f0c8 7c033f80 78165ce3 
Call Trace:
 [78161323] ? exit_mmap+0xaf/0x133
 [781226b1] ? mmput+0x4c/0xba
 [78165ce3] ? try_to_unuse+0x20b/0x3f5
 [78371534] ? _spin_unlock+0x22/0x3c
 [7816636a] ? sys_swapoff+0x17b/0x37c
 [78102d95] ? sysenter_past_esp+0x6a/0xa5
 ===
Code: 24 0c 00 00 8b 40 20 52 83 c0 0c 50 e8 ad a6 fd ff c9 c3 55 89 e5 8b 45 
08 8b 55 0c 8b 80 30 02 00 00 c1 e2 0c 8b 80 24 0c 00 00 8b 40 20 52 83 c0 0c 
50 e8 e6 a6 fd ff 58 5a c9 c3 55 89 e5 8b 
EIP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29 SS:ESP 0068:7c033e9c

Hugh
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-25 Thread Hugh Dickins
On Fri, 25 Jul 2008, Paul Menage wrote:
 On Fri, Jul 25, 2008 at 5:06 AM, Hugh Dickins [EMAIL PROTECTED] wrote:
 
  (Different topic, but one day I ought to get around to saying again
  how absurd I think a swap controller; whereas a mem+swap controller
  makes plenty of sense.  I think Rik and others said the same.)
 
 Agreed that a swap controller without a memory controller doesn't make
 much sense, but a memory controller without a swap controller can make
 sense on machines that don't intend to use swap.

I agree that a memory controller without a swap controller can
make sense: I hope so, anyway, since that's what's in mainline.
Even if swap is used, memory is a more precious resource than swap,
and you were right to go about controlling memory first.

 
 So if they were separate controllers, we'd use the proposed cgroup
 dependency features to make the swap controller depend on the memory
 controller - in which case you'd only be able to mount the swap
 controller on a hierarchy that also had the memory controller, and the
 swap controller would be able to make use of the page ownership
 information.
 
 It's more of a modularity issue than a functionality issue, I think -
 the swap controller and memory controller are tracking fundamentally
 different things (space on disk versus pages in memory), and the only
 dependency between them is the memory controller tracking the
 ownership of a page and providing it to the swap controller.

It sounds as if you're interpreting my mem+swap controller as a
mem controller and a swap controller and the swap controller makes
use of some of the mem controller infrastructure.

No, I'm trying to say something stronger than that.  I'm saying,
as I've said before, that I cannot imagine why anyone would want
to control swap itself - what they want to control is the total
of mem+swap.  Swap is a second-class citizen, nobody wants swap
if they can have mem, so why control it separately?

IIRC Rik expressed the same by pointing out that a cgroup at its
swap limit would then be forced to grow in mem (until it hits its
mem limit): so controlling the less precious resource would increase
pressure on the more precious resource.  (Actually, that probably
bears little relation to what he said - sorry, Rik!)  I don't recall
what answer he got, perhaps I'd be persuaded if I heard it again.

Hugh
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-25 Thread Hugh Dickins
On Fri, 25 Jul 2008, Balbir Singh wrote:
 
 I'll try and recreate the problem and fix it. If 
 memrlimit_cgroup_uncharge_as()
 created the problem, it's most likely related to mm-owner not being correct 
 and
 we are dereferencing the wrong memory.
 
 Thanks for the bug report, I'll look further.

Good luck!  I have only seen it once, on a dual-core laptop; though
I don't remember to try swapoff while busy as often as I should (be
sure to alternate between a couple or more of swapareas, so you can
swap a new one on just before swapping an old one off, to be pretty
sure of success).

May be easier to find in the source: my suspicion is that a bad
mm_users assumption will come into it.  But I realize now that it
could be entirely unrelated to memrlimit, just that uncharge_as
was the one to get hit by bad refcounting elsewhere.

Oh, that reminds me, I never reported back on my res_counter warnings
at shutdown: never saw them again, once I added in the set of changes
you came up with shortly after that - thanks.

Hugh
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-29 Thread Hugh Dickins
On Tue, 29 Jul 2008, KAMEZAWA Hiroyuki wrote:
 On Fri, 25 Jul 2008 17:46:45 +0100 (BST)
 Hugh Dickins [EMAIL PROTECTED] wrote:
 
  IIRC Rik expressed the same by pointing out that a cgroup at its
  swap limit would then be forced to grow in mem (until it hits its
  mem limit): so controlling the less precious resource would increase
  pressure on the more precious resource.  (Actually, that probably
  bears little relation to what he said - sorry, Rik!)  I don't recall
  what answer he got, perhaps I'd be persuaded if I heard it again.
  
 Added Nishimura to CC.
 
 IMHO, from user point of view, both of
  - having 2 controls as mem controller + swap controller
  - mem + swap controller
 doesn't have much difference. The users will use as they like.

I'm not suggesting either one of those alternatives.

I'm suggesting we have a mem controller (the thing we already have)
and a mem+swap controller (which we don't yet have: a controller
for the total mem+swap of a cgroup); the mem+swap controller likely
making use of much that is in the mem controller, as Paul has said.

(Unfortunately I don't have a good name for this mem+swap.)

I happen to believe that the mem+swap controller would actually be
a lot more useful than the current mem controller, and would expect
many to run with mem+swap controller enabled but mem controller
disabled or unlimited.  How much is mem and how much is swap being
left to global reclaim to decide, not imposed by any cgroup policy.

What I don't like the sound of at all is a swap controller.  Do you
think that a mem controller (limit 1G) and a mem+swap controller
(limit 2G) is equivalent to a mem controller (limit 1G) and a
swap controller (limit 1G)?  No: imagine memory pressure from
outside the cgroup - with the mem+swap controller it can push as
much as suits of the 2G out to swap; whereas with the swap controller,
once 1G is out, it has to stop pushing any more of that cgroup out.
I think that's absurd - but perhaps I just haven't looked, and
I've totally misinterpreted the talk of a swap controller.

 
 From memory controller's point of view, treating mem+swap by the same
 controller makes sense. Because memory controller can check wheter we can use
 more swap or not, we can avoid hopeless-scanning of Anon at swap-shortage.
 (By split-lru, I think we can do this avoidance.)

That's a detail I'm not concerned with on this level.

  
 Another-Topic?
 
 In recent servers, memory is big, swap is (relatively) small.

You'll know much more about those common proportions than I do.
I'd wonder why such big memory servers have any swap at all:
to cope with VM management defects we should be fixing?

 And under memory resource controller, the whole swap is easily occupied
 by a group. I want to avoid it.

Why?  I presume because you're thinking it a precious resource.
I don't think its relative smallness makes it more precious.

 
 For users, swap is not precious because it's not fast. 

Yes, and that's my view.

 But for memory reclaiming, swap is precious resource to page out
 anonymous/shmem/tmpfs memory.

I see that makes swap a useful resource, I don't see that it makes
it a precious resource.  We page out to it precisely because it's
less precious than the memory; both users and kernel would much
prefer to keep all the data in memory, but sometimes there isn't
enough memory so we go to swap.

There is just one way in which I see swap as precious, and that
is to get around some VM management stupidity.  If, for example,
on i386 there's a shortage of lowmem and lots of anonymous in lowmem
that we should shift to highmem, then I think it's still the case
that we have to do that balancing via writing out to and reading
in from swap, because nobody has actually hooked up page migration
to do that when appropriate?  But that's an argument for extending
page migration, not for needing a swap controller.

 I think usual system-admin considers swap as some emergency spare of memory.

Yes, I do too.

 I'd like to allow this emergency spare to each cgroup.

We do allow that emergency spare to each cgroup.  Perhaps you're
saying you want to divide it up in advance between the cgroups?
But why?  Sounds like a nice idea (reminds me of what Paul said
about using temporary files), but a solution to what problem?

 (For example, swap is used even if vm.swappiness==0. This is for avoiding
 OOM-Killer under some situation, this behavior is added by Rik.)

Sorry, I don't know what you're referring to there, but again,
suspect it's a detail we don't need to be concerned with here.

 
 == following is another use case I explained to Rik at 23/May/08 ==
 
 IIRC, a man shown his motivation to controll swap in OLS2007/BOF as following.
 Consider following system. (and there is no swap controller.) 
 Memory 4G. Swap 1G. with 2 cgroups A, B.
 
 state 1) swap is not used.
   Amemory limit to be 1G  no swap usage memory_usage=0M
   Bmemory limit to be 1G  no swap usage memory_usage=0M
 
 state 2) Run

[Devel] Re: memrlimit controller merge to mainline

2008-07-29 Thread Hugh Dickins
On Fri, 25 Jul 2008, Paul Menage wrote:
 On Fri, Jul 25, 2008 at 12:46 PM, Hugh Dickins [EMAIL PROTECTED] wrote:
  No, I'm trying to say something stronger than that.  I'm saying,
  as I've said before, that I cannot imagine why anyone would want
  to control swap itself - what they want to control is the total
  of mem+swap.  Swap is a second-class citizen, nobody wants swap
  if they can have mem, so why control it separately?
 
 Scheduling jobs on to machines is much more straightforward when they
 request xGB of memory and yGB of swap rather than just (x+y)GB of
 (memory+swap). We want to be able to guarantee to jobs that they will
 be able to use xGB of real memory.

I don't see that I'm denying you a way to guarantee that (though I've
been thinking more of the limits than the guarantees): I'm not saying
that you cannot have a mem controller, I'm saying that you can also
have a mem+swap controller; but that a swap-by-itself controller
makes no sense to me.

 Actually my preferred approach to swap controlling would be something like:
 
 - allow malloc to support mmaping pages from a temporary file rather
 than mmapping anonymous memory

I think that works until you get to fork: shared files and
private/anonymous/swap behave differently from then on.

Hugh
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-29 Thread Hugh Dickins
On Fri, 25 Jul 2008, Balbir Singh wrote:
 
 I see what your saying. When you look at Linux right now, we control swap
 independent of memory, so I am not totally opposed to setting swap, instead of
 swap+mem. I might not want to swap from a particular cgroup, in which case, I
 set swap to 0 and risk OOMing, which might be an acceptable trade-off 
 depending
 on my setup. I could easily change this policy on demand and add swap if 
 OOMing
 was no longer OK.

It's taken me a while to understand your point.  I think you're
saying that with a swap controller, you can set the swap limit to 0
on a cgroup if you want to keep it entirely in memory, without setting
any mem limit upon it; whereas with my mem+swap controller, you'd have
to set a mem limit then an equal mem+swap limit to achieve the same
never go to swap effect, and maybe you don't want to set a mem limit.

Hmm, but an unreachably high mem limit, and equal mem+swap limit,
would achieve that effect.  Sorry, I don't think I have understood
(and even if the unreachably high limit didn't work, this seems more
about setting a don't-swap flag than imposing a swap limit).

Hugh
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-08-04 Thread Hugh Dickins
On Tue, 5 Aug 2008, Balbir Singh wrote:
 Hugh Dickins wrote:
 [snip]
  
  BUG: unable to handle kernel paging request at 6b6b6b8b
  IP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29
  Pid: 22500, comm: swapoff Not tainted (2.6.26-rc8-mm1 #7)
   [78161323] ? exit_mmap+0xaf/0x133
   [781226b1] ? mmput+0x4c/0xba
   [78165ce3] ? try_to_unuse+0x20b/0x3f5
   [78371534] ? _spin_unlock+0x22/0x3c
   [7816636a] ? sys_swapoff+0x17b/0x37c
   [78102d95] ? sysenter_past_esp+0x6a/0xa5
 
 I am unable to reproduce the problem,

Me neither, I've spent many hours trying 2.6.27-rc1-mm1 and then
back to 2.6.26-rc8-mm1.  But I've been SO stupid: saw it originally
on one machine with SLAB_DEBUG=y, have been trying since mostly on
another with SLUB_DEBUG=y, but never thought to boot with
slub_debug=P,task_struct until now.

 but I do have an initial hypothesis
 
 CPU0  CPU1
   try_to_unuse
 task 1 stars exiting  look at mm = task1-mm
 ..increment mm_users
 task 1 exits
 mm-owner needs to be updated, but
 no new owner is found
 (mm_users  1, but no other task
 has task-mm = task1-mm)
 mm_update_next_owner() leaves
 
 grace period
   user count drops, call mmput(mm)
 task 1 freed
   dereferencing mm-owner fails

Yes, that looks right to me: seems obvious now.  I don't think your
careful alternation of CPU0/1 events at the end matters: the swapoff
CPU simply dereferences mm-owner after that task has gone.

(That's a shame, I'd always hoped that mm-owner-comm was going to
be good for use in mm messages, even when tearing down the mm.)

 I do have a potential solution in mind, but I want to make sure my
 hypothesis is correct.

It seems wrong that memrlimit_cgroup_uncharge_as should be called
after mm-owner may have been changed, even if it's to something safe.
But I forget the mm/task exit details, surely they're tricky.

By the way, is the ordering in mm_update_next_owner the best?
Would there be less movement if it searched amongst siblings before
it searched amongst children?  Ought it to make a first pass trying
to stay within the same cgroup?

Hugh
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel