from:"Balbir Singh"

RE: Sytem slowdown on 2.4.1-ac20 (recurring from 2.4.0)

2001-02-28 Thread Balbir Singh


Just FYI:

I remember posting something a few days ago to make the serial console more
reliable for such situations. Some allocations in the serial port driver are
done at runtime using page_alloc, if somebody runs out of memory the serial
tty driver would not work properly.

I am not saying that u ran out of memory. All I am saying is that it is
possible to make the serial tty driver more reliable using boot time
initialization.

Please excuse me if u find this a little off-topic.



-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]]On Behalf Of Vibol Hou
Sent: Monday, February 26, 2001 4:25 PM
To: Linux-Kernel
Subject: Sytem slowdown on 2.4.1-ac20 (recurring from 2.4.0)


I've reported this problem a long while ago, but no one answered my pleas.
To tell you the honest truth, I don't know where to begin looking.  It's
difficult to poke around when the serial console is unresponsive :/

When I was running 2.4.0, the system, a dual-processor webserver, would
_completely_ slow down after about 3 days of constant uptime (and a few
million pages served).  I mean _SLOW_.  I could get commands executed, but
it would take an unholy long time to type the commands in.  It seemed the
server was dropping lots of packets.  All TCP services simply stopped or
slowed.  ICMP packet loss to the server would be a sporadic from 50% to 75%.
Web service was rendered useless.  SSH _barely_ worked.  The number of
commands I could run (w, free, memstat, top) showed nothing out of the
ordinary.  Back then, I didn't have a serial console setup.

Now, I'm running 2.4.1-ac20 and I setup a serial console to try to catch any
errors.  I was hoping the problem wouldn't recur with this newer kernel, but
it seems to still happen, but now at about 5 days uptime.  When I manage to
get in a 'shutdown -h now' through SSH, the serial console spits out:

INIT: Switching to runlevel: 0
INIT:

And that's it.  It doesn't even seem to be able to finish shutting down.
Thusfar, no one else has reported any similar problems to what I have, so it
makes me wonder what is wrong.  The system ran fine with an uptime of over
100 days with the old 2.2.17 kernel.  What stifles me is the fact that the
serial console is completely unresponsive to input when the server gets into
this state.

Having said that, does anyone have any ideas or pointers for me?  Again,
this may seem like a fairly indescriptive e-mail, but that's just because I
can't do anything on the server when it gets to this state.  If there is
anything you recommend I do when this happens again (other than restart the
system), please let me know and I'll try.

--
Vibol Hou
KhmerConnection, http://khmer.cc

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Is it useful to support user level drivers

2001-06-21 Thread Balbir Singh


I realize that the Linux kernel supports user
level drivers (via ioperm, etc). However interrupts
at user level are not supported, does anyone think
it would be a good idea to add user level interrupt
support ? I have a framework for it, but it still
needs
a lot of work.

Depending on the response I get, I can send out
more email. Please cc me to the replies as I am no
longer a part of the Linux kernel mailing list - due
to the humble size of my mail box.

Balbir

__
Do You Yahoo!?
Get personalized email addresses from Yahoo! Mail
http://personal.mail.yahoo.com/
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: A signal fairy tale

2001-06-26 Thread Balbir Singh


Shouldn't there be a sigclose() and other operations to make the API
orthogonal. sigopen() should be selective about the signals it allows
as argument. Try and make sigopen() thread specific, so that if one
thread does a sigopen(), it does not imply it will do all the signal
handling for all the threads.

Does using sigopen() imply that signal(), sigaction(), etc cannot be used.
In the same process one could do a sigopen() in the library, but the
process could use sigaction()/signal() without knowing what the library
does (which signals it handles, etc).

Let me know, when somebody has a patch or needs help, I would like to
help or take a look at it.

Balbir

|NAME
|   sigopen - open a signal as a file descriptor
| 
|SYNOPSIS
|   #include signal.h
| 
|   int sigopen(int signum);
| 
|DESCRIPTION
|   The sigopen system call opens signal number signum as a file descriptor.
|   That signal is no longer delivered normally or available for pickup
|   with sigwait() et al.  Instead, it must be picked up by calling
|   read() on the file descriptor returned by sigwait(); the buffer passed to
|   read() must have a size which is a multiple of sizeof(siginfo_t).
|   Multiple signals may be picked up with a single call to read().
|   When that file descriptor is closed, the signal is available once more 
|   for traditional use.
|   A signal number cannot be opened more than once concurrently; sigopen() 
|   thus provides a way to avoid signal usage clashes in large programs.
|
|RETURN VALUE
|   signal returns the new file descriptor, or -1 on error (in which case, errno
|   is set appropriately).
|
|ERRORS
|   EWOULDBLOCK signal is already open
|
|NOTES
|   read() will block when reading from a file descriptor opened by sigopen()
|   until a signal is available unless fcntl(fd, F_SETFL, O_NONBLOCK) is called
|   to set it into nonblocking mode.
|
|HISTORY
|   sigopen() first appeared in the 2.5.2 Linux kernel.
|
|Linux  July, 2001 1   
|

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: A signal fairy tale

2001-06-27 Thread Balbir Singh


|
| Let me know, when somebody has a patch or needs help, I would like to
| help or take a look at it.
|
|Maybe we can both hack on this.
|

Sure, that should be interesting, did you have something in mind ? We can
start right away.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Reg kdb utility

2001-07-03 Thread Balbir Singh


You need to compile with the correct  kernel headers using the include
path feature -I path to new headers

Balbir



On Tue, 3 Jul 2001, SATHISH.J wrote:

|Hi,
|
|I tried to use kdb on my 2.2.14-12 kernel. I was able to compile the file 
|/usr/src/linux/arch/i386/kdb/modules/kdbm_vm.c and could get the object
|file. When I tried to insert it as a module it givesd the following error
|message:
|
|./kdbm_vm.o: kernel-module version mismatch
|./kdbm_vm.o was compiled for kernel version .2.14-12
|while this kernel is version 2.2.14-12.
|
|
|
|Please tell me why this message comes.
|
|Thanks in advance,
|
|Regards,
|satish.j
|
|
|
|
|
|
|
|
|-
|To unsubscribe from this list: send the line unsubscribe linux-kernel in
|the body of a message to [EMAIL PROTECTED]
|More majordomo info at  http://vger.kernel.org/majordomo-info.html
|Please read the FAQ at  http://www.tux.org/lkml/
|

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH -mm] Taskstats fix getdelays usage information

2007-04-19 Thread Balbir Singh



Add usage to getdelays.c. This patch was originally posted by Randy Dunlap
http://lkml.org/lkml/2007/3/19/168

Signed-off-by: Randy Dunlap [EMAIL PROTECTED]
Signed-off-by: [EMAIL PROTECTED]
---

 Documentation/accounting/getdelays.c |   14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff -puN Documentation/accounting/getdelays.c~fix-getdelays-usage 
Documentation/accounting/getdelays.c
--- linux-2.6.20/Documentation/accounting/getdelays.c~fix-getdelays-usage   
2007-04-19 14:41:45.0 +0530
+++ linux-2.6.20-balbir/Documentation/accounting/getdelays.c2007-04-19 
14:42:26.0 +0530
@@ -72,6 +72,16 @@ struct msgtemplate {
 
 char cpumask[100+6*MAX_CPUS];
 
+static void usage(void)
+{
+   fprintf(stderr, getdelays [-dilv] [-w logfile] [-r bufsize] 
+   [-m cpumask] [-t tgid] [-p pid]\n);
+   fprintf(stderr,   -d: print delayacct stats\n);
+   fprintf(stderr,   -i: print IO accounting (works only with -p)\n);
+   fprintf(stderr,   -l: listen forever\n);
+   fprintf(stderr,   -v: debug on\n);
+}
+
 /*
  * Create a raw netlink socket and bind
  */
@@ -227,7 +237,7 @@ int main(int argc, char *argv[])
struct msgtemplate msg;
 
while (1) {
-   c = getopt(argc, argv, diw:r:m:t:p:v:l);
+   c = getopt(argc, argv, diw:r:m:t:p:vl);
if (c  0)
break;
 
@@ -277,7 +287,7 @@ int main(int argc, char *argv[])
loop = 1;
break;
default:
-   printf(Unknown option %d\n, c);
+   usage();
exit(-1);
}
}
_

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH -mm] Taskstats fix the structure members alignment issue

2007-04-20 Thread Balbir Singh

192,# virtmem
200,# hiwater_rss
208,# hiwater_vm
216,# read_char
224,# write_char
232,# read_syscalls
240,# write_syscalls
248,# read_bytes
256,# write_bytes
264,# cancelled_write_bytes
);


Signed-off-by: Balbir Singh [EMAIL PROTECTED]
---

 include/linux/taskstats.h |   11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff -puN include/linux/taskstats.h~fix-taskstats-alignment 
include/linux/taskstats.h
--- linux-2.6.20/include/linux/taskstats.h~fix-taskstats-alignment  
2007-04-19 12:28:25.0 +0530
+++ linux-2.6.20-balbir/include/linux/taskstats.h   2007-04-19 
13:21:48.0 +0530
@@ -66,7 +66,7 @@ struct taskstats {
/* Delay waiting for cpu, while runnable
 * count, delay_total NOT updated atomically
 */
-   __u64   cpu_count;
+   __u64   cpu_count __attribute__((aligned(8)));
__u64   cpu_delay_total;
 
/* Following four fields atomically updated using task-delays-lock */
@@ -101,14 +101,17 @@ struct taskstats {
 
/* Basic Accounting Fields start */
charac_comm[TS_COMM_LEN];   /* Command name */
-   __u8ac_sched;   /* Scheduling discipline */
+   __u8ac_sched __attribute__((aligned(8)));
+   /* Scheduling discipline */
__u8ac_pad[3];
-   __u32   ac_uid; /* User ID */
+   __u32   ac_uid __attribute__((aligned(8)));
+   /* User ID */
__u32   ac_gid; /* Group ID */
__u32   ac_pid; /* Process ID */
__u32   ac_ppid;/* Parent process ID */
__u32   ac_btime;   /* Begin time [sec since 1970] */
-   __u64   ac_etime;   /* Elapsed time [usec] */
+   __u64   ac_etime __attribute__((aligned(8)));
+   /* Elapsed time [usec] */
__u64   ac_utime;   /* User CPU time [usec] */
__u64   ac_stime;   /* SYstem CPU time [usec] */
__u64   ac_minflt;  /* Minor Page Fault Count */
_

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -mm] Taskstats fix the structure members alignment issue

2007-04-21 Thread Balbir Singh


Andrew Morton wrote:

On Fri, 20 Apr 2007 22:13:41 +0530
Balbir Singh [EMAIL PROTECTED] wrote:


We broke the the alignment of members of taskstats to the 8 byte boundary
with the CSA patches. In the current kernel, the taskstats structure is
not suitable for use by 32 bit applications in a 64 bit kernel.



ugh, that was bad of us.


Yes :-)




...
The patch adds an __attribute__((aligned(8))) to the
taskstats structure members so that 32 bit applications using taskstats
can work with a 64 bit kernel.


But there might be 32-bit applications out there which are using the
present wrong structure?

otoh, I assume that those applications would be using taskstats.h and would
hence encounter this bug and we would have heard about it, is that correct?



Yes, correct.


otoh^2, 32-bit applications running under 32-bit kernels will presently be
functioning correctly, and your change will require that those applications
be recompiled, I think?



Yes, correct. They would be broken with this fix. We could  bump up the
version TASKSTATS_VERSION to 4. Would you like a new patch the version
bumped up?



This patch looks like 2.6.20 and 2.6.21 material, but very carefully...


Yes, 2.6.20 and 2.6.21 sound correct.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -mm] Taskstats fix the structure members alignment issue

2007-04-21 Thread Balbir Singh


Andrew Morton wrote:

On Sat, 21 Apr 2007 18:29:21 +0530 Balbir Singh [EMAIL PROTECTED] wrote:


The patch adds an __attribute__((aligned(8))) to the
taskstats structure members so that 32 bit applications using taskstats
can work with a 64 bit kernel.

But there might be 32-bit applications out there which are using the
present wrong structure?

otoh, I assume that those applications would be using taskstats.h and would
hence encounter this bug and we would have heard about it, is that correct?


Yes, correct.


otoh^2, 32-bit applications running under 32-bit kernels will presently be
functioning correctly, and your change will require that those applications
be recompiled, I think?


Yes, correct. They would be broken with this fix. We could  bump up the
version TASKSTATS_VERSION to 4. Would you like a new patch the version
bumped up?


I can do that.


Thanks




This patch looks like 2.6.20 and 2.6.21 material, but very carefully...

Yes, 2.6.20 and 2.6.21 sound correct.


OK.  I guess we have little choice but to slam it in asap, with a 2.6.20.x 
backport
before too many people start using the old interface.


Thanks, again!

--
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/8] Per-container pages reclamation

2007-04-24 Thread Balbir Singh


Pavel Emelianov wrote:

Implement try_to_free_pages_in_container() to free the
pages in container that has run out of memory.

The scan_control-isolate_pages() function isolates the
container pages only.



Pavel,

I've just started playing around with these patches, I preferred
the approach of v1. Please see below


+static unsigned long isolate_container_pages(unsigned long nr_to_scan,
+   struct list_head *src, struct list_head *dst,
+   unsigned long *scanned, struct zone *zone)
+{
+   unsigned long nr_taken = 0;
+   struct page *page;
+   struct page_container *pc;
+   unsigned long scan;
+   LIST_HEAD(pc_list);
+
+   for (scan = 0; scan  nr_to_scan  !list_empty(src); scan++) {
+   pc = list_entry(src-prev, struct page_container, list);
+   page = pc-page;
+   if (page_zone(page) != zone)
+   continue;


shrink_zone() will walk all pages looking for pages belonging to this
container and this slows down the reclaim quite a bit. Although we've
reused code, we've ended up walking the entire list of the zone to
find pages belonging to a particular container, this was the same
problem I had with my RSS controller patches.


+
+   list_move(pc-list, pc_list);
+



--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/8] Per-container pages reclamation

2007-04-24 Thread Balbir Singh


Pavel Emelianov wrote:

Balbir Singh wrote:

Pavel Emelianov wrote:

Implement try_to_free_pages_in_container() to free the
pages in container that has run out of memory.

The scan_control-isolate_pages() function isolates the
container pages only.


Pavel,

I've just started playing around with these patches, I preferred
the approach of v1. Please see below


+static unsigned long isolate_container_pages(unsigned long nr_to_scan,
+struct list_head *src, struct list_head *dst,
+unsigned long *scanned, struct zone *zone)
+{
+unsigned long nr_taken = 0;
+struct page *page;
+struct page_container *pc;
+unsigned long scan;
+LIST_HEAD(pc_list);
+
+for (scan = 0; scan  nr_to_scan  !list_empty(src); scan++) {
+pc = list_entry(src-prev, struct page_container, list);
+page = pc-page;
+if (page_zone(page) != zone)
+continue;

shrink_zone() will walk all pages looking for pages belonging to this


No. shrink_zone() will walk container pages looking for pages in the desired 
zone.
Scann through the full zone is done on global memory shortage.



Yes, I see that now. But for each zone in the system, we walk through the
containers list - right?

I have some more fixes, improvements that I want to send across.
I'll start sending them out to you as I test and verify them.



container and this slows down the reclaim quite a bit. Although we've
reused code, we've ended up walking the entire list of the zone to
find pages belonging to a particular container, this was the same
problem I had with my RSS controller patches.


+
+list_move(pc-list, pc_list);
+







--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 5/7] Per-container OOM killer and page reclamation

2007-03-09 Thread Balbir Singh


Hi, Pavel,

Please find my patch to add LRU behaviour to your latest RSS controller.

Balbir Singh
Linux Technology Center
IBM, ISTL
Add LRU behaviour to the RSS controller patches posted by Pavel Emelianov

	http://lkml.org/lkml/2007/3/6/198

which was in turn similar to the RSS controller posted by me

	http://lkml.org/lkml/2007/2/26/8

Pavel's patches have a per container list of pages, which helps reduce
reclaim time of the RSS controller but the per container list of pages is
in FIFO order. I've implemented active and inactive lists per container to
help select the right set of pages to reclaim when the container is under
memory pressure.

I've tested these patches on a ppc64 machine and they work fine for
the minimal testing I've done.

Pavel would you please include these patches in your next iteration.

Comments, suggestions and further improvements are as always welcome!

Signed-off-by: [EMAIL PROTECTED]
---

 include/linux/rss_container.h |1 
 mm/rss_container.c|   47 +++---
 mm/swap.c |5 
 mm/vmscan.c   |3 ++
 4 files changed, 44 insertions(+), 12 deletions(-)

diff -puN include/linux/rss_container.h~rss-container-lru2 include/linux/rss_container.h
--- linux-2.6.20/include/linux/rss_container.h~rss-container-lru2	2007-03-09 22:52:56.0 +0530
+++ linux-2.6.20-balbir/include/linux/rss_container.h	2007-03-10 00:39:59.0 +0530
@@ -19,6 +19,7 @@ int container_rss_prepare(struct page *,
 void container_rss_add(struct page_container *);
 void container_rss_del(struct page_container *);
 void container_rss_release(struct page_container *);
+void container_rss_move_lists(struct page *pg, bool active);
 
 int mm_init_container(struct mm_struct *mm, struct task_struct *tsk);
 void mm_free_container(struct mm_struct *mm);
diff -puN mm/rss_container.c~rss-container-lru2 mm/rss_container.c
--- linux-2.6.20/mm/rss_container.c~rss-container-lru2	2007-03-09 22:52:56.0 +0530
+++ linux-2.6.20-balbir/mm/rss_container.c	2007-03-10 02:42:54.0 +0530
@@ -17,7 +17,8 @@ static struct container_subsys rss_subsy
 
 struct rss_container {
 	struct res_counter res;
-	struct list_head page_list;
+	struct list_head inactive_list;
+	struct list_head active_list;
 	struct container_subsys_state css;
 };
 
@@ -96,6 +97,26 @@ void container_rss_release(struct page_c
 	kfree(pc);
 }
 
+void container_rss_move_lists(struct page *pg, bool active)
+{
+	struct rss_container *rss;
+	struct page_container *pc;
+
+	if (!page_mapped(pg))
+		return;
+
+	pc = page_container(pg);
+	BUG_ON(!pc);
+	rss = pc-cnt;
+
+	spin_lock_irq(rss-res.lock);
+	if (active)
+		list_move(pc-list, rss-active_list);
+	else
+		list_move(pc-list, rss-inactive_list);
+	spin_unlock_irq(rss-res.lock);
+}
+
 void container_rss_add(struct page_container *pc)
 {
 	struct page *pg;
@@ -105,7 +126,7 @@ void container_rss_add(struct page_conta
 	rss = pc-cnt;
 
 	spin_lock(rss-res.lock);
-	list_add(pc-list, rss-page_list);
+	list_add(pc-list, rss-active_list);
 	spin_unlock(rss-res.lock);
 
 	page_container(pg) = pc;
@@ -141,7 +162,10 @@ unsigned long container_isolate_pages(un
 	struct zone *z;
 
 	spin_lock_irq(rss-res.lock);
-	src = rss-page_list;
+	if (active)
+		src = rss-active_list;
+	else
+		src = rss-inactive_list;
 
 	for (scan = 0; scan  nr_to_scan  !list_empty(src); scan++) {
 		pc = list_entry(src-prev, struct page_container, list);
@@ -152,13 +176,10 @@ unsigned long container_isolate_pages(un
 
 		spin_lock(z-lru_lock);
 		if (PageLRU(page)) {
-			if ((active  PageActive(page)) ||
-	(!active  !PageActive(page))) {
-if (likely(get_page_unless_zero(page))) {
-	ClearPageLRU(page);
-	nr_taken++;
-	list_move(page-lru, dst);
-}
+			if (likely(get_page_unless_zero(page))) {
+ClearPageLRU(page);
+nr_taken++;
+list_move(page-lru, dst);
 			}
 		}
 		spin_unlock(z-lru_lock);
@@ -212,7 +233,8 @@ static int rss_create(struct container_s
 		return -ENOMEM;
 
 	res_counter_init(rss-res);
-	INIT_LIST_HEAD(rss-page_list);
+	INIT_LIST_HEAD(rss-inactive_list);
+	INIT_LIST_HEAD(rss-active_list);
 	cont-subsys[rss_subsys.subsys_id] = rss-css;
 	return 0;
 }
@@ -284,7 +306,8 @@ static __init int rss_create_early(struc
 
 	rss = init_rss_container;
 	res_counter_init(rss-res);
-	INIT_LIST_HEAD(rss-page_list);
+	INIT_LIST_HEAD(rss-inactive_list);
+	INIT_LIST_HEAD(rss-active_list);
 	cont-subsys[rss_subsys.subsys_id] = rss-css;
 	ss-create = rss_create;
 	return 0;
diff -puN mm/vmscan.c~rss-container-lru2 mm/vmscan.c
--- linux-2.6.20/mm/vmscan.c~rss-container-lru2	2007-03-09 22:52:56.0 +0530
+++ linux-2.6.20-balbir/mm/vmscan.c	2007-03-10 00:42:35.0 +0530
@@ -1142,6 +1142,7 @@ static unsigned long container_shrink_pa
 			else
 add_page_to_inactive_list(z, page);
 			spin_unlock_irq(z-lru_lock);
+			container_rss_move_lists(page, false);
 
 			put_page(page);
 		}
@@ -1191,6 +1192,7 @@ static void

Re: [RFC][PATCH 2/7] RSS controller core

2007-03-11 Thread Balbir Singh


On 3/11/07, Andrew Morton [EMAIL PROTECTED] wrote:

 On Sun, 11 Mar 2007 15:26:41 +0300 Kirill Korotaev [EMAIL PROTECTED] wrote:
 Andrew Morton wrote:
  On Tue, 06 Mar 2007 17:55:29 +0300
  Pavel Emelianov [EMAIL PROTECTED] wrote:
 
 
 +struct rss_container {
 +   struct res_counter res;
 +   struct list_head page_list;
 +   struct container_subsys_state css;
 +};
 +
 +struct page_container {
 +   struct page *page;
 +   struct rss_container *cnt;
 +   struct list_head list;
 +};
 
 
  ah.  This looks good.  I'll find a hunk of time to go through this work
  and through Paul's patches.  It'd be good to get both patchsets lined
  up in -mm within a couple of weeks.  But..
 
  We need to decide whether we want to do per-container memory limitation via
  these data structures, or whether we do it via a physical scan of some
  software zone, possibly based on Mel's patches.
 i.e. a separate memzone for each container?

Yep.  Straightforward machine partitioning.  An attractive thing is that it
100% reuses existing page reclaim, unaltered.


We discussed zones for resource control and some of the disadvantages at
 http://lkml.org/lkml/2006/10/30/222

I need to look at Mel's patches to determine if they are suitable for
control. But in a thread of discussion on those patches, it was agreed
that memory fragmentation and resource control are independent issues.




 imho memzone approach is inconvinient for pages sharing and shares accounting.
 it also makes memory management more strict, forbids overcommiting
 per-container etc.

umm, who said they were requirements?



We discussed some of the requirements in the RFC: Memory Controller
requirements thread
http://lkml.org/lkml/2006/10/30/51


 Maybe you have some ideas how we can decide on this?

We need to work out what the requirements are before we can settle on an
implementation.

Sigh.  Who is running this show?   Anyone?



All the stake holders involved in the RFC discussion :-) We've been
talking and building on top of each others patches. I hope that was a
good answer ;)


You can actually do a form of overcommittment by allowing multiple
containers to share one or more of the zones.  Whether that is sufficient
or suitable I don't know.  That depends on the requirements, and we haven't
even discussed those, let alone agreed to them.



There are other things like resizing a zone, finding the right size,
etc. I'll look
at Mel's patches to see what is supported.

Warm Regards,
Balbir Singh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 2/7] RSS controller core

2007-03-12 Thread Balbir Singh


doesn't look so good for me, mainly becaus of the
additional per page data and per page processing

on 4GB memory, with 100 guests, 50% shared for each
guest, this basically means ~1mio pages, 500k shared
and 1500k x sizeof(page_container) entries, which
roughly boils down to ~25MB of wasted memory ...

increase the amount of shared pages and it starts
getting worse, but maybe I'm missing something here

 We need to decide whether we want to do per-container memory
 limitation via these data structures, or whether we do it via a
 physical scan of some software zone, possibly based on Mel's patches.

why not do simple page accounting (as done currently
in Linux) and use that for the limits, without
keeping the reference from container to page?

best,
Herbert



Herbert,

You lost me in the cc list and I almost missed this part of the
thread. Could you please not modify the cc list.

Thanks,
Balbir
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 3/7] Data structures changes for RSS accounting

2007-03-12 Thread Balbir Singh


On 3/12/07, Dave Hansen [EMAIL PROTECTED] wrote:

On Mon, 2007-03-12 at 19:16 +0300, Kirill Korotaev wrote:
 now VE2 maps the same page. You can't determine whether this page is mapped
 to this container or another one w/o page-container pointer.

Hi Kirill,

I thought we can always get from the page to the VMA.  rmap provides
this to us via page-mapping and the 'struct address_space' or anon_vma.
Do we agree on that?

We can also get from the vma to the mm very easily, via vma-vm_mm,
right?

We can also get from a task to the container quite easily.

So, the only question becomes whether there is a 1:1 relationship
between mm_structs and containers.  Does each mm_struct belong to one
and only one container?  Basically, can a threaded process have
different threads in different containers?

It seems that we could bridge the gap pretty easily by either assigning
each mm_struct to a container directly, or putting some kind of
task-to-mm lookup.  Perhaps just a list like
mm-tasks_using_this_mm_list.

Not rocket science, right?

-- Dave


These patches are very similar to what I posted at
   http://lwn.net/Articles/223829/
In my patches, the thread group leader owns the mm_struct and all
threads belong to the same container. I did not have a per container
LRU, walking the global list for reclaim was a bit slow, but otherwise
my patches did not add anything to struct page

I used rmap information to get to the VMA and then the mm_struct.
Kirill, it is possible to determine all the containers that map the
page. Please see the page_in_container() function of
http://lkml.org/lkml/2007/2/26/7.

I was also thinking of using the page table(s) to identify all pages
belonging to a container, by obtaining all the mm_structs of tasks
belonging to a container. But this approach would not work well for
the page cache controller, when we add that to our memory controller.

Balbir
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 2/7] RSS controller core

2007-03-12 Thread Balbir Singh


hmm, it is very unlikely that this would happen,
for several reasons ... and indeed, checking the
thread in my mailbox shows that akpm dropped you ...



But, I got Andrew's email.



Subject: [RFC][PATCH 2/7] RSS controller core
From: Pavel Emelianov [EMAIL PROTECTED]
To: Andrew Morton [EMAIL PROTECTED], Paul Menage [EMAIL PROTECTED],
Srivatsa Vaddagiri [EMAIL PROTECTED],
Balbir Singh [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED],
Linux Kernel Mailing List linux-kernel@vger.kernel.org
Date: Tue, 06 Mar 2007 17:55:29 +0300

Subject: Re: [RFC][PATCH 2/7] RSS controller core
From: Andrew Morton [EMAIL PROTECTED]
To: Pavel Emelianov [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED],
Paul Menage [EMAIL PROTECTED],
List linux-kernel@vger.kernel.org
Date: Tue, 6 Mar 2007 14:00:36 -0800

that's the one I 'group' replied to ...

 Could you please not modify the cc list.

I never modify the cc unless explicitely asked
to do so. I wish others would have it that way
too :)



Thats good to know, but my mailer shows


Andrew Morton [EMAIL PROTECTED]
to  Pavel Emelianov [EMAIL PROTECTED]   
cc  
Paul Menage [EMAIL PROTECTED],
Srivatsa Vaddagiri [EMAIL PROTECTED],
Balbir Singh [EMAIL PROTECTED] (see I am HERE),
devel@openvz.org,
Linux Kernel Mailing List linux-kernel@vger.kernel.org,
[EMAIL PROTECTED],
Kirill Korotaev [EMAIL PROTECTED]   
dateMar 7, 2007 3:30 AM 
subject Re: [RFC][PATCH 2/7] RSS controller core
mailed-by   vger.kernel.org 
On Tue, 06 Mar 2007 17:55:29 +0300

and your reply as

Andrew Morton [EMAIL PROTECTED],
Pavel Emelianov [EMAIL PROTECTED],
[EMAIL PROTECTED],
[EMAIL PROTECTED],
[EMAIL PROTECTED],
Paul Menage [EMAIL PROTECTED],
List linux-kernel@vger.kernel.org   
to  Andrew Morton [EMAIL PROTECTED] 
cc  
Pavel Emelianov [EMAIL PROTECTED],
[EMAIL PROTECTED],
[EMAIL PROTECTED],
[EMAIL PROTECTED],
Paul Menage [EMAIL PROTECTED],
List linux-kernel@vger.kernel.org   
dateMar 9, 2007 10:18 PM
subject Re: [RFC][PATCH 2/7] RSS controller core
mailed-by   vger.kernel.org

I am not sure what went wrong. Could you please check your mail
client, cause it seemed to even change email address to smtp.osdl.org
which bounced back when I wrote to you earlier.


best,
Herbert



Cheers,
Balbir
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-13 Thread Balbir Singh

 the big deal so many accounting people have with just RSS? I'm
not a container person, this is an honest question. Because from my
POV if you conveniently ignore everything else... you may as well just
not do any accounting at all.



We decided to implement accounting and control in phases

1. RSS control
2. unmapped page cache control
3. mlock control
4. Kernel accounting and limits

This has several advantages

1. The limits can be individually set and controlled.
2. The code is broken down into simpler chunks for review and merging.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-14 Thread Balbir Singh


Nick Piggin wrote:

Balbir Singh wrote:

Nick Piggin wrote:



And strangely, this example does not go outside the parameters of
what you asked for AFAIKS. In the worst case of one container getting
_all_ the shared pages, they will still remain inside their maximum
rss limit.



When that does happen and if a container hits it limit, with a LRU
per-container, if the container is not actually using those pages,
they'll get thrown out of that container and get mapped into the
container that is using those pages most frequently.


Exactly. Statistically, first touch will work OK. It may mean some
reclaim inefficiencies in corner cases, but things will tend to
even out.



Exactly!


So they might get penalised a bit on reclaim, but maximum rss limits
will work fine, and you can (almost) guarantee X amount of memory for
a given container, and it will _work_.

But I also take back my comments about this being the only design I
have seen that gets everything, because the node-per-container idea
is a really good one on the surface. And it could mean even less impact
on the core VM than this patch. That is also a first-touch scheme.



With the proposed node-per-container, we will need to make massive core
VM changes to reorganize zones and nodes. We would want to allow

1. For sharing of nodes
2. Resizing nodes
3. May be more


But a lot of that is happening anyway for other reasons (eg. memory
plug/unplug). And I don't consider node/zone setup to be part of the
core VM as such... it is _good_ if we can move extra work into setup
rather than have it in the mm.

That said, I don't think this patch is terribly intrusive either.



Thanks, thats one of our goals, to keep it simple, understandable and
non-intrusive.




With the node-per-container idea, it will hard to control page cache
limits, independent of RSS limits or mlock limits.

NOTE: page cache == unmapped page cache here.


I don't know that it would be particularly harder than any other
first-touch scheme. If one container ends up being charged with too
much pagecache, eventually they'll reclaim a bit of it and the pages
will get charged to more frequent users.




Yes, true, but what if a user does not want to control the page
cache usage in a particular container or wants to turn off
RSS control.


However the messed up accounting that doesn't handle sharing between
groups of processes properly really bugs me.  Especially when we have
the infrastructure to do it right.

Does that make more sense?



I think it is simplistic.

Sure you could probably use some of the rmap stuff to account shared
mapped _user_ pages once for each container that touches them. And
this patchset isn't preventing that.

But how do you account kernel allocations? How do you account unmapped
pagecache?

What's the big deal so many accounting people have with just RSS? I'm
not a container person, this is an honest question. Because from my
POV if you conveniently ignore everything else... you may as well just
not do any accounting at all.



We decided to implement accounting and control in phases

1. RSS control
2. unmapped page cache control
3. mlock control
4. Kernel accounting and limits

This has several advantages

1. The limits can be individually set and controlled.
2. The code is broken down into simpler chunks for review and merging.


But this patch gives the groundwork to handle 1-4, and it is in a small
chunk, and one would be able to apply different limits to different types
of pages with it. Just using rmap to handle 1 does not really seem like a
viable alternative because it fundamentally isn't going to handle 2 or 4.



For (2), we have the basic setup in the form of a per-container LRU list
and a pointer from struct page to the container that first brought in
the page.


I'm not saying that you couldn't _later_ add something that uses rmap or
our current RSS accounting to tweak container-RSS semantics. But isn't it
sensible to lay the groundwork first? Get a clear path to something that
is good (not perfect), but *works*?



I agree with your development model suggestion. One of things we are going 
to do in the near future is to build (2) and then add (3) and (4). So far,

we've not encountered any difficulties on building on top of (1).

Vaidy, any comments?

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: taskstats accounting info

2007-03-14 Thread Balbir Singh


Randy.Dunlap wrote:

Hi,

Documentation/accounting/delay-accounting.txt says that the
getdelays program has a -c cmd argument, but that option
does not seem to exist in Documentation/account/getdelays.c.

Do you have an updated version of getdelays.c?
If not, please correct that documentation.



Yes, I did, but then I changed my laptop. I should have it archived
at some place, I'll dig it out or correct the documentation.


Is getdelays.c the best available example of a program
using the taskstats netlink interface?



It's the most portable example, since it does not depend on libnl.
It needs some cleaning up. I hope to get to it after the OLS
paper submission deadline.


Thanks,


Thanks for bringing the issue to my notice,

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-14 Thread Balbir Singh


Nick Piggin wrote:

Kirill Korotaev wrote:


The approaches I have seen that don't have a struct page pointer, do
intrusive things like try to put hooks everywhere throughout the kernel
where a userspace task can cause an allocation (and of course end up
missing many, so they aren't secure anyway)... and basically just
nasty stuff that will never get merged.



User beancounters patch has got through all these...
The approach where each charged object has a pointer to the owner 
container,

who has charged it - is the most easy/clean way to handle
all the problems with dynamic context change, races, etc.
and 1 pointer in page struct is just 0.1% overehad.


The pointer in struct page approach is a decent one, which I have
liked since this whole container effort came up. IIRC Linus and Alan
also thought that was a reasonable way to go.

I haven't reviewed the rest of the beancounters patch since looking
at it quite a few months ago... I probably don't have time for a
good review at the moment, but I should eventually.



This patch is not really beancounters.

1. It uses the containers framework
2. It is similar to my RSS controller (http://lkml.org/lkml/2007/2/26/8)

I would say that beancounters are changing and evolving.


Struct page overhead really isn't bad. Sure, nobody who doesn't use
containers will want to turn it on, but unless you're using a big PAE
system you're actually unlikely to notice.



big PAE doesn't make any difference IMHO
(until struct pages are not created for non-present physical memory 
areas)


The issue is just that struct pages use low memory, which is a really
scarce commodity on PAE. One more pointer in the struct page means
64MB less lowmem.

But PAE is crap anyway. We've already made enough concessions in the
kernel to support it. I agree: struct page overhead is not really
significant. The benefits of simplicity seems to outweigh the downside.


But again, I'll say the node-container approach of course does avoid
this nicely (because we already can get the node from the page). So
definitely that approach needs to be discredited before going with this
one.



But it lacks some other features:
1. page can't be shared easily with another container


I think they could be shared. You allocate _new_ pages from your own
node, but you can definitely use existing pages allocated to other
nodes.


2. shared page can't be accounted honestly to containers
   as fraction=PAGE_SIZE/containers-using-it


Yes there would be some accounting differences. I think it is hard
to say exactly what containers are using what page anyway, though.
What do you say about unmapped pages? Kernel allocations? etc.


3. It doesn't help accounting of kernel memory structures.
   e.g. in OpenVZ we use exactly the same pointer on the page
   to track which container owns it, e.g. pages used for page
   tables are accounted this way.


?
page_to_nid(page) ~= container that owns it.


4. I guess container destroy requires destroy of memory zone,
   which means write out of dirty data. Which doesn't sound
   good for me as well.


I haven't looked at any implementation, but I think it is fine for
the zone to stay around.


5. memory reclamation in case of global memory shortage
   becomes a tricky/unfair task.


I don't understand why? You can much more easily target a specific
container for reclaim with this approach than with others (because
you have an lru per container).



Yes, but we break the global LRU. With these RSS patches, reclaim not
triggered by containers still uses the global LRU, by using nodes,
we would have lost the global LRU.


6. You cannot overcommit. AFAIU, the memory should be granted
   to node exclusive usage and cannot be used by by another containers,
   even if it is unused. This is not an option for us.


I'm not sure about that. If you have a larger number of nodes, then
you could assign more free nodes to a container on demand. But I
think there would definitely be less flexibility with nodes...

I don't know... and seeing as I don't really know where the google
guys are going with it, I won't misrepresent their work any further ;)



Everyone seems to have a plan ;) I don't read the containers list...
does everyone still have *different* plans, or is any sort of consensus
being reached?



hope we'll have it soon :)


Good luck ;)



I think we have made some forward progress on the consensus.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: taskstats accounting info

2007-03-15 Thread Balbir Singh


Andrew Morton wrote:

On Wed, 14 Mar 2007 17:48:32 +0530 Balbir Singh [EMAIL PROTECTED] wrote:
Randy.Dunlap wrote:

Hi,

Documentation/accounting/delay-accounting.txt says that the
getdelays program has a -c cmd argument, but that option
does not seem to exist in Documentation/account/getdelays.c.

Do you have an updated version of getdelays.c?
If not, please correct that documentation.


Yes, I did, but then I changed my laptop. I should have it archived
at some place, I'll dig it out or correct the documentation.


Is getdelays.c the best available example of a program
using the taskstats netlink interface?


It's the most portable example, since it does not depend on libnl.


err, what is libnl?


libnl is a library abstraction for netlink (libnetlink).



If there exists some real userspace infrastructure which utilises
taskstats, can we please get a referece to it into the kernel
Documentation?  Perhaps in the TASKSTATS Kconfig entry, thanks.



That sounds like a good idea. I'll check for details and get back.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] containers (V7): Add generic multi-subsystem API to containers

2007-02-14 Thread Balbir Singh


[EMAIL PROTECTED] wrote:

This patch removes all cpuset-specific knowlege from the container
system, replacing it with a generic API that can be used by multiple
subsystems. Cpusets is adapted to be a container subsystem.

Signed-off-by: Paul Menage [EMAIL PROTECTED]



Hi, Paul,

This patch fails to apply for me

[EMAIL PROTECTED]:~/ocbalbir/images/kernels/containers/linux-2.6.20$ 
pushpatch

patching file include/linux/container.h
patching file include/linux/cpuset.h
patching file kernel/container.c
patch:  malformed patch at line 640: @@ -202,15 +418,18 @@ static 
DEFINE_MUTEX(callback_mutex);


multiuser_container does not apply

Is anybody else seeing this problem?

--
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC][PATCH][0/4] Memory controller (RSS Control)

2007-02-18 Thread Balbir Singh

This patch applies on top of Paul Menage's container patches (V7) posted at

http://lkml.org/lkml/2007/2/12/88

It implements a controller within the containers framework for limiting
memory usage (RSS usage).

The memory controller was discussed at length in the RFC posted to lkml
http://lkml.org/lkml/2006/10/30/51

Steps to use the controller
--


0. Download the patches, apply the patches
1. Turn on CONFIG_CONTAINER_MEMCTLR in kernel config, build the kernel
   and boot into the new kernel
2. mount -t container container -o memctlr /mount point
3. cd /mount point
   optionally do (mkdir directory; cd directory) under /mount point
4. echo $$  tasks (attaches the current shell to the container)
5. echo -n (limit value)  memctlr_limit
6. cat memctlr_usage
7. Run tasks, check the usage of the controller, reclaim behaviour
8. Report bugs, get bug fixes and iterate (goto step 0).

Advantages of the patchset
--

1. Zero overhead in struct page (struct page is not expanded)
2. Minimal changes to the core-mm code
3. Shared pages are not reclaimed unless all mappings belong to overlimit
   containers.
4. It can be used to debug drivers/applications/kernel components in a
   constrained memory environment (similar to mem=XXX option), except that
   several containers can be created simultaneously without rebooting and
   the limits can be changed. NOTE: There is no support for limiting
   kernel memory allocations and page cache control (presently).

Testing
---
Ran kernbench and lmbench with containers enabled (container filesystem not
mounted), they seemed to run fine
Created containers, attached tasks to containers with lower limits than
the memory the tasks require (memory hog tests) and ran some basic tests on
them

TODO's and improvement areas

1. Come up with cool page replacement algorithms for containers
   (if possible without any changes to struct page)
2. Add page cache control
3. Add kernel memory allocator control
4. Extract benchmark numbers and overhead data

Comments  criticism are welcome.

Series
--
memctlr-setup.patch
memctlr-acct.patch
memctlr-reclaim-on-limit.patch
memctlr-doc.patch

-- 
Warm Regards,
Balbir Singh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC][PATCH][1/4] RSS controller setup

2007-02-18 Thread Balbir Singh


This patch sets up the basic controller infrastructure on top of the
containers infrastructure. Two files are provided for monitoring
and control  memctlr_usage and memctlr_limit.

memctlr_usage shows the current usage (in pages, of RSS) and the limit
set by the user.

memctlr_limit can be used to set a limit on the RSS usage of the resource.
A special value of 0, indicates that the usage is unlimited. The limit
is set in units of pages.


Signed-off-by: [EMAIL PROTECTED]
---

 include/linux/memctlr.h |   22 ++
 init/Kconfig|7 +
 mm/Makefile |1 
 mm/memctlr.c|  169 
 4 files changed, 199 insertions(+)

diff -puN /dev/null include/linux/memctlr.h
--- /dev/null   2007-02-02 22:51:23.0 +0530
+++ linux-2.6.20-balbir/include/linux/memctlr.h 2007-02-16 00:22:11.0 
+0530
@@ -0,0 +1,22 @@
+/* memctlr.h - Memory Controller for containers
+ *
+ * Copyright (C) Balbir Singh,   IBM Corp. 2006-2007
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#ifndef _LINUX_MEMCTLR_H
+#define _LINUX_MEMCTLR_H
+
+#ifdef CONFIG_CONTAINER_MEMCTLR
+
+#else /* CONFIG_CONTAINER_MEMCTLR  */
+
+#endif /* CONFIG_CONTAINER_MEMCTLR */
+#endif /* _LINUX_MEMCTLR_H */
diff -puN init/Kconfig~memctlr-setup init/Kconfig
--- linux-2.6.20/init/Kconfig~memctlr-setup 2007-02-15 21:58:42.0 
+0530
+++ linux-2.6.20-balbir/init/Kconfig2007-02-15 21:58:42.0 +0530
@@ -306,6 +306,13 @@ config CONTAINER_NS
   for instance virtual servers and checkpoint/restart
   jobs.
 
+config CONTAINER_MEMCTLR
+   bool A simple RSS based memory controller
+   select CONTAINERS
+   help
+ Provides a simple Resource Controller for monitoring and
+ controlling the total Resident Set Size of the tasks in a container
+
 config RELAY
bool Kernel-user space relay support (formerly relayfs)
help
diff -puN mm/Makefile~memctlr-setup mm/Makefile
--- linux-2.6.20/mm/Makefile~memctlr-setup  2007-02-15 21:58:42.0 
+0530
+++ linux-2.6.20-balbir/mm/Makefile 2007-02-15 21:58:42.0 +0530
@@ -29,3 +29,4 @@ obj-$(CONFIG_MEMORY_HOTPLUG) += memory_h
 obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
+obj-$(CONFIG_CONTAINER_MEMCTLR) += memctlr.o
diff -puN /dev/null mm/memctlr.c
--- /dev/null   2007-02-02 22:51:23.0 +0530
+++ linux-2.6.20-balbir/mm/memctlr.c2007-02-16 00:22:11.0 +0530
@@ -0,0 +1,169 @@
+/*
+ * memctlr.c - Memory Controller for containers
+ *
+ * Copyright (C) Balbir Singh,   IBM Corp. 2006-2007
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#include linux/init.h
+#include linux/parser.h
+#include linux/fs.h
+#include linux/container.h
+#include linux/memctlr.h
+
+#include asm/uaccess.h
+
+#define RES_USAGE_NO_LIMIT 0
+static const char version[] = 0.1;
+
+struct res_counter {
+   unsigned long usage;/* The current usage of the resource being */
+   /* counted */
+   unsigned long limit;/* The limit on the resource   */
+   unsigned long nr_limit_exceeded;
+};
+
+struct memctlr {
+   struct container_subsys_state   css;
+   struct res_counter  counter;
+   spinlock_t  lock;
+};
+
+static struct container_subsys memctlr_subsys;
+
+static inline struct memctlr *memctlr_from_cont(struct container *cont)
+{
+   return container_of(container_subsys_state(cont, memctlr_subsys),
+   struct memctlr, css);
+}
+
+static inline struct memctlr *memctlr_from_task(struct task_struct *p)
+{
+   return memctlr_from_cont(task_container(p, memctlr_subsys));
+}
+
+static int memctlr_create(struct container_subsys *ss, struct container *cont)
+{
+   struct memctlr *mem = kzalloc(sizeof(*mem), GFP_KERNEL);
+   if (!mem)
+   return -ENOMEM;
+
+   spin_lock_init(mem-lock);
+   cont-subsys[memctlr_subsys.subsys_id] = mem-css;
+   return 0;
+}
+
+static void memctlr_destroy(struct container_subsys *ss,
+   struct container *cont)
+{
+   kfree

[RFC][PATCH][2/4] Add RSS accounting and control

2007-02-18 Thread Balbir Singh

);
if (unlikely(!pte_same(*page_table, orig_pte)))
-   goto out_nomap;
+   goto out_nomap_uncharge;
 
if (unlikely(!PageUptodate(page))) {
ret = VM_FAULT_SIGBUS;
-   goto out_nomap;
+   goto out_nomap_uncharge;
}
 
/* The page isn't present yet, go ahead with the fault. */
@@ -2068,6 +2083,8 @@ unlock:
pte_unmap_unlock(page_table, ptl);
 out:
return ret;
+out_nomap_uncharge:
+   memctlr_update_rss(mm, -1, MEMCTLR_DONT_CHECK_LIMIT);
 out_nomap:
pte_unmap_unlock(page_table, ptl);
unlock_page(page);
@@ -2092,6 +2109,9 @@ static int do_anonymous_page(struct mm_s
/* Allocate our own private page. */
pte_unmap(page_table);
 
+   if (!memctlr_update_rss(mm, 1, MEMCTLR_CHECK_LIMIT))
+   goto oom;
+
if (unlikely(anon_vma_prepare(vma)))
goto oom;
page = alloc_zeroed_user_highpage(vma, address);
@@ -2108,6 +2128,8 @@ static int do_anonymous_page(struct mm_s
lru_cache_add_active(page);
page_add_new_anon_rmap(page, vma, address);
} else {
+   memctlr_update_rss(mm, 1, MEMCTLR_DONT_CHECK_LIMIT);
+
/* Map the ZERO_PAGE - vm_page_prot is readonly */
page = ZERO_PAGE(address);
page_cache_get(page);
@@ -2218,6 +2240,9 @@ retry:
}
}
 
+   if (!memctlr_update_rss(mm, 1, MEMCTLR_CHECK_LIMIT))
+   goto oom;
+
page_table = pte_offset_map_lock(mm, pmd, address, ptl);
/*
 * For a file-backed vma, someone could have truncated or otherwise
@@ -2227,6 +2252,7 @@ retry:
if (mapping  unlikely(sequence != mapping-truncate_count)) {
pte_unmap_unlock(page_table, ptl);
page_cache_release(new_page);
+   memctlr_update_rss(mm, -1, MEMCTLR_DONT_CHECK_LIMIT);
cond_resched();
sequence = mapping-truncate_count;
smp_rmb();
@@ -2265,6 +2291,7 @@ retry:
} else {
/* One of our sibling threads was faster, back out. */
page_cache_release(new_page);
+   memctlr_update_rss(mm, -1, MEMCTLR_DONT_CHECK_LIMIT);
goto unlock;
}
 
diff -puN mm/rmap.c~memctlr-acct mm/rmap.c
--- linux-2.6.20/mm/rmap.c~memctlr-acct 2007-02-18 22:55:50.0 +0530
+++ linux-2.6.20-balbir/mm/rmap.c   2007-02-18 23:28:16.0 +0530
@@ -602,6 +602,11 @@ void page_remove_rmap(struct page *page,
__dec_zone_page_state(page,
PageAnon(page) ? NR_ANON_PAGES : 
NR_FILE_MAPPED);
}
+   /*
+* When we pass MEMCTLR_DONT_CHECK_LIMIT, it is ok to call
+* this function under the pte lock (since we will not block in reclaim)
+*/
+   memctlr_update_rss(vma-vm_mm, -1, MEMCTLR_DONT_CHECK_LIMIT);
 }
 
 /*
diff -puN mm/swapfile.c~memctlr-acct mm/swapfile.c
--- linux-2.6.20/mm/swapfile.c~memctlr-acct 2007-02-18 22:55:50.0 
+0530
+++ linux-2.6.20-balbir/mm/swapfile.c   2007-02-18 22:55:50.0 +0530
@@ -27,6 +27,7 @@
 #include linux/mutex.h
 #include linux/capability.h
 #include linux/syscalls.h
+#include linux/memctlr.h
 
 #include asm/pgtable.h
 #include asm/tlbflush.h
@@ -514,6 +515,7 @@ static void unuse_pte(struct vm_area_str
set_pte_at(vma-vm_mm, addr, pte,
   pte_mkold(mk_pte(page, vma-vm_page_prot)));
page_add_anon_rmap(page, vma, addr);
+   memctlr_update_rss(vma-vm_mm, 1, MEMCTLR_DONT_CHECK_LIMIT);
swap_free(entry);
/*
 * Move the page to the active list so it is not
_

-- 
Warm Regards,
Balbir Singh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC][PATCH][3/4] Add reclaim support

2007-02-18 Thread Balbir Singh

,
+   .swappiness = 100,
+   };
+
+   /*
+* We try to shrink LRUs in 3 passes:
+* 0 = Reclaim from inactive_list only
+* 1 = Reclaim mapped (normal reclaim)
+* 2 = 2nd pass of type 1
+*/
+   for (pass = 0; pass  3; pass++) {
+   int prio;
+
+   for (prio = DEF_PRIORITY; prio = 0; prio--) {
+   unsigned long nr_to_scan = nr_pages - ret;
+
+   sc.nr_scanned = 0;
+   ret += shrink_all_zones(nr_to_scan, prio,
+   pass, 1, sc);
+   if (ret = nr_pages)
+   goto out;
+
+   nr_total_scanned += sc.nr_scanned;
+   if (sc.nr_scanned  prio  DEF_PRIORITY - 2)
+   congestion_wait(WRITE, HZ / 10);
+   }
+   }
+out:
+   return ret;
+}
+#endif
+
 /* It's optimal to keep kswapds on the same CPUs as their memory, but
not required for correctness.  So if the last cpu in a node goes
away, we get changed to run anywhere: as the first one comes back,
_

-- 
Warm Regards,
Balbir Singh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC][PATCH][4/4] RSS controller documentation

2007-02-18 Thread Balbir Singh




Signed-off-by: [EMAIL PROTECTED]
---

 Documentation/memctlr.txt |   70 ++
 1 file changed, 70 insertions(+)

diff -puN /dev/null Documentation/memctlr.txt
--- /dev/null   2007-02-02 22:51:23.0 +0530
+++ linux-2.6.20-balbir/Documentation/memctlr.txt   2007-02-19 
00:51:44.0 +0530
@@ -0,0 +1,70 @@
+Introduction
+
+
+The memory controller is a controller module written under the containers
+framework. It can be used to limit the resource usage of a group of
+tasks grouped by the container.
+
+Accounting
+--
+
+The memory controller tracks the RSS usage of the tasks in the container.
+The definition of RSS was debated on lkml in the following thread
+
+   http://lkml.org/lkml/2006/10/10/130
+
+This patch is flexible, it is easy to adapt the patch to any definition
+of RSS. The current accounting is based on the current definition of
+RSS. Each page mapped is charged to the container.
+
+The accounting is done at two levels, each process has RSS accounting in
+the mm_struct and in the container it belongs to. The mm_struct accounting
+is used when a task switches (migrates to a different) container(s). The
+accounting information for the task is subtracted from the source container
+and added to the destination container. If as result of the migration, the
+destination container goes over limit, no action is taken until some task
+in the destination container runs and tries to map a new page in its
+page table.
+
+The current RSS usage can be seen in the memctlr_usage file. The value
+is in units of pages.
+
+Control
+---
+
+The memctlr_limit file allows the user to set a limit on the number of
+pages that can be mapped by the processes in the container. A special
+value of 0 (which is the default limit of any new container), indicates
+that the container can use unlimited amount of RSS.
+
+Reclaim
+---
+
+When the limit set in the container is hit, the memory controller starts
+reclaiming pages belonging to the container (simulating a local LRU in
+some sense). isolate_lru_pages() has been modified to isolate lru
+pages belonging to a specific container. Parallel reclaims on the same
+container are not allowed, other tasks end up waiting for the any existing
+reclaim to finish.
+
+The reclaim code uses two internal knobs, retries and pushback. pushback
+specifies the percentage of memory to be reclaimed when the container goes
+over limit. The retries knob, controls how many times reclaim is retried
+before the task is killed (because reclaim failed).
+
+Shared pages are treated specially during reclaim. They are not force
+reclaimed, they are only unmapped from containers which are over limit.
+This ensures that other containers do not pay a penalty for a shared
+page being reclaimed when a paritcular container goes over its limit.
+
+NOTE: All limits are hard limits.
+
+Future Plans
+
+
+The current controller implements only RSS control. It is planned to add
+the following components
+
+1. Page Cache control
+2. mlock'ed memory control
+3. kernel memory allocation control (memory allocated on behalf of a task)
_

-- 
Warm Regards,
Balbir Singh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH][0/4] Memory controller (RSS Control)

2007-02-19 Thread Balbir Singh


Andrew Morton wrote:

On Mon, 19 Feb 2007 12:20:19 +0530 Balbir Singh [EMAIL PROTECTED] wrote:


This patch applies on top of Paul Menage's container patches (V7) posted at

http://lkml.org/lkml/2007/2/12/88

It implements a controller within the containers framework for limiting
memory usage (RSS usage).


It's good to see someone building on someone else's work for once, rather
than everyone going off in different directions.  It makes one hope that we
might actually achieve something at last.



Thanks! It's good to know we are headed in the right direction.



The key part of this patchset is the reclaim algorithm:


@@ -636,6 +642,15 @@ static unsigned long isolate_lru_pages(u
 
 		list_del(page-lru);

target = src;
+   /*
+* For containers, do not scan the page unless it
+* belongs to the container we are reclaiming for
+*/
+   if (container  !page_in_container(page, zone, container)) {
+   scan--;
+   goto done;
+   }


Alas, I fear this might have quite bad worst-case behaviour.  One small
container which is under constant memory pressure will churn the
system-wide LRUs like mad, and will consume rather a lot of system time. 
So it's a point at which container A can deleteriously affect things which

are running in other containers, which is exactly what we're supposed to
not do.



Hmm.. I guess it's space vs time then :-) A CPU controller could
control how much time is spent reclaiming ;)

Coming back, I see the problem you mentioned and we have been thinking
of several possible solutions. In my introduction I pointed out

   Come up with cool page replacement algorithms for containers
   (if possible without any changes to struct page)


The solutions we have looked at are

1. Overload the LRU list_head in struct page to have a global
   LRU + a per container LRU

+--+   prev   +--+
| page +-| page +-
|  0   |-+  1   |
+--+   next   +--+

 Global LRU

+--+
   +--- + prev |
   || next +---+
   |+--+   |
   V^  V
+--+|  prev   +--+ +--+
| page ++ + page +-.   | page +
|  0   |-+  1   ||  n   |
+--+   next   +--+ +--+

 Global LRU + Container LRU


Page 1 and n belong to the same container, to get to page 0, you need
two de-references


2. Modify struct page to point to a container and allow each container to
   have a per-container LRU along with the global LRU


For efficiency we need the container LRU and we don't want to split
the global LRU either.

We need to optimize the reclaim path, but I thought of that as a secondary
problem. Once we all agree that the controller looks simple, accounts well
and works. We can/should definitely optimize the reclaim path.


--
Warm Regards,
Balbir Singh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH][1/4] RSS controller setup

2007-02-19 Thread Balbir Singh


Andrew Morton wrote:

On Mon, 19 Feb 2007 12:20:26 +0530 Balbir Singh [EMAIL PROTECTED] wrote:


This patch sets up the basic controller infrastructure on top of the
containers infrastructure. Two files are provided for monitoring
and control  memctlr_usage and memctlr_limit.


The patches use the identifier memctlr a lot.  It is hard to remember,
and unpronounceable.  Something like memcontrol or mem_controller or
memory_controller would be more typical.



I'll change the name to memory_controller


...

+   BUG_ON(!mem);
+   if ((buffer = kmalloc(nbytes + 1, GFP_KERNEL)) == 0)
+   return -ENOMEM;


Please prefer to do

buffer = kmalloc(nbytes + 1, GFP_KERNEL);
if (buffer == NULL)
reutrn -ENOMEM;

ie: avoid the assign-and-test-in-the-same-statement thing.  This affects
the whole patchset.



I'll fix that


Also, please don't compare pointers to literal zero like that.  It makes me
get buried it patches to convert it to NULL.  I think this is a sparse
thing.



Good point, I'll fix it.


+   buffer[nbytes] = 0;
+   if (copy_from_user(buffer, userbuf, nbytes)) {
+   ret = -EFAULT;
+   goto out_err;
+   }
+
+   container_manage_lock();
+   if (container_is_removed(cont)) {
+   ret = -ENODEV;
+   goto out_unlock;
+   }
+
+   limit = simple_strtoul(buffer, NULL, 10);
+   /*
+* 0 is a valid limit (unlimited resource usage)
+*/
+   if (!limit  strcmp(buffer, 0))
+   goto out_unlock;
+
+   spin_lock(mem-lock);
+   mem-counter.limit = limit;
+   spin_unlock(mem-lock);


The patches do this a lot: a single atomic assignment with a
pointless-looking lock/unlock around it.  It's often the case that this
idiom indicates a bug, or needless locking.  I think the only case where it
makes sense is when there's some other code somewhere which is doing

spin_lock(mem-lock);
...
use1(mem-counter.limit);
...
use2(mem-counter.limit);
...
spin_unlock(mem-lock);

where use1() and use2() expect the two reads of mem-counter.limit to
return the same value.

Is that the case in these patches?  If not, we might have a problem in
there.



The next set of patches move to atomic values for the limits. That should
fix the locking.


+
+static ssize_t memctlr_read(struct container *cont, struct cftype *cft,
+   struct file *file, char __user *userbuf,
+   size_t nbytes, loff_t *ppos)
+{
+   unsigned long usage, limit;
+   char usagebuf[64];  /* Move away from stack later */
+   char *s = usagebuf;
+   struct memctlr *mem = memctlr_from_cont(cont);
+
+   spin_lock(mem-lock);
+   usage = mem-counter.usage;
+   limit = mem-counter.limit;
+   spin_unlock(mem-lock);
+
+   s += sprintf(s, usage %lu, limit %ld\n, usage, limit);
+   return simple_read_from_buffer(userbuf, nbytes, ppos, usagebuf,
+   s - usagebuf);
+}


This output is hard to parse and to extend.  I'd suggest either two
separate files, or multi-line output:

usage: %lu kB
limit: %lu kB

and what are the units of these numbers?  Page counts?  If so, please don't
do that: it requires appplications and humans to know the current kernel's
page size.



Yes, this looks much better. I'll move to this format. I get myself lost
in bc at times, that should have been a hint.


+static struct cftype memctlr_usage = {
+   .name = memctlr_usage,
+   .read = memctlr_read,
+};
+
+static struct cftype memctlr_limit = {
+   .name = memctlr_limit,
+   .write = memctlr_write,
+};
+
+static int memctlr_populate(struct container_subsys *ss,
+   struct container *cont)
+{
+   int rc;
+   if ((rc = container_add_file(cont, memctlr_usage))  0)
+   return rc;
+   if ((rc = container_add_file(cont, memctlr_limit))  0)


Clean up the first file here?



I used cpuset_populate() as an example to code this one up.
I don't think there is an easy way in containers to clean up
files. I'll double check


+   return rc;
+   return 0;
+}
+
+static struct container_subsys memctlr_subsys = {
+   .name = memctlr,
+   .create = memctlr_create,
+   .destroy = memctlr_destroy,
+   .populate = memctlr_populate,
+};
+
+int __init memctlr_init(void)
+{
+   int id;
+
+   id = container_register_subsys(memctlr_subsys);
+   printk(Initializing memctlr version %s, id %d\n, version, id);
+   return id  0 ? id : 0;
+}
+
+module_init(memctlr_init);





Thanks for the detailed review,

--
Warm Regards,
Balbir Singh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http

Re: [ckrm-tech] [RFC][PATCH][0/4] Memory controller (RSS Control)

2007-02-19 Thread Balbir Singh


Kirill Korotaev wrote:

On 2/19/07, Andrew Morton [EMAIL PROTECTED] wrote:


Alas, I fear this might have quite bad worst-case behaviour.  One small
container which is under constant memory pressure will churn the
system-wide LRUs like mad, and will consume rather a lot of system time.
So it's a point at which container A can deleteriously affect things which
are running in other containers, which is exactly what we're supposed to
not do.


I think it's OK for a container to consume lots of system time during
reclaim, as long as we can account that time to the container involved
(i.e. if it's done during direct reclaim rather than by something like
kswapd).

hmm, is it ok to scan 100Gb of RAM for 10MB RAM container?
in UBC patch set we used page beancounters to track containter pages.
This allows to make efficient scan contoler and reclamation.

Thanks,
Kirill


Hi, Kirill,

Yes, that's a problem, but I think it's a problem that can be solved
in steps. First step, add reclaim. Second step, optimize reclaim.

--
Warm Regards,
Balbir Singh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ckrm-tech] [RFC][PATCH][2/4] Add RSS accounting and control

2007-02-19 Thread Balbir Singh


Andrew Morton wrote:

On Mon, 19 Feb 2007 12:20:34 +0530 Balbir Singh [EMAIL PROTECTED] wrote:


This patch adds the basic accounting hooks to account for pages allocated
into the RSS of a process. Accounting is maintained at two levels, in
the mm_struct of each task and in the memory controller data structure
associated with each node in the container.

When the limit specified for the container is exceeded, the task is killed.
RSS accounting is consistent with the current definition of RSS in the
kernel. Shared pages are accounted into the RSS of each process as is
done in the kernel currently. The code is flexible in that it can be easily
modified to work with any definition of RSS.

..

+static inline int memctlr_mm_init(struct mm_struct *mm)
+{
+   return 0;
+}


So it returns zero on success.  OK.



Oops. it should return 1 on success.


--- linux-2.6.20/kernel/fork.c~memctlr-acct 2007-02-18 22:55:50.0 
+0530
+++ linux-2.6.20-balbir/kernel/fork.c   2007-02-18 22:55:50.0 +0530
@@ -50,6 +50,7 @@
 #include linux/taskstats_kern.h
 #include linux/random.h
 #include linux/numtasks.h
+#include linux/memctlr.h
 
 #include asm/pgtable.h

 #include asm/pgalloc.h
@@ -342,10 +343,15 @@ static struct mm_struct * mm_init(struct
mm-free_area_cache = TASK_UNMAPPED_BASE;
mm-cached_hole_size = ~0UL;
 
+	if (!memctlr_mm_init(mm))

+   goto err;
+


But here we treat zero as an error?



It's a BUG, I'll fix it.



if (likely(!mm_alloc_pgd(mm))) {
mm-def_flags = 0;
return mm;
}
+
+err:
free_mm(mm);
return NULL;
 }

...

+int memctlr_mm_init(struct mm_struct *mm)
+{
+   mm-counter = kmalloc(sizeof(struct res_counter), GFP_KERNEL);
+   if (!mm-counter)
+   return 0;
+   atomic_long_set(mm-counter-usage, 0);
+   atomic_long_set(mm-counter-limit, 0);
+   rwlock_init(mm-container_lock);
+   return 1;
+}


ah-ha, we have another Documentation/SubmitChecklist customer.

It would be more conventional to make this return -EFOO on error,
zero on success.



ok.. I'll convert the functions to be consistent with the
return 0 on success philosophy.


+void memctlr_mm_free(struct mm_struct *mm)
+{
+   kfree(mm-counter);
+}
+
+static inline void memctlr_mm_assign_container_direct(struct mm_struct *mm,
+   struct container *cont)
+{
+   write_lock(mm-container_lock);
+   mm-container = cont;
+   write_unlock(mm-container_lock);
+}


More weird locking here.



The container field of the mm_struct is protected by a read write spin lock.


+void memctlr_mm_assign_container(struct mm_struct *mm, struct task_struct *p)
+{
+   struct container *cont = task_container(p, memctlr_subsys);
+   struct memctlr *mem = memctlr_from_cont(cont);
+
+   BUG_ON(!mem);
+   write_lock(mm-container_lock);
+   mm-container = cont;
+   write_unlock(mm-container_lock);
+}


And here.


Ditto.




+/*
+ * Update the rss usage counters for the mm_struct and the container it belongs
+ * to. We do not fail rss for pages shared during fork (see copy_one_pte()).
+ */
+int memctlr_update_rss(struct mm_struct *mm, int count, bool check)
+{
+   int ret = 1;
+   struct container *cont;
+   long usage, limit;
+   struct memctlr *mem;
+
+   read_lock(mm-container_lock);
+   cont = mm-container;
+   read_unlock(mm-container_lock);
+
+   if (!cont)
+   goto done;


And here.  I mean, if there was a reason for taking the lock around that
read, then testing `cont' outside the lock just invalidated that reason.



We took a consistent snapshot of cont. It cannot change outside the lock,
we check the value outside. I am sure I missed something.


+static inline void memctlr_double_lock(struct memctlr *mem1,
+   struct memctlr *mem2)
+{
+   if (mem1  mem2) {
+   spin_lock(mem1-lock);
+   spin_lock(mem2-lock);
+   } else {
+   spin_lock(mem2-lock);
+   spin_lock(mem1-lock);
+   }
+}


Conventionally we take the lower-addressed lock first when doing this, not
the higher-addressed one.



Will fix this.


+static inline void memctlr_double_unlock(struct memctlr *mem1,
+   struct memctlr *mem2)
+{
+   if (mem1  mem2) {
+   spin_unlock(mem2-lock);
+   spin_unlock(mem1-lock);
+   } else {
+   spin_unlock(mem1-lock);
+   spin_unlock(mem2-lock);
+   }
+}
+
...

retval = -ENOMEM;
+
+   if (!memctlr_update_rss(mm, 1, MEMCTLR_CHECK_LIMIT))
+   goto out;
+


Again, please use zero for success and -EFOO for error.

That way, you don't have to assume that the reason memctlr_update_rss()
failed was out-of-memory.  Just propagate the error back.



Yes, will do

Re: [RFC][PATCH][0/4] Memory controller (RSS Control)

2007-02-19 Thread Balbir Singh


Paul Menage wrote:

On 2/19/07, Andrew Morton [EMAIL PROTECTED] wrote:


Alas, I fear this might have quite bad worst-case behaviour.  One small
container which is under constant memory pressure will churn the
system-wide LRUs like mad, and will consume rather a lot of system time.
So it's a point at which container A can deleteriously affect things 
which

are running in other containers, which is exactly what we're supposed to
not do.


I think it's OK for a container to consume lots of system time during
reclaim, as long as we can account that time to the container involved
(i.e. if it's done during direct reclaim rather than by something like
kswapd).

Churning the LRU could well be bad though, I agree.



I completely agree with you on reclaim consuming time.

Churning the LRU can be avoided by the means I mentioned before

1. Add a container pointer (per page struct), it is also
   useful for the page cache controller
2. Check if the page belongs to a particular container before
   the list_del(page-lru), so that those pages can be skipped.
3. Use a double LRU list by overloading the lru list_head of
   struct page.


Paul




--
Warm Regards,
Balbir Singh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH][0/4] Memory controller (RSS Control)

2007-02-19 Thread Balbir Singh


Magnus Damm wrote:

On 2/19/07, Andrew Morton [EMAIL PROTECTED] wrote:
On Mon, 19 Feb 2007 12:20:19 +0530 Balbir Singh [EMAIL PROTECTED] 
wrote:


 This patch applies on top of Paul Menage's container patches (V7) 
posted at


   http://lkml.org/lkml/2007/2/12/88

 It implements a controller within the containers framework for limiting
 memory usage (RSS usage).



The key part of this patchset is the reclaim algorithm:

Alas, I fear this might have quite bad worst-case behaviour.  One small
container which is under constant memory pressure will churn the
system-wide LRUs like mad, and will consume rather a lot of system time.
So it's a point at which container A can deleteriously affect things 
which

are running in other containers, which is exactly what we're supposed to
not do.


Nice with a simple memory controller. The downside seems to be that it
doesn't scale very well when it comes to reclaim, but maybe that just
comes with being simple. Step by step, and maybe this is a good first
step?



Thanks, I totally agree.


Ideally I'd like to see unmapped pages handled on a per-container LRU
with a fallback to the system-wide LRUs. Shared/mapped pages could be
handled using PTE ageing/unmapping instead of page ageing, but that
may consume too much resources to be practical.

/ magnus


Keeping unmapped pages per container sounds interesting. I am not quite
sure what PTE ageing, will it look it up.


--
Warm Regards,
Balbir Singh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH][3/4] Add reclaim support

2007-02-19 Thread Balbir Singh


Andrew Morton wrote:

On Mon, 19 Feb 2007 12:20:42 +0530 Balbir Singh [EMAIL PROTECTED] wrote:


This patch reclaims pages from a container when the container limit is hit.
The executable is oom'ed only when the container it is running in, is overlimit
and we could not reclaim any pages belonging to the container

A parameter called pushback, controls how much memory is reclaimed when the
limit is hit. It should be easy to expose this knob to user space, but
currently it is hard coded to 20% of the total limit of the container.

isolate_lru_pages() has been modified to isolate pages belonging to a
particular container, so that reclaim code will reclaim only container
pages. For shared pages, reclaim does not unmap all mappings of the page,
it only unmaps those mappings that are over their limit. This ensures
that other containers are not penalized while reclaiming shared pages.

Parallel reclaim per container is not allowed. Each controller has a wait
queue that ensures that only one task per control is running reclaim on
that container.


...

--- linux-2.6.20/include/linux/rmap.h~memctlr-reclaim-on-limit  2007-02-18 
23:29:14.0 +0530
+++ linux-2.6.20-balbir/include/linux/rmap.h2007-02-18 23:29:14.0 
+0530
@@ -90,7 +90,15 @@ static inline void page_dup_rmap(struct 
  * Called from mm/vmscan.c to handle paging out

  */
 int page_referenced(struct page *, int is_locked);
-int try_to_unmap(struct page *, int ignore_refs);
+int try_to_unmap(struct page *, int ignore_refs, void *container);
+#ifdef CONFIG_CONTAINER_MEMCTLR
+bool page_in_container(struct page *page, struct zone *zone, void *container);
+#else
+static inline bool page_in_container(struct page *page, struct zone *zone, 
void *container)
+{
+   return true;
+}
+#endif /* CONFIG_CONTAINER_MEMCTLR */
 
 /*

  * Called from mm/filemap_xip.c to unmap empty zero page
@@ -118,7 +126,8 @@ int page_mkclean(struct page *);
 #define anon_vma_link(vma) do {} while (0)
 
 #define page_referenced(page,l) TestClearPageReferenced(page)

-#define try_to_unmap(page, refs) SWAP_FAIL
+#define try_to_unmap(page, refs, container) SWAP_FAIL
+#define page_in_container(page, zone, container)  true


I spy a compile error.

The static-inline version looks nicer.




I will compile with the feature turned off and double check. I'll
also convert it to a static inline function.



 static inline int page_mkclean(struct page *page)
 {
diff -puN include/linux/swap.h~memctlr-reclaim-on-limit include/linux/swap.h
--- linux-2.6.20/include/linux/swap.h~memctlr-reclaim-on-limit  2007-02-18 
23:29:14.0 +0530
+++ linux-2.6.20-balbir/include/linux/swap.h2007-02-18 23:29:14.0 
+0530
@@ -188,6 +188,10 @@ extern void swap_setup(void);
 /* linux/mm/vmscan.c */
 extern unsigned long try_to_free_pages(struct zone **, gfp_t);
 extern unsigned long shrink_all_memory(unsigned long nr_pages);
+#ifdef CONFIG_CONTAINER_MEMCTLR
+extern unsigned long memctlr_shrink_mapped_memory(unsigned long nr_pages,
+   void *container);
+#endif


Usually one doesn't need to put ifdefs around the declaration like this. 
If the function doesn't exist and nobody calls it, we're fine.  If someone

_does_ call it, we'll find out the error at link-time.



Sure, sounds good. I'll get rid of the #ifdefs.

 
+/*

+ * checks if the mm's container and scan control passed container match, if
+ * so, is the container over it's limit. Returns 1 if the container is above
+ * its limit.
+ */
+int memctlr_mm_overlimit(struct mm_struct *mm, void *sc_cont)
+{
+   struct container *cont;
+   struct memctlr *mem;
+   long usage, limit;
+   int ret = 1;
+
+   if (!sc_cont)
+   goto out;
+
+   read_lock(mm-container_lock);
+   cont = mm-container;
+
+   /*
+* Regular reclaim, let it proceed as usual
+*/
+   if (!sc_cont)
+   goto out;
+
+   ret = 0;
+   if (cont != sc_cont)
+   goto out;
+
+   mem = memctlr_from_cont(cont);
+   usage = atomic_long_read(mem-counter.usage);
+   limit = atomic_long_read(mem-counter.limit);
+   if (limit  (usage  limit))
+   ret = 1;
+out:
+   read_unlock(mm-container_lock);
+   return ret;
+}


hm, I wonder how much additional lock traffic all this adds.



It's a read_lock() and most of the locks are read_locks
which allow for concurrent access, until the container
changes or goes away


 int memctlr_mm_init(struct mm_struct *mm)
 {
mm-counter = kmalloc(sizeof(struct res_counter), GFP_KERNEL);
@@ -77,6 +125,46 @@ void memctlr_mm_assign_container(struct 
 	write_unlock(mm-container_lock);

 }
 
+static int memctlr_check_and_reclaim(struct container *cont, long usage,

+   long limit)
+{
+   unsigned long nr_pages = 0;
+   unsigned long nr_reclaimed = 0;
+   int retries = nr_retries;
+   int ret = 1

Re: [RFC][PATCH][3/4] Add reclaim support

2007-02-19 Thread Balbir Singh


KAMEZAWA Hiroyuki wrote:

On Mon, 19 Feb 2007 12:20:42 +0530
Balbir Singh [EMAIL PROTECTED] wrote:


+int memctlr_mm_overlimit(struct mm_struct *mm, void *sc_cont)
+{
+   struct container *cont;
+   struct memctlr *mem;
+   long usage, limit;
+   int ret = 1;
+
+   if (!sc_cont)
+   goto out;
+
+   read_lock(mm-container_lock);
+   cont = mm-container;



+out:
+   read_unlock(mm-container_lock);
+   return ret;
+}
+


should be
==
out_and_unlock:
read_unlock(mm-container_lock);
out_:
return ret;




Thanks, that's a much convention!



-Kame




--
Warm Regards,
Balbir Singh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ckrm-tech] [RFC][PATCH][2/4] Add RSS accounting and control

2007-02-19 Thread Balbir Singh


Andrew Morton wrote:

On Mon, 19 Feb 2007 16:07:44 +0530 Balbir Singh [EMAIL PROTECTED] wrote:


+void memctlr_mm_free(struct mm_struct *mm)
+{
+   kfree(mm-counter);
+}
+
+static inline void memctlr_mm_assign_container_direct(struct mm_struct *mm,
+   struct container *cont)
+{
+   write_lock(mm-container_lock);
+   mm-container = cont;
+   write_unlock(mm-container_lock);
+}

More weird locking here.


The container field of the mm_struct is protected by a read write spin lock.


That doesn't mean anything to me.

What would go wrong if the above locking was simply removed?  And how does
the locking prevent that fault?



Some pages could charged to the wrong container. Apart from that I do not
see anything going bad (I'll double check that).




+void memctlr_mm_assign_container(struct mm_struct *mm, struct task_struct *p)
+{
+   struct container *cont = task_container(p, memctlr_subsys);
+   struct memctlr *mem = memctlr_from_cont(cont);
+
+   BUG_ON(!mem);
+   write_lock(mm-container_lock);
+   mm-container = cont;
+   write_unlock(mm-container_lock);
+}

And here.

Ditto.


ditto ;)



:-)


+/*
+ * Update the rss usage counters for the mm_struct and the container it belongs
+ * to. We do not fail rss for pages shared during fork (see copy_one_pte()).
+ */
+int memctlr_update_rss(struct mm_struct *mm, int count, bool check)
+{
+   int ret = 1;
+   struct container *cont;
+   long usage, limit;
+   struct memctlr *mem;
+
+   read_lock(mm-container_lock);
+   cont = mm-container;
+   read_unlock(mm-container_lock);
+
+   if (!cont)
+   goto done;

And here.  I mean, if there was a reason for taking the lock around that
read, then testing `cont' outside the lock just invalidated that reason.


We took a consistent snapshot of cont. It cannot change outside the lock,
we check the value outside. I am sure I missed something.


If it cannot change outside the lock then we don't need to take the lock!



We took a snapshot that we thought was consistent. We check for the value
outside. I guess there is no harm, the worst thing that could happen
is wrong accounting during mm-container changes (when a task changes
container).


MEMCTLR_DONT_CHECK_LIMIT exists for the following reasons

1. Pages are shared during fork, fork() is not failed at that point
since the pages are shared anyway, we allow the RSS limit to be
exceeded.
2. When ZERO_PAGE is added, we don't check for limits (zeromap_pte_range).
3. On reducing RSS (passing -1 as the value)


OK, that might make a nice comment somewhere (if it's not already there).


Yes, thanks for keeping us humble and honest, I'll add it.

--
Warm Regards,
Balbir Singh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH][1/4] RSS controller setup

2007-02-19 Thread Balbir Singh


Paul Menage wrote:

On 2/19/07, Andrew Morton [EMAIL PROTECTED] wrote:


This output is hard to parse and to extend.  I'd suggest either two
separate files, or multi-line output:

usage: %lu kB
limit: %lu kB


Two separate files would be the container usage model that I
envisaged, inherited from the way cpusets does things.

And in this case, it should definitely be the limit in one file,
readable and writeable, and the usage in another, probably only
readable.

Having to read a file called memctlr_usage to find the current limit
sounds wrong.



That sound right, I'll fix this.


Hmm, I don't appear to have documented this yet, but I think a good
naming scheme for container files is subsystem.whatever - i.e.
these should be memctlr.usage and memctlr.limit. The existing
grandfathered Cpusets names violate this, but I'm not sure there's a
lot we can do about that.



Why subsystem.whatever, dots are harder to parse using regular
expressions and sound DOS'ish. I'd prefer _ to separate the
subsystem and whatever :-)


 +static int memctlr_populate(struct container_subsys *ss,
 + struct container *cont)
 +{
 + int rc;
 + if ((rc = container_add_file(cont, memctlr_usage))  0)
 + return rc;
 + if ((rc = container_add_file(cont, memctlr_limit))  0)

Clean up the first file here?


Containers don't currently provide an API for a subsystem to clean up
files from a directory - that's done automatically when the directory
is deleted.

I think I'll probably change the API for container_add_file to return
void, but mark an error in the container itself if something goes
wrong - that way rather than all the subsystems having to check for
error, container_populate_dir() can do so at the end of calling all
the subsystems' populate methods.



It should be easy to add container_remove_file() instead of marking
an error.


Paul



--
Warm Regards,
Balbir Singh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH][3/4] Add reclaim support

2007-02-19 Thread Balbir Singh


Andrew Morton wrote:

On Mon, 19 Feb 2007 16:20:53 +0530 Balbir Singh [EMAIL PROTECTED] wrote:


+ * so, is the container over it's limit. Returns 1 if the container is above
+ * its limit.
+ */
+int memctlr_mm_overlimit(struct mm_struct *mm, void *sc_cont)
+{
+   struct container *cont;
+   struct memctlr *mem;
+   long usage, limit;
+   int ret = 1;
+
+   if (!sc_cont)
+   goto out;
+
+   read_lock(mm-container_lock);
+   cont = mm-container;
+
+   /*
+* Regular reclaim, let it proceed as usual
+*/
+   if (!sc_cont)
+   goto out;
+
+   ret = 0;
+   if (cont != sc_cont)
+   goto out;
+
+   mem = memctlr_from_cont(cont);
+   usage = atomic_long_read(mem-counter.usage);
+   limit = atomic_long_read(mem-counter.limit);
+   if (limit  (usage  limit))
+   ret = 1;
+out:
+   read_unlock(mm-container_lock);
+   return ret;
+}

hm, I wonder how much additional lock traffic all this adds.


It's a read_lock() and most of the locks are read_locks
which allow for concurrent access, until the container
changes or goes away


read_lock isn't free, and I suspect we're calling this function pretty
often (every pagefault?) It'll be measurable on some workloads, on some
hardware.

It probably won't be terribly bad because each lock-taking is associated
with a clear_page().  But still, if there's any possibility of lightening
the locking up, now is the time to think about it.



Yes, good point. I'll revisit to see if barriers can replace the locking
or if the locking is required at all?


@@ -66,6 +67,9 @@ struct scan_control {
int swappiness;
 
 	int all_unreclaimable;

+
+   void *container;/* Used by containers for reclaiming */
+   /* pages when the limit is exceeded  */
 };

eww.  Why void*?


I did not want to expose struct container in mm/vmscan.c.


It's already there, via rmap.h



Yes, true


An additional
thought was that no matter what container goes in the field would be
useful for reclaim.


Am having trouble parsing that sentence ;)




The thought was that irrespective of the infrastructure that goes in
having an entry for reclaim in scan_control would be useful. I guess
the name exposes what the type tries to hide :-)

--
Warm Regards,
Balbir Singh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ckrm-tech] [RFC][PATCH][2/4] Add RSS accounting and control

2007-02-19 Thread Balbir Singh


Andrew Morton wrote:

On Mon, 19 Feb 2007 16:39:33 +0530 Balbir Singh [EMAIL PROTECTED] wrote:


Andrew Morton wrote:

On Mon, 19 Feb 2007 16:07:44 +0530 Balbir Singh [EMAIL PROTECTED] wrote:


+void memctlr_mm_free(struct mm_struct *mm)
+{
+   kfree(mm-counter);
+}
+
+static inline void memctlr_mm_assign_container_direct(struct mm_struct *mm,
+   struct container *cont)
+{
+   write_lock(mm-container_lock);
+   mm-container = cont;
+   write_unlock(mm-container_lock);
+}

More weird locking here.


The container field of the mm_struct is protected by a read write spin lock.

That doesn't mean anything to me.

What would go wrong if the above locking was simply removed?  And how does
the locking prevent that fault?


Some pages could charged to the wrong container. Apart from that I do not
see anything going bad (I'll double check that).


Argh.  Please, think about this.



Sure, I will. I guess I am short circuiting my thinking process :-)



That locking *doesn't do anything*.  Except for that one situation I
described: some other holder of the lock reads mm-container twice inside
the lock and requires that the value be the same both times (and that sort
of code should be converted to take a local copy, so this locking here can
be removed).



Yes, that makes sense.


+
+   read_lock(mm-container_lock);
+   cont = mm-container;
+   read_unlock(mm-container_lock);
+
+   if (!cont)
+   goto done;

And here.  I mean, if there was a reason for taking the lock around that
read, then testing `cont' outside the lock just invalidated that reason.


We took a consistent snapshot of cont. It cannot change outside the lock,
we check the value outside. I am sure I missed something.

If it cannot change outside the lock then we don't need to take the lock!


We took a snapshot that we thought was consistent.


Consistent with what?  That's a single-word read inside that lock.



Yes, that makes sense.


We check for the value
outside. I guess there is no harm, the worst thing that could happen
is wrong accounting during mm-container changes (when a task changes
container).


If container-lock is held when a task is removed from the
container then yes, `cont' here can refer to a container to which the task
no longer belongs.

More worrisome is the potential for use-after-free.  What prevents the
pointer at mm-container from referring to freed memory after we're dropped
the lock?



The container cannot be freed unless all tasks holding references to it are
gone, that would ensure that all mm-containers are pointing elsewhere and
never to a stale value.

I hope my short-circuited brain got this right :-)



--
Warm Regards,
Balbir Singh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH][0/4] Memory controller (RSS Control)

2007-02-19 Thread Balbir Singh


Magnus Damm wrote:

On 2/19/07, Balbir Singh [EMAIL PROTECTED] wrote:

Magnus Damm wrote:
 On 2/19/07, Andrew Morton [EMAIL PROTECTED] wrote:
 On Mon, 19 Feb 2007 12:20:19 +0530 Balbir Singh [EMAIL PROTECTED]
 wrote:

  This patch applies on top of Paul Menage's container patches (V7)
 posted at
 
http://lkml.org/lkml/2007/2/12/88
 
  It implements a controller within the containers framework for 
limiting

  memory usage (RSS usage).

 The key part of this patchset is the reclaim algorithm:

 Alas, I fear this might have quite bad worst-case behaviour.  One 
small

 container which is under constant memory pressure will churn the
 system-wide LRUs like mad, and will consume rather a lot of system 
time.

 So it's a point at which container A can deleteriously affect things
 which
 are running in other containers, which is exactly what we're 
supposed to

 not do.

 Nice with a simple memory controller. The downside seems to be that it
 doesn't scale very well when it comes to reclaim, but maybe that just
 comes with being simple. Step by step, and maybe this is a good first
 step?


Thanks, I totally agree.

 Ideally I'd like to see unmapped pages handled on a per-container LRU
 with a fallback to the system-wide LRUs. Shared/mapped pages could be
 handled using PTE ageing/unmapping instead of page ageing, but that
 may consume too much resources to be practical.

 / magnus

Keeping unmapped pages per container sounds interesting. I am not quite
sure what PTE ageing, will it look it up.


You will most likely have no luck looking it up, so here is what I
mean by PTE ageing:

The most common unit for memory resource control seems to be physical
pages. Keeping track of pages is simple in the case of a single user
per page, but for shared pages tracking the owner becomes more
complex.

I consider unmapped pages to only have a single user at a time, so the
unit for unmapped memory resource control is physical pages. Apart
from implementation details such as fun with struct page and
scalability, handling this case is not so complicated.

Mapped or shared pages should be handled in a different way IMO. PTEs
should be used instead of using physical pages as unit for resource
control and reclaim. For the user this looks pretty much the same as
physical pages, apart for memory overcommit.

So instead of using a global page reclaim policy and reserving
physical pages per container I propose that resource controlled shared
pages should be handled using a PTE replacement policy. This policy is
used to keep the most active PTEs in the container backed by physical
pages. Inactive PTEs gets unmapped in favour over newer PTEs.

One way to implement this could be by populating the address space of
resource controlled processes with multiple smaller LRU2Qs. The
compact data structure that I have in mind is basically an array of
256 bytes, one byte per PTE. Associated with this data strucuture are
start indexes and lengths for two lists. The indexes are used in a
FAT-type of chain to form single linked lists. So we create active and
inactive list here - and we move PTEs between the lists when we check
the young bits from the page reclaim and when we apply memory
pressure. Unmapping is done through the normal page reclaimer but
using information from the PTE LRUs.

In my mind this should lead to more fair resource control of mapped
pages, but if it is possible to implement with low overhead, that's
another question. =)

Thanks for listening.

/ magnus



Thanks for explaining PTE aging.

--
Warm Regards,
Balbir Singh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: taskstats accounting info

2007-03-20 Thread Balbir Singh


Randy Dunlap wrote:

On Thu, 15 Mar 2007 11:06:55 -0800 Andrew Morton wrote:


It's the most portable example, since it does not depend on libnl.

err, what is libnl?


lib-netlink (as already answered, but I wrote this last week)



I was referring to the library at http://people.suug.ch/~tgr/libnl/


If there exists some real userspace infrastructure which utilises
taskstats, can we please get a referece to it into the kernel
Documentation?  Perhaps in the TASKSTATS Kconfig entry, thanks.



Balbir, I was working with getdelays.c when I initially wrote
these questions.  Here is a small patch for it.  Hopefully you can
use it when you find the updated version of it.

~Randy

From: Randy Dunlap [EMAIL PROTECTED]

1.  add usage() function

2.  add unknown character in %c format (was only in %d, not useful):

./getdelays: invalid option -- h
Unknown option '?' (63)

instead of:

./getdelays: invalid option -- h
Unknown option 63

(or just remove that message)

3.  -v does not use an optarg, so remove ':' in getopt string after 'v';



Thanks, these look good. I'll add them to my local copy.



--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Linux-VServer example results for sharing vs. separate mappings ...

2007-03-25 Thread Balbir Singh


Andrew Morton wrote:
snip

The problem is memory reclaim.  A number of schemes which have been
proposed require a per-container page reclaim mechanism - basically a
separate scanner.

This is a huge, huge, huge problem.  The present scanner has been under
development for over a decade and has had tremendous amounts of work and
testing put into it.  And it still has problems.  But those problems will
be gradually addressed.

A per-container recaim scheme really really really wants to reuse all that
stuff rather than creating a separate, parallel, new scanner which has the
same robustness requirements, only has a decade less test and development
done on it.  And which permanently doubles our maintenance costs.



The current per-container reclaim scheme does reuse a lot of code. As far
as code maintenance is concerned, I think it should be easy to merge
some of the common functionality by abstracting them out as different
functions. The container smartness comes in only in the
container_isolate_pages(). This is an easy to understand function.


So how do we reuse our existing scanner?  With physical containers.  One
can envisage several schemes:

a) slice the machine into 128 fake NUMA nodes, use each node as the
   basic block of memory allocation, manage the binding between these
   memory hunks and process groups with cpusets.

   This is what google are testing, and it works.


Don't we break the global LRU with this scheme?



b) Create a new memory abstraction, call it the software zone, which
   is mostly decoupled from the present hardware zones.  Most of the MM
   is reworked to use software zones.  The software zones are
   runtime-resizeable, and obtain their pages via some means from the
   hardware zones.  A container uses a software zone.



I think the problem would be figuring out where to allocate memory from?
What happens if a software zone spans across many hardware zones?


c) Something else, similar to the above.  Various schemes can be
   envisaged, it isn't terribly important for this discussion.


Let me repeat: this all has a huge upside in that it reuses the existing
page reclaimation logic.  And cpusets.  Yes, we do discover glitches, but
those glitches (such as Christoph's recent discovery of suboptimal
interaction between cpusets and the global dirty ratio) get addressed, and
we tend to strengthen the overall MM system as we address them.


So what are the downsides?  I think mainly the sharing issue:


I think binding the resource controller and the allocator might be
a bad idea, I tried experimenting with it and soon ran into some
hard to answer questions

1. How do we control the length of the zonelists that we need to
   allocate memory from (in a container)
2. Like you said, how do we share pages across zones (containers)
3. What happens to the global LRU behaviour
4. Do we need a per_cpu_pageset assoicated with containers
5. What do we do with unused memory in a zone, is it shared with
   other zones
6. Changing zones or creating an abstraction out of it is likely
   to impact the entire vm setup core, that is high risk, so
   do we really need to do it this way.



But how much of a problem will it be *in practice*?  Probably a lot of
people just won't notice or care.  There will be a few situations where it
may be a problem, but perhaps we can address those?  Forced migration of
pages from one zone into another is possible.  Or change the reclaim code
so that a page which hasn't been referenced from a process within its
hardware container is considered unreferenced (so it gets reclaimed).  Or a
manual nuke-all-the-pages knob which system administration tools can use. 
All doable, if we indeed have a demonstrable problem which needs to be

addressed.

And I do think it's worth trying to address these things, because the
thought of implementing a brand new memory reclaim mechanism scares the
pants off me.



The reclaim mechanism proposed *does not impact the non-container users*.
The only impact is container driven reclaim, like every other new feature
this can benefit from good testing in -mm. I believe we have something
simple and understandable to get us started. I would request you to consider
merging the RSS controller and containers patches in -mm. If too many people
complain or we see the problems that you foresee and our testing,
enhancements and maintenance is unable to sort those problems, we know
we'll have another approach to fall back upon :-) It'll also teach us
to listen to the maintainers when they talk of design ;)

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Fix race between attach_task and cpuset_exit

2007-03-25 Thread Balbir Singh


Hi, Vatsa,

Srivatsa Vaddagiri wrote:


diff -puN kernel/cpuset.c~cpuset_race_fix kernel/cpuset.c
--- linux-2.6.21-rc4/kernel/cpuset.c~cpuset_race_fix2007-03-25 
21:08:27.0 +0530
+++ linux-2.6.21-rc4-vatsa/kernel/cpuset.c  2007-03-25 21:25:05.0 
+0530
@@ -1182,6 +1182,7 @@ static int attach_task(struct cpuset *cs
pid_t pid;
struct task_struct *tsk;
struct cpuset *oldcs;
+   struct cpuset *oldcs_tobe_released = NULL;


How about oldcs_to_be_released?


cpumask_t cpus;
nodemask_t from, to;
struct mm_struct *mm;
@@ -1237,6 +1238,8 @@ static int attach_task(struct cpuset *cs
}
atomic_inc(cs-count);
rcu_assign_pointer(tsk-cpuset, cs);
+   if (atomic_dec_and_test(oldcs-count))
+   oldcs_tobe_released = oldcs;
task_unlock(tsk);

guarantee_online_cpus(cs, cpus);
@@ -1257,8 +1260,8 @@ static int attach_task(struct cpuset *cs

put_task_struct(tsk);
synchronize_rcu();
-   if (atomic_dec_and_test(oldcs-count))
-   check_for_release(oldcs, ppathbuf);
+   if (oldcs_tobe_released)
+   check_for_release(oldcs_tobe_released, ppathbuf);
return 0;
 }

@@ -2200,10 +2203,6 @@ void cpuset_fork(struct task_struct *chi
  * it is holding that mutex while calling check_for_release(),
  * which calls kmalloc(), so can't be called holding callback_mutex().
  *
- * We don't need to task_lock() this reference to tsk-cpuset,
- * because tsk is already marked PF_EXITING, so attach_task() won't
- * mess with it, or task is a failed fork, never visible to attach_task.
- *
  * the_top_cpuset_hack:
  *
  *Set the exiting tasks cpuset to the root cpuset (top_cpuset).
@@ -2242,19 +2241,20 @@ void cpuset_exit(struct task_struct *tsk
 {
struct cpuset *cs;

+   task_lock(tsk);
cs = tsk-cpuset;
tsk-cpuset = top_cpuset;   /* the_top_cpuset_hack - see above */
+   atomic_dec(cs-count);


How about using a local variable like ref_count and using

ref_count = atomic_dec_and_test(cs-count); This will avoid the two
atomic operations, atomic_dec() and atomic_read() below.


+   task_unlock(tsk);

if (notify_on_release(cs)) {
char *pathbuf = NULL;

mutex_lock(manage_mutex);
-   if (atomic_dec_and_test(cs-count))
+   if (!atomic_read(cs-count))


if (ref_count == 0)


check_for_release(cs, pathbuf);
mutex_unlock(manage_mutex);
cpuset_release_agent(pathbuf);
-   } else {
-   atomic_dec(cs-count);
}
 }



--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Linux-VServer example results for sharing vs. separate mappings ...

2007-03-25 Thread Balbir Singh


Andrew Morton wrote:

Don't we break the global LRU with this scheme?


Sure, but that's deliberate!

(And we don't have a global LRU - the LRUs are per-zone).



Yes, true. But if we use zones for containers and say we have 400
of them, with all of them under limit. When the system wants
to reclaim memory, we might not end up reclaiming the best pages.
Am I missing something?


b) Create a new memory abstraction, call it the software zone, which
   is mostly decoupled from the present hardware zones.  Most of the MM
   is reworked to use software zones.  The software zones are
   runtime-resizeable, and obtain their pages via some means from the
   hardware zones.  A container uses a software zone.


I think the problem would be figuring out where to allocate memory from?
What happens if a software zone spans across many hardware zones?


Yes, that would be the tricky part.  But we generally don't care what
physical zone user pages come from, apart from NUMA optimisation.


The reclaim mechanism proposed *does not impact the non-container users*.


Yup.  Let's keep plugging away with Pavel's approach, see where it gets us.



Yes, we have some changes that we've made to the reclaim logic, we hope
to integrate a page cache controller soon. We are also testing the
patches. Hopefully soon enough, they'll be in a good state and we can
request you to merge the containers and the rss limit (plus page cache)
controller soon.


--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Linux-VServer example results for sharing vs. separate mappings ...

2007-03-25 Thread Balbir Singh


Andrew Morton wrote:

On Mon, 26 Mar 2007 08:06:07 +0530 Balbir Singh [EMAIL PROTECTED] wrote:


Andrew Morton wrote:

Don't we break the global LRU with this scheme?

Sure, but that's deliberate!

(And we don't have a global LRU - the LRUs are per-zone).


Yes, true. But if we use zones for containers and say we have 400
of them, with all of them under limit. When the system wants
to reclaim memory, we might not end up reclaiming the best pages.
Am I missing something?


If a zone is under its min_pages limit, it needs reclaim.  Who/when/why
that reclaim is run doesn't really matter.

Yeah, we might run into some scaling problems with that many zones. 
They're unlikely to be unfixable.




ok.




b) Create a new memory abstraction, call it the software zone, which
   is mostly decoupled from the present hardware zones.  Most of the MM
   is reworked to use software zones.  The software zones are
   runtime-resizeable, and obtain their pages via some means from the
   hardware zones.  A container uses a software zone.


I think the problem would be figuring out where to allocate memory from?
What happens if a software zone spans across many hardware zones?

Yes, that would be the tricky part.  But we generally don't care what
physical zone user pages come from, apart from NUMA optimisation.


The reclaim mechanism proposed *does not impact the non-container users*.

Yup.  Let's keep plugging away with Pavel's approach, see where it gets us.


Yes, we have some changes that we've made to the reclaim logic, we hope
to integrate a page cache controller soon. We are also testing the
patches. Hopefully soon enough, they'll be in a good state and we can
request you to merge the containers and the rss limit (plus page cache)
controller soon.


Now I'm worried again.  This separation between rss controller and
pagecache is largely alien to memory reclaim.  With physical containers
these new concepts (and their implementations) don't need to exist - it is
already all implemented.

Designing brand-new memory reclaim machinery in mid-2007 sounds like a very
bad idea.   But let us see what it looks like.



I did not mean to worry you again :-) We do not plan to implement brand
new memory reclaim, we intend to modify some bits and pieces for per
container reclaim. We believe at this point that all the necessary
infrastructure is largely present in container_isolate_pages(). Adding
a page cache controller should not require core-mm surgery, just the
accounting bits.

We basically agree that designing a brand new reclaim machinery is a bad
idea, non-container users will not be impacted. Only container driver
reclaim (caused by a container being at it's limit), will see some change
in reclaim behaviour and we shall try and restrict the changes to as
small as possible.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Cpu statistics accounting based on Paul Menage patches

2007-04-12 Thread Balbir Singh


Andrew Morton wrote:

On Wed, 11 Apr 2007 19:02:27 +0400
Pavel Emelianov [EMAIL PROTECTED] wrote:


Provides a per-container statistics concerning the numbers of tasks
in various states, system and user times, etc. Patch is inspired
by Paul's example of the used CPU time accounting. Although this
patch is independent from Paul's example to make it possible playing
with them separately.


Why is this actually needed?  If userspace has a list of the tasks which
are in a particular container, it can run around and add up the stats for
those tasks without kernel changes?

It's a bit irksome that we have so much accounting of this form in core
kernel, yet we have to go and add a completely new implementation to create
something which is similar to what we already have.  But I don't
immediately see a fix for that.  Apart from paragraph #1 ;)

Should there be linkage between per-container stats and
delivery-via-taskstats?  I can't think of one, really.

You have cpu stats.  Later, presumably, we'll need IO stats, MM stats,
context-switch stats, number-of-syscall stats, etc, etc.  Are we going to
reimplement all of those things as well?  See paragraph #1!

Bottom line: I think we seriously need to find some way of consolidating
per-container stats with our present per-task stats.  Perhaps we should
instead be looking at ways in which we can speed up paragraph #1.



This should be easy to build. per container stats can live in parallel
with per-task stats, but they can use the same general mechanism for
data communication to user space.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ckrm-tech] [RFC][PATCH][2/4] Add RSS accounting and control

2007-02-19 Thread Balbir Singh


Paul Menage wrote:

On 2/19/07, Balbir Singh [EMAIL PROTECTED] wrote:

More worrisome is the potential for use-after-free.  What prevents the
pointer at mm-container from referring to freed memory after we're dropped
the lock?


The container cannot be freed unless all tasks holding references to it are
gone,


... or have been moved to other containers. If you're not holding
task-alloc_lock or one of the container mutexes, there's nothing to
stop the task being moved to another container, and the container
being deleted.

If you're in an RCU section then you can guarantee that the container
(that you originally read from the task) and its subsystems at least
won't be deleted while you're accessing them, but for accounting like
this I suspect that's not enough, since you need to be adding to the
accounting stats on the correct container. I think you'll need to hold
mm-container_lock for the duration of memctl_update_rss()

Paul



Yes, that sounds like the correct thing to do.

--
Warm Regards,
Balbir Singh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ckrm-tech] [RFC][PATCH][2/4] Add RSS accounting and control

2007-02-19 Thread Balbir Singh


Vaidyanathan Srinivasan wrote:


Balbir Singh wrote:

Paul Menage wrote:

On 2/19/07, Balbir Singh [EMAIL PROTECTED] wrote:

More worrisome is the potential for use-after-free.  What prevents the
pointer at mm-container from referring to freed memory after we're dropped
the lock?


The container cannot be freed unless all tasks holding references to it are
gone,

... or have been moved to other containers. If you're not holding
task-alloc_lock or one of the container mutexes, there's nothing to
stop the task being moved to another container, and the container
being deleted.

If you're in an RCU section then you can guarantee that the container
(that you originally read from the task) and its subsystems at least
won't be deleted while you're accessing them, but for accounting like
this I suspect that's not enough, since you need to be adding to the
accounting stats on the correct container. I think you'll need to hold
mm-container_lock for the duration of memctl_update_rss()

Paul


Yes, that sounds like the correct thing to do.



Accounting accuracy will anyway be affected when a process is migrated
while it is still allocating pages.  Having a lock here does not
necessarily improve the accounting accuracy.  Charges from the old
container would have to be moved to the new container before deletion
which implies all tasks have already left the container and no
mm_struct is holding a pointer to it.

The only condition that will break our code will be if the container
pointer becomes invalid while we are updating stats.  This can be
prevented by RCU section as mentioned by Paul.  I believe explicit
lock and unlock may not provide additional benefit here.



Yes, if the container pointer becomes invalid, then consider the following
scenario

1. Use RCU, get a reference to the container
2. All tasks/mm's move to newer container (and the accounting information
   moves)
3. Container is RCU deleted
4. We still charge the older container that is going to be deleted soon
5. Release RCU
6. RCU garbage collects (callback runs)

We end up charging/uncharging a soon to be deleted container, that
is not good.

What did I miss?


--Vaidy




--
Warm Regards,
Balbir Singh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC][PATCH][0/4] Memory controller (RSS Control) (

2007-02-24 Thread Balbir Singh

This patch applies on top of Paul Menage's container patches (V7) posted at

http://lkml.org/lkml/2007/2/12/88

It implements a controller within the containers framework for limiting
memory usage (RSS usage).

The memory controller was discussed at length in the RFC posted to lkml
http://lkml.org/lkml/2006/10/30/51

This is version 2 of the patch, version 1 was posted at
http://lkml.org/lkml/2007/2/19/10

I have tried to incorporate all comments, more details can be found
in the changelog's of induvidual patches. Any remaining mistakes are
all my fault.

The next question could be why release version 2?

1. It serves a decision point to decide if we should move to a per-container
   LRU list. Walking through the global LRU is slow, in this patchset I've
   tried to address the LRU churning issue. The patch
   memcontrol-reclaim-on-limit has more details
2. I;ve included fixes for several of the comments/issues raised in version 1

Steps to use the controller
--
0. Download the patches, apply the patches
1. Turn on CONFIG_CONTAINER_MEMCONTROL in kernel config, build the kernel
   and boot into the new kernel
2. mount -t container container -o memcontrol /mount point
3. cd /mount point
   optionally do (mkdir directory; cd directory) under /mount point
4. echo $$  tasks (attaches the current shell to the container)
5. echo -n (limit value)  memcontrol_limit
6. cat memcontrol_usage
7. Run tasks, check the usage of the controller, reclaim behaviour
8. Report bugs, get bug fixes and iterate (goto step 0).

Advantages of the patchset
--
1. Zero overhead in struct page (struct page is not expanded)
2. Minimal changes to the core-mm code
3. Shared pages are not reclaimed unless all mappings belong to overlimit
   containers.
4. It can be used to debug drivers/applications/kernel components in a
   constrained memory environment (similar to mem=XXX option), except that
   several containers can be created simultaneously without rebooting and
   the limits can be changed. NOTE: There is no support for limiting
   kernel memory allocations and page cache control (presently).

Testing
---
Created containers, attached tasks to containers with lower limits than
the memory the tasks require (memory hog tests) and ran some basic tests on
them.
Tested the patches on UML and PowerPC. On UML tried the patches with the
config enabled and disabled (sanity check) and with containers enabled
but the memory controller disabled.

TODO's and improvement areas

1. Come up with cool page replacement algorithms for containers - still holds
   good (if possible without any changes to struct page)
2. Add page cache control
3. Add kernel memory allocator control
4. Extract benchmark numbers and overhead data

Comments  criticism are welcome.

Series
--
memcontrol-setup.patch
memcontrol-acct.patch
memcontrol-reclaim-on-limit.patch
memcontrol-doc.patch

-- 
Warm Regards,
Balbir Singh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC][PATCH][1/4] RSS controller setup (

2007-02-24 Thread Balbir Singh



Changelog

1. Change the name from memctlr to memcontrol
2. Coding style changes, call the API and then check return value (for kmalloc).
3. Change the output format, to print sizes in both pages and kB
4. Split the usage and limit files to be independent (cat memcontrol_usage
   no longer prints the limit)

TODO's

1. Implement error handling mechansim for handling container_add_file()
   failures (this would depend on the containers code).

This patch sets up the basic controller infrastructure on top of the
containers infrastructure. Two files are provided for monitoring
and control  memcontrol_usage and memcontrol_limit.

memcontrol_usage shows the current usage (in pages, of RSS) and the limit
set by the user.

memcontrol_limit can be used to set a limit on the RSS usage of the resource.
A special value of 0, indicates that the usage is unlimited. The limit
is set in units of pages.

Signed-off-by: [EMAIL PROTECTED]
---

 include/linux/memcontrol.h |   33 +++
 init/Kconfig   |7 +
 mm/Makefile|1 
 mm/memcontrol.c|  193 +
 4 files changed, 234 insertions(+)

diff -puN /dev/null include/linux/memcontrol.h
--- /dev/null   2007-02-02 22:51:23.0 +0530
+++ linux-2.6.20-balbir/include/linux/memcontrol.h  2007-02-24 
19:39:03.0 +0530
@@ -0,0 +1,33 @@
+/*
+ * memcontrol.h - Memory Controller for containers
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Â© Copyright IBM Corporation, 2006-2007
+ *
+ * Author: Balbir Singh [EMAIL PROTECTED]
+ *
+ */
+
+#ifndef _LINUX_MEMCONTROL_H
+#define _LINUX_MEMCONTROL_H
+
+#ifdef CONFIG_CONTAINER_MEMCONTROL
+#ifndef kB
+#define kB 1024/* One Kilo Byte */
+#endif
+
+#else /* CONFIG_CONTAINER_MEMCONTROL  */
+
+#endif /* CONFIG_CONTAINER_MEMCONTROL */
+#endif /* _LINUX_MEMCONTROL_H */
diff -puN init/Kconfig~memcontrol-setup init/Kconfig
--- linux-2.6.20/init/Kconfig~memcontrol-setup  2007-02-20 21:01:28.0 
+0530
+++ linux-2.6.20-balbir/init/Kconfig2007-02-20 21:01:28.0 +0530
@@ -306,6 +306,13 @@ config CONTAINER_NS
   for instance virtual servers and checkpoint/restart
   jobs.
 
+config CONTAINER_MEMCONTROL
+   bool A simple RSS based memory controller
+   select CONTAINERS
+   help
+ Provides a simple Resource Controller for monitoring and
+ controlling the total Resident Set Size of the tasks in a container
+
 config RELAY
bool Kernel-user space relay support (formerly relayfs)
help
diff -puN mm/Makefile~memcontrol-setup mm/Makefile
--- linux-2.6.20/mm/Makefile~memcontrol-setup   2007-02-20 21:01:28.0 
+0530
+++ linux-2.6.20-balbir/mm/Makefile 2007-02-20 21:01:28.0 +0530
@@ -29,3 +29,4 @@ obj-$(CONFIG_MEMORY_HOTPLUG) += memory_h
 obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
+obj-$(CONFIG_CONTAINER_MEMCONTROL) += memcontrol.o
diff -puN /dev/null mm/memcontrol.c
--- /dev/null   2007-02-02 22:51:23.0 +0530
+++ linux-2.6.20-balbir/mm/memcontrol.c 2007-02-24 19:39:24.0 +0530
@@ -0,0 +1,193 @@
+/*
+ * memcontrol.c - Memory Controller for containers
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Â© Copyright IBM Corporation, 2006-2007
+ *
+ * Author: Balbir Singh [EMAIL PROTECTED]
+ *
+ */
+
+#include linux/init.h
+#include linux/parser.h
+#include linux/fs.h
+#include linux/container.h
+#include linux/memcontrol.h
+
+#include asm/uaccess.h
+
+#define RES_USAGE_NO_LIMIT 0
+static const char version[] = 0.1;
+
+struct res_counter {
+   atomic_long_t usage;/* The current usage of the resource being */
+   /* counted */
+   atomic_long_t limit

[RFC][PATCH][2/4] Add RSS accounting and control (

2007-02-24 Thread Balbir Singh

 out_nomap;
+   goto out_nomap_uncharge;
}
 
/* The page isn't present yet, go ahead with the fault. */
@@ -2068,6 +2084,8 @@ unlock:
pte_unmap_unlock(page_table, ptl);
 out:
return ret;
+out_nomap_uncharge:
+   memcontrol_update_rss(mm, -1, MEMCONTROL_DONT_CHECK_LIMIT);
 out_nomap:
pte_unmap_unlock(page_table, ptl);
unlock_page(page);
@@ -2092,6 +2110,9 @@ static int do_anonymous_page(struct mm_s
/* Allocate our own private page. */
pte_unmap(page_table);
 
+   if (memcontrol_update_rss(mm, 1, MEMCONTROL_CHECK_LIMIT))
+   goto oom;
+
if (unlikely(anon_vma_prepare(vma)))
goto oom;
page = alloc_zeroed_user_highpage(vma, address);
@@ -2108,6 +2129,8 @@ static int do_anonymous_page(struct mm_s
lru_cache_add_active(page);
page_add_new_anon_rmap(page, vma, address);
} else {
+   memcontrol_update_rss(mm, 1, MEMCONTROL_DONT_CHECK_LIMIT);
+
/* Map the ZERO_PAGE - vm_page_prot is readonly */
page = ZERO_PAGE(address);
page_cache_get(page);
@@ -2218,6 +2241,9 @@ retry:
}
}
 
+   if (memcontrol_update_rss(mm, 1, MEMCONTROL_CHECK_LIMIT))
+   goto oom;
+
page_table = pte_offset_map_lock(mm, pmd, address, ptl);
/*
 * For a file-backed vma, someone could have truncated or otherwise
@@ -2227,6 +2253,7 @@ retry:
if (mapping  unlikely(sequence != mapping-truncate_count)) {
pte_unmap_unlock(page_table, ptl);
page_cache_release(new_page);
+   memcontrol_update_rss(mm, -1, MEMCONTROL_DONT_CHECK_LIMIT);
cond_resched();
sequence = mapping-truncate_count;
smp_rmb();
@@ -2265,6 +2292,7 @@ retry:
} else {
/* One of our sibling threads was faster, back out. */
page_cache_release(new_page);
+   memcontrol_update_rss(mm, -1, MEMCONTROL_DONT_CHECK_LIMIT);
goto unlock;
}
 
diff -puN mm/rmap.c~memcontrol-acct mm/rmap.c
--- linux-2.6.20/mm/rmap.c~memcontrol-acct  2007-02-24 19:39:29.0 
+0530
+++ linux-2.6.20-balbir/mm/rmap.c   2007-02-24 19:39:29.0 +0530
@@ -602,6 +602,11 @@ void page_remove_rmap(struct page *page,
__dec_zone_page_state(page,
PageAnon(page) ? NR_ANON_PAGES : 
NR_FILE_MAPPED);
}
+   /*
+* When we pass MEMCONTROL_DONT_CHECK_LIMIT, it is ok to call
+* this function under the pte lock (since we will not block in reclaim)
+*/
+   memcontrol_update_rss(vma-vm_mm, -1, MEMCONTROL_DONT_CHECK_LIMIT);
 }
 
 /*
diff -puN mm/swapfile.c~memcontrol-acct mm/swapfile.c
--- linux-2.6.20/mm/swapfile.c~memcontrol-acct  2007-02-24 19:39:29.0 
+0530
+++ linux-2.6.20-balbir/mm/swapfile.c   2007-02-24 19:39:29.0 +0530
@@ -27,6 +27,7 @@
 #include linux/mutex.h
 #include linux/capability.h
 #include linux/syscalls.h
+#include linux/memcontrol.h
 
 #include asm/pgtable.h
 #include asm/tlbflush.h
@@ -514,6 +515,7 @@ static void unuse_pte(struct vm_area_str
set_pte_at(vma-vm_mm, addr, pte,
   pte_mkold(mk_pte(page, vma-vm_page_prot)));
page_add_anon_rmap(page, vma, addr);
+   memcontrol_update_rss(vma-vm_mm, 1, MEMCONTROL_DONT_CHECK_LIMIT);
swap_free(entry);
/*
 * Move the page to the active list so it is not
_

-- 
Warm Regards,
Balbir Singh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC][PATCH][3/4] Add reclaim support (

2007-02-24 Thread Balbir Singh

) {
zone-nr_scan_active += (zone-nr_active  prio) + 1;
-   if (zone-nr_scan_active = nr_pages || pass  3) {
+   if (zone-nr_scan_active = nr_pages || pass  
max_pass) {
zone-nr_scan_active = 0;
nr_to_scan = min(nr_pages, zone-nr_active);
shrink_active_list(nr_to_scan, zone, sc, prio);
@@ -1394,7 +1431,7 @@ static unsigned long shrink_all_zones(un
}
 
zone-nr_scan_inactive += (zone-nr_inactive  prio) + 1;
-   if (zone-nr_scan_inactive = nr_pages || pass  3) {
+   if (zone-nr_scan_inactive = nr_pages || pass  max_pass) {
zone-nr_scan_inactive = 0;
nr_to_scan = min(nr_pages, zone-nr_inactive);
ret += shrink_inactive_list(nr_to_scan, zone, sc);
@@ -1405,7 +1442,9 @@ static unsigned long shrink_all_zones(un
 
return ret;
 }
+#endif
 
+#ifdef CONFIG_PM
 static unsigned long count_lru_pages(void)
 {
struct zone *zone;
@@ -1477,7 +1516,7 @@ unsigned long shrink_all_memory(unsigned
unsigned long nr_to_scan = nr_pages - ret;
 
sc.nr_scanned = 0;
-   ret += shrink_all_zones(nr_to_scan, prio, pass, sc);
+   ret += shrink_all_zones(nr_to_scan, prio, pass, 3, sc);
if (ret = nr_pages)
goto out;
 
@@ -1512,6 +1551,57 @@ out:
 }
 #endif
 
+#ifdef CONFIG_CONTAINER_MEMCONTROL
+/*
+ * Try to free `nr_pages' of memory, system-wide, and return the number of
+ * freed pages.
+ * Modelled after shrink_all_memory()
+ */
+unsigned long memcontrol_shrink_mapped_memory(unsigned long nr_pages,
+   struct container *container)
+{
+   unsigned long ret = 0;
+   int pass;
+   unsigned long nr_total_scanned = 0;
+
+   struct scan_control sc = {
+   .gfp_mask = GFP_KERNEL,
+   .may_swap = 0,
+   .swap_cluster_max = nr_pages,
+   .may_writepage = 1,
+   .container = container,
+   .may_swap = 1,
+   .swappiness = 100,
+   };
+
+   /*
+* We try to shrink LRUs in 3 passes:
+* 0 = Reclaim from inactive_list only
+* 1 = Reclaim mapped (normal reclaim)
+* 2 = 2nd pass of type 1
+*/
+   for (pass = 0; pass  3; pass++) {
+   int prio;
+
+   for (prio = DEF_PRIORITY; prio = 0; prio--) {
+   unsigned long nr_to_scan = nr_pages - ret;
+
+   sc.nr_scanned = 0;
+   ret += shrink_all_zones(nr_to_scan, prio,
+   pass, 1, sc);
+   if (ret = nr_pages)
+   goto out;
+
+   nr_total_scanned += sc.nr_scanned;
+   if (sc.nr_scanned  prio  DEF_PRIORITY - 2)
+   congestion_wait(WRITE, HZ / 10);
+   }
+   }
+out:
+   return ret;
+}
+#endif
+
 /* It's optimal to keep kswapds on the same CPUs as their memory, but
not required for correctness.  So if the last cpu in a node goes
away, we get changed to run anywhere: as the first one comes back,
diff -puN include/linux/mm_types.h~memcontrol-reclaim-on-limit 
include/linux/mm_types.h
diff -puN include/linux/list.h~memcontrol-reclaim-on-limit include/linux/list.h
--- linux-2.6.20/include/linux/list.h~memcontrol-reclaim-on-limit   
2007-02-24 19:40:56.0 +0530
+++ linux-2.6.20-balbir/include/linux/list.h2007-02-24 19:40:56.0 
+0530
@@ -343,6 +343,32 @@ static inline void list_splice(struct li
__list_splice(list, head);
 }
 
+static inline void __list_splice_tail(struct list_head *list,
+   struct list_head *head)
+{
+   struct list_head *first = list-next;
+   struct list_head *last = list-prev;
+   struct list_head *at = head-prev;
+
+   first-prev = at;
+   at-next = first;
+
+   last-next = head;
+   head-prev = last;
+}
+
+/**
+ * list_splice - join two lists, @list goes to the end (at head-prev)
+ * @list: the new list to add.
+ * @head: the place to add it in the first list.
+ */
+static inline void list_splice_tail(struct list_head *list,
+   struct list_head *head)
+{
+   if (!list_empty(list))
+   __list_splice_tail(list, head);
+}
+
 /**
  * list_splice_init - join two lists and reinitialise the emptied list.
  * @list: the new list to add.
_

-- 
Warm Regards,
Balbir Singh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo

[RFC][PATCH][4/4] RSS controller documentation (

2007-02-24 Thread Balbir Singh


---

Signed-off-by: [EMAIL PROTECTED]
---

 Documentation/memctlr.txt |   70 ++
 1 file changed, 70 insertions(+)

diff -puN /dev/null Documentation/memctlr.txt
--- /dev/null   2007-02-02 22:51:23.0 +0530
+++ linux-2.6.20-balbir/Documentation/memctlr.txt   2007-02-24 
19:41:23.0 +0530
@@ -0,0 +1,70 @@
+Introduction
+
+
+The memory controller is a controller module written under the containers
+framework. It can be used to limit the resource usage of a group of
+tasks grouped by the container.
+
+Accounting
+--
+
+The memory controller tracks the RSS usage of the tasks in the container.
+The definition of RSS was debated on lkml in the following thread
+
+   http://lkml.org/lkml/2006/10/10/130
+
+This patch is flexible, it is easy to adapt the patch to any definition
+of RSS. The current accounting is based on the current definition of
+RSS. Each page mapped is charged to the container.
+
+The accounting is done at two levels, each process has RSS accounting in
+the mm_struct and in the container it belongs to. The mm_struct accounting
+is used when a task switches (migrates to a different) container(s). The
+accounting information for the task is subtracted from the source container
+and added to the destination container. If as result of the migration, the
+destination container goes over limit, no action is taken until some task
+in the destination container runs and tries to map a new page in its
+page table.
+
+The current RSS usage can be seen in the memcontrol_usage file. The value
+is in units of pages.
+
+Control
+---
+
+The memcontrol_limit file allows the user to set a limit on the number of
+pages that can be mapped by the processes in the container. A special
+value of 0 (which is the default limit of any new container), indicates
+that the container can use unlimited amount of RSS.
+
+Reclaim
+---
+
+When the limit set in the container is hit, the memory controller starts
+reclaiming pages belonging to the container (simulating a local LRU in
+some sense). isolate_lru_pages() has been modified to isolate lru
+pages belonging to a specific container. Parallel reclaims on the same
+container are not allowed, other tasks end up waiting for the any existing
+reclaim to finish.
+
+The reclaim code uses two internal knobs, retries and pushback. pushback
+specifies the percentage of memory to be reclaimed when the container goes
+over limit. The retries knob, controls how many times reclaim is retried
+before the task is killed (because reclaim failed).
+
+Shared pages are treated specially during reclaim. They are not force
+reclaimed, they are only unmapped from containers which are over limit.
+This ensures that other containers do not pay a penalty for a shared
+page being reclaimed when a paritcular container goes over its limit.
+
+NOTE: All limits are hard limits.
+
+Future Plans
+
+
+The current controller implements only RSS control. It is planned to add
+the following components
+
+1. Page Cache control
+2. mlock'ed memory control
+3. kernel memory allocation control (memory allocated on behalf of a task)
_

-- 
Warm Regards,
Balbir Singh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Memcontrol patchset (was Re: [RFC][PATCH][0/4] Memory controller (RSS Control) ()

2007-02-24 Thread Balbir Singh


Hi,

My script could not parse the (#2) and posted the patches as subject
followed by (  instead

I apologize,
Balbir Singh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC][PATCH][1/4] RSS controller setup (v2)

2007-02-25 Thread Balbir Singh



Changelog

1. Change the name from memctlr to memcontrol
2. Coding style changes, call the API and then check return value (for kmalloc).
3. Change the output format, to print sizes in both pages and kB
4. Split the usage and limit files to be independent (cat memcontrol_usage
   no longer prints the limit)

TODO's

1. Implement error handling mechansim for handling container_add_file()
   failures (this would depend on the containers code).

This patch sets up the basic controller infrastructure on top of the
containers infrastructure. Two files are provided for monitoring
and control  memcontrol_usage and memcontrol_limit.

memcontrol_usage shows the current usage (in pages, of RSS) and the limit
set by the user.

memcontrol_limit can be used to set a limit on the RSS usage of the resource.
A special value of 0, indicates that the usage is unlimited. The limit
is set in units of pages.

Signed-off-by: [EMAIL PROTECTED]
---

 include/linux/memcontrol.h |   33 +++
 init/Kconfig   |7 +
 mm/Makefile|1 
 mm/memcontrol.c|  193 +
 4 files changed, 234 insertions(+)

diff -puN /dev/null include/linux/memcontrol.h
--- /dev/null   2007-02-02 22:51:23.0 +0530
+++ linux-2.6.20-balbir/include/linux/memcontrol.h  2007-02-24 
19:39:03.0 +0530
@@ -0,0 +1,33 @@
+/*
+ * memcontrol.h - Memory Controller for containers
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Â© Copyright IBM Corporation, 2006-2007
+ *
+ * Author: Balbir Singh [EMAIL PROTECTED]
+ *
+ */
+
+#ifndef _LINUX_MEMCONTROL_H
+#define _LINUX_MEMCONTROL_H
+
+#ifdef CONFIG_CONTAINER_MEMCONTROL
+#ifndef kB
+#define kB 1024/* One Kilo Byte */
+#endif
+
+#else /* CONFIG_CONTAINER_MEMCONTROL  */
+
+#endif /* CONFIG_CONTAINER_MEMCONTROL */
+#endif /* _LINUX_MEMCONTROL_H */
diff -puN init/Kconfig~memcontrol-setup init/Kconfig
--- linux-2.6.20/init/Kconfig~memcontrol-setup  2007-02-20 21:01:28.0 
+0530
+++ linux-2.6.20-balbir/init/Kconfig2007-02-20 21:01:28.0 +0530
@@ -306,6 +306,13 @@ config CONTAINER_NS
   for instance virtual servers and checkpoint/restart
   jobs.
 
+config CONTAINER_MEMCONTROL
+   bool A simple RSS based memory controller
+   select CONTAINERS
+   help
+ Provides a simple Resource Controller for monitoring and
+ controlling the total Resident Set Size of the tasks in a container
+
 config RELAY
bool Kernel-user space relay support (formerly relayfs)
help
diff -puN mm/Makefile~memcontrol-setup mm/Makefile
--- linux-2.6.20/mm/Makefile~memcontrol-setup   2007-02-20 21:01:28.0 
+0530
+++ linux-2.6.20-balbir/mm/Makefile 2007-02-20 21:01:28.0 +0530
@@ -29,3 +29,4 @@ obj-$(CONFIG_MEMORY_HOTPLUG) += memory_h
 obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
+obj-$(CONFIG_CONTAINER_MEMCONTROL) += memcontrol.o
diff -puN /dev/null mm/memcontrol.c
--- /dev/null   2007-02-02 22:51:23.0 +0530
+++ linux-2.6.20-balbir/mm/memcontrol.c 2007-02-24 19:39:24.0 +0530
@@ -0,0 +1,193 @@
+/*
+ * memcontrol.c - Memory Controller for containers
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Â© Copyright IBM Corporation, 2006-2007
+ *
+ * Author: Balbir Singh [EMAIL PROTECTED]
+ *
+ */
+
+#include linux/init.h
+#include linux/parser.h
+#include linux/fs.h
+#include linux/container.h
+#include linux/memcontrol.h
+
+#include asm/uaccess.h
+
+#define RES_USAGE_NO_LIMIT 0
+static const char version[] = 0.1;
+
+struct res_counter {
+   atomic_long_t usage;/* The current usage of the resource being */
+   /* counted */
+   atomic_long_t limit

[RFC][PATCH][4/4] RSS controller documentation (v2)

2007-02-25 Thread Balbir Singh


---

Signed-off-by: [EMAIL PROTECTED]
---

 Documentation/memctlr.txt |   70 ++
 1 file changed, 70 insertions(+)

diff -puN /dev/null Documentation/memctlr.txt
--- /dev/null   2007-02-02 22:51:23.0 +0530
+++ linux-2.6.20-balbir/Documentation/memctlr.txt   2007-02-24 
19:41:23.0 +0530
@@ -0,0 +1,70 @@
+Introduction
+
+
+The memory controller is a controller module written under the containers
+framework. It can be used to limit the resource usage of a group of
+tasks grouped by the container.
+
+Accounting
+--
+
+The memory controller tracks the RSS usage of the tasks in the container.
+The definition of RSS was debated on lkml in the following thread
+
+   http://lkml.org/lkml/2006/10/10/130
+
+This patch is flexible, it is easy to adapt the patch to any definition
+of RSS. The current accounting is based on the current definition of
+RSS. Each page mapped is charged to the container.
+
+The accounting is done at two levels, each process has RSS accounting in
+the mm_struct and in the container it belongs to. The mm_struct accounting
+is used when a task switches (migrates to a different) container(s). The
+accounting information for the task is subtracted from the source container
+and added to the destination container. If as result of the migration, the
+destination container goes over limit, no action is taken until some task
+in the destination container runs and tries to map a new page in its
+page table.
+
+The current RSS usage can be seen in the memcontrol_usage file. The value
+is in units of pages.
+
+Control
+---
+
+The memcontrol_limit file allows the user to set a limit on the number of
+pages that can be mapped by the processes in the container. A special
+value of 0 (which is the default limit of any new container), indicates
+that the container can use unlimited amount of RSS.
+
+Reclaim
+---
+
+When the limit set in the container is hit, the memory controller starts
+reclaiming pages belonging to the container (simulating a local LRU in
+some sense). isolate_lru_pages() has been modified to isolate lru
+pages belonging to a specific container. Parallel reclaims on the same
+container are not allowed, other tasks end up waiting for the any existing
+reclaim to finish.
+
+The reclaim code uses two internal knobs, retries and pushback. pushback
+specifies the percentage of memory to be reclaimed when the container goes
+over limit. The retries knob, controls how many times reclaim is retried
+before the task is killed (because reclaim failed).
+
+Shared pages are treated specially during reclaim. They are not force
+reclaimed, they are only unmapped from containers which are over limit.
+This ensures that other containers do not pay a penalty for a shared
+page being reclaimed when a paritcular container goes over its limit.
+
+NOTE: All limits are hard limits.
+
+Future Plans
+
+
+The current controller implements only RSS control. It is planned to add
+the following components
+
+1. Page Cache control
+2. mlock'ed memory control
+3. kernel memory allocation control (memory allocated on behalf of a task)
_

-- 
Warm Regards,
Balbir Singh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC][PATCH][3/4] Add reclaim support (v2)

2007-02-25 Thread Balbir Singh

) {
zone-nr_scan_active += (zone-nr_active  prio) + 1;
-   if (zone-nr_scan_active = nr_pages || pass  3) {
+   if (zone-nr_scan_active = nr_pages || pass  
max_pass) {
zone-nr_scan_active = 0;
nr_to_scan = min(nr_pages, zone-nr_active);
shrink_active_list(nr_to_scan, zone, sc, prio);
@@ -1394,7 +1431,7 @@ static unsigned long shrink_all_zones(un
}
 
zone-nr_scan_inactive += (zone-nr_inactive  prio) + 1;
-   if (zone-nr_scan_inactive = nr_pages || pass  3) {
+   if (zone-nr_scan_inactive = nr_pages || pass  max_pass) {
zone-nr_scan_inactive = 0;
nr_to_scan = min(nr_pages, zone-nr_inactive);
ret += shrink_inactive_list(nr_to_scan, zone, sc);
@@ -1405,7 +1442,9 @@ static unsigned long shrink_all_zones(un
 
return ret;
 }
+#endif
 
+#ifdef CONFIG_PM
 static unsigned long count_lru_pages(void)
 {
struct zone *zone;
@@ -1477,7 +1516,7 @@ unsigned long shrink_all_memory(unsigned
unsigned long nr_to_scan = nr_pages - ret;
 
sc.nr_scanned = 0;
-   ret += shrink_all_zones(nr_to_scan, prio, pass, sc);
+   ret += shrink_all_zones(nr_to_scan, prio, pass, 3, sc);
if (ret = nr_pages)
goto out;
 
@@ -1512,6 +1551,57 @@ out:
 }
 #endif
 
+#ifdef CONFIG_CONTAINER_MEMCONTROL
+/*
+ * Try to free `nr_pages' of memory, system-wide, and return the number of
+ * freed pages.
+ * Modelled after shrink_all_memory()
+ */
+unsigned long memcontrol_shrink_mapped_memory(unsigned long nr_pages,
+   struct container *container)
+{
+   unsigned long ret = 0;
+   int pass;
+   unsigned long nr_total_scanned = 0;
+
+   struct scan_control sc = {
+   .gfp_mask = GFP_KERNEL,
+   .may_swap = 0,
+   .swap_cluster_max = nr_pages,
+   .may_writepage = 1,
+   .container = container,
+   .may_swap = 1,
+   .swappiness = 100,
+   };
+
+   /*
+* We try to shrink LRUs in 3 passes:
+* 0 = Reclaim from inactive_list only
+* 1 = Reclaim mapped (normal reclaim)
+* 2 = 2nd pass of type 1
+*/
+   for (pass = 0; pass  3; pass++) {
+   int prio;
+
+   for (prio = DEF_PRIORITY; prio = 0; prio--) {
+   unsigned long nr_to_scan = nr_pages - ret;
+
+   sc.nr_scanned = 0;
+   ret += shrink_all_zones(nr_to_scan, prio,
+   pass, 1, sc);
+   if (ret = nr_pages)
+   goto out;
+
+   nr_total_scanned += sc.nr_scanned;
+   if (sc.nr_scanned  prio  DEF_PRIORITY - 2)
+   congestion_wait(WRITE, HZ / 10);
+   }
+   }
+out:
+   return ret;
+}
+#endif
+
 /* It's optimal to keep kswapds on the same CPUs as their memory, but
not required for correctness.  So if the last cpu in a node goes
away, we get changed to run anywhere: as the first one comes back,
diff -puN include/linux/mm_types.h~memcontrol-reclaim-on-limit 
include/linux/mm_types.h
diff -puN include/linux/list.h~memcontrol-reclaim-on-limit include/linux/list.h
--- linux-2.6.20/include/linux/list.h~memcontrol-reclaim-on-limit   
2007-02-24 19:40:56.0 +0530
+++ linux-2.6.20-balbir/include/linux/list.h2007-02-24 19:40:56.0 
+0530
@@ -343,6 +343,32 @@ static inline void list_splice(struct li
__list_splice(list, head);
 }
 
+static inline void __list_splice_tail(struct list_head *list,
+   struct list_head *head)
+{
+   struct list_head *first = list-next;
+   struct list_head *last = list-prev;
+   struct list_head *at = head-prev;
+
+   first-prev = at;
+   at-next = first;
+
+   last-next = head;
+   head-prev = last;
+}
+
+/**
+ * list_splice - join two lists, @list goes to the end (at head-prev)
+ * @list: the new list to add.
+ * @head: the place to add it in the first list.
+ */
+static inline void list_splice_tail(struct list_head *list,
+   struct list_head *head)
+{
+   if (!list_empty(list))
+   __list_splice_tail(list, head);
+}
+
 /**
  * list_splice_init - join two lists and reinitialise the emptied list.
  * @list: the new list to add.
_

-- 
Warm Regards,
Balbir Singh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo

[RFC][PATCH][0/4] Memory controller (RSS Control) (v2)

2007-02-25 Thread Balbir Singh

This is a repost of the patches at
http://lkml.org/lkml/2007/2/24/65
The previous post had a misleading subject which ended with a (.


This patch applies on top of Paul Menage's container patches (V7) posted at

http://lkml.org/lkml/2007/2/12/88

It implements a controller within the containers framework for limiting
memory usage (RSS usage).

The memory controller was discussed at length in the RFC posted to lkml
http://lkml.org/lkml/2006/10/30/51

This is version 2 of the patch, version 1 was posted at
http://lkml.org/lkml/2007/2/19/10

I have tried to incorporate all comments, more details can be found
in the changelog's of induvidual patches. Any remaining mistakes are
all my fault.

The next question could be why release version 2?

1. It serves a decision point to decide if we should move to a per-container
   LRU list. Walking through the global LRU is slow, in this patchset I've
   tried to address the LRU churning issue. The patch
   memcontrol-reclaim-on-limit has more details
2. I've included fixes for several of the comments/issues raised in version 1

Steps to use the controller
--
0. Download the patches, apply the patches
1. Turn on CONFIG_CONTAINER_MEMCONTROL in kernel config, build the kernel
   and boot into the new kernel
2. mount -t container container -o memcontrol /mount point
3. cd /mount point
   optionally do (mkdir directory; cd directory) under /mount point
4. echo $$  tasks (attaches the current shell to the container)
5. echo -n (limit value)  memcontrol_limit
6. cat memcontrol_usage
7. Run tasks, check the usage of the controller, reclaim behaviour
8. Report bugs, get bug fixes and iterate (goto step 0).

Advantages of the patchset
--
1. Zero overhead in struct page (struct page is not expanded)
2. Minimal changes to the core-mm code
3. Shared pages are not reclaimed unless all mappings belong to overlimit
   containers.
4. It can be used to debug drivers/applications/kernel components in a
   constrained memory environment (similar to mem=XXX option), except that
   several containers can be created simultaneously without rebooting and
   the limits can be changed. NOTE: There is no support for limiting
   kernel memory allocations and page cache control (presently).

Testing
---
Created containers, attached tasks to containers with lower limits than
the memory the tasks require (memory hog tests) and ran some basic tests on
them.
Tested the patches on UML and PowerPC. On UML tried the patches with the
config enabled and disabled (sanity check) and with containers enabled
but the memory controller disabled.

TODO's and improvement areas

1. Come up with cool page replacement algorithms for containers - still holds
   good (if possible without any changes to struct page)
2. Add page cache control
3. Add kernel memory allocator control
4. Extract benchmark numbers and overhead data

Comments  criticism are welcome.

Series
--
memcontrol-setup.patch
memcontrol-acct.patch
memcontrol-reclaim-on-limit.patch
memcontrol-doc.patch

-- 
Warm Regards,
Balbir Singh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC][PATCH][2/4] Add RSS accounting and control (v2)

2007-02-25 Thread Balbir Singh

 out_nomap;
+   goto out_nomap_uncharge;
}
 
/* The page isn't present yet, go ahead with the fault. */
@@ -2068,6 +2084,8 @@ unlock:
pte_unmap_unlock(page_table, ptl);
 out:
return ret;
+out_nomap_uncharge:
+   memcontrol_update_rss(mm, -1, MEMCONTROL_DONT_CHECK_LIMIT);
 out_nomap:
pte_unmap_unlock(page_table, ptl);
unlock_page(page);
@@ -2092,6 +2110,9 @@ static int do_anonymous_page(struct mm_s
/* Allocate our own private page. */
pte_unmap(page_table);
 
+   if (memcontrol_update_rss(mm, 1, MEMCONTROL_CHECK_LIMIT))
+   goto oom;
+
if (unlikely(anon_vma_prepare(vma)))
goto oom;
page = alloc_zeroed_user_highpage(vma, address);
@@ -2108,6 +2129,8 @@ static int do_anonymous_page(struct mm_s
lru_cache_add_active(page);
page_add_new_anon_rmap(page, vma, address);
} else {
+   memcontrol_update_rss(mm, 1, MEMCONTROL_DONT_CHECK_LIMIT);
+
/* Map the ZERO_PAGE - vm_page_prot is readonly */
page = ZERO_PAGE(address);
page_cache_get(page);
@@ -2218,6 +2241,9 @@ retry:
}
}
 
+   if (memcontrol_update_rss(mm, 1, MEMCONTROL_CHECK_LIMIT))
+   goto oom;
+
page_table = pte_offset_map_lock(mm, pmd, address, ptl);
/*
 * For a file-backed vma, someone could have truncated or otherwise
@@ -2227,6 +2253,7 @@ retry:
if (mapping  unlikely(sequence != mapping-truncate_count)) {
pte_unmap_unlock(page_table, ptl);
page_cache_release(new_page);
+   memcontrol_update_rss(mm, -1, MEMCONTROL_DONT_CHECK_LIMIT);
cond_resched();
sequence = mapping-truncate_count;
smp_rmb();
@@ -2265,6 +2292,7 @@ retry:
} else {
/* One of our sibling threads was faster, back out. */
page_cache_release(new_page);
+   memcontrol_update_rss(mm, -1, MEMCONTROL_DONT_CHECK_LIMIT);
goto unlock;
}
 
diff -puN mm/rmap.c~memcontrol-acct mm/rmap.c
--- linux-2.6.20/mm/rmap.c~memcontrol-acct  2007-02-24 19:39:29.0 
+0530
+++ linux-2.6.20-balbir/mm/rmap.c   2007-02-24 19:39:29.0 +0530
@@ -602,6 +602,11 @@ void page_remove_rmap(struct page *page,
__dec_zone_page_state(page,
PageAnon(page) ? NR_ANON_PAGES : 
NR_FILE_MAPPED);
}
+   /*
+* When we pass MEMCONTROL_DONT_CHECK_LIMIT, it is ok to call
+* this function under the pte lock (since we will not block in reclaim)
+*/
+   memcontrol_update_rss(vma-vm_mm, -1, MEMCONTROL_DONT_CHECK_LIMIT);
 }
 
 /*
diff -puN mm/swapfile.c~memcontrol-acct mm/swapfile.c
--- linux-2.6.20/mm/swapfile.c~memcontrol-acct  2007-02-24 19:39:29.0 
+0530
+++ linux-2.6.20-balbir/mm/swapfile.c   2007-02-24 19:39:29.0 +0530
@@ -27,6 +27,7 @@
 #include linux/mutex.h
 #include linux/capability.h
 #include linux/syscalls.h
+#include linux/memcontrol.h
 
 #include asm/pgtable.h
 #include asm/tlbflush.h
@@ -514,6 +515,7 @@ static void unuse_pte(struct vm_area_str
set_pte_at(vma-vm_mm, addr, pte,
   pte_mkold(mk_pte(page, vma-vm_page_prot)));
page_add_anon_rmap(page, vma, addr);
+   memcontrol_update_rss(vma-vm_mm, 1, MEMCONTROL_DONT_CHECK_LIMIT);
swap_free(entry);
/*
 * Move the page to the active list so it is not
_

-- 
Warm Regards,
Balbir Singh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The performance and behaviour of the anti-fragmentation related patches

2007-03-01 Thread Balbir Singh


Andrew Morton wrote:

So some urgent questions are: how are we going to do mem hotunplug and
per-container RSS?



Our basic unit of memory management is the zone.  Right now, a zone maps
onto some hardware-imposed thing.  But the zone-based MM works *well*.  I
suspect that a good way to solve both per-container RSS and mem hotunplug
is to split the zone concept away from its hardware limitations: create a
software zone and a hardware zone.  All the existing page allocator and
reclaim code remains basically unchanged, and it operates on software
zones.  Each software zones always lies within a single hardware zone. 
The software zones are resizeable.  For per-container RSS we give each

container one (or perhaps multiple) resizeable software zones.

For memory hotunplug, some of the hardware zone's software zones are marked
reclaimable and some are not; DIMMs which are wholly within reclaimable
zones can be depopulated and powered off or removed.

NUMA and cpusets screwed up: they've gone and used nodes as their basic
unit of memory management whereas they should have used zones.  This will
need to be untangled.


Anyway, that's just a shot in the dark.  Could be that we implement unplug
and RSS control by totally different means.  But I do wish that we'd sort
out what those means will be before we potentially complicate the story a
lot by adding antifragmentation.



Paul Menage had suggested something very similar in response to the RFC
for memory controllers I sent out and it was suggested that we create
small zones (roughly 64 MB) to avoid the issue of a zone/node not being
a shareable across containers. Even with a small size, there are some 
issues. The following thread has the details discussed.


http://lkml.org/lkml/2006/10/30/120

RSS accounting is very easy (with minimal changes to the core mm),
supplemented with an efficient per-container reclaimer, it should be
easy to implement a  good per-container RSS controller.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The performance and behaviour of the anti-fragmentation related patches

2007-03-01 Thread Balbir Singh


Linus Torvalds wrote:

On Thu, 1 Mar 2007, Andrew Morton wrote:

So some urgent questions are: how are we going to do mem hotunplug and
per-container RSS?


Also: how are we going to do this in virtualized environments? Usually the 
people who care abotu memory hotunplug are exactly the same people who 
also care (or claim to care, or _will_ care) about virtualization.


My personal opinion is that while I'm not a huge fan of virtualization, 
these kinds of things really _can_ be handled more cleanly at that layer, 
and not in the kernel at all. Afaik, it's what IBM already does, and has 
been doing for a while. There's no shame in looking at what already works, 
especially if it's simpler.


Could you please clarify as to what that layer means - is it the
firmware/hardware for virtualization? or does it refer to user space?
With virtualization the linux kernel would end up acting as a hypervisor
and resource management support like per-container RSS support needs to
be built into the kernel.

It would also be useful to have a resource controller like per-container
RSS control (container refers to a task grouping) within the kernel or
non-virtualized environments as well.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The performance and behaviour of the anti-fragmentation related patches

2007-03-01 Thread Balbir Singh


Linus Torvalds wrote:


On Fri, 2 Mar 2007, Balbir Singh wrote:

My personal opinion is that while I'm not a huge fan of virtualization,
these kinds of things really _can_ be handled more cleanly at that layer,
and not in the kernel at all. Afaik, it's what IBM already does, and has
been doing for a while. There's no shame in looking at what already works,
especially if it's simpler.

Could you please clarify as to what that layer means - is it the
firmware/hardware for virtualization? or does it refer to user space?


Virtualization in general. We don't know what it is - in IBM machines it's 
a hypervisor. With Xen and VMware, it's usually a hypervisor too. With 
KVM, it's obviously a host Linux kernel/user-process combination.




Thanks for clarifying.

The point being that in the guests, hotunplug is almost useless (for 
bigger ranges), and we're much better off just telling the virtualization 
hosts on a per-page level whether we care about a page or not, than to 
worry about fragmentation.


And in hosts, we usually don't care EITHER, since it's usually done in a 
hypervisor.



It would also be useful to have a resource controller like per-container
RSS control (container refers to a task grouping) within the kernel or
non-virtualized environments as well.


.. but this has again no impact on anti-fragmentation.



Yes, I agree that anti-fragmentation and resource management are independent
of each other. I must admit to being a bit selfish here, in that my main
interest is in resource management and we would love to see a well
written  and easy to understand resource management infrastructure and 
controllers to control CPU and memory usage. Since the issue of

per-container RSS control came up, I wanted to ensure that we do not mix
up resource control and anti-fragmentation.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ckrm-tech] [PATCH 1/2] rcfs core patch

2007-03-01 Thread Balbir Singh


Srivatsa Vaddagiri wrote:

Heavily based on Paul Menage's (inturn cpuset) work. The big difference
is that the patch uses task-nsproxy to group tasks for resource control
purpose (instead of task-containers).

The patch retains the same user interface as Paul Menage's patches. In
particular, you can have multiple hierarchies, each hierarchy giving a 
different composition/view of task-groups.


(Ideally this patch should have been split into 2 or 3 sub-patches, but
will do that on a subsequent version post)



With this don't we end up with a lot of duplicate between cpusets and rcfs.



Signed-off-by : Srivatsa Vaddagiri [EMAIL PROTECTED]
Signed-off-by : Paul Menage [EMAIL PROTECTED]


---

 linux-2.6.20-vatsa/include/linux/init_task.h |4 
 linux-2.6.20-vatsa/include/linux/nsproxy.h   |5 
 linux-2.6.20-vatsa/init/Kconfig  |   22 
 linux-2.6.20-vatsa/init/main.c   |1 
 linux-2.6.20-vatsa/kernel/Makefile   |1 



---


The diffstat does not look quite right.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] [RSDL 1/6] lists: add list splice tail

2007-03-04 Thread Balbir Singh


Con Kolivas wrote:

Add a list_splice_tail variant of list_splice.

Patch-by: Peter Zijlstra [EMAIL PROTECTED]
Signed-off-by: Con Kolivas [EMAIL PROTECTED]



Acked-by: Balbir Singh [EMAIL PROTECTED]

I had the same exact patch in my memcontrol at
http://lkml.org/lkml/2007/2/24/68 (see the last
two functions).


--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 1/7] Resource counters

2007-03-06 Thread Balbir Singh

 simple_read_from_buffer((void __user *)userbuf, nbytes,
+   pos, buf, s - buf);
+}
+
+ssize_t res_counter_write(struct res_counter *cnt, int member,
+   const char __user *userbuf, size_t nbytes, loff_t *pos)
+{
+   int ret;
+   char *buf, *end;
+   unsigned long tmp, *val;
+
+   buf = kmalloc(nbytes + 1, GFP_KERNEL);
+   ret = -ENOMEM;
+   if (buf == NULL)
+   goto out;
+
+   buf[nbytes] = 0;
+   ret = -EFAULT;
+   if (copy_from_user(buf, userbuf, nbytes))
+   goto out_free;
+
+   ret = -EINVAL;
+   tmp = simple_strtoul(buf, end, 10);
+   if (*end != '\0')
+   goto out_free;
+
+   val = res_counter_member(cnt, member);
+   *val = tmp;
+   ret = nbytes;
+out_free:
+   kfree(buf);
+out:
+   return ret;
+}




These bits look a little out of sync, with no users for these routines in
this patch. Won't you get a compiler warning, compiling this bit alone?

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 2/7] RSS controller core

2007-03-06 Thread Balbir Singh

-res.usage;
+}
+
+static void rss_move_task(struct container_subsys *ss,
+   struct container *cont,
+   struct container *old_cont,
+   struct task_struct *p)
+{
+   struct mm_struct *mm;
+   struct rss_container *rss, *old_rss;
+
+   mm = get_task_mm(p);
+   if (mm == NULL)
+   goto out;
+
+   rss = rss_from_cont(cont);
+   old_rss = rss_from_cont(old_cont);
+   if (old_rss != mm-rss_container)
+   goto out_put;
+
+   css_get_current(rss-css);
+   rcu_assign_pointer(mm-rss_container, rss);
+   css_put(old_rss-css);
+


I see that the charges are not migrated. Is that good?
If a user could find a way of migrating his/her task from
one container to another, it could create an issue with
the user's task taking up a big chunk of the RSS limit.

Can we migrate any task or just the thread group leader.
In my patches, I allowed migration of just the thread
group leader. Imagine if you have several threads, no
matter which container they belong to, their mm gets
charged (usage will not show up in the container's usage).
This could confuse the system administrator.


+out_put:
+   mmput(mm);
+out:
+   return;
+}
+
+static int rss_create(struct container_subsys *ss, struct container *cont)
+{
+   struct rss_container *rss;
+
+   rss = kzalloc(sizeof(struct rss_container), GFP_KERNEL);
+   if (rss == NULL)
+   return -ENOMEM;
+
+   res_counter_init(rss-res);
+   INIT_LIST_HEAD(rss-page_list);
+   cont-subsys[rss_subsys.subsys_id] = rss-css;
+   return 0;
+}
+
+static void rss_destroy(struct container_subsys *ss,
+   struct container *cont)
+{
+   kfree(rss_from_cont(cont));
+}
+
+
+static ssize_t rss_read(struct container *cont, struct cftype *cft,
+   struct file *file, char __user *userbuf,
+   size_t nbytes, loff_t *ppos)
+{
+   return res_counter_read(rss_from_cont(cont)-res, cft-private,
+   userbuf, nbytes, ppos);
+}
+
+static ssize_t rss_write(struct container *cont, struct cftype *cft,
+   struct file *file, const char __user *userbuf,
+   size_t nbytes, loff_t *ppos)
+{
+   return res_counter_write(rss_from_cont(cont)-res, cft-private,
+   userbuf, nbytes, ppos);
+}
+
+
+static struct cftype rss_usage = {
+   .name = rss_usage,
+   .private = RES_USAGE,
+   .read = rss_read,
+};
+
+static struct cftype rss_limit = {
+   .name = rss_limit,
+   .private = RES_LIMIT,
+   .read = rss_read,
+   .write = rss_write,
+};
+
+static struct cftype rss_failcnt = {
+   .name = rss_failcnt,
+   .private = RES_FAILCNT,
+   .read = rss_read,
+};
+
+static int rss_populate(struct container_subsys *ss,
+   struct container *cont)
+{
+   int rc;
+
+   if ((rc = container_add_file(cont, rss_usage))  0)
+   return rc;
+   if ((rc = container_add_file(cont, rss_failcnt))  0)
+   return rc;
+   if ((rc = container_add_file(cont, rss_limit))  0)
+   return rc;
+
+   return 0;
+}
+
+static struct rss_container init_rss_container;
+
+static __init int rss_create_early(struct container_subsys *ss,
+   struct container *cont)
+{
+   struct rss_container *rss;
+
+   rss = init_rss_container;
+   res_counter_init(rss-res);
+   INIT_LIST_HEAD(rss-page_list);
+   cont-subsys[rss_subsys.subsys_id] = rss-css;
+   ss-create = rss_create;
+   return 0;
+}
+
+static struct container_subsys rss_subsys = {
+   .name = rss,
+   .create = rss_create_early,
+   .destroy = rss_destroy,
+   .populate = rss_populate,
+   .attach = rss_move_task,
+};
+
+void __init container_rss_init_early(void)
+{
+   container_register_subsys(rss_subsys);
+   init_mm.rss_container = rss_from_cont(
+   task_container(init_task, rss_subsys));
+   css_get_current(init_mm.rss_container-css);
+}




--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/7] Resource controllers based on process containers

2007-03-06 Thread Balbir Singh


Pavel Emelianov wrote:

This patchset adds RSS, accounting and control and
limiting the number of tasks and files within container.

Based on top of Paul Menage's container subsystem v7

RSS controller includes per-container RSS accounter,
reclamation and OOM killer. It behaves like standalone
machine - when container runs out of resources it tries
to reclaim some pages and if it doesn't succeed in it
kills some task which mm_struct belongs to container in
question.

Num tasks and files containers are very simple and
self-descriptive from code.

As discussed before when a task moves from one container
to another no resources follow it - they keep holding the
container they were allocated in.



I have one problem with the patchset, I cannot compile
the patches individually and some of the code is hard
to read as it depends on functions from future patches.
Patch 2, 3 and 4 fail to compile without patch 5 applied.

Patch 1 failed to apply with a reject in kernel/Makefile
I applied it on top of 2.6.20 with all of Paul Menage's
patches (all 7).



--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: controlling mmap()'d vs read/write() pages

2007-03-28 Thread Balbir Singh


Herbert Poetzl wrote:

To me, one of the keys of Linux's global optimizations is being able
to use any memory globally for its most effective purpose, globally
(please ignore highmem :).  Let's say I have a 1GB container on a
machine that is at least 100% committed.  I mmap() a 1GB file and touch
the entire thing (I never touch it again).  I then go open another 1GB
file and r/w to it until the end of time.  I'm at or below my RSS limit,
but that 1GB of RAM could surely be better used for the second file.
How do we do this if we only account for a user's RSS?  Does this fit
into Alan's unfair bucket? ;)


what's the difference to a normal Linux system here?
when low on memory, the system will reclaim pages, and
guess what pages will be reclaimed first ...



But would it not bias application writers towards using read()/write()
calls over mmap()? They know that their calls are likely to be faster
when the application is run in a container. Without page cache control
we'll end up creating an asymmetrical container, where certain usage is 
charged and some usage is not.


Also, please note that when a page is unmapped and moved to swap cache;
the swap cache uses the page cache. Without page cache control, we could
end up with too many pages moving over to the swap cache and still
occupying memory, while the original intension was to avoid this
scenario.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Linux 2.6.24-rc7 Build-Failure at __you_cannot_kmalloc_that_much

2008-01-08 Thread Balbir Singh

Kamalesh Babulal wrote:
 Andrew Morton wrote:
 On Mon, 07 Jan 2008 16:06:20 +0530 Kamalesh Babulal [EMAIL PROTECTED] 
 wrote:

 The defconfig make fails on x86_64 (AMD box) with following error

   CHK include/linux/utsrelease.h
   CALLscripts/checksyscalls.sh
   CHK include/linux/compile.h
   GEN .version
   CHK include/linux/compile.h
   UPD include/linux/compile.h
   CC  init/version.o
   LD  init/built-in.o
   LD  .tmp_vmlinux1
 drivers/built-in.o(.init.text+0x8d76): In function `dmi_id_init':
 : undefined reference to `__you_cannot_kmalloc_that_much'
 make: *** [.tmp_vmlinux1] Error 1


 # gcc --version
 gcc (GCC) 3.2.3 20030502 (Red Hat Linux 3.2.3-59)

 This was reported by Adrian Bunk http://lkml.org/lkml/2007/12/1/39
 That's odd.  afacit the only kmalloc in dmi_id_init() is

 dmi_dev = kzalloc(sizeof(*dmi_dev), GFP_KERNEL);

 and even gcc-3.2.3 should be able to get that right.

 Could you please a) verify that simply removing that line fixes the build
 error and then b) try to find some way of fixing it?

 Try replacing `sizeof(*dmi_dev)' with `sizeof(struct dmi_device_attribute)'
 and any other tricks you can think of to try to make the compiler process
 the code differently.

 
 removing the line fixes the issue, but changing the sizeof(*dmi_dev) to
 sizeof(struct device) is not helping.
 

Hi, Andrew,

We tried the following, generated stabs information. The size of struct
device is 560 bytes. We found that dead code was not being eliminated
(__you_cannot_kmalloc_that_much), even though no one called that
function. I suspect builtin_constant_p() and dead code elimination as
the root causes of this error.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [LTP] Container Code Coverage for 2.6.23 mainline kernel

2008-01-09 Thread Balbir Singh

On Jan 9, 2008 12:52 PM, Rishikesh K. Rajak [EMAIL PROTECTED] wrote:

  Hi All,

  You can find the code coverage data for container code which has been
 merged with mainline linux-2.6.23 and respective testcases are merged with
 ltp for the feature called SYSVIPC NAMESPACE  UTS NAMESPACE .

  I have genrated the code on s390x machine, the more info you can find
 below.

  Linux  2.6.23-gcov-autokern1 #1 SMP Wed Jan 9 00:27:11 EST 2008 s390x s390x
 s390x GNU/Linux

  Please let me know if you need more info.

  Thanks
  Rishi


Rishi,

Something is wrong with the HTML within the tarball, it still points
to files in /home/rishi/
Could you please fix that.

Thanks,
Balbir
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [LTP] Container Code Coverage for 2.6.23 mainline kernel

2008-01-09 Thread Balbir Singh

On Jan 9, 2008 2:45 PM, Subrata Modak [EMAIL PROTECTED] wrote:

 On Wed, 2008-01-09 at 14:38 +0530, Balbir Singh wrote:
  On Jan 9, 2008 12:52 PM, Rishikesh K. Rajak [EMAIL PROTECTED] wrote:
  
Hi All,
  
You can find the code coverage data for container code which has been
   merged with mainline linux-2.6.23 and respective testcases are merged with
   ltp for the feature called SYSVIPC NAMESPACE  UTS NAMESPACE .
  
I have genrated the code on s390x machine, the more info you can find
   below.
  
Linux  2.6.23-gcov-autokern1 #1 SMP Wed Jan 9 00:27:11 EST 2008 s390x 
   s390x
   s390x GNU/Linux
  
Please let me know if you need more info.
  
Thanks
Rishi
  
 
  Rishi,
 
  Something is wrong with the HTML within the tarball, it still points
  to files in /home/rishi/
  Could you please fix that.

 Balbir, this HTML is just a saved as version of the original HTML as
 Rishi cannot TAR and send the original HTML file as it has to refer to
 hundreds of other files, which in that case will be difficult to Pack
 and send to the mailing list. So, he has saved the original HTML and
 then posted just that.

 --Subrata

 
  Thanks,
  Balbir

Subrata,

These files or their data is required to interpret the coverage data.
Details for each file and lines covered will help developers or
testers know what is not covered by these test cases.

Balbir
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 05/19] split LRU lists into anon file sets

2008-01-09 Thread Balbir Singh

* KAMEZAWA Hiroyuki [EMAIL PROTECTED] [2008-01-09 13:41:32]:

 I like this patch set thank you.
 
 On Tue, 08 Jan 2008 15:59:44 -0500
 Rik van Riel [EMAIL PROTECTED] wrote:
  Index: linux-2.6.24-rc6-mm1/mm/memcontrol.c
  ===
  --- linux-2.6.24-rc6-mm1.orig/mm/memcontrol.c   2008-01-07 
  11:55:09.0 -0500
  +++ linux-2.6.24-rc6-mm1/mm/memcontrol.c2008-01-07 17:32:53.0 
  -0500
 snip
 
  -enum mem_cgroup_zstat_index {
  -   MEM_CGROUP_ZSTAT_ACTIVE,
  -   MEM_CGROUP_ZSTAT_INACTIVE,
  -
  -   NR_MEM_CGROUP_ZSTAT,
  -};
  -
   struct mem_cgroup_per_zone {
  /*
   * spin_lock to protect the per cgroup LRU
   */
  spinlock_t  lru_lock;
  -   struct list_headactive_list;
  -   struct list_headinactive_list;
  -   unsigned long count[NR_MEM_CGROUP_ZSTAT];
  +   struct list_headlists[NR_LRU_LISTS];
  +   unsigned long   count[NR_LRU_LISTS];
   };
   /* Macro for accessing counter */
   #define MEM_CGROUP_ZSTAT(mz, idx)  ((mz)-count[(idx)])
  @@ -160,6 +152,7 @@ struct page_cgroup {
   };
   #define PAGE_CGROUP_FLAG_CACHE (0x1)   /* charged as cache */
   #define PAGE_CGROUP_FLAG_ACTIVE (0x2)  /* page is active in this 
  cgroup */
  +#define PAGE_CGROUP_FLAG_FILE  (0x4)   /* page is file system backed */
  
 
 Now, we don't have control_type and a feature for accounting only CACHE.
 Balbir-san, do you have some new plan ?


Hi, KAMEZAWA-San,

The control_type feature is gone. We still have cached page
accounting, but we do not allow control of only RSS pages anymore. We
need to control both RSS+cached pages. I do not understand your
question about new plan? Is it about adding back control_type?

 
 BTW, is it better to use PageSwapBacked(pc-page) rather than adding a new 
 flag
 PAGE_CGROUP_FLAG_FILE ?
 
 
 PAGE_CGROUP_FLAG_ACTIVE is used because global reclaim can change
 ACTIVE/INACTIVE attribute without accessing memory cgroup.
 (Then, we cannot trust PageActive(pc-page))
 

Yes, correct. A page active on the node's zone LRU need not be active
in the memory cgroup.

 ANON - FILE attribute can be changed dinamically (after added to LRU) ?
 
 If no, using page_file_cache(pc-page) will be easy.
 
 Thanks,
 -Kame
 

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 05/19] split LRU lists into anon file sets

2008-01-09 Thread Balbir Singh

* KAMEZAWA Hiroyuki [EMAIL PROTECTED] [2008-01-10 11:36:18]:

 On Thu, 10 Jan 2008 07:51:33 +0530
 Balbir Singh [EMAIL PROTECTED] wrote:
 
 #define PAGE_CGROUP_FLAG_CACHE (0x1)   /* charged as cache */
 #define PAGE_CGROUP_FLAG_ACTIVE (0x2)  /* page is active in this 
cgroup */
+#define PAGE_CGROUP_FLAG_FILE  (0x4)   /* page is file system backed */

   
   Now, we don't have control_type and a feature for accounting only CACHE.
   Balbir-san, do you have some new plan ?
  
  
  Hi, KAMEZAWA-San,
  
  The control_type feature is gone. We still have cached page
  accounting, but we do not allow control of only RSS pages anymore. We
  need to control both RSS+cached pages. I do not understand your
  question about new plan? Is it about adding back control_type?
  
 Ah, just wanted to confirm that we can drop PAGE_CGROUP_FLAG_CACHE
 if page_file_cache() function and split-LRU is introduced.
 

Earlier we would have had a problem, since we even accounted for swap
cache with PAGE_CGROUP_FLAG_CACHE and I think page_file_cache() does
not account swap cache pages with page_file_cache(). Our accounting
is based on mapped vs unmapped whereas the new code from Rik accounts
file vs anonymous. I suspect we could live a little while longer
with PAGE_CGROUP_FLAG_CACHE and then if we do not need it at all,
we can mark it down for removal. What do you think?


-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 00/19] VM pageout scalability improvements

2008-01-11 Thread Balbir Singh

* Rik van Riel [EMAIL PROTECTED] [2008-01-08 15:59:39]:

 On large memory systems, the VM can spend way too much time scanning
 through pages that it cannot (or should not) evict from memory. Not
 only does it use up CPU time, but it also provokes lock contention
 and can leave large systems under memory presure in a catatonic state.
 
 Against 2.6.24-rc6-mm1
 
 This patch series improves VM scalability by:
 
 1) making the locking a little more scalable
 
 2) putting filesystem backed, swap backed and non-reclaimable pages
onto their own LRUs, so the system only scans the pages that it
can/should evict from memory
 
 3) switching to SEQ replacement for the anonymous LRUs, so the
number of pages that need to be scanned when the system
starts swapping is bound to a reasonable number
 
 More info on the overall design can be found at:
 
   http://linux-mm.org/PageReplacementDesign
 
 
 Changelog:
 - merge memcontroller split LRU code into the main split LRU patch,
   since it is not functionally different (it was split up only to help
   people who had seen the last version of the patch series review it)
 - drop the page_file_cache debugging patch, since it never triggered
 - reintroduce code to not scan anon list if swap is full
 - add code to scan anon list if page cache is very small already
 - use lumpy reclaim more aggressively for smaller order  1 allocations


Hi, Rik,

I've just started the patch series, the compile fails for me on a
powerpc box. global_lru_pages() is defined under CONFIG_PM, but used
else where in mm/page-writeback.c. None of the global_lru_pages()
parameters depend on CONFIG_PM. Here's a simple patch to fix it.

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b14e188..39e6aef 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1920,6 +1920,14 @@ void wakeup_kswapd(struct zone *zone, int order)
wake_up_interruptible(pgdat-kswapd_wait);
 }
 
+unsigned long global_lru_pages(void)
+{
+   return global_page_state(NR_ACTIVE_ANON)
+   + global_page_state(NR_ACTIVE_FILE)
+   + global_page_state(NR_INACTIVE_ANON)
+   + global_page_state(NR_INACTIVE_FILE);
+}
+
 #ifdef CONFIG_PM
 /*
  * Helper function for shrink_all_memory().  Tries to reclaim 'nr_pages' pages
@@ -1968,14 +1976,6 @@ static unsigned long shrink_all_zones(unsigned long 
nr_pages, int prio,
return ret;
 }
 
-unsigned long global_lru_pages(void)
-{
-   return global_page_state(NR_ACTIVE_ANON)
-   + global_page_state(NR_ACTIVE_FILE)
-   + global_page_state(NR_INACTIVE_ANON)
-   + global_page_state(NR_INACTIVE_FILE);
-}
-
 /*
  * Try to free `nr_pages' of memory, system-wide, and return the number of
  * freed pages.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 00/19] VM pageout scalability improvements

2008-01-11 Thread Balbir Singh

* Rik van Riel [EMAIL PROTECTED] [2008-01-08 15:59:39]:

 Changelog:
 - merge memcontroller split LRU code into the main split LRU patch,
   since it is not functionally different (it was split up only to help
   people who had seen the last version of the patch series review it)

Hi, Rik,

I see a strange behaviour with this patchset. I have a program
(pagetest from Vaidy), that does the following

1. Can allocate different kinds of memory, mapped, malloc'ed or shared
2. Allocates and touches all the memory in a loop (2 times)

I mount the memory controller and limit it to 400M and run pagetest
and ask it to touch 1000M. Without this patchset everything runs fine,
but with this patchset installed, I immediately see

 pagetest invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0
 Call Trace:
 [c000e5aef400] [c000eb24] .show_stack+0x70/0x1bc (unreliable)
 [c000e5aef4b0] [c00c] .oom_kill_process+0x80/0x260
 [c000e5aef570] [c00bc498] .mem_cgroup_out_of_memory+0x6c/0x98
 [c000e5aef610] [c00f2574] .mem_cgroup_charge_common+0x1e0/0x414
 [c000e5aef6e0] [c00b852c] .add_to_page_cache+0x48/0x164
 [c000e5aef780] [c00b8664] .add_to_page_cache_lru+0x1c/0x68
 [c000e5aef810] [c012db50] .mpage_readpages+0xbc/0x15c
 [c000e5aef940] [c018bdac] .ext3_readpages+0x28/0x40
 [c000e5aef9c0] [c00c3978] .__do_page_cache_readahead+0x158/0x260
 [c000e5aefa90] [c00bac44] .filemap_fault+0x18c/0x3d4
 [c000e5aefb70] [c00cd510] .__do_fault+0xb0/0x588
 [c000e5aefc80] [c05653cc] .do_page_fault+0x440/0x620
 [c000e5aefe30] [c0005408] handle_page_fault+0x20/0x58
 Mem-info:
 Node 0 DMA per-cpu:
 CPU0: hi:6, btch:   1 usd:   4
 CPU1: hi:6, btch:   1 usd:   0
 CPU2: hi:6, btch:   1 usd:   3
 CPU3: hi:6, btch:   1 usd:   4
 Active_anon:9099 active_file:1523 inactive_anon0
  inactive_file:2869 noreclaim:0 dirty:20 writeback
:0 unstable:0
  free:44210 slab:639 mapped:1724 pagetables:475 bo
unce:0
 Node 0 DMA free:2829440kB min:7808kB low:9728kB hi
gh:11712kB active_anon:582336kB inactive_anon:0kB active_file:97472kB inactive_f
ile:183616kB noreclaim:0kB present:3813760kB pages_scanned:0 all_unreclaimable?
no
 lowmem_reserve[]: 0 0 0
 Node 0 DMA: 3*64kB 5*128kB 5*256kB 4*512kB 2*1024k
B 4*2048kB 3*4096kB 2*8192kB 170*16384kB = 2828352kB
 Swap cache: add 0, delete 0, find 0/0
 Free swap  = 3148608kB
 Total swap = 3148608kB
 Free swap:   3148608kB
 59648 pages of RAM
 677 reserved pages
 28165 pages shared
 0 pages swap cached
 Memory cgroup out of memory: kill process 6593 (pagetest) score 1003 or a child
 Killed process 6593 (pagetest)

I am using a powerpc box with 64K size pages. I'll try and investigate further,
just a heads up on the failure I am seeing.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] per-task I/O throttling

2008-01-11 Thread Balbir Singh

On Jan 11, 2008 4:15 AM, Andrea Righi [EMAIL PROTECTED] wrote:
 Allow to limit the bandwidth of I/O-intensive processes, like backup
 tools running in background, large files copy, checksums on huge files,
 etc.

 This kind of processes can noticeably impact the system responsiveness
 for some time and playing with tasks' priority is not always an
 acceptable solution.

 This patch allows to specify a maximum I/O rate in sectors per second
 for each single process via /proc/PID/io_throttle (default is zero,
 that specify no limit).

 Signed-off-by: Andrea Righi [EMAIL PROTECTED]

Hi, Andrea,

We have been thinking of doing control group based I/O control. I have
not reviewed your patch in detail. I can suggest looking at openvz's
IO controller. I/O bandwidth control is definitely interesting. How
did you test your solution?

Balbir
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 18/19] account mlocked pages

2008-01-11 Thread Balbir Singh

* Rik van Riel [EMAIL PROTECTED] [2008-01-08 15:59:57]:

The following patch is required to compile the code with
CONFIG_NORECLAIM enabled and CONFIG_NORECLAIM_MLOCK disabled.

Signed-off-by: Balbir Singh [EMAIL PROTECTED]

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c8ccf8f..fb08ee8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -88,6 +88,8 @@ enum zone_stat_item {
NR_NORECLAIM,   /*   */
 #ifdef CONFIG_NORECLAIM_MLOCK
NR_MLOCK,   /* mlock()ed pages found and moved off LRU */
+#else
+   NR_MLOCK=NR_ACTIVE_FILE,/* avoid compiler errors... */
 #endif
 #else
NR_NORECLAIM=NR_ACTIVE_FILE,/* avoid compiler errors in dead code */

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] per-task I/O throttling

2008-01-12 Thread Balbir Singh

* Peter Zijlstra [EMAIL PROTECTED] [2008-01-12 10:46:37]:

 
 On Fri, 2008-01-11 at 23:57 -0500, [EMAIL PROTECTED] wrote:
  On Fri, 11 Jan 2008 17:32:49 +0100, Andrea Righi said:
  
   The interesting feature is that it allows to set a priority for each
   process container, but AFAIK it doesn't allow to partition the
   bandwidth between different containers (that would be a nice feature
   IMHO). For example it would be great to be able to define per-container
   limits, like assign 10MB/s for processes in container A, 30MB/s to
   container B, 20MB/s to container C, etc.
  
  Has anybody considered allocating based on *seeks* rather than bytes moved,
  or counting seeks as virtual bytes for the purposes of accounting (if the
  disk can do 50mbytes/sec, and a seek takes 5millisecs, then count it as 100K
  of data)?
 
 I was considering a time scheduler, you can fill your time slot with
 seeks or data, it might be what CFQ does, but I've never even read the
 code.


So far the definition of I/O bandwidth has been w.r.t time. Not all IO
devices have sectors; I'd prefer bytes over a period of time.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Assign IRQs to HPET Timers

2008-01-12 Thread Balbir Singh

* Balaji Rao [EMAIL PROTECTED] [2008-01-12 00:36:11]:

 Assign an IRQ to HPET Timer devices when interrupt enable is requested.
 This now makes the HPET userspace API work.


A more detailed changelog will better help understand the nature and
origin of the problem and how to reproduce it.
 
 drivers/char/hpet.c  |   31 +--
 include/linux/hpet.h |2 +-
 2 files changed, 30 insertions(+), 3 deletions(-)
 
 Signed-off-by: Balaji Rao R [EMAIL PROTECTED]
 
 diff --git a/drivers/char/hpet.c b/drivers/char/hpet.c
 index 4c16778..92bd889 100644
 --- a/drivers/char/hpet.c
 +++ b/drivers/char/hpet.c
 @@ -390,7 +390,8 @@ static int hpet_ioctl_ieon(struct hpet_dev *devp)
   struct hpets *hpetp;
   int irq;
   unsigned long g, v, t, m;
 - unsigned long flags, isr;
 + unsigned long flags, isr, irq_bitmap;
 + u64 hpet_config;
 
   timer = devp-hd_timer;
   hpet = devp-hd_hpet;
 @@ -412,7 +413,29 @@ static int hpet_ioctl_ieon(struct hpet_dev *devp)
   devp-hd_flags |= HPET_SHARED_IRQ;
   spin_unlock_irq(hpet_lock);
 
 - irq = devp-hd_hdwirq;
 + /* Assign an IRQ to the timer */
 + hpet_config = readq(timer-hpet_config);
 + irq_bitmap =
 + (hpet_config  Tn_INT_ROUTE_CAP_MASK)  Tn_INT_ROUTE_CAP_SHIFT;

Should we check if the interrupts are being delivered via FSB, prior
to doing this?

 + if (!irq_bitmap)
 + irq = 0;/* No IRQ Assignable */
 + else {
 + irq = find_first_bit(irq_bitmap, 32);

 + do {
 + hpet_config |= irq  Tn_INT_ROUTE_CNF_SHIFT;
 + writeq(hpet_config, timer-hpet_config);
 +
 + /* Check whether we wrote a valid IRQ
 +  * number by reading back the field
 +  */
 + hpet_config = readq(timer-hpet_config);
 + if (irq == (hpet_config  Tn_INT_ROUTE_CNF_MASK)
 +  Tn_INT_ROUTE_CNF_SHIFT) {
 + devp-hd_hdwirq = irq;
 + break;  /* Success */
 + }
 + } while ((irq = (find_next_bit(irq_bitmap, 32, irq;
 + }

Shouldn't we do this at hpet_alloc() time?


 
   if (irq) {
   unsigned long irq_flags;
 @@ -509,6 +532,10 @@ hpet_ioctl_common(struct hpet_dev *devp, int cmd, 
 unsigned long arg, int kernel)
   break;
   v = readq(timer-hpet_config);
   v = ~Tn_INT_ENB_CNF_MASK;
 +
 + /* Zero out the IRQ field*/
 + v = ~Tn_INT_ROUTE_CNF_MASK;
 +
   writeq(v, timer-hpet_config);
   if (devp-hd_irq) {
   free_irq(devp-hd_irq, devp);
 diff --git a/include/linux/hpet.h b/include/linux/hpet.h
 index 707f7cb..e3c0b2a 100644
 --- a/include/linux/hpet.h
 +++ b/include/linux/hpet.h
 @@ -64,7 +64,7 @@ struct hpet {
   */
 
  #define  Tn_INT_ROUTE_CAP_MASK   (0xULL)
 -#define  Tn_INI_ROUTE_CAP_SHIFT  (32UL)
 +#define  Tn_INT_ROUTE_CAP_SHIFT  (32UL)
  #define  Tn_FSB_INT_DELCAP_MASK  (0x8000UL)
  #define  Tn_FSB_INT_DELCAP_SHIFT (15)
  #define  Tn_FSB_EN_CNF_MASK  (0x4000UL)

The patch looks good overall!

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] per-task I/O throttling

2008-01-12 Thread Balbir Singh

* Andrea Righi [EMAIL PROTECTED] [2008-01-12 19:01:14]:

 Peter Zijlstra wrote:
  On Sat, 2008-01-12 at 16:27 +0530, Balbir Singh wrote:
  * Peter Zijlstra [EMAIL PROTECTED] [2008-01-12 10:46:37]:
 
  On Fri, 2008-01-11 at 23:57 -0500, [EMAIL PROTECTED] wrote:
  On Fri, 11 Jan 2008 17:32:49 +0100, Andrea Righi said:
 
  The interesting feature is that it allows to set a priority for each
  process container, but AFAIK it doesn't allow to partition the
  bandwidth between different containers (that would be a nice feature
  IMHO). For example it would be great to be able to define per-container
  limits, like assign 10MB/s for processes in container A, 30MB/s to
  container B, 20MB/s to container C, etc.
  Has anybody considered allocating based on *seeks* rather than bytes 
  moved,
  or counting seeks as virtual bytes for the purposes of accounting (if 
  the
  disk can do 50mbytes/sec, and a seek takes 5millisecs, then count it as 
  100K
  of data)?
  I was considering a time scheduler, you can fill your time slot with
  seeks or data, it might be what CFQ does, but I've never even read the
  code.
 
  So far the definition of I/O bandwidth has been w.r.t time. Not all IO
  devices have sectors; I'd prefer bytes over a period of time.
  
  Doing a time based one would only require knowing the (avg) delay of
  seeks, whereas doing a bytes based one would also require knowing the
  (avg) speed of the device.
  
  That is, if you're also interested in providing a latency guarantee.
  Because that'd force you to convert bytes to time again.
 
 So, what about considering both bytes/sec and io-operations/sec? In this
 way we should be able to limit huge streams of data and seek storms (or
 any mix of them).
 
 Regarding CFQ, AFAIK it's only possible to configure an I/O priorty for
 a process, but there's no way for example to limit the bandwidth (or I/O
 operations/sec) for a particular user or group.
 

Limiting usage is also a very useful feature. Andrea could you please
port your patches over to control groups.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Re: Kernel Panic - 2.6.23-rc4-mm1 ia64 - was Re: Update: [Automatic] NUMA replicated pagecache ...

2007-09-12 Thread Balbir Singh

Lee Schermerhorn wrote:
 On Wed, 2007-09-12 at 16:41 +0100, Andy Whitcroft wrote:
 On Wed, Sep 12, 2007 at 11:09:47AM -0400, Lee Schermerhorn wrote:

 Interesting, I don't see a memory controller function in the stack
 trace, but I'll double check to see if I can find some silly race
 condition in there.
 right.  I noticed that after I sent the mail.  

 Also, config available at:
 http://free.linux.hp.com/~lts/Temp/config-2.6.23-rc4-mm1-gwydyr-nomemcont
 Be interested to know the outcome of any bisect you do.  Given its
 tripping in reclaim.
 
 Problem isolated to memory controller patches.  This patch seems to fix
 this particular problem.  I've only run the test for a few minutes with
 and without memory controller configured, but I did observe reclaim
 kicking in several times.  W/o this patch, system would panic as soon as
 I entered direct/zone reclaim--less than a minute.
 

Thanks, excellent catch! The patch looks sane.  Thanks for your help in
sorting this issue out. Hmm.. that means I never hit direct/zone reclaim
in my tests (I'll make a mental note to enhance my test cases to cover
this scenario).

 Lee
 
 
 PATCH 2.6.23-rc4-mm1 Memory Controller:  initialize all scan_controls'
   isolate_pages member.
 
 We need to initialize all scan_controls' isolate_pages member.
 Otherwise, shrink_active_list() attempts to execute at undefined
 location.
 
 Signed-off-by:  Lee Schermerhorn [EMAIL PROTECTED]
 
  mm/vmscan.c |2 ++
  1 file changed, 2 insertions(+)
 
 Index: Linux/mm/vmscan.c
 ===
 --- Linux.orig/mm/vmscan.c2007-09-10 13:22:21.0 -0400
 +++ Linux/mm/vmscan.c 2007-09-12 15:30:27.0 -0400
 @@ -1758,6 +1758,7 @@ unsigned long shrink_all_memory(unsigned
   .swap_cluster_max = nr_pages,
   .may_writepage = 1,
   .swappiness = vm_swappiness,
 + .isolate_pages = isolate_pages_global,
   };
 
   current-reclaim_state = reclaim_state;
 @@ -1941,6 +1942,7 @@ static int __zone_reclaim(struct zone *z
   SWAP_CLUSTER_MAX),
   .gfp_mask = gfp_mask,
   .swappiness = vm_swappiness,
 + .isolate_pages = isolate_pages_global,
   };
   unsigned long slab_reclaimable;
 
 
 


-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Memory shortage can result in inconsistent flocks state

2007-09-13 Thread Balbir Singh

On 9/13/07, Pavel Emelyanov [EMAIL PROTECTED] wrote:
 J. Bruce Fields wrote:
  On Tue, Sep 11, 2007 at 04:38:13PM +0400, Pavel Emelyanov wrote:
  This is a known feature that such re-locking is not atomic,
  but in the racy case the file should stay locked (although by
  some other process), but in this case the file will be unlocked.
 
  That's a little subtle (I assume you've never seen this actually
  happen?), but it makes sense to me.

 Well, this situation is hard to notice since usually programs
 try to finish up when some error is returned from the kernel,
 but I do believe that this could happen in one of the openvz
 kernels since we limit the kernel memory usage for containers
 and thus -ENOMEM is a common error.


The fault injection framework should be able to introduce the same
error. Of course hitting the error would require careful setup of the
fault parameters.

Balbir
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: cpuset trouble after hibernate

2007-09-17 Thread Balbir Singh

Andrew Morton wrote:
 On Mon, 10 Sep 2007 11:45:10 +0200 (CEST) Simon Derr [EMAIL PROTECTED] 
 wrote:
 
 On Sat, 8 Sep 2007, Nicolas Capit wrote:

 Hello,

 This is my situation:
   - I mounted the pseudo cpuset filesystem into /dev/cpuset
   - I created a cpuset named oar with my 2 cpus

 cat /dev/cpuset/oar/cpus 
 0-1

   - Then I hibernate my computer with 'echo -n disk /sys/power/state'
   - After reboot:

 cat /dev/cpuset/oar/cpus 
 0

 Why did I lost a cpu?
 Is this a normal behavior???
 Hi Nicolas,

 I believe this is related to the fact that hibernation uses the hotplug 
 subsystem to disable all CPUs except the boot CPU.

 Thus guarantee_online_cpus() is called on each cpuset and removes all 
 CPUs, except CPU 0, from all cpusets.

 I'm not quite sure about if/how this should be fixed in the kernel, 
 though. Looks like a very simple user-land workaround would be enough.

 
 Yeah.  Bug, surely.  But I guess it's always been there.
 
 What are the implications of this for cpusets-via-containers?
 

I suspect the functionality of cpusets is not affected by containers.

I wonder if containers should become suspend/resume aware and pass
that option on to controllers. I think it's only the bus drivers
and device drivers that do that now.



-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH mm] fix swapoff breakage; however...

2007-09-17 Thread Balbir Singh

Hugh Dickins wrote:
 rc4-mm1's memory-controller-memory-accounting-v7.patch broke swapoff:
 it extended unuse_pte_range's boolean found return code to allow an
 error return too; but ended up returning found (1) as an error.
 Replace that by success (0) before it gets to the upper level.
 
 Signed-off-by: Hugh Dickins [EMAIL PROTECTED]
 ---
 More fundamentally, it looks like any container brought over its limit in
 unuse_pte will abort swapoff: that doesn't doesn't seem contained to me.
 Maybe unuse_pte should just let containers go over their limits without
 error?  Or swap should be counted along with RSS?  Needs reconsideration.
 
  mm/swapfile.c |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 --- 2.6.23-rc4-mm1/mm/swapfile.c  2007-09-07 13:09:42.0 +0100
 +++ linux/mm/swapfile.c   2007-09-17 15:14:47.0 +0100
 @@ -642,7 +642,7 @@ static int unuse_mm(struct mm_struct *mm
   break;
   }
   up_read(mm-mmap_sem);
 - return ret;
 + return (ret  0)? ret: 0;

Thanks, for the catching this. There are three possible solutions

1. Account each RSS page with a probable swap cache page, double
   the RSS accounting to ensure that swapoff will not fail.
2. Account for the RSS page just once, do not account swap cache
   pages
3. Follow your suggestion and let containers go over their limits
   without error

With the current approach, a container over it's limit will not
be able to call swapoff successfully, is that bad?

We plan to implement per container/per cpuset swap in the future.
Given that, isn't this expected functionality. You are over it's
limit cannot really swapoff a swap device. If we allow pages to
be unused, we could end up with a container that could exceed
it's limit by a significant amount by calling swapoff.


  }
 
  /*


-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23-rc4-mm1 compile error for ppc 32

2007-09-17 Thread Balbir Singh

Benjamin Herrenschmidt wrote:
 On Sat, 2007-09-15 at 11:00 -0400, Mathieu Desnoyers wrote:
 * Benjamin Herrenschmidt ([EMAIL PROTECTED]) wrote:
 On Thu, 2007-09-13 at 15:17 -0700, Andrew Morton wrote:

 Like this?

 --- a/include/asm-powerpc/bitops.h~powerpc-lock-bitops-fix
 +++ a/include/asm-powerpc/bitops.h
 @@ -226,7 +226,7 @@ static __inline__ void set_bits(unsigned
  
  static __inline__ void __clear_bit_unlock(int nr, volatile unsigned long 
 *addr)
  {
 -  __asm__ __volatile__(LWSYNC_ON_SMP ::: memory);
 +  __asm__ __volatile__(LWSYNC_ON_SMP  ::: memory);
__clear_bit(nr, addr);
  }
  
 Looks ok. Can somebody test ? I'm still travelling...


 Hi Benjamin,

 With this patch and hrtimer.c fixes, 2.6.23-rc4-mm1 PPC arch (for
 powerpc 405) compiles fine.

 I still see errors/warnings from modpost though:
 
 Looks like the legacy ISA DMA crap no ? I don't know much about it.
 
 Ben.
 

Kamalesh has reported a similar bug and is looking to fix this problem.
He's been looking at Kconfig's to see that all of them either
depend on GENERIC_ISA_DMA or select it.

 make -f /home/compudj/git/linux-2.6-lttng/scripts/Makefile.modpost
   scripts/mod/modpost -m  -o /home/compudj/obj/powerpc-405/Module.symvers
 -s
 ERROR: request_dma [sound/oss/sscape.ko] undefined!
 ERROR: free_dma [sound/oss/sscape.ko] undefined!
 ERROR: dma_spin_lock [sound/oss/sscape.ko] undefined!
 ERROR: free_dma [sound/oss/sound.ko] undefined!
 ERROR: request_dma [sound/oss/sound.ko] undefined!
 ERROR: dma_spin_lock [sound/oss/sound.ko] undefined!
 ERROR: dma_spin_lock [sound/core/snd.ko] undefined!
 ERROR: dma_spin_lock [net/irda/irda.ko] undefined!
 WARNING: div64_64 [net/ipv4/tcp_cubic.ko] has no CRC!
 ERROR: free_dma [drivers/parport/parport_pc.ko] undefined!
 ERROR: request_dma [drivers/parport/parport_pc.ko] undefined!
 ERROR: dma_spin_lock [drivers/parport/parport_pc.ko] undefined!
 ERROR: request_dma [drivers/net/irda/w83977af_ir.ko] undefined!
 ERROR: free_dma [drivers/net/irda/w83977af_ir.ko] undefined!
 ERROR: request_dma [drivers/net/irda/via-ircc.ko] undefined!
 ERROR: free_dma [drivers/net/irda/via-ircc.ko] undefined!
 ERROR: request_dma [drivers/net/irda/smsc-ircc2.ko] undefined!
 ERROR: free_dma [drivers/net/irda/smsc-ircc2.ko] undefined!
 ERROR: free_dma [drivers/net/irda/nsc-ircc.ko] undefined!
 ERROR: request_dma [drivers/net/irda/nsc-ircc.ko] undefined!
 ERROR: free_dma [drivers/net/irda/ali-ircc.ko] undefined!
 ERROR: request_dma [drivers/net/irda/ali-ircc.ko] undefined!
 ERROR: request_dma [drivers/mmc/host/wbsd.ko] undefined!
 ERROR: dma_spin_lock [drivers/mmc/host/wbsd.ko] undefined!
 ERROR: free_dma [drivers/mmc/host/wbsd.ko] undefined!

 Mathieu

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Configurable reclaim batch size

2007-09-17 Thread Balbir Singh

Peter Zijlstra wrote:
 On Mon, 17 Sep 2007 10:54:59 -0700 (PDT) Christoph Lameter
 [EMAIL PROTECTED] wrote:
 
 On Sat, 15 Sep 2007, Peter Zijlstra wrote:

 It increases the lock hold times though. Otoh it might work out with the
 lock placement.
 Yeah may be good for NUMA.
 
 Might, I'd just like a _little_ justification for an extra tunable.
 
 Do you have any numbers that show this is worthwhile?
 Tried to run AIM7 but the improvements are in the noise. I need a tests 
 that really does large memory allocation and stresses the LRU. I could 
 code something up but then Lee's patch addresses some of the same issues.
 Is there any standard test that shows LRU handling regressions?
 
 hehe, I wish. I was just hoping you'd done this patch as a result of an
 actual problem and not a hunch.

Please do let me know if someone finds a good standard test for it or a
way to stress reclaim. I've heard AIM7 come up often, but never been
able to push it much. I should retry.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH mm] fix swapoff breakage; however...

2007-09-17 Thread Balbir Singh

Hugh Dickins wrote:
 On Tue, 18 Sep 2007, Balbir Singh wrote:
 Hugh Dickins wrote:
 More fundamentally, it looks like any container brought over its limit in
 unuse_pte will abort swapoff: that doesn't doesn't seem contained to me.
 Maybe unuse_pte should just let containers go over their limits without
 error?  Or swap should be counted along with RSS?  Needs reconsideration.
 Thanks, for the catching this. There are three possible solutions

 1. Account each RSS page with a probable swap cache page, double
the RSS accounting to ensure that swapoff will not fail.
 2. Account for the RSS page just once, do not account swap cache
pages
 
 Neither of those makes sense to me, but I may be misunderstanding.
 
 What would make sense is (what I meant when I said swap counted
 along with RSS) not to count pages out and back in as they are
 go out to swap and back in, just keep count of instantiated pages
 

I am not sure how you define instantiated pages. I suspect that
you mean RSS + pages swapped out (swap_pte)?

 I say make sense meaning that the numbers could be properly
 accounted; but it may well be unpalatable to treat fast RAM as
 equal to slow swap.
 
 3. Follow your suggestion and let containers go over their limits
without error

 With the current approach, a container over it's limit will not
 be able to call swapoff successfully, is that bad?
 
 That's not so bad.  What's bad is that anyone else with the
 CAP_SYS_ADMIN to swapoff is liable to be prevented by containers
 going over their limits.
 

If a swapoff is going to push a container over it's limit, then
we break the container and the isolation it provides. Upon swapoff
failure, may be we could get the container to print a nice
little warning so that anyone else with CAP_SYS_ADMIN can fix the
container limit and retry swapoff.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH mm] fix swapoff breakage; however...

2007-09-17 Thread Balbir Singh

Hugh Dickins wrote:
 On Tue, 18 Sep 2007, Balbir Singh wrote:
 Hugh Dickins wrote:
 What would make sense is (what I meant when I said swap counted
 along with RSS) not to count pages out and back in as they are
 go out to swap and back in, just keep count of instantiated pages

 I am not sure how you define instantiated pages. I suspect that
 you mean RSS + pages swapped out (swap_pte)?
 
 That's it.  (Whereas file pages counted out when paged out,
 then counted back in when paged back in.)
 
 If a swapoff is going to push a container over it's limit, then
 we break the container and the isolation it provides.
 
 Is it just my traditional bias, that makes me prefer you break
 your container than my swapoff?  I'm not sure.



:-) Please see my response below

 Upon swapoff
 failure, may be we could get the container to print a nice
 little warning so that anyone else with CAP_SYS_ADMIN can fix the
 container limit and retry swapoff.
 
 And then they hit the next one... rather like trying to work out
 the dependencies of packages for oneself: a very tedious process.
 

Yes, but here's the overall picture of what is happening

1. The system administrator setup a memory container to contain
   a group of applications.
2. The administrator tried to swapoff one/a group of swap files/
   devices
3. Operation 2, failed due to a container being above it's limit.
   Which implies that at some point a container went over it's
   limit and some of it's pages were swapped out

During swapoff, we try to account for pages coming back into the
container, our charging routine does try to reclaim pages,
which in turn implies -- it will use another swap device or
reclaim page cache, if both fails, we return -ENOMEM.

Given that the system administrator has setup the container and
the swap devices, I feel that he is in better control of what
to do with the system when swapoff fails.

In the future we plan to implement per container swap (a feature
desired by several people), assuming that administrators use
per container swap in the future, failing on limit sounds
like the right way to go forward.

 If the swapoff succeeds, that does mean there was actually room
 in memory (+ other swap) for everyone, even if some have gone over
 their nominal limits.  (But if the swapoff runs out of memory in
 the middle, yes, it might well have assigned the memory unfairly.)
 

Yes, precisely my point, the administrator is the best person
to decide how to assign memory to containers. Would it help
to add a container tunable that says, it's ok to go overlimit
with this container during a swapoff.

 The appropriate answer may depend on what you do when a container
 tries to fault in one more page than its limit.  Apparently just
 fail it (no attempt to page out another page from that container).
 

The problem with that approach is that applications will fail
in the middle of their task. They will never get a chance
to run at all, they will always get killed in the middle.
We want to be able to reclaim pages from the container and
let the application continue.

 So, if the whole system is under memory pressure, kswapd will
 be keeping the RSS of all tasks low, and they won't reach their
 limits; whereas if the system is not under memory pressure,
 tasks will easily approach their limits and so fail.
 

Tasks failing on limit does not sound good unless we are out
of all backup memory (slow storage). We still let the application
run, although slowly.


-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23-rc6-mm1 panic (memory controller issue ?)

2007-09-18 Thread Balbir Singh

Badari Pulavarty wrote:
 On Tue, 2007-09-18 at 15:21 -0700, Badari Pulavarty wrote:
 Hi Balbir,

 I get following panic from SLUB, while doing simple fsx tests.
 I haven't used any container/memory controller stuff except 
 that I configured them in :(

 Looks like slub doesn't like one of the flags passed in ?

 Known issue ? Ideas ?

 
 I think, I found the issue. I am still running tests to
 verify. Does this sound correct ?
 
 Thanks,
 Badari
 
 Need to strip __GFP_HIGHMEM flag while passing to 
 mem_container_cache_charge().
 
 Signed-off-by: Badari Pulavarty [EMAIL PROTECTED]
  mm/filemap.c |3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)
 
 Index: linux-2.6.23-rc6/mm/filemap.c
 ===
 --- linux-2.6.23-rc6.orig/mm/filemap.c2007-09-18 12:43:54.0 
 -0700
 +++ linux-2.6.23-rc6/mm/filemap.c 2007-09-18 19:14:44.0 -0700
 @@ -441,7 +441,8 @@ int filemap_write_and_wait_range(struct 
  int add_to_page_cache(struct page *page, struct address_space *mapping,
   pgoff_t offset, gfp_t gfp_mask)
  {
 - int error = mem_container_cache_charge(page, current-mm, gfp_mask);
 + int error = mem_container_cache_charge(page, current-mm,
 + gfp_mask  ~__GFP_HIGHMEM);
   if (error)
   goto out;
 
 
 

Hi, Badari,

The fix looks correct, radix_tree_preload() does the same thing in
add_to_page_cache(). Thanks for identifying the fix

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23-rc6-mm1 panic (memory controller issue ?)

2007-09-19 Thread Balbir Singh

Christoph Lameter wrote:
 On Wed, 19 Sep 2007, Balbir Singh wrote:
 
 The fix looks correct, radix_tree_preload() does the same thing in
 add_to_page_cache(). Thanks for identifying the fix
 
 Hmmm Radix tree preload can only take a limited set of flags?
 

Yes, the whole code is very interesting. From add_to_page_cache()
we call radix_tree_preload with __GFP_HIGHMEM cleared, but
from __add_to_swap_cache(), we don't make any changes to the
gfp_mask. radix_tree_preload() calls kmem_cache_alloc() and in slub
there is a check

BUG_ON(flags  GFP_SLAB_BUG_MASK);

So, I guess all our allocations should check against __GFP_DMA and
__GFP_HIGHMEM. I'll review the code, test it and send a fix.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23-rc6-mm1 panic (memory controller issue ?)

2007-09-19 Thread Balbir Singh

Christoph Lameter wrote:
 On Wed, 19 Sep 2007, Balbir Singh wrote:
 
 Yes, the whole code is very interesting. From add_to_page_cache()
 we call radix_tree_preload with __GFP_HIGHMEM cleared, but
 from __add_to_swap_cache(), we don't make any changes to the
 gfp_mask. radix_tree_preload() calls kmem_cache_alloc() and in slub
 there is a check

 BUG_ON(flags  GFP_SLAB_BUG_MASK);

 So, I guess all our allocations should check against __GFP_DMA and
 __GFP_HIGHMEM. I'll review the code, test it and send a fix.
 
 You need to use the proper mask from include/linux/gfp.h. Masking 
 individual bits will create problems when we create new bits.
 

I agree 100%, that's why I want to review the code. I want to use
a mask that clears the GFP_SLAB_BUG_MASK bits and review it. I want
to check against other call sites that use gfp_mask as well.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Add all thread stats for TASKSTATS_CMD_ATTR_TGID (v5)

2007-09-20 Thread Balbir Singh

Andrew Morton wrote:
 On Tue, 18 Sep 2007 00:23:39 +0200 Guillaume Chazarain [EMAIL PROTECTED] 
 wrote:
 
 TASKSTATS_CMD_ATTR_TGID used to return only the delay accounting stats, not
 the basic and extended accounting.  With this patch,
 TASKSTATS_CMD_ATTR_TGID also aggregates the accounting info for all threads
 of a thread group.  This makes TASKSTATS_CMD_ATTR_TGID usable in a similar
 fashion to TASKSTATS_CMD_ATTR_PID, for commands like iotop -P
 (http://guichaz.free.fr/misc/iotop.py).
 
 This patch conflicts somewhat with
 add-scaled-time-to-taskstats-based-process-accounting.patch
 
 I fixed it up like this:
 
 void bacct_add_tsk(struct taskstats *stats, struct task_struct *task)
 {
   if (task-flags  PF_SUPERPRIV)
   stats-ac_flag |= ASU;
   if (task-flags  PF_DUMPCORE)
   stats-ac_flag |= ACORE;
   if (task-flags  PF_SIGNALED)
   stats-ac_flag |= AXSIG;
   if (thread_group_leader(task)  (task-flags  PF_FORKNOEXEC))
   /*
* Threads are created by do_fork() and don't exec but not in
* the AFORK sense, as the latter involves fork(2).
*/
   stats-ac_flag |= AFORK;
 
   stats-ac_utimescaled +=
   cputime_to_msecs(task-utimescaled) * USEC_PER_MSEC;
   stats-ac_stimescaled +=
   cputime_to_msecs(task-stimescaled) * USEC_PER_MSEC;
   stats-ac_utime  += cputime_to_msecs(task-utime) * USEC_PER_MSEC;
   stats-ac_stime  += cputime_to_msecs(task-stime) * USEC_PER_MSEC;
   stats-ac_minflt += task-min_flt;
   stats-ac_majflt += task-maj_flt;
 }
 
 (note the s/=/+=/ in there) but it all needs reviewing and checking and
 testing please.

Andrew,

Thanks for reviewing the patchset, this patch is on my review and test
queue (which has gotten rather long of late). I'll test it further and
get back.


-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Add all thread stats for TASKSTATS_CMD_ATTR_TGID (v5)

2007-09-20 Thread Balbir Singh

 Andrew,

 Thanks for reviewing the patchset, this patch is on my review and test
 queue (which has gotten rather long of late). I'll test it further and
 get back.
 
 I still think this version is very wrong. It makes the -signal-stats
 absolutely meaningless. Quoting myself:
 


Hi, Oleg,

Yes, I see, removing the memcpy is definitely wrong. Thanks for catching
it. I did not get a chance to review the patch, it's on my review queue,
but since you've reviewed it, I am very glad that you have and
identified potential issues.

Big Thanks!

-- 
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Revert for cgroups CPU accounting subsystem patch

2007-11-12 Thread Balbir Singh

Paul Menage wrote:
 On Nov 12, 2007 10:00 PM, Srivatsa Vaddagiri [EMAIL PROTECTED] wrote:
 On second thoughts, this may be a usefull controller of its own.
 Say I just want to monitor usage (for accounting purpose) of a group of
 tasks, but don't want to control their cpu consumption, then cpuacct
 controller would come in handy.

 
 That's plausible, but having two separate ways of tracking and
 reporting the CPU usage of a cgroup seems wrong.
 
 How bad would it be in your suggested case if you just give each
 cgroup the same weight? So there would be fair scheduling between
 cgroups, which seems as reasonable as any other choice in the event
 that the CPU is contended.
 

Right now, one of the limitations of the CPU controller is that
the moment you create another control group, the bandwidth gets
divided by the default number of shares. We can't create groups
just for monitoring. cpu_acct fills this gap. I think in the
long run, we should move the helper functions into cpu_acct.c
and the interface logic into kernel/sched.c (cpu controller).

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: top lies ?

2007-11-12 Thread Balbir Singh

On Nov 13, 2007 12:38 PM, Al Boldi [EMAIL PROTECTED] wrote:
 kloczek wrote:
  Some data showed by top command looks like completly trashed.
  Fragment from top output:
 
  Mem:   2075784k total,  2053352k used,22432k free,19260k buffers
  Swap:  2096472k total,  136k used,  2096336k free,  1335080k cached
 
 PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  SWAP nFLT
  WCHAN COMMAND 14515 mysql 20   0 1837m 563m 4132 S   39 27.8
  27:14.20 1.2g   18 - mysqld
 
  How it is possible that swap ussage is 136k and swapped out portion of (in
  this case) mysqld process is 1.2g ?

 Welcome to OverCommit, aka OOM-nirvana.

 Try this:
 # echo 2  /proc/sys/vm/overcommit_memory
 # echo 0  /proc/sys/vm/overcommit_ratio

 But make sure you have enough swap.


 Thanks!

The swap cache looks pretty big, may be top is including that data
while reporting swap usage.

Balbir
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Revert for cgroups CPU accounting subsystem patch

2007-11-12 Thread Balbir Singh

Paul Menage wrote:
 On Nov 12, 2007 11:00 PM, Balbir Singh [EMAIL PROTECTED] wrote:
 Right now, one of the limitations of the CPU controller is that
 the moment you create another control group, the bandwidth gets
 divided by the default number of shares. We can't create groups
 just for monitoring.
 
 Could we get around this with, say, a flag that always treats a CFS
 schedulable entity as having a weight equal to the number of runnable
 tasks in it? So CPU bandwidth would be shared between groups in
 proportion to the number of runnable tasks, which would distribute the
 cycles approximately equivalently to them all being separate
 schedulable entities.
 

I think it's a good hack, but not sure about the complexity to implement
the code. I worry that if the number of tasks increase (say run into
thousands for one or more groups and a few groups have just a few
tasks), we'll lose out on accuracy.

 cpu_acct fills this gap.
 
 Agreed, but not in the right way IMO.
 

I think we already have the code, we need to make it more useful and
reusable.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] memcgroup: work better with tmpfs

2007-12-18 Thread Balbir Singh

Hugh Dickins wrote:
 Here's a couple of patches to get memcgroups working better with tmpfs
 and shmem, in conjunction with the tmpfs patches I just posted.  There
 will be another to come later on, but I shouldn't wait any longer to get
 these out to you.
 

Hi, Hugh,

Thank you so much for the review, some comments below

 (The missing patch will want to leave a mem_cgroup associated with a tmpfs
 file or shm object, so that if its pages get brought back from swap by
 swapoff, they can be associated with that mem_cgroup rather than the one
 which happens to be running swapoff.)
 
  mm/memcontrol.c |   81 --
  mm/shmem.c  |   28 +++
  2 files changed, 63 insertions(+), 46 deletions(-)
 
 But on the way I've noticed a number of issues with memcgroups not dealt
 with in these patches.
 
 1. Why is spin_lock_irqsave rather than spin_lock needed on mz-lru_lock?
 If it is needed, doesn't mem_cgroup_isolate_pages need to use it too?
 

We always call mem_cgroup_isolate_pages() from shrink_(in)active_pages
under spin_lock_irq of the zone's lru lock. That's the reason that we
don't explicitly use it in the routine.

 2. There's mem_cgroup_charge and mem_cgroup_cache_charge (wouldn't the
 former be better called mem_cgroup_charge_mapped? why does the latter

Yes, it would be. After we've refactored the code, the new name makes sense.

 test MEM_CGROUP_TYPE_ALL instead of MEM_CGROUP_TYPE_CACHED? I still don't
 understand your enums there).

We do that to ensure that we charge page cache pages only when the
accounting type is set to MEM_CGROUP_TYPE_ALL. If the type is anything
else, we ignore cached pages, we did not have MEM_CGROUP_TYPE_CACHED
initially when the patches went in.

 But there's only mem_cgroup_uncharge.
 So when, for example, an add_to_page_cache fails, the uncharge may not
 balance the charge?
 

We use mem_cgroup_uncharge() everywhere. The reason being, we might
switch control type, we uncharge pages that have a page_cgroup
associated with them, hence once we;ve charged, uncharge does not
distinguish between charge types.

 3. mem_cgroup_charge_common has rcu_read_lock/unlock around its
 rcu_dereference; mem_cgroup_cache_charge does not: is that right?
 

Very good catch! Will fix it.

 4. That page_assign_page_cgroup in free_hot_cold_page, what case is that
 handling?  Wouldn't there be a leak if it ever happens?  I've been running
 with a BUG_ON(page-page_cgroup) there and not hit it - should it perhaps
 be a Bad page state case?
 

Our cleanup in page_cache_uncharge() does take care of cleaning up the
page_cgroup. I think you've got it right, it should be a BUG_ON in
free_hot_cold_page()

 Hugh

Thanks for the detailed review and fixes.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Move page_assign_page_cgroup to VM_BUG_ON in free_hot_cold_page

2007-12-18 Thread Balbir Singh



Based on the recommendation and observations of Hugh Dickins,
page_cgroup_assign_cgroup() is not required. This patch replaces it with
a VM_BUG_ON, so that we can catch them in free_hot_cold_page()

Signed-off-by: Balbir Singh [EMAIL PROTECTED]
---

 mm/page_alloc.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff -puN 
mm/page_alloc.c~memory-controller-move-to-bug-on-in-free_hot_cold_page 
mm/page_alloc.c
--- 
linux-2.6.24-rc5/mm/page_alloc.c~memory-controller-move-to-bug-on-in-free_hot_cold_page
 2007-12-19 11:31:46.0 +0530
+++ linux-2.6.24-rc5-balbir/mm/page_alloc.c 2007-12-19 11:33:45.0 
+0530
@@ -995,7 +995,7 @@ static void fastcall free_hot_cold_page(
 
if (!PageHighMem(page))
debug_check_no_locks_freed(page_address(page), PAGE_SIZE);
-   page_assign_page_cgroup(page, NULL);
+   VM_BUG_ON(page_get_page_cgroup(page));
arch_free_page(page, 0);
kernel_map_pages(page, 1, 0);
 
_

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Memory controller use rcu_read_lock() in mem_cgroup_cache_charge()

2007-12-20 Thread Balbir Singh



Hugh Dickins noticed that we were using rcu_dereference() without
rcu_read_lock() in the cache charging routine. The patch below fixes
this problem

Signed-off-by: Balbir Singh [EMAIL PROTECTED]
---

 mm/memcontrol.c |   10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff -puN mm/memcontrol.c~memory-controller-use-rcu-lead-lock mm/memcontrol.c
--- linux-2.6.24-rc5/mm/memcontrol.c~memory-controller-use-rcu-lead-lock
2007-12-19 11:52:44.0 +0530
+++ linux-2.6.24-rc5-balbir/mm/memcontrol.c 2007-12-20 14:01:45.0 
+0530
@@ -717,16 +717,20 @@ int mem_cgroup_charge(struct page *page,
 int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask)
 {
+   int ret = 0;
struct mem_cgroup *mem;
if (!mm)
mm = init_mm;
 
+   rcu_read_lock();
mem = rcu_dereference(mm-mem_cgroup);
+   css_get(mem-css);
+   rcu_read_unlock();
if (mem-control_type == MEM_CGROUP_TYPE_ALL)
-   return mem_cgroup_charge_common(page, mm, gfp_mask,
+   ret = mem_cgroup_charge_common(page, mm, gfp_mask,
MEM_CGROUP_CHARGE_TYPE_CACHE);
-   else
-   return 0;
+   css_put(mem-css);
+   return ret;
 }
 
 /*
_

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Move page_assign_page_cgroup to VM_BUG_ON in free_hot_cold_page

2007-12-20 Thread Balbir Singh

Peter Zijlstra wrote:
 On Thu, 2007-12-20 at 14:16 +, Hugh Dickins wrote:
 On Thu, 20 Dec 2007, Peter Zijlstra wrote:
 On Thu, 2007-12-20 at 13:14 +, Hugh Dickins wrote:
 On Wed, 19 Dec 2007, Dave Hansen wrote:
 -   page_assign_page_cgroup(page, NULL);
 +   VM_BUG_ON(page_get_page_cgroup(page));
 Hi Balbir,

 You generally want to do these like:

   foo = page_assign_page_cgroup(page, NULL);
   VM_BUG_ON(foo);

 Some embedded people have been known to optimize kernel size like this:

   #define VM_BUG_ON(x) do{}while(0)
 Balbir's patch looks fine to me: I don't get your point there, Dave.
 There was a lengthy discussion here:
   http://lkml.org/lkml/2007/12/14/131

 on the merit of debug statements with side effects.
 Of course, but what's the relevance?

 But looking at our definition:

 #ifdef CONFIG_DEBUG_VM
 #define VM_BUG_ON(cond) BUG_ON(cond)
 #else
 #define VM_BUG_ON(condition) do { } while(0)
 #endif

 disabling CONFIG_DEBUG_VM breaks the code as proposed by Balbir in that
 it will no longer acquire the reference.
 But what reference?

 struct page_cgroup *page_get_page_cgroup(struct page *page)
 {
  return (struct page_cgroup *)
  (page-page_cgroup  ~PAGE_CGROUP_LOCK);
 }

 I guess the issue is that often a get function has a complementary
 put function, but this isn't one of them.  Would page_page_cgroup
 be a better name, perhaps?  I don't know.
 
 Ah, yes, I mistakenly assumed it was a reference get. In that case I
 stand corrected and do not have any objections.
 

I was going to say the same thing, page_get_page_cgroup() does not hold
any references. May be _get_ in the name is confusing.


-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2047 matches

Mail list logo