RE: Sytem slowdown on 2.4.1-ac20 (recurring from 2.4.0)
Just FYI: I remember posting something a few days ago to make the serial console more reliable for such situations. Some allocations in the serial port driver are done at runtime using page_alloc, if somebody runs out of memory the serial tty driver would not work properly. I am not saying that u ran out of memory. All I am saying is that it is possible to make the serial tty driver more reliable using boot time initialization. Please excuse me if u find this a little off-topic. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Vibol Hou Sent: Monday, February 26, 2001 4:25 PM To: Linux-Kernel Subject: Sytem slowdown on 2.4.1-ac20 (recurring from 2.4.0) I've reported this problem a long while ago, but no one answered my pleas. To tell you the honest truth, I don't know where to begin looking. It's difficult to poke around when the serial console is unresponsive :/ When I was running 2.4.0, the system, a dual-processor webserver, would _completely_ slow down after about 3 days of constant uptime (and a few million pages served). I mean _SLOW_. I could get commands executed, but it would take an unholy long time to type the commands in. It seemed the server was dropping lots of packets. All TCP services simply stopped or slowed. ICMP packet loss to the server would be a sporadic from 50% to 75%. Web service was rendered useless. SSH _barely_ worked. The number of commands I could run (w, free, memstat, top) showed nothing out of the ordinary. Back then, I didn't have a serial console setup. Now, I'm running 2.4.1-ac20 and I setup a serial console to try to catch any errors. I was hoping the problem wouldn't recur with this newer kernel, but it seems to still happen, but now at about 5 days uptime. When I manage to get in a 'shutdown -h now' through SSH, the serial console spits out: INIT: Switching to runlevel: 0 INIT: And that's it. It doesn't even seem to be able to finish shutting down. Thusfar, no one else has reported any similar problems to what I have, so it makes me wonder what is wrong. The system ran fine with an uptime of over 100 days with the old 2.2.17 kernel. What stifles me is the fact that the serial console is completely unresponsive to input when the server gets into this state. Having said that, does anyone have any ideas or pointers for me? Again, this may seem like a fairly indescriptive e-mail, but that's just because I can't do anything on the server when it gets to this state. If there is anything you recommend I do when this happens again (other than restart the system), please let me know and I'll try. -- Vibol Hou KhmerConnection, http://khmer.cc - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Is it useful to support user level drivers
I realize that the Linux kernel supports user level drivers (via ioperm, etc). However interrupts at user level are not supported, does anyone think it would be a good idea to add user level interrupt support ? I have a framework for it, but it still needs a lot of work. Depending on the response I get, I can send out more email. Please cc me to the replies as I am no longer a part of the Linux kernel mailing list - due to the humble size of my mail box. Balbir __ Do You Yahoo!? Get personalized email addresses from Yahoo! Mail http://personal.mail.yahoo.com/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: A signal fairy tale
Shouldn't there be a sigclose() and other operations to make the API orthogonal. sigopen() should be selective about the signals it allows as argument. Try and make sigopen() thread specific, so that if one thread does a sigopen(), it does not imply it will do all the signal handling for all the threads. Does using sigopen() imply that signal(), sigaction(), etc cannot be used. In the same process one could do a sigopen() in the library, but the process could use sigaction()/signal() without knowing what the library does (which signals it handles, etc). Let me know, when somebody has a patch or needs help, I would like to help or take a look at it. Balbir |NAME | sigopen - open a signal as a file descriptor | |SYNOPSIS | #include signal.h | | int sigopen(int signum); | |DESCRIPTION | The sigopen system call opens signal number signum as a file descriptor. | That signal is no longer delivered normally or available for pickup | with sigwait() et al. Instead, it must be picked up by calling | read() on the file descriptor returned by sigwait(); the buffer passed to | read() must have a size which is a multiple of sizeof(siginfo_t). | Multiple signals may be picked up with a single call to read(). | When that file descriptor is closed, the signal is available once more | for traditional use. | A signal number cannot be opened more than once concurrently; sigopen() | thus provides a way to avoid signal usage clashes in large programs. | |RETURN VALUE | signal returns the new file descriptor, or -1 on error (in which case, errno | is set appropriately). | |ERRORS | EWOULDBLOCK signal is already open | |NOTES | read() will block when reading from a file descriptor opened by sigopen() | until a signal is available unless fcntl(fd, F_SETFL, O_NONBLOCK) is called | to set it into nonblocking mode. | |HISTORY | sigopen() first appeared in the 2.5.2 Linux kernel. | |Linux July, 2001 1 | - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: A signal fairy tale
| | Let me know, when somebody has a patch or needs help, I would like to | help or take a look at it. | |Maybe we can both hack on this. | Sure, that should be interesting, did you have something in mind ? We can start right away. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Reg kdb utility
You need to compile with the correct kernel headers using the include path feature -I path to new headers Balbir On Tue, 3 Jul 2001, SATHISH.J wrote: |Hi, | |I tried to use kdb on my 2.2.14-12 kernel. I was able to compile the file |/usr/src/linux/arch/i386/kdb/modules/kdbm_vm.c and could get the object |file. When I tried to insert it as a module it givesd the following error |message: | |./kdbm_vm.o: kernel-module version mismatch |./kdbm_vm.o was compiled for kernel version .2.14-12 |while this kernel is version 2.2.14-12. | | | |Please tell me why this message comes. | |Thanks in advance, | |Regards, |satish.j | | | | | | | | |- |To unsubscribe from this list: send the line unsubscribe linux-kernel in |the body of a message to [EMAIL PROTECTED] |More majordomo info at http://vger.kernel.org/majordomo-info.html |Please read the FAQ at http://www.tux.org/lkml/ | - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm] Taskstats fix getdelays usage information
Add usage to getdelays.c. This patch was originally posted by Randy Dunlap http://lkml.org/lkml/2007/3/19/168 Signed-off-by: Randy Dunlap [EMAIL PROTECTED] Signed-off-by: [EMAIL PROTECTED] --- Documentation/accounting/getdelays.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff -puN Documentation/accounting/getdelays.c~fix-getdelays-usage Documentation/accounting/getdelays.c --- linux-2.6.20/Documentation/accounting/getdelays.c~fix-getdelays-usage 2007-04-19 14:41:45.0 +0530 +++ linux-2.6.20-balbir/Documentation/accounting/getdelays.c2007-04-19 14:42:26.0 +0530 @@ -72,6 +72,16 @@ struct msgtemplate { char cpumask[100+6*MAX_CPUS]; +static void usage(void) +{ + fprintf(stderr, getdelays [-dilv] [-w logfile] [-r bufsize] + [-m cpumask] [-t tgid] [-p pid]\n); + fprintf(stderr, -d: print delayacct stats\n); + fprintf(stderr, -i: print IO accounting (works only with -p)\n); + fprintf(stderr, -l: listen forever\n); + fprintf(stderr, -v: debug on\n); +} + /* * Create a raw netlink socket and bind */ @@ -227,7 +237,7 @@ int main(int argc, char *argv[]) struct msgtemplate msg; while (1) { - c = getopt(argc, argv, diw:r:m:t:p:v:l); + c = getopt(argc, argv, diw:r:m:t:p:vl); if (c 0) break; @@ -277,7 +287,7 @@ int main(int argc, char *argv[]) loop = 1; break; default: - printf(Unknown option %d\n, c); + usage(); exit(-1); } } _ -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm] Taskstats fix the structure members alignment issue
192,# virtmem 200,# hiwater_rss 208,# hiwater_vm 216,# read_char 224,# write_char 232,# read_syscalls 240,# write_syscalls 248,# read_bytes 256,# write_bytes 264,# cancelled_write_bytes ); Signed-off-by: Balbir Singh [EMAIL PROTECTED] --- include/linux/taskstats.h | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff -puN include/linux/taskstats.h~fix-taskstats-alignment include/linux/taskstats.h --- linux-2.6.20/include/linux/taskstats.h~fix-taskstats-alignment 2007-04-19 12:28:25.0 +0530 +++ linux-2.6.20-balbir/include/linux/taskstats.h 2007-04-19 13:21:48.0 +0530 @@ -66,7 +66,7 @@ struct taskstats { /* Delay waiting for cpu, while runnable * count, delay_total NOT updated atomically */ - __u64 cpu_count; + __u64 cpu_count __attribute__((aligned(8))); __u64 cpu_delay_total; /* Following four fields atomically updated using task-delays-lock */ @@ -101,14 +101,17 @@ struct taskstats { /* Basic Accounting Fields start */ charac_comm[TS_COMM_LEN]; /* Command name */ - __u8ac_sched; /* Scheduling discipline */ + __u8ac_sched __attribute__((aligned(8))); + /* Scheduling discipline */ __u8ac_pad[3]; - __u32 ac_uid; /* User ID */ + __u32 ac_uid __attribute__((aligned(8))); + /* User ID */ __u32 ac_gid; /* Group ID */ __u32 ac_pid; /* Process ID */ __u32 ac_ppid;/* Parent process ID */ __u32 ac_btime; /* Begin time [sec since 1970] */ - __u64 ac_etime; /* Elapsed time [usec] */ + __u64 ac_etime __attribute__((aligned(8))); + /* Elapsed time [usec] */ __u64 ac_utime; /* User CPU time [usec] */ __u64 ac_stime; /* SYstem CPU time [usec] */ __u64 ac_minflt; /* Minor Page Fault Count */ _ -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm] Taskstats fix the structure members alignment issue
Andrew Morton wrote: On Fri, 20 Apr 2007 22:13:41 +0530 Balbir Singh [EMAIL PROTECTED] wrote: We broke the the alignment of members of taskstats to the 8 byte boundary with the CSA patches. In the current kernel, the taskstats structure is not suitable for use by 32 bit applications in a 64 bit kernel. ugh, that was bad of us. Yes :-) ... The patch adds an __attribute__((aligned(8))) to the taskstats structure members so that 32 bit applications using taskstats can work with a 64 bit kernel. But there might be 32-bit applications out there which are using the present wrong structure? otoh, I assume that those applications would be using taskstats.h and would hence encounter this bug and we would have heard about it, is that correct? Yes, correct. otoh^2, 32-bit applications running under 32-bit kernels will presently be functioning correctly, and your change will require that those applications be recompiled, I think? Yes, correct. They would be broken with this fix. We could bump up the version TASKSTATS_VERSION to 4. Would you like a new patch the version bumped up? This patch looks like 2.6.20 and 2.6.21 material, but very carefully... Yes, 2.6.20 and 2.6.21 sound correct. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm] Taskstats fix the structure members alignment issue
Andrew Morton wrote: On Sat, 21 Apr 2007 18:29:21 +0530 Balbir Singh [EMAIL PROTECTED] wrote: The patch adds an __attribute__((aligned(8))) to the taskstats structure members so that 32 bit applications using taskstats can work with a 64 bit kernel. But there might be 32-bit applications out there which are using the present wrong structure? otoh, I assume that those applications would be using taskstats.h and would hence encounter this bug and we would have heard about it, is that correct? Yes, correct. otoh^2, 32-bit applications running under 32-bit kernels will presently be functioning correctly, and your change will require that those applications be recompiled, I think? Yes, correct. They would be broken with this fix. We could bump up the version TASKSTATS_VERSION to 4. Would you like a new patch the version bumped up? I can do that. Thanks This patch looks like 2.6.20 and 2.6.21 material, but very carefully... Yes, 2.6.20 and 2.6.21 sound correct. OK. I guess we have little choice but to slam it in asap, with a 2.6.20.x backport before too many people start using the old interface. Thanks, again! -- Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 8/8] Per-container pages reclamation
Pavel Emelianov wrote: Implement try_to_free_pages_in_container() to free the pages in container that has run out of memory. The scan_control-isolate_pages() function isolates the container pages only. Pavel, I've just started playing around with these patches, I preferred the approach of v1. Please see below +static unsigned long isolate_container_pages(unsigned long nr_to_scan, + struct list_head *src, struct list_head *dst, + unsigned long *scanned, struct zone *zone) +{ + unsigned long nr_taken = 0; + struct page *page; + struct page_container *pc; + unsigned long scan; + LIST_HEAD(pc_list); + + for (scan = 0; scan nr_to_scan !list_empty(src); scan++) { + pc = list_entry(src-prev, struct page_container, list); + page = pc-page; + if (page_zone(page) != zone) + continue; shrink_zone() will walk all pages looking for pages belonging to this container and this slows down the reclaim quite a bit. Although we've reused code, we've ended up walking the entire list of the zone to find pages belonging to a particular container, this was the same problem I had with my RSS controller patches. + + list_move(pc-list, pc_list); + -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 8/8] Per-container pages reclamation
Pavel Emelianov wrote: Balbir Singh wrote: Pavel Emelianov wrote: Implement try_to_free_pages_in_container() to free the pages in container that has run out of memory. The scan_control-isolate_pages() function isolates the container pages only. Pavel, I've just started playing around with these patches, I preferred the approach of v1. Please see below +static unsigned long isolate_container_pages(unsigned long nr_to_scan, +struct list_head *src, struct list_head *dst, +unsigned long *scanned, struct zone *zone) +{ +unsigned long nr_taken = 0; +struct page *page; +struct page_container *pc; +unsigned long scan; +LIST_HEAD(pc_list); + +for (scan = 0; scan nr_to_scan !list_empty(src); scan++) { +pc = list_entry(src-prev, struct page_container, list); +page = pc-page; +if (page_zone(page) != zone) +continue; shrink_zone() will walk all pages looking for pages belonging to this No. shrink_zone() will walk container pages looking for pages in the desired zone. Scann through the full zone is done on global memory shortage. Yes, I see that now. But for each zone in the system, we walk through the containers list - right? I have some more fixes, improvements that I want to send across. I'll start sending them out to you as I test and verify them. container and this slows down the reclaim quite a bit. Although we've reused code, we've ended up walking the entire list of the zone to find pages belonging to a particular container, this was the same problem I had with my RSS controller patches. + +list_move(pc-list, pc_list); + -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 5/7] Per-container OOM killer and page reclamation
Hi, Pavel, Please find my patch to add LRU behaviour to your latest RSS controller. Balbir Singh Linux Technology Center IBM, ISTL Add LRU behaviour to the RSS controller patches posted by Pavel Emelianov http://lkml.org/lkml/2007/3/6/198 which was in turn similar to the RSS controller posted by me http://lkml.org/lkml/2007/2/26/8 Pavel's patches have a per container list of pages, which helps reduce reclaim time of the RSS controller but the per container list of pages is in FIFO order. I've implemented active and inactive lists per container to help select the right set of pages to reclaim when the container is under memory pressure. I've tested these patches on a ppc64 machine and they work fine for the minimal testing I've done. Pavel would you please include these patches in your next iteration. Comments, suggestions and further improvements are as always welcome! Signed-off-by: [EMAIL PROTECTED] --- include/linux/rss_container.h |1 mm/rss_container.c| 47 +++--- mm/swap.c |5 mm/vmscan.c |3 ++ 4 files changed, 44 insertions(+), 12 deletions(-) diff -puN include/linux/rss_container.h~rss-container-lru2 include/linux/rss_container.h --- linux-2.6.20/include/linux/rss_container.h~rss-container-lru2 2007-03-09 22:52:56.0 +0530 +++ linux-2.6.20-balbir/include/linux/rss_container.h 2007-03-10 00:39:59.0 +0530 @@ -19,6 +19,7 @@ int container_rss_prepare(struct page *, void container_rss_add(struct page_container *); void container_rss_del(struct page_container *); void container_rss_release(struct page_container *); +void container_rss_move_lists(struct page *pg, bool active); int mm_init_container(struct mm_struct *mm, struct task_struct *tsk); void mm_free_container(struct mm_struct *mm); diff -puN mm/rss_container.c~rss-container-lru2 mm/rss_container.c --- linux-2.6.20/mm/rss_container.c~rss-container-lru2 2007-03-09 22:52:56.0 +0530 +++ linux-2.6.20-balbir/mm/rss_container.c 2007-03-10 02:42:54.0 +0530 @@ -17,7 +17,8 @@ static struct container_subsys rss_subsy struct rss_container { struct res_counter res; - struct list_head page_list; + struct list_head inactive_list; + struct list_head active_list; struct container_subsys_state css; }; @@ -96,6 +97,26 @@ void container_rss_release(struct page_c kfree(pc); } +void container_rss_move_lists(struct page *pg, bool active) +{ + struct rss_container *rss; + struct page_container *pc; + + if (!page_mapped(pg)) + return; + + pc = page_container(pg); + BUG_ON(!pc); + rss = pc-cnt; + + spin_lock_irq(rss-res.lock); + if (active) + list_move(pc-list, rss-active_list); + else + list_move(pc-list, rss-inactive_list); + spin_unlock_irq(rss-res.lock); +} + void container_rss_add(struct page_container *pc) { struct page *pg; @@ -105,7 +126,7 @@ void container_rss_add(struct page_conta rss = pc-cnt; spin_lock(rss-res.lock); - list_add(pc-list, rss-page_list); + list_add(pc-list, rss-active_list); spin_unlock(rss-res.lock); page_container(pg) = pc; @@ -141,7 +162,10 @@ unsigned long container_isolate_pages(un struct zone *z; spin_lock_irq(rss-res.lock); - src = rss-page_list; + if (active) + src = rss-active_list; + else + src = rss-inactive_list; for (scan = 0; scan nr_to_scan !list_empty(src); scan++) { pc = list_entry(src-prev, struct page_container, list); @@ -152,13 +176,10 @@ unsigned long container_isolate_pages(un spin_lock(z-lru_lock); if (PageLRU(page)) { - if ((active PageActive(page)) || - (!active !PageActive(page))) { -if (likely(get_page_unless_zero(page))) { - ClearPageLRU(page); - nr_taken++; - list_move(page-lru, dst); -} + if (likely(get_page_unless_zero(page))) { +ClearPageLRU(page); +nr_taken++; +list_move(page-lru, dst); } } spin_unlock(z-lru_lock); @@ -212,7 +233,8 @@ static int rss_create(struct container_s return -ENOMEM; res_counter_init(rss-res); - INIT_LIST_HEAD(rss-page_list); + INIT_LIST_HEAD(rss-inactive_list); + INIT_LIST_HEAD(rss-active_list); cont-subsys[rss_subsys.subsys_id] = rss-css; return 0; } @@ -284,7 +306,8 @@ static __init int rss_create_early(struc rss = init_rss_container; res_counter_init(rss-res); - INIT_LIST_HEAD(rss-page_list); + INIT_LIST_HEAD(rss-inactive_list); + INIT_LIST_HEAD(rss-active_list); cont-subsys[rss_subsys.subsys_id] = rss-css; ss-create = rss_create; return 0; diff -puN mm/vmscan.c~rss-container-lru2 mm/vmscan.c --- linux-2.6.20/mm/vmscan.c~rss-container-lru2 2007-03-09 22:52:56.0 +0530 +++ linux-2.6.20-balbir/mm/vmscan.c 2007-03-10 00:42:35.0 +0530 @@ -1142,6 +1142,7 @@ static unsigned long container_shrink_pa else add_page_to_inactive_list(z, page); spin_unlock_irq(z-lru_lock); + container_rss_move_lists(page, false); put_page(page); } @@ -1191,6 +1192,7 @@ static void
Re: [RFC][PATCH 2/7] RSS controller core
On 3/11/07, Andrew Morton [EMAIL PROTECTED] wrote: On Sun, 11 Mar 2007 15:26:41 +0300 Kirill Korotaev [EMAIL PROTECTED] wrote: Andrew Morton wrote: On Tue, 06 Mar 2007 17:55:29 +0300 Pavel Emelianov [EMAIL PROTECTED] wrote: +struct rss_container { + struct res_counter res; + struct list_head page_list; + struct container_subsys_state css; +}; + +struct page_container { + struct page *page; + struct rss_container *cnt; + struct list_head list; +}; ah. This looks good. I'll find a hunk of time to go through this work and through Paul's patches. It'd be good to get both patchsets lined up in -mm within a couple of weeks. But.. We need to decide whether we want to do per-container memory limitation via these data structures, or whether we do it via a physical scan of some software zone, possibly based on Mel's patches. i.e. a separate memzone for each container? Yep. Straightforward machine partitioning. An attractive thing is that it 100% reuses existing page reclaim, unaltered. We discussed zones for resource control and some of the disadvantages at http://lkml.org/lkml/2006/10/30/222 I need to look at Mel's patches to determine if they are suitable for control. But in a thread of discussion on those patches, it was agreed that memory fragmentation and resource control are independent issues. imho memzone approach is inconvinient for pages sharing and shares accounting. it also makes memory management more strict, forbids overcommiting per-container etc. umm, who said they were requirements? We discussed some of the requirements in the RFC: Memory Controller requirements thread http://lkml.org/lkml/2006/10/30/51 Maybe you have some ideas how we can decide on this? We need to work out what the requirements are before we can settle on an implementation. Sigh. Who is running this show? Anyone? All the stake holders involved in the RFC discussion :-) We've been talking and building on top of each others patches. I hope that was a good answer ;) You can actually do a form of overcommittment by allowing multiple containers to share one or more of the zones. Whether that is sufficient or suitable I don't know. That depends on the requirements, and we haven't even discussed those, let alone agreed to them. There are other things like resizing a zone, finding the right size, etc. I'll look at Mel's patches to see what is supported. Warm Regards, Balbir Singh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 2/7] RSS controller core
doesn't look so good for me, mainly becaus of the additional per page data and per page processing on 4GB memory, with 100 guests, 50% shared for each guest, this basically means ~1mio pages, 500k shared and 1500k x sizeof(page_container) entries, which roughly boils down to ~25MB of wasted memory ... increase the amount of shared pages and it starts getting worse, but maybe I'm missing something here We need to decide whether we want to do per-container memory limitation via these data structures, or whether we do it via a physical scan of some software zone, possibly based on Mel's patches. why not do simple page accounting (as done currently in Linux) and use that for the limits, without keeping the reference from container to page? best, Herbert Herbert, You lost me in the cc list and I almost missed this part of the thread. Could you please not modify the cc list. Thanks, Balbir - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 3/7] Data structures changes for RSS accounting
On 3/12/07, Dave Hansen [EMAIL PROTECTED] wrote: On Mon, 2007-03-12 at 19:16 +0300, Kirill Korotaev wrote: now VE2 maps the same page. You can't determine whether this page is mapped to this container or another one w/o page-container pointer. Hi Kirill, I thought we can always get from the page to the VMA. rmap provides this to us via page-mapping and the 'struct address_space' or anon_vma. Do we agree on that? We can also get from the vma to the mm very easily, via vma-vm_mm, right? We can also get from a task to the container quite easily. So, the only question becomes whether there is a 1:1 relationship between mm_structs and containers. Does each mm_struct belong to one and only one container? Basically, can a threaded process have different threads in different containers? It seems that we could bridge the gap pretty easily by either assigning each mm_struct to a container directly, or putting some kind of task-to-mm lookup. Perhaps just a list like mm-tasks_using_this_mm_list. Not rocket science, right? -- Dave These patches are very similar to what I posted at http://lwn.net/Articles/223829/ In my patches, the thread group leader owns the mm_struct and all threads belong to the same container. I did not have a per container LRU, walking the global list for reclaim was a bit slow, but otherwise my patches did not add anything to struct page I used rmap information to get to the VMA and then the mm_struct. Kirill, it is possible to determine all the containers that map the page. Please see the page_in_container() function of http://lkml.org/lkml/2007/2/26/7. I was also thinking of using the page table(s) to identify all pages belonging to a container, by obtaining all the mm_structs of tasks belonging to a container. But this approach would not work well for the page cache controller, when we add that to our memory controller. Balbir - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 2/7] RSS controller core
hmm, it is very unlikely that this would happen, for several reasons ... and indeed, checking the thread in my mailbox shows that akpm dropped you ... But, I got Andrew's email. Subject: [RFC][PATCH 2/7] RSS controller core From: Pavel Emelianov [EMAIL PROTECTED] To: Andrew Morton [EMAIL PROTECTED], Paul Menage [EMAIL PROTECTED], Srivatsa Vaddagiri [EMAIL PROTECTED], Balbir Singh [EMAIL PROTECTED] Cc: [EMAIL PROTECTED], Linux Kernel Mailing List linux-kernel@vger.kernel.org Date: Tue, 06 Mar 2007 17:55:29 +0300 Subject: Re: [RFC][PATCH 2/7] RSS controller core From: Andrew Morton [EMAIL PROTECTED] To: Pavel Emelianov [EMAIL PROTECTED] Cc: [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], Paul Menage [EMAIL PROTECTED], List linux-kernel@vger.kernel.org Date: Tue, 6 Mar 2007 14:00:36 -0800 that's the one I 'group' replied to ... Could you please not modify the cc list. I never modify the cc unless explicitely asked to do so. I wish others would have it that way too :) Thats good to know, but my mailer shows Andrew Morton [EMAIL PROTECTED] to Pavel Emelianov [EMAIL PROTECTED] cc Paul Menage [EMAIL PROTECTED], Srivatsa Vaddagiri [EMAIL PROTECTED], Balbir Singh [EMAIL PROTECTED] (see I am HERE), devel@openvz.org, Linux Kernel Mailing List linux-kernel@vger.kernel.org, [EMAIL PROTECTED], Kirill Korotaev [EMAIL PROTECTED] dateMar 7, 2007 3:30 AM subject Re: [RFC][PATCH 2/7] RSS controller core mailed-by vger.kernel.org On Tue, 06 Mar 2007 17:55:29 +0300 and your reply as Andrew Morton [EMAIL PROTECTED], Pavel Emelianov [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], Paul Menage [EMAIL PROTECTED], List linux-kernel@vger.kernel.org to Andrew Morton [EMAIL PROTECTED] cc Pavel Emelianov [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], Paul Menage [EMAIL PROTECTED], List linux-kernel@vger.kernel.org dateMar 9, 2007 10:18 PM subject Re: [RFC][PATCH 2/7] RSS controller core mailed-by vger.kernel.org I am not sure what went wrong. Could you please check your mail client, cause it seemed to even change email address to smtp.osdl.org which bounced back when I wrote to you earlier. best, Herbert Cheers, Balbir - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 4/7] RSS accounting hooks over the code
the big deal so many accounting people have with just RSS? I'm not a container person, this is an honest question. Because from my POV if you conveniently ignore everything else... you may as well just not do any accounting at all. We decided to implement accounting and control in phases 1. RSS control 2. unmapped page cache control 3. mlock control 4. Kernel accounting and limits This has several advantages 1. The limits can be individually set and controlled. 2. The code is broken down into simpler chunks for review and merging. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 4/7] RSS accounting hooks over the code
Nick Piggin wrote: Balbir Singh wrote: Nick Piggin wrote: And strangely, this example does not go outside the parameters of what you asked for AFAIKS. In the worst case of one container getting _all_ the shared pages, they will still remain inside their maximum rss limit. When that does happen and if a container hits it limit, with a LRU per-container, if the container is not actually using those pages, they'll get thrown out of that container and get mapped into the container that is using those pages most frequently. Exactly. Statistically, first touch will work OK. It may mean some reclaim inefficiencies in corner cases, but things will tend to even out. Exactly! So they might get penalised a bit on reclaim, but maximum rss limits will work fine, and you can (almost) guarantee X amount of memory for a given container, and it will _work_. But I also take back my comments about this being the only design I have seen that gets everything, because the node-per-container idea is a really good one on the surface. And it could mean even less impact on the core VM than this patch. That is also a first-touch scheme. With the proposed node-per-container, we will need to make massive core VM changes to reorganize zones and nodes. We would want to allow 1. For sharing of nodes 2. Resizing nodes 3. May be more But a lot of that is happening anyway for other reasons (eg. memory plug/unplug). And I don't consider node/zone setup to be part of the core VM as such... it is _good_ if we can move extra work into setup rather than have it in the mm. That said, I don't think this patch is terribly intrusive either. Thanks, thats one of our goals, to keep it simple, understandable and non-intrusive. With the node-per-container idea, it will hard to control page cache limits, independent of RSS limits or mlock limits. NOTE: page cache == unmapped page cache here. I don't know that it would be particularly harder than any other first-touch scheme. If one container ends up being charged with too much pagecache, eventually they'll reclaim a bit of it and the pages will get charged to more frequent users. Yes, true, but what if a user does not want to control the page cache usage in a particular container or wants to turn off RSS control. However the messed up accounting that doesn't handle sharing between groups of processes properly really bugs me. Especially when we have the infrastructure to do it right. Does that make more sense? I think it is simplistic. Sure you could probably use some of the rmap stuff to account shared mapped _user_ pages once for each container that touches them. And this patchset isn't preventing that. But how do you account kernel allocations? How do you account unmapped pagecache? What's the big deal so many accounting people have with just RSS? I'm not a container person, this is an honest question. Because from my POV if you conveniently ignore everything else... you may as well just not do any accounting at all. We decided to implement accounting and control in phases 1. RSS control 2. unmapped page cache control 3. mlock control 4. Kernel accounting and limits This has several advantages 1. The limits can be individually set and controlled. 2. The code is broken down into simpler chunks for review and merging. But this patch gives the groundwork to handle 1-4, and it is in a small chunk, and one would be able to apply different limits to different types of pages with it. Just using rmap to handle 1 does not really seem like a viable alternative because it fundamentally isn't going to handle 2 or 4. For (2), we have the basic setup in the form of a per-container LRU list and a pointer from struct page to the container that first brought in the page. I'm not saying that you couldn't _later_ add something that uses rmap or our current RSS accounting to tweak container-RSS semantics. But isn't it sensible to lay the groundwork first? Get a clear path to something that is good (not perfect), but *works*? I agree with your development model suggestion. One of things we are going to do in the near future is to build (2) and then add (3) and (4). So far, we've not encountered any difficulties on building on top of (1). Vaidy, any comments? -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: taskstats accounting info
Randy.Dunlap wrote: Hi, Documentation/accounting/delay-accounting.txt says that the getdelays program has a -c cmd argument, but that option does not seem to exist in Documentation/account/getdelays.c. Do you have an updated version of getdelays.c? If not, please correct that documentation. Yes, I did, but then I changed my laptop. I should have it archived at some place, I'll dig it out or correct the documentation. Is getdelays.c the best available example of a program using the taskstats netlink interface? It's the most portable example, since it does not depend on libnl. It needs some cleaning up. I hope to get to it after the OLS paper submission deadline. Thanks, Thanks for bringing the issue to my notice, -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 4/7] RSS accounting hooks over the code
Nick Piggin wrote: Kirill Korotaev wrote: The approaches I have seen that don't have a struct page pointer, do intrusive things like try to put hooks everywhere throughout the kernel where a userspace task can cause an allocation (and of course end up missing many, so they aren't secure anyway)... and basically just nasty stuff that will never get merged. User beancounters patch has got through all these... The approach where each charged object has a pointer to the owner container, who has charged it - is the most easy/clean way to handle all the problems with dynamic context change, races, etc. and 1 pointer in page struct is just 0.1% overehad. The pointer in struct page approach is a decent one, which I have liked since this whole container effort came up. IIRC Linus and Alan also thought that was a reasonable way to go. I haven't reviewed the rest of the beancounters patch since looking at it quite a few months ago... I probably don't have time for a good review at the moment, but I should eventually. This patch is not really beancounters. 1. It uses the containers framework 2. It is similar to my RSS controller (http://lkml.org/lkml/2007/2/26/8) I would say that beancounters are changing and evolving. Struct page overhead really isn't bad. Sure, nobody who doesn't use containers will want to turn it on, but unless you're using a big PAE system you're actually unlikely to notice. big PAE doesn't make any difference IMHO (until struct pages are not created for non-present physical memory areas) The issue is just that struct pages use low memory, which is a really scarce commodity on PAE. One more pointer in the struct page means 64MB less lowmem. But PAE is crap anyway. We've already made enough concessions in the kernel to support it. I agree: struct page overhead is not really significant. The benefits of simplicity seems to outweigh the downside. But again, I'll say the node-container approach of course does avoid this nicely (because we already can get the node from the page). So definitely that approach needs to be discredited before going with this one. But it lacks some other features: 1. page can't be shared easily with another container I think they could be shared. You allocate _new_ pages from your own node, but you can definitely use existing pages allocated to other nodes. 2. shared page can't be accounted honestly to containers as fraction=PAGE_SIZE/containers-using-it Yes there would be some accounting differences. I think it is hard to say exactly what containers are using what page anyway, though. What do you say about unmapped pages? Kernel allocations? etc. 3. It doesn't help accounting of kernel memory structures. e.g. in OpenVZ we use exactly the same pointer on the page to track which container owns it, e.g. pages used for page tables are accounted this way. ? page_to_nid(page) ~= container that owns it. 4. I guess container destroy requires destroy of memory zone, which means write out of dirty data. Which doesn't sound good for me as well. I haven't looked at any implementation, but I think it is fine for the zone to stay around. 5. memory reclamation in case of global memory shortage becomes a tricky/unfair task. I don't understand why? You can much more easily target a specific container for reclaim with this approach than with others (because you have an lru per container). Yes, but we break the global LRU. With these RSS patches, reclaim not triggered by containers still uses the global LRU, by using nodes, we would have lost the global LRU. 6. You cannot overcommit. AFAIU, the memory should be granted to node exclusive usage and cannot be used by by another containers, even if it is unused. This is not an option for us. I'm not sure about that. If you have a larger number of nodes, then you could assign more free nodes to a container on demand. But I think there would definitely be less flexibility with nodes... I don't know... and seeing as I don't really know where the google guys are going with it, I won't misrepresent their work any further ;) Everyone seems to have a plan ;) I don't read the containers list... does everyone still have *different* plans, or is any sort of consensus being reached? hope we'll have it soon :) Good luck ;) I think we have made some forward progress on the consensus. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: taskstats accounting info
Andrew Morton wrote: On Wed, 14 Mar 2007 17:48:32 +0530 Balbir Singh [EMAIL PROTECTED] wrote: Randy.Dunlap wrote: Hi, Documentation/accounting/delay-accounting.txt says that the getdelays program has a -c cmd argument, but that option does not seem to exist in Documentation/account/getdelays.c. Do you have an updated version of getdelays.c? If not, please correct that documentation. Yes, I did, but then I changed my laptop. I should have it archived at some place, I'll dig it out or correct the documentation. Is getdelays.c the best available example of a program using the taskstats netlink interface? It's the most portable example, since it does not depend on libnl. err, what is libnl? libnl is a library abstraction for netlink (libnetlink). If there exists some real userspace infrastructure which utilises taskstats, can we please get a referece to it into the kernel Documentation? Perhaps in the TASKSTATS Kconfig entry, thanks. That sounds like a good idea. I'll check for details and get back. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] containers (V7): Add generic multi-subsystem API to containers
[EMAIL PROTECTED] wrote: This patch removes all cpuset-specific knowlege from the container system, replacing it with a generic API that can be used by multiple subsystems. Cpusets is adapted to be a container subsystem. Signed-off-by: Paul Menage [EMAIL PROTECTED] Hi, Paul, This patch fails to apply for me [EMAIL PROTECTED]:~/ocbalbir/images/kernels/containers/linux-2.6.20$ pushpatch patching file include/linux/container.h patching file include/linux/cpuset.h patching file kernel/container.c patch: malformed patch at line 640: @@ -202,15 +418,18 @@ static DEFINE_MUTEX(callback_mutex); multiuser_container does not apply Is anybody else seeing this problem? -- Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH][0/4] Memory controller (RSS Control)
This patch applies on top of Paul Menage's container patches (V7) posted at http://lkml.org/lkml/2007/2/12/88 It implements a controller within the containers framework for limiting memory usage (RSS usage). The memory controller was discussed at length in the RFC posted to lkml http://lkml.org/lkml/2006/10/30/51 Steps to use the controller -- 0. Download the patches, apply the patches 1. Turn on CONFIG_CONTAINER_MEMCTLR in kernel config, build the kernel and boot into the new kernel 2. mount -t container container -o memctlr /mount point 3. cd /mount point optionally do (mkdir directory; cd directory) under /mount point 4. echo $$ tasks (attaches the current shell to the container) 5. echo -n (limit value) memctlr_limit 6. cat memctlr_usage 7. Run tasks, check the usage of the controller, reclaim behaviour 8. Report bugs, get bug fixes and iterate (goto step 0). Advantages of the patchset -- 1. Zero overhead in struct page (struct page is not expanded) 2. Minimal changes to the core-mm code 3. Shared pages are not reclaimed unless all mappings belong to overlimit containers. 4. It can be used to debug drivers/applications/kernel components in a constrained memory environment (similar to mem=XXX option), except that several containers can be created simultaneously without rebooting and the limits can be changed. NOTE: There is no support for limiting kernel memory allocations and page cache control (presently). Testing --- Ran kernbench and lmbench with containers enabled (container filesystem not mounted), they seemed to run fine Created containers, attached tasks to containers with lower limits than the memory the tasks require (memory hog tests) and ran some basic tests on them TODO's and improvement areas 1. Come up with cool page replacement algorithms for containers (if possible without any changes to struct page) 2. Add page cache control 3. Add kernel memory allocator control 4. Extract benchmark numbers and overhead data Comments criticism are welcome. Series -- memctlr-setup.patch memctlr-acct.patch memctlr-reclaim-on-limit.patch memctlr-doc.patch -- Warm Regards, Balbir Singh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH][1/4] RSS controller setup
This patch sets up the basic controller infrastructure on top of the containers infrastructure. Two files are provided for monitoring and control memctlr_usage and memctlr_limit. memctlr_usage shows the current usage (in pages, of RSS) and the limit set by the user. memctlr_limit can be used to set a limit on the RSS usage of the resource. A special value of 0, indicates that the usage is unlimited. The limit is set in units of pages. Signed-off-by: [EMAIL PROTECTED] --- include/linux/memctlr.h | 22 ++ init/Kconfig|7 + mm/Makefile |1 mm/memctlr.c| 169 4 files changed, 199 insertions(+) diff -puN /dev/null include/linux/memctlr.h --- /dev/null 2007-02-02 22:51:23.0 +0530 +++ linux-2.6.20-balbir/include/linux/memctlr.h 2007-02-16 00:22:11.0 +0530 @@ -0,0 +1,22 @@ +/* memctlr.h - Memory Controller for containers + * + * Copyright (C) Balbir Singh, IBM Corp. 2006-2007 + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of version 2.1 of the GNU Lesser General Public License + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + */ + +#ifndef _LINUX_MEMCTLR_H +#define _LINUX_MEMCTLR_H + +#ifdef CONFIG_CONTAINER_MEMCTLR + +#else /* CONFIG_CONTAINER_MEMCTLR */ + +#endif /* CONFIG_CONTAINER_MEMCTLR */ +#endif /* _LINUX_MEMCTLR_H */ diff -puN init/Kconfig~memctlr-setup init/Kconfig --- linux-2.6.20/init/Kconfig~memctlr-setup 2007-02-15 21:58:42.0 +0530 +++ linux-2.6.20-balbir/init/Kconfig2007-02-15 21:58:42.0 +0530 @@ -306,6 +306,13 @@ config CONTAINER_NS for instance virtual servers and checkpoint/restart jobs. +config CONTAINER_MEMCTLR + bool A simple RSS based memory controller + select CONTAINERS + help + Provides a simple Resource Controller for monitoring and + controlling the total Resident Set Size of the tasks in a container + config RELAY bool Kernel-user space relay support (formerly relayfs) help diff -puN mm/Makefile~memctlr-setup mm/Makefile --- linux-2.6.20/mm/Makefile~memctlr-setup 2007-02-15 21:58:42.0 +0530 +++ linux-2.6.20-balbir/mm/Makefile 2007-02-15 21:58:42.0 +0530 @@ -29,3 +29,4 @@ obj-$(CONFIG_MEMORY_HOTPLUG) += memory_h obj-$(CONFIG_FS_XIP) += filemap_xip.o obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_SMP) += allocpercpu.o +obj-$(CONFIG_CONTAINER_MEMCTLR) += memctlr.o diff -puN /dev/null mm/memctlr.c --- /dev/null 2007-02-02 22:51:23.0 +0530 +++ linux-2.6.20-balbir/mm/memctlr.c2007-02-16 00:22:11.0 +0530 @@ -0,0 +1,169 @@ +/* + * memctlr.c - Memory Controller for containers + * + * Copyright (C) Balbir Singh, IBM Corp. 2006-2007 + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of version 2.1 of the GNU Lesser General Public License + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + */ + +#include linux/init.h +#include linux/parser.h +#include linux/fs.h +#include linux/container.h +#include linux/memctlr.h + +#include asm/uaccess.h + +#define RES_USAGE_NO_LIMIT 0 +static const char version[] = 0.1; + +struct res_counter { + unsigned long usage;/* The current usage of the resource being */ + /* counted */ + unsigned long limit;/* The limit on the resource */ + unsigned long nr_limit_exceeded; +}; + +struct memctlr { + struct container_subsys_state css; + struct res_counter counter; + spinlock_t lock; +}; + +static struct container_subsys memctlr_subsys; + +static inline struct memctlr *memctlr_from_cont(struct container *cont) +{ + return container_of(container_subsys_state(cont, memctlr_subsys), + struct memctlr, css); +} + +static inline struct memctlr *memctlr_from_task(struct task_struct *p) +{ + return memctlr_from_cont(task_container(p, memctlr_subsys)); +} + +static int memctlr_create(struct container_subsys *ss, struct container *cont) +{ + struct memctlr *mem = kzalloc(sizeof(*mem), GFP_KERNEL); + if (!mem) + return -ENOMEM; + + spin_lock_init(mem-lock); + cont-subsys[memctlr_subsys.subsys_id] = mem-css; + return 0; +} + +static void memctlr_destroy(struct container_subsys *ss, + struct container *cont) +{ + kfree
[RFC][PATCH][2/4] Add RSS accounting and control
); if (unlikely(!pte_same(*page_table, orig_pte))) - goto out_nomap; + goto out_nomap_uncharge; if (unlikely(!PageUptodate(page))) { ret = VM_FAULT_SIGBUS; - goto out_nomap; + goto out_nomap_uncharge; } /* The page isn't present yet, go ahead with the fault. */ @@ -2068,6 +2083,8 @@ unlock: pte_unmap_unlock(page_table, ptl); out: return ret; +out_nomap_uncharge: + memctlr_update_rss(mm, -1, MEMCTLR_DONT_CHECK_LIMIT); out_nomap: pte_unmap_unlock(page_table, ptl); unlock_page(page); @@ -2092,6 +2109,9 @@ static int do_anonymous_page(struct mm_s /* Allocate our own private page. */ pte_unmap(page_table); + if (!memctlr_update_rss(mm, 1, MEMCTLR_CHECK_LIMIT)) + goto oom; + if (unlikely(anon_vma_prepare(vma))) goto oom; page = alloc_zeroed_user_highpage(vma, address); @@ -2108,6 +2128,8 @@ static int do_anonymous_page(struct mm_s lru_cache_add_active(page); page_add_new_anon_rmap(page, vma, address); } else { + memctlr_update_rss(mm, 1, MEMCTLR_DONT_CHECK_LIMIT); + /* Map the ZERO_PAGE - vm_page_prot is readonly */ page = ZERO_PAGE(address); page_cache_get(page); @@ -2218,6 +2240,9 @@ retry: } } + if (!memctlr_update_rss(mm, 1, MEMCTLR_CHECK_LIMIT)) + goto oom; + page_table = pte_offset_map_lock(mm, pmd, address, ptl); /* * For a file-backed vma, someone could have truncated or otherwise @@ -2227,6 +2252,7 @@ retry: if (mapping unlikely(sequence != mapping-truncate_count)) { pte_unmap_unlock(page_table, ptl); page_cache_release(new_page); + memctlr_update_rss(mm, -1, MEMCTLR_DONT_CHECK_LIMIT); cond_resched(); sequence = mapping-truncate_count; smp_rmb(); @@ -2265,6 +2291,7 @@ retry: } else { /* One of our sibling threads was faster, back out. */ page_cache_release(new_page); + memctlr_update_rss(mm, -1, MEMCTLR_DONT_CHECK_LIMIT); goto unlock; } diff -puN mm/rmap.c~memctlr-acct mm/rmap.c --- linux-2.6.20/mm/rmap.c~memctlr-acct 2007-02-18 22:55:50.0 +0530 +++ linux-2.6.20-balbir/mm/rmap.c 2007-02-18 23:28:16.0 +0530 @@ -602,6 +602,11 @@ void page_remove_rmap(struct page *page, __dec_zone_page_state(page, PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED); } + /* +* When we pass MEMCTLR_DONT_CHECK_LIMIT, it is ok to call +* this function under the pte lock (since we will not block in reclaim) +*/ + memctlr_update_rss(vma-vm_mm, -1, MEMCTLR_DONT_CHECK_LIMIT); } /* diff -puN mm/swapfile.c~memctlr-acct mm/swapfile.c --- linux-2.6.20/mm/swapfile.c~memctlr-acct 2007-02-18 22:55:50.0 +0530 +++ linux-2.6.20-balbir/mm/swapfile.c 2007-02-18 22:55:50.0 +0530 @@ -27,6 +27,7 @@ #include linux/mutex.h #include linux/capability.h #include linux/syscalls.h +#include linux/memctlr.h #include asm/pgtable.h #include asm/tlbflush.h @@ -514,6 +515,7 @@ static void unuse_pte(struct vm_area_str set_pte_at(vma-vm_mm, addr, pte, pte_mkold(mk_pte(page, vma-vm_page_prot))); page_add_anon_rmap(page, vma, addr); + memctlr_update_rss(vma-vm_mm, 1, MEMCTLR_DONT_CHECK_LIMIT); swap_free(entry); /* * Move the page to the active list so it is not _ -- Warm Regards, Balbir Singh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH][3/4] Add reclaim support
, + .swappiness = 100, + }; + + /* +* We try to shrink LRUs in 3 passes: +* 0 = Reclaim from inactive_list only +* 1 = Reclaim mapped (normal reclaim) +* 2 = 2nd pass of type 1 +*/ + for (pass = 0; pass 3; pass++) { + int prio; + + for (prio = DEF_PRIORITY; prio = 0; prio--) { + unsigned long nr_to_scan = nr_pages - ret; + + sc.nr_scanned = 0; + ret += shrink_all_zones(nr_to_scan, prio, + pass, 1, sc); + if (ret = nr_pages) + goto out; + + nr_total_scanned += sc.nr_scanned; + if (sc.nr_scanned prio DEF_PRIORITY - 2) + congestion_wait(WRITE, HZ / 10); + } + } +out: + return ret; +} +#endif + /* It's optimal to keep kswapds on the same CPUs as their memory, but not required for correctness. So if the last cpu in a node goes away, we get changed to run anywhere: as the first one comes back, _ -- Warm Regards, Balbir Singh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH][4/4] RSS controller documentation
Signed-off-by: [EMAIL PROTECTED] --- Documentation/memctlr.txt | 70 ++ 1 file changed, 70 insertions(+) diff -puN /dev/null Documentation/memctlr.txt --- /dev/null 2007-02-02 22:51:23.0 +0530 +++ linux-2.6.20-balbir/Documentation/memctlr.txt 2007-02-19 00:51:44.0 +0530 @@ -0,0 +1,70 @@ +Introduction + + +The memory controller is a controller module written under the containers +framework. It can be used to limit the resource usage of a group of +tasks grouped by the container. + +Accounting +-- + +The memory controller tracks the RSS usage of the tasks in the container. +The definition of RSS was debated on lkml in the following thread + + http://lkml.org/lkml/2006/10/10/130 + +This patch is flexible, it is easy to adapt the patch to any definition +of RSS. The current accounting is based on the current definition of +RSS. Each page mapped is charged to the container. + +The accounting is done at two levels, each process has RSS accounting in +the mm_struct and in the container it belongs to. The mm_struct accounting +is used when a task switches (migrates to a different) container(s). The +accounting information for the task is subtracted from the source container +and added to the destination container. If as result of the migration, the +destination container goes over limit, no action is taken until some task +in the destination container runs and tries to map a new page in its +page table. + +The current RSS usage can be seen in the memctlr_usage file. The value +is in units of pages. + +Control +--- + +The memctlr_limit file allows the user to set a limit on the number of +pages that can be mapped by the processes in the container. A special +value of 0 (which is the default limit of any new container), indicates +that the container can use unlimited amount of RSS. + +Reclaim +--- + +When the limit set in the container is hit, the memory controller starts +reclaiming pages belonging to the container (simulating a local LRU in +some sense). isolate_lru_pages() has been modified to isolate lru +pages belonging to a specific container. Parallel reclaims on the same +container are not allowed, other tasks end up waiting for the any existing +reclaim to finish. + +The reclaim code uses two internal knobs, retries and pushback. pushback +specifies the percentage of memory to be reclaimed when the container goes +over limit. The retries knob, controls how many times reclaim is retried +before the task is killed (because reclaim failed). + +Shared pages are treated specially during reclaim. They are not force +reclaimed, they are only unmapped from containers which are over limit. +This ensures that other containers do not pay a penalty for a shared +page being reclaimed when a paritcular container goes over its limit. + +NOTE: All limits are hard limits. + +Future Plans + + +The current controller implements only RSS control. It is planned to add +the following components + +1. Page Cache control +2. mlock'ed memory control +3. kernel memory allocation control (memory allocated on behalf of a task) _ -- Warm Regards, Balbir Singh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH][0/4] Memory controller (RSS Control)
Andrew Morton wrote: On Mon, 19 Feb 2007 12:20:19 +0530 Balbir Singh [EMAIL PROTECTED] wrote: This patch applies on top of Paul Menage's container patches (V7) posted at http://lkml.org/lkml/2007/2/12/88 It implements a controller within the containers framework for limiting memory usage (RSS usage). It's good to see someone building on someone else's work for once, rather than everyone going off in different directions. It makes one hope that we might actually achieve something at last. Thanks! It's good to know we are headed in the right direction. The key part of this patchset is the reclaim algorithm: @@ -636,6 +642,15 @@ static unsigned long isolate_lru_pages(u list_del(page-lru); target = src; + /* +* For containers, do not scan the page unless it +* belongs to the container we are reclaiming for +*/ + if (container !page_in_container(page, zone, container)) { + scan--; + goto done; + } Alas, I fear this might have quite bad worst-case behaviour. One small container which is under constant memory pressure will churn the system-wide LRUs like mad, and will consume rather a lot of system time. So it's a point at which container A can deleteriously affect things which are running in other containers, which is exactly what we're supposed to not do. Hmm.. I guess it's space vs time then :-) A CPU controller could control how much time is spent reclaiming ;) Coming back, I see the problem you mentioned and we have been thinking of several possible solutions. In my introduction I pointed out Come up with cool page replacement algorithms for containers (if possible without any changes to struct page) The solutions we have looked at are 1. Overload the LRU list_head in struct page to have a global LRU + a per container LRU +--+ prev +--+ | page +-| page +- | 0 |-+ 1 | +--+ next +--+ Global LRU +--+ +--- + prev | || next +---+ |+--+ | V^ V +--+| prev +--+ +--+ | page ++ + page +-. | page + | 0 |-+ 1 || n | +--+ next +--+ +--+ Global LRU + Container LRU Page 1 and n belong to the same container, to get to page 0, you need two de-references 2. Modify struct page to point to a container and allow each container to have a per-container LRU along with the global LRU For efficiency we need the container LRU and we don't want to split the global LRU either. We need to optimize the reclaim path, but I thought of that as a secondary problem. Once we all agree that the controller looks simple, accounts well and works. We can/should definitely optimize the reclaim path. -- Warm Regards, Balbir Singh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH][1/4] RSS controller setup
Andrew Morton wrote: On Mon, 19 Feb 2007 12:20:26 +0530 Balbir Singh [EMAIL PROTECTED] wrote: This patch sets up the basic controller infrastructure on top of the containers infrastructure. Two files are provided for monitoring and control memctlr_usage and memctlr_limit. The patches use the identifier memctlr a lot. It is hard to remember, and unpronounceable. Something like memcontrol or mem_controller or memory_controller would be more typical. I'll change the name to memory_controller ... + BUG_ON(!mem); + if ((buffer = kmalloc(nbytes + 1, GFP_KERNEL)) == 0) + return -ENOMEM; Please prefer to do buffer = kmalloc(nbytes + 1, GFP_KERNEL); if (buffer == NULL) reutrn -ENOMEM; ie: avoid the assign-and-test-in-the-same-statement thing. This affects the whole patchset. I'll fix that Also, please don't compare pointers to literal zero like that. It makes me get buried it patches to convert it to NULL. I think this is a sparse thing. Good point, I'll fix it. + buffer[nbytes] = 0; + if (copy_from_user(buffer, userbuf, nbytes)) { + ret = -EFAULT; + goto out_err; + } + + container_manage_lock(); + if (container_is_removed(cont)) { + ret = -ENODEV; + goto out_unlock; + } + + limit = simple_strtoul(buffer, NULL, 10); + /* +* 0 is a valid limit (unlimited resource usage) +*/ + if (!limit strcmp(buffer, 0)) + goto out_unlock; + + spin_lock(mem-lock); + mem-counter.limit = limit; + spin_unlock(mem-lock); The patches do this a lot: a single atomic assignment with a pointless-looking lock/unlock around it. It's often the case that this idiom indicates a bug, or needless locking. I think the only case where it makes sense is when there's some other code somewhere which is doing spin_lock(mem-lock); ... use1(mem-counter.limit); ... use2(mem-counter.limit); ... spin_unlock(mem-lock); where use1() and use2() expect the two reads of mem-counter.limit to return the same value. Is that the case in these patches? If not, we might have a problem in there. The next set of patches move to atomic values for the limits. That should fix the locking. + +static ssize_t memctlr_read(struct container *cont, struct cftype *cft, + struct file *file, char __user *userbuf, + size_t nbytes, loff_t *ppos) +{ + unsigned long usage, limit; + char usagebuf[64]; /* Move away from stack later */ + char *s = usagebuf; + struct memctlr *mem = memctlr_from_cont(cont); + + spin_lock(mem-lock); + usage = mem-counter.usage; + limit = mem-counter.limit; + spin_unlock(mem-lock); + + s += sprintf(s, usage %lu, limit %ld\n, usage, limit); + return simple_read_from_buffer(userbuf, nbytes, ppos, usagebuf, + s - usagebuf); +} This output is hard to parse and to extend. I'd suggest either two separate files, or multi-line output: usage: %lu kB limit: %lu kB and what are the units of these numbers? Page counts? If so, please don't do that: it requires appplications and humans to know the current kernel's page size. Yes, this looks much better. I'll move to this format. I get myself lost in bc at times, that should have been a hint. +static struct cftype memctlr_usage = { + .name = memctlr_usage, + .read = memctlr_read, +}; + +static struct cftype memctlr_limit = { + .name = memctlr_limit, + .write = memctlr_write, +}; + +static int memctlr_populate(struct container_subsys *ss, + struct container *cont) +{ + int rc; + if ((rc = container_add_file(cont, memctlr_usage)) 0) + return rc; + if ((rc = container_add_file(cont, memctlr_limit)) 0) Clean up the first file here? I used cpuset_populate() as an example to code this one up. I don't think there is an easy way in containers to clean up files. I'll double check + return rc; + return 0; +} + +static struct container_subsys memctlr_subsys = { + .name = memctlr, + .create = memctlr_create, + .destroy = memctlr_destroy, + .populate = memctlr_populate, +}; + +int __init memctlr_init(void) +{ + int id; + + id = container_register_subsys(memctlr_subsys); + printk(Initializing memctlr version %s, id %d\n, version, id); + return id 0 ? id : 0; +} + +module_init(memctlr_init); Thanks for the detailed review, -- Warm Regards, Balbir Singh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http
Re: [ckrm-tech] [RFC][PATCH][0/4] Memory controller (RSS Control)
Kirill Korotaev wrote: On 2/19/07, Andrew Morton [EMAIL PROTECTED] wrote: Alas, I fear this might have quite bad worst-case behaviour. One small container which is under constant memory pressure will churn the system-wide LRUs like mad, and will consume rather a lot of system time. So it's a point at which container A can deleteriously affect things which are running in other containers, which is exactly what we're supposed to not do. I think it's OK for a container to consume lots of system time during reclaim, as long as we can account that time to the container involved (i.e. if it's done during direct reclaim rather than by something like kswapd). hmm, is it ok to scan 100Gb of RAM for 10MB RAM container? in UBC patch set we used page beancounters to track containter pages. This allows to make efficient scan contoler and reclamation. Thanks, Kirill Hi, Kirill, Yes, that's a problem, but I think it's a problem that can be solved in steps. First step, add reclaim. Second step, optimize reclaim. -- Warm Regards, Balbir Singh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ckrm-tech] [RFC][PATCH][2/4] Add RSS accounting and control
Andrew Morton wrote: On Mon, 19 Feb 2007 12:20:34 +0530 Balbir Singh [EMAIL PROTECTED] wrote: This patch adds the basic accounting hooks to account for pages allocated into the RSS of a process. Accounting is maintained at two levels, in the mm_struct of each task and in the memory controller data structure associated with each node in the container. When the limit specified for the container is exceeded, the task is killed. RSS accounting is consistent with the current definition of RSS in the kernel. Shared pages are accounted into the RSS of each process as is done in the kernel currently. The code is flexible in that it can be easily modified to work with any definition of RSS. .. +static inline int memctlr_mm_init(struct mm_struct *mm) +{ + return 0; +} So it returns zero on success. OK. Oops. it should return 1 on success. --- linux-2.6.20/kernel/fork.c~memctlr-acct 2007-02-18 22:55:50.0 +0530 +++ linux-2.6.20-balbir/kernel/fork.c 2007-02-18 22:55:50.0 +0530 @@ -50,6 +50,7 @@ #include linux/taskstats_kern.h #include linux/random.h #include linux/numtasks.h +#include linux/memctlr.h #include asm/pgtable.h #include asm/pgalloc.h @@ -342,10 +343,15 @@ static struct mm_struct * mm_init(struct mm-free_area_cache = TASK_UNMAPPED_BASE; mm-cached_hole_size = ~0UL; + if (!memctlr_mm_init(mm)) + goto err; + But here we treat zero as an error? It's a BUG, I'll fix it. if (likely(!mm_alloc_pgd(mm))) { mm-def_flags = 0; return mm; } + +err: free_mm(mm); return NULL; } ... +int memctlr_mm_init(struct mm_struct *mm) +{ + mm-counter = kmalloc(sizeof(struct res_counter), GFP_KERNEL); + if (!mm-counter) + return 0; + atomic_long_set(mm-counter-usage, 0); + atomic_long_set(mm-counter-limit, 0); + rwlock_init(mm-container_lock); + return 1; +} ah-ha, we have another Documentation/SubmitChecklist customer. It would be more conventional to make this return -EFOO on error, zero on success. ok.. I'll convert the functions to be consistent with the return 0 on success philosophy. +void memctlr_mm_free(struct mm_struct *mm) +{ + kfree(mm-counter); +} + +static inline void memctlr_mm_assign_container_direct(struct mm_struct *mm, + struct container *cont) +{ + write_lock(mm-container_lock); + mm-container = cont; + write_unlock(mm-container_lock); +} More weird locking here. The container field of the mm_struct is protected by a read write spin lock. +void memctlr_mm_assign_container(struct mm_struct *mm, struct task_struct *p) +{ + struct container *cont = task_container(p, memctlr_subsys); + struct memctlr *mem = memctlr_from_cont(cont); + + BUG_ON(!mem); + write_lock(mm-container_lock); + mm-container = cont; + write_unlock(mm-container_lock); +} And here. Ditto. +/* + * Update the rss usage counters for the mm_struct and the container it belongs + * to. We do not fail rss for pages shared during fork (see copy_one_pte()). + */ +int memctlr_update_rss(struct mm_struct *mm, int count, bool check) +{ + int ret = 1; + struct container *cont; + long usage, limit; + struct memctlr *mem; + + read_lock(mm-container_lock); + cont = mm-container; + read_unlock(mm-container_lock); + + if (!cont) + goto done; And here. I mean, if there was a reason for taking the lock around that read, then testing `cont' outside the lock just invalidated that reason. We took a consistent snapshot of cont. It cannot change outside the lock, we check the value outside. I am sure I missed something. +static inline void memctlr_double_lock(struct memctlr *mem1, + struct memctlr *mem2) +{ + if (mem1 mem2) { + spin_lock(mem1-lock); + spin_lock(mem2-lock); + } else { + spin_lock(mem2-lock); + spin_lock(mem1-lock); + } +} Conventionally we take the lower-addressed lock first when doing this, not the higher-addressed one. Will fix this. +static inline void memctlr_double_unlock(struct memctlr *mem1, + struct memctlr *mem2) +{ + if (mem1 mem2) { + spin_unlock(mem2-lock); + spin_unlock(mem1-lock); + } else { + spin_unlock(mem1-lock); + spin_unlock(mem2-lock); + } +} + ... retval = -ENOMEM; + + if (!memctlr_update_rss(mm, 1, MEMCTLR_CHECK_LIMIT)) + goto out; + Again, please use zero for success and -EFOO for error. That way, you don't have to assume that the reason memctlr_update_rss() failed was out-of-memory. Just propagate the error back. Yes, will do
Re: [RFC][PATCH][0/4] Memory controller (RSS Control)
Paul Menage wrote: On 2/19/07, Andrew Morton [EMAIL PROTECTED] wrote: Alas, I fear this might have quite bad worst-case behaviour. One small container which is under constant memory pressure will churn the system-wide LRUs like mad, and will consume rather a lot of system time. So it's a point at which container A can deleteriously affect things which are running in other containers, which is exactly what we're supposed to not do. I think it's OK for a container to consume lots of system time during reclaim, as long as we can account that time to the container involved (i.e. if it's done during direct reclaim rather than by something like kswapd). Churning the LRU could well be bad though, I agree. I completely agree with you on reclaim consuming time. Churning the LRU can be avoided by the means I mentioned before 1. Add a container pointer (per page struct), it is also useful for the page cache controller 2. Check if the page belongs to a particular container before the list_del(page-lru), so that those pages can be skipped. 3. Use a double LRU list by overloading the lru list_head of struct page. Paul -- Warm Regards, Balbir Singh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH][0/4] Memory controller (RSS Control)
Magnus Damm wrote: On 2/19/07, Andrew Morton [EMAIL PROTECTED] wrote: On Mon, 19 Feb 2007 12:20:19 +0530 Balbir Singh [EMAIL PROTECTED] wrote: This patch applies on top of Paul Menage's container patches (V7) posted at http://lkml.org/lkml/2007/2/12/88 It implements a controller within the containers framework for limiting memory usage (RSS usage). The key part of this patchset is the reclaim algorithm: Alas, I fear this might have quite bad worst-case behaviour. One small container which is under constant memory pressure will churn the system-wide LRUs like mad, and will consume rather a lot of system time. So it's a point at which container A can deleteriously affect things which are running in other containers, which is exactly what we're supposed to not do. Nice with a simple memory controller. The downside seems to be that it doesn't scale very well when it comes to reclaim, but maybe that just comes with being simple. Step by step, and maybe this is a good first step? Thanks, I totally agree. Ideally I'd like to see unmapped pages handled on a per-container LRU with a fallback to the system-wide LRUs. Shared/mapped pages could be handled using PTE ageing/unmapping instead of page ageing, but that may consume too much resources to be practical. / magnus Keeping unmapped pages per container sounds interesting. I am not quite sure what PTE ageing, will it look it up. -- Warm Regards, Balbir Singh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH][3/4] Add reclaim support
Andrew Morton wrote: On Mon, 19 Feb 2007 12:20:42 +0530 Balbir Singh [EMAIL PROTECTED] wrote: This patch reclaims pages from a container when the container limit is hit. The executable is oom'ed only when the container it is running in, is overlimit and we could not reclaim any pages belonging to the container A parameter called pushback, controls how much memory is reclaimed when the limit is hit. It should be easy to expose this knob to user space, but currently it is hard coded to 20% of the total limit of the container. isolate_lru_pages() has been modified to isolate pages belonging to a particular container, so that reclaim code will reclaim only container pages. For shared pages, reclaim does not unmap all mappings of the page, it only unmaps those mappings that are over their limit. This ensures that other containers are not penalized while reclaiming shared pages. Parallel reclaim per container is not allowed. Each controller has a wait queue that ensures that only one task per control is running reclaim on that container. ... --- linux-2.6.20/include/linux/rmap.h~memctlr-reclaim-on-limit 2007-02-18 23:29:14.0 +0530 +++ linux-2.6.20-balbir/include/linux/rmap.h2007-02-18 23:29:14.0 +0530 @@ -90,7 +90,15 @@ static inline void page_dup_rmap(struct * Called from mm/vmscan.c to handle paging out */ int page_referenced(struct page *, int is_locked); -int try_to_unmap(struct page *, int ignore_refs); +int try_to_unmap(struct page *, int ignore_refs, void *container); +#ifdef CONFIG_CONTAINER_MEMCTLR +bool page_in_container(struct page *page, struct zone *zone, void *container); +#else +static inline bool page_in_container(struct page *page, struct zone *zone, void *container) +{ + return true; +} +#endif /* CONFIG_CONTAINER_MEMCTLR */ /* * Called from mm/filemap_xip.c to unmap empty zero page @@ -118,7 +126,8 @@ int page_mkclean(struct page *); #define anon_vma_link(vma) do {} while (0) #define page_referenced(page,l) TestClearPageReferenced(page) -#define try_to_unmap(page, refs) SWAP_FAIL +#define try_to_unmap(page, refs, container) SWAP_FAIL +#define page_in_container(page, zone, container) true I spy a compile error. The static-inline version looks nicer. I will compile with the feature turned off and double check. I'll also convert it to a static inline function. static inline int page_mkclean(struct page *page) { diff -puN include/linux/swap.h~memctlr-reclaim-on-limit include/linux/swap.h --- linux-2.6.20/include/linux/swap.h~memctlr-reclaim-on-limit 2007-02-18 23:29:14.0 +0530 +++ linux-2.6.20-balbir/include/linux/swap.h2007-02-18 23:29:14.0 +0530 @@ -188,6 +188,10 @@ extern void swap_setup(void); /* linux/mm/vmscan.c */ extern unsigned long try_to_free_pages(struct zone **, gfp_t); extern unsigned long shrink_all_memory(unsigned long nr_pages); +#ifdef CONFIG_CONTAINER_MEMCTLR +extern unsigned long memctlr_shrink_mapped_memory(unsigned long nr_pages, + void *container); +#endif Usually one doesn't need to put ifdefs around the declaration like this. If the function doesn't exist and nobody calls it, we're fine. If someone _does_ call it, we'll find out the error at link-time. Sure, sounds good. I'll get rid of the #ifdefs. +/* + * checks if the mm's container and scan control passed container match, if + * so, is the container over it's limit. Returns 1 if the container is above + * its limit. + */ +int memctlr_mm_overlimit(struct mm_struct *mm, void *sc_cont) +{ + struct container *cont; + struct memctlr *mem; + long usage, limit; + int ret = 1; + + if (!sc_cont) + goto out; + + read_lock(mm-container_lock); + cont = mm-container; + + /* +* Regular reclaim, let it proceed as usual +*/ + if (!sc_cont) + goto out; + + ret = 0; + if (cont != sc_cont) + goto out; + + mem = memctlr_from_cont(cont); + usage = atomic_long_read(mem-counter.usage); + limit = atomic_long_read(mem-counter.limit); + if (limit (usage limit)) + ret = 1; +out: + read_unlock(mm-container_lock); + return ret; +} hm, I wonder how much additional lock traffic all this adds. It's a read_lock() and most of the locks are read_locks which allow for concurrent access, until the container changes or goes away int memctlr_mm_init(struct mm_struct *mm) { mm-counter = kmalloc(sizeof(struct res_counter), GFP_KERNEL); @@ -77,6 +125,46 @@ void memctlr_mm_assign_container(struct write_unlock(mm-container_lock); } +static int memctlr_check_and_reclaim(struct container *cont, long usage, + long limit) +{ + unsigned long nr_pages = 0; + unsigned long nr_reclaimed = 0; + int retries = nr_retries; + int ret = 1
Re: [RFC][PATCH][3/4] Add reclaim support
KAMEZAWA Hiroyuki wrote: On Mon, 19 Feb 2007 12:20:42 +0530 Balbir Singh [EMAIL PROTECTED] wrote: +int memctlr_mm_overlimit(struct mm_struct *mm, void *sc_cont) +{ + struct container *cont; + struct memctlr *mem; + long usage, limit; + int ret = 1; + + if (!sc_cont) + goto out; + + read_lock(mm-container_lock); + cont = mm-container; +out: + read_unlock(mm-container_lock); + return ret; +} + should be == out_and_unlock: read_unlock(mm-container_lock); out_: return ret; Thanks, that's a much convention! -Kame -- Warm Regards, Balbir Singh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ckrm-tech] [RFC][PATCH][2/4] Add RSS accounting and control
Andrew Morton wrote: On Mon, 19 Feb 2007 16:07:44 +0530 Balbir Singh [EMAIL PROTECTED] wrote: +void memctlr_mm_free(struct mm_struct *mm) +{ + kfree(mm-counter); +} + +static inline void memctlr_mm_assign_container_direct(struct mm_struct *mm, + struct container *cont) +{ + write_lock(mm-container_lock); + mm-container = cont; + write_unlock(mm-container_lock); +} More weird locking here. The container field of the mm_struct is protected by a read write spin lock. That doesn't mean anything to me. What would go wrong if the above locking was simply removed? And how does the locking prevent that fault? Some pages could charged to the wrong container. Apart from that I do not see anything going bad (I'll double check that). +void memctlr_mm_assign_container(struct mm_struct *mm, struct task_struct *p) +{ + struct container *cont = task_container(p, memctlr_subsys); + struct memctlr *mem = memctlr_from_cont(cont); + + BUG_ON(!mem); + write_lock(mm-container_lock); + mm-container = cont; + write_unlock(mm-container_lock); +} And here. Ditto. ditto ;) :-) +/* + * Update the rss usage counters for the mm_struct and the container it belongs + * to. We do not fail rss for pages shared during fork (see copy_one_pte()). + */ +int memctlr_update_rss(struct mm_struct *mm, int count, bool check) +{ + int ret = 1; + struct container *cont; + long usage, limit; + struct memctlr *mem; + + read_lock(mm-container_lock); + cont = mm-container; + read_unlock(mm-container_lock); + + if (!cont) + goto done; And here. I mean, if there was a reason for taking the lock around that read, then testing `cont' outside the lock just invalidated that reason. We took a consistent snapshot of cont. It cannot change outside the lock, we check the value outside. I am sure I missed something. If it cannot change outside the lock then we don't need to take the lock! We took a snapshot that we thought was consistent. We check for the value outside. I guess there is no harm, the worst thing that could happen is wrong accounting during mm-container changes (when a task changes container). MEMCTLR_DONT_CHECK_LIMIT exists for the following reasons 1. Pages are shared during fork, fork() is not failed at that point since the pages are shared anyway, we allow the RSS limit to be exceeded. 2. When ZERO_PAGE is added, we don't check for limits (zeromap_pte_range). 3. On reducing RSS (passing -1 as the value) OK, that might make a nice comment somewhere (if it's not already there). Yes, thanks for keeping us humble and honest, I'll add it. -- Warm Regards, Balbir Singh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH][1/4] RSS controller setup
Paul Menage wrote: On 2/19/07, Andrew Morton [EMAIL PROTECTED] wrote: This output is hard to parse and to extend. I'd suggest either two separate files, or multi-line output: usage: %lu kB limit: %lu kB Two separate files would be the container usage model that I envisaged, inherited from the way cpusets does things. And in this case, it should definitely be the limit in one file, readable and writeable, and the usage in another, probably only readable. Having to read a file called memctlr_usage to find the current limit sounds wrong. That sound right, I'll fix this. Hmm, I don't appear to have documented this yet, but I think a good naming scheme for container files is subsystem.whatever - i.e. these should be memctlr.usage and memctlr.limit. The existing grandfathered Cpusets names violate this, but I'm not sure there's a lot we can do about that. Why subsystem.whatever, dots are harder to parse using regular expressions and sound DOS'ish. I'd prefer _ to separate the subsystem and whatever :-) +static int memctlr_populate(struct container_subsys *ss, + struct container *cont) +{ + int rc; + if ((rc = container_add_file(cont, memctlr_usage)) 0) + return rc; + if ((rc = container_add_file(cont, memctlr_limit)) 0) Clean up the first file here? Containers don't currently provide an API for a subsystem to clean up files from a directory - that's done automatically when the directory is deleted. I think I'll probably change the API for container_add_file to return void, but mark an error in the container itself if something goes wrong - that way rather than all the subsystems having to check for error, container_populate_dir() can do so at the end of calling all the subsystems' populate methods. It should be easy to add container_remove_file() instead of marking an error. Paul -- Warm Regards, Balbir Singh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH][3/4] Add reclaim support
Andrew Morton wrote: On Mon, 19 Feb 2007 16:20:53 +0530 Balbir Singh [EMAIL PROTECTED] wrote: + * so, is the container over it's limit. Returns 1 if the container is above + * its limit. + */ +int memctlr_mm_overlimit(struct mm_struct *mm, void *sc_cont) +{ + struct container *cont; + struct memctlr *mem; + long usage, limit; + int ret = 1; + + if (!sc_cont) + goto out; + + read_lock(mm-container_lock); + cont = mm-container; + + /* +* Regular reclaim, let it proceed as usual +*/ + if (!sc_cont) + goto out; + + ret = 0; + if (cont != sc_cont) + goto out; + + mem = memctlr_from_cont(cont); + usage = atomic_long_read(mem-counter.usage); + limit = atomic_long_read(mem-counter.limit); + if (limit (usage limit)) + ret = 1; +out: + read_unlock(mm-container_lock); + return ret; +} hm, I wonder how much additional lock traffic all this adds. It's a read_lock() and most of the locks are read_locks which allow for concurrent access, until the container changes or goes away read_lock isn't free, and I suspect we're calling this function pretty often (every pagefault?) It'll be measurable on some workloads, on some hardware. It probably won't be terribly bad because each lock-taking is associated with a clear_page(). But still, if there's any possibility of lightening the locking up, now is the time to think about it. Yes, good point. I'll revisit to see if barriers can replace the locking or if the locking is required at all? @@ -66,6 +67,9 @@ struct scan_control { int swappiness; int all_unreclaimable; + + void *container;/* Used by containers for reclaiming */ + /* pages when the limit is exceeded */ }; eww. Why void*? I did not want to expose struct container in mm/vmscan.c. It's already there, via rmap.h Yes, true An additional thought was that no matter what container goes in the field would be useful for reclaim. Am having trouble parsing that sentence ;) The thought was that irrespective of the infrastructure that goes in having an entry for reclaim in scan_control would be useful. I guess the name exposes what the type tries to hide :-) -- Warm Regards, Balbir Singh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ckrm-tech] [RFC][PATCH][2/4] Add RSS accounting and control
Andrew Morton wrote: On Mon, 19 Feb 2007 16:39:33 +0530 Balbir Singh [EMAIL PROTECTED] wrote: Andrew Morton wrote: On Mon, 19 Feb 2007 16:07:44 +0530 Balbir Singh [EMAIL PROTECTED] wrote: +void memctlr_mm_free(struct mm_struct *mm) +{ + kfree(mm-counter); +} + +static inline void memctlr_mm_assign_container_direct(struct mm_struct *mm, + struct container *cont) +{ + write_lock(mm-container_lock); + mm-container = cont; + write_unlock(mm-container_lock); +} More weird locking here. The container field of the mm_struct is protected by a read write spin lock. That doesn't mean anything to me. What would go wrong if the above locking was simply removed? And how does the locking prevent that fault? Some pages could charged to the wrong container. Apart from that I do not see anything going bad (I'll double check that). Argh. Please, think about this. Sure, I will. I guess I am short circuiting my thinking process :-) That locking *doesn't do anything*. Except for that one situation I described: some other holder of the lock reads mm-container twice inside the lock and requires that the value be the same both times (and that sort of code should be converted to take a local copy, so this locking here can be removed). Yes, that makes sense. + + read_lock(mm-container_lock); + cont = mm-container; + read_unlock(mm-container_lock); + + if (!cont) + goto done; And here. I mean, if there was a reason for taking the lock around that read, then testing `cont' outside the lock just invalidated that reason. We took a consistent snapshot of cont. It cannot change outside the lock, we check the value outside. I am sure I missed something. If it cannot change outside the lock then we don't need to take the lock! We took a snapshot that we thought was consistent. Consistent with what? That's a single-word read inside that lock. Yes, that makes sense. We check for the value outside. I guess there is no harm, the worst thing that could happen is wrong accounting during mm-container changes (when a task changes container). If container-lock is held when a task is removed from the container then yes, `cont' here can refer to a container to which the task no longer belongs. More worrisome is the potential for use-after-free. What prevents the pointer at mm-container from referring to freed memory after we're dropped the lock? The container cannot be freed unless all tasks holding references to it are gone, that would ensure that all mm-containers are pointing elsewhere and never to a stale value. I hope my short-circuited brain got this right :-) -- Warm Regards, Balbir Singh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH][0/4] Memory controller (RSS Control)
Magnus Damm wrote: On 2/19/07, Balbir Singh [EMAIL PROTECTED] wrote: Magnus Damm wrote: On 2/19/07, Andrew Morton [EMAIL PROTECTED] wrote: On Mon, 19 Feb 2007 12:20:19 +0530 Balbir Singh [EMAIL PROTECTED] wrote: This patch applies on top of Paul Menage's container patches (V7) posted at http://lkml.org/lkml/2007/2/12/88 It implements a controller within the containers framework for limiting memory usage (RSS usage). The key part of this patchset is the reclaim algorithm: Alas, I fear this might have quite bad worst-case behaviour. One small container which is under constant memory pressure will churn the system-wide LRUs like mad, and will consume rather a lot of system time. So it's a point at which container A can deleteriously affect things which are running in other containers, which is exactly what we're supposed to not do. Nice with a simple memory controller. The downside seems to be that it doesn't scale very well when it comes to reclaim, but maybe that just comes with being simple. Step by step, and maybe this is a good first step? Thanks, I totally agree. Ideally I'd like to see unmapped pages handled on a per-container LRU with a fallback to the system-wide LRUs. Shared/mapped pages could be handled using PTE ageing/unmapping instead of page ageing, but that may consume too much resources to be practical. / magnus Keeping unmapped pages per container sounds interesting. I am not quite sure what PTE ageing, will it look it up. You will most likely have no luck looking it up, so here is what I mean by PTE ageing: The most common unit for memory resource control seems to be physical pages. Keeping track of pages is simple in the case of a single user per page, but for shared pages tracking the owner becomes more complex. I consider unmapped pages to only have a single user at a time, so the unit for unmapped memory resource control is physical pages. Apart from implementation details such as fun with struct page and scalability, handling this case is not so complicated. Mapped or shared pages should be handled in a different way IMO. PTEs should be used instead of using physical pages as unit for resource control and reclaim. For the user this looks pretty much the same as physical pages, apart for memory overcommit. So instead of using a global page reclaim policy and reserving physical pages per container I propose that resource controlled shared pages should be handled using a PTE replacement policy. This policy is used to keep the most active PTEs in the container backed by physical pages. Inactive PTEs gets unmapped in favour over newer PTEs. One way to implement this could be by populating the address space of resource controlled processes with multiple smaller LRU2Qs. The compact data structure that I have in mind is basically an array of 256 bytes, one byte per PTE. Associated with this data strucuture are start indexes and lengths for two lists. The indexes are used in a FAT-type of chain to form single linked lists. So we create active and inactive list here - and we move PTEs between the lists when we check the young bits from the page reclaim and when we apply memory pressure. Unmapping is done through the normal page reclaimer but using information from the PTE LRUs. In my mind this should lead to more fair resource control of mapped pages, but if it is possible to implement with low overhead, that's another question. =) Thanks for listening. / magnus Thanks for explaining PTE aging. -- Warm Regards, Balbir Singh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: taskstats accounting info
Randy Dunlap wrote: On Thu, 15 Mar 2007 11:06:55 -0800 Andrew Morton wrote: It's the most portable example, since it does not depend on libnl. err, what is libnl? lib-netlink (as already answered, but I wrote this last week) I was referring to the library at http://people.suug.ch/~tgr/libnl/ If there exists some real userspace infrastructure which utilises taskstats, can we please get a referece to it into the kernel Documentation? Perhaps in the TASKSTATS Kconfig entry, thanks. Balbir, I was working with getdelays.c when I initially wrote these questions. Here is a small patch for it. Hopefully you can use it when you find the updated version of it. ~Randy From: Randy Dunlap [EMAIL PROTECTED] 1. add usage() function 2. add unknown character in %c format (was only in %d, not useful): ./getdelays: invalid option -- h Unknown option '?' (63) instead of: ./getdelays: invalid option -- h Unknown option 63 (or just remove that message) 3. -v does not use an optarg, so remove ':' in getopt string after 'v'; Thanks, these look good. I'll add them to my local copy. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux-VServer example results for sharing vs. separate mappings ...
Andrew Morton wrote: snip The problem is memory reclaim. A number of schemes which have been proposed require a per-container page reclaim mechanism - basically a separate scanner. This is a huge, huge, huge problem. The present scanner has been under development for over a decade and has had tremendous amounts of work and testing put into it. And it still has problems. But those problems will be gradually addressed. A per-container recaim scheme really really really wants to reuse all that stuff rather than creating a separate, parallel, new scanner which has the same robustness requirements, only has a decade less test and development done on it. And which permanently doubles our maintenance costs. The current per-container reclaim scheme does reuse a lot of code. As far as code maintenance is concerned, I think it should be easy to merge some of the common functionality by abstracting them out as different functions. The container smartness comes in only in the container_isolate_pages(). This is an easy to understand function. So how do we reuse our existing scanner? With physical containers. One can envisage several schemes: a) slice the machine into 128 fake NUMA nodes, use each node as the basic block of memory allocation, manage the binding between these memory hunks and process groups with cpusets. This is what google are testing, and it works. Don't we break the global LRU with this scheme? b) Create a new memory abstraction, call it the software zone, which is mostly decoupled from the present hardware zones. Most of the MM is reworked to use software zones. The software zones are runtime-resizeable, and obtain their pages via some means from the hardware zones. A container uses a software zone. I think the problem would be figuring out where to allocate memory from? What happens if a software zone spans across many hardware zones? c) Something else, similar to the above. Various schemes can be envisaged, it isn't terribly important for this discussion. Let me repeat: this all has a huge upside in that it reuses the existing page reclaimation logic. And cpusets. Yes, we do discover glitches, but those glitches (such as Christoph's recent discovery of suboptimal interaction between cpusets and the global dirty ratio) get addressed, and we tend to strengthen the overall MM system as we address them. So what are the downsides? I think mainly the sharing issue: I think binding the resource controller and the allocator might be a bad idea, I tried experimenting with it and soon ran into some hard to answer questions 1. How do we control the length of the zonelists that we need to allocate memory from (in a container) 2. Like you said, how do we share pages across zones (containers) 3. What happens to the global LRU behaviour 4. Do we need a per_cpu_pageset assoicated with containers 5. What do we do with unused memory in a zone, is it shared with other zones 6. Changing zones or creating an abstraction out of it is likely to impact the entire vm setup core, that is high risk, so do we really need to do it this way. But how much of a problem will it be *in practice*? Probably a lot of people just won't notice or care. There will be a few situations where it may be a problem, but perhaps we can address those? Forced migration of pages from one zone into another is possible. Or change the reclaim code so that a page which hasn't been referenced from a process within its hardware container is considered unreferenced (so it gets reclaimed). Or a manual nuke-all-the-pages knob which system administration tools can use. All doable, if we indeed have a demonstrable problem which needs to be addressed. And I do think it's worth trying to address these things, because the thought of implementing a brand new memory reclaim mechanism scares the pants off me. The reclaim mechanism proposed *does not impact the non-container users*. The only impact is container driven reclaim, like every other new feature this can benefit from good testing in -mm. I believe we have something simple and understandable to get us started. I would request you to consider merging the RSS controller and containers patches in -mm. If too many people complain or we see the problems that you foresee and our testing, enhancements and maintenance is unable to sort those problems, we know we'll have another approach to fall back upon :-) It'll also teach us to listen to the maintainers when they talk of design ;) -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix race between attach_task and cpuset_exit
Hi, Vatsa, Srivatsa Vaddagiri wrote: diff -puN kernel/cpuset.c~cpuset_race_fix kernel/cpuset.c --- linux-2.6.21-rc4/kernel/cpuset.c~cpuset_race_fix2007-03-25 21:08:27.0 +0530 +++ linux-2.6.21-rc4-vatsa/kernel/cpuset.c 2007-03-25 21:25:05.0 +0530 @@ -1182,6 +1182,7 @@ static int attach_task(struct cpuset *cs pid_t pid; struct task_struct *tsk; struct cpuset *oldcs; + struct cpuset *oldcs_tobe_released = NULL; How about oldcs_to_be_released? cpumask_t cpus; nodemask_t from, to; struct mm_struct *mm; @@ -1237,6 +1238,8 @@ static int attach_task(struct cpuset *cs } atomic_inc(cs-count); rcu_assign_pointer(tsk-cpuset, cs); + if (atomic_dec_and_test(oldcs-count)) + oldcs_tobe_released = oldcs; task_unlock(tsk); guarantee_online_cpus(cs, cpus); @@ -1257,8 +1260,8 @@ static int attach_task(struct cpuset *cs put_task_struct(tsk); synchronize_rcu(); - if (atomic_dec_and_test(oldcs-count)) - check_for_release(oldcs, ppathbuf); + if (oldcs_tobe_released) + check_for_release(oldcs_tobe_released, ppathbuf); return 0; } @@ -2200,10 +2203,6 @@ void cpuset_fork(struct task_struct *chi * it is holding that mutex while calling check_for_release(), * which calls kmalloc(), so can't be called holding callback_mutex(). * - * We don't need to task_lock() this reference to tsk-cpuset, - * because tsk is already marked PF_EXITING, so attach_task() won't - * mess with it, or task is a failed fork, never visible to attach_task. - * * the_top_cpuset_hack: * *Set the exiting tasks cpuset to the root cpuset (top_cpuset). @@ -2242,19 +2241,20 @@ void cpuset_exit(struct task_struct *tsk { struct cpuset *cs; + task_lock(tsk); cs = tsk-cpuset; tsk-cpuset = top_cpuset; /* the_top_cpuset_hack - see above */ + atomic_dec(cs-count); How about using a local variable like ref_count and using ref_count = atomic_dec_and_test(cs-count); This will avoid the two atomic operations, atomic_dec() and atomic_read() below. + task_unlock(tsk); if (notify_on_release(cs)) { char *pathbuf = NULL; mutex_lock(manage_mutex); - if (atomic_dec_and_test(cs-count)) + if (!atomic_read(cs-count)) if (ref_count == 0) check_for_release(cs, pathbuf); mutex_unlock(manage_mutex); cpuset_release_agent(pathbuf); - } else { - atomic_dec(cs-count); } } -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux-VServer example results for sharing vs. separate mappings ...
Andrew Morton wrote: Don't we break the global LRU with this scheme? Sure, but that's deliberate! (And we don't have a global LRU - the LRUs are per-zone). Yes, true. But if we use zones for containers and say we have 400 of them, with all of them under limit. When the system wants to reclaim memory, we might not end up reclaiming the best pages. Am I missing something? b) Create a new memory abstraction, call it the software zone, which is mostly decoupled from the present hardware zones. Most of the MM is reworked to use software zones. The software zones are runtime-resizeable, and obtain their pages via some means from the hardware zones. A container uses a software zone. I think the problem would be figuring out where to allocate memory from? What happens if a software zone spans across many hardware zones? Yes, that would be the tricky part. But we generally don't care what physical zone user pages come from, apart from NUMA optimisation. The reclaim mechanism proposed *does not impact the non-container users*. Yup. Let's keep plugging away with Pavel's approach, see where it gets us. Yes, we have some changes that we've made to the reclaim logic, we hope to integrate a page cache controller soon. We are also testing the patches. Hopefully soon enough, they'll be in a good state and we can request you to merge the containers and the rss limit (plus page cache) controller soon. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux-VServer example results for sharing vs. separate mappings ...
Andrew Morton wrote: On Mon, 26 Mar 2007 08:06:07 +0530 Balbir Singh [EMAIL PROTECTED] wrote: Andrew Morton wrote: Don't we break the global LRU with this scheme? Sure, but that's deliberate! (And we don't have a global LRU - the LRUs are per-zone). Yes, true. But if we use zones for containers and say we have 400 of them, with all of them under limit. When the system wants to reclaim memory, we might not end up reclaiming the best pages. Am I missing something? If a zone is under its min_pages limit, it needs reclaim. Who/when/why that reclaim is run doesn't really matter. Yeah, we might run into some scaling problems with that many zones. They're unlikely to be unfixable. ok. b) Create a new memory abstraction, call it the software zone, which is mostly decoupled from the present hardware zones. Most of the MM is reworked to use software zones. The software zones are runtime-resizeable, and obtain their pages via some means from the hardware zones. A container uses a software zone. I think the problem would be figuring out where to allocate memory from? What happens if a software zone spans across many hardware zones? Yes, that would be the tricky part. But we generally don't care what physical zone user pages come from, apart from NUMA optimisation. The reclaim mechanism proposed *does not impact the non-container users*. Yup. Let's keep plugging away with Pavel's approach, see where it gets us. Yes, we have some changes that we've made to the reclaim logic, we hope to integrate a page cache controller soon. We are also testing the patches. Hopefully soon enough, they'll be in a good state and we can request you to merge the containers and the rss limit (plus page cache) controller soon. Now I'm worried again. This separation between rss controller and pagecache is largely alien to memory reclaim. With physical containers these new concepts (and their implementations) don't need to exist - it is already all implemented. Designing brand-new memory reclaim machinery in mid-2007 sounds like a very bad idea. But let us see what it looks like. I did not mean to worry you again :-) We do not plan to implement brand new memory reclaim, we intend to modify some bits and pieces for per container reclaim. We believe at this point that all the necessary infrastructure is largely present in container_isolate_pages(). Adding a page cache controller should not require core-mm surgery, just the accounting bits. We basically agree that designing a brand new reclaim machinery is a bad idea, non-container users will not be impacted. Only container driver reclaim (caused by a container being at it's limit), will see some change in reclaim behaviour and we shall try and restrict the changes to as small as possible. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Cpu statistics accounting based on Paul Menage patches
Andrew Morton wrote: On Wed, 11 Apr 2007 19:02:27 +0400 Pavel Emelianov [EMAIL PROTECTED] wrote: Provides a per-container statistics concerning the numbers of tasks in various states, system and user times, etc. Patch is inspired by Paul's example of the used CPU time accounting. Although this patch is independent from Paul's example to make it possible playing with them separately. Why is this actually needed? If userspace has a list of the tasks which are in a particular container, it can run around and add up the stats for those tasks without kernel changes? It's a bit irksome that we have so much accounting of this form in core kernel, yet we have to go and add a completely new implementation to create something which is similar to what we already have. But I don't immediately see a fix for that. Apart from paragraph #1 ;) Should there be linkage between per-container stats and delivery-via-taskstats? I can't think of one, really. You have cpu stats. Later, presumably, we'll need IO stats, MM stats, context-switch stats, number-of-syscall stats, etc, etc. Are we going to reimplement all of those things as well? See paragraph #1! Bottom line: I think we seriously need to find some way of consolidating per-container stats with our present per-task stats. Perhaps we should instead be looking at ways in which we can speed up paragraph #1. This should be easy to build. per container stats can live in parallel with per-task stats, but they can use the same general mechanism for data communication to user space. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ckrm-tech] [RFC][PATCH][2/4] Add RSS accounting and control
Paul Menage wrote: On 2/19/07, Balbir Singh [EMAIL PROTECTED] wrote: More worrisome is the potential for use-after-free. What prevents the pointer at mm-container from referring to freed memory after we're dropped the lock? The container cannot be freed unless all tasks holding references to it are gone, ... or have been moved to other containers. If you're not holding task-alloc_lock or one of the container mutexes, there's nothing to stop the task being moved to another container, and the container being deleted. If you're in an RCU section then you can guarantee that the container (that you originally read from the task) and its subsystems at least won't be deleted while you're accessing them, but for accounting like this I suspect that's not enough, since you need to be adding to the accounting stats on the correct container. I think you'll need to hold mm-container_lock for the duration of memctl_update_rss() Paul Yes, that sounds like the correct thing to do. -- Warm Regards, Balbir Singh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ckrm-tech] [RFC][PATCH][2/4] Add RSS accounting and control
Vaidyanathan Srinivasan wrote: Balbir Singh wrote: Paul Menage wrote: On 2/19/07, Balbir Singh [EMAIL PROTECTED] wrote: More worrisome is the potential for use-after-free. What prevents the pointer at mm-container from referring to freed memory after we're dropped the lock? The container cannot be freed unless all tasks holding references to it are gone, ... or have been moved to other containers. If you're not holding task-alloc_lock or one of the container mutexes, there's nothing to stop the task being moved to another container, and the container being deleted. If you're in an RCU section then you can guarantee that the container (that you originally read from the task) and its subsystems at least won't be deleted while you're accessing them, but for accounting like this I suspect that's not enough, since you need to be adding to the accounting stats on the correct container. I think you'll need to hold mm-container_lock for the duration of memctl_update_rss() Paul Yes, that sounds like the correct thing to do. Accounting accuracy will anyway be affected when a process is migrated while it is still allocating pages. Having a lock here does not necessarily improve the accounting accuracy. Charges from the old container would have to be moved to the new container before deletion which implies all tasks have already left the container and no mm_struct is holding a pointer to it. The only condition that will break our code will be if the container pointer becomes invalid while we are updating stats. This can be prevented by RCU section as mentioned by Paul. I believe explicit lock and unlock may not provide additional benefit here. Yes, if the container pointer becomes invalid, then consider the following scenario 1. Use RCU, get a reference to the container 2. All tasks/mm's move to newer container (and the accounting information moves) 3. Container is RCU deleted 4. We still charge the older container that is going to be deleted soon 5. Release RCU 6. RCU garbage collects (callback runs) We end up charging/uncharging a soon to be deleted container, that is not good. What did I miss? --Vaidy -- Warm Regards, Balbir Singh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH][0/4] Memory controller (RSS Control) (
This patch applies on top of Paul Menage's container patches (V7) posted at http://lkml.org/lkml/2007/2/12/88 It implements a controller within the containers framework for limiting memory usage (RSS usage). The memory controller was discussed at length in the RFC posted to lkml http://lkml.org/lkml/2006/10/30/51 This is version 2 of the patch, version 1 was posted at http://lkml.org/lkml/2007/2/19/10 I have tried to incorporate all comments, more details can be found in the changelog's of induvidual patches. Any remaining mistakes are all my fault. The next question could be why release version 2? 1. It serves a decision point to decide if we should move to a per-container LRU list. Walking through the global LRU is slow, in this patchset I've tried to address the LRU churning issue. The patch memcontrol-reclaim-on-limit has more details 2. I;ve included fixes for several of the comments/issues raised in version 1 Steps to use the controller -- 0. Download the patches, apply the patches 1. Turn on CONFIG_CONTAINER_MEMCONTROL in kernel config, build the kernel and boot into the new kernel 2. mount -t container container -o memcontrol /mount point 3. cd /mount point optionally do (mkdir directory; cd directory) under /mount point 4. echo $$ tasks (attaches the current shell to the container) 5. echo -n (limit value) memcontrol_limit 6. cat memcontrol_usage 7. Run tasks, check the usage of the controller, reclaim behaviour 8. Report bugs, get bug fixes and iterate (goto step 0). Advantages of the patchset -- 1. Zero overhead in struct page (struct page is not expanded) 2. Minimal changes to the core-mm code 3. Shared pages are not reclaimed unless all mappings belong to overlimit containers. 4. It can be used to debug drivers/applications/kernel components in a constrained memory environment (similar to mem=XXX option), except that several containers can be created simultaneously without rebooting and the limits can be changed. NOTE: There is no support for limiting kernel memory allocations and page cache control (presently). Testing --- Created containers, attached tasks to containers with lower limits than the memory the tasks require (memory hog tests) and ran some basic tests on them. Tested the patches on UML and PowerPC. On UML tried the patches with the config enabled and disabled (sanity check) and with containers enabled but the memory controller disabled. TODO's and improvement areas 1. Come up with cool page replacement algorithms for containers - still holds good (if possible without any changes to struct page) 2. Add page cache control 3. Add kernel memory allocator control 4. Extract benchmark numbers and overhead data Comments criticism are welcome. Series -- memcontrol-setup.patch memcontrol-acct.patch memcontrol-reclaim-on-limit.patch memcontrol-doc.patch -- Warm Regards, Balbir Singh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH][1/4] RSS controller setup (
Changelog 1. Change the name from memctlr to memcontrol 2. Coding style changes, call the API and then check return value (for kmalloc). 3. Change the output format, to print sizes in both pages and kB 4. Split the usage and limit files to be independent (cat memcontrol_usage no longer prints the limit) TODO's 1. Implement error handling mechansim for handling container_add_file() failures (this would depend on the containers code). This patch sets up the basic controller infrastructure on top of the containers infrastructure. Two files are provided for monitoring and control memcontrol_usage and memcontrol_limit. memcontrol_usage shows the current usage (in pages, of RSS) and the limit set by the user. memcontrol_limit can be used to set a limit on the RSS usage of the resource. A special value of 0, indicates that the usage is unlimited. The limit is set in units of pages. Signed-off-by: [EMAIL PROTECTED] --- include/linux/memcontrol.h | 33 +++ init/Kconfig |7 + mm/Makefile|1 mm/memcontrol.c| 193 + 4 files changed, 234 insertions(+) diff -puN /dev/null include/linux/memcontrol.h --- /dev/null 2007-02-02 22:51:23.0 +0530 +++ linux-2.6.20-balbir/include/linux/memcontrol.h 2007-02-24 19:39:03.0 +0530 @@ -0,0 +1,33 @@ +/* + * memcontrol.h - Memory Controller for containers + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of version 2.1 of the GNU Lesser General Public License + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + * © Copyright IBM Corporation, 2006-2007 + * + * Author: Balbir Singh [EMAIL PROTECTED] + * + */ + +#ifndef _LINUX_MEMCONTROL_H +#define _LINUX_MEMCONTROL_H + +#ifdef CONFIG_CONTAINER_MEMCONTROL +#ifndef kB +#define kB 1024/* One Kilo Byte */ +#endif + +#else /* CONFIG_CONTAINER_MEMCONTROL */ + +#endif /* CONFIG_CONTAINER_MEMCONTROL */ +#endif /* _LINUX_MEMCONTROL_H */ diff -puN init/Kconfig~memcontrol-setup init/Kconfig --- linux-2.6.20/init/Kconfig~memcontrol-setup 2007-02-20 21:01:28.0 +0530 +++ linux-2.6.20-balbir/init/Kconfig2007-02-20 21:01:28.0 +0530 @@ -306,6 +306,13 @@ config CONTAINER_NS for instance virtual servers and checkpoint/restart jobs. +config CONTAINER_MEMCONTROL + bool A simple RSS based memory controller + select CONTAINERS + help + Provides a simple Resource Controller for monitoring and + controlling the total Resident Set Size of the tasks in a container + config RELAY bool Kernel-user space relay support (formerly relayfs) help diff -puN mm/Makefile~memcontrol-setup mm/Makefile --- linux-2.6.20/mm/Makefile~memcontrol-setup 2007-02-20 21:01:28.0 +0530 +++ linux-2.6.20-balbir/mm/Makefile 2007-02-20 21:01:28.0 +0530 @@ -29,3 +29,4 @@ obj-$(CONFIG_MEMORY_HOTPLUG) += memory_h obj-$(CONFIG_FS_XIP) += filemap_xip.o obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_SMP) += allocpercpu.o +obj-$(CONFIG_CONTAINER_MEMCONTROL) += memcontrol.o diff -puN /dev/null mm/memcontrol.c --- /dev/null 2007-02-02 22:51:23.0 +0530 +++ linux-2.6.20-balbir/mm/memcontrol.c 2007-02-24 19:39:24.0 +0530 @@ -0,0 +1,193 @@ +/* + * memcontrol.c - Memory Controller for containers + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of version 2.1 of the GNU Lesser General Public License + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + * © Copyright IBM Corporation, 2006-2007 + * + * Author: Balbir Singh [EMAIL PROTECTED] + * + */ + +#include linux/init.h +#include linux/parser.h +#include linux/fs.h +#include linux/container.h +#include linux/memcontrol.h + +#include asm/uaccess.h + +#define RES_USAGE_NO_LIMIT 0 +static const char version[] = 0.1; + +struct res_counter { + atomic_long_t usage;/* The current usage of the resource being */ + /* counted */ + atomic_long_t limit
[RFC][PATCH][2/4] Add RSS accounting and control (
out_nomap; + goto out_nomap_uncharge; } /* The page isn't present yet, go ahead with the fault. */ @@ -2068,6 +2084,8 @@ unlock: pte_unmap_unlock(page_table, ptl); out: return ret; +out_nomap_uncharge: + memcontrol_update_rss(mm, -1, MEMCONTROL_DONT_CHECK_LIMIT); out_nomap: pte_unmap_unlock(page_table, ptl); unlock_page(page); @@ -2092,6 +2110,9 @@ static int do_anonymous_page(struct mm_s /* Allocate our own private page. */ pte_unmap(page_table); + if (memcontrol_update_rss(mm, 1, MEMCONTROL_CHECK_LIMIT)) + goto oom; + if (unlikely(anon_vma_prepare(vma))) goto oom; page = alloc_zeroed_user_highpage(vma, address); @@ -2108,6 +2129,8 @@ static int do_anonymous_page(struct mm_s lru_cache_add_active(page); page_add_new_anon_rmap(page, vma, address); } else { + memcontrol_update_rss(mm, 1, MEMCONTROL_DONT_CHECK_LIMIT); + /* Map the ZERO_PAGE - vm_page_prot is readonly */ page = ZERO_PAGE(address); page_cache_get(page); @@ -2218,6 +2241,9 @@ retry: } } + if (memcontrol_update_rss(mm, 1, MEMCONTROL_CHECK_LIMIT)) + goto oom; + page_table = pte_offset_map_lock(mm, pmd, address, ptl); /* * For a file-backed vma, someone could have truncated or otherwise @@ -2227,6 +2253,7 @@ retry: if (mapping unlikely(sequence != mapping-truncate_count)) { pte_unmap_unlock(page_table, ptl); page_cache_release(new_page); + memcontrol_update_rss(mm, -1, MEMCONTROL_DONT_CHECK_LIMIT); cond_resched(); sequence = mapping-truncate_count; smp_rmb(); @@ -2265,6 +2292,7 @@ retry: } else { /* One of our sibling threads was faster, back out. */ page_cache_release(new_page); + memcontrol_update_rss(mm, -1, MEMCONTROL_DONT_CHECK_LIMIT); goto unlock; } diff -puN mm/rmap.c~memcontrol-acct mm/rmap.c --- linux-2.6.20/mm/rmap.c~memcontrol-acct 2007-02-24 19:39:29.0 +0530 +++ linux-2.6.20-balbir/mm/rmap.c 2007-02-24 19:39:29.0 +0530 @@ -602,6 +602,11 @@ void page_remove_rmap(struct page *page, __dec_zone_page_state(page, PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED); } + /* +* When we pass MEMCONTROL_DONT_CHECK_LIMIT, it is ok to call +* this function under the pte lock (since we will not block in reclaim) +*/ + memcontrol_update_rss(vma-vm_mm, -1, MEMCONTROL_DONT_CHECK_LIMIT); } /* diff -puN mm/swapfile.c~memcontrol-acct mm/swapfile.c --- linux-2.6.20/mm/swapfile.c~memcontrol-acct 2007-02-24 19:39:29.0 +0530 +++ linux-2.6.20-balbir/mm/swapfile.c 2007-02-24 19:39:29.0 +0530 @@ -27,6 +27,7 @@ #include linux/mutex.h #include linux/capability.h #include linux/syscalls.h +#include linux/memcontrol.h #include asm/pgtable.h #include asm/tlbflush.h @@ -514,6 +515,7 @@ static void unuse_pte(struct vm_area_str set_pte_at(vma-vm_mm, addr, pte, pte_mkold(mk_pte(page, vma-vm_page_prot))); page_add_anon_rmap(page, vma, addr); + memcontrol_update_rss(vma-vm_mm, 1, MEMCONTROL_DONT_CHECK_LIMIT); swap_free(entry); /* * Move the page to the active list so it is not _ -- Warm Regards, Balbir Singh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH][3/4] Add reclaim support (
) { zone-nr_scan_active += (zone-nr_active prio) + 1; - if (zone-nr_scan_active = nr_pages || pass 3) { + if (zone-nr_scan_active = nr_pages || pass max_pass) { zone-nr_scan_active = 0; nr_to_scan = min(nr_pages, zone-nr_active); shrink_active_list(nr_to_scan, zone, sc, prio); @@ -1394,7 +1431,7 @@ static unsigned long shrink_all_zones(un } zone-nr_scan_inactive += (zone-nr_inactive prio) + 1; - if (zone-nr_scan_inactive = nr_pages || pass 3) { + if (zone-nr_scan_inactive = nr_pages || pass max_pass) { zone-nr_scan_inactive = 0; nr_to_scan = min(nr_pages, zone-nr_inactive); ret += shrink_inactive_list(nr_to_scan, zone, sc); @@ -1405,7 +1442,9 @@ static unsigned long shrink_all_zones(un return ret; } +#endif +#ifdef CONFIG_PM static unsigned long count_lru_pages(void) { struct zone *zone; @@ -1477,7 +1516,7 @@ unsigned long shrink_all_memory(unsigned unsigned long nr_to_scan = nr_pages - ret; sc.nr_scanned = 0; - ret += shrink_all_zones(nr_to_scan, prio, pass, sc); + ret += shrink_all_zones(nr_to_scan, prio, pass, 3, sc); if (ret = nr_pages) goto out; @@ -1512,6 +1551,57 @@ out: } #endif +#ifdef CONFIG_CONTAINER_MEMCONTROL +/* + * Try to free `nr_pages' of memory, system-wide, and return the number of + * freed pages. + * Modelled after shrink_all_memory() + */ +unsigned long memcontrol_shrink_mapped_memory(unsigned long nr_pages, + struct container *container) +{ + unsigned long ret = 0; + int pass; + unsigned long nr_total_scanned = 0; + + struct scan_control sc = { + .gfp_mask = GFP_KERNEL, + .may_swap = 0, + .swap_cluster_max = nr_pages, + .may_writepage = 1, + .container = container, + .may_swap = 1, + .swappiness = 100, + }; + + /* +* We try to shrink LRUs in 3 passes: +* 0 = Reclaim from inactive_list only +* 1 = Reclaim mapped (normal reclaim) +* 2 = 2nd pass of type 1 +*/ + for (pass = 0; pass 3; pass++) { + int prio; + + for (prio = DEF_PRIORITY; prio = 0; prio--) { + unsigned long nr_to_scan = nr_pages - ret; + + sc.nr_scanned = 0; + ret += shrink_all_zones(nr_to_scan, prio, + pass, 1, sc); + if (ret = nr_pages) + goto out; + + nr_total_scanned += sc.nr_scanned; + if (sc.nr_scanned prio DEF_PRIORITY - 2) + congestion_wait(WRITE, HZ / 10); + } + } +out: + return ret; +} +#endif + /* It's optimal to keep kswapds on the same CPUs as their memory, but not required for correctness. So if the last cpu in a node goes away, we get changed to run anywhere: as the first one comes back, diff -puN include/linux/mm_types.h~memcontrol-reclaim-on-limit include/linux/mm_types.h diff -puN include/linux/list.h~memcontrol-reclaim-on-limit include/linux/list.h --- linux-2.6.20/include/linux/list.h~memcontrol-reclaim-on-limit 2007-02-24 19:40:56.0 +0530 +++ linux-2.6.20-balbir/include/linux/list.h2007-02-24 19:40:56.0 +0530 @@ -343,6 +343,32 @@ static inline void list_splice(struct li __list_splice(list, head); } +static inline void __list_splice_tail(struct list_head *list, + struct list_head *head) +{ + struct list_head *first = list-next; + struct list_head *last = list-prev; + struct list_head *at = head-prev; + + first-prev = at; + at-next = first; + + last-next = head; + head-prev = last; +} + +/** + * list_splice - join two lists, @list goes to the end (at head-prev) + * @list: the new list to add. + * @head: the place to add it in the first list. + */ +static inline void list_splice_tail(struct list_head *list, + struct list_head *head) +{ + if (!list_empty(list)) + __list_splice_tail(list, head); +} + /** * list_splice_init - join two lists and reinitialise the emptied list. * @list: the new list to add. _ -- Warm Regards, Balbir Singh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo
[RFC][PATCH][4/4] RSS controller documentation (
--- Signed-off-by: [EMAIL PROTECTED] --- Documentation/memctlr.txt | 70 ++ 1 file changed, 70 insertions(+) diff -puN /dev/null Documentation/memctlr.txt --- /dev/null 2007-02-02 22:51:23.0 +0530 +++ linux-2.6.20-balbir/Documentation/memctlr.txt 2007-02-24 19:41:23.0 +0530 @@ -0,0 +1,70 @@ +Introduction + + +The memory controller is a controller module written under the containers +framework. It can be used to limit the resource usage of a group of +tasks grouped by the container. + +Accounting +-- + +The memory controller tracks the RSS usage of the tasks in the container. +The definition of RSS was debated on lkml in the following thread + + http://lkml.org/lkml/2006/10/10/130 + +This patch is flexible, it is easy to adapt the patch to any definition +of RSS. The current accounting is based on the current definition of +RSS. Each page mapped is charged to the container. + +The accounting is done at two levels, each process has RSS accounting in +the mm_struct and in the container it belongs to. The mm_struct accounting +is used when a task switches (migrates to a different) container(s). The +accounting information for the task is subtracted from the source container +and added to the destination container. If as result of the migration, the +destination container goes over limit, no action is taken until some task +in the destination container runs and tries to map a new page in its +page table. + +The current RSS usage can be seen in the memcontrol_usage file. The value +is in units of pages. + +Control +--- + +The memcontrol_limit file allows the user to set a limit on the number of +pages that can be mapped by the processes in the container. A special +value of 0 (which is the default limit of any new container), indicates +that the container can use unlimited amount of RSS. + +Reclaim +--- + +When the limit set in the container is hit, the memory controller starts +reclaiming pages belonging to the container (simulating a local LRU in +some sense). isolate_lru_pages() has been modified to isolate lru +pages belonging to a specific container. Parallel reclaims on the same +container are not allowed, other tasks end up waiting for the any existing +reclaim to finish. + +The reclaim code uses two internal knobs, retries and pushback. pushback +specifies the percentage of memory to be reclaimed when the container goes +over limit. The retries knob, controls how many times reclaim is retried +before the task is killed (because reclaim failed). + +Shared pages are treated specially during reclaim. They are not force +reclaimed, they are only unmapped from containers which are over limit. +This ensures that other containers do not pay a penalty for a shared +page being reclaimed when a paritcular container goes over its limit. + +NOTE: All limits are hard limits. + +Future Plans + + +The current controller implements only RSS control. It is planned to add +the following components + +1. Page Cache control +2. mlock'ed memory control +3. kernel memory allocation control (memory allocated on behalf of a task) _ -- Warm Regards, Balbir Singh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Memcontrol patchset (was Re: [RFC][PATCH][0/4] Memory controller (RSS Control) ()
Hi, My script could not parse the (#2) and posted the patches as subject followed by ( instead I apologize, Balbir Singh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH][1/4] RSS controller setup (v2)
Changelog 1. Change the name from memctlr to memcontrol 2. Coding style changes, call the API and then check return value (for kmalloc). 3. Change the output format, to print sizes in both pages and kB 4. Split the usage and limit files to be independent (cat memcontrol_usage no longer prints the limit) TODO's 1. Implement error handling mechansim for handling container_add_file() failures (this would depend on the containers code). This patch sets up the basic controller infrastructure on top of the containers infrastructure. Two files are provided for monitoring and control memcontrol_usage and memcontrol_limit. memcontrol_usage shows the current usage (in pages, of RSS) and the limit set by the user. memcontrol_limit can be used to set a limit on the RSS usage of the resource. A special value of 0, indicates that the usage is unlimited. The limit is set in units of pages. Signed-off-by: [EMAIL PROTECTED] --- include/linux/memcontrol.h | 33 +++ init/Kconfig |7 + mm/Makefile|1 mm/memcontrol.c| 193 + 4 files changed, 234 insertions(+) diff -puN /dev/null include/linux/memcontrol.h --- /dev/null 2007-02-02 22:51:23.0 +0530 +++ linux-2.6.20-balbir/include/linux/memcontrol.h 2007-02-24 19:39:03.0 +0530 @@ -0,0 +1,33 @@ +/* + * memcontrol.h - Memory Controller for containers + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of version 2.1 of the GNU Lesser General Public License + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + * © Copyright IBM Corporation, 2006-2007 + * + * Author: Balbir Singh [EMAIL PROTECTED] + * + */ + +#ifndef _LINUX_MEMCONTROL_H +#define _LINUX_MEMCONTROL_H + +#ifdef CONFIG_CONTAINER_MEMCONTROL +#ifndef kB +#define kB 1024/* One Kilo Byte */ +#endif + +#else /* CONFIG_CONTAINER_MEMCONTROL */ + +#endif /* CONFIG_CONTAINER_MEMCONTROL */ +#endif /* _LINUX_MEMCONTROL_H */ diff -puN init/Kconfig~memcontrol-setup init/Kconfig --- linux-2.6.20/init/Kconfig~memcontrol-setup 2007-02-20 21:01:28.0 +0530 +++ linux-2.6.20-balbir/init/Kconfig2007-02-20 21:01:28.0 +0530 @@ -306,6 +306,13 @@ config CONTAINER_NS for instance virtual servers and checkpoint/restart jobs. +config CONTAINER_MEMCONTROL + bool A simple RSS based memory controller + select CONTAINERS + help + Provides a simple Resource Controller for monitoring and + controlling the total Resident Set Size of the tasks in a container + config RELAY bool Kernel-user space relay support (formerly relayfs) help diff -puN mm/Makefile~memcontrol-setup mm/Makefile --- linux-2.6.20/mm/Makefile~memcontrol-setup 2007-02-20 21:01:28.0 +0530 +++ linux-2.6.20-balbir/mm/Makefile 2007-02-20 21:01:28.0 +0530 @@ -29,3 +29,4 @@ obj-$(CONFIG_MEMORY_HOTPLUG) += memory_h obj-$(CONFIG_FS_XIP) += filemap_xip.o obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_SMP) += allocpercpu.o +obj-$(CONFIG_CONTAINER_MEMCONTROL) += memcontrol.o diff -puN /dev/null mm/memcontrol.c --- /dev/null 2007-02-02 22:51:23.0 +0530 +++ linux-2.6.20-balbir/mm/memcontrol.c 2007-02-24 19:39:24.0 +0530 @@ -0,0 +1,193 @@ +/* + * memcontrol.c - Memory Controller for containers + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of version 2.1 of the GNU Lesser General Public License + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + * © Copyright IBM Corporation, 2006-2007 + * + * Author: Balbir Singh [EMAIL PROTECTED] + * + */ + +#include linux/init.h +#include linux/parser.h +#include linux/fs.h +#include linux/container.h +#include linux/memcontrol.h + +#include asm/uaccess.h + +#define RES_USAGE_NO_LIMIT 0 +static const char version[] = 0.1; + +struct res_counter { + atomic_long_t usage;/* The current usage of the resource being */ + /* counted */ + atomic_long_t limit
[RFC][PATCH][4/4] RSS controller documentation (v2)
--- Signed-off-by: [EMAIL PROTECTED] --- Documentation/memctlr.txt | 70 ++ 1 file changed, 70 insertions(+) diff -puN /dev/null Documentation/memctlr.txt --- /dev/null 2007-02-02 22:51:23.0 +0530 +++ linux-2.6.20-balbir/Documentation/memctlr.txt 2007-02-24 19:41:23.0 +0530 @@ -0,0 +1,70 @@ +Introduction + + +The memory controller is a controller module written under the containers +framework. It can be used to limit the resource usage of a group of +tasks grouped by the container. + +Accounting +-- + +The memory controller tracks the RSS usage of the tasks in the container. +The definition of RSS was debated on lkml in the following thread + + http://lkml.org/lkml/2006/10/10/130 + +This patch is flexible, it is easy to adapt the patch to any definition +of RSS. The current accounting is based on the current definition of +RSS. Each page mapped is charged to the container. + +The accounting is done at two levels, each process has RSS accounting in +the mm_struct and in the container it belongs to. The mm_struct accounting +is used when a task switches (migrates to a different) container(s). The +accounting information for the task is subtracted from the source container +and added to the destination container. If as result of the migration, the +destination container goes over limit, no action is taken until some task +in the destination container runs and tries to map a new page in its +page table. + +The current RSS usage can be seen in the memcontrol_usage file. The value +is in units of pages. + +Control +--- + +The memcontrol_limit file allows the user to set a limit on the number of +pages that can be mapped by the processes in the container. A special +value of 0 (which is the default limit of any new container), indicates +that the container can use unlimited amount of RSS. + +Reclaim +--- + +When the limit set in the container is hit, the memory controller starts +reclaiming pages belonging to the container (simulating a local LRU in +some sense). isolate_lru_pages() has been modified to isolate lru +pages belonging to a specific container. Parallel reclaims on the same +container are not allowed, other tasks end up waiting for the any existing +reclaim to finish. + +The reclaim code uses two internal knobs, retries and pushback. pushback +specifies the percentage of memory to be reclaimed when the container goes +over limit. The retries knob, controls how many times reclaim is retried +before the task is killed (because reclaim failed). + +Shared pages are treated specially during reclaim. They are not force +reclaimed, they are only unmapped from containers which are over limit. +This ensures that other containers do not pay a penalty for a shared +page being reclaimed when a paritcular container goes over its limit. + +NOTE: All limits are hard limits. + +Future Plans + + +The current controller implements only RSS control. It is planned to add +the following components + +1. Page Cache control +2. mlock'ed memory control +3. kernel memory allocation control (memory allocated on behalf of a task) _ -- Warm Regards, Balbir Singh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH][3/4] Add reclaim support (v2)
) { zone-nr_scan_active += (zone-nr_active prio) + 1; - if (zone-nr_scan_active = nr_pages || pass 3) { + if (zone-nr_scan_active = nr_pages || pass max_pass) { zone-nr_scan_active = 0; nr_to_scan = min(nr_pages, zone-nr_active); shrink_active_list(nr_to_scan, zone, sc, prio); @@ -1394,7 +1431,7 @@ static unsigned long shrink_all_zones(un } zone-nr_scan_inactive += (zone-nr_inactive prio) + 1; - if (zone-nr_scan_inactive = nr_pages || pass 3) { + if (zone-nr_scan_inactive = nr_pages || pass max_pass) { zone-nr_scan_inactive = 0; nr_to_scan = min(nr_pages, zone-nr_inactive); ret += shrink_inactive_list(nr_to_scan, zone, sc); @@ -1405,7 +1442,9 @@ static unsigned long shrink_all_zones(un return ret; } +#endif +#ifdef CONFIG_PM static unsigned long count_lru_pages(void) { struct zone *zone; @@ -1477,7 +1516,7 @@ unsigned long shrink_all_memory(unsigned unsigned long nr_to_scan = nr_pages - ret; sc.nr_scanned = 0; - ret += shrink_all_zones(nr_to_scan, prio, pass, sc); + ret += shrink_all_zones(nr_to_scan, prio, pass, 3, sc); if (ret = nr_pages) goto out; @@ -1512,6 +1551,57 @@ out: } #endif +#ifdef CONFIG_CONTAINER_MEMCONTROL +/* + * Try to free `nr_pages' of memory, system-wide, and return the number of + * freed pages. + * Modelled after shrink_all_memory() + */ +unsigned long memcontrol_shrink_mapped_memory(unsigned long nr_pages, + struct container *container) +{ + unsigned long ret = 0; + int pass; + unsigned long nr_total_scanned = 0; + + struct scan_control sc = { + .gfp_mask = GFP_KERNEL, + .may_swap = 0, + .swap_cluster_max = nr_pages, + .may_writepage = 1, + .container = container, + .may_swap = 1, + .swappiness = 100, + }; + + /* +* We try to shrink LRUs in 3 passes: +* 0 = Reclaim from inactive_list only +* 1 = Reclaim mapped (normal reclaim) +* 2 = 2nd pass of type 1 +*/ + for (pass = 0; pass 3; pass++) { + int prio; + + for (prio = DEF_PRIORITY; prio = 0; prio--) { + unsigned long nr_to_scan = nr_pages - ret; + + sc.nr_scanned = 0; + ret += shrink_all_zones(nr_to_scan, prio, + pass, 1, sc); + if (ret = nr_pages) + goto out; + + nr_total_scanned += sc.nr_scanned; + if (sc.nr_scanned prio DEF_PRIORITY - 2) + congestion_wait(WRITE, HZ / 10); + } + } +out: + return ret; +} +#endif + /* It's optimal to keep kswapds on the same CPUs as their memory, but not required for correctness. So if the last cpu in a node goes away, we get changed to run anywhere: as the first one comes back, diff -puN include/linux/mm_types.h~memcontrol-reclaim-on-limit include/linux/mm_types.h diff -puN include/linux/list.h~memcontrol-reclaim-on-limit include/linux/list.h --- linux-2.6.20/include/linux/list.h~memcontrol-reclaim-on-limit 2007-02-24 19:40:56.0 +0530 +++ linux-2.6.20-balbir/include/linux/list.h2007-02-24 19:40:56.0 +0530 @@ -343,6 +343,32 @@ static inline void list_splice(struct li __list_splice(list, head); } +static inline void __list_splice_tail(struct list_head *list, + struct list_head *head) +{ + struct list_head *first = list-next; + struct list_head *last = list-prev; + struct list_head *at = head-prev; + + first-prev = at; + at-next = first; + + last-next = head; + head-prev = last; +} + +/** + * list_splice - join two lists, @list goes to the end (at head-prev) + * @list: the new list to add. + * @head: the place to add it in the first list. + */ +static inline void list_splice_tail(struct list_head *list, + struct list_head *head) +{ + if (!list_empty(list)) + __list_splice_tail(list, head); +} + /** * list_splice_init - join two lists and reinitialise the emptied list. * @list: the new list to add. _ -- Warm Regards, Balbir Singh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo
[RFC][PATCH][0/4] Memory controller (RSS Control) (v2)
This is a repost of the patches at http://lkml.org/lkml/2007/2/24/65 The previous post had a misleading subject which ended with a (. This patch applies on top of Paul Menage's container patches (V7) posted at http://lkml.org/lkml/2007/2/12/88 It implements a controller within the containers framework for limiting memory usage (RSS usage). The memory controller was discussed at length in the RFC posted to lkml http://lkml.org/lkml/2006/10/30/51 This is version 2 of the patch, version 1 was posted at http://lkml.org/lkml/2007/2/19/10 I have tried to incorporate all comments, more details can be found in the changelog's of induvidual patches. Any remaining mistakes are all my fault. The next question could be why release version 2? 1. It serves a decision point to decide if we should move to a per-container LRU list. Walking through the global LRU is slow, in this patchset I've tried to address the LRU churning issue. The patch memcontrol-reclaim-on-limit has more details 2. I've included fixes for several of the comments/issues raised in version 1 Steps to use the controller -- 0. Download the patches, apply the patches 1. Turn on CONFIG_CONTAINER_MEMCONTROL in kernel config, build the kernel and boot into the new kernel 2. mount -t container container -o memcontrol /mount point 3. cd /mount point optionally do (mkdir directory; cd directory) under /mount point 4. echo $$ tasks (attaches the current shell to the container) 5. echo -n (limit value) memcontrol_limit 6. cat memcontrol_usage 7. Run tasks, check the usage of the controller, reclaim behaviour 8. Report bugs, get bug fixes and iterate (goto step 0). Advantages of the patchset -- 1. Zero overhead in struct page (struct page is not expanded) 2. Minimal changes to the core-mm code 3. Shared pages are not reclaimed unless all mappings belong to overlimit containers. 4. It can be used to debug drivers/applications/kernel components in a constrained memory environment (similar to mem=XXX option), except that several containers can be created simultaneously without rebooting and the limits can be changed. NOTE: There is no support for limiting kernel memory allocations and page cache control (presently). Testing --- Created containers, attached tasks to containers with lower limits than the memory the tasks require (memory hog tests) and ran some basic tests on them. Tested the patches on UML and PowerPC. On UML tried the patches with the config enabled and disabled (sanity check) and with containers enabled but the memory controller disabled. TODO's and improvement areas 1. Come up with cool page replacement algorithms for containers - still holds good (if possible without any changes to struct page) 2. Add page cache control 3. Add kernel memory allocator control 4. Extract benchmark numbers and overhead data Comments criticism are welcome. Series -- memcontrol-setup.patch memcontrol-acct.patch memcontrol-reclaim-on-limit.patch memcontrol-doc.patch -- Warm Regards, Balbir Singh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH][2/4] Add RSS accounting and control (v2)
out_nomap; + goto out_nomap_uncharge; } /* The page isn't present yet, go ahead with the fault. */ @@ -2068,6 +2084,8 @@ unlock: pte_unmap_unlock(page_table, ptl); out: return ret; +out_nomap_uncharge: + memcontrol_update_rss(mm, -1, MEMCONTROL_DONT_CHECK_LIMIT); out_nomap: pte_unmap_unlock(page_table, ptl); unlock_page(page); @@ -2092,6 +2110,9 @@ static int do_anonymous_page(struct mm_s /* Allocate our own private page. */ pte_unmap(page_table); + if (memcontrol_update_rss(mm, 1, MEMCONTROL_CHECK_LIMIT)) + goto oom; + if (unlikely(anon_vma_prepare(vma))) goto oom; page = alloc_zeroed_user_highpage(vma, address); @@ -2108,6 +2129,8 @@ static int do_anonymous_page(struct mm_s lru_cache_add_active(page); page_add_new_anon_rmap(page, vma, address); } else { + memcontrol_update_rss(mm, 1, MEMCONTROL_DONT_CHECK_LIMIT); + /* Map the ZERO_PAGE - vm_page_prot is readonly */ page = ZERO_PAGE(address); page_cache_get(page); @@ -2218,6 +2241,9 @@ retry: } } + if (memcontrol_update_rss(mm, 1, MEMCONTROL_CHECK_LIMIT)) + goto oom; + page_table = pte_offset_map_lock(mm, pmd, address, ptl); /* * For a file-backed vma, someone could have truncated or otherwise @@ -2227,6 +2253,7 @@ retry: if (mapping unlikely(sequence != mapping-truncate_count)) { pte_unmap_unlock(page_table, ptl); page_cache_release(new_page); + memcontrol_update_rss(mm, -1, MEMCONTROL_DONT_CHECK_LIMIT); cond_resched(); sequence = mapping-truncate_count; smp_rmb(); @@ -2265,6 +2292,7 @@ retry: } else { /* One of our sibling threads was faster, back out. */ page_cache_release(new_page); + memcontrol_update_rss(mm, -1, MEMCONTROL_DONT_CHECK_LIMIT); goto unlock; } diff -puN mm/rmap.c~memcontrol-acct mm/rmap.c --- linux-2.6.20/mm/rmap.c~memcontrol-acct 2007-02-24 19:39:29.0 +0530 +++ linux-2.6.20-balbir/mm/rmap.c 2007-02-24 19:39:29.0 +0530 @@ -602,6 +602,11 @@ void page_remove_rmap(struct page *page, __dec_zone_page_state(page, PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED); } + /* +* When we pass MEMCONTROL_DONT_CHECK_LIMIT, it is ok to call +* this function under the pte lock (since we will not block in reclaim) +*/ + memcontrol_update_rss(vma-vm_mm, -1, MEMCONTROL_DONT_CHECK_LIMIT); } /* diff -puN mm/swapfile.c~memcontrol-acct mm/swapfile.c --- linux-2.6.20/mm/swapfile.c~memcontrol-acct 2007-02-24 19:39:29.0 +0530 +++ linux-2.6.20-balbir/mm/swapfile.c 2007-02-24 19:39:29.0 +0530 @@ -27,6 +27,7 @@ #include linux/mutex.h #include linux/capability.h #include linux/syscalls.h +#include linux/memcontrol.h #include asm/pgtable.h #include asm/tlbflush.h @@ -514,6 +515,7 @@ static void unuse_pte(struct vm_area_str set_pte_at(vma-vm_mm, addr, pte, pte_mkold(mk_pte(page, vma-vm_page_prot))); page_add_anon_rmap(page, vma, addr); + memcontrol_update_rss(vma-vm_mm, 1, MEMCONTROL_DONT_CHECK_LIMIT); swap_free(entry); /* * Move the page to the active list so it is not _ -- Warm Regards, Balbir Singh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The performance and behaviour of the anti-fragmentation related patches
Andrew Morton wrote: So some urgent questions are: how are we going to do mem hotunplug and per-container RSS? Our basic unit of memory management is the zone. Right now, a zone maps onto some hardware-imposed thing. But the zone-based MM works *well*. I suspect that a good way to solve both per-container RSS and mem hotunplug is to split the zone concept away from its hardware limitations: create a software zone and a hardware zone. All the existing page allocator and reclaim code remains basically unchanged, and it operates on software zones. Each software zones always lies within a single hardware zone. The software zones are resizeable. For per-container RSS we give each container one (or perhaps multiple) resizeable software zones. For memory hotunplug, some of the hardware zone's software zones are marked reclaimable and some are not; DIMMs which are wholly within reclaimable zones can be depopulated and powered off or removed. NUMA and cpusets screwed up: they've gone and used nodes as their basic unit of memory management whereas they should have used zones. This will need to be untangled. Anyway, that's just a shot in the dark. Could be that we implement unplug and RSS control by totally different means. But I do wish that we'd sort out what those means will be before we potentially complicate the story a lot by adding antifragmentation. Paul Menage had suggested something very similar in response to the RFC for memory controllers I sent out and it was suggested that we create small zones (roughly 64 MB) to avoid the issue of a zone/node not being a shareable across containers. Even with a small size, there are some issues. The following thread has the details discussed. http://lkml.org/lkml/2006/10/30/120 RSS accounting is very easy (with minimal changes to the core mm), supplemented with an efficient per-container reclaimer, it should be easy to implement a good per-container RSS controller. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The performance and behaviour of the anti-fragmentation related patches
Linus Torvalds wrote: On Thu, 1 Mar 2007, Andrew Morton wrote: So some urgent questions are: how are we going to do mem hotunplug and per-container RSS? Also: how are we going to do this in virtualized environments? Usually the people who care abotu memory hotunplug are exactly the same people who also care (or claim to care, or _will_ care) about virtualization. My personal opinion is that while I'm not a huge fan of virtualization, these kinds of things really _can_ be handled more cleanly at that layer, and not in the kernel at all. Afaik, it's what IBM already does, and has been doing for a while. There's no shame in looking at what already works, especially if it's simpler. Could you please clarify as to what that layer means - is it the firmware/hardware for virtualization? or does it refer to user space? With virtualization the linux kernel would end up acting as a hypervisor and resource management support like per-container RSS support needs to be built into the kernel. It would also be useful to have a resource controller like per-container RSS control (container refers to a task grouping) within the kernel or non-virtualized environments as well. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: The performance and behaviour of the anti-fragmentation related patches
Linus Torvalds wrote: On Fri, 2 Mar 2007, Balbir Singh wrote: My personal opinion is that while I'm not a huge fan of virtualization, these kinds of things really _can_ be handled more cleanly at that layer, and not in the kernel at all. Afaik, it's what IBM already does, and has been doing for a while. There's no shame in looking at what already works, especially if it's simpler. Could you please clarify as to what that layer means - is it the firmware/hardware for virtualization? or does it refer to user space? Virtualization in general. We don't know what it is - in IBM machines it's a hypervisor. With Xen and VMware, it's usually a hypervisor too. With KVM, it's obviously a host Linux kernel/user-process combination. Thanks for clarifying. The point being that in the guests, hotunplug is almost useless (for bigger ranges), and we're much better off just telling the virtualization hosts on a per-page level whether we care about a page or not, than to worry about fragmentation. And in hosts, we usually don't care EITHER, since it's usually done in a hypervisor. It would also be useful to have a resource controller like per-container RSS control (container refers to a task grouping) within the kernel or non-virtualized environments as well. .. but this has again no impact on anti-fragmentation. Yes, I agree that anti-fragmentation and resource management are independent of each other. I must admit to being a bit selfish here, in that my main interest is in resource management and we would love to see a well written and easy to understand resource management infrastructure and controllers to control CPU and memory usage. Since the issue of per-container RSS control came up, I wanted to ensure that we do not mix up resource control and anti-fragmentation. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ckrm-tech] [PATCH 1/2] rcfs core patch
Srivatsa Vaddagiri wrote: Heavily based on Paul Menage's (inturn cpuset) work. The big difference is that the patch uses task-nsproxy to group tasks for resource control purpose (instead of task-containers). The patch retains the same user interface as Paul Menage's patches. In particular, you can have multiple hierarchies, each hierarchy giving a different composition/view of task-groups. (Ideally this patch should have been split into 2 or 3 sub-patches, but will do that on a subsequent version post) With this don't we end up with a lot of duplicate between cpusets and rcfs. Signed-off-by : Srivatsa Vaddagiri [EMAIL PROTECTED] Signed-off-by : Paul Menage [EMAIL PROTECTED] --- linux-2.6.20-vatsa/include/linux/init_task.h |4 linux-2.6.20-vatsa/include/linux/nsproxy.h |5 linux-2.6.20-vatsa/init/Kconfig | 22 linux-2.6.20-vatsa/init/main.c |1 linux-2.6.20-vatsa/kernel/Makefile |1 --- The diffstat does not look quite right. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RSDL 1/6] lists: add list splice tail
Con Kolivas wrote: Add a list_splice_tail variant of list_splice. Patch-by: Peter Zijlstra [EMAIL PROTECTED] Signed-off-by: Con Kolivas [EMAIL PROTECTED] Acked-by: Balbir Singh [EMAIL PROTECTED] I had the same exact patch in my memcontrol at http://lkml.org/lkml/2007/2/24/68 (see the last two functions). -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/7] Resource counters
simple_read_from_buffer((void __user *)userbuf, nbytes, + pos, buf, s - buf); +} + +ssize_t res_counter_write(struct res_counter *cnt, int member, + const char __user *userbuf, size_t nbytes, loff_t *pos) +{ + int ret; + char *buf, *end; + unsigned long tmp, *val; + + buf = kmalloc(nbytes + 1, GFP_KERNEL); + ret = -ENOMEM; + if (buf == NULL) + goto out; + + buf[nbytes] = 0; + ret = -EFAULT; + if (copy_from_user(buf, userbuf, nbytes)) + goto out_free; + + ret = -EINVAL; + tmp = simple_strtoul(buf, end, 10); + if (*end != '\0') + goto out_free; + + val = res_counter_member(cnt, member); + *val = tmp; + ret = nbytes; +out_free: + kfree(buf); +out: + return ret; +} These bits look a little out of sync, with no users for these routines in this patch. Won't you get a compiler warning, compiling this bit alone? -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 2/7] RSS controller core
-res.usage; +} + +static void rss_move_task(struct container_subsys *ss, + struct container *cont, + struct container *old_cont, + struct task_struct *p) +{ + struct mm_struct *mm; + struct rss_container *rss, *old_rss; + + mm = get_task_mm(p); + if (mm == NULL) + goto out; + + rss = rss_from_cont(cont); + old_rss = rss_from_cont(old_cont); + if (old_rss != mm-rss_container) + goto out_put; + + css_get_current(rss-css); + rcu_assign_pointer(mm-rss_container, rss); + css_put(old_rss-css); + I see that the charges are not migrated. Is that good? If a user could find a way of migrating his/her task from one container to another, it could create an issue with the user's task taking up a big chunk of the RSS limit. Can we migrate any task or just the thread group leader. In my patches, I allowed migration of just the thread group leader. Imagine if you have several threads, no matter which container they belong to, their mm gets charged (usage will not show up in the container's usage). This could confuse the system administrator. +out_put: + mmput(mm); +out: + return; +} + +static int rss_create(struct container_subsys *ss, struct container *cont) +{ + struct rss_container *rss; + + rss = kzalloc(sizeof(struct rss_container), GFP_KERNEL); + if (rss == NULL) + return -ENOMEM; + + res_counter_init(rss-res); + INIT_LIST_HEAD(rss-page_list); + cont-subsys[rss_subsys.subsys_id] = rss-css; + return 0; +} + +static void rss_destroy(struct container_subsys *ss, + struct container *cont) +{ + kfree(rss_from_cont(cont)); +} + + +static ssize_t rss_read(struct container *cont, struct cftype *cft, + struct file *file, char __user *userbuf, + size_t nbytes, loff_t *ppos) +{ + return res_counter_read(rss_from_cont(cont)-res, cft-private, + userbuf, nbytes, ppos); +} + +static ssize_t rss_write(struct container *cont, struct cftype *cft, + struct file *file, const char __user *userbuf, + size_t nbytes, loff_t *ppos) +{ + return res_counter_write(rss_from_cont(cont)-res, cft-private, + userbuf, nbytes, ppos); +} + + +static struct cftype rss_usage = { + .name = rss_usage, + .private = RES_USAGE, + .read = rss_read, +}; + +static struct cftype rss_limit = { + .name = rss_limit, + .private = RES_LIMIT, + .read = rss_read, + .write = rss_write, +}; + +static struct cftype rss_failcnt = { + .name = rss_failcnt, + .private = RES_FAILCNT, + .read = rss_read, +}; + +static int rss_populate(struct container_subsys *ss, + struct container *cont) +{ + int rc; + + if ((rc = container_add_file(cont, rss_usage)) 0) + return rc; + if ((rc = container_add_file(cont, rss_failcnt)) 0) + return rc; + if ((rc = container_add_file(cont, rss_limit)) 0) + return rc; + + return 0; +} + +static struct rss_container init_rss_container; + +static __init int rss_create_early(struct container_subsys *ss, + struct container *cont) +{ + struct rss_container *rss; + + rss = init_rss_container; + res_counter_init(rss-res); + INIT_LIST_HEAD(rss-page_list); + cont-subsys[rss_subsys.subsys_id] = rss-css; + ss-create = rss_create; + return 0; +} + +static struct container_subsys rss_subsys = { + .name = rss, + .create = rss_create_early, + .destroy = rss_destroy, + .populate = rss_populate, + .attach = rss_move_task, +}; + +void __init container_rss_init_early(void) +{ + container_register_subsys(rss_subsys); + init_mm.rss_container = rss_from_cont( + task_container(init_task, rss_subsys)); + css_get_current(init_mm.rss_container-css); +} -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/7] Resource controllers based on process containers
Pavel Emelianov wrote: This patchset adds RSS, accounting and control and limiting the number of tasks and files within container. Based on top of Paul Menage's container subsystem v7 RSS controller includes per-container RSS accounter, reclamation and OOM killer. It behaves like standalone machine - when container runs out of resources it tries to reclaim some pages and if it doesn't succeed in it kills some task which mm_struct belongs to container in question. Num tasks and files containers are very simple and self-descriptive from code. As discussed before when a task moves from one container to another no resources follow it - they keep holding the container they were allocated in. I have one problem with the patchset, I cannot compile the patches individually and some of the code is hard to read as it depends on functions from future patches. Patch 2, 3 and 4 fail to compile without patch 5 applied. Patch 1 failed to apply with a reject in kernel/Makefile I applied it on top of 2.6.20 with all of Paul Menage's patches (all 7). -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: controlling mmap()'d vs read/write() pages
Herbert Poetzl wrote: To me, one of the keys of Linux's global optimizations is being able to use any memory globally for its most effective purpose, globally (please ignore highmem :). Let's say I have a 1GB container on a machine that is at least 100% committed. I mmap() a 1GB file and touch the entire thing (I never touch it again). I then go open another 1GB file and r/w to it until the end of time. I'm at or below my RSS limit, but that 1GB of RAM could surely be better used for the second file. How do we do this if we only account for a user's RSS? Does this fit into Alan's unfair bucket? ;) what's the difference to a normal Linux system here? when low on memory, the system will reclaim pages, and guess what pages will be reclaimed first ... But would it not bias application writers towards using read()/write() calls over mmap()? They know that their calls are likely to be faster when the application is run in a container. Without page cache control we'll end up creating an asymmetrical container, where certain usage is charged and some usage is not. Also, please note that when a page is unmapped and moved to swap cache; the swap cache uses the page cache. Without page cache control, we could end up with too many pages moving over to the swap cache and still occupying memory, while the original intension was to avoid this scenario. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.6.24-rc7 Build-Failure at __you_cannot_kmalloc_that_much
Kamalesh Babulal wrote: Andrew Morton wrote: On Mon, 07 Jan 2008 16:06:20 +0530 Kamalesh Babulal [EMAIL PROTECTED] wrote: The defconfig make fails on x86_64 (AMD box) with following error CHK include/linux/utsrelease.h CALLscripts/checksyscalls.sh CHK include/linux/compile.h GEN .version CHK include/linux/compile.h UPD include/linux/compile.h CC init/version.o LD init/built-in.o LD .tmp_vmlinux1 drivers/built-in.o(.init.text+0x8d76): In function `dmi_id_init': : undefined reference to `__you_cannot_kmalloc_that_much' make: *** [.tmp_vmlinux1] Error 1 # gcc --version gcc (GCC) 3.2.3 20030502 (Red Hat Linux 3.2.3-59) This was reported by Adrian Bunk http://lkml.org/lkml/2007/12/1/39 That's odd. afacit the only kmalloc in dmi_id_init() is dmi_dev = kzalloc(sizeof(*dmi_dev), GFP_KERNEL); and even gcc-3.2.3 should be able to get that right. Could you please a) verify that simply removing that line fixes the build error and then b) try to find some way of fixing it? Try replacing `sizeof(*dmi_dev)' with `sizeof(struct dmi_device_attribute)' and any other tricks you can think of to try to make the compiler process the code differently. removing the line fixes the issue, but changing the sizeof(*dmi_dev) to sizeof(struct device) is not helping. Hi, Andrew, We tried the following, generated stabs information. The size of struct device is 560 bytes. We found that dead code was not being eliminated (__you_cannot_kmalloc_that_much), even though no one called that function. I suspect builtin_constant_p() and dead code elimination as the root causes of this error. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [LTP] Container Code Coverage for 2.6.23 mainline kernel
On Jan 9, 2008 12:52 PM, Rishikesh K. Rajak [EMAIL PROTECTED] wrote: Hi All, You can find the code coverage data for container code which has been merged with mainline linux-2.6.23 and respective testcases are merged with ltp for the feature called SYSVIPC NAMESPACE UTS NAMESPACE . I have genrated the code on s390x machine, the more info you can find below. Linux 2.6.23-gcov-autokern1 #1 SMP Wed Jan 9 00:27:11 EST 2008 s390x s390x s390x GNU/Linux Please let me know if you need more info. Thanks Rishi Rishi, Something is wrong with the HTML within the tarball, it still points to files in /home/rishi/ Could you please fix that. Thanks, Balbir -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [LTP] Container Code Coverage for 2.6.23 mainline kernel
On Jan 9, 2008 2:45 PM, Subrata Modak [EMAIL PROTECTED] wrote: On Wed, 2008-01-09 at 14:38 +0530, Balbir Singh wrote: On Jan 9, 2008 12:52 PM, Rishikesh K. Rajak [EMAIL PROTECTED] wrote: Hi All, You can find the code coverage data for container code which has been merged with mainline linux-2.6.23 and respective testcases are merged with ltp for the feature called SYSVIPC NAMESPACE UTS NAMESPACE . I have genrated the code on s390x machine, the more info you can find below. Linux 2.6.23-gcov-autokern1 #1 SMP Wed Jan 9 00:27:11 EST 2008 s390x s390x s390x GNU/Linux Please let me know if you need more info. Thanks Rishi Rishi, Something is wrong with the HTML within the tarball, it still points to files in /home/rishi/ Could you please fix that. Balbir, this HTML is just a saved as version of the original HTML as Rishi cannot TAR and send the original HTML file as it has to refer to hundreds of other files, which in that case will be difficult to Pack and send to the mailing list. So, he has saved the original HTML and then posted just that. --Subrata Thanks, Balbir Subrata, These files or their data is required to interpret the coverage data. Details for each file and lines covered will help developers or testers know what is not covered by these test cases. Balbir -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 05/19] split LRU lists into anon file sets
* KAMEZAWA Hiroyuki [EMAIL PROTECTED] [2008-01-09 13:41:32]: I like this patch set thank you. On Tue, 08 Jan 2008 15:59:44 -0500 Rik van Riel [EMAIL PROTECTED] wrote: Index: linux-2.6.24-rc6-mm1/mm/memcontrol.c === --- linux-2.6.24-rc6-mm1.orig/mm/memcontrol.c 2008-01-07 11:55:09.0 -0500 +++ linux-2.6.24-rc6-mm1/mm/memcontrol.c2008-01-07 17:32:53.0 -0500 snip -enum mem_cgroup_zstat_index { - MEM_CGROUP_ZSTAT_ACTIVE, - MEM_CGROUP_ZSTAT_INACTIVE, - - NR_MEM_CGROUP_ZSTAT, -}; - struct mem_cgroup_per_zone { /* * spin_lock to protect the per cgroup LRU */ spinlock_t lru_lock; - struct list_headactive_list; - struct list_headinactive_list; - unsigned long count[NR_MEM_CGROUP_ZSTAT]; + struct list_headlists[NR_LRU_LISTS]; + unsigned long count[NR_LRU_LISTS]; }; /* Macro for accessing counter */ #define MEM_CGROUP_ZSTAT(mz, idx) ((mz)-count[(idx)]) @@ -160,6 +152,7 @@ struct page_cgroup { }; #define PAGE_CGROUP_FLAG_CACHE (0x1) /* charged as cache */ #define PAGE_CGROUP_FLAG_ACTIVE (0x2) /* page is active in this cgroup */ +#define PAGE_CGROUP_FLAG_FILE (0x4) /* page is file system backed */ Now, we don't have control_type and a feature for accounting only CACHE. Balbir-san, do you have some new plan ? Hi, KAMEZAWA-San, The control_type feature is gone. We still have cached page accounting, but we do not allow control of only RSS pages anymore. We need to control both RSS+cached pages. I do not understand your question about new plan? Is it about adding back control_type? BTW, is it better to use PageSwapBacked(pc-page) rather than adding a new flag PAGE_CGROUP_FLAG_FILE ? PAGE_CGROUP_FLAG_ACTIVE is used because global reclaim can change ACTIVE/INACTIVE attribute without accessing memory cgroup. (Then, we cannot trust PageActive(pc-page)) Yes, correct. A page active on the node's zone LRU need not be active in the memory cgroup. ANON - FILE attribute can be changed dinamically (after added to LRU) ? If no, using page_file_cache(pc-page) will be easy. Thanks, -Kame -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 05/19] split LRU lists into anon file sets
* KAMEZAWA Hiroyuki [EMAIL PROTECTED] [2008-01-10 11:36:18]: On Thu, 10 Jan 2008 07:51:33 +0530 Balbir Singh [EMAIL PROTECTED] wrote: #define PAGE_CGROUP_FLAG_CACHE (0x1) /* charged as cache */ #define PAGE_CGROUP_FLAG_ACTIVE (0x2) /* page is active in this cgroup */ +#define PAGE_CGROUP_FLAG_FILE (0x4) /* page is file system backed */ Now, we don't have control_type and a feature for accounting only CACHE. Balbir-san, do you have some new plan ? Hi, KAMEZAWA-San, The control_type feature is gone. We still have cached page accounting, but we do not allow control of only RSS pages anymore. We need to control both RSS+cached pages. I do not understand your question about new plan? Is it about adding back control_type? Ah, just wanted to confirm that we can drop PAGE_CGROUP_FLAG_CACHE if page_file_cache() function and split-LRU is introduced. Earlier we would have had a problem, since we even accounted for swap cache with PAGE_CGROUP_FLAG_CACHE and I think page_file_cache() does not account swap cache pages with page_file_cache(). Our accounting is based on mapped vs unmapped whereas the new code from Rik accounts file vs anonymous. I suspect we could live a little while longer with PAGE_CGROUP_FLAG_CACHE and then if we do not need it at all, we can mark it down for removal. What do you think? -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/19] VM pageout scalability improvements
* Rik van Riel [EMAIL PROTECTED] [2008-01-08 15:59:39]: On large memory systems, the VM can spend way too much time scanning through pages that it cannot (or should not) evict from memory. Not only does it use up CPU time, but it also provokes lock contention and can leave large systems under memory presure in a catatonic state. Against 2.6.24-rc6-mm1 This patch series improves VM scalability by: 1) making the locking a little more scalable 2) putting filesystem backed, swap backed and non-reclaimable pages onto their own LRUs, so the system only scans the pages that it can/should evict from memory 3) switching to SEQ replacement for the anonymous LRUs, so the number of pages that need to be scanned when the system starts swapping is bound to a reasonable number More info on the overall design can be found at: http://linux-mm.org/PageReplacementDesign Changelog: - merge memcontroller split LRU code into the main split LRU patch, since it is not functionally different (it was split up only to help people who had seen the last version of the patch series review it) - drop the page_file_cache debugging patch, since it never triggered - reintroduce code to not scan anon list if swap is full - add code to scan anon list if page cache is very small already - use lumpy reclaim more aggressively for smaller order 1 allocations Hi, Rik, I've just started the patch series, the compile fails for me on a powerpc box. global_lru_pages() is defined under CONFIG_PM, but used else where in mm/page-writeback.c. None of the global_lru_pages() parameters depend on CONFIG_PM. Here's a simple patch to fix it. diff --git a/mm/page-writeback.c b/mm/page-writeback.c diff --git a/mm/vmscan.c b/mm/vmscan.c index b14e188..39e6aef 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1920,6 +1920,14 @@ void wakeup_kswapd(struct zone *zone, int order) wake_up_interruptible(pgdat-kswapd_wait); } +unsigned long global_lru_pages(void) +{ + return global_page_state(NR_ACTIVE_ANON) + + global_page_state(NR_ACTIVE_FILE) + + global_page_state(NR_INACTIVE_ANON) + + global_page_state(NR_INACTIVE_FILE); +} + #ifdef CONFIG_PM /* * Helper function for shrink_all_memory(). Tries to reclaim 'nr_pages' pages @@ -1968,14 +1976,6 @@ static unsigned long shrink_all_zones(unsigned long nr_pages, int prio, return ret; } -unsigned long global_lru_pages(void) -{ - return global_page_state(NR_ACTIVE_ANON) - + global_page_state(NR_ACTIVE_FILE) - + global_page_state(NR_INACTIVE_ANON) - + global_page_state(NR_INACTIVE_FILE); -} - /* * Try to free `nr_pages' of memory, system-wide, and return the number of * freed pages. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/19] VM pageout scalability improvements
* Rik van Riel [EMAIL PROTECTED] [2008-01-08 15:59:39]: Changelog: - merge memcontroller split LRU code into the main split LRU patch, since it is not functionally different (it was split up only to help people who had seen the last version of the patch series review it) Hi, Rik, I see a strange behaviour with this patchset. I have a program (pagetest from Vaidy), that does the following 1. Can allocate different kinds of memory, mapped, malloc'ed or shared 2. Allocates and touches all the memory in a loop (2 times) I mount the memory controller and limit it to 400M and run pagetest and ask it to touch 1000M. Without this patchset everything runs fine, but with this patchset installed, I immediately see pagetest invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 Call Trace: [c000e5aef400] [c000eb24] .show_stack+0x70/0x1bc (unreliable) [c000e5aef4b0] [c00c] .oom_kill_process+0x80/0x260 [c000e5aef570] [c00bc498] .mem_cgroup_out_of_memory+0x6c/0x98 [c000e5aef610] [c00f2574] .mem_cgroup_charge_common+0x1e0/0x414 [c000e5aef6e0] [c00b852c] .add_to_page_cache+0x48/0x164 [c000e5aef780] [c00b8664] .add_to_page_cache_lru+0x1c/0x68 [c000e5aef810] [c012db50] .mpage_readpages+0xbc/0x15c [c000e5aef940] [c018bdac] .ext3_readpages+0x28/0x40 [c000e5aef9c0] [c00c3978] .__do_page_cache_readahead+0x158/0x260 [c000e5aefa90] [c00bac44] .filemap_fault+0x18c/0x3d4 [c000e5aefb70] [c00cd510] .__do_fault+0xb0/0x588 [c000e5aefc80] [c05653cc] .do_page_fault+0x440/0x620 [c000e5aefe30] [c0005408] handle_page_fault+0x20/0x58 Mem-info: Node 0 DMA per-cpu: CPU0: hi:6, btch: 1 usd: 4 CPU1: hi:6, btch: 1 usd: 0 CPU2: hi:6, btch: 1 usd: 3 CPU3: hi:6, btch: 1 usd: 4 Active_anon:9099 active_file:1523 inactive_anon0 inactive_file:2869 noreclaim:0 dirty:20 writeback :0 unstable:0 free:44210 slab:639 mapped:1724 pagetables:475 bo unce:0 Node 0 DMA free:2829440kB min:7808kB low:9728kB hi gh:11712kB active_anon:582336kB inactive_anon:0kB active_file:97472kB inactive_f ile:183616kB noreclaim:0kB present:3813760kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 Node 0 DMA: 3*64kB 5*128kB 5*256kB 4*512kB 2*1024k B 4*2048kB 3*4096kB 2*8192kB 170*16384kB = 2828352kB Swap cache: add 0, delete 0, find 0/0 Free swap = 3148608kB Total swap = 3148608kB Free swap: 3148608kB 59648 pages of RAM 677 reserved pages 28165 pages shared 0 pages swap cached Memory cgroup out of memory: kill process 6593 (pagetest) score 1003 or a child Killed process 6593 (pagetest) I am using a powerpc box with 64K size pages. I'll try and investigate further, just a heads up on the failure I am seeing. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] per-task I/O throttling
On Jan 11, 2008 4:15 AM, Andrea Righi [EMAIL PROTECTED] wrote: Allow to limit the bandwidth of I/O-intensive processes, like backup tools running in background, large files copy, checksums on huge files, etc. This kind of processes can noticeably impact the system responsiveness for some time and playing with tasks' priority is not always an acceptable solution. This patch allows to specify a maximum I/O rate in sectors per second for each single process via /proc/PID/io_throttle (default is zero, that specify no limit). Signed-off-by: Andrea Righi [EMAIL PROTECTED] Hi, Andrea, We have been thinking of doing control group based I/O control. I have not reviewed your patch in detail. I can suggest looking at openvz's IO controller. I/O bandwidth control is definitely interesting. How did you test your solution? Balbir -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 18/19] account mlocked pages
* Rik van Riel [EMAIL PROTECTED] [2008-01-08 15:59:57]: The following patch is required to compile the code with CONFIG_NORECLAIM enabled and CONFIG_NORECLAIM_MLOCK disabled. Signed-off-by: Balbir Singh [EMAIL PROTECTED] diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index c8ccf8f..fb08ee8 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -88,6 +88,8 @@ enum zone_stat_item { NR_NORECLAIM, /* */ #ifdef CONFIG_NORECLAIM_MLOCK NR_MLOCK, /* mlock()ed pages found and moved off LRU */ +#else + NR_MLOCK=NR_ACTIVE_FILE,/* avoid compiler errors... */ #endif #else NR_NORECLAIM=NR_ACTIVE_FILE,/* avoid compiler errors in dead code */ -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] per-task I/O throttling
* Peter Zijlstra [EMAIL PROTECTED] [2008-01-12 10:46:37]: On Fri, 2008-01-11 at 23:57 -0500, [EMAIL PROTECTED] wrote: On Fri, 11 Jan 2008 17:32:49 +0100, Andrea Righi said: The interesting feature is that it allows to set a priority for each process container, but AFAIK it doesn't allow to partition the bandwidth between different containers (that would be a nice feature IMHO). For example it would be great to be able to define per-container limits, like assign 10MB/s for processes in container A, 30MB/s to container B, 20MB/s to container C, etc. Has anybody considered allocating based on *seeks* rather than bytes moved, or counting seeks as virtual bytes for the purposes of accounting (if the disk can do 50mbytes/sec, and a seek takes 5millisecs, then count it as 100K of data)? I was considering a time scheduler, you can fill your time slot with seeks or data, it might be what CFQ does, but I've never even read the code. So far the definition of I/O bandwidth has been w.r.t time. Not all IO devices have sectors; I'd prefer bytes over a period of time. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Assign IRQs to HPET Timers
* Balaji Rao [EMAIL PROTECTED] [2008-01-12 00:36:11]: Assign an IRQ to HPET Timer devices when interrupt enable is requested. This now makes the HPET userspace API work. A more detailed changelog will better help understand the nature and origin of the problem and how to reproduce it. drivers/char/hpet.c | 31 +-- include/linux/hpet.h |2 +- 2 files changed, 30 insertions(+), 3 deletions(-) Signed-off-by: Balaji Rao R [EMAIL PROTECTED] diff --git a/drivers/char/hpet.c b/drivers/char/hpet.c index 4c16778..92bd889 100644 --- a/drivers/char/hpet.c +++ b/drivers/char/hpet.c @@ -390,7 +390,8 @@ static int hpet_ioctl_ieon(struct hpet_dev *devp) struct hpets *hpetp; int irq; unsigned long g, v, t, m; - unsigned long flags, isr; + unsigned long flags, isr, irq_bitmap; + u64 hpet_config; timer = devp-hd_timer; hpet = devp-hd_hpet; @@ -412,7 +413,29 @@ static int hpet_ioctl_ieon(struct hpet_dev *devp) devp-hd_flags |= HPET_SHARED_IRQ; spin_unlock_irq(hpet_lock); - irq = devp-hd_hdwirq; + /* Assign an IRQ to the timer */ + hpet_config = readq(timer-hpet_config); + irq_bitmap = + (hpet_config Tn_INT_ROUTE_CAP_MASK) Tn_INT_ROUTE_CAP_SHIFT; Should we check if the interrupts are being delivered via FSB, prior to doing this? + if (!irq_bitmap) + irq = 0;/* No IRQ Assignable */ + else { + irq = find_first_bit(irq_bitmap, 32); + do { + hpet_config |= irq Tn_INT_ROUTE_CNF_SHIFT; + writeq(hpet_config, timer-hpet_config); + + /* Check whether we wrote a valid IRQ + * number by reading back the field + */ + hpet_config = readq(timer-hpet_config); + if (irq == (hpet_config Tn_INT_ROUTE_CNF_MASK) + Tn_INT_ROUTE_CNF_SHIFT) { + devp-hd_hdwirq = irq; + break; /* Success */ + } + } while ((irq = (find_next_bit(irq_bitmap, 32, irq; + } Shouldn't we do this at hpet_alloc() time? if (irq) { unsigned long irq_flags; @@ -509,6 +532,10 @@ hpet_ioctl_common(struct hpet_dev *devp, int cmd, unsigned long arg, int kernel) break; v = readq(timer-hpet_config); v = ~Tn_INT_ENB_CNF_MASK; + + /* Zero out the IRQ field*/ + v = ~Tn_INT_ROUTE_CNF_MASK; + writeq(v, timer-hpet_config); if (devp-hd_irq) { free_irq(devp-hd_irq, devp); diff --git a/include/linux/hpet.h b/include/linux/hpet.h index 707f7cb..e3c0b2a 100644 --- a/include/linux/hpet.h +++ b/include/linux/hpet.h @@ -64,7 +64,7 @@ struct hpet { */ #define Tn_INT_ROUTE_CAP_MASK (0xULL) -#define Tn_INI_ROUTE_CAP_SHIFT (32UL) +#define Tn_INT_ROUTE_CAP_SHIFT (32UL) #define Tn_FSB_INT_DELCAP_MASK (0x8000UL) #define Tn_FSB_INT_DELCAP_SHIFT (15) #define Tn_FSB_EN_CNF_MASK (0x4000UL) The patch looks good overall! -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] per-task I/O throttling
* Andrea Righi [EMAIL PROTECTED] [2008-01-12 19:01:14]: Peter Zijlstra wrote: On Sat, 2008-01-12 at 16:27 +0530, Balbir Singh wrote: * Peter Zijlstra [EMAIL PROTECTED] [2008-01-12 10:46:37]: On Fri, 2008-01-11 at 23:57 -0500, [EMAIL PROTECTED] wrote: On Fri, 11 Jan 2008 17:32:49 +0100, Andrea Righi said: The interesting feature is that it allows to set a priority for each process container, but AFAIK it doesn't allow to partition the bandwidth between different containers (that would be a nice feature IMHO). For example it would be great to be able to define per-container limits, like assign 10MB/s for processes in container A, 30MB/s to container B, 20MB/s to container C, etc. Has anybody considered allocating based on *seeks* rather than bytes moved, or counting seeks as virtual bytes for the purposes of accounting (if the disk can do 50mbytes/sec, and a seek takes 5millisecs, then count it as 100K of data)? I was considering a time scheduler, you can fill your time slot with seeks or data, it might be what CFQ does, but I've never even read the code. So far the definition of I/O bandwidth has been w.r.t time. Not all IO devices have sectors; I'd prefer bytes over a period of time. Doing a time based one would only require knowing the (avg) delay of seeks, whereas doing a bytes based one would also require knowing the (avg) speed of the device. That is, if you're also interested in providing a latency guarantee. Because that'd force you to convert bytes to time again. So, what about considering both bytes/sec and io-operations/sec? In this way we should be able to limit huge streams of data and seek storms (or any mix of them). Regarding CFQ, AFAIK it's only possible to configure an I/O priorty for a process, but there's no way for example to limit the bandwidth (or I/O operations/sec) for a particular user or group. Limiting usage is also a very useful feature. Andrea could you please port your patches over to control groups. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Re: Kernel Panic - 2.6.23-rc4-mm1 ia64 - was Re: Update: [Automatic] NUMA replicated pagecache ...
Lee Schermerhorn wrote: On Wed, 2007-09-12 at 16:41 +0100, Andy Whitcroft wrote: On Wed, Sep 12, 2007 at 11:09:47AM -0400, Lee Schermerhorn wrote: Interesting, I don't see a memory controller function in the stack trace, but I'll double check to see if I can find some silly race condition in there. right. I noticed that after I sent the mail. Also, config available at: http://free.linux.hp.com/~lts/Temp/config-2.6.23-rc4-mm1-gwydyr-nomemcont Be interested to know the outcome of any bisect you do. Given its tripping in reclaim. Problem isolated to memory controller patches. This patch seems to fix this particular problem. I've only run the test for a few minutes with and without memory controller configured, but I did observe reclaim kicking in several times. W/o this patch, system would panic as soon as I entered direct/zone reclaim--less than a minute. Thanks, excellent catch! The patch looks sane. Thanks for your help in sorting this issue out. Hmm.. that means I never hit direct/zone reclaim in my tests (I'll make a mental note to enhance my test cases to cover this scenario). Lee PATCH 2.6.23-rc4-mm1 Memory Controller: initialize all scan_controls' isolate_pages member. We need to initialize all scan_controls' isolate_pages member. Otherwise, shrink_active_list() attempts to execute at undefined location. Signed-off-by: Lee Schermerhorn [EMAIL PROTECTED] mm/vmscan.c |2 ++ 1 file changed, 2 insertions(+) Index: Linux/mm/vmscan.c === --- Linux.orig/mm/vmscan.c2007-09-10 13:22:21.0 -0400 +++ Linux/mm/vmscan.c 2007-09-12 15:30:27.0 -0400 @@ -1758,6 +1758,7 @@ unsigned long shrink_all_memory(unsigned .swap_cluster_max = nr_pages, .may_writepage = 1, .swappiness = vm_swappiness, + .isolate_pages = isolate_pages_global, }; current-reclaim_state = reclaim_state; @@ -1941,6 +1942,7 @@ static int __zone_reclaim(struct zone *z SWAP_CLUSTER_MAX), .gfp_mask = gfp_mask, .swappiness = vm_swappiness, + .isolate_pages = isolate_pages_global, }; unsigned long slab_reclaimable; -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Memory shortage can result in inconsistent flocks state
On 9/13/07, Pavel Emelyanov [EMAIL PROTECTED] wrote: J. Bruce Fields wrote: On Tue, Sep 11, 2007 at 04:38:13PM +0400, Pavel Emelyanov wrote: This is a known feature that such re-locking is not atomic, but in the racy case the file should stay locked (although by some other process), but in this case the file will be unlocked. That's a little subtle (I assume you've never seen this actually happen?), but it makes sense to me. Well, this situation is hard to notice since usually programs try to finish up when some error is returned from the kernel, but I do believe that this could happen in one of the openvz kernels since we limit the kernel memory usage for containers and thus -ENOMEM is a common error. The fault injection framework should be able to introduce the same error. Of course hitting the error would require careful setup of the fault parameters. Balbir - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cpuset trouble after hibernate
Andrew Morton wrote: On Mon, 10 Sep 2007 11:45:10 +0200 (CEST) Simon Derr [EMAIL PROTECTED] wrote: On Sat, 8 Sep 2007, Nicolas Capit wrote: Hello, This is my situation: - I mounted the pseudo cpuset filesystem into /dev/cpuset - I created a cpuset named oar with my 2 cpus cat /dev/cpuset/oar/cpus 0-1 - Then I hibernate my computer with 'echo -n disk /sys/power/state' - After reboot: cat /dev/cpuset/oar/cpus 0 Why did I lost a cpu? Is this a normal behavior??? Hi Nicolas, I believe this is related to the fact that hibernation uses the hotplug subsystem to disable all CPUs except the boot CPU. Thus guarantee_online_cpus() is called on each cpuset and removes all CPUs, except CPU 0, from all cpusets. I'm not quite sure about if/how this should be fixed in the kernel, though. Looks like a very simple user-land workaround would be enough. Yeah. Bug, surely. But I guess it's always been there. What are the implications of this for cpusets-via-containers? I suspect the functionality of cpusets is not affected by containers. I wonder if containers should become suspend/resume aware and pass that option on to controllers. I think it's only the bus drivers and device drivers that do that now. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH mm] fix swapoff breakage; however...
Hugh Dickins wrote: rc4-mm1's memory-controller-memory-accounting-v7.patch broke swapoff: it extended unuse_pte_range's boolean found return code to allow an error return too; but ended up returning found (1) as an error. Replace that by success (0) before it gets to the upper level. Signed-off-by: Hugh Dickins [EMAIL PROTECTED] --- More fundamentally, it looks like any container brought over its limit in unuse_pte will abort swapoff: that doesn't doesn't seem contained to me. Maybe unuse_pte should just let containers go over their limits without error? Or swap should be counted along with RSS? Needs reconsideration. mm/swapfile.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- 2.6.23-rc4-mm1/mm/swapfile.c 2007-09-07 13:09:42.0 +0100 +++ linux/mm/swapfile.c 2007-09-17 15:14:47.0 +0100 @@ -642,7 +642,7 @@ static int unuse_mm(struct mm_struct *mm break; } up_read(mm-mmap_sem); - return ret; + return (ret 0)? ret: 0; Thanks, for the catching this. There are three possible solutions 1. Account each RSS page with a probable swap cache page, double the RSS accounting to ensure that swapoff will not fail. 2. Account for the RSS page just once, do not account swap cache pages 3. Follow your suggestion and let containers go over their limits without error With the current approach, a container over it's limit will not be able to call swapoff successfully, is that bad? We plan to implement per container/per cpuset swap in the future. Given that, isn't this expected functionality. You are over it's limit cannot really swapoff a swap device. If we allow pages to be unused, we could end up with a container that could exceed it's limit by a significant amount by calling swapoff. } /* -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23-rc4-mm1 compile error for ppc 32
Benjamin Herrenschmidt wrote: On Sat, 2007-09-15 at 11:00 -0400, Mathieu Desnoyers wrote: * Benjamin Herrenschmidt ([EMAIL PROTECTED]) wrote: On Thu, 2007-09-13 at 15:17 -0700, Andrew Morton wrote: Like this? --- a/include/asm-powerpc/bitops.h~powerpc-lock-bitops-fix +++ a/include/asm-powerpc/bitops.h @@ -226,7 +226,7 @@ static __inline__ void set_bits(unsigned static __inline__ void __clear_bit_unlock(int nr, volatile unsigned long *addr) { - __asm__ __volatile__(LWSYNC_ON_SMP ::: memory); + __asm__ __volatile__(LWSYNC_ON_SMP ::: memory); __clear_bit(nr, addr); } Looks ok. Can somebody test ? I'm still travelling... Hi Benjamin, With this patch and hrtimer.c fixes, 2.6.23-rc4-mm1 PPC arch (for powerpc 405) compiles fine. I still see errors/warnings from modpost though: Looks like the legacy ISA DMA crap no ? I don't know much about it. Ben. Kamalesh has reported a similar bug and is looking to fix this problem. He's been looking at Kconfig's to see that all of them either depend on GENERIC_ISA_DMA or select it. make -f /home/compudj/git/linux-2.6-lttng/scripts/Makefile.modpost scripts/mod/modpost -m -o /home/compudj/obj/powerpc-405/Module.symvers -s ERROR: request_dma [sound/oss/sscape.ko] undefined! ERROR: free_dma [sound/oss/sscape.ko] undefined! ERROR: dma_spin_lock [sound/oss/sscape.ko] undefined! ERROR: free_dma [sound/oss/sound.ko] undefined! ERROR: request_dma [sound/oss/sound.ko] undefined! ERROR: dma_spin_lock [sound/oss/sound.ko] undefined! ERROR: dma_spin_lock [sound/core/snd.ko] undefined! ERROR: dma_spin_lock [net/irda/irda.ko] undefined! WARNING: div64_64 [net/ipv4/tcp_cubic.ko] has no CRC! ERROR: free_dma [drivers/parport/parport_pc.ko] undefined! ERROR: request_dma [drivers/parport/parport_pc.ko] undefined! ERROR: dma_spin_lock [drivers/parport/parport_pc.ko] undefined! ERROR: request_dma [drivers/net/irda/w83977af_ir.ko] undefined! ERROR: free_dma [drivers/net/irda/w83977af_ir.ko] undefined! ERROR: request_dma [drivers/net/irda/via-ircc.ko] undefined! ERROR: free_dma [drivers/net/irda/via-ircc.ko] undefined! ERROR: request_dma [drivers/net/irda/smsc-ircc2.ko] undefined! ERROR: free_dma [drivers/net/irda/smsc-ircc2.ko] undefined! ERROR: free_dma [drivers/net/irda/nsc-ircc.ko] undefined! ERROR: request_dma [drivers/net/irda/nsc-ircc.ko] undefined! ERROR: free_dma [drivers/net/irda/ali-ircc.ko] undefined! ERROR: request_dma [drivers/net/irda/ali-ircc.ko] undefined! ERROR: request_dma [drivers/mmc/host/wbsd.ko] undefined! ERROR: dma_spin_lock [drivers/mmc/host/wbsd.ko] undefined! ERROR: free_dma [drivers/mmc/host/wbsd.ko] undefined! Mathieu -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Configurable reclaim batch size
Peter Zijlstra wrote: On Mon, 17 Sep 2007 10:54:59 -0700 (PDT) Christoph Lameter [EMAIL PROTECTED] wrote: On Sat, 15 Sep 2007, Peter Zijlstra wrote: It increases the lock hold times though. Otoh it might work out with the lock placement. Yeah may be good for NUMA. Might, I'd just like a _little_ justification for an extra tunable. Do you have any numbers that show this is worthwhile? Tried to run AIM7 but the improvements are in the noise. I need a tests that really does large memory allocation and stresses the LRU. I could code something up but then Lee's patch addresses some of the same issues. Is there any standard test that shows LRU handling regressions? hehe, I wish. I was just hoping you'd done this patch as a result of an actual problem and not a hunch. Please do let me know if someone finds a good standard test for it or a way to stress reclaim. I've heard AIM7 come up often, but never been able to push it much. I should retry. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH mm] fix swapoff breakage; however...
Hugh Dickins wrote: On Tue, 18 Sep 2007, Balbir Singh wrote: Hugh Dickins wrote: More fundamentally, it looks like any container brought over its limit in unuse_pte will abort swapoff: that doesn't doesn't seem contained to me. Maybe unuse_pte should just let containers go over their limits without error? Or swap should be counted along with RSS? Needs reconsideration. Thanks, for the catching this. There are three possible solutions 1. Account each RSS page with a probable swap cache page, double the RSS accounting to ensure that swapoff will not fail. 2. Account for the RSS page just once, do not account swap cache pages Neither of those makes sense to me, but I may be misunderstanding. What would make sense is (what I meant when I said swap counted along with RSS) not to count pages out and back in as they are go out to swap and back in, just keep count of instantiated pages I am not sure how you define instantiated pages. I suspect that you mean RSS + pages swapped out (swap_pte)? I say make sense meaning that the numbers could be properly accounted; but it may well be unpalatable to treat fast RAM as equal to slow swap. 3. Follow your suggestion and let containers go over their limits without error With the current approach, a container over it's limit will not be able to call swapoff successfully, is that bad? That's not so bad. What's bad is that anyone else with the CAP_SYS_ADMIN to swapoff is liable to be prevented by containers going over their limits. If a swapoff is going to push a container over it's limit, then we break the container and the isolation it provides. Upon swapoff failure, may be we could get the container to print a nice little warning so that anyone else with CAP_SYS_ADMIN can fix the container limit and retry swapoff. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH mm] fix swapoff breakage; however...
Hugh Dickins wrote: On Tue, 18 Sep 2007, Balbir Singh wrote: Hugh Dickins wrote: What would make sense is (what I meant when I said swap counted along with RSS) not to count pages out and back in as they are go out to swap and back in, just keep count of instantiated pages I am not sure how you define instantiated pages. I suspect that you mean RSS + pages swapped out (swap_pte)? That's it. (Whereas file pages counted out when paged out, then counted back in when paged back in.) If a swapoff is going to push a container over it's limit, then we break the container and the isolation it provides. Is it just my traditional bias, that makes me prefer you break your container than my swapoff? I'm not sure. :-) Please see my response below Upon swapoff failure, may be we could get the container to print a nice little warning so that anyone else with CAP_SYS_ADMIN can fix the container limit and retry swapoff. And then they hit the next one... rather like trying to work out the dependencies of packages for oneself: a very tedious process. Yes, but here's the overall picture of what is happening 1. The system administrator setup a memory container to contain a group of applications. 2. The administrator tried to swapoff one/a group of swap files/ devices 3. Operation 2, failed due to a container being above it's limit. Which implies that at some point a container went over it's limit and some of it's pages were swapped out During swapoff, we try to account for pages coming back into the container, our charging routine does try to reclaim pages, which in turn implies -- it will use another swap device or reclaim page cache, if both fails, we return -ENOMEM. Given that the system administrator has setup the container and the swap devices, I feel that he is in better control of what to do with the system when swapoff fails. In the future we plan to implement per container swap (a feature desired by several people), assuming that administrators use per container swap in the future, failing on limit sounds like the right way to go forward. If the swapoff succeeds, that does mean there was actually room in memory (+ other swap) for everyone, even if some have gone over their nominal limits. (But if the swapoff runs out of memory in the middle, yes, it might well have assigned the memory unfairly.) Yes, precisely my point, the administrator is the best person to decide how to assign memory to containers. Would it help to add a container tunable that says, it's ok to go overlimit with this container during a swapoff. The appropriate answer may depend on what you do when a container tries to fault in one more page than its limit. Apparently just fail it (no attempt to page out another page from that container). The problem with that approach is that applications will fail in the middle of their task. They will never get a chance to run at all, they will always get killed in the middle. We want to be able to reclaim pages from the container and let the application continue. So, if the whole system is under memory pressure, kswapd will be keeping the RSS of all tasks low, and they won't reach their limits; whereas if the system is not under memory pressure, tasks will easily approach their limits and so fail. Tasks failing on limit does not sound good unless we are out of all backup memory (slow storage). We still let the application run, although slowly. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23-rc6-mm1 panic (memory controller issue ?)
Badari Pulavarty wrote: On Tue, 2007-09-18 at 15:21 -0700, Badari Pulavarty wrote: Hi Balbir, I get following panic from SLUB, while doing simple fsx tests. I haven't used any container/memory controller stuff except that I configured them in :( Looks like slub doesn't like one of the flags passed in ? Known issue ? Ideas ? I think, I found the issue. I am still running tests to verify. Does this sound correct ? Thanks, Badari Need to strip __GFP_HIGHMEM flag while passing to mem_container_cache_charge(). Signed-off-by: Badari Pulavarty [EMAIL PROTECTED] mm/filemap.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Index: linux-2.6.23-rc6/mm/filemap.c === --- linux-2.6.23-rc6.orig/mm/filemap.c2007-09-18 12:43:54.0 -0700 +++ linux-2.6.23-rc6/mm/filemap.c 2007-09-18 19:14:44.0 -0700 @@ -441,7 +441,8 @@ int filemap_write_and_wait_range(struct int add_to_page_cache(struct page *page, struct address_space *mapping, pgoff_t offset, gfp_t gfp_mask) { - int error = mem_container_cache_charge(page, current-mm, gfp_mask); + int error = mem_container_cache_charge(page, current-mm, + gfp_mask ~__GFP_HIGHMEM); if (error) goto out; Hi, Badari, The fix looks correct, radix_tree_preload() does the same thing in add_to_page_cache(). Thanks for identifying the fix -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23-rc6-mm1 panic (memory controller issue ?)
Christoph Lameter wrote: On Wed, 19 Sep 2007, Balbir Singh wrote: The fix looks correct, radix_tree_preload() does the same thing in add_to_page_cache(). Thanks for identifying the fix Hmmm Radix tree preload can only take a limited set of flags? Yes, the whole code is very interesting. From add_to_page_cache() we call radix_tree_preload with __GFP_HIGHMEM cleared, but from __add_to_swap_cache(), we don't make any changes to the gfp_mask. radix_tree_preload() calls kmem_cache_alloc() and in slub there is a check BUG_ON(flags GFP_SLAB_BUG_MASK); So, I guess all our allocations should check against __GFP_DMA and __GFP_HIGHMEM. I'll review the code, test it and send a fix. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23-rc6-mm1 panic (memory controller issue ?)
Christoph Lameter wrote: On Wed, 19 Sep 2007, Balbir Singh wrote: Yes, the whole code is very interesting. From add_to_page_cache() we call radix_tree_preload with __GFP_HIGHMEM cleared, but from __add_to_swap_cache(), we don't make any changes to the gfp_mask. radix_tree_preload() calls kmem_cache_alloc() and in slub there is a check BUG_ON(flags GFP_SLAB_BUG_MASK); So, I guess all our allocations should check against __GFP_DMA and __GFP_HIGHMEM. I'll review the code, test it and send a fix. You need to use the proper mask from include/linux/gfp.h. Masking individual bits will create problems when we create new bits. I agree 100%, that's why I want to review the code. I want to use a mask that clears the GFP_SLAB_BUG_MASK bits and review it. I want to check against other call sites that use gfp_mask as well. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Add all thread stats for TASKSTATS_CMD_ATTR_TGID (v5)
Andrew Morton wrote: On Tue, 18 Sep 2007 00:23:39 +0200 Guillaume Chazarain [EMAIL PROTECTED] wrote: TASKSTATS_CMD_ATTR_TGID used to return only the delay accounting stats, not the basic and extended accounting. With this patch, TASKSTATS_CMD_ATTR_TGID also aggregates the accounting info for all threads of a thread group. This makes TASKSTATS_CMD_ATTR_TGID usable in a similar fashion to TASKSTATS_CMD_ATTR_PID, for commands like iotop -P (http://guichaz.free.fr/misc/iotop.py). This patch conflicts somewhat with add-scaled-time-to-taskstats-based-process-accounting.patch I fixed it up like this: void bacct_add_tsk(struct taskstats *stats, struct task_struct *task) { if (task-flags PF_SUPERPRIV) stats-ac_flag |= ASU; if (task-flags PF_DUMPCORE) stats-ac_flag |= ACORE; if (task-flags PF_SIGNALED) stats-ac_flag |= AXSIG; if (thread_group_leader(task) (task-flags PF_FORKNOEXEC)) /* * Threads are created by do_fork() and don't exec but not in * the AFORK sense, as the latter involves fork(2). */ stats-ac_flag |= AFORK; stats-ac_utimescaled += cputime_to_msecs(task-utimescaled) * USEC_PER_MSEC; stats-ac_stimescaled += cputime_to_msecs(task-stimescaled) * USEC_PER_MSEC; stats-ac_utime += cputime_to_msecs(task-utime) * USEC_PER_MSEC; stats-ac_stime += cputime_to_msecs(task-stime) * USEC_PER_MSEC; stats-ac_minflt += task-min_flt; stats-ac_majflt += task-maj_flt; } (note the s/=/+=/ in there) but it all needs reviewing and checking and testing please. Andrew, Thanks for reviewing the patchset, this patch is on my review and test queue (which has gotten rather long of late). I'll test it further and get back. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Add all thread stats for TASKSTATS_CMD_ATTR_TGID (v5)
Andrew, Thanks for reviewing the patchset, this patch is on my review and test queue (which has gotten rather long of late). I'll test it further and get back. I still think this version is very wrong. It makes the -signal-stats absolutely meaningless. Quoting myself: Hi, Oleg, Yes, I see, removing the memcpy is definitely wrong. Thanks for catching it. I did not get a chance to review the patch, it's on my review queue, but since you've reviewed it, I am very glad that you have and identified potential issues. Big Thanks! -- Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Revert for cgroups CPU accounting subsystem patch
Paul Menage wrote: On Nov 12, 2007 10:00 PM, Srivatsa Vaddagiri [EMAIL PROTECTED] wrote: On second thoughts, this may be a usefull controller of its own. Say I just want to monitor usage (for accounting purpose) of a group of tasks, but don't want to control their cpu consumption, then cpuacct controller would come in handy. That's plausible, but having two separate ways of tracking and reporting the CPU usage of a cgroup seems wrong. How bad would it be in your suggested case if you just give each cgroup the same weight? So there would be fair scheduling between cgroups, which seems as reasonable as any other choice in the event that the CPU is contended. Right now, one of the limitations of the CPU controller is that the moment you create another control group, the bandwidth gets divided by the default number of shares. We can't create groups just for monitoring. cpu_acct fills this gap. I think in the long run, we should move the helper functions into cpu_acct.c and the interface logic into kernel/sched.c (cpu controller). -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: top lies ?
On Nov 13, 2007 12:38 PM, Al Boldi [EMAIL PROTECTED] wrote: kloczek wrote: Some data showed by top command looks like completly trashed. Fragment from top output: Mem: 2075784k total, 2053352k used,22432k free,19260k buffers Swap: 2096472k total, 136k used, 2096336k free, 1335080k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ SWAP nFLT WCHAN COMMAND 14515 mysql 20 0 1837m 563m 4132 S 39 27.8 27:14.20 1.2g 18 - mysqld How it is possible that swap ussage is 136k and swapped out portion of (in this case) mysqld process is 1.2g ? Welcome to OverCommit, aka OOM-nirvana. Try this: # echo 2 /proc/sys/vm/overcommit_memory # echo 0 /proc/sys/vm/overcommit_ratio But make sure you have enough swap. Thanks! The swap cache looks pretty big, may be top is including that data while reporting swap usage. Balbir - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Revert for cgroups CPU accounting subsystem patch
Paul Menage wrote: On Nov 12, 2007 11:00 PM, Balbir Singh [EMAIL PROTECTED] wrote: Right now, one of the limitations of the CPU controller is that the moment you create another control group, the bandwidth gets divided by the default number of shares. We can't create groups just for monitoring. Could we get around this with, say, a flag that always treats a CFS schedulable entity as having a weight equal to the number of runnable tasks in it? So CPU bandwidth would be shared between groups in proportion to the number of runnable tasks, which would distribute the cycles approximately equivalently to them all being separate schedulable entities. I think it's a good hack, but not sure about the complexity to implement the code. I worry that if the number of tasks increase (say run into thousands for one or more groups and a few groups have just a few tasks), we'll lose out on accuracy. cpu_acct fills this gap. Agreed, but not in the right way IMO. I think we already have the code, we need to make it more useful and reusable. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] memcgroup: work better with tmpfs
Hugh Dickins wrote: Here's a couple of patches to get memcgroups working better with tmpfs and shmem, in conjunction with the tmpfs patches I just posted. There will be another to come later on, but I shouldn't wait any longer to get these out to you. Hi, Hugh, Thank you so much for the review, some comments below (The missing patch will want to leave a mem_cgroup associated with a tmpfs file or shm object, so that if its pages get brought back from swap by swapoff, they can be associated with that mem_cgroup rather than the one which happens to be running swapoff.) mm/memcontrol.c | 81 -- mm/shmem.c | 28 +++ 2 files changed, 63 insertions(+), 46 deletions(-) But on the way I've noticed a number of issues with memcgroups not dealt with in these patches. 1. Why is spin_lock_irqsave rather than spin_lock needed on mz-lru_lock? If it is needed, doesn't mem_cgroup_isolate_pages need to use it too? We always call mem_cgroup_isolate_pages() from shrink_(in)active_pages under spin_lock_irq of the zone's lru lock. That's the reason that we don't explicitly use it in the routine. 2. There's mem_cgroup_charge and mem_cgroup_cache_charge (wouldn't the former be better called mem_cgroup_charge_mapped? why does the latter Yes, it would be. After we've refactored the code, the new name makes sense. test MEM_CGROUP_TYPE_ALL instead of MEM_CGROUP_TYPE_CACHED? I still don't understand your enums there). We do that to ensure that we charge page cache pages only when the accounting type is set to MEM_CGROUP_TYPE_ALL. If the type is anything else, we ignore cached pages, we did not have MEM_CGROUP_TYPE_CACHED initially when the patches went in. But there's only mem_cgroup_uncharge. So when, for example, an add_to_page_cache fails, the uncharge may not balance the charge? We use mem_cgroup_uncharge() everywhere. The reason being, we might switch control type, we uncharge pages that have a page_cgroup associated with them, hence once we;ve charged, uncharge does not distinguish between charge types. 3. mem_cgroup_charge_common has rcu_read_lock/unlock around its rcu_dereference; mem_cgroup_cache_charge does not: is that right? Very good catch! Will fix it. 4. That page_assign_page_cgroup in free_hot_cold_page, what case is that handling? Wouldn't there be a leak if it ever happens? I've been running with a BUG_ON(page-page_cgroup) there and not hit it - should it perhaps be a Bad page state case? Our cleanup in page_cache_uncharge() does take care of cleaning up the page_cgroup. I think you've got it right, it should be a BUG_ON in free_hot_cold_page() Hugh Thanks for the detailed review and fixes. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Move page_assign_page_cgroup to VM_BUG_ON in free_hot_cold_page
Based on the recommendation and observations of Hugh Dickins, page_cgroup_assign_cgroup() is not required. This patch replaces it with a VM_BUG_ON, so that we can catch them in free_hot_cold_page() Signed-off-by: Balbir Singh [EMAIL PROTECTED] --- mm/page_alloc.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff -puN mm/page_alloc.c~memory-controller-move-to-bug-on-in-free_hot_cold_page mm/page_alloc.c --- linux-2.6.24-rc5/mm/page_alloc.c~memory-controller-move-to-bug-on-in-free_hot_cold_page 2007-12-19 11:31:46.0 +0530 +++ linux-2.6.24-rc5-balbir/mm/page_alloc.c 2007-12-19 11:33:45.0 +0530 @@ -995,7 +995,7 @@ static void fastcall free_hot_cold_page( if (!PageHighMem(page)) debug_check_no_locks_freed(page_address(page), PAGE_SIZE); - page_assign_page_cgroup(page, NULL); + VM_BUG_ON(page_get_page_cgroup(page)); arch_free_page(page, 0); kernel_map_pages(page, 1, 0); _ -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Memory controller use rcu_read_lock() in mem_cgroup_cache_charge()
Hugh Dickins noticed that we were using rcu_dereference() without rcu_read_lock() in the cache charging routine. The patch below fixes this problem Signed-off-by: Balbir Singh [EMAIL PROTECTED] --- mm/memcontrol.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff -puN mm/memcontrol.c~memory-controller-use-rcu-lead-lock mm/memcontrol.c --- linux-2.6.24-rc5/mm/memcontrol.c~memory-controller-use-rcu-lead-lock 2007-12-19 11:52:44.0 +0530 +++ linux-2.6.24-rc5-balbir/mm/memcontrol.c 2007-12-20 14:01:45.0 +0530 @@ -717,16 +717,20 @@ int mem_cgroup_charge(struct page *page, int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask) { + int ret = 0; struct mem_cgroup *mem; if (!mm) mm = init_mm; + rcu_read_lock(); mem = rcu_dereference(mm-mem_cgroup); + css_get(mem-css); + rcu_read_unlock(); if (mem-control_type == MEM_CGROUP_TYPE_ALL) - return mem_cgroup_charge_common(page, mm, gfp_mask, + ret = mem_cgroup_charge_common(page, mm, gfp_mask, MEM_CGROUP_CHARGE_TYPE_CACHE); - else - return 0; + css_put(mem-css); + return ret; } /* _ -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Move page_assign_page_cgroup to VM_BUG_ON in free_hot_cold_page
Peter Zijlstra wrote: On Thu, 2007-12-20 at 14:16 +, Hugh Dickins wrote: On Thu, 20 Dec 2007, Peter Zijlstra wrote: On Thu, 2007-12-20 at 13:14 +, Hugh Dickins wrote: On Wed, 19 Dec 2007, Dave Hansen wrote: - page_assign_page_cgroup(page, NULL); + VM_BUG_ON(page_get_page_cgroup(page)); Hi Balbir, You generally want to do these like: foo = page_assign_page_cgroup(page, NULL); VM_BUG_ON(foo); Some embedded people have been known to optimize kernel size like this: #define VM_BUG_ON(x) do{}while(0) Balbir's patch looks fine to me: I don't get your point there, Dave. There was a lengthy discussion here: http://lkml.org/lkml/2007/12/14/131 on the merit of debug statements with side effects. Of course, but what's the relevance? But looking at our definition: #ifdef CONFIG_DEBUG_VM #define VM_BUG_ON(cond) BUG_ON(cond) #else #define VM_BUG_ON(condition) do { } while(0) #endif disabling CONFIG_DEBUG_VM breaks the code as proposed by Balbir in that it will no longer acquire the reference. But what reference? struct page_cgroup *page_get_page_cgroup(struct page *page) { return (struct page_cgroup *) (page-page_cgroup ~PAGE_CGROUP_LOCK); } I guess the issue is that often a get function has a complementary put function, but this isn't one of them. Would page_page_cgroup be a better name, perhaps? I don't know. Ah, yes, I mistakenly assumed it was a reference get. In that case I stand corrected and do not have any objections. I was going to say the same thing, page_get_page_cgroup() does not hold any references. May be _get_ in the name is confusing. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/