Re: VM: dynamic swap remapping (patch)
ehlo. Well Joe seems to have provided a pretty interesting document on how it works in AIX, but I was wondering if they do anything wrt low/high watermarks like my idea. Basically you'd like to inform processes that the danger has been alliviated so that they can cautiously start accepting more work rather than freaking out and shutting out clients forever... Actually, most of applications believe that everything OK except something tells them it's not. Regular OOM protection may be build as: int on_sigdanger(int) { throw std::runtime_error(out of memory); } ... while( there_are_more_requests ) { try { do_some_work_eating_lot_of_memory(); } catch(const std::exception ex) { cerr ex.what() endl; } } I.e, we will attempt to execute user requests while we have them in our queue, but we will get exceptions and stop processing if system is out of memory. As soon as system will get enough free space we will continue normal processing without any special handling from our side. It means that signal that opposite SIGDANGER is rarely required, if required at all. You should be glad, it reduces work to do. ;) P.S. I know that throwing inside signal handler is bad techique, but it works (and works better than setting flag and testing it everywhere). dozen To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: VM: dynamic swap remapping (patch)
i got a (way old) ppc 604e, in the corner of my office. it's a 74p, latest 4.3.3 patchlevel from one month ago or so installed. i could arrange ssh access to the box if somebody cares, although i am not available 24x7 for remote hands ;-) there's nothing critical on it, the box got 128mb ram, so contact me off-list if you want to play around with it. /k Greg Lehey([EMAIL PROTECTED])@2001.10.01 13:19:51 +: On Sunday, 30 September 2001 at 14:55:58 -0500, Alfred Perlstein wrote: * Jos Backus [EMAIL PROTECTED] [010930 14:35] wrote: On Sun, Sep 30, 2001 at 02:23:26PM -0500, Alfred Perlstein wrote: * Jos Backus [EMAIL PROTECTED] [010930 12:55] wrote: AIX has SIGDANGER. Anyone care to tell me how it works in AIX? If the interface is nice, cloning it would be kind of cool. I don't currently have access to an AIX system, but http://as400bks.rochester.ibm.com/doc_link/en_US/a_doc_lib/aixbman/admnconc/pag_space_under.htm has some (useful) info. It sure does! I think I'm going to make a proposal on -arch about this, to be perfectly honest, AIX has a good implementation, I haven't read it all yet, but it doesn't look like it gives the applications a notification when the danger is gone, we'll have to figure that out, or I'll have to read more into this. If it's any help, I have an AIX box here. It belongs to IBM, so I have to respect security issues, but I'll do what I can. Greg -- See complete headers for address and phone numbers To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message -- Gravity is an unforgiving motherfucker. KR433/KR11-RIPE -- WebMonster Community Founder -- nGENn GmbH Senior Techie http://www.webmonster.de/ -- ftp://ftp.webmonster.de/ -- http://www.ngenn.net/ karstenrohrbach.de -- alphangenn.net -- alphascene.org -- [EMAIL PROTECTED] GnuPG 0x2964BF46 2001-03-15 42F9 9FFF 50D4 2F38 DBEE DF22 3340 4F4E 2964 BF46 Please do not remove my address from To: and Cc: fields in mailing lists. 10x PGP signature
Re: VM: dynamic swap remapping (patch)
: Second, application not always grows to 1G, most of the time it keeps : as small as 500M ;). Why should we precommit 1G for 500M data? Doing : multi-mmap memory management is additional pain. Why not? Disk space is cheap. For a problem like this I would simply throw in two 30G+ hard drives and partition them with 16G of swap each, giving me 32G of swap for the machine. If you needed to do it cheaply you could even use IDE, though personally I would use SCSI for reliability. Depending on the amount of real memory in the machine you might have to tweek a few kernel options (like matching NSWAP to the actual number of swap devices), but basically it should just work. Even using file-backed memory is fairly trivial. You don't need to do multi-mmap memory management or do any kernel tweaking. Just reserve 1G and use a single mmap() and file per process. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: VM: dynamic swap remapping (patch)
ehlo. My suggestion, (but not my final say, i'm still open to ideas): Implement a memory status signal to notify processes of changes in the relative amount of system memory. When memory reaches a low or high watermark, the signal is broadcast to all running processes. The default disposition will be to ignore the signal. The signal will be named SIGMEMINFO. (SIGXfoo means 'process has exceeded resource foo') Agreed. As for SIG_IGN, can anyone tell me -- can I force existing application to use my signal handler? For example, by preallocating some shared library? If so, there are no contras for ignoring signal by default. The signal will pass via the siginfo struct information such that the process can determine if the system has just exceeded the low watermark (danger) or has reclaimed down to the high watermark (enough free memory). Passing more info is always better. Agreed. a) over allocate swap a bit and set the low watermark carefully. b) do the following enhancement: Provide a system whereby you can swap to the filesystem without additional upcalls/syscalls from userspace, basically, provide some means of paging to the filesystem automatically. then, set your lowwater mark to the size of your swap partition, now your system will alert your processes and automatically swap _anyone_ to the filesystem. I really think that this would be more flexible and still allow you to achieve what you want... What do you think? I can't say anything until I'll got detail. Sorry, English is neither my native nor used often, so I may easely miss important details, but here is my random comments: Initally, I was trying the same (I think) approach, but there was some problems. Some kernel function refused to work with VM objects of processes differing from curproc. I.e., it could be hard to work with bigproc inside swap daemon; and swap daemon is the only place where we can detect OOM condition; that's why I used signal to transfer control to user space, and then back into kernel -- already in another process. Another reason to do it -- to make all limits and quota work automatically. Also, I did not wanted to make swap daemon busy too long. Also, what means over allocate swap a bit? How to compute the value of that bit? At what moment should we preallocate? Should we repeat preallocation after getting SIGMEMINFO (himark)? Also, you cannot set low mark to size of swap partition. To create file-based swap you need some memory (file operations requires it). So, low mark should be a bit lower (that's why I raised value of nswap_lowat). Finally, if you want to over allocate swap for every process in system, the whole swap can wind up consisting of only preallocations. Resource management is the role of kernel. Any hard reservation interfere with that. -- dozen @ home To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: VM: dynamic swap remapping (patch)
* Matt Dillon [EMAIL PROTECTED] [010930 02:53] wrote: : Second, application not always grows to 1G, most of the time it keeps : as small as 500M ;). Why should we precommit 1G for 500M data? Doing : multi-mmap memory management is additional pain. Why not? Disk space is cheap. For a problem like this I would simply throw in two 30G+ hard drives and partition them with 16G of swap each, giving me 32G of swap for the machine. If you needed to do it cheaply you could even use IDE, though personally I would use SCSI for reliability. Depending on the amount of real memory in the machine you might have to tweek a few kernel options (like matching NSWAP to the actual number of swap devices), but basically it should just work. Even using file-backed memory is fairly trivial. You don't need to do multi-mmap memory management or do any kernel tweaking. Just reserve 1G and use a single mmap() and file per process. What he needs is a system to inform him that things aren't looking so good, check my email for what I think is a pretty good solution. -- -Alfred Perlstein [[EMAIL PROTECTED]] 'Instead of asking why a piece of software is using 1970s technology, start asking why software is ignoring 30 years of accumulated wisdom.' To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: VM: dynamic swap remapping (patch)
In message [EMAIL PROTECTED], Matt Dillon writes: : Second, application not always grows to 1G, most of the time it keeps : as small as 500M ;). Why should we precommit 1G for 500M data? Doing : multi-mmap memory management is additional pain. Even using file-backed memory is fairly trivial. You don't need to do multi-mmap memory management or do any kernel tweaking. Just reserve 1G and use a single mmap() and file per process. I once had a patch to phkmalloc() which backed all malloc'ed VM with hidden files in the users homedir. It was written to put the VM usage under QUOTA control, but it had many useful side effects as well. I can't seem to find it right now, but it is trivial to do: just replace the sbrk(2) with mmap(). Only downside is the needed filedescriptor which some shells don't like. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 [EMAIL PROTECTED] | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: VM: dynamic swap remapping (patch)
On Sat, 29 Sep 2001, Alfred Perlstein wrote: * Vladimir Dozen [EMAIL PROTECTED] [010929 14:38] wrote: P.S. Anyway, I do NOT insist my solution is better, and even that it is good for anything at all. It was fun for me to hack in BSD kernel, and it was interesting challenge, and I feel need to share results with others. At worst, I will recommend our customer to setup processing farm under FreeBSD with applied patch. I'm really impressed with the work you put into this, but it seems that you've tried to tackle two problems at the same time, Indeed, the whole idea of swapping tasks to the filesystem in nice, but having the task do this all by itself isn't a good option for many people... My suggestion, (but not my final say, i'm still open to ideas): Implement a memory status signal to notify processes of changes in the relative amount of system memory. When memory reaches a low or high watermark, the signal is broadcast to all running processes. The default disposition will be to ignore the signal. The signal will be named SIGMEMINFO. (SIGXfoo means 'process has exceeded resource foo') That'd be SIGDANGER, right ? b) do the following enhancement: Provide a system whereby you can swap to the filesystem without additional upcalls/syscalls from userspace, basically, provide some means of paging to the filesystem automatically. Sounds like a winner, when swap runs out a process gets suspended onto the filesystem automatically and SIGDANGER is sent out to give others a chance to clean themselves up. If enough space is freed, the suspended process can get back into the system. This should also preserve leaky applications while at the same time leaving the system intact... regards, Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to [EMAIL PROTECTED] (spam digging piggy) To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: VM: dynamic swap remapping (patch)
ehlo. : Second, application not always grows to 1G, most of the time it keeps : as small as 500M ;). Why should we precommit 1G for 500M data? Doing : multi-mmap memory management is additional pain. Why not? Disk space is cheap. Developer time is expensive. Someone already wrote good allocation routines, and they are inside libc. Reinventing bycicle in every new large-scale application doesn't sounds good for me. For a problem like this I would simply throw in two 30G+ hard drives and partition them with 16G of swap each, giving me 32G of swap for the machine. As it was said here before, there are actually two problems: notification (avoiding silently kills) and getting more paging space. The second can be solved by adding swap space. The first -- cannot. As developer, I'm more interested in first. Current solution with killproc() is not acceptable. Just imagine any OS documentation which say: the OS may terminate process at any point with no warning or notification. Would you like to use it? But this is exactly what FreeBSD does at OOM. Even using file-backed memory is fairly trivial. You don't need to do multi-mmap memory management or do any kernel tweaking. Just reserve 1G and use a single mmap() and file per process. As I already said, it is not trivial. It involves writing/adopting some allocation stuff. It means time human resources - money. -- dozen @ home To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: VM: dynamic swap remapping (patch)
* Rik van Riel [EMAIL PROTECTED] [010930 04:12] wrote: On Sat, 29 Sep 2001, Alfred Perlstein wrote: * Vladimir Dozen [EMAIL PROTECTED] [010929 14:38] wrote: P.S. Anyway, I do NOT insist my solution is better, and even that it is good for anything at all. It was fun for me to hack in BSD kernel, and it was interesting challenge, and I feel need to share results with others. At worst, I will recommend our customer to setup processing farm under FreeBSD with applied patch. I'm really impressed with the work you put into this, but it seems that you've tried to tackle two problems at the same time, Indeed, the whole idea of swapping tasks to the filesystem in nice, but having the task do this all by itself isn't a good option for many people... My suggestion, (but not my final say, i'm still open to ideas): Implement a memory status signal to notify processes of changes in the relative amount of system memory. When memory reaches a low or high watermark, the signal is broadcast to all running processes. The default disposition will be to ignore the signal. The signal will be named SIGMEMINFO. (SIGXfoo means 'process has exceeded resource foo') That'd be SIGDANGER, right ? Sort of. b) do the following enhancement: Provide a system whereby you can swap to the filesystem without additional upcalls/syscalls from userspace, basically, provide some means of paging to the filesystem automatically. Sounds like a winner, when swap runs out a process gets suspended onto the filesystem automatically and SIGDANGER is sent out to give others a chance to clean themselves up. Well, no, the idea is to have a low and high watermark so that flip-flopping on the boundry doesn't generate a lot of signals. SIGDANGER is ok for a name, but slightly misleading because I wanted to piggyback some info in the siginfo to tell processes when the danger has passed. Well ok, the name is ok, but I do want an upcall when the situation is alleviated. Let me also state that it may be wise to add huristics to the system to not SIGDANGER anything that is completely swapped out or hasn't run in a long time, this would avoid a spike in thrashing at the time of the broadcast. If enough space is freed, the suspended process can get back into the system. This should also preserve leaky applications while at the same time leaving the system intact... Hopefully, also having a SIGDANGER handler may be an indication to the kernel to give you a second chance before shooting at you, I know it could be used to subvert behavior to have another niave program killed, however that could be a tunable to give those trying to do the right thing a second chance. -- -Alfred Perlstein [[EMAIL PROTECTED]] 'Instead of asking why a piece of software is using 1970s technology, start asking why software is ignoring 30 years of accumulated wisdom.' To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: VM: dynamic swap remapping (patch)
ehlo. You're still thinking of the combined solution, just think of a system where all you have right now is the signals I mentioned. Yah, now I think I got it. Well, actually, signal(s) is all I need. The remapping was just a bonus. To be more precise, I need the only signal -- at low mark passed. Some other application might be interested in second -- hi mark -- signal, but my doesn't. SIGDANGER is the signal from Irix, AFAIR? So, how about to accept this name (just to not increase entropy of the Universe) and send it to all processes when nswap_lowat reached? The only point -- I prefer to have ability to set nswap_lowat via sysctl since I cannot predict what amount of memory can be consumed while freeing memory ;) (e.g., throwing exception in C++ may eat memory due to creating exception object; logging may eat memory also). Just think what happens if your filesystems are full and you run out of swap... The same that happens today -- killproc() will kill me. The situation doesn't becomes worse with remapping, it just ... mmm... prolonges. -- dozen @ home To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: VM: dynamic swap remapping (patch)
* Vladimir Dozen [EMAIL PROTECTED] [010930 04:41] wrote: ehlo. You're still thinking of the combined solution, just think of a system where all you have right now is the signals I mentioned. Yah, now I think I got it. Well, actually, signal(s) is all I need. The remapping was just a bonus. To be more precise, I need the only signal -- at low mark passed. Some other application might be interested in second -- hi mark -- signal, but my doesn't. SIGDANGER is the signal from Irix, AFAIR? So, how about to accept this name (just to not increase entropy of the Universe) and send it to all processes when nswap_lowat reached? The only point -- I prefer to have ability to set nswap_lowat via sysctl since I cannot predict what amount of memory can be consumed while freeing memory ;) (e.g., throwing exception in C++ may eat memory due to creating exception object; logging may eat memory also). You want to submit a patch? If not I can take a look at it, but it's been a bit since I've looked at the vm system. Just think what happens if your filesystems are full and you run out of swap... The same that happens today -- killproc() will kill me. The situation doesn't becomes worse with remapping, it just ... mmm... prolonges. -- dozen @ home -- -Alfred Perlstein [[EMAIL PROTECTED]] 'Instead of asking why a piece of software is using 1970s technology, start asking why software is ignoring 30 years of accumulated wisdom.' To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: VM: dynamic swap remapping (patch)
ehlo. You want to submit a patch? If not I can take a look at it, but it's been a bit since I've looked at the vm system. except for sysctl, the patch is quite simple due to the fact that histeresis is already implemented in swap_pager.c, something like: diff vm/swap_pager.c vm.new/swap_pager.c 217a218,219 struct proc* p; 218a221,225 /* warn all processes */ for( p = allproc.lh_first; p != 0; p = p-p_list.le_next ) { psignal(p,SIGDANGER); } diff sys/signal.h sys.new/signal.h 105a106,109 #ifndef _POSIX_SOURCE #define SIGDANGER 32 /* close to out-of-memory */ #endif diff kern/kern_sig.c kern.new/kern_sig.c 165a166 SA_IGNORE /* SIGDANGER */ -- dozen @ home To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: VM: dynamic swap remapping (patch)
ehlo. diff vm/swap_pager.c vm.new/swap_pager.c 217a218,219 struct proc* p; 218a221,225 /* warn all processes */ for( p = allproc.lh_first; p != 0; p = p-p_list.le_next ) { psignal(p,SIGDANGER); } Oops, it doesn't work. All processes died. Why? Something should be changed in libc? -- dozen @ home To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: VM: dynamic swap remapping (patch)
* Vladimir Dozen [EMAIL PROTECTED] [010930 06:16] wrote: ehlo. diff vm/swap_pager.c vm.new/swap_pager.c 217a218,219 struct proc* p; 218a221,225 /* warn all processes */ for( p = allproc.lh_first; p != 0; p = p-p_list.le_next ) { psignal(p,SIGDANGER); } Oops, it doesn't work. All processes died. Why? Something should be changed in libc? I'll take a look at implementing it sometime this week. I want to do the siginfo thing if possible. -- -Alfred Perlstein [[EMAIL PROTECTED]] 'Instead of asking why a piece of software is using 1970s technology, start asking why software is ignoring 30 years of accumulated wisdom.' To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: VM: dynamic swap remapping (patch)
On Sun, Sep 30, 2001 at 01:44:37PM +, Vladimir Dozen wrote: SIGDANGER is the signal from Irix, AFAIR? AIX has SIGDANGER. -- Jos Backus _/ _/_/_/Santa Clara, CA _/ _/ _/ _/ _/_/_/ _/ _/ _/_/ [EMAIL PROTECTED] _/_/ _/_/_/use Std::Disclaimer; To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: VM: dynamic swap remapping (patch)
On Sun, Sep 30, 2001 at 02:23:26PM -0500, Alfred Perlstein wrote: * Jos Backus [EMAIL PROTECTED] [010930 12:55] wrote: AIX has SIGDANGER. Anyone care to tell me how it works in AIX? If the interface is nice, cloning it would be kind of cool. I don't currently have access to an AIX system, but http://as400bks.rochester.ibm.com/doc_link/en_US/a_doc_lib/aixbman/admnconc/pag_space_under.htm has some (useful) info. -- Jos Backus _/ _/_/_/Santa Clara, CA _/ _/ _/ _/ _/_/_/ _/ _/ _/_/ [EMAIL PROTECTED] _/_/ _/_/_/use Std::Disclaimer; To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: VM: dynamic swap remapping (patch)
* Jos Backus [EMAIL PROTECTED] [010930 14:35] wrote: On Sun, Sep 30, 2001 at 02:23:26PM -0500, Alfred Perlstein wrote: * Jos Backus [EMAIL PROTECTED] [010930 12:55] wrote: AIX has SIGDANGER. Anyone care to tell me how it works in AIX? If the interface is nice, cloning it would be kind of cool. I don't currently have access to an AIX system, but http://as400bks.rochester.ibm.com/doc_link/en_US/a_doc_lib/aixbman/admnconc/pag_space_under.htm has some (useful) info. It sure does! I think I'm going to make a proposal on -arch about this, to be perfectly honest, AIX has a good implementation, I haven't read it all yet, but it doesn't look like it gives the applications a notification when the danger is gone, we'll have to figure that out, or I'll have to read more into this. -- -Alfred Perlstein [[EMAIL PROTECTED]] 'Instead of asking why a piece of software is using 1970s technology, start asking why software is ignoring 30 years of accumulated wisdom.' To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: VM: dynamic swap remapping (patch)
: :In message [EMAIL PROTECTED], Matt Dillon writes: :: Second, application not always grows to 1G, most of the time it keeps :: as small as 500M ;). Why should we precommit 1G for 500M data? Doing :: multi-mmap memory management is additional pain. : :Even using file-backed memory is fairly trivial. You don't need to :do multi-mmap memory management or do any kernel tweaking. Just :reserve 1G and use a single mmap() and file per process. : :I once had a patch to phkmalloc() which backed all malloc'ed VM with :hidden files in the users homedir. It was written to put the VM :usage under QUOTA control, but it had many useful side effects as well. : :I can't seem to find it right now, but it is trivial to do: just :replace the sbrk(2) with mmap(). Only downside is the needed :filedescriptor which some shells don't like. : :-- :Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 :[EMAIL PROTECTED] | TCP/IP since RFC 956 I think the file descriptor problem can be solved easily... simply open the file, mmap() the entire 1G segment for this special application, and then close() the file. Then have sbrk() just eats out of the mapped segment. Alternatively sbrk() could open/mmap/close in large 1MB or 4MB segments, again leaving no file descriptors dangling. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: VM: dynamic swap remapping (patch)
: :Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 : :[EMAIL PROTECTED] | TCP/IP since RFC 956 : : I think the file descriptor problem can be solved easily... simply : open the file, mmap() the entire 1G segment for this special application, : and then close() the file. Then have sbrk() just eats out of the mapped : segment. Alternatively sbrk() could open/mmap/close in large 1MB or 4MB : segments, again leaving no file descriptors dangling. : :Won't that cause fragmentation? You're forgettng the need to :ftruncate or pre-zero the file unless that's been fixed. : :-- :-Alfred Perlstein [[EMAIL PROTECTED]] You have to pre-zero the file. You can do it in reasonably-sized chunks (like 4M) without causing fragmentation. You *CANNOT* use ftruncate() to extend the file - that will virtually guarentee massive fragmentation. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: VM: dynamic swap remapping (patch)
In message [EMAIL PROTECTED] Vladimir Dozen writes: : SIGDANGER is the signal from Irix, AFAIR? AIX. Warner To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: VM: dynamic swap remapping (patch)
On Sunday, 30 September 2001 at 14:55:58 -0500, Alfred Perlstein wrote: * Jos Backus [EMAIL PROTECTED] [010930 14:35] wrote: On Sun, Sep 30, 2001 at 02:23:26PM -0500, Alfred Perlstein wrote: * Jos Backus [EMAIL PROTECTED] [010930 12:55] wrote: AIX has SIGDANGER. Anyone care to tell me how it works in AIX? If the interface is nice, cloning it would be kind of cool. I don't currently have access to an AIX system, but http://as400bks.rochester.ibm.com/doc_link/en_US/a_doc_lib/aixbman/admnconc/pag_space_under.htm has some (useful) info. It sure does! I think I'm going to make a proposal on -arch about this, to be perfectly honest, AIX has a good implementation, I haven't read it all yet, but it doesn't look like it gives the applications a notification when the danger is gone, we'll have to figure that out, or I'll have to read more into this. If it's any help, I have an AIX box here. It belongs to IBM, so I have to respect security issues, but I'll do what I can. Greg -- See complete headers for address and phone numbers To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: VM: dynamic swap remapping (patch)
* Greg Lehey [EMAIL PROTECTED] [010930 22:49] wrote: On Sunday, 30 September 2001 at 14:55:58 -0500, Alfred Perlstein wrote: * Jos Backus [EMAIL PROTECTED] [010930 14:35] wrote: On Sun, Sep 30, 2001 at 02:23:26PM -0500, Alfred Perlstein wrote: * Jos Backus [EMAIL PROTECTED] [010930 12:55] wrote: AIX has SIGDANGER. Anyone care to tell me how it works in AIX? If the interface is nice, cloning it would be kind of cool. I don't currently have access to an AIX system, but http://as400bks.rochester.ibm.com/doc_link/en_US/a_doc_lib/aixbman/admnconc/pag_space_under.htm has some (useful) info. It sure does! I think I'm going to make a proposal on -arch about this, to be perfectly honest, AIX has a good implementation, I haven't read it all yet, but it doesn't look like it gives the applications a notification when the danger is gone, we'll have to figure that out, or I'll have to read more into this. If it's any help, I have an AIX box here. It belongs to IBM, so I have to respect security issues, but I'll do what I can. Well Joe seems to have provided a pretty interesting document on how it works in AIX, but I was wondering if they do anything wrt low/high watermarks like my idea. Basically you'd like to inform processes that the danger has been alliviated so that they can cautiously start accepting more work rather than freaking out and shutting out clients forever... This might lead to a situation where SIGDANGER starts getting sent informing that things are looking bleak, then processes start freeing resources, they get the second SIGDANGER to let them know that things are looking ok so they ramp up again and the cycle repeats, I guess that's not optimal, but I'd like FreeBSD to let processes know that things are looking better so they can go from scrooge mode to thrifty mode. -- -Alfred Perlstein [[EMAIL PROTECTED]] 'Instead of asking why a piece of software is using 1970s technology, start asking why software is ignoring 30 years of accumulated wisdom.' To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: VM: dynamic swap remapping (patch)
On Sun, Sep 30, 2001 at 11:41:14PM -0500, Alfred Perlstein wrote: If it's any help, I have an AIX box here. It belongs to IBM, so I have to respect security issues, but I'll do what I can. I seem to remember that one could set a watermark using the no command, but I could be wrong. No AIX to verify this, maybe Greg can. The link below has some info, too: http://nscp.upenn.edu/aix4.3html/aixbman/prftungd/tunableaixparms.htm -- JoS Backus _/ _/_/_/Santa Clara, CA _/ _/ _/ _/ _/_/_/ _/ _/ _/_/ [EMAIL PROTECTED] _/_/ _/_/_/use Std::Disclaimer; To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: VM: dynamic swap remapping (patch)
Alfred Perlstein wrote: [ ... SIGDANGER ... ] Well Joe seems to have provided a pretty interesting document on how it works in AIX, but I was wondering if they do anything wrt low/high watermarks like my idea. Basically you'd like to inform processes that the danger has been alliviated so that they can cautiously start accepting more work rather than freaking out and shutting out clients forever... The process is supposed to return unused memory to the system when it gets the signal, if it can. It's not supposed to shed all load until it gets the all clear signal. I don't know if there are any good books on Windows Internals, but the Windows VM system does the same thing: it notifies all kernel subsystems that they need to free up memory, if they can. The VFAT32 IFS will basically return exactly one page out of many thousands it is using for cache, when it gets the request (it is implemented as a callback, which you must provide when you register for VM services). This might lead to a situation where SIGDANGER starts getting sent informing that things are looking bleak, then processes start freeing resources, they get the second SIGDANGER to let them know that things are looking ok so they ramp up again and the cycle repeats, I guess that's not optimal, but I'd like FreeBSD to let processes know that things are looking better so they can go from scrooge mode to thrifty mode. The idea is just to free resources, if you can, and to mark the processes which are precious by whether or not they have a signal handler. A close reading of the other document posted (it seemed to be the admin manual from the URL) will indicate that the followon SIGKILL is not sent to the processes that have a SIGDANGER handler registered. Note that this does not mean that your process won't be killed off as a result of a page not present fault, so abusing the interface is not really tolerated very well by the system. I think signalling an all clear is really a bad idea; a soft hysteresis loop is much less prone to pendulum swings than a hard hysteresis loop (lesson #1 in the book Fuzzy Logic). -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: VM: dynamic swap remapping (patch)
Vladimir Dozen([EMAIL PROTECTED])@2001.09.29 15:59:41 +: ehlo. (Sorry for long pre-history, I believe it is necessary.) My current employer develops large CORBA-based data mining servers. They are usually run under HP-UX, but, following the current fashion to build processing farms, I was targeted to build version for free unices. Initial platform was Linux, and build itself was done smoothly, but very soon we were got problem: we use pthreads; to be more precise, we use thread-per-client model. This means that at the same time we may compute from single to a few tens client sessions. Each session may eat as much as 1G of address space, and even more (actually, there is no limits except for hardware ones). IIRC from the problems we had with a project some while ago, mm might help. [http://www.engelschall.com/sw/mm/] it wraps malloc() and friends into a neat api, including preallocation in fs space (the features are somewhat os dependent) and fast shared memory. /k -- Did you know that there are 71.9 acres of nipple tissue in the U.S.? KR433/KR11-RIPE -- WebMonster Community Founder -- nGENn GmbH Senior Techie http://www.webmonster.de/ -- ftp://ftp.webmonster.de/ -- http://www.ngenn.net/ karstenrohrbach.de -- alphangenn.net -- alphascene.org -- [EMAIL PROTECTED] GnuPG 0x2964BF46 2001-03-15 42F9 9FFF 50D4 2F38 DBEE DF22 3340 4F4E 2964 BF46 Please do not remove my address from To: and Cc: fields in mailing lists. 10x PGP signature
VM: dynamic swap remapping (patch)
ehlo. (Sorry for long pre-history, I believe it is necessary.) My current employer develops large CORBA-based data mining servers. They are usually run under HP-UX, but, following the current fashion to build processing farms, I was targeted to build version for free unices. Initial platform was Linux, and build itself was done smoothly, but very soon we were got problem: we use pthreads; to be more precise, we use thread-per-client model. This means that at the same time we may compute from single to a few tens client sessions. Each session may eat as much as 1G of address space, and even more (actually, there is no limits except for hardware ones). The problem was how Linux (and FreeBSD, as we discovered soon) treats out-of-memory (OOM) situation. Under HPUX memory is precommited (i.e., swap is reserved for every allocated page), so as soon as we get into OOM, malloc() or operator new() returns NULL or throws exception, so we have opportunity to unroll stack, tell client we cannot perform his request currently and, most important, are able to continue execution of other clients requests. Linux and FreeBSD simply were killing whole our process and we have no any chance to know we are out of memory! All our data of all our clients (some of them were in processing days before) were lost. :( Very unfriendly, and, what can be more important, this kind of interaction (absence of it, really) between OS and application reduces chances of porting really large applications onto FreeBSD due to fact that no one can trust OS that can simply trash user data with no warning. It seems to me, OS must use any chance to continue execution of application instead of killing it. I do think it is Right Way. I have wrote a patch that modifies behaivour (have I spelled this word right? ;) of VM when we are out of memory. Instead of killing largest process, we remap parts of it's address space onto temporal files (exactly as HP-UX does when swapping into dir turned on). Of course, we cannot do it when we absolutely out of swap, we do it a bit early, when swap daemon founds swap free pages lowed to nswap_lowat. I called this patch OOM Keeper as opposite to OOM Killer used in Linux (yah, I prefer BSD). Here is generic algorithm: 1. Swap daemon founds vm_swap_size nswap_lowat; it calls vm_oomkeeper_swap_almost_full(); 2. vm_oomkeeper_swap_almost_full() searches process having largest vm_object of type OBJT_SWAP, and sends it signal (proposed name: SIGXMEM). 3. process gets signal, and calls special syscall (proposed name: remap). 4. (we are again in kernel, this time curproc is our big process, in vm_oomkeeper_process). while free swap blocks are lower than nswap_hiwat, we do following: a) find largest object of OBJT_SWAP in current process b) create temporal file and unlink() it c) save first 1M of object into file d) cut first 1M of map (here we can get free swap blocks) e) mmap the file onto the place where the data was before. If any of above will fail, then old killproc() will trigger, so system will still be able to drop buggy processes. Note: process now has chance to do something in OOM situation. It can simply ignore signal, and it will be killed soon. It can call remap(), and it will be remapped onto files -- this will slow things down, but will allow to continue processing. It can free some space (e.g., by unmapping anonymous mmap). It can finally save current data and terminate, if nothing of above is acceptable. Note also that ulimits and quota are in action since files are created under process credentials. This patch was tested on my home PC with 64M RAM and 64M swap; I was able to run processes with _committed_ address space up to 512M in various scenarios: large malloc then commit, small incremental mallocs with immediate commit, random commit, parallel run of two or three such memory eaters, etc. No doubts, it requires additional testing. The patch is at whole in separate file -- vm_oomkeeper.c, and it requeres only single intrusion point in current code -- add single line in swap_pager.c:swp_sizechk(). But, to fully implement it, I have to add new signal and new syscall into system. I do not want to go so far until I'll know if my patch acceptable for FreeBSD team. To make it fully controllable it would also be useful to set nswap_{hi,lo}wat via sysctl interface. In any case, when using OOMK these two should be raised about 4 to 8 times (from 400K to 2-4M). It would be also valueable if default action for SIGXMEM would be not SIG_IGN, but calling remap(). This requires patching of libc. Special environment variable ($REMAPDIR) might be used to set location of temporal files. I can send the vm_oomkeeper.c by request (it is 12K long, and I do not want to
Re: VM: dynamic swap remapping (patch)
* Vladimir Dozen [EMAIL PROTECTED] [010929 06:57] wrote: ehlo. (Sorry for long pre-history, I believe it is necessary.) [snip] Comments? Wow! This is really awesome work you've done, perhaps you can put the patch up on a URL someplace? If not mail it to me in private and I can put it up for people to see. One thing though, I think that this behaviour should be toggled via a sysctl, but I think I can manage doing that for you. One other question, why not just set an option to make FreeBSD not overcommit? I've always wanted the ability to turn off overcommit for exactly the same reasons you do. -- -Alfred Perlstein [[EMAIL PROTECTED]] 'Instead of asking why a piece of software is using 1970s technology, start asking why software is ignoring 30 years of accumulated wisdom.' To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: VM: dynamic swap remapping (patch)
On Sat, Sep 29, 2001 at 07:10:24AM -0500, Alfred Perlstein wrote: * Vladimir Dozen [EMAIL PROTECTED] [010929 06:57] wrote: ehlo. (Sorry for long pre-history, I believe it is necessary.) [snip] Comments? Wow! This is really awesome work you've done, perhaps you can put the patch up on a URL someplace? If not mail it to me in private and I can put it up for people to see. One thing though, I think that this behaviour should be toggled via a sysctl, but I think I can manage doing that for you. One other question, why not just set an option to make FreeBSD not overcommit? I've always wanted the ability to turn off overcommit for exactly the same reasons you do. FWIW: Tru64 has had this capability since day one. You can select swap-overcommit mode by removing a symlink (/sbin/swapdefault - /dev/foob) were /dev/foob is the primary swap partition. W/ -- | / o / /_ _ email: [EMAIL PROTECTED] |/|/ / / /( (_) Bulte Arnhem, The Netherlands To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: VM: dynamic swap remapping (patch)
: overcommit? I've always wanted the ability to turn off overcommit : for exactly the same reasons you do. : :FWIW: Tru64 has had this capability since day one. You can select :swap-overcommit mode by removing a symlink (/sbin/swapdefault - /dev/foob) :were /dev/foob is the primary swap partition. : :W/ : :-- :| / o / /_ _email: [EMAIL PROTECTED] :|/|/ / / /( (_) BulteArnhem, The Netherlands Well, the overcommit argument comes up once or twice a year. Frankly I don't see much of a point to it. While it is true that you could implement a signal the plain fact of the matter is that having to deal with the possibility in a program at the N points (generally hundreds of points) where that program allocates memory, either directly or indirectly, virtually guarentees that you will introduce bugs into the system. You also cannot guarentee that your process will have time to cleanup prior to the system killing, nor can you guarentee that all the standard system utilities and daemons will be able to gracefully handle the out of memory condition. In otherwords, you could implement the signal and even have the program use it, but you will still likely leave gaping holes in the implementation that will result in lost data. It is much easier to manage memory manually. For example, if these programs require 1G of independant memory to run it ought to be a fairly simple matter to simply create a 1GB file for each process (using dd rather then ftruncate() to create the file so the blocks are preallocated), mmap() it using PROT_READ|PROT_WRITE, MAP_SHARED|MAP_NOSYNC, and do your memory management out of that. The memory space will be backed by the file rather then by swap. You get all the benefits of the standard overcommit capabilities of the system as well as the ability to pre-reserve the main workspace for the programs and you automatically get persistent storage for the data. Problem solved. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: VM: dynamic swap remapping (patch)
On Sat, 29 Sep 2001, Vladimir Dozen wrote: I have wrote a patch that modifies behaivour (have I spelled this word right? ;) of VM when we are out of memory. Instead of killing largest process, we remap parts of it's address space onto temporal files (exactly as HP-UX does when swapping into dir turned on). This is not instead of killing, this is just a way to delay the killing of processes longer. Once your disk is full you'd still run into the choice between a deadlock and a kill... It's an awesome way of delaying the out of memory problem, though, because a suspended application won't be able to allocate anything more, giving the system a better chance to let the running apps run to completion. Alternatively, the one leaky application is suspended and the rest of the system continues to run without any problems. In short, I like it ;) regards, Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to [EMAIL PROTECTED] (spam digging piggy) To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: VM: dynamic swap remapping (patch)
* Vladimir Dozen [EMAIL PROTECTED] [010929 14:38] wrote: P.S. Anyway, I do NOT insist my solution is better, and even that it is good for anything at all. It was fun for me to hack in BSD kernel, and it was interesting challenge, and I feel need to share results with others. At worst, I will recommend our customer to setup processing farm under FreeBSD with applied patch. I'm really impressed with the work you put into this, but it seems that you've tried to tackle two problems at the same time, and by tying them together made it less flexible and possibly more error prone. My suggestion, (but not my final say, i'm still open to ideas): Implement a memory status signal to notify processes of changes in the relative amount of system memory. When memory reaches a low or high watermark, the signal is broadcast to all running processes. The default disposition will be to ignore the signal. The signal will be named SIGMEMINFO. (SIGXfoo means 'process has exceeded resource foo') The signal will pass via the siginfo struct information such that the process can determine if the system has just exceeded the low watermark (danger) or has reclaimed down to the high watermark (enough free memory). This is just to provide processes with a warning to scale back consumption, exit, or release reasources, the good part is that it's broadcast and all interested parties will do something, hopefully the right thing. To achieve nearly the same effect as your patch, I would implement the above low/high water mark notification, then either: a) over allocate swap a bit and set the low watermark carefully. b) do the following enhancement: Provide a system whereby you can swap to the filesystem without additional upcalls/syscalls from userspace, basically, provide some means of paging to the filesystem automatically. then, set your lowwater mark to the size of your swap partition, now your system will alert your processes and automatically swap _anyone_ to the filesystem. I really think that this would be more flexible and still allow you to achieve what you want... What do you think? -- -Alfred Perlstein [[EMAIL PROTECTED]] 'Instead of asking why a piece of software is using 1970s technology, start asking why software is ignoring 30 years of accumulated wisdom.' To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: VM: dynamic swap remapping (patch)
ehlo. You also cannot guarentee that your process will have time to cleanup prior to the system killing, nor can you guarentee that all the standard system utilities and daemons will be able to gracefully handle the out of memory condition. In otherwords, you could implement the signal and even have the program use it, but you will still likely leave gaping holes in the implementation that will result in lost data. Actually, the things as I coded them better suited namely for poorly written daemons that never check for malloc result. Precommit will just kill them as soon as malloc() will return NULL, and they dereference it. Killproc() will kill them too. Remapping will save them. Disk space now is large enough to make them live till root will notice that they grow to much and do something (kill them manually, probably ;). It is much easier to manage memory manually. For example, if these programs require 1G of independant memory to run it ought to be a fairly simple matter to simply create a 1GB file for each process (using dd rather then ftruncate() to create the file so the blocks are preallocated), mmap() it using PROT_READ|PROT_WRITE,MAP_SHARED|MAP_NOSYNC, and do your memory management out of that. First at all, it is NOT easier. Doing own memory management is not too simple, especially for threads and SMP -- we seen 50% performance impact when two threads on two processors were doing intensive allocations (it was not FreeBSD, and these was kernel threads). Second, application not always grows to 1G, most of the time it keeps as small as 500M ;). Why should we precommit 1G for 500M data? Doing multi-mmap memory management is additional pain. Third, swapping to device is faster, and, while we have enough swap, I would prefer to swap there. Even a few percent for 5-day computation make sense. Problem solved. If I'm the developer -- probably, yes. What if I'm system administrator, and has to run something large _and important_? The day I'll notice that monster creates swap files I'll know I have to add RAM. I will have time since it still works, it was not killed. P.S. Anyway, I do NOT insist my solution is better, and even that it is good for anything at all. It was fun for me to hack in BSD kernel, and it was interesting challenge, and I feel need to share results with others. At worst, I will recommend our customer to setup processing farm under FreeBSD with applied patch. -- dozen @ home To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message