Re: followup: stuck lmtpd processes
On Wed, 24 Sep 2003, Henrique de Moraes Holschuh wrote: On Wed, 24 Sep 2003, Etienne Goyer wrote: On Wed, Sep 24, 2003 at 11:27:46AM -0400, Rob Siemborski wrote: However, I have looked into this and to my surprise, Linux is indeed restarting the system calls instead of returning with EINTR. However, the answer here is to set up the alarm() handler with sigaction without setting SA_RESTART, not to jump through select() hoops or make nonblocking lock attempts. I consult my local Unix guru on the subject, and he point me to Advanced Programming in the Unix Environment by W. Richard Stevens, section 10.5 (page 275 in my edition) Interrupted System Calls. The revelancy of this text may be questionned since it was published in 1992, but the behavior of SVR4 vs BSD differ substantially in this regard according to it. My guru concluded that system call interuption by signal is an assumption that may lead to portability problem. In the fud case, he is right. With SYSV you will get the interrupted system call, unless you tell it somehow not to do it (the SA_RESTART stuff). If we are to accomodate the BSDs, we can: 1. Let them have the short end of the stick (they get what we have now, that is, deadlocks). Not good. 2. Let them use the low-performance non-blocking mode (good solution). 3. Find out how to get EINTR from BSD (it should be possible nowadays). As for old legacy systems (too old BSDs, SunOS 4, very old Linux), well, I am inclined to simply ignore them. Let them use solution (2) above. That is the least of their problems, given the amount of annoying and dangerous bugs those old kernels have. #2 doesn't sound like a good option for anyone. It is not clear from the context in this message whether you want restartable system calls or not. BSD4.2 introduced restartable system calls. BSD4.3 introduced siginterrupt() to disable restartable system calls for a certain signal. So even with SunOS 4, you'd have the option. Newer BSDs have SA_RESTART via sigaction(). sigaction() is supposed to conform to the POSIX.1 standard, and therefore probably works the same as Linux. -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh Tom
Re: followup: stuck lmtpd processes
On Fri, 26 Sep 2003, Tom wrote: On Wed, 24 Sep 2003, Henrique de Moraes Holschuh wrote: With SYSV you will get the interrupted system call, unless you tell it somehow not to do it (the SA_RESTART stuff). If we are to accomodate the BSDs, we can: 1. Let them have the short end of the stick (they get what we have now, that is, deadlocks). Not good. 2. Let them use the low-performance non-blocking mode (good solution). 3. Find out how to get EINTR from BSD (it should be possible nowadays). As for old legacy systems (too old BSDs, SunOS 4, very old Linux), well, I am inclined to simply ignore them. Let them use solution (2) above. That is the least of their problems, given the amount of annoying and dangerous bugs those old kernels have. #2 doesn't sound like a good option for anyone. It is not clear from It is tried and true, and the patch is here NOW. the context in this message whether you want restartable system calls or not. BSD4.2 introduced restartable system calls. BSD4.3 introduced siginterrupt() to disable restartable system calls for a certain signal. So even with SunOS 4, you'd have the option. I have the code ready for POSIX. If anyone who uses SunOS and BSD4.2/4.3 wants to contribute the autoconf code and changes to get non-restartable syscalls for a signal, they're welcome... I don't like to write code I can't test. -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh
Re: followup: stuck lmtpd processes
On Wed, Sep 24, 2003 at 08:01:11PM +0100, Patrick Welche wrote: I don't understand. The only alarm() business I can see in imap/fud.c is around recvfrom which at least according to its man page says [EINTR]The receive was interrupted by delivery of a signal before any data were available. What system calls are there? I was looking for syscall()... Sorry, I may have got the nomenclature wrong since I am not very familiar with Unix C programming. However, the man page is wrong about EINTR at least as far as RedHat 7.x is concerned. In a murder environnement, when following a referral : 1. The alarm() is set 2. recvfrom() connect to referred by UDP 3. If the referred is not answering : 3.1 The SIG_ALRM handler is eventually executed 3.2 fud goes back to waiting after recvfrom and deadlock This is reproducible in my environment. Maybe I got something wrong in my crude syslog() debugging, but I am confident that this is what happen. The patch I posted earlier fix the problem for me. -- Etienne GoyerLinux Québec Technologies Inc. http://www.LinuxQuebec.com [EMAIL PROTECTED]
Re: followup: stuck lmtpd processes
On Thu, 25 Sep 2003, Etienne Goyer wrote: However, the man page is wrong about EINTR at least as far as RedHat 7.x is concerned. In a murder environnement, when following a referral : No it isn't wrong. The problem is signals that are configured via signal() instead of sigaction(). On Linux, signal() implicetly sets the SA_RESTART flag, which causes the signal to have this behavior, not the syscall. If SA_RESTART is not set, then the system call is interrupted (and returns EINTR) as it should be. -Rob -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Rob Siemborski * Andrew Systems Group * Cyert Hall 207 * 412-268-7456 Research Systems Programmer * /usr/contributed Gatekeeper
Re: followup: stuck lmtpd processes
Oooppss. Sorry, my mailbox went temporarily over quota and the delivery of the original thread was deferred until after I had read and responded to the followup. It looks like the locking mechanism is working correctly here and the bug is really in the network timeout. (or in the implementation that allows a network call after the append setup is called) The patch I wrote still might help you since it would prevent an individual user's problem from taking down the mail system. The user's mailbox would remain inaccessible, but the lmtpd processes attempting delivery would exit with errors and mail delivery as a whole would proceed. I still belive that a system with that uses locking with no timeout mechanism is inherently fragile. A single programming error can lead to a cascade failure that takes down the entire mail system as more and more processes hang up trying to get the lock. Just my two cents, John Andrew Morgan wrote: On Tue, 23 Sep 2003, John Wade wrote: Hi Andrew, I was the one who wrote the message you found. I finally came to the conclusion that the flat file locking mechanism is somewhat broken in Cyrus, but I was never a good enough C programmer to pin down what was happening. (The mmap stuff makes it really tricky to debug.)I wanted to blame it on the Linux kernel, but I know that others have experienced the same problems in Solaris. I finally gave up and wrote a locking timeout patch for 2.0.16. see http://www.oakton.edu/~jwade/cyrus/ for the patch and full details A number of other folks have tried this patch successfully on 2.0.16 and 2.1.x, and I know it has resolved our problem. If you can solve the particular bug that causes this, more power to you, if not, my work around resolves a number of possible deadlock issues. Enjoy, John Hey John, Thanks for that message. If you've read a little further in your info-cyrus messages, you'll see that I apparently have hit upon a different bug than the one you found (I think). Your page was instrumental in helping me track down the source of the problem though. It turns out I had an imaps process that hung onto the lock on the user's quota file. Apparently it obtained the lock, then went off to read from the network connection and never came back. I think your patch would fix the problem where are lot of processes are contending for a lock (by making them retry), but it wouldn't help if a single process keeps the lock indefinately. Ideally it should not be possible for a process to get hung while it is holding the lock, but that will require some careful programming in this particular case. In the meantime, I'll have to keep an eye on the system. Thanks again for your debugging clues... Andy
Re: followup: stuck lmtpd processes
On Tue, 23 Sep 2003, Andrew Morgan wrote: I think your patch would fix the problem where are lot of processes are contending for a lock (by making them retry), but it wouldn't help if a single process keeps the lock indefinately. I agree. The whole act of retrying for a lock is pretty silly, when you think about it. The kernel is responsible for waking processes up when they are blocking on a lock and it becomes available. If this isn't happening (causing the need to do locks in a nonblocking fashion) then something is wrong with the *kernel* not the application. The exponential backoff is pretty silly from a performance standpoint as well. The bug where the quota file stays locked is something else entirely. -Rob -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Rob Siemborski * Andrew Systems Group * Cyert Hall 207 * 412-268-7456 Research Systems Programmer * /usr/contributed Gatekeeper
Re: followup: stuck lmtpd processes
Hi, I don't have time this morning to have a look at your patch and understand the issue, but it reminded me of another bug I found a few months ago. It may or may not relate to the problem you are fixing. I just think you might be interested in knowing. It's the timeout part of your problem that chimed me. In Linux, a system call interrupted by alarm() does not exit; the SIG_ALRM handler is executed, then the syscall is retried. If the program you are trying to troubleshoot depend on alarm() to interrupt a blocking system call, it will deadlock in Linux (at least, it does in RedHat 7.3). I stumble on this bug in the fud server and submitted a patch, but it was not considered AFAIK since it was not reproducible in Solaris (duh!) and Linux was vaguely blamed while nothing was being done to address the issue. Sorry if this is unrelated to your problem. I thought you may be interested in my experience in case it is. On Tue, Sep 23, 2003 at 10:35:04PM -0500, John Wade wrote: Hi Andrew, I was the one who wrote the message you found. I finally came to the conclusion that the flat file locking mechanism is somewhat broken in Cyrus, but I was never a good enough C programmer to pin down what was happening. (The mmap stuff makes it really tricky to debug.)I wanted to blame it on the Linux kernel, but I know that others have experienced the same problems in Solaris. I finally gave up and wrote a locking timeout patch for 2.0.16. see http://www.oakton.edu/~jwade/cyrus/ for the patch and full details A number of other folks have tried this patch successfully on 2.0.16 and 2.1.x, and I know it has resolved our problem. If you can solve the particular bug that causes this, more power to you, if not, my work around resolves a number of possible deadlock issues. Enjoy, John Andrew Morgan wrote: Following up on my previous post about stuck lmtpd processes. I found this incredibly detailed faq at: http://www.faqchest.com/prgm/cyrus-l/cyrus-01/cyrus-0111/cyrus-011102/cyrus0023_33254.html This isn't exactly the same problem, but the steps on that page helped me figure out that they are all stuck trying to get a lock on: /private/cyrus/mail/k/user/krolickp/cyrus.header Looking at /proc/locks shows: 7: POSIX ADVISORY WRITE 21903 08:11:42107658 0 EOF d23895e0 c3217f44 c510e4c4 ccbf076c 7: - POSIX ADVISORY WRITE 32485 08:11:42107658 0 EOF ccbf0760 ee36ac44 f3bb26a4 d23895e0 ee36ac4c 7: - POSIX ADVISORY WRITE 1802 08:11:42107658 0 EOF ee36ac40 c050ea04 ccbf0764 d23895e0 c050ea0c 7: - POSIX ADVISORY WRITE 1217 08:11:42107658 0 EOF c050ea00 ee36a344 ee36ac44 d23895e0 ee36a34c ... I don't see how this deadlock occurred, but I'm willing to help debug it. Andy -- Etienne GoyerLinux Québec Technologies Inc. http://www.LinuxQuebec.com [EMAIL PROTECTED]
Re: followup: stuck lmtpd processes
On Wed, 24 Sep 2003, Rob Siemborski wrote: think about it. The kernel is responsible for waking processes up when they are blocking on a lock and it becomes available. If this isn't happening (causing the need to do locks in a nonblocking fashion) then something is wrong with the *kernel* not the application. Agreed, but if we are going to keep the blocking-on-lock behaviour (and I know we are ;-)), we really, really should have a way to timeout and kill the process if that lock does not release after a while. Resilience IS necessary... As an admin, I want to know there are problems from syslog events, not because the whole system stopped. Right now, at least in Linux (which DOES have kernel/glibc bugs in that area) that means we end up needing the slow-as-hell backoff non-blocking locks stuff. -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh
Re: followup: stuck lmtpd processes
On Wed, 24 Sep 2003, Henrique de Moraes Holschuh wrote: Agreed, but if we are going to keep the blocking-on-lock behaviour (and I know we are ;-)), we really, really should have a way to timeout and kill the process if that lock does not release after a while. Resilience IS necessary... As an admin, I want to know there are problems from syslog events, not because the whole system stopped. Right now, at least in Linux (which DOES have kernel/glibc bugs in that area) that means we end up needing the slow-as-hell backoff non-blocking locks stuff. As I understand it (based on your comments to Bug 1177), just setting an alarm() around the flock/fcntl calls isn't good enough to solve the Linux problem. -Rob -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Rob Siemborski * Andrew Systems Group * Cyert Hall 207 * 412-268-7456 Research Systems Programmer * /usr/contributed Gatekeeper
Re: followup: stuck lmtpd processes
On Wed, 24 Sep 2003, Rob Siemborski wrote: On Wed, 24 Sep 2003, Henrique de Moraes Holschuh wrote: Agreed, but if we are going to keep the blocking-on-lock behaviour (and I know we are ;-)), we really, really should have a way to timeout and kill the process if that lock does not release after a while. Resilience IS necessary... As an admin, I want to know there are problems from syslog events, not because the whole system stopped. Right now, at least in Linux (which DOES have kernel/glibc bugs in that area) that means we end up needing the slow-as-hell backoff non-blocking locks stuff. As I understand it (based on your comments to Bug 1177), just setting an alarm() around the flock/fcntl calls isn't good enough to solve the Linux problem. It is not a general solution when you hit glibc/kernel bugs, but I can certainly live with it IF I manage to track down a version of glibc and kernel that won't deadlock, that we can recommend. Either that, or allow for runtime-switchable behaviours (I am willing to code this). Now, I need to find some time to write a fctnl/flock deadlock test case. If anyone has one already, please send it my way... -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh
Re: followup: stuck lmtpd processes
I just wanted to add something to this discussion... First of all, we see the problem in Tru64 as well. When we upgraded to the 2.2 series, we put in the locking patch that John described below. This has helped us, but the locking problem has *not* gone away... in fact, it does a better job of *hiding* the problem than fixing it... besides, doing some kind of backoff timeout mechansim doesn't solve the problem that some other process has permanently placed a lock on a file, all it does is prevent lots of LMTP processes from stacking up *because* some other process permanently locked a file. As a result, we still experience the problem, and it manifests itself as being a particular user is unable to receive any email... and until that user actually notices that this is happening, we may not notice. The log files (the syslog mail log particularly) does give some clues, however, such as a lot of System I/O errors when talking to LMTP for a particular user). When looking at what file the processes are all waiting to get a lock on, it usually turns out to be the cyrus.header file and not the quota file. Is this still the same bug described by Rob on bugzilla? Does it have to be the quota file? Also, when we find the specific imaps process that happens to have the cyrus.header lock file opened for writing and has it locked, if we kill it off, we find that the write lock goes to another imaps process or to one of the LMTP processes and gets stuck there... we kill that one off and it goes to the next one and gets stuck. We never saw a case where all the other processes became unstuck and the problem went away. As a consquence, the only solution we have when we see the problem is to restart the Cyrus server (we usually wait until after work hours at least). I am not convinced the patch described below has helped us much, as when we saw the LMTP processes stacking up, it was right in our face and we could deal with it sooner than later. Anyways, those are my thoughts on the subject. Scott --On Tuesday, September 23, 2003 10:35 PM -0500 John Wade [EMAIL PROTECTED] wrote: Hi Andrew, I was the one who wrote the message you found. I finally came to the conclusion that the flat file locking mechanism is somewhat broken in Cyrus, but I was never a good enough C programmer to pin down what was happening. (The mmap stuff makes it really tricky to debug.)I wanted to blame it on the Linux kernel, but I know that others have experienced the same problems in Solaris. I finally gave up and wrote a locking timeout patch for 2.0.16. see http://www.oakton.edu/~jwade/cyrus/ for the patch and full details A number of other folks have tried this patch successfully on 2.0.16 and 2.1.x, and I know it has resolved our problem. If you can solve the particular bug that causes this, more power to you, if not, my work around resolves a number of possible deadlock issues. Enjoy, John Andrew Morgan wrote: Following up on my previous post about stuck lmtpd processes. I found this incredibly detailed faq at: http://www.faqchest.com/prgm/cyrus-l/cyrus-01/cyrus-0111/cyrus-011102/cy rus0023_33254.html This isn't exactly the same problem, but the steps on that page helped me figure out that they are all stuck trying to get a lock on: /private/cyrus/mail/k/user/krolickp/cyrus.header Looking at /proc/locks shows: 7: POSIX ADVISORY WRITE 21903 08:11:42107658 0 EOF d23895e0 c3217f44 c510e4c4 ccbf076c 7: - POSIX ADVISORY WRITE 32485 08:11:42107658 0 EOF ccbf0760 ee36ac44 f3bb26a4 d23895e0 ee36ac4c 7: - POSIX ADVISORY WRITE 1802 08:11:42107658 0 EOF ee36ac40 c050ea04 ccbf0764 d23895e0 c050ea0c 7: - POSIX ADVISORY WRITE 1217 08:11:42107658 0 EOF c050ea00 ee36a344 ee36ac44 d23895e0 ee36a34c ... I don't see how this deadlock occurred, but I'm willing to help debug it. Andy -- +---+ Scott W. Adkinshttp://www.cns.ohiou.edu/~sadkins/ UNIX Systems Engineer mailto:[EMAIL PROTECTED] ICQ 7626282 Work (740)593-9478 Fax (740)593-1944 +---+ PGP Public Key available at http://www.cns.ohiou.edu/~sadkins/pgp/ pgp0.pgp Description: PGP signature
Re: followup: stuck lmtpd processes
On Wed, 24 Sep 2003, Scott Adkins wrote: Also, when we find the specific imaps process that happens to have the cyrus.header lock file opened for writing and has it locked, if we kill it off, we find that the write lock goes to another imaps process or to one of the LMTP processes and gets stuck there... we kill that one off and it goes to the next one and gets stuck. We never saw a case where all the other processes became unstuck and the problem went away. So, if these processes are getting stuck -- what are they getting stuck waiting for? Processes don't acquire a lock and then just stop. As a consquence, the only solution we have when we see the problem is to restart the Cyrus server (we usually wait until after work hours at least). I am not convinced the patch described below has helped us much, as when we saw the LMTP processes stacking up, it was right in our face and we could deal with it sooner than later. I agree. -Rob -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Rob Siemborski * Andrew Systems Group * Cyert Hall 207 * 412-268-7456 Research Systems Programmer * /usr/contributed Gatekeeper
Re: followup: stuck lmtpd processes
On Wed, Sep 24, 2003 at 11:13:06AM -0300, Henrique de Moraes Holschuh wrote: It is not a general solution when you hit glibc/kernel bugs, but I can certainly live with it IF I manage to track down a version of glibc and kernel that won't deadlock, that we can recommend. Either that, or allow for runtime-switchable behaviours (I am willing to code this). This is not good enough. You can't recommend a specific kernel/glibc version; this is dictated by the distribution people use. You can't just recommend using the latest either, because a lot (most ?) people will prefer to use older, well-known, stable (Debian stable, RedHat 7.3) distribution. The obvious solution is to not use alarm() to interrupt blocking syscall, but to use non-blocking call with select() instead. I am not a very proficient C Unix programmer, so maybe this suggestion make no sense. However, in the case of the bug with fud I discussed in another post, it was a perfectly workable solution. And please don't scoff it as a problem with Linux, not Cyrus. Linux may well be broken (I can't tell), but it still constitute the vast majority of Cyrus installation (I would believe), and thus merit to be accomodated. -- Etienne GoyerLinux Québec Technologies Inc. http://www.LinuxQuebec.com [EMAIL PROTECTED]
Re: followup: stuck lmtpd processes
On Wed, 24 Sep 2003, Etienne Goyer wrote: The obvious solution is to not use alarm() to interrupt blocking syscall, but to use non-blocking call with select() instead. I am not a very proficient C Unix programmer, so maybe this suggestion make no sense. However, in the case of the bug with fud I discussed in another post, it was a perfectly workable solution. This is *not* the correct solution. However, I have looked into this and to my surprise, Linux is indeed restarting the system calls instead of returning with EINTR. However, the answer here is to set up the alarm() handler with sigaction without setting SA_RESTART, not to jump through select() hoops or make nonblocking lock attempts. I'll work on fixing fud shortly (its using signal() and it should be using sigaction()). And please don't scoff it as a problem with Linux, not Cyrus. Linux may well be broken (I can't tell), but it still constitute the vast majority of Cyrus installation (I would believe), and thus merit to be accomodated. Linux should certainly be accomodated, but not in a way that causes *severe* performance penalties on operating systems that don't mysteriously fall asleep in the middle of a blocking flock() or fcntl() call. -Rob -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Rob Siemborski * Andrew Systems Group * Cyert Hall 207 * 412-268-7456 Research Systems Programmer * /usr/contributed Gatekeeper
Re: followup: stuck lmtpd processes
On Wed, 24 Sep 2003, Rob Siemborski wrote: However, I have looked into this and to my surprise, Linux is indeed restarting the system calls instead of returning with EINTR. However, the answer here is to set up the alarm() handler with sigaction without setting SA_RESTART, not to jump through select() hoops or make nonblocking lock attempts. Yep, that's exactly it. Bullseye, Rob! I'll work on fixing fud shortly (its using signal() and it should be using sigaction()). And I will do a linux fcntl/flock patch that uses alarm() WITHOUT SA_RESTART as soon as I can. Linux should certainly be accomodated, but not in a way that causes *severe* performance penalties on operating systems that don't mysteriously fall asleep in the middle of a blocking flock() or fcntl() call. Agreed. -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh
Re: followup: stuck lmtpd processes
On Wed, Sep 24, 2003 at 11:27:46AM -0400, Rob Siemborski wrote: However, I have looked into this and to my surprise, Linux is indeed restarting the system calls instead of returning with EINTR. However, the answer here is to set up the alarm() handler with sigaction without setting SA_RESTART, not to jump through select() hoops or make nonblocking lock attempts. I consult my local Unix guru on the subject, and he point me to Advanced Programming in the Unix Environment by W. Richard Stevens, section 10.5 (page 275 in my edition) Interrupted System Calls. The revelancy of this text may be questionned since it was published in 1992, but the behavior of SVR4 vs BSD differ substantially in this regard according to it. My guru concluded that system call interuption by signal is an assumption that may lead to portability problem. In the fud case, he is right. I'll work on fixing fud shortly (its using signal() and it should be using sigaction()). The included patch against 2.1.13 work for me. -- Etienne GoyerLinux Québec Technologies Inc. http://www.LinuxQuebec.com [EMAIL PROTECTED] --- fud.c.orig Wed Jun 4 14:19:43 2003 +++ fud.c Wed Jun 4 15:28:54 2003 @@ -53,6 +53,7 @@ #include syslog.h #include signal.h #include sys/types.h +#include sys/time.h #include sys/param.h #include sys/stat.h #include netinet/in.h @@ -62,6 +63,7 @@ #include errno.h #include com_err.h #include pwd.h +#include sys/select.h #include assert.h #include mboxlist.h @@ -196,12 +198,6 @@ shut_down(r); } -static void cyrus_timeout(int signo) -{ - signo = 0; - return; -} - /* Send a proxy request to the backend, send their reply to sfrom */ int do_proxy_request(const char *who, const char *name, const char *backend_host, @@ -210,8 +206,10 @@ char tmpbuf[1024]; int x, rc; int csoc = -1; +fd_set ssock; struct sockaddr_in cin, cout; struct hostent *hp; +struct timeval tv; static int backend_port = 0; /* fud port in NETWORK BYTE ORDER */ /* Open a UDP socket to the Cyrus mail server */ @@ -240,14 +238,24 @@ /* Send the query and wait for a reply */ sendto (csoc, tmpbuf, strlen (tmpbuf), 0, (struct sockaddr *) cin, x); memset (tmpbuf, '\0', strlen (tmpbuf)); -signal (SIGALRM, cyrus_timeout); rc = 0; -alarm (1); -rc = recvfrom (csoc, tmpbuf, sizeof(tmpbuf), 0, - (struct sockaddr *) cout, x); -alarm (0); -if (rc 1) { - rc = IMAP_SERVER_UNAVAILABLE; +FD_ZERO(ssock); +FD_SET(csoc, ssock); +tv.tv_sec = 5; +tv.tv_usec = 0; +rc = select(csoc + 1, ssock, NULL, NULL, tv); +if (rc 0) { +syslog(LOG_ERR, Reading sock); + rc = recvfrom (csoc, tmpbuf, sizeof(tmpbuf), 0, + (struct sockaddr *) cout, x); +if (rc 1) { + rc = IMAP_SERVER_UNAVAILABLE; +send_reply(sfrom, REQ_UNK, who, name, 0, 0, 0); +goto done; +} +} else { + syslog(LOG_ERR, FUD timeout); +rc = IMAP_SERVER_UNAVAILABLE; send_reply(sfrom, REQ_UNK, who, name, 0, 0, 0); goto done; }
Re: followup: stuck lmtpd processes
On Wed, Sep 24, 2003 at 12:57:37PM -0300, Henrique de Moraes Holschuh wrote: I did check ALL the documentation already, and ALL of it says that sigalarm MUST interrupt the syscall, and that it HAS to return EINTR. So, it is a bug. So, it needs to be squashed, and people have to either patch or upgrade their systems... or deal with diminished performance. Please have a look at the Stevens reference I made in reply to Rob. According to him, BSD circa 1992 was not adhering to this behavior. If modern BSD perpetuate this or not, I can't tell. According to Stevens again, SunOS 4.1.2 had yet another behavior in this regard. Wheter these ancient OS should be accomodated or not is a decision I am not qualified to comment upon. And please don't scoff it as a problem with Linux, not Cyrus. Linux may well be broken (I can't tell), but it still constitute the vast majority of Cyrus installation (I would believe), and thus merit to be accomodated. Something that works in Linux, sure. Something that works in broken Linux? No. Fix the breakage in Linux, instead. That's our strenght, and I *will* stick to it as a Debian maintainer. While I agree with you on a technical level and admire your commitment to excellence, this may not be practictal. The installed base is huge and the interested party (Linux distributor) numerous. Getting everybody to update broken packages will be quite an endeavour. Considering this bug touch upon the kernel and glibc, expecting end-user to patch themselve without support from their distributor is not an option either. There is a proper Unix way to do it (using alarm(). this needs to be added to Cyrus IMHO) that *might* not work in certain Linux glibc/kernel combinations. That's the crux of the problem : if the glibc/kernel combination correspond to the major part of the installed base, it might continue to hurt for a long time. Now, if other Unixes have stupid lock and alarm() bugs, that deadlock testing code would be even more useful... :-) In the case of closed-source OS, there may be nothing we can do about it except working around the bug. -- Etienne GoyerLinux Québec Technologies Inc. http://www.LinuxQuebec.com [EMAIL PROTECTED]
Re: followup: stuck lmtpd processes
On Wed, 24 Sep 2003, Etienne Goyer wrote: I'll work on fixing fud shortly (its using signal() and it should be using sigaction()). The included patch against 2.1.13 work for me. This sort of thing won't work for file locking. I've just committed a patch to fud that uses sigaction() [Which since we're assuming POSIX anyway, should hopefully be enough]. -Rob -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Rob Siemborski * Andrew Systems Group * Cyert Hall 207 * 412-268-7456 Research Systems Programmer * /usr/contributed Gatekeeper
Re: followup: stuck lmtpd processes
On Wed, 24 Sep 2003, Etienne Goyer wrote: Something that works in Linux, sure. Something that works in broken Linux? No. Fix the breakage in Linux, instead. That's our strenght, and I *will* stick to it as a Debian maintainer. While I agree with you on a technical level and admire your commitment to excellence, this may not be practictal. The installed base is huge and the interested party (Linux distributor) numerous. Getting everybody to update broken packages will be quite an endeavour. We don't have to get everybody to update, only the people who are running Cyrus. There is a proper Unix way to do it (using alarm(). this needs to be added to Cyrus IMHO) that *might* not work in certain Linux glibc/kernel combinations. That's the crux of the problem : if the glibc/kernel combination correspond to the major part of the installed base, it might continue to hurt for a long time. If it is a single combination, that can be documented. Now, if other Unixes have stupid lock and alarm() bugs, that deadlock testing code would be even more useful... :-) In the case of closed-source OS, there may be nothing we can do about it except working around the bug. Or reporting it to the vendor. Which is exactly what I think we should do in the case of Linux. -Rob -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Rob Siemborski * Andrew Systems Group * Cyert Hall 207 * 412-268-7456 Research Systems Programmer * /usr/contributed Gatekeeper
Re: followup: stuck lmtpd processes
Thanks. I'll test it by the end of the week, and report. On Wed, Sep 24, 2003 at 01:18:12PM -0400, Rob Siemborski wrote: On Wed, 24 Sep 2003, Etienne Goyer wrote: I'll work on fixing fud shortly (its using signal() and it should be using sigaction()). The included patch against 2.1.13 work for me. This sort of thing won't work for file locking. I've just committed a patch to fud that uses sigaction() [Which since we're assuming POSIX anyway, should hopefully be enough]. -Rob -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Rob Siemborski * Andrew Systems Group * Cyert Hall 207 * 412-268-7456 Research Systems Programmer * /usr/contributed Gatekeeper -- Etienne GoyerLinux Québec Technologies Inc. http://www.LinuxQuebec.com [EMAIL PROTECTED]
Re: followup: stuck lmtpd processes
On Wed, 24 Sep 2003, Etienne Goyer wrote: On Wed, Sep 24, 2003 at 11:27:46AM -0400, Rob Siemborski wrote: However, I have looked into this and to my surprise, Linux is indeed restarting the system calls instead of returning with EINTR. However, the answer here is to set up the alarm() handler with sigaction without setting SA_RESTART, not to jump through select() hoops or make nonblocking lock attempts. I consult my local Unix guru on the subject, and he point me to Advanced Programming in the Unix Environment by W. Richard Stevens, section 10.5 (page 275 in my edition) Interrupted System Calls. The revelancy of this text may be questionned since it was published in 1992, but the behavior of SVR4 vs BSD differ substantially in this regard according to it. My guru concluded that system call interuption by signal is an assumption that may lead to portability problem. In the fud case, he is right. With SYSV you will get the interrupted system call, unless you tell it somehow not to do it (the SA_RESTART stuff). If we are to accomodate the BSDs, we can: 1. Let them have the short end of the stick (they get what we have now, that is, deadlocks). Not good. 2. Let them use the low-performance non-blocking mode (good solution). 3. Find out how to get EINTR from BSD (it should be possible nowadays). As for old legacy systems (too old BSDs, SunOS 4, very old Linux), well, I am inclined to simply ignore them. Let them use solution (2) above. That is the least of their problems, given the amount of annoying and dangerous bugs those old kernels have. -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh
Re: followup: stuck lmtpd processes
On Wed, Sep 24, 2003 at 02:20:50PM -0300, Henrique de Moraes Holschuh wrote: On Wed, 24 Sep 2003, Etienne Goyer wrote: On Wed, Sep 24, 2003 at 11:27:46AM -0400, Rob Siemborski wrote: However, I have looked into this and to my surprise, Linux is indeed restarting the system calls instead of returning with EINTR. However, the answer here is to set up the alarm() handler with sigaction without setting SA_RESTART, not to jump through select() hoops or make nonblocking lock attempts. I consult my local Unix guru on the subject, and he point me to Advanced Programming in the Unix Environment by W. Richard Stevens, section 10.5 (page 275 in my edition) Interrupted System Calls. The revelancy of this text may be questionned since it was published in 1992, but the behavior of SVR4 vs BSD differ substantially in this regard according to it. My guru concluded that system call interuption by signal is an assumption that may lead to portability problem. In the fud case, he is right. With SYSV you will get the interrupted system call, unless you tell it somehow not to do it (the SA_RESTART stuff). If we are to accomodate the BSDs, we can: 1. Let them have the short end of the stick (they get what we have now, that is, deadlocks). Not good. 2. Let them use the low-performance non-blocking mode (good solution). 3. Find out how to get EINTR from BSD (it should be possible nowadays). I don't understand. The only alarm() business I can see in imap/fud.c is around recvfrom which at least according to its man page says [EINTR]The receive was interrupted by delivery of a signal before any data were available. What system calls are there? I was looking for syscall()... Cheers, Patrick
Re: followup: stuck lmtpd processes
On Wed, 24 Sep 2003, John Wade wrote: The patch I wrote still might help you since it would prevent an individual user's problem from taking down the mail system. The user's mailbox would remain inaccessible, but the lmtpd processes attempting delivery would exit with errors and mail delivery as a whole would proceed. I still belive that a system with that uses locking with no timeout mechanism is inherently fragile. A single programming error can lead to a cascade failure that takes down the entire mail system as more and more processes hang up trying to get the lock. In this case, the problem only affected a single user. Anything that tried to get a lock on that user's quota file hung. I had a lot of lmtpd processes hung waiting for the lock, but that didn't really impact the system. The other users of the system were unaffected. I'm not saying it is acceptable, but the impact was minor. Andy
Re: followup: stuck lmtpd processes
Andy, Its happen to me before... Don't think it can't... That's all I'm saying... -John Andrew Morgan wrote: On Wed, 24 Sep 2003, John C. Amodeo wrote: ...until your system runs out of available open files... Then the real fun begins... :-) -John [EMAIL PROTECTED] tools]# cat /proc/sys/fs/file-max 209708 I'm in a lot of trouble if I've got 209708 files open. :) Andy -- _ John C. Amodeo - Associate Director of Information Technology Faculty of Arts and Sciences -- Computer Network Operations Rutgers, The State University of New Jersey 732.932.9455-voice 732.932.0013-fax
Re: followup: stuck lmtpd processes
On Wed, 24 Sep 2003, John C. Amodeo wrote: ...until your system runs out of available open files... Then the real fun begins... :-) -John [EMAIL PROTECTED] tools]# cat /proc/sys/fs/file-max 209708 I'm in a lot of trouble if I've got 209708 files open. :) Andy
Re: followup: stuck lmtpd processes
...until your system runs out of available open files... Then the real fun begins... :-) -John Andrew Morgan wrote: On Wed, 24 Sep 2003, John Wade wrote: The patch I wrote still might help you since it would prevent an individual user's problem from taking down the mail system. The user's mailbox would remain inaccessible, but the lmtpd processes attempting delivery would exit with errors and mail delivery as a whole would proceed. I still belive that a system with that uses locking with no timeout mechanism is inherently fragile. A single programming error can lead to a cascade failure that takes down the entire mail system as more and more processes hang up trying to get the lock. In this case, the problem only affected a single user. Anything that tried to get a lock on that user's quota file hung. I had a lot of lmtpd processes hung waiting for the lock, but that didn't really impact the system. The other users of the system were unaffected. I'm not saying it is acceptable, but the impact was minor. Andy -- _ John C. Amodeo - Associate Director of Information Technology Faculty of Arts and Sciences -- Computer Network Operations Rutgers, The State University of New Jersey 732.932.9455-voice 732.932.0013-fax
Re: followup: stuck lmtpd processes
On Wed, 24 Sep 2003, Scott Adkins wrote: When looking at what file the processes are all waiting to get a lock on, it usually turns out to be the cyrus.header file and not the quota file. Is this still the same bug described by Rob on bugzilla? Does it have to be the quota file? Also, when we find the specific imaps process that happens to have the cyrus.header lock file opened for writing and has it locked, if we kill it off, we find that the write lock goes to another imaps process or to one of the LMTP processes and gets stuck there... we kill that one off and it goes to the next one and gets stuck. We never saw a case where all the other processes became unstuck and the problem went away. Are you sure that the processes are hung on the cyrus.header lock? That's what I originally thought when I was only looking at the output of lsof and /proc/locks (linux). When I actually ran a gdb backtrace on one of the stuck processes, it became obvious that the lock was on the quota file instead: (gdb) bt #0 0x402ae5fb in fcntl () from /lib/libc.so.6 #1 0x08077504 in lock_reopen (fd=16, filename=0xbfffa098 /var/spool/cyrus/config/quota/k/user.krolickp, sbuf=0xbfffa040, failaction=0xbfffa03c) at lock_fcntl.c:87 #2 0x080570b6 in mailbox_lock_quota (quota=0xbfffc3c4) at mailbox.c:1016 #3 0x08053f73 in append_setup (as=0xbfffc118, name=0xbfffb114 user.krolickp, format=0, userid=0x0, auth_state=0x0, aclcheck=0, quotacheck=0) at append.c:209 I also saw exactly the behavior you describe when killing processes. I originally tried killing all the lmtpd process that were stuck because I believed that one of the lmtpd process was stuck holding the lock on cyrus.header. When I killed the one holding the lock on cyrus.header, another lmtpd process would grab the lock but still be stuck. When I finally killed the process holding the quota file lock (an imaps process), all the lmtpd processes got unstuck and delivered the waiting mail. It sounds to me like you're not actually killing the process that has the lock that all the other processes are waiting for. Andy
Re: followup: stuck lmtpd processes
Well, that could definitely be a problem... Next time we see a lock problem occur, I will look based on the information below to see if it is really a lock problem on the quota file. Thanks, Scott --On Wednesday, September 24, 2003 12:32 PM -0700 Andrew Morgan [EMAIL PROTECTED] wrote: On Wed, 24 Sep 2003, Scott Adkins wrote: When looking at what file the processes are all waiting to get a lock on, it usually turns out to be the cyrus.header file and not the quota file. Is this still the same bug described by Rob on bugzilla? Does it have to be the quota file? Also, when we find the specific imaps process that happens to have the cyrus.header lock file opened for writing and has it locked, if we kill it off, we find that the write lock goes to another imaps process or to one of the LMTP processes and gets stuck there... we kill that one off and it goes to the next one and gets stuck. We never saw a case where all the other processes became unstuck and the problem went away. Are you sure that the processes are hung on the cyrus.header lock? That's what I originally thought when I was only looking at the output of lsof and /proc/locks (linux). When I actually ran a gdb backtrace on one of the stuck processes, it became obvious that the lock was on the quota file instead: (gdb) bt # 0 0x402ae5fb in fcntl () from /lib/libc.so.6 # 1 0x08077504 in lock_reopen (fd=16, filename=0xbfffa098 # /var/spool/cyrus/config/quota/k/user.krolickp, sbuf=0xbfffa040, failaction=0xbfffa03c) at lock_fcntl.c:87 # 2 0x080570b6 in mailbox_lock_quota (quota=0xbfffc3c4) at mailbox.c:1016 # 3 0x08053f73 in append_setup (as=0xbfffc118, name=0xbfffb114 # user.krolickp, format=0, userid=0x0, auth_state=0x0, aclcheck=0, quotacheck=0) at append.c:209 I also saw exactly the behavior you describe when killing processes. I originally tried killing all the lmtpd process that were stuck because I believed that one of the lmtpd process was stuck holding the lock on cyrus.header. When I killed the one holding the lock on cyrus.header, another lmtpd process would grab the lock but still be stuck. When I finally killed the process holding the quota file lock (an imaps process), all the lmtpd processes got unstuck and delivered the waiting mail. It sounds to me like you're not actually killing the process that has the lock that all the other processes are waiting for. Andy -- +---+ Scott W. Adkinshttp://www.cns.ohiou.edu/~sadkins/ UNIX Systems Engineer mailto:[EMAIL PROTECTED] ICQ 7626282 Work (740)593-9478 Fax (740)593-1944 +---+ PGP Public Key available at http://www.cns.ohiou.edu/~sadkins/pgp/ pgp0.pgp Description: PGP signature
Re: followup: stuck lmtpd processes
Hi Andrew, I was the one who wrote the message you found. I finally came to the conclusion that the flat file locking mechanism is somewhat broken in Cyrus, but I was never a good enough C programmer to pin down what was happening. (The mmap stuff makes it really tricky to debug.)I wanted to blame it on the Linux kernel, but I know that others have experienced the same problems in Solaris. I finally gave up and wrote a locking timeout patch for 2.0.16. see http://www.oakton.edu/~jwade/cyrus/ for the patch and full details A number of other folks have tried this patch successfully on 2.0.16 and 2.1.x, and I know it has resolved our problem. If you can solve the particular bug that causes this, more power to you, if not, my work around resolves a number of possible deadlock issues. Enjoy, John Andrew Morgan wrote: Following up on my previous post about stuck lmtpd processes. I found this incredibly detailed faq at: http://www.faqchest.com/prgm/cyrus-l/cyrus-01/cyrus-0111/cyrus-011102/cyrus0023_33254.html This isn't exactly the same problem, but the steps on that page helped me figure out that they are all stuck trying to get a lock on: /private/cyrus/mail/k/user/krolickp/cyrus.header Looking at /proc/locks shows: 7: POSIX ADVISORY WRITE 21903 08:11:42107658 0 EOF d23895e0 c3217f44 c510e4c4 ccbf076c 7: - POSIX ADVISORY WRITE 32485 08:11:42107658 0 EOF ccbf0760 ee36ac44 f3bb26a4 d23895e0 ee36ac4c 7: - POSIX ADVISORY WRITE 1802 08:11:42107658 0 EOF ee36ac40 c050ea04 ccbf0764 d23895e0 c050ea0c 7: - POSIX ADVISORY WRITE 1217 08:11:42107658 0 EOF c050ea00 ee36a344 ee36ac44 d23895e0 ee36a34c ... I don't see how this deadlock occurred, but I'm willing to help debug it. Andy
Re: followup: stuck lmtpd processes
On Tue, 23 Sep 2003, John Wade wrote: Hi Andrew, I was the one who wrote the message you found. I finally came to the conclusion that the flat file locking mechanism is somewhat broken in Cyrus, but I was never a good enough C programmer to pin down what was happening. (The mmap stuff makes it really tricky to debug.)I wanted to blame it on the Linux kernel, but I know that others have experienced the same problems in Solaris. I finally gave up and wrote a locking timeout patch for 2.0.16. see http://www.oakton.edu/~jwade/cyrus/ for the patch and full details A number of other folks have tried this patch successfully on 2.0.16 and 2.1.x, and I know it has resolved our problem. If you can solve the particular bug that causes this, more power to you, if not, my work around resolves a number of possible deadlock issues. Enjoy, John Hey John, Thanks for that message. If you've read a little further in your info-cyrus messages, you'll see that I apparently have hit upon a different bug than the one you found (I think). Your page was instrumental in helping me track down the source of the problem though. It turns out I had an imaps process that hung onto the lock on the user's quota file. Apparently it obtained the lock, then went off to read from the network connection and never came back. I think your patch would fix the problem where are lot of processes are contending for a lock (by making them retry), but it wouldn't help if a single process keeps the lock indefinately. Ideally it should not be possible for a process to get hung while it is holding the lock, but that will require some careful programming in this particular case. In the meantime, I'll have to keep an eye on the system. Thanks again for your debugging clues... Andy