Re: [PATCH] Updated master.c process counting patch
On Thu, 16 May 2002, Jeremy Howard wrote: I *strongly* recommend also including shutdown.diff. This is important in Linux to avoid sockets handing around in CLOSE_WAIT state. Remove the I had a talk with some kernel people, and they confirmed that. shutdown(socket, SHUT_RD) should _always_ be done under linux if you really don't need to read from the socket anymore. For AF_INET* sockets, anyway. shutdown(socket, SHUT_RDWR) will reduce the CLOSE_WAIT state even more, however it effectively trashes the connection; I don't like this idea very much. It is far more amiable to the client if you let it read the last stuff you sent it (such as error messages!) at its own leisure. I am, thus, somewhat wary of adding SHUT_RDWR inconditionally. Maybe we could add a runtime-option that very busy sites can set if they need even faster socket recycling? -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh
Re: [PATCH] Updated master.c process counting patch
Henrique de Moraes Holschuh wrote: I am, thus, somewhat wary of adding SHUT_RDWR inconditionally. Maybe we could add a runtime-option that very busy sites can set if they need even faster socket recycling? That sounds like a good idea.
Re: [PATCH] Updated master.c process counting patch
Henrique de Moraes Holschuh wrote: On Thu, 16 May 2002, Jeremy Howard wrote: I *strongly* recommend also including shutdown.diff. This is important in Linux to avoid sockets handing around in CLOSE_WAIT state. Remove the ' !imapd_in-tls_conn' bit everywhere for general distribution--this is a workaround for a memory corruption problem that is unrelated to this patch. I see. What about all those commented lines in your original master-avail.diff patch? Were they past shutdown() experiments? That's correct, Henrique. The commented lines should not be included in a widely distributed version. I left them in, but commented out, in order to remind myself what experiments we had done. Only the lines that are *not* commented out should be included. I believe (?) that this issue is less important on Solaris, because I think that it handles close() differently to Linux. However on Linux it is vital to flush receive buffers and call shutdown() to avoid hanging connections.
Re: [PATCH] Updated master.c process counting patch
On Fri, 17 May 2002, Jeremy Howard wrote: I believe (?) that this issue is less important on Solaris, because I think that it handles close() differently to Linux. However on Linux it is vital to flush receive buffers and call shutdown() to avoid hanging connections. I am still studying the shutdown issue. BTW, why didn't you shutdown() all streams at once, and then sleeped, instead of shutdown() and sleep one at a time? -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh
Re: [PATCH] Updated master.c process counting patch
Henrique de Moraes Holschuh wrote: On Fri, 17 May 2002, Jeremy Howard wrote: I believe (?) that this issue is less important on Solaris, because I think that it handles close() differently to Linux. However on Linux it is vital to flush receive buffers and call shutdown() to avoid hanging connections. I am still studying the shutdown issue. BTW, why didn't you shutdown() all streams at once, and then sleeped, instead of shutdown() and sleep one at a time? Because it was late at night and I wasn't thinking straight... ;-)
Re: [PATCH] Updated master.c process counting patch
On Thu, 16 May 2002, Jeremy Howard wrote: I *strongly* recommend also including shutdown.diff. This is important in Linux to avoid sockets handing around in CLOSE_WAIT state. Remove the ' !imapd_in-tls_conn' bit everywhere for general distribution--this is a workaround for a memory corruption problem that is unrelated to this patch. I see. What about all those commented lines in your original master-avail.diff patch? Were they past shutdown() experiments? -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh
Re: [PATCH] Updated master.c process counting patch
Jeremy Howard wrote: Henrique de Moraes Holschuh wrote: I don't know what Ken and Lawrence think of these patches, but I just finished porting the child pid tracking of master-avail.diff to 2.1.4CVS, and will post that to this list soon. I will also include it in Debian, which will give some field-testing to the patch. I *strongly* recommend also including shutdown.diff. This is important in Linux to avoid sockets handing around in CLOSE_WAIT state. Remove the ' !imapd_in-tls_conn' bit everywhere for general distribution--this is a workaround for a memory corruption problem that is unrelated to this patch. I'm running a config almost the same as you and have never seen this problem. AFAIK, the CMU guys have never seen this either. Do you have a core that you can run a backtrace on, or can you force a core by setting MALLOC_CHECK_=2 before starting master (see malloc(3) for details)? What's your DB config look like? Are you using skiplist for everything by any chance? name : Cyrus IMAPD version: v2.1.4 2002/05/14 16:51:51 vendor : Project Cyrus support-url: http://asg.web.cmu.edu/cyrus os : Linux os-version : 2.4.18-SGI_XFS_1.1smp command: imapd arguments : environment: Cyrus SASL 2.1.3 Sleepycat Software: Berkeley DB 3.3.11: (July 12, 2001) OpenSSL 0.9.6b [engine] 9 Jul 2001 CMU Sieve 2.2 TCP Wrappers UCD-SNMP 4.2.3 lock = flock auth = unix idle = idled mboxlist.db = skiplist subs.db = flat seen.db = skiplist duplicate.db = db3-nosync tls.db = db3-nosync -- Kenneth Murchison Oceana Matrix Ltd. Software Engineer 21 Princeton Place 716-662-8973 x26 Orchard Park, NY 14127 --PGP Public Key--http://www.oceana.com/~ken/ksm.pgp
Re: [PATCH] Updated master.c process counting patch
Jeremy Howard wrote: Lawrence Greenfield wrote: Date: Wed, 15 May 2002 16:02:42 -0300 From: Henrique de Moraes Holschuh [EMAIL PROTECTED] [...] The point is, if that indeed happens, log or no log, master loses track of the number of children that can service requests. That would be a bug, and the patch supposedly fixes this bug. It really doesn't matter (for accepting or not the patch) why the child died. Yes, I understand that. However, if the master (in real life situations) is actually losing track of the number of available service processes without one of those service processes crashing (either by the sysadmin or otherwise) then there's some other problem in the child accounting. The child accounting is fine. The problem in our case was always caused by child segfaults, or failure to properly close TCP connections. We still see segfaults (about one per fifty thousand connections I'd guess), Can you send us a backtrace from a core? If you're not getting a core, please setup your system to dump one. Here are bits that I use in my Cyrus startup script on Linux: cd /var/imap/cores ulimit -c unlimited export MALLOC_CHECK_=2 $master If you have multiple services/processes the cores will overwrite each other, so you need to catch it fairly quickly (unless they all have the same failure). Ken -- Kenneth Murchison Oceana Matrix Ltd. Software Engineer 21 Princeton Place 716-662-8973 x26 Orchard Park, NY 14127 --PGP Public Key--http://www.oceana.com/~ken/ksm.pgp
Re: [PATCH] Updated master.c process counting patch
On Thu, 16 May 2002, Ken Murchison wrote: If you have multiple services/processes the cores will overwrite each other, so you need to catch it fairly quickly (unless they all have the Unless you tell the kernel to use the pid in the corefile name... Add this to the script on Linux 2.4.x: [ -f /proc/sys/kernel/core_uses_pid ] \ echo 1 /proc/sys/kernel/core_uses_pid I don't know which other kernels can do that, but it should be a reasonably common feature, given how useful it is... -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh
Re: [PATCH] Updated master.c process counting patch
Henrique de Moraes Holschuh wrote: On Thu, 16 May 2002, Ken Murchison wrote: If you have multiple services/processes the cores will overwrite each other, so you need to catch it fairly quickly (unless they all have the Unless you tell the kernel to use the pid in the corefile name... Add this to the script on Linux 2.4.x: [ -f /proc/sys/kernel/core_uses_pid ] \ echo 1 /proc/sys/kernel/core_uses_pid Right. The reason I didn't suggest this is because some large sites might be worried about cores taking up a lot of disk space, and I didn't want them screaming at me ;) Ken -- Kenneth Murchison Oceana Matrix Ltd. Software Engineer 21 Princeton Place 716-662-8973 x26 Orchard Park, NY 14127 --PGP Public Key--http://www.oceana.com/~ken/ksm.pgp
Re: [PATCH] Updated master.c process counting patch
For what its worth, we run into this problem a lot in our environment as well... I had come to the same conclusion as Jeremy and Henrique did before they even posted their message about it to the list. The environment we are running is as follows: Cyrus IMAPD 2.0.16 Compaq Tru64 5.1 on Alpha Platform Cyrus SASL 1.5.27 Sleepycat BerkeleyDB 3.2.9 OpenSSL 0.9.6 UCD-SNMP 4.2.1 lock = flock auth = unix mboxlist.db = flat subs.db = flat seen.db = flat duplicate.db = db3-nosync tls.db = none The problem existed before we enabled SSL, so it isn't SSL related, and the problem existed before we added SNMP to the mix. The only place we use BerkeleyDB is for the duplicate delivery database and only lmtpd is linked against the libdb.so file (I modified the Makefile to not link pthreads or BerkeleyDB into imapd, pop3d, etc, since they are using a flat file database anyways). As for when the problem would arise... well, obviously, when we ran into resource issues, master would lose track of the count of children as they segfaulted, etc. This problem was just recently described. However, we rarely had resource issues like that. The other times that we could catch the problem occuring is when the system is booting and the Cyrus server is first started. For some reason, we would lose a couple of the services within a short period of time after Cyrus was started. We would have to shut down Cyrus and restart it again. Rarely, the problem would occur again within the first few minutes of the restart, and we would have to do it again. Usually, though, the restart would work fine. Finally, we would periodically lose services for no apparent reason, meaning that the machine had not been recently rebooted, and we were not suffering from any apparent resource shortages (either at a system level or at the cyrus user level). In any the case, the problem exists in our Tru64 environment. Scott --On Thursday, May 16, 2002 9:25 AM -0400 Ken Murchison [EMAIL PROTECTED] wrote: I'm running a config almost the same as you and have never seen this problem. AFAIK, the CMU guys have never seen this either. Do you have a core that you can run a backtrace on, or can you force a core by setting MALLOC_CHECK_=2 before starting master (see malloc(3) for details)? What's your DB config look like? Are you using skiplist for everything by any chance? name : Cyrus IMAPD version: v2.1.4 2002/05/14 16:51:51 vendor : Project Cyrus support-url: http://asg.web.cmu.edu/cyrus os : Linux os-version : 2.4.18-SGI_XFS_1.1smp command: imapd arguments : environment: Cyrus SASL 2.1.3 Sleepycat Software: Berkeley DB 3.3.11: (July 12, 2001) OpenSSL 0.9.6b [engine] 9 Jul 2001 CMU Sieve 2.2 TCP Wrappers UCD-SNMP 4.2.3 lock = flock auth = unix idle = idled mboxlist.db = skiplist subs.db = flat seen.db = skiplist duplicate.db = db3-nosync tls.db = db3-nosync -- Kenneth Murchison Oceana Matrix Ltd. Software Engineer 21 Princeton Place 716-662-8973 x26 Orchard Park, NY 14127 --PGP Public Key--http://www.oceana.com/~ken/ksm.pgp -- +-=-=-=-=-=-=-=-=-=+=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=+=-=-=-=-=-=-=-=-+ Scott W. Adkinshttp://www.cns.ohiou.edu/~sadkins/ UNIX Systems Engineer mailto:[EMAIL PROTECTED] ICQ 7626282 Work (740)593-9478 Fax (740)593-1944 +-=-=-=-=-=-=-=-=-=+=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=+=-=-=-=-=-=-=-=-+ PGP Public Key available at http://www.cns.ohiou.edu/~sadkins/pgp/ msg07820/pgp0.pgp Description: PGP signature
Re: [PATCH] Updated master.c process counting patch
Ken Murchison wrote: I'm running a config almost the same as you and have never seen this problem. AFAIK, the CMU guys have never seen this either. Do you have a core that you can run a backtrace on, or can you force a core by setting MALLOC_CHECK_=2 before starting master (see malloc(3) for details)? Most of the segfaults were due to the problem that imapd_out or imapd_in were corrupted. The workaround discussed in our patch has solved most of these. I'll try and get a core file for the rare segfaults that we still get to see what the unresolved issues are. The imapd_out corruption problem can't be solved by studying the core file AFAICT because we can't see where the corruption is occuring. What's your DB config look like? Are you using skiplist for everything by any chance? Yes, 2.1.3 with skiplist for everything. We have the tls cache turned off, however. We prune the delivery database with -E0 every hour to avoid it getting big (otherwise DB recovery takes too long).
Re: [PATCH] Updated master.c process counting patch
Jeremy Howard wrote: Ken Murchison wrote: I'm running a config almost the same as you and have never seen this problem. AFAIK, the CMU guys have never seen this either. Do you have a core that you can run a backtrace on, or can you force a core by setting MALLOC_CHECK_=2 before starting master (see malloc(3) for details)? Most of the segfaults were due to the problem that imapd_out or imapd_in were corrupted. The workaround discussed in our patch has solved most of these. I'll try and get a core file for the rare segfaults that we still get to see what the unresolved issues are. The imapd_out corruption problem can't be solved by studying the core file AFAICT because we can't see where the corruption is occuring. If you set MALLOC_CHECK_=2, then imapd will abort() whenever it thinks that there might be a corruption. By examining this core, it is easier to track down these problems. I've done this a few times to track down the subtle errors that have baffled others. Ken -- Kenneth Murchison Oceana Matrix Ltd. Software Engineer 21 Princeton Place 716-662-8973 x26 Orchard Park, NY 14127 --PGP Public Key--http://www.oceana.com/~ken/ksm.pgp
Re: [PATCH] Updated master.c process counting patch
Date: Fri, 17 May 2002 09:35:39 +1000 From: Jeremy Howard [EMAIL PROTECTED] [...] Yes, 2.1.3 with skiplist for everything. We have the tls cache turned off, however. We prune the delivery database with -E0 every hour to avoid it getting big (otherwise DB recovery takes too long). Decreasing your checkpoint interval (we run at 10 minutes; 5 minutes is also pretty good) can also help immensely with this. Larry
Repeatable IMAP crash. Was Re: [PATCH] Updated master.c process counting patch
Actually, a number of users having been saying that renaming folders seems to cause problems. I think I've found a completely reproduceable scenario that crashes the IMAP server for me every time. [root@server2 root]# telnet localhost 143 Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. * OK blahserver Cyrus IMAP4 v2.1.3 server ready . login testuser blahpassword . OK User logged in . select inbox * FLAGS (\Answered \Flagged \Draft \Deleted \Seen) * OK [PERMANENTFLAGS (\Answered \Flagged \Draft \Deleted \Seen \*)] * 0 EXISTS * 0 RECENT * OK [UIDVALIDITY 1021369326] * OK [UIDNEXT 1] . OK [READ-WRITE] Completed . create inbox.t1 . OK Completed . create inbox.t2 . OK Completed . select inbox * FLAGS (\Answered \Flagged \Draft \Deleted \Seen) * OK [PERMANENTFLAGS (\Answered \Flagged \Draft \Deleted \Seen \*)] * 0 EXISTS * 0 RECENT * OK [UIDVALIDITY 1021369326] * OK [UIDNEXT 1] . OK [READ-WRITE] Completed . rename inbox.t1 inbox.t1_ . OK Completed . rename inbox.t2 inbox.t2_ Connection closed by foreign host. And in the IMAP log: May 16 22:22:25 blahserver master[23269]: process 8649 exited, signaled to death by 11 Does this do the same thing for anyone else? I'm not sure of all our config details, I'm sure Jeremy can post them if you need them. Rob - Original Message - From: Ken Murchison [EMAIL PROTECTED] To: Jeremy Howard [EMAIL PROTECTED] Cc: Lawrence Greenfield [EMAIL PROTECTED]; [EMAIL PROTECTED]; Henrique de Moraes Holschuh [EMAIL PROTECTED] Sent: Thursday, May 16, 2002 11:31 PM Subject: Re: [PATCH] Updated master.c process counting patch Jeremy Howard wrote: Lawrence Greenfield wrote: Date: Wed, 15 May 2002 16:02:42 -0300 From: Henrique de Moraes Holschuh [EMAIL PROTECTED] [...] The point is, if that indeed happens, log or no log, master loses track of the number of children that can service requests. That would be a bug, and the patch supposedly fixes this bug. It really doesn't matter (for accepting or not the patch) why the child died. Yes, I understand that. However, if the master (in real life situations) is actually losing track of the number of available service processes without one of those service processes crashing (either by the sysadmin or otherwise) then there's some other problem in the child accounting. The child accounting is fine. The problem in our case was always caused by child segfaults, or failure to properly close TCP connections. We still see segfaults (about one per fifty thousand connections I'd guess), Can you send us a backtrace from a core? If you're not getting a core, please setup your system to dump one. Here are bits that I use in my Cyrus startup script on Linux: cd /var/imap/cores ulimit -c unlimited export MALLOC_CHECK_=2 $master If you have multiple services/processes the cores will overwrite each other, so you need to catch it fairly quickly (unless they all have the same failure). Ken -- Kenneth Murchison Oceana Matrix Ltd. Software Engineer 21 Princeton Place 716-662-8973 x26 Orchard Park, NY 14127 --PGP Public Key--http://www.oceana.com/~ken/ksm.pgp
Re: [PATCH] Updated master.c process counting patch
Ken Murchison wrote: If you set MALLOC_CHECK_=2, then imapd will abort() whenever it thinks that there might be a corruption. By examining this core, it is easier to track down these problems. I've done this a few times to track down the subtle errors that have baffled others. Great. We've just updated our init.d file to use this, and to spit cores into different files using the pid, so hopefully we'll have some more info over the next few days.
Re: [PATCH] Updated master.c process counting patch
I am still waiting to hear from Ken and Lawrence on what they think about these patches? Will any or all of them be implented in the next release? Scott --On Wednesday, May 15, 2002 12:41 PM +1000 Jeremy Howard [EMAIL PROTECTED] wrote: Thanks to Jaska Kivela, some patch formatting problems that caused the master.c process counting patch to not apply cleanly have been resolved. The patch set has been updated, and now also incorporates the master.c race condition patch: http://jhoward.fastmail.fm/patches/cyrus/imap-diff.tgz The only file changed in the patch set is master-avail.diff . master-avail.diff solves the problem that master fails to correctly keep count of child processes if processes do not exit cleanly. This manifests itself as Cyrus failing to accept new connections on one or more ports after a while, when using preforking in imapd.conf. -- +-=-=-=-=-=-=-=-=-=+=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=+=-=-=-=-=-=-=-=-+ Scott W. Adkinshttp://www.cns.ohiou.edu/~sadkins/ UNIX Systems Engineer mailto:[EMAIL PROTECTED] ICQ 7626282 Work (740)593-9478 Fax (740)593-1944 +-=-=-=-=-=-=-=-=-=+=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=+=-=-=-=-=-=-=-=-+ PGP Public Key available at http://www.cns.ohiou.edu/~sadkins/pgp/ msg07755/pgp0.pgp Description: PGP signature
Re: [PATCH] Updated master.c process counting patch
On Wed, 15 May 2002, Scott Adkins wrote: I am still waiting to hear from Ken and Lawrence on what they think about these patches? Will any or all of them be implented in the next release? I don't know what Ken and Lawrence think of these patches, but I just finished porting the child pid tracking of master-avail.diff to 2.1.4CVS, and will post that to this list soon. I will also include it in Debian, which will give some field-testing to the patch. I don't know about all the other patches, though. I have included the safe_flock patches, and I *may* include the alarm and locking stuff in master-avail.diff later, but I must study and understand it first. -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh
Re: [PATCH] Updated master.c process counting patch
On Wed, 15 May 2002, Scott Russell wrote: On Wed, May 15, 2002 at 10:35:04AM -0300, Henrique de Moraes Holschuh wrote: I don't know about all the other patches, though. I have included the safe_flock patches, and I *may* include the alarm and locking stuff in master-avail.diff later, but I must study and understand it first. Are the patches TLS sane / tested? At one point I remember reading The ones I am including in Debian are. No TLS is _not_ an option IMHO. -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh
Re: [PATCH] Updated master.c process counting patch
On Wed, May 15, 2002 at 10:35:04AM -0300, Henrique de Moraes Holschuh wrote: On Wed, 15 May 2002, Scott Adkins wrote: I am still waiting to hear from Ken and Lawrence on what they think about these patches? Will any or all of them be implented in the next release? I don't know what Ken and Lawrence think of these patches, but I just finished porting the child pid tracking of master-avail.diff to 2.1.4CVS, and will post that to this list soon. I will also include it in Debian, which will give some field-testing to the patch. I don't know about all the other patches, though. I have included the safe_flock patches, and I *may* include the alarm and locking stuff in master-avail.diff later, but I must study and understand it first. Are the patches TLS sane / tested? At one point I remember reading that they wouldn't work with TLS connections. That was some time ago and I might have lost track of the status. -- Regards, Scott Russell ([EMAIL PROTECTED]) Linux Technology Center, System Admin, RHCE. T/L 441-9289 / External 919-543-9289 http://bzimage.raleigh.ibm.com/webcam
Re: [PATCH] Updated master.c process counting patch
On Wed, May 15, 2002 at 10:52:39AM -0300, Henrique de Moraes Holschuh wrote: On Wed, 15 May 2002, Scott Russell wrote: On Wed, May 15, 2002 at 10:35:04AM -0300, Henrique de Moraes Holschuh wrote: I don't know about all the other patches, though. I have included the safe_flock patches, and I *may* include the alarm and locking stuff in master-avail.diff later, but I must study and understand it first. Are the patches TLS sane / tested? At one point I remember reading The ones I am including in Debian are. No TLS is _not_ an option IMHO. Agreed. I'm thankful to the people who took the time to track this down and write the patch. I'm also thankful to those who added TLS support to the patch :) -- Regards, Scott Russell ([EMAIL PROTECTED]) Linux Technology Center, System Admin, RHCE. T/L 441-9289 / External 919-543-9289 http://bzimage.raleigh.ibm.com/webcam
Re: [PATCH] Updated master.c process counting patch
Date: Wed, 15 May 2002 08:34:29 -0400 From: Scott Adkins [EMAIL PROTECTED] [...] I am still waiting to hear from Ken and Lawrence on what they think about these patches? Will any or all of them be implented in the next release? I'm still wondering what causes these problems. Some reports say that service processes aren't crashing; if they're not crashing, how is the count getting off? I think in general this stuff is a good idea but I'd like to understand a little better how this is happening. We don't have this problem on any of our servers. Larry
Re: [PATCH] Updated master.c process counting patch
On Wed, 15 May 2002, Lawrence Greenfield wrote: Date: Wed, 15 May 2002 08:34:29 -0400 From: Scott Adkins [EMAIL PROTECTED] [...] I am still waiting to hear from Ken and Lawrence on what they think about these patches? Will any or all of them be implented in the next release? I'm still wondering what causes these problems. Some reports say that service processes aren't crashing; if they're not crashing, how is the count getting off? Good question, isn't it? I am trying to track a segfault in the auth_unix callbacks with SASL 2.1.2 [1], but after that I will try to do a once-over the entire master flow, with and without the child pid tracking patches. [1] http://bugs.debian.org/cgi-bin/bugreport.cgi?archive=nobug=145766 We don't have this problem on any of our servers. Do you have preforking enabled? If you do (and if I did undertand the issue correctly), start kill -9'ing service processes, and it should be possible to duplicate the bug. I will try that just now, in fact. -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh
Re: [PATCH] Updated master.c process counting patch
Date: Wed, 15 May 2002 15:37:50 -0300 From: Henrique de Moraes Holschuh [EMAIL PROTECTED] [...] I'm still wondering what causes these problems. Some reports say that service processes aren't crashing; if they're not crashing, how is the count getting off? Good question, isn't it? I am trying to track a segfault in the auth_unix callbacks with SASL 2.1.2 [1], but after that I will try to do a once-over the entire master flow, with and without the child pid tracking patches. [1] http://bugs.debian.org/cgi-bin/bugreport.cgi?archive=nobug=145766 auth_unix is part of the authorization, not part of libsasl. Regardless, this code happens after the service has told the master it's unavailable, so a crash here wouldn't cause the master's count to get off. We don't have this problem on any of our servers. Do you have preforking enabled? If you do (and if I did undertand the issue correctly), start kill -9'ing service processes, and it should be possible to duplicate the bug. I will try that just now, in fact. Sure, if you intentionally kill processes that are waiting for connections this happens. I understand this. But if I did that, master would log messages that the processes were dying incorrectly (signaled to death by 9). Larry
Re: [PATCH] Updated master.c process counting patch
On Wed, 15 May 2002, Lawrence Greenfield wrote: Good question, isn't it? I am trying to track a segfault in the auth_unix callbacks with SASL 2.1.2 [1], but after that I will try to do a once-over the entire master flow, with and without the child pid tracking patches. [1] http://bugs.debian.org/cgi-bin/bugreport.cgi?archive=nobug=145766 auth_unix is part of the authorization, not part of libsasl. It registers callbacks with sasl. Sasl calls auth_newstate via the callback interface, and glibc dies in the middle of a getgrent from auth_newstate. The problem could be anywhere. Do you have preforking enabled? If you do (and if I did undertand the issue correctly), start kill -9'ing service processes, and it should be possible to duplicate the bug. I will try that just now, in fact. Sure, if you intentionally kill processes that are waiting for connections this happens. I understand this. But if I did that, master would log messages that the processes were dying incorrectly (signaled to death by 9). The point is, if that indeed happens, log or no log, master loses track of the number of children that can service requests. That would be a bug, and the patch supposedly fixes this bug. It really doesn't matter (for accepting or not the patch) why the child died. I _do_ agree that we have to track down why the children are dying, too. But that is another separate issue. -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh
Re: [PATCH] Updated master.c process counting patch
Date: Wed, 15 May 2002 16:02:42 -0300 From: Henrique de Moraes Holschuh [EMAIL PROTECTED] [...] The point is, if that indeed happens, log or no log, master loses track of the number of children that can service requests. That would be a bug, and the patch supposedly fixes this bug. It really doesn't matter (for accepting or not the patch) why the child died. Yes, I understand that. However, if the master (in real life situations) is actually losing track of the number of available service processes without one of those service processes crashing (either by the sysadmin or otherwise) then there's some other problem in the child accounting. I'd like to know if there is before doing something that may mask the problem completely. Larry
Re: [PATCH] Updated master.c process counting patch
On Wed, 15 May 2002, Lawrence Greenfield wrote: service processes without one of those service processes crashing (either by the sysadmin or otherwise) then there's some other problem in the child accounting. Well, anything that could cause the messages to be lost will cause trouble for the accounting without the patch. I can't see any other races in there right now, though. I did duplicate the problem caused by children deaths, and I did verify that the patch fixes the problem, so at least that side of things it seems to handle well. Children forking is behaving erratically, though. Before the first connect() to any of the master-controlled services, a random number of children are created (up to the prefork setting). After a connect(), all the missing children (maybe subject to maxforkrate, I didn't test) are created. This might be the way the code is supposed to work, though. I'd like to know if there is before doing something that may mask the problem completely. Timing and logging warnings if the children take too much time to tell us that it is available for work, or die without telling us it was going to exit cleanly would give out the same information, while avoiding the worst of the problems the bug creates (no children left to service incoming requests). This would complicate the code in master/ a bit, though. -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot Henrique Holschuh
Re: [PATCH] Updated master.c process counting patch
--On Wednesday, May 15, 2002 4:29 PM -0400 Lawrence Greenfield [EMAIL PROTECTED] wrote: Date: Wed, 15 May 2002 16:02:42 -0300 From: Henrique de Moraes Holschuh [EMAIL PROTECTED] [...] The point is, if that indeed happens, log or no log, master loses track ofthe number of children that can service requests. That would be a bug, andthe patch supposedly fixes this bug. It really doesn't matter (foraccepting or not the patch) why the child died. Yes, I understand that. However, if the master (in real life situations) is actually losing track of the number of available service processes without one of those service processes crashing (either by the sysadmin or otherwise) then there's some other problem in the child accounting. I'd like to know if there is before doing something that may mask the problem completely. Larry Would it be sufficient if the patch were altered slightly to send a LOG_DEBUG message to syslog every time the master decremented one of the worker counters, specifying what it did? Right now, it seems like the indicator of the problem is a service unavailability. If key step in the process that we're covering is the report of the dead child to the master, and you'd be just as happy with a log message as a blatantly obvious failure, well heck, let's do it! I'm happy to send you any bit of logging information you want, just so long as my servers stay available! :) Michael Bacon OIT Systems Administration
Re: [PATCH] Updated master.c process counting patch
--On Wednesday, May 15, 2002 2:50 PM -0400 Lawrence Greenfield [EMAIL PROTECTED] wrote: Date: Wed, 15 May 2002 15:37:50 -0300 From: Henrique de Moraes Holschuh [EMAIL PROTECTED] [...] We don't have this problem on any of our servers. Do you have preforking enabled? If you do (and if I did undertand the issuecorrectly), start kill -9'ing service processes, and it should be possibleto duplicate the bug. I will try that just now, in fact. Sure, if you intentionally kill processes that are waiting for connections this happens. I understand this. But if I did that, master would log messages that the processes were dying incorrectly (signaled to death by 9). Just to follow up a bit more -- I'm not sure if this will be terribly helpful, but it may be illustrative of the one situation which might cause such a problem. Here's a snippet of log that occured shortly before we ran into a service problem: May 2 15:35:14 delrey.acpub.duke.edu master[16360]: can't fork process to run c heckpoint May 2 15:35:14 delrey.acpub.duke.edu last message repeated 45 times May 2 15:35:15 delrey.acpub.duke.edu master[16360]: process 7494 exited, signal ed to death by 9 Preceeding this is a couple hundred more messages generally complaining about inability to fork, lack of resources, inability to load a shared library, and all of the other things one would expect to see when swap runs out. However, I don't see any other mention of process 7494. (This is logging at the user6.info level, so we don't get all of the debug messages, unfortunately.) Now, the core problem here is very simple: This is a poor, beleagured SparcStation 20 with 17k mailboxes on it, and at a peak block of time, it just ran out of available memory. The solution is also simple: we need to upgrade our hardware. However, since we're currently under budget cuts as our Executive VP scrounges up money to go around building big buildings, that's not exactly a viable option. Tracking down the whys and wherefores of a process dying under a resource crunch is not likely to be terribly productive -- one can't very well expect a service to keep functioning normally under those circumstances. However, it would be nice if the master process paid attention to the SIGCHLD and the information from the wait() call and take note of the fact that the ex-process has shook off this mortal coil, so that 15 minutes later, the process miscount didn't cause the master to start blithely ignoring incoming requests. I hope this is helpful, Michael Bacon OIT Systems Administration Duke University
Re: [PATCH] Updated master.c process counting patch
Henrique de Moraes Holschuh wrote: I don't know what Ken and Lawrence think of these patches, but I just finished porting the child pid tracking of master-avail.diff to 2.1.4CVS, and will post that to this list soon. I will also include it in Debian, which will give some field-testing to the patch. I *strongly* recommend also including shutdown.diff. This is important in Linux to avoid sockets handing around in CLOSE_WAIT state. Remove the ' !imapd_in-tls_conn' bit everywhere for general distribution--this is a workaround for a memory corruption problem that is unrelated to this patch.
Re: [PATCH] Updated master.c process counting patch
Lawrence Greenfield wrote: Date: Wed, 15 May 2002 16:02:42 -0300 From: Henrique de Moraes Holschuh [EMAIL PROTECTED] [...] The point is, if that indeed happens, log or no log, master loses track of the number of children that can service requests. That would be a bug, and the patch supposedly fixes this bug. It really doesn't matter (for accepting or not the patch) why the child died. Yes, I understand that. However, if the master (in real life situations) is actually losing track of the number of available service processes without one of those service processes crashing (either by the sysadmin or otherwise) then there's some other problem in the child accounting. The child accounting is fine. The problem in our case was always caused by child segfaults, or failure to properly close TCP connections. We still see segfaults (about one per fifty thousand connections I'd guess), and occassional TCP closing problems, but they're much reduced with the other patches. However with the master.c patch these intermittent problems have no practical impact on our users, since the server handles them gracefully. The master.c patch is important because without it, any little problem in any daemon that causes a child to crash, will eventually bring down the whole server. With the patch, processes are still counted correctly and therefore a child segfaulting will not stop the server from accepting connections. Child segfaults still cause the problem to be logged as per usual--the only difference with the patch is that child processes are correctly counted in this event.
Re: [PATCH] Updated master.c process counting patch
Henrique de Moraes Holschuh wrote: On Wed, 15 May 2002, Scott Russell wrote: On Wed, May 15, 2002 at 10:35:04AM -0300, Henrique de Moraes Holschuh wrote: I don't know about all the other patches, though. I have included the safe_flock patches, and I *may* include the alarm and locking stuff in master-avail.diff later, but I must study and understand it first. Are the patches TLS sane / tested? At one point I remember reading The ones I am including in Debian are. No TLS is _not_ an option IMHO. The TLS issue was simply required to work around a memory corruption bug. The patches all work correctly with TLS, as long as you remove the ' !imapd_in-tls_conn' bits in imapd.c and pop3d.c. These extra tests are nothing to do with the other patches. The problem is that imapd_in et al sometimes get overwritten, causing prot_flush and prot_fill to result in a segfault when they are call during shut_down() or imapd_reset(). Therefore we wanted some way of checking whether imapd_in was still sane, and we happened to pick to check that -tls_conn==0. A better approach would be to make the first element of the record a known constant and check for that. An even better approach would be to work out why the structure is getting corrupted in the first place! To clarify--the memory corruption problem occurs regardless of whether the other patches are present (in our environment at least), and needs to be worked around regardless of whether the other patches are present. The issues are completely independent. The memory corruption bug is something we see every few thousand connections under Linux 2.4.18 with Cyrus 2.1.3 in a mixed client high load environment.
[PATCH] Updated master.c process counting patch
Thanks to Jaska Kivela, some patch formatting problems that caused the master.c process counting patch to not apply cleanly have been resolved. The patch set has been updated, and now also incorporates the master.c race condition patch: http://jhoward.fastmail.fm/patches/cyrus/imap-diff.tgz The only file changed in the patch set is master-avail.diff . master-avail.diff solves the problem that master fails to correctly keep count of child processes if processes do not exit cleanly. This manifests itself as Cyrus failing to accept new connections on one or more ports after a while, when using preforking in imapd.conf.