Re: One more attempt: stuck processes
Gary Mills wrote: On Mon, Nov 19, 2007 at 12:35:46PM -0500, Ken Murchison wrote: Sebastian Hagedorn wrote: -- Ken Murchison [EMAIL PROTECTED] is rumored to have mumbled on 17. November 2007 11:21:38 -0500 regarding Re: One more attempt: stuck processes: Here's a patch that seems to fix the problem. I did some basic testing (Linux only) to make sure that it doesn't break anything else, but its always possible that it has some unforseen side effects. Keep an eye on it and let me know if you see anything unusual. Thanks, it seems to be working fine so far! Enjoy your weekend now ... Gary, have you tried the patch? On Solaris 9, SO_RCVTIMEO is not mentioned in the setsockopt man page. My reading tells me that it is defined in the header file, but if it's actually used, setsockopt() will return an error. I understand that this is the case for other operating systems too. This seems to be a known problem with setsockopt(). I just checked the Opensolaris source; it does function there. This is the code: case SO_RCVTIMEO: if (optlen == sizeof (uint32_t)) sockets[i].in_timeout = *(uint32_t *)optval; else { errno = EINVAL; } break; Note that the option has to be an int, and the length has to be that of an int as well. Linux wants a `struct timeval'. I wonder if there's a standard in this area? OK. Can you both try this alternate patch? It should be portable, and GDB shouldn't cause it to kick out. I've set it up so that for SSL-wrapped services it will timeout after 3 minutes, otherwise it uses the service-specific timeout. -- Kenneth Murchison Systems Programmer Project Cyrus Developer/Maintainer Carnegie Mellon University Index: imapd.c === RCS file: /afs/andrew/system/cvs/src/cyrus/imap/imapd.c,v retrieving revision 1.535 diff -u -r1.535 imapd.c --- imapd.c 9 Nov 2007 12:54:22 - 1.535 +++ imapd.c 20 Nov 2007 14:15:24 - @@ -148,6 +148,7 @@ /* end PROXY STUFF */ /* per-user/session state */ +int imapd_timeout; struct protstream *imapd_out = NULL; struct protstream *imapd_in = NULL; struct protgroup *protin = NULL; @@ -701,7 +702,6 @@ #endif { socklen_t salen; -int timeout; sasl_security_properties_t *secprops = NULL; struct sockaddr_storage imapd_localaddr, imapd_remoteaddr; char localip[60], remoteip[60]; @@ -778,9 +778,10 @@ proc_register(imapd, imapd_clienthost, NULL, NULL); /* Set inactivity timer */ -timeout = config_getint(IMAPOPT_TIMEOUT); -if (timeout 30) timeout = 30; -prot_settimeout(imapd_in, timeout*60); +imapd_timeout = config_getint(IMAPOPT_TIMEOUT); +if (imapd_timeout 30) imapd_timeout = 30; +imapd_timeout *= 60; +prot_settimeout(imapd_in, imapd_timeout); prot_setflushonread(imapd_in, imapd_out); /* we were connected on imaps port so we should do @@ -6626,6 +6627,7 @@ result=tls_start_servertls(0, /* read */ 1, /* write */ + imaps ? 180 : imapd_timeout, layerp, auth_id, tls_conn); Index: lmtpengine.c === RCS file: /afs/andrew/system/cvs/src/cyrus/imap/lmtpengine.c,v retrieving revision 1.122 diff -u -r1.122 lmtpengine.c --- lmtpengine.c 10 Oct 2007 15:14:39 - 1.122 +++ lmtpengine.c 20 Nov 2007 14:15:25 - @@ -1579,6 +1579,7 @@ r=tls_start_servertls(0, /* read */ 1, /* write */ + 360, /* 6 minutes */ layerp, auth_id, (cd.tls_conn)); Index: mupdate.c === RCS file: /afs/andrew/system/cvs/src/cyrus/imap/mupdate.c,v retrieving revision 1.99 diff -u -r1.99 mupdate.c --- mupdate.c 10 Oct 2007 15:14:39 - 1.99 +++ mupdate.c 20 Nov 2007 14:15:25 - @@ -1941,6 +1941,7 @@ result=tls_start_servertls(C-pin-fd, /* read */ C-pout-fd, /* write */ + 180, /* 3 minutes */ layerp, auth_id, C-tlsconn); Index: nntpd.c === RCS file: /afs/andrew/system/cvs/src/cyrus/imap/nntpd.c,v retrieving revision 1.60 diff -u -r1.60 nntpd.c --- nntpd.c 10 Oct 2007 15:14:39 - 1.60 +++ nntpd.c 20 Nov 2007 14:15:25 - @@ -142,6 +142,7 @@ sasl_conn_t *nntp_saslconn; /* the sasl connection context */ +int nntp_timeout; char newsprefix[100] = ; char *nntp_userid = 0, *newsmaster; struct auth_state *nntp_authstate = 0, *newsmaster_authstate; @@ -523,7 +524,6 @@ char localip[60], remoteip[60]; char hbuf[NI_MAXHOST]; int niflags; -int timeout; sasl_security_properties_t *secprops=NULL; char unavail[1024]; @@ -590,9 +590,10 @@ proc_register(nntpd, nntp_clienthost, NULL, NULL); /* Set inactivity timer
Re: One more attempt: stuck processes
--On 20. November 2007 09:20:42 -0500 Ken Murchison [EMAIL PROTECTED] wrote: OK. Can you both try this alternate patch? It should be portable, and GDB shouldn't cause it to kick out. I've set it up so that for SSL-wrapped services it will timeout after 3 minutes, otherwise it uses the service-specific timeout. Thanks for working on this. I think you missed a spot: gcc -c -I.. -I./../sieve -I./../imap -I./../lib -I/usr/include/db4 -I/usr/kerberos/include -I/usr/include/et -DHAVE_CONFIG_H -I/usr/kerberos/include -O2 -g -pipe -march=i386 -mcpu=i686 -fPIC \ parser.c parser.c: In function `cmd_starttls': parser.c:834: warning: passing arg 3 of `tls_start_servertls' makes integer from pointer without a cast parser.c:834: warning: passing arg 4 of `tls_start_servertls' from incompatible pointer type parser.c:834: warning: passing arg 5 of `tls_start_servertls' from incompatible pointer type parser.c:834: too few arguments to function `tls_start_servertls' I can fix this myself, but it's probably easier if you do it. -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgpImfrnNAhAO.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On 20. November 2007 15:59:18 +0100 Sebastian Hagedorn [EMAIL PROTECTED] wrote: I can fix this myself, but it's probably easier if you do it. Just FYI: I fixed it locally with a 3 minute timeout and it compiled fine. I'll start testing it now. -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgpoLn6l11ZH7.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
-- Ken Murchison [EMAIL PROTECTED] is rumored to have mumbled on 19. November 2007 12:35:46 -0500 regarding Re: One more attempt: stuck processes: How are things looking today? Good! When I just checked I thought I'd found a new hanging pop3d process, because it's been around for 6 hours, but in fact it's still working: Nov 19 14:17:16 lvr13 pop3s[23493]: badlogin: p50865731.dip.t-dialin.net [80.134.87.49] LOGIN user not found ... Nov 19 19:04:23 lvr13 pop3s[23493]: badlogin: p50865731.dip.t-dialin.net [80.134.87.49] LOGIN user not found Nov 19 19:04:27 lvr13 pop3s[23493]: badlogin: p50865731.dip.t-dialin.net [80.134.87.49] LOGIN user not found Nov 19 19:04:30 lvr13 pop3s[23493]: badlogin: p50865731.dip.t-dialin.net [80.134.87.49] LOGIN user not found Sigh. But that's beside the matter. The only other potential downside the patch has is that stracing or gdb'ing it causes the timeout to trigger prematurely. AFAIK that's a common issue for signals. I'm not sure if there's something that could be done about that, i.e. catching ERESTART or something like that? -- Sebastian Hagedorn - Postmaster - RZKR-R1 (Flachbau), Zimmer 18 Zentrum für angewandte Informatik - Universitätsweiter Service RRZK Universität zu Köln / Cologne University - Tel. +49-221-478-5587 pgpOw7IfeSb18.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
-- Ken Murchison [EMAIL PROTECTED] is rumored to have mumbled on 19. November 2007 13:17:07 -0500 regarding Re: One more attempt: stuck processes: The only other potential downside the patch has is that stracing or gdb'ing it causes the timeout to trigger prematurely. AFAIK that's a common issue for signals. I'm not sure if there's something that could be done about that, i.e. catching ERESTART or something like that? Is this only a problem with pop3s, or does it also effect the STLS command? Yes, it does: telnet cyrus 110 ... stls +OK Begin TLS negotiation now -ERR [SYS/PERM] Starttls failed That's when I straced the process ... -- Sebastian Hagedorn - Postmaster - RZKR-R1 (Flachbau), Zimmer 18 Zentrum für angewandte Informatik - Universitätsweiter Service RRZK Universität zu Köln / Cologne University - Tel. +49-221-478-5587 pgpgMNLWdKmc1.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
Sebastian Hagedorn wrote: -- Ken Murchison [EMAIL PROTECTED] is rumored to have mumbled on 19. November 2007 13:17:07 -0500 regarding Re: One more attempt: stuck processes: The only other potential downside the patch has is that stracing or gdb'ing it causes the timeout to trigger prematurely. AFAIK that's a common issue for signals. I'm not sure if there's something that could be done about that, i.e. catching ERESTART or something like that? Is this only a problem with pop3s, or does it also effect the STLS command? Yes, it does: telnet cyrus 110 ... stls +OK Begin TLS negotiation now -ERR [SYS/PERM] Starttls failed That's when I straced the process ... If you want to write and test a patch to resolve this, I'll certainly take a look at it. Another way to resolve the blocking SSL_accept() problem would be to make the socket non-blocking and then check the SSL_accept() return code to see if we need to call SSL_accept() again. I took the easy way out by setting SO_RCVTIMEO. -- Kenneth Murchison Systems Programmer Project Cyrus Developer/Maintainer Carnegie Mellon University Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
Sebastian Hagedorn wrote: -- Ken Murchison [EMAIL PROTECTED] is rumored to have mumbled on 19. November 2007 12:35:46 -0500 regarding Re: One more attempt: stuck processes: How are things looking today? Good! When I just checked I thought I'd found a new hanging pop3d process, because it's been around for 6 hours, but in fact it's still working: Nov 19 14:17:16 lvr13 pop3s[23493]: badlogin: p50865731.dip.t-dialin.net [80.134.87.49] LOGIN user not found ... Nov 19 19:04:23 lvr13 pop3s[23493]: badlogin: p50865731.dip.t-dialin.net [80.134.87.49] LOGIN user not found Nov 19 19:04:27 lvr13 pop3s[23493]: badlogin: p50865731.dip.t-dialin.net [80.134.87.49] LOGIN user not found Nov 19 19:04:30 lvr13 pop3s[23493]: badlogin: p50865731.dip.t-dialin.net [80.134.87.49] LOGIN user not found Sigh. But that's beside the matter. The only other potential downside the patch has is that stracing or gdb'ing it causes the timeout to trigger prematurely. AFAIK that's a common issue for signals. I'm not sure if there's something that could be done about that, i.e. catching ERESTART or something like that? Is this only a problem with pop3s, or does it also effect the STLS command? -- Kenneth Murchison Systems Programmer Project Cyrus Developer/Maintainer Carnegie Mellon University Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
On Mon, Nov 19, 2007 at 12:35:46PM -0500, Ken Murchison wrote: Sebastian Hagedorn wrote: -- Ken Murchison [EMAIL PROTECTED] is rumored to have mumbled on 17. November 2007 11:21:38 -0500 regarding Re: One more attempt: stuck processes: Here's a patch that seems to fix the problem. I did some basic testing (Linux only) to make sure that it doesn't break anything else, but its always possible that it has some unforseen side effects. Keep an eye on it and let me know if you see anything unusual. Thanks, it seems to be working fine so far! Enjoy your weekend now ... Gary, have you tried the patch? On Solaris 9, SO_RCVTIMEO is not mentioned in the setsockopt man page. My reading tells me that it is defined in the header file, but if it's actually used, setsockopt() will return an error. I understand that this is the case for other operating systems too. This seems to be a known problem with setsockopt(). I just checked the Opensolaris source; it does function there. This is the code: case SO_RCVTIMEO: if (optlen == sizeof (uint32_t)) sockets[i].in_timeout = *(uint32_t *)optval; else { errno = EINVAL; } break; Note that the option has to be an int, and the length has to be that of an int as well. Linux wants a `struct timeval'. I wonder if there's a standard in this area? -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
On Fri, Nov 16, 2007 at 06:37:52PM +0100, Sebastian Hagedorn wrote: OK. Still the symptom seems to be different from what I'm seeing. It may be. As I said I had no time so far to investigate it in depth, I just wanted to say mee too for the hung process problem. Could it be that you have a process limit in /etc/cyrus.conf? Of course. And when the number of hung processes reaches that limit users start to complain that they can not read their mail. Gabor -- - MTA SZTAKI Computer and Automation Research Institute Hungarian Academy of Sciences - Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
Sebastian Hagedorn wrote: -- Ken Murchison [EMAIL PROTECTED] is rumored to have mumbled on 16. November 2007 15:54:50 -0500 regarding Re: One more attempt: stuck processes: That's exactly what Gary is seeing. Right. Apparently stripped binaries aren't any good for straces. Its blocking in SSL_accept(). Apparently the client connects to port 995, and then either sends nothing, or goes away and leaves the socket open. Here's a patch that seems to fix the problem. I did some basic testing (Linux only) to make sure that it doesn't break anything else, but its always possible that it has some unforseen side effects. Keep an eye on it and let me know if you see anything unusual. --- prot.c.~1.93.~ 2007-11-17 10:02:49.0 -0500 +++ prot.c 2007-11-17 11:21:05.0 -0500 @@ -58,6 +58,7 @@ #ifdef HAVE_UNISTD_H #include unistd.h #endif +#include sys/time.h #include sys/types.h #include sys/stat.h #include netinet/in.h @@ -214,10 +215,20 @@ */ int prot_settimeout(struct protstream *s, int timeout) { +struct timeval tv; + assert(!s-write); s-read_timeout = timeout; s-timeout_mark = time(NULL) + timeout; + +/* Set a read timeout on the socket itself. This primarily prevents + * OpenSSL calls such as SSL_accept() from blocking indefinitely. + */ +tv.tv_sec = timeout; +tv.tv_usec = 0; +setsockopt(0, SOL_SOCKET, SO_RCVTIMEO, tv, sizeof(struct timeval)); + return 0; } -- Kenneth Murchison Systems Programmer Project Cyrus Developer/Maintainer Carnegie Mellon University Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On 15. November 2007 19:25:19 +0100 Simon Matter [EMAIL PROTECTED] wrote: It's blinking red, which normally means a broken link. I'm not sure how The file 0 is a symbolic symlink which doesn't really point to a file, that's why the shell shows it blinking. Everything okay here. Thanks. That's what I thought, but I wasn't sure. reliable that is in this case. Anyway, lsof reports: pop3d 25038 cyrus0u IPv4 -64802663 TCP cyrus.rrz.uni-koeln.de:pop3s-p50865F5D.dip.t-dialin.net:1064 (ESTABLISHED) It *thinks* the connections is still open. So does netstat: # LANG=C netstat -a|grep p50865F5D tcp0 0 cyrus.rrz.uni-koeln.d:pop3s p50865F5D.dip.t-dialin:1064 ESTABLISHED But obviously that connection is dead. I don't know what conclusions to draw from that ... Just two ideas come to mind: 1) Since it only happens on dialup connections, could it be that the dialin router at the providers end sends TCP/RST when a client hangs up and those packets are filtered somewhere, maybe on your firewall? OK, let's run with that one. a) We don't really have a firewall, we only use ACLs on the Cisco routers. You can't even filter TCP/RST there. b) Even *if* a TCP/RST had been dropped, lost or whatever, the server *still* should timeout eventually! 2) Could it be that SO_LINGER should be used as socket option in service_create() in master/master.c. I didn't remember that option, so I just read up on it. It seems as though SO_LINGER is very dependent on implementation. If I get your intention correctly SO_LINGER would have to be set with l_onoff set to non-zero and l_linger to zero, right? So close() would return immediately? That might make sense if the stack trace showed a call to close(). But if I understand the code correctly, close() isn't called at all. The socket is closed as a result of a call to exit(). And that defeats all use of SO_LINGER: When the socket is closed as part of exit(2), it always lingers in the background. If it's complete nonsense, ignore it. I wouldn't know :-) -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgpproVHc1y86.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On 16. November 2007 16:52:27 +0100 Gabor Gombas [EMAIL PROTECTED] wrote: On Fri, Nov 16, 2007 at 12:36:49PM +0100, Sebastian Hagedorn wrote: He suggested that the trace is unreliable. Perhaps a bug in RHEL 3's version of OpenSSL messes up the stack. That would also explain why nobody else seems to have this problem. FYI I also know a system that has problems with hung Cyrus processes. AFAIR they have problems with pop3s only, but that may be because there are more POP3 than IMAP users, I don't know. The system in question runs 2.3.8 on Debian Etch currently. That's a 2.6 kernel, right? I intend to help diagnose that system but had no time so far; they're now running a script that does a POP3 connection every couple of minutes and if that takes too long it restarts Cyrus. Hm, we don't suffer any actual slowdown, it's just that the number of processes increases over time. There is nothing interesting in the logs: Oct 15 02:39:31 host cyrus/master[6102]: about to exec /usr/local/cyrus/sbin/pop3d Oct 15 02:39:31 host cyrus/pop3s[6102]: executed Oct 15 02:39:31 host cyrus/pop3s[6102]: accepted connection That's what I'm seeing. Could you get a stack trace? OTOH there are a lot of messages like the following: Oct 16 14:13:10 host cyrus/master[26136]: about to exec /usr/local/cyrus/sbin/pop3d Oct 16 14:13:10 host cyrus/pop3s[26136]: executed Oct 16 14:13:10 host cyrus/pop3s[26136]: accepted connection Oct 16 14:13:10 host cyrus/pop3s[26136]: pop3s failed: [XX.XXX.XX.XXX] Oct 16 14:13:10 host cyrus/pop3s[26136]: Fatal error: tls_start_servertls() failed Oct 16 14:13:10 host cyrus/master[15923]: process 26136 exited, status 75 Oct 16 14:13:10 host cyrus/master[15923]: service pop3s pid 26136 in BUSY state: terminated abnormally Any idea what's causing that? I have many of those as well. I suppose that could be any number of things. Faulty protocol or dropped connections. -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgp3H24eUgNSV.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
On Fri, Nov 16, 2007 at 03:20:57PM +0100, Sebastian Hagedorn wrote: --On 16. November 2007 08:00:07 -0600 Gary Mills [EMAIL PROTECTED] wrote: This timeout doesn't work in some cases. We have lots of POP sessions that never terminate. That's interesting to hear! Especially since you are using Solaris. About 30 out of 40 are in that state now. Here's an example: cyrus 13075 708 0 Oct 14 ?0:05 pop3d -s cyrus 20023 708 0 Oct 29 ?0:00 pop3d cyrus 24560 708 1 07:38:03 ?0:03 pop3d cyrus 631 708 0 Oct 03 ?0:10 pop3d -s cyrus 6786 708 0 Oct 20 ?0:00 pop3d -s cyrus 29777 708 0 07:45:03 ?0:00 pop3d cyrus 19175 708 0 Oct 04 ?0:04 pop3d -s One I just checked is stuck in a read(): # truss -p 19175 read(0, 0x002316F0, 5) (sleeping...) ^?# pfiles 19175 19175: pop3d -s Current rlimit: 256 file descriptors 0: S_IFSOCK mode:0666 dev:271,0 ino:25813 uid:0 gid:0 size:0 O_RDWR sockname: AF_INET 130.179.16.23 port: 995 peername: AF_INET 130.179.188.184 port: 51771 Could you get a stack trace? If you have gdb you just call it with gdb -p 19175. Then you can do bt at the prompt. I forget how to do it with Sun's debugger. Easy: # pstack 19175 19175: pop3d -s fef9f810 read (0, 2316f0, 5) fee1d2d0 read (0, 2316f0, 5, 0, 0, 0) + 5c ff06bb38 sock_read (1f0860, 2316f0, 5, 5, 0, 0) + 24 ff068af0 BIO_read (1f0860, 2316f0, 5, fef98b84, 0, 0) + 110 ff278488 ssl3_read_n (212798, 5, 8805, 0, 0, 203958) + 174 ff2785fc ssl3_get_record (204ce0, 8000, 8400, 4400, f1, f0) + d0 ff279424 ssl3_read_bytes (212798, 1000, 2000, 4, 0, ffbfe731) + 228 ff27a99c ssl3_get_message (ff2a259c, 2070a0, 0, , 19000, ffbfe7a0) + d0 ff27042c ssl3_accept (2150, 2160, 2180, 21e0, 2110, 2122) + 904 ff27bd2c ssl23_get_client_hello (2316fb, 6c, 6c, 4, fe79, 0) + 828 ff27b4b4 ssl23_accept (4000, 2000, 0, 0, 0, 0) + 2a4 00032d00 tls_start_servertls (0, 1, ffbfee24, ffbfee20, 1849a8, ff00) + 198 0002c504 cmd_starttls (1, 1fd8b8, 0, 0, 0, 0) + 184 0002a638 service_main (2, 192198, ffbffce0, 1aec4, 3508c, 1) + 488 00035250 main (2, ffbffcd4, ffbffce0, 17c400, 0, 0) + e18 00029298 _start (0, 0, 0, 0, 0, 0) + 108 I've confirmed that the client has gone away a long time ago. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
Sebastian Hagedorn wrote: I think I will try one more approach: I reverted cyrus.conf to not use -U 1 anymore, so that processes should be reused. I will strace one of the pop3d processes in the hope that it gets stuck. That way I should be able to see where things go wrong. If the process terminates normally I will try with another one. Please let me know if you get a trace from a hung process. -- Kenneth Murchison Systems Programmer Project Cyrus Developer/Maintainer Carnegie Mellon University Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
On Fri, Nov 16, 2007 at 05:20:00PM +0100, Sebastian Hagedorn wrote: That's a 2.6 kernel, right? Yes, 2.6.18-2-amd64. Hm, we don't suffer any actual slowdown, it's just that the number of processes increases over time. It's not a slowdown - the client connects, and hangs. It never even gets to the authentication phase (at least it's not logged). Clients that happen to connect to a non-affected process run normally. Also, IMAP connections do not seem to be affected, at least I did not hear any complaints about that. That's what I'm seeing. Could you get a stack trace? I intend to but I do not have the time currently. I'm not involved in the daily management of that machine and the operators are happy to just restart Cyrus when the hangs begin, and so far I never was around just when that happened. Gabor -- - MTA SZTAKI Computer and Automation Research Institute Hungarian Academy of Sciences - Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
Sebastian Hagedorn wrote: --On 16. November 2007 09:37:42 -0600 Gary Mills [EMAIL PROTECTED] wrote: Could you get a stack trace? If you have gdb you just call it with gdb -p 19175. Then you can do bt at the prompt. I forget how to do it with Sun's debugger. Easy: # pstack 19175 19175: pop3d -s fef9f810 read (0, 2316f0, 5) fee1d2d0 read (0, 2316f0, 5, 0, 0, 0) + 5c ff06bb38 sock_read (1f0860, 2316f0, 5, 5, 0, 0) + 24 ff068af0 BIO_read (1f0860, 2316f0, 5, fef98b84, 0, 0) + 110 ff278488 ssl3_read_n (212798, 5, 8805, 0, 0, 203958) + 174 ff2785fc ssl3_get_record (204ce0, 8000, 8400, 4400, f1, f0) + d0 ff279424 ssl3_read_bytes (212798, 1000, 2000, 4, 0, ffbfe731) + 228 ff27a99c ssl3_get_message (ff2a259c, 2070a0, 0, , 19000, ffbfe7a0) + d0 ff27042c ssl3_accept (2150, 2160, 2180, 21e0, 2110, 2122) + 904ff27bd2c ssl23_get_client_hello (2316fb, 6c, 6c, 4, fe79, 0) + 828ff27b4b4 ssl23_accept (4000, 2000, 0, 0, 0, 0) + 2a4 00032d00 tls_start_servertls (0, 1, ffbfee24, ffbfee20, 1849a8, ff00) + 1980002c504 cmd_starttls (1, 1fd8b8, 0, 0, 0, 0) + 184 0002a638 service_main (2, 192198, ffbffce0, 1aec4, 3508c, 1) + 488 00035250 main (2, ffbffcd4, ffbffce0, 17c400, 0, 0) + e18 00029298 _start (0, 0, 0, 0, 0, 0) + 108 Thanks, that looks like progress! That stack trace looks similar enough to the one I'm seeing that I could imagine that it is what I *should* be seeing if the stack weren't garbled. Of course that's only speculation. Ken, is it possible that the call to SSL_accept() in tls_start_servertls() blocks when the client goes away? That could explain everything Yes. Gary's problem might be very similar to yours, depending on what I see from the patch that I just sent you. -- Kenneth Murchison Systems Programmer Project Cyrus Developer/Maintainer Carnegie Mellon University Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On 16. November 2007 18:07:51 +0100 Gabor Gombas [EMAIL PROTECTED] wrote: Hm, we don't suffer any actual slowdown, it's just that the number of processes increases over time. It's not a slowdown - the client connects, and hangs. It never even gets to the authentication phase (at least it's not logged). Clients that happen to connect to a non-affected process run normally. Well, that just sounds like you're running out of entropy. That's a different issue. Recompile your cyrus-sasl to use /dev/urandom instead of /dev/random or disable apop in /etc/imapd.conf: allowapop: 0 Either of those things should get rid of that. -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgp4g0isqF9Ha.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
On Fri, Nov 16, 2007 at 01:54:24PM +0100, Alain Spineux wrote: On Nov 16, 2007 12:36 PM, Sebastian Hagedorn [EMAIL PROTECTED] wrote: --On 16. November 2007 11:27:09 +0100 Sebastian Hagedorn [EMAIL PROTECTED] wrote: 1. In the absence of the SO_KEEPALIVE option it is entirely possible that a TCP connection remains ESTABLISHED even when the other side has gone. I said that socket should timeout, but this is true only when the protocol (TCP here) require a response (usualy AK here) or at connection establishement. On the contrary it should stay open indefinitely util something happens. Router doing NAT can drop a too old connection, because it has to maintains a NAT table and make some cleanup time to time, this where KEEPALIVE become usefull. This may not be a solution to this particular problem, but it made me wonder why Cyrus does *not* use SO_KEEPALIVE. Is there a downside to it? Cyrus has already a built-in time out, it seems a lite conflicting to actively maintains the connection until it drop it itself ! This is the works of the client to actively maintains the connection, if it want it ! This timeout doesn't work in some cases. We have lots of POP sessions that never terminate. About 30 out of 40 are in that state now. Here's an example: cyrus 13075 708 0 Oct 14 ?0:05 pop3d -s cyrus 20023 708 0 Oct 29 ?0:00 pop3d cyrus 24560 708 1 07:38:03 ?0:03 pop3d cyrus 631 708 0 Oct 03 ?0:10 pop3d -s cyrus 6786 708 0 Oct 20 ?0:00 pop3d -s cyrus 29777 708 0 07:45:03 ?0:00 pop3d cyrus 19175 708 0 Oct 04 ?0:04 pop3d -s One I just checked is stuck in a read(): # truss -p 19175 read(0, 0x002316F0, 5) (sleeping...) ^?# pfiles 19175 19175: pop3d -s Current rlimit: 256 file descriptors 0: S_IFSOCK mode:0666 dev:271,0 ino:25813 uid:0 gid:0 size:0 O_RDWR sockname: AF_INET 130.179.16.23 port: 995 peername: AF_INET 130.179.188.184 port: 51771 -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On 16. November 2007 11:27:09 +0100 Sebastian Hagedorn [EMAIL PROTECTED] wrote: 1) Since it only happens on dialup connections, could it be that the dialin router at the providers end sends TCP/RST when a client hangs up and those packets are filtered somewhere, maybe on your firewall? OK, let's run with that one. a) We don't really have a firewall, we only use ACLs on the Cisco routers. You can't even filter TCP/RST there. b) Even *if* a TCP/RST had been dropped, lost or whatever, the server *still* should timeout eventually! I just had a discussion with a colleague regarding this. He made two observations: 1. In the absence of the SO_KEEPALIVE option it is entirely possible that a TCP connection remains ESTABLISHED even when the other side has gone. This may not be a solution to this particular problem, but it made me wonder why Cyrus does *not* use SO_KEEPALIVE. Is there a downside to it? 2. The stack trace looks garbled: (gdb) bt #0 0x0079f41e in __read_nocancel () from /lib/tls/libc.so.6 #1 0x00d0b2f7 in BIO_new_socket () from /lib/libcrypto.so.4 #2 0x00d092b2 in BIO_read () from /lib/libcrypto.so.4 #3 0x005dae13 in ssl23_read_bytes () from /lib/libssl.so.4 #4 0x005d9c51 in ssl23_get_client_hello () from /lib/libssl.so.4 #5 0x005d9712 in ssl23_accept () from /lib/libssl.so.4 #6 0x005ddc9a in SSL_accept () from /lib/libssl.so.4 #7 0x08052cb3 in shut_down () #8 0x0804e513 in shut_down () #9 0x0804d58c in ?? () #10 0x0001 in ?? () #11 0x082ee848 in ?? () #12 0x in ?? () He suggested that the trace is unreliable. Perhaps a bug in RHEL 3's version of OpenSSL messes up the stack. That would also explain why nobody else seems to have this problem. I think I will try one more approach: I reverted cyrus.conf to not use -U 1 anymore, so that processes should be reused. I will strace one of the pop3d processes in the hope that it gets stuck. That way I should be able to see where things go wrong. If the process terminates normally I will try with another one. If that doesn't go anywhere, I guess I'll drop this investigation. We will upgrade to RHEL 5 some time next year, so hopefully that will bring new bugs :-) -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgpbrIUha0peZ.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
Sebastian Hagedorn wrote: The only reason I could imagine for the sequence of calls was signal handling. But let's be methodical. There's only one spot where SSL_accept() is called: in tls_start_servertls(). In pop3d.c that's only called in cmd_starttls(). That in turn is called either in cmdloop (for handling of STLS) or in service_main() for connections to port 995. Actually, now that I think about it, I believe SSL_accept() can be called from SSL_read() at any time if a renegotiation is required. Since shut_down() calls prot_fill(), which in turn can call SSL_read(), its possible that we can get an SSL_accept() call. Before I start hacking code, can you apply the following patch (sorry about the line breaks) and see if I'm heading in the right direction? Let me know if you get any of the WARNING messages in your logs. --- prot.c.~1.93.~ 2007-11-16 11:21:56.0 -0500 +++ prot.c 2007-11-16 11:23:32.0 -0500 @@ -468,6 +468,7 @@ /* just do a SSL read instead if we're under a tls layer */ if (s-tls_conn != NULL) { n = SSL_read(s-tls_conn, (char *) s-buf, PROT_BUFSIZE); + if (n = 0) syslog(LOG_WARNING, SSL_read() returned %d, n); } else { n = read(s-fd, s-buf, PROT_BUFSIZE); } -- Kenneth Murchison Systems Programmer Project Cyrus Developer/Maintainer Carnegie Mellon University Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On 16. November 2007 13:54:24 +0100 Alain Spineux [EMAIL PROTECTED] wrote: On Nov 16, 2007 12:36 PM, Sebastian Hagedorn [EMAIL PROTECTED] wrote: I just had a discussion with a colleague regarding this. He made two observations: 1. In the absence of the SO_KEEPALIVE option it is entirely possible that a TCP connection remains ESTABLISHED even when the other side has gone. I said that socket should timeout, but this is true only when the protocol (TCP here) require a response (usualy AK here) or at connection establishement. Right. On the contrary it should stay open indefinitely util something happens. Router doing NAT can drop a too old connection, because it has to maintains a NAT table and make some cleanup time to time, this where KEEPALIVE become usefull. Not only there, but I think also in the case of unilaterally dropped connections. This may not be a solution to this particular problem, but it made me wonder why Cyrus does *not* use SO_KEEPALIVE. Is there a downside to it? Cyrus has already a built-in time out, it seems a lite conflicting to actively maintains the connection until it drop it itself ! I'm not sure I understand that sentence. This is the works of the client to actively maintains the connection, if it want it ! Yes, but what if the client is gone? I realise that *normally* the server keeps a built-in timeout, but I'm guessing that sometimes it doesn't work, perhaps because something (in prot_fill() perhaps?) blocks. I think I will try one more approach: I reverted cyrus.conf to not use -U 1 anymore, so that processes should be reused. I will strace one of the pop3d processes in the hope that it gets stuck. That way I should be able to see where things go wrong. If the process terminates normally I will try with another one. If that doesn't go anywhere, I guess I'll drop this You could try to replace imapd by a home made script, something like . mv imapd imapd_ echo exec strace -o /tmp/imapd.$$ imapd_ $* imapd chmod imapd a+x Thanks for the suggestion. I'll think about it, although I'm wary of doing that on a production server. investigation. We will upgrade to RHEL 5 some time next year, so hopefully that will bring new bugs :-) Sorry but I dont understand what you are complaining about! I'm not complaining ... Is-it because the imap or pop client is loosing its connection and this disturb the user No. or just because you are getting some sleeping processes ? If it were some I wouldn't worry. I'm talking hundreds of processes! I know I can kill them, in fact for the pop3d processes we run this command once a month: ps -C pop3d -o pid,start|grep [a-z]|awk '{print $1}'|xargs kill (It kills pop3d processes that have the month in their start time, i.e. are more than a day old) But for imapd processes it's not as easy to tell if they are just long-living or stuck. Do you have a timeout option in your imapd.conf to force the imap/pop server to autologout ? No. But both POP and IMAP have default timeouts. They just don't work in my case. -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgpUj3SrktoJw.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On 16. November 2007 08:00:07 -0600 Gary Mills [EMAIL PROTECTED] wrote: This timeout doesn't work in some cases. We have lots of POP sessions that never terminate. That's interesting to hear! Especially since you are using Solaris. About 30 out of 40 are in that state now. Here's an example: cyrus 13075 708 0 Oct 14 ?0:05 pop3d -s cyrus 20023 708 0 Oct 29 ?0:00 pop3d cyrus 24560 708 1 07:38:03 ?0:03 pop3d cyrus 631 708 0 Oct 03 ?0:10 pop3d -s cyrus 6786 708 0 Oct 20 ?0:00 pop3d -s cyrus 29777 708 0 07:45:03 ?0:00 pop3d cyrus 19175 708 0 Oct 04 ?0:04 pop3d -s One I just checked is stuck in a read(): # truss -p 19175 read(0, 0x002316F0, 5) (sleeping...) ^?# pfiles 19175 19175: pop3d -s Current rlimit: 256 file descriptors 0: S_IFSOCK mode:0666 dev:271,0 ino:25813 uid:0 gid:0 size:0 O_RDWR sockname: AF_INET 130.179.16.23 port: 995 peername: AF_INET 130.179.188.184 port: 51771 Could you get a stack trace? If you have gdb you just call it with gdb -p 19175. Then you can do bt at the prompt. I forget how to do it with Sun's debugger. -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgpvKBTMY4YQA.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On 16. November 2007 09:37:42 -0600 Gary Mills [EMAIL PROTECTED] wrote: Could you get a stack trace? If you have gdb you just call it with gdb -p 19175. Then you can do bt at the prompt. I forget how to do it with Sun's debugger. Easy: # pstack 19175 19175: pop3d -s fef9f810 read (0, 2316f0, 5) fee1d2d0 read (0, 2316f0, 5, 0, 0, 0) + 5c ff06bb38 sock_read (1f0860, 2316f0, 5, 5, 0, 0) + 24 ff068af0 BIO_read (1f0860, 2316f0, 5, fef98b84, 0, 0) + 110 ff278488 ssl3_read_n (212798, 5, 8805, 0, 0, 203958) + 174 ff2785fc ssl3_get_record (204ce0, 8000, 8400, 4400, f1, f0) + d0 ff279424 ssl3_read_bytes (212798, 1000, 2000, 4, 0, ffbfe731) + 228 ff27a99c ssl3_get_message (ff2a259c, 2070a0, 0, , 19000, ffbfe7a0) + d0 ff27042c ssl3_accept (2150, 2160, 2180, 21e0, 2110, 2122) + 904ff27bd2c ssl23_get_client_hello (2316fb, 6c, 6c, 4, fe79, 0) + 828ff27b4b4 ssl23_accept (4000, 2000, 0, 0, 0, 0) + 2a4 00032d00 tls_start_servertls (0, 1, ffbfee24, ffbfee20, 1849a8, ff00) + 1980002c504 cmd_starttls (1, 1fd8b8, 0, 0, 0, 0) + 184 0002a638 service_main (2, 192198, ffbffce0, 1aec4, 3508c, 1) + 488 00035250 main (2, ffbffcd4, ffbffce0, 17c400, 0, 0) + e18 00029298 _start (0, 0, 0, 0, 0, 0) + 108 Thanks, that looks like progress! That stack trace looks similar enough to the one I'm seeing that I could imagine that it is what I *should* be seeing if the stack weren't garbled. Of course that's only speculation. Ken, is it possible that the call to SSL_accept() in tls_start_servertls() blocks when the client goes away? That could explain everything -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgpldwycIAjiI.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
On Fri, Nov 16, 2007 at 12:36:49PM +0100, Sebastian Hagedorn wrote: He suggested that the trace is unreliable. Perhaps a bug in RHEL 3's version of OpenSSL messes up the stack. That would also explain why nobody else seems to have this problem. FYI I also know a system that has problems with hung Cyrus processes. AFAIR they have problems with pop3s only, but that may be because there are more POP3 than IMAP users, I don't know. The system in question runs 2.3.8 on Debian Etch currently. I intend to help diagnose that system but had no time so far; they're now running a script that does a POP3 connection every couple of minutes and if that takes too long it restarts Cyrus. There is nothing interesting in the logs: Oct 15 02:39:31 host cyrus/master[6102]: about to exec /usr/local/cyrus/sbin/pop3d Oct 15 02:39:31 host cyrus/pop3s[6102]: executed Oct 15 02:39:31 host cyrus/pop3s[6102]: accepted connection ... and that's about it, nothing else is logged about the stuck process. As can be seen the process gets stuck just after it has been created, so -U 1 can not help. OTOH there are a lot of messages like the following: Oct 16 14:13:10 host cyrus/master[26136]: about to exec /usr/local/cyrus/sbin/pop3d Oct 16 14:13:10 host cyrus/pop3s[26136]: executed Oct 16 14:13:10 host cyrus/pop3s[26136]: accepted connection Oct 16 14:13:10 host cyrus/pop3s[26136]: pop3s failed: [XX.XXX.XX.XXX] Oct 16 14:13:10 host cyrus/pop3s[26136]: Fatal error: tls_start_servertls() failed Oct 16 14:13:10 host cyrus/master[15923]: process 26136 exited, status 75 Oct 16 14:13:10 host cyrus/master[15923]: service pop3s pid 26136 in BUSY state: terminated abnormally Any idea what's causing that? Gabor -- - MTA SZTAKI Computer and Automation Research Institute Hungarian Academy of Sciences - Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
On Fri, Nov 16, 2007 at 06:11:01PM +0100, Sebastian Hagedorn wrote: Well, that just sounds like you're running out of entropy. That's a different issue. Recompile your cyrus-sasl to use /dev/urandom instead of /dev/random or disable apop in /etc/imapd.conf: Debian uses /dev/urandom for a long time: # strings /usr/lib/libsasl2.so.2 | grep random /dev/urandom And according to the logs I have, after a pop3 process got stuck other IMAP users can still log in using TLS+PLAIN, so entropy can be ruled out. Gabor -- - MTA SZTAKI Computer and Automation Research Institute Hungarian Academy of Sciences - Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
Sebastian Hagedorn wrote: --On 16. November 2007 12:39:28 -0500 Ken Murchison [EMAIL PROTECTED] wrote: Sorry, my patch wasn't complete. It wasn't logging the value that I wanted. OK: Nov 16 18:48:17 lvr13 pop3s[1385]: SSL_read() returned 0:5 Nov 16 18:48:33 lvr13 pop3s[1375]: SSL_read() returned 0:5 Nov 16 18:48:50 lvr13 pop3s[1980]: SSL_read() returned 0:6 Nov 16 18:48:54 lvr13 pop3s[1376]: SSL_read() returned 0:5 Nov 16 18:49:03 lvr13 pop3s[1375]: SSL_read() returned 0:5 Nov 16 18:49:11 lvr13 pop3s[1375]: SSL_read() returned 0:5 Nov 16 18:49:38 lvr13 pop3s[1375]: SSL_read() returned 0:5 Nov 16 18:49:54 lvr13 pop3s[1404]: SSL_read() returned 0:5 I'm guessing that's still not enough: #define SSL_ERROR_SYSCALL 5 /* look at error stack/return value/errno */ #define SSL_ERROR_ZERO_RETURN 6 SSL_ERROR_SYSCALL Some I/O error occurred. The OpenSSL error queue may contain more information on the error. If the error queue is empty (i.e. ERR_get_error() returns 0), ret can be used to find out more about the error: If ret == 0, an EOF was observed that violates the pro- tocol. If ret == -1, the underlying BIO reported an I/O error (for socket I/O on Unix systems, consult errno for details). So should I add a call to ERR_get_error()? Not yet. I'm assuming that none of these processes has hung. We're getting an I/O error most likely because the client has closed the connection immediately after sending QUIT. This is harmless. What I really want to see is if we get a SSL_ERROR_WANT_xxx return code when we're hung. -- Kenneth Murchison Systems Programmer Project Cyrus Developer/Maintainer Carnegie Mellon University Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
On Nov 16, 2007 6:11 PM, Sebastian Hagedorn [EMAIL PROTECTED] wrote: --On 16. November 2007 18:07:51 +0100 Gabor Gombas [EMAIL PROTECTED] wrote: Hm, we don't suffer any actual slowdown, it's just that the number of processes increases over time. It's not a slowdown - the client connects, and hangs. It never even gets to the authentication phase (at least it's not logged). Clients that happen to connect to a non-affected process run normally. Well, that just sounds like you're running out of entropy. That's a different issue. Recompile your cyrus-sasl to use /dev/urandom instead of /dev/random or disable apop in /etc/imapd.conf: allowapop: 0 Either of those things should get rid of that. The quick but the bad way to do this is for testing is # ls -l /dev/*random crw-rw-rw- 1 root root 1, 8 Nov 16 06:18 /dev/random cr--r--r-- 1 root root 1, 9 Nov 7 22:47 /dev/urandom # mv /dev/random /dev/random.orig # ln -sf /dev/urandom /dev/random And then, because letting /dev/random that way too long is insecure : # rm -f /dev/random # mv /dev/random.orig /dev/random This avoid to recompile the source just for testing. -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html -- Alain Spineux aspineux gmail com May the sources be with you Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
On Fri, Nov 16, 2007 at 05:13:13PM +0100, Sebastian Hagedorn wrote: --On 16. November 2007 14:23:17 +0100 Simon Matter [EMAIL PROTECTED] wrote: Did you ever see non SSL connections get stuck? No. Most of mine are `pop3d -s', but I have seen a few without the `-s'. When I did a stack trace on one, it also turned out to be for an SSL session. So, I have to agree. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
OK, now I got this: Nov 16 18:37:06 lvr13 pop3s[23089]: SSL_read() returned -1 But that process terminated normally. -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgpRJSjlsSCf8.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
On Nov 16, 2007 12:36 PM, Sebastian Hagedorn [EMAIL PROTECTED] wrote: --On 16. November 2007 11:27:09 +0100 Sebastian Hagedorn [EMAIL PROTECTED] wrote: 1) Since it only happens on dialup connections, could it be that the dialin router at the providers end sends TCP/RST when a client hangs up and those packets are filtered somewhere, maybe on your firewall? OK, let's run with that one. a) We don't really have a firewall, we only use ACLs on the Cisco routers. You can't even filter TCP/RST there. b) Even *if* a TCP/RST had been dropped, lost or whatever, the server *still* should timeout eventually! I just had a discussion with a colleague regarding this. He made two observations: 1. In the absence of the SO_KEEPALIVE option it is entirely possible that a TCP connection remains ESTABLISHED even when the other side has gone. I said that socket should timeout, but this is true only when the protocol (TCP here) require a response (usualy AK here) or at connection establishement. On the contrary it should stay open indefinitely util something happens. Router doing NAT can drop a too old connection, because it has to maintains a NAT table and make some cleanup time to time, this where KEEPALIVE become usefull. This may not be a solution to this particular problem, but it made me wonder why Cyrus does *not* use SO_KEEPALIVE. Is there a downside to it? Cyrus has already a built-in time out, it seems a lite conflicting to actively maintains the connection until it drop it itself ! This is the works of the client to actively maintains the connection, if it want it ! 2. The stack trace looks garbled: (gdb) bt #0 0x0079f41e in __read_nocancel () from /lib/tls/libc.so.6 #1 0x00d0b2f7 in BIO_new_socket () from /lib/libcrypto.so.4 #2 0x00d092b2 in BIO_read () from /lib/libcrypto.so.4 #3 0x005dae13 in ssl23_read_bytes () from /lib/libssl.so.4 #4 0x005d9c51 in ssl23_get_client_hello () from /lib/libssl.so.4 #5 0x005d9712 in ssl23_accept () from /lib/libssl.so.4 #6 0x005ddc9a in SSL_accept () from /lib/libssl.so.4 #7 0x08052cb3 in shut_down () #8 0x0804e513 in shut_down () #9 0x0804d58c in ?? () #10 0x0001 in ?? () #11 0x082ee848 in ?? () #12 0x in ?? () He suggested that the trace is unreliable. Perhaps a bug in RHEL 3's version of OpenSSL messes up the stack. That would also explain why nobody else seems to have this problem. I think I will try one more approach: I reverted cyrus.conf to not use -U 1 anymore, so that processes should be reused. I will strace one of the pop3d processes in the hope that it gets stuck. That way I should be able to see where things go wrong. If the process terminates normally I will try with another one. If that doesn't go anywhere, I guess I'll drop this You could try to replace imapd by a home made script, something like . mv imapd imapd_ echo exec strace -o /tmp/imapd.$$ imapd_ $* imapd chmod imapd a+x investigation. We will upgrade to RHEL 5 some time next year, so hopefully that will bring new bugs :-) Sorry but I dont understand what you are complaining about! Is-it because the imap or pop client is loosing its connection and this disturb the user or just because you are getting some sleeping processes ? Or both :-) Do you have a timeout option in your imapd.conf to force the imap/pop server to autologout ? Regards. Alain -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html -- Alain Spineux aspineux gmail com May the sources be with you Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On 16. November 2007 14:23:17 +0100 Simon Matter [EMAIL PROTECTED] wrote: Did you ever see non SSL connections get stuck? No. -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgpAIZv7hfTCt.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On 16. November 2007 11:27:52 -0500 Ken Murchison [EMAIL PROTECTED] wrote: Sebastian Hagedorn wrote: The only reason I could imagine for the sequence of calls was signal handling. But let's be methodical. There's only one spot where SSL_accept() is called: in tls_start_servertls(). In pop3d.c that's only called in cmd_starttls(). That in turn is called either in cmdloop (for handling of STLS) or in service_main() for connections to port 995. Actually, now that I think about it, I believe SSL_accept() can be called from SSL_read() at any time if a renegotiation is required. Since shut_down() calls prot_fill(), which in turn can call SSL_read(), its possible that we can get an SSL_accept() call. Before I start hacking code, can you apply the following patch (sorry about the line breaks) and see if I'm heading in the right direction? Let me know if you get any of the WARNING messages in your logs. --- prot.c.~1.93.~ 2007-11-16 11:21:56.0 -0500 +++ prot.c 2007-11-16 11:23:32.0 -0500 @@ -468,6 +468,7 @@ /* just do a SSL read instead if we're under a tls layer */ if (s-tls_conn != NULL) { n = SSL_read(s-tls_conn, (char *) s-buf, PROT_BUFSIZE); + if (n = 0) syslog(LOG_WARNING, SSL_read() returned %d, n); } else { n = read(s-fd, s-buf, PROT_BUFSIZE); } Yes, I do: Nov 16 17:59:34 lvr13 pop3s[3196]: SSL_read() returned 0 Nov 16 17:59:38 lvr13 pop3s[3196]: SSL_read() returned 0 Nov 16 18:00:09 lvr13 pop3s[3215]: SSL_read() returned 0 Nov 16 18:00:26 lvr13 pop3s[3847]: SSL_read() returned 0 Nov 16 18:00:34 lvr13 pop3s[3215]: SSL_read() returned 0 Nov 16 18:00:34 lvr13 pop3s[3199]: SSL_read() returned 0 Nov 16 18:00:39 lvr13 pop3s[3199]: SSL_read() returned 0 Nov 16 18:00:43 lvr13 pop3s[3229]: SSL_read() returned 0 Not all of these processes are stuck, though. (Maybe none are). Should I be looking for something specific? -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgpzM3I7B80P9.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
Sebastian Hagedorn wrote: Nov 16 18:00:26 lvr13 pop3s[3847]: SSL_read() returned 0 Nov 16 18:00:34 lvr13 pop3s[3215]: SSL_read() returned 0 Nov 16 18:00:34 lvr13 pop3s[3199]: SSL_read() returned 0 Nov 16 18:00:39 lvr13 pop3s[3199]: SSL_read() returned 0 Nov 16 18:00:43 lvr13 pop3s[3229]: SSL_read() returned 0 Not all of these processes are stuck, though. (Maybe none are). Should I be looking for something specific? Sorry, my patch wasn't complete. It wasn't logging the value that I wanted. Try this: --- prot.c.~1.93.~ 2007-11-16 11:21:56.0 -0500 +++ prot.c 2007-11-16 12:37:55.0 -0500 @@ -468,6 +468,10 @@ /* just do a SSL read instead if we're under a tls layer */ if (s-tls_conn != NULL) { n = SSL_read(s-tls_conn, (char *) s-buf, PROT_BUFSIZE); + if (n = 0) { + syslog(LOG_WARNING, SSL_read() returned %d:%d, + n, SSL_get_error(s-tls_conn, n)); + } } else { n = read(s-fd, s-buf, PROT_BUFSIZE); } -- Kenneth Murchison Systems Programmer Project Cyrus Developer/Maintainer Carnegie Mellon University Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On 16. November 2007 12:39:28 -0500 Ken Murchison [EMAIL PROTECTED] wrote: Sorry, my patch wasn't complete. It wasn't logging the value that I wanted. OK: Nov 16 18:48:17 lvr13 pop3s[1385]: SSL_read() returned 0:5 Nov 16 18:48:33 lvr13 pop3s[1375]: SSL_read() returned 0:5 Nov 16 18:48:50 lvr13 pop3s[1980]: SSL_read() returned 0:6 Nov 16 18:48:54 lvr13 pop3s[1376]: SSL_read() returned 0:5 Nov 16 18:49:03 lvr13 pop3s[1375]: SSL_read() returned 0:5 Nov 16 18:49:11 lvr13 pop3s[1375]: SSL_read() returned 0:5 Nov 16 18:49:38 lvr13 pop3s[1375]: SSL_read() returned 0:5 Nov 16 18:49:54 lvr13 pop3s[1404]: SSL_read() returned 0:5 I'm guessing that's still not enough: #define SSL_ERROR_SYSCALL 5 /* look at error stack/return value/errno */ #define SSL_ERROR_ZERO_RETURN 6 SSL_ERROR_SYSCALL Some I/O error occurred. The OpenSSL error queue may contain more information on the error. If the error queue is empty (i.e. ERR_get_error() returns 0), ret can be used to find out more about the error: If ret == 0, an EOF was observed that violates the pro- tocol. If ret == -1, the underlying BIO reported an I/O error (for socket I/O on Unix systems, consult errno for details). So should I add a call to ERR_get_error()? -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgpFMfHMrSvNV.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
Hi Can I resume the problem in : The server is blocked in a read, waiting for the client next command. (this is normal, 99% of the process are in this state). But the autologout procedure is not working! Then this means the SIGALRM that should awake the process never come or is not handled properly! I simple call to sleep() or signal() could disturb this. If this append only when using SSL, maybe the problem is here and the ALRM should bne reloaded somewhere. This is useless now, but files in $cyrus_imap/proc/* contains the user and the selected mailbox of all these processes this could be useful to know if this what not always the same user at the origin of the problem, because he was using an old outlook or something. Regards On Nov 16, 2007 5:33 PM, Ken Murchison [EMAIL PROTECTED] wrote: Sebastian Hagedorn wrote: --On 16. November 2007 09:37:42 -0600 Gary Mills [EMAIL PROTECTED] wrote: Could you get a stack trace? If you have gdb you just call it with gdb -p 19175. Then you can do bt at the prompt. I forget how to do it with Sun's debugger. Easy: # pstack 19175 19175: pop3d -s fef9f810 read (0, 2316f0, 5) fee1d2d0 read (0, 2316f0, 5, 0, 0, 0) + 5c ff06bb38 sock_read (1f0860, 2316f0, 5, 5, 0, 0) + 24 ff068af0 BIO_read (1f0860, 2316f0, 5, fef98b84, 0, 0) + 110 ff278488 ssl3_read_n (212798, 5, 8805, 0, 0, 203958) + 174 ff2785fc ssl3_get_record (204ce0, 8000, 8400, 4400, f1, f0) + d0 ff279424 ssl3_read_bytes (212798, 1000, 2000, 4, 0, ffbfe731) + 228 ff27a99c ssl3_get_message (ff2a259c, 2070a0, 0, , 19000, ffbfe7a0) + d0 ff27042c ssl3_accept (2150, 2160, 2180, 21e0, 2110, 2122) + 904ff27bd2c ssl23_get_client_hello (2316fb, 6c, 6c, 4, fe79, 0) + 828ff27b4b4 ssl23_accept (4000, 2000, 0, 0, 0, 0) + 2a4 00032d00 tls_start_servertls (0, 1, ffbfee24, ffbfee20, 1849a8, ff00) + 1980002c504 cmd_starttls (1, 1fd8b8, 0, 0, 0, 0) + 184 0002a638 service_main (2, 192198, ffbffce0, 1aec4, 3508c, 1) + 488 00035250 main (2, ffbffcd4, ffbffce0, 17c400, 0, 0) + e18 00029298 _start (0, 0, 0, 0, 0, 0) + 108 Thanks, that looks like progress! That stack trace looks similar enough to the one I'm seeing that I could imagine that it is what I *should* be seeing if the stack weren't garbled. Of course that's only speculation. Ken, is it possible that the call to SSL_accept() in tls_start_servertls() blocks when the client goes away? That could explain everything Yes. Gary's problem might be very similar to yours, depending on what I see from the patch that I just sent you. -- Kenneth Murchison Systems Programmer Project Cyrus Developer/Maintainer Carnegie Mellon University Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html -- Alain Spineux aspineux gmail com May the sources be with you Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On 16. November 2007 18:21:21 +0100 Gabor Gombas [EMAIL PROTECTED] wrote: On Fri, Nov 16, 2007 at 06:11:01PM +0100, Sebastian Hagedorn wrote: Well, that just sounds like you're running out of entropy. That's a different issue. Recompile your cyrus-sasl to use /dev/urandom instead of /dev/random or disable apop in /etc/imapd.conf: Debian uses /dev/urandom for a long time: # strings /usr/lib/libsasl2.so.2 | grep random /dev/urandom And according to the logs I have, after a pop3 process got stuck other IMAP users can still log in using TLS+PLAIN, so entropy can be ruled out. OK. Still the symptom seems to be different from what I'm seeing. Could it be that you have a process limit in /etc/cyrus.conf? -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgpvuf1O5cVWm.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
-- Ken Murchison [EMAIL PROTECTED] is rumored to have mumbled on 16. November 2007 12:58:49 -0500 regarding Re: One more attempt: stuck processes: So should I add a call to ERR_get_error()? Not yet. I'm assuming that none of these processes has hung. We're getting an I/O error most likely because the client has closed the connection immediately after sending QUIT. This is harmless. What I really want to see is if we get a SSL_ERROR_WANT_xxx return code when we're hung. I have both good and bad news. Bad news first: there is a stuck process that did *not* log that SSL_read line. Good news: the binary I'm running now isn't stripped and has much more detail in its stack trace: (gdb) bt #0 0x003d341e in __read_nocancel () from /lib/tls/libc.so.6 #1 0x0017f2f7 in BIO_new_socket () from /lib/libcrypto.so.4 #2 0x0017d2b2 in BIO_read () from /lib/libcrypto.so.4 #3 0x0089ec30 in ssl3_alert_code () from /lib/libssl.so.4 #4 0x0089edcc in ssl3_alert_code () from /lib/libssl.so.4 #5 0x008a00cf in ssl3_read_bytes () from /lib/libssl.so.4 #6 0x008a0ffc in ssl3_get_message () from /lib/libssl.so.4 #7 0x00896cab in ssl3_accept () from /lib/libssl.so.4 #8 0x00896944 in ssl3_accept () from /lib/libssl.so.4 #9 0x008a5c9a in SSL_accept () from /lib/libssl.so.4 #10 0x008a180d in ssl23_get_client_hello () from /lib/libssl.so.4 #11 0x008a1712 in ssl23_accept () from /lib/libssl.so.4 #12 0x008a5c9a in SSL_accept () from /lib/libssl.so.4 #13 0x08052cf3 in tls_start_servertls (readfd=-512, writefd=-512, layerbits=0xbfff7a78, authid=0xbfff7a74, ret=0x810bca0) at tls.c:803 #14 0x0804e553 in cmd_starttls (pop3s=1) at pop3d.c:1076 #15 0x0804d5cc in service_main (argc=2, argv=0x9e84008, envp=0xbfff9850) at pop3d.c:537 #16 0x08054550 in main (argc=2, argv=0x9, envp=0xbfff9850) at service.c:539 There's much less POP activity now, so I may have to wait until Monday for more results. -- Sebastian Hagedorn - Postmaster - RZKR-R1 (Flachbau), Zimmer 18 Zentrum für angewandte Informatik - Universitätsweiter Service RRZK Universität zu Köln / Cologne University - Tel. +49-221-478-5587 pgpvINiK8adT6.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
I know it has been asked before and may be redundant, but... You answered that cyrus-sasl is using /dev/urandom and should not run out of entropy. However, what about openssl itself? It also uses random numbers. Perhaps, as a test renaming /dev/random and ln -s /dev/urandom /dev/random. Gary Mills wrote: On Fri, Nov 16, 2007 at 05:13:13PM +0100, Sebastian Hagedorn wrote: --On 16. November 2007 14:23:17 +0100 Simon Matter [EMAIL PROTECTED] wrote: Did you ever see non SSL connections get stuck? No. Most of mine are `pop3d -s', but I have seen a few without the `-s'. When I did a stack trace on one, it also turned out to be for an SSL session. So, I have to agree. Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On Friday, November 16, 2007 3:54 PM -0500 Ken Murchison [EMAIL PROTECTED] wrote: I've reproduced the former by telneting to port 995 and doing nothing. I have been unable to reproduce the latter because as soon as I QUIT the telnet session or kill() the telnet process, pop3d exits gracefully. I agree with others that you want to do something other than kill the telnet session, like unplugging the cable. I've seen similar behavior out of cyrus that I thought could easily be people on laptops shutting their computers down hard, in some way that the TCP/IP stack never got a chance to clean up the connection properly. With a QUIT or a kill, you're giving the OS a chance to do the right thing. -Michael Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
On Fri, Nov 16, 2007 at 03:54:50PM -0500, Ken Murchison wrote: That's exactly what Gary is seeing. Its blocking in SSL_accept(). Apparently the client connects to port 995, and then either sends nothing, or goes away and leaves the socket open. I've reproduced the former by telneting to port 995 and doing nothing. I have been unable to reproduce the latter because as soon as I QUIT the telnet session or kill() the telnet process, pop3d exits gracefully. You probably have to reboot the client at that point, or just disconnect the cable and take it home. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
On Nov 16, 2007 6:24 PM, Alain Spineux [EMAIL PROTECTED] wrote: Hi Can I resume the problem in : I'm wrong The server is blocked in a read, waiting for the client next command. (this is normal, 99% of the process are in this state). No it is waiting in select, and the select has a timeout ! But the autologout procedure is not working! Then this means the SIGALRM that should awake the process never come or is not handled properly! I simple call to sleep() or signal() could disturb this. If this append only when using SSL, maybe the problem is here and the ALRM should bne reloaded somewhere. Wrong wrong ! It could be right, but here the time out looks to be done using the select ! This is useless now, but files in $cyrus_imap/proc/* contains the user and the selected mailbox of all these processes this could be useful to know if this what not always the same user at the origin of the problem, because he was using an old outlook or something. Regards On Nov 16, 2007 5:33 PM, Ken Murchison [EMAIL PROTECTED] wrote: Sebastian Hagedorn wrote: --On 16. November 2007 09:37:42 -0600 Gary Mills [EMAIL PROTECTED] wrote: Could you get a stack trace? If you have gdb you just call it with gdb -p 19175. Then you can do bt at the prompt. I forget how to do it with Sun's debugger. Easy: # pstack 19175 19175: pop3d -s fef9f810 read (0, 2316f0, 5) fee1d2d0 read (0, 2316f0, 5, 0, 0, 0) + 5c ff06bb38 sock_read (1f0860, 2316f0, 5, 5, 0, 0) + 24 ff068af0 BIO_read (1f0860, 2316f0, 5, fef98b84, 0, 0) + 110 ff278488 ssl3_read_n (212798, 5, 8805, 0, 0, 203958) + 174 ff2785fc ssl3_get_record (204ce0, 8000, 8400, 4400, f1, f0) + d0 ff279424 ssl3_read_bytes (212798, 1000, 2000, 4, 0, ffbfe731) + 228 ff27a99c ssl3_get_message (ff2a259c, 2070a0, 0, , 19000, ffbfe7a0) + d0 ff27042c ssl3_accept (2150, 2160, 2180, 21e0, 2110, 2122) + 904ff27bd2c ssl23_get_client_hello (2316fb, 6c, 6c, 4, fe79, 0) + 828ff27b4b4 ssl23_accept (4000, 2000, 0, 0, 0, 0) + 2a4 00032d00 tls_start_servertls (0, 1, ffbfee24, ffbfee20, 1849a8, ff00) + 1980002c504 cmd_starttls (1, 1fd8b8, 0, 0, 0, 0) + 184 0002a638 service_main (2, 192198, ffbffce0, 1aec4, 3508c, 1) + 488 00035250 main (2, ffbffcd4, ffbffce0, 17c400, 0, 0) + e18 00029298 _start (0, 0, 0, 0, 0, 0) + 108 Thanks, that looks like progress! That stack trace looks similar enough to the one I'm seeing that I could imagine that it is what I *should* be seeing if the stack weren't garbled. Of course that's only speculation. Ken, is it possible that the call to SSL_accept() in tls_start_servertls() blocks when the client goes away? That could explain everything Yes. Gary's problem might be very similar to yours, depending on what I see from the patch that I just sent you. -- Kenneth Murchison Systems Programmer Project Cyrus Developer/Maintainer Carnegie Mellon University Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html -- Alain Spineux aspineux gmail com May the sources be with you -- Alain Spineux aspineux gmail com May the sources be with you Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
Sebastian Hagedorn wrote: Thanks. I will try this patch as soon as I can, but it's clearly not the only issue, because the same thing happens with POP processes. Here's an example for one: (gdb) bt #0 0x0096441e in __read_nocancel () from /lib/tls/libc.so.6 #1 0x00ac02f7 in BIO_new_socket () from /lib/libcrypto.so.4 #2 0x00abe2b2 in BIO_read () from /lib/libcrypto.so.4 #3 0x00fcfe13 in ssl23_read_bytes () from /lib/libssl.so.4 #4 0x00fcec51 in ssl23_get_client_hello () from /lib/libssl.so.4 #5 0x00fce712 in ssl23_accept () from /lib/libssl.so.4 #6 0x00fd2c9a in SSL_accept () from /lib/libssl.so.4 #7 0x08052cb3 in shut_down () #8 0x0804e513 in shut_down () #9 0x0804d58c in ?? () #10 0x0001 in ?? () #11 0x08dd6848 in ?? () #12 0x in ?? () OK. What version of OpenSSL? Are they imaps/pop3s processes, or are your clients using STARTTLS? Are you sure that you're not running out of entropy? Have you tried adding the '-U 1' option to the offending services in cyrus.conf? -- Kenneth Murchison Systems Programmer Project Cyrus Developer/Maintainer Carnegie Mellon University Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On 15. November 2007 06:55:44 -0500 Ken Murchison [EMAIL PROTECTED] wrote: OK. What version of OpenSSL? cyradm says: Built w/OpenSSL 0.9.7a Feb 19 2003 Running w/OpenSSL 0.9.7a Feb 19 2003 rpm says: openssl-0.9.7a-33.23 This is RHEL 3. Are they imaps/pop3s processes, or are your clients using STARTTLS? Both. It doesn't seem to matter which. Are you sure that you're not running out of entropy? Yes, because cyrus-sasl uses /dev/urandom. Have you tried adding the '-U 1' option to the offending services in cyrus.conf? No. Since this potentially affects all IMAP and POP processes I would have to do it for all entries. Do you recommend that I try that? -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgpUWKZDaLfag.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On 14. November 2007 16:39:44 -0500 Ken Murchison [EMAIL PROTECTED] wrote: It looks to me like we are timing out the client while the client is IDLEing, but we get a signal from idled in the middle of shutdown(). Try this patch. --- imapd.c.~1.535.~2007-11-14 16:16:21.0 -0500 +++ imapd.c 2007-11-14 16:22:59.0 -0500 @@ -836,6 +836,8 @@ { int i; +idle_done(imapd_mailbox); + proc_cleanup(); i = 0; Thanks. I will try this patch as soon as I can, but it's clearly not the only issue, because the same thing happens with POP processes. Here's an example for one: (gdb) bt #0 0x0096441e in __read_nocancel () from /lib/tls/libc.so.6 #1 0x00ac02f7 in BIO_new_socket () from /lib/libcrypto.so.4 #2 0x00abe2b2 in BIO_read () from /lib/libcrypto.so.4 #3 0x00fcfe13 in ssl23_read_bytes () from /lib/libssl.so.4 #4 0x00fcec51 in ssl23_get_client_hello () from /lib/libssl.so.4 #5 0x00fce712 in ssl23_accept () from /lib/libssl.so.4 #6 0x00fd2c9a in SSL_accept () from /lib/libssl.so.4 #7 0x08052cb3 in shut_down () #8 0x0804e513 in shut_down () #9 0x0804d58c in ?? () #10 0x0001 in ?? () #11 0x08dd6848 in ?? () #12 0x in ?? () As you see it looks very much like the IMAP stack trace, but obviously the IDLE routine is not a factor. OTOH, all the stuck IMAP processes I've looked at so far *did* have those idle_xxx lines in their traces. I am correct in assuming that that would only occur for clients that actually use IDLE, right? -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgp6gqxRgy74J.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
Sebastian Hagedorn wrote: No. Since this potentially affects all IMAP and POP processes I would have to do it for all entries. Do you recommend that I try that? Since it looks like things are hanging when a process is being used, I'd like to see if the problem goes away if we don't reuse the processes. I'm just trying to do a bsearch() on the problem. -- Kenneth Murchison Systems Programmer Project Cyrus Developer/Maintainer Carnegie Mellon University Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On 15. November 2007 08:21:48 -0500 Ken Murchison [EMAIL PROTECTED] wrote: No. Since this potentially affects all IMAP and POP processes I would have to do it for all entries. Do you recommend that I try that? Since it looks like things are hanging when a process is being used, I'd like to see if the problem goes away if we don't reuse the processes. I'm just trying to do a bsearch() on the problem. OK. I've made the change and HUP'ed master. I can see the first processes with -U 1. So I guess I'll have to wait a while to see if any of those get stuck as well. -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgpgSvEePTrl0.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On 15. November 2007 08:32:18 -0500 Ken Murchison [EMAIL PROTECTED] wrote: Since it looks like things are hanging when a process is being used, I'd like to see if the problem goes away if we don't reuse the processes. I'm just trying to do a bsearch() on the problem. OK. I've made the change and HUP'ed master. I can see the first processes with -U 1. So I guess I'll have to wait a while to see if any of those get stuck as well. OK, let me know. Didn't work :-( # ps -aef| grep U UIDPID PPID C STIME TTY TIME CMD ... cyrus25038 32385 0 15:44 ?00:00:00 pop3d -s -U 1 ... That's more than an hour ago. So: # strace -p 25038 Process 25038 attached - interrupt to quit read(0, unfinished ... # gdb -p 25038 GNU gdb Red Hat Linux (6.3.0.0-1.138.el3rh) ... (gdb) bt #0 0x0079f41e in __read_nocancel () from /lib/tls/libc.so.6 #1 0x00d0b2f7 in BIO_new_socket () from /lib/libcrypto.so.4 #2 0x00d092b2 in BIO_read () from /lib/libcrypto.so.4 #3 0x005dae13 in ssl23_read_bytes () from /lib/libssl.so.4 #4 0x005d9c51 in ssl23_get_client_hello () from /lib/libssl.so.4 #5 0x005d9712 in ssl23_accept () from /lib/libssl.so.4 #6 0x005ddc9a in SSL_accept () from /lib/libssl.so.4 #7 0x08052cb3 in shut_down () #8 0x0804e513 in shut_down () #9 0x0804d58c in ?? () #10 0x0001 in ?? () #11 0x082ee848 in ?? () #12 0x in ?? () Any other ideas? -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgpoJF9EbdufP.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
Sebastian Hagedorn wrote: --On 15. November 2007 08:21:48 -0500 Ken Murchison [EMAIL PROTECTED] wrote: No. Since this potentially affects all IMAP and POP processes I would have to do it for all entries. Do you recommend that I try that? Since it looks like things are hanging when a process is being used, I'd like to see if the problem goes away if we don't reuse the processes. I'm just trying to do a bsearch() on the problem. OK. I've made the change and HUP'ed master. I can see the first processes with -U 1. So I guess I'll have to wait a while to see if any of those get stuck as well. OK, let me know. -- Kenneth Murchison Systems Programmer Project Cyrus Developer/Maintainer Carnegie Mellon University Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
Sebastian Hagedorn wrote: --On 15. November 2007 08:32:18 -0500 Ken Murchison [EMAIL PROTECTED] wrote: Since it looks like things are hanging when a process is being used, I'd like to see if the problem goes away if we don't reuse the processes. I'm just trying to do a bsearch() on the problem. OK. I've made the change and HUP'ed master. I can see the first processes with -U 1. So I guess I'll have to wait a while to see if any of those get stuck as well. OK, let me know. Didn't work :-( # ps -aef| grep U UIDPID PPID C STIME TTY TIME CMD ... cyrus25038 32385 0 15:44 ?00:00:00 pop3d -s -U 1 ... That's more than an hour ago. So: # strace -p 25038 Process 25038 attached - interrupt to quit read(0, unfinished ... # gdb -p 25038 GNU gdb Red Hat Linux (6.3.0.0-1.138.el3rh) ... (gdb) bt #0 0x0079f41e in __read_nocancel () from /lib/tls/libc.so.6 #1 0x00d0b2f7 in BIO_new_socket () from /lib/libcrypto.so.4 #2 0x00d092b2 in BIO_read () from /lib/libcrypto.so.4 #3 0x005dae13 in ssl23_read_bytes () from /lib/libssl.so.4 #4 0x005d9c51 in ssl23_get_client_hello () from /lib/libssl.so.4 #5 0x005d9712 in ssl23_accept () from /lib/libssl.so.4 #6 0x005ddc9a in SSL_accept () from /lib/libssl.so.4 #7 0x08052cb3 in shut_down () #8 0x0804e513 in shut_down () #9 0x0804d58c in ?? () #10 0x0001 in ?? () #11 0x082ee848 in ?? () #12 0x in ?? () Any other ideas? Not at the moment. I don't understand how SSL_accept() gets called from shut_down(). Are you running a Murder? -- Kenneth Murchison Systems Programmer Project Cyrus Developer/Maintainer Carnegie Mellon University Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On 15. November 2007 11:00:39 -0500 Ken Murchison [EMAIL PROTECTED] wrote: (gdb) bt # 0 0x0079f41e in __read_nocancel () from /lib/tls/libc.so.6 # 1 0x00d0b2f7 in BIO_new_socket () from /lib/libcrypto.so.4 # 2 0x00d092b2 in BIO_read () from /lib/libcrypto.so.4 # 3 0x005dae13 in ssl23_read_bytes () from /lib/libssl.so.4 # 4 0x005d9c51 in ssl23_get_client_hello () from /lib/libssl.so.4 # 5 0x005d9712 in ssl23_accept () from /lib/libssl.so.4 # 6 0x005ddc9a in SSL_accept () from /lib/libssl.so.4 # 7 0x08052cb3 in shut_down () # 8 0x0804e513 in shut_down () # 9 0x0804d58c in ?? () # 10 0x0001 in ?? () # 11 0x082ee848 in ?? () # 12 0x in ?? () Any other ideas? Not at the moment. I don't understand how SSL_accept() gets called from shut_down(). That's what I'd been wondering about myself. Are you running a Murder? No, it's a pretty plain setup. The only reason I could imagine for the sequence of calls was signal handling. But let's be methodical. There's only one spot where SSL_accept() is called: in tls_start_servertls(). In pop3d.c that's only called in cmd_starttls(). That in turn is called either in cmdloop (for handling of STLS) or in service_main() for connections to port 995. shut_down() could conceivably be called by signals_poll(). The POP process above has the following lines in syslog at debug level: Nov 15 15:44:18 lvr13 master[25038]: about to exec /usr/lib/cyrus-imapd/pop3d Nov 15 15:44:18 lvr13 pop3s[25038]: executed Nov 15 15:44:20 lvr13 pop3s[25038]: accepted connection That's it. So I'm guessing that the connection (which was to some dial-up IP address) got dropped while the process was in service_main(). I don't understand why that doesn't show up in the trace, and why last/first four entries read '??'. But even that doesn't really make sense, because shut_down() isn't actually a signal handler, it's only called at certain points by signal_polls(). -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgpjsGUbLBc4U.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
On Nov 15, 2007 4:54 PM, Sebastian Hagedorn [EMAIL PROTECTED] wrote: --On 15. November 2007 08:32:18 -0500 Ken Murchison [EMAIL PROTECTED] wrote: Since it looks like things are hanging when a process is being used, I'd like to see if the problem goes away if we don't reuse the processes. I'm just trying to do a bsearch() on the problem. OK. I've made the change and HUP'ed master. I can see the first processes with -U 1. So I guess I'll have to wait a while to see if any of those get stuck as well. OK, let me know. Didn't work :-( # ps -aef| grep U UIDPID PPID C STIME TTY TIME CMD ... cyrus25038 32385 0 15:44 ?00:00:00 pop3d -s -U 1 ... That's more than an hour ago. So: # strace -p 25038 Process 25038 attached - interrupt to quit read(0, unfinished ... Do you know what is 0, if it was a socket it should timeout, isn't it ? # ls -l /proc/25038/fd should answer the question # gdb -p 25038 GNU gdb Red Hat Linux (6.3.0.0-1.138.el3rh) ... (gdb) bt #0 0x0079f41e in __read_nocancel () from /lib/tls/libc.so.6 #1 0x00d0b2f7 in BIO_new_socket () from /lib/libcrypto.so.4 #2 0x00d092b2 in BIO_read () from /lib/libcrypto.so.4 #3 0x005dae13 in ssl23_read_bytes () from /lib/libssl.so.4 #4 0x005d9c51 in ssl23_get_client_hello () from /lib/libssl.so.4 #5 0x005d9712 in ssl23_accept () from /lib/libssl.so.4 #6 0x005ddc9a in SSL_accept () from /lib/libssl.so.4 #7 0x08052cb3 in shut_down () #8 0x0804e513 in shut_down () #9 0x0804d58c in ?? () #10 0x0001 in ?? () #11 0x082ee848 in ?? () #12 0x in ?? () Any other ideas? -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html -- Alain Spineux aspineux gmail com May the sources be with you Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On 15. November 2007 18:14:05 +0100 Alain Spineux [EMAIL PROTECTED] wrote: # strace -p 25038 Process 25038 attached - interrupt to quit read(0, unfinished ... Do you know what is 0, if it was a socket it should timeout, isn't it ? It should, I guess, but it doesn't. # ls -l /proc/25038/fd should answer the question I did not know that. Usually I just use lsof. Interestingly, there's this: # ls -l /proc/25038/fd insgesamt 0 lrwx--1 cyrusmail 64 15. Nov 18:15 0 - socket:[4230164633] ... It's blinking red, which normally means a broken link. I'm not sure how reliable that is in this case. Anyway, lsof reports: pop3d 25038 cyrus0u IPv4 -64802663 TCP cyrus.rrz.uni-koeln.de:pop3s-p50865F5D.dip.t-dialin.net:1064 (ESTABLISHED) It *thinks* the connections is still open. So does netstat: # LANG=C netstat -a|grep p50865F5D tcp0 0 cyrus.rrz.uni-koeln.d:pop3s p50865F5D.dip.t-dialin:1064 ESTABLISHED But obviously that connection is dead. I don't know what conclusions to draw from that ... -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgpyEp2xvUdqN.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On 15. November 2007 18:14:05 +0100 Alain Spineux [EMAIL PROTECTED] wrote: # strace -p 25038 Process 25038 attached - interrupt to quit read(0, unfinished ... Do you know what is 0, if it was a socket it should timeout, isn't it ? It should, I guess, but it doesn't. # ls -l /proc/25038/fd should answer the question I did not know that. Usually I just use lsof. Interestingly, there's this: # ls -l /proc/25038/fd insgesamt 0 lrwx--1 cyrusmail 64 15. Nov 18:15 0 - socket:[4230164633] ... It's blinking red, which normally means a broken link. I'm not sure how The file 0 is a symbolic symlink which doesn't really point to a file, that's why the shell shows it blinking. Everything okay here. reliable that is in this case. Anyway, lsof reports: pop3d 25038 cyrus0u IPv4 -64802663 TCP cyrus.rrz.uni-koeln.de:pop3s-p50865F5D.dip.t-dialin.net:1064 (ESTABLISHED) It *thinks* the connections is still open. So does netstat: # LANG=C netstat -a|grep p50865F5D tcp0 0 cyrus.rrz.uni-koeln.d:pop3s p50865F5D.dip.t-dialin:1064 ESTABLISHED But obviously that connection is dead. I don't know what conclusions to draw from that ... Just two ideas come to mind: 1) Since it only happens on dialup connections, could it be that the dialin router at the providers end sends TCP/RST when a client hangs up and those packets are filtered somewhere, maybe on your firewall? 2) Could it be that SO_LINGER should be used as socket option in service_create() in master/master.c. If it's complete nonsense, ignore it. Simon Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
One more attempt: stuck processes
Hi, I've brought up this topic before. We've been running cyrus-imapd very happily for several years. Yet there's one issue that none of the updates have resolved. The last time I reported it we were running 2.2.12. Now we're running 2.3.8, but the issues is the same: POP and IMAP processes that use TLS/SSL get stuck. My observations point towards dropped dial-up connections as the cause. Here's an example: [EMAIL PROTECTED] root]# strace -p 4998 Process 4998 attached - interrupt to quit read(0, unfinished ... Process 4998 detached In gdb: 0x003ed41e in __read_nocancel () from /lib/tls/libc.so.6 (gdb) bt #0 0x003ed41e in __read_nocancel () from /lib/tls/libc.so.6 #1 0x00c422f7 in BIO_new_socket () from /lib/libcrypto.so.4 #2 0x00c402b2 in BIO_read () from /lib/libcrypto.so.4 #3 0x00d26c30 in ssl3_alert_code () from /lib/libssl.so.4 #4 0x00d26dcc in ssl3_alert_code () from /lib/libssl.so.4 #5 0x00d280cf in ssl3_read_bytes () from /lib/libssl.so.4 #6 0x00d28ffc in ssl3_get_message () from /lib/libssl.so.4 #7 0x00d1ecab in ssl3_accept () from /lib/libssl.so.4 #8 0x00d1e944 in ssl3_accept () from /lib/libssl.so.4 #9 0x00d2dc9a in SSL_accept () from /lib/libssl.so.4 #10 0x00d2980d in ssl23_get_client_hello () from /lib/libssl.so.4 #11 0x00d29712 in ssl23_accept () from /lib/libssl.so.4 #12 0x00d2dc9a in SSL_accept () from /lib/libssl.so.4 #13 0x080a0a03 in idle_notify () #14 0x0805ef21 in idle_update () #15 0x08051fc6 in shut_down () #16 0x0804ebe4 in ?? () #17 0x08d53458 in ?? () #18 0x08d61a98 in ?? () #19 0x in ?? () (gdb) Ultimately all processes get stuck in libc, so this might be a library issue. My problem is that I have no way to confirm that. I've compared that stack trace to the source and I'm confused. idle_notify() is in idle.c, and it's very simple: /* * Notify idled of a mailbox change */ void idle_notify(struct mailbox *mailbox) { /* We should try to determine if we need to send this * (ie, is an imapd is IDLE on 'mailbox'?). */ idle_send_msg(IDLE_NOTIFY, mailbox); } So why doesn't the stack trace show a call to idle_send_msg()? Shouldn't the SSL routines be called from a procedure that actually does I/O? Anyway, if there is anybody who has seen something like this I would like to know about it. Also if anyone has a clue how to further debug this, please let me know! Thanks, Sebastian Hagedorn -- Sebastian Hagedorn - Postmaster - RZKR-R1 (Gebäude 52), Zimmer 18 Zentrum für angewandte Informatik - Universitätsweiter Service RRZK Universität zu Köln / Cologne University - Tel. +49-221-478-5587 Skype: shagedorn pgpzRU0VT9Nay.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
On Wed, Nov 14, 2007 at 04:15:13PM +0100, Sebastian Hagedorn wrote: I've brought up this topic before. We've been running cyrus-imapd very happily for several years. Yet there's one issue that none of the updates have resolved. The last time I reported it we were running 2.2.12. Now we're running 2.3.8, but the issues is the same: POP and IMAP processes that use TLS/SSL get stuck. My observations point towards dropped dial-up connections as the cause. Here's an example: Have you tried setting `tls_session_timeout' to zero in imapd.conf to disable caching of TLS sessions? I believe that that solved the problem for me, although it happened rarely. I only use `berkeley-nosync' for `duplicate_db' and `tlscache_db', but I suspected that the problem was the result of a database deadlock. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On 14. November 2007 09:30:45 -0600 Gary Mills [EMAIL PROTECTED] wrote: On Wed, Nov 14, 2007 at 04:15:13PM +0100, Sebastian Hagedorn wrote: I've brought up this topic before. We've been running cyrus-imapd very happily for several years. Yet there's one issue that none of the updates have resolved. The last time I reported it we were running 2.2.12. Now we're running 2.3.8, but the issues is the same: POP and IMAP processes that use TLS/SSL get stuck. My observations point towards dropped dial-up connections as the cause. Here's an example: Have you tried setting `tls_session_timeout' to zero in imapd.conf to disable caching of TLS sessions? No. While it's possible that there's some connection, I don't see it from the stack trace. I think I'd like some indication from one of the developers that caching *could* be the cause before I try that. I believe that that solved the problem for me, although it happened rarely. I only use `berkeley-nosync' for `duplicate_db' and `tlscache_db', but I suspected that the problem was the result of a database deadlock. We use skiplist, which to my mind makes it even less likely that that's the cause ... thanks for the suggestion anyway. -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgpbKouWlbjeE.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
Sebastian Hagedorn wrote: Hi, I've brought up this topic before. We've been running cyrus-imapd very happily for several years. Yet there's one issue that none of the updates have resolved. The last time I reported it we were running 2.2.12. Now we're running 2.3.8, but the issues is the same: POP and IMAP processes that use TLS/SSL get stuck. My observations point towards dropped dial-up connections as the cause. Here's an example: [EMAIL PROTECTED] root]# strace -p 4998 Process 4998 attached - interrupt to quit read(0, unfinished ... Process 4998 detached In gdb: 0x003ed41e in __read_nocancel () from /lib/tls/libc.so.6 (gdb) bt #0 0x003ed41e in __read_nocancel () from /lib/tls/libc.so.6 #1 0x00c422f7 in BIO_new_socket () from /lib/libcrypto.so.4 #2 0x00c402b2 in BIO_read () from /lib/libcrypto.so.4 #3 0x00d26c30 in ssl3_alert_code () from /lib/libssl.so.4 #4 0x00d26dcc in ssl3_alert_code () from /lib/libssl.so.4 #5 0x00d280cf in ssl3_read_bytes () from /lib/libssl.so.4 #6 0x00d28ffc in ssl3_get_message () from /lib/libssl.so.4 #7 0x00d1ecab in ssl3_accept () from /lib/libssl.so.4 #8 0x00d1e944 in ssl3_accept () from /lib/libssl.so.4 #9 0x00d2dc9a in SSL_accept () from /lib/libssl.so.4 #10 0x00d2980d in ssl23_get_client_hello () from /lib/libssl.so.4 #11 0x00d29712 in ssl23_accept () from /lib/libssl.so.4 #12 0x00d2dc9a in SSL_accept () from /lib/libssl.so.4 #13 0x080a0a03 in idle_notify () #14 0x0805ef21 in idle_update () #15 0x08051fc6 in shut_down () #16 0x0804ebe4 in ?? () #17 0x08d53458 in ?? () #18 0x08d61a98 in ?? () #19 0x in ?? () (gdb) It looks to me like we are timing out the client while the client is IDLEing, but we get a signal from idled in the middle of shutdown(). Try this patch. --- imapd.c.~1.535.~2007-11-14 16:16:21.0 -0500 +++ imapd.c 2007-11-14 16:22:59.0 -0500 @@ -836,6 +836,8 @@ { int i; +idle_done(imapd_mailbox); + proc_cleanup(); i = 0; -- Kenneth Murchison Systems Programmer Project Cyrus Developer/Maintainer Carnegie Mellon University Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html