RE: Collaboration replacement via Toltec/Bynari (was How many people to admin a Cyrus system?)
On Thu, 2007-11-15 at 22:10 +0200, Joon Radley wrote: Hi Olaf, Thats an interesting information. I have always thought that in Exchange-Outlook world the processing was on the server side and the messages were sitting on the server. Or the client side processing is limited to Toltec/Bynari solution? With Exchange-Outlook the Outlook message store receives on the server. When new mail is delivered to the exchange server it does the special processing before injecting it into the mail store. With the IMAP4 server and SMTP of your choice, the mail must be downloaded to be processes before it can be stored in the Outlook message store. OK. Now everything is clear. Didn't have the Bynari some server-side solution in the past? I remember (though not have used) a product called Bynari Server or sth. Regards, Olaf -- Olaf Frączyk [EMAIL PROTECTED] NAVI http://www.navi.pl http://www.ntp.navi.pl Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: LARGE single-system Cyrus installs?
Rob Mueller [EMAIL PROTECTED] wrote: About 30% of all I/O is to mailboxes.db, most of which is read. I haven't personally deployed a split-meta configuration, but I understand the meta files are similarly heavy I/O concentrators. That sounds odd. Given the size and hotness of mailboxes.db, and in most cases the size of mailboxes.db compared to the memory your machine has, basically the OS should end up caching the entire thing in memory. Solaris 10 does this in my case. Via dtrace you'll see that open() on the mailboxes.db and read-calls do not exceed microsecond ranges. mailboxes.db is not the problem here. It is entirely cached and rarely written (creating, deleting and moving a mailbox). Pascal Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On 15. November 2007 19:25:19 +0100 Simon Matter [EMAIL PROTECTED] wrote: It's blinking red, which normally means a broken link. I'm not sure how The file 0 is a symbolic symlink which doesn't really point to a file, that's why the shell shows it blinking. Everything okay here. Thanks. That's what I thought, but I wasn't sure. reliable that is in this case. Anyway, lsof reports: pop3d 25038 cyrus0u IPv4 -64802663 TCP cyrus.rrz.uni-koeln.de:pop3s-p50865F5D.dip.t-dialin.net:1064 (ESTABLISHED) It *thinks* the connections is still open. So does netstat: # LANG=C netstat -a|grep p50865F5D tcp0 0 cyrus.rrz.uni-koeln.d:pop3s p50865F5D.dip.t-dialin:1064 ESTABLISHED But obviously that connection is dead. I don't know what conclusions to draw from that ... Just two ideas come to mind: 1) Since it only happens on dialup connections, could it be that the dialin router at the providers end sends TCP/RST when a client hangs up and those packets are filtered somewhere, maybe on your firewall? OK, let's run with that one. a) We don't really have a firewall, we only use ACLs on the Cisco routers. You can't even filter TCP/RST there. b) Even *if* a TCP/RST had been dropped, lost or whatever, the server *still* should timeout eventually! 2) Could it be that SO_LINGER should be used as socket option in service_create() in master/master.c. I didn't remember that option, so I just read up on it. It seems as though SO_LINGER is very dependent on implementation. If I get your intention correctly SO_LINGER would have to be set with l_onoff set to non-zero and l_linger to zero, right? So close() would return immediately? That might make sense if the stack trace showed a call to close(). But if I understand the code correctly, close() isn't called at all. The socket is closed as a result of a call to exit(). And that defeats all use of SO_LINGER: When the socket is closed as part of exit(2), it always lingers in the background. If it's complete nonsense, ignore it. I wouldn't know :-) -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgpproVHc1y86.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On 16. November 2007 16:52:27 +0100 Gabor Gombas [EMAIL PROTECTED] wrote: On Fri, Nov 16, 2007 at 12:36:49PM +0100, Sebastian Hagedorn wrote: He suggested that the trace is unreliable. Perhaps a bug in RHEL 3's version of OpenSSL messes up the stack. That would also explain why nobody else seems to have this problem. FYI I also know a system that has problems with hung Cyrus processes. AFAIR they have problems with pop3s only, but that may be because there are more POP3 than IMAP users, I don't know. The system in question runs 2.3.8 on Debian Etch currently. That's a 2.6 kernel, right? I intend to help diagnose that system but had no time so far; they're now running a script that does a POP3 connection every couple of minutes and if that takes too long it restarts Cyrus. Hm, we don't suffer any actual slowdown, it's just that the number of processes increases over time. There is nothing interesting in the logs: Oct 15 02:39:31 host cyrus/master[6102]: about to exec /usr/local/cyrus/sbin/pop3d Oct 15 02:39:31 host cyrus/pop3s[6102]: executed Oct 15 02:39:31 host cyrus/pop3s[6102]: accepted connection That's what I'm seeing. Could you get a stack trace? OTOH there are a lot of messages like the following: Oct 16 14:13:10 host cyrus/master[26136]: about to exec /usr/local/cyrus/sbin/pop3d Oct 16 14:13:10 host cyrus/pop3s[26136]: executed Oct 16 14:13:10 host cyrus/pop3s[26136]: accepted connection Oct 16 14:13:10 host cyrus/pop3s[26136]: pop3s failed: [XX.XXX.XX.XXX] Oct 16 14:13:10 host cyrus/pop3s[26136]: Fatal error: tls_start_servertls() failed Oct 16 14:13:10 host cyrus/master[15923]: process 26136 exited, status 75 Oct 16 14:13:10 host cyrus/master[15923]: service pop3s pid 26136 in BUSY state: terminated abnormally Any idea what's causing that? I have many of those as well. I suppose that could be any number of things. Faulty protocol or dropped connections. -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgp3H24eUgNSV.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: Timed Actions in Sieve
On Tue, Nov 13, 2007 at 11:24:48AM +, Ian G Batten wrote: We've been having a chat about how useful it would be to have timed actions in sieve: so that a vacation message could be set up for a duration which would automatically revert, so that a forwarding could be set up for the duration of a short-term project, etc, etc. The naive way is to add support to the sieve interface of choice (the squirrelmail plugin in our case) to handle deferred actions, but I can think of all sorts of security problems with that. Another would be a means to auto-generate regexps to match on Date: headers, but that's really tacky. The full solution would be to have the current time available in sieve scripts, to then match on. Has anyone else thought about this area? We've had occasional complaints from people who set up a vacation message and then forgot to remove it later. They would like to be able to put a time limit on such things, so that they would stop working when that limit expires. More generally, I suppose they could specify start and stop times, so that they could set up the sieve script in advance of their vacation. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
On Fri, Nov 16, 2007 at 03:20:57PM +0100, Sebastian Hagedorn wrote: --On 16. November 2007 08:00:07 -0600 Gary Mills [EMAIL PROTECTED] wrote: This timeout doesn't work in some cases. We have lots of POP sessions that never terminate. That's interesting to hear! Especially since you are using Solaris. About 30 out of 40 are in that state now. Here's an example: cyrus 13075 708 0 Oct 14 ?0:05 pop3d -s cyrus 20023 708 0 Oct 29 ?0:00 pop3d cyrus 24560 708 1 07:38:03 ?0:03 pop3d cyrus 631 708 0 Oct 03 ?0:10 pop3d -s cyrus 6786 708 0 Oct 20 ?0:00 pop3d -s cyrus 29777 708 0 07:45:03 ?0:00 pop3d cyrus 19175 708 0 Oct 04 ?0:04 pop3d -s One I just checked is stuck in a read(): # truss -p 19175 read(0, 0x002316F0, 5) (sleeping...) ^?# pfiles 19175 19175: pop3d -s Current rlimit: 256 file descriptors 0: S_IFSOCK mode:0666 dev:271,0 ino:25813 uid:0 gid:0 size:0 O_RDWR sockname: AF_INET 130.179.16.23 port: 995 peername: AF_INET 130.179.188.184 port: 51771 Could you get a stack trace? If you have gdb you just call it with gdb -p 19175. Then you can do bt at the prompt. I forget how to do it with Sun's debugger. Easy: # pstack 19175 19175: pop3d -s fef9f810 read (0, 2316f0, 5) fee1d2d0 read (0, 2316f0, 5, 0, 0, 0) + 5c ff06bb38 sock_read (1f0860, 2316f0, 5, 5, 0, 0) + 24 ff068af0 BIO_read (1f0860, 2316f0, 5, fef98b84, 0, 0) + 110 ff278488 ssl3_read_n (212798, 5, 8805, 0, 0, 203958) + 174 ff2785fc ssl3_get_record (204ce0, 8000, 8400, 4400, f1, f0) + d0 ff279424 ssl3_read_bytes (212798, 1000, 2000, 4, 0, ffbfe731) + 228 ff27a99c ssl3_get_message (ff2a259c, 2070a0, 0, , 19000, ffbfe7a0) + d0 ff27042c ssl3_accept (2150, 2160, 2180, 21e0, 2110, 2122) + 904 ff27bd2c ssl23_get_client_hello (2316fb, 6c, 6c, 4, fe79, 0) + 828 ff27b4b4 ssl23_accept (4000, 2000, 0, 0, 0, 0) + 2a4 00032d00 tls_start_servertls (0, 1, ffbfee24, ffbfee20, 1849a8, ff00) + 198 0002c504 cmd_starttls (1, 1fd8b8, 0, 0, 0, 0) + 184 0002a638 service_main (2, 192198, ffbffce0, 1aec4, 3508c, 1) + 488 00035250 main (2, ffbffcd4, ffbffce0, 17c400, 0, 0) + e18 00029298 _start (0, 0, 0, 0, 0, 0) + 108 I've confirmed that the client has gone away a long time ago. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
Sebastian Hagedorn wrote: I think I will try one more approach: I reverted cyrus.conf to not use -U 1 anymore, so that processes should be reused. I will strace one of the pop3d processes in the hope that it gets stuck. That way I should be able to see where things go wrong. If the process terminates normally I will try with another one. Please let me know if you get a trace from a hung process. -- Kenneth Murchison Systems Programmer Project Cyrus Developer/Maintainer Carnegie Mellon University Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
On Fri, Nov 16, 2007 at 05:20:00PM +0100, Sebastian Hagedorn wrote: That's a 2.6 kernel, right? Yes, 2.6.18-2-amd64. Hm, we don't suffer any actual slowdown, it's just that the number of processes increases over time. It's not a slowdown - the client connects, and hangs. It never even gets to the authentication phase (at least it's not logged). Clients that happen to connect to a non-affected process run normally. Also, IMAP connections do not seem to be affected, at least I did not hear any complaints about that. That's what I'm seeing. Could you get a stack trace? I intend to but I do not have the time currently. I'm not involved in the daily management of that machine and the operators are happy to just restart Cyrus when the hangs begin, and so far I never was around just when that happened. Gabor -- - MTA SZTAKI Computer and Automation Research Institute Hungarian Academy of Sciences - Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
Sebastian Hagedorn wrote: --On 16. November 2007 09:37:42 -0600 Gary Mills [EMAIL PROTECTED] wrote: Could you get a stack trace? If you have gdb you just call it with gdb -p 19175. Then you can do bt at the prompt. I forget how to do it with Sun's debugger. Easy: # pstack 19175 19175: pop3d -s fef9f810 read (0, 2316f0, 5) fee1d2d0 read (0, 2316f0, 5, 0, 0, 0) + 5c ff06bb38 sock_read (1f0860, 2316f0, 5, 5, 0, 0) + 24 ff068af0 BIO_read (1f0860, 2316f0, 5, fef98b84, 0, 0) + 110 ff278488 ssl3_read_n (212798, 5, 8805, 0, 0, 203958) + 174 ff2785fc ssl3_get_record (204ce0, 8000, 8400, 4400, f1, f0) + d0 ff279424 ssl3_read_bytes (212798, 1000, 2000, 4, 0, ffbfe731) + 228 ff27a99c ssl3_get_message (ff2a259c, 2070a0, 0, , 19000, ffbfe7a0) + d0 ff27042c ssl3_accept (2150, 2160, 2180, 21e0, 2110, 2122) + 904ff27bd2c ssl23_get_client_hello (2316fb, 6c, 6c, 4, fe79, 0) + 828ff27b4b4 ssl23_accept (4000, 2000, 0, 0, 0, 0) + 2a4 00032d00 tls_start_servertls (0, 1, ffbfee24, ffbfee20, 1849a8, ff00) + 1980002c504 cmd_starttls (1, 1fd8b8, 0, 0, 0, 0) + 184 0002a638 service_main (2, 192198, ffbffce0, 1aec4, 3508c, 1) + 488 00035250 main (2, ffbffcd4, ffbffce0, 17c400, 0, 0) + e18 00029298 _start (0, 0, 0, 0, 0, 0) + 108 Thanks, that looks like progress! That stack trace looks similar enough to the one I'm seeing that I could imagine that it is what I *should* be seeing if the stack weren't garbled. Of course that's only speculation. Ken, is it possible that the call to SSL_accept() in tls_start_servertls() blocks when the client goes away? That could explain everything Yes. Gary's problem might be very similar to yours, depending on what I see from the patch that I just sent you. -- Kenneth Murchison Systems Programmer Project Cyrus Developer/Maintainer Carnegie Mellon University Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On 16. November 2007 18:07:51 +0100 Gabor Gombas [EMAIL PROTECTED] wrote: Hm, we don't suffer any actual slowdown, it's just that the number of processes increases over time. It's not a slowdown - the client connects, and hangs. It never even gets to the authentication phase (at least it's not logged). Clients that happen to connect to a non-affected process run normally. Well, that just sounds like you're running out of entropy. That's a different issue. Recompile your cyrus-sasl to use /dev/urandom instead of /dev/random or disable apop in /etc/imapd.conf: allowapop: 0 Either of those things should get rid of that. -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgp4g0isqF9Ha.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
On Fri, Nov 16, 2007 at 01:54:24PM +0100, Alain Spineux wrote: On Nov 16, 2007 12:36 PM, Sebastian Hagedorn [EMAIL PROTECTED] wrote: --On 16. November 2007 11:27:09 +0100 Sebastian Hagedorn [EMAIL PROTECTED] wrote: 1. In the absence of the SO_KEEPALIVE option it is entirely possible that a TCP connection remains ESTABLISHED even when the other side has gone. I said that socket should timeout, but this is true only when the protocol (TCP here) require a response (usualy AK here) or at connection establishement. On the contrary it should stay open indefinitely util something happens. Router doing NAT can drop a too old connection, because it has to maintains a NAT table and make some cleanup time to time, this where KEEPALIVE become usefull. This may not be a solution to this particular problem, but it made me wonder why Cyrus does *not* use SO_KEEPALIVE. Is there a downside to it? Cyrus has already a built-in time out, it seems a lite conflicting to actively maintains the connection until it drop it itself ! This is the works of the client to actively maintains the connection, if it want it ! This timeout doesn't work in some cases. We have lots of POP sessions that never terminate. About 30 out of 40 are in that state now. Here's an example: cyrus 13075 708 0 Oct 14 ?0:05 pop3d -s cyrus 20023 708 0 Oct 29 ?0:00 pop3d cyrus 24560 708 1 07:38:03 ?0:03 pop3d cyrus 631 708 0 Oct 03 ?0:10 pop3d -s cyrus 6786 708 0 Oct 20 ?0:00 pop3d -s cyrus 29777 708 0 07:45:03 ?0:00 pop3d cyrus 19175 708 0 Oct 04 ?0:04 pop3d -s One I just checked is stuck in a read(): # truss -p 19175 read(0, 0x002316F0, 5) (sleeping...) ^?# pfiles 19175 19175: pop3d -s Current rlimit: 256 file descriptors 0: S_IFSOCK mode:0666 dev:271,0 ino:25813 uid:0 gid:0 size:0 O_RDWR sockname: AF_INET 130.179.16.23 port: 995 peername: AF_INET 130.179.188.184 port: 51771 -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On 16. November 2007 11:27:09 +0100 Sebastian Hagedorn [EMAIL PROTECTED] wrote: 1) Since it only happens on dialup connections, could it be that the dialin router at the providers end sends TCP/RST when a client hangs up and those packets are filtered somewhere, maybe on your firewall? OK, let's run with that one. a) We don't really have a firewall, we only use ACLs on the Cisco routers. You can't even filter TCP/RST there. b) Even *if* a TCP/RST had been dropped, lost or whatever, the server *still* should timeout eventually! I just had a discussion with a colleague regarding this. He made two observations: 1. In the absence of the SO_KEEPALIVE option it is entirely possible that a TCP connection remains ESTABLISHED even when the other side has gone. This may not be a solution to this particular problem, but it made me wonder why Cyrus does *not* use SO_KEEPALIVE. Is there a downside to it? 2. The stack trace looks garbled: (gdb) bt #0 0x0079f41e in __read_nocancel () from /lib/tls/libc.so.6 #1 0x00d0b2f7 in BIO_new_socket () from /lib/libcrypto.so.4 #2 0x00d092b2 in BIO_read () from /lib/libcrypto.so.4 #3 0x005dae13 in ssl23_read_bytes () from /lib/libssl.so.4 #4 0x005d9c51 in ssl23_get_client_hello () from /lib/libssl.so.4 #5 0x005d9712 in ssl23_accept () from /lib/libssl.so.4 #6 0x005ddc9a in SSL_accept () from /lib/libssl.so.4 #7 0x08052cb3 in shut_down () #8 0x0804e513 in shut_down () #9 0x0804d58c in ?? () #10 0x0001 in ?? () #11 0x082ee848 in ?? () #12 0x in ?? () He suggested that the trace is unreliable. Perhaps a bug in RHEL 3's version of OpenSSL messes up the stack. That would also explain why nobody else seems to have this problem. I think I will try one more approach: I reverted cyrus.conf to not use -U 1 anymore, so that processes should be reused. I will strace one of the pop3d processes in the hope that it gets stuck. That way I should be able to see where things go wrong. If the process terminates normally I will try with another one. If that doesn't go anywhere, I guess I'll drop this investigation. We will upgrade to RHEL 5 some time next year, so hopefully that will bring new bugs :-) -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgpbrIUha0peZ.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: LARGE single-system Cyrus installs?
--On Friday, November 16, 2007 7:39 AM +0100 Pascal Gienger [EMAIL PROTECTED] wrote: Solaris 10 does this in my case. Via dtrace you'll see that open() on the mailboxes.db and read-calls do not exceed microsecond ranges. mailboxes.db is not the problem here. It is entirely cached and rarely written (creating, deleting and moving a mailbox). This is where I think the actual user count may really influence this behavior. On our system, during heavy times, we can see writes to the mailboxes file separated by no more than 5-10 seconds. If you're constantly freezing all cyrus processes for the duration of those writes, and those writes are taking any appreciable time at all, you're going to have a stuttering server with big load averages. Again, it's not I/O throughput to be worried about here -- it's latency. If you don't have write caches in front of your disk, even with RAID you're still at the mercy of drive latency in the millisecond range. Not a problem if those writes are once every five minutes, but if you're at peak load on a big system and seeing them every couple of seconds, that's brutal. -Michael Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
Sebastian Hagedorn wrote: The only reason I could imagine for the sequence of calls was signal handling. But let's be methodical. There's only one spot where SSL_accept() is called: in tls_start_servertls(). In pop3d.c that's only called in cmd_starttls(). That in turn is called either in cmdloop (for handling of STLS) or in service_main() for connections to port 995. Actually, now that I think about it, I believe SSL_accept() can be called from SSL_read() at any time if a renegotiation is required. Since shut_down() calls prot_fill(), which in turn can call SSL_read(), its possible that we can get an SSL_accept() call. Before I start hacking code, can you apply the following patch (sorry about the line breaks) and see if I'm heading in the right direction? Let me know if you get any of the WARNING messages in your logs. --- prot.c.~1.93.~ 2007-11-16 11:21:56.0 -0500 +++ prot.c 2007-11-16 11:23:32.0 -0500 @@ -468,6 +468,7 @@ /* just do a SSL read instead if we're under a tls layer */ if (s-tls_conn != NULL) { n = SSL_read(s-tls_conn, (char *) s-buf, PROT_BUFSIZE); + if (n = 0) syslog(LOG_WARNING, SSL_read() returned %d, n); } else { n = read(s-fd, s-buf, PROT_BUFSIZE); } -- Kenneth Murchison Systems Programmer Project Cyrus Developer/Maintainer Carnegie Mellon University Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On 16. November 2007 13:54:24 +0100 Alain Spineux [EMAIL PROTECTED] wrote: On Nov 16, 2007 12:36 PM, Sebastian Hagedorn [EMAIL PROTECTED] wrote: I just had a discussion with a colleague regarding this. He made two observations: 1. In the absence of the SO_KEEPALIVE option it is entirely possible that a TCP connection remains ESTABLISHED even when the other side has gone. I said that socket should timeout, but this is true only when the protocol (TCP here) require a response (usualy AK here) or at connection establishement. Right. On the contrary it should stay open indefinitely util something happens. Router doing NAT can drop a too old connection, because it has to maintains a NAT table and make some cleanup time to time, this where KEEPALIVE become usefull. Not only there, but I think also in the case of unilaterally dropped connections. This may not be a solution to this particular problem, but it made me wonder why Cyrus does *not* use SO_KEEPALIVE. Is there a downside to it? Cyrus has already a built-in time out, it seems a lite conflicting to actively maintains the connection until it drop it itself ! I'm not sure I understand that sentence. This is the works of the client to actively maintains the connection, if it want it ! Yes, but what if the client is gone? I realise that *normally* the server keeps a built-in timeout, but I'm guessing that sometimes it doesn't work, perhaps because something (in prot_fill() perhaps?) blocks. I think I will try one more approach: I reverted cyrus.conf to not use -U 1 anymore, so that processes should be reused. I will strace one of the pop3d processes in the hope that it gets stuck. That way I should be able to see where things go wrong. If the process terminates normally I will try with another one. If that doesn't go anywhere, I guess I'll drop this You could try to replace imapd by a home made script, something like . mv imapd imapd_ echo exec strace -o /tmp/imapd.$$ imapd_ $* imapd chmod imapd a+x Thanks for the suggestion. I'll think about it, although I'm wary of doing that on a production server. investigation. We will upgrade to RHEL 5 some time next year, so hopefully that will bring new bugs :-) Sorry but I dont understand what you are complaining about! I'm not complaining ... Is-it because the imap or pop client is loosing its connection and this disturb the user No. or just because you are getting some sleeping processes ? If it were some I wouldn't worry. I'm talking hundreds of processes! I know I can kill them, in fact for the pop3d processes we run this command once a month: ps -C pop3d -o pid,start|grep [a-z]|awk '{print $1}'|xargs kill (It kills pop3d processes that have the month in their start time, i.e. are more than a day old) But for imapd processes it's not as easy to tell if they are just long-living or stuck. Do you have a timeout option in your imapd.conf to force the imap/pop server to autologout ? No. But both POP and IMAP have default timeouts. They just don't work in my case. -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgpUj3SrktoJw.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On 16. November 2007 08:00:07 -0600 Gary Mills [EMAIL PROTECTED] wrote: This timeout doesn't work in some cases. We have lots of POP sessions that never terminate. That's interesting to hear! Especially since you are using Solaris. About 30 out of 40 are in that state now. Here's an example: cyrus 13075 708 0 Oct 14 ?0:05 pop3d -s cyrus 20023 708 0 Oct 29 ?0:00 pop3d cyrus 24560 708 1 07:38:03 ?0:03 pop3d cyrus 631 708 0 Oct 03 ?0:10 pop3d -s cyrus 6786 708 0 Oct 20 ?0:00 pop3d -s cyrus 29777 708 0 07:45:03 ?0:00 pop3d cyrus 19175 708 0 Oct 04 ?0:04 pop3d -s One I just checked is stuck in a read(): # truss -p 19175 read(0, 0x002316F0, 5) (sleeping...) ^?# pfiles 19175 19175: pop3d -s Current rlimit: 256 file descriptors 0: S_IFSOCK mode:0666 dev:271,0 ino:25813 uid:0 gid:0 size:0 O_RDWR sockname: AF_INET 130.179.16.23 port: 995 peername: AF_INET 130.179.188.184 port: 51771 Could you get a stack trace? If you have gdb you just call it with gdb -p 19175. Then you can do bt at the prompt. I forget how to do it with Sun's debugger. -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgpvKBTMY4YQA.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On 16. November 2007 09:37:42 -0600 Gary Mills [EMAIL PROTECTED] wrote: Could you get a stack trace? If you have gdb you just call it with gdb -p 19175. Then you can do bt at the prompt. I forget how to do it with Sun's debugger. Easy: # pstack 19175 19175: pop3d -s fef9f810 read (0, 2316f0, 5) fee1d2d0 read (0, 2316f0, 5, 0, 0, 0) + 5c ff06bb38 sock_read (1f0860, 2316f0, 5, 5, 0, 0) + 24 ff068af0 BIO_read (1f0860, 2316f0, 5, fef98b84, 0, 0) + 110 ff278488 ssl3_read_n (212798, 5, 8805, 0, 0, 203958) + 174 ff2785fc ssl3_get_record (204ce0, 8000, 8400, 4400, f1, f0) + d0 ff279424 ssl3_read_bytes (212798, 1000, 2000, 4, 0, ffbfe731) + 228 ff27a99c ssl3_get_message (ff2a259c, 2070a0, 0, , 19000, ffbfe7a0) + d0 ff27042c ssl3_accept (2150, 2160, 2180, 21e0, 2110, 2122) + 904ff27bd2c ssl23_get_client_hello (2316fb, 6c, 6c, 4, fe79, 0) + 828ff27b4b4 ssl23_accept (4000, 2000, 0, 0, 0, 0) + 2a4 00032d00 tls_start_servertls (0, 1, ffbfee24, ffbfee20, 1849a8, ff00) + 1980002c504 cmd_starttls (1, 1fd8b8, 0, 0, 0, 0) + 184 0002a638 service_main (2, 192198, ffbffce0, 1aec4, 3508c, 1) + 488 00035250 main (2, ffbffcd4, ffbffce0, 17c400, 0, 0) + e18 00029298 _start (0, 0, 0, 0, 0, 0) + 108 Thanks, that looks like progress! That stack trace looks similar enough to the one I'm seeing that I could imagine that it is what I *should* be seeing if the stack weren't garbled. Of course that's only speculation. Ken, is it possible that the call to SSL_accept() in tls_start_servertls() blocks when the client goes away? That could explain everything -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgpldwycIAjiI.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
On Fri, Nov 16, 2007 at 12:36:49PM +0100, Sebastian Hagedorn wrote: He suggested that the trace is unreliable. Perhaps a bug in RHEL 3's version of OpenSSL messes up the stack. That would also explain why nobody else seems to have this problem. FYI I also know a system that has problems with hung Cyrus processes. AFAIR they have problems with pop3s only, but that may be because there are more POP3 than IMAP users, I don't know. The system in question runs 2.3.8 on Debian Etch currently. I intend to help diagnose that system but had no time so far; they're now running a script that does a POP3 connection every couple of minutes and if that takes too long it restarts Cyrus. There is nothing interesting in the logs: Oct 15 02:39:31 host cyrus/master[6102]: about to exec /usr/local/cyrus/sbin/pop3d Oct 15 02:39:31 host cyrus/pop3s[6102]: executed Oct 15 02:39:31 host cyrus/pop3s[6102]: accepted connection ... and that's about it, nothing else is logged about the stuck process. As can be seen the process gets stuck just after it has been created, so -U 1 can not help. OTOH there are a lot of messages like the following: Oct 16 14:13:10 host cyrus/master[26136]: about to exec /usr/local/cyrus/sbin/pop3d Oct 16 14:13:10 host cyrus/pop3s[26136]: executed Oct 16 14:13:10 host cyrus/pop3s[26136]: accepted connection Oct 16 14:13:10 host cyrus/pop3s[26136]: pop3s failed: [XX.XXX.XX.XXX] Oct 16 14:13:10 host cyrus/pop3s[26136]: Fatal error: tls_start_servertls() failed Oct 16 14:13:10 host cyrus/master[15923]: process 26136 exited, status 75 Oct 16 14:13:10 host cyrus/master[15923]: service pop3s pid 26136 in BUSY state: terminated abnormally Any idea what's causing that? Gabor -- - MTA SZTAKI Computer and Automation Research Institute Hungarian Academy of Sciences - Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
RE: Collaboration replacement via Toltec/Bynari (was How many people to admin a Cyrus system?)
--On 15 November 2007 22:15:32 +0200 Joon Radley [EMAIL PROTECTED] wrote: Hi Ian, Cyrus Mailstore does handle final delivery, but there's plenty of opportunity to handle messages before that point. For example, we now use Exim and Cyrus Mailstore, and we have plenty of processing going on in Exim before hand off to Cyrus (with LMTP) including spamassassin, clamav and Exim filters. There are also processes between the two, for example Mailman. Very true, but it does not do the processing needed by Outlook. It cannot convert iTip and winmail.dat attachments to the related message objects and do the linking in the Outlook message store. This is where you need the transport mechanism of Outlook. So, the problem has nothing to do with IMAP, and everything to do with message handling before delivery to the mailbox. -- Ian Eiloart IT Services, University of Sussex x3148 Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
On Fri, Nov 16, 2007 at 06:11:01PM +0100, Sebastian Hagedorn wrote: Well, that just sounds like you're running out of entropy. That's a different issue. Recompile your cyrus-sasl to use /dev/urandom instead of /dev/random or disable apop in /etc/imapd.conf: Debian uses /dev/urandom for a long time: # strings /usr/lib/libsasl2.so.2 | grep random /dev/urandom And according to the logs I have, after a pop3 process got stuck other IMAP users can still log in using TLS+PLAIN, so entropy can be ruled out. Gabor -- - MTA SZTAKI Computer and Automation Research Institute Hungarian Academy of Sciences - Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
Sebastian Hagedorn wrote: --On 16. November 2007 12:39:28 -0500 Ken Murchison [EMAIL PROTECTED] wrote: Sorry, my patch wasn't complete. It wasn't logging the value that I wanted. OK: Nov 16 18:48:17 lvr13 pop3s[1385]: SSL_read() returned 0:5 Nov 16 18:48:33 lvr13 pop3s[1375]: SSL_read() returned 0:5 Nov 16 18:48:50 lvr13 pop3s[1980]: SSL_read() returned 0:6 Nov 16 18:48:54 lvr13 pop3s[1376]: SSL_read() returned 0:5 Nov 16 18:49:03 lvr13 pop3s[1375]: SSL_read() returned 0:5 Nov 16 18:49:11 lvr13 pop3s[1375]: SSL_read() returned 0:5 Nov 16 18:49:38 lvr13 pop3s[1375]: SSL_read() returned 0:5 Nov 16 18:49:54 lvr13 pop3s[1404]: SSL_read() returned 0:5 I'm guessing that's still not enough: #define SSL_ERROR_SYSCALL 5 /* look at error stack/return value/errno */ #define SSL_ERROR_ZERO_RETURN 6 SSL_ERROR_SYSCALL Some I/O error occurred. The OpenSSL error queue may contain more information on the error. If the error queue is empty (i.e. ERR_get_error() returns 0), ret can be used to find out more about the error: If ret == 0, an EOF was observed that violates the pro- tocol. If ret == -1, the underlying BIO reported an I/O error (for socket I/O on Unix systems, consult errno for details). So should I add a call to ERR_get_error()? Not yet. I'm assuming that none of these processes has hung. We're getting an I/O error most likely because the client has closed the connection immediately after sending QUIT. This is harmless. What I really want to see is if we get a SSL_ERROR_WANT_xxx return code when we're hung. -- Kenneth Murchison Systems Programmer Project Cyrus Developer/Maintainer Carnegie Mellon University Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
On Nov 16, 2007 6:11 PM, Sebastian Hagedorn [EMAIL PROTECTED] wrote: --On 16. November 2007 18:07:51 +0100 Gabor Gombas [EMAIL PROTECTED] wrote: Hm, we don't suffer any actual slowdown, it's just that the number of processes increases over time. It's not a slowdown - the client connects, and hangs. It never even gets to the authentication phase (at least it's not logged). Clients that happen to connect to a non-affected process run normally. Well, that just sounds like you're running out of entropy. That's a different issue. Recompile your cyrus-sasl to use /dev/urandom instead of /dev/random or disable apop in /etc/imapd.conf: allowapop: 0 Either of those things should get rid of that. The quick but the bad way to do this is for testing is # ls -l /dev/*random crw-rw-rw- 1 root root 1, 8 Nov 16 06:18 /dev/random cr--r--r-- 1 root root 1, 9 Nov 7 22:47 /dev/urandom # mv /dev/random /dev/random.orig # ln -sf /dev/urandom /dev/random And then, because letting /dev/random that way too long is insecure : # rm -f /dev/random # mv /dev/random.orig /dev/random This avoid to recompile the source just for testing. -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html -- Alain Spineux aspineux gmail com May the sources be with you Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
On Fri, Nov 16, 2007 at 05:13:13PM +0100, Sebastian Hagedorn wrote: --On 16. November 2007 14:23:17 +0100 Simon Matter [EMAIL PROTECTED] wrote: Did you ever see non SSL connections get stuck? No. Most of mine are `pop3d -s', but I have seen a few without the `-s'. When I did a stack trace on one, it also turned out to be for an SSL session. So, I have to agree. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
OK, now I got this: Nov 16 18:37:06 lvr13 pop3s[23089]: SSL_read() returned -1 But that process terminated normally. -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgpRJSjlsSCf8.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: LARGE single-system Cyrus installs?
On 15 Nov 07, at 1504, Michael Bacon wrote: Interesting thought. We haven't gone to ZFS yet, although I like the idea a lot. My hunch is it's an enormous win for the mailbox partitions, but perhaps it's not a good thing for the meta partition. I'll have to let someone else who knows more about ZFS and write speeds vs. read speeds chime in here. We're finding it a real win for the meta-partition. We're handing ~1000 users on a 2-way stripe by two-way mirror on the internal disks in a T2000 for the meta-data, with the message data coming in over NFS.We do see a few spikes of write operations (this is one instance from zpool isotat -v 1): capacity operationsbandwidth pool used avail read write read write - - - - - - pool1 52.1G 25.9G 4657 3.96K 3.71M mirror 26.0G 13.0G 4354 3.96K 1.42M c0t0d0s4 - - 0135 0 1.42M c0t1d0s4 - - 0126 63.4K 1.42M mirror 26.0G 13.0G 0302 0 2.29M c0t2d0s4 - - 0112 0 2.29M c0t3d0s4 - - 0109 0 2.29M - - - - - - but it's showing no signs at all of being IO bound on the metadata. The spikes are really just spikes for a second: the typical level is about 10 ops / disk / sec. ian Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
On Nov 16, 2007 12:36 PM, Sebastian Hagedorn [EMAIL PROTECTED] wrote: --On 16. November 2007 11:27:09 +0100 Sebastian Hagedorn [EMAIL PROTECTED] wrote: 1) Since it only happens on dialup connections, could it be that the dialin router at the providers end sends TCP/RST when a client hangs up and those packets are filtered somewhere, maybe on your firewall? OK, let's run with that one. a) We don't really have a firewall, we only use ACLs on the Cisco routers. You can't even filter TCP/RST there. b) Even *if* a TCP/RST had been dropped, lost or whatever, the server *still* should timeout eventually! I just had a discussion with a colleague regarding this. He made two observations: 1. In the absence of the SO_KEEPALIVE option it is entirely possible that a TCP connection remains ESTABLISHED even when the other side has gone. I said that socket should timeout, but this is true only when the protocol (TCP here) require a response (usualy AK here) or at connection establishement. On the contrary it should stay open indefinitely util something happens. Router doing NAT can drop a too old connection, because it has to maintains a NAT table and make some cleanup time to time, this where KEEPALIVE become usefull. This may not be a solution to this particular problem, but it made me wonder why Cyrus does *not* use SO_KEEPALIVE. Is there a downside to it? Cyrus has already a built-in time out, it seems a lite conflicting to actively maintains the connection until it drop it itself ! This is the works of the client to actively maintains the connection, if it want it ! 2. The stack trace looks garbled: (gdb) bt #0 0x0079f41e in __read_nocancel () from /lib/tls/libc.so.6 #1 0x00d0b2f7 in BIO_new_socket () from /lib/libcrypto.so.4 #2 0x00d092b2 in BIO_read () from /lib/libcrypto.so.4 #3 0x005dae13 in ssl23_read_bytes () from /lib/libssl.so.4 #4 0x005d9c51 in ssl23_get_client_hello () from /lib/libssl.so.4 #5 0x005d9712 in ssl23_accept () from /lib/libssl.so.4 #6 0x005ddc9a in SSL_accept () from /lib/libssl.so.4 #7 0x08052cb3 in shut_down () #8 0x0804e513 in shut_down () #9 0x0804d58c in ?? () #10 0x0001 in ?? () #11 0x082ee848 in ?? () #12 0x in ?? () He suggested that the trace is unreliable. Perhaps a bug in RHEL 3's version of OpenSSL messes up the stack. That would also explain why nobody else seems to have this problem. I think I will try one more approach: I reverted cyrus.conf to not use -U 1 anymore, so that processes should be reused. I will strace one of the pop3d processes in the hope that it gets stuck. That way I should be able to see where things go wrong. If the process terminates normally I will try with another one. If that doesn't go anywhere, I guess I'll drop this You could try to replace imapd by a home made script, something like . mv imapd imapd_ echo exec strace -o /tmp/imapd.$$ imapd_ $* imapd chmod imapd a+x investigation. We will upgrade to RHEL 5 some time next year, so hopefully that will bring new bugs :-) Sorry but I dont understand what you are complaining about! Is-it because the imap or pop client is loosing its connection and this disturb the user or just because you are getting some sleeping processes ? Or both :-) Do you have a timeout option in your imapd.conf to force the imap/pop server to autologout ? Regards. Alain -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html -- Alain Spineux aspineux gmail com May the sources be with you Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On 16. November 2007 14:23:17 +0100 Simon Matter [EMAIL PROTECTED] wrote: Did you ever see non SSL connections get stuck? No. -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgpAIZv7hfTCt.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: LARGE single-system Cyrus installs?
Dale Ghent wrote: On Nov 16, 2007, at 1:39 AM, Pascal Gienger wrote: Solaris 10 does this in my case. Via dtrace you'll see that open() on the mailboxes.db and read-calls do not exceed microsecond ranges. mailboxes.db is not the problem here. It is entirely cached and rarely written (creating, deleting and moving a mailbox). Hmm, I'm wondering if the Cyrus devs would be receptive to the idea of implementing some dtrace probes in Cyrus. Stuff such as mailbox open/close, IMAP operations such as SELECTs, message retrievals, and so on. We'd probably accept a patch, as long as its portable. -- Kenneth Murchison Systems Programmer Project Cyrus Developer/Maintainer Carnegie Mellon University Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On 16. November 2007 11:27:52 -0500 Ken Murchison [EMAIL PROTECTED] wrote: Sebastian Hagedorn wrote: The only reason I could imagine for the sequence of calls was signal handling. But let's be methodical. There's only one spot where SSL_accept() is called: in tls_start_servertls(). In pop3d.c that's only called in cmd_starttls(). That in turn is called either in cmdloop (for handling of STLS) or in service_main() for connections to port 995. Actually, now that I think about it, I believe SSL_accept() can be called from SSL_read() at any time if a renegotiation is required. Since shut_down() calls prot_fill(), which in turn can call SSL_read(), its possible that we can get an SSL_accept() call. Before I start hacking code, can you apply the following patch (sorry about the line breaks) and see if I'm heading in the right direction? Let me know if you get any of the WARNING messages in your logs. --- prot.c.~1.93.~ 2007-11-16 11:21:56.0 -0500 +++ prot.c 2007-11-16 11:23:32.0 -0500 @@ -468,6 +468,7 @@ /* just do a SSL read instead if we're under a tls layer */ if (s-tls_conn != NULL) { n = SSL_read(s-tls_conn, (char *) s-buf, PROT_BUFSIZE); + if (n = 0) syslog(LOG_WARNING, SSL_read() returned %d, n); } else { n = read(s-fd, s-buf, PROT_BUFSIZE); } Yes, I do: Nov 16 17:59:34 lvr13 pop3s[3196]: SSL_read() returned 0 Nov 16 17:59:38 lvr13 pop3s[3196]: SSL_read() returned 0 Nov 16 18:00:09 lvr13 pop3s[3215]: SSL_read() returned 0 Nov 16 18:00:26 lvr13 pop3s[3847]: SSL_read() returned 0 Nov 16 18:00:34 lvr13 pop3s[3215]: SSL_read() returned 0 Nov 16 18:00:34 lvr13 pop3s[3199]: SSL_read() returned 0 Nov 16 18:00:39 lvr13 pop3s[3199]: SSL_read() returned 0 Nov 16 18:00:43 lvr13 pop3s[3229]: SSL_read() returned 0 Not all of these processes are stuck, though. (Maybe none are). Should I be looking for something specific? -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgpzM3I7B80P9.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
Sebastian Hagedorn wrote: Nov 16 18:00:26 lvr13 pop3s[3847]: SSL_read() returned 0 Nov 16 18:00:34 lvr13 pop3s[3215]: SSL_read() returned 0 Nov 16 18:00:34 lvr13 pop3s[3199]: SSL_read() returned 0 Nov 16 18:00:39 lvr13 pop3s[3199]: SSL_read() returned 0 Nov 16 18:00:43 lvr13 pop3s[3229]: SSL_read() returned 0 Not all of these processes are stuck, though. (Maybe none are). Should I be looking for something specific? Sorry, my patch wasn't complete. It wasn't logging the value that I wanted. Try this: --- prot.c.~1.93.~ 2007-11-16 11:21:56.0 -0500 +++ prot.c 2007-11-16 12:37:55.0 -0500 @@ -468,6 +468,10 @@ /* just do a SSL read instead if we're under a tls layer */ if (s-tls_conn != NULL) { n = SSL_read(s-tls_conn, (char *) s-buf, PROT_BUFSIZE); + if (n = 0) { + syslog(LOG_WARNING, SSL_read() returned %d:%d, + n, SSL_get_error(s-tls_conn, n)); + } } else { n = read(s-fd, s-buf, PROT_BUFSIZE); } -- Kenneth Murchison Systems Programmer Project Cyrus Developer/Maintainer Carnegie Mellon University Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On 16. November 2007 12:39:28 -0500 Ken Murchison [EMAIL PROTECTED] wrote: Sorry, my patch wasn't complete. It wasn't logging the value that I wanted. OK: Nov 16 18:48:17 lvr13 pop3s[1385]: SSL_read() returned 0:5 Nov 16 18:48:33 lvr13 pop3s[1375]: SSL_read() returned 0:5 Nov 16 18:48:50 lvr13 pop3s[1980]: SSL_read() returned 0:6 Nov 16 18:48:54 lvr13 pop3s[1376]: SSL_read() returned 0:5 Nov 16 18:49:03 lvr13 pop3s[1375]: SSL_read() returned 0:5 Nov 16 18:49:11 lvr13 pop3s[1375]: SSL_read() returned 0:5 Nov 16 18:49:38 lvr13 pop3s[1375]: SSL_read() returned 0:5 Nov 16 18:49:54 lvr13 pop3s[1404]: SSL_read() returned 0:5 I'm guessing that's still not enough: #define SSL_ERROR_SYSCALL 5 /* look at error stack/return value/errno */ #define SSL_ERROR_ZERO_RETURN 6 SSL_ERROR_SYSCALL Some I/O error occurred. The OpenSSL error queue may contain more information on the error. If the error queue is empty (i.e. ERR_get_error() returns 0), ret can be used to find out more about the error: If ret == 0, an EOF was observed that violates the pro- tocol. If ret == -1, the underlying BIO reported an I/O error (for socket I/O on Unix systems, consult errno for details). So should I add a call to ERR_get_error()? -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgpFMfHMrSvNV.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
Hi Can I resume the problem in : The server is blocked in a read, waiting for the client next command. (this is normal, 99% of the process are in this state). But the autologout procedure is not working! Then this means the SIGALRM that should awake the process never come or is not handled properly! I simple call to sleep() or signal() could disturb this. If this append only when using SSL, maybe the problem is here and the ALRM should bne reloaded somewhere. This is useless now, but files in $cyrus_imap/proc/* contains the user and the selected mailbox of all these processes this could be useful to know if this what not always the same user at the origin of the problem, because he was using an old outlook or something. Regards On Nov 16, 2007 5:33 PM, Ken Murchison [EMAIL PROTECTED] wrote: Sebastian Hagedorn wrote: --On 16. November 2007 09:37:42 -0600 Gary Mills [EMAIL PROTECTED] wrote: Could you get a stack trace? If you have gdb you just call it with gdb -p 19175. Then you can do bt at the prompt. I forget how to do it with Sun's debugger. Easy: # pstack 19175 19175: pop3d -s fef9f810 read (0, 2316f0, 5) fee1d2d0 read (0, 2316f0, 5, 0, 0, 0) + 5c ff06bb38 sock_read (1f0860, 2316f0, 5, 5, 0, 0) + 24 ff068af0 BIO_read (1f0860, 2316f0, 5, fef98b84, 0, 0) + 110 ff278488 ssl3_read_n (212798, 5, 8805, 0, 0, 203958) + 174 ff2785fc ssl3_get_record (204ce0, 8000, 8400, 4400, f1, f0) + d0 ff279424 ssl3_read_bytes (212798, 1000, 2000, 4, 0, ffbfe731) + 228 ff27a99c ssl3_get_message (ff2a259c, 2070a0, 0, , 19000, ffbfe7a0) + d0 ff27042c ssl3_accept (2150, 2160, 2180, 21e0, 2110, 2122) + 904ff27bd2c ssl23_get_client_hello (2316fb, 6c, 6c, 4, fe79, 0) + 828ff27b4b4 ssl23_accept (4000, 2000, 0, 0, 0, 0) + 2a4 00032d00 tls_start_servertls (0, 1, ffbfee24, ffbfee20, 1849a8, ff00) + 1980002c504 cmd_starttls (1, 1fd8b8, 0, 0, 0, 0) + 184 0002a638 service_main (2, 192198, ffbffce0, 1aec4, 3508c, 1) + 488 00035250 main (2, ffbffcd4, ffbffce0, 17c400, 0, 0) + e18 00029298 _start (0, 0, 0, 0, 0, 0) + 108 Thanks, that looks like progress! That stack trace looks similar enough to the one I'm seeing that I could imagine that it is what I *should* be seeing if the stack weren't garbled. Of course that's only speculation. Ken, is it possible that the call to SSL_accept() in tls_start_servertls() blocks when the client goes away? That could explain everything Yes. Gary's problem might be very similar to yours, depending on what I see from the patch that I just sent you. -- Kenneth Murchison Systems Programmer Project Cyrus Developer/Maintainer Carnegie Mellon University Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html -- Alain Spineux aspineux gmail com May the sources be with you Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On 16. November 2007 18:21:21 +0100 Gabor Gombas [EMAIL PROTECTED] wrote: On Fri, Nov 16, 2007 at 06:11:01PM +0100, Sebastian Hagedorn wrote: Well, that just sounds like you're running out of entropy. That's a different issue. Recompile your cyrus-sasl to use /dev/urandom instead of /dev/random or disable apop in /etc/imapd.conf: Debian uses /dev/urandom for a long time: # strings /usr/lib/libsasl2.so.2 | grep random /dev/urandom And according to the logs I have, after a pop3 process got stuck other IMAP users can still log in using TLS+PLAIN, so entropy can be ruled out. OK. Still the symptom seems to be different from what I'm seeing. Could it be that you have a process limit in /etc/cyrus.conf? -- .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:. Zentrum für angewandte Informatik - Universitätsweiter Service RRZK .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:. .:.:.:.Skype: shagedorn.:.:.:. pgpvuf1O5cVWm.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
-- Ken Murchison [EMAIL PROTECTED] is rumored to have mumbled on 16. November 2007 12:58:49 -0500 regarding Re: One more attempt: stuck processes: So should I add a call to ERR_get_error()? Not yet. I'm assuming that none of these processes has hung. We're getting an I/O error most likely because the client has closed the connection immediately after sending QUIT. This is harmless. What I really want to see is if we get a SSL_ERROR_WANT_xxx return code when we're hung. I have both good and bad news. Bad news first: there is a stuck process that did *not* log that SSL_read line. Good news: the binary I'm running now isn't stripped and has much more detail in its stack trace: (gdb) bt #0 0x003d341e in __read_nocancel () from /lib/tls/libc.so.6 #1 0x0017f2f7 in BIO_new_socket () from /lib/libcrypto.so.4 #2 0x0017d2b2 in BIO_read () from /lib/libcrypto.so.4 #3 0x0089ec30 in ssl3_alert_code () from /lib/libssl.so.4 #4 0x0089edcc in ssl3_alert_code () from /lib/libssl.so.4 #5 0x008a00cf in ssl3_read_bytes () from /lib/libssl.so.4 #6 0x008a0ffc in ssl3_get_message () from /lib/libssl.so.4 #7 0x00896cab in ssl3_accept () from /lib/libssl.so.4 #8 0x00896944 in ssl3_accept () from /lib/libssl.so.4 #9 0x008a5c9a in SSL_accept () from /lib/libssl.so.4 #10 0x008a180d in ssl23_get_client_hello () from /lib/libssl.so.4 #11 0x008a1712 in ssl23_accept () from /lib/libssl.so.4 #12 0x008a5c9a in SSL_accept () from /lib/libssl.so.4 #13 0x08052cf3 in tls_start_servertls (readfd=-512, writefd=-512, layerbits=0xbfff7a78, authid=0xbfff7a74, ret=0x810bca0) at tls.c:803 #14 0x0804e553 in cmd_starttls (pop3s=1) at pop3d.c:1076 #15 0x0804d5cc in service_main (argc=2, argv=0x9e84008, envp=0xbfff9850) at pop3d.c:537 #16 0x08054550 in main (argc=2, argv=0x9, envp=0xbfff9850) at service.c:539 There's much less POP activity now, so I may have to wait until Monday for more results. -- Sebastian Hagedorn - Postmaster - RZKR-R1 (Flachbau), Zimmer 18 Zentrum für angewandte Informatik - Universitätsweiter Service RRZK Universität zu Köln / Cologne University - Tel. +49-221-478-5587 pgpvINiK8adT6.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: LARGE single-system Cyrus installs?
On Nov 16, 2007, at 1:39 AM, Pascal Gienger wrote: Solaris 10 does this in my case. Via dtrace you'll see that open() on the mailboxes.db and read-calls do not exceed microsecond ranges. mailboxes.db is not the problem here. It is entirely cached and rarely written (creating, deleting and moving a mailbox). Hmm, I'm wondering if the Cyrus devs would be receptive to the idea of implementing some dtrace probes in Cyrus. Stuff such as mailbox open/close, IMAP operations such as SELECTs, message retrievals, and so on. I run cyrus on my personal server now, so maybe I'll fool around with that idea. /dale -- Dale Ghent Specialist, Storage and UNIX Systems UMBC - Office of Information Technology ECS 201 - x51705 Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: LARGE single-system Cyrus installs?
On 15 Nov 2007, at 18:25, Rob Mueller wrote: About 30% of all I/O is to mailboxes.db, most of which is read. I haven't personally deployed a split-meta configuration, but I understand the meta files are similarly heavy I/O concentrators. That sounds odd. Yeah, it's not right. I was reading my iostat output backwards. In fact, it's writes and presumably an artifact of having system logs on the same device as mailboxes.db. Sorry for the confusion. :wes Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
I know it has been asked before and may be redundant, but... You answered that cyrus-sasl is using /dev/urandom and should not run out of entropy. However, what about openssl itself? It also uses random numbers. Perhaps, as a test renaming /dev/random and ln -s /dev/urandom /dev/random. Gary Mills wrote: On Fri, Nov 16, 2007 at 05:13:13PM +0100, Sebastian Hagedorn wrote: --On 16. November 2007 14:23:17 +0100 Simon Matter [EMAIL PROTECTED] wrote: Did you ever see non SSL connections get stuck? No. Most of mine are `pop3d -s', but I have seen a few without the `-s'. When I did a stack trace on one, it also turned out to be for an SSL session. So, I have to agree. Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Help with xfermailbox
I'm experiencing errors when attempting to transfer a mailbox from one backend to another in a murder environment. This is my first try, so this could be due to misconfiguration. I have three servers in my setup: kaled.olp.net - MUPDATE master and frontend gandalf.olp.net - backend #1 neo.olp.net - backend #2 When I issue the command xfermailbox user/9183641498 neo.olp.net from gandalf, I receive the error: gandalf.olp.net xfer user/9183641498 neo.olp.net xfermailbox: The remote Server(s) denied the operation And in neo's (destination backend) logs, I see: Nov 16 14:16:18 neo cyrus/imap[6183]: accepted connection Nov 16 14:16:19 neo cyrus/imap[6183]: login: gandalf.olp.net [65.161.252.87] cyrus-gandalf.olp.net GSSAPI User logged in Nov 16 14:16:19 neo cyrus/imap[6183]: kick_mupdate: can't connect to target: No such file or directory Sometimes I also get (in addition to the No such file or directory error): Nov 16 13:44:57 neo cyrus/imap[6171]: decoding error: generic failure; SASL(-1): generic failure: , closing connection The relevant portion of the code that generates this error appears to be in mupdate-client.c: strlcpy(buf, config_dir, sizeof(buf)); strlcat(buf, FNAME_MUPDATE_TARGET_SOCK, sizeof(buf)); memset((char *)srvaddr, 0, sizeof(srvaddr)); srvaddr.sun_family = AF_UNIX; strcpy(srvaddr.sun_path, buf); len = sizeof(srvaddr.sun_family) + strlen(srvaddr.sun_path) + 1; r = connect(s, (struct sockaddr *)srvaddr, len); if (r == -1) { syslog(LOG_ERR, kick_mupdate: can't connect to target: %m); goto done; } FNAME_MUPDATE_TARGET_SOCK is defined in mupdate-client.h as: #define FNAME_MUPDATE_TARGET_SOCK /socket/mupdate.target I can't find any sockets named mupdate.target on neo (my destination backend). Relevant configurations can be found at: http://support.olp.net/cyrus/kaled-imapd.conf http://support.olp.net/cyrus/kaled-cyrus.conf http://support.olp.net/cyrus/gandalf-imapd.conf http://support.olp.net/cyrus/gandalf-cyrus.conf http://support.olp.net/cyrus/neo-imapd.conf http://support.olp.net/cyrus/neo-cyrus.conf I'm running 2.3.10, with several Debian patches. Thanks for any help, -- Dan White [EMAIL PROTECTED] BTC Broadband Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
--On Friday, November 16, 2007 3:54 PM -0500 Ken Murchison [EMAIL PROTECTED] wrote: I've reproduced the former by telneting to port 995 and doing nothing. I have been unable to reproduce the latter because as soon as I QUIT the telnet session or kill() the telnet process, pop3d exits gracefully. I agree with others that you want to do something other than kill the telnet session, like unplugging the cable. I've seen similar behavior out of cyrus that I thought could easily be people on laptops shutting their computers down hard, in some way that the TCP/IP stack never got a chance to clean up the connection properly. With a QUIT or a kill, you're giving the OS a chance to do the right thing. -Michael Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Bingo!
-- Sebastian Hagedorn [EMAIL PROTECTED] is rumored to have mumbled on 16. November 2007 22:03:21 +0100 regarding Re: One more attempt: stuck processes: The question is how pop3d knows that the connection is dropped. And maybe that's really where dial-up comes into play. In don't know if you're in a position to test that, but what happens if you telnet to port 995 from dial-up and then drop the dial-up connection? I guess I might try that from home now. That does it ... I disconnected my cable modem while having an open telnet connection to 995. Now that process is stuck. -- Sebastian Hagedorn - RZKR-R1 (Flachbau), Zi. 18, Robert-Koch-Str. 10 Zentrum für angewandte Informatik - Universitätsweiter Service RRZK Universität zu Köln / Cologne University - Tel. +49-221-478-5587 pgpd0ZxYqLLmf.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: LARGE single-system Cyrus installs?
Dale Ghent wrote: On Nov 16, 2007, at 2:56 PM, Ken Murchison wrote: Dale Ghent wrote: On Nov 16, 2007, at 1:39 AM, Pascal Gienger wrote: Solaris 10 does this in my case. Via dtrace you'll see that open() on the mailboxes.db and read-calls do not exceed microsecond ranges. mailboxes.db is not the problem here. It is entirely cached and rarely written (creating, deleting and moving a mailbox). Hmm, I'm wondering if the Cyrus devs would be receptive to the idea of implementing some dtrace probes in Cyrus. Stuff such as mailbox open/close, IMAP operations such as SELECTs, message retrievals, and so on. We'd probably accept a patch, as long as its portable. Portable in what sense, exactly? Currently the only OSes which offer DTrace is OSX 10.5 and Solaris 10 (and Solaris Next), so would I be correct to assume that you mean that a dtrace feature would have to work on those two OSes? I don't care if it only works on Solaris 10, but the code can't get in the way of it compiling and running on any other non-Dtrace system. -- Kenneth Murchison Systems Programmer Project Cyrus Developer/Maintainer Carnegie Mellon University Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: Bingo!
-- Ken Murchison [EMAIL PROTECTED] is rumored to have mumbled on 16. November 2007 16:29:20 -0500 regarding Re: Bingo!: That does it ... I disconnected my cable modem while having an open telnet connection to 995. Now that process is stuck. Does the same thing happen if you telnet to port 110? Actually yes - so far! But the stack trace and strace are instructive: (gdb) bt #0 0x006cf2e8 in ___newselect_nocancel () from /lib/tls/libc.so.6 #1 0x08073f76 in prot_fill (s=0x9f6bf48) at prot.c:439 #2 0x080757ad in prot_fgets (buf=0xbfff7a30 quit, size=8191, s=0x9f6bf48) at prot.c:1196 #3 0x0804da6b in cmdloop () at pop3d.c:762 #4 0x0804d516 in service_main (argc=1, argv=0x9f1f008, envp=0xbfffb80c) at pop3d.c:543 #5 0x08054550 in main (argc=1, argv=0x9, envp=0xbfffb80c) at service.c:539 # strace -p 18432 Process 18432 attached - interrupt to quit select(1, [0], NULL, NULL, {463, 9} The select() will time out eventually, I'm sure. I'm currently waiting for that to happen. -- Sebastian Hagedorn - Postmaster - RZKR-R1 (Flachbau), Zimmer 18 Zentrum für angewandte Informatik - Universitätsweiter Service RRZK Universität zu Köln / Cologne University - Tel. +49-221-478-5587 pgpG7EqedAZoN.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: Bingo!
-- Sebastian Hagedorn [EMAIL PROTECTED] is rumored to have mumbled on 16. November 2007 22:36:09 +0100 regarding Re: Bingo!: The select() will time out eventually, I'm sure. I'm currently waiting for that to happen. Here we go: # strace -p 18432 Process 18432 attached - interrupt to quit select(1, [0], NULL, NULL, {463, 9}) = 0 (Timeout) time(NULL) = 1195249308 close(9)= 0 munmap(0xb47a4000, 4096)= 0 unlink(/var/lib/imap/proc/18432) = 0 munmap(0xb47a5000, 12214272)= 0 close(6)= 0 munmap(0xb41da000, 6070272) = 0 close(10) = 0 munmap(0xb534b000, 32768) = 0 munmap(0xb6953000, 2621440) = 0 munmap(0xb5353000, 23068672)= 0 munmap(0xb74a1000, 1318912) = 0 munmap(0xb735f000, 1318912) = 0 munmap(0xb721d000, 1318912) = 0 munmap(0xb70db000, 1318912) = 0 munmap(0xb6f99000, 1318912) = 0 munmap(0xb6e57000, 1318912) = 0 munmap(0xb6d15000, 1318912) = 0 munmap(0xb6bd3000, 1318912) = 0 munmap(0xb75f4000, 16384) = 0 exit_group(0) = ? I suppose an alarm handler is in order? -- Sebastian Hagedorn - RZKR-R1 (Flachbau), Zi. 18, Robert-Koch-Str. 10 Zentrum für angewandte Informatik - Universitätsweiter Service RRZK Universität zu Köln / Cologne University - Tel. +49-221-478-5587 pgp92aahzato9.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
RE: Bingo!
It looks like it timed out properly, correct? (from my phone) -- Kenneth Murchison Systems Programmer Project Cyrus Developer/Maintainer Carnegie Mellon University -Original Message- From: Sebastian Hagedorn [EMAIL PROTECTED] To: Ken Murchison [EMAIL PROTECTED] Cc: Postmaster Uni Köln [EMAIL PROTECTED]; Cyrus IMAP info-cyrus@lists.andrew.cmu.edu Sent: 11/16/07 4:43 PM Subject: Re: Bingo! -- Sebastian Hagedorn [EMAIL PROTECTED] is rumored to have mumbled on 16. November 2007 22:36:09 +0100 regarding Re: Bingo!: The select() will time out eventually, I'm sure. I'm currently waiting for that to happen. Here we go: # strace -p 18432 Process 18432 attached - interrupt to quit select(1, [0], NULL, NULL, {463, 9}) = 0 (Timeout) time(NULL) = 1195249308 close(9)= 0 munmap(0xb47a4000, 4096)= 0 unlink(/var/lib/imap/proc/18432) = 0 munmap(0xb47a5000, 12214272)= 0 close(6)= 0 munmap(0xb41da000, 6070272) = 0 close(10) = 0 munmap(0xb534b000, 32768) = 0 munmap(0xb6953000, 2621440) = 0 munmap(0xb5353000, 23068672)= 0 munmap(0xb74a1000, 1318912) = 0 munmap(0xb735f000, 1318912) = 0 munmap(0xb721d000, 1318912) = 0 munmap(0xb70db000, 1318912) = 0 munmap(0xb6f99000, 1318912) = 0 munmap(0xb6e57000, 1318912) = 0 munmap(0xb6d15000, 1318912) = 0 munmap(0xb6bd3000, 1318912) = 0 munmap(0xb75f4000, 16384) = 0 exit_group(0) = ? I suppose an alarm handler is in order? -- Sebastian Hagedorn - RZKR-R1 (Flachbau), Zi. 18, Robert-Koch-Str. 10 Zentrum für angewandte Informatik - Universitätsweiter Service RRZK Universität zu Köln / Cologne University - Tel. +49-221-478-5587 Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
On Fri, Nov 16, 2007 at 03:54:50PM -0500, Ken Murchison wrote: That's exactly what Gary is seeing. Its blocking in SSL_accept(). Apparently the client connects to port 995, and then either sends nothing, or goes away and leaves the socket open. I've reproduced the former by telneting to port 995 and doing nothing. I have been unable to reproduce the latter because as soon as I QUIT the telnet session or kill() the telnet process, pop3d exits gracefully. You probably have to reboot the client at that point, or just disconnect the cable and take it home. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: One more attempt: stuck processes
On Nov 16, 2007 6:24 PM, Alain Spineux [EMAIL PROTECTED] wrote: Hi Can I resume the problem in : I'm wrong The server is blocked in a read, waiting for the client next command. (this is normal, 99% of the process are in this state). No it is waiting in select, and the select has a timeout ! But the autologout procedure is not working! Then this means the SIGALRM that should awake the process never come or is not handled properly! I simple call to sleep() or signal() could disturb this. If this append only when using SSL, maybe the problem is here and the ALRM should bne reloaded somewhere. Wrong wrong ! It could be right, but here the time out looks to be done using the select ! This is useless now, but files in $cyrus_imap/proc/* contains the user and the selected mailbox of all these processes this could be useful to know if this what not always the same user at the origin of the problem, because he was using an old outlook or something. Regards On Nov 16, 2007 5:33 PM, Ken Murchison [EMAIL PROTECTED] wrote: Sebastian Hagedorn wrote: --On 16. November 2007 09:37:42 -0600 Gary Mills [EMAIL PROTECTED] wrote: Could you get a stack trace? If you have gdb you just call it with gdb -p 19175. Then you can do bt at the prompt. I forget how to do it with Sun's debugger. Easy: # pstack 19175 19175: pop3d -s fef9f810 read (0, 2316f0, 5) fee1d2d0 read (0, 2316f0, 5, 0, 0, 0) + 5c ff06bb38 sock_read (1f0860, 2316f0, 5, 5, 0, 0) + 24 ff068af0 BIO_read (1f0860, 2316f0, 5, fef98b84, 0, 0) + 110 ff278488 ssl3_read_n (212798, 5, 8805, 0, 0, 203958) + 174 ff2785fc ssl3_get_record (204ce0, 8000, 8400, 4400, f1, f0) + d0 ff279424 ssl3_read_bytes (212798, 1000, 2000, 4, 0, ffbfe731) + 228 ff27a99c ssl3_get_message (ff2a259c, 2070a0, 0, , 19000, ffbfe7a0) + d0 ff27042c ssl3_accept (2150, 2160, 2180, 21e0, 2110, 2122) + 904ff27bd2c ssl23_get_client_hello (2316fb, 6c, 6c, 4, fe79, 0) + 828ff27b4b4 ssl23_accept (4000, 2000, 0, 0, 0, 0) + 2a4 00032d00 tls_start_servertls (0, 1, ffbfee24, ffbfee20, 1849a8, ff00) + 1980002c504 cmd_starttls (1, 1fd8b8, 0, 0, 0, 0) + 184 0002a638 service_main (2, 192198, ffbffce0, 1aec4, 3508c, 1) + 488 00035250 main (2, ffbffcd4, ffbffce0, 17c400, 0, 0) + e18 00029298 _start (0, 0, 0, 0, 0, 0) + 108 Thanks, that looks like progress! That stack trace looks similar enough to the one I'm seeing that I could imagine that it is what I *should* be seeing if the stack weren't garbled. Of course that's only speculation. Ken, is it possible that the call to SSL_accept() in tls_start_servertls() blocks when the client goes away? That could explain everything Yes. Gary's problem might be very similar to yours, depending on what I see from the patch that I just sent you. -- Kenneth Murchison Systems Programmer Project Cyrus Developer/Maintainer Carnegie Mellon University Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html -- Alain Spineux aspineux gmail com May the sources be with you -- Alain Spineux aspineux gmail com May the sources be with you Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
building cyrus 2.3.10 on x86_64
I'm trying to build Cyrus 2.3.10 on an x86_64 box running Debian Etch (stable). The first problem I ran into was out-of-date config.guess and config.sub scripts in the 2.3.10 tarball. I grabbed the latest copies of those files from the GNU website: http://cvs.savannah.gnu.org/viewvc/*checkout*/config/config/config.guess http://cvs.savannah.gnu.org/viewvc/*checkout*/config/config/config.sub Could we get those added to the next tarball release? Now, I'm getting an error during the make process when the Cyrus perl bits are compiled: - ### Making all in /private/src/cyrus-imapd-2.3.10/perl/imap Checking if your kit is complete... Looks good Writing Makefile for Cyrus::IMAP make[2]: Entering directory `/private/src/cyrus-imapd-2.3.10/perl/imap' cp IMAP/Admin.pm blib/lib/Cyrus/IMAP/Admin.pm cp IMAP.pm blib/lib/Cyrus/IMAP.pm cp IMAP/Shell.pm blib/lib/Cyrus/IMAP/Shell.pm cp IMAP/IMSP.pm blib/lib/Cyrus/IMAP/IMSP.pm /usr/bin/perl /usr/share/perl/5.8/ExtUtils/xsubpp -typemap /usr/share/perl/5.8/ExtUtils/typemap -typemap typemap IMAP.xs IMAP.xsc mv IMAP.xsc IMAP.c cc -c -I../../lib -I../.. -I../../com_err/et -D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -O2 -DVERSION=\1.00\ -DXS_VERSION=\1.00\ -fPIC -I/usr/lib/perl/5.8/CORE -DPERL_POLLUTE IMAP.c Running Mkbootstrap for Cyrus::IMAP () chmod 644 IMAP.bs rm -f blib/arch/auto/Cyrus/IMAP/IMAP.so cc -shared -L/usr/local/lib IMAP.o -o blib/arch/auto/Cyrus/IMAP/IMAP.so ../../lib/libcyrus.a ../../lib/libcyrus_min.a\ -ldb-4.4 -lsasl2 -lssl -lcrypto \ /usr/bin/ld: ../../lib/libcyrus.a(imclient.o): relocation R_X86_64_32 against `a local symbol' can not be used when making a shared object; recompile with -fPIC ../../lib/libcyrus.a: could not read symbols: Bad value collect2: ld returned 1 exit status make[2]: *** [blib/arch/auto/Cyrus/IMAP/IMAP.so] Error 1 make[2]: Leaving directory `/private/src/cyrus-imapd-2.3.10/perl/imap' make[1]: *** [all] Error 1 make[1]: Leaving directory `/private/src/cyrus-imapd-2.3.10/perl' make: *** [all] Error 1 - I was able to get it to compile cleanly by adding -fPIC to the CFLAGS definition in each Makefile. I'm not sure if this is the correct solution though! Any feedback? Andy Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html