Re: One more attempt: stuck processes

2007-11-20 Thread Ken Murchison

Gary Mills wrote:

On Mon, Nov 19, 2007 at 12:35:46PM -0500, Ken Murchison wrote:

Sebastian Hagedorn wrote:
-- Ken Murchison [EMAIL PROTECTED] is rumored to have mumbled on 
17. November 2007 11:21:38 -0500 regarding Re: One more attempt: stuck 
processes:



Here's a patch that seems to fix the problem.  I did some basic testing
(Linux only) to make sure that it doesn't break anything else, but its
always possible that it has some unforseen side effects.  Keep an eye on
it and let me know if you see anything unusual.

Thanks, it seems to be working fine so far! Enjoy your weekend now ...



Gary, have you tried the patch?


On Solaris 9, SO_RCVTIMEO is not mentioned in the setsockopt man page.
My reading tells me that it is defined in the header file, but if it's
actually used, setsockopt() will return an error.  I understand that
this is the case for other operating systems too.  This seems to be a
known problem with setsockopt().  I just checked the Opensolaris source;
it does function there.  This is the code:

 case SO_RCVTIMEO:
if (optlen == sizeof (uint32_t))
  sockets[i].in_timeout = *(uint32_t *)optval;
else {
errno = EINVAL;
}
break;

Note that the option has to be an int, and the length has to be
that of an int as well.  Linux wants a `struct timeval'.  I wonder
if there's a standard in this area?


OK.  Can you both try this alternate patch?  It should be portable, and 
GDB shouldn't cause it to kick out.  I've set it up so that for 
SSL-wrapped services it will timeout after 3 minutes, otherwise it uses 
the service-specific timeout.


--
Kenneth Murchison
Systems Programmer
Project Cyrus Developer/Maintainer
Carnegie Mellon University
Index: imapd.c
===
RCS file: /afs/andrew/system/cvs/src/cyrus/imap/imapd.c,v
retrieving revision 1.535
diff -u -r1.535 imapd.c
--- imapd.c	9 Nov 2007 12:54:22 -	1.535
+++ imapd.c	20 Nov 2007 14:15:24 -
@@ -148,6 +148,7 @@
 /* end PROXY STUFF */
 
 /* per-user/session state */
+int imapd_timeout;
 struct protstream *imapd_out = NULL;
 struct protstream *imapd_in = NULL;
 struct protgroup *protin = NULL;
@@ -701,7 +702,6 @@
 #endif
 {
 socklen_t salen;
-int timeout;
 sasl_security_properties_t *secprops = NULL;
 struct sockaddr_storage imapd_localaddr, imapd_remoteaddr;
 char localip[60], remoteip[60];
@@ -778,9 +778,10 @@
 proc_register(imapd, imapd_clienthost, NULL, NULL);
 
 /* Set inactivity timer */
-timeout = config_getint(IMAPOPT_TIMEOUT);
-if (timeout  30) timeout = 30;
-prot_settimeout(imapd_in, timeout*60);
+imapd_timeout = config_getint(IMAPOPT_TIMEOUT);
+if (imapd_timeout  30) imapd_timeout = 30;
+imapd_timeout *= 60;
+prot_settimeout(imapd_in, imapd_timeout);
 prot_setflushonread(imapd_in, imapd_out);
 
 /* we were connected on imaps port so we should do 
@@ -6626,6 +6627,7 @@
   
 result=tls_start_servertls(0, /* read */
 			   1, /* write */
+			   imaps ? 180 : imapd_timeout,
 			   layerp,
 			   auth_id,
 			   tls_conn);
Index: lmtpengine.c
===
RCS file: /afs/andrew/system/cvs/src/cyrus/imap/lmtpengine.c,v
retrieving revision 1.122
diff -u -r1.122 lmtpengine.c
--- lmtpengine.c	10 Oct 2007 15:14:39 -	1.122
+++ lmtpengine.c	20 Nov 2007 14:15:25 -
@@ -1579,6 +1579,7 @@
   
 		r=tls_start_servertls(0, /* read */
   1, /* write */
+  360, /* 6 minutes */
   layerp,
   auth_id,
   (cd.tls_conn));
Index: mupdate.c
===
RCS file: /afs/andrew/system/cvs/src/cyrus/imap/mupdate.c,v
retrieving revision 1.99
diff -u -r1.99 mupdate.c
--- mupdate.c	10 Oct 2007 15:14:39 -	1.99
+++ mupdate.c	20 Nov 2007 14:15:25 -
@@ -1941,6 +1941,7 @@
   
 result=tls_start_servertls(C-pin-fd, /* read */
 			   C-pout-fd, /* write */
+			   180, /* 3 minutes */
 			   layerp,
 			   auth_id,
 			   C-tlsconn);
Index: nntpd.c
===
RCS file: /afs/andrew/system/cvs/src/cyrus/imap/nntpd.c,v
retrieving revision 1.60
diff -u -r1.60 nntpd.c
--- nntpd.c	10 Oct 2007 15:14:39 -	1.60
+++ nntpd.c	20 Nov 2007 14:15:25 -
@@ -142,6 +142,7 @@
 
 sasl_conn_t *nntp_saslconn; /* the sasl connection context */
 
+int nntp_timeout;
 char newsprefix[100] = ;
 char *nntp_userid = 0, *newsmaster;
 struct auth_state *nntp_authstate = 0, *newsmaster_authstate;
@@ -523,7 +524,6 @@
 char localip[60], remoteip[60];
 char hbuf[NI_MAXHOST];
 int niflags;
-int timeout;
 sasl_security_properties_t *secprops=NULL;
 char unavail[1024];
 
@@ -590,9 +590,10 @@
 proc_register(nntpd, nntp_clienthost, NULL, NULL);
 
 /* Set inactivity timer

Re: One more attempt: stuck processes

2007-11-20 Thread Sebastian Hagedorn
--On 20. November 2007 09:20:42 -0500 Ken Murchison [EMAIL PROTECTED] 
wrote:



OK.  Can you both try this alternate patch?  It should be portable, and
GDB shouldn't cause it to kick out.  I've set it up so that for
SSL-wrapped services it will timeout after 3 minutes, otherwise it uses
the service-specific timeout.


Thanks for working on this. I think you missed a spot:

gcc -c -I.. -I./../sieve -I./../imap -I./../lib  -I/usr/include/db4 
-I/usr/kerberos/include   -I/usr/include/et  -DHAVE_CONFIG_H 
-I/usr/kerberos/include   -O2 -g -pipe -march=i386 -mcpu=i686 -fPIC \

parser.c
parser.c: In function `cmd_starttls':
parser.c:834: warning: passing arg 3 of `tls_start_servertls' makes integer 
from pointer without a cast
parser.c:834: warning: passing arg 4 of `tls_start_servertls' from 
incompatible pointer type
parser.c:834: warning: passing arg 5 of `tls_start_servertls' from 
incompatible pointer type

parser.c:834: too few arguments to function `tls_start_servertls'

I can fix this myself, but it's probably easier if you do it.
--
.:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:.
Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
.:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:.
  .:.:.:.Skype: shagedorn.:.:.:.

pgpImfrnNAhAO.pgp
Description: PGP signature

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: One more attempt: stuck processes

2007-11-20 Thread Sebastian Hagedorn
--On 20. November 2007 15:59:18 +0100 Sebastian Hagedorn 
[EMAIL PROTECTED] wrote:



I can fix this myself, but it's probably easier if you do it.


Just FYI: I fixed it locally with a 3 minute timeout and it compiled fine. 
I'll start testing it now.

--
.:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:.
Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
.:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:.
  .:.:.:.Skype: shagedorn.:.:.:.

pgpoLn6l11ZH7.pgp
Description: PGP signature

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: One more attempt: stuck processes

2007-11-19 Thread Sebastian Hagedorn
-- Ken Murchison [EMAIL PROTECTED] is rumored to have mumbled on 19. 
November 2007 12:35:46 -0500 regarding Re: One more attempt: stuck 
processes:



How are things looking today?


Good! When I just checked I thought I'd found a new hanging pop3d process, 
because it's been around for 6 hours, but in fact it's still working:


Nov 19 14:17:16 lvr13 pop3s[23493]: badlogin: p50865731.dip.t-dialin.net 
[80.134.87.49] LOGIN user not found

...
Nov 19 19:04:23 lvr13 pop3s[23493]: badlogin: p50865731.dip.t-dialin.net 
[80.134.87.49] LOGIN user not found
Nov 19 19:04:27 lvr13 pop3s[23493]: badlogin: p50865731.dip.t-dialin.net 
[80.134.87.49] LOGIN user not found
Nov 19 19:04:30 lvr13 pop3s[23493]: badlogin: p50865731.dip.t-dialin.net 
[80.134.87.49] LOGIN user not found


Sigh. But that's beside the matter. The only other potential downside the 
patch has is that stracing or gdb'ing it causes the timeout to trigger 
prematurely. AFAIK that's a common issue for signals. I'm not sure if 
there's something that could be done about that, i.e. catching ERESTART or 
something like that?

--
Sebastian Hagedorn - Postmaster - RZKR-R1 (Flachbau), Zimmer 18
Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
Universität zu Köln / Cologne University - Tel. +49-221-478-5587

pgpOw7IfeSb18.pgp
Description: PGP signature

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: One more attempt: stuck processes

2007-11-19 Thread Sebastian Hagedorn
-- Ken Murchison [EMAIL PROTECTED] is rumored to have mumbled on 19. 
November 2007 13:17:07 -0500 regarding Re: One more attempt: stuck 
processes:



The only other potential downside

the patch has is that stracing or gdb'ing it causes the timeout to
trigger prematurely. AFAIK that's a common issue for signals. I'm not
sure if there's something that could be done about that, i.e. catching
ERESTART or something like that?


Is this only a problem with pop3s, or does it also effect the STLS
command?


Yes, it does:

telnet cyrus 110
...
stls
+OK Begin TLS negotiation now
-ERR [SYS/PERM] Starttls failed

That's when I straced the process ...
--
Sebastian Hagedorn - Postmaster - RZKR-R1 (Flachbau), Zimmer 18
Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
Universität zu Köln / Cologne University - Tel. +49-221-478-5587

pgpgMNLWdKmc1.pgp
Description: PGP signature

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: One more attempt: stuck processes

2007-11-19 Thread Ken Murchison
Sebastian Hagedorn wrote:
 -- Ken Murchison [EMAIL PROTECTED] is rumored to have mumbled on 
 19. November 2007 13:17:07 -0500 regarding Re: One more attempt: stuck 
 processes:
 
 The only other potential downside
 the patch has is that stracing or gdb'ing it causes the timeout to
 trigger prematurely. AFAIK that's a common issue for signals. I'm not
 sure if there's something that could be done about that, i.e. catching
 ERESTART or something like that?

 Is this only a problem with pop3s, or does it also effect the STLS
 command?
 
 Yes, it does:
 
 telnet cyrus 110
 ...
 stls
 +OK Begin TLS negotiation now
 -ERR [SYS/PERM] Starttls failed
 
 That's when I straced the process ...

If you want to write and test a patch to resolve this, I'll certainly 
take a look at it.

Another way to resolve the blocking SSL_accept() problem would be to 
make the socket non-blocking and then check the SSL_accept() return code 
to see if we need to call SSL_accept() again.  I took the easy way out 
by setting SO_RCVTIMEO.

-- 
Kenneth Murchison
Systems Programmer
Project Cyrus Developer/Maintainer
Carnegie Mellon University

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: One more attempt: stuck processes

2007-11-19 Thread Ken Murchison
Sebastian Hagedorn wrote:
 -- Ken Murchison [EMAIL PROTECTED] is rumored to have mumbled on 
 19. November 2007 12:35:46 -0500 regarding Re: One more attempt: stuck 
 processes:
 
 How are things looking today?
 
 Good! When I just checked I thought I'd found a new hanging pop3d 
 process, because it's been around for 6 hours, but in fact it's still 
 working:
 
 Nov 19 14:17:16 lvr13 pop3s[23493]: badlogin: p50865731.dip.t-dialin.net 
 [80.134.87.49] LOGIN user not found
 ...
 Nov 19 19:04:23 lvr13 pop3s[23493]: badlogin: p50865731.dip.t-dialin.net 
 [80.134.87.49] LOGIN user not found
 Nov 19 19:04:27 lvr13 pop3s[23493]: badlogin: p50865731.dip.t-dialin.net 
 [80.134.87.49] LOGIN user not found
 Nov 19 19:04:30 lvr13 pop3s[23493]: badlogin: p50865731.dip.t-dialin.net 
 [80.134.87.49] LOGIN user not found
 
 Sigh. But that's beside the matter. The only other potential downside 
 the patch has is that stracing or gdb'ing it causes the timeout to 
 trigger prematurely. AFAIK that's a common issue for signals. I'm not 
 sure if there's something that could be done about that, i.e. catching 
 ERESTART or something like that?

Is this only a problem with pop3s, or does it also effect the STLS command?

-- 
Kenneth Murchison
Systems Programmer
Project Cyrus Developer/Maintainer
Carnegie Mellon University

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: One more attempt: stuck processes

2007-11-19 Thread Gary Mills
On Mon, Nov 19, 2007 at 12:35:46PM -0500, Ken Murchison wrote:
 Sebastian Hagedorn wrote:
 -- Ken Murchison [EMAIL PROTECTED] is rumored to have mumbled on 
 17. November 2007 11:21:38 -0500 regarding Re: One more attempt: stuck 
 processes:
 
 Here's a patch that seems to fix the problem.  I did some basic testing
 (Linux only) to make sure that it doesn't break anything else, but its
 always possible that it has some unforseen side effects.  Keep an eye on
 it and let me know if you see anything unusual.
 
 Thanks, it seems to be working fine so far! Enjoy your weekend now ...

 Gary, have you tried the patch?

On Solaris 9, SO_RCVTIMEO is not mentioned in the setsockopt man page.
My reading tells me that it is defined in the header file, but if it's
actually used, setsockopt() will return an error.  I understand that
this is the case for other operating systems too.  This seems to be a
known problem with setsockopt().  I just checked the Opensolaris source;
it does function there.  This is the code:

 case SO_RCVTIMEO:
if (optlen == sizeof (uint32_t))
  sockets[i].in_timeout = *(uint32_t *)optval;
else {
errno = EINVAL;
}
break;

Note that the option has to be an int, and the length has to be
that of an int as well.  Linux wants a `struct timeval'.  I wonder
if there's a standard in this area?

-- 
-Gary Mills--Unix Support--U of M Academic Computing and Networking-

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: One more attempt: stuck processes

2007-11-17 Thread Gabor Gombas
On Fri, Nov 16, 2007 at 06:37:52PM +0100, Sebastian Hagedorn wrote:

 OK. Still the symptom seems to be different from what I'm seeing.

It may be. As I said I had no time so far to investigate it in depth, I
just wanted to say mee too for the hung process problem.

 Could it be that you have a process limit in /etc/cyrus.conf?

Of course. And when the number of hung processes reaches that limit
users start to complain that they can not read their mail.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: One more attempt: stuck processes

2007-11-17 Thread Ken Murchison
Sebastian Hagedorn wrote:
 -- Ken Murchison [EMAIL PROTECTED] is rumored to have mumbled on 
 16. November 2007 15:54:50 -0500 regarding Re: One more attempt: stuck 
 processes:
 
 That's exactly what Gary is seeing.
 
 Right. Apparently stripped binaries aren't any good for straces.
 
 Its blocking in SSL_accept().
 Apparently the client connects to port 995, and then either sends
 nothing, or goes away and leaves the socket open.

Here's a patch that seems to fix the problem.  I did some basic testing 
(Linux only) to make sure that it doesn't break anything else, but its 
always possible that it has some unforseen side effects.  Keep an eye on 
it and let me know if you see anything unusual.

--- prot.c.~1.93.~  2007-11-17 10:02:49.0 -0500
+++ prot.c  2007-11-17 11:21:05.0 -0500
@@ -58,6 +58,7 @@
  #ifdef HAVE_UNISTD_H
  #include unistd.h
  #endif
+#include sys/time.h
  #include sys/types.h
  #include sys/stat.h
  #include netinet/in.h
@@ -214,10 +215,20 @@
   */
  int prot_settimeout(struct protstream *s, int timeout)
  {
+struct timeval tv;
+
  assert(!s-write);

  s-read_timeout = timeout;
  s-timeout_mark = time(NULL) + timeout;
+
+/* Set a read timeout on the socket itself.  This primarily prevents
+ * OpenSSL calls such as SSL_accept() from blocking indefinitely.
+ */
+tv.tv_sec = timeout;
+tv.tv_usec = 0;
+setsockopt(0, SOL_SOCKET, SO_RCVTIMEO, tv, sizeof(struct timeval));
+
  return 0;
  }



-- 
Kenneth Murchison
Systems Programmer
Project Cyrus Developer/Maintainer
Carnegie Mellon University

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: One more attempt: stuck processes

2007-11-16 Thread Sebastian Hagedorn
--On 15. November 2007 19:25:19 +0100 Simon Matter [EMAIL PROTECTED] 
wrote:



It's blinking red, which normally means a broken link. I'm not sure how


The file 0 is a symbolic symlink which doesn't really point to a file,
that's why the shell shows it blinking. Everything okay here.


Thanks. That's what I thought, but I wasn't sure.


reliable that is in this case. Anyway, lsof reports:

pop3d   25038 cyrus0u  IPv4  -64802663 TCP
cyrus.rrz.uni-koeln.de:pop3s-p50865F5D.dip.t-dialin.net:1064
(ESTABLISHED)

It *thinks* the connections is still open. So does netstat:

# LANG=C netstat -a|grep p50865F5D
tcp0  0 cyrus.rrz.uni-koeln.d:pop3s
p50865F5D.dip.t-dialin:1064
ESTABLISHED

But obviously that connection is dead. I don't know what conclusions to
draw from that ...


Just two ideas come to mind:

1) Since it only happens on dialup connections, could it be that the
dialin router at the providers end sends TCP/RST when a client hangs up
and those packets are filtered somewhere, maybe on your firewall?


OK, let's run with that one.

a) We don't really have a firewall, we only use ACLs on the Cisco routers. 
You can't even filter TCP/RST there.


b) Even *if* a TCP/RST had been dropped, lost or whatever, the server 
*still* should timeout eventually!



2) Could it be that SO_LINGER should be used as socket option in
service_create() in master/master.c.


I didn't remember that option, so I just read up on it. It seems as though 
SO_LINGER is very dependent on implementation.  If I get your intention 
correctly SO_LINGER would have to be set with l_onoff set to non-zero and 
l_linger to zero, right? So close() would return immediately? That might 
make sense if the stack trace showed a call to close(). But if I understand 
the code correctly, close() isn't called at all. The socket is closed as a 
result of a call to exit(). And that defeats all use of SO_LINGER:


When the socket is closed as part of exit(2), it always lingers in the 
background.



If it's complete nonsense, ignore it.


I wouldn't know :-)
--
.:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:.
Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
.:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:.
  .:.:.:.Skype: shagedorn.:.:.:.

pgpproVHc1y86.pgp
Description: PGP signature

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: One more attempt: stuck processes

2007-11-16 Thread Sebastian Hagedorn
--On 16. November 2007 16:52:27 +0100 Gabor Gombas [EMAIL PROTECTED] 
wrote:



On Fri, Nov 16, 2007 at 12:36:49PM +0100, Sebastian Hagedorn wrote:


He suggested that the trace is unreliable. Perhaps a bug in RHEL 3's
version of OpenSSL messes up the stack. That would also explain why
nobody  else seems to have this problem.


FYI I also know a system that has problems with hung Cyrus processes.
AFAIR they have problems with pop3s only, but that may be because there
are more POP3 than IMAP users, I don't know. The system in question runs
2.3.8 on Debian Etch currently.


That's a 2.6 kernel, right?


I intend to help diagnose that system but had no time so far; they're
now running a script that does a POP3 connection every couple of minutes
and if that takes too long it restarts Cyrus.


Hm, we don't suffer any actual slowdown, it's just that the number of 
processes increases over time.



There is nothing interesting in the logs:

Oct 15 02:39:31 host cyrus/master[6102]: about to exec
/usr/local/cyrus/sbin/pop3d Oct 15 02:39:31 host cyrus/pop3s[6102]:
executed
Oct 15 02:39:31 host cyrus/pop3s[6102]: accepted connection


That's what I'm seeing. Could you get a stack trace?


OTOH there are a lot of messages like the following:

Oct 16 14:13:10 host cyrus/master[26136]: about to exec
/usr/local/cyrus/sbin/pop3d Oct 16 14:13:10 host cyrus/pop3s[26136]:
executed
Oct 16 14:13:10 host cyrus/pop3s[26136]: accepted connection
Oct 16 14:13:10 host cyrus/pop3s[26136]: pop3s failed:
[XX.XXX.XX.XXX] Oct 16 14:13:10 host cyrus/pop3s[26136]: Fatal error:
tls_start_servertls() failed Oct 16 14:13:10 host cyrus/master[15923]:
process 26136 exited, status 75 Oct 16 14:13:10 host cyrus/master[15923]:
service pop3s pid 26136 in BUSY state: terminated abnormally

Any idea what's causing that?


I have many of those as well. I suppose that could be any number of things. 
Faulty protocol or dropped connections.

--
.:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:.
Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
.:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:.
  .:.:.:.Skype: shagedorn.:.:.:.

pgp3H24eUgNSV.pgp
Description: PGP signature

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: One more attempt: stuck processes

2007-11-16 Thread Gary Mills
On Fri, Nov 16, 2007 at 03:20:57PM +0100, Sebastian Hagedorn wrote:
 --On 16. November 2007 08:00:07 -0600 Gary Mills [EMAIL PROTECTED] 
 wrote:
 
 This timeout doesn't work in some cases.  We have lots of POP sessions
 that never terminate.
 
 That's interesting to hear! Especially since you are using Solaris.
 
  About 30 out of 40 are in that state now.
 Here's an example:
 
cyrus 13075   708  0   Oct 14 ?0:05 pop3d -s
cyrus 20023   708  0   Oct 29 ?0:00 pop3d
cyrus 24560   708  1 07:38:03 ?0:03 pop3d
cyrus   631   708  0   Oct 03 ?0:10 pop3d -s
cyrus  6786   708  0   Oct 20 ?0:00 pop3d -s
cyrus 29777   708  0 07:45:03 ?0:00 pop3d
cyrus 19175   708  0   Oct 04 ?0:04 pop3d -s
 
 One I just checked is stuck in a read():
 
   # truss -p 19175
   read(0, 0x002316F0, 5)  (sleeping...)
   ^?# pfiles 19175
   19175:  pop3d -s
 Current rlimit: 256 file descriptors
  0: S_IFSOCK mode:0666 dev:271,0 ino:25813 uid:0 gid:0 size:0
 O_RDWR
   sockname: AF_INET 130.179.16.23  port: 995
   peername: AF_INET 130.179.188.184  port: 51771
 
 Could you get a stack trace? If you have gdb you just call it with gdb -p 
 19175. Then you can do bt at the prompt. I forget how to do it with 
 Sun's debugger.

Easy:

  # pstack 19175
  19175:  pop3d -s
   fef9f810 read (0, 2316f0, 5)
   fee1d2d0 read (0, 2316f0, 5, 0, 0, 0) + 5c
   ff06bb38 sock_read (1f0860, 2316f0, 5, 5, 0, 0) + 24
   ff068af0 BIO_read (1f0860, 2316f0, 5, fef98b84, 0, 0) + 110
   ff278488 ssl3_read_n (212798, 5, 8805, 0, 0, 203958) + 174
   ff2785fc ssl3_get_record (204ce0, 8000, 8400, 4400, f1, f0) + d0
   ff279424 ssl3_read_bytes (212798, 1000, 2000, 4, 0, ffbfe731) + 228
   ff27a99c ssl3_get_message (ff2a259c, 2070a0, 0, , 19000, ffbfe7a0) + 
d0 ff27042c ssl3_accept (2150, 2160, 2180, 21e0, 2110, 2122) + 904
   ff27bd2c ssl23_get_client_hello (2316fb, 6c, 6c, 4, fe79, 0) + 828
   ff27b4b4 ssl23_accept (4000, 2000, 0, 0, 0, 0) + 2a4
   00032d00 tls_start_servertls (0, 1, ffbfee24, ffbfee20, 1849a8, ff00) + 198
   0002c504 cmd_starttls (1, 1fd8b8, 0, 0, 0, 0) + 184
   0002a638 service_main (2, 192198, ffbffce0, 1aec4, 3508c, 1) + 488
   00035250 main (2, ffbffcd4, ffbffce0, 17c400, 0, 0) + e18
   00029298 _start   (0, 0, 0, 0, 0, 0) + 108

I've confirmed that the client has gone away a long time ago.

-- 
-Gary Mills--Unix Support--U of M Academic Computing and Networking-

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: One more attempt: stuck processes

2007-11-16 Thread Ken Murchison
Sebastian Hagedorn wrote:

 I think I will try one more approach: I reverted cyrus.conf to not use 
 -U 1 anymore, so that processes should be reused. I will strace one of 
 the pop3d processes in the hope that it gets stuck. That way I should be 
 able to see where things go wrong. If the process terminates normally I 
 will try with another one.

Please let me know if you get a trace from a hung process.

-- 
Kenneth Murchison
Systems Programmer
Project Cyrus Developer/Maintainer
Carnegie Mellon University

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: One more attempt: stuck processes

2007-11-16 Thread Gabor Gombas
On Fri, Nov 16, 2007 at 05:20:00PM +0100, Sebastian Hagedorn wrote:

 That's a 2.6 kernel, right?

Yes, 2.6.18-2-amd64.

 Hm, we don't suffer any actual slowdown, it's just that the number of 
 processes increases over time.

It's not a slowdown - the client connects, and hangs. It never even gets
to the authentication phase (at least it's not logged). Clients that
happen to connect to a non-affected process run normally. Also, IMAP
connections do not seem to be affected, at least I did not hear any
complaints about that.

 That's what I'm seeing. Could you get a stack trace?

I intend to but I do not have the time currently. I'm not involved in
the daily management of that machine and the operators are happy to just
restart Cyrus when the hangs begin, and so far I never was around just
when that happened.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: One more attempt: stuck processes

2007-11-16 Thread Ken Murchison
Sebastian Hagedorn wrote:
 --On 16. November 2007 09:37:42 -0600 Gary Mills [EMAIL PROTECTED] 
 wrote:
 
 Could you get a stack trace? If you have gdb you just call it with gdb
 -p  19175. Then you can do bt at the prompt. I forget how to do it
 with  Sun's debugger.

 Easy:

   # pstack 19175
   19175:  pop3d -s
fef9f810 read (0, 2316f0, 5)
fee1d2d0 read (0, 2316f0, 5, 0, 0, 0) + 5c
ff06bb38 sock_read (1f0860, 2316f0, 5, 5, 0, 0) + 24
ff068af0 BIO_read (1f0860, 2316f0, 5, fef98b84, 0, 0) + 110
ff278488 ssl3_read_n (212798, 5, 8805, 0, 0, 203958) + 174
ff2785fc ssl3_get_record (204ce0, 8000, 8400, 4400, f1, f0) + d0
ff279424 ssl3_read_bytes (212798, 1000, 2000, 4, 0, ffbfe731) + 228
ff27a99c ssl3_get_message (ff2a259c, 2070a0, 0, , 19000,
 ffbfe7a0) + d0 ff27042c ssl3_accept (2150, 2160, 2180, 21e0, 2110, 2122)
 + 904ff27bd2c ssl23_get_client_hello (2316fb, 6c, 6c, 4, fe79, 0)
 + 828ff27b4b4 ssl23_accept (4000, 2000, 0, 0, 0, 0) + 2a4
00032d00 tls_start_servertls (0, 1, ffbfee24, ffbfee20, 1849a8, ff00)
 + 1980002c504 cmd_starttls (1, 1fd8b8, 0, 0, 0, 0) + 184
0002a638 service_main (2, 192198, ffbffce0, 1aec4, 3508c, 1) + 488
00035250 main (2, ffbffcd4, ffbffce0, 17c400, 0, 0) + e18
00029298 _start   (0, 0, 0, 0, 0, 0) + 108
 
 Thanks, that looks like progress! That stack trace looks similar enough 
 to the one I'm seeing that I could imagine that it is what I *should* be 
 seeing if the stack weren't garbled. Of course that's only speculation.
 
 Ken, is it possible that the call to SSL_accept() in 
 tls_start_servertls() blocks when the client goes away? That could 
 explain everything 

Yes.  Gary's problem might be very similar to yours, depending on what I 
see from the patch that I just sent you.

-- 
Kenneth Murchison
Systems Programmer
Project Cyrus Developer/Maintainer
Carnegie Mellon University

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: One more attempt: stuck processes

2007-11-16 Thread Sebastian Hagedorn
--On 16. November 2007 18:07:51 +0100 Gabor Gombas [EMAIL PROTECTED] 
wrote:



Hm, we don't suffer any actual slowdown, it's just that the number of
processes increases over time.


It's not a slowdown - the client connects, and hangs. It never even gets
to the authentication phase (at least it's not logged). Clients that
happen to connect to a non-affected process run normally.


Well, that just sounds like you're running out of entropy. That's a 
different issue. Recompile your cyrus-sasl to use /dev/urandom instead of 
/dev/random or disable apop in /etc/imapd.conf:


allowapop: 0

Either of those things should get rid of that.
--
.:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:.
Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
.:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:.
  .:.:.:.Skype: shagedorn.:.:.:.

pgp4g0isqF9Ha.pgp
Description: PGP signature

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: One more attempt: stuck processes

2007-11-16 Thread Gary Mills
On Fri, Nov 16, 2007 at 01:54:24PM +0100, Alain Spineux wrote:
 On Nov 16, 2007 12:36 PM, Sebastian Hagedorn [EMAIL PROTECTED] wrote:
  --On 16. November 2007 11:27:09 +0100 Sebastian Hagedorn
  [EMAIL PROTECTED] wrote:
 
  1. In the absence of the SO_KEEPALIVE option it is entirely possible that a
  TCP connection remains ESTABLISHED even when the other side has gone.
 
 I said that socket should timeout, but this is true only when the
 protocol (TCP here)
 require a response (usualy AK here) or at connection establishement.
 On the contrary
 it should stay open indefinitely util something happens. Router doing
 NAT can drop
 a too old connection, because it has to maintains a NAT table and make some
 cleanup time to time, this where KEEPALIVE become usefull.
 
  This may not be a solution to this particular problem, but it made me
  wonder why Cyrus does *not* use SO_KEEPALIVE. Is there a downside to it?
 
 Cyrus has already a built-in time out, it seems a lite conflicting to actively
 maintains the connection until it drop it itself !
 This is the works of the client to actively maintains the connection,
 if it want it !

This timeout doesn't work in some cases.  We have lots of POP sessions
that never terminate.  About 30 out of 40 are in that state now.
Here's an example:

   cyrus 13075   708  0   Oct 14 ?0:05 pop3d -s
   cyrus 20023   708  0   Oct 29 ?0:00 pop3d
   cyrus 24560   708  1 07:38:03 ?0:03 pop3d
   cyrus   631   708  0   Oct 03 ?0:10 pop3d -s
   cyrus  6786   708  0   Oct 20 ?0:00 pop3d -s
   cyrus 29777   708  0 07:45:03 ?0:00 pop3d
   cyrus 19175   708  0   Oct 04 ?0:04 pop3d -s

One I just checked is stuck in a read():

  # truss -p 19175
  read(0, 0x002316F0, 5)  (sleeping...)
  ^?# pfiles 19175
  19175:  pop3d -s
Current rlimit: 256 file descriptors
 0: S_IFSOCK mode:0666 dev:271,0 ino:25813 uid:0 gid:0 size:0
O_RDWR
  sockname: AF_INET 130.179.16.23  port: 995
  peername: AF_INET 130.179.188.184  port: 51771

-- 
-Gary Mills--Unix Support--U of M Academic Computing and Networking-

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: One more attempt: stuck processes

2007-11-16 Thread Sebastian Hagedorn
--On 16. November 2007 11:27:09 +0100 Sebastian Hagedorn 
[EMAIL PROTECTED] wrote:



1) Since it only happens on dialup connections, could it be that the
dialin router at the providers end sends TCP/RST when a client hangs up
and those packets are filtered somewhere, maybe on your firewall?


OK, let's run with that one.

a) We don't really have a firewall, we only use ACLs on the Cisco
routers. You can't even filter TCP/RST there.

b) Even *if* a TCP/RST had been dropped, lost or whatever, the server
*still* should timeout eventually!


I just had a discussion with a colleague regarding this. He made two 
observations:


1. In the absence of the SO_KEEPALIVE option it is entirely possible that a 
TCP connection remains ESTABLISHED even when the other side has gone.


This may not be a solution to this particular problem, but it made me 
wonder why Cyrus does *not* use SO_KEEPALIVE. Is there a downside to it?


2. The stack trace looks garbled:

(gdb) bt
#0  0x0079f41e in __read_nocancel () from /lib/tls/libc.so.6
#1  0x00d0b2f7 in BIO_new_socket () from /lib/libcrypto.so.4
#2  0x00d092b2 in BIO_read () from /lib/libcrypto.so.4
#3  0x005dae13 in ssl23_read_bytes () from /lib/libssl.so.4
#4  0x005d9c51 in ssl23_get_client_hello () from /lib/libssl.so.4
#5  0x005d9712 in ssl23_accept () from /lib/libssl.so.4
#6  0x005ddc9a in SSL_accept () from /lib/libssl.so.4
#7  0x08052cb3 in shut_down ()
#8  0x0804e513 in shut_down ()
#9  0x0804d58c in ?? ()
#10 0x0001 in ?? ()
#11 0x082ee848 in ?? ()
#12 0x in ?? ()

He suggested that the trace is unreliable. Perhaps a bug in RHEL 3's 
version of OpenSSL messes up the stack. That would also explain why nobody 
else seems to have this problem.


I think I will try one more approach: I reverted cyrus.conf to not use -U 
1 anymore, so that processes should be reused. I will strace one of the 
pop3d processes in the hope that it gets stuck. That way I should be able 
to see where things go wrong. If the process terminates normally I will try 
with another one. If that doesn't go anywhere, I guess I'll drop this 
investigation. We will upgrade to RHEL 5 some time next year, so hopefully 
that will bring new bugs :-)

--
.:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:.
Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
.:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:.
  .:.:.:.Skype: shagedorn.:.:.:.

pgpbrIUha0peZ.pgp
Description: PGP signature

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: One more attempt: stuck processes

2007-11-16 Thread Ken Murchison
Sebastian Hagedorn wrote:

 The only reason I could imagine for the sequence of calls was signal 
 handling. But let's be methodical. There's only one spot where 
 SSL_accept() is called: in tls_start_servertls(). In pop3d.c that's only 
 called in cmd_starttls(). That in turn is called either in cmdloop (for 
 handling of STLS) or in service_main() for connections to port 995.

Actually, now that I think about it, I believe SSL_accept() can be 
called from SSL_read() at any time if a renegotiation is required. 
Since shut_down() calls prot_fill(), which in turn can call SSL_read(), 
its possible that we can get an SSL_accept() call.  Before I start 
hacking code, can you apply the following patch (sorry about the line 
breaks) and see if I'm heading in the right direction?  Let me know if 
you get any of the WARNING messages in your logs.


--- prot.c.~1.93.~  2007-11-16 11:21:56.0 -0500
+++ prot.c  2007-11-16 11:23:32.0 -0500
@@ -468,6 +468,7 @@
/* just do a SSL read instead if we're under a tls layer */
if (s-tls_conn != NULL) {
n = SSL_read(s-tls_conn, (char *) s-buf, PROT_BUFSIZE);
+   if (n = 0) syslog(LOG_WARNING, SSL_read() returned %d, n);
} else {
n = read(s-fd, s-buf, PROT_BUFSIZE);
}
-- 
Kenneth Murchison
Systems Programmer
Project Cyrus Developer/Maintainer
Carnegie Mellon University

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: One more attempt: stuck processes

2007-11-16 Thread Sebastian Hagedorn
--On 16. November 2007 13:54:24 +0100 Alain Spineux [EMAIL PROTECTED] 
wrote:



On Nov 16, 2007 12:36 PM, Sebastian Hagedorn [EMAIL PROTECTED]
wrote:

I just had a discussion with a colleague regarding this. He made two
observations:

1. In the absence of the SO_KEEPALIVE option it is entirely possible
that a TCP connection remains ESTABLISHED even when the other side has
gone.


I said that socket should timeout, but this is true only when the
protocol (TCP here)
require a response (usualy AK here) or at connection establishement.


Right.


On the contrary
it should stay open indefinitely util something happens. Router doing
NAT can drop
a too old connection, because it has to maintains a NAT table and make
some cleanup time to time, this where KEEPALIVE become usefull.


Not only there, but I think also in the case of unilaterally dropped 
connections.



This may not be a solution to this particular problem, but it made me
wonder why Cyrus does *not* use SO_KEEPALIVE. Is there a downside to it?


Cyrus has already a built-in time out, it seems a lite conflicting to
actively maintains the connection until it drop it itself !


I'm not sure I understand that sentence.


This is the works of the client to actively maintains the connection,
if it want it !


Yes, but what if the client is gone? I realise that *normally* the server 
keeps a built-in timeout, but I'm guessing that sometimes it doesn't work, 
perhaps because something (in prot_fill() perhaps?) blocks.



I think I will try one more approach: I reverted cyrus.conf to not use
-U 1 anymore, so that processes should be reused. I will strace one of
the pop3d processes in the hope that it gets stuck. That way I should be
able to see where things go wrong. If the process terminates normally I
will try with another one. If that doesn't go anywhere, I guess I'll
drop this


You could try to replace imapd by a home made script, something like .

mv imapd imapd_
echo exec strace -o /tmp/imapd.$$ imapd_ $*  imapd
chmod imapd a+x


Thanks for the suggestion. I'll think about it, although I'm wary of doing 
that on a production server.



investigation. We will upgrade to RHEL 5 some time next year, so
hopefully that will bring new bugs :-)


Sorry but I dont understand what you are complaining about!


I'm not complaining ...


Is-it because the imap or pop client is loosing its connection and
this disturb the user


No.


or just because you are getting some sleeping processes ?


If it were some I wouldn't worry. I'm talking hundreds of processes! I 
know I can kill them, in fact for the pop3d processes we run this command 
once a month:


ps -C pop3d -o pid,start|grep [a-z]|awk '{print $1}'|xargs kill

(It kills pop3d processes that have the month in their start time, i.e. are 
more than a day old)


But for imapd processes it's not as easy to tell if they are just 
long-living or stuck.



Do you have a timeout option in your imapd.conf to force the
imap/pop server to autologout ?


No. But both POP and IMAP have default timeouts. They just don't work in my 
case.

--
.:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:.
Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
.:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:.
  .:.:.:.Skype: shagedorn.:.:.:.

pgpUj3SrktoJw.pgp
Description: PGP signature

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: One more attempt: stuck processes

2007-11-16 Thread Sebastian Hagedorn
--On 16. November 2007 08:00:07 -0600 Gary Mills [EMAIL PROTECTED] 
wrote:



This timeout doesn't work in some cases.  We have lots of POP sessions
that never terminate.


That's interesting to hear! Especially since you are using Solaris.


 About 30 out of 40 are in that state now.
Here's an example:

   cyrus 13075   708  0   Oct 14 ?0:05 pop3d -s
   cyrus 20023   708  0   Oct 29 ?0:00 pop3d
   cyrus 24560   708  1 07:38:03 ?0:03 pop3d
   cyrus   631   708  0   Oct 03 ?0:10 pop3d -s
   cyrus  6786   708  0   Oct 20 ?0:00 pop3d -s
   cyrus 29777   708  0 07:45:03 ?0:00 pop3d
   cyrus 19175   708  0   Oct 04 ?0:04 pop3d -s

One I just checked is stuck in a read():

  # truss -p 19175
  read(0, 0x002316F0, 5)  (sleeping...)
  ^?# pfiles 19175
  19175:  pop3d -s
Current rlimit: 256 file descriptors
 0: S_IFSOCK mode:0666 dev:271,0 ino:25813 uid:0 gid:0 size:0
O_RDWR
  sockname: AF_INET 130.179.16.23  port: 995
  peername: AF_INET 130.179.188.184  port: 51771


Could you get a stack trace? If you have gdb you just call it with gdb -p 
19175. Then you can do bt at the prompt. I forget how to do it with 
Sun's debugger.

--
.:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:.
Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
.:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:.
  .:.:.:.Skype: shagedorn.:.:.:.

pgpvKBTMY4YQA.pgp
Description: PGP signature

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: One more attempt: stuck processes

2007-11-16 Thread Sebastian Hagedorn
--On 16. November 2007 09:37:42 -0600 Gary Mills [EMAIL PROTECTED] 
wrote:



Could you get a stack trace? If you have gdb you just call it with gdb
-p  19175. Then you can do bt at the prompt. I forget how to do it
with  Sun's debugger.


Easy:

  # pstack 19175
  19175:  pop3d -s
   fef9f810 read (0, 2316f0, 5)
   fee1d2d0 read (0, 2316f0, 5, 0, 0, 0) + 5c
   ff06bb38 sock_read (1f0860, 2316f0, 5, 5, 0, 0) + 24
   ff068af0 BIO_read (1f0860, 2316f0, 5, fef98b84, 0, 0) + 110
   ff278488 ssl3_read_n (212798, 5, 8805, 0, 0, 203958) + 174
   ff2785fc ssl3_get_record (204ce0, 8000, 8400, 4400, f1, f0) + d0
   ff279424 ssl3_read_bytes (212798, 1000, 2000, 4, 0, ffbfe731) + 228
   ff27a99c ssl3_get_message (ff2a259c, 2070a0, 0, , 19000,
ffbfe7a0) + d0 ff27042c ssl3_accept (2150, 2160, 2180, 21e0, 2110, 2122)
+ 904ff27bd2c ssl23_get_client_hello (2316fb, 6c, 6c, 4, fe79, 0)
+ 828ff27b4b4 ssl23_accept (4000, 2000, 0, 0, 0, 0) + 2a4
   00032d00 tls_start_servertls (0, 1, ffbfee24, ffbfee20, 1849a8, ff00)
+ 1980002c504 cmd_starttls (1, 1fd8b8, 0, 0, 0, 0) + 184
   0002a638 service_main (2, 192198, ffbffce0, 1aec4, 3508c, 1) + 488
   00035250 main (2, ffbffcd4, ffbffce0, 17c400, 0, 0) + e18
   00029298 _start   (0, 0, 0, 0, 0, 0) + 108


Thanks, that looks like progress! That stack trace looks similar enough to 
the one I'm seeing that I could imagine that it is what I *should* be 
seeing if the stack weren't garbled. Of course that's only speculation.


Ken, is it possible that the call to SSL_accept() in tls_start_servertls() 
blocks when the client goes away? That could explain everything 

--
.:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:.
Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
.:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:.
  .:.:.:.Skype: shagedorn.:.:.:.

pgpldwycIAjiI.pgp
Description: PGP signature

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: One more attempt: stuck processes

2007-11-16 Thread Gabor Gombas
On Fri, Nov 16, 2007 at 12:36:49PM +0100, Sebastian Hagedorn wrote:

 He suggested that the trace is unreliable. Perhaps a bug in RHEL 3's 
 version of OpenSSL messes up the stack. That would also explain why nobody 
 else seems to have this problem.

FYI I also know a system that has problems with hung Cyrus processes.
AFAIR they have problems with pop3s only, but that may be because there
are more POP3 than IMAP users, I don't know. The system in question runs
2.3.8 on Debian Etch currently.

I intend to help diagnose that system but had no time so far; they're
now running a script that does a POP3 connection every couple of minutes
and if that takes too long it restarts Cyrus.

There is nothing interesting in the logs:

Oct 15 02:39:31 host cyrus/master[6102]: about to exec 
/usr/local/cyrus/sbin/pop3d
Oct 15 02:39:31 host cyrus/pop3s[6102]: executed
Oct 15 02:39:31 host cyrus/pop3s[6102]: accepted connection

... and that's about it, nothing else is logged about the stuck process.
As can be seen the process gets stuck just after it has been created, so
-U 1 can not help.

OTOH there are a lot of messages like the following:

Oct 16 14:13:10 host cyrus/master[26136]: about to exec 
/usr/local/cyrus/sbin/pop3d
Oct 16 14:13:10 host cyrus/pop3s[26136]: executed
Oct 16 14:13:10 host cyrus/pop3s[26136]: accepted connection
Oct 16 14:13:10 host cyrus/pop3s[26136]: pop3s failed: 
[XX.XXX.XX.XXX]
Oct 16 14:13:10 host cyrus/pop3s[26136]: Fatal error: tls_start_servertls() 
failed
Oct 16 14:13:10 host cyrus/master[15923]: process 26136 exited, status 75
Oct 16 14:13:10 host cyrus/master[15923]: service pop3s pid 26136 in BUSY 
state: terminated abnormally

Any idea what's causing that?

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: One more attempt: stuck processes

2007-11-16 Thread Gabor Gombas
On Fri, Nov 16, 2007 at 06:11:01PM +0100, Sebastian Hagedorn wrote:

 Well, that just sounds like you're running out of entropy. That's a 
 different issue. Recompile your cyrus-sasl to use /dev/urandom instead of 
 /dev/random or disable apop in /etc/imapd.conf:

Debian uses /dev/urandom for a long time:

# strings /usr/lib/libsasl2.so.2 | grep random
/dev/urandom

And according to the logs I have, after a pop3 process got stuck other
IMAP users can still log in using TLS+PLAIN, so entropy can be ruled
out.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: One more attempt: stuck processes

2007-11-16 Thread Ken Murchison
Sebastian Hagedorn wrote:
 --On 16. November 2007 12:39:28 -0500 Ken Murchison 
 [EMAIL PROTECTED] wrote:
 
 Sorry, my patch wasn't complete.  It wasn't logging the value that I
 wanted.
 
 OK:
 
 Nov 16 18:48:17 lvr13 pop3s[1385]: SSL_read() returned 0:5
 Nov 16 18:48:33 lvr13 pop3s[1375]: SSL_read() returned 0:5
 Nov 16 18:48:50 lvr13 pop3s[1980]: SSL_read() returned 0:6
 Nov 16 18:48:54 lvr13 pop3s[1376]: SSL_read() returned 0:5
 Nov 16 18:49:03 lvr13 pop3s[1375]: SSL_read() returned 0:5
 Nov 16 18:49:11 lvr13 pop3s[1375]: SSL_read() returned 0:5
 Nov 16 18:49:38 lvr13 pop3s[1375]: SSL_read() returned 0:5
 Nov 16 18:49:54 lvr13 pop3s[1404]: SSL_read() returned 0:5
 
 I'm guessing that's still not enough:
 
 #define SSL_ERROR_SYSCALL   5 /* look at error stack/return 
 value/errno */
 #define SSL_ERROR_ZERO_RETURN   6
 
   SSL_ERROR_SYSCALL
   Some I/O error occurred.  The OpenSSL error queue may contain 
 more
   information on the error.  If the error queue is empty (i.e.
   ERR_get_error() returns 0), ret can be used to find out more 
 about
   the error: If ret == 0, an EOF was observed that violates the 
 pro-
   tocol.  If ret == -1, the underlying BIO reported an I/O error 
 (for
   socket I/O on Unix systems, consult errno for details).
 
 So should I add a call to ERR_get_error()?


Not yet.  I'm assuming that none of these processes has hung.  We're 
getting an I/O error most likely because the client has closed the 
connection immediately after sending QUIT.  This is harmless.

What I really want to see is if we get a SSL_ERROR_WANT_xxx return code 
when we're hung.

-- 
Kenneth Murchison
Systems Programmer
Project Cyrus Developer/Maintainer
Carnegie Mellon University

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: One more attempt: stuck processes

2007-11-16 Thread Alain Spineux
On Nov 16, 2007 6:11 PM, Sebastian Hagedorn [EMAIL PROTECTED] wrote:
 --On 16. November 2007 18:07:51 +0100 Gabor Gombas [EMAIL PROTECTED]
 wrote:

  Hm, we don't suffer any actual slowdown, it's just that the number of
  processes increases over time.
 
  It's not a slowdown - the client connects, and hangs. It never even gets
  to the authentication phase (at least it's not logged). Clients that
  happen to connect to a non-affected process run normally.

 Well, that just sounds like you're running out of entropy. That's a
 different issue. Recompile your cyrus-sasl to use /dev/urandom instead of
 /dev/random or disable apop in /etc/imapd.conf:

 allowapop: 0

 Either of those things should get rid of that.

The quick but the bad way to do this is for testing is

# ls -l /dev/*random
crw-rw-rw- 1 root root 1, 8 Nov 16 06:18 /dev/random
cr--r--r-- 1 root root 1, 9 Nov  7 22:47 /dev/urandom

# mv /dev/random /dev/random.orig
# ln -sf /dev/urandom /dev/random

And then, because letting /dev/random that way too long is insecure :

# rm -f /dev/random
# mv /dev/random.orig /dev/random

This avoid to recompile the source just for testing.




 --
  .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:.
 Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
 .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:.
.:.:.:.Skype: shagedorn.:.:.:.
 
 Cyrus Home Page: http://cyrusimap.web.cmu.edu/
 Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
 List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html




-- 
Alain Spineux
aspineux gmail com
May the sources be with you

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: One more attempt: stuck processes

2007-11-16 Thread Gary Mills
On Fri, Nov 16, 2007 at 05:13:13PM +0100, Sebastian Hagedorn wrote:
 --On 16. November 2007 14:23:17 +0100 Simon Matter [EMAIL PROTECTED] 
 wrote:
 
 Did you ever see non SSL connections get stuck?
 
 No.

Most of mine are `pop3d -s', but I have seen a few without the `-s'.
When I did a stack trace on one, it also turned out to be for an SSL
session.  So, I have to agree.

-- 
-Gary Mills--Unix Support--U of M Academic Computing and Networking-

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: One more attempt: stuck processes

2007-11-16 Thread Sebastian Hagedorn

OK, now I got this:

Nov 16 18:37:06 lvr13 pop3s[23089]: SSL_read() returned -1

But that process terminated normally.
--
.:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:.
Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
.:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:.
  .:.:.:.Skype: shagedorn.:.:.:.

pgpRJSjlsSCf8.pgp
Description: PGP signature

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: One more attempt: stuck processes

2007-11-16 Thread Alain Spineux
On Nov 16, 2007 12:36 PM, Sebastian Hagedorn [EMAIL PROTECTED] wrote:
 --On 16. November 2007 11:27:09 +0100 Sebastian Hagedorn
 [EMAIL PROTECTED] wrote:

  1) Since it only happens on dialup connections, could it be that the
  dialin router at the providers end sends TCP/RST when a client hangs up
  and those packets are filtered somewhere, maybe on your firewall?
 
  OK, let's run with that one.
 
  a) We don't really have a firewall, we only use ACLs on the Cisco
  routers. You can't even filter TCP/RST there.
 
  b) Even *if* a TCP/RST had been dropped, lost or whatever, the server
  *still* should timeout eventually!

 I just had a discussion with a colleague regarding this. He made two
 observations:

 1. In the absence of the SO_KEEPALIVE option it is entirely possible that a
 TCP connection remains ESTABLISHED even when the other side has gone.

I said that socket should timeout, but this is true only when the
protocol (TCP here)
require a response (usualy AK here) or at connection establishement.
On the contrary
it should stay open indefinitely util something happens. Router doing
NAT can drop
a too old connection, because it has to maintains a NAT table and make some
cleanup time to time, this where KEEPALIVE become usefull.


 This may not be a solution to this particular problem, but it made me
 wonder why Cyrus does *not* use SO_KEEPALIVE. Is there a downside to it?

Cyrus has already a built-in time out, it seems a lite conflicting to actively
maintains the connection until it drop it itself !
This is the works of the client to actively maintains the connection,
if it want it !


 2. The stack trace looks garbled:

 (gdb) bt
 #0  0x0079f41e in __read_nocancel () from /lib/tls/libc.so.6
 #1  0x00d0b2f7 in BIO_new_socket () from /lib/libcrypto.so.4
 #2  0x00d092b2 in BIO_read () from /lib/libcrypto.so.4
 #3  0x005dae13 in ssl23_read_bytes () from /lib/libssl.so.4
 #4  0x005d9c51 in ssl23_get_client_hello () from /lib/libssl.so.4
 #5  0x005d9712 in ssl23_accept () from /lib/libssl.so.4
 #6  0x005ddc9a in SSL_accept () from /lib/libssl.so.4
 #7  0x08052cb3 in shut_down ()
 #8  0x0804e513 in shut_down ()
 #9  0x0804d58c in ?? ()
 #10 0x0001 in ?? ()
 #11 0x082ee848 in ?? ()
 #12 0x in ?? ()

 He suggested that the trace is unreliable. Perhaps a bug in RHEL 3's
 version of OpenSSL messes up the stack. That would also explain why nobody
 else seems to have this problem.

 I think I will try one more approach: I reverted cyrus.conf to not use -U
 1 anymore, so that processes should be reused. I will strace one of the
 pop3d processes in the hope that it gets stuck. That way I should be able
 to see where things go wrong. If the process terminates normally I will try
 with another one. If that doesn't go anywhere, I guess I'll drop this

You could try to replace imapd by a home made script, something like .

mv imapd imapd_
echo exec strace -o /tmp/imapd.$$ imapd_ $*  imapd
chmod imapd a+x


 investigation. We will upgrade to RHEL 5 some time next year, so hopefully
 that will bring new bugs :-)

Sorry but I dont understand what you are complaining about!
Is-it because the imap or pop client is loosing its connection and
this disturb the user
or just because you are getting some sleeping processes ? Or both :-)

Do you have a timeout option in your imapd.conf to force the
imap/pop server to autologout ?


Regards.

Alain

 --
  .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:.
 Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
 .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:.
.:.:.:.Skype: shagedorn.:.:.:.
 
 Cyrus Home Page: http://cyrusimap.web.cmu.edu/
 Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
 List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html




-- 
Alain Spineux
aspineux gmail com
May the sources be with you

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: One more attempt: stuck processes

2007-11-16 Thread Sebastian Hagedorn
--On 16. November 2007 14:23:17 +0100 Simon Matter [EMAIL PROTECTED] 
wrote:



Did you ever see non SSL connections get stuck?


No.
--
.:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:.
Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
.:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:.
  .:.:.:.Skype: shagedorn.:.:.:.

pgpAIZv7hfTCt.pgp
Description: PGP signature

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: One more attempt: stuck processes

2007-11-16 Thread Sebastian Hagedorn
--On 16. November 2007 11:27:52 -0500 Ken Murchison [EMAIL PROTECTED] 
wrote:



Sebastian Hagedorn wrote:


The only reason I could imagine for the sequence of calls was signal
handling. But let's be methodical. There's only one spot where
SSL_accept() is called: in tls_start_servertls(). In pop3d.c that's only
called in cmd_starttls(). That in turn is called either in cmdloop (for
handling of STLS) or in service_main() for connections to port 995.


Actually, now that I think about it, I believe SSL_accept() can be called
from SSL_read() at any time if a renegotiation is required. Since
shut_down() calls prot_fill(), which in turn can call SSL_read(), its
possible that we can get an SSL_accept() call.  Before I start hacking
code, can you apply the following patch (sorry about the line breaks) and
see if I'm heading in the right direction?  Let me know if you get any of
the WARNING messages in your logs.


--- prot.c.~1.93.~  2007-11-16 11:21:56.0 -0500
+++ prot.c  2007-11-16 11:23:32.0 -0500
@@ -468,6 +468,7 @@
/* just do a SSL read instead if we're under a tls layer */
if (s-tls_conn != NULL) {
n = SSL_read(s-tls_conn, (char *) s-buf, PROT_BUFSIZE);
+   if (n = 0) syslog(LOG_WARNING, SSL_read() returned %d, n);
} else {
n = read(s-fd, s-buf, PROT_BUFSIZE);
}


Yes, I do:

Nov 16 17:59:34 lvr13 pop3s[3196]: SSL_read() returned 0
Nov 16 17:59:38 lvr13 pop3s[3196]: SSL_read() returned 0
Nov 16 18:00:09 lvr13 pop3s[3215]: SSL_read() returned 0
Nov 16 18:00:26 lvr13 pop3s[3847]: SSL_read() returned 0
Nov 16 18:00:34 lvr13 pop3s[3215]: SSL_read() returned 0
Nov 16 18:00:34 lvr13 pop3s[3199]: SSL_read() returned 0
Nov 16 18:00:39 lvr13 pop3s[3199]: SSL_read() returned 0
Nov 16 18:00:43 lvr13 pop3s[3229]: SSL_read() returned 0

Not all of these processes are stuck, though. (Maybe none are). Should I be 
looking for something specific?

--
.:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:.
Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
.:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:.
  .:.:.:.Skype: shagedorn.:.:.:.

pgpzM3I7B80P9.pgp
Description: PGP signature

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: One more attempt: stuck processes

2007-11-16 Thread Ken Murchison
Sebastian Hagedorn wrote:

 Nov 16 18:00:26 lvr13 pop3s[3847]: SSL_read() returned 0
 Nov 16 18:00:34 lvr13 pop3s[3215]: SSL_read() returned 0
 Nov 16 18:00:34 lvr13 pop3s[3199]: SSL_read() returned 0
 Nov 16 18:00:39 lvr13 pop3s[3199]: SSL_read() returned 0
 Nov 16 18:00:43 lvr13 pop3s[3229]: SSL_read() returned 0
 
 Not all of these processes are stuck, though. (Maybe none are). Should I 
 be looking for something specific?

Sorry, my patch wasn't complete.  It wasn't logging the value that I 
wanted.  Try this:

--- prot.c.~1.93.~  2007-11-16 11:21:56.0 -0500
+++ prot.c  2007-11-16 12:37:55.0 -0500
@@ -468,6 +468,10 @@
/* just do a SSL read instead if we're under a tls layer */
if (s-tls_conn != NULL) {
n = SSL_read(s-tls_conn, (char *) s-buf, PROT_BUFSIZE);
+   if (n = 0) {
+   syslog(LOG_WARNING, SSL_read() returned %d:%d,
+  n, SSL_get_error(s-tls_conn, n));
+   }
} else {
n = read(s-fd, s-buf, PROT_BUFSIZE);
}


-- 
Kenneth Murchison
Systems Programmer
Project Cyrus Developer/Maintainer
Carnegie Mellon University

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: One more attempt: stuck processes

2007-11-16 Thread Sebastian Hagedorn
--On 16. November 2007 12:39:28 -0500 Ken Murchison [EMAIL PROTECTED] 
wrote:



Sorry, my patch wasn't complete.  It wasn't logging the value that I
wanted.


OK:

Nov 16 18:48:17 lvr13 pop3s[1385]: SSL_read() returned 0:5
Nov 16 18:48:33 lvr13 pop3s[1375]: SSL_read() returned 0:5
Nov 16 18:48:50 lvr13 pop3s[1980]: SSL_read() returned 0:6
Nov 16 18:48:54 lvr13 pop3s[1376]: SSL_read() returned 0:5
Nov 16 18:49:03 lvr13 pop3s[1375]: SSL_read() returned 0:5
Nov 16 18:49:11 lvr13 pop3s[1375]: SSL_read() returned 0:5
Nov 16 18:49:38 lvr13 pop3s[1375]: SSL_read() returned 0:5
Nov 16 18:49:54 lvr13 pop3s[1404]: SSL_read() returned 0:5

I'm guessing that's still not enough:

#define SSL_ERROR_SYSCALL   5 /* look at error stack/return 
value/errno */

#define SSL_ERROR_ZERO_RETURN   6

  SSL_ERROR_SYSCALL
  Some I/O error occurred.  The OpenSSL error queue may contain 
more

  information on the error.  If the error queue is empty (i.e.
  ERR_get_error() returns 0), ret can be used to find out more 
about
  the error: If ret == 0, an EOF was observed that violates the 
pro-
  tocol.  If ret == -1, the underlying BIO reported an I/O error 
(for

  socket I/O on Unix systems, consult errno for details).

So should I add a call to ERR_get_error()?
--
.:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:.
Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
.:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:.
  .:.:.:.Skype: shagedorn.:.:.:.

pgpFMfHMrSvNV.pgp
Description: PGP signature

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: One more attempt: stuck processes

2007-11-16 Thread Alain Spineux
Hi

Can I resume the problem in :

The server is blocked in a read, waiting for the client next command.
(this is normal,
99% of the process are in this state). But the autologout procedure is
not working!

Then this means the SIGALRM that should awake the process never come or is not
handled properly! I simple call to sleep() or signal() could disturb this.
If this append only when using SSL, maybe the problem is here and the
ALRM should
bne reloaded somewhere.

This is useless now, but files in $cyrus_imap/proc/* contains the user
and the selected mailbox
of all these processes this could be useful to know if this what not
always the same user at the
origin of the problem, because he was using an old outlook or something.

Regards

On Nov 16, 2007 5:33 PM, Ken Murchison [EMAIL PROTECTED] wrote:

 Sebastian Hagedorn wrote:
  --On 16. November 2007 09:37:42 -0600 Gary Mills [EMAIL PROTECTED]
  wrote:
 
  Could you get a stack trace? If you have gdb you just call it with gdb
  -p  19175. Then you can do bt at the prompt. I forget how to do it
  with  Sun's debugger.
 
  Easy:
 
# pstack 19175
19175:  pop3d -s
 fef9f810 read (0, 2316f0, 5)
 fee1d2d0 read (0, 2316f0, 5, 0, 0, 0) + 5c
 ff06bb38 sock_read (1f0860, 2316f0, 5, 5, 0, 0) + 24
 ff068af0 BIO_read (1f0860, 2316f0, 5, fef98b84, 0, 0) + 110
 ff278488 ssl3_read_n (212798, 5, 8805, 0, 0, 203958) + 174
 ff2785fc ssl3_get_record (204ce0, 8000, 8400, 4400, f1, f0) + d0
 ff279424 ssl3_read_bytes (212798, 1000, 2000, 4, 0, ffbfe731) + 228
 ff27a99c ssl3_get_message (ff2a259c, 2070a0, 0, , 19000,
  ffbfe7a0) + d0 ff27042c ssl3_accept (2150, 2160, 2180, 21e0, 2110, 2122)
  + 904ff27bd2c ssl23_get_client_hello (2316fb, 6c, 6c, 4, fe79, 0)
  + 828ff27b4b4 ssl23_accept (4000, 2000, 0, 0, 0, 0) + 2a4
 00032d00 tls_start_servertls (0, 1, ffbfee24, ffbfee20, 1849a8, ff00)
  + 1980002c504 cmd_starttls (1, 1fd8b8, 0, 0, 0, 0) + 184
 0002a638 service_main (2, 192198, ffbffce0, 1aec4, 3508c, 1) + 488
 00035250 main (2, ffbffcd4, ffbffce0, 17c400, 0, 0) + e18
 00029298 _start   (0, 0, 0, 0, 0, 0) + 108
 
  Thanks, that looks like progress! That stack trace looks similar enough
  to the one I'm seeing that I could imagine that it is what I *should* be
  seeing if the stack weren't garbled. Of course that's only speculation.
 
  Ken, is it possible that the call to SSL_accept() in
  tls_start_servertls() blocks when the client goes away? That could
  explain everything 

 Yes.  Gary's problem might be very similar to yours, depending on what I
 see from the patch that I just sent you.

 --
 Kenneth Murchison
 Systems Programmer
 Project Cyrus Developer/Maintainer
 Carnegie Mellon University
 

 Cyrus Home Page: http://cyrusimap.web.cmu.edu/
 Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
 List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html




-- 
Alain Spineux
aspineux gmail com
May the sources be with you

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: One more attempt: stuck processes

2007-11-16 Thread Sebastian Hagedorn
--On 16. November 2007 18:21:21 +0100 Gabor Gombas [EMAIL PROTECTED] 
wrote:



On Fri, Nov 16, 2007 at 06:11:01PM +0100, Sebastian Hagedorn wrote:


Well, that just sounds like you're running out of entropy. That's a
different issue. Recompile your cyrus-sasl to use /dev/urandom instead
of  /dev/random or disable apop in /etc/imapd.conf:


Debian uses /dev/urandom for a long time:

# strings /usr/lib/libsasl2.so.2 | grep random
/dev/urandom

And according to the logs I have, after a pop3 process got stuck other
IMAP users can still log in using TLS+PLAIN, so entropy can be ruled
out.


OK. Still the symptom seems to be different from what I'm seeing. Could it 
be that you have a process limit in /etc/cyrus.conf?

--
.:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:.
Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
.:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:.
  .:.:.:.Skype: shagedorn.:.:.:.

pgpvuf1O5cVWm.pgp
Description: PGP signature

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: One more attempt: stuck processes

2007-11-16 Thread Sebastian Hagedorn
-- Ken Murchison [EMAIL PROTECTED] is rumored to have mumbled on 16. 
November 2007 12:58:49 -0500 regarding Re: One more attempt: stuck 
processes:



So should I add a call to ERR_get_error()?



Not yet.  I'm assuming that none of these processes has hung.  We're
getting an I/O error most likely because the client has closed the
connection immediately after sending QUIT.  This is harmless.

What I really want to see is if we get a SSL_ERROR_WANT_xxx return code
when we're hung.


I have both good and bad news. Bad news first: there is a stuck process 
that did *not* log that SSL_read line.


Good news: the binary I'm running now isn't stripped and has much more 
detail in its stack trace:


(gdb) bt
#0  0x003d341e in __read_nocancel () from /lib/tls/libc.so.6
#1  0x0017f2f7 in BIO_new_socket () from /lib/libcrypto.so.4
#2  0x0017d2b2 in BIO_read () from /lib/libcrypto.so.4
#3  0x0089ec30 in ssl3_alert_code () from /lib/libssl.so.4
#4  0x0089edcc in ssl3_alert_code () from /lib/libssl.so.4
#5  0x008a00cf in ssl3_read_bytes () from /lib/libssl.so.4
#6  0x008a0ffc in ssl3_get_message () from /lib/libssl.so.4
#7  0x00896cab in ssl3_accept () from /lib/libssl.so.4
#8  0x00896944 in ssl3_accept () from /lib/libssl.so.4
#9  0x008a5c9a in SSL_accept () from /lib/libssl.so.4
#10 0x008a180d in ssl23_get_client_hello () from /lib/libssl.so.4
#11 0x008a1712 in ssl23_accept () from /lib/libssl.so.4
#12 0x008a5c9a in SSL_accept () from /lib/libssl.so.4
#13 0x08052cf3 in tls_start_servertls (readfd=-512, writefd=-512, 
layerbits=0xbfff7a78, authid=0xbfff7a74,

   ret=0x810bca0) at tls.c:803
#14 0x0804e553 in cmd_starttls (pop3s=1) at pop3d.c:1076
#15 0x0804d5cc in service_main (argc=2, argv=0x9e84008, envp=0xbfff9850) at 
pop3d.c:537

#16 0x08054550 in main (argc=2, argv=0x9, envp=0xbfff9850) at service.c:539

There's much less POP activity now, so I may have to wait until Monday for 
more results.

--
Sebastian Hagedorn - Postmaster - RZKR-R1 (Flachbau), Zimmer 18
Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
Universität zu Köln / Cologne University - Tel. +49-221-478-5587

pgpvINiK8adT6.pgp
Description: PGP signature

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: One more attempt: stuck processes

2007-11-16 Thread Michael M. Rach
I know it has been asked before and may be redundant, but...  You 
answered that cyrus-sasl is using /dev/urandom and should not run out of 
entropy.  However, what about openssl itself?  It also uses random 
numbers.  Perhaps, as a test renaming /dev/random and ln -s /dev/urandom 
/dev/random.

Gary Mills wrote:
 On Fri, Nov 16, 2007 at 05:13:13PM +0100, Sebastian Hagedorn wrote:
   
 --On 16. November 2007 14:23:17 +0100 Simon Matter [EMAIL PROTECTED] 
 wrote:

 
 Did you ever see non SSL connections get stuck?
   
 No.
 

 Most of mine are `pop3d -s', but I have seen a few without the `-s'.
 When I did a stack trace on one, it also turned out to be for an SSL
 session.  So, I have to agree.

   

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: One more attempt: stuck processes

2007-11-16 Thread Michael Bacon
--On Friday, November 16, 2007 3:54 PM -0500 Ken Murchison 
[EMAIL PROTECTED] wrote:

 I've reproduced the former by telneting to port 995 and doing nothing.
 I have been unable to reproduce the latter because as soon as I QUIT the
 telnet session or kill() the telnet process, pop3d exits gracefully.

I agree with others that you want to do something other than kill the 
telnet session, like unplugging the cable.  I've seen similar behavior out 
of cyrus that I thought could easily be people on laptops shutting their 
computers down hard, in some way that the TCP/IP stack never got a chance 
to clean up the connection properly.  With a QUIT or a kill, you're giving 
the OS a chance to do the right thing.

-Michael

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: One more attempt: stuck processes

2007-11-16 Thread Gary Mills
On Fri, Nov 16, 2007 at 03:54:50PM -0500, Ken Murchison wrote:
 
 That's exactly what Gary is seeing.  Its blocking in SSL_accept(). 
 Apparently the client connects to port 995, and then either sends 
 nothing, or goes away and leaves the socket open.
 
 I've reproduced the former by telneting to port 995 and doing nothing. 
 I have been unable to reproduce the latter because as soon as I QUIT the 
 telnet session or kill() the telnet process, pop3d exits gracefully.

You probably have to reboot the client at that point, or just
disconnect the cable and take it home.

-- 
-Gary Mills--Unix Support--U of M Academic Computing and Networking-

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: One more attempt: stuck processes

2007-11-16 Thread Alain Spineux
On Nov 16, 2007 6:24 PM, Alain Spineux [EMAIL PROTECTED] wrote:
 Hi

 Can I resume the problem in :

I'm wrong


 The server is blocked in a read, waiting for the client next command.
 (this is normal,
 99% of the process are in this state).

No it is waiting in select, and the select has a timeout !

 But the autologout procedure is
 not working!

 Then this means the SIGALRM that should awake the process never come or is not
 handled properly! I simple call to sleep() or signal() could disturb this.
 If this append only when using SSL, maybe the problem is here and the
 ALRM should
 bne reloaded somewhere.

Wrong wrong ! It could be right, but here the time out looks to be
done using the select !


 This is useless now, but files in $cyrus_imap/proc/* contains the user
 and the selected mailbox
 of all these processes this could be useful to know if this what not
 always the same user at the
 origin of the problem, because he was using an old outlook or something.

 Regards


 On Nov 16, 2007 5:33 PM, Ken Murchison [EMAIL PROTECTED] wrote:
 
  Sebastian Hagedorn wrote:
   --On 16. November 2007 09:37:42 -0600 Gary Mills [EMAIL PROTECTED]
   wrote:
  
   Could you get a stack trace? If you have gdb you just call it with gdb
   -p  19175. Then you can do bt at the prompt. I forget how to do it
   with  Sun's debugger.
  
   Easy:
  
 # pstack 19175
 19175:  pop3d -s
  fef9f810 read (0, 2316f0, 5)
  fee1d2d0 read (0, 2316f0, 5, 0, 0, 0) + 5c
  ff06bb38 sock_read (1f0860, 2316f0, 5, 5, 0, 0) + 24
  ff068af0 BIO_read (1f0860, 2316f0, 5, fef98b84, 0, 0) + 110
  ff278488 ssl3_read_n (212798, 5, 8805, 0, 0, 203958) + 174
  ff2785fc ssl3_get_record (204ce0, 8000, 8400, 4400, f1, f0) + d0
  ff279424 ssl3_read_bytes (212798, 1000, 2000, 4, 0, ffbfe731) + 228
  ff27a99c ssl3_get_message (ff2a259c, 2070a0, 0, , 19000,
   ffbfe7a0) + d0 ff27042c ssl3_accept (2150, 2160, 2180, 21e0, 2110, 2122)
   + 904ff27bd2c ssl23_get_client_hello (2316fb, 6c, 6c, 4, fe79, 0)
   + 828ff27b4b4 ssl23_accept (4000, 2000, 0, 0, 0, 0) + 2a4
  00032d00 tls_start_servertls (0, 1, ffbfee24, ffbfee20, 1849a8, ff00)
   + 1980002c504 cmd_starttls (1, 1fd8b8, 0, 0, 0, 0) + 184
  0002a638 service_main (2, 192198, ffbffce0, 1aec4, 3508c, 1) + 488
  00035250 main (2, ffbffcd4, ffbffce0, 17c400, 0, 0) + e18
  00029298 _start   (0, 0, 0, 0, 0, 0) + 108
  
   Thanks, that looks like progress! That stack trace looks similar enough
   to the one I'm seeing that I could imagine that it is what I *should* be
   seeing if the stack weren't garbled. Of course that's only speculation.
  
   Ken, is it possible that the call to SSL_accept() in
   tls_start_servertls() blocks when the client goes away? That could
   explain everything 
 
  Yes.  Gary's problem might be very similar to yours, depending on what I
  see from the patch that I just sent you.
 
  --
  Kenneth Murchison
  Systems Programmer
  Project Cyrus Developer/Maintainer
  Carnegie Mellon University
  
 
  Cyrus Home Page: http://cyrusimap.web.cmu.edu/
  Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
  List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
 




 --
 Alain Spineux
 aspineux gmail com
 May the sources be with you




-- 
Alain Spineux
aspineux gmail com
May the sources be with you

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: One more attempt: stuck processes

2007-11-15 Thread Ken Murchison
Sebastian Hagedorn wrote:

 Thanks. I will try this patch as soon as I can, but it's clearly not the 
 only issue, because the same thing happens with POP processes. Here's an 
 example for one:
 
 (gdb) bt
 #0  0x0096441e in __read_nocancel () from /lib/tls/libc.so.6
 #1  0x00ac02f7 in BIO_new_socket () from /lib/libcrypto.so.4
 #2  0x00abe2b2 in BIO_read () from /lib/libcrypto.so.4
 #3  0x00fcfe13 in ssl23_read_bytes () from /lib/libssl.so.4
 #4  0x00fcec51 in ssl23_get_client_hello () from /lib/libssl.so.4
 #5  0x00fce712 in ssl23_accept () from /lib/libssl.so.4
 #6  0x00fd2c9a in SSL_accept () from /lib/libssl.so.4
 #7  0x08052cb3 in shut_down ()
 #8  0x0804e513 in shut_down ()
 #9  0x0804d58c in ?? ()
 #10 0x0001 in ?? ()
 #11 0x08dd6848 in ?? ()
 #12 0x in ?? ()

OK.  What version of OpenSSL?

Are they imaps/pop3s processes, or are your clients using STARTTLS?

Are you sure that you're not running out of entropy?

Have you tried adding the '-U 1' option to the offending services in 
cyrus.conf?


-- 
Kenneth Murchison
Systems Programmer
Project Cyrus Developer/Maintainer
Carnegie Mellon University

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: One more attempt: stuck processes

2007-11-15 Thread Sebastian Hagedorn
--On 15. November 2007 06:55:44 -0500 Ken Murchison [EMAIL PROTECTED] 
wrote:



OK.  What version of OpenSSL?


cyradm says:

Built w/OpenSSL 0.9.7a Feb 19 2003
Running w/OpenSSL 0.9.7a Feb 19 2003

rpm says:

openssl-0.9.7a-33.23

This is RHEL 3.


Are they imaps/pop3s processes, or are your clients using STARTTLS?


Both. It doesn't seem to matter which.


Are you sure that you're not running out of entropy?


Yes, because cyrus-sasl uses /dev/urandom.


Have you tried adding the '-U 1' option to the offending services in
cyrus.conf?


No. Since this potentially affects all IMAP and POP processes I would have 
to do it for all entries. Do you recommend that I try that?

--
.:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:.
Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
.:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:.
  .:.:.:.Skype: shagedorn.:.:.:.

pgpUWKZDaLfag.pgp
Description: PGP signature

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: One more attempt: stuck processes

2007-11-15 Thread Sebastian Hagedorn
--On 14. November 2007 16:39:44 -0500 Ken Murchison [EMAIL PROTECTED] 
wrote:



It looks to me like we are timing out the client while the client is
IDLEing, but we get a signal from idled in the middle of shutdown(). Try
this patch.

--- imapd.c.~1.535.~2007-11-14 16:16:21.0 -0500
+++ imapd.c 2007-11-14 16:22:59.0 -0500
@@ -836,6 +836,8 @@
  {
  int i;

+idle_done(imapd_mailbox);
+
  proc_cleanup();

  i = 0;


Thanks. I will try this patch as soon as I can, but it's clearly not the 
only issue, because the same thing happens with POP processes. Here's an 
example for one:


(gdb) bt
#0  0x0096441e in __read_nocancel () from /lib/tls/libc.so.6
#1  0x00ac02f7 in BIO_new_socket () from /lib/libcrypto.so.4
#2  0x00abe2b2 in BIO_read () from /lib/libcrypto.so.4
#3  0x00fcfe13 in ssl23_read_bytes () from /lib/libssl.so.4
#4  0x00fcec51 in ssl23_get_client_hello () from /lib/libssl.so.4
#5  0x00fce712 in ssl23_accept () from /lib/libssl.so.4
#6  0x00fd2c9a in SSL_accept () from /lib/libssl.so.4
#7  0x08052cb3 in shut_down ()
#8  0x0804e513 in shut_down ()
#9  0x0804d58c in ?? ()
#10 0x0001 in ?? ()
#11 0x08dd6848 in ?? ()
#12 0x in ?? ()

As you see it looks very much like the IMAP stack trace, but obviously the 
IDLE routine is not a factor. OTOH, all the stuck IMAP processes I've 
looked at so far *did* have those idle_xxx lines in their traces. I am 
correct in assuming that that would only occur for clients that actually 
use IDLE, right?

--
.:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:.
Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
.:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:.
  .:.:.:.Skype: shagedorn.:.:.:.

pgp6gqxRgy74J.pgp
Description: PGP signature

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: One more attempt: stuck processes

2007-11-15 Thread Ken Murchison
Sebastian Hagedorn wrote:

 No. Since this potentially affects all IMAP and POP processes I would 
 have to do it for all entries. Do you recommend that I try that?

Since it looks like things are hanging when a process is being used, I'd 
like to see if the problem goes away if we don't reuse the processes. 
I'm just trying to do a bsearch() on the problem.

-- 
Kenneth Murchison
Systems Programmer
Project Cyrus Developer/Maintainer
Carnegie Mellon University

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: One more attempt: stuck processes

2007-11-15 Thread Sebastian Hagedorn
--On 15. November 2007 08:21:48 -0500 Ken Murchison [EMAIL PROTECTED] 
wrote:



No. Since this potentially affects all IMAP and POP processes I would
have to do it for all entries. Do you recommend that I try that?


Since it looks like things are hanging when a process is being used, I'd
like to see if the problem goes away if we don't reuse the processes. I'm
just trying to do a bsearch() on the problem.


OK. I've made the change and HUP'ed master. I can see the first processes 
with -U 1. So I guess I'll have to wait a while to see if any of those get 
stuck as well.

--
.:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:.
Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
.:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:.
  .:.:.:.Skype: shagedorn.:.:.:.

pgpgSvEePTrl0.pgp
Description: PGP signature

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: One more attempt: stuck processes

2007-11-15 Thread Sebastian Hagedorn
--On 15. November 2007 08:32:18 -0500 Ken Murchison [EMAIL PROTECTED] 
wrote:



Since it looks like things are hanging when a process is being used, I'd
like to see if the problem goes away if we don't reuse the processes.
I'm just trying to do a bsearch() on the problem.


OK. I've made the change and HUP'ed master. I can see the first
processes with -U 1. So I guess I'll have to wait a while to see if any
of those get stuck as well.


OK, let me know.


Didn't work :-(

# ps -aef| grep U
UIDPID  PPID  C STIME TTY  TIME CMD
...
cyrus25038 32385  0 15:44 ?00:00:00 pop3d -s -U 1
...

That's more than an hour ago. So:

# strace -p 25038
Process 25038 attached - interrupt to quit
read(0,  unfinished ...

# gdb -p 25038
GNU gdb Red Hat Linux (6.3.0.0-1.138.el3rh)
...
(gdb) bt
#0  0x0079f41e in __read_nocancel () from /lib/tls/libc.so.6
#1  0x00d0b2f7 in BIO_new_socket () from /lib/libcrypto.so.4
#2  0x00d092b2 in BIO_read () from /lib/libcrypto.so.4
#3  0x005dae13 in ssl23_read_bytes () from /lib/libssl.so.4
#4  0x005d9c51 in ssl23_get_client_hello () from /lib/libssl.so.4
#5  0x005d9712 in ssl23_accept () from /lib/libssl.so.4
#6  0x005ddc9a in SSL_accept () from /lib/libssl.so.4
#7  0x08052cb3 in shut_down ()
#8  0x0804e513 in shut_down ()
#9  0x0804d58c in ?? ()
#10 0x0001 in ?? ()
#11 0x082ee848 in ?? ()
#12 0x in ?? ()

Any other ideas?
--
.:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:.
Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
.:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:.
  .:.:.:.Skype: shagedorn.:.:.:.

pgpoJF9EbdufP.pgp
Description: PGP signature

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: One more attempt: stuck processes

2007-11-15 Thread Ken Murchison
Sebastian Hagedorn wrote:
 --On 15. November 2007 08:21:48 -0500 Ken Murchison 
 [EMAIL PROTECTED] wrote:
 
 No. Since this potentially affects all IMAP and POP processes I would
 have to do it for all entries. Do you recommend that I try that?

 Since it looks like things are hanging when a process is being used, I'd
 like to see if the problem goes away if we don't reuse the processes. I'm
 just trying to do a bsearch() on the problem.
 
 OK. I've made the change and HUP'ed master. I can see the first 
 processes with -U 1. So I guess I'll have to wait a while to see if any 
 of those get stuck as well.

OK, let me know.

-- 
Kenneth Murchison
Systems Programmer
Project Cyrus Developer/Maintainer
Carnegie Mellon University

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: One more attempt: stuck processes

2007-11-15 Thread Ken Murchison
Sebastian Hagedorn wrote:
 --On 15. November 2007 08:32:18 -0500 Ken Murchison 
 [EMAIL PROTECTED] wrote:
 
 Since it looks like things are hanging when a process is being used, 
 I'd
 like to see if the problem goes away if we don't reuse the processes.
 I'm just trying to do a bsearch() on the problem.

 OK. I've made the change and HUP'ed master. I can see the first
 processes with -U 1. So I guess I'll have to wait a while to see if any
 of those get stuck as well.

 OK, let me know.
 
 Didn't work :-(
 
 # ps -aef| grep U
 UIDPID  PPID  C STIME TTY  TIME CMD
 ...
 cyrus25038 32385  0 15:44 ?00:00:00 pop3d -s -U 1
 ...
 
 That's more than an hour ago. So:
 
 # strace -p 25038
 Process 25038 attached - interrupt to quit
 read(0,  unfinished ...
 
 # gdb -p 25038
 GNU gdb Red Hat Linux (6.3.0.0-1.138.el3rh)
 ...
 (gdb) bt
 #0  0x0079f41e in __read_nocancel () from /lib/tls/libc.so.6
 #1  0x00d0b2f7 in BIO_new_socket () from /lib/libcrypto.so.4
 #2  0x00d092b2 in BIO_read () from /lib/libcrypto.so.4
 #3  0x005dae13 in ssl23_read_bytes () from /lib/libssl.so.4
 #4  0x005d9c51 in ssl23_get_client_hello () from /lib/libssl.so.4
 #5  0x005d9712 in ssl23_accept () from /lib/libssl.so.4
 #6  0x005ddc9a in SSL_accept () from /lib/libssl.so.4
 #7  0x08052cb3 in shut_down ()
 #8  0x0804e513 in shut_down ()
 #9  0x0804d58c in ?? ()
 #10 0x0001 in ?? ()
 #11 0x082ee848 in ?? ()
 #12 0x in ?? ()
 
 Any other ideas?


Not at the moment.  I don't understand how SSL_accept() gets called from 
shut_down().  Are you running a Murder?


-- 
Kenneth Murchison
Systems Programmer
Project Cyrus Developer/Maintainer
Carnegie Mellon University

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: One more attempt: stuck processes

2007-11-15 Thread Sebastian Hagedorn
--On 15. November 2007 11:00:39 -0500 Ken Murchison [EMAIL PROTECTED] 
wrote:



(gdb) bt
# 0  0x0079f41e in __read_nocancel () from /lib/tls/libc.so.6
# 1  0x00d0b2f7 in BIO_new_socket () from /lib/libcrypto.so.4
# 2  0x00d092b2 in BIO_read () from /lib/libcrypto.so.4
# 3  0x005dae13 in ssl23_read_bytes () from /lib/libssl.so.4
# 4  0x005d9c51 in ssl23_get_client_hello () from /lib/libssl.so.4
# 5  0x005d9712 in ssl23_accept () from /lib/libssl.so.4
# 6  0x005ddc9a in SSL_accept () from /lib/libssl.so.4
# 7  0x08052cb3 in shut_down ()
# 8  0x0804e513 in shut_down ()
# 9  0x0804d58c in ?? ()
# 10 0x0001 in ?? ()
# 11 0x082ee848 in ?? ()
# 12 0x in ?? ()

Any other ideas?


Not at the moment.  I don't understand how SSL_accept() gets called from
shut_down().


That's what I'd been wondering about myself.


 Are you running a Murder?


No, it's a pretty plain setup.

The only reason I could imagine for the sequence of calls was signal 
handling. But let's be methodical. There's only one spot where SSL_accept() 
is called: in tls_start_servertls(). In pop3d.c that's only called in 
cmd_starttls(). That in turn is called either in cmdloop (for handling of 
STLS) or in service_main() for connections to port 995.


shut_down() could conceivably be called by signals_poll(). The POP process 
above has the following lines in syslog at debug level:


Nov 15 15:44:18 lvr13 master[25038]: about to exec 
/usr/lib/cyrus-imapd/pop3d

Nov 15 15:44:18 lvr13 pop3s[25038]: executed
Nov 15 15:44:20 lvr13 pop3s[25038]: accepted connection

That's it. So I'm guessing that the connection (which was to some dial-up 
IP address) got dropped while the process was in service_main(). I don't 
understand why that doesn't show up in the trace, and why last/first four 
entries read '??'.


But even that doesn't really make sense, because shut_down() isn't actually 
a signal handler, it's only called at certain points by signal_polls().

--
.:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:.
Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
.:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:.
  .:.:.:.Skype: shagedorn.:.:.:.

pgpjsGUbLBc4U.pgp
Description: PGP signature

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: One more attempt: stuck processes

2007-11-15 Thread Alain Spineux
On Nov 15, 2007 4:54 PM, Sebastian Hagedorn [EMAIL PROTECTED] wrote:
 --On 15. November 2007 08:32:18 -0500 Ken Murchison [EMAIL PROTECTED]
 wrote:

  Since it looks like things are hanging when a process is being used, I'd
  like to see if the problem goes away if we don't reuse the processes.
  I'm just trying to do a bsearch() on the problem.
 
  OK. I've made the change and HUP'ed master. I can see the first
  processes with -U 1. So I guess I'll have to wait a while to see if any
  of those get stuck as well.
 
  OK, let me know.

 Didn't work :-(

 # ps -aef| grep U
 UIDPID  PPID  C STIME TTY  TIME CMD
 ...
 cyrus25038 32385  0 15:44 ?00:00:00 pop3d -s -U 1
 ...

 That's more than an hour ago. So:

 # strace -p 25038
 Process 25038 attached - interrupt to quit
 read(0,  unfinished ...

Do you know what is 0, if it was a socket it should timeout, isn't it ?

# ls -l /proc/25038/fd

should answer the question


 # gdb -p 25038
 GNU gdb Red Hat Linux (6.3.0.0-1.138.el3rh)
 ...
 (gdb) bt
 #0  0x0079f41e in __read_nocancel () from /lib/tls/libc.so.6
 #1  0x00d0b2f7 in BIO_new_socket () from /lib/libcrypto.so.4
 #2  0x00d092b2 in BIO_read () from /lib/libcrypto.so.4
 #3  0x005dae13 in ssl23_read_bytes () from /lib/libssl.so.4
 #4  0x005d9c51 in ssl23_get_client_hello () from /lib/libssl.so.4
 #5  0x005d9712 in ssl23_accept () from /lib/libssl.so.4
 #6  0x005ddc9a in SSL_accept () from /lib/libssl.so.4
 #7  0x08052cb3 in shut_down ()
 #8  0x0804e513 in shut_down ()
 #9  0x0804d58c in ?? ()
 #10 0x0001 in ?? ()
 #11 0x082ee848 in ?? ()
 #12 0x in ?? ()

 Any other ideas?

 --
  .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:.
 Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
 .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:.
.:.:.:.Skype: shagedorn.:.:.:.
 
 Cyrus Home Page: http://cyrusimap.web.cmu.edu/
 Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
 List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html




-- 
Alain Spineux
aspineux gmail com
May the sources be with you

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: One more attempt: stuck processes

2007-11-15 Thread Sebastian Hagedorn
--On 15. November 2007 18:14:05 +0100 Alain Spineux [EMAIL PROTECTED] 
wrote:



# strace -p 25038
Process 25038 attached - interrupt to quit
read(0,  unfinished ...


Do you know what is 0, if it was a socket it should timeout, isn't it ?


It should, I guess, but it doesn't.


# ls -l /proc/25038/fd

should answer the question


I did not know that. Usually I just use lsof. Interestingly, there's this:

#  ls -l /proc/25038/fd
insgesamt 0
lrwx--1 cyrusmail   64 15. Nov 18:15 0 - 
socket:[4230164633]

...

It's blinking red, which normally means a broken link. I'm not sure how 
reliable that is in this case. Anyway, lsof reports:


pop3d   25038 cyrus0u  IPv4  -64802663 TCP 
cyrus.rrz.uni-koeln.de:pop3s-p50865F5D.dip.t-dialin.net:1064 (ESTABLISHED)


It *thinks* the connections is still open. So does netstat:

# LANG=C netstat -a|grep p50865F5D
tcp0  0 cyrus.rrz.uni-koeln.d:pop3s p50865F5D.dip.t-dialin:1064 
ESTABLISHED


But obviously that connection is dead. I don't know what conclusions to 
draw from that ...

--
.:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:.
Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
.:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:.
  .:.:.:.Skype: shagedorn.:.:.:.

pgpyEp2xvUdqN.pgp
Description: PGP signature

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: One more attempt: stuck processes

2007-11-15 Thread Simon Matter
 --On 15. November 2007 18:14:05 +0100 Alain Spineux [EMAIL PROTECTED]
 wrote:

 # strace -p 25038
 Process 25038 attached - interrupt to quit
 read(0,  unfinished ...

 Do you know what is 0, if it was a socket it should timeout, isn't it ?

 It should, I guess, but it doesn't.

# ls -l /proc/25038/fd

 should answer the question

 I did not know that. Usually I just use lsof. Interestingly, there's this:

 #  ls -l /proc/25038/fd
 insgesamt 0
 lrwx--1 cyrusmail   64 15. Nov 18:15 0 -
 socket:[4230164633]
 ...

 It's blinking red, which normally means a broken link. I'm not sure how

The file 0 is a symbolic symlink which doesn't really point to a file,
that's why the shell shows it blinking. Everything okay here.

 reliable that is in this case. Anyway, lsof reports:

 pop3d   25038 cyrus0u  IPv4  -64802663 TCP
 cyrus.rrz.uni-koeln.de:pop3s-p50865F5D.dip.t-dialin.net:1064
 (ESTABLISHED)

 It *thinks* the connections is still open. So does netstat:

 # LANG=C netstat -a|grep p50865F5D
 tcp0  0 cyrus.rrz.uni-koeln.d:pop3s
 p50865F5D.dip.t-dialin:1064
 ESTABLISHED

 But obviously that connection is dead. I don't know what conclusions to
 draw from that ...

Just two ideas come to mind:

1) Since it only happens on dialup connections, could it be that the
dialin router at the providers end sends TCP/RST when a client hangs up
and those packets are filtered somewhere, maybe on your firewall?

2) Could it be that SO_LINGER should be used as socket option in
service_create() in master/master.c.

If it's complete nonsense, ignore it.

Simon


Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


One more attempt: stuck processes

2007-11-14 Thread Sebastian Hagedorn

Hi,

I've brought up this topic before. We've been running cyrus-imapd very 
happily for several years. Yet there's one issue that none of the updates 
have resolved. The last time I reported it we were running 2.2.12. Now 
we're running 2.3.8, but the issues is the same: POP and IMAP processes 
that use TLS/SSL get stuck. My observations point towards dropped dial-up 
connections as the cause. Here's an example:


[EMAIL PROTECTED] root]# strace -p 4998
Process 4998 attached - interrupt to quit
read(0,  unfinished ...
Process 4998 detached

In gdb:

0x003ed41e in __read_nocancel () from /lib/tls/libc.so.6
(gdb) bt
#0  0x003ed41e in __read_nocancel () from /lib/tls/libc.so.6
#1  0x00c422f7 in BIO_new_socket () from /lib/libcrypto.so.4
#2  0x00c402b2 in BIO_read () from /lib/libcrypto.so.4
#3  0x00d26c30 in ssl3_alert_code () from /lib/libssl.so.4
#4  0x00d26dcc in ssl3_alert_code () from /lib/libssl.so.4
#5  0x00d280cf in ssl3_read_bytes () from /lib/libssl.so.4
#6  0x00d28ffc in ssl3_get_message () from /lib/libssl.so.4
#7  0x00d1ecab in ssl3_accept () from /lib/libssl.so.4
#8  0x00d1e944 in ssl3_accept () from /lib/libssl.so.4
#9  0x00d2dc9a in SSL_accept () from /lib/libssl.so.4
#10 0x00d2980d in ssl23_get_client_hello () from /lib/libssl.so.4
#11 0x00d29712 in ssl23_accept () from /lib/libssl.so.4
#12 0x00d2dc9a in SSL_accept () from /lib/libssl.so.4
#13 0x080a0a03 in idle_notify ()
#14 0x0805ef21 in idle_update ()
#15 0x08051fc6 in shut_down ()
#16 0x0804ebe4 in ?? ()
#17 0x08d53458 in ?? ()
#18 0x08d61a98 in ?? ()
#19 0x in ?? ()
(gdb)

Ultimately all processes get stuck in libc, so this might be a library 
issue. My problem is that I have no way to confirm that. I've compared that 
stack trace to the source and I'm confused. idle_notify() is in idle.c, and 
it's very simple:


/*
* Notify idled of a mailbox change
*/
void idle_notify(struct mailbox *mailbox)
{
   /* We should try to determine if we need to send this
* (ie, is an imapd is IDLE on 'mailbox'?).
*/
   idle_send_msg(IDLE_NOTIFY, mailbox);
}

So why doesn't the stack trace show a call to idle_send_msg()? Shouldn't 
the SSL routines be called from a procedure that actually does I/O?


Anyway, if there is anybody who has seen something like this I would like 
to know about it. Also if anyone has a clue how to further debug this, 
please let me know!


Thanks, Sebastian Hagedorn
--
Sebastian Hagedorn - Postmaster - RZKR-R1 (Gebäude 52), Zimmer 18
Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
Universität zu Köln / Cologne University - Tel. +49-221-478-5587
Skype: shagedorn

pgpzRU0VT9Nay.pgp
Description: PGP signature

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: One more attempt: stuck processes

2007-11-14 Thread Gary Mills
On Wed, Nov 14, 2007 at 04:15:13PM +0100, Sebastian Hagedorn wrote:
 
 I've brought up this topic before. We've been running cyrus-imapd very 
 happily for several years. Yet there's one issue that none of the updates 
 have resolved. The last time I reported it we were running 2.2.12. Now 
 we're running 2.3.8, but the issues is the same: POP and IMAP processes 
 that use TLS/SSL get stuck. My observations point towards dropped dial-up 
 connections as the cause. Here's an example:

Have you tried setting `tls_session_timeout' to zero in imapd.conf to
disable caching of TLS sessions?  I believe that that solved the problem
for me, although it happened rarely.  I only use `berkeley-nosync' for
`duplicate_db' and `tlscache_db', but I suspected that the problem was
the result of a database deadlock.

-- 
-Gary Mills--Unix Support--U of M Academic Computing and Networking-

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: One more attempt: stuck processes

2007-11-14 Thread Sebastian Hagedorn
--On 14. November 2007 09:30:45 -0600 Gary Mills [EMAIL PROTECTED] 
wrote:



On Wed, Nov 14, 2007 at 04:15:13PM +0100, Sebastian Hagedorn wrote:


I've brought up this topic before. We've been running cyrus-imapd very
happily for several years. Yet there's one issue that none of the
updates  have resolved. The last time I reported it we were running
2.2.12. Now  we're running 2.3.8, but the issues is the same: POP and
IMAP processes  that use TLS/SSL get stuck. My observations point
towards dropped dial-up  connections as the cause. Here's an example:


Have you tried setting `tls_session_timeout' to zero in imapd.conf to
disable caching of TLS sessions?


No. While it's possible that there's some connection, I don't see it from 
the stack trace. I think I'd like some indication from one of the 
developers that caching *could* be the cause before I try that.



 I believe that that solved the problem
for me, although it happened rarely.  I only use `berkeley-nosync' for
`duplicate_db' and `tlscache_db', but I suspected that the problem was
the result of a database deadlock.


We use skiplist, which to my mind makes it even less likely that that's the 
cause ... thanks for the suggestion anyway.

--
.:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:.
Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
.:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:.
  .:.:.:.Skype: shagedorn.:.:.:.

pgpbKouWlbjeE.pgp
Description: PGP signature

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: One more attempt: stuck processes

2007-11-14 Thread Ken Murchison
Sebastian Hagedorn wrote:
 Hi,
 
 I've brought up this topic before. We've been running cyrus-imapd very 
 happily for several years. Yet there's one issue that none of the 
 updates have resolved. The last time I reported it we were running 
 2.2.12. Now we're running 2.3.8, but the issues is the same: POP and 
 IMAP processes that use TLS/SSL get stuck. My observations point 
 towards dropped dial-up connections as the cause. Here's an example:
 
 [EMAIL PROTECTED] root]# strace -p 4998
 Process 4998 attached - interrupt to quit
 read(0,  unfinished ...
 Process 4998 detached
 
 In gdb:
 
 0x003ed41e in __read_nocancel () from /lib/tls/libc.so.6
 (gdb) bt
 #0  0x003ed41e in __read_nocancel () from /lib/tls/libc.so.6
 #1  0x00c422f7 in BIO_new_socket () from /lib/libcrypto.so.4
 #2  0x00c402b2 in BIO_read () from /lib/libcrypto.so.4
 #3  0x00d26c30 in ssl3_alert_code () from /lib/libssl.so.4
 #4  0x00d26dcc in ssl3_alert_code () from /lib/libssl.so.4
 #5  0x00d280cf in ssl3_read_bytes () from /lib/libssl.so.4
 #6  0x00d28ffc in ssl3_get_message () from /lib/libssl.so.4
 #7  0x00d1ecab in ssl3_accept () from /lib/libssl.so.4
 #8  0x00d1e944 in ssl3_accept () from /lib/libssl.so.4
 #9  0x00d2dc9a in SSL_accept () from /lib/libssl.so.4
 #10 0x00d2980d in ssl23_get_client_hello () from /lib/libssl.so.4
 #11 0x00d29712 in ssl23_accept () from /lib/libssl.so.4
 #12 0x00d2dc9a in SSL_accept () from /lib/libssl.so.4
 #13 0x080a0a03 in idle_notify ()
 #14 0x0805ef21 in idle_update ()
 #15 0x08051fc6 in shut_down ()
 #16 0x0804ebe4 in ?? ()
 #17 0x08d53458 in ?? ()
 #18 0x08d61a98 in ?? ()
 #19 0x in ?? ()
 (gdb)

It looks to me like we are timing out the client while the client is 
IDLEing, but we get a signal from idled in the middle of shutdown(). 
Try this patch.

--- imapd.c.~1.535.~2007-11-14 16:16:21.0 -0500
+++ imapd.c 2007-11-14 16:22:59.0 -0500
@@ -836,6 +836,8 @@
  {
  int i;

+idle_done(imapd_mailbox);
+
  proc_cleanup();

  i = 0;


-- 
Kenneth Murchison
Systems Programmer
Project Cyrus Developer/Maintainer
Carnegie Mellon University

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html