Re: [PATCH] Updated master.c process counting patch

2002-05-18 Thread Henrique de Moraes Holschuh

On Thu, 16 May 2002, Jeremy Howard wrote:
 I *strongly* recommend also including shutdown.diff. This is important 
 in Linux to avoid sockets handing around in CLOSE_WAIT state. Remove the 

I had a talk with some kernel people, and they confirmed that.
shutdown(socket, SHUT_RD) should _always_ be done under linux if you really
don't need to read from the socket anymore. For AF_INET* sockets, anyway.

shutdown(socket, SHUT_RDWR) will reduce the CLOSE_WAIT state even more,
however it effectively trashes the connection; I don't like this idea very
much. It is far more amiable to the client if you let it read the last stuff
you sent it (such as error messages!) at its own leisure.

I am, thus, somewhat wary of adding SHUT_RDWR inconditionally. Maybe we
could add a runtime-option that very busy sites can set if they need even
faster socket recycling?

-- 
  One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie. -- The Silicon Valley Tarot
  Henrique Holschuh



Re: [PATCH] Updated master.c process counting patch

2002-05-18 Thread Jeremy Howard

Henrique de Moraes Holschuh wrote:

I am, thus, somewhat wary of adding SHUT_RDWR inconditionally. Maybe we
could add a runtime-option that very busy sites can set if they need even
faster socket recycling?
  

That sounds like a good idea.




Re: [PATCH] Updated master.c process counting patch

2002-05-17 Thread Jeremy Howard

Henrique de Moraes Holschuh wrote:

On Thu, 16 May 2002, Jeremy Howard wrote:
  

I *strongly* recommend also including shutdown.diff. This is important 
in Linux to avoid sockets handing around in CLOSE_WAIT state. Remove the 
'  !imapd_in-tls_conn' bit everywhere for general distribution--this 
is a workaround for a memory corruption problem that is unrelated to 
this patch.



I see. What about all those commented lines in your original
master-avail.diff patch?  Were they past shutdown() experiments?
  

That's correct, Henrique. The commented lines should not be included in 
a widely distributed version. I left them in, but commented out, in 
order to remind myself what experiments we had done. Only the lines that 
are *not* commented out should be included.

I believe (?) that this issue is less important on Solaris, because I 
think that it handles close() differently to Linux. However on Linux it 
is vital to flush receive buffers and call shutdown() to avoid hanging 
connections.




Re: [PATCH] Updated master.c process counting patch

2002-05-17 Thread Henrique de Moraes Holschuh

On Fri, 17 May 2002, Jeremy Howard wrote:
 I believe (?) that this issue is less important on Solaris, because I 
 think that it handles close() differently to Linux. However on Linux it 
 is vital to flush receive buffers and call shutdown() to avoid hanging 
 connections.

I am still studying the shutdown issue. BTW, why didn't you shutdown() all
streams at once, and then sleeped, instead of shutdown() and sleep one at a
time?

-- 
  One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie. -- The Silicon Valley Tarot
  Henrique Holschuh



Re: [PATCH] Updated master.c process counting patch

2002-05-17 Thread Jeremy Howard

Henrique de Moraes Holschuh wrote:

On Fri, 17 May 2002, Jeremy Howard wrote:
  

I believe (?) that this issue is less important on Solaris, because I 
think that it handles close() differently to Linux. However on Linux it 
is vital to flush receive buffers and call shutdown() to avoid hanging 
connections.



I am still studying the shutdown issue. BTW, why didn't you shutdown() all
streams at once, and then sleeped, instead of shutdown() and sleep one at a
time?
  

Because it was late at night and I wasn't thinking straight... ;-)




Re: [PATCH] Updated master.c process counting patch

2002-05-16 Thread Henrique de Moraes Holschuh

On Thu, 16 May 2002, Jeremy Howard wrote:
 I *strongly* recommend also including shutdown.diff. This is important 
 in Linux to avoid sockets handing around in CLOSE_WAIT state. Remove the 
 '  !imapd_in-tls_conn' bit everywhere for general distribution--this 
 is a workaround for a memory corruption problem that is unrelated to 
 this patch.

I see. What about all those commented lines in your original
master-avail.diff patch?  Were they past shutdown() experiments?

-- 
  One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie. -- The Silicon Valley Tarot
  Henrique Holschuh



Re: [PATCH] Updated master.c process counting patch

2002-05-16 Thread Ken Murchison



Jeremy Howard wrote:
 
 Henrique de Moraes Holschuh wrote:
 
 I don't know what Ken and Lawrence think of these patches, but I just
 finished porting the child pid tracking of master-avail.diff to 2.1.4CVS,
 and will post that to this list soon.  I will also include it in Debian,
 which will give some field-testing to the patch.
 
 
 I *strongly* recommend also including shutdown.diff. This is important
 in Linux to avoid sockets handing around in CLOSE_WAIT state. Remove the
 '  !imapd_in-tls_conn' bit everywhere for general distribution--this
 is a workaround for a memory corruption problem that is unrelated to
 this patch.

I'm running a config almost the same as you and have never seen this
problem.  AFAIK, the CMU guys have never seen this either.  Do you have
a core that you can run a backtrace on, or can you force a core by
setting MALLOC_CHECK_=2 before starting master (see malloc(3) for
details)?

What's your DB config look like?  Are you using skiplist for everything
by any chance?


name   : Cyrus IMAPD
version: v2.1.4 2002/05/14 16:51:51
vendor : Project Cyrus
support-url: http://asg.web.cmu.edu/cyrus
os : Linux
os-version : 2.4.18-SGI_XFS_1.1smp
command: imapd
arguments  : 
environment: Cyrus SASL 2.1.3
 Sleepycat Software: Berkeley DB 3.3.11: (July 12, 2001)
 OpenSSL 0.9.6b [engine] 9 Jul 2001
 CMU Sieve 2.2
 TCP Wrappers
 UCD-SNMP 4.2.3
 lock = flock
 auth = unix
 idle = idled
 mboxlist.db = skiplist
 subs.db = flat
 seen.db = skiplist
 duplicate.db = db3-nosync
 tls.db = db3-nosync

-- 
Kenneth Murchison Oceana Matrix Ltd.
Software Engineer 21 Princeton Place
716-662-8973 x26  Orchard Park, NY 14127
--PGP Public Key--http://www.oceana.com/~ken/ksm.pgp



Re: [PATCH] Updated master.c process counting patch

2002-05-16 Thread Ken Murchison



Jeremy Howard wrote:
 
 Lawrence Greenfield wrote:
 
Date: Wed, 15 May 2002 16:02:42 -0300
From: Henrique de Moraes Holschuh [EMAIL PROTECTED]
 [...]
The point is, if that indeed happens, log or no log, master loses track of
the number of children that can service requests. That would be a bug, and
the patch supposedly fixes this bug.  It really doesn't matter (for
accepting or not the patch) why the child died.
 
 Yes, I understand that.  However, if the master (in real life
 situations) is actually losing track of the number of available
 service processes without one of those service processes crashing
 (either by the sysadmin or otherwise) then there's some other problem
 in the child accounting.
 
 
 The child accounting is fine. The problem in our case was always caused
 by child segfaults, or failure to properly close TCP connections. We
 still see segfaults (about one per fifty thousand connections I'd
 guess),

Can you send us a backtrace from a core?  If you're not getting a core,
please setup your system to dump one.  Here are bits that I use in my
Cyrus startup script on Linux:

cd /var/imap/cores
ulimit -c unlimited
export MALLOC_CHECK_=2
$master 


If you have multiple services/processes the cores will overwrite each
other, so you need to catch it fairly quickly (unless they all have the
same failure).

Ken
-- 
Kenneth Murchison Oceana Matrix Ltd.
Software Engineer 21 Princeton Place
716-662-8973 x26  Orchard Park, NY 14127
--PGP Public Key--http://www.oceana.com/~ken/ksm.pgp



Re: [PATCH] Updated master.c process counting patch

2002-05-16 Thread Henrique de Moraes Holschuh

On Thu, 16 May 2002, Ken Murchison wrote:
 If you have multiple services/processes the cores will overwrite each
 other, so you need to catch it fairly quickly (unless they all have the

Unless you tell the kernel to use the pid in the corefile name...
Add this to the script on Linux 2.4.x:

[ -f /proc/sys/kernel/core_uses_pid ]  \
  echo 1 /proc/sys/kernel/core_uses_pid

I don't know which other kernels can do that, but it should be a reasonably
common feature, given how useful it is...

-- 
  One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie. -- The Silicon Valley Tarot
  Henrique Holschuh



Re: [PATCH] Updated master.c process counting patch

2002-05-16 Thread Ken Murchison



Henrique de Moraes Holschuh wrote:
 
 On Thu, 16 May 2002, Ken Murchison wrote:
  If you have multiple services/processes the cores will overwrite each
  other, so you need to catch it fairly quickly (unless they all have the
 
 Unless you tell the kernel to use the pid in the corefile name...
 Add this to the script on Linux 2.4.x:
 
 [ -f /proc/sys/kernel/core_uses_pid ]  \
   echo 1 /proc/sys/kernel/core_uses_pid

Right.  The reason I didn't suggest this is because some large sites
might be worried about cores taking up a lot of disk space, and I didn't
want them screaming at me ;)

Ken
-- 
Kenneth Murchison Oceana Matrix Ltd.
Software Engineer 21 Princeton Place
716-662-8973 x26  Orchard Park, NY 14127
--PGP Public Key--http://www.oceana.com/~ken/ksm.pgp



Re: [PATCH] Updated master.c process counting patch

2002-05-16 Thread Scott Adkins

For what its worth, we run into this problem a lot in our environment as
well... I had come to the same conclusion as Jeremy and Henrique did before
they even posted their message about it to the list.

The environment we are running is as follows:

  Cyrus IMAPD 2.0.16
  Compaq Tru64 5.1 on Alpha Platform
  Cyrus SASL 1.5.27
  Sleepycat BerkeleyDB 3.2.9
  OpenSSL 0.9.6
  UCD-SNMP 4.2.1
  lock = flock
  auth = unix
  mboxlist.db = flat
  subs.db = flat
  seen.db = flat
  duplicate.db = db3-nosync
  tls.db = none

The problem existed before we enabled SSL, so it isn't SSL related, and
the problem existed before we added SNMP to the mix.  The only place we
use BerkeleyDB is for the duplicate delivery database and only lmtpd is
linked against the libdb.so file (I modified the Makefile to not link
pthreads or BerkeleyDB into imapd, pop3d, etc, since they are using a
flat file database anyways).

As for when the problem would arise... well, obviously, when we ran into
resource issues, master would lose track of the count of children as they
segfaulted, etc.  This problem was just recently described.  However, we
rarely had resource issues like that.

The other times that we could catch the problem occuring is when the
system is booting and the Cyrus server is first started.  For some reason,
we would lose a couple of the services within a short period of time after
Cyrus was started.  We would have to shut down Cyrus and restart it again.
Rarely, the problem would occur again within the first few minutes of the
restart, and we would have to do it again.  Usually, though, the restart
would work fine.

Finally, we would periodically lose services for no apparent reason,
meaning that the machine had not been recently rebooted, and we were not
suffering from any apparent resource shortages (either at a system level
or at the cyrus user level).

In any the case, the problem exists in our Tru64 environment.

Scott

--On Thursday, May 16, 2002 9:25 AM -0400 Ken Murchison [EMAIL PROTECTED] 
wrote:

 I'm running a config almost the same as you and have never seen this
 problem.  AFAIK, the CMU guys have never seen this either.  Do you have
 a core that you can run a backtrace on, or can you force a core by
 setting MALLOC_CHECK_=2 before starting master (see malloc(3) for
 details)?

 What's your DB config look like?  Are you using skiplist for everything
 by any chance?

 name   : Cyrus IMAPD
 version: v2.1.4 2002/05/14 16:51:51
 vendor : Project Cyrus
 support-url: http://asg.web.cmu.edu/cyrus
 os : Linux
 os-version : 2.4.18-SGI_XFS_1.1smp
 command: imapd
 arguments  :
 environment: Cyrus SASL 2.1.3
  Sleepycat Software: Berkeley DB 3.3.11: (July 12, 2001)
  OpenSSL 0.9.6b [engine] 9 Jul 2001
  CMU Sieve 2.2
  TCP Wrappers
  UCD-SNMP 4.2.3
  lock = flock
  auth = unix
  idle = idled
  mboxlist.db = skiplist
  subs.db = flat
  seen.db = skiplist
  duplicate.db = db3-nosync
  tls.db = db3-nosync

 --
 Kenneth Murchison Oceana Matrix Ltd.
 Software Engineer 21 Princeton Place
 716-662-8973 x26  Orchard Park, NY 14127
 --PGP Public Key--http://www.oceana.com/~ken/ksm.pgp


--
 +-=-=-=-=-=-=-=-=-=+=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=+=-=-=-=-=-=-=-=-+
  Scott W. Adkinshttp://www.cns.ohiou.edu/~sadkins/
   UNIX Systems Engineer  mailto:[EMAIL PROTECTED]
ICQ 7626282 Work (740)593-9478 Fax (740)593-1944
 +-=-=-=-=-=-=-=-=-=+=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=+=-=-=-=-=-=-=-=-+
 PGP Public Key available at http://www.cns.ohiou.edu/~sadkins/pgp/


msg07820/pgp0.pgp
Description: PGP signature


Re: [PATCH] Updated master.c process counting patch

2002-05-16 Thread Jeremy Howard

Ken Murchison wrote:

I'm running a config almost the same as you and have never seen this
problem.  AFAIK, the CMU guys have never seen this either.  Do you have
a core that you can run a backtrace on, or can you force a core by
setting MALLOC_CHECK_=2 before starting master (see malloc(3) for
details)?

Most of the segfaults were due to the problem that imapd_out or imapd_in 
were corrupted. The workaround discussed in our patch has solved most of 
these. I'll try and get a core file for the rare segfaults that we still 
get to see what the unresolved issues are. The imapd_out corruption 
problem can't be solved by studying the core file AFAICT because we 
can't see where the corruption is occuring.

What's your DB config look like?  Are you using skiplist for everything
by any chance?
  

Yes, 2.1.3 with skiplist for everything. We have the tls cache turned 
off, however. We prune the delivery database with -E0 every hour to 
avoid it getting big (otherwise DB recovery takes too long).





Re: [PATCH] Updated master.c process counting patch

2002-05-16 Thread Ken Murchison



Jeremy Howard wrote:
 
 Ken Murchison wrote:
 
 I'm running a config almost the same as you and have never seen this
 problem.  AFAIK, the CMU guys have never seen this either.  Do you have
 a core that you can run a backtrace on, or can you force a core by
 setting MALLOC_CHECK_=2 before starting master (see malloc(3) for
 details)?
 
 Most of the segfaults were due to the problem that imapd_out or imapd_in
 were corrupted. The workaround discussed in our patch has solved most of
 these. I'll try and get a core file for the rare segfaults that we still
 get to see what the unresolved issues are. The imapd_out corruption
 problem can't be solved by studying the core file AFAICT because we
 can't see where the corruption is occuring.

If you set MALLOC_CHECK_=2, then imapd will abort() whenever it thinks
that there might be a corruption.  By examining this core, it is easier
to track down these problems.  I've done this a few times to track down
the subtle errors that have baffled others.

Ken
-- 
Kenneth Murchison Oceana Matrix Ltd.
Software Engineer 21 Princeton Place
716-662-8973 x26  Orchard Park, NY 14127
--PGP Public Key--http://www.oceana.com/~ken/ksm.pgp



Re: [PATCH] Updated master.c process counting patch

2002-05-16 Thread Lawrence Greenfield

   Date: Fri, 17 May 2002 09:35:39 +1000
   From: Jeremy Howard [EMAIL PROTECTED]
[...]
   Yes, 2.1.3 with skiplist for everything. We have the tls cache turned 
   off, however. We prune the delivery database with -E0 every hour to 
   avoid it getting big (otherwise DB recovery takes too long).

Decreasing your checkpoint interval (we run at 10 minutes; 5 minutes
is also pretty good) can also help immensely with this.

Larry





Repeatable IMAP crash. Was Re: [PATCH] Updated master.c process counting patch

2002-05-16 Thread Rob Mueller (fastmail)

Actually, a number of users having been saying that renaming folders seems
to cause problems. I think I've found a completely reproduceable scenario
that crashes the IMAP server for me every time.

[root@server2 root]# telnet localhost 143
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
* OK blahserver Cyrus IMAP4 v2.1.3 server ready
. login testuser blahpassword
. OK User logged in
. select inbox
* FLAGS (\Answered \Flagged \Draft \Deleted \Seen)
* OK [PERMANENTFLAGS (\Answered \Flagged \Draft \Deleted \Seen \*)]
* 0 EXISTS
* 0 RECENT
* OK [UIDVALIDITY 1021369326]
* OK [UIDNEXT 1]
. OK [READ-WRITE] Completed
. create inbox.t1
. OK Completed
. create inbox.t2
. OK Completed
. select inbox
* FLAGS (\Answered \Flagged \Draft \Deleted \Seen)
* OK [PERMANENTFLAGS (\Answered \Flagged \Draft \Deleted \Seen \*)]
* 0 EXISTS
* 0 RECENT
* OK [UIDVALIDITY 1021369326]
* OK [UIDNEXT 1]
. OK [READ-WRITE] Completed
. rename inbox.t1 inbox.t1_
. OK Completed
. rename inbox.t2 inbox.t2_
Connection closed by foreign host.

And in the IMAP log:

May 16 22:22:25 blahserver master[23269]: process 8649 exited, signaled to
death by 11

Does this do the same thing for anyone else?

I'm not sure of all our config details, I'm sure Jeremy can post them if you
need them.

Rob

- Original Message -
From: Ken Murchison [EMAIL PROTECTED]
To: Jeremy Howard [EMAIL PROTECTED]
Cc: Lawrence Greenfield [EMAIL PROTECTED];
[EMAIL PROTECTED]; Henrique de Moraes Holschuh
[EMAIL PROTECTED]
Sent: Thursday, May 16, 2002 11:31 PM
Subject: Re: [PATCH] Updated master.c process counting patch




 Jeremy Howard wrote:
 
  Lawrence Greenfield wrote:
 
 Date: Wed, 15 May 2002 16:02:42 -0300
 From: Henrique de Moraes Holschuh [EMAIL PROTECTED]
  [...]
 The point is, if that indeed happens, log or no log, master loses
track of
 the number of children that can service requests. That would be a
bug, and
 the patch supposedly fixes this bug.  It really doesn't matter (for
 accepting or not the patch) why the child died.
  
  Yes, I understand that.  However, if the master (in real life
  situations) is actually losing track of the number of available
  service processes without one of those service processes crashing
  (either by the sysadmin or otherwise) then there's some other problem
  in the child accounting.
  
  
  The child accounting is fine. The problem in our case was always caused
  by child segfaults, or failure to properly close TCP connections. We
  still see segfaults (about one per fifty thousand connections I'd
  guess),

 Can you send us a backtrace from a core?  If you're not getting a core,
 please setup your system to dump one.  Here are bits that I use in my
 Cyrus startup script on Linux:

 cd /var/imap/cores
 ulimit -c unlimited
 export MALLOC_CHECK_=2
 $master 


 If you have multiple services/processes the cores will overwrite each
 other, so you need to catch it fairly quickly (unless they all have the
 same failure).

 Ken
 --
 Kenneth Murchison Oceana Matrix Ltd.
 Software Engineer 21 Princeton Place
 716-662-8973 x26  Orchard Park, NY 14127
 --PGP Public Key--http://www.oceana.com/~ken/ksm.pgp






Re: [PATCH] Updated master.c process counting patch

2002-05-16 Thread Jeremy Howard

Ken Murchison wrote:

If you set MALLOC_CHECK_=2, then imapd will abort() whenever it thinks
that there might be a corruption.  By examining this core, it is easier
to track down these problems.  I've done this a few times to track down
the subtle errors that have baffled others.
  

Great. We've just updated our init.d file to use this, and to spit cores 
into different files using the pid, so hopefully we'll have some more 
info over the next few days.




Re: [PATCH] Updated master.c process counting patch

2002-05-15 Thread Scott Adkins

I am still waiting to hear from Ken and Lawrence on what they think about
these patches?  Will any or all of them be implented in the next release?

Scott

--On Wednesday, May 15, 2002 12:41 PM +1000 Jeremy Howard 
[EMAIL PROTECTED] wrote:

 Thanks to Jaska Kivela, some patch formatting problems that caused the
 master.c process counting patch to not apply cleanly have been resolved.
 The patch set has been updated, and now also incorporates the master.c
 race condition patch:

 http://jhoward.fastmail.fm/patches/cyrus/imap-diff.tgz

 The only file changed in the patch set is master-avail.diff .
 master-avail.diff solves the problem that master fails to correctly keep
 count of child processes if processes do not exit cleanly. This manifests
 itself as Cyrus failing to accept new connections on one or more ports
 after a while, when using preforking in imapd.conf.

--
 +-=-=-=-=-=-=-=-=-=+=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=+=-=-=-=-=-=-=-=-+
  Scott W. Adkinshttp://www.cns.ohiou.edu/~sadkins/
   UNIX Systems Engineer  mailto:[EMAIL PROTECTED]
ICQ 7626282 Work (740)593-9478 Fax (740)593-1944
 +-=-=-=-=-=-=-=-=-=+=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=+=-=-=-=-=-=-=-=-+
 PGP Public Key available at http://www.cns.ohiou.edu/~sadkins/pgp/


msg07755/pgp0.pgp
Description: PGP signature


Re: [PATCH] Updated master.c process counting patch

2002-05-15 Thread Henrique de Moraes Holschuh

On Wed, 15 May 2002, Scott Adkins wrote:
 I am still waiting to hear from Ken and Lawrence on what they think about
 these patches?  Will any or all of them be implented in the next release?

I don't know what Ken and Lawrence think of these patches, but I just
finished porting the child pid tracking of master-avail.diff to 2.1.4CVS,
and will post that to this list soon.  I will also include it in Debian,
which will give some field-testing to the patch.

I don't know about all the other patches, though. I have included the
safe_flock patches, and I *may* include the alarm and locking stuff in
master-avail.diff later, but I must study and understand it first.

-- 
  One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie. -- The Silicon Valley Tarot
  Henrique Holschuh



Re: [PATCH] Updated master.c process counting patch

2002-05-15 Thread Henrique de Moraes Holschuh

On Wed, 15 May 2002, Scott Russell wrote:
 On Wed, May 15, 2002 at 10:35:04AM -0300, Henrique de Moraes Holschuh wrote:
  I don't know about all the other patches, though. I have included the
  safe_flock patches, and I *may* include the alarm and locking stuff in
  master-avail.diff later, but I must study and understand it first.
 
 Are the patches TLS sane / tested? At one point I remember reading

The ones I am including in Debian are. No TLS is _not_ an option IMHO.

-- 
  One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie. -- The Silicon Valley Tarot
  Henrique Holschuh



Re: [PATCH] Updated master.c process counting patch

2002-05-15 Thread Scott Russell

On Wed, May 15, 2002 at 10:35:04AM -0300, Henrique de Moraes Holschuh wrote:
 On Wed, 15 May 2002, Scott Adkins wrote:
  I am still waiting to hear from Ken and Lawrence on what they think about
  these patches?  Will any or all of them be implented in the next release?
 
 I don't know what Ken and Lawrence think of these patches, but I just
 finished porting the child pid tracking of master-avail.diff to 2.1.4CVS,
 and will post that to this list soon.  I will also include it in Debian,
 which will give some field-testing to the patch.
 
 I don't know about all the other patches, though. I have included the
 safe_flock patches, and I *may* include the alarm and locking stuff in
 master-avail.diff later, but I must study and understand it first.
 

Are the patches TLS sane / tested? At one point I remember reading
that they wouldn't work with TLS connections. That was some time ago
and I might have lost track of the status.

-- 
Regards,
 Scott Russell ([EMAIL PROTECTED])
 Linux Technology Center, System Admin, RHCE.
 T/L 441-9289 / External 919-543-9289
 http://bzimage.raleigh.ibm.com/webcam




Re: [PATCH] Updated master.c process counting patch

2002-05-15 Thread Scott Russell

On Wed, May 15, 2002 at 10:52:39AM -0300, Henrique de Moraes Holschuh wrote:
 On Wed, 15 May 2002, Scott Russell wrote:
  On Wed, May 15, 2002 at 10:35:04AM -0300, Henrique de Moraes Holschuh wrote:
   I don't know about all the other patches, though. I have included the
   safe_flock patches, and I *may* include the alarm and locking stuff in
   master-avail.diff later, but I must study and understand it first.
  
  Are the patches TLS sane / tested? At one point I remember reading
 
 The ones I am including in Debian are. No TLS is _not_ an option IMHO.
 

Agreed. I'm thankful to the people who took the time to track this
down and write the patch. I'm also thankful to those who added TLS
support to the patch :)

-- 
Regards,
 Scott Russell ([EMAIL PROTECTED])
 Linux Technology Center, System Admin, RHCE.
 T/L 441-9289 / External 919-543-9289
 http://bzimage.raleigh.ibm.com/webcam




Re: [PATCH] Updated master.c process counting patch

2002-05-15 Thread Lawrence Greenfield

   Date: Wed, 15 May 2002 08:34:29 -0400
   From: Scott Adkins [EMAIL PROTECTED]
[...]
   I am still waiting to hear from Ken and Lawrence on what they think about
   these patches?  Will any or all of them be implented in the next release?

I'm still wondering what causes these problems.  Some reports say that
service processes aren't crashing; if they're not crashing, how is the
count getting off?

I think in general this stuff is a good idea but I'd like to
understand a little better how this is happening.

We don't have this problem on any of our servers.

Larry




Re: [PATCH] Updated master.c process counting patch

2002-05-15 Thread Henrique de Moraes Holschuh

On Wed, 15 May 2002, Lawrence Greenfield wrote:
Date: Wed, 15 May 2002 08:34:29 -0400
From: Scott Adkins [EMAIL PROTECTED]
 [...]
I am still waiting to hear from Ken and Lawrence on what they think about
these patches?  Will any or all of them be implented in the next release?
 
 I'm still wondering what causes these problems.  Some reports say that
 service processes aren't crashing; if they're not crashing, how is the
 count getting off?

Good question, isn't it?  I am trying to track a segfault in the auth_unix
callbacks with SASL 2.1.2 [1], but after that I will try to do a once-over
the entire master flow, with and without the child pid tracking patches.

[1] http://bugs.debian.org/cgi-bin/bugreport.cgi?archive=nobug=145766 

 We don't have this problem on any of our servers.

Do you have preforking enabled?  If you do (and if I did undertand the issue
correctly), start kill -9'ing service processes, and it should be possible
to duplicate the bug.  I will try that just now, in fact.

-- 
  One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie. -- The Silicon Valley Tarot
  Henrique Holschuh



Re: [PATCH] Updated master.c process counting patch

2002-05-15 Thread Lawrence Greenfield

   Date: Wed, 15 May 2002 15:37:50 -0300
   From: Henrique de Moraes Holschuh [EMAIL PROTECTED]
[...]
I'm still wondering what causes these problems.  Some reports say that
service processes aren't crashing; if they're not crashing, how is the
count getting off?

   Good question, isn't it?  I am trying to track a segfault in the auth_unix
   callbacks with SASL 2.1.2 [1], but after that I will try to do a once-over
   the entire master flow, with and without the child pid tracking patches.

   [1] http://bugs.debian.org/cgi-bin/bugreport.cgi?archive=nobug=145766 

auth_unix is part of the authorization, not part of libsasl.
Regardless, this code happens after the service has told the master
it's unavailable, so a crash here wouldn't cause the master's count to
get off.

We don't have this problem on any of our servers.

   Do you have preforking enabled?  If you do (and if I did undertand the issue
   correctly), start kill -9'ing service processes, and it should be possible
   to duplicate the bug.  I will try that just now, in fact.

Sure, if you intentionally kill processes that are waiting for
connections this happens.  I understand this.  But if I did that,
master would log messages that the processes were dying incorrectly
(signaled to death by 9).

Larry




Re: [PATCH] Updated master.c process counting patch

2002-05-15 Thread Henrique de Moraes Holschuh

On Wed, 15 May 2002, Lawrence Greenfield wrote:
Good question, isn't it?  I am trying to track a segfault in the auth_unix
callbacks with SASL 2.1.2 [1], but after that I will try to do a once-over
the entire master flow, with and without the child pid tracking patches.
 
[1] http://bugs.debian.org/cgi-bin/bugreport.cgi?archive=nobug=145766 
 
 auth_unix is part of the authorization, not part of libsasl.

It registers callbacks with sasl. Sasl calls auth_newstate via the callback
interface, and glibc dies in the middle of a getgrent from auth_newstate.
The problem could be anywhere.

Do you have preforking enabled?  If you do (and if I did undertand the issue
correctly), start kill -9'ing service processes, and it should be possible
to duplicate the bug.  I will try that just now, in fact.
 
 Sure, if you intentionally kill processes that are waiting for
 connections this happens.  I understand this.  But if I did that,
 master would log messages that the processes were dying incorrectly
 (signaled to death by 9).

The point is, if that indeed happens, log or no log, master loses track of
the number of children that can service requests. That would be a bug, and
the patch supposedly fixes this bug.  It really doesn't matter (for
accepting or not the patch) why the child died.

I _do_ agree that we have to track down why the children are dying, too. But
that is another separate issue.

-- 
  One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie. -- The Silicon Valley Tarot
  Henrique Holschuh



Re: [PATCH] Updated master.c process counting patch

2002-05-15 Thread Lawrence Greenfield

   Date: Wed, 15 May 2002 16:02:42 -0300
   From: Henrique de Moraes Holschuh [EMAIL PROTECTED]
[...]
   The point is, if that indeed happens, log or no log, master loses track of
   the number of children that can service requests. That would be a bug, and
   the patch supposedly fixes this bug.  It really doesn't matter (for
   accepting or not the patch) why the child died.

Yes, I understand that.  However, if the master (in real life
situations) is actually losing track of the number of available
service processes without one of those service processes crashing
(either by the sysadmin or otherwise) then there's some other problem
in the child accounting.

I'd like to know if there is before doing something that may mask the
problem completely.

Larry




Re: [PATCH] Updated master.c process counting patch

2002-05-15 Thread Henrique de Moraes Holschuh

On Wed, 15 May 2002, Lawrence Greenfield wrote:
 service processes without one of those service processes crashing
 (either by the sysadmin or otherwise) then there's some other problem
 in the child accounting.

Well, anything that could cause the messages to be lost will cause trouble
for the accounting without the patch. I can't see any other races in there
right now, though.

I did duplicate the problem caused by children deaths, and I did verify that
the patch fixes the problem, so at least that side of things it seems to
handle well.

Children forking is behaving erratically, though. Before the first connect()
to any of the master-controlled services, a random number of children are
created (up to the prefork setting).  After a connect(), all the missing
children (maybe subject to maxforkrate, I didn't test) are created.  This
might be the way the code is supposed to work, though.

 I'd like to know if there is before doing something that may mask the
 problem completely.

Timing and logging warnings if the children take too much time to tell us
that it is available for work, or die without telling us it was going to
exit cleanly would give out the same information, while avoiding the worst
of the problems the bug creates (no children left to service incoming
requests).  This would complicate the code in master/ a bit, though.

-- 
  One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie. -- The Silicon Valley Tarot
  Henrique Holschuh



Re: [PATCH] Updated master.c process counting patch

2002-05-15 Thread Michael Bacon



--On Wednesday, May 15, 2002 4:29 PM -0400 Lawrence Greenfield 
[EMAIL PROTECTED] wrote:

Date: Wed, 15 May 2002 16:02:42 -0300
From: Henrique de Moraes Holschuh [EMAIL PROTECTED]
 [...]
The point is, if that indeed happens, log or no log, master loses
 track ofthe number of children that can service requests. That would
 be a bug, andthe patch supposedly fixes this bug.  It really doesn't
 matter (foraccepting or not the patch) why the child died.

 Yes, I understand that.  However, if the master (in real life
 situations) is actually losing track of the number of available
 service processes without one of those service processes crashing
 (either by the sysadmin or otherwise) then there's some other problem
 in the child accounting.

 I'd like to know if there is before doing something that may mask the
 problem completely.

 Larry


Would it be sufficient if the patch were altered slightly to send a 
LOG_DEBUG message to syslog every time the master decremented one of the 
worker counters, specifying what it did?  Right now, it seems like the 
indicator of the problem is a service unavailability.  If key step in the 
process that we're covering is the report of the dead child to the master, 
and you'd be just as happy with a log message as a blatantly obvious 
failure, well heck, let's do it!  I'm happy to send you any bit of logging 
information you want, just so long as my servers stay available!  :)

Michael Bacon
OIT Systems Administration



Re: [PATCH] Updated master.c process counting patch

2002-05-15 Thread Michael Bacon



--On Wednesday, May 15, 2002 2:50 PM -0400 Lawrence Greenfield 
[EMAIL PROTECTED] wrote:

Date: Wed, 15 May 2002 15:37:50 -0300
From: Henrique de Moraes Holschuh [EMAIL PROTECTED]
 [...]   

 We don't have this problem on any of our servers.

Do you have preforking enabled?  If you do (and if I did undertand the
 issuecorrectly), start kill -9'ing service processes, and it should
 be possibleto duplicate the bug.  I will try that just now, in fact.

 Sure, if you intentionally kill processes that are waiting for
 connections this happens.  I understand this.  But if I did that,
 master would log messages that the processes were dying incorrectly
 (signaled to death by 9).


Just to follow up a bit more -- I'm not sure if this will be terribly 
helpful, but it may be illustrative of the one situation which might cause 
such a problem.  Here's a snippet of log that occured shortly before we ran 
into a service problem:

May  2 15:35:14 delrey.acpub.duke.edu master[16360]: can't fork process to 
run c
heckpoint
May  2 15:35:14 delrey.acpub.duke.edu last message repeated 45 times
May  2 15:35:15 delrey.acpub.duke.edu master[16360]: process 7494 exited, 
signal
ed to death by 9

Preceeding this is a couple hundred more messages generally complaining 
about inability to fork, lack of resources, inability to load a shared 
library, and all of the other things one would expect to see when swap runs 
out.  However, I don't see any other mention of process 7494.  (This is 
logging at the user6.info level, so we don't get all of the debug messages, 
unfortunately.)  Now, the core problem here is very simple:  This is a 
poor, beleagured SparcStation 20 with 17k mailboxes on it, and at a peak 
block of time, it just ran out of available memory.  The solution is also 
simple:  we need to upgrade our hardware.  However, since we're currently 
under budget cuts as our Executive VP scrounges up money to go around 
building big buildings, that's not exactly a viable option.

Tracking down the whys and wherefores of a process dying under a resource 
crunch is not likely to be terribly productive -- one can't very well 
expect a service to keep functioning normally under those circumstances. 
However, it would be nice if the master process paid attention to the 
SIGCHLD and the information from the wait() call and take note of the fact 
that the ex-process has shook off this mortal coil, so that 15 minutes 
later, the process miscount didn't cause the master to start blithely 
ignoring incoming requests.

I hope this is helpful,

Michael Bacon
OIT Systems Administration
Duke University





Re: [PATCH] Updated master.c process counting patch

2002-05-15 Thread Jeremy Howard

Henrique de Moraes Holschuh wrote:

I don't know what Ken and Lawrence think of these patches, but I just
finished porting the child pid tracking of master-avail.diff to 2.1.4CVS,
and will post that to this list soon.  I will also include it in Debian,
which will give some field-testing to the patch.
  

I *strongly* recommend also including shutdown.diff. This is important 
in Linux to avoid sockets handing around in CLOSE_WAIT state. Remove the 
'  !imapd_in-tls_conn' bit everywhere for general distribution--this 
is a workaround for a memory corruption problem that is unrelated to 
this patch.





Re: [PATCH] Updated master.c process counting patch

2002-05-15 Thread Jeremy Howard

Lawrence Greenfield wrote:

   Date: Wed, 15 May 2002 16:02:42 -0300
   From: Henrique de Moraes Holschuh [EMAIL PROTECTED]
[...]
   The point is, if that indeed happens, log or no log, master loses track of
   the number of children that can service requests. That would be a bug, and
   the patch supposedly fixes this bug.  It really doesn't matter (for
   accepting or not the patch) why the child died.

Yes, I understand that.  However, if the master (in real life
situations) is actually losing track of the number of available
service processes without one of those service processes crashing
(either by the sysadmin or otherwise) then there's some other problem
in the child accounting.
  

The child accounting is fine. The problem in our case was always caused 
by child segfaults, or failure to properly close TCP connections. We 
still see segfaults (about one per fifty thousand connections I'd 
guess), and occassional TCP closing problems, but they're much reduced 
with the other patches. However with the master.c patch these 
intermittent problems have no practical impact on our users, since the 
server handles them gracefully.

The master.c patch is important because without it, any little problem 
in any daemon that causes a child to crash, will eventually bring down 
the whole server. With the patch, processes are still counted correctly 
and therefore a child segfaulting will not stop the server from 
accepting connections.

Child segfaults still cause the problem to be logged as per usual--the 
only difference with the patch is that child processes are correctly 
counted in this event.





Re: [PATCH] Updated master.c process counting patch

2002-05-15 Thread Jeremy Howard

Henrique de Moraes Holschuh wrote:

On Wed, 15 May 2002, Scott Russell wrote:
  

On Wed, May 15, 2002 at 10:35:04AM -0300, Henrique de Moraes Holschuh wrote:


I don't know about all the other patches, though. I have included the
safe_flock patches, and I *may* include the alarm and locking stuff in
master-avail.diff later, but I must study and understand it first.
  

Are the patches TLS sane / tested? At one point I remember reading



The ones I am including in Debian are. No TLS is _not_ an option IMHO.

The TLS issue was simply required to work around a memory corruption 
bug. The patches all work correctly with TLS, as long as you remove the 
'  !imapd_in-tls_conn' bits in imapd.c and pop3d.c.

These extra tests are nothing to do with the other patches. The problem 
is that imapd_in et al sometimes get overwritten, causing prot_flush and 
prot_fill to result in a segfault when they are call during shut_down() 
or imapd_reset(). Therefore we wanted some way of checking whether 
imapd_in was still sane, and we happened to pick to check that 
-tls_conn==0. A better approach would be to make the first element of 
the record a known constant and check for that. An even better approach 
would be to work out why the structure is getting corrupted in the first 
place!

To clarify--the memory corruption problem occurs regardless of whether 
the other patches are present (in our environment at least), and needs 
to be worked around regardless of whether the other patches are present. 
The issues are completely independent. The memory corruption bug is 
something we see every few thousand connections under Linux 2.4.18 with 
Cyrus 2.1.3 in a mixed client high load environment.




[PATCH] Updated master.c process counting patch

2002-05-14 Thread Jeremy Howard

Thanks to Jaska Kivela, some patch formatting problems that caused the 
master.c process counting patch to not apply cleanly have been resolved. 
The patch set has been updated, and now also incorporates the master.c 
race condition patch:

http://jhoward.fastmail.fm/patches/cyrus/imap-diff.tgz

The only file changed in the patch set is master-avail.diff . 
master-avail.diff solves the problem that master fails to correctly keep 
count of child processes if processes do not exit cleanly. This 
manifests itself as Cyrus failing to accept new connections on one or 
more ports after a while, when using preforking in imapd.conf.