Re: followup: stuck lmtpd processes

2003-09-26 Thread Tom

On Wed, 24 Sep 2003, Henrique de Moraes Holschuh wrote:

 On Wed, 24 Sep 2003, Etienne Goyer wrote:
  On Wed, Sep 24, 2003 at 11:27:46AM -0400, Rob Siemborski wrote:
   However, I have looked into this and to my surprise, Linux is indeed
   restarting the system calls instead of returning with EINTR.  However, the
   answer here is to set up the alarm() handler with sigaction without
   setting SA_RESTART, not to jump through select() hoops or make nonblocking
   lock attempts.
 
  I consult my local Unix guru on the subject, and he point me to
  Advanced Programming in the Unix Environment by W. Richard Stevens,
  section 10.5 (page 275 in my edition) Interrupted System Calls.
  The revelancy of this text may be questionned since it was published in
  1992, but the behavior of SVR4 vs BSD differ substantially in this
  regard according to it.  My guru concluded that system call interuption
  by signal is an assumption that may lead to portability problem.  In the
  fud case, he is right.

 With SYSV you will get the interrupted system call, unless you tell it
 somehow not to do it (the SA_RESTART stuff).  If we are to accomodate the
 BSDs, we can:
   1. Let them have the short end of the stick (they get what we have now,
  that is, deadlocks). Not good.
   2. Let them use the low-performance non-blocking mode (good solution).
   3. Find out how to get EINTR from BSD (it should be possible nowadays).

 As for old legacy systems (too old BSDs, SunOS 4, very old Linux), well, I
 am inclined to simply ignore them.  Let them use solution (2) above.  That
 is the least of their problems, given the amount of annoying and dangerous
 bugs those old kernels have.

  #2 doesn't sound like a good option for anyone.  It is not clear from
the context in this message whether you want restartable system calls or
not.  BSD4.2 introduced restartable system calls.  BSD4.3 introduced
siginterrupt() to disable restartable system calls for a certain signal.
So even with SunOS 4, you'd have the option.

  Newer BSDs have SA_RESTART via sigaction().  sigaction() is supposed to
conform to the POSIX.1 standard, and therefore probably works the same as
Linux.

 --
   One disk to rule them all, One disk to find them. One disk to bring
   them all and in the darkness grind them. In the Land of Redmond
   where the shadows lie. -- The Silicon Valley Tarot
   Henrique Holschuh


Tom


Re: followup: stuck lmtpd processes

2003-09-26 Thread Henrique de Moraes Holschuh
On Fri, 26 Sep 2003, Tom wrote:
 On Wed, 24 Sep 2003, Henrique de Moraes Holschuh wrote:
  With SYSV you will get the interrupted system call, unless you tell it
  somehow not to do it (the SA_RESTART stuff).  If we are to accomodate the
  BSDs, we can:
1. Let them have the short end of the stick (they get what we have now,
   that is, deadlocks). Not good.
2. Let them use the low-performance non-blocking mode (good solution).
3. Find out how to get EINTR from BSD (it should be possible nowadays).
 
  As for old legacy systems (too old BSDs, SunOS 4, very old Linux), well, I
  am inclined to simply ignore them.  Let them use solution (2) above.  That
  is the least of their problems, given the amount of annoying and dangerous
  bugs those old kernels have.
 
   #2 doesn't sound like a good option for anyone.  It is not clear from

It is tried and true, and the patch is here NOW.

 the context in this message whether you want restartable system calls or
 not.  BSD4.2 introduced restartable system calls.  BSD4.3 introduced
 siginterrupt() to disable restartable system calls for a certain signal.
 So even with SunOS 4, you'd have the option.

I have the code ready for POSIX.  If anyone who uses SunOS and BSD4.2/4.3
wants to contribute the autoconf code and changes to get non-restartable
syscalls for a signal, they're welcome... I don't like to write code I can't
test.

-- 
  One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie. -- The Silicon Valley Tarot
  Henrique Holschuh


Re: followup: stuck lmtpd processes

2003-09-25 Thread Etienne Goyer
On Wed, Sep 24, 2003 at 08:01:11PM +0100, Patrick Welche wrote:
 I don't understand. The only alarm() business I can see in imap/fud.c
 is around recvfrom which at least according to its man page says
 
  [EINTR]The receive was interrupted by delivery of a signal
 before any data were available.
 
 What system calls are there? I was looking for syscall()...

Sorry, I may have got the nomenclature wrong since I am not very
familiar with Unix C programming.

However, the man page is wrong about EINTR at least as far as RedHat 7.x
is concerned.  In a murder environnement, when following a referral :

1. The alarm() is set

2. recvfrom() connect to referred by UDP

3. If the referred is not answering :

  3.1 The SIG_ALRM handler is eventually executed

  3.2 fud goes back to waiting after recvfrom and deadlock

This is reproducible in my environment.  Maybe I got something wrong in
my crude syslog() debugging, but I am confident that this is what
happen.  The patch I posted earlier fix the problem for me.

-- 
Etienne GoyerLinux Québec Technologies Inc.
http://www.LinuxQuebec.com   [EMAIL PROTECTED]


Re: followup: stuck lmtpd processes

2003-09-25 Thread Rob Siemborski
On Thu, 25 Sep 2003, Etienne Goyer wrote:

 However, the man page is wrong about EINTR at least as far as RedHat 7.x
 is concerned.  In a murder environnement, when following a referral :

No it isn't wrong.  The problem is signals that are configured via
signal() instead of sigaction().  On Linux, signal() implicetly sets the
SA_RESTART flag, which causes the signal to have this behavior, not the
syscall.

If SA_RESTART is not set, then the system call is interrupted (and returns
EINTR) as it should be.

-Rob

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Rob Siemborski * Andrew Systems Group * Cyert Hall 207 * 412-268-7456
Research Systems Programmer * /usr/contributed Gatekeeper



Re: followup: stuck lmtpd processes

2003-09-24 Thread John Wade
Oooppss. Sorry, my mailbox went temporarily over quota and the delivery 
of the original thread was deferred until after I had read and responded 
to the followup.  It looks like the locking mechanism is working 
correctly here and the bug is really in the network timeout.  (or in the 
implementation that allows a network call after the append setup is called)

The patch I wrote still might help you since it would prevent an 
individual user's problem from taking down the mail system.   The user's 
mailbox would remain inaccessible, but the lmtpd processes attempting 
delivery would exit with errors and mail delivery as a whole would 
proceed. I still belive that a system with that uses locking with no 
timeout mechanism is inherently fragile.  A single programming error can 
lead to a cascade failure that takes down the entire mail system as more 
and more processes hang up trying to get the lock.

Just my two cents,
John
Andrew Morgan wrote:

On Tue, 23 Sep 2003, John Wade wrote:

 

Hi Andrew,

I was the one who wrote the message you found.   I finally came to the
conclusion that the flat file locking mechanism is somewhat broken in
Cyrus, but I was never a good enough C programmer to pin down what was
happening.  (The mmap stuff makes it really tricky to debug.)I
wanted to blame it on the Linux kernel, but I know that others have
experienced the same problems in Solaris.
I finally gave up and wrote a locking timeout patch for 2.0.16.   see
http://www.oakton.edu/~jwade/cyrus/ for the patch and full details
A number of other folks have tried this patch successfully on 2.0.16 and
2.1.x, and I know it has resolved our problem.
If you can solve the particular bug that causes this, more power to you,
if not, my work around resolves a number of possible deadlock issues.
Enjoy,
John
   

Hey John,

Thanks for that message.  If you've read a little further in your
info-cyrus messages, you'll see that I apparently have hit upon a
different bug than the one you found (I think).  Your page was
instrumental in helping me track down the source of the problem though.
It turns out I had an imaps process that hung onto the lock on the user's
quota file.  Apparently it obtained the lock, then went off to read from
the network connection and never came back.
I think your patch would fix the problem where are lot of processes are
contending for a lock (by making them retry), but it wouldn't help if a
single process keeps the lock indefinately.  Ideally it should not be
possible for a process to get hung while it is holding the lock, but that
will require some careful programming in this particular case.  In the
meantime, I'll have to keep an eye on the system.
Thanks again for your debugging clues...

	Andy

 




Re: followup: stuck lmtpd processes

2003-09-24 Thread Rob Siemborski
On Tue, 23 Sep 2003, Andrew Morgan wrote:

 I think your patch would fix the problem where are lot of processes are
 contending for a lock (by making them retry), but it wouldn't help if a
 single process keeps the lock indefinately.

I agree.  The whole act of retrying for a lock is pretty silly, when you
think about it.  The kernel is responsible for waking processes up when
they are blocking on a lock and it becomes available.  If this isn't
happening (causing the need to do locks in a nonblocking fashion) then
something is wrong with the *kernel* not the application.

The exponential backoff is pretty silly from a performance standpoint as
well.

The bug where the quota file stays locked is something else entirely.

-Rob

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Rob Siemborski * Andrew Systems Group * Cyert Hall 207 * 412-268-7456
Research Systems Programmer * /usr/contributed Gatekeeper



Re: followup: stuck lmtpd processes

2003-09-24 Thread Etienne Goyer
Hi,

I don't have time this morning to have a look at your patch and
understand the issue, but it reminded me of another bug I found a few
months ago.  It may or may not relate to the problem you are fixing.  I
just think you might be interested in knowing.  It's the timeout part of
your problem that chimed me.

In Linux, a system call interrupted by alarm() does not exit; the
SIG_ALRM handler is executed, then the syscall is retried.  If the
program you are trying to troubleshoot depend on alarm() to interrupt a
blocking system call, it will deadlock in Linux (at least, it does in 
RedHat 7.3).

I stumble on this bug in the fud server and submitted a patch, but it was
not considered AFAIK since it was not reproducible in Solaris (duh!) and
Linux was vaguely blamed while nothing was being done to address the
issue.

Sorry if this is unrelated to your problem.  I thought you may be
interested in my experience in case it is.

On Tue, Sep 23, 2003 at 10:35:04PM -0500, John Wade wrote:
 Hi Andrew,
 
 I was the one who wrote the message you found.   I finally came to the 
 conclusion that the flat file locking mechanism is somewhat broken in 
 Cyrus, but I was never a good enough C programmer to pin down what was 
 happening.  (The mmap stuff makes it really tricky to debug.)I 
 wanted to blame it on the Linux kernel, but I know that others have 
 experienced the same problems in Solaris.
 
 I finally gave up and wrote a locking timeout patch for 2.0.16.   see 
 http://www.oakton.edu/~jwade/cyrus/ for the patch and full details
 
 A number of other folks have tried this patch successfully on 2.0.16 and 
 2.1.x, and I know it has resolved our problem.
 
 If you can solve the particular bug that causes this, more power to you, 
 if not, my work around resolves a number of possible deadlock issues.
 
 Enjoy,
 John
 
 
 
 Andrew Morgan wrote:
 
 Following up on my previous post about stuck lmtpd processes.  I found
 this incredibly detailed faq at:
 
 http://www.faqchest.com/prgm/cyrus-l/cyrus-01/cyrus-0111/cyrus-011102/cyrus0023_33254.html
 
 This isn't exactly the same problem, but the steps on that page helped me
 figure out that they are all stuck trying to get a lock on:
 
 /private/cyrus/mail/k/user/krolickp/cyrus.header
 
 Looking at /proc/locks shows:
 
 7: POSIX  ADVISORY  WRITE 21903 08:11:42107658 0 EOF d23895e0 c3217f44 c510e4c4 
  ccbf076c
 7: - POSIX  ADVISORY  WRITE 32485 08:11:42107658 0 EOF ccbf0760 ee36ac44 f3bb26a4 
 d23895e0 ee36ac4c
 7: - POSIX  ADVISORY  WRITE 1802 08:11:42107658 0 EOF ee36ac40 c050ea04 ccbf0764 
 d23895e0 c050ea0c
 7: - POSIX  ADVISORY  WRITE 1217 08:11:42107658 0 EOF c050ea00 ee36a344 ee36ac44 
 d23895e0 ee36a34c
 ...
 
 
 I don't see how this deadlock occurred, but I'm willing to help debug it.
 
  Andy
 
 
 
   
 

-- 
Etienne GoyerLinux Québec Technologies Inc.
http://www.LinuxQuebec.com   [EMAIL PROTECTED]


Re: followup: stuck lmtpd processes

2003-09-24 Thread Henrique de Moraes Holschuh
On Wed, 24 Sep 2003, Rob Siemborski wrote:
 think about it.  The kernel is responsible for waking processes up when
 they are blocking on a lock and it becomes available.  If this isn't
 happening (causing the need to do locks in a nonblocking fashion) then
 something is wrong with the *kernel* not the application.

Agreed, but if we are going to keep the blocking-on-lock behaviour (and I
know we are ;-)), we really, really should have a way to timeout and kill
the process if that lock does not release after a while.

Resilience IS necessary... As an admin, I want to know there are problems
from syslog events, not because the whole system stopped.  Right now, at
least in Linux (which DOES have kernel/glibc bugs in that area) that means
we end up needing the slow-as-hell backoff non-blocking locks stuff.

-- 
  One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie. -- The Silicon Valley Tarot
  Henrique Holschuh


Re: followup: stuck lmtpd processes

2003-09-24 Thread Rob Siemborski
On Wed, 24 Sep 2003, Henrique de Moraes Holschuh wrote:

 Agreed, but if we are going to keep the blocking-on-lock behaviour (and I
 know we are ;-)), we really, really should have a way to timeout and kill
 the process if that lock does not release after a while.

 Resilience IS necessary... As an admin, I want to know there are problems
 from syslog events, not because the whole system stopped.  Right now, at
 least in Linux (which DOES have kernel/glibc bugs in that area) that means
 we end up needing the slow-as-hell backoff non-blocking locks stuff.

As I understand it (based on your comments to Bug 1177), just setting an
alarm() around the flock/fcntl calls isn't good enough to solve the Linux
problem.

-Rob

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Rob Siemborski * Andrew Systems Group * Cyert Hall 207 * 412-268-7456
Research Systems Programmer * /usr/contributed Gatekeeper



Re: followup: stuck lmtpd processes

2003-09-24 Thread Henrique de Moraes Holschuh
On Wed, 24 Sep 2003, Rob Siemborski wrote:
 On Wed, 24 Sep 2003, Henrique de Moraes Holschuh wrote:
  Agreed, but if we are going to keep the blocking-on-lock behaviour (and I
  know we are ;-)), we really, really should have a way to timeout and kill
  the process if that lock does not release after a while.
 
  Resilience IS necessary... As an admin, I want to know there are problems
  from syslog events, not because the whole system stopped.  Right now, at
  least in Linux (which DOES have kernel/glibc bugs in that area) that means
  we end up needing the slow-as-hell backoff non-blocking locks stuff.
 
 As I understand it (based on your comments to Bug 1177), just setting an
 alarm() around the flock/fcntl calls isn't good enough to solve the Linux
 problem.

It is not a general solution when you hit glibc/kernel bugs, but I can
certainly live with it IF I manage to track down a version of glibc and
kernel that won't deadlock, that we can recommend. Either that, or allow for
runtime-switchable behaviours (I am willing to code this).

Now, I need to find some time to write a fctnl/flock deadlock test case. If
anyone has one already, please send it my way...

-- 
  One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie. -- The Silicon Valley Tarot
  Henrique Holschuh


Re: followup: stuck lmtpd processes

2003-09-24 Thread Scott Adkins
I just wanted to add something to this discussion...

First of all, we see the problem in Tru64 as well.  When we upgraded to the
2.2 series, we put in the locking patch that John described below.  This has
helped us, but the locking problem has *not* gone away... in fact, it does
a better job of *hiding* the problem than fixing it... besides, doing some
kind of backoff timeout mechansim doesn't solve the problem that some other
process has permanently placed a lock on a file, all it does is prevent lots
of LMTP processes from stacking up *because* some other process permanently
locked a file.
As a result, we still experience the problem, and it manifests itself as
being a particular user is unable to receive any email... and until that
user actually notices that this is happening, we may not notice.  The log
files (the syslog mail log particularly) does give some clues, however,
such as a lot of System I/O errors when talking to LMTP for a particular
user).
When looking at what file the processes are all waiting to get a lock on,
it usually turns out to be the cyrus.header file and not the quota file.
Is this still the same bug described by Rob on bugzilla?  Does it have to
be the quota file?
Also, when we find the specific imaps process that happens to have the
cyrus.header lock file opened for writing and has it locked, if we kill
it off, we find that the write lock goes to another imaps process or to
one of the LMTP processes and gets stuck there... we kill that one off
and it goes to the next one and gets stuck.  We never saw a case where
all the other processes became unstuck and the problem went away.
As a consquence, the only solution we have when we see the problem is to
restart the Cyrus server (we usually wait until after work hours at least).
I am not convinced the patch described below has helped us much, as when
we saw the LMTP processes stacking up, it was right in our face and we
could deal with it sooner than later.
Anyways, those are my thoughts on the subject.

Scott

--On Tuesday, September 23, 2003 10:35 PM -0500 John Wade 
[EMAIL PROTECTED] wrote:

Hi Andrew,

I was the one who wrote the message you found.   I finally came to the
conclusion that the flat file locking mechanism is somewhat broken in
Cyrus, but I was never a good enough C programmer to pin down what was
happening.  (The mmap stuff makes it really tricky to debug.)I wanted
to blame it on the Linux kernel, but I know that others have experienced
the same problems in Solaris.
I finally gave up and wrote a locking timeout patch for 2.0.16.   see
http://www.oakton.edu/~jwade/cyrus/ for the patch and full details
A number of other folks have tried this patch successfully on 2.0.16 and
2.1.x, and I know it has resolved our problem.
If you can solve the particular bug that causes this, more power to you,
if not, my work around resolves a number of possible deadlock issues.
Enjoy,
John


Andrew Morgan wrote:

Following up on my previous post about stuck lmtpd processes.  I found
this incredibly detailed faq at:
http://www.faqchest.com/prgm/cyrus-l/cyrus-01/cyrus-0111/cyrus-011102/cy
rus0023_33254.html
This isn't exactly the same problem, but the steps on that page helped me
figure out that they are all stuck trying to get a lock on:
/private/cyrus/mail/k/user/krolickp/cyrus.header

Looking at /proc/locks shows:

7: POSIX  ADVISORY  WRITE 21903 08:11:42107658 0 EOF d23895e0 c3217f44
c510e4c4  ccbf076c 7: - POSIX  ADVISORY  WRITE 32485
08:11:42107658 0 EOF ccbf0760 ee36ac44 f3bb26a4 d23895e0 ee36ac4c 7: -
POSIX  ADVISORY  WRITE 1802 08:11:42107658 0 EOF ee36ac40 c050ea04
ccbf0764 d23895e0 c050ea0c 7: - POSIX  ADVISORY  WRITE 1217
08:11:42107658 0 EOF c050ea00 ee36a344 ee36ac44 d23895e0 ee36a34c ...
I don't see how this deadlock occurred, but I'm willing to help debug it.

	Andy








--
+---+
 Scott W. Adkinshttp://www.cns.ohiou.edu/~sadkins/
  UNIX Systems Engineer  mailto:[EMAIL PROTECTED]
   ICQ 7626282 Work (740)593-9478 Fax (740)593-1944
+---+
PGP Public Key available at http://www.cns.ohiou.edu/~sadkins/pgp/

pgp0.pgp
Description: PGP signature


Re: followup: stuck lmtpd processes

2003-09-24 Thread Rob Siemborski
On Wed, 24 Sep 2003, Scott Adkins wrote:

 Also, when we find the specific imaps process that happens to have the
 cyrus.header lock file opened for writing and has it locked, if we kill
 it off, we find that the write lock goes to another imaps process or to
 one of the LMTP processes and gets stuck there... we kill that one off
 and it goes to the next one and gets stuck.  We never saw a case where
 all the other processes became unstuck and the problem went away.

So, if these processes are getting stuck -- what are they getting stuck
waiting for?  Processes don't acquire a lock and then just stop.

 As a consquence, the only solution we have when we see the problem is to
 restart the Cyrus server (we usually wait until after work hours at least).
 I am not convinced the patch described below has helped us much, as when
 we saw the LMTP processes stacking up, it was right in our face and we
 could deal with it sooner than later.

I agree.

-Rob

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Rob Siemborski * Andrew Systems Group * Cyert Hall 207 * 412-268-7456
Research Systems Programmer * /usr/contributed Gatekeeper



Re: followup: stuck lmtpd processes

2003-09-24 Thread Etienne Goyer
On Wed, Sep 24, 2003 at 11:13:06AM -0300, Henrique de Moraes Holschuh wrote:
 It is not a general solution when you hit glibc/kernel bugs, but I can
 certainly live with it IF I manage to track down a version of glibc and
 kernel that won't deadlock, that we can recommend. Either that, or allow for
 runtime-switchable behaviours (I am willing to code this).

This is not good enough.  You can't recommend a specific kernel/glibc
version; this is dictated by the distribution people use.  You can't
just recommend using the latest either, because a lot (most ?) people
will prefer to use older, well-known, stable (Debian stable, RedHat 7.3)
distribution.

The obvious solution is to not use alarm() to interrupt blocking
syscall, but to use non-blocking call with select() instead.  I
am not a very proficient C Unix programmer, so maybe this suggestion
make no sense.  However, in the case of the bug with fud I discussed in
another post, it was a perfectly workable solution.

And please don't scoff it as a problem with Linux, not Cyrus.  Linux
may well be broken (I can't tell), but it still constitute the vast
majority of Cyrus installation (I would believe), and thus merit to be
accomodated.

-- 
Etienne GoyerLinux Québec Technologies Inc.
http://www.LinuxQuebec.com   [EMAIL PROTECTED]


Re: followup: stuck lmtpd processes

2003-09-24 Thread Rob Siemborski
On Wed, 24 Sep 2003, Etienne Goyer wrote:

 The obvious solution is to not use alarm() to interrupt blocking
 syscall, but to use non-blocking call with select() instead.  I
 am not a very proficient C Unix programmer, so maybe this suggestion
 make no sense.  However, in the case of the bug with fud I discussed in
 another post, it was a perfectly workable solution.

This is *not* the correct solution.

However, I have looked into this and to my surprise, Linux is indeed
restarting the system calls instead of returning with EINTR.  However, the
answer here is to set up the alarm() handler with sigaction without
setting SA_RESTART, not to jump through select() hoops or make nonblocking
lock attempts.

I'll work on fixing fud shortly (its using signal() and it should be
using sigaction()).

 And please don't scoff it as a problem with Linux, not Cyrus.  Linux
 may well be broken (I can't tell), but it still constitute the vast
 majority of Cyrus installation (I would believe), and thus merit to be
 accomodated.

Linux should certainly be accomodated, but not in a way that causes
*severe* performance penalties on operating systems that don't
mysteriously fall asleep in the middle of a blocking flock() or fcntl()
call.

-Rob

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Rob Siemborski * Andrew Systems Group * Cyert Hall 207 * 412-268-7456
Research Systems Programmer * /usr/contributed Gatekeeper



Re: followup: stuck lmtpd processes

2003-09-24 Thread Henrique de Moraes Holschuh
On Wed, 24 Sep 2003, Rob Siemborski wrote:
 However, I have looked into this and to my surprise, Linux is indeed
 restarting the system calls instead of returning with EINTR.  However, the
 answer here is to set up the alarm() handler with sigaction without
 setting SA_RESTART, not to jump through select() hoops or make nonblocking
 lock attempts.

Yep, that's exactly it. Bullseye, Rob!

 I'll work on fixing fud shortly (its using signal() and it should be
 using sigaction()).

And I will do a linux fcntl/flock patch that uses alarm() WITHOUT
SA_RESTART as soon as I can.

 Linux should certainly be accomodated, but not in a way that causes
 *severe* performance penalties on operating systems that don't
 mysteriously fall asleep in the middle of a blocking flock() or fcntl()
 call.

Agreed.

-- 
  One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie. -- The Silicon Valley Tarot
  Henrique Holschuh


Re: followup: stuck lmtpd processes

2003-09-24 Thread Etienne Goyer
On Wed, Sep 24, 2003 at 11:27:46AM -0400, Rob Siemborski wrote:
 However, I have looked into this and to my surprise, Linux is indeed
 restarting the system calls instead of returning with EINTR.  However, the
 answer here is to set up the alarm() handler with sigaction without
 setting SA_RESTART, not to jump through select() hoops or make nonblocking
 lock attempts.

I consult my local Unix guru on the subject, and he point me to
Advanced Programming in the Unix Environment by W. Richard Stevens,
section 10.5 (page 275 in my edition) Interrupted System Calls.
The revelancy of this text may be questionned since it was published in
1992, but the behavior of SVR4 vs BSD differ substantially in this
regard according to it.  My guru concluded that system call interuption 
by signal is an assumption that may lead to portability problem.  In the
fud case, he is right.
 
 I'll work on fixing fud shortly (its using signal() and it should be
 using sigaction()).

The included patch against 2.1.13 work for me.

-- 
Etienne GoyerLinux Québec Technologies Inc.
http://www.LinuxQuebec.com   [EMAIL PROTECTED]
--- fud.c.orig  Wed Jun  4 14:19:43 2003
+++ fud.c   Wed Jun  4 15:28:54 2003
@@ -53,6 +53,7 @@
 #include syslog.h
 #include signal.h
 #include sys/types.h
+#include sys/time.h
 #include sys/param.h
 #include sys/stat.h
 #include netinet/in.h
@@ -62,6 +63,7 @@
 #include errno.h
 #include com_err.h
 #include pwd.h
+#include sys/select.h
 
 #include assert.h
 #include mboxlist.h
@@ -196,12 +198,6 @@
 shut_down(r);
 }
 
-static void cyrus_timeout(int signo)
-{
-  signo = 0;
-  return;
-}
-
 /* Send a proxy request to the backend, send their reply to sfrom */
 int do_proxy_request(const char *who, const char *name,
 const char *backend_host,
@@ -210,8 +206,10 @@
 char tmpbuf[1024];
 int x, rc;
 int csoc = -1;
+fd_set ssock;
 struct sockaddr_in cin, cout;
 struct hostent *hp;
+struct timeval tv;
 static int backend_port = 0; /* fud port in NETWORK BYTE ORDER */
 
 /* Open a UDP socket to the Cyrus mail server */
@@ -240,14 +238,24 @@
 /* Send the query and wait for a reply */
 sendto (csoc, tmpbuf, strlen (tmpbuf), 0, (struct sockaddr *) cin, x);
 memset (tmpbuf, '\0', strlen (tmpbuf));
-signal (SIGALRM, cyrus_timeout);
 rc = 0;
-alarm (1);
-rc = recvfrom (csoc, tmpbuf, sizeof(tmpbuf), 0,
-  (struct sockaddr *) cout, x);
-alarm (0);
-if (rc  1) {
-   rc = IMAP_SERVER_UNAVAILABLE;
+FD_ZERO(ssock);
+FD_SET(csoc, ssock);
+tv.tv_sec = 5;
+tv.tv_usec = 0;
+rc = select(csoc + 1, ssock, NULL, NULL, tv);
+if (rc  0) {
+syslog(LOG_ERR, Reading sock);
+   rc = recvfrom (csoc, tmpbuf, sizeof(tmpbuf), 0,
+  (struct sockaddr *) cout, x);
+if (rc  1) {
+   rc = IMAP_SERVER_UNAVAILABLE;
+send_reply(sfrom, REQ_UNK, who, name, 0, 0, 0);
+goto done;
+}
+} else {
+   syslog(LOG_ERR, FUD timeout);
+rc = IMAP_SERVER_UNAVAILABLE;
send_reply(sfrom, REQ_UNK, who, name, 0, 0, 0);
goto done;
 }


Re: followup: stuck lmtpd processes

2003-09-24 Thread Etienne Goyer
On Wed, Sep 24, 2003 at 12:57:37PM -0300, Henrique de Moraes Holschuh wrote:
 I did check ALL the documentation already, and ALL of it says that sigalarm
 MUST interrupt the syscall, and that it HAS to return EINTR.  So, it is a
 bug.  So, it needs to be squashed, and people have to either patch or
 upgrade their systems... or deal with diminished performance.

Please have a look at the Stevens reference I made in reply to Rob.
According to him, BSD circa 1992 was not adhering to this behavior.  If
modern BSD perpetuate this or not, I can't tell.  According to Stevens
again, SunOS 4.1.2 had yet another behavior in this regard.  Wheter
these ancient OS should be accomodated or not is a decision I am not
qualified to comment upon.

  And please don't scoff it as a problem with Linux, not Cyrus.  Linux
  may well be broken (I can't tell), but it still constitute the vast
  majority of Cyrus installation (I would believe), and thus merit to be
  accomodated.
 
 Something that works in Linux, sure.  Something that works in broken Linux?
 No.  Fix the breakage in Linux, instead.  That's our strenght, and I *will*
 stick to it as a Debian maintainer.

While I agree with you on a technical level and admire your commitment
to excellence, this may not be practictal.  The installed base is huge
and the interested party (Linux distributor) numerous.  Getting
everybody to update broken packages will be quite an endeavour.
Considering this bug touch upon the kernel and glibc, expecting end-user
to patch themselve without support from their distributor is not an
option either.

 There is a proper Unix way to do it (using alarm().  this needs to be added
 to Cyrus IMHO) that *might* not work in certain Linux glibc/kernel
 combinations.

That's the crux of the problem : if the glibc/kernel combination
correspond to the major part of the installed base, it might continue to 
hurt for a long time.

 Now, if other Unixes have stupid lock and alarm() bugs, that deadlock
 testing code would be even more useful... :-)

In the case of closed-source OS, there may be nothing we can do about it
except working around the bug.

-- 
Etienne GoyerLinux Québec Technologies Inc.
http://www.LinuxQuebec.com   [EMAIL PROTECTED]


Re: followup: stuck lmtpd processes

2003-09-24 Thread Rob Siemborski
On Wed, 24 Sep 2003, Etienne Goyer wrote:

  I'll work on fixing fud shortly (its using signal() and it should be
  using sigaction()).

 The included patch against 2.1.13 work for me.

This sort of thing won't work for file locking.  I've just committed a
patch to fud that uses sigaction() [Which since we're assuming POSIX
anyway, should hopefully be enough].

-Rob

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Rob Siemborski * Andrew Systems Group * Cyert Hall 207 * 412-268-7456
Research Systems Programmer * /usr/contributed Gatekeeper



Re: followup: stuck lmtpd processes

2003-09-24 Thread Rob Siemborski
On Wed, 24 Sep 2003, Etienne Goyer wrote:

  Something that works in Linux, sure.  Something that works in broken Linux?
  No.  Fix the breakage in Linux, instead.  That's our strenght, and I *will*
  stick to it as a Debian maintainer.

 While I agree with you on a technical level and admire your commitment
 to excellence, this may not be practictal.  The installed base is huge
 and the interested party (Linux distributor) numerous.  Getting
 everybody to update broken packages will be quite an endeavour.

We don't have to get everybody to update, only the people who are running
Cyrus.

  There is a proper Unix way to do it (using alarm().  this needs to be added
  to Cyrus IMHO) that *might* not work in certain Linux glibc/kernel
  combinations.

 That's the crux of the problem : if the glibc/kernel combination
 correspond to the major part of the installed base, it might continue to
 hurt for a long time.

If it is a single combination, that can be documented.

  Now, if other Unixes have stupid lock and alarm() bugs, that deadlock
  testing code would be even more useful... :-)

 In the case of closed-source OS, there may be nothing we can do about it
 except working around the bug.

Or reporting it to the vendor.  Which is exactly what I think we should do
in the case of Linux.

-Rob

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Rob Siemborski * Andrew Systems Group * Cyert Hall 207 * 412-268-7456
Research Systems Programmer * /usr/contributed Gatekeeper



Re: followup: stuck lmtpd processes

2003-09-24 Thread Etienne Goyer
Thanks.  I'll test it by the end of the week, and report.

On Wed, Sep 24, 2003 at 01:18:12PM -0400, Rob Siemborski wrote:
 On Wed, 24 Sep 2003, Etienne Goyer wrote:
 
   I'll work on fixing fud shortly (its using signal() and it should be
   using sigaction()).
 
  The included patch against 2.1.13 work for me.
 
 This sort of thing won't work for file locking.  I've just committed a
 patch to fud that uses sigaction() [Which since we're assuming POSIX
 anyway, should hopefully be enough].
 
 -Rob
 
 -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 Rob Siemborski * Andrew Systems Group * Cyert Hall 207 * 412-268-7456
 Research Systems Programmer * /usr/contributed Gatekeeper

-- 
Etienne GoyerLinux Québec Technologies Inc.
http://www.LinuxQuebec.com   [EMAIL PROTECTED]


Re: followup: stuck lmtpd processes

2003-09-24 Thread Henrique de Moraes Holschuh
On Wed, 24 Sep 2003, Etienne Goyer wrote:
 On Wed, Sep 24, 2003 at 11:27:46AM -0400, Rob Siemborski wrote:
  However, I have looked into this and to my surprise, Linux is indeed
  restarting the system calls instead of returning with EINTR.  However, the
  answer here is to set up the alarm() handler with sigaction without
  setting SA_RESTART, not to jump through select() hoops or make nonblocking
  lock attempts.
 
 I consult my local Unix guru on the subject, and he point me to
 Advanced Programming in the Unix Environment by W. Richard Stevens,
 section 10.5 (page 275 in my edition) Interrupted System Calls.
 The revelancy of this text may be questionned since it was published in
 1992, but the behavior of SVR4 vs BSD differ substantially in this
 regard according to it.  My guru concluded that system call interuption 
 by signal is an assumption that may lead to portability problem.  In the
 fud case, he is right.

With SYSV you will get the interrupted system call, unless you tell it
somehow not to do it (the SA_RESTART stuff).  If we are to accomodate the
BSDs, we can:
  1. Let them have the short end of the stick (they get what we have now,
 that is, deadlocks). Not good.
  2. Let them use the low-performance non-blocking mode (good solution).
  3. Find out how to get EINTR from BSD (it should be possible nowadays).

As for old legacy systems (too old BSDs, SunOS 4, very old Linux), well, I
am inclined to simply ignore them.  Let them use solution (2) above.  That
is the least of their problems, given the amount of annoying and dangerous
bugs those old kernels have.

-- 
  One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie. -- The Silicon Valley Tarot
  Henrique Holschuh


Re: followup: stuck lmtpd processes

2003-09-24 Thread Patrick Welche
On Wed, Sep 24, 2003 at 02:20:50PM -0300, Henrique de Moraes Holschuh wrote:
 On Wed, 24 Sep 2003, Etienne Goyer wrote:
  On Wed, Sep 24, 2003 at 11:27:46AM -0400, Rob Siemborski wrote:
   However, I have looked into this and to my surprise, Linux is indeed
   restarting the system calls instead of returning with EINTR.  However, the
   answer here is to set up the alarm() handler with sigaction without
   setting SA_RESTART, not to jump through select() hoops or make nonblocking
   lock attempts.
  
  I consult my local Unix guru on the subject, and he point me to
  Advanced Programming in the Unix Environment by W. Richard Stevens,
  section 10.5 (page 275 in my edition) Interrupted System Calls.
  The revelancy of this text may be questionned since it was published in
  1992, but the behavior of SVR4 vs BSD differ substantially in this
  regard according to it.  My guru concluded that system call interuption 
  by signal is an assumption that may lead to portability problem.  In the
  fud case, he is right.
 
 With SYSV you will get the interrupted system call, unless you tell it
 somehow not to do it (the SA_RESTART stuff).  If we are to accomodate the
 BSDs, we can:
   1. Let them have the short end of the stick (they get what we have now,
  that is, deadlocks). Not good.
   2. Let them use the low-performance non-blocking mode (good solution).
   3. Find out how to get EINTR from BSD (it should be possible nowadays).

I don't understand. The only alarm() business I can see in imap/fud.c
is around recvfrom which at least according to its man page says

 [EINTR]The receive was interrupted by delivery of a signal
before any data were available.

What system calls are there? I was looking for syscall()...

Cheers,

Patrick


Re: followup: stuck lmtpd processes

2003-09-24 Thread Andrew Morgan


On Wed, 24 Sep 2003, John Wade wrote:

 The patch I wrote still might help you since it would prevent an
 individual user's problem from taking down the mail system.   The user's
 mailbox would remain inaccessible, but the lmtpd processes attempting
 delivery would exit with errors and mail delivery as a whole would
 proceed. I still belive that a system with that uses locking with no
 timeout mechanism is inherently fragile.  A single programming error can
 lead to a cascade failure that takes down the entire mail system as more
 and more processes hang up trying to get the lock.

In this case, the problem only affected a single user.  Anything that
tried to get a lock on that user's quota file hung.  I had a lot of lmtpd
processes hung waiting for the lock, but that didn't really impact the
system.  The other users of the system were unaffected.

I'm not saying it is acceptable, but the impact was minor.

Andy



Re: followup: stuck lmtpd processes

2003-09-24 Thread John C. Amodeo
Andy,

Its happen to me before...  Don't think it can't...  That's all I'm
saying...

-John

Andrew Morgan wrote:

 On Wed, 24 Sep 2003, John C. Amodeo wrote:

  ...until your system runs out of available open files...
 
  Then the real fun begins... :-)
 
  -John

 [EMAIL PROTECTED] tools]# cat /proc/sys/fs/file-max
 209708

 I'm in a lot of trouble if I've got 209708 files open.  :)

 Andy

--
_
John C. Amodeo - Associate Director of Information Technology
Faculty of Arts and Sciences -- Computer  Network Operations
Rutgers, The State University of New Jersey
732.932.9455-voice 732.932.0013-fax




Re: followup: stuck lmtpd processes

2003-09-24 Thread Andrew Morgan


On Wed, 24 Sep 2003, John C. Amodeo wrote:

 ...until your system runs out of available open files...

 Then the real fun begins... :-)

 -John

[EMAIL PROTECTED] tools]# cat /proc/sys/fs/file-max
209708


I'm in a lot of trouble if I've got 209708 files open.  :)

Andy



Re: followup: stuck lmtpd processes

2003-09-24 Thread John C. Amodeo
...until your system runs out of available open files...

Then the real fun begins... :-)

-John

Andrew Morgan wrote:

 On Wed, 24 Sep 2003, John Wade wrote:

  The patch I wrote still might help you since it would prevent an
  individual user's problem from taking down the mail system.   The user's
  mailbox would remain inaccessible, but the lmtpd processes attempting
  delivery would exit with errors and mail delivery as a whole would
  proceed. I still belive that a system with that uses locking with no
  timeout mechanism is inherently fragile.  A single programming error can
  lead to a cascade failure that takes down the entire mail system as more
  and more processes hang up trying to get the lock.

 In this case, the problem only affected a single user.  Anything that
 tried to get a lock on that user's quota file hung.  I had a lot of lmtpd
 processes hung waiting for the lock, but that didn't really impact the
 system.  The other users of the system were unaffected.

 I'm not saying it is acceptable, but the impact was minor.

 Andy

--
_
John C. Amodeo - Associate Director of Information Technology
Faculty of Arts and Sciences -- Computer  Network Operations
Rutgers, The State University of New Jersey
732.932.9455-voice 732.932.0013-fax




Re: followup: stuck lmtpd processes

2003-09-24 Thread Andrew Morgan


On Wed, 24 Sep 2003, Scott Adkins wrote:

 When looking at what file the processes are all waiting to get a lock on,
 it usually turns out to be the cyrus.header file and not the quota file.
 Is this still the same bug described by Rob on bugzilla?  Does it have to
 be the quota file?

 Also, when we find the specific imaps process that happens to have the
 cyrus.header lock file opened for writing and has it locked, if we kill
 it off, we find that the write lock goes to another imaps process or to
 one of the LMTP processes and gets stuck there... we kill that one off
 and it goes to the next one and gets stuck.  We never saw a case where
 all the other processes became unstuck and the problem went away.

Are you sure that the processes are hung on the cyrus.header lock?  That's
what I originally thought when I was only looking at the output of lsof
and /proc/locks (linux).  When I actually ran a gdb backtrace on one of
the stuck processes, it became obvious that the lock was on the quota file
instead:

(gdb) bt
#0  0x402ae5fb in fcntl () from /lib/libc.so.6
#1  0x08077504 in lock_reopen (fd=16, filename=0xbfffa098 
/var/spool/cyrus/config/quota/k/user.krolickp, sbuf=0xbfffa040,
failaction=0xbfffa03c) at lock_fcntl.c:87
#2  0x080570b6 in mailbox_lock_quota (quota=0xbfffc3c4) at mailbox.c:1016
#3  0x08053f73 in append_setup (as=0xbfffc118, name=0xbfffb114 user.krolickp, 
format=0, userid=0x0, auth_state=0x0,
aclcheck=0, quotacheck=0) at append.c:209


I also saw exactly the behavior you describe when killing processes.  I
originally tried killing all the lmtpd process that were stuck because I
believed that one of the lmtpd process was stuck holding the lock on
cyrus.header.  When I killed the one holding the lock on cyrus.header,
another lmtpd process would grab the lock but still be stuck.

When I finally killed the process holding the quota file lock (an imaps
process), all the lmtpd processes got unstuck and delivered the waiting
mail.

It sounds to me like you're not actually killing the process that has the
lock that all the other processes are waiting for.

Andy



Re: followup: stuck lmtpd processes

2003-09-24 Thread Scott Adkins
Well, that could definitely be a problem... Next time we see a lock problem
occur, I will look based on the information below to see if it is really a
lock problem on the quota file.
Thanks,
Scott
--On Wednesday, September 24, 2003 12:32 PM -0700 Andrew Morgan 
[EMAIL PROTECTED] wrote:



On Wed, 24 Sep 2003, Scott Adkins wrote:

When looking at what file the processes are all waiting to get a lock on,
it usually turns out to be the cyrus.header file and not the quota file.
Is this still the same bug described by Rob on bugzilla?  Does it have to
be the quota file?
Also, when we find the specific imaps process that happens to have the
cyrus.header lock file opened for writing and has it locked, if we kill
it off, we find that the write lock goes to another imaps process or to
one of the LMTP processes and gets stuck there... we kill that one off
and it goes to the next one and gets stuck.  We never saw a case where
all the other processes became unstuck and the problem went away.
Are you sure that the processes are hung on the cyrus.header lock?  That's
what I originally thought when I was only looking at the output of lsof
and /proc/locks (linux).  When I actually ran a gdb backtrace on one of
the stuck processes, it became obvious that the lock was on the quota file
instead:
(gdb) bt
# 0  0x402ae5fb in fcntl () from /lib/libc.so.6
# 1  0x08077504 in lock_reopen (fd=16, filename=0xbfffa098
# /var/spool/cyrus/config/quota/k/user.krolickp, sbuf=0xbfffa040,
failaction=0xbfffa03c) at lock_fcntl.c:87
# 2  0x080570b6 in mailbox_lock_quota (quota=0xbfffc3c4) at mailbox.c:1016
# 3  0x08053f73 in append_setup (as=0xbfffc118, name=0xbfffb114
# user.krolickp, format=0, userid=0x0, auth_state=0x0,
aclcheck=0, quotacheck=0) at append.c:209
I also saw exactly the behavior you describe when killing processes.  I
originally tried killing all the lmtpd process that were stuck because I
believed that one of the lmtpd process was stuck holding the lock on
cyrus.header.  When I killed the one holding the lock on cyrus.header,
another lmtpd process would grab the lock but still be stuck.
When I finally killed the process holding the quota file lock (an imaps
process), all the lmtpd processes got unstuck and delivered the waiting
mail.
It sounds to me like you're not actually killing the process that has the
lock that all the other processes are waiting for.
	Andy



--
+---+
 Scott W. Adkinshttp://www.cns.ohiou.edu/~sadkins/
  UNIX Systems Engineer  mailto:[EMAIL PROTECTED]
   ICQ 7626282 Work (740)593-9478 Fax (740)593-1944
+---+
PGP Public Key available at http://www.cns.ohiou.edu/~sadkins/pgp/

pgp0.pgp
Description: PGP signature


Re: followup: stuck lmtpd processes

2003-09-23 Thread John Wade
Hi Andrew,

I was the one who wrote the message you found.   I finally came to the 
conclusion that the flat file locking mechanism is somewhat broken in 
Cyrus, but I was never a good enough C programmer to pin down what was 
happening.  (The mmap stuff makes it really tricky to debug.)I 
wanted to blame it on the Linux kernel, but I know that others have 
experienced the same problems in Solaris.

I finally gave up and wrote a locking timeout patch for 2.0.16.   see 
http://www.oakton.edu/~jwade/cyrus/ for the patch and full details

A number of other folks have tried this patch successfully on 2.0.16 and 
2.1.x, and I know it has resolved our problem.

If you can solve the particular bug that causes this, more power to you, 
if not, my work around resolves a number of possible deadlock issues.

Enjoy,
John


Andrew Morgan wrote:

Following up on my previous post about stuck lmtpd processes.  I found
this incredibly detailed faq at:
http://www.faqchest.com/prgm/cyrus-l/cyrus-01/cyrus-0111/cyrus-011102/cyrus0023_33254.html

This isn't exactly the same problem, but the steps on that page helped me
figure out that they are all stuck trying to get a lock on:
/private/cyrus/mail/k/user/krolickp/cyrus.header

Looking at /proc/locks shows:

7: POSIX  ADVISORY  WRITE 21903 08:11:42107658 0 EOF d23895e0 c3217f44 c510e4c4 
 ccbf076c
7: - POSIX  ADVISORY  WRITE 32485 08:11:42107658 0 EOF ccbf0760 ee36ac44 f3bb26a4 
d23895e0 ee36ac4c
7: - POSIX  ADVISORY  WRITE 1802 08:11:42107658 0 EOF ee36ac40 c050ea04 ccbf0764 
d23895e0 c050ea0c
7: - POSIX  ADVISORY  WRITE 1217 08:11:42107658 0 EOF c050ea00 ee36a344 ee36ac44 
d23895e0 ee36a34c
...
I don't see how this deadlock occurred, but I'm willing to help debug it.

	Andy



 




Re: followup: stuck lmtpd processes

2003-09-23 Thread Andrew Morgan


On Tue, 23 Sep 2003, John Wade wrote:

 Hi Andrew,

 I was the one who wrote the message you found.   I finally came to the
 conclusion that the flat file locking mechanism is somewhat broken in
 Cyrus, but I was never a good enough C programmer to pin down what was
 happening.  (The mmap stuff makes it really tricky to debug.)I
 wanted to blame it on the Linux kernel, but I know that others have
 experienced the same problems in Solaris.

 I finally gave up and wrote a locking timeout patch for 2.0.16.   see
 http://www.oakton.edu/~jwade/cyrus/ for the patch and full details

 A number of other folks have tried this patch successfully on 2.0.16 and
 2.1.x, and I know it has resolved our problem.

 If you can solve the particular bug that causes this, more power to you,
 if not, my work around resolves a number of possible deadlock issues.

 Enjoy,
 John

Hey John,

Thanks for that message.  If you've read a little further in your
info-cyrus messages, you'll see that I apparently have hit upon a
different bug than the one you found (I think).  Your page was
instrumental in helping me track down the source of the problem though.

It turns out I had an imaps process that hung onto the lock on the user's
quota file.  Apparently it obtained the lock, then went off to read from
the network connection and never came back.

I think your patch would fix the problem where are lot of processes are
contending for a lock (by making them retry), but it wouldn't help if a
single process keeps the lock indefinately.  Ideally it should not be
possible for a process to get hung while it is holding the lock, but that
will require some careful programming in this particular case.  In the
meantime, I'll have to keep an eye on the system.

Thanks again for your debugging clues...

Andy