On Fri, Dec 07, 2001, Sam Varshavchik <[EMAIL PROTECTED]> wrote:
> Johannes Erdfelt writes: 
> 
> > On Thu, Dec 06, 2001, Gordon Messmer <[EMAIL PROTECTED]> wrote:
> >> On Thu, 6 Dec 2001, Johannes Erdfelt wrote:
> >> > The mail server is busy much of the time, but I don't think it's busy
> >> > enough to naturally hit the respawnhi timeout. It looks like somehow
> >> > courier missed that a child finished and that's why it hit the respawnhi
> >> > timeout. 
> >> 
> >> I was wrong about that.  The child processes are still legitimately 
> >> running.  As fate would have it just as I started this email, I was pulled 
> >> in to some mail server issues and noticed that the respawnhi thing had 
> >> happened again.  All of the couriersmtp processes were stuck in a read() 
> >> system call on fd 5.  I have the control file from a couple, and there are 
> >> lots of DNS failures recorded. 
> >> 
> >> It's much too late to do any debugging right now, but I'll be over this 
> >> tomorrow.  In any case, it's not that courierd isn't harvesting children,
> >> it's that the children are blocking on an unprotected read().  (I thought
> >> they all had alarms in place...  /me shrugs)
> > 
> > I checked for any running processes, but I couldn't find any. I do have
> > lots of courier related process running (authdaemon, pop and imap) so I
> > may have missed one. 
> > 
> > Either way, my system sat for 6 hours or so doing nothing. If you're
> > right that there was a process still running, something is missing a
> > timeout. 
> > 
> > I wonder what the longest timeout is. I guess presumably the respawnhi
> > could happen at a time right after a legitimate process is spawned which
> > then needs to timeout to a client, there will always be the chance that
> > courier just stops delivering email for a while. 
> > 
> > respawnhi seems to need some sort of timeout, even if it's extremely
> > long.
> 
> The server is designed to restart itself only when no mail is pending. 
> 
> The problem is that the client should not be stuck like that.  There's a 
> select() before every read from the socket, so if anything, it should be 
> stuck in a select(). 
> 
> Get the date of the stuck message, and review your logs to see if there are 
> any errors in syslog around that time, or a little bit later. 

Happened again, not surprisingly 7 days from when I restarted the
server. I found this process hanging around:

daemon   28027  0.0  0.0  1916 1008 ?        S    Dec07   0:00 courieresmtp 0 
ofr.pm0.net

It's been sitting for 7 days.

In the logs I see:

Dec  7 22:31:29 quattro courierd: 
started,id=00077C44.3C118991.00006D25,from=<[EMAIL PROTECTED]>,module=local,host=robjohn!!510!510!/home/robjohn!!,addr=<robjohn>
Dec  7 22:31:29 quattro courierlocal: 
id=00077C44.3C118991.00006D25,from=<[EMAIL PROTECTED]>,addr=<[EMAIL PROTECTED]>,size=3209,success:
 Message delivered.
Dec  7 22:31:29 quattro courierd: 
started,id=00077C45.3C118991.00006D2A,from=<[EMAIL PROTECTED]>,module=esmtp,host=mediaone.net,addr=<[EMAIL PROTECTED]>
Dec  7 22:31:35 quattro courieresmtp: 
id=00077C45.3C118991.00006D2A,from=<[EMAIL PROTECTED]>,addr=<[EMAIL PROTECTED]>:
 550 5.7.1 <[EMAIL PROTECTED]>... Access denied
Dec  7 22:31:35 quattro courieresmtp: 
id=00077C45.3C118991.00006D2A,from=<[EMAIL PROTECTED]>,addr=<[EMAIL PROTECTED]>,status:
 failure
Dec  7 22:31:35 quattro courierd: completed,id=00077C45.3C118991.00006D2A
Dec  7 22:31:35 quattro courierd: 
started,id=00077C45.3C118991.00006D2A,from=<>,module=dsn,host=,addr=<[EMAIL PROTECTED]>
Dec  7 22:31:35 quattro courierd: newmsg,id=00077C44.3C118997.00006D32
Dec  7 22:31:35 quattro courierd: 
started,id=00077C44.3C118997.00006D32,from=<>,module=esmtp,host=ofr.pm0.net,addr=<[EMAIL PROTECTED]>
Dec  7 22:37:35 quattro courieresmtp: 
id=00077C44.3C118997.00006D32,from=<>,addr=<[EMAIL PROTECTED]>: Connection timed 
out
Dec  7 22:37:35 quattro courieresmtp: 
id=00077C44.3C118997.00006D32,from=<>,addr=<[EMAIL PROTECTED]>,status: deferred

I see later:

Dec  7 22:42:35 quattro courierd: 
started,id=00077C44.3C118997.00006D32,from=<>,module=esmtp,host=ofr.pm0.net,addr=<[EMAIL PROTECTED]>

I don't see any other messages for id 00077C44.3C118997.00006D32.

An strace on the aforementioned process resulted in:

quattro:~# strace -p 28027
read(6, 

Which just sits there.

JE


_______________________________________________
courier-users mailing list
[EMAIL PROTECTED]
Unsubscribe: https://lists.sourceforge.net/lists/listinfo/courier-users

Reply via email to