On Fri, Dec 07, 2001, Sam Varshavchik <[EMAIL PROTECTED]> wrote: > Johannes Erdfelt writes: > > > On Thu, Dec 06, 2001, Gordon Messmer <[EMAIL PROTECTED]> wrote: > >> On Thu, 6 Dec 2001, Johannes Erdfelt wrote: > >> > The mail server is busy much of the time, but I don't think it's busy > >> > enough to naturally hit the respawnhi timeout. It looks like somehow > >> > courier missed that a child finished and that's why it hit the respawnhi > >> > timeout. > >> > >> I was wrong about that. The child processes are still legitimately > >> running. As fate would have it just as I started this email, I was pulled > >> in to some mail server issues and noticed that the respawnhi thing had > >> happened again. All of the couriersmtp processes were stuck in a read() > >> system call on fd 5. I have the control file from a couple, and there are > >> lots of DNS failures recorded. > >> > >> It's much too late to do any debugging right now, but I'll be over this > >> tomorrow. In any case, it's not that courierd isn't harvesting children, > >> it's that the children are blocking on an unprotected read(). (I thought > >> they all had alarms in place... /me shrugs) > > > > I checked for any running processes, but I couldn't find any. I do have > > lots of courier related process running (authdaemon, pop and imap) so I > > may have missed one. > > > > Either way, my system sat for 6 hours or so doing nothing. If you're > > right that there was a process still running, something is missing a > > timeout. > > > > I wonder what the longest timeout is. I guess presumably the respawnhi > > could happen at a time right after a legitimate process is spawned which > > then needs to timeout to a client, there will always be the chance that > > courier just stops delivering email for a while. > > > > respawnhi seems to need some sort of timeout, even if it's extremely > > long. > > The server is designed to restart itself only when no mail is pending. > > The problem is that the client should not be stuck like that. There's a > select() before every read from the socket, so if anything, it should be > stuck in a select(). > > Get the date of the stuck message, and review your logs to see if there are > any errors in syslog around that time, or a little bit later.
Happened again, not surprisingly 7 days from when I restarted the server. I found this process hanging around: daemon 28027 0.0 0.0 1916 1008 ? S Dec07 0:00 courieresmtp 0 ofr.pm0.net It's been sitting for 7 days. In the logs I see: Dec 7 22:31:29 quattro courierd: started,id=00077C44.3C118991.00006D25,from=<[EMAIL PROTECTED]>,module=local,host=robjohn!!510!510!/home/robjohn!!,addr=<robjohn> Dec 7 22:31:29 quattro courierlocal: id=00077C44.3C118991.00006D25,from=<[EMAIL PROTECTED]>,addr=<[EMAIL PROTECTED]>,size=3209,success: Message delivered. Dec 7 22:31:29 quattro courierd: started,id=00077C45.3C118991.00006D2A,from=<[EMAIL PROTECTED]>,module=esmtp,host=mediaone.net,addr=<[EMAIL PROTECTED]> Dec 7 22:31:35 quattro courieresmtp: id=00077C45.3C118991.00006D2A,from=<[EMAIL PROTECTED]>,addr=<[EMAIL PROTECTED]>: 550 5.7.1 <[EMAIL PROTECTED]>... Access denied Dec 7 22:31:35 quattro courieresmtp: id=00077C45.3C118991.00006D2A,from=<[EMAIL PROTECTED]>,addr=<[EMAIL PROTECTED]>,status: failure Dec 7 22:31:35 quattro courierd: completed,id=00077C45.3C118991.00006D2A Dec 7 22:31:35 quattro courierd: started,id=00077C45.3C118991.00006D2A,from=<>,module=dsn,host=,addr=<[EMAIL PROTECTED]> Dec 7 22:31:35 quattro courierd: newmsg,id=00077C44.3C118997.00006D32 Dec 7 22:31:35 quattro courierd: started,id=00077C44.3C118997.00006D32,from=<>,module=esmtp,host=ofr.pm0.net,addr=<[EMAIL PROTECTED]> Dec 7 22:37:35 quattro courieresmtp: id=00077C44.3C118997.00006D32,from=<>,addr=<[EMAIL PROTECTED]>: Connection timed out Dec 7 22:37:35 quattro courieresmtp: id=00077C44.3C118997.00006D32,from=<>,addr=<[EMAIL PROTECTED]>,status: deferred I see later: Dec 7 22:42:35 quattro courierd: started,id=00077C44.3C118997.00006D32,from=<>,module=esmtp,host=ofr.pm0.net,addr=<[EMAIL PROTECTED]> I don't see any other messages for id 00077C44.3C118997.00006D32. An strace on the aforementioned process resulted in: quattro:~# strace -p 28027 read(6, Which just sits there. JE _______________________________________________ courier-users mailing list [EMAIL PROTECTED] Unsubscribe: https://lists.sourceforge.net/lists/listinfo/courier-users
