--On Wednesday, May 15, 2002 2:50 PM -0400 Lawrence Greenfield 
<[EMAIL PROTECTED]> wrote:

>    Date: Wed, 15 May 2002 15:37:50 -0300
>    From: Henrique de Moraes Holschuh <[EMAIL PROTECTED]>
> [...]   >

>    > We don't have this problem on any of our servers.
>
>    Do you have preforking enabled?  If you do (and if I did undertand the
> issue    correctly), start kill -9'ing service processes, and it should
> be possible    to duplicate the bug.  I will try that just now, in fact.
>
> Sure, if you intentionally kill processes that are waiting for
> connections this happens.  I understand this.  But if I did that,
> master would log messages that the processes were dying incorrectly
> ("signaled to death by 9").


Just to follow up a bit more -- I'm not sure if this will be terribly 
helpful, but it may be illustrative of the one situation which might cause 
such a problem.  Here's a snippet of log that occured shortly before we ran 
into a service problem:

May  2 15:35:14 delrey.acpub.duke.edu master[16360]: can't fork process to 
run c
heckpoint
May  2 15:35:14 delrey.acpub.duke.edu last message repeated 45 times
May  2 15:35:15 delrey.acpub.duke.edu master[16360]: process 7494 exited, 
signal
ed to death by 9

Preceeding this is a couple hundred more messages generally complaining 
about inability to fork, lack of resources, inability to load a shared 
library, and all of the other things one would expect to see when swap runs 
out.  However, I don't see any other mention of process 7494.  (This is 
logging at the user6.info level, so we don't get all of the debug messages, 
unfortunately.)  Now, the core problem here is very simple:  This is a 
poor, beleagured SparcStation 20 with 17k mailboxes on it, and at a peak 
block of time, it just ran out of available memory.  The solution is also 
simple:  we need to upgrade our hardware.  However, since we're currently 
under budget cuts as our Executive VP scrounges up money to go around 
building big buildings, that's not exactly a viable option.

Tracking down the whys and wherefores of a process dying under a resource 
crunch is not likely to be terribly productive -- one can't very well 
expect a service to keep functioning normally under those circumstances. 
However, it would be nice if the master process paid attention to the 
SIGCHLD and the information from the wait() call and take note of the fact 
that the ex-process has shook off this mortal coil, so that 15 minutes 
later, the process miscount didn't cause the master to start blithely 
ignoring incoming requests.

I hope this is helpful,

Michael Bacon
OIT Systems Administration
Duke University


Reply via email to