On Tue, Jul 02, 2019 at 05:13:43PM +0000, Stuart Henderson wrote:
> On 2019-07-02, Raimo Niskanen <raimo+open...@erix.ericsson.se> wrote:
> > Hi misc@!
> >
> > If anyone has got some tips about how to debug two hanging machines we have
> > in our test lab I am eager to learn.
> >
> > The machines runs 6.5, amd64 and are patched up to 005_libssl using M:Tier's
> > openup.  Other than that they are rather different, one small Zotac
> > ZBox-AD02 with AMD E-350 at 1.6 GHz, and one rack mounted Dell PowerEdge
> > R230 with Intel Xeon E3-1220.
> >
> > The overall symptoms are that it is possible to switch screens using
> > Alt+Ctrl+F1..Fn, but when logging in as root the greeting prints but no
> > prompt.  Alt+Ctrl+Del does not work.  The power button does not work.  I
> > have to long press the power button to force power off.
> >
> > This happens during our nightly tests, that are quite resource intesive.
> >
> > In /var/log/messages I find suspicious entries "/bsd: proc: table is full"
> > possibly before the machines become inresponsive, but these entries appear
> > many more times before that point.  And after this "table is full" message
> > there are many syslog entries; on one machine smartd constatly complains 
> > about
> > an unreadable (pending) sector and atascsi_passthru_done timeout, and on
> > the other the kernel complains about a probed monitor but no|invalid EDID.
> >
> > So it seems the machine is out of some resource and fails to spawn a login
> > shell.  Any clues to how I can find more details and a remedy?  I suspect a
> > full process table, but wonder how to detect and|or avoid that.
> >
> > I have considered having systat running on a console screen but do not know
> > which systat display that might tell me anything.
> >
> > Best regards
> 
> "/bsd: proc: table is full" means that the process table is full, but it 
> doesn't
> tell you what caused this.
> 
> The process table size is controlled by kern.maxproc, it is possible
> that the default is insufficient for your needs, but it's also possible
> that there was a build-up of processes that didn't exit due to another
> problem on the system.
> 
> I would leave top(1) running on the system, and also save "ps ax" output
> regularly, then look at that output in the run-up to a failure, to see
> if that gives clues.
> 

It seems that the full process table is a secondary symptom, and that there
is something else that happens on the machines a few hours before the
process table fills...

On one machine I hade left "systat pigs" running, and the last thing it
showed was about 90% for softnet and the rest <idle>, IIRC.

I have now corrected a presumably unrelated error in our nightly tests that
occured just before the freeze.  The test started a child process that was
abandoned, and when it noticed its controlling socket close it started to
write an error log.  Previously that froze sometimes and a few hours later
the process table got full.  Now the child process is not abandoned, and
I have not seen the freeze since...

Still chasing ghosts, this can simply not be over yet.

Best Regards
-- 

/ Raimo Niskanen, Erlang/OTP, Ericsson AB

Reply via email to