On Tue, Jul 02, 2019 at 05:13:43PM +0000, Stuart Henderson wrote: > On 2019-07-02, Raimo Niskanen <raimo+open...@erix.ericsson.se> wrote: > > Hi misc@! > > > > If anyone has got some tips about how to debug two hanging machines we have > > in our test lab I am eager to learn. > > > > The machines runs 6.5, amd64 and are patched up to 005_libssl using M:Tier's > > openup. Other than that they are rather different, one small Zotac > > ZBox-AD02 with AMD E-350 at 1.6 GHz, and one rack mounted Dell PowerEdge > > R230 with Intel Xeon E3-1220. > > > > The overall symptoms are that it is possible to switch screens using > > Alt+Ctrl+F1..Fn, but when logging in as root the greeting prints but no > > prompt. Alt+Ctrl+Del does not work. The power button does not work. I > > have to long press the power button to force power off. > > > > This happens during our nightly tests, that are quite resource intesive. > > > > In /var/log/messages I find suspicious entries "/bsd: proc: table is full" > > possibly before the machines become inresponsive, but these entries appear > > many more times before that point. And after this "table is full" message > > there are many syslog entries; on one machine smartd constatly complains > > about > > an unreadable (pending) sector and atascsi_passthru_done timeout, and on > > the other the kernel complains about a probed monitor but no|invalid EDID. > > > > So it seems the machine is out of some resource and fails to spawn a login > > shell. Any clues to how I can find more details and a remedy? I suspect a > > full process table, but wonder how to detect and|or avoid that. > > > > I have considered having systat running on a console screen but do not know > > which systat display that might tell me anything. > > > > Best regards > > "/bsd: proc: table is full" means that the process table is full, but it > doesn't > tell you what caused this. > > The process table size is controlled by kern.maxproc, it is possible > that the default is insufficient for your needs, but it's also possible > that there was a build-up of processes that didn't exit due to another > problem on the system. > > I would leave top(1) running on the system, and also save "ps ax" output > regularly, then look at that output in the run-up to a failure, to see > if that gives clues. >
It seems that the full process table is a secondary symptom, and that there is something else that happens on the machines a few hours before the process table fills... On one machine I hade left "systat pigs" running, and the last thing it showed was about 90% for softnet and the rest <idle>, IIRC. I have now corrected a presumably unrelated error in our nightly tests that occured just before the freeze. The test started a child process that was abandoned, and when it noticed its controlling socket close it started to write an error log. Previously that froze sometimes and a few hours later the process table got full. Now the child process is not abandoned, and I have not seen the freeze since... Still chasing ghosts, this can simply not be over yet. Best Regards -- / Raimo Niskanen, Erlang/OTP, Ericsson AB