On 2019-07-29, Raimo Niskanen <[email protected]> wrote:
> A new hang, I tried to invstigate:
>
> At July 19 the last log entry from my 'ps' log was from 14:55, which is
> also the time on the 'systat vmstat' screen when it froze.  Then the machine
> hums along but just after midnight at 00:42:01 the first "/bsd: process:
> table is full" entry appears.  That message repeats until I rebooted it
> today at July 29 10:48.
>
> I had a terminal with top running.  It was still updating.  It showed about
> 98% sys and 2% spin on one of 4 CPUs, the others 100% idle.  Then (after
> the process table had gotten full) it had 1282 idle processes and 1 on
> processor, which was 'top' itself.
> Memory: Real: 456M/1819M act/tot Free: 14G Cache: 676M Swap: 0K/16G.
>
> I had 8 shells under tmux ready for debugging.  'ls worked.
> 'systat' on one hung.  'top' on another failed with "cannot fork".
> 'exec ps ajxww" printed two lines with /sbin/init and /sbin/slaac
> and then hung.  'exec reboot' did not succeed.  Neither did a short power
> button, that at least caused a printout "stopping daemon nginx(failed)",
> but got no further.  I had to do a hard power off. 
>
> My theory now is that our daily tests right before 14:55 started a process
> (this process is the top 'top' process with 10:14 execution time) that
> triggers a lock or other contention problem in the kernel which causes
> one CPU to spin in the system, and blocks processes from dying.
> About 10 hours later the process table gets full.
>
> Any, ANY ideas of how to proceed would be appreciated!
>
> Best Regards

Did you notice any odd waitchan's (WAIT in top output)?

Maybe set ddb.console=1 in sysctl.conf and reboot (if not already
set), then try to break into DDB during a hang and see how things look
in ps there. (Test breaking into DDB before a hang first so you know
that you can do it .. you can just "c" to continue).

There might also be clues in things like "sh malloc" or "sh all pools".

Perhaps you could also get clues from running a kernel built with
'option WITNESS', you may get some messages in dmesg, or it adds commands
to ddb like "show locks", "show all locks", "show witness" (see ddb(4) for
details).

Can you provoke a hang by running this process manually?

Reply via email to