On Mon, Sep 09, 2019 at 05:42:02PM +0200, Raimo Niskanen wrote: > On Wed, Jul 31, 2019 at 05:46:08PM +0200, Raimo Niskanen wrote: > > On Mon, Jul 29, 2019 at 01:20:58PM +0000, Stuart Henderson wrote: > > > On 2019-07-29, Raimo Niskanen <raimo+open...@erix.ericsson.se> wrote: > > > > A new hang, I tried to invstigate: > > > > > > > > At July 19 the last log entry from my 'ps' log was from 14:55, which is > > > > also the time on the 'systat vmstat' screen when it froze. Then the > > > > machine > > > > hums along but just after midnight at 00:42:01 the first "/bsd: process: > > > > table is full" entry appears. That message repeats until I rebooted it > > > > today at July 29 10:48. > > > > > > > > I had a terminal with top running. It was still updating. It showed > > > > about > > > > 98% sys and 2% spin on one of 4 CPUs, the others 100% idle. Then (after > > > > the process table had gotten full) it had 1282 idle processes and 1 on > > > > processor, which was 'top' itself. > > > > Memory: Real: 456M/1819M act/tot Free: 14G Cache: 676M Swap: 0K/16G. > > > > > > > > I had 8 shells under tmux ready for debugging. 'ls worked. > > > > 'systat' on one hung. 'top' on another failed with "cannot fork". > > > > 'exec ps ajxww" printed two lines with /sbin/init and /sbin/slaac > > > > and then hung. 'exec reboot' did not succeed. Neither did a short > > > > power > > > > button, that at least caused a printout "stopping daemon nginx(failed)", > > > > but got no further. I had to do a hard power off. > > > > > > > > My theory now is that our daily tests right before 14:55 started a > > > > process > > > > (this process is the top 'top' process with 10:14 execution time) that > > > > triggers a lock or other contention problem in the kernel which causes > > > > one CPU to spin in the system, and blocks processes from dying. > > > > About 10 hours later the process table gets full. > > > > > > > > Any, ANY ideas of how to proceed would be appreciated! > > > > > > > > Best Regards > > > > > > Did you notice any odd waitchan's (WAIT in top output)? > > > > > > Maybe set ddb.console=1 in sysctl.conf and reboot (if not already > > > set), then try to break into DDB during a hang and see how things look > > > in ps there. (Test breaking into DDB before a hang first so you know > > > that you can do it .. you can just "c" to continue). > > > > > > There might also be clues in things like "sh malloc" or "sh all pools". > > > > > > Perhaps you could also get clues from running a kernel built with > > > 'option WITNESS', you may get some messages in dmesg, or it adds commands > > > to ddb like "show locks", "show all locks", "show witness" (see ddb(4) for > > > details). > > > > I have enabled Witness, it went so-so. We'll see what it catches. > > > > I downloaded 6.5 amd64 src.tar.gz and sys.tar.gz, unpacked them, > > applied all patches for stable 001-006 and built a kernel with: > > include "arch/amd64/conf/GENERIC" > > option MULTIPROCESSOR > > option MP_LOCKDEBUG > > option WITNESS > > > > Then I activated in /etc/sysctl.conf: > > ddb.console=1 > > kern.witness.locktrace=1 > > kern.witness.watch=3 > > > > For fun, I pressed Ctrl+Alt+Esc at the console, got a ddb> prompt and typed > > "show witness". It printed lots of info, I scrolled down to the end, but > > during the printout there was an UVM fault: > > > > Spin locks: > > /usr/src/sys/.... > > : > > bla bla bla > > : > > uvm_fault(0xffffffff81e03b50, 0xffff800022368360, 0, 1) -> e > > kernel: page fault trap, code=0 > > Faulted in DDB: continuing... > > > > Then I typed "cont" and it panicked. > > If anybody want details I took a picture. > > > > Have I combined too many debugging options, or is this sh*t that happens? > > > > Nevertheless, now the machine is running again, with Witness... > > > > I'll be back. > > I have encountered some kind of stop, oddly enought not a panic - it > just sat in ddb and I missed it for a week (or more). Then I did not > remember what I had planned to do so I "improvised" X-| , but anyway: > > ddb{0}> ps > shows about 350 processes from cron, half of them in state netlock, half
Sorry, that should have been about 1350 processes... > in state piperd. Then I have my test processes beam.smp: 6 in netlock, 6 > in piperd, about 70 in fsleep, 3 in poll, 3 in select, 4 in kqread. > Then about 100 more ordinary looking processes... > > ddb{0}> trace > db_enter()... > softclock(0)... > softintr_dispatch(0)... > Xsoftclock(0,0,1388,....)... > acpicpu_idle()... > shed_idle(ffffffff81ceff0)... > end trace frame: 0x0, count: -6 > > ddb{0}> show locks > exclusive kernel_lock &kernel_lock r = 0 (0xffffffff81e37b10) locked @ > /usr/src/sys/arch/amd64/amd64/softintr.c:87 > #0 witness_lock+0x41f > #1 softintr_dispatc+0x56 > #2 Xsoftclock+0x1f > #3 acpicpu_idle+0x271 > #4 sched_idle+0x235 > #5 proc_trampoline+0x1c > > ddb{0}> show nfsnode > size 5476515891122113356 flag 0 vnode 0xd080c7d8 accstamp 1099511681152 > > (I think the size looks strange) > > Then I tried show map and got a protection fault trap, gave up and > rebooted. > > That was it! Next time I will try: > trace > ps > show malloc > show all pools > show locks > show all locks > unless anyone has got more or better suggestions... > > Best Regards > -- > > / Raimo Niskanen, Erlang/OTP, Ericsson AB -- / Raimo Niskanen, Erlang/OTP, Ericsson AB