Re: How to debug hanging machines / proc: table is full

Andreas Kusalananda Kähäri Fri, 01 Nov 2019 03:55:58 -0700

On Fri, Nov 01, 2019 at 11:38:06AM +0100, Raimo Niskanen wrote:
> On Mon, Sep 09, 2019 at 05:44:32PM +0200, Raimo Niskanen wrote:
> > On Mon, Sep 09, 2019 at 05:42:02PM +0200, Raimo Niskanen wrote:
> > > On Wed, Jul 31, 2019 at 05:46:08PM +0200, Raimo Niskanen wrote:
> > > > On Mon, Jul 29, 2019 at 01:20:58PM +0000, Stuart Henderson wrote:
> > > > > On 2019-07-29, Raimo Niskanen <raimo+open...@erix.ericsson.se> wrote:
> > > > > > A new hang, I tried to invstigate:
> > > > > >
> > > > > > At July 19 the last log entry from my 'ps' log was from 14:55, 
> > > > > > which is
> > > > > > also the time on the 'systat vmstat' screen when it froze.  Then 
> > > > > > the machine
> > > > > > hums along but just after midnight at 00:42:01 the first "/bsd: 
> > > > > > process:
> > > > > > table is full" entry appears.  That message repeats until I 
> > > > > > rebooted it
> > > > > > today at July 29 10:48.
> > > > > >
> > > > > > I had a terminal with top running.  It was still updating.  It 
> > > > > > showed about
> > > > > > 98% sys and 2% spin on one of 4 CPUs, the others 100% idle.  Then 
> > > > > > (after
> > > > > > the process table had gotten full) it had 1282 idle processes and 1 
> > > > > > on
> > > > > > processor, which was 'top' itself.
> > > > > > Memory: Real: 456M/1819M act/tot Free: 14G Cache: 676M Swap: 0K/16G.
> > > > > >
> > > > > > I had 8 shells under tmux ready for debugging.  'ls worked.
> > > > > > 'systat' on one hung.  'top' on another failed with "cannot fork".
> > > > > > 'exec ps ajxww" printed two lines with /sbin/init and /sbin/slaac
> > > > > > and then hung.  'exec reboot' did not succeed.  Neither did a short 
> > > > > > power
> > > > > > button, that at least caused a printout "stopping daemon 
> > > > > > nginx(failed)",
> > > > > > but got no further.  I had to do a hard power off. 
> > > > > >
> > > > > > My theory now is that our daily tests right before 14:55 started a 
> > > > > > process
> > > > > > (this process is the top 'top' process with 10:14 execution time) 
> > > > > > that
> > > > > > triggers a lock or other contention problem in the kernel which 
> > > > > > causes
> > > > > > one CPU to spin in the system, and blocks processes from dying.
> > > > > > About 10 hours later the process table gets full.
> > > > > >
> > > > > > Any, ANY ideas of how to proceed would be appreciated!
> > > > > >
> > > > > > Best Regards
> > > > > 
> > > > > Did you notice any odd waitchan's (WAIT in top output)?
> > > > > 
> > > > > Maybe set ddb.console=1 in sysctl.conf and reboot (if not already
> > > > > set), then try to break into DDB during a hang and see how things look
> > > > > in ps there. (Test breaking into DDB before a hang first so you know
> > > > > that you can do it .. you can just "c" to continue).
> > > > > 
> > > > > There might also be clues in things like "sh malloc" or "sh all 
> > > > > pools".
> > > > > 
> > > > > Perhaps you could also get clues from running a kernel built with
> > > > > 'option WITNESS', you may get some messages in dmesg, or it adds 
> > > > > commands
> > > > > to ddb like "show locks", "show all locks", "show witness" (see 
> > > > > ddb(4) for
> > > > > details).
> > > > 
> > > > I have enabled Witness, it went so-so.  We'll see what it catches.
> > > > 
> > > > I downloaded 6.5 amd64 src.tar.gz and sys.tar.gz, unpacked them,
> > > > applied all patches for stable 001-006 and built a kernel with:
> > > >   include "arch/amd64/conf/GENERIC"
> > > >   option        MULTIPROCESSOR
> > > >   option        MP_LOCKDEBUG
> > > >   option        WITNESS
> > > > 
> > > > Then I activated in /etc/sysctl.conf:
> > > >   ddb.console=1
> > > >   kern.witness.locktrace=1
> > > >   kern.witness.watch=3
> > > > 
> > > > For fun, I pressed Ctrl+Alt+Esc at the console, got a ddb> prompt and 
> > > > typed
> > > > "show witness".  It printed lots of info, I scrolled down to the end, 
> > > > but
> > > > during the printout there was an UVM fault:
> > > > 
> > > >   Spin locks:
> > > >   /usr/src/sys/....
> > > >   :
> > > >   bla bla bla
> > > >   :
> > > >   uvm_fault(0xffffffff81e03b50, 0xffff800022368360, 0, 1) -> e
> > > >   kernel: page fault trap, code=0
> > > >   Faulted in DDB: continuing...
> > > > 
> > > > Then I typed "cont" and it panicked.
> > > > If anybody want details I took a picture.
> > > > 
> > > > Have I combined too many debugging options, or is this sh*t that 
> > > > happens?
> > > > 
> > > > Nevertheless, now the machine is running again, with Witness...
> > > > 
> > > > I'll be back.
> > > 
> > > I have encountered some kind of stop, oddly enought not a panic - it
> > > just sat in ddb and I missed it for a week (or more).  Then I did not
> > > remember what I had planned to do so I "improvised" X-| , but anyway:
> > > 
> > > ddb{0}> ps
> > > shows about 350 processes from cron, half of them in state netlock, half
> > 
> > Sorry, that should have been about 1350 processes...
> > 
> > > in state piperd.  Then I have my test processes beam.smp: 6 in netlock, 6
> > > in piperd, about 70 in fsleep, 3 in poll, 3 in select, 4 in kqread.
> > > Then about 100 more ordinary looking processes...
> > > 
> > > ddb{0}> trace
> > > db_enter()...
> > > softclock(0)...
> > > softintr_dispatch(0)...
> > > Xsoftclock(0,0,1388,....)...
> > > acpicpu_idle()...
> > > shed_idle(ffffffff81ceff0)...
> > > end trace frame: 0x0, count: -6
> > > 
> > > ddb{0}> show locks
> > > exclusive kernel_lock &kernel_lock r = 0 (0xffffffff81e37b10) locked @
> > > /usr/src/sys/arch/amd64/amd64/softintr.c:87
> > > #0  witness_lock+0x41f
> > > #1  softintr_dispatc+0x56
> > > #2  Xsoftclock+0x1f
> > > #3  acpicpu_idle+0x271
> > > #4  sched_idle+0x235
> > > #5  proc_trampoline+0x1c
> 
> I have gotten two more hangs, no panik.  Had to go to the debug console
> with Alt+Ctrl+Esc.  show all locks shows this:


Did you ever investigate what left those 1350 cron jobs running (i.e.
actually look at the crontabs to see if they were correct and test run
the jobs to make sure that they run as expected)?  Is that amount of
cron jobs executing at once normal on your system?


> 
>   Process nnnnn (softnet) thread 0xffff8000ffffe4b0 (461575)
>   exclusive rwlock netlock r = 0 (0xffffffff81cfb420) locked @
>   /usr/src/sys/net/if.c:950
>   #0 witness_lock+0x41f
>   #1 if_netisr+0x22
>   #2 taskq_thread+0x4d
>   #3 proc_trampoline+0x1c
> 
>   Process nnnnn (idle0) thread 0xffff8000fffff518 (155781)
>   exclusive kernel_lock &kernel_lock r= 0 (0xffffffff81e177c0) locked @
>   /usr/src/sys/arch/amd64/amd64/softintr.c:87
>   #0 witness_lock+0x41f
>   #1 softintr_dispatch+0x56
>   #2 Xsoftclock+0x1f
>   #3 acpicpu_idle+0x271
>   #4 shec_idle+0x235
>   #5 proc_trampoline+0x1c
> 
> Any clues, anyone?
> The machine hangs about once a week.
> 
> 
> > > 
> > > ddb{0}> show nfsnode
> > > size 5476515891122113356 flag 0 vnode 0xd080c7d8 accstamp 1099511681152
> > > 
> > > (I think the size looks strange)
> > > 
> > > Then I tried show map and got a protection fault trap, gave up and
> > > rebooted.
> > > 
> > > That was it!  Next time I will try:
> > >   trace
> > >   ps
> > >   show malloc
> > >   show all pools
> > >   show locks
> > >   show all locks
> > > unless anyone has got more or better suggestions...
> > > 
> > > Best Regards
> > > -- 
> > > 
> > > / Raimo Niskanen, Erlang/OTP, Ericsson AB
> > 
> > -- 
> > 
> > / Raimo Niskanen, Erlang/OTP, Ericsson AB
> 
> -- 
> 
> / Raimo Niskanen, Erlang/OTP, Ericsson AB

-- 
Andreas (Kusalananda) Kähäri
SciLifeLab, NBIS, ICM
Uppsala University, Sweden

Re: How to debug hanging machines / proc: table is full

Reply via email to