Re: NFS related panic? (was: Re: Killing a zombie process?)
On Fri 23 Oct 2015 at 00:46:57 +0200, Rhialto wrote: > This problem is very repeatable, usually within a few hours, just now it > happened within half an hour. > > It seems to me that somehow the nfs_reqq list gets corrupted. Then > either there is a crash when traversing it in nfs_timer() (occurring in > nfs_sigintr() due to being called with a bogus pointer), or there is a > hang when one of the NFS requests gets lost and never retried. I tried it with a TCP mount for NFS. Still hangs, this time in a bit under an hour of uptime. So the cause is likely not packet loss. -Olaf. -- ___ Olaf 'Rhialto' Seibert -- The Doctor: No, 'eureka' is Greek for \X/ rhialto/at/xs4all.nl-- 'this bath is too hot.' signature.asc Description: PGP signature
Re: NFS related panic? (was: Re: Killing a zombie process?)
This problem is very repeatable, usually within a few hours, just now it happened within half an hour. It seems to me that somehow the nfs_reqq list gets corrupted. Then either there is a crash when traversing it in nfs_timer() (occurring in nfs_sigintr() due to being called with a bogus pointer), or there is a hang when one of the NFS requests gets lost and never retried. -Olaf. -- ___ Olaf 'Rhialto' Seibert -- The Doctor: No, 'eureka' is Greek for \X/ rhialto/at/xs4all.nl-- 'this bath is too hot.' signature.asc Description: PGP signature
Re: NFS related panic? (was: Re: Killing a zombie process?)
On Tue 20 Oct 2015 at 01:04:59 +0200, Rhialto wrote: > with a rebuilt netbsd.gdb (hopefully the addresses match) > > #5 0x806b94b4 in nfs_sigintr (nmp=0x0, rep=0xfe81163730a8, > l=0x0) at ../../../../nfs/nfs_socket.c:871 nmp should not be NULL here... let's look at rep, where it comes from via "nmp = rep->r_nmp;" (gdb) print *(struct nfsreq *)0xfe81163730a8 $1 = {r_chain = {tqe_next = 0xfe811edcee40, tqe_prev = 0x1}, r_mreq = 0x828f9888, r_mrep = 0x0, r_md = 0x0, r_dpos = 0x0, r_nmp = 0x0, r_xid = 0, r_flags = 0, r_retry = 0, r_rexmit = 0, r_procnum = 0, r_rtt = 0, r_lwp = 0x0} well, r_chain.tqe_prev looks bogus (unless that's a special marker), so let's look at tqe_next: (gdb) print *((struct nfsreq *)0xfe81163730a8)->r_chain.tqe_next $3 = {r_chain = {tqe_next = 0x0, tqe_prev = 0x15aa3c85d}, r_mreq = 0xbd83e8af8fe58282, r_mrep = 0x81e39981e3a781e3, r_md = 0xe39d81e38180e38c, r_dpos = 0x8890e5b4a0e5ae81, r_nmp = 0xe57baf81e3ab81e3, r_xid = 2179183259, r_flags = -1565268289, r_retry = 0, r_rexmit = 0, r_procnum = 1520683101, r_rtt = 1, r_lwp = 0x80e39981e3a781e3} well, even more bogus. Too bad that the next frame has its argument optimized out... -Olaf. -- ___ Olaf 'Rhialto' Seibert -- The Doctor: No, 'eureka' is Greek for \X/ rhialto/at/xs4all.nl-- 'this bath is too hot.' signature.asc Description: PGP signature
Re: NFS related panic? (was: Re: Killing a zombie process?)
with a rebuilt netbsd.gdb (hopefully the addresses match) (gdb) target kvm netbsd.5.core 0x8063d735 in cpu_reboot (howto=howto@entry=260, bootstr=bootstr@entry=0x0) at ../../../../arch/amd64/amd64/machdep.c:671 671 dumpsys(); (gdb) bt #0 0x8063d735 in cpu_reboot (howto=howto@entry=260, bootstr=bootstr@entry=0x0) at ../../../../arch/amd64/amd64/machdep.c:671 #1 0x80865182 in vpanic (fmt=0x80d123b2 "trap", fmt@entry=0x80d123d2 "otection fault", ap=ap@entry=0xfe80b9fc1d10) at ../../../../kern/subr_prf.c:340 #2 0x8086523d in panic (fmt=fmt@entry=0x80d123d2 "otection fault") at ../../../../kern/subr_prf.c:256 #3 0x808a84d6 in trap (frame=0xfe80b9fc1e30) at ../../../../arch/amd64/amd64/trap.c:298 #4 0x80100f46 in alltraps () #5 0x806b94b4 in nfs_sigintr (nmp=0x0, rep=0xfe81163730a8, l=0x0) at ../../../../nfs/nfs_socket.c:871 #6 0x806b9b0e in nfs_timer (arg=) at ../../../../nfs/nfs_socket.c:752 #7 0x805e9458 in callout_softclock (v=) at ../../../../kern/kern_timeout.c:736 #8 0x805df84a in softint_execute (l=, s=, si=) at ../../../../kern/kern_softint.c:589 #9 softint_dispatch (pinned=, s=2) at ../../../../kern/kern_softint.c:871 #10 0x8011402f in Xsoftintr () (gdb) kvm proc 0xfe813fb39860 nfs_timer (arg=) at ../../../../nfs/nfs_socket.c:735 735 { (gdb) bt #0 nfs_timer (arg=) at ../../../../nfs/nfs_socket.c:735 #1 0x in ?? () -Olaf. -- ___ Olaf 'Rhialto' Seibert -- The Doctor: No, 'eureka' is Greek for \X/ rhialto/at/xs4all.nl-- 'this bath is too hot.' signature.asc Description: PGP signature
NFS related panic? (was: Re: Killing a zombie process?)
On Fri 16 Oct 2015 at 16:31:18 +0200, J. Hannken-Illjes wrote: > On 16 Oct 2015, at 13:44, Rhialtowrote: > > > "Interesting" results: it built packages overnight (from around 22:30 to > > 12:13, so for nearly 14 hours), then, when I didn't look, it rebooted. > > With panic? I re-tried and with a pure GENERIC 7.0 kernel it happened again and now I have a crash dump. Its dmesg ends with this: nfs server 10.0.0.16:/mnt/scratch: not responding nfs server 10.0.0.16:/mnt/scratch: is alive again fatal page fault in supervisor mode trap type 6 code 0 rip 806b94b4 cs 8 rflags 10246 cr2 38 ilevel 2 rsp ff fffe80b9fc1f28 curlwp 0xfe813fb39860 pid 0.5 lowest kstack 0xfe80b9fbf2c0 panic: trap cpu0: Begin traceback... vpanic() at netbsd:vpanic+0x13c snprintf() at netbsd:snprintf startlwp() at netbsd:startlwp alltraps() at netbsd:alltraps+0x96 callout_softclock() at netbsd:callout_softclock+0x248 softint_dispatch() at netbsd:softint_dispatch+0x79 DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xfe80b9fc1ff0 Xsoftintr() at netbsd:Xsoftintr+0x4f --- interrupt --- 0: cpu0: End traceback... dumping to dev 0,1 (offset=199775, size=1023726): pid 0.5 is this: PIDLID S CPU FLAGS STRUCT LWP * NAME WAIT 0> 5 7 0 200 fe813fb39860 softclk/0 gdb (without debugging symbols) so far thinks this is in nfs_timer(): (gdb) kvm proc 0xfe813fb39860 0x806b9aab in nfs_timer () (gdb) bt #0 0x806b9aab in nfs_timer () #1 0x in ?? () -Olaf. -- ___ Olaf 'Rhialto' Seibert -- The Doctor: No, 'eureka' is Greek for \X/ rhialto/at/xs4all.nl-- 'this bath is too hot.' signature.asc Description: PGP signature