ZFS panic on a RELENG_8 NFS server (Was: panic: spin lock held too long (RELENG_8 from today))
Hiroki Sato h...@freebsd.org wrote in 20110907.094717.2272609566853905102@allbsd.org: hr During this investigation an disk has to be replaced and resilvering hr it is now in progress. A deadlock and a forced reboot after that hr make recovering of the zfs datasets take a long time (for committing hr logs, I think), so I will try to reproduce the deadlock and get a hr core dump after it finished. I think I could reproduce the symptoms. I have no idea about if these are exactly the same as occurred on my box before because the kernel was replaced with one with some debugging options, but these are reproducible at least. There are two symptoms. One is a panic. A DDB output when the panic occurred is the following: Fatal trap 12: page fault while in kernel mode cpuid = 1; apic id = 01 fault virtual address = 0x10040 fault code = supervisor read data, page not present instruction pointer = 0x20:0x8065b926 stack pointer = 0x28:0xff8257b94d70 frame pointer = 0x28:0xff8257b94e10 code segment= base 0x0, limit 0xf, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags= interrupt enabled, resume, IOPL = 0 current process = 992 (nfsd: service) [thread pid 992 tid 100586 ] Stopped at witness_checkorder+0x246: movl0x40(%r13),%ebx db bt Tracing pid 992 tid 100586 td 0xff00595d9000 witness_checkorder() at witness_checkorder+0x246 _sx_slock() at _sx_slock+0x35 dmu_bonus_hold() at dmu_bonus_hold+0x57 zfs_zget() at zfs_zget+0x237 zfs_dirent_lock() at zfs_dirent_lock+0x488 zfs_dirlook() at zfs_dirlook+0x69 zfs_lookup() at zfs_lookup+0x26b zfs_freebsd_lookup() at zfs_freebsd_lookup+0x81 vfs_cache_lookup() at vfs_cache_lookup+0xf0 VOP_LOOKUP_APV() at VOP_LOOKUP_APV+0x40 lookup() at lookup+0x384 nfsvno_namei() at nfsvno_namei+0x268 nfsrvd_lookup() at nfsrvd_lookup+0xd6 nfsrvd_dorpc() at nfsrvd_dorpc+0x745 nfssvc_program() at nfssvc_program+0x447 svc_run_internal() at svc_run_internal+0x51b svc_thread_start() at svc_thread_start+0xb fork_exit() at fork_exit+0x11d fork_trampoline() at fork_trampoline+0xe --- trap 0xc, rip = 0x8006a031c, rsp = 0x7fffe6c8, rbp = 0x6 --- The complete output can be found at: http://people.allbsd.org/~hrs/zfs_panic_20110909_1/pool-zfs-20110909-1.txt Another is getting stuck at ZFS access. The kernel is running with no panic but any access to ZFS datasets causes a program non-responsive. The DDB output can be found at: http://people.allbsd.org/~hrs/zfs_panic_20110909_2/pool-zfs-20110909-2.txt The trigger for the both was some access to a ZFS dataset from the NFS clients. Because the access pattern was complex I could not narrow down what was the culprit, but it seems timing-dependent and simply doing rm -rf locally on the server can sometimes trigger them. The crash dump and the kernel can be found at the following URLs: panic: http://people.allbsd.org/~hrs/zfs_panic_20110909_1/ no panic but unresponsive: http://people.allbsd.org/~hrs/zfs_panic_20110909_2/ kernel: http://people.allbsd.org/~hrs/zfs_panic_20110909_kernel/ -- Hiroki ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: panic: spin lock held too long (RELENG_8 from today)
Attilio Rao atti...@freebsd.org wrote in CAJ-FndAChGndC=lkzni7i6mot+spw3-ofto9rh0+5wnnvwz...@mail.gmail.com: at This should be enough for someone NFS-aware to look into it. at at Were you also able to get a core? Yes. But as kib@ pointed out it seems a deadlock in ZFS. Some experiments I did showed that this deadlock can be triggered at least by doing rm -rf against a local directory that has a large number of files/sub-directories. Then, I updated the kernel with the latest 8-STABLE + WITNESS option because a fix for LOR of spa_config lock was committed and tracking locks without WITNESS was hard. The deadlock can still be triggered after that. During this investigation an disk has to be replaced and resilvering it is now in progress. A deadlock and a forced reboot after that make recovering of the zfs datasets take a long time (for committing logs, I think), so I will try to reproduce the deadlock and get a core dump after it finished. If the old kernel and core of the deadlock I reported on Saturday are still useful for debugging, I can put them to somewhere you can access. -- Hiroki pgptCZubr4hdM.pgp Description: PGP signature
Re: panic: spin lock held too long (RELENG_8 from today)
On Sat, Sep 03, 2011 at 12:05:47PM +0200, Attilio Rao wrote: This should be enough for someone NFS-aware to look into it. Were you also able to get a core? I'll try to look into it in the next days, in particular about the softclock state. I am absolutely sure that this is a zfs deadlock. pgpV5NUD9Kyx1.pgp Description: PGP signature
Re: panic: spin lock held too long (RELENG_8 from today)
On Aug 19, 2011, at 7:53 PM, Attilio Rao wrote: If nobody complains about it earlier, I'll propose the patch to re@ in 8 hours. Just a friendly 'me too', for the records. 22 hours of heavy network/disk I/O and no panic yet -- prior to the patch it was a panic orgy. Any response from re@ on the patch? It didn't appear to be in stable/8 as of yesterday: [root@flanker/ttypts/0(../src/sys/kern)#] svn diff Index: kern_timeout.c === --- kern_timeout.c (revision 225280) +++ kern_timeout.c (working copy) @@ -268,9 +268,11 @@ CC_LOCK_ASSERT(cc); c-c_cpu = CPUBLOCK; + spinlock_enter(); CC_UNLOCK(cc); new_cc = CC_CPU(new_cpu); CC_LOCK(new_cc); + spinlock_exit(); c-c_cpu = new_cpu; return (new_cc); } Regards, Trent.___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: panic: spin lock held too long (RELENG_8 from today)
2011/9/1 Trent Nelson tr...@snakebite.org: On Aug 19, 2011, at 7:53 PM, Attilio Rao wrote: If nobody complains about it earlier, I'll propose the patch to re@ in 8 hours. Just a friendly 'me too', for the records. 22 hours of heavy network/disk I/O and no panic yet -- prior to the patch it was a panic orgy. Any response from re@ on the patch? It didn't appear to be in stable/8 as of yesterday: It has been committed to STABLE_8 as r225288. Thanks, Attilio -- Peace can only be achieved by understanding - A. Einstein ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: panic: spin lock held too long (RELENG_8 from today)
On 8/18/2011 8:37 PM, Chip Camden wrote: st Thanks, Attilio. I've applied the patch and removed the extra debug st options I had added (though keeping debug symbols). I'll let you know if st I experience any more panics. No panic for 20 hours at this moment, FYI. For my NFS server, I think another 24 hours would be sufficient to confirm the stability. I will see how it works... -- Hiroki Likewise: $ uptime 5:37PM up 21:45, 5 users, load averages: 0.68, 0.45, 0.63 So far, so good (knocks on head). 0(ns4)% uptime 8:55AM up 22:39, 3 users, load averages: 0.01, 0.00, 0.00 0(ns4)% So far so good for me too ---Mike -- --- Mike Tancsa, tel +1 519 651 3400 Sentex Communications, m...@sentex.net Providing Internet services since 1994 www.sentex.net Cambridge, Ontario Canada http://www.tancsa.com/ ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: panic: spin lock held too long (RELENG_8 from today)
Quoth Mike Tancsa on Friday, 19 August 2011: On 8/18/2011 8:37 PM, Chip Camden wrote: st Thanks, Attilio. I've applied the patch and removed the extra debug st options I had added (though keeping debug symbols). I'll let you know if st I experience any more panics. No panic for 20 hours at this moment, FYI. For my NFS server, I think another 24 hours would be sufficient to confirm the stability. I will see how it works... -- Hiroki Likewise: $ uptime 5:37PM up 21:45, 5 users, load averages: 0.68, 0.45, 0.63 So far, so good (knocks on head). 0(ns4)% uptime 8:55AM up 22:39, 3 users, load averages: 0.01, 0.00, 0.00 0(ns4)% So far so good for me too ---Mike -- --- Mike Tancsa, tel +1 519 651 3400 Sentex Communications, m...@sentex.net Providing Internet services since 1994 www.sentex.net Cambridge, Ontario Canada http://www.tancsa.com/ ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org Still up and running here. 8:02AM up 1 day, 12:10, 4 users, load averages: 0.08, 0.26, 0.52 After the panics began, I never went more than 12 hours without one before applying this patch. I think you nailed it, Attilio. Or at least, you moved it. -- .O. | Sterling (Chip) Camden | http://camdensoftware.com ..O | sterl...@camdensoftware.com | http://chipsquips.com OOO | 2048R/D6DBAF91 | http://chipstips.com pgp4szrgFEc1J.pgp Description: PGP signature
Re: panic: spin lock held too long (RELENG_8 from today)
If nobody complains about it earlier, I'll propose the patch to re@ in 8 hours. Attilio 2011/8/19 Mike Tancsa m...@sentex.net: On 8/18/2011 8:37 PM, Chip Camden wrote: st Thanks, Attilio. I've applied the patch and removed the extra debug st options I had added (though keeping debug symbols). I'll let you know if st I experience any more panics. No panic for 20 hours at this moment, FYI. For my NFS server, I think another 24 hours would be sufficient to confirm the stability. I will see how it works... -- Hiroki Likewise: $ uptime 5:37PM up 21:45, 5 users, load averages: 0.68, 0.45, 0.63 So far, so good (knocks on head). 0(ns4)% uptime 8:55AM up 22:39, 3 users, load averages: 0.01, 0.00, 0.00 0(ns4)% So far so good for me too ---Mike -- --- Mike Tancsa, tel +1 519 651 3400 Sentex Communications, m...@sentex.net Providing Internet services since 1994 www.sentex.net Cambridge, Ontario Canada http://www.tancsa.com/ -- Peace can only be achieved by understanding - A. Einstein ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: panic: spin lock held too long (RELENG_8 from today)
Attilio Rao atti...@freebsd.org wrote in CAJ-FndDHmwa+=lnggu+5mk2xmtj8kwhb10jsoytkmgetvgn...@mail.gmail.com: at If nobody complains about it earlier, I'll propose the patch to re@ in 8 hours. Running fine for 45 hours so far. Please go ahead! -- Hiroki pgp3JVRs7kKa0.pgp Description: PGP signature
Re: panic: spin lock held too long (RELENG_8 from today)
Chip Camden sterl...@camdensoftware.com wrote in 20110818025550.ga1...@libertas.local.camdensoftware.com: st Quoth Attilio Rao on Thursday, 18 August 2011: st In callout_cpu_switch() if a low priority thread is migrating the st callout and gets preempted after the outcoming cpu queue lock is left st (and scheduled much later) we get this problem. st st In order to fix this bug it could be enough to use a critical section, st but I think this should be really interrupt safe, thus I'd wrap them st up with spinlock_enter()/spinlock_exit(). Fortunately st callout_cpu_switch() should be called rarely and also we already do st expensive locking operations in callout, thus we should not have st problem performance-wise. st st Can the guys I also CC'ed here try the following patch, with all the st initial kernel options that were leading you to the deadlock? (thus st revert any debugging patch/option you added for the moment): st http://www.freebsd.org/~attilio/callout-fixup.diff st st Please note that this patch is for STABLE_8, if you can confirm the st good result I'll commit to -CURRENT and then backmarge as soon as st possible. st st Thanks, st Attilio st st st Thanks, Attilio. I've applied the patch and removed the extra debug st options I had added (though keeping debug symbols). I'll let you know if st I experience any more panics. No panic for 20 hours at this moment, FYI. For my NFS server, I think another 24 hours would be sufficient to confirm the stability. I will see how it works... -- Hiroki pgpatVE0r5wVx.pgp Description: PGP signature
Re: panic: spin lock held too long (RELENG_8 from today)
Quoth Hiroki Sato on Friday, 19 August 2011: Chip Camden sterl...@camdensoftware.com wrote in 20110818025550.ga1...@libertas.local.camdensoftware.com: st Quoth Attilio Rao on Thursday, 18 August 2011: st In callout_cpu_switch() if a low priority thread is migrating the st callout and gets preempted after the outcoming cpu queue lock is left st (and scheduled much later) we get this problem. st st In order to fix this bug it could be enough to use a critical section, st but I think this should be really interrupt safe, thus I'd wrap them st up with spinlock_enter()/spinlock_exit(). Fortunately st callout_cpu_switch() should be called rarely and also we already do st expensive locking operations in callout, thus we should not have st problem performance-wise. st st Can the guys I also CC'ed here try the following patch, with all the st initial kernel options that were leading you to the deadlock? (thus st revert any debugging patch/option you added for the moment): st http://www.freebsd.org/~attilio/callout-fixup.diff st st Please note that this patch is for STABLE_8, if you can confirm the st good result I'll commit to -CURRENT and then backmarge as soon as st possible. st st Thanks, st Attilio st st st Thanks, Attilio. I've applied the patch and removed the extra debug st options I had added (though keeping debug symbols). I'll let you know if st I experience any more panics. No panic for 20 hours at this moment, FYI. For my NFS server, I think another 24 hours would be sufficient to confirm the stability. I will see how it works... -- Hiroki Likewise: $ uptime 5:37PM up 21:45, 5 users, load averages: 0.68, 0.45, 0.63 So far, so good (knocks on head). -- .O. | Sterling (Chip) Camden | http://camdensoftware.com ..O | sterl...@camdensoftware.com | http://chipsquips.com OOO | 2048R/D6DBAF91 | http://chipstips.com pgpkzPv5qfAnG.pgp Description: PGP signature
Re: panic: spin lock held too long (RELENG_8 from today)
Hi, Mike Tancsa m...@sentex.net wrote in 4e15a08c.6090...@sentex.net: mi On 7/7/2011 7:32 AM, Mike Tancsa wrote: mi On 7/7/2011 4:20 AM, Kostik Belousov wrote: mi mi BTW, we had a similar panic, spinlock held too long, the spinlock mi is the sched lock N, on busy 8-core box recently upgraded to the mi stable/8. Unfortunately, machine hung dumping core, so the stack trace mi for the owner thread was not available. mi mi I was unable to make any conclusion from the data that was present. mi If the situation is reproducable, you coulld try to revert r221937. This mi is pure speculation, though. mi mi Another crash just now after 5hrs uptime. I will try and revert r221937 mi unless there is any extra debugging you want me to add to the kernel mi instead ? I am also suffering from a reproducible panic on an 8-STABLE box, an NFS server with heavy I/O load. I could not get a kernel dump because this panic locked up the machine just after it occurred, but according to the stack trace it was the same as posted one. Switching to an 8.2R kernel can prevent this panic. Any progress on the investigation? -- spin lock 0x80cb46c0 (sched lock 0) held by 0xff01900458c0 (tid 100489) too long panic: spin lock held too long cpuid = 1 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2a kdb_backtrace() at kdb_backtrace+0x37 panic() at panic+0x187 _mtx_lock_spin_failed() at _mtx_lock_spin_failed+0x39 _mtx_lock_spin() at _mtx_lock_spin+0x9e sched_add() at sched_add+0x117 setrunnable() at setrunnable+0x78 sleepq_signal() at sleepq_signal+0x7a cv_signal() at cv_signal+0x3b xprt_active() at xprt_active+0xe3 svc_vc_soupcall() at svc_vc_soupcall+0xc sowakeup() at sowakeup+0x69 tcp_do_segment() at tcp_do_segment+0x25e7 tcp_input() at tcp_input+0xcdd ip_input() at ip_input+0xac netisr_dispatch_src() at netisr_dispatch_src+0x7e ether_demux() at ether_demux+0x14d ether_input() at ether_input+0x17d em_rxeof() at em_rxeof+0x1ca em_handle_que() at em_handle_que+0x5b taskqueue_run_locked() at taskqueue_run_locked+0x85 taskqueue_thread_loop() at taskqueue_thread_loop+0x4e fork_exit() at fork_exit+0x11f fork_trampoline() at fork_trampoline+0xe -- -- Hiroki pgpq7HXO6kUuo.pgp Description: PGP signature
Re: panic: spin lock held too long (RELENG_8 from today)
Quoth Hiroki Sato on Thursday, 18 August 2011: Hi, Mike Tancsa m...@sentex.net wrote in 4e15a08c.6090...@sentex.net: mi On 7/7/2011 7:32 AM, Mike Tancsa wrote: mi On 7/7/2011 4:20 AM, Kostik Belousov wrote: mi mi BTW, we had a similar panic, spinlock held too long, the spinlock mi is the sched lock N, on busy 8-core box recently upgraded to the mi stable/8. Unfortunately, machine hung dumping core, so the stack trace mi for the owner thread was not available. mi mi I was unable to make any conclusion from the data that was present. mi If the situation is reproducable, you coulld try to revert r221937. This mi is pure speculation, though. mi mi Another crash just now after 5hrs uptime. I will try and revert r221937 mi unless there is any extra debugging you want me to add to the kernel mi instead ? I am also suffering from a reproducible panic on an 8-STABLE box, an NFS server with heavy I/O load. I could not get a kernel dump because this panic locked up the machine just after it occurred, but according to the stack trace it was the same as posted one. Switching to an 8.2R kernel can prevent this panic. Any progress on the investigation? -- spin lock 0x80cb46c0 (sched lock 0) held by 0xff01900458c0 (tid 100489) too long panic: spin lock held too long cpuid = 1 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2a kdb_backtrace() at kdb_backtrace+0x37 panic() at panic+0x187 _mtx_lock_spin_failed() at _mtx_lock_spin_failed+0x39 _mtx_lock_spin() at _mtx_lock_spin+0x9e sched_add() at sched_add+0x117 setrunnable() at setrunnable+0x78 sleepq_signal() at sleepq_signal+0x7a cv_signal() at cv_signal+0x3b xprt_active() at xprt_active+0xe3 svc_vc_soupcall() at svc_vc_soupcall+0xc sowakeup() at sowakeup+0x69 tcp_do_segment() at tcp_do_segment+0x25e7 tcp_input() at tcp_input+0xcdd ip_input() at ip_input+0xac netisr_dispatch_src() at netisr_dispatch_src+0x7e ether_demux() at ether_demux+0x14d ether_input() at ether_input+0x17d em_rxeof() at em_rxeof+0x1ca em_handle_que() at em_handle_que+0x5b taskqueue_run_locked() at taskqueue_run_locked+0x85 taskqueue_thread_loop() at taskqueue_thread_loop+0x4e fork_exit() at fork_exit+0x11f fork_trampoline() at fork_trampoline+0xe -- -- Hiroki I'm also getting similar panics on 8.2-STABLE. Locks up everything and I have to power off. Once, I happened to be looking at the console when it happened and copied dow the following: Sleeping thread (tif 100037, pid 0) owns a non-sleepable lock panic: sleeping thread cpuid=1 Another time I got: lock order reversal: 1st 0xff000593e330 snaplk (snaplk) @ /usr/src/sys/kern/vfr_vnops.c:296 2nd 0xff0005e5d578 ufs (ufs) @ /usr/src/sys/ufs/ffs/ffs_snapshot.c:1587 I didn't copy down the traceback. These panics seem to hit when I'm doing heavy WAN I/O. I can go for about a day without one as long as I stay away from the web or even chat. Last night this system copied a backup of 35GB over the local network without failing, but as soon as I hopped onto Firefox this morning, down she went. I don't know if that's coincidence or useful data. I didn't get to say Thanks to Eitan Adler for attempting to help me with this on Monday night. Thanks, Eitan! -- .O. | Sterling (Chip) Camden | http://camdensoftware.com ..O | sterl...@camdensoftware.com | http://chipsquips.com OOO | 2048R/D6DBAF91 | http://chipstips.com pgpiDl3SMsKRu.pgp Description: PGP signature
Re: panic: spin lock held too long (RELENG_8 from today)
On 8/17/2011 1:38 PM, Hiroki Sato wrote: Any progress on the investigation? Unfortunately, I cannot reproduce it yet with a debugging kernel :( ---Mike -- spin lock 0x80cb46c0 (sched lock 0) held by 0xff01900458c0 (tid 100489) too long panic: spin lock held too long cpuid = 1 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2a kdb_backtrace() at kdb_backtrace+0x37 panic() at panic+0x187 _mtx_lock_spin_failed() at _mtx_lock_spin_failed+0x39 _mtx_lock_spin() at _mtx_lock_spin+0x9e sched_add() at sched_add+0x117 setrunnable() at setrunnable+0x78 sleepq_signal() at sleepq_signal+0x7a cv_signal() at cv_signal+0x3b xprt_active() at xprt_active+0xe3 svc_vc_soupcall() at svc_vc_soupcall+0xc sowakeup() at sowakeup+0x69 tcp_do_segment() at tcp_do_segment+0x25e7 tcp_input() at tcp_input+0xcdd ip_input() at ip_input+0xac netisr_dispatch_src() at netisr_dispatch_src+0x7e ether_demux() at ether_demux+0x14d ether_input() at ether_input+0x17d em_rxeof() at em_rxeof+0x1ca em_handle_que() at em_handle_que+0x5b taskqueue_run_locked() at taskqueue_run_locked+0x85 taskqueue_thread_loop() at taskqueue_thread_loop+0x4e fork_exit() at fork_exit+0x11f fork_trampoline() at fork_trampoline+0xe -- -- Hiroki -- --- Mike Tancsa, tel +1 519 651 3400 Sentex Communications, m...@sentex.net Providing Internet services since 1994 www.sentex.net Cambridge, Ontario Canada http://www.tancsa.com/ ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: panic: spin lock held too long (RELENG_8 from today)
2011/8/17 Hiroki Sato h...@freebsd.org: Hi, Mike Tancsa m...@sentex.net wrote in 4e15a08c.6090...@sentex.net: mi On 7/7/2011 7:32 AM, Mike Tancsa wrote: mi On 7/7/2011 4:20 AM, Kostik Belousov wrote: mi mi BTW, we had a similar panic, spinlock held too long, the spinlock mi is the sched lock N, on busy 8-core box recently upgraded to the mi stable/8. Unfortunately, machine hung dumping core, so the stack trace mi for the owner thread was not available. mi mi I was unable to make any conclusion from the data that was present. mi If the situation is reproducable, you coulld try to revert r221937. This mi is pure speculation, though. mi mi Another crash just now after 5hrs uptime. I will try and revert r221937 mi unless there is any extra debugging you want me to add to the kernel mi instead ? I am also suffering from a reproducible panic on an 8-STABLE box, an NFS server with heavy I/O load. I could not get a kernel dump because this panic locked up the machine just after it occurred, but according to the stack trace it was the same as posted one. Switching to an 8.2R kernel can prevent this panic. Any progress on the investigation? Hiroki, how easilly can you reproduce it? It would be important to have a DDB textdump with these informations: - bt - ps - show allpcpu - alltrace Alternatively, a coredump which has the stop cpu patch which Andryi can provide. Thanks, Attilio -- Peace can only be achieved by understanding - A. Einstein ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: panic: spin lock held too long (RELENG_8 from today)
Attilio Rao atti...@freebsd.org wrote in caj-fndcdow0_b2mv0lzeo-tpea9+7oanj7ihvkqsm4j4b0d...@mail.gmail.com: at 2011/8/17 Hiroki Sato h...@freebsd.org: at Hi, at at Mike Tancsa m...@sentex.net wrote at in 4e15a08c.6090...@sentex.net: at at mi On 7/7/2011 7:32 AM, Mike Tancsa wrote: at mi On 7/7/2011 4:20 AM, Kostik Belousov wrote: at mi at mi BTW, we had a similar panic, spinlock held too long, the spinlock at mi is the sched lock N, on busy 8-core box recently upgraded to the at mi stable/8. Unfortunately, machine hung dumping core, so the stack trace at mi for the owner thread was not available. at mi at mi I was unable to make any conclusion from the data that was present. at mi If the situation is reproducable, you coulld try to revert r221937. This at mi is pure speculation, though. at mi at mi Another crash just now after 5hrs uptime. I will try and revert r221937 at mi unless there is any extra debugging you want me to add to the kernel at mi instead ? at at I am also suffering from a reproducible panic on an 8-STABLE box, an at NFS server with heavy I/O load. I could not get a kernel dump at because this panic locked up the machine just after it occurred, but at according to the stack trace it was the same as posted one. at Switching to an 8.2R kernel can prevent this panic. at at Any progress on the investigation? at at Hiroki, at how easilly can you reproduce it? It takes 5-10 hours. I installed another kernel for debugging just now, so I think I will be able to collect more detail information in a couple of days. at It would be important to have a DDB textdump with these informations: at - bt at - ps at - show allpcpu at - alltrace at at Alternatively, a coredump which has the stop cpu patch which Andryi can provide. Okay, I will post them once I can get another panic. Thanks! -- Hiroki pgpFqPofBZyKa.pgp Description: PGP signature
Re: panic: spin lock held too long (RELENG_8 from today)
On Wed, Aug 17, 2011 at 10:52:01AM -0700, Chip Camden wrote: Quoth Hiroki Sato on Thursday, 18 August 2011: Hi, Mike Tancsa m...@sentex.net wrote in 4e15a08c.6090...@sentex.net: mi On 7/7/2011 7:32 AM, Mike Tancsa wrote: mi On 7/7/2011 4:20 AM, Kostik Belousov wrote: mi mi BTW, we had a similar panic, spinlock held too long, the spinlock mi is the sched lock N, on busy 8-core box recently upgraded to the mi stable/8. Unfortunately, machine hung dumping core, so the stack trace mi for the owner thread was not available. mi mi I was unable to make any conclusion from the data that was present. mi If the situation is reproducable, you coulld try to revert r221937. This mi is pure speculation, though. mi mi Another crash just now after 5hrs uptime. I will try and revert r221937 mi unless there is any extra debugging you want me to add to the kernel mi instead ? I am also suffering from a reproducible panic on an 8-STABLE box, an NFS server with heavy I/O load. I could not get a kernel dump because this panic locked up the machine just after it occurred, but according to the stack trace it was the same as posted one. Switching to an 8.2R kernel can prevent this panic. Any progress on the investigation? -- spin lock 0x80cb46c0 (sched lock 0) held by 0xff01900458c0 (tid 100489) too long panic: spin lock held too long cpuid = 1 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2a kdb_backtrace() at kdb_backtrace+0x37 panic() at panic+0x187 _mtx_lock_spin_failed() at _mtx_lock_spin_failed+0x39 _mtx_lock_spin() at _mtx_lock_spin+0x9e sched_add() at sched_add+0x117 setrunnable() at setrunnable+0x78 sleepq_signal() at sleepq_signal+0x7a cv_signal() at cv_signal+0x3b xprt_active() at xprt_active+0xe3 svc_vc_soupcall() at svc_vc_soupcall+0xc sowakeup() at sowakeup+0x69 tcp_do_segment() at tcp_do_segment+0x25e7 tcp_input() at tcp_input+0xcdd ip_input() at ip_input+0xac netisr_dispatch_src() at netisr_dispatch_src+0x7e ether_demux() at ether_demux+0x14d ether_input() at ether_input+0x17d em_rxeof() at em_rxeof+0x1ca em_handle_que() at em_handle_que+0x5b taskqueue_run_locked() at taskqueue_run_locked+0x85 taskqueue_thread_loop() at taskqueue_thread_loop+0x4e fork_exit() at fork_exit+0x11f fork_trampoline() at fork_trampoline+0xe -- -- Hiroki I'm also getting similar panics on 8.2-STABLE. Locks up everything and I have to power off. Once, I happened to be looking at the console when it happened and copied dow the following: Sleeping thread (tif 100037, pid 0) owns a non-sleepable lock panic: sleeping thread cpuid=1 No idea, might be relevant to the thread. Another time I got: lock order reversal: 1st 0xff000593e330 snaplk (snaplk) @ /usr/src/sys/kern/vfr_vnops.c:296 2nd 0xff0005e5d578 ufs (ufs) @ /usr/src/sys/ufs/ffs/ffs_snapshot.c:1587 I didn't copy down the traceback. snaplk refers to UFS snapshots. The above must have been typed in manually as well, due to some typos in filenames as well. Either this is a different problem, or if everyone in this thread is doing UFS snapshots (dump -L, mksnap_ffs, etc.) and having this problem happen then I recommend people stop using UFS snapshots. I've ranted about their unreliability in the past (years upon years ago -- still seems valid) and just how badly they can wedge a system. This is one of the many (MANY!) reasons why we use rsnapshot/rsync instead. The atime clobbering issue is the only downside. I don't see what this has to do with heavy WAN I/O unless you're doing something like dump-over-ssh, in which case see the above paragraph. These panics seem to hit when I'm doing heavy WAN I/O. I can go for about a day without one as long as I stay away from the web or even chat. Last night this system copied a backup of 35GB over the local network without failing, but as soon as I hopped onto Firefox this morning, down she went. I don't know if that's coincidence or useful data. I didn't get to say Thanks to Eitan Adler for attempting to help me with this on Monday night. Thanks, Eitan! -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: panic: spin lock held too long (RELENG_8 from today)
Quoth Jeremy Chadwick on Wednesday, 17 August 2011: I'm also getting similar panics on 8.2-STABLE. Locks up everything and I have to power off. Once, I happened to be looking at the console when it happened and copied dow the following: Sleeping thread (tif 100037, pid 0) owns a non-sleepable lock panic: sleeping thread cpuid=1 No idea, might be relevant to the thread. Another time I got: lock order reversal: 1st 0xff000593e330 snaplk (snaplk) @ /usr/src/sys/kern/vfr_vnops.c:296 2nd 0xff0005e5d578 ufs (ufs) @ /usr/src/sys/ufs/ffs/ffs_snapshot.c:1587 I didn't copy down the traceback. snaplk refers to UFS snapshots. The above must have been typed in manually as well, due to some typos in filenames as well. Either this is a different problem, or if everyone in this thread is doing UFS snapshots (dump -L, mksnap_ffs, etc.) and having this problem happen then I recommend people stop using UFS snapshots. I've ranted about their unreliability in the past (years upon years ago -- still seems valid) and just how badly they can wedge a system. This is one of the many (MANY!) reasons why we use rsnapshot/rsync instead. The atime clobbering issue is the only downside. If I'm doing UFS snapshots, I didn't know it. Yes, everything was copied manually because it only displays on the console and the keyboard does not respond after that point. So I copied first to paper, then had to decode my lousy handwriting to put it in an email. Sorry for the scribal errors. -- .O. | Sterling (Chip) Camden | http://camdensoftware.com ..O | sterl...@camdensoftware.com | http://chipsquips.com OOO | 2048R/D6DBAF91 | http://chipstips.com pgpOx3HMzh2AX.pgp Description: PGP signature
Re: panic: spin lock held too long (RELENG_8 from today)
Hiroki Sato h...@freebsd.org wrote in 20110818.043332.27079545013461535@allbsd.org: hr Attilio Rao atti...@freebsd.org wrote hr in caj-fndcdow0_b2mv0lzeo-tpea9+7oanj7ihvkqsm4j4b0d...@mail.gmail.com: hr hr at 2011/8/17 Hiroki Sato h...@freebsd.org: hr at Hi, hr at hr at Mike Tancsa m...@sentex.net wrote hr at in 4e15a08c.6090...@sentex.net: hr at hr at mi On 7/7/2011 7:32 AM, Mike Tancsa wrote: hr at mi On 7/7/2011 4:20 AM, Kostik Belousov wrote: hr at mi hr at mi BTW, we had a similar panic, spinlock held too long, the spinlock hr at mi is the sched lock N, on busy 8-core box recently upgraded to the hr at mi stable/8. Unfortunately, machine hung dumping core, so the stack trace hr at mi for the owner thread was not available. hr at mi hr at mi I was unable to make any conclusion from the data that was present. hr at mi If the situation is reproducable, you coulld try to revert r221937. This hr at mi is pure speculation, though. hr at mi hr at mi Another crash just now after 5hrs uptime. I will try and revert r221937 hr at mi unless there is any extra debugging you want me to add to the kernel hr at mi instead ? hr at hr at I am also suffering from a reproducible panic on an 8-STABLE box, an hr at NFS server with heavy I/O load. I could not get a kernel dump hr at because this panic locked up the machine just after it occurred, but hr at according to the stack trace it was the same as posted one. hr at Switching to an 8.2R kernel can prevent this panic. hr at hr at Any progress on the investigation? hr at hr at Hiroki, hr at how easilly can you reproduce it? hr hr It takes 5-10 hours. I installed another kernel for debugging just hr now, so I think I will be able to collect more detail information in hr a couple of days. hr hr at It would be important to have a DDB textdump with these informations: hr at - bt hr at - ps hr at - show allpcpu hr at - alltrace hr at hr at Alternatively, a coredump which has the stop cpu patch which Andryi can provide. hr hr Okay, I will post them once I can get another panic. Thanks! I got the panic with a crash dump this time. The result of bt, ps, allpcpu, and traces can be found at the following URL: http://people.allbsd.org/~hrs/FreeBSD/pool-panic_20110818-1.txt -- Hiroki ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: panic: spin lock held too long (RELENG_8 from today)
2011/8/18 Hiroki Sato h...@freebsd.org: Hiroki Sato h...@freebsd.org wrote in 20110818.043332.27079545013461535@allbsd.org: hr Attilio Rao atti...@freebsd.org wrote hr in caj-fndcdow0_b2mv0lzeo-tpea9+7oanj7ihvkqsm4j4b0d...@mail.gmail.com: hr hr at 2011/8/17 Hiroki Sato h...@freebsd.org: hr at Hi, hr at hr at Mike Tancsa m...@sentex.net wrote hr at in 4e15a08c.6090...@sentex.net: hr at hr at mi On 7/7/2011 7:32 AM, Mike Tancsa wrote: hr at mi On 7/7/2011 4:20 AM, Kostik Belousov wrote: hr at mi hr at mi BTW, we had a similar panic, spinlock held too long, the spinlock hr at mi is the sched lock N, on busy 8-core box recently upgraded to the hr at mi stable/8. Unfortunately, machine hung dumping core, so the stack trace hr at mi for the owner thread was not available. hr at mi hr at mi I was unable to make any conclusion from the data that was present. hr at mi If the situation is reproducable, you coulld try to revert r221937. This hr at mi is pure speculation, though. hr at mi hr at mi Another crash just now after 5hrs uptime. I will try and revert r221937 hr at mi unless there is any extra debugging you want me to add to the kernel hr at mi instead ? hr at hr at I am also suffering from a reproducible panic on an 8-STABLE box, an hr at NFS server with heavy I/O load. I could not get a kernel dump hr at because this panic locked up the machine just after it occurred, but hr at according to the stack trace it was the same as posted one. hr at Switching to an 8.2R kernel can prevent this panic. hr at hr at Any progress on the investigation? hr at hr at Hiroki, hr at how easilly can you reproduce it? hr hr It takes 5-10 hours. I installed another kernel for debugging just hr now, so I think I will be able to collect more detail information in hr a couple of days. hr hr at It would be important to have a DDB textdump with these informations: hr at - bt hr at - ps hr at - show allpcpu hr at - alltrace hr at hr at Alternatively, a coredump which has the stop cpu patch which Andryi can provide. hr hr Okay, I will post them once I can get another panic. Thanks! I got the panic with a crash dump this time. The result of bt, ps, allpcpu, and traces can be found at the following URL: http://people.allbsd.org/~hrs/FreeBSD/pool-panic_20110818-1.txt I'm not sure I understand it, is also a corefile available? If yes, where I could get it? (with the relevant sources and kernel.debug). Thanks, Attilio -- Peace can only be achieved by understanding - A. Einstein ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: panic: spin lock held too long (RELENG_8 from today)
2011/8/18 Hiroki Sato h...@freebsd.org: Hiroki Sato h...@freebsd.org wrote in 20110818.043332.27079545013461535@allbsd.org: hr Attilio Rao atti...@freebsd.org wrote hr in caj-fndcdow0_b2mv0lzeo-tpea9+7oanj7ihvkqsm4j4b0d...@mail.gmail.com: hr hr at 2011/8/17 Hiroki Sato h...@freebsd.org: hr at Hi, hr at hr at Mike Tancsa m...@sentex.net wrote hr at in 4e15a08c.6090...@sentex.net: hr at hr at mi On 7/7/2011 7:32 AM, Mike Tancsa wrote: hr at mi On 7/7/2011 4:20 AM, Kostik Belousov wrote: hr at mi hr at mi BTW, we had a similar panic, spinlock held too long, the spinlock hr at mi is the sched lock N, on busy 8-core box recently upgraded to the hr at mi stable/8. Unfortunately, machine hung dumping core, so the stack trace hr at mi for the owner thread was not available. hr at mi hr at mi I was unable to make any conclusion from the data that was present. hr at mi If the situation is reproducable, you coulld try to revert r221937. This hr at mi is pure speculation, though. hr at mi hr at mi Another crash just now after 5hrs uptime. I will try and revert r221937 hr at mi unless there is any extra debugging you want me to add to the kernel hr at mi instead ? hr at hr at I am also suffering from a reproducible panic on an 8-STABLE box, an hr at NFS server with heavy I/O load. I could not get a kernel dump hr at because this panic locked up the machine just after it occurred, but hr at according to the stack trace it was the same as posted one. hr at Switching to an 8.2R kernel can prevent this panic. hr at hr at Any progress on the investigation? hr at hr at Hiroki, hr at how easilly can you reproduce it? hr hr It takes 5-10 hours. I installed another kernel for debugging just hr now, so I think I will be able to collect more detail information in hr a couple of days. hr hr at It would be important to have a DDB textdump with these informations: hr at - bt hr at - ps hr at - show allpcpu hr at - alltrace hr at hr at Alternatively, a coredump which has the stop cpu patch which Andryi can provide. hr hr Okay, I will post them once I can get another panic. Thanks! I got the panic with a crash dump this time. The result of bt, ps, allpcpu, and traces can be found at the following URL: http://people.allbsd.org/~hrs/FreeBSD/pool-panic_20110818-1.txt Actually, I think I see the bug here. In callout_cpu_switch() if a low priority thread is migrating the callout and gets preempted after the outcoming cpu queue lock is left (and scheduled much later) we get this problem. In order to fix this bug it could be enough to use a critical section, but I think this should be really interrupt safe, thus I'd wrap them up with spinlock_enter()/spinlock_exit(). Fortunately callout_cpu_switch() should be called rarely and also we already do expensive locking operations in callout, thus we should not have problem performance-wise. Can the guys I also CC'ed here try the following patch, with all the initial kernel options that were leading you to the deadlock? (thus revert any debugging patch/option you added for the moment): http://www.freebsd.org/~attilio/callout-fixup.diff Please note that this patch is for STABLE_8, if you can confirm the good result I'll commit to -CURRENT and then backmarge as soon as possible. Thanks, Attilio -- Peace can only be achieved by understanding - A. Einstein ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: panic: spin lock held too long (RELENG_8 from today)
On Wed, Aug 17, 2011 at 05:01:05PM -0700, Chip Camden wrote: Quoth Jeremy Chadwick on Wednesday, 17 August 2011: I'm also getting similar panics on 8.2-STABLE. Locks up everything and I have to power off. Once, I happened to be looking at the console when it happened and copied dow the following: Sleeping thread (tif 100037, pid 0) owns a non-sleepable lock panic: sleeping thread cpuid=1 No idea, might be relevant to the thread. Another time I got: lock order reversal: 1st 0xff000593e330 snaplk (snaplk) @ /usr/src/sys/kern/vfr_vnops.c:296 2nd 0xff0005e5d578 ufs (ufs) @ /usr/src/sys/ufs/ffs/ffs_snapshot.c:1587 I didn't copy down the traceback. snaplk refers to UFS snapshots. The above must have been typed in manually as well, due to some typos in filenames as well. Either this is a different problem, or if everyone in this thread is doing UFS snapshots (dump -L, mksnap_ffs, etc.) and having this problem happen then I recommend people stop using UFS snapshots. I've ranted about their unreliability in the past (years upon years ago -- still seems valid) and just how badly they can wedge a system. This is one of the many (MANY!) reasons why we use rsnapshot/rsync instead. The atime clobbering issue is the only downside. If I'm doing UFS snapshots, I didn't know it. The backtrace indicates that a UFS snapshot is being made -- which causes the state to be set to string snaplk, which is then honoured in vfs_vnops.c. You can see for yourself: grep -r snaplk /usr/src/sys. So yes, I'm inclined to believe something on your system is doing UFS snapshot generation. Whether or not other people are doing it as well is a different story. Yes, everything was copied manually because it only displays on the console and the keyboard does not respond after that point. So I copied first to paper, then had to decode my lousy handwriting to put it in an email. Sorry for the scribal errors. That sounds more or less like what I saw with UFS snapshots: the system would go catatonic in one way or another. It wouldn't hard lock (as in if you had powered it off, etc.), it would live lock (as in the kernel was wedged or held up/spinning doing something). I never saw a panic as a result of UFS snapshots, only what I described here. TL;DR -- Your system appears to be making UFS snapshots, and that situation is possibly (likely?) unrelated to the sleeping thread issue you see that causes a panic. -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: panic: spin lock held too long (RELENG_8 from today)
Quoth Attilio Rao on Thursday, 18 August 2011: 2011/8/18 Hiroki Sato h...@freebsd.org: Hiroki Sato h...@freebsd.org wrote in 20110818.043332.27079545013461535@allbsd.org: hr Attilio Rao atti...@freebsd.org wrote hr in caj-fndcdow0_b2mv0lzeo-tpea9+7oanj7ihvkqsm4j4b0d...@mail.gmail.com: hr hr at 2011/8/17 Hiroki Sato h...@freebsd.org: hr at Hi, hr at hr at Mike Tancsa m...@sentex.net wrote hr at in 4e15a08c.6090...@sentex.net: hr at hr at mi On 7/7/2011 7:32 AM, Mike Tancsa wrote: hr at mi On 7/7/2011 4:20 AM, Kostik Belousov wrote: hr at mi hr at mi BTW, we had a similar panic, spinlock held too long, the spinlock hr at mi is the sched lock N, on busy 8-core box recently upgraded to the hr at mi stable/8. Unfortunately, machine hung dumping core, so the stack trace hr at mi for the owner thread was not available. hr at mi hr at mi I was unable to make any conclusion from the data that was present. hr at mi If the situation is reproducable, you coulld try to revert r221937. This hr at mi is pure speculation, though. hr at mi hr at mi Another crash just now after 5hrs uptime. I will try and revert r221937 hr at mi unless there is any extra debugging you want me to add to the kernel hr at mi instead ? hr at hr at I am also suffering from a reproducible panic on an 8-STABLE box, an hr at NFS server with heavy I/O load. I could not get a kernel dump hr at because this panic locked up the machine just after it occurred, but hr at according to the stack trace it was the same as posted one. hr at Switching to an 8.2R kernel can prevent this panic. hr at hr at Any progress on the investigation? hr at hr at Hiroki, hr at how easilly can you reproduce it? hr hr It takes 5-10 hours. I installed another kernel for debugging just hr now, so I think I will be able to collect more detail information in hr a couple of days. hr hr at It would be important to have a DDB textdump with these informations: hr at - bt hr at - ps hr at - show allpcpu hr at - alltrace hr at hr at Alternatively, a coredump which has the stop cpu patch which Andryi can provide. hr hr Okay, I will post them once I can get another panic. Thanks! I got the panic with a crash dump this time. The result of bt, ps, allpcpu, and traces can be found at the following URL: http://people.allbsd.org/~hrs/FreeBSD/pool-panic_20110818-1.txt Actually, I think I see the bug here. In callout_cpu_switch() if a low priority thread is migrating the callout and gets preempted after the outcoming cpu queue lock is left (and scheduled much later) we get this problem. In order to fix this bug it could be enough to use a critical section, but I think this should be really interrupt safe, thus I'd wrap them up with spinlock_enter()/spinlock_exit(). Fortunately callout_cpu_switch() should be called rarely and also we already do expensive locking operations in callout, thus we should not have problem performance-wise. Can the guys I also CC'ed here try the following patch, with all the initial kernel options that were leading you to the deadlock? (thus revert any debugging patch/option you added for the moment): http://www.freebsd.org/~attilio/callout-fixup.diff Please note that this patch is for STABLE_8, if you can confirm the good result I'll commit to -CURRENT and then backmarge as soon as possible. Thanks, Attilio Thanks, Attilio. I've applied the patch and removed the extra debug options I had added (though keeping debug symbols). I'll let you know if I experience any more panics. Regards, -- .O. | Sterling (Chip) Camden | http://camdensoftware.com ..O | sterl...@camdensoftware.com | http://chipsquips.com OOO | 2048R/D6DBAF91 | http://chipstips.com pgpJ447gdPrNv.pgp Description: PGP signature
Re: panic: spin lock held too long (RELENG_8 from today)
on 07/07/2011 08:55 Mike Tancsa said the following: I did a buildworld on this box to bring it up to RELENG_8 for the BIND fixes. Unfortunately, the formerly solid box (April 13th kernel) panic'd tonight with Unread portion of the kernel message buffer: spin lock 0xc0b1d200 (sched lock 1) held by 0xc5dac8a0 (tid 100107) too long panic: spin lock held too long cpuid = 0 Uptime: 13h30m4s Physical memory: 2035 MB Its a somewhat busy box taking in mail as well as backups for a few servers over nfs. At the time, it would have been getting about 250Mb/s inbound on its gigabit interface. Full core.txt file at http://www.tancsa.com/core-jul8-2011.txt I thought that this was supposed to contain output of 'thread apply all bt' in kgdb. Anyway, I think that stacktrace for tid 100107 may have some useful information. #0 doadump () at pcpu.h:231 231 pcpu.h: No such file or directory. in pcpu.h (kgdb) #0 doadump () at pcpu.h:231 #1 0xc06fd6d3 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:429 #2 0xc06fd937 in panic (fmt=Variable fmt is not available. ) at /usr/src/sys/kern/kern_shutdown.c:602 #3 0xc06ed95f in _mtx_lock_spin_failed (m=0x0) at /usr/src/sys/kern/kern_mutex.c:490 #4 0xc06ed9e5 in _mtx_lock_spin (m=0xc0b1d200, tid=3312388992, opts=0, file=0x0, line=0) at /usr/src/sys/kern/kern_mutex.c:526 #5 0xc0720254 in sched_add (td=0xc5dac5c0, flags=0) at /usr/src/sys/kern/sched_ule.c:1119 #6 0xc07203f9 in sched_wakeup (td=0xc5dac5c0) at /usr/src/sys/kern/sched_ule.c:1950 #7 0xc07061f8 in setrunnable (td=0xc5dac5c0) at /usr/src/sys/kern/kern_synch.c:499 #8 0xc07362af in sleepq_resume_thread (sq=0xca0da300, td=0xc5dac5c0, pri=Variable pri is not available. ) at /usr/src/sys/kern/subr_sleepqueue.c:751 #9 0xc0736e18 in sleepq_signal (wchan=0xc5fafe50, flags=1, pri=0, queue=0) at /usr/src/sys/kern/subr_sleepqueue.c:825 #10 0xc06b6764 in cv_signal (cvp=0xc5fafe50) at /usr/src/sys/kern/kern_condvar.c:422 #11 0xc08eaa0d in xprt_assignthread (xprt=Variable xprt is not available. ) at /usr/src/sys/rpc/svc.c:342 #12 0xc08ec502 in xprt_active (xprt=0xc95d9600) at /usr/src/sys/rpc/svc.c:378 #13 0xc08ee051 in svc_vc_soupcall (so=0xc6372ce0, arg=0xc95d9600, waitflag=1) at /usr/src/sys/rpc/svc_vc.c:747 #14 0xc075bbb1 in sowakeup (so=0xc6372ce0, sb=0xc6372d34) at /usr/src/sys/kern/uipc_sockbuf.c:191 #15 0xc08447bc in tcp_do_segment (m=0xcaa8d200, th=0xca6aa824, so=0xc6372ce0, tp=0xc63b4d20, drop_hdrlen=52, tlen=1448, iptos=0 '\0', ti_locked=2) at /usr/src/sys/netinet/tcp_input.c:1775 #16 0xc0847930 in tcp_input (m=0xcaa8d200, off0=20) at /usr/src/sys/netinet/tcp_input.c:1329 #17 0xc07ddaf7 in ip_input (m=0xcaa8d200) at /usr/src/sys/netinet/ip_input.c:787 #18 0xc07b8859 in netisr_dispatch_src (proto=1, source=0, m=0xcaa8d200) at /usr/src/sys/net/netisr.c:859 #19 0xc07b8af0 in netisr_dispatch (proto=1, m=0xcaa8d200) at /usr/src/sys/net/netisr.c:946 #20 0xc07ae5e1 in ether_demux (ifp=0xc56ed800, m=0xcaa8d200) at /usr/src/sys/net/if_ethersubr.c:894 #21 0xc07aeb5f in ether_input (ifp=0xc56ed800, m=0xcaa8d200) at /usr/src/sys/net/if_ethersubr.c:753 #22 0xc09977b2 in nfe_int_task (arg=0xc56ff000, pending=1) at /usr/src/sys/dev/nfe/if_nfe.c:2187 #23 0xc07387ca in taskqueue_run_locked (queue=0xc5702440) at /usr/src/sys/kern/subr_taskqueue.c:248 #24 0xc073895c in taskqueue_thread_loop (arg=0xc56ff130) at /usr/src/sys/kern/subr_taskqueue.c:385 #25 0xc06d1027 in fork_exit (callout=0xc07388a0 taskqueue_thread_loop, arg=0xc56ff130, frame=0xc538ed28) at /usr/src/sys/kern/kern_fork.c:861 #26 0xc09a5c24 in fork_trampoline () at /usr/src/sys/i386/i386/exception.s:275 (kgdb) -- Andriy Gapon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: panic: spin lock held too long (RELENG_8 from today)
On Thu, Jul 07, 2011 at 10:36:42AM +0300, Andriy Gapon wrote: on 07/07/2011 08:55 Mike Tancsa said the following: I did a buildworld on this box to bring it up to RELENG_8 for the BIND fixes. Unfortunately, the formerly solid box (April 13th kernel) panic'd tonight with Unread portion of the kernel message buffer: spin lock 0xc0b1d200 (sched lock 1) held by 0xc5dac8a0 (tid 100107) too long panic: spin lock held too long cpuid = 0 Uptime: 13h30m4s Physical memory: 2035 MB Its a somewhat busy box taking in mail as well as backups for a few servers over nfs. At the time, it would have been getting about 250Mb/s inbound on its gigabit interface. Full core.txt file at http://www.tancsa.com/core-jul8-2011.txt I thought that this was supposed to contain output of 'thread apply all bt' in kgdb. Anyway, I think that stacktrace for tid 100107 may have some useful information. #0 doadump () at pcpu.h:231 231 pcpu.h: No such file or directory. in pcpu.h (kgdb) #0 doadump () at pcpu.h:231 #1 0xc06fd6d3 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:429 #2 0xc06fd937 in panic (fmt=Variable fmt is not available. ) at /usr/src/sys/kern/kern_shutdown.c:602 #3 0xc06ed95f in _mtx_lock_spin_failed (m=0x0) at /usr/src/sys/kern/kern_mutex.c:490 #4 0xc06ed9e5 in _mtx_lock_spin (m=0xc0b1d200, tid=3312388992, opts=0, file=0x0, line=0) at /usr/src/sys/kern/kern_mutex.c:526 #5 0xc0720254 in sched_add (td=0xc5dac5c0, flags=0) at /usr/src/sys/kern/sched_ule.c:1119 #6 0xc07203f9 in sched_wakeup (td=0xc5dac5c0) at /usr/src/sys/kern/sched_ule.c:1950 #7 0xc07061f8 in setrunnable (td=0xc5dac5c0) at /usr/src/sys/kern/kern_synch.c:499 #8 0xc07362af in sleepq_resume_thread (sq=0xca0da300, td=0xc5dac5c0, pri=Variable pri is not available. ) at /usr/src/sys/kern/subr_sleepqueue.c:751 #9 0xc0736e18 in sleepq_signal (wchan=0xc5fafe50, flags=1, pri=0, queue=0) at /usr/src/sys/kern/subr_sleepqueue.c:825 #10 0xc06b6764 in cv_signal (cvp=0xc5fafe50) at /usr/src/sys/kern/kern_condvar.c:422 #11 0xc08eaa0d in xprt_assignthread (xprt=Variable xprt is not available. ) at /usr/src/sys/rpc/svc.c:342 #12 0xc08ec502 in xprt_active (xprt=0xc95d9600) at /usr/src/sys/rpc/svc.c:378 #13 0xc08ee051 in svc_vc_soupcall (so=0xc6372ce0, arg=0xc95d9600, waitflag=1) at /usr/src/sys/rpc/svc_vc.c:747 #14 0xc075bbb1 in sowakeup (so=0xc6372ce0, sb=0xc6372d34) at /usr/src/sys/kern/uipc_sockbuf.c:191 #15 0xc08447bc in tcp_do_segment (m=0xcaa8d200, th=0xca6aa824, so=0xc6372ce0, tp=0xc63b4d20, drop_hdrlen=52, tlen=1448, iptos=0 '\0', ti_locked=2) at /usr/src/sys/netinet/tcp_input.c:1775 #16 0xc0847930 in tcp_input (m=0xcaa8d200, off0=20) at /usr/src/sys/netinet/tcp_input.c:1329 #17 0xc07ddaf7 in ip_input (m=0xcaa8d200) at /usr/src/sys/netinet/ip_input.c:787 #18 0xc07b8859 in netisr_dispatch_src (proto=1, source=0, m=0xcaa8d200) at /usr/src/sys/net/netisr.c:859 #19 0xc07b8af0 in netisr_dispatch (proto=1, m=0xcaa8d200) at /usr/src/sys/net/netisr.c:946 #20 0xc07ae5e1 in ether_demux (ifp=0xc56ed800, m=0xcaa8d200) at /usr/src/sys/net/if_ethersubr.c:894 #21 0xc07aeb5f in ether_input (ifp=0xc56ed800, m=0xcaa8d200) at /usr/src/sys/net/if_ethersubr.c:753 #22 0xc09977b2 in nfe_int_task (arg=0xc56ff000, pending=1) at /usr/src/sys/dev/nfe/if_nfe.c:2187 #23 0xc07387ca in taskqueue_run_locked (queue=0xc5702440) at /usr/src/sys/kern/subr_taskqueue.c:248 #24 0xc073895c in taskqueue_thread_loop (arg=0xc56ff130) at /usr/src/sys/kern/subr_taskqueue.c:385 #25 0xc06d1027 in fork_exit (callout=0xc07388a0 taskqueue_thread_loop, arg=0xc56ff130, frame=0xc538ed28) at /usr/src/sys/kern/kern_fork.c:861 #26 0xc09a5c24 in fork_trampoline () at /usr/src/sys/i386/i386/exception.s:275 (kgdb) BTW, we had a similar panic, spinlock held too long, the spinlock is the sched lock N, on busy 8-core box recently upgraded to the stable/8. Unfortunately, machine hung dumping core, so the stack trace for the owner thread was not available. I was unable to make any conclusion from the data that was present. If the situation is reproducable, you coulld try to revert r221937. This is pure speculation, though. pgpW2o7azLFBo.pgp Description: PGP signature
Re: panic: spin lock held too long (RELENG_8 from today)
On 7/7/2011 4:20 AM, Kostik Belousov wrote: BTW, we had a similar panic, spinlock held too long, the spinlock is the sched lock N, on busy 8-core box recently upgraded to the stable/8. Unfortunately, machine hung dumping core, so the stack trace for the owner thread was not available. I was unable to make any conclusion from the data that was present. If the situation is reproducable, you coulld try to revert r221937. This is pure speculation, though. Another crash just now after 5hrs uptime. I will try and revert r221937 unless there is any extra debugging you want me to add to the kernel instead ? This is an inbound mail server so a little disruption is possible kgdb /usr/obj/usr/src/sys/recycle/kernel.debug vmcore.13 GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type show copying to see the conditions. There is absolutely no warranty for GDB. Type show warranty for details. This GDB was configured as i386-marcel-freebsd... Unread portion of the kernel message buffer: spin lock 0xc0b1d200 (sched lock 1) held by 0xc5dac2e0 (tid 100109) too long panic: spin lock held too long cpuid = 0 Uptime: 5h37m43s Physical memory: 2035 MB Dumping 260 MB: 245 229 213 197 181 165 149 133 117 101 85 69 53 37 21 5 Reading symbols from /boot/kernel/amdsbwd.ko...Reading symbols from /boot/kernel/amdsbwd.ko.symbols...done. done. Loaded symbols for /boot/kernel/amdsbwd.ko #0 doadump () at pcpu.h:231 231 pcpu.h: No such file or directory. in pcpu.h (kgdb) bt #0 doadump () at pcpu.h:231 #1 0xc06fd6d3 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:429 #2 0xc06fd937 in panic (fmt=Variable fmt is not available. ) at /usr/src/sys/kern/kern_shutdown.c:602 #3 0xc06ed95f in _mtx_lock_spin_failed (m=0x0) at /usr/src/sys/kern/kern_mutex.c:490 #4 0xc06ed9e5 in _mtx_lock_spin (m=0xc0b1d200, tid=3312388992, opts=0, file=0x0, line=0) at /usr/src/sys/kern/kern_mutex.c:526 #5 0xc0720254 in sched_add (td=0xc61892e0, flags=0) at /usr/src/sys/kern/sched_ule.c:1119 #6 0xc07203f9 in sched_wakeup (td=0xc61892e0) at /usr/src/sys/kern/sched_ule.c:1950 #7 0xc07061f8 in setrunnable (td=0xc61892e0) at /usr/src/sys/kern/kern_synch.c:499 #8 0xc07362af in sleepq_resume_thread (sq=0xc55311c0, td=0xc61892e0, pri=Variable pri is not available. ) at /usr/src/sys/kern/subr_sleepqueue.c:751 #9 0xc0736e18 in sleepq_signal (wchan=0xc60386d0, flags=1, pri=0, queue=0) at /usr/src/sys/kern/subr_sleepqueue.c:825 #10 0xc06b6764 in cv_signal (cvp=0xc60386d0) at /usr/src/sys/kern/kern_condvar.c:422 #11 0xc08eaa0d in xprt_assignthread (xprt=Variable xprt is not available. ) at /usr/src/sys/rpc/svc.c:342 #12 0xc08ec502 in xprt_active (xprt=0xc5db8a00) at /usr/src/sys/rpc/svc.c:378 #13 0xc08ee051 in svc_vc_soupcall (so=0xc618a19c, arg=0xc5db8a00, waitflag=1) at /usr/src/sys/rpc/svc_vc.c:747 #14 0xc075bbb1 in sowakeup (so=0xc618a19c, sb=0xc618a1f0) at /usr/src/sys/kern/uipc_sockbuf.c:191 #15 0xc08447bc in tcp_do_segment (m=0xc6567a00, th=0xc6785824, so=0xc618a19c, tp=0xc617e000, drop_hdrlen=52, tlen=1448, iptos=0 '\0', ti_locked=2) at /usr/src/sys/netinet/tcp_input.c:1775 #16 0xc0847930 in tcp_input (m=0xc6567a00, off0=20) at /usr/src/sys/netinet/tcp_input.c:1329 #17 0xc07ddaf7 in ip_input (m=0xc6567a00) at /usr/src/sys/netinet/ip_input.c:787 #18 0xc07b8859 in netisr_dispatch_src (proto=1, source=0, m=0xc6567a00) at /usr/src/sys/net/netisr.c:859 #19 0xc07b8af0 in netisr_dispatch (proto=1, m=0xc6567a00) at /usr/src/sys/net/netisr.c:946 #20 0xc07ae5e1 in ether_demux (ifp=0xc56ed800, m=0xc6567a00) at /usr/src/sys/net/if_ethersubr.c:894 #21 0xc07aeb5f in ether_input (ifp=0xc56ed800, m=0xc6567a00) at /usr/src/sys/net/if_ethersubr.c:753 #22 0xc09977b2 in nfe_int_task (arg=0xc56ff000, pending=1) at /usr/src/sys/dev/nfe/if_nfe.c:2187 #23 0xc07387ca in taskqueue_run_locked (queue=0xc5702440) at /usr/src/sys/kern/subr_taskqueue.c:248 #24 0xc073895c in taskqueue_thread_loop (arg=0xc56ff130) at /usr/src/sys/kern/subr_taskqueue.c:385 #25 0xc06d1027 in fork_exit (callout=0xc07388a0 taskqueue_thread_loop, arg=0xc56ff130, frame=0xc538ed28) at /usr/src/sys/kern/kern_fork.c:861 #26 0xc09a5c24 in fork_trampoline () at /usr/src/sys/i386/i386/exception.s:275 -- --- Mike Tancsa, tel +1 519 651 3400 Sentex Communications, m...@sentex.net Providing Internet services since 1994 www.sentex.net Cambridge, Ontario Canada http://www.tancsa.com/ ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: panic: spin lock held too long (RELENG_8 from today)
On Thu, Jul 07, 2011 at 07:32:41AM -0400, Mike Tancsa wrote: On 7/7/2011 4:20 AM, Kostik Belousov wrote: BTW, we had a similar panic, spinlock held too long, the spinlock is the sched lock N, on busy 8-core box recently upgraded to the stable/8. Unfortunately, machine hung dumping core, so the stack trace for the owner thread was not available. I was unable to make any conclusion from the data that was present. If the situation is reproducable, you coulld try to revert r221937. This is pure speculation, though. Another crash just now after 5hrs uptime. I will try and revert r221937 unless there is any extra debugging you want me to add to the kernel instead ? This is an inbound mail server so a little disruption is possible kgdb /usr/obj/usr/src/sys/recycle/kernel.debug vmcore.13 GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type show copying to see the conditions. There is absolutely no warranty for GDB. Type show warranty for details. This GDB was configured as i386-marcel-freebsd... Unread portion of the kernel message buffer: spin lock 0xc0b1d200 (sched lock 1) held by 0xc5dac2e0 (tid 100109) too long panic: spin lock held too long cpuid = 0 Uptime: 5h37m43s Physical memory: 2035 MB Dumping 260 MB: 245 229 213 197 181 165 149 133 117 101 85 69 53 37 21 5 Reading symbols from /boot/kernel/amdsbwd.ko...Reading symbols from /boot/kernel/amdsbwd.ko.symbols...done. done. Loaded symbols for /boot/kernel/amdsbwd.ko #0 doadump () at pcpu.h:231 231 pcpu.h: No such file or directory. in pcpu.h (kgdb) bt #0 doadump () at pcpu.h:231 #1 0xc06fd6d3 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:429 #2 0xc06fd937 in panic (fmt=Variable fmt is not available. ) at /usr/src/sys/kern/kern_shutdown.c:602 #3 0xc06ed95f in _mtx_lock_spin_failed (m=0x0) at /usr/src/sys/kern/kern_mutex.c:490 #4 0xc06ed9e5 in _mtx_lock_spin (m=0xc0b1d200, tid=3312388992, opts=0, file=0x0, line=0) at /usr/src/sys/kern/kern_mutex.c:526 #5 0xc0720254 in sched_add (td=0xc61892e0, flags=0) at /usr/src/sys/kern/sched_ule.c:1119 #6 0xc07203f9 in sched_wakeup (td=0xc61892e0) at /usr/src/sys/kern/sched_ule.c:1950 #7 0xc07061f8 in setrunnable (td=0xc61892e0) at /usr/src/sys/kern/kern_synch.c:499 #8 0xc07362af in sleepq_resume_thread (sq=0xc55311c0, td=0xc61892e0, pri=Variable pri is not available. ) at /usr/src/sys/kern/subr_sleepqueue.c:751 #9 0xc0736e18 in sleepq_signal (wchan=0xc60386d0, flags=1, pri=0, queue=0) at /usr/src/sys/kern/subr_sleepqueue.c:825 #10 0xc06b6764 in cv_signal (cvp=0xc60386d0) at /usr/src/sys/kern/kern_condvar.c:422 #11 0xc08eaa0d in xprt_assignthread (xprt=Variable xprt is not available. ) at /usr/src/sys/rpc/svc.c:342 #12 0xc08ec502 in xprt_active (xprt=0xc5db8a00) at /usr/src/sys/rpc/svc.c:378 #13 0xc08ee051 in svc_vc_soupcall (so=0xc618a19c, arg=0xc5db8a00, waitflag=1) at /usr/src/sys/rpc/svc_vc.c:747 #14 0xc075bbb1 in sowakeup (so=0xc618a19c, sb=0xc618a1f0) at /usr/src/sys/kern/uipc_sockbuf.c:191 #15 0xc08447bc in tcp_do_segment (m=0xc6567a00, th=0xc6785824, so=0xc618a19c, tp=0xc617e000, drop_hdrlen=52, tlen=1448, iptos=0 '\0', ti_locked=2) at /usr/src/sys/netinet/tcp_input.c:1775 #16 0xc0847930 in tcp_input (m=0xc6567a00, off0=20) at /usr/src/sys/netinet/tcp_input.c:1329 #17 0xc07ddaf7 in ip_input (m=0xc6567a00) at /usr/src/sys/netinet/ip_input.c:787 #18 0xc07b8859 in netisr_dispatch_src (proto=1, source=0, m=0xc6567a00) at /usr/src/sys/net/netisr.c:859 #19 0xc07b8af0 in netisr_dispatch (proto=1, m=0xc6567a00) at /usr/src/sys/net/netisr.c:946 #20 0xc07ae5e1 in ether_demux (ifp=0xc56ed800, m=0xc6567a00) at /usr/src/sys/net/if_ethersubr.c:894 #21 0xc07aeb5f in ether_input (ifp=0xc56ed800, m=0xc6567a00) at /usr/src/sys/net/if_ethersubr.c:753 #22 0xc09977b2 in nfe_int_task (arg=0xc56ff000, pending=1) at /usr/src/sys/dev/nfe/if_nfe.c:2187 #23 0xc07387ca in taskqueue_run_locked (queue=0xc5702440) at /usr/src/sys/kern/subr_taskqueue.c:248 #24 0xc073895c in taskqueue_thread_loop (arg=0xc56ff130) at /usr/src/sys/kern/subr_taskqueue.c:385 #25 0xc06d1027 in fork_exit (callout=0xc07388a0 taskqueue_thread_loop, arg=0xc56ff130, frame=0xc538ed28) at /usr/src/sys/kern/kern_fork.c:861 #26 0xc09a5c24 in fork_trampoline () at /usr/src/sys/i386/i386/exception.s:275 1. info threads 2. Find the index value that matches the tid in question (in the above spin lock panic, that'd be tid 100109). The index value will be the first number shown on the left 3. thread {index} 4. bt If this doesn't work, alternatively you can try (from the beginning) thread apply all bt and provide the output from that. (It will be quite lengthy, and at this point I think tid 100109 is the one of interest in this
Re: panic: spin lock held too long (RELENG_8 from today)
on 07/07/2011 14:41 Jeremy Chadwick said the following: 1. info threads 2. Find the index value that matches the tid in question (in the above spin lock panic, that'd be tid 100109). The index value will be the first number shown on the left 3. thread {index} Just in case, in kgdb there is a command 'tid' that does all of the above steps in one go. 4. bt If this doesn't work, alternatively you can try (from the beginning) thread apply all bt and provide the output from that. (It will be quite lengthy, and at this point I think tid 100109 is the one of interest in this crash, based on what Andriy said earlier) -- Andriy Gapon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: panic: spin lock held too long (RELENG_8 from today)
On 7/7/2011 7:32 AM, Mike Tancsa wrote: On 7/7/2011 4:20 AM, Kostik Belousov wrote: BTW, we had a similar panic, spinlock held too long, the spinlock is the sched lock N, on busy 8-core box recently upgraded to the stable/8. Unfortunately, machine hung dumping core, so the stack trace for the owner thread was not available. I was unable to make any conclusion from the data that was present. If the situation is reproducable, you coulld try to revert r221937. This is pure speculation, though. Another crash just now after 5hrs uptime. I will try and revert r221937 unless there is any extra debugging you want me to add to the kernel instead ? This is an inbound mail server so a little disruption is possible kgdb /usr/obj/usr/src/sys/recycle/kernel.debug vmcore.13 GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type show copying to see the conditions. There is absolutely no warranty for GDB. Type show warranty for details. This GDB was configured as i386-marcel-freebsd... Unread portion of the kernel message buffer: spin lock 0xc0b1d200 (sched lock 1) held by 0xc5dac2e0 (tid 100109) too long panic: spin lock held too long cpuid = 0 Uptime: 5h37m43s Physical memory: 2035 MB Dumping 260 MB: 245 229 213 197 181 165 149 133 117 101 85 69 53 37 21 5 And the second crash from today kgdb /usr/obj/usr/src/sys/recycle/kernel.debug vmcore.13 GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type show copying to see the conditions. There is absolutely no warranty for GDB. Type show warranty for details. This GDB was configured as i386-marcel-freebsd... Unread portion of the kernel message buffer: spin lock 0xc0b1d200 (sched lock 1) held by 0xc5dac2e0 (tid 100109) too long panic: spin lock held too long cpuid = 0 Uptime: 5h37m43s Physical memory: 2035 MB Dumping 260 MB: 245 229 213 197 181 165 149 133 117 101 85 69 53 37 21 5 Reading symbols from /boot/kernel/amdsbwd.ko...Reading symbols from /boot/kernel/amdsbwd.ko.symbols...done. done. Loaded symbols for /boot/kernel/amdsbwd.ko #0 doadump () at pcpu.h:231 231 pcpu.h: No such file or directory. in pcpu.h (kgdb) tid 100109 [Switching to thread 82 (Thread 100109)]#0 sched_switch (td=0xc5dac2e0, newtd=0xc553c5c0, flags=260) at /usr/src/sys/kern/sched_ule.c:1866 1866cpuid = PCPU_GET(cpuid); (kgdb) list 1861/* 1862 * We may return from cpu_switch on a different cpu. However, 1863 * we always return with td_lock pointing to the current cpu's 1864 * run queue lock. 1865 */ 1866cpuid = PCPU_GET(cpuid); 1867tdq = TDQ_CPU(cpuid); 1868lock_profile_obtain_lock_success( 1869TDQ_LOCKPTR(tdq)-lock_object, 0, 0, __FILE__, __LINE__); 1870#ifdef HWPMC_HOOKS (kgdb) p *td $1 = {td_lock = 0xc0b1d200, td_proc = 0xc5db4000, td_plist = {tqe_next = 0xc5dac5c0, tqe_prev = 0xc5dac008}, td_runq = {tqe_next = 0x0, tqe_prev = 0xc0b1d334}, td_slpq = {tqe_next = 0x0, tqe_prev = 0xc65d3b00}, td_lockq = {tqe_next = 0x0, tqe_prev = 0xc51f6b38}, td_cpuset = 0xc5533e38, td_sel = 0x0, td_sleepqueue = 0xc65d3b00, td_turnstile = 0xc63ceb80, td_umtxq = 0xc5d229c0, td_tid = 100109, td_sigqueue = { sq_signals = {__bits = {0, 0, 0, 0}}, sq_kill = {__bits = {0, 0, 0, 0}}, sq_list = {tqh_first = 0x0, tqh_last = 0xc5dac340}, sq_proc = 0xc5db4000, sq_flags = 1}, td_flags = 4, td_inhibitors = 0, td_pflags = 2097152, td_dupfd = 0, td_sqqueue = 0, td_wchan = 0x0, td_wmesg = 0x0, td_lastcpu = 0 '\0', td_oncpu = 1 '\001', td_owepreempt = 0 '\0', td_tsqueue = 0 '\0', td_locks = -291, td_rw_rlocks = 0, td_lk_slocks = 0, td_blocked = 0x0, td_lockname = 0x0, td_contested = {lh_first = 0x0}, td_sleeplocks = 0x0, td_intr_nesting_level = 0, td_pinned = 0, td_ucred = 0xc5538100, td_estcpu = 0, td_slptick = 0, td_blktick = 0, td_ru = {ru_utime = {tv_sec = 0, tv_usec = 0}, ru_stime = {tv_sec = 0, tv_usec = 0}, ru_maxrss = 1048, ru_ixrss = 85216, ru_idrss = 3834720, ru_isrss = 681728, ru_minflt = 0, ru_majflt = 0, ru_nswap = 0, ru_inblock = 82, ru_oublock = 271222, ru_msgsnd = 135625, ru_msgrcv = 2427350, ru_nsignals = 0, ru_nvcsw = 2076938, ru_nivcsw = 731134}, td_incruntime = 852332612, td_runtime = 88202475877, td_pticks = 5326, td_sticks = 48, td_iticks = 0, td_uticks = 0, td_intrval = 0, td_oldsigmask = {__bits = {0, 0, 0, 0}}, td_sigmask = {__bits = {0, 0, 0, 0}}, td_generation = 2808072, td_sigstk = {ss_sp = 0x0, ss_size = 0, ss_flags = 0}, td_xsig = 0, td_profil_addr
panic: spin lock held too long (RELENG_8 from today)
I did a buildworld on this box to bring it up to RELENG_8 for the BIND fixes. Unfortunately, the formerly solid box (April 13th kernel) panic'd tonight with Unread portion of the kernel message buffer: spin lock 0xc0b1d200 (sched lock 1) held by 0xc5dac8a0 (tid 100107) too long panic: spin lock held too long cpuid = 0 Uptime: 13h30m4s Physical memory: 2035 MB Its a somewhat busy box taking in mail as well as backups for a few servers over nfs. At the time, it would have been getting about 250Mb/s inbound on its gigabit interface. Full core.txt file at http://www.tancsa.com/core-jul8-2011.txt #0 doadump () at pcpu.h:231 231 pcpu.h: No such file or directory. in pcpu.h (kgdb) #0 doadump () at pcpu.h:231 #1 0xc06fd6d3 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:429 #2 0xc06fd937 in panic (fmt=Variable fmt is not available. ) at /usr/src/sys/kern/kern_shutdown.c:602 #3 0xc06ed95f in _mtx_lock_spin_failed (m=0x0) at /usr/src/sys/kern/kern_mutex.c:490 #4 0xc06ed9e5 in _mtx_lock_spin (m=0xc0b1d200, tid=3312388992, opts=0, file=0x0, line=0) at /usr/src/sys/kern/kern_mutex.c:526 #5 0xc0720254 in sched_add (td=0xc5dac5c0, flags=0) at /usr/src/sys/kern/sched_ule.c:1119 #6 0xc07203f9 in sched_wakeup (td=0xc5dac5c0) at /usr/src/sys/kern/sched_ule.c:1950 #7 0xc07061f8 in setrunnable (td=0xc5dac5c0) at /usr/src/sys/kern/kern_synch.c:499 #8 0xc07362af in sleepq_resume_thread (sq=0xca0da300, td=0xc5dac5c0, pri=Variable pri is not available. ) at /usr/src/sys/kern/subr_sleepqueue.c:751 #9 0xc0736e18 in sleepq_signal (wchan=0xc5fafe50, flags=1, pri=0, queue=0) at /usr/src/sys/kern/subr_sleepqueue.c:825 #10 0xc06b6764 in cv_signal (cvp=0xc5fafe50) at /usr/src/sys/kern/kern_condvar.c:422 #11 0xc08eaa0d in xprt_assignthread (xprt=Variable xprt is not available. ) at /usr/src/sys/rpc/svc.c:342 #12 0xc08ec502 in xprt_active (xprt=0xc95d9600) at /usr/src/sys/rpc/svc.c:378 #13 0xc08ee051 in svc_vc_soupcall (so=0xc6372ce0, arg=0xc95d9600, waitflag=1) at /usr/src/sys/rpc/svc_vc.c:747 #14 0xc075bbb1 in sowakeup (so=0xc6372ce0, sb=0xc6372d34) at /usr/src/sys/kern/uipc_sockbuf.c:191 #15 0xc08447bc in tcp_do_segment (m=0xcaa8d200, th=0xca6aa824, so=0xc6372ce0, tp=0xc63b4d20, drop_hdrlen=52, tlen=1448, iptos=0 '\0', ti_locked=2) at /usr/src/sys/netinet/tcp_input.c:1775 #16 0xc0847930 in tcp_input (m=0xcaa8d200, off0=20) at /usr/src/sys/netinet/tcp_input.c:1329 #17 0xc07ddaf7 in ip_input (m=0xcaa8d200) at /usr/src/sys/netinet/ip_input.c:787 #18 0xc07b8859 in netisr_dispatch_src (proto=1, source=0, m=0xcaa8d200) at /usr/src/sys/net/netisr.c:859 #19 0xc07b8af0 in netisr_dispatch (proto=1, m=0xcaa8d200) at /usr/src/sys/net/netisr.c:946 #20 0xc07ae5e1 in ether_demux (ifp=0xc56ed800, m=0xcaa8d200) at /usr/src/sys/net/if_ethersubr.c:894 #21 0xc07aeb5f in ether_input (ifp=0xc56ed800, m=0xcaa8d200) at /usr/src/sys/net/if_ethersubr.c:753 #22 0xc09977b2 in nfe_int_task (arg=0xc56ff000, pending=1) at /usr/src/sys/dev/nfe/if_nfe.c:2187 #23 0xc07387ca in taskqueue_run_locked (queue=0xc5702440) at /usr/src/sys/kern/subr_taskqueue.c:248 #24 0xc073895c in taskqueue_thread_loop (arg=0xc56ff130) at /usr/src/sys/kern/subr_taskqueue.c:385 #25 0xc06d1027 in fork_exit (callout=0xc07388a0 taskqueue_thread_loop, arg=0xc56ff130, frame=0xc538ed28) at /usr/src/sys/kern/kern_fork.c:861 #26 0xc09a5c24 in fork_trampoline () at /usr/src/sys/i386/i386/exception.s:275 (kgdb) -- --- Mike Tancsa, tel +1 519 651 3400 Sentex Communications, m...@sentex.net Providing Internet services since 1994 www.sentex.net Cambridge, Ontario Canada http://www.tancsa.com/ ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org