Re: ffs snapshot lockup
On Thu, Oct 05, 2006 at 10:01:07AM -0400, Vivek Khera wrote: On Oct 5, 2006, at 4:30 AM, Kostik Belousov wrote: The network load was minimal at the time. I had everyone log out and close mail etc. What were the symptoms of locked system ? Could you log in on console, or do something at the shell prompt on console ? Console was non-responsive. This time dump locked doing /usr so pretty much anything you try to run will block. When the lockup happens when dump is running on my home dir (/u/yertle1) partition, as long as you don't need that partition you can log in and run any programs you like. I have a service account whose home dir is in / var and was able to login that time to that account. No such luck this time since any activity pretty much uses /usr. Ping was responding (our monitoring didn't complain it was down). The only thing I could do was break to debugger on the console. This is very strange. You 3 instances of getty where just reading the tty input, and all suspectible processes (like sshd) are waiting on net events. No processes are blocked on the fs. One nfsd is serving the request, and dump is active. pgpAr4aPXXmZW.pgp Description: PGP signature
Re: ffs snapshot lockup
On Fri, Oct 06, 2006 at 10:39:50AM +0300, Kostik Belousov wrote: On Thu, Oct 05, 2006 at 10:01:07AM -0400, Vivek Khera wrote: On Oct 5, 2006, at 4:30 AM, Kostik Belousov wrote: The network load was minimal at the time. I had everyone log out and close mail etc. What were the symptoms of locked system ? Could you log in on console, or do something at the shell prompt on console ? Console was non-responsive. This time dump locked doing /usr so pretty much anything you try to run will block. When the lockup happens when dump is running on my home dir (/u/yertle1) partition, as long as you don't need that partition you can log in and run any programs you like. I have a service account whose home dir is in / var and was able to login that time to that account. No such luck this time since any activity pretty much uses /usr. Ping was responding (our monitoring didn't complain it was down). The only thing I could do was break to debugger on the console. This is very strange. You 3 instances of getty where just reading the tty input, and all suspectible processes (like sshd) are waiting on net events. No processes are blocked on the fs. One nfsd is serving the request, and dump is active. To repeat something I said earlier: when creating a snapshot (e.g. which dump -L does), the entire system may become unresponsive untilk the snapshot completes, which can take many minutes. How long are you waiting before pronouncing the system deadlocked? What does ^T on the console (e.g. when trying to log in), show you? Kris pgpY6zPaZmvsz.pgp Description: PGP signature
Re: ffs snapshot lockup
On Fri, Oct 06, 2006 at 02:11:05PM -0400, Vivek Khera wrote: On Oct 6, 2006, at 1:57 PM, Kris Kennaway wrote: This is very strange. You 3 instances of getty where just reading the tty input, and all suspectible processes (like sshd) are waiting on net events. No processes are blocked on the fs. One nfsd is serving the request, and dump is active. To repeat something I said earlier: when creating a snapshot (e.g. which dump -L does), the entire system may become unresponsive untilk the snapshot completes, which can take many minutes. I know snapshot takes a while -- we're used to that. How long are you waiting before pronouncing the system deadlocked? 10's of minutes. What does ^T on the console (e.g. when trying to log in), show you? There were no active snapshotting in the progress. Snapshot was already made, and dump happily processed in the moment captured in the script. nothing. the console is non-responsive. the remote shells are non responsive to any input. I'm now convinced it was all stemming from some bug in bge driver (at least for my specific chipset.) Last night I put in an old spare 3c905 NIC and turned off the motherboard bge via BIOS. I can't make the machine lock up at all, even with the watchdog running, and doing level0 dumps. Also, even though this NIC is only 10/100 and the prior was running at GigE speed, the system is *way* more responsive to network operations. For example, when I logged in this morning my IMAP mail client took barely a second or or so to open my inbox, whereas before it would take upwards of 10 seconds. This machine was always this way since it was first set up running 5.3. I can't believe I lived with it for so long... I'd like to find a nice stable GigE NIC for it, since I know that the onboard bge is definitely sub-optimal with FreeBSD. Dell's diagnostics don't find any hardware fault, for what that's worth. Curiously, I have a handful of other Dell servers at the office which all have bge and run just great at GigE speed to the same switch. If it does lock up again, I'll be sure to let you know! Was this system patched by the stuff I submitted to you ? pgpT50ylzWbU5.pgp Description: PGP signature
Re: ffs snapshot lockup
On Fri, Oct 06, 2006 at 02:11:05PM -0400, Vivek Khera wrote: On Oct 6, 2006, at 1:57 PM, Kris Kennaway wrote: This is very strange. You 3 instances of getty where just reading the tty input, and all suspectible processes (like sshd) are waiting on net events. No processes are blocked on the fs. One nfsd is serving the request, and dump is active. To repeat something I said earlier: when creating a snapshot (e.g. which dump -L does), the entire system may become unresponsive untilk the snapshot completes, which can take many minutes. I know snapshot takes a while -- we're used to that. OK, thanks for confirming. I'm now convinced it was all stemming from some bug in bge driver (at least for my specific chipset.) Last night I put in an old spare 3c905 NIC and turned off the motherboard bge via BIOS. We'd be interested in diagnosing this problem separately; can you set up a test machine with this card and give me access to it? Kris pgp5BixhXIBFf.pgp Description: PGP signature
Re: ffs snapshot lockup
On Oct 6, 2006, at 2:20 PM, Kostik Belousov wrote: If it does lock up again, I'll be sure to let you know! Was this system patched by the stuff I submitted to you ? yes. i did not update anything except adding the xl driver to the kernel, so as to minimize changes. if this holds stable for a few days i may try upgrading to BETA2. are your fixes rolled in there yet?
Re: ffs snapshot lockup
On Oct 6, 2006, at 2:31 PM, Kris Kennaway wrote: I'm now convinced it was all stemming from some bug in bge driver (at least for my specific chipset.) Last night I put in an old spare 3c905 NIC and turned off the motherboard bge via BIOS. We'd be interested in diagnosing this problem separately; can you set up a test machine with this card and give me access to it? I'm not sure which card you want to diagnose. The Dell PE800 has an onboard bge ethernet which was apparently not working at full speed, or was stalling otherwise. The 3c905 is what I used to be able to turn off the bge NIC. It appears to be functioning well.
Re: ffs snapshot lockup
On Wed, Oct 04, 2006 at 05:16:53PM -0400, Vivek Khera wrote: On Oct 3, 2006, at 4:43 AM, Kostik Belousov wrote: Details are posted at http://vivek.khera.org/scratch/crashlogs/ I have the crashdumps available to a kernel hacker upon request (i'd rather not make them generally available to the public...) It seems that you have snapshotted fs exported by nfsd ? At least, 18a is definitely the case. I have the patch (for current) that shall fix the issue. In fact, you need two patches: As per advice of Kris Kenneway, I turned off the software watchdog to rule out that as my problem. Then I ran a level 3 dump. Dump of root fs went fine, then it proceeded to do /usr. After a few minutes it locked up. Typescript 20 at the above URL shows the debugging info from the break into debugger of the locked up system. Since /usr was locked, nobody could log in at all. The network load was minimal at the time. I had everyone log out and close mail etc. What were the symptoms of locked system ? Could you log in on console, or do something at the shell prompt on console ? Also, did the system respond to the pings ? Fs-related deadlocks (as well as stalled disk io) usually do not prevent lowest levels of the isr/network stack from working. Again, I do not see the fs deadlock per se in the supplied script. Dump does disk io, it seems that nfsd tries to serve some request. Sshd looks to be ready to accept connections. If console is available, but ping responses not arrive, this is definitely network card problem. pgpzVHIESJM7W.pgp Description: PGP signature
Re: ffs snapshot lockup
On Oct 5, 2006, at 4:30 AM, Kostik Belousov wrote: The network load was minimal at the time. I had everyone log out and close mail etc. What were the symptoms of locked system ? Could you log in on console, or do something at the shell prompt on console ? Console was non-responsive. This time dump locked doing /usr so pretty much anything you try to run will block. When the lockup happens when dump is running on my home dir (/u/yertle1) partition, as long as you don't need that partition you can log in and run any programs you like. I have a service account whose home dir is in / var and was able to login that time to that account. No such luck this time since any activity pretty much uses /usr. Ping was responding (our monitoring didn't complain it was down). The only thing I could do was break to debugger on the console.
Re: ffs snapshot lockup
On Oct 3, 2006, at 4:43 AM, Kostik Belousov wrote: Details are posted at http://vivek.khera.org/scratch/crashlogs/ I have the crashdumps available to a kernel hacker upon request (i'd rather not make them generally available to the public...) It seems that you have snapshotted fs exported by nfsd ? At least, 18a is definitely the case. I have the patch (for current) that shall fix the issue. In fact, you need two patches: So last night I patched my 6.2-PRE and installed the kernel and rebooted. At 03:09 I got paged that the server was not responding. It had dropped into the kernel debugger on the console again. Typescript 19 (posted at the above URL) is the output of the debugging commands for it. These patches did not solve my problem (or maybe I have multiple problems and they only solved one.) The only thing I think was running at the time would be a large file copy from a remote system to this one using rsync.
Re: ffs snapshot lockup
On Wed, Oct 04, 2006 at 09:56:54AM -0400, Vivek Khera wrote: On Oct 3, 2006, at 4:43 AM, Kostik Belousov wrote: Details are posted at http://vivek.khera.org/scratch/crashlogs/ I have the crashdumps available to a kernel hacker upon request (i'd rather not make them generally available to the public...) It seems that you have snapshotted fs exported by nfsd ? At least, 18a is definitely the case. I have the patch (for current) that shall fix the issue. In fact, you need two patches: So last night I patched my 6.2-PRE and installed the kernel and rebooted. At 03:09 I got paged that the server was not responding. It had dropped into the kernel debugger on the console again. Typescript 19 (posted at the above URL) is the output of the debugging commands for it. These patches did not solve my problem (or maybe I have multiple problems and they only solved one.) The only thing I think was running at the time would be a large file copy from a remote system to this one using rsync. As I understand, you got the panic. Then, you shall post the panic message. If you have core file, then running kgdb on the core may show required information. (it shall be on the console exactly before en and backtrace (using the bt command of ddb) of the paniced thread. pgpNvqG4Z7KL1.pgp Description: PGP signature
Re: ffs snapshot lockup
On Wed, Oct 04, 2006 at 05:08:08PM +0300, Kostik Belousov wrote: On Wed, Oct 04, 2006 at 09:56:54AM -0400, Vivek Khera wrote: On Oct 3, 2006, at 4:43 AM, Kostik Belousov wrote: Details are posted at http://vivek.khera.org/scratch/crashlogs/ I have the crashdumps available to a kernel hacker upon request (i'd rather not make them generally available to the public...) It seems that you have snapshotted fs exported by nfsd ? At least, 18a is definitely the case. I have the patch (for current) that shall fix the issue. In fact, you need two patches: So last night I patched my 6.2-PRE and installed the kernel and rebooted. At 03:09 I got paged that the server was not responding. It had dropped into the kernel debugger on the console again. Typescript 19 (posted at the above URL) is the output of the debugging commands for it. These patches did not solve my problem (or maybe I have multiple problems and they only solved one.) The only thing I think was running at the time would be a large file copy from a remote system to this one using rsync. As I understand, you got the panic. Then, you shall post the panic message. If you have core file, then running kgdb on the core may show required information. (it shall be on the console exactly before en and backtrace (using the bt command of ddb) of the paniced thread. YOu can also do 'show msgbuf' from DDB. Kris pgpd996vEkvrB.pgp Description: PGP signature
Re: ffs snapshot lockup
On Oct 4, 2006, at 12:39 PM, Kris Kennaway wrote: The only thing I think was running at the time would be a large file copy from a remote system to this one using rsync. As I understand, you got the panic. Then, you shall post the panic message. If you have core file, then running kgdb on the core may show required information. (it shall be on the console exactly before en and backtrace (using the bt command of ddb) of the paniced thread. YOu can also do 'show msgbuf' from DDB. i ran kgdb on the vmcore file. since the dump was generated by calling doadump from DDB, the backtrace was showing the call stack of that. from what i read in the output from kgdb, it seems that something locked the kernel and we broke to debugger from the watchdog timeout (I enable software watchdog). When I fired up kgdb on my vmcore.19 file and ran the bt command, it said this: Unread portion of the kernel message buffer: interrupt total irq1: atkbd0 2 irq4: sio0 348 irq14: ata01 irq18: bge0 3228387 irq32: aac0 235404 irq34: ahc1 74 irq35: ahc0 15 cpu0: timer 36123790 Total39588021 KDB: stack backtrace: hardclock() at hardclock+0x1bb lapic_handle_timer() at lapic_handle_timer+0x117 Xtimerint() at Xtimerint+0x76 ithread_loop() at ithread_loop+0x148 fork_exit() at fork_exit+0xbb fork_trampoline() at fork_trampoline+0xe --- trap 0, rip = 0, rsp = 0xa61d7d00, rbp = 0 --- KDB: enter: watchdog timeout Locked vnodes 0xff002df5b798: tag nfs, type VDIR usecount 2, writecount 0, refcount 2 mountedhere 0 flags (VV_ROOT) lock type nfs: EXCL (count 1) by thread 0xff002a6c5980 (pid 49843)#0 0x802442b4 at lockmgr+0x5b7 #1 0x803a0573 at VOP_LOCK_APV+0x80 #2 0x802be6e5 at vn_lock+0x65 #3 0x802b2cbe at vget+0x8f #4 0x802a84e6 at vfs_hash_get+0xc4 #5 0x8030a3cc at nfs_nget+0xb9 #6 0x80310a9e at nfs_root+0x34 #7 0x802a96d7 at lookup+0xa14 #8 0x802a9d12 at namei+0x385 #9 0x802b8b59 at kern_lstat+0x62 #10 0x802b8e73 at lstat+0x2a #11 0x8037ac13 at syscall+0x470 #12 0x80368aa8 at Xfast_syscall+0xa8 fileid 3 fsid 0x400ff02 Dumping 1015 MB (2 chunks) chunk 0: 1MB (160 pages) ... ok chunk 1: 1015MB (259776 pages) 999 983 967 951 935 919 903 887 871 855 839 823 807 791 775 759 743 727 711 695 679 663 647 631 615 599 583 567 551 535 519 503 487 471 455 439 423 407 391 375 359 343 327 311 295 279 263 247 231 215 199 183 167 151 135 119 103 87 71 55 39 23 7 #0 doadump () at pcpu.h:172 172 __asm __volatile(movq %%gs:0,%0 : =r (td)); (kgdb) bt #0 doadump () at pcpu.h:172 #1 0x8017719b in db_fncall (dummy1=0, dummy2=0, dummy3=0, dummy4=0x0) at /usr/src/sys/ddb/db_command.c:492 #2 0x801775bf in db_command_loop () at /usr/src/sys/ddb/db_command.c:350 #3 0x801792dd in db_trap (type=-1508017968, code=0) at /usr/src/sys/ddb/db_main.c:221 #4 0x8026c72c in kdb_trap (type=3, code=0, tf=0xa61d79d0) at /usr/src/sys/kern/subr_kdb.c:473 #5 0x8037a4bf in trap (frame= {tf_rdi = 0, tf_rsi = -2139025408, tf_rdx = 1, tf_rcx = 1057545, tf_r8 = 1048064, tf_r9 = 10, tf_rax = 29, tf_rbx = 0, tf_rbp = -1508017520, tf_r10 = -1508017760, tf_r11 = 10, tf_r12 = -2141840192, tf_r13 = 0, tf_r14 = -1099502938944, tf_r15 = -1099511596728, tf_trapno = 3, tf_addr = 0, tf_flags = -1099511596728, tf_err = 0, tf_rip = -2144943427, tf_cs = 8, tf_rflags = 134, tf_rsp = -1508017520, tf_ss = 16}) at /usr/src/sys/ amd64/amd64/trap.c:442 #6 0x8036890b in calltrap () at /usr/src/sys/amd64/amd64/exception.S:168 #7 0x8026c2bd in kdb_enter (msg=0x0) at cpufunc.h:63 #8 0x8036cc94 in lapic_handle_timer (frame= {cf_rdi = -2036801520, cf_rsi = 1, cf_rdx = -1099502946304, cf_rcx = -1095242940416, cf_r8 = -2143479528, cf_r9 = -2143559117, cf_rax = 12582912, cf_rbx = -2036801536, cf_rbp = -1508017200, cf_r10 = 0, cf_r11 = 4, cf_r12 = -1099511596800, cf_r13 = 0, cf_r14 = -1099502938944, cf_r15 = -1099511596728, cf_rip = -2145575931, cf_cs = 8, cf_rflags = 514, cf_rsp = -1508017280, cf_ss = 16}) at /usr/src/sys/amd64/amd64/local_apic.c:635 #9 0x80369166 in Xtimerint () at apic_vector.S:153 #10 0x801d1c05 in bge_intr (xsc=0x8698e010) at bus.h:241 #11 0x8023aab5 in ithread_loop (arg=0xff8494c0) ---Type return to continue, or q return to quit--- at /usr/src/sys/kern/kern_intr.c:682 #12 0x8023992f in fork_exit ( callout=0x8023a96d ithread_loop, arg=0xff8494c0, frame=0xa61d7c50) at /usr/src/sys/kern/kern_fork.c:821 #13 0x80368c6e in fork_trampoline ()
Re: ffs snapshot lockup
On Wed, Oct 04, 2006 at 01:06:37PM -0400, Vivek Khera wrote: On Oct 4, 2006, at 12:39 PM, Kris Kennaway wrote: The only thing I think was running at the time would be a large file copy from a remote system to this one using rsync. As I understand, you got the panic. Then, you shall post the panic message. If you have core file, then running kgdb on the core may show required information. (it shall be on the console exactly before en and backtrace (using the bt command of ddb) of the paniced thread. YOu can also do 'show msgbuf' from DDB. i ran kgdb on the vmcore file. since the dump was generated by calling doadump from DDB, the backtrace was showing the call stack of that. from what i read in the output from kgdb, it seems that something locked the kernel and we broke to debugger from the watchdog timeout (I enable software watchdog). When I fired up kgdb on my vmcore.19 file and ran the bt command, it said this: Unread portion of the kernel message buffer: interrupt total irq1: atkbd0 2 irq4: sio0 348 irq14: ata01 irq18: bge0 3228387 irq32: aac0 235404 irq34: ahc1 74 irq35: ahc0 15 cpu0: timer 36123790 Total39588021 KDB: stack backtrace: hardclock() at hardclock+0x1bb lapic_handle_timer() at lapic_handle_timer+0x117 Xtimerint() at Xtimerint+0x76 ithread_loop() at ithread_loop+0x148 fork_exit() at fork_exit+0xbb fork_trampoline() at fork_trampoline+0xe --- trap 0, rip = 0, rsp = 0xa61d7d00, rbp = 0 --- KDB: enter: watchdog timeout Locked vnodes 0xff002df5b798: tag nfs, type VDIR usecount 2, writecount 0, refcount 2 mountedhere 0 flags (VV_ROOT) lock type nfs: EXCL (count 1) by thread 0xff002a6c5980 (pid 49843)#0 0x802442b4 at lockmgr+0x5b7 #1 0x803a0573 at VOP_LOCK_APV+0x80 #2 0x802be6e5 at vn_lock+0x65 #3 0x802b2cbe at vget+0x8f #4 0x802a84e6 at vfs_hash_get+0xc4 #5 0x8030a3cc at nfs_nget+0xb9 #6 0x80310a9e at nfs_root+0x34 #7 0x802a96d7 at lookup+0xa14 #8 0x802a9d12 at namei+0x385 #9 0x802b8b59 at kern_lstat+0x62 #10 0x802b8e73 at lstat+0x2a #11 0x8037ac13 at syscall+0x470 #12 0x80368aa8 at Xfast_syscall+0xa8 fileid 3 fsid 0x400ff02 Dumping 1015 MB (2 chunks) chunk 0: 1MB (160 pages) ... ok chunk 1: 1015MB (259776 pages) 999 983 967 951 935 919 903 887 871 855 839 823 807 791 775 759 743 727 711 695 679 663 647 631 615 599 583 567 551 535 519 503 487 471 455 439 423 407 391 375 359 343 327 311 295 279 263 247 231 215 199 183 167 151 135 119 103 87 71 55 39 23 7 #0 doadump () at pcpu.h:172 172 __asm __volatile(movq %%gs:0,%0 : =r (td)); (kgdb) bt #0 doadump () at pcpu.h:172 #1 0x8017719b in db_fncall (dummy1=0, dummy2=0, dummy3=0, dummy4=0x0) at /usr/src/sys/ddb/db_command.c:492 #2 0x801775bf in db_command_loop () at /usr/src/sys/ddb/db_command.c:350 #3 0x801792dd in db_trap (type=-1508017968, code=0) at /usr/src/sys/ddb/db_main.c:221 #4 0x8026c72c in kdb_trap (type=3, code=0, tf=0xa61d79d0) at /usr/src/sys/kern/subr_kdb.c:473 #5 0x8037a4bf in trap (frame= {tf_rdi = 0, tf_rsi = -2139025408, tf_rdx = 1, tf_rcx = 1057545, tf_r8 = 1048064, tf_r9 = 10, tf_rax = 29, tf_rbx = 0, tf_rbp = -1508017520, tf_r10 = -1508017760, tf_r11 = 10, tf_r12 = -2141840192, tf_r13 = 0, tf_r14 = -1099502938944, tf_r15 = -1099511596728, tf_trapno = 3, tf_addr = 0, tf_flags = -1099511596728, tf_err = 0, tf_rip = -2144943427, tf_cs = 8, tf_rflags = 134, tf_rsp = -1508017520, tf_ss = 16}) at /usr/src/sys/ amd64/amd64/trap.c:442 #6 0x8036890b in calltrap () at /usr/src/sys/amd64/amd64/exception.S:168 #7 0x8026c2bd in kdb_enter (msg=0x0) at cpufunc.h:63 #8 0x8036cc94 in lapic_handle_timer (frame= {cf_rdi = -2036801520, cf_rsi = 1, cf_rdx = -1099502946304, cf_rcx = -1095242940416, cf_r8 = -2143479528, cf_r9 = -2143559117, cf_rax = 12582912, cf_rbx = -2036801536, cf_rbp = -1508017200, cf_r10 = 0, cf_r11 = 4, cf_r12 = -1099511596800, cf_r13 = 0, cf_r14 = -1099502938944, cf_r15 = -1099511596728, cf_rip = -2145575931, cf_cs = 8, cf_rflags = 514, cf_rsp = -1508017280, cf_ss = 16}) at /usr/src/sys/amd64/amd64/local_apic.c:635 #9 0x80369166 in Xtimerint () at apic_vector.S:153 #10 0x801d1c05 in bge_intr (xsc=0x8698e010) at bus.h:241 Did you have any problems with your network card ? This seems to be quite popular plot in recent times. I do not see anything except wedged nfs request in this
Re: ffs snapshot lockup
On Oct 4, 2006, at 1:16 PM, Kostik Belousov wrote: #9 0x80369166 in Xtimerint () at apic_vector.S:153 #10 0x801d1c05 in bge_intr (xsc=0x8698e010) at bus.h:241 Did you have any problems with your network card ? This seems to be quite popular plot in recent times. quite possible. when i first bought this box, a dell pe800, the NIC was not recognized by FreeBSD 5.3. With some pciconf info and help from the stable@ list, we were able to have freebsd recognize the chipset and it seemed to work ok. network performance was never stellar. overall this machine feels much slower than it should be, but it is hard to tell whether that is caused by the network or the disk. i suspected the disk. anyhow, it was upgraded to FreeBSD 5.4, then to 6.0 (at which point i had to disable ACPI timer or it would lock during boot probing some devices). I do not see anything except wedged nfs request in this deadlock. Seems that nfs server does not respond. ok. i'll try my usual level 0 dump. that often causes the ffs to totally lock that partition. i may have multiple issues going on here :-( Please, include output of ps in the scripts. It makes much easier to grok the situation quickly.. noted. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ffs snapshot lockup
On Oct 4, 2006, at 1:16 PM, Kostik Belousov wrote: #10 0x801d1c05 in bge_intr (xsc=0x8698e010) at bus.h:241 Did you have any problems with your network card ? This seems to be quite popular plot in recent times. Another note: I don't see any bge0 timeouts in my /var/log/messages, and never have over the last 2 years of this box. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ffs snapshot lockup
On Wed, Oct 04, 2006 at 01:06:37PM -0400, Vivek Khera wrote: On Oct 4, 2006, at 12:39 PM, Kris Kennaway wrote: The only thing I think was running at the time would be a large file copy from a remote system to this one using rsync. As I understand, you got the panic. Then, you shall post the panic message. If you have core file, then running kgdb on the core may show required information. (it shall be on the console exactly before en and backtrace (using the bt command of ddb) of the paniced thread. YOu can also do 'show msgbuf' from DDB. i ran kgdb on the vmcore file. since the dump was generated by calling doadump from DDB, the backtrace was showing the call stack of that. from what i read in the output from kgdb, it seems that something locked the kernel and we broke to debugger from the watchdog timeout (I enable software watchdog). Hmm, be careful with that - if you set the timeout too low (and note that for some workloads O(minutes) may even be too low) then you'll get a lot of false positives. Kris pgpZ34qBnPXV7.pgp Description: PGP signature
Re: ffs snapshot lockup
On Oct 4, 2006, at 3:41 PM, Kris Kennaway wrote: from what i read in the output from kgdb, it seems that something locked the kernel and we broke to debugger from the watchdog timeout (I enable software watchdog). Hmm, be careful with that - if you set the timeout too low (and note that for some workloads O(minutes) may even be too low) then you'll get a lot of false positives. hmmm... the man page for watchdogd doesn't specify what the default timeout is, but that's what we've got running. [tappity-tapptity- tap...] source seems to indicate 16seconds timeout. interesting. so we could be getting hit with a bge interrupt storm and timing out. i'll turn off fido and see what happens. at this point, though, i think i have two separate issues. one with bge and watchdog timeout, and one with locking of the filesystem with mksnap_ffs, as the symptoms are different.
Re: ffs snapshot lockup
On Wed, Oct 04, 2006 at 03:53:54PM -0400, Vivek Khera wrote: On Oct 4, 2006, at 3:41 PM, Kris Kennaway wrote: from what i read in the output from kgdb, it seems that something locked the kernel and we broke to debugger from the watchdog timeout (I enable software watchdog). Hmm, be careful with that - if you set the timeout too low (and note that for some workloads O(minutes) may even be too low) then you'll get a lot of false positives. hmmm... the man page for watchdogd doesn't specify what the default timeout is, but that's what we've got running. [tappity-tapptity- tap...] source seems to indicate 16seconds timeout. interesting. Yes, that's probably way too low. e.g. when creating a snapshot (as in your workload) your machine may be unresponsive for up to a few minutes depending on your filesystem size and I/O load. so we could be getting hit with a bge interrupt storm and timing out. i'll turn off fido and see what happens. at this point, though, i think i have two separate issues. one with bge and watchdog timeout, and one with locking of the filesystem with mksnap_ffs, as the symptoms are different. That sounds plausible. Many people are reporting issues involving NIC interrupts, but they're proving elusive to characterize so far (there may be multiple problems). kris pgpjeHGeti1Tr.pgp Description: PGP signature
Re: ffs snapshot lockup
On Oct 3, 2006, at 4:43 AM, Kostik Belousov wrote: Details are posted at http://vivek.khera.org/scratch/crashlogs/ I have the crashdumps available to a kernel hacker upon request (i'd rather not make them generally available to the public...) It seems that you have snapshotted fs exported by nfsd ? At least, 18a is definitely the case. I have the patch (for current) that shall fix the issue. In fact, you need two patches: As per advice of Kris Kenneway, I turned off the software watchdog to rule out that as my problem. Then I ran a level 3 dump. Dump of root fs went fine, then it proceeded to do /usr. After a few minutes it locked up. Typescript 20 at the above URL shows the debugging info from the break into debugger of the locked up system. Since /usr was locked, nobody could log in at all. The network load was minimal at the time. I had everyone log out and close mail etc.
Re: ffs snapshot lockup
On Wed, Oct 04, 2006 at 03:41:48PM -0400, Kris Kennaway wrote: On Wed, Oct 04, 2006 at 01:06:37PM -0400, Vivek Khera wrote: On Oct 4, 2006, at 12:39 PM, Kris Kennaway wrote: The only thing I think was running at the time would be a large file copy from a remote system to this one using rsync. As I understand, you got the panic. Then, you shall post the panic message. If you have core file, then running kgdb on the core may show required information. (it shall be on the console exactly before en and backtrace (using the bt command of ddb) of the paniced thread. YOu can also do 'show msgbuf' from DDB. i ran kgdb on the vmcore file. since the dump was generated by calling doadump from DDB, the backtrace was showing the call stack of that. from what i read in the output from kgdb, it seems that something locked the kernel and we broke to debugger from the watchdog timeout (I enable software watchdog). Hmm, be careful with that - if you set the timeout too low (and note that for some workloads O(minutes) may even be too low) then you'll get a lot of false positives. Oh, yes. I've using this with success: watchdogd -t 3600 -e 'ls /tmp /dev /dev/null; true' -s 60 - Peter Kris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (FreeBSD) iD8DBQFFJA58Wry0BWjoQKURAr5dAKDf4YLcBJU9owRw6N1L3FcgJkvOOgCfRQkq bd8+tGZVB28bkYBN6KL7iO0= =B7Vl -END PGP SIGNATURE- ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ffs snapshot lockup
On Mon, Oct 02, 2006 at 03:23:49PM -0400, Vivek Khera wrote: On Sep 22, 2006, at 4:36 PM, Kris Kennaway wrote: Start by enabling INVARIANTS, INVARIANT_SUPPORT, DEBUG_LOCKS and DEBUG_VFS_LOCKS, then run 'show lockedvnods' and 'alltrace' in DDB (spammy, need that serial console), or at least trace the running processes (show allpcpu) and those listed in lockedvnods. Then call doadump and save the core+kernel.debug when you reboot. Well, what an exciting weekend we had! Three lockups/panics: two during running a dump (mksnap_ffs seems to be the culprit) and one running background fsck after rebooting from the prior crash. Details are posted at http://vivek.khera.org/scratch/crashlogs/ I have the crashdumps available to a kernel hacker upon request (i'd rather not make them generally available to the public...) It seems that you have snapshotted fs exported by nfsd ? At least, 18a is definitely the case. I have the patch (for current) that shall fix the issue. In fact, you need two patches: 1. buffer ownership patch by Tor Egge (already in current, I think it shall be MFCed before 6.2) tegge 2006-10-02 02:06:27 UTC FreeBSD src repository Modified files: sys/kern kern_lock.c vfs_bio.c sys/sys buf.h lockmgr.h Log: If the buffer lock has waiters after the buffer has changed identity then getnewbuf() needs to drop the buffer in order to wake waiters that might sleep on the buffer in the context of the old identity. Revision ChangesPath 1.100 +15 -0 src/sys/kern/kern_lock.c 1.510 +11 -0 src/sys/kern/vfs_bio.c 1.194 +11 -0 src/sys/sys/buf.h 1.51 +1 -0 src/sys/sys/lockmgr.h ___ cvs-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/cvs-all To unsubscribe, send any mail to [EMAIL PROTECTED] Index: src/sys/kern/kern_lock.c diff -u src/sys/kern/kern_lock.c:1.99 src/sys/kern/kern_lock.c:1.100 --- src/sys/kern/kern_lock.c:1.99 Tue Aug 15 18:29:01 2006 +++ src/sys/kern/kern_lock.cMon Oct 2 02:06:26 2006 @@ -566,6 +566,21 @@ } /* + * Determine the number of waiters on a lock. + */ +int +lockwaiters(lkp) + struct lock *lkp; +{ + int count; + + mtx_lock(lkp-lk_interlock); + count = lkp-lk_waitcount; + mtx_unlock(lkp-lk_interlock); + return (count); +} + +/* * Print out information about state of a lock. Used by VOP_PRINT * routines to display status about contained locks. */ Index: src/sys/kern/vfs_bio.c diff -u src/sys/kern/vfs_bio.c:1.509 src/sys/kern/vfs_bio.c:1.510 --- src/sys/kern/vfs_bio.c:1.509Wed Aug 9 17:43:26 2006 +++ src/sys/kern/vfs_bio.c Mon Oct 2 02:06:26 2006 @@ -1890,6 +1890,17 @@ } /* +* Notify any waiters for the buffer lock about +* identity change by freeing the buffer. +*/ + if (qindex == QUEUE_CLEAN BUF_LOCKWAITERS(bp) 0) { + bp-b_flags |= B_INVAL; + bfreekva(bp); + brelse(bp); + goto restart; + } + + /* * If we are overcomitted then recover the buffer and its * KVM space. This occurs in rare situations when multiple * processes are blocked in getnewbuf() or allocbuf(). Index: src/sys/sys/buf.h diff -u src/sys/sys/buf.h:1.193 src/sys/sys/buf.h:1.194 --- src/sys/sys/buf.h:1.193 Fri Mar 31 02:56:30 2006 +++ src/sys/sys/buf.h Mon Oct 2 02:06:27 2006 @@ -371,6 +371,17 @@ return ret; } + +/* + * Find out the number of waiters on a lock. + */ +static __inline int BUF_LOCKWAITERS(struct buf *); +static __inline int +BUF_LOCKWAITERS(struct buf *bp) +{ + return (lockwaiters(bp-b_lock)); +} + #endif /* _KERNEL */ struct buf_queue_head { Index: src/sys/sys/lockmgr.h diff -u src/sys/sys/lockmgr.h:1.50 src/sys/sys/lockmgr.h:1.51 --- src/sys/sys/lockmgr.h:1.50 Tue Aug 15 18:29:01 2006 +++ src/sys/sys/lockmgr.h Mon Oct 2 02:06:27 2006 @@ -203,6 +203,7 @@ void lockmgr_printinfo(struct lock *); intlockstatus(struct lock *, struct thread *); intlockcount(struct lock *); +intlockwaiters(struct lock *); #ifdef DDB intlockmgr_chain(struct thread *td, struct thread **ownerp); #endif 2. this one (it shall be applied in the sys/ufs; please, test and report results) Index: ffs/ffs_inode.c === RCS file: /usr/local/arch/ncvs/src/sys/ufs/ffs/ffs_inode.c,v retrieving revision 1.106 diff -u -r1.106 ffs_inode.c --- ffs/ffs_inode.c 5 Apr 2005 08:49:41 - 1.106 +++ ffs/ffs_inode.c 27 Sep 2006 08:29:18 - @@ -66,9 +66,11 @@ * IN_ACCESS, IN_UPDATE, and IN_CHANGE flags respectively. Write the inode * to disk if the IN_MODIFIED flag is set (it may be set
Re: ffs snapshot lockup
On Oct 3, 2006, at 4:43 AM, Kostik Belousov wrote: Details are posted at http://vivek.khera.org/scratch/crashlogs/ I have the crashdumps available to a kernel hacker upon request (i'd rather not make them generally available to the public...) It seems that you have snapshotted fs exported by nfsd ? At least, 18a is definitely the case. I have the patch (for current) that shall fix the issue. In fact, you need two patches: Yes, it is an NFS exported partition. The patches applied well to my existing 6.2-PRE source tree (I did not cvsup so as to minimize the changes to the system and prove that this fix either does or does not work for my problem.) There was minimal line fuzz, but no failures. If this works out, I'll cvsup and re-patch and test the latest sources. I'm building and installing the kernel now. Tomorrow when I'm at the office I'll try the dump again as that usually causes a lockup. Thanks for helping me solve this.
Re: ffs snapshot lockup
On Sep 22, 2006, at 4:36 PM, Kris Kennaway wrote: Start by enabling INVARIANTS, INVARIANT_SUPPORT, DEBUG_LOCKS and DEBUG_VFS_LOCKS, then run 'show lockedvnods' and 'alltrace' in DDB (spammy, need that serial console), or at least trace the running processes (show allpcpu) and those listed in lockedvnods. Then call doadump and save the core+kernel.debug when you reboot. Well, what an exciting weekend we had! Three lockups/panics: two during running a dump (mksnap_ffs seems to be the culprit) and one running background fsck after rebooting from the prior crash. Details are posted at http://vivek.khera.org/scratch/crashlogs/ I have the crashdumps available to a kernel hacker upon request (i'd rather not make them generally available to the public...)
Re: ffs snapshot lockup
On Sep 25, 2006, at 2:58 PM, Jeremy Chadwick wrote: This problem sounds awfully like what I reported in a recent PR. Now I'm starting to have my doubts as to whether or not my hardware was indeed at fault... http://www.freebsd.org/cgi/query-pr.cgi?pr=103435 I think my issue is a bit different -- the FS is the only thing locked up, and only the single FS that mksnap_ffs is running on. The rest of the system continues to run quite nicely as long as the process doesn't attempt to access the locked file system. In my case it has been the home directory partition, which is easy to avoid if you're root. I don't observe ethernet timeouts, etc.
Re: ffs snapshot lockup
On Sep 22, 2006, at 4:36 PM, Kris Kennaway wrote: Start by enabling INVARIANTS, INVARIANT_SUPPORT, DEBUG_LOCKS and DEBUG_VFS_LOCKS, then run 'show lockedvnods' and 'alltrace' in DDB (spammy, need that serial console), or at least trace the running processes (show allpcpu) and those listed in lockedvnods. Then call doadump and save the core+kernel.debug when you reboot. Well, it happened again (as I was building the debugging kernel, no less)... but it only locked up one file system not the whole box. So my home dir was wedged, and everything trying to use anyone's home dir was wedged, but /usr and / were still responding. So now I do have the kernel in place and the serial console is already there and your notes on what to do are there, so next time it happens I'm ready!
Re: ffs snapshot lockup
This problem sounds awfully like what I reported in a recent PR. Now I'm starting to have my doubts as to whether or not my hardware was indeed at fault... http://www.freebsd.org/cgi/query-pr.cgi?pr=103435 -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networkinghttp://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
ffs snapshot lockup
Yesterday we upgraded an amd64 system from 6.1 to 6.2-PRE from a cvsup of September 20 evening. The overnight nightly dump to tape went off normally. This afternoon when the sysadmin was performing the level0 backup, the ffs_mksnap program locked up the entire system. It wasn't the expected short hiccup that mksnap sometimes causes -- it was started just prior to lunchtime and when we got back everyone had wedged login sessions and IMAP sessions, and ffs_mksnap was in D state according to ps and idle for about 90 minutes. The only way out was to reboot the system. I know this is very un-useful as a bug report, but I'm putting this out in case anyone else has noticed this. It has never happened to us before when we were running 5.4 and then 6.1 on this box. Nothing else changed on the system during this time. The disk in question is on an adaptec RAID controller with the aac driver.
Re: ffs snapshot lockup
On Fri, Sep 22, 2006 at 02:55:34PM -0400, Vivek Khera wrote: Yesterday we upgraded an amd64 system from 6.1 to 6.2-PRE from a cvsup of September 20 evening. The overnight nightly dump to tape went off normally. This afternoon when the sysadmin was performing the level0 backup, the ffs_mksnap program locked up the entire system. It wasn't the expected short hiccup that mksnap sometimes causes -- it was started just prior to lunchtime and when we got back everyone had wedged login sessions and IMAP sessions, and ffs_mksnap was in D state according to ps and idle for about 90 minutes. The only way out was to reboot the system. I know this is very un-useful as a bug report, but I'm putting this out in case anyone else has noticed this. It has never happened to us before when we were running 5.4 and then 6.1 on this box. Nothing else changed on the system during this time. Yep, we'd need additional debugging to proceed. Kris ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ffs snapshot lockup
On Sep 22, 2006, at 3:03 PM, Kris Kennaway wrote: I know this is very un-useful as a bug report, but I'm putting this out in case anyone else has noticed this. It has never happened to us before when we were running 5.4 and then 6.1 on this box. Nothing else changed on the system during this time. Yep, we'd need additional debugging to proceed. How does one debug a system where the disk sub-system is frozen? I have serial console with DDB/KDB built into the kernel and debugging symbols too. Is that the only option here? We're gonna try the full level 0 backup again monday and if it locks up I'll try to figure it out, but could use some hints as to where to poke in the kernel.
Re: ffs snapshot lockup
On Fri, Sep 22, 2006 at 04:24:41PM -0400, Vivek Khera wrote: On Sep 22, 2006, at 3:03 PM, Kris Kennaway wrote: I know this is very un-useful as a bug report, but I'm putting this out in case anyone else has noticed this. It has never happened to us before when we were running 5.4 and then 6.1 on this box. Nothing else changed on the system during this time. Yep, we'd need additional debugging to proceed. How does one debug a system where the disk sub-system is frozen? I have serial console with DDB/KDB built into the kernel and debugging symbols too. Is that the only option here? Perhaps, but you can probably also take a dump still. It's a filesystem deadlock, not a disk device problem. We're gonna try the full level 0 backup again monday and if it locks up I'll try to figure it out, but could use some hints as to where to poke in the kernel. Start by enabling INVARIANTS, INVARIANT_SUPPORT, DEBUG_LOCKS and DEBUG_VFS_LOCKS, then run 'show lockedvnods' and 'alltrace' in DDB (spammy, need that serial console), or at least trace the running processes (show allpcpu) and those listed in lockedvnods. Then call doadump and save the core+kernel.debug when you reboot. Kris pgpxnqTnkvGrb.pgp Description: PGP signature