Re: ffs snapshot lockup

2006-10-06 Thread Kostik Belousov
On Thu, Oct 05, 2006 at 10:01:07AM -0400, Vivek Khera wrote:
 
 On Oct 5, 2006, at 4:30 AM, Kostik Belousov wrote:
 
 
 The network load was minimal at the time.  I had everyone log out and
 close mail etc.
 
 
 What were the symptoms of locked system ? Could you log in on  
 console, or
 do something at the shell prompt on console ?
 
 Console was non-responsive.  This time dump locked doing /usr so  
 pretty much anything you try to run will block.  When the lockup  
 happens when dump is running on my home dir (/u/yertle1) partition,  
 as long as you don't need that partition you can log in and run any  
 programs you like.  I have a service account whose home dir is in / 
 var and was able to login that time to that account.  No such luck  
 this time since any activity pretty much uses /usr.
 
 Ping was responding (our monitoring didn't complain it was down).
 
 The only thing I could do was break to debugger on the console.
 
This is very strange. You 3 instances of getty where just reading the
tty input, and all suspectible processes (like sshd) are waiting on net
events. No processes are blocked on the fs. One nfsd is serving the request,
and dump is active.


pgpAr4aPXXmZW.pgp
Description: PGP signature


Re: ffs snapshot lockup

2006-10-06 Thread Kris Kennaway
On Fri, Oct 06, 2006 at 10:39:50AM +0300, Kostik Belousov wrote:
 On Thu, Oct 05, 2006 at 10:01:07AM -0400, Vivek Khera wrote:
  
  On Oct 5, 2006, at 4:30 AM, Kostik Belousov wrote:
  
  
  The network load was minimal at the time.  I had everyone log out and
  close mail etc.
  
  
  What were the symptoms of locked system ? Could you log in on  
  console, or
  do something at the shell prompt on console ?
  
  Console was non-responsive.  This time dump locked doing /usr so  
  pretty much anything you try to run will block.  When the lockup  
  happens when dump is running on my home dir (/u/yertle1) partition,  
  as long as you don't need that partition you can log in and run any  
  programs you like.  I have a service account whose home dir is in / 
  var and was able to login that time to that account.  No such luck  
  this time since any activity pretty much uses /usr.
  
  Ping was responding (our monitoring didn't complain it was down).
  
  The only thing I could do was break to debugger on the console.
  
 This is very strange. You 3 instances of getty where just reading the
 tty input, and all suspectible processes (like sshd) are waiting on net
 events. No processes are blocked on the fs. One nfsd is serving the request,
 and dump is active.

To repeat something I said earlier: when creating a snapshot
(e.g. which dump -L does), the entire system may become unresponsive
untilk the snapshot completes, which can take many minutes.

How long are you waiting before pronouncing the system deadlocked?

What does ^T on the console (e.g. when trying to log in), show you?

Kris



pgpY6zPaZmvsz.pgp
Description: PGP signature


Re: ffs snapshot lockup

2006-10-06 Thread Kostik Belousov
On Fri, Oct 06, 2006 at 02:11:05PM -0400, Vivek Khera wrote:
 
 On Oct 6, 2006, at 1:57 PM, Kris Kennaway wrote:
 
 This is very strange. You 3 instances of getty where just reading the
 tty input, and all suspectible processes (like sshd) are waiting  
 on net
 events. No processes are blocked on the fs. One nfsd is serving  
 the request,
 and dump is active.
 
 To repeat something I said earlier: when creating a snapshot
 (e.g. which dump -L does), the entire system may become unresponsive
 untilk the snapshot completes, which can take many minutes.
 
 I know snapshot takes a while -- we're used to that.
 
 How long are you waiting before pronouncing the system deadlocked?
 
 
 
 10's of minutes.
 What does ^T on the console (e.g. when trying to log in), show you?
 
There were no active snapshotting in the progress. Snapshot was already
made, and dump happily processed in the moment captured in the script.

 nothing.  the console is non-responsive.  the remote shells are non  
 responsive to any input.
 
 I'm now convinced it was all stemming from some bug in bge driver (at  
 least for my specific chipset.)  Last night I put in an old spare  
 3c905 NIC and turned off the motherboard bge via BIOS.
 
 I can't make the machine lock up at all, even with the watchdog  
 running, and doing level0 dumps.
 
 Also, even though this NIC is only 10/100 and the prior was running  
 at GigE speed, the system is *way* more responsive to network  
 operations.  For example, when I logged in this morning my IMAP mail  
 client took barely a second or or so to open my inbox, whereas before  
 it would take upwards of 10 seconds.
 
 This machine was always this way since it was first set up running  
 5.3.  I can't believe I lived with it for so long...  I'd like to  
 find a nice stable GigE NIC for it, since I know that the onboard bge  
 is definitely sub-optimal with FreeBSD.  Dell's diagnostics don't  
 find any hardware fault, for what that's worth.
 
 Curiously, I have a handful of other Dell servers at the office which  
 all have bge and run just great at GigE speed to the same switch.
 
 If it does lock up again, I'll be sure to let you know!
 
Was this system patched by the stuff I submitted to you ?




pgpT50ylzWbU5.pgp
Description: PGP signature


Re: ffs snapshot lockup

2006-10-06 Thread Kris Kennaway
On Fri, Oct 06, 2006 at 02:11:05PM -0400, Vivek Khera wrote:
 
 On Oct 6, 2006, at 1:57 PM, Kris Kennaway wrote:
 
 This is very strange. You 3 instances of getty where just reading the
 tty input, and all suspectible processes (like sshd) are waiting  
 on net
 events. No processes are blocked on the fs. One nfsd is serving  
 the request,
 and dump is active.
 
 To repeat something I said earlier: when creating a snapshot
 (e.g. which dump -L does), the entire system may become unresponsive
 untilk the snapshot completes, which can take many minutes.
 
 I know snapshot takes a while -- we're used to that.

OK, thanks for confirming.

 I'm now convinced it was all stemming from some bug in bge driver (at  
 least for my specific chipset.)  Last night I put in an old spare  
 3c905 NIC and turned off the motherboard bge via BIOS.

We'd be interested in diagnosing this problem separately; can you set
up a test machine with this card and give me access to it?

Kris


pgp5BixhXIBFf.pgp
Description: PGP signature


Re: ffs snapshot lockup

2006-10-06 Thread Vivek Khera


On Oct 6, 2006, at 2:20 PM, Kostik Belousov wrote:


If it does lock up again, I'll be sure to let you know!


Was this system patched by the stuff I submitted to you ?



yes.  i did not update anything except adding the xl driver to the  
kernel, so as to minimize changes.


if this holds stable for a few days i may try upgrading to BETA2.   
are your fixes rolled in there yet?




Re: ffs snapshot lockup

2006-10-06 Thread Vivek Khera


On Oct 6, 2006, at 2:31 PM, Kris Kennaway wrote:


I'm now convinced it was all stemming from some bug in bge driver (at
least for my specific chipset.)  Last night I put in an old spare
3c905 NIC and turned off the motherboard bge via BIOS.


We'd be interested in diagnosing this problem separately; can you set
up a test machine with this card and give me access to it?


I'm not sure which card you want to diagnose.  The Dell PE800 has an  
onboard bge ethernet which was apparently not working at full speed,  
or was stalling otherwise.


The 3c905 is what I used to be able to turn off the bge NIC.  It  
appears to be functioning well.




Re: ffs snapshot lockup

2006-10-05 Thread Kostik Belousov
On Wed, Oct 04, 2006 at 05:16:53PM -0400, Vivek Khera wrote:
 
 On Oct 3, 2006, at 4:43 AM, Kostik Belousov wrote:
 
 Details are posted at http://vivek.khera.org/scratch/crashlogs/
 
 I have the crashdumps available to a kernel hacker upon request (i'd
 rather not make them generally available to the public...)
 
 It seems that you have snapshotted fs exported by nfsd ? At least,  
 18a is
 definitely the case. I have the patch (for current) that shall fix  
 the issue.
 In fact, you need two patches:
 
 As per advice of Kris Kenneway, I turned off the software watchdog to  
 rule out that as my problem. Then I ran a level 3 dump. Dump of root  
 fs went fine, then it proceeded to do /usr. After a few minutes it  
 locked up. Typescript 20 at the above URL shows the debugging info  
 from the break into debugger of the locked up system. Since /usr was  
 locked, nobody could log in at all.
 
 The network load was minimal at the time.  I had everyone log out and  
 close mail etc.
 

What were the symptoms of locked system ? Could you log in on console, or
do something at the shell prompt on console ?

Also, did the system respond to the pings ? Fs-related deadlocks (as
well as stalled disk io) usually do not prevent lowest levels of the
isr/network stack from working.

Again, I do not see the fs deadlock per se in the supplied script. Dump does
disk io, it seems that nfsd tries to serve some request. Sshd looks to be
ready to accept connections.

If console is available, but ping responses not arrive, this is definitely
network card problem.


pgpzVHIESJM7W.pgp
Description: PGP signature


Re: ffs snapshot lockup

2006-10-05 Thread Vivek Khera


On Oct 5, 2006, at 4:30 AM, Kostik Belousov wrote:



The network load was minimal at the time.  I had everyone log out and
close mail etc.



What were the symptoms of locked system ? Could you log in on  
console, or

do something at the shell prompt on console ?


Console was non-responsive.  This time dump locked doing /usr so  
pretty much anything you try to run will block.  When the lockup  
happens when dump is running on my home dir (/u/yertle1) partition,  
as long as you don't need that partition you can log in and run any  
programs you like.  I have a service account whose home dir is in / 
var and was able to login that time to that account.  No such luck  
this time since any activity pretty much uses /usr.


Ping was responding (our monitoring didn't complain it was down).

The only thing I could do was break to debugger on the console.





Re: ffs snapshot lockup

2006-10-04 Thread Vivek Khera


On Oct 3, 2006, at 4:43 AM, Kostik Belousov wrote:


Details are posted at http://vivek.khera.org/scratch/crashlogs/

I have the crashdumps available to a kernel hacker upon request (i'd
rather not make them generally available to the public...)

It seems that you have snapshotted fs exported by nfsd ? At least,  
18a is
definitely the case. I have the patch (for current) that shall fix  
the issue.

In fact, you need two patches:



So last night I patched my 6.2-PRE and installed the kernel and  
rebooted. At 03:09 I got paged that the server was not responding. It  
had dropped into the kernel debugger on the console again. Typescript  
19 (posted at the above URL) is the output of the debugging commands  
for it. These patches did not solve my problem (or maybe I have  
multiple problems and they only solved one.)


The only thing I think was running at the time would be a large file  
copy from a remote system to this one using rsync.

Re: ffs snapshot lockup

2006-10-04 Thread Kostik Belousov
On Wed, Oct 04, 2006 at 09:56:54AM -0400, Vivek Khera wrote:
 
 On Oct 3, 2006, at 4:43 AM, Kostik Belousov wrote:
 
 Details are posted at http://vivek.khera.org/scratch/crashlogs/
 
 I have the crashdumps available to a kernel hacker upon request (i'd
 rather not make them generally available to the public...)
 
 It seems that you have snapshotted fs exported by nfsd ? At least,  
 18a is
 definitely the case. I have the patch (for current) that shall fix  
 the issue.
 In fact, you need two patches:
 
 
 So last night I patched my 6.2-PRE and installed the kernel and  
 rebooted. At 03:09 I got paged that the server was not responding. It  
 had dropped into the kernel debugger on the console again. Typescript  
 19 (posted at the above URL) is the output of the debugging commands  
 for it. These patches did not solve my problem (or maybe I have  
 multiple problems and they only solved one.)
 
 The only thing I think was running at the time would be a large file  
 copy from a remote system to this one using rsync.

As I understand, you got the panic. Then, you shall post the panic message.
If you have core file, then running kgdb on the core may show required
information.
(it shall be on the console exactly before en
and backtrace (using the bt command of ddb) of the paniced thread.


pgpNvqG4Z7KL1.pgp
Description: PGP signature


Re: ffs snapshot lockup

2006-10-04 Thread Kris Kennaway
On Wed, Oct 04, 2006 at 05:08:08PM +0300, Kostik Belousov wrote:
 On Wed, Oct 04, 2006 at 09:56:54AM -0400, Vivek Khera wrote:
  
  On Oct 3, 2006, at 4:43 AM, Kostik Belousov wrote:
  
  Details are posted at http://vivek.khera.org/scratch/crashlogs/
  
  I have the crashdumps available to a kernel hacker upon request (i'd
  rather not make them generally available to the public...)
  
  It seems that you have snapshotted fs exported by nfsd ? At least,  
  18a is
  definitely the case. I have the patch (for current) that shall fix  
  the issue.
  In fact, you need two patches:
  
  
  So last night I patched my 6.2-PRE and installed the kernel and  
  rebooted. At 03:09 I got paged that the server was not responding. It  
  had dropped into the kernel debugger on the console again. Typescript  
  19 (posted at the above URL) is the output of the debugging commands  
  for it. These patches did not solve my problem (or maybe I have  
  multiple problems and they only solved one.)
  
  The only thing I think was running at the time would be a large file  
  copy from a remote system to this one using rsync.
 
 As I understand, you got the panic. Then, you shall post the panic message.
 If you have core file, then running kgdb on the core may show required
 information.
 (it shall be on the console exactly before en
 and backtrace (using the bt command of ddb) of the paniced thread.

YOu can also do 'show msgbuf' from DDB.

Kris


pgpd996vEkvrB.pgp
Description: PGP signature


Re: ffs snapshot lockup

2006-10-04 Thread Vivek Khera


On Oct 4, 2006, at 12:39 PM, Kris Kennaway wrote:



The only thing I think was running at the time would be a large file
copy from a remote system to this one using rsync.


As I understand, you got the panic. Then, you shall post the panic  
message.
If you have core file, then running kgdb on the core may show  
required

information.
(it shall be on the console exactly before en
and backtrace (using the bt command of ddb) of the paniced thread.


YOu can also do 'show msgbuf' from DDB.



i ran kgdb on the vmcore file.  since the dump was generated by  
calling doadump from DDB, the backtrace was showing the call stack of  
that.


from what i read in the output from kgdb, it seems that something  
locked the kernel and we broke to debugger from the watchdog timeout  
(I enable software watchdog).



When I fired up kgdb on my vmcore.19 file and ran the bt command, it  
said this:



Unread portion of the kernel message buffer:
interrupt   total
irq1: atkbd0   2
irq4: sio0   348
irq14: ata01
irq18: bge0  3228387
irq32: aac0   235404
irq34: ahc1   74
irq35: ahc0   15
cpu0: timer 36123790
Total39588021
KDB: stack backtrace:
hardclock() at hardclock+0x1bb
lapic_handle_timer() at lapic_handle_timer+0x117
Xtimerint() at Xtimerint+0x76
ithread_loop() at ithread_loop+0x148
fork_exit() at fork_exit+0xbb
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xa61d7d00, rbp = 0 ---
KDB: enter: watchdog timeout
Locked vnodes

0xff002df5b798: tag nfs, type VDIR
usecount 2, writecount 0, refcount 2 mountedhere 0
flags (VV_ROOT)
 lock type nfs: EXCL (count 1) by thread 0xff002a6c5980 (pid  
49843)#0 0x802442b4 at lockmgr+0x5b7

#1 0x803a0573 at VOP_LOCK_APV+0x80
#2 0x802be6e5 at vn_lock+0x65
#3 0x802b2cbe at vget+0x8f
#4 0x802a84e6 at vfs_hash_get+0xc4
#5 0x8030a3cc at nfs_nget+0xb9
#6 0x80310a9e at nfs_root+0x34
#7 0x802a96d7 at lookup+0xa14
#8 0x802a9d12 at namei+0x385
#9 0x802b8b59 at kern_lstat+0x62
#10 0x802b8e73 at lstat+0x2a
#11 0x8037ac13 at syscall+0x470
#12 0x80368aa8 at Xfast_syscall+0xa8

fileid 3 fsid 0x400ff02
Dumping 1015 MB (2 chunks)
  chunk 0: 1MB (160 pages) ... ok
  chunk 1: 1015MB (259776 pages) 999 983 967 951 935 919 903 887 871  
855 839 823 807 791 775 759 743 727 711 695 679 663 647 631 615 599  
583 567 551 535 519 503 487 471 455 439 423 407 391 375 359 343 327  
311 295 279 263 247 231 215 199 183 167 151 135 119 103 87 71 55 39 23 7


#0  doadump () at pcpu.h:172
172 __asm __volatile(movq %%gs:0,%0 : =r (td));
(kgdb) bt
#0  doadump () at pcpu.h:172
#1  0x8017719b in db_fncall (dummy1=0, dummy2=0, dummy3=0,  
dummy4=0x0)

at /usr/src/sys/ddb/db_command.c:492
#2  0x801775bf in db_command_loop ()
at /usr/src/sys/ddb/db_command.c:350
#3  0x801792dd in db_trap (type=-1508017968, code=0)
at /usr/src/sys/ddb/db_main.c:221
#4  0x8026c72c in kdb_trap (type=3, code=0,  
tf=0xa61d79d0)

at /usr/src/sys/kern/subr_kdb.c:473
#5  0x8037a4bf in trap (frame=
  {tf_rdi = 0, tf_rsi = -2139025408, tf_rdx = 1, tf_rcx =  
1057545, tf_r8 = 1048064, tf_r9 = 10, tf_rax = 29, tf_rbx = 0, tf_rbp  
= -1508017520, tf_r10 = -1508017760, tf_r11 = 10, tf_r12 =  
-2141840192, tf_r13 = 0, tf_r14 = -1099502938944, tf_r15 =  
-1099511596728, tf_trapno = 3, tf_addr = 0, tf_flags =  
-1099511596728, tf_err = 0, tf_rip = -2144943427, tf_cs = 8,  
tf_rflags = 134, tf_rsp = -1508017520, tf_ss = 16}) at /usr/src/sys/ 
amd64/amd64/trap.c:442

#6  0x8036890b in calltrap ()
at /usr/src/sys/amd64/amd64/exception.S:168
#7  0x8026c2bd in kdb_enter (msg=0x0) at cpufunc.h:63
#8  0x8036cc94 in lapic_handle_timer (frame=
  {cf_rdi = -2036801520, cf_rsi = 1, cf_rdx = -1099502946304,  
cf_rcx = -1095242940416, cf_r8 = -2143479528, cf_r9 = -2143559117,  
cf_rax = 12582912, cf_rbx = -2036801536, cf_rbp = -1508017200, cf_r10  
= 0, cf_r11 = 4, cf_r12 = -1099511596800, cf_r13 = 0, cf_r14 =  
-1099502938944, cf_r15 = -1099511596728, cf_rip = -2145575931, cf_cs  
= 8, cf_rflags = 514, cf_rsp = -1508017280, cf_ss = 16})

at /usr/src/sys/amd64/amd64/local_apic.c:635
#9  0x80369166 in Xtimerint () at apic_vector.S:153
#10 0x801d1c05 in bge_intr (xsc=0x8698e010) at bus.h:241
#11 0x8023aab5 in ithread_loop (arg=0xff8494c0)
---Type return to continue, or q return to quit---
at /usr/src/sys/kern/kern_intr.c:682
#12 0x8023992f in fork_exit (
callout=0x8023a96d ithread_loop, arg=0xff8494c0,
frame=0xa61d7c50) at /usr/src/sys/kern/kern_fork.c:821
#13 0x80368c6e in fork_trampoline ()

Re: ffs snapshot lockup

2006-10-04 Thread Kostik Belousov
On Wed, Oct 04, 2006 at 01:06:37PM -0400, Vivek Khera wrote:
 
 On Oct 4, 2006, at 12:39 PM, Kris Kennaway wrote:
 
 
 The only thing I think was running at the time would be a large file
 copy from a remote system to this one using rsync.
 
 As I understand, you got the panic. Then, you shall post the panic  
 message.
 If you have core file, then running kgdb on the core may show  
 required
 information.
 (it shall be on the console exactly before en
 and backtrace (using the bt command of ddb) of the paniced thread.
 
 YOu can also do 'show msgbuf' from DDB.
 
 
 i ran kgdb on the vmcore file.  since the dump was generated by  
 calling doadump from DDB, the backtrace was showing the call stack of  
 that.
 
 from what i read in the output from kgdb, it seems that something  
 locked the kernel and we broke to debugger from the watchdog timeout  
 (I enable software watchdog).
 
 
 When I fired up kgdb on my vmcore.19 file and ran the bt command, it  
 said this:
 
 
 Unread portion of the kernel message buffer:
 interrupt   total
 irq1: atkbd0   2
 irq4: sio0   348
 irq14: ata01
 irq18: bge0  3228387
 irq32: aac0   235404
 irq34: ahc1   74
 irq35: ahc0   15
 cpu0: timer 36123790
 Total39588021
 KDB: stack backtrace:
 hardclock() at hardclock+0x1bb
 lapic_handle_timer() at lapic_handle_timer+0x117
 Xtimerint() at Xtimerint+0x76
 ithread_loop() at ithread_loop+0x148
 fork_exit() at fork_exit+0xbb
 fork_trampoline() at fork_trampoline+0xe
 --- trap 0, rip = 0, rsp = 0xa61d7d00, rbp = 0 ---
 KDB: enter: watchdog timeout
 Locked vnodes
 
 0xff002df5b798: tag nfs, type VDIR
 usecount 2, writecount 0, refcount 2 mountedhere 0
 flags (VV_ROOT)
  lock type nfs: EXCL (count 1) by thread 0xff002a6c5980 (pid  
 49843)#0 0x802442b4 at lockmgr+0x5b7
 #1 0x803a0573 at VOP_LOCK_APV+0x80
 #2 0x802be6e5 at vn_lock+0x65
 #3 0x802b2cbe at vget+0x8f
 #4 0x802a84e6 at vfs_hash_get+0xc4
 #5 0x8030a3cc at nfs_nget+0xb9
 #6 0x80310a9e at nfs_root+0x34
 #7 0x802a96d7 at lookup+0xa14
 #8 0x802a9d12 at namei+0x385
 #9 0x802b8b59 at kern_lstat+0x62
 #10 0x802b8e73 at lstat+0x2a
 #11 0x8037ac13 at syscall+0x470
 #12 0x80368aa8 at Xfast_syscall+0xa8
 
 fileid 3 fsid 0x400ff02
 Dumping 1015 MB (2 chunks)
   chunk 0: 1MB (160 pages) ... ok
   chunk 1: 1015MB (259776 pages) 999 983 967 951 935 919 903 887 871  
 855 839 823 807 791 775 759 743 727 711 695 679 663 647 631 615 599  
 583 567 551 535 519 503 487 471 455 439 423 407 391 375 359 343 327  
 311 295 279 263 247 231 215 199 183 167 151 135 119 103 87 71 55 39 23 7
 
 #0  doadump () at pcpu.h:172
 172 __asm __volatile(movq %%gs:0,%0 : =r (td));
 (kgdb) bt
 #0  doadump () at pcpu.h:172
 #1  0x8017719b in db_fncall (dummy1=0, dummy2=0, dummy3=0,  
 dummy4=0x0)
 at /usr/src/sys/ddb/db_command.c:492
 #2  0x801775bf in db_command_loop ()
 at /usr/src/sys/ddb/db_command.c:350
 #3  0x801792dd in db_trap (type=-1508017968, code=0)
 at /usr/src/sys/ddb/db_main.c:221
 #4  0x8026c72c in kdb_trap (type=3, code=0,  
 tf=0xa61d79d0)
 at /usr/src/sys/kern/subr_kdb.c:473
 #5  0x8037a4bf in trap (frame=
   {tf_rdi = 0, tf_rsi = -2139025408, tf_rdx = 1, tf_rcx =  
 1057545, tf_r8 = 1048064, tf_r9 = 10, tf_rax = 29, tf_rbx = 0, tf_rbp  
 = -1508017520, tf_r10 = -1508017760, tf_r11 = 10, tf_r12 =  
 -2141840192, tf_r13 = 0, tf_r14 = -1099502938944, tf_r15 =  
 -1099511596728, tf_trapno = 3, tf_addr = 0, tf_flags =  
 -1099511596728, tf_err = 0, tf_rip = -2144943427, tf_cs = 8,  
 tf_rflags = 134, tf_rsp = -1508017520, tf_ss = 16}) at /usr/src/sys/ 
 amd64/amd64/trap.c:442
 #6  0x8036890b in calltrap ()
 at /usr/src/sys/amd64/amd64/exception.S:168
 #7  0x8026c2bd in kdb_enter (msg=0x0) at cpufunc.h:63
 #8  0x8036cc94 in lapic_handle_timer (frame=
   {cf_rdi = -2036801520, cf_rsi = 1, cf_rdx = -1099502946304,  
 cf_rcx = -1095242940416, cf_r8 = -2143479528, cf_r9 = -2143559117,  
 cf_rax = 12582912, cf_rbx = -2036801536, cf_rbp = -1508017200, cf_r10  
 = 0, cf_r11 = 4, cf_r12 = -1099511596800, cf_r13 = 0, cf_r14 =  
 -1099502938944, cf_r15 = -1099511596728, cf_rip = -2145575931, cf_cs  
 = 8, cf_rflags = 514, cf_rsp = -1508017280, cf_ss = 16})
 at /usr/src/sys/amd64/amd64/local_apic.c:635
 #9  0x80369166 in Xtimerint () at apic_vector.S:153
 #10 0x801d1c05 in bge_intr (xsc=0x8698e010) at bus.h:241

Did you have any problems with your network card ? This seems to be
quite popular plot in recent times.

I do not see anything except wedged nfs request in this 

Re: ffs snapshot lockup

2006-10-04 Thread Vivek Khera


On Oct 4, 2006, at 1:16 PM, Kostik Belousov wrote:


#9  0x80369166 in Xtimerint () at apic_vector.S:153
#10 0x801d1c05 in bge_intr (xsc=0x8698e010) at  
bus.h:241


Did you have any problems with your network card ? This seems to be
quite popular plot in recent times.


quite possible.  when i first bought this box, a dell pe800, the NIC  
was not recognized by FreeBSD 5.3.  With some pciconf info and help  
from the stable@ list, we were able to have freebsd recognize the  
chipset and it seemed to work ok.  network performance was never  
stellar.  overall this machine feels much slower than it should be,  
but it is hard to tell whether that is caused by the network or the  
disk.  i suspected the disk.


anyhow, it was upgraded to FreeBSD 5.4, then to 6.0 (at which point i  
had to disable ACPI timer or it would lock during boot probing some  
devices).




I do not see anything except wedged nfs request in this deadlock.  
Seems that

nfs server does not respond.


ok.  i'll try my usual level 0 dump.  that often causes the ffs to  
totally lock that partition.  i may have multiple issues going on  
here :-(




Please, include output of ps in the scripts. It makes much easier  
to grok

the situation quickly..


noted.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ffs snapshot lockup

2006-10-04 Thread Vivek Khera


On Oct 4, 2006, at 1:16 PM, Kostik Belousov wrote:

#10 0x801d1c05 in bge_intr (xsc=0x8698e010) at  
bus.h:241


Did you have any problems with your network card ? This seems to be
quite popular plot in recent times.



Another note: I don't see any bge0 timeouts in my /var/log/messages,  
and never have over the last 2 years of this box.


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ffs snapshot lockup

2006-10-04 Thread Kris Kennaway
On Wed, Oct 04, 2006 at 01:06:37PM -0400, Vivek Khera wrote:
 
 On Oct 4, 2006, at 12:39 PM, Kris Kennaway wrote:
 
 
 The only thing I think was running at the time would be a large file
 copy from a remote system to this one using rsync.
 
 As I understand, you got the panic. Then, you shall post the panic  
 message.
 If you have core file, then running kgdb on the core may show  
 required
 information.
 (it shall be on the console exactly before en
 and backtrace (using the bt command of ddb) of the paniced thread.
 
 YOu can also do 'show msgbuf' from DDB.
 
 
 i ran kgdb on the vmcore file.  since the dump was generated by  
 calling doadump from DDB, the backtrace was showing the call stack of  
 that.
 
 from what i read in the output from kgdb, it seems that something  
 locked the kernel and we broke to debugger from the watchdog timeout  
 (I enable software watchdog).

Hmm, be careful with that - if you set the timeout too low (and note
that for some workloads O(minutes) may even be too low) then you'll
get a lot of false positives.

Kris


pgpZ34qBnPXV7.pgp
Description: PGP signature


Re: ffs snapshot lockup

2006-10-04 Thread Vivek Khera


On Oct 4, 2006, at 3:41 PM, Kris Kennaway wrote:


from what i read in the output from kgdb, it seems that something
locked the kernel and we broke to debugger from the watchdog timeout
(I enable software watchdog).


Hmm, be careful with that - if you set the timeout too low (and note
that for some workloads O(minutes) may even be too low) then you'll
get a lot of false positives.


hmmm... the man page for watchdogd doesn't specify what the default  
timeout is, but that's what we've got running.   [tappity-tapptity- 
tap...] source seems to indicate 16seconds timeout.  interesting.


so we could be getting hit with a bge interrupt storm and timing  
out.  i'll turn off fido and see what happens.


at this point, though, i think i have two separate issues.  one with  
bge and watchdog timeout, and one with locking of the filesystem with  
mksnap_ffs, as the symptoms are different.




Re: ffs snapshot lockup

2006-10-04 Thread Kris Kennaway
On Wed, Oct 04, 2006 at 03:53:54PM -0400, Vivek Khera wrote:
 
 On Oct 4, 2006, at 3:41 PM, Kris Kennaway wrote:
 
 from what i read in the output from kgdb, it seems that something
 locked the kernel and we broke to debugger from the watchdog timeout
 (I enable software watchdog).
 
 Hmm, be careful with that - if you set the timeout too low (and note
 that for some workloads O(minutes) may even be too low) then you'll
 get a lot of false positives.
 
 hmmm... the man page for watchdogd doesn't specify what the default  
 timeout is, but that's what we've got running.   [tappity-tapptity- 
 tap...] source seems to indicate 16seconds timeout.  interesting.

Yes, that's probably way too low.  e.g. when creating a snapshot (as
in your workload) your machine may be unresponsive for up to a few
minutes depending on your filesystem size and I/O load.

 so we could be getting hit with a bge interrupt storm and timing  
 out.  i'll turn off fido and see what happens.
 
 at this point, though, i think i have two separate issues.  one with  
 bge and watchdog timeout, and one with locking of the filesystem with  
 mksnap_ffs, as the symptoms are different.

That sounds plausible.  Many people are reporting issues involving NIC
interrupts, but they're proving elusive to characterize so far (there
may be multiple problems).

kris





pgpjeHGeti1Tr.pgp
Description: PGP signature


Re: ffs snapshot lockup

2006-10-04 Thread Vivek Khera


On Oct 3, 2006, at 4:43 AM, Kostik Belousov wrote:


Details are posted at http://vivek.khera.org/scratch/crashlogs/

I have the crashdumps available to a kernel hacker upon request (i'd
rather not make them generally available to the public...)

It seems that you have snapshotted fs exported by nfsd ? At least,  
18a is
definitely the case. I have the patch (for current) that shall fix  
the issue.

In fact, you need two patches:


As per advice of Kris Kenneway, I turned off the software watchdog to  
rule out that as my problem. Then I ran a level 3 dump. Dump of root  
fs went fine, then it proceeded to do /usr. After a few minutes it  
locked up. Typescript 20 at the above URL shows the debugging info  
from the break into debugger of the locked up system. Since /usr was  
locked, nobody could log in at all.


The network load was minimal at the time.  I had everyone log out and  
close mail etc.




Re: ffs snapshot lockup

2006-10-04 Thread Peter Holm
On Wed, Oct 04, 2006 at 03:41:48PM -0400, Kris Kennaway wrote:
 On Wed, Oct 04, 2006 at 01:06:37PM -0400, Vivek Khera wrote:
  
  On Oct 4, 2006, at 12:39 PM, Kris Kennaway wrote:
  
  
  The only thing I think was running at the time would be a large file
  copy from a remote system to this one using rsync.
  
  As I understand, you got the panic. Then, you shall post the panic  
  message.
  If you have core file, then running kgdb on the core may show  
  required
  information.
  (it shall be on the console exactly before en
  and backtrace (using the bt command of ddb) of the paniced thread.
  
  YOu can also do 'show msgbuf' from DDB.
  
  
  i ran kgdb on the vmcore file.  since the dump was generated by  
  calling doadump from DDB, the backtrace was showing the call stack of  
  that.
  
  from what i read in the output from kgdb, it seems that something  
  locked the kernel and we broke to debugger from the watchdog timeout  
  (I enable software watchdog).
 
 Hmm, be careful with that - if you set the timeout too low (and note
 that for some workloads O(minutes) may even be too low) then you'll
 get a lot of false positives.
 


Oh, yes. I've using this with success:

watchdogd -t 3600 -e 'ls /tmp /dev  /dev/null; true' -s 60

- Peter
 Kris

 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.5 (FreeBSD)
 
 iD8DBQFFJA58Wry0BWjoQKURAr5dAKDf4YLcBJU9owRw6N1L3FcgJkvOOgCfRQkq
 bd8+tGZVB28bkYBN6KL7iO0=
 =B7Vl
 -END PGP SIGNATURE-
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ffs snapshot lockup

2006-10-03 Thread Kostik Belousov
On Mon, Oct 02, 2006 at 03:23:49PM -0400, Vivek Khera wrote:
 
 On Sep 22, 2006, at 4:36 PM, Kris Kennaway wrote:
 
 Start by enabling INVARIANTS, INVARIANT_SUPPORT, DEBUG_LOCKS and
 DEBUG_VFS_LOCKS, then run 'show lockedvnods' and 'alltrace' in DDB
 (spammy, need that serial console), or at least trace the running
 processes (show allpcpu) and those listed in lockedvnods.  Then call
 doadump and save the core+kernel.debug when you reboot.
 
 Well, what an exciting weekend we had!  Three lockups/panics: two  
 during running a dump (mksnap_ffs seems to be the culprit) and one  
 running background fsck after rebooting from the prior crash.
 
 Details are posted at http://vivek.khera.org/scratch/crashlogs/
 
 I have the crashdumps available to a kernel hacker upon request (i'd  
 rather not make them generally available to the public...)
 
It seems that you have snapshotted fs exported by nfsd ? At least, 18a is
definitely the case. I have the patch (for current) that shall fix the issue.
In fact, you need two patches:

1. buffer ownership patch by Tor Egge (already in current, I think it shall
be MFCed before 6.2)

tegge   2006-10-02 02:06:27 UTC

  FreeBSD src repository

  Modified files:
sys/kern kern_lock.c vfs_bio.c 
sys/sys  buf.h lockmgr.h 
  Log:
  If the buffer lock has waiters after the buffer has changed identity then
  getnewbuf() needs to drop the buffer in order to wake waiters that might
  sleep on the buffer in the context of the old identity.
  
  Revision  ChangesPath
  1.100 +15 -0 src/sys/kern/kern_lock.c
  1.510 +11 -0 src/sys/kern/vfs_bio.c
  1.194 +11 -0 src/sys/sys/buf.h
  1.51  +1 -0  src/sys/sys/lockmgr.h
___
cvs-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/cvs-all
To unsubscribe, send any mail to [EMAIL PROTECTED]


Index: src/sys/kern/kern_lock.c
diff -u src/sys/kern/kern_lock.c:1.99 src/sys/kern/kern_lock.c:1.100
--- src/sys/kern/kern_lock.c:1.99   Tue Aug 15 18:29:01 2006
+++ src/sys/kern/kern_lock.cMon Oct  2 02:06:26 2006
@@ -566,6 +566,21 @@
 }
 
 /*
+ * Determine the number of waiters on a lock.
+ */
+int
+lockwaiters(lkp)
+   struct lock *lkp;
+{
+   int count;
+
+   mtx_lock(lkp-lk_interlock);
+   count = lkp-lk_waitcount;
+   mtx_unlock(lkp-lk_interlock);
+   return (count);
+}
+
+/*
  * Print out information about state of a lock. Used by VOP_PRINT
  * routines to display status about contained locks.
  */
Index: src/sys/kern/vfs_bio.c
diff -u src/sys/kern/vfs_bio.c:1.509 src/sys/kern/vfs_bio.c:1.510
--- src/sys/kern/vfs_bio.c:1.509Wed Aug  9 17:43:26 2006
+++ src/sys/kern/vfs_bio.c  Mon Oct  2 02:06:26 2006
@@ -1890,6 +1890,17 @@
}
 
/*
+* Notify any waiters for the buffer lock about
+* identity change by freeing the buffer.
+*/
+   if (qindex == QUEUE_CLEAN  BUF_LOCKWAITERS(bp)  0) {
+   bp-b_flags |= B_INVAL;
+   bfreekva(bp);
+   brelse(bp);
+   goto restart;
+   }
+
+   /*
 * If we are overcomitted then recover the buffer and its
 * KVM space.  This occurs in rare situations when multiple
 * processes are blocked in getnewbuf() or allocbuf().
Index: src/sys/sys/buf.h
diff -u src/sys/sys/buf.h:1.193 src/sys/sys/buf.h:1.194
--- src/sys/sys/buf.h:1.193 Fri Mar 31 02:56:30 2006
+++ src/sys/sys/buf.h   Mon Oct  2 02:06:27 2006
@@ -371,6 +371,17 @@
return ret;
 }
 
+
+/*
+ * Find out the number of waiters on a lock.
+ */
+static __inline int BUF_LOCKWAITERS(struct buf *);
+static __inline int
+BUF_LOCKWAITERS(struct buf *bp)
+{
+   return (lockwaiters(bp-b_lock));
+}
+
 #endif /* _KERNEL */
 
 struct buf_queue_head {
Index: src/sys/sys/lockmgr.h
diff -u src/sys/sys/lockmgr.h:1.50 src/sys/sys/lockmgr.h:1.51
--- src/sys/sys/lockmgr.h:1.50  Tue Aug 15 18:29:01 2006
+++ src/sys/sys/lockmgr.h   Mon Oct  2 02:06:27 2006
@@ -203,6 +203,7 @@
 void   lockmgr_printinfo(struct lock *);
 intlockstatus(struct lock *, struct thread *);
 intlockcount(struct lock *);
+intlockwaiters(struct lock *);
 #ifdef DDB
 intlockmgr_chain(struct thread *td, struct thread **ownerp);
 #endif

2. this one (it shall be applied in the sys/ufs; please, test and report
results)

Index: ffs/ffs_inode.c
===
RCS file: /usr/local/arch/ncvs/src/sys/ufs/ffs/ffs_inode.c,v
retrieving revision 1.106
diff -u -r1.106 ffs_inode.c
--- ffs/ffs_inode.c 5 Apr 2005 08:49:41 -   1.106
+++ ffs/ffs_inode.c 27 Sep 2006 08:29:18 -
@@ -66,9 +66,11 @@
  * IN_ACCESS, IN_UPDATE, and IN_CHANGE flags respectively.  Write the inode
  * to disk if the IN_MODIFIED flag is set (it may be set 

Re: ffs snapshot lockup

2006-10-03 Thread Vivek Khera


On Oct 3, 2006, at 4:43 AM, Kostik Belousov wrote:


Details are posted at http://vivek.khera.org/scratch/crashlogs/

I have the crashdumps available to a kernel hacker upon request (i'd
rather not make them generally available to the public...)

It seems that you have snapshotted fs exported by nfsd ? At least,  
18a is
definitely the case. I have the patch (for current) that shall fix  
the issue.

In fact, you need two patches:


Yes, it is an NFS exported partition.

The patches applied well to my existing 6.2-PRE source tree (I did  
not cvsup so as to minimize the changes to the system and prove that  
this fix either does or does not work for my problem.)  There was  
minimal line fuzz, but no failures.  If this works out, I'll cvsup  
and re-patch and test the latest sources.


I'm building and installing the kernel now.  Tomorrow when I'm at the  
office I'll  try the dump again as that usually causes a lockup.


Thanks for helping me solve this.



Re: ffs snapshot lockup

2006-10-02 Thread Vivek Khera


On Sep 22, 2006, at 4:36 PM, Kris Kennaway wrote:


Start by enabling INVARIANTS, INVARIANT_SUPPORT, DEBUG_LOCKS and
DEBUG_VFS_LOCKS, then run 'show lockedvnods' and 'alltrace' in DDB
(spammy, need that serial console), or at least trace the running
processes (show allpcpu) and those listed in lockedvnods.  Then call
doadump and save the core+kernel.debug when you reboot.


Well, what an exciting weekend we had!  Three lockups/panics: two  
during running a dump (mksnap_ffs seems to be the culprit) and one  
running background fsck after rebooting from the prior crash.


Details are posted at http://vivek.khera.org/scratch/crashlogs/

I have the crashdumps available to a kernel hacker upon request (i'd  
rather not make them generally available to the public...)




Re: ffs snapshot lockup

2006-09-26 Thread Vivek Khera


On Sep 25, 2006, at 2:58 PM, Jeremy Chadwick wrote:


This problem sounds awfully like what I reported in a recent PR.
Now I'm starting to have my doubts as to whether or not my hardware
was indeed at fault...

http://www.freebsd.org/cgi/query-pr.cgi?pr=103435


I think my issue is a bit different -- the FS is the only thing  
locked up, and only the single FS that mksnap_ffs is running on.  The  
rest of the system continues to run quite nicely as long as the  
process doesn't attempt to access the locked file system.  In my case  
it has been the home directory partition, which is easy to avoid if  
you're root.


I don't observe ethernet timeouts, etc.



Re: ffs snapshot lockup

2006-09-25 Thread Vivek Khera


On Sep 22, 2006, at 4:36 PM, Kris Kennaway wrote:


Start by enabling INVARIANTS, INVARIANT_SUPPORT, DEBUG_LOCKS and
DEBUG_VFS_LOCKS, then run 'show lockedvnods' and 'alltrace' in DDB
(spammy, need that serial console), or at least trace the running
processes (show allpcpu) and those listed in lockedvnods.  Then call
doadump and save the core+kernel.debug when you reboot.


Well, it happened again (as I was building the debugging kernel, no  
less)... but it only locked up one file system not the whole box.  So  
my home dir was wedged, and everything trying to use anyone's home  
dir was wedged, but /usr and / were still responding.


So now I do have the kernel in place and the serial console is  
already there and your notes on what to do are there, so next time it  
happens I'm ready!




Re: ffs snapshot lockup

2006-09-25 Thread Jeremy Chadwick
This problem sounds awfully like what I reported in a recent PR.
Now I'm starting to have my doubts as to whether or not my hardware
was indeed at fault...

http://www.freebsd.org/cgi/query-pr.cgi?pr=103435

-- 
| Jeremy Chadwick jdc at parodius.com |
| Parodius Networkinghttp://www.parodius.com/ |
| UNIX Systems Administrator   Mountain View, CA, USA |
| Making life hard for others since 1977.   PGP: 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


ffs snapshot lockup

2006-09-22 Thread Vivek Khera
Yesterday we upgraded an amd64 system from 6.1 to 6.2-PRE from a  
cvsup of September 20 evening.  The overnight nightly dump to tape  
went off normally.  This afternoon when the sysadmin was performing  
the level0 backup, the ffs_mksnap program locked up the entire system.


It wasn't the expected short hiccup that mksnap sometimes causes --  
it was started just prior to lunchtime and when we got back everyone  
had wedged login sessions and IMAP sessions, and ffs_mksnap was in  
D state according to ps and idle for about 90 minutes.


The only way out was to reboot the system.

I know this is very un-useful as a bug report, but I'm putting this  
out in case anyone else has noticed this.  It has never happened to  
us before when we were running 5.4 and then 6.1 on this box.  Nothing  
else changed on the system during this time.


The disk in question is on an adaptec RAID controller with the aac  
driver.




Re: ffs snapshot lockup

2006-09-22 Thread Kris Kennaway
On Fri, Sep 22, 2006 at 02:55:34PM -0400, Vivek Khera wrote:
 Yesterday we upgraded an amd64 system from 6.1 to 6.2-PRE from a  
 cvsup of September 20 evening.  The overnight nightly dump to tape  
 went off normally.  This afternoon when the sysadmin was performing  
 the level0 backup, the ffs_mksnap program locked up the entire system.
 
 It wasn't the expected short hiccup that mksnap sometimes causes --  
 it was started just prior to lunchtime and when we got back everyone  
 had wedged login sessions and IMAP sessions, and ffs_mksnap was in  
 D state according to ps and idle for about 90 minutes.
 
 The only way out was to reboot the system.
 
 I know this is very un-useful as a bug report, but I'm putting this  
 out in case anyone else has noticed this.  It has never happened to  
 us before when we were running 5.4 and then 6.1 on this box.  Nothing  
 else changed on the system during this time.

Yep, we'd need additional debugging to proceed.

Kris
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ffs snapshot lockup

2006-09-22 Thread Vivek Khera


On Sep 22, 2006, at 3:03 PM, Kris Kennaway wrote:


I know this is very un-useful as a bug report, but I'm putting this
out in case anyone else has noticed this.  It has never happened to
us before when we were running 5.4 and then 6.1 on this box.  Nothing
else changed on the system during this time.


Yep, we'd need additional debugging to proceed.


How does one debug a system where the disk sub-system is frozen?  I  
have serial console with DDB/KDB built into the kernel and debugging  
symbols too.  Is that the only option here?


We're gonna try the full level 0 backup again monday and if it locks  
up I'll try to figure it out, but could use some hints as to where to  
poke in the kernel.




Re: ffs snapshot lockup

2006-09-22 Thread Kris Kennaway
On Fri, Sep 22, 2006 at 04:24:41PM -0400, Vivek Khera wrote:
 
 On Sep 22, 2006, at 3:03 PM, Kris Kennaway wrote:
 
 I know this is very un-useful as a bug report, but I'm putting this
 out in case anyone else has noticed this.  It has never happened to
 us before when we were running 5.4 and then 6.1 on this box.  Nothing
 else changed on the system during this time.
 
 Yep, we'd need additional debugging to proceed.
 
 How does one debug a system where the disk sub-system is frozen?  I  
 have serial console with DDB/KDB built into the kernel and debugging  
 symbols too.  Is that the only option here?

Perhaps, but you can probably also take a dump still.  It's a
filesystem deadlock, not a disk device problem.

 We're gonna try the full level 0 backup again monday and if it locks  
 up I'll try to figure it out, but could use some hints as to where to  
 poke in the kernel.

Start by enabling INVARIANTS, INVARIANT_SUPPORT, DEBUG_LOCKS and
DEBUG_VFS_LOCKS, then run 'show lockedvnods' and 'alltrace' in DDB
(spammy, need that serial console), or at least trace the running
processes (show allpcpu) and those listed in lockedvnods.  Then call
doadump and save the core+kernel.debug when you reboot.

Kris



pgpxnqTnkvGrb.pgp
Description: PGP signature