Re: Recurring problem: processes block accessing UFS file system

2006-01-12 Thread Don Lewis
On 11 Jan, Denis Shaposhnikov wrote:
 Hi!
 
 Don == Don Lewis [EMAIL PROTECTED] writes:
 
  Don Are you using any unusual file systems, such as nullfs or
  Don unionfs?
 
   Yes, I'm use a lots of nullfs. This is a host system for about 20
   jails with nullfs mounted ro system:
 
  Don That would be my guess as to the cause of the problem. Hopefully
  Don DEBUG_VFS_LOCKS will help pinpoint the bug.
 
 I've got the problem again. Now I have debug kernel and crash
 dump. That is an output from the kdb. Do you need any additional
 information which I can get from the crash dump?


 0xc6bed3fc: tag ufs, type VDIR
 usecount 3, writecount 0, refcount 4 mountedhere 0
 flags ()
 v_object 0xc84fce0c ref 0 pages 1
  lock type ufs: EXCL (count 1) by thread 0xc69d1d80 (pid 546) with 2 
 pending#0 0xc04cd7ad at lockmgr+0x3c6
 #1 0xc053d780 at vop_stdlock+0x32
 #2 0xc0655075 at VOP_LOCK_APV+0x3b
 #3 0xc05db299 at ffs_lock+0x19
 #4 0xc0655075 at VOP_LOCK_APV+0x3b
 #5 0xc0557513 at vn_lock+0x6c
 #6 0xc054a320 at vget+0x87
 #7 0xc053a1f0 at cache_lookup+0xde
 #8 0xc053ac93 at vfs_cache_lookup+0xad
 #9 0xc0652eaf at VOP_LOOKUP_APV+0x3b
 #10 0xc053f786 at lookup+0x419
 #11 0xc0540262 at namei+0x2e7
 #12 0xc0558d37 at vn_open_cred+0x1c2
 #13 0xc055918a at vn_open+0x33
 #14 0xc054e6d3 at kern_open+0xd2
 #15 0xc054f055 at open+0x36
 #16 0xc0648a47 at syscall+0x23a
 #17 0xc06326ff at Xint0x80_syscall+0x1f
 
   ino 2072795, on dev ad4s1g
 
 0xc6c07bf4: tag ufs, type VDIR
 usecount 2, writecount 0, refcount 3 mountedhere 0
 flags ()
 v_object 0xc940a744 ref 0 pages 1
  lock type ufs: EXCL (count 1) by thread 0xc9146c00 (pid 33016) with 1 
 pending#0 0xc04cd7ad at lockmgr+0x3c6
 #1 0xc053d780 at vop_stdlock+0x32
 #2 0xc0655075 at VOP_LOCK_APV+0x3b
 #3 0xc05db299 at ffs_lock+0x19
 #4 0xc0655075 at VOP_LOCK_APV+0x3b
 #5 0xc0557513 at vn_lock+0x6c
 #6 0xc054a320 at vget+0x87
 #7 0xc053a1f0 at cache_lookup+0xde
 #8 0xc053ac93 at vfs_cache_lookup+0xad
 #9 0xc0652eaf at VOP_LOOKUP_APV+0x3b
 #10 0xc053f786 at lookup+0x419
 #11 0xc0540262 at namei+0x2e7
 #12 0xc0554134 at kern_rmdir+0x98
 #13 0xc055436e at rmdir+0x22
 #14 0xc0648a47 at syscall+0x23a
 #15 0xc06326ff at Xint0x80_syscall+0x1f
 
   ino 2072767, on dev ad4s1g
 

 db alltrace

 Tracing command parser3.cgi pid 33016 tid 100652 td 0xc9146c00
 sched_switch(c9146c00,0,1,5e57bc72,aaef377f) at sched_switch+0xd8
 mi_switch(1,0,0,e93d86c4,c04f425b) at mi_switch+0x150
 sleepq_switch(e93d86f8,c04e4cc6,c6bed454,50,c0672a31) at sleepq_switch+0x115
 sleepq_wait(c6bed454,50,c0672a31,0,c602a080) at sleepq_wait+0xb
 msleep(c6bed454,c06b9c98,50,c0672a31,0) at msleep+0x454
 acquire(6,da1dcb4c,c0504340,da1dcb4c,4c) at acquire+0x7a
 lockmgr(c6bed454,2002,c6bed4c4,c9146c00,e93d87e0) at lockmgr+0x4ce
 vop_stdlock(e93d8838,120,c06a4f20,e93d8838,e93d87f0) at vop_stdlock+0x32
 VOP_LOCK_APV(c06a55c0,e93d8838,e93d8808,c0655075,e93d8838) at 
 VOP_LOCK_APV+0x3b
 ffs_lock(e93d8838,f,2,c6bed3fc,e93d8854) at ffs_lock+0x19
 VOP_LOCK_APV(c06a4f20,e93d8838,c0653a6a,c053acd2,c0652eaf) at 
 VOP_LOCK_APV+0x3b
 vn_lock(c6bed3fc,2002,c9146c00,c06326ff,6) at vn_lock+0x6c
 vget(c6bed3fc,2002,c9146c00,c0653a6a,c053acd2) at vget+0x87
 vfs_hash_get(c6268c00,1fa0db,2,c9146c00,e93d89bc) at vfs_hash_get+0xf5
 ffs_vget(c6268c00,1fa0db,2,e93d89bc,e93d89c0) at ffs_vget+0x49
 ufs_lookup(e93d8a6c,c0685ac5,c6c07bf4,e93d8c34,e93d8aa8) at ufs_lookup+0x965
 VOP_CACHEDLOOKUP_APV(c06a4f20,e93d8a6c,e93d8c34,c9146c00,c7fac400) at 
 VOP_CACHEDLOOKUP_APV+0x59
 vfs_cache_lookup(e93d8b14,e93d8ac0,c6c07bf4,e93d8c34,e93d8b30) at 
 vfs_cache_lookup+0xec
 VOP_LOOKUP_APV(c06a4f20,e93d8b14,c9146c00,c065510a,c054a6f6) at 
 VOP_LOOKUP_APV+0x3b
 lookup(e93d8c0c,c954a800,400,e93d8c28,0) at lookup+0x419
 namei(e93d8c0c,ffdf,2,0,c6268c00) at namei+0x2e7
 kern_rmdir(c9146c00,824b500,0,e93d8d30,c0648a47) at kern_rmdir+0x98
 rmdir(c9146c00,e93d8d04,4,d,e93d8d38) at rmdir+0x22
 syscall(804003b,e50003b,bfbf003b,8310560,8310560) at syscall+0x23a
 Xint0x80_syscall() at Xint0x80_syscall+0x1f
 --- syscall (137, FreeBSD ELF32, rmdir), eip = 0xe8201bf, esp = 0xbfbfbcfc, 
 ebp = 0xbfbfbd18 ---

 Tracing command nginx pid 546 tid 100122 td 0xc69d1d80
 sched_switch(c69d1d80,0,1,f002cf9e,6c43d35e) at sched_switch+0xd8
 mi_switch(1,0,c0502e44,e8a73670,c04f425b) at mi_switch+0x150
 sleepq_switch(e8a736a4,c04e4cc6,c6c07c4c,50,c0672a31) at sleepq_switch+0x115
 sleepq_wait(c6c07c4c,50,c0672a31,0,c044b82a) at sleepq_wait+0xb
 msleep(c6c07c4c,c06ba45c,50,c0672a31,0) at msleep+0x454
 acquire(6,c0655075,c05db299,c0655075,c0557513) at acquire+0x7a
 lockmgr(c6c07c4c,2002,c6c07cbc,c69d1d80,e8a7378c) at lockmgr+0x4ce
 vop_stdlock(e8a737e4,c06a4f20,c06a4f20,e8a737e4,e8a7379c) at vop_stdlock+0x32
 VOP_LOCK_APV(c06a55c0,e8a737e4,e8a737b4,c0655075,e8a737e4) at 
 VOP_LOCK_APV+0x3b
 ffs_lock(e8a737e4,c0673faf,2,c6c07bf4,e8a73800) at ffs_lock+0x19
 VOP_LOCK_APV(c06a4f20,e8a737e4,e8a73814,c0504340,e8a73814) at 
 VOP_LOCK_APV+0x3b
 

Re: Recurring problem: processes block accessing UFS file system

2006-01-12 Thread Don Lewis
On 12 Jan, Don Lewis wrote:
 On 11 Jan, Denis Shaposhnikov wrote:
 Hi!
 
 Don == Don Lewis [EMAIL PROTECTED] writes:
 
  Don Are you using any unusual file systems, such as nullfs or
  Don unionfs?
 
   Yes, I'm use a lots of nullfs. This is a host system for about 20
   jails with nullfs mounted ro system:
 
  Don That would be my guess as to the cause of the problem. Hopefully
  Don DEBUG_VFS_LOCKS will help pinpoint the bug.
 
 I've got the problem again. Now I have debug kernel and crash
 dump. That is an output from the kdb. Do you need any additional
 information which I can get from the crash dump?

 Process 33016 is executing rmdir().  While doing the lookup, it is
 holding a lock on vnode 0xc6c07bf4 and attempting to lock vnode
 c6bed3fc.  Vnode 0xc6c07bf4 should be the parent directory of c6bed3fc.
 
 Process 546 is executing open().  While doing the lookup, it is holding
 a lock on vnode 0xc6bed3fc while attempting to lock vnode c6c07bf4.
 Vnode 0xc6bed3fc should be the parent directory of c6c07bf4, but this is
 inconsistent with the previous paragraph.
 
 This situation should not be possible.  Using kgdb on your saved crash
 dump, print fmode and *ndp in the vn_open_cred() stack frame of
 process 546, and *nd in the kern_rmdir() stack frame of process 33016.
 The path names being looked up may be helpful.
 
 Are there any symbolic links in the path names?  If so, what are the
 link contents?
 
 Are either of these processes jailed?  If so, same or different jails?
 
 What are inodes 2072767 and 2072795 on ad4s1g?
 
 Are you using snapshots?

I just thought of another possible cause for this problem.  Is is
possible that you have any hard links to directories in the file system
on ad4s1g?  That could put a loop in the directory tree and mess up the
normal parent-child relationship that we rely on to avoid deadlocks.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recurring problem: processes block accessing UFS file system

2006-01-11 Thread Denis Shaposhnikov
Hi!

 Don == Don Lewis [EMAIL PROTECTED] writes:

 Don Are you using any unusual file systems, such as nullfs or
 Don unionfs?

  Yes, I'm use a lots of nullfs. This is a host system for about 20
  jails with nullfs mounted ro system:

 Don That would be my guess as to the cause of the problem. Hopefully
 Don DEBUG_VFS_LOCKS will help pinpoint the bug.

I've got the problem again. Now I have debug kernel and crash
dump. That is an output from the kdb. Do you need any additional
information which I can get from the crash dump?

Script started on Wed Jan 11 16:19:51 2006
Connected

~KDB: enter: Line break on console
[thread pid 10 tid 16 ]
Stopped at  kdb_enter+0x32: leave   
db ps
  pid   proc uid  ppid  pgrp  flag   stat  wmesgwchan  cmd
33279 c8fb66840 1 1 000 [SLPQ ufs 0xc690d454][SLP] sh
33017 c7f998b0 1008 1  2692 000 [SLPQ ufs 0xc690d300][SLP] httpd
33016 c9261000 1008 1  2692 0004000 [SLPQ ufs 0xc6bed454][SLP] parser3.cgi
33015 c8e1fd08 1008 1  2692 0004000 [SLPQ ufs 0xc6bed454][SLP] perl5.8.7
33013 c9796458 1008 1  2692 0004000 [SLPQ ufs 0xc690d454][SLP] parser3.cgi
30627 c8f2fd08 1012 21055 30627 0004002 [SLPQ ufs 0xc690d454][SLP] perl5.8.7
30210 c86e3684 1010 1 30210 0004000 [SLPQ ufs 0xc690d454][SLP] sftp-server
21225 c858522c 1012 1 21225 0004000 [SLPQ ufs 0xc6bf9300][SLP] sftp-server
21055 c6c6d684 1012 1 21055 0015002 [SLPQ ufs 0xc690d454][SLP] zsh
21031 c62978b0 1012 1 21031 0004002 [SLPQ ufs 0xc690d454][SLP] zsh
21003 c92696840  2728 21003 0004100 [SLPQ ufs 0xc690d454][SLP] sshd
 5004 c941eadc 1010 1  5004 0004000 [SLPQ ufs 0xc690d454][SLP] sftp-server
 2735 c80076840 1  2735 000 [SLPQ ufs 0xc690d454][SLP] cron
 2728 c7ca10000 1  2728 100 [SLPQ ufs 0xc690d454][SLP] sshd
 2705 c8006d08 1008 1  2703 0004002 [SLPQ ufs 0xc690d454][SLP] perl5.8.7
 2639 c8006adc0 1  2639 000 [SLPQ ufs 0xc690d454][SLP] syslogd
  546 c6a10684   80 1   540 100 [SLPQ ufs 0xc6c07c4c][SLP] nginx
   36 c60d722c0 0 0 204 [SLPQ - 0xe4c45cf0][SLP] schedcpu
   35 c60d74580 0 0 204 [SLPQ ktsusp 0xc60d74e0][SLP] syncer
   34 c60d76840 0 0 204 [SLPQ ktsusp 0xc60d770c][SLP] vnlru
   33 c60d78b00 0 0 204 [SLPQ ktsusp 0xc60d7938][SLP] bufdaemon
   32 c60d7adc0 0 0 20c [SLPQ pgzero 0xc06d0b14][SLP] pagezero
   31 c60d7d080 0 0 204 [SLPQ psleep 0xc06c8a48][SLP] vmdaemon
   30 c60d80000 0 0 204 [SLPQ psleep 0xc06c8a08][SLP] pagedaemon
   29 c60d822c0 0 0 204 [IWAIT] irq1: atkbd0
   28 c60d84580 0 0 204 [IWAIT] swi0: sio
   27 c602d6840 0 0 204 [IWAIT] irq18: atapci1
   26 c602d8b00 0 0 204 [IWAIT] irq15: ata1
   25 c602dadc0 0 0 204 [IWAIT] irq14: ata0
   24 c602dd080 0 0 204 [IWAIT] irq27: em1
   23 c6070 0 0 204 [IWAIT] irq26: em0
   22 c607022c0 0 0 204 [IWAIT] irq9: acpi0
   21 c60704580 0 0 204 [IWAIT] swi2: cambio
   20 c60706840 0 0 204 [IWAIT] swi6: task queue
9 c60708b00 0 0 204 [SLPQ - 0xc6064e00][SLP] acpi_task2
8 c6070adc0 0 0 204 [SLPQ - 0xc6064e00][SLP] acpi_task1
7 c5fd822c0 0 0 204 [SLPQ - 0xc6064e00][SLP] acpi_task0
   19 c5fd84580 0 0 204 [IWAIT] swi6: Giant taskq
6 c5fd86840 0 0 204 [SLPQ - 0xc6065000][SLP] thread taskq
   18 c5fd88b00 0 0 204 [IWAIT] swi5: Fast taskq
5 c5fd8adc0 0 0 204 [SLPQ - 0xc60650c0][SLP] kqueue taskq
   17 c5fd8d080 0 0 204 [SLPQ - 0xc06b8460][SLP] yarrow
4 c602d0000 0 0 204 [SLPQ - 0xc06b8d00][SLP] g_down
3 c602d22c0 0 0 204 [SLPQ - 0xc06b8cfc][SLP] g_up
2 c602d4580 0 0 204 [SLPQ - 0xc06b8cf4][SLP] g_event
   16 c5fd30000 0 0 204 [IWAIT] swi1: net
   15 c5fd322c0 0 0 204 [IWAIT] swi3: vm
   14 c5fd34580 0 0 20c [IWAIT] swi4: clock sio
   13 c5fd36840 0 0 20c [CPU 0] idle: cpu0
   12 c5fd38b00 0 0 20c [CPU 1] idle: cpu1
   11 c5fd3adc0 0 0 20c [CPU 2] idle: cpu2
   10 c5fd3d080 0 0 20c [CPU 3] idle: cpu3
1 c5fd80000 0 1 0004200 [SLPQ ufs 0xc690d5a8][SLP] init
0 c06b8e000 0 0 200 [IWAIT] swapper
21030 c8e1f8b0 1012 21003 21003 0002100 zomb[INACTIVE] sshd
 5005 c93f86840  2728  5005 0006100 zomb[INACTIVE] sshd
 5000 c83d1d080  2728  5000 0006100 zomb[INACTIVE] sshd
30211 c7907d080  2728 30211 0006100 zomb[INACTIVE] sshd
 4997 c6a10adc0  2728  4997 0006100 zomb[INACTIVE] sshd
 6404 c85586840  2728  6404 0006100 zomb[INACTIVE] sshd

db show lockedvnods
Locked vnodes

0xc690d550: tag ufs, type VDIR
usecount 1, writecount 0, 

Re: Recurring problem: processes block accessing UFS file system

2006-01-05 Thread Denis Shaposhnikov
Hi!

 Greg == Greg Rivers [EMAIL PROTECTED] writes:

 Greg It's taken more than a month, but the problem has recurred
 Greg without snapshots ever having been run.  I've got a good trace

I think that I have the same problem on a fresh CURRENT. For some
processes I see MWCHAN = ufs and D in the STAT. And I can't kill
such processes even with -9. And system can't kill them too on
shutdown. So, system can't do shutdown and wait forever after All
buffers synced message. At this moment I've entered to KDB do show
lockedvnods:

Locked vnodes

0xc687cb58: tag ufs, type VDIR
usecount 1, writecount 0, refcount 2 mountedhere 0
flags ()
v_object 0xcb5b1934 ref 0 pages 0
 lock type ufs: EXCL (count 1) by thread 0xc795d600 (pid 74686) with 1 
pending
ino 2072602, on dev ad4s1g

0xc687ca50: tag ufs, type VDIR
usecount 31, writecount 0, refcount 32 mountedhere 0
flags ()
v_object 0xc85d2744 ref 0 pages 1
 lock type ufs: EXCL (count 1) by thread 0xc7683d80 (pid 74178) with 6 
pending
ino 2072603, on dev ad4s1g

0xc687c948: tag ufs, type VDIR
usecount 2, writecount 0, refcount 3 mountedhere 0
flags ()
v_object 0xc875d000 ref 0 pages 1
 lock type ufs: EXCL (count 1) by thread 0xc91f3300 (pid 65610) with 1 
pending
ino 2072615, on dev ad4s1g

0xc691f420: tag ufs, type VDIR
usecount 2, writecount 0, refcount 3 mountedhere 0
flags ()
v_object 0xc8a773e0 ref 0 pages 1
 lock type ufs: EXCL (count 1) by thread 0xc68e5780 (pid 519) with 1 pending
ino 2072680, on dev ad4s1g

0xc691f318: tag ufs, type VDIR
usecount 3, writecount 0, refcount 4 mountedhere 0
flags ()
v_object 0xc8a7b2e8 ref 0 pages 1
 lock type ufs: EXCL (count 1) by thread 0xc7019780 (pid 74103) with 2 
pending
ino 2072795, on dev ad4s1g

0xc69bb528: tag ufs, type VDIR
usecount 2, writecount 0, refcount 3 mountedhere 0
flags ()
v_object 0xc7890744 ref 0 pages 1
 lock type ufs: EXCL (count 1) by thread 0xc91f4600 (pid 74129) with 1 
pending
ino 2072767, on dev ad4s1g
Locked vnodes

0xc687cb58: tag ufs, type VDIR
usecount 1, writecount 0, refcount 2 mountedhere 0
flags ()
v_object 0xcb5b1934 ref 0 pages 0
 lock type ufs: EXCL (count 1) by thread 0xc795d600 (pid 74686) with 1 
pending
ino 2072602, on dev ad4s1g

0xc687ca50: tag ufs, type VDIR
usecount 31, writecount 0, refcount 32 mountedhere 0
flags ()
v_object 0xc85d2744 ref 0 pages 1
 lock type ufs: EXCL (count 1) by thread 0xc7683d80 (pid 74178) with 6 
pending
ino 2072603, on dev ad4s1g

0xc687c948: tag ufs, type VDIR
usecount 2, writecount 0, refcount 3 mountedhere 0
flags ()
v_object 0xc875d000 ref 0 pages 1
 lock type ufs: EXCL (count 1) by thread 0xc91f3300 (pid 65610) with 1 
pending
ino 2072615, on dev ad4s1g

0xc691f420: tag ufs, type VDIR
usecount 2, writecount 0, refcount 3 mountedhere 0
flags ()
v_object 0xc8a773e0 ref 0 pages 1
 lock type ufs: EXCL (count 1) by thread 0xc68e5780 (pid 519) with 1 pending
ino 2072680, on dev ad4s1g

0xc691f318: tag ufs, type VDIR
usecount 3, writecount 0, refcount 4 mountedhere 0
flags ()
v_object 0xc8a7b2e8 ref 0 pages 1
 lock type ufs: EXCL (count 1) by t(kgdb) 

After that I've done call doadump and got vmcore.

ps show me:

(kgdb) ps
During symbol reading, Incomplete CFI data; unspecified registers at 0xc04d97eb.
  pidproc   uid  ppid  pgrp   flag stat comm wchan
74686 c94640000 1 1  00  1  sh   ufs c687caa8
74195 c970d0000  3074 74195  4000100  1  sshd ufs c687caa8
74178 c7682adc0  3074 74178  004000  1  sshd ufs c687c9a0
74129 c9b82adc 1008 1  5504  004000  1  parser3.cgi   ufs c691f370
74103 c70b5458 1008 1  5504  00  1  httpdufs c69bb580
65610 c92c0458 1005 1 65610  004000  1  sftp-server   ufs c691f478
 5518 c6247458 1008 1  5516  004002  1  perl5.8.7ufs c687caa8
 3081 c7523d080 1  3081  00  1  cron ufs c687caa8
 3074 c7682d080 1  3074  000100  1  sshd ufs c687caa8
 3016 c7523adc0 1  3016  00  1  syslogd  ufs c687caa8
  519 c68e4d08   80 1   518  000100  1  nginxufs c691f370
   34 c6260 0 0  000204  1  schedcpu - e88b3cf0
   33 c62438b00 0 0  000204  1  syncer   ktsusp c6243938
   32 c6243adc0 0 0  000204  1  vnlruktsusp c6243b64
   31 c6243d080 0 0  000204  1  bufdaemonktsusp c6243d90
   30 c62440000 0 0  00020c  1  pagezero pgzero c06c21a0
   29 c624422c0 0 0  000204  1  vmdaemon psleep c06c1d08
   28 c62444580 0 0  000204  1  pagedaemon   psleep c06c1cc8
   27 c602e6840 0 0  000204  1  irq1: atkbd0   
   26 c602e8b00 0 0  000204  1  swi0: sio
   25 c602eadc0 0 0  000204  1  irq18: atapci1   
   24 

Re: Recurring problem: processes block accessing UFS file system

2006-01-05 Thread Don Lewis
On  5 Jan, Denis Shaposhnikov wrote:
 Hi!
 
 Greg == Greg Rivers [EMAIL PROTECTED] writes:
 
  Greg It's taken more than a month, but the problem has recurred
  Greg without snapshots ever having been run.  I've got a good trace
 
 I think that I have the same problem on a fresh CURRENT. For some
 processes I see MWCHAN = ufs and D in the STAT. And I can't kill
 such processes even with -9. And system can't kill them too on
 shutdown. So, system can't do shutdown and wait forever after All
 buffers synced message. At this moment I've entered to KDB do show
 lockedvnods:
 
 Locked vnodes
 
 0xc687cb58: tag ufs, type VDIR
 usecount 1, writecount 0, refcount 2 mountedhere 0
 flags ()
 v_object 0xcb5b1934 ref 0 pages 0
  lock type ufs: EXCL (count 1) by thread 0xc795d600 (pid 74686) with 1 
 pending
 ino 2072602, on dev ad4s1g
 
 0xc687ca50: tag ufs, type VDIR
 usecount 31, writecount 0, refcount 32 mountedhere 0
 flags ()
 v_object 0xc85d2744 ref 0 pages 1
  lock type ufs: EXCL (count 1) by thread 0xc7683d80 (pid 74178) with 6 
 pending
 ino 2072603, on dev ad4s1g
 
 0xc687c948: tag ufs, type VDIR
 usecount 2, writecount 0, refcount 3 mountedhere 0
 flags ()
 v_object 0xc875d000 ref 0 pages 1
  lock type ufs: EXCL (count 1) by thread 0xc91f3300 (pid 65610) with 1 
 pending
 ino 2072615, on dev ad4s1g
 
 0xc691f420: tag ufs, type VDIR
 usecount 2, writecount 0, refcount 3 mountedhere 0
 flags ()
 v_object 0xc8a773e0 ref 0 pages 1
  lock type ufs: EXCL (count 1) by thread 0xc68e5780 (pid 519) with 1 
 pending
 ino 2072680, on dev ad4s1g
 
 0xc691f318: tag ufs, type VDIR
 usecount 3, writecount 0, refcount 4 mountedhere 0
 flags ()
 v_object 0xc8a7b2e8 ref 0 pages 1
  lock type ufs: EXCL (count 1) by thread 0xc7019780 (pid 74103) with 2 
 pending
 ino 2072795, on dev ad4s1g
 
 0xc69bb528: tag ufs, type VDIR
 usecount 2, writecount 0, refcount 3 mountedhere 0
 flags ()
 v_object 0xc7890744 ref 0 pages 1
  lock type ufs: EXCL (count 1) by thread 0xc91f4600 (pid 74129) with 1 
 pending
 ino 2072767, on dev ad4s1g
 Locked vnodes
 
 0xc687cb58: tag ufs, type VDIR
 usecount 1, writecount 0, refcount 2 mountedhere 0
 flags ()
 v_object 0xcb5b1934 ref 0 pages 0
  lock type ufs: EXCL (count 1) by thread 0xc795d600 (pid 74686) with 1 
 pending
 ino 2072602, on dev ad4s1g
 
 0xc687ca50: tag ufs, type VDIR
 usecount 31, writecount 0, refcount 32 mountedhere 0
 flags ()
 v_object 0xc85d2744 ref 0 pages 1
  lock type ufs: EXCL (count 1) by thread 0xc7683d80 (pid 74178) with 6 
 pending
 ino 2072603, on dev ad4s1g
 
 0xc687c948: tag ufs, type VDIR
 usecount 2, writecount 0, refcount 3 mountedhere 0
 flags ()
 v_object 0xc875d000 ref 0 pages 1
  lock type ufs: EXCL (count 1) by thread 0xc91f3300 (pid 65610) with 1 
 pending
 ino 2072615, on dev ad4s1g
 
 0xc691f420: tag ufs, type VDIR
 usecount 2, writecount 0, refcount 3 mountedhere 0
 flags ()
 v_object 0xc8a773e0 ref 0 pages 1
  lock type ufs: EXCL (count 1) by thread 0xc68e5780 (pid 519) with 1 
 pending
 ino 2072680, on dev ad4s1g
 
 0xc691f318: tag ufs, type VDIR
 usecount 3, writecount 0, refcount 4 mountedhere 0
 flags ()
 v_object 0xc8a7b2e8 ref 0 pages 1
  lock type ufs: EXCL (count 1) by t(kgdb) 
 
 After that I've done call doadump and got vmcore.
 
 ps show me:
 
 (kgdb) ps
 During symbol reading, Incomplete CFI data; unspecified registers at 
 0xc04d97eb.
   pidproc   uid  ppid  pgrp   flag stat comm wchan
 74686 c94640000 1 1  00  1  sh   ufs c687caa8
 74195 c970d0000  3074 74195  4000100  1  sshd ufs c687caa8
 74178 c7682adc0  3074 74178  004000  1  sshd ufs c687c9a0
 74129 c9b82adc 1008 1  5504  004000  1  parser3.cgi   ufs c691f370
 74103 c70b5458 1008 1  5504  00  1  httpdufs c69bb580
 65610 c92c0458 1005 1 65610  004000  1  sftp-server   ufs c691f478
  5518 c6247458 1008 1  5516  004002  1  perl5.8.7ufs c687caa8
  3081 c7523d080 1  3081  00  1  cron ufs c687caa8
  3074 c7682d080 1  3074  000100  1  sshd ufs c687caa8
  3016 c7523adc0 1  3016  00  1  syslogd  ufs c687caa8
   519 c68e4d08   80 1   518  000100  1  nginxufs c691f370
34 c6260 0 0  000204  1  schedcpu - e88b3cf0
33 c62438b00 0 0  000204  1  syncer   ktsusp c6243938
32 c6243adc0 0 0  000204  1  vnlruktsusp c6243b64
31 c6243d080 0 0  000204  1  bufdaemonktsusp c6243d90
30 c62440000 0 0  00020c  1  pagezero pgzero c06c21a0
29 c624422c0 0 0  000204  1  vmdaemon psleep c06c1d08
28 c62444580 0 0  000204  1  pagedaemon   psleep c06c1cc8
27 c602e6840   

Re: Recurring problem: processes block accessing UFS file system

2006-01-05 Thread Denis Shaposhnikov
 Don == Don Lewis [EMAIL PROTECTED] writes:

 Don pid 519 wants to lock this vnode but some other thread is
 Don holding the vnode lock.  Unfortunately we don't know who the
 Don lock holder is because the message is truncated.

Is it possible to find out the answer from the crashdump?

 Don This might just be a vnode lock leak.  Build a debug kernel with
 Don the DEBUG_VFS_LOCKS and DEBUG_LOCKS options and see if anything
 Don shows up.

I'll try, thank you.

-- 
DSS5-RIPE DSS-RIPN 2:550/[EMAIL PROTECTED] 2:550/[EMAIL PROTECTED]
mailto:[EMAIL PROTECTED] http://neva.vlink.ru/~dsh/
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recurring problem: processes block accessing UFS file system

2006-01-05 Thread Don Lewis
On  5 Jan, Denis Shaposhnikov wrote:
 Don == Don Lewis [EMAIL PROTECTED] writes:
 
  Don pid 519 wants to lock this vnode but some other thread is
  Don holding the vnode lock.  Unfortunately we don't know who the
  Don lock holder is because the message is truncated.
 
 Is it possible to find out the answer from the crashdump?

It's possible if you have the matching debug kernel, though it is more
painful.  In kgdb:
print *(struct vnode *)0xc691f318
print (struct vnode 
*)0xc691f318-v_vnlock-lk_lockholder-td_proc-p_pid)
or something like that.

  Don This might just be a vnode lock leak.  Build a debug kernel with
  Don the DEBUG_VFS_LOCKS and DEBUG_LOCKS options and see if anything
  Don shows up.
 
 I'll try, thank you.

Are you using any unusual file systems, such as nullfs or unionfs?

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recurring problem: processes block accessing UFS file system

2006-01-05 Thread Denis Shaposhnikov
 Don == Don Lewis [EMAIL PROTECTED] writes:

 Don Are you using any unusual file systems, such as nullfs or
 Don unionfs?

Yes, I'm use a lots of nullfs. This is a host system for about 20
jails with nullfs mounted ro system:

/dev/mirror/gm0s1a on / (ufs, local, read-only)
devfs on /dev (devfs, local)
/dev/mirror/gm0s1d on /usr/local (ufs, local)
/dev/mirror/gm0s1e on /home (ufs, local, nosuid)
/dev/mirror/gm0s1f on /tmp (ufs, local, noexec)
/dev/mirror/gm0s1g on /var (ufs, NFS exported, local)
procfs on /proc (procfs, local)
devfs on /var/named/dev (devfs, local)
/var/jail/build/the.jail on /var/jail/netflow/the.jail/.build (nullfs, local, 
read-only)
/var/jail/netflow/the.jail.etc on /var/jail/netflow/the.jail/.build/etc 
(nullfs, local, read-only)
/var/jail/netflow/the.jail.local on /var/jail/netflow/the.jail/.build/usr/local 
(nullfs, local, read-only)
devfs on /var/jail/netflow/the.jail/dev (devfs, local)
/var/jail/build/the.jail on /var/jail/mysql/the.jail/.build (nullfs, local, 
read-only)
/var/jail/mysql/the.jail.etc on /var/jail/mysql/the.jail/.build/etc (nullfs, 
local, read-only)
/var/jail/mysql/the.jail.local on /var/jail/mysql/the.jail/.build/usr/local 
(nullfs, local, read-only)
devfs on /var/jail/mysql/the.jail/dev (devfs, local)
/var/jail/build/the.jail on /var/jail/dspam/the.jail/.build (nullfs, local, 
read-only)
/var/jail/dspam/the.jail.etc on /var/jail/dspam/the.jail/.build/etc (nullfs, 
local, read-only)
/var/jail/dspam/the.jail.local on /var/jail/dspam/the.jail/.build/usr/local 
(nullfs, local, read-only)
devfs on /var/jail/dspam/the.jail/dev (devfs, local)
/var/jail/build/the.jail on /var/jail/deliver_smtp/the.jail/.build (nullfs, 
local, read-only)
/var/jail/deliver_smtp/the.jail.etc on 
/var/jail/deliver_smtp/the.jail/.build/etc (nullfs, local, read-only)
/var/jail/deliver_smtp/the.jail.local on 
/var/jail/deliver_smtp/the.jail/.build/usr/local (nullfs, local, read-only)
devfs on /var/jail/deliver_smtp/the.jail/dev (devfs, local)
/var/jail/build/the.jail on /var/jail/clamav/the.jail/.build (nullfs, local, 
read-only)
/var/jail/clamav/the.jail.etc on /var/jail/clamav/the.jail/.build/etc (nullfs, 
local, read-only)
/var/jail/clamav/the.jail.local on /var/jail/clamav/the.jail/.build/usr/local 
(nullfs, local, read-only)
devfs on /var/jail/clamav/the.jail/dev (devfs, local)
/var/jail/build/the.jail on /var/jail/mx1_smtp/the.jail/.build (nullfs, local, 
read-only)
/var/jail/mx1_smtp/the.jail.etc on /var/jail/mx1_smtp/the.jail/.build/etc 
(nullfs, local, read-only)
/var/jail/mx1_smtp/the.jail.local on 
/var/jail/mx1_smtp/the.jail/.build/usr/local (nullfs, local, read-only)
devfs on /var/jail/mx1_smtp/the.jail/dev (devfs, local)
/var/jail/build/the.jail on /var/jail/smtp_smtp/the.jail/.build (nullfs, local, 
read-only)
/var/jail/smtp_smtp/the.jail.etc on /var/jail/smtp_smtp/the.jail/.build/etc 
(nullfs, local, read-only)
/var/jail/smtp_smtp/the.jail.local on 
/var/jail/smtp_smtp/the.jail/.build/usr/local (nullfs, local, read-only)
/var/jail/mx1_smtp/the.jail/var/db/postfix on 
/var/jail/smtp_smtp/the.jail/var/db/postfix (nullfs, local, read-only)
devfs on /var/jail/smtp_smtp/the.jail/dev (devfs, local)
/var/jail/build/the.jail on /var/jail/cacti/the.jail/.build (nullfs, local, 
read-only)
/var/jail/cacti/the.jail.etc on /var/jail/cacti/the.jail/.build/etc (nullfs, 
local, read-only)
/var/jail/cacti/the.jail.local on /var/jail/cacti/the.jail/.build/usr/local 
(nullfs, local, read-only)
devfs on /var/jail/cacti/the.jail/dev (devfs, local)
/var/jail/build/the.jail on /var/jail/rk80/the.jail/.build (nullfs, local, 
read-only)
/var/jail/rk80/the.jail.etc on /var/jail/rk80/the.jail/.build/etc (nullfs, 
local, read-only)
/var/jail/rk80/the.jail.local on /var/jail/rk80/the.jail/.build/usr/local 
(nullfs, local, read-only)
devfs on /var/jail/rk80/the.jail/dev (devfs, local)
/var/jail/build/the.jail on /var/jail/syslog/the.jail/.build (nullfs, local, 
read-only)
/var/jail/syslog/the.jail.etc on /var/jail/syslog/the.jail/.build/etc (nullfs, 
local, read-only)
/var/jail/syslog/the.jail.local on /var/jail/syslog/the.jail/.build/usr/local 
(nullfs, local, read-only)
devfs on /var/jail/syslog/the.jail/dev (devfs, local)
/var/jail/build/the.jail on /var/jail/tftpboot/the.jail/.build (nullfs, local, 
read-only)
/var/jail/tftpboot/the.jail.etc on /var/jail/tftpboot/the.jail/.build/etc 
(nullfs, local, read-only)
/var/jail/tftpboot/the.jail.local on 
/var/jail/tftpboot/the.jail/.build/usr/local (nullfs, local, read-only)
devfs on /var/jail/tftpboot/the.jail/dev (devfs, local)
/var/jail/build/the.jail on /var/jail/clickon/the.jail/.build (nullfs, local, 
read-only)
/var/jail/clickon/the.jail.etc on /var/jail/clickon/the.jail/.build/etc 
(nullfs, local, read-only)
/var/jail/clickon/the.jail.local on /var/jail/clickon/the.jail/.build/usr/local 
(nullfs, local, read-only)
devfs on /var/jail/clickon/the.jail/dev (devfs, local)
/var/jail/build/the.jail on /var/jail/sulci/the.jail/.build (nullfs, local, 

Re: Recurring problem: processes block accessing UFS file system

2006-01-05 Thread Don Lewis
On  5 Jan, Denis Shaposhnikov wrote:
 Don == Don Lewis [EMAIL PROTECTED] writes:
 
  Don Are you using any unusual file systems, such as nullfs or
  Don unionfs?
 
 Yes, I'm use a lots of nullfs. This is a host system for about 20
 jails with nullfs mounted ro system:

That would be my guess as to the cause of the problem. Hopefully
DEBUG_VFS_LOCKS will help pinpoint the bug.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recurring problem: processes block accessing UFS file system

2006-01-04 Thread Don Lewis
On  3 Jan, Greg Rivers wrote:
 On Tue, 3 Jan 2006, Don Lewis wrote:

 Pid 87117 is playing with buf 0xdc76fe30 which is not locked, and is
 sleeping on the buf's b_xflags member.  It looks like 87117 is waiting
 for an in-progress write to complete.  There are a large number of other
 sendmail processes waiting in this same place.

 How about show buffer 0xdc76fe30?

 
 db show buffer 0xdc76fe30
 buf at 0xdc76fe30
 b_flags = 0x20a0vmio,delwri,cache
 b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0
 b_bufobj = (0xc8985610), b_data = 0xe1d6b000, b_blkno = 365086368
 lockstatus = 0, excl count = 0, excl owner 0x
 b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b858d4, 
 0xa8de1000),(0xc8984108, 0x2b858d5, 0xa8c62000),(0xc8984108, 0x2b858d6, 
 0xa8de3000),(0xc8984108, 0x2b858d7, 0xa8e64000)
 db

Hmn, it would be nice if DDB printed b_vflags so that we would know the
state of BV_BKGRDINPROG and BV_BKGRDWAIT.  As it is, we don't know if
the background write got lost or if we missed the wakeup.

At this point, I think it might be easier to do post-mortem debugging on
a core file with kgdb and kernel.debug.  Unless there is an obvious race
condition, I suspect this will be a tough slog.


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recurring problem: processes block accessing UFS file system

2006-01-04 Thread Greg Rivers

On Wed, 4 Jan 2006, Don Lewis wrote:


How about show buffer 0xdc76fe30?



db show buffer 0xdc76fe30
buf at 0xdc76fe30
b_flags = 0x20a0vmio,delwri,cache
b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0
b_bufobj = (0xc8985610), b_data = 0xe1d6b000, b_blkno = 365086368
lockstatus = 0, excl count = 0, excl owner 0x
b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b858d4, 
0xa8de1000),(0xc8984108, 0x2b858d5, 0xa8c62000),(0xc8984108, 0x2b858d6, 
0xa8de3000),(0xc8984108, 0x2b858d7, 0xa8e64000)
db


Hmn, it would be nice if DDB printed b_vflags so that we would know the
state of BV_BKGRDINPROG and BV_BKGRDWAIT.  As it is, we don't know if
the background write got lost or if we missed the wakeup.

At this point, I think it might be easier to do post-mortem debugging on
a core file with kgdb and kernel.debug.  Unless there is an obvious race
condition, I suspect this will be a tough slog.



I attempted to get a core, but panic wedged the machine.  I'll try again 
next time this happens.  Thanks again for looking at this.


--
Greg


db continue
#

(free up a partition and set up a dump device...)

# sync;sync
# sync;sync
# sync;sync
# KDB: enter: Line break on console
[thread pid 13 tid 13 ]
Stopped at  kdb_enter+0x30: leave 
db panic

panic: from debugger
cpuid = 0
KDB: stack backtrace:
kdb_backtrace(c0625783,0,c060c2d4,e70c5a20,c043bb58) at kdb_backtrace+0x2f
panic(c060c2d4,e70c5ad8,c043aaec,c04d13c9,0) at panic+0x129
db_command_loop(c04d13c9,0,,e70c5a4c,e70c5a48) at db_command_loop
db_command(c06549e4,c062efa0,c0629aa8,c0629aac,e70c5b44) at db_command+0x2a4
db_command_loop(c04d13c9,0,0,0,0) at db_command_loop+0x6a
db_trap(3,0,3,c8522a80,e70c5b94) at db_trap+0xe2
kdb_trap(3,0,e70c5b9c,f2db0257,0) at kdb_trap+0xb6
trap(c8710008,28,e70c0028,f9,c8714c00) at trap+0x4d1
calltrap() at calltrap+0x5
--- trap 0x3, eip = 0xc04d13c9, esp = 0xe70c5bdc, ebp = 0xe70c5be4 ---
kdb_enter(c0622c7f,679ed6d4,36,c8522a80,c8714c00) at kdb_enter+0x30
siointr1(c8714c00,37f86c,e5857050,c8522a80,c8522a80) at siointr1+0xd1
siointr(c8714c00,2,0,4,c8522a80) at siointr+0x76
intr_execute_handlers(c851a490,e70c5c68,e70c5cac,c05da733,34) at 
intr_execute_handlers+0x8c
lapic_handle_intr(34) at lapic_handle_intr+0x3b
Xapic_isr1() at Xapic_isr1+0x33
--- interrupt, eip = 0xc078427a, esp = 0xe70c5cac, ebp = 0xe70c5cac ---
acpi_cpu_c1(37f867,43d0eea4,7b70cdc0,e70c5ce0,37f867) at acpi_cpu_c1+0x5
acpi_cpu_idle(e70c5d04,c049a72a,1,0,0) at acpi_cpu_idle+0x18a
cpu_idle(1,0,0,0,0) at cpu_idle+0x29
idle_proc(0,e70c5d38,0,0,0) at idle_proc+0xad
fork_exit(c049a67d,0,e70c5d38) at fork_exit+0x7b
fork_trampoline() at fork_trampoline+0x8
--- trap 0x1, eip = 0, esp = 0xe70c5d6c, ebp = 0 ---
Uptime: 42d10h54m36s
GEOM_MIRROR: Device gm1: provider mirror/gm1 destroyed.
pid 13: corrected slot count (0-1)
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recurring problem: processes block accessing UFS file system

2006-01-03 Thread Greg Rivers

On Tue, 22 Nov 2005, I wrote:


On Mon, 21 Nov 2005, Kris Kennaway wrote:

It may not be the same problem.  You should also try to obtain a trace when 
snapshots are not implicated.




Agreed.  I'll do so at the first opportunity.



First, my thanks to all of you for looking into this.

It's taken more than a month, but the problem has recurred without 
snapshots ever having been run.  I've got a good trace of the machine in 
this state (ftp://ftp.fedex.com/incoming/no-snapshots.bz2).  My apologies 
for the size of the debug output, but the processes had really stacked up 
this time before I noticed it.


I have enough capacity that I can afford to have this machine out of 
production for a while, so I've left it suspended in kdb for the time 
being in case additional information is needed.  Please let me know if 
there's anything else I can do to facilitate troubleshooting this. 
Thanks!


--
Greg
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recurring problem: processes block accessing UFS file system

2006-01-03 Thread Don Lewis
On  3 Jan, Greg Rivers wrote:
 On Tue, 22 Nov 2005, I wrote:
 
 On Mon, 21 Nov 2005, Kris Kennaway wrote:

 It may not be the same problem.  You should also try to obtain a trace when 
 snapshots are not implicated.
 

 Agreed.  I'll do so at the first opportunity.

 
 First, my thanks to all of you for looking into this.
 
 It's taken more than a month, but the problem has recurred without 
 snapshots ever having been run.  I've got a good trace of the machine in 
 this state (attached).  My apologies for the size of the debug output, but 
 the processes had really stacked up this time before I noticed it.
 
 I have enough capacity that I can afford to have this machine out of 
 production for a while, so I've left it suspended in kdb for the time 
 being in case additional information is needed.  Please let me know if 
 there's anything else I can do to facilitate troubleshooting this. 
 Thanks!

There are large number of sendmail processes waiting on vnode locks
which are held by other sendmail processes that are waiting on other
vnode locks, etc. until we get to sendmail pid 87150 which is holding a
vnode lock and waiting to lock a buf.

Tracing command sendmail pid 87150 tid 100994 td 0xcf1c5480
sched_switch(cf1c5480,0,1,b2c5195e,a480a2bc) at sched_switch+0x158
mi_switch(1,0,c04d7b33,dc713fb0,ec26a6ac) at mi_switch+0x1d5
sleepq_switch(dc713fb0,ec26a6e0,c04bb9ce,dc713fb0,50) at sleepq_switch+0x16f
sleepq_wait(dc713fb0,50,c0618ef5,0,202122) at sleepq_wait+0x11
msleep(dc713fb0,c0658430,50,c0618ef5,0) at msleep+0x3d7
acquire(ec26a748,120,6,15c2e6e0,0) at acquire+0x89
lockmgr(dc713fb0,202122,c89855cc,cf1c5480,dc76fe30) at lockmgr+0x45f
getblk(c8985550,15c2e6e0,0,4000,0) at getblk+0x211
breadn(c8985550,15c2e6e0,0,4000,0) at breadn+0x52
bread(c8985550,15c2e6e0,0,4000,0) at bread+0x4c
ffs_vget(c887,ae58b3,2,ec26a8d4,8180) at ffs_vget+0x383
ffs_valloc(c8d41660,8180,c92e8d00,ec26a8d4,c05f9302) at ffs_valloc+0x154
ufs_makeinode(8180,c8d41660,ec26abd4,ec26abe8,ec26aa24) at ufs_makeinode+0x61
ufs_create(ec26aa50,ec26aa24,ec26ad04,ec26abc0,ec26ab0c) at ufs_create+0x36
VOP_CREATE_APV(c0646cc0,ec26aa50,2,ec26aa50,0) at VOP_CREATE_APV+0x3c
vn_open_cred(ec26abc0,ec26acc0,180,c92e8d00,6) at vn_open_cred+0x1fe
vn_open(ec26abc0,ec26acc0,180,6,c679eacb) at vn_open+0x33
kern_open(cf1c5480,81416c0,0,a03,180) at kern_open+0xca
open(cf1c5480,ec26ad04,c,cf1c5480,8169000) at open+0x36
syscall(3b,bfbf003b,bfbf003b,0,a02) at syscall+0x324
Xint0x80_syscall() at Xint0x80_syscall+0x1f

This doesn't appear to be a buf/memory exhausting problem because
syncer, bufdaemon, and pagedaemon all appear to be idle.

What does show lockedbufs say?

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recurring problem: processes block accessing UFS file system

2006-01-03 Thread Greg Rivers

On Tue, 3 Jan 2006, Don Lewis wrote:


There are large number of sendmail processes waiting on vnode locks
which are held by other sendmail processes that are waiting on other
vnode locks, etc. until we get to sendmail pid 87150 which is holding a
vnode lock and waiting to lock a buf.

Tracing command sendmail pid 87150 tid 100994 td 0xcf1c5480
sched_switch(cf1c5480,0,1,b2c5195e,a480a2bc) at sched_switch+0x158
mi_switch(1,0,c04d7b33,dc713fb0,ec26a6ac) at mi_switch+0x1d5
sleepq_switch(dc713fb0,ec26a6e0,c04bb9ce,dc713fb0,50) at sleepq_switch+0x16f
sleepq_wait(dc713fb0,50,c0618ef5,0,202122) at sleepq_wait+0x11
msleep(dc713fb0,c0658430,50,c0618ef5,0) at msleep+0x3d7
acquire(ec26a748,120,6,15c2e6e0,0) at acquire+0x89
lockmgr(dc713fb0,202122,c89855cc,cf1c5480,dc76fe30) at lockmgr+0x45f
getblk(c8985550,15c2e6e0,0,4000,0) at getblk+0x211
breadn(c8985550,15c2e6e0,0,4000,0) at breadn+0x52
bread(c8985550,15c2e6e0,0,4000,0) at bread+0x4c
ffs_vget(c887,ae58b3,2,ec26a8d4,8180) at ffs_vget+0x383
ffs_valloc(c8d41660,8180,c92e8d00,ec26a8d4,c05f9302) at ffs_valloc+0x154
ufs_makeinode(8180,c8d41660,ec26abd4,ec26abe8,ec26aa24) at ufs_makeinode+0x61
ufs_create(ec26aa50,ec26aa24,ec26ad04,ec26abc0,ec26ab0c) at ufs_create+0x36
VOP_CREATE_APV(c0646cc0,ec26aa50,2,ec26aa50,0) at VOP_CREATE_APV+0x3c
vn_open_cred(ec26abc0,ec26acc0,180,c92e8d00,6) at vn_open_cred+0x1fe
vn_open(ec26abc0,ec26acc0,180,6,c679eacb) at vn_open+0x33
kern_open(cf1c5480,81416c0,0,a03,180) at kern_open+0xca
open(cf1c5480,ec26ad04,c,cf1c5480,8169000) at open+0x36
syscall(3b,bfbf003b,bfbf003b,0,a02) at syscall+0x324
Xint0x80_syscall() at Xint0x80_syscall+0x1f

This doesn't appear to be a buf/memory exhausting problem because
syncer, bufdaemon, and pagedaemon all appear to be idle.

What does show lockedbufs say?



db show lockedbufs
buf at 0xdc625678
b_flags = 0x2000vmio
b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0
b_bufobj = (0xc8985610), b_data = 0xddcef000, b_blkno = 365233216
lockstatus = 2, excl count = 1, excl owner 0xfffe
b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b8a088, 
0x53175000),(0xc8984108, 0x2b8a089, 0x7ae56000),(0xc8984108, 0x2b8a08a, 
0xd3f57000),(0xc8984108, 0x2b8a08b, 0xd7d58000)

buf at 0xdc6b8ab0
b_flags = 0x2000vmio
b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0
b_bufobj = (0xc8985610), b_data = 0xdf9ab000, b_blkno = 365257760
lockstatus = 2, excl count = 1, excl owner 0xfffe
b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b8ac84, 
0x60b1000),(0xc8984108, 0x2b8ac85, 0x454d2000),(0xc8984108, 0x2b8ac86, 
0x1b273000),(0xc8984108, 0x2b8ac87, 0x47b74000)

buf at 0xdc6c3cc8
b_flags = 0x2000vmio
b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0
b_bufobj = (0xc8985610), b_data = 0xdfbd7000, b_blkno = 365265888
lockstatus = 2, excl count = 1, excl owner 0xfffe
b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b8b07c, 
0xba549000),(0xc8984108, 0x2b8b07d, 0x92eaa000),(0xc8984108, 0x2b8b07e, 
0xbdf4b000),(0xc8984108, 0x2b8b07f, 0x2090c000)

buf at 0xdc6d9e68
b_flags = 0x2000vmio
b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0
b_bufobj = (0xc8985610), b_data = 0xe0027000, b_blkno = 365240832
lockstatus = 2, excl count = 1, excl owner 0xfffe
b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b8a440, 
0x61b8d000),(0xc8984108, 0x2b8a441, 0xb0f4e000),(0xc8984108, 0x2b8a442, 
0x5f98f000),(0xc8984108, 0x2b8a443, 0x5c21)

buf at 0xdc6e5458
b_flags = 0x2000vmio
b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0
b_bufobj = (0xc8985610), b_data = 0xe025f000, b_blkno = 364617056
lockstatus = 2, excl count = 1, excl owner 0xfffe
b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b773ac, 
0x4e539000),(0xc8984108, 0x2b773ad, 0x6e13a000),(0xc8984108, 0x2b773ae, 
0xc653b000),(0xc8984108, 0x2b773af, 0x14e3c000)

buf at 0xdc6fd4b8
b_flags = 0x2000vmio
b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0
b_bufobj = (0xc8985610), b_data = 0xe070f000, b_blkno = 365224960
lockstatus = 2, excl count = 1, excl owner 0xfffe
b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b89c80, 
0x37c6d000),(0xc8984108, 0x2b89c81, 0x2e40e000),(0xc8984108, 0x2b89c82, 
0xa39af000),(0xc8984108, 0x2b89c83, 0x27ff)

buf at 0xdc713f50
b_flags = 0xa00200a0remfree,vmio,clusterok,delwri,cache
b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0
b_bufobj = (0xc8985610), b_data = 0xe0b7b000, b_blkno = 365094624
lockstatus = 2, excl count = 1, excl owner 0xcfeb5d80
b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b85cdc, 
0xa89e9000),(0xc8984108, 0x2b85cdd, 0xa852a000),(0xc8984108, 0x2b85cde, 
0xa850b000),(0xc8984108, 0x2b85cdf, 0xa836c000)

buf at 0xdc765f50
b_flags = 0xa00200a0remfree,vmio,clusterok,delwri,cache
b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0
b_bufobj = (0xc9c68b60), b_data = 0xe1b7b000, b_blkno = 364555424
lockstatus = 2, excl count = 1, excl owner 0xce9d4c00
b_npages = 4, pages(OBJ, IDX, PA): (0xcff7, 

Re: Recurring problem: processes block accessing UFS file system

2006-01-03 Thread Don Lewis
On  3 Jan, Greg Rivers wrote:
 On Tue, 3 Jan 2006, Don Lewis wrote:
 
 There are large number of sendmail processes waiting on vnode locks
 which are held by other sendmail processes that are waiting on other
 vnode locks, etc. until we get to sendmail pid 87150 which is holding a
 vnode lock and waiting to lock a buf.

 Tracing command sendmail pid 87150 tid 100994 td 0xcf1c5480
 sched_switch(cf1c5480,0,1,b2c5195e,a480a2bc) at sched_switch+0x158
 mi_switch(1,0,c04d7b33,dc713fb0,ec26a6ac) at mi_switch+0x1d5
 sleepq_switch(dc713fb0,ec26a6e0,c04bb9ce,dc713fb0,50) at sleepq_switch+0x16f
 sleepq_wait(dc713fb0,50,c0618ef5,0,202122) at sleepq_wait+0x11
 msleep(dc713fb0,c0658430,50,c0618ef5,0) at msleep+0x3d7
 acquire(ec26a748,120,6,15c2e6e0,0) at acquire+0x89
 lockmgr(dc713fb0,202122,c89855cc,cf1c5480,dc76fe30) at lockmgr+0x45f
 getblk(c8985550,15c2e6e0,0,4000,0) at getblk+0x211
 breadn(c8985550,15c2e6e0,0,4000,0) at breadn+0x52
 bread(c8985550,15c2e6e0,0,4000,0) at bread+0x4c
 ffs_vget(c887,ae58b3,2,ec26a8d4,8180) at ffs_vget+0x383
 ffs_valloc(c8d41660,8180,c92e8d00,ec26a8d4,c05f9302) at ffs_valloc+0x154
 ufs_makeinode(8180,c8d41660,ec26abd4,ec26abe8,ec26aa24) at ufs_makeinode+0x61
 ufs_create(ec26aa50,ec26aa24,ec26ad04,ec26abc0,ec26ab0c) at ufs_create+0x36
 VOP_CREATE_APV(c0646cc0,ec26aa50,2,ec26aa50,0) at VOP_CREATE_APV+0x3c
 vn_open_cred(ec26abc0,ec26acc0,180,c92e8d00,6) at vn_open_cred+0x1fe
 vn_open(ec26abc0,ec26acc0,180,6,c679eacb) at vn_open+0x33
 kern_open(cf1c5480,81416c0,0,a03,180) at kern_open+0xca
 open(cf1c5480,ec26ad04,c,cf1c5480,8169000) at open+0x36
 syscall(3b,bfbf003b,bfbf003b,0,a02) at syscall+0x324
 Xint0x80_syscall() at Xint0x80_syscall+0x1f

 This doesn't appear to be a buf/memory exhausting problem because
 syncer, bufdaemon, and pagedaemon all appear to be idle.

 What does show lockedbufs say?

 
 db show lockedbufs

[snip]

looks like this is the buf that pid 87150 is waiting for:

 buf at 0xdc713f50
 b_flags = 0xa00200a0remfree,vmio,clusterok,delwri,cache
 b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0
 b_bufobj = (0xc8985610), b_data = 0xe0b7b000, b_blkno = 365094624
 lockstatus = 2, excl count = 1, excl owner 0xcfeb5d80
 b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b85cdc, 
 0xa89e9000),(0xc8984108, 0x2b85cdd, 0xa852a000),(0xc8984108, 0x2b85cde, 
 0xa850b000),(0xc8984108, 0x2b85cdf, 0xa836c000)

which is locked by this thread:

Tracing command sendmail pid 87117 tid 101335 td 0xcfeb5d80
sched_switch(cfeb5d80,0,1,fd1926a,640c65f9) at sched_switch+0x158
mi_switch(1,0,c04d7b33,dc76fe8c,ec883b2c) at mi_switch+0x1d5
sleepq_switch(dc76fe8c,ec883b60,c04bb9ce,dc76fe8c,4c) at sleepq_switch+0x16f
sleepq_wait(dc76fe8c,4c,c061e9ac,0,0) at sleepq_wait+0x11
msleep(dc76fe8c,c0662f80,4c,c061e9ac,0) at msleep+0x3d7
getdirtybuf(dc76fe30,c0662f80,1,ec883ba8,0) at getdirtybuf+0x221   
softdep_update_inodeblock(cd1bc528,dc713f50,1,4000,0) at softdep_update_inodeblo
ck+0x267
ffs_update(cd953bb0,1,0,cd953bb0,ec883c78,c0529a59,0,0,0,4,1,cd953c2c) at ffs_up
date+0x27f
ffs_syncvnode(cd953bb0,1,4,ec883c78,c05f9a70) at ffs_syncvnode+0x52e
ffs_fsync(ec883cb4,ec883cd0,c052468a,c0646cc0,ec883cb4) at ffs_fsync+0x1c
VOP_FSYNC_APV(c0646cc0,ec883cb4,0,0,0) at VOP_FSYNC_APV+0x3a
fsync(cfeb5d80,ec883d04,4,cfeb5d80,ec883d2c) at fsync+0x1db
syscall(3b,3b,3b,80c7c1b,bfbfa6b0) at syscall+0x324
Xint0x80_syscall() at Xint0x80_syscall+0x1f
--- syscall (95, FreeBSD ELF32, fsync), eip = 0x8830f63f, esp = 0xbfbfa66c, ebp
= 0xbfbfaf98 ---


Pid 87117 is playing with buf 0xdc76fe30 which is not locked, and is
sleeping on the buf's b_xflags member.  It looks like 87117 is waiting
for an in-progress write to complete.  There are a large number of other
sendmail processes waiting in this same place.

How about show buffer 0xdc76fe30?

This is getting into an area of the kernel that I do not understand
well.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recurring problem: processes block accessing UFS file system

2006-01-03 Thread Greg Rivers

On Tue, 3 Jan 2006, Don Lewis wrote:


db show lockedbufs


[snip]

looks like this is the buf that pid 87150 is waiting for:


buf at 0xdc713f50
b_flags = 0xa00200a0remfree,vmio,clusterok,delwri,cache
b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0
b_bufobj = (0xc8985610), b_data = 0xe0b7b000, b_blkno = 365094624
lockstatus = 2, excl count = 1, excl owner 0xcfeb5d80
b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b85cdc, 
0xa89e9000),(0xc8984108, 0x2b85cdd, 0xa852a000),(0xc8984108, 0x2b85cde, 
0xa850b000),(0xc8984108, 0x2b85cdf, 0xa836c000)


which is locked by this thread:

Tracing command sendmail pid 87117 tid 101335 td 0xcfeb5d80
sched_switch(cfeb5d80,0,1,fd1926a,640c65f9) at sched_switch+0x158
mi_switch(1,0,c04d7b33,dc76fe8c,ec883b2c) at mi_switch+0x1d5
sleepq_switch(dc76fe8c,ec883b60,c04bb9ce,dc76fe8c,4c) at sleepq_switch+0x16f
sleepq_wait(dc76fe8c,4c,c061e9ac,0,0) at sleepq_wait+0x11
msleep(dc76fe8c,c0662f80,4c,c061e9ac,0) at msleep+0x3d7
getdirtybuf(dc76fe30,c0662f80,1,ec883ba8,0) at getdirtybuf+0x221
softdep_update_inodeblock(cd1bc528,dc713f50,1,4000,0) at softdep_update_inodeblo
ck+0x267
ffs_update(cd953bb0,1,0,cd953bb0,ec883c78,c0529a59,0,0,0,4,1,cd953c2c) at ffs_up
date+0x27f
ffs_syncvnode(cd953bb0,1,4,ec883c78,c05f9a70) at ffs_syncvnode+0x52e
ffs_fsync(ec883cb4,ec883cd0,c052468a,c0646cc0,ec883cb4) at ffs_fsync+0x1c
VOP_FSYNC_APV(c0646cc0,ec883cb4,0,0,0) at VOP_FSYNC_APV+0x3a
fsync(cfeb5d80,ec883d04,4,cfeb5d80,ec883d2c) at fsync+0x1db
syscall(3b,3b,3b,80c7c1b,bfbfa6b0) at syscall+0x324
Xint0x80_syscall() at Xint0x80_syscall+0x1f
--- syscall (95, FreeBSD ELF32, fsync), eip = 0x8830f63f, esp = 0xbfbfa66c, ebp
= 0xbfbfaf98 ---


Pid 87117 is playing with buf 0xdc76fe30 which is not locked, and is
sleeping on the buf's b_xflags member.  It looks like 87117 is waiting
for an in-progress write to complete.  There are a large number of other
sendmail processes waiting in this same place.

How about show buffer 0xdc76fe30?



db show buffer 0xdc76fe30
buf at 0xdc76fe30
b_flags = 0x20a0vmio,delwri,cache
b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0
b_bufobj = (0xc8985610), b_data = 0xe1d6b000, b_blkno = 365086368
lockstatus = 0, excl count = 0, excl owner 0x
b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b858d4, 
0xa8de1000),(0xc8984108, 0x2b858d5, 0xa8c62000),(0xc8984108, 0x2b858d6, 
0xa8de3000),(0xc8984108, 0x2b858d7, 0xa8e64000)
db



This is getting into an area of the kernel that I do not understand
well.



For me, we've long since passed that point.  :-)

--
Greg
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recurring problem: processes block accessing UFS file system

2006-01-02 Thread Don Lewis
On 26 Nov, Tor Egge wrote:
 
 Thanks Kris, these are exactly the clues I needed.  Since the deadlock 
 during a snapshot is fairly easy to reproduce, I did so and collected this 
 information below.  alltrace didn't work as I expected (didn't produce a 
 trace), so I traced each pid associated with a locked vnode separately.
 
 The vnode syncing loop in ffs_sync() has some problems:
 
   1. Softupdate processing performed after the loop has started might
  trigger the need for retrying the loop.  Processing of dirrem work
  items can cause IN_CHANGE to be set on some inodes, causing
  deadlock in ufs_inactive() later on while the file system is
  suspended).

I also don't like how this loop interacts with the vnode list churn done
by vnlru_free().  Maybe vnode recycling for a file system should be
skipped while a file system is being suspended or unmounted.

   2. nvp might no longer be associated with the same mount point after
  MNT_IUNLOCK(mp) has been called in the loop.  This can cause the
  vnode list traversal to be incomplete, with stale information in
  the snapshot.  Further damage can occur when background fsck uses
  that stale information.

It looks like this is handled in __mnt_vnode_next() by starting over.
Skipping vnode recycling should avoid this problem in the snapshot case.

This loop should be bypassed in normal operation and the individual
vnode syncing should be done by the syncer.  The only reason this loop
isn't skipped during normal operation is that that timestamp updates
aren't sufficient to add vnodes to the syncer worklist.

 Just a few lines down from that loop is a new problem:
 
   3. softdep_flushworklist() might not have processed all dirrem work
  items associated with the file system even if both error and count
  are zero. This can cause both background fsck and softupdate
  processing (after file system has been resumed) to decrement the
  link count of an inode, causing file system corruption or a panic.

Are you sure this is still true after the changes that were committed to
both HEAD and RELENG_6 before 6.0-RELEASE?

All the pending items that hang around various lists make me nervous,
though.  I really thing the number of each flavor should be tracked per
mount point and softupdates should complain if the counts are non-zero
at the end of the suspend and unmount tasks.

  Processing of these work items while the file system is suspended
  causes a panic.


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recurring problem: processes block accessing UFS file system

2006-01-02 Thread Don Lewis
On 26 Nov, Tor Egge wrote:
 
 Thanks Kris, these are exactly the clues I needed.  Since the deadlock 
 during a snapshot is fairly easy to reproduce, I did so and collected this 
 information below.  alltrace didn't work as I expected (didn't produce a 
 trace), so I traced each pid associated with a locked vnode separately.
 
 The vnode syncing loop in ffs_sync() has some problems:

There is also a MNT_VNODE_FOREACH() loop in ffs_snapshot().

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recurring problem: processes block accessing UFS file system

2006-01-02 Thread Don Lewis
On 22 Nov, Greg Rivers wrote:
 On Mon, 21 Nov 2005, Kris Kennaway wrote:
 
 It may not be the same problem.  You should also try to obtain a trace 
 when snapshots are not implicated.

 
 Agreed.  I'll do so at the first opportunity.
 
 
 'show lockedvnods' is very important for diagnosing filesystem 
 deadlocks, and 'alltrace' is the easiest way to obtain stack traces for 
 the resulting processes that are holding locks.  You'll want to 'set 
 $lines=0' and capture the output from the serial console to a file.

 
 Thanks Kris, these are exactly the clues I needed.  Since the deadlock 
 during a snapshot is fairly easy to reproduce, I did so and collected this 
 information below.  alltrace didn't work as I expected (didn't produce a 
 trace), so I traced each pid associated with a locked vnode separately.

There appears to be a lock order reversal between a vnode lock and the
hand-rolled suspend lock.  Basically, it is not permissible to wait for
the suspend lock while holding a vnode lock.  In cases where a thread
holds a vnode lock and then wants to start a file system write, it has
to make a non-blocking attempt to grab the suspend lock and then release
the vnode lock if the suspend lock can't be obtained.  For example, in
vn_open_cred():

if (vn_start_write(ndp-ni_dvp, mp, V_NOWAIT) != 0) {
NDFREE(ndp, NDF_ONLY_PNBUF);
vput(ndp-ni_dvp);
VFS_UNLOCK_GIANT(vfslocked);
if ((error = vn_start_write(NULL, mp,
V_XSLEEP | PCATCH)) != 0)
return (error);
goto restart;
}


The problem is taht vput() may call vinactive(), which calls
ufs_inactive(), which may attempt to do a file system write and block on
the snapshot lock.

db trace 99361
Tracing pid 99361 tid 100437 td 0xce67cd80
sched_switch(ce67cd80,0,1,3d4e53a2,95c40548) at sched_switch+0x158
mi_switch(1,0,c04d7b33,c86d706c,ebb00908) at mi_switch+0x1d5
sleepq_switch(c86d706c,ebb0093c,c04bb9ce,c86d706c,9f) at sleepq_switch+0x16f
sleepq_wait(c86d706c,9f,c061a026,0,c0647200) at sleepq_wait+0x11
msleep(c86d706c,c86d7044,29f,c061a026,0,c0653720,c8dc1aa0,ebb00978,ce67cd80) at
msleep+0x3d7
vn_write_suspend_wait(c8dc1aa0,c86d7000,1,0,ca07f817) at vn_write_suspend_wait+0
x181
ufs_inactive(ebb009c8,ebb009dc,c051b330,c0646cc0,ebb009c8) at ufs_inactive+0x1b4
VOP_INACTIVE_APV(c0646cc0,ebb009c8,a03,ebb009d0,c051054f) at VOP_INACTIVE_APV+0x
3a
vinactive(c8dc1aa0,ce67cd80,ffdf,ebb00a24,c0513592) at vinactive+0x82
vput(c8dc1aa0,ffdf,2,ebb00a50,0) at vput+0x187
vn_open_cred(ebb00bc0,ebb00cc0,180,c9430580,5) at vn_open_cred+0xfb
vn_open(ebb00bc0,ebb00cc0,180,5,4b5fcfad) at vn_open+0x33
kern_open(ce67cd80,81414e0,0,a03,180) at kern_open+0xca
open(ce67cd80,ebb00d04,c,ce67cd80,8155000) at open+0x36
syscall(3b,3b,3b,0,a02) at syscall+0x324

99361 ce682624  100   673   673 100 [SLPQ suspfs 0xc86d706c][SLP] sendmail


If a thread gets stuck waiting for the suspend lock while holding a
vnode lock, then it is possible for the thread that is creating the
snapshot to get stuck when it attempts to lock that same vnode.

98639 c8e91418 11008 98637 97958 0004102 [SLPQ ufs 0xc8991d18][SLP] mksnap_ffs

db trace 98639
Tracing pid 98639 tid 100282 td 0xc8e8e300
sched_switch(c8e8e300,0,1,cc167362,733b6597) at sched_switch+0x158
mi_switch(1,0,c04d7b33,c8991d18,eb80c544) at mi_switch+0x1d5
sleepq_switch(c8991d18,eb80c578,c04bb9ce,c8991d18,50) at sleepq_switch+0x16f
sleepq_wait(c8991d18,50,c06186b3,0,c065ae80) at sleepq_wait+0x11  
msleep(c8991d18,c065888c,50,c06186b3,0) at msleep+0x3d7
acquire(eb80c5e0,40,6,c8e8e300,0) at acquire+0x89
lockmgr(c8991d18,2002,c8991d3c,c8e8e300,eb80c608) at lockmgr+0x45f
vop_stdlock(eb80c660,0,c0646cc0,eb80c660,eb80c618) at vop_stdlock+0x2f
VOP_LOCK_APV(c0647200,eb80c660,eb80c630,c05f9f43,eb80c660) at VOP_LOCK_APV+0x44
ffs_lock(eb80c660,0,2002,c8991cc0,eb80c67c) at ffs_lock+0x19
VOP_LOCK_APV(c0646cc0,eb80c660,c06402e0,eb80c7d0,0) at VOP_LOCK_APV+0x44
vn_lock(c8991cc0,2002,c8e8e300,4000,c851f300) at vn_lock+0x132
ffs_snapshot(c86d7000,cc978e00,eb80c9a4,6c,eb80c964) at ffs_snapshot+0x152b
ffs_mount(c86d7000,c8e8e300,0,c8e8e300,c9f71bb0) at ffs_mount+0x9af
vfs_domount(c8e8e300,c88a8760,c88a8870,11211000,c9a77510) at vfs_domount+0x728
vfs_donmount(c8e8e300,11211000,eb80cbec,c9314580,e) at vfs_donmount+0x12e
kernel_mount(c8737b80,11211000,eb80cc30,6c,bfbfe8c8) at kernel_mount+0x46
ffs_cmount(c8737b80,bfbfe110,11211000,c8e8e300,0) at ffs_cmount+0x85
mount(c8e8e300,eb80cd04,10,c8e8e300,8052000) at mount+0x21b
syscall(3b,3b,3b,8814819c,bfbfe0b0) at syscall+0x324
Xint0x80_syscall() at Xint0x80_syscall+0x1f


This block of code in ufs_inactive() is triggering the problem:

if (ip-i_flag  (IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) {
if ((ip-i_flag  

Re: Recurring problem: processes block accessing UFS file system

2005-11-25 Thread Tor Egge

 Thanks Kris, these are exactly the clues I needed.  Since the deadlock 
 during a snapshot is fairly easy to reproduce, I did so and collected this 
 information below.  alltrace didn't work as I expected (didn't produce a 
 trace), so I traced each pid associated with a locked vnode separately.

The vnode syncing loop in ffs_sync() has some problems:

  1. Softupdate processing performed after the loop has started might
 trigger the need for retrying the loop.  Processing of dirrem work
 items can cause IN_CHANGE to be set on some inodes, causing
 deadlock in ufs_inactive() later on while the file system is
 suspended).

  2. nvp might no longer be associated with the same mount point after
 MNT_IUNLOCK(mp) has been called in the loop.  This can cause the
 vnode list traversal to be incomplete, with stale information in
 the snapshot.  Further damage can occur when background fsck uses
 that stale information.

Just a few lines down from that loop is a new problem:

  3. softdep_flushworklist() might not have processed all dirrem work
 items associated with the file system even if both error and count
 are zero.  This can cause both background fsck and softupdate
 processing (after file system has been resumed) to decrement the
 link count of an inode, causing file system corruption or a panic.
 Processing of these work items while the file system is suspended
 causes a panic.

- Tor Egge
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recurring problem: processes block accessing UFS file system

2005-11-24 Thread Peter Jeremy
On Mon, 2005-Nov-21 21:23:10 -0600, Greg Rivers wrote:
 the 
deadlock also occurs under normal operation when no snapshots are running 
or have ever been run since boot.  It's just much less frequent in this 
case.

I've also seen this behaviour in both 5.x and 6.0 (I can't recall if
it bit me in 4.x).  In my case, the trigger is OpenOffice.org - one
of the offending processes is almost always OOo/program/pagein.
Unfortunately, I haven't been able to get to the bottom of this (and
my son isn't happy that OOo keeps deadlocking on him).

-- 
Peter Jeremy
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recurring problem: processes block accessing UFS file system

2005-11-21 Thread Kris Kennaway
On Mon, Nov 21, 2005 at 05:54:09PM -0600, Greg Rivers wrote:
 I've recently put up three busy email relay hosts running 6.0-STABLE. 
 Performance is excellent except for a nagging critical issue that keeps 
 cropping up.
 
 /var/spool is its own file system mounted on a geom stripe of four BSD 
 partitions (details below).  Once every two or three days all the 
 processes accessing /var/spool block forever in disk wait.  All three 
 machines suffer this problem.  No diagnostic messages are generated and 
 the machines continue running fine otherwise, but a reboot is required to 
 clear the condition.  This problem occurs during normal operation, but is 
 particularly likely to occur during a backup when dump makes a snapshot.
 
 There doesn't appear to be a problem with gstripe, as gstripe status is 
 UP and I can read the raw device just fine while processes continue to 
 block on the file system.  I tried running a kernel with WITNESS and 
 DIAGNOSTIC, but these options shed no light.
 
 If I catch the problem early enough I can break successfully into kdb; 
 otherwise, if too many processes stack up, the machine hangs going into 
 kdb and must be power-cycled.

Make sure you have KDB_STOP_NMI in your kernel.

 I obtained the following process listing and traces from kdb.  I traced
 mksnap_ffs which was blocked in ufs, and two random sendmail processes
 that were blocked in ufs and suspfs respectively.

Looks like a UFS snapshot deadlock.  Are you running something like
dump -L on this filesystem, or making other use of snapshots?  fsck -B
also uses them, but shouldn't be running except at boot time.

You should take this up with Kirk McKusick [EMAIL PROTECTED] - in
the meantime you can work around it by not making use of UFS
snapshots.

Kris

pgpzeDChaoxrI.pgp
Description: PGP signature


Re: Recurring problem: processes block accessing UFS file system

2005-11-21 Thread Ronald Klop
On Tue, 22 Nov 2005 00:54:09 +0100, Greg Rivers  
[EMAIL PROTECTED] wrote:


I've recently put up three busy email relay hosts running 6.0-STABLE.  
Performance is excellent except for a nagging critical issue that keeps  
cropping up.


/var/spool is its own file system mounted on a geom stripe of four BSD  
partitions (details below).  Once every two or three days all the  
processes accessing /var/spool block forever in disk wait.  All three  
machines suffer this problem.  No diagnostic messages are generated and  
the machines continue running fine otherwise, but a reboot is required  
to clear the condition.  This problem occurs during normal operation,  
but is particularly likely to occur during a backup when dump makes a  
snapshot.


There doesn't appear to be a problem with gstripe, as gstripe status is  
UP and I can read the raw device just fine while processes continue to  
block on the file system.  I tried running a kernel with WITNESS and  
DIAGNOSTIC, but these options shed no light.


If I catch the problem early enough I can break successfully into kdb;  
otherwise, if too many processes stack up, the machine hangs going into  
kdb and must be power-cycled.


I'd appreciate any insight anyone may have into this problem or advise  
on turning this report into a coherent PR.


I have a machine with 5.4-STABLE with the same problem. It hangs every  
couple of days if I make regular snapshots. It is a remote machine which I  
don't have easy access to. I disabled the snapshots and since than it  
didn't hang a single time.

I hoped it would be fixed in 6.0, but this sounds the same.

Ronald.

--
 Ronald Klop
 Amsterdam, The Netherlands
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recurring problem: processes block accessing UFS file system

2005-11-21 Thread Greg Rivers

On Mon, 21 Nov 2005, Kris Kennaway wrote:

Looks like a UFS snapshot deadlock.  Are you running something like dump 
-L on this filesystem, or making other use of snapshots?




Indeed I am (dump -L), but as I said (not very clearly, sorry), the 
deadlock also occurs under normal operation when no snapshots are running 
or have ever been run since boot.  It's just much less frequent in this 
case.  Disabling dumps altogether was one of the first things I tried.




fsck -B also uses them, but shouldn't be running except at boot time.



Right, it isn't even at boot since I have background_fsck disabled.


You should take this up with Kirk McKusick [EMAIL PROTECTED] - in 
the meantime you can work around it by not making use of UFS snapshots.




Sounds good.  Even though the problem may not be snapshot related, it 
certainly seems to be UFS related.  Thanks for your suggestions and your 
prompt reply.


--
Greg
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Recurring problem: processes block accessing UFS file system

2005-11-21 Thread Kris Kennaway
On Mon, Nov 21, 2005 at 09:23:10PM -0600, Greg Rivers wrote:
 On Mon, 21 Nov 2005, Kris Kennaway wrote:
 
 Looks like a UFS snapshot deadlock.  Are you running something like dump 
 -L on this filesystem, or making other use of snapshots?
 
 
 Indeed I am (dump -L), but as I said (not very clearly, sorry), the 
 deadlock also occurs under normal operation when no snapshots are running 
 or have ever been run since boot.  It's just much less frequent in this 
 case.  Disabling dumps altogether was one of the first things I tried.

It may not be the same problem.  You should also try to obtain a trace
when snapshots are not implicated.

'show lockedvnods' is very important for diagnosing filesystem
deadlocks, and 'alltrace' is the easiest way to obtain stack traces
for the resulting processes that are holding locks.  You'll want to
'set $lines=0' and capture the output from the serial console to a
file.

Kris


pgpLYbBcV2TU9.pgp
Description: PGP signature