Re: Recurring problem: processes block accessing UFS file system
On 11 Jan, Denis Shaposhnikov wrote: Hi! Don == Don Lewis [EMAIL PROTECTED] writes: Don Are you using any unusual file systems, such as nullfs or Don unionfs? Yes, I'm use a lots of nullfs. This is a host system for about 20 jails with nullfs mounted ro system: Don That would be my guess as to the cause of the problem. Hopefully Don DEBUG_VFS_LOCKS will help pinpoint the bug. I've got the problem again. Now I have debug kernel and crash dump. That is an output from the kdb. Do you need any additional information which I can get from the crash dump? 0xc6bed3fc: tag ufs, type VDIR usecount 3, writecount 0, refcount 4 mountedhere 0 flags () v_object 0xc84fce0c ref 0 pages 1 lock type ufs: EXCL (count 1) by thread 0xc69d1d80 (pid 546) with 2 pending#0 0xc04cd7ad at lockmgr+0x3c6 #1 0xc053d780 at vop_stdlock+0x32 #2 0xc0655075 at VOP_LOCK_APV+0x3b #3 0xc05db299 at ffs_lock+0x19 #4 0xc0655075 at VOP_LOCK_APV+0x3b #5 0xc0557513 at vn_lock+0x6c #6 0xc054a320 at vget+0x87 #7 0xc053a1f0 at cache_lookup+0xde #8 0xc053ac93 at vfs_cache_lookup+0xad #9 0xc0652eaf at VOP_LOOKUP_APV+0x3b #10 0xc053f786 at lookup+0x419 #11 0xc0540262 at namei+0x2e7 #12 0xc0558d37 at vn_open_cred+0x1c2 #13 0xc055918a at vn_open+0x33 #14 0xc054e6d3 at kern_open+0xd2 #15 0xc054f055 at open+0x36 #16 0xc0648a47 at syscall+0x23a #17 0xc06326ff at Xint0x80_syscall+0x1f ino 2072795, on dev ad4s1g 0xc6c07bf4: tag ufs, type VDIR usecount 2, writecount 0, refcount 3 mountedhere 0 flags () v_object 0xc940a744 ref 0 pages 1 lock type ufs: EXCL (count 1) by thread 0xc9146c00 (pid 33016) with 1 pending#0 0xc04cd7ad at lockmgr+0x3c6 #1 0xc053d780 at vop_stdlock+0x32 #2 0xc0655075 at VOP_LOCK_APV+0x3b #3 0xc05db299 at ffs_lock+0x19 #4 0xc0655075 at VOP_LOCK_APV+0x3b #5 0xc0557513 at vn_lock+0x6c #6 0xc054a320 at vget+0x87 #7 0xc053a1f0 at cache_lookup+0xde #8 0xc053ac93 at vfs_cache_lookup+0xad #9 0xc0652eaf at VOP_LOOKUP_APV+0x3b #10 0xc053f786 at lookup+0x419 #11 0xc0540262 at namei+0x2e7 #12 0xc0554134 at kern_rmdir+0x98 #13 0xc055436e at rmdir+0x22 #14 0xc0648a47 at syscall+0x23a #15 0xc06326ff at Xint0x80_syscall+0x1f ino 2072767, on dev ad4s1g db alltrace Tracing command parser3.cgi pid 33016 tid 100652 td 0xc9146c00 sched_switch(c9146c00,0,1,5e57bc72,aaef377f) at sched_switch+0xd8 mi_switch(1,0,0,e93d86c4,c04f425b) at mi_switch+0x150 sleepq_switch(e93d86f8,c04e4cc6,c6bed454,50,c0672a31) at sleepq_switch+0x115 sleepq_wait(c6bed454,50,c0672a31,0,c602a080) at sleepq_wait+0xb msleep(c6bed454,c06b9c98,50,c0672a31,0) at msleep+0x454 acquire(6,da1dcb4c,c0504340,da1dcb4c,4c) at acquire+0x7a lockmgr(c6bed454,2002,c6bed4c4,c9146c00,e93d87e0) at lockmgr+0x4ce vop_stdlock(e93d8838,120,c06a4f20,e93d8838,e93d87f0) at vop_stdlock+0x32 VOP_LOCK_APV(c06a55c0,e93d8838,e93d8808,c0655075,e93d8838) at VOP_LOCK_APV+0x3b ffs_lock(e93d8838,f,2,c6bed3fc,e93d8854) at ffs_lock+0x19 VOP_LOCK_APV(c06a4f20,e93d8838,c0653a6a,c053acd2,c0652eaf) at VOP_LOCK_APV+0x3b vn_lock(c6bed3fc,2002,c9146c00,c06326ff,6) at vn_lock+0x6c vget(c6bed3fc,2002,c9146c00,c0653a6a,c053acd2) at vget+0x87 vfs_hash_get(c6268c00,1fa0db,2,c9146c00,e93d89bc) at vfs_hash_get+0xf5 ffs_vget(c6268c00,1fa0db,2,e93d89bc,e93d89c0) at ffs_vget+0x49 ufs_lookup(e93d8a6c,c0685ac5,c6c07bf4,e93d8c34,e93d8aa8) at ufs_lookup+0x965 VOP_CACHEDLOOKUP_APV(c06a4f20,e93d8a6c,e93d8c34,c9146c00,c7fac400) at VOP_CACHEDLOOKUP_APV+0x59 vfs_cache_lookup(e93d8b14,e93d8ac0,c6c07bf4,e93d8c34,e93d8b30) at vfs_cache_lookup+0xec VOP_LOOKUP_APV(c06a4f20,e93d8b14,c9146c00,c065510a,c054a6f6) at VOP_LOOKUP_APV+0x3b lookup(e93d8c0c,c954a800,400,e93d8c28,0) at lookup+0x419 namei(e93d8c0c,ffdf,2,0,c6268c00) at namei+0x2e7 kern_rmdir(c9146c00,824b500,0,e93d8d30,c0648a47) at kern_rmdir+0x98 rmdir(c9146c00,e93d8d04,4,d,e93d8d38) at rmdir+0x22 syscall(804003b,e50003b,bfbf003b,8310560,8310560) at syscall+0x23a Xint0x80_syscall() at Xint0x80_syscall+0x1f --- syscall (137, FreeBSD ELF32, rmdir), eip = 0xe8201bf, esp = 0xbfbfbcfc, ebp = 0xbfbfbd18 --- Tracing command nginx pid 546 tid 100122 td 0xc69d1d80 sched_switch(c69d1d80,0,1,f002cf9e,6c43d35e) at sched_switch+0xd8 mi_switch(1,0,c0502e44,e8a73670,c04f425b) at mi_switch+0x150 sleepq_switch(e8a736a4,c04e4cc6,c6c07c4c,50,c0672a31) at sleepq_switch+0x115 sleepq_wait(c6c07c4c,50,c0672a31,0,c044b82a) at sleepq_wait+0xb msleep(c6c07c4c,c06ba45c,50,c0672a31,0) at msleep+0x454 acquire(6,c0655075,c05db299,c0655075,c0557513) at acquire+0x7a lockmgr(c6c07c4c,2002,c6c07cbc,c69d1d80,e8a7378c) at lockmgr+0x4ce vop_stdlock(e8a737e4,c06a4f20,c06a4f20,e8a737e4,e8a7379c) at vop_stdlock+0x32 VOP_LOCK_APV(c06a55c0,e8a737e4,e8a737b4,c0655075,e8a737e4) at VOP_LOCK_APV+0x3b ffs_lock(e8a737e4,c0673faf,2,c6c07bf4,e8a73800) at ffs_lock+0x19 VOP_LOCK_APV(c06a4f20,e8a737e4,e8a73814,c0504340,e8a73814) at VOP_LOCK_APV+0x3b
Re: Recurring problem: processes block accessing UFS file system
On 12 Jan, Don Lewis wrote: On 11 Jan, Denis Shaposhnikov wrote: Hi! Don == Don Lewis [EMAIL PROTECTED] writes: Don Are you using any unusual file systems, such as nullfs or Don unionfs? Yes, I'm use a lots of nullfs. This is a host system for about 20 jails with nullfs mounted ro system: Don That would be my guess as to the cause of the problem. Hopefully Don DEBUG_VFS_LOCKS will help pinpoint the bug. I've got the problem again. Now I have debug kernel and crash dump. That is an output from the kdb. Do you need any additional information which I can get from the crash dump? Process 33016 is executing rmdir(). While doing the lookup, it is holding a lock on vnode 0xc6c07bf4 and attempting to lock vnode c6bed3fc. Vnode 0xc6c07bf4 should be the parent directory of c6bed3fc. Process 546 is executing open(). While doing the lookup, it is holding a lock on vnode 0xc6bed3fc while attempting to lock vnode c6c07bf4. Vnode 0xc6bed3fc should be the parent directory of c6c07bf4, but this is inconsistent with the previous paragraph. This situation should not be possible. Using kgdb on your saved crash dump, print fmode and *ndp in the vn_open_cred() stack frame of process 546, and *nd in the kern_rmdir() stack frame of process 33016. The path names being looked up may be helpful. Are there any symbolic links in the path names? If so, what are the link contents? Are either of these processes jailed? If so, same or different jails? What are inodes 2072767 and 2072795 on ad4s1g? Are you using snapshots? I just thought of another possible cause for this problem. Is is possible that you have any hard links to directories in the file system on ad4s1g? That could put a loop in the directory tree and mess up the normal parent-child relationship that we rely on to avoid deadlocks. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Recurring problem: processes block accessing UFS file system
Hi! Don == Don Lewis [EMAIL PROTECTED] writes: Don Are you using any unusual file systems, such as nullfs or Don unionfs? Yes, I'm use a lots of nullfs. This is a host system for about 20 jails with nullfs mounted ro system: Don That would be my guess as to the cause of the problem. Hopefully Don DEBUG_VFS_LOCKS will help pinpoint the bug. I've got the problem again. Now I have debug kernel and crash dump. That is an output from the kdb. Do you need any additional information which I can get from the crash dump? Script started on Wed Jan 11 16:19:51 2006 Connected ~KDB: enter: Line break on console [thread pid 10 tid 16 ] Stopped at kdb_enter+0x32: leave db ps pid proc uid ppid pgrp flag stat wmesgwchan cmd 33279 c8fb66840 1 1 000 [SLPQ ufs 0xc690d454][SLP] sh 33017 c7f998b0 1008 1 2692 000 [SLPQ ufs 0xc690d300][SLP] httpd 33016 c9261000 1008 1 2692 0004000 [SLPQ ufs 0xc6bed454][SLP] parser3.cgi 33015 c8e1fd08 1008 1 2692 0004000 [SLPQ ufs 0xc6bed454][SLP] perl5.8.7 33013 c9796458 1008 1 2692 0004000 [SLPQ ufs 0xc690d454][SLP] parser3.cgi 30627 c8f2fd08 1012 21055 30627 0004002 [SLPQ ufs 0xc690d454][SLP] perl5.8.7 30210 c86e3684 1010 1 30210 0004000 [SLPQ ufs 0xc690d454][SLP] sftp-server 21225 c858522c 1012 1 21225 0004000 [SLPQ ufs 0xc6bf9300][SLP] sftp-server 21055 c6c6d684 1012 1 21055 0015002 [SLPQ ufs 0xc690d454][SLP] zsh 21031 c62978b0 1012 1 21031 0004002 [SLPQ ufs 0xc690d454][SLP] zsh 21003 c92696840 2728 21003 0004100 [SLPQ ufs 0xc690d454][SLP] sshd 5004 c941eadc 1010 1 5004 0004000 [SLPQ ufs 0xc690d454][SLP] sftp-server 2735 c80076840 1 2735 000 [SLPQ ufs 0xc690d454][SLP] cron 2728 c7ca10000 1 2728 100 [SLPQ ufs 0xc690d454][SLP] sshd 2705 c8006d08 1008 1 2703 0004002 [SLPQ ufs 0xc690d454][SLP] perl5.8.7 2639 c8006adc0 1 2639 000 [SLPQ ufs 0xc690d454][SLP] syslogd 546 c6a10684 80 1 540 100 [SLPQ ufs 0xc6c07c4c][SLP] nginx 36 c60d722c0 0 0 204 [SLPQ - 0xe4c45cf0][SLP] schedcpu 35 c60d74580 0 0 204 [SLPQ ktsusp 0xc60d74e0][SLP] syncer 34 c60d76840 0 0 204 [SLPQ ktsusp 0xc60d770c][SLP] vnlru 33 c60d78b00 0 0 204 [SLPQ ktsusp 0xc60d7938][SLP] bufdaemon 32 c60d7adc0 0 0 20c [SLPQ pgzero 0xc06d0b14][SLP] pagezero 31 c60d7d080 0 0 204 [SLPQ psleep 0xc06c8a48][SLP] vmdaemon 30 c60d80000 0 0 204 [SLPQ psleep 0xc06c8a08][SLP] pagedaemon 29 c60d822c0 0 0 204 [IWAIT] irq1: atkbd0 28 c60d84580 0 0 204 [IWAIT] swi0: sio 27 c602d6840 0 0 204 [IWAIT] irq18: atapci1 26 c602d8b00 0 0 204 [IWAIT] irq15: ata1 25 c602dadc0 0 0 204 [IWAIT] irq14: ata0 24 c602dd080 0 0 204 [IWAIT] irq27: em1 23 c6070 0 0 204 [IWAIT] irq26: em0 22 c607022c0 0 0 204 [IWAIT] irq9: acpi0 21 c60704580 0 0 204 [IWAIT] swi2: cambio 20 c60706840 0 0 204 [IWAIT] swi6: task queue 9 c60708b00 0 0 204 [SLPQ - 0xc6064e00][SLP] acpi_task2 8 c6070adc0 0 0 204 [SLPQ - 0xc6064e00][SLP] acpi_task1 7 c5fd822c0 0 0 204 [SLPQ - 0xc6064e00][SLP] acpi_task0 19 c5fd84580 0 0 204 [IWAIT] swi6: Giant taskq 6 c5fd86840 0 0 204 [SLPQ - 0xc6065000][SLP] thread taskq 18 c5fd88b00 0 0 204 [IWAIT] swi5: Fast taskq 5 c5fd8adc0 0 0 204 [SLPQ - 0xc60650c0][SLP] kqueue taskq 17 c5fd8d080 0 0 204 [SLPQ - 0xc06b8460][SLP] yarrow 4 c602d0000 0 0 204 [SLPQ - 0xc06b8d00][SLP] g_down 3 c602d22c0 0 0 204 [SLPQ - 0xc06b8cfc][SLP] g_up 2 c602d4580 0 0 204 [SLPQ - 0xc06b8cf4][SLP] g_event 16 c5fd30000 0 0 204 [IWAIT] swi1: net 15 c5fd322c0 0 0 204 [IWAIT] swi3: vm 14 c5fd34580 0 0 20c [IWAIT] swi4: clock sio 13 c5fd36840 0 0 20c [CPU 0] idle: cpu0 12 c5fd38b00 0 0 20c [CPU 1] idle: cpu1 11 c5fd3adc0 0 0 20c [CPU 2] idle: cpu2 10 c5fd3d080 0 0 20c [CPU 3] idle: cpu3 1 c5fd80000 0 1 0004200 [SLPQ ufs 0xc690d5a8][SLP] init 0 c06b8e000 0 0 200 [IWAIT] swapper 21030 c8e1f8b0 1012 21003 21003 0002100 zomb[INACTIVE] sshd 5005 c93f86840 2728 5005 0006100 zomb[INACTIVE] sshd 5000 c83d1d080 2728 5000 0006100 zomb[INACTIVE] sshd 30211 c7907d080 2728 30211 0006100 zomb[INACTIVE] sshd 4997 c6a10adc0 2728 4997 0006100 zomb[INACTIVE] sshd 6404 c85586840 2728 6404 0006100 zomb[INACTIVE] sshd db show lockedvnods Locked vnodes 0xc690d550: tag ufs, type VDIR usecount 1, writecount 0,
Re: Recurring problem: processes block accessing UFS file system
Hi! Greg == Greg Rivers [EMAIL PROTECTED] writes: Greg It's taken more than a month, but the problem has recurred Greg without snapshots ever having been run. I've got a good trace I think that I have the same problem on a fresh CURRENT. For some processes I see MWCHAN = ufs and D in the STAT. And I can't kill such processes even with -9. And system can't kill them too on shutdown. So, system can't do shutdown and wait forever after All buffers synced message. At this moment I've entered to KDB do show lockedvnods: Locked vnodes 0xc687cb58: tag ufs, type VDIR usecount 1, writecount 0, refcount 2 mountedhere 0 flags () v_object 0xcb5b1934 ref 0 pages 0 lock type ufs: EXCL (count 1) by thread 0xc795d600 (pid 74686) with 1 pending ino 2072602, on dev ad4s1g 0xc687ca50: tag ufs, type VDIR usecount 31, writecount 0, refcount 32 mountedhere 0 flags () v_object 0xc85d2744 ref 0 pages 1 lock type ufs: EXCL (count 1) by thread 0xc7683d80 (pid 74178) with 6 pending ino 2072603, on dev ad4s1g 0xc687c948: tag ufs, type VDIR usecount 2, writecount 0, refcount 3 mountedhere 0 flags () v_object 0xc875d000 ref 0 pages 1 lock type ufs: EXCL (count 1) by thread 0xc91f3300 (pid 65610) with 1 pending ino 2072615, on dev ad4s1g 0xc691f420: tag ufs, type VDIR usecount 2, writecount 0, refcount 3 mountedhere 0 flags () v_object 0xc8a773e0 ref 0 pages 1 lock type ufs: EXCL (count 1) by thread 0xc68e5780 (pid 519) with 1 pending ino 2072680, on dev ad4s1g 0xc691f318: tag ufs, type VDIR usecount 3, writecount 0, refcount 4 mountedhere 0 flags () v_object 0xc8a7b2e8 ref 0 pages 1 lock type ufs: EXCL (count 1) by thread 0xc7019780 (pid 74103) with 2 pending ino 2072795, on dev ad4s1g 0xc69bb528: tag ufs, type VDIR usecount 2, writecount 0, refcount 3 mountedhere 0 flags () v_object 0xc7890744 ref 0 pages 1 lock type ufs: EXCL (count 1) by thread 0xc91f4600 (pid 74129) with 1 pending ino 2072767, on dev ad4s1g Locked vnodes 0xc687cb58: tag ufs, type VDIR usecount 1, writecount 0, refcount 2 mountedhere 0 flags () v_object 0xcb5b1934 ref 0 pages 0 lock type ufs: EXCL (count 1) by thread 0xc795d600 (pid 74686) with 1 pending ino 2072602, on dev ad4s1g 0xc687ca50: tag ufs, type VDIR usecount 31, writecount 0, refcount 32 mountedhere 0 flags () v_object 0xc85d2744 ref 0 pages 1 lock type ufs: EXCL (count 1) by thread 0xc7683d80 (pid 74178) with 6 pending ino 2072603, on dev ad4s1g 0xc687c948: tag ufs, type VDIR usecount 2, writecount 0, refcount 3 mountedhere 0 flags () v_object 0xc875d000 ref 0 pages 1 lock type ufs: EXCL (count 1) by thread 0xc91f3300 (pid 65610) with 1 pending ino 2072615, on dev ad4s1g 0xc691f420: tag ufs, type VDIR usecount 2, writecount 0, refcount 3 mountedhere 0 flags () v_object 0xc8a773e0 ref 0 pages 1 lock type ufs: EXCL (count 1) by thread 0xc68e5780 (pid 519) with 1 pending ino 2072680, on dev ad4s1g 0xc691f318: tag ufs, type VDIR usecount 3, writecount 0, refcount 4 mountedhere 0 flags () v_object 0xc8a7b2e8 ref 0 pages 1 lock type ufs: EXCL (count 1) by t(kgdb) After that I've done call doadump and got vmcore. ps show me: (kgdb) ps During symbol reading, Incomplete CFI data; unspecified registers at 0xc04d97eb. pidproc uid ppid pgrp flag stat comm wchan 74686 c94640000 1 1 00 1 sh ufs c687caa8 74195 c970d0000 3074 74195 4000100 1 sshd ufs c687caa8 74178 c7682adc0 3074 74178 004000 1 sshd ufs c687c9a0 74129 c9b82adc 1008 1 5504 004000 1 parser3.cgi ufs c691f370 74103 c70b5458 1008 1 5504 00 1 httpdufs c69bb580 65610 c92c0458 1005 1 65610 004000 1 sftp-server ufs c691f478 5518 c6247458 1008 1 5516 004002 1 perl5.8.7ufs c687caa8 3081 c7523d080 1 3081 00 1 cron ufs c687caa8 3074 c7682d080 1 3074 000100 1 sshd ufs c687caa8 3016 c7523adc0 1 3016 00 1 syslogd ufs c687caa8 519 c68e4d08 80 1 518 000100 1 nginxufs c691f370 34 c6260 0 0 000204 1 schedcpu - e88b3cf0 33 c62438b00 0 0 000204 1 syncer ktsusp c6243938 32 c6243adc0 0 0 000204 1 vnlruktsusp c6243b64 31 c6243d080 0 0 000204 1 bufdaemonktsusp c6243d90 30 c62440000 0 0 00020c 1 pagezero pgzero c06c21a0 29 c624422c0 0 0 000204 1 vmdaemon psleep c06c1d08 28 c62444580 0 0 000204 1 pagedaemon psleep c06c1cc8 27 c602e6840 0 0 000204 1 irq1: atkbd0 26 c602e8b00 0 0 000204 1 swi0: sio 25 c602eadc0 0 0 000204 1 irq18: atapci1 24
Re: Recurring problem: processes block accessing UFS file system
On 5 Jan, Denis Shaposhnikov wrote: Hi! Greg == Greg Rivers [EMAIL PROTECTED] writes: Greg It's taken more than a month, but the problem has recurred Greg without snapshots ever having been run. I've got a good trace I think that I have the same problem on a fresh CURRENT. For some processes I see MWCHAN = ufs and D in the STAT. And I can't kill such processes even with -9. And system can't kill them too on shutdown. So, system can't do shutdown and wait forever after All buffers synced message. At this moment I've entered to KDB do show lockedvnods: Locked vnodes 0xc687cb58: tag ufs, type VDIR usecount 1, writecount 0, refcount 2 mountedhere 0 flags () v_object 0xcb5b1934 ref 0 pages 0 lock type ufs: EXCL (count 1) by thread 0xc795d600 (pid 74686) with 1 pending ino 2072602, on dev ad4s1g 0xc687ca50: tag ufs, type VDIR usecount 31, writecount 0, refcount 32 mountedhere 0 flags () v_object 0xc85d2744 ref 0 pages 1 lock type ufs: EXCL (count 1) by thread 0xc7683d80 (pid 74178) with 6 pending ino 2072603, on dev ad4s1g 0xc687c948: tag ufs, type VDIR usecount 2, writecount 0, refcount 3 mountedhere 0 flags () v_object 0xc875d000 ref 0 pages 1 lock type ufs: EXCL (count 1) by thread 0xc91f3300 (pid 65610) with 1 pending ino 2072615, on dev ad4s1g 0xc691f420: tag ufs, type VDIR usecount 2, writecount 0, refcount 3 mountedhere 0 flags () v_object 0xc8a773e0 ref 0 pages 1 lock type ufs: EXCL (count 1) by thread 0xc68e5780 (pid 519) with 1 pending ino 2072680, on dev ad4s1g 0xc691f318: tag ufs, type VDIR usecount 3, writecount 0, refcount 4 mountedhere 0 flags () v_object 0xc8a7b2e8 ref 0 pages 1 lock type ufs: EXCL (count 1) by thread 0xc7019780 (pid 74103) with 2 pending ino 2072795, on dev ad4s1g 0xc69bb528: tag ufs, type VDIR usecount 2, writecount 0, refcount 3 mountedhere 0 flags () v_object 0xc7890744 ref 0 pages 1 lock type ufs: EXCL (count 1) by thread 0xc91f4600 (pid 74129) with 1 pending ino 2072767, on dev ad4s1g Locked vnodes 0xc687cb58: tag ufs, type VDIR usecount 1, writecount 0, refcount 2 mountedhere 0 flags () v_object 0xcb5b1934 ref 0 pages 0 lock type ufs: EXCL (count 1) by thread 0xc795d600 (pid 74686) with 1 pending ino 2072602, on dev ad4s1g 0xc687ca50: tag ufs, type VDIR usecount 31, writecount 0, refcount 32 mountedhere 0 flags () v_object 0xc85d2744 ref 0 pages 1 lock type ufs: EXCL (count 1) by thread 0xc7683d80 (pid 74178) with 6 pending ino 2072603, on dev ad4s1g 0xc687c948: tag ufs, type VDIR usecount 2, writecount 0, refcount 3 mountedhere 0 flags () v_object 0xc875d000 ref 0 pages 1 lock type ufs: EXCL (count 1) by thread 0xc91f3300 (pid 65610) with 1 pending ino 2072615, on dev ad4s1g 0xc691f420: tag ufs, type VDIR usecount 2, writecount 0, refcount 3 mountedhere 0 flags () v_object 0xc8a773e0 ref 0 pages 1 lock type ufs: EXCL (count 1) by thread 0xc68e5780 (pid 519) with 1 pending ino 2072680, on dev ad4s1g 0xc691f318: tag ufs, type VDIR usecount 3, writecount 0, refcount 4 mountedhere 0 flags () v_object 0xc8a7b2e8 ref 0 pages 1 lock type ufs: EXCL (count 1) by t(kgdb) After that I've done call doadump and got vmcore. ps show me: (kgdb) ps During symbol reading, Incomplete CFI data; unspecified registers at 0xc04d97eb. pidproc uid ppid pgrp flag stat comm wchan 74686 c94640000 1 1 00 1 sh ufs c687caa8 74195 c970d0000 3074 74195 4000100 1 sshd ufs c687caa8 74178 c7682adc0 3074 74178 004000 1 sshd ufs c687c9a0 74129 c9b82adc 1008 1 5504 004000 1 parser3.cgi ufs c691f370 74103 c70b5458 1008 1 5504 00 1 httpdufs c69bb580 65610 c92c0458 1005 1 65610 004000 1 sftp-server ufs c691f478 5518 c6247458 1008 1 5516 004002 1 perl5.8.7ufs c687caa8 3081 c7523d080 1 3081 00 1 cron ufs c687caa8 3074 c7682d080 1 3074 000100 1 sshd ufs c687caa8 3016 c7523adc0 1 3016 00 1 syslogd ufs c687caa8 519 c68e4d08 80 1 518 000100 1 nginxufs c691f370 34 c6260 0 0 000204 1 schedcpu - e88b3cf0 33 c62438b00 0 0 000204 1 syncer ktsusp c6243938 32 c6243adc0 0 0 000204 1 vnlruktsusp c6243b64 31 c6243d080 0 0 000204 1 bufdaemonktsusp c6243d90 30 c62440000 0 0 00020c 1 pagezero pgzero c06c21a0 29 c624422c0 0 0 000204 1 vmdaemon psleep c06c1d08 28 c62444580 0 0 000204 1 pagedaemon psleep c06c1cc8 27 c602e6840
Re: Recurring problem: processes block accessing UFS file system
Don == Don Lewis [EMAIL PROTECTED] writes: Don pid 519 wants to lock this vnode but some other thread is Don holding the vnode lock. Unfortunately we don't know who the Don lock holder is because the message is truncated. Is it possible to find out the answer from the crashdump? Don This might just be a vnode lock leak. Build a debug kernel with Don the DEBUG_VFS_LOCKS and DEBUG_LOCKS options and see if anything Don shows up. I'll try, thank you. -- DSS5-RIPE DSS-RIPN 2:550/[EMAIL PROTECTED] 2:550/[EMAIL PROTECTED] mailto:[EMAIL PROTECTED] http://neva.vlink.ru/~dsh/ ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Recurring problem: processes block accessing UFS file system
On 5 Jan, Denis Shaposhnikov wrote: Don == Don Lewis [EMAIL PROTECTED] writes: Don pid 519 wants to lock this vnode but some other thread is Don holding the vnode lock. Unfortunately we don't know who the Don lock holder is because the message is truncated. Is it possible to find out the answer from the crashdump? It's possible if you have the matching debug kernel, though it is more painful. In kgdb: print *(struct vnode *)0xc691f318 print (struct vnode *)0xc691f318-v_vnlock-lk_lockholder-td_proc-p_pid) or something like that. Don This might just be a vnode lock leak. Build a debug kernel with Don the DEBUG_VFS_LOCKS and DEBUG_LOCKS options and see if anything Don shows up. I'll try, thank you. Are you using any unusual file systems, such as nullfs or unionfs? ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Recurring problem: processes block accessing UFS file system
Don == Don Lewis [EMAIL PROTECTED] writes: Don Are you using any unusual file systems, such as nullfs or Don unionfs? Yes, I'm use a lots of nullfs. This is a host system for about 20 jails with nullfs mounted ro system: /dev/mirror/gm0s1a on / (ufs, local, read-only) devfs on /dev (devfs, local) /dev/mirror/gm0s1d on /usr/local (ufs, local) /dev/mirror/gm0s1e on /home (ufs, local, nosuid) /dev/mirror/gm0s1f on /tmp (ufs, local, noexec) /dev/mirror/gm0s1g on /var (ufs, NFS exported, local) procfs on /proc (procfs, local) devfs on /var/named/dev (devfs, local) /var/jail/build/the.jail on /var/jail/netflow/the.jail/.build (nullfs, local, read-only) /var/jail/netflow/the.jail.etc on /var/jail/netflow/the.jail/.build/etc (nullfs, local, read-only) /var/jail/netflow/the.jail.local on /var/jail/netflow/the.jail/.build/usr/local (nullfs, local, read-only) devfs on /var/jail/netflow/the.jail/dev (devfs, local) /var/jail/build/the.jail on /var/jail/mysql/the.jail/.build (nullfs, local, read-only) /var/jail/mysql/the.jail.etc on /var/jail/mysql/the.jail/.build/etc (nullfs, local, read-only) /var/jail/mysql/the.jail.local on /var/jail/mysql/the.jail/.build/usr/local (nullfs, local, read-only) devfs on /var/jail/mysql/the.jail/dev (devfs, local) /var/jail/build/the.jail on /var/jail/dspam/the.jail/.build (nullfs, local, read-only) /var/jail/dspam/the.jail.etc on /var/jail/dspam/the.jail/.build/etc (nullfs, local, read-only) /var/jail/dspam/the.jail.local on /var/jail/dspam/the.jail/.build/usr/local (nullfs, local, read-only) devfs on /var/jail/dspam/the.jail/dev (devfs, local) /var/jail/build/the.jail on /var/jail/deliver_smtp/the.jail/.build (nullfs, local, read-only) /var/jail/deliver_smtp/the.jail.etc on /var/jail/deliver_smtp/the.jail/.build/etc (nullfs, local, read-only) /var/jail/deliver_smtp/the.jail.local on /var/jail/deliver_smtp/the.jail/.build/usr/local (nullfs, local, read-only) devfs on /var/jail/deliver_smtp/the.jail/dev (devfs, local) /var/jail/build/the.jail on /var/jail/clamav/the.jail/.build (nullfs, local, read-only) /var/jail/clamav/the.jail.etc on /var/jail/clamav/the.jail/.build/etc (nullfs, local, read-only) /var/jail/clamav/the.jail.local on /var/jail/clamav/the.jail/.build/usr/local (nullfs, local, read-only) devfs on /var/jail/clamav/the.jail/dev (devfs, local) /var/jail/build/the.jail on /var/jail/mx1_smtp/the.jail/.build (nullfs, local, read-only) /var/jail/mx1_smtp/the.jail.etc on /var/jail/mx1_smtp/the.jail/.build/etc (nullfs, local, read-only) /var/jail/mx1_smtp/the.jail.local on /var/jail/mx1_smtp/the.jail/.build/usr/local (nullfs, local, read-only) devfs on /var/jail/mx1_smtp/the.jail/dev (devfs, local) /var/jail/build/the.jail on /var/jail/smtp_smtp/the.jail/.build (nullfs, local, read-only) /var/jail/smtp_smtp/the.jail.etc on /var/jail/smtp_smtp/the.jail/.build/etc (nullfs, local, read-only) /var/jail/smtp_smtp/the.jail.local on /var/jail/smtp_smtp/the.jail/.build/usr/local (nullfs, local, read-only) /var/jail/mx1_smtp/the.jail/var/db/postfix on /var/jail/smtp_smtp/the.jail/var/db/postfix (nullfs, local, read-only) devfs on /var/jail/smtp_smtp/the.jail/dev (devfs, local) /var/jail/build/the.jail on /var/jail/cacti/the.jail/.build (nullfs, local, read-only) /var/jail/cacti/the.jail.etc on /var/jail/cacti/the.jail/.build/etc (nullfs, local, read-only) /var/jail/cacti/the.jail.local on /var/jail/cacti/the.jail/.build/usr/local (nullfs, local, read-only) devfs on /var/jail/cacti/the.jail/dev (devfs, local) /var/jail/build/the.jail on /var/jail/rk80/the.jail/.build (nullfs, local, read-only) /var/jail/rk80/the.jail.etc on /var/jail/rk80/the.jail/.build/etc (nullfs, local, read-only) /var/jail/rk80/the.jail.local on /var/jail/rk80/the.jail/.build/usr/local (nullfs, local, read-only) devfs on /var/jail/rk80/the.jail/dev (devfs, local) /var/jail/build/the.jail on /var/jail/syslog/the.jail/.build (nullfs, local, read-only) /var/jail/syslog/the.jail.etc on /var/jail/syslog/the.jail/.build/etc (nullfs, local, read-only) /var/jail/syslog/the.jail.local on /var/jail/syslog/the.jail/.build/usr/local (nullfs, local, read-only) devfs on /var/jail/syslog/the.jail/dev (devfs, local) /var/jail/build/the.jail on /var/jail/tftpboot/the.jail/.build (nullfs, local, read-only) /var/jail/tftpboot/the.jail.etc on /var/jail/tftpboot/the.jail/.build/etc (nullfs, local, read-only) /var/jail/tftpboot/the.jail.local on /var/jail/tftpboot/the.jail/.build/usr/local (nullfs, local, read-only) devfs on /var/jail/tftpboot/the.jail/dev (devfs, local) /var/jail/build/the.jail on /var/jail/clickon/the.jail/.build (nullfs, local, read-only) /var/jail/clickon/the.jail.etc on /var/jail/clickon/the.jail/.build/etc (nullfs, local, read-only) /var/jail/clickon/the.jail.local on /var/jail/clickon/the.jail/.build/usr/local (nullfs, local, read-only) devfs on /var/jail/clickon/the.jail/dev (devfs, local) /var/jail/build/the.jail on /var/jail/sulci/the.jail/.build (nullfs, local,
Re: Recurring problem: processes block accessing UFS file system
On 5 Jan, Denis Shaposhnikov wrote: Don == Don Lewis [EMAIL PROTECTED] writes: Don Are you using any unusual file systems, such as nullfs or Don unionfs? Yes, I'm use a lots of nullfs. This is a host system for about 20 jails with nullfs mounted ro system: That would be my guess as to the cause of the problem. Hopefully DEBUG_VFS_LOCKS will help pinpoint the bug. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Recurring problem: processes block accessing UFS file system
On 3 Jan, Greg Rivers wrote: On Tue, 3 Jan 2006, Don Lewis wrote: Pid 87117 is playing with buf 0xdc76fe30 which is not locked, and is sleeping on the buf's b_xflags member. It looks like 87117 is waiting for an in-progress write to complete. There are a large number of other sendmail processes waiting in this same place. How about show buffer 0xdc76fe30? db show buffer 0xdc76fe30 buf at 0xdc76fe30 b_flags = 0x20a0vmio,delwri,cache b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0 b_bufobj = (0xc8985610), b_data = 0xe1d6b000, b_blkno = 365086368 lockstatus = 0, excl count = 0, excl owner 0x b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b858d4, 0xa8de1000),(0xc8984108, 0x2b858d5, 0xa8c62000),(0xc8984108, 0x2b858d6, 0xa8de3000),(0xc8984108, 0x2b858d7, 0xa8e64000) db Hmn, it would be nice if DDB printed b_vflags so that we would know the state of BV_BKGRDINPROG and BV_BKGRDWAIT. As it is, we don't know if the background write got lost or if we missed the wakeup. At this point, I think it might be easier to do post-mortem debugging on a core file with kgdb and kernel.debug. Unless there is an obvious race condition, I suspect this will be a tough slog. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Recurring problem: processes block accessing UFS file system
On Wed, 4 Jan 2006, Don Lewis wrote: How about show buffer 0xdc76fe30? db show buffer 0xdc76fe30 buf at 0xdc76fe30 b_flags = 0x20a0vmio,delwri,cache b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0 b_bufobj = (0xc8985610), b_data = 0xe1d6b000, b_blkno = 365086368 lockstatus = 0, excl count = 0, excl owner 0x b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b858d4, 0xa8de1000),(0xc8984108, 0x2b858d5, 0xa8c62000),(0xc8984108, 0x2b858d6, 0xa8de3000),(0xc8984108, 0x2b858d7, 0xa8e64000) db Hmn, it would be nice if DDB printed b_vflags so that we would know the state of BV_BKGRDINPROG and BV_BKGRDWAIT. As it is, we don't know if the background write got lost or if we missed the wakeup. At this point, I think it might be easier to do post-mortem debugging on a core file with kgdb and kernel.debug. Unless there is an obvious race condition, I suspect this will be a tough slog. I attempted to get a core, but panic wedged the machine. I'll try again next time this happens. Thanks again for looking at this. -- Greg db continue # (free up a partition and set up a dump device...) # sync;sync # sync;sync # sync;sync # KDB: enter: Line break on console [thread pid 13 tid 13 ] Stopped at kdb_enter+0x30: leave db panic panic: from debugger cpuid = 0 KDB: stack backtrace: kdb_backtrace(c0625783,0,c060c2d4,e70c5a20,c043bb58) at kdb_backtrace+0x2f panic(c060c2d4,e70c5ad8,c043aaec,c04d13c9,0) at panic+0x129 db_command_loop(c04d13c9,0,,e70c5a4c,e70c5a48) at db_command_loop db_command(c06549e4,c062efa0,c0629aa8,c0629aac,e70c5b44) at db_command+0x2a4 db_command_loop(c04d13c9,0,0,0,0) at db_command_loop+0x6a db_trap(3,0,3,c8522a80,e70c5b94) at db_trap+0xe2 kdb_trap(3,0,e70c5b9c,f2db0257,0) at kdb_trap+0xb6 trap(c8710008,28,e70c0028,f9,c8714c00) at trap+0x4d1 calltrap() at calltrap+0x5 --- trap 0x3, eip = 0xc04d13c9, esp = 0xe70c5bdc, ebp = 0xe70c5be4 --- kdb_enter(c0622c7f,679ed6d4,36,c8522a80,c8714c00) at kdb_enter+0x30 siointr1(c8714c00,37f86c,e5857050,c8522a80,c8522a80) at siointr1+0xd1 siointr(c8714c00,2,0,4,c8522a80) at siointr+0x76 intr_execute_handlers(c851a490,e70c5c68,e70c5cac,c05da733,34) at intr_execute_handlers+0x8c lapic_handle_intr(34) at lapic_handle_intr+0x3b Xapic_isr1() at Xapic_isr1+0x33 --- interrupt, eip = 0xc078427a, esp = 0xe70c5cac, ebp = 0xe70c5cac --- acpi_cpu_c1(37f867,43d0eea4,7b70cdc0,e70c5ce0,37f867) at acpi_cpu_c1+0x5 acpi_cpu_idle(e70c5d04,c049a72a,1,0,0) at acpi_cpu_idle+0x18a cpu_idle(1,0,0,0,0) at cpu_idle+0x29 idle_proc(0,e70c5d38,0,0,0) at idle_proc+0xad fork_exit(c049a67d,0,e70c5d38) at fork_exit+0x7b fork_trampoline() at fork_trampoline+0x8 --- trap 0x1, eip = 0, esp = 0xe70c5d6c, ebp = 0 --- Uptime: 42d10h54m36s GEOM_MIRROR: Device gm1: provider mirror/gm1 destroyed. pid 13: corrected slot count (0-1) ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Recurring problem: processes block accessing UFS file system
On Tue, 22 Nov 2005, I wrote: On Mon, 21 Nov 2005, Kris Kennaway wrote: It may not be the same problem. You should also try to obtain a trace when snapshots are not implicated. Agreed. I'll do so at the first opportunity. First, my thanks to all of you for looking into this. It's taken more than a month, but the problem has recurred without snapshots ever having been run. I've got a good trace of the machine in this state (ftp://ftp.fedex.com/incoming/no-snapshots.bz2). My apologies for the size of the debug output, but the processes had really stacked up this time before I noticed it. I have enough capacity that I can afford to have this machine out of production for a while, so I've left it suspended in kdb for the time being in case additional information is needed. Please let me know if there's anything else I can do to facilitate troubleshooting this. Thanks! -- Greg ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Recurring problem: processes block accessing UFS file system
On 3 Jan, Greg Rivers wrote: On Tue, 22 Nov 2005, I wrote: On Mon, 21 Nov 2005, Kris Kennaway wrote: It may not be the same problem. You should also try to obtain a trace when snapshots are not implicated. Agreed. I'll do so at the first opportunity. First, my thanks to all of you for looking into this. It's taken more than a month, but the problem has recurred without snapshots ever having been run. I've got a good trace of the machine in this state (attached). My apologies for the size of the debug output, but the processes had really stacked up this time before I noticed it. I have enough capacity that I can afford to have this machine out of production for a while, so I've left it suspended in kdb for the time being in case additional information is needed. Please let me know if there's anything else I can do to facilitate troubleshooting this. Thanks! There are large number of sendmail processes waiting on vnode locks which are held by other sendmail processes that are waiting on other vnode locks, etc. until we get to sendmail pid 87150 which is holding a vnode lock and waiting to lock a buf. Tracing command sendmail pid 87150 tid 100994 td 0xcf1c5480 sched_switch(cf1c5480,0,1,b2c5195e,a480a2bc) at sched_switch+0x158 mi_switch(1,0,c04d7b33,dc713fb0,ec26a6ac) at mi_switch+0x1d5 sleepq_switch(dc713fb0,ec26a6e0,c04bb9ce,dc713fb0,50) at sleepq_switch+0x16f sleepq_wait(dc713fb0,50,c0618ef5,0,202122) at sleepq_wait+0x11 msleep(dc713fb0,c0658430,50,c0618ef5,0) at msleep+0x3d7 acquire(ec26a748,120,6,15c2e6e0,0) at acquire+0x89 lockmgr(dc713fb0,202122,c89855cc,cf1c5480,dc76fe30) at lockmgr+0x45f getblk(c8985550,15c2e6e0,0,4000,0) at getblk+0x211 breadn(c8985550,15c2e6e0,0,4000,0) at breadn+0x52 bread(c8985550,15c2e6e0,0,4000,0) at bread+0x4c ffs_vget(c887,ae58b3,2,ec26a8d4,8180) at ffs_vget+0x383 ffs_valloc(c8d41660,8180,c92e8d00,ec26a8d4,c05f9302) at ffs_valloc+0x154 ufs_makeinode(8180,c8d41660,ec26abd4,ec26abe8,ec26aa24) at ufs_makeinode+0x61 ufs_create(ec26aa50,ec26aa24,ec26ad04,ec26abc0,ec26ab0c) at ufs_create+0x36 VOP_CREATE_APV(c0646cc0,ec26aa50,2,ec26aa50,0) at VOP_CREATE_APV+0x3c vn_open_cred(ec26abc0,ec26acc0,180,c92e8d00,6) at vn_open_cred+0x1fe vn_open(ec26abc0,ec26acc0,180,6,c679eacb) at vn_open+0x33 kern_open(cf1c5480,81416c0,0,a03,180) at kern_open+0xca open(cf1c5480,ec26ad04,c,cf1c5480,8169000) at open+0x36 syscall(3b,bfbf003b,bfbf003b,0,a02) at syscall+0x324 Xint0x80_syscall() at Xint0x80_syscall+0x1f This doesn't appear to be a buf/memory exhausting problem because syncer, bufdaemon, and pagedaemon all appear to be idle. What does show lockedbufs say? ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Recurring problem: processes block accessing UFS file system
On Tue, 3 Jan 2006, Don Lewis wrote: There are large number of sendmail processes waiting on vnode locks which are held by other sendmail processes that are waiting on other vnode locks, etc. until we get to sendmail pid 87150 which is holding a vnode lock and waiting to lock a buf. Tracing command sendmail pid 87150 tid 100994 td 0xcf1c5480 sched_switch(cf1c5480,0,1,b2c5195e,a480a2bc) at sched_switch+0x158 mi_switch(1,0,c04d7b33,dc713fb0,ec26a6ac) at mi_switch+0x1d5 sleepq_switch(dc713fb0,ec26a6e0,c04bb9ce,dc713fb0,50) at sleepq_switch+0x16f sleepq_wait(dc713fb0,50,c0618ef5,0,202122) at sleepq_wait+0x11 msleep(dc713fb0,c0658430,50,c0618ef5,0) at msleep+0x3d7 acquire(ec26a748,120,6,15c2e6e0,0) at acquire+0x89 lockmgr(dc713fb0,202122,c89855cc,cf1c5480,dc76fe30) at lockmgr+0x45f getblk(c8985550,15c2e6e0,0,4000,0) at getblk+0x211 breadn(c8985550,15c2e6e0,0,4000,0) at breadn+0x52 bread(c8985550,15c2e6e0,0,4000,0) at bread+0x4c ffs_vget(c887,ae58b3,2,ec26a8d4,8180) at ffs_vget+0x383 ffs_valloc(c8d41660,8180,c92e8d00,ec26a8d4,c05f9302) at ffs_valloc+0x154 ufs_makeinode(8180,c8d41660,ec26abd4,ec26abe8,ec26aa24) at ufs_makeinode+0x61 ufs_create(ec26aa50,ec26aa24,ec26ad04,ec26abc0,ec26ab0c) at ufs_create+0x36 VOP_CREATE_APV(c0646cc0,ec26aa50,2,ec26aa50,0) at VOP_CREATE_APV+0x3c vn_open_cred(ec26abc0,ec26acc0,180,c92e8d00,6) at vn_open_cred+0x1fe vn_open(ec26abc0,ec26acc0,180,6,c679eacb) at vn_open+0x33 kern_open(cf1c5480,81416c0,0,a03,180) at kern_open+0xca open(cf1c5480,ec26ad04,c,cf1c5480,8169000) at open+0x36 syscall(3b,bfbf003b,bfbf003b,0,a02) at syscall+0x324 Xint0x80_syscall() at Xint0x80_syscall+0x1f This doesn't appear to be a buf/memory exhausting problem because syncer, bufdaemon, and pagedaemon all appear to be idle. What does show lockedbufs say? db show lockedbufs buf at 0xdc625678 b_flags = 0x2000vmio b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0 b_bufobj = (0xc8985610), b_data = 0xddcef000, b_blkno = 365233216 lockstatus = 2, excl count = 1, excl owner 0xfffe b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b8a088, 0x53175000),(0xc8984108, 0x2b8a089, 0x7ae56000),(0xc8984108, 0x2b8a08a, 0xd3f57000),(0xc8984108, 0x2b8a08b, 0xd7d58000) buf at 0xdc6b8ab0 b_flags = 0x2000vmio b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0 b_bufobj = (0xc8985610), b_data = 0xdf9ab000, b_blkno = 365257760 lockstatus = 2, excl count = 1, excl owner 0xfffe b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b8ac84, 0x60b1000),(0xc8984108, 0x2b8ac85, 0x454d2000),(0xc8984108, 0x2b8ac86, 0x1b273000),(0xc8984108, 0x2b8ac87, 0x47b74000) buf at 0xdc6c3cc8 b_flags = 0x2000vmio b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0 b_bufobj = (0xc8985610), b_data = 0xdfbd7000, b_blkno = 365265888 lockstatus = 2, excl count = 1, excl owner 0xfffe b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b8b07c, 0xba549000),(0xc8984108, 0x2b8b07d, 0x92eaa000),(0xc8984108, 0x2b8b07e, 0xbdf4b000),(0xc8984108, 0x2b8b07f, 0x2090c000) buf at 0xdc6d9e68 b_flags = 0x2000vmio b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0 b_bufobj = (0xc8985610), b_data = 0xe0027000, b_blkno = 365240832 lockstatus = 2, excl count = 1, excl owner 0xfffe b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b8a440, 0x61b8d000),(0xc8984108, 0x2b8a441, 0xb0f4e000),(0xc8984108, 0x2b8a442, 0x5f98f000),(0xc8984108, 0x2b8a443, 0x5c21) buf at 0xdc6e5458 b_flags = 0x2000vmio b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0 b_bufobj = (0xc8985610), b_data = 0xe025f000, b_blkno = 364617056 lockstatus = 2, excl count = 1, excl owner 0xfffe b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b773ac, 0x4e539000),(0xc8984108, 0x2b773ad, 0x6e13a000),(0xc8984108, 0x2b773ae, 0xc653b000),(0xc8984108, 0x2b773af, 0x14e3c000) buf at 0xdc6fd4b8 b_flags = 0x2000vmio b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0 b_bufobj = (0xc8985610), b_data = 0xe070f000, b_blkno = 365224960 lockstatus = 2, excl count = 1, excl owner 0xfffe b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b89c80, 0x37c6d000),(0xc8984108, 0x2b89c81, 0x2e40e000),(0xc8984108, 0x2b89c82, 0xa39af000),(0xc8984108, 0x2b89c83, 0x27ff) buf at 0xdc713f50 b_flags = 0xa00200a0remfree,vmio,clusterok,delwri,cache b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0 b_bufobj = (0xc8985610), b_data = 0xe0b7b000, b_blkno = 365094624 lockstatus = 2, excl count = 1, excl owner 0xcfeb5d80 b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b85cdc, 0xa89e9000),(0xc8984108, 0x2b85cdd, 0xa852a000),(0xc8984108, 0x2b85cde, 0xa850b000),(0xc8984108, 0x2b85cdf, 0xa836c000) buf at 0xdc765f50 b_flags = 0xa00200a0remfree,vmio,clusterok,delwri,cache b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0 b_bufobj = (0xc9c68b60), b_data = 0xe1b7b000, b_blkno = 364555424 lockstatus = 2, excl count = 1, excl owner 0xce9d4c00 b_npages = 4, pages(OBJ, IDX, PA): (0xcff7,
Re: Recurring problem: processes block accessing UFS file system
On 3 Jan, Greg Rivers wrote: On Tue, 3 Jan 2006, Don Lewis wrote: There are large number of sendmail processes waiting on vnode locks which are held by other sendmail processes that are waiting on other vnode locks, etc. until we get to sendmail pid 87150 which is holding a vnode lock and waiting to lock a buf. Tracing command sendmail pid 87150 tid 100994 td 0xcf1c5480 sched_switch(cf1c5480,0,1,b2c5195e,a480a2bc) at sched_switch+0x158 mi_switch(1,0,c04d7b33,dc713fb0,ec26a6ac) at mi_switch+0x1d5 sleepq_switch(dc713fb0,ec26a6e0,c04bb9ce,dc713fb0,50) at sleepq_switch+0x16f sleepq_wait(dc713fb0,50,c0618ef5,0,202122) at sleepq_wait+0x11 msleep(dc713fb0,c0658430,50,c0618ef5,0) at msleep+0x3d7 acquire(ec26a748,120,6,15c2e6e0,0) at acquire+0x89 lockmgr(dc713fb0,202122,c89855cc,cf1c5480,dc76fe30) at lockmgr+0x45f getblk(c8985550,15c2e6e0,0,4000,0) at getblk+0x211 breadn(c8985550,15c2e6e0,0,4000,0) at breadn+0x52 bread(c8985550,15c2e6e0,0,4000,0) at bread+0x4c ffs_vget(c887,ae58b3,2,ec26a8d4,8180) at ffs_vget+0x383 ffs_valloc(c8d41660,8180,c92e8d00,ec26a8d4,c05f9302) at ffs_valloc+0x154 ufs_makeinode(8180,c8d41660,ec26abd4,ec26abe8,ec26aa24) at ufs_makeinode+0x61 ufs_create(ec26aa50,ec26aa24,ec26ad04,ec26abc0,ec26ab0c) at ufs_create+0x36 VOP_CREATE_APV(c0646cc0,ec26aa50,2,ec26aa50,0) at VOP_CREATE_APV+0x3c vn_open_cred(ec26abc0,ec26acc0,180,c92e8d00,6) at vn_open_cred+0x1fe vn_open(ec26abc0,ec26acc0,180,6,c679eacb) at vn_open+0x33 kern_open(cf1c5480,81416c0,0,a03,180) at kern_open+0xca open(cf1c5480,ec26ad04,c,cf1c5480,8169000) at open+0x36 syscall(3b,bfbf003b,bfbf003b,0,a02) at syscall+0x324 Xint0x80_syscall() at Xint0x80_syscall+0x1f This doesn't appear to be a buf/memory exhausting problem because syncer, bufdaemon, and pagedaemon all appear to be idle. What does show lockedbufs say? db show lockedbufs [snip] looks like this is the buf that pid 87150 is waiting for: buf at 0xdc713f50 b_flags = 0xa00200a0remfree,vmio,clusterok,delwri,cache b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0 b_bufobj = (0xc8985610), b_data = 0xe0b7b000, b_blkno = 365094624 lockstatus = 2, excl count = 1, excl owner 0xcfeb5d80 b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b85cdc, 0xa89e9000),(0xc8984108, 0x2b85cdd, 0xa852a000),(0xc8984108, 0x2b85cde, 0xa850b000),(0xc8984108, 0x2b85cdf, 0xa836c000) which is locked by this thread: Tracing command sendmail pid 87117 tid 101335 td 0xcfeb5d80 sched_switch(cfeb5d80,0,1,fd1926a,640c65f9) at sched_switch+0x158 mi_switch(1,0,c04d7b33,dc76fe8c,ec883b2c) at mi_switch+0x1d5 sleepq_switch(dc76fe8c,ec883b60,c04bb9ce,dc76fe8c,4c) at sleepq_switch+0x16f sleepq_wait(dc76fe8c,4c,c061e9ac,0,0) at sleepq_wait+0x11 msleep(dc76fe8c,c0662f80,4c,c061e9ac,0) at msleep+0x3d7 getdirtybuf(dc76fe30,c0662f80,1,ec883ba8,0) at getdirtybuf+0x221 softdep_update_inodeblock(cd1bc528,dc713f50,1,4000,0) at softdep_update_inodeblo ck+0x267 ffs_update(cd953bb0,1,0,cd953bb0,ec883c78,c0529a59,0,0,0,4,1,cd953c2c) at ffs_up date+0x27f ffs_syncvnode(cd953bb0,1,4,ec883c78,c05f9a70) at ffs_syncvnode+0x52e ffs_fsync(ec883cb4,ec883cd0,c052468a,c0646cc0,ec883cb4) at ffs_fsync+0x1c VOP_FSYNC_APV(c0646cc0,ec883cb4,0,0,0) at VOP_FSYNC_APV+0x3a fsync(cfeb5d80,ec883d04,4,cfeb5d80,ec883d2c) at fsync+0x1db syscall(3b,3b,3b,80c7c1b,bfbfa6b0) at syscall+0x324 Xint0x80_syscall() at Xint0x80_syscall+0x1f --- syscall (95, FreeBSD ELF32, fsync), eip = 0x8830f63f, esp = 0xbfbfa66c, ebp = 0xbfbfaf98 --- Pid 87117 is playing with buf 0xdc76fe30 which is not locked, and is sleeping on the buf's b_xflags member. It looks like 87117 is waiting for an in-progress write to complete. There are a large number of other sendmail processes waiting in this same place. How about show buffer 0xdc76fe30? This is getting into an area of the kernel that I do not understand well. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Recurring problem: processes block accessing UFS file system
On Tue, 3 Jan 2006, Don Lewis wrote: db show lockedbufs [snip] looks like this is the buf that pid 87150 is waiting for: buf at 0xdc713f50 b_flags = 0xa00200a0remfree,vmio,clusterok,delwri,cache b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0 b_bufobj = (0xc8985610), b_data = 0xe0b7b000, b_blkno = 365094624 lockstatus = 2, excl count = 1, excl owner 0xcfeb5d80 b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b85cdc, 0xa89e9000),(0xc8984108, 0x2b85cdd, 0xa852a000),(0xc8984108, 0x2b85cde, 0xa850b000),(0xc8984108, 0x2b85cdf, 0xa836c000) which is locked by this thread: Tracing command sendmail pid 87117 tid 101335 td 0xcfeb5d80 sched_switch(cfeb5d80,0,1,fd1926a,640c65f9) at sched_switch+0x158 mi_switch(1,0,c04d7b33,dc76fe8c,ec883b2c) at mi_switch+0x1d5 sleepq_switch(dc76fe8c,ec883b60,c04bb9ce,dc76fe8c,4c) at sleepq_switch+0x16f sleepq_wait(dc76fe8c,4c,c061e9ac,0,0) at sleepq_wait+0x11 msleep(dc76fe8c,c0662f80,4c,c061e9ac,0) at msleep+0x3d7 getdirtybuf(dc76fe30,c0662f80,1,ec883ba8,0) at getdirtybuf+0x221 softdep_update_inodeblock(cd1bc528,dc713f50,1,4000,0) at softdep_update_inodeblo ck+0x267 ffs_update(cd953bb0,1,0,cd953bb0,ec883c78,c0529a59,0,0,0,4,1,cd953c2c) at ffs_up date+0x27f ffs_syncvnode(cd953bb0,1,4,ec883c78,c05f9a70) at ffs_syncvnode+0x52e ffs_fsync(ec883cb4,ec883cd0,c052468a,c0646cc0,ec883cb4) at ffs_fsync+0x1c VOP_FSYNC_APV(c0646cc0,ec883cb4,0,0,0) at VOP_FSYNC_APV+0x3a fsync(cfeb5d80,ec883d04,4,cfeb5d80,ec883d2c) at fsync+0x1db syscall(3b,3b,3b,80c7c1b,bfbfa6b0) at syscall+0x324 Xint0x80_syscall() at Xint0x80_syscall+0x1f --- syscall (95, FreeBSD ELF32, fsync), eip = 0x8830f63f, esp = 0xbfbfa66c, ebp = 0xbfbfaf98 --- Pid 87117 is playing with buf 0xdc76fe30 which is not locked, and is sleeping on the buf's b_xflags member. It looks like 87117 is waiting for an in-progress write to complete. There are a large number of other sendmail processes waiting in this same place. How about show buffer 0xdc76fe30? db show buffer 0xdc76fe30 buf at 0xdc76fe30 b_flags = 0x20a0vmio,delwri,cache b_error = 0, b_bufsize = 16384, b_bcount = 16384, b_resid = 0 b_bufobj = (0xc8985610), b_data = 0xe1d6b000, b_blkno = 365086368 lockstatus = 0, excl count = 0, excl owner 0x b_npages = 4, pages(OBJ, IDX, PA): (0xc8984108, 0x2b858d4, 0xa8de1000),(0xc8984108, 0x2b858d5, 0xa8c62000),(0xc8984108, 0x2b858d6, 0xa8de3000),(0xc8984108, 0x2b858d7, 0xa8e64000) db This is getting into an area of the kernel that I do not understand well. For me, we've long since passed that point. :-) -- Greg ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Recurring problem: processes block accessing UFS file system
On 26 Nov, Tor Egge wrote: Thanks Kris, these are exactly the clues I needed. Since the deadlock during a snapshot is fairly easy to reproduce, I did so and collected this information below. alltrace didn't work as I expected (didn't produce a trace), so I traced each pid associated with a locked vnode separately. The vnode syncing loop in ffs_sync() has some problems: 1. Softupdate processing performed after the loop has started might trigger the need for retrying the loop. Processing of dirrem work items can cause IN_CHANGE to be set on some inodes, causing deadlock in ufs_inactive() later on while the file system is suspended). I also don't like how this loop interacts with the vnode list churn done by vnlru_free(). Maybe vnode recycling for a file system should be skipped while a file system is being suspended or unmounted. 2. nvp might no longer be associated with the same mount point after MNT_IUNLOCK(mp) has been called in the loop. This can cause the vnode list traversal to be incomplete, with stale information in the snapshot. Further damage can occur when background fsck uses that stale information. It looks like this is handled in __mnt_vnode_next() by starting over. Skipping vnode recycling should avoid this problem in the snapshot case. This loop should be bypassed in normal operation and the individual vnode syncing should be done by the syncer. The only reason this loop isn't skipped during normal operation is that that timestamp updates aren't sufficient to add vnodes to the syncer worklist. Just a few lines down from that loop is a new problem: 3. softdep_flushworklist() might not have processed all dirrem work items associated with the file system even if both error and count are zero. This can cause both background fsck and softupdate processing (after file system has been resumed) to decrement the link count of an inode, causing file system corruption or a panic. Are you sure this is still true after the changes that were committed to both HEAD and RELENG_6 before 6.0-RELEASE? All the pending items that hang around various lists make me nervous, though. I really thing the number of each flavor should be tracked per mount point and softupdates should complain if the counts are non-zero at the end of the suspend and unmount tasks. Processing of these work items while the file system is suspended causes a panic. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Recurring problem: processes block accessing UFS file system
On 26 Nov, Tor Egge wrote: Thanks Kris, these are exactly the clues I needed. Since the deadlock during a snapshot is fairly easy to reproduce, I did so and collected this information below. alltrace didn't work as I expected (didn't produce a trace), so I traced each pid associated with a locked vnode separately. The vnode syncing loop in ffs_sync() has some problems: There is also a MNT_VNODE_FOREACH() loop in ffs_snapshot(). ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Recurring problem: processes block accessing UFS file system
On 22 Nov, Greg Rivers wrote: On Mon, 21 Nov 2005, Kris Kennaway wrote: It may not be the same problem. You should also try to obtain a trace when snapshots are not implicated. Agreed. I'll do so at the first opportunity. 'show lockedvnods' is very important for diagnosing filesystem deadlocks, and 'alltrace' is the easiest way to obtain stack traces for the resulting processes that are holding locks. You'll want to 'set $lines=0' and capture the output from the serial console to a file. Thanks Kris, these are exactly the clues I needed. Since the deadlock during a snapshot is fairly easy to reproduce, I did so and collected this information below. alltrace didn't work as I expected (didn't produce a trace), so I traced each pid associated with a locked vnode separately. There appears to be a lock order reversal between a vnode lock and the hand-rolled suspend lock. Basically, it is not permissible to wait for the suspend lock while holding a vnode lock. In cases where a thread holds a vnode lock and then wants to start a file system write, it has to make a non-blocking attempt to grab the suspend lock and then release the vnode lock if the suspend lock can't be obtained. For example, in vn_open_cred(): if (vn_start_write(ndp-ni_dvp, mp, V_NOWAIT) != 0) { NDFREE(ndp, NDF_ONLY_PNBUF); vput(ndp-ni_dvp); VFS_UNLOCK_GIANT(vfslocked); if ((error = vn_start_write(NULL, mp, V_XSLEEP | PCATCH)) != 0) return (error); goto restart; } The problem is taht vput() may call vinactive(), which calls ufs_inactive(), which may attempt to do a file system write and block on the snapshot lock. db trace 99361 Tracing pid 99361 tid 100437 td 0xce67cd80 sched_switch(ce67cd80,0,1,3d4e53a2,95c40548) at sched_switch+0x158 mi_switch(1,0,c04d7b33,c86d706c,ebb00908) at mi_switch+0x1d5 sleepq_switch(c86d706c,ebb0093c,c04bb9ce,c86d706c,9f) at sleepq_switch+0x16f sleepq_wait(c86d706c,9f,c061a026,0,c0647200) at sleepq_wait+0x11 msleep(c86d706c,c86d7044,29f,c061a026,0,c0653720,c8dc1aa0,ebb00978,ce67cd80) at msleep+0x3d7 vn_write_suspend_wait(c8dc1aa0,c86d7000,1,0,ca07f817) at vn_write_suspend_wait+0 x181 ufs_inactive(ebb009c8,ebb009dc,c051b330,c0646cc0,ebb009c8) at ufs_inactive+0x1b4 VOP_INACTIVE_APV(c0646cc0,ebb009c8,a03,ebb009d0,c051054f) at VOP_INACTIVE_APV+0x 3a vinactive(c8dc1aa0,ce67cd80,ffdf,ebb00a24,c0513592) at vinactive+0x82 vput(c8dc1aa0,ffdf,2,ebb00a50,0) at vput+0x187 vn_open_cred(ebb00bc0,ebb00cc0,180,c9430580,5) at vn_open_cred+0xfb vn_open(ebb00bc0,ebb00cc0,180,5,4b5fcfad) at vn_open+0x33 kern_open(ce67cd80,81414e0,0,a03,180) at kern_open+0xca open(ce67cd80,ebb00d04,c,ce67cd80,8155000) at open+0x36 syscall(3b,3b,3b,0,a02) at syscall+0x324 99361 ce682624 100 673 673 100 [SLPQ suspfs 0xc86d706c][SLP] sendmail If a thread gets stuck waiting for the suspend lock while holding a vnode lock, then it is possible for the thread that is creating the snapshot to get stuck when it attempts to lock that same vnode. 98639 c8e91418 11008 98637 97958 0004102 [SLPQ ufs 0xc8991d18][SLP] mksnap_ffs db trace 98639 Tracing pid 98639 tid 100282 td 0xc8e8e300 sched_switch(c8e8e300,0,1,cc167362,733b6597) at sched_switch+0x158 mi_switch(1,0,c04d7b33,c8991d18,eb80c544) at mi_switch+0x1d5 sleepq_switch(c8991d18,eb80c578,c04bb9ce,c8991d18,50) at sleepq_switch+0x16f sleepq_wait(c8991d18,50,c06186b3,0,c065ae80) at sleepq_wait+0x11 msleep(c8991d18,c065888c,50,c06186b3,0) at msleep+0x3d7 acquire(eb80c5e0,40,6,c8e8e300,0) at acquire+0x89 lockmgr(c8991d18,2002,c8991d3c,c8e8e300,eb80c608) at lockmgr+0x45f vop_stdlock(eb80c660,0,c0646cc0,eb80c660,eb80c618) at vop_stdlock+0x2f VOP_LOCK_APV(c0647200,eb80c660,eb80c630,c05f9f43,eb80c660) at VOP_LOCK_APV+0x44 ffs_lock(eb80c660,0,2002,c8991cc0,eb80c67c) at ffs_lock+0x19 VOP_LOCK_APV(c0646cc0,eb80c660,c06402e0,eb80c7d0,0) at VOP_LOCK_APV+0x44 vn_lock(c8991cc0,2002,c8e8e300,4000,c851f300) at vn_lock+0x132 ffs_snapshot(c86d7000,cc978e00,eb80c9a4,6c,eb80c964) at ffs_snapshot+0x152b ffs_mount(c86d7000,c8e8e300,0,c8e8e300,c9f71bb0) at ffs_mount+0x9af vfs_domount(c8e8e300,c88a8760,c88a8870,11211000,c9a77510) at vfs_domount+0x728 vfs_donmount(c8e8e300,11211000,eb80cbec,c9314580,e) at vfs_donmount+0x12e kernel_mount(c8737b80,11211000,eb80cc30,6c,bfbfe8c8) at kernel_mount+0x46 ffs_cmount(c8737b80,bfbfe110,11211000,c8e8e300,0) at ffs_cmount+0x85 mount(c8e8e300,eb80cd04,10,c8e8e300,8052000) at mount+0x21b syscall(3b,3b,3b,8814819c,bfbfe0b0) at syscall+0x324 Xint0x80_syscall() at Xint0x80_syscall+0x1f This block of code in ufs_inactive() is triggering the problem: if (ip-i_flag (IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) { if ((ip-i_flag
Re: Recurring problem: processes block accessing UFS file system
Thanks Kris, these are exactly the clues I needed. Since the deadlock during a snapshot is fairly easy to reproduce, I did so and collected this information below. alltrace didn't work as I expected (didn't produce a trace), so I traced each pid associated with a locked vnode separately. The vnode syncing loop in ffs_sync() has some problems: 1. Softupdate processing performed after the loop has started might trigger the need for retrying the loop. Processing of dirrem work items can cause IN_CHANGE to be set on some inodes, causing deadlock in ufs_inactive() later on while the file system is suspended). 2. nvp might no longer be associated with the same mount point after MNT_IUNLOCK(mp) has been called in the loop. This can cause the vnode list traversal to be incomplete, with stale information in the snapshot. Further damage can occur when background fsck uses that stale information. Just a few lines down from that loop is a new problem: 3. softdep_flushworklist() might not have processed all dirrem work items associated with the file system even if both error and count are zero. This can cause both background fsck and softupdate processing (after file system has been resumed) to decrement the link count of an inode, causing file system corruption or a panic. Processing of these work items while the file system is suspended causes a panic. - Tor Egge ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Recurring problem: processes block accessing UFS file system
On Mon, 2005-Nov-21 21:23:10 -0600, Greg Rivers wrote: the deadlock also occurs under normal operation when no snapshots are running or have ever been run since boot. It's just much less frequent in this case. I've also seen this behaviour in both 5.x and 6.0 (I can't recall if it bit me in 4.x). In my case, the trigger is OpenOffice.org - one of the offending processes is almost always OOo/program/pagein. Unfortunately, I haven't been able to get to the bottom of this (and my son isn't happy that OOo keeps deadlocking on him). -- Peter Jeremy ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Recurring problem: processes block accessing UFS file system
On Mon, Nov 21, 2005 at 05:54:09PM -0600, Greg Rivers wrote: I've recently put up three busy email relay hosts running 6.0-STABLE. Performance is excellent except for a nagging critical issue that keeps cropping up. /var/spool is its own file system mounted on a geom stripe of four BSD partitions (details below). Once every two or three days all the processes accessing /var/spool block forever in disk wait. All three machines suffer this problem. No diagnostic messages are generated and the machines continue running fine otherwise, but a reboot is required to clear the condition. This problem occurs during normal operation, but is particularly likely to occur during a backup when dump makes a snapshot. There doesn't appear to be a problem with gstripe, as gstripe status is UP and I can read the raw device just fine while processes continue to block on the file system. I tried running a kernel with WITNESS and DIAGNOSTIC, but these options shed no light. If I catch the problem early enough I can break successfully into kdb; otherwise, if too many processes stack up, the machine hangs going into kdb and must be power-cycled. Make sure you have KDB_STOP_NMI in your kernel. I obtained the following process listing and traces from kdb. I traced mksnap_ffs which was blocked in ufs, and two random sendmail processes that were blocked in ufs and suspfs respectively. Looks like a UFS snapshot deadlock. Are you running something like dump -L on this filesystem, or making other use of snapshots? fsck -B also uses them, but shouldn't be running except at boot time. You should take this up with Kirk McKusick [EMAIL PROTECTED] - in the meantime you can work around it by not making use of UFS snapshots. Kris pgpzeDChaoxrI.pgp Description: PGP signature
Re: Recurring problem: processes block accessing UFS file system
On Tue, 22 Nov 2005 00:54:09 +0100, Greg Rivers [EMAIL PROTECTED] wrote: I've recently put up three busy email relay hosts running 6.0-STABLE. Performance is excellent except for a nagging critical issue that keeps cropping up. /var/spool is its own file system mounted on a geom stripe of four BSD partitions (details below). Once every two or three days all the processes accessing /var/spool block forever in disk wait. All three machines suffer this problem. No diagnostic messages are generated and the machines continue running fine otherwise, but a reboot is required to clear the condition. This problem occurs during normal operation, but is particularly likely to occur during a backup when dump makes a snapshot. There doesn't appear to be a problem with gstripe, as gstripe status is UP and I can read the raw device just fine while processes continue to block on the file system. I tried running a kernel with WITNESS and DIAGNOSTIC, but these options shed no light. If I catch the problem early enough I can break successfully into kdb; otherwise, if too many processes stack up, the machine hangs going into kdb and must be power-cycled. I'd appreciate any insight anyone may have into this problem or advise on turning this report into a coherent PR. I have a machine with 5.4-STABLE with the same problem. It hangs every couple of days if I make regular snapshots. It is a remote machine which I don't have easy access to. I disabled the snapshots and since than it didn't hang a single time. I hoped it would be fixed in 6.0, but this sounds the same. Ronald. -- Ronald Klop Amsterdam, The Netherlands ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Recurring problem: processes block accessing UFS file system
On Mon, 21 Nov 2005, Kris Kennaway wrote: Looks like a UFS snapshot deadlock. Are you running something like dump -L on this filesystem, or making other use of snapshots? Indeed I am (dump -L), but as I said (not very clearly, sorry), the deadlock also occurs under normal operation when no snapshots are running or have ever been run since boot. It's just much less frequent in this case. Disabling dumps altogether was one of the first things I tried. fsck -B also uses them, but shouldn't be running except at boot time. Right, it isn't even at boot since I have background_fsck disabled. You should take this up with Kirk McKusick [EMAIL PROTECTED] - in the meantime you can work around it by not making use of UFS snapshots. Sounds good. Even though the problem may not be snapshot related, it certainly seems to be UFS related. Thanks for your suggestions and your prompt reply. -- Greg ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Recurring problem: processes block accessing UFS file system
On Mon, Nov 21, 2005 at 09:23:10PM -0600, Greg Rivers wrote: On Mon, 21 Nov 2005, Kris Kennaway wrote: Looks like a UFS snapshot deadlock. Are you running something like dump -L on this filesystem, or making other use of snapshots? Indeed I am (dump -L), but as I said (not very clearly, sorry), the deadlock also occurs under normal operation when no snapshots are running or have ever been run since boot. It's just much less frequent in this case. Disabling dumps altogether was one of the first things I tried. It may not be the same problem. You should also try to obtain a trace when snapshots are not implicated. 'show lockedvnods' is very important for diagnosing filesystem deadlocks, and 'alltrace' is the easiest way to obtain stack traces for the resulting processes that are holding locks. You'll want to 'set $lines=0' and capture the output from the serial console to a file. Kris pgpLYbBcV2TU9.pgp Description: PGP signature