Re: [Jfs-discussion] Strange problem with JFS
Sorry it's taken me so long to look into this, and thanks for all your help with this bug. On Sun, 2004-12-19 at 14:10 -0500, Sonny Rao wrote: Hmm, okay I had thought it might be caused by a lack of available transaction blocks or locks, but that doesn't seem to be the case based on the data you provided. Actually, we do appear to be out of transaction locks (tlocks). jfsSync D C0348640 0 1803 1 2174 1802 (L-TLB) de94bdd8 0046 dee198b0 c0348640 def1d72c 0008 c01f1c3a def1d72c d808f600 000f424b dbe060f0 12707480 000f424b dee19a58 de94a000 de94bde8 c0118714 de94be08 e097fe87 dee198b0 Call Trace: [c01f1c3a] generic_make_request+0x15e/0x1de [c0118714] default_wake_function+0x0/0x12 [e097fe87] txLockAlloc+0xa7/0x160 [jfs] [c0118714] default_wake_function+0x0/0x12 [c0118714] default_wake_function+0x0/0x12 [e0980bb5] txLock+0x275/0x470 [jfs] [e097f5b1] lbmStartIO+0xb1/0xe0 [jfs] [e097f46f] lbmWrite+0x10f/0x150 [jfs] [e097c99e] __get_metapage+0x5e/0x3d0 [jfs] [e097dcbd] lmGCwrite+0xdd/0xf0 [jfs] [e096ce50] diWrite+0x190/0x5f0 [jfs] [e0981473] txCommit+0x1b3/0x320 [jfs] [e098083b] txEnd+0x3b/0x140 [jfs] [e098351f] jfs_sync+0x21f/0x2c0 [jfs] [c0118714] default_wake_function+0x0/0x12 [c0105d26] ret_from_fork+0x6/0x14 [c0118714] default_wake_function+0x0/0x12 [e0983300] jfs_sync+0x0/0x2c0 [jfs] [c010425d] kernel_thread_helper+0x5/0xb This is the thread that is trying to free some tlocks when they start getting low. I think generic_make_request is just stack noise. We really don't want txLockAlloc to block here. We try really hard not to exhaust all of the tlocks so that this thread can make some progress. I think the real problem though is the jfsCommit thread. If it were not blocked, I suspect that there would be a lot of tlocks freed. jfsCommit D C0348640 0 1802 1 1803 1801 (L-TLB) de429ee0 0046 dee18800 c0348640 df7dc000 dec76234 4000 e09357e4 de428000 00270284 dbe060f0 f6e6e500 000f424a dee189a8 dec7607c de429ef4 dee18800 dec76234 c026e3cd deeaa000 1bcc dec76080 dec76080 Call Trace: [c026e3cd] rwsem_down_read_failed+0x8f/0x17c [e097085d] .text.lock.jfs_imap+0x3ff/0x422 [jfs] [e09824de] txUpdateMap+0xae/0x250 [jfs] [e09808d7] txEnd+0xd7/0x140 [jfs] [e0982d00] txLazyCommit+0x20/0xe0 [jfs] [e0982f78] jfs_lazycommit+0x1b8/0x1e0 [jfs] [c0118714] default_wake_function+0x0/0x12 [c0105d26] ret_from_fork+0x6/0x14 [c0118714] default_wake_function+0x0/0x12 [e0982dc0] jfs_lazycommit+0x0/0x1e0 [jfs] [c010425d] kernel_thread_helper+0x5/0xb This thread must be in diUpdatePMap, trying to acquire ipimap's rdwrlock: IREAD_LOCK(). The funny thing is that the only place a write lock is taken on this inode is in diNewIAG, which is only called under diAlloc, which I don't see in any stacks. I can't find any error path that would leave the lock taken, so I can't account for why this thread would be blocked. I'm sure you would have noticed if a thread oopsed in this path, leaving the lock locked. I am able to reproduce the problem on a laptop I have with me as well, I'm running a 2.6.8.1 kernel. It looks like all of the processes are stuck waiting on a semaphore somewhere in namei (.text.lock.namei) and upon reboot i have a bit of filesystem corruption. I'd be interested in any info from jfs_fsck as to the nature of the filesystem corruption. Shaggy will have to look into this some more, thanks for the report. Thanks for all your help. Shaggy -- David Kleikamp IBM Linux Technology Center ___ Jfs-discussion mailing list Jfs-discussion@www-124.ibm.com http://www-124.ibm.com/developerworks/oss/mailman/listinfo/jfs-discussion
Re: [Jfs-discussion] Strange problem with JFS
On Mon, 2005-01-10 at 10:59 -0600, Dave Kleikamp wrote: This thread must be in diUpdatePMap, trying to acquire ipimap's rdwrlock: IREAD_LOCK(). The funny thing is that the only place a write lock is taken on this inode is in diNewIAG, which is only called under diAlloc, which I don't see in any stacks. I can't find any error path that would leave the lock taken, so I can't account for why this thread would be blocked. I'm sure you would have noticed if a thread oopsed in this path, leaving the lock locked. I found the deadlock. This thread is the missing link: fstestD C0348640 0 4598 4561 4599 4597 (NOTLB) dbc11d88 0086 dbd57770 c0348640 df1ed83c d9912d50 d8dc7700 000f424a dbd57918 df7dd244 0286 dbc1 dbd57770 c026d7d7 df7dd24c 0001 dbd57770 c0118714 Call Trace: [c026d7d7] __down+0x8b/0xfd [c0118714] default_wake_function+0x0/0x12 [c026d984] __down_failed+0x8/0xc [e0970663] .text.lock.jfs_imap+0x205/0x422 [jfs] [c0169722] alloc_inode+0xf2/0x19b [e097bccb] ialloc+0x5b/0x1b0 [jfs] [e096399d] jfs_mkdir+0x5d/0x2f0 [jfs] [c0117c03] recalc_task_prio+0x93/0x188 [c015dce7] permission+0x67/0x79 [c015eeca] link_path_walk+0xccf/0xdb2 [c015de3d] cached_lookup+0x23/0x85 [c015dce7] permission+0x67/0x79 [c0160379] vfs_mkdir+0xad/0x137 [c01604ba] sys_mkdir+0xb7/0xf6 [c01181fd] schedule_tail+0x41/0x4d [c0105dfd] sysenter_past_esp+0x52/0x71 The compiler does a lot of inlining, but I do believe this is in diNewIAG. The thread is doing a down(JFS_IP(ipimap)-commit_sem) while holding the rdwr_lock (which is a recent change). I will have to see how hard this is to fix. -- David Kleikamp IBM Linux Technology Center ___ Jfs-discussion mailing list Jfs-discussion@www-124.ibm.com http://www-124.ibm.com/developerworks/oss/mailman/listinfo/jfs-discussion
Re: [Jfs-discussion] Strange problem with JFS
On Thu, Dec 23, 2004 at 08:34:12AM -0600, Jon Nelson wrote: On Mon, 20 Dec 2004, Sonny Rao wrote: This message is mostly for Shaggy, but here it is anyway (kinda long): Ok I have some suspicious looking backtraces now where some threads are sleeping in jfs_unlink (which holds the directory i_sem I think) which hold the i_sem (from sys_mkdir). Well, it's more or less out of my hands now. I wonder if this relates to my performance problems as well - I suspect it does, although a fix for this bug may ot improve performance. If you are out of tlocks frequently, then you're in luck. There's module load time parameter for the number of txblocks and txlocks, just up the values and see if performance increases. Sonny ___ Jfs-discussion mailing list Jfs-discussion@www-124.ibm.com http://www-124.ibm.com/developerworks/oss/mailman/listinfo/jfs-discussion
Re: [Jfs-discussion] Strange problem with JFS
On Sat, Dec 18, 2004 at 02:49:38PM -0600, Jon Nelson wrote: I should note that if I crack open another terminal and strace the find, this is what I get: open(., O_RDONLY|O_LARGEFILE) = 5 fchdir(5) = 0 lstat64(., {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat64(., {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 chdir(.) = 0 lstat64(., {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat64(., {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 fstat64(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 5), ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40252000 write(1, .\n, 2. ) = 2 open(., O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 6 fstat64(6, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 fcntl64(6, F_SETFD, FD_CLOEXEC) = 0 getdents(6, ^ hangs here. Yeah, there appears to be a deadlock somewhere in the directory handling code. I find that it's not always reproducible on my laptop depending on how much memory is free, other stuff going on, etc. But I was able to trigger it once by simply running fstest and a find concurrently, then killing fstest and running ls on the directory where fstest was running. Here's what the backtraces look like: fstestD C0348640 0 5093 1 5094 5092 (NOTLB) f07f9f14 0082 f07f7870 c0348640 fff5 f13a992c c015eeca f13a992c 0001 f07f9f70 f242ad10 23063c80 000f42cd f07f7a18 f13a9994 0292 f07f8000 f07f7870 c026d7d7 f13a999c 0001 f07f7870 c0118714 Call Trace: [c015eeca] link_path_walk+0xccf/0xdb2 [c026d7d7] __down+0x8b/0xfd [c0118714] default_wake_function+0x0/0x12 [c026d984] __down_failed+0x8/0xc [c0162261] .text.lock.namei+0x109/0x168 [c01068f1] error_code+0x2d/0x38 [c01181fd] schedule_tail+0x41/0x4d [c0105dfd] sysenter_past_esp+0x52/0x71 fstestD C0348640 0 5094 1 5095 5093 (NOTLB) f07fbf14 0082 f07f72e0 c0348640 fff5 f13a992c c015eeca f13a992c 0001 f07fbf70 f242ad10 23063c80 000f42cd f07f7488 f13a9994 0292 f07fa000 f07f72e0 c026d7d7 f13a999c 0001 f07f72e0 c0118714 Call Trace: [c015eeca] link_path_walk+0xccf/0xdb2 [c026d7d7] __down+0x8b/0xfd [c0118714] default_wake_function+0x0/0x12 [c026d984] __down_failed+0x8/0xc [c0162261] .text.lock.namei+0x109/0x168 [c01068f1] error_code+0x2d/0x38 [c01181fd] schedule_tail+0x41/0x4d [c0105dfd] sysenter_past_esp+0x52/0x71 snip find D C0348AE8 0 5115 4192 (NOTLB) f2233f24 0082 f1526330 c0348ae8 000f42ce fe04 0744ab6b 000f42ce f1526330 000f4240 074f6600 000f42ce f1526ff8 f13a9994 0292 f2232000 f1526e50 c026d7d7 f13a999c 0001 f1526e50 c0118714 Call Trace: [c026d7d7] __down+0x8b/0xfd [c0118714] default_wake_function+0x0/0x12 [c026d984] __down_failed+0x8/0xc [c016393f] .text.lock.readdir+0x5/0x16 [c0163900] sys_getdents64+0x71/0xab [c01637a4] filldir64+0x0/0xeb [c0105dfd] sysenter_past_esp+0x52/0x71 lsD C0348640 0 5117 4191 (NOTLB) f0581f24 0082 f15268c0 c0348640 f15bf354 f15af400 f14b8140 f14b8160 f747c1b0 000f4240 70291b00 000f42d1 f1526a68 f13a9994 0292 f058 f15268c0 c026d7d7 f13a999c 0001 f15268c0 c0118714 Call Trace: [c026d7d7] __down+0x8b/0xfd [c0118714] default_wake_function+0x0/0x12 [c026d984] __down_failed+0x8/0xc [c016393f] .text.lock.readdir+0x5/0x16 [c0163900] sys_getdents64+0x71/0xab [c01637a4] filldir64+0x0/0xeb [c0105dfd] sysenter_past_esp+0x52/0x71 So someone is holding the readdir semaphore here, but I can't tell which thread. find and ls are blocking on that. Also the jfsCommit and jfsSync threads are also blocking on something, but I haven't been able to get the backtraces for those yet. All of the fstest threads seem to be stuck in some error path. I'll follow up if I find more. Sonny ___ Jfs-discussion mailing list [EMAIL PROTECTED] http://www-124.ibm.com/developerworks/oss/mailman/listinfo/jfs-discussion
Re: [Jfs-discussion] Strange problem with JFS
On Mon, Dec 20, 2004 at 03:20:58AM -0500, Sonny Rao wrote: On Sat, Dec 18, 2004 at 02:49:38PM -0600, Jon Nelson wrote: I should note that if I crack open another terminal and strace the find, this is what I get: open(., O_RDONLY|O_LARGEFILE) = 5 fchdir(5) = 0 lstat64(., {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat64(., {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 chdir(.) = 0 lstat64(., {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat64(., {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 fstat64(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 5), ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40252000 write(1, .\n, 2. ) = 2 open(., O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 6 fstat64(6, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 fcntl64(6, F_SETFD, FD_CLOEXEC) = 0 getdents(6, ^ hangs here. Yeah, there appears to be a deadlock somewhere in the directory handling code. I find that it's not always reproducible on my laptop depending on how much memory is free, other stuff going on, etc. But I was able to trigger it once by simply running fstest and a find concurrently, then killing fstest and running ls on the directory where fstest was running. This message is mostly for Shaggy, but here it is anyway (kinda long): Ok I have some suspicious looking backtraces now where some threads are sleeping in jfs_unlink (which holds the directory i_sem I think) which hold the i_sem (from sys_mkdir). Overall of the 100 fstest threads, here's numbers for what syscall each was running: unlink - 4 pwrite64- 24 close - 7 open/create - 3 mkdir - 1 unknown - 60 (error path, blocked on namei semaphore, probably JFS namei.c ? ) Here's some more backtraces: pdflusher writing back inodes: pdflush D C0348640 059 36158 (L-TLB) c1717e2c 0046 dfeb0b90 c0348640 c1717e30 dcaf8a44 c1717e20 0010 c1717f34 db127260 e19f81c0 000f424a dfeb0d38 dcaf8800 0282 c1716000 dfeb0b90 c026d7d7 dcaf8808 0001 dfeb0b90 c0118714 Call Trace: [c026d7d7] __down+0x8b/0xfd [c0118714] default_wake_function+0x0/0x12 [c026d984] __down_failed+0x8/0xc [e096360a] .text.lock.inode+0x5/0x9b [jfs] [e0962e4c] jfs_write_inode+0x4c/0x60 [jfs] [c0170e43] write_inode+0x46/0x48 [c017102b] __sync_single_inode+0x1e6/0x202 [c0171263] sync_sb_inodes+0x169/0x282 [c0171462] writeback_inodes+0xe6/0xf2 [c01393c8] wb_kupdate+0x99/0x10e [c0139d77] __pdflush+0xcb/0x1c4 [c0139e70] pdflush+0x0/0x2c [c0139e98] pdflush+0x28/0x2c [c013932f] wb_kupdate+0x0/0x10e [c0139e70] pdflush+0x0/0x2c [c012d579] kthread+0xa5/0xab [c012d4d4] kthread+0x0/0xab [c010425d] kernel_thread_helper+0x5/0xb jfs kernel daemon threads: jfsIO S C0348640 0 1801 1 1802 1795 (L-TLB) deeadf90 0046 dfaed770 c0348640 dfb619c0 0050 df2cdd40 df2cdd40 df75d5c0 d809d780 0e3514c0 000f424b dfaed918 deeac000 deeac000 deeadfc0 e000 e097faa5 df2cdd40 dfaed770 dfaed770 Call Trace: [e097faa5] jfsIOWait+0x125/0x160 [jfs] [c0118714] default_wake_function+0x0/0x12 [c0105d26] ret_from_fork+0x6/0x14 [c0118714] default_wake_function+0x0/0x12 [e097f980] jfsIOWait+0x0/0x160 [jfs] [c010425d] kernel_thread_helper+0x5/0xb jfsCommit D C0348640 0 1802 1 1803 1801 (L-TLB) de429ee0 0046 dee18800 c0348640 df7dc000 dec76234 4000 e09357e4 de428000 00270284 dbe060f0 f6e6e500 000f424a dee189a8 dec7607c de429ef4 dee18800 dec76234 c026e3cd deeaa000 1bcc dec76080 dec76080 Call Trace: [c026e3cd] rwsem_down_read_failed+0x8f/0x17c [e097085d] .text.lock.jfs_imap+0x3ff/0x422 [jfs] [e09824de] txUpdateMap+0xae/0x250 [jfs] [e09808d7] txEnd+0xd7/0x140 [jfs] [e0982d00] txLazyCommit+0x20/0xe0 [jfs] [e0982f78] jfs_lazycommit+0x1b8/0x1e0 [jfs] [c0118714] default_wake_function+0x0/0x12 [c0105d26] ret_from_fork+0x6/0x14 [c0118714] default_wake_function+0x0/0x12 [e0982dc0] jfs_lazycommit+0x0/0x1e0 [jfs] [c010425d] kernel_thread_helper+0x5/0xb jfsSync D C0348640 0 1803 1 2174 1802 (L-TLB) de94bdd8 0046 dee198b0 c0348640 def1d72c 0008 c01f1c3a def1d72c d808f600 000f424b dbe060f0 12707480 000f424b dee19a58 de94a000 de94bde8 c0118714 de94be08 e097fe87 dee198b0 Call Trace: [c01f1c3a] generic_make_request+0x15e/0x1de [c0118714] default_wake_function+0x0/0x12 [e097fe87] txLockAlloc+0xa7/0x160 [jfs] [c0118714] default_wake_function+0x0/0x12 [c0118714] default_wake_function+0x0/0x12 [e0980bb5] txLock+0x275/0x470 [jfs] [e097f5b1] lbmStartIO+0xb1/0xe0 [jfs] [e097f46f] lbmWrite+0x10f/0x150 [jfs] [e097c99e] __get_metapage+0x5e/0x3d0 [jfs] [e097dcbd]
Re: [Jfs-discussion] Strange problem with JFS
On Sat, Dec 18, 2004 at 02:47:27PM -0600, Jon Nelson wrote: On Sat, 18 Dec 2004, Jon Nelson wrote: On Sat, 18 Dec 2004, Sonny Rao wrote: ... Well, I'm not really sure what I'm looking for, but I did this: fstest -n 50 -f 1 -s 2048 Then, in another terminal, I ran this find . | wc -l Boom! disk light goes out, can't do ANYTHING with that filesystem, instantly. I got fstest (I'm pretty sure mine is old) at: http://samba.org/ftp/unpacked/junkcode/ Also, from slabtop at the current moment: 10696 10695 99%0.81K 26744 10696K jfs_ip In this order: the contents of the files after boot, after cleaning the FS (due to the hanging), and then just a few seconds after it's hung (I can hang it almost instantly, 100% reproduceable). snip Hmm, okay I had thought it might be caused by a lack of available transaction blocks or locks, but that doesn't seem to be the case based on the data you provided. I am able to reproduce the problem on a laptop I have with me as well, I'm running a 2.6.8.1 kernel. It looks like all of the processes are stuck waiting on a semaphore somewhere in namei (.text.lock.namei) and upon reboot i have a bit of filesystem corruption. Shaggy will have to look into this some more, thanks for the report. Sonny ___ Jfs-discussion mailing list [EMAIL PROTECTED] http://www-124.ibm.com/developerworks/oss/mailman/listinfo/jfs-discussion
Re: [Jfs-discussion] Strange problem with JFS
I should note that if I crack open another terminal and strace the find, this is what I get: open(., O_RDONLY|O_LARGEFILE) = 5 fchdir(5) = 0 lstat64(., {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat64(., {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 chdir(.) = 0 lstat64(., {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat64(., {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 fstat64(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 5), ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40252000 write(1, .\n, 2. ) = 2 open(., O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 6 fstat64(6, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 fcntl64(6, F_SETFD, FD_CLOEXEC) = 0 getdents(6, ^ hangs here. -- Jon Nelson [EMAIL PROTECTED] ___ Jfs-discussion mailing list [EMAIL PROTECTED] http://www-124.ibm.com/developerworks/oss/mailman/listinfo/jfs-discussion
Re: [Jfs-discussion] Strange problem with JFS
A small addendum - the load on this box is now 54, but it's totally usable. This is a very low end laptop with a celeron 1.6GHz. -- Jon Nelson [EMAIL PROTECTED] ___ Jfs-discussion mailing list [EMAIL PROTECTED] http://www-124.ibm.com/developerworks/oss/mailman/listinfo/jfs-discussion