Re: [Jfs-discussion] Strange problem with JFS

2005-01-10 Thread Dave Kleikamp
Sorry it's taken me so long to look into this, and thanks for all your
help with this bug.

On Sun, 2004-12-19 at 14:10 -0500, Sonny Rao wrote:
 Hmm, okay I had thought it might be caused by a lack of available
 transaction blocks or locks, but that doesn't seem to be the case
 based on the data you provided.

Actually, we do appear to be out of transaction locks (tlocks).

jfsSync   D C0348640 0  1803  1  2174  1802 (L-TLB)
de94bdd8 0046 dee198b0 c0348640 def1d72c 0008 c01f1c3a def1d72c 
   d808f600 000f424b dbe060f0  12707480 000f424b dee19a58
de94a000 
   de94bde8 c0118714 de94be08 e097fe87   
dee198b0 
Call Trace:
 [c01f1c3a] generic_make_request+0x15e/0x1de
 [c0118714] default_wake_function+0x0/0x12
 [e097fe87] txLockAlloc+0xa7/0x160 [jfs]
 [c0118714] default_wake_function+0x0/0x12
 [c0118714] default_wake_function+0x0/0x12
 [e0980bb5] txLock+0x275/0x470 [jfs]
 [e097f5b1] lbmStartIO+0xb1/0xe0 [jfs]
 [e097f46f] lbmWrite+0x10f/0x150 [jfs]
 [e097c99e] __get_metapage+0x5e/0x3d0 [jfs]
 [e097dcbd] lmGCwrite+0xdd/0xf0 [jfs]
 [e096ce50] diWrite+0x190/0x5f0 [jfs]
 [e0981473] txCommit+0x1b3/0x320 [jfs]
 [e098083b] txEnd+0x3b/0x140 [jfs]
 [e098351f] jfs_sync+0x21f/0x2c0 [jfs]
 [c0118714] default_wake_function+0x0/0x12
 [c0105d26] ret_from_fork+0x6/0x14
 [c0118714] default_wake_function+0x0/0x12
 [e0983300] jfs_sync+0x0/0x2c0 [jfs]
 [c010425d] kernel_thread_helper+0x5/0xb

This is the thread that is trying to free some tlocks when they start
getting low.  I think generic_make_request is just stack noise.  We
really don't want txLockAlloc to block here.  We try really hard not to
exhaust all of the tlocks so that this thread can make some progress.

I think the real problem though is the jfsCommit thread.  If it were not
blocked, I suspect that there would be a lot of tlocks freed.

jfsCommit D C0348640 0  1802  1  1803  1801 (L-TLB)
de429ee0 0046 dee18800 c0348640 df7dc000 dec76234 4000 e09357e4 
   de428000 00270284 dbe060f0  f6e6e500 000f424a dee189a8
dec7607c 
   de429ef4 dee18800 dec76234 c026e3cd deeaa000 1bcc dec76080
dec76080 
Call Trace:
 [c026e3cd] rwsem_down_read_failed+0x8f/0x17c
 [e097085d] .text.lock.jfs_imap+0x3ff/0x422 [jfs]
 [e09824de] txUpdateMap+0xae/0x250 [jfs]
 [e09808d7] txEnd+0xd7/0x140 [jfs]
 [e0982d00] txLazyCommit+0x20/0xe0 [jfs]
 [e0982f78] jfs_lazycommit+0x1b8/0x1e0 [jfs]
 [c0118714] default_wake_function+0x0/0x12
 [c0105d26] ret_from_fork+0x6/0x14
 [c0118714] default_wake_function+0x0/0x12
 [e0982dc0] jfs_lazycommit+0x0/0x1e0 [jfs]
 [c010425d] kernel_thread_helper+0x5/0xb

This thread must be in diUpdatePMap, trying to acquire ipimap's
rdwrlock: IREAD_LOCK().  The funny thing is that the only place a write
lock is taken on this inode is in diNewIAG, which is only called under
diAlloc, which I don't see in any stacks.

I can't find any error path that would leave the lock taken, so I can't
account for why this thread would be blocked.  I'm sure you would have
noticed if a thread oopsed in this path, leaving the lock locked.

 I am able to reproduce the problem on a laptop I have with me as well,
 I'm running a 2.6.8.1 kernel.
 
 It looks like all of the processes are stuck waiting on a semaphore
 somewhere in namei (.text.lock.namei)  and upon reboot i have a bit of
 filesystem corruption.

I'd be interested in any info from jfs_fsck as to the nature of the
filesystem corruption.

 Shaggy will have to look into this some more, thanks for the report.

Thanks for all your help.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

___
Jfs-discussion mailing list
Jfs-discussion@www-124.ibm.com
http://www-124.ibm.com/developerworks/oss/mailman/listinfo/jfs-discussion


Re: [Jfs-discussion] Strange problem with JFS

2005-01-10 Thread Dave Kleikamp
On Mon, 2005-01-10 at 10:59 -0600, Dave Kleikamp wrote:
 This thread must be in diUpdatePMap, trying to acquire ipimap's
 rdwrlock: IREAD_LOCK().  The funny thing is that the only place a write
 lock is taken on this inode is in diNewIAG, which is only called under
 diAlloc, which I don't see in any stacks.
 
 I can't find any error path that would leave the lock taken, so I can't
 account for why this thread would be blocked.  I'm sure you would have
 noticed if a thread oopsed in this path, leaving the lock locked.

I found the deadlock.  This thread is the missing link:

fstestD C0348640 0  4598   4561  4599  4597 (NOTLB)
dbc11d88 0086 dbd57770 c0348640     
    df1ed83c d9912d50  d8dc7700 000f424a dbd57918
df7dd244 
   0286 dbc1 dbd57770 c026d7d7 df7dd24c 0001 dbd57770
c0118714 
Call Trace:
 [c026d7d7] __down+0x8b/0xfd
 [c0118714] default_wake_function+0x0/0x12
 [c026d984] __down_failed+0x8/0xc
 [e0970663] .text.lock.jfs_imap+0x205/0x422 [jfs]
 [c0169722] alloc_inode+0xf2/0x19b
 [e097bccb] ialloc+0x5b/0x1b0 [jfs]
 [e096399d] jfs_mkdir+0x5d/0x2f0 [jfs]
 [c0117c03] recalc_task_prio+0x93/0x188
 [c015dce7] permission+0x67/0x79
 [c015eeca] link_path_walk+0xccf/0xdb2
 [c015de3d] cached_lookup+0x23/0x85
 [c015dce7] permission+0x67/0x79
 [c0160379] vfs_mkdir+0xad/0x137
 [c01604ba] sys_mkdir+0xb7/0xf6
 [c01181fd] schedule_tail+0x41/0x4d
 [c0105dfd] sysenter_past_esp+0x52/0x71

The compiler does a lot of inlining, but I do believe this is in
diNewIAG.  The thread is doing a down(JFS_IP(ipimap)-commit_sem) while
holding the rdwr_lock (which is a recent change).  I will have to see
how hard this is to fix.
-- 
David Kleikamp
IBM Linux Technology Center

___
Jfs-discussion mailing list
Jfs-discussion@www-124.ibm.com
http://www-124.ibm.com/developerworks/oss/mailman/listinfo/jfs-discussion


Re: [Jfs-discussion] Strange problem with JFS

2005-01-10 Thread Sonny Rao
On Thu, Dec 23, 2004 at 08:34:12AM -0600, Jon Nelson wrote:
 On Mon, 20 Dec 2004, Sonny Rao wrote:
 
 This message is mostly for Shaggy, but here it is anyway (kinda long):
 
 Ok I have some suspicious looking backtraces now where  some threads
 are sleeping in jfs_unlink (which holds the directory i_sem I think)
 which hold the i_sem (from sys_mkdir).
 
 Well, it's more or less out of my hands now. I wonder if this relates to 
 my performance problems as well - I suspect it does, although a fix for 
 this bug may ot improve performance.

If you are out of tlocks frequently, then you're in luck.  There's
module load time parameter for the number of txblocks and txlocks,
just up the values and see if performance increases.  

Sonny
___
Jfs-discussion mailing list
Jfs-discussion@www-124.ibm.com
http://www-124.ibm.com/developerworks/oss/mailman/listinfo/jfs-discussion


Re: [Jfs-discussion] Strange problem with JFS

2004-12-20 Thread Sonny Rao
On Sat, Dec 18, 2004 at 02:49:38PM -0600, Jon Nelson wrote:
 
 
 I should note that if I crack open another terminal and strace the find, 
 this is what I get:
 
 
 open(., O_RDONLY|O_LARGEFILE) = 5
 fchdir(5)   = 0
 lstat64(., {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
 lstat64(., {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
 chdir(.)  = 0
 lstat64(., {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
 lstat64(., {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
 fstat64(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 5), ...}) = 0
 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 
 0) = 0x40252000
 write(1, .\n, 2.
 )  = 2
 open(., O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 6
 fstat64(6, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
 fcntl64(6, F_SETFD, FD_CLOEXEC) = 0
 getdents(6,
 ^ hangs here.

Yeah, there appears to be a deadlock somewhere in the directory
handling code.  I find that it's not always reproducible on my laptop
depending on how much memory is free, other stuff going on, etc.  But
I was able to trigger it once by simply running fstest and a find
concurrently, then killing fstest and running ls on the directory
where fstest was running. 

Here's what the backtraces look like:

fstestD C0348640 0  5093  1  5094  5092 (NOTLB)
f07f9f14 0082 f07f7870 c0348640 fff5 f13a992c c015eeca f13a992c 
   0001 f07f9f70 f242ad10  23063c80 000f42cd f07f7a18 f13a9994 
   0292 f07f8000 f07f7870 c026d7d7 f13a999c 0001 f07f7870 c0118714 
Call Trace:
 [c015eeca] link_path_walk+0xccf/0xdb2
 [c026d7d7] __down+0x8b/0xfd
 [c0118714] default_wake_function+0x0/0x12
 [c026d984] __down_failed+0x8/0xc
 [c0162261] .text.lock.namei+0x109/0x168
 [c01068f1] error_code+0x2d/0x38
 [c01181fd] schedule_tail+0x41/0x4d
 [c0105dfd] sysenter_past_esp+0x52/0x71
fstestD C0348640 0  5094  1  5095  5093 (NOTLB)
f07fbf14 0082 f07f72e0 c0348640 fff5 f13a992c c015eeca f13a992c 
   0001 f07fbf70 f242ad10  23063c80 000f42cd f07f7488 f13a9994 
   0292 f07fa000 f07f72e0 c026d7d7 f13a999c 0001 f07f72e0 c0118714 
Call Trace:
 [c015eeca] link_path_walk+0xccf/0xdb2
 [c026d7d7] __down+0x8b/0xfd
 [c0118714] default_wake_function+0x0/0x12
 [c026d984] __down_failed+0x8/0xc
 [c0162261] .text.lock.namei+0x109/0x168
 [c01068f1] error_code+0x2d/0x38
 [c01181fd] schedule_tail+0x41/0x4d
 [c0105dfd] sysenter_past_esp+0x52/0x71

snip
find  D C0348AE8 0  5115   4192 (NOTLB)
f2233f24 0082 f1526330 c0348ae8 000f42ce fe04   
   0744ab6b 000f42ce f1526330 000f4240 074f6600 000f42ce f1526ff8 f13a9994 
   0292 f2232000 f1526e50 c026d7d7 f13a999c 0001 f1526e50 c0118714 
Call Trace:
 [c026d7d7] __down+0x8b/0xfd
 [c0118714] default_wake_function+0x0/0x12
 [c026d984] __down_failed+0x8/0xc
 [c016393f] .text.lock.readdir+0x5/0x16
 [c0163900] sys_getdents64+0x71/0xab
 [c01637a4] filldir64+0x0/0xeb
 [c0105dfd] sysenter_past_esp+0x52/0x71
lsD C0348640 0  5117   4191 (NOTLB)
f0581f24 0082 f15268c0 c0348640  f15bf354 f15af400  
   f14b8140 f14b8160 f747c1b0 000f4240 70291b00 000f42d1 f1526a68 f13a9994 
   0292 f058 f15268c0 c026d7d7 f13a999c 0001 f15268c0 c0118714 
Call Trace:
 [c026d7d7] __down+0x8b/0xfd
 [c0118714] default_wake_function+0x0/0x12
 [c026d984] __down_failed+0x8/0xc
 [c016393f] .text.lock.readdir+0x5/0x16
 [c0163900] sys_getdents64+0x71/0xab
 [c01637a4] filldir64+0x0/0xeb
 [c0105dfd] sysenter_past_esp+0x52/0x71


So someone is holding the readdir semaphore here, but I can't tell
which thread.  find and ls are blocking on that.  Also the jfsCommit
and jfsSync threads are also blocking on something, but I haven't been
able to get the backtraces for those yet.  

All of the fstest threads seem to be stuck in some
error path.

I'll follow up if I find more.

Sonny
___
Jfs-discussion mailing list
[EMAIL PROTECTED]
http://www-124.ibm.com/developerworks/oss/mailman/listinfo/jfs-discussion


Re: [Jfs-discussion] Strange problem with JFS

2004-12-20 Thread Sonny Rao
On Mon, Dec 20, 2004 at 03:20:58AM -0500, Sonny Rao wrote:
 On Sat, Dec 18, 2004 at 02:49:38PM -0600, Jon Nelson wrote:
  
  
  I should note that if I crack open another terminal and strace the find, 
  this is what I get:
  
  
  open(., O_RDONLY|O_LARGEFILE) = 5
  fchdir(5)   = 0
  lstat64(., {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
  lstat64(., {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
  chdir(.)  = 0
  lstat64(., {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
  lstat64(., {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
  fstat64(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 5), ...}) = 0
  mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 
  0) = 0x40252000
  write(1, .\n, 2.
  )  = 2
  open(., O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 6
  fstat64(6, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
  fcntl64(6, F_SETFD, FD_CLOEXEC) = 0
  getdents(6,
  ^ hangs here.
 
 Yeah, there appears to be a deadlock somewhere in the directory
 handling code.  I find that it's not always reproducible on my laptop
 depending on how much memory is free, other stuff going on, etc.  But
 I was able to trigger it once by simply running fstest and a find
 concurrently, then killing fstest and running ls on the directory
 where fstest was running. 

This message is mostly for Shaggy, but here it is anyway (kinda long):

Ok I have some suspicious looking backtraces now where  some threads
are sleeping in jfs_unlink (which holds the directory i_sem I think)
which hold the i_sem (from sys_mkdir). 

Overall of the 100 fstest threads, here's numbers for what syscall 
each was running:

unlink  - 4
pwrite64- 24
close   - 7
open/create - 3
mkdir   - 1

unknown  - 60 (error path, blocked on namei semaphore, probably JFS
namei.c ? )

Here's some more backtraces:

pdflusher writing back inodes:

pdflush   D C0348640 059  36158 (L-TLB)
c1717e2c 0046 dfeb0b90 c0348640 c1717e30 dcaf8a44 c1717e20  
   0010 c1717f34 db127260  e19f81c0 000f424a dfeb0d38 dcaf8800 
   0282 c1716000 dfeb0b90 c026d7d7 dcaf8808 0001 dfeb0b90 c0118714 
Call Trace:
 [c026d7d7] __down+0x8b/0xfd
 [c0118714] default_wake_function+0x0/0x12
 [c026d984] __down_failed+0x8/0xc
 [e096360a] .text.lock.inode+0x5/0x9b [jfs]
 [e0962e4c] jfs_write_inode+0x4c/0x60 [jfs]
 [c0170e43] write_inode+0x46/0x48
 [c017102b] __sync_single_inode+0x1e6/0x202
 [c0171263] sync_sb_inodes+0x169/0x282
 [c0171462] writeback_inodes+0xe6/0xf2
 [c01393c8] wb_kupdate+0x99/0x10e
 [c0139d77] __pdflush+0xcb/0x1c4
 [c0139e70] pdflush+0x0/0x2c
 [c0139e98] pdflush+0x28/0x2c
 [c013932f] wb_kupdate+0x0/0x10e
 [c0139e70] pdflush+0x0/0x2c
 [c012d579] kthread+0xa5/0xab
 [c012d4d4] kthread+0x0/0xab
 [c010425d] kernel_thread_helper+0x5/0xb

jfs kernel daemon threads:

jfsIO S C0348640 0  1801  1  1802  1795 (L-TLB)
deeadf90 0046 dfaed770 c0348640 dfb619c0 0050  df2cdd40 
   df2cdd40 df75d5c0 d809d780  0e3514c0 000f424b dfaed918 deeac000 
   deeac000 deeadfc0 e000 e097faa5 df2cdd40 dfaed770  dfaed770 
Call Trace:
 [e097faa5] jfsIOWait+0x125/0x160 [jfs]
 [c0118714] default_wake_function+0x0/0x12
 [c0105d26] ret_from_fork+0x6/0x14
 [c0118714] default_wake_function+0x0/0x12
 [e097f980] jfsIOWait+0x0/0x160 [jfs]
 [c010425d] kernel_thread_helper+0x5/0xb
jfsCommit D C0348640 0  1802  1  1803  1801 (L-TLB)
de429ee0 0046 dee18800 c0348640 df7dc000 dec76234 4000 e09357e4 
   de428000 00270284 dbe060f0  f6e6e500 000f424a dee189a8 dec7607c 
   de429ef4 dee18800 dec76234 c026e3cd deeaa000 1bcc dec76080 dec76080 
Call Trace:
 [c026e3cd] rwsem_down_read_failed+0x8f/0x17c
 [e097085d] .text.lock.jfs_imap+0x3ff/0x422 [jfs]
 [e09824de] txUpdateMap+0xae/0x250 [jfs]
 [e09808d7] txEnd+0xd7/0x140 [jfs]
 [e0982d00] txLazyCommit+0x20/0xe0 [jfs]
 [e0982f78] jfs_lazycommit+0x1b8/0x1e0 [jfs]
 [c0118714] default_wake_function+0x0/0x12
 [c0105d26] ret_from_fork+0x6/0x14
 [c0118714] default_wake_function+0x0/0x12
 [e0982dc0] jfs_lazycommit+0x0/0x1e0 [jfs]
 [c010425d] kernel_thread_helper+0x5/0xb
jfsSync   D C0348640 0  1803  1  2174  1802 (L-TLB)
de94bdd8 0046 dee198b0 c0348640 def1d72c 0008 c01f1c3a def1d72c 
   d808f600 000f424b dbe060f0  12707480 000f424b dee19a58 de94a000 
   de94bde8 c0118714 de94be08 e097fe87    dee198b0 
Call Trace:
 [c01f1c3a] generic_make_request+0x15e/0x1de
 [c0118714] default_wake_function+0x0/0x12
 [e097fe87] txLockAlloc+0xa7/0x160 [jfs]
 [c0118714] default_wake_function+0x0/0x12
 [c0118714] default_wake_function+0x0/0x12
 [e0980bb5] txLock+0x275/0x470 [jfs]
 [e097f5b1] lbmStartIO+0xb1/0xe0 [jfs]
 [e097f46f] lbmWrite+0x10f/0x150 [jfs]
 [e097c99e] __get_metapage+0x5e/0x3d0 [jfs]
 [e097dcbd] 

Re: [Jfs-discussion] Strange problem with JFS

2004-12-19 Thread Sonny Rao
On Sat, Dec 18, 2004 at 02:47:27PM -0600, Jon Nelson wrote:
 
 
 On Sat, 18 Dec 2004, Jon Nelson wrote:
 
 On Sat, 18 Dec 2004, Sonny Rao wrote:
 ...
 Well, I'm not really sure what I'm looking for, but I did this:
 
 fstest -n 50 -f 1 -s 2048
 
 Then, in another terminal, I ran this
 
 find . | wc -l
 
 Boom!  disk light goes out, can't do ANYTHING with that filesystem,
 instantly.
 
 I got fstest (I'm pretty sure mine is old) at:
 
 http://samba.org/ftp/unpacked/junkcode/
 
 Also, from slabtop at the current moment:
 
  10696  10695  99%0.81K   26744 10696K jfs_ip
 
 In this order: the contents of the files after boot, after cleaning the 
 FS (due to the hanging), and then just a few seconds after it's hung (I 
 can hang it almost instantly, 100% reproduceable).
snip

Hmm, okay I had thought it might be caused by a lack of available
transaction blocks or locks, but that doesn't seem to be the case
based on the data you provided.  

I am able to reproduce the problem on a laptop I have with me as well,
I'm running a 2.6.8.1 kernel.

It looks like all of the processes are stuck waiting on a semaphore
somewhere in namei (.text.lock.namei)  and upon reboot i have a bit of
filesystem corruption.

Shaggy will have to look into this some more, thanks for the report.

Sonny
___
Jfs-discussion mailing list
[EMAIL PROTECTED]
http://www-124.ibm.com/developerworks/oss/mailman/listinfo/jfs-discussion


Re: [Jfs-discussion] Strange problem with JFS

2004-12-18 Thread Jon Nelson

I should note that if I crack open another terminal and strace the find, 
this is what I get:

open(., O_RDONLY|O_LARGEFILE) = 5
fchdir(5)   = 0
lstat64(., {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat64(., {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
chdir(.)  = 0
lstat64(., {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat64(., {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
fstat64(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 5), ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 
0) = 0x40252000
write(1, .\n, 2.
)  = 2
open(., O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 6
fstat64(6, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
fcntl64(6, F_SETFD, FD_CLOEXEC) = 0
getdents(6,
^ hangs here.

--
Jon Nelson [EMAIL PROTECTED]
___
Jfs-discussion mailing list
[EMAIL PROTECTED]
http://www-124.ibm.com/developerworks/oss/mailman/listinfo/jfs-discussion


Re: [Jfs-discussion] Strange problem with JFS

2004-12-17 Thread Jon Nelson
A small addendum - the load on this box is now 54, but it's totally 
usable.  This is a very low end laptop with a celeron 1.6GHz.

--
Jon Nelson [EMAIL PROTECTED]
___
Jfs-discussion mailing list
[EMAIL PROTECTED]
http://www-124.ibm.com/developerworks/oss/mailman/listinfo/jfs-discussion