reiserfs3 bug?

Jason Keltz Wed, 05 Apr 2006 07:39:54 -0700

Our Linux 2.4.32 NFS fileserver exports 4 reiserfs 3.6 filesystems to awhole bunch of hosts. Somewhere between every 6-30 days, NFS suddenlyseems to "hang" (ie. all the hosts gets the "nfs server not responding"message). The server is still up (we can ssh to it/etc). Up to thispoint, we've thought it was a bug in nfs. We recently installed SGI kdb(kernel debugger) to help with debugging the problem, and we'rewondering now whether it is actually reiserfs related. We need to getthe output from several crashes in order to do more debugging, althoughwe believe the problem is probably the same each time.

In a normal state, the WCHAN column on "ps" output lists the nfsd and[kreiserfsd] processes as "end". When the system gets into this state,both the nfsd and [kreiserfsd] processes report "down" for WCHAN.(I can imagine that if kreiserfsd and nfs are both hanging on the samelock, bad things could happen..)


The backtrace when the problem occurs for kreiserfsd yields:

Mar 28 13:04:24 0xf635a000 247 1 0 1 D 0xf635a370kreiserfsd

Mar 28 13:04:24 ESP        EIP        Function (args)

Mar 28 13:04:24 0xf635bef8 0xc011b144 schedule+0x2b4 (0xc0452e40, 0x0,0xf635a000, 0xd4a10970, 0xd4a10970)Mar 28 13:04:24 kernel .text 0xc01000000xc011ae90 0xc011b3d0Mar 28 13:04:24 0xf635bf40 0xc014680e __wait_on_buffer+0x6e (0xd4a10920,0x9e0, 0xf635bf90, 0xc2937000, 0xf8ac6000)Mar 28 13:04:24 kernel .text 0xc01000000xc01467a0 0xc0146840

Mar 28 13:04:24 0xf635bf68 0xf8943e79 [reiserfs]flush_commit_list+0x3e9

Mar 28 13:04:24 reiserfs .text 0xf89210600xf8943a90 0xf8943f60Mar 28 13:04:24 0xf635bfa8 0xf894801d [reiserfs]flush_async_commits+0x3d(0xf7576800, 0xdd667cc0, 0xf635bfd8, 0xf635bfdc, 0x20)Mar 28 13:04:24 reiserfs .text 0xf89210600xf8947fe0 0xf8948020Mar 28 13:04:25 0xf635bfb8 0xf894652b[reiserfs]reiserfs_journal_commit_thread+0x1dbMar 28 13:04:25 reiserfs .text 0xf89210600xf8946350 0xf89465f0

Mar 28 13:04:25 0xf635bff4 0xc010741e arch_kernel_thread+0x2e

Mar 28 13:04:25 kernel .text 0xc01000000xc01073f0 0xc0107430

single stepping on the processor after the problem reveals that thesystem is "idle"/not doing anything else with this.

I won't bother including the output of the backtrace of the 256 nfsprocesses on our fileserver here, but they probably give a lot more ofthe story. If you are interested, please see this link for the fulldetails:


http://www.cs.yorku.ca/~jas/fileserver

If anyone has any ideas, or anywhere we could insert debugging code iinorder to help solve this problem, we would *really* appreciate your help!

We recently upgraded from 2.4.26 to 2.4.32 in the hopes that the bugwould have been fixed, but it didn't make any difference.

ps: A few times, when we issue the "reboot" command, the systems get"unstuck" (systems get "nfs ok") just before the system reboots...whatever is stuck seems to get unstuck for a moment before the system isrebooted.


Thanks..

Jason Keltz
[EMAIL PROTECTED]

reiserfs3 bug?

Reply via email to