Our Linux 2.4.32 NFS fileserver exports 4 reiserfs 3.6 filesystems to a whole bunch of hosts. Somewhere between every 6-30 days, NFS suddenly seems to "hang" (ie. all the hosts gets the "nfs server not responding" message). The server is still up (we can ssh to it/etc). Up to this point, we've thought it was a bug in nfs. We recently installed SGI kdb (kernel debugger) to help with debugging the problem, and we're wondering now whether it is actually reiserfs related. We need to get the output from several crashes in order to do more debugging, although we believe the problem is probably the same each time.

In a normal state, the WCHAN column on "ps" output lists the nfsd and [kreiserfsd] processes as "end". When the system gets into this state, both the nfsd and [kreiserfsd] processes report "down" for WCHAN. (I can imagine that if kreiserfsd and nfs are both hanging on the same lock, bad things could happen..)

The backtrace when the problem occurs for kreiserfsd yields:

Mar 28 13:04:24 0xf635a000 247 1 0 1 D 0xf635a370 kreiserfsd
Mar 28 13:04:24 ESP        EIP        Function (args)
Mar 28 13:04:24 0xf635bef8 0xc011b144 schedule+0x2b4 (0xc0452e40, 0x0, 0xf635a000, 0xd4a10970, 0xd4a10970) Mar 28 13:04:24 kernel .text 0xc0100000 0xc011ae90 0xc011b3d0 Mar 28 13:04:24 0xf635bf40 0xc014680e __wait_on_buffer+0x6e (0xd4a10920, 0x9e0, 0xf635bf90, 0xc2937000, 0xf8ac6000) Mar 28 13:04:24 kernel .text 0xc0100000 0xc01467a0 0xc0146840
Mar 28 13:04:24 0xf635bf68 0xf8943e79 [reiserfs]flush_commit_list+0x3e9
Mar 28 13:04:24 reiserfs .text 0xf8921060 0xf8943a90 0xf8943f60 Mar 28 13:04:24 0xf635bfa8 0xf894801d [reiserfs]flush_async_commits+0x3d (0xf7576800, 0xdd667cc0, 0xf635bfd8, 0xf635bfdc, 0x20) Mar 28 13:04:24 reiserfs .text 0xf8921060 0xf8947fe0 0xf8948020 Mar 28 13:04:25 0xf635bfb8 0xf894652b [reiserfs]reiserfs_journal_commit_thread+0x1db Mar 28 13:04:25 reiserfs .text 0xf8921060 0xf8946350 0xf89465f0
Mar 28 13:04:25 0xf635bff4 0xc010741e arch_kernel_thread+0x2e
Mar 28 13:04:25 kernel .text 0xc0100000 0xc01073f0 0xc0107430

single stepping on the processor after the problem reveals that the system is "idle"/not doing anything else with this.

I won't bother including the output of the backtrace of the 256 nfs processes on our fileserver here, but they probably give a lot more of the story. If you are interested, please see this link for the full details:

http://www.cs.yorku.ca/~jas/fileserver

If anyone has any ideas, or anywhere we could insert debugging code iin order to help solve this problem, we would *really* appreciate your help!

We recently upgraded from 2.4.26 to 2.4.32 in the hopes that the bug would have been fixed, but it didn't make any difference.

ps: A few times, when we issue the "reboot" command, the systems get "unstuck" (systems get "nfs ok") just before the system reboots... whatever is stuck seems to get unstuck for a moment before the system is rebooted.

Thanks..

Jason Keltz
[EMAIL PROTECTED]

Reply via email to