Hello, I've recently setup an ocfs2 volume via a 4Gb/s SAN which is directly mounted on two CentOS 5.5 machines (2.6.18-194.32.1.el5). Both servers are exporting the volume via NFS3 to our HPC cluster. This is to replace a single NFS server exporting an ext3 volume which was unable to keep up with our IO requirements. I switched over to using the new ocfs2 volume on Monday, and it had been performing pretty well overall. This morning, however, I saw significant loads appearing on both the NFS servers (load >30, which is not unheard of since we are running 32 NFS threads per machine), however attempting ot access the shared volume resulted in a hanging connection.
Logging into the NFS servers showed that the ocfs volume could be accessed fine, and was responsive, however the load on the machines was clearly coming from nfsd. iostat showed there was no substantial activity on the ocfs2 volume despite the NFS load. dmesg outputs on both servers show a number of hung task warnings: Mar 25 12:02:13 bss-adm2 kernel: INFO: task nfsd:996 blocked for more than 120 seconds. Mar 25 12:02:13 bss-adm2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 25 12:02:13 bss-adm2 kernel: nfsd D ffff81041ab1d7e0 0 996 1 1008 1017 (L-TLB) Mar 25 12:02:13 bss-adm2 kernel: ffff8102b8e05d00 0000000000000046 0000000000000246 ffffffff889678b9 Mar 25 12:02:13 bss-adm2 kernel: ffff8103eaa20000 000000000000000a ffff81041262e040 ffff81041ab1d7e0 Mar 25 12:02:13 bss-adm2 kernel: 00022676fafd0a61 00000000000021b6 ffff81041262e228 00000007de477ba0 Mar 25 12:02:13 bss-adm2 kernel: Call Trace: Mar 25 12:02:13 bss-adm2 kernel: [<ffffffff889678b9>] :ocfs2:ocfs2_cluster_unlock+0x290/0x30d Mar 25 12:02:13 bss-adm2 kernel: [<ffffffff88979236>] :ocfs2:ocfs2_permission+0x137/0x1a4 Mar 25 12:02:13 bss-adm2 kernel: [<ffffffff8000d9d8>] permission+0x81/0xc8 Mar 25 12:02:13 bss-adm2 kernel: [<ffffffff80063c6f>] __mutex_lock_slowpath+0x60/0x9b Mar 25 12:02:13 bss-adm2 kernel: [<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14 Mar 25 12:02:13 bss-adm2 kernel: [<ffffffff8882c981>] :nfsd:nfsd_lookup_dentry+0x306/0x418 Mar 25 12:02:13 bss-adm2 kernel: [<ffffffff887ab4b4>] :sunrpc:ip_map_match+0x19/0x30 Mar 25 12:02:13 bss-adm2 kernel: [<ffffffff8882cab5>] :nfsd:nfsd_lookup+0x22/0xb0 Mar 25 12:02:13 bss-adm2 kernel: [<ffffffff887ab59e>] :sunrpc:ip_map_lookup+0xbc/0xc3 Mar 25 12:02:13 bss-adm2 kernel: [<ffffffff8883347d>] :nfsd:nfsd3_proc_lookup+0xc5/0xd2 Mar 25 12:02:14 bss-adm2 kernel: [<ffffffff888281db>] :nfsd:nfsd_dispatch+0xd8/0x1d6 Mar 25 12:02:14 bss-adm2 kernel: [<ffffffff887a8651>] :sunrpc:svc_process+0x454/0x71b Mar 25 12:02:14 bss-adm2 kernel: [<ffffffff80064644>] __down_read+0x12/0x92 Mar 25 12:02:14 bss-adm2 kernel: [<ffffffff888285a1>] :nfsd:nfsd+0x0/0x2cb Mar 25 12:02:14 bss-adm2 kernel: [<ffffffff88828746>] :nfsd:nfsd+0x1a5/0x2cb Mar 25 12:02:14 bss-adm2 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Mar 25 12:02:14 bss-adm2 kernel: [<ffffffff888285a1>] :nfsd:nfsd+0x0/0x2cb Mar 25 12:02:14 bss-adm2 kernel: [<ffffffff888285a1>] :nfsd:nfsd+0x0/0x2cb Mar 25 12:02:14 bss-adm2 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Although these are obviously nfsd hangs, the fact they occurrent on both servers at the same time make me suspect something on the ocfs2 side. It was necessary to shutdown nfsd and restart the cluster nodes in order for them to resume. Being new to ocfs I'm not sure quite where to look for clues as to what caused this. I'm gussing from the ocfs2_cluster_unlock at the top of the stack trace that this is a o2cb locking issue. The NFS traffic is going over the same (1Gb) network connections as the o2cb heartbeat, so I'm wondering if that may have contributed to the problem. I should be able to add a separate fabric for the oc2b heartbeat if that might be the cause, however neither of the servers were fenced. Anyone have any suggestions? Many thanks, James -- Dr. James Abbott Bioinformatics Software Developer Imperial College, London _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users