Are you mount with nordirplus? For more refer to this email. http://oss.oracle.com/pipermail/ocfs2-announce/2008-June/000025.html
On 03/25/2011 08:49 AM, James Abbott wrote: > Hello, > > I've recently setup an ocfs2 volume via a 4Gb/s SAN which is directly > mounted on two CentOS 5.5 machines (2.6.18-194.32.1.el5). Both servers > are exporting the volume via NFS3 to our HPC cluster. This is to replace > a single NFS server exporting an ext3 volume which was unable to keep up > with our IO requirements. I switched over to using the new ocfs2 volume on > Monday, and it had been performing pretty well overall. This morning, > however, I saw significant loads appearing on both the NFS servers (load >> 30, which is not unheard of since we are running 32 NFS threads per > machine), however attempting ot access the shared volume resulted in a > hanging connection. > > Logging into the NFS servers showed that the ocfs volume could be > accessed fine, and was responsive, however the load on the machines was > clearly coming from nfsd. iostat showed there was no substantial activity > on the ocfs2 volume despite the NFS load. dmesg outputs on both servers > show a number of hung task warnings: > > Mar 25 12:02:13 bss-adm2 kernel: INFO: task nfsd:996 blocked for more than > 120 seconds. > Mar 25 12:02:13 bss-adm2 kernel: "echo 0> > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Mar 25 12:02:13 bss-adm2 kernel: nfsd D ffff81041ab1d7e0 0 996 > 1 1008 1017 (L-TLB) > Mar 25 12:02:13 bss-adm2 kernel: ffff8102b8e05d00 0000000000000046 > 0000000000000246 ffffffff889678b9 > Mar 25 12:02:13 bss-adm2 kernel: ffff8103eaa20000 000000000000000a > ffff81041262e040 ffff81041ab1d7e0 > Mar 25 12:02:13 bss-adm2 kernel: 00022676fafd0a61 00000000000021b6 > ffff81041262e228 00000007de477ba0 > Mar 25 12:02:13 bss-adm2 kernel: Call Trace: > Mar 25 12:02:13 bss-adm2 kernel: [<ffffffff889678b9>] > :ocfs2:ocfs2_cluster_unlock+0x290/0x30d > Mar 25 12:02:13 bss-adm2 kernel: [<ffffffff88979236>] > :ocfs2:ocfs2_permission+0x137/0x1a4 > Mar 25 12:02:13 bss-adm2 kernel: [<ffffffff8000d9d8>] permission+0x81/0xc8 > Mar 25 12:02:13 bss-adm2 kernel: [<ffffffff80063c6f>] > __mutex_lock_slowpath+0x60/0x9b > Mar 25 12:02:13 bss-adm2 kernel: [<ffffffff80063cb9>] > .text.lock.mutex+0xf/0x14 > Mar 25 12:02:13 bss-adm2 kernel: [<ffffffff8882c981>] > :nfsd:nfsd_lookup_dentry+0x306/0x418 > Mar 25 12:02:13 bss-adm2 kernel: [<ffffffff887ab4b4>] > :sunrpc:ip_map_match+0x19/0x30 > Mar 25 12:02:13 bss-adm2 kernel: [<ffffffff8882cab5>] > :nfsd:nfsd_lookup+0x22/0xb0 > Mar 25 12:02:13 bss-adm2 kernel: [<ffffffff887ab59e>] > :sunrpc:ip_map_lookup+0xbc/0xc3 > Mar 25 12:02:13 bss-adm2 kernel: [<ffffffff8883347d>] > :nfsd:nfsd3_proc_lookup+0xc5/0xd2 > Mar 25 12:02:14 bss-adm2 kernel: [<ffffffff888281db>] > :nfsd:nfsd_dispatch+0xd8/0x1d6 > Mar 25 12:02:14 bss-adm2 kernel: [<ffffffff887a8651>] > :sunrpc:svc_process+0x454/0x71b > Mar 25 12:02:14 bss-adm2 kernel: [<ffffffff80064644>] __down_read+0x12/0x92 > Mar 25 12:02:14 bss-adm2 kernel: [<ffffffff888285a1>] :nfsd:nfsd+0x0/0x2cb > Mar 25 12:02:14 bss-adm2 kernel: [<ffffffff88828746>] :nfsd:nfsd+0x1a5/0x2cb > Mar 25 12:02:14 bss-adm2 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 > Mar 25 12:02:14 bss-adm2 kernel: [<ffffffff888285a1>] :nfsd:nfsd+0x0/0x2cb > Mar 25 12:02:14 bss-adm2 kernel: [<ffffffff888285a1>] :nfsd:nfsd+0x0/0x2cb > Mar 25 12:02:14 bss-adm2 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 > > Although these are obviously nfsd hangs, the fact they occurrent on both > servers at the same time make me suspect something on the ocfs2 side. It > was necessary to shutdown nfsd and restart the cluster nodes in order > for them to resume. > > Being new to ocfs I'm not sure quite where to look for clues as to what > caused this. I'm gussing from the ocfs2_cluster_unlock at the top of the > stack trace that this is a o2cb locking issue. The NFS traffic is going > over the same (1Gb) network connections as the o2cb heartbeat, so I'm > wondering if that may have contributed to the problem. I should be able > to add a separate fabric for the oc2b heartbeat if that might be the > cause, however neither of the servers were fenced. > > Anyone have any suggestions? > > Many thanks, > James > _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users