If the hang you see is after a node (with a mounted ocfs2 volume) dies, then it is a known one. This specific recovery bug was introduced in 1.2.7 and fixed in 1.2.8-2. 1.2.8-SLES-r3074 maps to 1.2.8-1. The fixed one should be version r3080 or more.
If so, upgrade to the latest SLES10 SP1 kernel. This was detected and fixed few months ago. http://oss.oracle.com/pipermail/ocfs2-commits/2008-January/002350.html But, is that the real issue? As in, you don't mention a server going down in your original problem but only during test. Does a server go down during regular op too? One change I would recommend is that your network idle is too low. We've increased the default for that to 30 secs http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT Sunil Sérgio Surkamp wrote: > Hi all, > > We setup a OCFS2 cluster on our storage, and exported it using NFS to > other network servers. It was working fine, but suddenly it locked up > all NFS clients and unlocked only rebooting all servers (including the > OCFS2 servers). It seems that under heavy load, the OCFS2+NFS solution > is deadlocking. > > Setup: > * 1 Dell Storage AX100 > * 2 Dell servers, running SuSE 10-sp1 x86_64 and attached to storage > using fibre channel qlogic HBA > * 4 Dell servers, running FreeBSD and accessing the shared storage by NFS > > The FreeBSD servers are conected in 2 rows. 2 of them mount the suse #1 > nfsd and 2 mount to suse #2 nfsd, to split the load. Network interfaces > are connected by a gigabit network with a dedicated switch to NFS and > OCFS2 (Heartbit/sync messages) traffic. > > Without NFS and it seems to work fine. We rushed the filesystem using > 'iozone' manytimes on both serveres at sametime and it worked like expected. > > During deadlock recovery, we rebooted the slave OCFS2 server (suse01) > first and checked the 'dmesg' on master: > > o2net: connection to node suse01 (num 1) at 192.168.0.1:7777 has been > idle for 10.0 seconds, shutting it down. > (0,0):o2net_idle_timer:1434 here are some times that might help debug > the situation: (tmr 1211375306.9290 now 1211375316.11998 dr > 1211375306.9272 adv 1211375306.9313:1211375306.9314 func (300d6acb:502) > 1211374816.37752:1211374816.37756) > o2net: no longer connected to node suse01 (num 1) at 192.168.0.1:7777 > (15331,4):dlm_get_lock_resource:932 > F59B45831EEA41F384BADE6C4B7A932B:M000000000000000000001ba9d5b7e0: at > least one node (1) torecover before lock mastery can begin > (5313,4):dlm_get_lock_resource:932 > F59B45831EEA41F384BADE6C4B7A932B:$RECOVERY: at least one node (1) > torecover before lock mastery can begin > (5313,4):dlm_get_lock_resource:966 F59B45831EEA41F384BADE6C4B7A932B: > recovery map is not empty, but must master $RECOVERY lock now > (15331,4):ocfs2_replay_journal:1173 Recovering node 1 from slot 1 on > device (8,17) > kjournald starting. Commit interval 5 seconds > o2net: accepted connection from node suse01 (num 1) at 192.168.0.1:7777 > ocfs2_dlm: Node 1 joins domain F59B45831EEA41F384BADE6C4B7A932B > ocfs2_dlm: Nodes in domain ("F59B45831EEA41F384BADE6C4B7A932B"): 0 1 > > It seems to me that something is deadlocking on DLM resource manager. I > used the debugfs.ocfs2 to show me the active locks and it has many of > them with "Blocking Mode" and/or "Requested Mode" marked as "Invalid", > can it be one of the problems? Why there is a Invalid Blocking Mode for > DLM locks? Is it just a pre-allocated empty lock? > > System configuration: > --> o2cb: > # O2CB_ENABELED: 'true' means to load the driver on boot. > O2CB_ENABLED=true > > # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start. > O2CB_BOOTCLUSTER=ocfs2 > > # TIMEOUT - 600s > O2CB_HEARTBEAT_THRESHOLD=301 > > --> cluster.conf: > node: > ip_port = 7777 > ip_address = 192.168.0.10 > number = 0 > name = suse02 > cluster = ocfs2 > > node: > ip_port = 7777 > ip_address = 192.168.0.1 > number = 1 > name = suse01 > cluster = ocfs2 > > cluster: > node_count = 2 > name = ocfs2 > > FreeBSD setup: > * Default NFS Client configuration. > * nfslocking daemon disabled. > * NFS not soft mounted. > > SuSE package versions: > ocfs2-tools-1.2.3-0.7 > ocfs2console-1.2.3-0.7 > nfs-utils-1.0.7-36.26 > nfsidmap-0.12-16.17 > > OCFS2 kernel driver version: > OCFS2 1.2.8-SLES-r3074 Fri Jan 4 23:47:26 UTC 2008 (build sles) > OCFS2 Node Manager 1.2.8-SLES-r3074 Fri Jan 4 23:47:26 UTC 2008 (build > sles) > OCFS2 DLM 1.2.8-SLES-r3074 Fri Jan 4 23:47:26 UTC 2008 (build sles) > OCFS2 DLMFS 1.2.8-SLES-r3074 Fri Jan 4 23:47:26 UTC 2008 (build sles) > > Any tip on what is going on? > > Thanks for any help. > > Regards, > _______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
