Many thanks Marcos.
Kind regards Paul Fretter From: Marcos E. Matsunaga [mailto:[EMAIL PROTECTED] Sent: 09 October 2007 13:31 To: paul fretter (TOC) Cc: [email protected] Subject: Re: [Ocfs2-users] RE: Access to OCFS2 volume paused when a node crashes You may want to try to increase the network timeout. You will have to do it on all nodes. See the FAQ http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT with special attention to #104 and 105. Regards, Marcos Eduardo Matsunaga Oracle USA Linux Engineering paul fretter (TOC) wrote: To clarify, The host "node1" is the OCFS node 0 in the config file. The log entries are from another system in the cluster. Kind regards Paul -----Original Message----- From: paul fretter (TOC) Sent: 09 October 2007 11:41 To: [email protected] Subject: Access to OCFS2 volume paused when a node crashes There is a node (node1) on our cluster that for some reason hangs every now and again, but it seems that when it happens it also pauses access to the OCFS2 volume for the other nodes. We are running the latest version of OCFS2 and the tools, on RHEL4 (x86_64) with kernel 2.6.9-42. All nodes area connected by fibrechannel to a common LUN for data sharing. I guess there may be something I can do with configuring timeouts etc(?), but I thought I'd check with this list first. Here is the relevant info from /va/log/messages Oct 9 11:24:41 jic55124 kernel: o2net: connection to node node1 (num 0) at 10.1 0.10.1:7777 has been idle for 10.0 seconds, shutting it down. Oct 9 11:24:41 jic55124 kernel: (0,1):o2net_idle_timer:1418 here are some times that might help debug the situation: (tmr 1191925471.993435 now 1191925481.9942 92 dr 1191925471.993425 adv 1191925471.993436:1191925471.993437 func (98e2d068:5 07) 1191924562.14841:1191924562.14844) Oct 9 11:24:41 jic55124 kernel: o2net: no longer connected to node node1 (num 0 ) at 10.10.10.1:7777 Oct 9 11:24:41 jic55124 kernel: (727,3):dlm_do_master_request:1418 ERROR: link to 0 went down! Oct 9 11:24:41 jic55124 kernel: (727,3):dlm_get_lock_resource:995 ERROR: status = -112 [EMAIL PROTECTED] ~]# tail /var/log/messages Oct 9 11:28:48 jic55124 kernel: (856,2):dlm_get_lock_resource:995 ERROR: status = -107 Oct 9 11:28:48 jic55124 kernel: (856,2):dlm_do_master_request:1418 ERROR: link to 0 went down! Oct 9 11:28:48 jic55124 kernel: (856,2):dlm_get_lock_resource:995 ERROR: status = -107 Oct 9 11:33:42 jic55124 kernel: (865,0):dlm_get_lock_resource:921 6B13C23CB44C4D888150894FE4D35D4E:M000000000000000000007571339968: at least one node (0) torecover before lock mastery can begin Oct 9 11:33:42 jic55124 kernel: (3765,1):ocfs2_dlm_eviction_cb:119 device (8,80): dlm has evicted node 0 Oct 9 11:33:43 jic55124 kernel: (865,0):dlm_get_lock_resource:976 6B13C23CB44C4D888150894FE4D35D4E:M000000000000000000007571339968: at least one node (0) torecover before lock mastery can begin Oct 9 11:33:46 jic55124 kernel: (727,3):dlm_restart_lock_mastery:1301 ERROR: node down! 0 Oct 9 11:33:46 jic55124 kernel: (727,3):dlm_wait_for_lock_mastery:1118 ERROR: status = -11 Oct 9 11:33:48 jic55124 kernel: (865,1):ocfs2_replay_journal:1167 Recovering node 0 from slot 5 on device (8,80) Oct 9 11:33:50 jic55124 kernel: kjournald starting. Commit interval 5 seconds Many thanks Paul Fretter _______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
_______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
