Hi Jon, >From the information you provided, I can see: Your ocfs2 cluster has 3 nodes configured, which are node 0, 1, 2. And the network of node 1 (host name is server1) has been down and then result in the -ENOTCONN(-107) and -EHOSTDOWN(-112) errors when sending dlm messages.
On 2015/10/29 20:39, Jonathan Ramsay wrote: > Previous history : This cluster has been stable etc for months . > > On two occasions this week , we have experienced one server lock up resulting > in many D state processes - primarily nfs - requiring a reboot . > > 26337 D lookup_slow smbd > 26372 D lookup_slow smbd > 26381 D dlm_wait_for_lock_mastery smbd > 26406 D iterate_dir smbd > 26417 D iterate_dir bash > 26530 D iterate_dir smbd > 26557 D iterate_dir smbd > 26761 D iterate_dir smbd > > > Oct 26 12:05:01 server1 CRON[60021]: (root) CMD (command -v debian-sa1 > > /dev/null && debian-sa1 1 1) > Oct 26 12:12:27 server1 kernel: [1633038.930488] o2net: Connection to node > server2 (num 0) at 10.0.0.11:7777 <http://10.0.0.11:7777> has been idle for > 30.5 secs, shutting it down. > Oct 26 12:12:27 server1 kernel: [1633038.930579] o2net: No longer connected > to node server2 (num 0) at 10.0.0.11:7777 <http://10.0.0.11:7777> > Oct 26 12:12:27 server1 kernel: [1633039.018762] o2net: Connection to node > server3 (num 2) at 10.0.0.30:7777 <http://10.0.0.30:7777> shutdown, state 8 > Oct 26 12:12:27 server1 kernel: [1633039.018846] o2net: No longer connected > to node server3 (num 2) at 10.0.0.30:7777 <http://10.0.0.30:7777> > Oct 26 12:12:27 server1 kernel: [1633039.018987] > (kworker/u192:2,59600,0):o2net_send_tcp_msg:960 ERROR: sendmsg returned -32 > instead of 24 > Oct 26 12:12:32 server1 kernel: [1633043.622980] o2net: Accepted connection > from node server3 (num 2) at 10.0.0.30:7777 <http://10.0.0.30:7777> > Oct 26 12:12:32 server1 kernel: [1633043.623052] o2net: Connected to node > server2 (num 0) at 10.0.0.11:7777 <http://10.0.0.11:7777> > > > Oct 28 12:01:35 server2 kernel: [158857.315968] > (kworker/u64:0,25700,8):dlm_send_remote_convert_request:392 ERROR: Error -107 > when sending message 504 (key 0x276c0b15) to node 1 > Oct 28 12:01:35 server2 kernel: [158857.317060] o2dlm: Waiting on the death > of node 1 in domain 27C73AA97B7F4D9398D884DB8DA467DB > Oct 28 12:01:40 server2 kernel: [158862.288279] > (nfsd,7107,7):dlm_send_remote_convert_request:392 ERROR: Error -107 when > sending message 504 (key 0xbd7d045e) to node 1 > Oct 28 12:01:40 server2 kernel: [158862.289730] o2dlm: Waiting on the death > of node 1 in domain 71DEDAB8A10A4F849C60F454E41418FB > Oct 28 12:01:40 server2 kernel: [158862.420276] > (kworker/u64:0,25700,10):dlm_send_remote_convert_request:392 ERROR: Error > -107 when sending message 504 (key 0x276c0b15) to node 1 > Oct 28 12:01:40 server2 kernel: [158862.421855] o2dlm: Waiting on the death > of node 1 in domain 27C73AA97B7F4D9398D884DB8DA467DB > Oct 28 12:01:45 server2 kernel: [158866.795872] > (smbd,6126,8):dlm_do_master_request:1344 ERROR: link to 1 went down! > Oct 28 12:01:45 server2 kernel: [158866.796643] > (smbd,6126,8):dlm_get_lock_resource:929 ERROR: status = -107 > Oct 28 12:01:45 server2 kernel: [158867.392484] > (nfsd,7107,7):dlm_send_remote_convert_request:392 ERROR: Er > > > Oct 26 12:12:46 server2 kernel: [441237.624180] > (smbd,9618,7):dlm_do_master_request:1344 ERROR: link to 1 went down! > Oct 26 12:12:46 server2 kernel: [441237.624181] > (smbd,9618,7):dlm_get_lock_resource:929 ERROR: status = -107 > Oct 26 12:12:46 server2 kernel: [441237.624832] > (smbd,9618,7):dlm_do_master_request:1344 ERROR: link to 1 went down! > Oct 26 12:12:46 server2 kernel: [441237.624833] > (smbd,9618,7):dlm_get_lock_resource:929 ERROR: status = -107 > Oct 26 12:12:46 server2 kernel: [441237.914234] > (smbd,16974,11):dlm_get_lock_resource:929 ERROR: status = -112 > Oct 26 12:12:50 server2 kernel: [441242.308190] o2net: Accepted connection > from node server1 (num 1) at 10.0.0.80:7777 <http://10.0.0.80:7777> > Oct 26 12:15:01 server2 CRON[17178]: (root) CMD (command -v debian-sa1 > > /dev/null && debian-sa1 1 1) > > Oct 28 12:01:04 server1 kernel: [1805223.124152] o2net: Connection to node > server2 (num 0) at 10.0.0.11:7777 <http://10.0.0.11:7777> shutdown, state 8 > Oct 28 12:01:04 server1 kernel: [1805223.124252] o2net: No longer connected > to node server2 (num 0) at 10.0.0.11:7777 <http://10.0.0.11:7777> > Oct 28 12:01:04 server1 kernel: [1805223.124285] > (python,7983,0):dlm_do_master_request:1344 ERROR: link to 0 went down! > Oct 28 12:01:04 server1 kernel: [1805223.124563] > (kworker/u192:1,9377,4):o2net_send_tcp_msg:960 ERROR: sendmsg returned -32 > instead of 24 > Oct 28 12:01:04 server1 kernel: [1805223.145675] > (python,7983,0):dlm_get_lock_resource:929 ERROR: status = -112 > Oct 28 12:01:04 server1 kernel: [1805223.237950] o2net: Connection to node > server3 (num 2) at 10.0.0.30:7777 <http://10.0.0.30:7777> shutdown, state 8 > Oct 28 12:01:04 server1 kernel: [1805223.237993] o2net: No longer connected > to node server3 (num 2) at 10.0.0.30:7777 <http://10.0.0.30:7777> > Oct 28 12:01:04 server1 kernel: [1805223.238067] > (python,7983,2):dlm_do_master_request:1344 ERROR: link to 2 went down! > Oct 28 12:01:04 server1 kernel: [1805223.238078] > (kworker/u192:1,9377,12):o2net_send_tcp_msg:960 ERROR: sendmsg returned -32 > instead of 24 > Oct 28 12:01:04 server1 kernel: [1805223.238118] > (dlm_thread,4881,6):dlm_send_proxy_ast_msg:482 ERROR: > 71DEDAB8A10A4F849C60F454E41418FB: res P000000000000000000000000000000, error > -112 send AST to node 2 > Oct 28 12:01:04 server1 kernel: [1805223.238121] > (dlm_thread,4881,6):dlm_flush_asts:596 ERROR: status = -112 > > > I am investigating a perc controller as we had a battery fail message on > server1 during reboot , but wanted some more input on these errors as well. > > Thank you for your time , > > Jon > > > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > https://oss.oracle.com/mailman/listinfo/ocfs2-users > _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users