Re: [Ocfs2-users] Checking on the nature of these messages / errors

2015-10-29 Thread Joseph Qi
Hi Jon,
>From the information you provided, I can see:
Your ocfs2 cluster has 3 nodes configured, which are node 0, 1, 2.
And the network of node 1 (host name is server1) has been down and then
result in the -ENOTCONN(-107) and -EHOSTDOWN(-112) errors when sending
dlm messages.

On 2015/10/29 20:39, Jonathan Ramsay wrote:
> Previous history : This cluster has been stable etc for months . 
> 
> On two occasions this week , we have experienced one server lock up resulting 
> in many D state processes - primarily nfs - requiring a reboot . 
> 
> 26337 Dlookup_slow  smbd
> 26372 Dlookup_slow  smbd
> 26381 Ddlm_wait_for_lock_masterysmbd
> 26406 Diterate_dir  smbd
> 26417 Diterate_dir  bash
> 26530 Diterate_dir  smbd
> 26557 Diterate_dir  smbd
> 26761 Diterate_dir  smbd
> 
> 
> Oct 26 12:05:01 server1 CRON[60021]: (root) CMD (command -v debian-sa1 > 
> /dev/null && debian-sa1 1 1)
> Oct 26 12:12:27 server1 kernel: [1633038.930488] o2net: Connection to node 
> server2  (num 0) at 10.0.0.11:  has been idle for 
> 30.5 secs, shutting it down.
> Oct 26 12:12:27 server1 kernel: [1633038.930579] o2net: No longer connected 
> to node server2  (num 0) at 10.0.0.11: 
> Oct 26 12:12:27 server1 kernel: [1633039.018762] o2net: Connection to node 
> server3  (num 2) at 10.0.0.30:  shutdown, state 8
> Oct 26 12:12:27 server1 kernel: [1633039.018846] o2net: No longer connected 
> to node server3  (num 2) at 10.0.0.30: 
> Oct 26 12:12:27 server1 kernel: [1633039.018987] 
> (kworker/u192:2,59600,0):o2net_send_tcp_msg:960 ERROR: sendmsg returned -32 
> instead of 24
> Oct 26 12:12:32 server1 kernel: [1633043.622980] o2net: Accepted connection 
> from node server3  (num 2) at 10.0.0.30: 
> Oct 26 12:12:32 server1 kernel: [1633043.623052] o2net: Connected to node 
> server2  (num 0) at 10.0.0.11: 
> 
> 
> Oct 28 12:01:35 server2  kernel: [158857.315968] 
> (kworker/u64:0,25700,8):dlm_send_remote_convert_request:392 ERROR: Error -107 
> when sending message 504 (key 0x276c0b15) to node 1
> Oct 28 12:01:35 server2  kernel: [158857.317060] o2dlm: Waiting on the death 
> of node 1 in domain 27C73AA97B7F4D9398D884DB8DA467DB
> Oct 28 12:01:40 server2  kernel: [158862.288279] 
> (nfsd,7107,7):dlm_send_remote_convert_request:392 ERROR: Error -107 when 
> sending message 504 (key 0xbd7d045e) to node 1
> Oct 28 12:01:40 server2  kernel: [158862.289730] o2dlm: Waiting on the death 
> of node 1 in domain 71DEDAB8A10A4F849C60F454E41418FB
> Oct 28 12:01:40 server2  kernel: [158862.420276] 
> (kworker/u64:0,25700,10):dlm_send_remote_convert_request:392 ERROR: Error 
> -107 when sending message 504 (key 0x276c0b15) to node 1
> Oct 28 12:01:40 server2  kernel: [158862.421855] o2dlm: Waiting on the death 
> of node 1 in domain 27C73AA97B7F4D9398D884DB8DA467DB
> Oct 28 12:01:45 server2  kernel: [158866.795872] 
> (smbd,6126,8):dlm_do_master_request:1344 ERROR: link to 1 went down!
> Oct 28 12:01:45 server2  kernel: [158866.796643] 
> (smbd,6126,8):dlm_get_lock_resource:929 ERROR: status = -107
> Oct 28 12:01:45 server2  kernel: [158867.392484] 
> (nfsd,7107,7):dlm_send_remote_convert_request:392 ERROR: Er
> 
> 
> Oct 26 12:12:46 server2  kernel: [441237.624180] 
> (smbd,9618,7):dlm_do_master_request:1344 ERROR: link to 1 went down!
> Oct 26 12:12:46 server2  kernel: [441237.624181] 
> (smbd,9618,7):dlm_get_lock_resource:929 ERROR: status = -107
> Oct 26 12:12:46 server2  kernel: [441237.624832] 
> (smbd,9618,7):dlm_do_master_request:1344 ERROR: link to 1 went down!
> Oct 26 12:12:46 server2  kernel: [441237.624833] 
> (smbd,9618,7):dlm_get_lock_resource:929 ERROR: status = -107
> Oct 26 12:12:46 server2  kernel: [441237.914234] 
> (smbd,16974,11):dlm_get_lock_resource:929 ERROR: status = -112
> Oct 26 12:12:50 server2  kernel: [441242.308190] o2net: Accepted connection 
> from node server1  (num 1) at 10.0.0.80: 
> Oct 26 12:15:01 server2  CRON[17178]: (root) CMD (command -v debian-sa1 > 
> /dev/null && debian-sa1 1 1)
> 
> Oct 28 12:01:04 server1  kernel: [1805223.124152] o2net: Connection to node 
> server2  (num 0) at 10.0.0.11:  shutdown, state 8
> Oct 28 12:01:04 server1  kernel: [1805223.124252] o2net: No longer connected 
> to node server2  (num 0) at 10.0.0.11: 
> Oct 28 12:01:04 server1  kernel: [1805223.124285] 
> (python,7983,0):dlm_do_master_request:1344 ERROR: link to 0 went down!
> Oct 28 12:01:04 server1  kernel: [1805223.124563] 
> (kworker/u192:1,9377,4):o2net_send_tcp_msg:960 ERROR: sendmsg returned -32 
> instead of 24
> Oct 28 12:01:04 server1  kernel: 

[Ocfs2-users] Checking on the nature of these messages / errors

2015-10-29 Thread Jonathan Ramsay
Previous history : This cluster has been stable etc for months .

On two occasions this week , we have experienced one server lock up
resulting in many D state processes - primarily nfs - requiring a reboot .

26337 Dlookup_slow  smbd26372 D
lookup_slow  smbd26381 D
dlm_wait_for_lock_masterysmbd26406 Diterate_dir
  smbd26417 Diterate_dir
   bash26530 Diterate_dir
smbd26557 Diterate_dir  smbd26761 D
iterate_dir  smbd


Oct 26 12:05:01 server1 CRON[60021]: (root) CMD (command -v debian-sa1 >
/dev/null && debian-sa1 1 1)
Oct 26 12:12:27 server1 kernel: [1633038.930488] o2net: Connection to node
server2  (num 0) at 10.0.0.11: has been idle for 30.5 secs, shutting it
down.
Oct 26 12:12:27 server1 kernel: [1633038.930579] o2net: No longer connected
to node server2  (num 0) at 10.0.0.11:
Oct 26 12:12:27 server1 kernel: [1633039.018762] o2net: Connection to node
server3  (num 2) at 10.0.0.30: shutdown, state 8
Oct 26 12:12:27 server1 kernel: [1633039.018846] o2net: No longer connected
to node server3  (num 2) at 10.0.0.30:
Oct 26 12:12:27 server1 kernel: [1633039.018987]
(kworker/u192:2,59600,0):o2net_send_tcp_msg:960 ERROR: sendmsg returned -32
instead of 24
Oct 26 12:12:32 server1 kernel: [1633043.622980] o2net: Accepted connection
from node server3  (num 2) at 10.0.0.30:
Oct 26 12:12:32 server1 kernel: [1633043.623052] o2net: Connected to node
server2  (num 0) at 10.0.0.11:


Oct 28 12:01:35 server2  kernel: [158857.315968]
(kworker/u64:0,25700,8):dlm_send_remote_convert_request:392 ERROR: Error
-107 when sending message 504 (key 0x276c0b15) to node 1
Oct 28 12:01:35 server2  kernel: [158857.317060] o2dlm: Waiting on the
death of node 1 in domain 27C73AA97B7F4D9398D884DB8DA467DB
Oct 28 12:01:40 server2  kernel: [158862.288279]
(nfsd,7107,7):dlm_send_remote_convert_request:392 ERROR: Error -107 when
sending message 504 (key 0xbd7d045e) to node 1
Oct 28 12:01:40 server2  kernel: [158862.289730] o2dlm: Waiting on the
death of node 1 in domain 71DEDAB8A10A4F849C60F454E41418FB
Oct 28 12:01:40 server2  kernel: [158862.420276]
(kworker/u64:0,25700,10):dlm_send_remote_convert_request:392 ERROR: Error
-107 when sending message 504 (key 0x276c0b15) to node 1
Oct 28 12:01:40 server2  kernel: [158862.421855] o2dlm: Waiting on the
death of node 1 in domain 27C73AA97B7F4D9398D884DB8DA467DB
Oct 28 12:01:45 server2  kernel: [158866.795872]
(smbd,6126,8):dlm_do_master_request:1344 ERROR: link to 1 went down!
Oct 28 12:01:45 server2  kernel: [158866.796643]
(smbd,6126,8):dlm_get_lock_resource:929 ERROR: status = -107
Oct 28 12:01:45 server2  kernel: [158867.392484]
(nfsd,7107,7):dlm_send_remote_convert_request:392 ERROR: Er


Oct 26 12:12:46 server2  kernel: [441237.624180]
(smbd,9618,7):dlm_do_master_request:1344 ERROR: link to 1 went down!
Oct 26 12:12:46 server2  kernel: [441237.624181]
(smbd,9618,7):dlm_get_lock_resource:929 ERROR: status = -107
Oct 26 12:12:46 server2  kernel: [441237.624832]
(smbd,9618,7):dlm_do_master_request:1344 ERROR: link to 1 went down!
Oct 26 12:12:46 server2  kernel: [441237.624833]
(smbd,9618,7):dlm_get_lock_resource:929 ERROR: status = -107
Oct 26 12:12:46 server2  kernel: [441237.914234]
(smbd,16974,11):dlm_get_lock_resource:929 ERROR: status = -112
Oct 26 12:12:50 server2  kernel: [441242.308190] o2net: Accepted connection
from node server1  (num 1) at 10.0.0.80:
Oct 26 12:15:01 server2  CRON[17178]: (root) CMD (command -v debian-sa1 >
/dev/null && debian-sa1 1 1)

Oct 28 12:01:04 server1  kernel: [1805223.124152] o2net: Connection to node
server2  (num 0) at 10.0.0.11: shutdown, state 8
Oct 28 12:01:04 server1  kernel: [1805223.124252] o2net: No longer
connected to node server2  (num 0) at 10.0.0.11:
Oct 28 12:01:04 server1  kernel: [1805223.124285]
(python,7983,0):dlm_do_master_request:1344 ERROR: link to 0 went down!
Oct 28 12:01:04 server1  kernel: [1805223.124563]
(kworker/u192:1,9377,4):o2net_send_tcp_msg:960 ERROR: sendmsg returned -32
instead of 24
Oct 28 12:01:04 server1  kernel: [1805223.145675]
(python,7983,0):dlm_get_lock_resource:929 ERROR: status = -112
Oct 28 12:01:04 server1  kernel: [1805223.237950] o2net: Connection to node
server3  (num 2) at 10.0.0.30: shutdown, state 8
Oct 28 12:01:04 server1  kernel: [1805223.237993] o2net: No longer
connected to node server3  (num 2) at 10.0.0.30:
Oct 28 12:01:04 server1  kernel: [1805223.238067]
(python,7983,2):dlm_do_master_request:1344 ERROR: link to 2 went down!
Oct 28 12:01:04 server1  kernel: [1805223.238078]
(kworker/u192:1,9377,12):o2net_send_tcp_msg:960 ERROR: sendmsg returned -32
instead of 24
Oct 28 12:01:04 server1  kernel: [1805223.238118]
(dlm_thread,4881,6):dlm_send_proxy_ast_msg:482 ERROR:
71DEDAB8A10A4F849C60F454E41418FB: res P00,
error -112 send AST to node 2
Oct 28 12:01:0