[Linux-HA] file system resource becomes inaccesible when any of the node goes down
SLES 11 SP 3 + online updates(pacemaker-1.1.11-0.8.11.70 openais-1.1.4-5.22.1.7) Its a dual primary drbd cluster, which mounts a file system resource on both the cluster nodes simultaneously(file system type is ocfs2). Whenever any of the nodes goes down, the file system(/sharedata) become inaccessible for exact 35 seconds on the other (surviving/online) node, and then become available again on the online node. Please help me understand why the node which survives or remains online unable to access the file system resource(/sharedata) for 35 seconds ? and how can I fix the cluster so that file system remains accessible on the surviving node without any interruption/delay(as in my case of about 35 seconds) By inaccessible, I meant to say that running ls -l /sharedata and df /sharedata does not return any output and does not return the prompt back on the online node for exact 35 seconds once the other node becomes offline. e.g node1 got offline somewhere around 01:37:15, and then /sharedata file system was inaccessible during 01:37:35 and 01:38:18 on the online node i.e node2. /var/log/messages on node2, when node1 went offline: Jul 5 01:37:26 node2 kernel: [ 675.255865] drbd r0: PingAck did not arrive in time. Jul 5 01:37:26 node2 kernel: [ 675.255886] drbd r0: peer( Primary - Unknown ) conn( Connected - NetworkFailure ) pdsk( UpToDate - DUnknown ) Jul 5 01:37:26 node2 kernel: [ 675.256030] block drbd0: new current UUID C23D1458962AD18D:A8DD404C9F563391:6A5F4A26F64BAF0B:6A5E4A26F64BAF0B Jul 5 01:37:26 node2 kernel: [ 675.256079] drbd r0: asender terminated Jul 5 01:37:26 node2 kernel: [ 675.256081] drbd r0: Terminating drbd_a_r0 Jul 5 01:37:26 node2 kernel: [ 675.256306] drbd r0: Connection closed Jul 5 01:37:26 node2 kernel: [ 675.256338] drbd r0: conn( NetworkFailure - Unconnected ) Jul 5 01:37:26 node2 kernel: [ 675.256339] drbd r0: receiver terminated Jul 5 01:37:26 node2 kernel: [ 675.256340] drbd r0: Restarting receiver thread Jul 5 01:37:26 node2 kernel: [ 675.256341] drbd r0: receiver (re)started Jul 5 01:37:26 node2 kernel: [ 675.256344] drbd r0: conn( Unconnected - WFConnection ) Jul 5 01:37:29 node2 corosync[4040]: [TOTEM ] A processor failed, forming new configuration. Jul 5 01:37:35 node2 corosync[4040]: [CLM ] CLM CONFIGURATION CHANGE Jul 5 01:37:35 node2 corosync[4040]: [CLM ] New Configuration: Jul 5 01:37:35 node2 corosync[4040]: [CLM ] r(0) ip(172.16.241.132) Jul 5 01:37:35 node2 corosync[4040]: [CLM ] Members Left: Jul 5 01:37:35 node2 corosync[4040]: [CLM ] r(0) ip(172.16.241.131) Jul 5 01:37:35 node2 corosync[4040]: [CLM ] Members Joined: Jul 5 01:37:35 node2 corosync[4040]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 216: memb=1, new=0, lost=1 Jul 5 01:37:35 node2 corosync[4040]: [pcmk ] info: pcmk_peer_update: memb: node2 739307908 Jul 5 01:37:35 node2 corosync[4040]: [pcmk ] info: pcmk_peer_update: lost: node1 739307907 Jul 5 01:37:35 node2 corosync[4040]: [CLM ] CLM CONFIGURATION CHANGE Jul 5 01:37:35 node2 corosync[4040]: [CLM ] New Configuration: Jul 5 01:37:35 node2 corosync[4040]: [CLM ] r(0) ip(172.16.241.132) Jul 5 01:37:35 node2 cluster-dlm[4344]: notice: plugin_handle_membership: Membership 216: quorum lost Jul 5 01:37:35 node2 ocfs2_controld[4473]: notice: plugin_handle_membership: Membership 216: quorum lost Jul 5 01:37:35 node2 corosync[4040]: [CLM ] Members Left: Jul 5 01:37:35 node2 crmd[4050]: notice: plugin_handle_membership: Membership 216: quorum lost Jul 5 01:37:35 node2 stonith-ng[4046]: notice: plugin_handle_membership: Membership 216: quorum lost Jul 5 01:37:35 node2 cib[4045]: notice: plugin_handle_membership: Membership 216: quorum lost Jul 5 01:37:35 node2 cluster-dlm[4344]: notice: crm_update_peer_state: plugin_handle_membership: Node node1[739307907] - state is now lost (was member) Jul 5 01:37:35 node2 ocfs2_controld[4473]: notice: crm_update_peer_state: plugin_handle_membership: Node node1[739307907] - state is now lost (was member) Jul 5 01:37:35 node2 corosync[4040]: [CLM ] Members Joined: Jul 5 01:37:35 node2 crmd[4050]: warning: match_down_event: No match for shutdown action on node1 Jul 5 01:37:35 node2 stonith-ng[4046]: notice: crm_update_peer_state: plugin_handle_membership: Node node1[739307907] - state is now lost (was member) Jul 5 01:37:35 node2 cib[4045]: notice: crm_update_peer_state: plugin_handle_membership: Node node1[739307907] - state is now lost (was member) Jul 5 01:37:35 node2 cluster-dlm[4344]: update_cluster: Processing membership 216 Jul 5 01:37:35 node2 ocfs2_controld[4473]: confchg called Jul 5 01:37:35 node2 corosync[4040]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 216: memb=1, new=0, lost=0 Jul 5 01:37:35 node2 crmd[4050]: notice: peer_update_callback: Stonith/shutdown of node1 not matched
Re: [Linux-HA] file system resource becomes inaccesible when any of the node goes down
when a node goes down, you will see the node in unclean state, how you see in your logs, forming new configuration(corosync) - stonith reboot request - and you are using sbd and the node become offline after thet msgwait is expired, when msgwait is expired pacemaker knows the node is dead and than the ocfs will ok, stonith-dlm-ocfs2, if you need to reduce the msgwait you need to be careful about others timeouts and cluster problems. 2015-07-05 18:13 GMT+02:00 Muhammad Sharfuddin m.sharfud...@nds.com.pk: SLES 11 SP 3 + online updates(pacemaker-1.1.11-0.8.11.70 openais-1.1.4-5.22.1.7) Its a dual primary drbd cluster, which mounts a file system resource on both the cluster nodes simultaneously(file system type is ocfs2). Whenever any of the nodes goes down, the file system(/sharedata) become inaccessible for exact 35 seconds on the other (surviving/online) node, and then become available again on the online node. Please help me understand why the node which survives or remains online unable to access the file system resource(/sharedata) for 35 seconds ? and how can I fix the cluster so that file system remains accessible on the surviving node without any interruption/delay(as in my case of about 35 seconds) By inaccessible, I meant to say that running ls -l /sharedata and df /sharedata does not return any output and does not return the prompt back on the online node for exact 35 seconds once the other node becomes offline. e.g node1 got offline somewhere around 01:37:15, and then /sharedata file system was inaccessible during 01:37:35 and 01:38:18 on the online node i.e node2. /var/log/messages on node2, when node1 went offline: Jul 5 01:37:26 node2 kernel: [ 675.255865] drbd r0: PingAck did not arrive in time. Jul 5 01:37:26 node2 kernel: [ 675.255886] drbd r0: peer( Primary - Unknown ) conn( Connected - NetworkFailure ) pdsk( UpToDate - DUnknown ) Jul 5 01:37:26 node2 kernel: [ 675.256030] block drbd0: new current UUID C23D1458962AD18D:A8DD404C9F563391:6A5F4A26F64BAF0B:6A5E4A26F64BAF0B Jul 5 01:37:26 node2 kernel: [ 675.256079] drbd r0: asender terminated Jul 5 01:37:26 node2 kernel: [ 675.256081] drbd r0: Terminating drbd_a_r0 Jul 5 01:37:26 node2 kernel: [ 675.256306] drbd r0: Connection closed Jul 5 01:37:26 node2 kernel: [ 675.256338] drbd r0: conn( NetworkFailure - Unconnected ) Jul 5 01:37:26 node2 kernel: [ 675.256339] drbd r0: receiver terminated Jul 5 01:37:26 node2 kernel: [ 675.256340] drbd r0: Restarting receiver thread Jul 5 01:37:26 node2 kernel: [ 675.256341] drbd r0: receiver (re)started Jul 5 01:37:26 node2 kernel: [ 675.256344] drbd r0: conn( Unconnected - WFConnection ) Jul 5 01:37:29 node2 corosync[4040]: [TOTEM ] A processor failed, forming new configuration. Jul 5 01:37:35 node2 corosync[4040]: [CLM ] CLM CONFIGURATION CHANGE Jul 5 01:37:35 node2 corosync[4040]: [CLM ] New Configuration: Jul 5 01:37:35 node2 corosync[4040]: [CLM ] r(0) ip(172.16.241.132) Jul 5 01:37:35 node2 corosync[4040]: [CLM ] Members Left: Jul 5 01:37:35 node2 corosync[4040]: [CLM ] r(0) ip(172.16.241.131) Jul 5 01:37:35 node2 corosync[4040]: [CLM ] Members Joined: Jul 5 01:37:35 node2 corosync[4040]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 216: memb=1, new=0, lost=1 Jul 5 01:37:35 node2 corosync[4040]: [pcmk ] info: pcmk_peer_update: memb: node2 739307908 Jul 5 01:37:35 node2 corosync[4040]: [pcmk ] info: pcmk_peer_update: lost: node1 739307907 Jul 5 01:37:35 node2 corosync[4040]: [CLM ] CLM CONFIGURATION CHANGE Jul 5 01:37:35 node2 corosync[4040]: [CLM ] New Configuration: Jul 5 01:37:35 node2 corosync[4040]: [CLM ] r(0) ip(172.16.241.132) Jul 5 01:37:35 node2 cluster-dlm[4344]: notice: plugin_handle_membership: Membership 216: quorum lost Jul 5 01:37:35 node2 ocfs2_controld[4473]: notice: plugin_handle_membership: Membership 216: quorum lost Jul 5 01:37:35 node2 corosync[4040]: [CLM ] Members Left: Jul 5 01:37:35 node2 crmd[4050]: notice: plugin_handle_membership: Membership 216: quorum lost Jul 5 01:37:35 node2 stonith-ng[4046]: notice: plugin_handle_membership: Membership 216: quorum lost Jul 5 01:37:35 node2 cib[4045]: notice: plugin_handle_membership: Membership 216: quorum lost Jul 5 01:37:35 node2 cluster-dlm[4344]: notice: crm_update_peer_state: plugin_handle_membership: Node node1[739307907] - state is now lost (was member) Jul 5 01:37:35 node2 ocfs2_controld[4473]: notice: crm_update_peer_state: plugin_handle_membership: Node node1[739307907] - state is now lost (was member) Jul 5 01:37:35 node2 corosync[4040]: [CLM ] Members Joined: Jul 5 01:37:35 node2 crmd[4050]: warning: match_down_event: No match for shutdown action on node1 Jul 5 01:37:35 node2 stonith-ng[4046]: notice: crm_update_peer_state: plugin_handle_membership: Node node1[739307907] - state is now lost (was