[Linux-HA] file system resource becomes inaccesible when any of the node goes down

2015-07-05 Thread Muhammad Sharfuddin
SLES 11 SP 3 + online updates(pacemaker-1.1.11-0.8.11.70 
openais-1.1.4-5.22.1.7)


Its a dual primary drbd cluster, which mounts a file system resource on 
both the cluster nodes simultaneously(file system type is ocfs2).


Whenever any of the nodes goes down, the file system(/sharedata) become 
inaccessible for exact 35 seconds on the other (surviving/online) node, 
and then become available again on the online node.


Please help me understand why the node which survives or remains online 
unable to access the file system resource(/sharedata) for 35 seconds ? 
and how can I fix the cluster so that file system remains accessible on 
the surviving node without any interruption/delay(as in my case of about 
35 seconds)


By inaccessible, I meant to say that running ls -l /sharedata and df 
/sharedata does not return any output and does not return the prompt 
back on the online node for exact 35 seconds once the other node becomes 
offline.


e.g node1 got offline somewhere around  01:37:15, and then /sharedata 
file system was inaccessible during 01:37:35 and 01:38:18 on the online 
node i.e node2.



/var/log/messages on node2, when node1 went offline:
Jul  5 01:37:26 node2 kernel: [  675.255865] drbd r0: PingAck did not 
arrive in time.
Jul  5 01:37:26 node2 kernel: [  675.255886] drbd r0: peer( Primary - 
Unknown ) conn( Connected - NetworkFailure ) pdsk( UpToDate - DUnknown )
Jul  5 01:37:26 node2 kernel: [  675.256030] block drbd0: new current 
UUID C23D1458962AD18D:A8DD404C9F563391:6A5F4A26F64BAF0B:6A5E4A26F64BAF0B

Jul  5 01:37:26 node2 kernel: [  675.256079] drbd r0: asender terminated
Jul  5 01:37:26 node2 kernel: [  675.256081] drbd r0: Terminating drbd_a_r0
Jul  5 01:37:26 node2 kernel: [  675.256306] drbd r0: Connection closed
Jul  5 01:37:26 node2 kernel: [  675.256338] drbd r0: conn( 
NetworkFailure - Unconnected )

Jul  5 01:37:26 node2 kernel: [  675.256339] drbd r0: receiver terminated
Jul  5 01:37:26 node2 kernel: [  675.256340] drbd r0: Restarting 
receiver thread

Jul  5 01:37:26 node2 kernel: [  675.256341] drbd r0: receiver (re)started
Jul  5 01:37:26 node2 kernel: [  675.256344] drbd r0: conn( Unconnected 
- WFConnection )
Jul  5 01:37:29 node2 corosync[4040]:  [TOTEM ] A processor failed, 
forming new configuration.

Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] CLM CONFIGURATION CHANGE
Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] New Configuration:
Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] r(0) ip(172.16.241.132)
Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] Members Left:
Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] r(0) ip(172.16.241.131)
Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] Members Joined:
Jul  5 01:37:35 node2 corosync[4040]:  [pcmk  ] notice: 
pcmk_peer_update: Transitional membership event on ring 216: memb=1, 
new=0, lost=1
Jul  5 01:37:35 node2 corosync[4040]:  [pcmk  ] info: pcmk_peer_update: 
memb: node2 739307908
Jul  5 01:37:35 node2 corosync[4040]:  [pcmk  ] info: pcmk_peer_update: 
lost: node1 739307907

Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] CLM CONFIGURATION CHANGE
Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] New Configuration:
Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] r(0) ip(172.16.241.132)
Jul  5 01:37:35 node2 cluster-dlm[4344]:   notice: 
plugin_handle_membership: Membership 216: quorum lost
Jul  5 01:37:35 node2 ocfs2_controld[4473]:   notice: 
plugin_handle_membership: Membership 216: quorum lost

Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] Members Left:
Jul  5 01:37:35 node2 crmd[4050]:   notice: plugin_handle_membership: 
Membership 216: quorum lost
Jul  5 01:37:35 node2 stonith-ng[4046]:   notice: 
plugin_handle_membership: Membership 216: quorum lost
Jul  5 01:37:35 node2 cib[4045]:   notice: plugin_handle_membership: 
Membership 216: quorum lost
Jul  5 01:37:35 node2 cluster-dlm[4344]:   notice: 
crm_update_peer_state: plugin_handle_membership: Node node1[739307907] - 
state is now lost (was member)
Jul  5 01:37:35 node2 ocfs2_controld[4473]:   notice: 
crm_update_peer_state: plugin_handle_membership: Node node1[739307907] - 
state is now lost (was member)

Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] Members Joined:
Jul  5 01:37:35 node2 crmd[4050]:  warning: match_down_event: No match 
for shutdown action on node1
Jul  5 01:37:35 node2 stonith-ng[4046]:   notice: crm_update_peer_state: 
plugin_handle_membership: Node node1[739307907] - state is now lost (was 
member)
Jul  5 01:37:35 node2 cib[4045]:   notice: crm_update_peer_state: 
plugin_handle_membership: Node node1[739307907] - state is now lost (was 
member)
Jul  5 01:37:35 node2 cluster-dlm[4344]: update_cluster: Processing 
membership 216

Jul  5 01:37:35 node2 ocfs2_controld[4473]: confchg called
Jul  5 01:37:35 node2 corosync[4040]:  [pcmk  ] notice: 
pcmk_peer_update: Stable membership event on ring 216: memb=1, new=0, lost=0
Jul  5 01:37:35 node2 crmd[4050]:   notice: peer_update_callback: 
Stonith/shutdown of node1 not matched

Re: [Linux-HA] file system resource becomes inaccesible when any of the node goes down

2015-07-05 Thread emmanuel segura
when a node goes down, you will see the node in unclean state, how you
see in your logs, forming new configuration(corosync) - stonith
reboot request - and you are using sbd and the node become offline
after thet msgwait is expired, when msgwait is expired pacemaker knows
the node is dead and than the ocfs will ok, stonith-dlm-ocfs2, if
you need to reduce the msgwait you need to be careful about others
timeouts and cluster problems.

2015-07-05 18:13 GMT+02:00 Muhammad Sharfuddin m.sharfud...@nds.com.pk:
 SLES 11 SP 3 + online updates(pacemaker-1.1.11-0.8.11.70
 openais-1.1.4-5.22.1.7)

 Its a dual primary drbd cluster, which mounts a file system resource on both
 the cluster nodes simultaneously(file system type is ocfs2).

 Whenever any of the nodes goes down, the file system(/sharedata) become
 inaccessible for exact 35 seconds on the other (surviving/online) node, and
 then become available again on the online node.

 Please help me understand why the node which survives or remains online
 unable to access the file system resource(/sharedata) for 35 seconds ? and
 how can I fix the cluster so that file system remains accessible on the
 surviving node without any interruption/delay(as in my case of about 35
 seconds)

 By inaccessible, I meant to say that running ls -l /sharedata and df
 /sharedata does not return any output and does not return the prompt back
 on the online node for exact 35 seconds once the other node becomes offline.

 e.g node1 got offline somewhere around  01:37:15, and then /sharedata file
 system was inaccessible during 01:37:35 and 01:38:18 on the online node i.e
 node2.


 /var/log/messages on node2, when node1 went offline:
 Jul  5 01:37:26 node2 kernel: [  675.255865] drbd r0: PingAck did not arrive
 in time.
 Jul  5 01:37:26 node2 kernel: [  675.255886] drbd r0: peer( Primary -
 Unknown ) conn( Connected - NetworkFailure ) pdsk( UpToDate - DUnknown )
 Jul  5 01:37:26 node2 kernel: [  675.256030] block drbd0: new current UUID
 C23D1458962AD18D:A8DD404C9F563391:6A5F4A26F64BAF0B:6A5E4A26F64BAF0B
 Jul  5 01:37:26 node2 kernel: [  675.256079] drbd r0: asender terminated
 Jul  5 01:37:26 node2 kernel: [  675.256081] drbd r0: Terminating drbd_a_r0
 Jul  5 01:37:26 node2 kernel: [  675.256306] drbd r0: Connection closed
 Jul  5 01:37:26 node2 kernel: [  675.256338] drbd r0: conn( NetworkFailure
 - Unconnected )
 Jul  5 01:37:26 node2 kernel: [  675.256339] drbd r0: receiver terminated
 Jul  5 01:37:26 node2 kernel: [  675.256340] drbd r0: Restarting receiver
 thread
 Jul  5 01:37:26 node2 kernel: [  675.256341] drbd r0: receiver (re)started
 Jul  5 01:37:26 node2 kernel: [  675.256344] drbd r0: conn( Unconnected -
 WFConnection )
 Jul  5 01:37:29 node2 corosync[4040]:  [TOTEM ] A processor failed, forming
 new configuration.
 Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] CLM CONFIGURATION CHANGE
 Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] New Configuration:
 Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] r(0) ip(172.16.241.132)
 Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] Members Left:
 Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] r(0) ip(172.16.241.131)
 Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] Members Joined:
 Jul  5 01:37:35 node2 corosync[4040]:  [pcmk  ] notice: pcmk_peer_update:
 Transitional membership event on ring 216: memb=1, new=0, lost=1
 Jul  5 01:37:35 node2 corosync[4040]:  [pcmk  ] info: pcmk_peer_update:
 memb: node2 739307908
 Jul  5 01:37:35 node2 corosync[4040]:  [pcmk  ] info: pcmk_peer_update:
 lost: node1 739307907
 Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] CLM CONFIGURATION CHANGE
 Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] New Configuration:
 Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] r(0) ip(172.16.241.132)
 Jul  5 01:37:35 node2 cluster-dlm[4344]:   notice: plugin_handle_membership:
 Membership 216: quorum lost
 Jul  5 01:37:35 node2 ocfs2_controld[4473]:   notice:
 plugin_handle_membership: Membership 216: quorum lost
 Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] Members Left:
 Jul  5 01:37:35 node2 crmd[4050]:   notice: plugin_handle_membership:
 Membership 216: quorum lost
 Jul  5 01:37:35 node2 stonith-ng[4046]:   notice: plugin_handle_membership:
 Membership 216: quorum lost
 Jul  5 01:37:35 node2 cib[4045]:   notice: plugin_handle_membership:
 Membership 216: quorum lost
 Jul  5 01:37:35 node2 cluster-dlm[4344]:   notice: crm_update_peer_state:
 plugin_handle_membership: Node node1[739307907] - state is now lost (was
 member)
 Jul  5 01:37:35 node2 ocfs2_controld[4473]:   notice: crm_update_peer_state:
 plugin_handle_membership: Node node1[739307907] - state is now lost (was
 member)
 Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] Members Joined:
 Jul  5 01:37:35 node2 crmd[4050]:  warning: match_down_event: No match for
 shutdown action on node1
 Jul  5 01:37:35 node2 stonith-ng[4046]:   notice: crm_update_peer_state:
 plugin_handle_membership: Node node1[739307907] - state is now lost (was