[DRBD-user] Surviving node restarts in 2 Node cluster configuration with XEN+DRBD+COROSYNC+OCFS2

kamal kishi Fri, 17 Apr 2015 03:55:36 -0700

Hi All,

Have a setup where 2 nodes are synchronized by DRBD and maintained by
COROSYNC.


Here DRBD works as Primary/Primary mode.

Server1 and Server2

As I shutdown Server1, the Server2 wait for sometime and restarts by itself.

Where as the other way if Server2 is shutdown when Server1 online then
nothing happens to Server1.

Please find the log attached and the versions used for the same.

I do understand the fencing creating issue, but is there any way to disable
the same.


-- 
Regards,
Kamal Kishore B V

root@server2:~# cat /etc/issue && dpkg -l | egrep 
"corosync|pacemaker|drbd|xen|ocfs2"
Ubuntu 12.04 LTS \n \l

ii  corosync                               1.4.2-2ubuntu0.1                     
   Standards-based cluster framework (daemon and modules)
hi  drbd8-utils                            2:8.3.11-0ubuntu1                    
   RAID 1 over tcp/ip for Linux utilities
ii  libxen-4.1                             4.1.5-0ubuntu0.12.04.3               
   Public libs for Xen
ii  libxenstore3.0                         4.1.5-0ubuntu0.12.04.3               
   Xenstore communications library for Xen
ii  ocfs2-tools                            1.6.3-4ubuntu1                       
   tools for managing OCFS2 cluster filesystems
ii  pacemaker                              1.1.6-2ubuntu3.2                     
   HA cluster resource manager
ii  xen-hypervisor-4.1-amd64               4.1.5-0ubuntu0.12.04.3               
   Xen Hypervisor on AMD64
ii  xen-utils-4.1                          4.1.5-0ubuntu0.12.04.3               
   XEN administrative tools
ii  xen-utils-common                       4.1.2-1ubuntu1                       
   XEN administrative tools - common files
ii  xenstore-utils                         4.1.5-0ubuntu0.12.04.3               
   Xenstore utilities for Xen

cat /etc/issue && dpkg -l | egrep "bridge-utils"
bridge-utils    1.5-2ubuntu7    Utilities for configuring the Linux Ethernet 
bridge

cat /etc/issue && dpkg -l | egrep "ssh"
Ubuntu 12.04 LTS \n \l

ii  libssh-4                               0.5.2-1                              
   tiny C SSH library
ii  openssh-client                         1:5.9p1-5ubuntu1.4                   
   secure shell (SSH) client, for secure access to remote machines
ii  openssh-server                         1:5.9p1-5ubuntu1.4                   
   secure shell (SSH) server, for secure access from remote machines
ii  ssh-askpass-gnome                      1:5.9p1-5ubuntu1                     
   interactive X program to prompt users for a passphrase for ssh-add
ii  ssh-import-id                          2.10-0ubuntu1                        
   securely retrieve an SSH public key and install it locally


cat /etc/issue && dpkg -l | egrep "vncviewer"
Ubuntu 12.04 LTS \n \l

ii  xtightvncviewer                        1.3.9-6.2ubuntu2                     
   virtual network computing client software for X

Apr 18 02:49:03 server2 crmd: [1181]: WARN: check_dead_member: Our DC node 
(server1) left the cluster
Apr 18 02:49:03 server2 crmd: [1181]: info: do_state_transition: State 
transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL 
origin=check_dead_member ]
Apr 18 02:49:03 server2 crmd: [1181]: info: update_dc: Unset DC server1
Apr 18 02:49:03 server2 crmd: [1181]: info: do_state_transition: State 
transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC 
cause=C_FSA_INTERNAL origin=do_election_check ]
Apr 18 02:49:03 server2 crmd: [1181]: info: do_te_control: Registering TE UUID: 
11058a2e-0c23-4f0d-a911-65a5eb1b74a0
Apr 18 02:49:03 server2 crmd: [1181]: info: set_graph_functions: Setting custom 
graph functions
Apr 18 02:49:03 server2 crmd: [1181]: info: unpack_graph: Unpacked transition 
-1: 0 actions in 0 synapses
Apr 18 02:49:03 server2 crmd: [1181]: info: do_dc_takeover: Taking over DC 
status for this partition
Apr 18 02:49:03 server2 cib: [1177]: info: cib_process_readwrite: We are now in 
R/W mode
Apr 18 02:49:03 server2 cib: [1177]: info: cib_process_request: Operation 
complete: op cib_master for section 'all' (origin=local/crmd/17, 
version=0.116.8): ok (rc=0)
Apr 18 02:49:03 server2 cib: [1177]: info: cib_process_request: Operation 
complete: op cib_modify for section cib (origin=local/crmd/18, 
version=0.116.9): ok (rc=0)
Apr 18 02:49:03 server2 cib: [1177]: info: cib_process_request: Operation 
complete: op cib_modify for section crm_config (origin=local/crmd/20, 
version=0.116.10): ok (rc=0)
Apr 18 02:49:03 server2 crmd: [1181]: info: join_make_offer: Making join offers 
based on membership 156200
Apr 18 02:49:03 server2 cib: [1177]: info: cib_process_request: Operation 
complete: op cib_modify for section crm_config (origin=local/crmd/22, 
version=0.116.11): ok (rc=0)
Apr 18 02:49:03 server2 crmd: [1181]: info: do_dc_join_offer_all: join-1: 
Waiting on 1 outstanding join acks
Apr 18 02:49:03 server2 crmd: [1181]: info: ais_dispatch_message: Membership 
156200: quorum still lost
Apr 18 02:49:03 server2 crmd: [1181]: info: crmd_ais_dispatch: Setting expected 
votes to 2
Apr 18 02:49:03 server2 cib: [1177]: info: cib_process_request: Operation 
complete: op cib_modify for section crm_config (origin=local/crmd/25, 
version=0.116.12): ok (rc=0)
Apr 18 02:49:03 server2 crmd: [1181]: info: update_dc: Set DC to server2 (3.0.5)
Apr 18 02:49:03 server2 crmd: [1181]: info: config_query_callback: Shutdown 
escalation occurs after: 1200000ms
Apr 18 02:49:03 server2 crmd: [1181]: info: config_query_callback: Checking for 
expired actions every 900000ms
Apr 18 02:49:03 server2 crmd: [1181]: info: config_query_callback: Sending 
expected-votes=2 to corosync
Apr 18 02:49:03 server2 crmd: [1181]: info: ais_dispatch_message: Membership 
156200: quorum still lost
Apr 18 02:49:03 server2 crmd: [1181]: info: crmd_ais_dispatch: Setting expected 
votes to 2
Apr 18 02:49:03 server2 cib: [1177]: info: cib_process_request: Operation 
complete: op cib_modify for section crm_config (origin=local/crmd/28, 
version=0.116.13): ok (rc=0)
Apr 18 02:49:03 server2 crmd: [1181]: info: do_state_transition: State 
transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED 
cause=C_FSA_INTERNAL origin=check_join_state ]
Apr 18 02:49:03 server2 crmd: [1181]: info: do_state_transition: All 1 cluster 
nodes responded to the join offer.
Apr 18 02:49:03 server2 crmd: [1181]: info: do_dc_join_finalize: join-1: 
Syncing the CIB from server2 to the rest of the cluster
Apr 18 02:49:03 server2 cib: [1177]: info: cib_process_request: Operation 
complete: op cib_sync for section 'all' (origin=local/crmd/29, 
version=0.116.13): ok (rc=0)
Apr 18 02:49:03 server2 cib: [1177]: info: cib_process_request: Operation 
complete: op cib_modify for section nodes (origin=local/crmd/30, 
version=0.116.14): ok (rc=0)
Apr 18 02:49:03 server2 crmd: [1181]: info: do_dc_join_ack: join-1: Updating 
node state to member for server2
Apr 18 02:49:03 server2 cib: [1177]: info: cib_process_request: Operation 
complete: op cib_delete for section //node_state[@uname='server2']/lrm 
(origin=local/crmd/31, version=0.116.15): ok (rc=0)
Apr 18 02:49:03 server2 crmd: [1181]: info: erase_xpath_callback: Deletion of 
"//node_state[@uname='server2']/lrm": ok (rc=0)
Apr 18 02:49:03 server2 crmd: [1181]: info: do_state_transition: State 
transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED 
cause=C_FSA_INTERNAL origin=check_join_state ]
Apr 18 02:49:03 server2 crmd: [1181]: info: do_state_transition: All 1 cluster 
nodes are eligible to run resources.
Apr 18 02:49:03 server2 crmd: [1181]: info: do_dc_join_final: Ensuring DC, 
quorum and node attributes are up-to-date
Apr 18 02:49:03 server2 crmd: [1181]: info: crm_update_quorum: Updating quorum 
status to false (call=35)
Apr 18 02:49:03 server2 crmd: [1181]: info: abort_transition_graph: 
do_te_invoke:167 - Triggered transition abort (complete=1) : Peer Cancelled
Apr 18 02:49:03 server2 crmd: [1181]: info: do_pe_invoke: Query 36: Requesting 
the current CIB: S_POLICY_ENGINE
Apr 18 02:49:03 server2 attrd: [1179]: notice: attrd_local_callback: Sending 
full refresh (origin=crmd)
Apr 18 02:49:03 server2 attrd: [1179]: notice: attrd_trigger_update: Sending 
flush op to all hosts for: probe_complete (true)
Apr 18 02:49:03 server2 cib: [1177]: info: cib_process_request: Operation 
complete: op cib_modify for section nodes (origin=local/crmd/33, 
version=0.116.17): ok (rc=0)
Apr 18 02:49:03 server2 crmd: [1181]: WARN: match_down_event: No match for 
shutdown action on server1
Apr 18 02:49:03 server2 crmd: [1181]: info: te_update_diff: Stonith/shutdown of 
server1 not matched
Apr 18 02:49:03 server2 cib: [1177]: info: cib_process_request: Operation 
complete: op cib_modify for section cib (origin=local/crmd/35, 
version=0.116.19): ok (rc=0)
Apr 18 02:49:03 server2 crmd: [1181]: info: abort_transition_graph: 
te_update_diff:215 - Triggered transition abort (complete=1, tag=node_state, 
id=server1, magic=NA, cib=0.116.18) : Node failure
Apr 18 02:49:03 server2 crmd: [1181]: info: do_pe_invoke: Query 37: Requesting 
the current CIB: S_POLICY_ENGINE
Apr 18 02:49:03 server2 crmd: [1181]: info: do_pe_invoke_callback: Invoking the 
PE: query=37, ref=pe_calc-dc-1429305543-16, seq=156200, quorate=0
Apr 18 02:49:03 server2 attrd: [1179]: notice: attrd_trigger_update: Sending 
flush op to all hosts for: master-resDRBDr1:0 (10000)
Apr 18 02:49:03 server2 pengine: [1180]: notice: unpack_config: On loss of CCM 
Quorum: Ignore
Apr 18 02:49:03 server2 pengine: [1180]: notice: RecurringOp:  Start recurring 
monitor (20s) for resXen1 on server2
Apr 18 02:49:03 server2 pengine: [1180]: notice: LogActions: Start   
resXen1#011(server2)
Apr 18 02:49:03 server2 pengine: [1180]: notice: LogActions: Leave   
resDRBDr1:0#011(Master server2)
Apr 18 02:49:03 server2 pengine: [1180]: notice: LogActions: Leave   
resDRBDr1:1#011(Stopped)
Apr 18 02:49:03 server2 pengine: [1180]: notice: LogActions: Leave   
resOCFS2r1:0#011(Started server2)
Apr 18 02:49:03 server2 pengine: [1180]: notice: LogActions: Leave   
resOCFS2r1:1#011(Stopped)
Apr 18 02:49:03 server2 crmd: [1181]: info: do_state_transition: State 
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response ]
Apr 18 02:49:03 server2 crmd: [1181]: info: unpack_graph: Unpacked transition 
0: 2 actions in 2 synapses
Apr 18 02:49:03 server2 crmd: [1181]: info: do_te_invoke: Processing graph 0 
(ref=pe_calc-dc-1429305543-16) derived from /var/lib/pengine/pe-input-1545.bz2
Apr 18 02:49:03 server2 crmd: [1181]: info: te_rsc_command: Initiating action 
6: start resXen1_start_0 on server2 (local)
Apr 18 02:49:03 server2 crmd: [1181]: info: do_lrm_rsc_op: Performing 
key=6:0:0:11058a2e-0c23-4f0d-a911-65a5eb1b74a0 op=resXen1_start_0 )
Apr 18 02:49:03 server2 lrmd: [1178]: info: rsc:resXen1 start[14] (pid 5929)
Apr 18 02:49:03 server2 pengine: [1180]: notice: process_pe_message: Transition 
0: PEngine Input stored in: /var/lib/pengine/pe-input-1545.bz2
Apr 18 02:49:05 server2 kernel: [  839.612059] o2net: Connection to node 
server1 (num 0) at 192.168.0.91:7777 has been idle for 10.12 secs, shutting it 
down.
Apr 18 02:49:05 server2 kernel: [  839.612093] o2net: No longer connected to 
node server1 (num 0) at 192.168.0.91:7777
Apr 18 02:49:05 server2 kernel: [  839.612126] 
(xend,5992,1):dlm_send_remote_convert_request:395 ERROR: Error -112 when 
sending message 504 (key 0x2d50ec47) to node 0
Apr 18 02:49:05 server2 kernel: [  839.612134] o2dlm: Waiting on the death of 
node 0 in domain 89D0A7DA3B9B43EDB575466F176F7A0C
Apr 18 02:49:15 server2 kernel: [  849.628073] o2net: No connection established 
with node 0 after 10.0 seconds, giving up.
Apr 18 02:49:16 server2 kernel: [  851.344088] block drbd0: PingAck did not 
arrive in time.
Apr 18 02:49:16 server2 kernel: [  851.344101] block drbd0: peer( Primary -> 
Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) 
susp( 0 -> 1 ) 
Apr 18 02:49:16 server2 kernel: [  851.352284] block drbd0: asender terminated
Apr 18 02:49:16 server2 kernel: [  851.352292] block drbd0: Terminating 
drbd0_asender
Apr 18 02:49:16 server2 kernel: [  851.352361] block drbd0: Connection closed
Apr 18 02:49:16 server2 kernel: [  851.352405] block drbd0: conn( 
NetworkFailure -> Unconnected ) 
Apr 18 02:49:16 server2 kernel: [  851.352409] block drbd0: receiver terminated
Apr 18 02:49:16 server2 kernel: [  851.352411] block drbd0: Restarting 
drbd0_receiver
Apr 18 02:49:16 server2 kernel: [  851.352413] block drbd0: receiver (re)started
Apr 18 02:49:16 server2 kernel: [  851.352418] block drbd0: conn( Unconnected 
-> WFConnection ) 
Apr 18 02:49:16 server2 kernel: [  851.352496] block drbd0: helper command: 
/sbin/drbdadm fence-peer minor-0
Apr 18 02:49:16 server2 crm-fence-peer.sh[6078]: invoked for r0
Apr 18 02:49:17 server2 cib: [1177]: info: cib:diff: - <cib admin_epoch="0" 
epoch="116" num_updates="21" />
Apr 18 02:49:17 server2 cib: [1177]: info: cib:diff: + <cib epoch="117" 
num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2" 
crm_feature_set="3.0.5" update-origin="server2" update-client="cibadmin" 
cib-last-written="Sat Apr 18 02:37:19 2015" have-quorum="0" dc-uuid="server2" >
Apr 18 02:49:17 server2 cib: [1177]: info: cib:diff: +   <configuration >
Apr 18 02:49:17 server2 cib: [1177]: info: cib:diff: +     <constraints >
Apr 18 02:49:17 server2 cib: [1177]: info: cib:diff: +       <rsc_location 
rsc="msDRBDr1" id="drbd-fence-by-handler-msDRBDr1" 
__crm_diff_marker__="added:top" >
Apr 18 02:49:17 server2 cib: [1177]: info: cib:diff: +         <rule 
role="Master" score="-INFINITY" id="drbd-fence-by-handler-rule-msDRBDr1" >
Apr 18 02:49:17 server2 cib: [1177]: info: cib:diff: +           <expression 
attribute="#uname" operation="ne" value="server2" 
id="drbd-fence-by-handler-expr-msDRBDr1" />
Apr 18 02:49:17 server2 cib: [1177]: info: cib:diff: +         </rule>
Apr 18 02:49:17 server2 cib: [1177]: info: cib:diff: +       </rsc_location>
Apr 18 02:49:17 server2 cib: [1177]: info: cib:diff: +     </constraints>
Apr 18 02:49:17 server2 cib: [1177]: info: cib:diff: +   </configuration>
Apr 18 02:49:17 server2 cib: [1177]: info: cib:diff: + </cib>
Apr 18 02:49:17 server2 crmd: [1181]: info: abort_transition_graph: 
te_update_diff:124 - Triggered transition abort (complete=0, tag=diff, 
id=(null), magic=NA, cib=0.117.1) : Non-status change
Apr 18 02:49:17 server2 cib: [1177]: info: cib_process_request: Operation 
complete: op cib_create for section constraints (origin=local/cibadmin/2, 
version=0.117.1): ok (rc=0)
Apr 18 02:49:17 server2 crmd: [1181]: info: update_abort_priority: Abort 
priority upgraded from 0 to 1000000
Apr 18 02:49:17 server2 crmd: [1181]: info: update_abort_priority: Abort action 
done superceeded by restart
Apr 18 02:49:17 server2 crm-fence-peer.sh[6078]: INFO peer is reachable, my 
disk is UpToDate: placed constraint 'drbd-fence-by-handler-msDRBDr1'
Apr 18 02:49:17 server2 kernel: [  852.446543] block drbd0: helper command: 
/sbin/drbdadm fence-peer minor-0 exit code 4 (0x400)
Apr 18 02:49:17 server2 kernel: [  852.446547] block drbd0: fence-peer helper 
returned 4 (peer was fenced)
Apr 18 02:49:17 server2 kernel: [  852.446553] block drbd0: pdsk( DUnknown -> 
Outdated ) 
Apr 18 02:49:17 server2 kernel: [  852.446586] block drbd0: new current UUID 
1F5EB0210CC2AB1B:A158E0E3B5B108B9:EFB663F37532A93A:EFB563F37532A93B
Apr 18 02:49:17 server2 kernel: [  852.492924] block drbd0: susp( 1 -> 0 ) 
Apr 18 02:49:25 server2 kernel: [  859.644076] o2net: No connection established 
with node 0 after 10.0 seconds, giving up.
Apr 18 02:49:25 server2 kernel: [  859.644112] 
(xend,5992,0):dlm_send_remote_convert_request:395 ERROR: Error -107 when 
sending message 504 (key 0x2d50ec47) to node 0
Apr 18 02:49:25 server2 kernel: [  859.644120] o2dlm: Waiting on the death of 
node 0 in domain 89D0A7DA3B9B43EDB575466F176F7A0C
Apr 18 02:49:30 server2 kernel: [  865.452111] o2net: Connection to node 
server1 (num 0) at 192.168.0.91:7777 shutdown, state 7
Apr 18 02:49:33 server2 kernel: [  868.452122] o2net: Connection to node 
server1 (num 0) at 192.168.0.91:7777 shutdown, state 7
Apr 18 02:49:35 server2 kernel: [  869.660076] o2net: No connection established 
with node 0 after 10.0 seconds, giving up.
Apr 18 02:49:35 server2 kernel: [  869.660116] 
(xend,5992,0):dlm_send_remote_convert_request:395 ERROR: Error -107 when 
sending message 504 (key 0x2d50ec47) to node 0
Apr 18 02:49:35 server2 kernel: [  869.660123] o2dlm: Waiting on the death of 
node 0 in domain 89D0A7DA3B9B43EDB575466F176F7A0C

_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

[DRBD-user] Surviving node restarts in 2 Node cluster configuration with XEN+DRBD+COROSYNC+OCFS2

Reply via email to