---------- Forwarded message ---------- From: ruchi priya <[email protected]> Date: Tue, Nov 2, 2010 at 11:16 PM Subject: split brain issue in drbd----help To: [email protected]
Hi All, , I have encounterd a strange behaviour of split brain .I have a 2 node DRBD cluster running.Even when the heartbeat link is not down ,both are attempting to become primary and each dont know if they are in the same membership.I have tried pinging the ip with -I option ,network is fine as well. If there are 2 nodes in the cluster & the redundant paths break ,how to determine which node will take over the service ?? And here fencing is enabled ( STONITH enabled) ,so when connectivity is restored ,heartbeat should stop services .The restart will result in one of the 2 machines providing the services ,but here both are attempting to become primary .!! How to resolve?? Please help.!! The ha.cf and ha-debug is attached. ha.cf: crm on debugfile /var/log/ha-debug logfile /var/log/ha-log logfacility local0 keepalive 2 deadtime 30 warntime 10 initdead 60 udpport 6940 auto_failback off node MDS_172-20-6-3 node MDS_172-20-6-60 ucast eth6 172.20.6.60 ucast eth5 172.20.8.80 ucast bond0 192.168.20.60 ha_debug : heartbeat[6407]: 2010/10/20_04:14:21 WARN: Core dumps could be lost if multiple dumps occur. heartbeat[6407]: 2010/10/20_04:14:21 WARN: Consider setting non-default value in /proc/sys/kernel/core_pattern (or equivalent) for maximum supportability heartbeat[6407]: 2010/10/20_04:14:21 WARN: Consider setting /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum supportability heartbeat[6407]: 2010/10/20_04:14:21 WARN: Logging daemon is disabled --enabling logging daemon is recommended heartbeat[6407]: 2010/10/20_04:14:21 info: ************************** heartbeat[6407]: 2010/10/20_04:14:21 info: Configuration validated. Starting heartbeat 2.1.4 heartbeat[6408]: 2010/10/20_04:14:21 info: heartbeat: version 2.1.4 heartbeat[6408]: 2010/10/20_04:14:22 info: Heartbeat generation: 1271790490 heartbeat[6408]: 2010/10/20_04:14:22 info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth6 heartbeat[6408]: 2010/10/20_04:14:22 info: glib: ucast: bound send socket to device: eth6 heartbeat[6408]: 2010/10/20_04:14:22 info: glib: ucast: bound receive socket to device: eth6 heartbeat[6408]: 2010/10/20_04:14:22 info: glib: ucast: started on port 6940 interface eth6 to 172.20.6.60 heartbeat[6408]: 2010/10/20_04:14:22 info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth5 heartbeat[6408]: 2010/10/20_04:14:22 info: glib: ucast: bound send socket to device: eth5 heartbeat[6408]: 2010/10/20_04:14:22 info: glib: ucast: bound receive socket to device: eth5 heartbeat[6408]: 2010/10/20_04:14:22 info: glib: ucast: started on port 6940 interface eth5 to 172.20.8.80 heartbeat[6408]: 2010/10/20_04:14:22 info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on bond0 heartbeat[6408]: 2010/10/20_04:14:22 info: glib: ucast: bound send socket to device: bond0 heartbeat[6408]: 2010/10/20_04:14:22 info: glib: ucast: bound receive socket to device: bond0 heartbeat[6408]: 2010/10/20_04:14:22 info: glib: ucast: started on port 6940 interface bond0 to 192.168.20.60 heartbeat[6408]: 2010/10/20_04:14:22 info: G_main_add_TriggerHandler: Added signal manual handler heartbeat[6408]: 2010/10/20_04:14:22 info: G_main_add_TriggerHandler: Added signal manual handler heartbeat[6408]: 2010/10/20_04:14:22 info: G_main_add_SignalHandler: Added signal handler for signal 17 heartbeat[6408]: 2010/10/20_04:14:22 info: Local status now set to: 'up' heartbeat[6408]: 2010/10/20_04:14:23 info: Link mds_172-20-6-60:eth5 up. heartbeat[6408]: 2010/10/20_04:14:23 info: Status update for node mds_172-20-6-60: status active heartbeat[6408]: 2010/10/20_04:14:23 info: Link mds_172-20-6-60:bond0 up. heartbeat[6408]: 2010/10/20_04:14:31 info: Comm_now_up(): updating status to active heartbeat[6408]: 2010/10/20_04:14:31 info: Local status now set to: 'active' heartbeat[6408]: 2010/10/20_04:14:31 info: Starting child client "//lib64/heartbeat/ccm" (1001,100) heartbeat[6408]: 2010/10/20_04:14:31 info: Starting child client "//lib64/heartbeat/cib" (1001,100) heartbeat[6792]: 2010/10/20_04:14:31 info: Starting "//lib64/heartbeat/ccm" as uid 1001 gid 100 (pid 6792) heartbeat[6408]: 2010/10/20_04:14:31 info: Starting child client "//lib64/heartbeat/lrmd -r" (0,0) heartbeat[6793]: 2010/10/20_04:14:31 info: Starting "//lib64/heartbeat/cib" as uid 1001 gid 100 (pid 6793) heartbeat[6408]: 2010/10/20_04:14:31 info: Starting child client "//lib64/heartbeat/stonithd" (0,0) heartbeat[6794]: 2010/10/20_04:14:31 info: Starting "//lib64/heartbeat/lrmd -r" as uid 0 gid 0 (pid 6794) heartbeat[6408]: 2010/10/20_04:14:31 info: Starting child client "//lib64/heartbeat/attrd" (1001,100) heartbeat[6795]: 2010/10/20_04:14:31 info: Starting "//lib64/heartbeat/stonithd" as uid 0 gid 0 (pid 6795) heartbeat[6408]: 2010/10/20_04:14:31 info: Starting child client "//lib64/heartbeat/crmd" (1001,100) heartbeat[6797]: 2010/10/20_04:14:31 info: Starting "//lib64/heartbeat/crmd" as uid 1001 gid 100 (pid 6797) heartbeat[6796]: 2010/10/20_04:14:31 info: Starting "//lib64/heartbeat/attrd" as uid 1001 gid 100 (pid 6796) lrmd[6794]: 2010/10/20_04:14:31 info: G_main_add_SignalHandler: Added signal handler for signal 15 stonithd[6795]: 2010/10/20_04:14:31 WARN: Core dumps could be lost if multiple dumps occur. attrd[6796]: 2010/10/20_04:14:31 info: G_main_add_SignalHandler: Added signal handler for signal 15 stonithd[6795]: 2010/10/20_04:14:31 WARN: Consider setting non-default value in /proc/sys/kernel/core_pattern (or equivalent) for maximum supportability crmd[6797]: 2010/10/20_04:16:09 info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm crmd[6797]: 2010/10/20_04:16:09 info: mem_handle_event: no mbr_track info crmd[6797]: 2010/10/20_04:16:09 info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm crmd[6797]: 2010/10/20_04:16:09 info: mem_handle_event: instance=230, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 crmd[6797]: 2010/10/20_04:16:09 info: crmd_ccm_msg_callback: Quorum (re)attained after event=NEW MEMBERSHIP (id=230) cib[6793]: 2010/10/20_04:16:09 info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm crmd[6797]: 2010/10/20_04:16:09 info: ccm_event_detail: NEW MEMBERSHIP: trans=230, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3 cib[6793]: 2010/10/20_04:16:09 info: mem_handle_event: no mbr_track info crmd[6797]: 2010/10/20_04:16:09 info: ccm_event_detail: CURRENT: mds_172-20-6-3 [nodeid=0, born=230] cib[6793]: 2010/10/20_04:16:09 info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm cib[6793]: 2010/10/20_04:16:09 info: mem_handle_event: instance=230, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 cib[6793]: 2010/10/20_04:16:09 info: cib_ccm_msg_callback: PEER: mds_172-20-6-3 crmd[6797]: 2010/10/20_04:16:12 WARN: crmd_ha_msg_callback: Ignoring HA message (op=noop) from mds_172-20-6-60: not in our membership list (size=1) crmd[6797]: 2010/10/20_04:16:21 info: mem_handle_event: Got an event OC_EV_MS_NOT_PRIMARY from ccm crmd[6797]: 2010/10/20_04:16:21 info: mem_handle_event: instance=230, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 crmd[6797]: 2010/10/20_04:16:21 info: crmd_ccm_msg_callback: Quorum lost after event=NOT PRIMARY (id=230) cib[6793]: 2010/10/20_04:16:21 info: mem_handle_event: Got an event OC_EV_MS_NOT_PRIMARY from ccm cib[6793]: 2010/10/20_04:16:21 info: mem_handle_event: instance=230, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 crmd[6797]: 2010/10/20_04:16:22 WARN: crmd_ha_msg_callback: Ignoring HA message (op=noop) from ccm crmd[6797]: 2010/10/20_04:16:31 info: mem_handle_event: instance=231, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 cib[6793]: 2010/10/20_04:16:31 info: mem_handle_event: instance=231, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 crmd[6797]: 2010/10/20_04:16:31 info: crmd_ccm_msg_callback: Quorum (re)attained after event=NEW MEMBERSHIP (id=231) cib[6793]: 2010/10/20_04:16:31 info: cib_ccm_msg_callback: PEER: mds_172-20-6-3 crmd[6797]: 2010/10/20_04:16:31 info: ccm_event_detail: NEW MEMBERSHIP: trans=231, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3 crmd[6797]: 2010/10/20_04:16:31 info: ccm_event_detail: CURRENT: mds_172-20-6-3 [nodeid=0, born=231] crmd[6797]: 2010/10/20_04:16:46 info: crm_timer_popped: Election Trigger (I_DC_TIMEOUT) just popped! crmd[6797]: 2010/10/20_04:16:46 WARN: do_log: [[FSA]] Input I_DC_TIMEOUT from crm_timer_popped() received in state (S_PENDING) crmd[6797]: 2010/10/20_04:16:46 info: do_state_transition: State transition S_PENDING -> S_ELECTION [ input=I_DC_TIMEOUT cause=C_TIMER_POPPED origin=crm_timer_popped ] crmd[6797]: 2010/10/20_04:16:46 info: do_election_count_vote: Updated voted hash for mds_172-20-6-3 to vote crmd[6797]: 2010/10/20_04:16:46 info: do_election_count_vote: Election ignore: our vote (mds_172-20-6-3) crmd[6797]: 2010/10/20_04:16:46 info: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ] crmd[6797]: 2010/10/20_04:16:46 info: start_subsystem: Starting sub-system "tengine" crmd[6797]: 2010/10/20_04:16:46 info: start_subsystem: Starting sub-system "pengine" crmd[6797]: 2010/10/20_04:16:46 info: do_dc_takeover: Taking over DC status for this partition cib[6793]: 2010/10/20_04:16:46 info: cib_process_readwrite: We are now in R/W mode cib[6793]: 2010/10/20_04:16:46 info: revision_check: Updating CIB revision to 2.0 cib[6793]: 2010/10/20_04:16:46 info: log_data_element: cib:diff: - <cib epoch="0" num_updates="0" generated="false"/> cib[6793]: 2010/10/20_04:16:46 info: log_data_element: cib:diff: + <cib epoch="1" num_updates="1" generated="true" crm_feature_set="2.0"/> crmd[6797]: 2010/10/20_04:16:46 info: erase_node_from_join: Removed dead node mds_172-20-6-3 from join calculations: welcomed=0 itegrated=0 finalized=0 confirmed=0 crmd[6797]: 2010/10/20_04:16:46 info: join_make_offer: Making join offers based on membership 231 crmd[6797]: 2010/10/20_04:16:46 info: do_dc_join_offer_all: join-1: Waiting on 1 outstanding join acks cib[13168]: 2010/10/20_04:16:46 info: retrieveCib: Reading cluster configuration from: //var/lib/heartbeat/crm/cib.xml (digest: //var/lib/heartbeat/crm/cib.xml.sig) cib[6793]: 2010/10/20_04:16:46 info: log_data_element: cib:diff: - <cib epoch="1"/> cib[6793]: 2010/10/20_04:16:46 info: log_data_element: cib:diff: + <cib epoch="2"> cib[6793]: 2010/10/20_04:16:46 info: log_data_element: cib:diff: + <configuration> cib[6793]: 2010/10/20_04:16:46 info: log_data_element: cib:diff: + <crm_config> cib[6793]: 2010/10/20_04:16:46 info: log_data_element: cib:diff: + <cluster_property_set id="cib-bootstrap-options"> cib[6793]: 2010/10/20_04:16:46 info: log_data_element: cib:diff: + <attributes> cib[6793]: 2010/10/20_04:16:46 info: log_data_element: cib:diff: + <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="2.1.4-node: aa909246edb386137b986c5773344b98c6969999"/> cib[6793]: 2010/10/20_04:16:46 info: log_data_element: cib:diff: + </attributes> cib[6793]: 2010/10/20_04:16:46 info: log_data_element: cib:diff: + </cluster_property_set> cib[6793]: 2010/10/20_04:16:46 info: log_data_element: cib:diff: + </crm_config> cib[6793]: 2010/10/20_04:16:46 info: log_data_element: cib:diff: + </configuration> cib[6793]: 2010/10/20_04:16:46 info: log_data_element: cib:diff: + </cib> tengine[13166]: 2010/10/20_04:16:46 info: G_main_add_SignalHandler: Added signal handler for signal 15 tengine[13166]: 2010/10/20_04:16:46 info: G_main_add_TriggerHandler: Added signal manual handler tengine[13166]: 2010/10/20_04:16:46 info: G_main_add_TriggerHandler: Added signal manual handler cib[6793]: 2010/10/20_04:16:46 info: cib_null_callback: Setting cib_diff_notify callbacks for tengine: on cib[13168]: 2010/10/20_04:16:46 info: retrieveCib: Reading cluster configuration from: //var/lib/heartbeat/crm/cib.xml (digest: //var/lib/heartbeat/crm/cib.xml.sig) cib[13168]: 2010/10/20_04:16:46 info: retrieveCib: Reading cluster configuration from: //var/lib/heartbeat/crm/cib.xml.last (digest: //var/lib/heartbeat/crm/cib.xml.sig.last) cib[13168]: 2010/10/20_04:16:46 info: write_cib_contents: Wrote version 0.1.1 of the CIB to disk (digest: b9370c847a7ddce6dd22c533035fd5f8) crmd[6797]: 2010/10/20_04:16:46 info: update_dc: Set DC to mds_172-20-6-3 (2.0) cib[13168]: 2010/10/20_04:16:46 info: retrieveCib: Reading cluster configuration from: //var/lib/heartbeat/crm/cib.xml (digest: //var/lib/heartbeat/crm/cib.xml.sig) cib[13168]: 2010/10/20_04:16:46 info: retrieveCib: Reading cluster configuration from: //var/lib/heartbeat/crm/cib.xml.last (digest: //var/lib/heartbeat/crm/cib.xml.sig.last) cib[13171]: 2010/10/20_04:16:46 info: retrieveCib: Reading cluster configuration from: //var/lib/heartbeat/crm/cib.xml (digest: //var/lib/heartbeat/crm/cib.xml.sig) cib[13171]: 2010/10/20_04:16:46 info: retrieveCib: Reading cluster configuration from: //var/lib/heartbeat/crm/cib.xml (digest: //var/lib/heartbeat/crm/cib.xml.sig) cib[13171]: 2010/10/20_04:16:46 info: retrieveCib: Reading cluster configuration from: //var/lib/heartbeat/crm/cib.xml.last (digest: //var/lib/heartbeat/crm/cib.xml.sig.last) crmd[6797]: 2010/10/20_04:16:46 info: do_state_transition: State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ] crmd[6797]: 2010/10/20_04:16:46 info: do_state_transition: All 1 cluster nodes responded to the join offer. crmd[6797]: 2010/10/20_04:16:46 info: update_attrd: Connecting to attrd... cib[6793]: 2010/10/20_04:16:46 info: sync_our_cib: Syncing CIB to all peers attrd[6796]: 2010/10/20_04:16:46 info: attrd_local_callback: Sending full refresh cib[6793]: 2010/10/20_04:16:46 info: log_data_element: cib:diff: - <cib epoch="2"/> cib[6793]: 2010/10/20_04:16:46 info: log_data_element: cib:diff: + <cib epoch="3" dc_uuid="80ccc795-376f-40e8-b3a9-ac468886cdaa"/> cib[13171]: 2010/10/20_04:16:46 info: write_cib_contents: Wrote version 0.2.1 of the CIB to disk (digest: 9dce32038e30a29c2223a253f092ceb3) cib[13171]: 2010/10/20_04:16:46 info: retrieveCib: Reading cluster configuration from: //var/lib/heartbeat/crm/cib.xml (digest: //var/lib/heartbeat/crm/cib.xml.sig) cib[13171]: 2010/10/20_04:16:46 info: retrieveCib: Reading cluster configuration from: //var/lib/heartbeat/crm/cib.xml.last (digest: //var/lib/heartbeat/crm/cib.xml.sig.last) cib[13200]: 2010/10/20_04:16:46 info: retrieveCib: Reading cluster configuration from: //var/lib/heartbeat/crm/cib.xml (digest: //var/lib/heartbeat/crm/cib.xml.sig) cib[13200]: 2010/10/20_04:16:46 info: retrieveCib: Reading cluster configuration from: //var/lib/heartbeat/crm/cib.xml (digest: //var/lib/heartbeat/crm/cib.xml.sig) cib[13200]: 2010/10/20_04:16:46 info: retrieveCib: Reading cluster configuration from: //var/lib/heartbeat/crm/cib.xml.last (digest: //var/lib/heartbeat/crm/cib.xml.sig.last) cib[13200]: 2010/10/20_04:16:46 info: write_cib_contents: Wrote version 0.3.2 of the CIB to disk (digest: 51c7a5192a7d3e272505ddfe7e68f7e6) cib[13200]: 2010/10/20_04:16:46 info: retrieveCib: Reading cluster configuration from: //var/lib/heartbeat/crm/cib.xml (digest: //var/lib/heartbeat/crm/cib.xml.sig) cib[13200]: 2010/10/20_04:16:46 info: retrieveCib: Reading cluster configuration from: //var/lib/heartbeat/crm/cib.xml.last (digest: //var/lib/heartbeat/crm/cib.xml.sig.last) crmd[6797]: 2010/10/20_04:16:46 info: update_dc: Set DC to mds_172-20-6-3 (2.0) tengine[13166]: 2010/10/20_04:16:47 info: te_connect_stonith: Connected tengine[13166]: 2010/10/20_04:16:47 info: update_abort_priority: Abort priority upgraded to 1000000 tengine[13166]: 2010/10/20_04:16:47 info: update_abort_priority: 'DC Takeover' abort superceeded crmd[6797]: 2010/10/20_04:16:47 info: do_dc_join_ack: join-1: Updating node state to member for mds_172-20-6-3 crmd[6797]: 2010/10/20_04:16:47 info: do_state_transition: State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED cause=C_FSA_INTERNAL origin=check_join_state ] crmd[6797]: 2010/10/20_04:16:47 info: do_state_transition: All 1 cluster nodes are eligible to run resources. tengine[13166]: 2010/10/20_04:26:48 WARN: notify_crmd: Delaying completion until all CIB updates complete tengine[13166]: 2010/10/20_04:26:48 info: notify_crmd: Exiting after transition tengine[13166]: 2010/10/20_04:26:48 info: te_init: Exiting tengine crmd[6797]: 2010/10/20_04:26:48 info: process_client_disconnect: Received HUP from tengine:[13166] crmd[6797]: 2010/10/20_04:26:48 WARN: do_dc_join_finalize: join-2: We are still in a transition. Delaying until the TE completes. crmd[6797]: 2010/10/20_04:26:48 info: crmdManagedChildDied: Process tengine:[13166] exited (signal=0, exitcode=0) crmd[6797]: 2010/10/20_04:26:48 WARN: do_dc_join_finalize: join-2: We are still in a transition. Delaying until the TE completes. cib[6793]: 2010/10/20_04:34:35 info: cib_stats: Processed 3 operations (3333.00us average, 0% utilization) in the last 10min cib[6793]: 2010/10/20_09:14:34 info: cib_stats: Processed 2 operations (0.00us average, 0% utilization) in the last 10min cib[6793]: 2010/10/20_09:54:34 info: cib_stats: Processed 2 operations (10000.00us average, 0% utilization) in the last 10min cib[6793]: 2010/10/20_10:17:39 WARN: cib_peer_callback: Discarding cib_apply_diff message (9bf8) from mds_172-20-6-60: not in our membership cib[6793]: 2010/10/20_10:17:39 WARN: cib_peer_callback: Discarding cib_apply_diff message (9bf9) from mds_172-20-6-60: not in our membership cib[6793]: 2010/10/20_10:17:39 WARN: cib_peer_callback: Discarding cib_apply_diff message (9bfb) from crmd[6797]: 2010/10/20_04:17:28 info: process_client_disconnect: Received HUP from pengine:[-1] crmd[6797]: 2010/10/20_04:17:28 WARN: do_log: [[FSA]] Input I_RELEASE_SUCCESS from do_dc_release() received in state (S_TERMINATE) crmd[6797]: 2010/10/20_04:17:28 WARN: do_dc_join_finalize: join-2: We are still in a transition. Delaying until the TE completes. cib[6793]: 2010/10/20_04:17:29 WARN: cib_peer_callback: Discarding cib_replace message (71d2) from mds_172-20-6-60: not in our membership cib[6793]: 2010/10/20_04:17:29 WARN: cib_peer_callback: Discarding cib_apply_diff message (71d3) from mds_172-20-6-60: not in our membership cib[6793]: 2010/10/20_04:17:29 WARN: cib_peer_callback: Discarding cib_apply_diff message (71d4) from mds_172-20-6-60: not in our membership cib[6793]: 2010/10/20_04:17:29 WARN: cib_peer_callback: Discarding cib_apply_diff message (71d5) from mds_172-20-6-60: not in our membership lrmd[6794]: 2010/10/20_04:17:30 info: RA output: (drbddisk_1:start:stderr) 0: State change failed: (-1) Multiple primaries not allowed by config Command '/sbin/drbdsetup 0 primary' terminated with exit code 11 lrmd[6794]: 2010/10/20_04:17:32 info: RA output: (drbddisk_1:start:stderr) 0: State change failed: (-1) Multiple primaries not allowed by config Command '/sbin/drbdsetup 0 primary' terminated with exit code 11 lrmd[6794]: 2010/10/20_04:17:34 info: RA output: (drbddisk_1:start:stderr) 0: State change failed: (-1) Multiple primaries not allowed by config Command '/sbin/drbdsetup 0 primary' terminated with exit code 11 *WARN: crmd_ha_msg_callback: Ignoring HA message (op=noop) from mds_172-20-6-60: not in our membership list (size=1)* Best Regards Ruchi Priya -- Regards Ruchi Priya -- Regards Ruchi Priya _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
