[Linux-ha-dev] Sometimes "not in our membership" failure in 2-node cluster

Marcus Mon, 14 Nov 2011 04:05:35 -0800

Hi,

I have some problems with a 2-node cluster running Heartbeat 3.0.3 on 
Debian 6.0.3. I set up 2 nodes in VMwares. To test failover behaviour I 
disable eth0 and eth1. The NICs are configigured in heartbeat. Eth0 with 
multicast and eth1 with broadcast. In most cases when I reactivate the 
NICs crm_mon shows me both nodes running after a few seconds. But 
sometimes not. Then the ha-log shows following errors:


Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: info: Link 
debian60-clnode1:eth1 up.
Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: CRIT: Cluster node 
debian60-clnode1 returning after partition.
Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: info: For 
information on cluster partitions, See URL: 
http://linux-ha.org/wiki/Split_Brain
Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: WARN: Deadtime 
value may be too small.
Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: info: See FAQ for 
information on tuning deadtime.
Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: info: URL: 
http://linux-ha.org/wiki/FAQ#Heavy_Load
Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: WARN: Late 
heartbeat: Node debian60-clnode1: interval 234500 ms
Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: info: Status update 
for node debian60-clnode1: status active
Nov 14 12:05:21 debian60-clnode2 crmd: [32584]: notice: 
crmd_ha_status_callback: Status update: Node debian60-clnode1 now has 
status [active] (DC=true)
Nov 14 12:05:21 debian60-clnode2 crmd: [32584]: info: 
crm_update_peer_proc: debian60-clnode1.ais is now online
Nov 14 12:05:21 debian60-clnode2 cib: [32580]: WARN: cib_peer_callback: 
Discarding cib_apply_diff message (1518) from debian60-clnode1: not in 
our membership
Nov 14 12:05:22 debian60-clnode2 ccm: [32579]: info: Break tie for 2 
nodes cluster
Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: mem_handle_event: 
Got an event OC_EV_MS_INVALID from ccm
Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: mem_handle_event: 
no mbr_track info
Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: mem_handle_event: 
Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm
Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: mem_handle_event: 
instance=514, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: 
crmd_ccm_msg_callback: Quorum (re)attained after event=NEW MEMBERSHIP 
(id=514)
Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: ccm_event_detail: 
NEW MEMBERSHIP: trans=514, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, 
old_idx=3
Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: 
ccm_event_detail:         CURRENT: debian60-clnode2 [nodeid=1, born=514]
Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: 
populate_cib_nodes_ha: Requesting the list of configured nodes
Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: mem_handle_event: 
Got an event OC_EV_MS_INVALID from ccm
Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: mem_handle_event: 
no mbr_track info
Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: mem_handle_event: 
Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm
Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: mem_handle_event: 
instance=514, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: 
cib_ccm_msg_callback: Processing CCM event=NEW MEMBERSHIP (id=514)
Nov 14 12:05:22 debian60-clnode2 ccm: [32579]: WARN: ccm_state_joined: 
dropping message of type CCM_TYPE_PROTOVERSION_RESP.  Is this a 
Byzantine failure?
Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: 
cib_process_request: Operation complete: op cib_modify for section nodes 
(origin=local/crmd/1497, version=0.1131.11): ok (rc=0)


The only way to get the cluster back to a consistent status is to 
restart heartbeat on one node.

I tested different downtimes and it seems that the behavior depends not 
on the downtime.

Is this behavior of heartbeat correct?


_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

[Linux-ha-dev] Sometimes "not in our membership" failure in 2-node cluster

Reply via email to