Hi, I have some problems with a 2-node cluster running Heartbeat 3.0.3 on Debian 6.0.3. I set up 2 nodes in VMwares. To test failover behaviour I disable eth0 and eth1. The NICs are configigured in heartbeat. Eth0 with multicast and eth1 with broadcast. In most cases when I reactivate the NICs crm_mon shows me both nodes running after a few seconds. But sometimes not. Then the ha-log shows following errors:
Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: info: Link debian60-clnode1:eth1 up. Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: CRIT: Cluster node debian60-clnode1 returning after partition. Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: info: For information on cluster partitions, See URL: http://linux-ha.org/wiki/Split_Brain Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: WARN: Deadtime value may be too small. Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: info: See FAQ for information on tuning deadtime. Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: info: URL: http://linux-ha.org/wiki/FAQ#Heavy_Load Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: WARN: Late heartbeat: Node debian60-clnode1: interval 234500 ms Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: info: Status update for node debian60-clnode1: status active Nov 14 12:05:21 debian60-clnode2 crmd: [32584]: notice: crmd_ha_status_callback: Status update: Node debian60-clnode1 now has status [active] (DC=true) Nov 14 12:05:21 debian60-clnode2 crmd: [32584]: info: crm_update_peer_proc: debian60-clnode1.ais is now online Nov 14 12:05:21 debian60-clnode2 cib: [32580]: WARN: cib_peer_callback: Discarding cib_apply_diff message (1518) from debian60-clnode1: not in our membership Nov 14 12:05:22 debian60-clnode2 ccm: [32579]: info: Break tie for 2 nodes cluster Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: mem_handle_event: no mbr_track info Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: mem_handle_event: instance=514, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: crmd_ccm_msg_callback: Quorum (re)attained after event=NEW MEMBERSHIP (id=514) Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: ccm_event_detail: NEW MEMBERSHIP: trans=514, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3 Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: ccm_event_detail: CURRENT: debian60-clnode2 [nodeid=1, born=514] Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: populate_cib_nodes_ha: Requesting the list of configured nodes Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: mem_handle_event: no mbr_track info Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: mem_handle_event: instance=514, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: cib_ccm_msg_callback: Processing CCM event=NEW MEMBERSHIP (id=514) Nov 14 12:05:22 debian60-clnode2 ccm: [32579]: WARN: ccm_state_joined: dropping message of type CCM_TYPE_PROTOVERSION_RESP. Is this a Byzantine failure? Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/1497, version=0.1131.11): ok (rc=0) The only way to get the cluster back to a consistent status is to restart heartbeat on one node. I tested different downtimes and it seems that the behavior depends not on the downtime. Is this behavior of heartbeat correct? _______________________________________________________ Linux-HA-Dev: [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
