On Mon, Nov 14, 2011 at 12:43:03PM +0100, Marcus wrote: > Hi, > > I have some problems with a 2-node cluster running Heartbeat 3.0.3 on
please reproduce with most recent heartbeat, http://hg.linux-ha.org/heartbeat-STABLE_3_0/ (http:// ... /archive/tip.tar.bz2) There have been changes in the cleanup path for dead nodes. > Debian 6.0.3. I set up 2 nodes in VMwares. You did clone one into the other, or did a clean, independend install of both? Pacemaker version? Glue version? > To test failover behaviour I > disable eth0 and eth1. How, exactly? Is this "symmetrical" or "asymmetrical", i.e. does that block only incoming/only outgoing/both? Have both nodes seen the respective other as dead? Does that block both links at the same time, or one after the other? > The NICs are configigured in heartbeat. Eth0 with > multicast and eth1 with broadcast. In most cases when I reactivate the > NICs crm_mon shows me both nodes running after a few seconds. But > sometimes not. Then the ha-log shows following errors: > > Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: info: Link > debian60-clnode1:eth1 up. > Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: CRIT: Cluster node > debian60-clnode1 returning after partition. > Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: info: For information on > cluster partitions, See URL: http://linux-ha.org/wiki/Split_Brain > Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: WARN: Deadtime value may > be too small. > Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: info: See FAQ for > information on tuning deadtime. > Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: info: URL: > http://linux-ha.org/wiki/FAQ#Heavy_Load > Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: WARN: Late heartbeat: > Node debian60-clnode1: interval 234500 ms > Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: info: Status update for > node debian60-clnode1: status active > Nov 14 12:05:21 debian60-clnode2 crmd: [32584]: notice: > crmd_ha_status_callback: Status update: Node debian60-clnode1 now has status > [active] (DC=true) > Nov 14 12:05:21 debian60-clnode2 crmd: [32584]: info: crm_update_peer_proc: > debian60-clnode1.ais is now online > Nov 14 12:05:21 debian60-clnode2 cib: [32580]: WARN: cib_peer_callback: > Discarding cib_apply_diff message (1518) from debian60-clnode1: not in our > membership > Nov 14 12:05:22 debian60-clnode2 ccm: [32579]: info: Break tie for 2 nodes > cluster > Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: mem_handle_event: Got > an event OC_EV_MS_INVALID from ccm > Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: mem_handle_event: no > mbr_track info > Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: mem_handle_event: Got > an event OC_EV_MS_NEW_MEMBERSHIP from ccm > Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: mem_handle_event: > instance=514, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 > Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: crmd_ccm_msg_callback: > Quorum (re)attained after event=NEW MEMBERSHIP (id=514) > Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: ccm_event_detail: NEW > MEMBERSHIP: trans=514, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3 > Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: ccm_event_detail: > CURRENT: debian60-clnode2 [nodeid=1, born=514] Membership only sees _one_ node here, still. nodes=1; this somehow looks like stale data. What does /usr/lib/heartbeat/ccm_test_client say during such an experiment? both nodes? (or lib64, may be in the -dev package) btw, as long as it can talk to ccm, that does not terminate by itself, you need to ctrl-c it... > Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: populate_cib_nodes_ha: > Requesting the list of configured nodes > Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: mem_handle_event: Got an > event OC_EV_MS_INVALID from ccm > Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: mem_handle_event: no > mbr_track info > Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: mem_handle_event: Got an > event OC_EV_MS_NEW_MEMBERSHIP from ccm > Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: mem_handle_event: > instance=514, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 > Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: cib_ccm_msg_callback: > Processing CCM event=NEW MEMBERSHIP (id=514) > Nov 14 12:05:22 debian60-clnode2 ccm: [32579]: WARN: ccm_state_joined: > dropping message of type CCM_TYPE_PROTOVERSION_RESP. Is this a Byzantine > failure? > Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: cib_process_request: > Operation complete: op cib_modify for section nodes (origin=local/crmd/1497, > version=0.1131.11): ok (rc=0) > > The only way to get the cluster back to a consistent status is to > restart heartbeat on one node. > > I tested different downtimes and it seems that the behavior depends not > on the downtime. > > Is this behavior of heartbeat correct? I don't think so. Anything about lost packets? If you can "easily" reproduce (with most recent heartbeat), a tcpdump may be useful from, say, 10 heartbeats before you disable, to half a minute minute after you re-enable the network link. (or just crank up debuggin so high that you even see message dumps in the logs...) -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com _______________________________________________________ Linux-HA-Dev: [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
