Re: [Linux-ha-dev] Sometimes "not in our membership" failure in 2-node cluster

Lars Ellenberg Mon, 14 Nov 2011 09:09:47 -0800

On Mon, Nov 14, 2011 at 12:43:03PM +0100, Marcus wrote:
> Hi,
> 
> I have some problems with a 2-node cluster running Heartbeat 3.0.3 on


please reproduce with most recent heartbeat,
   http://hg.linux-ha.org/heartbeat-STABLE_3_0/
    (http:// ... /archive/tip.tar.bz2)

There have been changes in the cleanup path for dead nodes.

> Debian 6.0.3. I set up 2 nodes in VMwares.

You did clone one into the other,
or did a clean, independend install of both?

Pacemaker version?
Glue version?

> To test failover behaviour I 
> disable eth0 and eth1.

How, exactly? Is this "symmetrical" or "asymmetrical",
i.e. does that block only incoming/only outgoing/both?
Have both nodes seen the respective other as dead?

Does that block both links at the same time, or one after the other?

> The NICs are configigured in heartbeat. Eth0 with 
> multicast and eth1 with broadcast. In most cases when I reactivate the 
> NICs crm_mon shows me both nodes running after a few seconds. But 
> sometimes not. Then the ha-log shows following errors:
> 
> Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: info: Link 
> debian60-clnode1:eth1 up.
> Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: CRIT: Cluster node 
> debian60-clnode1 returning after partition.
> Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: info: For information on 
> cluster partitions, See URL: http://linux-ha.org/wiki/Split_Brain
> Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: WARN: Deadtime value may 
> be too small.
> Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: info: See FAQ for 
> information on tuning deadtime.
> Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: info: URL: 
> http://linux-ha.org/wiki/FAQ#Heavy_Load
> Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: WARN: Late heartbeat: 
> Node debian60-clnode1: interval 234500 ms
> Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: info: Status update for 
> node debian60-clnode1: status active
> Nov 14 12:05:21 debian60-clnode2 crmd: [32584]: notice: 
> crmd_ha_status_callback: Status update: Node debian60-clnode1 now has status 
> [active] (DC=true)
> Nov 14 12:05:21 debian60-clnode2 crmd: [32584]: info: crm_update_peer_proc: 
> debian60-clnode1.ais is now online
> Nov 14 12:05:21 debian60-clnode2 cib: [32580]: WARN: cib_peer_callback: 
> Discarding cib_apply_diff message (1518) from debian60-clnode1: not in our 
> membership
> Nov 14 12:05:22 debian60-clnode2 ccm: [32579]: info: Break tie for 2  nodes 
> cluster
> Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: mem_handle_event: Got 
> an event OC_EV_MS_INVALID from ccm
> Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: mem_handle_event: no 
> mbr_track info
> Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: mem_handle_event: Got 
> an event OC_EV_MS_NEW_MEMBERSHIP from ccm
> Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: mem_handle_event: 
> instance=514, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: crmd_ccm_msg_callback: 
> Quorum (re)attained after event=NEW MEMBERSHIP (id=514)
> Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: ccm_event_detail: NEW 
> MEMBERSHIP: trans=514, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3
> Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: ccm_event_detail:       
>   CURRENT: debian60-clnode2 [nodeid=1, born=514]

Membership only sees _one_ node here, still.
nodes=1; this somehow looks like stale data.

What does /usr/lib/heartbeat/ccm_test_client say
during such an experiment? both nodes?

(or lib64, may be in the -dev package)
btw, as long as it can talk to ccm,
that does not terminate by itself,
you need to ctrl-c it...

> Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: populate_cib_nodes_ha: 
> Requesting the list of configured nodes
> Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: mem_handle_event: Got an 
> event OC_EV_MS_INVALID from ccm
> Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: mem_handle_event: no 
> mbr_track info
> Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: mem_handle_event: Got an 
> event OC_EV_MS_NEW_MEMBERSHIP from ccm
> Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: mem_handle_event: 
> instance=514, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: cib_ccm_msg_callback: 
> Processing CCM event=NEW MEMBERSHIP (id=514)
> Nov 14 12:05:22 debian60-clnode2 ccm: [32579]: WARN: ccm_state_joined: 
> dropping message of type CCM_TYPE_PROTOVERSION_RESP.  Is this a Byzantine 
> failure?
> Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: cib_process_request: 
> Operation complete: op cib_modify for section nodes (origin=local/crmd/1497, 
> version=0.1131.11): ok (rc=0)
>
> The only way to get the cluster back to a consistent status is to 
> restart heartbeat on one node.
> 
> I tested different downtimes and it seems that the behavior depends not 
> on the downtime.
> 
> Is this behavior of heartbeat correct?

I don't think so.
Anything about lost packets?

If you can "easily" reproduce (with most recent heartbeat),
a tcpdump may be useful from, say, 10 heartbeats before you disable,
to half a minute minute after you re-enable the network link.
(or just crank up debuggin so high that you even see message dumps in the 
logs...)


-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] Sometimes "not in our membership" failure in 2-node cluster

Reply via email to