Hello, I have three nodes cluster using pacemaker/corosync. When I reboot one node,
the node unable to join cluster. I can see that kind of split brain 10-20% (recall ration) if I shutdown a node. What do you think of this problem? My questions are: - Is this known problem? - Any work around to avoid the this? - How can I solve this problem? [testserver001] ============ Last updated: Sat Mar 10 14:18:49 2012 Stack: openais Current DC: NONE 3 Nodes configured, 3 expected votes 4 Resources configured. ============ OFFLINE: [ testserver001 testserver002 testserver003 ] Migration summary: [testserver002] ============ Last updated: Sat Mar 10 14:15:17 2012 Stack: openais Current DC: testserver002 - partition with quorum Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3 3 Nodes configured, 3 expected votes 4 Resources configured. ============ Online: [ testserver002 testserver003 ] OFFLINE: [ testserver001 ] Resource Group: testgroup testrsc (lsb:testmgr): Started testserver002 stonith-testserver002 (stonith:external/ipmi): Started testserver003 stonith-testserver003 (stonith:external/ipmi): Started testserver002 stonith-testserver001 (stonith:external/ipmi): Started testserver003 Migration summary: * Node testserver003: * Node testserver002: [testserver003] ============ Last updated: Sat Mar 10 14:19:07 2012 Stack: openais Current DC: testserver002 - partition with quorum Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3 3 Nodes configured, 3 expected votes 4 Resources configured. ============ Online: [ testserver002 testserver003 ] OFFLINE: [ testserver001 ] Resource Group: testgroup testrsc (lsb:testmgr): Started testserver002 stonith-testserver002 (stonith:external/ipmi): Started testserver003 stonith-testserver003 (stonith:external/ipmi): Started testserver002 stonith-testserver001 (stonith:external/ipmi): Started testserver003 Migration summary: * Node testserver003: * Node testserver002: - Checked information + https://bugzilla.redhat.com/show_bug.cgi?id=525589 It looks the packages which I used already support this. + http://comments.gmane.org/gmane.linux.highavailability.user/36101 I checked entries in /etc/hosts but I didn't find out the wrong entry. === 127.0.0.1 testserver001 localhost ::1 localhost6.localdomain6 localhost6 === - Look into this from tcpdump OK case: after MESSAGE_TYPE_ORF_TOKEN received, pacemaker sends MESSAGE_TYPE_MCAST. I took the information from VMware env. + MESSAGE_TYPE_ORF_TOKEN No. Time Source Destination Protocol Length Info 119 2012-03-19 22:00:15.250310 172.27.4.1 172.27.4.2 UDP 112 Source port: 23489 Destination port: 23490 Frame 119: 112 bytes on wire (896 bits), 112 bytes captured (896 bits) Ethernet II, Src: Vmware_6b:b9:9a (00:0c:29:6b:b9:9a), Dst: Vmware_8e:74:92 (00:0c:29:8e:74:92) Internet Protocol Version 4, Src: 172.27.4.1 (172.27.4.1), Dst: 172.27.4.2 (172.27.4.2) User Datagram Protocol, Src Port: 23489 (23489), Dst Port: 23490 (23490) Data (70 bytes) 0000 00 00 22 ff ac 1b 04 01 00 00 00 00 0c 00 00 00 .."............. 0010 00 00 00 00 00 00 00 00 ac 1b 04 01 02 00 ac 1b ................ (snip) + MESSAGE_TYPE_MCAST No. Time Source Destination Protocol Length Info 5141 2012-03-19 22:01:19.198346 172.27.4.2 226.94.16.16 UDP 1486 Source port: 23489 Destination port: 23490 Frame 5141: 1486 bytes on wire (11888 bits), 1486 bytes captured (11888 bits) Ethernet II, Src: Vmware_8e:74:92 (00:0c:29:8e:74:92), Dst: IPv4mcast_5e:10:10 (01:00:5e:5e:10:10) Internet Protocol Version 4, Src: 172.27.4.2 (172.27.4.2), Dst: 226.94.16.16 (226.94.16.16) User Datagram Protocol, Src Port: 23489 (23489), Dst Port: 23490 (23490) Data (1444 bytes) 0000 01 02 22 ff ac 1b 04 02 ac 1b 04 02 02 00 ac 1b .."............. 0010 04 02 08 00 02 00 ac 1b 04 02 08 00 04 00 ac 1b ................ (snip) NG case: MESSAGE_TYPE_ORF_TOKEN sent and received repeatedly and I can see the message in pacemaker.log. + MESSAGE_TYPE_ORF_TOKEN No. Time Source Destination Protocol Length Info 39605 2012-03-10 14:18:13.826778 172.27.4.2 172.27.4.3 UDP 112 Source port: 23489 Destination port: 23490 Frame 39605: 112 bytes on wire (896 bits), 112 bytes captured (896 bits) Ethernet II, Src: FujitsuT_98:79:4b (00:19:99:98:79:4b), Dst: FujitsuT_97:8d:15 (00:19:99:97:8d:15) Internet Protocol Version 4, Src: 172.27.4.2 (172.27.4.2), Dst: 172.27.4.3 (172.27.4.3) User Datagram Protocol, Src Port: 23489 (23489), Dst Port: 23490 (23490) Data (70 bytes) 0000 00 00 22 ff ac 1b 04 01 00 00 00 00 01 00 00 00 .."............. 0010 ff ff ff ff ac 1b 04 01 ac 1b 04 01 02 00 ac 1b ................ (snip) + pacemaker.log Mar 10 14:20:09 testserver001 crmd: [7551]: info: crm_timer_popped: Election Trigger (I_DC_TIMEOUT) just popped! Mar 10 14:20:09 testserver001 crmd: [7551]: WARN: do_log: FSA: Input I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING Mar 10 14:20:09 testserver001 crmd: [7551]: info: do_state_transition: State transition S_PENDING -> S_ELECTION [ input=I_DC_TIMEOUT cause=C_TIMER_POPPED origin=crm_timer_popped ] Mar 10 14:22:09 testserver001 crmd: [7551]: ERROR: crm_timer_popped: Election Timeout (I_ELECTION_DC) just popped! Mar 10 14:22:09 testserver001 crmd: [7551]: info: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_TIMER_POPPED origin=crm_timer_popped ] Mar 10 14:22:09 testserver001 crmd: [7551]: info: do_te_control: Registering TE UUID: b2bb3cc4-cead-475c-bb73-3adbb60142ae Mar 10 14:22:09 testserver001 crmd: [7551]: WARN: cib_client_add_notify_callback: Callback already present Mar 10 14:22:09 testserver001 crmd: [7551]: info: set_graph_functions: Setting custom graph functions Mar 10 14:22:09 testserver001 crmd: [7551]: info: unpack_graph: Unpacked transition -1: 0 actions in 0 synapses Mar 10 14:22:09 testserver001 crmd: [7551]: info: do_dc_takeover: Taking over DC status for this partition Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_readwrite: We are now in R/W mode Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request: Operation complete: op cib_master for section 'all' (origin=local/crmd/6, version=0.143.0): ok (rc=0) Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request: Operation complete: op cib_modify for section cib (origin=local/crmd/7, version=0.143.0): ok (rc=0) Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/9, version=0.143.0): ok (rc=0) Mar 10 14:22:09 testserver001 crmd: [7551]: info: do_dc_join_offer_all: join-1: Waiting on 1 outstanding join acks Mar 10 14:22:09 testserver001 crmd: [7551]: info: ais_dispatch: Membership 516: quorum still lost Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/11, version=0.143.0): ok (rc=0) Mar 10 14:22:09 testserver001 crmd: [7551]: info: crm_ais_dispatch: Setting expected votes to 3 Mar 10 14:22:09 testserver001 crmd: [7551]: info: config_query_callback: Checking for expired actions every 900000ms Mar 10 14:22:09 testserver001 crmd: [7551]: info: config_query_callback: Sending expected-votes=3 to corosync Mar 10 14:22:09 testserver001 crmd: [7551]: info: ais_dispatch: Membership 516: quorum still lost Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/14, version=0.143.0): ok (rc=0) Mar 10 14:22:09 testserver001 crmd: [7551]: info: crm_ais_dispatch: Setting expected votes to 3 Mar 10 14:22:09 testserver001 crmd: [7551]: info: te_connect_stonith: Attempting connection to fencing daemon... Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/16, version=0.143.0): ok (rc=0) Mar 10 14:22:10 testserver001 crmd: [7551]: info: te_connect_stonith: Connected + enum message_type { MESSAGE_TYPE_ORF_TOKEN = 0, /* Ordering, Reliability, Flow (ORF) control Token */ MESSAGE_TYPE_MCAST = 1, /* ring ordered multicast message */ MESSAGE_TYPE_MEMB_MERGE_DETECT = 2, /* merge rings if there are available rings */ MESSAGE_TYPE_MEMB_JOIN = 3, /* membership join message */ MESSAGE_TYPE_MEMB_COMMIT_TOKEN = 4, /* membership commit token */ MESSAGE_TYPE_TOKEN_HOLD_CANCEL = 5, /* cancel the holding of the token */ }; - packages on CentOS 5.6 + pacemaker-1.0.10-1.4.el5 + corosync-1.2.5-1.3.el5 Thank you in advance, Hisashi Osanai Hisashi Osanai (osanai.hisa...@jp.fujitsu.com) _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org