Hello, i got one cluster which is acting very weird. I have never experienced this kind of error in any other cluster (20+). Version is 2.1.4 (Yes i know its old but its the latest stable on debian lenny and migrating is not option, all other clusters work fine on the same HB version tho)
What happends is that the heartbeat communication layer seems to be disrupted. I see alot of eth0 up and downs on one node, and on the other i see this: Mar 15 06:26:43 node1 ccm: [3801]: ERROR: ccm_state_sent_memlistreq: dropping message of type CCM_TYPE_FINAL_MEMLIST. Is this a Byzantine failure? Mar 15 06:28:24 node1 ccm: [3801]: ERROR: llm_set_uptime: Negative uptime -1932263424 for node 0 [node1] Mar 15 06:30:34 node1 ccm: [3801]: ERROR: llm_set_uptime: Negative uptime -1898708992 for node 0 [node1] Mar 15 06:36:44 node1 ccm: [3801]: ERROR: ccm_state_sent_memlistreq: dropping message of type CCM_TYPE_FINAL_MEMLIST. Is this a Byzantine failure? Mar 15 06:43:26 node1 ccm: [3801]: ERROR: ccm_state_sent_memlistreq: dropping message of type CCM_TYPE_FINAL_MEMLIST. Is this a Byzantine failure? Mar 15 07:10:13 node1 ccm: [3801]: ERROR: ccm_state_sent_memlistreq: dropping message of type CCM_TYPE_FINAL_MEMLIST. Is this a Byzantine failure? Mar 15 07:26:57 node1 ccm: [3801]: ERROR: ccm_state_sent_memlistreq: dropping message of type CCM_TYPE_FINAL_MEMLIST. Is this a Byzantine failure? Mar 15 07:39:25 node1 ccm: [3801]: ERROR: ccm_state_sent_memlistreq: dropping message of type CCM_TYPE_FINAL_MEMLIST. Is this a Byzantine failure? Mar 15 07:46:59 node1 ccm: [3801]: ERROR: ccm_state_sent_memlistreq: dropping message of type CCM_TYPE_FINAL_MEMLIST. Is this a Byzantine failure? Mar 15 07:52:00 node1 ccm: [3801]: ERROR: ccm_state_sent_memlistreq: dropping message of type CCM_TYPE_FINAL_MEMLIST. Is this a Byzantine failure? Mar 15 07:52:57 node1 crmd: [3806]: ERROR: crmd_ha_msg_callback: Another DC detected: node2 (op=noop) Mar 15 07:53:02 node1 pengine: [14552]: ERROR: native_add_running: Resource heartbeat::my_resource_2 appears to be active on 2 nodes. Mar 15 07:53:02 node1 pengine: [14552]: ERROR: See http://linux-ha.org/v2/faq/resource_too_active for more information. Mar 15 07:53:02 node1 pengine: [14552]: ERROR: native_add_running: Resource heartbeat::my_ressource appears to be act ive on 2 nodes. Mar 15 07:53:02 node1 pengine: [14552]: ERROR: See http://linux-ha.org/v2/faq/resource_too_active for more information. Mar 15 07:53:02 node1 pengine: [14552]: ERROR: native_create_actions: Attempting recovery of resource my_resource Mar 15 07:53:02 node1 pengine: [14552]: ERROR: native_create_actions: Attempting recovery of resource my_resource_2 Mar 15 07:53:02 node1 pengine: [14552]: ERROR: process_pe_message: Transition 1: ERRORs found during PE processing. PEngine Input stored in: /var/lib/hea rtbeat/pengine/pe-error-70.bz2 Mar 15 07:53:41 node1 ccm: [3801]: ERROR: ccm_state_sent_memlistreq: dropping message of type CCM_TYPE_FINAL_MEMLIST. Is this a Byzantine failure? Mar 15 07:54:07 node1 crmd: [3806]: ERROR: do_dc_join_ack: Join not in progress: ignoring join-4 from node2 Mar 15 07:54:10 node1 crmd: [3806]: ERROR: do_dc_join_ack: Join not in progress: ignoring join-6 from node2 Mar 15 07:54:25 node1 crmd: [3806]: ERROR: do_dc_join_ack: Join not in progress: ignoring join-12 from node2 Mar 15 07:54:26 node1 crmd: [3806]: ERROR: do_dc_join_ack: Join not in progress: ignoring join-13 from node2 Mar 15 07:54:28 node1 crmd: [3806]: ERROR: do_dc_join_ack: Join not in progress: ignoring join-14 from node2 Mar 15 07:54:32 node1 crmd: [3806]: ERROR: do_dc_join_ack: Join not in progress: ignoring join-16 from node2 Mar 15 07:54:41 node1 crmd: [3806]: ERROR: do_dc_join_ack: Join not in progress: ignoring join-20 from node2 Mar 15 07:54:53 node1 crmd: [3806]: ERROR: do_dc_join_ack: Join not in progress: ignoring join-25 from node2 Mar 15 07:54:57 node1 crmd: [3806]: ERROR: do_dc_join_ack: Join not in progress: ignoring join-27 from node2 Mar 15 07:56:16 node1 crmd: [3806]: ERROR: crmd_ha_msg_callback: Another DC detected: node2 (op=noop) other node: Mar 15 06:41:50 node2 ccm: [3802]: ERROR: llm_set_uptime: Negative uptime -1026228224 for node 1 [node2] Mar 15 07:08:53 node2 ccm: [3802]: ERROR: llm_set_uptime: Negative uptime -2818048 for node 1 [node2] Mar 15 07:42:10 node2 ccm: [3802]: ERROR: llm_set_uptime: Negative uptime -1546190848 for node 1 [node2] Mar 15 07:45:11 node2 ccm: [3738]: ERROR: llm_set_uptime: Negative uptime -1479081984 for node 1 [node2] Mar 15 07:49:40 node2 ccm: [3738]: ERROR: llm_set_uptime: Negative uptime -1344864256 for node 1 [node2] Mar 15 07:52:57 node2 ccm: [3738]: ERROR: llm_set_uptime: Negative uptime -1244200960 for node 1 [node2] Mar 15 07:54:08 node2 crmd: [3743]: ERROR: do_dc_join_ack: Join not in progress: ignoring join-124 from node1 Mar 15 07:54:11 node2 crmd: [3743]: ERROR: do_dc_join_ack: Join not in progress: ignoring join-126 from node1 Mar 15 07:54:26 node2 crmd: [3743]: ERROR: do_dc_join_ack: Join not in progress: ignoring join-132 from node1 Mar 15 07:54:31 node2 crmd: [3743]: ERROR: do_dc_join_ack: Join not in progress: ignoring join-135 from node1 Mar 15 07:54:35 node2 crmd: [3743]: ERROR: do_dc_join_filter_offer: Node node1 is not a member Mar 15 07:54:35 node2 crmd: [3743]: ERROR: do_dc_join_filter_offer: join-138: NACK'ing node node1 (ref join_request-crmd-1268636075-47939) Mar 15 07:54:36 node2 crmd: [3743]: ERROR: do_dc_join_ack: Join not in progress: ignoring join-138 from node1 Mar 15 07:56:15 node2 ccm: [3738]: ERROR: llm_set_uptime: Negative uptime -1076428800 for node 1 [node2] Mar 15 07:56:22 node2 crm_mon: [10016]: ERROR: See http://linux-ha.org/v2/faq/resource_too_active for more information. Mar 15 07:57:59 node2 crmd: [3743]: ERROR: do_dc_join_filter_offer: Node node1 is not a member Mar 15 07:57:59 node2 crmd: [3743]: ERROR: do_dc_join_filter_offer: join-153: NACK'ing node node1 (ref join_request-crmd-1268636278-48047) Mar 15 07:58:00 node2 crmd: [3743]: ERROR: do_dc_join_ack: Join not in progress: ignoring join-153 from node1 Mar 15 07:59:33 node2 ccm: [3738]: ERROR: llm_set_uptime: Negative uptime -958988288 for node 1 [node2] Mar 15 07:59:38 node2 crm_mon: [11936]: ERROR: See http://linux-ha.org/v2/faq/resource_too_active for more information. Mar 15 07:59:40 node2 crm_mon: [11951]: ERROR: See http://linux-ha.org/v2/faq/resource_too_active for more information. Mar 15 08:00:25 node2 crmd: [3743]: ERROR: crmd_ha_msg_callback: Another DC detected: node1 (op=noop) interface downs on one node every couple minutes: Mar 15 06:26:41 node2 heartbeat: [2708]: info: Link node1:eth0 dead. Mar 15 06:26:42 node2 heartbeat: [2708]: info: Link node1:eth0 up. on node1 there is only once or twice node2:eth0 dead in 24h. At first i thought it was a firewall issue, but there doesnt seem to be any firewall inbetween. What happends is that each node is rebooting itself in a random interval (because crm on is active) - so they stoneith themselfs for no apparent reason?!) Any ideas? _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
