[Linux-HA] Weird errors like: dropping message of type CCM_TYPE_FINAL_MEMLIST. Is this a Byzantine failure?

Harakiri Tue, 16 Mar 2010 05:06:00 -0700

Hello,

i got one cluster which is acting very weird. I have never experienced this 
kind of error in any other cluster (20+). Version is 2.1.4 (Yes i know its old 
but its the latest stable on debian lenny and migrating is not option, all 
other clusters work fine on the same HB version tho)


What happends is that the heartbeat communication layer seems to be disrupted. 
I see alot of eth0 up and downs on one node, and on the other i see this:

Mar 15 06:26:43 node1 ccm: [3801]: ERROR: ccm_state_sent_memlistreq: dropping 
message of type CCM_TYPE_FINAL_MEMLIST. Is this a Byzantine failure?
Mar 15 06:28:24 node1 ccm: [3801]: ERROR: llm_set_uptime: Negative uptime 
-1932263424 for node 0 [node1]
Mar 15 06:30:34 node1 ccm: [3801]: ERROR: llm_set_uptime: Negative uptime 
-1898708992 for node 0 [node1]
Mar 15 06:36:44 node1 ccm: [3801]: ERROR: ccm_state_sent_memlistreq: dropping 
message of type CCM_TYPE_FINAL_MEMLIST. Is this a Byzantine failure?
Mar 15 06:43:26 node1 ccm: [3801]: ERROR: ccm_state_sent_memlistreq: dropping 
message of type CCM_TYPE_FINAL_MEMLIST. Is this a Byzantine failure?
Mar 15 07:10:13 node1 ccm: [3801]: ERROR: ccm_state_sent_memlistreq: dropping 
message of type CCM_TYPE_FINAL_MEMLIST. Is this a Byzantine failure?
Mar 15 07:26:57 node1 ccm: [3801]: ERROR: ccm_state_sent_memlistreq: dropping 
message of type CCM_TYPE_FINAL_MEMLIST. Is this a Byzantine failure?
Mar 15 07:39:25 node1 ccm: [3801]: ERROR: ccm_state_sent_memlistreq: dropping 
message of type CCM_TYPE_FINAL_MEMLIST. Is this a Byzantine failure?
Mar 15 07:46:59 node1 ccm: [3801]: ERROR: ccm_state_sent_memlistreq: dropping 
message of type CCM_TYPE_FINAL_MEMLIST. Is this a Byzantine failure?
Mar 15 07:52:00 node1 ccm: [3801]: ERROR: ccm_state_sent_memlistreq: dropping 
message of type CCM_TYPE_FINAL_MEMLIST. Is this a Byzantine failure?
Mar 15 07:52:57 node1 crmd: [3806]: ERROR: crmd_ha_msg_callback: Another DC 
detected: node2 (op=noop)
Mar 15 07:53:02 node1 pengine: [14552]: ERROR: native_add_running: Resource 
heartbeat::my_resource_2 appears to be active on 2 nodes.
Mar 15 07:53:02 node1 pengine: [14552]: ERROR: See 
http://linux-ha.org/v2/faq/resource_too_active for more information.
Mar 15 07:53:02 node1 pengine: [14552]: ERROR: native_add_running: Resource 
heartbeat::my_ressource appears to be act
ive on 2 nodes.
Mar 15 07:53:02 node1 pengine: [14552]: ERROR: See 
http://linux-ha.org/v2/faq/resource_too_active for more information.
Mar 15 07:53:02 node1 pengine: [14552]: ERROR: native_create_actions: 
Attempting recovery of resource my_resource
Mar 15 07:53:02 node1 pengine: [14552]: ERROR: native_create_actions: 
Attempting recovery of resource my_resource_2
Mar 15 07:53:02 node1 pengine: [14552]: ERROR: process_pe_message: Transition 
1: ERRORs found during PE processing. PEngine Input stored in: /var/lib/hea
rtbeat/pengine/pe-error-70.bz2
Mar 15 07:53:41 node1 ccm: [3801]: ERROR: ccm_state_sent_memlistreq: dropping 
message of type CCM_TYPE_FINAL_MEMLIST. Is this a Byzantine failure?
Mar 15 07:54:07 node1 crmd: [3806]: ERROR: do_dc_join_ack: Join not in 
progress: ignoring join-4 from node2
Mar 15 07:54:10 node1 crmd: [3806]: ERROR: do_dc_join_ack: Join not in 
progress: ignoring join-6 from node2
Mar 15 07:54:25 node1 crmd: [3806]: ERROR: do_dc_join_ack: Join not in 
progress: ignoring join-12 from node2
Mar 15 07:54:26 node1 crmd: [3806]: ERROR: do_dc_join_ack: Join not in 
progress: ignoring join-13 from node2
Mar 15 07:54:28 node1 crmd: [3806]: ERROR: do_dc_join_ack: Join not in 
progress: ignoring join-14 from node2
Mar 15 07:54:32 node1 crmd: [3806]: ERROR: do_dc_join_ack: Join not in 
progress: ignoring join-16 from node2
Mar 15 07:54:41 node1 crmd: [3806]: ERROR: do_dc_join_ack: Join not in 
progress: ignoring join-20 from node2
Mar 15 07:54:53 node1 crmd: [3806]: ERROR: do_dc_join_ack: Join not in 
progress: ignoring join-25 from node2
Mar 15 07:54:57 node1 crmd: [3806]: ERROR: do_dc_join_ack: Join not in 
progress: ignoring join-27 from node2
Mar 15 07:56:16 node1 crmd: [3806]: ERROR: crmd_ha_msg_callback: Another DC 
detected: node2 (op=noop)

other node:

Mar 15 06:41:50 node2 ccm: [3802]: ERROR: llm_set_uptime: Negative uptime 
-1026228224 for node 1 [node2]
Mar 15 07:08:53 node2 ccm: [3802]: ERROR: llm_set_uptime: Negative uptime 
-2818048 for node 1 [node2]
Mar 15 07:42:10 node2 ccm: [3802]: ERROR: llm_set_uptime: Negative uptime 
-1546190848 for node 1 [node2]
Mar 15 07:45:11 node2 ccm: [3738]: ERROR: llm_set_uptime: Negative uptime 
-1479081984 for node 1 [node2]
Mar 15 07:49:40 node2 ccm: [3738]: ERROR: llm_set_uptime: Negative uptime 
-1344864256 for node 1 [node2]
Mar 15 07:52:57 node2 ccm: [3738]: ERROR: llm_set_uptime: Negative uptime 
-1244200960 for node 1 [node2]
Mar 15 07:54:08 node2 crmd: [3743]: ERROR: do_dc_join_ack: Join not in 
progress: ignoring join-124 from node1
Mar 15 07:54:11 node2 crmd: [3743]: ERROR: do_dc_join_ack: Join not in 
progress: ignoring join-126 from node1
Mar 15 07:54:26 node2 crmd: [3743]: ERROR: do_dc_join_ack: Join not in 
progress: ignoring join-132 from node1
Mar 15 07:54:31 node2 crmd: [3743]: ERROR: do_dc_join_ack: Join not in 
progress: ignoring join-135 from node1
Mar 15 07:54:35 node2 crmd: [3743]: ERROR: do_dc_join_filter_offer: Node node1 
is not a member
Mar 15 07:54:35 node2 crmd: [3743]: ERROR: do_dc_join_filter_offer: join-138: 
NACK'ing node node1 (ref join_request-crmd-1268636075-47939)
Mar 15 07:54:36 node2 crmd: [3743]: ERROR: do_dc_join_ack: Join not in 
progress: ignoring join-138 from node1
Mar 15 07:56:15 node2 ccm: [3738]: ERROR: llm_set_uptime: Negative uptime 
-1076428800 for node 1 [node2]
Mar 15 07:56:22 node2 crm_mon: [10016]: ERROR: See 
http://linux-ha.org/v2/faq/resource_too_active for more information.
Mar 15 07:57:59 node2 crmd: [3743]: ERROR: do_dc_join_filter_offer: Node node1 
is not a member
Mar 15 07:57:59 node2 crmd: [3743]: ERROR: do_dc_join_filter_offer: join-153: 
NACK'ing node node1 (ref join_request-crmd-1268636278-48047)
Mar 15 07:58:00 node2 crmd: [3743]: ERROR: do_dc_join_ack: Join not in 
progress: ignoring join-153 from node1
Mar 15 07:59:33 node2 ccm: [3738]: ERROR: llm_set_uptime: Negative uptime 
-958988288 for node 1 [node2]
Mar 15 07:59:38 node2 crm_mon: [11936]: ERROR: See 
http://linux-ha.org/v2/faq/resource_too_active for more information.
Mar 15 07:59:40 node2 crm_mon: [11951]: ERROR: See 
http://linux-ha.org/v2/faq/resource_too_active for more information.
Mar 15 08:00:25 node2 crmd: [3743]: ERROR: crmd_ha_msg_callback: Another DC 
detected: node1 (op=noop)

interface downs on one node every couple minutes:

Mar 15 06:26:41 node2 heartbeat: [2708]: info: Link node1:eth0 dead.
Mar 15 06:26:42 node2 heartbeat: [2708]: info: Link node1:eth0 up.

on node1 there is only once or twice node2:eth0 dead in 24h.

At first i thought it was a firewall issue, but there doesnt seem to be any 
firewall inbetween.

What happends is that each node is rebooting itself in a random interval 
(because crm on is active) - so they stoneith themselfs for no apparent 
reason?!)

Any ideas?


      
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Weird errors like: dropping message of type CCM_TYPE_FINAL_MEMLIST. Is this a Byzantine failure?

Reply via email to