[Linux-HA] lost node and split brain cluster

Frank Mon, 22 Dec 2008 02:08:22 -0800

Hi,
we have a two nodes cluster with heartbeat 2.1.4 on a RHEL 5 Linux

We have OpenVZ on both nodes, a shared GFS disk with VE data, andManageVE resources, some on node1 on some on node2We are experimenting a strange behavior. Sometimes VEs of one nodeappear started on both nodes. If we monitors with hb_gui, one of thenodes is showed stopped although heartbeat is started there. Logs in thenode with all resources started looks like this:


........................
heartbeat[24950]: 2008/12/19_18:07:54 info: time_longclock: clock_t wrapped 
around (uptime).
heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch 
function for send local status was delayed 42949663
06 ms (> 510 ms) before being called (GSource: 0x9744758)
cib[25144]: 2008/12/19_18:07:54 info: mem_handle_event: Got an event 
OC_EV_MS_INVALID from ccm
crmd[25148]: 2008/12/19_18:07:54 notice: crmd_ha_status_callback: Status 
update: Node parmenides.upc.es now has status [dead]
heartbeat[24950]: 2008/12/19_18:07:54 info: Gmain_timeout_dispatch: started at 
4359079925 should have started at 64112728
cib[25144]: 2008/12/19_18:07:54 info: mem_handle_event: no mbr_track info
crmd[25148]: 2008/12/19_18:07:54 info: mem_handle_event: Got an event 
OC_EV_MS_INVALID from ccm
heartbeat[24950]: 2008/12/19_18:07:54 WARN: node parmenides.upc.es: is dead
cib[25144]: 2008/12/19_18:07:54 info: mem_handle_event: Got an event 
OC_EV_MS_NEW_MEMBERSHIP from ccm
crmd[25148]: 2008/12/19_18:07:54 info: mem_handle_event: no mbr_track info
heartbeat[24950]: 2008/12/19_18:07:54 info: Link parmenides.upc.es:eth1 dead.
cib[25144]: 2008/12/19_18:07:54 info: mem_handle_event: instance=41, nodes=1, 
new=0, lost=1, n_idx=0, new_idx=1, old_idx=3
crmd[25148]: 2008/12/19_18:07:54 info: mem_handle_event: Got an event 
OC_EV_MS_NEW_MEMBERSHIP from ccm
heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch 
function for check for signals was delayed 42949665
46 ms (> 510 ms) before being called (GSource: 0x9744988)
cib[25144]: 2008/12/19_18:07:54 info: cib_ccm_msg_callback: LOST: 
parmenides.upc.es
crmd[25148]: 2008/12/19_18:07:54 info: mem_handle_event: instance=41, nodes=1, 
new=0, lost=1, n_idx=0, new_idx=1, old_idx=3
heartbeat[24950]: 2008/12/19_18:07:54 info: Gmain_timeout_dispatch: started at 
4359079925 should have started at 64112704
cib[25144]: 2008/12/19_18:07:54 info: cib_ccm_msg_callback: PEER: 
heraclito.upc.es
crmd[25148]: 2008/12/19_18:07:54 info: crmd_ccm_msg_callback: Quorum 
(re)attained after event=NEW MEMBERSHIP (id=41)
heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch 
function for update msgfree count was delayed 42949
67216 ms (> 30000 ms) before being called (GSource: 0x9744a08)
ccm[25142]: 2008/12/19_18:07:54 info: Break tie for 2 nodes cluster
crmd[25148]: 2008/12/19_18:07:54 info: erase_node_from_join: Removed dead node 
parmenides.upc.es from join calculations: welcome
d=0 itegrated=0 finalized=0 confirmed=1
heartbeat[24950]: 2008/12/19_18:07:54 info: Gmain_timeout_dispatch: started at 
4359079925 should have started at 64112637
crmd[25148]: 2008/12/19_18:07:54 info: ccm_event_detail: NEW MEMBERSHIP: 
trans=41, nodes=1, new=0, lost=1 n_idx=0, new_idx=1, old_idx=3
heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch 
function for client audit was delayed 4294962206 ms
(> 5000 ms) before being called (GSource: 0x9744908)
crmd[25148]: 2008/12/19_18:07:54 info: ccm_event_detail:        CURRENT: 
heraclito.upc.es [nodeid=0, born=41]
heartbeat[24950]: 2008/12/19_18:07:54 info: Gmain_timeout_dispatch: started at 
4359079925 should have started at 64113138
crmd[25148]: 2008/12/19_18:07:54 info: ccm_event_detail:        LOST:    
parmenides.upc.es [nodeid=1, born=40]
heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch 
function for memory stats was delayed 4244815296 ms
(> 5000 ms) before being called (GSource: 0x9744888)
..................
....................
heartbeat[24950]: 2008/12/19_18:07:55 CRIT: Cluster node parmenides.upc.es 
returning after partition.
heartbeat[24950]: 2008/12/19_18:07:55 info: For information on cluster 
partitions, See URL: http://linux-ha.org/SplitBrain
heartbeat[24950]: 2008/12/19_18:07:55 WARN: Deadtime value may be too small.
heartbeat[24950]: 2008/12/19_18:07:55 info: See FAQ for information on tuning 
deadtime.
heartbeat[24950]: 2008/12/19_18:07:55 info: URL: 
http://linux-ha.org/FAQ#heavy_load
heartbeat[24950]: 2008/12/19_18:07:55 info: Link parmenides.upc.es:eth1 up.
heartbeat[24950]: 2008/12/19_18:07:55 info: Status update for node 
parmenides.upc.es: status active
crmd[25148]: 2008/12/19_18:07:55 notice: crmd_ha_status_callback: Status 
update: Node parmenides.upc.es now has status [active]
cib[25144]: 2008/12/19_18:07:56 info: mem_handle_event: Got an event 
OC_EV_MS_INVALID from ccm
crmd[25148]: 2008/12/19_18:07:56 info: mem_handle_event: Got an event 
OC_EV_MS_INVALID from ccm
cib[25144]: 2008/12/19_18:07:56 info: mem_handle_event: no mbr_track info
crmd[25148]: 2008/12/19_18:07:56 info: mem_handle_event: no mbr_track info
cib[25144]: 2008/12/19_18:07:56 info: mem_handle_event: Got an event 
OC_EV_MS_NEW_MEMBERSHIP from ccm
crmd[25148]: 2008/12/19_18:07:56 info: mem_handle_event: Got an event 
OC_EV_MS_NEW_MEMBERSHIP from ccm
cib[25144]: 2008/12/19_18:07:56 info: mem_handle_event: instance=42, nodes=2, 
new=1, lost=0, n_idx=0, new_idx=2, old_idx=4
crmd[25148]: 2008/12/19_18:07:56 info: mem_handle_event: instance=42, nodes=2, 
new=1, lost=0, n_idx=0, new_idx=2, old_idx=4
cib[25144]: 2008/12/19_18:07:56 info: cib_ccm_msg_callback: PEER: 
heraclito.upc.es
crmd[25148]: 2008/12/19_18:07:56 info: crmd_ccm_msg_callback: Quorum 
(re)attained after event=NEW MEMBERSHIP (id=42)
cib[25144]: 2008/12/19_18:07:56 info: cib_ccm_msg_callback: PEER: 
parmenides.upc.es
crmd[25148]: 2008/12/19_18:07:56 info: ccm_event_detail: NEW MEMBERSHIP: 
trans=42, nodes=2, new=1, lost=0 n_idx=0, new_idx=2, ol
d_idx=4
crmd[25148]: 2008/12/19_18:07:56 info: ccm_event_detail:        CURRENT: 
heraclito.upc.es [nodeid=0, born=38]
crmd[25148]: 2008/12/19_18:07:56 info: ccm_event_detail:        CURRENT: 
parmenides.upc.es [nodeid=1, born=42]
crmd[25148]: 2008/12/19_18:07:56 info: ccm_event_detail:        NEW:     
parmenides.upc.es [nodeid=1, born=42]
.................

It seems like one of the nodes is lost and latter recovered, but thecluster do not recover normality. Resources on recovered node keeps onstarted on theother. The only way I have found to solve the situation is to do manualaction on the cluster, for instance to stop a resource. If I do that,then cluster seemsto realize the situation, detects that some resources are multi-homed,and corrects the situation.

What is happening? Does heavy load affects as the log messages says? Butwhy cluster doesn't recover normal status?

Thanks for your help.

Frank




--
Aquest missatge ha estat analitzat per MailScanner
a la cerca de virus i d'altres continguts perillosos,
i es considera que està net.
For all your IT requirements visit: http://www.transtec.co.uk

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] lost node and split brain cluster

Reply via email to