Hi,
we have a two nodes cluster with heartbeat 2.1.4 on a RHEL 5 Linux
We have OpenVZ on both nodes, a shared GFS disk with VE data, and
ManageVE resources, some on node1 on some on node2
We are experimenting a strange behavior. Sometimes VEs of one node
appear started on both nodes. If we monitors with hb_gui, one of the
nodes is showed stopped although heartbeat is started there. Logs in the
node with all resources started looks like this:
........................
heartbeat[24950]: 2008/12/19_18:07:54 info: time_longclock: clock_t wrapped
around (uptime).
heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch
function for send local status was delayed 42949663
06 ms (> 510 ms) before being called (GSource: 0x9744758)
cib[25144]: 2008/12/19_18:07:54 info: mem_handle_event: Got an event
OC_EV_MS_INVALID from ccm
crmd[25148]: 2008/12/19_18:07:54 notice: crmd_ha_status_callback: Status
update: Node parmenides.upc.es now has status [dead]
heartbeat[24950]: 2008/12/19_18:07:54 info: Gmain_timeout_dispatch: started at
4359079925 should have started at 64112728
cib[25144]: 2008/12/19_18:07:54 info: mem_handle_event: no mbr_track info
crmd[25148]: 2008/12/19_18:07:54 info: mem_handle_event: Got an event
OC_EV_MS_INVALID from ccm
heartbeat[24950]: 2008/12/19_18:07:54 WARN: node parmenides.upc.es: is dead
cib[25144]: 2008/12/19_18:07:54 info: mem_handle_event: Got an event
OC_EV_MS_NEW_MEMBERSHIP from ccm
crmd[25148]: 2008/12/19_18:07:54 info: mem_handle_event: no mbr_track info
heartbeat[24950]: 2008/12/19_18:07:54 info: Link parmenides.upc.es:eth1 dead.
cib[25144]: 2008/12/19_18:07:54 info: mem_handle_event: instance=41, nodes=1,
new=0, lost=1, n_idx=0, new_idx=1, old_idx=3
crmd[25148]: 2008/12/19_18:07:54 info: mem_handle_event: Got an event
OC_EV_MS_NEW_MEMBERSHIP from ccm
heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch
function for check for signals was delayed 42949665
46 ms (> 510 ms) before being called (GSource: 0x9744988)
cib[25144]: 2008/12/19_18:07:54 info: cib_ccm_msg_callback: LOST:
parmenides.upc.es
crmd[25148]: 2008/12/19_18:07:54 info: mem_handle_event: instance=41, nodes=1,
new=0, lost=1, n_idx=0, new_idx=1, old_idx=3
heartbeat[24950]: 2008/12/19_18:07:54 info: Gmain_timeout_dispatch: started at
4359079925 should have started at 64112704
cib[25144]: 2008/12/19_18:07:54 info: cib_ccm_msg_callback: PEER:
heraclito.upc.es
crmd[25148]: 2008/12/19_18:07:54 info: crmd_ccm_msg_callback: Quorum
(re)attained after event=NEW MEMBERSHIP (id=41)
heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch
function for update msgfree count was delayed 42949
67216 ms (> 30000 ms) before being called (GSource: 0x9744a08)
ccm[25142]: 2008/12/19_18:07:54 info: Break tie for 2 nodes cluster
crmd[25148]: 2008/12/19_18:07:54 info: erase_node_from_join: Removed dead node
parmenides.upc.es from join calculations: welcome
d=0 itegrated=0 finalized=0 confirmed=1
heartbeat[24950]: 2008/12/19_18:07:54 info: Gmain_timeout_dispatch: started at
4359079925 should have started at 64112637
crmd[25148]: 2008/12/19_18:07:54 info: ccm_event_detail: NEW MEMBERSHIP:
trans=41, nodes=1, new=0, lost=1 n_idx=0, new_idx=1, old_idx=3
heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch
function for client audit was delayed 4294962206 ms
(> 5000 ms) before being called (GSource: 0x9744908)
crmd[25148]: 2008/12/19_18:07:54 info: ccm_event_detail: CURRENT:
heraclito.upc.es [nodeid=0, born=41]
heartbeat[24950]: 2008/12/19_18:07:54 info: Gmain_timeout_dispatch: started at
4359079925 should have started at 64113138
crmd[25148]: 2008/12/19_18:07:54 info: ccm_event_detail: LOST:
parmenides.upc.es [nodeid=1, born=40]
heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch
function for memory stats was delayed 4244815296 ms
(> 5000 ms) before being called (GSource: 0x9744888)
..................
....................
heartbeat[24950]: 2008/12/19_18:07:55 CRIT: Cluster node parmenides.upc.es
returning after partition.
heartbeat[24950]: 2008/12/19_18:07:55 info: For information on cluster
partitions, See URL: http://linux-ha.org/SplitBrain
heartbeat[24950]: 2008/12/19_18:07:55 WARN: Deadtime value may be too small.
heartbeat[24950]: 2008/12/19_18:07:55 info: See FAQ for information on tuning
deadtime.
heartbeat[24950]: 2008/12/19_18:07:55 info: URL:
http://linux-ha.org/FAQ#heavy_load
heartbeat[24950]: 2008/12/19_18:07:55 info: Link parmenides.upc.es:eth1 up.
heartbeat[24950]: 2008/12/19_18:07:55 info: Status update for node
parmenides.upc.es: status active
crmd[25148]: 2008/12/19_18:07:55 notice: crmd_ha_status_callback: Status
update: Node parmenides.upc.es now has status [active]
cib[25144]: 2008/12/19_18:07:56 info: mem_handle_event: Got an event
OC_EV_MS_INVALID from ccm
crmd[25148]: 2008/12/19_18:07:56 info: mem_handle_event: Got an event
OC_EV_MS_INVALID from ccm
cib[25144]: 2008/12/19_18:07:56 info: mem_handle_event: no mbr_track info
crmd[25148]: 2008/12/19_18:07:56 info: mem_handle_event: no mbr_track info
cib[25144]: 2008/12/19_18:07:56 info: mem_handle_event: Got an event
OC_EV_MS_NEW_MEMBERSHIP from ccm
crmd[25148]: 2008/12/19_18:07:56 info: mem_handle_event: Got an event
OC_EV_MS_NEW_MEMBERSHIP from ccm
cib[25144]: 2008/12/19_18:07:56 info: mem_handle_event: instance=42, nodes=2,
new=1, lost=0, n_idx=0, new_idx=2, old_idx=4
crmd[25148]: 2008/12/19_18:07:56 info: mem_handle_event: instance=42, nodes=2,
new=1, lost=0, n_idx=0, new_idx=2, old_idx=4
cib[25144]: 2008/12/19_18:07:56 info: cib_ccm_msg_callback: PEER:
heraclito.upc.es
crmd[25148]: 2008/12/19_18:07:56 info: crmd_ccm_msg_callback: Quorum
(re)attained after event=NEW MEMBERSHIP (id=42)
cib[25144]: 2008/12/19_18:07:56 info: cib_ccm_msg_callback: PEER:
parmenides.upc.es
crmd[25148]: 2008/12/19_18:07:56 info: ccm_event_detail: NEW MEMBERSHIP:
trans=42, nodes=2, new=1, lost=0 n_idx=0, new_idx=2, ol
d_idx=4
crmd[25148]: 2008/12/19_18:07:56 info: ccm_event_detail: CURRENT:
heraclito.upc.es [nodeid=0, born=38]
crmd[25148]: 2008/12/19_18:07:56 info: ccm_event_detail: CURRENT:
parmenides.upc.es [nodeid=1, born=42]
crmd[25148]: 2008/12/19_18:07:56 info: ccm_event_detail: NEW:
parmenides.upc.es [nodeid=1, born=42]
.................
It seems like one of the nodes is lost and latter recovered, but the
cluster do not recover normality. Resources on recovered node keeps on
started on the
other. The only way I have found to solve the situation is to do manual
action on the cluster, for instance to stop a resource. If I do that,
then cluster seems
to realize the situation, detects that some resources are multi-homed,
and corrects the situation.
What is happening? Does heavy load affects as the log messages says? But
why cluster doesn't recover normal status?
Thanks for your help.
Frank
--
Aquest missatge ha estat analitzat per MailScanner
a la cerca de virus i d'altres continguts perillosos,
i es considera que està net.
For all your IT requirements visit: http://www.transtec.co.uk
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems