Hi, On Mon, Dec 22, 2008 at 09:00:51AM +0100, Frank wrote: > Hi, > we have a two nodes cluster with heartbeat 2.1.4 on a RHEL 5 Linux > We have OpenVZ on both nodes, a shared GFS disk with VE data, and ManageVE > resources, some on node1 on some on node2 > We are experimenting a strange behavior. Sometimes VEs of one node appear > started on both nodes. If we monitors with hb_gui, one of the > nodes is showed stopped although heartbeat is started there. Logs in the > node with all resources started looks like this: > > ........................ > heartbeat[24950]: 2008/12/19_18:07:54 info: time_longclock: clock_t wrapped > around (uptime).
Strange, this should've been fixed a long time ago, definitely before 2.1.4. > heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch > function for send local status was delayed 42949663 > 06 ms (> 510 ms) before being called (GSource: 0x9744758) > cib[25144]: 2008/12/19_18:07:54 info: mem_handle_event: Got an event > OC_EV_MS_INVALID from ccm > crmd[25148]: 2008/12/19_18:07:54 notice: crmd_ha_status_callback: Status > update: Node parmenides.upc.es now has status [dead] > heartbeat[24950]: 2008/12/19_18:07:54 info: Gmain_timeout_dispatch: started > at 4359079925 should have started at 64112728 > cib[25144]: 2008/12/19_18:07:54 info: mem_handle_event: no mbr_track info > crmd[25148]: 2008/12/19_18:07:54 info: mem_handle_event: Got an event > OC_EV_MS_INVALID from ccm > heartbeat[24950]: 2008/12/19_18:07:54 WARN: node parmenides.upc.es: is dead > cib[25144]: 2008/12/19_18:07:54 info: mem_handle_event: Got an event > OC_EV_MS_NEW_MEMBERSHIP from ccm > crmd[25148]: 2008/12/19_18:07:54 info: mem_handle_event: no mbr_track info > heartbeat[24950]: 2008/12/19_18:07:54 info: Link parmenides.upc.es:eth1 dead. > cib[25144]: 2008/12/19_18:07:54 info: mem_handle_event: instance=41, nodes=1, > new=0, lost=1, n_idx=0, new_idx=1, old_idx=3 > crmd[25148]: 2008/12/19_18:07:54 info: mem_handle_event: Got an event > OC_EV_MS_NEW_MEMBERSHIP from ccm > heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch > function for check for signals was delayed 42949665 > 46 ms (> 510 ms) before being called (GSource: 0x9744988) > cib[25144]: 2008/12/19_18:07:54 info: cib_ccm_msg_callback: LOST: > parmenides.upc.es > crmd[25148]: 2008/12/19_18:07:54 info: mem_handle_event: instance=41, > nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=3 > heartbeat[24950]: 2008/12/19_18:07:54 info: Gmain_timeout_dispatch: started > at 4359079925 should have started at 64112704 > cib[25144]: 2008/12/19_18:07:54 info: cib_ccm_msg_callback: PEER: > heraclito.upc.es > crmd[25148]: 2008/12/19_18:07:54 info: crmd_ccm_msg_callback: Quorum > (re)attained after event=NEW MEMBERSHIP (id=41) > heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch > function for update msgfree count was delayed 42949 > 67216 ms (> 30000 ms) before being called (GSource: 0x9744a08) > ccm[25142]: 2008/12/19_18:07:54 info: Break tie for 2 nodes cluster > crmd[25148]: 2008/12/19_18:07:54 info: erase_node_from_join: Removed dead > node parmenides.upc.es from join calculations: welcome > d=0 itegrated=0 finalized=0 confirmed=1 > heartbeat[24950]: 2008/12/19_18:07:54 info: Gmain_timeout_dispatch: started > at 4359079925 should have started at 64112637 > crmd[25148]: 2008/12/19_18:07:54 info: ccm_event_detail: NEW MEMBERSHIP: > trans=41, nodes=1, new=0, lost=1 n_idx=0, new_idx=1, old_idx=3 > heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch > function for client audit was delayed 4294962206 ms > (> 5000 ms) before being called (GSource: 0x9744908) > crmd[25148]: 2008/12/19_18:07:54 info: ccm_event_detail: CURRENT: > heraclito.upc.es [nodeid=0, born=41] > heartbeat[24950]: 2008/12/19_18:07:54 info: Gmain_timeout_dispatch: started > at 4359079925 should have started at 64113138 > crmd[25148]: 2008/12/19_18:07:54 info: ccm_event_detail: LOST: > parmenides.upc.es [nodeid=1, born=40] > heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch > function for memory stats was delayed 4244815296 ms > (> 5000 ms) before being called (GSource: 0x9744888) > .................. > .................... > heartbeat[24950]: 2008/12/19_18:07:55 CRIT: Cluster node parmenides.upc.es > returning after partition. > heartbeat[24950]: 2008/12/19_18:07:55 info: For information on cluster > partitions, See URL: http://linux-ha.org/SplitBrain > heartbeat[24950]: 2008/12/19_18:07:55 WARN: Deadtime value may be too small. > heartbeat[24950]: 2008/12/19_18:07:55 info: See FAQ for information on tuning > deadtime. > heartbeat[24950]: 2008/12/19_18:07:55 info: URL: > http://linux-ha.org/FAQ#heavy_load > heartbeat[24950]: 2008/12/19_18:07:55 info: Link parmenides.upc.es:eth1 up. > heartbeat[24950]: 2008/12/19_18:07:55 info: Status update for node > parmenides.upc.es: status active > crmd[25148]: 2008/12/19_18:07:55 notice: crmd_ha_status_callback: Status > update: Node parmenides.upc.es now has status [active] > cib[25144]: 2008/12/19_18:07:56 info: mem_handle_event: Got an event > OC_EV_MS_INVALID from ccm > crmd[25148]: 2008/12/19_18:07:56 info: mem_handle_event: Got an event > OC_EV_MS_INVALID from ccm > cib[25144]: 2008/12/19_18:07:56 info: mem_handle_event: no mbr_track info > crmd[25148]: 2008/12/19_18:07:56 info: mem_handle_event: no mbr_track info > cib[25144]: 2008/12/19_18:07:56 info: mem_handle_event: Got an event > OC_EV_MS_NEW_MEMBERSHIP from ccm > crmd[25148]: 2008/12/19_18:07:56 info: mem_handle_event: Got an event > OC_EV_MS_NEW_MEMBERSHIP from ccm > cib[25144]: 2008/12/19_18:07:56 info: mem_handle_event: instance=42, nodes=2, > new=1, lost=0, n_idx=0, new_idx=2, old_idx=4 > crmd[25148]: 2008/12/19_18:07:56 info: mem_handle_event: instance=42, > nodes=2, new=1, lost=0, n_idx=0, new_idx=2, old_idx=4 > cib[25144]: 2008/12/19_18:07:56 info: cib_ccm_msg_callback: PEER: > heraclito.upc.es > crmd[25148]: 2008/12/19_18:07:56 info: crmd_ccm_msg_callback: Quorum > (re)attained after event=NEW MEMBERSHIP (id=42) > cib[25144]: 2008/12/19_18:07:56 info: cib_ccm_msg_callback: PEER: > parmenides.upc.es > crmd[25148]: 2008/12/19_18:07:56 info: ccm_event_detail: NEW MEMBERSHIP: > trans=42, nodes=2, new=1, lost=0 n_idx=0, new_idx=2, ol > d_idx=4 > crmd[25148]: 2008/12/19_18:07:56 info: ccm_event_detail: CURRENT: > heraclito.upc.es [nodeid=0, born=38] > crmd[25148]: 2008/12/19_18:07:56 info: ccm_event_detail: CURRENT: > parmenides.upc.es [nodeid=1, born=42] > crmd[25148]: 2008/12/19_18:07:56 info: ccm_event_detail: NEW: > parmenides.upc.es [nodeid=1, born=42] > ................. > > > It seems like one of the nodes is lost and latter recovered, but the > cluster do not recover normality. Resources on recovered node keeps on > started on the > other. The only way I have found to solve the situation is to do manual > action on the cluster, for instance to stop a resource. If I do that, then > cluster seems > to realize the situation, detects that some resources are multi-homed, and > corrects the situation. > > What is happening? Does heavy load affects as the log messages says? But > why cluster doesn't recover normal status? Can you please produce a report with hb_report for the period around the accident and file a bugzilla. Thanks, Dejan > Thanks for your help. > > Frank > > > > > -- > Aquest missatge ha estat analitzat per MailScanner > a la cerca de virus i d'altres continguts perillosos, > i es considera que est? net. > For all your IT requirements visit: http://www.transtec.co.uk > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
