Re: [Linux-HA] lost node and split brain cluster

Dejan Muhamedagic Mon, 22 Dec 2008 08:27:57 -0800

Hi,

On Mon, Dec 22, 2008 at 09:00:51AM +0100, Frank wrote:
> Hi,
> we have a two nodes cluster with heartbeat 2.1.4 on a RHEL 5 Linux
> We have OpenVZ on both nodes, a shared GFS disk with VE data, and ManageVE 
> resources, some on node1 on some on node2
> We are experimenting a strange behavior. Sometimes VEs of one node appear 
> started on both nodes. If we monitors with hb_gui, one of the
> nodes is showed stopped although heartbeat is started there. Logs in the 
> node with all resources started looks like this:
>
> ........................
> heartbeat[24950]: 2008/12/19_18:07:54 info: time_longclock: clock_t wrapped 
> around (uptime).


Strange, this should've been fixed a long time ago, definitely
before 2.1.4.

> heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch 
> function for send local status was delayed 42949663
> 06 ms (> 510 ms) before being called (GSource: 0x9744758)
> cib[25144]: 2008/12/19_18:07:54 info: mem_handle_event: Got an event 
> OC_EV_MS_INVALID from ccm
> crmd[25148]: 2008/12/19_18:07:54 notice: crmd_ha_status_callback: Status 
> update: Node parmenides.upc.es now has status [dead]
> heartbeat[24950]: 2008/12/19_18:07:54 info: Gmain_timeout_dispatch: started 
> at 4359079925 should have started at 64112728
> cib[25144]: 2008/12/19_18:07:54 info: mem_handle_event: no mbr_track info
> crmd[25148]: 2008/12/19_18:07:54 info: mem_handle_event: Got an event 
> OC_EV_MS_INVALID from ccm
> heartbeat[24950]: 2008/12/19_18:07:54 WARN: node parmenides.upc.es: is dead
> cib[25144]: 2008/12/19_18:07:54 info: mem_handle_event: Got an event 
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> crmd[25148]: 2008/12/19_18:07:54 info: mem_handle_event: no mbr_track info
> heartbeat[24950]: 2008/12/19_18:07:54 info: Link parmenides.upc.es:eth1 dead.
> cib[25144]: 2008/12/19_18:07:54 info: mem_handle_event: instance=41, nodes=1, 
> new=0, lost=1, n_idx=0, new_idx=1, old_idx=3
> crmd[25148]: 2008/12/19_18:07:54 info: mem_handle_event: Got an event 
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch 
> function for check for signals was delayed 42949665
> 46 ms (> 510 ms) before being called (GSource: 0x9744988)
> cib[25144]: 2008/12/19_18:07:54 info: cib_ccm_msg_callback: LOST: 
> parmenides.upc.es
> crmd[25148]: 2008/12/19_18:07:54 info: mem_handle_event: instance=41, 
> nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=3
> heartbeat[24950]: 2008/12/19_18:07:54 info: Gmain_timeout_dispatch: started 
> at 4359079925 should have started at 64112704
> cib[25144]: 2008/12/19_18:07:54 info: cib_ccm_msg_callback: PEER: 
> heraclito.upc.es
> crmd[25148]: 2008/12/19_18:07:54 info: crmd_ccm_msg_callback: Quorum 
> (re)attained after event=NEW MEMBERSHIP (id=41)
> heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch 
> function for update msgfree count was delayed 42949
> 67216 ms (> 30000 ms) before being called (GSource: 0x9744a08)
> ccm[25142]: 2008/12/19_18:07:54 info: Break tie for 2 nodes cluster
> crmd[25148]: 2008/12/19_18:07:54 info: erase_node_from_join: Removed dead 
> node parmenides.upc.es from join calculations: welcome
> d=0 itegrated=0 finalized=0 confirmed=1
> heartbeat[24950]: 2008/12/19_18:07:54 info: Gmain_timeout_dispatch: started 
> at 4359079925 should have started at 64112637
> crmd[25148]: 2008/12/19_18:07:54 info: ccm_event_detail: NEW MEMBERSHIP: 
> trans=41, nodes=1, new=0, lost=1 n_idx=0, new_idx=1, old_idx=3
> heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch 
> function for client audit was delayed 4294962206 ms
> (> 5000 ms) before being called (GSource: 0x9744908)
> crmd[25148]: 2008/12/19_18:07:54 info: ccm_event_detail:        CURRENT: 
> heraclito.upc.es [nodeid=0, born=41]
> heartbeat[24950]: 2008/12/19_18:07:54 info: Gmain_timeout_dispatch: started 
> at 4359079925 should have started at 64113138
> crmd[25148]: 2008/12/19_18:07:54 info: ccm_event_detail:        LOST:    
> parmenides.upc.es [nodeid=1, born=40]
> heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch 
> function for memory stats was delayed 4244815296 ms
> (> 5000 ms) before being called (GSource: 0x9744888)
> ..................
> ....................
> heartbeat[24950]: 2008/12/19_18:07:55 CRIT: Cluster node parmenides.upc.es 
> returning after partition.
> heartbeat[24950]: 2008/12/19_18:07:55 info: For information on cluster 
> partitions, See URL: http://linux-ha.org/SplitBrain
> heartbeat[24950]: 2008/12/19_18:07:55 WARN: Deadtime value may be too small.
> heartbeat[24950]: 2008/12/19_18:07:55 info: See FAQ for information on tuning 
> deadtime.
> heartbeat[24950]: 2008/12/19_18:07:55 info: URL: 
> http://linux-ha.org/FAQ#heavy_load
> heartbeat[24950]: 2008/12/19_18:07:55 info: Link parmenides.upc.es:eth1 up.
> heartbeat[24950]: 2008/12/19_18:07:55 info: Status update for node 
> parmenides.upc.es: status active
> crmd[25148]: 2008/12/19_18:07:55 notice: crmd_ha_status_callback: Status 
> update: Node parmenides.upc.es now has status [active]
> cib[25144]: 2008/12/19_18:07:56 info: mem_handle_event: Got an event 
> OC_EV_MS_INVALID from ccm
> crmd[25148]: 2008/12/19_18:07:56 info: mem_handle_event: Got an event 
> OC_EV_MS_INVALID from ccm
> cib[25144]: 2008/12/19_18:07:56 info: mem_handle_event: no mbr_track info
> crmd[25148]: 2008/12/19_18:07:56 info: mem_handle_event: no mbr_track info
> cib[25144]: 2008/12/19_18:07:56 info: mem_handle_event: Got an event 
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> crmd[25148]: 2008/12/19_18:07:56 info: mem_handle_event: Got an event 
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> cib[25144]: 2008/12/19_18:07:56 info: mem_handle_event: instance=42, nodes=2, 
> new=1, lost=0, n_idx=0, new_idx=2, old_idx=4
> crmd[25148]: 2008/12/19_18:07:56 info: mem_handle_event: instance=42, 
> nodes=2, new=1, lost=0, n_idx=0, new_idx=2, old_idx=4
> cib[25144]: 2008/12/19_18:07:56 info: cib_ccm_msg_callback: PEER: 
> heraclito.upc.es
> crmd[25148]: 2008/12/19_18:07:56 info: crmd_ccm_msg_callback: Quorum 
> (re)attained after event=NEW MEMBERSHIP (id=42)
> cib[25144]: 2008/12/19_18:07:56 info: cib_ccm_msg_callback: PEER: 
> parmenides.upc.es
> crmd[25148]: 2008/12/19_18:07:56 info: ccm_event_detail: NEW MEMBERSHIP: 
> trans=42, nodes=2, new=1, lost=0 n_idx=0, new_idx=2, ol
> d_idx=4
> crmd[25148]: 2008/12/19_18:07:56 info: ccm_event_detail:        CURRENT: 
> heraclito.upc.es [nodeid=0, born=38]
> crmd[25148]: 2008/12/19_18:07:56 info: ccm_event_detail:        CURRENT: 
> parmenides.upc.es [nodeid=1, born=42]
> crmd[25148]: 2008/12/19_18:07:56 info: ccm_event_detail:        NEW:     
> parmenides.upc.es [nodeid=1, born=42]
> .................
>
>
> It seems like one of the nodes is lost and latter recovered, but the 
> cluster do not recover normality. Resources on recovered node keeps on 
> started on the
> other. The only way I have found to solve the situation is to do manual 
> action on the cluster, for instance to stop a resource. If I do that, then 
> cluster seems
> to realize the situation, detects that some resources are multi-homed, and 
> corrects the situation.
>
> What is happening? Does heavy load affects as the log messages says? But 
> why cluster doesn't recover normal status?

Can you please produce a report with hb_report for the period
around the accident and file a bugzilla.

Thanks,

Dejan

> Thanks for your help.
>
> Frank
>
>
>
>
> -- 
> Aquest missatge ha estat analitzat per MailScanner
> a la cerca de virus i d'altres continguts perillosos,
> i es considera que est? net.
> For all your IT requirements visit: http://www.transtec.co.uk
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] lost node and split brain cluster

Reply via email to