Hi, On Wed, Sep 26, 2007 at 10:51:19AM +0200, Raoul Bhatia [IPAX] wrote: > hello, > > ill try to keep things short so please do not consider it rude: > > 2 (debian 4.0) nodes: eth0 = external; eth1 = hb channel > > the cluster has been in the state: > > >Current DC: webcluster02 (917954cd-0285-4fcd-9cd2-671736c4de66) > >2 Nodes configured. > > ... > >Node: webcluster01 (49e81295-8e2f-4aeb-98f3-a14de6f62298): online > >Node: webcluster02 (917954cd-0285-4fcd-9cd2-671736c4de66): online > > on webcluster01 i issued /etc/init.d/networking restart causing: > >Sep 26 10:33:59 webcluster01 kernel: [78706.462355] tg3: eth1: Link is > >down. > >Sep 26 10:34:02 webcluster01 kernel: [78708.758368] tg3: eth1: Link is up > >at 1000 Mbps, full duplex. > >Sep 26 10:34:02 webcluster01 kernel: [78708.764028] tg3: eth1: Flow > >control is on for TX and on for RX. > >Sep 26 10:34:29 webcluster01 heartbeat: [31919]: WARN: node webcluster02: > >is dead > >Sep 26 10:34:29 webcluster01 heartbeat: [31919]: info: Link > >webcluster02:eth1 dead. > >Sep 26 10:34:29 webcluster01 crmd: [31937]: notice: > >crmd_ha_status_callback: Status update: Node webcluster02 now has status > >[dead] > >Sep 26 10:34:29 webcluster01 ccm: [31932]: info: Break tie for 2 nodes > >cluster > > now, crm_mon is in a split-brain situation: > > webcluster01: > >Current DC: webcluster01 (49e81295-8e2f-4aeb-98f3-a14de6f62298) > >2 Nodes configured. > >... > >Node: webcluster01 (49e81295-8e2f-4aeb-98f3-a14de6f62298): online > >Node: webcluster02 (917954cd-0285-4fcd-9cd2-671736c4de66): OFFLINE > ^^^^^^^ > > webcluster02: > > Current DC: webcluster02 (917954cd-0285-4fcd-9cd2-671736c4de66) > > 2 Nodes configured. > > ... > > Node: webcluster01 (49e81295-8e2f-4aeb-98f3-a14de6f62298): online > > Node: webcluster02 (917954cd-0285-4fcd-9cd2-671736c4de66): online > ^^^^^^ > > Q: how do i resolve this issue without restarting heartbeat?
There is probably no other reliable way. Actually, most of the time Heartbeat recovers from split brain and it should, but there seems to be a bug/deficiency in algorithm which sometimes leaves the cluster in a state similar to the one you encountered. See http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1546 for more details. > shouldn't > there be a check to avoid this kind of split-brain situation? What do you mean by "this kind"? BTW, split brain conditions should be avoided at any cost. That doesn't mean however that we shouldn't do our best to recover from them. > do you need any further information? You could attach the logs to the bugzilla file. Thanks. Dejan > cheers, > raoul bhatia > > ps: i do not use stonith yet as i do not want stonith to interfere with > configuration errors :) > -- > ____________________________________________________________________ > DI (FH) Raoul Bhatia M.Sc. email. [EMAIL PROTECTED] > Technischer Leiter > > IPAX - Aloy Bhatia Hava OEG web. http://www.ipax.at > Barawitzkagasse 10/2/2/11 email. [EMAIL PROTECTED] > 1190 Wien tel. +43 1 3670030 > FN 277995t HG Wien fax. +43 1 3670030 15 > ____________________________________________________________________ > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
