Re: [Linux-HA] Split brain after node reboot...argggh

Dejan Muhamedagic Mon, 25 Feb 2008 14:23:53 -0800

Hi,

On Mon, Feb 25, 2008 at 10:26:25PM +0100, Johan Hoeke wrote:
> Andreas Mock wrote:
> >> -----Urspr???ngliche Nachricht-----
> >> Von: [EMAIL PROTECTED] 
> >> [mailto:[EMAIL PROTECTED] Im Auftrag von 
> >> Johan Hoeke
> >> Gesendet: Montag, 25. Februar 2008 16:59
> >> An: General Linux-HA mailing list
> >> Betreff: Re: [Linux-HA] Split brain after node reboot...argggh
> >>
> >> Maybe totally unrelated but just in case,
> >>
> >> I had a split brain situation on a 2 node cluster a little 
> >> while ago. I
> >> posted the hbreport here, and Dejan concluded that sounded an 
> >> awful lot
> >> like:
> >>
> >> http://developerbugs.linux-foundation.org/show_bug.cgi?id=1768
> >>
> >> and asked me to reopen that case and uploading the hb_report. 
> >> I'm in the
> >> process of doing so now.
> >>
> >> See 
> >> http://www.mail-archive.com/[email protected]/msg06792.html
> >>
> >> I'm changing my nodes to not start heartbeat automatically 
> >> after reboot.
> > 
> > 
> > Hi Johan,
> > 
> > thank you for your reply. I checked the entries ans saw that
> > I also got the "WARN: node A down". I really don't know why
> > the upcoming node is not able to determine the status of
> > the other node correctly. But probably this is the problem.
> > 
> > Stonith-ing without waiting on the result seems really strange.
> > A pitty for us that Andrew is on vacation.
> > 
> > Best regards
> > Andreas Mock
> 
> Hi Andreas,
> 
> Please take everything I say with a grain of salt because I'm in no way
> an heartbeat expert!
> 
> At the risk of further exposing my ignorance (stole that line from
> someone on this list) I'll comment on your hb_report:
> 
> I noticed from your report that you have quorum enabled for your two
> node cluster. I recall reading that is best to not use quorum on a two
> node cluster. Sorry, can't find the link just now. I saw a reference to
> a twonode quorum module in your config file, so you might have that
> covered.


Nodes have always quorum in a two node cluster. That's what that
twonode module does. Otherwise, every node failure would be
considered a tie.

> I had an issue with a bad iptables causing my nodes not to see the
> other's heartbeat. Iptables became active on eth1 by mistake, where we

Can't remember, but here it looks like you have only one link for
heartbeat? You really should have more than one. Install another
network card, or use serial, or use the public interface.

> have our crossover heartbeat cable. Maybe a similar network problem has
> come up on your site.

> And your logging reads at times as if the heartbeat nodes are defined as
> pingd nodes as well:
> 
> Feb 25 15:32:13 dis01 pingd: [10135]: notice: pingd_lstatus_callback:
> Status update: Ping node dis02 now has status [dead]

That's odd. dis02 is not configured as a ping node.

> I'm sure I read somewhere that you're not supposed to use your cluster
> nodes as ping nodes.

Right.

> I'm using the gateway as a pingd target. Maybe the
> 62.146.40.161 address from your ha.cf is a gateway or something similar,
> i can't tell.
> 
> One last thing, and i'm repeating myself, I'm setting heartbeat so it
> won't start on reboot. This because I pulled the heartbeat cable as a
> test the other day, and the nodes took turns shooting each other after
> rebooting. Dejan called it a shooting match. Classic mistake i guess :)

It actually doesn't seem to me very obvious, so I wouldn't call
it a classic mistake. And, apart from network failures which
really are bad, you'd anyway have a very different game: the
failing host would come up, start heartbeat, and rejoin the
cluster.

Thanks,

Dejan

> 
> regards,
> 
> Johan
> 
> 
> 



> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Split brain after node reboot...argggh

Reply via email to