Hi,

On Mon, Feb 25, 2008 at 04:36:51PM -0500, Doug Lochart wrote:
> On Mon, Feb 25, 2008 at 4:26 PM, Johan Hoeke <[EMAIL PROTECTED]> wrote:
> > Andreas Mock wrote:
> >  >> -----Urspr???ngliche Nachricht-----
> >  >> Von: [EMAIL PROTECTED]
> >  >> [mailto:[EMAIL PROTECTED] Im Auftrag von
> >  >> Johan Hoeke
> >  >> Gesendet: Montag, 25. Februar 2008 16:59
> >  >> An: General Linux-HA mailing list
> >  >> Betreff: Re: [Linux-HA] Split brain after node reboot...argggh
> >  >>
> >  >> Maybe totally unrelated but just in case,
> >  >>
> >  >> I had a split brain situation on a 2 node cluster a little
> >  >> while ago. I
> >  >> posted the hbreport here, and Dejan concluded that sounded an
> >  >> awful lot
> >  >> like:
> >  >>
> >  >> http://developerbugs.linux-foundation.org/show_bug.cgi?id=1768
> >  >>
> >  >> and asked me to reopen that case and uploading the hb_report.
> >  >> I'm in the
> >  >> process of doing so now.
> >  >>
> >  >> See
> >  >> http://www.mail-archive.com/[email protected]/msg06792.html
> >  >>
> >  >> I'm changing my nodes to not start heartbeat automatically
> >  >> after reboot.
> >  >
> >  >
> >  > Hi Johan,
> >  >
> >  > thank you for your reply. I checked the entries ans saw that
> >  > I also got the "WARN: node A down". I really don't know why
> >  > the upcoming node is not able to determine the status of
> >  > the other node correctly. But probably this is the problem.
> >  >
> >  > Stonith-ing without waiting on the result seems really strange.
> >  > A pitty for us that Andrew is on vacation.
> >  >
> >  > Best regards
> >  > Andreas Mock
> >
> >  Hi Andreas,
> >
> >  Please take everything I say with a grain of salt because I'm in no way
> >  an heartbeat expert!
> >
> >  At the risk of further exposing my ignorance (stole that line from
> >  someone on this list) I'll comment on your hb_report:
> >
> >  I noticed from your report that you have quorum enabled for your two
> >  node cluster. I recall reading that is best to not use quorum on a two
> >  node cluster. Sorry, can't find the link just now. I saw a reference to
> >  a twonode quorum module in your config file, so you might have that
> >  covered.
> >
> >  I had an issue with a bad iptables causing my nodes not to see the
> >  other's heartbeat. Iptables became active on eth1 by mistake, where we
> >  have our crossover heartbeat cable. Maybe a similar network problem has
> >  come up on your site.
> >
> >  And your logging reads at times as if the heartbeat nodes are defined as
> >  pingd nodes as well:
> >
> >  Feb 25 15:32:13 dis01 pingd: [10135]: notice: pingd_lstatus_callback:
> >  Status update: Ping node dis02 now has status [dead]
> >
> >  I'm sure I read somewhere that you're not supposed to use your cluster
> >  nodes as ping nodes. I'm using the gateway as a pingd target. Maybe the
> >  62.146.40.161 address from your ha.cf is a gateway or something similar,
> >  i can't tell.
> >
> >  One last thing, and i'm repeating myself, I'm setting heartbeat so it
> >  won't start on reboot. This because I pulled the heartbeat cable as a
> >  test the other day, and the nodes took turns shooting each other after
> >  rebooting. Dejan called it a shooting match. Classic mistake i guess :)
> 
> I am curious on this.  Why would it be a classic mistake?  It is only
> a mistake if there was some FAQ or guideline that everyone knew about
> that said not to do it.  I have asked the same question (about having
> heartbeat start on boot) and the answers I received said that it is OK
> to do it.

Not easy to decide either way. Yes, normally it should be OK to
start heartbeat automatically, given that there's a resilient
network behind. Because in that case, should a split brain occur,
it should also be short lived and, while the node reboots, the
communication established again.

It is still definitely the safest way to shutdown the node and
even keep it down until an administrator establishes where is/was
the problem. Note that this is something which may be happening
extremely seldom. The best managed clusters are those that never
fail over.

> So now I am confused.  I have not tried yanking on the heartbeat cable
> because I have heartbeat set up to go out both interfaces.  If I did
> make it only one interface and yank the cable I could not use STONITH
> because at this point I can only use IPMI which requires eth0 to be
> available as it is ip addressable.

In this situation, if you have a split brain caused by the
network, then you'd have zero availability: both nodes would try
to shoot the other, but none would be able to. They'd wait
indefinitely until the network is restored. Now, arguably, the
vast majority of clusters keep networked services, so there would
be an outage anyway.

> Whe is it a good AND not so good situation to start heartbeat at boot?

Hope that this makes it a bit more clear.

Thanks,

Dejan

> 
> regards,
> 
> Doug
> 
> 
> 
> >  regards,
> >
> >  Johan
> >
> >
> >
> >
> > _______________________________________________
> >  Linux-HA mailing list
> >  [email protected]
> >  http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >  See also: http://linux-ha.org/ReportingProblems
> >
> 
> 
> 
> -- 
> What profits a man if he gains the whole world yet loses his soul?

> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to