Re: [Linux-HA] Split Brain and not able to repair

Doug Lochart Wed, 20 Feb 2008 08:03:46 -0800

On Feb 20, 2008 8:59 AM, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:
> Hi,
>
>
> On Tue, Feb 19, 2008 at 04:19:06PM -0500, Doug Lochart wrote:
> > On Feb 19, 2008 2:32 PM, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:
> > > Hi,
> > >
> > > On Tue, Feb 19, 2008 at 12:07:27PM -0500, Doug Lochart wrote:
> > > > I feel your pain.  I suffered through this as well as I am just
> > > > learning.  I was following a few tutorials and followed them closely
> > > > only to end up in SplitBrain (WTF??) so I plan on writing a tutorial
> > > > that covers what all of what is needed to avoid this situation.
> > >
> > > To avoid split brain, which should be avoided at any cost,
> > > provide multiple heartbeat links.
> > >
> > > > However I am still struggling to determine what all that is.
> > >
> > > Split brain: http://www.linux-ha.org/SplitBrain
> > >
> > >
> > > > I was able to recover but I am not 100% sure how I did it.  I do know
> > > > that I looked at my drbd.conf file and found a few interesting
> > > > parameters that seemd to help me.  The first one commented is what all
> > > > of these params were set to.  Once I read the conf file comments I
> > > > chose what I thought was bets for my values.
> > > >
> > > >     #after-sb-0pri disconnect;
> > > >     after-sb-0pri discard-older-primary;
> > > >     after-sb-1pri call-pri-lost-after-sb;
> > > >    after-sb-2pri call-pri-lost-after-sb;
> > > >     rr-conflict disconnect;
> > > >
> > > > I copied this conf file to both nodes and then restarted everything.
> > > > My split brain messages went away in syslog but I was still not able
> > > > to see everything so I did the following.
> > > >
> > > > /etc/init.d/heartbeat stop
> > > >
> > > > drdbadm detach all (on both nodes)
> > > > drbdadm state all (on both nodes)
> > > > drdbadm up all (on both nodes)
> > > > drbdadm state all (on both nodes)
> > > >
> > > > At this point mine was Secondary/Secondary so on the node I want as
> > > > primary I did
> > > > drbdadm primary all.
> > > >
> > > > This did it for me.  Then I restarted heartbeat
> > > >
> > > > /etc/init.d/heartbeat start
> > > >
> > > > Now all seems to be working at least the logs look clean and the
> > > > resources are up.
> > > >
> > > > What the tutorials I was following failed to mention was that you need
> > > > a fencing policy.
> > >
> > > There are quite a few places where fencing is mentioned.
> >
> > Not in the tutorials I was following.  They were not on the linux-ha
> > site.  I go to so frustrated trying to find something I had to look
> > elsewhere.
> >
> > > > Basically you need to setup and configure STONITH
> > > > so that one node will be able to KNOW that it has complete control
> > > > over the shared resource (disk).  STONITH will operate with many
> > > > devices (smart ups, ipmi, etc) and will shut the power off to the node
> > > > determined to be causing the problem.
> > > >
> > > > Hopefully others reading this will tell me where I am wrong or flesh 
> > > > this out.
> > >
> > > This is basically true. STONITH is used as a mechanism to fence a
> > > node in order to make sure that it is down.
> >
> > I guess most of my problems stem from me trying to use version1
> > (hareseources) and not 2 because I am already confused as it is.  Can
> > you configure STONITH with V1?  I do see an example for V2 but not V1.
>
> Try http://linux-ha.org/ha.cf/StonithDirective
>
> >
> > > > I am really green on this stuff and having a hard time finding good
> > > > docs/guides for newbies that really cover this stuff.
> > >
> > > I'm afraid that the documentation is not very well organized.
> > > Many people tried to improve it (it's a wiki), but so far there
> > > hasn't been much effect.
> >
> > Hopefully I would like to put up a nice newbie oriented
> > tutorial when I am done.
>
> I don't think that the documentation is that bad, just not well
> organized. The complexity of the matter doesn't help either.
>
> Thanks,
>
> Dejan


I will give you that.  The more I dig the more I find.  There are SO
many links on the HA site and there is no clear road map.  If you just
pick a page and exhaust all the links you will find lots of tidbits
that you did not expect.  Last night I watched Alan's 90 minute video
he filmed at the Linux conf in Austrailia in 2007.  That was VERY
helpful so any one out there in the newbie boat I highly recommend
taking the time to watch it.  In the 90 minute video I also learned
that what is in the 'pressroom' on the website are not just press
releases but docs, tutorials, and info about HA.  I would not have
thought to look there until I saw the video.

The more you learn the easier it is to find the information as you
know more of what you are looking for.  However for a newbie/novice it
is VERY unorganized.

regards,

Doug

>
> >
> > thanks,
> >
> > Doug
> >
> >
> > > Thanks,
> > >
> > > Dejan
> > >
> > >
> > > > good luck
> > > >
> > > > regards,
> > > >
> > > > Doug
> > > >
> > > >
> > > > On Feb 19, 2008 11:33 AM, Schmidt, Florian
> > > > <[EMAIL PROTECTED]> wrote:
> > > > > Hi readers,
> > > > >
> > > > > i caused a split brain on my testing machine, to see how it would 
> > > > > react.
> > > > > I disabled on both machines the eth1-interface, over which the 
> > > > > heartbeat
> > > > > happened.
> > > > >
> > > > > So the DRBD still was connected (over the eth0-interface) but, 
> > > > > hearbeat
> > > > > was split-brained.
> > > > >
> > > > > After I saw, what I expected (heartbeat failed to mount drbd on the
> > > > > secondary node, because the primary was still alive) I enabled the
> > > > > interfaces again and expected the nodes to recover the situation
> > > > > somehow..but this failed
> > > > >
> > > > > I can restart one or both heartbeat-instances now, but they aren't 
> > > > > able
> > > > > to connect to each other :(
> > > > >
> > > > > Following crm_mon -1 on the nodes:
> > > > >
> > > > >
> > > > > First node (nodekrz)
> > > > >
> > > > > ============
> > > > > Last updated: Tue Feb 19 17:31:03 2008
> > > > > Current DC: noderz (91d062c3-ad0a-4c24-b759-acada7f19101)
> > > > > 2 Nodes configured.
> > > > > 2 Resources configured.
> > > > > ============
> > > > >
> > > > > Node: noderz (91d062c3-ad0a-4c24-b759-acada7f19101): online
> > > > > Node: nodekrz (44425bd9-2cba-4d6a-ac62-82a8bb81a23d): OFFLINE
> > > > >
> > > > > Master/Slave Set: drbd_master_slave
> > > > >     drbd_r0:0   (heartbeat::ocf:drbd):  Master noderz
> > > > >     drbd_r0:1   (heartbeat::ocf:drbd):  Stopped
> > > > > Resource Group: Filesystem_and_IP
> > > > >     Filesystem  (heartbeat::ocf:Filesystem):    Started noderz
> > > > >     Cluster_IP  (heartbeat::ocf:IPaddr):        Started noderz
> > > > >
> > > > >
> > > > > Second node: (noderz)
> > > > >
> > > > > ============
> > > > > Last updated: Tue Feb 19 17:30:17 2008
> > > > > Current DC: nodekrz (44425bd9-2cba-4d6a-ac62-82a8bb81a23d)
> > > > > 2 Nodes configured.
> > > > > 2 Resources configured.
> > > > > ============
> > > > >
> > > > > Node: noderz (91d062c3-ad0a-4c24-b759-acada7f19101): OFFLINE
> > > > > Node: nodekrz (44425bd9-2cba-4d6a-ac62-82a8bb81a23d): online
> > > > >
> > > > > Master/Slave Set: drbd_master_slave
> > > > >     drbd_r0:0   (heartbeat::ocf:drbd):  Master nodekrz
> > > > >     drbd_r0:1   (heartbeat::ocf:drbd):  Stopped
> > > > > Resource Group: Filesystem_and_IP
> > > > >     Filesystem  (heartbeat::ocf:Filesystem):    Started nodekrz
> > > > >     Cluster_IP  (heartbeat::ocf:IPaddr):        Started nodekrz
> > > > >
> > > > > They are able to ping each other over the heartbeat-link.
> > > > >
> > > > > Like I said, restarting heartbeat on one or both nodes at the same 
> > > > > time
> > > > > doesn't change anything.
> > > > >
> > > > > So what to do to solve this situation?
> > > > >
> > > > > Thanks for replies
> > > > >
> > > > > Florian
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > Linux-HA mailing list
> > > > > [email protected]
> > > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > > > See also: http://linux-ha.org/ReportingProblems
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > What profits a man if he gains the whole world yet loses his soul?
> > > > _______________________________________________
> > > > Linux-HA mailing list
> > > > [email protected]
> > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > > See also: http://linux-ha.org/ReportingProblems
> > > _______________________________________________
> > > Linux-HA mailing list
> > > [email protected]
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > See also: http://linux-ha.org/ReportingProblems
> > >
> >
> >
> >
> > --
> > What profits a man if he gains the whole world yet loses his soul?
> > _______________________________________________
> > Linux-HA mailing list
> > [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>



-- 
What profits a man if he gains the whole world yet loses his soul?
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Split Brain and not able to repair

Reply via email to