Re: [Linux-HA] Split Brain and not able to repair

Dejan Muhamedagic Tue, 19 Feb 2008 11:32:51 -0800

Hi,

On Tue, Feb 19, 2008 at 12:07:27PM -0500, Doug Lochart wrote:
> I feel your pain.  I suffered through this as well as I am just
> learning.  I was following a few tutorials and followed them closely
> only to end up in SplitBrain (WTF??) so I plan on writing a tutorial
> that covers what all of what is needed to avoid this situation.


To avoid split brain, which should be avoided at any cost,
provide multiple heartbeat links.

> However I am still struggling to determine what all that is.

Split brain: http://www.linux-ha.org/SplitBrain

> I was able to recover but I am not 100% sure how I did it.  I do know
> that I looked at my drbd.conf file and found a few interesting
> parameters that seemd to help me.  The first one commented is what all
> of these params were set to.  Once I read the conf file comments I
> chose what I thought was bets for my values.
> 
>     #after-sb-0pri disconnect;
>     after-sb-0pri discard-older-primary;
>     after-sb-1pri call-pri-lost-after-sb;
>    after-sb-2pri call-pri-lost-after-sb;
>     rr-conflict disconnect;
> 
> I copied this conf file to both nodes and then restarted everything.
> My split brain messages went away in syslog but I was still not able
> to see everything so I did the following.
> 
> /etc/init.d/heartbeat stop
> 
> drdbadm detach all (on both nodes)
> drbdadm state all (on both nodes)
> drdbadm up all (on both nodes)
> drbdadm state all (on both nodes)
> 
> At this point mine was Secondary/Secondary so on the node I want as
> primary I did
> drbdadm primary all.
> 
> This did it for me.  Then I restarted heartbeat
> 
> /etc/init.d/heartbeat start
> 
> Now all seems to be working at least the logs look clean and the
> resources are up.
> 
> What the tutorials I was following failed to mention was that you need
> a fencing policy.

There are quite a few places where fencing is mentioned.

> Basically you need to setup and configure STONITH
> so that one node will be able to KNOW that it has complete control
> over the shared resource (disk).  STONITH will operate with many
> devices (smart ups, ipmi, etc) and will shut the power off to the node
> determined to be causing the problem.
> 
> Hopefully others reading this will tell me where I am wrong or flesh this out.

This is basically true. STONITH is used as a mechanism to fence a
node in order to make sure that it is down.

> I am really green on this stuff and having a hard time finding good
> docs/guides for newbies that really cover this stuff.

I'm afraid that the documentation is not very well organized.
Many people tried to improve it (it's a wiki), but so far there
hasn't been much effect.

Thanks,

Dejan

> good luck
> 
> regards,
> 
> Doug
> 
> 
> On Feb 19, 2008 11:33 AM, Schmidt, Florian
> <[EMAIL PROTECTED]> wrote:
> > Hi readers,
> >
> > i caused a split brain on my testing machine, to see how it would react.
> > I disabled on both machines the eth1-interface, over which the heartbeat
> > happened.
> >
> > So the DRBD still was connected (over the eth0-interface) but, hearbeat
> > was split-brained.
> >
> > After I saw, what I expected (heartbeat failed to mount drbd on the
> > secondary node, because the primary was still alive) I enabled the
> > interfaces again and expected the nodes to recover the situation
> > somehow..but this failed
> >
> > I can restart one or both heartbeat-instances now, but they aren't able
> > to connect to each other :(
> >
> > Following crm_mon -1 on the nodes:
> >
> >
> > First node (nodekrz)
> >
> > ============
> > Last updated: Tue Feb 19 17:31:03 2008
> > Current DC: noderz (91d062c3-ad0a-4c24-b759-acada7f19101)
> > 2 Nodes configured.
> > 2 Resources configured.
> > ============
> >
> > Node: noderz (91d062c3-ad0a-4c24-b759-acada7f19101): online
> > Node: nodekrz (44425bd9-2cba-4d6a-ac62-82a8bb81a23d): OFFLINE
> >
> > Master/Slave Set: drbd_master_slave
> >     drbd_r0:0   (heartbeat::ocf:drbd):  Master noderz
> >     drbd_r0:1   (heartbeat::ocf:drbd):  Stopped
> > Resource Group: Filesystem_and_IP
> >     Filesystem  (heartbeat::ocf:Filesystem):    Started noderz
> >     Cluster_IP  (heartbeat::ocf:IPaddr):        Started noderz
> >
> >
> > Second node: (noderz)
> >
> > ============
> > Last updated: Tue Feb 19 17:30:17 2008
> > Current DC: nodekrz (44425bd9-2cba-4d6a-ac62-82a8bb81a23d)
> > 2 Nodes configured.
> > 2 Resources configured.
> > ============
> >
> > Node: noderz (91d062c3-ad0a-4c24-b759-acada7f19101): OFFLINE
> > Node: nodekrz (44425bd9-2cba-4d6a-ac62-82a8bb81a23d): online
> >
> > Master/Slave Set: drbd_master_slave
> >     drbd_r0:0   (heartbeat::ocf:drbd):  Master nodekrz
> >     drbd_r0:1   (heartbeat::ocf:drbd):  Stopped
> > Resource Group: Filesystem_and_IP
> >     Filesystem  (heartbeat::ocf:Filesystem):    Started nodekrz
> >     Cluster_IP  (heartbeat::ocf:IPaddr):        Started nodekrz
> >
> > They are able to ping each other over the heartbeat-link.
> >
> > Like I said, restarting heartbeat on one or both nodes at the same time
> > doesn't change anything.
> >
> > So what to do to solve this situation?
> >
> > Thanks for replies
> >
> > Florian
> >
> >
> > _______________________________________________
> > Linux-HA mailing list
> > [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
> 
> 
> 
> -- 
> What profits a man if he gains the whole world yet loses his soul?
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Split Brain and not able to repair

Reply via email to