AW: [Linux-HA] Split Brain and not able to repair

Schmidt, Florian Wed, 20 Feb 2008 09:20:16 -0800

Doug, 

well i didnt' copied and pasted that. ;) I read about these options and also 
decided for disconnecting...
But now, I would like to know how to inform somebody about the split-brain and 
thougt, you'd know something..but^^


Well, maybe one of the silent readers knows something about that...

Anyway thanks for your disaster-recovery..it recovered :)

Regards

Florian


Florian,

What I posted as part of my parameters SHOULD not be used under normal
operations.  I think I posted this info in another thread (one that I
started) and a much more experienced person said it was a BAD idea.
The values for those should be left at disconnect because you will
more than likely want to make the final decision as to how to solve
your split brain manually.  Part of my reasoning behind posting those
values was to hopefully involve a more senior person into the
discussion :)

As far as your email question I cannot answer.

Setting those parameters however did help me recover from my split
brain.  As I was in pure test mode I could care less about the data on
my drbd partition.

One thing I still have not discovered and would be great to know is
"When you have a split brain with a DRBD how do you get your system
back to normal?"  A Disaster recovery overview if you will.  I am
still using haresources at this point.

On Feb 20, 2008 11:55 AM, Schmidt, Florian
<[EMAIL PROTECTED]> wrote:
>
>
> -----Ursprüngliche Nachricht-----
> Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Im Auftrag von Doug Lochart
> Gesendet: Dienstag, 19. Februar 2008 18:07
> An: General Linux-HA mailing list
> Betreff: Re: [Linux-HA] Split Brain and not able to repair
>
> I feel your pain.  I suffered through this as well as I am just
> learning.  I was following a few tutorials and followed them closely
> only to end up in SplitBrain (WTF??) so I plan on writing a tutorial
> that covers what all of what is needed to avoid this situation.
> However I am still struggling to determine what all that is.
>
> I was able to recover but I am not 100% sure how I did it.  I do know
> that I looked at my drbd.conf file and found a few interesting
> parameters that seemd to help me.  The first one commented is what all
> of these params were set to.  Once I read the conf file comments I
> chose what I thought was bets for my values.
>
>     #after-sb-0pri disconnect;
>     after-sb-0pri discard-older-primary;
>     after-sb-1pri call-pri-lost-after-sb;
>    after-sb-2pri call-pri-lost-after-sb;
>     rr-conflict disconnect;
>
>
> [Florian]
> Thanks for your reply. I added this options into my drbd.conf now :)
>
> But may I ask, what your script "pri-lost-after-sb" consists of? What do you 
> execute in this script? I would like to send a mail to an administrator, if a 
> split-brain occurs, so that he can solve the situation... :)
>
> Regards
> Florian
>
>
> I copied this conf file to both nodes and then restarted everything.
> My split brain messages went away in syslog but I was still not able
> to see everything so I did the following.
>
> /etc/init.d/heartbeat stop
>
> drdbadm detach all (on both nodes)
> drbdadm state all (on both nodes)
> drdbadm up all (on both nodes)
> drbdadm state all (on both nodes)
>
> At this point mine was Secondary/Secondary so on the node I want as
> primary I did
> drbdadm primary all.
>
> This did it for me.  Then I restarted heartbeat
>
> /etc/init.d/heartbeat start
>
> Now all seems to be working at least the logs look clean and the
> resources are up.
>
> What the tutorials I was following failed to mention was that you need
> a fencing policy.  Basically you need to setup and configure STONITH
> so that one node will be able to KNOW that it has complete control
> over the shared resource (disk).  STONITH will operate with many
> devices (smart ups, ipmi, etc) and will shut the power off to the node
> determined to be causing the problem.
>
> Hopefully others reading this will tell me where I am wrong or flesh this out.
>
> I am really green on this stuff and having a hard time finding good
> docs/guides for newbies that really cover this stuff.
>
> good luck
>
> regards,
>
> Doug
>
>
> On Feb 19, 2008 11:33 AM, Schmidt, Florian
> <[EMAIL PROTECTED]> wrote:
> > Hi readers,
> >
> > i caused a split brain on my testing machine, to see how it would react.
> > I disabled on both machines the eth1-interface, over which the heartbeat
> > happened.
> >
> > So the DRBD still was connected (over the eth0-interface) but, hearbeat
> > was split-brained.
> >
> > After I saw, what I expected (heartbeat failed to mount drbd on the
> > secondary node, because the primary was still alive) I enabled the
> > interfaces again and expected the nodes to recover the situation
> > somehow..but this failed
> >
> > I can restart one or both heartbeat-instances now, but they aren't able
> > to connect to each other :(
> >
> > Following crm_mon -1 on the nodes:
> >
> >
> > First node (nodekrz)
> >
> > ============
> > Last updated: Tue Feb 19 17:31:03 2008
> > Current DC: noderz (91d062c3-ad0a-4c24-b759-acada7f19101)
> > 2 Nodes configured.
> > 2 Resources configured.
> > ============
> >
> > Node: noderz (91d062c3-ad0a-4c24-b759-acada7f19101): online
> > Node: nodekrz (44425bd9-2cba-4d6a-ac62-82a8bb81a23d): OFFLINE
> >
> > Master/Slave Set: drbd_master_slave
> >     drbd_r0:0   (heartbeat::ocf:drbd):  Master noderz
> >     drbd_r0:1   (heartbeat::ocf:drbd):  Stopped
> > Resource Group: Filesystem_and_IP
> >     Filesystem  (heartbeat::ocf:Filesystem):    Started noderz
> >     Cluster_IP  (heartbeat::ocf:IPaddr):        Started noderz
> >
> >
> > Second node: (noderz)
> >
> > ============
> > Last updated: Tue Feb 19 17:30:17 2008
> > Current DC: nodekrz (44425bd9-2cba-4d6a-ac62-82a8bb81a23d)
> > 2 Nodes configured.
> > 2 Resources configured.
> > ============
> >
> > Node: noderz (91d062c3-ad0a-4c24-b759-acada7f19101): OFFLINE
> > Node: nodekrz (44425bd9-2cba-4d6a-ac62-82a8bb81a23d): online
> >
> > Master/Slave Set: drbd_master_slave
> >     drbd_r0:0   (heartbeat::ocf:drbd):  Master nodekrz
> >     drbd_r0:1   (heartbeat::ocf:drbd):  Stopped
> > Resource Group: Filesystem_and_IP
> >     Filesystem  (heartbeat::ocf:Filesystem):    Started nodekrz
> >     Cluster_IP  (heartbeat::ocf:IPaddr):        Started nodekrz
> >
> > They are able to ping each other over the heartbeat-link.
> >
> > Like I said, restarting heartbeat on one or both nodes at the same time
> > doesn't change anything.
> >
> > So what to do to solve this situation?
> >
> > Thanks for replies
> >
> > Florian
> >
> >
> > _______________________________________________
> > Linux-HA mailing list
> > [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
>
>
>
> --
> What profits a man if he gains the whole world yet loses his soul?
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>



-- 
What profits a man if he gains the whole world yet loses his soul?
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

AW: [Linux-HA] Split Brain and not able to repair

Reply via email to