I feel your pain. I suffered through this as well as I am just
learning. I was following a few tutorials and followed them closely
only to end up in SplitBrain (WTF??) so I plan on writing a tutorial
that covers what all of what is needed to avoid this situation.
However I am still struggling to determine what all that is.
I was able to recover but I am not 100% sure how I did it. I do know
that I looked at my drbd.conf file and found a few interesting
parameters that seemd to help me. The first one commented is what all
of these params were set to. Once I read the conf file comments I
chose what I thought was bets for my values.
#after-sb-0pri disconnect;
after-sb-0pri discard-older-primary;
after-sb-1pri call-pri-lost-after-sb;
after-sb-2pri call-pri-lost-after-sb;
rr-conflict disconnect;
I copied this conf file to both nodes and then restarted everything.
My split brain messages went away in syslog but I was still not able
to see everything so I did the following.
/etc/init.d/heartbeat stop
drdbadm detach all (on both nodes)
drbdadm state all (on both nodes)
drdbadm up all (on both nodes)
drbdadm state all (on both nodes)
At this point mine was Secondary/Secondary so on the node I want as
primary I did
drbdadm primary all.
This did it for me. Then I restarted heartbeat
/etc/init.d/heartbeat start
Now all seems to be working at least the logs look clean and the
resources are up.
What the tutorials I was following failed to mention was that you need
a fencing policy. Basically you need to setup and configure STONITH
so that one node will be able to KNOW that it has complete control
over the shared resource (disk). STONITH will operate with many
devices (smart ups, ipmi, etc) and will shut the power off to the node
determined to be causing the problem.
Hopefully others reading this will tell me where I am wrong or flesh this out.
I am really green on this stuff and having a hard time finding good
docs/guides for newbies that really cover this stuff.
good luck
regards,
Doug
On Feb 19, 2008 11:33 AM, Schmidt, Florian
<[EMAIL PROTECTED]> wrote:
> Hi readers,
>
> i caused a split brain on my testing machine, to see how it would react.
> I disabled on both machines the eth1-interface, over which the heartbeat
> happened.
>
> So the DRBD still was connected (over the eth0-interface) but, hearbeat
> was split-brained.
>
> After I saw, what I expected (heartbeat failed to mount drbd on the
> secondary node, because the primary was still alive) I enabled the
> interfaces again and expected the nodes to recover the situation
> somehow..but this failed
>
> I can restart one or both heartbeat-instances now, but they aren't able
> to connect to each other :(
>
> Following crm_mon -1 on the nodes:
>
>
> First node (nodekrz)
>
> ============
> Last updated: Tue Feb 19 17:31:03 2008
> Current DC: noderz (91d062c3-ad0a-4c24-b759-acada7f19101)
> 2 Nodes configured.
> 2 Resources configured.
> ============
>
> Node: noderz (91d062c3-ad0a-4c24-b759-acada7f19101): online
> Node: nodekrz (44425bd9-2cba-4d6a-ac62-82a8bb81a23d): OFFLINE
>
> Master/Slave Set: drbd_master_slave
> drbd_r0:0 (heartbeat::ocf:drbd): Master noderz
> drbd_r0:1 (heartbeat::ocf:drbd): Stopped
> Resource Group: Filesystem_and_IP
> Filesystem (heartbeat::ocf:Filesystem): Started noderz
> Cluster_IP (heartbeat::ocf:IPaddr): Started noderz
>
>
> Second node: (noderz)
>
> ============
> Last updated: Tue Feb 19 17:30:17 2008
> Current DC: nodekrz (44425bd9-2cba-4d6a-ac62-82a8bb81a23d)
> 2 Nodes configured.
> 2 Resources configured.
> ============
>
> Node: noderz (91d062c3-ad0a-4c24-b759-acada7f19101): OFFLINE
> Node: nodekrz (44425bd9-2cba-4d6a-ac62-82a8bb81a23d): online
>
> Master/Slave Set: drbd_master_slave
> drbd_r0:0 (heartbeat::ocf:drbd): Master nodekrz
> drbd_r0:1 (heartbeat::ocf:drbd): Stopped
> Resource Group: Filesystem_and_IP
> Filesystem (heartbeat::ocf:Filesystem): Started nodekrz
> Cluster_IP (heartbeat::ocf:IPaddr): Started nodekrz
>
> They are able to ping each other over the heartbeat-link.
>
> Like I said, restarting heartbeat on one or both nodes at the same time
> doesn't change anything.
>
> So what to do to solve this situation?
>
> Thanks for replies
>
> Florian
>
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
--
What profits a man if he gains the whole world yet loses his soul?
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems