Hi, On Tue, Feb 19, 2008 at 04:19:06PM -0500, Doug Lochart wrote: > On Feb 19, 2008 2:32 PM, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: > > Hi, > > > > On Tue, Feb 19, 2008 at 12:07:27PM -0500, Doug Lochart wrote: > > > I feel your pain. I suffered through this as well as I am just > > > learning. I was following a few tutorials and followed them closely > > > only to end up in SplitBrain (WTF??) so I plan on writing a tutorial > > > that covers what all of what is needed to avoid this situation. > > > > To avoid split brain, which should be avoided at any cost, > > provide multiple heartbeat links. > > > > > However I am still struggling to determine what all that is. > > > > Split brain: http://www.linux-ha.org/SplitBrain > > > > > > > I was able to recover but I am not 100% sure how I did it. I do know > > > that I looked at my drbd.conf file and found a few interesting > > > parameters that seemd to help me. The first one commented is what all > > > of these params were set to. Once I read the conf file comments I > > > chose what I thought was bets for my values. > > > > > > #after-sb-0pri disconnect; > > > after-sb-0pri discard-older-primary; > > > after-sb-1pri call-pri-lost-after-sb; > > > after-sb-2pri call-pri-lost-after-sb; > > > rr-conflict disconnect; > > > > > > I copied this conf file to both nodes and then restarted everything. > > > My split brain messages went away in syslog but I was still not able > > > to see everything so I did the following. > > > > > > /etc/init.d/heartbeat stop > > > > > > drdbadm detach all (on both nodes) > > > drbdadm state all (on both nodes) > > > drdbadm up all (on both nodes) > > > drbdadm state all (on both nodes) > > > > > > At this point mine was Secondary/Secondary so on the node I want as > > > primary I did > > > drbdadm primary all. > > > > > > This did it for me. Then I restarted heartbeat > > > > > > /etc/init.d/heartbeat start > > > > > > Now all seems to be working at least the logs look clean and the > > > resources are up. > > > > > > What the tutorials I was following failed to mention was that you need > > > a fencing policy. > > > > There are quite a few places where fencing is mentioned. > > Not in the tutorials I was following. They were not on the linux-ha > site. I go to so frustrated trying to find something I had to look > elsewhere. > > > > Basically you need to setup and configure STONITH > > > so that one node will be able to KNOW that it has complete control > > > over the shared resource (disk). STONITH will operate with many > > > devices (smart ups, ipmi, etc) and will shut the power off to the node > > > determined to be causing the problem. > > > > > > Hopefully others reading this will tell me where I am wrong or flesh this > > > out. > > > > This is basically true. STONITH is used as a mechanism to fence a > > node in order to make sure that it is down. > > I guess most of my problems stem from me trying to use version1 > (hareseources) and not 2 because I am already confused as it is. Can > you configure STONITH with V1? I do see an example for V2 but not V1.
Try http://linux-ha.org/ha.cf/StonithDirective > > > > I am really green on this stuff and having a hard time finding good > > > docs/guides for newbies that really cover this stuff. > > > > I'm afraid that the documentation is not very well organized. > > Many people tried to improve it (it's a wiki), but so far there > > hasn't been much effect. > > Hopefully I would like to put up a nice newbie oriented > tutorial when I am done. I don't think that the documentation is that bad, just not well organized. The complexity of the matter doesn't help either. Thanks, Dejan > > thanks, > > Doug > > > > Thanks, > > > > Dejan > > > > > > > good luck > > > > > > regards, > > > > > > Doug > > > > > > > > > On Feb 19, 2008 11:33 AM, Schmidt, Florian > > > <[EMAIL PROTECTED]> wrote: > > > > Hi readers, > > > > > > > > i caused a split brain on my testing machine, to see how it would react. > > > > I disabled on both machines the eth1-interface, over which the heartbeat > > > > happened. > > > > > > > > So the DRBD still was connected (over the eth0-interface) but, hearbeat > > > > was split-brained. > > > > > > > > After I saw, what I expected (heartbeat failed to mount drbd on the > > > > secondary node, because the primary was still alive) I enabled the > > > > interfaces again and expected the nodes to recover the situation > > > > somehow..but this failed > > > > > > > > I can restart one or both heartbeat-instances now, but they aren't able > > > > to connect to each other :( > > > > > > > > Following crm_mon -1 on the nodes: > > > > > > > > > > > > First node (nodekrz) > > > > > > > > ============ > > > > Last updated: Tue Feb 19 17:31:03 2008 > > > > Current DC: noderz (91d062c3-ad0a-4c24-b759-acada7f19101) > > > > 2 Nodes configured. > > > > 2 Resources configured. > > > > ============ > > > > > > > > Node: noderz (91d062c3-ad0a-4c24-b759-acada7f19101): online > > > > Node: nodekrz (44425bd9-2cba-4d6a-ac62-82a8bb81a23d): OFFLINE > > > > > > > > Master/Slave Set: drbd_master_slave > > > > drbd_r0:0 (heartbeat::ocf:drbd): Master noderz > > > > drbd_r0:1 (heartbeat::ocf:drbd): Stopped > > > > Resource Group: Filesystem_and_IP > > > > Filesystem (heartbeat::ocf:Filesystem): Started noderz > > > > Cluster_IP (heartbeat::ocf:IPaddr): Started noderz > > > > > > > > > > > > Second node: (noderz) > > > > > > > > ============ > > > > Last updated: Tue Feb 19 17:30:17 2008 > > > > Current DC: nodekrz (44425bd9-2cba-4d6a-ac62-82a8bb81a23d) > > > > 2 Nodes configured. > > > > 2 Resources configured. > > > > ============ > > > > > > > > Node: noderz (91d062c3-ad0a-4c24-b759-acada7f19101): OFFLINE > > > > Node: nodekrz (44425bd9-2cba-4d6a-ac62-82a8bb81a23d): online > > > > > > > > Master/Slave Set: drbd_master_slave > > > > drbd_r0:0 (heartbeat::ocf:drbd): Master nodekrz > > > > drbd_r0:1 (heartbeat::ocf:drbd): Stopped > > > > Resource Group: Filesystem_and_IP > > > > Filesystem (heartbeat::ocf:Filesystem): Started nodekrz > > > > Cluster_IP (heartbeat::ocf:IPaddr): Started nodekrz > > > > > > > > They are able to ping each other over the heartbeat-link. > > > > > > > > Like I said, restarting heartbeat on one or both nodes at the same time > > > > doesn't change anything. > > > > > > > > So what to do to solve this situation? > > > > > > > > Thanks for replies > > > > > > > > Florian > > > > > > > > > > > > _______________________________________________ > > > > Linux-HA mailing list > > > > [email protected] > > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha > > > > See also: http://linux-ha.org/ReportingProblems > > > > > > > > > > > > > > > > -- > > > What profits a man if he gains the whole world yet loses his soul? > > > _______________________________________________ > > > Linux-HA mailing list > > > [email protected] > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha > > > See also: http://linux-ha.org/ReportingProblems > > _______________________________________________ > > Linux-HA mailing list > > [email protected] > > http://lists.linux-ha.org/mailman/listinfo/linux-ha > > See also: http://linux-ha.org/ReportingProblems > > > > > > -- > What profits a man if he gains the whole world yet loses his soul? > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
