[Linux-HA] Split-Brain occurrence, what do I do now?

Doug Lochart Thu, 14 Feb 2008 19:08:16 -0800

I Was following a few tutorials and had everything ready to test.  The
tutorial said to yank the power cord but my sysadmin did not want me
to do that so I simply pulled out the ethernet cables (one was to a
switched network the other was a crossover between the two nodes for
heartbeats).  Server 2 did not take over ( I had an init script error
that I forgot to fix).  I plugged the cables back in and restarted
both machines.  A cat of /proc/drbd displayed this:


GIT-hash: bd3e2c922f95c4fa0dca57a4f8c24bf8b249cc02 build by
[EMAIL PROTECTED], 2008-02-01 07:33:35
 0: cs:StandAlone st:Primary/Unknown ds:UpToDate/DUnknown   r---
    ns:0 nr:0 dw:4 dr:249 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
        resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
        act_log: used:0/257 hits:1 misses:0 starving:0 dirty:0 changed:0

I looked in dmesg and saw this:

-----------------------------------------------------------------------------------------------------
drbd: initialised. Version: 8.0.8 (api:86/proto:86)
drbd: GIT-hash: bd3e2c922f95c4fa0dca57a4f8c24bf8b249cc02 build by
[EMAIL PROTECTED], 2008-02-01 07:33:35
drbd: registered as block device major 147
drbd: minor_table @ 0xffff810073c19680
drbd0: disk( Diskless -> Attaching )
drbd0: Found 6 transactions (276 active extents) in activity log.
drbd0: max_segment_size ( = BIO size ) = 32768
drbd0: drbd_bm_resize called with capacity == 1953042632
drbd0: resync bitmap: bits=244130329 words=3814537
drbd0: size = 931 GB (976521316 KB)
drbd0: reading of bitmap took 580 jiffies
drbd0: recounting of set bits took additional 28 jiffies
drbd0: 4 KB (1 bits) marked out-of-sync by on disk bit-map.
drbd0: disk( Attaching -> UpToDate )
drbd0: Writing meta data super block now.
drbd0: conn( StandAlone -> Unconnected )
drbd0: receiver (re)started
drbd0: conn( Unconnected -> WFConnection )
drbd0: Handshake successful: DRBD Network Protocol version 86
drbd0: conn( WFConnection -> WFReportParams )
drbd0: Split-Brain detected, dropping connection!
drbd0: self 6A01F5A6BB510A26:19D4D8C195E805A7:483E5FB7A5527AAD:0000000000000004
drbd0: peer 9C679A93F13525CE:19D4D8C195E805A6:483E5FB7A5527AAC:0000000000000004
drbd0: conn( WFReportParams -> Disconnecting )
drbd0: helper command: /sbin/drbdadm split-brain
drbd0: error receiving ReportState, l: 4!
drbd0: asender terminated
drbd0: tl_clear()
drbd0: Connection closed
drbd0: conn( Disconnecting -> StandAlone )
drbd0: receiver terminated
drbd0: role( Secondary -> Primary )
drbd0: Writing meta data super block now.
kjournald starting.  Commit interval 5 seconds
EXT3 FS on drbd0, internal journal
EXT3-fs: mounted filesystem with ordered data mode.

-------------------------------------------------------------------------

So I searched in the mailing list for split-brain and many posts I
find say that doing what I did (yanking both cables) will cause a
split-brain WTF ??
I am using drbd 8.0.8, heartbeat 2.1.3_3 version 1 haresources style.

I am really confused.  I am following a tutorial and I go right into a
split brain.  I can't see how it would have been any different if I
yanked the power cord versus yanking the cables.  I thought this is
what heartbeat was supposed to handle?

How do I recover?  No data was lost as I was just doing an initial test.
What is it I need to do to prevent a split-brain from happening again?

Is there a good place to go to read about avoiding this situation?
Much of the info I have found jumps right in as if you are already a
master of this stuff.
In case you need them my configs follow:

regards,

Douglas Lochart

haresources:
capestor1  IPaddr::10.3.120.140/24/eth0 drbddisk::r0
Filesystem::/dev/drbd0::/capestor::ext3 capestor-server

drbd.conf
-------------
global {
    usage-count yes;
}

common {
  syncer { rate 10M; }
}

resource r0 {
  protocol C;

  handlers {
    pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
    pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
    local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
    # these were commented out in the examples
    # outdate-peer "/usr/lib/drbd/outdate-peer.sh on amd 192.168.22.11
192.168.23.11 on alf 192.168.22.12 192.168.23.12";
    outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
    #pri-lost "echo pri-lost. Have a look at the log files. | mail -s
'DRBD Alert' root";
    # Notify someone in case DRBD split brained.
    #split-brain "echo split-brain. drbdadm -- --discard-my-data
connect $DRBD_RESOURCE ? | mail -s 'DRBD Alert' root";
  }

  startup {
    degr-wfc-timeout 120;    # 2 minutes.
    # become-primary-on both;
  }

  disk {
    on-io-error   detach;
    # fencing resource-only;
    # size 10G;
  }

  net {
    after-sb-0pri disconnect;
    after-sb-1pri disconnect;
    after-sb-2pri disconnect;
    rr-conflict disconnect;
  }

  syncer {
    rate 10M;
    # should tis be 263168 for one terra byte?
    al-extents 257;
  }

  ##################################################################
  # Setup capestor1
  ##################################################################

  on capestor1 {
    device     /dev/drbd0;
    disk       /dev/sdb1;
    address    10.3.120.134:7788;
    flexible-meta-disk  internal;
  }

  on capestor2 {
    device    /dev/drbd0;
    disk      /dev/sdb1;
    address    10.3.120.135:7788;
    meta-disk internal;
  }
}


-- 
What profits a man if he gains the whole world yet loses his soul?
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Split-Brain occurrence, what do I do now?

Reply via email to