[DRBD-user] strange split-brain problem

Klaus Darilion Mon, 06 Dec 2010 12:10:54 -0800

Hi all!

Today I had a strange experience.


node A: 192.168.100.100, cc1-vie
  /dev/drbd1: primary
  /dev/drbd5: primary

node B: 192.168.100.101, cc1-sbg
  /dev/drbd1: secondary
  /dev/drbd5: secondary

The /dev/drbdX devices are used by a xen domU.

resource manager-ha {
  startup {
    become-primary-on cc1-vie;
  }
  on cc1-vie {
    device    /dev/drbd1;
    disk      /dev/mapper/cc1--vienna-manager--disk--drbd;
    address   192.168.100.100:7789;
    meta-disk internal;
  }
  on cc1-sbg {
    device    /dev/drbd1;
    disk      /dev/mapper/cc1--sbg-manager--disk--drbd;
    address   192.168.100.101:7789;
    meta-disk internal;
  }
}

resource cc-manager-templates-ha {
  startup {
    become-primary-on cc1-vie;
  }
  on cc1-vie {
    device    /dev/drbd5;
    disk      /dev/mapper/cc1--vienna-cc--manager--templates--drbd
    address   192.168.100.100:7793;
    meta-disk internal;
  }
  on cc1-sbg {
    device    /dev/drbd5;
    disk      /dev/mapper/cc1--sbg-cc--manager--templates--drbd
    address   192.168.100.101:7793;
    meta-disk internal;
  }
}

Everything was running fine. Then I rebooted both servers. Then I spotted:

block drbd5: Starting worker thread (from cqueue [1573])
block drbd5: disk( Diskless -> Attaching )
block drbd5: Found 4 transactions (192 active extents) in activity log.
block drbd5: Method to ensure write ordering: barrier
block drbd5: Backing device's merge_bvec_fn() = ffffffff81431b10
block drbd5: max_segment_size ( = BIO size ) = 4096
block drbd5: drbd_bm_resize called with capacity == 41941688
block drbd5: resync bitmap: bits=5242711 words=81918
block drbd5: size = 20 GB (20970844 KB)
block drbd5: recounting of set bits took additional 0 jiffies
block drbd5: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
block drbd5: Marked additional 508 MB as out-of-sync based on AL.
block drbd5: disk( Attaching -> UpToDate )

This is the first thing which makes me nervous: There were 500MB tosynchronize although the server was idle and everything was synchronizedbefore rebooting.


Then I did some more reboots on node A and spotted again:

block drbd5: Starting worker thread (from cqueue [1630])
block drbd5: disk( Diskless -> Attaching )
block drbd5: Found 4 transactions (126 active extents) in activity log.
block drbd5: Method to ensure write ordering: barrier
block drbd5: Backing device's merge_bvec_fn() = ffffffff81431b10
block drbd5: max_segment_size ( = BIO size ) = 4096
block drbd5: drbd_bm_resize called with capacity == 41941688
block drbd5: resync bitmap: bits=5242711 words=81918
block drbd5: size = 20 GB (20970844 KB)
block drbd5: recounting of set bits took additional 0 jiffies
block drbd5: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
block drbd5: Marked additional 488 MB as out-of-sync based on AL.
block drbd5: disk( Attaching -> UpToDate )

So, why again resynchronizing almost 500Mb although the partition is notused at all (just mounted in a domU).


Then some more reboots on node A and suddenly:

block drbd5: State change failed: Refusing to be Primary without atleast one UpToDate diskblock drbd5: state = { cs:WFConnection ro:Secondary/Unknownds:Diskless/DUnknown r--- }block drbd5: wanted = { cs:WFConnection ro:Primary/Unknownds:Diskless/DUnknown r--- }block drbd5: State change failed: Refusing to be Primary without atleast one UpToDate diskblock drbd5: state = { cs:WFConnection ro:Secondary/Unknownds:Diskless/DUnknown r--- }block drbd5: wanted = { cs:WFConnection ro:Primary/Unknownds:Diskless/DUnknown r--- }



Then the status on node A was:

cc-manager-templates-ha Connected Primary/Secondary Diskless/UpToDate Ar----

When I tried to manually attach the device I got error messages:"Split-Brain detected, dropping connection".



After some googling without finding any hint suddenly the status changed:

cc-manager-templates-ha StandAlone Primary/Unknown UpToDate/DUnknownr---- xen-vbd: _cc-manager

So, suddenly this one device is not connected anymore. All the otherdrbd devices are still connected and working fine - only this singledevice is making problems, although it has identical configuration.

What could cause such an issue? Everything was working fine, I justrebooted the servers.


Any hints what to do now to solve this issue?

thanks
Klaus

_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

[DRBD-user] strange split-brain problem

Reply via email to