Re: [DRBD-user] strange split-brain problem

Klaus Darilion Tue, 07 Dec 2010 11:36:12 -0800

Hi Lars!

Am 07.12.2010 19:45, schrieb Lars Ellenberg:

On Mon, Dec 06, 2010 at 06:08:19PM +0100, Klaus Darilion wrote:

Hi all!


Today I had a strange experience.

node A: 192.168.100.100, cc1-vie
   /dev/drbd1: primary
   /dev/drbd5: primary

node B: 192.168.100.101, cc1-sbg
   /dev/drbd1: secondary
   /dev/drbd5: secondary

The /dev/drbdX devices are used by a xen domU.

resource manager-ha {
   startup {
     become-primary-on cc1-vie;
   }
   on cc1-vie {
     device    /dev/drbd1;
     disk      /dev/mapper/cc1--vienna-manager--disk--drbd;
     address   192.168.100.100:7789;
     meta-disk internal;
   }
   on cc1-sbg {
     device    /dev/drbd1;
     disk      /dev/mapper/cc1--sbg-manager--disk--drbd;
     address   192.168.100.101:7789;
     meta-disk internal;
   }
}

resource cc-manager-templates-ha {
   startup {
     become-primary-on cc1-vie;
   }
   on cc1-vie {
     device    /dev/drbd5;
     disk      /dev/mapper/cc1--vienna-cc--manager--templates--drbd
     address   192.168.100.100:7793;
     meta-disk internal;
   }
   on cc1-sbg {
     device    /dev/drbd5;
     disk      /dev/mapper/cc1--sbg-cc--manager--templates--drbd
     address   192.168.100.101:7793;
     meta-disk internal;
   }
}

Everything was running fine. Then I rebooted both servers. Then I spotted:

block drbd5: Starting worker thread (from cqueue [1573])
block drbd5: disk( Diskless ->  Attaching )
block drbd5: Found 4 transactions (192 active extents) in activity log.
block drbd5: Method to ensure write ordering: barrier
block drbd5: Backing device's merge_bvec_fn() = ffffffff81431b10
block drbd5: max_segment_size ( = BIO size ) = 4096
block drbd5: drbd_bm_resize called with capacity == 41941688
block drbd5: resync bitmap: bits=5242711 words=81918
block drbd5: size = 20 GB (20970844 KB)
block drbd5: recounting of set bits took additional 0 jiffies
block drbd5: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
block drbd5: Marked additional 508 MB as out-of-sync based on AL.
block drbd5: disk( Attaching ->  UpToDate )


This is the first thing which makes me nervous: There were 500MB to
synchronize although the server was idle and everything was
synchronized before rebooting.


As was pointed out already,
read up on what we call the activitly log.

Then some more reboots on node A and suddenly:

block drbd5: State change failed: Refusing to be Primary without at
least one UpToDate disk
block drbd5:   state = { cs:WFConnection ro:Secondary/Unknown
ds:Diskless/DUnknown r--- }

      ^^^^^^^^

You failed to attach, you have not yet connected,
so DRBD refuses to become Primary: which data should it be Primary with?


but how can it be secondary without and disk?

Then the status on node A was:

cc-manager-templates-ha  Connected Primary/Secondary
Diskless/UpToDate A r----


It was able to establish the connection,
and was going Primary with the data of the peer.

Is this a feature? How can it know that the peers data is up2date whenit can not attach to the local disk?

When I tried to manually attach the device I got error messages:
"Split-Brain detected, dropping connection".


Hm.  Ugly.
It should refuse the attach instead.
Did it just get the error message wrong,
or did it actually disconnect there?
What DRBD version would that be?


Ubuntu 10.04:
# /etc/init.d/drbd status
drbd driver loaded OK; device status:
version: 8.3.7 (api:88/proto:86-91)

GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build byr...@cc1-sbg, 2010-10-14 15:13:20

After some googling without finding any hint suddenly the status changed:

cc-manager-templates-ha  StandAlone Primary/Unknown
UpToDate/DUnknown r---- xen-vbd: _cc-manager


So, suddenly this one device is not connected anymore. All the other
drbd devices are still connected and working fine - only this single
device is making problems, although it has identical configuration.


What could cause such an issue? Everything was working fine, I just
rebooted the servers.

Any hints what to do now to solve this issue?


Your setup is broken.
Apparently something in your boot process, at least "sometimes",
claims the lower level devices so DRBD fails to attach.
  Fix that.


Almost done (see below)

Your shutdown process is apparently broken enough to
not really shutdown everything and demote/down DRBD
so it stays Primary. That makes an "orderly" shutdown/reboot
look like a Primary crash to DRBD.
  Fix that.

Done. DRDB was shut down before xendomains. Thus DRBD refused to shutdown as xen had the volumes still mounted. So I changed the order of thesymlinks in /etc/rcX.d/ for drbd.


Are you sure that you have been the only one tampering with DRBD at the
time, or would heartbeat/pacemaker/whatever try to do something at the
same time?


no cluster managers - just me

And, BTW, no.
Your /etc/hosts file has zero to do with how DRBD behaves.

At least I can reproduce the bad behavior when adding the bug to/etc/hosts. I think it has something todo how I address the disk. Theone volume which is working fine is configured with:

  disk /dev/mapper/cc1--vienna-manager--disk--drbd

The other volume which causes the problems is configured with
  disk /dev/cc1-vienna/cc-manager-templates-drbd
which is a symlink to
  /dev/mapper/cc1--vienna-cc--manager--templates--drbd

So, I have no idea why, but it seems that if /etc/hosts is broken thenthe symlinks are no available when DRBD starts. When after booting up isstop/start the DRBD service, then DRBD attaches to the disks fine. Strange.


Thanks
Klaus
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Re: [DRBD-user] strange split-brain problem

Reply via email to