Thanks for the reply Felix,
Log extracts are included below & attached as requested.
> Hi,
>
> On 01/24/2011 01:20 AM, Lew wrote:
> > I've encountered some unexpected behavior with a split brain
> > instance.
> > It seems from what has occurred that the default behavior is set to
> > roll
> > back & discard changes.
> >
> > Recently in my sand pit, I've been manually disconnecting resources
> > as a
> > an ad hock way of maintaining a snapshot for roll back.
> > This way if I'm happy with changes, I can reconnect and within
> > seconds
> > we're fully synced again.
> >
> > I moved a server yesterday and discovered after a drbdadm connect
> > all,
> > that one of the resources had split brained and discarded a few days
> > worth of work;
> > rolling back to the point in time when the resource was first
> > disconnected.
> >
> > What's interesting to me is that the disconnected secondary node had
> > never been set primary, so how did we end up in split brain?
> > I also do not understand why it was only this resource that split
> > brained, when others that existed in seemingly identical
> > configurations
> > and states did not.
> >
> > I expect I'll need to explicitly prohibit this behavior in a global
> > net
> > section covering after-sb-0pri etc;
> > I still don't understand why discard & roll back has been chosen
> > default
> > behavior, I'm contending from my experience it should not be.
>
> It's not.
OK, seems to me something smells then.
> > Looks to me like a few days work is lost, but if anyone knows of a
> > way
> > to recover from a roll back discard scenario, I'd be very happy to
> > find out.
>
> Please share pertinent logs and drbd configuration.
Config
------
resource x2 {
protocol A;
syncer {
rate 100M;
}
on emlsurit-v4 {
device /dev/drbd9;
disk /dev/r50lvm/emlsurit-x2-drbd;
address 192.168.254.100:7799;
flexible-meta-disk internal;
}
on emlsurit-v5 {
device /dev/drbd9;
disk /dev/r50lvm/emlsurit-x2-drbd;
address 192.168.254.101:7799;
meta-disk internal;
}
}
Global Config (comments removed)
-------------
global {
usage-count yes;
}
common {
protocol A;
handlers {
pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh";
pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh";
echo o > /proc/sysrq-trigger ; halt -f";
local-io-error "/usr/lib/drbd/notify-io-error.sh";
split-brain "/usr/lib/drbd/notify-split-brain.sh root";
out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
}
startup {
}
disk {
no-disk-flushes;
no-md-flushes;
}
net {
}
syncer {
}
}
Message Log extract (A bit long to post)-- see attached.
Cheers & thanks,
Lew
Jan 23 15:07:16 emlsurit-v4 kernel: [ 15.028980] block drbd9: Starting worker
thread (from cqueue [1924])
Jan 23 15:07:16 emlsurit-v4 kernel: [ 15.029345] block drbd9: disk( Diskless
-> Attaching )
Jan 23 15:07:16 emlsurit-v4 kernel: [ 15.034009] block drbd9: Found 4
transactions (192 active extents) in activity log.
Jan 23 15:07:16 emlsurit-v4 kernel: [ 15.034013] block drbd9: Method to
ensure write ordering: barrier
Jan 23 15:07:16 emlsurit-v4 kernel: [ 15.034017] block drbd9: Backing
device's merge_bvec_fn() = ffffffff81435d30
Jan 23 15:07:16 emlsurit-v4 kernel: [ 15.034020] block drbd9:
max_segment_size ( = BIO size ) = 4096
Jan 23 15:07:16 emlsurit-v4 kernel: [ 15.034023] block drbd9: drbd_bm_resize
called with capacity == 31456248
Jan 23 15:07:16 emlsurit-v4 kernel: [ 15.034144] block drbd9: resync bitmap:
bits=3932031 words=61438
Jan 23 15:07:16 emlsurit-v4 kernel: [ 15.034146] block drbd9: size = 15 GB
(15728124 KB)
Jan 23 15:07:16 emlsurit-v4 kernel: [ 15.044907] block drbd9: recounting of
set bits took additional 0 jiffies
Jan 23 15:07:16 emlsurit-v4 kernel: [ 15.044910] block drbd9: 0 KB (0 bits)
marked out-of-sync by on disk bit-map.
Jan 23 15:07:16 emlsurit-v4 kernel: [ 15.044929] block drbd9: Marked
additional 508 MB as out-of-sync based on AL.
Jan 23 15:07:16 emlsurit-v4 kernel: [ 15.049222] block drbd9: disk( Attaching
-> UpToDate )
Jan 23 15:07:16 emlsurit-v4 kernel: [ 15.087220] block drbd9: conn(
StandAlone -> Unconnected )
Jan 23 15:07:16 emlsurit-v4 kernel: [ 15.087237] block drbd9: Starting
receiver thread (from drbd9_worker [2126])
Jan 23 15:07:16 emlsurit-v4 kernel: [ 15.087277] block drbd9: receiver
(re)started
Jan 23 15:07:16 emlsurit-v4 kernel: [ 15.087281] block drbd9: conn(
Unconnected -> WFConnection )
Jan 23 15:07:16 emlsurit-v4 kernel: [ 15.087332] block drbd9: conn(
WFConnection -> Disconnecting )
Jan 23 15:07:16 emlsurit-v4 kernel: [ 15.287540] block drbd9: Discarding
network configuration.
Jan 23 15:07:16 emlsurit-v4 kernel: [ 15.287690] block drbd9: Connection
closed
Jan 23 15:07:16 emlsurit-v4 kernel: [ 15.287699] block drbd9: conn(
Disconnecting -> StandAlone )
Jan 23 15:07:16 emlsurit-v4 kernel: [ 15.287823] block drbd9: receiver
terminated
Jan 23 15:07:16 emlsurit-v4 kernel: [ 15.287828] block drbd9: Terminating
receiver thread
Jan 23 15:53:01 emlsurit-v4 kernel: [ 2756.121108] block drbd9: role( Secondary
-> Primary )
Jan 23 15:53:01 emlsurit-v4 kernel: [ 2756.122546] block drbd9: Creating new
current UUID
Jan 23 15:55:06 emlsurit-v4 kernel: [ 2880.172752] type=1503
audit(1295758506.227:17): operation="open" pid=8340 parent=1787
profile="/usr/lib/libvirt/virt-aa-helper" requested_mask="r::"
denied_mask="r::" fsuid=0 ouid=0 name="/dev/drbd9"
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.806263] block drbd9: conn(
StandAlone -> Unconnected )
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.806312] block drbd9: Starting
receiver thread (from drbd9_worker [2126])
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.806353] block drbd9: receiver
(re)started
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.806359] block drbd9: conn(
Unconnected -> WFConnection )
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.904763] block drbd9: Handshake
successful: Agreed network protocol version 91
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.904772] block drbd9: conn(
WFConnection -> WFReportParams )
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.904796] block drbd9: Starting
asender thread (from drbd9_receiver [18060])
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.904936] block drbd9:
data-integrity-alg: <not-used>
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.905963] block drbd9:
drbd_sync_handshake:
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.905967] block drbd9: self
49615ABF1622FC55:643454BA1CA67140:5625CFAB3DDD24A2:EA5079D16F8C7807 bits:143432
flags:0
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.905971] block drbd9: peer
6116B0558277E470:643454BA1CA67140:5625CFAB3DDD24A2:EA5079D16F8C7807 bits:336381
flags:0
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.905975] block drbd9:
uuid_compare()=100 by rule 90
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.906273] block drbd9: helper command:
/sbin/drbdadm split-brain minor-9
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.937925] block drbd9: conn(
WFReportParams -> NetworkFailure )
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.937935] block drbd9: asender
terminated
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.937938] block drbd9: Terminating
asender thread
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.950821] block drbd9: helper command:
/sbin/drbdadm split-brain minor-9 exit code 127 (0x7f00)
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.950827] block drbd9: conn(
NetworkFailure -> Disconnecting )
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.951122] block drbd9: Connection
closed
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.951129] block drbd9: conn(
Disconnecting -> StandAlone )
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.951149] block drbd9: receiver
terminated
Jan 23 22:19:35 emlsurit-v4 kernel: [25910.951151] block drbd9: Terminating
receiver thread
Jan 23 22:34:37 emlsurit-v4 kernel: [26811.487616] block drbd9: conn(
StandAlone -> Unconnected )
Jan 23 22:34:37 emlsurit-v4 kernel: [26811.487638] block drbd9: Starting
receiver thread (from drbd9_worker [2126])
Jan 23 22:34:37 emlsurit-v4 kernel: [26811.487690] block drbd9: receiver
(re)started
Jan 23 22:34:37 emlsurit-v4 kernel: [26811.487696] block drbd9: conn(
Unconnected -> WFConnection )
Jan 23 22:35:04 emlsurit-v4 kernel: [26838.182513] block drbd9: Handshake
successful: Agreed network protocol version 91
Jan 23 22:35:04 emlsurit-v4 kernel: [26838.182522] block drbd9: conn(
WFConnection -> WFReportParams )
Jan 23 22:35:04 emlsurit-v4 kernel: [26838.182539] block drbd9: Starting
asender thread (from drbd9_receiver [20045])
Jan 23 22:35:04 emlsurit-v4 kernel: [26838.183313] block drbd9:
data-integrity-alg: <not-used>
Jan 23 22:35:04 emlsurit-v4 kernel: [26838.183340] block drbd9:
drbd_sync_handshake:
Jan 23 22:35:04 emlsurit-v4 kernel: [26838.183345] block drbd9: self
49615ABF1622FC55:643454BA1CA67140:5625CFAB3DDD24A2:EA5079D16F8C7807 bits:143799
flags:0
Jan 23 22:35:04 emlsurit-v4 kernel: [26838.183349] block drbd9: peer
6116B0558277E470:643454BA1CA67140:5625CFAB3DDD24A2:EA5079D16F8C7807 bits:336381
flags:0
Jan 23 22:35:04 emlsurit-v4 kernel: [26838.183353] block drbd9:
uuid_compare()=100 by rule 90
Jan 23 22:35:04 emlsurit-v4 kernel: [26838.183610] block drbd9: helper command:
/sbin/drbdadm split-brain minor-9
Jan 23 22:35:04 emlsurit-v4 kernel: [26838.192301] block drbd9: conn(
WFReportParams -> NetworkFailure )
Jan 23 22:35:04 emlsurit-v4 kernel: [26838.192309] block drbd9: asender
terminated
Jan 23 22:35:04 emlsurit-v4 kernel: [26838.192311] block drbd9: Terminating
asender thread
Jan 23 22:35:04 emlsurit-v4 kernel: [26838.192702] block drbd9: helper command:
/sbin/drbdadm split-brain minor-9 exit code 127 (0x7f00)
Jan 23 22:35:04 emlsurit-v4 kernel: [26838.192709] block drbd9: conn(
NetworkFailure -> Disconnecting )
Jan 23 22:35:04 emlsurit-v4 kernel: [26838.193004] block drbd9: Connection
closed
Jan 23 22:35:04 emlsurit-v4 kernel: [26838.193012] block drbd9: conn(
Disconnecting -> StandAlone )
Jan 23 22:35:04 emlsurit-v4 kernel: [26838.193027] block drbd9: receiver
terminated
Jan 23 22:35:04 emlsurit-v4 kernel: [26838.193029] block drbd9: Terminating
receiver thread
Jan 23 22:35:58 emlsurit-v4 kernel: [26892.356300] block drbd9: conn(
StandAlone -> Unconnected )
Jan 23 22:35:58 emlsurit-v4 kernel: [26892.356326] block drbd9: Starting
receiver thread (from drbd9_worker [2126])
Jan 23 22:35:58 emlsurit-v4 kernel: [26892.356519] block drbd9: receiver
(re)started
Jan 23 22:35:58 emlsurit-v4 kernel: [26892.356527] block drbd9: conn(
Unconnected -> WFConnection )
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user