Hi all,
We are trying to use DRBD over a WAN. To simulate a LAN, we use an L2TP tunnel
between primary and secondary nodes.
The current setup uses a WAN emulator on the tunnel to emulate some network
constraints.
With incremental synchronization, that works, but we have issues when we try to
trigger a full synchronization with:
> drbdadm invalidate r0
When we add some delay to the traffic (up to 60ms) everything works fine, but
as soon as we add some jitter, even a small one (2ms), the mirrored partition
gets locked and does not answer to monitoring after a few seconds.
The system tries to force a switch-over, but sometimes fails and we have to
wait for the end of the full synchronization.
We use DRBD 8.3.11 with a 3.2.0-49 kernel (ubuntu 12.04)
Do you have some pointers?
Thanks a lot.
Jérôme
PS:
Here is our drbd configuration:
global {
usage-count no;
}
common {
protocol B;
handlers {
initial-split-brain
"/p25/bin/drbd-notify-initial-split-brain.sh";
split-brain "/p25/bin/drbd-notify-split-brain.sh ;
/p25/bin/drbd-notify-emergency-reboot.sh ; echo b > /proc/sysrq-trigger ;
reboot -f";
pri-lost-after-sb "/p25/bin/drbd-notify-pri-lost-after-sb.sh;
/p25/bin/drbd-notify-emergency-reboot.sh ; echo b > /proc/sysrq-trigger ;
reboot -f";
pri-on-incon-degr "/p25/bin/drbd-notify-pri-on-incon-degr.sh;
/p25/bin/drbd-notify-emergency-reboot.sh ; echo b > /proc/sysrq-trigger ;
reboot -f";
pri-lost "/p25/bin/drbd-notify-pri-lost.sh ;
/p25/bin/drbd-notify-emergency-reboot.sh ; echo b > /proc/sysrq-trigger ;
reboot -f";
out-of-sync "/p25/bin/drbd-notify-out-of-sync.sh ;
/p25/bin/drbd-notify-emergency-reboot.sh ; echo b > /proc/sysrq-trigger ;
reboot -f";
local-io-error "/p25/bin/drbd-notify-io-error.sh ;
/p25/bin/drbd-notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt
-f";
}
startup {
}
disk {
}
net {
after-sb-0pri discard-least-changes;
after-sb-1pri discard-secondary;
after-sb-2pri call-pri-lost-after-sb;
rr-conflict call-pri-lost;
max-buffers 8000;
max-epoch-size 8000;
sndbuf-size 0;
}
syncer {
rate 10M;
verify-alg sha1;
}
}
I don't see any drbd errors in the logs:
Full sync starting:
Nov 8 15:49:08 localhost kernel: [88504.170917] block drbd0: conn( Connected
-> StartingSyncS ) pdsk( UpToDate -> Consistent )
Nov 8 15:49:08 localhost kernel: [88504.186884] block drbd0: bitmap WRITE of 4
pages took 0 jiffies
Nov 8 15:49:08 localhost kernel: [88504.191755] block drbd0: 510 MB (130515
bits) marked out-of-sync by on disk bit-map.
Nov 8 15:49:08 localhost kernel: [88504.198156] block drbd0: helper command:
/sbin/drbdadm before-resync-source minor-0
Nov 8 15:49:08 localhost kernel: [88504.201237] block drbd0: helper command:
/sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0)
Nov 8 15:49:08 localhost kernel: [88504.201248] block drbd0: conn(
StartingSyncS -> SyncSource ) pdsk( Consistent -> Inconsistent )
Nov 8 15:49:08 localhost kernel: [88504.201256] block drbd0: Began resync as
SyncSource (will sync 522060 KB [130515 bits set]).
Nov 8 15:49:08 localhost kernel: [88504.209902] block drbd0: updated sync UUID
0AB4A06E64296649:0001000000000000:0001000000000001:5662C47AC0552870
And stopping:
Nov 8 15:52:21 localhost kernel: [88697.310502] block drbd0: Resync done
(total 193 sec; paused 0 sec; 2704 K/sec)
Nov 8 15:52:21 localhost kernel: [88697.310513] block drbd0: updated UUIDs
0AB4A06E64296649:0000000000000000:0001000000000000:0001000000000001
Nov 8 15:52:21 localhost kernel: [88697.310524] block drbd0: conn( SyncSource
-> Connected ) pdsk( Inconsistent -> UpToDate )
Nov 8 15:52:21 localhost kernel: [88697.365073] block drbd0: bitmap WRITE of 0
pages took 0 jiffies
Nov 8 15:52:21 localhost kernel: [88697.366858] block drbd0: 0 KB (0 bits)
marked out-of-sync by on disk bit-map.
But FS and other errors:
Nov 8 15:49:12 localhost pengine: [964]: notice: unpack_rsc_op: Ignoring
expired failure mirroredFS_monitor_15000 (rc=-2,
magic=2:-2;14:293:0:e71d3650-2904-430b-90ce-db6f7cdd8d0e) on
19a21328-51e2-4130-bc85-c7e779598bf4
Nov 8 15:49:12 localhost pengine: [964]: WARN: unpack_rsc_op: Processing
failed op bcss_last_failure_0 on 19a21328-51e2-4130-bc85-c7e779598bf4: unknown
error (1)
Nov 8 15:49:58 localhost crmd: [965]: ERROR: process_lrm_event: LRM operation
mirroredFS_monitor_15000 (61) Timed Out (timeout=40000ms)
Nov 8 15:49:58 localhost crmd: [965]: info: process_graph_event: Detected
action mirroredFS_monitor_15000 from a different transition: 293 vs. 1510
Nov 8 15:49:58 localhost pengine: [964]: WARN: unpack_rsc_op: Processing
failed op bcss_last_failure_0 on 19a21328-51e2-4130-bc85-c7e779598bf4: unknown
error (1)
Nov 8 15:49:58 localhost attrd: [963]: notice: attrd_trigger_update: Sending
flush op to all hosts for: last-failure-mirroredFS (1383925798)
Nov 8 15:50:58 localhost crmd: [965]: info: send_direct_ack: ACK'ing resource
op bcss_fail_60000 from 0:0:crm-resource-27411: lrm_invoke-lrmd-1383925858-1701
Nov 8 15:50:58 localhost crmd: [965]: info: process_lrm_event: LRM operation
bcss_asyncmon_0 (call=70, rc=1, cib-update=1677, confirmed=false) unknown error
Nov 8 15:50:58 localhost crmd: [965]: ERROR: process_graph_event: Action
bcss_asyncmon_0 (0:1;70:-1:0:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx) initiated
outside of a transition
Nov 8 15:50:58 localhost crmd: [965]: info: abort_transition_graph:
process_graph_event:474 - Triggered transition abort (complete=1,
tag=lrm_rsc_op, id=bcss_last_failure_0,
magic=0:1;70:-1:0:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx, cib=0.323.426) :
Unexpected event
Nov 8 15:50:58 localhost crmd: [965]: WARN: update_failcount: Updating
failcount for bcss on 19a21328-51e2-4130-bc85-c7e779598bf4 after failed
asyncmon: rc=1 (update=value++, time=1383925858)
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user