[DRBD-user] Mirrored partition locked with network jitter

Labussiere, Jerome Fri, 15 Nov 2013 02:30:46 -0800

Hi all,


We are trying to use DRBD over a WAN. To simulate a LAN, we use an L2TP tunnel 
between primary and secondary nodes.

The current setup uses a WAN emulator on the tunnel to emulate some network 
constraints.

 

With incremental synchronization, that works, but we have issues when we try to 
trigger a full synchronization with:

> drbdadm invalidate r0

 

When we add some delay to the traffic (up to 60ms) everything works fine, but 
as soon as we add some jitter, even a small one (2ms), the mirrored partition 
gets locked and does not answer to monitoring after a few seconds.

The system tries to force a switch-over, but sometimes fails and we have to 
wait for the end of the full synchronization.

 

We use DRBD 8.3.11 with a 3.2.0-49 kernel (ubuntu 12.04)

 

Do you have some pointers?

 

Thanks a lot.

 

Jérôme

 

 

PS:

Here is our drbd configuration:

global {

        usage-count no;

}

 

common {

        protocol B;

 

        handlers {

                initial-split-brain 
"/p25/bin/drbd-notify-initial-split-brain.sh";

                split-brain "/p25/bin/drbd-notify-split-brain.sh            ; 
/p25/bin/drbd-notify-emergency-reboot.sh  ; echo b > /proc/sysrq-trigger ; 
reboot -f";

                pri-lost-after-sb "/p25/bin/drbd-notify-pri-lost-after-sb.sh; 
/p25/bin/drbd-notify-emergency-reboot.sh  ; echo b > /proc/sysrq-trigger ; 
reboot -f";

                pri-on-incon-degr "/p25/bin/drbd-notify-pri-on-incon-degr.sh; 
/p25/bin/drbd-notify-emergency-reboot.sh  ; echo b > /proc/sysrq-trigger ; 
reboot -f";

                pri-lost "/p25/bin/drbd-notify-pri-lost.sh                  ; 
/p25/bin/drbd-notify-emergency-reboot.sh  ; echo b > /proc/sysrq-trigger ; 
reboot -f";

                out-of-sync "/p25/bin/drbd-notify-out-of-sync.sh            ; 
/p25/bin/drbd-notify-emergency-reboot.sh  ; echo b > /proc/sysrq-trigger ; 
reboot -f";

                local-io-error "/p25/bin/drbd-notify-io-error.sh            ; 
/p25/bin/drbd-notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt 
-f";

        }

 

        startup {

        }

 

        disk {

        }

 

        net {

                after-sb-0pri discard-least-changes;

                after-sb-1pri discard-secondary;

                after-sb-2pri call-pri-lost-after-sb;

                rr-conflict call-pri-lost;

                max-buffers 8000;

                max-epoch-size 8000;

                sndbuf-size 0;

        }

 

        syncer {

                rate 10M;

                verify-alg sha1;

        }

 

}

 

 

 

I don't see any drbd errors in the logs:

Full sync starting:

Nov  8 15:49:08 localhost kernel: [88504.170917] block drbd0: conn( Connected 
-> StartingSyncS ) pdsk( UpToDate -> Consistent )

Nov  8 15:49:08 localhost kernel: [88504.186884] block drbd0: bitmap WRITE of 4 
pages took 0 jiffies

Nov  8 15:49:08 localhost kernel: [88504.191755] block drbd0: 510 MB (130515 
bits) marked out-of-sync by on disk bit-map.

Nov  8 15:49:08 localhost kernel: [88504.198156] block drbd0: helper command: 
/sbin/drbdadm before-resync-source minor-0

Nov  8 15:49:08 localhost kernel: [88504.201237] block drbd0: helper command: 
/sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0)

Nov  8 15:49:08 localhost kernel: [88504.201248] block drbd0: conn( 
StartingSyncS -> SyncSource ) pdsk( Consistent -> Inconsistent )

Nov  8 15:49:08 localhost kernel: [88504.201256] block drbd0: Began resync as 
SyncSource (will sync 522060 KB [130515 bits set]).

Nov  8 15:49:08 localhost kernel: [88504.209902] block drbd0: updated sync UUID 
0AB4A06E64296649:0001000000000000:0001000000000001:5662C47AC0552870

 

And stopping:

Nov  8 15:52:21 localhost kernel: [88697.310502] block drbd0: Resync done 
(total 193 sec; paused 0 sec; 2704 K/sec)

Nov  8 15:52:21 localhost kernel: [88697.310513] block drbd0: updated UUIDs 
0AB4A06E64296649:0000000000000000:0001000000000000:0001000000000001

Nov  8 15:52:21 localhost kernel: [88697.310524] block drbd0: conn( SyncSource 
-> Connected ) pdsk( Inconsistent -> UpToDate )

Nov  8 15:52:21 localhost kernel: [88697.365073] block drbd0: bitmap WRITE of 0 
pages took 0 jiffies

Nov  8 15:52:21 localhost kernel: [88697.366858] block drbd0: 0 KB (0 bits) 
marked out-of-sync by on disk bit-map.

 

But FS and other errors:

Nov  8 15:49:12 localhost pengine: [964]: notice: unpack_rsc_op: Ignoring 
expired failure mirroredFS_monitor_15000 (rc=-2, 
magic=2:-2;14:293:0:e71d3650-2904-430b-90ce-db6f7cdd8d0e) on 
19a21328-51e2-4130-bc85-c7e779598bf4

Nov  8 15:49:12 localhost pengine: [964]: WARN: unpack_rsc_op: Processing 
failed op bcss_last_failure_0 on 19a21328-51e2-4130-bc85-c7e779598bf4: unknown 
error (1)

 

Nov  8 15:49:58 localhost crmd: [965]: ERROR: process_lrm_event: LRM operation 
mirroredFS_monitor_15000 (61) Timed Out (timeout=40000ms)

Nov  8 15:49:58 localhost crmd: [965]: info: process_graph_event: Detected 
action mirroredFS_monitor_15000 from a different transition: 293 vs. 1510

 

Nov  8 15:49:58 localhost pengine: [964]: WARN: unpack_rsc_op: Processing 
failed op bcss_last_failure_0 on 19a21328-51e2-4130-bc85-c7e779598bf4: unknown 
error (1)

Nov  8 15:49:58 localhost attrd: [963]: notice: attrd_trigger_update: Sending 
flush op to all hosts for: last-failure-mirroredFS (1383925798)

 

Nov  8 15:50:58 localhost crmd: [965]: info: send_direct_ack: ACK'ing resource 
op bcss_fail_60000 from 0:0:crm-resource-27411: lrm_invoke-lrmd-1383925858-1701

Nov  8 15:50:58 localhost crmd: [965]: info: process_lrm_event: LRM operation 
bcss_asyncmon_0 (call=70, rc=1, cib-update=1677, confirmed=false) unknown error

Nov  8 15:50:58 localhost crmd: [965]: ERROR: process_graph_event: Action 
bcss_asyncmon_0 (0:1;70:-1:0:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx) initiated 
outside of a transition

Nov  8 15:50:58 localhost crmd: [965]: info: abort_transition_graph: 
process_graph_event:474 - Triggered transition abort (complete=1, 
tag=lrm_rsc_op, id=bcss_last_failure_0, 
magic=0:1;70:-1:0:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx, cib=0.323.426) : 
Unexpected event

Nov  8 15:50:58 localhost crmd: [965]: WARN: update_failcount: Updating 
failcount for bcss on 19a21328-51e2-4130-bc85-c7e779598bf4 after failed 
asyncmon: rc=1 (update=value++, time=1383925858)

_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

[DRBD-user] Mirrored partition locked with network jitter

Reply via email to