Hi Eric,

I've had the pleasure to deal with this exact issue, and in prod too :O


On Wed 12 Oct 2016 14:04:48, Eric Robinson wrote:
> This morning we are seeing an issue where drbd is repeatedly resyncing, 
> getting to 100%, and starting over, and never getting to an UpToDate/UpToDate 
> state.
> 
> On one node, it is logging this sequence over and over…
> 
> <snip>
> 
> Oct 12 06:56:11 ha14a kernel: d-con ha02_mysql: Starting asender thread (from 
> drbd_r_ha02_mys [804])
> Oct 12 06:56:11 ha14a kernel: block drbd1: drbd_sync_handshake:
> Oct 12 06:56:11 ha14a kernel: block drbd1: self 
> 13FB9B08BF812C5A:0000000000000000:4B9700420A3698D8:4B9600420A3698D9 bits:0 
> flags:0
> Oct 12 06:56:11 ha14a kernel: block drbd1: peer 
> 38E17129E5821B5F:13FB9B08BF812C5B:13FA9B08BF812C5B:13F99B08BF812C5B bits:0 
> flags:0
> Oct 12 06:56:11 ha14a kernel: block drbd1: uuid_compare()=-1 by rule 50
> Oct 12 06:56:11 ha14a kernel: block drbd1: Becoming sync target due to disk 
> states.
> Oct 12 06:56:11 ha14a kernel: block drbd1: peer( Unknown -> Primary ) conn( 
> WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
> Oct 12 06:56:11 ha14a kernel: block drbd1: receive bitmap stats 
> [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
> Oct 12 06:56:11 ha14a kernel: block drbd1: send bitmap stats 
> [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
> Oct 12 06:56:11 ha14a kernel: block drbd1: conn( WFBitMapT -> WFSyncUUID )
> Oct 12 06:56:11 ha14a kernel: block drbd1: updated sync uuid 
> 13FC9B08BF812C5A:0000000000000000:4B9700420A3698D8:4B9600420A3698D9
> Oct 12 06:56:11 ha14a kernel: block drbd1: helper command: /sbin/drbdadm 
> before-resync-target minor-1
> Oct 12 06:56:11 ha14a kernel: block drbd1: helper command: /sbin/drbdadm 
> before-resync-target minor-1 exit code 0 (0x0)
> Oct 12 06:56:11 ha14a kernel: block drbd1: conn( WFSyncUUID -> SyncTarget )
> Oct 12 06:56:11 ha14a kernel: block drbd1: Began resync as SyncTarget (will 
> sync 0 KB [0 bits set]).

The two lines below are the important lines, where DRBD assumes network
failure due PingAck not arrving in time.

> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: PingAck did not arrive in 
> time.
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: peer( Primary -> Unknown ) 
> conn( SyncTarget -> NetworkFailure ) pdsk( UpToDate -> DUnknown )

You need to increase the timeout withing which PingAck is expected.

drbdadm net-options -v --ping-timeout=10 drbd0

this is the command I used. The --ping-timeout is in 10th of a second so
value of '10' is actually 1s. Please confirm this in documentation as
the version of DRBD I run this on was 8.x

Also you may need to tweak the timeout a bit..


Hope this helps

v


> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: asender terminated
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: Terminating drbd_a_ha02_mys
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: Connection closed
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: conn( NetworkFailure -> 
> Unconnected )
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: receiver terminated
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: Restarting receiver thread
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: receiver (re)started
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: conn( Unconnected -> 
> WFConnection )
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: Handshake successful: Agreed 
> network protocol version 101
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: Peer authenticated using 20 
> bytes HMAC
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: conn( WFConnection -> 
> WFReportParams )
> 
> </snip>
> 
> On the other node, it is saying this over and over…
> 
> <snip>
> 
> Oct 12 06:58:51 ha14b kernel: block drbd1: drbd_sync_handshake:
> Oct 12 06:58:51 ha14b kernel: block drbd1: self 
> 38E17129E5821B5F:148D9B08BF812C5B:148C9B08BF812C5B:148B9B08BF812C5B bits:0 
> flags:0
> Oct 12 06:58:51 ha14b kernel: block drbd1: peer 
> 148D9B08BF812C5A:0000000000000000:4B9700420A3698D8:4B9600420A3698D9 bits:0 
> flags:0
> Oct 12 06:58:51 ha14b kernel: block drbd1: uuid_compare()=1 by rule 70
> Oct 12 06:58:51 ha14b kernel: block drbd1: Becoming sync source due to disk 
> states.
> Oct 12 06:58:51 ha14b kernel: block drbd1: peer( Unknown -> Secondary ) conn( 
> WFReportParams -> WFBitMapS )
> Oct 12 06:58:51 ha14b kernel: block drbd1: send bitmap stats 
> [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
> Oct 12 06:58:51 ha14b kernel: block drbd1: receive bitmap stats 
> [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
> Oct 12 06:58:51 ha14b kernel: block drbd1: helper command: /sbin/drbdadm 
> before-resync-source minor-1
> Oct 12 06:58:51 ha14b kernel: block drbd1: helper command: /sbin/drbdadm 
> before-resync-source minor-1 exit code 0 (0x0)
> Oct 12 06:58:51 ha14b kernel: block drbd1: conn( WFBitMapS -> SyncSource )
> Oct 12 06:58:51 ha14b kernel: block drbd1: Began resync as SyncSource (will 
> sync 0 KB [0 bits set]).
> Oct 12 06:58:51 ha14b kernel: block drbd1: updated sync UUID 
> 38E17129E5821B5F:148E9B08BF812C5B:148D9B08BF812C5B:148C9B08BF812C5B
> Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: sock was shut down by peer
> Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: peer( Secondary -> Unknown ) 
> conn( SyncSource -> BrokenPipe )
> Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: short read (expected size 16)
> Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: Connection closed
> Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: conn( BrokenPipe -> 
> Unconnected )
> Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: receiver terminated
> Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: Restarting receiver thread
> Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: receiver (re)started
> Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: conn( Unconnected -> 
> WFConnection )
> 
> </snip>
> 
> However, I can guarantee that the network connection is solid. Running ping 
> flood, I get 30,000 packets sent with no loss or latency.
> 
> Help, please?
> 
> --
> Eric Robinson
> 

> _______________________________________________
> drbd-user mailing list
> [email protected]
> http://lists.linbit.com/mailman/listinfo/drbd-user


-- 
Regards

Viktor Villafuerte
Optus Internet Engineering
t: +61 2 80825265
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to