Short in the dark - are the drives (or their controller if you're using raid) 
using any form of caching? It is conceivable that when resync is finished it 
tries flushing the data to the device, and if this takes waaaaay to long it 
could lead to timeout of the drbd kernel thread.
Is IO happening on those drives when they are resyncing?
Try running something like "sync ; sleep 1 ; sync" on the Inconsistent node 
when it's resyncing (I hope that won't kill your IO)

But that's really just a guess.

Jan

> On 12 Oct 2016, at 16:04, Eric Robinson <[email protected]> wrote:
> 
> This morning we are seeing an issue where drbd is repeatedly resyncing, 
> getting to 100%, and starting over, and never getting to an UpToDate/UpToDate 
> state.
>  
> On one node, it is logging this sequence over and over…
>  
> <snip>
>  
> Oct 12 06:56:11 ha14a kernel: d-con ha02_mysql: Starting asender thread (from 
> drbd_r_ha02_mys [804])
> Oct 12 06:56:11 ha14a kernel: block drbd1: drbd_sync_handshake:
> Oct 12 06:56:11 ha14a kernel: block drbd1: self 
> 13FB9B08BF812C5A:0000000000000000:4B9700420A3698D8:4B9600420A3698D9 bits:0 
> flags:0
> Oct 12 06:56:11 ha14a kernel: block drbd1: peer 
> 38E17129E5821B5F:13FB9B08BF812C5B:13FA9B08BF812C5B:13F99B08BF812C5B bits:0 
> flags:0
> Oct 12 06:56:11 ha14a kernel: block drbd1: uuid_compare()=-1 by rule 50
> Oct 12 06:56:11 ha14a kernel: block drbd1: Becoming sync target due to disk 
> states.
> Oct 12 06:56:11 ha14a kernel: block drbd1: peer( Unknown -> Primary ) conn( 
> WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
> Oct 12 06:56:11 ha14a kernel: block drbd1: receive bitmap stats 
> [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
> Oct 12 06:56:11 ha14a kernel: block drbd1: send bitmap stats 
> [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
> Oct 12 06:56:11 ha14a kernel: block drbd1: conn( WFBitMapT -> WFSyncUUID )
> Oct 12 06:56:11 ha14a kernel: block drbd1: updated sync uuid 
> 13FC9B08BF812C5A:0000000000000000:4B9700420A3698D8:4B9600420A3698D9
> Oct 12 06:56:11 ha14a kernel: block drbd1: helper command: /sbin/drbdadm 
> before-resync-target minor-1
> Oct 12 06:56:11 ha14a kernel: block drbd1: helper command: /sbin/drbdadm 
> before-resync-target minor-1 exit code 0 (0x0)
> Oct 12 06:56:11 ha14a kernel: block drbd1: conn( WFSyncUUID -> SyncTarget )
> Oct 12 06:56:11 ha14a kernel: block drbd1: Began resync as SyncTarget (will 
> sync 0 KB [0 bits set]).
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: PingAck did not arrive in 
> time.
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: peer( Primary -> Unknown ) 
> conn( SyncTarget -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: asender terminated
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: Terminating drbd_a_ha02_mys
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: Connection closed
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: conn( NetworkFailure -> 
> Unconnected )
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: receiver terminated
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: Restarting receiver thread
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: receiver (re)started
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: conn( Unconnected -> 
> WFConnection )
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: Handshake successful: Agreed 
> network protocol version 101
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: Peer authenticated using 20 
> bytes HMAC
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: conn( WFConnection -> 
> WFReportParams )
>  
> </snip>
>  
> On the other node, it is saying this over and over…
>  
> <snip>
>  
> Oct 12 06:58:51 ha14b kernel: block drbd1: drbd_sync_handshake:
> Oct 12 06:58:51 ha14b kernel: block drbd1: self 
> 38E17129E5821B5F:148D9B08BF812C5B:148C9B08BF812C5B:148B9B08BF812C5B bits:0 
> flags:0
> Oct 12 06:58:51 ha14b kernel: block drbd1: peer 
> 148D9B08BF812C5A:0000000000000000:4B9700420A3698D8:4B9600420A3698D9 bits:0 
> flags:0
> Oct 12 06:58:51 ha14b kernel: block drbd1: uuid_compare()=1 by rule 70
> Oct 12 06:58:51 ha14b kernel: block drbd1: Becoming sync source due to disk 
> states.
> Oct 12 06:58:51 ha14b kernel: block drbd1: peer( Unknown -> Secondary ) conn( 
> WFReportParams -> WFBitMapS )
> Oct 12 06:58:51 ha14b kernel: block drbd1: send bitmap stats 
> [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
> Oct 12 06:58:51 ha14b kernel: block drbd1: receive bitmap stats 
> [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
> Oct 12 06:58:51 ha14b kernel: block drbd1: helper command: /sbin/drbdadm 
> before-resync-source minor-1
> Oct 12 06:58:51 ha14b kernel: block drbd1: helper command: /sbin/drbdadm 
> before-resync-source minor-1 exit code 0 (0x0)
> Oct 12 06:58:51 ha14b kernel: block drbd1: conn( WFBitMapS -> SyncSource )
> Oct 12 06:58:51 ha14b kernel: block drbd1: Began resync as SyncSource (will 
> sync 0 KB [0 bits set]).
> Oct 12 06:58:51 ha14b kernel: block drbd1: updated sync UUID 
> 38E17129E5821B5F:148E9B08BF812C5B:148D9B08BF812C5B:148C9B08BF812C5B
> Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: sock was shut down by peer
> Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: peer( Secondary -> Unknown ) 
> conn( SyncSource -> BrokenPipe )
> Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: short read (expected size 16)
> Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: Connection closed
> Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: conn( BrokenPipe -> 
> Unconnected )
> Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: receiver terminated
> Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: Restarting receiver thread
> Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: receiver (re)started
> Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: conn( Unconnected -> 
> WFConnection )
>  
> </snip>
>  
> However, I can guarantee that the network connection is solid. Running ping 
> flood, I get 30,000 packets sent with no loss or latency.
>  
> Help, please?
>  
> --
> Eric Robinson
>  
> _______________________________________________
> drbd-user mailing list
> [email protected] <mailto:[email protected]>
> http://lists.linbit.com/mailman/listinfo/drbd-user 
> <http://lists.linbit.com/mailman/listinfo/drbd-user>
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to