> -----Original Message-----
> From: drbd-user-boun...@lists.linbit.com [mailto:drbd-user-
> boun...@lists.linbit.com] On Behalf Of Lars Ellenberg
> Sent: Wednesday, October 12, 2016 11:49 PM
> To: email@example.com
> Subject: Re: [DRBD-user] DRBD constantly re-syncing, getting to 100%,
> starting over. What?
> On Wed, Oct 12, 2016 at 04:35:58PM +0200, Jan Schermer wrote:
> > Short in the dark - are the drives (or their controller if you're
> > using raid) using any form of caching? It is conceivable that when
> > resync is finished it tries flushing the data to the device, and if
> > this takes waaaaay to long it could lead to timeout of the drbd kernel
> > thread.
> > Is IO happening on those drives when they are resyncing?
> > Try running something like "sync ; sleep 1 ; sync" on the Inconsistent
> > node when it's resyncing (I hope that won't kill your IO)
> sync only affects stuff in the linux (buffer/) page cache, DRBD sits below
> "no effect" on DRBD IO.
> > > Oct 12 06:56:11 ha14a kernel: block drbd1: Began resync as SyncTarget
> (will sync 0 KB [0 bits set]).
> > > Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: PingAck did not arrive in
> > > Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: peer( Primary ->
> > > Unknown ) conn( SyncTarget -> NetworkFailure ) pdsk( UpToDate ->
> > > DUnknown )
> has been said before:
> DRBD ping timeout is apparently too short for the latency in your setup.
> increase it appropriately.
> Where latency in this case involves network rtt plus kernel thread scheduling
> plus maybe additional synchronous (flush/fua) IO plus whatever else DRBD
> feels is necessary for a full DRBD to DRBD round-trip.
> > > However, I can guarantee that the network connection is solid.
> > > Running ping flood, I get 30,000 packets sent with no loss or
> > > latency.
> Mind telling us the network characteristics? IO backend?
> Virtualized? Distribution? Kernel and DRBD version(s)?
We have a dozen other DRBD clusters and this has never happened to any of the
others over the past decade or so, and they are all on the same switched
network. The nodes are in different data centers 22 miles apart connected by
gigabit fiber. Latency is always sub -millisecond. See the following ping
[root@ha14a ~]# ping -f ha14b-cl
PING ha14b-cl.mycharts.md (198.51.100.43) 56(84) bytes of data.
--- ha14b-cl.mycharts.md ping statistics ---
23433 packets transmitted, 23432 received, 0% packet loss, time 15911ms
rtt min/avg/max/mdev = 0.585/0.659/0.847/0.021 ms, ipg/ewma 0.679/0.658 ms
The servers are all physical, running RHEL 6.3 kernel 2.6.32-279.el6.x86_64.
DRBD version is 8.4.3
> : Lars Ellenberg
> : LINBIT | Keeping the Digital World Running
> : DRBD -- Heartbeat -- Corosync -- Pacemaker
> DRBD(r) and LINBIT(r) are registered trademarks of LINBIT __ please don't Cc
> me, but send to list -- I'm subscribed
> drbd-user mailing list
drbd-user mailing list