Hi all,

I've got very similar problem to

http://lists.linbit.com/pipermail/drbd-user/2015-July/022243.html

that.


OS: Red Hat Enterprise Linux Server release 6.7 (Santiago)
kernel: kernel-2.6.32-573.1.1.el6.x86_64
DRBD: kmod-drbd84-8.4.6-1.el6.elrepo.x86_64


Two node cluster where after upgrade to the lastest kernel + Corosync
Pacemaker DRBD I've encoutered problems.

Secondary does a full resync, seems to successfully finish it but then
'PingAck did not arrive in time' happens and the resync repeats. This
keeps happening in a loop thereafter.

Please note that I could not observe any network problems, no problems
with the resync or any other services, except the very end of the
resync.

I've tried reboot the secondary (target) node, invalidating the data
and forcing full resync by hand, wating about 4 rounds of the loop :)
But to no avail.

Currently I'm in process of downgrading the kernel to see what happens
but as it's a production server this will take little time (outage) :/

I'm also fairly new to DRBD so please point out anything obvious I may
be missing.

thanks

v


Logs:

primary node (SyncSource)

Aug 12 15:22:45 dbpg02 kernel: drbd drbd0: PingAck did not arrive in time.
Aug 12 15:22:45 dbpg02 kernel: drbd drbd0: peer( Secondary -> Unknown ) conn( 
SyncSource -> NetworkFailure ) 
Aug 12 15:22:45 dbpg02 kernel: drbd drbd0: ack_receiver terminated
Aug 12 15:22:45 dbpg02 kernel: drbd drbd0: Terminating drbd_a_drbd0
Aug 12 15:22:46 dbpg02 kernel: drbd drbd0: Connection closed
Aug 12 15:22:46 dbpg02 kernel: drbd drbd0: conn( NetworkFailure -> Unconnected 
) 
Aug 12 15:22:46 dbpg02 kernel: drbd drbd0: receiver terminated
Aug 12 15:22:46 dbpg02 kernel: drbd drbd0: Restarting receiver thread
Aug 12 15:22:46 dbpg02 kernel: drbd drbd0: receiver (re)started
Aug 12 15:22:46 dbpg02 kernel: drbd drbd0: conn( Unconnected -> WFConnection ) 
Aug 12 15:22:47 dbpg02 kernel: drbd drbd0: Handshake successful: Agreed network 
protocol version 101
Aug 12 15:22:47 dbpg02 kernel: drbd drbd0: Agreed to support TRIM on protocol 
level
Aug 12 15:22:48 dbpg02 kernel: drbd drbd0: conn( WFConnection -> WFReportParams 
) 
Aug 12 15:22:48 dbpg02 kernel: drbd drbd0: Starting ack_recv thread (from 
drbd_r_drbd0 [3465])
Aug 12 15:22:48 dbpg02 kernel: block drbd0: drbd_sync_handshake:
Aug 12 15:22:48 dbpg02 kernel: block drbd0: self 
453310593995C7C3:FAE7FAB0FA09B197:FAE6FAB0FA09B197:2B42A0D9A256A63A bits:152 
flags:0
Aug 12 15:22:49 dbpg02 kernel: block drbd0: peer 
453310593995C7C2:0000000000000000:FAE7FAB0FA09B196:FAE6FAB0FA09B197 bits:0 
flags:0
Aug 12 15:22:49 dbpg02 kernel: block drbd0: was SyncSource, missed the resync 
finished event, corrected myself:
Aug 12 15:22:49 dbpg02 kernel: block drbd0: self 
453310593995C7C3:0000000000000000:FAE7FAB0FA09B197:FAE6FAB0FA09B197 bits:162 
flags:0
Aug 12 15:22:49 dbpg02 kernel: block drbd0: uuid_compare()=1 by rule 34
Aug 12 15:22:49 dbpg02 kernel: block drbd0: peer( Unknown -> Secondary ) conn( 
WFReportParams -> WFBitMapS ) pdsk( Inconsistent -> Consistent ) 
Aug 12 15:22:49 dbpg02 kernel: block drbd0: send bitmap stats [Bytes(packets)]: 
plain 0(0), RLE 48(1), total 48; compression: 100.0%
Aug 12 15:22:50 dbpg02 kernel: block drbd0: receive bitmap stats 
[Bytes(packets)]: plain 0(0), RLE 48(1), total 48; compression: 100.0%
Aug 12 15:22:50 dbpg02 kernel: block drbd0: helper command: /sbin/drbdadm 
before-resync-source minor-0
Aug 12 15:22:50 dbpg02 kernel: block drbd0: helper command: /sbin/drbdadm 
before-resync-source minor-0 exit code 0 (0x0)
Aug 12 15:22:50 dbpg02 kernel: block drbd0: conn( WFBitMapS -> SyncSource ) 
pdsk( Consistent -> Inconsistent ) 
Aug 12 15:22:50 dbpg02 kernel: block drbd0: Began resync as SyncSource (will 
sync 672 KB [168 bits set]).
Aug 12 15:22:51 dbpg02 kernel: block drbd0: updated sync UUID 
453310593995C7C3:803193A96DCAEE9C:FAE7FAB0FA09B197:FAE6FAB0FA09B197
Aug 12 15:22:52 dbpg02 kernel: block drbd0: Resync done (total 1 sec; paused 0 
sec; 672 K/sec)
Aug 12 15:22:52 dbpg02 kernel: block drbd0: updated UUIDs 
453310593995C7C3:0000000000000000:803193A96DCAEE9C:FAE7FAB0FA09B197
Aug 12 15:22:52 dbpg02 kernel: block drbd0: conn( SyncSource -> Connected ) 
pdsk( Inconsistent -> UpToDate ) 
Aug 12 15:22:52 dbpg02 crmd[5746]:   notice: crm_add_logfile: Additional 
logging available in /var/log/pacemaker.log
Aug 12 15:22:53 dbpg02 kernel: drbd drbd0: sock was shut down by peer
Aug 12 15:22:53 dbpg02 kernel: drbd drbd0: peer( Secondary -> Unknown ) conn( 
Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) 
Aug 12 15:22:53 dbpg02 kernel: drbd drbd0: short read (expected size 16)
Aug 12 15:22:53 dbpg02 kernel: drbd drbd0: meta connection shut down by peer.
Aug 12 15:22:53 dbpg02 kernel: block drbd0: new current UUID 
EBEE6945467BB233:453310593995C7C3:803193A96DCAEE9C:FAE7FAB0FA09B197
Aug 12 15:22:53 dbpg02 kernel: drbd drbd0: ack_receiver terminated
Aug 12 15:22:53 dbpg02 kernel: drbd drbd0: Terminating drbd_a_drbd0
Aug 12 15:22:54 dbpg02 kernel: drbd drbd0: Connection closed
Aug 12 15:22:54 dbpg02 kernel: drbd drbd0: conn( BrokenPipe -> Unconnected ) 
Aug 12 15:22:54 dbpg02 kernel: drbd drbd0: helper command: /sbin/drbdadm 
fence-peer drbd0
Aug 12 15:22:54 dbpg02 kernel: drbd drbd0: receiver terminated
Aug 12 15:22:54 dbpg02 kernel: drbd drbd0: Restarting receiver thread
Aug 12 15:22:54 dbpg02 kernel: drbd drbd0: receiver (re)started
Aug 12 15:22:54 dbpg02 kernel: drbd drbd0: conn( Unconnected -> WFConnection )
Aug 12 15:22:54 dbpg02 crm-fence-peer.sh[5817]: invoked for drbd0
Aug 12 15:22:54 dbpg02 kernel: drbd drbd0: Handshake successful: Agreed network 
protocol version 101
Aug 12 15:22:54 dbpg02 crmd[2719]:   notice: do_state_transition: State 
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
Aug 12 15:22:54 dbpg02 crm-fence-peer.sh[5817]: INFO peer is reachable, my disk 
is UpToDate: placed constraint 'drbd-fence-by-handler-drbd0-ms-drbd0'
Aug 12 15:22:54 dbpg02 pengine[2718]:   notice: update_validation: 
pacemaker-1.2-style configuration is also valid for pacemaker-1.3
Aug 12 15:22:54 dbpg02 pengine[2718]:   notice: update_validation: Upgrading 
pacemaker-1.3-style configuration to pacemaker-2.0 with upgrade-1.3.xsl
Aug 12 15:22:54 dbpg02 pengine[2718]:   notice: update_validation: Transformed 
the configuration from pacemaker-1.2 to pacemaker-2.0
Aug 12 15:22:54 dbpg02 pengine[2718]:   notice: unpack_config: On loss of CCM 
Quorum: Ignore
Aug 12 15:22:54 dbpg02 pengine[2718]:   notice: process_pe_message: Calculated 
Transition 8: /var/lib/pacemaker/pengine/pe-input-20.bz2
Aug 12 15:22:54 dbpg02 crmd[2719]:   notice: run_graph: Transition 8 
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-20.bz2): Complete
Aug 12 15:22:54 dbpg02 crmd[2719]:   notice: do_state_transition: State 
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]
Aug 12 15:22:55 dbpg02 kernel: drbd drbd0: Agreed to support TRIM on protocol 
level
Aug 12 15:22:55 dbpg02 kernel: drbd drbd0: helper command: /sbin/drbdadm 
fence-peer drbd0 exit code 4 (0x400)
Aug 12 15:22:55 dbpg02 kernel: drbd drbd0: fence-peer helper returned 4 (peer 
was fenced)
Aug 12 15:22:55 dbpg02 kernel: drbd drbd0: pdsk( DUnknown -> Outdated ) 
Aug 12 15:22:55 dbpg02 kernel: drbd drbd0: conn( WFConnection -> WFReportParams 
) 
Aug 12 15:22:55 dbpg02 kernel: drbd drbd0: Starting ack_recv thread (from 
drbd_r_drbd0 [3465])
Aug 12 15:22:55 dbpg02 kernel: block drbd0: drbd_sync_handshake:
Aug 12 15:22:55 dbpg02 kernel: block drbd0: self 
EBEE6945467BB233:453310593995C7C3:803193A96DCAEE9C:FAE7FAB0FA09B197 bits:32 
flags:0
Aug 12 15:22:56 dbpg02 kernel: block drbd0: peer 
803193A96DCAEE9C:0000000000000000:FAE7FAB0FA09B196:FAE6FAB0FA09B197 bits:0 
flags:0
Aug 12 15:22:56 dbpg02 kernel: block drbd0: uuid_compare()=2 by rule 80
Aug 12 15:22:56 dbpg02 kernel: block drbd0: Becoming sync source due to disk 
states.
Aug 12 15:22:56 dbpg02 kernel: block drbd0: Writing the whole bitmap, full sync 
required after drbd_sync_handshake.
Aug 12 15:22:56 dbpg02 kernel: block drbd0: bitmap WRITE of 6144 pages took 218 
jiffies
Aug 12 15:22:57 dbpg02 kernel: block drbd0: 768 GB (201320439 bits) marked 
out-of-sync by on disk bit-map.



secondary node (SyncTarget)


Aug 12 15:22:45 dbpg01 kernel: block drbd0: Resync done (total 8 sec; paused 0 
sec; 13212 K/sec)
Aug 12 15:22:45 dbpg01 kernel: block drbd0: updated UUIDs 
453310593995C7C2:0000000000000000:FAE7FAB0FA09B196:FAE6FAB0FA09B197
Aug 12 15:22:45 dbpg01 kernel: block drbd0: conn( SyncTarget -> Connected ) 
disk( Inconsistent -> UpToDate ) 
Aug 12 15:22:45 dbpg01 kernel: block drbd0: helper command: /sbin/drbdadm 
after-resync-target minor-0
Aug 12 15:22:45 dbpg01 crm-unfence-peer.sh[7419]: invoked for drbd0
Aug 12 15:22:45 dbpg01 kernel: block drbd0: helper command: /sbin/drbdadm 
after-resync-target minor-0 exit code 0 (0x0)
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: sock was shut down by peer
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: peer( Primary -> Unknown ) conn( 
Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) 
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: short read (expected size 16)
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: meta connection shut down by peer.
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: ack_receiver terminated
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: Terminating drbd_a_drbd0
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: Connection closed
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: conn( BrokenPipe -> Unconnected ) 
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: receiver terminated
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: Restarting receiver thread
Aug 12 15:22:46 dbpg01 kernel: drbd drbd0: receiver (re)started
Aug 12 15:22:47 dbpg01 kernel: drbd drbd0: conn( Unconnected -> WFConnection ) 
Aug 12 15:22:47 dbpg01 kernel: drbd drbd0: Handshake successful: Agreed network 
protocol version 101
Aug 12 15:22:47 dbpg01 kernel: drbd drbd0: Agreed to support TRIM on protocol 
level
Aug 12 15:22:47 dbpg01 kernel: drbd drbd0: conn( WFConnection -> WFReportParams 
) 
Aug 12 15:22:48 dbpg01 kernel: drbd drbd0: Starting ack_recv thread (from 
drbd_r_drbd0 [7326])
Aug 12 15:22:48 dbpg01 kernel: block drbd0: drbd_sync_handshake:
Aug 12 15:22:48 dbpg01 kernel: block drbd0: self 
453310593995C7C2:0000000000000000:FAE7FAB0FA09B196:FAE6FAB0FA09B197 bits:0 
flags:0
Aug 12 15:22:49 dbpg01 kernel: block drbd0: peer 
453310593995C7C3:FAE7FAB0FA09B197:FAE6FAB0FA09B197:2B42A0D9A256A63A bits:152 
flags:0
Aug 12 15:22:49 dbpg01 kernel: block drbd0: was SyncTarget, peer missed the 
resync finished event, corrected peer:
Aug 12 15:22:49 dbpg01 kernel: block drbd0: peer 
453310593995C7C3:0000000000000000:FAE7FAB0FA09B197:FAE6FAB0FA09B197 bits:152 
flags:0
Aug 12 15:22:49 dbpg01 kernel: block drbd0: uuid_compare()=-1 by rule 35
Aug 12 15:22:49 dbpg01 kernel: block drbd0: peer( Unknown -> Primary ) conn( 
WFReportParams -> WFBitMapT ) disk( UpToDate -> Outdated ) pdsk( DUnknown -> 
UpToDate ) 
Aug 12 15:22:49 dbpg01 kernel: block drbd0: receive bitmap stats 
[Bytes(packets)]: plain 0(0), RLE 48(1), total 48; compression: 100.0%
Aug 12 15:22:50 dbpg01 kernel: block drbd0: send bitmap stats [Bytes(packets)]: 
plain 0(0), RLE 48(1), total 48; compression: 100.0%
Aug 12 15:22:50 dbpg01 kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID ) 
Aug 12 15:22:51 dbpg01 kernel: block drbd0: updated sync uuid 
803193A96DCAEE9C:0000000000000000:FAE7FAB0FA09B196:FAE6FAB0FA09B197
Aug 12 15:22:51 dbpg01 kernel: block drbd0: helper command: /sbin/drbdadm 
before-resync-target minor-0
Aug 12 15:22:51 dbpg01 kernel: block drbd0: helper command: /sbin/drbdadm 
before-resync-target minor-0 exit code 0 (0x0)
Aug 12 15:22:51 dbpg01 kernel: block drbd0: conn( WFSyncUUID -> SyncTarget ) 
disk( Outdated -> Inconsistent ) 
Aug 12 15:22:51 dbpg01 kernel: block drbd0: Began resync as SyncTarget (will 
sync 672 KB [168 bits set]).
Aug 12 15:22:52 dbpg01 kernel: drbd drbd0: PingAck did not arrive in time.
Aug 12 15:22:52 dbpg01 kernel: drbd drbd0: peer( Primary -> Unknown ) conn( 
SyncTarget -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Aug 12 15:22:53 dbpg01 kernel: drbd drbd0: ack_receiver terminated
Aug 12 15:22:53 dbpg01 kernel: drbd drbd0: Terminating drbd_a_drbd0
Aug 12 15:22:53 dbpg01 kernel: drbd drbd0: Connection closed
Aug 12 15:22:53 dbpg01 kernel: drbd drbd0: conn( NetworkFailure -> Unconnected 
) 
Aug 12 15:22:53 dbpg01 kernel: drbd drbd0: receiver terminated
Aug 12 15:22:53 dbpg01 kernel: drbd drbd0: Restarting receiver thread
Aug 12 15:22:53 dbpg01 kernel: drbd drbd0: receiver (re)started
Aug 12 15:22:53 dbpg01 kernel: drbd drbd0: conn( Unconnected -> WFConnection ) 
Aug 12 15:22:54 dbpg01 kernel: drbd drbd0: Handshake successful: Agreed network 
protocol version 101
Aug 12 15:22:54 dbpg01 kernel: drbd drbd0: Agreed to support TRIM on protocol 
level
Aug 12 15:22:54 dbpg01 crmd[6750]:   notice: handle_request: Current ping 
state: S_NOT_DC
Aug 12 15:22:55 dbpg01 kernel: drbd drbd0: conn( WFConnection -> WFReportParams 
) 
Aug 12 15:22:55 dbpg01 kernel: drbd drbd0: Starting ack_recv thread (from 
drbd_r_drbd0 [7326])
Aug 12 15:22:55 dbpg01 kernel: block drbd0: drbd_sync_handshake:
Aug 12 15:22:55 dbpg01 kernel: block drbd0: self 
803193A96DCAEE9C:0000000000000000:FAE7FAB0FA09B196:FAE6FAB0FA09B197 bits:0 
flags:0
Aug 12 15:22:56 dbpg01 kernel: block drbd0: peer 
EBEE6945467BB233:453310593995C7C3:803193A96DCAEE9C:FAE7FAB0FA09B197 bits:32 
flags:0
Aug 12 15:22:56 dbpg01 kernel: block drbd0: uuid_compare()=-2 by rule 60
Aug 12 15:22:56 dbpg01 kernel: block drbd0: Becoming sync target due to disk 
states.
Aug 12 15:22:56 dbpg01 kernel: block drbd0: Writing the whole bitmap, full sync 
required after drbd_sync_handshake.
Aug 12 15:22:56 dbpg01 kernel: block drbd0: bitmap WRITE of 6144 pages took 298 
jiffies
Aug 12 15:22:56 dbpg01 kernel: block drbd0: 768 GB (201320439 bits) marked 
out-of-sync by on disk bit-map.
Aug 12 15:22:57 dbpg01 attrd[6748]:   notice: attrd_trigger_update: Sending 
flush op to all hosts for: master-drbd0 (<null>)
Aug 12 15:22:57 dbpg01 attrd[6748]:   notice: attrd_perform_update: Sent delete 
19: node=dbpg01.sdp.syd.optusnet.com.au, attr=master-drbd0, id=<n/a>, 
set=(null), section=status
Aug 12 15:22:57 dbpg01 attrd[6748]:   notice: attrd_perform_update: Sent delete 
21: node=dbpg01.sdp.syd.optusnet.com.au, attr=master-drbd0, id=<n/a>, 
set=(null), section=status
Aug 12 15:22:57 dbpg01 kernel: block drbd0: peer( Unknown -> Primary ) conn( 
WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) 
Aug 12 15:22:57 dbpg01 crmd[7488]:   notice: crm_add_logfile: Additional 
logging available in /var/log/pacemaker.log
Aug 12 15:22:57 dbpg01 kernel: block drbd0: receive bitmap stats 
[Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
Aug 12 15:22:58 dbpg01 kernel: block drbd0: send bitmap stats [Bytes(packets)]: 
plain 0(0), RLE 23(1), total 23; compression: 100.0%
Aug 12 15:22:58 dbpg01 kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID ) 
Aug 12 15:22:59 dbpg01 kernel: block drbd0: updated sync uuid 
453410593995C7C2:0000000000000000:FAE7FAB0FA09B196:FAE6FAB0FA09B197
Aug 12 15:22:59 dbpg01 kernel: block drbd0: helper command: /sbin/drbdadm 
before-resync-target minor-0
Aug 12 15:22:59 dbpg01 kernel: block drbd0: helper command: /sbin/drbdadm 
before-resync-target minor-0 exit code 0 (0x0)
Aug 12 15:22:59 dbpg01 kernel: block drbd0: conn( WFSyncUUID -> SyncTarget ) 
Aug 12 15:22:59 dbpg01 kernel: block drbd0: Began resync as SyncTarget (will 
sync 805281756 KB [201320439 bits set]).





-- 
Regards

Viktor Villafuerte
Optus Internet Engineering
t: +61 2 80825265
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to