On 02/11/18 08:45, Jarno Elonen wrote:
More clues:
Just witnessed a resync (after invalidate) to steadily go from 100%
out-of-sync to 0% (after several automatic disconnects and reconnects).
Immediately after reaching 0%, it went to negative -<very-large-number>%
! After that, drbdtop started showing 8.0ZiB out-of-sync.
Looks like a severe wrap-around bug.
-Jarno
On Thu, 1 Nov 2018 at 22:30, Jarno Elonen <[email protected]
<mailto:[email protected]>> wrote:
Here's some more info.
Dmesg shows some suspicious looking log message, such as:
1) FIXME drbd_s_vm-117-s[2830] op clear, bitmap locked for 'receive
bitmap' by drbd_r_vm-117-s[5038]
2) Wrong magic value 0xffff0007 in protocol version 114
3) peer request with dagtag 399201392 not found
got_peer_ack [drbd] failed
4) Rejecting concurrent remote state change 2226202936 because of
state change 2923939731
Ignoring P_TWOPC_ABORT packet 2226202936.
5) drbd_r_vm-117-s[5038] going to 'detect_finished_resyncs()' but
bitmap already locked for 'write from resync_finished' by
drbd_w_vm-117-s[2812]
md_sync_timer expired! Worker calls drbd_md_sync().
6) incompatible discard-my-data settings
conn( Connecting -> Disconnecting )
error receiving P_PROTOCOL, e: -5 l: 7!
Two of the four nodes have DRBD 9.0.15-1 and two have 9.0.16-1. All
of them API v 16:
== mox-a ==
version: 9.0.15-1 (api:2/proto:86-114)
GIT-hash: c46d27900f471ea0f5ba587592439a9ddde1d08b build by
root@mox-a, 2018-10-28 03:08:58
Transports (api:16): tcp (9.0.15-1)
== mox-b ==
version: 9.0.15-1 (api:2/proto:86-114)
GIT-hash: c46d27900f471ea0f5ba587592439a9ddde1d08b build by
root@mox-b, 2018-10-10 17:50:25
Transports (api:16): tcp (9.0.15-1)
== mox-c ==
version: 9.0.16-1 (api:2/proto:86-114)
GIT-hash: ab9777dfeaf9d619acc9a5201bfcae8103e9529c build by
root@mox-c, 2018-10-28 05:45:05
Transports (api:16): tcp (9.0.16-1)
== mox-d ==
version: 9.0.16-1 (api:2/proto:86-114)
GIT-hash: ab9777dfeaf9d619acc9a5201bfcae8103e9529c build by
root@mox-d, 2018-10-29 00:22:23
Transports (api:16): tcp (9.0.16-1)
Running Proxmox (5.2-2) as can you'd guess from host names. DRBD
resources being managed by LINSTOR.
On Thu, 1 Nov 2018 at 17:32, Jarno Elonen <[email protected]
<mailto:[email protected]>> wrote:
Okay, today one of these resources got a sudden, severe
filesystem corruption on the primary.
On the other hand, the secondaries (that showed 8ZiB
out-of-sync) were still mountable after I disconnected the
corrupted primary. No idea how current data the secondaries had,
but drbdtop still showed them as connected and 8Zib out-of-sync.
This is getting quite worrisome. Is anyone else experiencing
this with DRBD 9? Is it something really wrong in my setup, or
are there perhaps some known instabilities in DRBD 9.0.15-1?
-Jarno
On Wed, 31 Oct 2018 at 20:46, Jarno Elonen <[email protected]
<mailto:[email protected]>> wrote:
I've got several DRBD 9 resource that constantly show
*UpToDate* with 9223372036854774304 bytes (exactly 8ZiB) of
OutOfDate data.
Any idea what might cause this and how to fix it?
Example:
# drbdsetup status --verbose --statistics vm-106-disk-1
vm-106-disk-1 node-id:0 role:Primary suspended:no
write-ordering:flush
volume:0 minor:1003 disk:UpToDate quorum:yes
size:16777688 read:215779 written:22369564
al-writes:89 bm-writes:0 upper-pending:0
lower-pending:0 al-suspended:no blocked:no
mox-a node-id:1 connection:Connected role:Secondary
congested:no ap-in-flight:0
rs-in-flight:18446744073709549808
volume:0 replication:Established peer-disk:UpToDate
resync-suspended:no
received:215116 sent:22368903
out-of-sync:9223372036854774304 pending:0 unacked:0
mox-c node-id:2 connection:Connected role:Secondary
congested:no ap-in-flight:0
rs-in-flight:18446744073709549808
volume:0 replication:Established peer-disk:UpToDate
resync-suspended:no
received:1188 sent:19884428 out-of-sync:0 pending:0
unacked:0
Version info:
version: 9.0.15-1 (api:2/proto:86-114)
GIT-hash: c46d27900f471ea0f5ba587592439a9ddde1d08b build by
root@mox-b, 2018-10-10 17:50:25
Transports (api:16): tcp (9.0.15-1)
-Jarno
Not exactly the same issue you are seeing, but I have had an issue this
week with a newly created resource on a 9.0.16-1 primary against a
9.0.13-1 secondary.
As soon as I started writing to the new primary the secondary started
repeatedly disconnecting with the error:
drbd resource274 primary.host: Unexpected data packet ? (0x0036)
followed by resync (and then same error again, followed by resync, ....)
Probably completely unrelated to your issues, and I know there is a
_lot_ of bug fixes between 9.0.13-1 and 9.0.16-1 (and I _do_ have have a
long overdue update of the secondary planned v. soon).
Theoretically, different 9.0.x kernel versions should be able work
together (same api). But in practice, I avoid it and usually update drbd
& kernel at same time on all nodes.
So it could be that 9.0.16-1 has particular problems with co-operating
with earlier version, perhaps more so than other versions.
Eddie
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user