On 02/11/18 08:45, Jarno Elonen wrote:
More clues:

Just witnessed a resync (after invalidate) to steadily go from 100% out-of-sync to 0% (after several automatic disconnects and reconnects). Immediately after reaching 0%, it went to negative -<very-large-number>% ! After that, drbdtop started showing 8.0ZiB out-of-sync.

Looks like a severe wrap-around bug.

-Jarno


On Thu, 1 Nov 2018 at 22:30, Jarno Elonen <[email protected] <mailto:[email protected]>> wrote:

    Here's some more info.
    Dmesg shows some suspicious looking log message, such as:

    1) FIXME drbd_s_vm-117-s[2830] op clear, bitmap locked for 'receive
    bitmap' by drbd_r_vm-117-s[5038]

    2) Wrong magic value 0xffff0007 in protocol version 114

    3) peer request with dagtag 399201392 not found
    got_peer_ack [drbd] failed

    4) Rejecting concurrent remote state change 2226202936 because of
    state change 2923939731
    Ignoring P_TWOPC_ABORT packet 2226202936.

    5) drbd_r_vm-117-s[5038] going to 'detect_finished_resyncs()' but
    bitmap already locked for 'write from resync_finished' by
    drbd_w_vm-117-s[2812]
    md_sync_timer expired! Worker calls drbd_md_sync().

    6) incompatible discard-my-data settings
    conn( Connecting -> Disconnecting )
    error receiving P_PROTOCOL, e: -5 l: 7!

    Two of the four nodes have DRBD 9.0.15-1 and two have 9.0.16-1. All
    of them API v 16:

    == mox-a ==
    version: 9.0.15-1 (api:2/proto:86-114)
    GIT-hash: c46d27900f471ea0f5ba587592439a9ddde1d08b build by
    root@mox-a, 2018-10-28 03:08:58
    Transports (api:16): tcp (9.0.15-1)

    == mox-b ==
    version: 9.0.15-1 (api:2/proto:86-114)
    GIT-hash: c46d27900f471ea0f5ba587592439a9ddde1d08b build by
    root@mox-b, 2018-10-10 17:50:25
    Transports (api:16): tcp (9.0.15-1)

    == mox-c ==
    version: 9.0.16-1 (api:2/proto:86-114)
    GIT-hash: ab9777dfeaf9d619acc9a5201bfcae8103e9529c build by
    root@mox-c, 2018-10-28 05:45:05
    Transports (api:16): tcp (9.0.16-1)

    == mox-d ==
    version: 9.0.16-1 (api:2/proto:86-114)
    GIT-hash: ab9777dfeaf9d619acc9a5201bfcae8103e9529c build by
    root@mox-d, 2018-10-29 00:22:23
    Transports (api:16): tcp (9.0.16-1)

    Running Proxmox (5.2-2) as can you'd guess from host names. DRBD
    resources being managed by LINSTOR.


    On Thu, 1 Nov 2018 at 17:32, Jarno Elonen <[email protected]
    <mailto:[email protected]>> wrote:

        Okay, today one of these resources got a sudden, severe
        filesystem corruption on the primary.

        On the other hand, the secondaries (that showed 8ZiB
        out-of-sync) were still mountable after I disconnected the
        corrupted primary. No idea how current data the secondaries had,
        but drbdtop still showed them as connected and 8Zib out-of-sync.

        This is getting quite worrisome. Is anyone else experiencing
        this with DRBD 9? Is it something really wrong in my setup, or
        are there perhaps some known instabilities in DRBD 9.0.15-1?

        -Jarno


        On Wed, 31 Oct 2018 at 20:46, Jarno Elonen <[email protected]
        <mailto:[email protected]>> wrote:

            I've got several DRBD 9 resource that constantly show
            *UpToDate* with 9223372036854774304 bytes (exactly 8ZiB) of
            OutOfDate data.

            Any idea what might cause this and how to fix it?

            Example:

            # drbdsetup status --verbose --statistics vm-106-disk-1
            vm-106-disk-1 node-id:0 role:Primary suspended:no
                 write-ordering:flush
               volume:0 minor:1003 disk:UpToDate quorum:yes
                   size:16777688 read:215779 written:22369564
            al-writes:89 bm-writes:0 upper-pending:0
                   lower-pending:0 al-suspended:no blocked:no
               mox-a node-id:1 connection:Connected role:Secondary
            congested:no ap-in-flight:0
                   rs-in-flight:18446744073709549808
                 volume:0 replication:Established peer-disk:UpToDate
            resync-suspended:no
                     received:215116 sent:22368903
            out-of-sync:9223372036854774304 pending:0 unacked:0
               mox-c node-id:2 connection:Connected role:Secondary
            congested:no ap-in-flight:0
                   rs-in-flight:18446744073709549808
                 volume:0 replication:Established peer-disk:UpToDate
            resync-suspended:no
                     received:1188 sent:19884428 out-of-sync:0 pending:0
            unacked:0

            Version info:
            version: 9.0.15-1 (api:2/proto:86-114)
            GIT-hash: c46d27900f471ea0f5ba587592439a9ddde1d08b build by
            root@mox-b, 2018-10-10 17:50:25
            Transports (api:16): tcp (9.0.15-1)

            -Jarno

Not exactly the same issue you are seeing, but I have had an issue this week with a newly created resource on a 9.0.16-1 primary against a 9.0.13-1 secondary.

As soon as I started writing to the new primary the secondary started repeatedly disconnecting with the error:

drbd resource274 primary.host: Unexpected data packet ? (0x0036)

followed by resync (and then same error again, followed by resync, ....)

Probably completely unrelated to your issues, and I know there is a _lot_ of bug fixes between 9.0.13-1 and 9.0.16-1 (and I _do_ have have a long overdue update of the secondary planned v. soon).

Theoretically, different 9.0.x kernel versions should be able work together (same api). But in practice, I avoid it and usually update drbd & kernel at same time on all nodes.

So it could be that 9.0.16-1 has particular problems with co-operating with earlier version, perhaps more so than other versions.

Eddie
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to