To follow up, one of our other engineers may have discovered why drbd won't auto-promote in this use case. Turns out in zfs 0.7.12 the device is being opened with the flag FMODE_EXCL being passed into blkdev_get_by_path() which drbd isn't detecting during auto-promote. He plans to back port a change from zfs 8 that will use the flag FMODE_WRITE which should trigger the code path to auto-promote the resource in drbd.
On Fri, Nov 15, 2019 at 2:57 PM Doug Cahill <handr...@gmail.com> wrote: > > > > On Fri, Nov 15, 2019 at 4:34 AM Robert Altnoeder < > robert.altnoe...@linbit.com> wrote: > >> Could you try a few things, so we can get a better picture of what's >> happening there: >> - Can you get a hash of the data on the backend device that DRBD is >> writing to, from before and after one of those dubious Secondary-mode >> writes, to verify whether or not any data is actually changed? >> > > Looks like the sha256sum changes on both primary and secondary local > backing devices to drbd when I write to my zpool that has the drbd log > device. > > node0 before sha256sum > [root@dccdx0 ~]# dd if=/dev/sda1 bs=8M iflag=direct | sha256sum > 17179869184 bytes (17 GB) copied, 69.4921 s, 247 MB/s > 730d0f908c64a42ffc168211350b87a72c72ed56de2feef4be0904342acf20ac - > > node1 before sha256sum > [root@dccdx1 ~]# dd if=/dev/sda1 bs=8M iflag=direct | sha256sum > 17179869184 bytes (17 GB) copied, 70.4586 s, 244 MB/s > adbab9ee2a96ed476fe649cd10dc17994767190ae350a7be146c40427e272a73 - > > Write test: > [root@dccdx0 ~]# dd if=/dev/urandom > of=/dev/zvol/act_per_pool000/test_drbd bs=4k count=100000 oflag=sync,direct > 409600000 bytes (410 MB) copied, 44.8633 s, 9.1 MB/s > > node0 after sha256sum > [root@dccdx0 ~]# dd if=/dev/sda1 bs=8M iflag=direct | sha256sum > 17179869184 bytes (17 GB) copied, 71.6324 s, 240 MB/s > e8c02e50daf281973b04ea1b76e6cdb8760a789245ade987ba5410deba68067d - > > node1 after sha256sum > [root@dccdx1 ~]# dd if=/dev/sda1 bs=8M iflag=direct | sha256sum > 2048+0 records in > 2048+0 records out > 17179869184 bytes (17 GB) copied, 68.326 s, 251 MB/s > da8b90e8f57c20e4ea47a498157cb2865249d8b8cefc36aedb49a4467572924f - > > >> - Can you switch the other peer into the Primary role manually (so that >> the node where the problem occurs should refuse to become a Primary) and >> see what happens when ZFS tries to write to that log? >> > > Attempt 1 with zpool imported: > This is the secondary side where the zpool is not imported. I'm > [root@dccdx1 ~]# drbdadm primary r0 > r0: State change failed: (-10) State change was refused by peer node > additional info from kernel: > Declined by peer dccdx0 (id: 1), see the kernel log there > Command 'drbdsetup primary r0' terminated with exit code 11 > > Info logged in /var/log/messages: > dccdx1: Preparing remote state change 1839699090 > dccdx0 kernel: [171910.178046] drbd r0: State change failed: Peer may not > become primary while device is opened read-only > dccdx0 kernel: [171910.195954] drbd r0 dccdx1: Aborting remote state > change 1839699090 > > Attempt 2 with zpool exported: > export the pool on primary node. > secondary node I promote drbd resource to primary: > [root@dccdx1 ~]# drbdadm primary r0 > [root@dccdx1 ~]# drbdadm status > r0 role:Primary > disk:UpToDate > dccdx0 role:Secondary > peer-disk:UpToDate > > import zpool on primary node with drbd device as secondary: > [root@dccdx0 ~]# zpool import -f -o cachefile=none -d > /dev/drbd/by-disk/disk/by-path -d /dev/disk/by-path -d /dev/mapper > act_per_pool000 > The devices below are missing, use '-m' to import the pool anyway: > pci-0000:18:00.0-scsi-0:0:2:0-part1 [log] > cannot import 'act_per_pool000': one or more devices is currently > unavailable > > > >> I tried to reproduce the problem from user space (with auto-promote off >> and trying to read/write from/to a Secondary), where it does not seem to >> happen (everything is normal, cannot even read from a Secondary). >> However, I expect those ZFS operations to be done by some code in the >> kernel itself, and something may not be playing by the rules there - >> maybe ZFS is causing some I/O without doing a proper open/close cycle, >> or we are missing something in DRBD for some I/O case that's supposed to >> be valid. >> > > Another dev is looking into using system tap so we can debug the kernel > calls to block devices to see why they aren't being flagged to open as > writable. We are as puzzled how the zfs vdisk kernel call is or is not > being captured so that drbd detects this to auto-promote. > > >> >> br, >> Robert >> >> On 11/14/19 10:28 PM, Doug Cahill wrote: >> > I spent some more time looking into this with another developer and I >> > can see while running "drbdsetup events2 r0" that there is a quick >> > blip when I add the drbd r0 resource to my pool as the log device: >> > >> > change resource name:r0 role:Primary >> > change resource name:r0 role:Secondary >> > >> > However, if I export and/or import the pool, the event never registers >> > again. When I write to a vdisk on this pool I can see the nr:11766480 >> > dw:11766452 counts increase on the peer which leads me to believe >> > blocks are being written, yet the state never changes. >> > >> > I also tried to run dd to the "peer" side drbd device while the >> > "active" side was writing data and found a message stating the peer >> > may not become primary while the device is opened read-only in my >> > syslog which doesn't make sense. The device is being written to, so >> > how is the block device state being tricked to thinking it is read only? >> > >> > =========in the log from the node I'm writing to the drbd resource >> > drbd r0 dccdx0: Preparing remote state change 892694821 >> > drbd r0: State change failed: Peer may not become primary while device >> > is opened read-only >> > kernel: [92771.927574] drbd r0 dccdx0: Aborting remote state change >> > 892694821 >> > >> > On Thu, Nov 14, 2019 at 10:39 AM Doug Cahill <handr...@gmail.com >> > <mailto:handr...@gmail.com>> wrote: >> > >> > On Thu, Nov 14, 2019 at 4:52 AM Roland Kammerer >> > <roland.kamme...@linbit.com <mailto:roland.kamme...@linbit.com>> >> > wrote: >> > >> > On Wed, Nov 13, 2019 at 03:08:37PM -0500, Doug Cahill wrote: >> > > I'm configuring a two node setup with drbd 9.0.20-1 on CentOS >> 7 >> > > (3.10.0-957.1.3.el7.x86_64) with a single resource backed by >> > an SSDs. I've >> > > explicitly enabled auto-promote in my resource configuration >> > to use this >> > > feature. >> > > >> > > The drbd device is being used in a single-primary >> > configuration as a zpool >> > > SLOG device. The zpool is only ever imported on one node at >> > a time and the >> > > import is successful during cluster failover events between >> > nodes. I >> > > confirmed through zdb that the zpool includes the configured >> > drbd device >> > > path. >> > > >> > > My concern is that the drbdadm status output shows the Role >> > of the drbd >> > > resource as "Secondary" on both sides. The documentations >> > reads that the >> > > drbd resource will be auto promoted to primary when it is >> > opened for >> > > writing. >> > >> > But also demoted when closed (don't know if this happens in your >> > scenario). >> > >> > > drbdadm status >> > > r0 role:Secondary >> > > disk:UpToDate >> > > dccdx0 role:Secondary >> > > peer-disk:UpToDate >> > >> > Maybe it is closed and demoted again and you look at it at the >> > wrong >> > points in time? Better look into the syslog for role changes, >> > or monitor >> > with "drbdsetup events2 r0". Do you see switches to Primary >> there? >> > >> > >> > I checked the drbdadm status while my dd write session was in >> > progress and I see no change from Secondary to Primary. I also >> > checked the stats under /sys/class and it looks the same. >> > >> > cat >> /sys/kernel/debug/drbd/resources/r0/connections/dccdx0/0/proc_drbd >> > 0: cs:Established ro:Secondary/Secondary ds:UpToDate/UpToDate C >> > r----- >> > ns:3330728 nr:0 dw:20103080 dr:26292 al:131 bm:0 lo:0 pe:[0;0] >> > ua:0 ap:[0;0] ep:1 wo:1 oos:0 >> > resync: used:0/61 hits:64 misses:4 starving:0 locked:0 changed:2 >> > act_log: used:0/1237 hits:28951 misses:536 starving:0 locked:0 >> > changed:132 >> > blocked on activity log: 0/0/0 >> > >> > >> > Best, rck >> > _______________________________________________ >> > Star us on GITHUB: https://github.com/LINBIT >> > drbd-user mailing list >> > drbd-user@lists.linbit.com <mailto:drbd-user@lists.linbit.com> >> > https://lists.linbit.com/mailman/listinfo/drbd-user >> > >> > >> > _______________________________________________ >> > Star us on GITHUB: https://github.com/LINBIT >> > drbd-user mailing list >> > drbd-user@lists.linbit.com >> > https://lists.linbit.com/mailman/listinfo/drbd-user >> >> >> -- >> Robert ALTNOEDER - Software Developer >> +43-1-817-82-92 x72 <tel:+4318178292> >> robert.altnoe...@linbit.com <mailto:robert.altnoe...@linbit.com> >> >> LIN <http://www.linbit.com/en/>BIT <http://www.linbit.com/en/> | Keeping >> the Digital World Running >> DRBD HA - Disaster Recovery - Software-defined Storage >> t <https://twitter.com/linbit> / f >> <https://www.facebook.com/pg/linbitdrbd/posts/> / in >> <https://www.linkedin.com/company/linbit> / y >> <https://www.youtube.com/user/linbit> / g+ >> <https://plus.google.com/+Linbit/about> >> >> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. >> _______________________________________________ >> Star us on GITHUB: https://github.com/LINBIT >> drbd-user mailing list >> drbd-user@lists.linbit.com >> https://lists.linbit.com/mailman/listinfo/drbd-user >> >
_______________________________________________ Star us on GITHUB: https://github.com/LINBIT drbd-user mailing list drbd-user@lists.linbit.com https://lists.linbit.com/mailman/listinfo/drbd-user