[REGRESSION] block: virtio-blk + LVM raid1 spurious sector-0 read failures on libaio/threads submit since 5ff3f74e145a ("block: simplify direct io validity check")

Vjaceslavs Klimovs Fri, 15 May 2026 10:06:28 -0700

Summary
-------
On v6.18, starting a libvirt/QEMU guest with virtio-blk backed by an
LVM "--type raid1" LV (drivers/md/dm-raid.c stacked on
drivers/md/raid1.c) makes md/raid1 register read failures at LV
sector 0 within seconds of "virsh start" and mark rimage_0 Faulty
once max_corrected_read_errors (default 20) is exceeded. Reads
succeed via the redirect path so guests boot, but every guest disk
ends up degraded on every VM start. Same workload on legacy
"--type mirror" (drivers/md/dm-raid1.c) crashes the host: a
zero-length READ reaches the NVMe controller, is rejected with
"Invalid Field in Command", and the dm-mirror recovery path oopses.


Symptom on dm-raid raid1 (post --type raid1)
--------------------------------------------
Per LV, at virsh start, in host dmesg:

  kernel: raid1_end_read_request: 95 callbacks suppressed
  kernel: raid1_read_request: 95 callbacks suppressed
  kernel: md/raid1:mdX: dm-58: rescheduling sector 0
  kernel: md/raid1:mdX: redirecting sector 0 to other mirror: dm-58
  kernel: md/raid1:mdX: dm-58: rescheduling sector 0
  kernel: md/raid1:mdX: redirecting sector 0 to other mirror: dm-58
  [... 10 rescheduling/redirecting pairs ...]
  kernel: md/raid1:mdX: dm-58: Raid device exceeded read_error
threshold [cur 21:max 20]
  kernel: md/raid1:mdX: dm-58: Failing raid device
  kernel: md/raid1:mdX: Disk failure on dm-58, disabling device.
  kernel: md/raid1:mdX: Operation continuing on 1 devices.

  dmeventd: WARNING: Device #0 of raid1 array, vg0-iris_boot, has failed.
  dmeventd: WARNING: Waiting for resynchronization to finish before
initiating repair on RAID device vg0-iris_boot.
  dmeventd: Use 'lvconvert --repair vg0/iris_boot' to replace failed device.

Subsequent "lvs -a":

  WARNING: RaidLV vg0/iris_boot needs to be refreshed!
  See character 'r' at position 9 in the RaidLV's attributes and its SubLV(s).

dmesg | grep nvme is EMPTY on this path. The NVMe driver is not
involved in producing the error; the failure originates between the
virtio-blk bio submission and raid1_end_read_request().

Symptom on legacy dm-mirror (pre-conversion --type mirror)
----------------------------------------------------------
Same workload on drivers/md/dm-raid1.c reaches the NVMe controller
as a zero-length READ and panics the host through dm-mirror's
recovery path:

  kernel: operation not supported error, dev nvme1n1, sector 935446535
op 0x0:(READ) flags 0x0 phys_seg 0 prio class 2
  kernel: nvme1n1: I/O Cmd(0x2) @ LBA 935446535, 0 blocks, I/O Error
(sct 0x0 / sc 0x2)
  [... 10+ identical bursts at same timestamp ...]
  dmeventd: Primary mirror device 252:58 read failed.
  dmeventd: vg0-iris_boot is now in-sync.
  [kernel oops in dm_mirror recovery path, full trace lost to console flash]

The "phys_seg 0", "0 blocks", "sct 0x0/sc 0x2" trio (NVMe Generic,
Invalid Field in Command, NVMe spec 4.1.1.2) is unambiguous: a bio
with bi_iter.bi_size == 0 and bi_vcnt == 0 left the block layer and
hit the controller. dm-raid raid1 hides this by retrying on the
surviving leg, but the upstream-of-md trigger is identical.

Bisect
------
git bisect, v6.12..v6.18, 16 deterministic GOOD/BAD steps, no skips,
~104 minutes:

  5ff3f74e145adc79b49668adb8de276446acf6be is the first bad commit
  block: simplify direct io validity check

  --- a/block/fops.c
  +++ b/block/fops.c
  @@ -38,8 +38,8 @@ static blk_opf_t dio_bio_write_op(struct kiocb *iocb)
   static bool blkdev_dio_invalid(struct block_device *bdev, struct kiocb *iocb,
                                  struct iov_iter *iter)
   {
  -        return iocb->ki_pos & (bdev_logical_block_size(bdev) - 1) ||
  -                !bdev_iter_is_aligned(bdev, iter);
  +        return (iocb->ki_pos | iov_iter_count(iter)) &
  +                        (bdev_logical_block_size(bdev) - 1);
   }

The dropped bdev_iter_is_aligned() used to walk the iov_iter and
reject per-segment misaligned/degenerate vectors at the blkdev fops
entry point. The replacement only validates ki_pos and total length
against the logical block size. Cases that now pass that no longer
get rejected:

  - iter with iov_iter_count(iter) == 0  (degenerate; total length is
    "sector-aligned" since 0 % 512 == 0)
  - iter where total length is sector-aligned but a segment isn't

The commit message justifies the removal with "The block layer
checks all the segments for validity later". This is true for the
io_uring submit path (which enters __blkdev_direct_IO directly and
does its own validation) but not for the libaio aio_read/write_iter
or the worker-pool sync read/write_iter paths that enter via
blkdev_{read,write}_iter() -> blkdev_dio_invalid(). For those paths,
the segment check has no replacement.

Reproducing
----------------------------------------------------------

The trigger requires QEMU virtio-blk's specific submission shape AND
a non-io_uring submit. Userspace libaio alone, userspace
preadv-in-a-thread alone, and QEMU's raw-driver open probes (which
qemu-img info exercises identically) are all insufficient. The
combination that hits the bug is "guest-driven I/O through
virtio-blk-pci with cache.direct=on and aio in {native, threads}".

#regzbot introduced: 5ff3f74e145adc79b49668adb8de276446acf6be

Thanks,
Vjaceslavs Klimovs

[REGRESSION] block: virtio-blk + LVM raid1 spurious sector-0 read failures on libaio/threads submit since 5ff3f74e145a ("block: simplify direct io validity check")

Reply via email to