Summary
-------
On v6.18, starting a libvirt/QEMU guest with virtio-blk backed by an
LVM "--type raid1" LV (drivers/md/dm-raid.c stacked on
drivers/md/raid1.c) makes md/raid1 register read failures at LV
sector 0 within seconds of "virsh start" and mark rimage_0 Faulty
once max_corrected_read_errors (default 20) is exceeded. Reads
succeed via the redirect path so guests boot, but every guest disk
ends up degraded on every VM start. Same workload on legacy
"--type mirror" (drivers/md/dm-raid1.c) crashes the host: a
zero-length READ reaches the NVMe controller, is rejected with
"Invalid Field in Command", and the dm-mirror recovery path oopses.
Symptom on dm-raid raid1 (post --type raid1)
--------------------------------------------
Per LV, at virsh start, in host dmesg:
kernel: raid1_end_read_request: 95 callbacks suppressed
kernel: raid1_read_request: 95 callbacks suppressed
kernel: md/raid1:mdX: dm-58: rescheduling sector 0
kernel: md/raid1:mdX: redirecting sector 0 to other mirror: dm-58
kernel: md/raid1:mdX: dm-58: rescheduling sector 0
kernel: md/raid1:mdX: redirecting sector 0 to other mirror: dm-58
[... 10 rescheduling/redirecting pairs ...]
kernel: md/raid1:mdX: dm-58: Raid device exceeded read_error
threshold [cur 21:max 20]
kernel: md/raid1:mdX: dm-58: Failing raid device
kernel: md/raid1:mdX: Disk failure on dm-58, disabling device.
kernel: md/raid1:mdX: Operation continuing on 1 devices.
dmeventd: WARNING: Device #0 of raid1 array, vg0-iris_boot, has failed.
dmeventd: WARNING: Waiting for resynchronization to finish before
initiating repair on RAID device vg0-iris_boot.
dmeventd: Use 'lvconvert --repair vg0/iris_boot' to replace failed device.
Subsequent "lvs -a":
WARNING: RaidLV vg0/iris_boot needs to be refreshed!
See character 'r' at position 9 in the RaidLV's attributes and its SubLV(s).
dmesg | grep nvme is EMPTY on this path. The NVMe driver is not
involved in producing the error; the failure originates between the
virtio-blk bio submission and raid1_end_read_request().
Symptom on legacy dm-mirror (pre-conversion --type mirror)
----------------------------------------------------------
Same workload on drivers/md/dm-raid1.c reaches the NVMe controller
as a zero-length READ and panics the host through dm-mirror's
recovery path:
kernel: operation not supported error, dev nvme1n1, sector 935446535
op 0x0:(READ) flags 0x0 phys_seg 0 prio class 2
kernel: nvme1n1: I/O Cmd(0x2) @ LBA 935446535, 0 blocks, I/O Error
(sct 0x0 / sc 0x2)
[... 10+ identical bursts at same timestamp ...]
dmeventd: Primary mirror device 252:58 read failed.
dmeventd: vg0-iris_boot is now in-sync.
[kernel oops in dm_mirror recovery path, full trace lost to console flash]
The "phys_seg 0", "0 blocks", "sct 0x0/sc 0x2" trio (NVMe Generic,
Invalid Field in Command, NVMe spec 4.1.1.2) is unambiguous: a bio
with bi_iter.bi_size == 0 and bi_vcnt == 0 left the block layer and
hit the controller. dm-raid raid1 hides this by retrying on the
surviving leg, but the upstream-of-md trigger is identical.
Bisect
------
git bisect, v6.12..v6.18, 16 deterministic GOOD/BAD steps, no skips,
~104 minutes:
5ff3f74e145adc79b49668adb8de276446acf6be is the first bad commit
block: simplify direct io validity check
--- a/block/fops.c
+++ b/block/fops.c
@@ -38,8 +38,8 @@ static blk_opf_t dio_bio_write_op(struct kiocb *iocb)
static bool blkdev_dio_invalid(struct block_device *bdev, struct kiocb *iocb,
struct iov_iter *iter)
{
- return iocb->ki_pos & (bdev_logical_block_size(bdev) - 1) ||
- !bdev_iter_is_aligned(bdev, iter);
+ return (iocb->ki_pos | iov_iter_count(iter)) &
+ (bdev_logical_block_size(bdev) - 1);
}
The dropped bdev_iter_is_aligned() used to walk the iov_iter and
reject per-segment misaligned/degenerate vectors at the blkdev fops
entry point. The replacement only validates ki_pos and total length
against the logical block size. Cases that now pass that no longer
get rejected:
- iter with iov_iter_count(iter) == 0 (degenerate; total length is
"sector-aligned" since 0 % 512 == 0)
- iter where total length is sector-aligned but a segment isn't
The commit message justifies the removal with "The block layer
checks all the segments for validity later". This is true for the
io_uring submit path (which enters __blkdev_direct_IO directly and
does its own validation) but not for the libaio aio_read/write_iter
or the worker-pool sync read/write_iter paths that enter via
blkdev_{read,write}_iter() -> blkdev_dio_invalid(). For those paths,
the segment check has no replacement.
Reproducing
----------------------------------------------------------
The trigger requires QEMU virtio-blk's specific submission shape AND
a non-io_uring submit. Userspace libaio alone, userspace
preadv-in-a-thread alone, and QEMU's raw-driver open probes (which
qemu-img info exercises identically) are all insufficient. The
combination that hits the bug is "guest-driven I/O through
virtio-blk-pci with cache.direct=on and aio in {native, threads}".
#regzbot introduced: 5ff3f74e145adc79b49668adb8de276446acf6be
Thanks,
Vjaceslavs Klimovs