On 6/14/26 19:57, Dr. David Alan Gilbert wrote: > > I've got a repeatable raid hang/warn and would appreciate some pointers > as where to debug. > (I've been logging stuff on > https://bugzilla.kernel.org/show_bug.cgi?id=221535 )
Note: not my area of expertise, so I might be sending you totally off-track with this comment. Feel free to ignore it. But FWIW: Have you seen these reports? https://lore.kernel.org/all/2982107.4sosBPzcNG@electra/ https://lore.kernel.org/all/CAC_j7i1R7oy+nRhxEjCTba=dugn02w9x+p94dcu0ahv5+5t...@mail.gmail.com/ The former lead to a fix in the mdraid code that should be in the kernel version you are using. But in a reply to the latter report the repoter claimed that that fix is not enough (claiming "this was obvious" and also using dm), but things then stalled there. Ciao, Thorsten > This started off as debugging a case where I'd get my RAID1 (on the host) > getting a reliable 'rescheduling sector'/disk failure while running the qemu > block test suite > during a qemu build, but then I tried to build a smaller discrete > test, and now I've got a simply triggerable warn and test hang. > There's no errors from the underlying SATA layer on the storage, > everything resyncs just fine. > > I've got an existing LVM vg ('main') with two mirrors on sda2, and sdb2 > which are SATA disks. > > # lvcreate --type mirror --mirrors 1 -L 1G main /dev/sda2 /dev/sdb2 > # mkfs.ext4 /dev/mapper/main-lvol0 > # mount /dev/mapper/main-lvol0 /mnt/tmp/ > # chmod a+rwx /mnt/tmp > > $ dd if=/dev/zero of=/mnt/tmp/testfile bs=1024k count=1 > > (I then wait for the IO to stop) > > then we've got this little test program: > > <--><--><--><--><--><--><--><--><--><--><--><--><--><--><--><--><--><--> > #include <errno.h> > #include <fcntl.h> > #include <asm-generic/fcntl.h> > #include <stdio.h> > #include <unistd.h> > > > const char* path="/mnt/tmp/testfile"; > static char buf[8192]; > > int main() > { > int fd=open(path, O_RDWR|O_DIRECT|O_CLOEXEC); > > errno=0; > int res3=pread(fd, buf, 4096, 0); > printf("pread of 4096 said: %d (%m)\n", res3); > > } > <--><--><--><--><--><--><--><--><--><--><--><--><--><--><--><--><--><--> > > running that, either hangs or gets a 'pread of 4096 said: -1 (Input/output > error)' > when it hangs it's unkillable. > > at the moment (on 7.1.0-rc7) this is giving: > Jun 14 18:08:32 dalek kernel: device-mapper: raid1: Mirror read failed from > 252:24. Trying alternative device. > Jun 14 18:08:32 dalek kernel: ------------[ cut here ]------------ > Jun 14 18:08:32 dalek dmeventd[1010]: Primary mirror device 252:24 read > failed. > Jun 14 18:08:32 dalek kernel: WARNING: block/bio.c:1044 at > bio_add_page+0x18b/0x250, CPU#15: kworker/15:1/369 > > (full backtrace below) > (Note there is a moan in there about sdb IO error - repeated a lot - but > again, there's no SATA level errors, and the drive is fine on smart, and > I can read the whole of the underlying lvm mirrors, so I don't think it's > physically there). > > I did a blktrace, although that gives me a 23G blkparse output, hmm > (I see each event repeated a lot - maybe per thread?) > > 252,26 15 1 0.000000000 3435 Q RS 264192 + 8 [dbf] > 252,26 is /dev/mapper/main-lvol0 > 252,24 15 1 0.000005501 3435 A RS 264192 + 8 <- (252,26) 264192 > 252,24 is main-lvol0_mimage_0 > 252,24 15 2 0.000005761 3435 Q RS 264192 + 8 [dbf] > 8,0 15 1 0.000008646 3435 A RS 71634944 + 8 <- (252,24) > 264192 > so that's sda > 8,0 15 2 0.000008787 3435 A RS 73734144 + 8 <- (8,2) > 71634944 > I guess mapping down from sda2 to sda > 8,0 15 3 0.000009037 3435 Q RS 73734144 + 8 [dbf] > 8,0 15 4 0.000009809 3435 C RS 73734144 + 8 [65514] > ??? Hmm what's the 65514 there? > 252,24 15 3 0.000010320 3435 C RS 264192 + 8 [65514] > 252,25 15 1 0.000290384 369 Q R 264192 + 8 [kworker/15:1] > 252,25 is main-lvol0_mimage_1 > > and at this point I'm a bit lost as to what I'm looking for. > > Hints appreciated! > > (I don't believe this is a regression - or at least not recent) > > Dave > > > > > Jun 14 18:08:32 dalek kernel: device-mapper: raid1: Mirror read failed from > 252:24. Trying alternative device. > Jun 14 18:08:32 dalek kernel: ------------[ cut here ]------------ > Jun 14 18:08:32 dalek dmeventd[1010]: Primary mirror device 252:24 read > failed. > Jun 14 18:08:32 dalek kernel: WARNING: block/bio.c:1044 at > bio_add_page+0x18b/0x250, CPU#15: kworker/15:1/369 > Jun 14 18:08:32 dalek dmeventd[1010]: main-lvol0 is now in-sync. > Jun 14 18:08:32 dalek kernel: Modules linked in: nft_masq nft_reject_ipv4 > act_csum cls_u32 sch_htb nf_nat_tftp nf_conntrack_tftp bridge stp llc rfkill > nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 > nf_reje> > Jun 14 18:08:32 dalek kernel: drm_panel_backlight_quirks gpu_sched > drm_suballoc_helper video nvme drm_display_helper nvme_core cec nvme_keyring > sp5100_tco nvme_auth wmi serio_raw fuse scsi_dh_alua i2c_dev scsi_dh_rdac > scsi_dh_emc > Jun 14 18:08:32 dalek kernel: CPU: 15 UID: 0 PID: 369 Comm: kworker/15:1 Not > tainted 7.1.0-rc7+ #786 PREEMPT(lazy) > Jun 14 18:08:32 dalek kernel: Hardware name: To Be Filled By O.E.M. To Be > Filled By O.E.M./X570 Pro4, BIOS P3.10 07/13/2020 > Jun 14 18:08:32 dalek kernel: Workqueue: kmirrord do_mirror > Jun 14 18:08:32 dalek kernel: RIP: 0010:bio_add_page+0x18b/0x250 > Jun 14 18:08:32 dalek kernel: Code: 24 10 4c 8b 04 24 84 c0 0f 85 c9 00 00 00 > 41 0f b7 40 78 48 8b 74 24 08 8b 4c 24 14 e9 b4 fe ff ff 0f 0b 31 c0 e9 55 d1 > af 00 <0f> 0b eb f5 48 8b 7f 08 83 7f 60 05 0f 85 00 ff ff ff 49 8b 3b 4c > Jun 14 18:08:32 dalek kernel: RSP: 0018:ffffd1fb8176fc10 EFLAGS: 00010246 > Jun 14 18:08:32 dalek kernel: RAX: 0000000000000000 RBX: ffffd1fb8176fd18 > RCX: 0000000000000000 > Jun 14 18:08:32 dalek kernel: RDX: 0000000000000000 RSI: 0000000000000000 > RDI: ffff8d1a8eb28b00 > Jun 14 18:08:32 dalek kernel: RBP: 0000000000000000 R08: ffffd1fb8176fc38 > R09: ffffd1fb8176fc40 > Jun 14 18:08:32 dalek kernel: R10: ffffd1fb8176fc34 R11: 0000000000000000 > R12: 0000000000000000 > Jun 14 18:08:32 dalek kernel: R13: ffffd1fb8176fd90 R14: 0000000000000001 > R15: ffff8d1a8eb28b00 > Jun 14 18:08:32 dalek kernel: FS: 0000000000000000(0000) > GS:ffff8d29d161f000(0000) knlGS:0000000000000000 > Jun 14 18:08:32 dalek kernel: CS: 0010 DS: 0000 ES: 0000 CR0: > 0000000080050033 > Jun 14 18:08:32 dalek kernel: CR2: 00007f0ddcd7b9d0 CR3: 000000023dcbf000 > CR4: 0000000000350ef0 > Jun 14 18:08:32 dalek kernel: Call Trace: > Jun 14 18:08:32 dalek kernel: <TASK> > Jun 14 18:08:32 dalek kernel: do_region+0x227/0x2a0 > Jun 14 18:08:32 dalek kernel: dispatch_io+0xf1/0x150 > Jun 14 18:08:32 dalek kernel: ? __pfx_bio_get_page+0x10/0x10 > Jun 14 18:08:32 dalek kernel: ? __pfx_bio_next_page+0x10/0x10 > Jun 14 18:08:32 dalek kernel: ? __pfx_read_callback+0x10/0x10 > Jun 14 18:08:32 dalek kernel: dm_io+0x169/0x2d0 > Jun 14 18:08:32 dalek kernel: ? __pfx_bio_get_page+0x10/0x10 > Jun 14 18:08:32 dalek kernel: ? __pfx_bio_next_page+0x10/0x10 > Jun 14 18:08:32 dalek kernel: do_reads+0x149/0x230 > Jun 14 18:08:32 dalek kernel: ? __pfx_read_callback+0x10/0x10 > Jun 14 18:08:32 dalek kernel: do_mirror+0x11a/0x2b0 > Jun 14 18:08:32 dalek kernel: process_one_work+0x19e/0x390 > Jun 14 18:08:32 dalek kernel: worker_thread+0x1a6/0x310 > Jun 14 18:08:32 dalek kernel: ? __pfx_worker_thread+0x10/0x10 > Jun 14 18:08:32 dalek kernel: kthread+0xe4/0x120 > Jun 14 18:08:32 dalek kernel: ? __pfx_kthread+0x10/0x10 > Jun 14 18:08:32 dalek kernel: ret_from_fork+0x1a1/0x270 > Jun 14 18:08:32 dalek kernel: ? __pfx_kthread+0x10/0x10 > Jun 14 18:08:32 dalek kernel: ret_from_fork_asm+0x1a/0x30 > Jun 14 18:08:32 dalek kernel: </TASK> > Jun 14 18:08:32 dalek kernel: ---[ end trace 0000000000000000 ]--- > Jun 14 18:08:32 dalek kernel: ------------[ cut here ]------------ > Jun 14 18:08:32 dalek kernel: WARNING: drivers/scsi/scsi_lib.c:1164 at > scsi_alloc_sgtables+0x38a/0x400, CPU#15: kworker/15:1/369 > Jun 14 18:08:32 dalek kernel: Modules linked in: nft_masq nft_reject_ipv4 > act_csum cls_u32 sch_htb nf_nat_tftp nf_conntrack_tftp bridge stp llc rfkill > nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 > nf_reje> > Jun 14 18:08:32 dalek kernel: drm_panel_backlight_quirks gpu_sched > drm_suballoc_helper video nvme drm_display_helper nvme_core cec nvme_keyring > sp5100_tco nvme_auth wmi serio_raw fuse scsi_dh_alua i2c_dev scsi_dh_rdac > scsi_dh_emc > Jun 14 18:08:32 dalek kernel: CPU: 15 UID: 0 PID: 369 Comm: kworker/15:1 > Tainted: G W 7.1.0-rc7+ #786 PREEMPT(lazy) > Jun 14 18:08:32 dalek kernel: Tainted: [W]=WARN > Jun 14 18:08:32 dalek kernel: Hardware name: To Be Filled By O.E.M. To Be > Filled By O.E.M./X570 Pro4, BIOS P3.10 07/13/2020 > Jun 14 18:08:32 dalek kernel: Workqueue: kmirrord do_mirror > Jun 14 18:08:32 dalek kernel: RIP: 0010:scsi_alloc_sgtables+0x38a/0x400 > Jun 14 18:08:32 dalek kernel: Code: 8b 3d ba 2d a9 01 e9 d1 fd ff ff 48 8b 75 > 00 48 8d bb f0 fe ff ff e8 15 b7 b0 ff 48 89 ab e0 00 00 00 89 45 08 e9 30 ff > ff ff <0f> 0b 4c 8b 6c 24 30 b8 0a 00 00 00 e9 21 ff ff ff b8 09 00 00 00 > Jun 14 18:08:32 dalek kernel: RSP: 0018:ffffd1fb8176f7f0 EFLAGS: 00010246 > Jun 14 18:08:32 dalek kernel: RAX: 0000000000000000 RBX: ffff8d1aedad0110 > RCX: 0000000000000009 > Jun 14 18:08:32 dalek kernel: RDX: 0000000000000000 RSI: ffffffff99c15960 > RDI: ffff8d1aedad0110 > Jun 14 18:08:32 dalek kernel: RBP: ffff8d1a93d17000 R08: ffff8d1aedad0110 > R09: ffff8d1a818fa800 > Jun 14 18:08:32 dalek kernel: R10: 7020676e69736961 R11: 0000000000000000 > R12: 0000000000000000 > Jun 14 18:08:32 dalek kernel: R13: 0000000000000000 R14: ffff8d1a93394000 > R15: ffff8d1a93d17000 > Jun 14 18:08:32 dalek kernel: FS: 0000000000000000(0000) > GS:ffff8d29d161f000(0000) knlGS:0000000000000000 > Jun 14 18:08:32 dalek kernel: CS: 0010 DS: 0000 ES: 0000 CR0: > 0000000080050033 > Jun 14 18:08:32 dalek kernel: CR2: 00007f0ddcd7b9d0 CR3: 000000023dcbf000 > CR4: 0000000000350ef0 > Jun 14 18:08:32 dalek kernel: Call Trace: > Jun 14 18:08:32 dalek kernel: <TASK> > Jun 14 18:08:32 dalek kernel: ? srso_return_thunk+0x5/0x5f > Jun 14 18:08:32 dalek kernel: sd_setup_read_write_cmnd+0x9d/0x740 > Jun 14 18:08:32 dalek kernel: ? srso_return_thunk+0x5/0x5f > Jun 14 18:08:32 dalek kernel: scsi_queue_rq+0x4d2/0x890 > Jun 14 18:08:32 dalek kernel: blk_mq_dispatch_rq_list+0x241/0x530 > Jun 14 18:08:32 dalek kernel: ? srso_return_thunk+0x5/0x5f > Jun 14 18:08:32 dalek kernel: ? sbitmap_get+0x61/0x100 > Jun 14 18:08:32 dalek kernel: __blk_mq_do_dispatch_sched+0x330/0x340 > Jun 14 18:08:32 dalek kernel: __blk_mq_sched_dispatch_requests+0x143/0x180 > Jun 14 18:08:32 dalek kernel: blk_mq_sched_dispatch_requests+0x2d/0x70 > Jun 14 18:08:32 dalek kernel: blk_mq_run_hw_queue+0x2bf/0x350 > Jun 14 18:08:32 dalek kernel: ? srso_return_thunk+0x5/0x5f > Jun 14 18:08:32 dalek kernel: blk_mq_dispatch_list+0x172/0x350 > Jun 14 18:08:32 dalek kernel: blk_mq_flush_plug_list+0x51/0x1a0 > Jun 14 18:08:32 dalek kernel: ? blk_mq_submit_bio+0x71c/0x9f0 > Jun 14 18:08:32 dalek kernel: __blk_flush_plug+0x112/0x180 > Jun 14 18:08:32 dalek kernel: ? srso_return_thunk+0x5/0x5f > Jun 14 18:08:32 dalek kernel: __submit_bio+0x19c/0x260 > Jun 14 18:08:32 dalek kernel: __submit_bio_noacct+0x8e/0x210 > Jun 14 18:08:32 dalek kernel: do_region+0x14c/0x2a0 > Jun 14 18:08:32 dalek kernel: dispatch_io+0xf1/0x150 > Jun 14 18:08:32 dalek kernel: ? __pfx_bio_get_page+0x10/0x10 > Jun 14 18:08:32 dalek kernel: ? __pfx_bio_next_page+0x10/0x10 > Jun 14 18:08:32 dalek kernel: ? __pfx_read_callback+0x10/0x10 > Jun 14 18:08:32 dalek kernel: dm_io+0x169/0x2d0 > Jun 14 18:08:32 dalek kernel: ? __pfx_bio_get_page+0x10/0x10 > Jun 14 18:08:32 dalek kernel: ? __pfx_bio_next_page+0x10/0x10 > Jun 14 18:08:32 dalek kernel: do_reads+0x149/0x230 > Jun 14 18:08:32 dalek kernel: ? __pfx_read_callback+0x10/0x10 > Jun 14 18:08:32 dalek kernel: do_mirror+0x11a/0x2b0 > Jun 14 18:08:32 dalek kernel: process_one_work+0x19e/0x390 > Jun 14 18:08:32 dalek kernel: worker_thread+0x1a6/0x310 > Jun 14 18:08:32 dalek kernel: ? __pfx_worker_thread+0x10/0x10 > Jun 14 18:08:32 dalek kernel: kthread+0xe4/0x120 > Jun 14 18:08:32 dalek kernel: ? __pfx_kthread+0x10/0x10 > Jun 14 18:08:32 dalek kernel: ret_from_fork+0x1a1/0x270 > Jun 14 18:08:32 dalek kernel: ? __pfx_kthread+0x10/0x10 > Jun 14 18:08:32 dalek kernel: ret_from_fork_asm+0x1a/0x30 > Jun 14 18:08:32 dalek kernel: </TASK> > Jun 14 18:08:32 dalek kernel: ---[ end trace 0000000000000000 ]--- > Jun 14 18:08:32 dalek kernel: I/O error, dev sdb, sector 50606087 op > 0x0:(READ) flags 0x0 phys_seg 0 prio class 2 > Jun 14 18:08:32 dalek kernel: I/O error, dev sdb, sector 50606087 op > 0x0:(READ) flags 0x0 phys_seg 0 prio class 2 > Jun 14 18:08:32 dalek kernel: I/O error, dev sdb, sector 50606087 op > 0x0:(READ) flags 0x0 phys_seg 0 prio class 2 > Jun 14 18:08:32 dalek kernel: I/O error, dev sdb, sector 50606087 op > 0x0:(READ) flags 0x0 phys_seg 0 prio class 2 > Jun 14 18:08:32 dalek kernel: I/O error, dev sdb, sector 50606087 op > 0x0:(READ) flags 0x0 phys_seg 0 prio class 2 > Jun 14 18:08:32 dalek kernel: I/O error, dev sdb, sector 50606087 op > 0x0:(READ) flags 0x0 phys_seg 0 prio class 2 > Jun 14 18:08:32 dalek kernel: I/O error, dev sdb, sector 50606087 op > 0x0:(READ) flags 0x0 phys_seg 0 prio class 2 > Jun 14 18:08:32 dalek kernel: I/O error, dev sdb, sector 50606087 op > 0x0:(READ) flags 0x0 phys_seg 0 prio class 2 > Jun 14 18:08:32 dalek kernel: I/O error, dev sdb, sector 50606087 op > 0x0:(READ) flags 0x0 phys_seg 0 prio class 2 > Jun 14 18:08:32 dalek kernel: I/O error, dev sdb, sector 50606087 op > 0x0:(READ) flags 0x0 phys_seg 0 prio class 2 > Jun 14 18:08:37 dalek kernel: blk_print_req_error: 241000 callbacks suppressed > Jun 14 18:08:37 dalek kernel: I/O error, dev sdb, sector 50606087 op > 0x0:(READ) flags 0x0 phys_seg 0 prio class 2 > >
