Hi, 在 2025/11/21 19:01, Filippo Giunchedi 写道: > Hello linux-raid, > I'm seeking assistance with the following bug: recent versions of mpt3sas > started announcing drive's optimal_io_size of 0xFFF000 and when said drives > are > part of a mdraid raid10 the array's optimal_io_size results in 0xFFF000. > > When an LVM PV is created on the array its metadata area by default is aligned > with its optimal_io_size, resulting in an abnormally-large size of ~4GB. > During > GRUB's LVM detection an allocation is made based on the metadata area size > which results in an unbootable system. This problem shows up only for > newly-created PVs and thus systems with existing PVs are not affected in my > testing. > > I was able to reproduce the problem on qemu using scsi-hd devices as shown > below and on https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1121006. The > bug > is present both on Debian' stable kernel and Linux 6.18, though I haven't yet > determined when the change was introduced in mpt3sas. > > I'm wondering where the problem is in this case and what could be done to fix > it?
You can take a look at the following thread. Re: [PATCH 1/2] block: ignore underlying non-stack devices io_opt - Yu Kuai <https://lore.kernel.org/all/[email protected]/> > thank you, > Filippo > > On Thu, Nov 20, 2025 at 02:43:24PM +0000, Filippo Giunchedi wrote: >> Hello Salvatore, >> Thank you for the quick reply. >> >> On Wed, Nov 19, 2025 at 05:59:48PM +0100, Salvatore Bonaccorso wrote: >> [...] >>>> Capabilities: [348] Vendor Specific Information: ID=0001 Rev=1 >>>> Len=038 <?> >>>> Capabilities: [380] Data Link Feature <?> >>>> Kernel driver in use: mpt3sas >>> This sounds like quite an intersting finding but probably hard to >>> reproduce without the hardware if it comes to be specific to the >>> controller type and driver. >> That's a great point re: reproducibility, and it got me curious on something >> I >> hadn't thought of testing. Namely if there's another angle to this: does any >> block device with the same block I/O hints exhibit the same problem? The >> answer is >> actually "yes". >> >> I used qemu 'scsi-hd' device to set the same values to be able to test >> locally. >> On an already-installed VM I added the following to present four new devices: >> >> -device virtio-scsi-pci,id=scsi0 >> >> -drive file=./workdir/disks/disk3.qcow2,format=qcow2,if=none,id=drive3 >> -device >> scsi-hd,bus=scsi0.0,drive=drive3,physical_block_size=4096,logical_block_size=512,min_io_size=4096,opt_io_size=16773120 >> >> -drive file=./workdir/disks/disk4.qcow2,format=qcow2,if=none,id=drive4 >> -device >> scsi-hd,bus=scsi0.0,drive=drive4,physical_block_size=4096,logical_block_size=512,min_io_size=4096,opt_io_size=16773120 >> >> -drive file=./workdir/disks/disk5.qcow2,format=qcow2,if=none,id=drive5 >> -device >> scsi-hd,bus=scsi0.0,drive=drive5,physical_block_size=4096,logical_block_size=512,min_io_size=4096,opt_io_size=16773120 >> >> -drive file=./workdir/disks/disk6.qcow2,format=qcow2,if=none,id=drive6 >> -device >> scsi-hd,bus=scsi0.0,drive=drive6,physical_block_size=4096,logical_block_size=512,min_io_size=4096,opt_io_size=16773120 >> >> I used 10G files with 'qemu-img create -f qcow2 <file> 10G' though size >> doesn't >> affect anything in my testing. >> >> Then in the VM: >> >> # cat /sys/block/sd[cdef]/queue/optimal_io_size >> 16773120 >> 16773120 >> 16773120 >> 16773120 >> # mdadm --create /dev/md1 --level 10 --bitmap none --raid-devices 4 /dev/sdc >> /dev/sdd /dev/sde /dev/sdf >> mdadm: Defaulting to version 1.2 metadata >> mdadm: array /dev/md1 started. >> # cat /sys/block/md1/queue/optimal_io_size >> 4293918720 >> >> I was able to reproduce the problem with src:linux 6.18~rc6-1~exp1 as well >> as 6.12.57-1. >> >> Since it is easy to test this way I tried with a few different opt_io_size >> values and >> was able to reproduce only with 16773120 (i.e. 0xFFF000). >> >>> I would like to ask: Do you have the possibility to make an OS >>> instalaltion such that you can freely experiment with various kernels >>> and then under them assemble the arrays? If so that would be great >>> that you could start bisecting the changes to find where find changes. >>> >>> I.e. install the OS independtly on the controller, find by bisecting >>> Debian versions manually the kernels between bookworm and trixie >>> (6.1.y -> 6.12.y to narrow down the upsream range). >> Yes I'm able to perform testing on this host, in fact I worked around the >> problem for now by disabling LVM's md alignment auto detection and thus we >> have >> an installed system. >> For reference that's "devices { data_alignment_detection = 0 }" in lvm's >> config. >> >>> Then bisect the ustream changes to find the offending commits. Let me >>> know if you need more specific instructions on the idea. >> Having pointers on how the recommended way to build Debian kernels would be >> of >> great help, thank you! >> >>> Additionally it would be interesting to know if the issue persist in >>> 6.17.8 or even 6.18~rc6-1~exp1 to be able to clearly indicate upstream >>> that the issue persist in upper kernels. >>> >>> Idealy actually this goes asap to upstream once we are more confident >>> ont the subsystem to where to report the issue. If we are reasonably >>> confident it it mpt3sas specific already then I would say to go >>> already to: >> Given the qemu-based reproducer above, maybe this issue is actually two bugs: >> raid10 as per above, and mpt3sas presenting 0xFFF000 as optimal_io_size. >> While >> the latter might be suspicious maybe it is not wrong per-se though? >> >> best, >> Filippo -- Thanks, Kuai

