On Fri, Mar 26, 2021 at 3:21 AM ChangLimin <chan...@chinatelecom.cn> wrote:
> >On Thu, Mar 25, 2021 at 8:07 AM ChangLimin <chan...@chinatelecom.cn> > wrote: > >>On Wed, Mar 24, 2021 at 4:52 PM Max Reitz <mre...@redhat.com> wrote: > >>On 22.03.21 10:25, ChangLimin wrote: > >>> For Linux 5.10/5.11, qemu write zeros to a multipath device using > >>> ioctl(fd, BLKZEROOUT, range) with cache none or directsync return > -EBUSY > >>> permanently. > >> > >>So as far as I can track back the discussion, Kevin asked on v1 why we’d > >>set has_write_zeroes to false, i.e. whether the EBUSY might not go away > >>at some point, and if it did, whether we shouldn’t retry BLKZEROOUT then. > >>You haven’t explicitly replied to that question (as far as I can see), > >>so it kind of still stands. > >> > >>Implicitly, there are two conflicting answers in this patch: On one > >>hand, the commit message says “permanently”, and this is what you told > >>Nir as a realistic case where this can occur. > > > >For Linux 5.10/5.11, the EBUSY is permanently, the reproduce step is > below. > >For other Linux version, the EBUSY may be temporary. > >Because Linux 5.10/5.11 is not used widely, so do not set > has_write_zeroes to false. > > > >>I'm afraid ChangLimin did not answer my question. I'm looking for real > >>world used case when qemu cannot write zeros to multipath device, when > >>nobody else is using the device. > >> > >>I tried to reproduce this on Fedora (kernel 5.10) with qemu-img convert, > >>once with a multipath device, and once with logical volume on a vg > created > >>on the multipath device, and I could not reproduce this issue. > > > >The following is steps to reproduct the issue on Fedora 34. > > > ># uname -a > >Linux fedora-34 5.11.3-300.fc34.x86_64 #1 SMP Thu Mar 4 19:03:18 UTC 2021 > x86_64 x86_64 x86_64 GNU/Linux > > > >Is this the most recent kernel? I have 5.11.7 in fedora 32. > > > > > ># qemu-img -V > >qemu-img version 5.2.0 (qemu-5.2.0-5.fc34.1) > > > >1. Login in an ISCSI LUN created using targetcli on ubuntu 20.04 > ># iscsiadm -m discovery -t st -p 192.169.1.109 > >192.169.1.109:3260,1 iqn.2003-01.org.linux-iscsi:lio-lv100 > > > ># iscsiadm -m node -l -T iqn.2003-01.org.linux-iscsi:lio-lv100 > ># iscsiadm -m session > >tcp: [1] 192.169.1.109:3260,1 iqn.2003-01.org.linux-iscsi:lio-lv100 > (non-flash) > > > >2. start multipathd service > ># mpathconf --enable > ># systemctl start multipathd > > > >3. add multipath path > ># multipath -a `/lib/udev/scsi_id -g /dev/sdb` # sdb means the ISCSI LUN > >wwid '36001405b76856e4816b48b99c6a77de3' added > > > ># multipathd add path /dev/sdb > >ok > > > ># multipath -ll # /dev/dm-1 is the multipath device based on /dev/sdb > >mpatha (36001405bebfc3a0522541cda30220db9) dm-1 LIO-ORG,lv102 > >size=1.0G features='0' hwhandler='1 alua' wp=rw > >`-+- policy='service-time 0' prio=50 status=active > > `- 5:0:0:0 sdd 8:48 active ready running > > > >You are using user_friendly_names which is (sadly) the default. > >But I don't think it should matter. > > > >4. qemu-img return EBUSY both to dm-1 and sdb > ># wget > http://download.cirros-cloud.net/0.4.0/cirros-0.4.0-x86_64-disk.img > ># qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/dm-1 > >qemu-img: error while writing at byte 0: Device or resource busy > > > ># qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/sdb > >qemu-img: error while writing at byte 0: Device or resource busy > > > >5. blkdiscard also return EBUSY both to dm-1 and sdb > ># blkdiscard -o 0 -l 4096 /dev/dm-1 > >blkdiscard: cannot open /dev/dm-1: Device or resource busy > > > ># blkdiscard -o 0 -l 4096 /dev/sdb > >blkdiscard: cannot open /dev/sdb: No such file or directory > > > >6. dd write zero is good, because it does not use blkdiscard > ># dd if=/dev/zero of=/dev/dm-1 bs=1M count=100 oflag=direct > >100+0 records in > >100+0 records out > >104857600 bytes (105 MB, 100 MiB) copied, 2.33623 s, 44.9 MB/s > > > >7. The LUN should support blkdiscard feature, otherwise it will not write > zero > >with ioctl(fd, BLKZEROOUT, range) > > > >Thanks! > > > >I could not reproduce this with kernel 5.10, but now I'm no 5.11: > ># uname -r > >5.11.7-100.fc32.x86_64 > > > ># qemu-img --version > >qemu-img version 5.2.0 (qemu-5.2.0-6.fc32.1) > >Copyright (c) 2003-2020 Fabrice Bellard and the QEMU Project developers > > > ># cat /etc/multipath.conf > >defaults { > >user_friendly_names no > >find_multipaths no > >} > > > >blacklist_exceptions { > > property "(SCSI_IDENT_|ID_WWN)" > >} > > > >blacklist { > >} > > > ># multipath -ll 36001405e884ab8ff4b44fdba6901099c > >36001405e884ab8ff4b44fdba6901099c dm-8 LIO-ORG,3-09 > >size=6.0G features='0' hwhandler='1 alua' wp=rw > >`-+- policy='service-time 0' prio=50 status=active > > `- 1:0:0:9 sdk 8:160 active ready running > > > >$ lsblk /dev/sdk > >NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT > >sdk 8:160 0 6G 0 disk > >└─36001405e884ab8ff4b44fdba6901099c 253:13 0 6G 0 mpath > > > >$ virt-builder fedora-32 -o disk.img > >[ 2.9] Downloading: http://builder.libguestfs.org/fedora-32.xz > >[ 3.8] Planning how to build this image > >[ 3.8] Uncompressing > >[ 11.1] Opening the new disk > >[ 16.1] Setting a random seed > >[ 16.1] Setting passwords > >virt-builder: Setting random password of root to TcikQqRxAaIqS1kF > >[ 17.0] Finishing off > > Output file: disk.img > > Output size: 6.0G > > Output format: raw > > Total usable space: 5.4G > > Free space: 4.0G (74%) > > > >$ qemu-img info disk.img > >image: disk.img > >file format: raw > >virtual size: 6 GiB (6442450944 bytes) > >disk size: 1.29 GiB > > > ># qemu-img convert -p -f raw -O raw -t none -T none disk.img > /dev/mapper/36001405e884ab8ff4b44fdba6901099c > > (100.00/100%) > > > >Works. > > > ># lsblk /dev/sdk > >NAME MAJ:MIN RM SIZE RO TYPE > MOUNTPOINT > >sdk 8:160 0 6G 0 disk > >└─36001405e884ab8ff4b44fdba6901099c 253:13 0 6G 0 mpath > > ├─36001405e884ab8ff4b44fdba6901099c1 253:14 0 1M 0 part > > ├─36001405e884ab8ff4b44fdba6901099c2 253:15 0 1G 0 part > > ├─36001405e884ab8ff4b44fdba6901099c3 253:16 0 615M 0 part > > └─36001405e884ab8ff4b44fdba6901099c4 253:17 0 4.4G 0 part > > > ># qemu-img convert -p -f raw -O raw -t none -T none disk.img > /dev/mapper/36001405e884ab8ff4b44fdba6901099c > > (100.00/100%) > > > >Works, wiping the partitions. > > > ># mount /dev/mapper/36001405e884ab8ff4b44fdba6901099c4 /tmp/mnt > > > ># mount | grep /dev/mapper/36001405e884ab8ff4b44fdba6901099c4 > >/dev/mapper/36001405e884ab8ff4b44fdba6901099c4 on /tmp/mnt type xfs > (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota) > > > >This is invalid use, converting to device with mounted file system, but > let's try: > > > ># qemu-img convert -p -f raw -O raw -t none -T none disk.img > /dev/mapper/36001405e884ab8ff4b44fdba6901099c > > (100.00/100%) > > > >Still works?! > > > ># wipefs -a /dev/mapper/36001405e884ab8ff4b44fdba6901099c > >wipefs: error: /dev/mapper/36001405e884ab8ff4b44fdba6901099c: probing > initialization failed: Device or resource busy > > > ># blkdiscard /dev/mapper/36001405e884ab8ff4b44fdba6901099c > > > >Works. > > > >This the configuration of the LUN on the server side: > > > > { > > > "aio": false, > > > "alua_tpgs": [ > > > { > > > "alua_access_state": 0, > > > "alua_access_status": 0, > > > "alua_access_type": 3, > > > "alua_support_active_nonoptimized": 1, > > > "alua_support_active_optimized": 1, > > > "alua_support_offline": 1, > > > "alua_support_standby": 1, > > > "alua_support_transitioning": 1, > > > "alua_support_unavailable": 1, > > > "alua_write_metadata": 0, > > > "implicit_trans_secs": 0, > > > "name": "default_tg_pt_gp", > > > "nonop_delay_msecs": 100, > > > "preferred": 0, > > > "tg_pt_gp_id": 0, > > > "trans_delay_msecs": 0 > > > } > > > ], > > > "attributes": { > > > "block_size": 512, > > > "emulate_3pc": 1, > > > "emulate_caw": 1, > > > "emulate_dpo": 1, > > > "emulate_fua_read": 1, > > > "emulate_fua_write": 1, > > > "emulate_model_alias": 1, > > > "emulate_pr": 1, > > > "emulate_rest_reord": 0, > > > "emulate_tas": 1, > > > "emulate_tpu": 1, > > > "emulate_tpws": 1, > > > "emulate_ua_intlck_ctrl": 0, > > > "emulate_write_cache": 1, > > > "enforce_pr_isids": 1, > > > "force_pr_aptpl": 0, > > > "is_nonrot": 0, > > > "max_unmap_block_desc_count": 1, > > > "max_unmap_lba_count": 8192, > > > "max_write_same_len": 65335, > > > "optimal_sectors": 16384, > > > "pi_prot_format": 0, > > > "pi_prot_type": 0, > > > "pi_prot_verify": 0, > > > "queue_depth": 128, > > > "unmap_granularity": 1, > > > "unmap_granularity_alignment": 0, > > > "unmap_zeroes_data": 0 > > > }, > > > "dev": "/target/3/09", > > > "name": "3-09", > > > "plugin": "fileio", > > > "size": 6442450944, > > > "write_back": true, > > > "wwn": "e884ab8f-f4b4-4fdb-a690-1099c072c86d" > > > }, > > > > >Maybe this upstream change is not in all downstream 5.11 kernels, or > 5.11.7 > >already includes the fix? > > Linux 5.10.24/5.11.7 already merged the fix on 2021-03-17 by Greg > Kroah-Hartman. > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.11.y&id=7e0815797656f029fab2edc309406cddf931b9d8 > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=d44c9780ed40db88626c9354868eab72159c7a7f > > # git describe d44c9780ed40db88626c9354868eab72159c7a7f > v5.10.23-184-gd44c9780ed40 > > # git describe 7e0815797656f029fab2edc309406cddf931b9d8 > v5.11.6-178-g7e0815797656 > So this is practically fixed. I don't think keeping a workaround for this in qemu is a good idea. >Adding Ben, maybe he had more insight on the multipath side. > > > >>If I understand the kernel change correctly, this can happen when there > is > >>a mounted file system on top of the multipath device. I don't think we > have > >>a use case when qemu accesses a multipath device when the device is used > >>by a file system, but maybe I missed something. > >> > >>So that to me implies > >>that we actually should not retry BLKZEROOUT, because the EBUSY will > >>remain, and that condition won’t change while the block device is in use > >>by qemu. > >> > >>On the other hand, in the code, you have decided not to reset > >>has_write_zeroes to false, so the implementation will retry. > >> > >>EBUSY is usually a temporary error, so retrying makes sense. The question > >>is if we really can write zeroes manually in this case? > >> > >>So I don’t quite understand. Should we keep trying BLKZEROOUT or is > >>there no chance of it working after it has at one point failed with > >>EBUSY? (Are there other cases besides what’s described in this commit > >>message where EBUSY might be returned and it is only temporary?) > >> > >>> Fallback to pwritev instead of exit for -EBUSY error. > >>> > >>> The issue was introduced in Linux 5.10: > >>> > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=384d87ef2c954fc58e6c5fd8253e4a1984f5fe02 > >>> > >>> Fixed in Linux 5.12: > >>> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=56887cffe946bb0a90c74429fa94d6110a73119d > >>> > >>> Signed-off-by: ChangLimin <chan...@chinatelecom.cn> > >>> --- > >>> block/file-posix.c | 8 ++++++-- > >>> 1 file changed, 6 insertions(+), 2 deletions(-) > >>> > >>> diff --git a/block/file-posix.c b/block/file-posix.c > >>> index 20e14f8e96..d4054ac9cb 100644 > >>> --- a/block/file-posix.c > >>> +++ b/block/file-posix.c > >>> @@ -1624,8 +1624,12 @@ static ssize_t > >>> handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb) > >>> } while (errno == EINTR); > >>> > >>> ret = translate_err(-errno); > >>> - if (ret == -ENOTSUP) { > >>> - s->has_write_zeroes = false; > >>> + switch (ret) { > >>> + case -ENOTSUP: > >>> + s->has_write_zeroes = false; /* fall through */ > >>> + case -EBUSY: /* Linux 5.10/5.11 may return -EBUSY for > multipath > >>> devices */ > >>> + return -ENOTSUP; > >>> + break; > >> > >>(Not sure why this break is here.) > >> > >>Max > >> > >>> } > >>> } > >>> #endif > >>> -- > >>> 2.27.0 > >>> > >