Re: [PATCH V4] file-posix: allow -EBUSY error during ioctl(fd, BLKZEROOUT, range) on block

Nir Soffer Fri, 26 Mar 2021 13:29:35 -0700

On Fri, Mar 26, 2021 at 3:21 AM ChangLimin <[email protected]> wrote:


> >On Thu, Mar 25, 2021 at 8:07 AM ChangLimin <[email protected]>
> wrote:
> >>On Wed, Mar 24, 2021 at 4:52 PM Max Reitz <[email protected]> wrote:
> >>On 22.03.21 10:25, ChangLimin wrote:
> >>> For Linux 5.10/5.11, qemu write zeros to a multipath device using
> >>> ioctl(fd, BLKZEROOUT, range) with cache none or directsync return
> -EBUSY
> >>> permanently.
> >>
> >>So as far as I can track back the discussion, Kevin asked on v1 why we’d
> >>set has_write_zeroes to false, i.e. whether the EBUSY might not go away
> >>at some point, and if it did, whether we shouldn’t retry BLKZEROOUT then.
> >>You haven’t explicitly replied to that question (as far as I can see),
> >>so it kind of still stands.
> >>
> >>Implicitly, there are two conflicting answers in this patch: On one
> >>hand, the commit message says “permanently”, and this is what you told
> >>Nir as a realistic case where this can occur.
> >
> >For Linux 5.10/5.11, the EBUSY is permanently, the reproduce step is
> below.
> >For other Linux version, the EBUSY may be temporary.
> >Because  Linux 5.10/5.11 is not used widely, so do not set
> has_write_zeroes to false.
> >
> >>I'm afraid ChangLimin did not answer my question. I'm looking for real
> >>world used case when qemu cannot write zeros to multipath device, when
> >>nobody else is using the device.
> >>
> >>I tried to reproduce this on Fedora (kernel 5.10) with qemu-img convert,
> >>once with a multipath device, and once with logical volume on a vg
> created
> >>on the multipath device, and I could not reproduce this issue.
> >
> >The following is steps to reproduct the issue on Fedora 34.
> >
> ># uname -a
> >Linux fedora-34 5.11.3-300.fc34.x86_64 #1 SMP Thu Mar 4 19:03:18 UTC 2021
> x86_64 x86_64 x86_64 GNU/Linux
> >
> >Is this the most recent kernel? I have 5.11.7 in fedora 32.
> >
> >
> ># qemu-img -V
> >qemu-img version 5.2.0 (qemu-5.2.0-5.fc34.1)
> >
> >1.  Login in an ISCSI LUN created using targetcli on ubuntu 20.04
> ># iscsiadm -m discovery -t st -p 192.169.1.109
> >192.169.1.109:3260,1 iqn.2003-01.org.linux-iscsi:lio-lv100
> >
> ># iscsiadm -m node -l -T iqn.2003-01.org.linux-iscsi:lio-lv100
> ># iscsiadm -m session
> >tcp: [1] 192.169.1.109:3260,1 iqn.2003-01.org.linux-iscsi:lio-lv100
> (non-flash)
> >
> >2. start multipathd service
> ># mpathconf --enable
> ># systemctl start multipathd
> >
> >3.  add multipath path
> ># multipath -a `/lib/udev/scsi_id -g /dev/sdb`   # sdb means the ISCSI LUN
> >wwid '36001405b76856e4816b48b99c6a77de3' added
> >
> ># multipathd add path /dev/sdb
> >ok
> >
> ># multipath -ll     # /dev/dm-1 is the multipath device based on /dev/sdb
> >mpatha (36001405bebfc3a0522541cda30220db9) dm-1 LIO-ORG,lv102
> >size=1.0G features='0' hwhandler='1 alua' wp=rw
> >`-+- policy='service-time 0' prio=50 status=active
> >  `- 5:0:0:0  sdd  8:48   active ready running
> >
> >You are using user_friendly_names which is (sadly) the default.
> >But I don't think it should matter.
> >
> >4. qemu-img return EBUSY both to dm-1 and sdb
> ># wget
> http://download.cirros-cloud.net/0.4.0/cirros-0.4.0-x86_64-disk.img
> ># qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/dm-1
> >qemu-img: error while writing at byte 0: Device or resource busy
> >
> ># qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/sdb
> >qemu-img: error while writing at byte 0: Device or resource busy
> >
> >5. blkdiscard also return EBUSY  both to dm-1 and sdb
> ># blkdiscard -o 0 -l 4096 /dev/dm-1
> >blkdiscard: cannot open /dev/dm-1: Device or resource busy
> >
> ># blkdiscard -o 0 -l 4096 /dev/sdb
> >blkdiscard: cannot open /dev/sdb: No such file or directory
> >
> >6. dd write zero is good, because it does not use blkdiscard
> ># dd if=/dev/zero of=/dev/dm-1 bs=1M count=100 oflag=direct
> >100+0 records in
> >100+0 records out
> >104857600 bytes (105 MB, 100 MiB) copied, 2.33623 s, 44.9 MB/s
> >
> >7. The LUN should support blkdiscard feature, otherwise it will not write
> zero
> >with  ioctl(fd, BLKZEROOUT, range)
> >
> >Thanks!
> >
> >I could not reproduce this with kernel 5.10, but now I'm no 5.11:
> ># uname -r
> >5.11.7-100.fc32.x86_64
> >
> ># qemu-img --version
> >qemu-img version 5.2.0 (qemu-5.2.0-6.fc32.1)
> >Copyright (c) 2003-2020 Fabrice Bellard and the QEMU Project developers
> >
> ># cat /etc/multipath.conf
> >defaults {
> >user_friendly_names no
> >find_multipaths no
> >}
> >
> >blacklist_exceptions {
> >        property "(SCSI_IDENT_|ID_WWN)"
> >}
> >
> >blacklist {
> >}
> >
> ># multipath -ll 36001405e884ab8ff4b44fdba6901099c
> >36001405e884ab8ff4b44fdba6901099c dm-8 LIO-ORG,3-09
> >size=6.0G features='0' hwhandler='1 alua' wp=rw
> >`-+- policy='service-time 0' prio=50 status=active
> >  `- 1:0:0:9 sdk     8:160 active ready running
> >
> >$ lsblk /dev/sdk
> >NAME                                MAJ:MIN RM SIZE RO TYPE  MOUNTPOINT
> >sdk                                   8:160  0   6G  0 disk
> >└─36001405e884ab8ff4b44fdba6901099c 253:13   0   6G  0 mpath
> >
> >$ virt-builder fedora-32 -o disk.img
> >[   2.9] Downloading: http://builder.libguestfs.org/fedora-32.xz
> >[   3.8] Planning how to build this image
> >[   3.8] Uncompressing
> >[  11.1] Opening the new disk
> >[  16.1] Setting a random seed
> >[  16.1] Setting passwords
> >virt-builder: Setting random password of root to TcikQqRxAaIqS1kF
> >[  17.0] Finishing off
> >                   Output file: disk.img
> >                   Output size: 6.0G
> >                 Output format: raw
> >            Total usable space: 5.4G
> >                    Free space: 4.0G (74%)
> >
> >$ qemu-img info disk.img
> >image: disk.img
> >file format: raw
> >virtual size: 6 GiB (6442450944 bytes)
> >disk size: 1.29 GiB
> >
> ># qemu-img convert -p -f raw -O raw -t none -T none disk.img
> /dev/mapper/36001405e884ab8ff4b44fdba6901099c
> >    (100.00/100%)
> >
> >Works.
> >
> ># lsblk /dev/sdk
> >NAME                                   MAJ:MIN RM  SIZE RO TYPE
>  MOUNTPOINT
> >sdk                                      8:160  0    6G  0 disk
> >└─36001405e884ab8ff4b44fdba6901099c    253:13   0    6G  0 mpath
> >  ├─36001405e884ab8ff4b44fdba6901099c1 253:14   0    1M  0 part
> >  ├─36001405e884ab8ff4b44fdba6901099c2 253:15   0    1G  0 part
> >  ├─36001405e884ab8ff4b44fdba6901099c3 253:16   0  615M  0 part
> >  └─36001405e884ab8ff4b44fdba6901099c4 253:17   0  4.4G  0 part
> >
> ># qemu-img convert -p -f raw -O raw -t none -T none disk.img
> /dev/mapper/36001405e884ab8ff4b44fdba6901099c
> >    (100.00/100%)
> >
> >Works, wiping the partitions.
> >
> ># mount /dev/mapper/36001405e884ab8ff4b44fdba6901099c4 /tmp/mnt
> >
> ># mount | grep /dev/mapper/36001405e884ab8ff4b44fdba6901099c4
> >/dev/mapper/36001405e884ab8ff4b44fdba6901099c4 on /tmp/mnt type xfs
> (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> >
> >This is invalid use, converting to device with mounted file system, but
> let's try:
> >
> ># qemu-img convert -p -f raw -O raw -t none -T none disk.img
> /dev/mapper/36001405e884ab8ff4b44fdba6901099c
> >    (100.00/100%)
> >
> >Still works?!
> >
> ># wipefs -a /dev/mapper/36001405e884ab8ff4b44fdba6901099c
> >wipefs: error: /dev/mapper/36001405e884ab8ff4b44fdba6901099c: probing
> initialization failed: Device or resource busy
> >
> ># blkdiscard /dev/mapper/36001405e884ab8ff4b44fdba6901099c
> >
> >Works.
> >
> >This the configuration of the LUN on the server side:
> >
> >   {
>
> >      "aio": false,
>
> >      "alua_tpgs": [
>
> >        {
>
> >          "alua_access_state": 0,
>
> >          "alua_access_status": 0,
>
> >          "alua_access_type": 3,
>
> >          "alua_support_active_nonoptimized": 1,
>
> >          "alua_support_active_optimized": 1,
>
> >          "alua_support_offline": 1,
>
> >          "alua_support_standby": 1,
>
> >          "alua_support_transitioning": 1,
>
> >          "alua_support_unavailable": 1,
>
> >          "alua_write_metadata": 0,
>
> >          "implicit_trans_secs": 0,
>
> >          "name": "default_tg_pt_gp",
>
> >          "nonop_delay_msecs": 100,
>
> >          "preferred": 0,
>
> >          "tg_pt_gp_id": 0,
>
> >          "trans_delay_msecs": 0
>
> >        }
>
> >      ],
>
> >      "attributes": {
>
> >        "block_size": 512,
>
> >        "emulate_3pc": 1,
>
> >        "emulate_caw": 1,
>
> >        "emulate_dpo": 1,
>
> >        "emulate_fua_read": 1,
>
> >        "emulate_fua_write": 1,
>
> >        "emulate_model_alias": 1,
>
> >        "emulate_pr": 1,
>
> >        "emulate_rest_reord": 0,
>
> >        "emulate_tas": 1,
>
> >        "emulate_tpu": 1,
>
> >        "emulate_tpws": 1,
>
> >        "emulate_ua_intlck_ctrl": 0,
>
> >        "emulate_write_cache": 1,
>
> >        "enforce_pr_isids": 1,
>
> >        "force_pr_aptpl": 0,
>
> >        "is_nonrot": 0,
>
> >        "max_unmap_block_desc_count": 1,
>
> >        "max_unmap_lba_count": 8192,
>
> >        "max_write_same_len": 65335,
>
> >        "optimal_sectors": 16384,
>
> >        "pi_prot_format": 0,
>
> >        "pi_prot_type": 0,
>
> >        "pi_prot_verify": 0,
>
> >        "queue_depth": 128,
>
> >        "unmap_granularity": 1,
>
> >        "unmap_granularity_alignment": 0,
>
> >        "unmap_zeroes_data": 0
>
> >      },
>
> >      "dev": "/target/3/09",
>
> >      "name": "3-09",
>
> >      "plugin": "fileio",
>
> >      "size": 6442450944,
>
> >      "write_back": true,
>
> >      "wwn": "e884ab8f-f4b4-4fdb-a690-1099c072c86d"
>
> >    },
>
> >
> >Maybe this upstream change is not in all downstream 5.11 kernels, or
> 5.11.7
> >already includes the fix?
>
> Linux 5.10.24/5.11.7 already merged the fix on 2021-03-17 by Greg
> Kroah-Hartman.
>
>
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.11.y&id=7e0815797656f029fab2edc309406cddf931b9d8
>
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=d44c9780ed40db88626c9354868eab72159c7a7f
>
> # git describe d44c9780ed40db88626c9354868eab72159c7a7f
> v5.10.23-184-gd44c9780ed40
>
> # git describe 7e0815797656f029fab2edc309406cddf931b9d8
> v5.11.6-178-g7e0815797656
>

So this is practically fixed.

I don't think keeping a workaround for this in qemu is a good idea.

>Adding Ben, maybe he had more insight on the multipath side.
> >
> >>If I understand the kernel change correctly, this can happen when there
> is
> >>a mounted file system on top of the multipath device. I don't think we
> have
> >>a use case when qemu accesses a multipath device when the device is used
> >>by a file system, but maybe I missed something.
> >>
> >>So that to me implies
> >>that we actually should not retry BLKZEROOUT, because the EBUSY will
> >>remain, and that condition won’t change while the block device is in use
> >>by qemu.
> >>
> >>On the other hand, in the code, you have decided not to reset
> >>has_write_zeroes to false, so the implementation will retry.
> >>
> >>EBUSY is usually a temporary error, so retrying makes sense. The question
> >>is if we really can write zeroes manually in this case?
> >>
> >>So I don’t quite understand.  Should we keep trying BLKZEROOUT or is
> >>there no chance of it working after it has at one point failed with
> >>EBUSY?  (Are there other cases besides what’s described in this commit
> >>message where EBUSY might be returned and it is only temporary?)
> >>
> >>> Fallback to pwritev instead of exit for -EBUSY error.
> >>>
> >>> The issue was introduced in Linux 5.10:
> >>>
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=384d87ef2c954fc58e6c5fd8253e4a1984f5fe02
> >>>
> >>> Fixed in Linux 5.12:
> >>>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=56887cffe946bb0a90c74429fa94d6110a73119d
> >>>
> >>> Signed-off-by: ChangLimin <[email protected]>
> >>> ---
> >>>   block/file-posix.c | 8 ++++++--
> >>>   1 file changed, 6 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/block/file-posix.c b/block/file-posix.c
> >>> index 20e14f8e96..d4054ac9cb 100644
> >>> --- a/block/file-posix.c
> >>> +++ b/block/file-posix.c
> >>> @@ -1624,8 +1624,12 @@ static ssize_t
> >>> handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
> >>>           } while (errno == EINTR);
> >>>
> >>>           ret = translate_err(-errno);
> >>> -        if (ret == -ENOTSUP) {
> >>> -            s->has_write_zeroes = false;
> >>> +        switch (ret) {
> >>> +        case -ENOTSUP:
> >>> +            s->has_write_zeroes = false; /* fall through */
> >>> +        case -EBUSY: /* Linux 5.10/5.11 may return -EBUSY for
> multipath
> >>> devices */
> >>> +            return -ENOTSUP;
> >>> +            break;
> >>
> >>(Not sure why this break is here.)
> >>
> >>Max
> >>
> >>>           }
> >>>       }
> >>>   #endif
> >>> --
> >>> 2.27.0
> >>>
>
>

Re: [PATCH V4] file-posix: allow -EBUSY error during ioctl(fd, BLKZEROOUT, range) on block

Reply via email to