from:"\"Kevin Wolf\""

Re: [PATCH 09/39] qobject: replace assert(0) with g_assert_not_reached()

2024-09-11 Thread Kevin Wolf

Am 11.09.2024 um 00:15 hat Pierrick Bouvier geschrieben:
> Signed-off-by: Pierrick Bouvier 

Reviewed-by: Kevin Wolf

Re: [PATCH 31/39] hw/scsi: remove break after g_assert_not_reached()

2024-09-11 Thread Kevin Wolf

Am 11.09.2024 um 00:15 hat Pierrick Bouvier geschrieben:
> Signed-off-by: Pierrick Bouvier 

Reviewed-by: Kevin Wolf

Re: [PATCH 25/39] block: remove break after g_assert_not_reached()

2024-09-11 Thread Kevin Wolf

Am 11.09.2024 um 00:15 hat Pierrick Bouvier geschrieben:
> Signed-off-by: Pierrick Bouvier 

Reviewed-by: Kevin Wolf

Re: [PATCH 15/39] block: replace assert(false) with g_assert_not_reached()

2024-09-11 Thread Kevin Wolf

Am 11.09.2024 um 00:15 hat Pierrick Bouvier geschrieben:
> Signed-off-by: Pierrick Bouvier 

Reviewed-by: Kevin Wolf

Re: [PATCH] block: support locking on change medium

2024-09-09 Thread Kevin Wolf

Am 09.09.2024 um 16:25 hat Joelle van Dyne geschrieben:
> On Mon, Sep 9, 2024 at 2:56 AM Kevin Wolf  wrote:
> >
> > Am 09.09.2024 um 03:58 hat Joelle van Dyne geschrieben:
> > > New optional argument for 'blockdev-change-medium' QAPI command to allow
> > > the caller to specify if they wish to enable file locking.
> > >
> > > Signed-off-by: Joelle van Dyne 
> >
> > I feel once you need to control such details of the backend, you should
> > really use a separate 'blockdev-add' commannd.
> >
> > If it feels a bit too cumbersome to send explicit commands to open the
> > tray, remove the medium, insert the new medium referencing the node you
> > added with 'blockdev-add' and then close the tray again, I can
> > understand. Maybe what we should do is extend 'blockdev-change-medium'
> > so that it doesn't only accept a filename to specify the new images, but
> > alternatively also a node-name.
> >
> > > +switch (file_locking_mode) {
> > > +case BLOCKDEV_CHANGE_FILE_LOCKING_MODE_AUTO:
> > > +break;
> > > +
> > > +case BLOCKDEV_CHANGE_FILE_LOCKING_MODE_OFF:
> > > +qdict_put_str(options, "file.locking", "off");
> > > +break;
> > > +
> > > +case BLOCKDEV_CHANGE_FILE_LOCKING_MODE_ON:
> > > +qdict_put_str(options, "file.locking", "on");
> > > +break;
> > > +
> > > +default:
> > > +abort();
> > > +}
> >
> > Using "file.locking" makes assumptions about what the passed filename
> > string would result in. There is nothing that guarantees that the block
> > driver even has a "file" child, or that the "file" child is referring
> > to a file-posix driver rather than using a different protocol or being a
> > filter driver above yet another node. It also doesn't consider backing
> > files and other non-primary children of the opened node.
> >
> > So this is not correct, and I don't think there is any realistic way of
> > making it correct with this approach.
> 
> The existence of "filename" already makes this assumption that the
> input is a file child.

No. Try using something like "blkdebug::image.iso" or "nbd://localhost".

In the former case, you get another layer and the "file" child would be
a blkdebug node. To turn off locking on the file-posix block driver
you'd need to set "file.file.locking" in this case.

And the latter doesn't have any file-posix involved, it goes straight to
the NBD block driver.

> While I agree with you that there are better ways to solve this
> problem, ultimately "blockdev-change-medium" will have to be
> deprecated when this hypothetical "better" way of referencing a node
> added with blockdev-add is introduced.

This is not a hypothetical better way. It is how you can achieve this
today.

Extending blockdev-change-medium to alternatively accept a node-name
would just introduce a convenience shortcut (as is the whole command)
that doesn't require you to send four QMP commands (blockdev-open-tray,
blockdev-remove-medium, blockdev-insert-medium and blockdev-close-tray).

There is no need to deprecate blockdev-change-medium with a filename any
more than there is a need to deprecate -hda on the command line. We
could do it because there are alternatives that provide a strict
superset of functionality, but we don't have to.

Kevin

Re: [PATCH] block: support locking on change medium

2024-09-09 Thread Kevin Wolf

Am 09.09.2024 um 03:58 hat Joelle van Dyne geschrieben:
> New optional argument for 'blockdev-change-medium' QAPI command to allow
> the caller to specify if they wish to enable file locking.
> 
> Signed-off-by: Joelle van Dyne 

I feel once you need to control such details of the backend, you should
really use a separate 'blockdev-add' commannd.

If it feels a bit too cumbersome to send explicit commands to open the
tray, remove the medium, insert the new medium referencing the node you
added with 'blockdev-add' and then close the tray again, I can
understand. Maybe what we should do is extend 'blockdev-change-medium'
so that it doesn't only accept a filename to specify the new images, but
alternatively also a node-name.

> +switch (file_locking_mode) {
> +case BLOCKDEV_CHANGE_FILE_LOCKING_MODE_AUTO:
> +break;
> +
> +case BLOCKDEV_CHANGE_FILE_LOCKING_MODE_OFF:
> +qdict_put_str(options, "file.locking", "off");
> +break;
> +
> +case BLOCKDEV_CHANGE_FILE_LOCKING_MODE_ON:
> +qdict_put_str(options, "file.locking", "on");
> +break;
> +
> +default:
> +abort();
> +}

Using "file.locking" makes assumptions about what the passed filename
string would result in. There is nothing that guarantees that the block
driver even has a "file" child, or that the "file" child is referring
to a file-posix driver rather than using a different protocol or being a
filter driver above yet another node. It also doesn't consider backing
files and other non-primary children of the opened node.

So this is not correct, and I don't think there is any realistic way of
making it correct with this approach.

Kevin

[PATCH] raw-format: Fix error message for invalid offset/size

2024-08-29 Thread Kevin Wolf

s->offset and s->size are only set at the end of the function and still
contain the old values when formatting the error message. Print the
parameters with the new values that we actually checked instead.

Fixes: 500e2434207d ('raw-format: Split raw_read_options()')
Signed-off-by: Kevin Wolf 
---
 block/raw-format.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/raw-format.c b/block/raw-format.c
index ac7e8495f6..e08526e2ec 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -111,7 +111,7 @@ raw_apply_options(BlockDriverState *bs, BDRVRawState *s, 
uint64_t offset,
 if (offset > real_size) {
 error_setg(errp, "Offset (%" PRIu64 ") cannot be greater than "
"size of the containing file (%" PRId64 ")",
-   s->offset, real_size);
+   offset, real_size);
 return -EINVAL;
 }
 
@@ -119,7 +119,7 @@ raw_apply_options(BlockDriverState *bs, BDRVRawState *s, 
uint64_t offset,
 error_setg(errp, "The sum of offset (%" PRIu64 ") and size "
"(%" PRIu64 ") has to be smaller or equal to the "
" actual size of the containing file (%" PRId64 ")",
-   s->offset, s->size, real_size);
+   offset, size, real_size);
 return -EINVAL;
 }
 
-- 
2.46.0

Re: [PATCH v5 0/5] vvfat: Fix write bugs for large files and add iotests

2024-08-13 Thread Kevin Wolf

Am 11.08.2024 um 09:51 hat Michael Tokarev geschrieben:
> 12.06.2024 15:43, Amjad Alsharafi wrote:
> > These patches fix some bugs found when modifying files in vvfat.
> > First, there was a bug when writing to the cluster 2 or above of a file, it
> > will copy the cluster before it instead, so, when writing to cluster=2, the
> > content of cluster=1 will be copied into disk instead in its place.
> > 
> > Another issue was modifying the clusters of a file and adding new
> > clusters, this showed 2 issues:
> > - If the new cluster is not immediately after the last cluster, it will
> > cause issues when reading from this file in the future.
> > - Generally, the usage of info.file.offset was incorrect, and the
> > system would crash on abort() when the file is modified and a new
> > cluster was added.
> > 
> > Also, added some iotests for vvfat, covering the this fix and also
> > general behavior such as reading, writing, and creating files on the 
> > filesystem.
> > Including tests for reading/writing the first cluster which
> > would pass even before this patch.
> ...
> > Amjad Alsharafi (5):
> >vvfat: Fix bug in writing to middle of file
> >vvfat: Fix usage of `info.file.offset`
> >vvfat: Fix wrong checks for cluster mappings invariant
> >vvfat: Fix reading files with non-continuous clusters
> >iotests: Add `vvfat` tests
> 
> Actually, maybe the whole series is a good candidate for -stable, not
> just the first fix.  What do you think?

Yes, if you consider vvfat relevant for stable at all, then I think you
want to take all of the fixes, each one fixes some corruption in
read-write mode. (Though as far as I can tell, read-write support is
still broken even after this series.)

Kevin

[PULL v3 00/13] Block layer patches

2024-08-06 Thread Kevin Wolf

The following changes since commit c659b7b3b4925f8cef486a3ee64e911519495782:

  Merge tag 'pull-riscv-to-apply-20240806-2' of 
https://github.com/alistair23/qemu into staging (2024-08-06 17:35:51 +1000)

are available in the Git repository at:

  https://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to ca1dcc913888a97073233dd9142ca5241dab1b7c:

  iotests/024: exclude 'backing file format' field from the output (2024-08-06 
20:12:40 +0200)


Block layer patches

- scsi-block: Fix error handling with r/werror=stop
- Depend on newer clang for TSA, make WITH_GRAPH_RDLOCK_GUARD() fully
  checked, fix block-copy to add missing lock
- vvfat: Fix write bugs for large files and add iotests
- Clean up blockdev-snapshot-internal-sync doc
- Fix iotests 024 for qed


Amjad Alsharafi (5):
  vvfat: Fix bug in writing to middle of file
  vvfat: Fix usage of `info.file.offset`
  vvfat: Fix wrong checks for cluster mappings invariant
  vvfat: Fix reading files with non-continuous clusters
  iotests: Add `vvfat` tests

Andrey Drobyshev (1):
  iotests/024: exclude 'backing file format' field from the output

Kevin Wolf (6):
  block-copy: Fix missing graph lock
  block/graph-lock: Make WITH_GRAPH_RDLOCK_GUARD() fully checked
  scsi-disk: Use positive return value for status in dma_readv/writev
  scsi-block: Don't skip callback for sgio error status/driver_status
  scsi-disk: Add warning comments that host_status errors take a shortcut
  scsi-disk: Always report RESERVATION_CONFLICT to guest

Markus Armbruster (1):
  qapi-block-core: Clean up blockdev-snapshot-internal-sync doc

 qapi/block-core.json   |   7 +-
 include/block/graph-lock.h |  21 +-
 block/block-copy.c |   4 +-
 block/vvfat.c  |  27 +-
 hw/scsi/scsi-disk.c|  73 ++--
 tests/qemu-iotests/fat16.py| 690 +
 tests/qemu-iotests/testenv.py  |   2 +-
 meson.build|  14 +-
 tests/qemu-iotests/024 |   2 +-
 tests/qemu-iotests/024.out |   1 -
 tests/qemu-iotests/check   |   2 +-
 tests/qemu-iotests/tests/vvfat | 485 ++
 tests/qemu-iotests/tests/vvfat.out |   5 +
 13 files changed, 1280 insertions(+), 53 deletions(-)
 create mode 100644 tests/qemu-iotests/fat16.py
 create mode 100755 tests/qemu-iotests/tests/vvfat
 create mode 100755 tests/qemu-iotests/tests/vvfat.out

[PULL v2 00/13] Block layer patches

2024-08-06 Thread Kevin Wolf

The following changes since commit e7207a9971dd41618b407030902b0b2256deb664:

  Merge tag 'for-upstream' of https://gitlab.com/bonzini/qemu into staging 
(2024-08-06 08:02:34 +1000)

are available in the Git repository at:

  https://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to 91f16b6cdccd2942ae09d155cf3f79e6bbb485d5:

  iotests/024: exclude 'backing file format' field from the output (2024-08-06 
09:54:24 +0200)


Block layer patches

- scsi-block: Fix error handling with r/werror=stop
- Depend on newer clang for TSA, make WITH_GRAPH_RDLOCK_GUARD() fully
  checked, fix block-copy to add missing lock
- vvfat: Fix write bugs for large files and add iotests
- Clean up blockdev-snapshot-internal-sync doc
- Fix iotests 024 for qed


Amjad Alsharafi (5):
  vvfat: Fix bug in writing to middle of file
  vvfat: Fix usage of `info.file.offset`
  vvfat: Fix wrong checks for cluster mappings invariant
  vvfat: Fix reading files with non-continuous clusters
  iotests: Add `vvfat` tests

Andrey Drobyshev (1):
  iotests/024: exclude 'backing file format' field from the output

Kevin Wolf (6):
  block-copy: Fix missing graph lock
  block/graph-lock: Make WITH_GRAPH_RDLOCK_GUARD() fully checked
  scsi-disk: Use positive return value for status in dma_readv/writev
  scsi-block: Don't skip callback for sgio error status/driver_status
  scsi-disk: Add warning comments that host_status errors take a shortcut
  scsi-disk: Always report RESERVATION_CONFLICT to guest

Markus Armbruster (1):
  qapi-block-core: Clean up blockdev-snapshot-internal-sync doc

 qapi/block-core.json   |   7 +-
 include/block/graph-lock.h |  21 +-
 block/block-copy.c |   4 +-
 block/vvfat.c  |  27 +-
 hw/scsi/scsi-disk.c|  73 ++--
 tests/qemu-iotests/fat16.py| 690 +
 tests/qemu-iotests/testenv.py  |   2 +-
 meson.build|  14 +-
 tests/qemu-iotests/024 |   2 +-
 tests/qemu-iotests/024.out |   1 -
 tests/qemu-iotests/check   |   2 +-
 tests/qemu-iotests/tests/vvfat | 485 ++
 tests/qemu-iotests/tests/vvfat.out |   5 +
 13 files changed, 1280 insertions(+), 53 deletions(-)
 create mode 100644 tests/qemu-iotests/fat16.py
 create mode 100755 tests/qemu-iotests/tests/vvfat
 create mode 100755 tests/qemu-iotests/tests/vvfat.out

[PULL 01/13] qapi-block-core: Clean up blockdev-snapshot-internal-sync doc

2024-08-05 Thread Kevin Wolf

From: Markus Armbruster 

BlockdevSnapshotInternal is the arguments type of command
blockdev-snapshot-internal-sync.  Its doc comment contains this note:

# .. note:: In a transaction, if @name is empty or any snapshot matching
#@name exists, the operation will fail.  Only some image formats
#support it; for example, qcow2, and rbd.

"In a transaction" is misleading, and "if @name is empty or any
snapshot matching @name exists, the operation will fail" is redundant
with the command's Errors documentation.  Drop.

The remainder is fine.  Move it to the command's doc comment, where it
is more prominently visible, with a slight rephrasing for clarity.

Signed-off-by: Markus Armbruster 
Message-ID: <20240718123609.3063055-1-arm...@redhat.com>
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 qapi/block-core.json | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index f400b334c8..994e384a71 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -6046,10 +6046,6 @@
 #
 # @name: the name of the internal snapshot to be created
 #
-# .. note:: In a transaction, if @name is empty or any snapshot matching
-#@name exists, the operation will fail.  Only some image formats
-#support it; for example, qcow2, and rbd.
-#
 # Since: 1.7
 ##
 { 'struct': 'BlockdevSnapshotInternal',
@@ -6070,6 +6066,9 @@
 # - If the format of the image used does not support it,
 #   GenericError
 #
+# .. note:: Only some image formats such as qcow2 and rbd support
+#internal snapshots.
+#
 # Since: 1.7
 #
 # .. qmp-example::
-- 
2.45.2

[PULL 12/13] iotests: Add `vvfat` tests

2024-08-05 Thread Kevin Wolf

From: Amjad Alsharafi 

Added several tests to verify the implementation of the vvfat driver.

We needed a way to interact with it, so created a basic `fat16.py` driver
that handled writing correct sectors for us.

Added `vvfat` to the non-generic formats, as its not a normal image format.

Signed-off-by: Amjad Alsharafi 
Reviewed-by: Kevin Wolf 
Tested-by: Kevin Wolf 
Message-ID: 

[kwolf: Made mypy and pylint happy to unbreak 297]
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/fat16.py| 690 +
 tests/qemu-iotests/testenv.py  |   2 +-
 tests/qemu-iotests/check   |   2 +-
 tests/qemu-iotests/tests/vvfat | 485 
 tests/qemu-iotests/tests/vvfat.out |   5 +
 5 files changed, 1182 insertions(+), 2 deletions(-)
 create mode 100644 tests/qemu-iotests/fat16.py
 create mode 100755 tests/qemu-iotests/tests/vvfat
 create mode 100755 tests/qemu-iotests/tests/vvfat.out

diff --git a/tests/qemu-iotests/fat16.py b/tests/qemu-iotests/fat16.py
new file mode 100644
index 00..2c3c1cbb3d
--- /dev/null
+++ b/tests/qemu-iotests/fat16.py
@@ -0,0 +1,690 @@
+# A simple FAT16 driver that is used to test the `vvfat` driver in QEMU.
+#
+# Copyright (C) 2024 Amjad Alsharafi 
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+from typing import Callable, List, Optional, Protocol, Set
+import string
+
+SECTOR_SIZE = 512
+DIRENTRY_SIZE = 32
+ALLOWED_FILE_CHARS = set(
+"!#$%&'()-@^_`{}~" + string.digits + string.ascii_uppercase
+)
+
+
+class MBR:
+def __init__(self, data: bytes):
+assert len(data) == 512
+self.partition_table = []
+for i in range(4):
+partition = data[446 + i * 16 : 446 + (i + 1) * 16]
+self.partition_table.append(
+{
+"status": partition[0],
+"start_head": partition[1],
+"start_sector": partition[2] & 0x3F,
+"start_cylinder": ((partition[2] & 0xC0) << 2)
+  | partition[3],
+"type": partition[4],
+"end_head": partition[5],
+"end_sector": partition[6] & 0x3F,
+"end_cylinder": ((partition[6] & 0xC0) << 2)
+| partition[7],
+"start_lba": int.from_bytes(partition[8:12], "little"),
+"size": int.from_bytes(partition[12:16], "little"),
+}
+)
+
+def __str__(self):
+return "\n".join(
+[
+f"{i}: {partition}"
+for i, partition in enumerate(self.partition_table)
+]
+)
+
+
+class FatBootSector:
+# pylint: disable=too-many-instance-attributes
+def __init__(self, data: bytes):
+assert len(data) == 512
+self.bytes_per_sector = int.from_bytes(data[11:13], "little")
+self.sectors_per_cluster = data[13]
+self.reserved_sectors = int.from_bytes(data[14:16], "little")
+self.fat_count = data[16]
+self.root_entries = int.from_bytes(data[17:19], "little")
+total_sectors_16 = int.from_bytes(data[19:21], "little")
+self.media_descriptor = data[21]
+self.sectors_per_fat = int.from_bytes(data[22:24], "little")
+self.sectors_per_track = int.from_bytes(data[24:26], "little")
+self.heads = int.from_bytes(data[26:28], "little")
+self.hidden_sectors = int.from_bytes(data[28:32], "little")
+total_sectors_32 = int.from_bytes(data[32:36], "little")
+assert (
+total_sectors_16 == 0 or total_sectors_32 == 0
+), "Both total sectors (16 and 32) fields are non-zero"
+self.total_sectors = total_sectors_16 or total_sectors_32
+self.drive_number = data[36]
+self.volume_id = int.from_bytes(data[39:43], "little")
+self.volume_label = data[43:54].decode("ascii").strip()
+self.fs_type = data[54:62].decode("ascii").strip()
+
+def root_dir_start(self):
+

[PULL 06/13] scsi-disk: Add warning comments that host_status errors take a shortcut

2024-08-05 Thread Kevin Wolf

scsi_block_sgio_complete() has surprising behaviour in that there are
error cases in which it directly completes the request and never calls
the passed callback. In the current state of the code, this doesn't seem
to result in bugs, but with future code changes, we must be careful to
never rely on the callback doing some cleanup until this code smell is
fixed. For now, just add warnings to make people aware of the trap.

Signed-off-by: Kevin Wolf 
Acked-by: Paolo Bonzini 
Message-ID: <20240731123207.27636-4-kw...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 hw/scsi/scsi-disk.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 6e1a5c98df..69a195177e 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -68,6 +68,9 @@ struct SCSIDiskClass {
 /*
  * Callbacks receive ret == 0 for success. Errors are represented either as
  * negative errno values, or as positive SAM status codes.
+ *
+ * Beware: For errors returned in host_status, the function may directly
+ * complete the request and never call the callback.
  */
 DMAIOFunc   *dma_readv;
 DMAIOFunc   *dma_writev;
@@ -381,6 +384,7 @@ done:
 scsi_req_unref(&r->req);
 }
 
+/* May not be called in all error cases, don't rely on cleanup here */
 static void scsi_dma_complete(void *opaque, int ret)
 {
 SCSIDiskReq *r = (SCSIDiskReq *)opaque;
@@ -421,6 +425,7 @@ done:
 scsi_req_unref(&r->req);
 }
 
+/* May not be called in all error cases, don't rely on cleanup here */
 static void scsi_read_complete(void *opaque, int ret)
 {
 SCSIDiskReq *r = (SCSIDiskReq *)opaque;
@@ -560,6 +565,7 @@ done:
 scsi_req_unref(&r->req);
 }
 
+/* May not be called in all error cases, don't rely on cleanup here */
 static void scsi_write_complete(void * opaque, int ret)
 {
 SCSIDiskReq *r = (SCSIDiskReq *)opaque;
@@ -2821,6 +2827,7 @@ static void scsi_block_sgio_complete(void *opaque, int 
ret)
 sg_io_hdr_t *io_hdr = &req->io_header;
 
 if (ret == 0) {
+/* FIXME This skips calling req->cb() and any cleanup in it */
 if (io_hdr->host_status != SCSI_HOST_OK) {
 scsi_req_complete_failed(&r->req, io_hdr->host_status);
 scsi_req_unref(&r->req);
-- 
2.45.2

[PULL 08/13] vvfat: Fix bug in writing to middle of file

2024-08-05 Thread Kevin Wolf

From: Amjad Alsharafi 

Before this commit, the behavior when calling `commit_one_file` for
example with `offset=0x2000` (second cluster), what will happen is that
we won't fetch the next cluster from the fat, and instead use the first
cluster for the read operation.

This is due to off-by-one error here, where `i=0x2000 !< offset=0x2000`,
thus not fetching the next cluster.

Signed-off-by: Amjad Alsharafi 
Reviewed-by: Kevin Wolf 
Tested-by: Kevin Wolf 
Message-ID: 

Signed-off-by: Kevin Wolf 
---
 block/vvfat.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/vvfat.c b/block/vvfat.c
index 086fedf474..a67cfa823b 100644
--- a/block/vvfat.c
+++ b/block/vvfat.c
@@ -2525,8 +2525,9 @@ commit_one_file(BDRVVVFATState* s, int dir_index, 
uint32_t offset)
 return -1;
 }
 
-for (i = s->cluster_size; i < offset; i += s->cluster_size)
+for (i = 0; i < offset; i += s->cluster_size) {
 c = modified_fat_get(s, c);
+}
 
 fd = qemu_open_old(mapping->path, O_RDWR | O_CREAT | O_BINARY, 0666);
 if (fd < 0) {
-- 
2.45.2

[PULL 10/13] vvfat: Fix wrong checks for cluster mappings invariant

2024-08-05 Thread Kevin Wolf

From: Amjad Alsharafi 

How this `abort` was intended to check for was:
- if the `mapping->first_mapping_index` is not the same as
  `first_mapping_index`, which **should** happen only in one case,
  when we are handling the first mapping, in that case
  `mapping->first_mapping_index == -1`, in all other cases, the other
  mappings after the first should have the condition `true`.
- From above, we know that this is the first mapping, so if the offset
  is not `0`, then abort, since this is an invalid state.

The issue was that `first_mapping_index` is not set if we are
checking from the middle, the variable `first_mapping_index` is
only set if we passed through the check `cluster_was_modified` with the
first mapping, and in the same function call we checked the other
mappings.

One approach is to go into the loop even if `cluster_was_modified`
is not true so that we will be able to set `first_mapping_index` for the
first mapping, but since `first_mapping_index` is only used here,
another approach is to just check manually for the
`mapping->first_mapping_index != -1` since we know that this is the
value for the only entry where `offset == 0` (i.e. first mapping).

Signed-off-by: Amjad Alsharafi 
Reviewed-by: Kevin Wolf 
Message-ID: 

Signed-off-by: Kevin Wolf 
---
 block/vvfat.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/block/vvfat.c b/block/vvfat.c
index cfde468c2e..e3a83fbc88 100644
--- a/block/vvfat.c
+++ b/block/vvfat.c
@@ -1880,7 +1880,6 @@ get_cluster_count_for_direntry(BDRVVVFATState* s, 
direntry_t* direntry, const ch
 
 uint32_t cluster_num = begin_of_direntry(direntry);
 uint32_t offset = 0;
-int first_mapping_index = -1;
 mapping_t* mapping = NULL;
 const char* basename2 = NULL;
 
@@ -1942,14 +1941,9 @@ get_cluster_count_for_direntry(BDRVVVFATState* s, 
direntry_t* direntry, const ch
 
 if (strcmp(basename, basename2))
 copy_it = 1;
-first_mapping_index = array_index(&(s->mapping), 
mapping);
-}
-
-if (mapping->first_mapping_index != first_mapping_index
-&& mapping->info.file.offset > 0) {
-abort();
-copy_it = 1;
 }
+assert(mapping->first_mapping_index == -1
+|| mapping->info.file.offset > 0);
 
 /* need to write out? */
 if (!was_modified && is_file(direntry)) {
-- 
2.45.2

[PULL 11/13] vvfat: Fix reading files with non-continuous clusters

2024-08-05 Thread Kevin Wolf

From: Amjad Alsharafi 

When reading with `read_cluster` we get the `mapping` with
`find_mapping_for_cluster` and then we call `open_file` for this
mapping.
The issue appear when its the same file, but a second cluster that is
not immediately after it, imagine clusters `500 -> 503`, this will give
us 2 mappings one has the range `500..501` and another `503..504`, both
point to the same file, but different offsets.

When we don't open the file since the path is the same, we won't assign
`s->current_mapping` and thus accessing way out of bound of the file.

>From our example above, after `open_file` (that didn't open anything) we
will get the offset into the file with
`s->cluster_size*(cluster_num-s->current_mapping->begin)`, which will
give us `0x2000 * (504-500)`, which is out of bound for this mapping and
will produce some issues.

Signed-off-by: Amjad Alsharafi 
Message-ID: 
<1f3ea115779abab62ba32c788073cdc99f9ad5dd.1721470238.git.amjadsharaf...@gmail.com>
[kwolf: Simplified the patch based on Amjad's analysis and input]
Signed-off-by: Kevin Wolf 
---
 block/vvfat.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/vvfat.c b/block/vvfat.c
index e3a83fbc88..8ffe8b3b9b 100644
--- a/block/vvfat.c
+++ b/block/vvfat.c
@@ -1369,8 +1369,9 @@ static int open_file(BDRVVVFATState* s,mapping_t* mapping)
 return -1;
 vvfat_close_current_file(s);
 s->current_fd = fd;
-s->current_mapping = mapping;
 }
+
+s->current_mapping = mapping;
 return 0;
 }
 
-- 
2.45.2

[PULL 13/13] iotests/024: exclude 'backing file format' field from the output

2024-08-05 Thread Kevin Wolf

From: Andrey Drobyshev 

Apparently 'qemu-img info' doesn't report the backing file format field
for qed (as it does for qcow2):

$ qemu-img create -f qed base.qed 1M && qemu-img create -f qed -b base.qed -F 
qed top.qed 1M
$ qemu-img create -f qcow2 base.qcow2 1M && qemu-img create -f qcow2 -b 
base.qcow2 -F qcow2 top.qcow2 1M
$ qemu-img info top.qed | grep 'backing file format'
$ qemu-img info top.qcow2 | grep 'backing file format'
backing file format: qcow2

This leads to the 024 test failure with -qed.  Let's just filter the
field out and exclude it from the output.

This is a fixup for the commit f93e65ee51 ("iotests/{024, 271}: add
testcases for qemu-img rebase").

Reported-by: Thomas Huth 
Signed-off-by: Andrey Drobyshev 
Message-ID: <20240730094701.790624-1-andrey.drobys...@virtuozzo.com>
Reviewed-by: Eric Blake 
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/024 | 2 +-
 tests/qemu-iotests/024.out | 1 -
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/024 b/tests/qemu-iotests/024
index 285f17e79f..b29c76e161 100755
--- a/tests/qemu-iotests/024
+++ b/tests/qemu-iotests/024
@@ -283,7 +283,7 @@ TEST_IMG=$BASE_OLD _make_test_img -b "$BASE_NEW" -F $IMGFMT 
\
 CLUSTER_SIZE=$(( CLUSTER_SIZE * 2 )) TEST_IMG=$OVERLAY \
 _make_test_img -b "$BASE_OLD" -F $IMGFMT $(( CLUSTER_SIZE * 6 ))
 
-TEST_IMG=$OVERLAY _img_info
+TEST_IMG=$OVERLAY _img_info | grep -v '^backing file format:'
 
 echo
 echo "Fill backing files with data"
diff --git a/tests/qemu-iotests/024.out b/tests/qemu-iotests/024.out
index e1e8eea863..3d1e31927a 100644
--- a/tests/qemu-iotests/024.out
+++ b/tests/qemu-iotests/024.out
@@ -214,7 +214,6 @@ file format: IMGFMT
 virtual size: 384 KiB (393216 bytes)
 cluster_size: 131072
 backing file: TEST_DIR/subdir/t.IMGFMT.base_old
-backing file format: IMGFMT
 
 Fill backing files with data
 
-- 
2.45.2

[PULL 07/13] scsi-disk: Always report RESERVATION_CONFLICT to guest

2024-08-05 Thread Kevin Wolf

In the case of scsi-block, RESERVATION_CONFLICT is not a backend error,
but indicates that the guest tried to make a request that it isn't
allowed to execute. Pass the error to the guest so that it can decide
what to do with it.

Without this, if we stop the VM in response to a RESERVATION_CONFLICT
(as is the default policy in management software such as oVirt or
KubeVirt), it can happen that the VM cannot be resumed any more because
every attempt to resume it immediately runs into the same error and
stops the VM again.

One case that expects RESERVATION_CONFLICT errors to be visible in the
guest is running the validation tests in Windows 2019's Failover Cluster
Manager, which intentionally tries to execute invalid requests to see if
they are properly rejected.

Buglink: https://issues.redhat.com/browse/RHEL-5
Signed-off-by: Kevin Wolf 
Acked-by: Paolo Bonzini 
Message-ID: <20240731123207.27636-5-kw...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 hw/scsi/scsi-disk.c | 35 ++-
 1 file changed, 30 insertions(+), 5 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 69a195177e..4d94b2b816 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -224,7 +224,7 @@ static bool scsi_handle_rw_error(SCSIDiskReq *r, int ret, 
bool acct_failed)
 SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
 SCSIDiskClass *sdc = (SCSIDiskClass *) object_get_class(OBJECT(s));
 SCSISense sense = SENSE_CODE(NO_SENSE);
-int error = 0;
+int error;
 bool req_has_sense = false;
 BlockErrorAction action;
 int status;
@@ -235,11 +235,35 @@ static bool scsi_handle_rw_error(SCSIDiskReq *r, int ret, 
bool acct_failed)
 } else {
 /* A passthrough command has completed with nonzero status.  */
 status = ret;
-if (status == CHECK_CONDITION) {
+switch (status) {
+case CHECK_CONDITION:
 req_has_sense = true;
 error = scsi_sense_buf_to_errno(r->req.sense, 
sizeof(r->req.sense));
-} else {
+break;
+case RESERVATION_CONFLICT:
+/*
+ * Don't apply the error policy, always report to the guest.
+ *
+ * This is a passthrough code path, so it's not a backend error, 
but
+ * a response to an invalid guest request.
+ *
+ * Windows Failover Cluster validation intentionally sends invalid
+ * requests to verify that reservations work as intended. It is
+ * crucial that it sees the resulting errors.
+ *
+ * Treating a reservation conflict as a guest-side error is obvious
+ * when a pr-manager is in use. Without one, the situation is less
+ * clear, but there might be nothing that can be fixed on the host
+ * (like in the above example), and we don't want to be stuck in a
+ * loop where resuming the VM and retrying the request immediately
+ * stops it again. So always reporting is still the safer option in
+ * this case, too.
+ */
+error = 0;
+break;
+default:
 error = EINVAL;
+break;
 }
 }
 
@@ -249,8 +273,9 @@ static bool scsi_handle_rw_error(SCSIDiskReq *r, int ret, 
bool acct_failed)
  * are usually retried immediately, so do not post them to QMP and
  * do not account them as failed I/O.
  */
-if (req_has_sense &&
-scsi_sense_buf_is_guest_recoverable(r->req.sense, 
sizeof(r->req.sense))) {
+if (!error || (req_has_sense &&
+   scsi_sense_buf_is_guest_recoverable(r->req.sense,
+   sizeof(r->req.sense 
{
 action = BLOCK_ERROR_ACTION_REPORT;
 acct_failed = false;
 } else {
-- 
2.45.2

[PULL 09/13] vvfat: Fix usage of `info.file.offset`

2024-08-05 Thread Kevin Wolf

From: Amjad Alsharafi 

The field is marked as "the offset in the file (in clusters)", but it
was being used like this
`cluster_size*(nums)+mapping->info.file.offset`, which is incorrect.

Signed-off-by: Amjad Alsharafi 
Reviewed-by: Kevin Wolf 
Message-ID: 
<72f19a7903886dda1aa78bcae0e17702ee939262.1721470238.git.amjadsharaf...@gmail.com>
Signed-off-by: Kevin Wolf 
---
 block/vvfat.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/block/vvfat.c b/block/vvfat.c
index a67cfa823b..cfde468c2e 100644
--- a/block/vvfat.c
+++ b/block/vvfat.c
@@ -1408,7 +1408,9 @@ read_cluster_directory:
 
 assert(s->current_fd);
 
-
offset=s->cluster_size*(cluster_num-s->current_mapping->begin)+s->current_mapping->info.file.offset;
+offset = s->cluster_size *
+((cluster_num - s->current_mapping->begin)
++ s->current_mapping->info.file.offset);
 if(lseek(s->current_fd, offset, SEEK_SET)!=offset)
 return -3;
 s->cluster=s->cluster_buffer;
@@ -1929,8 +1931,9 @@ get_cluster_count_for_direntry(BDRVVVFATState* s, 
direntry_t* direntry, const ch
 (mapping->mode & MODE_DIRECTORY) == 0) {
 
 /* was modified in qcow */
-if (offset != mapping->info.file.offset + s->cluster_size
-* (cluster_num - mapping->begin)) {
+if (offset != s->cluster_size
+* ((cluster_num - mapping->begin)
++ mapping->info.file.offset)) {
 /* offset of this cluster in file chain has changed */
 abort();
 copy_it = 1;
@@ -2404,7 +2407,7 @@ static int commit_mappings(BDRVVVFATState* s,
 (mapping->end - mapping->begin);
 } else
 next_mapping->info.file.offset = mapping->info.file.offset +
-mapping->end - mapping->begin;
+(mapping->end - mapping->begin);
 
 mapping = next_mapping;
 }
-- 
2.45.2

[PULL 03/13] block/graph-lock: Make WITH_GRAPH_RDLOCK_GUARD() fully checked

2024-08-05 Thread Kevin Wolf

Upstream clang 18 (and backports to clang 17 in Fedora and RHEL)
implemented support for __attribute__((cleanup())) in its Thread Safety
Analysis, so we can now actually have a proper implementation of
WITH_GRAPH_RDLOCK_GUARD() that understands when we acquire and when we
release the lock.

-Wthread-safety is now only enabled if the compiler is new enough to
understand this pattern. In theory, we could have used some #ifdefs to
keep the existing basic checks on old compilers, but as long as someone
runs a newer compiler (and our CI does), we will catch locking problems,
so it's probably not worth keeping multiple implementations for this.

The implementation can't use g_autoptr any more because the glib macros
define wrapper functions that don't have the right TSA attributes, so
the compiler would complain about them. Just use the cleanup attribute
directly instead.

Signed-off-by: Kevin Wolf 
Message-ID: <20240627181245.281403-3-kw...@redhat.com>
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Manos Pitsidianakis 
Signed-off-by: Kevin Wolf 
---
 include/block/graph-lock.h | 21 ++---
 meson.build| 14 +-
 2 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/include/block/graph-lock.h b/include/block/graph-lock.h
index d7545e82d0..dc8d949184 100644
--- a/include/block/graph-lock.h
+++ b/include/block/graph-lock.h
@@ -209,31 +209,38 @@ typedef struct GraphLockable { } GraphLockable;
  * unlocked. TSA_ASSERT_SHARED() makes sure that the following calls know that
  * we hold the lock while unlocking is left unchecked.
  */
-static inline GraphLockable * TSA_ASSERT_SHARED(graph_lock) TSA_NO_TSA 
coroutine_fn
+static inline GraphLockable * TSA_ACQUIRE_SHARED(graph_lock) coroutine_fn
 graph_lockable_auto_lock(GraphLockable *x)
 {
 bdrv_graph_co_rdlock();
 return x;
 }
 
-static inline void TSA_NO_TSA coroutine_fn
-graph_lockable_auto_unlock(GraphLockable *x)
+static inline void TSA_RELEASE_SHARED(graph_lock) coroutine_fn
+graph_lockable_auto_unlock(GraphLockable **x)
 {
 bdrv_graph_co_rdunlock();
 }
 
-G_DEFINE_AUTOPTR_CLEANUP_FUNC(GraphLockable, graph_lockable_auto_unlock)
+#define GRAPH_AUTO_UNLOCK __attribute__((cleanup(graph_lockable_auto_unlock)))
 
+/*
+ * @var is only used to break the loop after the first iteration.
+ * @unlock_var can't be unlocked and then set to NULL because TSA wants the 
lock
+ * to be held at the start of every iteration of the loop.
+ */
 #define WITH_GRAPH_RDLOCK_GUARD_(var) \
-for (g_autoptr(GraphLockable) var = graph_lockable_auto_lock(GML_OBJ_()); \
+for (GraphLockable *unlock_var GRAPH_AUTO_UNLOCK =\
+graph_lockable_auto_lock(GML_OBJ_()), \
+*var = unlock_var;\
  var; \
- graph_lockable_auto_unlock(var), var = NULL)
+ var = NULL)
 
 #define WITH_GRAPH_RDLOCK_GUARD() \
 WITH_GRAPH_RDLOCK_GUARD_(glue(graph_lockable_auto, __COUNTER__))
 
 #define GRAPH_RDLOCK_GUARD(x)   \
-g_autoptr(GraphLockable)\
+GraphLockable * GRAPH_AUTO_UNLOCK   \
 glue(graph_lockable_auto, __COUNTER__) G_GNUC_UNUSED =  \
 graph_lockable_auto_lock(GML_OBJ_())
 
diff --git a/meson.build b/meson.build
index 97f63aa86c..c2a050b844 100644
--- a/meson.build
+++ b/meson.build
@@ -649,7 +649,19 @@ warn_flags = [
 ]
 
 if host_os != 'darwin'
-  warn_flags += ['-Wthread-safety']
+  tsa_has_cleanup = cc.compiles('''
+struct __attribute__((capability("mutex"))) mutex {};
+void lock(struct mutex *m) __attribute__((acquire_capability(m)));
+void unlock(struct mutex *m) __attribute__((release_capability(m)));
+
+void test(void) {
+  struct mutex __attribute__((cleanup(unlock))) m;
+  lock(&m);
+}
+  ''', args: ['-Wthread-safety', '-Werror'])
+  if tsa_has_cleanup
+warn_flags += ['-Wthread-safety']
+  endif
 endif
 
 # Set up C++ compiler flags
-- 
2.45.2

[PULL 04/13] scsi-disk: Use positive return value for status in dma_readv/writev

2024-08-05 Thread Kevin Wolf

In some error cases, scsi_block_sgio_complete() never calls the passed
callback, but directly completes the request. This leads to bugs because
its error paths are not exact copies of what the callback would normally
do.

In preparation to fix this, allow passing positive return values to the
callbacks that represent the status code that should be used to complete
the request.

scsi_handle_rw_error() already handles positive values for its ret
parameter because scsi_block_sgio_complete() calls directly into it.

Signed-off-by: Kevin Wolf 
Acked-by: Paolo Bonzini 
Message-ID: <20240731123207.27636-2-kw...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 hw/scsi/scsi-disk.c | 21 ++---
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index a67092db6a..3ff6798bde 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -65,6 +65,10 @@ OBJECT_DECLARE_TYPE(SCSIDiskState, SCSIDiskClass, 
SCSI_DISK_BASE)
 
 struct SCSIDiskClass {
 SCSIDeviceClass parent_class;
+/*
+ * Callbacks receive ret == 0 for success. Errors are represented either as
+ * negative errno values, or as positive SAM status codes.
+ */
 DMAIOFunc   *dma_readv;
 DMAIOFunc   *dma_writev;
 bool(*need_fua_emulation)(SCSICommand *cmd);
@@ -283,7 +287,7 @@ static bool scsi_disk_req_check_error(SCSIDiskReq *r, int 
ret, bool acct_failed)
 return true;
 }
 
-if (ret < 0) {
+if (ret != 0) {
 return scsi_handle_rw_error(r, ret, acct_failed);
 }
 
@@ -360,7 +364,7 @@ static void scsi_write_do_fua(SCSIDiskReq *r)
 static void scsi_dma_complete_noio(SCSIDiskReq *r, int ret)
 {
 assert(r->req.aiocb == NULL);
-if (scsi_disk_req_check_error(r, ret, false)) {
+if (scsi_disk_req_check_error(r, ret, ret > 0)) {
 goto done;
 }
 
@@ -385,9 +389,10 @@ static void scsi_dma_complete(void *opaque, int ret)
 assert(r->req.aiocb != NULL);
 r->req.aiocb = NULL;
 
+/* ret > 0 is accounted for in scsi_disk_req_check_error() */
 if (ret < 0) {
 block_acct_failed(blk_get_stats(s->qdev.conf.blk), &r->acct);
-} else {
+} else if (ret == 0) {
 block_acct_done(blk_get_stats(s->qdev.conf.blk), &r->acct);
 }
 scsi_dma_complete_noio(r, ret);
@@ -403,7 +408,7 @@ static void scsi_read_complete_noio(SCSIDiskReq *r, int ret)
qemu_get_current_aio_context());
 
 assert(r->req.aiocb == NULL);
-if (scsi_disk_req_check_error(r, ret, false)) {
+if (scsi_disk_req_check_error(r, ret, ret > 0)) {
 goto done;
 }
 
@@ -424,9 +429,10 @@ static void scsi_read_complete(void *opaque, int ret)
 assert(r->req.aiocb != NULL);
 r->req.aiocb = NULL;
 
+/* ret > 0 is accounted for in scsi_disk_req_check_error() */
 if (ret < 0) {
 block_acct_failed(blk_get_stats(s->qdev.conf.blk), &r->acct);
-} else {
+} else if (ret == 0) {
 block_acct_done(blk_get_stats(s->qdev.conf.blk), &r->acct);
 trace_scsi_disk_read_complete(r->req.tag, r->qiov.size);
 }
@@ -534,7 +540,7 @@ static void scsi_write_complete_noio(SCSIDiskReq *r, int 
ret)
qemu_get_current_aio_context());
 
 assert (r->req.aiocb == NULL);
-if (scsi_disk_req_check_error(r, ret, false)) {
+if (scsi_disk_req_check_error(r, ret, ret > 0)) {
 goto done;
 }
 
@@ -562,9 +568,10 @@ static void scsi_write_complete(void * opaque, int ret)
 assert (r->req.aiocb != NULL);
 r->req.aiocb = NULL;
 
+/* ret > 0 is accounted for in scsi_disk_req_check_error() */
 if (ret < 0) {
 block_acct_failed(blk_get_stats(s->qdev.conf.blk), &r->acct);
-} else {
+} else if (ret == 0) {
 block_acct_done(blk_get_stats(s->qdev.conf.blk), &r->acct);
 }
 scsi_write_complete_noio(r, ret);
-- 
2.45.2

[PULL 05/13] scsi-block: Don't skip callback for sgio error status/driver_status

2024-08-05 Thread Kevin Wolf

Instead of calling into scsi_handle_rw_error() directly from
scsi_block_sgio_complete() and skipping the normal callback, go through
the normal cleanup path by calling the callback with a positive error
value.

The important difference here is not only that the code path is cleaner,
but that the callbacks set r->req.aiocb = NULL. If we skip setting this
and the error action is BLOCK_ERROR_ACTION_STOP, resuming the VM runs
into an assertion failure in scsi_read_data() or scsi_write_data()
because the dangling aiocb pointer is unexpected.

Fixes: a108557bbf ("scsi: inline sg_io_sense_from_errno() into the callers.")
Buglink: https://issues.redhat.com/browse/RHEL-5
Signed-off-by: Kevin Wolf 
Acked-by: Paolo Bonzini 
Message-ID: <20240731123207.27636-3-kw...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 hw/scsi/scsi-disk.c | 10 --
 1 file changed, 10 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 3ff6798bde..6e1a5c98df 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -2832,16 +2832,6 @@ static void scsi_block_sgio_complete(void *opaque, int 
ret)
 } else {
 ret = io_hdr->status;
 }
-
-if (ret > 0) {
-if (scsi_handle_rw_error(r, ret, true)) {
-scsi_req_unref(&r->req);
-return;
-}
-
-/* Ignore error.  */
-ret = 0;
-}
 }
 
 req->cb(req->cb_opaque, ret);
-- 
2.45.2

[PULL 02/13] block-copy: Fix missing graph lock

2024-08-05 Thread Kevin Wolf

The graph lock needs to be held when calling bdrv_co_pdiscard(). Fix
block_copy_task_entry() to take it for the call.

WITH_GRAPH_RDLOCK_GUARD() was implemented in a weak way because of
limitations in clang's Thread Safety Analysis at the time, so that it
only asserts that the lock is held (which allows calling functions that
require the lock), but we never deal with the unlocking (so even after
the scope of the guard, the compiler assumes that the lock is still
held). This is why the compiler didn't catch this locking error.

Signed-off-by: Kevin Wolf 
Message-ID: <20240627181245.281403-2-kw...@redhat.com>
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Kevin Wolf 
---
 block/block-copy.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/block-copy.c b/block/block-copy.c
index 7e3b378528..cc618e4561 100644
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -595,7 +595,9 @@ static coroutine_fn int block_copy_task_entry(AioTask *task)
 if (s->discard_source && ret == 0) {
 int64_t nbytes =
 MIN(t->req.offset + t->req.bytes, s->len) - t->req.offset;
-bdrv_co_pdiscard(s->source, t->req.offset, nbytes);
+WITH_GRAPH_RDLOCK_GUARD() {
+bdrv_co_pdiscard(s->source, t->req.offset, nbytes);
+}
 }
 
 return ret;
-- 
2.45.2

[PULL 00/13] Block layer patches

2024-08-05 Thread Kevin Wolf

The following changes since commit f9851d2ffef59b3a7f39513469263ab3b019480f:

  Merge tag 'migration-20240802-pull-request' of 
https://gitlab.com/farosas/qemu into staging (2024-08-03 07:26:26 +1000)

are available in the Git repository at:

  https://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to 833362e55d9acc6b7ccc1d80d8cd84688e7f5761:

  iotests/024: exclude 'backing file format' field from the output (2024-08-05 
23:00:42 +0200)


Block layer patches

- scsi-block: Fix error handling with r/werror=stop
- Depend on newer clang for TSA, make WITH_GRAPH_RDLOCK_GUARD() fully
  checked, fix block-copy to add missing lock
- vvfat: Fix write bugs for large files and add iotests
- Clean up blockdev-snapshot-internal-sync doc
- Fix iotests 024 for qed


Amjad Alsharafi (5):
  vvfat: Fix bug in writing to middle of file
  vvfat: Fix usage of `info.file.offset`
  vvfat: Fix wrong checks for cluster mappings invariant
  vvfat: Fix reading files with non-continuous clusters
  iotests: Add `vvfat` tests

Andrey Drobyshev (1):
  iotests/024: exclude 'backing file format' field from the output

Kevin Wolf (6):
  block-copy: Fix missing graph lock
  block/graph-lock: Make WITH_GRAPH_RDLOCK_GUARD() fully checked
  scsi-disk: Use positive return value for status in dma_readv/writev
  scsi-block: Don't skip callback for sgio error status/driver_status
  scsi-disk: Add warning comments that host_status errors take a shortcut
  scsi-disk: Always report RESERVATION_CONFLICT to guest

Markus Armbruster (1):
  qapi-block-core: Clean up blockdev-snapshot-internal-sync doc

 qapi/block-core.json   |   7 +-
 include/block/graph-lock.h |  21 +-
 block/block-copy.c |   4 +-
 block/vvfat.c  |  27 +-
 hw/scsi/scsi-disk.c|  73 ++--
 tests/qemu-iotests/fat16.py| 690 +
 tests/qemu-iotests/testenv.py  |   2 +-
 meson.build|  14 +-
 tests/qemu-iotests/024 |   2 +-
 tests/qemu-iotests/024.out |   1 -
 tests/qemu-iotests/check   |   2 +-
 tests/qemu-iotests/tests/vvfat | 485 ++
 tests/qemu-iotests/tests/vvfat.out |   5 +
 13 files changed, 1280 insertions(+), 53 deletions(-)
 create mode 100644 tests/qemu-iotests/fat16.py
 create mode 100755 tests/qemu-iotests/tests/vvfat
 create mode 100755 tests/qemu-iotests/tests/vvfat.out

Re: [PATCH v3 2/3] iotests/298: add testcase for async writes with preallocation filter

2024-08-05 Thread Kevin Wolf

Am 05.08.2024 um 14:56 hat Andrey Drobyshev geschrieben:
> On 8/5/24 3:04 PM, Kevin Wolf wrote:
> > Am 16.07.2024 um 16:41 hat Andrey Drobyshev geschrieben:
> >> The testcase simply creates a 64G image with 1M clusters, generates a list
> >> of 1M aligned offsets and feeds aio_write commands with those offsets to
> >> qemu-io run with '--aio native --nocache'.  Then we check the data
> >> written at each of the offsets.  Before the previous commit this could
> >> result into a race within the preallocation filter which would zeroize
> >> some clusters after actually writing data to them.
> >>
> >> Note: the test doesn't fail in 100% cases as there's a race involved,
> >> but the failures are pretty consistent so it should be good enough for
> >> detecting the problem.
> >>
> >> Signed-off-by: Andrey Drobyshev 
> > 
> > I left it running in a loop for a while, but couldn't reproduce the bug
> > with this test.
> > 
> 
> Hmmm, it seems to have stopped failing on my setup as well, no matter
> how I increase the number of requests.  And it seems to be related to
> the interleaving 'write-zeroes' requests.  My initial attempt was to
> cover the case described by Vladimir here:
> https://lists.nongnu.org/archive/html/qemu-block/2024-07/msg00415.html
> Maybe we just leave it and try reproducing the corruption with just
> regular write requests?  At least with this version it seems to be
> failing pretty stably on my setup:
> 
> > +def test_prealloc_async_writes(self):
> > +requests = 2048 # Number of write/read requests to feed to qemu-io
> > +total_clusters = 64 * 1024 # 64G / 1M
> > +
> > +offsets = random.sample(range(0, total_clusters), requests)
> > +aio_write_cmds = [f'aio_write -P 0xaa  {off}M 1M' for off in 
> > offsets]
> > +read_cmds = [f'read -P 0xaa {off}M 1M' for off in offsets]
> > +
> > +proc = iotests.QemuIoInteractive('--aio', 'native', '--nocache',
> > + '--image-opts', drive_opts)
> > +for cmd in aio_write_cmds:
> > +proc.cmd(cmd)
> > +proc.close()
> > +
> > +proc = iotests.QemuIoInteractive('-f', iotests.imgfmt, disk)
> > +for cmd in read_cmds:
> > +out = proc.cmd(cmd)
> > +self.assertFalse('Pattern verification failed' in str(out))
> > +proc.close()
> > +

This doesn't seem to fail for me either. Neither on tmpfs nor in my home
directory (which is XFS on an encrypted LVM volume).

Are you using some more complicated setup than "./check -qcow2 298"?

Do you think we could use blkdebug to deterministically trigger the case
instead of trying to brute force it? If we can suspend the write_zeroes
request on the child node of preallocate, I think that's all we need to
reproduce the bug as described in the commit message of the fix.

Kevin

Re: [PATCH v3 3/3] scripts: add filev2p.py script for mapping virtual file offsets mapping

2024-08-05 Thread Kevin Wolf

Am 16.07.2024 um 16:41 hat Andrey Drobyshev geschrieben:
> The script is basically a wrapper around "filefrag" utility.  This might
> be used to map virtual offsets within the file to the underlying block
> device offsets.  In addition, a chunk size might be specified, in which
> case a list of such mappings will be obtained:
> 
> $ scripts/filev2p.py -s 100M /sparsefile 1768M
> 1853882368..1895825407 (file)  ->  16332619776..16374562815 (/dev/sda4)  ->  
> 84492156928..84534099967 (/dev/sda)
> 1895825408..1958739967 (file)  ->  17213591552..17276506111 (/dev/sda4)  ->  
> 85373128704..85436043263 (/dev/sda)
> 
> This could come in handy when we need to map a certain piece of data
> within a file inside VM to the same data within the image on the host
> (e.g. physical offset on VM's /dev/sda would be the virtual offset
> within QCOW2 image).
> 
> Note: as of now the script only works with the files located on plain
> partitions, i.e. it doesn't work with partitions built on top of LVM.
> Partitions on LVM would require another level of mapping.
> 
> Signed-off-by: Andrey Drobyshev 
> ---
>  scripts/filev2p.py | 311 +
>  1 file changed, 311 insertions(+)
>  create mode 100755 scripts/filev2p.py
> 
> diff --git a/scripts/filev2p.py b/scripts/filev2p.py
> new file mode 100755
> index 00..3bd7d18b5e
> --- /dev/null
> +++ b/scripts/filev2p.py
> @@ -0,0 +1,311 @@
> +#!/usr/bin/env python3
> +#
> +# Map file virtual offset to the offset on the underlying block device.
> +# Works by parsing 'filefrag' output.
> +#
> +# Copyright (c) 2024 Virtuozzo International GmbH.
> +#
> +# This program is free software; you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation; either version 2 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program.  If not, see .
> +#
> +
> +import argparse
> +import os
> +import subprocess
> +import re
> +import sys
> +
> +from bisect import bisect_right
> +from collections import namedtuple
> +from dataclasses import dataclass
> +from shutil import which
> +from stat import S_ISBLK
> +
> +
> +Partition = namedtuple('Partition', ['partpath', 'diskpath', 'part_offt'])
> +
> +
> +@dataclass
> +class Extent:
> +'''Class representing an individual file extent.
> +
> +This is basically a piece of data within the file which is located
> +consecutively (i.e. not sparsely) on the underlying block device.
> +'''

Python docstrings should always be triple double quotes """...""" as per
PEP 257.

Some functions below even use a single single quote because they are on
a single line. They should still use the same convention.

> +
> +log_start:  int
> +log_end:int
> +phys_start: int
> +phys_end:   int
> +length: int
> +partition:  Partition
> +
> +@property
> +def disk_start(self):
> +'Number of the first byte of this extent on the whole disk 
> (/dev/sda)'
> +return self.partition.part_offt + self.phys_start
> +
> +@property
> +def disk_end(self):
> +'Number of the last byte of this extent on the whole disk (/dev/sda)'
> +return self.partition.part_offt + self.phys_end
> +
> +def __str__(self):
> +ischunk = self.log_end > self.log_start
> +maybe_end = lambda s: f'..{s}' if ischunk else ''
> +return '%s%s (file)  ->  %s%s (%s)  ->  %s%s (%s)' % (
> +self.log_start, maybe_end(self.log_end),
> +self.phys_start, maybe_end(self.phys_end), 
> self.partition.partpath,
> +self.disk_start, maybe_end(self.disk_end), 
> self.partition.diskpath
> +)
> +
> +@classmethod
> +def ext_slice(cls, bigger_ext, start, end):
> +'''Constructor for the Extent class from a bigger extent.
> +
> +Return Extent instance which is a slice of @bigger_ext contained
> +within the range [start, end].
> +'''
> +
> +assert start >= bigger_ext.log_start
> +assert end <= bigger_ext.log_end
> +
> +if start == bigger_ext.log_start and end == bigger_ext.log_end:
> +return bigger_ext
> +
> +phys_start = bigger_ext.phys_start + (start - bigger_ext.log_start)
> +phys_end = bigger_ext.phys_end - (bigger_ext.log_end - end)
> +length = end - start + 1
> +
> +return cls(start, end, phys_start, phys_end, length,
> +   bigger_ext.partition)
> +
> +
> +def run_cmd(cmd: str) -> str:
> +'''Wrapper around subprocess.run.
> +
> +Retur

Re: [PATCH v3 2/3] iotests/298: add testcase for async writes with preallocation filter

2024-08-05 Thread Kevin Wolf

Am 16.07.2024 um 16:41 hat Andrey Drobyshev geschrieben:
> The testcase simply creates a 64G image with 1M clusters, generates a list
> of 1M aligned offsets and feeds aio_write commands with those offsets to
> qemu-io run with '--aio native --nocache'.  Then we check the data
> written at each of the offsets.  Before the previous commit this could
> result into a race within the preallocation filter which would zeroize
> some clusters after actually writing data to them.
> 
> Note: the test doesn't fail in 100% cases as there's a race involved,
> but the failures are pretty consistent so it should be good enough for
> detecting the problem.
> 
> Signed-off-by: Andrey Drobyshev 

I left it running in a loop for a while, but couldn't reproduce the bug
with this test.

>  tests/qemu-iotests/298 | 49 ++
>  tests/qemu-iotests/298.out |  4 ++--
>  2 files changed, 51 insertions(+), 2 deletions(-)
> 
> diff --git a/tests/qemu-iotests/298 b/tests/qemu-iotests/298
> index 09c9290711..b7126e9e15 100755
> --- a/tests/qemu-iotests/298
> +++ b/tests/qemu-iotests/298
> @@ -20,8 +20,10 @@
>  
>  import os
>  import iotests
> +import random
>  
>  MiB = 1024 * 1024
> +GiB = MiB * 1024
>  disk = os.path.join(iotests.test_dir, 'disk')
>  overlay = os.path.join(iotests.test_dir, 'overlay')
>  refdisk = os.path.join(iotests.test_dir, 'refdisk')
> @@ -176,5 +178,52 @@ class TestTruncate(iotests.QMPTestCase):
>  self.do_test('off', '150M')
>  
>  
> +class TestPreallocAsyncWrites(iotests.QMPTestCase):
> +def setUp(self):
> +# Make sure we get reproducible write patterns on each run
> +random.seed(42)
> +iotests.qemu_img_create('-f', iotests.imgfmt, disk, '-o',
> +f'cluster_size={MiB},lazy_refcounts=on',
> +str(64 * GiB))
> +
> +def tearDown(self):
> +os.remove(disk)
> +
> +def test_prealloc_async_writes(self):
> +def gen_write_pattern():
> +n = 0
> +while True:
> +yield '-P 0xaa' if n else '-z'
> +n = 1 - n

This looks like a complicated way to write the following?

# Alternate between write_zeroes and writing data
def gen_write_pattern():
while True:
yield '-z'
yield '-P 0xaa'

> +def gen_read_pattern():
> +n = 0
> +while True:
> +yield '-P 0xaa' if n else '-P 0x00'
> +n = 1 - n

Same here.

Kevin

Re: [PATCH v3 1/3] block: zero data data corruption using prealloc-filter

2024-08-05 Thread Kevin Wolf

Am 18.07.2024 um 21:46 hat Denis V. Lunev geschrieben:
> On 7/18/24 17:51, Kevin Wolf wrote:
> > Am 16.07.2024 um 16:41 hat Andrey Drobyshev geschrieben:
> > > From: "Denis V. Lunev" 
> > > 
> > > We have observed that some clusters in the QCOW2 files are zeroed
> > > while preallocation filter is used.
> > > 
> > > We are able to trace down the following sequence when prealloc-filter
> > > is used:
> > >  co=0x55e7cbed7680 qcow2_co_pwritev_task()
> > >  co=0x55e7cbed7680 preallocate_co_pwritev_part()
> > >  co=0x55e7cbed7680 handle_write()
> > >  co=0x55e7cbed7680 bdrv_co_do_pwrite_zeroes()
> > >  co=0x55e7cbed7680 raw_do_pwrite_zeroes()
> > >  co=0x7f9edb7fe500 do_fallocate()
> > > 
> > > Here coroutine 0x55e7cbed7680 is being blocked waiting while coroutine
> > > 0x7f9edb7fe500 will finish with fallocate of the file area. OK. It is
> > > time to handle next coroutine, which
> > >  co=0x55e7cbee91b0 qcow2_co_pwritev_task()
> > >  co=0x55e7cbee91b0 preallocate_co_pwritev_part()
> > >  co=0x55e7cbee91b0 handle_write()
> > >  co=0x55e7cbee91b0 bdrv_co_do_pwrite_zeroes()
> > >  co=0x55e7cbee91b0 raw_do_pwrite_zeroes()
> > >  co=0x7f9edb7deb00 do_fallocate()
> > > 
> > > The trouble comes here. Coroutine 0x55e7cbed7680 has not advanced
> > > file_end yet and coroutine 0x55e7cbee91b0 will start fallocate() for
> > > the same area. This means that if (once fallocate is started inside
> > > 0x7f9edb7deb00) original fallocate could end and the real write will
> > > be executed. In that case write() request is handled at the same time
> > > as fallocate().
> > > 
> > > The patch moves s->file_lock assignment before fallocate and that is
> > s/file_lock/file_end/?
> > 
> > > crucial. The idea is that all subsequent requests into the area
> > > being preallocation will be issued as just writes without fallocate
> > > to this area and they will not proceed thanks to overlapping
> > > requests mechanics. If preallocation will fail, we will just switch
> > > to the normal expand-by-write behavior and that is not a problem
> > > except performance.
> > > 
> > > Signed-off-by: Denis V. Lunev 
> > > Tested-by: Andrey Drobyshev 
> > > ---
> > >   block/preallocate.c | 8 +++-
> > >   1 file changed, 7 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/block/preallocate.c b/block/preallocate.c
> > > index d215bc5d6d..ecf0aa4baa 100644
> > > --- a/block/preallocate.c
> > > +++ b/block/preallocate.c
> > > @@ -383,6 +383,13 @@ handle_write(BlockDriverState *bs, int64_t offset, 
> > > int64_t bytes,
> > >   want_merge_zero = want_merge_zero && (prealloc_start <= offset);
> > > +/*
> > > + * Assign file_end before making actual preallocation. This will 
> > > ensure
> > > + * that next request performed while preallocation is in progress 
> > > will
> > > + * be passed without preallocation.
> > > + */
> > > +s->file_end = prealloc_end;
> > > +
> > >   ret = bdrv_co_pwrite_zeroes(
> > >   bs->file, prealloc_start, prealloc_end - prealloc_start,
> > >   BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING | 
> > > BDRV_REQ_NO_WAIT);
> > > @@ -391,7 +398,6 @@ handle_write(BlockDriverState *bs, int64_t offset, 
> > > int64_t bytes,
> > >   return false;
> > >   }
> > > -s->file_end = prealloc_end;
> > >   return want_merge_zero;
> > >   }
> > Until bdrv_co_pwrite_zeroes() completes successfully, the file size is
> > unchanged. In other words, if anything calls preallocate_co_getlength()
> > in the meantime, don't we run into...
> > 
> >  ret = bdrv_co_getlength(bs->file->bs);
> > 
> >  if (has_prealloc_perms(bs)) {
> >  s->file_end = s->zero_start = s->data_end = ret;
> >  }
> > 
> > ...and reset s->file_end back to the old value, re-enabling the bug
> > you're trying to fix here?
> > 
> > Or do we know that no such code path can be called concurrently for some
> > reason?
> > 
> > Kevin
> > 
> After more detailed thinking I tend to disagree.
> Normally we would not hit the problem. Though
> this was not obvious at the beginning :)
> 
> The point in han

Re: [PATCH] iotests/024: exclude 'backing file format' field from the output

2024-08-05 Thread Kevin Wolf

Am 30.07.2024 um 11:47 hat Andrey Drobyshev geschrieben:
> Apparently 'qemu-img info' doesn't report the backing file format field
> for qed (as it does for qcow2):
> 
> $ qemu-img create -f qed base.qed 1M && qemu-img create -f qed -b base.qed -F 
> qed top.qed 1M
> $ qemu-img create -f qcow2 base.qcow2 1M && qemu-img create -f qcow2 -b 
> base.qcow2 -F qcow2 top.qcow2 1M
> $ qemu-img info top.qed | grep 'backing file format'
> $ qemu-img info top.qcow2 | grep 'backing file format'
> backing file format: qcow2
> 
> This leads to the 024 test failure with -qed.  Let's just filter the
> field out and exclude it from the output.
> 
> This is a fixup for the commit f93e65ee51 ("iotests/{024, 271}: add
> testcases for qemu-img rebase").
> 
> Found-by: Thomas Huth 
> Signed-off-by: Andrey Drobyshev 

Thanks, applied to the block branch.

Kevin

Re: [PATCH v6 0/5] vvfat: Fix write bugs for large files and add iotests

2024-08-05 Thread Kevin Wolf

Am 20.07.2024 um 12:13 hat Amjad Alsharafi geschrieben:
> These patches fix some bugs found when modifying files in vvfat.
> First, there was a bug when writing to the cluster 2 or above of a file, it
> will copy the cluster before it instead, so, when writing to cluster=2, the
> content of cluster=1 will be copied into disk instead in its place.
> 
> Another issue was modifying the clusters of a file and adding new
> clusters, this showed 2 issues:
> - If the new cluster is not immediately after the last cluster, it will
> cause issues when reading from this file in the future.
> - Generally, the usage of info.file.offset was incorrect, and the
> system would crash on abort() when the file is modified and a new
> cluster was added.
> 
> Also, added some iotests for vvfat, covering the this fix and also
> general behavior such as reading, writing, and creating files on the 
> filesystem.
> Including tests for reading/writing the first cluster which
> would pass even before this patch.

Thanks, applied to the block branch (with patch 4 changed as indicated
in my comment there).

Kevin

Re: [PATCH v6 4/5] vvfat: Fix reading files with non-continuous clusters

2024-08-05 Thread Kevin Wolf

Am 20.07.2024 um 12:13 hat Amjad Alsharafi geschrieben:
> When reading with `read_cluster` we get the `mapping` with
> `find_mapping_for_cluster` and then we call `open_file` for this
> mapping.
> The issue appear when its the same file, but a second cluster that is
> not immediately after it, imagine clusters `500 -> 503`, this will give
> us 2 mappings one has the range `500..501` and another `503..504`, both
> point to the same file, but different offsets.
> 
> When we don't open the file since the path is the same, we won't assign
> `s->current_mapping` and thus accessing way out of bound of the file.
> 
> From our example above, after `open_file` (that didn't open anything) we
> will get the offset into the file with
> `s->cluster_size*(cluster_num-s->current_mapping->begin)`, which will
> give us `0x2000 * (504-500)`, which is out of bound for this mapping and
> will produce some issues.
> 
> Signed-off-by: Amjad Alsharafi 
> ---
>  block/vvfat.c | 10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/block/vvfat.c b/block/vvfat.c
> index b63ac5d045..d2b879705e 100644
> --- a/block/vvfat.c
> +++ b/block/vvfat.c
> @@ -1360,15 +1360,17 @@ static int open_file(BDRVVVFATState* s,mapping_t* 
> mapping)
>  {
>  if(!mapping)
>  return -1;
> -if(!s->current_mapping ||
> -strcmp(s->current_mapping->path,mapping->path)) {
> +if (s->current_mapping != mapping) {
>  /* open file */
>  int fd = qemu_open_old(mapping->path,
> -   O_RDONLY | O_BINARY | O_LARGEFILE);
> -if(fd<0)
> +O_RDONLY | O_BINARY | O_LARGEFILE);
> +if (fd < 0) {
>  return -1;
> +}
>  vvfat_close_current_file(s);
> +
>  s->current_fd = fd;
> +assert(s->current_fd);
>  s->current_mapping = mapping;
>  }
>  return 0;

Now we're reopening the file even if the mapping is actually referring
to the same file that was already open. So the result should be correct,
but wasteful in what is probably a common case.

However, this version of the patch simplified things enough that I think
I finally see what we really want:

diff --git a/block/vvfat.c b/block/vvfat.c
index e3a83fbc88..8ffe8b3b9b 100644
--- a/block/vvfat.c
+++ b/block/vvfat.c
@@ -1369,8 +1369,9 @@ static int open_file(BDRVVVFATState* s,mapping_t* mapping)
 return -1;
 vvfat_close_current_file(s);
 s->current_fd = fd;
-s->current_mapping = mapping;
 }
+
+s->current_mapping = mapping;
 return 0;
 }

That should be all that is needed, and it passes your test case.

I don't usually do this, but since time is running out for QEMU 9.1,
I'll just replace the code of this patch with the above and go ahead
with that. If you think it's wrong, let me know and we'll fix it on top
of this series.

Kevin

Re: [PATCH 7/7] block/ssh.c: Don't double-check that characters are hex digits

2024-07-31 Thread Kevin Wolf

Am 31.07.2024 um 16:36 hat Peter Maydell geschrieben:
> In compare_fingerprint() we effectively check whether the characters
> in the fingerprint are valid hex digits twice: first we do so with
> qemu_isxdigit(), but then the hex2decimal() function also has a code
> path where it effectively detects an invalid digit and returns -1.
> This causes Coverity to complain because it thinks that we might use
> that -1 value in an expression where it would be an integer overflow.

If it's a Coverity issue, I think you want to mention the CID, too.

> Avoid the double-check of hex digit validity by testing the return
> values from hex2decimal() rather than doing separate calls to
> qemu_isxdigit().
> 
> Signed-off-by: Peter Maydell 
> ---
> Could alternatively have put a g_assert_non_reached() in
> hex2decimal(), but this seemed better to me.

I don't like that hex2decimal() returns -1 when its result is unsigned,
which is why you had to write the check as > 0xf rather than < 0. So in
this sense, g_assert_non_reached() would look better to me. But we could
also just change it to return UINT_MAX instead, which should be the
same, just written in a less confusing way.

>  block/ssh.c | 10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/block/ssh.c b/block/ssh.c
> index 27d582e0e3d..510dd208aba 100644
> --- a/block/ssh.c
> +++ b/block/ssh.c
> @@ -376,13 +376,15 @@ static int compare_fingerprint(const unsigned char 
> *fingerprint, size_t len,
>  unsigned c;
>  
>  while (len > 0) {
> +unsigned c0, c1;
>  while (*host_key_check == ':')
>  host_key_check++;
> -if (!qemu_isxdigit(host_key_check[0]) ||
> -!qemu_isxdigit(host_key_check[1]))
> +c0 = hex2decimal(host_key_check[0]);
> +c1 = hex2decimal(host_key_check[1]);
> +if (c0 > 0xf || c1 > 0xf) {
>  return 1;
> -c = hex2decimal(host_key_check[0]) * 16 +
> -hex2decimal(host_key_check[1]);
> +}
> +    c = c0 * 16 + c1;
>  if (c - *fingerprint != 0)
>  return c - *fingerprint;
>  fingerprint++;

Reviewed-by: Kevin Wolf

Re: [PATCH 6/7] hw/ide/pci.c: Remove dead code from bmdma_prepare_buf()

2024-07-31 Thread Kevin Wolf

Am 31.07.2024 um 16:36 hat Peter Maydell geschrieben:
> Coverity notes that the code at the end of the loop in
> bmdma_prepare_buf() is unreachable.  This is because in commit
> 9fbf0fa81fca8f527 ("ide: remove hardcoded 2GiB transactional limit")
> we removed the only codepath in the loop which could "break" out of
> it, but didn't notice that this meant we should also remove the code
> at the end of the loop.
> 
> Remove the dead code.
> 
> Resolves: Coverity CID 1547772
> Signed-off-by: Peter Maydell 
> ---
>  hw/ide/pci.c | 4 
>  1 file changed, 4 deletions(-)
> 
> diff --git a/hw/ide/pci.c b/hw/ide/pci.c
> index 4675d079a17..f2cb500a94f 100644
> --- a/hw/ide/pci.c
> +++ b/hw/ide/pci.c
> @@ -266,10 +266,6 @@ static int32_t bmdma_prepare_buf(const IDEDMA *dma, 
> int32_t limit)
>  s->io_buffer_size += l;
>  }
>  }
> -
> -qemu_sglist_destroy(&s->sg);
> -s->io_buffer_size = 0;
> -return -1;
>  }

Should we put a g_assert_not_reached() here instead to make it easier
for the reader to understand how this function works?

Either way:
Reviewed-by: Kevin Wolf

Re: [PATCH 3/7] hw/block/pflash_cfi01: Don't decrement pfl->counter below 0

2024-07-31 Thread Kevin Wolf

Am 31.07.2024 um 16:36 hat Peter Maydell geschrieben:
> In pflash_write() Coverity points out that we can decrement the
> unsigned pfl->counter below zero, which makes it wrap around.  In
> fact this is harmless, because if pfl->counter is 0 at this point we
> also increment pfl->wcycle to 3, and the wcycle == 3 handling doesn't
> look at counter; the only way back into code which looks at the
> counter value is via wcycle == 1, which will reinitialize the counter.
> But it's arguably a little clearer to break early in the "counter ==
> 0" if(), to avoid the decrement-below-zero.
> 
> Resolves: Coverity CID 1547611
> Signed-off-by: Peter Maydell 

Reviewed-by: Kevin Wolf

Re: [PATCH 2/7] block/gluster: Use g_autofree for string in qemu_gluster_parse_json()

2024-07-31 Thread Kevin Wolf

Am 31.07.2024 um 16:36 hat Peter Maydell geschrieben:
> In the loop in qemu_gluster_parse_json() we do:
> 
> char *str = NULL;
> for(...) {
> str = g_strdup_printf(...);
> ...
> if (various errors) {
> goto out;
> }
> ...
> g_free(str);
> str = NULL;
> }
> return 0;
> out:
> various cleanups;
> g_free(str);
> ...
> return -errno;
> 
> Coverity correctly complains that the assignment "str = NULL" at the
> end of the loop is unnecessary, because we will either go back to the
> top of the loop and overwrite it, or else we will exit the loop and
> then exit the function without ever reading str again. The assignment
> is there as defensive coding to ensure that str is only non-NULL if
> it's a live allocation, so this is intentional.
> 
> We can make Coverity happier and simplify the code here by using
> g_autofree, since we never need 'str' outside the loop.
> 
> Resolves: Coverity CID 1527385
> Signed-off-by: Peter Maydell 
> ---
>  block/gluster.c | 6 +-
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/block/gluster.c b/block/gluster.c
> index f8b415f3812..61ded95e660 100644
> --- a/block/gluster.c
> +++ b/block/gluster.c
> @@ -514,7 +514,6 @@ static int qemu_gluster_parse_json(BlockdevOptionsGluster 
> *gconf,
>  SocketAddressList **tail;
>  QDict *backing_options = NULL;
>  Error *local_err = NULL;
> -char *str = NULL;
>  const char *ptr;
>  int i, type, num_servers;
>  
> @@ -547,7 +546,7 @@ static int qemu_gluster_parse_json(BlockdevOptionsGluster 
> *gconf,
>  tail = &gconf->server;
>  
>  for (i = 0; i < num_servers; i++) {
> -str = g_strdup_printf(GLUSTER_OPT_SERVER_PATTERN"%d.", i);
> +g_autofree char *str = 
> g_strdup_printf(GLUSTER_OPT_SERVER_PATTERN"%d.", i);

This line is too long now.

With this fixed:
Reviewed-by: Kevin Wolf

Re: [PATCH 1/7] block/vdi.c: Avoid potential overflow when calculating size of write

2024-07-31 Thread Kevin Wolf

Am 31.07.2024 um 16:36 hat Peter Maydell geschrieben:
> In vdi_co_pwritev() we multiply a sector count by SECTOR_SIZE to
> get the size to write in bytes. Coverity notes that this means that
> we do the multiply as a 32x32->32 multiply before converting to
> 64 bits, which has the potential to overflow.
> 
> This is very unlikely to happen, since the block map has 4 bytes per
> block and the maximum number of blocks in the image must fit into a
> 32-bit integer.  But we can keep Coverity happy by including a cast
> so we do a 64-bit multiply here.
> 
> Resolves: Coverity CID 1508076
> Signed-off-by: Peter Maydell 
> ---
>  block/vdi.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/vdi.c b/block/vdi.c
> index 6363da08cee..27c60ba18d0 100644
> --- a/block/vdi.c
> +++ b/block/vdi.c
> @@ -728,7 +728,7 @@ nonallocating_write:
>  logout("will write %u block map sectors starting from entry %u\n",
> n_sectors, bmap_first);
>  ret = bdrv_co_pwrite(bs->file, bmap_offset * SECTOR_SIZE,
> - n_sectors * SECTOR_SIZE, base, 0);
> + n_sectors * (uint64_t)SECTOR_SIZE, base, 0);
>  }

I wonder if we shouldn't just make VDI's SECTOR_SIZE 64 bits like
BDRV_SECTOR_SIZE. It's easy to miss the cast in individual places.

Kevin

Re: [PATCH 4/7] hw/ide/atapi: Be explicit that assigning to s->lcyl truncates

2024-07-31 Thread Kevin Wolf

Am 31.07.2024 um 16:36 hat Peter Maydell geschrieben:
> In ide_atapi_cmd_reply_end() we calculate a 16-bit size, and then
> assign its two halves to s->lcyl and s->hcyl like this:
> 
>s->lcyl = size;
>s->hcyl = size >> 8;
> 
> Coverity warns that the first line here can overflow the
> 8-bit s->lcyl variable. This is true, and in this case we're
> deliberately only after the low 8 bits of the value. The
> code is clearer to both humans and Coverity if we're explicit
> that we only wanted the low 8 bits, though.
> 
> Signed-off-by: Peter Maydell 

Reviewed-by: Kevin Wolf

Re: [PATCH 5/7] hw/block/fdc-isa: Assert that isa_fdc_get_drive_max_chs() found something

2024-07-31 Thread Kevin Wolf

Am 31.07.2024 um 16:36 hat Peter Maydell geschrieben:
> Coverity complains about an overflow in isa_fdc_get_drive_max_chs()
> that can happen if the loop over fd_formats never finds a match,
> because we initialize *maxc to 0 and then at the end of the
> function decrement it.
> 
> This can't ever actually happen because fd_formats has at least
> one entry for each FloppyDriveType, so we must at least once
> find a match and update *maxc, *maxh and *maxs. Assert that we
> did find a match, which should keep Coverity happy and will also
> detect possible bugs in the data in fd_formats.
> 
> Resolves: Coverity CID 1547663
> Signed-off-by: Peter Maydell 

Reviewed-by: Kevin Wolf

[PATCH v2 3/4] scsi-disk: Add warning comments that host_status errors take a shortcut

2024-07-31 Thread Kevin Wolf

scsi_block_sgio_complete() has surprising behaviour in that there are
error cases in which it directly completes the request and never calls
the passed callback. In the current state of the code, this doesn't seem
to result in bugs, but with future code changes, we must be careful to
never rely on the callback doing some cleanup until this code smell is
fixed. For now, just add warnings to make people aware of the trap.

Signed-off-by: Kevin Wolf 
---
 hw/scsi/scsi-disk.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 6e1a5c98df..69a195177e 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -68,6 +68,9 @@ struct SCSIDiskClass {
 /*
  * Callbacks receive ret == 0 for success. Errors are represented either as
  * negative errno values, or as positive SAM status codes.
+ *
+ * Beware: For errors returned in host_status, the function may directly
+ * complete the request and never call the callback.
  */
 DMAIOFunc   *dma_readv;
 DMAIOFunc   *dma_writev;
@@ -381,6 +384,7 @@ done:
 scsi_req_unref(&r->req);
 }
 
+/* May not be called in all error cases, don't rely on cleanup here */
 static void scsi_dma_complete(void *opaque, int ret)
 {
 SCSIDiskReq *r = (SCSIDiskReq *)opaque;
@@ -421,6 +425,7 @@ done:
 scsi_req_unref(&r->req);
 }
 
+/* May not be called in all error cases, don't rely on cleanup here */
 static void scsi_read_complete(void *opaque, int ret)
 {
 SCSIDiskReq *r = (SCSIDiskReq *)opaque;
@@ -560,6 +565,7 @@ done:
 scsi_req_unref(&r->req);
 }
 
+/* May not be called in all error cases, don't rely on cleanup here */
 static void scsi_write_complete(void * opaque, int ret)
 {
 SCSIDiskReq *r = (SCSIDiskReq *)opaque;
@@ -2821,6 +2827,7 @@ static void scsi_block_sgio_complete(void *opaque, int 
ret)
 sg_io_hdr_t *io_hdr = &req->io_header;
 
 if (ret == 0) {
+/* FIXME This skips calling req->cb() and any cleanup in it */
 if (io_hdr->host_status != SCSI_HOST_OK) {
 scsi_req_complete_failed(&r->req, io_hdr->host_status);
 scsi_req_unref(&r->req);
-- 
2.45.2

[PATCH v2 4/4] scsi-disk: Always report RESERVATION_CONFLICT to guest

2024-07-31 Thread Kevin Wolf

In the case of scsi-block, RESERVATION_CONFLICT is not a backend error,
but indicates that the guest tried to make a request that it isn't
allowed to execute. Pass the error to the guest so that it can decide
what to do with it.

Without this, if we stop the VM in response to a RESERVATION_CONFLICT
(as is the default policy in management software such as oVirt or
KubeVirt), it can happen that the VM cannot be resumed any more because
every attempt to resume it immediately runs into the same error and
stops the VM again.

One case that expects RESERVATION_CONFLICT errors to be visible in the
guest is running the validation tests in Windows 2019's Failover Cluster
Manager, which intentionally tries to execute invalid requests to see if
they are properly rejected.

Buglink: https://issues.redhat.com/browse/RHEL-5
Signed-off-by: Kevin Wolf 
---
 hw/scsi/scsi-disk.c | 35 ++-
 1 file changed, 30 insertions(+), 5 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 69a195177e..4d94b2b816 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -224,7 +224,7 @@ static bool scsi_handle_rw_error(SCSIDiskReq *r, int ret, 
bool acct_failed)
 SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
 SCSIDiskClass *sdc = (SCSIDiskClass *) object_get_class(OBJECT(s));
 SCSISense sense = SENSE_CODE(NO_SENSE);
-int error = 0;
+int error;
 bool req_has_sense = false;
 BlockErrorAction action;
 int status;
@@ -235,11 +235,35 @@ static bool scsi_handle_rw_error(SCSIDiskReq *r, int ret, 
bool acct_failed)
 } else {
 /* A passthrough command has completed with nonzero status.  */
 status = ret;
-if (status == CHECK_CONDITION) {
+switch (status) {
+case CHECK_CONDITION:
 req_has_sense = true;
 error = scsi_sense_buf_to_errno(r->req.sense, 
sizeof(r->req.sense));
-} else {
+break;
+case RESERVATION_CONFLICT:
+/*
+ * Don't apply the error policy, always report to the guest.
+ *
+ * This is a passthrough code path, so it's not a backend error, 
but
+ * a response to an invalid guest request.
+ *
+ * Windows Failover Cluster validation intentionally sends invalid
+ * requests to verify that reservations work as intended. It is
+ * crucial that it sees the resulting errors.
+ *
+ * Treating a reservation conflict as a guest-side error is obvious
+ * when a pr-manager is in use. Without one, the situation is less
+ * clear, but there might be nothing that can be fixed on the host
+ * (like in the above example), and we don't want to be stuck in a
+ * loop where resuming the VM and retrying the request immediately
+ * stops it again. So always reporting is still the safer option in
+ * this case, too.
+ */
+error = 0;
+break;
+default:
 error = EINVAL;
+break;
 }
 }
 
@@ -249,8 +273,9 @@ static bool scsi_handle_rw_error(SCSIDiskReq *r, int ret, 
bool acct_failed)
  * are usually retried immediately, so do not post them to QMP and
  * do not account them as failed I/O.
  */
-if (req_has_sense &&
-scsi_sense_buf_is_guest_recoverable(r->req.sense, 
sizeof(r->req.sense))) {
+if (!error || (req_has_sense &&
+   scsi_sense_buf_is_guest_recoverable(r->req.sense,
+   sizeof(r->req.sense 
{
 action = BLOCK_ERROR_ACTION_REPORT;
 acct_failed = false;
 } else {
-- 
2.45.2

[PATCH v2 2/4] scsi-block: Don't skip callback for sgio error status/driver_status

2024-07-31 Thread Kevin Wolf

Instead of calling into scsi_handle_rw_error() directly from
scsi_block_sgio_complete() and skipping the normal callback, go through
the normal cleanup path by calling the callback with a positive error
value.

The important difference here is not only that the code path is cleaner,
but that the callbacks set r->req.aiocb = NULL. If we skip setting this
and the error action is BLOCK_ERROR_ACTION_STOP, resuming the VM runs
into an assertion failure in scsi_read_data() or scsi_write_data()
because the dangling aiocb pointer is unexpected.

Fixes: a108557bbf ("scsi: inline sg_io_sense_from_errno() into the callers.")
Buglink: https://issues.redhat.com/browse/RHEL-5
Signed-off-by: Kevin Wolf 
---
 hw/scsi/scsi-disk.c | 10 --
 1 file changed, 10 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 3ff6798bde..6e1a5c98df 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -2832,16 +2832,6 @@ static void scsi_block_sgio_complete(void *opaque, int 
ret)
 } else {
 ret = io_hdr->status;
 }
-
-if (ret > 0) {
-if (scsi_handle_rw_error(r, ret, true)) {
-scsi_req_unref(&r->req);
-return;
-}
-
-/* Ignore error.  */
-ret = 0;
-}
 }
 
 req->cb(req->cb_opaque, ret);
-- 
2.45.2

[PATCH v2 0/4] scsi-block: Fix error handling with r/werror=stop

2024-07-31 Thread Kevin Wolf

Running validation tests in Windows 2019's Failover Cluster Manager
fails in two different ways when run with rerror/werror=stop:

1. It runs into an assertion failure because the sgio-based I/O path
   takes shortcuts in its error handling that skip necessary cleanup

2. RESERVATION_CONFLICT is treated as a host error and stops the VM,
   which in some cases can't be resumed at all because nothing will make
   the error go away on retry. The error should always go to the guest
   instead, it's an invalid request from the guest.

This series fixes these problems.

v2:
- Patch 4: [Paolo]
  * Set error = 0 explicitly, remove useless variable initialisation
  * Add comment to explain why we consider it a guest error
  * Mention scsi-block specifically in the commit message

Kevin Wolf (4):
  scsi-disk: Use positive return value for status in dma_readv/writev
  scsi-block: Don't skip callback for sgio error status/driver_status
  scsi-disk: Add warning comments that host_status errors take a
shortcut
  scsi-disk: Always report RESERVATION_CONFLICT to guest

 hw/scsi/scsi-disk.c | 73 +++--
 1 file changed, 51 insertions(+), 22 deletions(-)

-- 
2.45.2

[PATCH v2 1/4] scsi-disk: Use positive return value for status in dma_readv/writev

2024-07-31 Thread Kevin Wolf

In some error cases, scsi_block_sgio_complete() never calls the passed
callback, but directly completes the request. This leads to bugs because
its error paths are not exact copies of what the callback would normally
do.

In preparation to fix this, allow passing positive return values to the
callbacks that represent the status code that should be used to complete
the request.

scsi_handle_rw_error() already handles positive values for its ret
parameter because scsi_block_sgio_complete() calls directly into it.

Signed-off-by: Kevin Wolf 
---
 hw/scsi/scsi-disk.c | 21 ++---
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index a67092db6a..3ff6798bde 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -65,6 +65,10 @@ OBJECT_DECLARE_TYPE(SCSIDiskState, SCSIDiskClass, 
SCSI_DISK_BASE)
 
 struct SCSIDiskClass {
 SCSIDeviceClass parent_class;
+/*
+ * Callbacks receive ret == 0 for success. Errors are represented either as
+ * negative errno values, or as positive SAM status codes.
+ */
 DMAIOFunc   *dma_readv;
 DMAIOFunc   *dma_writev;
 bool(*need_fua_emulation)(SCSICommand *cmd);
@@ -283,7 +287,7 @@ static bool scsi_disk_req_check_error(SCSIDiskReq *r, int 
ret, bool acct_failed)
 return true;
 }
 
-if (ret < 0) {
+if (ret != 0) {
 return scsi_handle_rw_error(r, ret, acct_failed);
 }
 
@@ -360,7 +364,7 @@ static void scsi_write_do_fua(SCSIDiskReq *r)
 static void scsi_dma_complete_noio(SCSIDiskReq *r, int ret)
 {
 assert(r->req.aiocb == NULL);
-if (scsi_disk_req_check_error(r, ret, false)) {
+if (scsi_disk_req_check_error(r, ret, ret > 0)) {
 goto done;
 }
 
@@ -385,9 +389,10 @@ static void scsi_dma_complete(void *opaque, int ret)
 assert(r->req.aiocb != NULL);
 r->req.aiocb = NULL;
 
+/* ret > 0 is accounted for in scsi_disk_req_check_error() */
 if (ret < 0) {
 block_acct_failed(blk_get_stats(s->qdev.conf.blk), &r->acct);
-} else {
+} else if (ret == 0) {
 block_acct_done(blk_get_stats(s->qdev.conf.blk), &r->acct);
 }
 scsi_dma_complete_noio(r, ret);
@@ -403,7 +408,7 @@ static void scsi_read_complete_noio(SCSIDiskReq *r, int ret)
qemu_get_current_aio_context());
 
 assert(r->req.aiocb == NULL);
-if (scsi_disk_req_check_error(r, ret, false)) {
+if (scsi_disk_req_check_error(r, ret, ret > 0)) {
 goto done;
 }
 
@@ -424,9 +429,10 @@ static void scsi_read_complete(void *opaque, int ret)
 assert(r->req.aiocb != NULL);
 r->req.aiocb = NULL;
 
+/* ret > 0 is accounted for in scsi_disk_req_check_error() */
 if (ret < 0) {
 block_acct_failed(blk_get_stats(s->qdev.conf.blk), &r->acct);
-} else {
+} else if (ret == 0) {
 block_acct_done(blk_get_stats(s->qdev.conf.blk), &r->acct);
 trace_scsi_disk_read_complete(r->req.tag, r->qiov.size);
 }
@@ -534,7 +540,7 @@ static void scsi_write_complete_noio(SCSIDiskReq *r, int 
ret)
qemu_get_current_aio_context());
 
 assert (r->req.aiocb == NULL);
-if (scsi_disk_req_check_error(r, ret, false)) {
+if (scsi_disk_req_check_error(r, ret, ret > 0)) {
 goto done;
 }
 
@@ -562,9 +568,10 @@ static void scsi_write_complete(void * opaque, int ret)
 assert (r->req.aiocb != NULL);
 r->req.aiocb = NULL;
 
+/* ret > 0 is accounted for in scsi_disk_req_check_error() */
 if (ret < 0) {
 block_acct_failed(blk_get_stats(s->qdev.conf.blk), &r->acct);
-} else {
+} else if (ret == 0) {
 block_acct_done(blk_get_stats(s->qdev.conf.blk), &r->acct);
 }
 scsi_write_complete_noio(r, ret);
-- 
2.45.2

Re: [PATCH 01/18] qapi: Smarter camel_to_upper() to reduce need for 'prefix'

2024-07-31 Thread Kevin Wolf

Am 30.07.2024 um 10:10 hat Markus Armbruster geschrieben:
> camel_to_upper() converts its argument from camel case to upper case
> with '_' between words.  Used for generated enumeration constant
> prefixes.
> 
> When some of the words are spelled all caps, where exactly to insert
> '_' is guesswork.  camel_to_upper()'s guesses are bad enough in places
> to make people override them with a 'prefix' in the schema.
> 
> Rewrite it to guess better:
> 
> 1. Insert '_' after a non-upper case character followed by an upper
>case character:
> 
>OneTwo -> ONE_TWO
>One2Three -> ONE2_THREE
> 
> 2. Insert '_' before the last upper case character followed by a
>non-upper case character:
> 
>ACRONYMWord -> ACRONYM_Word
> 
>Except at the beginning (as in OneTwo above), or when there is
>already one:
> 
>AbCd -> AB_CD

Maybe it's just me, but the exception "at the beginning" (in the sense
of "after the first character") seems to be exactly where I thought
"that looks strange" while going through your list below. In particular,
I'd expect X_DBG_* instead of XDBG_*. I also thought that the Q_*
spelling made more sense, though this might be less clear. But in case
of doubt, less exceptions seems like a good choice.

> +# Copy remainder of ``value`` to ``ret`` with '_' inserted
> +for ch in value[1:]:
> +if ch.isupper() == upc:
> +pass
> +elif upc:
> +# ``ret`` ends in upper case, next char isn't: insert '_'
> +# before the last upper case char unless there is one
> +# already, or it's at the beginning
> +if len(ret) > 2 and ret[-2] != '_':
> +ret = ret[:-1] + '_' + ret[-1]

I think in the code this means I would have expected len(ret) >= 2.

Kevin

Re: [PATCH 4/4] scsi-disk: Always report RESERVATION_CONFLICT to guest

2024-07-29 Thread Kevin Wolf

Am 29.07.2024 um 14:26 hat Paolo Bonzini geschrieben:
> Il lun 29 lug 2024, 14:20 Kevin Wolf  ha scritto:
> 
> > Apparently both oVirt and Kubevirt unconditionally use the stop policy,
> > so I'm afraid in this case we must acknowledge that our expectations
> > don't match reality.
> >
> 
> Yeah, of course.
> 
> If I understand correctly, not having a pr-manager could mean that QEMU
> > itself is sufficiently privileged and then the same logic would apply.
> >
> > But even if it means that we can't change any persistent reservations
> > from the VM, what use would stopping the VM be? You would run into the
> > exact case I'm describing in the commit message: You try to resume the
> > VM and it immediately stops again because the request still doesn't get
> > through. Or do you expect the host admin to take some manual action
> > then?
> >
> 
> Yes, if the PR operation is not allowed then the host admin would probably
> get a notification and release the PR (or perhaps shutdown the VM with an
> error) itself.
> 
> And what would you do about the Windows cluster validation case that
> > intentionally sends a request which reservations don't and shouldn't
> > allow? There is nothing on the host side to fix there. The guest is only
> > happy when it gets an error back.
> >
> 
> Yes, in that case (which is the most common one) there is nothing you can
> do, so the patch is a good idea even if the case without a PR manager is a
> bit murky.

Ok, so modifying the commit message and removing the 'error'
initialisation it is. Maybe mention the cluster validation case in the
comment here to explain why we do this even for non-pr-manager cases,
but not as a FIXME or TODO because it's not a problem with the
implementation, but we don't have any other choice. Right?

Should I send a v2 for this or is it okay to do this only while applying
the patch?

Kevin

Re: [PATCH v2 1/3] block/commit: implement final flush

2024-07-29 Thread Kevin Wolf

Am 19.07.2024 um 12:35 hat Vladimir Sementsov-Ogievskiy geschrieben:
> On 18.07.24 22:22, Kevin Wolf wrote:
> > Am 26.06.2024 um 16:50 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > > Actually block job is not completed without the final flush. It's
> > > rather unexpected to have broken target when job was successfully
> > > completed long ago and now we fail to flush or process just
> > > crashed/killed.
> > > 
> > > Mirror job already has mirror_flush() for this. So, it's OK.
> > > 
> > > Do this for commit job too.
> > > 
> > > Signed-off-by: Vladimir Sementsov-Ogievskiy 
> > > ---
> > >   block/commit.c | 41 +++--
> > >   1 file changed, 27 insertions(+), 14 deletions(-)
> > > 
> > > diff --git a/block/commit.c b/block/commit.c
> > > index 7c3fdcb0ca..81971692a2 100644
> > > --- a/block/commit.c
> > > +++ b/block/commit.c
> > > @@ -134,6 +134,7 @@ static int coroutine_fn commit_run(Job *job, Error 
> > > **errp)
> > >   int64_t n = 0; /* bytes */
> > >   QEMU_AUTO_VFREE void *buf = NULL;
> > >   int64_t len, base_len;
> > > +bool need_final_flush = true;
> > >   len = blk_co_getlength(s->top);
> > >   if (len < 0) {
> > > @@ -155,8 +156,8 @@ static int coroutine_fn commit_run(Job *job, Error 
> > > **errp)
> > >   buf = blk_blockalign(s->top, COMMIT_BUFFER_SIZE);
> > > -for (offset = 0; offset < len; offset += n) {
> > > -bool copy;
> > > +for (offset = 0; offset < len || need_final_flush; offset += n) {
> > 
> > In general, the control flow would be nicer to read if the final flush
> > weren't integrated into the loop, but just added after it.
> > 
> > But I assume this is pretty much required for pausing the job during
> > error handling in the final flush if you don't want to duplicate a lot
> > of the logic into a second loop?
> 
> Right, that's the reason.

This would probably be the right solution if it affected only commit.
But I've thought a bit more about this and given that the same thing
happens in all of the block jobs, I'm really wondering if introducing a
block job helper function wouldn't be better, so that each block job
could just add something like this after its main loop:

do {
ret = blk_co_flush();
} while (block_job_handle_error(job, ret));

And the helper would call block_job_error_action(), stop the job if
necessary, check if it's cancelled, include a pause point etc.

Kevin

Re: [PATCH 4/4] scsi-disk: Always report RESERVATION_CONFLICT to guest

2024-07-29 Thread Kevin Wolf

Am 29.07.2024 um 13:55 hat Paolo Bonzini geschrieben:
> On Mon, Jul 29, 2024 at 11:47 AM Kevin Wolf  wrote:
> > RESERVATION_CONFLICT is not a backend error, but indicates that the
> > guest tried to make a request that it isn't allowed to execute. Pass the
> > error to the guest so that it can decide what to do with it.
> 
> This is only true of scsi-block (though your patch is okay here -
> scsi-disk would see an EBADE and go down the ret < 0 path).

Right, in the scsi-disk case, we probably do want to consider it a
host-side error because the guest can't see or influence what happens on
the backend.

I can change the commit message accordingly.

> In general, for scsi-block I'd expect people to use report instead of
> stop. I agree that this is the best behavior for the case where you
> have a pr-manager, but it may also be better to stop the VM if a
> pr-manager has not been set up.  That's probably a bit hackish, so I
> guess it's okay to add a FIXME or TODO comment instead?

Apparently both oVirt and Kubevirt unconditionally use the stop policy,
so I'm afraid in this case we must acknowledge that our expectations
don't match reality.

If I understand correctly, not having a pr-manager could mean that QEMU
itself is sufficiently privileged and then the same logic would apply.

But even if it means that we can't change any persistent reservations
from the VM, what use would stopping the VM be? You would run into the
exact case I'm describing in the commit message: You try to resume the
VM and it immediately stops again because the request still doesn't get
through. Or do you expect the host admin to take some manual action
then?

And what would you do about the Windows cluster validation case that
intentionally sends a request which reservations don't and shouldn't
allow? There is nothing on the host side to fix there. The guest is only
happy when it gets an error back.

> > -if (status == CHECK_CONDITION) {
> > +switch (status) {
> > +case CHECK_CONDITION:
> >  req_has_sense = true;
> >  error = scsi_sense_buf_to_errno(r->req.sense, 
> > sizeof(r->req.sense));
> > -} else {
> > +break;
> > +case RESERVATION_CONFLICT:
> > +/* Don't apply the error policy, always report to the guest */
> 
> This is the only case where you get error == 0. Maybe remove it from
> the initializer, and set it here?

Not sure why the initialiser was added in the first place, but yes, I
can do that.

Kevin

> On Mon, Jul 29, 2024 at 11:47 AM Kevin Wolf  wrote:
> >
> > RESERVATION_CONFLICT is not a backend error, but indicates that the
> > guest tried to make a request that it isn't allowed to execute. Pass the
> > error to the guest so that it can decide what to do with it.
> >
> > Without this, if we stop the VM in response to a RESERVATION_CONFLICT,
> > it can happen that the VM cannot be resumed any more because every
> > attempt to resume it immediately runs into the same error and stops the
> > VM again.
> >
> > One case that expects RESERVATION_CONFLICT errors to be visible in the
> > guest is running the validation tests in Windows 2019's Failover Cluster
> > Manager, which intentionally tries to execute invalid requests to see if
> > they are properly rejected.
> >
> > Buglink: https://issues.redhat.com/browse/RHEL-5
> > Signed-off-by: Kevin Wolf 
> > ---
> >  hw/scsi/scsi-disk.c | 15 +++
> >  1 file changed, 11 insertions(+), 4 deletions(-)
> >
> > diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
> > index 69a195177e..e173b238de 100644
> > --- a/hw/scsi/scsi-disk.c
> > +++ b/hw/scsi/scsi-disk.c
> > @@ -235,11 +235,17 @@ static bool scsi_handle_rw_error(SCSIDiskReq *r, int 
> > ret, bool acct_failed)
> >  } else {
> >  /* A passthrough command has completed with nonzero status.  */
> >  status = ret;
> > -if (status == CHECK_CONDITION) {
> > +switch (status) {
> > +case CHECK_CONDITION:
> >  req_has_sense = true;
> >  error = scsi_sense_buf_to_errno(r->req.sense, 
> > sizeof(r->req.sense));
> > -} else {
> > +break;
> > +case RESERVATION_CONFLICT:
> > +/* Don't apply the error policy, always report to the guest */
> > +break;
> > +default:
> >  error = EINVAL;
> > +break;
> >  }
> >  }
> >
> > @@ -249,8 +255,9 @@ static bool scsi_handle_rw_error(SCSIDi

[PATCH 1/4] scsi-disk: Use positive return value for status in dma_readv/writev

2024-07-29 Thread Kevin Wolf

In some error cases, scsi_block_sgio_complete() never calls the passed
callback, but directly completes the request. This leads to bugs because
its error paths are not exact copies of what the callback would normally
do.

In preparation to fix this, allow passing positive return values to the
callbacks that represent the status code that should be used to complete
the request.

scsi_handle_rw_error() already handles positive values for its ret
parameter because scsi_block_sgio_complete() calls directly into it.

Signed-off-by: Kevin Wolf 
---
 hw/scsi/scsi-disk.c | 21 ++---
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index a67092db6a..3ff6798bde 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -65,6 +65,10 @@ OBJECT_DECLARE_TYPE(SCSIDiskState, SCSIDiskClass, 
SCSI_DISK_BASE)
 
 struct SCSIDiskClass {
 SCSIDeviceClass parent_class;
+/*
+ * Callbacks receive ret == 0 for success. Errors are represented either as
+ * negative errno values, or as positive SAM status codes.
+ */
 DMAIOFunc   *dma_readv;
 DMAIOFunc   *dma_writev;
 bool(*need_fua_emulation)(SCSICommand *cmd);
@@ -283,7 +287,7 @@ static bool scsi_disk_req_check_error(SCSIDiskReq *r, int 
ret, bool acct_failed)
 return true;
 }
 
-if (ret < 0) {
+if (ret != 0) {
 return scsi_handle_rw_error(r, ret, acct_failed);
 }
 
@@ -360,7 +364,7 @@ static void scsi_write_do_fua(SCSIDiskReq *r)
 static void scsi_dma_complete_noio(SCSIDiskReq *r, int ret)
 {
 assert(r->req.aiocb == NULL);
-if (scsi_disk_req_check_error(r, ret, false)) {
+if (scsi_disk_req_check_error(r, ret, ret > 0)) {
 goto done;
 }
 
@@ -385,9 +389,10 @@ static void scsi_dma_complete(void *opaque, int ret)
 assert(r->req.aiocb != NULL);
 r->req.aiocb = NULL;
 
+/* ret > 0 is accounted for in scsi_disk_req_check_error() */
 if (ret < 0) {
 block_acct_failed(blk_get_stats(s->qdev.conf.blk), &r->acct);
-} else {
+} else if (ret == 0) {
 block_acct_done(blk_get_stats(s->qdev.conf.blk), &r->acct);
 }
 scsi_dma_complete_noio(r, ret);
@@ -403,7 +408,7 @@ static void scsi_read_complete_noio(SCSIDiskReq *r, int ret)
qemu_get_current_aio_context());
 
 assert(r->req.aiocb == NULL);
-if (scsi_disk_req_check_error(r, ret, false)) {
+if (scsi_disk_req_check_error(r, ret, ret > 0)) {
 goto done;
 }
 
@@ -424,9 +429,10 @@ static void scsi_read_complete(void *opaque, int ret)
 assert(r->req.aiocb != NULL);
 r->req.aiocb = NULL;
 
+/* ret > 0 is accounted for in scsi_disk_req_check_error() */
 if (ret < 0) {
 block_acct_failed(blk_get_stats(s->qdev.conf.blk), &r->acct);
-} else {
+} else if (ret == 0) {
 block_acct_done(blk_get_stats(s->qdev.conf.blk), &r->acct);
 trace_scsi_disk_read_complete(r->req.tag, r->qiov.size);
 }
@@ -534,7 +540,7 @@ static void scsi_write_complete_noio(SCSIDiskReq *r, int 
ret)
qemu_get_current_aio_context());
 
 assert (r->req.aiocb == NULL);
-if (scsi_disk_req_check_error(r, ret, false)) {
+if (scsi_disk_req_check_error(r, ret, ret > 0)) {
 goto done;
 }
 
@@ -562,9 +568,10 @@ static void scsi_write_complete(void * opaque, int ret)
 assert (r->req.aiocb != NULL);
 r->req.aiocb = NULL;
 
+/* ret > 0 is accounted for in scsi_disk_req_check_error() */
 if (ret < 0) {
 block_acct_failed(blk_get_stats(s->qdev.conf.blk), &r->acct);
-} else {
+} else if (ret == 0) {
 block_acct_done(blk_get_stats(s->qdev.conf.blk), &r->acct);
 }
 scsi_write_complete_noio(r, ret);
-- 
2.45.2

[PATCH 2/4] scsi-block: Don't skip callback for sgio error status/driver_status

2024-07-29 Thread Kevin Wolf

Instead of calling into scsi_handle_rw_error() directly from
scsi_block_sgio_complete() and skipping the normal callback, go through
the normal cleanup path by calling the callback with a positive error
value.

The important difference here is not only that the code path is cleaner,
but that the callbacks set r->req.aiocb = NULL. If we skip setting this
and the error action is BLOCK_ERROR_ACTION_STOP, resuming the VM runs
into an assertion failure in scsi_read_data() or scsi_write_data()
because the dangling aiocb pointer is unexpected.

Fixes: a108557bbf ("scsi: inline sg_io_sense_from_errno() into the callers.")
Buglink: https://issues.redhat.com/browse/RHEL-5
Signed-off-by: Kevin Wolf 
---
 hw/scsi/scsi-disk.c | 10 --
 1 file changed, 10 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 3ff6798bde..6e1a5c98df 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -2832,16 +2832,6 @@ static void scsi_block_sgio_complete(void *opaque, int 
ret)
 } else {
 ret = io_hdr->status;
 }
-
-if (ret > 0) {
-if (scsi_handle_rw_error(r, ret, true)) {
-scsi_req_unref(&r->req);
-return;
-}
-
-/* Ignore error.  */
-ret = 0;
-}
 }
 
 req->cb(req->cb_opaque, ret);
-- 
2.45.2

[PATCH 4/4] scsi-disk: Always report RESERVATION_CONFLICT to guest

2024-07-29 Thread Kevin Wolf

RESERVATION_CONFLICT is not a backend error, but indicates that the
guest tried to make a request that it isn't allowed to execute. Pass the
error to the guest so that it can decide what to do with it.

Without this, if we stop the VM in response to a RESERVATION_CONFLICT,
it can happen that the VM cannot be resumed any more because every
attempt to resume it immediately runs into the same error and stops the
VM again.

One case that expects RESERVATION_CONFLICT errors to be visible in the
guest is running the validation tests in Windows 2019's Failover Cluster
Manager, which intentionally tries to execute invalid requests to see if
they are properly rejected.

Buglink: https://issues.redhat.com/browse/RHEL-5
Signed-off-by: Kevin Wolf 
---
 hw/scsi/scsi-disk.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 69a195177e..e173b238de 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -235,11 +235,17 @@ static bool scsi_handle_rw_error(SCSIDiskReq *r, int ret, 
bool acct_failed)
 } else {
 /* A passthrough command has completed with nonzero status.  */
 status = ret;
-if (status == CHECK_CONDITION) {
+switch (status) {
+case CHECK_CONDITION:
 req_has_sense = true;
 error = scsi_sense_buf_to_errno(r->req.sense, 
sizeof(r->req.sense));
-} else {
+break;
+case RESERVATION_CONFLICT:
+/* Don't apply the error policy, always report to the guest */
+break;
+default:
 error = EINVAL;
+break;
 }
 }
 
@@ -249,8 +255,9 @@ static bool scsi_handle_rw_error(SCSIDiskReq *r, int ret, 
bool acct_failed)
  * are usually retried immediately, so do not post them to QMP and
  * do not account them as failed I/O.
  */
-if (req_has_sense &&
-scsi_sense_buf_is_guest_recoverable(r->req.sense, 
sizeof(r->req.sense))) {
+if (!error || (req_has_sense &&
+   scsi_sense_buf_is_guest_recoverable(r->req.sense,
+   sizeof(r->req.sense 
{
 action = BLOCK_ERROR_ACTION_REPORT;
 acct_failed = false;
 } else {
-- 
2.45.2

[PATCH 0/4] scsi-block: Fix error handling with r/werror=stop

2024-07-29 Thread Kevin Wolf

Running validation tests in Windows 2019's Failover Cluster Manager
fails in two different ways when run with rerror/werror=stop:

1. It runs into an assertion failure because the sgio-based I/O path
   takes shortcuts in its error handling that skip necessary cleanup

2. RESERVATION_CONFLICT is treated as a host error and stops the VM,
   which in some cases can't be resumed at all because nothing will make
   the error go away on retry. The error should always go to the guest
   instead, it's an invalid request from the guest.

This series fixes these problems.

Kevin Wolf (4):
  scsi-disk: Use positive return value for status in dma_readv/writev
  scsi-block: Don't skip callback for sgio error status/driver_status
  scsi-disk: Add warning comments that host_status errors take a
shortcut
  scsi-disk: Always report RESERVATION_CONFLICT to guest

 hw/scsi/scsi-disk.c | 53 +++--
 1 file changed, 32 insertions(+), 21 deletions(-)

-- 
2.45.2

[PATCH 3/4] scsi-disk: Add warning comments that host_status errors take a shortcut

2024-07-29 Thread Kevin Wolf

scsi_block_sgio_complete() has surprising behaviour in that there are
error cases in which it directly completes the request and never calls
the passed callback. In the current state of the code, this doesn't seem
to result in bugs, but with future code changes, we must be careful to
never rely on the callback doing some cleanup until this code smell is
fixed. For now, just add warnings to make people aware of the trap.

Signed-off-by: Kevin Wolf 
---
 hw/scsi/scsi-disk.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 6e1a5c98df..69a195177e 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -68,6 +68,9 @@ struct SCSIDiskClass {
 /*
  * Callbacks receive ret == 0 for success. Errors are represented either as
  * negative errno values, or as positive SAM status codes.
+ *
+ * Beware: For errors returned in host_status, the function may directly
+ * complete the request and never call the callback.
  */
 DMAIOFunc   *dma_readv;
 DMAIOFunc   *dma_writev;
@@ -381,6 +384,7 @@ done:
 scsi_req_unref(&r->req);
 }
 
+/* May not be called in all error cases, don't rely on cleanup here */
 static void scsi_dma_complete(void *opaque, int ret)
 {
 SCSIDiskReq *r = (SCSIDiskReq *)opaque;
@@ -421,6 +425,7 @@ done:
 scsi_req_unref(&r->req);
 }
 
+/* May not be called in all error cases, don't rely on cleanup here */
 static void scsi_read_complete(void *opaque, int ret)
 {
 SCSIDiskReq *r = (SCSIDiskReq *)opaque;
@@ -560,6 +565,7 @@ done:
 scsi_req_unref(&r->req);
 }
 
+/* May not be called in all error cases, don't rely on cleanup here */
 static void scsi_write_complete(void * opaque, int ret)
 {
 SCSIDiskReq *r = (SCSIDiskReq *)opaque;
@@ -2821,6 +2827,7 @@ static void scsi_block_sgio_complete(void *opaque, int 
ret)
 sg_io_hdr_t *io_hdr = &req->io_header;
 
 if (ret == 0) {
+/* FIXME This skips calling req->cb() and any cleanup in it */
 if (io_hdr->host_status != SCSI_HOST_OK) {
 scsi_req_complete_failed(&r->req, io_hdr->host_status);
 scsi_req_unref(&r->req);
-- 
2.45.2

Re: [PATCH v2] block-backend: per-device throttling of BLOCK_IO_ERROR reports

2024-07-19 Thread Kevin Wolf

Am 19.07.2024 um 06:54 hat Markus Armbruster geschrieben:
> Kevin Wolf  writes:
> 
> > Am 09.01.2024 um 14:13 hat Vladimir Sementsov-Ogievskiy geschrieben:
> >> From: Leonid Kaplan 
> >> 
> >> BLOCK_IO_ERROR events comes from guest, so we must throttle them.
> >> We still want per-device throttling, so let's use device id as a key.
> >> 
> >> Signed-off-by: Leonid Kaplan 
> >> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> >> ---
> >> 
> >> v2: add Note: to QAPI doc
> >> 
> >>  monitor/monitor.c| 10 ++
> >>  qapi/block-core.json |  2 ++
> >>  2 files changed, 12 insertions(+)
> >> 
> >> diff --git a/monitor/monitor.c b/monitor/monitor.c
> >> index 01ede1babd..ad0243e9d7 100644
> >> --- a/monitor/monitor.c
> >> +++ b/monitor/monitor.c
> >> @@ -309,6 +309,7 @@ int error_printf_unless_qmp(const char *fmt, ...)
> >>  static MonitorQAPIEventConf monitor_qapi_event_conf[QAPI_EVENT__MAX] = {
> >>  /* Limit guest-triggerable events to 1 per second */
> >>  [QAPI_EVENT_RTC_CHANGE]= { 1000 * SCALE_MS },
> >> +[QAPI_EVENT_BLOCK_IO_ERROR]= { 1000 * SCALE_MS },
> >>  [QAPI_EVENT_WATCHDOG]  = { 1000 * SCALE_MS },
> >>  [QAPI_EVENT_BALLOON_CHANGE]= { 1000 * SCALE_MS },
> >>  [QAPI_EVENT_QUORUM_REPORT_BAD] = { 1000 * SCALE_MS },
> >> @@ -498,6 +499,10 @@ static unsigned int qapi_event_throttle_hash(const 
> >> void *key)
> >>  hash += g_str_hash(qdict_get_str(evstate->data, "qom-path"));
> >>  }
> >>  
> >> +if (evstate->event == QAPI_EVENT_BLOCK_IO_ERROR) {
> >> +hash += g_str_hash(qdict_get_str(evstate->data, "device"));
> >> +}
> >
> > Using "device" only works with -drive, i.e. when the BlockBackend
> > actually has a name. In modern configurations with a -blockdev
> > referenced by -device, the BlockBackend doesn't have a name any more.
> >
> > Maybe we should be using the qdev id (or more generally, QOM path) here,
> > but that's something the event doesn't even contain yet.
> 
> Uh, does the event reliably identify the I/O error's node or not?  If
> not, then that's a serious design defect.
> 
> There's @node-name.  Several commands use "either @device or @node-name"
> to identify a node.  Is that sufficient here?

Possibly. The QAPI event is sent by a device, not by the backend, and
the commit message claims per-device throttling. That's what made me
think that we should base it on the device id.

But it's true that the error does originate in the backend (and it's
unlikely that two devices are attached to the same node anyway), so the
node-name could be good enough if we don't have a BlockBackend name. We
should claim per-block-node rather then per-device throttling then.

Kevin

Re: [PATCH v5 4/5] vvfat: Fix reading files with non-continuous clusters

2024-07-19 Thread Kevin Wolf

Am 19.07.2024 um 02:29 hat Amjad Alsharafi geschrieben:
> 
> 
> On Jul 19 2024, at 8:20 am, Amjad Alsharafi  wrote:
> 
> > On Thu, Jul 18, 2024 at 05:20:36PM +0200, Kevin Wolf wrote:
> >> Am 12.06.2024 um 14:43 hat Amjad Alsharafi geschrieben:
> >> > When reading with `read_cluster` we get the `mapping` with
> >> > `find_mapping_for_cluster` and then we call `open_file` for this
> >> > mapping.
> >> > The issue appear when its the same file, but a second cluster that is
> >> > not immediately after it, imagine clusters `500 -> 503`, this will give
> >> > us 2 mappings one has the range `500..501` and another `503..504`, both
> >> > point to the same file, but different offsets.
> >> > 
> >> > When we don't open the file since the path is the same, we won't assign
> >> > `s->current_mapping` and thus accessing way out of bound of the file.
> >> > 
> >> > From our example above, after `open_file` (that didn't open
> >> anything) we
> >> > will get the offset into the file with
> >> > `s->cluster_size*(cluster_num-s->current_mapping->begin)`, which will
> >> > give us `0x2000 * (504-500)`, which is out of bound for this
> >> mapping and
> >> > will produce some issues.
> >> > 
> >> > Signed-off-by: Amjad Alsharafi 
> >> > ---
> >> >  block/vvfat.c | 23 ---
> >> >  1 file changed, 16 insertions(+), 7 deletions(-)
> >> > 
> >> > diff --git a/block/vvfat.c b/block/vvfat.c
> >> > index b63ac5d045..fc570d0610 100644
> >> > --- a/block/vvfat.c
> >> > +++ b/block/vvfat.c
> >> > @@ -1360,15 +1360,24 @@ static int open_file(BDRVVVFATState*
> >> s,mapping_t* mapping)
> >> >  {
> >> >  if(!mapping)
> >> >  return -1;
> >> > +int new_path = 1;
> >> >  if(!s->current_mapping ||
> >> > -strcmp(s->current_mapping->path,mapping->path)) {
> >> > -/* open file */
> >> > -int fd = qemu_open_old(mapping->path,
> >> > +s->current_mapping->info.file.offset
> >> > +!= mapping->info.file.offset ||
> >> 
> >> I'm wondering if this couldn't just be s->current_mapping != mapping?
> > 
> > Actually, you are totally right. Not sure what made me go for this.
> > 
> > I tried also to test with only checking if the path changed, but it
> > fails on some tests. So the offset is important.
> > For that reason, checking just the mapping ptr is better since we won't
> > have 2 mappings with same file and offset.
> > 
> > I'll then use this change. Thanks
> 
> Should I send a new patch? since most commits are reviewed now

Yes, please do. I think I reviewed the whole series.

Kevin

> > 
> >> 
> >> > +(new_path = strcmp(s->current_mapping->path,
> >> mapping->path))) {
> >> 
> >> If both the path and the offset change, we still want to set
> >> new_path, I
> >> think. And if we didn't already have a mapping, we also need to open the
> >> file.
> >> 
> >> Actually, setting a variable inside the condition makes it kind of hard
> >> to read, so if s->current_mapping != mapping works, we can do the check
> >> only in the conditon below:
> >> 
> >> > +if (new_path) {
> >> 
> >> if (!s->current_mapping ||
> >> strcmp(s->current_mapping->path, mapping->path))
> >> 
> >> > +/* open file */
> >> > +int fd = qemu_open_old(mapping->path,
> >> > O_RDONLY | O_BINARY | O_LARGEFILE);
> >> > -if(fd<0)
> >> > -return -1;
> >> > -vvfat_close_current_file(s);
> >> > -s->current_fd = fd;
> >> > +if (fd < 0) {
> >> > +return -1;
> >> > +}
> >> > +vvfat_close_current_file(s);
> >> > +
> >> > +s->current_fd = fd;
> >> > +}
> >> > +assert(s->current_fd);
> >> >  s->current_mapping = mapping;
> >> >  }
> >> 
> >> Kevin
> >> 
> > 
>

Re: [PATCH v2 1/3] block/commit: implement final flush

2024-07-18 Thread Kevin Wolf

Am 26.06.2024 um 16:50 hat Vladimir Sementsov-Ogievskiy geschrieben:
> Actually block job is not completed without the final flush. It's
> rather unexpected to have broken target when job was successfully
> completed long ago and now we fail to flush or process just
> crashed/killed.
> 
> Mirror job already has mirror_flush() for this. So, it's OK.
> 
> Do this for commit job too.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/commit.c | 41 +++--
>  1 file changed, 27 insertions(+), 14 deletions(-)
> 
> diff --git a/block/commit.c b/block/commit.c
> index 7c3fdcb0ca..81971692a2 100644
> --- a/block/commit.c
> +++ b/block/commit.c
> @@ -134,6 +134,7 @@ static int coroutine_fn commit_run(Job *job, Error **errp)
>  int64_t n = 0; /* bytes */
>  QEMU_AUTO_VFREE void *buf = NULL;
>  int64_t len, base_len;
> +bool need_final_flush = true;
>  
>  len = blk_co_getlength(s->top);
>  if (len < 0) {
> @@ -155,8 +156,8 @@ static int coroutine_fn commit_run(Job *job, Error **errp)
>  
>  buf = blk_blockalign(s->top, COMMIT_BUFFER_SIZE);
>  
> -for (offset = 0; offset < len; offset += n) {
> -bool copy;
> +for (offset = 0; offset < len || need_final_flush; offset += n) {

In general, the control flow would be nicer to read if the final flush
weren't integrated into the loop, but just added after it.

But I assume this is pretty much required for pausing the job during
error handling in the final flush if you don't want to duplicate a lot
of the logic into a second loop?

> +bool copy = false;
>  bool error_in_source = true;
>  
>  /* Note that even when no rate limit is applied we need to yield
> @@ -166,22 +167,34 @@ static int coroutine_fn commit_run(Job *job, Error 
> **errp)
>  if (job_is_cancelled(&s->common.job)) {
>  break;
>  }
> -/* Copy if allocated above the base */
> -ret = blk_co_is_allocated_above(s->top, s->base_overlay, true,
> -offset, COMMIT_BUFFER_SIZE, &n);
> -copy = (ret > 0);
> -trace_commit_one_iteration(s, offset, n, ret);
> -if (copy) {
> -assert(n < SIZE_MAX);
>  
> -ret = blk_co_pread(s->top, offset, n, buf, 0);
> -if (ret >= 0) {
> -ret = blk_co_pwrite(s->base, offset, n, buf, 0);
> -if (ret < 0) {
> -error_in_source = false;
> +if (offset < len) {
> +/* Copy if allocated above the base */
> +ret = blk_co_is_allocated_above(s->top, s->base_overlay, true,
> +offset, COMMIT_BUFFER_SIZE, &n);
> +copy = (ret > 0);
> +trace_commit_one_iteration(s, offset, n, ret);
> +if (copy) {
> +assert(n < SIZE_MAX);
> +
> +ret = blk_co_pread(s->top, offset, n, buf, 0);
> +if (ret >= 0) {
> +ret = blk_co_pwrite(s->base, offset, n, buf, 0);
> +if (ret < 0) {
> +error_in_source = false;
> +}
>  }
>  }
> +} else {
> +assert(need_final_flush);
> +ret = blk_co_flush(s->base);
> +if (ret < 0) {
> +error_in_source = false;
> +} else {
> +need_final_flush = false;
> +}

Should we set n = 0 in this block to avoid counting the last chunk twice
for the progress?

>  }
> +
>  if (ret < 0) {
>  BlockErrorAction action =
>  block_job_error_action(&s->common, s->on_error,

Kevin

Re: [PATCH v2] block-backend: per-device throttling of BLOCK_IO_ERROR reports

2024-07-18 Thread Kevin Wolf

Am 09.01.2024 um 14:13 hat Vladimir Sementsov-Ogievskiy geschrieben:
> From: Leonid Kaplan 
> 
> BLOCK_IO_ERROR events comes from guest, so we must throttle them.
> We still want per-device throttling, so let's use device id as a key.
> 
> Signed-off-by: Leonid Kaplan 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
> 
> v2: add Note: to QAPI doc
> 
>  monitor/monitor.c| 10 ++
>  qapi/block-core.json |  2 ++
>  2 files changed, 12 insertions(+)
> 
> diff --git a/monitor/monitor.c b/monitor/monitor.c
> index 01ede1babd..ad0243e9d7 100644
> --- a/monitor/monitor.c
> +++ b/monitor/monitor.c
> @@ -309,6 +309,7 @@ int error_printf_unless_qmp(const char *fmt, ...)
>  static MonitorQAPIEventConf monitor_qapi_event_conf[QAPI_EVENT__MAX] = {
>  /* Limit guest-triggerable events to 1 per second */
>  [QAPI_EVENT_RTC_CHANGE]= { 1000 * SCALE_MS },
> +[QAPI_EVENT_BLOCK_IO_ERROR]= { 1000 * SCALE_MS },
>  [QAPI_EVENT_WATCHDOG]  = { 1000 * SCALE_MS },
>  [QAPI_EVENT_BALLOON_CHANGE]= { 1000 * SCALE_MS },
>  [QAPI_EVENT_QUORUM_REPORT_BAD] = { 1000 * SCALE_MS },
> @@ -498,6 +499,10 @@ static unsigned int qapi_event_throttle_hash(const void 
> *key)
>  hash += g_str_hash(qdict_get_str(evstate->data, "qom-path"));
>  }
>  
> +if (evstate->event == QAPI_EVENT_BLOCK_IO_ERROR) {
> +hash += g_str_hash(qdict_get_str(evstate->data, "device"));
> +}

Using "device" only works with -drive, i.e. when the BlockBackend
actually has a name. In modern configurations with a -blockdev
referenced by -device, the BlockBackend doesn't have a name any more.

Maybe we should be using the qdev id (or more generally, QOM path) here,
but that's something the event doesn't even contain yet.

> +
>  return hash;
>  }
>  
> @@ -525,6 +530,11 @@ static gboolean qapi_event_throttle_equal(const void *a, 
> const void *b)
> qdict_get_str(evb->data, "qom-path"));
>  }
>  
> +if (eva->event == QAPI_EVENT_BLOCK_IO_ERROR) {
> +return !strcmp(qdict_get_str(eva->data, "device"),
> +   qdict_get_str(evb->data, "device"));
> +}
> +
>  return TRUE;
>  }
>  
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index ca390c5700..32c2c2f030 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -5559,6 +5559,8 @@
>  # Note: If action is "stop", a STOP event will eventually follow the
>  # BLOCK_IO_ERROR event
>  #
> +# Note: This event is rate-limited.
> +#
>  # Since: 0.13
>  #
>  # Example:

Kevin

Re: [PATCH] block/curl: rewrite http header parsing function

2024-07-18 Thread Kevin Wolf

Am 29.06.2024 um 16:25 hat Michael Tokarev geschrieben:
> Existing code was long, unclear and twisty.
> 
> Signed-off-by: Michael Tokarev 
> ---
>  block/curl.c | 44 ++--
>  1 file changed, 18 insertions(+), 26 deletions(-)
> 
> diff --git a/block/curl.c b/block/curl.c
> index 419f7c89ef..9802d0319d 100644
> --- a/block/curl.c
> +++ b/block/curl.c
> @@ -210,37 +210,29 @@ static size_t curl_header_cb(void *ptr, size_t size, 
> size_t nmemb, void *opaque)
>  {
>  BDRVCURLState *s = opaque;
>  size_t realsize = size * nmemb;
> -const char *header = (char *)ptr;
> -const char *end = header + realsize;
> -const char *accept_ranges = "accept-ranges:";
> -const char *bytes = "bytes";
> +const char *p = ptr;
> +const char *end = p + realsize;
> +const char *t = "accept-ranges : bytes "; /* A lowercase template */

I don't think spaces between the field name and the colon are allowed
in the spec (and in the old code), only before and after the value.

> -if (realsize >= strlen(accept_ranges)
> -&& g_ascii_strncasecmp(header, accept_ranges,
> -   strlen(accept_ranges)) == 0) {
> -
> -char *p = strchr(header, ':') + 1;
> -
> -/* Skip whitespace between the header name and value. */
> -while (p < end && *p && g_ascii_isspace(*p)) {
> -p++;
> -}
> -
> -if (end - p >= strlen(bytes)
> -&& strncmp(p, bytes, strlen(bytes)) == 0) {
> -
> -/* Check that there is nothing but whitespace after the value. */
> -p += strlen(bytes);
> -while (p < end && *p && g_ascii_isspace(*p)) {
> -p++;
> -}
> -
> -if (p == end || !*p) {
> -s->accept_range = true;
> +/* check if header matches the "t" template */
> +for (;;) {
> +if (*t == ' ') { /* space in t matches any amount of isspace in p */
> +if (p < end && g_ascii_isspace(*p)) {
> +++p;
> +} else {
> +++t;
>  }
> +} else if (*t && p < end && *t == g_ascii_tolower(*p)) {
> +++p, ++t;
> +} else {
> +break;
>  }
>  }
>  
> +if (!*t && p == end) { /* if we managed to reach ends of both strings */
> +s->accept_range = true;
> +}

Maybe make the generic comparison with a template a separate function
(maybe even in cutils.c?) so that curl_header_cb() essentially only has
something like this any more:

if (!qemu_memcasecmp_space(ptr, end, "accept-ranges: bytes ")) {
s->accept_range = true;
}

(A better name for the function would be preferable, of course. Maybe
also a bool return value, but if it has a name related to memcmp() or
strcmp(), then 0 must mean it matches.)

Then this would really highlight the curl specific logic rather than the
string parser in curl_header_cb().

Kevin

Re: [PATCH] qapi-block-core: Clean up blockdev-snapshot-internal-sync doc

2024-07-18 Thread Kevin Wolf

Am 18.07.2024 um 14:36 hat Markus Armbruster geschrieben:
> BlockdevSnapshotInternal is the arguments type of command
> blockdev-snapshot-internal-sync.  Its doc comment contains this note:
> 
> # .. note:: In a transaction, if @name is empty or any snapshot matching
> #@name exists, the operation will fail.  Only some image formats
> #support it; for example, qcow2, and rbd.
> 
> "In a transaction" is misleading, and "if @name is empty or any
> snapshot matching @name exists, the operation will fail" is redundant
> with the command's Errors documentation.  Drop.
> 
> The remainder is fine.  Move it to the command's doc comment, where it
> is more prominently visible, with a slight rephrasing for clarity.
> 
> Signed-off-by: Markus Armbruster 

Thanks, applied to the block branch.

Kevin

Re: [PATCH v3 1/3] block: zero data data corruption using prealloc-filter

2024-07-18 Thread Kevin Wolf

Am 16.07.2024 um 16:41 hat Andrey Drobyshev geschrieben:
> From: "Denis V. Lunev" 
> 
> We have observed that some clusters in the QCOW2 files are zeroed
> while preallocation filter is used.
> 
> We are able to trace down the following sequence when prealloc-filter
> is used:
> co=0x55e7cbed7680 qcow2_co_pwritev_task()
> co=0x55e7cbed7680 preallocate_co_pwritev_part()
> co=0x55e7cbed7680 handle_write()
> co=0x55e7cbed7680 bdrv_co_do_pwrite_zeroes()
> co=0x55e7cbed7680 raw_do_pwrite_zeroes()
> co=0x7f9edb7fe500 do_fallocate()
> 
> Here coroutine 0x55e7cbed7680 is being blocked waiting while coroutine
> 0x7f9edb7fe500 will finish with fallocate of the file area. OK. It is
> time to handle next coroutine, which
> co=0x55e7cbee91b0 qcow2_co_pwritev_task()
> co=0x55e7cbee91b0 preallocate_co_pwritev_part()
> co=0x55e7cbee91b0 handle_write()
> co=0x55e7cbee91b0 bdrv_co_do_pwrite_zeroes()
> co=0x55e7cbee91b0 raw_do_pwrite_zeroes()
> co=0x7f9edb7deb00 do_fallocate()
> 
> The trouble comes here. Coroutine 0x55e7cbed7680 has not advanced
> file_end yet and coroutine 0x55e7cbee91b0 will start fallocate() for
> the same area. This means that if (once fallocate is started inside
> 0x7f9edb7deb00) original fallocate could end and the real write will
> be executed. In that case write() request is handled at the same time
> as fallocate().
> 
> The patch moves s->file_lock assignment before fallocate and that is

s/file_lock/file_end/?

> crucial. The idea is that all subsequent requests into the area
> being preallocation will be issued as just writes without fallocate
> to this area and they will not proceed thanks to overlapping
> requests mechanics. If preallocation will fail, we will just switch
> to the normal expand-by-write behavior and that is not a problem
> except performance.
> 
> Signed-off-by: Denis V. Lunev 
> Tested-by: Andrey Drobyshev 
> ---
>  block/preallocate.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/block/preallocate.c b/block/preallocate.c
> index d215bc5d6d..ecf0aa4baa 100644
> --- a/block/preallocate.c
> +++ b/block/preallocate.c
> @@ -383,6 +383,13 @@ handle_write(BlockDriverState *bs, int64_t offset, 
> int64_t bytes,
>  
>  want_merge_zero = want_merge_zero && (prealloc_start <= offset);
>  
> +/*
> + * Assign file_end before making actual preallocation. This will ensure
> + * that next request performed while preallocation is in progress will
> + * be passed without preallocation.
> + */
> +s->file_end = prealloc_end;
> +
>  ret = bdrv_co_pwrite_zeroes(
>  bs->file, prealloc_start, prealloc_end - prealloc_start,
>  BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING | BDRV_REQ_NO_WAIT);
> @@ -391,7 +398,6 @@ handle_write(BlockDriverState *bs, int64_t offset, 
> int64_t bytes,
>  return false;
>  }
>  
> -s->file_end = prealloc_end;
>  return want_merge_zero;
>  }

Until bdrv_co_pwrite_zeroes() completes successfully, the file size is
unchanged. In other words, if anything calls preallocate_co_getlength()
in the meantime, don't we run into...

ret = bdrv_co_getlength(bs->file->bs);

if (has_prealloc_perms(bs)) {
s->file_end = s->zero_start = s->data_end = ret;
}

...and reset s->file_end back to the old value, re-enabling the bug
you're trying to fix here?

Or do we know that no such code path can be called concurrently for some
reason?

Kevin

Re: [PATCH v5 3/5] vvfat: Fix wrong checks for cluster mappings invariant

2024-07-18 Thread Kevin Wolf

Am 12.06.2024 um 14:43 hat Amjad Alsharafi geschrieben:
> How this `abort` was intended to check for was:
> - if the `mapping->first_mapping_index` is not the same as
>   `first_mapping_index`, which **should** happen only in one case,
>   when we are handling the first mapping, in that case
>   `mapping->first_mapping_index == -1`, in all other cases, the other
>   mappings after the first should have the condition `true`.
> - From above, we know that this is the first mapping, so if the offset
>   is not `0`, then abort, since this is an invalid state.
> 
> The issue was that `first_mapping_index` is not set if we are
> checking from the middle, the variable `first_mapping_index` is
> only set if we passed through the check `cluster_was_modified` with the
> first mapping, and in the same function call we checked the other
> mappings.
> 
> One approach is to go into the loop even if `cluster_was_modified`
> is not true so that we will be able to set `first_mapping_index` for the
> first mapping, but since `first_mapping_index` is only used here,
> another approach is to just check manually for the
> `mapping->first_mapping_index != -1` since we know that this is the
> value for the only entry where `offset == 0` (i.e. first mapping).
> 
> Signed-off-by: Amjad Alsharafi 

Reviewed-by: Kevin Wolf

Re: [PATCH v5 5/5] iotests: Add `vvfat` tests

2024-07-18 Thread Kevin Wolf

Am 12.06.2024 um 14:43 hat Amjad Alsharafi geschrieben:
> Added several tests to verify the implementation of the vvfat driver.
> 
> We needed a way to interact with it, so created a basic `fat16.py` driver
> that handled writing correct sectors for us.
> 
> Added `vvfat` to the non-generic formats, as its not a normal image format.
> 
> Signed-off-by: Amjad Alsharafi 

> +def truncate_file(
> +self,
> +entry: FatDirectoryEntry,
> +new_size: int,
> +allocate_non_continuous: bool = False,
> +):
> +"""
> +Truncate the file at the given path to the new size.
> +"""
> +if entry is None:
> +return Exception("entry is None")
> +if entry.attributes & 0x10 != 0:
> +raise Exception(f"{entry.whole_name()} is a directory")
> +
> +def clusters_from_size(size: int):
> +return (
> +size + self.boot_sector.cluster_bytes() - 1
> +) // self.boot_sector.cluster_bytes()
> +
> +# First, allocate new FATs if we need to
> +required_clusters = clusters_from_size(new_size)
> +current_clusters = clusters_from_size(entry.size_bytes)
> +
> +affected_clusters = set()
> +
> +# Keep at least one cluster, easier to manage this way
> +if required_clusters == 0:
> +required_clusters = 1
> +if current_clusters == 0:
> +current_clusters = 1
> +
> +if required_clusters > current_clusters:
> +# Allocate new clusters
> +cluster = entry.cluster
> +to_add = required_clusters
> +for _ in range(current_clusters - 1):
> +to_add -= 1
> +cluster = self.next_cluster(cluster)
> +assert required_clusters > 0, "No new clusters to allocate"
> +assert cluster is not None, "Cluster is None"
> +assert (
> +self.next_cluster(cluster) is None
> +), "Cluster is not the last cluster"
> +
> +# Allocate new clusters
> +for _ in range(to_add - 1):
> +new_cluster = self.next_free_cluster()
> +if allocate_non_continuous:
> +new_cluster = self.next_free_cluster_non_continuous()

The normal self.next_free_cluster() could be in an else branch. No
reason to search for a free cluster when you immediately overwrite it
anyway.

> +self.write_fat_entry(cluster, new_cluster)
> +self.write_fat_entry(new_cluster, 0x)
> +cluster = new_cluster
> +
> +elif required_clusters < current_clusters:
> +# Truncate the file
> +cluster = entry.cluster
> +for _ in range(required_clusters - 1):
> +cluster = self.next_cluster(cluster)
> +assert cluster is not None, "Cluster is None"
> +
> +next_cluster = self.next_cluster(cluster)
> +# mark last as EOF
> +self.write_fat_entry(cluster, 0x)
> +# free the rest
> +while next_cluster is not None:
> +cluster = next_cluster
> +next_cluster = self.next_cluster(next_cluster)
> +self.write_fat_entry(cluster, 0)
> +
> +self.flush_fats()
> +
> +# verify number of clusters
> +cluster = entry.cluster
> +count = 0
> +while cluster is not None:
> +count += 1
> +affected_clusters.add(cluster)
> +cluster = self.next_cluster(cluster)
> +assert (
> +count == required_clusters
> +), f"Expected {required_clusters} clusters, got {count}"
> +
> +# update the size
> +entry.size_bytes = new_size
> +self.update_direntry(entry)
> +
> +# trigger every affected cluster
> +    for cluster in affected_clusters:
> +first_sector = self.boot_sector.first_sector_of_cluster(cluster)
> +first_sector_data = self.read_sectors(first_sector, 1)
> +self.write_sectors(first_sector, first_sector_data)

Other than this, the patch looks good to me and we seem to test all the
cases that are fixed by the previous patches.

Reviewed-by: Kevin Wolf 
Tested-by: Kevin Wolf 

Kevin

Re: [PATCH v5 2/5] vvfat: Fix usage of `info.file.offset`

2024-07-18 Thread Kevin Wolf

Am 12.06.2024 um 14:43 hat Amjad Alsharafi geschrieben:
> The field is marked as "the offset in the file (in clusters)", but it
> was being used like this
> `cluster_size*(nums)+mapping->info.file.offset`, which is incorrect.
> 
> Signed-off-by: Amjad Alsharafi 

Reviewed-by: Kevin Wolf

Re: [PATCH v5 4/5] vvfat: Fix reading files with non-continuous clusters

2024-07-18 Thread Kevin Wolf

Am 12.06.2024 um 14:43 hat Amjad Alsharafi geschrieben:
> When reading with `read_cluster` we get the `mapping` with
> `find_mapping_for_cluster` and then we call `open_file` for this
> mapping.
> The issue appear when its the same file, but a second cluster that is
> not immediately after it, imagine clusters `500 -> 503`, this will give
> us 2 mappings one has the range `500..501` and another `503..504`, both
> point to the same file, but different offsets.
> 
> When we don't open the file since the path is the same, we won't assign
> `s->current_mapping` and thus accessing way out of bound of the file.
> 
> From our example above, after `open_file` (that didn't open anything) we
> will get the offset into the file with
> `s->cluster_size*(cluster_num-s->current_mapping->begin)`, which will
> give us `0x2000 * (504-500)`, which is out of bound for this mapping and
> will produce some issues.
> 
> Signed-off-by: Amjad Alsharafi 
> ---
>  block/vvfat.c | 23 ---
>  1 file changed, 16 insertions(+), 7 deletions(-)
> 
> diff --git a/block/vvfat.c b/block/vvfat.c
> index b63ac5d045..fc570d0610 100644
> --- a/block/vvfat.c
> +++ b/block/vvfat.c
> @@ -1360,15 +1360,24 @@ static int open_file(BDRVVVFATState* s,mapping_t* 
> mapping)
>  {
>  if(!mapping)
>  return -1;
> +int new_path = 1;
>  if(!s->current_mapping ||
> -strcmp(s->current_mapping->path,mapping->path)) {
> -/* open file */
> -int fd = qemu_open_old(mapping->path,
> +s->current_mapping->info.file.offset
> +!= mapping->info.file.offset ||

I'm wondering if this couldn't just be s->current_mapping != mapping?

> +(new_path = strcmp(s->current_mapping->path, mapping->path))) {

If both the path and the offset change, we still want to set new_path, I
think. And if we didn't already have a mapping, we also need to open the
file.

Actually, setting a variable inside the condition makes it kind of hard
to read, so if s->current_mapping != mapping works, we can do the check
only in the conditon below:

> +if (new_path) {

if (!s->current_mapping ||
strcmp(s->current_mapping->path, mapping->path))

> +/* open file */
> +int fd = qemu_open_old(mapping->path,
> O_RDONLY | O_BINARY | O_LARGEFILE);
> -if(fd<0)
> -return -1;
> -vvfat_close_current_file(s);
> -s->current_fd = fd;
> +if (fd < 0) {
> +return -1;
> +}
> +vvfat_close_current_file(s);
> +
> +s->current_fd = fd;
> +}
> +assert(s->current_fd);
>  s->current_mapping = mapping;
>  }

Kevin

Re: [PULL 4/4] block: Parse filenames only when explicitly requested

2024-07-04 Thread Kevin Wolf

Am 03.07.2024 um 23:16 hat Michael Tokarev geschrieben:
> 02.07.2024 19:39, Kevin Wolf wrote:
> > When handling image filenames from legacy options such as -drive or from
> > tools, these filenames are parsed for protocol prefixes, including for
> > the json:{} pseudo-protocol.
> > 
> > This behaviour is intended for filenames that come directly from the
> > command line and for backing files, which may come from the image file
> > itself. Higher level management tools generally take care to verify that
> > untrusted images don't contain a bad (or any) backing file reference;
> > 'qemu-img info' is a suitable tool for this.
> > 
> > However, for other files that can be referenced in images, such as
> > qcow2 data files or VMDK extents, the string from the image file is
> > usually not verified by management tools - and 'qemu-img info' wouldn't
> > be suitable because in contrast to backing files, it already opens these
> > other referenced files. So here the string should be interpreted as a
> > literal local filename. More complex configurations need to be specified
> > explicitly on the command line or in QMP.
> > 
> > This patch changes bdrv_open_inherit() so that it only parses filenames
> > if a new parameter parse_filename is true. It is set for the top level
> > in bdrv_open(), for the file child and for the backing file child. All
> > other callers pass false and disable filename parsing this way.
> 
> I'm attaching backports of this change to 8.2 and 7.2 series.
> 
> Please check if the resulting backports are correct, or if they're needed
> in the first place.  Note: 7.2 lacks quite some locking in this area, eg
> v8.0.0-2069-g8394c35ee148 "block: Fix AioContext locking in bdrv_open_child()"
> v8.2.0-rc0-59-g6bc0bcc89f84 "block: Fix deadlocks in bdrv_graph_wrunlock()".

Apart from minor style differences, your 7.2 backport is a perfect match
of the RHEL 9.2 backport which I already reviewed downstream. The 8.2
patch is a bit different from the RHEL 9.4 one because we backported the
final bits of the multiqueue work, but the differences make sense for
upstream QEMU 8.2.

So both patches look good to me.

Kevin

[PULL 3/4] iotests/270: Don't store data-file with json: prefix in image

2024-07-02 Thread Kevin Wolf

We want to disable filename parsing for data files because it's too easy
to abuse in malicious image files. Make the test ready for the change by
passing the data file explicitly in command line options.

Cc: qemu-sta...@nongnu.org
Signed-off-by: Kevin Wolf 
Reviewed-by: Eric Blake 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Hanna Czenczek 
---
 tests/qemu-iotests/270 | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/tests/qemu-iotests/270 b/tests/qemu-iotests/270
index 74352342db..c37b674aa2 100755
--- a/tests/qemu-iotests/270
+++ b/tests/qemu-iotests/270
@@ -60,8 +60,16 @@ _make_test_img -o cluster_size=2M,data_file="$TEST_IMG.orig" 
\
 # "write" 2G of data without using any space.
 # (qemu-img create does not like it, though, because null-co does not
 # support image creation.)
-$QEMU_IMG amend -o data_file="json:{'driver':'null-co',,'size':'4294967296'}" \
-"$TEST_IMG"
+test_img_with_null_data="json:{
+'driver': '$IMGFMT',
+'file': {
+'filename': '$TEST_IMG'
+},
+'data-file': {
+'driver': 'null-co',
+'size':'4294967296'
+}
+}"
 
 # This gives us a range of:
 #   2^31 - 512 + 768 - 1 = 2^31 + 255 > 2^31
@@ -74,7 +82,7 @@ $QEMU_IMG amend -o 
data_file="json:{'driver':'null-co',,'size':'4294967296'}" \
 # on L2 boundaries, we need large L2 tables; hence the cluster size of
 # 2 MB.  (Anything from 256 kB should work, though, because then one L2
 # table covers 8 GB.)
-$QEMU_IO -c "write 768 $((2 ** 31 - 512))" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "write 768 $((2 ** 31 - 512))" "$test_img_with_null_data" | 
_filter_qemu_io
 
 _check_test_img
 
-- 
2.45.2

[PULL 2/4] iotests/244: Don't store data-file with protocol in image

2024-07-02 Thread Kevin Wolf

We want to disable filename parsing for data files because it's too easy
to abuse in malicious image files. Make the test ready for the change by
passing the data file explicitly in command line options.

Cc: qemu-sta...@nongnu.org
Signed-off-by: Kevin Wolf 
Reviewed-by: Eric Blake 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Hanna Czenczek 
---
 tests/qemu-iotests/244 | 19 ---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/tests/qemu-iotests/244 b/tests/qemu-iotests/244
index 3e61fa25bb..bb9cc6512f 100755
--- a/tests/qemu-iotests/244
+++ b/tests/qemu-iotests/244
@@ -215,9 +215,22 @@ $QEMU_IMG convert -f $IMGFMT -O $IMGFMT -n -C 
"$TEST_IMG.src" "$TEST_IMG"
 $QEMU_IMG compare -f $IMGFMT -F $IMGFMT "$TEST_IMG.src" "$TEST_IMG"
 
 # blkdebug doesn't support copy offloading, so this tests the error path
-$QEMU_IMG amend -f $IMGFMT -o "data_file=blkdebug::$TEST_IMG.data" "$TEST_IMG"
-$QEMU_IMG convert -f $IMGFMT -O $IMGFMT -n -C "$TEST_IMG.src" "$TEST_IMG"
-$QEMU_IMG compare -f $IMGFMT -F $IMGFMT "$TEST_IMG.src" "$TEST_IMG"
+test_img_with_blkdebug="json:{
+'driver': 'qcow2',
+'file': {
+'driver': 'file',
+'filename': '$TEST_IMG'
+},
+'data-file': {
+'driver': 'blkdebug',
+'image': {
+'driver': 'file',
+'filename': '$TEST_IMG.data'
+}
+}
+}"
+$QEMU_IMG convert -f $IMGFMT -O $IMGFMT -n -C "$TEST_IMG.src" 
"$test_img_with_blkdebug"
+$QEMU_IMG compare -f $IMGFMT -F $IMGFMT "$TEST_IMG.src" 
"$test_img_with_blkdebug"
 
 echo
 echo "=== Flushing should flush the data file ==="
-- 
2.45.2

[PULL 4/4] block: Parse filenames only when explicitly requested

2024-07-02 Thread Kevin Wolf

When handling image filenames from legacy options such as -drive or from
tools, these filenames are parsed for protocol prefixes, including for
the json:{} pseudo-protocol.

This behaviour is intended for filenames that come directly from the
command line and for backing files, which may come from the image file
itself. Higher level management tools generally take care to verify that
untrusted images don't contain a bad (or any) backing file reference;
'qemu-img info' is a suitable tool for this.

However, for other files that can be referenced in images, such as
qcow2 data files or VMDK extents, the string from the image file is
usually not verified by management tools - and 'qemu-img info' wouldn't
be suitable because in contrast to backing files, it already opens these
other referenced files. So here the string should be interpreted as a
literal local filename. More complex configurations need to be specified
explicitly on the command line or in QMP.

This patch changes bdrv_open_inherit() so that it only parses filenames
if a new parameter parse_filename is true. It is set for the top level
in bdrv_open(), for the file child and for the backing file child. All
other callers pass false and disable filename parsing this way.

Cc: qemu-sta...@nongnu.org
Signed-off-by: Kevin Wolf 
Reviewed-by: Eric Blake 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Hanna Czenczek 
---
 block.c | 90 -
 1 file changed, 57 insertions(+), 33 deletions(-)

diff --git a/block.c b/block.c
index c1cc313d21..c317de9eaa 100644
--- a/block.c
+++ b/block.c
@@ -86,6 +86,7 @@ static BlockDriverState *bdrv_open_inherit(const char 
*filename,
BlockDriverState *parent,
const BdrvChildClass *child_class,
BdrvChildRole child_role,
+   bool parse_filename,
Error **errp);
 
 static bool bdrv_recurse_has_child(BlockDriverState *bs,
@@ -2055,7 +2056,8 @@ static void parse_json_protocol(QDict *options, const 
char **pfilename,
  * block driver has been specified explicitly.
  */
 static int bdrv_fill_options(QDict **options, const char *filename,
- int *flags, Error **errp)
+ int *flags, bool allow_parse_filename,
+ Error **errp)
 {
 const char *drvname;
 bool protocol = *flags & BDRV_O_PROTOCOL;
@@ -2097,7 +2099,7 @@ static int bdrv_fill_options(QDict **options, const char 
*filename,
 if (protocol && filename) {
 if (!qdict_haskey(*options, "filename")) {
 qdict_put_str(*options, "filename", filename);
-parse_filename = true;
+parse_filename = allow_parse_filename;
 } else {
 error_setg(errp, "Can't specify 'file' and 'filename' options at "
  "the same time");
@@ -3660,7 +3662,8 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict 
*parent_options,
 }
 
 backing_hd = bdrv_open_inherit(backing_filename, reference, options, 0, bs,
-   &child_of_bds, bdrv_backing_role(bs), errp);
+   &child_of_bds, bdrv_backing_role(bs), true,
+   errp);
 if (!backing_hd) {
 bs->open_flags |= BDRV_O_NO_BACKING;
 error_prepend(errp, "Could not open backing file: ");
@@ -3694,7 +3697,8 @@ free_exit:
 static BlockDriverState *
 bdrv_open_child_bs(const char *filename, QDict *options, const char *bdref_key,
BlockDriverState *parent, const BdrvChildClass *child_class,
-   BdrvChildRole child_role, bool allow_none, Error **errp)
+   BdrvChildRole child_role, bool allow_none,
+   bool parse_filename, Error **errp)
 {
 BlockDriverState *bs = NULL;
 QDict *image_options;
@@ -3725,7 +3729,8 @@ bdrv_open_child_bs(const char *filename, QDict *options, 
const char *bdref_key,
 }
 
 bs = bdrv_open_inherit(filename, reference, image_options, 0,
-   parent, child_class, child_role, errp);
+   parent, child_class, child_role, parse_filename,
+   errp);
 if (!bs) {
 goto done;
 }
@@ -3735,6 +3740,33 @@ done:
 return bs;
 }
 
+static BdrvChild *bdrv_open_child_common(const char *filename,
+ QDict *options, const char *bdref_key,
+ BlockDriverState *parent,
+ const BdrvChildClass *child_class,
+ BdrvChildRole child_role,
+

[PULL 1/4] qcow2: Don't open data_file with BDRV_O_NO_IO

2024-07-02 Thread Kevin Wolf

One use case for 'qemu-img info' is verifying that untrusted images
don't reference an unwanted external file, be it as a backing file or an
external data file. To make sure that calling 'qemu-img info' can't
already have undesired side effects with a malicious image, just don't
open the data file at all with BDRV_O_NO_IO. If nothing ever tries to do
I/O, we don't need to have it open.

This changes the output of iotests case 061, which used 'qemu-img info'
to show that opening an image with an invalid data file fails. After
this patch, it succeeds. Replace this part of the test with a qemu-io
call, but keep the final 'qemu-img info' to show that the invalid data
file is correctly displayed in the output.

Fixes: CVE-2024-4467
Cc: qemu-sta...@nongnu.org
Signed-off-by: Kevin Wolf 
Reviewed-by: Eric Blake 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Hanna Czenczek 
---
 block/qcow2.c  | 17 -
 tests/qemu-iotests/061 |  6 --
 tests/qemu-iotests/061.out |  8 ++--
 3 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 10883a2494..70b19730a3 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1636,7 +1636,22 @@ qcow2_do_open(BlockDriverState *bs, QDict *options, int 
flags,
 goto fail;
 }
 
-if (open_data_file) {
+if (open_data_file && (flags & BDRV_O_NO_IO)) {
+/*
+ * Don't open the data file for 'qemu-img info' so that it can be used
+ * to verify that an untrusted qcow2 image doesn't refer to external
+ * files.
+ *
+ * Note: This still makes has_data_file() return true.
+ */
+if (s->incompatible_features & QCOW2_INCOMPAT_DATA_FILE) {
+s->data_file = NULL;
+} else {
+s->data_file = bs->file;
+}
+qdict_extract_subqdict(options, NULL, "data-file.");
+qdict_del(options, "data-file");
+} else if (open_data_file) {
 /* Open external data file */
 bdrv_graph_co_rdunlock();
 s->data_file = bdrv_co_open_child(NULL, options, "data-file", bs,
diff --git a/tests/qemu-iotests/061 b/tests/qemu-iotests/061
index 53c7d428e3..b71ac097d1 100755
--- a/tests/qemu-iotests/061
+++ b/tests/qemu-iotests/061
@@ -326,12 +326,14 @@ $QEMU_IMG amend -o "data_file=foo" "$TEST_IMG"
 echo
 _make_test_img -o "compat=1.1,data_file=$TEST_IMG.data" 64M
 $QEMU_IMG amend -o "data_file=foo" "$TEST_IMG"
-_img_info --format-specific
+$QEMU_IO -c "read 0 4k" "$TEST_IMG" 2>&1 | _filter_testdir | _filter_imgfmt
+$QEMU_IO -c "open -o 
data-file.filename=$TEST_IMG.data,file.filename=$TEST_IMG" -c "read 0 4k" | 
_filter_qemu_io
 TEST_IMG="data-file.filename=$TEST_IMG.data,file.filename=$TEST_IMG" _img_info 
--format-specific --image-opts
 
 echo
 $QEMU_IMG amend -o "data_file=" --image-opts 
"data-file.filename=$TEST_IMG.data,file.filename=$TEST_IMG"
-_img_info --format-specific
+$QEMU_IO -c "read 0 4k" "$TEST_IMG" 2>&1 | _filter_testdir | _filter_imgfmt
+$QEMU_IO -c "open -o 
data-file.filename=$TEST_IMG.data,file.filename=$TEST_IMG" -c "read 0 4k" | 
_filter_qemu_io
 TEST_IMG="data-file.filename=$TEST_IMG.data,file.filename=$TEST_IMG" _img_info 
--format-specific --image-opts
 
 echo
diff --git a/tests/qemu-iotests/061.out b/tests/qemu-iotests/061.out
index 139fc68177..24c33add7c 100644
--- a/tests/qemu-iotests/061.out
+++ b/tests/qemu-iotests/061.out
@@ -545,7 +545,9 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
 qemu-img: data-file can only be set for images that use an external data file
 
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864 
data_file=TEST_DIR/t.IMGFMT.data
-qemu-img: Could not open 'TEST_DIR/t.IMGFMT': Could not open 'foo': No such 
file or directory
+qemu-io: can't open device TEST_DIR/t.IMGFMT: Could not open 'foo': No such 
file or directory
+read 4096/4096 bytes at offset 0
+4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 image: TEST_DIR/t.IMGFMT
 file format: IMGFMT
 virtual size: 64 MiB (67108864 bytes)
@@ -560,7 +562,9 @@ Format specific information:
 corrupt: false
 extended l2: false
 
-qemu-img: Could not open 'TEST_DIR/t.IMGFMT': 'data-file' is required for this 
image
+qemu-io: can't open device TEST_DIR/t.IMGFMT: 'data-file' is required for this 
image
+read 4096/4096 bytes at offset 0
+4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 image: TEST_DIR/t.IMGFMT
 file format: IMGFMT
 virtual size: 64 MiB (67108864 bytes)
-- 
2.45.2

[PULL 0/4] Block layer patches (CVE-2024-4467)

2024-07-02 Thread Kevin Wolf

The following changes since commit c80a339587fe4148292c260716482dd2f86d4476:

  Merge tag 'pull-target-arm-20240701' of 
https://git.linaro.org/people/pmaydell/qemu-arm into staging (2024-07-01 
10:41:45 -0700)

are available in the Git repository at:

  https://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to 7ead946998610657d38d1a505d5f25300d4ca613:

  block: Parse filenames only when explicitly requested (2024-07-02 18:12:30 
+0200)


Block layer patches (CVE-2024-4467)

- Don't open qcow2 data files in 'qemu-img info'
- Disallow protocol prefixes for qcow2 data files, VMDK extent files and
  other child nodes that are neither 'file' nor 'backing'

----
Kevin Wolf (4):
  qcow2: Don't open data_file with BDRV_O_NO_IO
  iotests/244: Don't store data-file with protocol in image
  iotests/270: Don't store data-file with json: prefix in image
  block: Parse filenames only when explicitly requested

 block.c| 90 +-
 block/qcow2.c  | 17 -
 tests/qemu-iotests/061 |  6 ++--
 tests/qemu-iotests/061.out |  8 +++--
 tests/qemu-iotests/244 | 19 --
 tests/qemu-iotests/270 | 14 ++--
 6 files changed, 110 insertions(+), 44 deletions(-)

Re: [PATCH] scsi: Don't ignore most usb-storage properties

2024-07-02 Thread Kevin Wolf

Am 01.07.2024 um 15:08 hat Fiona Ebner geschrieben:
> Hi,
> 
> we got a user report about bootindex for an 'usb-storage' device not
> working anymore [0] and I reproduced it and bisected it to this patch.
> 
> Am 31.01.24 um 14:06 schrieb Kevin Wolf:
> > @@ -399,11 +397,10 @@ SCSIDevice *scsi_bus_legacy_add_drive(SCSIBus *bus, 
> > BlockBackend *blk,
> >  object_property_add_child(OBJECT(bus), name, OBJECT(dev));
> >  g_free(name);
> >  
> > +s = SCSI_DEVICE(dev);
> > +s->conf = *conf;
> > +
> >  qdev_prop_set_uint32(dev, "scsi-id", unit);
> > -if (bootindex >= 0) {
> > -object_property_set_int(OBJECT(dev), "bootindex", bootindex,
> > -&error_abort);
> > -}
> 
> The fact that this is not called anymore means that the 'set' method
> for the property is also not called. Here, that method is
> device_set_bootindex() (as configured by scsi_dev_instance_init() ->
> device_add_bootindex_property()). Therefore, the device is never
> registered via add_boot_device_path() meaning that the bootindex
> property does not have the desired effect anymore.

Hmm, yes, seem I missed this side effect.

Bringing back this one object_property_set_int() call would be the
easiest fix, but I wonder if an explicit add_boot_device_path() call
(and allowing that one to fail gracefully instead of directly calling
exit()) wouldn't be better than re-setting a property to its current
value just for the side effect.

> Is it necessary to keep the object_property_set_{bool,int} and
> qdev_prop_set_enum calls around for these potential side effects? Would
> it even be necessary to introduce new similar calls for the newly
> supported properties? Or is there an easy alternative to
> s->conf = *conf;
> that does trigger the side effects?

I don't think the other properties whose setter we don't call any more
have side effects. They are processed during .realize, which is what I
probably expected for bootindex, too.

And that's really how all properties should be if we ever want to move
forward with the .instance_config approach for creating QOM objects
because then we won't call any setters during object creation any more,
they would only be for runtime changes.

Kevin

[PATCH 2/2] block/graph-lock: Make WITH_GRAPH_RDLOCK_GUARD() fully checked

2024-06-27 Thread Kevin Wolf

Upstream clang 18 (and backports to clang 17 in Fedora and RHEL)
implemented support for __attribute__((cleanup())) in its Thread Safety
Analysis, so we can now actually have a proper implementation of
WITH_GRAPH_RDLOCK_GUARD() that understands when we acquire and when we
release the lock.

-Wthread-safety is now only enabled if the compiler is new enough to
understand this pattern. In theory, we could have used some #ifdefs to
keep the existing basic checks on old compilers, but as long as someone
runs a newer compiler (and our CI does), we will catch locking problems,
so it's probably not worth keeping multiple implementations for this.

The implementation can't use g_autoptr any more because the glib macros
define wrapper functions that don't have the right TSA attributes, so
the compiler would complain about them. Just use the cleanup attribute
directly instead.

Signed-off-by: Kevin Wolf 
---
 include/block/graph-lock.h | 21 ++---
 meson.build| 14 +-
 2 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/include/block/graph-lock.h b/include/block/graph-lock.h
index d7545e82d0..dc8d949184 100644
--- a/include/block/graph-lock.h
+++ b/include/block/graph-lock.h
@@ -209,31 +209,38 @@ typedef struct GraphLockable { } GraphLockable;
  * unlocked. TSA_ASSERT_SHARED() makes sure that the following calls know that
  * we hold the lock while unlocking is left unchecked.
  */
-static inline GraphLockable * TSA_ASSERT_SHARED(graph_lock) TSA_NO_TSA 
coroutine_fn
+static inline GraphLockable * TSA_ACQUIRE_SHARED(graph_lock) coroutine_fn
 graph_lockable_auto_lock(GraphLockable *x)
 {
 bdrv_graph_co_rdlock();
 return x;
 }
 
-static inline void TSA_NO_TSA coroutine_fn
-graph_lockable_auto_unlock(GraphLockable *x)
+static inline void TSA_RELEASE_SHARED(graph_lock) coroutine_fn
+graph_lockable_auto_unlock(GraphLockable **x)
 {
 bdrv_graph_co_rdunlock();
 }
 
-G_DEFINE_AUTOPTR_CLEANUP_FUNC(GraphLockable, graph_lockable_auto_unlock)
+#define GRAPH_AUTO_UNLOCK __attribute__((cleanup(graph_lockable_auto_unlock)))
 
+/*
+ * @var is only used to break the loop after the first iteration.
+ * @unlock_var can't be unlocked and then set to NULL because TSA wants the 
lock
+ * to be held at the start of every iteration of the loop.
+ */
 #define WITH_GRAPH_RDLOCK_GUARD_(var) \
-for (g_autoptr(GraphLockable) var = graph_lockable_auto_lock(GML_OBJ_()); \
+for (GraphLockable *unlock_var GRAPH_AUTO_UNLOCK =\
+graph_lockable_auto_lock(GML_OBJ_()), \
+*var = unlock_var;\
  var; \
- graph_lockable_auto_unlock(var), var = NULL)
+ var = NULL)
 
 #define WITH_GRAPH_RDLOCK_GUARD() \
 WITH_GRAPH_RDLOCK_GUARD_(glue(graph_lockable_auto, __COUNTER__))
 
 #define GRAPH_RDLOCK_GUARD(x)   \
-g_autoptr(GraphLockable)\
+GraphLockable * GRAPH_AUTO_UNLOCK   \
 glue(graph_lockable_auto, __COUNTER__) G_GNUC_UNUSED =  \
 graph_lockable_auto_lock(GML_OBJ_())
 
diff --git a/meson.build b/meson.build
index 97e00d6f59..b1d5ce5f1d 100644
--- a/meson.build
+++ b/meson.build
@@ -624,7 +624,19 @@ warn_flags = [
 ]
 
 if host_os != 'darwin'
-  warn_flags += ['-Wthread-safety']
+  tsa_has_cleanup = cc.compiles('''
+struct __attribute__((capability("mutex"))) mutex {};
+void lock(struct mutex *m) __attribute__((acquire_capability(m)));
+void unlock(struct mutex *m) __attribute__((release_capability(m)));
+
+void test(void) {
+  struct mutex __attribute__((cleanup(unlock))) m;
+  lock(&m);
+}
+  ''', args: ['-Wthread-safety', '-Werror'])
+  if tsa_has_cleanup
+warn_flags += ['-Wthread-safety']
+  endif
 endif
 
 # Set up C++ compiler flags
-- 
2.45.2

[PATCH 0/2] block/graph-lock: Make WITH_GRAPH_RDLOCK_GUARD() fully checked

2024-06-27 Thread Kevin Wolf

Newer clang versions allow us to check scoped guards more thoroughly.
Surprisingly, we only seem to have missed one instance with the old
incomplete checks.

Kevin Wolf (2):
  block-copy: Fix missing graph lock
  block/graph-lock: Make WITH_GRAPH_RDLOCK_GUARD() fully checked

 include/block/graph-lock.h | 21 ++---
 block/block-copy.c |  4 +++-
 meson.build| 14 +-
 3 files changed, 30 insertions(+), 9 deletions(-)

-- 
2.45.2

[PATCH 1/2] block-copy: Fix missing graph lock

2024-06-27 Thread Kevin Wolf

The graph lock needs to be held when calling bdrv_co_pdiscard(). Fix
block_copy_task_entry() to take it for the call.

WITH_GRAPH_RDLOCK_GUARD() was implemented in a weak way because of
limitations in clang's Thread Safety Analysis at the time, so that it
only asserts that the lock is held (which allows calling functions that
require the lock), but we never deal with the unlocking (so even after
the scope of the guard, the compiler assumes that the lock is still
held). This is why the compiler didn't catch this locking error.

Signed-off-by: Kevin Wolf 
---
 block/block-copy.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/block-copy.c b/block/block-copy.c
index 7e3b378528..cc618e4561 100644
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -595,7 +595,9 @@ static coroutine_fn int block_copy_task_entry(AioTask *task)
 if (s->discard_source && ret == 0) {
 int64_t nbytes =
 MIN(t->req.offset + t->req.bytes, s->len) - t->req.offset;
-bdrv_co_pdiscard(s->source, t->req.offset, nbytes);
+WITH_GRAPH_RDLOCK_GUARD() {
+bdrv_co_pdiscard(s->source, t->req.offset, nbytes);
+}
 }
 
 return ret;
-- 
2.45.2

Re: [PATCH] block/curl: explicitly assert that strchr returns non-NULL value

2024-06-27 Thread Kevin Wolf

Am 27.06.2024 um 17:30 hat Vladimir Sementsov-Ogievskiy geschrieben:
> strchr may return NULL if colon is not found. It seems clearer to
> assert explicitly that we don't expect it here, than dereference 1 in
> the next line.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/curl.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/block/curl.c b/block/curl.c
> index 419f7c89ef..ccfffd6c12 100644
> --- a/block/curl.c
> +++ b/block/curl.c
> @@ -219,7 +219,9 @@ static size_t curl_header_cb(void *ptr, size_t size, 
> size_t nmemb, void *opaque)
>  && g_ascii_strncasecmp(header, accept_ranges,
> strlen(accept_ranges)) == 0) {
>  
> -char *p = strchr(header, ':') + 1;
> +char *p = strchr(header, ':');
> +assert(p != NULL);
> +p += 1;

I'm not sure if this is actually much clearer because it doesn't say why
we don't expect NULL here. If you don't look at the context, it almost
looks like an assert() where proper error handling is needed. If you do,
then the original line is clear enough.

My first thought was that maybe what we want is a comment, but we
actually already know where the colon is. So how about this instead:

char *p = header + strlen(accept_ranges);

Kevin

>  /* Skip whitespace between the header name and value. */
>  while (p < end && *p && g_ascii_isspace(*p)) {
> -- 
> 2.34.1
>

Re: [PATCH v2] Consider discard option when writing zeros

2024-06-27 Thread Kevin Wolf

Am 26.06.2024 um 18:27 hat Nir Soffer geschrieben:
> On Wed, Jun 26, 2024 at 12:17 PM Daniel P. Berrangé 
> wrote:
> 
> > On Mon, Jun 24, 2024 at 06:08:26PM +0200, Kevin Wolf wrote:
> > > Am 24.06.2024 um 17:23 hat Stefan Hajnoczi geschrieben:
> > > > On Wed, Jun 19, 2024 at 08:43:25PM +0300, Nir Soffer wrote:
> > > > > Tested using:
> > > >
> > > > Hi Nir,
> > > > This looks like a good candidate for the qemu-iotests test suite.
> > Adding
> > > > it to the automated tests will protect against future regressions.
> > > >
> > > > Please add the script and the expected output to
> > > > tests/qemu-iotests/test/write-zeroes-unmap and run it using
> > > > `(cd build && tests/qemu-iotests/check write-zeroes-unmap)`.
> > > >
> > > > See the existing test cases in tests/qemu-iotests/ and
> > > > tests/qemu-iotests/tests/ for examples. Some are shell scripts and
> > > > others are Python. I think shell makes sense for this test case. You
> > > > can copy the test framework boilerplate from an existing test case.
> > >
> > > 'du' can't be used like this in qemu-iotests because it makes
> > > assumptions that depend on the filesystem. A test case replicating what
> > > Nir did manually would likely fail on XFS with its preallocation.
> > >
> > > Maybe we could operate on a file exposed by the FUSE export that is
> > > backed by qcow2, and then you can use 'qemu-img map' on that qcow2 image
> > > to verify the allocation status. Somewhat complicated, but I think it
> > > could work.
> >
> > A simpler option would be to use 'du' but with a fuzzy range test,
> > rather than an exact equality test.
> >
> > For the tests which write 1 MB, check the 'du' usage is "at least 1MB",
> > for the tests which expect to unmap blocks, check that the 'du' usage
> > is "less than 256kb". This should be within bounds of xfs speculative
> > allocation.
> 
> This should work, I'll start with this approach.

If we're okay with accepting tests that depend on filesystem behaviour,
then 'qemu-img map -f raw --output=json' should be the less risky
approach than checking 'du'.

Kevin

Re: [PATCH v2] Consider discard option when writing zeros

2024-06-26 Thread Kevin Wolf

Am 24.06.2024 um 23:12 hat Nir Soffer geschrieben:
> On Mon, Jun 24, 2024 at 7:08 PM Kevin Wolf  wrote:
> 
> > Am 24.06.2024 um 17:23 hat Stefan Hajnoczi geschrieben:
> > > On Wed, Jun 19, 2024 at 08:43:25PM +0300, Nir Soffer wrote:
> > > > Tested using:
> > >
> > > Hi Nir,
> > > This looks like a good candidate for the qemu-iotests test suite. Adding
> > > it to the automated tests will protect against future regressions.
> > >
> > > Please add the script and the expected output to
> > > tests/qemu-iotests/test/write-zeroes-unmap and run it using
> > > `(cd build && tests/qemu-iotests/check write-zeroes-unmap)`.
> > >
> > > See the existing test cases in tests/qemu-iotests/ and
> > > tests/qemu-iotests/tests/ for examples. Some are shell scripts and
> > > others are Python. I think shell makes sense for this test case. You
> > > can copy the test framework boilerplate from an existing test case.
> >
> > 'du' can't be used like this in qemu-iotests because it makes
> > assumptions that depend on the filesystem. A test case replicating what
> > Nir did manually would likely fail on XFS with its preallocation.
> 
> This is why I did not try to add a new qemu-iotest yet.
> 
> > Maybe we could operate on a file exposed by the FUSE export that is
> > backed by qcow2, and then you can use 'qemu-img map' on that qcow2 image
> > to verify the allocation status. Somewhat complicated, but I think it
> > could work.
> 
> Do we have examples of using the FUSE export? It sounds complicated but
> being able to test on any file system is awesome. The complexity can be
> hidden behind simple test helpers.

We seem to have a few tests that use it, and then the fuse protocol
implementation, too. 308 and file-io-error look relevant.

> Another option is to use a specific file system created for the tests,
> for example on a loop device. We used userstorage[1] in ovirt to test
> on specific file systems with known sector size.

Creating loop devices requires root privileges. If I understand
correctly, userstorage solved that by having a setup phase as root
before running the tests as a normal user? We don't really have that in
qemu-iotests.

Some tests require passwordless sudo and are skipped otherwise, but this
means that in practice they are almost always skipped.

> But more important, are you ok with the change?
> 
> I'm not sure about not creating sparse images by default - this is not
> consistent with qemu-img convert and qemu-nbd, which do sparsify by
> default. The old behavior seems better.

Well, your patches make it do what we always claimed it would do, so
that consistency is certainly a good thing. Unmapping on write_zeroes
and ignoring truncate is a weird combination anyway that doesn't really
make any sense to me, so I don't think it's worth preserving. The other
way around could have been more defensible, but that's not how our bug
works.

Now, if ignoring all discard requests is a good default these days is a
separate question and I'm not sure really. Maybe discard=unmap should
be the default (and apply to both discard are write_zeroes, of course).

Kevin

Re: [PATCH v2] Consider discard option when writing zeros

2024-06-24 Thread Kevin Wolf

Am 24.06.2024 um 17:23 hat Stefan Hajnoczi geschrieben:
> On Wed, Jun 19, 2024 at 08:43:25PM +0300, Nir Soffer wrote:
> > Tested using:
> 
> Hi Nir,
> This looks like a good candidate for the qemu-iotests test suite. Adding
> it to the automated tests will protect against future regressions.
> 
> Please add the script and the expected output to
> tests/qemu-iotests/test/write-zeroes-unmap and run it using
> `(cd build && tests/qemu-iotests/check write-zeroes-unmap)`.
> 
> See the existing test cases in tests/qemu-iotests/ and
> tests/qemu-iotests/tests/ for examples. Some are shell scripts and
> others are Python. I think shell makes sense for this test case. You
> can copy the test framework boilerplate from an existing test case.

'du' can't be used like this in qemu-iotests because it makes
assumptions that depend on the filesystem. A test case replicating what
Nir did manually would likely fail on XFS with its preallocation.

Maybe we could operate on a file exposed by the FUSE export that is
backed by qcow2, and then you can use 'qemu-img map' on that qcow2 image
to verify the allocation status. Somewhat complicated, but I think it
could work.

Kevin

signature.asc
Description: PGP signature

Re: [PATCH] block/file-posix: Consider discard flag when opening

2024-06-19 Thread Kevin Wolf

Am 18.06.2024 um 23:24 hat Nir Soffer geschrieben:
> Set has_discard only when BDRV_O_UNMAP is not set. With this users that
> want to keep their images fully allocated can disable hole punching
> when writing zeros or discarding using:
> 
>-drive file=thick.img,discard=off
> 
> This change is not entirely correct since it changes the default discard
> behavior.  Previously we always allowed punching holes, but now you have
> must use discard=unmap|on to enable it. We probably need to add the
> BDDR_O_UNMAP flag by default.
> 
> make check still works, so maybe we don't have tests for sparsifying
> images, or maybe you need to run special tests that do not run by
> default. We needs tests for keeping images non-sparse.
> 
> Signed-off-by: Nir Soffer 

So first of all, I agree with you that this patch is wrong. ;-)

At first, I failed to understand the problem this is trying to solve. I
put a debug message in handle_aiocb_discard() and tried with which
options it triggers. [1] To me, this looked exactly like it should be.
We only try to discard blocks when discard=unmap is given as an option.

That leaves the case of write_zeroes. And while at the first sight, the
code looked good, we do seem to have a problem there and it tried to
unmap even with discard=off.

>  block/file-posix.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index be25e35ff6..acac2abadc 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -738,11 +738,11 @@ static int raw_open_common(BlockDriverState *bs, QDict 
> *options,
>  ret = -EINVAL;
>  goto fail;
>  }
>  #endif /* !defined(CONFIG_LINUX_IO_URING) */
>  
> -s->has_discard = true;
> +s->has_discard = !!(bdrv_flags & BDRV_O_UNMAP);
>  s->has_write_zeroes = true;
>  
>  if (fstat(s->fd, &st) < 0) {
>  ret = -errno;
>  error_setg_errno(errp, errno, "Could not stat file");

s->has_discard is about what the host supports, not about the semantics
of the QEMU block node. So this doesn't feel right to me.

So for the buggy case, write_zeroes, bdrv_co_pwrite_zeroes() has code
that considers the case and clears the ~BDRV_REQ_MAY_UNMAP flags:

if (!(child->bs->open_flags & BDRV_O_UNMAP)) {
flags &= ~BDRV_REQ_MAY_UNMAP;
}

But it turns out that we don't necessarily even go through this function
for the top node which has discard=off, so it can't take effect:

(gdb) bt
#0  0x74f2f144 in __pthread_kill_implementation () at /lib64/libc.so.6
#1  0x74ed765e in raise () at /lib64/libc.so.6
#2  0x74ebf902 in abort () at /lib64/libc.so.6
#3  0x5615aff0 in raw_do_pwrite_zeroes (bs=0x57f4bcf0, offset=0, 
bytes=1048576, flags=BDRV_REQ_MAY_UNMAP, blkdev=false) at 
../block/file-posix.c:3643
#4  0x5615557e in raw_co_pwrite_zeroes (bs=0x57f4bcf0, offset=0, 
bytes=1048576, flags=BDRV_REQ_MAY_UNMAP) at ../block/file-posix.c:3655
#5  0x560cde2a in bdrv_co_do_pwrite_zeroes (bs=0x57f4bcf0, 
offset=0, bytes=1048576, flags=6) at ../block/io.c:1901
#6  0x560c72f9 in bdrv_aligned_pwritev (child=0x57f51460, 
req=0x7fffed5ff800, offset=0, bytes=1048576, align=1, qiov=0x0, qiov_offset=0, 
flags=6) at ../block/io.c:2100
#7  0x560c6b41 in bdrv_co_do_zero_pwritev (child=0x57f51460, 
offset=0, bytes=1048576, flags=6, req=0x7fffed5ff800) at ../block/io.c:2183
#8  0x560c6647 in bdrv_co_pwritev_part (child=0x57f51460, offset=0, 
bytes=1048576, qiov=0x0, qiov_offset=0, flags=6) at ../block/io.c:2283
#9  0x560c634f in bdrv_co_pwritev (child=0x57f51460, offset=0, 
bytes=1048576, qiov=0x0, flags=6) at ../block/io.c:2216
#10 0x560c75b5 in bdrv_co_pwrite_zeroes (child=0x57f51460, 
offset=0, bytes=1048576, flags=BDRV_REQ_MAY_UNMAP) at ../block/io.c:2322
#11 0x56117d24 in raw_co_pwrite_zeroes (bs=0x57f44980, offset=0, 
bytes=1048576, flags=BDRV_REQ_MAY_UNMAP) at ../block/raw-format.c:307
#12 0x560cde2a in bdrv_co_do_pwrite_zeroes (bs=0x57f44980, 
offset=0, bytes=1048576, flags=6) at ../block/io.c:1901
#13 0x560c72f9 in bdrv_aligned_pwritev (child=0x57f513f0, 
req=0x7fffed5ffd90, offset=0, bytes=1048576, align=1, qiov=0x0, qiov_offset=0, 
flags=6) at ../block/io.c:2100
#14 0x560c6b41 in bdrv_co_do_zero_pwritev (child=0x57f513f0, 
offset=0, bytes=1048576, flags=6, req=0x7fffed5ffd90) at ../block/io.c:2183
#15 0x560c6647 in bdrv_co_pwritev_part (child=0x57f513f0, offset=0, 
bytes=1048576, qiov=0x0, qiov_offset=0, flags=6) at ../block/io.c:2283
#16 0x560ad741 in blk_co_do_pwritev_part (blk=0x57f51660, offset=0, 
bytes=1048576, qiov=0x0, qiov_offset=0, flags=6) at 
../block/block-backend.c:1425
#17 0x560ad5f2 in blk_co_pwritev_part (blk=0x57f51660, offset=0, 
bytes=1048576, qiov=0x0, qiov_offset=0, flags=6) at 
../block/block-backend.c:1440
#18 0x560ad8cf in blk_co_pwritev (blk=

Re: [PATCH v5 5/5] iotests: Add `vvfat` tests

2024-06-13 Thread Kevin Wolf

Am 13.06.2024 um 16:07 hat Amjad Alsharafi geschrieben:
> On Wed, Jun 12, 2024 at 08:43:26PM +0800, Amjad Alsharafi wrote:
> > Added several tests to verify the implementation of the vvfat driver.
> > 
> > We needed a way to interact with it, so created a basic `fat16.py` driver
> > that handled writing correct sectors for us.
> > 
> > Added `vvfat` to the non-generic formats, as its not a normal image format.
> > 
> > Signed-off-by: Amjad Alsharafi 
> > ---
> >  tests/qemu-iotests/check   |   2 +-
> >  tests/qemu-iotests/fat16.py| 673 +
> >  tests/qemu-iotests/testenv.py  |   2 +-
> >  tests/qemu-iotests/tests/vvfat | 457 
> >  tests/qemu-iotests/tests/vvfat.out |   5 +
> >  5 files changed, 1137 insertions(+), 2 deletions(-)
> >  create mode 100644 tests/qemu-iotests/fat16.py
> >  create mode 100755 tests/qemu-iotests/tests/vvfat
> >  create mode 100755 tests/qemu-iotests/tests/vvfat.out
> 
> Btw Kevin, I'm not sure if I should add myself to the MAINTAINERS file
> for the new files (fat16.py and vvfat) or not

You can if you really want to, but there is no need.

Basically if you see yourself maintaining these files for a long time
instead of just contributing them now and you want to be contacted
when they are changed in the future, that's when you could consider it.
But the patches will still go through my or Hanna's tree and we already
cover all of qemu-iotests, so we don't have to have a formal
maintainership definition for these files specifically.

Kevin

Re: [PATCH v4 5/5] iotests: add backup-discard-source

2024-06-13 Thread Kevin Wolf

Am 12.06.2024 um 21:21 hat Vladimir Sementsov-Ogievskiy geschrieben:
> On 11.06.24 20:49, Kevin Wolf wrote:
> > Am 13.03.2024 um 16:28 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > > Add test for a new backup option: discard-source.
> > > 
> > > Signed-off-by: Vladimir Sementsov-Ogievskiy 
> > > Reviewed-by: Fiona Ebner 
> > > Tested-by: Fiona Ebner 
> > 
> > This test fails for me, and it already does so after this commit that
> > introduced it. I haven't checked what get_actual_size(), but I'm running
> > on XFS, so its preallocation could be causing this. We generally avoid
> > checking the number of allocated blocks in image files for this reason.
> > 
> 
> Hmm right, I see that relying on allocated size is bad thing. Better
> is to check block status, to see how many qcow2 clusters are
> allocated.
> 
> Do we have some qmp command to get such information? The simplest way
> I see is to add dirty to temp block-node, and then check its dirty
> count through query-named-block-nodes

Hm, does it have to be QMP in a running QEMU process? I'm not sure what
the best way would be there.

Otherwise, my approach would be 'qemu-img check' or even 'qemu-img map',
depending on what you want to verify with it.

Kevin

Re: [PATCH v4 2/4] vvfat: Fix usage of `info.file.offset`

2024-06-11 Thread Kevin Wolf

Am 11.06.2024 um 18:22 hat Amjad Alsharafi geschrieben:
> On Tue, Jun 11, 2024 at 04:30:53PM +0200, Kevin Wolf wrote:
> > Am 11.06.2024 um 14:31 hat Amjad Alsharafi geschrieben:
> > > On Mon, Jun 10, 2024 at 06:49:43PM +0200, Kevin Wolf wrote:
> > > > Am 05.06.2024 um 02:58 hat Amjad Alsharafi geschrieben:
> > > > > The field is marked as "the offset in the file (in clusters)", but it
> > > > > was being used like this
> > > > > `cluster_size*(nums)+mapping->info.file.offset`, which is incorrect.
> > > > > 
> > > > > Additionally, removed the `abort` when `first_mapping_index` does not
> > > > > match, as this matches the case when adding new clusters for files, 
> > > > > and
> > > > > its inevitable that we reach this condition when doing that if the
> > > > > clusters are not after one another, so there is no reason to `abort`
> > > > > here, execution continues and the new clusters are written to disk
> > > > > correctly.
> > > > > 
> > > > > Signed-off-by: Amjad Alsharafi 
> > > > 
> > > > Can you help me understand how first_mapping_index really works?
> > > > 
> > > > It seems to me that you get a chain of mappings for each file on the FAT
> > > > filesystem, which are just the contiguous areas in it, and
> > > > first_mapping_index refers to the mapping at the start of the file. But
> > > > for much of the time, it actually doesn't seem to be set at all, so you
> > > > have mapping->first_mapping_index == -1. Do you understand the rules
> > > > around when it's set and when it isn't?
> > > 
> > > Yeah. So `first_mapping_index` is the index of the first mapping, each
> > > mapping is a group of clusters that are contiguous in the file.
> > > Its mostly `-1` because the first mapping will have the value set as
> > > `-1` and not its own index, this value will only be set when the file
> > > contain more than one mapping, and this will only happen when you add
> > > clusters to a file that are not contiguous with the existing clusters.
> > 
> > Ah, that makes some sense. Not sure if it's optimal, but it's a rule I
> > can work with. So just to confirm, this is the invariant that we think
> > should always hold true, right?
> > 
> > assert((mapping->mode & MODE_DIRECTORY) ||
> >!mapping->info.file.offset ||
> >mapping->first_mapping_index > 0);
> > 
> 
> Yes.
> 
> We can add this into `get_cluster_count_for_direntry` loop.

Maybe even find_mapping_for_cluster() because we think it should apply
always? It's called by get_cluster_count_for_direntry(), but also by
other functions.

Either way, I think this should be a separate patch.

> I'm thinking of also converting those `abort` into `assert`, since
> the line `copy_it = 1;` was confusing me, since it was after the `abort`.

I agree for the abort() that you removed, but I'm not sure about the
other one. I have a feeling the copy_it = 1 might actually be correct
there (if the copying logic is implemented correctly; I didn't check
that).

> > > And actually, thanks to that I noticed another bug not fixed in PATCH 3, 
> > > We are doing this check 
> > > `s->current_mapping->first_mapping_index != mapping->first_mapping_index`
> > > to know if we should switch to the new mapping or not. 
> > > If we were reading from the first mapping (`first_mapping_index == -1`)
> > > and we jumped to the second mapping (`first_mapping_index == n`), we
> > > will catch this condition and switch to the new mapping.
> > > 
> > > But if the file has more than 2 mappings, and we jumped to the 3rd
> > > mapping, we will not catch this since (`first_mapping_index == n`) for
> > > both of them haha. I think a better check is to check the `mapping`
> > > pointer directly. (I'll add it also in the next series together with a
> > > test for it.)
> > 
> > This comparison is exactly what confused me. I didn't realise that the
> > first mapping in the chain has a different value here, so I thought this
> > must mean that we're looking at a different file now - but of course I
> > couldn't see a reason for that because we're iterating through a single
> > file in this function.
> > 
> > But even now that I know that the condition triggers when switching from
> > the first to the second mapping, it doesn't make se

Re: [PATCH v4 5/5] iotests: add backup-discard-source

2024-06-11 Thread Kevin Wolf

Am 13.03.2024 um 16:28 hat Vladimir Sementsov-Ogievskiy geschrieben:
> Add test for a new backup option: discard-source.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> Reviewed-by: Fiona Ebner 
> Tested-by: Fiona Ebner 

This test fails for me, and it already does so after this commit that
introduced it. I haven't checked what get_actual_size(), but I'm running
on XFS, so its preallocation could be causing this. We generally avoid
checking the number of allocated blocks in image files for this reason.

Kevin


backup-discard-source   fail   [19:45:49] [19:45:50]   0.8s 
failed, exit status 1
--- /home/kwolf/source/qemu/tests/qemu-iotests/tests/backup-discard-source.out
+++ 
/home/kwolf/source/qemu/build-clang/scratch/qcow2-file-backup-discard-source/backup-discard-source.out.bad
@@ -1,5 +1,14 @@
-..
+F.
+==
+FAIL: test_discard_cbw (__main__.TestBackup.test_discard_cbw)
+1. do backup(discard_source=True), which should inform
+--
+Traceback (most recent call last):
+  File 
"/home/kwolf/source/qemu/tests/qemu-iotests/tests/backup-discard-source", line 
147, in test_discard_cbw
+self.assertLess(get_actual_size(self.vm, 'temp'), 512 * 1024)
+AssertionError: 1249280 not less than 524288
+
 --
 Ran 2 tests

-OK
+FAILED (failures=1)
Failures: backup-discard-source
Failed 1 of 1 iotests

[PULL 5/8] block/copy-before-write: use uint64_t for timeout in nanoseconds

2024-06-11 Thread Kevin Wolf

From: Fiona Ebner 

rather than the uint32_t for which the maximum is slightly more than 4
seconds and larger values would overflow. The QAPI interface allows
specifying the number of seconds, so only values 0 to 4 are safe right
now, other values lead to a much lower timeout than a user expects.

The block_copy() call where this is used already takes a uint64_t for
the timeout, so no change required there.

Fixes: 6db7fd1ca9 ("block/copy-before-write: implement cbw-timeout option")
Reported-by: Friedrich Weber 
Signed-off-by: Fiona Ebner 
Message-ID: <20240429141934.442154-1-f.eb...@proxmox.com>
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 block/copy-before-write.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/copy-before-write.c b/block/copy-before-write.c
index cd65524e26..853e01a1eb 100644
--- a/block/copy-before-write.c
+++ b/block/copy-before-write.c
@@ -43,7 +43,7 @@ typedef struct BDRVCopyBeforeWriteState {
 BlockCopyState *bcs;
 BdrvChild *target;
 OnCbwError on_cbw_error;
-uint32_t cbw_timeout_ns;
+uint64_t cbw_timeout_ns;
 bool discard_source;
 
 /*
-- 
2.45.2

[PULL 0/8] Block layer patches

2024-06-11 Thread Kevin Wolf

The following changes since commit 80e8f0602168f451a93e71cbb1d59e93d745e62e:

  Merge tag 'bsd-user-misc-2024q2-pull-request' of gitlab.com:bsdimp/qemu into 
staging (2024-06-09 11:21:55 -0700)

are available in the Git repository at:

  https://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to 3ab0f063e58ed9224237d69c4211ca83335164c4:

  crypto/block: drop qcrypto_block_open() n_threads argument (2024-06-10 
11:05:43 +0200)


Block layer patches

- crypto: Fix crash when used with multiqueue devices
- linux-aio: add IO_CMD_FDSYNC command support
- copy-before-write: Avoid integer overflows for timeout > 4s
- Fix crash with QMP block_resize and iothreads
- qemu-io: add cvtnum() error handling for zone commands
- Code cleanup


Denis V. Lunev via (1):
  block: drop force_dup parameter of raw_reconfigure_getfd()

Fiona Ebner (1):
  block/copy-before-write: use uint64_t for timeout in nanoseconds

Prasad J Pandit (1):
  linux-aio: add IO_CMD_FDSYNC command support

Stefan Hajnoczi (5):
  Revert "monitor: use aio_co_reschedule_self()"
  aio: warn about iohandler_ctx special casing
  qemu-io: add cvtnum() error handling for zone commands
  block/crypto: create ciphers on demand
  crypto/block: drop qcrypto_block_open() n_threads argument

 crypto/blockpriv.h |  13 +++--
 include/block/aio.h|   6 +++
 include/block/raw-aio.h|   1 +
 include/crypto/block.h |   2 -
 block/copy-before-write.c  |   2 +-
 block/crypto.c |   1 -
 block/file-posix.c |  17 --
 block/linux-aio.c  |  21 +++-
 block/qcow.c   |   2 +-
 block/qcow2.c  |   5 +-
 crypto/block-luks.c|   4 +-
 crypto/block-qcow.c|   8 ++-
 crypto/block.c | 114 -
 qapi/qmp-dispatch.c|   7 ++-
 qemu-io-cmds.c |  48 -
 tests/unit/test-crypto-block.c |   4 --
 16 files changed, 176 insertions(+), 79 deletions(-)

[PULL 4/8] qemu-io: add cvtnum() error handling for zone commands

2024-06-11 Thread Kevin Wolf

From: Stefan Hajnoczi 

cvtnum() parses positive int64_t values and returns a negative errno on
failure. Print errors and return early when cvtnum() fails.

While we're at it, also reject nr_zones values greater or equal to 2^32
since they cannot be represented.

Reported-by: Peter Maydell 
Cc: Sam Li 
Signed-off-by: Stefan Hajnoczi 
Message-ID: <20240507180558.377233-1-stefa...@redhat.com>
Reviewed-by: Sam Li 
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 qemu-io-cmds.c | 48 +++-
 1 file changed, 47 insertions(+), 1 deletion(-)

diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index f5d7202a13..e2fab57183 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -1739,12 +1739,26 @@ static int zone_report_f(BlockBackend *blk, int argc, 
char **argv)
 {
 int ret;
 int64_t offset;
+int64_t val;
 unsigned int nr_zones;
 
 ++optind;
 offset = cvtnum(argv[optind]);
+if (offset < 0) {
+print_cvtnum_err(offset, argv[optind]);
+return offset;
+}
 ++optind;
-nr_zones = cvtnum(argv[optind]);
+val = cvtnum(argv[optind]);
+if (val < 0) {
+print_cvtnum_err(val, argv[optind]);
+return val;
+}
+if (val > UINT_MAX) {
+printf("Number of zones must be less than 2^32\n");
+return -ERANGE;
+}
+nr_zones = val;
 
 g_autofree BlockZoneDescriptor *zones = NULL;
 zones = g_new(BlockZoneDescriptor, nr_zones);
@@ -1780,8 +1794,16 @@ static int zone_open_f(BlockBackend *blk, int argc, char 
**argv)
 int64_t offset, len;
 ++optind;
 offset = cvtnum(argv[optind]);
+if (offset < 0) {
+print_cvtnum_err(offset, argv[optind]);
+return offset;
+}
 ++optind;
 len = cvtnum(argv[optind]);
+if (len < 0) {
+print_cvtnum_err(len, argv[optind]);
+return len;
+}
 ret = blk_zone_mgmt(blk, BLK_ZO_OPEN, offset, len);
 if (ret < 0) {
 printf("zone open failed: %s\n", strerror(-ret));
@@ -1805,8 +1827,16 @@ static int zone_close_f(BlockBackend *blk, int argc, 
char **argv)
 int64_t offset, len;
 ++optind;
 offset = cvtnum(argv[optind]);
+if (offset < 0) {
+print_cvtnum_err(offset, argv[optind]);
+return offset;
+}
 ++optind;
 len = cvtnum(argv[optind]);
+if (len < 0) {
+print_cvtnum_err(len, argv[optind]);
+return len;
+}
 ret = blk_zone_mgmt(blk, BLK_ZO_CLOSE, offset, len);
 if (ret < 0) {
 printf("zone close failed: %s\n", strerror(-ret));
@@ -1830,8 +1860,16 @@ static int zone_finish_f(BlockBackend *blk, int argc, 
char **argv)
 int64_t offset, len;
 ++optind;
 offset = cvtnum(argv[optind]);
+if (offset < 0) {
+print_cvtnum_err(offset, argv[optind]);
+return offset;
+}
 ++optind;
 len = cvtnum(argv[optind]);
+if (len < 0) {
+print_cvtnum_err(len, argv[optind]);
+return len;
+}
 ret = blk_zone_mgmt(blk, BLK_ZO_FINISH, offset, len);
 if (ret < 0) {
 printf("zone finish failed: %s\n", strerror(-ret));
@@ -1855,8 +1893,16 @@ static int zone_reset_f(BlockBackend *blk, int argc, 
char **argv)
 int64_t offset, len;
 ++optind;
 offset = cvtnum(argv[optind]);
+if (offset < 0) {
+print_cvtnum_err(offset, argv[optind]);
+return offset;
+}
 ++optind;
 len = cvtnum(argv[optind]);
+if (len < 0) {
+print_cvtnum_err(len, argv[optind]);
+return len;
+}
 ret = blk_zone_mgmt(blk, BLK_ZO_RESET, offset, len);
 if (ret < 0) {
 printf("zone reset failed: %s\n", strerror(-ret));
-- 
2.45.2

[PULL 8/8] crypto/block: drop qcrypto_block_open() n_threads argument

2024-06-11 Thread Kevin Wolf

From: Stefan Hajnoczi 

The n_threads argument is no longer used since the previous commit.
Remove it.

Signed-off-by: Stefan Hajnoczi 
Message-ID: <20240527155851.892885-3-stefa...@redhat.com>
Reviewed-by: Kevin Wolf 
Acked-by: Daniel P. Berrangé 
Signed-off-by: Kevin Wolf 
---
 crypto/blockpriv.h | 1 -
 include/crypto/block.h | 2 --
 block/crypto.c | 1 -
 block/qcow.c   | 2 +-
 block/qcow2.c  | 5 ++---
 crypto/block-luks.c| 1 -
 crypto/block-qcow.c| 6 ++
 crypto/block.c | 3 +--
 tests/unit/test-crypto-block.c | 4 
 9 files changed, 6 insertions(+), 19 deletions(-)

diff --git a/crypto/blockpriv.h b/crypto/blockpriv.h
index 4bf6043d5d..b8f77cb5eb 100644
--- a/crypto/blockpriv.h
+++ b/crypto/blockpriv.h
@@ -59,7 +59,6 @@ struct QCryptoBlockDriver {
 QCryptoBlockReadFunc readfunc,
 void *opaque,
 unsigned int flags,
-size_t n_threads,
 Error **errp);
 
 int (*create)(QCryptoBlock *block,
diff --git a/include/crypto/block.h b/include/crypto/block.h
index 92e823c9f2..5b5d039800 100644
--- a/include/crypto/block.h
+++ b/include/crypto/block.h
@@ -76,7 +76,6 @@ typedef enum {
  * @readfunc: callback for reading data from the volume
  * @opaque: data to pass to @readfunc
  * @flags: bitmask of QCryptoBlockOpenFlags values
- * @n_threads: allow concurrent I/O from up to @n_threads threads
  * @errp: pointer to a NULL-initialized error object
  *
  * Create a new block encryption object for an existing
@@ -113,7 +112,6 @@ QCryptoBlock *qcrypto_block_open(QCryptoBlockOpenOptions 
*options,
  QCryptoBlockReadFunc readfunc,
  void *opaque,
  unsigned int flags,
- size_t n_threads,
  Error **errp);
 
 typedef enum {
diff --git a/block/crypto.c b/block/crypto.c
index 21eed909c1..4eed3ffa6a 100644
--- a/block/crypto.c
+++ b/block/crypto.c
@@ -363,7 +363,6 @@ static int block_crypto_open_generic(QCryptoBlockFormat 
format,
block_crypto_read_func,
bs,
cflags,
-   1,
errp);
 
 if (!crypto->block) {
diff --git a/block/qcow.c b/block/qcow.c
index ca8e1d5ec8..c2f89db055 100644
--- a/block/qcow.c
+++ b/block/qcow.c
@@ -211,7 +211,7 @@ static int qcow_open(BlockDriverState *bs, QDict *options, 
int flags,
 cflags |= QCRYPTO_BLOCK_OPEN_NO_IO;
 }
 s->crypto = qcrypto_block_open(crypto_opts, "encrypt.",
-   NULL, NULL, cflags, 1, errp);
+   NULL, NULL, cflags, errp);
 if (!s->crypto) {
 ret = -EINVAL;
 goto fail;
diff --git a/block/qcow2.c b/block/qcow2.c
index 956128b409..10883a2494 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -321,7 +321,7 @@ qcow2_read_extensions(BlockDriverState *bs, uint64_t 
start_offset,
 }
 s->crypto = qcrypto_block_open(s->crypto_opts, "encrypt.",
qcow2_crypto_hdr_read_func,
-   bs, cflags, QCOW2_MAX_THREADS, 
errp);
+   bs, cflags, errp);
 if (!s->crypto) {
 return -EINVAL;
 }
@@ -1701,8 +1701,7 @@ qcow2_do_open(BlockDriverState *bs, QDict *options, int 
flags,
 cflags |= QCRYPTO_BLOCK_OPEN_NO_IO;
 }
 s->crypto = qcrypto_block_open(s->crypto_opts, "encrypt.",
-   NULL, NULL, cflags,
-   QCOW2_MAX_THREADS, errp);
+   NULL, NULL, cflags, errp);
 if (!s->crypto) {
 ret = -EINVAL;
 goto fail;
diff --git a/crypto/block-luks.c b/crypto/block-luks.c
index 3357852c0a..5b777c15d3 100644
--- a/crypto/block-luks.c
+++ b/crypto/block-luks.c
@@ -1189,7 +1189,6 @@ qcrypto_block_luks_open(QCryptoBlock *block,
 QCryptoBlockReadFunc readfunc,
 void *opaque,
 unsigned int flags,
-size_t n_threads,
 Error **errp)
 {
 QCryptoBlockLUKS *luks = NULL;
diff --git a/crypto/block-qcow.c b/crypto/block-qcow.c
index 02305058e3..42e9556e42 100644
--- a/crypto/block-qcow.c
+++ b/crypto/block-qcow.c
@@ -44,7 +44,6 @@ qcrypto_block_qcow_has_format(const uint8_t *buf 
G_GNUC_UNUSED,
 static int
 qcrypto_block_qcow_init(QCry

[PULL 6/8] linux-aio: add IO_CMD_FDSYNC command support

2024-06-11 Thread Kevin Wolf

From: Prasad Pandit 

Libaio defines IO_CMD_FDSYNC command to sync all outstanding
asynchronous I/O operations, by flushing out file data to the
disk storage. Enable linux-aio to submit such aio request.

When using aio=native without fdsync() support, QEMU creates
pthreads, and destroying these pthreads results in TLB flushes.
In a real-time guest environment, TLB flushes cause a latency
spike. This patch helps to avoid such spikes.

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Prasad Pandit 
Message-ID: <20240425070412.37248-1-ppan...@redhat.com>
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 include/block/raw-aio.h |  1 +
 block/file-posix.c  |  9 +
 block/linux-aio.c   | 21 -
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index 20e000b8ef..626706827f 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -60,6 +60,7 @@ void laio_cleanup(LinuxAioState *s);
 int coroutine_fn laio_co_submit(int fd, uint64_t offset, QEMUIOVector *qiov,
 int type, uint64_t dev_max_batch);
 
+bool laio_has_fdsync(int);
 void laio_detach_aio_context(LinuxAioState *s, AioContext *old_context);
 void laio_attach_aio_context(LinuxAioState *s, AioContext *new_context);
 #endif
diff --git a/block/file-posix.c b/block/file-posix.c
index 5c46938936..be25e35ff6 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -159,6 +159,7 @@ typedef struct BDRVRawState {
 bool has_discard:1;
 bool has_write_zeroes:1;
 bool use_linux_aio:1;
+bool has_laio_fdsync:1;
 bool use_linux_io_uring:1;
 int page_cache_inconsistent; /* errno from fdatasync failure */
 bool has_fallocate;
@@ -718,6 +719,9 @@ static int raw_open_common(BlockDriverState *bs, QDict 
*options,
 ret = -EINVAL;
 goto fail;
 }
+if (s->use_linux_aio) {
+s->has_laio_fdsync = laio_has_fdsync(s->fd);
+}
 #else
 if (s->use_linux_aio) {
 error_setg(errp, "aio=native was specified, but is not supported "
@@ -2598,6 +2602,11 @@ static int coroutine_fn 
raw_co_flush_to_disk(BlockDriverState *bs)
 if (raw_check_linux_io_uring(s)) {
 return luring_co_submit(bs, s->fd, 0, NULL, QEMU_AIO_FLUSH);
 }
+#endif
+#ifdef CONFIG_LINUX_AIO
+if (s->has_laio_fdsync && raw_check_linux_aio(s)) {
+return laio_co_submit(s->fd, 0, NULL, QEMU_AIO_FLUSH, 0);
+}
 #endif
 return raw_thread_pool_submit(handle_aiocb_flush, &acb);
 }
diff --git a/block/linux-aio.c b/block/linux-aio.c
index ec05d946f3..e3b5ec9aba 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -384,6 +384,9 @@ static int laio_do_submit(int fd, struct qemu_laiocb 
*laiocb, off_t offset,
 case QEMU_AIO_READ:
 io_prep_preadv(iocbs, fd, qiov->iov, qiov->niov, offset);
 break;
+case QEMU_AIO_FLUSH:
+io_prep_fdsync(iocbs, fd);
+break;
 /* Currently Linux kernel does not support other operations */
 default:
 fprintf(stderr, "%s: invalid AIO request type 0x%x.\n",
@@ -412,7 +415,7 @@ int coroutine_fn laio_co_submit(int fd, uint64_t offset, 
QEMUIOVector *qiov,
 AioContext *ctx = qemu_get_current_aio_context();
 struct qemu_laiocb laiocb = {
 .co = qemu_coroutine_self(),
-.nbytes = qiov->size,
+.nbytes = qiov ? qiov->size : 0,
 .ctx= aio_get_linux_aio(ctx),
 .ret= -EINPROGRESS,
 .is_read= (type == QEMU_AIO_READ),
@@ -486,3 +489,19 @@ void laio_cleanup(LinuxAioState *s)
 }
 g_free(s);
 }
+
+bool laio_has_fdsync(int fd)
+{
+struct iocb cb;
+struct iocb *cbs[] = {&cb, NULL};
+
+io_context_t ctx = 0;
+io_setup(1, &ctx);
+
+/* check if host kernel supports IO_CMD_FDSYNC */
+io_prep_fdsync(&cb, fd);
+int ret = io_submit(ctx, 1, cbs);
+
+io_destroy(ctx);
+return (ret == -EINVAL) ? false : true;
+}
-- 
2.45.2

[PULL 1/8] block: drop force_dup parameter of raw_reconfigure_getfd()

2024-06-11 Thread Kevin Wolf

From: "Denis V. Lunev via" 

Since commit 72373e40fbc, this parameter is always passed as 'false'
from the caller.

Signed-off-by: Denis V. Lunev 
CC: Andrey Zhadchenko 
CC: Kevin Wolf 
CC: Hanna Reitz 
Message-ID: <20240430170213.148558-1-...@openvz.org>
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 block/file-posix.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 35684f7e21..5c46938936 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1039,8 +1039,7 @@ static int fcntl_setfl(int fd, int flag)
 }
 
 static int raw_reconfigure_getfd(BlockDriverState *bs, int flags,
- int *open_flags, uint64_t perm, bool 
force_dup,
- Error **errp)
+ int *open_flags, uint64_t perm, Error **errp)
 {
 BDRVRawState *s = bs->opaque;
 int fd = -1;
@@ -1068,7 +1067,7 @@ static int raw_reconfigure_getfd(BlockDriverState *bs, 
int flags,
 assert((s->open_flags & O_ASYNC) == 0);
 #endif
 
-if (!force_dup && *open_flags == s->open_flags) {
+if (*open_flags == s->open_flags) {
 /* We're lucky, the existing fd is fine */
 return s->fd;
 }
@@ -3748,8 +3747,7 @@ static int raw_check_perm(BlockDriverState *bs, uint64_t 
perm, uint64_t shared,
 int ret;
 
 /* We may need a new fd if auto-read-only switches the mode */
-ret = raw_reconfigure_getfd(bs, input_flags, &open_flags, perm,
-false, errp);
+ret = raw_reconfigure_getfd(bs, input_flags, &open_flags, perm, errp);
 if (ret < 0) {
 return ret;
 } else if (ret != s->fd) {
-- 
2.45.2

[PULL 3/8] aio: warn about iohandler_ctx special casing

2024-06-11 Thread Kevin Wolf

From: Stefan Hajnoczi 

The main loop has two AioContexts: qemu_aio_context and iohandler_ctx.
The main loop runs them both, but nested aio_poll() calls on
qemu_aio_context exclude iohandler_ctx.

Which one should qemu_get_current_aio_context() return when called from
the main loop? Document that it's always qemu_aio_context.

This has subtle effects on functions that use
qemu_get_current_aio_context(). For example, aio_co_reschedule_self()
does not work when moving from iohandler_ctx to qemu_aio_context because
qemu_get_current_aio_context() does not differentiate these two
AioContexts.

Document this in order to reduce the chance of future bugs.

Signed-off-by: Stefan Hajnoczi 
Message-ID: <20240506190622.56095-3-stefa...@redhat.com>
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 include/block/aio.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/include/block/aio.h b/include/block/aio.h
index 8378553eb9..4ee81936ed 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -629,6 +629,9 @@ void aio_co_schedule(AioContext *ctx, Coroutine *co);
  *
  * Move the currently running coroutine to new_ctx. If the coroutine is already
  * running in new_ctx, do nothing.
+ *
+ * Note that this function cannot reschedule from iohandler_ctx to
+ * qemu_aio_context.
  */
 void coroutine_fn aio_co_reschedule_self(AioContext *new_ctx);
 
@@ -661,6 +664,9 @@ void aio_co_enter(AioContext *ctx, Coroutine *co);
  * If called from an IOThread this will be the IOThread's AioContext.  If
  * called from the main thread or with the "big QEMU lock" taken it
  * will be the main loop AioContext.
+ *
+ * Note that the return value is never the main loop's iohandler_ctx and the
+ * return value is the main loop AioContext instead.
  */
 AioContext *qemu_get_current_aio_context(void);
 
-- 
2.45.2

[PULL 7/8] block/crypto: create ciphers on demand

2024-06-11 Thread Kevin Wolf

From: Stefan Hajnoczi 

Ciphers are pre-allocated by qcrypto_block_init_cipher() depending on
the given number of threads. The -device
virtio-blk-pci,iothread-vq-mapping= feature allows users to assign
multiple IOThreads to a virtio-blk device, but the association between
the virtio-blk device and the block driver happens after the block
driver is already open.

When the number of threads given to qcrypto_block_init_cipher() is
smaller than the actual number of threads at runtime, the
block->n_free_ciphers > 0 assertion in qcrypto_block_pop_cipher() can
fail.

Get rid of qcrypto_block_init_cipher() n_thread's argument and allocate
ciphers on demand.

Reported-by: Qing Wang 
Buglink: https://issues.redhat.com/browse/RHEL-36159
Signed-off-by: Stefan Hajnoczi 
Message-ID: <20240527155851.892885-2-stefa...@redhat.com>
Reviewed-by: Kevin Wolf 
Acked-by: Daniel P. Berrangé 
Signed-off-by: Kevin Wolf 
---
 crypto/blockpriv.h  |  12 +++--
 crypto/block-luks.c |   3 +-
 crypto/block-qcow.c |   2 +-
 crypto/block.c  | 111 ++--
 4 files changed, 78 insertions(+), 50 deletions(-)

diff --git a/crypto/blockpriv.h b/crypto/blockpriv.h
index 836f3b4726..4bf6043d5d 100644
--- a/crypto/blockpriv.h
+++ b/crypto/blockpriv.h
@@ -32,8 +32,14 @@ struct QCryptoBlock {
 const QCryptoBlockDriver *driver;
 void *opaque;
 
-QCryptoCipher **ciphers;
-size_t n_ciphers;
+/* Cipher parameters */
+QCryptoCipherAlgorithm alg;
+QCryptoCipherMode mode;
+uint8_t *key;
+size_t nkey;
+
+QCryptoCipher **free_ciphers;
+size_t max_free_ciphers;
 size_t n_free_ciphers;
 QCryptoIVGen *ivgen;
 QemuMutex mutex;
@@ -130,7 +136,7 @@ int qcrypto_block_init_cipher(QCryptoBlock *block,
   QCryptoCipherAlgorithm alg,
   QCryptoCipherMode mode,
   const uint8_t *key, size_t nkey,
-  size_t n_threads, Error **errp);
+  Error **errp);
 
 void qcrypto_block_free_cipher(QCryptoBlock *block);
 
diff --git a/crypto/block-luks.c b/crypto/block-luks.c
index 3ee928fb5a..3357852c0a 100644
--- a/crypto/block-luks.c
+++ b/crypto/block-luks.c
@@ -1262,7 +1262,6 @@ qcrypto_block_luks_open(QCryptoBlock *block,
   luks->cipher_mode,
   masterkey,
   luks->header.master_key_len,
-  n_threads,
   errp) < 0) {
 goto fail;
 }
@@ -1456,7 +1455,7 @@ qcrypto_block_luks_create(QCryptoBlock *block,
 /* Setup the block device payload encryption objects */
 if (qcrypto_block_init_cipher(block, luks_opts.cipher_alg,
   luks_opts.cipher_mode, masterkey,
-  luks->header.master_key_len, 1, errp) < 0) {
+  luks->header.master_key_len, errp) < 0) {
 goto error;
 }
 
diff --git a/crypto/block-qcow.c b/crypto/block-qcow.c
index 4d7cf36a8f..02305058e3 100644
--- a/crypto/block-qcow.c
+++ b/crypto/block-qcow.c
@@ -75,7 +75,7 @@ qcrypto_block_qcow_init(QCryptoBlock *block,
 ret = qcrypto_block_init_cipher(block, QCRYPTO_CIPHER_ALG_AES_128,
 QCRYPTO_CIPHER_MODE_CBC,
 keybuf, G_N_ELEMENTS(keybuf),
-n_threads, errp);
+errp);
 if (ret < 0) {
 ret = -ENOTSUP;
 goto fail;
diff --git a/crypto/block.c b/crypto/block.c
index 506ea1d1a3..ba6d1cebc7 100644
--- a/crypto/block.c
+++ b/crypto/block.c
@@ -20,6 +20,7 @@
 
 #include "qemu/osdep.h"
 #include "qapi/error.h"
+#include "qemu/lockable.h"
 #include "blockpriv.h"
 #include "block-qcow.h"
 #include "block-luks.h"
@@ -57,6 +58,8 @@ QCryptoBlock *qcrypto_block_open(QCryptoBlockOpenOptions 
*options,
 {
 QCryptoBlock *block = g_new0(QCryptoBlock, 1);
 
+qemu_mutex_init(&block->mutex);
+
 block->format = options->format;
 
 if (options->format >= G_N_ELEMENTS(qcrypto_block_drivers) ||
@@ -76,8 +79,6 @@ QCryptoBlock *qcrypto_block_open(QCryptoBlockOpenOptions 
*options,
 return NULL;
 }
 
-qemu_mutex_init(&block->mutex);
-
 return block;
 }
 
@@ -92,6 +93,8 @@ QCryptoBlock *qcrypto_block_create(QCryptoBlockCreateOptions 
*options,
 {
 QCryptoBlock *block = g_new0(QCryptoBlock, 1);
 
+qemu_mutex_init(&block->mutex);
+
 block->format = options->format;
 
 if (options->format >= G_N_ELEMENTS(qcrypto_block_drivers) ||
@@ -111,8 +114,6 @@ QCryptoBlock 
*qcrypto_block_create(QCryptoBlockCreateOptions *options,
 return NULL;
 }
 
-qemu_mut

[PULL 2/8] Revert "monitor: use aio_co_reschedule_self()"

2024-06-11 Thread Kevin Wolf

From: Stefan Hajnoczi 

Commit 1f25c172f837 ("monitor: use aio_co_reschedule_self()") was a code
cleanup that uses aio_co_reschedule_self() instead of open coding
coroutine rescheduling.

Bug RHEL-34618 was reported and Kevin Wolf  identified
the root cause. I missed that aio_co_reschedule_self() ->
qemu_get_current_aio_context() only knows about
qemu_aio_context/IOThread AioContexts and not about iohandler_ctx. It
does not function correctly when going back from the iohandler_ctx to
qemu_aio_context.

Go back to open coding the AioContext transitions to avoid this bug.

This reverts commit 1f25c172f83704e350c0829438d832384084a74d.

Cc: qemu-sta...@nongnu.org
Buglink: https://issues.redhat.com/browse/RHEL-34618
Signed-off-by: Stefan Hajnoczi 
Message-ID: <20240506190622.56095-2-stefa...@redhat.com>
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 qapi/qmp-dispatch.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/qapi/qmp-dispatch.c b/qapi/qmp-dispatch.c
index f3488afeef..176b549473 100644
--- a/qapi/qmp-dispatch.c
+++ b/qapi/qmp-dispatch.c
@@ -212,7 +212,8 @@ QDict *coroutine_mixed_fn qmp_dispatch(const QmpCommandList 
*cmds, QObject *requ
  * executing the command handler so that it can make progress if it
  * involves an AIO_WAIT_WHILE().
  */
-aio_co_reschedule_self(qemu_get_aio_context());
+aio_co_schedule(qemu_get_aio_context(), qemu_coroutine_self());
+qemu_coroutine_yield();
 }
 
 monitor_set_cur(qemu_coroutine_self(), cur_mon);
@@ -226,7 +227,9 @@ QDict *coroutine_mixed_fn qmp_dispatch(const QmpCommandList 
*cmds, QObject *requ
  * Move back to iohandler_ctx so that nested event loops for
  * qemu_aio_context don't start new monitor commands.
  */
-aio_co_reschedule_self(iohandler_get_aio_context());
+aio_co_schedule(iohandler_get_aio_context(),
+qemu_coroutine_self());
+qemu_coroutine_yield();
 }
 } else {
/*
-- 
2.45.2

Re: [PATCH v4 2/4] vvfat: Fix usage of `info.file.offset`

2024-06-11 Thread Kevin Wolf

Am 11.06.2024 um 14:31 hat Amjad Alsharafi geschrieben:
> On Mon, Jun 10, 2024 at 06:49:43PM +0200, Kevin Wolf wrote:
> > Am 05.06.2024 um 02:58 hat Amjad Alsharafi geschrieben:
> > > The field is marked as "the offset in the file (in clusters)", but it
> > > was being used like this
> > > `cluster_size*(nums)+mapping->info.file.offset`, which is incorrect.
> > > 
> > > Additionally, removed the `abort` when `first_mapping_index` does not
> > > match, as this matches the case when adding new clusters for files, and
> > > its inevitable that we reach this condition when doing that if the
> > > clusters are not after one another, so there is no reason to `abort`
> > > here, execution continues and the new clusters are written to disk
> > > correctly.
> > > 
> > > Signed-off-by: Amjad Alsharafi 
> > 
> > Can you help me understand how first_mapping_index really works?
> > 
> > It seems to me that you get a chain of mappings for each file on the FAT
> > filesystem, which are just the contiguous areas in it, and
> > first_mapping_index refers to the mapping at the start of the file. But
> > for much of the time, it actually doesn't seem to be set at all, so you
> > have mapping->first_mapping_index == -1. Do you understand the rules
> > around when it's set and when it isn't?
> 
> Yeah. So `first_mapping_index` is the index of the first mapping, each
> mapping is a group of clusters that are contiguous in the file.
> Its mostly `-1` because the first mapping will have the value set as
> `-1` and not its own index, this value will only be set when the file
> contain more than one mapping, and this will only happen when you add
> clusters to a file that are not contiguous with the existing clusters.

Ah, that makes some sense. Not sure if it's optimal, but it's a rule I
can work with. So just to confirm, this is the invariant that we think
should always hold true, right?

assert((mapping->mode & MODE_DIRECTORY) ||
   !mapping->info.file.offset ||
   mapping->first_mapping_index > 0);

> And actually, thanks to that I noticed another bug not fixed in PATCH 3, 
> We are doing this check 
> `s->current_mapping->first_mapping_index != mapping->first_mapping_index`
> to know if we should switch to the new mapping or not. 
> If we were reading from the first mapping (`first_mapping_index == -1`)
> and we jumped to the second mapping (`first_mapping_index == n`), we
> will catch this condition and switch to the new mapping.
> 
> But if the file has more than 2 mappings, and we jumped to the 3rd
> mapping, we will not catch this since (`first_mapping_index == n`) for
> both of them haha. I think a better check is to check the `mapping`
> pointer directly. (I'll add it also in the next series together with a
> test for it.)

This comparison is exactly what confused me. I didn't realise that the
first mapping in the chain has a different value here, so I thought this
must mean that we're looking at a different file now - but of course I
couldn't see a reason for that because we're iterating through a single
file in this function.

But even now that I know that the condition triggers when switching from
the first to the second mapping, it doesn't make sense to me. We don't
have to copy things around just because a file is non-contiguous.

What we want to catch is if the order of mappings has changed compared
to the old state. Do we need a linked list, maybe a prev_mapping_index,
instead of first_mapping_index so that we can compare if it is still the
same as before?

Or actually, I suppose that's the first block with an abort() in the
code, just that it doesn't compare mappings, but their offsets.

> > 
> > >  block/vvfat.c | 12 +++-
> > >  1 file changed, 7 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/block/vvfat.c b/block/vvfat.c
> > > index 19da009a5b..f0642ac3e4 100644
> > > --- a/block/vvfat.c
> > > +++ b/block/vvfat.c
> > > @@ -1408,7 +1408,9 @@ read_cluster_directory:
> > >  
> > >  assert(s->current_fd);
> > >  
> > > -
> > > offset=s->cluster_size*(cluster_num-s->current_mapping->begin)+s->current_mapping->info.file.offset;
> > > +offset = s->cluster_size *
> > > +((cluster_num - s->current_mapping->begin)
> > > ++ s->current_mapping->info.file.offset);
> > >  if(lseek(s->current_fd, offset, SEEK_SET)!=offset)
> > >  return -3;
> > >  s->cluste

Re: [PATCH v4 4/4] iotests: Add `vvfat` tests

2024-06-10 Thread Kevin Wolf

Am 10.06.2024 um 16:11 hat Amjad Alsharafi geschrieben:
> On Mon, Jun 10, 2024 at 02:01:24PM +0200, Kevin Wolf wrote:
> > With the updated test, I can catch the problems that are fixed by
> > patches 1 and 2, but it still doesn't need patch 3 to pass.
> > 
> > Kevin
> > 
> 
> Thanks for reviewing, those are all mistakes, and I fixed them (included
> a small patch to fix these issues at the end...).
> 
> Regarding the failing test, I forgot to also read the files from the fat
> driver, and instead I was just reading from the host filesystem.
> I'm not sure exactly, why reading from the filesystem works, but reading
> from the driver (i.e. guest) gives the weird buggy result. 
> I have updated the test in the patch below to reflect this.
> 
> I would love if you can test the patch below and let me know if the
> issues are fixed, after that I can send the new series.

Yes, that looks good to me and reproduces a failure without patch 3.

Kevin

Re: [PATCH v4 2/4] vvfat: Fix usage of `info.file.offset`

2024-06-10 Thread Kevin Wolf

Am 05.06.2024 um 02:58 hat Amjad Alsharafi geschrieben:
> The field is marked as "the offset in the file (in clusters)", but it
> was being used like this
> `cluster_size*(nums)+mapping->info.file.offset`, which is incorrect.
> 
> Additionally, removed the `abort` when `first_mapping_index` does not
> match, as this matches the case when adding new clusters for files, and
> its inevitable that we reach this condition when doing that if the
> clusters are not after one another, so there is no reason to `abort`
> here, execution continues and the new clusters are written to disk
> correctly.
> 
> Signed-off-by: Amjad Alsharafi 

Can you help me understand how first_mapping_index really works?

It seems to me that you get a chain of mappings for each file on the FAT
filesystem, which are just the contiguous areas in it, and
first_mapping_index refers to the mapping at the start of the file. But
for much of the time, it actually doesn't seem to be set at all, so you
have mapping->first_mapping_index == -1. Do you understand the rules
around when it's set and when it isn't?

>  block/vvfat.c | 12 +++-
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/block/vvfat.c b/block/vvfat.c
> index 19da009a5b..f0642ac3e4 100644
> --- a/block/vvfat.c
> +++ b/block/vvfat.c
> @@ -1408,7 +1408,9 @@ read_cluster_directory:
>  
>  assert(s->current_fd);
>  
> -
> offset=s->cluster_size*(cluster_num-s->current_mapping->begin)+s->current_mapping->info.file.offset;
> +offset = s->cluster_size *
> +((cluster_num - s->current_mapping->begin)
> ++ s->current_mapping->info.file.offset);
>  if(lseek(s->current_fd, offset, SEEK_SET)!=offset)
>  return -3;
>  s->cluster=s->cluster_buffer;
> @@ -1929,8 +1931,9 @@ get_cluster_count_for_direntry(BDRVVVFATState* s, 
> direntry_t* direntry, const ch
>  (mapping->mode & MODE_DIRECTORY) == 0) {
>  
>  /* was modified in qcow */
> -if (offset != mapping->info.file.offset + s->cluster_size
> -* (cluster_num - mapping->begin)) {
> +if (offset != s->cluster_size
> +* ((cluster_num - mapping->begin)
> ++ mapping->info.file.offset)) {
>  /* offset of this cluster in file chain has changed 
> */
>  abort();
>  copy_it = 1;
> @@ -1944,7 +1947,6 @@ get_cluster_count_for_direntry(BDRVVVFATState* s, 
> direntry_t* direntry, const ch
>  
>  if (mapping->first_mapping_index != first_mapping_index
>  && mapping->info.file.offset > 0) {
> -abort();
>  copy_it = 1;
>  }

I'm unsure which case this represents. If first_mapping_index refers to
the mapping of the first cluster in the file, does this mean we got a
mapping for a different file here? Or is the comparison between -1 and a
real value?

In any case it doesn't seem to be the case that the comment at the
declaration of copy_it describes.

>  
> @@ -2404,7 +2406,7 @@ static int commit_mappings(BDRVVVFATState* s,
>  (mapping->end - mapping->begin);
>  } else
>  next_mapping->info.file.offset = mapping->info.file.offset +
> -mapping->end - mapping->begin;
> +(mapping->end - mapping->begin);
>  
>  mapping = next_mapping;
>  }

Kevin

Re: [PATCH v4 4/4] iotests: Add `vvfat` tests

2024-06-10 Thread Kevin Wolf

Am 05.06.2024 um 02:58 hat Amjad Alsharafi geschrieben:
> Added several tests to verify the implementation of the vvfat driver.
> 
> We needed a way to interact with it, so created a basic `fat16.py` driver 
> that handled writing correct sectors for us.
> 
> Added `vvfat` to the non-generic formats, as its not a normal image format.
> 
> Signed-off-by: Amjad Alsharafi 
> ---
>  tests/qemu-iotests/check   |   2 +-
>  tests/qemu-iotests/fat16.py| 635 +
>  tests/qemu-iotests/testenv.py  |   2 +-
>  tests/qemu-iotests/tests/vvfat | 440 
>  tests/qemu-iotests/tests/vvfat.out |   5 +
>  5 files changed, 1082 insertions(+), 2 deletions(-)
>  create mode 100644 tests/qemu-iotests/fat16.py
>  create mode 100755 tests/qemu-iotests/tests/vvfat
>  create mode 100755 tests/qemu-iotests/tests/vvfat.out
> 
> diff --git a/tests/qemu-iotests/check b/tests/qemu-iotests/check
> index 56d88ca423..545f9ec7bd 100755
> --- a/tests/qemu-iotests/check
> +++ b/tests/qemu-iotests/check
> @@ -84,7 +84,7 @@ def make_argparser() -> argparse.ArgumentParser:
>  p.set_defaults(imgfmt='raw', imgproto='file')
>  
>  format_list = ['raw', 'bochs', 'cloop', 'parallels', 'qcow', 'qcow2',
> -   'qed', 'vdi', 'vpc', 'vhdx', 'vmdk', 'luks', 'dmg']
> +   'qed', 'vdi', 'vpc', 'vhdx', 'vmdk', 'luks', 'dmg', 
> 'vvfat']
>  g_fmt = p.add_argument_group(
>  '  image format options',
>  'The following options set the IMGFMT environment variable. '
> diff --git a/tests/qemu-iotests/fat16.py b/tests/qemu-iotests/fat16.py
> new file mode 100644
> index 00..baf801b4d5
> --- /dev/null
> +++ b/tests/qemu-iotests/fat16.py
> @@ -0,0 +1,635 @@
> +# A simple FAT16 driver that is used to test the `vvfat` driver in QEMU.
> +#
> +# Copyright (C) 2024 Amjad Alsharafi 
> +#
> +# This program is free software; you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation; either version 2 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program.  If not, see .
> +
> +from typing import List
> +import string
> +
> +SECTOR_SIZE = 512
> +DIRENTRY_SIZE = 32
> +ALLOWED_FILE_CHARS = \
> +set("!#$%&'()-@^_`{}~" + string.digits + string.ascii_uppercase)
> +
> +
> +class MBR:
> +def __init__(self, data: bytes):
> +assert len(data) == 512
> +self.partition_table = []
> +for i in range(4):
> +partition = data[446 + i * 16 : 446 + (i + 1) * 16]
> +self.partition_table.append(
> +{
> +"status": partition[0],
> +"start_head": partition[1],
> +"start_sector": partition[2] & 0x3F,
> +"start_cylinder":
> +((partition[2] & 0xC0) << 2) | partition[3],
> +"type": partition[4],
> +"end_head": partition[5],
> +"end_sector": partition[6] & 0x3F,
> +"end_cylinder":
> +((partition[6] & 0xC0) << 2) | partition[7],
> +"start_lba": int.from_bytes(partition[8:12], "little"),
> +"size": int.from_bytes(partition[12:16], "little"),
> +}
> +)
> +
> +def __str__(self):
> +return "\n".join(
> +[f"{i}: {partition}"
> +for i, partition in enumerate(self.partition_table)]
> +)
> +
> +
> +class FatBootSector:
> +def __init__(self, data: bytes):
> +assert len(data) == 512
> +self.bytes_per_sector = int.from_bytes(data[11:13], "little")
> +self.sectors_per_cluster = data[13]
> +self.reserved_sectors = int.from_bytes(data[14:16], "little")
> +self.fat_count = data[16]
> +self.root_entries = int.from_bytes(data[17:19], "little")
> +self.media_descriptor = data[21]
> +self.fat_size = int.from_bytes(data[22:24], "little")
> +self.sectors_per_fat = int.from_bytes(data[22:24], "little")

Why two different attributes self.fat_size and self.sectors_per_fat that
contain the same value?

> +self.sectors_per_track = int.from_bytes(data[24:26], "little")
> +self.heads = int.from_bytes(data[26:28], "little")
> +self.hidden_sectors = int.from_bytes(data[28:32], "little")
> +self.total_sectors = int.from_bytes(data[32:36], "little")

This value should only be considered if the 16 bit value at byte 19
(which you don't st

[PATCH] scsi-disk: Don't silently truncate serial number

2024-06-04 Thread Kevin Wolf

Before this commit, scsi-disk accepts a string of arbitrary length for
its "serial" property. However, the value visible on the guest is
actually truncated to 36 characters. This limitation doesn't come from
the SCSI specification, it is an arbitrary limit that was initially
picked as 20 and later bumped to 36 by commit 48b62063.

Similarly, device_id was introduced as a copy of the serial number,
limited to 20 characters, but commit 48b62063 forgot to actually bump
it.

As long as we silently truncate the given string, extending the limit is
actually not a harmless change, but break the guest ABI. This is the
most important reason why commit 48b62063 was really wrong (and it's
also why we can't change device_id to be in sync with the serial number
again and use 36 characters now, it would be another guest ABI
breakage).

In order to avoid future breakage, don't silently truncate the serial
number string any more, but just error out if it would be truncated.

Buglink: https://issues.redhat.com/browse/RHEL-3542
Suggested-by: Peter Krempa 
Signed-off-by: Kevin Wolf 
---
 hw/scsi/scsi-disk.c | 20 +---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 4bd7af9d0c..5f55ae54e4 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -58,6 +58,9 @@
 
 #define TYPE_SCSI_DISK_BASE "scsi-disk-base"
 
+#define MAX_SERIAL_LEN  36
+#define MAX_SERIAL_LEN_FOR_DEVID20
+
 OBJECT_DECLARE_TYPE(SCSIDiskState, SCSIDiskClass, SCSI_DISK_BASE)
 
 struct SCSIDiskClass {
@@ -648,8 +651,8 @@ static int scsi_disk_emulate_vpd_page(SCSIRequest *req, 
uint8_t *outbuf)
 }
 
 l = strlen(s->serial);
-if (l > 36) {
-l = 36;
+if (l > MAX_SERIAL_LEN) {
+l = MAX_SERIAL_LEN;
 }
 
 trace_scsi_disk_emulate_vpd_page_80(req->cmd.xfer);
@@ -2501,9 +2504,20 @@ static void scsi_realize(SCSIDevice *dev, Error **errp)
 if (!s->vendor) {
 s->vendor = g_strdup("QEMU");
 }
+if (s->serial && strlen(s->serial) > MAX_SERIAL_LEN) {
+error_setg(errp, "The serial number can't be longer than %d 
characters",
+   MAX_SERIAL_LEN);
+return;
+}
 if (!s->device_id) {
 if (s->serial) {
-s->device_id = g_strdup_printf("%.20s", s->serial);
+if (strlen(s->serial) > MAX_SERIAL_LEN_FOR_DEVID) {
+error_setg(errp, "The serial number can't be longer than %d "
+   "characters when it is also used as the default for 
"
+   "device_id", MAX_SERIAL_LEN_FOR_DEVID);
+return;
+}
+s->device_id = g_strdup(s->serial);
 } else {
 const char *str = blk_name(s->qdev.conf.blk);
 if (str && *str) {
-- 
2.45.1

Re: [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes

2024-06-04 Thread Kevin Wolf

Am 04.06.2024 um 09:58 hat Fiona Ebner geschrieben:
> Am 03.06.24 um 18:21 schrieb Kevin Wolf:
> > Am 03.06.2024 um 16:17 hat Fiona Ebner geschrieben:
> >> Am 26.03.24 um 13:44 schrieb Kevin Wolf:
> >>>
> >>> The fix for bdrv_flush_all() is probably to make it bdrv_co_flush_all()
> >>> with a coroutine wrapper so that the graph lock is held for the whole
> >>> function. Then calling bdrv_co_flush() while iterating the list is safe
> >>> and doesn't allow concurrent graph modifications.
> >>
> >> The second is that iotest 255 ran into an assertion failure upon QMP 
> >> 'quit':
> >>
> >>> ../block/graph-lock.c:113: bdrv_graph_wrlock: Assertion 
> >>> `!qemu_in_coroutine()' failed.
> >>
> >> Looking at the backtrace:
> >>
> >>> #5  0x762a90cc3eb2 in __GI___assert_fail
> >>> (assertion=0x5afb07991e7d "!qemu_in_coroutine()", file=0x5afb07991e00 
> >>> "../block/graph-lock.c", line=113, function=0x5afb07991f20 
> >>> <__PRETTY_FUNCTION__.4> "bdrv_graph_wrlock")
> >>> at ./assert/assert.c:101
> >>> #6  0x5afb07585311 in bdrv_graph_wrlock () at 
> >>> ../block/graph-lock.c:113
> >>> #7  0x5afb07573a36 in blk_remove_bs (blk=0x5afb0af99420) at 
> >>> ../block/block-backend.c:901
> >>> #8  0x5afb075729a7 in blk_delete (blk=0x5afb0af99420) at 
> >>> ../block/block-backend.c:487
> >>> #9  0x5afb07572d88 in blk_unref (blk=0x5afb0af99420) at 
> >>> ../block/block-backend.c:547
> >>> #10 0x5afb07572fe8 in bdrv_next (it=0x762a852fef00) at 
> >>> ../block/block-backend.c:618
> >>> #11 0x5afb0758cd65 in bdrv_co_flush_all () at ../block/io.c:2347
> >>> #12 0x5afb0753ba37 in bdrv_co_flush_all_entry (opaque=0x712c6050) 
> >>> at block/block-gen.c:1391
> >>> #13 0x5afb0773bf41 in coroutine_trampoline (i0=168365184, i1=23291)
> >>
> >> So I guess calling bdrv_next() is not safe from a coroutine, because
> >> the function doing the iteration could end up being the last thing to
> >> have a reference for the BB.
> > 
> > Does your bdrv_co_flush_all() take the graph (reader) lock? If so, this
> > is surprising, because while we hold the graph lock, no reference should
> > be able to go away - you need the writer lock for that and you won't get
> > it as long as bdrv_co_flush_all() locks the graph. So whatever had a
> > reference before the bdrv_next() loop must still have it now. Do you
> > know where it gets dropped?
> > 
> 
> AFAICT, yes, it does hold the graph reader lock. The generated code is:
> 
> > static void coroutine_fn bdrv_co_flush_all_entry(void *opaque)
> > {
> > BdrvFlushAll *s = opaque;
> > 
> > bdrv_graph_co_rdlock();
> > s->ret = bdrv_co_flush_all();
> > bdrv_graph_co_rdunlock();
> > s->poll_state.in_progress = false;
> > 
> > aio_wait_kick();
> > }
> 
> Apparently when the mirror job is aborted/exits, which can happen during
> the polling for bdrv_co_flush_all_entry(), a reference can go away
> without the write lock (at least my breakpoints didn't trigger) being held:
> 
> > #0  blk_unref (blk=0x5cdefe943d20) at ../block/block-backend.c:537
> > #1  0x5cdefb26697e in mirror_exit_common (job=0x5cdefeb53000) at 
> > ../block/mirror.c:710
> > #2  0x5cdefb263575 in mirror_abort (job=0x5cdefeb53000) at 
> > ../block/mirror.c:823
> > #3  0x5cdefb2248a6 in job_abort (job=0x5cdefeb53000) at ../job.c:825
> > #4  0x5cdefb2245f2 in job_finalize_single_locked (job=0x5cdefeb53000) 
> > at ../job.c:855
> > #5  0x5cdefb223852 in job_completed_txn_abort_locked 
> > (job=0x5cdefeb53000) at ../job.c:958
> > #6  0x5cdefb223714 in job_completed_locked (job=0x5cdefeb53000) at 
> > ../job.c:1065
> > #7  0x5cdefb224a8b in job_exit (opaque=0x5cdefeb53000) at ../job.c:1088
> > #8  0x5cdefb4134fc in aio_bh_call (bh=0x5cdefe7487c0) at 
> > ../util/async.c:171
> > #9  0x5cdefb4136ce in aio_bh_poll (ctx=0x5cdefd9cd750) at 
> > ../util/async.c:218
> > #10 0x5cdefb3efdfd in aio_poll (ctx=0x5cdefd9cd750, blocking=true) at 
> > ../util/aio-posix.c:722
> > #11 0x5cdefb20435e in bdrv_poll_co (s=0x7ffe491621d8) at 
> > ../block/block-gen.h:43
> > #12 0x5cdefb206a33 in bdrv_flush_all () at block/block-gen.c:1410
> > #13 0x5cdefae5c8ed in do_vm_stop (state=RUN_STAT

Re: [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes

2024-06-03 Thread Kevin Wolf

Am 03.06.2024 um 16:17 hat Fiona Ebner geschrieben:
> Hi Kevin,
> 
> Am 26.03.24 um 13:44 schrieb Kevin Wolf:
> > Am 22.03.2024 um 10:50 hat Fiona Ebner geschrieben:
> >> The old_bs variable in bdrv_next() is currently determined by looking
> >> at the old block backend. However, if the block graph changes before
> >> the next bdrv_next() call, it might be that the associated BDS is not
> >> the same that was referenced previously. In that case, the wrong BDS
> >> is unreferenced, leading to an assertion failure later:
> >>
> >>> bdrv_unref: Assertion `bs->refcnt > 0' failed.
> > 
> > Your change makes sense, but in theory it shouldn't make a difference.
> > The real bug is in the caller, you can't allow graph modifications while
> > iterating the list of nodes. Even if it doesn't crash (like after your
> > patch), you don't have any guarantee that you will have seen every node
> > that exists that the end - and maybe not even that you don't get the
> > same node twice.
> > 
> >> In particular, this can happen in the context of bdrv_flush_all(),
> >> when polling for bdrv_co_flush() in the generated co-wrapper leads to
> >> a graph change (for example with a stream block job [0]).
> > 
> > The whole locking around this case is a bit tricky and would deserve
> > some cleanup.
> > 
> > The basic rule for bdrv_next() callers is that they need to hold the
> > graph reader lock as long as they are iterating the graph, otherwise
> > it's not safe. This implies that the ref/unref pairs in it should never
> > make a difference either - which is important, because at least
> > releasing the last reference is forbidden while holding the graph lock.
> > I intended to remove the ref/unref for bdrv_next(), but I didn't because
> > I realised that the callers need to be audited first that they really
> > obey the rules. You found one that would be problematic.
> > 
> > The thing that bdrv_flush_all() gets wrong is that it promises to follow
> > the graph lock rules with GRAPH_RDLOCK_GUARD_MAINLOOP(), but then calls
> > something that polls. The compiler can't catch this because bdrv_flush()
> > is a co_wrapper_mixed_bdrv_rdlock. The behaviour for these functions is:
> > 
> > - If called outside of coroutine context, they are GRAPH_UNLOCKED
> > - If called in coroutine context, they are GRAPH_RDLOCK
> > 
> > We should probably try harder to get rid of the mixed functions, because
> > a synchronous co_wrapper_bdrv_rdlock could actually be marked
> > GRAPH_UNLOCKED in the code and then the compiler could catch this case.
> > 
> > The fix for bdrv_flush_all() is probably to make it bdrv_co_flush_all()
> > with a coroutine wrapper so that the graph lock is held for the whole
> > function. Then calling bdrv_co_flush() while iterating the list is safe
> > and doesn't allow concurrent graph modifications.
> 
> I attempted this now, but ran into two issues:
> 
> The first is that I had to add support for a function without arguments
> to the block-coroutine-wrapper.py script. I can send this as an RFC in
> any case if desired.

I think I wouldn't necessarily merge it if we don't have code that makes
use of it, but having it on the list as an RFC can't hurt either way.
Then we can come back to it once we do need it for something.

> The second is that iotest 255 ran into an assertion failure upon QMP 'quit':
> 
> > ../block/graph-lock.c:113: bdrv_graph_wrlock: Assertion 
> > `!qemu_in_coroutine()' failed.
> 
> Looking at the backtrace:
> 
> > #5  0x762a90cc3eb2 in __GI___assert_fail
> > (assertion=0x5afb07991e7d "!qemu_in_coroutine()", file=0x5afb07991e00 
> > "../block/graph-lock.c", line=113, function=0x5afb07991f20 
> > <__PRETTY_FUNCTION__.4> "bdrv_graph_wrlock")
> > at ./assert/assert.c:101
> > #6  0x5afb07585311 in bdrv_graph_wrlock () at ../block/graph-lock.c:113
> > #7  0x5afb07573a36 in blk_remove_bs (blk=0x5afb0af99420) at 
> > ../block/block-backend.c:901
> > #8  0x5afb075729a7 in blk_delete (blk=0x5afb0af99420) at 
> > ../block/block-backend.c:487
> > #9  0x5afb07572d88 in blk_unref (blk=0x5afb0af99420) at 
> > ../block/block-backend.c:547
> > #10 0x5afb07572fe8 in bdrv_next (it=0x762a852fef00) at 
> > ../block/block-backend.c:618
> > #11 0x5afb0758cd65 in bdrv_co_flush_all () at ../block/io.c:2347
> > #12 0x5afb0753ba37 in bdrv_co_flush_all_entry (opaque=0x712c6050) 
> > at block/b

Re: [PATCH v3 2/2] iotests: test NBD+TLS+iothread

2024-06-03 Thread Kevin Wolf

Am 03.06.2024 um 13:45 hat Daniel P. Berrangé geschrieben:
> On Fri, May 31, 2024 at 01:04:59PM -0500, Eric Blake wrote:
> > Prevent regressions when using NBD with TLS in the presence of
> > iothreads, adding coverage the fix to qio channels made in the
> > previous patch.
> > 
> > The shell function pick_unused_port() was copied from
> > nbdkit.git/tests/functions.sh.in, where it had all authors from Red
> > Hat, agreeing to the resulting relicensing from 2-clause BSD to GPLv2.
> > 
> > CC: qemu-sta...@nongnu.org
> > CC: "Richard W.M. Jones" 
> > Signed-off-by: Eric Blake 
> > ---
> >  tests/qemu-iotests/tests/nbd-tls-iothread | 168 ++
> >  tests/qemu-iotests/tests/nbd-tls-iothread.out |  54 ++
> >  2 files changed, 222 insertions(+)
> >  create mode 100755 tests/qemu-iotests/tests/nbd-tls-iothread
> >  create mode 100644 tests/qemu-iotests/tests/nbd-tls-iothread.out
> 
> 
> 
> > +# pick_unused_port
> > +#
> > +# Picks and returns an "unused" port, setting the global variable
> > +# $port.
> > +#
> > +# This is inherently racy, but we need it because qemu does not currently
> > +# permit NBD+TLS over a Unix domain socket
> > +pick_unused_port ()
> > +{
> > +if ! (ss --version) >/dev/null 2>&1; then
> > +_notrun "ss utility required, skipped this test"
> > +fi
> > +
> > +# Start at a random port to make it less likely that two parallel
> > +# tests will conflict.
> > +port=$(( 5 + (RANDOM%15000) ))
> > +while ss -ltn | grep -sqE ":$port\b"; do
> > +((port++))
> > +if [ $port -eq 65000 ]; then port=5; fi
> > +done
> > +echo picked unused port
> > +}
> 
> In retrospect I'd probably have suggested putting this into
> common.qemu as its conceptually independant of this test.

If we make it generic, should nbd_server_start_tcp_socket() in
common.nbd use it, too, instead of implementing its own loop trying to
find a free port?

Kevin

Re: [PATCH] block/copy-before-write: use uint64_t for timeout in nanoseconds

2024-06-03 Thread Kevin Wolf

Am 03.06.2024 um 16:45 hat Fiona Ebner geschrieben:
> Am 28.05.24 um 18:06 schrieb Kevin Wolf:
> > Am 29.04.2024 um 16:19 hat Fiona Ebner geschrieben:
> >> rather than the uint32_t for which the maximum is slightly more than 4
> >> seconds and larger values would overflow. The QAPI interface allows
> >> specifying the number of seconds, so only values 0 to 4 are safe right
> >> now, other values lead to a much lower timeout than a user expects.
> >>
> >> The block_copy() call where this is used already takes a uint64_t for
> >> the timeout, so no change required there.
> >>
> >> Fixes: 6db7fd1ca9 ("block/copy-before-write: implement cbw-timeout option")
> >> Reported-by: Friedrich Weber 
> >> Signed-off-by: Fiona Ebner 
> > 
> > Thanks, applied to the block branch.
> > 
> > But I don't think our job is done yet with this. Increasing the limit is
> > good and useful, but even if it's now unlikely to hit with sane values,
> > we should still catch integer overflows in cbw_open() and return an
> > error on too big values instead of silently wrapping around.
> 
> NANOSECONDS_PER_SECOND is 10^9 and the QAPI type for cbw-timeout is
> uint32_t, so even with the maximum allowed value, there is no overflow.
> Should I still add such a check?

You're right, I missed that cbw_timeout is uint32_t. So uint64_t will be
always be enough to hold the result, and the calculation is also done in
64 bits because NANOSECONDS_PER_SECOND is long long. Then we don't need
a check.

Kevin

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1001 matches

Mail list logo