Hi Stefan, Stefan Hajnoczi <stefa...@redhat.com> 于2022年6月20日周一 15:55写道: > > On Mon, Jun 20, 2022 at 11:36:11AM +0800, Sam Li wrote: > > Hi Sam, > Is this version 2 of "[RFC v1] Add support for zoned device"? Please > keep the email subject line the same (except for "v2", "v3", etc) so > that it's clear which patch series this new version replaces. > > > Fix some mistakes before. It can report a range of zones now. > > This looks like the description of what changed compared to v1. Please > put the changelog below "---" in the future. When patch emails are > merged by git-am(1) it keeps the text above "---" and discards the text > below "---". The changelog is usually no longer useful once the patches > are merged, so it should be located below the "---" line. > > The text above the "---" is the commit description (an explanation of > why this commit is necessary). In this case the commit description > should explain that this patch adds .bdrv_co_zone_report() and > .bdrv_co_zone_mgmt() to BlockDriver so that zoned block devices can be > supported. > > > > > Signed-off-by: Sam Li <faithilike...@gmail.com> > > --- > > block/block-backend.c | 22 ++++ > > block/coroutines.h | 5 + > > block/file-posix.c | 182 ++++++++++++++++++++++++++++++ > > block/io.c | 23 ++++ > > include/block/block-common.h | 43 ++++++- > > include/block/block-io.h | 13 +++ > > include/block/block_int-common.h | 20 ++++ > > qemu-io-cmds.c | 118 +++++++++++++++++++ > > tests/qemu-iotests/tests/zoned.sh | 52 +++++++++ > > 9 files changed, 477 insertions(+), 1 deletion(-) > > create mode 100644 tests/qemu-iotests/tests/zoned.sh > > > > diff --git a/block/block-backend.c b/block/block-backend.c > > index e0e1aff4b1..20248e4a35 100644 > > --- a/block/block-backend.c > > +++ b/block/block-backend.c > > @@ -104,6 +104,8 @@ typedef struct BlockBackendAIOCB { > > int ret; > > } BlockBackendAIOCB; > > > > + > > + > > Please avoid whitespace changes in code that is otherwise untouched by > your patch. Code changes can cause merge conflicts and they make it > harder to use git-annotate(1), so only changes that are necessary should > be included in a patch. > > > static const AIOCBInfo block_backend_aiocb_info = { > > .get_aio_context = blk_aiocb_get_aio_context, > > .aiocb_size = sizeof(BlockBackendAIOCB), > > @@ -1810,6 +1812,25 @@ int blk_flush(BlockBackend *blk) > > return ret; > > } > > > > Please add a documentation comment for blk_co_zone_report() that > explains how to use the functions and the purpose of the arguments. For > example, does offset have to be the first byte in a zone or can it be > any byte offset? What are the alignment requirements of offset and len? > Why is nr_zones a pointer? > > > +int blk_co_zone_report(BlockBackend *blk, int64_t offset, int64_t len, > > Functions that run in coroutine context must be labeled with > coroutine_fn: > > int coroutine_fn blk_co_zone_report(...) > > This tells humans and tools that the function can only be called from a > coroutine. There is a blog post about coroutines in QEMU here: > https://blog.vmsplice.net/2014/01/coroutines-in-qemu-basics.html > > > + int64_t *nr_zones, > > + struct BlockZoneDescriptor *zones) > > QEMU coding style uses typedefs when defining structs, so "struct > BlockZoneDescriptor *zones" should be written as "BlockZoneDescriptor > *zones". > > > +{ > > + int ret; > > This function is called from the I/O code path, please mark it with: > > IO_CODE(); > > From include/block/block-io.h: > > * I/O API functions. These functions are thread-safe, and therefore > * can run in any thread as long as the thread has called > * aio_context_acquire/release(). > * > * These functions can only call functions from I/O and Common categories, > * but can be invoked by GS, "I/O or GS" and I/O APIs. > * > * All functions in this category must use the macro > * IO_CODE(); > * to catch when they are accidentally called by the wrong API. > > > + ret = bdrv_co_zone_report(blk->root->bs, offset, len, nr_zones, zones); > > Please add blk_inc_in_flight(blk) and blk_dec_in_flight(blk) around this > function call to ensure that zone report requests finish before I/O is > drained (see bdrv_drained_begin()). This is necessary so that it's > possible to wait for I/O requests, including zone report, to complete. > > Similar to blk_co_do_preadv() we need blk_wait_while_drained(blk), > blk_check_byte_request(), and bdrv_inc_in_flight(bs) before calling > bdrv_co_zone_report(). bdrv_dec_in_flight(bs) needs to be called after > bdrv_co_zone_report() returns. > After adding similar structure to blk_co_do_preadv(), zone operation command will always fail at blk_wait_while_drained(blk) because blk->inflight <= 0. Would it be ok to just remove blk_wait_while_drained?
> > + return ret; > > +} > > + > > +int blk_co_zone_mgmt(BlockBackend *blk, enum zone_op op, > > + int64_t offset, int64_t len) > > +{ > > + int ret; > > + ret = bdrv_co_zone_mgmt(blk->root->bs, op, offset, len); > > + > > + return ret; > > +} > > The same applies to this function. > > > + > > + > > void blk_drain(BlockBackend *blk) > > { > > BlockDriverState *bs = blk_bs(blk); > > @@ -2634,3 +2655,4 @@ int blk_make_empty(BlockBackend *blk, Error **errp) > > > > return bdrv_make_empty(blk->root, errp); > > } > > + > > Unrelated whitespace change. > > > diff --git a/block/coroutines.h b/block/coroutines.h > > index 830ecaa733..247326255f 100644 > > --- a/block/coroutines.h > > +++ b/block/coroutines.h > > @@ -80,6 +80,11 @@ int coroutine_fn > > blk_co_do_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes); > > > > int coroutine_fn blk_co_do_flush(BlockBackend *blk); > > +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset, > > + int64_t len, int64_t *nr_zones, > > + struct BlockZoneDescriptor *zones); > > +int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, enum zone_op op, > > + int64_t offset, int64_t len); > > > > > > /* > > diff --git a/block/file-posix.c b/block/file-posix.c > > index 48cd096624..71fd21f8ba 100644 > > --- a/block/file-posix.c > > +++ b/block/file-posix.c > > @@ -178,6 +178,137 @@ typedef struct BDRVRawReopenState { > > bool check_cache_dropped; > > } BDRVRawReopenState; > > > > +/* > > + * parse_zone - Fill a zone descriptor > > + */ > > +static inline void parse_zone(struct BlockZoneDescriptor *zone, > > + struct blk_zone *blkz) { > > + zone->start = blkz->start << BDRV_SECTOR_BITS; > > + zone->length = blkz->len << BDRV_SECTOR_BITS; > > + zone->cap = blkz->capacity << BDRV_SECTOR_BITS; > > + zone->wp = blkz->wp << BDRV_SECTOR_BITS; > > + zone->type = blkz->type; > > + zone->cond = blkz->type; > > Should this be "zone->cond = blkz->cond"? > > > +} > > + > > +/** > > + * zone report - Get a zone block device's information in the > > + * form of an array of zone descriptors. > > + * > > + * @param bs: passing zone block device file descriptor > > + * @param zones: Space to hold zone information on reply > > + * @param offset: the location in the zone block device > > + * @return 0 on success, -1 on failure > > + */ > > +static int raw_co_zone_report(BlockDriverState *bs, int64_t offset, > > int64_t len, > > coroutine_fn > > > + int64_t *nr_zones, > > + struct BlockZoneDescriptor *zones) { > > + BDRVRawState *s = bs->opaque; > > + struct blk_zone_report *rep; > > + struct BlockZoneDescriptor bzd; > > + struct blk_zone *blkz; > > + int64_t zone_size_mask, end, rep_size, nrz; > > + int ret, n = 0, i = 0; > > + > > + printf("%s\n", "start to report zone"); > > This looks like debug output. QEMU has a tracing system that you can > use. See docs/devel/tracing.rst. > > > + nrz = *nr_zones; > > + rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct > > blk_zone); > > + rep = (struct blk_zone_report *)malloc(rep_size); > > Please use g_autofree and g_new(). QEMU avoids direct use of > malloc(3)/free(3). > > > + if (!rep) { > > + return -1; > > + } > > + > > + zone_size_mask = zone_start_sector - 1; > > + /* align up */ > > + end = ((offset + len + zone_size_mask) & (~zone_size_mask)) > > + >> BDRV_SECTOR_BITS; > > More readable: > > end = DIV_ROUND_UP(offset + len, BDRV_SECTOR_SIZE); > > > + > > + blkz = (struct blk_zone *)(rep + 1); > > + while (offset < end) { > > + memset(rep, 0, rep_size); > > + rep->sector = offset; > > + rep->nr_zones = nrz; > > + > > + ret = ioctl(s->fd, BLKREPORTZONE, rep); > > Can this ioctl() block? It seems likely. If yes, then the call needs to > be made from the thread pool to avoid blocking the current thread. See > raw_thread_pool_submit(). > > > + if (ret != 0) { > > + ret = -errno; > > + error_report("%d: ioctl BLKREPORTZONE at %ld failed %d", > > + s->fd, offset, errno); > > + free(rep); > > + return ret; > > + } > > + > > + if (!rep->nr_zones) { > > + break; > > + } > > + > > + for (i = 0; i < rep->nr_zones; i++) { > > + parse_zone(&bzd, &blkz[i]); > > + if (zones) { > > + memcpy(&zones[n], &bzd, sizeof(bzd)); > > n is never incremented so this always overwrites the first element. > > > + } > > + } > > + > > + offset = blkz[i].start + blkz[i].len; > > + } > > + > > Does this function need to update *nr_zones = n before returning? How does > the caller know how many zones were returned? > > > + return ret; > > +} > > + > > +/** > > + * zone management operations - Execute an operation on a zone > > + */ > > +static int raw_co_zone_mgmt(BlockDriverState *bs, enum zone_op op, > > + int64_t offset, int64_t len) { > > + BDRVRawState *s = bs->opaque; > > + struct blk_zone_range range; > > + const char *ioctl_name; > > + uint64_t ioctl_op; > > ioctl()'s second argument is unsigned long request. Please use unsigned > long to keep the types consistent. > > > + int64_t zone_size_mask, end; > > + int ret; > > + > > + zone_size_mask = zone_start_sector - 1; > > + /* align up */ > > + end = ((offset + len + zone_size_mask) & (~zone_size_mask)) > > + >> BDRV_SECTOR_BITS; > > + offset = (offset & (~zone_size_mask)) >> BDRV_SECTOR_BITS; /* align > > down */ > > + > > + switch (op) { > > + case zone_open: > > + ioctl_name = "BLKOPENZONE"; > > + ioctl_op = BLKOPENZONE; > > + break; > > + case zone_close: > > + ioctl_name = "BLKCLOSEZONE"; > > + ioctl_op = BLKCLOSEZONE; > > + break; > > + case zone_finish: > > + ioctl_name = "BLKFINISHZONE"; > > + ioctl_op = BLKFINISHZONE; > > + break; > > + case zone_reset: > > + ioctl_name = "BLKRESETZONE"; > > + ioctl_op = BLKRESETZONE; > > + break; > > + default: > > + error_report("Invalid zone operation 0x%x", op); > > + errno = -EINVAL; > > + return -1; > > + } > > + > > + /* Execute the operation */ > > + range.sector = offset; > > + range.nr_sectors = end - offset; > > + ret = ioctl(s->fd, ioctl_op, &range); > > + if (ret != 0) { > > + error_report("ioctl %s failed %d", > > + ioctl_name, errno); > > + return -1; > > + } > > + > > + return 0; > > +} > > + > > static int fd_open(BlockDriverState *bs) > > { > > BDRVRawState *s = bs->opaque; > > @@ -3324,6 +3455,9 @@ BlockDriver bdrv_file = { > > .bdrv_abort_perm_update = raw_abort_perm_update, > > .create_opts = &raw_create_opts, > > .mutable_opts = mutable_opts, > > + > > + .bdrv_co_zone_report = raw_co_zone_report, > > + .bdrv_co_zone_mgmt = raw_co_zone_mgmt, > > }; > > > > /***********************************************/ > > @@ -3703,6 +3837,53 @@ static BlockDriver bdrv_host_device = { > > #endif > > }; > > > > +static BlockDriver bdrv_zoned_host_device = { > > + .format_name = "zoned_host_device", > > + .protocol_name = "zoned_host_device", > > + .instance_size = sizeof(BDRVRawState), > > + .bdrv_needs_filename = true, > > + .bdrv_probe_device = hdev_probe_device, > > + .bdrv_parse_filename = hdev_parse_filename, > > + .bdrv_file_open = hdev_open, > > + .bdrv_close = raw_close, > > + .bdrv_reopen_prepare = raw_reopen_prepare, > > + .bdrv_reopen_commit = raw_reopen_commit, > > + .bdrv_reopen_abort = raw_reopen_abort, > > + .bdrv_co_create_opts = bdrv_co_create_opts_simple, > > + .create_opts = &bdrv_create_opts_simple, > > + .mutable_opts = mutable_opts, > > + .bdrv_co_invalidate_cache = raw_co_invalidate_cache, > > + .bdrv_co_pwrite_zeroes = hdev_co_pwrite_zeroes, > > + > > + .bdrv_co_preadv = raw_co_preadv, > > + .bdrv_co_pwritev = raw_co_pwritev, > > + .bdrv_co_flush_to_disk = raw_co_flush_to_disk, > > + .bdrv_co_pdiscard = hdev_co_pdiscard, > > + .bdrv_co_copy_range_from = raw_co_copy_range_from, > > + .bdrv_co_copy_range_to = raw_co_copy_range_to, > > + .bdrv_refresh_limits = raw_refresh_limits, > > + .bdrv_io_plug = raw_aio_plug, > > + .bdrv_io_unplug = raw_aio_unplug, > > + .bdrv_attach_aio_context = raw_aio_attach_aio_context, > > + > > + .bdrv_co_truncate = raw_co_truncate, > > + .bdrv_getlength = raw_getlength, > > + .bdrv_get_info = raw_get_info, > > + .bdrv_get_allocated_file_size > > + = raw_get_allocated_file_size, > > + .bdrv_get_specific_stats = hdev_get_specific_stats, > > + .bdrv_check_perm = raw_check_perm, > > + .bdrv_set_perm = raw_set_perm, > > + .bdrv_abort_perm_update = raw_abort_perm_update, > > + .bdrv_probe_blocksizes = hdev_probe_blocksizes, > > + .bdrv_probe_geometry = hdev_probe_geometry, > > + .bdrv_co_ioctl = hdev_co_ioctl, > > + > > + /* zone management operations */ > > + .bdrv_co_zone_report = raw_co_zone_report, > > + .bdrv_co_zone_mgmt = raw_co_zone_mgmt, > > +}; > > + > > #if defined(__linux__) || defined(__FreeBSD__) || > > defined(__FreeBSD_kernel__) > > static void cdrom_parse_filename(const char *filename, QDict *options, > > Error **errp) > > @@ -3964,6 +4145,7 @@ static void bdrv_file_init(void) > > #if defined(HAVE_HOST_BLOCK_DEVICE) > > bdrv_register(&bdrv_host_device); > > #ifdef __linux__ > > + bdrv_register(&bdrv_zoned_host_device); > > bdrv_register(&bdrv_host_cdrom); > > #endif > > #if defined(__FreeBSD__) || defined(__FreeBSD_kernel__) > > diff --git a/block/io.c b/block/io.c > > index 789e6373d5..3e8bb83cc3 100644 > > --- a/block/io.c > > +++ b/block/io.c > > @@ -3258,6 +3258,29 @@ out: > > return co.ret; > > } > > > > +int bdrv_co_zone_report(BlockDriverState *bs, > > + int64_t offset, int64_t len, int64_t *nr_zones, > > + struct BlockZoneDescriptor *zones) > > +{ > > + int ret = 0; > > + if (!bs->drv->bdrv_co_zone_report(bs, offset, len, nr_zones, zones)) { > > + return -ENOTSUP; > > + } > > This returns -ENOTSUP any time ->bdrv_co_zone_report() returns 0. Should > this be: > > if (!bs->drv->bdrv_co_zone_report) { > return -ENOTSUP; > } > > return bs->drv->bdrv_co_zone_report(bs, offset, len, nr_zones, zones); > > ? > > > + > > + return ret; > > +} > > + > > +int bdrv_co_zone_mgmt(BlockDriverState *bs, enum zone_op op, > > + int64_t offset, int64_t len) > > +{ > > + int ret = 0; > > + if (!bs->drv->bdrv_co_zone_mgmt(bs, op, offset, len)) { > > + return -ENOTSUP; > > + } > > + > > + return ret; > > +} > > + > > void *qemu_blockalign(BlockDriverState *bs, size_t size) > > { > > IO_CODE(); > > diff --git a/include/block/block-common.h b/include/block/block-common.h > > index fdb7306e78..24c468d8ad 100644 > > --- a/include/block/block-common.h > > +++ b/include/block/block-common.h > > @@ -23,7 +23,7 @@ > > */ > > #ifndef BLOCK_COMMON_H > > #define BLOCK_COMMON_H > > - > > +#include <linux/blkzoned.h> > > Linux-specific code must be #ifdef __linux__ to avoid compilation errors > on non-Linux hosts. > > > #include "block/aio.h" > > #include "block/aio-wait.h" > > #include "qemu/iov.h" > > @@ -48,6 +48,47 @@ > > typedef struct BlockDriver BlockDriver; > > typedef struct BdrvChild BdrvChild; > > typedef struct BdrvChildClass BdrvChildClass; > > +enum zone_type { > > Please use "typedef enum { ... } BdrvZoneType" so that enums follow > QEMU's coding style. > > > + BLK_ZT_CONV = BLK_ZONE_TYPE_CONVENTIONAL, > > + BLK_ZT_SWR = BLK_ZONE_TYPE_SEQWRITE_REQ, > > + BLK_ZT_SWP = BLK_ZONE_TYPE_SEQWRITE_PREF, > > +}; > > + > > +enum zone_cond { > > + BLK_ZS_NOT_WP = BLK_ZONE_COND_NOT_WP, > > + BLK_ZS_EMPTY = BLK_ZONE_COND_EMPTY, > > + BLK_ZS_IOPEN = BLK_ZONE_COND_IMP_OPEN, > > + BLK_ZS_EOPEN = BLK_ZONE_COND_EXP_OPEN, > > + BLK_ZS_CLOSED = BLK_ZONE_COND_CLOSED, > > + BLK_ZS_RDONLY = BLK_ZONE_COND_READONLY, > > + BLK_ZS_FULL = BLK_ZONE_COND_FULL, > > + BLK_ZS_OFFLINE = BLK_ZONE_COND_OFFLINE, > > +}; > > This 1:1 correspondence with Linux constants could make the code a > little harder to port. > > Maybe QEMU's block layer should define its own numeric constants so the > code doesn't rely on operating system-specific headers. > block/file-posix.c #ifdef __linux__ code can be responsible for > converting Linux-specific constants to QEMU constants (and the 1:1 > mapping can be used there). > Can we define those constants in block-common.h? Because BlockZoneDescriptor requires zone_condition, zone_type defined and BlockZoneDesicriptor are used in header files and qemu-io sub-commands. If we use #ifdef __linux__ in block-common.h, it can be responsible for converting Linux constants instead. Thanks for reviewing! If there is any problem, please let me know. Best regards, Sam > > + > > +enum zone_op { > > + zone_open, > > + zone_close, > > + zone_finish, > > + zone_reset, > > +}; > > + > > +/* > > + * Zone descriptor data structure. > > + * Provide information on a zone with all position and size values in > > bytes. > > + */ > > +typedef struct BlockZoneDescriptor { > > + uint64_t start; > > + uint64_t length; > > + uint64_t cap; > > + uint64_t wp; > > + enum zone_type type; > > + enum zone_cond cond; > > +} BlockZoneDescriptor; > > + > > +enum zone_model { > > + BLK_Z_HM, > > + BLK_Z_HA, > > +}; > > > > typedef struct BlockDriverInfo { > > /* in bytes, 0 if irrelevant */ > > diff --git a/include/block/block-io.h b/include/block/block-io.h > > index 62c84f0519..deb8902684 100644 > > --- a/include/block/block-io.h > > +++ b/include/block/block-io.h > > @@ -80,6 +80,13 @@ int bdrv_co_ioctl(BlockDriverState *bs, int req, void > > *buf); > > /* Ensure contents are flushed to disk. */ > > int coroutine_fn bdrv_co_flush(BlockDriverState *bs); > > > > +/* Report zone information of zone block device. */ > > +int coroutine_fn bdrv_co_zone_report(BlockDriverState *bs, int64_t offset, > > + int64_t len, int64_t *nr_zones, > > + struct BlockZoneDescriptor *zones); > > +int coroutine_fn bdrv_co_zone_mgmt(BlockDriverState *bs, enum zone_op op, > > + int64_t offset, int64_t len); > > + > > int bdrv_co_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes); > > bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs); > > int bdrv_block_status(BlockDriverState *bs, int64_t offset, > > @@ -290,6 +297,12 @@ bdrv_readv_vmstate(BlockDriverState *bs, QEMUIOVector > > *qiov, int64_t pos); > > int generated_co_wrapper > > bdrv_writev_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos); > > > > +int generated_co_wrapper blk_zone_report(BlockBackend *blk, int64_t offset, > > + int64_t len, int64_t *nr_zones, > > + struct BlockZoneDescriptor > > *zones); > > +int generated_co_wrapper blk_zone_mgmt(BlockBackend *blk, enum zone_op op, > > + int64_t offset, int64_t len); > > + > > /** > > * bdrv_parent_drained_begin_single: > > * > > diff --git a/include/block/block_int-common.h > > b/include/block/block_int-common.h > > index 8947abab76..4d0180a7da 100644 > > --- a/include/block/block_int-common.h > > +++ b/include/block/block_int-common.h > > @@ -63,6 +63,7 @@ > > #define BLOCK_OPT_EXTL2 "extended_l2" > > > > #define BLOCK_PROBE_BUF_SIZE 512 > > +#define zone_start_sector 512 > > > > enum BdrvTrackedRequestType { > > BDRV_TRACKED_READ, > > @@ -94,6 +95,19 @@ typedef struct BdrvTrackedRequest { > > struct BdrvTrackedRequest *waiting_for; > > } BdrvTrackedRequest; > > > > +/** > > + * Zone device information data structure. > > + * Provide information on a device. > > + */ > > +typedef struct zbd_dev { > > + enum zone_model model; > > + uint32_t block_size; > > How does this relate to QEMU's BlockLimits? > > > + uint32_t write_granularity; > > Same. Maybe this belongs in BlockLimits? > > > + uint32_t nr_zones; > > Should this really be limited to 32-bit? For example, take 256 MB zones, > then the max nr_zones * 256 MB is much smaller than a uint64_t capacity > value. It seems safer to make this 64-bit, but maybe Dmitry, Hannes, or > Damien can tell us what to do here. > > > + struct BlockZoneDescriptor *zones; /* array of zones */ > > When is this used?