23.05.2019 20:27, Max Reitz wrote: > On 17.05.19 16:50, Vladimir Sementsov-Ogievskiy wrote: >> 10.04.2019 23:20, Max Reitz wrote: >>> What bs->file and bs->backing mean depends on the node. For filter >>> nodes, both signify a node that will eventually receive all R/W >>> accesses. For format nodes, bs->file contains metadata and data, and >>> bs->backing will not receive writes -- instead, writes are COWed to >>> bs->file. Usually. >>> >>> In any case, it is not trivial to guess what a child means exactly with >>> our currently limited form of expression. It is better to introduce >>> some functions that actually guarantee a meaning: >>> >>> - bdrv_filtered_cow_child() will return the child that receives requests >>> filtered through COW. That is, reads may or may not be forwarded >>> (depending on the overlay's allocation status), but writes never go to >>> this child. >>> >>> - bdrv_filtered_rw_child() will return the child that receives requests >>> filtered through some very plain process. Reads and writes issued to >>> the parent will go to the child as well (although timing, etc. may be >>> modified). >>> >>> - All drivers but quorum (but quorum is pretty opaque to the general >>> block layer anyway) always only have one of these children: All read >>> requests must be served from the filtered_rw_child (if it exists), so >>> if there was a filtered_cow_child in addition, it would not receive >>> any requests at all. >>> (The closest here is mirror, where all requests are passed on to the >>> source, but with write-blocking, write requests are "COWed" to the >>> target. But that just means that the target is a special child that >>> cannot be introspected by the generic block layer functions, and that >>> source is a filtered_rw_child.) >>> Therefore, we can also add bdrv_filtered_child() which returns that >>> one child (or NULL, if there is no filtered child). >>> >>> Also, many places in the current block layer should be skipping filters >>> (all filters or just the ones added implicitly, it depends) when going >>> through a block node chain. They do not do that currently, but this >>> patch makes them. >>> >>> One example for this is qemu-img map, which should skip filters and only >>> look at the COW elements in the graph. The change to iotest 204's >>> reference output shows how using blkdebug on top of a COW node used to >>> make qemu-img map disregard the rest of the backing chain, but with this >>> patch, the allocation in the base image is reported correctly. >>> >>> Furthermore, a note should be made that sometimes we do want to access >>> bs->backing directly. This is whenever the operation in question is not >>> about accessing the COW child, but the "backing" child, be it COW or >>> not. This is the case in functions such as bdrv_open_backing_file() or >>> whenever we have to deal with the special behavior of @backing as a >>> blockdev option, which is that it does not default to null like all >>> other child references do. >>> >>> Finally, the query functions (query-block and query-named-block-nodes) >>> are modified to return any filtered child under "backing", not just >>> bs->backing or COW children. This is so that filters do not interrupt >>> the reported backing chain. This changes the output of iotest 184, as >>> the throttled node now appears as a backing child. >>> >>> Signed-off-by: Max Reitz <mre...@redhat.com> >>> --- >> >> [..] >> >>> --- a/block/mirror.c >>> +++ b/block/mirror.c > > [...] > >>> @@ -1650,7 +1651,9 @@ static void mirror_start_job(const char *job_id, >>> BlockDriverState *bs, >>> * any jobs in them must be blocked */ >>> if (target_is_backing) { >>> BlockDriverState *iter; >>> - for (iter = backing_bs(bs); iter != target; iter = >>> backing_bs(iter)) { >>> + for (iter = bdrv_filtered_bs(bs); iter != target; >> >> should it be filtered_target too? > > Hmm... The comment says that all nodes that disappear must be blocked. > I don’t even know by heart which nodes I let disappear. :-/ > > I suppose we should start at the first explicit node, filter or not...?
Hm, I thought about where should we stop. But I don't think we want to remove nodes under target, so it should be OK as is.. > >>> + iter = bdrv_filtered_bs(iter)) >>> + { >>> /* XXX BLK_PERM_WRITE needs to be allowed so we don't block >>> * ourselves at s->base (if writes are blocked for a node, >>> they are >>> * also blocked for its backing file). The other options >>> would be a > > [...] > >>> @@ -1707,14 +1710,14 @@ void mirror_start(const char *job_id, >>> BlockDriverState *bs, >>> MirrorCopyMode copy_mode, Error **errp) >>> { >>> bool is_none_mode; >>> - BlockDriverState *base; >>> + BlockDriverState *base = NULL; >> >> dead assignment > > Now I wonder why I even have that. Probably an artifact from some > intermediate point. > >>> >>> if (mode == MIRROR_SYNC_MODE_INCREMENTAL) { >>> error_setg(errp, "Sync mode 'incremental' not supported"); >>> return; >>> } >>> is_none_mode = mode == MIRROR_SYNC_MODE_NONE; >>> - base = mode == MIRROR_SYNC_MODE_TOP ? backing_bs(bs) : NULL; >>> + base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : >>> NULL; >>> mirror_start_job(job_id, bs, creation_flags, target, replaces, >>> speed, granularity, buf_size, backing_mode, >>> on_source_error, on_target_error, unmap, NULL, NULL, >>> diff --git a/block/qapi.c b/block/qapi.c >>> index 110d05dc57..478c6f5e0d 100644 >>> --- a/block/qapi.c >>> +++ b/block/qapi.c > > [...] > >>> @@ -535,9 +538,10 @@ static BlockStats >>> *bdrv_query_bds_stats(BlockDriverState *bs, >>> s->parent = bdrv_query_bds_stats(bs->file->bs, blk_level); >>> } >>> >>> - if (blk_level && bs->backing) { >>> + cow_bs = bdrv_filtered_cow_bs(bs); >> >> So, if we at blk_level and top bs is explicit filter, you don't want to show >> it's >> child? > > I do. It’s in s->parent. I thought it makes sense to change the > existing bs->file vs. bs->backing to storage vs. COW. Hmm. When I reviewed this I didn't consider the following patch. Actually, in this patch you break showing backing child for filter, and in following - fix it. Not very good, but not a reason to merge these two patches.. Ok for me. > >> Hmm, at least, we can't show it if it is file-child, as qapi filed already >> called >> backing. So, if we can't show for file-child-based filters, it may be better >> to not >> show filter children here at all. >> >>> + if (blk_level && cow_bs) { >>> s->has_backing = true; >>> - s->backing = bdrv_query_bds_stats(bs->backing->bs, blk_level); >>> + s->backing = bdrv_query_bds_stats(cow_bs, blk_level); >>> } >>> >>> return s; >>> diff --git a/block/stream.c b/block/stream.c >>> index bfaebb861a..23d5c890e0 100644 >>> --- a/block/stream.c >>> +++ b/block/stream.c >>> @@ -65,6 +65,7 @@ static int stream_prepare(Job *job) >>> StreamBlockJob *s = container_of(job, StreamBlockJob, common.job); >>> BlockJob *bjob = &s->common; >>> BlockDriverState *bs = blk_bs(bjob->blk); >>> + BlockDriverState *unfiltered = bdrv_skip_rw_filters(bs); >> >> Aha, I'd call it filtered, but unfiltered is correct too, it's amazing > > Haha :-) > > I think it’s all rather insane than amazing, but, well, insanity never > ceases to amaze, does it. > >>> BlockDriverState *base = s->base; >>> Error *local_err = NULL; >>> int ret = 0; >>> @@ -72,7 +73,7 @@ static int stream_prepare(Job *job) > > [...] > >>> @@ -121,7 +122,7 @@ static int coroutine_fn stream_run(Job *job, Error >>> **errp) >>> int64_t n = 0; /* bytes */ >>> void *buf; >>> >>> - if (!bs->backing) { >>> + if (!bdrv_filtered_child(bs)) { >>> goto out; >>> } >> >> this condition checks that there is nothing to stream, so, I thing it's >> better to check >> if (!bdrv_backing_chain_next(bs)) { >> goto out; >> } > > Ah, sure. > >>> @@ -162,7 +163,7 @@ static int coroutine_fn stream_run(Job *job, Error >>> **errp) >>> } else if (ret >= 0) { >>> /* Copy if allocated in the intermediate images. Limit to >>> the >>> * known-unallocated area [offset, >>> offset+n*BDRV_SECTOR_SIZE). */ >>> - ret = bdrv_is_allocated_above(backing_bs(bs), base, >>> + ret = bdrv_is_allocated_above(bdrv_filtered_bs(bs), base, >>> offset, n, &n); >> >> Hmm, if we trying to support bs to be filter, and actually operate on >> first-non-filter, >> as you write in qapi spec, this is wrong. Again it should be >> bdrv_filtered_cow_bs(bdrv_skip_rw_filters(bs)).. > > Would bdrv_backing_chain_next() fulfill the same purpose? It can’t be > allocated in a filter node, after all. It's OK too. > >> Or, may be better, we at stream start should calculate reald top bs to >> operate on, and >> forget about all filters above.. i.e., do bs = bdrv_skip_rw_filters(bs) at >> the very >> beginning, when creating a job. > > Sounds reasonable. We can ignore all the filters on top of the > (un)filtered top anyway. > >>> /* Finish early if end of backing file has been reached */ >>> @@ -268,7 +269,9 @@ void stream_start(const char *job_id, BlockDriverState >>> *bs, >>> * disappear from the chain after this operation. The streaming job >>> reads >>> * every block only once, assuming that it doesn't change, so block >>> writes >>> * and resizes. */ >>> - for (iter = backing_bs(bs); iter && iter != base; iter = >>> backing_bs(iter)) { >>> + for (iter = bdrv_filtered_bs(bs); iter && iter != base; >>> + iter = bdrv_filtered_bs(iter)) >>> + { >>> block_job_add_bdrv(&s->common, "intermediate node", iter, 0, >>> BLK_PERM_CONSISTENT_READ | >>> BLK_PERM_WRITE_UNCHANGED, >>> &error_abort); >>> diff --git a/blockdev.c b/blockdev.c >>> index 4775a07d93..bb71b8368d 100644 >>> --- a/blockdev.c >>> +++ b/blockdev.c >>> @@ -1094,7 +1094,7 @@ void hmp_commit(Monitor *mon, const QDict *qdict) >>> return; >>> } >>> >>> - bs = blk_bs(blk); >>> + bs = bdrv_skip_implicit_filters(blk_bs(blk)); >>> aio_context = bdrv_get_aio_context(bs); >>> aio_context_acquire(aio_context); >>> >>> @@ -1663,7 +1663,7 @@ static void external_snapshot_prepare(BlkActionState >>> *common, >>> goto out; >>> } >>> >>> - if (state->new_bs->backing != NULL) { >>> + if (bdrv_filtered_cow_child(state->new_bs)) { >> >> Do we allow to create filter snapshot? We should either restrict it >> explicitly or >> check bdrv_filtered_child here.. And we can't allow file-based-filters >> anyway.. > > Hm, yes, we should probably check both (separately to give better error > messages). > > In theory it might be possible to allow filters on top, but there isn’t > really any point. If someone wants to add filters on top of the > snapshot, they should use reopen. > >> [skipped up to the end of blockdev.c, I'm tired o_O] > > I can very much relate. :-) > > Your review definitely is much appreciated. > >>> diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c >>> index d1bb863cb6..f99f753fba 100644 >>> --- a/migration/block-dirty-bitmap.c >>> +++ b/migration/block-dirty-bitmap.c >>> @@ -285,9 +285,7 @@ static int init_dirty_bitmap_migration(void) >>> const char *drive_name = bdrv_get_device_or_node_name(bs); >>> >>> /* skip automatically inserted nodes */ >>> - while (bs && bs->drv && bs->implicit) { >>> - bs = backing_bs(bs); >>> - } >>> + bs = bdrv_skip_implicit_filters(bs); >> >> this intersects with Jonh's patch >> [PATCH v2] migration/dirty-bitmaps: change bitmap enumeration method >> https://lists.gnu.org/archive/html/qemu-devel/2019-05/msg03340.html > > Well. I’m not really considerate of other patches with this series. > Rebasing is always such a pain that I just write it for the current > master. I won’t incorporate unmerged series because doing so may cause > me to have to rebase more than once. > > And I can’t get this series merged soon enough because it’s just wrong > that I (and you) have to the one(s) thinking about how to treat filters > everywhere. It should be the people that introduce the code. I think it's just impossible to think over all filter use-cases now. Better to stop at some point, and then fix wrong things while covering real use-cases by io-tests. > >>> for (bitmap = bdrv_dirty_bitmap_next(bs, NULL); bitmap; >>> bitmap = bdrv_dirty_bitmap_next(bs, bitmap)) >>> diff --git a/nbd/server.c b/nbd/server.c >>> index e21bd501dc..e41ae89dbe 100644 >>> --- a/nbd/server.c >>> +++ b/nbd/server.c >>> @@ -1506,13 +1506,13 @@ NBDExport *nbd_export_new(BlockDriverState *bs, >>> uint64_t dev_offset, >>> if (bitmap) { >>> BdrvDirtyBitmap *bm = NULL; >>> >>> - while (true) { >>> + while (bs) { >>> bm = bdrv_find_dirty_bitmap(bs, bitmap); >>> - if (bm != NULL || bs->backing == NULL) { >>> + if (bm != NULL) { >>> break; >>> } >>> >>> - bs = bs->backing->bs; >>> + bs = bdrv_filtered_bs(bs); >>> } >> >> Check in documentation: "@bitmap: Also export the dirty bitmap reachable >> from @device". >> >> "Reachable" is not bad, but we may want to clarify that extended backing >> chain is meant > > Hm... Isn’t that just a problem with the current documentation? > Yes. Anyway, it's not necessary to fix it in theses series. > I think this change in code better fits what I’d guess from “reachable” > than what it currently means. > >>> if (bm == NULL) { >>> diff --git a/qemu-img.c b/qemu-img.c >>> index aa6f81f1ea..bcfbb743fc 100644 >>> --- a/qemu-img.c >>> +++ b/qemu-img.c > > [...] > >>> @@ -2434,7 +2433,8 @@ static int img_convert(int argc, char **argv) >>> * s.target_backing_sectors has to be negative, which it will >>> * be automatically). The backing file length is used only >>> * for optimizations, so such a case is not fatal. */ >>> - s.target_backing_sectors = bdrv_nb_sectors(out_bs->backing->bs); >>> + s.target_backing_sectors = >>> + bdrv_nb_sectors(bdrv_filtered_cow_bs(out_bs)); >> >> can't out_bs be filter itself? > > why would you do that > > More serious, well, perhaps, in theory. In practice I really cannot > imagine why it would be. Throttling? But it has file child and will not work with backing anyway. And we don't have public backing-based filters anyway. So, I don't care, it's OK. > >> >>> } else { >>> s.target_backing_sectors = -1; >>> } >>> @@ -2797,6 +2797,7 @@ static int get_block_status(BlockDriverState *bs, >>> int64_t offset, >>> >>> depth = 0; >>> for (;;) { >>> + bs = bdrv_skip_rw_filters(bs); >> >> Why? Filters may have own implementation of block_status, why to skip it? >> >> Or, thay cannot? Really, may be disallow filters have block_status, we may >> solve >> inefficient block_status_above we talked about before. > > As said in the other subthread, I think ignoring filters here is fine. > > Max > >>> ret = bdrv_block_status(bs, offset, bytes, &bytes, &map, &file); >>> if (ret < 0) { >>> return ret; > -- Best regards, Vladimir