Am 06.01.22 um 17:01 schrieb Ilya Dryomov: > On Thu, Jan 6, 2022 at 4:27 PM Peter Lieven <p...@kamp.de> wrote: >> Am 05.10.21 um 10:36 schrieb Ilya Dryomov: >>> On Tue, Oct 5, 2021 at 10:19 AM Peter Lieven <p...@kamp.de> wrote: >>>> Am 05.10.21 um 09:54 schrieb Ilya Dryomov: >>>>> On Thu, Sep 16, 2021 at 2:21 PM Peter Lieven <p...@kamp.de> wrote: >>>>>> the qemu rbd driver currently lacks support for bdrv_co_block_status. >>>>>> This results mainly in incorrect progress during block operations (e.g. >>>>>> qemu-img convert with an rbd image as source). >>>>>> >>>>>> This patch utilizes the rbd_diff_iterate2 call from librbd to detect >>>>>> allocated and unallocated (all zero areas). >>>>>> >>>>>> To avoid querying the ceph OSDs for the answer this is only done if >>>>>> the image has the fast-diff feature which depends on the object-map and >>>>>> exclusive-lock features. In this case it is guaranteed that the >>>>>> information >>>>>> is present in memory in the librbd client and thus very fast. >>>>>> >>>>>> If fast-diff is not available all areas are reported to be allocated >>>>>> which is the current behaviour if bdrv_co_block_status is not >>>>>> implemented. >>>>>> >>>>>> Signed-off-by: Peter Lieven <p...@kamp.de> >>>>>> --- >>>>>> V2->V3: >>>>>> - check rbd_flags every time (they can change during runtime) [Ilya] >>>>>> - also check for fast-diff invalid flag [Ilya] >>>>>> - *map and *file cant be NULL [Ilya] >>>>>> - set ret = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID in case of an >>>>>> unallocated area [Ilya] >>>>>> - typo: catched -> caught [Ilya] >>>>>> - changed wording about fast-diff, object-map and exclusive lock in >>>>>> commit msg [Ilya] >>>>>> >>>>>> V1->V2: >>>>>> - add commit comment [Stefano] >>>>>> - use failed_post_open [Stefano] >>>>>> - remove redundant assert [Stefano] >>>>>> - add macro+comment for the magic -9000 value [Stefano] >>>>>> - always set *file if its non NULL [Stefano] >>>>>> >>>>>> block/rbd.c | 126 ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> 1 file changed, 126 insertions(+) >>>>>> >>>>>> diff --git a/block/rbd.c b/block/rbd.c >>>>>> index dcf82b15b8..3cb24f9981 100644 >>>>>> --- a/block/rbd.c >>>>>> +++ b/block/rbd.c >>>>>> @@ -1259,6 +1259,131 @@ static ImageInfoSpecific >>>>>> *qemu_rbd_get_specific_info(BlockDriverState *bs, >>>>>> return spec_info; >>>>>> } >>>>>> >>>>>> +typedef struct rbd_diff_req { >>>>>> + uint64_t offs; >>>>>> + uint64_t bytes; >>>>>> + int exists; >>>>> Hi Peter, >>>>> >>>>> Nit: make exists a bool. The one in the callback has to be an int >>>>> because of the callback signature but let's not spread that. >>>>> >>>>>> +} rbd_diff_req; >>>>>> + >>>>>> +/* >>>>>> + * rbd_diff_iterate2 allows to interrupt the exection by returning a >>>>>> negative >>>>>> + * value in the callback routine. Choose a value that does not conflict >>>>>> with >>>>>> + * an existing exitcode and return it if we want to prematurely stop the >>>>>> + * execution because we detected a change in the allocation status. >>>>>> + */ >>>>>> +#define QEMU_RBD_EXIT_DIFF_ITERATE2 -9000 >>>>>> + >>>>>> +static int qemu_rbd_co_block_status_cb(uint64_t offs, size_t len, >>>>>> + int exists, void *opaque) >>>>>> +{ >>>>>> + struct rbd_diff_req *req = opaque; >>>>>> + >>>>>> + assert(req->offs + req->bytes <= offs); >>>>>> + >>>>>> + if (req->exists && offs > req->offs + req->bytes) { >>>>>> + /* >>>>>> + * we started in an allocated area and jumped over an >>>>>> unallocated area, >>>>>> + * req->bytes contains the length of the allocated area before >>>>>> the >>>>>> + * unallocated area. stop further processing. >>>>>> + */ >>>>>> + return QEMU_RBD_EXIT_DIFF_ITERATE2; >>>>>> + } >>>>>> + if (req->exists && !exists) { >>>>>> + /* >>>>>> + * we started in an allocated area and reached a hole. >>>>>> req->bytes >>>>>> + * contains the length of the allocated area before the hole. >>>>>> + * stop further processing. >>>>>> + */ >>>>>> + return QEMU_RBD_EXIT_DIFF_ITERATE2; >>>>> Do you have a test case for when this branch is taken? >>>> That would happen if you diff from a snapshot, the question is if it can >>>> also happen if the image is a clone from a snapshot? >>>> >>>> >>>>>> + } >>>>>> + if (!req->exists && exists && offs > req->offs) { >>>>>> + /* >>>>>> + * we started in an unallocated area and hit the first allocated >>>>>> + * block. req->bytes must be set to the length of the >>>>>> unallocated area >>>>>> + * before the allocated area. stop further processing. >>>>>> + */ >>>>>> + req->bytes = offs - req->offs; >>>>>> + return QEMU_RBD_EXIT_DIFF_ITERATE2; >>>>>> + } >>>>>> + >>>>>> + /* >>>>>> + * assert that we caught all cases above and allocation state has >>>>>> not >>>>>> + * changed during callbacks. >>>>>> + */ >>>>>> + assert(exists == req->exists || !req->bytes); >>>>>> + req->exists = exists; >>>>>> + >>>>>> + /* >>>>>> + * assert that we either return an unallocated block or have got >>>>>> callbacks >>>>>> + * for all allocated blocks present. >>>>>> + */ >>>>>> + assert(!req->exists || offs == req->offs + req->bytes); >>>>>> + req->bytes = offs + len - req->offs; >>>>>> + >>>>>> + return 0; >>>>>> +} >>>>>> + >>>>>> +static int coroutine_fn qemu_rbd_co_block_status(BlockDriverState *bs, >>>>>> + bool want_zero, >>>>>> int64_t offset, >>>>>> + int64_t bytes, int64_t >>>>>> *pnum, >>>>>> + int64_t *map, >>>>>> + BlockDriverState >>>>>> **file) >>>>>> +{ >>>>>> + BDRVRBDState *s = bs->opaque; >>>>>> + int ret, r; >>>>> Nit: I would rename ret to status or something like that to make >>>>> it clear(er) that it is an actual value and never an error. Or, >>>>> even better, drop it entirely and return one of the two bitmasks >>>>> directly. >>>>> >>>>>> + struct rbd_diff_req req = { .offs = offset }; >>>>>> + uint64_t features, flags; >>>>>> + >>>>>> + assert(offset + bytes <= s->image_size); >>>>>> + >>>>>> + /* default to all sectors allocated */ >>>>>> + ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID; >>>>>> + *map = offset; >>>>>> + *file = bs; >>>>>> + *pnum = bytes; >>>>>> + >>>>>> + /* check if RBD image supports fast-diff */ >>>>>> + r = rbd_get_features(s->image, &features); >>>>>> + if (r < 0) { >>>>>> + goto out; >>>>>> + } >>>>>> + if (!(features & RBD_FEATURE_FAST_DIFF)) { >>>>>> + goto out; >>>>>> + } >>>>>> + >>>>>> + /* check if RBD fast-diff result is valid */ >>>>>> + r = rbd_get_flags(s->image, &flags); >>>>>> + if (r < 0) { >>>>>> + goto out; >>>>>> + } >>>>>> + if (flags & RBD_FLAG_FAST_DIFF_INVALID) { >>>>>> + goto out; >>>>> Nit: I'm not a fan of labels that cover just the return statement. >>>>> Feel free to ignore this one but I would get rid of it and replace >>>>> these gotos with returns. >>>> That would be return with the bitmask directly coded in if I also >>>> >>>> drop the ret variable. I can change that, no problem. >>>> >>>> >>>>>> + } >>>>>> + >>>>>> + r = rbd_diff_iterate2(s->image, NULL, offset, bytes, true, true, >>>>>> + qemu_rbd_co_block_status_cb, &req); >>>>>> + if (r < 0 && r != QEMU_RBD_EXIT_DIFF_ITERATE2) { >>>>>> + goto out; >>>>>> + } >>>>>> + assert(req.bytes <= bytes); >>>>>> + if (!req.exists) { >>>>>> + if (r == 0 && !req.bytes) { >>>>>> + /* >>>>>> + * rbd_diff_iterate2 does not invoke callbacks for >>>>>> unallocated areas >>>>>> + * except for the case where an overlay has a hole where >>>>>> the parent >>>>>> + * has not. This here catches the case where no callback was >>>>>> + * invoked at all. >>>>>> + */ >>>>> The above is true in the case of diffing against a snapshot, i.e. when >>>>> the "from" snapshot has some data where the "to" revision (whether HEAD >>>>> or another snapshot) has a hole. I don't think it is true for child vs >>>>> parent (but it has been a while since I looked at the diff code). As >>>>> long as NULL is passed for fromsnapname, I would expect the callback to >>>>> be invoked only for allocated areas. If I'm right, we could simplify >>>>> qemu_rbd_co_block_status_cb() considerably. >>>> See my comment in the callback. Can you look at the diff code or give >>>> me at least a pointer where I can find it. I am not that familiar with >>>> the librbd code yet. >>> I assumed that you added !exists handling because it came up in your >>> testing. If you don't have a test case then let's proceed under the >>> assumption that it doesn't happen for clones. >> >> Hi Ilya, >> >> >> it seems that our assumption was false. I have an image which shows holes >> without diffing against >> >> a snapshot. Do you have an idea why? >> >> >> $ rbd --conf /etc/ceph/ceph-dev01.conf --id lieven info >> dhp-standard/c4ca7ee9-36ce-4fc9-9d3b-ece8a4f8b83e/c1ad11d0-4f6a-4cc1-8aa3-ff3c413c1471.raw >> rbd image 'c1ad11d0-4f6a-4cc1-8aa3-ff3c413c1471.raw': >> size 20 GiB in 20000 objects >> order 20 (1 MiB objects) >> snapshot_count: 2 >> id: 3d6daa102e4d9f >> block_name_prefix: rbd_data.3d6daa102e4d9f >> format: 2 >> features: layering, exclusive-lock, object-map, fast-diff, deep-flatten >> op_features: >> flags: >> create_timestamp: Tue Sep 21 14:16:56 2021 >> access_timestamp: Thu Jan 6 15:24:46 2022 >> modify_timestamp: Thu Jan 6 15:45:42 2022 >> >> >> $ rbd --conf /etc/ceph/ceph-dev01.conf --id lieven snap ls >> dhp-standard/c4ca7ee9-36ce-4fc9-9d3b-ece8a4f8b83e/c1ad11d0-4f6a-4cc1-8aa3-ff3c413c1471.raw >> SNAPID NAME SIZE PROTECTED TIMESTAMP >> 12297 dlp-20210921-144509 20 GiB Tue Sep 21 14:45:09 2021 >> 17745 dlp-20220106-040000 20 GiB Thu Jan 6 04:00:01 2022 >> >> >> $ rbd --conf /etc/ceph/ceph-dev01.conf --id lieven diff --whole-object >> dhp-standard/c4ca7ee9-36ce-4fc9-9d3b-ece8a4f8b83e/c1ad11d0-4f6a-4cc1-8aa3-ff3c413c1471.raw >> | grep zero >> 204472320 1048576 zero >> 1114636288 1048576 zero >> 1115684864 1048576 zero >> 1116733440 1048576 zero >> 1117782016 1048576 zero >> 1218445312 1048576 zero >> 1219493888 1048576 zero >> 1220542464 1048576 zero > Hi Peter, > > Yes, I stumbled upon this just yesterday while looking into another > librbd issue surfaced by this patch [1] and was going to email you and > the list after wrapping my head around this behavior. I see where it > is coming from but I'm not sure what the right fix is. I would prefer > to patch librbd but that may turn out to be not feasible. Let me get > back on this next week. > > [1] https://tracker.ceph.com/issues/53784
Can you please verify and check if this would be a valid workaround for #53784 diff --git a/block/rbd.c b/block/rbd.c index def96292e0..19c01bb89f 100644 --- a/block/rbd.c +++ b/block/rbd.c @@ -1320,6 +1320,7 @@ static int coroutine_fn qemu_rbd_co_block_status(BlockDriverState *bs, int status, r; RBDDiffIterateReq req = { .offs = offset }; uint64_t features, flags; + int64_t head; assert(offset + bytes <= s->image_size); @@ -1347,6 +1348,11 @@ static int coroutine_fn qemu_rbd_co_block_status(BlockDriverState *bs, return status; } + head = offset & (s->object_size - 1); + offset -= head; + req.offs -= head; + bytes += head; + r = rbd_diff_iterate2(s->image, NULL, offset, bytes, true, true, qemu_rbd_diff_iterate_cb, &req); if (r < 0 && r != QEMU_RBD_EXIT_DIFF_ITERATE2) { @@ -1366,7 +1372,8 @@ static int coroutine_fn qemu_rbd_co_block_status(BlockDriverState *bs, status = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID; } - *pnum = req.bytes; + assert(req.bytes > head); + *pnum = req.bytes - head; return status; } I would then prepare a series with two patches. One for #53784 and a seconds to fix the exists asssertion. Best, Peter