Am 31.10.2025 um 10:29 hat Hanna Czenczek geschrieben:
> On 29.10.25 21:23, Kevin Wolf wrote:
> > Am 28.10.2025 um 17:33 hat Hanna Czenczek geschrieben:
> > > The cache-cleaner runs as a timer CB in the BDS AioContext. With
> > > multiqueue, it can run concurrently to I/O requests, and because it does
> > > not take any lock, this can break concurrent cache accesses, corrupting
> > > the image. While the chances of this happening are low, it can be
> > > reproduced e.g. by modifying the code to schedule the timer CB every
> > > 5 ms (instead of at most once per second) and modifying the last (inner)
> > > while loop of qcow2_cache_clean_unused() like so:
> > >
> > > while (i < c->size && can_clean_entry(c, i)) {
> > > for (int j = 0; j < 1000 && can_clean_entry(c, i); j++) {
> > > usleep(100);
> > > }
> > > c->entries[i].offset = 0;
> > > c->entries[i].lru_counter = 0;
> > > i++;
> > > to_clean++;
> > > }
> > >
> > > i.e. making it wait on purpose for the point in time where the cache is
> > > in use by something else.
> > >
> > > The solution chosen for this in this patch is not the best solution, I
> > > hope, but I admittedly can’t come up with anything strictly better.
> > >
> > > We can protect from concurrent cache accesses either by taking the
> > > existing s->lock, or we introduce a new (non-coroutine) mutex
> > > specifically for cache accesses. I would prefer to avoid the latter so
> > > as not to introduce additional (very slight) overhead.
> > In theory, the old plan was that eventually qcow2 would use fine grained
> > locks instead of the single s->lock, and having a separate cache lock
> > would be a step towards it. But if we never actually make use of it to
> > hold s->lock for a shorter time, that's not really a good argument. I'm
> > not sure if that's ever going to happen unless for a rewrite in Rust or
> > something.
> >
> > I never tried to measure specifically if lock contention is a problem
> > with high queue depth and random I/O on a huge disk. Intuitively,
> > holding s->lock while doing I/O for loading entries into the cache can't
> > be really good.
> >
> > Anyway, I went a bit on a tangent there...
> >
> > > Using s->lock, which is a coroutine mutex, however means that we need to
> > > take it in a coroutine, so the timer CB must enter such a coroutine. As
> > > a result, descheduling the timer is no longer a guarantee that the
> > > cache-cleaner will not run, because it may now be yielding in
> > > qemu_co_mutex_lock().
> > I think creating a coroutine in cache_clean_timer_cb() is the wrong
> > approach. Instead, cache_clean_timer_init() could create a coroutine
> > and its implementation could be something like this:
> >
> > while (!s->cache_clean_timer_stopping) {
> > qemu_co_sleep_ns_wakeable(&s->cache_clean_timer_wake,
> > QEMU_CLOCK_VIRTUAL,
Oh, I wanted to comment on this one, too, but apparently forgot...
I have absolutely no idea why we're using QEMU_CLOCK_VIRTUAL in the
existing code. I don't think the cache cleaning should stop just because
the guest is paused. You can still have block jobs and exports that work
on the image and fill the cache.
> > s->cache_clean_interval *
> > NANOSECONDS_PER_SECOND);
> >
> > qemu_co_mutex_lock(&s->lock);
> > qcow2_cache_clean_unused(s->l2_table_cache);
> > qcow2_cache_clean_unused(s->refcount_block_cache);
> > qemu_co_mutex_unlock(&s->lock);
> > }
> > s->cache_clean_timer_stopping = false;
>
> Ah, that’s nicer. I think we can replace the flag by checking
> s->cache_clean_interval > 0 and adding a CoQueue to wake up waiting
> coroutines.
Ah, yes, stopping on s->cache_clean_interval == 0 is good.
> > > (Note even now this was only guaranteed for cache_clean_timer_del()
> > > callers that run in the BDS (the timer’s) AioContext. For callers
> > > running in the main context, the problem may have already existed,
> > > though maybe the BQL prevents timers from running in other contexts, I’m
> > > not sure.)
> > >
> > > Polling to await the timer to actually settle seems very complicated for
> > > something that’s rather a minor problem, but I can’t come up with any
> > > better solution that doesn’t again just overlook potential problems.
> > >
> > > (One cleaner idea may be to have a generic way to have timers run
> > > coroutines, and to await those when descheduling the timer. But while
> > > cleaner, it would also be more complicated, and I don’t think worth it
> > > at this point.)
> > >
> > > (Not Cc-ing qemu-stable, as the issue is quite unlikely to be hit, and
> > > I’m not too fond of this solution.)
> > >
> > > Signed-off-by: Hanna Czenczek <[email protected]>
> > > ---
> > > block/qcow2.h | 1 +
> > > block/qcow2.c | 90 ++++++++++++++++++++++++++++++++++++++++++++-------
> > > 2 files changed, 79 insertions(+), 12 deletions(-)
> > > @@ -867,6 +893,39 @@ static void cache_clean_timer_del(BlockDriverState
> > > *bs)
> > > }
> > > }
> > > +/*
> > > + * Delete the cache clean timer and await any yet running instance.
> > > + * Must be called from the main or BDS AioContext, holding s->lock.
> > > + */
> > > +static void coroutine_fn
> > > +cache_clean_timer_locked_co_del_and_wait(BlockDriverState *bs)
> > > +{
> > > + BDRVQcow2State *s = bs->opaque;
> > > + IO_OR_GS_CODE();
> > > + cache_clean_timer_del(bs);
> > > + if (qatomic_read(&s->cache_clean_running)) {
> > > + qemu_co_mutex_unlock(&s->lock);
> > > + qatomic_set(&s->cache_clean_polling, true);
> > > + BDRV_POLL_WHILE(bs, qatomic_read(&s->cache_clean_running));
> > Polling in a coroutine_fn is verboten.
> >
> > If we do need this function, I think it would be a yield here and a wake
> > on the other side. I think we might be able to get around it if we move
> > the call from qcow2_do_open() into qcow2_open() (i.e. outside the
> > coroutine). A bit ugly, so your choice.
>
> We can let a CoQueue do the waking, no?
Yes, that's probably nicer. I don't expect that you'll ever have more
than a single waiter, but a CoQueue would be safe in either case.
> > > + qemu_co_mutex_lock(&s->lock);
> > > + }
> > > +}
> > > +
> > > +/*
> > > + * Delete the cache clean timer and await any yet running instance.
> > > + * Must be called from the main or BDS AioContext without s->lock held.
> > > + */
> > > +static void cache_clean_timer_del_and_wait(BlockDriverState *bs)
> > > +{
> > > + BDRVQcow2State *s = bs->opaque;
> > > + IO_OR_GS_CODE();
> > > + cache_clean_timer_del(bs);
> > > + if (qatomic_read(&s->cache_clean_running)) {
> > > + qatomic_set(&s->cache_clean_polling, true);
> > > + BDRV_POLL_WHILE(bs, qatomic_read(&s->cache_clean_running));
> > > + }
> > > +}
> > > +
> > > static void qcow2_detach_aio_context(BlockDriverState *bs)
> > > {
> > > cache_clean_timer_del(bs);
> > > @@ -1214,12 +1273,20 @@ fail:
> > > return ret;
> > > }
> > > +/* s_locked specifies whether s->lock is held or not */
> > > static void qcow2_update_options_commit(BlockDriverState *bs,
> > > - Qcow2ReopenState *r)
> > > + Qcow2ReopenState *r,
> > > + bool s_locked)
> > > {
> > > BDRVQcow2State *s = bs->opaque;
> > > int i;
> > > + if (s_locked) {
> > > + cache_clean_timer_locked_co_del_and_wait(bs);
> > > + } else {
> > > + cache_clean_timer_del_and_wait(bs);
> > > + }
> > > +
> > > if (s->l2_table_cache) {
> > > qcow2_cache_destroy(s->l2_table_cache);
> > > }
> > > @@ -1228,6 +1295,10 @@ static void
> > > qcow2_update_options_commit(BlockDriverState *bs,
> > > }
> > > s->l2_table_cache = r->l2_table_cache;
> > > s->refcount_block_cache = r->refcount_block_cache;
> > > +
> > > + s->cache_clean_interval = r->cache_clean_interval;
> > > + cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
> > > +
> > > s->l2_slice_size = r->l2_slice_size;
> > > s->overlap_check = r->overlap_check;
> > > @@ -1239,12 +1310,6 @@ static void
> > > qcow2_update_options_commit(BlockDriverState *bs,
> > > s->discard_no_unref = r->discard_no_unref;
> > > - if (s->cache_clean_interval != r->cache_clean_interval) {
> > > - cache_clean_timer_del(bs);
> > > - s->cache_clean_interval = r->cache_clean_interval;
> > > - cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
> > > - }
> > > -
> > I think the del/init pair here won't be necessary any more after
> > switching to the background coroutine. It will just start using the new
> > value of s->cache_clean_interval the next time it sleeps.
>
> One problem is that if we don’t lock s->lock, the coroutine can read
> s->l2_table_cache and s->refcount_block_cache while they’re invalid, which
> is why I moved the deletion above.
I see. This is a preexisting problem, right? The timer runs in the BDS
main context, while qcow2_update_options_commit() runs in the main loop
or... essentially anywhere else?
Should it be a separate patch then? A comment in the code wouldn't hurt
either.
> We also still need to delete if the interval is set to 0 (or
> special-case that in the coroutine to wait forever).
If the coroutine terminates on 0 as you suggested above, that would
automatically be solved.
> We could run all of this in a coroutine so we can lock s->lock, or we
> have to force-stop the timer/coroutine at the start. Maybe running it
> in a coroutine is better…
So qcow2_reopen_commit() would immediately enter a coroutine and
BDRV_POLL_WHILE() for its completion?
It's not exactly pretty (and I hope polling in reopen callbacks is even
allowed), but maybe more local ugliness than having additional state
(like s->cache_clean_running) to allow BDRV_POLL_WHILE() to wait for the
cache clean coroutine to stop.
Kevin