Am 28.10.2025 um 17:33 hat Hanna Czenczek geschrieben:
> The cache-cleaner runs as a timer CB in the BDS AioContext.  With
> multiqueue, it can run concurrently to I/O requests, and because it does
> not take any lock, this can break concurrent cache accesses, corrupting
> the image.  While the chances of this happening are low, it can be
> reproduced e.g. by modifying the code to schedule the timer CB every
> 5 ms (instead of at most once per second) and modifying the last (inner)
> while loop of qcow2_cache_clean_unused() like so:
> 
>     while (i < c->size && can_clean_entry(c, i)) {
>         for (int j = 0; j < 1000 && can_clean_entry(c, i); j++) {
>             usleep(100);
>         }
>         c->entries[i].offset = 0;
>         c->entries[i].lru_counter = 0;
>         i++;
>         to_clean++;
>     }
> 
> i.e. making it wait on purpose for the point in time where the cache is
> in use by something else.
> 
> The solution chosen for this in this patch is not the best solution, I
> hope, but I admittedly can’t come up with anything strictly better.
> 
> We can protect from concurrent cache accesses either by taking the
> existing s->lock, or we introduce a new (non-coroutine) mutex
> specifically for cache accesses.  I would prefer to avoid the latter so
> as not to introduce additional (very slight) overhead.

In theory, the old plan was that eventually qcow2 would use fine grained
locks instead of the single s->lock, and having a separate cache lock
would be a step towards it. But if we never actually make use of it to
hold s->lock for a shorter time, that's not really a good argument. I'm
not sure if that's ever going to happen unless for a rewrite in Rust or
something.

I never tried to measure specifically if lock contention is a problem
with high queue depth and random I/O on a huge disk. Intuitively,
holding s->lock while doing I/O for loading entries into the cache can't
be really good.

Anyway, I went a bit on a tangent there...

> Using s->lock, which is a coroutine mutex, however means that we need to
> take it in a coroutine, so the timer CB must enter such a coroutine.  As
> a result, descheduling the timer is no longer a guarantee that the
> cache-cleaner will not run, because it may now be yielding in
> qemu_co_mutex_lock().

I think creating a coroutine in cache_clean_timer_cb() is the wrong
approach. Instead, cache_clean_timer_init() could create a coroutine
and its implementation could be something like this:

    while (!s->cache_clean_timer_stopping) {
        qemu_co_sleep_ns_wakeable(&s->cache_clean_timer_wake,
                                  QEMU_CLOCK_VIRTUAL,
                                  s->cache_clean_interval * 
NANOSECONDS_PER_SECOND);

        qemu_co_mutex_lock(&s->lock);
        qcow2_cache_clean_unused(s->l2_table_cache);
        qcow2_cache_clean_unused(s->refcount_block_cache);
        qemu_co_mutex_unlock(&s->lock);
    }
    s->cache_clean_timer_stopping = false;

> (Note even now this was only guaranteed for cache_clean_timer_del()
> callers that run in the BDS (the timer’s) AioContext.  For callers
> running in the main context, the problem may have already existed,
> though maybe the BQL prevents timers from running in other contexts, I’m
> not sure.)
> 
> Polling to await the timer to actually settle seems very complicated for
> something that’s rather a minor problem, but I can’t come up with any
> better solution that doesn’t again just overlook potential problems.
> 
> (One cleaner idea may be to have a generic way to have timers run
> coroutines, and to await those when descheduling the timer.  But while
> cleaner, it would also be more complicated, and I don’t think worth it
> at this point.)
> 
> (Not Cc-ing qemu-stable, as the issue is quite unlikely to be hit, and
> I’m not too fond of this solution.)
> 
> Signed-off-by: Hanna Czenczek <[email protected]>
> ---
>  block/qcow2.h |  1 +
>  block/qcow2.c | 90 ++++++++++++++++++++++++++++++++++++++++++++-------
>  2 files changed, 79 insertions(+), 12 deletions(-)

> @@ -867,6 +893,39 @@ static void cache_clean_timer_del(BlockDriverState *bs)
>      }
>  }
>  
> +/*
> + * Delete the cache clean timer and await any yet running instance.
> + * Must be called from the main or BDS AioContext, holding s->lock.
> + */
> +static void coroutine_fn
> +cache_clean_timer_locked_co_del_and_wait(BlockDriverState *bs)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    IO_OR_GS_CODE();
> +    cache_clean_timer_del(bs);
> +    if (qatomic_read(&s->cache_clean_running)) {
> +        qemu_co_mutex_unlock(&s->lock);
> +        qatomic_set(&s->cache_clean_polling, true);
> +        BDRV_POLL_WHILE(bs, qatomic_read(&s->cache_clean_running));

Polling in a coroutine_fn is verboten.

If we do need this function, I think it would be a yield here and a wake
on the other side. I think we might be able to get around it if we move
the call from qcow2_do_open() into qcow2_open() (i.e. outside the
coroutine). A bit ugly, so your choice.

> +        qemu_co_mutex_lock(&s->lock);
> +    }
> +}
> +
> +/*
> + * Delete the cache clean timer and await any yet running instance.
> + * Must be called from the main or BDS AioContext without s->lock held.
> + */
> +static void cache_clean_timer_del_and_wait(BlockDriverState *bs)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    IO_OR_GS_CODE();
> +    cache_clean_timer_del(bs);
> +    if (qatomic_read(&s->cache_clean_running)) {
> +        qatomic_set(&s->cache_clean_polling, true);
> +        BDRV_POLL_WHILE(bs, qatomic_read(&s->cache_clean_running));
> +    }
> +}
> +
>  static void qcow2_detach_aio_context(BlockDriverState *bs)
>  {
>      cache_clean_timer_del(bs);
> @@ -1214,12 +1273,20 @@ fail:
>      return ret;
>  }
>  
> +/* s_locked specifies whether s->lock is held or not */
>  static void qcow2_update_options_commit(BlockDriverState *bs,
> -                                        Qcow2ReopenState *r)
> +                                        Qcow2ReopenState *r,
> +                                        bool s_locked)
>  {
>      BDRVQcow2State *s = bs->opaque;
>      int i;
>  
> +    if (s_locked) {
> +        cache_clean_timer_locked_co_del_and_wait(bs);
> +    } else {
> +        cache_clean_timer_del_and_wait(bs);
> +    }
> +
>      if (s->l2_table_cache) {
>          qcow2_cache_destroy(s->l2_table_cache);
>      }
> @@ -1228,6 +1295,10 @@ static void 
> qcow2_update_options_commit(BlockDriverState *bs,
>      }
>      s->l2_table_cache = r->l2_table_cache;
>      s->refcount_block_cache = r->refcount_block_cache;
> +
> +    s->cache_clean_interval = r->cache_clean_interval;
> +    cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
> +
>      s->l2_slice_size = r->l2_slice_size;
>  
>      s->overlap_check = r->overlap_check;
> @@ -1239,12 +1310,6 @@ static void 
> qcow2_update_options_commit(BlockDriverState *bs,
>  
>      s->discard_no_unref = r->discard_no_unref;
>  
> -    if (s->cache_clean_interval != r->cache_clean_interval) {
> -        cache_clean_timer_del(bs);
> -        s->cache_clean_interval = r->cache_clean_interval;
> -        cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
> -    }
> -

I think the del/init pair here won't be necessary any more after
switching to the background coroutine. It will just start using the new
value of s->cache_clean_interval the next time it sleeps.

Kevin


Reply via email to