On Mon, Jun 08, 2026 at 06:05:34PM +0200, Paolo Bonzini wrote:
> On 6/4/26 15:34, Peter Xu wrote:
> > The race condition was not reported in any real bug, but I found it while
> > reviewing some relevant chnages from Akihiko [1].  It's not clear if
> > there's any way to reproduce it, so far it's only theoretical.  Hence
> > there's also no Fixes tag attached.  There's also no need to copy stable
> > until we have a solid reproducer.
> > 
> > Currently, mru_block might still points to ramblocks that are removed if
> > below race condition happens:
> > 
> >    Reader A                              Writer
> >    --------                              ------
> >    rcu_read_lock()
> >    walk list, find block X
> >                                          QLIST_REMOVE_RCU(X)
> >                                          qatomic_set(&mru_block, NULL)
> >                                          call_rcu(X, reclaim_ramblock)
> >    qatomic_set(&mru_block, X)  <----------- overwrites NULL
> >    rcu_read_unlock()
> >                                          grace period ends, X freed
> > 
> >    Reader C
> >    --------
> >    rcu_read_lock()
> >    qatomic_rcu_read(&mru_block) -> X
> >    Read X's block->offset ...  <----------- UAF
> > 
> > To fix it, we can introduce a nested RCU free for reset of the mru_block
> > field, and only free the ramblock until the 2nd RCU call.
> 
> There is a version field in RAMList.  Maybe we need mru_version too?
> 
> - reader A reads version before starting the list walk, then writes
> mru_version and mru_block
> 
> - writers update version and do not even have to clear mru_block
> 
> - new readers ignore mru_block if mru_version != version
> 
> This works, but there is a snag:
> 
>     reader A (MRU parts elided, assuming a miss)
>     --------------------------------------------
>     load_acquire version // load version before walking list
>     walk list
>     ...
> 
>             writer B
>             -----------------------
>             // like seqcount_write_begin
>             store version++
>             smp_wmb();
>             // ----
>             QLIST_REMOVE_RCU
>             store release version++  // like seqcount_write_end
> 
>     // these two must be atomic!
>     store release mru_version = version & ~1, mru_block = block
> 
>     reader C (MRU parts shown)
>     --------------------
>     load_acquire mru_version    // like seqcount_read_begin
>     load mru_block
>     // like seqcount_read_retry
>     smp_rmb()
>     load_acquire version
>     // ----
>     if block == NULL || version != mru_version || block mismatch
>       walk list
> 
> In other words, the seqcount retry check compares the current version of the
> list against the one *stored by the previous reader*.
> 
> The snag is that torn writes of the (version, block) pair are a problem, so
> the store of mru_version and mru_block must be atomic and that requires
> CONFIG_ATOMIC128.  Maybe avoid MRU altogether if it's not available, and if
> it is use qatomic_read__nocheck/qatomic_set__nocheck?

Indeed looks like having mru_version is another possible solution, but I
also don't see how we can avoid using 128b atomic..  Besides, the logic
seems to be slightly involved: I tried to write another patch with
QemuSeqLock which seems cleaner, and I've attached it at the end.. only
pesudo code and I left a comment for the atomic 128b update.

NOTE: in that patch I also removed version update for adding new ramblocks
because IIUC we don't need it.  If anyone thinks that's a good idea that
can be done separately.. but it shouldn't matter a lot with current master.

Shall we just stick with the RCU nested free approach?  Personally I think
it's still straightforward and less overhead (no penalty on reader at all).

Thanks,

===8<===

>From a04e925f5b9c82b07a38f801b73095cd3a71e91b Mon Sep 17 00:00:00 2001
From: Peter Xu <[email protected]>
Date: Mon, 8 Jun 2026 14:45:01 -0400
Subject: [PATCH] fix mru_block

Signed-off-by: Peter Xu <[email protected]>
---
 include/system/ramlist.h |  9 +++++-
 migration/ram.c          | 11 ++++---
 system/physmem.c         | 63 +++++++++++++++++++++++++---------------
 3 files changed, 54 insertions(+), 29 deletions(-)

diff --git a/include/system/ramlist.h b/include/system/ramlist.h
index c7f388f487..b2b4403e13 100644
--- a/include/system/ramlist.h
+++ b/include/system/ramlist.h
@@ -5,6 +5,7 @@
 #include "qemu/thread.h"
 #include "qemu/rcu.h"
 #include "qemu/rcu_queue.h"
+#include "qemu/seqlock.h"
 #include "system/ram_addr.h"
 
 typedef struct RAMBlockNotifier RAMBlockNotifier;
@@ -42,12 +43,18 @@ typedef struct {
 
 typedef struct RAMList {
     QemuMutex mutex;
+    /* Version of the ram_list that found mru_block, must be an even number */
+    unsigned mru_version;
     RAMBlock *mru_block;
     /* RCU-enabled, writes protected by the ramlist lock. */
     QLIST_HEAD(, RAMBlock) blocks;
     DirtyMemoryBlocks *dirty_memory[DIRTY_MEMORY_NUM];
     unsigned int num_dirty_blocks;
-    uint32_t version;
+    /*
+     * Version of the ram_list.  NOTE: currently we only boost this when
+     * ramblocks are removed.
+     */
+    QemuSeqLock version;
     QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers;
 } RAMList;
 extern RAMList ram_list;
diff --git a/migration/ram.c b/migration/ram.c
index 6da24d7258..14bce73f75 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -380,7 +380,7 @@ struct RAMState {
     /* Last dirty target page we have sent */
     ram_addr_t last_page;
     /* last ram version we have seen */
-    uint32_t last_version;
+    unsigned last_version;
     /* How many times we have dirty too many pages */
     int dirty_rate_high_cnt;
     /* these variables are used for bitmap sync */
@@ -2487,6 +2487,7 @@ static void ram_page_hint_reset(PageLocationHint *hint)
 
 static void ram_state_reset(RAMState *rs)
 {
+    unsigned version;
     int i;
 
     for (i = 0; i < RAM_CHANNEL_MAX; i++) {
@@ -2497,10 +2498,12 @@ static void ram_state_reset(RAMState *rs)
     rs->last_page = 0;
 
     /* Read version before ram_list.blocks */
-    rs->last_version = qatomic_load_acquire(&ram_list.version);
+    do {
+        version = seqlock_read_begin(&ram_list.version);
+    } while (seqlock_read_retry(&ram_list.version, version));
 
+    rs->last_version = version;
     rs->xbzrle_started = false;
-
     ram_page_hint_reset(&rs->page_hint);
 }
 
@@ -3273,7 +3276,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
      */
     WITH_QEMU_LOCK_GUARD(&rs->bitmap_mutex) {
         WITH_RCU_READ_LOCK_GUARD() {
-            if (qatomic_read(&ram_list.version) != rs->last_version) {
+            if (seqlock_read_retry(&ram_list.version, rs->last_version)) {
                 ram_state_reset(rs);
             }
 
diff --git a/system/physmem.c b/system/physmem.c
index 9e1ac13e82..1960dc64a7 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -821,11 +821,29 @@ AddressSpace *cpu_get_address_space(CPUState *cpu, int 
asidx)
 static RAMBlock *qemu_get_ram_block(ram_addr_t addr)
 {
     RAMBlock *block;
+    unsigned version = ram_list.mru_version;
 
     block = qatomic_rcu_read(&ram_list.mru_block);
-    if (block && addr - block->offset < block->max_length) {
+    /*
+     * Don't use MRU one if ramblock might be removed or being removed.
+     * While if we see the version hasn't incremented it means even if
+     * there's a concurrent writer it needs to wait for this rcu reader to
+     * finish using the ramblock, hence it's still safe.
+     *
+     * Also due to above, we must check seqlock_read_retry() first before
+     * de-referencing block pointer here to make sure it's still not freed.
+     */
+    if (!seqlock_read_retry(&ram_list.version, version) &&
+        block && addr - block->offset < block->max_length) {
         return block;
     }
+
+    /*
+     * Read the current ram_list version. This is only needed for updating
+     * mru_block; it's always safe to walk ram_list under RCU read lock.
+     */
+    version = seqlock_read_begin(&ram_list.version);
+
     RAMBLOCK_FOREACH(block) {
         if (addr - block->offset < block->max_length) {
             goto found;
@@ -836,23 +854,10 @@ static RAMBlock *qemu_get_ram_block(ram_addr_t addr)
     abort();
 
 found:
-    /* It is safe to write mru_block outside the BQL.  This
-     * is what happens:
-     *
-     *     qatomic_set(&mru_block, xxx)
-     *     rcu_read_unlock()
-     *                                        xxx removed from list
-     *                  rcu_read_lock()
-     *                  read mru_block
-     *                                        qatomic_set(&mru_block, NULL);
-     *                                        call_rcu(reclaim_ramblock, xxx);
-     *                  rcu_read_unlock()
-     *
-     * qatomic_rcu_set is not needed here.  The block was already published
-     * when it was placed into the list.  Here we're just making an extra
-     * copy of the pointer.
-     */
-    qatomic_set(&ram_list.mru_block, block);
+    /* Only update mru_block if no ramblock is removed or being removed */
+    if (!seqlock_read_retry(&ram_list.version, version)) {
+        // atomic set both ram_list.mru_version/mru_block...
+    }
     return block;
 }
 
@@ -2260,10 +2265,12 @@ static void ram_block_add(RAMBlock *new_block, Error 
**errp)
     } else { /* list is empty */
         QLIST_INSERT_HEAD_RCU(&ram_list.blocks, new_block, next);
     }
-    qatomic_set(&ram_list.mru_block, NULL);
 
-    /* Write list before version */
-    qatomic_store_release(&ram_list.version, ram_list.version + 1);
+    /*
+     * NOTE: we don't update ram_list.version when adding new ramblocks,
+     * because the version is currently only used to invalidate cached
+     * pointers to removed ramblocks.  No user cares about additions.
+     */
     qemu_mutex_unlock_ramlist();
 
     physical_memory_set_dirty_range(new_block->offset,
@@ -2603,15 +2610,23 @@ void qemu_ram_free(RAMBlock *block)
                                 block->max_length);
     }
 
+    /*
+     * We can use seqlock_write_[un]lock(), however stick with the helpers
+     * when they're available to make it easier to track writers.
+     */
     qemu_mutex_lock_ramlist();
+    seqlock_write_begin(&ram_list.version);
+
     name = cpr_name(block->mr);
     cpr_delete_fd(name, 0);
     QLIST_REMOVE_RCU(block, next);
+    /* while version is odd, nobody will be updating this.. */
     qatomic_set(&ram_list.mru_block, NULL);
-    /* Write list before version */
-    qatomic_store_release(&ram_list.version, ram_list.version + 1);
-    call_rcu(block, reclaim_ramblock, rcu);
+
+    seqlock_write_end(&ram_list.version);
     qemu_mutex_unlock_ramlist();
+
+    call_rcu(block, reclaim_ramblock, rcu);
 }
 
 #ifndef _WIN32
-- 
2.53.0


-- 
Peter Xu


Reply via email to