On 3/27/2025 8:27 PM, Steven Sistare wrote:
On 3/26/2025 5:34 PM, Michael Roth wrote:
On Wed, Mar 26, 2025 at 05:13:50PM -0300, Fabiano Rosas wrote:
Michael Roth <michael.r...@amd.com> writes:

Quoting Tom Lendacky (2025-03-26 14:21:31)
On 3/26/25 13:46, Tom Lendacky wrote:
On 3/7/25 12:15, Fabiano Rosas wrote:
From: Steve Sistare <steven.sist...@oracle.com>

Unlike cpr-reboot mode, cpr-transfer mode cannot save volatile ram blocks in the migration stream file and recreate them later, because the physical memory for the blocks is pinned and registered for vfio.  Add a blocker
for volatile ram blocks.

Also add a blocker for RAM_GUEST_MEMFD.  Preserving guest_memfd may be
sufficient for CPR, but it has not been tested yet.

Signed-off-by: Steve Sistare <steven.sist...@oracle.com>
Reviewed-by: Fabiano Rosas <faro...@suse.de>
Reviewed-by: Peter Xu <pet...@redhat.com>
Reviewed-by: David Hildenbrand <da...@redhat.com>
Message-ID: <1740667681-257312-1-git-send-email- steven.sist...@oracle.com>
Signed-off-by: Fabiano Rosas <faro...@suse.de>
---
  include/exec/memory.h   |  3 ++
  include/exec/ramblock.h |  1 +
  migration/savevm.c      |  2 ++
  system/physmem.c        | 66 ++++++++++++++++++++++++++++++++++ +++++++
  4 files changed, 72 insertions(+)

This patch breaks booting an SNP guest as it triggers the following
assert:

qemu-system-x86_64: ../util/error.c:68: error_setv: Assertion `*errp == NULL' failed.


Usually this means the error has already been set previously, which is
not allowed.

I tracked it to the err_setg() call in ram_block_add_cpr_blocker().
It looks like the error message is unable to be printed because
rb->cpr_blocker is NULL.

Adding aux-ram-share=on to the -machine object gets me past the error and therefore the assertion, but isn't that an incompatible change to how an
SNP guest has to be started?

If I update the err_setg() call to use the errp parameter that is passed
into ram_block_add_cpr_blocker(), I get the following message and then
the guest launch terminates:


The usage at ram_block_add_cpr_blocker() is correct, the cpr_blocker
gets initialized and registered as a migration blocker. The errp only
becomes relevant later when migration starts and the error condition is
met.

qemu-system-x86_64: Memory region pc.bios is not compatible with CPR.
share=on is required for memory-backend objects, and aux-ram- share=on is
required.

Since errp is an &error_fatal, it causes QEMU to exit, so this^ error
message is bogus.


The qemu parameters I used prior to this patch that allowed an SNP guest
to launch were:

-machine type=q35,confidential-guest-support=sev0,memory-backend=ram1
-object memory-backend- memfd,id=ram1,size=16G,share=true,prealloc=false

With these parameters after this patch, the launch fails.

I think it might be failing because the caller of
ram_block_add_cpr_blocker() is passing in &error_abort, but if the
error_setg() is call on a properly initialized cpr_blocker value then
SNP is still able to boot for me.
I'm not sure where the best spot is
to initialize cpr_blocker, it probably needs to be done before either
ram_block_add_cpr_blocker() or ram_block_del_cpr_blocker() are callable,
but the following avoids the reported crash at least:

diff --git a/system/physmem.c b/system/physmem.c
index 44dd129662..bff0fdcaac 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -4176,6 +4176,7 @@ void ram_block_add_cpr_blocker(RAMBlock *rb, Error **errp)
          return;
      }

+    rb->cpr_blocker = NULL;

Could it be the cpr_blocker already got set at ram_block_add() in the
RAM_GUEST_MEMFD path?

That seems to be the case: in some cases ram_block_add() sets cpr_blocker
when (new_block->flags & RAM_GUEST_MEMFD) is true, and then soon after
when ram_block_add_cpr_blocker() is called on the same RAMBlock:

   2025-03-26T21:08:15.092427Z qemu-system-x86_64: warning: ram_block_add: new_block 0x55c247e4c880 new_block->cpr_blocker (nil) name ram1    2025-03-26T21:08:15.124710Z qemu-system-x86_64: warning: ram_block_add: new_block 0x55c2480fde00 new_block->cpr_blocker (nil) name pc.bios    2025-03-26T21:08:15.126190Z qemu-system-x86_64: warning: ram_block_add_cpr_blocker: rb 0x55c2480fde00 rb->cpr_blocker 0x55c2480fe050 name pc.bios    2025-03-26T21:08:15.138582Z qemu-system-x86_64: warning: ram_block_add: new_block 0x55c247e3c1e0 new_block->cpr_blocker (nil) name pc.rom    2025-03-26T21:08:15.138938Z qemu-system-x86_64: warning: ram_block_add_cpr_blocker: rb 0x55c247e3c1e0 rb->cpr_blocker 0x55c247e3c890 name pc.rom    2025-03-26T21:08:16.185577Z qemu-system-x86_64: warning: ram_block_add_cpr_blocker: rb 0x55c248db9200 rb->cpr_blocker (nil) name /rom@etc/acpi/tables    2025-03-26T21:08:16.187140Z qemu-system-x86_64: warning: ram_block_add_cpr_blocker: rb 0x55c248085620 rb->cpr_blocker (nil) name /rom@etc/table-loader    2025-03-26T21:08:16.188029Z qemu-system-x86_64: warning: ram_block_add_cpr_blocker: rb 0x55c2480ce220 rb->cpr_blocker (nil) name /rom@etc/acpi/rsd

Thanks everyone for debugging this.  To summarize, ram_block_add_cpr_blocker already blocks guest_memfd, because rb->fd < 0.  The fix is to delete this redundant code in ram_block_add:

         error_setg(&new_block->cpr_blocker,
                    "Memory region %s uses guest_memfd, "
                    "which is not supported with CPR.",
                    memory_region_name(new_block->mr));
         migrate_add_blocker_modes(&new_block->cpr_blocker, errp,
                                  MIG_MODE_CPR_TRANSFER, MIG_MODE_CPR_EXEC,
                                   -1);

I just encountered the same issue with TDX guest, after rebasing TDX code to 10.0.0-rc0.

thank you all for the reporting and quick solution for it.

I will submit a fix (unless Tom or Michael would prefer to author it).

- Steve



Reply via email to