Re: [RFC PATCH 1/1] vhost-user: add shmem mmap request

2024-06-05 Thread David Hildenbrand

On 05.06.24 17:19, Stefan Hajnoczi wrote:

On Wed, 5 Jun 2024 at 10:29, Stefan Hajnoczi  wrote:


On Wed, Jun 05, 2024 at 10:13:32AM +0200, Albert Esteve wrote:

On Tue, Jun 4, 2024 at 8:54 PM Stefan Hajnoczi  wrote:


On Thu, May 30, 2024 at 05:22:23PM +0200, Albert Esteve wrote:

Add SHMEM_MAP/UNMAP requests to vhost-user.

This request allows backends to dynamically map
fds into a shared memory region indentified by


Please call this "VIRTIO Shared Memory Region" everywhere (code,
vhost-user spec, commit description, etc) so it's clear that this is not
about vhost-user shared memory tables/regions.


its `shmid`. Then, the fd memory is advertised
to the frontend through a BAR+offset, so it can
be read by the driver while its valid.


Why is a PCI BAR mentioned here? vhost-user does not know about the
VIRTIO Transport (e.g. PCI) being used. It's the frontend's job to
report VIRTIO Shared Memory Regions to the driver.



I will remove PCI BAR, as it is true that it depends on the
transport. I was trying to explain that the driver
will use the shm_base + shm_offset to access
the mapped memory.




Then, the backend can munmap the memory range
in a given shared memory region (again, identified
by its `shmid`), to free it. After this, the
region becomes private and shall not be accessed
by the frontend anymore.


What does "private" mean?

The frontend must mmap PROT_NONE to reserve the virtual memory space
when no fd is mapped in the VIRTIO Shared Memory Region. Otherwise an
unrelated mmap(NULL, ...) might use that address range and the guest
would have access to the host memory! This is a security issue and needs
to be mentioned explicitly in the spec.



I mentioned private because it changes the mapping from MAP_SHARED
to MAP_PRIVATE. I will highlight PROT_NONE instead.


I see. Then "MAP_PRIVATE" would be clearer. I wasn't sure whether you
mean mmap flags or something like the memory range is no longer
accessible to the driver.


One more thing: please check whether kvm.ko memory regions need to be
modified or split to match the SHMEM_MAP mapping's read/write
permissions.

The VIRTIO Shared Memory Area pages can have PROT_READ, PROT_WRITE,
PROT_READ|PROT_WRITE, or PROT_NONE.

kvm.ko memory regions are read/write or read-only. I'm not sure what
happens when the guest accesses a kvm.ko memory region containing
mappings with permissions more restrictive than its kvm.ko memory
region.


IIRC, the KVM R/O memory region requests could allow to further reduce 
permissions (assuming your mmap is R/W you could map it R/O into the KVM 
MMU), but I might remember things incorrectly.




In other words, the kvm.ko memory region would allow the
access but the Linux virtual memory configuration would cause a page
fault.

For example, imagine a QEMU MemoryRegion containing a SHMEM_MAP
mapping with PROT_READ. The kvm.ko memory region would be read/write
(unless extra steps were taken to tell kvm.ko about the permissions).
When the guest stores to the PROT_READ page, kvm.ko will process a
fault...and I'm not sure what happens next.

A similar scenario occurs when a PROT_NONE mapping exists within a
kvm.ko memory region. I don't remember how kvm.ko behaves when the
guest tries to access the pages.

It's worth figuring this out before going further because it could
become tricky if issues are discovered later. I have CCed David
Hildenbrand in case he knows.



One relevant piece is likely:

"When the KVM_CAP_SYNC_MMU capability is available, changes in the 
backing of the memory region are automatically reflected into the guest. 
 For example, an mmap() that affects the region will be made visible 
immediately. "


We can already effectively get R/O or PROT_NONE PTEs in 
PROT_READ|PROT_WRITE mappings, and the KVM must be able to handle that 
properly -- trigegring a page fault to let core-MM resolve that.


If we have a PROT_NONE VMA and the guest writes to it, we'd likely end 
up (to resolve the KVM MMU page fault) in 
hva_to_pfn_slow()->get_user_pages_unlocked(), which would return -EFAULT.


Not sure if we really inject a page fault into the guest or if the KVM 
run would effectively fail with -EFAULT and make user space unhappy. 
Certainly something to play with!


--
Cheers,

David / dhildenb




Re: [PATCH V1 17/26] machine: memfd-alloc option

2024-06-04 Thread David Hildenbrand

On 04.06.24 18:41, Peter Xu wrote:

On Tue, Jun 04, 2024 at 06:14:08PM +0200, David Hildenbrand wrote:

On 04.06.24 17:58, Peter Xu wrote:

On Tue, Jun 04, 2024 at 08:13:26AM +0100, Daniel P. Berrangé wrote:

On Mon, Jun 03, 2024 at 05:48:32PM -0400, Peter Xu wrote:

That property, irrelevant of what it is called (and I doubt whether Dan's
suggestion on "shared-ram" is good, e.g. mmap(MAP_SHARED) doesn't have user
visible fd but it's shared-ram for sure..), is yet another way to specify
guest mem types.

What if the user specified this property but specified something else in
the -object parameters?  E.g. -machine share-ram=on -object
memory-backend-ram,share=off.  What should we do?


The machine property would only apply to memory regions that are
*NOT* being created via -object. The memory-backend objects would
always honour their own share settnig.


In that case we may want to rename that to share-ram-by-default=on.
Otherwise it's not clear which one would take effect from an user POV, even
if we can define it like that in the code.

Even with share-ram-by-default=on, it can be still confusing in some form
or another. Consider this cmdline:

-machine q35,share-ram-by-default=on -object memory-backend-ram,id=mem1

Then is mem1 shared or not?  From reading the cmdline, if share ram by
default it should be ON if we don't specify it, but it's actually off?
It's because -object has its own default values.


We do have something similar with "merge" and "dump" properties. See
machine_mem_merge() / machine_dump_guest_core().

These correspond to the "mem-merge" and "dump-guest-core" machine
properties.


These look fine so far, as long as -object cmdline doesn't allow to specify
the same thing again.



You can. The mem-merge / dump-guest-core set the default that can be 
modified per memory backend (merge / dump properties).




But ...



IMHO fundamentally it's just controversial to have two ways to configure
guest memory.  If '-object' is the preferred and complete way to configure
it, I prefer sticking with it if possible and see what is missing.


... I agree with that. With vhost-user we also require a reasonable
configuration (using proper fd-based shared memory) for it to work.



I think I raised that as the other major reason too, that I think it's so
far only about the vram that is out of the picture here.  We don't and
shouldn't have complicated RW RAMs floating around that we want this
property to cover.


Agreed. And maybe we can still keep migration of any MAP_PRIVATE thing
working by migrating that memory? CPR will be "slightly less fast".

But the biggest piece -- guest RAM -- will be migrated via the fd directly.


I think it should work but only without VFIO.  When with VFIO there must
have no private pages at all or migrating is racy with concurrent DMAs
(yes, AFAICT CPR can run migration with DMA running..).


Understood. For these we could fail migration. Thanks for the pointer.

--
Cheers,

David / dhildenb




Re: [PATCH V1 17/26] machine: memfd-alloc option

2024-06-04 Thread David Hildenbrand

On 04.06.24 17:58, Peter Xu wrote:

On Tue, Jun 04, 2024 at 08:13:26AM +0100, Daniel P. Berrangé wrote:

On Mon, Jun 03, 2024 at 05:48:32PM -0400, Peter Xu wrote:

That property, irrelevant of what it is called (and I doubt whether Dan's
suggestion on "shared-ram" is good, e.g. mmap(MAP_SHARED) doesn't have user
visible fd but it's shared-ram for sure..), is yet another way to specify
guest mem types.

What if the user specified this property but specified something else in
the -object parameters?  E.g. -machine share-ram=on -object
memory-backend-ram,share=off.  What should we do?


The machine property would only apply to memory regions that are
*NOT* being created via -object. The memory-backend objects would
always honour their own share settnig.


In that case we may want to rename that to share-ram-by-default=on.
Otherwise it's not clear which one would take effect from an user POV, even
if we can define it like that in the code.

Even with share-ram-by-default=on, it can be still confusing in some form
or another. Consider this cmdline:

   -machine q35,share-ram-by-default=on -object memory-backend-ram,id=mem1

Then is mem1 shared or not?  From reading the cmdline, if share ram by
default it should be ON if we don't specify it, but it's actually off?
It's because -object has its own default values.


We do have something similar with "merge" and "dump" properties. See 
machine_mem_merge() / machine_dump_guest_core().


These correspond to the "mem-merge" and "dump-guest-core" machine 
properties.


But ...



IMHO fundamentally it's just controversial to have two ways to configure
guest memory.  If '-object' is the preferred and complete way to configure
it, I prefer sticking with it if possible and see what is missing.


... I agree with that. With vhost-user we also require a reasonable 
configuration (using proper fd-based shared memory) for it to work.




I think I raised that as the other major reason too, that I think it's so
far only about the vram that is out of the picture here.  We don't and
shouldn't have complicated RW RAMs floating around that we want this
property to cover.


Agreed. And maybe we can still keep migration of any MAP_PRIVATE thing 
working by migrating that memory? CPR will be "slightly less fast".


But the biggest piece -- guest RAM -- will be migrated via the fd directly.

--
Cheers,

David / dhildenb




Re: [PATCH 1/6] migration: remove RDMA live migration temporarily

2024-06-04 Thread David Hildenbrand

On 04.06.24 14:14, Gonglei via wrote:

From: Jialin Wang 

The new RDMA live migration will be introduced in the upcoming
few commits.

Signed-off-by: Jialin Wang 
Signed-off-by: Gonglei 
---


[...]


-
-/* Avoid ram_block_discard_disable(), cannot change during migration. */
-if (ram_block_discard_is_required()) {
-error_setg(errp, "RDMA: cannot disable RAM discard");
-return;
-}


I'm particularly interested in the interaction with 
virtio-balloon/virtio-mem.


Do we still have to disable discarding of RAM, and where would you do 
that in the rewrite?


--
Cheers,

David / dhildenb




Re: [PATCH v2 3/4] backends/hostmem: Report error on qemu_madvise() failures

2024-05-31 Thread David Hildenbrand

On 31.05.24 09:28, Michal Privoznik wrote:

If user sets .merge or .dump attributes qemu_madvise() is called
with corresponding advice. But it is never checked for failure
which may mislead users into thinking the attribute is set
correctly. Report an appropriate error.

Signed-off-by: Michal Privoznik 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v2 1/4] osdep: Make qemu_madvise() to set errno in all cases

2024-05-31 Thread David Hildenbrand

On 31.05.24 10:12, Philippe Mathieu-Daudé wrote:

On 31/5/24 10:01, David Hildenbrand wrote:

On 31.05.24 09:57, Philippe Mathieu-Daudé wrote:

Hi Michal,

On 31/5/24 09:28, Michal Privoznik wrote:

The unspoken premise of qemu_madvise() is that errno is set on
error. And it is mostly the case except for posix_madvise() which
is documented to return either zero (on success) or a positive
error number.


Watch out, Linux:

     RETURN VALUE

    On success, posix_madvise() returns 0.  On failure,
    it returns a positive error number.

but on Darwin:

     RETURN VALUES

    Upon successful completion, a value of 0 is returned.
    Otherwise, a value of -1 is returned and errno is set
    to indicate the error.

(Haven't checked other POSIX OSes).



... but it's supposed to follow the "posix standard" :D Maybe an issue
in the docs?

FreeBSD seems to follow the standard: "The posix_madvise() interface is
identical, except it returns an error number on error and does not
modify errno, and is provided for standards conformance."

Same with OpenBSD: "The posix_madvise() interface has the same effect,
but returns the error value instead of only setting errno."


On Darwin, MADVISE(2):

 The posix_madvise() behaves same as madvise() except that it uses
 values with POSIX_ prefix for the advice system call argument.

 The posix_madvise function is part of IEEE 1003.1-2001 and was first
 implemented in Mac OS X 10.2.

Per IEEE 1003.1-2001
(https://pubs.opengroup.org/onlinepubs/009695399/functions/posix_madvise.html):

RETURN VALUE

  Upon successful completion, posix_madvise() shall return zero;
  otherwise, an error number shall be returned to indicate the error.

Note the use of "shall" which is described in RFC2119 as:

 This word, or the adjective "RECOMMENDED", mean that there
 may exist valid reasons in particular circumstances to ignore a
 particular item, but the full implications must be understood and
 carefully weighed before choosing a different course.


Agreed, so we have to be careful.

I do wonder if there would be the option for an automatic approach: for 
example, detect if the errno was/was not changed. Hm.


--
Cheers,

David / dhildenb




Re: [PATCH v2 1/4] osdep: Make qemu_madvise() to set errno in all cases

2024-05-31 Thread David Hildenbrand

On 31.05.24 09:57, Philippe Mathieu-Daudé wrote:

Hi Michal,

On 31/5/24 09:28, Michal Privoznik wrote:

The unspoken premise of qemu_madvise() is that errno is set on
error. And it is mostly the case except for posix_madvise() which
is documented to return either zero (on success) or a positive
error number.


Watch out, Linux:

RETURN VALUE

   On success, posix_madvise() returns 0.  On failure,
   it returns a positive error number.

but on Darwin:

RETURN VALUES

   Upon successful completion, a value of 0 is returned.
   Otherwise, a value of -1 is returned and errno is set
   to indicate the error.

(Haven't checked other POSIX OSes).



... but it's supposed to follow the "posix standard" :D Maybe an issue 
in the docs?


FreeBSD seems to follow the standard: "The posix_madvise() interface is 
identical, except it returns an error number on error and does not 
modify errno, and is provided for standards conformance."


Same with OpenBSD: "The posix_madvise() interface has the same effect, 
but returns the error value instead of only setting errno."


--
Cheers,

David / dhildenb




Re: [PATCH v2 2/4] osdep: Make qemu_madvise() return ENOSYS on unsupported OSes

2024-05-31 Thread David Hildenbrand

On 31.05.24 09:28, Michal Privoznik wrote:

Not every OS is capable of madvise() or posix_madvise() even. In
that case, errno should be set to ENOSYS as it reflects the cause
better. This also mimic what madvise()/posix_madvise() would
return if kernel lacks corresponding syscall (e.g. due to
configuration).


Yes and no. According to the man page

" EINVAL advice is not a valid."

if a particular MADV_* call is not implemented, we would get EINVAL, 
which is really unfortunate ... to detect what is actually supported :(


For the patch here ENOSYS makes sense:

Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH 3/3] backends/hostmem: Round up memory size for qemu_madvise() and mbind()

2024-05-29 Thread David Hildenbrand

On 29.05.24 08:48, Michal Prívozník wrote:

On 5/28/24 18:47, David Hildenbrand wrote:

Am 28.05.24 um 18:15 schrieb Michal Privoznik:

./build/qemu-system-x86_64 \ -m
size=8389632k,slots=16,maxmem=2560k \ -object
'{"qom-type":"memory-backend-file","id":"ram-node0","mem-path":"hugepages2M","prealloc":true,"size":8590983168,"host-nodes":[0],"policy":"bind"}'
 \ -numa node,nodeid=0,cpus=0,memdev=ram-node0


For DIMMs and friends we now (again) enforce that the size must be
aligned to the page size:

commit 540a1abbf0b243e4cfb4333c5d30a041f7080ba4
Author: David Hildenbrand 
Date:   Wed Jan 17 14:55:54 2024 +0100

     memory-device: reintroduce memory region size check

     We used to check that the memory region size is multiples of the
overall
     requested address alignment for the device memory address.

     We removed that check, because there are cases (i.e., hv-balloon) where
     devices unconditionally request an address alignment that has a very
large
     alignment (i.e., 32 GiB), but the actual memory device size might
not be
     multiples of that alignment.

     However, this change:

     (a) allows for some practically impossible DIMM sizes, like "1GB+1
byte".
     (b) allows for DIMMs that partially cover hugetlb pages, previously
     reported in [1].
...

Partial hugetlb pages do not particularly make sense; wasting memory. Do
we expect people to actually ave working setup that we might break when
disallowing such configurations? Or would we actually help them identify
that something non-sensical is happening?

When using memory-backend-memfd, we already do get a proper error:

  ./build/qemu-system-x86_64 -m 2047m -object
memory-backend-memfd,id=ram-node0,hugetlb=on,size=2047m,reserve=off
-numa node,nodeid=0,cpus=0,memdev=ram-node0 -S
qemu-system-x86_64: failed to resize memfd to 2146435072: Invalid argument

So this only applies to memory-backend-file, where we maybe should fail
in a similar way?



Yeah, let's fail in that case. But non-aligned length is just one of
many reasons madvise()/mbind() can fail. What about the others? Should
we make them report an error or just a warning?


Regarding madvise(), we should report at least a warning.

In qemu_ram_setup_dump() we print an error if QEMU_MADV_DONTDUMP failed.

But we swallow any errors from memory_try_enable_merging() ... in 
general, we likely have to distinguish the "not supported by the OS" 
case from " actually supported but failed" case.


In the second patch, maybe we should really fail if something unexpected 
happens, instead of fake-changing the properties.


--
Cheers,

David / dhildenb




Re: [PATCH 1/3] osdep: Make qemu_madvise() to set errno in all cases

2024-05-28 Thread David Hildenbrand

Am 28.05.24 um 18:15 schrieb Michal Privoznik:

The unspoken premise of qemu_madvise() is that errno is set on
error. And it is mostly the case except for posix_madvise() which
is documented to return either zero (on success) or a positive
error number. This means, we must set errno ourselves. And while
at it, make the function return a negative value on error, just
like other error paths do.

Signed-off-by: Michal Privoznik 
---
  util/osdep.c | 7 ++-
  1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/util/osdep.c b/util/osdep.c
index e996c4744a..e42f4e8121 100644
--- a/util/osdep.c
+++ b/util/osdep.c
@@ -57,7 +57,12 @@ int qemu_madvise(void *addr, size_t len, int advice)
  #if defined(CONFIG_MADVISE)
  return madvise(addr, len, advice);
  #elif defined(CONFIG_POSIX_MADVISE)
-return posix_madvise(addr, len, advice);
+int rc = posix_madvise(addr, len, advice);
+if (rc) {
+errno = rc;
+return -1;
+}
+return 0;
  #else
  errno = EINVAL;
  return -1;


Interesting, seems to be correct

Reviewed-by: David Hildenbrand 

--
Thanks,

David / dhildenb




Re: [PATCH 3/3] backends/hostmem: Round up memory size for qemu_madvise() and mbind()

2024-05-28 Thread David Hildenbrand

Am 28.05.24 um 18:15 schrieb Michal Privoznik:
./build/qemu-system-x86_64 \ -m size=8389632k,slots=16,maxmem=2560k \ 
-object 
'{"qom-type":"memory-backend-file","id":"ram-node0","mem-path":"hugepages2M","prealloc":true,"size":8590983168,"host-nodes":[0],"policy":"bind"}' \ -numa node,nodeid=0,cpus=0,memdev=ram-node0


For DIMMs and friends we now (again) enforce that the size must be aligned to 
the page size:


commit 540a1abbf0b243e4cfb4333c5d30a041f7080ba4
Author: David Hildenbrand 
Date:   Wed Jan 17 14:55:54 2024 +0100

memory-device: reintroduce memory region size check

We used to check that the memory region size is multiples of the overall
requested address alignment for the device memory address.

We removed that check, because there are cases (i.e., hv-balloon) where
devices unconditionally request an address alignment that has a very large
alignment (i.e., 32 GiB), but the actual memory device size might not be
multiples of that alignment.

However, this change:

(a) allows for some practically impossible DIMM sizes, like "1GB+1 byte".
(b) allows for DIMMs that partially cover hugetlb pages, previously
reported in [1].
...

Partial hugetlb pages do not particularly make sense; wasting memory. Do we 
expect people to actually ave working setup that we might break when disallowing 
such configurations? Or would we actually help them identify that something 
non-sensical is happening?


When using memory-backend-memfd, we already do get a proper error:

 ./build/qemu-system-x86_64 -m 2047m -object 
memory-backend-memfd,id=ram-node0,hugetlb=on,size=2047m,reserve=off -numa 
node,nodeid=0,cpus=0,memdev=ram-node0 -S

qemu-system-x86_64: failed to resize memfd to 2146435072: Invalid argument

So this only applies to memory-backend-file, where we maybe should fail in a 
similar way?


--
Thanks,

David / dhildenb




Re: [PATCH v2 1/3] hw/riscv/virt: Add memory hotplugging and virtio-md-pci support

2024-05-27 Thread David Hildenbrand


I wonder if we should forbid users from removing memory that is 'out of place', 
i.e. users
should always remove pc-dimms in LIFO order. Usually this kind of control is 
done by
management, e.g. libvirt, but if we're not sure we'll keep consistency during 
memory

unplugs ...


I really don't think we should do any of that. ;)

--
Thanks,

David / dhildenb




Re: [RFC PATCH 1/3] hw/intc/s390_flic: Migrate pending state

2024-05-26 Thread David Hildenbrand

Am 26.05.24 um 21:44 schrieb Richard Henderson:

On 5/26/24 08:53, David Hildenbrand wrote:

Am 25.05.24 um 15:12 schrieb Nicholas Piggin:

The flic pending state is not migrated, so if the machine is migrated
while an interrupt is pending, it can be lost. This shows up in
qtest migration test, an extint is pending (due to console writes?)
and the CPU waits via s390_cpu_set_psw and expects the interrupt to
wake it. However when the flic pending state is lost, s390_cpu_has_int
returns false, so s390_cpu_exec_interrupt falls through to halting
again.

Fix this by migrating pending. This prevents the qtest from hanging.
Does service_param need to be migrated? Or the IO lists?

Signed-off-by: Nicholas Piggin 
---
  hw/intc/s390_flic.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/hw/intc/s390_flic.c b/hw/intc/s390_flic.c
index 6771645699..b70cf2295a 100644
--- a/hw/intc/s390_flic.c
+++ b/hw/intc/s390_flic.c
@@ -369,6 +369,7 @@ static const VMStateDescription qemu_s390_flic_vmstate = {
  .fields = (const VMStateField[]) {
  VMSTATE_UINT8(simm, QEMUS390FLICState),
  VMSTATE_UINT8(nimm, QEMUS390FLICState),
+    VMSTATE_UINT32(pending, QEMUS390FLICState),
  VMSTATE_END_OF_LIST()
  }
  };


Likely you have to handle this using QEMU compat machines.


Well, since existing migration is broken, I don't think you have to preserve 


Migration is broken only in some case "while an interrupt is pending, it can be 
lost".



compatibility.  But you do have to bump the version number.


Looking at it, this is TCG only, so likely we don't care that much about 
migration compatibility. But I have no idea what level of compatibility we want 
to support there.


--
Thanks,

David / dhildenb




Re: [RFC PATCH 1/3] hw/intc/s390_flic: Migrate pending state

2024-05-26 Thread David Hildenbrand

Am 25.05.24 um 15:12 schrieb Nicholas Piggin:

The flic pending state is not migrated, so if the machine is migrated
while an interrupt is pending, it can be lost. This shows up in
qtest migration test, an extint is pending (due to console writes?)
and the CPU waits via s390_cpu_set_psw and expects the interrupt to
wake it. However when the flic pending state is lost, s390_cpu_has_int
returns false, so s390_cpu_exec_interrupt falls through to halting
again.

Fix this by migrating pending. This prevents the qtest from hanging.
Does service_param need to be migrated? Or the IO lists?

Signed-off-by: Nicholas Piggin 
---
  hw/intc/s390_flic.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/hw/intc/s390_flic.c b/hw/intc/s390_flic.c
index 6771645699..b70cf2295a 100644
--- a/hw/intc/s390_flic.c
+++ b/hw/intc/s390_flic.c
@@ -369,6 +369,7 @@ static const VMStateDescription qemu_s390_flic_vmstate = {
  .fields = (const VMStateField[]) {
  VMSTATE_UINT8(simm, QEMUS390FLICState),
  VMSTATE_UINT8(nimm, QEMUS390FLICState),
+VMSTATE_UINT32(pending, QEMUS390FLICState),
  VMSTATE_END_OF_LIST()
  }
  };


Likely you have to handle this using QEMU compat machines.

--
Thanks,

David / dhildenb




Re: [PATCH v2 1/3] hw/riscv/virt: Add memory hotplugging and virtio-md-pci support

2024-05-24 Thread David Hildenbrand

On 24.05.24 15:14, Daniel Henrique Barboza wrote:



On 5/21/24 07:56, Björn Töpel wrote:

From: Björn Töpel 

Virtio-based memory devices (virtio-mem/virtio-pmem) allows for
dynamic resizing of virtual machine memory, and requires proper
hotplugging (add/remove) support to work.

Add device memory support for RISC-V "virt" machine, and enable
virtio-md-pci with the corresponding missing hotplugging callbacks.

Signed-off-by: Björn Töpel 
---
   hw/riscv/Kconfig   |  2 +
   hw/riscv/virt.c| 83 +-
   hw/virtio/virtio-mem.c |  5 ++-
   3 files changed, 87 insertions(+), 3 deletions(-)

diff --git a/hw/riscv/Kconfig b/hw/riscv/Kconfig
index a2030e3a6ff0..08f82dbb681a 100644
--- a/hw/riscv/Kconfig
+++ b/hw/riscv/Kconfig
@@ -56,6 +56,8 @@ config RISCV_VIRT
   select PLATFORM_BUS
   select ACPI
   select ACPI_PCI
+select VIRTIO_MEM_SUPPORTED
+select VIRTIO_PMEM_SUPPORTED
   
   config SHAKTI_C

   bool
diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c
index 4fdb66052587..443902f919d2 100644
--- a/hw/riscv/virt.c
+++ b/hw/riscv/virt.c
@@ -53,6 +53,8 @@
   #include "hw/pci-host/gpex.h"
   #include "hw/display/ramfb.h"
   #include "hw/acpi/aml-build.h"
+#include "hw/mem/memory-device.h"
+#include "hw/virtio/virtio-mem-pci.h"
   #include "qapi/qapi-visit-common.h"
   #include "hw/virtio/virtio-iommu.h"
   
@@ -1407,6 +1409,7 @@ static void virt_machine_init(MachineState *machine)

   DeviceState *mmio_irqchip, *virtio_irqchip, *pcie_irqchip;
   int i, base_hartid, hart_count;
   int socket_count = riscv_socket_count(machine);
+hwaddr device_memory_base, device_memory_size;
   
   /* Check socket count limit */

   if (VIRT_SOCKETS_MAX < socket_count) {
@@ -1420,6 +1423,12 @@ static void virt_machine_init(MachineState *machine)
   exit(1);
   }
   
+if (machine->ram_slots > ACPI_MAX_RAM_SLOTS) {

+error_report("unsupported amount of memory slots: %"PRIu64,
+ machine->ram_slots);
+exit(EXIT_FAILURE);
+}
+
   /* Initialize sockets */
   mmio_irqchip = virtio_irqchip = pcie_irqchip = NULL;
   for (i = 0; i < socket_count; i++) {
@@ -1553,6 +1562,37 @@ static void virt_machine_init(MachineState *machine)
   memory_region_add_subregion(system_memory, memmap[VIRT_MROM].base,
   mask_rom);
   
+/* device memory */

+device_memory_base = ROUND_UP(s->memmap[VIRT_DRAM].base + 
machine->ram_size,
+  GiB);
+device_memory_size = machine->maxram_size - machine->ram_size;
+if (device_memory_size > 0) {
+/*
+ * Each DIMM is aligned based on the backend's alignment value.
+ * Assume max 1G hugepage alignment per slot.
+ */
+device_memory_size += machine->ram_slots * GiB;


We don't need to align to 1GiB. This calc can use 2MiB instead (or 4MiB if we're
running 32 bits).


+
+if (riscv_is_32bit(>soc[0])) {
+hwaddr memtop = device_memory_base + ROUND_UP(device_memory_size,
+  GiB);


Same here - alignment is 2/4 MiB.


+
+if (memtop > UINT32_MAX) {
+error_report("memory exceeds 32-bit limit by %lu bytes",
+ memtop - UINT32_MAX);
+exit(EXIT_FAILURE);
+}
+}
+
+if (device_memory_base + device_memory_size < device_memory_size) {
+error_report("unsupported amount of device memory");
+exit(EXIT_FAILURE);
+}


Took another look and found this a bit strange. These are all unsigned vars, so
if (unsigned a + unsigned b < unsigned b) will always be 'false'. The compiler 
is
probably cropping this out.


No. Unsigned interger overflow is defined behavior and this is a common 
check to detect such overflow. tI's consistent with what we do for other 
architectures.




The calc we need to do is to ensure that the extra ram_slots * alignment will 
fit into
the VIRT_DRAM block, i.e. maxram_size + (ram_slots * alignment) < 
memmap[VIRT_DRAM].size.


TBH I'm starting to have second thoughts about letting users hotplug whatever 
they want.
It seems cleaner to just force the 2/4 Mb alignment in pre_plug() and be done 
with it,
no need to allocate ram_slots * alignment and doing all these extra checks.


It's worth noting that if user space decides to specify addresses 
manually, it can mess up everything already. There are other events that 
can result in fragmentation of the memory device area (repeated 
hot(un)plug of differing DIMMs).


Assume you have 1 GiB range and hotplug a 512 MiB DIMM at offset 256 
MiB. You won't be able to hotplug another 512 MiB DIMM even though we 
reserved space.


My take so far is: if the user wants to do such stuff it should size the 
area (maxmem) much larger or deal with the concequences (not being able 
to hotplug memory).


It 

Re: [PATCH v5 13/13] tests/qtest/vhost-user-test: add a test case for memory-backend-shm

2024-05-23 Thread David Hildenbrand

On 23.05.24 16:55, Stefano Garzarella wrote:

`memory-backend-shm` can be used with vhost-user devices, so let's
add a new test case for it.

Acked-by: Thomas Huth 
Acked-by: Stefan Hajnoczi 
Signed-off-by: Stefano Garzarella 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v5 12/13] tests/qtest/vhost-user-blk-test: use memory-backend-shm

2024-05-23 Thread David Hildenbrand

On 23.05.24 16:55, Stefano Garzarella wrote:

`memory-backend-memfd` is available only on Linux while the new
`memory-backend-shm` can be used on any POSIX-compliant operating
system. Let's use it so we can run the test in multiple environments.

Acked-by: Thomas Huth 
Acked-by: Stefan Hajnoczi 
Reviewed-by: Philippe Mathieu-Daudé 
Tested-by: Philippe Mathieu-Daudé 
Signed-off-by: Stefano Garzarella 
---
  tests/qtest/vhost-user-blk-test.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qtest/vhost-user-blk-test.c 
b/tests/qtest/vhost-user-blk-test.c
index 117b9acd10..e945f6abf2 100644
--- a/tests/qtest/vhost-user-blk-test.c
+++ b/tests/qtest/vhost-user-blk-test.c
@@ -906,7 +906,7 @@ static void start_vhost_user_blk(GString *cmd_line, int 
vus_instances,
 vhost_user_blk_bin);
  
  g_string_append_printf(cmd_line,

-" -object memory-backend-memfd,id=mem,size=256M,share=on "
+" -object memory-backend-shm,id=mem,size=256M,share=on "


Can we simplifya nd drop the share=on?

Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v5 07/13] vhost-user: enable frontends on any POSIX system

2024-05-23 Thread David Hildenbrand

On 23.05.24 16:55, Stefano Garzarella wrote:

The vhost-user protocol is not really Linux-specific so let's enable
vhost-user frontends for any POSIX system.

In vhost_net.c we use VHOST_FILE_UNBIND which is defined in a Linux
specific header, let's define it for other systems as well.

Reviewed-by: Philippe Mathieu-Daudé 
Tested-by: Philippe Mathieu-Daudé 
Acked-by: Stefan Hajnoczi 
Signed-off-by: Stefano Garzarella 


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v5 06/13] contrib/vhost-user-*: use QEMU bswap helper functions

2024-05-23 Thread David Hildenbrand

On 23.05.24 16:55, Stefano Garzarella wrote:

Let's replace the calls to le*toh() and htole*() with qemu/bswap.h
helpers to make the code more portable.

Suggested-by: Philippe Mathieu-Daudé 
Reviewed-by: Philippe Mathieu-Daudé 
Tested-by: Philippe Mathieu-Daudé 
Acked-by: Stefan Hajnoczi 
Signed-off-by: Stefano Garzarella 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v5 04/13] vhost-user-server: do not set memory fd non-blocking

2024-05-23 Thread David Hildenbrand

On 23.05.24 16:55, Stefano Garzarella wrote:

In vhost-user-server we set all fd received from the other peer
in non-blocking mode. For some of them (e.g. memfd, shm_open, etc.)
it's not really needed, because we don't use these fd with blocking
operations, but only to map memory.

In addition, in some systems this operation can fail (e.g. in macOS
setting an fd returned by shm_open() non-blocking fails with errno
= ENOTTY).

So, let's avoid setting fd non-blocking for those messages that we
know carry memory fd (e.g. VHOST_USER_ADD_MEM_REG,
VHOST_USER_SET_MEM_TABLE).

Reviewed-by: Daniel P. Berrangé 
Acked-by: Stefan Hajnoczi 
Signed-off-by: Stefano Garzarella 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v5 05/13] contrib/vhost-user-blk: fix bind() using the right size of the address

2024-05-23 Thread David Hildenbrand

On 23.05.24 16:55, Stefano Garzarella wrote:

On macOS passing `-s /tmp/vhost.socket` parameter to the vhost-user-blk
application, the bind was done on `/tmp/vhost.socke` pathname,
missing the last character.

This sounds like one of the portability problems described in the
unix(7) manpage:

Pathname sockets
When  binding  a socket to a pathname, a few rules should
be observed for maximum portability and ease of coding:

•  The pathname in sun_path should be null-terminated.

•  The length of the pathname, including the  terminating
   null byte, should not exceed the size of sun_path.

•  The  addrlen  argument  that  describes  the enclosing
   sockaddr_un structure should have a value of at least:

   offsetof(struct sockaddr_un, sun_path) +
   strlen(addr.sun_path)+1

   or,  more  simply,  addrlen  can   be   specified   as
   sizeof(struct sockaddr_un).

So let's follow the last advice and simplify the code as well.

Reviewed-by: Philippe Mathieu-Daudé 
Tested-by: Philippe Mathieu-Daudé 
Acked-by: Stefan Hajnoczi 
Signed-off-by: Stefano Garzarella 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v5 03/13] libvhost-user: mask F_INFLIGHT_SHMFD if memfd is not supported

2024-05-23 Thread David Hildenbrand

On 23.05.24 16:55, Stefano Garzarella wrote:

libvhost-user will panic when receiving VHOST_USER_GET_INFLIGHT_FD
message if MFD_ALLOW_SEALING is not defined, since it's not able
to create a memfd.

VHOST_USER_GET_INFLIGHT_FD is used only if
VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD is negotiated. So, let's mask
that feature if the backend is not able to properly handle these
messages.

Reviewed-by: Philippe Mathieu-Daudé 
Tested-by: Philippe Mathieu-Daudé 
Acked-by: Stefan Hajnoczi 
Signed-off-by: Stefano Garzarella 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v5 02/13] libvhost-user: fail vu_message_write() if sendmsg() is failing

2024-05-23 Thread David Hildenbrand

On 23.05.24 16:55, Stefano Garzarella wrote:

In vu_message_write() we use sendmsg() to send the message header,
then a write() to send the payload.

If sendmsg() fails we should avoid sending the payload, since we
were unable to send the header.

Discovered before fixing the issue with the previous patch, where
sendmsg() failed on macOS due to wrong parameters, but the frontend
still sent the payload which the backend incorrectly interpreted
as a wrong header.

Reviewed-by: Daniel P. Berrangé 
Reviewed-by: Philippe Mathieu-Daudé 
Tested-by: Philippe Mathieu-Daudé 
Acked-by: Stefan Hajnoczi 
Signed-off-by: Stefano Garzarella 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH] hw/riscv/virt: Add hotplugging and virtio-md-pci support

2024-05-21 Thread David Hildenbrand

On 19.05.24 11:24, Daniel Henrique Barboza wrote:


On 5/18/24 16:50, David Hildenbrand wrote:


Hi,



diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c
index 4fdb66052587..16c2bdbfe6b6 100644
--- a/hw/riscv/virt.c
+++ b/hw/riscv/virt.c
@@ -53,6 +53,8 @@
    #include "hw/pci-host/gpex.h"
    #include "hw/display/ramfb.h"
    #include "hw/acpi/aml-build.h"
+#include "hw/mem/memory-device.h"
+#include "hw/virtio/virtio-mem-pci.h"
    #include "qapi/qapi-visit-common.h"
    #include "hw/virtio/virtio-iommu.h"
@@ -1407,6 +1409,7 @@ static void virt_machine_init(MachineState *machine)
    DeviceState *mmio_irqchip, *virtio_irqchip, *pcie_irqchip;
    int i, base_hartid, hart_count;
    int socket_count = riscv_socket_count(machine);
+    hwaddr device_memory_base, device_memory_size;
    /* Check socket count limit */
    if (VIRT_SOCKETS_MAX < socket_count) {
@@ -1553,6 +1556,25 @@ static void virt_machine_init(MachineState *machine)
    memory_region_add_subregion(system_memory, memmap[VIRT_MROM].base,
    mask_rom);
+    device_memory_base = ROUND_UP(s->memmap[VIRT_DRAM].base + 
machine->ram_size,
+  GiB);
+    device_memory_size = machine->maxram_size - machine->ram_size;
+
+    if (riscv_is_32bit(>soc[0])) {
+    hwaddr memtop = device_memory_base + ROUND_UP(device_memory_size, GiB);
+
+    if (memtop > UINT32_MAX) {
+    error_report("Memory exceeds 32-bit limit by %lu bytes",
+ memtop - UINT32_MAX);
+    exit(EXIT_FAILURE);
+    }
+    }
+
+    if (device_memory_size > 0) {
+    machine_memory_devices_init(machine, device_memory_base,
+    device_memory_size);
+    }
+


I think we need a design discussion before proceeding here. You're allocating 
all
available memory as a memory device area, but in theory we might also support
pc-dimm hotplugs (which would be the equivalent of adding physical RAM dimms to
the board.) in the future too. If you're not familiar with this feature you can
check it out the docs in [1].


Note that DIMMs are memory devices as well. You can plug into the memory device 
area both, ACPI-based memory devices (DIMM, NVDIMM) or virtio-based memory 
devices (virtio-mem, virtio-pmem).



As an example, the 'virt' ARM board (hw/arm/virt.c) reserves a space for this
type of hotplug by checking how much 'ram_slots' we're allocating for it:

device_memory_size = ms->maxram_size - ms->ram_size + ms->ram_slots * GiB;



Note that we increased the region size to be able to fit most requests even if 
alignment of memory devices is weird. See below.

In sane setups, this is usually not required (adding a single additional GB for 
some flexiility might be good enough).


Other boards do the same with ms->ram_slots. We should consider doing it as 
well,
now, even if we're not up to the point of supporting pc-dimm hotplug, to avoid
having to change the memory layout later in the road and breaking existing
setups.

If we want to copy the ARM board, ram_slots is capped to ACPI_MAX_RAM_SLOTS 
(256).
Each RAM slot is considered to be a 1GiB dimm, i.e. we would reserve 256GiB for
them.


This only reserves some *additional* space to fixup weird alignment of memory 
devices. *not* the actual space for these devices.

We don't consider each DIMM to be 1 GiB in size, but add an additional 1 GiB in 
case we have to align DIMMs in physical address space.

I *think* this dates back to old x86 handling where we aligned the address of 
each DIMM to be at a 1 GiB boundary. So if you would have plugged two 128 MiB 
DIMMs, you'd have required more than 256 MiB of space in the area after 
aligning inside the memory device area.



Thanks for the explanation. I missed the part where the ram_slots were being
used just to solve potential alignment issues and pc-dimms could occupy the same
space being allocated via machine_memory_devices_init().

This patch isn't far off then. If we take care to avoid plugging unaligned 
memory
we might not even need this spare area.


Right.





Note: I do not have the visibility of discussions on the memory management 
space,
and I might be missing details such as "we don't care about pc-dimm hotplug
anymore, it's legacy, we're going to support only virtio-md-pci from now on". We
had a situation like that with virtio-balloon and virtio-mem in the past, and 
I'm
not sure if this might fall in the same category.


Not sure if I got your comment right, but virtio-mem was never supposed to be a 
virtio-balloon replacement (especially of the free-page-reporting and memory 
stats part).


I was trying to refer to a situation we faced 3+ years ago in the powerpc 
machines,
where we were trying to add virtio-mem support there given that virtio-mem 
is/was
been seen (as far as I can remember

Re: [PATCH] hw/riscv/virt: Add hotplugging and virtio-md-pci support

2024-05-18 Thread David Hildenbrand



Hi,



diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c
index 4fdb66052587..16c2bdbfe6b6 100644
--- a/hw/riscv/virt.c
+++ b/hw/riscv/virt.c
@@ -53,6 +53,8 @@
   #include "hw/pci-host/gpex.h"
   #include "hw/display/ramfb.h"
   #include "hw/acpi/aml-build.h"
+#include "hw/mem/memory-device.h"
+#include "hw/virtio/virtio-mem-pci.h"
   #include "qapi/qapi-visit-common.h"
   #include "hw/virtio/virtio-iommu.h"
   
@@ -1407,6 +1409,7 @@ static void virt_machine_init(MachineState *machine)

   DeviceState *mmio_irqchip, *virtio_irqchip, *pcie_irqchip;
   int i, base_hartid, hart_count;
   int socket_count = riscv_socket_count(machine);
+hwaddr device_memory_base, device_memory_size;
   
   /* Check socket count limit */

   if (VIRT_SOCKETS_MAX < socket_count) {
@@ -1553,6 +1556,25 @@ static void virt_machine_init(MachineState *machine)
   memory_region_add_subregion(system_memory, memmap[VIRT_MROM].base,
   mask_rom);
   
+device_memory_base = ROUND_UP(s->memmap[VIRT_DRAM].base + machine->ram_size,

+  GiB);
+device_memory_size = machine->maxram_size - machine->ram_size;
+
+if (riscv_is_32bit(>soc[0])) {
+hwaddr memtop = device_memory_base + ROUND_UP(device_memory_size, GiB);
+
+if (memtop > UINT32_MAX) {
+error_report("Memory exceeds 32-bit limit by %lu bytes",
+ memtop - UINT32_MAX);
+exit(EXIT_FAILURE);
+}
+}
+
+if (device_memory_size > 0) {
+machine_memory_devices_init(machine, device_memory_base,
+device_memory_size);
+}
+


I think we need a design discussion before proceeding here. You're allocating 
all
available memory as a memory device area, but in theory we might also support
pc-dimm hotplugs (which would be the equivalent of adding physical RAM dimms to
the board.) in the future too. If you're not familiar with this feature you can
check it out the docs in [1].


Note that DIMMs are memory devices as well. You can plug into the memory 
device area both, ACPI-based memory devices (DIMM, NVDIMM) or 
virtio-based memory devices (virtio-mem, virtio-pmem).




As an example, the 'virt' ARM board (hw/arm/virt.c) reserves a space for this
type of hotplug by checking how much 'ram_slots' we're allocating for it:

device_memory_size = ms->maxram_size - ms->ram_size + ms->ram_slots * GiB;



Note that we increased the region size to be able to fit most requests 
even if alignment of memory devices is weird. See below.


In sane setups, this is usually not required (adding a single additional 
GB for some flexiility might be good enough).



Other boards do the same with ms->ram_slots. We should consider doing it as 
well,
now, even if we're not up to the point of supporting pc-dimm hotplug, to avoid
having to change the memory layout later in the road and breaking existing
setups.

If we want to copy the ARM board, ram_slots is capped to ACPI_MAX_RAM_SLOTS 
(256).
Each RAM slot is considered to be a 1GiB dimm, i.e. we would reserve 256GiB for
them.


This only reserves some *additional* space to fixup weird alignment of 
memory devices. *not* the actual space for these devices.


We don't consider each DIMM to be 1 GiB in size, but add an additional 1 
GiB in case we have to align DIMMs in physical address space.


I *think* this dates back to old x86 handling where we aligned the 
address of each DIMM to be at a 1 GiB boundary. So if you would have 
plugged two 128 MiB DIMMs, you'd have required more than 256 MiB of 
space in the area after aligning inside the memory device area.




Note: I do not have the visibility of discussions on the memory management 
space,
and I might be missing details such as "we don't care about pc-dimm hotplug
anymore, it's legacy, we're going to support only virtio-md-pci from now on". We
had a situation like that with virtio-balloon and virtio-mem in the past, and 
I'm
not sure if this might fall in the same category.


Not sure if I got your comment right, but virtio-mem was never supposed 
to be a virtio-balloon replacement (especially of the 
free-page-reporting and memory stats part).




I see that David Hildenbrand is also CCed in the patch so he'll let us know if
I'm out of line with what I'm asking.


Supporting PC-DIMMs might be required at some point when dealing with 
OSes that don't support virtio-mem and friends.


--
Cheers,

David / dhildenb




Re: [PATCH v6 6/8] xen: mapcache: Pass the ram_addr offset to xen_map_cache()

2024-05-16 Thread David Hildenbrand

On 16.05.24 17:48, Edgar E. Iglesias wrote:

From: "Edgar E. Iglesias" 

Pass the ram_addr offset to xen_map_cache.
This is in preparation for adding grant mappings that need
to compute the address within the RAMBlock.

No functional changes.

Signed-off-by: Edgar E. Iglesias 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v6 5/8] softmmu: Replace check for RAMBlock offset 0 with xen_mr_is_memory

2024-05-16 Thread David Hildenbrand

On 16.05.24 17:48, Edgar E. Iglesias wrote:

From: "Edgar E. Iglesias" 

For xen, when checking for the first RAM (xen_memory), use
xen_mr_is_memory() rather than checking for a RAMBlock with
offset 0.

All Xen machines create xen_memory first so this has no
functional change for existing machines.

Signed-off-by: Edgar E. Iglesias 
Reviewed-by: Stefano Stabellini 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v6 4/8] softmmu: xen: Always pass offset + addr to xen_map_cache

2024-05-16 Thread David Hildenbrand

On 16.05.24 17:48, Edgar E. Iglesias wrote:

From: "Edgar E. Iglesias" 

Always pass address with offset to xen_map_cache().
This is in preparation for support for grant mappings.

Since this is within a block that checks for offset == 0,
this has no functional changes.

Signed-off-by: Edgar E. Iglesias 
Reviewed-by: Stefano Stabellini 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH 11/14] target/s390x: Fix helper_per_ifetch flags

2024-05-07 Thread David Hildenbrand

On 02.05.24 07:44, Richard Henderson wrote:

CPU state is read on the exception path.

Fixes: 83bb161299c ("target-s390x: PER instruction-fetch nullification event 
support")
Signed-off-by: Richard Henderson 
---
  target/s390x/helper.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/s390x/helper.h b/target/s390x/helper.h
index 5611155ba1..31bd193322 100644
--- a/target/s390x/helper.h
+++ b/target/s390x/helper.h
@@ -361,7 +361,7 @@ DEF_HELPER_FLAGS_1(purge, TCG_CALL_NO_RWG, void, env)
  DEF_HELPER_3(lra, i64, env, i64, i64)
  DEF_HELPER_FLAGS_3(per_check_exception, TCG_CALL_NO_WG, void, env, i64, i32)
  DEF_HELPER_FLAGS_3(per_branch, TCG_CALL_NO_WG, void, env, i64, i32)
-DEF_HELPER_FLAGS_2(per_ifetch, TCG_CALL_NO_RWG, void, env, i64)
+DEF_HELPER_FLAGS_2(per_ifetch, TCG_CALL_NO_WG, void, env, i64)
  DEF_HELPER_FLAGS_2(per_store_real, TCG_CALL_NO_WG, noreturn, env, i32)
  DEF_HELPER_FLAGS_1(stfl, TCG_CALL_NO_RWG, void, env)
  


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v4 14/17] xen: Add xen_mr_is_memory()

2024-05-02 Thread David Hildenbrand

On 30.04.24 18:49, Edgar E. Iglesias wrote:

From: "Edgar E. Iglesias" 

Add xen_mr_is_memory() to abstract away tests for the
xen_memory MR.

Signed-off-by: Edgar E. Iglesias 
---


[...]


  #endif
diff --git a/system/physmem.c b/system/physmem.c
index ad7a8c7d95..1a5ffcba2a 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2227,7 +2227,7 @@ static void *qemu_ram_ptr_length(RAMBlock *block, 
ram_addr_t addr,
   * because we don't want to map the entire memory in QEMU.
   * In that case just map the requested area.
   */
-if (block->offset == 0) {
+if (xen_mr_is_memory(block->mr)) {
  return xen_map_cache(block->mr, addr, len, lock, lock,
   is_write);
  }


I'd have moved that into a separate patch, because this is not a simple 
abstraction here.


Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v4 13/17] softmmu: Pass RAM MemoryRegion and is_write xen_map_cache()

2024-05-02 Thread David Hildenbrand

On 30.04.24 18:49, Edgar E. Iglesias wrote:

From: "Edgar E. Iglesias" 

Propagate MR and is_write to xen_map_cache().


I'm pretty sure the patch subject is missing a "to" :)


This is in preparation for adding support for grant mappings.

No functional change.



Reviewed-by: David Hildenbrand 
--
Cheers,

David / dhildenb




Re: [PATCH v4 01/17] softmmu: let qemu_map_ram_ptr() use qemu_ram_ptr_length()

2024-05-01 Thread David Hildenbrand

On 30.04.24 18:49, Edgar E. Iglesias wrote:

From: Juergen Gross 

qemu_map_ram_ptr() and qemu_ram_ptr_length() share quite some code, so
modify qemu_ram_ptr_length() a little bit and use it for
qemu_map_ram_ptr(), too.

Signed-off-by: Juergen Gross 
Signed-off-by: Vikram Garhwal 
Signed-off-by: Edgar E. Iglesias 
Reviewed-by: Stefano Stabellini 
Reviewed-by: Alex Bennée 
Reviewed-by: Edgar E. Iglesias 
---


Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v2] hw/s390x: Attach the sclpconsole to /machine/sclp/s390-sclp-event-facility

2024-04-30 Thread David Hildenbrand

On 30.04.24 21:08, Thomas Huth wrote:

The sclpconsole currently does not have a proper parent in the QOM
tree, so it shows up under /machine/unattached - which is somewhat
ugly. We should rather attach it to /machine/sclp/s390-sclp-event-facility
where the other devices of type TYPE_SCLP_EVENT already reside.

Signed-off-by: Thomas Huth 
---
  hw/s390x/s390-virtio-ccw.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c
index 5c83d1ea17..41be8bf857 100644
--- a/hw/s390x/s390-virtio-ccw.c
+++ b/hw/s390x/s390-virtio-ccw.c
@@ -242,11 +242,13 @@ static void s390_create_virtio_net(BusState *bus, const 
char *name)
  
  static void s390_create_sclpconsole(const char *type, Chardev *chardev)

  {
+BusState *ev_fac_bus = sclp_get_event_facility_bus();
  DeviceState *dev;
  
  dev = qdev_new(type);

+object_property_add_child(OBJECT(ev_fac_bus->parent), type, OBJECT(dev));
  qdev_prop_set_chr(dev, "chardev", chardev);
-qdev_realize_and_unref(dev, sclp_get_event_facility_bus(), _fatal);
+qdev_realize_and_unref(dev, ev_fac_bus, _fatal);
  }
  
  static void ccw_init(MachineState *machine)


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v4 1/2] target/s390x: report deprecated-props in cpu-model-expansion reply

2024-04-30 Thread David Hildenbrand

On 29.04.24 21:10, Collin Walling wrote:

Retain a list of deprecated features disjoint from any particular
CPU model. A query-cpu-model-expansion reply will now provide a list of
properties (i.e. features) that are flagged as deprecated. Example:

 {
   "return": {
 "model": {
   "name": "z14.2-base",
   "deprecated-props": [
 "bpb",
 "csske"
   ],
   "props": {
 "pfmfi": false,
 "exrl": true,
 ...a lot more props...
 "skey": false,
 "vxpdeh2": false
   }
 }
   }
 }

It is recommended that s390 guests operate with these features
explicitly disabled to ensure compatability with future hardware.

Signed-off-by: Collin Walling 



Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v4 2/2] target/s390x: flag te and cte as deprecated

2024-04-30 Thread David Hildenbrand

On 29.04.24 21:10, Collin Walling wrote:

Add the CONSTRAINT_TRANSACTIONAL_EXE (cte) and TRANSACTIONAL_EXE (te)
to the list of deprecated features.

Signed-off-by: Collin Walling 
---
  target/s390x/cpu_features.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/target/s390x/cpu_features.c b/target/s390x/cpu_features.c
index efafc9711c..cb4e2b8920 100644
--- a/target/s390x/cpu_features.c
+++ b/target/s390x/cpu_features.c
@@ -218,6 +218,9 @@ void s390_get_deprecated_features(S390FeatBitmap features)
   /* CSSKE is deprecated on newer generations */
   S390_FEAT_CONDITIONAL_SSKE,
   S390_FEAT_BPB,
+ /* Deprecated on z16 */
+ S390_FEAT_CONSTRAINT_TRANSACTIONAL_EXE,
+ S390_FEAT_TRANSACTIONAL_EXE
  };
  int i;
  


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH] hw/s390x: Attach the sclpconsole to the /machine/sclp node

2024-04-30 Thread David Hildenbrand

On 30.04.24 10:04, Thomas Huth wrote:

The sclpconsole currently does not have a proper parent in the QOM
tree, so it shows up under /machine/unattached - which is somewhat
ugly. Let's attach it to /machine/sclp instead.




IIRC, this should not affect migration

Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v3 1/2] target/s390x: report deprecated-props in cpu-model-expansion reply

2024-04-26 Thread David Hildenbrand

On 26.04.24 19:44, David Hildenbrand wrote:

On 24.04.24 23:56, Collin Walling wrote:

Retain a list of deprecated features disjoint from any particular
CPU model. A query-cpu-model-expansion reply will now provide a list of
properties (i.e. features) that are flagged as deprecated. Example:

  {
"return": {
  "model": {
"name": "z14.2-base",
"deprecated-props": [
  "bpb",
  "csske"
],
"props": {
  "pfmfi": false,
  "exrl": true,
  ...a lot more props...
  "skey": false,
  "vxpdeh2": false
}
  }
}
  }

It is recommended that s390 guests operate with these features
explicitly disabled to ensure compatability with future hardware.


Likely you should only report features that are applicable to a model.
that is, if it's part of the full_feat.

Otherwise, the caller might simply want do set all features to "false",
and we'd fail setting a feature that is unknown to a specific CPU
generation.

That is, you would AND the bitmap with the full_feat of the underlying
CPU definition.


Refreshing my memory, I think we can just clear any CPU features. We 
only bail out when setting them!


--
Cheers,

David / dhildenb




Re: [PATCH v3 1/2] target/s390x: report deprecated-props in cpu-model-expansion reply

2024-04-26 Thread David Hildenbrand

On 24.04.24 23:56, Collin Walling wrote:

Retain a list of deprecated features disjoint from any particular
CPU model. A query-cpu-model-expansion reply will now provide a list of
properties (i.e. features) that are flagged as deprecated. Example:

 {
   "return": {
 "model": {
   "name": "z14.2-base",
   "deprecated-props": [
 "bpb",
 "csske"
   ],
   "props": {
 "pfmfi": false,
 "exrl": true,
 ...a lot more props...
 "skey": false,
 "vxpdeh2": false
   }
 }
   }
 }

It is recommended that s390 guests operate with these features
explicitly disabled to ensure compatability with future hardware.


Likely you should only report features that are applicable to a model. 
that is, if it's part of the full_feat.


Otherwise, the caller might simply want do set all features to "false", 
and we'd fail setting a feature that is unknown to a specific CPU 
generation.


That is, you would AND the bitmap with the full_feat of the underlying 
CPU definition.


--
Cheers,

David / dhildenb




Re: [PATCH v2 2/3] target/s390x: add support for "disable-deprecated-feats" expansion option

2024-04-25 Thread David Hildenbrand

On 24.04.24 20:33, Collin Walling wrote:

On 4/24/24 03:24, David Hildenbrand wrote:

On 23.04.24 23:06, Collin Walling wrote:

Retain a list of deprecated features disjoint from any particular
CPU model. When a query-cpu-model-expansion is provided with the
"disable-deprecated-feats" option set, the resulting properties list
will include all deprecated features paired with false. Example:

{ ... "bpb": false, "csske": false, ...}

It is recommended that s390 guests operate with these features
explicitly disabled to ensure compatability with future hardware.

Signed-off-by: Collin Walling 
---
   target/s390x/cpu_features.c  | 14 ++
   target/s390x/cpu_features.h  |  1 +
   target/s390x/cpu_models_sysemu.c | 20 
   3 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/target/s390x/cpu_features.c b/target/s390x/cpu_features.c
index d28eb65845..efafc9711c 100644
--- a/target/s390x/cpu_features.c
+++ b/target/s390x/cpu_features.c
@@ -212,6 +212,20 @@ void s390_feat_bitmap_to_ascii(const S390FeatBitmap 
features, void *opaque,
   };
   }
   
+void s390_get_deprecated_features(S390FeatBitmap features)

+{
+static const int feats[] = {
+ /* CSSKE is deprecated on newer generations */
+ S390_FEAT_CONDITIONAL_SSKE,
+ S390_FEAT_BPB,
+};
+int i;
+
+for (i = 0; i < ARRAY_SIZE(feats); i++) {
+set_bit(feats[i], features);
+}
+}
+
   #define FEAT_GROUP_INIT(_name, _group, _desc)\
   {\
   .name = _name,   \
diff --git a/target/s390x/cpu_features.h b/target/s390x/cpu_features.h
index a9bd68a2e1..661a8cd6db 100644
--- a/target/s390x/cpu_features.h
+++ b/target/s390x/cpu_features.h
@@ -69,6 +69,7 @@ void s390_add_from_feat_block(S390FeatBitmap features, 
S390FeatType type,
 uint8_t *data);
   void s390_feat_bitmap_to_ascii(const S390FeatBitmap features, void *opaque,
  void (*fn)(const char *name, void *opaque));
+void s390_get_deprecated_features(S390FeatBitmap features);
   
   /* Definition of a CPU feature group */

   typedef struct {
diff --git a/target/s390x/cpu_models_sysemu.c b/target/s390x/cpu_models_sysemu.c
index ef9fa80efd..b002819188 100644
--- a/target/s390x/cpu_models_sysemu.c
+++ b/target/s390x/cpu_models_sysemu.c
@@ -171,7 +171,8 @@ static void qdict_add_enabled_feat(const char *name, void 
*opaque)
   
   /* convert S390CPUDef into a static CpuModelInfo */

   static void cpu_info_from_model(CpuModelInfo *info, const S390CPUModel 
*model,
-bool delta_changes)
+bool delta_changes,
+bool disable_deprecated_feats)
   {
   QDict *qdict = qdict_new();
   S390FeatBitmap bitmap;
@@ -201,6 +202,13 @@ static void cpu_info_from_model(CpuModelInfo *info, const 
S390CPUModel *model,
   s390_feat_bitmap_to_ascii(bitmap, qdict, qdict_add_disabled_feat);
   }
   
+/* features flagged as deprecated */

+if (disable_deprecated_feats) {
+bitmap_zero(bitmap, S390_FEAT_MAX);
+s390_get_deprecated_features(bitmap);
+s390_feat_bitmap_to_ascii(bitmap, qdict, qdict_add_disabled_feat);
+}


Likely, we should remove these features from the actual bitmap, such
that they won't appear twice in the output? I'd expect the
cpu_info_from_model() caller to handle that.

Just adding them to the list as disabled is likely wrong.

For example, if someone were to expend a given model with "... bpb:
true" with disable-deprecated-feat=on, that should be remove from
"bpb:true", and only replaced by "bpb=false" if it would be part of the
CPU model we would be expanding to.

Or am I missing something?



qdict_add_disabled_feat will handle updating the feature if it already
exists. I placed the code to process deprecated features as the last
step of cpu_info_from_model to override any features that have already
been added to the bitmap. Whether it should be the deprecated feats that
take priority, or what the user requests is up in the air, however...


Yes, that's one of my concern. IIRC, if the user specifies the same 
property multiple times, it's unclear which one will win.


"bpb=true,bpb=false" might mean that bpb=true might win.

I think this is something that this interface should sort out, so it 
will be actually usable!




... Daniel's suggestion to modify the QMP response to include a separate
list of "deprecated-props" seems like a much more efficient and readable
way that alleviates both your and Markus' concerns.


Would you only include properties that would apply to the current model 
and would be set to true in the current model?


How would libvirt make use of this interface, and could it run into the 
issue spelled out above?


--
Cheers,

David / dhildenb




Re: [PATCH v2 2/3] target/s390x: add support for "disable-deprecated-feats" expansion option

2024-04-24 Thread David Hildenbrand

On 23.04.24 23:06, Collin Walling wrote:

Retain a list of deprecated features disjoint from any particular
CPU model. When a query-cpu-model-expansion is provided with the
"disable-deprecated-feats" option set, the resulting properties list
will include all deprecated features paired with false. Example:

{ ... "bpb": false, "csske": false, ...}

It is recommended that s390 guests operate with these features
explicitly disabled to ensure compatability with future hardware.

Signed-off-by: Collin Walling 
---
  target/s390x/cpu_features.c  | 14 ++
  target/s390x/cpu_features.h  |  1 +
  target/s390x/cpu_models_sysemu.c | 20 
  3 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/target/s390x/cpu_features.c b/target/s390x/cpu_features.c
index d28eb65845..efafc9711c 100644
--- a/target/s390x/cpu_features.c
+++ b/target/s390x/cpu_features.c
@@ -212,6 +212,20 @@ void s390_feat_bitmap_to_ascii(const S390FeatBitmap 
features, void *opaque,
  };
  }
  
+void s390_get_deprecated_features(S390FeatBitmap features)

+{
+static const int feats[] = {
+ /* CSSKE is deprecated on newer generations */
+ S390_FEAT_CONDITIONAL_SSKE,
+ S390_FEAT_BPB,
+};
+int i;
+
+for (i = 0; i < ARRAY_SIZE(feats); i++) {
+set_bit(feats[i], features);
+}
+}
+
  #define FEAT_GROUP_INIT(_name, _group, _desc)\
  {\
  .name = _name,   \
diff --git a/target/s390x/cpu_features.h b/target/s390x/cpu_features.h
index a9bd68a2e1..661a8cd6db 100644
--- a/target/s390x/cpu_features.h
+++ b/target/s390x/cpu_features.h
@@ -69,6 +69,7 @@ void s390_add_from_feat_block(S390FeatBitmap features, 
S390FeatType type,
uint8_t *data);
  void s390_feat_bitmap_to_ascii(const S390FeatBitmap features, void *opaque,
 void (*fn)(const char *name, void *opaque));
+void s390_get_deprecated_features(S390FeatBitmap features);
  
  /* Definition of a CPU feature group */

  typedef struct {
diff --git a/target/s390x/cpu_models_sysemu.c b/target/s390x/cpu_models_sysemu.c
index ef9fa80efd..b002819188 100644
--- a/target/s390x/cpu_models_sysemu.c
+++ b/target/s390x/cpu_models_sysemu.c
@@ -171,7 +171,8 @@ static void qdict_add_enabled_feat(const char *name, void 
*opaque)
  
  /* convert S390CPUDef into a static CpuModelInfo */

  static void cpu_info_from_model(CpuModelInfo *info, const S390CPUModel *model,
-bool delta_changes)
+bool delta_changes,
+bool disable_deprecated_feats)
  {
  QDict *qdict = qdict_new();
  S390FeatBitmap bitmap;
@@ -201,6 +202,13 @@ static void cpu_info_from_model(CpuModelInfo *info, const 
S390CPUModel *model,
  s390_feat_bitmap_to_ascii(bitmap, qdict, qdict_add_disabled_feat);
  }
  
+/* features flagged as deprecated */

+if (disable_deprecated_feats) {
+bitmap_zero(bitmap, S390_FEAT_MAX);
+s390_get_deprecated_features(bitmap);
+s390_feat_bitmap_to_ascii(bitmap, qdict, qdict_add_disabled_feat);
+}


Likely, we should remove these features from the actual bitmap, such 
that they won't appear twice in the output? I'd expect the 
cpu_info_from_model() caller to handle that.


Just adding them to the list as disabled is likely wrong.

For example, if someone were to expend a given model with "... bpb: 
true" with disable-deprecated-feat=on, that should be remove from 
"bpb:true", and only replaced by "bpb=false" if it would be part of the 
CPU model we would be expanding to.


Or am I missing something?

--
Cheers,

David / dhildenb




Re: [PATCH 0/3] Remove useless architecture prefix from the CPU list

2024-04-20 Thread David Hildenbrand

On 20.04.24 07:46, Thomas Huth wrote:

Printing an architecture prefix in front of each CPU name is not helpful
at all: It is confusing for the users since they don't know whether they
have to specify these letters for the "-cpu" parameter, too, and it also
takes some precious space in the dense output of the CPU entries. Let's
simply remove those now.


Yes, I also never really understood the purpose.

--
Cheers,

David / dhildenb




[PATCH v1] virtio-mem: improve error message when unplug of device fails due to plugged memory

2024-04-16 Thread David Hildenbrand
The error message is actually expressive, considering QEMU only. But
when called from Libvirt, talking about "size" can be confusing, because
in Libvirt "size" translates to the memory backend size in QEMU (maximum
size) and "current" translates to the QEMU "size" property.

Let's simply avoid talking about the "size" property and spell out that
some device memory is still plugged.

Cc: Liang Cong 
Cc: Mario Casquero 
Cc: "Michael S. Tsirkin" 
Signed-off-by: David Hildenbrand 
---
 hw/virtio/virtio-mem.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index ffd119ebac..ef64bf1b4a 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -1832,8 +1832,8 @@ static void virtio_mem_unplug_request_check(VirtIOMEM 
*vmem, Error **errp)
 }
 
 if (vmem->size) {
-error_setg(errp, "virtio-mem device cannot get unplugged while"
-   " '" VIRTIO_MEM_SIZE_PROP "' != '0'");
+error_setg(errp, "virtio-mem device cannot get unplugged while some"
+   " of its memory is still plugged");
 return;
 }
 if (vmem->requested_size) {
-- 
2.44.0




Re: [PATCH for-9.1 v3 09/11] hostmem: add a new memory backend based on POSIX shm_open()

2024-04-08 Thread David Hildenbrand

On 08.04.24 09:58, Stefano Garzarella wrote:

On Thu, Apr 04, 2024 at 04:09:34PM +0200, David Hildenbrand wrote:

On 04.04.24 14:23, Stefano Garzarella wrote:

shm_open() creates and opens a new POSIX shared memory object.
A POSIX shared memory object allows creating memory backend with an
associated file descriptor that can be shared with external processes
(e.g. vhost-user).

The new `memory-backend-shm` can be used as an alternative when
`memory-backend-memfd` is not available (Linux only), since shm_open()
should be provided by any POSIX-compliant operating system.

This backend mimics memfd, allocating memory that is practically
anonymous. In theory shm_open() requires a name, but this is allocated
for a short time interval and shm_unlink() is called right after
shm_open(). After that, only fd is shared with external processes
(e.g., vhost-user) as if it were associated with anonymous memory.

In the future we may also allow the user to specify the name to be
passed to shm_open(), but for now we keep the backend simple, mimicking
anonymous memory such as memfd.

Signed-off-by: Stefano Garzarella 
---
v3
- enriched commit message and documentation to highlight that we
   want to mimic memfd (David)
---
  docs/system/devices/vhost-user.rst |   5 +-
  qapi/qom.json  |  17 +
  backends/hostmem-shm.c | 118 +
  backends/meson.build   |   1 +
  qemu-options.hx|  11 +++
  5 files changed, 150 insertions(+), 2 deletions(-)
  create mode 100644 backends/hostmem-shm.c

diff --git a/docs/system/devices/vhost-user.rst 
b/docs/system/devices/vhost-user.rst
index 9b2da106ce..35259d8ec7 100644
--- a/docs/system/devices/vhost-user.rst
+++ b/docs/system/devices/vhost-user.rst
@@ -98,8 +98,9 @@ Shared memory object
  In order for the daemon to access the VirtIO queues to process the
  requests it needs access to the guest's address space. This is
-achieved via the ``memory-backend-file`` or ``memory-backend-memfd``
-objects. A reference to a file-descriptor which can access this object
+achieved via the ``memory-backend-file``, ``memory-backend-memfd``, or
+``memory-backend-shm`` objects.
+A reference to a file-descriptor which can access this object
  will be passed via the socket as part of the protocol negotiation.
  Currently the shared memory object needs to match the size of the main
diff --git a/qapi/qom.json b/qapi/qom.json
index 85e6b4f84a..5252ec69e3 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -721,6 +721,19 @@
  '*hugetlbsize': 'size',
  '*seal': 'bool' } }
+##
+# @MemoryBackendShmProperties:
+#
+# Properties for memory-backend-shm objects.
+#
+# The @share boolean option is true by default with shm.
+#
+# Since: 9.1
+##
+{ 'struct': 'MemoryBackendShmProperties',
+  'base': 'MemoryBackendProperties',
+  'data': { } }
+


Acked-by: David Hildenbrand 

One comment: we should maybe just forbid setting share=off. it doesn't
make any sense and it can even result in an unexpected double memory
consumption. We missed doing that for memfd, unfortunately.


Good point!

IIUC the `share` property is defined by the parent `hostmem`, so I
should find a way to override the property here and disable the setter,
or add an option to `hostmem` to make the property non-writable.


Right, or simply fail later when you would find "share=off" in 
shm_backend_memory_alloc().


When ever supporting named shmem_open(), it could make sense for VM 
snapshotting. Right now it doesn't really make any sense.


--
Cheers,

David / dhildenb




Re: [PATCH for-9.1 v3 09/11] hostmem: add a new memory backend based on POSIX shm_open()

2024-04-04 Thread David Hildenbrand

On 04.04.24 14:23, Stefano Garzarella wrote:

shm_open() creates and opens a new POSIX shared memory object.
A POSIX shared memory object allows creating memory backend with an
associated file descriptor that can be shared with external processes
(e.g. vhost-user).

The new `memory-backend-shm` can be used as an alternative when
`memory-backend-memfd` is not available (Linux only), since shm_open()
should be provided by any POSIX-compliant operating system.

This backend mimics memfd, allocating memory that is practically
anonymous. In theory shm_open() requires a name, but this is allocated
for a short time interval and shm_unlink() is called right after
shm_open(). After that, only fd is shared with external processes
(e.g., vhost-user) as if it were associated with anonymous memory.

In the future we may also allow the user to specify the name to be
passed to shm_open(), but for now we keep the backend simple, mimicking
anonymous memory such as memfd.

Signed-off-by: Stefano Garzarella 
---
v3
- enriched commit message and documentation to highlight that we
   want to mimic memfd (David)
---
  docs/system/devices/vhost-user.rst |   5 +-
  qapi/qom.json  |  17 +
  backends/hostmem-shm.c | 118 +
  backends/meson.build   |   1 +
  qemu-options.hx|  11 +++
  5 files changed, 150 insertions(+), 2 deletions(-)
  create mode 100644 backends/hostmem-shm.c

diff --git a/docs/system/devices/vhost-user.rst 
b/docs/system/devices/vhost-user.rst
index 9b2da106ce..35259d8ec7 100644
--- a/docs/system/devices/vhost-user.rst
+++ b/docs/system/devices/vhost-user.rst
@@ -98,8 +98,9 @@ Shared memory object
  
  In order for the daemon to access the VirtIO queues to process the

  requests it needs access to the guest's address space. This is
-achieved via the ``memory-backend-file`` or ``memory-backend-memfd``
-objects. A reference to a file-descriptor which can access this object
+achieved via the ``memory-backend-file``, ``memory-backend-memfd``, or
+``memory-backend-shm`` objects.
+A reference to a file-descriptor which can access this object
  will be passed via the socket as part of the protocol negotiation.
  
  Currently the shared memory object needs to match the size of the main

diff --git a/qapi/qom.json b/qapi/qom.json
index 85e6b4f84a..5252ec69e3 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -721,6 +721,19 @@
  '*hugetlbsize': 'size',
  '*seal': 'bool' } }
  
+##

+# @MemoryBackendShmProperties:
+#
+# Properties for memory-backend-shm objects.
+#
+# The @share boolean option is true by default with shm.
+#
+# Since: 9.1
+##
+{ 'struct': 'MemoryBackendShmProperties',
+  'base': 'MemoryBackendProperties',
+  'data': { } }
+


Acked-by: David Hildenbrand 

One comment: we should maybe just forbid setting share=off. it doesn't 
make any sense and it can even result in an unexpected double memory 
consumption. We missed doing that for memfd, unfortunately.


--
Cheers,

David / dhildenb




Re: [PATCH for-9.1 v2 09/11] hostmem: add a new memory backend based on POSIX shm_open()

2024-03-27 Thread David Hildenbrand


So I thought that for now we only support the "anonymous" mode, and if
in the future we have a use case where the user wants to specify the
name, we can add it later.


Okay, so for now you really only want an anonymous fd that behaves like
a memfd, got it.

Likely we should somehow fail if the "POSIX shared memory object"
already exists. Hmm.


`O_CREAT | O_EXCL` should provide just this behavior.
  From shm_open(3) manpage:

  O_EXCL  If O_CREAT was also specified, and a shared memory object
  with the given name already exists, return an error.  The
  check for the existence of the object, and its creation if
  it does not exist, are performed atomically.



Cool!





That said, if you think it's already useful from the beginning, I can
add the name as an optional parameter.



I'm also not quite sure if "host_memory_backend_get_name()" should be
used for the purpose here.


What problem do you see? As an alternative I thought of a static
counter.


If it's really just an "internal UUID", as we'll remove the file using
shm_unlink(), then the name in fact shouldn't matter and this would be
fine. Correct?


Right. It's a name that will only "appear" in the system for a very
short time since we call shm_unlink() right after shm_open(). So I just
need the unique name to prevent several QEMUs launched at the same time
from colliding with the name.


Okay, essentially emulating tmpfile(), but for weird shmem_open() :)





So I assume if we ever want to have non-anonymous fds here, we'd pass
in a new property "name", and change the logic how we open/unlink.
Makes sense.


Exactly! I can clarify this in the commit description as well.

Thanks for the helpful comments!
If there is anything I need to change for v3, please tell me ;-)


Besides some patch description extension, makes sense to me :)

--
Cheers,

David / dhildenb




Re: [PATCH for-9.1 v2 09/11] hostmem: add a new memory backend based on POSIX shm_open()

2024-03-27 Thread David Hildenbrand

On 27.03.24 11:23, Stefano Garzarella wrote:

On Tue, Mar 26, 2024 at 03:45:52PM +0100, David Hildenbrand wrote:

+mode = 0;
+oflag = O_RDWR | O_CREAT | O_EXCL;
+backend_name = host_memory_backend_get_name(backend);
+
+/*
+ * Some operating systems allow creating anonymous POSIX shared memory
+ * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not
+ * defined by POSIX, so let's create a unique name.
+ *
+ * From Linux's shm_open(3) man-page:
+ *   For  portable  use,  a shared  memory  object should be identified
+ *   by a name of the form /somename;"
+ */
+g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%s", getpid(),
+backend_name);


Hm, shouldn't we just let the user specify a name, and if no name was
specified, generate one ourselves?


I thought about it and initially did it that way, but then some problems
came up so I tried to keep it as simple as possible for the user and for
our use case (having an fd associated with memory and sharing it with
other processes).

The problems I had were:

- what mode_t to use if the object does not exist and needs to be
created? >
- exclude O_EXCL if the user passes the name since they may have already
created it?


I'd handle both like we handle files. But I understand now that you 
really just want to "simulate" memfd.




- call shm_unlink() only at object deallocation?


Right, we don't really do that for ordinary files, they keep existing. 
We only have the "discard-data" property to free up memory.


For memfd, it just happens "automatically". They cannot be looked up by 
name, and once the last user closes the fd, it gets destroyed.




So I thought that for now we only support the "anonymous" mode, and if
in the future we have a use case where the user wants to specify the
name, we can add it later.


Okay, so for now you really only want an anonymous fd that behaves like 
a memfd, got it.


Likely we should somehow fail if the "POSIX shared memory object" 
already exists. Hmm.




That said, if you think it's already useful from the beginning, I can
add the name as an optional parameter.



I'm also not quite sure if "host_memory_backend_get_name()" should be
used for the purpose here.


What problem do you see? As an alternative I thought of a static
counter.


If it's really just an "internal UUID", as we'll remove the file using 
shm_unlink(), then the name in fact shouldn't matter and this would be 
fine. Correct?



So I assume if we ever want to have non-anonymous fds here, we'd pass in 
a new property "name", and change the logic how we open/unlink. Makes sense.


--
Cheers,

David / dhildenb




Re: [PATCH-for-9.1 v2 13/21] hw/mem/pc-dimm: Remove legacy_align argument from pc_dimm_pre_plug()

2024-03-27 Thread David Hildenbrand

On 27.03.24 10:51, Philippe Mathieu-Daudé wrote:

'legacy_align' is always NULL, remove it.

Signed-off-by: Philippe Mathieu-Daudé 
Reviewed-by: Thomas Huth 
Message-Id: <20240305134221.30924-11-phi...@linaro.org>
---


I was really confused for a second until I saw that this series is 
dependent on another one.


Please make sure to CC people you CC on patches to also CC on the cover 
letter.


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH-for-9.1 v2 14/21] hw/mem/memory-device: Remove legacy_align from memory_device_pre_plug()

2024-03-27 Thread David Hildenbrand

  void memory_device_pre_plug(MemoryDeviceState *md, MachineState *ms,
-const uint64_t *legacy_align, Error **errp)
+Error **errp)
  {
  const MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(md);
  Error *local_err = NULL;
@@ -388,14 +388,10 @@ void memory_device_pre_plug(MemoryDeviceState *md, 
MachineState *ms,
  return;
  }
  
-if (legacy_align) {

-align = *legacy_align;
-} else {
-if (mdc->get_min_alignment) {
-align = mdc->get_min_alignment(md);
-}
-align = MAX(align, memory_region_get_alignment(mr));
+if (mdc->get_min_alignment) {
+align = mdc->get_min_alignment(md);
  }
+align = MAX(align, memory_region_get_alignment(mr));
  addr = mdc->get_addr(md);
  addr = memory_device_get_free_addr(ms, !addr ? NULL : , align,
 memory_region_size(mr), _err);


Very nice!

Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH for-9.1 v2 02/11] libvhost-user: fail vu_message_write() if sendmsg() is failing

2024-03-27 Thread David Hildenbrand


I was wondering if we could see some partial sendmsg()/write
succeeding. Meaning, we transferred some bytes but not all, and we'd
actually need to loop ...


Yep, true, but I would fix it in another patch/series if you agree.


Absolutely.

--
Cheers,

David / dhildenb




Re: [PATCH for-9.1 v2 09/11] hostmem: add a new memory backend based on POSIX shm_open()

2024-03-26 Thread David Hildenbrand

+mode = 0;
+oflag = O_RDWR | O_CREAT | O_EXCL;
+backend_name = host_memory_backend_get_name(backend);
+
+/*
+ * Some operating systems allow creating anonymous POSIX shared memory
+ * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not
+ * defined by POSIX, so let's create a unique name.
+ *
+ * From Linux's shm_open(3) man-page:
+ *   For  portable  use,  a shared  memory  object should be identified
+ *   by a name of the form /somename;"
+ */
+g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%s", getpid(),
+backend_name);


Hm, shouldn't we just let the user specify a name, and if no name was 
specified, generate one ourselves?


I'm also not quite sure if "host_memory_backend_get_name()" should be 
used for the purpose here.


--
Cheers,

David / dhildenb




Re: [PATCH for-9.1 v2 02/11] libvhost-user: fail vu_message_write() if sendmsg() is failing

2024-03-26 Thread David Hildenbrand

On 26.03.24 15:34, Eric Blake wrote:

On Tue, Mar 26, 2024 at 02:39:27PM +0100, Stefano Garzarella wrote:

In vu_message_write() we use sendmsg() to send the message header,
then a write() to send the payload.

If sendmsg() fails we should avoid sending the payload, since we
were unable to send the header.

Discovered before fixing the issue with the previous patch, where
sendmsg() failed on macOS due to wrong parameters, but the frontend
still sent the payload which the backend incorrectly interpreted
as a wrong header.

Signed-off-by: Stefano Garzarella 
---
  subprojects/libvhost-user/libvhost-user.c | 5 +
  1 file changed, 5 insertions(+)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index 22bea0c775..a11afd1960 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -639,6 +639,11 @@ vu_message_write(VuDev *dev, int conn_fd, VhostUserMsg 
*vmsg)
  rc = sendmsg(conn_fd, , 0);
  } while (rc < 0 && (errno == EINTR || errno == EAGAIN));
  
+if (rc <= 0) {


Is rejecting a 0 return value correct?  Technically, a 0 return means
a successful write of 0 bytes - but then again, it is unwise to
attempt to send an fd or other auxilliary ddata without at least one
regular byte, so we should not be attempting a write of 0 bytes.  So I
guess this one is okay, although I might have used < instead of <=.


I was wondering if we could see some partial sendmsg()/write succeeding. 
Meaning, we transferred some bytes but not all, and we'd actually need 
to loop ...


--
Cheers,

David / dhildenb




Re: [PATCH for-9.1 v2 01/11] libvhost-user: set msg.msg_control to NULL when it is empty

2024-03-26 Thread David Hildenbrand

On 26.03.24 14:39, Stefano Garzarella wrote:

On some OS (e.g. macOS) sendmsg() returns -1 (errno EINVAL) if
the `struct msghdr` has the field `msg_controllen` set to 0, but
`msg_control` is not NULL.

Signed-off-by: Stefano Garzarella 
---
  subprojects/libvhost-user/libvhost-user.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index a879149fef..22bea0c775 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -632,6 +632,7 @@ vu_message_write(VuDev *dev, int conn_fd, VhostUserMsg 
*vmsg)
  memcpy(CMSG_DATA(cmsg), vmsg->fds, fdsize);
  } else {
  msg.msg_controllen = 0;
+msg.msg_control = NULL;
  }
  
  do {


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v1] qapi: document parameters of query-cpu-model-* QAPI commands

2024-03-25 Thread David Hildenbrand

  #
  # Usually, a CPU model is compared against the maximum possible CPU
  # model of a certain configuration (e.g. the "host" model for KVM).
@@ -154,7 +155,14 @@
  # Some architectures may not support comparing CPU models.  s390x
  # supports comparing CPU models.
  #
-# Returns: a CpuModelBaselineInfo
+# @modela: description of the first CPU model to compare, referred to as
+# "model A" in CpuModelCompareResult
+#
+# @modelb: description of the second CPU model to compare, referred to as
+# "model B" in CpuModelCompareResult
+#
+# Returns: a CpuModelCompareInfo, describing how both CPU models


Scratch the comma?


Agreed to both. Do you want to fixup when applying or do you prefer a v2?

Thanks!



Reviewed-by: Markus Armbruster 



--
Cheers,

David / dhildenb




[PATCH v1] qapi: document parameters of query-cpu-model-* QAPI commands

2024-03-25 Thread David Hildenbrand
Let's document the parameters of these commands, so we can remove them
from the "documentation-exceptions" list.

While at it, extend the "Returns:" documentation as well, fixing a wrong
use of CpuModelBaselineInfo vs. CpuModelCompareInfo for
query-cpu-model-comparison.

Cc: Markus Armbruster 
Cc: Eric Blake 
Cc: Eduardo Habkost 
Cc: Marcel Apfelbaum 
Cc: "Philippe Mathieu-Daudé" 
Cc: Yanan Wang 
Signed-off-by: David Hildenbrand 
---
 qapi/machine-target.json | 46 +++-
 qapi/pragma.json |  3 ---
 2 files changed, 31 insertions(+), 18 deletions(-)

diff --git a/qapi/machine-target.json b/qapi/machine-target.json
index 519adf3220..7883616cce 100644
--- a/qapi/machine-target.json
+++ b/qapi/machine-target.json
@@ -124,11 +124,12 @@
 ##
 # @query-cpu-model-comparison:
 #
-# Compares two CPU models, returning how they compare in a specific
-# configuration.  The results indicates how both models compare
-# regarding runnability.  This result can be used by tooling to make
-# decisions if a certain CPU model will run in a certain configuration
-# or if a compatible CPU model has to be created by baselining.
+# Compares two CPU models, @modela and @modelb, returning how they
+# compare in a specific configuration.  The results indicates how
+# both models compare regarding runnability.  This result can be
+# used by tooling to make decisions if a certain CPU model will
+# run in a certain configuration or if a compatible CPU model has
+# to be created by baselining.
 #
 # Usually, a CPU model is compared against the maximum possible CPU
 # model of a certain configuration (e.g. the "host" model for KVM).
@@ -154,7 +155,14 @@
 # Some architectures may not support comparing CPU models.  s390x
 # supports comparing CPU models.
 #
-# Returns: a CpuModelBaselineInfo
+# @modela: description of the first CPU model to compare, referred to as
+# "model A" in CpuModelCompareResult
+#
+# @modelb: description of the second CPU model to compare, referred to as
+# "model B" in CpuModelCompareResult
+#
+# Returns: a CpuModelCompareInfo, describing how both CPU models
+# compare
 #
 # Errors:
 # - if comparing CPU models is not supported
@@ -175,9 +183,9 @@
 ##
 # @query-cpu-model-baseline:
 #
-# Baseline two CPU models, creating a compatible third model.  The
-# created model will always be a static, migration-safe CPU model (see
-# "static" CPU model expansion for details).
+# Baseline two CPU models, @modela and @modelb, creating a compatible
+# third model.  The created model will always be a static,
+# migration-safe CPU model (see "static" CPU model expansion for details).
 #
 # This interface can be used by tooling to create a compatible CPU
 # model out two CPU models.  The created CPU model will be identical
@@ -204,7 +212,11 @@
 # Some architectures may not support baselining CPU models.  s390x
 # supports baselining CPU models.
 #
-# Returns: a CpuModelBaselineInfo
+# @modela: description of the first CPU model to baseline
+#
+# @modelb: description of the second CPU model to baseline
+#
+# Returns: a CpuModelBaselineInfo, describing the baselined CPU model
 #
 # Errors:
 # - if baselining CPU models is not supported
@@ -243,10 +255,10 @@
 ##
 # @query-cpu-model-expansion:
 #
-# Expands a given CPU model (or a combination of CPU model +
-# additional options) to different granularities, allowing tooling to
-# get an understanding what a specific CPU model looks like in QEMU
-# under a certain configuration.
+# Expands a given CPU model, @model, (or a combination of CPU model +
+# additional options) to different granularities, specified by
+# @type, allowing tooling to get an understanding what a specific
+# CPU model looks like in QEMU under a certain configuration.
 #
 # This interface can be used to query the "host" CPU model.
 #
@@ -269,7 +281,11 @@
 # Some architectures may not support all expansion types.  s390x
 # supports "full" and "static". Arm only supports "full".
 #
-# Returns: a CpuModelExpansionInfo
+# @model: description of the CPU model to expand
+#
+# @type: expansion type, specifying how to expand the CPU model
+#
+# Returns: a CpuModelExpansionInfo, describing the expanded CPU model
 #
 # Errors:
 # - if expanding CPU models is not supported
diff --git a/qapi/pragma.json b/qapi/pragma.json
index 6929ab776e..0d82bc1a03 100644
--- a/qapi/pragma.json
+++ b/qapi/pragma.json
@@ -90,9 +90,6 @@
 'XDbgBlockGraph',
 'YankInstanceType',
 'blockdev-reopen',
-'query-cpu-model-baseline',
-'query-cpu-model-comparison',
-'query-cpu-model-expansion',
 'query-rocker',
 'query-rocker-ports',
 'query-stats-schemas',
-- 
2.43.2




Re: Let's close member documentation gaps

2024-03-25 Thread David Hildenbrand

On 25.03.24 15:13, David Hildenbrand wrote:

On 25.03.24 10:36, Markus Armbruster wrote:

If you're cc'ed, I have a bit of doc work for you.  Search for your
name to find it.

The QAPI generator forces you to document your stuff.  Except for
commands, events, enum and object types listed in pragma
documentation-exceptions, the generator silently defaults missing
documentation to "Not documented".  Right now, we're using this loophole
some 500 times.



What would be the right step to make sure I am resolving these "hidden"
issues, and that the QAPI generator would be happy with my changes?


Ah, I assume simply removing the offender from "qapi/pragma.json" and 
then compiling.


--
Cheers,

David / dhildenb




Re: Let's close member documentation gaps

2024-03-25 Thread David Hildenbrand

On 25.03.24 10:36, Markus Armbruster wrote:

If you're cc'ed, I have a bit of doc work for you.  Search for your
name to find it.

The QAPI generator forces you to document your stuff.  Except for
commands, events, enum and object types listed in pragma
documentation-exceptions, the generator silently defaults missing
documentation to "Not documented".  Right now, we're using this loophole
some 500 times.



What would be the right step to make sure I am resolving these "hidden" 
issues, and that the QAPI generator would be happy with my changes?


--
Cheers,

David / dhildenb




Re: [PATCH v3 11/49] physmem: Introduce ram_block_discard_guest_memfd_range()

2024-03-20 Thread David Hildenbrand

On 20.03.24 18:38, Michael Roth wrote:

On Wed, Mar 20, 2024 at 10:37:14AM +0100, David Hildenbrand wrote:

On 20.03.24 09:39, Michael Roth wrote:

From: Xiaoyao Li 

When memory page is converted from private to shared, the original
private memory is back'ed by guest_memfd. Introduce
ram_block_discard_guest_memfd_range() for discarding memory in
guest_memfd.

Originally-from: Isaku Yamahata 
Codeveloped-by: Xiaoyao Li 


"Co-developed-by"


Signed-off-by: Xiaoyao Li 
Reviewed-by: David Hildenbrand 


Your SOB should go here.


---
Changes in v5:
- Collect Reviewed-by from David;

Changes in in v4:
- Drop ram_block_convert_range() and open code its implementation in the
next Patch.

Signed-off-by: Michael Roth 


I only received 3 patches from this series, and now I am confused: changelog
talks about v5 and this is "PATCH v3"

Please make sure to send at least the cover letter along (I might not need
the other 46 patches :D ).


Sorry for the confusion, you got auto-Cc'd by git, which is good, but
not sure there's a good way to make sure everyone gets a copy of the
cover letter. I could see how it would help useful to potential
reviewers though. I'll try to come up with a script for it and take that
approach in the future.


A script shared with me in the past to achieve that in most cases:

$ cat cc-cmd.sh
#!/bin/bash

if [[ $1 == *gitsendemail.msg* || $1 == *cover-letter* ]]; then
grep ': .* <.*@.*>' -h *.patch | sed 's/^.*: //' | sort | uniq
fi


And attach to "git send-email ... *.patch": --cc-cmd=./cc-cmd.sh

--
Cheers,

David / dhildenb




Re: [PATCH 2/4] hw/s390x/virtio-ccw: Always deliver NMI to first CPU

2024-03-20 Thread David Hildenbrand

On 20.02.24 16:08, Philippe Mathieu-Daudé wrote:

We can trigger NMI from HMP or QMP.

QEMU maps the NMI to the s390x per-CPU 'RESTART' interrupt.
Linux guests usually setup this interrupt to trigger kdump
or crash. Such crashdump can be triggered in QEMU by HMP
"nmi" or QMP "inject-nmi" commands.

Using QMP, since we can not select a particular CPU, the first
CPU is used (CPU#0). See the documentation from commit 795dc6e4
("watchdog: Add new Virtual Watchdog action INJECT-NMI"):

   @inject-nmi: a non-maskable interrupt is injected into the
first VCPU (all VCPUS on x86) (since 2.4)

While we can select a particular CPU on HMP, the guest behavior
is expected to be the same if using CPU #N or CPU #0. Since
always using CPU#0 simplifies API maintainance, update s390_nmi()
to deliver NMI to the first CPU.

Signed-off-by: Philippe Mathieu-Daudé 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v3 11/49] physmem: Introduce ram_block_discard_guest_memfd_range()

2024-03-20 Thread David Hildenbrand

On 20.03.24 13:43, Xiaoyao Li wrote:

On 3/20/2024 5:37 PM, David Hildenbrand wrote:

On 20.03.24 09:39, Michael Roth wrote:

From: Xiaoyao Li 

When memory page is converted from private to shared, the original
private memory is back'ed by guest_memfd. Introduce
ram_block_discard_guest_memfd_range() for discarding memory in
guest_memfd.

Originally-from: Isaku Yamahata 
Codeveloped-by: Xiaoyao Li 


"Co-developed-by"


Michael is using the patch from my TDX-QEMU v5 series[1]. I need to fix it.


Signed-off-by: Xiaoyao Li 
Reviewed-by: David Hildenbrand 


Your SOB should go here.


---
Changes in v5:
- Collect Reviewed-by from David;

Changes in in v4:
- Drop ram_block_convert_range() and open code its implementation in the
    next Patch.

Signed-off-by: Michael Roth 


I only received 3 patches from this series, and now I am confused:
changelog talks about v5 and this is "PATCH v3"


As above, because the guest_memfd patches in my TDX-QEMU v5[1] were
directly picked for this series, so the change history says v5. They are
needed by SEV-SNP as well.


Thanks, I was missing the context without a cover letter. These comments 
here likely should be dropped here.


--
Cheers,

David / dhildenb




Re: [PATCH v3 11/49] physmem: Introduce ram_block_discard_guest_memfd_range()

2024-03-20 Thread David Hildenbrand

On 20.03.24 09:39, Michael Roth wrote:

From: Xiaoyao Li 

When memory page is converted from private to shared, the original
private memory is back'ed by guest_memfd. Introduce
ram_block_discard_guest_memfd_range() for discarding memory in
guest_memfd.

Originally-from: Isaku Yamahata 
Codeveloped-by: Xiaoyao Li 


"Co-developed-by"


Signed-off-by: Xiaoyao Li 
Reviewed-by: David Hildenbrand 


Your SOB should go here.


---
Changes in v5:
- Collect Reviewed-by from David;

Changes in in v4:
- Drop ram_block_convert_range() and open code its implementation in the
   next Patch.

Signed-off-by: Michael Roth 


I only received 3 patches from this series, and now I am confused: 
changelog talks about v5 and this is "PATCH v3"


Please make sure to send at least the cover letter along (I might not 
need the other 46 patches :D ).


--
Cheers,

David / dhildenb




Re: 答复: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment

2024-03-19 Thread David Hildenbrand

On 19.03.24 15:23, Peter Maydell wrote:

On Tue, 19 Mar 2024 at 09:24, David Hildenbrand  wrote:

I spotted new pause_all_vcpus() / resume_all_vcpus() calls in 
hw/intc/arm_gicv3_kvm.c and
thought they would be the problematic bit.

Yeah, that's going to be problematic. Further note that a lot of code does not 
expect
that the BQL is suddenly dropped.


Agreed; we already have one nasty set of bugs in the framebuffer
devices because a function drops the BQL briefly:
https://lore.kernel.org/qemu-devel/CAFEAcA9odnPo2LPip295Uztri7JfoVnQbkJ=wn+k8dqneb_...@mail.gmail.com/T/#u
so let's avoid introducing any more of a similar kind.

Side note, the pause_all_vcpus()/resume_all_vcpus() calls in
hw/i386/vapic.c are probably a bit suspect for similar reasons.


Exactly my thoughts. But there, it was less clear "why" it is even 
required. It's only performed for KVM.


Do we also just want to stop KVM threads from executing instructions?, 
so blocking KVM ioctls might be a reasonable "replacement"? Really not sure.


--
Cheers,

David / dhildenb




Re: 答复: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment

2024-03-19 Thread David Hildenbrand

On 19.03.24 10:24, David Hildenbrand wrote:

On 19.03.24 06:06, zhukeqian wrote:

Hi David,

Thanks for reviewing.

On 17.03.24 09:37, Keqian Zhu via wrote:

Both main loop thread and vCPU thread are allowed to call
pause_all_vcpus(), and in general resume_all_vcpus() is called after
it. Two issues live in pause_all_vcpus():


In general, calling pause_all_vcpus() from VCPU threads is quite dangerous.

Do we have reproducers for the cases below?



I produce the issues by testing ARM vCPU hotplug feature:
QEMU changes for vCPU hotplug could be cloned from below site,
   https://github.com/salil-mehta/qemu.git virt-cpuhp-armv8/rfc-v2
Guest Kernel changes (by James Morse, ARM) are available here:
   https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git 
virtual_cpu_hotplug/rfc/v2



Thanks for these infos (would be reasonable to include that in the cover 
letter).

Okay, so likely this is not actually a "fix" for upstream as it is. Understood.


The procedure to produce problems:
1. Startup a Linux VM (e.g., called OS-vcpuhotplug) with 32 possible vCPUs and 
16 current vCPUs.
2. Log in guestOS and run script[1] to continuously online/offline CPU.
3. At host side, run script[2] to continuously hotplug/unhotplug vCPU.
After several minutes, we can hit these problems.

Script[1] to online/offline CPU:
for ((time=1;time<1000;time++));
do
  for ((cpu=16;cpu<32;cpu++));
  do
  echo 1 > /sys/devices/system/cpu/cpu$cpu/online
  done

  for ((cpu=16;cpu<32;cpu++));
  do
  echo 0 > /sys/devices/system/cpu/cpu$cpu/online
  done
done

Script[2] to hotplug/unhotplug vCPU:
for ((time=1;time<100;time++));
do
  echo $time
  for ((cpu=16;cpu<=32;cpu++));
  do
  echo "virsh setvcpus OS-vcpuhotplug --count  $cpu --live"
  virsh setvcpus OS-vcpuhotplug --count  $cpu --live
  sleep 2
  done

  for ((cpu=32;cpu>=16;cpu--));
  do
  echo "virsh setvcpus OS-vcpuhotplug --count  $cpu --live"
  virsh setvcpus OS-vcpuhotplug --count  $cpu --live
  sleep 2
  done

  for ((cpu=16;cpu<=32;cpu+=2));
  do
  echo "virsh setvcpus OS-vcpuhotplug --count  $cpu --live"
  virsh setvcpus OS-vcpuhotplug --count  $cpu --live
  sleep 2
  done

  for ((cpu=32;cpu>=16;cpu-=2));
  do
  echo "virsh setvcpus OS-vcpuhotplug --count  $cpu --live"
  virsh setvcpus OS-vcpuhotplug --count  $cpu --live
  sleep 2
  done
done

The script[1] will call PSCI CPU_ON which emulated by QEMU, which result in 
calling cpu_reset() on vCPU thread.


I spotted new pause_all_vcpus() / resume_all_vcpus() calls in 
hw/intc/arm_gicv3_kvm.c and
thought they would be the problematic bit.

Yeah, that's going to be problematic. Further note that a lot of code does not 
expect
that the BQL is suddenly dropped.

We had issues with that in different context where we ended up wanting to use 
pause/resume from VCPU context:

https://lore.kernel.org/all/294a987d-b0ef-1b58-98ac-0d4d43075...@redhat.com/

This sounds like a bad idea. Read below.


For ARM architecture, it needs to reset GICC registers, which is only possible 
when all vcpus paused. So script[1]
will call pause_all_vcpus() in vCPU thread.
The script[2] also calls cpu_reset() for newly hotplugged vCPU, which is done 
in main loop thread.
So this scenario causes problems as I state in commit message.



1. There is possibility that during thread T1 waits on qemu_pause_cond
with bql unlocked, other thread has called
pause_all_vcpus() and resume_all_vcpus(), then thread T1 will stuck,
because the condition all_vcpus_paused() is always false.


How can this happen?

Two threads calling pause_all_vcpus() is borderline broken, as you note.

IIRC, we should call pause_all_vcpus() only if some other mechanism prevents 
these races. For example, based on runstate changes.



We already has bql to prevent concurrent calling of pause_all_vcpus() and 
resume_all_vcpus(). But pause_all_vcpus() will
unlock bql in the half way, which gives change for other thread to call pause 
and resume. In the  past, code does not consider
this problem, now I add retry mechanism to fix it.


Note that BQL did not prevent concurrent calling of pause_all_vcpus(). There 
had to be something else. Likely that was runstate transitions.





Just imagine one thread calling pause_all_vcpus() while another one
calls resume_all_vcpus(). It cannot possibly work.


With bql, we can make sure all vcpus are paused after pause_all_vcpus() finish, 
 and all vcpus are resumed after resume_all_vcpus() finish.

For example, the following situation may occur:
Thread T1: lock bql  ->pau

Re: 答复: [PATCH v1 2/2] system/cpus: Fix resume_all_vcpus() under vCPU hotplug condition

2024-03-19 Thread David Hildenbrand

On 19.03.24 06:11, zhukeqian wrote:

Hi David,

On 17.03.24 09:37, Keqian Zhu via wrote:

For vCPU being hotplugged, qemu_init_vcpu() is called. In this
function, we set vcpu state as stopped, and then wait vcpu thread to
be created.

As the vcpu state is stopped, it will inform us it has been created
and then wait on halt_cond. After we has realized vcpu object, we will
resume the vcpu thread.

However, during we wait vcpu thread to be created, the bql is
unlocked, and other thread is allowed to call resume_all_vcpus(),
which will resume the un-realized vcpu.

This fixes the issue by filter out un-realized vcpu during
resume_all_vcpus().


Similar question: is there a reproducer?

How could we currently hotplug a VCPU, and while it is being created, see 
pause_all_vcpus()/resume_all_vcpus() getting claled.


I described the reason for this at patch 1.


If I am not getting this wrong, there seems to be some other mechanism missing 
that makes sure that this cannot happen. Dropping the BQL half-way through 
creating a VCPU might be the problem.


When we add retry mechanism in pause_all_vcpus(), we can solve this problem. 
With the sematic unchanged for user, which means:
With bql, we can make sure all vcpus are paused after pause_all_vcpus() finish, 
 and all vcpus are resumed after resume_all_vcpus() finish.


Okay, got it. As just replied to #1, please see if you can avoid messing 
with pause_all_vcpus() by inhibiting KVM IOCTLs like KVM does. That 
would be preferable.


--
Cheers,

David / dhildenb




Re: 答复: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment

2024-03-19 Thread David Hildenbrand
 to lock bql-> lock bql  &&  
all_vcpu_paused ->   success and do other work -> unlock bql
Thread T2: wait T1 unlock bql to lock bql->   
lock bql->  resume_all_vcpus   ->   success  and do other work   -> unlock 
bql



Now trow in another thread and it all gets really complicated :)

Finding ways to avoid pause_all_vcpus() on the ARM reset code would be 
preferable.

I guess you simply want to do something similar to what KVM does to avoid 
messing
with pause_all_vcpus(): inhibiting certain IOCTLs.


commit f39b7d2b96e3e73c01bb678cd096f7baf0b9ab39
Author: David Hildenbrand 
Date:   Fri Nov 11 10:47:58 2022 -0500

kvm: Atomic memslot updates

If we update an existing memslot (e.g., resize, split), we temporarily

remove the memslot to re-add it immediately afterwards. These updates
are not atomic, especially not for KVM VCPU threads, such that we can
get spurious faults.

Let's inhibit most KVM ioctls while performing relevant updates, such

that we can perform the update just as if it would happen atomically
without additional kernel support.

We capture the add/del changes and apply them in the notifier commit

stage instead. There, we can check for overlaps and perform the ioctl
inhibiting only if really required (-> overlap).

To keep things simple we don't perform additional checks that wouldn't

actually result in an overlap -- such as !RAM memory regions in some
cases (see kvm_set_phys_mem()).

To minimize cache-line bouncing, use a separate indicator

(in_ioctl_lock) per CPU.  Also, make sure to hold the kvm_slots_lock
while performing both actions (removing+re-adding).

We have to wait until all IOCTLs were exited and block new ones from

getting executed.

This approach cannot result in a deadlock as long as the inhibitor does

not hold any locks that might hinder an IOCTL from getting finished and
exited - something fairly unusual. The inhibitor will always hold the BQL.

AFAIKs, one possible candidate would be userfaultfd. If a page cannot be

placed (e.g., during postcopy), because we're waiting for a lock, or if the
userfaultfd thread cannot process a fault, because it is waiting for a
lock, there could be a deadlock. However, the BQL is not applicable here,
because any other guest memory access while holding the BQL would already
result in a deadlock.

Nothing else in the kernel should block forever and wait for userspace

intervention.

Note: pause_all_vcpus()/resume_all_vcpus() or

start_exclusive()/end_exclusive() cannot be used, as they either drop
the BQL or require to be called without the BQL - something inhibitors
cannot handle. We need a low-level locking mechanism that is
deadlock-free even when not releasing the BQL.

--
Cheers,

David / dhildenb




Re: [PATCH v2 1/2] target/s390x: Use mutable temporary value for op_ts

2024-03-18 Thread David Hildenbrand

On 18.03.24 21:27, Ilya Leoshkevich wrote:

From: Ido Plat 

Otherwise TCG would assume the register that holds t1 would be constant
and reuse whenever it needs the value within it.

Cc: qemu-sta...@nongnu.org
Fixes: f1ea739bd598 ("target/s390x: Use tcg_constant_* in local contexts")
Reviewed-by: Ilya Leoshkevich 
Reviewed-by: Richard Henderson 
[iii: Adjust a newline and capitalization, add tags]
Signed-off-by: Ido Plat 
Signed-off-by: Ilya Leoshkevich 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v1 2/2] system/cpus: Fix resume_all_vcpus() under vCPU hotplug condition

2024-03-18 Thread David Hildenbrand

On 17.03.24 09:37, Keqian Zhu via wrote:

For vCPU being hotplugged, qemu_init_vcpu() is called. In this
function, we set vcpu state as stopped, and then wait vcpu thread
to be created.

As the vcpu state is stopped, it will inform us it has been created
and then wait on halt_cond. After we has realized vcpu object, we
will resume the vcpu thread.

However, during we wait vcpu thread to be created, the bql is
unlocked, and other thread is allowed to call resume_all_vcpus(),
which will resume the un-realized vcpu.

This fixes the issue by filter out un-realized vcpu during
resume_all_vcpus().


Similar question: is there a reproducer?

How could we currently hotplug a VCPU, and while it is being created, 
see pause_all_vcpus()/resume_all_vcpus() getting claled.


If I am not getting this wrong, there seems to be some other mechanism 
missing that makes sure that this cannot happen. Dropping the BQL 
half-way through creating a VCPU might be the problem.


--
Cheers,

David / dhildenb




Re: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment

2024-03-18 Thread David Hildenbrand

On 17.03.24 09:37, Keqian Zhu via wrote:

Both main loop thread and vCPU thread are allowed to call
pause_all_vcpus(), and in general resume_all_vcpus() is called
after it. Two issues live in pause_all_vcpus():


In general, calling pause_all_vcpus() from VCPU threads is quite dangerous.

Do we have reproducers for the cases below?



1. There is possibility that during thread T1 waits on
qemu_pause_cond with bql unlocked, other thread has called
pause_all_vcpus() and resume_all_vcpus(), then thread T1 will
stuck, because the condition all_vcpus_paused() is always false.


How can this happen?

Two threads calling pause_all_vcpus() is borderline broken, as you note.

IIRC, we should call pause_all_vcpus() only if some other mechanism 
prevents these races. For example, based on runstate changes.



Just imagine one thread calling pause_all_vcpus() while another one 
calls resume_all_vcpus(). It cannot possibly work.




2. After all_vcpus_paused() has been checked as true, we will
unlock bql to relock replay_mutex. During the bql was unlocked,
the vcpu's state may has been changed by other thread, so we
must retry.

Signed-off-by: Keqian Zhu 
---
  system/cpus.c | 29 -
  1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/system/cpus.c b/system/cpus.c
index 68d161d96b..4e41abe23e 100644
--- a/system/cpus.c
+++ b/system/cpus.c
@@ -571,12 +571,14 @@ static bool all_vcpus_paused(void)
  return true;
  }
  
-void pause_all_vcpus(void)

+static void request_pause_all_vcpus(void)
  {
  CPUState *cpu;
  
-qemu_clock_enable(QEMU_CLOCK_VIRTUAL, false);

  CPU_FOREACH(cpu) {
+if (cpu->stopped) {
+continue;
+}
  if (qemu_cpu_is_self(cpu)) {
  qemu_cpu_stop(cpu, true);
  } else {
@@ -584,6 +586,14 @@ void pause_all_vcpus(void)
  qemu_cpu_kick(cpu);
  }
  }
+}
+
+void pause_all_vcpus(void)
+{
+qemu_clock_enable(QEMU_CLOCK_VIRTUAL, false);
+
+retry:
+request_pause_all_vcpus();
  
  /* We need to drop the replay_lock so any vCPU threads woken up

   * can finish their replay tasks
@@ -592,14 +602,23 @@ void pause_all_vcpus(void)
  
  while (!all_vcpus_paused()) {

  qemu_cond_wait(_pause_cond, );
-CPU_FOREACH(cpu) {
-qemu_cpu_kick(cpu);
-}
+/* During we waited on qemu_pause_cond the bql was unlocked,
+ * the vcpu's state may has been changed by other thread, so
+ * we must request the pause state on all vcpus again.
+ */
+request_pause_all_vcpus();
  }
  
  bql_unlock();

  replay_mutex_lock();
  bql_lock();
+
+/* During the bql was unlocked, the vcpu's state may has been
+ * changed by other thread, so we must retry.
+ */
+if (!all_vcpus_paused()) {
+goto retry;
+}
  }
  
  void cpu_resume(CPUState *cpu)


--
Cheers,

David / dhildenb




Re: [PATCH v2 00/14] libvhost-user: support more memslots and cleanup memslot handling code

2024-03-12 Thread David Hildenbrand

On 11.03.24 21:03, Mario Casquero wrote:

This series has been successfully tested by QE. Start the
qemu-storage-daemon in the background with a rhel 9.5 image and
vhost-user-blk. After that, boot up a VM with virtio-mem and
vhost-user-blk-pci. Check with the HMP command 'info mtree' that
virtio-mem is making use of multiple memslots.

Tested-by: Mario Casquero 


Thanks Mario!

--
Cheers,

David / dhildenb




Re: [PATCH v2 0/4] physmem: Fix MemoryRegion for second access to cached MMIO Address Space

2024-03-08 Thread David Hildenbrand

On 08.03.24 08:36, Peter Xu wrote:

On Thu, Mar 07, 2024 at 03:37:06PM +, Jonathan Cameron wrote:

v2: (Thanks to Peter Xu for reviewing!)
- New patch 1 to rename addr1 to mr_addr in the interests of meaningful naming.
- Take advantage of a cached address space only allow for a single MR to 
simplify
   the new code.
- Various cleanups of indentation etc.
- Cover letter and some patch descriptions updated to reflect changes.
- Changes all called out in specific patches.


All look good to me, thanks.  Having the new functions' first argument as
MemTxAttrs is slightly weird to me, but that's not a big deal and we can
clean it up later if wanted.  I guess it's good to fix this in 9.0 first as
it's a real bug even if not trivial to hit.

I queued it in my migration tree (with my "memory API" hat..).

I won't send a pull until next Monday.  Please shoot if there's any objections!



Works for me, thanks!

--
Cheers,

David / dhildenb




Re: [PATCH v2 4/4] physmem: Fix wrong address in large address_space_read/write_cached_slow()

2024-03-08 Thread David Hildenbrand

On 07.03.24 16:37, Jonathan Cameron wrote:

If the access is bigger than the MemoryRegion supports,
flatview_read/write_continue() will attempt to update the Memory Region.
but the address passed to flatview_translate() is relative to the cache, not
to the FlatView.

On arm/virt with interleaved CXL memory emulation and virtio-blk-pci this
lead to the first part of descriptor being read from the CXL memory and the
second part from PA 0x8 which happens to be a blank region
of a flash chip and all ffs on this particular configuration.
Note this test requires the out of tree ARM support for CXL, but
the problem is more general.

Avoid this by adding new address_space_read_continue_cached()
and address_space_write_continue_cached() which share all the logic
with the flatview versions except for the MemoryRegion lookup which
is unnecessary as the MemoryRegionCache only covers one MemoryRegion.

Signed-off-by: Jonathan Cameron 
---
v2: Review from Peter Xu
- Drop additional lookups of the MemoryRegion via
address_space_translate_cached() as it will always return the same
answer.
- Drop various parameters that are then unused.
- rename addr1 to mr_addr.
- Drop a fuzz_dma_read_cb(). Could put this back but it means
   carrying the address into the inner call and the only in tree
   fuzzer checks if it is normal RAM and if not does nothing anyway.
   We don't hit this path for normal RAM.
---
  system/physmem.c | 63 +++-
  1 file changed, 57 insertions(+), 6 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index 1264eab24b..701bea27dd 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -3381,6 +3381,59 @@ static inline MemoryRegion 
*address_space_translate_cached(
  return section.mr;
  }
  
+/* Called within RCU critical section.  */

+static MemTxResult address_space_write_continue_cached(MemTxAttrs attrs,
+   const void *ptr,
+   hwaddr len,
+   hwaddr mr_addr,
+   hwaddr l,
+   MemoryRegion *mr)


The only thing that is really confusing is

hwaddr len,
...
hwaddr l,

ehm, what?

... but it fits the style of flatview_read_continue(), so at least the 
level of confusion this creates is consistent with the other code.



Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v2 3/4] physmem: Factor out body of flatview_read/write_continue() loop

2024-03-07 Thread David Hildenbrand

On 07.03.24 16:37, Jonathan Cameron wrote:

This code will be reused for the address_space_cached accessors
shortly.

Also reduce scope of result variable now we aren't directly
calling this in the loop.

Signed-off-by: Jonathan Cameron 
---
v2: Thanks to Peter Xu
- Fix alignment of code.
- Drop unused addr parameter.
- Carry through new mr_addr parameter name.
- RB not picked up as not sure what Peter will think wrt to
   resulting parameter ordering.
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v2 2/4] physmem: Reduce local variable scope in flatview_read/write_continue()

2024-03-07 Thread David Hildenbrand

On 07.03.24 16:37, Jonathan Cameron wrote:

Precursor to factoring out the inner loops for reuse.

Reviewed-by: Peter Xu 
Signed-off-by: Jonathan Cameron 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v2 1/4] physmem: Rename addr1 to more informative mr_addr in flatview_read/write() and similar

2024-03-07 Thread David Hildenbrand

On 07.03.24 16:37, Jonathan Cameron wrote:

The calls to flatview_read/write[_continue]() have parameters addr and
addr1 but the names give no indication of what they are addresses of.
Rename addr1 to mr_addr to reflect that it is the translated address
offset within the MemoryRegion returned by flatview_translate().
Similarly rename the parameter in address_space_read/write_cached_slow()

Suggested-by: Peter Xu 
Signed-off-by: Jonathan Cameron 



Much cleaner indeed!

Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH] qapi: Fix format of the memory-backend-file's @rom property doc comment

2024-02-29 Thread David Hildenbrand

On 29.02.24 11:58, Stefano Garzarella wrote:

Reflow paragraph following commit a937b6aa73 ("qapi: Reformat doc
comments to conform to current conventions"): use 4 spaces indentation,
70 columns width, and two spaces to separate sentences.

Suggested-by: Markus Armbruster 
Signed-off-by: Stefano Garzarella 
---
  qapi/qom.json | 27 ++-
  1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/qapi/qom.json b/qapi/qom.json
index 2a6e49365a..db1b0fdea2 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -668,19 +668,20 @@
  # @readonly: if true, the backing file is opened read-only; if false,
  # it is opened read-write.  (default: false)
  #
-# @rom: whether to create Read Only Memory (ROM) that cannot be modified
-#   by the VM.  Any write attempts to such ROM will be denied.  Most
-#   use cases want writable RAM instead of ROM.  However, selected use
-#   cases, like R/O NVDIMMs, can benefit from ROM.  If set to 'on',
-#   create ROM; if set to 'off', create writable RAM;  if set to
-#   'auto', the value of the @readonly property is used.  This
-#   property is primarily helpful when we want to have proper RAM in
-#   configurations that would traditionally create ROM before this
-#   property was introduced: VM templating, where we want to open a
-#   file readonly (@readonly set to true) and mark the memory to be
-#   private for QEMU (@share set to false).  For this use case, we need
-#   writable RAM instead of ROM, and want to set this property to 'off'.
-#   (default: auto, since 8.2)
+# @rom: whether to create Read Only Memory (ROM) that cannot be
+# modified by the VM.  Any write attempts to such ROM will be
+# denied.  Most use cases want writable RAM instead of ROM.
+# However, selected use cases, like R/O NVDIMMs, can benefit from
+# ROM.  If set to 'on', create ROM; if set to 'off', create
+# writable RAM; if set to 'auto', the value of the @readonly
+# property is used.  This property is primarily helpful when we
+# want to have proper RAM in configurations that would
+# traditionally create ROM before this property was introduced: VM
+# templating, where we want to open a file readonly (@readonly set
+# to true) and mark the memory to be private for QEMU (@share set
+# to false).  For this use case, we need writable RAM instead of
+# ROM, and want to set this property to 'off'.  (default: auto,
+# since 8.2)
  #
  # Since: 2.1
  ##


Ideally, we'd have a format checker that complains like checkpatch 
usually would.


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH 9/9] hostmem-file: support POSIX shm_open()

2024-02-28 Thread David Hildenbrand

On 28.02.24 12:47, Stefano Garzarella wrote:

Add a new `shm` bool option for `-object memory-backend-file`.

When this option is set to true, the POSIX shm_open(3) is used instead
of open(2).

So a file will not be created in the filesystem, but a "POSIX shared
memory object" will be instantiated. In Linux this turns into a file
in /dev/shm, but in other OSes this may not happen (for example in
macOS or FreeBSD nothing is shown in any filesystem).

This new feature is useful when we need to share guest memory with
another process (e.g. vhost-user backend), but we don't have
memfd_create() or any special filesystems (e.g. /dev/shm) available
as in macOS.

Signed-off-by: Stefano Garzarella 
---
I am not sure this is the best way to support shm_open() in QEMU.

Other solutions I had in mind were:

- create a new memory-backend-shm

- extend memory-backend-memfd to use shm_open() on systems where memfd is
not available (problem: shm_open wants a name to assign to the object, but
we can do a workaround by using a random name and do the unlink right away)

Any preference/suggestion?


Both sound like reasonable options, and IMHO better than hostmem-file 
with things that are not necessarily "real" files.


Regarding memory-backend-memfd, we similarly have to pass a name to 
memfd_create(), although for different purpose: "The  name  supplied in 
name is used as a filename and will be displayed as the target of the 
corresponding symbolic link in the directory /proc/self/fd/".


So we simply pass TYPE_MEMORY_BACKEND_MEMFD.

Likely, memory-backend-shm that directly maps to shm_open() and only 
provides properties reasonable for shm_open() is the cleanest approach. 
So that would currently be my preference :)


--
Cheers,

David / dhildenb




Re: [PATCH v4 33/66] i386/tdx: Make memory type private by default

2024-02-20 Thread David Hildenbrand

On 29.01.24 03:18, Xiaoyao Li wrote:

On 1/26/2024 10:58 PM, David Hildenbrand wrote:

On 25.01.24 04:22, Xiaoyao Li wrote:

By default (due to the recent UPM change), restricted memory attribute is
shared.  Convert the memory region from shared to private at the memory
slot creation time.

add kvm region registering function to check the flag
and convert the region, and add memory listener to TDX guest code to set
the flag to the possible memory region.

Without this patch
- Secure-EPT violation on private area
- KVM_MEMORY_FAULT EXIT (kvm -> qemu)
- qemu converts the 4K page from shared to private
- Resume VCPU execution
- Secure-EPT violation again
- KVM resolves EPT Violation
This also prevents huge page because page conversion is done at 4K
granularity.  Although it's possible to merge 4K private mapping into
2M large page, it slows guest boot.

With this patch
- After memory slot creation, convert the region from private to shared
- Secure-EPT violation on private area.
- KVM resolves EPT Violation

Originated-from: Isaku Yamahata 
Signed-off-by: Xiaoyao Li 
---
   include/exec/memory.h |  1 +
   target/i386/kvm/tdx.c | 20 
   2 files changed, 21 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 7229fcc0415f..f25959f6d30f 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -850,6 +850,7 @@ struct IOMMUMemoryRegion {
   #define MEMORY_LISTENER_PRIORITY_MIN    0
   #define MEMORY_LISTENER_PRIORITY_ACCEL  10
   #define MEMORY_LISTENER_PRIORITY_DEV_BACKEND    10
+#define MEMORY_LISTENER_PRIORITY_ACCEL_HIGH 20
   /**
    * struct MemoryListener: callbacks structure for updates to the
physical memory map
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 7b250d80bc1d..f892551821ce 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -19,6 +19,7 @@
   #include "standard-headers/asm-x86/kvm_para.h"
   #include "sysemu/kvm.h"
   #include "sysemu/sysemu.h"
+#include "exec/address-spaces.h"
   #include "hw/i386/x86.h"
   #include "kvm_i386.h"
@@ -621,6 +622,19 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
   return 0;
   }
+static void tdx_guest_region_add(MemoryListener *listener,
+ MemoryRegionSection *section)
+{
+    memory_region_set_default_private(section->mr);
+}


That looks fishy. Why is TDX to decide what happens to other memory
regions it doesn't own?

We should define that behavior when creating these memory region, and
TDX could sanity check that they have been setup properly.

Let me ask differently: For which memory region where we have
RAM_GUEST_MEMFD set would we *not* want to set private as default right
from the start?


All memory regions have RAM_GUEST_MEMFD set will benefit from being
private as default, for TDX guest.

I will update the implementation to set RAM_DEFAULT_PRIVATE flag when
guest_memfd is created successfully, like

diff --git a/system/physmem.c b/system/physmem.c
index fc59470191ef..60676689c807 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1850,6 +1850,8 @@ static void ram_block_add(RAMBlock *new_block,
Error **errp)
   qemu_mutex_unlock_ramlist();
   return;
   }
+
+new_block->flags |= RAM_DEFAULT_PRIVATE;
   }

then this patch can be dropped, and the calling of
memory_region_set_default_private(mr) of Patch 45 can be dropped too.

I think this is what you suggested, right?


Yes, if that works, great!

--
Cheers,

David / dhildenb




Re: [PATCH] system/physmem: remove redundant arg reassignment

2024-02-15 Thread David Hildenbrand

On 15.02.24 10:15, Manos Pitsidianakis wrote:

Arguments `ram_block` are reassigned to local declarations `block`
without further use. Remove re-assignment to reduce noise.

Signed-off-by: Manos Pitsidianakis 
---
  system/physmem.c | 7 ++-
  1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index 5e66d9ae36..d4c3bfac65 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2154,10 +2154,8 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
   *
   * Called within RCU critical section.
   */
-void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t addr)
+void *qemu_map_ram_ptr(RAMBlock *block, ram_addr_t addr)
  {
-RAMBlock *block = ram_block;
-
  if (block == NULL) {
  block = qemu_get_ram_block(addr);
  addr -= block->offset;
@@ -2182,10 +2180,9 @@ void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t 
addr)
   *
   * Called within RCU critical section.
   */
-static void *qemu_ram_ptr_length(RAMBlock *ram_block, ram_addr_t addr,
+static void *qemu_ram_ptr_length(RAMBlock *block, ram_addr_t addr,
   hwaddr *size, bool lock)
  {
-RAMBlock *block = ram_block;
  if (*size == 0) {
  return NULL;
  }

base-commit: 5767815218efd3cbfd409505ed824d5f356044ae


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




[PATCH v2 06/14] libvhost-user: No need to check for NULL when unmapping

2024-02-14 Thread David Hildenbrand
We never add a memory region if mmap() failed. Therefore, no need to check
for NULL.

Reviewed-by: Raphael Norwitz 
Acked-by: Stefano Garzarella 
Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index f43b5096d0..225283f764 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -247,11 +247,8 @@ vu_remove_all_mem_regs(VuDev *dev)
 
 for (i = 0; i < dev->nregions; i++) {
 VuDevRegion *r = >regions[i];
-void *ma = (void *)(uintptr_t)r->mmap_addr;
 
-if (ma) {
-munmap(ma, r->size + r->mmap_offset);
-}
+munmap((void *)(uintptr_t)r->mmap_addr, r->size + r->mmap_offset);
 }
 dev->nregions = 0;
 }
@@ -888,11 +885,8 @@ vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
 for (i = 0; i < dev->nregions; i++) {
 if (reg_equal(>regions[i], msg_region)) {
 VuDevRegion *r = >regions[i];
-void *ma = (void *) (uintptr_t) r->mmap_addr;
 
-if (ma) {
-munmap(ma, r->size + r->mmap_offset);
-}
+munmap((void *)(uintptr_t)r->mmap_addr, r->size + r->mmap_offset);
 
 /*
  * Shift all affected entries by 1 to close the hole at index i and
-- 
2.43.0




[PATCH v2 04/14] libvhost-user: Merge vu_set_mem_table_exec_postcopy() into vu_set_mem_table_exec()

2024-02-14 Thread David Hildenbrand
Let's reduce some code duplication and prepare for further changes.

Reviewed-by: Raphael Norwitz 
Acked-by: Stefano Garzarella 
Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 119 +++---
 1 file changed, 39 insertions(+), 80 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index e4907dfc26..a7bd7de3cd 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -937,95 +937,23 @@ vu_get_shared_object(VuDev *dev, VhostUserMsg *vmsg)
 }
 
 static bool
-vu_set_mem_table_exec_postcopy(VuDev *dev, VhostUserMsg *vmsg)
+vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
 {
-unsigned int i;
 VhostUserMemory m = vmsg->payload.memory, *memory = 
-dev->nregions = memory->nregions;
-
-DPRINT("Nregions: %u\n", memory->nregions);
-for (i = 0; i < dev->nregions; i++) {
-void *mmap_addr;
-VhostUserMemoryRegion *msg_region = >regions[i];
-VuDevRegion *dev_region = >regions[i];
-
-DPRINT("Region %d\n", i);
-DPRINT("guest_phys_addr: 0x%016"PRIx64"\n",
-   msg_region->guest_phys_addr);
-DPRINT("memory_size: 0x%016"PRIx64"\n",
-   msg_region->memory_size);
-DPRINT("userspace_addr   0x%016"PRIx64"\n",
-   msg_region->userspace_addr);
-DPRINT("mmap_offset  0x%016"PRIx64"\n",
-   msg_region->mmap_offset);
-
-dev_region->gpa = msg_region->guest_phys_addr;
-dev_region->size = msg_region->memory_size;
-dev_region->qva = msg_region->userspace_addr;
-dev_region->mmap_offset = msg_region->mmap_offset;
+int prot = PROT_READ | PROT_WRITE;
+unsigned int i;
 
-/* We don't use offset argument of mmap() since the
- * mapped address has to be page aligned, and we use huge
- * pages.
+if (dev->postcopy_listening) {
+/*
  * In postcopy we're using PROT_NONE here to catch anyone
  * accessing it before we userfault
  */
-mmap_addr = mmap(0, dev_region->size + dev_region->mmap_offset,
- PROT_NONE, MAP_SHARED | MAP_NORESERVE,
- vmsg->fds[i], 0);
-
-if (mmap_addr == MAP_FAILED) {
-vu_panic(dev, "region mmap error: %s", strerror(errno));
-} else {
-dev_region->mmap_addr = (uint64_t)(uintptr_t)mmap_addr;
-DPRINT("mmap_addr:   0x%016"PRIx64"\n",
-   dev_region->mmap_addr);
-}
-
-/* Return the address to QEMU so that it can translate the ufd
- * fault addresses back.
- */
-msg_region->userspace_addr = (uintptr_t)(mmap_addr +
- dev_region->mmap_offset);
-close(vmsg->fds[i]);
-}
-
-/* Send the message back to qemu with the addresses filled in */
-vmsg->fd_num = 0;
-if (!vu_send_reply(dev, dev->sock, vmsg)) {
-vu_panic(dev, "failed to respond to set-mem-table for postcopy");
-return false;
-}
-
-/* Wait for QEMU to confirm that it's registered the handler for the
- * faults.
- */
-if (!dev->read_msg(dev, dev->sock, vmsg) ||
-vmsg->size != sizeof(vmsg->payload.u64) ||
-vmsg->payload.u64 != 0) {
-vu_panic(dev, "failed to receive valid ack for postcopy 
set-mem-table");
-return false;
+prot = PROT_NONE;
 }
 
-/* OK, now we can go and register the memory and generate faults */
-(void)generate_faults(dev);
-
-return false;
-}
-
-static bool
-vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
-{
-unsigned int i;
-VhostUserMemory m = vmsg->payload.memory, *memory = 
-
 vu_remove_all_mem_regs(dev);
 dev->nregions = memory->nregions;
 
-if (dev->postcopy_listening) {
-return vu_set_mem_table_exec_postcopy(dev, vmsg);
-}
-
 DPRINT("Nregions: %u\n", memory->nregions);
 for (i = 0; i < dev->nregions; i++) {
 void *mmap_addr;
@@ -1051,8 +979,7 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
  * mapped address has to be page aligned, and we use huge
  * pages.  */
 mmap_addr = mmap(0, dev_region->size + dev_region->mmap_offset,
- PROT_READ | PROT_WRITE, MAP_SHARED | MAP_NORESERVE,
- vmsg->fds[i], 0);
+ prot, MAP_SHARED | MAP_NORESERVE, vmsg->fds[i], 0);
 
 if (mmap_addr == MAP_FAILED) {
 vu_panic(dev, "region mmap error: %s", strerror

[PATCH v2 09/14] libvhost-user: Factor out search for memory region by GPA and simplify

2024-02-14 Thread David Hildenbrand
Memory regions cannot overlap, and if we ever hit that case something
would be really flawed.

For example, when vhost code in QEMU decides to increase the size of memory
regions to cover full huge pages, it makes sure to never create overlaps,
and if there would be overlaps, it would bail out.

QEMU commits 48d7c9757749 ("vhost: Merge sections added to temporary
list"), c1ece84e7c93 ("vhost: Huge page align and merge") and
e7b94a84b6cb ("vhost: Allow adjoining regions") added and clarified that
handling and how overlaps are impossible.

Consequently, each GPA can belong to at most one memory region, and
everything else doesn't make sense. Let's factor out our search to prepare
for further changes.

Reviewed-by: Raphael Norwitz 
Acked-by: Stefano Garzarella 
Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 79 +--
 1 file changed, 45 insertions(+), 34 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index 7f29e01c30..d72f25396d 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -195,30 +195,47 @@ vu_panic(VuDev *dev, const char *msg, ...)
  */
 }
 
+/* Search for a memory region that covers this guest physical address. */
+static VuDevRegion *
+vu_gpa_to_mem_region(VuDev *dev, uint64_t guest_addr)
+{
+unsigned int i;
+
+/*
+ * Memory regions cannot overlap in guest physical address space. Each
+ * GPA belongs to exactly one memory region, so there can only be one
+ * match.
+ */
+for (i = 0; i < dev->nregions; i++) {
+VuDevRegion *cur = >regions[i];
+
+if (guest_addr >= cur->gpa && guest_addr < cur->gpa + cur->size) {
+return cur;
+}
+}
+return NULL;
+}
+
 /* Translate guest physical address to our virtual address.  */
 void *
 vu_gpa_to_va(VuDev *dev, uint64_t *plen, uint64_t guest_addr)
 {
-unsigned int i;
+VuDevRegion *r;
 
 if (*plen == 0) {
 return NULL;
 }
 
-/* Find matching memory region.  */
-for (i = 0; i < dev->nregions; i++) {
-VuDevRegion *r = >regions[i];
-
-if ((guest_addr >= r->gpa) && (guest_addr < (r->gpa + r->size))) {
-if ((guest_addr + *plen) > (r->gpa + r->size)) {
-*plen = r->gpa + r->size - guest_addr;
-}
-return (void *)(uintptr_t)
-guest_addr - r->gpa + r->mmap_addr + r->mmap_offset;
-}
+r = vu_gpa_to_mem_region(dev, guest_addr);
+if (!r) {
+return NULL;
 }
 
-return NULL;
+if ((guest_addr + *plen) > (r->gpa + r->size)) {
+*plen = r->gpa + r->size - guest_addr;
+}
+return (void *)(uintptr_t)guest_addr - r->gpa + r->mmap_addr +
+   r->mmap_offset;
 }
 
 /* Translate qemu virtual address to our virtual address.  */
@@ -854,8 +871,8 @@ static inline bool reg_equal(VuDevRegion *vudev_reg,
 static bool
 vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
 VhostUserMemoryRegion m = vmsg->payload.memreg.region, *msg_region = 
-unsigned int i;
-bool found = false;
+unsigned int idx;
+VuDevRegion *r;
 
 if (vmsg->fd_num > 1) {
 vmsg_close_fds(vmsg);
@@ -882,28 +899,22 @@ vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
 DPRINT("mmap_offset  0x%016"PRIx64"\n",
msg_region->mmap_offset);
 
-for (i = 0; i < dev->nregions; i++) {
-if (reg_equal(>regions[i], msg_region)) {
-VuDevRegion *r = >regions[i];
-
-munmap((void *)(uintptr_t)r->mmap_addr, r->size + r->mmap_offset);
-
-/* Shift all affected entries by 1 to close the hole at index. */
-memmove(dev->regions + i, dev->regions + i + 1,
-sizeof(VuDevRegion) * (dev->nregions - i - 1));
-DPRINT("Successfully removed a region\n");
-dev->nregions--;
-i--;
-
-found = true;
-break;
-}
-}
-
-if (!found) {
+r = vu_gpa_to_mem_region(dev, msg_region->guest_phys_addr);
+if (!r || !reg_equal(r, msg_region)) {
+vmsg_close_fds(vmsg);
 vu_panic(dev, "Specified region not found\n");
+return false;
 }
 
+munmap((void *)(uintptr_t)r->mmap_addr, r->size + r->mmap_offset);
+
+idx = r - dev->regions;
+assert(idx < dev->nregions);
+/* Shift all affected entries by 1 to close the hole. */
+memmove(r, r + 1, sizeof(VuDevRegion) * (dev->nregions - idx - 1));
+DPRINT("Successfully removed a region\n");
+dev->nregions--;
+
 vmsg_close_fds(vmsg);
 
 return false;
-- 
2.43.0




[PATCH v2 07/14] libvhost-user: Don't zero out memory for memory regions

2024-02-14 Thread David Hildenbrand
dev->nregions always covers only valid entries. Stop zeroing out other
array elements that are unused.

Reviewed-by: Raphael Norwitz 
Acked-by: Stefano Garzarella 
Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index 225283f764..2e8b611385 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -888,13 +888,9 @@ vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
 
 munmap((void *)(uintptr_t)r->mmap_addr, r->size + r->mmap_offset);
 
-/*
- * Shift all affected entries by 1 to close the hole at index i and
- * zero out the last entry.
- */
+/* Shift all affected entries by 1 to close the hole at index. */
 memmove(dev->regions + i, dev->regions + i + 1,
 sizeof(VuDevRegion) * (dev->nregions - i - 1));
-memset(dev->regions + dev->nregions - 1, 0, sizeof(VuDevRegion));
 DPRINT("Successfully removed a region\n");
 dev->nregions--;
 i--;
@@ -2119,7 +2115,6 @@ vu_init(VuDev *dev,
 DPRINT("%s: failed to malloc mem regions\n", __func__);
 return false;
 }
-memset(dev->regions, 0, VHOST_USER_MAX_RAM_SLOTS * 
sizeof(dev->regions[0]));
 
 dev->vq = malloc(max_queues * sizeof(dev->vq[0]));
 if (!dev->vq) {
-- 
2.43.0




[PATCH v2 01/14] libvhost-user: Dynamically allocate memory for memory slots

2024-02-14 Thread David Hildenbrand
Let's prepare for increasing VHOST_USER_MAX_RAM_SLOTS by dynamically
allocating dev->regions. We don't have any ABI guarantees (not
dynamically linked), so we can simply change the layout of VuDev.

Let's zero out the memory, just as we used to do.

Reviewed-by: Raphael Norwitz 
Acked-by: Stefano Garzarella 
Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 11 +++
 subprojects/libvhost-user/libvhost-user.h |  2 +-
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index a3b158c671..360c5366d6 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -2171,6 +2171,8 @@ vu_deinit(VuDev *dev)
 
 free(dev->vq);
 dev->vq = NULL;
+free(dev->regions);
+dev->regions = NULL;
 }
 
 bool
@@ -2205,9 +2207,18 @@ vu_init(VuDev *dev,
 dev->backend_fd = -1;
 dev->max_queues = max_queues;
 
+dev->regions = malloc(VHOST_USER_MAX_RAM_SLOTS * sizeof(dev->regions[0]));
+if (!dev->regions) {
+DPRINT("%s: failed to malloc mem regions\n", __func__);
+return false;
+}
+memset(dev->regions, 0, VHOST_USER_MAX_RAM_SLOTS * 
sizeof(dev->regions[0]));
+
 dev->vq = malloc(max_queues * sizeof(dev->vq[0]));
 if (!dev->vq) {
 DPRINT("%s: failed to malloc virtqueues\n", __func__);
+free(dev->regions);
+dev->regions = NULL;
 return false;
 }
 
diff --git a/subprojects/libvhost-user/libvhost-user.h 
b/subprojects/libvhost-user/libvhost-user.h
index c2352904f0..c882b4e3a2 100644
--- a/subprojects/libvhost-user/libvhost-user.h
+++ b/subprojects/libvhost-user/libvhost-user.h
@@ -398,7 +398,7 @@ typedef struct VuDevInflightInfo {
 struct VuDev {
 int sock;
 uint32_t nregions;
-VuDevRegion regions[VHOST_USER_MAX_RAM_SLOTS];
+VuDevRegion *regions;
 VuVirtq *vq;
 VuDevInflightInfo inflight_info;
 int log_call_fd;
-- 
2.43.0




[PATCH v2 10/14] libvhost-user: Speedup gpa_to_mem_region() and vu_gpa_to_va()

2024-02-14 Thread David Hildenbrand
Let's speed up GPA to memory region / virtual address lookup. Store the
memory regions ordered by guest physical addresses, and use binary
search for address translation, as well as when adding/removing memory
regions.

Most importantly, this will speed up GPA->VA address translation when we
have many memslots.

Reviewed-by: Raphael Norwitz 
Acked-by: Stefano Garzarella 
Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 49 +--
 1 file changed, 45 insertions(+), 4 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index d72f25396d..ef6353d847 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -199,19 +199,30 @@ vu_panic(VuDev *dev, const char *msg, ...)
 static VuDevRegion *
 vu_gpa_to_mem_region(VuDev *dev, uint64_t guest_addr)
 {
-unsigned int i;
+int low = 0;
+int high = dev->nregions - 1;
 
 /*
  * Memory regions cannot overlap in guest physical address space. Each
  * GPA belongs to exactly one memory region, so there can only be one
  * match.
+ *
+ * We store our memory regions ordered by GPA and can simply perform a
+ * binary search.
  */
-for (i = 0; i < dev->nregions; i++) {
-VuDevRegion *cur = >regions[i];
+while (low <= high) {
+unsigned int mid = low + (high - low) / 2;
+VuDevRegion *cur = >regions[mid];
 
 if (guest_addr >= cur->gpa && guest_addr < cur->gpa + cur->size) {
 return cur;
 }
+if (guest_addr >= cur->gpa + cur->size) {
+low = mid + 1;
+}
+if (guest_addr < cur->gpa) {
+high = mid - 1;
+}
 }
 return NULL;
 }
@@ -273,9 +284,14 @@ vu_remove_all_mem_regs(VuDev *dev)
 static void
 _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd)
 {
+const uint64_t start_gpa = msg_region->guest_phys_addr;
+const uint64_t end_gpa = start_gpa + msg_region->memory_size;
 int prot = PROT_READ | PROT_WRITE;
 VuDevRegion *r;
 void *mmap_addr;
+int low = 0;
+int high = dev->nregions - 1;
+unsigned int idx;
 
 DPRINT("Adding region %d\n", dev->nregions);
 DPRINT("guest_phys_addr: 0x%016"PRIx64"\n",
@@ -295,6 +311,29 @@ _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion 
*msg_region, int fd)
 prot = PROT_NONE;
 }
 
+/*
+ * We will add memory regions into the array sorted by GPA. Perform a
+ * binary search to locate the insertion point: it will be at the low
+ * index.
+ */
+while (low <= high) {
+unsigned int mid = low + (high - low)  / 2;
+VuDevRegion *cur = >regions[mid];
+
+/* Overlap of GPA addresses. */
+if (start_gpa < cur->gpa + cur->size && cur->gpa < end_gpa) {
+vu_panic(dev, "regions with overlapping guest physical addresses");
+return;
+}
+if (start_gpa >= cur->gpa + cur->size) {
+low = mid + 1;
+}
+if (start_gpa < cur->gpa) {
+high = mid - 1;
+}
+}
+idx = low;
+
 /*
  * We don't use offset argument of mmap() since the mapped address has
  * to be page aligned, and we use huge pages.
@@ -308,7 +347,9 @@ _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion 
*msg_region, int fd)
 DPRINT("mmap_addr:   0x%016"PRIx64"\n",
(uint64_t)(uintptr_t)mmap_addr);
 
-r = >regions[dev->nregions];
+/* Shift all affected entries by 1 to open a hole at idx. */
+r = >regions[idx];
+memmove(r + 1, r, sizeof(VuDevRegion) * (dev->nregions - idx));
 r->gpa = msg_region->guest_phys_addr;
 r->size = msg_region->memory_size;
 r->qva = msg_region->userspace_addr;
-- 
2.43.0




[PATCH v2 13/14] libvhost-user: Dynamically remap rings after (temporarily?) removing memory regions

2024-02-14 Thread David Hildenbrand
Currently, we try to remap all rings whenever we add a single new memory
region. That doesn't quite make sense, because we already map rings when
setting the ring address, and panic if that goes wrong. Likely, that
handling was simply copied from set_mem_table code, where we actually
have to remap all rings.

Remapping all rings might require us to walk quite a lot of memory
regions to perform the address translations. Ideally, we'd simply remove
that remapping.

However, let's be a bit careful. There might be some weird corner cases
where we might temporarily remove a single memory region (e.g., resize
it), that would have worked for now. Further, a ring might be located on
hotplugged memory, and as the VM reboots, we might unplug that memory, to
hotplug memory before resetting the ring addresses.

So let's unmap affected rings as we remove a memory region, and try
dynamically mapping the ring again when required.

Acked-by: Raphael Norwitz 
Acked-by: Stefano Garzarella 
Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 107 --
 1 file changed, 78 insertions(+), 29 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index ed0a978d4f..61fb3050b3 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -283,10 +283,75 @@ vu_remove_all_mem_regs(VuDev *dev)
 dev->nregions = 0;
 }
 
+static bool
+map_ring(VuDev *dev, VuVirtq *vq)
+{
+vq->vring.desc = qva_to_va(dev, vq->vra.desc_user_addr);
+vq->vring.used = qva_to_va(dev, vq->vra.used_user_addr);
+vq->vring.avail = qva_to_va(dev, vq->vra.avail_user_addr);
+
+DPRINT("Setting virtq addresses:\n");
+DPRINT("vring_desc  at %p\n", vq->vring.desc);
+DPRINT("vring_used  at %p\n", vq->vring.used);
+DPRINT("vring_avail at %p\n", vq->vring.avail);
+
+return !(vq->vring.desc && vq->vring.used && vq->vring.avail);
+}
+
 static bool
 vu_is_vq_usable(VuDev *dev, VuVirtq *vq)
 {
-return likely(!dev->broken) && likely(vq->vring.avail);
+if (unlikely(dev->broken)) {
+return false;
+}
+
+if (likely(vq->vring.avail)) {
+return true;
+}
+
+/*
+ * In corner cases, we might temporarily remove a memory region that
+ * mapped a ring. When removing a memory region we make sure to
+ * unmap any rings that would be impacted. Let's try to remap if we
+ * already succeeded mapping this ring once.
+ */
+if (!vq->vra.desc_user_addr || !vq->vra.used_user_addr ||
+!vq->vra.avail_user_addr) {
+return false;
+}
+if (map_ring(dev, vq)) {
+vu_panic(dev, "remapping queue on access");
+return false;
+}
+return true;
+}
+
+static void
+unmap_rings(VuDev *dev, VuDevRegion *r)
+{
+int i;
+
+for (i = 0; i < dev->max_queues; i++) {
+VuVirtq *vq = >vq[i];
+const uintptr_t desc = (uintptr_t)vq->vring.desc;
+const uintptr_t used = (uintptr_t)vq->vring.used;
+const uintptr_t avail = (uintptr_t)vq->vring.avail;
+
+if (desc < r->mmap_addr || desc >= r->mmap_addr + r->size) {
+continue;
+}
+if (used < r->mmap_addr || used >= r->mmap_addr + r->size) {
+continue;
+}
+if (avail < r->mmap_addr || avail >= r->mmap_addr + r->size) {
+continue;
+}
+
+DPRINT("Unmapping rings of queue %d\n", i);
+vq->vring.desc = NULL;
+vq->vring.used = NULL;
+vq->vring.avail = NULL;
+}
 }
 
 static size_t
@@ -786,21 +851,6 @@ vu_reset_device_exec(VuDev *dev, VhostUserMsg *vmsg)
 return false;
 }
 
-static bool
-map_ring(VuDev *dev, VuVirtq *vq)
-{
-vq->vring.desc = qva_to_va(dev, vq->vra.desc_user_addr);
-vq->vring.used = qva_to_va(dev, vq->vra.used_user_addr);
-vq->vring.avail = qva_to_va(dev, vq->vra.avail_user_addr);
-
-DPRINT("Setting virtq addresses:\n");
-DPRINT("vring_desc  at %p\n", vq->vring.desc);
-DPRINT("vring_used  at %p\n", vq->vring.used);
-DPRINT("vring_avail at %p\n", vq->vring.avail);
-
-return !(vq->vring.desc && vq->vring.used && vq->vring.avail);
-}
-
 static bool
 generate_faults(VuDev *dev) {
 unsigned int i;
@@ -884,7 +934,6 @@ generate_faults(VuDev *dev) {
 
 static bool
 vu_add_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
-int i;
 VhostUserMemoryRegion m = vmsg->payload.memreg.region, *msg_region = 
 
 if (vmsg->fd_num != 1) {
@@ -930,19 +979,9 @@ vu_add_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
 vmsg->fd_num = 0;
 DPRINT("Successf

[PATCH v2 05/14] libvhost-user: Factor out adding a memory region

2024-02-14 Thread David Hildenbrand
Let's factor it out, reducing quite some code duplication and perparing
for further changes.

If we fail to mmap a region and panic, we now simply don't add that
(broken) region.

Note that we now increment dev->nregions as we are successfully
adding memory regions, and don't increment dev->nregions if anything went
wrong.

Reviewed-by: Raphael Norwitz 
Acked-by: Stefano Garzarella 
Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 168 --
 1 file changed, 60 insertions(+), 108 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index a7bd7de3cd..f43b5096d0 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -256,6 +256,61 @@ vu_remove_all_mem_regs(VuDev *dev)
 dev->nregions = 0;
 }
 
+static void
+_vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd)
+{
+int prot = PROT_READ | PROT_WRITE;
+VuDevRegion *r;
+void *mmap_addr;
+
+DPRINT("Adding region %d\n", dev->nregions);
+DPRINT("guest_phys_addr: 0x%016"PRIx64"\n",
+   msg_region->guest_phys_addr);
+DPRINT("memory_size: 0x%016"PRIx64"\n",
+   msg_region->memory_size);
+DPRINT("userspace_addr:  0x%016"PRIx64"\n",
+   msg_region->userspace_addr);
+DPRINT("mmap_offset: 0x%016"PRIx64"\n",
+   msg_region->mmap_offset);
+
+if (dev->postcopy_listening) {
+/*
+ * In postcopy we're using PROT_NONE here to catch anyone
+ * accessing it before we userfault
+ */
+prot = PROT_NONE;
+}
+
+/*
+ * We don't use offset argument of mmap() since the mapped address has
+ * to be page aligned, and we use huge pages.
+ */
+mmap_addr = mmap(0, msg_region->memory_size + msg_region->mmap_offset,
+ prot, MAP_SHARED | MAP_NORESERVE, fd, 0);
+if (mmap_addr == MAP_FAILED) {
+vu_panic(dev, "region mmap error: %s", strerror(errno));
+return;
+}
+DPRINT("mmap_addr:   0x%016"PRIx64"\n",
+   (uint64_t)(uintptr_t)mmap_addr);
+
+r = >regions[dev->nregions];
+r->gpa = msg_region->guest_phys_addr;
+r->size = msg_region->memory_size;
+r->qva = msg_region->userspace_addr;
+r->mmap_addr = (uint64_t)(uintptr_t)mmap_addr;
+r->mmap_offset = msg_region->mmap_offset;
+dev->nregions++;
+
+if (dev->postcopy_listening) {
+/*
+ * Return the address to QEMU so that it can translate the ufd
+ * fault addresses back.
+ */
+msg_region->userspace_addr = r->mmap_addr + r->mmap_offset;
+}
+}
+
 static void
 vmsg_close_fds(VhostUserMsg *vmsg)
 {
@@ -727,10 +782,7 @@ generate_faults(VuDev *dev) {
 static bool
 vu_add_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
 int i;
-bool track_ramblocks = dev->postcopy_listening;
 VhostUserMemoryRegion m = vmsg->payload.memreg.region, *msg_region = 
-VuDevRegion *dev_region = >regions[dev->nregions];
-void *mmap_addr;
 
 if (vmsg->fd_num != 1) {
 vmsg_close_fds(vmsg);
@@ -760,69 +812,20 @@ vu_add_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
  * we know all the postcopy client bases have been received, and we
  * should start generating faults.
  */
-if (track_ramblocks &&
+if (dev->postcopy_listening &&
 vmsg->size == sizeof(vmsg->payload.u64) &&
 vmsg->payload.u64 == 0) {
 (void)generate_faults(dev);
 return false;
 }
 
-DPRINT("Adding region: %u\n", dev->nregions);
-DPRINT("guest_phys_addr: 0x%016"PRIx64"\n",
-   msg_region->guest_phys_addr);
-DPRINT("memory_size: 0x%016"PRIx64"\n",
-   msg_region->memory_size);
-DPRINT("userspace_addr   0x%016"PRIx64"\n",
-   msg_region->userspace_addr);
-DPRINT("mmap_offset  0x%016"PRIx64"\n",
-   msg_region->mmap_offset);
-
-dev_region->gpa = msg_region->guest_phys_addr;
-dev_region->size = msg_region->memory_size;
-dev_region->qva = msg_region->userspace_addr;
-dev_region->mmap_offset = msg_region->mmap_offset;
-
-/*
- * We don't use offset argument of mmap() since the
- * mapped address has to be page aligned, and we use huge
- * pages.
- */
-if (track_ramblocks) {
-/*
- * In postcopy we're using PROT_NONE here to catch anyone
- * accessing it before we userfault.
- */
-mmap_addr = mmap(0, dev_region->size + dev_region->mmap_offset,
-  

[PATCH v2 08/14] libvhost-user: Don't search for duplicates when removing memory regions

2024-02-14 Thread David Hildenbrand
We cannot have duplicate memory regions, something would be deeply
flawed elsewhere. Let's just stop the search once we found an entry.

We'll add more sanity checks when adding memory regions later.

Reviewed-by: Raphael Norwitz 
Acked-by: Stefano Garzarella 
Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index 2e8b611385..7f29e01c30 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -896,8 +896,7 @@ vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
 i--;
 
 found = true;
-
-/* Continue the search for eventual duplicates. */
+break;
 }
 }
 
-- 
2.43.0




[PATCH v2 02/14] libvhost-user: Bump up VHOST_USER_MAX_RAM_SLOTS to 509

2024-02-14 Thread David Hildenbrand
Let's support up to 509 mem slots, just like vhost in the kernel usually
does and the rust vhost-user implementation recently [1] started doing.
This is required to properly support memory hotplug, either using
multiple DIMMs (ACPI supports up to 256) or using virtio-mem.

The 509 used to be the KVM limit, it supported 512, but 3 were
used for internal purposes. Currently, KVM supports more than 512, but
it usually doesn't make use of more than ~260 (i.e., 256 DIMMs + boot
memory), except when other memory devices like PCI devices with BARs are
used. So, 509 seems to work well for vhost in the kernel.

Details can be found in the QEMU change that made virtio-mem consume
up to 256 mem slots across all virtio-mem devices. [2]

509 mem slots implies 509 VMAs/mappings in the worst case (even though,
in practice with virtio-mem we won't be seeing more than ~260 in most
setups).

With max_map_count under Linux defaulting to 64k, 509 mem slots
still correspond to less than 1% of the maximum number of mappings.
There are plenty left for the application to consume.

[1] https://github.com/rust-vmm/vhost/pull/224
[2] https://lore.kernel.org/all/20230926185738.277351-1-da...@redhat.com/

Reviewed-by: Raphael Norwitz 
Acked-by: Stefano Garzarella 
Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.h | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.h 
b/subprojects/libvhost-user/libvhost-user.h
index c882b4e3a2..deb40e77b3 100644
--- a/subprojects/libvhost-user/libvhost-user.h
+++ b/subprojects/libvhost-user/libvhost-user.h
@@ -31,10 +31,12 @@
 #define VHOST_MEMORY_BASELINE_NREGIONS 8
 
 /*
- * Set a reasonable maximum number of ram slots, which will be supported by
- * any architecture.
+ * vhost in the kernel usually supports 509 mem slots. 509 used to be the
+ * KVM limit, it supported 512, but 3 were used for internal purposes. This
+ * limit is sufficient to support many DIMMs and virtio-mem in
+ * "dynamic-memslots" mode.
  */
-#define VHOST_USER_MAX_RAM_SLOTS 32
+#define VHOST_USER_MAX_RAM_SLOTS 509
 
 #define VHOST_USER_HDR_SIZE offsetof(VhostUserMsg, payload.u64)
 
-- 
2.43.0




[PATCH v2 00/14] libvhost-user: support more memslots and cleanup memslot handling code

2024-02-14 Thread David Hildenbrand
This series adds support for more memslots (509) to libvhost-user, to
make it fully compatible with virtio-mem that uses up to 256 memslots
accross all memory devices in "dynamic-memslot" mode (more details
in patch #2).

With that in place, this series optimizes and extends memory region
handling in libvhost-user:
* Heavily deduplicate and clean up the memory region handling code
* Speeds up GPA->VA translation with many memslots using binary search
* Optimize mmap_offset handling to use it as fd_offset for mmap()
* Avoid ring remapping when adding a single memory region
* Avoid dumping all guest memory, possibly allocating memory in sparse
  memory mappings when the process crashes

I'm being very careful to not break some weird corner case that modern
QEMU might no longer trigger, but older one could have triggered or some
other frontend might trigger.

The only thing where I am not careful is to forbid memory regions that
overlap in GPA space: it doesn't make any sense.

With this series, virtio-mem (with dynamic-memslots=on) +
qemu-storage-daemon works flawlessly and as expected in my tests.

v1 -> v2:
* Drop "libvhost-user: Fix msg_region->userspace_addr computation"
 -> Not actually required
* "libvhost-user: Factor out adding a memory region"
 -> Make debug output more consistent (add missing ":")
* "libvhost-user: Use most of mmap_offset as fd_offset"
 -> get_fd_pagesize -> get_fd_hugepagesize; remove getpagesize()
 -> "mmap_offset:" to "old mmap_offset:" in debug message
 -> "adj mmap_offset:" to "new mmap_offset:" in debug message
 -> Use "(unsigned int)fs.f_type"; the man page of fstatfs() calls out
that the type of f_type can vary depending on the architecture.
"unsigned int" is sufficient here.
 -> Updated patch description
* Added RBs+ACKs
* Did a Gitlab CI run, seems to be happy reagrding libvhost-user

Cc: Michael S. Tsirkin 
Cc: Jason Wang 
Cc: Stefan Hajnoczi 
Cc: Stefano Garzarella 
Cc: Germano Veit Michel 
Cc: Raphael Norwitz 

David Hildenbrand (14):
  libvhost-user: Dynamically allocate memory for memory slots
  libvhost-user: Bump up VHOST_USER_MAX_RAM_SLOTS to 509
  libvhost-user: Factor out removing all mem regions
  libvhost-user: Merge vu_set_mem_table_exec_postcopy() into
vu_set_mem_table_exec()
  libvhost-user: Factor out adding a memory region
  libvhost-user: No need to check for NULL when unmapping
  libvhost-user: Don't zero out memory for memory regions
  libvhost-user: Don't search for duplicates when removing memory
regions
  libvhost-user: Factor out search for memory region by GPA and simplify
  libvhost-user: Speedup gpa_to_mem_region() and vu_gpa_to_va()
  libvhost-user: Use most of mmap_offset as fd_offset
  libvhost-user: Factor out vq usability check
  libvhost-user: Dynamically remap rings after (temporarily?) removing
memory regions
  libvhost-user: Mark mmap'ed region memory as MADV_DONTDUMP

 subprojects/libvhost-user/libvhost-user.c | 595 --
 subprojects/libvhost-user/libvhost-user.h |  10 +-
 2 files changed, 334 insertions(+), 271 deletions(-)

-- 
2.43.0




[PATCH v2 12/14] libvhost-user: Factor out vq usability check

2024-02-14 Thread David Hildenbrand
Let's factor it out to prepare for further changes.

Reviewed-by: Raphael Norwitz 
Acked-by: Stefano Garzarella 
Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 24 +++
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index 55aef5fcc6..ed0a978d4f 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -283,6 +283,12 @@ vu_remove_all_mem_regs(VuDev *dev)
 dev->nregions = 0;
 }
 
+static bool
+vu_is_vq_usable(VuDev *dev, VuVirtq *vq)
+{
+return likely(!dev->broken) && likely(vq->vring.avail);
+}
+
 static size_t
 get_fd_hugepagesize(int fd)
 {
@@ -2380,8 +2386,7 @@ vu_queue_get_avail_bytes(VuDev *dev, VuVirtq *vq, 
unsigned int *in_bytes,
 idx = vq->last_avail_idx;
 
 total_bufs = in_total = out_total = 0;
-if (unlikely(dev->broken) ||
-unlikely(!vq->vring.avail)) {
+if (!vu_is_vq_usable(dev, vq)) {
 goto done;
 }
 
@@ -2496,8 +2501,7 @@ vu_queue_avail_bytes(VuDev *dev, VuVirtq *vq, unsigned 
int in_bytes,
 bool
 vu_queue_empty(VuDev *dev, VuVirtq *vq)
 {
-if (unlikely(dev->broken) ||
-unlikely(!vq->vring.avail)) {
+if (!vu_is_vq_usable(dev, vq)) {
 return true;
 }
 
@@ -2536,8 +2540,7 @@ vring_notify(VuDev *dev, VuVirtq *vq)
 
 static void _vu_queue_notify(VuDev *dev, VuVirtq *vq, bool sync)
 {
-if (unlikely(dev->broken) ||
-unlikely(!vq->vring.avail)) {
+if (!vu_is_vq_usable(dev, vq)) {
 return;
 }
 
@@ -2862,8 +2865,7 @@ vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz)
 unsigned int head;
 VuVirtqElement *elem;
 
-if (unlikely(dev->broken) ||
-unlikely(!vq->vring.avail)) {
+if (!vu_is_vq_usable(dev, vq)) {
 return NULL;
 }
 
@@ -3020,8 +3022,7 @@ vu_queue_fill(VuDev *dev, VuVirtq *vq,
 {
 struct vring_used_elem uelem;
 
-if (unlikely(dev->broken) ||
-unlikely(!vq->vring.avail)) {
+if (!vu_is_vq_usable(dev, vq)) {
 return;
 }
 
@@ -3050,8 +3051,7 @@ vu_queue_flush(VuDev *dev, VuVirtq *vq, unsigned int 
count)
 {
 uint16_t old, new;
 
-if (unlikely(dev->broken) ||
-unlikely(!vq->vring.avail)) {
+if (!vu_is_vq_usable(dev, vq)) {
 return;
 }
 
-- 
2.43.0




[PATCH v2 14/14] libvhost-user: Mark mmap'ed region memory as MADV_DONTDUMP

2024-02-14 Thread David Hildenbrand
We already use MADV_NORESERVE to deal with sparse memory regions. Let's
also set madvise(MADV_DONTDUMP), otherwise a crash of the process can
result in us allocating all memory in the mmap'ed region for dumping
purposes.

This change implies that the mmap'ed rings won't be included in a
coredump. If ever required for debugging purposes, we could mark only
the mapped rings MADV_DODUMP.

Ignore errors during madvise() for now.

Reviewed-by: Raphael Norwitz 
Acked-by: Stefano Garzarella 
Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index 61fb3050b3..a879149fef 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -460,6 +460,12 @@ _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion 
*msg_region, int fd)
 DPRINT("mmap_addr:   0x%016"PRIx64"\n",
(uint64_t)(uintptr_t)mmap_addr);
 
+#if defined(__linux__)
+/* Don't include all guest memory in a coredump. */
+madvise(mmap_addr, msg_region->memory_size + mmap_offset,
+MADV_DONTDUMP);
+#endif
+
 /* Shift all affected entries by 1 to open a hole at idx. */
 r = >regions[idx];
 memmove(r + 1, r, sizeof(VuDevRegion) * (dev->nregions - idx));
-- 
2.43.0




[PATCH v2 11/14] libvhost-user: Use most of mmap_offset as fd_offset

2024-02-14 Thread David Hildenbrand
In the past, QEMU would create memory regions that could partially cover
hugetlb pages, making mmap() fail if we would use the mmap_offset as an
fd_offset. For that reason, we never used the mmap_offset as an offset into
the fd and instead always mapped the fd from the very start.

However, that can easily result in us mmap'ing a lot of unnecessary
parts of an fd, possibly repeatedly.

QEMU nowadays does not create memory regions that partially cover huge
pages -- it never really worked with postcopy. QEMU handles merging of
regions that partially cover huge pages (due to holes in boot memory) since
2018 in c1ece84e7c93 ("vhost: Huge page align and merge").

Let's be a bit careful and not unconditionally convert the
mmap_offset into an fd_offset. Instead, let's simply detect the hugetlb
size and pass as much as we can as fd_offset, making sure that we call
mmap() with a properly aligned offset.

With QEMU and a virtio-mem device that is fully plugged (50GiB using 50
memslots) the qemu-storage daemon process consumes in the VA space
1281GiB before this change and 58GiB after this change.

 Vhost user message 
Request: VHOST_USER_ADD_MEM_REG (37)
Flags:   0x9
Size:40
Fds: 59
Adding region 4
guest_phys_addr: 0x0002
memory_size: 0x4000
userspace_addr:  0x7fb73bffe000
old mmap_offset: 0x8000
fd_offset:   0x8000
new mmap_offset: 0x
mmap_addr:   0x7f02f1bdc000
Successfully added new region
 Vhost user message 
Request: VHOST_USER_ADD_MEM_REG (37)
Flags:   0x9
Size:40
Fds: 59
Adding region 5
guest_phys_addr: 0x00024000
memory_size: 0x4000
userspace_addr:  0x7fb77bffe000
old mmap_offset: 0xc000
fd_offset:   0xc000
new mmap_offset: 0x
mmap_addr:   0x7f028400
Successfully added new region

Reviewed-by: Raphael Norwitz 
Acked-by: Stefano Garzarella 
Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 54 ---
 1 file changed, 48 insertions(+), 6 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index ef6353d847..55aef5fcc6 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -43,6 +43,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #ifdef __NR_userfaultfd
 #include 
@@ -281,12 +283,32 @@ vu_remove_all_mem_regs(VuDev *dev)
 dev->nregions = 0;
 }
 
+static size_t
+get_fd_hugepagesize(int fd)
+{
+#if defined(__linux__)
+struct statfs fs;
+int ret;
+
+do {
+ret = fstatfs(fd, );
+} while (ret != 0 && errno == EINTR);
+
+if (!ret && (unsigned int)fs.f_type == HUGETLBFS_MAGIC) {
+return fs.f_bsize;
+}
+#endif
+return 0;
+}
+
 static void
 _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd)
 {
 const uint64_t start_gpa = msg_region->guest_phys_addr;
 const uint64_t end_gpa = start_gpa + msg_region->memory_size;
 int prot = PROT_READ | PROT_WRITE;
+uint64_t mmap_offset, fd_offset;
+size_t hugepagesize;
 VuDevRegion *r;
 void *mmap_addr;
 int low = 0;
@@ -300,7 +322,7 @@ _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion 
*msg_region, int fd)
msg_region->memory_size);
 DPRINT("userspace_addr:  0x%016"PRIx64"\n",
msg_region->userspace_addr);
-DPRINT("mmap_offset: 0x%016"PRIx64"\n",
+DPRINT("old mmap_offset: 0x%016"PRIx64"\n",
msg_region->mmap_offset);
 
 if (dev->postcopy_listening) {
@@ -335,11 +357,31 @@ _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion 
*msg_region, int fd)
 idx = low;
 
 /*
- * We don't use offset argument of mmap() since the mapped address has
- * to be page aligned, and we use huge pages.
+ * Convert most of msg_region->mmap_offset to fd_offset. In almost all
+ * cases, this will leave us with mmap_offset == 0, mmap()'ing only
+ * what we really need. Only if a memory region would partially cover
+ * hugetlb pages, we'd get mmap_offset != 0, which usually doesn't happen
+ * anymore (i.e., modern QEMU).
+ *
+ * Note that mmap() with hugetlb would fail if the offset into the file
+ * is not aligned to the huge page size.
  */
-mmap_addr = mmap(0, msg_region->memory_size + msg_region->mmap_offset,
- prot, MAP_SHARED | MAP_NORESERVE, fd, 0);
+hugepagesize = get_fd_hugepagesize(fd);
+if (hugepagesize) {
+fd_offset = ALIGN_DOWN(msg_region->mmap_offset, hugepagesize);
+mmap_offset = msg_region->mmap_offset - fd_offset;
+} else {
+

[PATCH v2 03/14] libvhost-user: Factor out removing all mem regions

2024-02-14 Thread David Hildenbrand
Let's factor it out. Note that the check for MAP_FAILED was wrong as
we never set mmap_addr if mmap() failed. We'll remove the NULL check
separately.

Reviewed-by: Raphael Norwitz 
Acked-by: Stefano Garzarella 
Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 34 ---
 1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index 360c5366d6..e4907dfc26 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -240,6 +240,22 @@ qva_to_va(VuDev *dev, uint64_t qemu_addr)
 return NULL;
 }
 
+static void
+vu_remove_all_mem_regs(VuDev *dev)
+{
+unsigned int i;
+
+for (i = 0; i < dev->nregions; i++) {
+VuDevRegion *r = >regions[i];
+void *ma = (void *)(uintptr_t)r->mmap_addr;
+
+if (ma) {
+munmap(ma, r->size + r->mmap_offset);
+}
+}
+dev->nregions = 0;
+}
+
 static void
 vmsg_close_fds(VhostUserMsg *vmsg)
 {
@@ -1003,14 +1019,7 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
 unsigned int i;
 VhostUserMemory m = vmsg->payload.memory, *memory = 
 
-for (i = 0; i < dev->nregions; i++) {
-VuDevRegion *r = >regions[i];
-void *ma = (void *) (uintptr_t) r->mmap_addr;
-
-if (ma) {
-munmap(ma, r->size + r->mmap_offset);
-}
-}
+vu_remove_all_mem_regs(dev);
 dev->nregions = memory->nregions;
 
 if (dev->postcopy_listening) {
@@ -2112,14 +2121,7 @@ vu_deinit(VuDev *dev)
 {
 unsigned int i;
 
-for (i = 0; i < dev->nregions; i++) {
-VuDevRegion *r = >regions[i];
-void *m = (void *) (uintptr_t) r->mmap_addr;
-if (m != MAP_FAILED) {
-munmap(m, r->size + r->mmap_offset);
-}
-}
-dev->nregions = 0;
+vu_remove_all_mem_regs(dev);
 
 for (i = 0; i < dev->max_queues; i++) {
 VuVirtq *vq = >vq[i];
-- 
2.43.0




Re: [PATCH v1 00/15] libvhost-user: support more memslots and cleanup memslot handling code

2024-02-14 Thread David Hildenbrand

On 13.02.24 19:55, Michael S. Tsirkin wrote:

On Tue, Feb 13, 2024 at 07:27:44PM +0100, David Hildenbrand wrote:

On 13.02.24 18:33, Michael S. Tsirkin wrote:

On Fri, Feb 02, 2024 at 10:53:17PM +0100, David Hildenbrand wrote:

This series adds support for more memslots (509) to libvhost-user, to
make it fully compatible with virtio-mem that uses up to 256 memslots
accross all memory devices in "dynamic-memslot" mode (more details
in patch #3).



Breaks build on some systems. E.g.
https://gitlab.com/mstredhat/qemu/-/jobs/6163591599




./subprojects/libvhost-user/libvhost-user.c:369:27: error: comparison of
integer expressions of different signedness: ‘long int’ and ‘unsigned int’
[-Werror=sign-compare]
   369 | if (!ret && fs.f_type == HUGETLBFS_MAGIC) {
   |   ^~

So easy to fix in v2, thanks!



I think there is another problem around plugins though.


There is a wrong checkpatch error:

https://gitlab.com/mstredhat/qemu/-/jobs/6162397277

d96f29518232719b0c444ab93913e8515a6cb5c6:100: ERROR: use 
qemu_real_host_page_size() instead of getpagesize()

total: 1 errors, 1 warnings, 81 lines checked

qemu_real_host_page_size() is not available in libvhost-user. But I 
could just change that code to not require getpagesize() at all.



Apart from that, I don't spot anything libvhost-user related (some qtest 
timeouts, a "error_setv: Assertion `*errp == NULL' failed."). Did I miss 
something?


--
Cheers,

David / dhildenb




Re: [PATCH v1 00/15] libvhost-user: support more memslots and cleanup memslot handling code

2024-02-13 Thread David Hildenbrand

On 13.02.24 18:33, Michael S. Tsirkin wrote:

On Fri, Feb 02, 2024 at 10:53:17PM +0100, David Hildenbrand wrote:

This series adds support for more memslots (509) to libvhost-user, to
make it fully compatible with virtio-mem that uses up to 256 memslots
accross all memory devices in "dynamic-memslot" mode (more details
in patch #3).



Breaks build on some systems. E.g.
https://gitlab.com/mstredhat/qemu/-/jobs/6163591599




./subprojects/libvhost-user/libvhost-user.c:369:27: error: comparison of 
integer expressions of different signedness: ‘long int’ and ‘unsigned 
int’ [-Werror=sign-compare]

  369 | if (!ret && fs.f_type == HUGETLBFS_MAGIC) {
  |   ^~

So easy to fix in v2, thanks!

--
Cheers,

David / dhildenb




Re: [PATCH v1 01/15] libvhost-user: Fix msg_region->userspace_addr computation

2024-02-13 Thread David Hildenbrand

On 13.02.24 18:32, Michael S. Tsirkin wrote:

On Fri, Feb 02, 2024 at 10:53:18PM +0100, David Hildenbrand wrote:

We barely had mmap_offset set in the past. With virtio-mem and
dynamic-memslots that will change.

In vu_add_mem_reg() and vu_set_mem_table_exec_postcopy(), we are
performing pointer arithmetics, which is wrong.


Wrong how? I suspect you mean arithmetic on void * pointers is not portable?


You are absolutely right, no idea how I concluded that we might have a 
different pointer size here.


I'll either convert this patch into a cleanup or drop it for v2, thanks!

--
Cheers,

David / dhildenb




Re: [PATCH V3 09/13] migration: notifier error checking

2024-02-12 Thread David Hildenbrand

On 08.02.24 19:54, Steve Sistare wrote:

Check the status returned by migration notifiers and report errors.
If notifiers fail, call the notifiers again so they can clean up.


IIUC, if any of the notifiers will actually start to fail, say, during 
MIG_EVENT_PRECOPY_SETUP, you will call MIG_EVENT_PRECOPY_FAILED on all 
notifiers.


That will include notifiers that have never seen a 
MIG_EVENT_PRECOPY_SETUP call.


Is that what we expect notifiers to be able to deal with? Can we 
document that?


--
Cheers,

David / dhildenb




Re: [PATCH V3 08/13] migration: refactor migrate_fd_connect failures

2024-02-12 Thread David Hildenbrand

On 08.02.24 19:54, Steve Sistare wrote:

Move common code for the error path in migrate_fd_connect to a shared
fail label.  No functional change.

Signed-off-by: Steve Sistare 
Reviewed-by: Peter Xu 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH V3 07/13] migration: per-mode notifiers

2024-02-12 Thread David Hildenbrand

On 08.02.24 19:54, Steve Sistare wrote:

Keep a separate list of migration notifiers for each migration mode.

Suggested-by: Peter Xu 
Signed-off-by: Steve Sistare 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH V3 06/13] migration: MigrationNotifyFunc

2024-02-12 Thread David Hildenbrand

On 08.02.24 19:53, Steve Sistare wrote:

Define MigrationNotifyFunc to improve type safety and simplify migration
notifiers.

Signed-off-by: Steve Sistare 
---

Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH V3 04/13] migration: MigrationEvent for notifiers

2024-02-12 Thread David Hildenbrand

On 08.02.24 19:53, Steve Sistare wrote:

Passing MigrationState to notifiers is unsound because they could access
unstable migration state internals or even modify the state.  Instead, pass
the minimal info needed in a new MigrationEvent struct, which could be
extended in the future if needed.

Suggested-by: Peter Xu 
Signed-off-by: Steve Sistare 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




  1   2   3   4   5   6   7   8   9   10   >