Re: [RFC PATCH 1/1] vhost-user: add shmem mmap request
On 05.06.24 17:19, Stefan Hajnoczi wrote: On Wed, 5 Jun 2024 at 10:29, Stefan Hajnoczi wrote: On Wed, Jun 05, 2024 at 10:13:32AM +0200, Albert Esteve wrote: On Tue, Jun 4, 2024 at 8:54 PM Stefan Hajnoczi wrote: On Thu, May 30, 2024 at 05:22:23PM +0200, Albert Esteve wrote: Add SHMEM_MAP/UNMAP requests to vhost-user. This request allows backends to dynamically map fds into a shared memory region indentified by Please call this "VIRTIO Shared Memory Region" everywhere (code, vhost-user spec, commit description, etc) so it's clear that this is not about vhost-user shared memory tables/regions. its `shmid`. Then, the fd memory is advertised to the frontend through a BAR+offset, so it can be read by the driver while its valid. Why is a PCI BAR mentioned here? vhost-user does not know about the VIRTIO Transport (e.g. PCI) being used. It's the frontend's job to report VIRTIO Shared Memory Regions to the driver. I will remove PCI BAR, as it is true that it depends on the transport. I was trying to explain that the driver will use the shm_base + shm_offset to access the mapped memory. Then, the backend can munmap the memory range in a given shared memory region (again, identified by its `shmid`), to free it. After this, the region becomes private and shall not be accessed by the frontend anymore. What does "private" mean? The frontend must mmap PROT_NONE to reserve the virtual memory space when no fd is mapped in the VIRTIO Shared Memory Region. Otherwise an unrelated mmap(NULL, ...) might use that address range and the guest would have access to the host memory! This is a security issue and needs to be mentioned explicitly in the spec. I mentioned private because it changes the mapping from MAP_SHARED to MAP_PRIVATE. I will highlight PROT_NONE instead. I see. Then "MAP_PRIVATE" would be clearer. I wasn't sure whether you mean mmap flags or something like the memory range is no longer accessible to the driver. One more thing: please check whether kvm.ko memory regions need to be modified or split to match the SHMEM_MAP mapping's read/write permissions. The VIRTIO Shared Memory Area pages can have PROT_READ, PROT_WRITE, PROT_READ|PROT_WRITE, or PROT_NONE. kvm.ko memory regions are read/write or read-only. I'm not sure what happens when the guest accesses a kvm.ko memory region containing mappings with permissions more restrictive than its kvm.ko memory region. IIRC, the KVM R/O memory region requests could allow to further reduce permissions (assuming your mmap is R/W you could map it R/O into the KVM MMU), but I might remember things incorrectly. In other words, the kvm.ko memory region would allow the access but the Linux virtual memory configuration would cause a page fault. For example, imagine a QEMU MemoryRegion containing a SHMEM_MAP mapping with PROT_READ. The kvm.ko memory region would be read/write (unless extra steps were taken to tell kvm.ko about the permissions). When the guest stores to the PROT_READ page, kvm.ko will process a fault...and I'm not sure what happens next. A similar scenario occurs when a PROT_NONE mapping exists within a kvm.ko memory region. I don't remember how kvm.ko behaves when the guest tries to access the pages. It's worth figuring this out before going further because it could become tricky if issues are discovered later. I have CCed David Hildenbrand in case he knows. One relevant piece is likely: "When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of the memory region are automatically reflected into the guest. For example, an mmap() that affects the region will be made visible immediately. " We can already effectively get R/O or PROT_NONE PTEs in PROT_READ|PROT_WRITE mappings, and the KVM must be able to handle that properly -- trigegring a page fault to let core-MM resolve that. If we have a PROT_NONE VMA and the guest writes to it, we'd likely end up (to resolve the KVM MMU page fault) in hva_to_pfn_slow()->get_user_pages_unlocked(), which would return -EFAULT. Not sure if we really inject a page fault into the guest or if the KVM run would effectively fail with -EFAULT and make user space unhappy. Certainly something to play with! -- Cheers, David / dhildenb
Re: [PATCH V1 17/26] machine: memfd-alloc option
On 04.06.24 18:41, Peter Xu wrote: On Tue, Jun 04, 2024 at 06:14:08PM +0200, David Hildenbrand wrote: On 04.06.24 17:58, Peter Xu wrote: On Tue, Jun 04, 2024 at 08:13:26AM +0100, Daniel P. Berrangé wrote: On Mon, Jun 03, 2024 at 05:48:32PM -0400, Peter Xu wrote: That property, irrelevant of what it is called (and I doubt whether Dan's suggestion on "shared-ram" is good, e.g. mmap(MAP_SHARED) doesn't have user visible fd but it's shared-ram for sure..), is yet another way to specify guest mem types. What if the user specified this property but specified something else in the -object parameters? E.g. -machine share-ram=on -object memory-backend-ram,share=off. What should we do? The machine property would only apply to memory regions that are *NOT* being created via -object. The memory-backend objects would always honour their own share settnig. In that case we may want to rename that to share-ram-by-default=on. Otherwise it's not clear which one would take effect from an user POV, even if we can define it like that in the code. Even with share-ram-by-default=on, it can be still confusing in some form or another. Consider this cmdline: -machine q35,share-ram-by-default=on -object memory-backend-ram,id=mem1 Then is mem1 shared or not? From reading the cmdline, if share ram by default it should be ON if we don't specify it, but it's actually off? It's because -object has its own default values. We do have something similar with "merge" and "dump" properties. See machine_mem_merge() / machine_dump_guest_core(). These correspond to the "mem-merge" and "dump-guest-core" machine properties. These look fine so far, as long as -object cmdline doesn't allow to specify the same thing again. You can. The mem-merge / dump-guest-core set the default that can be modified per memory backend (merge / dump properties). But ... IMHO fundamentally it's just controversial to have two ways to configure guest memory. If '-object' is the preferred and complete way to configure it, I prefer sticking with it if possible and see what is missing. ... I agree with that. With vhost-user we also require a reasonable configuration (using proper fd-based shared memory) for it to work. I think I raised that as the other major reason too, that I think it's so far only about the vram that is out of the picture here. We don't and shouldn't have complicated RW RAMs floating around that we want this property to cover. Agreed. And maybe we can still keep migration of any MAP_PRIVATE thing working by migrating that memory? CPR will be "slightly less fast". But the biggest piece -- guest RAM -- will be migrated via the fd directly. I think it should work but only without VFIO. When with VFIO there must have no private pages at all or migrating is racy with concurrent DMAs (yes, AFAICT CPR can run migration with DMA running..). Understood. For these we could fail migration. Thanks for the pointer. -- Cheers, David / dhildenb
Re: [PATCH V1 17/26] machine: memfd-alloc option
On 04.06.24 17:58, Peter Xu wrote: On Tue, Jun 04, 2024 at 08:13:26AM +0100, Daniel P. Berrangé wrote: On Mon, Jun 03, 2024 at 05:48:32PM -0400, Peter Xu wrote: That property, irrelevant of what it is called (and I doubt whether Dan's suggestion on "shared-ram" is good, e.g. mmap(MAP_SHARED) doesn't have user visible fd but it's shared-ram for sure..), is yet another way to specify guest mem types. What if the user specified this property but specified something else in the -object parameters? E.g. -machine share-ram=on -object memory-backend-ram,share=off. What should we do? The machine property would only apply to memory regions that are *NOT* being created via -object. The memory-backend objects would always honour their own share settnig. In that case we may want to rename that to share-ram-by-default=on. Otherwise it's not clear which one would take effect from an user POV, even if we can define it like that in the code. Even with share-ram-by-default=on, it can be still confusing in some form or another. Consider this cmdline: -machine q35,share-ram-by-default=on -object memory-backend-ram,id=mem1 Then is mem1 shared or not? From reading the cmdline, if share ram by default it should be ON if we don't specify it, but it's actually off? It's because -object has its own default values. We do have something similar with "merge" and "dump" properties. See machine_mem_merge() / machine_dump_guest_core(). These correspond to the "mem-merge" and "dump-guest-core" machine properties. But ... IMHO fundamentally it's just controversial to have two ways to configure guest memory. If '-object' is the preferred and complete way to configure it, I prefer sticking with it if possible and see what is missing. ... I agree with that. With vhost-user we also require a reasonable configuration (using proper fd-based shared memory) for it to work. I think I raised that as the other major reason too, that I think it's so far only about the vram that is out of the picture here. We don't and shouldn't have complicated RW RAMs floating around that we want this property to cover. Agreed. And maybe we can still keep migration of any MAP_PRIVATE thing working by migrating that memory? CPR will be "slightly less fast". But the biggest piece -- guest RAM -- will be migrated via the fd directly. -- Cheers, David / dhildenb
Re: [PATCH 1/6] migration: remove RDMA live migration temporarily
On 04.06.24 14:14, Gonglei via wrote: From: Jialin Wang The new RDMA live migration will be introduced in the upcoming few commits. Signed-off-by: Jialin Wang Signed-off-by: Gonglei --- [...] - -/* Avoid ram_block_discard_disable(), cannot change during migration. */ -if (ram_block_discard_is_required()) { -error_setg(errp, "RDMA: cannot disable RAM discard"); -return; -} I'm particularly interested in the interaction with virtio-balloon/virtio-mem. Do we still have to disable discarding of RAM, and where would you do that in the rewrite? -- Cheers, David / dhildenb
Re: [PATCH v2 3/4] backends/hostmem: Report error on qemu_madvise() failures
On 31.05.24 09:28, Michal Privoznik wrote: If user sets .merge or .dump attributes qemu_madvise() is called with corresponding advice. But it is never checked for failure which may mislead users into thinking the attribute is set correctly. Report an appropriate error. Signed-off-by: Michal Privoznik --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v2 1/4] osdep: Make qemu_madvise() to set errno in all cases
On 31.05.24 10:12, Philippe Mathieu-Daudé wrote: On 31/5/24 10:01, David Hildenbrand wrote: On 31.05.24 09:57, Philippe Mathieu-Daudé wrote: Hi Michal, On 31/5/24 09:28, Michal Privoznik wrote: The unspoken premise of qemu_madvise() is that errno is set on error. And it is mostly the case except for posix_madvise() which is documented to return either zero (on success) or a positive error number. Watch out, Linux: RETURN VALUE On success, posix_madvise() returns 0. On failure, it returns a positive error number. but on Darwin: RETURN VALUES Upon successful completion, a value of 0 is returned. Otherwise, a value of -1 is returned and errno is set to indicate the error. (Haven't checked other POSIX OSes). ... but it's supposed to follow the "posix standard" :D Maybe an issue in the docs? FreeBSD seems to follow the standard: "The posix_madvise() interface is identical, except it returns an error number on error and does not modify errno, and is provided for standards conformance." Same with OpenBSD: "The posix_madvise() interface has the same effect, but returns the error value instead of only setting errno." On Darwin, MADVISE(2): The posix_madvise() behaves same as madvise() except that it uses values with POSIX_ prefix for the advice system call argument. The posix_madvise function is part of IEEE 1003.1-2001 and was first implemented in Mac OS X 10.2. Per IEEE 1003.1-2001 (https://pubs.opengroup.org/onlinepubs/009695399/functions/posix_madvise.html): RETURN VALUE Upon successful completion, posix_madvise() shall return zero; otherwise, an error number shall be returned to indicate the error. Note the use of "shall" which is described in RFC2119 as: This word, or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course. Agreed, so we have to be careful. I do wonder if there would be the option for an automatic approach: for example, detect if the errno was/was not changed. Hm. -- Cheers, David / dhildenb
Re: [PATCH v2 1/4] osdep: Make qemu_madvise() to set errno in all cases
On 31.05.24 09:57, Philippe Mathieu-Daudé wrote: Hi Michal, On 31/5/24 09:28, Michal Privoznik wrote: The unspoken premise of qemu_madvise() is that errno is set on error. And it is mostly the case except for posix_madvise() which is documented to return either zero (on success) or a positive error number. Watch out, Linux: RETURN VALUE On success, posix_madvise() returns 0. On failure, it returns a positive error number. but on Darwin: RETURN VALUES Upon successful completion, a value of 0 is returned. Otherwise, a value of -1 is returned and errno is set to indicate the error. (Haven't checked other POSIX OSes). ... but it's supposed to follow the "posix standard" :D Maybe an issue in the docs? FreeBSD seems to follow the standard: "The posix_madvise() interface is identical, except it returns an error number on error and does not modify errno, and is provided for standards conformance." Same with OpenBSD: "The posix_madvise() interface has the same effect, but returns the error value instead of only setting errno." -- Cheers, David / dhildenb
Re: [PATCH v2 2/4] osdep: Make qemu_madvise() return ENOSYS on unsupported OSes
On 31.05.24 09:28, Michal Privoznik wrote: Not every OS is capable of madvise() or posix_madvise() even. In that case, errno should be set to ENOSYS as it reflects the cause better. This also mimic what madvise()/posix_madvise() would return if kernel lacks corresponding syscall (e.g. due to configuration). Yes and no. According to the man page " EINVAL advice is not a valid." if a particular MADV_* call is not implemented, we would get EINVAL, which is really unfortunate ... to detect what is actually supported :( For the patch here ENOSYS makes sense: Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH 3/3] backends/hostmem: Round up memory size for qemu_madvise() and mbind()
On 29.05.24 08:48, Michal Prívozník wrote: On 5/28/24 18:47, David Hildenbrand wrote: Am 28.05.24 um 18:15 schrieb Michal Privoznik: ./build/qemu-system-x86_64 \ -m size=8389632k,slots=16,maxmem=2560k \ -object '{"qom-type":"memory-backend-file","id":"ram-node0","mem-path":"hugepages2M","prealloc":true,"size":8590983168,"host-nodes":[0],"policy":"bind"}' \ -numa node,nodeid=0,cpus=0,memdev=ram-node0 For DIMMs and friends we now (again) enforce that the size must be aligned to the page size: commit 540a1abbf0b243e4cfb4333c5d30a041f7080ba4 Author: David Hildenbrand Date: Wed Jan 17 14:55:54 2024 +0100 memory-device: reintroduce memory region size check We used to check that the memory region size is multiples of the overall requested address alignment for the device memory address. We removed that check, because there are cases (i.e., hv-balloon) where devices unconditionally request an address alignment that has a very large alignment (i.e., 32 GiB), but the actual memory device size might not be multiples of that alignment. However, this change: (a) allows for some practically impossible DIMM sizes, like "1GB+1 byte". (b) allows for DIMMs that partially cover hugetlb pages, previously reported in [1]. ... Partial hugetlb pages do not particularly make sense; wasting memory. Do we expect people to actually ave working setup that we might break when disallowing such configurations? Or would we actually help them identify that something non-sensical is happening? When using memory-backend-memfd, we already do get a proper error: ./build/qemu-system-x86_64 -m 2047m -object memory-backend-memfd,id=ram-node0,hugetlb=on,size=2047m,reserve=off -numa node,nodeid=0,cpus=0,memdev=ram-node0 -S qemu-system-x86_64: failed to resize memfd to 2146435072: Invalid argument So this only applies to memory-backend-file, where we maybe should fail in a similar way? Yeah, let's fail in that case. But non-aligned length is just one of many reasons madvise()/mbind() can fail. What about the others? Should we make them report an error or just a warning? Regarding madvise(), we should report at least a warning. In qemu_ram_setup_dump() we print an error if QEMU_MADV_DONTDUMP failed. But we swallow any errors from memory_try_enable_merging() ... in general, we likely have to distinguish the "not supported by the OS" case from " actually supported but failed" case. In the second patch, maybe we should really fail if something unexpected happens, instead of fake-changing the properties. -- Cheers, David / dhildenb
Re: [PATCH 1/3] osdep: Make qemu_madvise() to set errno in all cases
Am 28.05.24 um 18:15 schrieb Michal Privoznik: The unspoken premise of qemu_madvise() is that errno is set on error. And it is mostly the case except for posix_madvise() which is documented to return either zero (on success) or a positive error number. This means, we must set errno ourselves. And while at it, make the function return a negative value on error, just like other error paths do. Signed-off-by: Michal Privoznik --- util/osdep.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/util/osdep.c b/util/osdep.c index e996c4744a..e42f4e8121 100644 --- a/util/osdep.c +++ b/util/osdep.c @@ -57,7 +57,12 @@ int qemu_madvise(void *addr, size_t len, int advice) #if defined(CONFIG_MADVISE) return madvise(addr, len, advice); #elif defined(CONFIG_POSIX_MADVISE) -return posix_madvise(addr, len, advice); +int rc = posix_madvise(addr, len, advice); +if (rc) { +errno = rc; +return -1; +} +return 0; #else errno = EINVAL; return -1; Interesting, seems to be correct Reviewed-by: David Hildenbrand -- Thanks, David / dhildenb
Re: [PATCH 3/3] backends/hostmem: Round up memory size for qemu_madvise() and mbind()
Am 28.05.24 um 18:15 schrieb Michal Privoznik: ./build/qemu-system-x86_64 \ -m size=8389632k,slots=16,maxmem=2560k \ -object '{"qom-type":"memory-backend-file","id":"ram-node0","mem-path":"hugepages2M","prealloc":true,"size":8590983168,"host-nodes":[0],"policy":"bind"}' \ -numa node,nodeid=0,cpus=0,memdev=ram-node0 For DIMMs and friends we now (again) enforce that the size must be aligned to the page size: commit 540a1abbf0b243e4cfb4333c5d30a041f7080ba4 Author: David Hildenbrand Date: Wed Jan 17 14:55:54 2024 +0100 memory-device: reintroduce memory region size check We used to check that the memory region size is multiples of the overall requested address alignment for the device memory address. We removed that check, because there are cases (i.e., hv-balloon) where devices unconditionally request an address alignment that has a very large alignment (i.e., 32 GiB), but the actual memory device size might not be multiples of that alignment. However, this change: (a) allows for some practically impossible DIMM sizes, like "1GB+1 byte". (b) allows for DIMMs that partially cover hugetlb pages, previously reported in [1]. ... Partial hugetlb pages do not particularly make sense; wasting memory. Do we expect people to actually ave working setup that we might break when disallowing such configurations? Or would we actually help them identify that something non-sensical is happening? When using memory-backend-memfd, we already do get a proper error: ./build/qemu-system-x86_64 -m 2047m -object memory-backend-memfd,id=ram-node0,hugetlb=on,size=2047m,reserve=off -numa node,nodeid=0,cpus=0,memdev=ram-node0 -S qemu-system-x86_64: failed to resize memfd to 2146435072: Invalid argument So this only applies to memory-backend-file, where we maybe should fail in a similar way? -- Thanks, David / dhildenb
Re: [PATCH v2 1/3] hw/riscv/virt: Add memory hotplugging and virtio-md-pci support
I wonder if we should forbid users from removing memory that is 'out of place', i.e. users should always remove pc-dimms in LIFO order. Usually this kind of control is done by management, e.g. libvirt, but if we're not sure we'll keep consistency during memory unplugs ... I really don't think we should do any of that. ;) -- Thanks, David / dhildenb
Re: [RFC PATCH 1/3] hw/intc/s390_flic: Migrate pending state
Am 26.05.24 um 21:44 schrieb Richard Henderson: On 5/26/24 08:53, David Hildenbrand wrote: Am 25.05.24 um 15:12 schrieb Nicholas Piggin: The flic pending state is not migrated, so if the machine is migrated while an interrupt is pending, it can be lost. This shows up in qtest migration test, an extint is pending (due to console writes?) and the CPU waits via s390_cpu_set_psw and expects the interrupt to wake it. However when the flic pending state is lost, s390_cpu_has_int returns false, so s390_cpu_exec_interrupt falls through to halting again. Fix this by migrating pending. This prevents the qtest from hanging. Does service_param need to be migrated? Or the IO lists? Signed-off-by: Nicholas Piggin --- hw/intc/s390_flic.c | 1 + 1 file changed, 1 insertion(+) diff --git a/hw/intc/s390_flic.c b/hw/intc/s390_flic.c index 6771645699..b70cf2295a 100644 --- a/hw/intc/s390_flic.c +++ b/hw/intc/s390_flic.c @@ -369,6 +369,7 @@ static const VMStateDescription qemu_s390_flic_vmstate = { .fields = (const VMStateField[]) { VMSTATE_UINT8(simm, QEMUS390FLICState), VMSTATE_UINT8(nimm, QEMUS390FLICState), + VMSTATE_UINT32(pending, QEMUS390FLICState), VMSTATE_END_OF_LIST() } }; Likely you have to handle this using QEMU compat machines. Well, since existing migration is broken, I don't think you have to preserve Migration is broken only in some case "while an interrupt is pending, it can be lost". compatibility. But you do have to bump the version number. Looking at it, this is TCG only, so likely we don't care that much about migration compatibility. But I have no idea what level of compatibility we want to support there. -- Thanks, David / dhildenb
Re: [RFC PATCH 1/3] hw/intc/s390_flic: Migrate pending state
Am 25.05.24 um 15:12 schrieb Nicholas Piggin: The flic pending state is not migrated, so if the machine is migrated while an interrupt is pending, it can be lost. This shows up in qtest migration test, an extint is pending (due to console writes?) and the CPU waits via s390_cpu_set_psw and expects the interrupt to wake it. However when the flic pending state is lost, s390_cpu_has_int returns false, so s390_cpu_exec_interrupt falls through to halting again. Fix this by migrating pending. This prevents the qtest from hanging. Does service_param need to be migrated? Or the IO lists? Signed-off-by: Nicholas Piggin --- hw/intc/s390_flic.c | 1 + 1 file changed, 1 insertion(+) diff --git a/hw/intc/s390_flic.c b/hw/intc/s390_flic.c index 6771645699..b70cf2295a 100644 --- a/hw/intc/s390_flic.c +++ b/hw/intc/s390_flic.c @@ -369,6 +369,7 @@ static const VMStateDescription qemu_s390_flic_vmstate = { .fields = (const VMStateField[]) { VMSTATE_UINT8(simm, QEMUS390FLICState), VMSTATE_UINT8(nimm, QEMUS390FLICState), +VMSTATE_UINT32(pending, QEMUS390FLICState), VMSTATE_END_OF_LIST() } }; Likely you have to handle this using QEMU compat machines. -- Thanks, David / dhildenb
Re: [PATCH v2 1/3] hw/riscv/virt: Add memory hotplugging and virtio-md-pci support
On 24.05.24 15:14, Daniel Henrique Barboza wrote: On 5/21/24 07:56, Björn Töpel wrote: From: Björn Töpel Virtio-based memory devices (virtio-mem/virtio-pmem) allows for dynamic resizing of virtual machine memory, and requires proper hotplugging (add/remove) support to work. Add device memory support for RISC-V "virt" machine, and enable virtio-md-pci with the corresponding missing hotplugging callbacks. Signed-off-by: Björn Töpel --- hw/riscv/Kconfig | 2 + hw/riscv/virt.c| 83 +- hw/virtio/virtio-mem.c | 5 ++- 3 files changed, 87 insertions(+), 3 deletions(-) diff --git a/hw/riscv/Kconfig b/hw/riscv/Kconfig index a2030e3a6ff0..08f82dbb681a 100644 --- a/hw/riscv/Kconfig +++ b/hw/riscv/Kconfig @@ -56,6 +56,8 @@ config RISCV_VIRT select PLATFORM_BUS select ACPI select ACPI_PCI +select VIRTIO_MEM_SUPPORTED +select VIRTIO_PMEM_SUPPORTED config SHAKTI_C bool diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c index 4fdb66052587..443902f919d2 100644 --- a/hw/riscv/virt.c +++ b/hw/riscv/virt.c @@ -53,6 +53,8 @@ #include "hw/pci-host/gpex.h" #include "hw/display/ramfb.h" #include "hw/acpi/aml-build.h" +#include "hw/mem/memory-device.h" +#include "hw/virtio/virtio-mem-pci.h" #include "qapi/qapi-visit-common.h" #include "hw/virtio/virtio-iommu.h" @@ -1407,6 +1409,7 @@ static void virt_machine_init(MachineState *machine) DeviceState *mmio_irqchip, *virtio_irqchip, *pcie_irqchip; int i, base_hartid, hart_count; int socket_count = riscv_socket_count(machine); +hwaddr device_memory_base, device_memory_size; /* Check socket count limit */ if (VIRT_SOCKETS_MAX < socket_count) { @@ -1420,6 +1423,12 @@ static void virt_machine_init(MachineState *machine) exit(1); } +if (machine->ram_slots > ACPI_MAX_RAM_SLOTS) { +error_report("unsupported amount of memory slots: %"PRIu64, + machine->ram_slots); +exit(EXIT_FAILURE); +} + /* Initialize sockets */ mmio_irqchip = virtio_irqchip = pcie_irqchip = NULL; for (i = 0; i < socket_count; i++) { @@ -1553,6 +1562,37 @@ static void virt_machine_init(MachineState *machine) memory_region_add_subregion(system_memory, memmap[VIRT_MROM].base, mask_rom); +/* device memory */ +device_memory_base = ROUND_UP(s->memmap[VIRT_DRAM].base + machine->ram_size, + GiB); +device_memory_size = machine->maxram_size - machine->ram_size; +if (device_memory_size > 0) { +/* + * Each DIMM is aligned based on the backend's alignment value. + * Assume max 1G hugepage alignment per slot. + */ +device_memory_size += machine->ram_slots * GiB; We don't need to align to 1GiB. This calc can use 2MiB instead (or 4MiB if we're running 32 bits). + +if (riscv_is_32bit(>soc[0])) { +hwaddr memtop = device_memory_base + ROUND_UP(device_memory_size, + GiB); Same here - alignment is 2/4 MiB. + +if (memtop > UINT32_MAX) { +error_report("memory exceeds 32-bit limit by %lu bytes", + memtop - UINT32_MAX); +exit(EXIT_FAILURE); +} +} + +if (device_memory_base + device_memory_size < device_memory_size) { +error_report("unsupported amount of device memory"); +exit(EXIT_FAILURE); +} Took another look and found this a bit strange. These are all unsigned vars, so if (unsigned a + unsigned b < unsigned b) will always be 'false'. The compiler is probably cropping this out. No. Unsigned interger overflow is defined behavior and this is a common check to detect such overflow. tI's consistent with what we do for other architectures. The calc we need to do is to ensure that the extra ram_slots * alignment will fit into the VIRT_DRAM block, i.e. maxram_size + (ram_slots * alignment) < memmap[VIRT_DRAM].size. TBH I'm starting to have second thoughts about letting users hotplug whatever they want. It seems cleaner to just force the 2/4 Mb alignment in pre_plug() and be done with it, no need to allocate ram_slots * alignment and doing all these extra checks. It's worth noting that if user space decides to specify addresses manually, it can mess up everything already. There are other events that can result in fragmentation of the memory device area (repeated hot(un)plug of differing DIMMs). Assume you have 1 GiB range and hotplug a 512 MiB DIMM at offset 256 MiB. You won't be able to hotplug another 512 MiB DIMM even though we reserved space. My take so far is: if the user wants to do such stuff it should size the area (maxmem) much larger or deal with the concequences (not being able to hotplug memory). It
Re: [PATCH v5 13/13] tests/qtest/vhost-user-test: add a test case for memory-backend-shm
On 23.05.24 16:55, Stefano Garzarella wrote: `memory-backend-shm` can be used with vhost-user devices, so let's add a new test case for it. Acked-by: Thomas Huth Acked-by: Stefan Hajnoczi Signed-off-by: Stefano Garzarella --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v5 12/13] tests/qtest/vhost-user-blk-test: use memory-backend-shm
On 23.05.24 16:55, Stefano Garzarella wrote: `memory-backend-memfd` is available only on Linux while the new `memory-backend-shm` can be used on any POSIX-compliant operating system. Let's use it so we can run the test in multiple environments. Acked-by: Thomas Huth Acked-by: Stefan Hajnoczi Reviewed-by: Philippe Mathieu-Daudé Tested-by: Philippe Mathieu-Daudé Signed-off-by: Stefano Garzarella --- tests/qtest/vhost-user-blk-test.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tests/qtest/vhost-user-blk-test.c b/tests/qtest/vhost-user-blk-test.c index 117b9acd10..e945f6abf2 100644 --- a/tests/qtest/vhost-user-blk-test.c +++ b/tests/qtest/vhost-user-blk-test.c @@ -906,7 +906,7 @@ static void start_vhost_user_blk(GString *cmd_line, int vus_instances, vhost_user_blk_bin); g_string_append_printf(cmd_line, -" -object memory-backend-memfd,id=mem,size=256M,share=on " +" -object memory-backend-shm,id=mem,size=256M,share=on " Can we simplifya nd drop the share=on? Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v5 07/13] vhost-user: enable frontends on any POSIX system
On 23.05.24 16:55, Stefano Garzarella wrote: The vhost-user protocol is not really Linux-specific so let's enable vhost-user frontends for any POSIX system. In vhost_net.c we use VHOST_FILE_UNBIND which is defined in a Linux specific header, let's define it for other systems as well. Reviewed-by: Philippe Mathieu-Daudé Tested-by: Philippe Mathieu-Daudé Acked-by: Stefan Hajnoczi Signed-off-by: Stefano Garzarella Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v5 06/13] contrib/vhost-user-*: use QEMU bswap helper functions
On 23.05.24 16:55, Stefano Garzarella wrote: Let's replace the calls to le*toh() and htole*() with qemu/bswap.h helpers to make the code more portable. Suggested-by: Philippe Mathieu-Daudé Reviewed-by: Philippe Mathieu-Daudé Tested-by: Philippe Mathieu-Daudé Acked-by: Stefan Hajnoczi Signed-off-by: Stefano Garzarella --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v5 04/13] vhost-user-server: do not set memory fd non-blocking
On 23.05.24 16:55, Stefano Garzarella wrote: In vhost-user-server we set all fd received from the other peer in non-blocking mode. For some of them (e.g. memfd, shm_open, etc.) it's not really needed, because we don't use these fd with blocking operations, but only to map memory. In addition, in some systems this operation can fail (e.g. in macOS setting an fd returned by shm_open() non-blocking fails with errno = ENOTTY). So, let's avoid setting fd non-blocking for those messages that we know carry memory fd (e.g. VHOST_USER_ADD_MEM_REG, VHOST_USER_SET_MEM_TABLE). Reviewed-by: Daniel P. Berrangé Acked-by: Stefan Hajnoczi Signed-off-by: Stefano Garzarella --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v5 05/13] contrib/vhost-user-blk: fix bind() using the right size of the address
On 23.05.24 16:55, Stefano Garzarella wrote: On macOS passing `-s /tmp/vhost.socket` parameter to the vhost-user-blk application, the bind was done on `/tmp/vhost.socke` pathname, missing the last character. This sounds like one of the portability problems described in the unix(7) manpage: Pathname sockets When binding a socket to a pathname, a few rules should be observed for maximum portability and ease of coding: • The pathname in sun_path should be null-terminated. • The length of the pathname, including the terminating null byte, should not exceed the size of sun_path. • The addrlen argument that describes the enclosing sockaddr_un structure should have a value of at least: offsetof(struct sockaddr_un, sun_path) + strlen(addr.sun_path)+1 or, more simply, addrlen can be specified as sizeof(struct sockaddr_un). So let's follow the last advice and simplify the code as well. Reviewed-by: Philippe Mathieu-Daudé Tested-by: Philippe Mathieu-Daudé Acked-by: Stefan Hajnoczi Signed-off-by: Stefano Garzarella --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v5 03/13] libvhost-user: mask F_INFLIGHT_SHMFD if memfd is not supported
On 23.05.24 16:55, Stefano Garzarella wrote: libvhost-user will panic when receiving VHOST_USER_GET_INFLIGHT_FD message if MFD_ALLOW_SEALING is not defined, since it's not able to create a memfd. VHOST_USER_GET_INFLIGHT_FD is used only if VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD is negotiated. So, let's mask that feature if the backend is not able to properly handle these messages. Reviewed-by: Philippe Mathieu-Daudé Tested-by: Philippe Mathieu-Daudé Acked-by: Stefan Hajnoczi Signed-off-by: Stefano Garzarella --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v5 02/13] libvhost-user: fail vu_message_write() if sendmsg() is failing
On 23.05.24 16:55, Stefano Garzarella wrote: In vu_message_write() we use sendmsg() to send the message header, then a write() to send the payload. If sendmsg() fails we should avoid sending the payload, since we were unable to send the header. Discovered before fixing the issue with the previous patch, where sendmsg() failed on macOS due to wrong parameters, but the frontend still sent the payload which the backend incorrectly interpreted as a wrong header. Reviewed-by: Daniel P. Berrangé Reviewed-by: Philippe Mathieu-Daudé Tested-by: Philippe Mathieu-Daudé Acked-by: Stefan Hajnoczi Signed-off-by: Stefano Garzarella --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH] hw/riscv/virt: Add hotplugging and virtio-md-pci support
On 19.05.24 11:24, Daniel Henrique Barboza wrote: On 5/18/24 16:50, David Hildenbrand wrote: Hi, diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c index 4fdb66052587..16c2bdbfe6b6 100644 --- a/hw/riscv/virt.c +++ b/hw/riscv/virt.c @@ -53,6 +53,8 @@ #include "hw/pci-host/gpex.h" #include "hw/display/ramfb.h" #include "hw/acpi/aml-build.h" +#include "hw/mem/memory-device.h" +#include "hw/virtio/virtio-mem-pci.h" #include "qapi/qapi-visit-common.h" #include "hw/virtio/virtio-iommu.h" @@ -1407,6 +1409,7 @@ static void virt_machine_init(MachineState *machine) DeviceState *mmio_irqchip, *virtio_irqchip, *pcie_irqchip; int i, base_hartid, hart_count; int socket_count = riscv_socket_count(machine); + hwaddr device_memory_base, device_memory_size; /* Check socket count limit */ if (VIRT_SOCKETS_MAX < socket_count) { @@ -1553,6 +1556,25 @@ static void virt_machine_init(MachineState *machine) memory_region_add_subregion(system_memory, memmap[VIRT_MROM].base, mask_rom); + device_memory_base = ROUND_UP(s->memmap[VIRT_DRAM].base + machine->ram_size, + GiB); + device_memory_size = machine->maxram_size - machine->ram_size; + + if (riscv_is_32bit(>soc[0])) { + hwaddr memtop = device_memory_base + ROUND_UP(device_memory_size, GiB); + + if (memtop > UINT32_MAX) { + error_report("Memory exceeds 32-bit limit by %lu bytes", + memtop - UINT32_MAX); + exit(EXIT_FAILURE); + } + } + + if (device_memory_size > 0) { + machine_memory_devices_init(machine, device_memory_base, + device_memory_size); + } + I think we need a design discussion before proceeding here. You're allocating all available memory as a memory device area, but in theory we might also support pc-dimm hotplugs (which would be the equivalent of adding physical RAM dimms to the board.) in the future too. If you're not familiar with this feature you can check it out the docs in [1]. Note that DIMMs are memory devices as well. You can plug into the memory device area both, ACPI-based memory devices (DIMM, NVDIMM) or virtio-based memory devices (virtio-mem, virtio-pmem). As an example, the 'virt' ARM board (hw/arm/virt.c) reserves a space for this type of hotplug by checking how much 'ram_slots' we're allocating for it: device_memory_size = ms->maxram_size - ms->ram_size + ms->ram_slots * GiB; Note that we increased the region size to be able to fit most requests even if alignment of memory devices is weird. See below. In sane setups, this is usually not required (adding a single additional GB for some flexiility might be good enough). Other boards do the same with ms->ram_slots. We should consider doing it as well, now, even if we're not up to the point of supporting pc-dimm hotplug, to avoid having to change the memory layout later in the road and breaking existing setups. If we want to copy the ARM board, ram_slots is capped to ACPI_MAX_RAM_SLOTS (256). Each RAM slot is considered to be a 1GiB dimm, i.e. we would reserve 256GiB for them. This only reserves some *additional* space to fixup weird alignment of memory devices. *not* the actual space for these devices. We don't consider each DIMM to be 1 GiB in size, but add an additional 1 GiB in case we have to align DIMMs in physical address space. I *think* this dates back to old x86 handling where we aligned the address of each DIMM to be at a 1 GiB boundary. So if you would have plugged two 128 MiB DIMMs, you'd have required more than 256 MiB of space in the area after aligning inside the memory device area. Thanks for the explanation. I missed the part where the ram_slots were being used just to solve potential alignment issues and pc-dimms could occupy the same space being allocated via machine_memory_devices_init(). This patch isn't far off then. If we take care to avoid plugging unaligned memory we might not even need this spare area. Right. Note: I do not have the visibility of discussions on the memory management space, and I might be missing details such as "we don't care about pc-dimm hotplug anymore, it's legacy, we're going to support only virtio-md-pci from now on". We had a situation like that with virtio-balloon and virtio-mem in the past, and I'm not sure if this might fall in the same category. Not sure if I got your comment right, but virtio-mem was never supposed to be a virtio-balloon replacement (especially of the free-page-reporting and memory stats part). I was trying to refer to a situation we faced 3+ years ago in the powerpc machines, where we were trying to add virtio-mem support there given that virtio-mem is/was been seen (as far as I can remember
Re: [PATCH] hw/riscv/virt: Add hotplugging and virtio-md-pci support
Hi, diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c index 4fdb66052587..16c2bdbfe6b6 100644 --- a/hw/riscv/virt.c +++ b/hw/riscv/virt.c @@ -53,6 +53,8 @@ #include "hw/pci-host/gpex.h" #include "hw/display/ramfb.h" #include "hw/acpi/aml-build.h" +#include "hw/mem/memory-device.h" +#include "hw/virtio/virtio-mem-pci.h" #include "qapi/qapi-visit-common.h" #include "hw/virtio/virtio-iommu.h" @@ -1407,6 +1409,7 @@ static void virt_machine_init(MachineState *machine) DeviceState *mmio_irqchip, *virtio_irqchip, *pcie_irqchip; int i, base_hartid, hart_count; int socket_count = riscv_socket_count(machine); +hwaddr device_memory_base, device_memory_size; /* Check socket count limit */ if (VIRT_SOCKETS_MAX < socket_count) { @@ -1553,6 +1556,25 @@ static void virt_machine_init(MachineState *machine) memory_region_add_subregion(system_memory, memmap[VIRT_MROM].base, mask_rom); +device_memory_base = ROUND_UP(s->memmap[VIRT_DRAM].base + machine->ram_size, + GiB); +device_memory_size = machine->maxram_size - machine->ram_size; + +if (riscv_is_32bit(>soc[0])) { +hwaddr memtop = device_memory_base + ROUND_UP(device_memory_size, GiB); + +if (memtop > UINT32_MAX) { +error_report("Memory exceeds 32-bit limit by %lu bytes", + memtop - UINT32_MAX); +exit(EXIT_FAILURE); +} +} + +if (device_memory_size > 0) { +machine_memory_devices_init(machine, device_memory_base, +device_memory_size); +} + I think we need a design discussion before proceeding here. You're allocating all available memory as a memory device area, but in theory we might also support pc-dimm hotplugs (which would be the equivalent of adding physical RAM dimms to the board.) in the future too. If you're not familiar with this feature you can check it out the docs in [1]. Note that DIMMs are memory devices as well. You can plug into the memory device area both, ACPI-based memory devices (DIMM, NVDIMM) or virtio-based memory devices (virtio-mem, virtio-pmem). As an example, the 'virt' ARM board (hw/arm/virt.c) reserves a space for this type of hotplug by checking how much 'ram_slots' we're allocating for it: device_memory_size = ms->maxram_size - ms->ram_size + ms->ram_slots * GiB; Note that we increased the region size to be able to fit most requests even if alignment of memory devices is weird. See below. In sane setups, this is usually not required (adding a single additional GB for some flexiility might be good enough). Other boards do the same with ms->ram_slots. We should consider doing it as well, now, even if we're not up to the point of supporting pc-dimm hotplug, to avoid having to change the memory layout later in the road and breaking existing setups. If we want to copy the ARM board, ram_slots is capped to ACPI_MAX_RAM_SLOTS (256). Each RAM slot is considered to be a 1GiB dimm, i.e. we would reserve 256GiB for them. This only reserves some *additional* space to fixup weird alignment of memory devices. *not* the actual space for these devices. We don't consider each DIMM to be 1 GiB in size, but add an additional 1 GiB in case we have to align DIMMs in physical address space. I *think* this dates back to old x86 handling where we aligned the address of each DIMM to be at a 1 GiB boundary. So if you would have plugged two 128 MiB DIMMs, you'd have required more than 256 MiB of space in the area after aligning inside the memory device area. Note: I do not have the visibility of discussions on the memory management space, and I might be missing details such as "we don't care about pc-dimm hotplug anymore, it's legacy, we're going to support only virtio-md-pci from now on". We had a situation like that with virtio-balloon and virtio-mem in the past, and I'm not sure if this might fall in the same category. Not sure if I got your comment right, but virtio-mem was never supposed to be a virtio-balloon replacement (especially of the free-page-reporting and memory stats part). I see that David Hildenbrand is also CCed in the patch so he'll let us know if I'm out of line with what I'm asking. Supporting PC-DIMMs might be required at some point when dealing with OSes that don't support virtio-mem and friends. -- Cheers, David / dhildenb
Re: [PATCH v6 6/8] xen: mapcache: Pass the ram_addr offset to xen_map_cache()
On 16.05.24 17:48, Edgar E. Iglesias wrote: From: "Edgar E. Iglesias" Pass the ram_addr offset to xen_map_cache. This is in preparation for adding grant mappings that need to compute the address within the RAMBlock. No functional changes. Signed-off-by: Edgar E. Iglesias --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v6 5/8] softmmu: Replace check for RAMBlock offset 0 with xen_mr_is_memory
On 16.05.24 17:48, Edgar E. Iglesias wrote: From: "Edgar E. Iglesias" For xen, when checking for the first RAM (xen_memory), use xen_mr_is_memory() rather than checking for a RAMBlock with offset 0. All Xen machines create xen_memory first so this has no functional change for existing machines. Signed-off-by: Edgar E. Iglesias Reviewed-by: Stefano Stabellini --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v6 4/8] softmmu: xen: Always pass offset + addr to xen_map_cache
On 16.05.24 17:48, Edgar E. Iglesias wrote: From: "Edgar E. Iglesias" Always pass address with offset to xen_map_cache(). This is in preparation for support for grant mappings. Since this is within a block that checks for offset == 0, this has no functional changes. Signed-off-by: Edgar E. Iglesias Reviewed-by: Stefano Stabellini --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH 11/14] target/s390x: Fix helper_per_ifetch flags
On 02.05.24 07:44, Richard Henderson wrote: CPU state is read on the exception path. Fixes: 83bb161299c ("target-s390x: PER instruction-fetch nullification event support") Signed-off-by: Richard Henderson --- target/s390x/helper.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/target/s390x/helper.h b/target/s390x/helper.h index 5611155ba1..31bd193322 100644 --- a/target/s390x/helper.h +++ b/target/s390x/helper.h @@ -361,7 +361,7 @@ DEF_HELPER_FLAGS_1(purge, TCG_CALL_NO_RWG, void, env) DEF_HELPER_3(lra, i64, env, i64, i64) DEF_HELPER_FLAGS_3(per_check_exception, TCG_CALL_NO_WG, void, env, i64, i32) DEF_HELPER_FLAGS_3(per_branch, TCG_CALL_NO_WG, void, env, i64, i32) -DEF_HELPER_FLAGS_2(per_ifetch, TCG_CALL_NO_RWG, void, env, i64) +DEF_HELPER_FLAGS_2(per_ifetch, TCG_CALL_NO_WG, void, env, i64) DEF_HELPER_FLAGS_2(per_store_real, TCG_CALL_NO_WG, noreturn, env, i32) DEF_HELPER_FLAGS_1(stfl, TCG_CALL_NO_RWG, void, env) Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v4 14/17] xen: Add xen_mr_is_memory()
On 30.04.24 18:49, Edgar E. Iglesias wrote: From: "Edgar E. Iglesias" Add xen_mr_is_memory() to abstract away tests for the xen_memory MR. Signed-off-by: Edgar E. Iglesias --- [...] #endif diff --git a/system/physmem.c b/system/physmem.c index ad7a8c7d95..1a5ffcba2a 100644 --- a/system/physmem.c +++ b/system/physmem.c @@ -2227,7 +2227,7 @@ static void *qemu_ram_ptr_length(RAMBlock *block, ram_addr_t addr, * because we don't want to map the entire memory in QEMU. * In that case just map the requested area. */ -if (block->offset == 0) { +if (xen_mr_is_memory(block->mr)) { return xen_map_cache(block->mr, addr, len, lock, lock, is_write); } I'd have moved that into a separate patch, because this is not a simple abstraction here. Acked-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v4 13/17] softmmu: Pass RAM MemoryRegion and is_write xen_map_cache()
On 30.04.24 18:49, Edgar E. Iglesias wrote: From: "Edgar E. Iglesias" Propagate MR and is_write to xen_map_cache(). I'm pretty sure the patch subject is missing a "to" :) This is in preparation for adding support for grant mappings. No functional change. Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v4 01/17] softmmu: let qemu_map_ram_ptr() use qemu_ram_ptr_length()
On 30.04.24 18:49, Edgar E. Iglesias wrote: From: Juergen Gross qemu_map_ram_ptr() and qemu_ram_ptr_length() share quite some code, so modify qemu_ram_ptr_length() a little bit and use it for qemu_map_ram_ptr(), too. Signed-off-by: Juergen Gross Signed-off-by: Vikram Garhwal Signed-off-by: Edgar E. Iglesias Reviewed-by: Stefano Stabellini Reviewed-by: Alex Bennée Reviewed-by: Edgar E. Iglesias --- Acked-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v2] hw/s390x: Attach the sclpconsole to /machine/sclp/s390-sclp-event-facility
On 30.04.24 21:08, Thomas Huth wrote: The sclpconsole currently does not have a proper parent in the QOM tree, so it shows up under /machine/unattached - which is somewhat ugly. We should rather attach it to /machine/sclp/s390-sclp-event-facility where the other devices of type TYPE_SCLP_EVENT already reside. Signed-off-by: Thomas Huth --- hw/s390x/s390-virtio-ccw.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c index 5c83d1ea17..41be8bf857 100644 --- a/hw/s390x/s390-virtio-ccw.c +++ b/hw/s390x/s390-virtio-ccw.c @@ -242,11 +242,13 @@ static void s390_create_virtio_net(BusState *bus, const char *name) static void s390_create_sclpconsole(const char *type, Chardev *chardev) { +BusState *ev_fac_bus = sclp_get_event_facility_bus(); DeviceState *dev; dev = qdev_new(type); +object_property_add_child(OBJECT(ev_fac_bus->parent), type, OBJECT(dev)); qdev_prop_set_chr(dev, "chardev", chardev); -qdev_realize_and_unref(dev, sclp_get_event_facility_bus(), _fatal); +qdev_realize_and_unref(dev, ev_fac_bus, _fatal); } static void ccw_init(MachineState *machine) Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v4 1/2] target/s390x: report deprecated-props in cpu-model-expansion reply
On 29.04.24 21:10, Collin Walling wrote: Retain a list of deprecated features disjoint from any particular CPU model. A query-cpu-model-expansion reply will now provide a list of properties (i.e. features) that are flagged as deprecated. Example: { "return": { "model": { "name": "z14.2-base", "deprecated-props": [ "bpb", "csske" ], "props": { "pfmfi": false, "exrl": true, ...a lot more props... "skey": false, "vxpdeh2": false } } } } It is recommended that s390 guests operate with these features explicitly disabled to ensure compatability with future hardware. Signed-off-by: Collin Walling Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v4 2/2] target/s390x: flag te and cte as deprecated
On 29.04.24 21:10, Collin Walling wrote: Add the CONSTRAINT_TRANSACTIONAL_EXE (cte) and TRANSACTIONAL_EXE (te) to the list of deprecated features. Signed-off-by: Collin Walling --- target/s390x/cpu_features.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/target/s390x/cpu_features.c b/target/s390x/cpu_features.c index efafc9711c..cb4e2b8920 100644 --- a/target/s390x/cpu_features.c +++ b/target/s390x/cpu_features.c @@ -218,6 +218,9 @@ void s390_get_deprecated_features(S390FeatBitmap features) /* CSSKE is deprecated on newer generations */ S390_FEAT_CONDITIONAL_SSKE, S390_FEAT_BPB, + /* Deprecated on z16 */ + S390_FEAT_CONSTRAINT_TRANSACTIONAL_EXE, + S390_FEAT_TRANSACTIONAL_EXE }; int i; Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH] hw/s390x: Attach the sclpconsole to the /machine/sclp node
On 30.04.24 10:04, Thomas Huth wrote: The sclpconsole currently does not have a proper parent in the QOM tree, so it shows up under /machine/unattached - which is somewhat ugly. Let's attach it to /machine/sclp instead. IIRC, this should not affect migration Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v3 1/2] target/s390x: report deprecated-props in cpu-model-expansion reply
On 26.04.24 19:44, David Hildenbrand wrote: On 24.04.24 23:56, Collin Walling wrote: Retain a list of deprecated features disjoint from any particular CPU model. A query-cpu-model-expansion reply will now provide a list of properties (i.e. features) that are flagged as deprecated. Example: { "return": { "model": { "name": "z14.2-base", "deprecated-props": [ "bpb", "csske" ], "props": { "pfmfi": false, "exrl": true, ...a lot more props... "skey": false, "vxpdeh2": false } } } } It is recommended that s390 guests operate with these features explicitly disabled to ensure compatability with future hardware. Likely you should only report features that are applicable to a model. that is, if it's part of the full_feat. Otherwise, the caller might simply want do set all features to "false", and we'd fail setting a feature that is unknown to a specific CPU generation. That is, you would AND the bitmap with the full_feat of the underlying CPU definition. Refreshing my memory, I think we can just clear any CPU features. We only bail out when setting them! -- Cheers, David / dhildenb
Re: [PATCH v3 1/2] target/s390x: report deprecated-props in cpu-model-expansion reply
On 24.04.24 23:56, Collin Walling wrote: Retain a list of deprecated features disjoint from any particular CPU model. A query-cpu-model-expansion reply will now provide a list of properties (i.e. features) that are flagged as deprecated. Example: { "return": { "model": { "name": "z14.2-base", "deprecated-props": [ "bpb", "csske" ], "props": { "pfmfi": false, "exrl": true, ...a lot more props... "skey": false, "vxpdeh2": false } } } } It is recommended that s390 guests operate with these features explicitly disabled to ensure compatability with future hardware. Likely you should only report features that are applicable to a model. that is, if it's part of the full_feat. Otherwise, the caller might simply want do set all features to "false", and we'd fail setting a feature that is unknown to a specific CPU generation. That is, you would AND the bitmap with the full_feat of the underlying CPU definition. -- Cheers, David / dhildenb
Re: [PATCH v2 2/3] target/s390x: add support for "disable-deprecated-feats" expansion option
On 24.04.24 20:33, Collin Walling wrote: On 4/24/24 03:24, David Hildenbrand wrote: On 23.04.24 23:06, Collin Walling wrote: Retain a list of deprecated features disjoint from any particular CPU model. When a query-cpu-model-expansion is provided with the "disable-deprecated-feats" option set, the resulting properties list will include all deprecated features paired with false. Example: { ... "bpb": false, "csske": false, ...} It is recommended that s390 guests operate with these features explicitly disabled to ensure compatability with future hardware. Signed-off-by: Collin Walling --- target/s390x/cpu_features.c | 14 ++ target/s390x/cpu_features.h | 1 + target/s390x/cpu_models_sysemu.c | 20 3 files changed, 27 insertions(+), 8 deletions(-) diff --git a/target/s390x/cpu_features.c b/target/s390x/cpu_features.c index d28eb65845..efafc9711c 100644 --- a/target/s390x/cpu_features.c +++ b/target/s390x/cpu_features.c @@ -212,6 +212,20 @@ void s390_feat_bitmap_to_ascii(const S390FeatBitmap features, void *opaque, }; } +void s390_get_deprecated_features(S390FeatBitmap features) +{ +static const int feats[] = { + /* CSSKE is deprecated on newer generations */ + S390_FEAT_CONDITIONAL_SSKE, + S390_FEAT_BPB, +}; +int i; + +for (i = 0; i < ARRAY_SIZE(feats); i++) { +set_bit(feats[i], features); +} +} + #define FEAT_GROUP_INIT(_name, _group, _desc)\ {\ .name = _name, \ diff --git a/target/s390x/cpu_features.h b/target/s390x/cpu_features.h index a9bd68a2e1..661a8cd6db 100644 --- a/target/s390x/cpu_features.h +++ b/target/s390x/cpu_features.h @@ -69,6 +69,7 @@ void s390_add_from_feat_block(S390FeatBitmap features, S390FeatType type, uint8_t *data); void s390_feat_bitmap_to_ascii(const S390FeatBitmap features, void *opaque, void (*fn)(const char *name, void *opaque)); +void s390_get_deprecated_features(S390FeatBitmap features); /* Definition of a CPU feature group */ typedef struct { diff --git a/target/s390x/cpu_models_sysemu.c b/target/s390x/cpu_models_sysemu.c index ef9fa80efd..b002819188 100644 --- a/target/s390x/cpu_models_sysemu.c +++ b/target/s390x/cpu_models_sysemu.c @@ -171,7 +171,8 @@ static void qdict_add_enabled_feat(const char *name, void *opaque) /* convert S390CPUDef into a static CpuModelInfo */ static void cpu_info_from_model(CpuModelInfo *info, const S390CPUModel *model, -bool delta_changes) +bool delta_changes, +bool disable_deprecated_feats) { QDict *qdict = qdict_new(); S390FeatBitmap bitmap; @@ -201,6 +202,13 @@ static void cpu_info_from_model(CpuModelInfo *info, const S390CPUModel *model, s390_feat_bitmap_to_ascii(bitmap, qdict, qdict_add_disabled_feat); } +/* features flagged as deprecated */ +if (disable_deprecated_feats) { +bitmap_zero(bitmap, S390_FEAT_MAX); +s390_get_deprecated_features(bitmap); +s390_feat_bitmap_to_ascii(bitmap, qdict, qdict_add_disabled_feat); +} Likely, we should remove these features from the actual bitmap, such that they won't appear twice in the output? I'd expect the cpu_info_from_model() caller to handle that. Just adding them to the list as disabled is likely wrong. For example, if someone were to expend a given model with "... bpb: true" with disable-deprecated-feat=on, that should be remove from "bpb:true", and only replaced by "bpb=false" if it would be part of the CPU model we would be expanding to. Or am I missing something? qdict_add_disabled_feat will handle updating the feature if it already exists. I placed the code to process deprecated features as the last step of cpu_info_from_model to override any features that have already been added to the bitmap. Whether it should be the deprecated feats that take priority, or what the user requests is up in the air, however... Yes, that's one of my concern. IIRC, if the user specifies the same property multiple times, it's unclear which one will win. "bpb=true,bpb=false" might mean that bpb=true might win. I think this is something that this interface should sort out, so it will be actually usable! ... Daniel's suggestion to modify the QMP response to include a separate list of "deprecated-props" seems like a much more efficient and readable way that alleviates both your and Markus' concerns. Would you only include properties that would apply to the current model and would be set to true in the current model? How would libvirt make use of this interface, and could it run into the issue spelled out above? -- Cheers, David / dhildenb
Re: [PATCH v2 2/3] target/s390x: add support for "disable-deprecated-feats" expansion option
On 23.04.24 23:06, Collin Walling wrote: Retain a list of deprecated features disjoint from any particular CPU model. When a query-cpu-model-expansion is provided with the "disable-deprecated-feats" option set, the resulting properties list will include all deprecated features paired with false. Example: { ... "bpb": false, "csske": false, ...} It is recommended that s390 guests operate with these features explicitly disabled to ensure compatability with future hardware. Signed-off-by: Collin Walling --- target/s390x/cpu_features.c | 14 ++ target/s390x/cpu_features.h | 1 + target/s390x/cpu_models_sysemu.c | 20 3 files changed, 27 insertions(+), 8 deletions(-) diff --git a/target/s390x/cpu_features.c b/target/s390x/cpu_features.c index d28eb65845..efafc9711c 100644 --- a/target/s390x/cpu_features.c +++ b/target/s390x/cpu_features.c @@ -212,6 +212,20 @@ void s390_feat_bitmap_to_ascii(const S390FeatBitmap features, void *opaque, }; } +void s390_get_deprecated_features(S390FeatBitmap features) +{ +static const int feats[] = { + /* CSSKE is deprecated on newer generations */ + S390_FEAT_CONDITIONAL_SSKE, + S390_FEAT_BPB, +}; +int i; + +for (i = 0; i < ARRAY_SIZE(feats); i++) { +set_bit(feats[i], features); +} +} + #define FEAT_GROUP_INIT(_name, _group, _desc)\ {\ .name = _name, \ diff --git a/target/s390x/cpu_features.h b/target/s390x/cpu_features.h index a9bd68a2e1..661a8cd6db 100644 --- a/target/s390x/cpu_features.h +++ b/target/s390x/cpu_features.h @@ -69,6 +69,7 @@ void s390_add_from_feat_block(S390FeatBitmap features, S390FeatType type, uint8_t *data); void s390_feat_bitmap_to_ascii(const S390FeatBitmap features, void *opaque, void (*fn)(const char *name, void *opaque)); +void s390_get_deprecated_features(S390FeatBitmap features); /* Definition of a CPU feature group */ typedef struct { diff --git a/target/s390x/cpu_models_sysemu.c b/target/s390x/cpu_models_sysemu.c index ef9fa80efd..b002819188 100644 --- a/target/s390x/cpu_models_sysemu.c +++ b/target/s390x/cpu_models_sysemu.c @@ -171,7 +171,8 @@ static void qdict_add_enabled_feat(const char *name, void *opaque) /* convert S390CPUDef into a static CpuModelInfo */ static void cpu_info_from_model(CpuModelInfo *info, const S390CPUModel *model, -bool delta_changes) +bool delta_changes, +bool disable_deprecated_feats) { QDict *qdict = qdict_new(); S390FeatBitmap bitmap; @@ -201,6 +202,13 @@ static void cpu_info_from_model(CpuModelInfo *info, const S390CPUModel *model, s390_feat_bitmap_to_ascii(bitmap, qdict, qdict_add_disabled_feat); } +/* features flagged as deprecated */ +if (disable_deprecated_feats) { +bitmap_zero(bitmap, S390_FEAT_MAX); +s390_get_deprecated_features(bitmap); +s390_feat_bitmap_to_ascii(bitmap, qdict, qdict_add_disabled_feat); +} Likely, we should remove these features from the actual bitmap, such that they won't appear twice in the output? I'd expect the cpu_info_from_model() caller to handle that. Just adding them to the list as disabled is likely wrong. For example, if someone were to expend a given model with "... bpb: true" with disable-deprecated-feat=on, that should be remove from "bpb:true", and only replaced by "bpb=false" if it would be part of the CPU model we would be expanding to. Or am I missing something? -- Cheers, David / dhildenb
Re: [PATCH 0/3] Remove useless architecture prefix from the CPU list
On 20.04.24 07:46, Thomas Huth wrote: Printing an architecture prefix in front of each CPU name is not helpful at all: It is confusing for the users since they don't know whether they have to specify these letters for the "-cpu" parameter, too, and it also takes some precious space in the dense output of the CPU entries. Let's simply remove those now. Yes, I also never really understood the purpose. -- Cheers, David / dhildenb
[PATCH v1] virtio-mem: improve error message when unplug of device fails due to plugged memory
The error message is actually expressive, considering QEMU only. But when called from Libvirt, talking about "size" can be confusing, because in Libvirt "size" translates to the memory backend size in QEMU (maximum size) and "current" translates to the QEMU "size" property. Let's simply avoid talking about the "size" property and spell out that some device memory is still plugged. Cc: Liang Cong Cc: Mario Casquero Cc: "Michael S. Tsirkin" Signed-off-by: David Hildenbrand --- hw/virtio/virtio-mem.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c index ffd119ebac..ef64bf1b4a 100644 --- a/hw/virtio/virtio-mem.c +++ b/hw/virtio/virtio-mem.c @@ -1832,8 +1832,8 @@ static void virtio_mem_unplug_request_check(VirtIOMEM *vmem, Error **errp) } if (vmem->size) { -error_setg(errp, "virtio-mem device cannot get unplugged while" - " '" VIRTIO_MEM_SIZE_PROP "' != '0'"); +error_setg(errp, "virtio-mem device cannot get unplugged while some" + " of its memory is still plugged"); return; } if (vmem->requested_size) { -- 2.44.0
Re: [PATCH for-9.1 v3 09/11] hostmem: add a new memory backend based on POSIX shm_open()
On 08.04.24 09:58, Stefano Garzarella wrote: On Thu, Apr 04, 2024 at 04:09:34PM +0200, David Hildenbrand wrote: On 04.04.24 14:23, Stefano Garzarella wrote: shm_open() creates and opens a new POSIX shared memory object. A POSIX shared memory object allows creating memory backend with an associated file descriptor that can be shared with external processes (e.g. vhost-user). The new `memory-backend-shm` can be used as an alternative when `memory-backend-memfd` is not available (Linux only), since shm_open() should be provided by any POSIX-compliant operating system. This backend mimics memfd, allocating memory that is practically anonymous. In theory shm_open() requires a name, but this is allocated for a short time interval and shm_unlink() is called right after shm_open(). After that, only fd is shared with external processes (e.g., vhost-user) as if it were associated with anonymous memory. In the future we may also allow the user to specify the name to be passed to shm_open(), but for now we keep the backend simple, mimicking anonymous memory such as memfd. Signed-off-by: Stefano Garzarella --- v3 - enriched commit message and documentation to highlight that we want to mimic memfd (David) --- docs/system/devices/vhost-user.rst | 5 +- qapi/qom.json | 17 + backends/hostmem-shm.c | 118 + backends/meson.build | 1 + qemu-options.hx| 11 +++ 5 files changed, 150 insertions(+), 2 deletions(-) create mode 100644 backends/hostmem-shm.c diff --git a/docs/system/devices/vhost-user.rst b/docs/system/devices/vhost-user.rst index 9b2da106ce..35259d8ec7 100644 --- a/docs/system/devices/vhost-user.rst +++ b/docs/system/devices/vhost-user.rst @@ -98,8 +98,9 @@ Shared memory object In order for the daemon to access the VirtIO queues to process the requests it needs access to the guest's address space. This is -achieved via the ``memory-backend-file`` or ``memory-backend-memfd`` -objects. A reference to a file-descriptor which can access this object +achieved via the ``memory-backend-file``, ``memory-backend-memfd``, or +``memory-backend-shm`` objects. +A reference to a file-descriptor which can access this object will be passed via the socket as part of the protocol negotiation. Currently the shared memory object needs to match the size of the main diff --git a/qapi/qom.json b/qapi/qom.json index 85e6b4f84a..5252ec69e3 100644 --- a/qapi/qom.json +++ b/qapi/qom.json @@ -721,6 +721,19 @@ '*hugetlbsize': 'size', '*seal': 'bool' } } +## +# @MemoryBackendShmProperties: +# +# Properties for memory-backend-shm objects. +# +# The @share boolean option is true by default with shm. +# +# Since: 9.1 +## +{ 'struct': 'MemoryBackendShmProperties', + 'base': 'MemoryBackendProperties', + 'data': { } } + Acked-by: David Hildenbrand One comment: we should maybe just forbid setting share=off. it doesn't make any sense and it can even result in an unexpected double memory consumption. We missed doing that for memfd, unfortunately. Good point! IIUC the `share` property is defined by the parent `hostmem`, so I should find a way to override the property here and disable the setter, or add an option to `hostmem` to make the property non-writable. Right, or simply fail later when you would find "share=off" in shm_backend_memory_alloc(). When ever supporting named shmem_open(), it could make sense for VM snapshotting. Right now it doesn't really make any sense. -- Cheers, David / dhildenb
Re: [PATCH for-9.1 v3 09/11] hostmem: add a new memory backend based on POSIX shm_open()
On 04.04.24 14:23, Stefano Garzarella wrote: shm_open() creates and opens a new POSIX shared memory object. A POSIX shared memory object allows creating memory backend with an associated file descriptor that can be shared with external processes (e.g. vhost-user). The new `memory-backend-shm` can be used as an alternative when `memory-backend-memfd` is not available (Linux only), since shm_open() should be provided by any POSIX-compliant operating system. This backend mimics memfd, allocating memory that is practically anonymous. In theory shm_open() requires a name, but this is allocated for a short time interval and shm_unlink() is called right after shm_open(). After that, only fd is shared with external processes (e.g., vhost-user) as if it were associated with anonymous memory. In the future we may also allow the user to specify the name to be passed to shm_open(), but for now we keep the backend simple, mimicking anonymous memory such as memfd. Signed-off-by: Stefano Garzarella --- v3 - enriched commit message and documentation to highlight that we want to mimic memfd (David) --- docs/system/devices/vhost-user.rst | 5 +- qapi/qom.json | 17 + backends/hostmem-shm.c | 118 + backends/meson.build | 1 + qemu-options.hx| 11 +++ 5 files changed, 150 insertions(+), 2 deletions(-) create mode 100644 backends/hostmem-shm.c diff --git a/docs/system/devices/vhost-user.rst b/docs/system/devices/vhost-user.rst index 9b2da106ce..35259d8ec7 100644 --- a/docs/system/devices/vhost-user.rst +++ b/docs/system/devices/vhost-user.rst @@ -98,8 +98,9 @@ Shared memory object In order for the daemon to access the VirtIO queues to process the requests it needs access to the guest's address space. This is -achieved via the ``memory-backend-file`` or ``memory-backend-memfd`` -objects. A reference to a file-descriptor which can access this object +achieved via the ``memory-backend-file``, ``memory-backend-memfd``, or +``memory-backend-shm`` objects. +A reference to a file-descriptor which can access this object will be passed via the socket as part of the protocol negotiation. Currently the shared memory object needs to match the size of the main diff --git a/qapi/qom.json b/qapi/qom.json index 85e6b4f84a..5252ec69e3 100644 --- a/qapi/qom.json +++ b/qapi/qom.json @@ -721,6 +721,19 @@ '*hugetlbsize': 'size', '*seal': 'bool' } } +## +# @MemoryBackendShmProperties: +# +# Properties for memory-backend-shm objects. +# +# The @share boolean option is true by default with shm. +# +# Since: 9.1 +## +{ 'struct': 'MemoryBackendShmProperties', + 'base': 'MemoryBackendProperties', + 'data': { } } + Acked-by: David Hildenbrand One comment: we should maybe just forbid setting share=off. it doesn't make any sense and it can even result in an unexpected double memory consumption. We missed doing that for memfd, unfortunately. -- Cheers, David / dhildenb
Re: [PATCH for-9.1 v2 09/11] hostmem: add a new memory backend based on POSIX shm_open()
So I thought that for now we only support the "anonymous" mode, and if in the future we have a use case where the user wants to specify the name, we can add it later. Okay, so for now you really only want an anonymous fd that behaves like a memfd, got it. Likely we should somehow fail if the "POSIX shared memory object" already exists. Hmm. `O_CREAT | O_EXCL` should provide just this behavior. From shm_open(3) manpage: O_EXCL If O_CREAT was also specified, and a shared memory object with the given name already exists, return an error. The check for the existence of the object, and its creation if it does not exist, are performed atomically. Cool! That said, if you think it's already useful from the beginning, I can add the name as an optional parameter. I'm also not quite sure if "host_memory_backend_get_name()" should be used for the purpose here. What problem do you see? As an alternative I thought of a static counter. If it's really just an "internal UUID", as we'll remove the file using shm_unlink(), then the name in fact shouldn't matter and this would be fine. Correct? Right. It's a name that will only "appear" in the system for a very short time since we call shm_unlink() right after shm_open(). So I just need the unique name to prevent several QEMUs launched at the same time from colliding with the name. Okay, essentially emulating tmpfile(), but for weird shmem_open() :) So I assume if we ever want to have non-anonymous fds here, we'd pass in a new property "name", and change the logic how we open/unlink. Makes sense. Exactly! I can clarify this in the commit description as well. Thanks for the helpful comments! If there is anything I need to change for v3, please tell me ;-) Besides some patch description extension, makes sense to me :) -- Cheers, David / dhildenb
Re: [PATCH for-9.1 v2 09/11] hostmem: add a new memory backend based on POSIX shm_open()
On 27.03.24 11:23, Stefano Garzarella wrote: On Tue, Mar 26, 2024 at 03:45:52PM +0100, David Hildenbrand wrote: +mode = 0; +oflag = O_RDWR | O_CREAT | O_EXCL; +backend_name = host_memory_backend_get_name(backend); + +/* + * Some operating systems allow creating anonymous POSIX shared memory + * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not + * defined by POSIX, so let's create a unique name. + * + * From Linux's shm_open(3) man-page: + * For portable use, a shared memory object should be identified + * by a name of the form /somename;" + */ +g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%s", getpid(), +backend_name); Hm, shouldn't we just let the user specify a name, and if no name was specified, generate one ourselves? I thought about it and initially did it that way, but then some problems came up so I tried to keep it as simple as possible for the user and for our use case (having an fd associated with memory and sharing it with other processes). The problems I had were: - what mode_t to use if the object does not exist and needs to be created? > - exclude O_EXCL if the user passes the name since they may have already created it? I'd handle both like we handle files. But I understand now that you really just want to "simulate" memfd. - call shm_unlink() only at object deallocation? Right, we don't really do that for ordinary files, they keep existing. We only have the "discard-data" property to free up memory. For memfd, it just happens "automatically". They cannot be looked up by name, and once the last user closes the fd, it gets destroyed. So I thought that for now we only support the "anonymous" mode, and if in the future we have a use case where the user wants to specify the name, we can add it later. Okay, so for now you really only want an anonymous fd that behaves like a memfd, got it. Likely we should somehow fail if the "POSIX shared memory object" already exists. Hmm. That said, if you think it's already useful from the beginning, I can add the name as an optional parameter. I'm also not quite sure if "host_memory_backend_get_name()" should be used for the purpose here. What problem do you see? As an alternative I thought of a static counter. If it's really just an "internal UUID", as we'll remove the file using shm_unlink(), then the name in fact shouldn't matter and this would be fine. Correct? So I assume if we ever want to have non-anonymous fds here, we'd pass in a new property "name", and change the logic how we open/unlink. Makes sense. -- Cheers, David / dhildenb
Re: [PATCH-for-9.1 v2 13/21] hw/mem/pc-dimm: Remove legacy_align argument from pc_dimm_pre_plug()
On 27.03.24 10:51, Philippe Mathieu-Daudé wrote: 'legacy_align' is always NULL, remove it. Signed-off-by: Philippe Mathieu-Daudé Reviewed-by: Thomas Huth Message-Id: <20240305134221.30924-11-phi...@linaro.org> --- I was really confused for a second until I saw that this series is dependent on another one. Please make sure to CC people you CC on patches to also CC on the cover letter. Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH-for-9.1 v2 14/21] hw/mem/memory-device: Remove legacy_align from memory_device_pre_plug()
void memory_device_pre_plug(MemoryDeviceState *md, MachineState *ms, -const uint64_t *legacy_align, Error **errp) +Error **errp) { const MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(md); Error *local_err = NULL; @@ -388,14 +388,10 @@ void memory_device_pre_plug(MemoryDeviceState *md, MachineState *ms, return; } -if (legacy_align) { -align = *legacy_align; -} else { -if (mdc->get_min_alignment) { -align = mdc->get_min_alignment(md); -} -align = MAX(align, memory_region_get_alignment(mr)); +if (mdc->get_min_alignment) { +align = mdc->get_min_alignment(md); } +align = MAX(align, memory_region_get_alignment(mr)); addr = mdc->get_addr(md); addr = memory_device_get_free_addr(ms, !addr ? NULL : , align, memory_region_size(mr), _err); Very nice! Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH for-9.1 v2 02/11] libvhost-user: fail vu_message_write() if sendmsg() is failing
I was wondering if we could see some partial sendmsg()/write succeeding. Meaning, we transferred some bytes but not all, and we'd actually need to loop ... Yep, true, but I would fix it in another patch/series if you agree. Absolutely. -- Cheers, David / dhildenb
Re: [PATCH for-9.1 v2 09/11] hostmem: add a new memory backend based on POSIX shm_open()
+mode = 0; +oflag = O_RDWR | O_CREAT | O_EXCL; +backend_name = host_memory_backend_get_name(backend); + +/* + * Some operating systems allow creating anonymous POSIX shared memory + * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not + * defined by POSIX, so let's create a unique name. + * + * From Linux's shm_open(3) man-page: + * For portable use, a shared memory object should be identified + * by a name of the form /somename;" + */ +g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%s", getpid(), +backend_name); Hm, shouldn't we just let the user specify a name, and if no name was specified, generate one ourselves? I'm also not quite sure if "host_memory_backend_get_name()" should be used for the purpose here. -- Cheers, David / dhildenb
Re: [PATCH for-9.1 v2 02/11] libvhost-user: fail vu_message_write() if sendmsg() is failing
On 26.03.24 15:34, Eric Blake wrote: On Tue, Mar 26, 2024 at 02:39:27PM +0100, Stefano Garzarella wrote: In vu_message_write() we use sendmsg() to send the message header, then a write() to send the payload. If sendmsg() fails we should avoid sending the payload, since we were unable to send the header. Discovered before fixing the issue with the previous patch, where sendmsg() failed on macOS due to wrong parameters, but the frontend still sent the payload which the backend incorrectly interpreted as a wrong header. Signed-off-by: Stefano Garzarella --- subprojects/libvhost-user/libvhost-user.c | 5 + 1 file changed, 5 insertions(+) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index 22bea0c775..a11afd1960 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -639,6 +639,11 @@ vu_message_write(VuDev *dev, int conn_fd, VhostUserMsg *vmsg) rc = sendmsg(conn_fd, , 0); } while (rc < 0 && (errno == EINTR || errno == EAGAIN)); +if (rc <= 0) { Is rejecting a 0 return value correct? Technically, a 0 return means a successful write of 0 bytes - but then again, it is unwise to attempt to send an fd or other auxilliary ddata without at least one regular byte, so we should not be attempting a write of 0 bytes. So I guess this one is okay, although I might have used < instead of <=. I was wondering if we could see some partial sendmsg()/write succeeding. Meaning, we transferred some bytes but not all, and we'd actually need to loop ... -- Cheers, David / dhildenb
Re: [PATCH for-9.1 v2 01/11] libvhost-user: set msg.msg_control to NULL when it is empty
On 26.03.24 14:39, Stefano Garzarella wrote: On some OS (e.g. macOS) sendmsg() returns -1 (errno EINVAL) if the `struct msghdr` has the field `msg_controllen` set to 0, but `msg_control` is not NULL. Signed-off-by: Stefano Garzarella --- subprojects/libvhost-user/libvhost-user.c | 1 + 1 file changed, 1 insertion(+) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index a879149fef..22bea0c775 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -632,6 +632,7 @@ vu_message_write(VuDev *dev, int conn_fd, VhostUserMsg *vmsg) memcpy(CMSG_DATA(cmsg), vmsg->fds, fdsize); } else { msg.msg_controllen = 0; +msg.msg_control = NULL; } do { Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v1] qapi: document parameters of query-cpu-model-* QAPI commands
# # Usually, a CPU model is compared against the maximum possible CPU # model of a certain configuration (e.g. the "host" model for KVM). @@ -154,7 +155,14 @@ # Some architectures may not support comparing CPU models. s390x # supports comparing CPU models. # -# Returns: a CpuModelBaselineInfo +# @modela: description of the first CPU model to compare, referred to as +# "model A" in CpuModelCompareResult +# +# @modelb: description of the second CPU model to compare, referred to as +# "model B" in CpuModelCompareResult +# +# Returns: a CpuModelCompareInfo, describing how both CPU models Scratch the comma? Agreed to both. Do you want to fixup when applying or do you prefer a v2? Thanks! Reviewed-by: Markus Armbruster -- Cheers, David / dhildenb
[PATCH v1] qapi: document parameters of query-cpu-model-* QAPI commands
Let's document the parameters of these commands, so we can remove them from the "documentation-exceptions" list. While at it, extend the "Returns:" documentation as well, fixing a wrong use of CpuModelBaselineInfo vs. CpuModelCompareInfo for query-cpu-model-comparison. Cc: Markus Armbruster Cc: Eric Blake Cc: Eduardo Habkost Cc: Marcel Apfelbaum Cc: "Philippe Mathieu-Daudé" Cc: Yanan Wang Signed-off-by: David Hildenbrand --- qapi/machine-target.json | 46 +++- qapi/pragma.json | 3 --- 2 files changed, 31 insertions(+), 18 deletions(-) diff --git a/qapi/machine-target.json b/qapi/machine-target.json index 519adf3220..7883616cce 100644 --- a/qapi/machine-target.json +++ b/qapi/machine-target.json @@ -124,11 +124,12 @@ ## # @query-cpu-model-comparison: # -# Compares two CPU models, returning how they compare in a specific -# configuration. The results indicates how both models compare -# regarding runnability. This result can be used by tooling to make -# decisions if a certain CPU model will run in a certain configuration -# or if a compatible CPU model has to be created by baselining. +# Compares two CPU models, @modela and @modelb, returning how they +# compare in a specific configuration. The results indicates how +# both models compare regarding runnability. This result can be +# used by tooling to make decisions if a certain CPU model will +# run in a certain configuration or if a compatible CPU model has +# to be created by baselining. # # Usually, a CPU model is compared against the maximum possible CPU # model of a certain configuration (e.g. the "host" model for KVM). @@ -154,7 +155,14 @@ # Some architectures may not support comparing CPU models. s390x # supports comparing CPU models. # -# Returns: a CpuModelBaselineInfo +# @modela: description of the first CPU model to compare, referred to as +# "model A" in CpuModelCompareResult +# +# @modelb: description of the second CPU model to compare, referred to as +# "model B" in CpuModelCompareResult +# +# Returns: a CpuModelCompareInfo, describing how both CPU models +# compare # # Errors: # - if comparing CPU models is not supported @@ -175,9 +183,9 @@ ## # @query-cpu-model-baseline: # -# Baseline two CPU models, creating a compatible third model. The -# created model will always be a static, migration-safe CPU model (see -# "static" CPU model expansion for details). +# Baseline two CPU models, @modela and @modelb, creating a compatible +# third model. The created model will always be a static, +# migration-safe CPU model (see "static" CPU model expansion for details). # # This interface can be used by tooling to create a compatible CPU # model out two CPU models. The created CPU model will be identical @@ -204,7 +212,11 @@ # Some architectures may not support baselining CPU models. s390x # supports baselining CPU models. # -# Returns: a CpuModelBaselineInfo +# @modela: description of the first CPU model to baseline +# +# @modelb: description of the second CPU model to baseline +# +# Returns: a CpuModelBaselineInfo, describing the baselined CPU model # # Errors: # - if baselining CPU models is not supported @@ -243,10 +255,10 @@ ## # @query-cpu-model-expansion: # -# Expands a given CPU model (or a combination of CPU model + -# additional options) to different granularities, allowing tooling to -# get an understanding what a specific CPU model looks like in QEMU -# under a certain configuration. +# Expands a given CPU model, @model, (or a combination of CPU model + +# additional options) to different granularities, specified by +# @type, allowing tooling to get an understanding what a specific +# CPU model looks like in QEMU under a certain configuration. # # This interface can be used to query the "host" CPU model. # @@ -269,7 +281,11 @@ # Some architectures may not support all expansion types. s390x # supports "full" and "static". Arm only supports "full". # -# Returns: a CpuModelExpansionInfo +# @model: description of the CPU model to expand +# +# @type: expansion type, specifying how to expand the CPU model +# +# Returns: a CpuModelExpansionInfo, describing the expanded CPU model # # Errors: # - if expanding CPU models is not supported diff --git a/qapi/pragma.json b/qapi/pragma.json index 6929ab776e..0d82bc1a03 100644 --- a/qapi/pragma.json +++ b/qapi/pragma.json @@ -90,9 +90,6 @@ 'XDbgBlockGraph', 'YankInstanceType', 'blockdev-reopen', -'query-cpu-model-baseline', -'query-cpu-model-comparison', -'query-cpu-model-expansion', 'query-rocker', 'query-rocker-ports', 'query-stats-schemas', -- 2.43.2
Re: Let's close member documentation gaps
On 25.03.24 15:13, David Hildenbrand wrote: On 25.03.24 10:36, Markus Armbruster wrote: If you're cc'ed, I have a bit of doc work for you. Search for your name to find it. The QAPI generator forces you to document your stuff. Except for commands, events, enum and object types listed in pragma documentation-exceptions, the generator silently defaults missing documentation to "Not documented". Right now, we're using this loophole some 500 times. What would be the right step to make sure I am resolving these "hidden" issues, and that the QAPI generator would be happy with my changes? Ah, I assume simply removing the offender from "qapi/pragma.json" and then compiling. -- Cheers, David / dhildenb
Re: Let's close member documentation gaps
On 25.03.24 10:36, Markus Armbruster wrote: If you're cc'ed, I have a bit of doc work for you. Search for your name to find it. The QAPI generator forces you to document your stuff. Except for commands, events, enum and object types listed in pragma documentation-exceptions, the generator silently defaults missing documentation to "Not documented". Right now, we're using this loophole some 500 times. What would be the right step to make sure I am resolving these "hidden" issues, and that the QAPI generator would be happy with my changes? -- Cheers, David / dhildenb
Re: [PATCH v3 11/49] physmem: Introduce ram_block_discard_guest_memfd_range()
On 20.03.24 18:38, Michael Roth wrote: On Wed, Mar 20, 2024 at 10:37:14AM +0100, David Hildenbrand wrote: On 20.03.24 09:39, Michael Roth wrote: From: Xiaoyao Li When memory page is converted from private to shared, the original private memory is back'ed by guest_memfd. Introduce ram_block_discard_guest_memfd_range() for discarding memory in guest_memfd. Originally-from: Isaku Yamahata Codeveloped-by: Xiaoyao Li "Co-developed-by" Signed-off-by: Xiaoyao Li Reviewed-by: David Hildenbrand Your SOB should go here. --- Changes in v5: - Collect Reviewed-by from David; Changes in in v4: - Drop ram_block_convert_range() and open code its implementation in the next Patch. Signed-off-by: Michael Roth I only received 3 patches from this series, and now I am confused: changelog talks about v5 and this is "PATCH v3" Please make sure to send at least the cover letter along (I might not need the other 46 patches :D ). Sorry for the confusion, you got auto-Cc'd by git, which is good, but not sure there's a good way to make sure everyone gets a copy of the cover letter. I could see how it would help useful to potential reviewers though. I'll try to come up with a script for it and take that approach in the future. A script shared with me in the past to achieve that in most cases: $ cat cc-cmd.sh #!/bin/bash if [[ $1 == *gitsendemail.msg* || $1 == *cover-letter* ]]; then grep ': .* <.*@.*>' -h *.patch | sed 's/^.*: //' | sort | uniq fi And attach to "git send-email ... *.patch": --cc-cmd=./cc-cmd.sh -- Cheers, David / dhildenb
Re: [PATCH 2/4] hw/s390x/virtio-ccw: Always deliver NMI to first CPU
On 20.02.24 16:08, Philippe Mathieu-Daudé wrote: We can trigger NMI from HMP or QMP. QEMU maps the NMI to the s390x per-CPU 'RESTART' interrupt. Linux guests usually setup this interrupt to trigger kdump or crash. Such crashdump can be triggered in QEMU by HMP "nmi" or QMP "inject-nmi" commands. Using QMP, since we can not select a particular CPU, the first CPU is used (CPU#0). See the documentation from commit 795dc6e4 ("watchdog: Add new Virtual Watchdog action INJECT-NMI"): @inject-nmi: a non-maskable interrupt is injected into the first VCPU (all VCPUS on x86) (since 2.4) While we can select a particular CPU on HMP, the guest behavior is expected to be the same if using CPU #N or CPU #0. Since always using CPU#0 simplifies API maintainance, update s390_nmi() to deliver NMI to the first CPU. Signed-off-by: Philippe Mathieu-Daudé --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v3 11/49] physmem: Introduce ram_block_discard_guest_memfd_range()
On 20.03.24 13:43, Xiaoyao Li wrote: On 3/20/2024 5:37 PM, David Hildenbrand wrote: On 20.03.24 09:39, Michael Roth wrote: From: Xiaoyao Li When memory page is converted from private to shared, the original private memory is back'ed by guest_memfd. Introduce ram_block_discard_guest_memfd_range() for discarding memory in guest_memfd. Originally-from: Isaku Yamahata Codeveloped-by: Xiaoyao Li "Co-developed-by" Michael is using the patch from my TDX-QEMU v5 series[1]. I need to fix it. Signed-off-by: Xiaoyao Li Reviewed-by: David Hildenbrand Your SOB should go here. --- Changes in v5: - Collect Reviewed-by from David; Changes in in v4: - Drop ram_block_convert_range() and open code its implementation in the next Patch. Signed-off-by: Michael Roth I only received 3 patches from this series, and now I am confused: changelog talks about v5 and this is "PATCH v3" As above, because the guest_memfd patches in my TDX-QEMU v5[1] were directly picked for this series, so the change history says v5. They are needed by SEV-SNP as well. Thanks, I was missing the context without a cover letter. These comments here likely should be dropped here. -- Cheers, David / dhildenb
Re: [PATCH v3 11/49] physmem: Introduce ram_block_discard_guest_memfd_range()
On 20.03.24 09:39, Michael Roth wrote: From: Xiaoyao Li When memory page is converted from private to shared, the original private memory is back'ed by guest_memfd. Introduce ram_block_discard_guest_memfd_range() for discarding memory in guest_memfd. Originally-from: Isaku Yamahata Codeveloped-by: Xiaoyao Li "Co-developed-by" Signed-off-by: Xiaoyao Li Reviewed-by: David Hildenbrand Your SOB should go here. --- Changes in v5: - Collect Reviewed-by from David; Changes in in v4: - Drop ram_block_convert_range() and open code its implementation in the next Patch. Signed-off-by: Michael Roth I only received 3 patches from this series, and now I am confused: changelog talks about v5 and this is "PATCH v3" Please make sure to send at least the cover letter along (I might not need the other 46 patches :D ). -- Cheers, David / dhildenb
Re: 答复: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment
On 19.03.24 15:23, Peter Maydell wrote: On Tue, 19 Mar 2024 at 09:24, David Hildenbrand wrote: I spotted new pause_all_vcpus() / resume_all_vcpus() calls in hw/intc/arm_gicv3_kvm.c and thought they would be the problematic bit. Yeah, that's going to be problematic. Further note that a lot of code does not expect that the BQL is suddenly dropped. Agreed; we already have one nasty set of bugs in the framebuffer devices because a function drops the BQL briefly: https://lore.kernel.org/qemu-devel/CAFEAcA9odnPo2LPip295Uztri7JfoVnQbkJ=wn+k8dqneb_...@mail.gmail.com/T/#u so let's avoid introducing any more of a similar kind. Side note, the pause_all_vcpus()/resume_all_vcpus() calls in hw/i386/vapic.c are probably a bit suspect for similar reasons. Exactly my thoughts. But there, it was less clear "why" it is even required. It's only performed for KVM. Do we also just want to stop KVM threads from executing instructions?, so blocking KVM ioctls might be a reasonable "replacement"? Really not sure. -- Cheers, David / dhildenb
Re: 答复: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment
On 19.03.24 10:24, David Hildenbrand wrote: On 19.03.24 06:06, zhukeqian wrote: Hi David, Thanks for reviewing. On 17.03.24 09:37, Keqian Zhu via wrote: Both main loop thread and vCPU thread are allowed to call pause_all_vcpus(), and in general resume_all_vcpus() is called after it. Two issues live in pause_all_vcpus(): In general, calling pause_all_vcpus() from VCPU threads is quite dangerous. Do we have reproducers for the cases below? I produce the issues by testing ARM vCPU hotplug feature: QEMU changes for vCPU hotplug could be cloned from below site, https://github.com/salil-mehta/qemu.git virt-cpuhp-armv8/rfc-v2 Guest Kernel changes (by James Morse, ARM) are available here: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git virtual_cpu_hotplug/rfc/v2 Thanks for these infos (would be reasonable to include that in the cover letter). Okay, so likely this is not actually a "fix" for upstream as it is. Understood. The procedure to produce problems: 1. Startup a Linux VM (e.g., called OS-vcpuhotplug) with 32 possible vCPUs and 16 current vCPUs. 2. Log in guestOS and run script[1] to continuously online/offline CPU. 3. At host side, run script[2] to continuously hotplug/unhotplug vCPU. After several minutes, we can hit these problems. Script[1] to online/offline CPU: for ((time=1;time<1000;time++)); do for ((cpu=16;cpu<32;cpu++)); do echo 1 > /sys/devices/system/cpu/cpu$cpu/online done for ((cpu=16;cpu<32;cpu++)); do echo 0 > /sys/devices/system/cpu/cpu$cpu/online done done Script[2] to hotplug/unhotplug vCPU: for ((time=1;time<100;time++)); do echo $time for ((cpu=16;cpu<=32;cpu++)); do echo "virsh setvcpus OS-vcpuhotplug --count $cpu --live" virsh setvcpus OS-vcpuhotplug --count $cpu --live sleep 2 done for ((cpu=32;cpu>=16;cpu--)); do echo "virsh setvcpus OS-vcpuhotplug --count $cpu --live" virsh setvcpus OS-vcpuhotplug --count $cpu --live sleep 2 done for ((cpu=16;cpu<=32;cpu+=2)); do echo "virsh setvcpus OS-vcpuhotplug --count $cpu --live" virsh setvcpus OS-vcpuhotplug --count $cpu --live sleep 2 done for ((cpu=32;cpu>=16;cpu-=2)); do echo "virsh setvcpus OS-vcpuhotplug --count $cpu --live" virsh setvcpus OS-vcpuhotplug --count $cpu --live sleep 2 done done The script[1] will call PSCI CPU_ON which emulated by QEMU, which result in calling cpu_reset() on vCPU thread. I spotted new pause_all_vcpus() / resume_all_vcpus() calls in hw/intc/arm_gicv3_kvm.c and thought they would be the problematic bit. Yeah, that's going to be problematic. Further note that a lot of code does not expect that the BQL is suddenly dropped. We had issues with that in different context where we ended up wanting to use pause/resume from VCPU context: https://lore.kernel.org/all/294a987d-b0ef-1b58-98ac-0d4d43075...@redhat.com/ This sounds like a bad idea. Read below. For ARM architecture, it needs to reset GICC registers, which is only possible when all vcpus paused. So script[1] will call pause_all_vcpus() in vCPU thread. The script[2] also calls cpu_reset() for newly hotplugged vCPU, which is done in main loop thread. So this scenario causes problems as I state in commit message. 1. There is possibility that during thread T1 waits on qemu_pause_cond with bql unlocked, other thread has called pause_all_vcpus() and resume_all_vcpus(), then thread T1 will stuck, because the condition all_vcpus_paused() is always false. How can this happen? Two threads calling pause_all_vcpus() is borderline broken, as you note. IIRC, we should call pause_all_vcpus() only if some other mechanism prevents these races. For example, based on runstate changes. We already has bql to prevent concurrent calling of pause_all_vcpus() and resume_all_vcpus(). But pause_all_vcpus() will unlock bql in the half way, which gives change for other thread to call pause and resume. In the past, code does not consider this problem, now I add retry mechanism to fix it. Note that BQL did not prevent concurrent calling of pause_all_vcpus(). There had to be something else. Likely that was runstate transitions. Just imagine one thread calling pause_all_vcpus() while another one calls resume_all_vcpus(). It cannot possibly work. With bql, we can make sure all vcpus are paused after pause_all_vcpus() finish, and all vcpus are resumed after resume_all_vcpus() finish. For example, the following situation may occur: Thread T1: lock bql ->pau
Re: 答复: [PATCH v1 2/2] system/cpus: Fix resume_all_vcpus() under vCPU hotplug condition
On 19.03.24 06:11, zhukeqian wrote: Hi David, On 17.03.24 09:37, Keqian Zhu via wrote: For vCPU being hotplugged, qemu_init_vcpu() is called. In this function, we set vcpu state as stopped, and then wait vcpu thread to be created. As the vcpu state is stopped, it will inform us it has been created and then wait on halt_cond. After we has realized vcpu object, we will resume the vcpu thread. However, during we wait vcpu thread to be created, the bql is unlocked, and other thread is allowed to call resume_all_vcpus(), which will resume the un-realized vcpu. This fixes the issue by filter out un-realized vcpu during resume_all_vcpus(). Similar question: is there a reproducer? How could we currently hotplug a VCPU, and while it is being created, see pause_all_vcpus()/resume_all_vcpus() getting claled. I described the reason for this at patch 1. If I am not getting this wrong, there seems to be some other mechanism missing that makes sure that this cannot happen. Dropping the BQL half-way through creating a VCPU might be the problem. When we add retry mechanism in pause_all_vcpus(), we can solve this problem. With the sematic unchanged for user, which means: With bql, we can make sure all vcpus are paused after pause_all_vcpus() finish, and all vcpus are resumed after resume_all_vcpus() finish. Okay, got it. As just replied to #1, please see if you can avoid messing with pause_all_vcpus() by inhibiting KVM IOCTLs like KVM does. That would be preferable. -- Cheers, David / dhildenb
Re: 答复: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment
to lock bql-> lock bql && all_vcpu_paused -> success and do other work -> unlock bql Thread T2: wait T1 unlock bql to lock bql-> lock bql-> resume_all_vcpus -> success and do other work -> unlock bql Now trow in another thread and it all gets really complicated :) Finding ways to avoid pause_all_vcpus() on the ARM reset code would be preferable. I guess you simply want to do something similar to what KVM does to avoid messing with pause_all_vcpus(): inhibiting certain IOCTLs. commit f39b7d2b96e3e73c01bb678cd096f7baf0b9ab39 Author: David Hildenbrand Date: Fri Nov 11 10:47:58 2022 -0500 kvm: Atomic memslot updates If we update an existing memslot (e.g., resize, split), we temporarily remove the memslot to re-add it immediately afterwards. These updates are not atomic, especially not for KVM VCPU threads, such that we can get spurious faults. Let's inhibit most KVM ioctls while performing relevant updates, such that we can perform the update just as if it would happen atomically without additional kernel support. We capture the add/del changes and apply them in the notifier commit stage instead. There, we can check for overlaps and perform the ioctl inhibiting only if really required (-> overlap). To keep things simple we don't perform additional checks that wouldn't actually result in an overlap -- such as !RAM memory regions in some cases (see kvm_set_phys_mem()). To minimize cache-line bouncing, use a separate indicator (in_ioctl_lock) per CPU. Also, make sure to hold the kvm_slots_lock while performing both actions (removing+re-adding). We have to wait until all IOCTLs were exited and block new ones from getting executed. This approach cannot result in a deadlock as long as the inhibitor does not hold any locks that might hinder an IOCTL from getting finished and exited - something fairly unusual. The inhibitor will always hold the BQL. AFAIKs, one possible candidate would be userfaultfd. If a page cannot be placed (e.g., during postcopy), because we're waiting for a lock, or if the userfaultfd thread cannot process a fault, because it is waiting for a lock, there could be a deadlock. However, the BQL is not applicable here, because any other guest memory access while holding the BQL would already result in a deadlock. Nothing else in the kernel should block forever and wait for userspace intervention. Note: pause_all_vcpus()/resume_all_vcpus() or start_exclusive()/end_exclusive() cannot be used, as they either drop the BQL or require to be called without the BQL - something inhibitors cannot handle. We need a low-level locking mechanism that is deadlock-free even when not releasing the BQL. -- Cheers, David / dhildenb
Re: [PATCH v2 1/2] target/s390x: Use mutable temporary value for op_ts
On 18.03.24 21:27, Ilya Leoshkevich wrote: From: Ido Plat Otherwise TCG would assume the register that holds t1 would be constant and reuse whenever it needs the value within it. Cc: qemu-sta...@nongnu.org Fixes: f1ea739bd598 ("target/s390x: Use tcg_constant_* in local contexts") Reviewed-by: Ilya Leoshkevich Reviewed-by: Richard Henderson [iii: Adjust a newline and capitalization, add tags] Signed-off-by: Ido Plat Signed-off-by: Ilya Leoshkevich --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v1 2/2] system/cpus: Fix resume_all_vcpus() under vCPU hotplug condition
On 17.03.24 09:37, Keqian Zhu via wrote: For vCPU being hotplugged, qemu_init_vcpu() is called. In this function, we set vcpu state as stopped, and then wait vcpu thread to be created. As the vcpu state is stopped, it will inform us it has been created and then wait on halt_cond. After we has realized vcpu object, we will resume the vcpu thread. However, during we wait vcpu thread to be created, the bql is unlocked, and other thread is allowed to call resume_all_vcpus(), which will resume the un-realized vcpu. This fixes the issue by filter out un-realized vcpu during resume_all_vcpus(). Similar question: is there a reproducer? How could we currently hotplug a VCPU, and while it is being created, see pause_all_vcpus()/resume_all_vcpus() getting claled. If I am not getting this wrong, there seems to be some other mechanism missing that makes sure that this cannot happen. Dropping the BQL half-way through creating a VCPU might be the problem. -- Cheers, David / dhildenb
Re: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment
On 17.03.24 09:37, Keqian Zhu via wrote: Both main loop thread and vCPU thread are allowed to call pause_all_vcpus(), and in general resume_all_vcpus() is called after it. Two issues live in pause_all_vcpus(): In general, calling pause_all_vcpus() from VCPU threads is quite dangerous. Do we have reproducers for the cases below? 1. There is possibility that during thread T1 waits on qemu_pause_cond with bql unlocked, other thread has called pause_all_vcpus() and resume_all_vcpus(), then thread T1 will stuck, because the condition all_vcpus_paused() is always false. How can this happen? Two threads calling pause_all_vcpus() is borderline broken, as you note. IIRC, we should call pause_all_vcpus() only if some other mechanism prevents these races. For example, based on runstate changes. Just imagine one thread calling pause_all_vcpus() while another one calls resume_all_vcpus(). It cannot possibly work. 2. After all_vcpus_paused() has been checked as true, we will unlock bql to relock replay_mutex. During the bql was unlocked, the vcpu's state may has been changed by other thread, so we must retry. Signed-off-by: Keqian Zhu --- system/cpus.c | 29 - 1 file changed, 24 insertions(+), 5 deletions(-) diff --git a/system/cpus.c b/system/cpus.c index 68d161d96b..4e41abe23e 100644 --- a/system/cpus.c +++ b/system/cpus.c @@ -571,12 +571,14 @@ static bool all_vcpus_paused(void) return true; } -void pause_all_vcpus(void) +static void request_pause_all_vcpus(void) { CPUState *cpu; -qemu_clock_enable(QEMU_CLOCK_VIRTUAL, false); CPU_FOREACH(cpu) { +if (cpu->stopped) { +continue; +} if (qemu_cpu_is_self(cpu)) { qemu_cpu_stop(cpu, true); } else { @@ -584,6 +586,14 @@ void pause_all_vcpus(void) qemu_cpu_kick(cpu); } } +} + +void pause_all_vcpus(void) +{ +qemu_clock_enable(QEMU_CLOCK_VIRTUAL, false); + +retry: +request_pause_all_vcpus(); /* We need to drop the replay_lock so any vCPU threads woken up * can finish their replay tasks @@ -592,14 +602,23 @@ void pause_all_vcpus(void) while (!all_vcpus_paused()) { qemu_cond_wait(_pause_cond, ); -CPU_FOREACH(cpu) { -qemu_cpu_kick(cpu); -} +/* During we waited on qemu_pause_cond the bql was unlocked, + * the vcpu's state may has been changed by other thread, so + * we must request the pause state on all vcpus again. + */ +request_pause_all_vcpus(); } bql_unlock(); replay_mutex_lock(); bql_lock(); + +/* During the bql was unlocked, the vcpu's state may has been + * changed by other thread, so we must retry. + */ +if (!all_vcpus_paused()) { +goto retry; +} } void cpu_resume(CPUState *cpu) -- Cheers, David / dhildenb
Re: [PATCH v2 00/14] libvhost-user: support more memslots and cleanup memslot handling code
On 11.03.24 21:03, Mario Casquero wrote: This series has been successfully tested by QE. Start the qemu-storage-daemon in the background with a rhel 9.5 image and vhost-user-blk. After that, boot up a VM with virtio-mem and vhost-user-blk-pci. Check with the HMP command 'info mtree' that virtio-mem is making use of multiple memslots. Tested-by: Mario Casquero Thanks Mario! -- Cheers, David / dhildenb
Re: [PATCH v2 0/4] physmem: Fix MemoryRegion for second access to cached MMIO Address Space
On 08.03.24 08:36, Peter Xu wrote: On Thu, Mar 07, 2024 at 03:37:06PM +, Jonathan Cameron wrote: v2: (Thanks to Peter Xu for reviewing!) - New patch 1 to rename addr1 to mr_addr in the interests of meaningful naming. - Take advantage of a cached address space only allow for a single MR to simplify the new code. - Various cleanups of indentation etc. - Cover letter and some patch descriptions updated to reflect changes. - Changes all called out in specific patches. All look good to me, thanks. Having the new functions' first argument as MemTxAttrs is slightly weird to me, but that's not a big deal and we can clean it up later if wanted. I guess it's good to fix this in 9.0 first as it's a real bug even if not trivial to hit. I queued it in my migration tree (with my "memory API" hat..). I won't send a pull until next Monday. Please shoot if there's any objections! Works for me, thanks! -- Cheers, David / dhildenb
Re: [PATCH v2 4/4] physmem: Fix wrong address in large address_space_read/write_cached_slow()
On 07.03.24 16:37, Jonathan Cameron wrote: If the access is bigger than the MemoryRegion supports, flatview_read/write_continue() will attempt to update the Memory Region. but the address passed to flatview_translate() is relative to the cache, not to the FlatView. On arm/virt with interleaved CXL memory emulation and virtio-blk-pci this lead to the first part of descriptor being read from the CXL memory and the second part from PA 0x8 which happens to be a blank region of a flash chip and all ffs on this particular configuration. Note this test requires the out of tree ARM support for CXL, but the problem is more general. Avoid this by adding new address_space_read_continue_cached() and address_space_write_continue_cached() which share all the logic with the flatview versions except for the MemoryRegion lookup which is unnecessary as the MemoryRegionCache only covers one MemoryRegion. Signed-off-by: Jonathan Cameron --- v2: Review from Peter Xu - Drop additional lookups of the MemoryRegion via address_space_translate_cached() as it will always return the same answer. - Drop various parameters that are then unused. - rename addr1 to mr_addr. - Drop a fuzz_dma_read_cb(). Could put this back but it means carrying the address into the inner call and the only in tree fuzzer checks if it is normal RAM and if not does nothing anyway. We don't hit this path for normal RAM. --- system/physmem.c | 63 +++- 1 file changed, 57 insertions(+), 6 deletions(-) diff --git a/system/physmem.c b/system/physmem.c index 1264eab24b..701bea27dd 100644 --- a/system/physmem.c +++ b/system/physmem.c @@ -3381,6 +3381,59 @@ static inline MemoryRegion *address_space_translate_cached( return section.mr; } +/* Called within RCU critical section. */ +static MemTxResult address_space_write_continue_cached(MemTxAttrs attrs, + const void *ptr, + hwaddr len, + hwaddr mr_addr, + hwaddr l, + MemoryRegion *mr) The only thing that is really confusing is hwaddr len, ... hwaddr l, ehm, what? ... but it fits the style of flatview_read_continue(), so at least the level of confusion this creates is consistent with the other code. Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v2 3/4] physmem: Factor out body of flatview_read/write_continue() loop
On 07.03.24 16:37, Jonathan Cameron wrote: This code will be reused for the address_space_cached accessors shortly. Also reduce scope of result variable now we aren't directly calling this in the loop. Signed-off-by: Jonathan Cameron --- v2: Thanks to Peter Xu - Fix alignment of code. - Drop unused addr parameter. - Carry through new mr_addr parameter name. - RB not picked up as not sure what Peter will think wrt to resulting parameter ordering. --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v2 2/4] physmem: Reduce local variable scope in flatview_read/write_continue()
On 07.03.24 16:37, Jonathan Cameron wrote: Precursor to factoring out the inner loops for reuse. Reviewed-by: Peter Xu Signed-off-by: Jonathan Cameron --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v2 1/4] physmem: Rename addr1 to more informative mr_addr in flatview_read/write() and similar
On 07.03.24 16:37, Jonathan Cameron wrote: The calls to flatview_read/write[_continue]() have parameters addr and addr1 but the names give no indication of what they are addresses of. Rename addr1 to mr_addr to reflect that it is the translated address offset within the MemoryRegion returned by flatview_translate(). Similarly rename the parameter in address_space_read/write_cached_slow() Suggested-by: Peter Xu Signed-off-by: Jonathan Cameron Much cleaner indeed! Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH] qapi: Fix format of the memory-backend-file's @rom property doc comment
On 29.02.24 11:58, Stefano Garzarella wrote: Reflow paragraph following commit a937b6aa73 ("qapi: Reformat doc comments to conform to current conventions"): use 4 spaces indentation, 70 columns width, and two spaces to separate sentences. Suggested-by: Markus Armbruster Signed-off-by: Stefano Garzarella --- qapi/qom.json | 27 ++- 1 file changed, 14 insertions(+), 13 deletions(-) diff --git a/qapi/qom.json b/qapi/qom.json index 2a6e49365a..db1b0fdea2 100644 --- a/qapi/qom.json +++ b/qapi/qom.json @@ -668,19 +668,20 @@ # @readonly: if true, the backing file is opened read-only; if false, # it is opened read-write. (default: false) # -# @rom: whether to create Read Only Memory (ROM) that cannot be modified -# by the VM. Any write attempts to such ROM will be denied. Most -# use cases want writable RAM instead of ROM. However, selected use -# cases, like R/O NVDIMMs, can benefit from ROM. If set to 'on', -# create ROM; if set to 'off', create writable RAM; if set to -# 'auto', the value of the @readonly property is used. This -# property is primarily helpful when we want to have proper RAM in -# configurations that would traditionally create ROM before this -# property was introduced: VM templating, where we want to open a -# file readonly (@readonly set to true) and mark the memory to be -# private for QEMU (@share set to false). For this use case, we need -# writable RAM instead of ROM, and want to set this property to 'off'. -# (default: auto, since 8.2) +# @rom: whether to create Read Only Memory (ROM) that cannot be +# modified by the VM. Any write attempts to such ROM will be +# denied. Most use cases want writable RAM instead of ROM. +# However, selected use cases, like R/O NVDIMMs, can benefit from +# ROM. If set to 'on', create ROM; if set to 'off', create +# writable RAM; if set to 'auto', the value of the @readonly +# property is used. This property is primarily helpful when we +# want to have proper RAM in configurations that would +# traditionally create ROM before this property was introduced: VM +# templating, where we want to open a file readonly (@readonly set +# to true) and mark the memory to be private for QEMU (@share set +# to false). For this use case, we need writable RAM instead of +# ROM, and want to set this property to 'off'. (default: auto, +# since 8.2) # # Since: 2.1 ## Ideally, we'd have a format checker that complains like checkpatch usually would. Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH 9/9] hostmem-file: support POSIX shm_open()
On 28.02.24 12:47, Stefano Garzarella wrote: Add a new `shm` bool option for `-object memory-backend-file`. When this option is set to true, the POSIX shm_open(3) is used instead of open(2). So a file will not be created in the filesystem, but a "POSIX shared memory object" will be instantiated. In Linux this turns into a file in /dev/shm, but in other OSes this may not happen (for example in macOS or FreeBSD nothing is shown in any filesystem). This new feature is useful when we need to share guest memory with another process (e.g. vhost-user backend), but we don't have memfd_create() or any special filesystems (e.g. /dev/shm) available as in macOS. Signed-off-by: Stefano Garzarella --- I am not sure this is the best way to support shm_open() in QEMU. Other solutions I had in mind were: - create a new memory-backend-shm - extend memory-backend-memfd to use shm_open() on systems where memfd is not available (problem: shm_open wants a name to assign to the object, but we can do a workaround by using a random name and do the unlink right away) Any preference/suggestion? Both sound like reasonable options, and IMHO better than hostmem-file with things that are not necessarily "real" files. Regarding memory-backend-memfd, we similarly have to pass a name to memfd_create(), although for different purpose: "The name supplied in name is used as a filename and will be displayed as the target of the corresponding symbolic link in the directory /proc/self/fd/". So we simply pass TYPE_MEMORY_BACKEND_MEMFD. Likely, memory-backend-shm that directly maps to shm_open() and only provides properties reasonable for shm_open() is the cleanest approach. So that would currently be my preference :) -- Cheers, David / dhildenb
Re: [PATCH v4 33/66] i386/tdx: Make memory type private by default
On 29.01.24 03:18, Xiaoyao Li wrote: On 1/26/2024 10:58 PM, David Hildenbrand wrote: On 25.01.24 04:22, Xiaoyao Li wrote: By default (due to the recent UPM change), restricted memory attribute is shared. Convert the memory region from shared to private at the memory slot creation time. add kvm region registering function to check the flag and convert the region, and add memory listener to TDX guest code to set the flag to the possible memory region. Without this patch - Secure-EPT violation on private area - KVM_MEMORY_FAULT EXIT (kvm -> qemu) - qemu converts the 4K page from shared to private - Resume VCPU execution - Secure-EPT violation again - KVM resolves EPT Violation This also prevents huge page because page conversion is done at 4K granularity. Although it's possible to merge 4K private mapping into 2M large page, it slows guest boot. With this patch - After memory slot creation, convert the region from private to shared - Secure-EPT violation on private area. - KVM resolves EPT Violation Originated-from: Isaku Yamahata Signed-off-by: Xiaoyao Li --- include/exec/memory.h | 1 + target/i386/kvm/tdx.c | 20 2 files changed, 21 insertions(+) diff --git a/include/exec/memory.h b/include/exec/memory.h index 7229fcc0415f..f25959f6d30f 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -850,6 +850,7 @@ struct IOMMUMemoryRegion { #define MEMORY_LISTENER_PRIORITY_MIN 0 #define MEMORY_LISTENER_PRIORITY_ACCEL 10 #define MEMORY_LISTENER_PRIORITY_DEV_BACKEND 10 +#define MEMORY_LISTENER_PRIORITY_ACCEL_HIGH 20 /** * struct MemoryListener: callbacks structure for updates to the physical memory map diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c index 7b250d80bc1d..f892551821ce 100644 --- a/target/i386/kvm/tdx.c +++ b/target/i386/kvm/tdx.c @@ -19,6 +19,7 @@ #include "standard-headers/asm-x86/kvm_para.h" #include "sysemu/kvm.h" #include "sysemu/sysemu.h" +#include "exec/address-spaces.h" #include "hw/i386/x86.h" #include "kvm_i386.h" @@ -621,6 +622,19 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp) return 0; } +static void tdx_guest_region_add(MemoryListener *listener, + MemoryRegionSection *section) +{ + memory_region_set_default_private(section->mr); +} That looks fishy. Why is TDX to decide what happens to other memory regions it doesn't own? We should define that behavior when creating these memory region, and TDX could sanity check that they have been setup properly. Let me ask differently: For which memory region where we have RAM_GUEST_MEMFD set would we *not* want to set private as default right from the start? All memory regions have RAM_GUEST_MEMFD set will benefit from being private as default, for TDX guest. I will update the implementation to set RAM_DEFAULT_PRIVATE flag when guest_memfd is created successfully, like diff --git a/system/physmem.c b/system/physmem.c index fc59470191ef..60676689c807 100644 --- a/system/physmem.c +++ b/system/physmem.c @@ -1850,6 +1850,8 @@ static void ram_block_add(RAMBlock *new_block, Error **errp) qemu_mutex_unlock_ramlist(); return; } + +new_block->flags |= RAM_DEFAULT_PRIVATE; } then this patch can be dropped, and the calling of memory_region_set_default_private(mr) of Patch 45 can be dropped too. I think this is what you suggested, right? Yes, if that works, great! -- Cheers, David / dhildenb
Re: [PATCH] system/physmem: remove redundant arg reassignment
On 15.02.24 10:15, Manos Pitsidianakis wrote: Arguments `ram_block` are reassigned to local declarations `block` without further use. Remove re-assignment to reduce noise. Signed-off-by: Manos Pitsidianakis --- system/physmem.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/system/physmem.c b/system/physmem.c index 5e66d9ae36..d4c3bfac65 100644 --- a/system/physmem.c +++ b/system/physmem.c @@ -2154,10 +2154,8 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length) * * Called within RCU critical section. */ -void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t addr) +void *qemu_map_ram_ptr(RAMBlock *block, ram_addr_t addr) { -RAMBlock *block = ram_block; - if (block == NULL) { block = qemu_get_ram_block(addr); addr -= block->offset; @@ -2182,10 +2180,9 @@ void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t addr) * * Called within RCU critical section. */ -static void *qemu_ram_ptr_length(RAMBlock *ram_block, ram_addr_t addr, +static void *qemu_ram_ptr_length(RAMBlock *block, ram_addr_t addr, hwaddr *size, bool lock) { -RAMBlock *block = ram_block; if (*size == 0) { return NULL; } base-commit: 5767815218efd3cbfd409505ed824d5f356044ae Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
[PATCH v2 06/14] libvhost-user: No need to check for NULL when unmapping
We never add a memory region if mmap() failed. Therefore, no need to check for NULL. Reviewed-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 10 ++ 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index f43b5096d0..225283f764 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -247,11 +247,8 @@ vu_remove_all_mem_regs(VuDev *dev) for (i = 0; i < dev->nregions; i++) { VuDevRegion *r = >regions[i]; -void *ma = (void *)(uintptr_t)r->mmap_addr; -if (ma) { -munmap(ma, r->size + r->mmap_offset); -} +munmap((void *)(uintptr_t)r->mmap_addr, r->size + r->mmap_offset); } dev->nregions = 0; } @@ -888,11 +885,8 @@ vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) { for (i = 0; i < dev->nregions; i++) { if (reg_equal(>regions[i], msg_region)) { VuDevRegion *r = >regions[i]; -void *ma = (void *) (uintptr_t) r->mmap_addr; -if (ma) { -munmap(ma, r->size + r->mmap_offset); -} +munmap((void *)(uintptr_t)r->mmap_addr, r->size + r->mmap_offset); /* * Shift all affected entries by 1 to close the hole at index i and -- 2.43.0
[PATCH v2 04/14] libvhost-user: Merge vu_set_mem_table_exec_postcopy() into vu_set_mem_table_exec()
Let's reduce some code duplication and prepare for further changes. Reviewed-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 119 +++--- 1 file changed, 39 insertions(+), 80 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index e4907dfc26..a7bd7de3cd 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -937,95 +937,23 @@ vu_get_shared_object(VuDev *dev, VhostUserMsg *vmsg) } static bool -vu_set_mem_table_exec_postcopy(VuDev *dev, VhostUserMsg *vmsg) +vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg) { -unsigned int i; VhostUserMemory m = vmsg->payload.memory, *memory = -dev->nregions = memory->nregions; - -DPRINT("Nregions: %u\n", memory->nregions); -for (i = 0; i < dev->nregions; i++) { -void *mmap_addr; -VhostUserMemoryRegion *msg_region = >regions[i]; -VuDevRegion *dev_region = >regions[i]; - -DPRINT("Region %d\n", i); -DPRINT("guest_phys_addr: 0x%016"PRIx64"\n", - msg_region->guest_phys_addr); -DPRINT("memory_size: 0x%016"PRIx64"\n", - msg_region->memory_size); -DPRINT("userspace_addr 0x%016"PRIx64"\n", - msg_region->userspace_addr); -DPRINT("mmap_offset 0x%016"PRIx64"\n", - msg_region->mmap_offset); - -dev_region->gpa = msg_region->guest_phys_addr; -dev_region->size = msg_region->memory_size; -dev_region->qva = msg_region->userspace_addr; -dev_region->mmap_offset = msg_region->mmap_offset; +int prot = PROT_READ | PROT_WRITE; +unsigned int i; -/* We don't use offset argument of mmap() since the - * mapped address has to be page aligned, and we use huge - * pages. +if (dev->postcopy_listening) { +/* * In postcopy we're using PROT_NONE here to catch anyone * accessing it before we userfault */ -mmap_addr = mmap(0, dev_region->size + dev_region->mmap_offset, - PROT_NONE, MAP_SHARED | MAP_NORESERVE, - vmsg->fds[i], 0); - -if (mmap_addr == MAP_FAILED) { -vu_panic(dev, "region mmap error: %s", strerror(errno)); -} else { -dev_region->mmap_addr = (uint64_t)(uintptr_t)mmap_addr; -DPRINT("mmap_addr: 0x%016"PRIx64"\n", - dev_region->mmap_addr); -} - -/* Return the address to QEMU so that it can translate the ufd - * fault addresses back. - */ -msg_region->userspace_addr = (uintptr_t)(mmap_addr + - dev_region->mmap_offset); -close(vmsg->fds[i]); -} - -/* Send the message back to qemu with the addresses filled in */ -vmsg->fd_num = 0; -if (!vu_send_reply(dev, dev->sock, vmsg)) { -vu_panic(dev, "failed to respond to set-mem-table for postcopy"); -return false; -} - -/* Wait for QEMU to confirm that it's registered the handler for the - * faults. - */ -if (!dev->read_msg(dev, dev->sock, vmsg) || -vmsg->size != sizeof(vmsg->payload.u64) || -vmsg->payload.u64 != 0) { -vu_panic(dev, "failed to receive valid ack for postcopy set-mem-table"); -return false; +prot = PROT_NONE; } -/* OK, now we can go and register the memory and generate faults */ -(void)generate_faults(dev); - -return false; -} - -static bool -vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg) -{ -unsigned int i; -VhostUserMemory m = vmsg->payload.memory, *memory = - vu_remove_all_mem_regs(dev); dev->nregions = memory->nregions; -if (dev->postcopy_listening) { -return vu_set_mem_table_exec_postcopy(dev, vmsg); -} - DPRINT("Nregions: %u\n", memory->nregions); for (i = 0; i < dev->nregions; i++) { void *mmap_addr; @@ -1051,8 +979,7 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg) * mapped address has to be page aligned, and we use huge * pages. */ mmap_addr = mmap(0, dev_region->size + dev_region->mmap_offset, - PROT_READ | PROT_WRITE, MAP_SHARED | MAP_NORESERVE, - vmsg->fds[i], 0); + prot, MAP_SHARED | MAP_NORESERVE, vmsg->fds[i], 0); if (mmap_addr == MAP_FAILED) { vu_panic(dev, "region mmap error: %s", strerror
[PATCH v2 09/14] libvhost-user: Factor out search for memory region by GPA and simplify
Memory regions cannot overlap, and if we ever hit that case something would be really flawed. For example, when vhost code in QEMU decides to increase the size of memory regions to cover full huge pages, it makes sure to never create overlaps, and if there would be overlaps, it would bail out. QEMU commits 48d7c9757749 ("vhost: Merge sections added to temporary list"), c1ece84e7c93 ("vhost: Huge page align and merge") and e7b94a84b6cb ("vhost: Allow adjoining regions") added and clarified that handling and how overlaps are impossible. Consequently, each GPA can belong to at most one memory region, and everything else doesn't make sense. Let's factor out our search to prepare for further changes. Reviewed-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 79 +-- 1 file changed, 45 insertions(+), 34 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index 7f29e01c30..d72f25396d 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -195,30 +195,47 @@ vu_panic(VuDev *dev, const char *msg, ...) */ } +/* Search for a memory region that covers this guest physical address. */ +static VuDevRegion * +vu_gpa_to_mem_region(VuDev *dev, uint64_t guest_addr) +{ +unsigned int i; + +/* + * Memory regions cannot overlap in guest physical address space. Each + * GPA belongs to exactly one memory region, so there can only be one + * match. + */ +for (i = 0; i < dev->nregions; i++) { +VuDevRegion *cur = >regions[i]; + +if (guest_addr >= cur->gpa && guest_addr < cur->gpa + cur->size) { +return cur; +} +} +return NULL; +} + /* Translate guest physical address to our virtual address. */ void * vu_gpa_to_va(VuDev *dev, uint64_t *plen, uint64_t guest_addr) { -unsigned int i; +VuDevRegion *r; if (*plen == 0) { return NULL; } -/* Find matching memory region. */ -for (i = 0; i < dev->nregions; i++) { -VuDevRegion *r = >regions[i]; - -if ((guest_addr >= r->gpa) && (guest_addr < (r->gpa + r->size))) { -if ((guest_addr + *plen) > (r->gpa + r->size)) { -*plen = r->gpa + r->size - guest_addr; -} -return (void *)(uintptr_t) -guest_addr - r->gpa + r->mmap_addr + r->mmap_offset; -} +r = vu_gpa_to_mem_region(dev, guest_addr); +if (!r) { +return NULL; } -return NULL; +if ((guest_addr + *plen) > (r->gpa + r->size)) { +*plen = r->gpa + r->size - guest_addr; +} +return (void *)(uintptr_t)guest_addr - r->gpa + r->mmap_addr + + r->mmap_offset; } /* Translate qemu virtual address to our virtual address. */ @@ -854,8 +871,8 @@ static inline bool reg_equal(VuDevRegion *vudev_reg, static bool vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) { VhostUserMemoryRegion m = vmsg->payload.memreg.region, *msg_region = -unsigned int i; -bool found = false; +unsigned int idx; +VuDevRegion *r; if (vmsg->fd_num > 1) { vmsg_close_fds(vmsg); @@ -882,28 +899,22 @@ vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) { DPRINT("mmap_offset 0x%016"PRIx64"\n", msg_region->mmap_offset); -for (i = 0; i < dev->nregions; i++) { -if (reg_equal(>regions[i], msg_region)) { -VuDevRegion *r = >regions[i]; - -munmap((void *)(uintptr_t)r->mmap_addr, r->size + r->mmap_offset); - -/* Shift all affected entries by 1 to close the hole at index. */ -memmove(dev->regions + i, dev->regions + i + 1, -sizeof(VuDevRegion) * (dev->nregions - i - 1)); -DPRINT("Successfully removed a region\n"); -dev->nregions--; -i--; - -found = true; -break; -} -} - -if (!found) { +r = vu_gpa_to_mem_region(dev, msg_region->guest_phys_addr); +if (!r || !reg_equal(r, msg_region)) { +vmsg_close_fds(vmsg); vu_panic(dev, "Specified region not found\n"); +return false; } +munmap((void *)(uintptr_t)r->mmap_addr, r->size + r->mmap_offset); + +idx = r - dev->regions; +assert(idx < dev->nregions); +/* Shift all affected entries by 1 to close the hole. */ +memmove(r, r + 1, sizeof(VuDevRegion) * (dev->nregions - idx - 1)); +DPRINT("Successfully removed a region\n"); +dev->nregions--; + vmsg_close_fds(vmsg); return false; -- 2.43.0
[PATCH v2 07/14] libvhost-user: Don't zero out memory for memory regions
dev->nregions always covers only valid entries. Stop zeroing out other array elements that are unused. Reviewed-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index 225283f764..2e8b611385 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -888,13 +888,9 @@ vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) { munmap((void *)(uintptr_t)r->mmap_addr, r->size + r->mmap_offset); -/* - * Shift all affected entries by 1 to close the hole at index i and - * zero out the last entry. - */ +/* Shift all affected entries by 1 to close the hole at index. */ memmove(dev->regions + i, dev->regions + i + 1, sizeof(VuDevRegion) * (dev->nregions - i - 1)); -memset(dev->regions + dev->nregions - 1, 0, sizeof(VuDevRegion)); DPRINT("Successfully removed a region\n"); dev->nregions--; i--; @@ -2119,7 +2115,6 @@ vu_init(VuDev *dev, DPRINT("%s: failed to malloc mem regions\n", __func__); return false; } -memset(dev->regions, 0, VHOST_USER_MAX_RAM_SLOTS * sizeof(dev->regions[0])); dev->vq = malloc(max_queues * sizeof(dev->vq[0])); if (!dev->vq) { -- 2.43.0
[PATCH v2 01/14] libvhost-user: Dynamically allocate memory for memory slots
Let's prepare for increasing VHOST_USER_MAX_RAM_SLOTS by dynamically allocating dev->regions. We don't have any ABI guarantees (not dynamically linked), so we can simply change the layout of VuDev. Let's zero out the memory, just as we used to do. Reviewed-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 11 +++ subprojects/libvhost-user/libvhost-user.h | 2 +- 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index a3b158c671..360c5366d6 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -2171,6 +2171,8 @@ vu_deinit(VuDev *dev) free(dev->vq); dev->vq = NULL; +free(dev->regions); +dev->regions = NULL; } bool @@ -2205,9 +2207,18 @@ vu_init(VuDev *dev, dev->backend_fd = -1; dev->max_queues = max_queues; +dev->regions = malloc(VHOST_USER_MAX_RAM_SLOTS * sizeof(dev->regions[0])); +if (!dev->regions) { +DPRINT("%s: failed to malloc mem regions\n", __func__); +return false; +} +memset(dev->regions, 0, VHOST_USER_MAX_RAM_SLOTS * sizeof(dev->regions[0])); + dev->vq = malloc(max_queues * sizeof(dev->vq[0])); if (!dev->vq) { DPRINT("%s: failed to malloc virtqueues\n", __func__); +free(dev->regions); +dev->regions = NULL; return false; } diff --git a/subprojects/libvhost-user/libvhost-user.h b/subprojects/libvhost-user/libvhost-user.h index c2352904f0..c882b4e3a2 100644 --- a/subprojects/libvhost-user/libvhost-user.h +++ b/subprojects/libvhost-user/libvhost-user.h @@ -398,7 +398,7 @@ typedef struct VuDevInflightInfo { struct VuDev { int sock; uint32_t nregions; -VuDevRegion regions[VHOST_USER_MAX_RAM_SLOTS]; +VuDevRegion *regions; VuVirtq *vq; VuDevInflightInfo inflight_info; int log_call_fd; -- 2.43.0
[PATCH v2 10/14] libvhost-user: Speedup gpa_to_mem_region() and vu_gpa_to_va()
Let's speed up GPA to memory region / virtual address lookup. Store the memory regions ordered by guest physical addresses, and use binary search for address translation, as well as when adding/removing memory regions. Most importantly, this will speed up GPA->VA address translation when we have many memslots. Reviewed-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 49 +-- 1 file changed, 45 insertions(+), 4 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index d72f25396d..ef6353d847 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -199,19 +199,30 @@ vu_panic(VuDev *dev, const char *msg, ...) static VuDevRegion * vu_gpa_to_mem_region(VuDev *dev, uint64_t guest_addr) { -unsigned int i; +int low = 0; +int high = dev->nregions - 1; /* * Memory regions cannot overlap in guest physical address space. Each * GPA belongs to exactly one memory region, so there can only be one * match. + * + * We store our memory regions ordered by GPA and can simply perform a + * binary search. */ -for (i = 0; i < dev->nregions; i++) { -VuDevRegion *cur = >regions[i]; +while (low <= high) { +unsigned int mid = low + (high - low) / 2; +VuDevRegion *cur = >regions[mid]; if (guest_addr >= cur->gpa && guest_addr < cur->gpa + cur->size) { return cur; } +if (guest_addr >= cur->gpa + cur->size) { +low = mid + 1; +} +if (guest_addr < cur->gpa) { +high = mid - 1; +} } return NULL; } @@ -273,9 +284,14 @@ vu_remove_all_mem_regs(VuDev *dev) static void _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd) { +const uint64_t start_gpa = msg_region->guest_phys_addr; +const uint64_t end_gpa = start_gpa + msg_region->memory_size; int prot = PROT_READ | PROT_WRITE; VuDevRegion *r; void *mmap_addr; +int low = 0; +int high = dev->nregions - 1; +unsigned int idx; DPRINT("Adding region %d\n", dev->nregions); DPRINT("guest_phys_addr: 0x%016"PRIx64"\n", @@ -295,6 +311,29 @@ _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd) prot = PROT_NONE; } +/* + * We will add memory regions into the array sorted by GPA. Perform a + * binary search to locate the insertion point: it will be at the low + * index. + */ +while (low <= high) { +unsigned int mid = low + (high - low) / 2; +VuDevRegion *cur = >regions[mid]; + +/* Overlap of GPA addresses. */ +if (start_gpa < cur->gpa + cur->size && cur->gpa < end_gpa) { +vu_panic(dev, "regions with overlapping guest physical addresses"); +return; +} +if (start_gpa >= cur->gpa + cur->size) { +low = mid + 1; +} +if (start_gpa < cur->gpa) { +high = mid - 1; +} +} +idx = low; + /* * We don't use offset argument of mmap() since the mapped address has * to be page aligned, and we use huge pages. @@ -308,7 +347,9 @@ _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd) DPRINT("mmap_addr: 0x%016"PRIx64"\n", (uint64_t)(uintptr_t)mmap_addr); -r = >regions[dev->nregions]; +/* Shift all affected entries by 1 to open a hole at idx. */ +r = >regions[idx]; +memmove(r + 1, r, sizeof(VuDevRegion) * (dev->nregions - idx)); r->gpa = msg_region->guest_phys_addr; r->size = msg_region->memory_size; r->qva = msg_region->userspace_addr; -- 2.43.0
[PATCH v2 13/14] libvhost-user: Dynamically remap rings after (temporarily?) removing memory regions
Currently, we try to remap all rings whenever we add a single new memory region. That doesn't quite make sense, because we already map rings when setting the ring address, and panic if that goes wrong. Likely, that handling was simply copied from set_mem_table code, where we actually have to remap all rings. Remapping all rings might require us to walk quite a lot of memory regions to perform the address translations. Ideally, we'd simply remove that remapping. However, let's be a bit careful. There might be some weird corner cases where we might temporarily remove a single memory region (e.g., resize it), that would have worked for now. Further, a ring might be located on hotplugged memory, and as the VM reboots, we might unplug that memory, to hotplug memory before resetting the ring addresses. So let's unmap affected rings as we remove a memory region, and try dynamically mapping the ring again when required. Acked-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 107 -- 1 file changed, 78 insertions(+), 29 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index ed0a978d4f..61fb3050b3 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -283,10 +283,75 @@ vu_remove_all_mem_regs(VuDev *dev) dev->nregions = 0; } +static bool +map_ring(VuDev *dev, VuVirtq *vq) +{ +vq->vring.desc = qva_to_va(dev, vq->vra.desc_user_addr); +vq->vring.used = qva_to_va(dev, vq->vra.used_user_addr); +vq->vring.avail = qva_to_va(dev, vq->vra.avail_user_addr); + +DPRINT("Setting virtq addresses:\n"); +DPRINT("vring_desc at %p\n", vq->vring.desc); +DPRINT("vring_used at %p\n", vq->vring.used); +DPRINT("vring_avail at %p\n", vq->vring.avail); + +return !(vq->vring.desc && vq->vring.used && vq->vring.avail); +} + static bool vu_is_vq_usable(VuDev *dev, VuVirtq *vq) { -return likely(!dev->broken) && likely(vq->vring.avail); +if (unlikely(dev->broken)) { +return false; +} + +if (likely(vq->vring.avail)) { +return true; +} + +/* + * In corner cases, we might temporarily remove a memory region that + * mapped a ring. When removing a memory region we make sure to + * unmap any rings that would be impacted. Let's try to remap if we + * already succeeded mapping this ring once. + */ +if (!vq->vra.desc_user_addr || !vq->vra.used_user_addr || +!vq->vra.avail_user_addr) { +return false; +} +if (map_ring(dev, vq)) { +vu_panic(dev, "remapping queue on access"); +return false; +} +return true; +} + +static void +unmap_rings(VuDev *dev, VuDevRegion *r) +{ +int i; + +for (i = 0; i < dev->max_queues; i++) { +VuVirtq *vq = >vq[i]; +const uintptr_t desc = (uintptr_t)vq->vring.desc; +const uintptr_t used = (uintptr_t)vq->vring.used; +const uintptr_t avail = (uintptr_t)vq->vring.avail; + +if (desc < r->mmap_addr || desc >= r->mmap_addr + r->size) { +continue; +} +if (used < r->mmap_addr || used >= r->mmap_addr + r->size) { +continue; +} +if (avail < r->mmap_addr || avail >= r->mmap_addr + r->size) { +continue; +} + +DPRINT("Unmapping rings of queue %d\n", i); +vq->vring.desc = NULL; +vq->vring.used = NULL; +vq->vring.avail = NULL; +} } static size_t @@ -786,21 +851,6 @@ vu_reset_device_exec(VuDev *dev, VhostUserMsg *vmsg) return false; } -static bool -map_ring(VuDev *dev, VuVirtq *vq) -{ -vq->vring.desc = qva_to_va(dev, vq->vra.desc_user_addr); -vq->vring.used = qva_to_va(dev, vq->vra.used_user_addr); -vq->vring.avail = qva_to_va(dev, vq->vra.avail_user_addr); - -DPRINT("Setting virtq addresses:\n"); -DPRINT("vring_desc at %p\n", vq->vring.desc); -DPRINT("vring_used at %p\n", vq->vring.used); -DPRINT("vring_avail at %p\n", vq->vring.avail); - -return !(vq->vring.desc && vq->vring.used && vq->vring.avail); -} - static bool generate_faults(VuDev *dev) { unsigned int i; @@ -884,7 +934,6 @@ generate_faults(VuDev *dev) { static bool vu_add_mem_reg(VuDev *dev, VhostUserMsg *vmsg) { -int i; VhostUserMemoryRegion m = vmsg->payload.memreg.region, *msg_region = if (vmsg->fd_num != 1) { @@ -930,19 +979,9 @@ vu_add_mem_reg(VuDev *dev, VhostUserMsg *vmsg) { vmsg->fd_num = 0; DPRINT("Successf
[PATCH v2 05/14] libvhost-user: Factor out adding a memory region
Let's factor it out, reducing quite some code duplication and perparing for further changes. If we fail to mmap a region and panic, we now simply don't add that (broken) region. Note that we now increment dev->nregions as we are successfully adding memory regions, and don't increment dev->nregions if anything went wrong. Reviewed-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 168 -- 1 file changed, 60 insertions(+), 108 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index a7bd7de3cd..f43b5096d0 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -256,6 +256,61 @@ vu_remove_all_mem_regs(VuDev *dev) dev->nregions = 0; } +static void +_vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd) +{ +int prot = PROT_READ | PROT_WRITE; +VuDevRegion *r; +void *mmap_addr; + +DPRINT("Adding region %d\n", dev->nregions); +DPRINT("guest_phys_addr: 0x%016"PRIx64"\n", + msg_region->guest_phys_addr); +DPRINT("memory_size: 0x%016"PRIx64"\n", + msg_region->memory_size); +DPRINT("userspace_addr: 0x%016"PRIx64"\n", + msg_region->userspace_addr); +DPRINT("mmap_offset: 0x%016"PRIx64"\n", + msg_region->mmap_offset); + +if (dev->postcopy_listening) { +/* + * In postcopy we're using PROT_NONE here to catch anyone + * accessing it before we userfault + */ +prot = PROT_NONE; +} + +/* + * We don't use offset argument of mmap() since the mapped address has + * to be page aligned, and we use huge pages. + */ +mmap_addr = mmap(0, msg_region->memory_size + msg_region->mmap_offset, + prot, MAP_SHARED | MAP_NORESERVE, fd, 0); +if (mmap_addr == MAP_FAILED) { +vu_panic(dev, "region mmap error: %s", strerror(errno)); +return; +} +DPRINT("mmap_addr: 0x%016"PRIx64"\n", + (uint64_t)(uintptr_t)mmap_addr); + +r = >regions[dev->nregions]; +r->gpa = msg_region->guest_phys_addr; +r->size = msg_region->memory_size; +r->qva = msg_region->userspace_addr; +r->mmap_addr = (uint64_t)(uintptr_t)mmap_addr; +r->mmap_offset = msg_region->mmap_offset; +dev->nregions++; + +if (dev->postcopy_listening) { +/* + * Return the address to QEMU so that it can translate the ufd + * fault addresses back. + */ +msg_region->userspace_addr = r->mmap_addr + r->mmap_offset; +} +} + static void vmsg_close_fds(VhostUserMsg *vmsg) { @@ -727,10 +782,7 @@ generate_faults(VuDev *dev) { static bool vu_add_mem_reg(VuDev *dev, VhostUserMsg *vmsg) { int i; -bool track_ramblocks = dev->postcopy_listening; VhostUserMemoryRegion m = vmsg->payload.memreg.region, *msg_region = -VuDevRegion *dev_region = >regions[dev->nregions]; -void *mmap_addr; if (vmsg->fd_num != 1) { vmsg_close_fds(vmsg); @@ -760,69 +812,20 @@ vu_add_mem_reg(VuDev *dev, VhostUserMsg *vmsg) { * we know all the postcopy client bases have been received, and we * should start generating faults. */ -if (track_ramblocks && +if (dev->postcopy_listening && vmsg->size == sizeof(vmsg->payload.u64) && vmsg->payload.u64 == 0) { (void)generate_faults(dev); return false; } -DPRINT("Adding region: %u\n", dev->nregions); -DPRINT("guest_phys_addr: 0x%016"PRIx64"\n", - msg_region->guest_phys_addr); -DPRINT("memory_size: 0x%016"PRIx64"\n", - msg_region->memory_size); -DPRINT("userspace_addr 0x%016"PRIx64"\n", - msg_region->userspace_addr); -DPRINT("mmap_offset 0x%016"PRIx64"\n", - msg_region->mmap_offset); - -dev_region->gpa = msg_region->guest_phys_addr; -dev_region->size = msg_region->memory_size; -dev_region->qva = msg_region->userspace_addr; -dev_region->mmap_offset = msg_region->mmap_offset; - -/* - * We don't use offset argument of mmap() since the - * mapped address has to be page aligned, and we use huge - * pages. - */ -if (track_ramblocks) { -/* - * In postcopy we're using PROT_NONE here to catch anyone - * accessing it before we userfault. - */ -mmap_addr = mmap(0, dev_region->size + dev_region->mmap_offset, -
[PATCH v2 08/14] libvhost-user: Don't search for duplicates when removing memory regions
We cannot have duplicate memory regions, something would be deeply flawed elsewhere. Let's just stop the search once we found an entry. We'll add more sanity checks when adding memory regions later. Reviewed-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index 2e8b611385..7f29e01c30 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -896,8 +896,7 @@ vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) { i--; found = true; - -/* Continue the search for eventual duplicates. */ +break; } } -- 2.43.0
[PATCH v2 02/14] libvhost-user: Bump up VHOST_USER_MAX_RAM_SLOTS to 509
Let's support up to 509 mem slots, just like vhost in the kernel usually does and the rust vhost-user implementation recently [1] started doing. This is required to properly support memory hotplug, either using multiple DIMMs (ACPI supports up to 256) or using virtio-mem. The 509 used to be the KVM limit, it supported 512, but 3 were used for internal purposes. Currently, KVM supports more than 512, but it usually doesn't make use of more than ~260 (i.e., 256 DIMMs + boot memory), except when other memory devices like PCI devices with BARs are used. So, 509 seems to work well for vhost in the kernel. Details can be found in the QEMU change that made virtio-mem consume up to 256 mem slots across all virtio-mem devices. [2] 509 mem slots implies 509 VMAs/mappings in the worst case (even though, in practice with virtio-mem we won't be seeing more than ~260 in most setups). With max_map_count under Linux defaulting to 64k, 509 mem slots still correspond to less than 1% of the maximum number of mappings. There are plenty left for the application to consume. [1] https://github.com/rust-vmm/vhost/pull/224 [2] https://lore.kernel.org/all/20230926185738.277351-1-da...@redhat.com/ Reviewed-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.h | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.h b/subprojects/libvhost-user/libvhost-user.h index c882b4e3a2..deb40e77b3 100644 --- a/subprojects/libvhost-user/libvhost-user.h +++ b/subprojects/libvhost-user/libvhost-user.h @@ -31,10 +31,12 @@ #define VHOST_MEMORY_BASELINE_NREGIONS 8 /* - * Set a reasonable maximum number of ram slots, which will be supported by - * any architecture. + * vhost in the kernel usually supports 509 mem slots. 509 used to be the + * KVM limit, it supported 512, but 3 were used for internal purposes. This + * limit is sufficient to support many DIMMs and virtio-mem in + * "dynamic-memslots" mode. */ -#define VHOST_USER_MAX_RAM_SLOTS 32 +#define VHOST_USER_MAX_RAM_SLOTS 509 #define VHOST_USER_HDR_SIZE offsetof(VhostUserMsg, payload.u64) -- 2.43.0
[PATCH v2 00/14] libvhost-user: support more memslots and cleanup memslot handling code
This series adds support for more memslots (509) to libvhost-user, to make it fully compatible with virtio-mem that uses up to 256 memslots accross all memory devices in "dynamic-memslot" mode (more details in patch #2). With that in place, this series optimizes and extends memory region handling in libvhost-user: * Heavily deduplicate and clean up the memory region handling code * Speeds up GPA->VA translation with many memslots using binary search * Optimize mmap_offset handling to use it as fd_offset for mmap() * Avoid ring remapping when adding a single memory region * Avoid dumping all guest memory, possibly allocating memory in sparse memory mappings when the process crashes I'm being very careful to not break some weird corner case that modern QEMU might no longer trigger, but older one could have triggered or some other frontend might trigger. The only thing where I am not careful is to forbid memory regions that overlap in GPA space: it doesn't make any sense. With this series, virtio-mem (with dynamic-memslots=on) + qemu-storage-daemon works flawlessly and as expected in my tests. v1 -> v2: * Drop "libvhost-user: Fix msg_region->userspace_addr computation" -> Not actually required * "libvhost-user: Factor out adding a memory region" -> Make debug output more consistent (add missing ":") * "libvhost-user: Use most of mmap_offset as fd_offset" -> get_fd_pagesize -> get_fd_hugepagesize; remove getpagesize() -> "mmap_offset:" to "old mmap_offset:" in debug message -> "adj mmap_offset:" to "new mmap_offset:" in debug message -> Use "(unsigned int)fs.f_type"; the man page of fstatfs() calls out that the type of f_type can vary depending on the architecture. "unsigned int" is sufficient here. -> Updated patch description * Added RBs+ACKs * Did a Gitlab CI run, seems to be happy reagrding libvhost-user Cc: Michael S. Tsirkin Cc: Jason Wang Cc: Stefan Hajnoczi Cc: Stefano Garzarella Cc: Germano Veit Michel Cc: Raphael Norwitz David Hildenbrand (14): libvhost-user: Dynamically allocate memory for memory slots libvhost-user: Bump up VHOST_USER_MAX_RAM_SLOTS to 509 libvhost-user: Factor out removing all mem regions libvhost-user: Merge vu_set_mem_table_exec_postcopy() into vu_set_mem_table_exec() libvhost-user: Factor out adding a memory region libvhost-user: No need to check for NULL when unmapping libvhost-user: Don't zero out memory for memory regions libvhost-user: Don't search for duplicates when removing memory regions libvhost-user: Factor out search for memory region by GPA and simplify libvhost-user: Speedup gpa_to_mem_region() and vu_gpa_to_va() libvhost-user: Use most of mmap_offset as fd_offset libvhost-user: Factor out vq usability check libvhost-user: Dynamically remap rings after (temporarily?) removing memory regions libvhost-user: Mark mmap'ed region memory as MADV_DONTDUMP subprojects/libvhost-user/libvhost-user.c | 595 -- subprojects/libvhost-user/libvhost-user.h | 10 +- 2 files changed, 334 insertions(+), 271 deletions(-) -- 2.43.0
[PATCH v2 12/14] libvhost-user: Factor out vq usability check
Let's factor it out to prepare for further changes. Reviewed-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 24 +++ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index 55aef5fcc6..ed0a978d4f 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -283,6 +283,12 @@ vu_remove_all_mem_regs(VuDev *dev) dev->nregions = 0; } +static bool +vu_is_vq_usable(VuDev *dev, VuVirtq *vq) +{ +return likely(!dev->broken) && likely(vq->vring.avail); +} + static size_t get_fd_hugepagesize(int fd) { @@ -2380,8 +2386,7 @@ vu_queue_get_avail_bytes(VuDev *dev, VuVirtq *vq, unsigned int *in_bytes, idx = vq->last_avail_idx; total_bufs = in_total = out_total = 0; -if (unlikely(dev->broken) || -unlikely(!vq->vring.avail)) { +if (!vu_is_vq_usable(dev, vq)) { goto done; } @@ -2496,8 +2501,7 @@ vu_queue_avail_bytes(VuDev *dev, VuVirtq *vq, unsigned int in_bytes, bool vu_queue_empty(VuDev *dev, VuVirtq *vq) { -if (unlikely(dev->broken) || -unlikely(!vq->vring.avail)) { +if (!vu_is_vq_usable(dev, vq)) { return true; } @@ -2536,8 +2540,7 @@ vring_notify(VuDev *dev, VuVirtq *vq) static void _vu_queue_notify(VuDev *dev, VuVirtq *vq, bool sync) { -if (unlikely(dev->broken) || -unlikely(!vq->vring.avail)) { +if (!vu_is_vq_usable(dev, vq)) { return; } @@ -2862,8 +2865,7 @@ vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz) unsigned int head; VuVirtqElement *elem; -if (unlikely(dev->broken) || -unlikely(!vq->vring.avail)) { +if (!vu_is_vq_usable(dev, vq)) { return NULL; } @@ -3020,8 +3022,7 @@ vu_queue_fill(VuDev *dev, VuVirtq *vq, { struct vring_used_elem uelem; -if (unlikely(dev->broken) || -unlikely(!vq->vring.avail)) { +if (!vu_is_vq_usable(dev, vq)) { return; } @@ -3050,8 +3051,7 @@ vu_queue_flush(VuDev *dev, VuVirtq *vq, unsigned int count) { uint16_t old, new; -if (unlikely(dev->broken) || -unlikely(!vq->vring.avail)) { +if (!vu_is_vq_usable(dev, vq)) { return; } -- 2.43.0
[PATCH v2 14/14] libvhost-user: Mark mmap'ed region memory as MADV_DONTDUMP
We already use MADV_NORESERVE to deal with sparse memory regions. Let's also set madvise(MADV_DONTDUMP), otherwise a crash of the process can result in us allocating all memory in the mmap'ed region for dumping purposes. This change implies that the mmap'ed rings won't be included in a coredump. If ever required for debugging purposes, we could mark only the mapped rings MADV_DODUMP. Ignore errors during madvise() for now. Reviewed-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index 61fb3050b3..a879149fef 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -460,6 +460,12 @@ _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd) DPRINT("mmap_addr: 0x%016"PRIx64"\n", (uint64_t)(uintptr_t)mmap_addr); +#if defined(__linux__) +/* Don't include all guest memory in a coredump. */ +madvise(mmap_addr, msg_region->memory_size + mmap_offset, +MADV_DONTDUMP); +#endif + /* Shift all affected entries by 1 to open a hole at idx. */ r = >regions[idx]; memmove(r + 1, r, sizeof(VuDevRegion) * (dev->nregions - idx)); -- 2.43.0
[PATCH v2 11/14] libvhost-user: Use most of mmap_offset as fd_offset
In the past, QEMU would create memory regions that could partially cover hugetlb pages, making mmap() fail if we would use the mmap_offset as an fd_offset. For that reason, we never used the mmap_offset as an offset into the fd and instead always mapped the fd from the very start. However, that can easily result in us mmap'ing a lot of unnecessary parts of an fd, possibly repeatedly. QEMU nowadays does not create memory regions that partially cover huge pages -- it never really worked with postcopy. QEMU handles merging of regions that partially cover huge pages (due to holes in boot memory) since 2018 in c1ece84e7c93 ("vhost: Huge page align and merge"). Let's be a bit careful and not unconditionally convert the mmap_offset into an fd_offset. Instead, let's simply detect the hugetlb size and pass as much as we can as fd_offset, making sure that we call mmap() with a properly aligned offset. With QEMU and a virtio-mem device that is fully plugged (50GiB using 50 memslots) the qemu-storage daemon process consumes in the VA space 1281GiB before this change and 58GiB after this change. Vhost user message Request: VHOST_USER_ADD_MEM_REG (37) Flags: 0x9 Size:40 Fds: 59 Adding region 4 guest_phys_addr: 0x0002 memory_size: 0x4000 userspace_addr: 0x7fb73bffe000 old mmap_offset: 0x8000 fd_offset: 0x8000 new mmap_offset: 0x mmap_addr: 0x7f02f1bdc000 Successfully added new region Vhost user message Request: VHOST_USER_ADD_MEM_REG (37) Flags: 0x9 Size:40 Fds: 59 Adding region 5 guest_phys_addr: 0x00024000 memory_size: 0x4000 userspace_addr: 0x7fb77bffe000 old mmap_offset: 0xc000 fd_offset: 0xc000 new mmap_offset: 0x mmap_addr: 0x7f028400 Successfully added new region Reviewed-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 54 --- 1 file changed, 48 insertions(+), 6 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index ef6353d847..55aef5fcc6 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -43,6 +43,8 @@ #include #include #include +#include +#include #ifdef __NR_userfaultfd #include @@ -281,12 +283,32 @@ vu_remove_all_mem_regs(VuDev *dev) dev->nregions = 0; } +static size_t +get_fd_hugepagesize(int fd) +{ +#if defined(__linux__) +struct statfs fs; +int ret; + +do { +ret = fstatfs(fd, ); +} while (ret != 0 && errno == EINTR); + +if (!ret && (unsigned int)fs.f_type == HUGETLBFS_MAGIC) { +return fs.f_bsize; +} +#endif +return 0; +} + static void _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd) { const uint64_t start_gpa = msg_region->guest_phys_addr; const uint64_t end_gpa = start_gpa + msg_region->memory_size; int prot = PROT_READ | PROT_WRITE; +uint64_t mmap_offset, fd_offset; +size_t hugepagesize; VuDevRegion *r; void *mmap_addr; int low = 0; @@ -300,7 +322,7 @@ _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd) msg_region->memory_size); DPRINT("userspace_addr: 0x%016"PRIx64"\n", msg_region->userspace_addr); -DPRINT("mmap_offset: 0x%016"PRIx64"\n", +DPRINT("old mmap_offset: 0x%016"PRIx64"\n", msg_region->mmap_offset); if (dev->postcopy_listening) { @@ -335,11 +357,31 @@ _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd) idx = low; /* - * We don't use offset argument of mmap() since the mapped address has - * to be page aligned, and we use huge pages. + * Convert most of msg_region->mmap_offset to fd_offset. In almost all + * cases, this will leave us with mmap_offset == 0, mmap()'ing only + * what we really need. Only if a memory region would partially cover + * hugetlb pages, we'd get mmap_offset != 0, which usually doesn't happen + * anymore (i.e., modern QEMU). + * + * Note that mmap() with hugetlb would fail if the offset into the file + * is not aligned to the huge page size. */ -mmap_addr = mmap(0, msg_region->memory_size + msg_region->mmap_offset, - prot, MAP_SHARED | MAP_NORESERVE, fd, 0); +hugepagesize = get_fd_hugepagesize(fd); +if (hugepagesize) { +fd_offset = ALIGN_DOWN(msg_region->mmap_offset, hugepagesize); +mmap_offset = msg_region->mmap_offset - fd_offset; +} else { +
[PATCH v2 03/14] libvhost-user: Factor out removing all mem regions
Let's factor it out. Note that the check for MAP_FAILED was wrong as we never set mmap_addr if mmap() failed. We'll remove the NULL check separately. Reviewed-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 34 --- 1 file changed, 18 insertions(+), 16 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index 360c5366d6..e4907dfc26 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -240,6 +240,22 @@ qva_to_va(VuDev *dev, uint64_t qemu_addr) return NULL; } +static void +vu_remove_all_mem_regs(VuDev *dev) +{ +unsigned int i; + +for (i = 0; i < dev->nregions; i++) { +VuDevRegion *r = >regions[i]; +void *ma = (void *)(uintptr_t)r->mmap_addr; + +if (ma) { +munmap(ma, r->size + r->mmap_offset); +} +} +dev->nregions = 0; +} + static void vmsg_close_fds(VhostUserMsg *vmsg) { @@ -1003,14 +1019,7 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg) unsigned int i; VhostUserMemory m = vmsg->payload.memory, *memory = -for (i = 0; i < dev->nregions; i++) { -VuDevRegion *r = >regions[i]; -void *ma = (void *) (uintptr_t) r->mmap_addr; - -if (ma) { -munmap(ma, r->size + r->mmap_offset); -} -} +vu_remove_all_mem_regs(dev); dev->nregions = memory->nregions; if (dev->postcopy_listening) { @@ -2112,14 +2121,7 @@ vu_deinit(VuDev *dev) { unsigned int i; -for (i = 0; i < dev->nregions; i++) { -VuDevRegion *r = >regions[i]; -void *m = (void *) (uintptr_t) r->mmap_addr; -if (m != MAP_FAILED) { -munmap(m, r->size + r->mmap_offset); -} -} -dev->nregions = 0; +vu_remove_all_mem_regs(dev); for (i = 0; i < dev->max_queues; i++) { VuVirtq *vq = >vq[i]; -- 2.43.0
Re: [PATCH v1 00/15] libvhost-user: support more memslots and cleanup memslot handling code
On 13.02.24 19:55, Michael S. Tsirkin wrote: On Tue, Feb 13, 2024 at 07:27:44PM +0100, David Hildenbrand wrote: On 13.02.24 18:33, Michael S. Tsirkin wrote: On Fri, Feb 02, 2024 at 10:53:17PM +0100, David Hildenbrand wrote: This series adds support for more memslots (509) to libvhost-user, to make it fully compatible with virtio-mem that uses up to 256 memslots accross all memory devices in "dynamic-memslot" mode (more details in patch #3). Breaks build on some systems. E.g. https://gitlab.com/mstredhat/qemu/-/jobs/6163591599 ./subprojects/libvhost-user/libvhost-user.c:369:27: error: comparison of integer expressions of different signedness: ‘long int’ and ‘unsigned int’ [-Werror=sign-compare] 369 | if (!ret && fs.f_type == HUGETLBFS_MAGIC) { | ^~ So easy to fix in v2, thanks! I think there is another problem around plugins though. There is a wrong checkpatch error: https://gitlab.com/mstredhat/qemu/-/jobs/6162397277 d96f29518232719b0c444ab93913e8515a6cb5c6:100: ERROR: use qemu_real_host_page_size() instead of getpagesize() total: 1 errors, 1 warnings, 81 lines checked qemu_real_host_page_size() is not available in libvhost-user. But I could just change that code to not require getpagesize() at all. Apart from that, I don't spot anything libvhost-user related (some qtest timeouts, a "error_setv: Assertion `*errp == NULL' failed."). Did I miss something? -- Cheers, David / dhildenb
Re: [PATCH v1 00/15] libvhost-user: support more memslots and cleanup memslot handling code
On 13.02.24 18:33, Michael S. Tsirkin wrote: On Fri, Feb 02, 2024 at 10:53:17PM +0100, David Hildenbrand wrote: This series adds support for more memslots (509) to libvhost-user, to make it fully compatible with virtio-mem that uses up to 256 memslots accross all memory devices in "dynamic-memslot" mode (more details in patch #3). Breaks build on some systems. E.g. https://gitlab.com/mstredhat/qemu/-/jobs/6163591599 ./subprojects/libvhost-user/libvhost-user.c:369:27: error: comparison of integer expressions of different signedness: ‘long int’ and ‘unsigned int’ [-Werror=sign-compare] 369 | if (!ret && fs.f_type == HUGETLBFS_MAGIC) { | ^~ So easy to fix in v2, thanks! -- Cheers, David / dhildenb
Re: [PATCH v1 01/15] libvhost-user: Fix msg_region->userspace_addr computation
On 13.02.24 18:32, Michael S. Tsirkin wrote: On Fri, Feb 02, 2024 at 10:53:18PM +0100, David Hildenbrand wrote: We barely had mmap_offset set in the past. With virtio-mem and dynamic-memslots that will change. In vu_add_mem_reg() and vu_set_mem_table_exec_postcopy(), we are performing pointer arithmetics, which is wrong. Wrong how? I suspect you mean arithmetic on void * pointers is not portable? You are absolutely right, no idea how I concluded that we might have a different pointer size here. I'll either convert this patch into a cleanup or drop it for v2, thanks! -- Cheers, David / dhildenb
Re: [PATCH V3 09/13] migration: notifier error checking
On 08.02.24 19:54, Steve Sistare wrote: Check the status returned by migration notifiers and report errors. If notifiers fail, call the notifiers again so they can clean up. IIUC, if any of the notifiers will actually start to fail, say, during MIG_EVENT_PRECOPY_SETUP, you will call MIG_EVENT_PRECOPY_FAILED on all notifiers. That will include notifiers that have never seen a MIG_EVENT_PRECOPY_SETUP call. Is that what we expect notifiers to be able to deal with? Can we document that? -- Cheers, David / dhildenb
Re: [PATCH V3 08/13] migration: refactor migrate_fd_connect failures
On 08.02.24 19:54, Steve Sistare wrote: Move common code for the error path in migrate_fd_connect to a shared fail label. No functional change. Signed-off-by: Steve Sistare Reviewed-by: Peter Xu --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH V3 07/13] migration: per-mode notifiers
On 08.02.24 19:54, Steve Sistare wrote: Keep a separate list of migration notifiers for each migration mode. Suggested-by: Peter Xu Signed-off-by: Steve Sistare --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH V3 06/13] migration: MigrationNotifyFunc
On 08.02.24 19:53, Steve Sistare wrote: Define MigrationNotifyFunc to improve type safety and simplify migration notifiers. Signed-off-by: Steve Sistare --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH V3 04/13] migration: MigrationEvent for notifiers
On 08.02.24 19:53, Steve Sistare wrote: Passing MigrationState to notifiers is unsound because they could access unstable migration state internals or even modify the state. Instead, pass the minimal info needed in a new MigrationEvent struct, which could be extended in the future if needed. Suggested-by: Peter Xu Signed-off-by: Steve Sistare --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb