from:"Mattias Nissler"

Re: [PATCH v10 0/7] Support message-based DMA in vfio-user server

2024-05-08 Thread Mattias Nissler

On Wed, May 8, 2024 at 11:16 PM Philippe Mathieu-Daudé
 wrote:
>
> On 7/5/24 16:34, Mattias Nissler wrote:
> > This series adds basic support for message-based DMA in qemu's vfio-user
> > server. This is useful for cases where the client does not provide file
> > descriptors for accessing system memory via memory mappings. My motivating 
> > use
> > case is to hook up device models as PCIe endpoints to a hardware design. 
> > This
> > works by bridging the PCIe transaction layer to vfio-user, and the endpoint
> > does not access memory directly, but sends memory requests TLPs to the 
> > hardware
> > design in order to perform DMA.
>
> Patches 1-3 & 7 queued to hw-misc tree, thanks.

Excellent, thanks for picking these up!

Re: [PATCH v9 2/5] softmmu: Support concurrent bounce buffers

2024-05-08 Thread Mattias Nissler

On Tue, May 7, 2024 at 4:46 PM Philippe Mathieu-Daudé  wrote:
>
> On 7/5/24 16:04, Mattias Nissler wrote:
> > On Tue, May 7, 2024 at 2:57 PM Philippe Mathieu-Daudé  
> > wrote:
> >>
> >> On 7/5/24 11:42, Mattias Nissler wrote:
> >>> When DMA memory can't be directly accessed, as is the case when
> >>> running the device model in a separate process without shareable DMA
> >>> file descriptors, bounce buffering is used.
> >>>
> >>> It is not uncommon for device models to request mapping of several DMA
> >>> regions at the same time. Examples include:
> >>>* net devices, e.g. when transmitting a packet that is split across
> >>>  several TX descriptors (observed with igb)
> >>>* USB host controllers, when handling a packet with multiple data TRBs
> >>>  (observed with xhci)
> >>>
> >>> Previously, qemu only provided a single bounce buffer per AddressSpace
> >>> and would fail DMA map requests while the buffer was already in use. In
> >>> turn, this would cause DMA failures that ultimately manifest as hardware
> >>> errors from the guest perspective.
> >>>
> >>> This change allocates DMA bounce buffers dynamically instead of
> >>> supporting only a single buffer. Thus, multiple DMA mappings work
> >>> correctly also when RAM can't be mmap()-ed.
> >>>
> >>> The total bounce buffer allocation size is limited individually for each
> >>> AddressSpace. The default limit is 4096 bytes, matching the previous
> >>> maximum buffer size. A new x-max-bounce-buffer-size parameter is
> >>> provided to configure the limit for PCI devices.
> >>>
> >>> Signed-off-by: Mattias Nissler 
> >>> ---
> >>>hw/pci/pci.c|  8 
> >>>include/exec/memory.h   | 14 +++
> >>>include/hw/pci/pci_device.h |  3 ++
> >>>system/memory.c |  5 ++-
> >>>system/physmem.c| 82 ++---
> >>>5 files changed, 76 insertions(+), 36 deletions(-)
>
>
> >>>/**
> >>> * struct AddressSpace: describes a mapping of addresses to 
> >>> #MemoryRegion objects
> >>> @@ -1143,8 +1137,10 @@ struct AddressSpace {
> >>>QTAILQ_HEAD(, MemoryListener) listeners;
> >>>QTAILQ_ENTRY(AddressSpace) address_spaces_link;
> >>>
> >>> -/* Bounce buffer to use for this address space. */
> >>> -BounceBuffer bounce;
> >>> +/* Maximum DMA bounce buffer size used for indirect memory map 
> >>> requests */
> >>> +uint32_t max_bounce_buffer_size;
> >>
> >> Alternatively size_t.
> >
> > While switching things over, I was surprised to find that
> > DEFINE_PROP_SIZE wants a uint64_t field rather than a size_t field.
> > There is a DEFINE_PROP_SIZE32 variant for uint32_t though. Considering
> > my options, assuming that we want to use size_t for everything other
> > than the property:
> >
> > (1) Make PCIDevice::max_bounce_buffer_size size_t and have the
> > preprocessor select DEFINE_PROP_SIZE/DEFINE_PROP_SIZE32. This makes
> > the qdev property type depend on the host. Ugh.
> >
> > (2) Make PCIDevice::max_bounce_buffer_size uint64_t and clamp if
> > needed when used. Weird to allow larger values that are then clamped,
> > although it probably doesn't matter in practice since address space is
> > limited to 4GB anyways.
> >
> > (3) Make PCIDevice::max_bounce_buffer_size uint32_t and accept the
> > limitation that the largest bounce buffer limit is 4GB even on 64-bit
> > hosts.
> >
> > #3 seemed most pragmatic, so I'll go with that.
>
> LGTM, thanks for updating.

No problem, can I ask you to provide a formal R-B on the v10 #4 patch
[1] then, so the series will be ready to go in?

[1] https://lists.nongnu.org/archive/html/qemu-devel/2024-05/msg01382.html

>
> >
> >
> >>
> >>> +/* Total size of bounce buffers currently allocated, atomically 
> >>> accessed */
> >>> +uint32_t bounce_buffer_size;
> >>
> >> Ditto.
>

[PATCH v10 6/7] vfio-user: Message-based DMA support

2024-05-07 Thread Mattias Nissler

Wire up support for DMA for the case where the vfio-user client does not
provide mmap()-able file descriptors, but DMA requests must be performed
via the VFIO-user protocol. This installs an indirect memory region,
which already works for pci_dma_{read,write}, and pci_dma_map works
thanks to the existing DMA bounce buffering support.

Note that while simple scenarios work with this patch, there's a known
race condition in libvfio-user that will mess up the communication
channel. See https://github.com/nutanix/libvfio-user/issues/279 for
details as well as a proposed fix.

Reviewed-by: Jagannathan Raman 
Signed-off-by: Mattias Nissler 
---
 hw/remote/trace-events|   2 +
 hw/remote/vfio-user-obj.c | 100 --
 2 files changed, 87 insertions(+), 15 deletions(-)

diff --git a/hw/remote/trace-events b/hw/remote/trace-events
index 0d1b7d56a5..358a68fb34 100644
--- a/hw/remote/trace-events
+++ b/hw/remote/trace-events
@@ -9,6 +9,8 @@ vfu_cfg_read(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x -> 
0x%x"
 vfu_cfg_write(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x <- 0x%x"
 vfu_dma_register(uint64_t gpa, size_t len) "vfu: registering GPA 0x%"PRIx64", 
%zu bytes"
 vfu_dma_unregister(uint64_t gpa) "vfu: unregistering GPA 0x%"PRIx64""
+vfu_dma_read(uint64_t gpa, size_t len) "vfu: DMA read 0x%"PRIx64", %zu bytes"
+vfu_dma_write(uint64_t gpa, size_t len) "vfu: DMA write 0x%"PRIx64", %zu bytes"
 vfu_bar_register(int i, uint64_t addr, uint64_t size) "vfu: BAR %d: addr 
0x%"PRIx64" size 0x%"PRIx64""
 vfu_bar_rw_enter(const char *op, uint64_t addr) "vfu: %s request for BAR 
address 0x%"PRIx64""
 vfu_bar_rw_exit(const char *op, uint64_t addr) "vfu: Finished %s of BAR 
address 0x%"PRIx64""
diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index d9b879e056..a15e291c9a 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -300,6 +300,63 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, 
char * const buf,
 return count;
 }
 
+static MemTxResult vfu_dma_read(void *opaque, hwaddr addr, uint64_t *val,
+unsigned size, MemTxAttrs attrs)
+{
+MemoryRegion *region = opaque;
+vfu_ctx_t *vfu_ctx = VFU_OBJECT(region->owner)->vfu_ctx;
+uint8_t buf[sizeof(uint64_t)];
+
+trace_vfu_dma_read(region->addr + addr, size);
+
+g_autofree dma_sg_t *sg = g_malloc0(dma_sg_size());
+vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
+if (vfu_addr_to_sgl(vfu_ctx, vfu_addr, size, sg, 1, PROT_READ) < 0 ||
+vfu_sgl_read(vfu_ctx, sg, 1, buf) != 0) {
+return MEMTX_ERROR;
+}
+
+*val = ldn_he_p(buf, size);
+
+return MEMTX_OK;
+}
+
+static MemTxResult vfu_dma_write(void *opaque, hwaddr addr, uint64_t val,
+ unsigned size, MemTxAttrs attrs)
+{
+MemoryRegion *region = opaque;
+vfu_ctx_t *vfu_ctx = VFU_OBJECT(region->owner)->vfu_ctx;
+uint8_t buf[sizeof(uint64_t)];
+
+trace_vfu_dma_write(region->addr + addr, size);
+
+stn_he_p(buf, size, val);
+
+g_autofree dma_sg_t *sg = g_malloc0(dma_sg_size());
+vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
+if (vfu_addr_to_sgl(vfu_ctx, vfu_addr, size, sg, 1, PROT_WRITE) < 0 ||
+vfu_sgl_write(vfu_ctx, sg, 1, buf) != 0) {
+return MEMTX_ERROR;
+}
+
+return MEMTX_OK;
+}
+
+static const MemoryRegionOps vfu_dma_ops = {
+.read_with_attrs = vfu_dma_read,
+.write_with_attrs = vfu_dma_write,
+.endianness = DEVICE_HOST_ENDIAN,
+.valid = {
+.min_access_size = 1,
+.max_access_size = 8,
+.unaligned = true,
+},
+.impl = {
+.min_access_size = 1,
+.max_access_size = 8,
+},
+};
+
 static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
 {
 VfuObject *o = vfu_get_private(vfu_ctx);
@@ -308,17 +365,30 @@ static void dma_register(vfu_ctx_t *vfu_ctx, 
vfu_dma_info_t *info)
 g_autofree char *name = NULL;
 struct iovec *iov = >iova;
 
-if (!info->vaddr) {
-return;
-}
-
 name = g_strdup_printf("mem-%s-%"PRIx64"", o->device,
-   (uint64_t)info->vaddr);
+   (uint64_t)iov->iov_base);
 
 subregion = g_new0(MemoryRegion, 1);
 
-memory_region_init_ram_ptr(subregion, NULL, name,
-   iov->iov_len, info->vaddr);
+if (info->vaddr) {
+memory_region_init_ram_ptr(subregion, OBJECT(o), name,
+   iov->iov_len, info->vaddr);
+} else {
+/*
+ * Note that I/O regions' MemoryRegionOps handle accesses of at most 8
+ * bytes at a time, and larger access

[PATCH v10 7/7] vfio-user: Fix config space access byte order

2024-05-07 Thread Mattias Nissler

PCI config space is little-endian, so on a big-endian host we need to
perform byte swaps for values as they are passed to and received from
the generic PCI config space access machinery.

Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Jagannathan Raman 
Signed-off-by: Mattias Nissler 
---
 hw/remote/vfio-user-obj.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index a15e291c9a..0e93d7a7b4 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -281,7 +281,7 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, 
char * const buf,
 while (bytes > 0) {
 len = (bytes > pci_access_width) ? pci_access_width : bytes;
 if (is_write) {
-memcpy(, ptr, len);
+val = ldn_le_p(ptr, len);
 pci_host_config_write_common(o->pci_dev, offset,
  pci_config_size(o->pci_dev),
  val, len);
@@ -289,7 +289,7 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, 
char * const buf,
 } else {
 val = pci_host_config_read_common(o->pci_dev, offset,
   pci_config_size(o->pci_dev), 
len);
-memcpy(ptr, , len);
+stn_le_p(ptr, len, val);
 trace_vfu_cfg_read(offset, val);
 }
 offset += len;
-- 
2.43.2

[PATCH v10 4/7] softmmu: Support concurrent bounce buffers

2024-05-07 Thread Mattias Nissler

When DMA memory can't be directly accessed, as is the case when
running the device model in a separate process without shareable DMA
file descriptors, bounce buffering is used.

It is not uncommon for device models to request mapping of several DMA
regions at the same time. Examples include:
 * net devices, e.g. when transmitting a packet that is split across
   several TX descriptors (observed with igb)
 * USB host controllers, when handling a packet with multiple data TRBs
   (observed with xhci)

Previously, qemu only provided a single bounce buffer per AddressSpace
and would fail DMA map requests while the buffer was already in use. In
turn, this would cause DMA failures that ultimately manifest as hardware
errors from the guest perspective.

This change allocates DMA bounce buffers dynamically instead of
supporting only a single buffer. Thus, multiple DMA mappings work
correctly also when RAM can't be mmap()-ed.

The total bounce buffer allocation size is limited individually for each
AddressSpace. The default limit is 4096 bytes, matching the previous
maximum buffer size. A new x-max-bounce-buffer-size parameter is
provided to configure the limit for PCI devices.

Signed-off-by: Mattias Nissler 
---
 hw/pci/pci.c|  8 
 include/exec/memory.h   | 14 +++
 include/hw/pci/pci_device.h |  3 ++
 system/memory.c |  5 ++-
 system/physmem.c| 82 ++---
 5 files changed, 76 insertions(+), 36 deletions(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 324c1302d2..d6f4944cbd 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -85,6 +85,8 @@ static Property pci_props[] = {
 QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
 DEFINE_PROP_BIT("x-pcie-ari-nextfn-1", PCIDevice, cap_present,
 QEMU_PCIE_ARI_NEXTFN_1_BITNR, false),
+DEFINE_PROP_SIZE32("x-max-bounce-buffer-size", PCIDevice,
+ max_bounce_buffer_size, DEFAULT_MAX_BOUNCE_BUFFER_SIZE),
 DEFINE_PROP_END_OF_LIST()
 };
 
@@ -1204,6 +1206,8 @@ static PCIDevice *do_pci_register_device(PCIDevice 
*pci_dev,
"bus master container", UINT64_MAX);
 address_space_init(_dev->bus_master_as,
_dev->bus_master_container_region, pci_dev->name);
+pci_dev->bus_master_as.max_bounce_buffer_size =
+pci_dev->max_bounce_buffer_size;
 
 if (phase_check(PHASE_MACHINE_READY)) {
 pci_init_bus_master(pci_dev);
@@ -2633,6 +2637,10 @@ static void pci_device_class_init(ObjectClass *klass, 
void *data)
 k->unrealize = pci_qdev_unrealize;
 k->bus_type = TYPE_PCI_BUS;
 device_class_set_props(k, pci_props);
+object_class_property_set_description(
+klass, "x-max-bounce-buffer-size",
+"Maximum buffer size allocated for bounce buffers used for mapped "
+"access to indirect DMA memory");
 }
 
 static void pci_device_class_base_init(ObjectClass *klass, void *data)
diff --git a/include/exec/memory.h b/include/exec/memory.h
index d417d7f363..451879efbd 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1117,13 +1117,7 @@ typedef struct AddressSpaceMapClient {
 QLIST_ENTRY(AddressSpaceMapClient) link;
 } AddressSpaceMapClient;
 
-typedef struct {
-MemoryRegion *mr;
-void *buffer;
-hwaddr addr;
-hwaddr len;
-bool in_use;
-} BounceBuffer;
+#define DEFAULT_MAX_BOUNCE_BUFFER_SIZE (4096)
 
 /**
  * struct AddressSpace: describes a mapping of addresses to #MemoryRegion 
objects
@@ -1143,8 +1137,10 @@ struct AddressSpace {
 QTAILQ_HEAD(, MemoryListener) listeners;
 QTAILQ_ENTRY(AddressSpace) address_spaces_link;
 
-/* Bounce buffer to use for this address space. */
-BounceBuffer bounce;
+/* Maximum DMA bounce buffer size used for indirect memory map requests */
+size_t max_bounce_buffer_size;
+/* Total size of bounce buffers currently allocated, atomically accessed */
+size_t bounce_buffer_size;
 /* List of callbacks to invoke when buffers free up */
 QemuMutex map_client_list_lock;
 QLIST_HEAD(, AddressSpaceMapClient) map_client_list;
diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
index d3dd0f64b2..253b48a688 100644
--- a/include/hw/pci/pci_device.h
+++ b/include/hw/pci/pci_device.h
@@ -160,6 +160,9 @@ struct PCIDevice {
 /* ID of standby device in net_failover pair */
 char *failover_pair_id;
 uint32_t acpi_index;
+
+/* Maximum DMA bounce buffer size used for indirect memory map requests */
+uint32_t max_bounce_buffer_size;
 };
 
 static inline int pci_intx(PCIDevice *pci_dev)
diff --git a/system/memory.c b/system/memory.c
index 642a449f8c..c288ed354a 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -3174,7 +3174,8 @@ void address_space_init(AddressSpace *as, MemoryRegion 
*root, const char *name)
 as->ioeventfds = NULL;

[PATCH v10 5/7] Update subprojects/libvfio-user

2024-05-07 Thread Mattias Nissler

Brings in assorted bug fixes. The following are of particular interest
with respect to message-based DMA support:

* bb308a2 "Fix address calculation for message-based DMA"
  Corrects a bug in DMA address calculation.

* 1569a37 "Pass server->client command over a separate socket pair"
  Adds support for separate sockets for either command direction,
  addressing a bug where libvfio-user gets confused if both client and
  server send commands concurrently.

Reviewed-by: Jagannathan Raman 
Signed-off-by: Mattias Nissler 
---
 subprojects/libvfio-user.wrap | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/subprojects/libvfio-user.wrap b/subprojects/libvfio-user.wrap
index 416955ca45..cdf0a7a375 100644
--- a/subprojects/libvfio-user.wrap
+++ b/subprojects/libvfio-user.wrap
@@ -1,4 +1,4 @@
 [wrap-git]
 url = https://gitlab.com/qemu-project/libvfio-user.git
-revision = 0b28d205572c80b568a1003db2c8f37ca333e4d7
+revision = 1569a37a54ecb63bd4008708c76339ccf7d06115
 depth = 1
-- 
2.43.2

[PATCH v10 0/7] Support message-based DMA in vfio-user server

2024-05-07 Thread Mattias Nissler

This series adds basic support for message-based DMA in qemu's vfio-user
server. This is useful for cases where the client does not provide file
descriptors for accessing system memory via memory mappings. My motivating use
case is to hook up device models as PCIe endpoints to a hardware design. This
works by bridging the PCIe transaction layer to vfio-user, and the endpoint
does not access memory directly, but sends memory requests TLPs to the hardware
design in order to perform DMA.

Note that more work is needed to make message-based DMA work well: qemu
currently breaks down DMA accesses into chunks of size 8 bytes at maximum, each
of which will be handled in a separate vfio-user DMA request message. This is
quite terrible for large DMA accesses, such as when nvme reads and writes
page-sized blocks for example. Thus, I would like to improve qemu to be able to
perform larger accesses, at least for indirect memory regions. I have something
working locally, but since this will likely result in more involved surgery and
discussion, I am leaving this to be addressed in a separate patch.

Changes from v1:

* Address Stefan's review comments. In particular, enforce an allocation limit
  and don't drop the map client callbacks given that map requests can fail when
  hitting size limits.

* libvfio-user version bump now included in the series.

* Tested as well on big-endian s390x. This uncovered another byte order issue
  in vfio-user server code that I've included a fix for.

Changes from v2:

* Add a preparatory patch to make bounce buffering an AddressSpace-specific
  concept.

* The total buffer size limit parameter is now per AdressSpace and can be
  configured for PCIDevice via a property.

* Store a magic value in first bytes of bounce buffer struct as a best effort
  measure to detect invalid pointers in address_space_unmap.

Changes from v3:

* libvfio-user now supports twin-socket mode which uses separate sockets for
  client->server and server->client commands, respectively. This addresses the
  concurrent command bug triggered by server->client DMA access commands. See
  https://github.com/nutanix/libvfio-user/issues/279 for details.

* Add missing teardown code in do_address_space_destroy.

* Fix bounce buffer size bookkeeping race condition.

* Generate unmap notification callbacks unconditionally.

* Some cosmetic fixes.

Changes from v4:

* Fix accidentally dropped memory_region_unref, control flow restored to match
  previous code to simplify review.

* Some cosmetic fixes.

Changes from v5:

* Unregister indirect memory region in libvfio-user dma_unregister callback.

Changes from v6:

* Rebase, resolve straightforward merge conflict in system/dma-helpers.c

Changes from v7:

* Rebase (applied cleanly)

* Restore various Reviewed-by and Tested-by tags that I failed to carry
  forward (I double-checked that the patches haven't changed since the reviewed
  version)

Changes from v8:

* Rebase (clean)

* Change bounce buffer size accounting to use uint32_t so it works also on
  hosts that don't support uint64_t atomics, such as mipsel. As a consequence
  overflows are a real concern now, so switch to a cmpxchg loop for allocating
  bounce buffer space.

Changes from v9:

* Incorporate patch split and QEMU_MUTEX_GUARD change by phi...@linaro.org

* Use size_t instead of uint32_t for bounce buffer size accounting. The qdev
  property remains uint32_t though, so it has a consistent size regardless of
  host.

Mattias Nissler (6):
  system/physmem: Propagate AddressSpace to MapClient helpers
  system/physmem: Per-AddressSpace bounce buffering
  softmmu: Support concurrent bounce buffers
  Update subprojects/libvfio-user
  vfio-user: Message-based DMA support
  vfio-user: Fix config space access byte order

Philippe Mathieu-Daudé (1):
  system/physmem: Replace qemu_mutex_lock() calls with QEMU_LOCK_GUARD

 hw/pci/pci.c  |   8 ++
 hw/remote/trace-events|   2 +
 hw/remote/vfio-user-obj.c | 104 -
 include/exec/cpu-common.h |   2 -
 include/exec/memory.h |  41 +-
 include/hw/pci/pci_device.h   |   3 +
 subprojects/libvfio-user.wrap |   2 +-
 system/dma-helpers.c  |   4 +-
 system/memory.c   |   8 ++
 system/physmem.c  | 140 ++
 10 files changed, 225 insertions(+), 89 deletions(-)

-- 
2.43.2

[PATCH v10 2/7] system/physmem: Propagate AddressSpace to MapClient helpers

2024-05-07 Thread Mattias Nissler

Propagate AddressSpace handler to following helpers:
- register_map_client()
- unregister_map_client()
- notify_map_clients[_locked]()

Rename them using 'address_space_' prefix instead of 'cpu_'.

The AddressSpace argument will be used in the next commit.

Reviewed-by: Peter Xu 
Tested-by: Jonathan Cameron 
Signed-off-by: Mattias Nissler 
Message-ID: <20240507094210.300566-2-mniss...@rivosinc.com>
[PMD: Split patch, part 1/2]
Signed-off-by: Philippe Mathieu-Daudé 
Reviewed-by: Mattias Nissler 
---
 include/exec/cpu-common.h |  2 --
 include/exec/memory.h | 26 --
 system/dma-helpers.c  |  4 ++--
 system/physmem.c  | 24 
 4 files changed, 38 insertions(+), 18 deletions(-)

diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 8bc397e251..815342d043 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -147,8 +147,6 @@ void *cpu_physical_memory_map(hwaddr addr,
   bool is_write);
 void cpu_physical_memory_unmap(void *buffer, hwaddr len,
bool is_write, hwaddr access_len);
-void cpu_register_map_client(QEMUBH *bh);
-void cpu_unregister_map_client(QEMUBH *bh);
 
 bool cpu_physical_memory_is_io(hwaddr phys_addr);
 
diff --git a/include/exec/memory.h b/include/exec/memory.h
index dadb5cd65a..e1e0c5a3de 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -2946,8 +2946,8 @@ bool address_space_access_valid(AddressSpace *as, hwaddr 
addr, hwaddr len,
  * May return %NULL and set *@plen to zero(0), if resources needed to perform
  * the mapping are exhausted.
  * Use only for reads OR writes - not for read-modify-write operations.
- * Use cpu_register_map_client() to know when retrying the map operation is
- * likely to succeed.
+ * Use address_space_register_map_client() to know when retrying the map
+ * operation is likely to succeed.
  *
  * @as: #AddressSpace to be accessed
  * @addr: address within that address space
@@ -2972,6 +2972,28 @@ void *address_space_map(AddressSpace *as, hwaddr addr,
 void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len,
  bool is_write, hwaddr access_len);
 
+/*
+ * address_space_register_map_client: Register a callback to invoke when
+ * resources for address_space_map() are available again.
+ *
+ * address_space_map may fail when there are not enough resources available,
+ * such as when bounce buffer memory would exceed the limit. The callback can
+ * be used to retry the address_space_map operation. Note that the callback
+ * gets automatically removed after firing.
+ *
+ * @as: #AddressSpace to be accessed
+ * @bh: callback to invoke when address_space_map() retry is appropriate
+ */
+void address_space_register_map_client(AddressSpace *as, QEMUBH *bh);
+
+/*
+ * address_space_unregister_map_client: Unregister a callback that has
+ * previously been registered and not fired yet.
+ *
+ * @as: #AddressSpace to be accessed
+ * @bh: callback to unregister
+ */
+void address_space_unregister_map_client(AddressSpace *as, QEMUBH *bh);
 
 /* Internal functions, part of the implementation of address_space_read.  */
 MemTxResult address_space_read_full(AddressSpace *as, hwaddr addr,
diff --git a/system/dma-helpers.c b/system/dma-helpers.c
index 9b221cf94e..74013308f5 100644
--- a/system/dma-helpers.c
+++ b/system/dma-helpers.c
@@ -169,7 +169,7 @@ static void dma_blk_cb(void *opaque, int ret)
 if (dbs->iov.size == 0) {
 trace_dma_map_wait(dbs);
 dbs->bh = aio_bh_new(ctx, reschedule_dma, dbs);
-cpu_register_map_client(dbs->bh);
+address_space_register_map_client(dbs->sg->as, dbs->bh);
 return;
 }
 
@@ -197,7 +197,7 @@ static void dma_aio_cancel(BlockAIOCB *acb)
 }
 
 if (dbs->bh) {
-cpu_unregister_map_client(dbs->bh);
+address_space_unregister_map_client(dbs->sg->as, dbs->bh);
 qemu_bh_delete(dbs->bh);
 dbs->bh = NULL;
 }
diff --git a/system/physmem.c b/system/physmem.c
index 5486014cf2..27e754ff57 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -3065,24 +3065,24 @@ QemuMutex map_client_list_lock;
 static QLIST_HEAD(, MapClient) map_client_list
 = QLIST_HEAD_INITIALIZER(map_client_list);
 
-static void cpu_unregister_map_client_do(MapClient *client)
+static void address_space_unregister_map_client_do(MapClient *client)
 {
 QLIST_REMOVE(client, link);
 g_free(client);
 }
 
-static void cpu_notify_map_clients_locked(void)
+static void address_space_notify_map_clients_locked(AddressSpace *as)
 {
 MapClient *client;
 
 while (!QLIST_EMPTY(_client_list)) {
 client = QLIST_FIRST(_client_list);
 qemu_bh_schedule(client->bh);
-cpu_unregister_map_client_do(client);
+address_space_unregister_map_client_do(client);
 }
 }
 
-void cpu_register_map_client(QEMUBH *bh)
+void address_space_

[PATCH v10 1/7] system/physmem: Replace qemu_mutex_lock() calls with QEMU_LOCK_GUARD

2024-05-07 Thread Mattias Nissler

From: Philippe Mathieu-Daudé 

From: Philippe Mathieu-Daudé 

Simplify cpu_[un]register_map_client() and cpu_notify_map_clients()
by replacing the pair of qemu_mutex_lock/qemu_mutex_unlock calls by
the WITH_QEMU_LOCK_GUARD() macro.

Signed-off-by: Philippe Mathieu-Daudé 
Signed-off-by: Mattias Nissler 
Reviewed-by: Mattias Nissler 
---
 system/physmem.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index d3a3d8a45c..5486014cf2 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -3086,7 +3086,7 @@ void cpu_register_map_client(QEMUBH *bh)
 {
 MapClient *client = g_malloc(sizeof(*client));
 
-qemu_mutex_lock(_client_list_lock);
+QEMU_LOCK_GUARD(_client_list_lock);
 client->bh = bh;
 QLIST_INSERT_HEAD(_client_list, client, link);
 /* Write map_client_list before reading in_use.  */
@@ -3094,7 +3094,6 @@ void cpu_register_map_client(QEMUBH *bh)
 if (!qatomic_read(_use)) {
 cpu_notify_map_clients_locked();
 }
-qemu_mutex_unlock(_client_list_lock);
 }
 
 void cpu_exec_init_all(void)
@@ -3117,21 +3116,19 @@ void cpu_unregister_map_client(QEMUBH *bh)
 {
 MapClient *client;
 
-qemu_mutex_lock(_client_list_lock);
+QEMU_LOCK_GUARD(_client_list_lock);
 QLIST_FOREACH(client, _client_list, link) {
 if (client->bh == bh) {
 cpu_unregister_map_client_do(client);
 break;
 }
 }
-qemu_mutex_unlock(_client_list_lock);
 }
 
 static void cpu_notify_map_clients(void)
 {
-qemu_mutex_lock(_client_list_lock);
+QEMU_LOCK_GUARD(_client_list_lock);
 cpu_notify_map_clients_locked();
-qemu_mutex_unlock(_client_list_lock);
 }
 
 static bool flatview_access_valid(FlatView *fv, hwaddr addr, hwaddr len,
-- 
2.43.2

[PATCH v10 3/7] system/physmem: Per-AddressSpace bounce buffering

2024-05-07 Thread Mattias Nissler

Instead of using a single global bounce buffer, give each AddressSpace
its own bounce buffer. The MapClient callback mechanism moves to
AddressSpace accordingly.

This is in preparation for generalizing bounce buffer handling further
to allow multiple bounce buffers, with a total allocation limit
configured per AddressSpace.

Reviewed-by: Peter Xu 
Tested-by: Jonathan Cameron 
Signed-off-by: Mattias Nissler 
Message-ID: <20240507094210.300566-2-mniss...@rivosinc.com>
[PMD: Split patch, part 2/2]
Signed-off-by: Philippe Mathieu-Daudé 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Mattias Nissler 
---
 include/exec/memory.h | 19 +++
 system/memory.c   |  7 +
 system/physmem.c  | 73 ---
 3 files changed, 53 insertions(+), 46 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index e1e0c5a3de..d417d7f363 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1112,6 +1112,19 @@ struct MemoryListener {
 QTAILQ_ENTRY(MemoryListener) link_as;
 };
 
+typedef struct AddressSpaceMapClient {
+QEMUBH *bh;
+QLIST_ENTRY(AddressSpaceMapClient) link;
+} AddressSpaceMapClient;
+
+typedef struct {
+MemoryRegion *mr;
+void *buffer;
+hwaddr addr;
+hwaddr len;
+bool in_use;
+} BounceBuffer;
+
 /**
  * struct AddressSpace: describes a mapping of addresses to #MemoryRegion 
objects
  */
@@ -1129,6 +1142,12 @@ struct AddressSpace {
 struct MemoryRegionIoeventfd *ioeventfds;
 QTAILQ_HEAD(, MemoryListener) listeners;
 QTAILQ_ENTRY(AddressSpace) address_spaces_link;
+
+/* Bounce buffer to use for this address space. */
+BounceBuffer bounce;
+/* List of callbacks to invoke when buffers free up */
+QemuMutex map_client_list_lock;
+QLIST_HEAD(, AddressSpaceMapClient) map_client_list;
 };
 
 typedef struct AddressSpaceDispatch AddressSpaceDispatch;
diff --git a/system/memory.c b/system/memory.c
index 49f1cb2c38..642a449f8c 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -3174,6 +3174,9 @@ void address_space_init(AddressSpace *as, MemoryRegion 
*root, const char *name)
 as->ioeventfds = NULL;
 QTAILQ_INIT(>listeners);
 QTAILQ_INSERT_TAIL(_spaces, as, address_spaces_link);
+as->bounce.in_use = false;
+qemu_mutex_init(>map_client_list_lock);
+QLIST_INIT(>map_client_list);
 as->name = g_strdup(name ? name : "anonymous");
 address_space_update_topology(as);
 address_space_update_ioeventfds(as);
@@ -3181,6 +3184,10 @@ void address_space_init(AddressSpace *as, MemoryRegion 
*root, const char *name)
 
 static void do_address_space_destroy(AddressSpace *as)
 {
+assert(!qatomic_read(>bounce.in_use));
+assert(QLIST_EMPTY(>map_client_list));
+qemu_mutex_destroy(>map_client_list_lock);
+
 assert(QTAILQ_EMPTY(>listeners));
 
 flatview_unref(as->current_map);
diff --git a/system/physmem.c b/system/physmem.c
index 27e754ff57..62758202cf 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -3046,26 +3046,8 @@ void cpu_flush_icache_range(hwaddr start, hwaddr len)
  NULL, len, FLUSH_CACHE);
 }
 
-typedef struct {
-MemoryRegion *mr;
-void *buffer;
-hwaddr addr;
-hwaddr len;
-bool in_use;
-} BounceBuffer;
-
-static BounceBuffer bounce;
-
-typedef struct MapClient {
-QEMUBH *bh;
-QLIST_ENTRY(MapClient) link;
-} MapClient;
-
-QemuMutex map_client_list_lock;
-static QLIST_HEAD(, MapClient) map_client_list
-= QLIST_HEAD_INITIALIZER(map_client_list);
-
-static void address_space_unregister_map_client_do(MapClient *client)
+static void
+address_space_unregister_map_client_do(AddressSpaceMapClient *client)
 {
 QLIST_REMOVE(client, link);
 g_free(client);
@@ -3073,10 +3055,10 @@ static void 
address_space_unregister_map_client_do(MapClient *client)
 
 static void address_space_notify_map_clients_locked(AddressSpace *as)
 {
-MapClient *client;
+AddressSpaceMapClient *client;
 
-while (!QLIST_EMPTY(_client_list)) {
-client = QLIST_FIRST(_client_list);
+while (!QLIST_EMPTY(>map_client_list)) {
+client = QLIST_FIRST(>map_client_list);
 qemu_bh_schedule(client->bh);
 address_space_unregister_map_client_do(client);
 }
@@ -3084,14 +3066,14 @@ static void 
address_space_notify_map_clients_locked(AddressSpace *as)
 
 void address_space_register_map_client(AddressSpace *as, QEMUBH *bh)
 {
-MapClient *client = g_malloc(sizeof(*client));
+AddressSpaceMapClient *client = g_malloc(sizeof(*client));
 
-QEMU_LOCK_GUARD(_client_list_lock);
+QEMU_LOCK_GUARD(>map_client_list_lock);
 client->bh = bh;
-QLIST_INSERT_HEAD(_client_list, client, link);
+QLIST_INSERT_HEAD(>map_client_list, client, link);
 /* Write map_client_list before reading in_use.  */
 smp_mb();
-if (!qatomic_read(_use)) {
+if (!qatomic_read(

Re: [PATCH 0/3] system/physmem: Propagate AddressSpace to MapClient helpers

2024-05-07 Thread Mattias Nissler

On Tue, May 7, 2024 at 4:02 PM Philippe Mathieu-Daudé  wrote:
>
> On 7/5/24 14:47, Mattias Nissler wrote:
> > On Tue, May 7, 2024 at 2:30 PM Philippe Mathieu-Daudé  
> > wrote:
> >>
> >> Respin of Mattias patch [1 split to ease review.
> >> Preliminary use QEMU_LOCK_GUARD to simplify.
> >>
> >> I'm OK to include this and the endianness fix [2]
> >> if Mattias agrees, once first patch is reviewed.
> >
> > To be honest, given that this patch series has been lingering for
> > almost a year now, I'm fine with whatever gets us closer to getting
> > this landed. I believe Peter was also considering doing a pull request
> > for the series, so you may want to coordinate with him if you haven't
> > already.
>
> Well I'm sorry, today is the first time I've been looking at it,
> and was trying to help reviewing. I see I was Cc'ed on earlier
> versions but missed them. OK, I'll see with Peter.

It's fine, sorry for being a bit negative.

Re: [PATCH v9 2/5] softmmu: Support concurrent bounce buffers

2024-05-07 Thread Mattias Nissler

On Tue, May 7, 2024 at 2:57 PM Philippe Mathieu-Daudé  wrote:
>
> On 7/5/24 11:42, Mattias Nissler wrote:
> > When DMA memory can't be directly accessed, as is the case when
> > running the device model in a separate process without shareable DMA
> > file descriptors, bounce buffering is used.
> >
> > It is not uncommon for device models to request mapping of several DMA
> > regions at the same time. Examples include:
> >   * net devices, e.g. when transmitting a packet that is split across
> > several TX descriptors (observed with igb)
> >   * USB host controllers, when handling a packet with multiple data TRBs
> > (observed with xhci)
> >
> > Previously, qemu only provided a single bounce buffer per AddressSpace
> > and would fail DMA map requests while the buffer was already in use. In
> > turn, this would cause DMA failures that ultimately manifest as hardware
> > errors from the guest perspective.
> >
> > This change allocates DMA bounce buffers dynamically instead of
> > supporting only a single buffer. Thus, multiple DMA mappings work
> > correctly also when RAM can't be mmap()-ed.
> >
> > The total bounce buffer allocation size is limited individually for each
> > AddressSpace. The default limit is 4096 bytes, matching the previous
> > maximum buffer size. A new x-max-bounce-buffer-size parameter is
> > provided to configure the limit for PCI devices.
> >
> > Signed-off-by: Mattias Nissler 
> > ---
> >   hw/pci/pci.c|  8 
> >   include/exec/memory.h   | 14 +++
> >   include/hw/pci/pci_device.h |  3 ++
> >   system/memory.c |  5 ++-
> >   system/physmem.c| 82 ++---
> >   5 files changed, 76 insertions(+), 36 deletions(-)
>
>
> > diff --git a/include/exec/memory.h b/include/exec/memory.h
> > index d417d7f363..2ea1e99da2 100644
> > --- a/include/exec/memory.h
> > +++ b/include/exec/memory.h
> > @@ -1117,13 +1117,7 @@ typedef struct AddressSpaceMapClient {
> >   QLIST_ENTRY(AddressSpaceMapClient) link;
> >   } AddressSpaceMapClient;
> >
> > -typedef struct {
> > -MemoryRegion *mr;
> > -void *buffer;
> > -hwaddr addr;
> > -hwaddr len;
> > -bool in_use;
> > -} BounceBuffer;
> > +#define DEFAULT_MAX_BOUNCE_BUFFER_SIZE (4096)
> >
> >   /**
> >* struct AddressSpace: describes a mapping of addresses to #MemoryRegion 
> > objects
> > @@ -1143,8 +1137,10 @@ struct AddressSpace {
> >   QTAILQ_HEAD(, MemoryListener) listeners;
> >   QTAILQ_ENTRY(AddressSpace) address_spaces_link;
> >
> > -/* Bounce buffer to use for this address space. */
> > -BounceBuffer bounce;
> > +/* Maximum DMA bounce buffer size used for indirect memory map 
> > requests */
> > +uint32_t max_bounce_buffer_size;
>
> Alternatively size_t.

While switching things over, I was surprised to find that
DEFINE_PROP_SIZE wants a uint64_t field rather than a size_t field.
There is a DEFINE_PROP_SIZE32 variant for uint32_t though. Considering
my options, assuming that we want to use size_t for everything other
than the property:

(1) Make PCIDevice::max_bounce_buffer_size size_t and have the
preprocessor select DEFINE_PROP_SIZE/DEFINE_PROP_SIZE32. This makes
the qdev property type depend on the host. Ugh.

(2) Make PCIDevice::max_bounce_buffer_size uint64_t and clamp if
needed when used. Weird to allow larger values that are then clamped,
although it probably doesn't matter in practice since address space is
limited to 4GB anyways.

(3) Make PCIDevice::max_bounce_buffer_size uint32_t and accept the
limitation that the largest bounce buffer limit is 4GB even on 64-bit
hosts.

#3 seemed most pragmatic, so I'll go with that.


>
> > +/* Total size of bounce buffers currently allocated, atomically 
> > accessed */
> > +uint32_t bounce_buffer_size;
>
> Ditto.
>
> >   /* List of callbacks to invoke when buffers free up */
> >   QemuMutex map_client_list_lock;
> >   QLIST_HEAD(, AddressSpaceMapClient) map_client_list;
> > diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
> > index d3dd0f64b2..253b48a688 100644
> > --- a/include/hw/pci/pci_device.h
> > +++ b/include/hw/pci/pci_device.h
> > @@ -160,6 +160,9 @@ struct PCIDevice {
> >   /* ID of standby device in net_failover pair */
> >   char *failover_pair_id;
> >   uint32_t acpi_index;
> > +
> > +/* Maximum DMA bounce buffer size used for indirect memory map 
> > requests */
> > +uint32_t max_bounc

Re: [PATCH v8 0/5] Support message-based DMA in vfio-user server

2024-05-07 Thread Mattias Nissler

On Tue, May 7, 2024 at 2:52 PM Philippe Mathieu-Daudé  wrote:
>
> On 7/5/24 11:43, Mattias Nissler wrote:
> >
> >
> > On Mon, May 6, 2024 at 11:07 PM Mattias Nissler  > <mailto:mniss...@rivosinc.com>> wrote:
> >
> >
> >
> > On Mon, May 6, 2024 at 4:44 PM Peter Xu  > <mailto:pet...@redhat.com>> wrote:
> >
> > On Thu, Mar 28, 2024 at 08:53:36AM +0100, Mattias Nissler wrote:
> >  > Stefan, to the best of my knowledge this is fully reviewed
> > and ready
> >  > to go in - can you kindly pick it up or advise in case there's
> >  > something I missed? Thanks!
> >
> > Fails cross-compile on mipsel:
> >
> > https://gitlab.com/peterx/qemu/-/jobs/6787790601
> > <https://gitlab.com/peterx/qemu/-/jobs/6787790601>
> >
> >
> > Ah, bummer, thanks for reporting. 4GB of bounce buffer should be
> > plenty, so switching to 32 bit atomics seems a good idea at first
> > glance. I'll take a closer look tomorrow and send a respin with a fix.
> >
> >
> > To close the loop on this: I have posted v9 with patch #2 adjusted to
> > use uint32_t for size accounting to fix this.
>
> "size accounting" calls for portable size_t type. But if uint32_t
> satisfies our needs, OK.

To clarify, I'm referring to "bounce buffer size accounting", i.e.
keeping track of how much memory we've allocated for bounce buffers. I
don't think that there are practical use cases where anyone would want
to spend more than 4GB on bounce buffers, hence uint32_t seemed fine.
If you prefer size_t (at the expense of using different widths, which
will ultimately be visible in the command line parameter), I'm happy
to switch to that though.

Re: [PATCH 0/3] system/physmem: Propagate AddressSpace to MapClient helpers

2024-05-07 Thread Mattias Nissler

On Tue, May 7, 2024 at 2:30 PM Philippe Mathieu-Daudé  wrote:
>
> Respin of Mattias patch [1 split to ease review.
> Preliminary use QEMU_LOCK_GUARD to simplify.
>
> I'm OK to include this and the endianness fix [2]
> if Mattias agrees, once first patch is reviewed.

To be honest, given that this patch series has been lingering for
almost a year now, I'm fine with whatever gets us closer to getting
this landed. I believe Peter was also considering doing a pull request
for the series, so you may want to coordinate with him if you haven't
already.



>
>
> Regards,
>
> Phil.
>
> [1 
> https://lore.kernel.org/qemu-devel/20240507094210.300566-2-mniss...@rivosinc.com/
> [2] 
> https://lore.kernel.org/qemu-devel/20240507094210.300566-6-mniss...@rivosinc.com/
>
> Mattias Nissler (2):
>   system/physmem: Propagate AddressSpace to MapClient helpers
>   system/physmem: Per-AddressSpace bounce buffering
>
> Philippe Mathieu-Daudé (1):
>   system/physmem: Replace qemu_mutex_lock() calls with QEMU_LOCK_GUARD
>
>  include/exec/cpu-common.h |  2 -
>  include/exec/memory.h | 45 +-
>  system/dma-helpers.c  |  4 +-
>  system/memory.c   |  7 +++
>  system/physmem.c  | 98 +++
>  5 files changed, 90 insertions(+), 66 deletions(-)
>
> --
> 2.41.0
>

[PATCH v9 3/5] Update subprojects/libvfio-user

2024-05-07 Thread Mattias Nissler

Brings in assorted bug fixes. The following are of particular interest
with respect to message-based DMA support:

* bb308a2 "Fix address calculation for message-based DMA"
  Corrects a bug in DMA address calculation.

* 1569a37 "Pass server->client command over a separate socket pair"
  Adds support for separate sockets for either command direction,
  addressing a bug where libvfio-user gets confused if both client and
  server send commands concurrently.

Reviewed-by: Jagannathan Raman 
Signed-off-by: Mattias Nissler 
---
 subprojects/libvfio-user.wrap | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/subprojects/libvfio-user.wrap b/subprojects/libvfio-user.wrap
index 416955ca45..cdf0a7a375 100644
--- a/subprojects/libvfio-user.wrap
+++ b/subprojects/libvfio-user.wrap
@@ -1,4 +1,4 @@
 [wrap-git]
 url = https://gitlab.com/qemu-project/libvfio-user.git
-revision = 0b28d205572c80b568a1003db2c8f37ca333e4d7
+revision = 1569a37a54ecb63bd4008708c76339ccf7d06115
 depth = 1
-- 
2.43.2

[PATCH v9 5/5] vfio-user: Fix config space access byte order

2024-05-07 Thread Mattias Nissler

PCI config space is little-endian, so on a big-endian host we need to
perform byte swaps for values as they are passed to and received from
the generic PCI config space access machinery.

Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Jagannathan Raman 
Signed-off-by: Mattias Nissler 
---
 hw/remote/vfio-user-obj.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index a15e291c9a..0e93d7a7b4 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -281,7 +281,7 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, 
char * const buf,
 while (bytes > 0) {
 len = (bytes > pci_access_width) ? pci_access_width : bytes;
 if (is_write) {
-memcpy(, ptr, len);
+val = ldn_le_p(ptr, len);
 pci_host_config_write_common(o->pci_dev, offset,
  pci_config_size(o->pci_dev),
  val, len);
@@ -289,7 +289,7 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, 
char * const buf,
 } else {
 val = pci_host_config_read_common(o->pci_dev, offset,
   pci_config_size(o->pci_dev), 
len);
-memcpy(ptr, , len);
+stn_le_p(ptr, len, val);
 trace_vfu_cfg_read(offset, val);
 }
 offset += len;
-- 
2.43.2

Re: [PATCH v8 0/5] Support message-based DMA in vfio-user server

2024-05-07 Thread Mattias Nissler

On Mon, May 6, 2024 at 11:07 PM Mattias Nissler 
wrote:

>
>
> On Mon, May 6, 2024 at 4:44 PM Peter Xu  wrote:
>
>> On Thu, Mar 28, 2024 at 08:53:36AM +0100, Mattias Nissler wrote:
>> > Stefan, to the best of my knowledge this is fully reviewed and ready
>> > to go in - can you kindly pick it up or advise in case there's
>> > something I missed? Thanks!
>>
>> Fails cross-compile on mipsel:
>>
>> https://gitlab.com/peterx/qemu/-/jobs/6787790601
>
>
> Ah, bummer, thanks for reporting. 4GB of bounce buffer should be plenty,
> so switching to 32 bit atomics seems a good idea at first glance. I'll take
> a closer look tomorrow and send a respin with a fix.
>

To close the loop on this: I have posted v9 with patch #2 adjusted to use
uint32_t for size accounting to fix this.

[PATCH v9 1/5] softmmu: Per-AddressSpace bounce buffering

2024-05-07 Thread Mattias Nissler

Instead of using a single global bounce buffer, give each AddressSpace
its own bounce buffer. The MapClient callback mechanism moves to
AddressSpace accordingly.

This is in preparation for generalizing bounce buffer handling further
to allow multiple bounce buffers, with a total allocation limit
configured per AddressSpace.

Reviewed-by: Peter Xu 
Tested-by: Jonathan Cameron 
Signed-off-by: Mattias Nissler 
---
 include/exec/cpu-common.h |   2 -
 include/exec/memory.h |  45 -
 system/dma-helpers.c  |   4 +-
 system/memory.c   |   7 +++
 system/physmem.c  | 101 --
 5 files changed, 93 insertions(+), 66 deletions(-)

diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 8bc397e251..815342d043 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -147,8 +147,6 @@ void *cpu_physical_memory_map(hwaddr addr,
   bool is_write);
 void cpu_physical_memory_unmap(void *buffer, hwaddr len,
bool is_write, hwaddr access_len);
-void cpu_register_map_client(QEMUBH *bh);
-void cpu_unregister_map_client(QEMUBH *bh);
 
 bool cpu_physical_memory_is_io(hwaddr phys_addr);
 
diff --git a/include/exec/memory.h b/include/exec/memory.h
index dadb5cd65a..d417d7f363 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1112,6 +1112,19 @@ struct MemoryListener {
 QTAILQ_ENTRY(MemoryListener) link_as;
 };
 
+typedef struct AddressSpaceMapClient {
+QEMUBH *bh;
+QLIST_ENTRY(AddressSpaceMapClient) link;
+} AddressSpaceMapClient;
+
+typedef struct {
+MemoryRegion *mr;
+void *buffer;
+hwaddr addr;
+hwaddr len;
+bool in_use;
+} BounceBuffer;
+
 /**
  * struct AddressSpace: describes a mapping of addresses to #MemoryRegion 
objects
  */
@@ -1129,6 +1142,12 @@ struct AddressSpace {
 struct MemoryRegionIoeventfd *ioeventfds;
 QTAILQ_HEAD(, MemoryListener) listeners;
 QTAILQ_ENTRY(AddressSpace) address_spaces_link;
+
+/* Bounce buffer to use for this address space. */
+BounceBuffer bounce;
+/* List of callbacks to invoke when buffers free up */
+QemuMutex map_client_list_lock;
+QLIST_HEAD(, AddressSpaceMapClient) map_client_list;
 };
 
 typedef struct AddressSpaceDispatch AddressSpaceDispatch;
@@ -2946,8 +2965,8 @@ bool address_space_access_valid(AddressSpace *as, hwaddr 
addr, hwaddr len,
  * May return %NULL and set *@plen to zero(0), if resources needed to perform
  * the mapping are exhausted.
  * Use only for reads OR writes - not for read-modify-write operations.
- * Use cpu_register_map_client() to know when retrying the map operation is
- * likely to succeed.
+ * Use address_space_register_map_client() to know when retrying the map
+ * operation is likely to succeed.
  *
  * @as: #AddressSpace to be accessed
  * @addr: address within that address space
@@ -2972,6 +2991,28 @@ void *address_space_map(AddressSpace *as, hwaddr addr,
 void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len,
  bool is_write, hwaddr access_len);
 
+/*
+ * address_space_register_map_client: Register a callback to invoke when
+ * resources for address_space_map() are available again.
+ *
+ * address_space_map may fail when there are not enough resources available,
+ * such as when bounce buffer memory would exceed the limit. The callback can
+ * be used to retry the address_space_map operation. Note that the callback
+ * gets automatically removed after firing.
+ *
+ * @as: #AddressSpace to be accessed
+ * @bh: callback to invoke when address_space_map() retry is appropriate
+ */
+void address_space_register_map_client(AddressSpace *as, QEMUBH *bh);
+
+/*
+ * address_space_unregister_map_client: Unregister a callback that has
+ * previously been registered and not fired yet.
+ *
+ * @as: #AddressSpace to be accessed
+ * @bh: callback to unregister
+ */
+void address_space_unregister_map_client(AddressSpace *as, QEMUBH *bh);
 
 /* Internal functions, part of the implementation of address_space_read.  */
 MemTxResult address_space_read_full(AddressSpace *as, hwaddr addr,
diff --git a/system/dma-helpers.c b/system/dma-helpers.c
index 9b221cf94e..74013308f5 100644
--- a/system/dma-helpers.c
+++ b/system/dma-helpers.c
@@ -169,7 +169,7 @@ static void dma_blk_cb(void *opaque, int ret)
 if (dbs->iov.size == 0) {
 trace_dma_map_wait(dbs);
 dbs->bh = aio_bh_new(ctx, reschedule_dma, dbs);
-cpu_register_map_client(dbs->bh);
+address_space_register_map_client(dbs->sg->as, dbs->bh);
 return;
 }
 
@@ -197,7 +197,7 @@ static void dma_aio_cancel(BlockAIOCB *acb)
 }
 
 if (dbs->bh) {
-cpu_unregister_map_client(dbs->bh);
+address_space_unregister_map_client(dbs->sg->as, dbs->bh);
 qemu_bh_delete(dbs->bh);
 dbs->bh = NULL;
 }
diff --git a/system/memory.c b/system/memor

[PATCH v9 2/5] softmmu: Support concurrent bounce buffers

2024-05-07 Thread Mattias Nissler

When DMA memory can't be directly accessed, as is the case when
running the device model in a separate process without shareable DMA
file descriptors, bounce buffering is used.

It is not uncommon for device models to request mapping of several DMA
regions at the same time. Examples include:
 * net devices, e.g. when transmitting a packet that is split across
   several TX descriptors (observed with igb)
 * USB host controllers, when handling a packet with multiple data TRBs
   (observed with xhci)

Previously, qemu only provided a single bounce buffer per AddressSpace
and would fail DMA map requests while the buffer was already in use. In
turn, this would cause DMA failures that ultimately manifest as hardware
errors from the guest perspective.

This change allocates DMA bounce buffers dynamically instead of
supporting only a single buffer. Thus, multiple DMA mappings work
correctly also when RAM can't be mmap()-ed.

The total bounce buffer allocation size is limited individually for each
AddressSpace. The default limit is 4096 bytes, matching the previous
maximum buffer size. A new x-max-bounce-buffer-size parameter is
provided to configure the limit for PCI devices.

Signed-off-by: Mattias Nissler 
---
 hw/pci/pci.c|  8 
 include/exec/memory.h   | 14 +++
 include/hw/pci/pci_device.h |  3 ++
 system/memory.c |  5 ++-
 system/physmem.c| 82 ++---
 5 files changed, 76 insertions(+), 36 deletions(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 324c1302d2..69934bfbbf 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -85,6 +85,8 @@ static Property pci_props[] = {
 QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
 DEFINE_PROP_BIT("x-pcie-ari-nextfn-1", PCIDevice, cap_present,
 QEMU_PCIE_ARI_NEXTFN_1_BITNR, false),
+DEFINE_PROP_UINT32("x-max-bounce-buffer-size", PCIDevice,
+   max_bounce_buffer_size, DEFAULT_MAX_BOUNCE_BUFFER_SIZE),
 DEFINE_PROP_END_OF_LIST()
 };
 
@@ -1204,6 +1206,8 @@ static PCIDevice *do_pci_register_device(PCIDevice 
*pci_dev,
"bus master container", UINT64_MAX);
 address_space_init(_dev->bus_master_as,
_dev->bus_master_container_region, pci_dev->name);
+pci_dev->bus_master_as.max_bounce_buffer_size =
+pci_dev->max_bounce_buffer_size;
 
 if (phase_check(PHASE_MACHINE_READY)) {
 pci_init_bus_master(pci_dev);
@@ -2633,6 +2637,10 @@ static void pci_device_class_init(ObjectClass *klass, 
void *data)
 k->unrealize = pci_qdev_unrealize;
 k->bus_type = TYPE_PCI_BUS;
 device_class_set_props(k, pci_props);
+object_class_property_set_description(
+klass, "x-max-bounce-buffer-size",
+"Maximum buffer size allocated for bounce buffers used for mapped "
+"access to indirect DMA memory");
 }
 
 static void pci_device_class_base_init(ObjectClass *klass, void *data)
diff --git a/include/exec/memory.h b/include/exec/memory.h
index d417d7f363..2ea1e99da2 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1117,13 +1117,7 @@ typedef struct AddressSpaceMapClient {
 QLIST_ENTRY(AddressSpaceMapClient) link;
 } AddressSpaceMapClient;
 
-typedef struct {
-MemoryRegion *mr;
-void *buffer;
-hwaddr addr;
-hwaddr len;
-bool in_use;
-} BounceBuffer;
+#define DEFAULT_MAX_BOUNCE_BUFFER_SIZE (4096)
 
 /**
  * struct AddressSpace: describes a mapping of addresses to #MemoryRegion 
objects
@@ -1143,8 +1137,10 @@ struct AddressSpace {
 QTAILQ_HEAD(, MemoryListener) listeners;
 QTAILQ_ENTRY(AddressSpace) address_spaces_link;
 
-/* Bounce buffer to use for this address space. */
-BounceBuffer bounce;
+/* Maximum DMA bounce buffer size used for indirect memory map requests */
+uint32_t max_bounce_buffer_size;
+/* Total size of bounce buffers currently allocated, atomically accessed */
+uint32_t bounce_buffer_size;
 /* List of callbacks to invoke when buffers free up */
 QemuMutex map_client_list_lock;
 QLIST_HEAD(, AddressSpaceMapClient) map_client_list;
diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
index d3dd0f64b2..253b48a688 100644
--- a/include/hw/pci/pci_device.h
+++ b/include/hw/pci/pci_device.h
@@ -160,6 +160,9 @@ struct PCIDevice {
 /* ID of standby device in net_failover pair */
 char *failover_pair_id;
 uint32_t acpi_index;
+
+/* Maximum DMA bounce buffer size used for indirect memory map requests */
+uint32_t max_bounce_buffer_size;
 };
 
 static inline int pci_intx(PCIDevice *pci_dev)
diff --git a/system/memory.c b/system/memory.c
index 642a449f8c..c288ed354a 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -3174,7 +3174,8 @@ void address_space_init(AddressSpace *as, MemoryRegion 
*root, const char *name)
 as->ioeventfds = NULL;

[PATCH v9 4/5] vfio-user: Message-based DMA support

2024-05-07 Thread Mattias Nissler

Wire up support for DMA for the case where the vfio-user client does not
provide mmap()-able file descriptors, but DMA requests must be performed
via the VFIO-user protocol. This installs an indirect memory region,
which already works for pci_dma_{read,write}, and pci_dma_map works
thanks to the existing DMA bounce buffering support.

Note that while simple scenarios work with this patch, there's a known
race condition in libvfio-user that will mess up the communication
channel. See https://github.com/nutanix/libvfio-user/issues/279 for
details as well as a proposed fix.

Reviewed-by: Jagannathan Raman 
Signed-off-by: Mattias Nissler 
---
 hw/remote/trace-events|   2 +
 hw/remote/vfio-user-obj.c | 100 --
 2 files changed, 87 insertions(+), 15 deletions(-)

diff --git a/hw/remote/trace-events b/hw/remote/trace-events
index 0d1b7d56a5..358a68fb34 100644
--- a/hw/remote/trace-events
+++ b/hw/remote/trace-events
@@ -9,6 +9,8 @@ vfu_cfg_read(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x -> 
0x%x"
 vfu_cfg_write(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x <- 0x%x"
 vfu_dma_register(uint64_t gpa, size_t len) "vfu: registering GPA 0x%"PRIx64", 
%zu bytes"
 vfu_dma_unregister(uint64_t gpa) "vfu: unregistering GPA 0x%"PRIx64""
+vfu_dma_read(uint64_t gpa, size_t len) "vfu: DMA read 0x%"PRIx64", %zu bytes"
+vfu_dma_write(uint64_t gpa, size_t len) "vfu: DMA write 0x%"PRIx64", %zu bytes"
 vfu_bar_register(int i, uint64_t addr, uint64_t size) "vfu: BAR %d: addr 
0x%"PRIx64" size 0x%"PRIx64""
 vfu_bar_rw_enter(const char *op, uint64_t addr) "vfu: %s request for BAR 
address 0x%"PRIx64""
 vfu_bar_rw_exit(const char *op, uint64_t addr) "vfu: Finished %s of BAR 
address 0x%"PRIx64""
diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index d9b879e056..a15e291c9a 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -300,6 +300,63 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, 
char * const buf,
 return count;
 }
 
+static MemTxResult vfu_dma_read(void *opaque, hwaddr addr, uint64_t *val,
+unsigned size, MemTxAttrs attrs)
+{
+MemoryRegion *region = opaque;
+vfu_ctx_t *vfu_ctx = VFU_OBJECT(region->owner)->vfu_ctx;
+uint8_t buf[sizeof(uint64_t)];
+
+trace_vfu_dma_read(region->addr + addr, size);
+
+g_autofree dma_sg_t *sg = g_malloc0(dma_sg_size());
+vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
+if (vfu_addr_to_sgl(vfu_ctx, vfu_addr, size, sg, 1, PROT_READ) < 0 ||
+vfu_sgl_read(vfu_ctx, sg, 1, buf) != 0) {
+return MEMTX_ERROR;
+}
+
+*val = ldn_he_p(buf, size);
+
+return MEMTX_OK;
+}
+
+static MemTxResult vfu_dma_write(void *opaque, hwaddr addr, uint64_t val,
+ unsigned size, MemTxAttrs attrs)
+{
+MemoryRegion *region = opaque;
+vfu_ctx_t *vfu_ctx = VFU_OBJECT(region->owner)->vfu_ctx;
+uint8_t buf[sizeof(uint64_t)];
+
+trace_vfu_dma_write(region->addr + addr, size);
+
+stn_he_p(buf, size, val);
+
+g_autofree dma_sg_t *sg = g_malloc0(dma_sg_size());
+vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
+if (vfu_addr_to_sgl(vfu_ctx, vfu_addr, size, sg, 1, PROT_WRITE) < 0 ||
+vfu_sgl_write(vfu_ctx, sg, 1, buf) != 0) {
+return MEMTX_ERROR;
+}
+
+return MEMTX_OK;
+}
+
+static const MemoryRegionOps vfu_dma_ops = {
+.read_with_attrs = vfu_dma_read,
+.write_with_attrs = vfu_dma_write,
+.endianness = DEVICE_HOST_ENDIAN,
+.valid = {
+.min_access_size = 1,
+.max_access_size = 8,
+.unaligned = true,
+},
+.impl = {
+.min_access_size = 1,
+.max_access_size = 8,
+},
+};
+
 static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
 {
 VfuObject *o = vfu_get_private(vfu_ctx);
@@ -308,17 +365,30 @@ static void dma_register(vfu_ctx_t *vfu_ctx, 
vfu_dma_info_t *info)
 g_autofree char *name = NULL;
 struct iovec *iov = >iova;
 
-if (!info->vaddr) {
-return;
-}
-
 name = g_strdup_printf("mem-%s-%"PRIx64"", o->device,
-   (uint64_t)info->vaddr);
+   (uint64_t)iov->iov_base);
 
 subregion = g_new0(MemoryRegion, 1);
 
-memory_region_init_ram_ptr(subregion, NULL, name,
-   iov->iov_len, info->vaddr);
+if (info->vaddr) {
+memory_region_init_ram_ptr(subregion, OBJECT(o), name,
+   iov->iov_len, info->vaddr);
+} else {
+/*
+ * Note that I/O regions' MemoryRegionOps handle accesses of at most 8
+ * bytes at a time, and larger access

[PATCH v9 0/5] Support message-based DMA in vfio-user server

2024-05-07 Thread Mattias Nissler

This series adds basic support for message-based DMA in qemu's vfio-user
server. This is useful for cases where the client does not provide file
descriptors for accessing system memory via memory mappings. My motivating use
case is to hook up device models as PCIe endpoints to a hardware design. This
works by bridging the PCIe transaction layer to vfio-user, and the endpoint
does not access memory directly, but sends memory requests TLPs to the hardware
design in order to perform DMA.

Note that more work is needed to make message-based DMA work well: qemu
currently breaks down DMA accesses into chunks of size 8 bytes at maximum, each
of which will be handled in a separate vfio-user DMA request message. This is
quite terrible for large DMA accesses, such as when nvme reads and writes
page-sized blocks for example. Thus, I would like to improve qemu to be able to
perform larger accesses, at least for indirect memory regions. I have something
working locally, but since this will likely result in more involved surgery and
discussion, I am leaving this to be addressed in a separate patch.

Changes from v1:

* Address Stefan's review comments. In particular, enforce an allocation limit
  and don't drop the map client callbacks given that map requests can fail when
  hitting size limits.

* libvfio-user version bump now included in the series.

* Tested as well on big-endian s390x. This uncovered another byte order issue
  in vfio-user server code that I've included a fix for.

Changes from v2:

* Add a preparatory patch to make bounce buffering an AddressSpace-specific
  concept.

* The total buffer size limit parameter is now per AdressSpace and can be
  configured for PCIDevice via a property.

* Store a magic value in first bytes of bounce buffer struct as a best effort
  measure to detect invalid pointers in address_space_unmap.

Changes from v3:

* libvfio-user now supports twin-socket mode which uses separate sockets for
  client->server and server->client commands, respectively. This addresses the
  concurrent command bug triggered by server->client DMA access commands. See
  https://github.com/nutanix/libvfio-user/issues/279 for details.

* Add missing teardown code in do_address_space_destroy.

* Fix bounce buffer size bookkeeping race condition.

* Generate unmap notification callbacks unconditionally.

* Some cosmetic fixes.

Changes from v4:

* Fix accidentally dropped memory_region_unref, control flow restored to match
  previous code to simplify review.

* Some cosmetic fixes.

Changes from v5:

* Unregister indirect memory region in libvfio-user dma_unregister callback.

Changes from v6:

* Rebase, resolve straightforward merge conflict in system/dma-helpers.c

Changes from v7:

* Rebase (applied cleanly)

* Restore various Reviewed-by and Tested-by tags that I failed to carry
  forward (I double-checked that the patches haven't changed since the reviewed
  version)

Changes from v8:

* Rebase (clean)

* Change bounce buffer size accounting to use uint32_t so it works also on
  hosts that don't support uint64_t atomics, such as mipsel. As a consequence
  overflows are a real concern now, so switch to a cmpxchg loop for allocating
  bounce buffer space.

Mattias Nissler (5):
  softmmu: Per-AddressSpace bounce buffering
  softmmu: Support concurrent bounce buffers
  Update subprojects/libvfio-user
  vfio-user: Message-based DMA support
  vfio-user: Fix config space access byte order

 hw/pci/pci.c  |   8 ++
 hw/remote/trace-events|   2 +
 hw/remote/vfio-user-obj.c | 104 +
 include/exec/cpu-common.h |   2 -
 include/exec/memory.h |  41 +-
 include/hw/pci/pci_device.h   |   3 +
 subprojects/libvfio-user.wrap |   2 +-
 system/dma-helpers.c  |   4 +-
 system/memory.c   |   8 ++
 system/physmem.c  | 143 ++
 10 files changed, 228 insertions(+), 89 deletions(-)

-- 
2.43.2

Re: [PATCH v8 0/5] Support message-based DMA in vfio-user server

2024-05-06 Thread Mattias Nissler

On Mon, May 6, 2024 at 4:44 PM Peter Xu  wrote:

> On Thu, Mar 28, 2024 at 08:53:36AM +0100, Mattias Nissler wrote:
> > Stefan, to the best of my knowledge this is fully reviewed and ready
> > to go in - can you kindly pick it up or advise in case there's
> > something I missed? Thanks!
>
> Fails cross-compile on mipsel:
>
> https://gitlab.com/peterx/qemu/-/jobs/6787790601

Ah, bummer, thanks for reporting. 4GB of bounce buffer should be plenty, so
switching to 32 bit atomics seems a good idea at first glance. I'll take a
closer look tomorrow and send a respin with a fix.

Re: [PATCH v8 0/5] Support message-based DMA in vfio-user server

2024-05-06 Thread Mattias Nissler

On Mon, May 6, 2024 at 5:01 PM Stefan Hajnoczi  wrote:

> On Thu, 28 Mar 2024 at 03:54, Mattias Nissler 
> wrote:
> >
> > Stefan, to the best of my knowledge this is fully reviewed and ready
> > to go in - can you kindly pick it up or advise in case there's
> > something I missed? Thanks!
>
> This code is outside the areas that I maintain. I think it would make
> sense for Jag to merge it and send a pull request as vfio-user
> maintainer.


OK, thanks for following up, I'll check with Jag.

Re: [PATCH v8 0/5] Support message-based DMA in vfio-user server

2024-03-28 Thread Mattias Nissler

Stefan, to the best of my knowledge this is fully reviewed and ready
to go in - can you kindly pick it up or advise in case there's
something I missed? Thanks!

On Mon, Mar 4, 2024 at 11:25 AM Peter Xu  wrote:
>
> On Mon, Mar 04, 2024 at 02:05:49AM -0800, Mattias Nissler wrote:
> > This series adds basic support for message-based DMA in qemu's vfio-user
> > server. This is useful for cases where the client does not provide file
> > descriptors for accessing system memory via memory mappings. My motivating 
> > use
> > case is to hook up device models as PCIe endpoints to a hardware design. 
> > This
> > works by bridging the PCIe transaction layer to vfio-user, and the endpoint
> > does not access memory directly, but sends memory requests TLPs to the 
> > hardware
> > design in order to perform DMA.
> >
> > Note that more work is needed to make message-based DMA work well: qemu
> > currently breaks down DMA accesses into chunks of size 8 bytes at maximum, 
> > each
> > of which will be handled in a separate vfio-user DMA request message. This 
> > is
> > quite terrible for large DMA accesses, such as when nvme reads and writes
> > page-sized blocks for example. Thus, I would like to improve qemu to be 
> > able to
> > perform larger accesses, at least for indirect memory regions. I have 
> > something
> > working locally, but since this will likely result in more involved surgery 
> > and
> > discussion, I am leaving this to be addressed in a separate patch.
>
> No objection from my side memory-wise.  It'll be good to get some words
> from Paolo if possible.
>
> Copy Peter Maydell due to the other relevant discussion.
>
> https://lore.kernel.org/qemu-devel/20240228125939.56925-1-heinrich.schucha...@canonical.com/
>
> --
> Peter Xu
>

[PATCH v8 1/5] softmmu: Per-AddressSpace bounce buffering

2024-03-04 Thread Mattias Nissler

Instead of using a single global bounce buffer, give each AddressSpace
its own bounce buffer. The MapClient callback mechanism moves to
AddressSpace accordingly.

This is in preparation for generalizing bounce buffer handling further
to allow multiple bounce buffers, with a total allocation limit
configured per AddressSpace.

Reviewed-by: Peter Xu 
Tested-by: Jonathan Cameron 
Signed-off-by: Mattias Nissler 
---
 include/exec/cpu-common.h |   2 -
 include/exec/memory.h |  45 -
 system/dma-helpers.c  |   4 +-
 system/memory.c   |   7 +++
 system/physmem.c  | 101 --
 5 files changed, 93 insertions(+), 66 deletions(-)

diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 9ead1be100..bd6999fa35 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -148,8 +148,6 @@ void *cpu_physical_memory_map(hwaddr addr,
   bool is_write);
 void cpu_physical_memory_unmap(void *buffer, hwaddr len,
bool is_write, hwaddr access_len);
-void cpu_register_map_client(QEMUBH *bh);
-void cpu_unregister_map_client(QEMUBH *bh);
 
 bool cpu_physical_memory_is_io(hwaddr phys_addr);
 
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 8626a355b3..0658846555 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1106,6 +1106,19 @@ struct MemoryListener {
 QTAILQ_ENTRY(MemoryListener) link_as;
 };
 
+typedef struct AddressSpaceMapClient {
+QEMUBH *bh;
+QLIST_ENTRY(AddressSpaceMapClient) link;
+} AddressSpaceMapClient;
+
+typedef struct {
+MemoryRegion *mr;
+void *buffer;
+hwaddr addr;
+hwaddr len;
+bool in_use;
+} BounceBuffer;
+
 /**
  * struct AddressSpace: describes a mapping of addresses to #MemoryRegion 
objects
  */
@@ -1123,6 +1136,12 @@ struct AddressSpace {
 struct MemoryRegionIoeventfd *ioeventfds;
 QTAILQ_HEAD(, MemoryListener) listeners;
 QTAILQ_ENTRY(AddressSpace) address_spaces_link;
+
+/* Bounce buffer to use for this address space. */
+BounceBuffer bounce;
+/* List of callbacks to invoke when buffers free up */
+QemuMutex map_client_list_lock;
+QLIST_HEAD(, AddressSpaceMapClient) map_client_list;
 };
 
 typedef struct AddressSpaceDispatch AddressSpaceDispatch;
@@ -2926,8 +2945,8 @@ bool address_space_access_valid(AddressSpace *as, hwaddr 
addr, hwaddr len,
  * May return %NULL and set *@plen to zero(0), if resources needed to perform
  * the mapping are exhausted.
  * Use only for reads OR writes - not for read-modify-write operations.
- * Use cpu_register_map_client() to know when retrying the map operation is
- * likely to succeed.
+ * Use address_space_register_map_client() to know when retrying the map
+ * operation is likely to succeed.
  *
  * @as: #AddressSpace to be accessed
  * @addr: address within that address space
@@ -2952,6 +2971,28 @@ void *address_space_map(AddressSpace *as, hwaddr addr,
 void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len,
  bool is_write, hwaddr access_len);
 
+/*
+ * address_space_register_map_client: Register a callback to invoke when
+ * resources for address_space_map() are available again.
+ *
+ * address_space_map may fail when there are not enough resources available,
+ * such as when bounce buffer memory would exceed the limit. The callback can
+ * be used to retry the address_space_map operation. Note that the callback
+ * gets automatically removed after firing.
+ *
+ * @as: #AddressSpace to be accessed
+ * @bh: callback to invoke when address_space_map() retry is appropriate
+ */
+void address_space_register_map_client(AddressSpace *as, QEMUBH *bh);
+
+/*
+ * address_space_unregister_map_client: Unregister a callback that has
+ * previously been registered and not fired yet.
+ *
+ * @as: #AddressSpace to be accessed
+ * @bh: callback to unregister
+ */
+void address_space_unregister_map_client(AddressSpace *as, QEMUBH *bh);
 
 /* Internal functions, part of the implementation of address_space_read.  */
 MemTxResult address_space_read_full(AddressSpace *as, hwaddr addr,
diff --git a/system/dma-helpers.c b/system/dma-helpers.c
index 9b221cf94e..74013308f5 100644
--- a/system/dma-helpers.c
+++ b/system/dma-helpers.c
@@ -169,7 +169,7 @@ static void dma_blk_cb(void *opaque, int ret)
 if (dbs->iov.size == 0) {
 trace_dma_map_wait(dbs);
 dbs->bh = aio_bh_new(ctx, reschedule_dma, dbs);
-cpu_register_map_client(dbs->bh);
+address_space_register_map_client(dbs->sg->as, dbs->bh);
 return;
 }
 
@@ -197,7 +197,7 @@ static void dma_aio_cancel(BlockAIOCB *acb)
 }
 
 if (dbs->bh) {
-cpu_unregister_map_client(dbs->bh);
+address_space_unregister_map_client(dbs->sg->as, dbs->bh);
 qemu_bh_delete(dbs->bh);
 dbs->bh = NULL;
 }
diff --git a/system/memory.c b/system/memor

[PATCH v8 2/5] softmmu: Support concurrent bounce buffers

2024-03-04 Thread Mattias Nissler

When DMA memory can't be directly accessed, as is the case when
running the device model in a separate process without shareable DMA
file descriptors, bounce buffering is used.

It is not uncommon for device models to request mapping of several DMA
regions at the same time. Examples include:
 * net devices, e.g. when transmitting a packet that is split across
   several TX descriptors (observed with igb)
 * USB host controllers, when handling a packet with multiple data TRBs
   (observed with xhci)

Previously, qemu only provided a single bounce buffer per AddressSpace
and would fail DMA map requests while the buffer was already in use. In
turn, this would cause DMA failures that ultimately manifest as hardware
errors from the guest perspective.

This change allocates DMA bounce buffers dynamically instead of
supporting only a single buffer. Thus, multiple DMA mappings work
correctly also when RAM can't be mmap()-ed.

The total bounce buffer allocation size is limited individually for each
AddressSpace. The default limit is 4096 bytes, matching the previous
maximum buffer size. A new x-max-bounce-buffer-size parameter is
provided to configure the limit for PCI devices.

Reviewed-by: Peter Xu 
Tested-by: Jonathan Cameron 
Signed-off-by: Mattias Nissler 
---
 hw/pci/pci.c|  8 
 include/exec/memory.h   | 14 +++
 include/hw/pci/pci_device.h |  3 ++
 system/memory.c |  5 ++-
 system/physmem.c| 80 +
 5 files changed, 74 insertions(+), 36 deletions(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 6496d027ca..036b3ff822 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -85,6 +85,8 @@ static Property pci_props[] = {
 QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
 DEFINE_PROP_BIT("x-pcie-ari-nextfn-1", PCIDevice, cap_present,
 QEMU_PCIE_ARI_NEXTFN_1_BITNR, false),
+DEFINE_PROP_SIZE("x-max-bounce-buffer-size", PCIDevice,
+ max_bounce_buffer_size, DEFAULT_MAX_BOUNCE_BUFFER_SIZE),
 DEFINE_PROP_END_OF_LIST()
 };
 
@@ -1203,6 +1205,8 @@ static PCIDevice *do_pci_register_device(PCIDevice 
*pci_dev,
"bus master container", UINT64_MAX);
 address_space_init(_dev->bus_master_as,
_dev->bus_master_container_region, pci_dev->name);
+pci_dev->bus_master_as.max_bounce_buffer_size =
+pci_dev->max_bounce_buffer_size;
 
 if (phase_check(PHASE_MACHINE_READY)) {
 pci_init_bus_master(pci_dev);
@@ -2632,6 +2636,10 @@ static void pci_device_class_init(ObjectClass *klass, 
void *data)
 k->unrealize = pci_qdev_unrealize;
 k->bus_type = TYPE_PCI_BUS;
 device_class_set_props(k, pci_props);
+object_class_property_set_description(
+klass, "x-max-bounce-buffer-size",
+"Maximum buffer size allocated for bounce buffers used for mapped "
+"access to indirect DMA memory");
 }
 
 static void pci_device_class_base_init(ObjectClass *klass, void *data)
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 0658846555..3fe0e2824c 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -,13 +,7 @@ typedef struct AddressSpaceMapClient {
 QLIST_ENTRY(AddressSpaceMapClient) link;
 } AddressSpaceMapClient;
 
-typedef struct {
-MemoryRegion *mr;
-void *buffer;
-hwaddr addr;
-hwaddr len;
-bool in_use;
-} BounceBuffer;
+#define DEFAULT_MAX_BOUNCE_BUFFER_SIZE (4096)
 
 /**
  * struct AddressSpace: describes a mapping of addresses to #MemoryRegion 
objects
@@ -1137,8 +1131,10 @@ struct AddressSpace {
 QTAILQ_HEAD(, MemoryListener) listeners;
 QTAILQ_ENTRY(AddressSpace) address_spaces_link;
 
-/* Bounce buffer to use for this address space. */
-BounceBuffer bounce;
+/* Maximum DMA bounce buffer size used for indirect memory map requests */
+uint64_t max_bounce_buffer_size;
+/* Total size of bounce buffers currently allocated, atomically accessed */
+uint64_t bounce_buffer_size;
 /* List of callbacks to invoke when buffers free up */
 QemuMutex map_client_list_lock;
 QLIST_HEAD(, AddressSpaceMapClient) map_client_list;
diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
index d3dd0f64b2..f4027c5379 100644
--- a/include/hw/pci/pci_device.h
+++ b/include/hw/pci/pci_device.h
@@ -160,6 +160,9 @@ struct PCIDevice {
 /* ID of standby device in net_failover pair */
 char *failover_pair_id;
 uint32_t acpi_index;
+
+/* Maximum DMA bounce buffer size used for indirect memory map requests */
+uint64_t max_bounce_buffer_size;
 };
 
 static inline int pci_intx(PCIDevice *pci_dev)
diff --git a/system/memory.c b/system/memory.c
index ad0caef1b8..1cf89654a1 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -3133,7 +3133,8 @@ void address_space_init(AddressSpace *as, MemoryRegion

[PATCH v8 4/5] vfio-user: Message-based DMA support

2024-03-04 Thread Mattias Nissler

Wire up support for DMA for the case where the vfio-user client does not
provide mmap()-able file descriptors, but DMA requests must be performed
via the VFIO-user protocol. This installs an indirect memory region,
which already works for pci_dma_{read,write}, and pci_dma_map works
thanks to the existing DMA bounce buffering support.

Note that while simple scenarios work with this patch, there's a known
race condition in libvfio-user that will mess up the communication
channel. See https://github.com/nutanix/libvfio-user/issues/279 for
details as well as a proposed fix.

Reviewed-by: Jagannathan Raman 
Signed-off-by: Mattias Nissler 
---
 hw/remote/trace-events|   2 +
 hw/remote/vfio-user-obj.c | 100 --
 2 files changed, 87 insertions(+), 15 deletions(-)

diff --git a/hw/remote/trace-events b/hw/remote/trace-events
index 0d1b7d56a5..358a68fb34 100644
--- a/hw/remote/trace-events
+++ b/hw/remote/trace-events
@@ -9,6 +9,8 @@ vfu_cfg_read(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x -> 
0x%x"
 vfu_cfg_write(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x <- 0x%x"
 vfu_dma_register(uint64_t gpa, size_t len) "vfu: registering GPA 0x%"PRIx64", 
%zu bytes"
 vfu_dma_unregister(uint64_t gpa) "vfu: unregistering GPA 0x%"PRIx64""
+vfu_dma_read(uint64_t gpa, size_t len) "vfu: DMA read 0x%"PRIx64", %zu bytes"
+vfu_dma_write(uint64_t gpa, size_t len) "vfu: DMA write 0x%"PRIx64", %zu bytes"
 vfu_bar_register(int i, uint64_t addr, uint64_t size) "vfu: BAR %d: addr 
0x%"PRIx64" size 0x%"PRIx64""
 vfu_bar_rw_enter(const char *op, uint64_t addr) "vfu: %s request for BAR 
address 0x%"PRIx64""
 vfu_bar_rw_exit(const char *op, uint64_t addr) "vfu: Finished %s of BAR 
address 0x%"PRIx64""
diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index d9b879e056..a15e291c9a 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -300,6 +300,63 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, 
char * const buf,
 return count;
 }
 
+static MemTxResult vfu_dma_read(void *opaque, hwaddr addr, uint64_t *val,
+unsigned size, MemTxAttrs attrs)
+{
+MemoryRegion *region = opaque;
+vfu_ctx_t *vfu_ctx = VFU_OBJECT(region->owner)->vfu_ctx;
+uint8_t buf[sizeof(uint64_t)];
+
+trace_vfu_dma_read(region->addr + addr, size);
+
+g_autofree dma_sg_t *sg = g_malloc0(dma_sg_size());
+vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
+if (vfu_addr_to_sgl(vfu_ctx, vfu_addr, size, sg, 1, PROT_READ) < 0 ||
+vfu_sgl_read(vfu_ctx, sg, 1, buf) != 0) {
+return MEMTX_ERROR;
+}
+
+*val = ldn_he_p(buf, size);
+
+return MEMTX_OK;
+}
+
+static MemTxResult vfu_dma_write(void *opaque, hwaddr addr, uint64_t val,
+ unsigned size, MemTxAttrs attrs)
+{
+MemoryRegion *region = opaque;
+vfu_ctx_t *vfu_ctx = VFU_OBJECT(region->owner)->vfu_ctx;
+uint8_t buf[sizeof(uint64_t)];
+
+trace_vfu_dma_write(region->addr + addr, size);
+
+stn_he_p(buf, size, val);
+
+g_autofree dma_sg_t *sg = g_malloc0(dma_sg_size());
+vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
+if (vfu_addr_to_sgl(vfu_ctx, vfu_addr, size, sg, 1, PROT_WRITE) < 0 ||
+vfu_sgl_write(vfu_ctx, sg, 1, buf) != 0) {
+return MEMTX_ERROR;
+}
+
+return MEMTX_OK;
+}
+
+static const MemoryRegionOps vfu_dma_ops = {
+.read_with_attrs = vfu_dma_read,
+.write_with_attrs = vfu_dma_write,
+.endianness = DEVICE_HOST_ENDIAN,
+.valid = {
+.min_access_size = 1,
+.max_access_size = 8,
+.unaligned = true,
+},
+.impl = {
+.min_access_size = 1,
+.max_access_size = 8,
+},
+};
+
 static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
 {
 VfuObject *o = vfu_get_private(vfu_ctx);
@@ -308,17 +365,30 @@ static void dma_register(vfu_ctx_t *vfu_ctx, 
vfu_dma_info_t *info)
 g_autofree char *name = NULL;
 struct iovec *iov = >iova;
 
-if (!info->vaddr) {
-return;
-}
-
 name = g_strdup_printf("mem-%s-%"PRIx64"", o->device,
-   (uint64_t)info->vaddr);
+   (uint64_t)iov->iov_base);
 
 subregion = g_new0(MemoryRegion, 1);
 
-memory_region_init_ram_ptr(subregion, NULL, name,
-   iov->iov_len, info->vaddr);
+if (info->vaddr) {
+memory_region_init_ram_ptr(subregion, OBJECT(o), name,
+   iov->iov_len, info->vaddr);
+} else {
+/*
+ * Note that I/O regions' MemoryRegionOps handle accesses of at most 8
+ * bytes at a time, and larger access

[PATCH v8 3/5] Update subprojects/libvfio-user

2024-03-04 Thread Mattias Nissler

Brings in assorted bug fixes. The following are of particular interest
with respect to message-based DMA support:

* bb308a2 "Fix address calculation for message-based DMA"
  Corrects a bug in DMA address calculation.

* 1569a37 "Pass server->client command over a separate socket pair"
  Adds support for separate sockets for either command direction,
  addressing a bug where libvfio-user gets confused if both client and
  server send commands concurrently.

Reviewed-by: Jagannathan Raman 
Signed-off-by: Mattias Nissler 
---
 subprojects/libvfio-user.wrap | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/subprojects/libvfio-user.wrap b/subprojects/libvfio-user.wrap
index 416955ca45..cdf0a7a375 100644
--- a/subprojects/libvfio-user.wrap
+++ b/subprojects/libvfio-user.wrap
@@ -1,4 +1,4 @@
 [wrap-git]
 url = https://gitlab.com/qemu-project/libvfio-user.git
-revision = 0b28d205572c80b568a1003db2c8f37ca333e4d7
+revision = 1569a37a54ecb63bd4008708c76339ccf7d06115
 depth = 1
-- 
2.34.1

[PATCH v8 5/5] vfio-user: Fix config space access byte order

2024-03-04 Thread Mattias Nissler

PCI config space is little-endian, so on a big-endian host we need to
perform byte swaps for values as they are passed to and received from
the generic PCI config space access machinery.

Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Jagannathan Raman 
Signed-off-by: Mattias Nissler 
---
 hw/remote/vfio-user-obj.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index a15e291c9a..0e93d7a7b4 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -281,7 +281,7 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, 
char * const buf,
 while (bytes > 0) {
 len = (bytes > pci_access_width) ? pci_access_width : bytes;
 if (is_write) {
-memcpy(, ptr, len);
+val = ldn_le_p(ptr, len);
 pci_host_config_write_common(o->pci_dev, offset,
  pci_config_size(o->pci_dev),
  val, len);
@@ -289,7 +289,7 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, 
char * const buf,
 } else {
 val = pci_host_config_read_common(o->pci_dev, offset,
   pci_config_size(o->pci_dev), 
len);
-memcpy(ptr, , len);
+stn_le_p(ptr, len, val);
 trace_vfu_cfg_read(offset, val);
 }
 offset += len;
-- 
2.34.1

[PATCH v8 0/5] Support message-based DMA in vfio-user server

2024-03-04 Thread Mattias Nissler

This series adds basic support for message-based DMA in qemu's vfio-user
server. This is useful for cases where the client does not provide file
descriptors for accessing system memory via memory mappings. My motivating use
case is to hook up device models as PCIe endpoints to a hardware design. This
works by bridging the PCIe transaction layer to vfio-user, and the endpoint
does not access memory directly, but sends memory requests TLPs to the hardware
design in order to perform DMA.

Note that more work is needed to make message-based DMA work well: qemu
currently breaks down DMA accesses into chunks of size 8 bytes at maximum, each
of which will be handled in a separate vfio-user DMA request message. This is
quite terrible for large DMA accesses, such as when nvme reads and writes
page-sized blocks for example. Thus, I would like to improve qemu to be able to
perform larger accesses, at least for indirect memory regions. I have something
working locally, but since this will likely result in more involved surgery and
discussion, I am leaving this to be addressed in a separate patch.

Changes from v1:

* Address Stefan's review comments. In particular, enforce an allocation limit
  and don't drop the map client callbacks given that map requests can fail when
  hitting size limits.

* libvfio-user version bump now included in the series.

* Tested as well on big-endian s390x. This uncovered another byte order issue
  in vfio-user server code that I've included a fix for.

Changes from v2:

* Add a preparatory patch to make bounce buffering an AddressSpace-specific
  concept.

* The total buffer size limit parameter is now per AdressSpace and can be
  configured for PCIDevice via a property.

* Store a magic value in first bytes of bounce buffer struct as a best effort
  measure to detect invalid pointers in address_space_unmap.

Changes from v3:

* libvfio-user now supports twin-socket mode which uses separate sockets for
  client->server and server->client commands, respectively. This addresses the
  concurrent command bug triggered by server->client DMA access commands. See
  https://github.com/nutanix/libvfio-user/issues/279 for details.

* Add missing teardown code in do_address_space_destroy.

* Fix bounce buffer size bookkeeping race condition.

* Generate unmap notification callbacks unconditionally.

* Some cosmetic fixes.

Changes from v4:

* Fix accidentally dropped memory_region_unref, control flow restored to match
  previous code to simplify review.

* Some cosmetic fixes.

Changes from v5:

* Unregister indirect memory region in libvfio-user dma_unregister callback.

Changes from v6:

* Rebase, resolve straightforward merge conflict in system/dma-helpers.c

Changes from v7:

* Rebase (applied cleanly)

* Restore various Reviewed-by and Tested-by tags that I failed to carry
  forward (I double-checked that the patches haven't changed since the reviewed
  version)

Mattias Nissler (5):
  softmmu: Per-AddressSpace bounce buffering
  softmmu: Support concurrent bounce buffers
  Update subprojects/libvfio-user
  vfio-user: Message-based DMA support
  vfio-user: Fix config space access byte order

 hw/pci/pci.c  |   8 ++
 hw/remote/trace-events|   2 +
 hw/remote/vfio-user-obj.c | 104 +
 include/exec/cpu-common.h |   2 -
 include/exec/memory.h |  41 +-
 include/hw/pci/pci_device.h   |   3 +
 subprojects/libvfio-user.wrap |   2 +-
 system/dma-helpers.c  |   4 +-
 system/memory.c   |   8 ++
 system/physmem.c  | 141 ++
 10 files changed, 226 insertions(+), 89 deletions(-)

-- 
2.34.1

Re: [PATCH, v2] physmem: avoid bounce buffer too small

2024-02-29 Thread Mattias Nissler

On Thu, Feb 29, 2024 at 1:35 PM Peter Maydell  wrote:
>
> On Thu, 29 Feb 2024 at 11:17, Heinrich Schuchardt
>  wrote:
> > > But yes, I'm not surprised that CXL runs into this. Heinrich,
> > > are you doing CXL testing, or is this some other workload?
> >
> > I am running the UEFI Self-Certification Tests (SCT) on EDK 2 using:
> >
> > qemu-system-riscv64 \
> >-M virt,acpi=off -accel tcg -m 4096 \
> >-serial mon:stdio \
> >-device virtio-gpu-pci \
> >-device qemu-xhci \
> >-device usb-kbd \
> >-drive
> > if=pflash,format=raw,unit=0,file=RISCV_VIRT_CODE.fd,readonly=on \
> >-drive if=pflash,format=raw,unit=1,file=RISCV_VIRT_VARS.fd \
> >-drive file=sct.img,format=raw,if=virtio \
> >-device virtio-net-device,netdev=net0 \
> >-netdev user,id=net0
> >
> > This does not invoke any CXL related stuff.
>
> Hmm, that doesn't seem like it ought to be running into this.
> What underlying memory region is the guest trying to do
> the virtio queue access to?

FWIW, I have seen multiple bounce buffer usage with the generic net TX
path as well as the XHCI controller, so it might be either of these.
Bounce buffering should only take place when the memory region can't
be accessed directly though - I don't see why that's the case for the
given command line.

Re: [PATCH, v2] physmem: avoid bounce buffer too small

2024-02-29 Thread Mattias Nissler

On Thu, Feb 29, 2024 at 12:12 PM Peter Maydell  wrote:
>
> On Thu, 29 Feb 2024 at 10:59, Jonathan Cameron
>  wrote:
> >
> > On Thu, 29 Feb 2024 09:38:29 +
> > Peter Maydell  wrote:
> >
> > > On Wed, 28 Feb 2024 at 19:07, Heinrich Schuchardt
> > >  wrote:
> > > >
> > > > On 28.02.24 19:39, Peter Maydell wrote:
> > > > > The limitation to a page dates back to commit 6d16c2f88f2a in 2009,
> > > > > which was the first implementation of this function. I don't think
> > > > > there's a particular reason for that value beyond that it was
> > > > > probably a convenient value that was assumed to be likely "big 
> > > > > enough".
> > > > >
> > > > > I think the idea with this bounce-buffer has always been that this
> > > > > isn't really a code path we expected to end up in very often --
> > > > > it's supposed to be for when devices are doing DMA, which they
> > > > > will typically be doing to memory (backed by host RAM), not
> > > > > devices (backed by MMIO and needing a bounce buffer). So the
> > > > > whole mechanism is a bit "last fallback to stop things breaking
> > > > > entirely".
> > > > >
> > > > > The address_space_map() API says that it's allowed to return
> > > > > a subset of the range you ask for, so if the virtio code doesn't
> > > > > cope with the minimum being set to TARGET_PAGE_SIZE then either
> > > > > we need to fix that virtio code or we need to change the API
> > > > > of this function. (But I think you will also get a reduced
> > > > > range if you try to use it across a boundary between normal
> > > > > host-memory-backed RAM and a device MemoryRegion.)
> > > >
> > > > If we allow a bounce buffer only to be used once (via the in_use flag),
> > > > why do we allow only a single bounce buffer?
> > > >
> > > > Could address_space_map() allocate a new bounce buffer on every call and
> > > > address_space_unmap() deallocate it?
> > > >
> > > > Isn't the design with a single bounce buffer bound to fail with a
> > > > multi-threaded client as collision can be expected?
> > >
> > > Yeah, I don't suppose multi-threaded was particularly expected.
> > > Again, this is really a "handle the case where the guest does
> > > something silly" setup, which is why only one bounce buffer.
> > >
> > > Why is your guest ending up in the bounce-buffer path?
> >
> > Happens for me with emulated CXL memory.
>
> Can we put that in the "something silly" bucket? :-)
> But yes, I'm not surprised that CXL runs into this. Heinrich,
> are you doing CXL testing, or is this some other workload?
>
> > I think the case I saw
> > was split descriptors in virtio via address space caches
> > https://elixir.bootlin.com/qemu/latest/source/hw/virtio/virtio.c#L4043
> >
> > One bounce buffer is in use for the outer loop and another for the 
> > descriptors
> > it is pointing to.
>
> Mmm. The other assumption made in the design of the address_space_map()
> API I think was that it was unlikely that a device would be trying
> to do two DMA operations simultaneously. This is clearly not
> true in practice. We definitely need to fix one end or other of
> this API.
>
> (I'm not sure why the bounce-buffer limit ought to be per-AddressSpace:
> is that just done in Matthias' series so that we can attach an
> x-thingy property to the individual PCI device?)

Yes, that's the result of review feedback to the early iterations of
my series. Specifically, (1) a limit is needed to prevent rogue guests
from hogging unlimited amounts of memory and (2) global parameters are
frowned upon. Setting a suitable limit is much more practical when
targeted at a given device/driver combination.

Re: [PATCH v7 0/5] Support message-based DMA in vfio-user server

2024-02-29 Thread Mattias Nissler

Hi,

I actually failed to carry forward the Reviewed-by tags from Jag,
Phillipe and Stefan as well when reposting even though I didn't make
any non-trivial changes to the respective patches. I intend to post
another version with the respective tags restored, but I'll give you a
day or two to speak up if you disagree.

Thanks,
Mattias

On Tue, Feb 20, 2024 at 6:06 AM Peter Xu  wrote:
>
> On Mon, Feb 12, 2024 at 12:06:12AM -0800, Mattias Nissler wrote:
> > Changes from v6:
> >
> > * Rebase, resolve straightforward merge conflict in system/dma-helpers.c
>
> Hi, Mattias,
>
> If the change is trivial, feel free to carry over my R-bs in the first two
> patches in the commit message.
>
> Thanks,
>
> --
> Peter Xu
>

Re: [PATCH, v2] physmem: avoid bounce buffer too small

2024-02-29 Thread Mattias Nissler

On Thu, Feb 29, 2024 at 11:22 AM Heinrich Schuchardt
 wrote:
>
> On 29.02.24 02:11, Peter Xu wrote:
> > On Wed, Feb 28, 2024 at 08:07:47PM +0100, Heinrich Schuchardt wrote:
> >> On 28.02.24 19:39, Peter Maydell wrote:
> >>> On Wed, 28 Feb 2024 at 18:28, Heinrich Schuchardt
> >>>  wrote:
> 
>  On 28.02.24 16:06, Philippe Mathieu-Daudé wrote:
> > Hi Heinrich,
> >
> > On 28/2/24 13:59, Heinrich Schuchardt wrote:
> >> virtqueue_map_desc() is called with values of sz exceeding that may
> >> exceed
> >> TARGET_PAGE_SIZE. sz = 0x2800 has been observed.
> >
> > Pure (and can also be stupid) question: why virtqueue_map_desc() would map
> > to !direct mem?  Shouldn't those buffers normally allocated from guest RAM?
> >
> >>
> >> We only support a single bounce buffer. We have to avoid
> >> virtqueue_map_desc() calling address_space_map() multiple times.
> >> Otherwise
> >> we see an error
> >>
> >>qemu: virtio: bogus descriptor or out of resources
> >>
> >> Increase the minimum size of the bounce buffer to 0x1 which matches
> >> the largest value of TARGET_PAGE_SIZE for all architectures.
> >>
> >> Signed-off-by: Heinrich Schuchardt 
> >> ---
> >> v2:
> >>   remove unrelated change
> >> ---
> >> system/physmem.c | 8 ++--
> >> 1 file changed, 6 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/system/physmem.c b/system/physmem.c
> >> index e3ebc19eef..3c82da1c86 100644
> >> --- a/system/physmem.c
> >> +++ b/system/physmem.c
> >> @@ -3151,8 +3151,12 @@ void *address_space_map(AddressSpace *as,
> >> *plen = 0;
> >> return NULL;
> >> }
> >> -/* Avoid unbounded allocations */
> >> -l = MIN(l, TARGET_PAGE_SIZE);
> >> +/*
> >> + * There is only one bounce buffer. The largest occuring
> >> value of
> >> + * parameter sz of virtqueue_map_desc() must fit into the 
> >> bounce
> >> + * buffer.
> >> + */
> >> +l = MIN(l, 0x1);
> >
> > Please define this magic value. Maybe ANY_TARGET_PAGE_SIZE or
> > TARGETS_BIGGEST_PAGE_SIZE?
> >
> > Then along:
> >  QEMU_BUILD_BUG_ON(TARGET_PAGE_SIZE <= TARGETS_BIGGEST_PAGE_SIZE);
> 
>  Thank you Philippe for reviewing.
> 
>  TARGETS_BIGGEST_PAGE_SIZE does not fit as the value is not driven by the
>  page size.
>  How about MIN_BOUNCE_BUFFER_SIZE?
>  Is include/exec/memory.h the right include for the constant?
> 
>  I don't think that TARGET_PAGE_SIZE has any relevance for setting the
>  bounce buffer size. I only mentioned it to say that we are not
>  decreasing the value on any existing architecture.
> 
>  I don't know why TARGET_PAGE_SIZE ever got into this piece of code.
>  e3127ae0cdcd ("exec: reorganize address_space_map") does not provide a
>  reason for this choice. Maybe Paolo remembers.
> >>>
> >>> The limitation to a page dates back to commit 6d16c2f88f2a in 2009,
> >>> which was the first implementation of this function. I don't think
> >>> there's a particular reason for that value beyond that it was
> >>> probably a convenient value that was assumed to be likely "big enough".
> >>>
> >>> I think the idea with this bounce-buffer has always been that this
> >>> isn't really a code path we expected to end up in very often --
> >>> it's supposed to be for when devices are doing DMA, which they
> >>> will typically be doing to memory (backed by host RAM), not
> >>> devices (backed by MMIO and needing a bounce buffer). So the
> >>> whole mechanism is a bit "last fallback to stop things breaking
> >>> entirely".
> >>>
> >>> The address_space_map() API says that it's allowed to return
> >>> a subset of the range you ask for, so if the virtio code doesn't
> >>> cope with the minimum being set to TARGET_PAGE_SIZE then either
> >>> we need to fix that virtio code or we need to change the API
> >>> of this function. (But I think you will also get a reduced
> >>> range if you try to use it across a boundary between normal
> >>> host-memory-backed RAM and a device MemoryRegion.)
> >>
> >> If we allow a bounce buffer only to be used once (via the in_use flag), why
> >> do we allow only a single bounce buffer?
> >>
> >> Could address_space_map() allocate a new bounce buffer on every call and
> >> address_space_unmap() deallocate it?
> >>
> >> Isn't the design with a single bounce buffer bound to fail with a
> >> multi-threaded client as collision can be expected?
> >
> > See:
> >
> > https://lore.kernel.org/r/20240212080617.2559498-1-mniss...@rivosinc.com
> >
> > For some reason that series didn't land, but it seems to be helpful in this
> > case too if e.g. there can be multiple of such devices.
> >
> > Thanks,
> >
>
> Hello Peter Xu,
>
> thanks for pointing to your series. What I like

Re: Crash with CXL + TCG on 8.2: Was Re: qemu cxl memory expander shows numa_node -1

2024-02-18 Thread Mattias Nissler

On Thu, Feb 15, 2024 at 4:29 PM Jonathan Cameron <
jonathan.came...@huawei.com> wrote:

> On Thu, 8 Feb 2024 14:50:59 +
> Jonathan Cameron  wrote:
>
> > On Wed, 7 Feb 2024 17:34:15 +
> > Jonathan Cameron  wrote:
> >
> > > On Fri, 2 Feb 2024 16:56:18 +
> > > Peter Maydell  wrote:
> > >
> > > > On Fri, 2 Feb 2024 at 16:50, Gregory Price <
> gregory.pr...@memverge.com> wrote:
> > > > >
> > > > > On Fri, Feb 02, 2024 at 04:33:20PM +, Peter Maydell wrote:
>
> > > > > > Here we are trying to take an interrupt. This isn't related to
> the
> > > > > > other can_do_io stuff, it's happening because do_ld_mmio_beN
> assumes
> > > > > > it's called with the BQL not held, but in fact there are some
> > > > > > situations where we call into the memory subsystem and we do
> > > > > > already have the BQL.
> > > >
> > > > > It's bugs all the way down as usual!
> > > > > https://xkcd.com/1416/
> > > > >
> > > > > I'll dig in a little next week to see if there's an easy fix. We
> can see
> > > > > the return address is already 0 going into mmu_translate, so it
> does
> > > > > look unrelated to the patch I threw out - but probably still has
> to do
> > > > > with things being on IO.
> > > >
> > > > Yes, the low level memory accessors only need to take the BQL if the
> thing
> > > > being accessed is an MMIO device. Probably what is wanted is for
> those
> > > > functions to do "take the lock if we don't already have it",
> something
> > > > like hw/core/cpu-common.c:cpu_reset_interrupt() does.
> >
> > Got back to x86 testing and indeed not taking the lock in that one path
> > does get things running (with all Gregory's earlier hacks + DMA limits as
> > described below).  Guess it's time to roll some cleaned up patches and
> > see how much everyone screams :)
> >
>
> 3 series sent out:
> (all also on gitlab.com/jic23/qemu cxl-2024-02-15 though I updated patch
> descriptions
> a little after pushing that out)
>
> Main set of fixes (x86 'works' under my light testing after this one)
>
> https://lore.kernel.org/qemu-devel/20240215150133.2088-1-jonathan.came...@huawei.com/
>
> ARM FEAT_HADFS (access and dirty it updating in PTW) workaround for
> missing atomic CAS
>
> https://lore.kernel.org/qemu-devel/20240215151804.2426-1-jonathan.came...@huawei.com/T/#t
>
> DMA / virtio fix:
>
> https://lore.kernel.org/qemu-devel/20240215142817.1904-1-jonathan.came...@huawei.com/
>
> Last thing I need to do is propose a suitable flag to make
> Mattias' bounce buffering size parameter apply to "memory" address space.


For background, I actually had a global bounce buffer size parameter apply
to all address spaces in an earlier version of my series. After discussion
on the list, we settled on an address-space specific parameter so it can be
configured per device. I haven't looked into where the memory accesses in
your context originate from - can they be attributed to a specific entity
to house the parameter?


> Currently
> I'm carrying this: (I've no idea how much is need but it's somewhere
> between 4k and 1G)
>
> diff --git a/system/physmem.c b/system/physmem.c
> index 43b37942cf..49b961c7a5 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -2557,6 +2557,7 @@ static void memory_map_init(void)
>  memory_region_init(system_memory, NULL, "system", UINT64_MAX);
>  address_space_init(_space_memory, system_memory, "memory");
>
> +address_space_memory.max_bounce_buffer_size = 1024 * 1024 * 1024;
>  system_io = g_malloc(sizeof(*system_io));
>  memory_region_init_io(system_io, NULL, _io_ops, NULL, "io",
>65536);
>
> Please take a look. These are all in areas of QEMU I'm not particularly
> confident
> about so relying on nice people giving feedback even more than normal!
>
> Thanks to all those who helped with debugging and suggestions.
>
> Thanks,
>
> Jonathan
>
> > Jonathan
> >
> >
> > > >
> > > > -- PMM
> > >
> > > Still a work in progress but I thought I'd give an update on some of
> the fun...
> > >
> > > I have a set of somewhat dubious workarounds that sort of do the job
> (where
> > > the aim is to be able to safely run any workload on top of any valid
> > > emulated CXL device setup).
> > >
> > > To recap, the issue is that for CXL memory interleaving we need to have
> > > find grained routing to each device (16k Max Gran).  That was fine
> whilst
> > > pretty much all the testing was DAX based so software wasn't running
> out
> > > of it.  Now the kernel is rather more aggressive in defaulting any
> volatile
> > > CXL memory it finds to being normal memory (in some configs anyway)
> people
> > > started hitting problems. Given one of the most important functions of
> the
> > > emulation is to check data ends up in the right backing stores, I'm not
> > > keen to drop that feature unless we absolutely have to.
> > >
> > > 1) For the simple case of no interleave I have working code that just
> > >shoves the MemoryRegion in directly and all works fine.  That was
>

[PATCH v7 5/5] vfio-user: Fix config space access byte order

2024-02-12 Thread Mattias Nissler

PCI config space is little-endian, so on a big-endian host we need to
perform byte swaps for values as they are passed to and received from
the generic PCI config space access machinery.

Signed-off-by: Mattias Nissler 
---
 hw/remote/vfio-user-obj.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index a15e291c9a..0e93d7a7b4 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -281,7 +281,7 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, 
char * const buf,
 while (bytes > 0) {
 len = (bytes > pci_access_width) ? pci_access_width : bytes;
 if (is_write) {
-memcpy(, ptr, len);
+val = ldn_le_p(ptr, len);
 pci_host_config_write_common(o->pci_dev, offset,
  pci_config_size(o->pci_dev),
  val, len);
@@ -289,7 +289,7 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, 
char * const buf,
 } else {
 val = pci_host_config_read_common(o->pci_dev, offset,
   pci_config_size(o->pci_dev), 
len);
-memcpy(ptr, , len);
+stn_le_p(ptr, len, val);
 trace_vfu_cfg_read(offset, val);
 }
 offset += len;
-- 
2.34.1

[PATCH v7 3/5] Update subprojects/libvfio-user

2024-02-12 Thread Mattias Nissler

Brings in assorted bug fixes. The following are of particular interest
with respect to message-based DMA support:

* bb308a2 "Fix address calculation for message-based DMA"
  Corrects a bug in DMA address calculation.

* 1569a37 "Pass server->client command over a separate socket pair"
  Adds support for separate sockets for either command direction,
  addressing a bug where libvfio-user gets confused if both client and
  server send commands concurrently.

Signed-off-by: Mattias Nissler 
---
 subprojects/libvfio-user.wrap | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/subprojects/libvfio-user.wrap b/subprojects/libvfio-user.wrap
index 416955ca45..cdf0a7a375 100644
--- a/subprojects/libvfio-user.wrap
+++ b/subprojects/libvfio-user.wrap
@@ -1,4 +1,4 @@
 [wrap-git]
 url = https://gitlab.com/qemu-project/libvfio-user.git
-revision = 0b28d205572c80b568a1003db2c8f37ca333e4d7
+revision = 1569a37a54ecb63bd4008708c76339ccf7d06115
 depth = 1
-- 
2.34.1

Re: [PATCH v6 0/5] Support message-based DMA in vfio-user server

2024-02-12 Thread Mattias Nissler

Hi Jonathan,

To the best of my knowledge, all patches in the series have been thoroughly
reviewed. Admittedly, I got a bit distracted with other things though, so
I've been dragging my feet on follow-through. Sorry about that.

I've just taken another look and found it no longer applies cleanly to
master due to a minor merge conflict. I've just sent a rebased version to
address that.

Stefan, are you OK to pick this up for merging at your next convenience?

Thanks,
Mattias



On Fri, Feb 9, 2024 at 6:39 PM Jonathan Cameron 
wrote:

> On Wed,  1 Nov 2023 06:16:06 -0700
> Mattias Nissler  wrote:
>
> > This series adds basic support for message-based DMA in qemu's vfio-user
> > server. This is useful for cases where the client does not provide file
> > descriptors for accessing system memory via memory mappings. My
> motivating use
> > case is to hook up device models as PCIe endpoints to a hardware design.
> This
> > works by bridging the PCIe transaction layer to vfio-user, and the
> endpoint
> > does not access memory directly, but sends memory requests TLPs to the
> hardware
> > design in order to perform DMA.
> >
> > Note that more work is needed to make message-based DMA work well: qemu
> > currently breaks down DMA accesses into chunks of size 8 bytes at
> maximum, each
> > of which will be handled in a separate vfio-user DMA request message.
> This is
> > quite terrible for large DMA accesses, such as when nvme reads and writes
> > page-sized blocks for example. Thus, I would like to improve qemu to be
> able to
> > perform larger accesses, at least for indirect memory regions. I have
> something
> > working locally, but since this will likely result in more involved
> surgery and
> > discussion, I am leaving this to be addressed in a separate patch.
> >
> Hi Mattias,
>
> I was wondering what the status of this patch set is - seems no
> outstanding issues
> have been raised?
>
> I'd run into a similar problem with multiple DMA mappings using the bounce
> buffer
> when using the emulated CXL memory with virtio-blk-pci accessing it.
>
> In that particular case virtio-blk is using the "memory" address space, but
> otherwise your first 2 patches work for me as well so I'd definitely like
> to see those get merged!
>
> Thanks,
>
> Jonathan
>
> > Changes from v1:
> >
> > * Address Stefan's review comments. In particular, enforce an allocation
> limit
> >   and don't drop the map client callbacks given that map requests can
> fail when
> >   hitting size limits.
> >
> > * libvfio-user version bump now included in the series.
> >
> > * Tested as well on big-endian s390x. This uncovered another byte order
> issue
> >   in vfio-user server code that I've included a fix for.
> >
> > Changes from v2:
> >
> > * Add a preparatory patch to make bounce buffering an
> AddressSpace-specific
> >   concept.
> >
> > * The total buffer size limit parameter is now per AdressSpace and can be
> >   configured for PCIDevice via a property.
> >
> > * Store a magic value in first bytes of bounce buffer struct as a best
> effort
> >   measure to detect invalid pointers in address_space_unmap.
> >
> > Changes from v3:
> >
> > * libvfio-user now supports twin-socket mode which uses separate sockets
> for
> >   client->server and server->client commands, respectively. This
> addresses the
> >   concurrent command bug triggered by server->client DMA access
> commands. See
> >   https://github.com/nutanix/libvfio-user/issues/279 for details.
> >
> > * Add missing teardown code in do_address_space_destroy.
> >
> > * Fix bounce buffer size bookkeeping race condition.
> >
> > * Generate unmap notification callbacks unconditionally.
> >
> > * Some cosmetic fixes.
> >
> > Changes from v4:
> >
> > * Fix accidentally dropped memory_region_unref, control flow restored to
> match
> >   previous code to simplify review.
> >
> > * Some cosmetic fixes.
> >
> > Changes from v5:
> >
> > * Unregister indirect memory region in libvfio-user dma_unregister
> callback.
> >
> > I believe all patches in the series have been reviewed appropriately, so
> my
> > hope is that this will be the final iteration. Stefan, Peter, Jag,
> thanks for
> > your feedback, let me know if there's anything else needed from my side
> before
> > this can get merged.
> >
> > Mattias Nissler (5):
> >   softmmu: Per-AddressSpace bounce buffering
> >   softmmu: Support concurrent bounce buff

[PATCH v7 1/5] softmmu: Per-AddressSpace bounce buffering

2024-02-12 Thread Mattias Nissler

Instead of using a single global bounce buffer, give each AddressSpace
its own bounce buffer. The MapClient callback mechanism moves to
AddressSpace accordingly.

This is in preparation for generalizing bounce buffer handling further
to allow multiple bounce buffers, with a total allocation limit
configured per AddressSpace.

Signed-off-by: Mattias Nissler 
---
 include/exec/cpu-common.h |   2 -
 include/exec/memory.h |  45 -
 system/dma-helpers.c  |   4 +-
 system/memory.c   |   7 +++
 system/physmem.c  | 101 --
 5 files changed, 93 insertions(+), 66 deletions(-)

diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 9ead1be100..bd6999fa35 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -148,8 +148,6 @@ void *cpu_physical_memory_map(hwaddr addr,
   bool is_write);
 void cpu_physical_memory_unmap(void *buffer, hwaddr len,
bool is_write, hwaddr access_len);
-void cpu_register_map_client(QEMUBH *bh);
-void cpu_unregister_map_client(QEMUBH *bh);
 
 bool cpu_physical_memory_is_io(hwaddr phys_addr);
 
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 177be23db7..6995a443d3 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1106,6 +1106,19 @@ struct MemoryListener {
 QTAILQ_ENTRY(MemoryListener) link_as;
 };
 
+typedef struct AddressSpaceMapClient {
+QEMUBH *bh;
+QLIST_ENTRY(AddressSpaceMapClient) link;
+} AddressSpaceMapClient;
+
+typedef struct {
+MemoryRegion *mr;
+void *buffer;
+hwaddr addr;
+hwaddr len;
+bool in_use;
+} BounceBuffer;
+
 /**
  * struct AddressSpace: describes a mapping of addresses to #MemoryRegion 
objects
  */
@@ -1123,6 +1136,12 @@ struct AddressSpace {
 struct MemoryRegionIoeventfd *ioeventfds;
 QTAILQ_HEAD(, MemoryListener) listeners;
 QTAILQ_ENTRY(AddressSpace) address_spaces_link;
+
+/* Bounce buffer to use for this address space. */
+BounceBuffer bounce;
+/* List of callbacks to invoke when buffers free up */
+QemuMutex map_client_list_lock;
+QLIST_HEAD(, AddressSpaceMapClient) map_client_list;
 };
 
 typedef struct AddressSpaceDispatch AddressSpaceDispatch;
@@ -2926,8 +2945,8 @@ bool address_space_access_valid(AddressSpace *as, hwaddr 
addr, hwaddr len,
  * May return %NULL and set *@plen to zero(0), if resources needed to perform
  * the mapping are exhausted.
  * Use only for reads OR writes - not for read-modify-write operations.
- * Use cpu_register_map_client() to know when retrying the map operation is
- * likely to succeed.
+ * Use address_space_register_map_client() to know when retrying the map
+ * operation is likely to succeed.
  *
  * @as: #AddressSpace to be accessed
  * @addr: address within that address space
@@ -2952,6 +2971,28 @@ void *address_space_map(AddressSpace *as, hwaddr addr,
 void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len,
  bool is_write, hwaddr access_len);
 
+/*
+ * address_space_register_map_client: Register a callback to invoke when
+ * resources for address_space_map() are available again.
+ *
+ * address_space_map may fail when there are not enough resources available,
+ * such as when bounce buffer memory would exceed the limit. The callback can
+ * be used to retry the address_space_map operation. Note that the callback
+ * gets automatically removed after firing.
+ *
+ * @as: #AddressSpace to be accessed
+ * @bh: callback to invoke when address_space_map() retry is appropriate
+ */
+void address_space_register_map_client(AddressSpace *as, QEMUBH *bh);
+
+/*
+ * address_space_unregister_map_client: Unregister a callback that has
+ * previously been registered and not fired yet.
+ *
+ * @as: #AddressSpace to be accessed
+ * @bh: callback to unregister
+ */
+void address_space_unregister_map_client(AddressSpace *as, QEMUBH *bh);
 
 /* Internal functions, part of the implementation of address_space_read.  */
 MemTxResult address_space_read_full(AddressSpace *as, hwaddr addr,
diff --git a/system/dma-helpers.c b/system/dma-helpers.c
index 9b221cf94e..74013308f5 100644
--- a/system/dma-helpers.c
+++ b/system/dma-helpers.c
@@ -169,7 +169,7 @@ static void dma_blk_cb(void *opaque, int ret)
 if (dbs->iov.size == 0) {
 trace_dma_map_wait(dbs);
 dbs->bh = aio_bh_new(ctx, reschedule_dma, dbs);
-cpu_register_map_client(dbs->bh);
+address_space_register_map_client(dbs->sg->as, dbs->bh);
 return;
 }
 
@@ -197,7 +197,7 @@ static void dma_aio_cancel(BlockAIOCB *acb)
 }
 
 if (dbs->bh) {
-cpu_unregister_map_client(dbs->bh);
+address_space_unregister_map_client(dbs->sg->as, dbs->bh);
 qemu_bh_delete(dbs->bh);
 dbs->bh = NULL;
 }
diff --git a/system/memory.c b/system/memory.c
index a229a79988..ad0caef1b8 100644
---

[PATCH v7 2/5] softmmu: Support concurrent bounce buffers

2024-02-12 Thread Mattias Nissler

When DMA memory can't be directly accessed, as is the case when
running the device model in a separate process without shareable DMA
file descriptors, bounce buffering is used.

It is not uncommon for device models to request mapping of several DMA
regions at the same time. Examples include:
 * net devices, e.g. when transmitting a packet that is split across
   several TX descriptors (observed with igb)
 * USB host controllers, when handling a packet with multiple data TRBs
   (observed with xhci)

Previously, qemu only provided a single bounce buffer per AddressSpace
and would fail DMA map requests while the buffer was already in use. In
turn, this would cause DMA failures that ultimately manifest as hardware
errors from the guest perspective.

This change allocates DMA bounce buffers dynamically instead of
supporting only a single buffer. Thus, multiple DMA mappings work
correctly also when RAM can't be mmap()-ed.

The total bounce buffer allocation size is limited individually for each
AddressSpace. The default limit is 4096 bytes, matching the previous
maximum buffer size. A new x-max-bounce-buffer-size parameter is
provided to configure the limit for PCI devices.

Signed-off-by: Mattias Nissler 
---
 hw/pci/pci.c|  8 
 include/exec/memory.h   | 14 +++
 include/hw/pci/pci_device.h |  3 ++
 system/memory.c |  5 ++-
 system/physmem.c| 80 +
 5 files changed, 74 insertions(+), 36 deletions(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 6496d027ca..036b3ff822 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -85,6 +85,8 @@ static Property pci_props[] = {
 QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
 DEFINE_PROP_BIT("x-pcie-ari-nextfn-1", PCIDevice, cap_present,
 QEMU_PCIE_ARI_NEXTFN_1_BITNR, false),
+DEFINE_PROP_SIZE("x-max-bounce-buffer-size", PCIDevice,
+ max_bounce_buffer_size, DEFAULT_MAX_BOUNCE_BUFFER_SIZE),
 DEFINE_PROP_END_OF_LIST()
 };
 
@@ -1203,6 +1205,8 @@ static PCIDevice *do_pci_register_device(PCIDevice 
*pci_dev,
"bus master container", UINT64_MAX);
 address_space_init(_dev->bus_master_as,
_dev->bus_master_container_region, pci_dev->name);
+pci_dev->bus_master_as.max_bounce_buffer_size =
+pci_dev->max_bounce_buffer_size;
 
 if (phase_check(PHASE_MACHINE_READY)) {
 pci_init_bus_master(pci_dev);
@@ -2632,6 +2636,10 @@ static void pci_device_class_init(ObjectClass *klass, 
void *data)
 k->unrealize = pci_qdev_unrealize;
 k->bus_type = TYPE_PCI_BUS;
 device_class_set_props(k, pci_props);
+object_class_property_set_description(
+klass, "x-max-bounce-buffer-size",
+"Maximum buffer size allocated for bounce buffers used for mapped "
+"access to indirect DMA memory");
 }
 
 static void pci_device_class_base_init(ObjectClass *klass, void *data)
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 6995a443d3..e7bc4717ea 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -,13 +,7 @@ typedef struct AddressSpaceMapClient {
 QLIST_ENTRY(AddressSpaceMapClient) link;
 } AddressSpaceMapClient;
 
-typedef struct {
-MemoryRegion *mr;
-void *buffer;
-hwaddr addr;
-hwaddr len;
-bool in_use;
-} BounceBuffer;
+#define DEFAULT_MAX_BOUNCE_BUFFER_SIZE (4096)
 
 /**
  * struct AddressSpace: describes a mapping of addresses to #MemoryRegion 
objects
@@ -1137,8 +1131,10 @@ struct AddressSpace {
 QTAILQ_HEAD(, MemoryListener) listeners;
 QTAILQ_ENTRY(AddressSpace) address_spaces_link;
 
-/* Bounce buffer to use for this address space. */
-BounceBuffer bounce;
+/* Maximum DMA bounce buffer size used for indirect memory map requests */
+uint64_t max_bounce_buffer_size;
+/* Total size of bounce buffers currently allocated, atomically accessed */
+uint64_t bounce_buffer_size;
 /* List of callbacks to invoke when buffers free up */
 QemuMutex map_client_list_lock;
 QLIST_HEAD(, AddressSpaceMapClient) map_client_list;
diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
index d3dd0f64b2..f4027c5379 100644
--- a/include/hw/pci/pci_device.h
+++ b/include/hw/pci/pci_device.h
@@ -160,6 +160,9 @@ struct PCIDevice {
 /* ID of standby device in net_failover pair */
 char *failover_pair_id;
 uint32_t acpi_index;
+
+/* Maximum DMA bounce buffer size used for indirect memory map requests */
+uint64_t max_bounce_buffer_size;
 };
 
 static inline int pci_intx(PCIDevice *pci_dev)
diff --git a/system/memory.c b/system/memory.c
index ad0caef1b8..1cf89654a1 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -3133,7 +3133,8 @@ void address_space_init(AddressSpace *as, MemoryRegion 
*root, const char *name)
 as->ioeventfds = NULL;

[PATCH v7 4/5] vfio-user: Message-based DMA support

2024-02-12 Thread Mattias Nissler

Wire up support for DMA for the case where the vfio-user client does not
provide mmap()-able file descriptors, but DMA requests must be performed
via the VFIO-user protocol. This installs an indirect memory region,
which already works for pci_dma_{read,write}, and pci_dma_map works
thanks to the existing DMA bounce buffering support.

Note that while simple scenarios work with this patch, there's a known
race condition in libvfio-user that will mess up the communication
channel. See https://github.com/nutanix/libvfio-user/issues/279 for
details as well as a proposed fix.

Signed-off-by: Mattias Nissler 
---
 hw/remote/trace-events|   2 +
 hw/remote/vfio-user-obj.c | 100 --
 2 files changed, 87 insertions(+), 15 deletions(-)

diff --git a/hw/remote/trace-events b/hw/remote/trace-events
index 0d1b7d56a5..358a68fb34 100644
--- a/hw/remote/trace-events
+++ b/hw/remote/trace-events
@@ -9,6 +9,8 @@ vfu_cfg_read(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x -> 
0x%x"
 vfu_cfg_write(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x <- 0x%x"
 vfu_dma_register(uint64_t gpa, size_t len) "vfu: registering GPA 0x%"PRIx64", 
%zu bytes"
 vfu_dma_unregister(uint64_t gpa) "vfu: unregistering GPA 0x%"PRIx64""
+vfu_dma_read(uint64_t gpa, size_t len) "vfu: DMA read 0x%"PRIx64", %zu bytes"
+vfu_dma_write(uint64_t gpa, size_t len) "vfu: DMA write 0x%"PRIx64", %zu bytes"
 vfu_bar_register(int i, uint64_t addr, uint64_t size) "vfu: BAR %d: addr 
0x%"PRIx64" size 0x%"PRIx64""
 vfu_bar_rw_enter(const char *op, uint64_t addr) "vfu: %s request for BAR 
address 0x%"PRIx64""
 vfu_bar_rw_exit(const char *op, uint64_t addr) "vfu: Finished %s of BAR 
address 0x%"PRIx64""
diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index d9b879e056..a15e291c9a 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -300,6 +300,63 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, 
char * const buf,
 return count;
 }
 
+static MemTxResult vfu_dma_read(void *opaque, hwaddr addr, uint64_t *val,
+unsigned size, MemTxAttrs attrs)
+{
+MemoryRegion *region = opaque;
+vfu_ctx_t *vfu_ctx = VFU_OBJECT(region->owner)->vfu_ctx;
+uint8_t buf[sizeof(uint64_t)];
+
+trace_vfu_dma_read(region->addr + addr, size);
+
+g_autofree dma_sg_t *sg = g_malloc0(dma_sg_size());
+vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
+if (vfu_addr_to_sgl(vfu_ctx, vfu_addr, size, sg, 1, PROT_READ) < 0 ||
+vfu_sgl_read(vfu_ctx, sg, 1, buf) != 0) {
+return MEMTX_ERROR;
+}
+
+*val = ldn_he_p(buf, size);
+
+return MEMTX_OK;
+}
+
+static MemTxResult vfu_dma_write(void *opaque, hwaddr addr, uint64_t val,
+ unsigned size, MemTxAttrs attrs)
+{
+MemoryRegion *region = opaque;
+vfu_ctx_t *vfu_ctx = VFU_OBJECT(region->owner)->vfu_ctx;
+uint8_t buf[sizeof(uint64_t)];
+
+trace_vfu_dma_write(region->addr + addr, size);
+
+stn_he_p(buf, size, val);
+
+g_autofree dma_sg_t *sg = g_malloc0(dma_sg_size());
+vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
+if (vfu_addr_to_sgl(vfu_ctx, vfu_addr, size, sg, 1, PROT_WRITE) < 0 ||
+vfu_sgl_write(vfu_ctx, sg, 1, buf) != 0) {
+return MEMTX_ERROR;
+}
+
+return MEMTX_OK;
+}
+
+static const MemoryRegionOps vfu_dma_ops = {
+.read_with_attrs = vfu_dma_read,
+.write_with_attrs = vfu_dma_write,
+.endianness = DEVICE_HOST_ENDIAN,
+.valid = {
+.min_access_size = 1,
+.max_access_size = 8,
+.unaligned = true,
+},
+.impl = {
+.min_access_size = 1,
+.max_access_size = 8,
+},
+};
+
 static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
 {
 VfuObject *o = vfu_get_private(vfu_ctx);
@@ -308,17 +365,30 @@ static void dma_register(vfu_ctx_t *vfu_ctx, 
vfu_dma_info_t *info)
 g_autofree char *name = NULL;
 struct iovec *iov = >iova;
 
-if (!info->vaddr) {
-return;
-}
-
 name = g_strdup_printf("mem-%s-%"PRIx64"", o->device,
-   (uint64_t)info->vaddr);
+   (uint64_t)iov->iov_base);
 
 subregion = g_new0(MemoryRegion, 1);
 
-memory_region_init_ram_ptr(subregion, NULL, name,
-   iov->iov_len, info->vaddr);
+if (info->vaddr) {
+memory_region_init_ram_ptr(subregion, OBJECT(o), name,
+   iov->iov_len, info->vaddr);
+} else {
+/*
+ * Note that I/O regions' MemoryRegionOps handle accesses of at most 8
+ * bytes at a time, and larger accesses are broken down. However,
+

[PATCH v7 0/5] Support message-based DMA in vfio-user server

2024-02-12 Thread Mattias Nissler

This series adds basic support for message-based DMA in qemu's vfio-user
server. This is useful for cases where the client does not provide file
descriptors for accessing system memory via memory mappings. My motivating use
case is to hook up device models as PCIe endpoints to a hardware design. This
works by bridging the PCIe transaction layer to vfio-user, and the endpoint
does not access memory directly, but sends memory requests TLPs to the hardware
design in order to perform DMA.

Note that more work is needed to make message-based DMA work well: qemu
currently breaks down DMA accesses into chunks of size 8 bytes at maximum, each
of which will be handled in a separate vfio-user DMA request message. This is
quite terrible for large DMA accesses, such as when nvme reads and writes
page-sized blocks for example. Thus, I would like to improve qemu to be able to
perform larger accesses, at least for indirect memory regions. I have something
working locally, but since this will likely result in more involved surgery and
discussion, I am leaving this to be addressed in a separate patch.

Changes from v1:

* Address Stefan's review comments. In particular, enforce an allocation limit
  and don't drop the map client callbacks given that map requests can fail when
  hitting size limits.

* libvfio-user version bump now included in the series.

* Tested as well on big-endian s390x. This uncovered another byte order issue
  in vfio-user server code that I've included a fix for.

Changes from v2:

* Add a preparatory patch to make bounce buffering an AddressSpace-specific
  concept.

* The total buffer size limit parameter is now per AdressSpace and can be
  configured for PCIDevice via a property.

* Store a magic value in first bytes of bounce buffer struct as a best effort
  measure to detect invalid pointers in address_space_unmap.

Changes from v3:

* libvfio-user now supports twin-socket mode which uses separate sockets for
  client->server and server->client commands, respectively. This addresses the
  concurrent command bug triggered by server->client DMA access commands. See
  https://github.com/nutanix/libvfio-user/issues/279 for details.

* Add missing teardown code in do_address_space_destroy.

* Fix bounce buffer size bookkeeping race condition.

* Generate unmap notification callbacks unconditionally.

* Some cosmetic fixes.

Changes from v4:

* Fix accidentally dropped memory_region_unref, control flow restored to match
  previous code to simplify review.

* Some cosmetic fixes.

Changes from v5:

* Unregister indirect memory region in libvfio-user dma_unregister callback.

Changes from v6:

* Rebase, resolve straightforward merge conflict in system/dma-helpers.c

Mattias Nissler (5):
  softmmu: Per-AddressSpace bounce buffering
  softmmu: Support concurrent bounce buffers
  Update subprojects/libvfio-user
  vfio-user: Message-based DMA support
  vfio-user: Fix config space access byte order

 hw/pci/pci.c  |   8 ++
 hw/remote/trace-events|   2 +
 hw/remote/vfio-user-obj.c | 104 +
 include/exec/cpu-common.h |   2 -
 include/exec/memory.h |  41 +-
 include/hw/pci/pci_device.h   |   3 +
 subprojects/libvfio-user.wrap |   2 +-
 system/dma-helpers.c  |   4 +-
 system/memory.c   |   8 ++
 system/physmem.c  | 141 ++
 10 files changed, 226 insertions(+), 89 deletions(-)

-- 
2.34.1

[PATCH v6 5/5] vfio-user: Fix config space access byte order

2023-11-01 Thread Mattias Nissler

PCI config space is little-endian, so on a big-endian host we need to
perform byte swaps for values as they are passed to and received from
the generic PCI config space access machinery.

Signed-off-by: Mattias Nissler 
---
 hw/remote/vfio-user-obj.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 9f5e385668..46a2036bd1 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -281,7 +281,7 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, 
char * const buf,
 while (bytes > 0) {
 len = (bytes > pci_access_width) ? pci_access_width : bytes;
 if (is_write) {
-memcpy(, ptr, len);
+val = ldn_le_p(ptr, len);
 pci_host_config_write_common(o->pci_dev, offset,
  pci_config_size(o->pci_dev),
  val, len);
@@ -289,7 +289,7 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, 
char * const buf,
 } else {
 val = pci_host_config_read_common(o->pci_dev, offset,
   pci_config_size(o->pci_dev), 
len);
-memcpy(ptr, , len);
+stn_le_p(ptr, len, val);
 trace_vfu_cfg_read(offset, val);
 }
 offset += len;
-- 
2.34.1

[PATCH v6 3/5] Update subprojects/libvfio-user

2023-11-01 Thread Mattias Nissler

Brings in assorted bug fixes. The following are of particular interest
with respect to message-based DMA support:

* bb308a2 "Fix address calculation for message-based DMA"
  Corrects a bug in DMA address calculation.

* 1569a37 "Pass server->client command over a separate socket pair"
  Adds support for separate sockets for either command direction,
  addressing a bug where libvfio-user gets confused if both client and
  server send commands concurrently.

Signed-off-by: Mattias Nissler 
---
 subprojects/libvfio-user.wrap | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/subprojects/libvfio-user.wrap b/subprojects/libvfio-user.wrap
index 416955ca45..cdf0a7a375 100644
--- a/subprojects/libvfio-user.wrap
+++ b/subprojects/libvfio-user.wrap
@@ -1,4 +1,4 @@
 [wrap-git]
 url = https://gitlab.com/qemu-project/libvfio-user.git
-revision = 0b28d205572c80b568a1003db2c8f37ca333e4d7
+revision = 1569a37a54ecb63bd4008708c76339ccf7d06115
 depth = 1
-- 
2.34.1

[PATCH v6 2/5] softmmu: Support concurrent bounce buffers

2023-11-01 Thread Mattias Nissler

When DMA memory can't be directly accessed, as is the case when
running the device model in a separate process without shareable DMA
file descriptors, bounce buffering is used.

It is not uncommon for device models to request mapping of several DMA
regions at the same time. Examples include:
 * net devices, e.g. when transmitting a packet that is split across
   several TX descriptors (observed with igb)
 * USB host controllers, when handling a packet with multiple data TRBs
   (observed with xhci)

Previously, qemu only provided a single bounce buffer per AddressSpace
and would fail DMA map requests while the buffer was already in use. In
turn, this would cause DMA failures that ultimately manifest as hardware
errors from the guest perspective.

This change allocates DMA bounce buffers dynamically instead of
supporting only a single buffer. Thus, multiple DMA mappings work
correctly also when RAM can't be mmap()-ed.

The total bounce buffer allocation size is limited individually for each
AddressSpace. The default limit is 4096 bytes, matching the previous
maximum buffer size. A new x-max-bounce-buffer-size parameter is
provided to configure the limit for PCI devices.

Signed-off-by: Mattias Nissler 
---
 hw/pci/pci.c|  8 
 include/exec/memory.h   | 14 +++
 include/hw/pci/pci_device.h |  3 ++
 system/memory.c |  5 ++-
 system/physmem.c| 80 +
 5 files changed, 74 insertions(+), 36 deletions(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 7d09e1a39d..206826fcf2 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -85,6 +85,8 @@ static Property pci_props[] = {
 QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
 DEFINE_PROP_BIT("x-pcie-ari-nextfn-1", PCIDevice, cap_present,
 QEMU_PCIE_ARI_NEXTFN_1_BITNR, false),
+DEFINE_PROP_SIZE("x-max-bounce-buffer-size", PCIDevice,
+ max_bounce_buffer_size, DEFAULT_MAX_BOUNCE_BUFFER_SIZE),
 DEFINE_PROP_END_OF_LIST()
 };
 
@@ -1201,6 +1203,8 @@ static PCIDevice *do_pci_register_device(PCIDevice 
*pci_dev,
"bus master container", UINT64_MAX);
 address_space_init(_dev->bus_master_as,
_dev->bus_master_container_region, pci_dev->name);
+pci_dev->bus_master_as.max_bounce_buffer_size =
+pci_dev->max_bounce_buffer_size;
 
 if (phase_check(PHASE_MACHINE_READY)) {
 pci_init_bus_master(pci_dev);
@@ -2657,6 +2661,10 @@ static void pci_device_class_init(ObjectClass *klass, 
void *data)
 k->unrealize = pci_qdev_unrealize;
 k->bus_type = TYPE_PCI_BUS;
 device_class_set_props(k, pci_props);
+object_class_property_set_description(
+klass, "x-max-bounce-buffer-size",
+"Maximum buffer size allocated for bounce buffers used for mapped "
+"access to indirect DMA memory");
 }
 
 static void pci_device_class_base_init(ObjectClass *klass, void *data)
diff --git a/include/exec/memory.h b/include/exec/memory.h
index fd2cbee65b..4f99bb4e6c 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1091,13 +1091,7 @@ typedef struct AddressSpaceMapClient {
 QLIST_ENTRY(AddressSpaceMapClient) link;
 } AddressSpaceMapClient;
 
-typedef struct {
-MemoryRegion *mr;
-void *buffer;
-hwaddr addr;
-hwaddr len;
-bool in_use;
-} BounceBuffer;
+#define DEFAULT_MAX_BOUNCE_BUFFER_SIZE (4096)
 
 /**
  * struct AddressSpace: describes a mapping of addresses to #MemoryRegion 
objects
@@ -1117,8 +,10 @@ struct AddressSpace {
 QTAILQ_HEAD(, MemoryListener) listeners;
 QTAILQ_ENTRY(AddressSpace) address_spaces_link;
 
-/* Bounce buffer to use for this address space. */
-BounceBuffer bounce;
+/* Maximum DMA bounce buffer size used for indirect memory map requests */
+uint64_t max_bounce_buffer_size;
+/* Total size of bounce buffers currently allocated, atomically accessed */
+uint64_t bounce_buffer_size;
 /* List of callbacks to invoke when buffers free up */
 QemuMutex map_client_list_lock;
 QLIST_HEAD(, AddressSpaceMapClient) map_client_list;
diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
index d3dd0f64b2..f4027c5379 100644
--- a/include/hw/pci/pci_device.h
+++ b/include/hw/pci/pci_device.h
@@ -160,6 +160,9 @@ struct PCIDevice {
 /* ID of standby device in net_failover pair */
 char *failover_pair_id;
 uint32_t acpi_index;
+
+/* Maximum DMA bounce buffer size used for indirect memory map requests */
+uint64_t max_bounce_buffer_size;
 };
 
 static inline int pci_intx(PCIDevice *pci_dev)
diff --git a/system/memory.c b/system/memory.c
index a63bbc79a7..0b90ad8b07 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -3132,7 +3132,8 @@ void address_space_init(AddressSpace *as, MemoryRegion 
*root, const char *name)
 as->ioeventfds = NULL;

[PATCH v6 4/5] vfio-user: Message-based DMA support

2023-11-01 Thread Mattias Nissler

Wire up support for DMA for the case where the vfio-user client does not
provide mmap()-able file descriptors, but DMA requests must be performed
via the VFIO-user protocol. This installs an indirect memory region,
which already works for pci_dma_{read,write}, and pci_dma_map works
thanks to the existing DMA bounce buffering support.

Note that while simple scenarios work with this patch, there's a known
race condition in libvfio-user that will mess up the communication
channel. See https://github.com/nutanix/libvfio-user/issues/279 for
details as well as a proposed fix.

Signed-off-by: Mattias Nissler 
---
 hw/remote/trace-events|   2 +
 hw/remote/vfio-user-obj.c | 100 --
 2 files changed, 87 insertions(+), 15 deletions(-)

diff --git a/hw/remote/trace-events b/hw/remote/trace-events
index 0d1b7d56a5..358a68fb34 100644
--- a/hw/remote/trace-events
+++ b/hw/remote/trace-events
@@ -9,6 +9,8 @@ vfu_cfg_read(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x -> 
0x%x"
 vfu_cfg_write(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x <- 0x%x"
 vfu_dma_register(uint64_t gpa, size_t len) "vfu: registering GPA 0x%"PRIx64", 
%zu bytes"
 vfu_dma_unregister(uint64_t gpa) "vfu: unregistering GPA 0x%"PRIx64""
+vfu_dma_read(uint64_t gpa, size_t len) "vfu: DMA read 0x%"PRIx64", %zu bytes"
+vfu_dma_write(uint64_t gpa, size_t len) "vfu: DMA write 0x%"PRIx64", %zu bytes"
 vfu_bar_register(int i, uint64_t addr, uint64_t size) "vfu: BAR %d: addr 
0x%"PRIx64" size 0x%"PRIx64""
 vfu_bar_rw_enter(const char *op, uint64_t addr) "vfu: %s request for BAR 
address 0x%"PRIx64""
 vfu_bar_rw_exit(const char *op, uint64_t addr) "vfu: Finished %s of BAR 
address 0x%"PRIx64""
diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 8b10c32a3c..9f5e385668 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -300,6 +300,63 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, 
char * const buf,
 return count;
 }
 
+static MemTxResult vfu_dma_read(void *opaque, hwaddr addr, uint64_t *val,
+unsigned size, MemTxAttrs attrs)
+{
+MemoryRegion *region = opaque;
+vfu_ctx_t *vfu_ctx = VFU_OBJECT(region->owner)->vfu_ctx;
+uint8_t buf[sizeof(uint64_t)];
+
+trace_vfu_dma_read(region->addr + addr, size);
+
+g_autofree dma_sg_t *sg = g_malloc0(dma_sg_size());
+vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
+if (vfu_addr_to_sgl(vfu_ctx, vfu_addr, size, sg, 1, PROT_READ) < 0 ||
+vfu_sgl_read(vfu_ctx, sg, 1, buf) != 0) {
+return MEMTX_ERROR;
+}
+
+*val = ldn_he_p(buf, size);
+
+return MEMTX_OK;
+}
+
+static MemTxResult vfu_dma_write(void *opaque, hwaddr addr, uint64_t val,
+ unsigned size, MemTxAttrs attrs)
+{
+MemoryRegion *region = opaque;
+vfu_ctx_t *vfu_ctx = VFU_OBJECT(region->owner)->vfu_ctx;
+uint8_t buf[sizeof(uint64_t)];
+
+trace_vfu_dma_write(region->addr + addr, size);
+
+stn_he_p(buf, size, val);
+
+g_autofree dma_sg_t *sg = g_malloc0(dma_sg_size());
+vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
+if (vfu_addr_to_sgl(vfu_ctx, vfu_addr, size, sg, 1, PROT_WRITE) < 0 ||
+vfu_sgl_write(vfu_ctx, sg, 1, buf) != 0) {
+return MEMTX_ERROR;
+}
+
+return MEMTX_OK;
+}
+
+static const MemoryRegionOps vfu_dma_ops = {
+.read_with_attrs = vfu_dma_read,
+.write_with_attrs = vfu_dma_write,
+.endianness = DEVICE_HOST_ENDIAN,
+.valid = {
+.min_access_size = 1,
+.max_access_size = 8,
+.unaligned = true,
+},
+.impl = {
+.min_access_size = 1,
+.max_access_size = 8,
+},
+};
+
 static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
 {
 VfuObject *o = vfu_get_private(vfu_ctx);
@@ -308,17 +365,30 @@ static void dma_register(vfu_ctx_t *vfu_ctx, 
vfu_dma_info_t *info)
 g_autofree char *name = NULL;
 struct iovec *iov = >iova;
 
-if (!info->vaddr) {
-return;
-}
-
 name = g_strdup_printf("mem-%s-%"PRIx64"", o->device,
-   (uint64_t)info->vaddr);
+   (uint64_t)iov->iov_base);
 
 subregion = g_new0(MemoryRegion, 1);
 
-memory_region_init_ram_ptr(subregion, NULL, name,
-   iov->iov_len, info->vaddr);
+if (info->vaddr) {
+memory_region_init_ram_ptr(subregion, OBJECT(o), name,
+   iov->iov_len, info->vaddr);
+} else {
+/*
+ * Note that I/O regions' MemoryRegionOps handle accesses of at most 8
+ * bytes at a time, and larger accesses are broken down. However,
+

[PATCH v6 1/5] softmmu: Per-AddressSpace bounce buffering

2023-11-01 Thread Mattias Nissler

Instead of using a single global bounce buffer, give each AddressSpace
its own bounce buffer. The MapClient callback mechanism moves to
AddressSpace accordingly.

This is in preparation for generalizing bounce buffer handling further
to allow multiple bounce buffers, with a total allocation limit
configured per AddressSpace.

Signed-off-by: Mattias Nissler 
---
 include/exec/cpu-common.h |   2 -
 include/exec/memory.h |  45 -
 system/dma-helpers.c  |   4 +-
 system/memory.c   |   7 +++
 system/physmem.c  | 101 --
 5 files changed, 93 insertions(+), 66 deletions(-)

diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 30c376a4de..0d31856898 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -160,8 +160,6 @@ void *cpu_physical_memory_map(hwaddr addr,
   bool is_write);
 void cpu_physical_memory_unmap(void *buffer, hwaddr len,
bool is_write, hwaddr access_len);
-void cpu_register_map_client(QEMUBH *bh);
-void cpu_unregister_map_client(QEMUBH *bh);
 
 bool cpu_physical_memory_is_io(hwaddr phys_addr);
 
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 9087d02769..fd2cbee65b 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1086,6 +1086,19 @@ struct MemoryListener {
 QTAILQ_ENTRY(MemoryListener) link_as;
 };
 
+typedef struct AddressSpaceMapClient {
+QEMUBH *bh;
+QLIST_ENTRY(AddressSpaceMapClient) link;
+} AddressSpaceMapClient;
+
+typedef struct {
+MemoryRegion *mr;
+void *buffer;
+hwaddr addr;
+hwaddr len;
+bool in_use;
+} BounceBuffer;
+
 /**
  * struct AddressSpace: describes a mapping of addresses to #MemoryRegion 
objects
  */
@@ -1103,6 +1116,12 @@ struct AddressSpace {
 struct MemoryRegionIoeventfd *ioeventfds;
 QTAILQ_HEAD(, MemoryListener) listeners;
 QTAILQ_ENTRY(AddressSpace) address_spaces_link;
+
+/* Bounce buffer to use for this address space. */
+BounceBuffer bounce;
+/* List of callbacks to invoke when buffers free up */
+QemuMutex map_client_list_lock;
+QLIST_HEAD(, AddressSpaceMapClient) map_client_list;
 };
 
 typedef struct AddressSpaceDispatch AddressSpaceDispatch;
@@ -2874,8 +2893,8 @@ bool address_space_access_valid(AddressSpace *as, hwaddr 
addr, hwaddr len,
  * May return %NULL and set *@plen to zero(0), if resources needed to perform
  * the mapping are exhausted.
  * Use only for reads OR writes - not for read-modify-write operations.
- * Use cpu_register_map_client() to know when retrying the map operation is
- * likely to succeed.
+ * Use address_space_register_map_client() to know when retrying the map
+ * operation is likely to succeed.
  *
  * @as: #AddressSpace to be accessed
  * @addr: address within that address space
@@ -2900,6 +2919,28 @@ void *address_space_map(AddressSpace *as, hwaddr addr,
 void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len,
  bool is_write, hwaddr access_len);
 
+/*
+ * address_space_register_map_client: Register a callback to invoke when
+ * resources for address_space_map() are available again.
+ *
+ * address_space_map may fail when there are not enough resources available,
+ * such as when bounce buffer memory would exceed the limit. The callback can
+ * be used to retry the address_space_map operation. Note that the callback
+ * gets automatically removed after firing.
+ *
+ * @as: #AddressSpace to be accessed
+ * @bh: callback to invoke when address_space_map() retry is appropriate
+ */
+void address_space_register_map_client(AddressSpace *as, QEMUBH *bh);
+
+/*
+ * address_space_unregister_map_client: Unregister a callback that has
+ * previously been registered and not fired yet.
+ *
+ * @as: #AddressSpace to be accessed
+ * @bh: callback to unregister
+ */
+void address_space_unregister_map_client(AddressSpace *as, QEMUBH *bh);
 
 /* Internal functions, part of the implementation of address_space_read.  */
 MemTxResult address_space_read_full(AddressSpace *as, hwaddr addr,
diff --git a/system/dma-helpers.c b/system/dma-helpers.c
index 36211acc7e..611ea04ffb 100644
--- a/system/dma-helpers.c
+++ b/system/dma-helpers.c
@@ -167,7 +167,7 @@ static void dma_blk_cb(void *opaque, int ret)
 if (dbs->iov.size == 0) {
 trace_dma_map_wait(dbs);
 dbs->bh = aio_bh_new(ctx, reschedule_dma, dbs);
-cpu_register_map_client(dbs->bh);
+address_space_register_map_client(dbs->sg->as, dbs->bh);
 goto out;
 }
 
@@ -197,7 +197,7 @@ static void dma_aio_cancel(BlockAIOCB *acb)
 }
 
 if (dbs->bh) {
-cpu_unregister_map_client(dbs->bh);
+address_space_unregister_map_client(dbs->sg->as, dbs->bh);
 qemu_bh_delete(dbs->bh);
 dbs->bh = NULL;
 }
diff --git a/system/memory.c b/system/memory.c
index 4928f2525d..a63bbc79a7 100644
---

[PATCH v6 0/5] Support message-based DMA in vfio-user server

2023-11-01 Thread Mattias Nissler

This series adds basic support for message-based DMA in qemu's vfio-user
server. This is useful for cases where the client does not provide file
descriptors for accessing system memory via memory mappings. My motivating use
case is to hook up device models as PCIe endpoints to a hardware design. This
works by bridging the PCIe transaction layer to vfio-user, and the endpoint
does not access memory directly, but sends memory requests TLPs to the hardware
design in order to perform DMA.

Note that more work is needed to make message-based DMA work well: qemu
currently breaks down DMA accesses into chunks of size 8 bytes at maximum, each
of which will be handled in a separate vfio-user DMA request message. This is
quite terrible for large DMA accesses, such as when nvme reads and writes
page-sized blocks for example. Thus, I would like to improve qemu to be able to
perform larger accesses, at least for indirect memory regions. I have something
working locally, but since this will likely result in more involved surgery and
discussion, I am leaving this to be addressed in a separate patch.

Changes from v1:

* Address Stefan's review comments. In particular, enforce an allocation limit
  and don't drop the map client callbacks given that map requests can fail when
  hitting size limits.

* libvfio-user version bump now included in the series.

* Tested as well on big-endian s390x. This uncovered another byte order issue
  in vfio-user server code that I've included a fix for.

Changes from v2:

* Add a preparatory patch to make bounce buffering an AddressSpace-specific
  concept.

* The total buffer size limit parameter is now per AdressSpace and can be
  configured for PCIDevice via a property.

* Store a magic value in first bytes of bounce buffer struct as a best effort
  measure to detect invalid pointers in address_space_unmap.

Changes from v3:

* libvfio-user now supports twin-socket mode which uses separate sockets for
  client->server and server->client commands, respectively. This addresses the
  concurrent command bug triggered by server->client DMA access commands. See
  https://github.com/nutanix/libvfio-user/issues/279 for details.

* Add missing teardown code in do_address_space_destroy.

* Fix bounce buffer size bookkeeping race condition.

* Generate unmap notification callbacks unconditionally.

* Some cosmetic fixes.

Changes from v4:

* Fix accidentally dropped memory_region_unref, control flow restored to match
  previous code to simplify review.

* Some cosmetic fixes.

Changes from v5:

* Unregister indirect memory region in libvfio-user dma_unregister callback.

I believe all patches in the series have been reviewed appropriately, so my
hope is that this will be the final iteration. Stefan, Peter, Jag, thanks for
your feedback, let me know if there's anything else needed from my side before
this can get merged.

Mattias Nissler (5):
  softmmu: Per-AddressSpace bounce buffering
  softmmu: Support concurrent bounce buffers
  Update subprojects/libvfio-user
  vfio-user: Message-based DMA support
  vfio-user: Fix config space access byte order

 hw/pci/pci.c  |   8 ++
 hw/remote/trace-events|   2 +
 hw/remote/vfio-user-obj.c | 104 +
 include/exec/cpu-common.h |   2 -
 include/exec/memory.h |  41 +-
 include/hw/pci/pci_device.h   |   3 +
 subprojects/libvfio-user.wrap |   2 +-
 system/dma-helpers.c  |   4 +-
 system/memory.c   |   8 ++
 system/physmem.c  | 141 ++
 10 files changed, 226 insertions(+), 89 deletions(-)

-- 
2.34.1

Re: [RFC] Proposal of QEMU PCI Endpoint test environment

2023-10-06 Thread Mattias Nissler

On Fri, Oct 6, 2023 at 1:51 PM Shunsuke Mie  wrote:
>
>
> On 2023/10/05 16:02, Mattias Nissler wrote:
> > On Thu, Oct 5, 2023 at 3:31 AM Shunsuke Mie  wrote:
> >> Hi Jiri, Mattias and all.
> >>
> >> 2023年10月4日(水) 16:36 Mattias Nissler :
> >>>> hi shunsuke, all,
> >>>> what about vfio-user + qemu?
> >> Thank you for the suggestion.
> >>
> >>> FWIW, I have had some good success using VFIO-user to bridge software 
> >>> components to hardware designs. For the most part, I have been hooking up 
> >>> software endpoint models to hardware design components speaking the PCIe 
> >>> transaction layer protocol. The central piece you need is a way to 
> >>> translate between the VFIO-user protocol and PCIe transaction layer 
> >>> messages, basically converting ECAM accesses, memory accesses (DMA+MMIO), 
> >>> and interrupts between the two worlds. I have some code which implements 
> >>> the basics of that. It's certainly far from complete (TLP is a massive 
> >>> protocol), but it works well enough for me. I believe we should be able 
> >>> to open-source this if there's interest, let me know.
> >> It is what I want to do, but I'm not familiar with the vfio and vfio-user, 
> >> and I have a question. QEMU has a PCI TLP communication implementation for 
> >> Multi-process QEMU[1]. It is similar to your success.
> > I'm no qemu expert, but my understanding is that the plan is for the
> > existing multi-process QEMU implementation to eventually be
> > superseded/replaced by the VFIO-user based one (qemu folks, please
> > correct me if I'm wrong). From a functional perspective they are more
> > or less equivalent AFAICT.
> >
> The project is promising.
>
> I found a session about the vfio adapts to Multi-process QEMU[1] in KVM
> Forun 2021, butI couldn't found some posted patches.
> If anyone knows status of this project, could you please let me know?

Again, I'm just an interested bystander, so take my words with a grain
of salt. That said, my understanding is that there is an intention to
get the vfio-user client code into qemu in the foreseeable future. The
most recent version of the code that I'm aware of is here:
https://github.com/oracle/qemu/tree/vfio-user-p3.1

>
> [1] https://www.youtube.com/watch?v=NBT8rImx3VE
> >> The multi-process qemu also communicates TLP over UDS. Could you let me 
> >> know your opinion about it?
> > Note that neither multi-process qemu nor VFIO-user actually pass
> > around TLPs, but rather have their own command language to encode
> > ECAM, MMIO, DMA, interrupts etc. However, translation from/to TLP is
> > possible and works well enough in my experience.
> I agree.
> >>> One thing to note is that there are currently some limits to bridging 
> >>> VFIO-user / TLP that I haven't figured out and/or will need further work: 
> >>> Advanced PCIe concepts like PASID, ATS/PRI, SR-IOV etc. may lack 
> >>> equivalents on the VFIO-user side that would have to be filled in. The 
> >>> folk behind libvfio-user[2] have been very approachable and open to 
> >>> improvements in my experience though.
> >>>
> >>> If I understand correctly, the specific goal here is testing PCIe 
> >>> endpoint designs against a Linux host. What you'd need for that is a PCI 
> >>> host controller for the Linux side to talk to and then hooking up 
> >>> endpoints on the transaction layer. QEMU can simulate host controllers 
> >>> that work with existing Linux drivers just fine. Then you can put a 
> >>> vfio-user-pci stub device (I don't think this has landed in qemu yet, but 
> >>> you can find the code at [1]) on the simulated PCI bus which will expose 
> >>> any software interactions with the endpoint as VFIO-user protocol 
> >>> messages over unix domain socket. The piece you need to bring is a 
> >>> VFIO-user server that handles these messages. Its task is basically 
> >>> translating between VFIO-user and TLP and then injecting TLP into your 
> >>> hardware design.
> >> Yes, If the pci host controller you said can be implemented, I can achieve 
> >> my goal.
> > I meant to say that the existing PCIe host controller implementations
> > in qemu can be used as is.
> Sorry, I misunderstood.
> >> To begin with, I'll investigate the vfio and libvfio-user.  Thanks!.
> >>
> >> [1] https://www.qemu.org/docs/master/system/multi-process.html
> >>
> >> Best,
> >> Shunsuke
> >>>
> >>> [1] https://github.com/oracle/qemu/tree/vfio-user-p3.1 - I believe that's 
> >>> the latest version, Jagannathan Raman will know best
> >>> [2] https://github.com/nutanix/libvfio-user
> >>>

Re: [PATCH v5 4/5] vfio-user: Message-based DMA support

2023-10-06 Thread Mattias Nissler

On Wed, Oct 4, 2023 at 4:54 PM Jag Raman  wrote:

>
>
> > On Sep 20, 2023, at 4:06 AM, Mattias Nissler 
> wrote:
> >
> > Wire up support for DMA for the case where the vfio-user client does not
> > provide mmap()-able file descriptors, but DMA requests must be performed
> > via the VFIO-user protocol. This installs an indirect memory region,
> > which already works for pci_dma_{read,write}, and pci_dma_map works
> > thanks to the existing DMA bounce buffering support.
> >
> > Note that while simple scenarios work with this patch, there's a known
> > race condition in libvfio-user that will mess up the communication
> > channel. See https://github.com/nutanix/libvfio-user/issues/279 for
> > details as well as a proposed fix.
> >
> > Signed-off-by: Mattias Nissler 
> > ---
> > hw/remote/trace-events|  2 +
> > hw/remote/vfio-user-obj.c | 84 +++
> > 2 files changed, 79 insertions(+), 7 deletions(-)
> >
> > diff --git a/hw/remote/trace-events b/hw/remote/trace-events
> > index 0d1b7d56a5..358a68fb34 100644
> > --- a/hw/remote/trace-events
> > +++ b/hw/remote/trace-events
> > @@ -9,6 +9,8 @@ vfu_cfg_read(uint32_t offset, uint32_t val) "vfu: cfg:
> 0x%x -> 0x%x"
> > vfu_cfg_write(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x <- 0x%x"
> > vfu_dma_register(uint64_t gpa, size_t len) "vfu: registering GPA
> 0x%"PRIx64", %zu bytes"
> > vfu_dma_unregister(uint64_t gpa) "vfu: unregistering GPA 0x%"PRIx64""
> > +vfu_dma_read(uint64_t gpa, size_t len) "vfu: DMA read 0x%"PRIx64", %zu
> bytes"
> > +vfu_dma_write(uint64_t gpa, size_t len) "vfu: DMA write 0x%"PRIx64",
> %zu bytes"
> > vfu_bar_register(int i, uint64_t addr, uint64_t size) "vfu: BAR %d: addr
> 0x%"PRIx64" size 0x%"PRIx64""
> > vfu_bar_rw_enter(const char *op, uint64_t addr) "vfu: %s request for BAR
> address 0x%"PRIx64""
> > vfu_bar_rw_exit(const char *op, uint64_t addr) "vfu: Finished %s of BAR
> address 0x%"PRIx64""
> > diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> > index 8b10c32a3c..6a561f7969 100644
> > --- a/hw/remote/vfio-user-obj.c
> > +++ b/hw/remote/vfio-user-obj.c
> > @@ -300,6 +300,63 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t
> *vfu_ctx, char * const buf,
> > return count;
> > }
> >
> > +static MemTxResult vfu_dma_read(void *opaque, hwaddr addr, uint64_t
> *val,
> > +unsigned size, MemTxAttrs attrs)
> > +{
> > +MemoryRegion *region = opaque;
> > +vfu_ctx_t *vfu_ctx = VFU_OBJECT(region->owner)->vfu_ctx;
> > +uint8_t buf[sizeof(uint64_t)];
> > +
> > +trace_vfu_dma_read(region->addr + addr, size);
> > +
> > +g_autofree dma_sg_t *sg = g_malloc0(dma_sg_size());
> > +vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
> > +if (vfu_addr_to_sgl(vfu_ctx, vfu_addr, size, sg, 1, PROT_READ) < 0
> ||
> > +vfu_sgl_read(vfu_ctx, sg, 1, buf) != 0) {
> > +return MEMTX_ERROR;
> > +}
> > +
> > +*val = ldn_he_p(buf, size);
> > +
> > +return MEMTX_OK;
> > +}
> > +
> > +static MemTxResult vfu_dma_write(void *opaque, hwaddr addr, uint64_t
> val,
> > + unsigned size, MemTxAttrs attrs)
> > +{
> > +MemoryRegion *region = opaque;
> > +vfu_ctx_t *vfu_ctx = VFU_OBJECT(region->owner)->vfu_ctx;
> > +uint8_t buf[sizeof(uint64_t)];
> > +
> > +trace_vfu_dma_write(region->addr + addr, size);
> > +
> > +stn_he_p(buf, size, val);
> > +
> > +g_autofree dma_sg_t *sg = g_malloc0(dma_sg_size());
> > +vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
> > +if (vfu_addr_to_sgl(vfu_ctx, vfu_addr, size, sg, 1, PROT_WRITE) < 0
> ||
> > +vfu_sgl_write(vfu_ctx, sg, 1, buf) != 0) {
> > +return MEMTX_ERROR;
> > +}
> > +
> > +return MEMTX_OK;
> > +}
> > +
> > +static const MemoryRegionOps vfu_dma_ops = {
> > +.read_with_attrs = vfu_dma_read,
> > +.write_with_attrs = vfu_dma_write,
> > +.endianness = DEVICE_HOST_ENDIAN,
> > +.valid = {
> > +.min_access_size = 1,
> > +.max_access_size = 8,
> > +.unaligned = true,
> > +},
> > +.impl = {
> > +.min_access_siz

Re: [PATCH v5 5/5] vfio-user: Fix config space access byte order

2023-10-06 Thread Mattias Nissler

On Thu, Oct 5, 2023 at 6:30 PM Jag Raman  wrote:

>
>
> > On Sep 20, 2023, at 4:06 AM, Mattias Nissler 
> wrote:
> >
> > PCI config space is little-endian, so on a big-endian host we need to
> > perform byte swaps for values as they are passed to and received from
> > the generic PCI config space access machinery.
> >
> > Signed-off-by: Mattias Nissler 
> > ---
> > hw/remote/vfio-user-obj.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> > index 6a561f7969..6043a91b11 100644
> > --- a/hw/remote/vfio-user-obj.c
> > +++ b/hw/remote/vfio-user-obj.c
> > @@ -281,7 +281,7 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t
> *vfu_ctx, char * const buf,
> > while (bytes > 0) {
> > len = (bytes > pci_access_width) ? pci_access_width : bytes;
> > if (is_write) {
> > -memcpy(, ptr, len);
> > +val = ldn_le_p(ptr, len);
> > pci_host_config_write_common(o->pci_dev, offset,
> >  pci_config_size(o->pci_dev),
> >  val, len);
> > @@ -289,7 +289,7 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t
> *vfu_ctx, char * const buf,
> > } else {
> > val = pci_host_config_read_common(o->pci_dev, offset,
> >
>  pci_config_size(o->pci_dev), len);
> > -memcpy(ptr, , len);
> > +stn_le_p(ptr, len, val);
> > trace_vfu_cfg_read(offset, val);
> > }
> > offset += len;
> > --
> > 2.34.1
> >
>
> Hey,
>
> When you tested on s390x, could you see the correct values for the config
> space in the Kernel? For example, were any known device's vendor and device
> IDs valid?
>

I don't exactly remember whether I checked vendor and device IDs, but I've
done something more comprehensive: I set up a qemu vfio-user server
exposing an nvme device and then connected a qemu client (with the
vfio-user client patches from the oracle qemu github repo). Linux running
in the client probes the nvme device successfully and I've mounted a file
system on it. Both qemu binaries are s390x.


>
> I'm asking because flatview_read_continue() / flatview_write_continue()
> does endianness adjustment. So, I want to confirm that the endianness
> adjustment in your code also makes sense from Kernel's perspective.
>

The conversion in the flatview access path is adjusting from the endianness
of the memory region to what the emulated CPU needs. Since the PCI config
space memory region is little-endian (see pci_host_data_le_ops), we're
doing a swap there. The code I'm changing is backing the memory region, so
incoming/outgoing data for writes/reads must be in little-endian to adhere
to the endianness declared by the memory region.


>
> I'm trying to access a big-endian system, but no luck.
>

Btw. I don't have access to big-endian hardware either, but it was
surprisingly straightforward to make my x86 ubuntu machine run s390x
binaries via multiarch + qemu user mode (qemu turtles all the way down :-D)


>
> #0  0x55b97a30 in vfio_user_io_region_read
> (vbasedev=0x57802c80, index=7 '\a', off=4, size=2, data=0x7fff6cfb945c)
> at ../hw/vfio/user.c:1985
> #1  0x55b8dcfb in vfio_pci_read_config (pdev=0x57802250,
> addr=4, len=2) at ../hw/vfio/pci.c:1202
> #2  0x5599d3f9 in pci_host_config_read_common
> (pci_dev=0x57802250, addr=addr@entry=4, limit=,
> limit@entry=256, len=len@entry=2) at ../hw/pci/pci_host.c:107
> #3  0x5599d74a in pci_data_read (s=,
> addr=2147493892, len=2) at ../hw/pci/pci_host.c:143
> #4  0x5599d84f in pci_host_data_read (opaque=,
> addr=, len=) at ../hw/pci/pci_host.c:188
> #5  0x55bc3c4d in memory_region_read_accessor 
> (mr=mr@entry=0x569de370,
> addr=0, value=value@entry=0x7fff6cfb9710, size=size@entry=2, shift=0,
> mask=mask@entry=65535, attrs=...) at ../softmmu/memory.c:441
> #6  0x55bc3fce in access_with_adjusted_size (addr=addr@entry=0,
> value=value@entry=0x7fff6cfb9710, size=size@entry=2,
> access_size_min=, access_size_max=,
> access_fn=0x55bc3c10 , mr=0x569de370,
> attrs=...) at ../softmmu/memory.c:569
> #7  0x55bc41a1 in memory_region_dispatch_read1 (attrs=..., size=2,
> pval=0x7fff6cfb9710, addr=, mr=) at
> ../softmmu/memory.c:1443
> #8  0x55bc41a1 in memory_region_dispatch_read 
> (mr=mr@entry=0x569de370,
> addr=, pval=pval@entry=0x7fff6cfb9710, op=MO_16,
> attrs=attrs@entry=...) at ../softmmu/memory.c:1476
> #9  0x55bce13b in flatview_read_continue (fv=fv@entry

Re: [RFC] Proposal of QEMU PCI Endpoint test environment

2023-10-05 Thread Mattias Nissler

On Thu, Oct 5, 2023 at 3:31 AM Shunsuke Mie  wrote:
>
> Hi Jiri, Mattias and all.
>
> 2023年10月4日(水) 16:36 Mattias Nissler :
>>>
>>> hi shunsuke, all,
>>> what about vfio-user + qemu?
>
> Thank you for the suggestion.
>
>> FWIW, I have had some good success using VFIO-user to bridge software 
>> components to hardware designs. For the most part, I have been hooking up 
>> software endpoint models to hardware design components speaking the PCIe 
>> transaction layer protocol. The central piece you need is a way to translate 
>> between the VFIO-user protocol and PCIe transaction layer messages, 
>> basically converting ECAM accesses, memory accesses (DMA+MMIO), and 
>> interrupts between the two worlds. I have some code which implements the 
>> basics of that. It's certainly far from complete (TLP is a massive 
>> protocol), but it works well enough for me. I believe we should be able to 
>> open-source this if there's interest, let me know.
>
> It is what I want to do, but I'm not familiar with the vfio and vfio-user, 
> and I have a question. QEMU has a PCI TLP communication implementation for 
> Multi-process QEMU[1]. It is similar to your success.

I'm no qemu expert, but my understanding is that the plan is for the
existing multi-process QEMU implementation to eventually be
superseded/replaced by the VFIO-user based one (qemu folks, please
correct me if I'm wrong). From a functional perspective they are more
or less equivalent AFAICT.

> The multi-process qemu also communicates TLP over UDS. Could you let me know 
> your opinion about it?

Note that neither multi-process qemu nor VFIO-user actually pass
around TLPs, but rather have their own command language to encode
ECAM, MMIO, DMA, interrupts etc. However, translation from/to TLP is
possible and works well enough in my experience.

>
>> One thing to note is that there are currently some limits to bridging 
>> VFIO-user / TLP that I haven't figured out and/or will need further work: 
>> Advanced PCIe concepts like PASID, ATS/PRI, SR-IOV etc. may lack equivalents 
>> on the VFIO-user side that would have to be filled in. The folk behind 
>> libvfio-user[2] have been very approachable and open to improvements in my 
>> experience though.
>>
>> If I understand correctly, the specific goal here is testing PCIe endpoint 
>> designs against a Linux host. What you'd need for that is a PCI host 
>> controller for the Linux side to talk to and then hooking up endpoints on 
>> the transaction layer. QEMU can simulate host controllers that work with 
>> existing Linux drivers just fine. Then you can put a vfio-user-pci stub 
>> device (I don't think this has landed in qemu yet, but you can find the code 
>> at [1]) on the simulated PCI bus which will expose any software interactions 
>> with the endpoint as VFIO-user protocol messages over unix domain socket. 
>> The piece you need to bring is a VFIO-user server that handles these 
>> messages. Its task is basically translating between VFIO-user and TLP and 
>> then injecting TLP into your hardware design.
>
> Yes, If the pci host controller you said can be implemented, I can achieve my 
> goal.

I meant to say that the existing PCIe host controller implementations
in qemu can be used as is.

>
> To begin with, I'll investigate the vfio and libvfio-user.  Thanks!.
>
> [1] https://www.qemu.org/docs/master/system/multi-process.html
>
> Best,
> Shunsuke
>>
>>
>> [1] https://github.com/oracle/qemu/tree/vfio-user-p3.1 - I believe that's 
>> the latest version, Jagannathan Raman will know best
>> [2] https://github.com/nutanix/libvfio-user
>>
>

Re: [RFC] Proposal of QEMU PCI Endpoint test environment

2023-10-04 Thread Mattias Nissler

>
> hi shunsuke, all,
> what about vfio-user + qemu?
>

FWIW, I have had some good success using VFIO-user to bridge software
components to hardware designs. For the most part, I have been hooking up
software endpoint models to hardware design components speaking the PCIe
transaction layer protocol. The central piece you need is a way to
translate between the VFIO-user protocol and PCIe transaction layer
messages, basically converting ECAM accesses, memory accesses (DMA+MMIO),
and interrupts between the two worlds. I have some code which implements
the basics of that. It's certainly far from complete (TLP is a massive
protocol), but it works well enough for me. I believe we should be able to
open-source this if there's interest, let me know.

One thing to note is that there are currently some limits to bridging
VFIO-user / TLP that I haven't figured out and/or will need further work:
Advanced PCIe concepts like PASID, ATS/PRI, SR-IOV etc. may lack
equivalents on the VFIO-user side that would have to be filled in. The folk
behind libvfio-user[2] have been very approachable and open to improvements
in my experience though.

If I understand correctly, the specific goal here is testing PCIe endpoint
designs against a Linux host. What you'd need for that is a PCI host
controller for the Linux side to talk to and then hooking up endpoints on
the transaction layer. QEMU can simulate host controllers that work with
existing Linux drivers just fine. Then you can put a vfio-user-pci stub
device (I don't think this has landed in qemu yet, but you can find the
code at [1]) on the simulated PCI bus which will expose any software
interactions with the endpoint as VFIO-user protocol messages over unix
domain socket. The piece you need to bring is a VFIO-user server that
handles these messages. Its task is basically translating between VFIO-user
and TLP and then injecting TLP into your hardware design.

[1] https://github.com/oracle/qemu/tree/vfio-user-p3.1 - I believe that's
the latest version, Jagannathan Raman will know best
[2] https://github.com/nutanix/libvfio-user

[PATCH v5 2/5] softmmu: Support concurrent bounce buffers

2023-09-20 Thread Mattias Nissler

When DMA memory can't be directly accessed, as is the case when
running the device model in a separate process without shareable DMA
file descriptors, bounce buffering is used.

It is not uncommon for device models to request mapping of several DMA
regions at the same time. Examples include:
 * net devices, e.g. when transmitting a packet that is split across
   several TX descriptors (observed with igb)
 * USB host controllers, when handling a packet with multiple data TRBs
   (observed with xhci)

Previously, qemu only provided a single bounce buffer per AddressSpace
and would fail DMA map requests while the buffer was already in use. In
turn, this would cause DMA failures that ultimately manifest as hardware
errors from the guest perspective.

This change allocates DMA bounce buffers dynamically instead of
supporting only a single buffer. Thus, multiple DMA mappings work
correctly also when RAM can't be mmap()-ed.

The total bounce buffer allocation size is limited individually for each
AddressSpace. The default limit is 4096 bytes, matching the previous
maximum buffer size. A new x-max-bounce-buffer-size parameter is
provided to configure the limit for PCI devices.

Signed-off-by: Mattias Nissler 
---
 hw/pci/pci.c|  8 
 include/exec/memory.h   | 14 +++
 include/hw/pci/pci_device.h |  3 ++
 softmmu/memory.c|  5 ++-
 softmmu/physmem.c   | 80 +
 5 files changed, 74 insertions(+), 36 deletions(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 881d774fb6..d071ac8091 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -85,6 +85,8 @@ static Property pci_props[] = {
 QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
 DEFINE_PROP_BIT("x-pcie-ari-nextfn-1", PCIDevice, cap_present,
 QEMU_PCIE_ARI_NEXTFN_1_BITNR, false),
+DEFINE_PROP_SIZE("x-max-bounce-buffer-size", PCIDevice,
+ max_bounce_buffer_size, DEFAULT_MAX_BOUNCE_BUFFER_SIZE),
 DEFINE_PROP_END_OF_LIST()
 };
 
@@ -1208,6 +1210,8 @@ static PCIDevice *do_pci_register_device(PCIDevice 
*pci_dev,
"bus master container", UINT64_MAX);
 address_space_init(_dev->bus_master_as,
_dev->bus_master_container_region, pci_dev->name);
+pci_dev->bus_master_as.max_bounce_buffer_size =
+pci_dev->max_bounce_buffer_size;
 
 if (phase_check(PHASE_MACHINE_READY)) {
 pci_init_bus_master(pci_dev);
@@ -2664,6 +2668,10 @@ static void pci_device_class_init(ObjectClass *klass, 
void *data)
 k->unrealize = pci_qdev_unrealize;
 k->bus_type = TYPE_PCI_BUS;
 device_class_set_props(k, pci_props);
+object_class_property_set_description(
+klass, "x-max-bounce-buffer-size",
+"Maximum buffer size allocated for bounce buffers used for mapped "
+"access to indirect DMA memory");
 }
 
 static void pci_device_class_base_init(ObjectClass *klass, void *data)
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 7d68936157..67379bd9cc 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1081,13 +1081,7 @@ typedef struct AddressSpaceMapClient {
 QLIST_ENTRY(AddressSpaceMapClient) link;
 } AddressSpaceMapClient;
 
-typedef struct {
-MemoryRegion *mr;
-void *buffer;
-hwaddr addr;
-hwaddr len;
-bool in_use;
-} BounceBuffer;
+#define DEFAULT_MAX_BOUNCE_BUFFER_SIZE (4096)
 
 /**
  * struct AddressSpace: describes a mapping of addresses to #MemoryRegion 
objects
@@ -1106,8 +1100,10 @@ struct AddressSpace {
 QTAILQ_HEAD(, MemoryListener) listeners;
 QTAILQ_ENTRY(AddressSpace) address_spaces_link;
 
-/* Bounce buffer to use for this address space. */
-BounceBuffer bounce;
+/* Maximum DMA bounce buffer size used for indirect memory map requests */
+uint64_t max_bounce_buffer_size;
+/* Total size of bounce buffers currently allocated, atomically accessed */
+uint64_t bounce_buffer_size;
 /* List of callbacks to invoke when buffers free up */
 QemuMutex map_client_list_lock;
 QLIST_HEAD(, AddressSpaceMapClient) map_client_list;
diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
index d3dd0f64b2..f4027c5379 100644
--- a/include/hw/pci/pci_device.h
+++ b/include/hw/pci/pci_device.h
@@ -160,6 +160,9 @@ struct PCIDevice {
 /* ID of standby device in net_failover pair */
 char *failover_pair_id;
 uint32_t acpi_index;
+
+/* Maximum DMA bounce buffer size used for indirect memory map requests */
+uint64_t max_bounce_buffer_size;
 };
 
 static inline int pci_intx(PCIDevice *pci_dev)
diff --git a/softmmu/memory.c b/softmmu/memory.c
index ffa37fc327..24d90b10b2 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -3105,7 +3105,8 @@ void address_space_init(AddressSpace *as, MemoryRegion 
*root, const char *name)
 as->ioeventfds = NULL;

[PATCH v5 5/5] vfio-user: Fix config space access byte order

2023-09-20 Thread Mattias Nissler

PCI config space is little-endian, so on a big-endian host we need to
perform byte swaps for values as they are passed to and received from
the generic PCI config space access machinery.

Signed-off-by: Mattias Nissler 
---
 hw/remote/vfio-user-obj.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 6a561f7969..6043a91b11 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -281,7 +281,7 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, 
char * const buf,
 while (bytes > 0) {
 len = (bytes > pci_access_width) ? pci_access_width : bytes;
 if (is_write) {
-memcpy(, ptr, len);
+val = ldn_le_p(ptr, len);
 pci_host_config_write_common(o->pci_dev, offset,
  pci_config_size(o->pci_dev),
  val, len);
@@ -289,7 +289,7 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, 
char * const buf,
 } else {
 val = pci_host_config_read_common(o->pci_dev, offset,
   pci_config_size(o->pci_dev), 
len);
-memcpy(ptr, , len);
+stn_le_p(ptr, len, val);
 trace_vfu_cfg_read(offset, val);
 }
 offset += len;
-- 
2.34.1

[PATCH v5 1/5] softmmu: Per-AddressSpace bounce buffering

2023-09-20 Thread Mattias Nissler

Instead of using a single global bounce buffer, give each AddressSpace
its own bounce buffer. The MapClient callback mechanism moves to
AddressSpace accordingly.

This is in preparation for generalizing bounce buffer handling further
to allow multiple bounce buffers, with a total allocation limit
configured per AddressSpace.

Signed-off-by: Mattias Nissler 
---
 include/exec/cpu-common.h |   2 -
 include/exec/memory.h |  45 -
 softmmu/dma-helpers.c |   4 +-
 softmmu/memory.c  |   7 +++
 softmmu/physmem.c | 101 --
 5 files changed, 93 insertions(+), 66 deletions(-)

diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 41788c0bdd..63463c415d 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -138,8 +138,6 @@ void *cpu_physical_memory_map(hwaddr addr,
   bool is_write);
 void cpu_physical_memory_unmap(void *buffer, hwaddr len,
bool is_write, hwaddr access_len);
-void cpu_register_map_client(QEMUBH *bh);
-void cpu_unregister_map_client(QEMUBH *bh);
 
 bool cpu_physical_memory_is_io(hwaddr phys_addr);
 
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 68284428f8..7d68936157 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1076,6 +1076,19 @@ struct MemoryListener {
 QTAILQ_ENTRY(MemoryListener) link_as;
 };
 
+typedef struct AddressSpaceMapClient {
+QEMUBH *bh;
+QLIST_ENTRY(AddressSpaceMapClient) link;
+} AddressSpaceMapClient;
+
+typedef struct {
+MemoryRegion *mr;
+void *buffer;
+hwaddr addr;
+hwaddr len;
+bool in_use;
+} BounceBuffer;
+
 /**
  * struct AddressSpace: describes a mapping of addresses to #MemoryRegion 
objects
  */
@@ -1092,6 +1105,12 @@ struct AddressSpace {
 struct MemoryRegionIoeventfd *ioeventfds;
 QTAILQ_HEAD(, MemoryListener) listeners;
 QTAILQ_ENTRY(AddressSpace) address_spaces_link;
+
+/* Bounce buffer to use for this address space. */
+BounceBuffer bounce;
+/* List of callbacks to invoke when buffers free up */
+QemuMutex map_client_list_lock;
+QLIST_HEAD(, AddressSpaceMapClient) map_client_list;
 };
 
 typedef struct AddressSpaceDispatch AddressSpaceDispatch;
@@ -2832,8 +2851,8 @@ bool address_space_access_valid(AddressSpace *as, hwaddr 
addr, hwaddr len,
  * May return %NULL and set *@plen to zero(0), if resources needed to perform
  * the mapping are exhausted.
  * Use only for reads OR writes - not for read-modify-write operations.
- * Use cpu_register_map_client() to know when retrying the map operation is
- * likely to succeed.
+ * Use address_space_register_map_client() to know when retrying the map
+ * operation is likely to succeed.
  *
  * @as: #AddressSpace to be accessed
  * @addr: address within that address space
@@ -2858,6 +2877,28 @@ void *address_space_map(AddressSpace *as, hwaddr addr,
 void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len,
  bool is_write, hwaddr access_len);
 
+/*
+ * address_space_register_map_client: Register a callback to invoke when
+ * resources for address_space_map() are available again.
+ *
+ * address_space_map may fail when there are not enough resources available,
+ * such as when bounce buffer memory would exceed the limit. The callback can
+ * be used to retry the address_space_map operation. Note that the callback
+ * gets automatically removed after firing.
+ *
+ * @as: #AddressSpace to be accessed
+ * @bh: callback to invoke when address_space_map() retry is appropriate
+ */
+void address_space_register_map_client(AddressSpace *as, QEMUBH *bh);
+
+/*
+ * address_space_unregister_map_client: Unregister a callback that has
+ * previously been registered and not fired yet.
+ *
+ * @as: #AddressSpace to be accessed
+ * @bh: callback to unregister
+ */
+void address_space_unregister_map_client(AddressSpace *as, QEMUBH *bh);
 
 /* Internal functions, part of the implementation of address_space_read.  */
 MemTxResult address_space_read_full(AddressSpace *as, hwaddr addr,
diff --git a/softmmu/dma-helpers.c b/softmmu/dma-helpers.c
index 2463964805..d9fc26c063 100644
--- a/softmmu/dma-helpers.c
+++ b/softmmu/dma-helpers.c
@@ -167,7 +167,7 @@ static void dma_blk_cb(void *opaque, int ret)
 if (dbs->iov.size == 0) {
 trace_dma_map_wait(dbs);
 dbs->bh = aio_bh_new(ctx, reschedule_dma, dbs);
-cpu_register_map_client(dbs->bh);
+address_space_register_map_client(dbs->sg->as, dbs->bh);
 goto out;
 }
 
@@ -197,7 +197,7 @@ static void dma_aio_cancel(BlockAIOCB *acb)
 }
 
 if (dbs->bh) {
-cpu_unregister_map_client(dbs->bh);
+address_space_unregister_map_client(dbs->sg->as, dbs->bh);
 qemu_bh_delete(dbs->bh);
 dbs->bh = NULL;
 }
diff --git a/softmmu/memory.c b/softmmu/memory.c
index 7d9494ce70..ffa37fc327 1006

[PATCH v5 3/5] Update subprojects/libvfio-user

2023-09-20 Thread Mattias Nissler

Brings in assorted bug fixes. The following are of particular interest
with respect to message-based DMA support:

* bb308a2 "Fix address calculation for message-based DMA"
  Corrects a bug in DMA address calculation.

* 1569a37 "Pass server->client command over a separate socket pair"
  Adds support for separate sockets for either command direction,
  addressing a bug where libvfio-user gets confused if both client and
  server send commands concurrently.

Signed-off-by: Mattias Nissler 
---
 subprojects/libvfio-user.wrap | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/subprojects/libvfio-user.wrap b/subprojects/libvfio-user.wrap
index 416955ca45..cdf0a7a375 100644
--- a/subprojects/libvfio-user.wrap
+++ b/subprojects/libvfio-user.wrap
@@ -1,4 +1,4 @@
 [wrap-git]
 url = https://gitlab.com/qemu-project/libvfio-user.git
-revision = 0b28d205572c80b568a1003db2c8f37ca333e4d7
+revision = 1569a37a54ecb63bd4008708c76339ccf7d06115
 depth = 1
-- 
2.34.1

[PATCH v5 0/5] Support message-based DMA in vfio-user server

2023-09-20 Thread Mattias Nissler

This series adds basic support for message-based DMA in qemu's vfio-user
server. This is useful for cases where the client does not provide file
descriptors for accessing system memory via memory mappings. My motivating use
case is to hook up device models as PCIe endpoints to a hardware design. This
works by bridging the PCIe transaction layer to vfio-user, and the endpoint
does not access memory directly, but sends memory requests TLPs to the hardware
design in order to perform DMA.

Note that more work is needed to make message-based DMA work well: qemu
currently breaks down DMA accesses into chunks of size 8 bytes at maximum, each
of which will be handled in a separate vfio-user DMA request message. This is
quite terrible for large DMA accesses, such as when nvme reads and writes
page-sized blocks for example. Thus, I would like to improve qemu to be able to
perform larger accesses, at least for indirect memory regions. I have something
working locally, but since this will likely result in more involved surgery and
discussion, I am leaving this to be addressed in a separate patch.

Changes from v1:

* Address Stefan's review comments. In particular, enforce an allocation limit
  and don't drop the map client callbacks given that map requests can fail when
  hitting size limits.

* libvfio-user version bump now included in the series.

* Tested as well on big-endian s390x. This uncovered another byte order issue
  in vfio-user server code that I've included a fix for.

Changes from v2:

* Add a preparatory patch to make bounce buffering an AddressSpace-specific
  concept.

* The total buffer size limit parameter is now per AdressSpace and can be
  configured for PCIDevice via a property.

* Store a magic value in first bytes of bounce buffer struct as a best effort
  measure to detect invalid pointers in address_space_unmap.

Changes from v3:

* libvfio-user now supports twin-socket mode which uses separate sockets for
  client->server and server->client commands, respectively. This addresses the
  concurrent command bug triggered by server->client DMA access commands. See
  https://github.com/nutanix/libvfio-user/issues/279 for details.

* Add missing teardown code in do_address_space_destroy.

* Fix bounce buffer size bookkeeping race condition.

* Generate unmap notification callbacks unconditionally.

* Some cosmetic fixes.

Changes from v4:

* Fix accidentally dropped memory_region_unref, control flow restored to match
  previous code to simplify review.

* Some cosmetic fixes.

Mattias Nissler (5):
  softmmu: Per-AddressSpace bounce buffering
  softmmu: Support concurrent bounce buffers
  Update subprojects/libvfio-user
  vfio-user: Message-based DMA support
  vfio-user: Fix config space access byte order

 hw/pci/pci.c  |   8 ++
 hw/remote/trace-events|   2 +
 hw/remote/vfio-user-obj.c |  88 ++---
 include/exec/cpu-common.h |   2 -
 include/exec/memory.h |  41 +-
 include/hw/pci/pci_device.h   |   3 +
 softmmu/dma-helpers.c |   4 +-
 softmmu/memory.c  |   8 ++
 softmmu/physmem.c | 141 ++
 subprojects/libvfio-user.wrap |   2 +-
 10 files changed, 218 insertions(+), 81 deletions(-)

-- 
2.34.1

[PATCH v5 4/5] vfio-user: Message-based DMA support

2023-09-20 Thread Mattias Nissler

Wire up support for DMA for the case where the vfio-user client does not
provide mmap()-able file descriptors, but DMA requests must be performed
via the VFIO-user protocol. This installs an indirect memory region,
which already works for pci_dma_{read,write}, and pci_dma_map works
thanks to the existing DMA bounce buffering support.

Note that while simple scenarios work with this patch, there's a known
race condition in libvfio-user that will mess up the communication
channel. See https://github.com/nutanix/libvfio-user/issues/279 for
details as well as a proposed fix.

Signed-off-by: Mattias Nissler 
---
 hw/remote/trace-events|  2 +
 hw/remote/vfio-user-obj.c | 84 +++
 2 files changed, 79 insertions(+), 7 deletions(-)

diff --git a/hw/remote/trace-events b/hw/remote/trace-events
index 0d1b7d56a5..358a68fb34 100644
--- a/hw/remote/trace-events
+++ b/hw/remote/trace-events
@@ -9,6 +9,8 @@ vfu_cfg_read(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x -> 
0x%x"
 vfu_cfg_write(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x <- 0x%x"
 vfu_dma_register(uint64_t gpa, size_t len) "vfu: registering GPA 0x%"PRIx64", 
%zu bytes"
 vfu_dma_unregister(uint64_t gpa) "vfu: unregistering GPA 0x%"PRIx64""
+vfu_dma_read(uint64_t gpa, size_t len) "vfu: DMA read 0x%"PRIx64", %zu bytes"
+vfu_dma_write(uint64_t gpa, size_t len) "vfu: DMA write 0x%"PRIx64", %zu bytes"
 vfu_bar_register(int i, uint64_t addr, uint64_t size) "vfu: BAR %d: addr 
0x%"PRIx64" size 0x%"PRIx64""
 vfu_bar_rw_enter(const char *op, uint64_t addr) "vfu: %s request for BAR 
address 0x%"PRIx64""
 vfu_bar_rw_exit(const char *op, uint64_t addr) "vfu: Finished %s of BAR 
address 0x%"PRIx64""
diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 8b10c32a3c..6a561f7969 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -300,6 +300,63 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, 
char * const buf,
 return count;
 }
 
+static MemTxResult vfu_dma_read(void *opaque, hwaddr addr, uint64_t *val,
+unsigned size, MemTxAttrs attrs)
+{
+MemoryRegion *region = opaque;
+vfu_ctx_t *vfu_ctx = VFU_OBJECT(region->owner)->vfu_ctx;
+uint8_t buf[sizeof(uint64_t)];
+
+trace_vfu_dma_read(region->addr + addr, size);
+
+g_autofree dma_sg_t *sg = g_malloc0(dma_sg_size());
+vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
+if (vfu_addr_to_sgl(vfu_ctx, vfu_addr, size, sg, 1, PROT_READ) < 0 ||
+vfu_sgl_read(vfu_ctx, sg, 1, buf) != 0) {
+return MEMTX_ERROR;
+}
+
+*val = ldn_he_p(buf, size);
+
+return MEMTX_OK;
+}
+
+static MemTxResult vfu_dma_write(void *opaque, hwaddr addr, uint64_t val,
+ unsigned size, MemTxAttrs attrs)
+{
+MemoryRegion *region = opaque;
+vfu_ctx_t *vfu_ctx = VFU_OBJECT(region->owner)->vfu_ctx;
+uint8_t buf[sizeof(uint64_t)];
+
+trace_vfu_dma_write(region->addr + addr, size);
+
+stn_he_p(buf, size, val);
+
+g_autofree dma_sg_t *sg = g_malloc0(dma_sg_size());
+vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
+if (vfu_addr_to_sgl(vfu_ctx, vfu_addr, size, sg, 1, PROT_WRITE) < 0 ||
+vfu_sgl_write(vfu_ctx, sg, 1, buf) != 0) {
+return MEMTX_ERROR;
+}
+
+return MEMTX_OK;
+}
+
+static const MemoryRegionOps vfu_dma_ops = {
+.read_with_attrs = vfu_dma_read,
+.write_with_attrs = vfu_dma_write,
+.endianness = DEVICE_HOST_ENDIAN,
+.valid = {
+.min_access_size = 1,
+.max_access_size = 8,
+.unaligned = true,
+},
+.impl = {
+.min_access_size = 1,
+.max_access_size = 8,
+},
+};
+
 static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
 {
 VfuObject *o = vfu_get_private(vfu_ctx);
@@ -308,17 +365,30 @@ static void dma_register(vfu_ctx_t *vfu_ctx, 
vfu_dma_info_t *info)
 g_autofree char *name = NULL;
 struct iovec *iov = >iova;
 
-if (!info->vaddr) {
-return;
-}
-
 name = g_strdup_printf("mem-%s-%"PRIx64"", o->device,
-   (uint64_t)info->vaddr);
+   (uint64_t)iov->iov_base);
 
 subregion = g_new0(MemoryRegion, 1);
 
-memory_region_init_ram_ptr(subregion, NULL, name,
-   iov->iov_len, info->vaddr);
+if (info->vaddr) {
+memory_region_init_ram_ptr(subregion, OBJECT(o), name,
+   iov->iov_len, info->vaddr);
+} else {
+/*
+ * Note that I/O regions' MemoryRegionOps handle accesses of at most 8
+ * bytes at a time, and larger accesses are broken down. However,
+

Re: [PATCH v4 2/5] softmmu: Support concurrent bounce buffers

2023-09-19 Thread Mattias Nissler

On Tue, Sep 19, 2023 at 7:14 PM Peter Xu  wrote:
>
> On Tue, Sep 19, 2023 at 09:08:10AM -0700, Mattias Nissler wrote:
> > @@ -3119,31 +3143,35 @@ void *address_space_map(AddressSpace *as,
> >  void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len,
> >   bool is_write, hwaddr access_len)
> >  {
> > -if (buffer != as->bounce.buffer) {
> > -MemoryRegion *mr;
> > -ram_addr_t addr1;
> > +MemoryRegion *mr;
> > +ram_addr_t addr1;
> > +
> > +mr = memory_region_from_host(buffer, );
> > +if (mr == NULL) {
> > +BounceBuffer *bounce = container_of(buffer, BounceBuffer, buffer);
> > +assert(bounce->magic == BOUNCE_BUFFER_MAGIC);
> >
> > -mr = memory_region_from_host(buffer, );
> > -assert(mr != NULL);
> >  if (is_write) {
> > -invalidate_and_set_dirty(mr, addr1, access_len);
> > -}
> > -if (xen_enabled()) {
> > -xen_invalidate_map_cache_entry(buffer);
> > +address_space_write(as, bounce->addr, MEMTXATTRS_UNSPECIFIED,
> > +bounce->buffer, access_len);
> >  }
> > -memory_region_unref(mr);
> > +
> > +memory_region_unref(bounce->mr);
> > +qatomic_sub(>bounce_buffer_size, bounce->len);
> > +/* Write bounce_buffer_size before reading map_client_list. */
> > +smp_mb();
> > +address_space_notify_map_clients(as);
> > +bounce->magic = ~BOUNCE_BUFFER_MAGIC;
> > +g_free(bounce);
> >  return;
> >  }
> > +
> > +if (xen_enabled()) {
> > +xen_invalidate_map_cache_entry(buffer);
> > +}
> >  if (is_write) {
> > -address_space_write(as, as->bounce.addr, MEMTXATTRS_UNSPECIFIED,
> > -as->bounce.buffer, access_len);
> > -}
> > -qemu_vfree(as->bounce.buffer);
> > -as->bounce.buffer = NULL;
> > -memory_region_unref(as->bounce.mr);
>
> This line needs to be kept?

Yes, good catch. Thanks!

>
> > -/* Clear in_use before reading map_client_list.  */
> > -qatomic_set_mb(>bounce.in_use, false);
> > -address_space_notify_map_clients(as);
> > +invalidate_and_set_dirty(mr, addr1, access_len);
> > +}
> >  }
>
> --
> Peter Xu
>

[PATCH v4 2/5] softmmu: Support concurrent bounce buffers

2023-09-19 Thread Mattias Nissler

When DMA memory can't be directly accessed, as is the case when
running the device model in a separate process without shareable DMA
file descriptors, bounce buffering is used.

It is not uncommon for device models to request mapping of several DMA
regions at the same time. Examples include:
 * net devices, e.g. when transmitting a packet that is split across
   several TX descriptors (observed with igb)
 * USB host controllers, when handling a packet with multiple data TRBs
   (observed with xhci)

Previously, qemu only provided a single bounce buffer per AddressSpace
and would fail DMA map requests while the buffer was already in use. In
turn, this would cause DMA failures that ultimately manifest as hardware
errors from the guest perspective.

This change allocates DMA bounce buffers dynamically instead of
supporting only a single buffer. Thus, multiple DMA mappings work
correctly also when RAM can't be mmap()-ed.

The total bounce buffer allocation size is limited individually for each
AddressSpace. The default limit is 4096 bytes, matching the previous
maximum buffer size. A new x-max-bounce-buffer-size parameter is
provided to configure the limit for PCI devices.

Signed-off-by: Mattias Nissler 
---
 hw/pci/pci.c|  8 
 include/exec/memory.h   | 14 +++---
 include/hw/pci/pci_device.h |  3 ++
 softmmu/memory.c|  5 ++-
 softmmu/physmem.c   | 88 -
 5 files changed, 77 insertions(+), 41 deletions(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 881d774fb6..d071ac8091 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -85,6 +85,8 @@ static Property pci_props[] = {
 QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
 DEFINE_PROP_BIT("x-pcie-ari-nextfn-1", PCIDevice, cap_present,
 QEMU_PCIE_ARI_NEXTFN_1_BITNR, false),
+DEFINE_PROP_SIZE("x-max-bounce-buffer-size", PCIDevice,
+ max_bounce_buffer_size, DEFAULT_MAX_BOUNCE_BUFFER_SIZE),
 DEFINE_PROP_END_OF_LIST()
 };
 
@@ -1208,6 +1210,8 @@ static PCIDevice *do_pci_register_device(PCIDevice 
*pci_dev,
"bus master container", UINT64_MAX);
 address_space_init(_dev->bus_master_as,
_dev->bus_master_container_region, pci_dev->name);
+pci_dev->bus_master_as.max_bounce_buffer_size =
+pci_dev->max_bounce_buffer_size;
 
 if (phase_check(PHASE_MACHINE_READY)) {
 pci_init_bus_master(pci_dev);
@@ -2664,6 +2668,10 @@ static void pci_device_class_init(ObjectClass *klass, 
void *data)
 k->unrealize = pci_qdev_unrealize;
 k->bus_type = TYPE_PCI_BUS;
 device_class_set_props(k, pci_props);
+object_class_property_set_description(
+klass, "x-max-bounce-buffer-size",
+"Maximum buffer size allocated for bounce buffers used for mapped "
+"access to indirect DMA memory");
 }
 
 static void pci_device_class_base_init(ObjectClass *klass, void *data)
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 7d68936157..67379bd9cc 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1081,13 +1081,7 @@ typedef struct AddressSpaceMapClient {
 QLIST_ENTRY(AddressSpaceMapClient) link;
 } AddressSpaceMapClient;
 
-typedef struct {
-MemoryRegion *mr;
-void *buffer;
-hwaddr addr;
-hwaddr len;
-bool in_use;
-} BounceBuffer;
+#define DEFAULT_MAX_BOUNCE_BUFFER_SIZE (4096)
 
 /**
  * struct AddressSpace: describes a mapping of addresses to #MemoryRegion 
objects
@@ -1106,8 +1100,10 @@ struct AddressSpace {
 QTAILQ_HEAD(, MemoryListener) listeners;
 QTAILQ_ENTRY(AddressSpace) address_spaces_link;
 
-/* Bounce buffer to use for this address space. */
-BounceBuffer bounce;
+/* Maximum DMA bounce buffer size used for indirect memory map requests */
+uint64_t max_bounce_buffer_size;
+/* Total size of bounce buffers currently allocated, atomically accessed */
+uint64_t bounce_buffer_size;
 /* List of callbacks to invoke when buffers free up */
 QemuMutex map_client_list_lock;
 QLIST_HEAD(, AddressSpaceMapClient) map_client_list;
diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
index d3dd0f64b2..f4027c5379 100644
--- a/include/hw/pci/pci_device.h
+++ b/include/hw/pci/pci_device.h
@@ -160,6 +160,9 @@ struct PCIDevice {
 /* ID of standby device in net_failover pair */
 char *failover_pair_id;
 uint32_t acpi_index;
+
+/* Maximum DMA bounce buffer size used for indirect memory map requests */
+uint64_t max_bounce_buffer_size;
 };
 
 static inline int pci_intx(PCIDevice *pci_dev)
diff --git a/softmmu/memory.c b/softmmu/memory.c
index ffa37fc327..24d90b10b2 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -3105,7 +3105,8 @@ void address_space_init(AddressSpace *as, MemoryRegion 
*root, const char *name)
 as->ioeventfds = NULL;

[PATCH v4 3/5] Update subprojects/libvfio-user

2023-09-19 Thread Mattias Nissler

Brings in assorted bug fixes. The following are of particular interest
with respect to message-based DMA support:

* bb308a2 "Fix address calculation for message-based DMA"
  Corrects a bug in DMA address calculation.

* 1569a37 "Pass server->client command over a separate socket pair"
  Adds support for separate sockets for either command direction,
  addressing a bug where libvfio-user gets confused if both client and
  server send commands concurrently.

Signed-off-by: Mattias Nissler 
---
 subprojects/libvfio-user.wrap | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/subprojects/libvfio-user.wrap b/subprojects/libvfio-user.wrap
index 416955ca45..cdf0a7a375 100644
--- a/subprojects/libvfio-user.wrap
+++ b/subprojects/libvfio-user.wrap
@@ -1,4 +1,4 @@
 [wrap-git]
 url = https://gitlab.com/qemu-project/libvfio-user.git
-revision = 0b28d205572c80b568a1003db2c8f37ca333e4d7
+revision = 1569a37a54ecb63bd4008708c76339ccf7d06115
 depth = 1
-- 
2.34.1

[PATCH v4 5/5] vfio-user: Fix config space access byte order

2023-09-19 Thread Mattias Nissler

PCI config space is little-endian, so on a big-endian host we need to
perform byte swaps for values as they are passed to and received from
the generic PCI config space access machinery.

Signed-off-by: Mattias Nissler 
---
 hw/remote/vfio-user-obj.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 6a561f7969..6043a91b11 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -281,7 +281,7 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, 
char * const buf,
 while (bytes > 0) {
 len = (bytes > pci_access_width) ? pci_access_width : bytes;
 if (is_write) {
-memcpy(, ptr, len);
+val = ldn_le_p(ptr, len);
 pci_host_config_write_common(o->pci_dev, offset,
  pci_config_size(o->pci_dev),
  val, len);
@@ -289,7 +289,7 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, 
char * const buf,
 } else {
 val = pci_host_config_read_common(o->pci_dev, offset,
   pci_config_size(o->pci_dev), 
len);
-memcpy(ptr, , len);
+stn_le_p(ptr, len, val);
 trace_vfu_cfg_read(offset, val);
 }
 offset += len;
-- 
2.34.1

[PATCH v4 4/5] vfio-user: Message-based DMA support

2023-09-19 Thread Mattias Nissler

Wire up support for DMA for the case where the vfio-user client does not
provide mmap()-able file descriptors, but DMA requests must be performed
via the VFIO-user protocol. This installs an indirect memory region,
which already works for pci_dma_{read,write}, and pci_dma_map works
thanks to the existing DMA bounce buffering support.

Note that while simple scenarios work with this patch, there's a known
race condition in libvfio-user that will mess up the communication
channel. See https://github.com/nutanix/libvfio-user/issues/279 for
details as well as a proposed fix.

Signed-off-by: Mattias Nissler 
---
 hw/remote/trace-events|  2 +
 hw/remote/vfio-user-obj.c | 84 +++
 2 files changed, 79 insertions(+), 7 deletions(-)

diff --git a/hw/remote/trace-events b/hw/remote/trace-events
index 0d1b7d56a5..358a68fb34 100644
--- a/hw/remote/trace-events
+++ b/hw/remote/trace-events
@@ -9,6 +9,8 @@ vfu_cfg_read(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x -> 
0x%x"
 vfu_cfg_write(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x <- 0x%x"
 vfu_dma_register(uint64_t gpa, size_t len) "vfu: registering GPA 0x%"PRIx64", 
%zu bytes"
 vfu_dma_unregister(uint64_t gpa) "vfu: unregistering GPA 0x%"PRIx64""
+vfu_dma_read(uint64_t gpa, size_t len) "vfu: DMA read 0x%"PRIx64", %zu bytes"
+vfu_dma_write(uint64_t gpa, size_t len) "vfu: DMA write 0x%"PRIx64", %zu bytes"
 vfu_bar_register(int i, uint64_t addr, uint64_t size) "vfu: BAR %d: addr 
0x%"PRIx64" size 0x%"PRIx64""
 vfu_bar_rw_enter(const char *op, uint64_t addr) "vfu: %s request for BAR 
address 0x%"PRIx64""
 vfu_bar_rw_exit(const char *op, uint64_t addr) "vfu: Finished %s of BAR 
address 0x%"PRIx64""
diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 8b10c32a3c..6a561f7969 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -300,6 +300,63 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, 
char * const buf,
 return count;
 }
 
+static MemTxResult vfu_dma_read(void *opaque, hwaddr addr, uint64_t *val,
+unsigned size, MemTxAttrs attrs)
+{
+MemoryRegion *region = opaque;
+vfu_ctx_t *vfu_ctx = VFU_OBJECT(region->owner)->vfu_ctx;
+uint8_t buf[sizeof(uint64_t)];
+
+trace_vfu_dma_read(region->addr + addr, size);
+
+g_autofree dma_sg_t *sg = g_malloc0(dma_sg_size());
+vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
+if (vfu_addr_to_sgl(vfu_ctx, vfu_addr, size, sg, 1, PROT_READ) < 0 ||
+vfu_sgl_read(vfu_ctx, sg, 1, buf) != 0) {
+return MEMTX_ERROR;
+}
+
+*val = ldn_he_p(buf, size);
+
+return MEMTX_OK;
+}
+
+static MemTxResult vfu_dma_write(void *opaque, hwaddr addr, uint64_t val,
+ unsigned size, MemTxAttrs attrs)
+{
+MemoryRegion *region = opaque;
+vfu_ctx_t *vfu_ctx = VFU_OBJECT(region->owner)->vfu_ctx;
+uint8_t buf[sizeof(uint64_t)];
+
+trace_vfu_dma_write(region->addr + addr, size);
+
+stn_he_p(buf, size, val);
+
+g_autofree dma_sg_t *sg = g_malloc0(dma_sg_size());
+vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
+if (vfu_addr_to_sgl(vfu_ctx, vfu_addr, size, sg, 1, PROT_WRITE) < 0 ||
+vfu_sgl_write(vfu_ctx, sg, 1, buf) != 0) {
+return MEMTX_ERROR;
+}
+
+return MEMTX_OK;
+}
+
+static const MemoryRegionOps vfu_dma_ops = {
+.read_with_attrs = vfu_dma_read,
+.write_with_attrs = vfu_dma_write,
+.endianness = DEVICE_HOST_ENDIAN,
+.valid = {
+.min_access_size = 1,
+.max_access_size = 8,
+.unaligned = true,
+},
+.impl = {
+.min_access_size = 1,
+.max_access_size = 8,
+},
+};
+
 static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
 {
 VfuObject *o = vfu_get_private(vfu_ctx);
@@ -308,17 +365,30 @@ static void dma_register(vfu_ctx_t *vfu_ctx, 
vfu_dma_info_t *info)
 g_autofree char *name = NULL;
 struct iovec *iov = >iova;
 
-if (!info->vaddr) {
-return;
-}
-
 name = g_strdup_printf("mem-%s-%"PRIx64"", o->device,
-   (uint64_t)info->vaddr);
+   (uint64_t)iov->iov_base);
 
 subregion = g_new0(MemoryRegion, 1);
 
-memory_region_init_ram_ptr(subregion, NULL, name,
-   iov->iov_len, info->vaddr);
+if (info->vaddr) {
+memory_region_init_ram_ptr(subregion, OBJECT(o), name,
+   iov->iov_len, info->vaddr);
+} else {
+/*
+ * Note that I/O regions' MemoryRegionOps handle accesses of at most 8
+ * bytes at a time, and larger accesses are broken down. However,
+

[PATCH v4 0/5] Support message-based DMA in vfio-user server

2023-09-19 Thread Mattias Nissler

This series adds basic support for message-based DMA in qemu's vfio-user
server. This is useful for cases where the client does not provide file
descriptors for accessing system memory via memory mappings. My motivating use
case is to hook up device models as PCIe endpoints to a hardware design. This
works by bridging the PCIe transaction layer to vfio-user, and the endpoint
does not access memory directly, but sends memory requests TLPs to the hardware
design in order to perform DMA.

Note that more work is needed to make message-based DMA work well: qemu
currently breaks down DMA accesses into chunks of size 8 bytes at maximum, each
of which will be handled in a separate vfio-user DMA request message. This is
quite terrible for large DMA accesses, such as when nvme reads and writes
page-sized blocks for example. Thus, I would like to improve qemu to be able to
perform larger accesses, at least for indirect memory regions. I have something
working locally, but since this will likely result in more involved surgery and
discussion, I am leaving this to be addressed in a separate patch.

Changes from v1:

* Address Stefan's review comments. In particular, enforce an allocation limit
  and don't drop the map client callbacks given that map requests can fail when
  hitting size limits.

* libvfio-user version bump now included in the series.

* Tested as well on big-endian s390x. This uncovered another byte order issue
  in vfio-user server code that I've included a fix for.

Changes from v2:

* Add a preparatory patch to make bounce buffering an AddressSpace-specific
  concept.

* The total buffer size limit parameter is now per AdressSpace and can be
  configured for PCIDevice via a property.

* Store a magic value in first bytes of bounce buffer struct as a best effort
  measure to detect invalid pointers in address_space_unmap.

Changes from v3:

* libvfio-user now supports twin-socket mode which uses separate sockets for
  client->server and server->client commands, respectively. This addresses the
  concurrent command bug triggered by server->client DMA access commands. See
  https://github.com/nutanix/libvfio-user/issues/279 for details.

* Add missing teardown code in do_address_space_destroy.

* Fix bounce buffer size bookkeeping race condition.

* Generate unmap notification callbacks unconditionally.

* Some cosmetic fixes.

Mattias Nissler (5):
  softmmu: Per-AddressSpace bounce buffering
  softmmu: Support concurrent bounce buffers
  Update subprojects/libvfio-user
  vfio-user: Message-based DMA support
  vfio-user: Fix config space access byte order

 hw/pci/pci.c  |   8 ++
 hw/remote/trace-events|   2 +
 hw/remote/vfio-user-obj.c |  88 ++--
 include/exec/cpu-common.h |   2 -
 include/exec/memory.h |  41 +-
 include/hw/pci/pci_device.h   |   3 +
 softmmu/dma-helpers.c |   4 +-
 softmmu/memory.c  |   8 ++
 softmmu/physmem.c | 149 ++
 subprojects/libvfio-user.wrap |   2 +-
 10 files changed, 221 insertions(+), 86 deletions(-)

-- 
2.34.1

[PATCH v4 1/5] softmmu: Per-AddressSpace bounce buffering

2023-09-19 Thread Mattias Nissler

Instead of using a single global bounce buffer, give each AddressSpace
its own bounce buffer. The MapClient callback mechanism moves to
AddressSpace accordingly.

This is in preparation for generalizing bounce buffer handling further
to allow multiple bounce buffers, with a total allocation limit
configured per AddressSpace.

Signed-off-by: Mattias Nissler 
---
 include/exec/cpu-common.h |   2 -
 include/exec/memory.h |  45 -
 softmmu/dma-helpers.c |   4 +-
 softmmu/memory.c  |   7 +++
 softmmu/physmem.c | 103 --
 5 files changed, 94 insertions(+), 67 deletions(-)

diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 41788c0bdd..63463c415d 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -138,8 +138,6 @@ void *cpu_physical_memory_map(hwaddr addr,
   bool is_write);
 void cpu_physical_memory_unmap(void *buffer, hwaddr len,
bool is_write, hwaddr access_len);
-void cpu_register_map_client(QEMUBH *bh);
-void cpu_unregister_map_client(QEMUBH *bh);
 
 bool cpu_physical_memory_is_io(hwaddr phys_addr);
 
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 68284428f8..7d68936157 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1076,6 +1076,19 @@ struct MemoryListener {
 QTAILQ_ENTRY(MemoryListener) link_as;
 };
 
+typedef struct AddressSpaceMapClient {
+QEMUBH *bh;
+QLIST_ENTRY(AddressSpaceMapClient) link;
+} AddressSpaceMapClient;
+
+typedef struct {
+MemoryRegion *mr;
+void *buffer;
+hwaddr addr;
+hwaddr len;
+bool in_use;
+} BounceBuffer;
+
 /**
  * struct AddressSpace: describes a mapping of addresses to #MemoryRegion 
objects
  */
@@ -1092,6 +1105,12 @@ struct AddressSpace {
 struct MemoryRegionIoeventfd *ioeventfds;
 QTAILQ_HEAD(, MemoryListener) listeners;
 QTAILQ_ENTRY(AddressSpace) address_spaces_link;
+
+/* Bounce buffer to use for this address space. */
+BounceBuffer bounce;
+/* List of callbacks to invoke when buffers free up */
+QemuMutex map_client_list_lock;
+QLIST_HEAD(, AddressSpaceMapClient) map_client_list;
 };
 
 typedef struct AddressSpaceDispatch AddressSpaceDispatch;
@@ -2832,8 +2851,8 @@ bool address_space_access_valid(AddressSpace *as, hwaddr 
addr, hwaddr len,
  * May return %NULL and set *@plen to zero(0), if resources needed to perform
  * the mapping are exhausted.
  * Use only for reads OR writes - not for read-modify-write operations.
- * Use cpu_register_map_client() to know when retrying the map operation is
- * likely to succeed.
+ * Use address_space_register_map_client() to know when retrying the map
+ * operation is likely to succeed.
  *
  * @as: #AddressSpace to be accessed
  * @addr: address within that address space
@@ -2858,6 +2877,28 @@ void *address_space_map(AddressSpace *as, hwaddr addr,
 void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len,
  bool is_write, hwaddr access_len);
 
+/*
+ * address_space_register_map_client: Register a callback to invoke when
+ * resources for address_space_map() are available again.
+ *
+ * address_space_map may fail when there are not enough resources available,
+ * such as when bounce buffer memory would exceed the limit. The callback can
+ * be used to retry the address_space_map operation. Note that the callback
+ * gets automatically removed after firing.
+ *
+ * @as: #AddressSpace to be accessed
+ * @bh: callback to invoke when address_space_map() retry is appropriate
+ */
+void address_space_register_map_client(AddressSpace *as, QEMUBH *bh);
+
+/*
+ * address_space_unregister_map_client: Unregister a callback that has
+ * previously been registered and not fired yet.
+ *
+ * @as: #AddressSpace to be accessed
+ * @bh: callback to unregister
+ */
+void address_space_unregister_map_client(AddressSpace *as, QEMUBH *bh);
 
 /* Internal functions, part of the implementation of address_space_read.  */
 MemTxResult address_space_read_full(AddressSpace *as, hwaddr addr,
diff --git a/softmmu/dma-helpers.c b/softmmu/dma-helpers.c
index 2463964805..d9fc26c063 100644
--- a/softmmu/dma-helpers.c
+++ b/softmmu/dma-helpers.c
@@ -167,7 +167,7 @@ static void dma_blk_cb(void *opaque, int ret)
 if (dbs->iov.size == 0) {
 trace_dma_map_wait(dbs);
 dbs->bh = aio_bh_new(ctx, reschedule_dma, dbs);
-cpu_register_map_client(dbs->bh);
+address_space_register_map_client(dbs->sg->as, dbs->bh);
 goto out;
 }
 
@@ -197,7 +197,7 @@ static void dma_aio_cancel(BlockAIOCB *acb)
 }
 
 if (dbs->bh) {
-cpu_unregister_map_client(dbs->bh);
+address_space_unregister_map_client(dbs->sg->as, dbs->bh);
 qemu_bh_delete(dbs->bh);
 dbs->bh = NULL;
 }
diff --git a/softmmu/memory.c b/softmmu/memory.c
index 7d9494ce70..ffa37fc327 1006

Re: [PATCH v3 1/5] softmmu: Per-AddressSpace bounce buffering

2023-09-19 Thread Mattias Nissler

On Fri, Sep 15, 2023 at 10:37 AM Mattias Nissler  wrote:
>
> On Wed, Sep 13, 2023 at 8:30 PM Peter Xu  wrote:
> >
> > On Thu, Sep 07, 2023 at 06:04:06AM -0700, Mattias Nissler wrote:
> > > @@ -3105,6 +3105,9 @@ void address_space_init(AddressSpace *as, 
> > > MemoryRegion *root, const char *name)
> > >  as->ioeventfds = NULL;
> > >  QTAILQ_INIT(>listeners);
> > >  QTAILQ_INSERT_TAIL(_spaces, as, address_spaces_link);
> > > +as->bounce.in_use = false;
> > > +qemu_mutex_init(>map_client_list_lock);
> > > +QLIST_INIT(>map_client_list);
> > >  as->name = g_strdup(name ? name : "anonymous");
> > >  address_space_update_topology(as);
> > >  address_space_update_ioeventfds(as);
> >
> > Missing the counterpart in do_address_space_destroy()?
>
> Of course, thanks for pointing this out.
>
> >
> > Perhaps we should assert on having no one using the buffer, or on the
> > client list too.
>
> I agree it makes sense to put these assertions, but let me dig a bit
> and do some experiments to see whether these hold true in practice.

To close the loop here: I've experimented a bit to try whether I can
get the shutdown path to trigger the assertions by terminating the
qemu process with mappings present. I tried xhci (for usb_packet_map),
e1000e (for net_tx_pkt_add_raw_fragment_pci), and nvme (for
dma-helpers), and some of them with hacked Linux kernels in attempts
to create problematic situations. I found that cleanup of mappings
seems to work correctly already, I wasn't able to trigger the
assertions I added in do_address_space_destroy. That doesn't prove
absence of a code path that would trigger them, but then that would
just indicate a bug in device model cleanup code that should be fixed
anyways.

>
> >
> > Thanks,
> >
> > --
> > Peter Xu
> >

Re: [PATCH v3 5/5] vfio-user: Fix config space access byte order

2023-09-15 Thread Mattias Nissler

On Thu, Sep 14, 2023 at 10:32 PM Stefan Hajnoczi  wrote:
>
> On Thu, Sep 07, 2023 at 06:04:10AM -0700, Mattias Nissler wrote:
> > PCI config space is little-endian, so on a big-endian host we need to
> > perform byte swaps for values as they are passed to and received from
> > the generic PCI config space access machinery.
> >
> > Signed-off-by: Mattias Nissler 
> > ---
> >  hw/remote/vfio-user-obj.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
>
> After some discussion about PCI Configuration Space endianness on IRC
> with aw, mcayland, and f4bug I am now happy with this patch:
>
> 1. Configuration space can only be accessed in 1-, 2-, or 4-byte
>accesses.
> 2. If it's a 2- or 4-byte access then your patch adds the missing
>little-endian conversion.
> 3. If it's a 1-byte access then there is (effectively) no byteswap in
>the code path and the pci_dev->config[] array is already
>little-endian.

Thanks for checking! This indeed relies on the
pci_host_config_{read,write}_common to be register-based access paths.
I have also experimentally verified that this works as expected using
a s390x build.

Re: [PATCH v3 4/5] vfio-user: Message-based DMA support

2023-09-15 Thread Mattias Nissler

On Thu, Sep 14, 2023 at 9:04 PM Stefan Hajnoczi  wrote:
>
> On Thu, Sep 07, 2023 at 06:04:09AM -0700, Mattias Nissler wrote:
> > Wire up support for DMA for the case where the vfio-user client does not
> > provide mmap()-able file descriptors, but DMA requests must be performed
> > via the VFIO-user protocol. This installs an indirect memory region,
> > which already works for pci_dma_{read,write}, and pci_dma_map works
> > thanks to the existing DMA bounce buffering support.
> >
> > Note that while simple scenarios work with this patch, there's a known
> > race condition in libvfio-user that will mess up the communication
> > channel. See https://github.com/nutanix/libvfio-user/issues/279 for
> > details as well as a proposed fix.
> >
> > Signed-off-by: Mattias Nissler 
> > ---
> >  hw/remote/trace-events|  2 +
> >  hw/remote/vfio-user-obj.c | 84 +++
> >  2 files changed, 79 insertions(+), 7 deletions(-)
> >
> > diff --git a/hw/remote/trace-events b/hw/remote/trace-events
> > index 0d1b7d56a5..358a68fb34 100644
> > --- a/hw/remote/trace-events
> > +++ b/hw/remote/trace-events
> > @@ -9,6 +9,8 @@ vfu_cfg_read(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x 
> > -> 0x%x"
> >  vfu_cfg_write(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x <- 0x%x"
> >  vfu_dma_register(uint64_t gpa, size_t len) "vfu: registering GPA 
> > 0x%"PRIx64", %zu bytes"
> >  vfu_dma_unregister(uint64_t gpa) "vfu: unregistering GPA 0x%"PRIx64""
> > +vfu_dma_read(uint64_t gpa, size_t len) "vfu: DMA read 0x%"PRIx64", %zu 
> > bytes"
> > +vfu_dma_write(uint64_t gpa, size_t len) "vfu: DMA write 0x%"PRIx64", %zu 
> > bytes"
> >  vfu_bar_register(int i, uint64_t addr, uint64_t size) "vfu: BAR %d: addr 
> > 0x%"PRIx64" size 0x%"PRIx64""
> >  vfu_bar_rw_enter(const char *op, uint64_t addr) "vfu: %s request for BAR 
> > address 0x%"PRIx64""
> >  vfu_bar_rw_exit(const char *op, uint64_t addr) "vfu: Finished %s of BAR 
> > address 0x%"PRIx64""
> > diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> > index 8b10c32a3c..cee5e615a9 100644
> > --- a/hw/remote/vfio-user-obj.c
> > +++ b/hw/remote/vfio-user-obj.c
> > @@ -300,6 +300,63 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t 
> > *vfu_ctx, char * const buf,
> >  return count;
> >  }
> >
> > +static MemTxResult vfu_dma_read(void *opaque, hwaddr addr, uint64_t *val,
> > +unsigned size, MemTxAttrs attrs)
> > +{
> > +MemoryRegion *region = opaque;
> > +VfuObject *o = VFU_OBJECT(region->owner);
> > +uint8_t buf[sizeof(uint64_t)];
> > +
> > +trace_vfu_dma_read(region->addr + addr, size);
> > +
> > +dma_sg_t *sg = alloca(dma_sg_size());
>
> Variable-length arrays have recently been removed from QEMU and
> alloca(3) is a similar case. An example is commit
> b3c8246750b7077add335559341268f2956f6470 ("hw/nvme: Avoid dynamic stack
> allocation").
>
> libvfio-user returns a sane sizeof(struct dma_sg) value so we don't need
> to worry about bogus values, so the risk is low here.
>
> However, its hard to scan for and forbid the dangerous alloca(3) calls
> when exceptions are made for some alloca(3) uses.
>
> I would avoid alloca(3) and instead use:
>
>   g_autofree dma_sg_t *sg = g_new(dma_sg_size(), 1);

Ok, changing. I personally actually dislike alloca, I was just
following libvfio-user's example code. Plus there's really no valid
performance argument here given that the IPC we're doing will dominate
everything.

>
> > +vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
> > +if (vfu_addr_to_sgl(o->vfu_ctx, vfu_addr, size, sg, 1, PROT_READ) < 0 
> > ||
> > +vfu_sgl_read(o->vfu_ctx, sg, 1, buf) != 0) {
> > +return MEMTX_ERROR;
> > +}
> > +
> > +*val = ldn_he_p(buf, size);
> > +
> > +return MEMTX_OK;
> > +}
> > +
> > +static MemTxResult vfu_dma_write(void *opaque, hwaddr addr, uint64_t val,
> > + unsigned size, MemTxAttrs attrs)
> > +{
> > +MemoryRegion *region = opaque;
> > +VfuObject *o = VFU_OBJECT(region->owner);
> > +uint8_t buf[sizeof(uint64_t)];
> > +
> > +trace_vfu_dma_write(region->addr + addr, size);
> > +
> > +stn_he_p(buf, size, val);
> > +
&g

Re: [PATCH v3 2/5] softmmu: Support concurrent bounce buffers

2023-09-15 Thread Mattias Nissler

On Thu, Sep 14, 2023 at 8:49 PM Stefan Hajnoczi  wrote:
>
> On Thu, Sep 07, 2023 at 06:04:07AM -0700, Mattias Nissler wrote:
> > When DMA memory can't be directly accessed, as is the case when
> > running the device model in a separate process without shareable DMA
> > file descriptors, bounce buffering is used.
> >
> > It is not uncommon for device models to request mapping of several DMA
> > regions at the same time. Examples include:
> >  * net devices, e.g. when transmitting a packet that is split across
> >several TX descriptors (observed with igb)
> >  * USB host controllers, when handling a packet with multiple data TRBs
> >(observed with xhci)
> >
> > Previously, qemu only provided a single bounce buffer per AddressSpace
> > and would fail DMA map requests while the buffer was already in use. In
> > turn, this would cause DMA failures that ultimately manifest as hardware
> > errors from the guest perspective.
> >
> > This change allocates DMA bounce buffers dynamically instead of
> > supporting only a single buffer. Thus, multiple DMA mappings work
> > correctly also when RAM can't be mmap()-ed.
> >
> > The total bounce buffer allocation size is limited individually for each
> > AddressSpace. The default limit is 4096 bytes, matching the previous
> > maximum buffer size. A new x-max-bounce-buffer-size parameter is
> > provided to configure the limit for PCI devices.
> >
> > Signed-off-by: Mattias Nissler 
> > ---
> >  hw/pci/pci.c|  8 
> >  include/exec/memory.h   | 14 ++
> >  include/hw/pci/pci_device.h |  3 ++
> >  softmmu/memory.c|  3 +-
> >  softmmu/physmem.c   | 94 +
> >  5 files changed, 80 insertions(+), 42 deletions(-)
> >
> > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > index 881d774fb6..8c4541b394 100644
> > --- a/hw/pci/pci.c
> > +++ b/hw/pci/pci.c
> > @@ -85,6 +85,8 @@ static Property pci_props[] = {
> >  QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
> >  DEFINE_PROP_BIT("x-pcie-ari-nextfn-1", PCIDevice, cap_present,
> >  QEMU_PCIE_ARI_NEXTFN_1_BITNR, false),
> > +DEFINE_PROP_SIZE("x-max-bounce-buffer-size", PCIDevice,
> > + max_bounce_buffer_size, 4096),
> >  DEFINE_PROP_END_OF_LIST()
> >  };
> >
> > @@ -1208,6 +1210,8 @@ static PCIDevice *do_pci_register_device(PCIDevice 
> > *pci_dev,
> > "bus master container", UINT64_MAX);
> >  address_space_init(_dev->bus_master_as,
> > _dev->bus_master_container_region, 
> > pci_dev->name);
> > +pci_dev->bus_master_as.max_bounce_buffer_size =
> > +pci_dev->max_bounce_buffer_size;
> >
> >  if (phase_check(PHASE_MACHINE_READY)) {
> >  pci_init_bus_master(pci_dev);
> > @@ -2664,6 +2668,10 @@ static void pci_device_class_init(ObjectClass 
> > *klass, void *data)
> >  k->unrealize = pci_qdev_unrealize;
> >  k->bus_type = TYPE_PCI_BUS;
> >  device_class_set_props(k, pci_props);
> > +object_class_property_set_description(
> > +klass, "x-max-bounce-buffer-size",
> > +"Maximum buffer size allocated for bounce buffers used for mapped "
> > +"access to indirect DMA memory");
> >  }
> >
> >  static void pci_device_class_base_init(ObjectClass *klass, void *data)
> > diff --git a/include/exec/memory.h b/include/exec/memory.h
> > index 7d68936157..5577542b5e 100644
> > --- a/include/exec/memory.h
> > +++ b/include/exec/memory.h
> > @@ -1081,14 +1081,6 @@ typedef struct AddressSpaceMapClient {
> >  QLIST_ENTRY(AddressSpaceMapClient) link;
> >  } AddressSpaceMapClient;
> >
> > -typedef struct {
> > -MemoryRegion *mr;
> > -void *buffer;
> > -hwaddr addr;
> > -hwaddr len;
> > -bool in_use;
> > -} BounceBuffer;
> > -
> >  /**
> >   * struct AddressSpace: describes a mapping of addresses to #MemoryRegion 
> > objects
> >   */
> > @@ -1106,8 +1098,10 @@ struct AddressSpace {
> >  QTAILQ_HEAD(, MemoryListener) listeners;
> >  QTAILQ_ENTRY(AddressSpace) address_spaces_link;
> >
> > -/* Bounce buffer to use for this address space. */
> > -BounceBuffer bounce;
> > +/* Maximum DMA bounce buffer size used for indirect memory map 
> > requests */
> > +uint64_t max_bounce_buffer_size;
>

Re: [PATCH v3 2/5] softmmu: Support concurrent bounce buffers

2023-09-15 Thread Mattias Nissler

On Wed, Sep 13, 2023 at 9:11 PM Peter Xu  wrote:
>
> On Thu, Sep 07, 2023 at 06:04:07AM -0700, Mattias Nissler wrote:
> > When DMA memory can't be directly accessed, as is the case when
> > running the device model in a separate process without shareable DMA
> > file descriptors, bounce buffering is used.
> >
> > It is not uncommon for device models to request mapping of several DMA
> > regions at the same time. Examples include:
> >  * net devices, e.g. when transmitting a packet that is split across
> >several TX descriptors (observed with igb)
> >  * USB host controllers, when handling a packet with multiple data TRBs
> >(observed with xhci)
> >
> > Previously, qemu only provided a single bounce buffer per AddressSpace
> > and would fail DMA map requests while the buffer was already in use. In
> > turn, this would cause DMA failures that ultimately manifest as hardware
> > errors from the guest perspective.
> >
> > This change allocates DMA bounce buffers dynamically instead of
> > supporting only a single buffer. Thus, multiple DMA mappings work
> > correctly also when RAM can't be mmap()-ed.
> >
> > The total bounce buffer allocation size is limited individually for each
> > AddressSpace. The default limit is 4096 bytes, matching the previous
> > maximum buffer size. A new x-max-bounce-buffer-size parameter is
> > provided to configure the limit for PCI devices.
> >
> > Signed-off-by: Mattias Nissler 
> > ---
> >  hw/pci/pci.c|  8 
> >  include/exec/memory.h   | 14 ++
> >  include/hw/pci/pci_device.h |  3 ++
> >  softmmu/memory.c|  3 +-
> >  softmmu/physmem.c   | 94 +
> >  5 files changed, 80 insertions(+), 42 deletions(-)
> >
> > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > index 881d774fb6..8c4541b394 100644
> > --- a/hw/pci/pci.c
> > +++ b/hw/pci/pci.c
> > @@ -85,6 +85,8 @@ static Property pci_props[] = {
> >  QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
> >  DEFINE_PROP_BIT("x-pcie-ari-nextfn-1", PCIDevice, cap_present,
> >  QEMU_PCIE_ARI_NEXTFN_1_BITNR, false),
> > +DEFINE_PROP_SIZE("x-max-bounce-buffer-size", PCIDevice,
> > + max_bounce_buffer_size, 4096),
> >  DEFINE_PROP_END_OF_LIST()
> >  };
> >
> > @@ -1208,6 +1210,8 @@ static PCIDevice *do_pci_register_device(PCIDevice 
> > *pci_dev,
> > "bus master container", UINT64_MAX);
> >  address_space_init(_dev->bus_master_as,
> > _dev->bus_master_container_region, 
> > pci_dev->name);
> > +pci_dev->bus_master_as.max_bounce_buffer_size =
> > +pci_dev->max_bounce_buffer_size;
> >
> >  if (phase_check(PHASE_MACHINE_READY)) {
> >  pci_init_bus_master(pci_dev);
> > @@ -2664,6 +2668,10 @@ static void pci_device_class_init(ObjectClass 
> > *klass, void *data)
> >  k->unrealize = pci_qdev_unrealize;
> >  k->bus_type = TYPE_PCI_BUS;
> >  device_class_set_props(k, pci_props);
> > +object_class_property_set_description(
> > +klass, "x-max-bounce-buffer-size",
> > +"Maximum buffer size allocated for bounce buffers used for mapped "
> > +"access to indirect DMA memory");
> >  }
> >
> >  static void pci_device_class_base_init(ObjectClass *klass, void *data)
> > diff --git a/include/exec/memory.h b/include/exec/memory.h
> > index 7d68936157..5577542b5e 100644
> > --- a/include/exec/memory.h
> > +++ b/include/exec/memory.h
> > @@ -1081,14 +1081,6 @@ typedef struct AddressSpaceMapClient {
> >  QLIST_ENTRY(AddressSpaceMapClient) link;
> >  } AddressSpaceMapClient;
> >
> > -typedef struct {
> > -MemoryRegion *mr;
> > -void *buffer;
> > -hwaddr addr;
> > -hwaddr len;
> > -bool in_use;
> > -} BounceBuffer;
> > -
> >  /**
> >   * struct AddressSpace: describes a mapping of addresses to #MemoryRegion 
> > objects
> >   */
> > @@ -1106,8 +1098,10 @@ struct AddressSpace {
> >  QTAILQ_HEAD(, MemoryListener) listeners;
> >  QTAILQ_ENTRY(AddressSpace) address_spaces_link;
> >
> > -/* Bounce buffer to use for this address space. */
> > -BounceBuffer bounce;
> > +/* Maximum DMA bounce buffer size used for indirect memory map 
> > requests */
> > +uint64_t max_bounce_buffer_size;
> > +/

Re: [PATCH v3 1/5] softmmu: Per-AddressSpace bounce buffering

2023-09-15 Thread Mattias Nissler

On Wed, Sep 13, 2023 at 8:30 PM Peter Xu  wrote:
>
> On Thu, Sep 07, 2023 at 06:04:06AM -0700, Mattias Nissler wrote:
> > @@ -3105,6 +3105,9 @@ void address_space_init(AddressSpace *as, 
> > MemoryRegion *root, const char *name)
> >  as->ioeventfds = NULL;
> >  QTAILQ_INIT(>listeners);
> >  QTAILQ_INSERT_TAIL(_spaces, as, address_spaces_link);
> > +as->bounce.in_use = false;
> > +qemu_mutex_init(>map_client_list_lock);
> > +QLIST_INIT(>map_client_list);
> >  as->name = g_strdup(name ? name : "anonymous");
> >  address_space_update_topology(as);
> >  address_space_update_ioeventfds(as);
>
> Missing the counterpart in do_address_space_destroy()?

Of course, thanks for pointing this out.

>
> Perhaps we should assert on having no one using the buffer, or on the
> client list too.

I agree it makes sense to put these assertions, but let me dig a bit
and do some experiments to see whether these hold true in practice.

>
> Thanks,
>
> --
> Peter Xu
>

Re: [PATCH v3 0/5] Support message-based DMA in vfio-user server

2023-09-15 Thread Mattias Nissler

On Thu, Sep 14, 2023 at 4:39 PM Stefan Hajnoczi  wrote:
>
> On Thu, Sep 07, 2023 at 06:04:05AM -0700, Mattias Nissler wrote:
> > This series adds basic support for message-based DMA in qemu's vfio-user
> > server. This is useful for cases where the client does not provide file
> > descriptors for accessing system memory via memory mappings. My motivating 
> > use
> > case is to hook up device models as PCIe endpoints to a hardware design. 
> > This
> > works by bridging the PCIe transaction layer to vfio-user, and the endpoint
> > does not access memory directly, but sends memory requests TLPs to the 
> > hardware
> > design in order to perform DMA.
> >
> > Note that there is some more work required on top of this series to get
> > message-based DMA to really work well:
> >
> > * libvfio-user has a long-standing issue where socket communication gets 
> > messed
> >   up when messages are sent from both ends at the same time. See
> >   https://github.com/nutanix/libvfio-user/issues/279 for more details. I've
> >   been engaging there and a fix is in review.
> >
> > * qemu currently breaks down DMA accesses into chunks of size 8 bytes at
> >   maximum, each of which will be handled in a separate vfio-user DMA request
> >   message. This is quite terrible for large DMA accesses, such as when nvme
> >   reads and writes page-sized blocks for example. Thus, I would like to 
> > improve
> >   qemu to be able to perform larger accesses, at least for indirect memory
> >   regions. I have something working locally, but since this will likely 
> > result
> >   in more involved surgery and discussion, I am leaving this to be 
> > addressed in
> >   a separate patch.
>
> Have you tried setting mr->ops->valid.max_access_size to something like
> 64 KB?

I had tried that early on, but it's not that easy unfortunately. The
memory access path eventually hits flatview_read_continue [1], where
memory_region_dispatch_read gets invoked which passes data in a single
uint64_t, which is also the unit of data that MemoryRegionOps operates
on. Thus, sizeof(uint64_t) is the current hard limit when accessing an
indirect memory region. I have some proof of concept code that extends
MemoryRegionOps with functions to read and write larger blocks, and
change the dispatching code to use these if available. I'm not sure
whether that's the right way to go though, it was just what jumped out
at me as a quick way to get what I need :-) Happy to share this code
if it helps the conversation.

There are certainly various considerations with this:
* It crossed my mind that we could introduce a separate memory region
type (I understand that indirect memory regions were originally
designed for I/O regions, accessed by the CPU, and thus naturally
limited to memop-sized accesses?). But then again perhaps we want
arbitrarily-sized accesses for potentially all memory regions, not
just those of special types?
* If we do decide to add support to MemoryRegionOps for
arbitrarily-sized accesses, that raises the question on whether this
is a 3rd, optional pair of accessors in addition to read/write and
read_with_attrs/write_with_attrs, or whether MemoryRegionOps deserves
a cleanup to expose only a single pair of arbitrarily-size accessors.
Then we'd adapt them somehow to the simpler memop-sized accessors
which existing code implements, and I think makes sense to keep for
cases where this is sufficient.
* Performance - need to keep an eye on what performance implications
these design decisions come with.

[1] https://github.com/qemu/qemu/blob/master/softmmu/physmem.c#L2744

>
> Paolo: Any suggestions for increasing DMA transaction sizes?
>
> Stefan
>
> >
> > Changes from v1:
> >
> > * Address Stefan's review comments. In particular, enforce an allocation 
> > limit
> >   and don't drop the map client callbacks given that map requests can fail 
> > when
> >   hitting size limits.
> >
> > * libvfio-user version bump now included in the series.
> >
> > * Tested as well on big-endian s390x. This uncovered another byte order 
> > issue
> >   in vfio-user server code that I've included a fix for.
> >
> > Changes from v2:
> >
> > * Add a preparatory patch to make bounce buffering an AddressSpace-specific
> >   concept.
> >
> > * The total buffer size limit parameter is now per AdressSpace and can be
> >   configured for PCIDevice via a property.
> >
> > * Store a magic value in first bytes of bounce buffer struct as a best 
> > effort
> >   measure to detect invalid pointers in address_space_unmap.
> >
> > Mattias Nissler (5):
> >   softmmu: Per-AddressSpace bounce bufferin

[PATCH v3 4/5] vfio-user: Message-based DMA support

2023-09-07 Thread Mattias Nissler

Wire up support for DMA for the case where the vfio-user client does not
provide mmap()-able file descriptors, but DMA requests must be performed
via the VFIO-user protocol. This installs an indirect memory region,
which already works for pci_dma_{read,write}, and pci_dma_map works
thanks to the existing DMA bounce buffering support.

Note that while simple scenarios work with this patch, there's a known
race condition in libvfio-user that will mess up the communication
channel. See https://github.com/nutanix/libvfio-user/issues/279 for
details as well as a proposed fix.

Signed-off-by: Mattias Nissler 
---
 hw/remote/trace-events|  2 +
 hw/remote/vfio-user-obj.c | 84 +++
 2 files changed, 79 insertions(+), 7 deletions(-)

diff --git a/hw/remote/trace-events b/hw/remote/trace-events
index 0d1b7d56a5..358a68fb34 100644
--- a/hw/remote/trace-events
+++ b/hw/remote/trace-events
@@ -9,6 +9,8 @@ vfu_cfg_read(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x -> 
0x%x"
 vfu_cfg_write(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x <- 0x%x"
 vfu_dma_register(uint64_t gpa, size_t len) "vfu: registering GPA 0x%"PRIx64", 
%zu bytes"
 vfu_dma_unregister(uint64_t gpa) "vfu: unregistering GPA 0x%"PRIx64""
+vfu_dma_read(uint64_t gpa, size_t len) "vfu: DMA read 0x%"PRIx64", %zu bytes"
+vfu_dma_write(uint64_t gpa, size_t len) "vfu: DMA write 0x%"PRIx64", %zu bytes"
 vfu_bar_register(int i, uint64_t addr, uint64_t size) "vfu: BAR %d: addr 
0x%"PRIx64" size 0x%"PRIx64""
 vfu_bar_rw_enter(const char *op, uint64_t addr) "vfu: %s request for BAR 
address 0x%"PRIx64""
 vfu_bar_rw_exit(const char *op, uint64_t addr) "vfu: Finished %s of BAR 
address 0x%"PRIx64""
diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 8b10c32a3c..cee5e615a9 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -300,6 +300,63 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, 
char * const buf,
 return count;
 }
 
+static MemTxResult vfu_dma_read(void *opaque, hwaddr addr, uint64_t *val,
+unsigned size, MemTxAttrs attrs)
+{
+MemoryRegion *region = opaque;
+VfuObject *o = VFU_OBJECT(region->owner);
+uint8_t buf[sizeof(uint64_t)];
+
+trace_vfu_dma_read(region->addr + addr, size);
+
+dma_sg_t *sg = alloca(dma_sg_size());
+vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
+if (vfu_addr_to_sgl(o->vfu_ctx, vfu_addr, size, sg, 1, PROT_READ) < 0 ||
+vfu_sgl_read(o->vfu_ctx, sg, 1, buf) != 0) {
+return MEMTX_ERROR;
+}
+
+*val = ldn_he_p(buf, size);
+
+return MEMTX_OK;
+}
+
+static MemTxResult vfu_dma_write(void *opaque, hwaddr addr, uint64_t val,
+ unsigned size, MemTxAttrs attrs)
+{
+MemoryRegion *region = opaque;
+VfuObject *o = VFU_OBJECT(region->owner);
+uint8_t buf[sizeof(uint64_t)];
+
+trace_vfu_dma_write(region->addr + addr, size);
+
+stn_he_p(buf, size, val);
+
+dma_sg_t *sg = alloca(dma_sg_size());
+vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
+if (vfu_addr_to_sgl(o->vfu_ctx, vfu_addr, size, sg, 1, PROT_WRITE) < 0 ||
+vfu_sgl_write(o->vfu_ctx, sg, 1, buf) != 0)  {
+return MEMTX_ERROR;
+}
+
+return MEMTX_OK;
+}
+
+static const MemoryRegionOps vfu_dma_ops = {
+.read_with_attrs = vfu_dma_read,
+.write_with_attrs = vfu_dma_write,
+.endianness = DEVICE_HOST_ENDIAN,
+.valid = {
+.min_access_size = 1,
+.max_access_size = 8,
+.unaligned = true,
+},
+.impl = {
+.min_access_size = 1,
+.max_access_size = 8,
+},
+};
+
 static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
 {
 VfuObject *o = vfu_get_private(vfu_ctx);
@@ -308,17 +365,30 @@ static void dma_register(vfu_ctx_t *vfu_ctx, 
vfu_dma_info_t *info)
 g_autofree char *name = NULL;
 struct iovec *iov = >iova;
 
-if (!info->vaddr) {
-return;
-}
-
 name = g_strdup_printf("mem-%s-%"PRIx64"", o->device,
-   (uint64_t)info->vaddr);
+   (uint64_t)iov->iov_base);
 
 subregion = g_new0(MemoryRegion, 1);
 
-memory_region_init_ram_ptr(subregion, NULL, name,
-   iov->iov_len, info->vaddr);
+if (info->vaddr) {
+memory_region_init_ram_ptr(subregion, OBJECT(o), name,
+   iov->iov_len, info->vaddr);
+} else {
+/*
+ * Note that I/O regions' MemoryRegionOps handle accesses of at most 8
+ * bytes at a time, and larger accesses are broken down. However,
+ * many/most DMA accesses are larger than 8 bytes

[PATCH v3 3/5] Update subprojects/libvfio-user

2023-09-07 Thread Mattias Nissler

Brings in assorted bug fixes. In particular, "Fix address calculation
for message-based DMA" corrects a bug in DMA address calculation which
is necessary to get DMA across VFIO-user messages working.

Signed-off-by: Mattias Nissler 
---
 subprojects/libvfio-user.wrap | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/subprojects/libvfio-user.wrap b/subprojects/libvfio-user.wrap
index 416955ca45..135667a40d 100644
--- a/subprojects/libvfio-user.wrap
+++ b/subprojects/libvfio-user.wrap
@@ -1,4 +1,4 @@
 [wrap-git]
 url = https://gitlab.com/qemu-project/libvfio-user.git
-revision = 0b28d205572c80b568a1003db2c8f37ca333e4d7
+revision = f63ef82ad01821417df488cef7ec1fd94c3883fa
 depth = 1
-- 
2.34.1

[PATCH v3 2/5] softmmu: Support concurrent bounce buffers

2023-09-07 Thread Mattias Nissler

When DMA memory can't be directly accessed, as is the case when
running the device model in a separate process without shareable DMA
file descriptors, bounce buffering is used.

It is not uncommon for device models to request mapping of several DMA
regions at the same time. Examples include:
 * net devices, e.g. when transmitting a packet that is split across
   several TX descriptors (observed with igb)
 * USB host controllers, when handling a packet with multiple data TRBs
   (observed with xhci)

Previously, qemu only provided a single bounce buffer per AddressSpace
and would fail DMA map requests while the buffer was already in use. In
turn, this would cause DMA failures that ultimately manifest as hardware
errors from the guest perspective.

This change allocates DMA bounce buffers dynamically instead of
supporting only a single buffer. Thus, multiple DMA mappings work
correctly also when RAM can't be mmap()-ed.

The total bounce buffer allocation size is limited individually for each
AddressSpace. The default limit is 4096 bytes, matching the previous
maximum buffer size. A new x-max-bounce-buffer-size parameter is
provided to configure the limit for PCI devices.

Signed-off-by: Mattias Nissler 
---
 hw/pci/pci.c|  8 
 include/exec/memory.h   | 14 ++
 include/hw/pci/pci_device.h |  3 ++
 softmmu/memory.c|  3 +-
 softmmu/physmem.c   | 94 +
 5 files changed, 80 insertions(+), 42 deletions(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 881d774fb6..8c4541b394 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -85,6 +85,8 @@ static Property pci_props[] = {
 QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
 DEFINE_PROP_BIT("x-pcie-ari-nextfn-1", PCIDevice, cap_present,
 QEMU_PCIE_ARI_NEXTFN_1_BITNR, false),
+DEFINE_PROP_SIZE("x-max-bounce-buffer-size", PCIDevice,
+ max_bounce_buffer_size, 4096),
 DEFINE_PROP_END_OF_LIST()
 };
 
@@ -1208,6 +1210,8 @@ static PCIDevice *do_pci_register_device(PCIDevice 
*pci_dev,
"bus master container", UINT64_MAX);
 address_space_init(_dev->bus_master_as,
_dev->bus_master_container_region, pci_dev->name);
+pci_dev->bus_master_as.max_bounce_buffer_size =
+pci_dev->max_bounce_buffer_size;
 
 if (phase_check(PHASE_MACHINE_READY)) {
 pci_init_bus_master(pci_dev);
@@ -2664,6 +2668,10 @@ static void pci_device_class_init(ObjectClass *klass, 
void *data)
 k->unrealize = pci_qdev_unrealize;
 k->bus_type = TYPE_PCI_BUS;
 device_class_set_props(k, pci_props);
+object_class_property_set_description(
+klass, "x-max-bounce-buffer-size",
+"Maximum buffer size allocated for bounce buffers used for mapped "
+"access to indirect DMA memory");
 }
 
 static void pci_device_class_base_init(ObjectClass *klass, void *data)
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 7d68936157..5577542b5e 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1081,14 +1081,6 @@ typedef struct AddressSpaceMapClient {
 QLIST_ENTRY(AddressSpaceMapClient) link;
 } AddressSpaceMapClient;
 
-typedef struct {
-MemoryRegion *mr;
-void *buffer;
-hwaddr addr;
-hwaddr len;
-bool in_use;
-} BounceBuffer;
-
 /**
  * struct AddressSpace: describes a mapping of addresses to #MemoryRegion 
objects
  */
@@ -1106,8 +1098,10 @@ struct AddressSpace {
 QTAILQ_HEAD(, MemoryListener) listeners;
 QTAILQ_ENTRY(AddressSpace) address_spaces_link;
 
-/* Bounce buffer to use for this address space. */
-BounceBuffer bounce;
+/* Maximum DMA bounce buffer size used for indirect memory map requests */
+uint64_t max_bounce_buffer_size;
+/* Total size of bounce buffers currently allocated, atomically accessed */
+uint64_t bounce_buffer_size;
 /* List of callbacks to invoke when buffers free up */
 QemuMutex map_client_list_lock;
 QLIST_HEAD(, AddressSpaceMapClient) map_client_list;
diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
index d3dd0f64b2..f4027c5379 100644
--- a/include/hw/pci/pci_device.h
+++ b/include/hw/pci/pci_device.h
@@ -160,6 +160,9 @@ struct PCIDevice {
 /* ID of standby device in net_failover pair */
 char *failover_pair_id;
 uint32_t acpi_index;
+
+/* Maximum DMA bounce buffer size used for indirect memory map requests */
+uint64_t max_bounce_buffer_size;
 };
 
 static inline int pci_intx(PCIDevice *pci_dev)
diff --git a/softmmu/memory.c b/softmmu/memory.c
index 5c9622c3d6..e02799359c 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -3105,7 +3105,8 @@ void address_space_init(AddressSpace *as, MemoryRegion 
*root, const char *name)
 as->ioeventfds = NULL;
 QTAILQ_INIT(>listeners);
 QTAILQ_INSERT_TAIL(_spaces, as, addre

[PATCH v3 0/5] Support message-based DMA in vfio-user server

2023-09-07 Thread Mattias Nissler

This series adds basic support for message-based DMA in qemu's vfio-user
server. This is useful for cases where the client does not provide file
descriptors for accessing system memory via memory mappings. My motivating use
case is to hook up device models as PCIe endpoints to a hardware design. This
works by bridging the PCIe transaction layer to vfio-user, and the endpoint
does not access memory directly, but sends memory requests TLPs to the hardware
design in order to perform DMA.

Note that there is some more work required on top of this series to get
message-based DMA to really work well:

* libvfio-user has a long-standing issue where socket communication gets messed
  up when messages are sent from both ends at the same time. See
  https://github.com/nutanix/libvfio-user/issues/279 for more details. I've
  been engaging there and a fix is in review.

* qemu currently breaks down DMA accesses into chunks of size 8 bytes at
  maximum, each of which will be handled in a separate vfio-user DMA request
  message. This is quite terrible for large DMA accesses, such as when nvme
  reads and writes page-sized blocks for example. Thus, I would like to improve
  qemu to be able to perform larger accesses, at least for indirect memory
  regions. I have something working locally, but since this will likely result
  in more involved surgery and discussion, I am leaving this to be addressed in
  a separate patch.

Changes from v1:

* Address Stefan's review comments. In particular, enforce an allocation limit
  and don't drop the map client callbacks given that map requests can fail when
  hitting size limits.

* libvfio-user version bump now included in the series.

* Tested as well on big-endian s390x. This uncovered another byte order issue
  in vfio-user server code that I've included a fix for.

Changes from v2:

* Add a preparatory patch to make bounce buffering an AddressSpace-specific
  concept.

* The total buffer size limit parameter is now per AdressSpace and can be
  configured for PCIDevice via a property.

* Store a magic value in first bytes of bounce buffer struct as a best effort
  measure to detect invalid pointers in address_space_unmap.

Mattias Nissler (5):
  softmmu: Per-AddressSpace bounce buffering
  softmmu: Support concurrent bounce buffers
  Update subprojects/libvfio-user
  vfio-user: Message-based DMA support
  vfio-user: Fix config space access byte order

 hw/pci/pci.c  |   8 ++
 hw/remote/trace-events|   2 +
 hw/remote/vfio-user-obj.c |  88 +--
 include/exec/cpu-common.h |   2 -
 include/exec/memory.h |  39 -
 include/hw/pci/pci_device.h   |   3 +
 softmmu/dma-helpers.c |   4 +-
 softmmu/memory.c  |   4 +
 softmmu/physmem.c | 155 ++
 subprojects/libvfio-user.wrap |   2 +-
 10 files changed, 220 insertions(+), 87 deletions(-)

-- 
2.34.1

[PATCH v3 5/5] vfio-user: Fix config space access byte order

2023-09-07 Thread Mattias Nissler

PCI config space is little-endian, so on a big-endian host we need to
perform byte swaps for values as they are passed to and received from
the generic PCI config space access machinery.

Signed-off-by: Mattias Nissler 
---
 hw/remote/vfio-user-obj.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index cee5e615a9..d38b4700f3 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -281,7 +281,7 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, 
char * const buf,
 while (bytes > 0) {
 len = (bytes > pci_access_width) ? pci_access_width : bytes;
 if (is_write) {
-memcpy(, ptr, len);
+val = ldn_le_p(ptr, len);
 pci_host_config_write_common(o->pci_dev, offset,
  pci_config_size(o->pci_dev),
  val, len);
@@ -289,7 +289,7 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, 
char * const buf,
 } else {
 val = pci_host_config_read_common(o->pci_dev, offset,
   pci_config_size(o->pci_dev), 
len);
-memcpy(ptr, , len);
+stn_le_p(ptr, len, val);
 trace_vfu_cfg_read(offset, val);
 }
 offset += len;
-- 
2.34.1

[PATCH v3 1/5] softmmu: Per-AddressSpace bounce buffering

2023-09-07 Thread Mattias Nissler

Instead of using a single global bounce buffer, give each AddressSpace
its own bounce buffer. The MapClient callback mechanism moves to
AddressSpace accordingly.

This is in preparation for generalizing bounce buffer handling further
to allow multiple bounce buffers, with a total allocation limit
configured per AddressSpace.

Signed-off-by: Mattias Nissler 
---
 include/exec/cpu-common.h |   2 -
 include/exec/memory.h |  45 -
 softmmu/dma-helpers.c |   4 +-
 softmmu/memory.c  |   3 ++
 softmmu/physmem.c | 103 --
 5 files changed, 90 insertions(+), 67 deletions(-)

diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 41788c0bdd..63463c415d 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -138,8 +138,6 @@ void *cpu_physical_memory_map(hwaddr addr,
   bool is_write);
 void cpu_physical_memory_unmap(void *buffer, hwaddr len,
bool is_write, hwaddr access_len);
-void cpu_register_map_client(QEMUBH *bh);
-void cpu_unregister_map_client(QEMUBH *bh);
 
 bool cpu_physical_memory_is_io(hwaddr phys_addr);
 
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 68284428f8..7d68936157 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1076,6 +1076,19 @@ struct MemoryListener {
 QTAILQ_ENTRY(MemoryListener) link_as;
 };
 
+typedef struct AddressSpaceMapClient {
+QEMUBH *bh;
+QLIST_ENTRY(AddressSpaceMapClient) link;
+} AddressSpaceMapClient;
+
+typedef struct {
+MemoryRegion *mr;
+void *buffer;
+hwaddr addr;
+hwaddr len;
+bool in_use;
+} BounceBuffer;
+
 /**
  * struct AddressSpace: describes a mapping of addresses to #MemoryRegion 
objects
  */
@@ -1092,6 +1105,12 @@ struct AddressSpace {
 struct MemoryRegionIoeventfd *ioeventfds;
 QTAILQ_HEAD(, MemoryListener) listeners;
 QTAILQ_ENTRY(AddressSpace) address_spaces_link;
+
+/* Bounce buffer to use for this address space. */
+BounceBuffer bounce;
+/* List of callbacks to invoke when buffers free up */
+QemuMutex map_client_list_lock;
+QLIST_HEAD(, AddressSpaceMapClient) map_client_list;
 };
 
 typedef struct AddressSpaceDispatch AddressSpaceDispatch;
@@ -2832,8 +2851,8 @@ bool address_space_access_valid(AddressSpace *as, hwaddr 
addr, hwaddr len,
  * May return %NULL and set *@plen to zero(0), if resources needed to perform
  * the mapping are exhausted.
  * Use only for reads OR writes - not for read-modify-write operations.
- * Use cpu_register_map_client() to know when retrying the map operation is
- * likely to succeed.
+ * Use address_space_register_map_client() to know when retrying the map
+ * operation is likely to succeed.
  *
  * @as: #AddressSpace to be accessed
  * @addr: address within that address space
@@ -2858,6 +2877,28 @@ void *address_space_map(AddressSpace *as, hwaddr addr,
 void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len,
  bool is_write, hwaddr access_len);
 
+/*
+ * address_space_register_map_client: Register a callback to invoke when
+ * resources for address_space_map() are available again.
+ *
+ * address_space_map may fail when there are not enough resources available,
+ * such as when bounce buffer memory would exceed the limit. The callback can
+ * be used to retry the address_space_map operation. Note that the callback
+ * gets automatically removed after firing.
+ *
+ * @as: #AddressSpace to be accessed
+ * @bh: callback to invoke when address_space_map() retry is appropriate
+ */
+void address_space_register_map_client(AddressSpace *as, QEMUBH *bh);
+
+/*
+ * address_space_unregister_map_client: Unregister a callback that has
+ * previously been registered and not fired yet.
+ *
+ * @as: #AddressSpace to be accessed
+ * @bh: callback to unregister
+ */
+void address_space_unregister_map_client(AddressSpace *as, QEMUBH *bh);
 
 /* Internal functions, part of the implementation of address_space_read.  */
 MemTxResult address_space_read_full(AddressSpace *as, hwaddr addr,
diff --git a/softmmu/dma-helpers.c b/softmmu/dma-helpers.c
index 2463964805..d9fc26c063 100644
--- a/softmmu/dma-helpers.c
+++ b/softmmu/dma-helpers.c
@@ -167,7 +167,7 @@ static void dma_blk_cb(void *opaque, int ret)
 if (dbs->iov.size == 0) {
 trace_dma_map_wait(dbs);
 dbs->bh = aio_bh_new(ctx, reschedule_dma, dbs);
-cpu_register_map_client(dbs->bh);
+address_space_register_map_client(dbs->sg->as, dbs->bh);
 goto out;
 }
 
@@ -197,7 +197,7 @@ static void dma_aio_cancel(BlockAIOCB *acb)
 }
 
 if (dbs->bh) {
-cpu_unregister_map_client(dbs->bh);
+address_space_unregister_map_client(dbs->sg->as, dbs->bh);
 qemu_bh_delete(dbs->bh);
 dbs->bh = NULL;
 }
diff --git a/softmmu/memory.c b/softmmu/memory.c
index 7d9494ce70..5c9622c3d6 100644
--- a/

Re: [PATCH v2 1/4] softmmu: Support concurrent bounce buffers

2023-09-07 Thread Mattias Nissler

On Tue, Sep 5, 2023 at 3:45 PM Peter Xu  wrote:
>
> On Tue, Sep 05, 2023 at 09:38:39AM +0200, Mattias Nissler wrote:
> > It would be nice to use a property on the device that originates the
> > DMA operation to configure this. However, I don't see how to do this
> > in a reasonable way without bigger changes: A typical call path is
> > pci_dma_map -> dma_memory_map -> address_space_map. While pci_dma_map
> > has a PCIDevice*, address_space_map only receives the AddressSpace*.
> > So, we'd probably have to pass through a new QObject parameter to
> > address_space_map that indicates the originator and pass that through?
> > Or is there a better alternative to supply context information to
> > address_space map? Let me know if any of these approaches sound
> > appropriate and I'll be happy to explore them further.
>
> Should be possible to do. The pci address space is not shared but
> per-device by default (even if there is no vIOMMU intervention).  See
> do_pci_register_device():
>
> address_space_init(_dev->bus_master_as,
>_dev->bus_master_container_region, pci_dev->name);

Ah, thanks for that hint! This works, and it probably even makes more
sense to treat bounce buffering as a concept tied to AddressSpace
rather than a global thing.

I'll send an updated series shortly, with the configuration parameter
attached to the PCI device, so it can be specified as a -device option
on the command line. In that light, I decided to keep the default at
4096 bytes though, since we now have the ability for each device model
to choose its default independently.

Re: [PATCH v2 1/4] softmmu: Support concurrent bounce buffers

2023-09-05 Thread Mattias Nissler

On Fri, Sep 1, 2023 at 3:41 PM Markus Armbruster  wrote:
>
> Stefan Hajnoczi  writes:
>
> > On Wed, Aug 23, 2023 at 04:54:06PM -0400, Peter Xu wrote:
> >> On Wed, Aug 23, 2023 at 10:08:08PM +0200, Mattias Nissler wrote:
> >> > On Wed, Aug 23, 2023 at 7:35 PM Peter Xu  wrote:
> >> > > On Wed, Aug 23, 2023 at 02:29:02AM -0700, Mattias Nissler wrote:
> >> > > > diff --git a/softmmu/vl.c b/softmmu/vl.c
> >> > > > index b0b96f67fa..dbe52f5ea1 100644
> >> > > > --- a/softmmu/vl.c
> >> > > > +++ b/softmmu/vl.c
> >> > > > @@ -3469,6 +3469,12 @@ void qemu_init(int argc, char **argv)
> >> > > >  exit(1);
> >> > > >  #endif
> >> > > >  break;
> >> > > > +case QEMU_OPTION_max_bounce_buffer_size:
> >> > > > +if (qemu_strtosz(optarg, NULL, 
> >> > > > _bounce_buffer_size) < 0) {
> >> > > > +error_report("invalid -max-ounce-buffer-size 
> >> > > > value");
> >> > > > +exit(1);
> >> > > > +}
> >> > > > +break;
> >> > >
> >> > > PS: I had a vague memory that we do not recommend adding more qemu 
> >> > > cmdline
> >> > > options, but I don't know enough on the plan to say anything real.
> >> >
> >> > I am aware of that, and I'm really not happy with the command line
> >> > option myself. Consider the command line flag a straw man I put in to
> >> > see whether any reviewers have better ideas :)
> >> >
> >> > More seriously, I actually did look around to see whether I can add
> >> > the parameter to one of the existing option groupings somewhere, but
> >> > neither do I have a suitable QOM object that I can attach the
> >> > parameter to, nor did I find any global option groups that fits: this
> >> > is not really memory configuration, and it's not really CPU
> >> > configuration, it's more related to shared device model
> >> > infrastructure... If you have a good idea for a home for this, I'm all
> >> > ears.
> >>
> >> No good & simple suggestion here, sorry.  We can keep the option there
> >> until someone jumps in, then the better alternative could also come along.
> >>
> >> After all I expect if we can choose a sensible enough default value, this
> >> new option shouldn't be used by anyone for real.
> >
> > QEMU commits to stability in its external interfaces. Once the
> > command-line option is added, it needs to be supported in the future or
> > go through the deprecation process. I think we should agree on the
> > command-line option now.
> >
> > Two ways to avoid the issue:
> > 1. Drop the command-line option until someone needs it.
>
> Avoiding unneeded configuration knobs is always good.
>
> > 2. Make it an experimental option (with an "x-" prefix).
>
> Fine if actual experiments are planned.
>
> Also fine if it's a development or debugging aid.

To a certain extent it is: I've been playing with different device
models and bumping the parameter until their DMA requests stopped
failing.

>
> > The closest to a proper solution that I found was adding it as a
> > -machine property. What bothers me is that if QEMU supports
> > multi-machine emulation in a single process some day, then the property
> > doesn't make sense since it's global rather than specific to a machine.
> >
> > CCing Markus Armbruster for ideas.
>
> I'm afraid I'm lacking context.  Glancing at the patch...
>
> ``-max-bounce-buffer-size size``
> Set the limit in bytes for DMA bounce buffer allocations.
>
> DMA bounce buffers are used when device models request memory-mapped 
> access
> to memory regions that can't be directly mapped by the qemu process, 
> so the
> memory must read or written to a temporary local buffer for the device
> model to work with. This is the case e.g. for I/O memory regions, and 
> when
> running in multi-process mode without shared access to memory.
>
> Whether bounce buffering is necessary depends heavily on the device 
> model
> implementation. Some devices use explicit DMA read and write 
> operations
> which do not require bounce buffers. Some devices, notably storage, 
> will
> retry a failed D

Re: [PATCH v2 1/4] softmmu: Support concurrent bounce buffers

2023-08-24 Thread Mattias Nissler

On Wed, Aug 23, 2023 at 10:54 PM Peter Xu  wrote:
>
> On Wed, Aug 23, 2023 at 10:08:08PM +0200, Mattias Nissler wrote:
> > Peter, thanks for taking a look and providing feedback!
> >
> > On Wed, Aug 23, 2023 at 7:35 PM Peter Xu  wrote:
> > >
> > > On Wed, Aug 23, 2023 at 02:29:02AM -0700, Mattias Nissler wrote:
> > > > When DMA memory can't be directly accessed, as is the case when
> > > > running the device model in a separate process without shareable DMA
> > > > file descriptors, bounce buffering is used.
> > > >
> > > > It is not uncommon for device models to request mapping of several DMA
> > > > regions at the same time. Examples include:
> > > >  * net devices, e.g. when transmitting a packet that is split across
> > > >several TX descriptors (observed with igb)
> > > >  * USB host controllers, when handling a packet with multiple data TRBs
> > > >(observed with xhci)
> > > >
> > > > Previously, qemu only provided a single bounce buffer and would fail DMA
> > > > map requests while the buffer was already in use. In turn, this would
> > > > cause DMA failures that ultimately manifest as hardware errors from the
> > > > guest perspective.
> > > >
> > > > This change allocates DMA bounce buffers dynamically instead of
> > > > supporting only a single buffer. Thus, multiple DMA mappings work
> > > > correctly also when RAM can't be mmap()-ed.
> > > >
> > > > The total bounce buffer allocation size is limited by a new command line
> > > > parameter. The default is 4096 bytes to match the previous maximum
> > > > buffer size. It is expected that suitable limits will vary quite a bit
> > > > in practice depending on device models and workloads.
> > > >
> > > > Signed-off-by: Mattias Nissler 
> > > > ---
> > > >  include/sysemu/sysemu.h |  2 +
> > > >  qemu-options.hx | 27 +
> > > >  softmmu/globals.c   |  1 +
> > > >  softmmu/physmem.c   | 84 +++--
> > > >  softmmu/vl.c|  6 +++
> > > >  5 files changed, 83 insertions(+), 37 deletions(-)
> > > >
> > > > diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> > > > index 25be2a692e..c5dc93cb53 100644
> > > > --- a/include/sysemu/sysemu.h
> > > > +++ b/include/sysemu/sysemu.h
> > > > @@ -61,6 +61,8 @@ extern int nb_option_roms;
> > > >  extern const char *prom_envs[MAX_PROM_ENVS];
> > > >  extern unsigned int nb_prom_envs;
> > > >
> > > > +extern uint64_t max_bounce_buffer_size;
> > > > +
> > > >  /* serial ports */
> > > >
> > > >  /* Return the Chardev for serial port i, or NULL if none */
> > > > diff --git a/qemu-options.hx b/qemu-options.hx
> > > > index 29b98c3d4c..6071794237 100644
> > > > --- a/qemu-options.hx
> > > > +++ b/qemu-options.hx
> > > > @@ -4959,6 +4959,33 @@ SRST
> > > >  ERST
> > > >  #endif
> > > >
> > > > +DEF("max-bounce-buffer-size", HAS_ARG,
> > > > +QEMU_OPTION_max_bounce_buffer_size,
> > > > +"-max-bounce-buffer-size size\n"
> > > > +"DMA bounce buffer size limit in bytes 
> > > > (default=4096)\n",
> > > > +QEMU_ARCH_ALL)
> > > > +SRST
> > > > +``-max-bounce-buffer-size size``
> > > > +Set the limit in bytes for DMA bounce buffer allocations.
> > > > +
> > > > +DMA bounce buffers are used when device models request 
> > > > memory-mapped access
> > > > +to memory regions that can't be directly mapped by the qemu 
> > > > process, so the
> > > > +memory must read or written to a temporary local buffer for the 
> > > > device
> > > > +model to work with. This is the case e.g. for I/O memory regions, 
> > > > and when
> > > > +running in multi-process mode without shared access to memory.
> > > > +
> > > > +Whether bounce buffering is necessary depends heavily on the 
> > > > device model
> > > > +implementation. Some devices use explicit DMA read and write 
> > > > operations
> > > > +which do not require bounce buffers. Some devices, notably 
&

Re: [PATCH v2 1/4] softmmu: Support concurrent bounce buffers

2023-08-23 Thread Mattias Nissler

Peter, thanks for taking a look and providing feedback!

On Wed, Aug 23, 2023 at 7:35 PM Peter Xu  wrote:
>
> On Wed, Aug 23, 2023 at 02:29:02AM -0700, Mattias Nissler wrote:
> > When DMA memory can't be directly accessed, as is the case when
> > running the device model in a separate process without shareable DMA
> > file descriptors, bounce buffering is used.
> >
> > It is not uncommon for device models to request mapping of several DMA
> > regions at the same time. Examples include:
> >  * net devices, e.g. when transmitting a packet that is split across
> >several TX descriptors (observed with igb)
> >  * USB host controllers, when handling a packet with multiple data TRBs
> >(observed with xhci)
> >
> > Previously, qemu only provided a single bounce buffer and would fail DMA
> > map requests while the buffer was already in use. In turn, this would
> > cause DMA failures that ultimately manifest as hardware errors from the
> > guest perspective.
> >
> > This change allocates DMA bounce buffers dynamically instead of
> > supporting only a single buffer. Thus, multiple DMA mappings work
> > correctly also when RAM can't be mmap()-ed.
> >
> > The total bounce buffer allocation size is limited by a new command line
> > parameter. The default is 4096 bytes to match the previous maximum
> > buffer size. It is expected that suitable limits will vary quite a bit
> > in practice depending on device models and workloads.
> >
> > Signed-off-by: Mattias Nissler 
> > ---
> >  include/sysemu/sysemu.h |  2 +
> >  qemu-options.hx | 27 +
> >  softmmu/globals.c   |  1 +
> >  softmmu/physmem.c   | 84 +++--
> >  softmmu/vl.c|  6 +++
> >  5 files changed, 83 insertions(+), 37 deletions(-)
> >
> > diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> > index 25be2a692e..c5dc93cb53 100644
> > --- a/include/sysemu/sysemu.h
> > +++ b/include/sysemu/sysemu.h
> > @@ -61,6 +61,8 @@ extern int nb_option_roms;
> >  extern const char *prom_envs[MAX_PROM_ENVS];
> >  extern unsigned int nb_prom_envs;
> >
> > +extern uint64_t max_bounce_buffer_size;
> > +
> >  /* serial ports */
> >
> >  /* Return the Chardev for serial port i, or NULL if none */
> > diff --git a/qemu-options.hx b/qemu-options.hx
> > index 29b98c3d4c..6071794237 100644
> > --- a/qemu-options.hx
> > +++ b/qemu-options.hx
> > @@ -4959,6 +4959,33 @@ SRST
> >  ERST
> >  #endif
> >
> > +DEF("max-bounce-buffer-size", HAS_ARG,
> > +QEMU_OPTION_max_bounce_buffer_size,
> > +"-max-bounce-buffer-size size\n"
> > +"DMA bounce buffer size limit in bytes 
> > (default=4096)\n",
> > +QEMU_ARCH_ALL)
> > +SRST
> > +``-max-bounce-buffer-size size``
> > +Set the limit in bytes for DMA bounce buffer allocations.
> > +
> > +DMA bounce buffers are used when device models request memory-mapped 
> > access
> > +to memory regions that can't be directly mapped by the qemu process, 
> > so the
> > +memory must read or written to a temporary local buffer for the device
> > +model to work with. This is the case e.g. for I/O memory regions, and 
> > when
> > +running in multi-process mode without shared access to memory.
> > +
> > +Whether bounce buffering is necessary depends heavily on the device 
> > model
> > +implementation. Some devices use explicit DMA read and write operations
> > +which do not require bounce buffers. Some devices, notably storage, 
> > will
> > +retry a failed DMA map request after bounce buffer space becomes 
> > available
> > +again. Most other devices will bail when encountering map request 
> > failures,
> > +which will typically appear to the guest as a hardware error.
> > +
> > +Suitable bounce buffer size values depend on the workload and guest
> > +configuration. A few kilobytes up to a few megabytes are common sizes
> > +encountered in practice.
>
> Does it mean that the default 4K size can still easily fail with some
> device setup?

Yes. The thing is that the respective device setup is pretty exotic,
at least the only setup I'm aware of is multi-process with direct RAM
access via shared file descriptors from the device process disabled
(which hurts performance, so few people will run this setup). In
theory, DMA to an I/O region of some sort would also run into the
issue even in single process mode, but I'm not

[PATCH v2 4/4] vfio-user: Fix config space access byte order

2023-08-23 Thread Mattias Nissler

PCI config space is little-endian, so on a big-endian host we need to
perform byte swaps for values as they are passed to and received from
the generic PCI config space access machinery.

Signed-off-by: Mattias Nissler 
---
 hw/remote/vfio-user-obj.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index cee5e615a9..d38b4700f3 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -281,7 +281,7 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, 
char * const buf,
 while (bytes > 0) {
 len = (bytes > pci_access_width) ? pci_access_width : bytes;
 if (is_write) {
-memcpy(, ptr, len);
+val = ldn_le_p(ptr, len);
 pci_host_config_write_common(o->pci_dev, offset,
  pci_config_size(o->pci_dev),
  val, len);
@@ -289,7 +289,7 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, 
char * const buf,
 } else {
 val = pci_host_config_read_common(o->pci_dev, offset,
   pci_config_size(o->pci_dev), 
len);
-memcpy(ptr, , len);
+stn_le_p(ptr, len, val);
 trace_vfu_cfg_read(offset, val);
 }
 offset += len;
-- 
2.34.1

[PATCH v2 2/4] Update subprojects/libvfio-user

2023-08-23 Thread Mattias Nissler

Brings in assorted bug fixes. In particular, "Fix address calculation
for message-based DMA" corrects a bug in DMA address calculation which
is necessary to get DMA across VFIO-user messages working.

Signed-off-by: Mattias Nissler 
---
 subprojects/libvfio-user.wrap | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/subprojects/libvfio-user.wrap b/subprojects/libvfio-user.wrap
index 416955ca45..47aad1ae18 100644
--- a/subprojects/libvfio-user.wrap
+++ b/subprojects/libvfio-user.wrap
@@ -1,4 +1,4 @@
 [wrap-git]
 url = https://gitlab.com/qemu-project/libvfio-user.git
-revision = 0b28d205572c80b568a1003db2c8f37ca333e4d7
+revision = cfb7d908dca025bdea6709801c5790863e902ef8
 depth = 1
-- 
2.34.1

[PATCH v2 0/4] Support message-based DMA in vfio-user server

2023-08-23 Thread Mattias Nissler

This series adds basic support for message-based DMA in qemu's vfio-user
server. This is useful for cases where the client does not provide file
descriptors for accessing system memory via memory mappings. My motivating use
case is to hook up device models as PCIe endpoints to a hardware design. This
works by bridging the PCIe transaction layer to vfio-user, and the endpoint
does not access memory directly, but sends memory requests TLPs to the hardware
design in order to perform DMA.

Note that there is some more work required on top of this series to get
message-based DMA to really work well:

* libvfio-user has a long-standing issue where socket communication gets messed
  up when messages are sent from both ends at the same time. See
  https://github.com/nutanix/libvfio-user/issues/279 for more details. I've
  been engaging there and a fix is in review.

* qemu currently breaks down DMA accesses into chunks of size 8 bytes at
  maximum, each of which will be handled in a separate vfio-user DMA request
  message. This is quite terrible for large DMA accesses, such as when nvme
  reads and writes page-sized blocks for example. Thus, I would like to improve
  qemu to be able to perform larger accesses, at least for indirect memory
  regions. I have something working locally, but since this will likely result
  in more involved surgery and discussion, I am leaving this to be addressed in
  a separate patch.

Changes from v1:

* Address Stefan's review comments. In particular, enforce an allocation limit
  and don't drop the map client callbacks given that map requests can fail when
  hitting size limits.

* libvfio-user version bump now included in the series.

* Tested as well on big-endian s390x. This uncovered another byte order issue
  in vfio-user server code that I've included a fix for.

Mattias Nissler (4):
  softmmu: Support concurrent bounce buffers
  Update subprojects/libvfio-user
  vfio-user: Message-based DMA support
  vfio-user: Fix config space access byte order

 hw/remote/trace-events|  2 +
 hw/remote/vfio-user-obj.c | 88 +++
 include/sysemu/sysemu.h   |  2 +
 qemu-options.hx   | 27 +++
 softmmu/globals.c |  1 +
 softmmu/physmem.c | 84 ++---
 softmmu/vl.c  |  6 +++
 subprojects/libvfio-user.wrap |  2 +-
 8 files changed, 165 insertions(+), 47 deletions(-)

-- 
2.34.1

[PATCH v2 3/4] vfio-user: Message-based DMA support

2023-08-23 Thread Mattias Nissler

Wire up support for DMA for the case where the vfio-user client does not
provide mmap()-able file descriptors, but DMA requests must be performed
via the VFIO-user protocol. This installs an indirect memory region,
which already works for pci_dma_{read,write}, and pci_dma_map works
thanks to the existing DMA bounce buffering support.

Note that while simple scenarios work with this patch, there's a known
race condition in libvfio-user that will mess up the communication
channel. See https://github.com/nutanix/libvfio-user/issues/279 for
details as well as a proposed fix.

Signed-off-by: Mattias Nissler 
---
 hw/remote/trace-events|  2 +
 hw/remote/vfio-user-obj.c | 84 +++
 2 files changed, 79 insertions(+), 7 deletions(-)

diff --git a/hw/remote/trace-events b/hw/remote/trace-events
index 0d1b7d56a5..358a68fb34 100644
--- a/hw/remote/trace-events
+++ b/hw/remote/trace-events
@@ -9,6 +9,8 @@ vfu_cfg_read(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x -> 
0x%x"
 vfu_cfg_write(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x <- 0x%x"
 vfu_dma_register(uint64_t gpa, size_t len) "vfu: registering GPA 0x%"PRIx64", 
%zu bytes"
 vfu_dma_unregister(uint64_t gpa) "vfu: unregistering GPA 0x%"PRIx64""
+vfu_dma_read(uint64_t gpa, size_t len) "vfu: DMA read 0x%"PRIx64", %zu bytes"
+vfu_dma_write(uint64_t gpa, size_t len) "vfu: DMA write 0x%"PRIx64", %zu bytes"
 vfu_bar_register(int i, uint64_t addr, uint64_t size) "vfu: BAR %d: addr 
0x%"PRIx64" size 0x%"PRIx64""
 vfu_bar_rw_enter(const char *op, uint64_t addr) "vfu: %s request for BAR 
address 0x%"PRIx64""
 vfu_bar_rw_exit(const char *op, uint64_t addr) "vfu: Finished %s of BAR 
address 0x%"PRIx64""
diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 8b10c32a3c..cee5e615a9 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -300,6 +300,63 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, 
char * const buf,
 return count;
 }
 
+static MemTxResult vfu_dma_read(void *opaque, hwaddr addr, uint64_t *val,
+unsigned size, MemTxAttrs attrs)
+{
+MemoryRegion *region = opaque;
+VfuObject *o = VFU_OBJECT(region->owner);
+uint8_t buf[sizeof(uint64_t)];
+
+trace_vfu_dma_read(region->addr + addr, size);
+
+dma_sg_t *sg = alloca(dma_sg_size());
+vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
+if (vfu_addr_to_sgl(o->vfu_ctx, vfu_addr, size, sg, 1, PROT_READ) < 0 ||
+vfu_sgl_read(o->vfu_ctx, sg, 1, buf) != 0) {
+return MEMTX_ERROR;
+}
+
+*val = ldn_he_p(buf, size);
+
+return MEMTX_OK;
+}
+
+static MemTxResult vfu_dma_write(void *opaque, hwaddr addr, uint64_t val,
+ unsigned size, MemTxAttrs attrs)
+{
+MemoryRegion *region = opaque;
+VfuObject *o = VFU_OBJECT(region->owner);
+uint8_t buf[sizeof(uint64_t)];
+
+trace_vfu_dma_write(region->addr + addr, size);
+
+stn_he_p(buf, size, val);
+
+dma_sg_t *sg = alloca(dma_sg_size());
+vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
+if (vfu_addr_to_sgl(o->vfu_ctx, vfu_addr, size, sg, 1, PROT_WRITE) < 0 ||
+vfu_sgl_write(o->vfu_ctx, sg, 1, buf) != 0)  {
+return MEMTX_ERROR;
+}
+
+return MEMTX_OK;
+}
+
+static const MemoryRegionOps vfu_dma_ops = {
+.read_with_attrs = vfu_dma_read,
+.write_with_attrs = vfu_dma_write,
+.endianness = DEVICE_HOST_ENDIAN,
+.valid = {
+.min_access_size = 1,
+.max_access_size = 8,
+.unaligned = true,
+},
+.impl = {
+.min_access_size = 1,
+.max_access_size = 8,
+},
+};
+
 static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
 {
 VfuObject *o = vfu_get_private(vfu_ctx);
@@ -308,17 +365,30 @@ static void dma_register(vfu_ctx_t *vfu_ctx, 
vfu_dma_info_t *info)
 g_autofree char *name = NULL;
 struct iovec *iov = >iova;
 
-if (!info->vaddr) {
-return;
-}
-
 name = g_strdup_printf("mem-%s-%"PRIx64"", o->device,
-   (uint64_t)info->vaddr);
+   (uint64_t)iov->iov_base);
 
 subregion = g_new0(MemoryRegion, 1);
 
-memory_region_init_ram_ptr(subregion, NULL, name,
-   iov->iov_len, info->vaddr);
+if (info->vaddr) {
+memory_region_init_ram_ptr(subregion, OBJECT(o), name,
+   iov->iov_len, info->vaddr);
+} else {
+/*
+ * Note that I/O regions' MemoryRegionOps handle accesses of at most 8
+ * bytes at a time, and larger accesses are broken down. However,
+ * many/most DMA accesses are larger than 8 bytes

[PATCH v2 1/4] softmmu: Support concurrent bounce buffers

2023-08-23 Thread Mattias Nissler

When DMA memory can't be directly accessed, as is the case when
running the device model in a separate process without shareable DMA
file descriptors, bounce buffering is used.

It is not uncommon for device models to request mapping of several DMA
regions at the same time. Examples include:
 * net devices, e.g. when transmitting a packet that is split across
   several TX descriptors (observed with igb)
 * USB host controllers, when handling a packet with multiple data TRBs
   (observed with xhci)

Previously, qemu only provided a single bounce buffer and would fail DMA
map requests while the buffer was already in use. In turn, this would
cause DMA failures that ultimately manifest as hardware errors from the
guest perspective.

This change allocates DMA bounce buffers dynamically instead of
supporting only a single buffer. Thus, multiple DMA mappings work
correctly also when RAM can't be mmap()-ed.

The total bounce buffer allocation size is limited by a new command line
parameter. The default is 4096 bytes to match the previous maximum
buffer size. It is expected that suitable limits will vary quite a bit
in practice depending on device models and workloads.

Signed-off-by: Mattias Nissler 
---
 include/sysemu/sysemu.h |  2 +
 qemu-options.hx | 27 +
 softmmu/globals.c   |  1 +
 softmmu/physmem.c   | 84 +++--
 softmmu/vl.c|  6 +++
 5 files changed, 83 insertions(+), 37 deletions(-)

diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index 25be2a692e..c5dc93cb53 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -61,6 +61,8 @@ extern int nb_option_roms;
 extern const char *prom_envs[MAX_PROM_ENVS];
 extern unsigned int nb_prom_envs;
 
+extern uint64_t max_bounce_buffer_size;
+
 /* serial ports */
 
 /* Return the Chardev for serial port i, or NULL if none */
diff --git a/qemu-options.hx b/qemu-options.hx
index 29b98c3d4c..6071794237 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -4959,6 +4959,33 @@ SRST
 ERST
 #endif
 
+DEF("max-bounce-buffer-size", HAS_ARG,
+QEMU_OPTION_max_bounce_buffer_size,
+"-max-bounce-buffer-size size\n"
+"DMA bounce buffer size limit in bytes (default=4096)\n",
+QEMU_ARCH_ALL)
+SRST
+``-max-bounce-buffer-size size``
+Set the limit in bytes for DMA bounce buffer allocations.
+
+DMA bounce buffers are used when device models request memory-mapped access
+to memory regions that can't be directly mapped by the qemu process, so the
+memory must read or written to a temporary local buffer for the device
+model to work with. This is the case e.g. for I/O memory regions, and when
+running in multi-process mode without shared access to memory.
+
+Whether bounce buffering is necessary depends heavily on the device model
+implementation. Some devices use explicit DMA read and write operations
+which do not require bounce buffers. Some devices, notably storage, will
+retry a failed DMA map request after bounce buffer space becomes available
+again. Most other devices will bail when encountering map request failures,
+which will typically appear to the guest as a hardware error.
+
+Suitable bounce buffer size values depend on the workload and guest
+configuration. A few kilobytes up to a few megabytes are common sizes
+encountered in practice.
+ERST
+
 DEFHEADING()
 
 DEFHEADING(Generic object creation:)
diff --git a/softmmu/globals.c b/softmmu/globals.c
index e83b5428d1..d3cc010717 100644
--- a/softmmu/globals.c
+++ b/softmmu/globals.c
@@ -54,6 +54,7 @@ const char *prom_envs[MAX_PROM_ENVS];
 uint8_t *boot_splash_filedata;
 int only_migratable; /* turn it off unless user states otherwise */
 int icount_align_option;
+uint64_t max_bounce_buffer_size = 4096;
 
 /* The bytes in qemu_uuid are in the order specified by RFC4122, _not_ in the
  * little-endian "wire format" described in the SMBIOS 2.6 specification.
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 3df73542e1..9f0fec0c8e 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -50,6 +50,7 @@
 #include "sysemu/dma.h"
 #include "sysemu/hostmem.h"
 #include "sysemu/hw_accel.h"
+#include "sysemu/sysemu.h"
 #include "sysemu/xen-mapcache.h"
 #include "trace/trace-root.h"
 
@@ -2904,13 +2905,12 @@ void cpu_flush_icache_range(hwaddr start, hwaddr len)
 
 typedef struct {
 MemoryRegion *mr;
-void *buffer;
 hwaddr addr;
-hwaddr len;
-bool in_use;
+size_t len;
+uint8_t buffer[];
 } BounceBuffer;
 
-static BounceBuffer bounce;
+static size_t bounce_buffer_size;
 
 typedef struct MapClient {
 QEMUBH *bh;
@@ -2945,9 +2945,9 @@ void cpu_register_map_client(QEMUBH *bh)
 qemu_mutex_lock(_client_list_lock);
 client->bh = bh;
 QLIST_INSERT_HEAD(_client_list, client, link);
-/* Write map

Re: [PATCH 3/3] vfio-user: Message-based DMA support

2023-08-23 Thread Mattias Nissler

On Thu, Jul 20, 2023 at 8:32 PM Stefan Hajnoczi  wrote:
>
> On Tue, Jul 04, 2023 at 01:06:27AM -0700, Mattias Nissler wrote:
> > Wire up support for DMA for the case where the vfio-user client does not
> > provide mmap()-able file descriptors, but DMA requests must be performed
> > via the VFIO-user protocol. This installs an indirect memory region,
> > which already works for pci_dma_{read,write}, and pci_dma_map works
> > thanks to the existing DMA bounce buffering support.
> >
> > Note that while simple scenarios work with this patch, there's a known
> > race condition in libvfio-user that will mess up the communication
> > channel: https://github.com/nutanix/libvfio-user/issues/279 I intend to
> > contribute a fix for this problem, see discussion on the github issue
> > for more details.
> >
> > Signed-off-by: Mattias Nissler 
> > ---
> >  hw/remote/vfio-user-obj.c | 62 ++-
> >  1 file changed, 55 insertions(+), 7 deletions(-)
> >
> > diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> > index 8b10c32a3c..9799580c77 100644
> > --- a/hw/remote/vfio-user-obj.c
> > +++ b/hw/remote/vfio-user-obj.c
> > @@ -300,6 +300,53 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t 
> > *vfu_ctx, char * const buf,
> >  return count;
> >  }
> >
> > +static MemTxResult vfu_dma_read(void *opaque, hwaddr addr, uint64_t *val,
> > +unsigned size, MemTxAttrs attrs)
> > +{
> > +MemoryRegion *region = opaque;
> > +VfuObject *o = VFU_OBJECT(region->owner);
> > +
> > +dma_sg_t *sg = alloca(dma_sg_size());
> > +vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
> > +if (vfu_addr_to_sgl(o->vfu_ctx, vfu_addr, size, sg, 1, PROT_READ) < 0 
> > ||
> > +vfu_sgl_read(o->vfu_ctx, sg, 1, val) != 0) {
>
> Does this work on big-endian host CPUs? It looks like reading 0x12345678
> into uint64_t val would result in *val = 0x12345678 instead of
> 0x12345678.

Ah, good catch, thanks! Confirmed as an issue using a cross-compiled
s390x qemu binary. I will fix this by using ld/st helpers.




>
> > +return MEMTX_ERROR;
> > +}
> > +
> > +return MEMTX_OK;
> > +}
> > +
> > +static MemTxResult vfu_dma_write(void *opaque, hwaddr addr, uint64_t val,
> > + unsigned size, MemTxAttrs attrs)
> > +{
> > +MemoryRegion *region = opaque;
> > +VfuObject *o = VFU_OBJECT(region->owner);
> > +
> > +dma_sg_t *sg = alloca(dma_sg_size());
> > +vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
> > +if (vfu_addr_to_sgl(o->vfu_ctx, vfu_addr, size, sg, 1, PROT_WRITE) < 0 
> > ||
> > +vfu_sgl_write(o->vfu_ctx, sg, 1, ) != 0)  {
>
> Same potential endianness issue here.
>
> Stefan

Re: [PATCH 1/3] softmmu: Support concurrent bounce buffers

2023-08-23 Thread Mattias Nissler

On Thu, Jul 20, 2023 at 8:10 PM Stefan Hajnoczi  wrote:
>
> On Tue, Jul 04, 2023 at 01:06:25AM -0700, Mattias Nissler wrote:
> > It is not uncommon for device models to request mapping of several DMA
> > regions at the same time. An example is igb (and probably other net
> > devices as well) when a packet is spread across multiple descriptors.
> >
> > In order to support this when indirect DMA is used, as is the case when
> > running the device model in a vfio-server process without mmap()-ed DMA,
> > this change allocates DMA bounce buffers dynamically instead of
> > supporting only a single buffer.
> >
> > Signed-off-by: Mattias Nissler 
> > ---
> >  softmmu/physmem.c | 74 ++-
> >  1 file changed, 35 insertions(+), 39 deletions(-)
>
> Is this a functional change or purely a performance optimization? If
> it's a performance optimization, please include benchmark results to
> justify this change.


It's a functional change in the sense that it fixes qemu to make some
hardware models actually work with bounce-buffered DMA. Right now, the
device models attempt to perform DMA accesses, receive an error due to
bounce buffer contention and then just bail, which the guest will
observe as a timeout and/or hardware error. I ran into this with igb
and xhci.

>
>
> QEMU memory allocations must be bounded so that an untrusted guest
> cannot cause QEMU to exhaust host memory. There must be a limit to the
> amount of bounce buffer memory.

Ah, makes sense. I will add code to track the total bounce buffer size
and enforce a limit. Since the amount of buffer space depends a lot on
the workload (I have observed xhci + usb-storage + Linux to use 1MB
buffer sizes by default), I'll make the limit configurable.





>
>
> > diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> > index bda475a719..56130b5a1d 100644
> > --- a/softmmu/physmem.c
> > +++ b/softmmu/physmem.c
> > @@ -2904,13 +2904,11 @@ void cpu_flush_icache_range(hwaddr start, hwaddr 
> > len)
> >
> >  typedef struct {
> >  MemoryRegion *mr;
> > -void *buffer;
> >  hwaddr addr;
> > -hwaddr len;
> > -bool in_use;
> > +uint8_t buffer[];
> >  } BounceBuffer;
> >
> > -static BounceBuffer bounce;
> > +static size_t bounce_buffers_in_use;
> >
> >  typedef struct MapClient {
> >  QEMUBH *bh;
> > @@ -2947,7 +2945,7 @@ void cpu_register_map_client(QEMUBH *bh)
> >  QLIST_INSERT_HEAD(_client_list, client, link);
> >  /* Write map_client_list before reading in_use.  */
> >  smp_mb();
> > -if (!qatomic_read(_use)) {
> > +if (qatomic_read(_buffers_in_use)) {
> >  cpu_notify_map_clients_locked();
> >  }
> >  qemu_mutex_unlock(_client_list_lock);
> > @@ -3076,31 +3074,24 @@ void *address_space_map(AddressSpace *as,
> >  RCU_READ_LOCK_GUARD();
> >  fv = address_space_to_flatview(as);
> >  mr = flatview_translate(fv, addr, , , is_write, attrs);
> > +memory_region_ref(mr);
> >
> >  if (!memory_access_is_direct(mr, is_write)) {
> > -if (qatomic_xchg(_use, true)) {
> > -*plen = 0;
> > -return NULL;
> > -}
> > -/* Avoid unbounded allocations */
> > -l = MIN(l, TARGET_PAGE_SIZE);
> > -bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, l);
> > -bounce.addr = addr;
> > -bounce.len = l;
> > -
> > -memory_region_ref(mr);
> > -bounce.mr = mr;
> > +qatomic_inc_fetch(_buffers_in_use);
> > +
> > +BounceBuffer *bounce = g_malloc(l + sizeof(BounceBuffer));
> > +bounce->addr = addr;
> > +bounce->mr = mr;
> > +
> >  if (!is_write) {
> >  flatview_read(fv, addr, MEMTXATTRS_UNSPECIFIED,
> > -   bounce.buffer, l);
> > +  bounce->buffer, l);
> >  }
> >
> >  *plen = l;
> > -return bounce.buffer;
> > +return bounce->buffer;
>
> Bounce buffer allocation always succeeds now. Can the
> cpu_notify_map_clients*() be removed now that no one is waiting for
> bounce buffers anymore?
>
> >  }
> >
> > -
> > -memory_region_ref(mr);
> >  *plen = flatview_extend_translation(fv, addr, len, mr, xlat,
> >  l, is_write, attrs);
> >  fuzz_dma_read_cb(addr, *plen, mr);
> > @@ -3114,31 +3105,36 @@ void *address_space_ma

Re: [PATCH 2/3] softmmu: Remove DMA unmap notification callback

2023-08-23 Thread Mattias Nissler

On Thu, Jul 20, 2023 at 8:14 PM Stefan Hajnoczi  wrote:
>
> On Tue, Jul 04, 2023 at 01:06:26AM -0700, Mattias Nissler wrote:
> > According to old commit messages, this was introduced to retry a DMA
> > operation at a later point in case the single bounce buffer is found to
> > be busy. This was never used widely - only the dma-helpers code made use
> > of it, but there are other device models that use multiple DMA mappings
> > (concurrently) and just failed.
> >
> > After the improvement to support multiple concurrent bounce buffers,
> > the condition the notification callback allowed to work around no
> > longer exists, so we can just remove the logic and simplify the code.
> >
> > Signed-off-by: Mattias Nissler 
> > ---
> >  softmmu/dma-helpers.c | 28 -
> >  softmmu/physmem.c | 71 ---
> >  2 files changed, 99 deletions(-)
>
> I'm not sure if it will be possible to remove this once a limit is
> placed bounce buffer space.

I investigated this in detail and concluded that you're right
unfortunately. In particular, after I found that Linux likes to use
megabyte-sided DMA buffers with xhci-attached USB storage, I don't
think it's realistic to set a reasonable fixed limit that accommodates
most workloads in practice.






>
> >
> > diff --git a/softmmu/dma-helpers.c b/softmmu/dma-helpers.c
> > index 2463964805..d05d226f11 100644
> > --- a/softmmu/dma-helpers.c
> > +++ b/softmmu/dma-helpers.c
> > @@ -68,23 +68,10 @@ typedef struct {
> >  int sg_cur_index;
> >  dma_addr_t sg_cur_byte;
> >  QEMUIOVector iov;
> > -QEMUBH *bh;
> >  DMAIOFunc *io_func;
> >  void *io_func_opaque;
> >  } DMAAIOCB;
> >
> > -static void dma_blk_cb(void *opaque, int ret);
> > -
> > -static void reschedule_dma(void *opaque)
> > -{
> > -DMAAIOCB *dbs = (DMAAIOCB *)opaque;
> > -
> > -assert(!dbs->acb && dbs->bh);
> > -qemu_bh_delete(dbs->bh);
> > -dbs->bh = NULL;
> > -dma_blk_cb(dbs, 0);
> > -}
> > -
> >  static void dma_blk_unmap(DMAAIOCB *dbs)
> >  {
> >  int i;
> > @@ -101,7 +88,6 @@ static void dma_complete(DMAAIOCB *dbs, int ret)
> >  {
> >  trace_dma_complete(dbs, ret, dbs->common.cb);
> >
> > -assert(!dbs->acb && !dbs->bh);
> >  dma_blk_unmap(dbs);
> >  if (dbs->common.cb) {
> >  dbs->common.cb(dbs->common.opaque, ret);
> > @@ -164,13 +150,6 @@ static void dma_blk_cb(void *opaque, int ret)
> >  }
> >  }
> >
> > -if (dbs->iov.size == 0) {
> > -trace_dma_map_wait(dbs);
> > -dbs->bh = aio_bh_new(ctx, reschedule_dma, dbs);
> > -cpu_register_map_client(dbs->bh);
> > -goto out;
> > -}
> > -
> >  if (!QEMU_IS_ALIGNED(dbs->iov.size, dbs->align)) {
> >  qemu_iovec_discard_back(>iov,
> >  QEMU_ALIGN_DOWN(dbs->iov.size, 
> > dbs->align));
> > @@ -189,18 +168,12 @@ static void dma_aio_cancel(BlockAIOCB *acb)
> >
> >  trace_dma_aio_cancel(dbs);
> >
> > -assert(!(dbs->acb && dbs->bh));
> >  if (dbs->acb) {
> >  /* This will invoke dma_blk_cb.  */
> >  blk_aio_cancel_async(dbs->acb);
> >  return;
> >  }
> >
> > -if (dbs->bh) {
> > -cpu_unregister_map_client(dbs->bh);
> > -qemu_bh_delete(dbs->bh);
> > -dbs->bh = NULL;
> > -}
> >  if (dbs->common.cb) {
> >  dbs->common.cb(dbs->common.opaque, -ECANCELED);
> >  }
> > @@ -239,7 +212,6 @@ BlockAIOCB *dma_blk_io(AioContext *ctx,
> >  dbs->dir = dir;
> >  dbs->io_func = io_func;
> >  dbs->io_func_opaque = io_func_opaque;
> > -dbs->bh = NULL;
> >  qemu_iovec_init(>iov, sg->nsg);
> >  dma_blk_cb(dbs, 0);
> >  return >common;
> > diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> > index 56130b5a1d..2b4123c127 100644
> > --- a/softmmu/physmem.c
> > +++ b/softmmu/physmem.c
> > @@ -2908,49 +2908,6 @@ typedef struct {
> >  uint8_t buffer[];
> >  } BounceBuffer;
> >
> > -static size_t bounce_buffers_in_use;
> > -
> > -typedef struct MapClient {
> > -QEMUBH *bh;
> > -QLIST_ENTRY(MapClient) link;
> > -} MapClient;
> > -
> > -QemuMutex map_client_list

Re: [PATCH 0/3] Support message-based DMA in vfio-user server

2023-07-20 Thread Mattias Nissler

Stefan,

I hope you had a great vacation!

Thanks for updating the mirror and your review. Your comments all make
sense, and I will address your input when I find time - just a quick
ack now since I'm travelling next week and will be on vacation the
first half of August, so it might be a while.

Thanks,
Mattias

On Thu, Jul 20, 2023 at 8:41 PM Stefan Hajnoczi  wrote:
>
> On Tue, Jul 04, 2023 at 01:06:24AM -0700, Mattias Nissler wrote:
> > This series adds basic support for message-based DMA in qemu's vfio-user
> > server. This is useful for cases where the client does not provide file
> > descriptors for accessing system memory via memory mappings. My motivating 
> > use
> > case is to hook up device models as PCIe endpoints to a hardware design. 
> > This
> > works by bridging the PCIe transaction layer to vfio-user, and the endpoint
> > does not access memory directly, but sends memory requests TLPs to the 
> > hardware
> > design in order to perform DMA.
> >
> > Note that in addition to the 3 commits included, we also need a
> > subprojects/libvfio-user roll to bring in this bugfix:
> > https://github.com/nutanix/libvfio-user/commit/bb308a2e8ee9486a4c8b53d8d773f7c8faaeba08
> > Stefan, can I ask you to kindly update the
> > https://gitlab.com/qemu-project/libvfio-user mirror? I'll be happy to 
> > include
> > an update to subprojects/libvfio-user.wrap in this series.
>
> Done:
> https://gitlab.com/qemu-project/libvfio-user/-/commits/master
>
> Repository mirroring is automated now, so new upstream commits will
> appear in the QEMU mirror repository from now on.
>
> >
> > Finally, there is some more work required on top of this series to get
> > message-based DMA to really work well:
> >
> > * libvfio-user has a long-standing issue where socket communication gets 
> > messed
> >   up when messages are sent from both ends at the same time. See
> >   https://github.com/nutanix/libvfio-user/issues/279 for more details. I've
> >   been engaging there and plan to contribute a fix.
> >
> > * qemu currently breaks down DMA accesses into chunks of size 8 bytes at
> >   maximum, each of which will be handled in a separate vfio-user DMA request
> >   message. This is quite terrible for large DMA acceses, such as when nvme
> >   reads and writes page-sized blocks for example. Thus, I would like to 
> > improve
> >   qemu to be able to perform larger accesses, at least for indirect memory
> >   regions. I have something working locally, but since this will likely 
> > result
> >   in more involved surgery and discussion, I am leaving this to be 
> > addressed in
> >   a separate patch.
> >
> > Mattias Nissler (3):
> >   softmmu: Support concurrent bounce buffers
> >   softmmu: Remove DMA unmap notification callback
> >   vfio-user: Message-based DMA support
> >
> >  hw/remote/vfio-user-obj.c |  62 --
> >  softmmu/dma-helpers.c |  28 
> >  softmmu/physmem.c | 131 --
> >  3 files changed, 83 insertions(+), 138 deletions(-)
>
> Sorry for the late review. I was on vacation and am catching up on
> emails.
>
> Paolo worked on the QEMU memory API and can give input on how to make
> this efficient for large DMA accesses. There is a chance that memory
> dispatch with larger sizes will be needed for ENQCMD CPU instruction
> emulation too.
>
> Stefan

[PATCH 2/3] softmmu: Remove DMA unmap notification callback

2023-07-04 Thread Mattias Nissler

According to old commit messages, this was introduced to retry a DMA
operation at a later point in case the single bounce buffer is found to
be busy. This was never used widely - only the dma-helpers code made use
of it, but there are other device models that use multiple DMA mappings
(concurrently) and just failed.

After the improvement to support multiple concurrent bounce buffers,
the condition the notification callback allowed to work around no
longer exists, so we can just remove the logic and simplify the code.

Signed-off-by: Mattias Nissler 
---
 softmmu/dma-helpers.c | 28 -
 softmmu/physmem.c | 71 ---
 2 files changed, 99 deletions(-)

diff --git a/softmmu/dma-helpers.c b/softmmu/dma-helpers.c
index 2463964805..d05d226f11 100644
--- a/softmmu/dma-helpers.c
+++ b/softmmu/dma-helpers.c
@@ -68,23 +68,10 @@ typedef struct {
 int sg_cur_index;
 dma_addr_t sg_cur_byte;
 QEMUIOVector iov;
-QEMUBH *bh;
 DMAIOFunc *io_func;
 void *io_func_opaque;
 } DMAAIOCB;
 
-static void dma_blk_cb(void *opaque, int ret);
-
-static void reschedule_dma(void *opaque)
-{
-DMAAIOCB *dbs = (DMAAIOCB *)opaque;
-
-assert(!dbs->acb && dbs->bh);
-qemu_bh_delete(dbs->bh);
-dbs->bh = NULL;
-dma_blk_cb(dbs, 0);
-}
-
 static void dma_blk_unmap(DMAAIOCB *dbs)
 {
 int i;
@@ -101,7 +88,6 @@ static void dma_complete(DMAAIOCB *dbs, int ret)
 {
 trace_dma_complete(dbs, ret, dbs->common.cb);
 
-assert(!dbs->acb && !dbs->bh);
 dma_blk_unmap(dbs);
 if (dbs->common.cb) {
 dbs->common.cb(dbs->common.opaque, ret);
@@ -164,13 +150,6 @@ static void dma_blk_cb(void *opaque, int ret)
 }
 }
 
-if (dbs->iov.size == 0) {
-trace_dma_map_wait(dbs);
-dbs->bh = aio_bh_new(ctx, reschedule_dma, dbs);
-cpu_register_map_client(dbs->bh);
-goto out;
-}
-
 if (!QEMU_IS_ALIGNED(dbs->iov.size, dbs->align)) {
 qemu_iovec_discard_back(>iov,
 QEMU_ALIGN_DOWN(dbs->iov.size, dbs->align));
@@ -189,18 +168,12 @@ static void dma_aio_cancel(BlockAIOCB *acb)
 
 trace_dma_aio_cancel(dbs);
 
-assert(!(dbs->acb && dbs->bh));
 if (dbs->acb) {
 /* This will invoke dma_blk_cb.  */
 blk_aio_cancel_async(dbs->acb);
 return;
 }
 
-if (dbs->bh) {
-cpu_unregister_map_client(dbs->bh);
-qemu_bh_delete(dbs->bh);
-dbs->bh = NULL;
-}
 if (dbs->common.cb) {
 dbs->common.cb(dbs->common.opaque, -ECANCELED);
 }
@@ -239,7 +212,6 @@ BlockAIOCB *dma_blk_io(AioContext *ctx,
 dbs->dir = dir;
 dbs->io_func = io_func;
 dbs->io_func_opaque = io_func_opaque;
-dbs->bh = NULL;
 qemu_iovec_init(>iov, sg->nsg);
 dma_blk_cb(dbs, 0);
 return >common;
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 56130b5a1d..2b4123c127 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -2908,49 +2908,6 @@ typedef struct {
 uint8_t buffer[];
 } BounceBuffer;
 
-static size_t bounce_buffers_in_use;
-
-typedef struct MapClient {
-QEMUBH *bh;
-QLIST_ENTRY(MapClient) link;
-} MapClient;
-
-QemuMutex map_client_list_lock;
-static QLIST_HEAD(, MapClient) map_client_list
-= QLIST_HEAD_INITIALIZER(map_client_list);
-
-static void cpu_unregister_map_client_do(MapClient *client)
-{
-QLIST_REMOVE(client, link);
-g_free(client);
-}
-
-static void cpu_notify_map_clients_locked(void)
-{
-MapClient *client;
-
-while (!QLIST_EMPTY(_client_list)) {
-client = QLIST_FIRST(_client_list);
-qemu_bh_schedule(client->bh);
-cpu_unregister_map_client_do(client);
-}
-}
-
-void cpu_register_map_client(QEMUBH *bh)
-{
-MapClient *client = g_malloc(sizeof(*client));
-
-qemu_mutex_lock(_client_list_lock);
-client->bh = bh;
-QLIST_INSERT_HEAD(_client_list, client, link);
-/* Write map_client_list before reading in_use.  */
-smp_mb();
-if (qatomic_read(_buffers_in_use)) {
-cpu_notify_map_clients_locked();
-}
-qemu_mutex_unlock(_client_list_lock);
-}
-
 void cpu_exec_init_all(void)
 {
 qemu_mutex_init(_list.mutex);
@@ -2964,28 +2921,6 @@ void cpu_exec_init_all(void)
 finalize_target_page_bits();
 io_mem_init();
 memory_map_init();
-qemu_mutex_init(_client_list_lock);
-}
-
-void cpu_unregister_map_client(QEMUBH *bh)
-{
-MapClient *client;
-
-qemu_mutex_lock(_client_list_lock);
-QLIST_FOREACH(client, _client_list, link) {
-if (client->bh == bh) {
-cpu_unregister_map_client_do(client);
-break;
-}
-}
-qemu_mutex_unlock(_client_list_lock);
-}
-
-static void cpu_notify_map_clients(void)
-{
-qemu_mutex_lock(_client_list_lock);
-cpu_notify_map_clients_locked();
-qe

[PATCH 0/3] Support message-based DMA in vfio-user server

2023-07-04 Thread Mattias Nissler

This series adds basic support for message-based DMA in qemu's vfio-user
server. This is useful for cases where the client does not provide file
descriptors for accessing system memory via memory mappings. My motivating use
case is to hook up device models as PCIe endpoints to a hardware design. This
works by bridging the PCIe transaction layer to vfio-user, and the endpoint
does not access memory directly, but sends memory requests TLPs to the hardware
design in order to perform DMA.

Note that in addition to the 3 commits included, we also need a
subprojects/libvfio-user roll to bring in this bugfix:
https://github.com/nutanix/libvfio-user/commit/bb308a2e8ee9486a4c8b53d8d773f7c8faaeba08
Stefan, can I ask you to kindly update the
https://gitlab.com/qemu-project/libvfio-user mirror? I'll be happy to include
an update to subprojects/libvfio-user.wrap in this series.

Finally, there is some more work required on top of this series to get
message-based DMA to really work well:

* libvfio-user has a long-standing issue where socket communication gets messed
  up when messages are sent from both ends at the same time. See
  https://github.com/nutanix/libvfio-user/issues/279 for more details. I've
  been engaging there and plan to contribute a fix.

* qemu currently breaks down DMA accesses into chunks of size 8 bytes at
  maximum, each of which will be handled in a separate vfio-user DMA request
  message. This is quite terrible for large DMA acceses, such as when nvme
  reads and writes page-sized blocks for example. Thus, I would like to improve
  qemu to be able to perform larger accesses, at least for indirect memory
  regions. I have something working locally, but since this will likely result
  in more involved surgery and discussion, I am leaving this to be addressed in
  a separate patch.

Mattias Nissler (3):
  softmmu: Support concurrent bounce buffers
  softmmu: Remove DMA unmap notification callback
  vfio-user: Message-based DMA support

 hw/remote/vfio-user-obj.c |  62 --
 softmmu/dma-helpers.c |  28 
 softmmu/physmem.c | 131 --
 3 files changed, 83 insertions(+), 138 deletions(-)

-- 
2.34.1

[PATCH 3/3] vfio-user: Message-based DMA support

2023-07-04 Thread Mattias Nissler

Wire up support for DMA for the case where the vfio-user client does not
provide mmap()-able file descriptors, but DMA requests must be performed
via the VFIO-user protocol. This installs an indirect memory region,
which already works for pci_dma_{read,write}, and pci_dma_map works
thanks to the existing DMA bounce buffering support.

Note that while simple scenarios work with this patch, there's a known
race condition in libvfio-user that will mess up the communication
channel: https://github.com/nutanix/libvfio-user/issues/279 I intend to
contribute a fix for this problem, see discussion on the github issue
for more details.

Signed-off-by: Mattias Nissler 
---
 hw/remote/vfio-user-obj.c | 62 ++-
 1 file changed, 55 insertions(+), 7 deletions(-)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 8b10c32a3c..9799580c77 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -300,6 +300,53 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, 
char * const buf,
 return count;
 }
 
+static MemTxResult vfu_dma_read(void *opaque, hwaddr addr, uint64_t *val,
+unsigned size, MemTxAttrs attrs)
+{
+MemoryRegion *region = opaque;
+VfuObject *o = VFU_OBJECT(region->owner);
+
+dma_sg_t *sg = alloca(dma_sg_size());
+vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
+if (vfu_addr_to_sgl(o->vfu_ctx, vfu_addr, size, sg, 1, PROT_READ) < 0 ||
+vfu_sgl_read(o->vfu_ctx, sg, 1, val) != 0) {
+return MEMTX_ERROR;
+}
+
+return MEMTX_OK;
+}
+
+static MemTxResult vfu_dma_write(void *opaque, hwaddr addr, uint64_t val,
+ unsigned size, MemTxAttrs attrs)
+{
+MemoryRegion *region = opaque;
+VfuObject *o = VFU_OBJECT(region->owner);
+
+dma_sg_t *sg = alloca(dma_sg_size());
+vfu_dma_addr_t vfu_addr = (vfu_dma_addr_t)(region->addr + addr);
+if (vfu_addr_to_sgl(o->vfu_ctx, vfu_addr, size, sg, 1, PROT_WRITE) < 0 ||
+vfu_sgl_write(o->vfu_ctx, sg, 1, ) != 0)  {
+return MEMTX_ERROR;
+}
+
+return MEMTX_OK;
+}
+
+static const MemoryRegionOps vfu_dma_ops = {
+.read_with_attrs = vfu_dma_read,
+.write_with_attrs = vfu_dma_write,
+.endianness = DEVICE_NATIVE_ENDIAN,
+.valid = {
+.min_access_size = 1,
+.max_access_size = 8,
+.unaligned = true,
+},
+.impl = {
+.min_access_size = 1,
+.max_access_size = 8,
+},
+};
+
 static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
 {
 VfuObject *o = vfu_get_private(vfu_ctx);
@@ -308,17 +355,18 @@ static void dma_register(vfu_ctx_t *vfu_ctx, 
vfu_dma_info_t *info)
 g_autofree char *name = NULL;
 struct iovec *iov = >iova;
 
-if (!info->vaddr) {
-return;
-}
-
 name = g_strdup_printf("mem-%s-%"PRIx64"", o->device,
-   (uint64_t)info->vaddr);
+   (uint64_t)iov->iov_base);
 
 subregion = g_new0(MemoryRegion, 1);
 
-memory_region_init_ram_ptr(subregion, NULL, name,
-   iov->iov_len, info->vaddr);
+if (info->vaddr) {
+memory_region_init_ram_ptr(subregion, OBJECT(o), name,
+   iov->iov_len, info->vaddr);
+} else {
+memory_region_init_io(subregion, OBJECT(o), _dma_ops, subregion,
+  name, iov->iov_len);
+}
 
 dma_as = pci_device_iommu_address_space(o->pci_dev);
 
-- 
2.34.1

[PATCH 1/3] softmmu: Support concurrent bounce buffers

2023-07-04 Thread Mattias Nissler

It is not uncommon for device models to request mapping of several DMA
regions at the same time. An example is igb (and probably other net
devices as well) when a packet is spread across multiple descriptors.

In order to support this when indirect DMA is used, as is the case when
running the device model in a vfio-server process without mmap()-ed DMA,
this change allocates DMA bounce buffers dynamically instead of
supporting only a single buffer.

Signed-off-by: Mattias Nissler 
---
 softmmu/physmem.c | 74 ++-
 1 file changed, 35 insertions(+), 39 deletions(-)

diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index bda475a719..56130b5a1d 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -2904,13 +2904,11 @@ void cpu_flush_icache_range(hwaddr start, hwaddr len)
 
 typedef struct {
 MemoryRegion *mr;
-void *buffer;
 hwaddr addr;
-hwaddr len;
-bool in_use;
+uint8_t buffer[];
 } BounceBuffer;
 
-static BounceBuffer bounce;
+static size_t bounce_buffers_in_use;
 
 typedef struct MapClient {
 QEMUBH *bh;
@@ -2947,7 +2945,7 @@ void cpu_register_map_client(QEMUBH *bh)
 QLIST_INSERT_HEAD(_client_list, client, link);
 /* Write map_client_list before reading in_use.  */
 smp_mb();
-if (!qatomic_read(_use)) {
+if (qatomic_read(_buffers_in_use)) {
 cpu_notify_map_clients_locked();
 }
 qemu_mutex_unlock(_client_list_lock);
@@ -3076,31 +3074,24 @@ void *address_space_map(AddressSpace *as,
 RCU_READ_LOCK_GUARD();
 fv = address_space_to_flatview(as);
 mr = flatview_translate(fv, addr, , , is_write, attrs);
+memory_region_ref(mr);
 
 if (!memory_access_is_direct(mr, is_write)) {
-if (qatomic_xchg(_use, true)) {
-*plen = 0;
-return NULL;
-}
-/* Avoid unbounded allocations */
-l = MIN(l, TARGET_PAGE_SIZE);
-bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, l);
-bounce.addr = addr;
-bounce.len = l;
-
-memory_region_ref(mr);
-bounce.mr = mr;
+qatomic_inc_fetch(_buffers_in_use);
+
+BounceBuffer *bounce = g_malloc(l + sizeof(BounceBuffer));
+bounce->addr = addr;
+bounce->mr = mr;
+
 if (!is_write) {
 flatview_read(fv, addr, MEMTXATTRS_UNSPECIFIED,
-   bounce.buffer, l);
+  bounce->buffer, l);
 }
 
 *plen = l;
-return bounce.buffer;
+return bounce->buffer;
 }
 
-
-memory_region_ref(mr);
 *plen = flatview_extend_translation(fv, addr, len, mr, xlat,
 l, is_write, attrs);
 fuzz_dma_read_cb(addr, *plen, mr);
@@ -3114,31 +3105,36 @@ void *address_space_map(AddressSpace *as,
 void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len,
  bool is_write, hwaddr access_len)
 {
-if (buffer != bounce.buffer) {
-MemoryRegion *mr;
-ram_addr_t addr1;
+MemoryRegion *mr;
+ram_addr_t addr1;
+
+mr = memory_region_from_host(buffer, );
+if (mr == NULL) {
+/*
+ * Must be a bounce buffer (unless the caller passed a pointer which
+ * wasn't returned by address_space_map, which is illegal).
+ */
+BounceBuffer *bounce = container_of(buffer, BounceBuffer, buffer);
 
-mr = memory_region_from_host(buffer, );
-assert(mr != NULL);
 if (is_write) {
-invalidate_and_set_dirty(mr, addr1, access_len);
+address_space_write(as, bounce->addr, MEMTXATTRS_UNSPECIFIED,
+bounce->buffer, access_len);
 }
-if (xen_enabled()) {
-xen_invalidate_map_cache_entry(buffer);
+memory_region_unref(bounce->mr);
+g_free(bounce);
+
+if (qatomic_dec_fetch(_buffers_in_use) == 1) {
+cpu_notify_map_clients();
 }
-memory_region_unref(mr);
 return;
 }
+
+if (xen_enabled()) {
+xen_invalidate_map_cache_entry(buffer);
+}
 if (is_write) {
-address_space_write(as, bounce.addr, MEMTXATTRS_UNSPECIFIED,
-bounce.buffer, access_len);
-}
-qemu_vfree(bounce.buffer);
-bounce.buffer = NULL;
-memory_region_unref(bounce.mr);
-/* Clear in_use before reading map_client_list.  */
-qatomic_set_mb(_use, false);
-cpu_notify_map_clients();
+invalidate_and_set_dirty(mr, addr1, access_len);
+}
 }
 
 void *cpu_physical_memory_map(hwaddr addr,
-- 
2.34.1

[PATCH] hw/remote: Fix vfu_cfg trace offset format

2023-04-26 Thread Mattias Nissler

The printed offset value is prefixed with 0x, but was actually printed
in decimal. To spare others the confusion, adjust the format specifier
to hexadecimal.

Signed-off-by: Mattias Nissler 
---
 hw/remote/trace-events | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/remote/trace-events b/hw/remote/trace-events
index c167b3c7a5..0d1b7d56a5 100644
--- a/hw/remote/trace-events
+++ b/hw/remote/trace-events
@@ -5,8 +5,8 @@ mpqemu_recv_io_error(int cmd, int size, int nfds) "failed to 
receive %d size %d,
 
 # vfio-user-obj.c
 vfu_prop(const char *prop, const char *val) "vfu: setting %s as %s"
-vfu_cfg_read(uint32_t offset, uint32_t val) "vfu: cfg: 0x%u -> 0x%x"
-vfu_cfg_write(uint32_t offset, uint32_t val) "vfu: cfg: 0x%u <- 0x%x"
+vfu_cfg_read(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x -> 0x%x"
+vfu_cfg_write(uint32_t offset, uint32_t val) "vfu: cfg: 0x%x <- 0x%x"
 vfu_dma_register(uint64_t gpa, size_t len) "vfu: registering GPA 0x%"PRIx64", 
%zu bytes"
 vfu_dma_unregister(uint64_t gpa) "vfu: unregistering GPA 0x%"PRIx64""
 vfu_bar_register(int i, uint64_t addr, uint64_t size) "vfu: BAR %d: addr 
0x%"PRIx64" size 0x%"PRIx64""
-- 
2.25.1

Re: [PATCH] coresight: etm4x: Add config to exclude kernel mode tracing

2021-01-18 Thread Mattias Nissler

On Fri, Jan 15, 2021 at 6:46 AM Sai Prakash Ranjan
 wrote:
>
> Hello Mathieu, Suzuki
>
> On 2020-10-15 21:32, Mathieu Poirier wrote:
> > On Thu, Oct 15, 2020 at 06:15:22PM +0530, Sai Prakash Ranjan wrote:
> >> On production systems with ETMs enabled, it is preferred to
> >> exclude kernel mode(NS EL1) tracing for security concerns and
> >> support only userspace(NS EL0) tracing. So provide an option
> >> via kconfig to exclude kernel mode tracing if it is required.
> >> This config is disabled by default and would not affect the
> >> current configuration which has both kernel and userspace
> >> tracing enabled by default.
> >>
> >
> > One requires root access (or be part of a special trace group) to be
> > able to use
> > the cs_etm PMU.  With this kind of elevated access restricting tracing
> > at EL1
> > provides little in terms of security.
> >
>
> Apart from the VM usecase discussed, I am told there are other
> security concerns here regarding need to exclude kernel mode tracing
> even for the privileged users/root. One such case being the ability
> to analyze cryptographic code execution since ETMs can record all
> branch instructions including timestamps in the kernel and there may
> be other cases as well which I may not be aware of and hence have
> added Denis and Mattias. Please let us know if you have any questions
> further regarding this not being a security concern.

Well, the idea that root privileges != full control over the kernel
isn't new and at the very least since lockdown became part of mainline
[1] no longer an esoteric edge case. Regarding the use case Sai hints
at (namely protection of secrets in the kernel), Matthew Garret
actually has some more thoughts about confidentiality mode for
lockdown for secret protection [2]. And thus, unless someone can make
a compelling case that instruction-level tracing will not leak secrets
held by the kernel, I think an option for the kernel to prevent itself
from being traced (even by root) is valuable.

Finally, to sketch a practical use case scenario: Consider a system
where disk contents are encrypted and the encryption key is set up by
the user when mounting the file system. From that point on the
encryption key resides in the kernel. It seems reasonable to expect
that the disk encryption key be protected from exfiltration even if
the system later suffers a root compromise (or even against insiders
that have root access), at least as long as the attacker doesn't
manage to compromise the kernel.

[1] https://lwn.net/Articles/796866/
[2] https://mjg59.dreamwidth.org/55105.html

>
> After this discussion, I would like to post a v2 based on Suzuki's
> feedback earlier. @Suzuki, I have a common config for ETM3 and ETM4
> but couldn't get much idea on how to implement it for Intel PTs, if
> you have any suggestions there, please do share or we can have this
> only for Coresight ETMs.
>
> Thanks,
> Sai
>
> --
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
> member
> of Code Aurora Forum, hosted by The Linux Foundation

Re: [PATCH 4.9 30/47] ANDROID: binder: remove waitqueue when thread exits.

2019-10-07 Thread Mattias Nissler

(resend, apologies for accidental HTML reply)

On Sun, Oct 6, 2019 at 11:24 AM Greg Kroah-Hartman
 wrote:
>
> On Sun, Oct 06, 2019 at 10:32:02AM -0700, Eric Biggers wrote:
> > On Sun, Oct 06, 2019 at 07:21:17PM +0200, Greg Kroah-Hartman wrote:
> > > From: Martijn Coenen 
> > >
> > > commit f5cb779ba16334b45ba8946d6bfa6d9834d1527f upstream.
> > >
> > > binder_poll() passes the thread->wait waitqueue that
> > > can be slept on for work. When a thread that uses
> > > epoll explicitly exits using BINDER_THREAD_EXIT,
> > > the waitqueue is freed, but it is never removed
> > > from the corresponding epoll data structure. When
> > > the process subsequently exits, the epoll cleanup
> > > code tries to access the waitlist, which results in
> > > a use-after-free.
> > >
> > > Prevent this by using POLLFREE when the thread exits.
> > >
> > > Signed-off-by: Martijn Coenen 
> > > Reported-by: syzbot 
> > > Cc: stable  # 4.14
> > > [backport BINDER_LOOPER_STATE_POLL logic as well]
> > > Signed-off-by: Mattias Nissler 
> > > Signed-off-by: Greg Kroah-Hartman 
> > > ---
> > >  drivers/android/binder.c |   17 -
> > >  1 file changed, 16 insertions(+), 1 deletion(-)
> > >
> > > --- a/drivers/android/binder.c
> > > +++ b/drivers/android/binder.c
> > > @@ -334,7 +334,8 @@ enum {
> > > BINDER_LOOPER_STATE_EXITED  = 0x04,
> > > BINDER_LOOPER_STATE_INVALID = 0x08,
> > > BINDER_LOOPER_STATE_WAITING = 0x10,
> > > -   BINDER_LOOPER_STATE_NEED_RETURN = 0x20
> > > +   BINDER_LOOPER_STATE_NEED_RETURN = 0x20,
> > > +   BINDER_LOOPER_STATE_POLL= 0x40,
> > >  };
> > >
> > >  struct binder_thread {
> > > @@ -2628,6 +2629,18 @@ static int binder_free_thread(struct bin
> > > } else
> > > BUG();
> > > }
> > > +
> > > +   /*
> > > +* If this thread used poll, make sure we remove the waitqueue
> > > +* from any epoll data structures holding it with POLLFREE.
> > > +* waitqueue_active() is safe to use here because we're holding
> > > +* the inner lock.
> > > +*/
> > > +   if ((thread->looper & BINDER_LOOPER_STATE_POLL) &&
> > > +   waitqueue_active(>wait)) {
> > > +   wake_up_poll(>wait, POLLHUP | POLLFREE);
> > > +   }
> > > +
> > > if (send_reply)
> > > binder_send_failed_reply(send_reply, BR_DEAD_REPLY);
> > > binder_release_work(>todo);
> > > @@ -2651,6 +2664,8 @@ static unsigned int binder_poll(struct f
> > > return POLLERR;
> > > }
> > >
> > > +   thread->looper |= BINDER_LOOPER_STATE_POLL;
> > > +
> > > wait_for_proc_work = thread->transaction_stack == NULL &&
> > > list_empty(>todo) && thread->return_error == BR_OK;
> > >
> >
> > Are you sure this backport is correct, given that in 4.9, binder_poll()
> > sometimes uses proc->wait instead of thread->wait?:

Jann's PoC calls the BINDER_THREAD_EXIT ioctl to free the
binder_thread which will then cause the UAF, and this is cut off by
the patch. IIUC, you are worried about a similar AUF on the proc->wait
access. I am not 100% sure, but I think the binder_proc lifetime
matches the corresponding struct file instance, so it shouldn't be
possible to get the binder_proc deallocated while still being able to
access it via filp->private_data.

> >
> > wait_for_proc_work = thread->transaction_stack == NULL &&
> > list_empty(>todo) && thread->return_error == BR_OK;
> >
> > binder_unlock(__func__);
> >
> > if (wait_for_proc_work) {
> > if (binder_has_proc_work(proc, thread))
> > return POLLIN;
> > poll_wait(filp, >wait, wait);
> > if (binder_has_proc_work(proc, thread))
> > return POLLIN;
> > } else {
> > if (binder_has_thread_work(thread))
> > return POLLIN;
> > poll_wait(filp, >wait, wait);
> > if (binder_has_thread_work(thread))
> > return POLLIN;
> > }
> > return 0;
>
> I _think_ the backport is correct, and I know someone has verified that
> the 4.4.y backport works properly and I don't see much difference here
> from that version.
>
> But I will defer to Todd and Martijn here, as they know this code _WAY_
> better than I do.  The codebase has changed a lot from 4.9.y to 4.14.y
> so it makes it hard to do equal comparisons simply.
>
> Todd and Martijn, thoughts?
>
> thanks,
>
> greg k-h

Re: [PATCH 3/3] brcmfmac: Add check for short event packets

2017-09-11 Thread Mattias Nissler

On Fri, Sep 8, 2017 at 9:13 PM, Kevin Cernekee <cerne...@chromium.org> wrote:
>
> The length of the data in the received skb is currently passed into
> brcmf_fweh_process_event() as packet_len, but this value is not checked.
> event_packet should be followed by DATALEN bytes of additional event
> data.  Ensure that the received packet actually contains at least
> DATALEN bytes of additional data, to avoid copying uninitialized memory
> into event->data.
>
> Suggested-by: Mattias Nissler <mniss...@chromium.org>
> Signed-off-by: Kevin Cernekee <cerne...@chromium.org>
> ---
>  drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c 
> b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c
> index 5aabdc9ed7e0..4cad1f0d2a82 100644
> --- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c
> +++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c
> @@ -429,7 +429,8 @@ void brcmf_fweh_process_event(struct brcmf_pub *drvr,
> if (code != BRCMF_E_IF && !fweh->evt_handler[code])
> return;
>
> -   if (datalen > BRCMF_DCMD_MAXLEN)
> +   if (datalen > BRCMF_DCMD_MAXLEN ||
> +   datalen + sizeof(*event_packet) < packet_len)

Shouldn't this check be larger-than, i.e. we need the packet to be at
least sizeof(*event_packet) + its payload size?

> return;
>
> if (in_interrupt())
> --
> 2.14.1.581.gf28d330327-goog
>

1 2 >

1 - 100 of 157 matches

Mail list logo