Re: RFC: use VFIO over a UNIX domain socket to implement device offloading

John G Johnson Wed, 20 May 2020 17:48:09 -0700

> I'm confused by VFIO_USER_ADD_MEMORY_REGION vs VFIO_USER_IOMMU_MAP_DMA.
> The former seems intended to provide the server with access to the
> entire GPA space, while the latter indicates an IOVA to GPA mapping of
> those regions.  Doesn't this break the basic isolation of a vIOMMU?
> This essentially says to me "here's all the guest memory, but please
> only access these regions for which we're providing DMA mappings".
> That invites abuse.
> 

        The purpose behind separating QEMU into multiple processes is
to provide an additional layer protection for the infrastructure against
a malign guest, not for the guest against itself, so preventing a server
that has been compromised by a guest from accessing all of guest memory
adds no additional benefit.  We don’t even have an IOMMU in our current
guest model for this reason.

        The implementation was stolen from vhost-user, with the exception
that we push IOTLB translations from client to server like VFIO does, as
opposed to pulling them from server to client like vhost-user does.

        That said, neither the qemu-mp nor MUSER implementation uses an
IOMMU, so if you prefer another IOMMU model, we can consider it.  We
could only send the guest memory file descriptors with IOMMU_MAP_DMA
requests, although that would cost performance since each request would
require the server to execute an mmap() system call.


> Also regarding VFIO_USER_ADD_MEMORY_REGION, it's not clear to me how
> "an array of file descriptors will be sent as part of the message
> meta-data" works.  Also consider s/SUB/DEL/.  Why is the Device ID in
> the table specified as 0?  How does a client learn their Device ID?
> 

        SCM_RIGHTS message controls allow sendmsg() to send an array of
file descriptors over a UNIX domain socket.

        We’re only supporting one device per socket in this protocol
version, so the device ID will always be 0.  This may change in a future
revision, so we included the field in the header to avoid a major version
change if device multiplexing is added later.


> VFIO_USER_DEVICE_GET_REGION_INFO (or anything else making use of a
> capability chain), the cap_offset and next pointers within the chain
> need to specify what their offset is relative to (ie. the start of the
> packet, the start of the vfio compatible data structure, etc).  I
> assume the latter for client compatibility.
> 

        Yes.  We will attempt to make the language clearer.


> Also on REGION_INFO, offset is specified as "the base offset to be
> given to the mmap() call for regions with the MMAP attribute".  Base
> offset from what?  Is the mmap performed on the socket fd?  Do we not
> allow read/write, we need to use VFIO_USER_MMIO_READ/WRITE instead?
> Why do we specify "MMIO" in those operations versus simply "REGION"?
> Are we arbitrarily excluding support for I/O port regions or device
> specific regions?  If these commands replace direct read and write to
> an fd offset, how is PCI config space handled?
> 

        The base offset refers to the sparse areas, where the sparse area
offset is added to the base region offset.  We will try to make the text
clearer here as well.

        MMIO was added to distinguish these operations from DMA operations.
I can see how this can cause confusion when the region refers to a port range,
so we can change the name to REGION_READ/WRITE. 


> VFIO_USER_MMIO_READ specifies the count field is zero and the reply
> will include the count specifying the amount of data read.  How does
> the client specify how much data to read?  Via message size?
> 

        This is a bug in the doc.  As you said, the read field should
be the amount of data to be read.
        

> VFIO_USER_DMA_READ/WRITE, is the address a GPA or IOVA?  IMO the device
> should only ever have access via IOVA, which implies a DMA mapping
> exists for the device.  Can you provide an example of why we need these
> commands since there seems little point to this interface if a device
> cannot directly interact with VM memory.
> 

        It is a GPA.  The device emulation code would only handle the DMA
addresses the guest programmed it with; the server infrastructure knows
whether an IOMMU exists, and whether the DMA address needs translation to
GPA or not.


> The IOMMU commands should be unnecessary, a vIOMMU should be
> transparent to the server by virtue that the device only knows about
> IOVA mappings accessible to the device.  Requiring the client to expose
> all memory to the server implies that the server must always be trusted.
> 

        The client and server are equally trusted; the guest is the untrusted
entity.


> Interrupt info format, s/type/index/, s/vector/subindex/
> 

        ok


> In addition to the unused ioctls, the entire concept of groups and
> containers are not found in this specification.  To some degree that
> makes sense and even mdevs and typically SR-IOV VFs have a 1:1 device
> to group relationship.  However, the container is very much involved in
> the development of migration support, where it's the container that
> provides dirty bitmaps.  Since we're doing map and unmap without that
> container concept here, perhaps we'd equally apply those APIs to this
> same socket.  Thanks,

        Groups and containers are host IOMMU concepts, and we don’t
interact with the host here.  The kernel VFIO driver doesn’t even need
to exist for VFIO over socket.  I think it’s fine to assume a 1-1
correspondence between containers, groups, and a VFIO over socket device.

        Thanks for looking this over.

                                                        Thanos & JJ
Re: RFC: use VFIO over a UNIX domain socket to implement device offloading

Reply via email to