On Mon, May 10, 2021 at 05:57:37PM +0100, Stefan Hajnoczi wrote: > On Wed, Apr 14, 2021 at 04:41:22AM -0700, Thanos Makatos wrote: > > Elena A: I CCed you in case you want to review the
Sorry, we should have included Elena already. > > +VFIO sparse mmap > > +^^^^^^^^^^^^^^^^ > > + > > ++------------------+----------------------------------+ > > +| Name | Value | > > ++==================+==================================+ > > +| id | VFIO_REGION_INFO_CAP_SPARSE_MMAP | > > ++------------------+----------------------------------+ > > +| version | 0x1 | > > ++------------------+----------------------------------+ > > +| next | <next> | > > ++------------------+----------------------------------+ > > +| sparse mmap info | VFIO region info sparse mmap | > > ++------------------+----------------------------------+ > > + > > +This capability is defined when only a subrange of the region supports > > +direct access by the client via mmap(). The VFIO sparse mmap area is > > defined in > > +``<linux/vfio.h>`` (``struct vfio_region_sparse_mmap_area``). > > It's a little early to reference struct vfio_region_sparse_mmap_area > here because the parent struct vfio_region_info_cap_sparse_mmap is only > referenced below. I suggest combining the two: > > The VFIO sparse mmap area is defined in ``<linux/vfio.h>`` (``struct > vfio_region_info_cap_sparse_mmap`` and ``struct > vfio_region_sparse_mmap_area``). Good idea. > > +Region IO FD info format > > +^^^^^^^^^^^^^^^^^^^^^^^^ > > + > > ++-------------+--------+------+ > > +| Name | Offset | Size | > > ++=============+========+======+ > > +| argsz | 16 | 4 | > > ++-------------+--------+------+ > > +| flags | 20 | 4 | > > ++-------------+--------+------+ > > +| index | 24 | 4 | > > ++-------------+--------+------+ > > +| count | 28 | 4 | > > ++-------------+--------+------+ > > +| sub-regions | 32 | ... | > > ++-------------+--------+------+ > > + > > +* *argsz* is the size of the region IO FD info structure plus the > > + total size of the sub-region array. Thus, each array entry "i" is at > > offset > > + i * ((argsz - 32) / count). Note that currently this is 40 bytes for > > both IO > > + FD types, but this is not to be relied on. > > +* *flags* must be zero > > +* *index* is the index of memory region being queried > > +* *count* is the number of sub-regions in the array > > +* *sub-regions* is the array of Sub-Region IO FD info structures > > + > > +The client must set ``flags`` to zero and specify the region being queried > > in > > +the ``index``. > > + > > +The client sets the ``argsz`` field to indicate the maximum size of the > > response > > +that the server can send, which must be at least the size of the response > > header > > +plus space for the sub-region array. If the full response size exceeds > > ``argsz``, > > +then the server must respond only with the response header and the Region > > IO FD > > +info structure, setting in ``argsz`` the buffer size required to store the > > full > > +response. In this case, no file descriptors are passed back. The client > > then > > +retries the operation with a larger receive buffer. > > + > > +The reply message will additionally include at least one file descriptor > > in the > > +ancillary data. Note that more than one sub-region may share the same file > > +descriptor. > > How does this interact with the maximum number of file descriptors, > max_fds? It is possible that there are more sub-regions than max_fds > allows... I think this would just be a matter of the client advertising a reasonably large enough size for max_msg_fds. Do we need to worry about this? > > +Each sub-region given in the response has one of two possible structures, > > +depending whether *type* is `VFIO_USER_IO_FD_TYPE_IOEVENTFD` or > > +`VFIO_USER_IO_FD_TYPE_IOREGIONFD`: > > + > > +Sub-Region IO FD info format (ioeventfd) > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > + > > ++-----------+--------+------+ > > +| Name | Offset | Size | > > ++===========+========+======+ > > +| offset | 0 | 8 | > > ++-----------+--------+------+ > > +| size | 8 | 8 | > > ++-----------+--------+------+ > > +| fd_index | 16 | 4 | > > ++-----------+--------+------+ > > +| type | 20 | 4 | > > ++-----------+--------+------+ > > +| flags | 24 | 4 | > > ++-----------+--------+------+ > > +| padding | 28 | 4 | > > ++-----------+--------+------+ > > +| datamatch | 32 | 8 | > > ++-----------+--------+------+ > > + > > +* *offset* is the offset of the start of the sub-region within the region > > + requested ("physical address offset" for the region) > > +* *size* is the length of the sub-region. This may be zero if the access > > size is > > + not relevant, which may allow for optimizations > > +* *fd_index* is the index in the ancillary data of the FD to use for > > ioeventfd > > + notification; it may be shared. > > +* *type* is `VFIO_USER_IO_FD_TYPE_IOEVENTFD` > > +* *flags* is any of: > > + * `KVM_IOEVENTFD_FLAG_DATAMATCH` > > + * `KVM_IOEVENTFD_FLAG_PIO` > > The client must not trust the server, so care must be taken to validate > flags and offsets. Failure to do so would allow the server to hijack I/O > dispatch to addresses outside its regions (e.g. MMIO vs PIO or an offset > beyond the end of the region). > > It would help to mention this explicitly in the spec so that client > implementors take care. I'll add a note. > > + * `KVM_IOREGION_PIO` > > + * `KVM_IOREGION_POSTED_WRITES` > > +* *user_data* is an opaque value passed back to the server via a message > > on the > > + file descriptor > > + > > +For further information on the ioregionfd-specific fields, see: > > +https://lore.kernel.org/kvm/cover.1613828726.git.eafanas...@gmail.com/ > > + > > +(FIXME: update with final API docs.) > > I suggest postponing the ioregionfd part of the spec until the KVM code > lands. In general the approach makes sense to me though, so I think it > will be possible to merge it later with minimal changes. I think it's useful to have it now, still, at least until we hit 1.0 and can't make incompatible changes. We've already had several useful review comments from yourself and others. I agree it shouldn't be in place in any final version, though. (Ideally, we'd have an implementation as well.) > > +VFIO IRQ info format > > +^^^^^^^^^^^^^^^^^^^^ > > + > > ++-------+--------+---------------------------+ > > +| Name | Offset | Size | > > ++=======+========+===========================+ > > +| argsz | 16 | 4 | > > ++-------+--------+---------------------------+ > > +| flags | 20 | 4 | > > ++-------+--------+---------------------------+ > > +| | +-----+--------------------------+ | > > +| | | Bit | Definition | | > > +| | +=====+==========================+ | > > +| | | 0 | VFIO_IRQ_INFO_EVENTFD | | > > +| | +-----+--------------------------+ | > > +| | | 1 | VFIO_IRQ_INFO_MASKABLE | | > > +| | +-----+--------------------------+ | > > +| | | 2 | VFIO_IRQ_INFO_AUTOMASKED | | > > +| | +-----+--------------------------+ | > > +| | | 3 | VFIO_IRQ_INFO_NORESIZE | | > > +| | +-----+--------------------------+ | > > ++-------+--------+---------------------------+ > > +| index | 24 | 4 | > > ++-------+--------+---------------------------+ > > +| count | 28 | 4 | > > ++-------+--------+---------------------------+ > > + > > +* *argsz* is the size of the VFIO IRQ info structure. > > +* *flags* defines IRQ attributes: > > + > > + * *VFIO_IRQ_INFO_EVENTFD* indicates the IRQ type can support server > > eventfd > > + signalling. > > + * *VFIO_IRQ_INFO_MASKABLE* indicates that the IRQ type supports the MASK > > and > > + UNMASK actions in a VFIO_USER_DEVICE_SET_IRQS message. > > + * *VFIO_IRQ_INFO_AUTOMASKED* indicates the IRQ type masks itself after > > being > > + triggered, and the client must send an UNMASK action to receive new > > + interrupts. > > + * *VFIO_IRQ_INFO_NORESIZE* indicates VFIO_USER_SET_IRQS operations setup > > + interrupts as a set, and new sub-indexes cannot be enabled without > > disabling > > + the entire type. > > + > > +* index is the index of IRQ type being queried, it is the only field that > > is > > + required to be set in the command message. > > +* count describes the number of interrupts of the queried type. > > Is count an output-only field since the previous sentence says index is > the only field required in the command message? > > I find it confusing that the spec shows the input/output structs without > explicitly documenting that fields are input, output, or input & output. I agree. I'll take care of this. https://github.com/nutanix/libvfio-user/issues/486 > > +VFIO IRQ set format > > +^^^^^^^^^^^^^^^^^^^ > > + > > ++-------+--------+------------------------------+ > > +| Name | Offset | Size | > > ++=======+========+==============================+ > > +| argsz | 16 | 4 | > > ++-------+--------+------------------------------+ > > +| flags | 20 | 4 | > > ++-------+--------+------------------------------+ > > +| | +-----+-----------------------------+ | > > +| | | Bit | Definition | | > > +| | +=====+=============================+ | > > +| | | 0 | VFIO_IRQ_SET_DATA_NONE | | > > +| | +-----+-----------------------------+ | > > +| | | 1 | VFIO_IRQ_SET_DATA_BOOL | | > > +| | +-----+-----------------------------+ | > > +| | | 2 | VFIO_IRQ_SET_DATA_EVENTFD | | > > +| | +-----+-----------------------------+ | > > +| | | 3 | VFIO_IRQ_SET_ACTION_MASK | | > > +| | +-----+-----------------------------+ | > > +| | | 4 | VFIO_IRQ_SET_ACTION_UNMASK | | > > +| | +-----+-----------------------------+ | > > +| | | 5 | VFIO_IRQ_SET_ACTION_TRIGGER | | > > +| | +-----+-----------------------------+ | > > ++-------+--------+------------------------------+ > > +| index | 24 | 4 | > > ++-------+--------+------------------------------+ > > +| start | 28 | 4 | > > ++-------+--------+------------------------------+ > > +| count | 32 | 4 | > > ++-------+--------+------------------------------+ > > +| data | 36 | variable | > > ++-------+--------+------------------------------+ > > + > > +* *argsz* is the size of the VFIO IRQ set structure, including any *data* > > field. > > +* *flags* defines the action performed on the interrupt range. The DATA > > flags > > + describe the data field sent in the message; the ACTION flags describe > > the > > + action to be performed. The flags are mutually exclusive for both sets. > > + > > + * *VFIO_IRQ_SET_DATA_NONE* indicates there is no data field in the > > command. > > + The action is performed unconditionally. > > + * *VFIO_IRQ_SET_DATA_BOOL* indicates the data field is an array of > > boolean > > + bytes. The action is performed if the corresponding boolean is true. > > + * *VFIO_IRQ_SET_DATA_EVENTFD* indicates an array of event file > > descriptors > > + was sent in the message meta-data. These descriptors will be signalled > > when > > signalled...by the client or by the server? Either. > For example, does VFIO_IRQ_SET_ACTION_TRIGGER + > VFIO_IRQ_SET_DATA_EVENTFD provide an eventfd that the server will signal > when the device raises the interrupt? The server can trigger it, but so can the client (for whatever reason) via a combination of VFIO_IRQ_SET_ACTION_TRIGGER+VFIO_IRQ_SET_DATA_BOOL/NONE. > > On the other hand, VFIO_IRQ_SET_ACTION_MASK + VFIO_IRQ_SET_DATA_EVENTFD > seems to be the other way around. The server reads from the eventfd to > respond when the irq is masked. > > > + the action defined by the action flags occurs. In AF_UNIX sockets, the > > + descriptors are sent as SCM_RIGHTS type ancillary data. > > + If no file descriptors are provided, this de-assigns the specified > > + previously configured interrupts. > > + * *VFIO_IRQ_SET_ACTION_MASK* indicates a masking event. It can be used > > with > > + VFIO_IRQ_SET_DATA_BOOL or VFIO_IRQ_SET_DATA_NONE to mask an interrupt, > > or > > + with VFIO_IRQ_SET_DATA_EVENTFD to generate an event when the guest > > masks > > + the interrupt. > > + * *VFIO_IRQ_SET_ACTION_UNMASK* indicates an unmasking event. It can be > > used > > + with VFIO_IRQ_SET_DATA_BOOL or VFIO_IRQ_SET_DATA_NONE to unmask an > > + interrupt, or with VFIO_IRQ_SET_DATA_EVENTFD to generate an event when > > the > > + guest unmasks the interrupt. > > + * *VFIO_IRQ_SET_ACTION_TRIGGER* indicates a triggering event. It can be > > used > > + with VFIO_IRQ_SET_DATA_BOOL or VFIO_IRQ_SET_DATA_NONE to trigger an > > + interrupt, or with VFIO_IRQ_SET_DATA_EVENTFD to generate an event when > > the > > + server triggers the interrupt. I believe (could be wrong) we inherited these semantics (and IMO rather unfortunate naming) from vfio. > Maybe the text can be restructured to make this clear. Yes, I think it probably can, let me take a pass. > > +VFIO_USER_REGION_READ > > +--------------------- > > + > > +Message format > > +^^^^^^^^^^^^^^ > > + > > ++--------------+------------------------+ > > +| Name | Value | > > ++==============+========================+ > > +| Message ID | <ID> | > > ++--------------+------------------------+ > > +| Command | 9 | > > ++--------------+------------------------+ > > +| Message size | 32 + data size | > > ++--------------+------------------------+ > > +| Flags | Reply bit set in reply | > > ++--------------+------------------------+ > > +| Error | 0/errno | > > ++--------------+------------------------+ > > +| Read info | REGION read/write data | > > ++--------------+------------------------+ > > + > > +This command message is sent from the client to the server to read from > > server > > +memory. In the command messages, there is no data, and the count is the > > amount > > +of data to be read. The reply message must include the data read, and its > > count > > +field is the amount of data read. > > There is no data in command messages, but Message size is still 32 + > data size? It is not. Spec was unclear here, I've added some text to this and VFIO_DMA_READ. > > +VFIO_USER_VM_INTERRUPT > > +---------------------- > > + > > +Message format > > +^^^^^^^^^^^^^^ > > + > > ++----------------+------------------------+ > > +| Name | Value | > > ++================+========================+ > > +| Message ID | <ID> | > > ++----------------+------------------------+ > > +| Command | 13 | > > ++----------------+------------------------+ > > +| Message size | 20 | > > ++----------------+------------------------+ > > +| Flags | Reply bit set in reply | > > ++----------------+------------------------+ > > +| Error | 0/errno | > > ++----------------+------------------------+ > > +| Interrupt info | <interrupt> | > > ++----------------+------------------------+ > > + > > +This command message is sent from the server to the client to signal the > > device > > +has raised an interrupt. > > Except if the client set up irq eventfds? Clarified. > > +Interrupt info format > > +^^^^^^^^^^^^^^^^^^^^^ > > + > > ++-----------+--------+------+ > > +| Name | Offset | Size | > > ++===========+========+======+ > > +| Sub-index | 16 | 4 | > > ++-----------+--------+------+ > > + > > +* *Sub-index* is relative to the IRQ index, e.g., the vector number used > > in PCI > > + MSI/X type interrupts. > > Hmm...this is weird. The server tells the client to raise an MSI-X > interrupt but does not include the MSI message that resides in the MSI-X > table BAR device region? Or should MSI-X interrupts be delivered to the > client via VFIO_USER_DMA_WRITE instead? > > (Basically it's not clear to me how MSI-X interrupts would work with > vfio-user. Reading how they work in kernel VFIO might let me infer it, > but it's probably worth explaining this clearly in the spec.) It doesn't. We don't have an implementation, and the qemu patches don't get this right either - it treats the sub-index as the IRQ index AKA IRQ type. I'd be inclined to just remove this for now, until we have an implementation. Thoughts? > > +VFIO_USER_DEVICE_RESET > > +---------------------- > > + > > +Message format > > +^^^^^^^^^^^^^^ > > + > > ++--------------+------------------------+ > > +| Name | Value | > > ++==============+========================+ > > +| Message ID | <ID> | > > ++--------------+------------------------+ > > +| Command | 14 | > > ++--------------+------------------------+ > > +| Message size | 16 | > > ++--------------+------------------------+ > > +| Flags | Reply bit set in reply | > > ++--------------+------------------------+ > > +| Error | 0/errno | > > ++--------------+------------------------+ > > + > > +This command message is sent from the client to the server to reset the > > device. > > Any requirements for how long VFIO_USER_DEVICE_RESET takes to complete? > In some cases a reset involves the server communicating with other > systems or components and this can take an unbounded amount of time. > Therefore this message could hang. For example, if a vfio-user NVMe > device was accessing data on a hung NFS export and there were I/O > requests in flight that need to be aborted. I'm not sure this is something we could put in the generic spec. Perhaps a caveat? > > +VFIO_USER_DIRTY_PAGES > > +--------------------- > > + > > +Message format > > +^^^^^^^^^^^^^^ > > + > > ++--------------------+------------------------+ > > +| Name | Value | > > ++====================+========================+ > > +| Message ID | <ID> | > > ++--------------------+------------------------+ > > +| Command | 15 | > > ++--------------------+------------------------+ > > +| Message size | 16 | > > ++--------------------+------------------------+ > > +| Flags | Reply bit set in reply | > > ++--------------------+------------------------+ > > +| Error | 0/errno | > > ++--------------------+------------------------+ > > +| VFIO Dirty bitmap | <dirty bitmap> | > > ++--------------------+------------------------+ > > + > > +This command is analogous to VFIO_IOMMU_DIRTY_PAGES. It is sent by the > > client > > +to the server in order to control logging of dirty pages, usually during a > > live > > +migration. The VFIO dirty bitmap structure is defined in ``<linux/vfio.h>`` > > +(``struct vfio_iommu_type1_dirty_bitmap``). > > Do all vfio-user servers need to implement VFIO_USER_DIRTY_PAGES? It's > common for some device implementations to omit migration support because > it increases implementation complexity and is not needed in certain use > cases. Added a note that this is optional. thanks john