Re: Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE
On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi wrote: > > On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote: > > VDUSE (vDPA Device in Userspace) is a framework to support > > implementing software-emulated vDPA devices in userspace. This > > document is intended to clarify the VDUSE design and usage. > > > > Signed-off-by: Xie Yongji > > --- > > Documentation/userspace-api/index.rst | 1 + > > Documentation/userspace-api/vduse.rst | 222 > > ++ > > 2 files changed, 223 insertions(+) > > create mode 100644 Documentation/userspace-api/vduse.rst > > > > diff --git a/Documentation/userspace-api/index.rst > > b/Documentation/userspace-api/index.rst > > index 0b5eefed027e..c432be070f67 100644 > > --- a/Documentation/userspace-api/index.rst > > +++ b/Documentation/userspace-api/index.rst > > @@ -27,6 +27,7 @@ place where this information is gathered. > > iommu > > media/index > > sysfs-platform_profile > > + vduse > > > > .. only:: subproject and html > > > > diff --git a/Documentation/userspace-api/vduse.rst > > b/Documentation/userspace-api/vduse.rst > > new file mode 100644 > > index ..2f9cd1a4e530 > > --- /dev/null > > +++ b/Documentation/userspace-api/vduse.rst > > @@ -0,0 +1,222 @@ > > +== > > +VDUSE - "vDPA Device in Userspace" > > +== > > + > > +vDPA (virtio data path acceleration) device is a device that uses a > > +datapath which complies with the virtio specifications with vendor > > +specific control path. vDPA devices can be both physically located on > > +the hardware or emulated by software. VDUSE is a framework that makes it > > +possible to implement software-emulated vDPA devices in userspace. And > > +to make it simple, the emulated vDPA device's control path is handled in > > +the kernel and only the data path is implemented in the userspace. > > + > > +Note that only virtio block device is supported by VDUSE framework now, > > +which can reduce security risks when the userspace process that implements > > +the data path is run by an unprivileged user. The Support for other device > > +types can be added after the security issue is clarified or fixed in the > > future. > > + > > +Start/Stop VDUSE devices > > + > > + > > +VDUSE devices are started as follows: > > + > > +1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on > > + /dev/vduse/control. > > + > > +2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first > > + messages will arrive while attaching the VDUSE instance to vDPA bus. > > + > > +3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE > > + instance to vDPA bus. > > + > > +VDUSE devices are stopped as follows: > > + > > +1. Send the VDPA_CMD_DEV_DEL netlink message to detach the VDUSE > > + instance from vDPA bus. > > + > > +2. Close the file descriptor referring to /dev/vduse/$NAME > > + > > +3. Destroy the VDUSE instance with ioctl(VDUSE_DESTROY_DEV) on > > + /dev/vduse/control > > + > > +The netlink messages metioned above can be sent via vdpa tool in iproute2 > > +or use the below sample codes: > > + > > +.. code-block:: c > > + > > + static int netlink_add_vduse(const char *name, enum vdpa_command cmd) > > + { > > + struct nl_sock *nlsock; > > + struct nl_msg *msg; > > + int famid; > > + > > + nlsock = nl_socket_alloc(); > > + if (!nlsock) > > + return -ENOMEM; > > + > > + if (genl_connect(nlsock)) > > + goto free_sock; > > + > > + famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME); > > + if (famid < 0) > > + goto close_sock; > > + > > + msg = nlmsg_alloc(); > > + if (!msg) > > + goto close_sock; > > + > > + if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0, > > cmd, 0)) > > + goto nla_put_failure; > > + > > + NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name); > > + if (cmd == VDPA_CMD_DEV_NEW) > > + NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, > > "vduse"); > > + > > + if (nl_send_sync(nlsock, msg)) > > + goto close_sock; > > + > > + nl_close(nlsock); > > + nl_socket_free(nlsock); > > + > > + return 0; > > + nla_put_failure: > > + nlmsg_free(msg); > > + close_sock: > > + nl_close(nlsock); > > + free_sock: > > + nl_socket_free(nlsock); > > + return -1; > > + } > > + > > +How VDUSE works > > +--- > > + > > +Since the emuldated vDPA device's control path is handled in the kernel, > > s/emuldated/emulated/ > Will fix it. > > +a message-based communication protocol and few types of control messages > > +are introduced by VDUSE framework
Re: [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace
在 2021/6/28 下午6:32, Yongji Xie 写道: The large barrier is bounce-buffer mapping: SPDK requires hugepages for NVMe over PCIe and RDMA, so take some preallcoated hugepages to map as bounce buffer is necessary. Or it's hard to avoid an extra memcpy from bounce-buffer to hugepage. If you can add an option to map hugepages as bounce-buffer, then SPDK could also be a potential user of vduse. I think we can support registering user space memory for bounce-buffer use like XDP does. But this needs to pin the pages, so I didn't consider it in this initial version. Note that userspace should be unaware of the existence of the bounce buffer. So we need to think carefully of mmap() vs umem registering. Thanks ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace
在 2021/6/28 下午1:54, Liu, Xiaodong 写道: Several issues: - VDUSE needs to limit the total size of the bounce buffers (64M if I was not wrong). Does it work for SPDK? Yes, Jason. It is enough and works for SPDK. Since it's a kind of bounce buffer mainly for in-flight IO, so limited size like 64MB is enough. Ok. - VDUSE can use hugepages but I'm not sure we can mandate hugepages (or we need introduce new flags for supporting this) Same with your worry, I'm afraid too that it is a hard for a kernel module to directly preallocate hugepage internal. What I tried is that: 1. A simple agent daemon (represents for one device) `preallocates` and maps dozens of 2MB hugepages (like 64MB) for one device. 2. The daemon passes its mapping addr and hugepage fd to kernel module through created IOCTL. 3. Kernel module remaps the hugepages inside kernel. Such model should work, but the main "issue" is that it introduce overheads in the case of vhost-vDPA. Note that in the case of vhost-vDPA, we don't use bounce buffer, the userspace pages were shared directly. And since DMA is not done per page, it prevents us from using tricks like vm_insert_page() in those cases. 4. Vhost user target gets and maps hugepage fd from kernel module in vhost-user msg through Unix Domain Socket cmsg. Then kernel module and target map on the same hugepage based bounce buffer for in-flight IO. If there is one option in VDUSE to map userspace preallocated memory, then VDUSE should be able to mandate it even it is hugepage based. As above, this requires some kind of re-design since VDUSE depends on the model of mmap(MAP_SHARED) instead of umem registering. Thanks ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
在 2021/6/29 上午11:56, Yongji Xie 写道: On Tue, Jun 29, 2021 at 11:29 AM Jason Wang wrote: 在 2021/6/29 上午10:26, Yongji Xie 写道: On Mon, Jun 28, 2021 at 12:40 PM Jason Wang wrote: 在 2021/6/25 下午12:19, Yongji Xie 写道: 2b) for set_status(): simply relay the message to userspace, reply is no needed. Userspace will use a command to update the status when the datapath is stop. The the status could be fetched via get_stats(). 2b looks more spec complaint. Looks good to me. And I think we can use the reply of the message to update the status instead of introducing a new command. Just notice this part in virtio_finalize_features(): virtio_add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK); status = dev->config->get_status(dev); if (!(status & VIRTIO_CONFIG_S_FEATURES_OK)) { So we no reply doesn't work for FEATURES_OK. So my understanding is: 1) We must not use noreply for set_status() 2) We can use noreply for get_status(), but it requires a new ioctl to update the status. So it looks to me we need synchronize for both get_status() and set_status(). We should not send messages to userspace in the FEATURES_OK case. So the synchronization is not necessary. As discussed previously, there could be a device that mandates some features (VIRTIO_F_RING_PACKED). So it can choose to not accept FEATURES_OK is packed virtqueue is not negotiated. In this case we need to relay the message to userspace. OK, I see. If so, I prefer to only use noreply for set_status(). We do not set the status bit if the message is failed. In this way, we don't need to change lots of virtio core codes to handle the failure of set_status()/get_status(). It should work. Thanks Thanks, Yongji ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
On Tue, Jun 29, 2021 at 11:29 AM Jason Wang wrote: > > > 在 2021/6/29 上午10:26, Yongji Xie 写道: > > On Mon, Jun 28, 2021 at 12:40 PM Jason Wang wrote: > >> > >> 在 2021/6/25 下午12:19, Yongji Xie 写道: > 2b) for set_status(): simply relay the message to userspace, reply is no > needed. Userspace will use a command to update the status when the > datapath is stop. The the status could be fetched via get_stats(). > > 2b looks more spec complaint. > > >>> Looks good to me. And I think we can use the reply of the message to > >>> update the status instead of introducing a new command. > >>> > >> Just notice this part in virtio_finalize_features(): > >> > >> virtio_add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK); > >> status = dev->config->get_status(dev); > >> if (!(status & VIRTIO_CONFIG_S_FEATURES_OK)) { > >> > >> So we no reply doesn't work for FEATURES_OK. > >> > >> So my understanding is: > >> > >> 1) We must not use noreply for set_status() > >> 2) We can use noreply for get_status(), but it requires a new ioctl to > >> update the status. > >> > >> So it looks to me we need synchronize for both get_status() and > >> set_status(). > >> > > We should not send messages to userspace in the FEATURES_OK case. So > > the synchronization is not necessary. > > > As discussed previously, there could be a device that mandates some > features (VIRTIO_F_RING_PACKED). So it can choose to not accept > FEATURES_OK is packed virtqueue is not negotiated. > > In this case we need to relay the message to userspace. > OK, I see. If so, I prefer to only use noreply for set_status(). We do not set the status bit if the message is failed. In this way, we don't need to change lots of virtio core codes to handle the failure of set_status()/get_status(). Thanks, Yongji ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
在 2021/6/29 上午10:26, Yongji Xie 写道: On Mon, Jun 28, 2021 at 12:40 PM Jason Wang wrote: 在 2021/6/25 下午12:19, Yongji Xie 写道: 2b) for set_status(): simply relay the message to userspace, reply is no needed. Userspace will use a command to update the status when the datapath is stop. The the status could be fetched via get_stats(). 2b looks more spec complaint. Looks good to me. And I think we can use the reply of the message to update the status instead of introducing a new command. Just notice this part in virtio_finalize_features(): virtio_add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK); status = dev->config->get_status(dev); if (!(status & VIRTIO_CONFIG_S_FEATURES_OK)) { So we no reply doesn't work for FEATURES_OK. So my understanding is: 1) We must not use noreply for set_status() 2) We can use noreply for get_status(), but it requires a new ioctl to update the status. So it looks to me we need synchronize for both get_status() and set_status(). We should not send messages to userspace in the FEATURES_OK case. So the synchronization is not necessary. As discussed previously, there could be a device that mandates some features (VIRTIO_F_RING_PACKED). So it can choose to not accept FEATURES_OK is packed virtqueue is not negotiated. In this case we need to relay the message to userspace. Thanks Thanks, Yongji ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: Re: [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace
On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi wrote: > > On Tue, Jun 15, 2021 at 10:13:21PM +0800, Xie Yongji wrote: > > This series introduces a framework that makes it possible to implement > > software-emulated vDPA devices in userspace. And to make it simple, the > > emulated vDPA device's control path is handled in the kernel and only the > > data path is implemented in the userspace. > > This looks interesting. Unfortunately I don't have enough time to do a > full review, but I looked at the documentation and uapi header file to > give feedback on the userspace ABI. > OK. Thanks for your comments. It's helpful! Thanks, Yongji ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi wrote: > > On Tue, Jun 15, 2021 at 10:13:30PM +0800, Xie Yongji wrote: > > diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h > > new file mode 100644 > > index ..f21b2e51b5c8 > > --- /dev/null > > +++ b/include/uapi/linux/vduse.h > > @@ -0,0 +1,143 @@ > > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ > > +#ifndef _UAPI_VDUSE_H_ > > +#define _UAPI_VDUSE_H_ > > + > > +#include > > + > > +#define VDUSE_API_VERSION0 > > + > > +#define VDUSE_NAME_MAX 256 > > + > > +/* the control messages definition for read/write */ > > + > > +enum vduse_req_type { > > + /* Get the state for virtqueue from userspace */ > > + VDUSE_GET_VQ_STATE, > > + /* Notify userspace to start the dataplane, no reply */ > > + VDUSE_START_DATAPLANE, > > + /* Notify userspace to stop the dataplane, no reply */ > > + VDUSE_STOP_DATAPLANE, > > + /* Notify userspace to update the memory mapping in device IOTLB */ > > + VDUSE_UPDATE_IOTLB, > > +}; > > + > > +struct vduse_vq_state { > > + __u32 index; /* virtqueue index */ > > + __u32 avail_idx; /* virtqueue state (last_avail_idx) */ > > +}; > > + > > +struct vduse_iova_range { > > + __u64 start; /* start of the IOVA range */ > > + __u64 last; /* end of the IOVA range */ > > Please clarify whether this describes a closed range [start, last] or an > open range [start, last). > OK. > > +}; > > + > > +struct vduse_dev_request { > > + __u32 type; /* request type */ > > + __u32 request_id; /* request id */ > > +#define VDUSE_REQ_FLAGS_NO_REPLY (1 << 0) /* No need to reply */ > > + __u32 flags; /* request flags */ > > + __u32 reserved; /* for future use */ > > + union { > > + struct vduse_vq_state vq_state; /* virtqueue state */ > > + struct vduse_iova_range iova; /* iova range for updating */ > > + __u32 padding[16]; /* padding */ > > + }; > > +}; > > + > > +struct vduse_dev_response { > > + __u32 request_id; /* corresponding request id */ > > +#define VDUSE_REQ_RESULT_OK 0x00 > > +#define VDUSE_REQ_RESULT_FAILED 0x01 > > + __u32 result; /* the result of request */ > > + __u32 reserved[2]; /* for future use */ > > + union { > > + struct vduse_vq_state vq_state; /* virtqueue state */ > > + __u32 padding[16]; /* padding */ > > + }; > > +}; > > + > > +/* ioctls */ > > + > > +struct vduse_dev_config { > > + char name[VDUSE_NAME_MAX]; /* vduse device name */ > > + __u32 vendor_id; /* virtio vendor id */ > > + __u32 device_id; /* virtio device id */ > > + __u64 features; /* device features */ > > + __u64 bounce_size; /* bounce buffer size for iommu */ > > + __u16 vq_size_max; /* the max size of virtqueue */ > > The VIRTIO specification allows per-virtqueue sizes. A device can have > two virtqueues, where the first one allows up to 1024 descriptors and > the second one allows only 128 descriptors, for example. > Good point! But it looks like virtio-vdpa/virtio-pci doesn't support that now. All virtqueues have the same maximum size. > This constant seems to impose the constraint that all virtqueues have > the same maximum size. Is this really necessary? > This will be used by vring_create_virtqueue(). We need to specify the maximum queue size supported by the device. > > + __u16 padding; /* padding */ > > + __u32 vq_num; /* the number of virtqueues */ > > + __u32 vq_align; /* the allocation alignment of virtqueue's metadata */ > > I'm not sure what this is? > This will be used by vring_create_virtqueue() too. > > + __u32 config_size; /* the size of the configuration space */ > > + __u32 reserved[15]; /* for future use */ > > + __u8 config[0]; /* the buffer of the configuration space */ > > +}; > > + > > +struct vduse_iotlb_entry { > > + __u64 offset; /* the mmap offset on fd */ > > + __u64 start; /* start of the IOVA range */ > > + __u64 last; /* last of the IOVA range */ > > Same here, please specify whether this is an open range or a closed > range. > Sure. > > +#define VDUSE_ACCESS_RO 0x1 > > +#define VDUSE_ACCESS_WO 0x2 > > +#define VDUSE_ACCESS_RW 0x3 > > + __u8 perm; /* access permission of this range */ > > +}; > > + > > +struct vduse_config_update { > > + __u32 offset; /* offset from the beginning of configuration space */ > > + __u32 length; /* the length to write to configuration space */ > > + __u8 buffer[0]; /* buffer used to write from */ > > +}; > > + > > +struct vduse_vq_info { > > + __u32 index; /* virtqueue index */ > > + __u32 avail_idx; /* virtqueue state (last_avail_idx) */ > > + __u64 desc_addr; /* address of desc area */ > > + __u64 driver_addr; /* address of driver area */ > > + __u64 device_addr; /* address of device area */ > > + __u32 num; /* the size of virtqueue */ > > + __u8 ready; /* ready status of
Re: Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
On Mon, Jun 28, 2021 at 12:40 PM Jason Wang wrote: > > > 在 2021/6/25 下午12:19, Yongji Xie 写道: > >> 2b) for set_status(): simply relay the message to userspace, reply is no > >> needed. Userspace will use a command to update the status when the > >> datapath is stop. The the status could be fetched via get_stats(). > >> > >> 2b looks more spec complaint. > >> > > Looks good to me. And I think we can use the reply of the message to > > update the status instead of introducing a new command. > > > > Just notice this part in virtio_finalize_features(): > > virtio_add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK); > status = dev->config->get_status(dev); > if (!(status & VIRTIO_CONFIG_S_FEATURES_OK)) { > > So we no reply doesn't work for FEATURES_OK. > > So my understanding is: > > 1) We must not use noreply for set_status() > 2) We can use noreply for get_status(), but it requires a new ioctl to > update the status. > > So it looks to me we need synchronize for both get_status() and > set_status(). > We should not send messages to userspace in the FEATURES_OK case. So the synchronization is not necessary. Thanks, Yongji ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: Plan for /dev/ioasid RFC v2
> From: Alex Williamson > Sent: Tuesday, June 29, 2021 6:32 AM > > On Mon, 28 Jun 2021 01:09:18 + > "Tian, Kevin" wrote: > > > > From: Jason Gunthorpe > > > Sent: Friday, June 25, 2021 10:36 PM > > > > > > On Fri, Jun 25, 2021 at 10:27:18AM +, Tian, Kevin wrote: > > > > > > > - When receiving the binding call for the 1st device in a group, > iommu_fd > > > > calls iommu_group_set_block_dma(group, dev->driver) which does > > > > several things: > > > > > > The whole problem here is trying to match this new world where we want > > > devices to be in charge of their own IOMMU configuration and the old > > > world where groups are in charge. > > > > > > Inserting the group fd and then calling a device-centric > > > VFIO_GROUP_GET_DEVICE_FD_NEW doesn't solve this conflict, and isn't > > > necessary. > > > > No, this was not what I meant. There is no group fd required when > > calling this device-centric interface. I was actually talking about: > > > > iommu_group_set_block_dma(dev->group, dev->driver) > > > > just because current iommu layer API is group-centric. Whether this > > should be improved could be next-level thing. Sorry for not making > > it clear in the first place. > > > > > We can always get the group back from the device at any > > > point in the sequence do to a group wide operation. > > > > yes. > > > > > > > > What I saw as the appeal of the sort of idea was to just completely > > > leave all the difficult multi-device-group scenarios behind on the old > > > group centric API and then we don't have to deal with them at all, or > > > least not right away. > > > > yes, this is the staged approach that we discussed earlier. and > > the reason why I refined this proposal about multi-devices group > > here is because you want to see some confidence along this > > direction. Thus I expanded your idea and hope to achieve consensus > > with Alex/Joerg who obviously have not been convinced yet. > > > > > > > > I'd see some progression where iommu_fd only works with 1:1 groups at > > > the start. Other scenarios continue with the old API. > > > > One uAPI open after completing this new sketch. v1 proposed to > > conduct binding (VFIO_BIND_IOMMU_FD) after device_fd is acquired. > > With this sketch we need a new VFIO_GROUP_GET_DEVICE_FD_NEW > > to complete both in one step. I want to get Alex's confirmation whether > > it sounds good to him, since it's better to unify the uAPI between 1:1 > > group and 1:N group even if we don't support 1:N in the start. > > I don't like it. It doesn't make sense to me. You have the > group-centric world, which must continue to exist and cannot change > because we cannot break the vfio uapi. We can make extensions, we can > define a new parallel uapi, we can deprecate the uapi, but in the short > term, it can't change. > > AIUI, the new device-centric model starts with vfio device files that > can be opened directly. So what then is the purpose of a *GROUP* get > device fd? Why is a vfio uapi involved in setting a device cookie for > another subsystem? > > I'd expect that /dev/iommu will be used by multiple subsystems. All > will want to bind devices to address spaces, so shouldn't binding a > device to an iommufd be an ioctl on the iommufd, ie. > IOMMU_BIND_VFIO_DEVICE_FD. Maybe we don't even need "VFIO" in there > and > the iommufd code can figure it out internally. > > You're essentially trying to reduce vfio to the device interface. That > necessarily implies that ioctls on the container, group, or passed > through the container to the iommu no longer exist. From my > perspective, there should ideally be no new vfio ioctls. The user gets > a limited access vfio device fd and enables full access to the device > by registering it to the iommufd subsystem (100% this needs to be > enforced until close() to avoid revoke issues). The user interacts > exclusively with vfio via the device fd and performs all DMA address > space related operations through the iommufd. > > > > Then maybe groups where all devices use the same IOASID. > > > > > > Then 1:N groups if the source device is reliably identifiable, this > > > requires iommu subystem work to attach domains to sub-group objects - > > > not sure it is worthwhile. > > > > > > But at least we can talk about each step with well thought out patches > > > > > > The only thing that needs to be done to get the 1:1 step is to broadly > > > define how the other two cases will work so we don't get into trouble > > > and set some way to exclude the problematic cases from even getting to > > > iommu_fd in the first place. > > > > > > For instance if we go ahead and create /dev/vfio/device nodes we could > > > do this only if the group was 1:1, otherwise the group cdev has to be > > > used, along with its API. > > > > I feel for VFIO possibly we don't need significant change to its uAPI > > sequence, since it anyway needs to support existing semantics for > > backward compatibility. With
RE: Plan for /dev/ioasid RFC v2
> From: Alex Williamson > Sent: Tuesday, June 29, 2021 7:09 AM > > > > Ideally vfio would also at least be able to register a type1 IOMMU > > > backend through the existing uapi, backed by this iommu code, ie. we'd > > > create a new "iommufd" (but without the user visible fd), > > > > It would be amazing to be able to achieve this, at least for me there > > are too many details be able to tell what that would look like > > exactly. I suggested once that putting the container ioctl interface > > in the drivers/iommu code may allow for this without too much trouble.. > > If we can't achieve this, then type1 legacy code is going to need to > live through an extended deprecation period. I'm hoping that between > type1 and a native interface we'll have two paths into iommufd to vet > the design. Thanks, > I prefer to keeping type1 legacy code for some time. After the new device-centric flow works, we can then study how to make type1 as a shim layer atop. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: Plan for /dev/ioasid RFC v2
> From: Jason Gunthorpe > Sent: Tuesday, June 29, 2021 7:13 AM > > On Mon, Jun 28, 2021 at 05:09:02PM -0600, Alex Williamson wrote: > > On Mon, 28 Jun 2021 19:48:18 -0300 > > Jason Gunthorpe wrote: > > > > > On Mon, Jun 28, 2021 at 04:31:45PM -0600, Alex Williamson wrote: > > > > > > > I'd expect that /dev/iommu will be used by multiple subsystems. All > > > > will want to bind devices to address spaces, so shouldn't binding a > > > > device to an iommufd be an ioctl on the iommufd, ie. > > > > IOMMU_BIND_VFIO_DEVICE_FD. Maybe we don't even need "VFIO" in > there and > > > > the iommufd code can figure it out internally. > > > > > > It wants to be the other way around because iommu_fd is the lower > > > level subsystem. We don't/can't teach iommu_fd how to convert a fd > > > number to a vfio/vdpa/etc/etc, we teach all the things building on > > > iommu_fd how to change a fd number to an iommu - they already > > > necessarily have an inter-module linkage. > > > > These seem like peer subsystems, like vfio and kvm. vfio shouldn't > > have any hard dependencies on the iommufd module, especially so long as > > we have the legacy type1 code. > > It does, the vfio_device implementation has to tell the iommu subsystem > what kind of device behavior it has and possibly interact with the > iommu subsystem with it in cases like PASID. This was outlined in part > of the RFC. Right. PASID is managed by specific device driver in this RFC and provided as routing information to iommu_fd when the device is attached to an IOASID. Another point is about PASID virtualization (vPASID->pPASID), which is established by having the user to register its vPASID when doing the attach call. vfio device driver needs to use this information in the mediation path. In concept vPASID is not relevant to iommu_fd which only cares about pPASID. Having vPASID registered via iommu_fd uAPI and then indirectly communicated to vfio device driver looks not a clean way in the first place. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: Plan for /dev/ioasid RFC v2
On Mon, Jun 28, 2021 at 05:09:02PM -0600, Alex Williamson wrote: > On Mon, 28 Jun 2021 19:48:18 -0300 > Jason Gunthorpe wrote: > > > On Mon, Jun 28, 2021 at 04:31:45PM -0600, Alex Williamson wrote: > > > > > I'd expect that /dev/iommu will be used by multiple subsystems. All > > > will want to bind devices to address spaces, so shouldn't binding a > > > device to an iommufd be an ioctl on the iommufd, ie. > > > IOMMU_BIND_VFIO_DEVICE_FD. Maybe we don't even need "VFIO" in there and > > > the iommufd code can figure it out internally. > > > > It wants to be the other way around because iommu_fd is the lower > > level subsystem. We don't/can't teach iommu_fd how to convert a fd > > number to a vfio/vdpa/etc/etc, we teach all the things building on > > iommu_fd how to change a fd number to an iommu - they already > > necessarily have an inter-module linkage. > > These seem like peer subsystems, like vfio and kvm. vfio shouldn't > have any hard dependencies on the iommufd module, especially so long as > we have the legacy type1 code. It does, the vfio_device implementation has to tell the iommu subsystem what kind of device behavior it has and possibly interact with the iommu subsystem with it in cases like PASID. This was outlined in part of the RFC. In any event a module dependency from vfio to iommu is not bothersome, while the other way certainly is. > Likewise iommufd shouldn't have any on vfio. As much as you > dislike the symbol_get hack of the kvm-vfio device, it would be > reasonable for iommufd to reach for a vfio symbol when an > IOMMU_BIND_VFIO_DEVICE_FD ioctl is called. We'd have to add a special ioctl to iommu for every new subsystem, it doesn't scale. iommu is a core subsystem, vfio is a driver subsystem. The direction of dependency is clear, I think. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: Plan for /dev/ioasid RFC v2
On Mon, 28 Jun 2021 19:48:18 -0300 Jason Gunthorpe wrote: > On Mon, Jun 28, 2021 at 04:31:45PM -0600, Alex Williamson wrote: > > > I'd expect that /dev/iommu will be used by multiple subsystems. All > > will want to bind devices to address spaces, so shouldn't binding a > > device to an iommufd be an ioctl on the iommufd, ie. > > IOMMU_BIND_VFIO_DEVICE_FD. Maybe we don't even need "VFIO" in there and > > the iommufd code can figure it out internally. > > It wants to be the other way around because iommu_fd is the lower > level subsystem. We don't/can't teach iommu_fd how to convert a fd > number to a vfio/vdpa/etc/etc, we teach all the things building on > iommu_fd how to change a fd number to an iommu - they already > necessarily have an inter-module linkage. These seem like peer subsystems, like vfio and kvm. vfio shouldn't have any hard dependencies on the iommufd module, especially so long as we have the legacy type1 code. Likewise iommufd shouldn't have any on vfio. As much as you dislike the symbol_get hack of the kvm-vfio device, it would be reasonable for iommufd to reach for a vfio symbol when an IOMMU_BIND_VFIO_DEVICE_FD ioctl is called. > There is a certain niceness to what you are saying but it is not so > practical without doing something bigger.. > > > Ideally vfio would also at least be able to register a type1 IOMMU > > backend through the existing uapi, backed by this iommu code, ie. we'd > > create a new "iommufd" (but without the user visible fd), > > It would be amazing to be able to achieve this, at least for me there > are too many details be able to tell what that would look like > exactly. I suggested once that putting the container ioctl interface > in the drivers/iommu code may allow for this without too much trouble.. If we can't achieve this, then type1 legacy code is going to need to live through an extended deprecation period. I'm hoping that between type1 and a native interface we'll have two paths into iommufd to vet the design. Thanks, Alex ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: Plan for /dev/ioasid RFC v2
On Mon, Jun 28, 2021 at 04:31:45PM -0600, Alex Williamson wrote: > I'd expect that /dev/iommu will be used by multiple subsystems. All > will want to bind devices to address spaces, so shouldn't binding a > device to an iommufd be an ioctl on the iommufd, ie. > IOMMU_BIND_VFIO_DEVICE_FD. Maybe we don't even need "VFIO" in there and > the iommufd code can figure it out internally. It wants to be the other way around because iommu_fd is the lower level subsystem. We don't/can't teach iommu_fd how to convert a fd number to a vfio/vdpa/etc/etc, we teach all the things building on iommu_fd how to change a fd number to an iommu - they already necessarily have an inter-module linkage. There is a certain niceness to what you are saying but it is not so practical without doing something bigger.. > Ideally vfio would also at least be able to register a type1 IOMMU > backend through the existing uapi, backed by this iommu code, ie. we'd > create a new "iommufd" (but without the user visible fd), It would be amazing to be able to achieve this, at least for me there are too many details be able to tell what that would look like exactly. I suggested once that putting the container ioctl interface in the drivers/iommu code may allow for this without too much trouble.. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: Plan for /dev/ioasid RFC v2
On Mon, 28 Jun 2021 01:09:18 + "Tian, Kevin" wrote: > > From: Jason Gunthorpe > > Sent: Friday, June 25, 2021 10:36 PM > > > > On Fri, Jun 25, 2021 at 10:27:18AM +, Tian, Kevin wrote: > > > > > - When receiving the binding call for the 1st device in a group, > > > iommu_fd > > > calls iommu_group_set_block_dma(group, dev->driver) which does > > > several things: > > > > The whole problem here is trying to match this new world where we want > > devices to be in charge of their own IOMMU configuration and the old > > world where groups are in charge. > > > > Inserting the group fd and then calling a device-centric > > VFIO_GROUP_GET_DEVICE_FD_NEW doesn't solve this conflict, and isn't > > necessary. > > No, this was not what I meant. There is no group fd required when > calling this device-centric interface. I was actually talking about: > > iommu_group_set_block_dma(dev->group, dev->driver) > > just because current iommu layer API is group-centric. Whether this > should be improved could be next-level thing. Sorry for not making > it clear in the first place. > > > We can always get the group back from the device at any > > point in the sequence do to a group wide operation. > > yes. > > > > > What I saw as the appeal of the sort of idea was to just completely > > leave all the difficult multi-device-group scenarios behind on the old > > group centric API and then we don't have to deal with them at all, or > > least not right away. > > yes, this is the staged approach that we discussed earlier. and > the reason why I refined this proposal about multi-devices group > here is because you want to see some confidence along this > direction. Thus I expanded your idea and hope to achieve consensus > with Alex/Joerg who obviously have not been convinced yet. > > > > > I'd see some progression where iommu_fd only works with 1:1 groups at > > the start. Other scenarios continue with the old API. > > One uAPI open after completing this new sketch. v1 proposed to > conduct binding (VFIO_BIND_IOMMU_FD) after device_fd is acquired. > With this sketch we need a new VFIO_GROUP_GET_DEVICE_FD_NEW > to complete both in one step. I want to get Alex's confirmation whether > it sounds good to him, since it's better to unify the uAPI between 1:1 > group and 1:N group even if we don't support 1:N in the start. I don't like it. It doesn't make sense to me. You have the group-centric world, which must continue to exist and cannot change because we cannot break the vfio uapi. We can make extensions, we can define a new parallel uapi, we can deprecate the uapi, but in the short term, it can't change. AIUI, the new device-centric model starts with vfio device files that can be opened directly. So what then is the purpose of a *GROUP* get device fd? Why is a vfio uapi involved in setting a device cookie for another subsystem? I'd expect that /dev/iommu will be used by multiple subsystems. All will want to bind devices to address spaces, so shouldn't binding a device to an iommufd be an ioctl on the iommufd, ie. IOMMU_BIND_VFIO_DEVICE_FD. Maybe we don't even need "VFIO" in there and the iommufd code can figure it out internally. You're essentially trying to reduce vfio to the device interface. That necessarily implies that ioctls on the container, group, or passed through the container to the iommu no longer exist. From my perspective, there should ideally be no new vfio ioctls. The user gets a limited access vfio device fd and enables full access to the device by registering it to the iommufd subsystem (100% this needs to be enforced until close() to avoid revoke issues). The user interacts exclusively with vfio via the device fd and performs all DMA address space related operations through the iommufd. > > Then maybe groups where all devices use the same IOASID. > > > > Then 1:N groups if the source device is reliably identifiable, this > > requires iommu subystem work to attach domains to sub-group objects - > > not sure it is worthwhile. > > > > But at least we can talk about each step with well thought out patches > > > > The only thing that needs to be done to get the 1:1 step is to broadly > > define how the other two cases will work so we don't get into trouble > > and set some way to exclude the problematic cases from even getting to > > iommu_fd in the first place. > > > > For instance if we go ahead and create /dev/vfio/device nodes we could > > do this only if the group was 1:1, otherwise the group cdev has to be > > used, along with its API. > > I feel for VFIO possibly we don't need significant change to its uAPI > sequence, since it anyway needs to support existing semantics for > backward compatibility. With this sketch we can keep vfio container/ > group by introducing an external iommu type which implies a different > GET_DEVICE_FD semantics. /dev/iommu can report a fd-wide capability > for whether 1:N group is supported to vfio user.
Re: Plan for /dev/ioasid RFC v2
On Mon, Jun 28, 2021 at 06:45:23AM +, Tian, Kevin wrote: > 7) Unbinding detaches the device from the block_dma domain and >re-attach it to the default domain. From now on the user should >be denied from accessing the device. vfio should tear down the >MMIO mapping at this point. I think we should just forbid this, so long as the device_fd is open the iommu_fd cannot be destroyed and there is no way to detact a device other than closing its Fd. revoke is tricky enough to implement we should avoid it. > It's still an open whether we want to further allow devices within a group > attached to different IOASIDs in case that the source devices are reliably > identifiable. This is an usage not supported by existing vfio and might be > not worthwhile due to improved isolation over time. The main decision here is to decide if the uAPI should have some way to indicate that a device does not have its own unique IOASID but is sharing with the group Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: Plan for /dev/ioasid RFC v2
On Mon, Jun 28, 2021 at 02:03:56AM +, Tian, Kevin wrote: > Combining with the last paragraph above, are you actually suggesting > that 1:1 group (including mdev) should use a new device-centric vfio > uAPI (without group fd) while existing group-centric vfio uAPI is only > kept for 1:N group (with slight semantics change in my sketch to match > device-centric iommu fd API)? Yes, this is one approach Using a VFIO_GROUP_GET_DEVICE_FD_NEW on the group FD is another option, but locks us into having the group FD. Which is better possibly depends on some details when going through the code transformation, though I prefer not to design assuming the group FD must exist. > (not via an indirect group ioctl). Then it implies that we may have to allow > the user open a device before it is put into a security context, thus the > safe guard may have to be enabled on mmap() for 1:1 group. This is a > different sequence from the existing group-centric model. Yes, but I think this is fairly minor, it would just start with a dummy fops and move to operational fops once things are setup enough. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace
On Tue, Jun 15, 2021 at 10:13:21PM +0800, Xie Yongji wrote: > This series introduces a framework that makes it possible to implement > software-emulated vDPA devices in userspace. And to make it simple, the > emulated vDPA device's control path is handled in the kernel and only the > data path is implemented in the userspace. This looks interesting. Unfortunately I don't have enough time to do a full review, but I looked at the documentation and uapi header file to give feedback on the userspace ABI. Stefan signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
On Tue, Jun 15, 2021 at 10:13:30PM +0800, Xie Yongji wrote: > diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h > new file mode 100644 > index ..f21b2e51b5c8 > --- /dev/null > +++ b/include/uapi/linux/vduse.h > @@ -0,0 +1,143 @@ > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ > +#ifndef _UAPI_VDUSE_H_ > +#define _UAPI_VDUSE_H_ > + > +#include > + > +#define VDUSE_API_VERSION0 > + > +#define VDUSE_NAME_MAX 256 > + > +/* the control messages definition for read/write */ > + > +enum vduse_req_type { > + /* Get the state for virtqueue from userspace */ > + VDUSE_GET_VQ_STATE, > + /* Notify userspace to start the dataplane, no reply */ > + VDUSE_START_DATAPLANE, > + /* Notify userspace to stop the dataplane, no reply */ > + VDUSE_STOP_DATAPLANE, > + /* Notify userspace to update the memory mapping in device IOTLB */ > + VDUSE_UPDATE_IOTLB, > +}; > + > +struct vduse_vq_state { > + __u32 index; /* virtqueue index */ > + __u32 avail_idx; /* virtqueue state (last_avail_idx) */ > +}; > + > +struct vduse_iova_range { > + __u64 start; /* start of the IOVA range */ > + __u64 last; /* end of the IOVA range */ Please clarify whether this describes a closed range [start, last] or an open range [start, last). > +}; > + > +struct vduse_dev_request { > + __u32 type; /* request type */ > + __u32 request_id; /* request id */ > +#define VDUSE_REQ_FLAGS_NO_REPLY (1 << 0) /* No need to reply */ > + __u32 flags; /* request flags */ > + __u32 reserved; /* for future use */ > + union { > + struct vduse_vq_state vq_state; /* virtqueue state */ > + struct vduse_iova_range iova; /* iova range for updating */ > + __u32 padding[16]; /* padding */ > + }; > +}; > + > +struct vduse_dev_response { > + __u32 request_id; /* corresponding request id */ > +#define VDUSE_REQ_RESULT_OK 0x00 > +#define VDUSE_REQ_RESULT_FAILED 0x01 > + __u32 result; /* the result of request */ > + __u32 reserved[2]; /* for future use */ > + union { > + struct vduse_vq_state vq_state; /* virtqueue state */ > + __u32 padding[16]; /* padding */ > + }; > +}; > + > +/* ioctls */ > + > +struct vduse_dev_config { > + char name[VDUSE_NAME_MAX]; /* vduse device name */ > + __u32 vendor_id; /* virtio vendor id */ > + __u32 device_id; /* virtio device id */ > + __u64 features; /* device features */ > + __u64 bounce_size; /* bounce buffer size for iommu */ > + __u16 vq_size_max; /* the max size of virtqueue */ The VIRTIO specification allows per-virtqueue sizes. A device can have two virtqueues, where the first one allows up to 1024 descriptors and the second one allows only 128 descriptors, for example. This constant seems to impose the constraint that all virtqueues have the same maximum size. Is this really necessary? > + __u16 padding; /* padding */ > + __u32 vq_num; /* the number of virtqueues */ > + __u32 vq_align; /* the allocation alignment of virtqueue's metadata */ I'm not sure what this is? > + __u32 config_size; /* the size of the configuration space */ > + __u32 reserved[15]; /* for future use */ > + __u8 config[0]; /* the buffer of the configuration space */ > +}; > + > +struct vduse_iotlb_entry { > + __u64 offset; /* the mmap offset on fd */ > + __u64 start; /* start of the IOVA range */ > + __u64 last; /* last of the IOVA range */ Same here, please specify whether this is an open range or a closed range. > +#define VDUSE_ACCESS_RO 0x1 > +#define VDUSE_ACCESS_WO 0x2 > +#define VDUSE_ACCESS_RW 0x3 > + __u8 perm; /* access permission of this range */ > +}; > + > +struct vduse_config_update { > + __u32 offset; /* offset from the beginning of configuration space */ > + __u32 length; /* the length to write to configuration space */ > + __u8 buffer[0]; /* buffer used to write from */ > +}; > + > +struct vduse_vq_info { > + __u32 index; /* virtqueue index */ > + __u32 avail_idx; /* virtqueue state (last_avail_idx) */ > + __u64 desc_addr; /* address of desc area */ > + __u64 driver_addr; /* address of driver area */ > + __u64 device_addr; /* address of device area */ > + __u32 num; /* the size of virtqueue */ > + __u8 ready; /* ready status of virtqueue */ > +}; > + > +struct vduse_vq_eventfd { > + __u32 index; /* virtqueue index */ > +#define VDUSE_EVENTFD_DEASSIGN -1 > + int fd; /* eventfd, -1 means de-assigning the eventfd */ > +}; > + > +#define VDUSE_BASE 0x81 > + > +/* Get the version of VDUSE API. This is used for future extension */ > +#define VDUSE_GET_API_VERSION_IOR(VDUSE_BASE, 0x00, __u64) > + > +/* Set the version of VDUSE API. */ > +#define VDUSE_SET_API_VERSION_IOW(VDUSE_BASE, 0x01, __u64) > + > +/* Create a vduse device which is represented by a char device > (/dev/vduse/) */ > +#define
Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE
On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote: > VDUSE (vDPA Device in Userspace) is a framework to support > implementing software-emulated vDPA devices in userspace. This > document is intended to clarify the VDUSE design and usage. > > Signed-off-by: Xie Yongji > --- > Documentation/userspace-api/index.rst | 1 + > Documentation/userspace-api/vduse.rst | 222 > ++ > 2 files changed, 223 insertions(+) > create mode 100644 Documentation/userspace-api/vduse.rst > > diff --git a/Documentation/userspace-api/index.rst > b/Documentation/userspace-api/index.rst > index 0b5eefed027e..c432be070f67 100644 > --- a/Documentation/userspace-api/index.rst > +++ b/Documentation/userspace-api/index.rst > @@ -27,6 +27,7 @@ place where this information is gathered. > iommu > media/index > sysfs-platform_profile > + vduse > > .. only:: subproject and html > > diff --git a/Documentation/userspace-api/vduse.rst > b/Documentation/userspace-api/vduse.rst > new file mode 100644 > index ..2f9cd1a4e530 > --- /dev/null > +++ b/Documentation/userspace-api/vduse.rst > @@ -0,0 +1,222 @@ > +== > +VDUSE - "vDPA Device in Userspace" > +== > + > +vDPA (virtio data path acceleration) device is a device that uses a > +datapath which complies with the virtio specifications with vendor > +specific control path. vDPA devices can be both physically located on > +the hardware or emulated by software. VDUSE is a framework that makes it > +possible to implement software-emulated vDPA devices in userspace. And > +to make it simple, the emulated vDPA device's control path is handled in > +the kernel and only the data path is implemented in the userspace. > + > +Note that only virtio block device is supported by VDUSE framework now, > +which can reduce security risks when the userspace process that implements > +the data path is run by an unprivileged user. The Support for other device > +types can be added after the security issue is clarified or fixed in the > future. > + > +Start/Stop VDUSE devices > + > + > +VDUSE devices are started as follows: > + > +1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on > + /dev/vduse/control. > + > +2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first > + messages will arrive while attaching the VDUSE instance to vDPA bus. > + > +3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE > + instance to vDPA bus. > + > +VDUSE devices are stopped as follows: > + > +1. Send the VDPA_CMD_DEV_DEL netlink message to detach the VDUSE > + instance from vDPA bus. > + > +2. Close the file descriptor referring to /dev/vduse/$NAME > + > +3. Destroy the VDUSE instance with ioctl(VDUSE_DESTROY_DEV) on > + /dev/vduse/control > + > +The netlink messages metioned above can be sent via vdpa tool in iproute2 > +or use the below sample codes: > + > +.. code-block:: c > + > + static int netlink_add_vduse(const char *name, enum vdpa_command cmd) > + { > + struct nl_sock *nlsock; > + struct nl_msg *msg; > + int famid; > + > + nlsock = nl_socket_alloc(); > + if (!nlsock) > + return -ENOMEM; > + > + if (genl_connect(nlsock)) > + goto free_sock; > + > + famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME); > + if (famid < 0) > + goto close_sock; > + > + msg = nlmsg_alloc(); > + if (!msg) > + goto close_sock; > + > + if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0, > cmd, 0)) > + goto nla_put_failure; > + > + NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name); > + if (cmd == VDPA_CMD_DEV_NEW) > + NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, > "vduse"); > + > + if (nl_send_sync(nlsock, msg)) > + goto close_sock; > + > + nl_close(nlsock); > + nl_socket_free(nlsock); > + > + return 0; > + nla_put_failure: > + nlmsg_free(msg); > + close_sock: > + nl_close(nlsock); > + free_sock: > + nl_socket_free(nlsock); > + return -1; > + } > + > +How VDUSE works > +--- > + > +Since the emuldated vDPA device's control path is handled in the kernel, s/emuldated/emulated/ > +a message-based communication protocol and few types of control messages > +are introduced by VDUSE framework to make userspace be aware of the data > +path related changes: > + > +- VDUSE_GET_VQ_STATE: Get the state for virtqueue from userspace > + > +- VDUSE_START_DATAPLANE: Notify userspace to start the dataplane > + > +- VDUSE_STOP_DATAPLANE: Notify userspace to stop the dataplane > + > +- VDUSE_UPDATE_IOTLB: Notify userspace to update
Re: [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace
On Mon, 28 Jun 2021 at 10:55, Liu Xiaodong wrote: > > On Tue, Jun 15, 2021 at 10:13:21PM +0800, Xie Yongji wrote: > > > > This series introduces a framework that makes it possible to implement > > software-emulated vDPA devices in userspace. And to make it simple, the > > emulated vDPA device's control path is handled in the kernel and only the > > data path is implemented in the userspace. > > > > Since the emuldated vDPA device's control path is handled in the kernel, > > a message mechnism is introduced to make userspace be aware of the data > > path related changes. Userspace can use read()/write() to receive/reply > > the control messages. > > > > In the data path, the core is mapping dma buffer into VDUSE daemon's > > address space, which can be implemented in different ways depending on > > the vdpa bus to which the vDPA device is attached. > > > > In virtio-vdpa case, we implements a MMU-based on-chip IOMMU driver with > > bounce-buffering mechanism to achieve that. And in vhost-vdpa case, the dma > > buffer is reside in a userspace memory region which can be shared to the > > VDUSE userspace processs via transferring the shmfd. > > > > The details and our user case is shown below: > > > > - > > -- > > |Container || QEMU(VM) | | > > VDUSE daemon | > > | - || --- | | > > - | > > | |dev/vdx| || |/dev/vhost-vdpa-x| | | | vDPA device > > emulation | | block driver | | > > +--- ---+ > > -+--+- > > | || > > | > > | || > > | > > +---++--+- > > || block device | | vhost device || vduse driver > > | | TCP/IP || > > |---+ + > > ---+ -+| > > | | | | > > || > > | --+-- --+--- > > ---+---|| > > | | virtio-blk driver | | vhost-vdpa driver | | vdpa device > > ||| > > | --+-- --+--- > > ---+---|| > > | | virtio bus | | > > || > > | ++--- | | > > || > > || | | > > || > > | --+--| | > > || > > | | virtio-blk device || | > > || > > | --+--| | > > || > > || | | > > || > > | ---+--- | | > > || > > | | virtio-vdpa driver | | | > > || > > | ---+--- | | > > || > > || | | > > vdpa bus || > > | > > ---+--+---+ > > || > > | > > ---+--- | > > -| > > NIC |-- > > > > ---+--- > > > > | > > > >-+- > > > >| Remote Storages | > > > >--- > > > > We make use of it to implement a block device connecting to > > our distributed storage, which can be used both
Re: [PATCH v4] iommu/of: Fix pci_request_acs() before enumerating PCI devices
Hi Bjorn, I would like to add more explanation about what problem this patch fixed. I am testing the SVA/vSVA series patches written by @Jean and @Brucker. I test with the following qemu command line, with a hisilicon SEC device attached on a pcie-root-port. $QEMU/qemu-system-aarch64 \ -enable-kvm \ -kernel $LINUX/arch/arm64/boot/Image \ -m 16G \ -smp cores=8,threads=1,sockets=2\ -machine virt,kernel_irqchip=on,gic-version=3,iommu=smmuv3 \ -device pcie-root-port,port=0x8,chassis=1,id=pci.1,bus=pcie.0,addr=0x1 \ -device vfio-pci,host=75:00.0,bus=pci.1,addr=0x0,id=acc2 \ -net none \ -initrd ./mfs.cpio.gz \ -cpu host \ -nographic \ -append "rdinit=init console=ttyAMA0 earlycon=pl011,0x900" \ And I got the guest PCI configuration: 00:00.0 Class 0600: Device 1b36:0008# root bus 00:01.0 Class 0604: Device 1b36:000c# root port Capabilities: [148 v1] Access Control Services ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans+ ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans- Kernel driver in use: pcieport 01:00.0 Class 1000: Device 19e5:a255 (rev 21) # SEC The PCI configuration shows that the ACS of the pcie root port is not enabled, while it should have. Then when I insmod device driver and init the SVA feature, I got [ 24.342450] hisi_sec2 :01:00.0: cannot attach to incompatible domain (0 SSID bits != 10) [ 24.343731] hisi_sec2 :01:00.0: Failed to add to iommu group 0: -22 [ 24.345243] hisi_sec2 :01:00.0: enabling device ( -> 0002) qemu-system-aarch64: vfio_enable_vectors failed to register S1 MSI binding for vector 0(-2) qemu-system-aarch64: vfio: Error: Failed to setup MSI fds: Interrupted system call qemu-system-aarch64: vfio: Error: Failed to enable MSI I figured out that this error occurs in the arm_smmu_attach_dev when checking ssid_bits for SVA feature, the master->ssid_bits != smmu_domain->s1_cfg.s1cdmax caused this problem. This is becuase the ACS of pcie-root-port is not enabled, the pcie-root-port and SEC device share the same domain. And SEC's ssid_bits is 10, while pcie-root-port's s1cdmax is zero, this cause the problem. And about why the ACS is not enabled in kernel, I have explained as the following: On 2021/6/7 20:58, Xingang Wang wrote: On 2021/6/5 3:04, Bjorn Helgaas wrote: [+cc John, who tested 6bf6c24720d3] On Fri, May 21, 2021 at 03:03:24AM +, Wang Xingang wrote: From: Xingang Wang When booting with devicetree, the pci_request_acs() is called after the enumeration and initialization of PCI devices, thus the ACS is not enabled. And ACS should be enabled when IOMMU is detected for the PCI host bridge, so add check for IOMMU before probe of PCI host and call pci_request_acs() to make sure ACS will be enabled when enumerating PCI devices. I'm happy to apply this, but I'm a little puzzled about 6bf6c24720d3 ("iommu/of: Request ACS from the PCI core when configuring IOMMU linkage"). It was tested and fixed a problem, but I don't understand how. 6bf6c24720d3 added the call to pci_request_acs() in of_iommu_configure() so it currently looks like this: of_iommu_configure(dev, ...) { if (dev_is_pci(dev)) pci_request_acs(); pci_request_acs() sets pci_acs_enable, which tells us to enable ACS when enumerating PCI devices in the future. But we only call pci_request_acs() if we already *have* a PCI device. So maybe 6bf6c24720d3 fixed a problem for *some* PCI devices, but not all? E.g., did we call of_iommu_configure() for one PCI device before enumerating the rest? I test the kernel on an arm platform with qemu: qemu-system-aarch64 \ -cpu host \ -kernel arch/arm64/boot/Image \ -enable-kvm \ -m 8G \ -smp 2,sockets=2,cores=1,threads=1 \ -machine virt,kernel_irqchip=on,gic-version=3,iommu=smmuv3\ -initrd rootfs.cpio.gz \ -nographic \ -append "rdinit=init console=ttyAMA0 earlycon=pl011,0x900 nokaslr" \ -device pcie-root-port,port=0x1,chassis=1,id=pci.1,addr=0x8 \ -netdev user,id=hostnet0 \ -device virtio-net-pci,netdev=hostnet0,id=net0,mac=08:13:3a:5a:22:5b,bus=pci.1,addr=0x0 \ And find that the of_iommu_configure is called after the enumeration of the pcie-root-port. And this might only infect the first device, when enumerating the rest devices, the pci_acs_enable has already be enabled. But to make sure that the pci_acs_enable will always be set before all PCI devices, it would be better to set it in initialization of PCI bridges. Thanks Xingang Fixes: 6bf6c24720d33 ("iommu/of: Request ACS from the PCI core when configuring IOMMU linkage") Signed-off-by: Xingang Wang --- drivers/iommu/of_iommu.c | 1 - drivers/pci/of.c | 8 +++- 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c index a9d2df001149..54a14da242cc 100644 ---
Re: [PATCH v4 1/3] iommu: io-pgtable: add DART pagetable format
On 27.06.21 16:34, Sven Peter wrote: Apple's DART iommu uses a pagetable format that shares some similarities with the ones already implemented by io-pgtable.c. Add a new format variant to support the required differences so that we don't have to duplicate the pagetable handling code. Signed-off-by: Sven Peter --- drivers/iommu/io-pgtable-arm.c | 62 ++ drivers/iommu/io-pgtable.c | 1 + include/linux/io-pgtable.h | 7 3 files changed, 70 insertions(+) diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c index 87def58e79b5..1dd5c45b4b5b 100644 --- a/drivers/iommu/io-pgtable-arm.c +++ b/drivers/iommu/io-pgtable-arm.c @@ -127,6 +127,9 @@ #define ARM_MALI_LPAE_MEMATTR_IMP_DEF 0x88ULL #define ARM_MALI_LPAE_MEMATTR_WRITE_ALLOC 0x8DULL +#define APPLE_DART_PTE_PROT_NO_WRITE (1<<7) +#define APPLE_DART_PTE_PROT_NO_READ (1<<8) + /* IOPTE accessors */ #define iopte_deref(pte,d) __va(iopte_to_paddr(pte, d)) @@ -381,6 +384,15 @@ static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data, { arm_lpae_iopte pte; + if (data->iop.fmt == ARM_APPLE_DART) { + pte = 0; + if (!(prot & IOMMU_WRITE)) + pte |= APPLE_DART_PTE_PROT_NO_WRITE; + if (!(prot & IOMMU_READ)) + pte |= APPLE_DART_PTE_PROT_NO_READ; + return pte; What about the other bits, such as sharability, XN, etc? Do they not exist on DART? Or have they not been reverse engineered and 0s happen to "just work"? + } + if (data->iop.fmt == ARM_64_LPAE_S1 || data->iop.fmt == ARM_32_LPAE_S1) { pte = ARM_LPAE_PTE_nG; @@ -1043,6 +1055,51 @@ arm_mali_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg, void *cookie) return NULL; } +static struct io_pgtable * +apple_dart_alloc_pgtable(struct io_pgtable_cfg *cfg, void *cookie) +{ + struct arm_lpae_io_pgtable *data; + int i; + + if (cfg->oas > 36) + return NULL; + + data = arm_lpae_alloc_pgtable(cfg); + if (!data) + return NULL; + + /* +* Apple's DART always requires three levels with the first level being +* stored in four MMIO registers. We always concatenate the first and +* second level so that we only have to setup the MMIO registers once. +* This results in an effective two level pagetable. +*/ + if (data->start_level < 1) + return NULL; + if (data->start_level == 1 && data->pgd_bits > 2) + return NULL; + if (data->start_level > 1) + data->pgd_bits = 0; + data->start_level = 2; + cfg->apple_dart_cfg.n_ttbrs = 1 << data->pgd_bits; Maybe add a BUG_ON if n_ttbrs > ARRAY_SIZE(ttbr)? Or alternatively, do a normal runtime check and bail out then. Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: Plan for /dev/ioasid RFC v2
> From: Jason Gunthorpe > Sent: Friday, June 25, 2021 10:36 PM > > On Fri, Jun 25, 2021 at 10:27:18AM +, Tian, Kevin wrote: > > > - When receiving the binding call for the 1st device in a group, iommu_fd > > calls iommu_group_set_block_dma(group, dev->driver) which does > > several things: > > The whole problem here is trying to match this new world where we want > devices to be in charge of their own IOMMU configuration and the old > world where groups are in charge. > > Inserting the group fd and then calling a device-centric > VFIO_GROUP_GET_DEVICE_FD_NEW doesn't solve this conflict, and isn't > necessary. We can always get the group back from the device at any > point in the sequence do to a group wide operation. > > What I saw as the appeal of the sort of idea was to just completely > leave all the difficult multi-device-group scenarios behind on the old > group centric API and then we don't have to deal with them at all, or > least not right away. > > I'd see some progression where iommu_fd only works with 1:1 groups at > the start. Other scenarios continue with the old API. > > Then maybe groups where all devices use the same IOASID. > > Then 1:N groups if the source device is reliably identifiable, this > requires iommu subystem work to attach domains to sub-group objects - > not sure it is worthwhile. > > But at least we can talk about each step with well thought out patches > > The only thing that needs to be done to get the 1:1 step is to broadly > define how the other two cases will work so we don't get into trouble > and set some way to exclude the problematic cases from even getting to > iommu_fd in the first place. > > For instance if we go ahead and create /dev/vfio/device nodes we could > do this only if the group was 1:1, otherwise the group cdev has to be > used, along with its API. > Thinking more along your direction, here is an updated sketch: [Stage-1] Multi-devices group (1:N) is handled by existing vfio group fd and vfio_iommu_type1 driver. Singleton group (1:1) is handled via a new device-centric protocol: 1) /dev/vfio/device nodes are created for devices in singleton group or devices w/o group (mdev) 2) user gets iommu_fd by open("/dev/iommu"). A default block_dma domain is created per iommu_fd (or globally) with an empty I/O page table. 3) iommu_fd reports that only 1:1 group is supported 4) user gets device_fd by open("/dev/vfio/device"). At this point mmap() should be blocked since a security context hasn't been established for this fd. This could be done by returning an error (EACCESS or EAGAIN?), or succeeding w/o actually setting up the mapping. 5) user requests to bind device_fd to iommu_fd which verifies the group is not 1:N (for mdev the check is on the parent device). Successful binding automatically attaches the device to the block_ dma domain via iommu_attach_group(). From now on the user is permitted to access the device. If mmap() in 3) is allowed, vfio actually sets up the MMIO mapping at this point. 6) before the device is unbound from iommu_fd, it is always in a security context. Attaching/detaching just switches the security context between the block_dma domain and an ioasid domain. 7) Unbinding detaches the device from the block_dma domain and re-attach it to the default domain. From now on the user should be denied from accessing the device. vfio should tear down the MMIO mapping at this point. [Stage-2] Both 1:1 and 1:N groups are handled via the new device-centric protocol. Old vfio uAPI is kept for legacy applications. All devices in the same group must share the same I/O address space. A key difference from stage-1 is the additional check on group viability: 1) vfio creates /dev/vfio/device nodes for all devices 2) Same as stage-1 for getting iommu_fd 3) iommu_fd reports that both 1:1 and 1:N groups are supported 4) Same as stage-1 for getting device_fd 5) when receiving the binding call for the 1st device in a group, iommu fd does several things: a) Identify the group of this device and check group viability. A group is viable only when all devices in the group are in one of below states: * driver-less * bound to a driver which is same as the one which does the binding call (vfio in this case) * bound to an otherwise allowed driver (which indicates that it is safe for iommu_fd usage around probe()) b) Attach all devices in the group to the block_dma domain, via existing iommu_attach_group(). c) Register a notifier callback to verifie group viability on IOMMU_GROUP_ NOTIFY_BOUND_DRIVER event. BUG_ON() might be eliminated if we can find a way to deny probe of non-iommu-safe drivers.
Re: [PATCH 9/9] dma-debug: Use memory_intersects() directly
On 2021/6/28 14:11, Christoph Hellwig wrote: You've sent me a patch 9 out of 9 without the previous 8 patches. There is no way I can sensibly review this series. The full patches is https://lore.kernel.org/linux-arch/20210626073439.150586-1-wangkefeng.w...@huawei.com/T/#t This patch is not very relevant about the above 8 patches, only use the existing function to simplify code. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 9/9] dma-debug: Use memory_intersects() directly
You've sent me a patch 9 out of 9 without the previous 8 patches. There is no way I can sensibly review this series. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: Bug#989778: Regression in at least 5.10.y and mainline: Firewire audio interface fails to work properly (when booted under Xen)
On Thu, 24 Jun 2021 at 04:50, Salvatore Bonaccorso wrote: > Hi Robin, > > On Mon, Jun 14, 2021 at 02:29:08PM +0100, Robin Murphy wrote: > > On 2021-06-13 07:29, Salvatore Bonaccorso wrote: > > > A user in Debian reported the above issue, which was reproducible with > > > 5.13-rc5 and 5.10.y as packaged in Debian and found that 85a5a6875ca9 > > > ("swiotlb: don't modify orig_addr in swiotlb_tbl_sync_single") that > > > introduced the issue. > > > > Sounds like it's probably the same thing as being discussed over here: > > > > > https://lore.kernel.org/linux-iommu/2e899de2-4b69-c4b6-33a6-09fb8949d...@nxp.com/ > > Thanks for the pointer, it seems that now it has been fixed upstream > with 5f89468e2f06 ("swiotlb: manipulate orig_addr when tlb_addr has > offset"). > I've checked out upstream v5.13 b7050b242430f3170e0b57f5f55136e44cb8dc66 (which includes the above commit) and confirmed that my issue is now resolved. Now to wait for it to reach Debian :) ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu