Re: Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-06-28 Thread Yongji Xie
On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi  wrote:
>
> On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote:
> > VDUSE (vDPA Device in Userspace) is a framework to support
> > implementing software-emulated vDPA devices in userspace. This
> > document is intended to clarify the VDUSE design and usage.
> >
> > Signed-off-by: Xie Yongji 
> > ---
> >  Documentation/userspace-api/index.rst |   1 +
> >  Documentation/userspace-api/vduse.rst | 222 
> > ++
> >  2 files changed, 223 insertions(+)
> >  create mode 100644 Documentation/userspace-api/vduse.rst
> >
> > diff --git a/Documentation/userspace-api/index.rst 
> > b/Documentation/userspace-api/index.rst
> > index 0b5eefed027e..c432be070f67 100644
> > --- a/Documentation/userspace-api/index.rst
> > +++ b/Documentation/userspace-api/index.rst
> > @@ -27,6 +27,7 @@ place where this information is gathered.
> > iommu
> > media/index
> > sysfs-platform_profile
> > +   vduse
> >
> >  .. only::  subproject and html
> >
> > diff --git a/Documentation/userspace-api/vduse.rst 
> > b/Documentation/userspace-api/vduse.rst
> > new file mode 100644
> > index ..2f9cd1a4e530
> > --- /dev/null
> > +++ b/Documentation/userspace-api/vduse.rst
> > @@ -0,0 +1,222 @@
> > +==
> > +VDUSE - "vDPA Device in Userspace"
> > +==
> > +
> > +vDPA (virtio data path acceleration) device is a device that uses a
> > +datapath which complies with the virtio specifications with vendor
> > +specific control path. vDPA devices can be both physically located on
> > +the hardware or emulated by software. VDUSE is a framework that makes it
> > +possible to implement software-emulated vDPA devices in userspace. And
> > +to make it simple, the emulated vDPA device's control path is handled in
> > +the kernel and only the data path is implemented in the userspace.
> > +
> > +Note that only virtio block device is supported by VDUSE framework now,
> > +which can reduce security risks when the userspace process that implements
> > +the data path is run by an unprivileged user. The Support for other device
> > +types can be added after the security issue is clarified or fixed in the 
> > future.
> > +
> > +Start/Stop VDUSE devices
> > +
> > +
> > +VDUSE devices are started as follows:
> > +
> > +1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
> > +   /dev/vduse/control.
> > +
> > +2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
> > +   messages will arrive while attaching the VDUSE instance to vDPA bus.
> > +
> > +3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
> > +   instance to vDPA bus.
> > +
> > +VDUSE devices are stopped as follows:
> > +
> > +1. Send the VDPA_CMD_DEV_DEL netlink message to detach the VDUSE
> > +   instance from vDPA bus.
> > +
> > +2. Close the file descriptor referring to /dev/vduse/$NAME
> > +
> > +3. Destroy the VDUSE instance with ioctl(VDUSE_DESTROY_DEV) on
> > +   /dev/vduse/control
> > +
> > +The netlink messages metioned above can be sent via vdpa tool in iproute2
> > +or use the below sample codes:
> > +
> > +.. code-block:: c
> > +
> > + static int netlink_add_vduse(const char *name, enum vdpa_command cmd)
> > + {
> > + struct nl_sock *nlsock;
> > + struct nl_msg *msg;
> > + int famid;
> > +
> > + nlsock = nl_socket_alloc();
> > + if (!nlsock)
> > + return -ENOMEM;
> > +
> > + if (genl_connect(nlsock))
> > + goto free_sock;
> > +
> > + famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
> > + if (famid < 0)
> > + goto close_sock;
> > +
> > + msg = nlmsg_alloc();
> > + if (!msg)
> > + goto close_sock;
> > +
> > + if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0, 
> > cmd, 0))
> > + goto nla_put_failure;
> > +
> > + NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
> > + if (cmd == VDPA_CMD_DEV_NEW)
> > + NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, 
> > "vduse");
> > +
> > + if (nl_send_sync(nlsock, msg))
> > + goto close_sock;
> > +
> > + nl_close(nlsock);
> > + nl_socket_free(nlsock);
> > +
> > + return 0;
> > + nla_put_failure:
> > + nlmsg_free(msg);
> > + close_sock:
> > + nl_close(nlsock);
> > + free_sock:
> > + nl_socket_free(nlsock);
> > + return -1;
> > + }
> > +
> > +How VDUSE works
> > +---
> > +
> > +Since the emuldated vDPA device's control path is handled in the kernel,
>
> s/emuldated/emulated/
>

Will fix it.

> > +a message-based communication protocol and few types of control messages
> > +are introduced by VDUSE framework 

Re: [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace

2021-06-28 Thread Jason Wang


在 2021/6/28 下午6:32, Yongji Xie 写道:

The large barrier is bounce-buffer mapping: SPDK requires hugepages
for NVMe over PCIe and RDMA, so take some preallcoated hugepages to
map as bounce buffer is necessary. Or it's hard to avoid an extra
memcpy from bounce-buffer to hugepage.
If you can add an option to map hugepages as bounce-buffer,
then SPDK could also be a potential user of vduse.


I think we can support registering user space memory for bounce-buffer
use like XDP does. But this needs to pin the pages, so I didn't
consider it in this initial version.



Note that userspace should be unaware of the existence of the bounce buffer.

So we need to think carefully of mmap() vs umem registering.

Thanks

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace

2021-06-28 Thread Jason Wang



在 2021/6/28 下午1:54, Liu, Xiaodong 写道:

Several issues:

- VDUSE needs to limit the total size of the bounce buffers (64M if I was not
wrong). Does it work for SPDK?

Yes, Jason. It is enough and works for SPDK.
Since it's a kind of bounce buffer mainly for in-flight IO, so limited size like
64MB is enough.



Ok.





- VDUSE can use hugepages but I'm not sure we can mandate hugepages (or we
need introduce new flags for supporting this)

Same with your worry, I'm afraid too that it is a hard for a kernel module
to directly preallocate hugepage internal.
What I tried is that:
1. A simple agent daemon (represents for one device)  `preallocates` and maps
 dozens of 2MB hugepages (like 64MB) for one device.
2. The daemon passes its mapping addr and hugepage fd to kernel
 module through created IOCTL.
3. Kernel module remaps the hugepages inside kernel.



Such model should work, but the main "issue" is that it introduce  
overheads in the case of vhost-vDPA.


Note that in the case of vhost-vDPA, we don't use bounce buffer, the  
userspace pages were shared directly.


And since DMA is not done per page, it prevents us from using tricks  
like vm_insert_page() in those cases.




4. Vhost user target gets and maps hugepage fd from kernel module
 in vhost-user msg through Unix Domain Socket cmsg.
Then kernel module and target map on the same hugepage based
bounce buffer for in-flight IO.

If there is one option in VDUSE to map userspace preallocated memory, then
VDUSE should be able to mandate it even it is hugepage based.



As above, this requires some kind of re-design since VDUSE depends on  
the model of mmap(MAP_SHARED) instead of umem registering.


Thanks

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-06-28 Thread Jason Wang


在 2021/6/29 上午11:56, Yongji Xie 写道:

On Tue, Jun 29, 2021 at 11:29 AM Jason Wang  wrote:


在 2021/6/29 上午10:26, Yongji Xie 写道:

On Mon, Jun 28, 2021 at 12:40 PM Jason Wang  wrote:

在 2021/6/25 下午12:19, Yongji Xie 写道:

2b) for set_status(): simply relay the message to userspace, reply is no
needed. Userspace will use a command to update the status when the
datapath is stop. The the status could be fetched via get_stats().

2b looks more spec complaint.


Looks good to me. And I think we can use the reply of the message to
update the status instead of introducing a new command.


Just notice this part in virtio_finalize_features():

   virtio_add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK);
   status = dev->config->get_status(dev);
   if (!(status & VIRTIO_CONFIG_S_FEATURES_OK)) {

So we no reply doesn't work for FEATURES_OK.

So my understanding is:

1) We must not use noreply for set_status()
2) We can use noreply for get_status(), but it requires a new ioctl to
update the status.

So it looks to me we need synchronize for both get_status() and
set_status().


We should not send messages to userspace in the FEATURES_OK case. So
the synchronization is not necessary.


As discussed previously, there could be a device that mandates some
features (VIRTIO_F_RING_PACKED). So it can choose to not accept
FEATURES_OK is packed virtqueue is not negotiated.

In this case we need to relay the message to userspace.


OK, I see. If so, I prefer to only use noreply for set_status(). We do
not set the status bit if the message is failed. In this way, we don't
need to change lots of virtio core codes to handle the failure of
set_status()/get_status().



It should work.

Thanks




Thanks,
Yongji



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-06-28 Thread Yongji Xie
On Tue, Jun 29, 2021 at 11:29 AM Jason Wang  wrote:
>
>
> 在 2021/6/29 上午10:26, Yongji Xie 写道:
> > On Mon, Jun 28, 2021 at 12:40 PM Jason Wang  wrote:
> >>
> >> 在 2021/6/25 下午12:19, Yongji Xie 写道:
>  2b) for set_status(): simply relay the message to userspace, reply is no
>  needed. Userspace will use a command to update the status when the
>  datapath is stop. The the status could be fetched via get_stats().
> 
>  2b looks more spec complaint.
> 
> >>> Looks good to me. And I think we can use the reply of the message to
> >>> update the status instead of introducing a new command.
> >>>
> >> Just notice this part in virtio_finalize_features():
> >>
> >>   virtio_add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK);
> >>   status = dev->config->get_status(dev);
> >>   if (!(status & VIRTIO_CONFIG_S_FEATURES_OK)) {
> >>
> >> So we no reply doesn't work for FEATURES_OK.
> >>
> >> So my understanding is:
> >>
> >> 1) We must not use noreply for set_status()
> >> 2) We can use noreply for get_status(), but it requires a new ioctl to
> >> update the status.
> >>
> >> So it looks to me we need synchronize for both get_status() and
> >> set_status().
> >>
> > We should not send messages to userspace in the FEATURES_OK case. So
> > the synchronization is not necessary.
>
>
> As discussed previously, there could be a device that mandates some
> features (VIRTIO_F_RING_PACKED). So it can choose to not accept
> FEATURES_OK is packed virtqueue is not negotiated.
>
> In this case we need to relay the message to userspace.
>

OK, I see. If so, I prefer to only use noreply for set_status(). We do
not set the status bit if the message is failed. In this way, we don't
need to change lots of virtio core codes to handle the failure of
set_status()/get_status().

Thanks,
Yongji
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-06-28 Thread Jason Wang


在 2021/6/29 上午10:26, Yongji Xie 写道:

On Mon, Jun 28, 2021 at 12:40 PM Jason Wang  wrote:


在 2021/6/25 下午12:19, Yongji Xie 写道:

2b) for set_status(): simply relay the message to userspace, reply is no
needed. Userspace will use a command to update the status when the
datapath is stop. The the status could be fetched via get_stats().

2b looks more spec complaint.


Looks good to me. And I think we can use the reply of the message to
update the status instead of introducing a new command.


Just notice this part in virtio_finalize_features():

  virtio_add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK);
  status = dev->config->get_status(dev);
  if (!(status & VIRTIO_CONFIG_S_FEATURES_OK)) {

So we no reply doesn't work for FEATURES_OK.

So my understanding is:

1) We must not use noreply for set_status()
2) We can use noreply for get_status(), but it requires a new ioctl to
update the status.

So it looks to me we need synchronize for both get_status() and
set_status().


We should not send messages to userspace in the FEATURES_OK case. So
the synchronization is not necessary.



As discussed previously, there could be a device that mandates some 
features (VIRTIO_F_RING_PACKED). So it can choose to not accept 
FEATURES_OK is packed virtqueue is not negotiated.


In this case we need to relay the message to userspace.

Thanks




Thanks,
Yongji



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: Re: [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace

2021-06-28 Thread Yongji Xie
On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi  wrote:
>
> On Tue, Jun 15, 2021 at 10:13:21PM +0800, Xie Yongji wrote:
> > This series introduces a framework that makes it possible to implement
> > software-emulated vDPA devices in userspace. And to make it simple, the
> > emulated vDPA device's control path is handled in the kernel and only the
> > data path is implemented in the userspace.
>
> This looks interesting. Unfortunately I don't have enough time to do a
> full review, but I looked at the documentation and uapi header file to
> give feedback on the userspace ABI.
>

OK. Thanks for your comments. It's helpful!

Thanks,
Yongji
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-06-28 Thread Yongji Xie
On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi  wrote:
>
> On Tue, Jun 15, 2021 at 10:13:30PM +0800, Xie Yongji wrote:
> > diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> > new file mode 100644
> > index ..f21b2e51b5c8
> > --- /dev/null
> > +++ b/include/uapi/linux/vduse.h
> > @@ -0,0 +1,143 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +#ifndef _UAPI_VDUSE_H_
> > +#define _UAPI_VDUSE_H_
> > +
> > +#include 
> > +
> > +#define VDUSE_API_VERSION0
> > +
> > +#define VDUSE_NAME_MAX   256
> > +
> > +/* the control messages definition for read/write */
> > +
> > +enum vduse_req_type {
> > + /* Get the state for virtqueue from userspace */
> > + VDUSE_GET_VQ_STATE,
> > + /* Notify userspace to start the dataplane, no reply */
> > + VDUSE_START_DATAPLANE,
> > + /* Notify userspace to stop the dataplane, no reply */
> > + VDUSE_STOP_DATAPLANE,
> > + /* Notify userspace to update the memory mapping in device IOTLB */
> > + VDUSE_UPDATE_IOTLB,
> > +};
> > +
> > +struct vduse_vq_state {
> > + __u32 index; /* virtqueue index */
> > + __u32 avail_idx; /* virtqueue state (last_avail_idx) */
> > +};
> > +
> > +struct vduse_iova_range {
> > + __u64 start; /* start of the IOVA range */
> > + __u64 last; /* end of the IOVA range */
>
> Please clarify whether this describes a closed range [start, last] or an
> open range [start, last).
>

OK.

> > +};
> > +
> > +struct vduse_dev_request {
> > + __u32 type; /* request type */
> > + __u32 request_id; /* request id */
> > +#define VDUSE_REQ_FLAGS_NO_REPLY (1 << 0) /* No need to reply */
> > + __u32 flags; /* request flags */
> > + __u32 reserved; /* for future use */
> > + union {
> > + struct vduse_vq_state vq_state; /* virtqueue state */
> > + struct vduse_iova_range iova; /* iova range for updating */
> > + __u32 padding[16]; /* padding */
> > + };
> > +};
> > +
> > +struct vduse_dev_response {
> > + __u32 request_id; /* corresponding request id */
> > +#define VDUSE_REQ_RESULT_OK  0x00
> > +#define VDUSE_REQ_RESULT_FAILED  0x01
> > + __u32 result; /* the result of request */
> > + __u32 reserved[2]; /* for future use */
> > + union {
> > + struct vduse_vq_state vq_state; /* virtqueue state */
> > + __u32 padding[16]; /* padding */
> > + };
> > +};
> > +
> > +/* ioctls */
> > +
> > +struct vduse_dev_config {
> > + char name[VDUSE_NAME_MAX]; /* vduse device name */
> > + __u32 vendor_id; /* virtio vendor id */
> > + __u32 device_id; /* virtio device id */
> > + __u64 features; /* device features */
> > + __u64 bounce_size; /* bounce buffer size for iommu */
> > + __u16 vq_size_max; /* the max size of virtqueue */
>
> The VIRTIO specification allows per-virtqueue sizes. A device can have
> two virtqueues, where the first one allows up to 1024 descriptors and
> the second one allows only 128 descriptors, for example.
>

Good point! But it looks like virtio-vdpa/virtio-pci doesn't support
that now. All virtqueues have the same maximum size.

> This constant seems to impose the constraint that all virtqueues have
> the same maximum size. Is this really necessary?
>

This will be used by vring_create_virtqueue(). We need to specify the
maximum queue size supported by the device.

> > + __u16 padding; /* padding */
> > + __u32 vq_num; /* the number of virtqueues */
> > + __u32 vq_align; /* the allocation alignment of virtqueue's metadata */
>
> I'm not sure what this is?
>

 This will be used by vring_create_virtqueue() too.

> > + __u32 config_size; /* the size of the configuration space */
> > + __u32 reserved[15]; /* for future use */
> > + __u8 config[0]; /* the buffer of the configuration space */
> > +};
> > +
> > +struct vduse_iotlb_entry {
> > + __u64 offset; /* the mmap offset on fd */
> > + __u64 start; /* start of the IOVA range */
> > + __u64 last; /* last of the IOVA range */
>
> Same here, please specify whether this is an open range or a closed
> range.
>

Sure.

> > +#define VDUSE_ACCESS_RO 0x1
> > +#define VDUSE_ACCESS_WO 0x2
> > +#define VDUSE_ACCESS_RW 0x3
> > + __u8 perm; /* access permission of this range */
> > +};
> > +
> > +struct vduse_config_update {
> > + __u32 offset; /* offset from the beginning of configuration space */
> > + __u32 length; /* the length to write to configuration space */
> > + __u8 buffer[0]; /* buffer used to write from */
> > +};
> > +
> > +struct vduse_vq_info {
> > + __u32 index; /* virtqueue index */
> > + __u32 avail_idx; /* virtqueue state (last_avail_idx) */
> > + __u64 desc_addr; /* address of desc area */
> > + __u64 driver_addr; /* address of driver area */
> > + __u64 device_addr; /* address of device area */
> > + __u32 num; /* the size of virtqueue */
> > + __u8 ready; /* ready status of 

Re: Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-06-28 Thread Yongji Xie
On Mon, Jun 28, 2021 at 12:40 PM Jason Wang  wrote:
>
>
> 在 2021/6/25 下午12:19, Yongji Xie 写道:
> >> 2b) for set_status(): simply relay the message to userspace, reply is no
> >> needed. Userspace will use a command to update the status when the
> >> datapath is stop. The the status could be fetched via get_stats().
> >>
> >> 2b looks more spec complaint.
> >>
> > Looks good to me. And I think we can use the reply of the message to
> > update the status instead of introducing a new command.
> >
>
> Just notice this part in virtio_finalize_features():
>
>  virtio_add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK);
>  status = dev->config->get_status(dev);
>  if (!(status & VIRTIO_CONFIG_S_FEATURES_OK)) {
>
> So we no reply doesn't work for FEATURES_OK.
>
> So my understanding is:
>
> 1) We must not use noreply for set_status()
> 2) We can use noreply for get_status(), but it requires a new ioctl to
> update the status.
>
> So it looks to me we need synchronize for both get_status() and
> set_status().
>

We should not send messages to userspace in the FEATURES_OK case. So
the synchronization is not necessary.

Thanks,
Yongji
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: Plan for /dev/ioasid RFC v2

2021-06-28 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Tuesday, June 29, 2021 6:32 AM
> 
> On Mon, 28 Jun 2021 01:09:18 +
> "Tian, Kevin"  wrote:
> 
> > > From: Jason Gunthorpe 
> > > Sent: Friday, June 25, 2021 10:36 PM
> > >
> > > On Fri, Jun 25, 2021 at 10:27:18AM +, Tian, Kevin wrote:
> > >
> > > > -   When receiving the binding call for the 1st device in a group,
> iommu_fd
> > > > calls iommu_group_set_block_dma(group, dev->driver) which does
> > > > several things:
> > >
> > > The whole problem here is trying to match this new world where we want
> > > devices to be in charge of their own IOMMU configuration and the old
> > > world where groups are in charge.
> > >
> > > Inserting the group fd and then calling a device-centric
> > > VFIO_GROUP_GET_DEVICE_FD_NEW doesn't solve this conflict, and isn't
> > > necessary.
> >
> > No, this was not what I meant. There is no group fd required when
> > calling this device-centric interface. I was actually talking about:
> >
> > iommu_group_set_block_dma(dev->group, dev->driver)
> >
> > just because current iommu layer API is group-centric. Whether this
> > should be improved could be next-level thing. Sorry for not making
> > it clear in the first place.
> >
> > > We can always get the group back from the device at any
> > > point in the sequence do to a group wide operation.
> >
> > yes.
> >
> > >
> > > What I saw as the appeal of the sort of idea was to just completely
> > > leave all the difficult multi-device-group scenarios behind on the old
> > > group centric API and then we don't have to deal with them at all, or
> > > least not right away.
> >
> > yes, this is the staged approach that we discussed earlier. and
> > the reason why I refined this proposal about multi-devices group
> > here is because you want to see some confidence along this
> > direction. Thus I expanded your idea and hope to achieve consensus
> > with Alex/Joerg who obviously have not been convinced yet.
> >
> > >
> > > I'd see some progression where iommu_fd only works with 1:1 groups at
> > > the start. Other scenarios continue with the old API.
> >
> > One uAPI open after completing this new sketch. v1 proposed to
> > conduct binding (VFIO_BIND_IOMMU_FD) after device_fd is acquired.
> > With this sketch we need a new VFIO_GROUP_GET_DEVICE_FD_NEW
> > to complete both in one step. I want to get Alex's confirmation whether
> > it sounds good to him, since it's better to unify the uAPI between 1:1
> > group and 1:N group even if we don't support 1:N in the start.
> 
> I don't like it.  It doesn't make sense to me.  You have the
> group-centric world, which must continue to exist and cannot change
> because we cannot break the vfio uapi.  We can make extensions, we can
> define a new parallel uapi, we can deprecate the uapi, but in the short
> term, it can't change.
> 
> AIUI, the new device-centric model starts with vfio device files that
> can be opened directly.  So what then is the purpose of a *GROUP* get
> device fd?  Why is a vfio uapi involved in setting a device cookie for
> another subsystem?
> 
> I'd expect that /dev/iommu will be used by multiple subsystems.  All
> will want to bind devices to address spaces, so shouldn't binding a
> device to an iommufd be an ioctl on the iommufd, ie.
> IOMMU_BIND_VFIO_DEVICE_FD.  Maybe we don't even need "VFIO" in there
> and
> the iommufd code can figure it out internally.
> 
> You're essentially trying to reduce vfio to the device interface.  That
> necessarily implies that ioctls on the container, group, or passed
> through the container to the iommu no longer exist.  From my
> perspective, there should ideally be no new vfio ioctls.  The user gets
> a limited access vfio device fd and enables full access to the device
> by registering it to the iommufd subsystem (100% this needs to be
> enforced until close() to avoid revoke issues).  The user interacts
> exclusively with vfio via the device fd and performs all DMA address
> space related operations through the iommufd.
> 
> > > Then maybe groups where all devices use the same IOASID.
> > >
> > > Then 1:N groups if the source device is reliably identifiable, this
> > > requires iommu subystem work to attach domains to sub-group objects -
> > > not sure it is worthwhile.
> > >
> > > But at least we can talk about each step with well thought out patches
> > >
> > > The only thing that needs to be done to get the 1:1 step is to broadly
> > > define how the other two cases will work so we don't get into trouble
> > > and set some way to exclude the problematic cases from even getting to
> > > iommu_fd in the first place.
> > >
> > > For instance if we go ahead and create /dev/vfio/device nodes we could
> > > do this only if the group was 1:1, otherwise the group cdev has to be
> > > used, along with its API.
> >
> > I feel for VFIO possibly we don't need significant change to its uAPI
> > sequence, since it anyway needs to support existing semantics for
> > backward compatibility. With 

RE: Plan for /dev/ioasid RFC v2

2021-06-28 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Tuesday, June 29, 2021 7:09 AM
> 
> > > Ideally vfio would also at least be able to register a type1 IOMMU
> > > backend through the existing uapi, backed by this iommu code, ie. we'd
> > > create a new "iommufd" (but without the user visible fd),
> >
> > It would be amazing to be able to achieve this, at least for me there
> > are too many details be able to tell what that would look like
> > exactly. I suggested once that putting the container ioctl interface
> > in the drivers/iommu code may allow for this without too much trouble..
> 
> If we can't achieve this, then type1 legacy code is going to need to
> live through an extended deprecation period.  I'm hoping that between
> type1 and a native interface we'll have two paths into iommufd to vet
> the design.  Thanks,
> 

I prefer to keeping type1 legacy code for some time. After the new
device-centric flow works, we can then study how to make type1
as a shim layer atop.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: Plan for /dev/ioasid RFC v2

2021-06-28 Thread Tian, Kevin
> From: Jason Gunthorpe
> Sent: Tuesday, June 29, 2021 7:13 AM
> 
> On Mon, Jun 28, 2021 at 05:09:02PM -0600, Alex Williamson wrote:
> > On Mon, 28 Jun 2021 19:48:18 -0300
> > Jason Gunthorpe  wrote:
> >
> > > On Mon, Jun 28, 2021 at 04:31:45PM -0600, Alex Williamson wrote:
> > >
> > > > I'd expect that /dev/iommu will be used by multiple subsystems.  All
> > > > will want to bind devices to address spaces, so shouldn't binding a
> > > > device to an iommufd be an ioctl on the iommufd, ie.
> > > > IOMMU_BIND_VFIO_DEVICE_FD.  Maybe we don't even need "VFIO" in
> there and
> > > > the iommufd code can figure it out internally.
> > >
> > > It wants to be the other way around because iommu_fd is the lower
> > > level subsystem. We don't/can't teach iommu_fd how to convert a fd
> > > number to a vfio/vdpa/etc/etc, we teach all the things building on
> > > iommu_fd how to change a fd number to an iommu - they already
> > > necessarily have an inter-module linkage.
> >
> > These seem like peer subsystems, like vfio and kvm.  vfio shouldn't
> > have any hard dependencies on the iommufd module, especially so long as
> > we have the legacy type1 code.
> 
> It does, the vfio_device implementation has to tell the iommu subsystem
> what kind of device behavior it has and possibly interact with the
> iommu subsystem with it in cases like PASID. This was outlined in part
> of the RFC.

Right. PASID is managed by specific device driver in this RFC and provided
as routing information to iommu_fd when the device is attached to an 
IOASID. Another point is about PASID virtualization (vPASID->pPASID),
which is established by having the user to register its vPASID when doing
the attach call. vfio device driver needs to use this information in the
mediation path. In concept vPASID is not relevant to iommu_fd which only 
cares about pPASID. Having vPASID registered via iommu_fd uAPI and 
then indirectly communicated to vfio device driver looks not a clean
way in the first place.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: Plan for /dev/ioasid RFC v2

2021-06-28 Thread Jason Gunthorpe
On Mon, Jun 28, 2021 at 05:09:02PM -0600, Alex Williamson wrote:
> On Mon, 28 Jun 2021 19:48:18 -0300
> Jason Gunthorpe  wrote:
> 
> > On Mon, Jun 28, 2021 at 04:31:45PM -0600, Alex Williamson wrote:
> > 
> > > I'd expect that /dev/iommu will be used by multiple subsystems.  All
> > > will want to bind devices to address spaces, so shouldn't binding a
> > > device to an iommufd be an ioctl on the iommufd, ie.
> > > IOMMU_BIND_VFIO_DEVICE_FD.  Maybe we don't even need "VFIO" in there and
> > > the iommufd code can figure it out internally.  
> > 
> > It wants to be the other way around because iommu_fd is the lower
> > level subsystem. We don't/can't teach iommu_fd how to convert a fd
> > number to a vfio/vdpa/etc/etc, we teach all the things building on
> > iommu_fd how to change a fd number to an iommu - they already
> > necessarily have an inter-module linkage.
> 
> These seem like peer subsystems, like vfio and kvm.  vfio shouldn't
> have any hard dependencies on the iommufd module, especially so long as
> we have the legacy type1 code.

It does, the vfio_device implementation has to tell the iommu subsystem
what kind of device behavior it has and possibly interact with the
iommu subsystem with it in cases like PASID. This was outlined in part
of the RFC.

In any event a module dependency from vfio to iommu is not bothersome,
while the other way certainly is.

> Likewise iommufd shouldn't have any on vfio.  As much as you
> dislike the symbol_get hack of the kvm-vfio device, it would be
> reasonable for iommufd to reach for a vfio symbol when an
> IOMMU_BIND_VFIO_DEVICE_FD ioctl is called.

We'd have to add a special ioctl to iommu for every new subsystem, it
doesn't scale. iommu is a core subsystem, vfio is a driver subsystem.
The direction of dependency is clear, I think.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: Plan for /dev/ioasid RFC v2

2021-06-28 Thread Alex Williamson
On Mon, 28 Jun 2021 19:48:18 -0300
Jason Gunthorpe  wrote:

> On Mon, Jun 28, 2021 at 04:31:45PM -0600, Alex Williamson wrote:
> 
> > I'd expect that /dev/iommu will be used by multiple subsystems.  All
> > will want to bind devices to address spaces, so shouldn't binding a
> > device to an iommufd be an ioctl on the iommufd, ie.
> > IOMMU_BIND_VFIO_DEVICE_FD.  Maybe we don't even need "VFIO" in there and
> > the iommufd code can figure it out internally.  
> 
> It wants to be the other way around because iommu_fd is the lower
> level subsystem. We don't/can't teach iommu_fd how to convert a fd
> number to a vfio/vdpa/etc/etc, we teach all the things building on
> iommu_fd how to change a fd number to an iommu - they already
> necessarily have an inter-module linkage.

These seem like peer subsystems, like vfio and kvm.  vfio shouldn't
have any hard dependencies on the iommufd module, especially so long as
we have the legacy type1 code.  Likewise iommufd shouldn't have any on
vfio.  As much as you dislike the symbol_get hack of the kvm-vfio
device, it would be reasonable for iommufd to reach for a vfio symbol
when an IOMMU_BIND_VFIO_DEVICE_FD ioctl is called.

> There is a certain niceness to what you are saying but it is not so
> practical without doing something bigger..
> 
> > Ideally vfio would also at least be able to register a type1 IOMMU
> > backend through the existing uapi, backed by this iommu code, ie. we'd
> > create a new "iommufd" (but without the user visible fd),   
> 
> It would be amazing to be able to achieve this, at least for me there
> are too many details be able to tell what that would look like
> exactly. I suggested once that putting the container ioctl interface
> in the drivers/iommu code may allow for this without too much trouble..

If we can't achieve this, then type1 legacy code is going to need to
live through an extended deprecation period.  I'm hoping that between
type1 and a native interface we'll have two paths into iommufd to vet
the design.  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: Plan for /dev/ioasid RFC v2

2021-06-28 Thread Jason Gunthorpe
On Mon, Jun 28, 2021 at 04:31:45PM -0600, Alex Williamson wrote:

> I'd expect that /dev/iommu will be used by multiple subsystems.  All
> will want to bind devices to address spaces, so shouldn't binding a
> device to an iommufd be an ioctl on the iommufd, ie.
> IOMMU_BIND_VFIO_DEVICE_FD.  Maybe we don't even need "VFIO" in there and
> the iommufd code can figure it out internally.

It wants to be the other way around because iommu_fd is the lower
level subsystem. We don't/can't teach iommu_fd how to convert a fd
number to a vfio/vdpa/etc/etc, we teach all the things building on
iommu_fd how to change a fd number to an iommu - they already
necessarily have an inter-module linkage.

There is a certain niceness to what you are saying but it is not so
practical without doing something bigger..

> Ideally vfio would also at least be able to register a type1 IOMMU
> backend through the existing uapi, backed by this iommu code, ie. we'd
> create a new "iommufd" (but without the user visible fd), 

It would be amazing to be able to achieve this, at least for me there
are too many details be able to tell what that would look like
exactly. I suggested once that putting the container ioctl interface
in the drivers/iommu code may allow for this without too much trouble..

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: Plan for /dev/ioasid RFC v2

2021-06-28 Thread Alex Williamson
On Mon, 28 Jun 2021 01:09:18 +
"Tian, Kevin"  wrote:

> > From: Jason Gunthorpe 
> > Sent: Friday, June 25, 2021 10:36 PM
> > 
> > On Fri, Jun 25, 2021 at 10:27:18AM +, Tian, Kevin wrote:
> >   
> > > -   When receiving the binding call for the 1st device in a group, 
> > > iommu_fd
> > > calls iommu_group_set_block_dma(group, dev->driver) which does
> > > several things:  
> > 
> > The whole problem here is trying to match this new world where we want
> > devices to be in charge of their own IOMMU configuration and the old
> > world where groups are in charge.
> > 
> > Inserting the group fd and then calling a device-centric
> > VFIO_GROUP_GET_DEVICE_FD_NEW doesn't solve this conflict, and isn't
> > necessary.   
> 
> No, this was not what I meant. There is no group fd required when
> calling this device-centric interface. I was actually talking about:
> 
>   iommu_group_set_block_dma(dev->group, dev->driver)
> 
> just because current iommu layer API is group-centric. Whether this
> should be improved could be next-level thing. Sorry for not making
> it clear in the first place.
> 
> > We can always get the group back from the device at any
> > point in the sequence do to a group wide operation.  
> 
> yes.
> 
> > 
> > What I saw as the appeal of the sort of idea was to just completely
> > leave all the difficult multi-device-group scenarios behind on the old
> > group centric API and then we don't have to deal with them at all, or
> > least not right away.  
> 
> yes, this is the staged approach that we discussed earlier. and
> the reason why I refined this proposal about multi-devices group 
> here is because you want to see some confidence along this
> direction. Thus I expanded your idea and hope to achieve consensus
> with Alex/Joerg who obviously have not been convinced yet.
> 
> > 
> > I'd see some progression where iommu_fd only works with 1:1 groups at
> > the start. Other scenarios continue with the old API.  
> 
> One uAPI open after completing this new sketch. v1 proposed to
> conduct binding (VFIO_BIND_IOMMU_FD) after device_fd is acquired.
> With this sketch we need a new VFIO_GROUP_GET_DEVICE_FD_NEW
> to complete both in one step. I want to get Alex's confirmation whether
> it sounds good to him, since it's better to unify the uAPI between 1:1 
> group and 1:N group even if we don't support 1:N in the start. 

I don't like it.  It doesn't make sense to me.  You have the
group-centric world, which must continue to exist and cannot change
because we cannot break the vfio uapi.  We can make extensions, we can
define a new parallel uapi, we can deprecate the uapi, but in the short
term, it can't change.

AIUI, the new device-centric model starts with vfio device files that
can be opened directly.  So what then is the purpose of a *GROUP* get
device fd?  Why is a vfio uapi involved in setting a device cookie for
another subsystem?

I'd expect that /dev/iommu will be used by multiple subsystems.  All
will want to bind devices to address spaces, so shouldn't binding a
device to an iommufd be an ioctl on the iommufd, ie.
IOMMU_BIND_VFIO_DEVICE_FD.  Maybe we don't even need "VFIO" in there and
the iommufd code can figure it out internally.

You're essentially trying to reduce vfio to the device interface.  That
necessarily implies that ioctls on the container, group, or passed
through the container to the iommu no longer exist.  From my
perspective, there should ideally be no new vfio ioctls.  The user gets
a limited access vfio device fd and enables full access to the device
by registering it to the iommufd subsystem (100% this needs to be
enforced until close() to avoid revoke issues).  The user interacts
exclusively with vfio via the device fd and performs all DMA address
space related operations through the iommufd.

> > Then maybe groups where all devices use the same IOASID.
> > 
> > Then 1:N groups if the source device is reliably identifiable, this
> > requires iommu subystem work to attach domains to sub-group objects -
> > not sure it is worthwhile.
> > 
> > But at least we can talk about each step with well thought out patches
> > 
> > The only thing that needs to be done to get the 1:1 step is to broadly
> > define how the other two cases will work so we don't get into trouble
> > and set some way to exclude the problematic cases from even getting to
> > iommu_fd in the first place.
> > 
> > For instance if we go ahead and create /dev/vfio/device nodes we could
> > do this only if the group was 1:1, otherwise the group cdev has to be
> > used, along with its API.  
> 
> I feel for VFIO possibly we don't need significant change to its uAPI 
> sequence, since it anyway needs to support existing semantics for 
> backward compatibility. With this sketch we can keep vfio container/
> group by introducing an external iommu type which implies a different
> GET_DEVICE_FD semantics. /dev/iommu can report a fd-wide capability
> for whether 1:N group is supported to vfio user.


Re: Plan for /dev/ioasid RFC v2

2021-06-28 Thread Jason Gunthorpe
On Mon, Jun 28, 2021 at 06:45:23AM +, Tian, Kevin wrote:

> 7)   Unbinding detaches the device from the block_dma domain and
>re-attach it to the default domain. From now on the user should 
>be denied from accessing the device. vfio should tear down the
>MMIO mapping at this point.

I think we should just forbid this, so long as the device_fd is open
the iommu_fd cannot be destroyed and there is no way to detact a
device other than closing its Fd.

revoke is tricky enough to implement we should avoid it.

> It's still an open whether we want to further allow devices within a group 
> attached to different IOASIDs in case that the source devices are reliably 
> identifiable. This is an usage not supported by existing vfio and might be
> not worthwhile due to improved isolation over time.

The main decision here is to decide if the uAPI should have some way to
indicate that a device does not have its own unique IOASID but is
sharing with the group

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: Plan for /dev/ioasid RFC v2

2021-06-28 Thread Jason Gunthorpe
On Mon, Jun 28, 2021 at 02:03:56AM +, Tian, Kevin wrote:

> Combining with the last paragraph above, are you actually suggesting 
> that 1:1 group (including mdev) should use a new device-centric vfio 
> uAPI (without group fd) while existing group-centric vfio uAPI is only 
> kept for 1:N group (with slight semantics change in my sketch to match 
> device-centric iommu fd API)?

Yes, this is one approach

Using a VFIO_GROUP_GET_DEVICE_FD_NEW on the group FD is another
option, but locks us into having the group FD.

Which is better possibly depends on some details when going through
the code transformation, though I prefer not to design assuming the
group FD must exist.

> (not via an indirect group ioctl). Then it implies that we may have to allow 
> the user open a device before it is put into a security context, thus the 
> safe guard may have to be enabled on mmap() for 1:1 group. This is a
> different sequence from the existing group-centric model.

Yes, but I think this is fairly minor, it would just start with a
dummy fops and move to operational fops once things are setup enough.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace

2021-06-28 Thread Stefan Hajnoczi
On Tue, Jun 15, 2021 at 10:13:21PM +0800, Xie Yongji wrote:
> This series introduces a framework that makes it possible to implement
> software-emulated vDPA devices in userspace. And to make it simple, the
> emulated vDPA device's control path is handled in the kernel and only the
> data path is implemented in the userspace.

This looks interesting. Unfortunately I don't have enough time to do a
full review, but I looked at the documentation and uapi header file to
give feedback on the userspace ABI.

Stefan


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-06-28 Thread Stefan Hajnoczi
On Tue, Jun 15, 2021 at 10:13:30PM +0800, Xie Yongji wrote:
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> new file mode 100644
> index ..f21b2e51b5c8
> --- /dev/null
> +++ b/include/uapi/linux/vduse.h
> @@ -0,0 +1,143 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_VDUSE_H_
> +#define _UAPI_VDUSE_H_
> +
> +#include 
> +
> +#define VDUSE_API_VERSION0
> +
> +#define VDUSE_NAME_MAX   256
> +
> +/* the control messages definition for read/write */
> +
> +enum vduse_req_type {
> + /* Get the state for virtqueue from userspace */
> + VDUSE_GET_VQ_STATE,
> + /* Notify userspace to start the dataplane, no reply */
> + VDUSE_START_DATAPLANE,
> + /* Notify userspace to stop the dataplane, no reply */
> + VDUSE_STOP_DATAPLANE,
> + /* Notify userspace to update the memory mapping in device IOTLB */
> + VDUSE_UPDATE_IOTLB,
> +};
> +
> +struct vduse_vq_state {
> + __u32 index; /* virtqueue index */
> + __u32 avail_idx; /* virtqueue state (last_avail_idx) */
> +};
> +
> +struct vduse_iova_range {
> + __u64 start; /* start of the IOVA range */
> + __u64 last; /* end of the IOVA range */

Please clarify whether this describes a closed range [start, last] or an
open range [start, last).

> +};
> +
> +struct vduse_dev_request {
> + __u32 type; /* request type */
> + __u32 request_id; /* request id */
> +#define VDUSE_REQ_FLAGS_NO_REPLY (1 << 0) /* No need to reply */
> + __u32 flags; /* request flags */
> + __u32 reserved; /* for future use */
> + union {
> + struct vduse_vq_state vq_state; /* virtqueue state */
> + struct vduse_iova_range iova; /* iova range for updating */
> + __u32 padding[16]; /* padding */
> + };
> +};
> +
> +struct vduse_dev_response {
> + __u32 request_id; /* corresponding request id */
> +#define VDUSE_REQ_RESULT_OK  0x00
> +#define VDUSE_REQ_RESULT_FAILED  0x01
> + __u32 result; /* the result of request */
> + __u32 reserved[2]; /* for future use */
> + union {
> + struct vduse_vq_state vq_state; /* virtqueue state */
> + __u32 padding[16]; /* padding */
> + };
> +};
> +
> +/* ioctls */
> +
> +struct vduse_dev_config {
> + char name[VDUSE_NAME_MAX]; /* vduse device name */
> + __u32 vendor_id; /* virtio vendor id */
> + __u32 device_id; /* virtio device id */
> + __u64 features; /* device features */
> + __u64 bounce_size; /* bounce buffer size for iommu */
> + __u16 vq_size_max; /* the max size of virtqueue */

The VIRTIO specification allows per-virtqueue sizes. A device can have
two virtqueues, where the first one allows up to 1024 descriptors and
the second one allows only 128 descriptors, for example.

This constant seems to impose the constraint that all virtqueues have
the same maximum size. Is this really necessary?

> + __u16 padding; /* padding */
> + __u32 vq_num; /* the number of virtqueues */
> + __u32 vq_align; /* the allocation alignment of virtqueue's metadata */

I'm not sure what this is?

> + __u32 config_size; /* the size of the configuration space */
> + __u32 reserved[15]; /* for future use */
> + __u8 config[0]; /* the buffer of the configuration space */
> +};
> +
> +struct vduse_iotlb_entry {
> + __u64 offset; /* the mmap offset on fd */
> + __u64 start; /* start of the IOVA range */
> + __u64 last; /* last of the IOVA range */

Same here, please specify whether this is an open range or a closed
range.

> +#define VDUSE_ACCESS_RO 0x1
> +#define VDUSE_ACCESS_WO 0x2
> +#define VDUSE_ACCESS_RW 0x3
> + __u8 perm; /* access permission of this range */
> +};
> +
> +struct vduse_config_update {
> + __u32 offset; /* offset from the beginning of configuration space */
> + __u32 length; /* the length to write to configuration space */
> + __u8 buffer[0]; /* buffer used to write from */
> +};
> +
> +struct vduse_vq_info {
> + __u32 index; /* virtqueue index */
> + __u32 avail_idx; /* virtqueue state (last_avail_idx) */
> + __u64 desc_addr; /* address of desc area */
> + __u64 driver_addr; /* address of driver area */
> + __u64 device_addr; /* address of device area */
> + __u32 num; /* the size of virtqueue */
> + __u8 ready; /* ready status of virtqueue */
> +};
> +
> +struct vduse_vq_eventfd {
> + __u32 index; /* virtqueue index */
> +#define VDUSE_EVENTFD_DEASSIGN -1
> + int fd; /* eventfd, -1 means de-assigning the eventfd */
> +};
> +
> +#define VDUSE_BASE   0x81
> +
> +/* Get the version of VDUSE API. This is used for future extension */
> +#define VDUSE_GET_API_VERSION_IOR(VDUSE_BASE, 0x00, __u64)
> +
> +/* Set the version of VDUSE API. */
> +#define VDUSE_SET_API_VERSION_IOW(VDUSE_BASE, 0x01, __u64)
> +
> +/* Create a vduse device which is represented by a char device 
> (/dev/vduse/) */
> +#define 

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-06-28 Thread Stefan Hajnoczi
On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote:
> VDUSE (vDPA Device in Userspace) is a framework to support
> implementing software-emulated vDPA devices in userspace. This
> document is intended to clarify the VDUSE design and usage.
> 
> Signed-off-by: Xie Yongji 
> ---
>  Documentation/userspace-api/index.rst |   1 +
>  Documentation/userspace-api/vduse.rst | 222 
> ++
>  2 files changed, 223 insertions(+)
>  create mode 100644 Documentation/userspace-api/vduse.rst
> 
> diff --git a/Documentation/userspace-api/index.rst 
> b/Documentation/userspace-api/index.rst
> index 0b5eefed027e..c432be070f67 100644
> --- a/Documentation/userspace-api/index.rst
> +++ b/Documentation/userspace-api/index.rst
> @@ -27,6 +27,7 @@ place where this information is gathered.
> iommu
> media/index
> sysfs-platform_profile
> +   vduse
>  
>  .. only::  subproject and html
>  
> diff --git a/Documentation/userspace-api/vduse.rst 
> b/Documentation/userspace-api/vduse.rst
> new file mode 100644
> index ..2f9cd1a4e530
> --- /dev/null
> +++ b/Documentation/userspace-api/vduse.rst
> @@ -0,0 +1,222 @@
> +==
> +VDUSE - "vDPA Device in Userspace"
> +==
> +
> +vDPA (virtio data path acceleration) device is a device that uses a
> +datapath which complies with the virtio specifications with vendor
> +specific control path. vDPA devices can be both physically located on
> +the hardware or emulated by software. VDUSE is a framework that makes it
> +possible to implement software-emulated vDPA devices in userspace. And
> +to make it simple, the emulated vDPA device's control path is handled in
> +the kernel and only the data path is implemented in the userspace.
> +
> +Note that only virtio block device is supported by VDUSE framework now,
> +which can reduce security risks when the userspace process that implements
> +the data path is run by an unprivileged user. The Support for other device
> +types can be added after the security issue is clarified or fixed in the 
> future.
> +
> +Start/Stop VDUSE devices
> +
> +
> +VDUSE devices are started as follows:
> +
> +1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
> +   /dev/vduse/control.
> +
> +2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
> +   messages will arrive while attaching the VDUSE instance to vDPA bus.
> +
> +3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
> +   instance to vDPA bus.
> +
> +VDUSE devices are stopped as follows:
> +
> +1. Send the VDPA_CMD_DEV_DEL netlink message to detach the VDUSE
> +   instance from vDPA bus.
> +
> +2. Close the file descriptor referring to /dev/vduse/$NAME
> +
> +3. Destroy the VDUSE instance with ioctl(VDUSE_DESTROY_DEV) on
> +   /dev/vduse/control
> +
> +The netlink messages metioned above can be sent via vdpa tool in iproute2
> +or use the below sample codes:
> +
> +.. code-block:: c
> +
> + static int netlink_add_vduse(const char *name, enum vdpa_command cmd)
> + {
> + struct nl_sock *nlsock;
> + struct nl_msg *msg;
> + int famid;
> +
> + nlsock = nl_socket_alloc();
> + if (!nlsock)
> + return -ENOMEM;
> +
> + if (genl_connect(nlsock))
> + goto free_sock;
> +
> + famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
> + if (famid < 0)
> + goto close_sock;
> +
> + msg = nlmsg_alloc();
> + if (!msg)
> + goto close_sock;
> +
> + if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0, 
> cmd, 0))
> + goto nla_put_failure;
> +
> + NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
> + if (cmd == VDPA_CMD_DEV_NEW)
> + NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, 
> "vduse");
> +
> + if (nl_send_sync(nlsock, msg))
> + goto close_sock;
> +
> + nl_close(nlsock);
> + nl_socket_free(nlsock);
> +
> + return 0;
> + nla_put_failure:
> + nlmsg_free(msg);
> + close_sock:
> + nl_close(nlsock);
> + free_sock:
> + nl_socket_free(nlsock);
> + return -1;
> + }
> +
> +How VDUSE works
> +---
> +
> +Since the emuldated vDPA device's control path is handled in the kernel,

s/emuldated/emulated/

> +a message-based communication protocol and few types of control messages
> +are introduced by VDUSE framework to make userspace be aware of the data
> +path related changes:
> +
> +- VDUSE_GET_VQ_STATE: Get the state for virtqueue from userspace
> +
> +- VDUSE_START_DATAPLANE: Notify userspace to start the dataplane
> +
> +- VDUSE_STOP_DATAPLANE: Notify userspace to stop the dataplane
> +
> +- VDUSE_UPDATE_IOTLB: Notify userspace to update 

Re: [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace

2021-06-28 Thread Yongji Xie
On Mon, 28 Jun 2021 at 10:55, Liu Xiaodong  wrote:
>
> On Tue, Jun 15, 2021 at 10:13:21PM +0800, Xie Yongji wrote:
> >
> > This series introduces a framework that makes it possible to implement
> > software-emulated vDPA devices in userspace. And to make it simple, the
> > emulated vDPA device's control path is handled in the kernel and only the
> > data path is implemented in the userspace.
> >
> > Since the emuldated vDPA device's control path is handled in the kernel,
> > a message mechnism is introduced to make userspace be aware of the data
> > path related changes. Userspace can use read()/write() to receive/reply
> > the control messages.
> >
> > In the data path, the core is mapping dma buffer into VDUSE daemon's
> > address space, which can be implemented in different ways depending on
> > the vdpa bus to which the vDPA device is attached.
> >
> > In virtio-vdpa case, we implements a MMU-based on-chip IOMMU driver with
> > bounce-buffering mechanism to achieve that. And in vhost-vdpa case, the dma
> > buffer is reside in a userspace memory region which can be shared to the
> > VDUSE userspace processs via transferring the shmfd.
> >
> > The details and our user case is shown below:
> >
> > -   
> > --
> > |Container ||  QEMU(VM) |   |   
> > VDUSE daemon |
> > |   -  ||  ---  |   | 
> > -  |
> > |   |dev/vdx|  ||  |/dev/vhost-vdpa-x|  |   | | vDPA device 
> > emulation | | block driver | |
> > +--- ---+   
> > -+--+-
> > |   ||  
> > |
> > |   ||  
> > |
> > +---++--+-
> > || block device |   |  vhost device || vduse driver 
> > |  | TCP/IP ||
> > |---+   +
> > ---+  -+|
> > |   |   |   |   
> > ||
> > | --+--   --+--- 
> > ---+---||
> > | | virtio-blk driver |   |  vhost-vdpa driver | | vdpa device 
> > |||
> > | --+--   --+--- 
> > ---+---||
> > |   |  virtio bus   |   |   
> > ||
> > |   ++---   |   |   
> > ||
> > ||  |   |   
> > ||
> > |  --+--|   |   
> > ||
> > |  | virtio-blk device ||   |   
> > ||
> > |  --+--|   |   
> > ||
> > ||  |   |   
> > ||
> > | ---+---   |   |   
> > ||
> > | |  virtio-vdpa driver |   |   |   
> > ||
> > | ---+---   |   |   
> > ||
> > ||  |   |
> > vdpa bus   ||
> > | 
> > ---+--+---+ 
> >   ||
> > |   
> >  ---+--- |
> > -|
> >  NIC |--
> > 
> >  ---+---
> > 
> > |
> > 
> >-+-
> > 
> >| Remote Storages |
> > 
> >---
> >
> > We make use of it to implement a block device connecting to
> > our distributed storage, which can be used both 

Re: [PATCH v4] iommu/of: Fix pci_request_acs() before enumerating PCI devices

2021-06-28 Thread Xingang Wang

Hi Bjorn,

I would like to add more explanation about what problem this patch
fixed.

I am testing the SVA/vSVA series patches written by @Jean and @Brucker.
I test with the following qemu command line, with a hisilicon SEC device
attached on a pcie-root-port.

$QEMU/qemu-system-aarch64 \
-enable-kvm \
-kernel $LINUX/arch/arm64/boot/Image \
-m 16G \
-smp cores=8,threads=1,sockets=2\
-machine virt,kernel_irqchip=on,gic-version=3,iommu=smmuv3 \
-device 
pcie-root-port,port=0x8,chassis=1,id=pci.1,bus=pcie.0,addr=0x1 \

-device vfio-pci,host=75:00.0,bus=pci.1,addr=0x0,id=acc2 \
-net none \
-initrd ./mfs.cpio.gz \
-cpu host \
-nographic \
-append "rdinit=init console=ttyAMA0 earlycon=pl011,0x900" \

And I got the guest PCI configuration:
00:00.0 Class 0600: Device 1b36:0008# root bus
00:01.0 Class 0604: Device 1b36:000c# root port
Capabilities: [148 v1] Access Control Services
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ 
EgressCtrl- DirectTrans+
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- 
EgressCtrl- DirectTrans-

Kernel driver in use: pcieport
01:00.0 Class 1000: Device 19e5:a255 (rev 21)   # SEC

The PCI configuration shows that the ACS of the pcie root port is
not enabled, while it should have.

Then when I insmod device driver and init the SVA feature, I got

[   24.342450] hisi_sec2 :01:00.0: cannot attach to incompatible 
domain (0 SSID bits != 10)

[   24.343731] hisi_sec2 :01:00.0: Failed to add to iommu group 0: -22
[   24.345243] hisi_sec2 :01:00.0: enabling device ( -> 0002)
qemu-system-aarch64: vfio_enable_vectors failed to register S1 MSI 
binding for vector 0(-2)
qemu-system-aarch64: vfio: Error: Failed to setup MSI fds: Interrupted 
system call

qemu-system-aarch64: vfio: Error: Failed to enable MSI

I figured out that this error occurs in the arm_smmu_attach_dev
when checking ssid_bits for SVA feature,
the master->ssid_bits != smmu_domain->s1_cfg.s1cdmax caused this 
problem. This is becuase the ACS of pcie-root-port is not enabled, the 
pcie-root-port and SEC device share the same domain.
And SEC's ssid_bits is 10, while pcie-root-port's s1cdmax is zero, this 
cause the problem.


And about why the ACS is not enabled in kernel, I have explained as the 
following:


On 2021/6/7 20:58, Xingang Wang wrote:

On 2021/6/5 3:04, Bjorn Helgaas wrote:

[+cc John, who tested 6bf6c24720d3]

On Fri, May 21, 2021 at 03:03:24AM +, Wang Xingang wrote:

From: Xingang Wang 

When booting with devicetree, the pci_request_acs() is called after the
enumeration and initialization of PCI devices, thus the ACS is not
enabled. And ACS should be enabled when IOMMU is detected for the
PCI host bridge, so add check for IOMMU before probe of PCI host and 
call

pci_request_acs() to make sure ACS will be enabled when enumerating PCI
devices.


I'm happy to apply this, but I'm a little puzzled about 6bf6c24720d3
("iommu/of: Request ACS from the PCI core when configuring IOMMU
linkage").  It was tested and fixed a problem, but I don't understand
how.

6bf6c24720d3 added the call to pci_request_acs() in
of_iommu_configure() so it currently looks like this:

   of_iommu_configure(dev, ...)
   {
 if (dev_is_pci(dev))
   pci_request_acs();

pci_request_acs() sets pci_acs_enable, which tells us to enable ACS
when enumerating PCI devices in the future.  But we only call
pci_request_acs() if we already *have* a PCI device.

So maybe 6bf6c24720d3 fixed a problem for *some* PCI devices, but not
all?  E.g., did we call of_iommu_configure() for one PCI device before
enumerating the rest?


I test the kernel on an arm platform with qemu:

qemu-system-aarch64 \
  -cpu host \
  -kernel arch/arm64/boot/Image \
  -enable-kvm \
  -m 8G \
  -smp 2,sockets=2,cores=1,threads=1 \
  -machine virt,kernel_irqchip=on,gic-version=3,iommu=smmuv3\
  -initrd rootfs.cpio.gz \
  -nographic \
  -append "rdinit=init console=ttyAMA0 earlycon=pl011,0x900 nokaslr" \
  -device pcie-root-port,port=0x1,chassis=1,id=pci.1,addr=0x8 \
  -netdev user,id=hostnet0 \
  -device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=08:13:3a:5a:22:5b,bus=pci.1,addr=0x0 
\


And find that the of_iommu_configure is called after the enumeration
of the pcie-root-port. And this might only infect the first device, when 
enumerating

the rest devices, the pci_acs_enable has already be enabled.

But to make sure that the pci_acs_enable will always be set before all 
PCI devices,

it would be better to set it in initialization of PCI bridges.

Thanks

Xingang


Fixes: 6bf6c24720d33 ("iommu/of: Request ACS from the PCI core when
configuring IOMMU linkage")
Signed-off-by: Xingang Wang 
---
  drivers/iommu/of_iommu.c | 1 -
  drivers/pci/of.c | 8 +++-
  2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c
index a9d2df001149..54a14da242cc 100644
--- 

Re: [PATCH v4 1/3] iommu: io-pgtable: add DART pagetable format

2021-06-28 Thread Alexander Graf via iommu




On 27.06.21 16:34, Sven Peter wrote:


Apple's DART iommu uses a pagetable format that shares some
similarities with the ones already implemented by io-pgtable.c.
Add a new format variant to support the required differences
so that we don't have to duplicate the pagetable handling code.

Signed-off-by: Sven Peter 
---
  drivers/iommu/io-pgtable-arm.c | 62 ++
  drivers/iommu/io-pgtable.c |  1 +
  include/linux/io-pgtable.h |  7 
  3 files changed, 70 insertions(+)

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 87def58e79b5..1dd5c45b4b5b 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -127,6 +127,9 @@
  #define ARM_MALI_LPAE_MEMATTR_IMP_DEF  0x88ULL
  #define ARM_MALI_LPAE_MEMATTR_WRITE_ALLOC 0x8DULL

+#define APPLE_DART_PTE_PROT_NO_WRITE (1<<7)
+#define APPLE_DART_PTE_PROT_NO_READ (1<<8)
+
  /* IOPTE accessors */
  #define iopte_deref(pte,d) __va(iopte_to_paddr(pte, d))

@@ -381,6 +384,15 @@ static arm_lpae_iopte arm_lpae_prot_to_pte(struct 
arm_lpae_io_pgtable *data,
  {
 arm_lpae_iopte pte;

+   if (data->iop.fmt == ARM_APPLE_DART) {
+   pte = 0;
+   if (!(prot & IOMMU_WRITE))
+   pte |= APPLE_DART_PTE_PROT_NO_WRITE;
+   if (!(prot & IOMMU_READ))
+   pte |= APPLE_DART_PTE_PROT_NO_READ;
+   return pte;


What about the other bits, such as sharability, XN, etc? Do they not 
exist on DART? Or have they not been reverse engineered and 0s happen to 
"just work"?



+   }
+
 if (data->iop.fmt == ARM_64_LPAE_S1 ||
 data->iop.fmt == ARM_32_LPAE_S1) {
 pte = ARM_LPAE_PTE_nG;
@@ -1043,6 +1055,51 @@ arm_mali_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg, 
void *cookie)
 return NULL;
  }

+static struct io_pgtable *
+apple_dart_alloc_pgtable(struct io_pgtable_cfg *cfg, void *cookie)
+{
+   struct arm_lpae_io_pgtable *data;
+   int i;
+
+   if (cfg->oas > 36)
+   return NULL;
+
+   data = arm_lpae_alloc_pgtable(cfg);
+   if (!data)
+   return NULL;
+
+   /*
+* Apple's DART always requires three levels with the first level being
+* stored in four MMIO registers. We always concatenate the first and
+* second level so that we only have to setup the MMIO registers once.
+* This results in an effective two level pagetable.
+*/
+   if (data->start_level < 1)
+   return NULL;
+   if (data->start_level == 1 && data->pgd_bits > 2)
+   return NULL;
+   if (data->start_level > 1)
+   data->pgd_bits = 0;
+   data->start_level = 2;
+   cfg->apple_dart_cfg.n_ttbrs = 1 << data->pgd_bits;


Maybe add a BUG_ON if n_ttbrs > ARRAY_SIZE(ttbr)? Or alternatively, do a 
normal runtime check and bail out then.



Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: Plan for /dev/ioasid RFC v2

2021-06-28 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Friday, June 25, 2021 10:36 PM
> 
> On Fri, Jun 25, 2021 at 10:27:18AM +, Tian, Kevin wrote:
> 
> > -   When receiving the binding call for the 1st device in a group, iommu_fd
> > calls iommu_group_set_block_dma(group, dev->driver) which does
> > several things:
> 
> The whole problem here is trying to match this new world where we want
> devices to be in charge of their own IOMMU configuration and the old
> world where groups are in charge.
> 
> Inserting the group fd and then calling a device-centric
> VFIO_GROUP_GET_DEVICE_FD_NEW doesn't solve this conflict, and isn't
> necessary. We can always get the group back from the device at any
> point in the sequence do to a group wide operation.
> 
> What I saw as the appeal of the sort of idea was to just completely
> leave all the difficult multi-device-group scenarios behind on the old
> group centric API and then we don't have to deal with them at all, or
> least not right away.
> 
> I'd see some progression where iommu_fd only works with 1:1 groups at
> the start. Other scenarios continue with the old API.
> 
> Then maybe groups where all devices use the same IOASID.
> 
> Then 1:N groups if the source device is reliably identifiable, this
> requires iommu subystem work to attach domains to sub-group objects -
> not sure it is worthwhile.
> 
> But at least we can talk about each step with well thought out patches
> 
> The only thing that needs to be done to get the 1:1 step is to broadly
> define how the other two cases will work so we don't get into trouble
> and set some way to exclude the problematic cases from even getting to
> iommu_fd in the first place.
> 
> For instance if we go ahead and create /dev/vfio/device nodes we could
> do this only if the group was 1:1, otherwise the group cdev has to be
> used, along with its API.
> 

Thinking more along your direction, here is an updated sketch:

[Stage-1]

Multi-devices group (1:N) is handled by existing vfio group fd and 
vfio_iommu_type1 driver.

Singleton group (1:1) is handled via a new device-centric protocol:

1)   /dev/vfio/device nodes are created for devices in singleton group
   or devices w/o group (mdev)

2)   user gets iommu_fd by open("/dev/iommu"). A default block_dma
   domain is created per iommu_fd (or globally) with an empty I/O 
   page table. 

3)   iommu_fd reports that only 1:1 group is supported

4)   user gets device_fd by open("/dev/vfio/device"). At this point
   mmap() should be blocked since a security context hasn't been 
   established for this fd. This could be done by returning an error 
   (EACCESS or EAGAIN?), or succeeding w/o actually setting up the 
   mapping.

5)   user requests to bind device_fd to iommu_fd which verifies the 
   group is not 1:N (for mdev the check is on the parent device). 
   Successful binding automatically attaches the device to the block_
   dma domain via iommu_attach_group(). From now on the user is 
   permitted to access the device. If mmap() in 3) is allowed, vfio 
   actually sets up the MMIO mapping at this point.

6)   before the device is unbound from iommu_fd, it is always in a 
   security context. Attaching/detaching just switches the security
   context between the block_dma domain and an ioasid domain.

7)   Unbinding detaches the device from the block_dma domain and
   re-attach it to the default domain. From now on the user should 
   be denied from accessing the device. vfio should tear down the
   MMIO mapping at this point.

[Stage-2]

Both 1:1 and 1:N groups are handled via the new device-centric protocol. 
Old vfio uAPI is kept for legacy applications. All devices in the same group 
must share the same I/O address space.

A key difference from stage-1 is the additional check on group viability:

1)   vfio creates /dev/vfio/device nodes for all devices

2)   Same as stage-1 for getting iommu_fd

3)   iommu_fd reports that both 1:1 and 1:N groups are supported

4)   Same as stage-1 for getting device_fd 

5)   when receiving the binding call for the 1st device in a group, iommu
   fd does several things:

a) Identify the group of this device and check group viability. A group 
is viable only when all devices in the group are in one of below 
states:

* driver-less
* bound to a driver which is same as the one which does the
   binding call (vfio in this case)
* bound to an otherwise allowed driver (which indicates that it
   is safe for iommu_fd usage around probe())

b) Attach all devices in the group to the block_dma domain, via existing
 iommu_attach_group().

c) Register a notifier callback to verifie group viability on 
IOMMU_GROUP_
NOTIFY_BOUND_DRIVER event. BUG_ON() might be eliminated if
we can find a way to deny probe of non-iommu-safe drivers.

Re: [PATCH 9/9] dma-debug: Use memory_intersects() directly

2021-06-28 Thread Kefeng Wang



On 2021/6/28 14:11, Christoph Hellwig wrote:

You've sent me a patch 9 out of 9 without the previous 8 patches.  There
is no way I can sensibly review this series.


The full patches is 
https://lore.kernel.org/linux-arch/20210626073439.150586-1-wangkefeng.w...@huawei.com/T/#t


This patch is not very relevant about the above 8 patches, only use the 
existing function to simplify code.


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 9/9] dma-debug: Use memory_intersects() directly

2021-06-28 Thread Christoph Hellwig
You've sent me a patch 9 out of 9 without the previous 8 patches.  There
is no way I can sensibly review this series.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: Bug#989778: Regression in at least 5.10.y and mainline: Firewire audio interface fails to work properly (when booted under Xen)

2021-06-28 Thread ‍小太
On Thu, 24 Jun 2021 at 04:50, Salvatore Bonaccorso 
wrote:

> Hi Robin,
>
> On Mon, Jun 14, 2021 at 02:29:08PM +0100, Robin Murphy wrote:
> > On 2021-06-13 07:29, Salvatore Bonaccorso wrote:
> > > A user in Debian reported the above issue, which was reproducible with
> > > 5.13-rc5 and 5.10.y as packaged in Debian and found that 85a5a6875ca9
> > > ("swiotlb: don't modify orig_addr in swiotlb_tbl_sync_single") that
> > > introduced the issue.
> >
> > Sounds like it's probably the same thing as being discussed over here:
> >
> >
> https://lore.kernel.org/linux-iommu/2e899de2-4b69-c4b6-33a6-09fb8949d...@nxp.com/
>
> Thanks for the pointer, it seems that now it has been fixed upstream
> with 5f89468e2f06 ("swiotlb: manipulate orig_addr when tlb_addr has
> offset").
>

I've checked out upstream v5.13 b7050b242430f3170e0b57f5f55136e44cb8dc66
(which includes the above commit) and confirmed that my issue is now
resolved.
Now to wait for it to reach Debian :)
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu