Re: [PATCH RFC] virtio: wrap config->reset calls

2021-10-18 Thread Stefan Hajnoczi
On Wed, Oct 13, 2021 at 06:55:31AM -0400, Michael S. Tsirkin wrote:
> This will enable cleanups down the road.
> The idea is to disable cbs, then add "flush_queued_cbs" callback
> as a parameter, this way drivers can flush any work
> queued after callbacks have been disabled.
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  arch/um/drivers/virt-pci.c | 2 +-
>  drivers/block/virtio_blk.c | 4 ++--
>  drivers/bluetooth/virtio_bt.c  | 2 +-
>  drivers/char/hw_random/virtio-rng.c| 2 +-
>  drivers/char/virtio_console.c  | 4 ++--
>  drivers/crypto/virtio/virtio_crypto_core.c | 8 
>  drivers/firmware/arm_scmi/virtio.c | 2 +-
>  drivers/gpio/gpio-virtio.c | 2 +-
>  drivers/gpu/drm/virtio/virtgpu_kms.c   | 2 +-
>  drivers/i2c/busses/i2c-virtio.c| 2 +-
>  drivers/iommu/virtio-iommu.c   | 2 +-
>  drivers/net/caif/caif_virtio.c | 2 +-
>  drivers/net/virtio_net.c   | 4 ++--
>  drivers/net/wireless/mac80211_hwsim.c  | 2 +-
>  drivers/nvdimm/virtio_pmem.c   | 2 +-
>  drivers/rpmsg/virtio_rpmsg_bus.c   | 2 +-
>  drivers/scsi/virtio_scsi.c | 2 +-
>  drivers/virtio/virtio.c| 5 +
>  drivers/virtio/virtio_balloon.c| 2 +-
>  drivers/virtio/virtio_input.c  | 2 +-
>  drivers/virtio/virtio_mem.c| 2 +-
>  fs/fuse/virtio_fs.c| 4 ++--
>  include/linux/virtio.h | 1 +
>  net/9p/trans_virtio.c  | 2 +-
>  net/vmw_vsock/virtio_transport.c   | 4 ++--
>  sound/virtio/virtio_card.c | 4 ++--
>  26 files changed, 39 insertions(+), 33 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-08 Thread Stefan Hajnoczi
On Wed, Jul 07, 2021 at 05:09:13PM +0800, Yongji Xie wrote:
> On Tue, Jul 6, 2021 at 6:22 PM Stefan Hajnoczi  wrote:
> >
> > On Tue, Jul 06, 2021 at 11:04:18AM +0800, Yongji Xie wrote:
> > > On Mon, Jul 5, 2021 at 8:50 PM Stefan Hajnoczi  
> > > wrote:
> > > >
> > > > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> > > > >
> > > > > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > > > > OK, I get you now. Since the VIRTIO specification says "Device
> > > > > > > > configuration space is generally used for rarely-changing or
> > > > > > > > initialization-time parameters". I assume the 
> > > > > > > > VDUSE_DEV_SET_CONFIG
> > > > > > > > ioctl should not be called frequently.
> > > > > > > The spec uses MUST and other terms to define the precise 
> > > > > > > requirements.
> > > > > > > Here the language (especially the word "generally") is weaker and 
> > > > > > > means
> > > > > > > there may be exceptions.
> > > > > > >
> > > > > > > Another type of access that doesn't work with the 
> > > > > > > VDUSE_DEV_SET_CONFIG
> > > > > > > approach is reads that have side-effects. For example, imagine a 
> > > > > > > field
> > > > > > > containing an error code if the device encounters a problem 
> > > > > > > unrelated to
> > > > > > > a specific virtqueue request. Reading from this field resets the 
> > > > > > > error
> > > > > > > code to 0, saving the driver an extra configuration space write 
> > > > > > > access
> > > > > > > and possibly race conditions. It isn't possible to implement those
> > > > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, 
> > > > > > > but it
> > > > > > > makes me think that the interface does not allow full VIRTIO 
> > > > > > > semantics.
> > > > >
> > > > >
> > > > > Note that though you're correct, my understanding is that config 
> > > > > space is
> > > > > not suitable for this kind of error propagating. And it would be very 
> > > > > hard
> > > > > to implement such kind of semantic in some transports.  Virtqueue 
> > > > > should be
> > > > > much better. As Yong Ji quoted, the config space is used for
> > > > > "rarely-changing or intialization-time parameters".
> > > > >
> > > > >
> > > > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> > > > > > handle the message failure, I'm going to add a return value to
> > > > > > virtio_config_ops.get() and virtio_cread_* API so that the error can
> > > > > > be propagated to the virtio device driver. Then the virtio-blk 
> > > > > > device
> > > > > > driver can be modified to handle that.
> > > > > >
> > > > > > Jason and Stefan, what do you think of this way?
> > > >
> > > > Why does VDUSE_DEV_GET_CONFIG need to support an error return value?
> > > >
> > >
> > > We add a timeout and return error in case userspace never replies to
> > > the message.
> > >
> > > > The VIRTIO spec provides no way for the device to report errors from
> > > > config space accesses.
> > > >
> > > > The QEMU virtio-pci implementation returns -1 from invalid
> > > > virtio_config_read*() and silently discards virtio_config_write*()
> > > > accesses.
> > > >
> > > > VDUSE can take the same approach with
> > > > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
> > > >
> > >
> > > I noticed that virtio_config_read*() only returns -1 when we access a
> > > invalid field. But in the VDUSE case, VDUSE_DEV_GET_CONFIG might fail
> > > when we access a valid field. Not sure if it's ok to silently ignore
> > > this kind of error.
> >
> > That's a good point but it's a general VIRTIO issue. Any device
> > implementation (QEMU userspace, hardware vDPA, etc) can fail, so the
> > VIRTIO specification needs to provide a way for the driver to detect
> > this.
> >
> > If userspace violates the contract then VDUSE needs to mark the device
> > broken. QEMU's device emulation does something similar with the
> > vdev->broken flag.
> >
> > The VIRTIO Device Status field DEVICE_NEEDS_RESET bit can be set by
> > vDPA/VDUSE to indicate that the device is not operational and must be
> > reset.
> >
> 
> It might be a solution. But DEVICE_NEEDS_RESET  is not implemented
> currently. So I'm thinking whether it's ok to add a check of
> DEVICE_NEEDS_RESET status bit in probe function of virtio device
> driver (e.g. virtio-blk driver). Then VDUSE can make use of it to fail
> device initailization when configuration space access failed.

Okay.

Stefan


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-08 Thread Stefan Hajnoczi
On Thu, Jul 08, 2021 at 12:17:56PM +0800, Jason Wang wrote:
> 
> 在 2021/7/7 下午11:54, Stefan Hajnoczi 写道:
> > On Wed, Jul 07, 2021 at 05:24:08PM +0800, Jason Wang wrote:
> > > 在 2021/7/7 下午4:55, Stefan Hajnoczi 写道:
> > > > On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote:
> > > > > 在 2021/7/7 上午1:11, Stefan Hajnoczi 写道:
> > > > > > On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote:
> > > > > > > On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi 
> > > > > > >  wrote:
> > > > > > > > On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote:
> > > > > > > > > 在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:
> > > > > > > > > > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> > > > > > > > > > > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > > > > > > > > > > OK, I get you now. Since the VIRTIO specification 
> > > > > > > > > > > > > > says "Device
> > > > > > > > > > > > > > configuration space is generally used for 
> > > > > > > > > > > > > > rarely-changing or
> > > > > > > > > > > > > > initialization-time parameters". I assume the 
> > > > > > > > > > > > > > VDUSE_DEV_SET_CONFIG
> > > > > > > > > > > > > > ioctl should not be called frequently.
> > > > > > > > > > > > > The spec uses MUST and other terms to define the 
> > > > > > > > > > > > > precise requirements.
> > > > > > > > > > > > > Here the language (especially the word "generally") 
> > > > > > > > > > > > > is weaker and means
> > > > > > > > > > > > > there may be exceptions.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Another type of access that doesn't work with the 
> > > > > > > > > > > > > VDUSE_DEV_SET_CONFIG
> > > > > > > > > > > > > approach is reads that have side-effects. For 
> > > > > > > > > > > > > example, imagine a field
> > > > > > > > > > > > > containing an error code if the device encounters a 
> > > > > > > > > > > > > problem unrelated to
> > > > > > > > > > > > > a specific virtqueue request. Reading from this field 
> > > > > > > > > > > > > resets the error
> > > > > > > > > > > > > code to 0, saving the driver an extra configuration 
> > > > > > > > > > > > > space write access
> > > > > > > > > > > > > and possibly race conditions. It isn't possible to 
> > > > > > > > > > > > > implement those
> > > > > > > > > > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another 
> > > > > > > > > > > > > corner case, but it
> > > > > > > > > > > > > makes me think that the interface does not allow full 
> > > > > > > > > > > > > VIRTIO semantics.
> > > > > > > > > > > Note that though you're correct, my understanding is that 
> > > > > > > > > > > config space is
> > > > > > > > > > > not suitable for this kind of error propagating. And it 
> > > > > > > > > > > would be very hard
> > > > > > > > > > > to implement such kind of semantic in some transports.  
> > > > > > > > > > > Virtqueue should be
> > > > > > > > > > > much better. As Yong Ji quoted, the config space is used 
> > > > > > > > > > > for
> > > > > > > > > > > "rarely-changing or intialization-time parameters".
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-07 Thread Stefan Hajnoczi
On Wed, Jul 07, 2021 at 05:24:08PM +0800, Jason Wang wrote:
> 
> 在 2021/7/7 下午4:55, Stefan Hajnoczi 写道:
> > On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote:
> > > 在 2021/7/7 上午1:11, Stefan Hajnoczi 写道:
> > > > On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote:
> > > > > On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi  
> > > > > wrote:
> > > > > > On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote:
> > > > > > > 在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:
> > > > > > > > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> > > > > > > > > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > > > > > > > > OK, I get you now. Since the VIRTIO specification says 
> > > > > > > > > > > > "Device
> > > > > > > > > > > > configuration space is generally used for 
> > > > > > > > > > > > rarely-changing or
> > > > > > > > > > > > initialization-time parameters". I assume the 
> > > > > > > > > > > > VDUSE_DEV_SET_CONFIG
> > > > > > > > > > > > ioctl should not be called frequently.
> > > > > > > > > > > The spec uses MUST and other terms to define the precise 
> > > > > > > > > > > requirements.
> > > > > > > > > > > Here the language (especially the word "generally") is 
> > > > > > > > > > > weaker and means
> > > > > > > > > > > there may be exceptions.
> > > > > > > > > > > 
> > > > > > > > > > > Another type of access that doesn't work with the 
> > > > > > > > > > > VDUSE_DEV_SET_CONFIG
> > > > > > > > > > > approach is reads that have side-effects. For example, 
> > > > > > > > > > > imagine a field
> > > > > > > > > > > containing an error code if the device encounters a 
> > > > > > > > > > > problem unrelated to
> > > > > > > > > > > a specific virtqueue request. Reading from this field 
> > > > > > > > > > > resets the error
> > > > > > > > > > > code to 0, saving the driver an extra configuration space 
> > > > > > > > > > > write access
> > > > > > > > > > > and possibly race conditions. It isn't possible to 
> > > > > > > > > > > implement those
> > > > > > > > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner 
> > > > > > > > > > > case, but it
> > > > > > > > > > > makes me think that the interface does not allow full 
> > > > > > > > > > > VIRTIO semantics.
> > > > > > > > > Note that though you're correct, my understanding is that 
> > > > > > > > > config space is
> > > > > > > > > not suitable for this kind of error propagating. And it would 
> > > > > > > > > be very hard
> > > > > > > > > to implement such kind of semantic in some transports.  
> > > > > > > > > Virtqueue should be
> > > > > > > > > much better. As Yong Ji quoted, the config space is used for
> > > > > > > > > "rarely-changing or intialization-time parameters".
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next 
> > > > > > > > > > version. And to
> > > > > > > > > > handle the message failure, I'm going to add a return value 
> > > > > > > > > > to
> > > > > > > > > > virtio_config_ops.get() and virtio_cread_* API so that the 
> > > > > > > > > > error can
> > > > > > > > > > be propagated to the virtio device driver. Then the 
> > > > > > > > > > virtio-blk device
> > > > > > > > > > driver can be modified to handle that.
> > > > >

Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-07-07 Thread Stefan Hajnoczi
On Tue, Jun 15, 2021 at 10:13:30PM +0800, Xie Yongji wrote:
> +static bool vduse_validate_config(struct vduse_dev_config *config)
> +{

The name field needs to be NUL terminated?

> + case VDUSE_CREATE_DEV: {
> + struct vduse_dev_config config;
> + unsigned long size = offsetof(struct vduse_dev_config, config);
> + void *buf;
> +
> + ret = -EFAULT;
> + if (copy_from_user(, argp, size))
> + break;
> +
> + ret = -EINVAL;
> + if (vduse_validate_config() == false)
> + break;
> +
> + buf = vmemdup_user(argp + size, config.config_size);
> + if (IS_ERR(buf)) {
> + ret = PTR_ERR(buf);
> + break;
> + }
> + ret = vduse_create_dev(, buf, control->api_version);
> + break;
> + }
> + case VDUSE_DESTROY_DEV: {
> + char name[VDUSE_NAME_MAX];
> +
> + ret = -EFAULT;
> + if (copy_from_user(name, argp, VDUSE_NAME_MAX))
> + break;

Is this missing a NUL terminator?


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-06 Thread Stefan Hajnoczi
On Tue, Jul 06, 2021 at 11:04:18AM +0800, Yongji Xie wrote:
> On Mon, Jul 5, 2021 at 8:50 PM Stefan Hajnoczi  wrote:
> >
> > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> > >
> > > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > > OK, I get you now. Since the VIRTIO specification says "Device
> > > > > > configuration space is generally used for rarely-changing or
> > > > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> > > > > > ioctl should not be called frequently.
> > > > > The spec uses MUST and other terms to define the precise requirements.
> > > > > Here the language (especially the word "generally") is weaker and 
> > > > > means
> > > > > there may be exceptions.
> > > > >
> > > > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
> > > > > approach is reads that have side-effects. For example, imagine a field
> > > > > containing an error code if the device encounters a problem unrelated 
> > > > > to
> > > > > a specific virtqueue request. Reading from this field resets the error
> > > > > code to 0, saving the driver an extra configuration space write access
> > > > > and possibly race conditions. It isn't possible to implement those
> > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
> > > > > makes me think that the interface does not allow full VIRTIO 
> > > > > semantics.
> > >
> > >
> > > Note that though you're correct, my understanding is that config space is
> > > not suitable for this kind of error propagating. And it would be very hard
> > > to implement such kind of semantic in some transports.  Virtqueue should 
> > > be
> > > much better. As Yong Ji quoted, the config space is used for
> > > "rarely-changing or intialization-time parameters".
> > >
> > >
> > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> > > > handle the message failure, I'm going to add a return value to
> > > > virtio_config_ops.get() and virtio_cread_* API so that the error can
> > > > be propagated to the virtio device driver. Then the virtio-blk device
> > > > driver can be modified to handle that.
> > > >
> > > > Jason and Stefan, what do you think of this way?
> >
> > Why does VDUSE_DEV_GET_CONFIG need to support an error return value?
> >
> 
> We add a timeout and return error in case userspace never replies to
> the message.
> 
> > The VIRTIO spec provides no way for the device to report errors from
> > config space accesses.
> >
> > The QEMU virtio-pci implementation returns -1 from invalid
> > virtio_config_read*() and silently discards virtio_config_write*()
> > accesses.
> >
> > VDUSE can take the same approach with
> > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
> >
> 
> I noticed that virtio_config_read*() only returns -1 when we access a
> invalid field. But in the VDUSE case, VDUSE_DEV_GET_CONFIG might fail
> when we access a valid field. Not sure if it's ok to silently ignore
> this kind of error.

That's a good point but it's a general VIRTIO issue. Any device
implementation (QEMU userspace, hardware vDPA, etc) can fail, so the
VIRTIO specification needs to provide a way for the driver to detect
this.

If userspace violates the contract then VDUSE needs to mark the device
broken. QEMU's device emulation does something similar with the
vdev->broken flag.

The VIRTIO Device Status field DEVICE_NEEDS_RESET bit can be set by
vDPA/VDUSE to indicate that the device is not operational and must be
reset.

The driver code may still process the -1 value read from the
configuration space. Hopefully this isn't a problem. There is currently
no VIRTIO interface besides DEVICE_NEEDS_RESET to indicate configuration
space access failures. On the other hand, drivers need to handle
malicious devices so they should be able to cope with the -1 value
anyway.

Stefan


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-06 Thread Stefan Hajnoczi
On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote:
> 
> 在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:
> > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> > > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > > OK, I get you now. Since the VIRTIO specification says "Device
> > > > > > configuration space is generally used for rarely-changing or
> > > > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> > > > > > ioctl should not be called frequently.
> > > > > The spec uses MUST and other terms to define the precise requirements.
> > > > > Here the language (especially the word "generally") is weaker and 
> > > > > means
> > > > > there may be exceptions.
> > > > > 
> > > > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
> > > > > approach is reads that have side-effects. For example, imagine a field
> > > > > containing an error code if the device encounters a problem unrelated 
> > > > > to
> > > > > a specific virtqueue request. Reading from this field resets the error
> > > > > code to 0, saving the driver an extra configuration space write access
> > > > > and possibly race conditions. It isn't possible to implement those
> > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
> > > > > makes me think that the interface does not allow full VIRTIO 
> > > > > semantics.
> > > 
> > > Note that though you're correct, my understanding is that config space is
> > > not suitable for this kind of error propagating. And it would be very hard
> > > to implement such kind of semantic in some transports.  Virtqueue should 
> > > be
> > > much better. As Yong Ji quoted, the config space is used for
> > > "rarely-changing or intialization-time parameters".
> > > 
> > > 
> > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> > > > handle the message failure, I'm going to add a return value to
> > > > virtio_config_ops.get() and virtio_cread_* API so that the error can
> > > > be propagated to the virtio device driver. Then the virtio-blk device
> > > > driver can be modified to handle that.
> > > > 
> > > > Jason and Stefan, what do you think of this way?
> > Why does VDUSE_DEV_GET_CONFIG need to support an error return value?
> > 
> > The VIRTIO spec provides no way for the device to report errors from
> > config space accesses.
> > 
> > The QEMU virtio-pci implementation returns -1 from invalid
> > virtio_config_read*() and silently discards virtio_config_write*()
> > accesses.
> > 
> > VDUSE can take the same approach with
> > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
> > 
> > > I'd like to stick to the current assumption thich get_config won't fail.
> > > That is to say,
> > > 
> > > 1) maintain a config in the kernel, make sure the config space read can
> > > always succeed
> > > 2) introduce an ioctl for the vduse usersapce to update the config space.
> > > 3) we can synchronize with the vduse userspace during set_config
> > > 
> > > Does this work?
> > I noticed that caching is also allowed by the vhost-user protocol
> > messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't
> > know whether or not caching is in effect. The interface you outlined
> > above requires caching.
> > 
> > Is there a reason why the host kernel vDPA code needs to cache the
> > configuration space?
> 
> 
> Because:
> 
> 1) Kernel can not wait forever in get_config(), this is the major difference
> with vhost-user.

virtio_cread() can sleep:

  #define virtio_cread(vdev, structname, member, ptr) \
  do {\
  typeof(((structname*)0)->member) virtio_cread_v;\
  \
  might_sleep();  \
  ^^

Which code path cannot sleep?

> 2) Stick to the current assumption that virtio_cread() should always
> succeed.

That can be done by reading -1 (like QEMU does) when the read fails.

Stefan


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-05 Thread Stefan Hajnoczi
On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> 
> 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > OK, I get you now. Since the VIRTIO specification says "Device
> > > > configuration space is generally used for rarely-changing or
> > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> > > > ioctl should not be called frequently.
> > > The spec uses MUST and other terms to define the precise requirements.
> > > Here the language (especially the word "generally") is weaker and means
> > > there may be exceptions.
> > > 
> > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
> > > approach is reads that have side-effects. For example, imagine a field
> > > containing an error code if the device encounters a problem unrelated to
> > > a specific virtqueue request. Reading from this field resets the error
> > > code to 0, saving the driver an extra configuration space write access
> > > and possibly race conditions. It isn't possible to implement those
> > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
> > > makes me think that the interface does not allow full VIRTIO semantics.
> 
> 
> Note that though you're correct, my understanding is that config space is
> not suitable for this kind of error propagating. And it would be very hard
> to implement such kind of semantic in some transports.  Virtqueue should be
> much better. As Yong Ji quoted, the config space is used for
> "rarely-changing or intialization-time parameters".
> 
> 
> > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> > handle the message failure, I'm going to add a return value to
> > virtio_config_ops.get() and virtio_cread_* API so that the error can
> > be propagated to the virtio device driver. Then the virtio-blk device
> > driver can be modified to handle that.
> > 
> > Jason and Stefan, what do you think of this way?

Why does VDUSE_DEV_GET_CONFIG need to support an error return value?

The VIRTIO spec provides no way for the device to report errors from
config space accesses.

The QEMU virtio-pci implementation returns -1 from invalid
virtio_config_read*() and silently discards virtio_config_write*()
accesses.

VDUSE can take the same approach with
VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.

> I'd like to stick to the current assumption thich get_config won't fail.
> That is to say,
> 
> 1) maintain a config in the kernel, make sure the config space read can
> always succeed
> 2) introduce an ioctl for the vduse usersapce to update the config space.
> 3) we can synchronize with the vduse userspace during set_config
> 
> Does this work?

I noticed that caching is also allowed by the vhost-user protocol
messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't
know whether or not caching is in effect. The interface you outlined
above requires caching.

Is there a reason why the host kernel vDPA code needs to cache the
configuration space?

Here are the vhost-user protocol messages:

  Virtio device config space
  ^^

  ++--+---+-+
  | offset | size | flags | payload |
  ++--+---+-+

  :offset: a 32-bit offset of virtio device's configuration space

  :size: a 32-bit configuration space access size in bytes

  :flags: a 32-bit value:
- 0: Vhost master messages used for writeable fields
- 1: Vhost master messages used for live migration

  :payload: Size bytes array holding the contents of the virtio
device's configuration space

  ...

  ``VHOST_USER_GET_CONFIG``
:id: 24
:equivalent ioctl: N/A
:master payload: virtio device config space
:slave payload: virtio device config space

When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
submitted by the vhost-user master to fetch the contents of the
virtio device configuration space, vhost-user slave's payload size
MUST match master's request, vhost-user slave uses zero length of
payload to indicate an error to vhost-user master. The vhost-user
master may cache the contents to avoid repeated
``VHOST_USER_GET_CONFIG`` calls.

  ``VHOST_USER_SET_CONFIG``
:id: 25
:equivalent ioctl: N/A
:master payload: virtio device config space
:slave payload: N/A

When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
submitted by the vhost-user master when the Guest changes the virtio
device configuration space and also can be used for live migration
on the destination host. The vhost-user slave must check the flags
field, and slaves MUST NOT accept SET_CONFIG for read-only
configuration space fields unless the live migration bit is set.

Stefan


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: Re: Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-01 Thread Stefan Hajnoczi
On Thu, Jul 01, 2021 at 06:00:48PM +0800, Yongji Xie wrote:
> On Wed, Jun 30, 2021 at 6:06 PM Stefan Hajnoczi  wrote:
> >
> > On Tue, Jun 29, 2021 at 01:43:11PM +0800, Yongji Xie wrote:
> > > On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi  
> > > wrote:
> > > > On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote:
> > > > > + static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t 
> > > > > *len)
> > > > > + {
> > > > > + int fd;
> > > > > + void *addr;
> > > > > + size_t size;
> > > > > + struct vduse_iotlb_entry entry;
> > > > > +
> > > > > + entry.start = iova;
> > > > > + entry.last = iova + 1;
> > > >
> > > > Why +1?
> > > >
> > > > I expected the request to include *len so that VDUSE can create a bounce
> > > > buffer for the full iova range, if necessary.
> > > >
> > >
> > > The function is used to translate iova to va. And the *len is not
> > > specified by the caller. Instead, it's used to tell the caller the
> > > length of the contiguous iova region from the specified iova. And the
> > > ioctl VDUSE_IOTLB_GET_FD will get the file descriptor to the first
> > > overlapped iova region. So using iova + 1 should be enough here.
> >
> > Does the entry.last field have any purpose with VDUSE_IOTLB_GET_FD? I
> > wonder why userspace needs to assign a value at all if it's always +1.
> >
> 
> If we need to get some iova regions in the specified range, we need
> the entry.last field. For example, we can use [0, ULONG_MAX] to get
> the first overlapped iova region which might be [4096, 8192]. But in
> this function, we don't use VDUSE_IOTLB_GET_FD like this. We need to
> get the iova region including the specified iova.

I see, thanks for explaining!

> > > > > + return addr + iova - entry.start;
> > > > > + }
> > > > > +
> > > > > +- VDUSE_DEV_GET_FEATURES: Get the negotiated features
> > > >
> > > > Are these VIRTIO feature bits? Please explain how feature negotiation
> > > > works. There must be a way for userspace to report the device's
> > > > supported feature bits to the kernel.
> > > >
> > >
> > > Yes, these are VIRTIO feature bits. Userspace will specify the
> > > device's supported feature bits when creating a new VDUSE device with
> > > ioctl(VDUSE_CREATE_DEV).
> >
> > Can the VDUSE device influence feature bit negotiation? For example, if
> > the VDUSE virtio-blk device does not implement discard/write-zeroes, how
> > does QEMU or the guest find out about this?
> >
> 
> There is a "features" field in struct vduse_dev_config which is used
> to do feature negotiation.

This approach is more restrictive than required by the VIRTIO
specification:

  "The device SHOULD accept any valid subset of features the driver
  accepts, otherwise it MUST fail to set the FEATURES_OK device status
  bit when the driver writes it."

  
https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-130002

The spec allows a device to reject certain subsets of features. For
example, if feature B depends on feature A and can only be enabled when
feature A is also enabled.

From your description I think VDUSE would accept feature B without
feature A since the device implementation has no opportunity to fail
negotiation with custom logic.

Ideally VDUSE would send a SET_FEATURES message to userspace, allowing
the device implementation full flexibility in which subsets of features
to accept.

This is a corner case. Many or maybe even all existing VIRTIO devices
don't need this flexibility, but I want to point out this limitation in
the VDUSE interface because it may cause issues in the future.

> > > > > +- VDUSE_DEV_UPDATE_CONFIG: Update the configuration space and inject 
> > > > > a config interrupt
> > > >
> > > > Does this mean the contents of the configuration space are cached by
> > > > VDUSE?
> > >
> > > Yes, but the kernel will also store the same contents.
> > >
> > > > The downside is that the userspace code cannot generate the
> > > > contents on demand. Most devices doin't need to generate the contents
> > > > on demand, so I think this is okay but I had expected a different
> > > > interface:
> > > >
> > > > kernel->userspace VDUSE_DEV_GET_CONFIG
> 

Re: Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-06-30 Thread Stefan Hajnoczi
On Tue, Jun 29, 2021 at 01:43:11PM +0800, Yongji Xie wrote:
> On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi  wrote:
> > On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote:
> > > + static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t *len)
> > > + {
> > > + int fd;
> > > + void *addr;
> > > + size_t size;
> > > + struct vduse_iotlb_entry entry;
> > > +
> > > + entry.start = iova;
> > > + entry.last = iova + 1;
> >
> > Why +1?
> >
> > I expected the request to include *len so that VDUSE can create a bounce
> > buffer for the full iova range, if necessary.
> >
> 
> The function is used to translate iova to va. And the *len is not
> specified by the caller. Instead, it's used to tell the caller the
> length of the contiguous iova region from the specified iova. And the
> ioctl VDUSE_IOTLB_GET_FD will get the file descriptor to the first
> overlapped iova region. So using iova + 1 should be enough here.

Does the entry.last field have any purpose with VDUSE_IOTLB_GET_FD? I
wonder why userspace needs to assign a value at all if it's always +1.

> 
> > > + fd = ioctl(dev_fd, VDUSE_IOTLB_GET_FD, );
> > > + if (fd < 0)
> > > + return NULL;
> > > +
> > > + size = entry.last - entry.start + 1;
> > > + *len = entry.last - iova + 1;
> > > + addr = mmap(0, size, perm_to_prot(entry.perm), MAP_SHARED,
> > > + fd, entry.offset);
> > > + close(fd);
> > > + if (addr == MAP_FAILED)
> > > + return NULL;
> > > +
> > > + /* do something to cache this iova region */
> >
> > How is userspace expected to manage iotlb mmaps? When should munmap(2)
> > be called?
> >
> 
> The simple way is using a list to store the iotlb mappings. And we
> should call the munmap(2) for the old mappings when VDUSE_UPDATE_IOTLB
> or VDUSE_STOP_DATAPLANE message is received.

Thanks for explaining. It would be helpful to have a description of
IOTLB operation in this document.

> > Should userspace expect VDUSE_IOTLB_GET_FD to return a full chunk of
> > guest RAM (e.g. multiple gigabytes) that can be cached permanently or
> > will it return just enough pages to cover [start, last)?
> >
> 
> It should return one iotlb mapping that covers [start, last). In
> vhost-vdpa cases, it might be a full chunk of guest RAM. In
> virtio-vdpa cases, it might be the whole bounce buffer or one coherent
> mapping (produced by dma_alloc_coherent()).

Great, thanks. Adding something about this to the documentation would
help others implementing VDUSE devices or libraries.

> > > +
> > > + return addr + iova - entry.start;
> > > + }
> > > +
> > > +- VDUSE_DEV_GET_FEATURES: Get the negotiated features
> >
> > Are these VIRTIO feature bits? Please explain how feature negotiation
> > works. There must be a way for userspace to report the device's
> > supported feature bits to the kernel.
> >
> 
> Yes, these are VIRTIO feature bits. Userspace will specify the
> device's supported feature bits when creating a new VDUSE device with
> ioctl(VDUSE_CREATE_DEV).

Can the VDUSE device influence feature bit negotiation? For example, if
the VDUSE virtio-blk device does not implement discard/write-zeroes, how
does QEMU or the guest find out about this?

> > > +- VDUSE_DEV_UPDATE_CONFIG: Update the configuration space and inject a 
> > > config interrupt
> >
> > Does this mean the contents of the configuration space are cached by
> > VDUSE?
> 
> Yes, but the kernel will also store the same contents.
> 
> > The downside is that the userspace code cannot generate the
> > contents on demand. Most devices doin't need to generate the contents
> > on demand, so I think this is okay but I had expected a different
> > interface:
> >
> > kernel->userspace VDUSE_DEV_GET_CONFIG
> > userspace->kernel VDUSE_DEV_INJECT_CONFIG_IRQ
> >
> 
> The problem is how to handle the failure of VDUSE_DEV_GET_CONFIG. We
> will need lots of modification of virtio codes to support that. So to
> make it simple, we choose this way:
> 
> userspace -> kernel VDUSE_DEV_SET_CONFIG
> userspace -> kernel VDUSE_DEV_INJECT_CONFIG_IRQ
> 
> > I think you can leave it the way it is, but I wanted to mention this in
> > case someone thinks it's important to support generating the contents of
> &g

Re: Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-06-30 Thread Stefan Hajnoczi
On Tue, Jun 29, 2021 at 10:59:51AM +0800, Yongji Xie wrote:
> On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi  wrote:
> >
> > On Tue, Jun 15, 2021 at 10:13:30PM +0800, Xie Yongji wrote:
> > > +/* ioctls */
> > > +
> > > +struct vduse_dev_config {
> > > + char name[VDUSE_NAME_MAX]; /* vduse device name */
> > > + __u32 vendor_id; /* virtio vendor id */
> > > + __u32 device_id; /* virtio device id */
> > > + __u64 features; /* device features */
> > > + __u64 bounce_size; /* bounce buffer size for iommu */
> > > + __u16 vq_size_max; /* the max size of virtqueue */
> >
> > The VIRTIO specification allows per-virtqueue sizes. A device can have
> > two virtqueues, where the first one allows up to 1024 descriptors and
> > the second one allows only 128 descriptors, for example.
> >
> 
> Good point! But it looks like virtio-vdpa/virtio-pci doesn't support
> that now. All virtqueues have the same maximum size.

I see struct vpda_config_ops only supports a per-device max vq size:
u16 (*get_vq_num_max)(struct vdpa_device *vdev);

virtio-pci supports per-virtqueue sizes because the struct
virtio_pci_common_cfg->queue_size register is per-queue (controlled by
queue_select).

I guess this is a question for Jason: will vdpa will keep this limitation?
If yes, then VDUSE can stick to it too without running into problems in
the future.

> > > + __u16 padding; /* padding */
> > > + __u32 vq_num; /* the number of virtqueues */
> > > + __u32 vq_align; /* the allocation alignment of virtqueue's metadata 
> > > */
> >
> > I'm not sure what this is?
> >
> 
>  This will be used by vring_create_virtqueue() too.

If there is no official definition for the meaning of this value then
"/* same as vring_create_virtqueue()'s vring_align parameter */" would
be clearer. That way the reader knows what to research in order to
understand how this field works.

I don't remember but maybe it was used to support vrings when the
host/guest have non-4KB page sizes. I wonder if anyone has an official
definition for this value?


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace

2021-06-28 Thread Stefan Hajnoczi
On Tue, Jun 15, 2021 at 10:13:21PM +0800, Xie Yongji wrote:
> This series introduces a framework that makes it possible to implement
> software-emulated vDPA devices in userspace. And to make it simple, the
> emulated vDPA device's control path is handled in the kernel and only the
> data path is implemented in the userspace.

This looks interesting. Unfortunately I don't have enough time to do a
full review, but I looked at the documentation and uapi header file to
give feedback on the userspace ABI.

Stefan


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-06-28 Thread Stefan Hajnoczi
On Tue, Jun 15, 2021 at 10:13:30PM +0800, Xie Yongji wrote:
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> new file mode 100644
> index ..f21b2e51b5c8
> --- /dev/null
> +++ b/include/uapi/linux/vduse.h
> @@ -0,0 +1,143 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_VDUSE_H_
> +#define _UAPI_VDUSE_H_
> +
> +#include 
> +
> +#define VDUSE_API_VERSION0
> +
> +#define VDUSE_NAME_MAX   256
> +
> +/* the control messages definition for read/write */
> +
> +enum vduse_req_type {
> + /* Get the state for virtqueue from userspace */
> + VDUSE_GET_VQ_STATE,
> + /* Notify userspace to start the dataplane, no reply */
> + VDUSE_START_DATAPLANE,
> + /* Notify userspace to stop the dataplane, no reply */
> + VDUSE_STOP_DATAPLANE,
> + /* Notify userspace to update the memory mapping in device IOTLB */
> + VDUSE_UPDATE_IOTLB,
> +};
> +
> +struct vduse_vq_state {
> + __u32 index; /* virtqueue index */
> + __u32 avail_idx; /* virtqueue state (last_avail_idx) */
> +};
> +
> +struct vduse_iova_range {
> + __u64 start; /* start of the IOVA range */
> + __u64 last; /* end of the IOVA range */

Please clarify whether this describes a closed range [start, last] or an
open range [start, last).

> +};
> +
> +struct vduse_dev_request {
> + __u32 type; /* request type */
> + __u32 request_id; /* request id */
> +#define VDUSE_REQ_FLAGS_NO_REPLY (1 << 0) /* No need to reply */
> + __u32 flags; /* request flags */
> + __u32 reserved; /* for future use */
> + union {
> + struct vduse_vq_state vq_state; /* virtqueue state */
> + struct vduse_iova_range iova; /* iova range for updating */
> + __u32 padding[16]; /* padding */
> + };
> +};
> +
> +struct vduse_dev_response {
> + __u32 request_id; /* corresponding request id */
> +#define VDUSE_REQ_RESULT_OK  0x00
> +#define VDUSE_REQ_RESULT_FAILED  0x01
> + __u32 result; /* the result of request */
> + __u32 reserved[2]; /* for future use */
> + union {
> + struct vduse_vq_state vq_state; /* virtqueue state */
> + __u32 padding[16]; /* padding */
> + };
> +};
> +
> +/* ioctls */
> +
> +struct vduse_dev_config {
> + char name[VDUSE_NAME_MAX]; /* vduse device name */
> + __u32 vendor_id; /* virtio vendor id */
> + __u32 device_id; /* virtio device id */
> + __u64 features; /* device features */
> + __u64 bounce_size; /* bounce buffer size for iommu */
> + __u16 vq_size_max; /* the max size of virtqueue */

The VIRTIO specification allows per-virtqueue sizes. A device can have
two virtqueues, where the first one allows up to 1024 descriptors and
the second one allows only 128 descriptors, for example.

This constant seems to impose the constraint that all virtqueues have
the same maximum size. Is this really necessary?

> + __u16 padding; /* padding */
> + __u32 vq_num; /* the number of virtqueues */
> + __u32 vq_align; /* the allocation alignment of virtqueue's metadata */

I'm not sure what this is?

> + __u32 config_size; /* the size of the configuration space */
> + __u32 reserved[15]; /* for future use */
> + __u8 config[0]; /* the buffer of the configuration space */
> +};
> +
> +struct vduse_iotlb_entry {
> + __u64 offset; /* the mmap offset on fd */
> + __u64 start; /* start of the IOVA range */
> + __u64 last; /* last of the IOVA range */

Same here, please specify whether this is an open range or a closed
range.

> +#define VDUSE_ACCESS_RO 0x1
> +#define VDUSE_ACCESS_WO 0x2
> +#define VDUSE_ACCESS_RW 0x3
> + __u8 perm; /* access permission of this range */
> +};
> +
> +struct vduse_config_update {
> + __u32 offset; /* offset from the beginning of configuration space */
> + __u32 length; /* the length to write to configuration space */
> + __u8 buffer[0]; /* buffer used to write from */
> +};
> +
> +struct vduse_vq_info {
> + __u32 index; /* virtqueue index */
> + __u32 avail_idx; /* virtqueue state (last_avail_idx) */
> + __u64 desc_addr; /* address of desc area */
> + __u64 driver_addr; /* address of driver area */
> + __u64 device_addr; /* address of device area */
> + __u32 num; /* the size of virtqueue */
> + __u8 ready; /* ready status of virtqueue */
> +};
> +
> +struct vduse_vq_eventfd {
> + __u32 index; /* virtqueue index */
> +#define VDUSE_EVENTFD_DEASSIGN -1
> + int fd; /* eventfd, -1 means de-assigning the eventfd */
> +};
> +
> +#define VDUSE_BASE   0x81
> +
> +/* Get the version of VDUSE API. This is used for future extension */
> +#define VDUSE_GET_API_VERSION_IOR(VDUSE_BASE, 0x00, __u64)
> +
> +/* Set the version of VDUSE API. */
> +#define VDUSE_SET_API_VERSION_IOW(VDUSE_BASE, 0x01, __u64)
> +
> +/* Create a vduse device which is represented by a char device 
> (/dev/vduse/) */
> +#define 

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-06-28 Thread Stefan Hajnoczi
On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote:
> VDUSE (vDPA Device in Userspace) is a framework to support
> implementing software-emulated vDPA devices in userspace. This
> document is intended to clarify the VDUSE design and usage.
> 
> Signed-off-by: Xie Yongji 
> ---
>  Documentation/userspace-api/index.rst |   1 +
>  Documentation/userspace-api/vduse.rst | 222 
> ++
>  2 files changed, 223 insertions(+)
>  create mode 100644 Documentation/userspace-api/vduse.rst
> 
> diff --git a/Documentation/userspace-api/index.rst 
> b/Documentation/userspace-api/index.rst
> index 0b5eefed027e..c432be070f67 100644
> --- a/Documentation/userspace-api/index.rst
> +++ b/Documentation/userspace-api/index.rst
> @@ -27,6 +27,7 @@ place where this information is gathered.
> iommu
> media/index
> sysfs-platform_profile
> +   vduse
>  
>  .. only::  subproject and html
>  
> diff --git a/Documentation/userspace-api/vduse.rst 
> b/Documentation/userspace-api/vduse.rst
> new file mode 100644
> index ..2f9cd1a4e530
> --- /dev/null
> +++ b/Documentation/userspace-api/vduse.rst
> @@ -0,0 +1,222 @@
> +==
> +VDUSE - "vDPA Device in Userspace"
> +==
> +
> +vDPA (virtio data path acceleration) device is a device that uses a
> +datapath which complies with the virtio specifications with vendor
> +specific control path. vDPA devices can be both physically located on
> +the hardware or emulated by software. VDUSE is a framework that makes it
> +possible to implement software-emulated vDPA devices in userspace. And
> +to make it simple, the emulated vDPA device's control path is handled in
> +the kernel and only the data path is implemented in the userspace.
> +
> +Note that only virtio block device is supported by VDUSE framework now,
> +which can reduce security risks when the userspace process that implements
> +the data path is run by an unprivileged user. The Support for other device
> +types can be added after the security issue is clarified or fixed in the 
> future.
> +
> +Start/Stop VDUSE devices
> +
> +
> +VDUSE devices are started as follows:
> +
> +1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
> +   /dev/vduse/control.
> +
> +2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
> +   messages will arrive while attaching the VDUSE instance to vDPA bus.
> +
> +3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
> +   instance to vDPA bus.
> +
> +VDUSE devices are stopped as follows:
> +
> +1. Send the VDPA_CMD_DEV_DEL netlink message to detach the VDUSE
> +   instance from vDPA bus.
> +
> +2. Close the file descriptor referring to /dev/vduse/$NAME
> +
> +3. Destroy the VDUSE instance with ioctl(VDUSE_DESTROY_DEV) on
> +   /dev/vduse/control
> +
> +The netlink messages metioned above can be sent via vdpa tool in iproute2
> +or use the below sample codes:
> +
> +.. code-block:: c
> +
> + static int netlink_add_vduse(const char *name, enum vdpa_command cmd)
> + {
> + struct nl_sock *nlsock;
> + struct nl_msg *msg;
> + int famid;
> +
> + nlsock = nl_socket_alloc();
> + if (!nlsock)
> + return -ENOMEM;
> +
> + if (genl_connect(nlsock))
> + goto free_sock;
> +
> + famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
> + if (famid < 0)
> + goto close_sock;
> +
> + msg = nlmsg_alloc();
> + if (!msg)
> + goto close_sock;
> +
> + if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0, 
> cmd, 0))
> + goto nla_put_failure;
> +
> + NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
> + if (cmd == VDPA_CMD_DEV_NEW)
> + NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, 
> "vduse");
> +
> + if (nl_send_sync(nlsock, msg))
> + goto close_sock;
> +
> + nl_close(nlsock);
> + nl_socket_free(nlsock);
> +
> + return 0;
> + nla_put_failure:
> + nlmsg_free(msg);
> + close_sock:
> + nl_close(nlsock);
> + free_sock:
> + nl_socket_free(nlsock);
> + return -1;
> + }
> +
> +How VDUSE works
> +---
> +
> +Since the emuldated vDPA device's control path is handled in the kernel,

s/emuldated/emulated/

> +a message-based communication protocol and few types of control messages
> +are introduced by VDUSE framework to make userspace be aware of the data
> +path related changes:
> +
> +- VDUSE_GET_VQ_STATE: Get the state for virtqueue from userspace
> +
> +- VDUSE_START_DATAPLANE: Notify userspace to start the dataplane
> +
> +- VDUSE_STOP_DATAPLANE: Notify userspace to stop the dataplane
> +
> +- VDUSE_UPDATE_IOTLB: Notify userspace to update 

Re: [PATCH v3 02/14] iommu: Report domain nesting info

2020-07-03 Thread Stefan Hajnoczi
On Tue, Jun 30, 2020 at 02:00:49AM +, Tian, Kevin wrote:
> > From: Liu, Yi L 
> > Sent: Monday, June 29, 2020 8:23 PM
> > 
> > Hi Stefan,
> > 
> > > From: Stefan Hajnoczi 
> > > Sent: Monday, June 29, 2020 5:25 PM
> > >
> > > On Wed, Jun 24, 2020 at 01:55:15AM -0700, Liu Yi L wrote:
> > > > +/*
> > > > + * struct iommu_nesting_info - Information for nesting-capable IOMMU.
> > > > + * user space should check it before using
> > > > + * nesting capability.
> > > > + *
> > > > + * @size:  size of the whole structure
> > > > + * @format:PASID table entry format, the same definition with
> > > > + * @format of struct iommu_gpasid_bind_data.
> > > > + * @features:  supported nesting features.
> > > > + * @flags: currently reserved for future extension.
> > > > + * @data:  vendor specific cap info.
> > > > + *
> > > > + * 
> > > > +---++
> > > > + * | feature   |  Notes
> > > >  |
> > > > + *
> > >
> > +===+===
> > 
> > > =+
> > > > + * | SYSWIDE_PASID |  Kernel manages PASID in system wide, PASIDs
> > used  |
> > > > + * |   |  in the system should be allocated by host kernel 
> > > >  |
> > > > + * 
> > > > +---++
> > > > + * | BIND_PGTBL|  bind page tables to host PASID, the PASID could  
> > > >  |
> > > > + * |   |  either be a host PASID passed in bind request or 
> > > >  |
> > > > + * |   |  default PASIDs (e.g. default PASID of 
> > > > aux-domain) |
> > > > + * 
> > > > +---++
> > > > + * | CACHE_INVLD   |  mandatory feature for nesting capable IOMMU
> > |
> > > > + * 
> > > > +---++
> > >
> > > This feature description is vague about what CACHE_INVLD does and how
> > to
> > > use it. If I understand correctly, the presence of this feature means
> > > that VFIO_IOMMU_NESTING_OP_CACHE_INVLD must be used?
> > >
> > > The same kind of clarification could be done for SYSWIDE_PASID and
> > > BIND_PGTBL too.
> > 
> > For SYSWIDE_PASID and BIND_PGTBL, yes, presence of the feature bit
> > means must use. So the two are requirements to user space if it wants
> > to setup nesting. While for CACHE_INVLD, it's kind of availability
> > here. How about removing CACHE_INVLD as presence of BIND_PGTBL should
> > indicates support of CACHE_INVLD?
> > 
> 
> So far this assumption is correct but it may not be true when thinking 
> forward.
> For example, a vendor might find a way to allow the owner of 1st-level page
> table to directly invalidate cache w/o going through host IOMMU driver. From
> this angle I feel explicitly reporting this capability is more robust.
> 
> Regarding to the description, what about below?
> 
> --
> SYSWIDE_PASID: PASIDs are managed in system-wide, instead of per device.
> When a device is assigned to userspace or VM, proper uAPI (provided by 
> userspace driver framework, e.g. VFIO) must be used to allocate/free PASIDs
> for the assigned device.
> 
> BIND_PGTBL: The owner of the first-level/stage-1 page table must explicitly 
> bind the page table to associated PASID (either the one specified in bind 
> request or the default PASID of the iommu domain), through VFIO_IOMMU
> _NESTING_OP
> 
> CACHE_INVLD: The owner of the first-level/stage-1 page table must
> explicitly invalidate the IOMMU cache through VFIO_IOMMU_NESTING_OP,
> according to vendor-specific requirement when changing the page table.
> --

Mentioning the API to allocate/free PASIDs and VFIO_IOMMU_NESTING_OP has
made this clearer. This lets someone reading the documentation know
where to look for further information on using these features.

Thank you!

Stefan


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 02/14] iommu: Report domain nesting info

2020-06-29 Thread Stefan Hajnoczi
On Wed, Jun 24, 2020 at 01:55:15AM -0700, Liu Yi L wrote:
> +/*
> + * struct iommu_nesting_info - Information for nesting-capable IOMMU.
> + *   user space should check it before using
> + *   nesting capability.
> + *
> + * @size:size of the whole structure
> + * @format:  PASID table entry format, the same definition with
> + *   @format of struct iommu_gpasid_bind_data.
> + * @features:supported nesting features.
> + * @flags:   currently reserved for future extension.
> + * @data:vendor specific cap info.
> + *
> + * +---++
> + * | feature   |  Notes |
> + * +===++
> + * | SYSWIDE_PASID |  Kernel manages PASID in system wide, PASIDs used  |
> + * |   |  in the system should be allocated by host kernel  |
> + * +---++
> + * | BIND_PGTBL|  bind page tables to host PASID, the PASID could   |
> + * |   |  either be a host PASID passed in bind request or  |
> + * |   |  default PASIDs (e.g. default PASID of aux-domain) |
> + * +---++
> + * | CACHE_INVLD   |  mandatory feature for nesting capable IOMMU   |
> + * +---++

This feature description is vague about what CACHE_INVLD does and how to
use it. If I understand correctly, the presence of this feature means
that VFIO_IOMMU_NESTING_OP_CACHE_INVLD must be used?

The same kind of clarification could be done for SYSWIDE_PASID and
BIND_PGTBL too.

Stefan


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 13/14] vfio: Document dual stage control

2020-06-29 Thread Stefan Hajnoczi
On Wed, Jun 24, 2020 at 01:55:26AM -0700, Liu Yi L wrote:
> +Details can be found in Documentation/userspace-api/iommu.rst. For Intel
> +VT-d, each stage 1 page table is bound to host by:
> +
> +nesting_op->flags = VFIO_IOMMU_NESTING_OP_BIND_PGTBL;
> +memcpy(_op->data, _data, sizeof(bind_data));
> +ioctl(container->fd, VFIO_IOMMU_NESTING_OP, nesting_op);
> +
> +As mentioned above, guest OS may use stage 1 for GIOVA->GPA or GVA->GPA.
> +GVA->GPA page tables are available when PASID (Process Address Space ID)
> +is exposed to guest. e.g. guest with PASID-capable devices assigned. For
> +such page table binding, the bind_data should include PASID info, which
> +is allocated by guest itself or by host. This depends on hardware vendor
> +e.g. Intel VT-d requires to allocate PASID from host. This requirement is
> +defined by the Virtual Command Support in VT-d 3.0 spec, guest software
> +running on VT-d should allocate PASID from host kernel. To allocate PASID
> +from host, user space should +check the IOMMU_NESTING_FEAT_SYSWIDE_PASID

s/+check/check/g

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 14/15] vfio: Document dual stage control

2020-06-22 Thread Stefan Hajnoczi
On Wed, Jun 17, 2020 at 06:27:27AM +, Liu, Yi L wrote:
> > From: Stefan Hajnoczi 
> > Sent: Monday, June 15, 2020 5:41 PM
> > On Thu, Jun 11, 2020 at 05:15:33AM -0700, Liu Yi L wrote:
> >
> > > From: Eric Auger 
> > >
> > > The VFIO API was enhanced to support nested stage control: a bunch of
> > > new iotcls and usage guideline.
> > >
> > > Let's document the process to follow to set up nested mode.
> > >
> > > Cc: Kevin Tian 
> > > CC: Jacob Pan 
> > > Cc: Alex Williamson 
> > > Cc: Eric Auger 
> > > Cc: Jean-Philippe Brucker 
> > > Cc: Joerg Roedel 
> > > Cc: Lu Baolu 
> > > Signed-off-by: Eric Auger 
> > > Signed-off-by: Liu Yi L 
> > > ---
> > > v1 -> v2:
> > > *) new in v2, compared with Eric's original version, pasid table bind
> > >and fault reporting is removed as this series doesn't cover them.
> > >Original version from Eric.
> > >https://lkml.org/lkml/2020/3/20/700
> > >
> > >  Documentation/driver-api/vfio.rst | 64
> > > +++
> > >  1 file changed, 64 insertions(+)
> > >
> > > diff --git a/Documentation/driver-api/vfio.rst
> > > b/Documentation/driver-api/vfio.rst
> > > index f1a4d3c..06224bd 100644
> > > --- a/Documentation/driver-api/vfio.rst
> > > +++ b/Documentation/driver-api/vfio.rst
> > > @@ -239,6 +239,70 @@ group and can access them as follows::
> > >   /* Gratuitous device reset and go... */
> > >   ioctl(device, VFIO_DEVICE_RESET);
> > >
> > > +IOMMU Dual Stage Control
> > > +
> > > +
> > > +Some IOMMUs support 2 stages/levels of translation. Stage corresponds
> > > +to the ARM terminology while level corresponds to Intel's VTD 
> > > terminology.
> > > +In the following text we use either without distinction.
> > > +
> > > +This is useful when the guest is exposed with a virtual IOMMU and
> > > +some devices are assigned to the guest through VFIO. Then the guest
> > > +OS can use stage 1 (GIOVA -> GPA or GVA->GPA), while the hypervisor
> > > +uses stage 2 for VM isolation (GPA -> HPA).
> > > +
> > > +Under dual stage translation, the guest gets ownership of the stage 1
> > > +page tables and also owns stage 1 configuration structures. The
> > > +hypervisor owns the root configuration structure (for security
> > > +reason), including stage 2 configuration. This works as long
> > > +configuration structures and page table
> > 
> > s/as long configuration/as long as configuration/
> 
> got it.
> 
> > 
> > > +format are compatible between the virtual IOMMU and the physical IOMMU.
> > 
> > s/format/formats/
> 
> I see.
> 
> > > +
> > > +Assuming the HW supports it, this nested mode is selected by choosing
> > > +the VFIO_TYPE1_NESTING_IOMMU type through:
> > > +
> > > +ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_NESTING_IOMMU);
> > > +
> > > +This forces the hypervisor to use the stage 2, leaving stage 1
> > > +available for guest usage. The guest stage 1 format depends on IOMMU
> > > +vendor, and it is the same with the nesting configuration method.
> > > +User space should check the format and configuration method after
> > > +setting nesting type by
> > > +using:
> > > +
> > > +ioctl(container->fd, VFIO_IOMMU_GET_INFO, _info);
> > > +
> > > +Details can be found in Documentation/userspace-api/iommu.rst. For
> > > +Intel VT-d, each stage 1 page table is bound to host by:
> > > +
> > > +nesting_op->flags = VFIO_IOMMU_NESTING_OP_BIND_PGTBL;
> > > +memcpy(_op->data, _data, sizeof(bind_data));
> > > +ioctl(container->fd, VFIO_IOMMU_NESTING_OP, nesting_op);
> > > +
> > > +As mentioned above, guest OS may use stage 1 for GIOVA->GPA or GVA->GPA.
> > > +GVA->GPA page tables are available when PASID (Process Address Space
> > > +GVA->ID)
> > > +is exposed to guest. e.g. guest with PASID-capable devices assigned.
> > > +For such page table binding, the bind_data should include PASID info,
> > > +which is allocated by guest itself or by host. This depends on
> > > +hardware vendor e.g. Intel VT-d requires to allocate PASID from host.
> > > +This requirement is available by VFIO_IOMMU_GET_INFO. User space
> > > +could allocate PASID 

Re: [PATCH v2 00/15] vfio: expose virtual Shared Virtual Addressing to VMs

2020-06-22 Thread Stefan Hajnoczi
On Tue, Jun 16, 2020 at 12:09:16PM -0400, Peter Xu wrote:
> On Tue, Jun 16, 2020 at 04:49:28PM +0100, Stefan Hajnoczi wrote:
> > Isolation between applications is preserved but there is no isolation
> > between the device and the application itself. The application needs to
> > trust the device.
> > 
> > Examples:
> > 
> > 1. The device can snoop secret data from readable pages in the
> >application's virtual memory space.
> > 
> > 2. The device can gain arbitrary execution on the CPU by overwriting
> >control flow addresses (e.g. function pointers, stack return
> >addresses) in writable pages.
> 
> To me, SVA seems to be that "middle layer" of secure where it's not as safe as
> VFIO_IOMMU_MAP_DMA which has buffer level granularity of control (but of 
> course
> we pay overhead on buffer setups and on-the-fly translations), however it's 
> far
> better than DMA with no IOMMU which can ruin the whole host/guest, because
> after all we do a lot of isolations as process based.
> 
> IMHO it's the same as when we see a VM (or the QEMU process) as a whole along
> with the guest code.  In some cases we don't care if the guest did some bad
> things to mess up with its own QEMU process.  It is still ideal if we can even
> stop the guest from doing so, but when it's not easy to do it the ideal way, 
> we
> just lower the requirement to not spread the influence to the host and other
> VMs.

Makes sense.

Stefan


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 00/15] vfio: expose virtual Shared Virtual Addressing to VMs

2020-06-22 Thread Stefan Hajnoczi
On Tue, Jun 16, 2020 at 10:00:16AM -0700, Raj, Ashok wrote:
> On Tue, Jun 16, 2020 at 04:49:28PM +0100, Stefan Hajnoczi wrote:
> > On Tue, Jun 16, 2020 at 02:26:38AM +, Tian, Kevin wrote:
> > > > From: Stefan Hajnoczi 
> > > > Sent: Monday, June 15, 2020 6:02 PM
> > > > 
> > > > On Thu, Jun 11, 2020 at 05:15:19AM -0700, Liu Yi L wrote:
> > > > > Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
> > > > > Intel platforms allows address space sharing between device DMA and
> > > > > applications. SVA can reduce programming complexity and enhance
> > > > security.
> > > > >
> > > > > This VFIO series is intended to expose SVA usage to VMs. i.e. Sharing
> > > > > guest application address space with passthru devices. This is called
> > > > > vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU
> > > > > changes. For IOMMU and QEMU changes, they are in separate series 
> > > > > (listed
> > > > > in the "Related series").
> > > > >
> > > > > The high-level architecture for SVA virtualization is as below, the 
> > > > > key
> > > > > design of vSVA support is to utilize the dual-stage IOMMU translation 
> > > > > (
> > > > > also known as IOMMU nesting translation) capability in host IOMMU.
> > > > >
> > > > >
> > > > > .-.  .---.
> > > > > |   vIOMMU|  | Guest process CR3, FL only|
> > > > > | |  '---'
> > > > > ./
> > > > > | PASID Entry |--- PASID cache flush -
> > > > > '-'   |
> > > > > | |   V
> > > > > | |CR3 in GPA
> > > > > '-'
> > > > > Guest
> > > > > --| Shadow |--|
> > > > >   vv  v
> > > > > Host
> > > > > .-.  .--.
> > > > > |   pIOMMU|  | Bind FL for GVA-GPA  |
> > > > > | |  '--'
> > > > > ./  |
> > > > > | PASID Entry | V (Nested xlate)
> > > > > '\.--.
> > > > > | |   |SL for GPA-HPA, default domain|
> > > > > | |   '--'
> > > > > '-'
> > > > > Where:
> > > > >  - FL = First level/stage one page tables
> > > > >  - SL = Second level/stage two page tables
> > > > 
> > > > Hi,
> > > > Looks like an interesting feature!
> > > > 
> > > > To check I understand this feature: can applications now pass virtual
> > > > addresses to devices instead of translating to IOVAs?
> > > > 
> > > > If yes, can guest applications restrict the vSVA address space so the
> > > > device only has access to certain regions?
> > > > 
> > > > On one hand replacing IOVA translation with virtual addresses simplifies
> > > > the application programming model, but does it give up isolation if the
> > > > device can now access all application memory?
> > > > 
> > > 
> > > with SVA each application is allocated with a unique PASID to tag its
> > > virtual address space. The device that claims SVA support must guarantee 
> > > that one application can only program the device to access its own virtual
> > > address space (i.e. all DMAs triggered by this application are tagged with
> > > the application's PASID, and are translated by IOMMU's PASID-granular
> > > page table). So, isolation is not sacrificed in SVA.
> > 
> > Isolation between applications is preserved but there is no isolation
> > between the device and the application itself. The application needs to
> > trust the device.
> 
> Right. With all convenience comes security trust. With SVA there is an
> expectation that the device has the required security boundaries properly
> implemented. FWIW, what is our guarantee today that VF's are secure from
> one another or even its own PF? They can also generate transactions with
>

Re: [PATCH v2 00/15] vfio: expose virtual Shared Virtual Addressing to VMs

2020-06-16 Thread Stefan Hajnoczi
On Tue, Jun 16, 2020 at 02:26:38AM +, Tian, Kevin wrote:
> > From: Stefan Hajnoczi 
> > Sent: Monday, June 15, 2020 6:02 PM
> > 
> > On Thu, Jun 11, 2020 at 05:15:19AM -0700, Liu Yi L wrote:
> > > Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
> > > Intel platforms allows address space sharing between device DMA and
> > > applications. SVA can reduce programming complexity and enhance
> > security.
> > >
> > > This VFIO series is intended to expose SVA usage to VMs. i.e. Sharing
> > > guest application address space with passthru devices. This is called
> > > vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU
> > > changes. For IOMMU and QEMU changes, they are in separate series (listed
> > > in the "Related series").
> > >
> > > The high-level architecture for SVA virtualization is as below, the key
> > > design of vSVA support is to utilize the dual-stage IOMMU translation (
> > > also known as IOMMU nesting translation) capability in host IOMMU.
> > >
> > >
> > > .-.  .---.
> > > |   vIOMMU|  | Guest process CR3, FL only|
> > > | |  '---'
> > > ./
> > > | PASID Entry |--- PASID cache flush -
> > > '-'   |
> > > | |   V
> > > | |CR3 in GPA
> > > '-'
> > > Guest
> > > --| Shadow |--|
> > >   vv  v
> > > Host
> > > .-.  .--.
> > > |   pIOMMU|  | Bind FL for GVA-GPA  |
> > > | |  '--'
> > > ./  |
> > > | PASID Entry | V (Nested xlate)
> > > '\.--.
> > > | |   |SL for GPA-HPA, default domain|
> > > | |   '--'
> > > '-'
> > > Where:
> > >  - FL = First level/stage one page tables
> > >  - SL = Second level/stage two page tables
> > 
> > Hi,
> > Looks like an interesting feature!
> > 
> > To check I understand this feature: can applications now pass virtual
> > addresses to devices instead of translating to IOVAs?
> > 
> > If yes, can guest applications restrict the vSVA address space so the
> > device only has access to certain regions?
> > 
> > On one hand replacing IOVA translation with virtual addresses simplifies
> > the application programming model, but does it give up isolation if the
> > device can now access all application memory?
> > 
> 
> with SVA each application is allocated with a unique PASID to tag its
> virtual address space. The device that claims SVA support must guarantee 
> that one application can only program the device to access its own virtual
> address space (i.e. all DMAs triggered by this application are tagged with
> the application's PASID, and are translated by IOMMU's PASID-granular
> page table). So, isolation is not sacrificed in SVA.

Isolation between applications is preserved but there is no isolation
between the device and the application itself. The application needs to
trust the device.

Examples:

1. The device can snoop secret data from readable pages in the
   application's virtual memory space.

2. The device can gain arbitrary execution on the CPU by overwriting
   control flow addresses (e.g. function pointers, stack return
   addresses) in writable pages.

Stefan


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 00/15] vfio: expose virtual Shared Virtual Addressing to VMs

2020-06-16 Thread Stefan Hajnoczi
On Mon, Jun 15, 2020 at 12:39:40PM +, Liu, Yi L wrote:
> > From: Stefan Hajnoczi 
> > Sent: Monday, June 15, 2020 6:02 PM
> > 
> > On Thu, Jun 11, 2020 at 05:15:19AM -0700, Liu Yi L wrote:
> > > Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
> > > Intel platforms allows address space sharing between device DMA and
> > > applications. SVA can reduce programming complexity and enhance security.
> > >
> > > This VFIO series is intended to expose SVA usage to VMs. i.e. Sharing
> > > guest application address space with passthru devices. This is called
> > > vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU
> > > changes. For IOMMU and QEMU changes, they are in separate series (listed
> > > in the "Related series").
> > >
> > > The high-level architecture for SVA virtualization is as below, the key
> > > design of vSVA support is to utilize the dual-stage IOMMU translation (
> > > also known as IOMMU nesting translation) capability in host IOMMU.
> > >
> > >
> > > .-.  .---.
> > > |   vIOMMU|  | Guest process CR3, FL only|
> > > | |  '---'
> > > ./
> > > | PASID Entry |--- PASID cache flush -
> > > '-'   |
> > > | |   V
> > > | |CR3 in GPA
> > > '-'
> > > Guest
> > > --| Shadow |--|
> > >   vv  v
> > > Host
> > > .-.  .--.
> > > |   pIOMMU|  | Bind FL for GVA-GPA  |
> > > | |  '--'
> > > ./  |
> > > | PASID Entry | V (Nested xlate)
> > > '\.--.
> > > | |   |SL for GPA-HPA, default domain|
> > > | |   '--'
> > > '-'
> > > Where:
> > >  - FL = First level/stage one page tables
> > >  - SL = Second level/stage two page tables
> > 
> > Hi,
> > Looks like an interesting feature!
> 
> thanks for the interest. Stefan :-)
> 
> > To check I understand this feature: can applications now pass virtual
> > addresses to devices instead of translating to IOVAs?
> 
> yes, application could pass virtual addresses to device directly. As
> long as the virtual address is mapped in cpu page table, then IOMMU
> would get it translated to physical address.
> 
> > If yes, can guest applications restrict the vSVA address space so the
> > device only has access to certain regions?
> 
> do you mean restrict the access of certain virtual address regions of
> guest application ? or certain guest memory? :-)

Your reply below answered my question. I was wondering if applications
can protect parts of their virtual memory space that should not be
accessed by the device. It makes sense that there is a trade-off to
simplify the programming model and performance might also be better if
the application doesn't need to DMA map/unmap buffers frequently.

Stefan


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 14/15] vfio: Document dual stage control

2020-06-15 Thread Stefan Hajnoczi
On Thu, Jun 11, 2020 at 05:15:33AM -0700, Liu Yi L wrote:
> From: Eric Auger 
> 
> The VFIO API was enhanced to support nested stage control: a bunch of
> new iotcls and usage guideline.
> 
> Let's document the process to follow to set up nested mode.
> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Cc: Joerg Roedel 
> Cc: Lu Baolu 
> Signed-off-by: Eric Auger 
> Signed-off-by: Liu Yi L 
> ---
> v1 -> v2:
> *) new in v2, compared with Eric's original version, pasid table bind
>and fault reporting is removed as this series doesn't cover them.
>Original version from Eric.
>https://lkml.org/lkml/2020/3/20/700
> 
>  Documentation/driver-api/vfio.rst | 64 
> +++
>  1 file changed, 64 insertions(+)
> 
> diff --git a/Documentation/driver-api/vfio.rst 
> b/Documentation/driver-api/vfio.rst
> index f1a4d3c..06224bd 100644
> --- a/Documentation/driver-api/vfio.rst
> +++ b/Documentation/driver-api/vfio.rst
> @@ -239,6 +239,70 @@ group and can access them as follows::
>   /* Gratuitous device reset and go... */
>   ioctl(device, VFIO_DEVICE_RESET);
>  
> +IOMMU Dual Stage Control
> +
> +
> +Some IOMMUs support 2 stages/levels of translation. Stage corresponds to
> +the ARM terminology while level corresponds to Intel's VTD terminology.
> +In the following text we use either without distinction.
> +
> +This is useful when the guest is exposed with a virtual IOMMU and some
> +devices are assigned to the guest through VFIO. Then the guest OS can use
> +stage 1 (GIOVA -> GPA or GVA->GPA), while the hypervisor uses stage 2 for
> +VM isolation (GPA -> HPA).
> +
> +Under dual stage translation, the guest gets ownership of the stage 1 page
> +tables and also owns stage 1 configuration structures. The hypervisor owns
> +the root configuration structure (for security reason), including stage 2
> +configuration. This works as long configuration structures and page table

s/as long configuration/as long as configuration/

> +format are compatible between the virtual IOMMU and the physical IOMMU.

s/format/formats/

> +
> +Assuming the HW supports it, this nested mode is selected by choosing the
> +VFIO_TYPE1_NESTING_IOMMU type through:
> +
> +ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_NESTING_IOMMU);
> +
> +This forces the hypervisor to use the stage 2, leaving stage 1 available
> +for guest usage. The guest stage 1 format depends on IOMMU vendor, and
> +it is the same with the nesting configuration method. User space should
> +check the format and configuration method after setting nesting type by
> +using:
> +
> +ioctl(container->fd, VFIO_IOMMU_GET_INFO, _info);
> +
> +Details can be found in Documentation/userspace-api/iommu.rst. For Intel
> +VT-d, each stage 1 page table is bound to host by:
> +
> +nesting_op->flags = VFIO_IOMMU_NESTING_OP_BIND_PGTBL;
> +memcpy(_op->data, _data, sizeof(bind_data));
> +ioctl(container->fd, VFIO_IOMMU_NESTING_OP, nesting_op);
> +
> +As mentioned above, guest OS may use stage 1 for GIOVA->GPA or GVA->GPA.
> +GVA->GPA page tables are available when PASID (Process Address Space ID)
> +is exposed to guest. e.g. guest with PASID-capable devices assigned. For
> +such page table binding, the bind_data should include PASID info, which
> +is allocated by guest itself or by host. This depends on hardware vendor
> +e.g. Intel VT-d requires to allocate PASID from host. This requirement is
> +available by VFIO_IOMMU_GET_INFO. User space could allocate PASID from
> +host by:
> +
> +req.flags = VFIO_IOMMU_ALLOC_PASID;
> +ioctl(container, VFIO_IOMMU_PASID_REQUEST, );

It is not clear how the userspace application determines whether PASIDs
must be allocated from the host via VFIO_IOMMU_PASID_REQUEST or if the
guest itself can allocate PASIDs. The text mentions VFIO_IOMMU_GET_INFO
but what exactly should the userspace application check?


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 00/15] vfio: expose virtual Shared Virtual Addressing to VMs

2020-06-15 Thread Stefan Hajnoczi
On Thu, Jun 11, 2020 at 05:15:19AM -0700, Liu Yi L wrote:
> Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
> Intel platforms allows address space sharing between device DMA and
> applications. SVA can reduce programming complexity and enhance security.
> 
> This VFIO series is intended to expose SVA usage to VMs. i.e. Sharing
> guest application address space with passthru devices. This is called
> vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU
> changes. For IOMMU and QEMU changes, they are in separate series (listed
> in the "Related series").
> 
> The high-level architecture for SVA virtualization is as below, the key
> design of vSVA support is to utilize the dual-stage IOMMU translation (
> also known as IOMMU nesting translation) capability in host IOMMU.
> 
> 
> .-.  .---.
> |   vIOMMU|  | Guest process CR3, FL only|
> | |  '---'
> ./
> | PASID Entry |--- PASID cache flush -
> '-'   |
> | |   V
> | |CR3 in GPA
> '-'
> Guest
> --| Shadow |--|
>   vv  v
> Host
> .-.  .--.
> |   pIOMMU|  | Bind FL for GVA-GPA  |
> | |  '--'
> ./  |
> | PASID Entry | V (Nested xlate)
> '\.--.
> | |   |SL for GPA-HPA, default domain|
> | |   '--'
> '-'
> Where:
>  - FL = First level/stage one page tables
>  - SL = Second level/stage two page tables

Hi,
Looks like an interesting feature!

To check I understand this feature: can applications now pass virtual
addresses to devices instead of translating to IOVAs?

If yes, can guest applications restrict the vSVA address space so the
device only has access to certain regions?

On one hand replacing IOVA translation with virtual addresses simplifies
the application programming model, but does it give up isolation if the
device can now access all application memory?

Thanks,
Stefan


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu