Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-08 Thread Yongji Xie
On Thu, Jul 8, 2021 at 5:06 PM Stefan Hajnoczi  wrote:
>
> On Thu, Jul 08, 2021 at 12:17:56PM +0800, Jason Wang wrote:
> >
> > 在 2021/7/7 下午11:54, Stefan Hajnoczi 写道:
> > > On Wed, Jul 07, 2021 at 05:24:08PM +0800, Jason Wang wrote:
> > > > 在 2021/7/7 下午4:55, Stefan Hajnoczi 写道:
> > > > > On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote:
> > > > > > 在 2021/7/7 上午1:11, Stefan Hajnoczi 写道:
> > > > > > > On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote:
> > > > > > > > On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi 
> > > > > > > >  wrote:
> > > > > > > > > On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote:
> > > > > > > > > > 在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:
> > > > > > > > > > > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang 
> > > > > > > > > > > wrote:
> > > > > > > > > > > > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > > > > > > > > > > > OK, I get you now. Since the VIRTIO specification 
> > > > > > > > > > > > > > > says "Device
> > > > > > > > > > > > > > > configuration space is generally used for 
> > > > > > > > > > > > > > > rarely-changing or
> > > > > > > > > > > > > > > initialization-time parameters". I assume the 
> > > > > > > > > > > > > > > VDUSE_DEV_SET_CONFIG
> > > > > > > > > > > > > > > ioctl should not be called frequently.
> > > > > > > > > > > > > > The spec uses MUST and other terms to define the 
> > > > > > > > > > > > > > precise requirements.
> > > > > > > > > > > > > > Here the language (especially the word "generally") 
> > > > > > > > > > > > > > is weaker and means
> > > > > > > > > > > > > > there may be exceptions.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Another type of access that doesn't work with the 
> > > > > > > > > > > > > > VDUSE_DEV_SET_CONFIG
> > > > > > > > > > > > > > approach is reads that have side-effects. For 
> > > > > > > > > > > > > > example, imagine a field
> > > > > > > > > > > > > > containing an error code if the device encounters a 
> > > > > > > > > > > > > > problem unrelated to
> > > > > > > > > > > > > > a specific virtqueue request. Reading from this 
> > > > > > > > > > > > > > field resets the error
> > > > > > > > > > > > > > code to 0, saving the driver an extra configuration 
> > > > > > > > > > > > > > space write access
> > > > > > > > > > > > > > and possibly race conditions. It isn't possible to 
> > > > > > > > > > > > > > implement those
> > > > > > > > > > > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another 
> > > > > > > > > > > > > > corner case, but it
> > > > > > > > > > > > > > makes me think that the interface does not allow 
> > > > > > > > > > > > > > full VIRTIO semantics.
> > > > > > > > > > > > Note that though you're correct, my understanding is 
> > > > > > > > > > > > that config space is
> > > > > > > > > > > > not suitable for this kind of error propagating. And it 
> > > > > > > > > > > > would be very hard
> > > > > > > > > > > > to implement such kind of semantic in some transports.  
> > > > > > > > > > > > Virtqueue should be
> > > > > > > > > > > > much better. As Yong Ji quoted, the config space is 
> > > > > > > > > > > > used for
> > > > > > > > > > > > "rarely-changing or intialization-time parameters".
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next 
> > > > > > > > > > > > > version. And to
> > > > > > > > > > > > > handle the message failure, I'm going to add a return 
> > > > > > > > > > > > > value to
> > > > > > > > > > > > > virtio_config_ops.get() and virtio_cread_* API so 
> > > > > > > > > > > > > that the error can
> > > > > > > > > > > > > be propagated to the virtio device driver. Then the 
> > > > > > > > > > > > > virtio-blk device
> > > > > > > > > > > > > driver can be modified to handle that.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Jason and Stefan, what do you think of this way?
> > > > > > > > > > > Why does VDUSE_DEV_GET_CONFIG need to support an error 
> > > > > > > > > > > return value?
> > > > > > > > > > >
> > > > > > > > > > > The VIRTIO spec provides no way for the device to report 
> > > > > > > > > > > errors from
> > > > > > > > > > > config space accesses.
> > > > > > > > > > >
> > > > > > > > > > > The QEMU virtio-pci implementation returns -1 from invalid
> > > > > > > > > > > virtio_config_read*() and silently discards 
> > > > > > > > > > > virtio_config_write*()
> > > > > > > > > > > accesses.
> > > > > > > > > > >
> > > > > > > > > > > VDUSE can take the same approach with
> > > > > > > > > > > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
> > > > > > > > > > >
> > > > > > > > > > > > I'd like to stick to the current assumption thich 
> > > > > > > > > > > > get_config won't fail.
> > > > > > > > > > > > That is to say,
> > > > > > > > > > > >
> > > > > > > > > > > > 1) maintain a config in the kernel, make sure the 
> > > > > > > > > > > > config space read can
> > > > > > > > > > > > always succeed
> > > > > > > > > > 

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-08 Thread Stefan Hajnoczi
On Wed, Jul 07, 2021 at 05:09:13PM +0800, Yongji Xie wrote:
> On Tue, Jul 6, 2021 at 6:22 PM Stefan Hajnoczi  wrote:
> >
> > On Tue, Jul 06, 2021 at 11:04:18AM +0800, Yongji Xie wrote:
> > > On Mon, Jul 5, 2021 at 8:50 PM Stefan Hajnoczi  
> > > wrote:
> > > >
> > > > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> > > > >
> > > > > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > > > > OK, I get you now. Since the VIRTIO specification says "Device
> > > > > > > > configuration space is generally used for rarely-changing or
> > > > > > > > initialization-time parameters". I assume the 
> > > > > > > > VDUSE_DEV_SET_CONFIG
> > > > > > > > ioctl should not be called frequently.
> > > > > > > The spec uses MUST and other terms to define the precise 
> > > > > > > requirements.
> > > > > > > Here the language (especially the word "generally") is weaker and 
> > > > > > > means
> > > > > > > there may be exceptions.
> > > > > > >
> > > > > > > Another type of access that doesn't work with the 
> > > > > > > VDUSE_DEV_SET_CONFIG
> > > > > > > approach is reads that have side-effects. For example, imagine a 
> > > > > > > field
> > > > > > > containing an error code if the device encounters a problem 
> > > > > > > unrelated to
> > > > > > > a specific virtqueue request. Reading from this field resets the 
> > > > > > > error
> > > > > > > code to 0, saving the driver an extra configuration space write 
> > > > > > > access
> > > > > > > and possibly race conditions. It isn't possible to implement those
> > > > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, 
> > > > > > > but it
> > > > > > > makes me think that the interface does not allow full VIRTIO 
> > > > > > > semantics.
> > > > >
> > > > >
> > > > > Note that though you're correct, my understanding is that config 
> > > > > space is
> > > > > not suitable for this kind of error propagating. And it would be very 
> > > > > hard
> > > > > to implement such kind of semantic in some transports.  Virtqueue 
> > > > > should be
> > > > > much better. As Yong Ji quoted, the config space is used for
> > > > > "rarely-changing or intialization-time parameters".
> > > > >
> > > > >
> > > > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> > > > > > handle the message failure, I'm going to add a return value to
> > > > > > virtio_config_ops.get() and virtio_cread_* API so that the error can
> > > > > > be propagated to the virtio device driver. Then the virtio-blk 
> > > > > > device
> > > > > > driver can be modified to handle that.
> > > > > >
> > > > > > Jason and Stefan, what do you think of this way?
> > > >
> > > > Why does VDUSE_DEV_GET_CONFIG need to support an error return value?
> > > >
> > >
> > > We add a timeout and return error in case userspace never replies to
> > > the message.
> > >
> > > > The VIRTIO spec provides no way for the device to report errors from
> > > > config space accesses.
> > > >
> > > > The QEMU virtio-pci implementation returns -1 from invalid
> > > > virtio_config_read*() and silently discards virtio_config_write*()
> > > > accesses.
> > > >
> > > > VDUSE can take the same approach with
> > > > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
> > > >
> > >
> > > I noticed that virtio_config_read*() only returns -1 when we access a
> > > invalid field. But in the VDUSE case, VDUSE_DEV_GET_CONFIG might fail
> > > when we access a valid field. Not sure if it's ok to silently ignore
> > > this kind of error.
> >
> > That's a good point but it's a general VIRTIO issue. Any device
> > implementation (QEMU userspace, hardware vDPA, etc) can fail, so the
> > VIRTIO specification needs to provide a way for the driver to detect
> > this.
> >
> > If userspace violates the contract then VDUSE needs to mark the device
> > broken. QEMU's device emulation does something similar with the
> > vdev->broken flag.
> >
> > The VIRTIO Device Status field DEVICE_NEEDS_RESET bit can be set by
> > vDPA/VDUSE to indicate that the device is not operational and must be
> > reset.
> >
> 
> It might be a solution. But DEVICE_NEEDS_RESET  is not implemented
> currently. So I'm thinking whether it's ok to add a check of
> DEVICE_NEEDS_RESET status bit in probe function of virtio device
> driver (e.g. virtio-blk driver). Then VDUSE can make use of it to fail
> device initailization when configuration space access failed.

Okay.

Stefan


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-08 Thread Stefan Hajnoczi
On Thu, Jul 08, 2021 at 12:17:56PM +0800, Jason Wang wrote:
> 
> 在 2021/7/7 下午11:54, Stefan Hajnoczi 写道:
> > On Wed, Jul 07, 2021 at 05:24:08PM +0800, Jason Wang wrote:
> > > 在 2021/7/7 下午4:55, Stefan Hajnoczi 写道:
> > > > On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote:
> > > > > 在 2021/7/7 上午1:11, Stefan Hajnoczi 写道:
> > > > > > On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote:
> > > > > > > On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi 
> > > > > > >  wrote:
> > > > > > > > On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote:
> > > > > > > > > 在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:
> > > > > > > > > > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> > > > > > > > > > > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > > > > > > > > > > OK, I get you now. Since the VIRTIO specification 
> > > > > > > > > > > > > > says "Device
> > > > > > > > > > > > > > configuration space is generally used for 
> > > > > > > > > > > > > > rarely-changing or
> > > > > > > > > > > > > > initialization-time parameters". I assume the 
> > > > > > > > > > > > > > VDUSE_DEV_SET_CONFIG
> > > > > > > > > > > > > > ioctl should not be called frequently.
> > > > > > > > > > > > > The spec uses MUST and other terms to define the 
> > > > > > > > > > > > > precise requirements.
> > > > > > > > > > > > > Here the language (especially the word "generally") 
> > > > > > > > > > > > > is weaker and means
> > > > > > > > > > > > > there may be exceptions.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Another type of access that doesn't work with the 
> > > > > > > > > > > > > VDUSE_DEV_SET_CONFIG
> > > > > > > > > > > > > approach is reads that have side-effects. For 
> > > > > > > > > > > > > example, imagine a field
> > > > > > > > > > > > > containing an error code if the device encounters a 
> > > > > > > > > > > > > problem unrelated to
> > > > > > > > > > > > > a specific virtqueue request. Reading from this field 
> > > > > > > > > > > > > resets the error
> > > > > > > > > > > > > code to 0, saving the driver an extra configuration 
> > > > > > > > > > > > > space write access
> > > > > > > > > > > > > and possibly race conditions. It isn't possible to 
> > > > > > > > > > > > > implement those
> > > > > > > > > > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another 
> > > > > > > > > > > > > corner case, but it
> > > > > > > > > > > > > makes me think that the interface does not allow full 
> > > > > > > > > > > > > VIRTIO semantics.
> > > > > > > > > > > Note that though you're correct, my understanding is that 
> > > > > > > > > > > config space is
> > > > > > > > > > > not suitable for this kind of error propagating. And it 
> > > > > > > > > > > would be very hard
> > > > > > > > > > > to implement such kind of semantic in some transports.  
> > > > > > > > > > > Virtqueue should be
> > > > > > > > > > > much better. As Yong Ji quoted, the config space is used 
> > > > > > > > > > > for
> > > > > > > > > > > "rarely-changing or intialization-time parameters".
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next 
> > > > > > > > > > > > version. And to
> > > > > > > > > > > > handle the message failure, I'm going to add a return 
> > > > > > > > > > > > value to
> > > > > > > > > > > > virtio_config_ops.get() and virtio_cread_* API so that 
> > > > > > > > > > > > the error can
> > > > > > > > > > > > be propagated to the virtio device driver. Then the 
> > > > > > > > > > > > virtio-blk device
> > > > > > > > > > > > driver can be modified to handle that.
> > > > > > > > > > > > 
> > > > > > > > > > > > Jason and Stefan, what do you think of this way?
> > > > > > > > > > Why does VDUSE_DEV_GET_CONFIG need to support an error 
> > > > > > > > > > return value?
> > > > > > > > > > 
> > > > > > > > > > The VIRTIO spec provides no way for the device to report 
> > > > > > > > > > errors from
> > > > > > > > > > config space accesses.
> > > > > > > > > > 
> > > > > > > > > > The QEMU virtio-pci implementation returns -1 from invalid
> > > > > > > > > > virtio_config_read*() and silently discards 
> > > > > > > > > > virtio_config_write*()
> > > > > > > > > > accesses.
> > > > > > > > > > 
> > > > > > > > > > VDUSE can take the same approach with
> > > > > > > > > > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
> > > > > > > > > > 
> > > > > > > > > > > I'd like to stick to the current assumption thich 
> > > > > > > > > > > get_config won't fail.
> > > > > > > > > > > That is to say,
> > > > > > > > > > > 
> > > > > > > > > > > 1) maintain a config in the kernel, make sure the config 
> > > > > > > > > > > space read can
> > > > > > > > > > > always succeed
> > > > > > > > > > > 2) introduce an ioctl for the vduse usersapce to update 
> > > > > > > > > > > the config space.
> > > > > > > > > > > 3) we can synchronize with the vduse userspace during 
> > > > > > > > > > > set_config
> > > > > > > > > > > 
> > > > > > 

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-07 Thread Jason Wang


在 2021/7/7 下午11:54, Stefan Hajnoczi 写道:

On Wed, Jul 07, 2021 at 05:24:08PM +0800, Jason Wang wrote:

在 2021/7/7 下午4:55, Stefan Hajnoczi 写道:

On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote:

在 2021/7/7 上午1:11, Stefan Hajnoczi 写道:

On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote:

On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi  wrote:

On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote:

在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:

On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:

在 2021/7/4 下午5:49, Yongji Xie 写道:

OK, I get you now. Since the VIRTIO specification says "Device
configuration space is generally used for rarely-changing or
initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
ioctl should not be called frequently.

The spec uses MUST and other terms to define the precise requirements.
Here the language (especially the word "generally") is weaker and means
there may be exceptions.

Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
approach is reads that have side-effects. For example, imagine a field
containing an error code if the device encounters a problem unrelated to
a specific virtqueue request. Reading from this field resets the error
code to 0, saving the driver an extra configuration space write access
and possibly race conditions. It isn't possible to implement those
semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
makes me think that the interface does not allow full VIRTIO semantics.

Note that though you're correct, my understanding is that config space is
not suitable for this kind of error propagating. And it would be very hard
to implement such kind of semantic in some transports.  Virtqueue should be
much better. As Yong Ji quoted, the config space is used for
"rarely-changing or intialization-time parameters".



Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
handle the message failure, I'm going to add a return value to
virtio_config_ops.get() and virtio_cread_* API so that the error can
be propagated to the virtio device driver. Then the virtio-blk device
driver can be modified to handle that.

Jason and Stefan, what do you think of this way?

Why does VDUSE_DEV_GET_CONFIG need to support an error return value?

The VIRTIO spec provides no way for the device to report errors from
config space accesses.

The QEMU virtio-pci implementation returns -1 from invalid
virtio_config_read*() and silently discards virtio_config_write*()
accesses.

VDUSE can take the same approach with
VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.


I'd like to stick to the current assumption thich get_config won't fail.
That is to say,

1) maintain a config in the kernel, make sure the config space read can
always succeed
2) introduce an ioctl for the vduse usersapce to update the config space.
3) we can synchronize with the vduse userspace during set_config

Does this work?

I noticed that caching is also allowed by the vhost-user protocol
messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't
know whether or not caching is in effect. The interface you outlined
above requires caching.

Is there a reason why the host kernel vDPA code needs to cache the
configuration space?

Because:

1) Kernel can not wait forever in get_config(), this is the major difference
with vhost-user.

virtio_cread() can sleep:

 #define virtio_cread(vdev, structname, member, ptr) \
 do {\
 typeof(((structname*)0)->member) virtio_cread_v;\
 \
 might_sleep();  \
 ^^

Which code path cannot sleep?

Well, it can sleep but it can't sleep forever. For VDUSE, a
buggy/malicious userspace may refuse to respond to the get_config.

It looks to me the ideal case, with the current virtio spec, for VDUSE is to

1) maintain the device and its state in the kernel, userspace may sync
with the kernel device via ioctls
2) offload the datapath (virtqueue) to the userspace

This seems more robust and safe than simply relaying everything to
userspace and waiting for its response.

And we know for sure this model can work, an example is TUN/TAP:
netdevice is abstracted in the kernel and datapath is done via
sendmsg()/recvmsg().

Maintaining the config in the kernel follows this model and it can
simplify the device generation implementation.

For config space write, it requires more thought but fortunately it's
not commonly used. So VDUSE can choose to filter out the
device/features that depends on the config write.

This is the problem. There are other messages like SET_FEATURES where I
guess we'll face the same challenge.

Probably not, userspace device can tell the kernel about the device_features
and mandated_features during 

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-07 Thread Stefan Hajnoczi
On Wed, Jul 07, 2021 at 05:24:08PM +0800, Jason Wang wrote:
> 
> 在 2021/7/7 下午4:55, Stefan Hajnoczi 写道:
> > On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote:
> > > 在 2021/7/7 上午1:11, Stefan Hajnoczi 写道:
> > > > On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote:
> > > > > On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi  
> > > > > wrote:
> > > > > > On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote:
> > > > > > > 在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:
> > > > > > > > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> > > > > > > > > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > > > > > > > > OK, I get you now. Since the VIRTIO specification says 
> > > > > > > > > > > > "Device
> > > > > > > > > > > > configuration space is generally used for 
> > > > > > > > > > > > rarely-changing or
> > > > > > > > > > > > initialization-time parameters". I assume the 
> > > > > > > > > > > > VDUSE_DEV_SET_CONFIG
> > > > > > > > > > > > ioctl should not be called frequently.
> > > > > > > > > > > The spec uses MUST and other terms to define the precise 
> > > > > > > > > > > requirements.
> > > > > > > > > > > Here the language (especially the word "generally") is 
> > > > > > > > > > > weaker and means
> > > > > > > > > > > there may be exceptions.
> > > > > > > > > > > 
> > > > > > > > > > > Another type of access that doesn't work with the 
> > > > > > > > > > > VDUSE_DEV_SET_CONFIG
> > > > > > > > > > > approach is reads that have side-effects. For example, 
> > > > > > > > > > > imagine a field
> > > > > > > > > > > containing an error code if the device encounters a 
> > > > > > > > > > > problem unrelated to
> > > > > > > > > > > a specific virtqueue request. Reading from this field 
> > > > > > > > > > > resets the error
> > > > > > > > > > > code to 0, saving the driver an extra configuration space 
> > > > > > > > > > > write access
> > > > > > > > > > > and possibly race conditions. It isn't possible to 
> > > > > > > > > > > implement those
> > > > > > > > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner 
> > > > > > > > > > > case, but it
> > > > > > > > > > > makes me think that the interface does not allow full 
> > > > > > > > > > > VIRTIO semantics.
> > > > > > > > > Note that though you're correct, my understanding is that 
> > > > > > > > > config space is
> > > > > > > > > not suitable for this kind of error propagating. And it would 
> > > > > > > > > be very hard
> > > > > > > > > to implement such kind of semantic in some transports.  
> > > > > > > > > Virtqueue should be
> > > > > > > > > much better. As Yong Ji quoted, the config space is used for
> > > > > > > > > "rarely-changing or intialization-time parameters".
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next 
> > > > > > > > > > version. And to
> > > > > > > > > > handle the message failure, I'm going to add a return value 
> > > > > > > > > > to
> > > > > > > > > > virtio_config_ops.get() and virtio_cread_* API so that the 
> > > > > > > > > > error can
> > > > > > > > > > be propagated to the virtio device driver. Then the 
> > > > > > > > > > virtio-blk device
> > > > > > > > > > driver can be modified to handle that.
> > > > > > > > > > 
> > > > > > > > > > Jason and Stefan, what do you think of this way?
> > > > > > > > Why does VDUSE_DEV_GET_CONFIG need to support an error return 
> > > > > > > > value?
> > > > > > > > 
> > > > > > > > The VIRTIO spec provides no way for the device to report errors 
> > > > > > > > from
> > > > > > > > config space accesses.
> > > > > > > > 
> > > > > > > > The QEMU virtio-pci implementation returns -1 from invalid
> > > > > > > > virtio_config_read*() and silently discards 
> > > > > > > > virtio_config_write*()
> > > > > > > > accesses.
> > > > > > > > 
> > > > > > > > VDUSE can take the same approach with
> > > > > > > > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
> > > > > > > > 
> > > > > > > > > I'd like to stick to the current assumption thich get_config 
> > > > > > > > > won't fail.
> > > > > > > > > That is to say,
> > > > > > > > > 
> > > > > > > > > 1) maintain a config in the kernel, make sure the config 
> > > > > > > > > space read can
> > > > > > > > > always succeed
> > > > > > > > > 2) introduce an ioctl for the vduse usersapce to update the 
> > > > > > > > > config space.
> > > > > > > > > 3) we can synchronize with the vduse userspace during 
> > > > > > > > > set_config
> > > > > > > > > 
> > > > > > > > > Does this work?
> > > > > > > > I noticed that caching is also allowed by the vhost-user 
> > > > > > > > protocol
> > > > > > > > messages (QEMU's docs/interop/vhost-user.rst), but the device 
> > > > > > > > doesn't
> > > > > > > > know whether or not caching is in effect. The interface you 
> > > > > > > > outlined
> > > > > > > > above requires caching.
> > > > > > > > 
> > > > > > > > Is there a reason why the host kernel vDPA code needs to cache 
> > > > > > > 

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-07 Thread Jason Wang


在 2021/7/7 下午4:55, Stefan Hajnoczi 写道:

On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote:

在 2021/7/7 上午1:11, Stefan Hajnoczi 写道:

On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote:

On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi  wrote:

On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote:

在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:

On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:

在 2021/7/4 下午5:49, Yongji Xie 写道:

OK, I get you now. Since the VIRTIO specification says "Device
configuration space is generally used for rarely-changing or
initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
ioctl should not be called frequently.

The spec uses MUST and other terms to define the precise requirements.
Here the language (especially the word "generally") is weaker and means
there may be exceptions.

Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
approach is reads that have side-effects. For example, imagine a field
containing an error code if the device encounters a problem unrelated to
a specific virtqueue request. Reading from this field resets the error
code to 0, saving the driver an extra configuration space write access
and possibly race conditions. It isn't possible to implement those
semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
makes me think that the interface does not allow full VIRTIO semantics.

Note that though you're correct, my understanding is that config space is
not suitable for this kind of error propagating. And it would be very hard
to implement such kind of semantic in some transports.  Virtqueue should be
much better. As Yong Ji quoted, the config space is used for
"rarely-changing or intialization-time parameters".



Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
handle the message failure, I'm going to add a return value to
virtio_config_ops.get() and virtio_cread_* API so that the error can
be propagated to the virtio device driver. Then the virtio-blk device
driver can be modified to handle that.

Jason and Stefan, what do you think of this way?

Why does VDUSE_DEV_GET_CONFIG need to support an error return value?

The VIRTIO spec provides no way for the device to report errors from
config space accesses.

The QEMU virtio-pci implementation returns -1 from invalid
virtio_config_read*() and silently discards virtio_config_write*()
accesses.

VDUSE can take the same approach with
VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.


I'd like to stick to the current assumption thich get_config won't fail.
That is to say,

1) maintain a config in the kernel, make sure the config space read can
always succeed
2) introduce an ioctl for the vduse usersapce to update the config space.
3) we can synchronize with the vduse userspace during set_config

Does this work?

I noticed that caching is also allowed by the vhost-user protocol
messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't
know whether or not caching is in effect. The interface you outlined
above requires caching.

Is there a reason why the host kernel vDPA code needs to cache the
configuration space?

Because:

1) Kernel can not wait forever in get_config(), this is the major difference
with vhost-user.

virtio_cread() can sleep:

#define virtio_cread(vdev, structname, member, ptr) \
do {\
typeof(((structname*)0)->member) virtio_cread_v;\
\
might_sleep();  \
^^

Which code path cannot sleep?

Well, it can sleep but it can't sleep forever. For VDUSE, a
buggy/malicious userspace may refuse to respond to the get_config.

It looks to me the ideal case, with the current virtio spec, for VDUSE is to

1) maintain the device and its state in the kernel, userspace may sync
with the kernel device via ioctls
2) offload the datapath (virtqueue) to the userspace

This seems more robust and safe than simply relaying everything to
userspace and waiting for its response.

And we know for sure this model can work, an example is TUN/TAP:
netdevice is abstracted in the kernel and datapath is done via
sendmsg()/recvmsg().

Maintaining the config in the kernel follows this model and it can
simplify the device generation implementation.

For config space write, it requires more thought but fortunately it's
not commonly used. So VDUSE can choose to filter out the
device/features that depends on the config write.

This is the problem. There are other messages like SET_FEATURES where I
guess we'll face the same challenge.


Probably not, userspace device can tell the kernel about the device_features
and mandated_features during creation, and the feature negotiation could be
done purely in the kernel without bothering the userspace.



(For 

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-07 Thread Yongji Xie
On Tue, Jul 6, 2021 at 6:22 PM Stefan Hajnoczi  wrote:
>
> On Tue, Jul 06, 2021 at 11:04:18AM +0800, Yongji Xie wrote:
> > On Mon, Jul 5, 2021 at 8:50 PM Stefan Hajnoczi  wrote:
> > >
> > > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> > > >
> > > > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > > > OK, I get you now. Since the VIRTIO specification says "Device
> > > > > > > configuration space is generally used for rarely-changing or
> > > > > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> > > > > > > ioctl should not be called frequently.
> > > > > > The spec uses MUST and other terms to define the precise 
> > > > > > requirements.
> > > > > > Here the language (especially the word "generally") is weaker and 
> > > > > > means
> > > > > > there may be exceptions.
> > > > > >
> > > > > > Another type of access that doesn't work with the 
> > > > > > VDUSE_DEV_SET_CONFIG
> > > > > > approach is reads that have side-effects. For example, imagine a 
> > > > > > field
> > > > > > containing an error code if the device encounters a problem 
> > > > > > unrelated to
> > > > > > a specific virtqueue request. Reading from this field resets the 
> > > > > > error
> > > > > > code to 0, saving the driver an extra configuration space write 
> > > > > > access
> > > > > > and possibly race conditions. It isn't possible to implement those
> > > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but 
> > > > > > it
> > > > > > makes me think that the interface does not allow full VIRTIO 
> > > > > > semantics.
> > > >
> > > >
> > > > Note that though you're correct, my understanding is that config space 
> > > > is
> > > > not suitable for this kind of error propagating. And it would be very 
> > > > hard
> > > > to implement such kind of semantic in some transports.  Virtqueue 
> > > > should be
> > > > much better. As Yong Ji quoted, the config space is used for
> > > > "rarely-changing or intialization-time parameters".
> > > >
> > > >
> > > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> > > > > handle the message failure, I'm going to add a return value to
> > > > > virtio_config_ops.get() and virtio_cread_* API so that the error can
> > > > > be propagated to the virtio device driver. Then the virtio-blk device
> > > > > driver can be modified to handle that.
> > > > >
> > > > > Jason and Stefan, what do you think of this way?
> > >
> > > Why does VDUSE_DEV_GET_CONFIG need to support an error return value?
> > >
> >
> > We add a timeout and return error in case userspace never replies to
> > the message.
> >
> > > The VIRTIO spec provides no way for the device to report errors from
> > > config space accesses.
> > >
> > > The QEMU virtio-pci implementation returns -1 from invalid
> > > virtio_config_read*() and silently discards virtio_config_write*()
> > > accesses.
> > >
> > > VDUSE can take the same approach with
> > > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
> > >
> >
> > I noticed that virtio_config_read*() only returns -1 when we access a
> > invalid field. But in the VDUSE case, VDUSE_DEV_GET_CONFIG might fail
> > when we access a valid field. Not sure if it's ok to silently ignore
> > this kind of error.
>
> That's a good point but it's a general VIRTIO issue. Any device
> implementation (QEMU userspace, hardware vDPA, etc) can fail, so the
> VIRTIO specification needs to provide a way for the driver to detect
> this.
>
> If userspace violates the contract then VDUSE needs to mark the device
> broken. QEMU's device emulation does something similar with the
> vdev->broken flag.
>
> The VIRTIO Device Status field DEVICE_NEEDS_RESET bit can be set by
> vDPA/VDUSE to indicate that the device is not operational and must be
> reset.
>

It might be a solution. But DEVICE_NEEDS_RESET  is not implemented
currently. So I'm thinking whether it's ok to add a check of
DEVICE_NEEDS_RESET status bit in probe function of virtio device
driver (e.g. virtio-blk driver). Then VDUSE can make use of it to fail
device initailization when configuration space access failed.

Thanks,
Yongji
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-06 Thread Stefan Hajnoczi
On Tue, Jul 06, 2021 at 11:04:18AM +0800, Yongji Xie wrote:
> On Mon, Jul 5, 2021 at 8:50 PM Stefan Hajnoczi  wrote:
> >
> > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> > >
> > > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > > OK, I get you now. Since the VIRTIO specification says "Device
> > > > > > configuration space is generally used for rarely-changing or
> > > > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> > > > > > ioctl should not be called frequently.
> > > > > The spec uses MUST and other terms to define the precise requirements.
> > > > > Here the language (especially the word "generally") is weaker and 
> > > > > means
> > > > > there may be exceptions.
> > > > >
> > > > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
> > > > > approach is reads that have side-effects. For example, imagine a field
> > > > > containing an error code if the device encounters a problem unrelated 
> > > > > to
> > > > > a specific virtqueue request. Reading from this field resets the error
> > > > > code to 0, saving the driver an extra configuration space write access
> > > > > and possibly race conditions. It isn't possible to implement those
> > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
> > > > > makes me think that the interface does not allow full VIRTIO 
> > > > > semantics.
> > >
> > >
> > > Note that though you're correct, my understanding is that config space is
> > > not suitable for this kind of error propagating. And it would be very hard
> > > to implement such kind of semantic in some transports.  Virtqueue should 
> > > be
> > > much better. As Yong Ji quoted, the config space is used for
> > > "rarely-changing or intialization-time parameters".
> > >
> > >
> > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> > > > handle the message failure, I'm going to add a return value to
> > > > virtio_config_ops.get() and virtio_cread_* API so that the error can
> > > > be propagated to the virtio device driver. Then the virtio-blk device
> > > > driver can be modified to handle that.
> > > >
> > > > Jason and Stefan, what do you think of this way?
> >
> > Why does VDUSE_DEV_GET_CONFIG need to support an error return value?
> >
> 
> We add a timeout and return error in case userspace never replies to
> the message.
> 
> > The VIRTIO spec provides no way for the device to report errors from
> > config space accesses.
> >
> > The QEMU virtio-pci implementation returns -1 from invalid
> > virtio_config_read*() and silently discards virtio_config_write*()
> > accesses.
> >
> > VDUSE can take the same approach with
> > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
> >
> 
> I noticed that virtio_config_read*() only returns -1 when we access a
> invalid field. But in the VDUSE case, VDUSE_DEV_GET_CONFIG might fail
> when we access a valid field. Not sure if it's ok to silently ignore
> this kind of error.

That's a good point but it's a general VIRTIO issue. Any device
implementation (QEMU userspace, hardware vDPA, etc) can fail, so the
VIRTIO specification needs to provide a way for the driver to detect
this.

If userspace violates the contract then VDUSE needs to mark the device
broken. QEMU's device emulation does something similar with the
vdev->broken flag.

The VIRTIO Device Status field DEVICE_NEEDS_RESET bit can be set by
vDPA/VDUSE to indicate that the device is not operational and must be
reset.

The driver code may still process the -1 value read from the
configuration space. Hopefully this isn't a problem. There is currently
no VIRTIO interface besides DEVICE_NEEDS_RESET to indicate configuration
space access failures. On the other hand, drivers need to handle
malicious devices so they should be able to cope with the -1 value
anyway.

Stefan


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-06 Thread Stefan Hajnoczi
On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote:
> 
> 在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:
> > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> > > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > > OK, I get you now. Since the VIRTIO specification says "Device
> > > > > > configuration space is generally used for rarely-changing or
> > > > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> > > > > > ioctl should not be called frequently.
> > > > > The spec uses MUST and other terms to define the precise requirements.
> > > > > Here the language (especially the word "generally") is weaker and 
> > > > > means
> > > > > there may be exceptions.
> > > > > 
> > > > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
> > > > > approach is reads that have side-effects. For example, imagine a field
> > > > > containing an error code if the device encounters a problem unrelated 
> > > > > to
> > > > > a specific virtqueue request. Reading from this field resets the error
> > > > > code to 0, saving the driver an extra configuration space write access
> > > > > and possibly race conditions. It isn't possible to implement those
> > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
> > > > > makes me think that the interface does not allow full VIRTIO 
> > > > > semantics.
> > > 
> > > Note that though you're correct, my understanding is that config space is
> > > not suitable for this kind of error propagating. And it would be very hard
> > > to implement such kind of semantic in some transports.  Virtqueue should 
> > > be
> > > much better. As Yong Ji quoted, the config space is used for
> > > "rarely-changing or intialization-time parameters".
> > > 
> > > 
> > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> > > > handle the message failure, I'm going to add a return value to
> > > > virtio_config_ops.get() and virtio_cread_* API so that the error can
> > > > be propagated to the virtio device driver. Then the virtio-blk device
> > > > driver can be modified to handle that.
> > > > 
> > > > Jason and Stefan, what do you think of this way?
> > Why does VDUSE_DEV_GET_CONFIG need to support an error return value?
> > 
> > The VIRTIO spec provides no way for the device to report errors from
> > config space accesses.
> > 
> > The QEMU virtio-pci implementation returns -1 from invalid
> > virtio_config_read*() and silently discards virtio_config_write*()
> > accesses.
> > 
> > VDUSE can take the same approach with
> > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
> > 
> > > I'd like to stick to the current assumption thich get_config won't fail.
> > > That is to say,
> > > 
> > > 1) maintain a config in the kernel, make sure the config space read can
> > > always succeed
> > > 2) introduce an ioctl for the vduse usersapce to update the config space.
> > > 3) we can synchronize with the vduse userspace during set_config
> > > 
> > > Does this work?
> > I noticed that caching is also allowed by the vhost-user protocol
> > messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't
> > know whether or not caching is in effect. The interface you outlined
> > above requires caching.
> > 
> > Is there a reason why the host kernel vDPA code needs to cache the
> > configuration space?
> 
> 
> Because:
> 
> 1) Kernel can not wait forever in get_config(), this is the major difference
> with vhost-user.

virtio_cread() can sleep:

  #define virtio_cread(vdev, structname, member, ptr) \
  do {\
  typeof(((structname*)0)->member) virtio_cread_v;\
  \
  might_sleep();  \
  ^^

Which code path cannot sleep?

> 2) Stick to the current assumption that virtio_cread() should always
> succeed.

That can be done by reading -1 (like QEMU does) when the read fails.

Stefan


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-05 Thread Yongji Xie
On Mon, Jul 5, 2021 at 8:50 PM Stefan Hajnoczi  wrote:
>
> On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> >
> > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > OK, I get you now. Since the VIRTIO specification says "Device
> > > > > configuration space is generally used for rarely-changing or
> > > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> > > > > ioctl should not be called frequently.
> > > > The spec uses MUST and other terms to define the precise requirements.
> > > > Here the language (especially the word "generally") is weaker and means
> > > > there may be exceptions.
> > > >
> > > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
> > > > approach is reads that have side-effects. For example, imagine a field
> > > > containing an error code if the device encounters a problem unrelated to
> > > > a specific virtqueue request. Reading from this field resets the error
> > > > code to 0, saving the driver an extra configuration space write access
> > > > and possibly race conditions. It isn't possible to implement those
> > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
> > > > makes me think that the interface does not allow full VIRTIO semantics.
> >
> >
> > Note that though you're correct, my understanding is that config space is
> > not suitable for this kind of error propagating. And it would be very hard
> > to implement such kind of semantic in some transports.  Virtqueue should be
> > much better. As Yong Ji quoted, the config space is used for
> > "rarely-changing or intialization-time parameters".
> >
> >
> > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> > > handle the message failure, I'm going to add a return value to
> > > virtio_config_ops.get() and virtio_cread_* API so that the error can
> > > be propagated to the virtio device driver. Then the virtio-blk device
> > > driver can be modified to handle that.
> > >
> > > Jason and Stefan, what do you think of this way?
>
> Why does VDUSE_DEV_GET_CONFIG need to support an error return value?
>

We add a timeout and return error in case userspace never replies to
the message.

> The VIRTIO spec provides no way for the device to report errors from
> config space accesses.
>
> The QEMU virtio-pci implementation returns -1 from invalid
> virtio_config_read*() and silently discards virtio_config_write*()
> accesses.
>
> VDUSE can take the same approach with
> VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
>

I noticed that virtio_config_read*() only returns -1 when we access a
invalid field. But in the VDUSE case, VDUSE_DEV_GET_CONFIG might fail
when we access a valid field. Not sure if it's ok to silently ignore
this kind of error.

Thanks,
Yongji
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-05 Thread Jason Wang


在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:

On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:

在 2021/7/4 下午5:49, Yongji Xie 写道:

OK, I get you now. Since the VIRTIO specification says "Device
configuration space is generally used for rarely-changing or
initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
ioctl should not be called frequently.

The spec uses MUST and other terms to define the precise requirements.
Here the language (especially the word "generally") is weaker and means
there may be exceptions.

Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
approach is reads that have side-effects. For example, imagine a field
containing an error code if the device encounters a problem unrelated to
a specific virtqueue request. Reading from this field resets the error
code to 0, saving the driver an extra configuration space write access
and possibly race conditions. It isn't possible to implement those
semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
makes me think that the interface does not allow full VIRTIO semantics.


Note that though you're correct, my understanding is that config space is
not suitable for this kind of error propagating. And it would be very hard
to implement such kind of semantic in some transports.  Virtqueue should be
much better. As Yong Ji quoted, the config space is used for
"rarely-changing or intialization-time parameters".



Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
handle the message failure, I'm going to add a return value to
virtio_config_ops.get() and virtio_cread_* API so that the error can
be propagated to the virtio device driver. Then the virtio-blk device
driver can be modified to handle that.

Jason and Stefan, what do you think of this way?

Why does VDUSE_DEV_GET_CONFIG need to support an error return value?

The VIRTIO spec provides no way for the device to report errors from
config space accesses.

The QEMU virtio-pci implementation returns -1 from invalid
virtio_config_read*() and silently discards virtio_config_write*()
accesses.

VDUSE can take the same approach with
VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.


I'd like to stick to the current assumption thich get_config won't fail.
That is to say,

1) maintain a config in the kernel, make sure the config space read can
always succeed
2) introduce an ioctl for the vduse usersapce to update the config space.
3) we can synchronize with the vduse userspace during set_config

Does this work?

I noticed that caching is also allowed by the vhost-user protocol
messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't
know whether or not caching is in effect. The interface you outlined
above requires caching.

Is there a reason why the host kernel vDPA code needs to cache the
configuration space?



Because:

1) Kernel can not wait forever in get_config(), this is the major 
difference with vhost-user.
2) Stick to the current assumption that virtio_cread() should always 
succeed.


Thanks




Here are the vhost-user protocol messages:

   Virtio device config space
   ^^

   ++--+---+-+
   | offset | size | flags | payload |
   ++--+---+-+

   :offset: a 32-bit offset of virtio device's configuration space

   :size: a 32-bit configuration space access size in bytes

   :flags: a 32-bit value:
 - 0: Vhost master messages used for writeable fields
 - 1: Vhost master messages used for live migration

   :payload: Size bytes array holding the contents of the virtio
 device's configuration space

   ...

   ``VHOST_USER_GET_CONFIG``
 :id: 24
 :equivalent ioctl: N/A
 :master payload: virtio device config space
 :slave payload: virtio device config space

 When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
 submitted by the vhost-user master to fetch the contents of the
 virtio device configuration space, vhost-user slave's payload size
 MUST match master's request, vhost-user slave uses zero length of
 payload to indicate an error to vhost-user master. The vhost-user
 master may cache the contents to avoid repeated
 ``VHOST_USER_GET_CONFIG`` calls.

   ``VHOST_USER_SET_CONFIG``
 :id: 25
 :equivalent ioctl: N/A
 :master payload: virtio device config space
 :slave payload: N/A

 When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
 submitted by the vhost-user master when the Guest changes the virtio
 device configuration space and also can be used for live migration
 on the destination host. The vhost-user slave must check the flags
 field, and slaves MUST NOT accept SET_CONFIG for read-only
 configuration space fields unless the live migration bit is set.

Stefan


___
iommu mailing list
iommu@lists.linux-foundation.org

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-05 Thread Stefan Hajnoczi
On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> 
> 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > OK, I get you now. Since the VIRTIO specification says "Device
> > > > configuration space is generally used for rarely-changing or
> > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> > > > ioctl should not be called frequently.
> > > The spec uses MUST and other terms to define the precise requirements.
> > > Here the language (especially the word "generally") is weaker and means
> > > there may be exceptions.
> > > 
> > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
> > > approach is reads that have side-effects. For example, imagine a field
> > > containing an error code if the device encounters a problem unrelated to
> > > a specific virtqueue request. Reading from this field resets the error
> > > code to 0, saving the driver an extra configuration space write access
> > > and possibly race conditions. It isn't possible to implement those
> > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
> > > makes me think that the interface does not allow full VIRTIO semantics.
> 
> 
> Note that though you're correct, my understanding is that config space is
> not suitable for this kind of error propagating. And it would be very hard
> to implement such kind of semantic in some transports.  Virtqueue should be
> much better. As Yong Ji quoted, the config space is used for
> "rarely-changing or intialization-time parameters".
> 
> 
> > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> > handle the message failure, I'm going to add a return value to
> > virtio_config_ops.get() and virtio_cread_* API so that the error can
> > be propagated to the virtio device driver. Then the virtio-blk device
> > driver can be modified to handle that.
> > 
> > Jason and Stefan, what do you think of this way?

Why does VDUSE_DEV_GET_CONFIG need to support an error return value?

The VIRTIO spec provides no way for the device to report errors from
config space accesses.

The QEMU virtio-pci implementation returns -1 from invalid
virtio_config_read*() and silently discards virtio_config_write*()
accesses.

VDUSE can take the same approach with
VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.

> I'd like to stick to the current assumption thich get_config won't fail.
> That is to say,
> 
> 1) maintain a config in the kernel, make sure the config space read can
> always succeed
> 2) introduce an ioctl for the vduse usersapce to update the config space.
> 3) we can synchronize with the vduse userspace during set_config
> 
> Does this work?

I noticed that caching is also allowed by the vhost-user protocol
messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't
know whether or not caching is in effect. The interface you outlined
above requires caching.

Is there a reason why the host kernel vDPA code needs to cache the
configuration space?

Here are the vhost-user protocol messages:

  Virtio device config space
  ^^

  ++--+---+-+
  | offset | size | flags | payload |
  ++--+---+-+

  :offset: a 32-bit offset of virtio device's configuration space

  :size: a 32-bit configuration space access size in bytes

  :flags: a 32-bit value:
- 0: Vhost master messages used for writeable fields
- 1: Vhost master messages used for live migration

  :payload: Size bytes array holding the contents of the virtio
device's configuration space

  ...

  ``VHOST_USER_GET_CONFIG``
:id: 24
:equivalent ioctl: N/A
:master payload: virtio device config space
:slave payload: virtio device config space

When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
submitted by the vhost-user master to fetch the contents of the
virtio device configuration space, vhost-user slave's payload size
MUST match master's request, vhost-user slave uses zero length of
payload to indicate an error to vhost-user master. The vhost-user
master may cache the contents to avoid repeated
``VHOST_USER_GET_CONFIG`` calls.

  ``VHOST_USER_SET_CONFIG``
:id: 25
:equivalent ioctl: N/A
:master payload: virtio device config space
:slave payload: N/A

When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
submitted by the vhost-user master when the Guest changes the virtio
device configuration space and also can be used for live migration
on the destination host. The vhost-user slave must check the flags
field, and slaves MUST NOT accept SET_CONFIG for read-only
configuration space fields unless the live migration bit is set.

Stefan


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-04 Thread Jason Wang


在 2021/7/4 下午5:49, Yongji Xie 写道:

OK, I get you now. Since the VIRTIO specification says "Device
configuration space is generally used for rarely-changing or
initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
ioctl should not be called frequently.

The spec uses MUST and other terms to define the precise requirements.
Here the language (especially the word "generally") is weaker and means
there may be exceptions.

Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
approach is reads that have side-effects. For example, imagine a field
containing an error code if the device encounters a problem unrelated to
a specific virtqueue request. Reading from this field resets the error
code to 0, saving the driver an extra configuration space write access
and possibly race conditions. It isn't possible to implement those
semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
makes me think that the interface does not allow full VIRTIO semantics.



Note that though you're correct, my understanding is that config space 
is not suitable for this kind of error propagating. And it would be very 
hard to implement such kind of semantic in some transports.  Virtqueue 
should be much better. As Yong Ji quoted, the config space is used for 
"rarely-changing or intialization-time parameters".




Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
handle the message failure, I'm going to add a return value to
virtio_config_ops.get() and virtio_cread_* API so that the error can
be propagated to the virtio device driver. Then the virtio-blk device
driver can be modified to handle that.

Jason and Stefan, what do you think of this way?



I'd like to stick to the current assumption thich get_config won't fail. 
That is to say,


1) maintain a config in the kernel, make sure the config space read can 
always succeed

2) introduce an ioctl for the vduse usersapce to update the config space.
3) we can synchronize with the vduse userspace during set_config

Does this work?

Thanks






___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: Re: Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-04 Thread Yongji Xie
On Thu, Jul 1, 2021 at 9:15 PM Stefan Hajnoczi  wrote:
>
> On Thu, Jul 01, 2021 at 06:00:48PM +0800, Yongji Xie wrote:
> > On Wed, Jun 30, 2021 at 6:06 PM Stefan Hajnoczi  wrote:
> > >
> > > On Tue, Jun 29, 2021 at 01:43:11PM +0800, Yongji Xie wrote:
> > > > On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi  
> > > > wrote:
> > > > > On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote:
> > > > > > + static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t 
> > > > > > *len)
> > > > > > + {
> > > > > > + int fd;
> > > > > > + void *addr;
> > > > > > + size_t size;
> > > > > > + struct vduse_iotlb_entry entry;
> > > > > > +
> > > > > > + entry.start = iova;
> > > > > > + entry.last = iova + 1;
> > > > >
> > > > > Why +1?
> > > > >
> > > > > I expected the request to include *len so that VDUSE can create a 
> > > > > bounce
> > > > > buffer for the full iova range, if necessary.
> > > > >
> > > >
> > > > The function is used to translate iova to va. And the *len is not
> > > > specified by the caller. Instead, it's used to tell the caller the
> > > > length of the contiguous iova region from the specified iova. And the
> > > > ioctl VDUSE_IOTLB_GET_FD will get the file descriptor to the first
> > > > overlapped iova region. So using iova + 1 should be enough here.
> > >
> > > Does the entry.last field have any purpose with VDUSE_IOTLB_GET_FD? I
> > > wonder why userspace needs to assign a value at all if it's always +1.
> > >
> >
> > If we need to get some iova regions in the specified range, we need
> > the entry.last field. For example, we can use [0, ULONG_MAX] to get
> > the first overlapped iova region which might be [4096, 8192]. But in
> > this function, we don't use VDUSE_IOTLB_GET_FD like this. We need to
> > get the iova region including the specified iova.
>
> I see, thanks for explaining!
>
> > > > > > + return addr + iova - entry.start;
> > > > > > + }
> > > > > > +
> > > > > > +- VDUSE_DEV_GET_FEATURES: Get the negotiated features
> > > > >
> > > > > Are these VIRTIO feature bits? Please explain how feature negotiation
> > > > > works. There must be a way for userspace to report the device's
> > > > > supported feature bits to the kernel.
> > > > >
> > > >
> > > > Yes, these are VIRTIO feature bits. Userspace will specify the
> > > > device's supported feature bits when creating a new VDUSE device with
> > > > ioctl(VDUSE_CREATE_DEV).
> > >
> > > Can the VDUSE device influence feature bit negotiation? For example, if
> > > the VDUSE virtio-blk device does not implement discard/write-zeroes, how
> > > does QEMU or the guest find out about this?
> > >
> >
> > There is a "features" field in struct vduse_dev_config which is used
> > to do feature negotiation.
>
> This approach is more restrictive than required by the VIRTIO
> specification:
>
>   "The device SHOULD accept any valid subset of features the driver
>   accepts, otherwise it MUST fail to set the FEATURES_OK device status
>   bit when the driver writes it."
>
>   
> https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-130002
>
> The spec allows a device to reject certain subsets of features. For
> example, if feature B depends on feature A and can only be enabled when
> feature A is also enabled.
>
> From your description I think VDUSE would accept feature B without
> feature A since the device implementation has no opportunity to fail
> negotiation with custom logic.
>

Yes, we discussed it [1] before. So I'd like to re-introduce
SET_STATUS messages so that the userspace can fail feature negotiation
during setting FEATURES_OK status bit.

[1]  https://lkml.org/lkml/2021/6/28/1587

> Ideally VDUSE would send a SET_FEATURES message to userspace, allowing
> the device implementation full flexibility in which subsets of features
> to accept.
>
> This is a corner case. Many or maybe even all existing VIRTIO devices
> don't need this flexibility, but I want to point out this limitation in
> the VDUSE interface because it may cause issues in the future.
>
> > > > > > +- VDUSE_DEV_UPDATE_CONFIG: Update the configuration space and 
> > > > > > inject a config interrupt
> > > > >
> > > > > Does this mean the contents of the configuration space are cached by
> > > > > VDUSE?
> > > >
> > > > Yes, but the kernel will also store the same contents.
> > > >
> > > > > The downside is that the userspace code cannot generate the
> > > > > contents on demand. Most devices doin't need to generate the contents
> > > > > on demand, so I think this is okay but I had expected a different
> > > > > interface:
> > > > >
> > > > > kernel->userspace VDUSE_DEV_GET_CONFIG
> > > > > userspace->kernel VDUSE_DEV_INJECT_CONFIG_IRQ
> > > > >
> > > >
> > > > The problem is how to handle the failure of VDUSE_DEV_GET_CONFIG. We
> > > > will need lots of modification of virtio codes to support that. So to
> > > > make it simple, we choose this way:
> 

Re: Re: Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-01 Thread Stefan Hajnoczi
On Thu, Jul 01, 2021 at 06:00:48PM +0800, Yongji Xie wrote:
> On Wed, Jun 30, 2021 at 6:06 PM Stefan Hajnoczi  wrote:
> >
> > On Tue, Jun 29, 2021 at 01:43:11PM +0800, Yongji Xie wrote:
> > > On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi  
> > > wrote:
> > > > On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote:
> > > > > + static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t 
> > > > > *len)
> > > > > + {
> > > > > + int fd;
> > > > > + void *addr;
> > > > > + size_t size;
> > > > > + struct vduse_iotlb_entry entry;
> > > > > +
> > > > > + entry.start = iova;
> > > > > + entry.last = iova + 1;
> > > >
> > > > Why +1?
> > > >
> > > > I expected the request to include *len so that VDUSE can create a bounce
> > > > buffer for the full iova range, if necessary.
> > > >
> > >
> > > The function is used to translate iova to va. And the *len is not
> > > specified by the caller. Instead, it's used to tell the caller the
> > > length of the contiguous iova region from the specified iova. And the
> > > ioctl VDUSE_IOTLB_GET_FD will get the file descriptor to the first
> > > overlapped iova region. So using iova + 1 should be enough here.
> >
> > Does the entry.last field have any purpose with VDUSE_IOTLB_GET_FD? I
> > wonder why userspace needs to assign a value at all if it's always +1.
> >
> 
> If we need to get some iova regions in the specified range, we need
> the entry.last field. For example, we can use [0, ULONG_MAX] to get
> the first overlapped iova region which might be [4096, 8192]. But in
> this function, we don't use VDUSE_IOTLB_GET_FD like this. We need to
> get the iova region including the specified iova.

I see, thanks for explaining!

> > > > > + return addr + iova - entry.start;
> > > > > + }
> > > > > +
> > > > > +- VDUSE_DEV_GET_FEATURES: Get the negotiated features
> > > >
> > > > Are these VIRTIO feature bits? Please explain how feature negotiation
> > > > works. There must be a way for userspace to report the device's
> > > > supported feature bits to the kernel.
> > > >
> > >
> > > Yes, these are VIRTIO feature bits. Userspace will specify the
> > > device's supported feature bits when creating a new VDUSE device with
> > > ioctl(VDUSE_CREATE_DEV).
> >
> > Can the VDUSE device influence feature bit negotiation? For example, if
> > the VDUSE virtio-blk device does not implement discard/write-zeroes, how
> > does QEMU or the guest find out about this?
> >
> 
> There is a "features" field in struct vduse_dev_config which is used
> to do feature negotiation.

This approach is more restrictive than required by the VIRTIO
specification:

  "The device SHOULD accept any valid subset of features the driver
  accepts, otherwise it MUST fail to set the FEATURES_OK device status
  bit when the driver writes it."

  
https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-130002

The spec allows a device to reject certain subsets of features. For
example, if feature B depends on feature A and can only be enabled when
feature A is also enabled.

From your description I think VDUSE would accept feature B without
feature A since the device implementation has no opportunity to fail
negotiation with custom logic.

Ideally VDUSE would send a SET_FEATURES message to userspace, allowing
the device implementation full flexibility in which subsets of features
to accept.

This is a corner case. Many or maybe even all existing VIRTIO devices
don't need this flexibility, but I want to point out this limitation in
the VDUSE interface because it may cause issues in the future.

> > > > > +- VDUSE_DEV_UPDATE_CONFIG: Update the configuration space and inject 
> > > > > a config interrupt
> > > >
> > > > Does this mean the contents of the configuration space are cached by
> > > > VDUSE?
> > >
> > > Yes, but the kernel will also store the same contents.
> > >
> > > > The downside is that the userspace code cannot generate the
> > > > contents on demand. Most devices doin't need to generate the contents
> > > > on demand, so I think this is okay but I had expected a different
> > > > interface:
> > > >
> > > > kernel->userspace VDUSE_DEV_GET_CONFIG
> > > > userspace->kernel VDUSE_DEV_INJECT_CONFIG_IRQ
> > > >
> > >
> > > The problem is how to handle the failure of VDUSE_DEV_GET_CONFIG. We
> > > will need lots of modification of virtio codes to support that. So to
> > > make it simple, we choose this way:
> > >
> > > userspace -> kernel VDUSE_DEV_SET_CONFIG
> > > userspace -> kernel VDUSE_DEV_INJECT_CONFIG_IRQ
> > >
> > > > I think you can leave it the way it is, but I wanted to mention this in
> > > > case someone thinks it's important to support generating the contents of
> > > > the configuration space on demand.
> > > >
> > >
> > > Sorry, I didn't get you here. Can't VDUSE_DEV_SET_CONFIG and
> > > VDUSE_DEV_INJECT_CONFIG_IRQ achieve that?
> >
> > If the contents of the configuration 

Re: Re: Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-01 Thread Yongji Xie
On Wed, Jun 30, 2021 at 6:06 PM Stefan Hajnoczi  wrote:
>
> On Tue, Jun 29, 2021 at 01:43:11PM +0800, Yongji Xie wrote:
> > On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi  wrote:
> > > On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote:
> > > > + static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t *len)
> > > > + {
> > > > + int fd;
> > > > + void *addr;
> > > > + size_t size;
> > > > + struct vduse_iotlb_entry entry;
> > > > +
> > > > + entry.start = iova;
> > > > + entry.last = iova + 1;
> > >
> > > Why +1?
> > >
> > > I expected the request to include *len so that VDUSE can create a bounce
> > > buffer for the full iova range, if necessary.
> > >
> >
> > The function is used to translate iova to va. And the *len is not
> > specified by the caller. Instead, it's used to tell the caller the
> > length of the contiguous iova region from the specified iova. And the
> > ioctl VDUSE_IOTLB_GET_FD will get the file descriptor to the first
> > overlapped iova region. So using iova + 1 should be enough here.
>
> Does the entry.last field have any purpose with VDUSE_IOTLB_GET_FD? I
> wonder why userspace needs to assign a value at all if it's always +1.
>

If we need to get some iova regions in the specified range, we need
the entry.last field. For example, we can use [0, ULONG_MAX] to get
the first overlapped iova region which might be [4096, 8192]. But in
this function, we don't use VDUSE_IOTLB_GET_FD like this. We need to
get the iova region including the specified iova.

> >
> > > > + fd = ioctl(dev_fd, VDUSE_IOTLB_GET_FD, );
> > > > + if (fd < 0)
> > > > + return NULL;
> > > > +
> > > > + size = entry.last - entry.start + 1;
> > > > + *len = entry.last - iova + 1;
> > > > + addr = mmap(0, size, perm_to_prot(entry.perm), MAP_SHARED,
> > > > + fd, entry.offset);
> > > > + close(fd);
> > > > + if (addr == MAP_FAILED)
> > > > + return NULL;
> > > > +
> > > > + /* do something to cache this iova region */
> > >
> > > How is userspace expected to manage iotlb mmaps? When should munmap(2)
> > > be called?
> > >
> >
> > The simple way is using a list to store the iotlb mappings. And we
> > should call the munmap(2) for the old mappings when VDUSE_UPDATE_IOTLB
> > or VDUSE_STOP_DATAPLANE message is received.
>
> Thanks for explaining. It would be helpful to have a description of
> IOTLB operation in this document.
>

Sure.

> > > Should userspace expect VDUSE_IOTLB_GET_FD to return a full chunk of
> > > guest RAM (e.g. multiple gigabytes) that can be cached permanently or
> > > will it return just enough pages to cover [start, last)?
> > >
> >
> > It should return one iotlb mapping that covers [start, last). In
> > vhost-vdpa cases, it might be a full chunk of guest RAM. In
> > virtio-vdpa cases, it might be the whole bounce buffer or one coherent
> > mapping (produced by dma_alloc_coherent()).
>
> Great, thanks. Adding something about this to the documentation would
> help others implementing VDUSE devices or libraries.
>

OK.

> > > > +
> > > > + return addr + iova - entry.start;
> > > > + }
> > > > +
> > > > +- VDUSE_DEV_GET_FEATURES: Get the negotiated features
> > >
> > > Are these VIRTIO feature bits? Please explain how feature negotiation
> > > works. There must be a way for userspace to report the device's
> > > supported feature bits to the kernel.
> > >
> >
> > Yes, these are VIRTIO feature bits. Userspace will specify the
> > device's supported feature bits when creating a new VDUSE device with
> > ioctl(VDUSE_CREATE_DEV).
>
> Can the VDUSE device influence feature bit negotiation? For example, if
> the VDUSE virtio-blk device does not implement discard/write-zeroes, how
> does QEMU or the guest find out about this?
>

There is a "features" field in struct vduse_dev_config which is used
to do feature negotiation.

> > > > +- VDUSE_DEV_UPDATE_CONFIG: Update the configuration space and inject a 
> > > > config interrupt
> > >
> > > Does this mean the contents of the configuration space are cached by
> > > VDUSE?
> >
> > Yes, but the kernel will also store the same contents.
> >
> > > The downside is that the userspace code cannot generate the
> > > contents on demand. Most devices doin't need to generate the contents
> > > on demand, so I think this is okay but I had expected a different
> > > interface:
> > >
> > > kernel->userspace VDUSE_DEV_GET_CONFIG
> > > userspace->kernel VDUSE_DEV_INJECT_CONFIG_IRQ
> > >
> >
> > The problem is how to handle the failure of VDUSE_DEV_GET_CONFIG. We
> > will need lots of modification of virtio codes to support that. So to
> > make it simple, we choose this way:
> >
> > userspace -> kernel VDUSE_DEV_SET_CONFIG
> > userspace -> kernel VDUSE_DEV_INJECT_CONFIG_IRQ
> >
> > > I think you can leave it the way 

Re: Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-06-30 Thread Stefan Hajnoczi
On Tue, Jun 29, 2021 at 01:43:11PM +0800, Yongji Xie wrote:
> On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi  wrote:
> > On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote:
> > > + static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t *len)
> > > + {
> > > + int fd;
> > > + void *addr;
> > > + size_t size;
> > > + struct vduse_iotlb_entry entry;
> > > +
> > > + entry.start = iova;
> > > + entry.last = iova + 1;
> >
> > Why +1?
> >
> > I expected the request to include *len so that VDUSE can create a bounce
> > buffer for the full iova range, if necessary.
> >
> 
> The function is used to translate iova to va. And the *len is not
> specified by the caller. Instead, it's used to tell the caller the
> length of the contiguous iova region from the specified iova. And the
> ioctl VDUSE_IOTLB_GET_FD will get the file descriptor to the first
> overlapped iova region. So using iova + 1 should be enough here.

Does the entry.last field have any purpose with VDUSE_IOTLB_GET_FD? I
wonder why userspace needs to assign a value at all if it's always +1.

> 
> > > + fd = ioctl(dev_fd, VDUSE_IOTLB_GET_FD, );
> > > + if (fd < 0)
> > > + return NULL;
> > > +
> > > + size = entry.last - entry.start + 1;
> > > + *len = entry.last - iova + 1;
> > > + addr = mmap(0, size, perm_to_prot(entry.perm), MAP_SHARED,
> > > + fd, entry.offset);
> > > + close(fd);
> > > + if (addr == MAP_FAILED)
> > > + return NULL;
> > > +
> > > + /* do something to cache this iova region */
> >
> > How is userspace expected to manage iotlb mmaps? When should munmap(2)
> > be called?
> >
> 
> The simple way is using a list to store the iotlb mappings. And we
> should call the munmap(2) for the old mappings when VDUSE_UPDATE_IOTLB
> or VDUSE_STOP_DATAPLANE message is received.

Thanks for explaining. It would be helpful to have a description of
IOTLB operation in this document.

> > Should userspace expect VDUSE_IOTLB_GET_FD to return a full chunk of
> > guest RAM (e.g. multiple gigabytes) that can be cached permanently or
> > will it return just enough pages to cover [start, last)?
> >
> 
> It should return one iotlb mapping that covers [start, last). In
> vhost-vdpa cases, it might be a full chunk of guest RAM. In
> virtio-vdpa cases, it might be the whole bounce buffer or one coherent
> mapping (produced by dma_alloc_coherent()).

Great, thanks. Adding something about this to the documentation would
help others implementing VDUSE devices or libraries.

> > > +
> > > + return addr + iova - entry.start;
> > > + }
> > > +
> > > +- VDUSE_DEV_GET_FEATURES: Get the negotiated features
> >
> > Are these VIRTIO feature bits? Please explain how feature negotiation
> > works. There must be a way for userspace to report the device's
> > supported feature bits to the kernel.
> >
> 
> Yes, these are VIRTIO feature bits. Userspace will specify the
> device's supported feature bits when creating a new VDUSE device with
> ioctl(VDUSE_CREATE_DEV).

Can the VDUSE device influence feature bit negotiation? For example, if
the VDUSE virtio-blk device does not implement discard/write-zeroes, how
does QEMU or the guest find out about this?

> > > +- VDUSE_DEV_UPDATE_CONFIG: Update the configuration space and inject a 
> > > config interrupt
> >
> > Does this mean the contents of the configuration space are cached by
> > VDUSE?
> 
> Yes, but the kernel will also store the same contents.
> 
> > The downside is that the userspace code cannot generate the
> > contents on demand. Most devices doin't need to generate the contents
> > on demand, so I think this is okay but I had expected a different
> > interface:
> >
> > kernel->userspace VDUSE_DEV_GET_CONFIG
> > userspace->kernel VDUSE_DEV_INJECT_CONFIG_IRQ
> >
> 
> The problem is how to handle the failure of VDUSE_DEV_GET_CONFIG. We
> will need lots of modification of virtio codes to support that. So to
> make it simple, we choose this way:
> 
> userspace -> kernel VDUSE_DEV_SET_CONFIG
> userspace -> kernel VDUSE_DEV_INJECT_CONFIG_IRQ
> 
> > I think you can leave it the way it is, but I wanted to mention this in
> > case someone thinks it's important to support generating the contents of
> > the configuration space on demand.
> >
> 
> Sorry, I didn't get you here. Can't VDUSE_DEV_SET_CONFIG and
> VDUSE_DEV_INJECT_CONFIG_IRQ achieve that?

If the contents of the configuration space change continuously, then the
VDUSE_DEV_SET_CONFIG approach is inefficient and might have race
conditions. For example, imagine a device where the driver can read a
timer from the configuration space. I think the VIRTIO device model
allows that although I'm not aware of any devices that do something like
it today. The problem is that VDUSE_DEV_SET_CONFIG would have to be
called 

Re: Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-06-28 Thread Yongji Xie
On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi  wrote:
>
> On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote:
> > VDUSE (vDPA Device in Userspace) is a framework to support
> > implementing software-emulated vDPA devices in userspace. This
> > document is intended to clarify the VDUSE design and usage.
> >
> > Signed-off-by: Xie Yongji 
> > ---
> >  Documentation/userspace-api/index.rst |   1 +
> >  Documentation/userspace-api/vduse.rst | 222 
> > ++
> >  2 files changed, 223 insertions(+)
> >  create mode 100644 Documentation/userspace-api/vduse.rst
> >
> > diff --git a/Documentation/userspace-api/index.rst 
> > b/Documentation/userspace-api/index.rst
> > index 0b5eefed027e..c432be070f67 100644
> > --- a/Documentation/userspace-api/index.rst
> > +++ b/Documentation/userspace-api/index.rst
> > @@ -27,6 +27,7 @@ place where this information is gathered.
> > iommu
> > media/index
> > sysfs-platform_profile
> > +   vduse
> >
> >  .. only::  subproject and html
> >
> > diff --git a/Documentation/userspace-api/vduse.rst 
> > b/Documentation/userspace-api/vduse.rst
> > new file mode 100644
> > index ..2f9cd1a4e530
> > --- /dev/null
> > +++ b/Documentation/userspace-api/vduse.rst
> > @@ -0,0 +1,222 @@
> > +==
> > +VDUSE - "vDPA Device in Userspace"
> > +==
> > +
> > +vDPA (virtio data path acceleration) device is a device that uses a
> > +datapath which complies with the virtio specifications with vendor
> > +specific control path. vDPA devices can be both physically located on
> > +the hardware or emulated by software. VDUSE is a framework that makes it
> > +possible to implement software-emulated vDPA devices in userspace. And
> > +to make it simple, the emulated vDPA device's control path is handled in
> > +the kernel and only the data path is implemented in the userspace.
> > +
> > +Note that only virtio block device is supported by VDUSE framework now,
> > +which can reduce security risks when the userspace process that implements
> > +the data path is run by an unprivileged user. The Support for other device
> > +types can be added after the security issue is clarified or fixed in the 
> > future.
> > +
> > +Start/Stop VDUSE devices
> > +
> > +
> > +VDUSE devices are started as follows:
> > +
> > +1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
> > +   /dev/vduse/control.
> > +
> > +2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
> > +   messages will arrive while attaching the VDUSE instance to vDPA bus.
> > +
> > +3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
> > +   instance to vDPA bus.
> > +
> > +VDUSE devices are stopped as follows:
> > +
> > +1. Send the VDPA_CMD_DEV_DEL netlink message to detach the VDUSE
> > +   instance from vDPA bus.
> > +
> > +2. Close the file descriptor referring to /dev/vduse/$NAME
> > +
> > +3. Destroy the VDUSE instance with ioctl(VDUSE_DESTROY_DEV) on
> > +   /dev/vduse/control
> > +
> > +The netlink messages metioned above can be sent via vdpa tool in iproute2
> > +or use the below sample codes:
> > +
> > +.. code-block:: c
> > +
> > + static int netlink_add_vduse(const char *name, enum vdpa_command cmd)
> > + {
> > + struct nl_sock *nlsock;
> > + struct nl_msg *msg;
> > + int famid;
> > +
> > + nlsock = nl_socket_alloc();
> > + if (!nlsock)
> > + return -ENOMEM;
> > +
> > + if (genl_connect(nlsock))
> > + goto free_sock;
> > +
> > + famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
> > + if (famid < 0)
> > + goto close_sock;
> > +
> > + msg = nlmsg_alloc();
> > + if (!msg)
> > + goto close_sock;
> > +
> > + if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0, 
> > cmd, 0))
> > + goto nla_put_failure;
> > +
> > + NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
> > + if (cmd == VDPA_CMD_DEV_NEW)
> > + NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, 
> > "vduse");
> > +
> > + if (nl_send_sync(nlsock, msg))
> > + goto close_sock;
> > +
> > + nl_close(nlsock);
> > + nl_socket_free(nlsock);
> > +
> > + return 0;
> > + nla_put_failure:
> > + nlmsg_free(msg);
> > + close_sock:
> > + nl_close(nlsock);
> > + free_sock:
> > + nl_socket_free(nlsock);
> > + return -1;
> > + }
> > +
> > +How VDUSE works
> > +---
> > +
> > +Since the emuldated vDPA device's control path is handled in the kernel,
>
> s/emuldated/emulated/
>

Will fix it.

> > +a message-based communication protocol and few types of control messages
> > +are introduced by VDUSE framework 

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-06-28 Thread Stefan Hajnoczi
On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote:
> VDUSE (vDPA Device in Userspace) is a framework to support
> implementing software-emulated vDPA devices in userspace. This
> document is intended to clarify the VDUSE design and usage.
> 
> Signed-off-by: Xie Yongji 
> ---
>  Documentation/userspace-api/index.rst |   1 +
>  Documentation/userspace-api/vduse.rst | 222 
> ++
>  2 files changed, 223 insertions(+)
>  create mode 100644 Documentation/userspace-api/vduse.rst
> 
> diff --git a/Documentation/userspace-api/index.rst 
> b/Documentation/userspace-api/index.rst
> index 0b5eefed027e..c432be070f67 100644
> --- a/Documentation/userspace-api/index.rst
> +++ b/Documentation/userspace-api/index.rst
> @@ -27,6 +27,7 @@ place where this information is gathered.
> iommu
> media/index
> sysfs-platform_profile
> +   vduse
>  
>  .. only::  subproject and html
>  
> diff --git a/Documentation/userspace-api/vduse.rst 
> b/Documentation/userspace-api/vduse.rst
> new file mode 100644
> index ..2f9cd1a4e530
> --- /dev/null
> +++ b/Documentation/userspace-api/vduse.rst
> @@ -0,0 +1,222 @@
> +==
> +VDUSE - "vDPA Device in Userspace"
> +==
> +
> +vDPA (virtio data path acceleration) device is a device that uses a
> +datapath which complies with the virtio specifications with vendor
> +specific control path. vDPA devices can be both physically located on
> +the hardware or emulated by software. VDUSE is a framework that makes it
> +possible to implement software-emulated vDPA devices in userspace. And
> +to make it simple, the emulated vDPA device's control path is handled in
> +the kernel and only the data path is implemented in the userspace.
> +
> +Note that only virtio block device is supported by VDUSE framework now,
> +which can reduce security risks when the userspace process that implements
> +the data path is run by an unprivileged user. The Support for other device
> +types can be added after the security issue is clarified or fixed in the 
> future.
> +
> +Start/Stop VDUSE devices
> +
> +
> +VDUSE devices are started as follows:
> +
> +1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
> +   /dev/vduse/control.
> +
> +2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
> +   messages will arrive while attaching the VDUSE instance to vDPA bus.
> +
> +3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
> +   instance to vDPA bus.
> +
> +VDUSE devices are stopped as follows:
> +
> +1. Send the VDPA_CMD_DEV_DEL netlink message to detach the VDUSE
> +   instance from vDPA bus.
> +
> +2. Close the file descriptor referring to /dev/vduse/$NAME
> +
> +3. Destroy the VDUSE instance with ioctl(VDUSE_DESTROY_DEV) on
> +   /dev/vduse/control
> +
> +The netlink messages metioned above can be sent via vdpa tool in iproute2
> +or use the below sample codes:
> +
> +.. code-block:: c
> +
> + static int netlink_add_vduse(const char *name, enum vdpa_command cmd)
> + {
> + struct nl_sock *nlsock;
> + struct nl_msg *msg;
> + int famid;
> +
> + nlsock = nl_socket_alloc();
> + if (!nlsock)
> + return -ENOMEM;
> +
> + if (genl_connect(nlsock))
> + goto free_sock;
> +
> + famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
> + if (famid < 0)
> + goto close_sock;
> +
> + msg = nlmsg_alloc();
> + if (!msg)
> + goto close_sock;
> +
> + if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0, 
> cmd, 0))
> + goto nla_put_failure;
> +
> + NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
> + if (cmd == VDPA_CMD_DEV_NEW)
> + NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, 
> "vduse");
> +
> + if (nl_send_sync(nlsock, msg))
> + goto close_sock;
> +
> + nl_close(nlsock);
> + nl_socket_free(nlsock);
> +
> + return 0;
> + nla_put_failure:
> + nlmsg_free(msg);
> + close_sock:
> + nl_close(nlsock);
> + free_sock:
> + nl_socket_free(nlsock);
> + return -1;
> + }
> +
> +How VDUSE works
> +---
> +
> +Since the emuldated vDPA device's control path is handled in the kernel,

s/emuldated/emulated/

> +a message-based communication protocol and few types of control messages
> +are introduced by VDUSE framework to make userspace be aware of the data
> +path related changes:
> +
> +- VDUSE_GET_VQ_STATE: Get the state for virtqueue from userspace
> +
> +- VDUSE_START_DATAPLANE: Notify userspace to start the dataplane
> +
> +- VDUSE_STOP_DATAPLANE: Notify userspace to stop the dataplane
> +
> +- VDUSE_UPDATE_IOTLB: Notify userspace to update 

[PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-06-15 Thread Xie Yongji
VDUSE (vDPA Device in Userspace) is a framework to support
implementing software-emulated vDPA devices in userspace. This
document is intended to clarify the VDUSE design and usage.

Signed-off-by: Xie Yongji 
---
 Documentation/userspace-api/index.rst |   1 +
 Documentation/userspace-api/vduse.rst | 222 ++
 2 files changed, 223 insertions(+)
 create mode 100644 Documentation/userspace-api/vduse.rst

diff --git a/Documentation/userspace-api/index.rst 
b/Documentation/userspace-api/index.rst
index 0b5eefed027e..c432be070f67 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -27,6 +27,7 @@ place where this information is gathered.
iommu
media/index
sysfs-platform_profile
+   vduse
 
 .. only::  subproject and html
 
diff --git a/Documentation/userspace-api/vduse.rst 
b/Documentation/userspace-api/vduse.rst
new file mode 100644
index ..2f9cd1a4e530
--- /dev/null
+++ b/Documentation/userspace-api/vduse.rst
@@ -0,0 +1,222 @@
+==
+VDUSE - "vDPA Device in Userspace"
+==
+
+vDPA (virtio data path acceleration) device is a device that uses a
+datapath which complies with the virtio specifications with vendor
+specific control path. vDPA devices can be both physically located on
+the hardware or emulated by software. VDUSE is a framework that makes it
+possible to implement software-emulated vDPA devices in userspace. And
+to make it simple, the emulated vDPA device's control path is handled in
+the kernel and only the data path is implemented in the userspace.
+
+Note that only virtio block device is supported by VDUSE framework now,
+which can reduce security risks when the userspace process that implements
+the data path is run by an unprivileged user. The Support for other device
+types can be added after the security issue is clarified or fixed in the 
future.
+
+Start/Stop VDUSE devices
+
+
+VDUSE devices are started as follows:
+
+1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
+   /dev/vduse/control.
+
+2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
+   messages will arrive while attaching the VDUSE instance to vDPA bus.
+
+3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
+   instance to vDPA bus.
+
+VDUSE devices are stopped as follows:
+
+1. Send the VDPA_CMD_DEV_DEL netlink message to detach the VDUSE
+   instance from vDPA bus.
+
+2. Close the file descriptor referring to /dev/vduse/$NAME
+
+3. Destroy the VDUSE instance with ioctl(VDUSE_DESTROY_DEV) on
+   /dev/vduse/control
+
+The netlink messages metioned above can be sent via vdpa tool in iproute2
+or use the below sample codes:
+
+.. code-block:: c
+
+   static int netlink_add_vduse(const char *name, enum vdpa_command cmd)
+   {
+   struct nl_sock *nlsock;
+   struct nl_msg *msg;
+   int famid;
+
+   nlsock = nl_socket_alloc();
+   if (!nlsock)
+   return -ENOMEM;
+
+   if (genl_connect(nlsock))
+   goto free_sock;
+
+   famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
+   if (famid < 0)
+   goto close_sock;
+
+   msg = nlmsg_alloc();
+   if (!msg)
+   goto close_sock;
+
+   if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0, 
cmd, 0))
+   goto nla_put_failure;
+
+   NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
+   if (cmd == VDPA_CMD_DEV_NEW)
+   NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, 
"vduse");
+
+   if (nl_send_sync(nlsock, msg))
+   goto close_sock;
+
+   nl_close(nlsock);
+   nl_socket_free(nlsock);
+
+   return 0;
+   nla_put_failure:
+   nlmsg_free(msg);
+   close_sock:
+   nl_close(nlsock);
+   free_sock:
+   nl_socket_free(nlsock);
+   return -1;
+   }
+
+How VDUSE works
+---
+
+Since the emuldated vDPA device's control path is handled in the kernel,
+a message-based communication protocol and few types of control messages
+are introduced by VDUSE framework to make userspace be aware of the data
+path related changes:
+
+- VDUSE_GET_VQ_STATE: Get the state for virtqueue from userspace
+
+- VDUSE_START_DATAPLANE: Notify userspace to start the dataplane
+
+- VDUSE_STOP_DATAPLANE: Notify userspace to stop the dataplane
+
+- VDUSE_UPDATE_IOTLB: Notify userspace to update the memory mapping in device 
IOTLB
+
+Userspace needs to read()/write() on /dev/vduse/$NAME to receive/reply
+those control messages from/to VDUSE kernel module as follows:
+
+.. code-block:: c
+
+   static int vduse_message_handler(int dev_fd)
+   {
+   int len;
+