from:"Michael S. Tsirkin"

Re: [PATCH net-next V2] virtio-net: synchronize operstate with admin state on up/down

2024-05-30 Thread Michael S. Tsirkin

On Thu, May 30, 2024 at 11:20:55AM +0800, Jason Wang wrote:
> This patch synchronize operstate with admin state per RFC2863.
> 
> This is done by trying to toggle the carrier upon open/close and
> synchronize with the config change work. This allows propagate status
> correctly to stacked devices like:
> 
> ip link add link enp0s3 macvlan0 type macvlan
> ip link set link enp0s3 down
> ip link show
> 
> Before this patch:
> 
> 3: enp0s3:  mtu 1500 qdisc pfifo_fast state DOWN mode 
> DEFAULT group default qlen 1000
> link/ether 00:00:05:00:00:09 brd ff:ff:ff:ff:ff:ff
> ..
> 5: macvlan0@enp0s3:  mtu 1500 qdisc 
> noqueue state UP mode DEFAULT group default qlen 1000
> link/ether b2:a9:c5:04:da:53 brd ff:ff:ff:ff:ff:ff
> 
> After this patch:
> 
> 3: enp0s3:  mtu 1500 qdisc pfifo_fast state DOWN mode 
> DEFAULT group default qlen 1000
> link/ether 00:00:05:00:00:09 brd ff:ff:ff:ff:ff:ff
> ...
> 5: macvlan0@enp0s3:  mtu 1500 qdisc 
> noqueue state LOWERLAYERDOWN mode DEFAULT group default qlen 1000
> link/ether b2:a9:c5:04:da:53 brd ff:ff:ff:ff:ff:ff
> 
> Cc: Venkat Venkatsubra 
> Cc: Gia-Khanh Nguyen 
> Reviewed-by: Xuan Zhuo 
> Acked-by: Michael S. Tsirkin 
> Signed-off-by: Jason Wang 
> ---
> Changes since V1:
> - rebase
> - add ack/review tags





> ---
>  drivers/net/virtio_net.c | 94 +++-
>  1 file changed, 63 insertions(+), 31 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 4a802c0ea2cb..69e4ae353c51 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -433,6 +433,12 @@ struct virtnet_info {
>   /* The lock to synchronize the access to refill_enabled */
>   spinlock_t refill_lock;
>  
> + /* Is config change enabled? */
> + bool config_change_enabled;
> +
> + /* The lock to synchronize the access to config_change_enabled */
> + spinlock_t config_change_lock;
> +
>   /* Work struct for config space updates */
>   struct work_struct config_work;
>  


But we already have dev->config_lock and dev->config_enabled.

And it actually works better - instead of discarding config
change events it defers them until enabled.



> @@ -623,6 +629,20 @@ static void disable_delayed_refill(struct virtnet_info 
> *vi)
>   spin_unlock_bh(>refill_lock);
>  }
>  
> +static void enable_config_change(struct virtnet_info *vi)
> +{
> + spin_lock_irq(>config_change_lock);
> + vi->config_change_enabled = true;
> + spin_unlock_irq(>config_change_lock);
> +}
> +
> +static void disable_config_change(struct virtnet_info *vi)
> +{
> + spin_lock_irq(>config_change_lock);
> + vi->config_change_enabled = false;
> + spin_unlock_irq(>config_change_lock);
> +}
> +
>  static void enable_rx_mode_work(struct virtnet_info *vi)
>  {
>   rtnl_lock();
> @@ -2421,6 +2441,25 @@ static int virtnet_enable_queue_pair(struct 
> virtnet_info *vi, int qp_index)
>   return err;
>  }
>  
> +static void virtnet_update_settings(struct virtnet_info *vi)
> +{
> + u32 speed;
> + u8 duplex;
> +
> + if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_SPEED_DUPLEX))
> + return;
> +
> + virtio_cread_le(vi->vdev, struct virtio_net_config, speed, );
> +
> + if (ethtool_validate_speed(speed))
> + vi->speed = speed;
> +
> + virtio_cread_le(vi->vdev, struct virtio_net_config, duplex, );
> +
> + if (ethtool_validate_duplex(duplex))
> + vi->duplex = duplex;
> +}
> +
>  static int virtnet_open(struct net_device *dev)
>  {
>   struct virtnet_info *vi = netdev_priv(dev);
> @@ -2439,6 +2478,18 @@ static int virtnet_open(struct net_device *dev)
>   goto err_enable_qp;
>   }
>  
> + /* Assume link up if device can't report link status,
> +otherwise get link status from config. */
> + netif_carrier_off(dev);
> + if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_STATUS)) {
> + enable_config_change(vi);
> + schedule_work(>config_work);
> + } else {
> + vi->status = VIRTIO_NET_S_LINK_UP;
> + virtnet_update_settings(vi);
> + netif_carrier_on(dev);
> + }
> +
>   return 0;
>  
>  err_enable_qp:
> @@ -2875,12 +2926,19 @@ static int virtnet_close(struct net_device *dev)
>   disable_delayed_refill(vi);
>   /* Make sure refill_work doesn't re-enable napi! */
>   cancel_delayed_work_sync(>refill);
> + /* Make sure config notification doesn't schedule config

Re: [PATCH] tools/virtio: pipe assertion in vring_test.c

2024-05-27 Thread Michael S. Tsirkin

On Mon, May 27, 2024 at 04:13:31PM +0900, ysk...@gmail.com wrote:
> From: Yunseong Kim 
> 
> The virtio_device need to fail checking when create the geust/host pipe.

typo

> 
> Signed-off-by: Yunseong Kim 


I guess ... 

> ---
>  tools/virtio/vringh_test.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/virtio/vringh_test.c b/tools/virtio/vringh_test.c
> index 98ff808d6f0c..b1af8807c02a 100644
> --- a/tools/virtio/vringh_test.c
> +++ b/tools/virtio/vringh_test.c
> @@ -161,8 +161,8 @@ static int parallel_test(u64 features,
>   host_map = mmap(NULL, mapsize, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
>   guest_map = mmap(NULL, mapsize, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 
> 0);
>  
> - pipe(to_guest);
> - pipe(to_host);
> + assert(pipe(to_guest) == 0);
> + assert(pipe(to_host) == 0);


I don't like == 0, prefer ! .
Also, calling pipe outside assert is preferable, since in theory
assert can be compiled out.
Not an issue here but people tend to copy/paste text.

>   CPU_ZERO(_set);
>   find_cpus(_cpu, _cpu);
> -- 
> 2.34.1

Re: [RFC PATCH 0/5] vsock/virtio: Add support for multi-devices

2024-05-23 Thread Michael S. Tsirkin

On Fri, May 17, 2024 at 10:46:02PM +0800, Xuewei Niu wrote:
>  include/linux/virtio_vsock.h|   2 +-
>  include/net/af_vsock.h  |  25 ++-
>  include/uapi/linux/virtio_vsock.h   |   1 +
>  include/uapi/linux/vm_sockets.h |  14 ++
>  net/vmw_vsock/af_vsock.c| 116 +--
>  net/vmw_vsock/virtio_transport.c| 255 ++--
>  net/vmw_vsock/virtio_transport_common.c |  16 +-
>  net/vmw_vsock/vsock_loopback.c  |   4 +-
>  8 files changed, 352 insertions(+), 81 deletions(-)

As any change to virtio device/driver interface, this has to
go through the virtio TC. Please subscribe at
virtio-comment+subscr...@lists.linux.dev and then
contact the TC at virtio-comm...@lists.linux.dev

You will likely eventually need to write a spec draft document, too.

-- 
MST

[GIT PULL v2] virtio: features, fixes, cleanups

2024-05-23 Thread Michael S. Tsirkin



Things to note here:
- dropped a couple of patches at the last moment. Did a bunch
  of testing in the last day to make sure that's not causing
  any fallout, it's a revert and no other changes in the same area
  so I feel rather safe doing that.
- the new Marvell OCTEON DPU driver is not here: latest v4 keeps causing
  build failures on mips. I kept deferring the pull hoping to get it in
  and I might try to merge a new version post rc1 (supposed to be ok for
  new drivers as they can't cause regressions), but we'll see.
- there are also a couple bugfixes under review, to be merged after rc1
- there is a trivial conflict in the header file. Shouldn't be any
  trouble to resolve, but fyi the resolution by Stephen is here
diff --cc drivers/virtio/virtio_mem.c
index e8355f55a8f7,6d4dfbc53a66..
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@@ -21,7 -21,7 +21,8 @@@
  #include 
  #include 
  #include 
 +#include 
+ #include 
  Also see it here:
  https://lore.kernel.org/all/20240423145947.14217...@canb.auug.org.au/


The following changes since commit 18daea77cca626f590fb140fc11e3a43c5d41354:

  Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm 
(2024-04-30 12:40:41 -0700)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to c8fae27d141a32a1624d0d0d5419d94252824498:

  virtio-pci: Check if is_avq is NULL (2024-05-22 08:39:41 -0400)


virtio: features, fixes, cleanups

Several new features here:

- virtio-net is finally supported in vduse.

- Virtio (balloon and mem) interaction with suspend is improved

- vhost-scsi now handles signals better/faster.

Fixes, cleanups all over the place.

Signed-off-by: Michael S. Tsirkin 


Christophe JAILLET (1):
  vhost-vdpa: Remove usage of the deprecated ida_simple_xx() API

David Hildenbrand (1):
  virtio-mem: support suspend+resume

David Stevens (2):
  virtio_balloon: Give the balloon its own wakeup source
  virtio_balloon: Treat stats requests as wakeup events

Eugenio Pérez (1):
  MAINTAINERS: add Eugenio Pérez as reviewer

Jiri Pirko (1):
  virtio: delete vq in vp_find_vqs_msix() when request_irq() fails

Krzysztof Kozlowski (24):
  virtio: balloon: drop owner assignment
  virtio: input: drop owner assignment
  virtio: mem: drop owner assignment
  um: virt-pci: drop owner assignment
  virtio_blk: drop owner assignment
  bluetooth: virtio: drop owner assignment
  hwrng: virtio: drop owner assignment
  virtio_console: drop owner assignment
  crypto: virtio - drop owner assignment
  firmware: arm_scmi: virtio: drop owner assignment
  gpio: virtio: drop owner assignment
  drm/virtio: drop owner assignment
  iommu: virtio: drop owner assignment
  misc: nsm: drop owner assignment
  net: caif: virtio: drop owner assignment
  net: virtio: drop owner assignment
  net: 9p: virtio: drop owner assignment
  vsock/virtio: drop owner assignment
  wifi: mac80211_hwsim: drop owner assignment
  nvdimm: virtio_pmem: drop owner assignment
  rpmsg: virtio: drop owner assignment
  scsi: virtio: drop owner assignment
  fuse: virtio: drop owner assignment
  sound: virtio: drop owner assignment

Li Zhang (1):
  virtio-pci: Check if is_avq is NULL

Li Zhijian (1):
  vdpa: Convert sprintf/snprintf to sysfs_emit

Maxime Coquelin (3):
  vduse: validate block features only with block devices
  vduse: Temporarily fail if control queue feature requested
  vduse: enable Virtio-net device type

Michael S. Tsirkin (1):
  Merge tag 'stable/vduse-virtio-net' into vhost

Mike Christie (9):
  vhost-scsi: Handle vhost_vq_work_queue failures for events
  vhost-scsi: Handle vhost_vq_work_queue failures for cmds
  vhost-scsi: Use system wq to flush dev for TMFs
  vhost: Remove vhost_vq_flush
  vhost_scsi: Handle vhost_vq_work_queue failures for TMFs
  vhost: Use virtqueue mutex for swapping worker
  vhost: Release worker mutex during flushes
  vhost_task: Handle SIGKILL by flushing work and exiting
  kernel: Remove signal hacks for vhost_tasks

Uwe Kleine-König (1):
  virtio-mmio: Convert to platform remove callback returning void

Yuxue Liu (2):
  vp_vdpa: Fix return value check vp_vdpa_request_irq
  vp_vdpa: don't allocate unused msix vectors

Zhu Lingshan (1):
  MAINTAINERS: apply maintainer role of Intel vDPA driver

 MAINTAINERS   |  10 +-
 arch/um/drivers/virt-pci.c|   1 -
 drivers/block/virtio_blk.c|   1 -
 drivers/bluetooth/virtio_bt.c |   1 -
 drivers/char/hw_random

Re: [GIT PULL] virtio: features, fixes, cleanups

2024-05-22 Thread Michael S. Tsirkin

On Wed, May 22, 2024 at 06:03:08AM -0400, Michael S. Tsirkin wrote:
> Things to note here:

Sorry Linus, author of one of the patchsets I merged wants to drop it now.
I could revert but it seems cleaner to do that, re-test and re-post.
Will drop a duplicate as long as I do it.



> - the new Marvell OCTEON DPU driver is not here: latest v4 keeps causing
>   build failures on mips. I deferred the pull hoping to get it in
>   and I might merge a new version post rc1
>   (supposed to be ok for new drivers as they can't cause regressions),
>   but we'll see.
> - there are also a couple bugfixes under review, to be merged after rc1
> - I merged a trivial patch (removing a comment) that also got
>   merged through net.
>   git handles this just fine and it did not seem worth it
>   rebasing to drop it.
> - there is a trivial conflict in the header file. Shouldn't be any
>   trouble to resolve, but fyi the resolution by Stephen is here
>   diff --cc drivers/virtio/virtio_mem.c
>   index e8355f55a8f7,6d4dfbc53a66..
>   --- a/drivers/virtio/virtio_mem.c
>   +++ b/drivers/virtio/virtio_mem.c
>   @@@ -21,7 -21,7 +21,8 @@@
> #include 
> #include 
> #include 
>+#include 
>   + #include 
>   Also see it here:
>   https://lore.kernel.org/all/20240423145947.14217...@canb.auug.org.au/
> 
> 
> 
> The following changes since commit 18daea77cca626f590fb140fc11e3a43c5d41354:
> 
>   Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm 
> (2024-04-30 12:40:41 -0700)
> 
> are available in the Git repository at:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus
> 
> for you to fetch changes up to 0b8dbbdcf2e42273fbac9b752919e2e5b2abac21:
> 
>   Merge tag 'for_linus' into vhost (2024-05-12 08:15:28 -0400)
> 
> 
> virtio: features, fixes, cleanups
> 
> Several new features here:
> 
> - virtio-net is finally supported in vduse.
> 
> - Virtio (balloon and mem) interaction with suspend is improved
> 
> - vhost-scsi now handles signals better/faster.
> 
> - virtio-net now supports premapped mode by default,
>   opening the door for all kind of zero copy tricks.
> 
> Fixes, cleanups all over the place.
> 
> Signed-off-by: Michael S. Tsirkin 
> 
> 
> Christophe JAILLET (1):
>   vhost-vdpa: Remove usage of the deprecated ida_simple_xx() API
> 
> David Hildenbrand (1):
>   virtio-mem: support suspend+resume
> 
> David Stevens (2):
>   virtio_balloon: Give the balloon its own wakeup source
>   virtio_balloon: Treat stats requests as wakeup events
> 
> Eugenio Pérez (2):
>   MAINTAINERS: add Eugenio Pérez as reviewer
>   MAINTAINERS: add Eugenio Pérez as reviewer
> 
> Jiri Pirko (1):
>   virtio: delete vq in vp_find_vqs_msix() when request_irq() fails
> 
> Krzysztof Kozlowski (24):
>   virtio: balloon: drop owner assignment
>   virtio: input: drop owner assignment
>   virtio: mem: drop owner assignment
>   um: virt-pci: drop owner assignment
>   virtio_blk: drop owner assignment
>   bluetooth: virtio: drop owner assignment
>   hwrng: virtio: drop owner assignment
>   virtio_console: drop owner assignment
>   crypto: virtio - drop owner assignment
>   firmware: arm_scmi: virtio: drop owner assignment
>   gpio: virtio: drop owner assignment
>   drm/virtio: drop owner assignment
>   iommu: virtio: drop owner assignment
>   misc: nsm: drop owner assignment
>   net: caif: virtio: drop owner assignment
>   net: virtio: drop owner assignment
>   net: 9p: virtio: drop owner assignment
>   vsock/virtio: drop owner assignment
>   wifi: mac80211_hwsim: drop owner assignment
>   nvdimm: virtio_pmem: drop owner assignment
>   rpmsg: virtio: drop owner assignment
>   scsi: virtio: drop owner assignment
>   fuse: virtio: drop owner assignment
>   sound: virtio: drop owner assignment
> 
> Li Zhijian (1):
>   vdpa: Convert sprintf/snprintf to sysfs_emit
> 
> Maxime Coquelin (6):
>   vduse: validate block features only with block devices
>   vduse: Temporarily fail if control queue feature requested
>   vduse: enable Virtio-net device type
>   vduse: validate block features only with block devices
>   vduse: Temporarily fail if control queue feature requested
>   vduse: enable Virtio-net device type
> 
> Michael S. Tsirkin (2):
>   Merge tag 'stable/vduse-virtio-net' into vhost
>   Merge tag 'for_linus' into vho

Re: [GIT PULL] virtio: features, fixes, cleanups

2024-05-22 Thread Michael S. Tsirkin

On Wed, May 22, 2024 at 06:22:45PM +0800, Xuan Zhuo wrote:
> On Wed, 22 May 2024 06:03:01 -0400, "Michael S. Tsirkin"  
> wrote:
> > Things to note here:
> >
> > - the new Marvell OCTEON DPU driver is not here: latest v4 keeps causing
> >   build failures on mips. I deferred the pull hoping to get it in
> >   and I might merge a new version post rc1
> >   (supposed to be ok for new drivers as they can't cause regressions),
> >   but we'll see.
> > - there are also a couple bugfixes under review, to be merged after rc1
> > - I merged a trivial patch (removing a comment) that also got
> >   merged through net.
> >   git handles this just fine and it did not seem worth it
> >   rebasing to drop it.
> > - there is a trivial conflict in the header file. Shouldn't be any
> >   trouble to resolve, but fyi the resolution by Stephen is here
> > diff --cc drivers/virtio/virtio_mem.c
> > index e8355f55a8f7,6d4dfbc53a66..
> > --- a/drivers/virtio/virtio_mem.c
> > +++ b/drivers/virtio/virtio_mem.c
> > @@@ -21,7 -21,7 +21,8 @@@
> >   #include 
> >   #include 
> >   #include 
> >  +#include 
> > + #include 
> >   Also see it here:
> >   https://lore.kernel.org/all/20240423145947.14217...@canb.auug.org.au/
> >
> >
> >
> > The following changes since commit 18daea77cca626f590fb140fc11e3a43c5d41354:
> >
> >   Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm 
> > (2024-04-30 12:40:41 -0700)
> >
> > are available in the Git repository at:
> >
> >   https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git 
> > tags/for_linus
> >
> > for you to fetch changes up to 0b8dbbdcf2e42273fbac9b752919e2e5b2abac21:
> >
> >   Merge tag 'for_linus' into vhost (2024-05-12 08:15:28 -0400)
> >
> > 
> > virtio: features, fixes, cleanups
> >
> > Several new features here:
> >
> > - virtio-net is finally supported in vduse.
> >
> > - Virtio (balloon and mem) interaction with suspend is improved
> >
> > - vhost-scsi now handles signals better/faster.
> >
> > - virtio-net now supports premapped mode by default,
> >   opening the door for all kind of zero copy tricks.
> >
> > Fixes, cleanups all over the place.
> >
> > Signed-off-by: Michael S. Tsirkin 
> >
> > 
> > Christophe JAILLET (1):
> >   vhost-vdpa: Remove usage of the deprecated ida_simple_xx() API
> >
> > David Hildenbrand (1):
> >   virtio-mem: support suspend+resume
> >
> > David Stevens (2):
> >   virtio_balloon: Give the balloon its own wakeup source
> >   virtio_balloon: Treat stats requests as wakeup events
> >
> > Eugenio Pérez (2):
> >   MAINTAINERS: add Eugenio Pérez as reviewer
> >   MAINTAINERS: add Eugenio Pérez as reviewer
> >
> > Jiri Pirko (1):
> >   virtio: delete vq in vp_find_vqs_msix() when request_irq() fails
> >
> > Krzysztof Kozlowski (24):
> >   virtio: balloon: drop owner assignment
> >   virtio: input: drop owner assignment
> >   virtio: mem: drop owner assignment
> >   um: virt-pci: drop owner assignment
> >   virtio_blk: drop owner assignment
> >   bluetooth: virtio: drop owner assignment
> >   hwrng: virtio: drop owner assignment
> >   virtio_console: drop owner assignment
> >   crypto: virtio - drop owner assignment
> >   firmware: arm_scmi: virtio: drop owner assignment
> >   gpio: virtio: drop owner assignment
> >   drm/virtio: drop owner assignment
> >   iommu: virtio: drop owner assignment
> >   misc: nsm: drop owner assignment
> >   net: caif: virtio: drop owner assignment
> >   net: virtio: drop owner assignment
> >   net: 9p: virtio: drop owner assignment
> >   vsock/virtio: drop owner assignment
> >   wifi: mac80211_hwsim: drop owner assignment
> >   nvdimm: virtio_pmem: drop owner assignment
> >   rpmsg: virtio: drop owner assignment
> >   scsi: virtio: drop owner assignment
> >   fuse: virtio: drop owner assignment
> >   sound: virtio: drop owner assignment
> >
> > Li Zhijian (1):
> >   vdpa: Convert sprintf/snprintf to sysfs_emit
> >
> > Maxime Coquelin (6):
> >   vduse: validate block features only with block devices
> >   vduse: Temporaril

[GIT PULL] virtio: features, fixes, cleanups

2024-05-22 Thread Michael S. Tsirkin

Things to note here:

- the new Marvell OCTEON DPU driver is not here: latest v4 keeps causing
  build failures on mips. I deferred the pull hoping to get it in
  and I might merge a new version post rc1
  (supposed to be ok for new drivers as they can't cause regressions),
  but we'll see.
- there are also a couple bugfixes under review, to be merged after rc1
- I merged a trivial patch (removing a comment) that also got
  merged through net.
  git handles this just fine and it did not seem worth it
  rebasing to drop it.
- there is a trivial conflict in the header file. Shouldn't be any
  trouble to resolve, but fyi the resolution by Stephen is here
diff --cc drivers/virtio/virtio_mem.c
index e8355f55a8f7,6d4dfbc53a66..
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@@ -21,7 -21,7 +21,8 @@@
  #include 
  #include 
  #include 
 +#include 
+ #include 
  Also see it here:
  https://lore.kernel.org/all/20240423145947.14217...@canb.auug.org.au/



The following changes since commit 18daea77cca626f590fb140fc11e3a43c5d41354:

  Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm 
(2024-04-30 12:40:41 -0700)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to 0b8dbbdcf2e42273fbac9b752919e2e5b2abac21:

  Merge tag 'for_linus' into vhost (2024-05-12 08:15:28 -0400)


virtio: features, fixes, cleanups

Several new features here:

- virtio-net is finally supported in vduse.

- Virtio (balloon and mem) interaction with suspend is improved

- vhost-scsi now handles signals better/faster.

- virtio-net now supports premapped mode by default,
  opening the door for all kind of zero copy tricks.

Fixes, cleanups all over the place.

Signed-off-by: Michael S. Tsirkin 


Christophe JAILLET (1):
  vhost-vdpa: Remove usage of the deprecated ida_simple_xx() API

David Hildenbrand (1):
  virtio-mem: support suspend+resume

David Stevens (2):
  virtio_balloon: Give the balloon its own wakeup source
  virtio_balloon: Treat stats requests as wakeup events

Eugenio Pérez (2):
  MAINTAINERS: add Eugenio Pérez as reviewer
  MAINTAINERS: add Eugenio Pérez as reviewer

Jiri Pirko (1):
  virtio: delete vq in vp_find_vqs_msix() when request_irq() fails

Krzysztof Kozlowski (24):
  virtio: balloon: drop owner assignment
  virtio: input: drop owner assignment
  virtio: mem: drop owner assignment
  um: virt-pci: drop owner assignment
  virtio_blk: drop owner assignment
  bluetooth: virtio: drop owner assignment
  hwrng: virtio: drop owner assignment
  virtio_console: drop owner assignment
  crypto: virtio - drop owner assignment
  firmware: arm_scmi: virtio: drop owner assignment
  gpio: virtio: drop owner assignment
  drm/virtio: drop owner assignment
  iommu: virtio: drop owner assignment
  misc: nsm: drop owner assignment
  net: caif: virtio: drop owner assignment
  net: virtio: drop owner assignment
  net: 9p: virtio: drop owner assignment
  vsock/virtio: drop owner assignment
  wifi: mac80211_hwsim: drop owner assignment
  nvdimm: virtio_pmem: drop owner assignment
  rpmsg: virtio: drop owner assignment
  scsi: virtio: drop owner assignment
  fuse: virtio: drop owner assignment
  sound: virtio: drop owner assignment

Li Zhijian (1):
  vdpa: Convert sprintf/snprintf to sysfs_emit

Maxime Coquelin (6):
  vduse: validate block features only with block devices
  vduse: Temporarily fail if control queue feature requested
  vduse: enable Virtio-net device type
  vduse: validate block features only with block devices
  vduse: Temporarily fail if control queue feature requested
  vduse: enable Virtio-net device type

Michael S. Tsirkin (2):
  Merge tag 'stable/vduse-virtio-net' into vhost
  Merge tag 'for_linus' into vhost

Mike Christie (9):
  vhost-scsi: Handle vhost_vq_work_queue failures for events
  vhost-scsi: Handle vhost_vq_work_queue failures for cmds
  vhost-scsi: Use system wq to flush dev for TMFs
  vhost: Remove vhost_vq_flush
  vhost_scsi: Handle vhost_vq_work_queue failures for TMFs
  vhost: Use virtqueue mutex for swapping worker
  vhost: Release worker mutex during flushes
  vhost_task: Handle SIGKILL by flushing work and exiting
  kernel: Remove signal hacks for vhost_tasks

Uwe Kleine-König (1):
  virtio-mmio: Convert to platform remove callback returning void

Xuan Zhuo (7):
  virtio_ring: introduce dma map api for page
  virtio_ring: enable premapped mode whatever use_dma_api
  virtio_net: replace private by pp struct inside page
  virtio_net: big mode

Re: [PATCH] vhost: use pr_err for vq_err

2024-05-16 Thread Michael S. Tsirkin

On Thu, May 16, 2024 at 03:46:29PM +0800, Peng Fan (OSS) wrote:
> From: Peng Fan 
> 
> Use pr_err to print out error message without enabling DEBUG. This could
> make people catch error easier.
> 
> Signed-off-by: Peng Fan 

This isn't appropriate: pr_err must not be triggerable
by userspace. If you are debugging userspace, use a debugging
kernel, it's that simple.


> ---
>  drivers/vhost/vhost.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index bb75a292d50c..0bff436d1ce9 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -248,7 +248,7 @@ void vhost_iotlb_map_free(struct vhost_iotlb *iotlb,
> struct vhost_iotlb_map *map);
>  
>  #define vq_err(vq, fmt, ...) do {  \
> - pr_debug(pr_fmt(fmt), ##__VA_ARGS__);   \
> + pr_err(pr_fmt(fmt), ##__VA_ARGS__);   \
>   if ((vq)->error_ctx)   \
>   eventfd_signal((vq)->error_ctx);\
>   } while (0)
> -- 
> 2.37.1

Re: [PATCH net-next] virtio_net: Fix error code in __virtnet_get_hw_stats()

2024-05-15 Thread Michael S. Tsirkin

On Wed, May 15, 2024 at 04:50:48PM +0200, Dan Carpenter wrote:
> On Sun, May 12, 2024 at 12:01:55PM -0400, Michael S. Tsirkin wrote:
> > On Fri, May 10, 2024 at 03:50:45PM +0300, Dan Carpenter wrote:
> > > The virtnet_send_command_reply() function returns true on success or
> > > false on failure.  The "ok" variable is true/false depending on whether
> > > it succeeds or not.  It's up to the caller to translate the true/false
> > > into -EINVAL on failure or zero for success.
> > > 
> > > The bug is that __virtnet_get_hw_stats() returns false for both
> > > errors and success.  It's not a bug, but it is confusing that the caller
> > > virtnet_get_hw_stats() uses an "ok" variable to store negative error
> > > codes.
> > 
> > The bug is ... It's not a bug 
> > 
> > I think what you are trying to say is that the error isn't
> > really handled anyway, except for printing a warning,
> > so it's not a big deal.
> > 
> > Right?
> > 
> 
> No, I'm sorry, that was confusing.  The change to __virtnet_get_hw_stats()
> is a bugfix but the change to virtnet_get_hw_stats() was not a bugfix.
> I viewed this all as really one thing, because it's cleaning up the
> error codes which happens to fix a bug.  It seems very related.  At the
> same time, I can also see how people would disagree.
> 
> I'm traveling until May 23.  I can resend this.  Probably as two patches
> for simpler review.
> 
> regards,
> dan carpenter
>  

Yea, no rush - bugfixes are fine after 23. And it's ok to combine into
one - we don't want inconsistent code - just please write a clear
commit log message.


-- 
MST

[PATCH] vhost/vsock: always initialize seqpacket_allow

2024-05-15 Thread Michael S. Tsirkin

There are two issues around seqpacket_allow:
1. seqpacket_allow is not initialized when socket is
   created. Thus if features are never set, it will be
   read uninitialized.
2. if VIRTIO_VSOCK_F_SEQPACKET is set and then cleared,
   then seqpacket_allow will not be cleared appropriately
   (existing apps I know about don't usually do this but
it's legal and there's no way to be sure no one relies
on this).

To fix:
- initialize seqpacket_allow after allocation
- set it unconditionally in set_features

Reported-by: syzbot+6c21aeb59d0e82eb2...@syzkaller.appspotmail.com
Reported-by: Jeongjun Park 
Fixes: ced7b713711f ("vhost/vsock: support SEQPACKET for transport").
Cc: Arseny Krasnov 
Cc: David S. Miller 
Cc: Stefan Hajnoczi 
Signed-off-by: Michael S. Tsirkin 
Acked-by: Arseniy Krasnov 
Tested-by: Arseniy Krasnov 

---


Reposting now it's been tested.

 drivers/vhost/vsock.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index ec20ecff85c7..bf664ec9341b 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -667,6 +667,7 @@ static int vhost_vsock_dev_open(struct inode *inode, struct 
file *file)
}
 
vsock->guest_cid = 0; /* no CID assigned yet */
+   vsock->seqpacket_allow = false;
 
atomic_set(>queued_replies, 0);
 
@@ -810,8 +811,7 @@ static int vhost_vsock_set_features(struct vhost_vsock 
*vsock, u64 features)
goto err;
}
 
-   if (features & (1ULL << VIRTIO_VSOCK_F_SEQPACKET))
-   vsock->seqpacket_allow = true;
+   vsock->seqpacket_allow = features & (1ULL << VIRTIO_VSOCK_F_SEQPACKET);
 
for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
vq = >vqs[i];
-- 
MST

Re: [PATCH net-next] virtio_net: Fix error code in __virtnet_get_hw_stats()

2024-05-12 Thread Michael S. Tsirkin

On Fri, May 10, 2024 at 03:50:45PM +0300, Dan Carpenter wrote:
> The virtnet_send_command_reply() function returns true on success or
> false on failure.  The "ok" variable is true/false depending on whether
> it succeeds or not.  It's up to the caller to translate the true/false
> into -EINVAL on failure or zero for success.
> 
> The bug is that __virtnet_get_hw_stats() returns false for both
> errors and success.  It's not a bug, but it is confusing that the caller
> virtnet_get_hw_stats() uses an "ok" variable to store negative error
> codes.

The bug is ... It's not a bug 

I think what you are trying to say is that the error isn't
really handled anyway, except for printing a warning,
so it's not a big deal.

Right?

I don't know why can't get_ethtool_stats fail - we should
probably fix that.


> Fix the bug and clean things up so that it's clear that
> __virtnet_get_hw_stats() returns zero on success or negative error codes
> on failure.
> 
> Fixes: 941168f8b40e ("virtio_net: support device stats")
> Signed-off-by: Dan Carpenter 
> ---
>  drivers/net/virtio_net.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 218a446c4c27..4fc0fcdad259 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -4016,7 +4016,7 @@ static int __virtnet_get_hw_stats(struct virtnet_info 
> *vi,
>   _out, _in);
>  
>   if (!ok)
> - return ok;
> + return -EINVAL;
>  
>   for (p = reply; p - reply < res_size; p += le16_to_cpu(hdr->size)) {
>   hdr = p;
> @@ -4053,7 +4053,7 @@ static int virtnet_get_hw_stats(struct virtnet_info *vi,
>   struct virtio_net_ctrl_queue_stats *req;
>   bool enable_cvq;
>   void *reply;
> - int ok;
> + int err;
>  
>   if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_DEVICE_STATS))
>   return 0;
> @@ -4100,12 +4100,12 @@ static int virtnet_get_hw_stats(struct virtnet_info 
> *vi,
>   if (enable_cvq)
>   virtnet_make_stat_req(vi, ctx, req, vi->max_queue_pairs * 2, 
> );
>  
> - ok = __virtnet_get_hw_stats(vi, ctx, req, sizeof(*req) * j, reply, 
> res_size);
> + err = __virtnet_get_hw_stats(vi, ctx, req, sizeof(*req) * j, reply, 
> res_size);
>  
>   kfree(req);
>   kfree(reply);
>  
> - return ok;
> + return err;
>  }
>  
>  static void virtnet_get_strings(struct net_device *dev, u32 stringset, u8 
> *data)

Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn

2024-05-01 Thread Michael S. Tsirkin

On Wed, May 01, 2024 at 10:57:38AM -0500, Mike Christie wrote:
> On 5/1/24 2:50 AM, Hillf Danton wrote:
> > On Wed, 1 May 2024 02:01:20 -0400 Michael S. Tsirkin 
> >>
> >> and then it failed testing.
> >>
> > So did my patch [1] but then the reason was spotted [2,3]
> > 
> > [1] https://lore.kernel.org/lkml/20240430110209.4310-1-hdan...@sina.com/
> > [2] https://lore.kernel.org/lkml/20240430225005.4368-1-hdan...@sina.com/
> > [3] https://lore.kernel.org/lkml/a7f8470617589...@google.com/
> 
> Just to make sure I understand the conclusion.
> 
> Edward's patch that just swaps the order of the calls:
> 
> https://lore.kernel.org/lkml/tencent_546da49414e876eebecf2c78d26d242ee...@qq.com/
> 
> fixes the UAF. I tested the same in my setup. However, when you guys tested it
> with sysbot, it also triggered a softirq/RCU warning.
> 
> The softirq/RCU part of the issue is fixed with this commit:
> 
> https://lore.kernel.org/all/20240427102808.29356-1-qiang.zhang1...@gmail.com/
> 
> commit 1dd1eff161bd55968d3d46bc36def62d71fb4785
> Author: Zqiang 
> Date:   Sat Apr 27 18:28:08 2024 +0800
> 
> softirq: Fix suspicious RCU usage in __do_softirq()
> 
> The problem was that I was testing with -next master which has that patch.
> It looks like you guys were testing against bb7a2467e6be which didn't have
> the patch, and so that's why you guys still hit the softirq/RCU issue. Later
> when you added that patch to your patch, it worked with syzbot.
> 
> So is it safe to assume that the softirq/RCU patch above will be upstream
> when the vhost changes go in or is there a tag I need to add to my patches?

That patch is upstream now. I rebased and asked syzbot to test
https://lore.kernel.org/lkml/tencent_546da49414e876eebecf2c78d26d242ee...@qq.com/
on top.

If that passes I will squash.

Re: [syzbot] [net?] [virt?] [kvm?] KASAN: slab-use-after-free Read in vhost_task_fn

2024-05-01 Thread Michael S. Tsirkin

#syz test https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git 
f138e94c1f0dbeae721917694fb2203446a68ea9

Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn

2024-05-01 Thread Michael S. Tsirkin

On Wed, May 01, 2024 at 10:57:38AM -0500, Mike Christie wrote:
> On 5/1/24 2:50 AM, Hillf Danton wrote:
> > On Wed, 1 May 2024 02:01:20 -0400 Michael S. Tsirkin 
> >>
> >> and then it failed testing.
> >>
> > So did my patch [1] but then the reason was spotted [2,3]
> > 
> > [1] https://lore.kernel.org/lkml/20240430110209.4310-1-hdan...@sina.com/
> > [2] https://lore.kernel.org/lkml/20240430225005.4368-1-hdan...@sina.com/
> > [3] https://lore.kernel.org/lkml/a7f8470617589...@google.com/
> 
> Just to make sure I understand the conclusion.
> 
> Edward's patch that just swaps the order of the calls:
> 
> https://lore.kernel.org/lkml/tencent_546da49414e876eebecf2c78d26d242ee...@qq.com/
> 
> fixes the UAF. I tested the same in my setup. However, when you guys tested it
> with sysbot, it also triggered a softirq/RCU warning.
> 
> The softirq/RCU part of the issue is fixed with this commit:
> 
> https://lore.kernel.org/all/20240427102808.29356-1-qiang.zhang1...@gmail.com/
> 
> commit 1dd1eff161bd55968d3d46bc36def62d71fb4785
> Author: Zqiang 
> Date:   Sat Apr 27 18:28:08 2024 +0800
> 
> softirq: Fix suspicious RCU usage in __do_softirq()
> 
> The problem was that I was testing with -next master which has that patch.
> It looks like you guys were testing against bb7a2467e6be which didn't have
> the patch, and so that's why you guys still hit the softirq/RCU issue. Later
> when you added that patch to your patch, it worked with syzbot.
> 
> So is it safe to assume that the softirq/RCU patch above will be upstream
> when the vhost changes go in or is there a tag I need to add to my patches?

Two points:
- I do not want bisect broken. If you depend on this patch either I pick
  it too before your patch, or we defer until 
1dd1eff161bd55968d3d46bc36def62d71fb4785
  is merged. You can also ask for that patch to be merged in this cycle.
- Do not assume - pls push somewhere a hash based on vhost that syzbot can test
  and confirm all is well. Thanks!

Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-05-01 Thread Michael S. Tsirkin

On Wed, May 01, 2024 at 12:51:51PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 30, 2024 at 12:50:05PM +0200, Tobias Huschle wrote:
> > It took me a while, but I was able to figure out why EEVDF behaves 
> > different then CFS does. I'm still waiting for some official confirmation
> > of my assumptions but it all seems very plausible to me.
> > 
> > Leaving aside all the specifics of vhost and kworkers, a more general
> > description of the scenario would be as follows:
> > 
> > Assume that we have two tasks taking turns on a single CPU. 
> > Task 1 does something and wakes up Task 2.
> > Task 2 does something and goes to sleep.
> > And we're just repeating that.
> > Task 1 and task 2 only run for very short amounts of time, i.e. much 
> > shorter than a regular time slice (vhost = task1, kworker = task2).
> > 
> > Let's further assume, that task 1 runs longer than task 2. 
> > In CFS, this means, that vruntime of task 1 starts to outrun the vruntime
> > of task 2. This means that vruntime(task2) < vruntime(task1). Hence, task 2
> > always gets picked on wake up because it has the smaller vruntime. 
> > In EEVDF, this would translate to a permanent positive lag, which also 
> > causes task 2 to get consistently scheduled on wake up.
> > 
> > Let's now assume, that ocassionally, task 2 runs a little bit longer than
> > task 1. In CFS, this means, that task 2 can close the vruntime gap by a
> > bit, but, it can easily remain below the value of task 1. Task 2 would 
> > still get picked on wake up.
> > With EEVDF, in its current form, task 2 will now get a negative lag, which
> > in turn, will cause it not being picked on the next wake up.
> 
> Right, so I've been working on changes where tasks will be able to
> 'earn' credit when sleeping. Specifically, keeping dequeued tasks on the
> runqueue will allow them to burn off negative lag. Once they get picked
> again they are guaranteed to have zero (or more) lag. If by that time
> they've not been woken up again, they get dequeued with 0-lag.
> 
> (placement with 0-lag will ensure eligibility doesn't inhibit the pick,
> but is not sufficient to ensure a pick)
> 
> However, this alone will not be sufficient to get the behaviour you
> want. Notably, even at 0-lag the virtual deadline will still be after
> the virtual deadline of the already running task -- assuming they have
> equal request sizes.
> 
> That is, IIUC, you want your task 2 (kworker) to always preempt task 1
> (vhost), right? So even if tsak 2 were to have 0-lag, placing it would
> be something like:
> 
> t1  |-<
> t2|-<
> V-|-
> 
> So t1 has started at | with a virtual deadline at <. Then a short
> while later -- V will have advanced a little -- it wakes t2 with 0-lag,
> but as you can observe, its virtual deadline will be later than t1's and
> as such it will never get picked, even though they're both eligible.
> 
> > So, it seems we have a change in the level of how far the both variants 
> > look 
> > into the past. CFS being willing to take more history into account, whereas
> > EEVDF does not (with update_entity_lag setting the lag value from scratch, 
> > and place_entity not taking the original vruntime into account).
> >
> > All of this can be seen as correct by design, a task consumes more time
> > than the others, so it has to give way to others. The big difference
> > is now, that CFS allowed a task to collect some bonus by constantly using 
> > less CPU time than others and trading that time against ocassionally taking
> > more CPU time. EEVDF could do the same thing, by allowing the accumulation
> > of positive lag, which can then be traded against the one time the task
> > would get negative lag. This might clash with other EEVDF assumptions 
> > though.
> 
> Right, so CFS was a pure virtual runtime based scheduler, while EEVDF
> considers both virtual runtime (for eligibility, which ties to fairness)
> but primarily virtual deadline (for timeliness).
> 
> If you want to make EEVDF force pick a task by modifying vruntime you
> have to place it with lag > request (slice) such that the virtual
> deadline of the newly placed task is before the already running task,
> yielding both eligibility and earliest deadline.
> 
> Consistently placing tasks with such large (positive) lag will affect
> fairness though, they're basically always runnable, so barring external
> throttling, they'll starve you.
> 
> > The patch below fixes the degredation, but is not at all aligned with what 
> > EEVDF wants to achieve, but it helps as an indicator that my hypothesis is
> > correct.
> > 
> > So, what does this now mean for the vhost regression we were discussing?
> > 
> > 1. The behavior of the scheduler changed with regard to wake-up scenarios.
> > 2. vhost in its current form relies on the way how CFS works by assuming 
> >that the kworker always gets scheduled.
> 
> How does it assume this? Also, this is a performance issue, not a

Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn

2024-05-01 Thread Michael S. Tsirkin

On Wed, May 01, 2024 at 08:15:44AM +0800, Hillf Danton wrote:
> On Tue, Apr 30, 2024 at 11:23:04AM -0500, Mike Christie wrote:
> > On 4/30/24 8:05 AM, Edward Adam Davis wrote:
> > >  static int vhost_task_fn(void *data)
> > >  {
> > >   struct vhost_task *vtsk = data;
> > > @@ -51,7 +51,7 @@ static int vhost_task_fn(void *data)
> > >   schedule();
> > >   }
> > >  
> > > - mutex_lock(>exit_mutex);
> > > + mutex_lock(_mutex);
> > >   /*
> > >* If a vhost_task_stop and SIGKILL race, we can ignore the SIGKILL.
> > >* When the vhost layer has called vhost_task_stop it's already stopped
> > > @@ -62,7 +62,7 @@ static int vhost_task_fn(void *data)
> > >   vtsk->handle_sigkill(vtsk->data);
> > >   }
> > >   complete(>exited);
> > > - mutex_unlock(>exit_mutex);
> > > + mutex_unlock(_mutex);
> > >  
> > 
> > Edward, thanks for the patch. I think though I just needed to swap the
> > order of the calls above.
> > 
> > Instead of:
> > 
> > complete(>exited);
> > mutex_unlock(>exit_mutex);
> > 
> > it should have been:
> > 
> > mutex_unlock(>exit_mutex);
> > complete(>exited);
> 
> JFYI Edward did it [1]
> 
> [1] 
> https://lore.kernel.org/lkml/tencent_546da49414e876eebecf2c78d26d242ee...@qq.com/

and then it failed testing.

> > 
> > If my analysis is correct, then Michael do you want me to resubmit a
> > patch on top of your vhost branch or resubmit the entire patchset?

Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn

2024-04-30 Thread Michael S. Tsirkin

On Tue, Apr 30, 2024 at 08:01:11PM -0500, Mike Christie wrote:
> On 4/30/24 7:15 PM, Hillf Danton wrote:
> > On Tue, Apr 30, 2024 at 11:23:04AM -0500, Mike Christie wrote:
> >> On 4/30/24 8:05 AM, Edward Adam Davis wrote:
> >>>  static int vhost_task_fn(void *data)
> >>>  {
> >>>   struct vhost_task *vtsk = data;
> >>> @@ -51,7 +51,7 @@ static int vhost_task_fn(void *data)
> >>>   schedule();
> >>>   }
> >>>  
> >>> - mutex_lock(>exit_mutex);
> >>> + mutex_lock(_mutex);
> >>>   /*
> >>>* If a vhost_task_stop and SIGKILL race, we can ignore the SIGKILL.
> >>>* When the vhost layer has called vhost_task_stop it's already stopped
> >>> @@ -62,7 +62,7 @@ static int vhost_task_fn(void *data)
> >>>   vtsk->handle_sigkill(vtsk->data);
> >>>   }
> >>>   complete(>exited);
> >>> - mutex_unlock(>exit_mutex);
> >>> + mutex_unlock(_mutex);
> >>>  
> >>
> >> Edward, thanks for the patch. I think though I just needed to swap the
> >> order of the calls above.
> >>
> >> Instead of:
> >>
> >> complete(>exited);
> >> mutex_unlock(>exit_mutex);
> >>
> >> it should have been:
> >>
> >> mutex_unlock(>exit_mutex);
> >> complete(>exited);
> > 
> > JFYI Edward did it [1]
> > 
> > [1] 
> > https://lore.kernel.org/lkml/tencent_546da49414e876eebecf2c78d26d242ee...@qq.com/
> 
> Thanks.
> 
> I tested the code with that change and it no longer triggers the UAF.

Weird but syzcaller said that yes it triggers.

Compare
dcc0ca06174e6...@google.com
which tests the order
mutex_unlock(>exit_mutex);
complete(>exited);
that you like and says it triggers

and
97bda90617521...@google.com
which says it does not trigger.

Whatever you do please send it to syzcaller in the original
thread and then when you post please include the syzcaller report.

Given this gets confusing I'm fine with just a fixup patch,
and note in the commit log where I should squash it.


> I've fixed up the original patch that had the bug and am going to
> resubmit the patchset like how Michael requested.
>

Re: [PATCH] virtio_net: Warn if insufficient queue length for transmitting

2024-04-30 Thread Michael S. Tsirkin

On Tue, Apr 30, 2024 at 03:35:09PM -0400, Darius Rad wrote:
> The transmit queue is stopped when the number of free queue entries is less
> than 2+MAX_SKB_FRAGS, in start_xmit().  If the queue length (QUEUE_NUM_MAX)
> is less than then this, transmission will immediately trigger a netdev
> watchdog timeout.  Report this condition earlier and more directly.
> 
> Signed-off-by: Darius Rad 
> ---
>  drivers/net/virtio_net.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 115c3c5414f2..72ee8473b61c 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -4917,6 +4917,9 @@ static int virtnet_probe(struct virtio_device *vdev)
>   set_bit(guest_offloads[i], >guest_offloads);
>   vi->guest_offloads_capable = vi->guest_offloads;
>  
> + if (virtqueue_get_vring_size(vi->sq->vq) < 2 + MAX_SKB_FRAGS)
> + netdev_warn_once(dev, "not enough queue entries, expect xmit 
> timeout\n");
> +

How about actually fixing it though? E.g. by linearizing...

It also bothers me that there's practically
/proc/sys/net/core/max_skb_frags
and if that's low then things could actually work.

Finally, while originally it was just 17 typically, now it's
configurable. So it's possible that you change the config to make big
tcp work better and device stops working while it worked fine
previously.


>   pr_debug("virtnet: registered device %s with %d RX and TX vq's\n",
>dev->name, max_queue_pairs);
>  
> -- 
> 2.39.2

Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn

2024-04-30 Thread Michael S. Tsirkin

On Tue, Apr 30, 2024 at 11:23:04AM -0500, Mike Christie wrote:
> On 4/30/24 8:05 AM, Edward Adam Davis wrote:
> >  static int vhost_task_fn(void *data)
> >  {
> > struct vhost_task *vtsk = data;
> > @@ -51,7 +51,7 @@ static int vhost_task_fn(void *data)
> > schedule();
> > }
> >  
> > -   mutex_lock(>exit_mutex);
> > +   mutex_lock(_mutex);
> > /*
> >  * If a vhost_task_stop and SIGKILL race, we can ignore the SIGKILL.
> >  * When the vhost layer has called vhost_task_stop it's already stopped
> > @@ -62,7 +62,7 @@ static int vhost_task_fn(void *data)
> > vtsk->handle_sigkill(vtsk->data);
> > }
> > complete(>exited);
> > -   mutex_unlock(>exit_mutex);
> > +   mutex_unlock(_mutex);
> >  
> 
> Edward, thanks for the patch. I think though I just needed to swap the
> order of the calls above.
> 
> Instead of:
> 
> complete(>exited);
> mutex_unlock(>exit_mutex);
> 
> it should have been:
> 
> mutex_unlock(>exit_mutex);
> complete(>exited);
> 
> If my analysis is correct, then Michael do you want me to resubmit a
> patch on top of your vhost branch or resubmit the entire patchset?

Resubmit all please.

-- 
MST

Re: [PATCH v2 0/4] vhost: Cleanup

2024-04-29 Thread Michael S. Tsirkin

On Mon, Apr 29, 2024 at 08:13:56PM +1000, Gavin Shan wrote:
> This is suggested by Michael S. Tsirkin according to [1] and the goal
> is to apply smp_rmb() inside vhost_get_avail_idx() if needed. With it,
> the caller of the function needn't to worry about memory barriers. Since
> we're here, other cleanups are also applied.
> 
> [1] 
> https://lore.kernel.org/virtualization/20240327155750-mutt-send-email-...@kernel.org/


Patch 1 makes some sense, gave some comments. Rest I think we should
just drop.

> PATCH[1] improves vhost_get_avail_idx() so that smp_rmb() is applied if
>  needed. Besides, the sanity checks on the retrieved available
>  queue index are also squeezed to vhost_get_avail_idx()
> PATCH[2] drops the local variable @last_avail_idx since it's equivalent
>  to vq->last_avail_idx
> PATCH[3] improves vhost_get_avail_head(), similar to what we're doing
>  for vhost_get_avail_idx(), so that the relevant sanity checks
>  on the head are squeezed to vhost_get_avail_head()
> PATCH[4] Reformat vhost_{get, put}_user() by using tab instead of space
>  as the terminator for each line
> 
> Gavin Shan (3):
>   vhost: Drop variable last_avail_idx in vhost_get_vq_desc()
>   vhost: Improve vhost_get_avail_head()
>   vhost: Reformat vhost_{get, put}_user()
> 
> Michael S. Tsirkin (1):
>   vhost: Improve vhost_get_avail_idx() with smp_rmb()
> 
>  drivers/vhost/vhost.c | 215 +++---
>  1 file changed, 97 insertions(+), 118 deletions(-)
> 
> Changelog
> =
> v2:
>   * Improve vhost_get_avail_idx() as Michael suggested in [1]
> as above (Michael)
>   * Correct @head's type from 'unsigned int' to 'int'
> (l...@intel.com)
> 
> -- 
> 2.44.0

Re: [PATCH v2 4/4] vhost: Reformat vhost_{get, put}_user()

2024-04-29 Thread Michael S. Tsirkin

On Mon, Apr 29, 2024 at 08:14:00PM +1000, Gavin Shan wrote:
> Reformat the macros to use tab as the terminator for each line so
> that it looks clean.
> 
> No functional change intended.
> 
> Signed-off-by: Gavin Shan 

Just messes up history for no real gain.

> ---
>  drivers/vhost/vhost.c | 60 +--
>  1 file changed, 30 insertions(+), 30 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 4ddb9ec2fe46..c1ed5e750521 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1207,21 +1207,22 @@ static inline void __user *__vhost_get_user(struct 
> vhost_virtqueue *vq,
>   return __vhost_get_user_slow(vq, addr, size, type);
>  }
>  
> -#define vhost_put_user(vq, x, ptr)   \
> -({ \
> - int ret; \
> - if (!vq->iotlb) { \
> - ret = __put_user(x, ptr); \
> - } else { \
> - __typeof__(ptr) to = \
> +#define vhost_put_user(vq, x, ptr)   \
> +({   \
> + int ret;\
> + if (!vq->iotlb) {   \
> + ret = __put_user(x, ptr);   \
> + } else {\
> + __typeof__(ptr) to =\
>   (__typeof__(ptr)) __vhost_get_user(vq, ptr, \
> -   sizeof(*ptr), VHOST_ADDR_USED); \
> - if (to != NULL) \
> - ret = __put_user(x, to); \
> - else \
> - ret = -EFAULT;  \
> - } \
> - ret; \
> + sizeof(*ptr),   \
> + VHOST_ADDR_USED);   \
> + if (to != NULL) \
> + ret = __put_user(x, to);\
> + else\
> + ret = -EFAULT;  \
> + }   \
> + ret;\
>  })
>  
>  static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
> @@ -1252,22 +1253,21 @@ static inline int vhost_put_used_idx(struct 
> vhost_virtqueue *vq)
> >used->idx);
>  }
>  
> -#define vhost_get_user(vq, x, ptr, type) \
> -({ \
> - int ret; \
> - if (!vq->iotlb) { \
> - ret = __get_user(x, ptr); \
> - } else { \
> - __typeof__(ptr) from = \
> - (__typeof__(ptr)) __vhost_get_user(vq, ptr, \
> -sizeof(*ptr), \
> -type); \
> - if (from != NULL) \
> - ret = __get_user(x, from); \
> - else \
> - ret = -EFAULT; \
> - } \
> - ret; \
> +#define vhost_get_user(vq, x, ptr, type) \
> +({   \
> + int ret;\
> + if (!vq->iotlb) {   \
> + ret = __get_user(x, ptr);   \
> + } else {\
> + __typeof__(ptr) from =  \
> + (__typeof__(ptr)) __vhost_get_user(vq, ptr, \
> + sizeof(*ptr), type);\
> + if (from != NULL)   \
> + ret = __get_user(x, from);  \
> + else\
> + ret = -EFAULT;  \
> + }   \
> + ret;\
>  })
>  
>  #define vhost_get_avail(vq, x, ptr) \
> -- 
> 2.44.0

Re: [PATCH v2 3/4] vhost: Improve vhost_get_avail_head()

2024-04-29 Thread Michael S. Tsirkin

On Mon, Apr 29, 2024 at 08:13:59PM +1000, Gavin Shan wrote:
> Improve vhost_get_avail_head() so that the head or errno is returned.
> With it, the relevant sanity checks are squeezed to vhost_get_avail_head()
> and vhost_get_vq_desc() is further simplified.
> 
> No functional change intended.
> 
> Signed-off-by: Gavin Shan 

I don't see what does this moving code around achieve.

> ---
>  drivers/vhost/vhost.c | 50 ++-
>  1 file changed, 26 insertions(+), 24 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index b278c0333a66..4ddb9ec2fe46 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1322,11 +1322,27 @@ static inline int vhost_get_avail_idx(struct 
> vhost_virtqueue *vq)
>   return 1;
>  }
>  
> -static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
> -__virtio16 *head, int idx)
> +static inline int vhost_get_avail_head(struct vhost_virtqueue *vq)
>  {
> - return vhost_get_avail(vq, *head,
> ->avail->ring[idx & (vq->num - 1)]);
> + __virtio16 head;
> + int r;
> +
> + r = vhost_get_avail(vq, head,
> + >avail->ring[vq->last_avail_idx & (vq->num - 
> 1)]);
> + if (unlikely(r)) {
> + vq_err(vq, "Failed to read head: index %u address %p\n",
> +vq->last_avail_idx,
> +>avail->ring[vq->last_avail_idx & (vq->num - 1)]);
> + return r;
> + }
> +
> + r = vhost16_to_cpu(vq, head);
> + if (unlikely(r >= vq->num)) {
> + vq_err(vq, "Invalid head %d (%u)\n", r, vq->num);
> + return -EINVAL;
> + }
> +
> + return r;
>  }
>  
>  static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
> @@ -2523,9 +2539,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
> struct vhost_log *log, unsigned int *log_num)
>  {
>   struct vring_desc desc;
> - unsigned int i, head, found = 0;
> - __virtio16 ring_head;
> - int ret, access;
> + unsigned int i, found = 0;
> + int head, ret, access;
>  
>   if (vq->avail_idx == vq->last_avail_idx) {
>   ret = vhost_get_avail_idx(vq);
> @@ -2536,23 +2551,10 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>   return vq->num;
>   }
>  
> - /* Grab the next descriptor number they're advertising, and increment
> -  * the index we've seen. */
> - if (unlikely(vhost_get_avail_head(vq, _head, vq->last_avail_idx))) 
> {
> - vq_err(vq, "Failed to read head: idx %d address %p\n",
> -vq->last_avail_idx,
> ->avail->ring[vq->last_avail_idx % vq->num]);
> - return -EFAULT;
> - }
> -
> - head = vhost16_to_cpu(vq, ring_head);
> -
> - /* If their number is silly, that's an error. */
> - if (unlikely(head >= vq->num)) {
> - vq_err(vq, "Guest says index %u > %u is available",
> -head, vq->num);
> - return -EINVAL;
> - }
> + /* Grab the next descriptor number they're advertising */
> + head = vhost_get_avail_head(vq);
> + if (unlikely(head < 0))
> + return head;
>  
>   /* When we start there are none of either input nor output. */
>   *out_num = *in_num = 0;
> -- 
> 2.44.0

Re: [PATCH v2 2/4] vhost: Drop variable last_avail_idx in vhost_get_vq_desc()

2024-04-29 Thread Michael S. Tsirkin

On Mon, Apr 29, 2024 at 08:13:58PM +1000, Gavin Shan wrote:
> The local variable @last_avail_idx is equivalent to vq->last_avail_idx.
> So the code can be simplified a bit by dropping the local variable
> @last_avail_idx.
> 
> No functional change intended.
> 
> Signed-off-by: Gavin Shan 
> ---
>  drivers/vhost/vhost.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 7aa623117aab..b278c0333a66 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2524,7 +2524,6 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>  {
>   struct vring_desc desc;
>   unsigned int i, head, found = 0;
> - u16 last_avail_idx = vq->last_avail_idx;
>   __virtio16 ring_head;
>   int ret, access;
>  
> @@ -2539,10 +2538,10 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>  
>   /* Grab the next descriptor number they're advertising, and increment
>* the index we've seen. */
> - if (unlikely(vhost_get_avail_head(vq, _head, last_avail_idx))) {
> + if (unlikely(vhost_get_avail_head(vq, _head, vq->last_avail_idx))) 
> {
>   vq_err(vq, "Failed to read head: idx %d address %p\n",
> -last_avail_idx,
> ->avail->ring[last_avail_idx % vq->num]);
> +vq->last_avail_idx,
> +>avail->ring[vq->last_avail_idx % vq->num]);
>   return -EFAULT;
>   }

I don't see the big advantage and the line is long now.

>  
> -- 
> 2.44.0

Re: [PATCH v2 1/4] vhost: Improve vhost_get_avail_idx() with smp_rmb()

2024-04-29 Thread Michael S. Tsirkin

On Mon, Apr 29, 2024 at 08:13:57PM +1000, Gavin Shan wrote:
> From: "Michael S. Tsirkin" 
> 
> All the callers of vhost_get_avail_idx() are concerned to the memory

*with* the memory barrier

> barrier, imposed by smp_rmb() to ensure the order of the available
> ring entry read and avail_idx read.
> 
> Improve vhost_get_avail_idx() so that smp_rmb() is executed when
> the avail_idx is advanced.

accessed, not advanced. guest advances it.

> With it, the callers needn't to worry
> about the memory barrier.
> 
> No functional change intended.

I'd add:

As a side benefit, we also validate the index on all paths now, which
will hopefully help catch future errors earlier.

Note: current code is inconsistent in how it handles errors:
some places treat it as an empty ring, others - non empty.
This patch does not attempt to change the existing behaviour.



> Signed-off-by: Michael S. Tsirkin 
> [gshan: repainted vhost_get_avail_idx()]

?repainted?

> Reviewed-by: Gavin Shan 
> Acked-by: Will Deacon 
> ---
>  drivers/vhost/vhost.c | 106 +-
>  1 file changed, 42 insertions(+), 64 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 8995730ce0bf..7aa623117aab 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1290,10 +1290,36 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
>   mutex_unlock(>vqs[i]->mutex);
>  }
>  
> -static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
> -   __virtio16 *idx)
> +static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq)
>  {
> - return vhost_get_avail(vq, *idx, >avail->idx);
> + __virtio16 idx;
> + int r;
> +
> + r = vhost_get_avail(vq, idx, >avail->idx);
> + if (unlikely(r < 0)) {
> + vq_err(vq, "Failed to access available index at %p (%d)\n",
> +>avail->idx, r);
> + return r;
> + }
> +
> + /* Check it isn't doing very strange thing with available indexes */
> + vq->avail_idx = vhost16_to_cpu(vq, idx);
> + if (unlikely((u16)(vq->avail_idx - vq->last_avail_idx) > vq->num)) {
> + vq_err(vq, "Invalid available index change from %u to %u",
> +vq->last_avail_idx, vq->avail_idx);
> + return -EINVAL;
> + }
> +
> + /* We're done if there is nothing new */
> + if (vq->avail_idx == vq->last_avail_idx)
> + return 0;
> +
> + /*
> +  * We updated vq->avail_idx so we need a memory barrier between
> +  * the index read above and the caller reading avail ring entries.
> +  */
> + smp_rmb();
> + return 1;
>  }
>  
>  static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
> @@ -2498,38 +2524,17 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>  {
>   struct vring_desc desc;
>   unsigned int i, head, found = 0;
> - u16 last_avail_idx;
> - __virtio16 avail_idx;
> + u16 last_avail_idx = vq->last_avail_idx;
>   __virtio16 ring_head;
>   int ret, access;
>  
> - /* Check it isn't doing very strange things with descriptor numbers. */
> - last_avail_idx = vq->last_avail_idx;
> -
>   if (vq->avail_idx == vq->last_avail_idx) {
> - if (unlikely(vhost_get_avail_idx(vq, _idx))) {
> - vq_err(vq, "Failed to access avail idx at %p\n",
> - >avail->idx);
> - return -EFAULT;
> - }
> - vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
> -
> - if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
> - vq_err(vq, "Guest moved avail index from %u to %u",
> - last_avail_idx, vq->avail_idx);
> - return -EFAULT;
> - }
> + ret = vhost_get_avail_idx(vq);
> + if (unlikely(ret < 0))
> + return ret;
>  
> - /* If there's nothing new since last we looked, return
> -  * invalid.
> -  */
> - if (vq->avail_idx == last_avail_idx)
> + if (!ret)
>   return vq->num;
> -
> - /* Only get avail ring entries after they have been
> -  * exposed by guest.
> -  */
> - smp_rmb();
>   }
>  
>   /* Grab the next descriptor number they're advertising, and increment
> @@ -2790,35 +2795,20 @@ EXPORT_SYMBOL_GPL(vhost

Re: [PATCH 0/4] vhost: Cleanup

2024-04-29 Thread Michael S. Tsirkin

On Tue, Apr 23, 2024 at 01:24:03PM +1000, Gavin Shan wrote:
> This is suggested by Michael S. Tsirkin according to [1] and the goal
> is to apply smp_rmb() inside vhost_get_avail_idx() if needed. With it,
> the caller of the function needn't to worry about memory barriers. Since
> we're here, other cleanups are also applied.


Gavin I suggested another approach.
1. Start with the patch I sent (vhost: order avail ring reads after
   index updates) just do a diff against latest.
   simplify error handling a bit.
2. Do any other cleanups on top.

> [1] 
> https://lore.kernel.org/virtualization/20240327075940-mutt-send-email-...@kernel.org/
> 
> PATCH[1] drops the local variable @last_avail_idx since it's equivalent
>  to vq->last_avail_idx
> PATCH[2] improves vhost_get_avail_idx() so that smp_rmb() is applied if
>  needed. Besides, the sanity checks on the retrieved available
>  queue index are also squeezed to vhost_get_avail_idx()
> PATCH[3] improves vhost_get_avail_head(), similar to what we're doing
>  for vhost_get_avail_idx(), so that the relevant sanity checks
>  on the head are squeezed to vhost_get_avail_head()
> PATCH[4] Reformat vhost_{get, put}_user() by using tab instead of space
>  as the terminator for each line
> 
> Gavin Shan (4):
>   vhost: Drop variable last_avail_idx in vhost_get_vq_desc()
>   vhost: Improve vhost_get_avail_idx() with smp_rmb()
>   vhost: Improve vhost_get_avail_head()
>   vhost: Reformat vhost_{get, put}_user()
> 
>  drivers/vhost/vhost.c | 199 +++---
>  1 file changed, 88 insertions(+), 111 deletions(-)
> 
> -- 
> 2.44.0

Re: 回复: [PATCH v5] vp_vdpa: don't allocate unused msix vectors

2024-04-25 Thread Michael S. Tsirkin

On Tue, Apr 23, 2024 at 08:42:57AM +, Angus Chen wrote:
> Hi mst.
> 
> > -Original Message-
> > From: Michael S. Tsirkin 
> > Sent: Tuesday, April 23, 2024 4:35 PM
> > To: Gavin Liu 
> > Cc: jasow...@redhat.com; Angus Chen ;
> > virtualizat...@lists.linux.dev; xuanz...@linux.alibaba.com;
> > linux-kernel@vger.kernel.org; Heng Qi 
> > Subject: Re: 回复: [PATCH v5] vp_vdpa: don't allocate unused msix vectors
> > 
> > On Tue, Apr 23, 2024 at 01:39:17AM +, Gavin Liu wrote:
> > > On Wed, Apr 10, 2024 at 11:30:20AM +0800, lyx634449800 wrote:
> > > > From: Yuxue Liu 
> > > >
> > > > When there is a ctlq and it doesn't require interrupt callbacks,the
> > > > original method of calculating vectors wastes hardware msi or msix
> > > > resources as well as system IRQ resources.
> > > >
> > > > When conducting performance testing using testpmd in the guest os, it
> > > > was found that the performance was lower compared to directly using
> > > > vfio-pci to passthrough the device
> > > >
> > > > In scenarios where the virtio device in the guest os does not utilize
> > > > interrupts, the vdpa driver still configures the hardware's msix
> > > > vector. Therefore, the hardware still sends interrupts to the host os.
> > >
> > > >I just have a question on this part. How come hardware sends interrupts 
> > > >does
> > not guest driver disable them?
> > >
> > >1：Assuming the guest OS's Virtio device is using PMD mode, QEMU sets
> > the call fd to -1
> > >2：On the host side, the vhost_vdpa program will set
> > vp_vdpa->vring[i].cb.callback to invalid
> > >3：Before the modification, the vp_vdpa_request_irq function does not
> > check whether
> > >   vp_vdpa->vring[i].cb.callback is valid. Instead, it enables the
> > hardware's MSIX
> > > interrupts based on the number of queues of the device
> > >
> > 
> > So MSIX is enabled but why would it trigger? virtio PMD in poll mode
> > presumably suppresses interrupts after all.
> Virtio pmd is in the guest,but in host side,the msix is enabled,then the 
> device will triger 
> Interrupt normally. I analysed this bug before,and I think gavin is right.
> Did I make it clear?

Not really. Guest disables interrupts presumably (it's polling)
why does device still send them?


> > 
> > >
> > >
> > > - Original Message -
> > > From: Michael S. Tsirkin m...@redhat.com
> > > Sent: April 22, 2024 20:09
> > > To: Gavin Liu gavin@jaguarmicro.com
> > > Cc: jasow...@redhat.com; Angus Chen angus.c...@jaguarmicro.com;
> > virtualizat...@lists.linux.dev; xuanz...@linux.alibaba.com;
> > linux-kernel@vger.kernel.org; Heng Qi hen...@linux.alibaba.com
> > > Subject: Re: [PATCH v5] vp_vdpa: don't allocate unused msix vectors
> > >
> > >
> > >
> > > External Mail: This email originated from OUTSIDE of the organization!
> > > Do not click links, open attachments or provide ANY information unless you
> > recognize the sender and know the content is safe.
> > >
> > >
> > > On Wed, Apr 10, 2024 at 11:30:20AM +0800, lyx634449800 wrote:
> > > > From: Yuxue Liu 
> > > >
> > > > When there is a ctlq and it doesn't require interrupt callbacks,the
> > > > original method of calculating vectors wastes hardware msi or msix
> > > > resources as well as system IRQ resources.
> > > >
> > > > When conducting performance testing using testpmd in the guest os, it
> > > > was found that the performance was lower compared to directly using
> > > > vfio-pci to passthrough the device
> > > >
> > > > In scenarios where the virtio device in the guest os does not utilize
> > > > interrupts, the vdpa driver still configures the hardware's msix
> > > > vector. Therefore, the hardware still sends interrupts to the host os.
> > >
> > > I just have a question on this part. How come hardware sends interrupts 
> > > does
> > not guest driver disable them?
> > >
> > > > Because of this unnecessary
> > > > action by the hardware, hardware performance decreases, and it also
> > > > affects the performance of the host os.
> > > >
> > > > Before modification:(interrupt mode)
> > > >  32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
> > > >

[GIT PULL] virtio: bugfix

2024-04-25 Thread Michael S. Tsirkin

The following changes since commit 0bbac3facb5d6cc0171c45c9873a2dc96bea9680:

  Linux 6.9-rc4 (2024-04-14 13:38:39 -0700)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to 98a821546b3919a10a58faa12ebe5e9a55cd638e:

  vDPA: code clean for vhost_vdpa uapi (2024-04-22 17:07:13 -0400)


virtio: bugfix

enum renames for vdpa uapi - we better do this now before
the names have been in any releases.

Signed-off-by: Michael S. Tsirkin 


Zhu Lingshan (1):
  vDPA: code clean for vhost_vdpa uapi

 drivers/vdpa/vdpa.c   | 6 +++---
 include/uapi/linux/vdpa.h | 6 +++---
 2 files changed, 6 insertions(+), 6 deletions(-)

Re: [PATCH v5 3/5] vduse: Add function to get/free the pages for reconnection

2024-04-25 Thread Michael S. Tsirkin

On Thu, Apr 25, 2024 at 09:35:58AM +0800, Jason Wang wrote:
> On Wed, Apr 24, 2024 at 5:51 PM Michael S. Tsirkin  wrote:
> >
> > On Wed, Apr 24, 2024 at 08:44:10AM +0800, Jason Wang wrote:
> > > On Tue, Apr 23, 2024 at 4:42 PM Michael S. Tsirkin  
> > > wrote:
> > > >
> > > > On Tue, Apr 23, 2024 at 11:09:59AM +0800, Jason Wang wrote:
> > > > > On Tue, Apr 23, 2024 at 4:05 AM Michael S. Tsirkin  
> > > > > wrote:
> > > > > >
> > > > > > On Thu, Apr 18, 2024 at 08:57:51AM +0800, Jason Wang wrote:
> > > > > > > On Wed, Apr 17, 2024 at 5:29 PM Michael S. Tsirkin 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Fri, Apr 12, 2024 at 09:28:23PM +0800, Cindy Lu wrote:
> > > > > > > > > Add the function vduse_alloc_reconnnect_info_mem
> > > > > > > > > and vduse_alloc_reconnnect_info_mem
> > > > > > > > > These functions allow vduse to allocate and free memory for 
> > > > > > > > > reconnection
> > > > > > > > > information. The amount of memory allocated is vq_num pages.
> > > > > > > > > Each VQS will map its own page where the reconnection 
> > > > > > > > > information will be saved
> > > > > > > > >
> > > > > > > > > Signed-off-by: Cindy Lu 
> > > > > > > > > ---
> > > > > > > > >  drivers/vdpa/vdpa_user/vduse_dev.c | 40 
> > > > > > > > > ++
> > > > > > > > >  1 file changed, 40 insertions(+)
> > > > > > > > >
> > > > > > > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> > > > > > > > > b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > > > index ef3c9681941e..2da659d5f4a8 100644
> > > > > > > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > > > @@ -65,6 +65,7 @@ struct vduse_virtqueue {
> > > > > > > > >   int irq_effective_cpu;
> > > > > > > > >   struct cpumask irq_affinity;
> > > > > > > > >   struct kobject kobj;
> > > > > > > > > + unsigned long vdpa_reconnect_vaddr;
> > > > > > > > >  };
> > > > > > > > >
> > > > > > > > >  struct vduse_dev;
> > > > > > > > > @@ -1105,6 +1106,38 @@ static void 
> > > > > > > > > vduse_vq_update_effective_cpu(struct vduse_virtqueue *vq)
> > > > > > > > >
> > > > > > > > >   vq->irq_effective_cpu = curr_cpu;
> > > > > > > > >  }
> > > > > > > > > +static int vduse_alloc_reconnnect_info_mem(struct vduse_dev 
> > > > > > > > > *dev)
> > > > > > > > > +{
> > > > > > > > > + unsigned long vaddr = 0;
> > > > > > > > > + struct vduse_virtqueue *vq;
> > > > > > > > > +
> > > > > > > > > + for (int i = 0; i < dev->vq_num; i++) {
> > > > > > > > > + /*page 0~ vq_num save the reconnect info for 
> > > > > > > > > vq*/
> > > > > > > > > + vq = dev->vqs[i];
> > > > > > > > > + vaddr = get_zeroed_page(GFP_KERNEL);
> > > > > > > >
> > > > > > > >
> > > > > > > > I don't get why you insist on stealing kernel memory for 
> > > > > > > > something
> > > > > > > > that is just used by userspace to store data for its own use.
> > > > > > > > Userspace does not lack ways to persist data, for example,
> > > > > > > > create a regular file anywhere in the filesystem.
> > > > > > >
> > > > > > > Good point. So the motivation here is to:
> > > > > > >
> > > > > > > 1) be self contained, no dependency for high speed persist data
> > > > > > > storage like tmpfs
> > > > > >
> > > > > > No idea what this means.
> > > > >
> > > > > I mean a regular file may slow down the datapath performance, so
> > > > > usually the application will try to use tmpfs and other which is a
> > > > > dependency for implementing the reconnection.
> > > >
> > > > Are we worried about systems without tmpfs now?
> > >
> > > Yes.
> >
> > Why? Who ships these?
> 
> Not sure, but it could be disabled or unmounted. I'm not sure make
> VDUSE depends on TMPFS is a good idea.
> 
> Thanks

Don't disable or unmount it then?
The use-case needs to be much clearer if we are adding a way for
userspace to pin kernel memory for unlimited time.

-- 
MST

Re: [PATCH v5 3/5] vduse: Add function to get/free the pages for reconnection

2024-04-24 Thread Michael S. Tsirkin

On Wed, Apr 24, 2024 at 08:44:10AM +0800, Jason Wang wrote:
> On Tue, Apr 23, 2024 at 4:42 PM Michael S. Tsirkin  wrote:
> >
> > On Tue, Apr 23, 2024 at 11:09:59AM +0800, Jason Wang wrote:
> > > On Tue, Apr 23, 2024 at 4:05 AM Michael S. Tsirkin  
> > > wrote:
> > > >
> > > > On Thu, Apr 18, 2024 at 08:57:51AM +0800, Jason Wang wrote:
> > > > > On Wed, Apr 17, 2024 at 5:29 PM Michael S. Tsirkin  
> > > > > wrote:
> > > > > >
> > > > > > On Fri, Apr 12, 2024 at 09:28:23PM +0800, Cindy Lu wrote:
> > > > > > > Add the function vduse_alloc_reconnnect_info_mem
> > > > > > > and vduse_alloc_reconnnect_info_mem
> > > > > > > These functions allow vduse to allocate and free memory for 
> > > > > > > reconnection
> > > > > > > information. The amount of memory allocated is vq_num pages.
> > > > > > > Each VQS will map its own page where the reconnection information 
> > > > > > > will be saved
> > > > > > >
> > > > > > > Signed-off-by: Cindy Lu 
> > > > > > > ---
> > > > > > >  drivers/vdpa/vdpa_user/vduse_dev.c | 40 
> > > > > > > ++
> > > > > > >  1 file changed, 40 insertions(+)
> > > > > > >
> > > > > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> > > > > > > b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > index ef3c9681941e..2da659d5f4a8 100644
> > > > > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > @@ -65,6 +65,7 @@ struct vduse_virtqueue {
> > > > > > >   int irq_effective_cpu;
> > > > > > >   struct cpumask irq_affinity;
> > > > > > >   struct kobject kobj;
> > > > > > > + unsigned long vdpa_reconnect_vaddr;
> > > > > > >  };
> > > > > > >
> > > > > > >  struct vduse_dev;
> > > > > > > @@ -1105,6 +1106,38 @@ static void 
> > > > > > > vduse_vq_update_effective_cpu(struct vduse_virtqueue *vq)
> > > > > > >
> > > > > > >   vq->irq_effective_cpu = curr_cpu;
> > > > > > >  }
> > > > > > > +static int vduse_alloc_reconnnect_info_mem(struct vduse_dev *dev)
> > > > > > > +{
> > > > > > > + unsigned long vaddr = 0;
> > > > > > > + struct vduse_virtqueue *vq;
> > > > > > > +
> > > > > > > + for (int i = 0; i < dev->vq_num; i++) {
> > > > > > > + /*page 0~ vq_num save the reconnect info for vq*/
> > > > > > > + vq = dev->vqs[i];
> > > > > > > + vaddr = get_zeroed_page(GFP_KERNEL);
> > > > > >
> > > > > >
> > > > > > I don't get why you insist on stealing kernel memory for something
> > > > > > that is just used by userspace to store data for its own use.
> > > > > > Userspace does not lack ways to persist data, for example,
> > > > > > create a regular file anywhere in the filesystem.
> > > > >
> > > > > Good point. So the motivation here is to:
> > > > >
> > > > > 1) be self contained, no dependency for high speed persist data
> > > > > storage like tmpfs
> > > >
> > > > No idea what this means.
> > >
> > > I mean a regular file may slow down the datapath performance, so
> > > usually the application will try to use tmpfs and other which is a
> > > dependency for implementing the reconnection.
> >
> > Are we worried about systems without tmpfs now?
> 
> Yes.

Why? Who ships these?


> >
> >
> > > >
> > > > > 2) standardize the format in uAPI which allows reconnection from
> > > > > arbitrary userspace, unfortunately, such effort was removed in new
> > > > > versions
> > > >
> > > > And I don't see why that has to live in the kernel tree either.
> > >
> > > I can't find a better place, any idea?
> > >
> > > Thanks
> >
> >
> > Well anywhere on github really. w

Re: [PATCH v3 2/4] virtio_balloon: introduce oom-kill invocations

2024-04-23 Thread Michael S. Tsirkin

On Tue, Apr 23, 2024 at 11:41:07AM +0800, zhenwei pi wrote:
> When the guest OS runs under critical memory pressure, the guest
> starts to kill processes. A guest monitor agent may scan 'oom_kill'
> from /proc/vmstat, and reports the OOM KILL event. However, the agent
> may be killed and we will loss this critical event(and the later
> events).
> 
> For now we can also grep for magic words in guest kernel log from host
> side. Rather than this unstable way, virtio balloon reports OOM-KILL
> invocations instead.
> 
> Acked-by: David Hildenbrand 
> Signed-off-by: zhenwei pi 
> ---
>  drivers/virtio/virtio_balloon.c | 1 +
>  include/uapi/linux/virtio_balloon.h | 6 --
>  2 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 1710e3098ecd..f7a47eaa0936 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -330,6 +330,7 @@ static inline unsigned int update_balloon_vm_stats(struct 
> virtio_balloon *vb)
>   pages_to_bytes(events[PSWPOUT]));
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]);
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
> + update_stat(vb, idx++, VIRTIO_BALLOON_S_OOM_KILL, events[OOM_KILL]);
>  
>  #ifdef CONFIG_HUGETLB_PAGE
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC,
> diff --git a/include/uapi/linux/virtio_balloon.h 
> b/include/uapi/linux/virtio_balloon.h
> index ddaa45e723c4..b17bbe033697 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -71,7 +71,8 @@ struct virtio_balloon_config {
>  #define VIRTIO_BALLOON_S_CACHES   7   /* Disk caches */
>  #define VIRTIO_BALLOON_S_HTLB_PGALLOC  8  /* Hugetlb page allocations */
>  #define VIRTIO_BALLOON_S_HTLB_PGFAIL   9  /* Hugetlb page allocation 
> failures */
> -#define VIRTIO_BALLOON_S_NR   10
> +#define VIRTIO_BALLOON_S_OOM_KILL  10 /* OOM killer invocations */
> +#define VIRTIO_BALLOON_S_NR   11
>  
>  #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \
>   VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \

Looks like a useful extension. But
any UAPI extension has to go to virtio spec first.

> @@ -83,7 +84,8 @@ struct virtio_balloon_config {
>   VIRTIO_BALLOON_S_NAMES_prefix "available-memory", \
>   VIRTIO_BALLOON_S_NAMES_prefix "disk-caches", \
>   VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \
> - VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures" \
> + VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \
> + VIRTIO_BALLOON_S_NAMES_prefix "oom-kills" \
>  }
>  
>  #define VIRTIO_BALLOON_S_NAMES VIRTIO_BALLOON_S_NAMES_WITH_PREFIX("")
> -- 
> 2.34.1

Re: [PATCH v5 3/5] vduse: Add function to get/free the pages for reconnection

2024-04-23 Thread Michael S. Tsirkin

On Tue, Apr 23, 2024 at 11:09:59AM +0800, Jason Wang wrote:
> On Tue, Apr 23, 2024 at 4:05 AM Michael S. Tsirkin  wrote:
> >
> > On Thu, Apr 18, 2024 at 08:57:51AM +0800, Jason Wang wrote:
> > > On Wed, Apr 17, 2024 at 5:29 PM Michael S. Tsirkin  
> > > wrote:
> > > >
> > > > On Fri, Apr 12, 2024 at 09:28:23PM +0800, Cindy Lu wrote:
> > > > > Add the function vduse_alloc_reconnnect_info_mem
> > > > > and vduse_alloc_reconnnect_info_mem
> > > > > These functions allow vduse to allocate and free memory for 
> > > > > reconnection
> > > > > information. The amount of memory allocated is vq_num pages.
> > > > > Each VQS will map its own page where the reconnection information 
> > > > > will be saved
> > > > >
> > > > > Signed-off-by: Cindy Lu 
> > > > > ---
> > > > >  drivers/vdpa/vdpa_user/vduse_dev.c | 40 
> > > > > ++
> > > > >  1 file changed, 40 insertions(+)
> > > > >
> > > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> > > > > b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > index ef3c9681941e..2da659d5f4a8 100644
> > > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > @@ -65,6 +65,7 @@ struct vduse_virtqueue {
> > > > >   int irq_effective_cpu;
> > > > >   struct cpumask irq_affinity;
> > > > >   struct kobject kobj;
> > > > > + unsigned long vdpa_reconnect_vaddr;
> > > > >  };
> > > > >
> > > > >  struct vduse_dev;
> > > > > @@ -1105,6 +1106,38 @@ static void 
> > > > > vduse_vq_update_effective_cpu(struct vduse_virtqueue *vq)
> > > > >
> > > > >   vq->irq_effective_cpu = curr_cpu;
> > > > >  }
> > > > > +static int vduse_alloc_reconnnect_info_mem(struct vduse_dev *dev)
> > > > > +{
> > > > > + unsigned long vaddr = 0;
> > > > > + struct vduse_virtqueue *vq;
> > > > > +
> > > > > + for (int i = 0; i < dev->vq_num; i++) {
> > > > > + /*page 0~ vq_num save the reconnect info for vq*/
> > > > > + vq = dev->vqs[i];
> > > > > + vaddr = get_zeroed_page(GFP_KERNEL);
> > > >
> > > >
> > > > I don't get why you insist on stealing kernel memory for something
> > > > that is just used by userspace to store data for its own use.
> > > > Userspace does not lack ways to persist data, for example,
> > > > create a regular file anywhere in the filesystem.
> > >
> > > Good point. So the motivation here is to:
> > >
> > > 1) be self contained, no dependency for high speed persist data
> > > storage like tmpfs
> >
> > No idea what this means.
> 
> I mean a regular file may slow down the datapath performance, so
> usually the application will try to use tmpfs and other which is a
> dependency for implementing the reconnection.

Are we worried about systems without tmpfs now?


> >
> > > 2) standardize the format in uAPI which allows reconnection from
> > > arbitrary userspace, unfortunately, such effort was removed in new
> > > versions
> >
> > And I don't see why that has to live in the kernel tree either.
> 
> I can't find a better place, any idea?
> 
> Thanks


Well anywhere on github really. with libvhost-user maybe?
It's harmless enough in Documentation
if you like but ties you to the kernel release cycle in a way that
is completely unnecessary.

> >
> > > If the above doesn't make sense, we don't need to offer those pages by 
> > > VDUSE.
> > >
> > > Thanks
> > >
> > >
> > > >
> > > >
> > > >
> > > > > + if (vaddr == 0)
> > > > > + return -ENOMEM;
> > > > > +
> > > > > + vq->vdpa_reconnect_vaddr = vaddr;
> > > > > + }
> > > > > +
> > > > > + return 0;
> > > > > +}
> > > > > +
> > > > > +static int vduse_free_reconnnect_info_mem(struct vduse_dev *dev)
> > > > > +{
> > > > > + struct vduse_virtqueue *vq;
> > > > > +
> > > > &g

Re: 回复: [PATCH v5] vp_vdpa: don't allocate unused msix vectors

2024-04-23 Thread Michael S. Tsirkin

On Tue, Apr 23, 2024 at 01:39:17AM +, Gavin Liu wrote:
> On Wed, Apr 10, 2024 at 11:30:20AM +0800, lyx634449800 wrote:
> > From: Yuxue Liu 
> >
> > When there is a ctlq and it doesn't require interrupt callbacks,the 
> > original method of calculating vectors wastes hardware msi or msix 
> > resources as well as system IRQ resources.
> >
> > When conducting performance testing using testpmd in the guest os, it 
> > was found that the performance was lower compared to directly using 
> > vfio-pci to passthrough the device
> >
> > In scenarios where the virtio device in the guest os does not utilize 
> > interrupts, the vdpa driver still configures the hardware's msix 
> > vector. Therefore, the hardware still sends interrupts to the host os.
> 
> >I just have a question on this part. How come hardware sends interrupts does 
> >not guest driver disable them?
>
>1：Assuming the guest OS's Virtio device is using PMD mode, QEMU sets the 
> call fd to -1
>2：On the host side, the vhost_vdpa program will set 
> vp_vdpa->vring[i].cb.callback to invalid
>3：Before the modification, the vp_vdpa_request_irq function does not check 
> whether 
>   vp_vdpa->vring[i].cb.callback is valid. Instead, it enables the 
> hardware's MSIX
> interrupts based on the number of queues of the device
> 

So MSIX is enabled but why would it trigger? virtio PMD in poll mode
presumably suppresses interrupts after all.

> 
> 
> - Original Message -
> From: Michael S. Tsirkin m...@redhat.com
> Sent: April 22, 2024 20:09
> To: Gavin Liu gavin@jaguarmicro.com
> Cc: jasow...@redhat.com; Angus Chen angus.c...@jaguarmicro.com; 
> virtualizat...@lists.linux.dev; xuanz...@linux.alibaba.com; 
> linux-kernel@vger.kernel.org; Heng Qi hen...@linux.alibaba.com
> Subject: Re: [PATCH v5] vp_vdpa: don't allocate unused msix vectors
> 
> 
> 
> External Mail: This email originated from OUTSIDE of the organization!
> Do not click links, open attachments or provide ANY information unless you 
> recognize the sender and know the content is safe.
> 
> 
> On Wed, Apr 10, 2024 at 11:30:20AM +0800, lyx634449800 wrote:
> > From: Yuxue Liu 
> >
> > When there is a ctlq and it doesn't require interrupt callbacks,the 
> > original method of calculating vectors wastes hardware msi or msix 
> > resources as well as system IRQ resources.
> >
> > When conducting performance testing using testpmd in the guest os, it 
> > was found that the performance was lower compared to directly using 
> > vfio-pci to passthrough the device
> >
> > In scenarios where the virtio device in the guest os does not utilize 
> > interrupts, the vdpa driver still configures the hardware's msix 
> > vector. Therefore, the hardware still sends interrupts to the host os.
> 
> I just have a question on this part. How come hardware sends interrupts does 
> not guest driver disable them?
> 
> > Because of this unnecessary
> > action by the hardware, hardware performance decreases, and it also 
> > affects the performance of the host os.
> >
> > Before modification:(interrupt mode)
> >  32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
> >  33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
> >  34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
> >  35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config
> >
> > After modification:(interrupt mode)
> >  32:  0  0  1  7   PCI-MSI 32768-edge  vp-vdpa[:00:02.0]-0
> >  33: 36  0  3  0   PCI-MSI 32769-edge  vp-vdpa[:00:02.0]-1
> >  34:  0  0  0  0   PCI-MSI 32770-edge  vp-vdpa[:00:02.0]-config
> >
> > Before modification:(virtio pmd mode for guest os)
> >  32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
> >  33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
> >  34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
> >  35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config
> >
> > After modification:(virtio pmd mode for guest os)
> >  32: 0  0  0   0   PCI-MSI 32768-edge   vp-vdpa[:00:02.0]-config
> >
> > To verify the use of the virtio PMD mode in the guest operating 
> > system, the following patch needs to be applied to QEMU:
> > https://lore.kernel.org/all/20240408073311.2049-1-yuxue.liu@jaguarmicr
> > o.com
> >
> > Signed-off-by: Yuxue Liu 
> > Acked-by: Jason Wang 
> > Reviewed-by: Heng Qi 
> > ---
> > V5: modify the description of the printout when an exception occurs
> > V4: update the title and

Re: [PATCH 0/3] Improve memory statistics for virtio balloon

2024-04-22 Thread Michael S. Tsirkin

On Thu, Apr 18, 2024 at 02:25:59PM +0800, zhenwei pi wrote:
> RFC -> v1:
> - several text changes: oom-kill -> oom-kills, SCAN_ASYNC -> ASYN_SCAN.
> - move vm events codes into '#ifdef CONFIG_VM_EVENT_COUNTERS'
> 
> RFC version:
> Link: 
> https://lore.kernel.org/lkml/20240415084113.1203428-1-pizhen...@bytedance.com/T/#m1898963b3c27a989b1123db475135c3ca687ca84


Make sure this builds without introducing new warnings please. 

> zhenwei pi (3):
>   virtio_balloon: introduce oom-kill invocations
>   virtio_balloon: introduce memory allocation stall counter
>   virtio_balloon: introduce memory scan/reclaim info
> 
>  drivers/virtio/virtio_balloon.c | 30 -
>  include/uapi/linux/virtio_balloon.h | 16 +--
>  2 files changed, 43 insertions(+), 3 deletions(-)
> 
> -- 
> 2.34.1

Re: [PATCH v3 3/3] vhost: Improve vhost_get_avail_idx() with smp_rmb()

2024-04-22 Thread Michael S. Tsirkin

On Mon, Apr 08, 2024 at 02:15:24PM +1000, Gavin Shan wrote:
> Hi Michael,
> 
> On 3/30/24 19:02, Gavin Shan wrote:
> > On 3/28/24 19:31, Michael S. Tsirkin wrote:
> > > On Thu, Mar 28, 2024 at 10:21:49AM +1000, Gavin Shan wrote:
> > > > All the callers of vhost_get_avail_idx() are concerned to the memory
> > > > barrier, imposed by smp_rmb() to ensure the order of the available
> > > > ring entry read and avail_idx read.
> > > > 
> > > > Improve vhost_get_avail_idx() so that smp_rmb() is executed when
> > > > the avail_idx is advanced. With it, the callers needn't to worry
> > > > about the memory barrier.
> > > > 
> > > > Suggested-by: Michael S. Tsirkin 
> > > > Signed-off-by: Gavin Shan 
> > > 
> > > Previous patches are ok. This one I feel needs more work -
> > > first more code such as sanity checking should go into
> > > this function, second there's actually a difference
> > > between comparing to last_avail_idx and just comparing
> > > to the previous value of avail_idx.
> > > I will pick patches 1-2 and post a cleanup on top so you can
> > > take a look, ok?
> > > 
> > 
> > Thanks, Michael. It's fine to me.
> > 
> 
> A kindly ping.
> 
> If it's ok to you, could you please merge PATCH[1-2]? Our downstream
> 9.4 need the fixes, especially for NVidia's grace-hopper and grace-grace
> platforms.
> 
> For PATCH[3], I also can help with the improvement if you don't have time
> for it. Please let me know.
> 
> Thanks,
> Gavin

1-2 are upstream go ahead and post the cleanup.

-- 
MST

Re: [PATCH v2 0/6] virtiofs: fix the warning for ITER_KVEC dio

2024-04-22 Thread Michael S. Tsirkin

On Tue, Apr 09, 2024 at 09:48:08AM +0800, Hou Tao wrote:
> Hi,
> 
> On 4/8/2024 3:45 PM, Michael S. Tsirkin wrote:
> > On Wed, Feb 28, 2024 at 10:41:20PM +0800, Hou Tao wrote:
> >> From: Hou Tao 
> >>
> >> Hi,
> >>
> >> The patch set aims to fix the warning related to an abnormal size
> >> parameter of kmalloc() in virtiofs. The warning occurred when attempting
> >> to insert a 10MB sized kernel module kept in a virtiofs with cache
> >> disabled. As analyzed in patch #1, the root cause is that the length of
> >> the read buffer is no limited, and the read buffer is passed directly to
> >> virtiofs through out_args[0].value. Therefore patch #1 limits the
> >> length of the read buffer passed to virtiofs by using max_pages. However
> >> it is not enough, because now the maximal value of max_pages is 256.
> >> Consequently, when reading a 10MB-sized kernel module, the length of the
> >> bounce buffer in virtiofs will be 40 + (256 * 4096), and kmalloc will
> >> try to allocate 2MB from memory subsystem. The request for 2MB of
> >> physically contiguous memory significantly stress the memory subsystem
> >> and may fail indefinitely on hosts with fragmented memory. To address
> >> this, patch #2~#5 use scattered pages in a bio_vec to replace the
> >> kmalloc-allocated bounce buffer when the length of the bounce buffer for
> >> KVEC_ITER dio is larger than PAGE_SIZE. The final issue with the
> >> allocation of the bounce buffer and sg array in virtiofs is that
> >> GFP_ATOMIC is used even when the allocation occurs in a kworker context.
> >> Therefore the last patch uses GFP_NOFS for the allocation of both sg
> >> array and bounce buffer when initiated by the kworker. For more details,
> >> please check the individual patches.
> >>
> >> As usual, comments are always welcome.
> >>
> >> Change Log:
> > Bernd should I just merge the patchset as is?
> > It seems to fix a real problem and no one has the
> > time to work on a better fix  WDYT?
> 
> Sorry for the long delay. I am just start to prepare for v3. In v3, I
> plan to avoid the unnecessary memory copy between fuse args and bio_vec.
> Will post it before next week.

Didn't happen before this week apparently.

> >
> >
> >> v2:
> >>   * limit the length of ITER_KVEC dio by max_pages instead of the
> >> newly-introduced max_nopage_rw. Using max_pages make the ITER_KVEC
> >> dio being consistent with other rw operations.
> >>   * replace kmalloc-allocated bounce buffer by using a bounce buffer
> >> backed by scattered pages when the length of the bounce buffer for
> >> KVEC_ITER dio is larger than PAG_SIZE, so even on hosts with
> >> fragmented memory, the KVEC_ITER dio can be handled normally by
> >> virtiofs. (Bernd Schubert)
> >>   * merge the GFP_NOFS patch [1] into this patch-set and use
> >> memalloc_nofs_{save|restore}+GFP_KERNEL instead of GFP_NOFS
> >> (Benjamin Coddington)
> >>
> >> v1: 
> >> https://lore.kernel.org/linux-fsdevel/20240103105929.1902658-1-hou...@huaweicloud.com/
> >>
> >> [1]: 
> >> https://lore.kernel.org/linux-fsdevel/20240105105305.4052672-1-hou...@huaweicloud.com/
> >>
> >> Hou Tao (6):
> >>   fuse: limit the length of ITER_KVEC dio by max_pages
> >>   virtiofs: move alloc/free of argbuf into separated helpers
> >>   virtiofs: factor out more common methods for argbuf
> >>   virtiofs: support bounce buffer backed by scattered pages
> >>   virtiofs: use scattered bounce buffer for ITER_KVEC dio
> >>   virtiofs: use GFP_NOFS when enqueuing request through kworker
> >>
> >>  fs/fuse/file.c  |  12 +-
> >>  fs/fuse/virtio_fs.c | 336 +---
> >>  2 files changed, 296 insertions(+), 52 deletions(-)
> >>
> >> -- 
> >> 2.29.2

Re: [PATCH v5 3/5] vduse: Add function to get/free the pages for reconnection

2024-04-22 Thread Michael S. Tsirkin

On Thu, Apr 18, 2024 at 08:57:51AM +0800, Jason Wang wrote:
> On Wed, Apr 17, 2024 at 5:29 PM Michael S. Tsirkin  wrote:
> >
> > On Fri, Apr 12, 2024 at 09:28:23PM +0800, Cindy Lu wrote:
> > > Add the function vduse_alloc_reconnnect_info_mem
> > > and vduse_alloc_reconnnect_info_mem
> > > These functions allow vduse to allocate and free memory for reconnection
> > > information. The amount of memory allocated is vq_num pages.
> > > Each VQS will map its own page where the reconnection information will be 
> > > saved
> > >
> > > Signed-off-by: Cindy Lu 
> > > ---
> > >  drivers/vdpa/vdpa_user/vduse_dev.c | 40 ++
> > >  1 file changed, 40 insertions(+)
> > >
> > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> > > b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > index ef3c9681941e..2da659d5f4a8 100644
> > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > @@ -65,6 +65,7 @@ struct vduse_virtqueue {
> > >   int irq_effective_cpu;
> > >   struct cpumask irq_affinity;
> > >   struct kobject kobj;
> > > + unsigned long vdpa_reconnect_vaddr;
> > >  };
> > >
> > >  struct vduse_dev;
> > > @@ -1105,6 +1106,38 @@ static void vduse_vq_update_effective_cpu(struct 
> > > vduse_virtqueue *vq)
> > >
> > >   vq->irq_effective_cpu = curr_cpu;
> > >  }
> > > +static int vduse_alloc_reconnnect_info_mem(struct vduse_dev *dev)
> > > +{
> > > + unsigned long vaddr = 0;
> > > + struct vduse_virtqueue *vq;
> > > +
> > > + for (int i = 0; i < dev->vq_num; i++) {
> > > + /*page 0~ vq_num save the reconnect info for vq*/
> > > + vq = dev->vqs[i];
> > > + vaddr = get_zeroed_page(GFP_KERNEL);
> >
> >
> > I don't get why you insist on stealing kernel memory for something
> > that is just used by userspace to store data for its own use.
> > Userspace does not lack ways to persist data, for example,
> > create a regular file anywhere in the filesystem.
> 
> Good point. So the motivation here is to:
> 
> 1) be self contained, no dependency for high speed persist data
> storage like tmpfs

No idea what this means.

> 2) standardize the format in uAPI which allows reconnection from
> arbitrary userspace, unfortunately, such effort was removed in new
> versions

And I don't see why that has to live in the kernel tree either.

> If the above doesn't make sense, we don't need to offer those pages by VDUSE.
> 
> Thanks
> 
> 
> >
> >
> >
> > > + if (vaddr == 0)
> > > + return -ENOMEM;
> > > +
> > > + vq->vdpa_reconnect_vaddr = vaddr;
> > > + }
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +static int vduse_free_reconnnect_info_mem(struct vduse_dev *dev)
> > > +{
> > > + struct vduse_virtqueue *vq;
> > > +
> > > + for (int i = 0; i < dev->vq_num; i++) {
> > > + vq = dev->vqs[i];
> > > +
> > > + if (vq->vdpa_reconnect_vaddr)
> > > + free_page(vq->vdpa_reconnect_vaddr);
> > > + vq->vdpa_reconnect_vaddr = 0;
> > > + }
> > > +
> > > + return 0;
> > > +}
> > >
> > >  static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > >   unsigned long arg)
> > > @@ -1672,6 +1705,8 @@ static int vduse_destroy_dev(char *name)
> > >   mutex_unlock(>lock);
> > >   return -EBUSY;
> > >   }
> > > + vduse_free_reconnnect_info_mem(dev);
> > > +
> > >   dev->connected = true;
> > >   mutex_unlock(>lock);
> > >
> > > @@ -1855,12 +1890,17 @@ static int vduse_create_dev(struct 
> > > vduse_dev_config *config,
> > >   ret = vduse_dev_init_vqs(dev, config->vq_align, config->vq_num);
> > >   if (ret)
> > >   goto err_vqs;
> > > + ret = vduse_alloc_reconnnect_info_mem(dev);
> > > + if (ret < 0)
> > > + goto err_mem;
> > >
> > >   __module_get(THIS_MODULE);
> > >
> > >   return 0;
> > >  err_vqs:
> > >   device_destroy(_class, MKDEV(MAJOR(vduse_major), dev->minor));
> > > +err_mem:
> > > + vduse_free_reconnnect_info_mem(dev);
> > >  err_dev:
> > >   idr_remove(_idr, dev->minor);
> > >  err_idr:
> > > --
> > > 2.43.0
> >

Re: [PATCH virt] virt: fix uninit-value in vhost_vsock_dev_open

2024-04-22 Thread Michael S. Tsirkin

On Mon, Apr 22, 2024 at 09:00:31AM -0400, Stefan Hajnoczi wrote:
> On Sun, Apr 21, 2024 at 12:06:06PM +0900, Jeongjun Park wrote:
> > static bool vhost_transport_seqpacket_allow(u32 remote_cid)
> > {
> > 
> > vsock = vhost_vsock_get(remote_cid);
> > 
> > if (vsock)
> > seqpacket_allow = vsock->seqpacket_allow;
> > 
> > }
> > 
> > I think this is due to reading a previously created uninitialized 
> > vsock->seqpacket_allow inside vhost_transport_seqpacket_allow(), 
> > which is executed by the function pointer present in the if statement.
> 
> CCing Arseny, author of commit ced7b713711f ("vhost/vsock: support
> SEQPACKET for transport").
> 
> Looks like a genuine bug in the commit. vhost_vsock_set_features() sets
> seqpacket_allow to true when the feature is negotiated. The assumption
> is that the field defaults to false.
> 
> The rest of the vhost_vsock.ko code is written to initialize the
> vhost_vsock fields, so you could argue seqpacket_allow should just be
> explicitly initialized to false.
> 
> However, eliminating this class of errors by zeroing seems reasonable in
> this code path. vhost_vsock_dev_open() is not performance-critical.
> 
> Acked-by: Stefan Hajnoczi 

But now that it's explained, the bugfix as proposed is incomplete:
userspace can set features twice and the second time will leak
old VIRTIO_VSOCK_F_SEQPACKET bit value.

And I am pretty sure the Fixes tag is wrong.

So I wrote this, but I actually don't have a set for
seqpacket to test this. Arseny could you help test maybe?
Thanks!

commit bcc17a060d93b198d8a17a9b87b593f41337ee28
Author: Michael S. Tsirkin 
Date:   Mon Apr 22 10:03:13 2024 -0400

vhost/vsock: always initialize seqpacket_allow

There are two issues around seqpacket_allow:
1. seqpacket_allow is not initialized when socket is
created. Thus if features are never set, it will be
read uninitialized.
2. if VIRTIO_VSOCK_F_SEQPACKET is set and then cleared,
then seqpacket_allow will not be cleared appropriately
(existing apps I know about don't usually do this but
it's legal and there's no way to be sure no one relies
on this).

To fix:
- initialize seqpacket_allow after allocation
- set it unconditionally in set_features

Reported-by: syzbot+6c21aeb59d0e82eb2...@syzkaller.appspotmail.com
Reported-by: Jeongjun Park 
Fixes: ced7b713711f ("vhost/vsock: support SEQPACKET for transport").
Cc: Arseny Krasnov 
Cc: David S. Miller 
Cc: Stefan Hajnoczi 
Signed-off-by: Michael S. Tsirkin 

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index ec20ecff85c7..bf664ec9341b 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -667,6 +667,7 @@ static int vhost_vsock_dev_open(struct inode *inode, struct 
file *file)
}

vsock->guest_cid = 0; /* no CID assigned yet */
+   vsock->seqpacket_allow = false;

atomic_set(>queued_replies, 0);

@@ -810,8 +811,7 @@ static int vhost_vsock_set_features(struct vhost_vsock 
*vsock, u64 features)
goto err;
}

-   if (features & (1ULL << VIRTIO_VSOCK_F_SEQPACKET))
-   vsock->seqpacket_allow = true;
+   vsock->seqpacket_allow = features & (1ULL << VIRTIO_VSOCK_F_SEQPACKET);

for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
vq = >vqs[i];

Re: [syzbot] [virt?] [net?] KMSAN: uninit-value in vsock_assign_transport (2)

2024-04-22 Thread Michael S. Tsirkin

On Fri, Apr 19, 2024 at 02:39:20AM -0700, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:8cd26fd90c1a Merge tag 'for-6.9-rc4-tag' of git://git.kern..
> git tree:   upstream
> console+strace: https://syzkaller.appspot.com/x/log.txt?x=102d27cd18
> kernel config:  https://syzkaller.appspot.com/x/.config?x=87a805e655619c64
> dashboard link: https://syzkaller.appspot.com/bug?extid=6c21aeb59d0e82eb2782
> compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 
> 2.40
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=16e38c3b18
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=10e62fed18
> 
> Downloadable assets:
> disk image: 
> https://storage.googleapis.com/syzbot-assets/488822aee24a/disk-8cd26fd9.raw.xz
> vmlinux: 
> https://storage.googleapis.com/syzbot-assets/ba40e322ba00/vmlinux-8cd26fd9.xz
> kernel image: 
> https://storage.googleapis.com/syzbot-assets/f30af1dfbc30/bzImage-8cd26fd9.xz
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+6c21aeb59d0e82eb2...@syzkaller.appspotmail.com
> 
> =
> BUG: KMSAN: uninit-value in vsock_assign_transport+0xb2a/0xb90 
> net/vmw_vsock/af_vsock.c:500
>  vsock_assign_transport+0xb2a/0xb90 net/vmw_vsock/af_vsock.c:500
>  vsock_connect+0x544/0x1560 net/vmw_vsock/af_vsock.c:1393
>  __sys_connect_file net/socket.c:2048 [inline]
>  __sys_connect+0x606/0x690 net/socket.c:2065
>  __do_sys_connect net/socket.c:2075 [inline]
>  __se_sys_connect net/socket.c:2072 [inline]
>  __x64_sys_connect+0x91/0xe0 net/socket.c:2072
>  x64_sys_call+0x3356/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:43
>  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>  do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> Uninit was created at:
>  __kmalloc_large_node+0x231/0x370 mm/slub.c:3921
>  __do_kmalloc_node mm/slub.c:3954 [inline]
>  __kmalloc_node+0xb07/0x1060 mm/slub.c:3973
>  kmalloc_node include/linux/slab.h:648 [inline]
>  kvmalloc_node+0xc0/0x2d0 mm/util.c:634
>  kvmalloc include/linux/slab.h:766 [inline]
>  vhost_vsock_dev_open+0x44/0x510 drivers/vhost/vsock.c:659
>  misc_open+0x66b/0x760 drivers/char/misc.c:165
>  chrdev_open+0xa5f/0xb80 fs/char_dev.c:414
>  do_dentry_open+0x11f1/0x2120 fs/open.c:955
>  vfs_open+0x7e/0xa0 fs/open.c:1089
>  do_open fs/namei.c:3642 [inline]
>  path_openat+0x4a3c/0x5b00 fs/namei.c:3799
>  do_filp_open+0x20e/0x590 fs/namei.c:3826
>  do_sys_openat2+0x1bf/0x2f0 fs/open.c:1406
>  do_sys_open fs/open.c:1421 [inline]
>  __do_sys_openat fs/open.c:1437 [inline]
>  __se_sys_openat fs/open.c:1432 [inline]
>  __x64_sys_openat+0x2a1/0x310 fs/open.c:1432
>  x64_sys_call+0x3a64/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:258
>  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>  do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> CPU: 1 PID: 5021 Comm: syz-executor390 Not tainted 
> 6.9.0-rc4-syzkaller-00038-g8cd26fd90c1a #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 03/27/2024
> =
> 
> 
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkal...@googlegroups.com.
> 
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> 
> If the report is already addressed, let syzbot know by replying with:
> #syz fix: exact-commit-title
> 
> If you want syzbot to run the reproducer, reply with:
> #syz test: git://repo/address.git branch-or-commit-hash
> If you attach or paste a git patch, syzbot will apply it before testing.
> 
> If you want to overwrite report's subsystems, reply with:
> #syz set subsystems: new-subsystem
> (See the list of subsystem names on the web dashboard)
> 
> If the report is a duplicate of another one, reply with:
> #syz dup: exact-subject-of-another-report
> 
> If you want to undo deduplication, reply with:
> #syz undup


#syz test: https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git 
bcc17a060d93b198d8a17a9b87b593f41337ee28

Re: [PATCH v5] vp_vdpa: don't allocate unused msix vectors

2024-04-22 Thread Michael S. Tsirkin

On Wed, Apr 10, 2024 at 11:30:20AM +0800, lyx634449800 wrote:
> From: Yuxue Liu 
> 
> When there is a ctlq and it doesn't require interrupt
> callbacks,the original method of calculating vectors
> wastes hardware msi or msix resources as well as system
> IRQ resources.
> 
> When conducting performance testing using testpmd in the
> guest os, it was found that the performance was lower compared
> to directly using vfio-pci to passthrough the device
> 
> In scenarios where the virtio device in the guest os does
> not utilize interrupts, the vdpa driver still configures
> the hardware's msix vector. Therefore, the hardware still
> sends interrupts to the host os.

I just have a question on this part. How come hardware
sends interrupts does not guest driver disable them?

> Because of this unnecessary
> action by the hardware, hardware performance decreases, and
> it also affects the performance of the host os.
> 
> Before modification:(interrupt mode)
>  32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
>  33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
>  34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
>  35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config
> 
> After modification:(interrupt mode)
>  32:  0  0  1  7   PCI-MSI 32768-edge  vp-vdpa[:00:02.0]-0
>  33: 36  0  3  0   PCI-MSI 32769-edge  vp-vdpa[:00:02.0]-1
>  34:  0  0  0  0   PCI-MSI 32770-edge  vp-vdpa[:00:02.0]-config
> 
> Before modification:(virtio pmd mode for guest os)
>  32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
>  33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
>  34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
>  35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config
> 
> After modification:(virtio pmd mode for guest os)
>  32: 0  0  0   0   PCI-MSI 32768-edge   vp-vdpa[:00:02.0]-config
> 
> To verify the use of the virtio PMD mode in the guest operating
> system, the following patch needs to be applied to QEMU:
> https://lore.kernel.org/all/20240408073311.2049-1-yuxue@jaguarmicro.com
> 
> Signed-off-by: Yuxue Liu 
> Acked-by: Jason Wang 
> Reviewed-by: Heng Qi 
> ---
> V5: modify the description of the printout when an exception occurs
> V4: update the title and assign values to uninitialized variables
> V3: delete unused variables and add validation records
> V2: fix when allocating IRQs, scan all queues
> 
>  drivers/vdpa/virtio_pci/vp_vdpa.c | 22 --
>  1 file changed, 16 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/vdpa/virtio_pci/vp_vdpa.c 
> b/drivers/vdpa/virtio_pci/vp_vdpa.c
> index df5f4a3bccb5..8de0224e9ec2 100644
> --- a/drivers/vdpa/virtio_pci/vp_vdpa.c
> +++ b/drivers/vdpa/virtio_pci/vp_vdpa.c
> @@ -160,7 +160,13 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
>   struct pci_dev *pdev = mdev->pci_dev;
>   int i, ret, irq;
>   int queues = vp_vdpa->queues;
> - int vectors = queues + 1;
> + int vectors = 1;
> + int msix_vec = 0;
> +
> + for (i = 0; i < queues; i++) {
> + if (vp_vdpa->vring[i].cb.callback)
> + vectors++;
> + }
>  
>   ret = pci_alloc_irq_vectors(pdev, vectors, vectors, PCI_IRQ_MSIX);
>   if (ret != vectors) {
> @@ -173,9 +179,12 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
>   vp_vdpa->vectors = vectors;
>  
>   for (i = 0; i < queues; i++) {
> + if (!vp_vdpa->vring[i].cb.callback)
> + continue;
> +
>   snprintf(vp_vdpa->vring[i].msix_name, VP_VDPA_NAME_SIZE,
>   "vp-vdpa[%s]-%d\n", pci_name(pdev), i);
> - irq = pci_irq_vector(pdev, i);
> + irq = pci_irq_vector(pdev, msix_vec);
>   ret = devm_request_irq(>dev, irq,
>  vp_vdpa_vq_handler,
>  0, vp_vdpa->vring[i].msix_name,
> @@ -185,21 +194,22 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
>   "vp_vdpa: fail to request irq for vq %d\n", i);
>   goto err;
>   }
> - vp_modern_queue_vector(mdev, i, i);
> + vp_modern_queue_vector(mdev, i, msix_vec);
>   vp_vdpa->vring[i].irq = irq;
> + msix_vec++;
>   }
>  
>   snprintf(vp_vdpa->msix_name, VP_VDPA_NAME_SIZE, "vp-vdpa[%s]-config\n",
>pci_name(pdev));
> - irq = pci_irq_vector(pdev, queues);
> + irq = pci_irq_vector(pdev, msix_vec);
>   ret = devm_request_irq(>dev, irq, vp_vdpa_config_handler, 0,
>  vp_vdpa->msix_name, vp_vdpa);
>   if (ret) {
>   dev_err(>dev,
> - "vp_vdpa: fail to request irq for vq %d\n", i);
> + "vp_vdpa: fail to request irq for config: %d\n", ret);
>   goto err;
>   }
> -

Re: [PATCH v2 1/4] virtio_balloon: separate vm events into a function

2024-04-22 Thread Michael S. Tsirkin

On Mon, Apr 22, 2024 at 03:42:51PM +0800, zhenwei pi wrote:
> All the VM events related statistics have dependence on
> 'CONFIG_VM_EVENT_COUNTERS', once any stack variable is required by any
> VM events in future, we would have codes like:
>  #ifdef CONFIG_VM_EVENT_COUNTERS
>   unsigned long foo;
>  #endif
>   ...
>  #ifdef CONFIG_VM_EVENT_COUNTERS
>   foo = events[XXX] + events[YYY];
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_XXX, foo);
>  #endif
> 
> Separate vm events into a single function, also remove
> 'CONFIG_VM_EVENT_COUNTERS' from 'update_balloon_stats'.
> 
> Signed-off-by: zhenwei pi 
> ---
>  drivers/virtio/virtio_balloon.c | 44 ++---
>  1 file changed, 29 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 1f5b3dd31fcf..59fe157e5722 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -316,34 +316,48 @@ static inline void update_stat(struct virtio_balloon 
> *vb, int idx,
>  
>  #define pages_to_bytes(x) ((u64)(x) << PAGE_SHIFT)
>  
> -static unsigned int update_balloon_stats(struct virtio_balloon *vb)
> +/* Return the number of entries filled by vm events */
> +static inline unsigned int update_balloon_vm_stats(struct virtio_balloon *vb,
> +unsigned int start)
>  {
> +#ifdef CONFIG_VM_EVENT_COUNTERS
>   unsigned long events[NR_VM_EVENT_ITEMS];
> - struct sysinfo i;
> - unsigned int idx = 0;
> - long available;
> - unsigned long caches;
> + unsigned int idx = start;
>  
>   all_vm_events(events);
> - si_meminfo();
> -
> - available = si_mem_available();
> - caches = global_node_page_state(NR_FILE_PAGES);
> -
> -#ifdef CONFIG_VM_EVENT_COUNTERS
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_IN,
> - pages_to_bytes(events[PSWPIN]));
> + pages_to_bytes(events[PSWPIN]));
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_OUT,
> - pages_to_bytes(events[PSWPOUT]));
> + pages_to_bytes(events[PSWPOUT]));
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]);
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
> +
>  #ifdef CONFIG_HUGETLB_PAGE
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC,
>   events[HTLB_BUDDY_PGALLOC]);
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGFAIL,
>   events[HTLB_BUDDY_PGALLOC_FAIL]);
> -#endif
> -#endif
> +#endif /* CONFIG_HUGETLB_PAGE */
> +
> + return idx - start;
> +#else /* CONFIG_VM_EVENT_COUNTERS */
> +
> + return 0;
> +#endif /* CONFIG_VM_EVENT_COUNTERS */
> +}
> +

Generally the preferred style is this:

#ifdef .

static inline unsigned int update_balloon_vm_stats(struct virtio_balloon *vb,
   unsigned int start)
{

}

#else /* CONFIG_VM_EVENT_COUNTERS */

static inline unsigned int update_balloon_vm_stats(struct virtio_balloon *vb,
   unsigned int start)
{
return 0;
}

#endif

however given it was a spaghetti of ifdefs even before that,
the patch's ok I think.


> +static unsigned int update_balloon_stats(struct virtio_balloon *vb)
> +{
> + struct sysinfo i;
> + unsigned int idx = 0;
> + long available;
> + unsigned long caches;
> +
> + idx += update_balloon_vm_stats(vb, idx);
> +
> + si_meminfo();
> + available = si_mem_available();
> + caches = global_node_page_state(NR_FILE_PAGES);
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMFREE,
>   pages_to_bytes(i.freeram));
>   update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMTOT,
> -- 
> 2.34.1

Re: [PATCH virt] virt: fix uninit-value in vhost_vsock_dev_open

2024-04-20 Thread Michael S. Tsirkin

On Sat, Apr 20, 2024 at 05:57:50PM +0900, Jeongjun Park wrote:
> Change vhost_vsock_dev_open() to use kvzalloc() instead of kvmalloc()
> to avoid uninit state.
> 
> Reported-by: syzbot+6c21aeb59d0e82eb2...@syzkaller.appspotmail.com
> Fixes: dcda9b04713c ("mm, tree wide: replace __GFP_REPEAT by 
> __GFP_RETRY_MAYFAIL with more useful semantic")
> Signed-off-by: Jeongjun Park 

What value exactly is used uninitialized?

> ---
>  drivers/vhost/vsock.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> index ec20ecff85c7..652ef97a444b 100644
> --- a/drivers/vhost/vsock.c
> +++ b/drivers/vhost/vsock.c
> @@ -656,7 +656,7 @@ static int vhost_vsock_dev_open(struct inode *inode, 
> struct file *file)
>   /* This struct is large and allocation could fail, fall back to vmalloc
>* if there is no other way.
>*/
> - vsock = kvmalloc(sizeof(*vsock), GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> + vsock = kvzalloc(sizeof(*vsock), GFP_KERNEL | __GFP_RETRY_MAYFAIL);
>   if (!vsock)
>   return -ENOMEM;
>  
> -- 
> 2.34.1

Re: [PATCH 1/1] virtio: Add support for the virtio suspend feature

2024-04-18 Thread Michael S. Tsirkin

On Thu, Apr 18, 2024 at 03:14:37PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 4/17/2024 4:54 PM, David Stevens wrote:
> > Add support for the VIRTIO_F_SUSPEND feature. When this feature is
> > negotiated, power management can use it to suspend virtio devices
> > instead of resorting to resetting the devices entirely.
> > 
> > Signed-off-by: David Stevens 
> > ---
> >   drivers/virtio/virtio.c| 32 ++
> >   drivers/virtio/virtio_pci_common.c | 29 +++
> >   drivers/virtio/virtio_pci_modern.c | 19 ++
> >   include/linux/virtio.h |  2 ++
> >   include/uapi/linux/virtio_config.h | 10 +-
> >   5 files changed, 74 insertions(+), 18 deletions(-)
> > 
> > diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
> > index f4080692b351..cd11495a5098 100644
> > --- a/drivers/virtio/virtio.c
> > +++ b/drivers/virtio/virtio.c
> > @@ -1,5 +1,6 @@
> >   // SPDX-License-Identifier: GPL-2.0-only
> >   #include 
> > +#include 
> >   #include 
> >   #include 
> >   #include 
> > @@ -580,6 +581,37 @@ int virtio_device_restore(struct virtio_device *dev)
> > return ret;
> >   }
> >   EXPORT_SYMBOL_GPL(virtio_device_restore);
> > +
> > +static int virtio_device_set_suspend_bit(struct virtio_device *dev, bool 
> > enabled)
> > +{
> > +   u8 status, target;
> > +
> > +   status = dev->config->get_status(dev);
> > +   if (enabled)
> > +   target = status | VIRTIO_CONFIG_S_SUSPEND;
> > +   else
> > +   target = status & ~VIRTIO_CONFIG_S_SUSPEND;
> > +   dev->config->set_status(dev, target);
> I think it is better to verify whether the device SUSPEND bit is
> already set or clear, we can just return if status == target.
> 
> Thanks
> Zhu Lingshan
> > +
> > +   while ((status = dev->config->get_status(dev)) != target) {
> > +   if (status & VIRTIO_CONFIG_S_NEEDS_RESET)
> > +   return -EIO;
> > +   mdelay(10);

Bad device state (set by surprise removal) should also
be handled here I think.


> > +   }
> > +   return 0;
> > +}
> > +
> > +int virtio_device_suspend(struct virtio_device *dev)
> > +{
> > +   return virtio_device_set_suspend_bit(dev, true);
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_device_suspend);
> > +
> > +int virtio_device_resume(struct virtio_device *dev)
> > +{
> > +   return virtio_device_set_suspend_bit(dev, false);
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_device_resume);
> >   #endif
> >   static int virtio_init(void)
> > diff --git a/drivers/virtio/virtio_pci_common.c 
> > b/drivers/virtio/virtio_pci_common.c
> > index b655fccaf773..4d542de05970 100644
> > --- a/drivers/virtio/virtio_pci_common.c
> > +++ b/drivers/virtio/virtio_pci_common.c
> > @@ -495,31 +495,26 @@ static int virtio_pci_restore(struct device *dev)
> > return virtio_device_restore(_dev->vdev);
> >   }
> > -static bool vp_supports_pm_no_reset(struct device *dev)
> > +static int virtio_pci_suspend(struct device *dev)
> >   {
> > struct pci_dev *pci_dev = to_pci_dev(dev);
> > -   u16 pmcsr;
> > -
> > -   if (!pci_dev->pm_cap)
> > -   return false;
> > -
> > -   pci_read_config_word(pci_dev, pci_dev->pm_cap + PCI_PM_CTRL, );
> > -   if (PCI_POSSIBLE_ERROR(pmcsr)) {
> > -   dev_err(dev, "Unable to query pmcsr");
> > -   return false;
> > -   }
> > +   struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev);
> > -   return pmcsr & PCI_PM_CTRL_NO_SOFT_RESET;
> > -}
> > +   if (virtio_has_feature(_dev->vdev, VIRTIO_F_SUSPEND))
> > +   return virtio_device_suspend(_dev->vdev);
> > -static int virtio_pci_suspend(struct device *dev)
> > -{
> > -   return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_freeze(dev);
> > +   return virtio_pci_freeze(dev);
> >   }
> >   static int virtio_pci_resume(struct device *dev)
> >   {
> > -   return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_restore(dev);
> > +   struct pci_dev *pci_dev = to_pci_dev(dev);
> > +   struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev);
> > +
> > +   if (virtio_has_feature(_dev->vdev, VIRTIO_F_SUSPEND))
> > +   return virtio_device_resume(_dev->vdev);
> > +
> > +   return virtio_pci_restore(dev);
> >   }
> >   static const struct dev_pm_ops virtio_pci_pm_ops = {
> > diff --git a/drivers/virtio/virtio_pci_modern.c 
> > b/drivers/virtio/virtio_pci_modern.c
> > index f62b530aa3b5..ac8734526b8d 100644
> > --- a/drivers/virtio/virtio_pci_modern.c
> > +++ b/drivers/virtio/virtio_pci_modern.c
> > @@ -209,6 +209,22 @@ static void vp_modern_avq_deactivate(struct 
> > virtio_device *vdev)
> > __virtqueue_break(admin_vq->info.vq);
> >   }
> > +static bool vp_supports_pm_no_reset(struct pci_dev *pci_dev)
> > +{
> > +   u16 pmcsr;
> > +
> > +   if (!pci_dev->pm_cap)
> > +   return false;
> > +
> > +   pci_read_config_word(pci_dev, pci_dev->pm_cap + PCI_PM_CTRL, );
> > +   if (PCI_POSSIBLE_ERROR(pmcsr)) {
> > +   dev_err(_dev->dev, "Unable to query pmcsr");
> > +   return false;

Re: [PATCH v5 3/5] vduse: Add function to get/free the pages for reconnection

2024-04-17 Thread Michael S. Tsirkin

On Fri, Apr 12, 2024 at 09:28:23PM +0800, Cindy Lu wrote:
> Add the function vduse_alloc_reconnnect_info_mem
> and vduse_alloc_reconnnect_info_mem
> These functions allow vduse to allocate and free memory for reconnection
> information. The amount of memory allocated is vq_num pages.
> Each VQS will map its own page where the reconnection information will be 
> saved
> 
> Signed-off-by: Cindy Lu 
> ---
>  drivers/vdpa/vdpa_user/vduse_dev.c | 40 ++
>  1 file changed, 40 insertions(+)
> 
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> b/drivers/vdpa/vdpa_user/vduse_dev.c
> index ef3c9681941e..2da659d5f4a8 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -65,6 +65,7 @@ struct vduse_virtqueue {
>   int irq_effective_cpu;
>   struct cpumask irq_affinity;
>   struct kobject kobj;
> + unsigned long vdpa_reconnect_vaddr;
>  };
>  
>  struct vduse_dev;
> @@ -1105,6 +1106,38 @@ static void vduse_vq_update_effective_cpu(struct 
> vduse_virtqueue *vq)
>  
>   vq->irq_effective_cpu = curr_cpu;
>  }
> +static int vduse_alloc_reconnnect_info_mem(struct vduse_dev *dev)
> +{
> + unsigned long vaddr = 0;
> + struct vduse_virtqueue *vq;
> +
> + for (int i = 0; i < dev->vq_num; i++) {
> + /*page 0~ vq_num save the reconnect info for vq*/
> + vq = dev->vqs[i];
> + vaddr = get_zeroed_page(GFP_KERNEL);


I don't get why you insist on stealing kernel memory for something
that is just used by userspace to store data for its own use.
Userspace does not lack ways to persist data, for example,
create a regular file anywhere in the filesystem.



> + if (vaddr == 0)
> + return -ENOMEM;
> +
> + vq->vdpa_reconnect_vaddr = vaddr;
> + }
> +
> + return 0;
> +}
> +
> +static int vduse_free_reconnnect_info_mem(struct vduse_dev *dev)
> +{
> + struct vduse_virtqueue *vq;
> +
> + for (int i = 0; i < dev->vq_num; i++) {
> + vq = dev->vqs[i];
> +
> + if (vq->vdpa_reconnect_vaddr)
> + free_page(vq->vdpa_reconnect_vaddr);
> + vq->vdpa_reconnect_vaddr = 0;
> + }
> +
> + return 0;
> +}
>  
>  static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>   unsigned long arg)
> @@ -1672,6 +1705,8 @@ static int vduse_destroy_dev(char *name)
>   mutex_unlock(>lock);
>   return -EBUSY;
>   }
> + vduse_free_reconnnect_info_mem(dev);
> +
>   dev->connected = true;
>   mutex_unlock(>lock);
>  
> @@ -1855,12 +1890,17 @@ static int vduse_create_dev(struct vduse_dev_config 
> *config,
>   ret = vduse_dev_init_vqs(dev, config->vq_align, config->vq_num);
>   if (ret)
>   goto err_vqs;
> + ret = vduse_alloc_reconnnect_info_mem(dev);
> + if (ret < 0)
> + goto err_mem;
>  
>   __module_get(THIS_MODULE);
>  
>   return 0;
>  err_vqs:
>   device_destroy(_class, MKDEV(MAJOR(vduse_major), dev->minor));
> +err_mem:
> + vduse_free_reconnnect_info_mem(dev);
>  err_dev:
>   idr_remove(_idr, dev->minor);
>  err_idr:
> -- 
> 2.43.0

Re: [PATCH] vhost-vdpa: Remove usage of the deprecated ida_simple_xx() API

2024-04-14 Thread Michael S. Tsirkin

On Sun, Apr 14, 2024 at 10:59:06AM +0200, Christophe JAILLET wrote:
> Le 14/04/2024 à 10:35, Michael S. Tsirkin a écrit :
> > On Mon, Jan 15, 2024 at 09:35:50PM +0100, Christophe JAILLET wrote:
> > > ida_alloc() and ida_free() should be preferred to the deprecated
> > > ida_simple_get() and ida_simple_remove().
> > > 
> > > Note that the upper limit of ida_simple_get() is exclusive, buInputt the 
> > > one of
> > 
> > What's buInputt? But?
> 
> Yes, sorry. It is "but".
> 
> Let me know if I should send a v2, or if it can be fixed when it is applied.
> 
> CJ

Yes it's easier if you do. Thanks!

> > 
> > > ida_alloc_max() is inclusive. So a -1 has been added when needed.
> > > 
> > > Signed-off-by: Christophe JAILLET 
> > 
> > 
> > Jason, wanna ack?
> > 
> > > ---
> > >   drivers/vhost/vdpa.c | 6 +++---
> > >   1 file changed, 3 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> > > index bc4a51e4638b..849b9d2dd51f 100644
> > > --- a/drivers/vhost/vdpa.c
> > > +++ b/drivers/vhost/vdpa.c
> > > @@ -1534,7 +1534,7 @@ static void vhost_vdpa_release_dev(struct device 
> > > *device)
> > >   struct vhost_vdpa *v =
> > >  container_of(device, struct vhost_vdpa, dev);
> > > - ida_simple_remove(_vdpa_ida, v->minor);
> > > + ida_free(_vdpa_ida, v->minor);
> > >   kfree(v->vqs);
> > >   kfree(v);
> > >   }
> > > @@ -1557,8 +1557,8 @@ static int vhost_vdpa_probe(struct vdpa_device 
> > > *vdpa)
> > >   if (!v)
> > >   return -ENOMEM;
> > > - minor = ida_simple_get(_vdpa_ida, 0,
> > > -VHOST_VDPA_DEV_MAX, GFP_KERNEL);
> > > + minor = ida_alloc_max(_vdpa_ida, VHOST_VDPA_DEV_MAX - 1,
> > > +   GFP_KERNEL);
> > >   if (minor < 0) {
> > >   kfree(v);
> > >   return minor;
> > > -- 
> > > 2.43.0
> > 
> > 
> >

Re: [PATCH] vhost-vdpa: Remove usage of the deprecated ida_simple_xx() API

2024-04-14 Thread Michael S. Tsirkin

On Mon, Jan 15, 2024 at 09:35:50PM +0100, Christophe JAILLET wrote:
> ida_alloc() and ida_free() should be preferred to the deprecated
> ida_simple_get() and ida_simple_remove().
> 
> Note that the upper limit of ida_simple_get() is exclusive, buInputt the one 
> of

What's buInputt? But?

> ida_alloc_max() is inclusive. So a -1 has been added when needed.
> 
> Signed-off-by: Christophe JAILLET 


Jason, wanna ack?

> ---
>  drivers/vhost/vdpa.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index bc4a51e4638b..849b9d2dd51f 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -1534,7 +1534,7 @@ static void vhost_vdpa_release_dev(struct device 
> *device)
>   struct vhost_vdpa *v =
>  container_of(device, struct vhost_vdpa, dev);
>  
> - ida_simple_remove(_vdpa_ida, v->minor);
> + ida_free(_vdpa_ida, v->minor);
>   kfree(v->vqs);
>   kfree(v);
>  }
> @@ -1557,8 +1557,8 @@ static int vhost_vdpa_probe(struct vdpa_device *vdpa)
>   if (!v)
>   return -ENOMEM;
>  
> - minor = ida_simple_get(_vdpa_ida, 0,
> -VHOST_VDPA_DEV_MAX, GFP_KERNEL);
> + minor = ida_alloc_max(_vdpa_ida, VHOST_VDPA_DEV_MAX - 1,
> +   GFP_KERNEL);
>   if (minor < 0) {
>   kfree(v);
>   return minor;
> -- 
> 2.43.0

[GIT PULL] virtio: bugfixes

2024-04-14 Thread Michael S. Tsirkin

The following changes since commit fec50db7033ea478773b159e0e2efb135270e3b7:

  Linux 6.9-rc3 (2024-04-07 13:22:46 -0700)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to 76f408535aab39c33e0a1dcada9fba5631c65595:

  vhost: correct misleading printing information (2024-04-08 04:11:04 -0400)


virtio: bugfixes

Some small, obvious (in hindsight) bugfixes:

- new ioctl in vhost-vdpa has a wrong # - not too late to fix

- vhost has apparently been lacking an smp_rmb() -
  due to code duplication :( The duplication will be fixed in
  the next merge cycle, this is a minimal fix.

- an error message in vhost talks about guest moving used index -
  which of course never happens, guest only ever moves the
  available index.

- i2c-virtio didn't set the driver owner so it did not get
  refcounted correctly.

Signed-off-by: Michael S. Tsirkin 


Gavin Shan (2):
  vhost: Add smp_rmb() in vhost_vq_avail_empty()
  vhost: Add smp_rmb() in vhost_enable_notify()

Krzysztof Kozlowski (1):
  virtio: store owner from modules with register_virtio_driver()

Michael S. Tsirkin (1):
  vhost-vdpa: change ioctl # for VDPA_GET_VRING_SIZE

Xianting Tian (1):
  vhost: correct misleading printing information

 .../driver-api/virtio/writing_virtio_drivers.rst   |  1 -
 drivers/vhost/vhost.c  | 30 ++
 drivers/virtio/virtio.c|  6 +++--
 include/linux/virtio.h |  7 +++--
 include/uapi/linux/vhost.h | 15 ++-
 5 files changed, 42 insertions(+), 17 deletions(-)

Re: [PATCH v4] vp_vdpa: don't allocate unused msix vectors

2024-04-09 Thread Michael S. Tsirkin

Good and clear subject, I like it.

On Tue, Apr 09, 2024 at 04:58:18PM +0800, lyx634449800 wrote:
> From: Yuxue Liu 
> 
> When there is a ctlq and it doesn't require interrupt
> callbacks,the original method of calculating vectors
> wastes hardware msi or msix resources as well as system
> IRQ resources.
> 
> When conducting performance testing using testpmd in the
> guest os, it was found that the performance was lower compared
> to directly using vfio-pci to passthrough the device
> 
> In scenarios where the virtio device in the guest os does
> not utilize interrupts, the vdpa driver still configures
> the hardware's msix vector. Therefore, the hardware still
> sends interrupts to the host os. Because of this unnecessary
> action by the hardware, hardware performance decreases, and
> it also affects the performance of the host os.
> 
> Before modification:(interrupt mode)
>  32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
>  33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
>  34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
>  35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config
> 
> After modification:(interrupt mode)
>  32:  0  0  1  7   PCI-MSI 32768-edge  vp-vdpa[:00:02.0]-0
>  33: 36  0  3  0   PCI-MSI 32769-edge  vp-vdpa[:00:02.0]-1
>  34:  0  0  0  0   PCI-MSI 32770-edge  vp-vdpa[:00:02.0]-config
> 
> Before modification:(virtio pmd mode for guest os)
>  32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
>  33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
>  34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
>  35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config
> 
> After modification:(virtio pmd mode for guest os)
>  32: 0  0  0   0   PCI-MSI 32768-edge   vp-vdpa[:00:02.0]-config
> 
> To verify the use of the virtio PMD mode in the guest operating
> system, the following patch needs to be applied to QEMU:
> https://lore.kernel.org/all/20240408073311.2049-1-yuxue@jaguarmicro.com
> 
> Signed-off-by: Yuxue Liu 
> Acked-by: Jason Wang 

Much better, thanks!
A couple of small tweaks to polish it up and it'll be ready.

> ---
> V4: Update the title and assign values to uninitialized variables
> V3: delete unused variables and add validation records
> V2: fix when allocating IRQs, scan all queues
> 
>  drivers/vdpa/virtio_pci/vp_vdpa.c | 23 +--
>  1 file changed, 17 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/vdpa/virtio_pci/vp_vdpa.c 
> b/drivers/vdpa/virtio_pci/vp_vdpa.c
> index df5f4a3bccb5..74bc8adfc7e8 100644
> --- a/drivers/vdpa/virtio_pci/vp_vdpa.c
> +++ b/drivers/vdpa/virtio_pci/vp_vdpa.c
> @@ -160,7 +160,14 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
>   struct pci_dev *pdev = mdev->pci_dev;
>   int i, ret, irq;
>   int queues = vp_vdpa->queues;
> - int vectors = queues + 1;
> + int vectors = 0;
> + int msix_vec = 0;
> +
> + for (i = 0; i < queues; i++) {
> + if (vp_vdpa->vring[i].cb.callback)
> + vectors++;
> + }
> + vectors++;


Actually even easier: int vectors = 1; and then we do not need
this last line.
Sorry I only noticed now.

>  
>   ret = pci_alloc_irq_vectors(pdev, vectors, vectors, PCI_IRQ_MSIX);
>   if (ret != vectors) {
> @@ -173,9 +180,12 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
>   vp_vdpa->vectors = vectors;
>  
>   for (i = 0; i < queues; i++) {
> + if (!vp_vdpa->vring[i].cb.callback)
> + continue;
> +
>   snprintf(vp_vdpa->vring[i].msix_name, VP_VDPA_NAME_SIZE,
>   "vp-vdpa[%s]-%d\n", pci_name(pdev), i);
> - irq = pci_irq_vector(pdev, i);
> + irq = pci_irq_vector(pdev, msix_vec);
>   ret = devm_request_irq(>dev, irq,
>  vp_vdpa_vq_handler,
>  0, vp_vdpa->vring[i].msix_name,
> @@ -185,21 +195,22 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
>   "vp_vdpa: fail to request irq for vq %d\n", i);
>   goto err;
>   }
> - vp_modern_queue_vector(mdev, i, i);
> + vp_modern_queue_vector(mdev, i, msix_vec);
>   vp_vdpa->vring[i].irq = irq;
> + msix_vec++;
>   }
>  
>   snprintf(vp_vdpa->msix_name, VP_VDPA_NAME_SIZE, "vp-vdpa[%s]-config\n",
>pci_name(pdev));
> - irq = pci_irq_vector(pdev, queues);
> + irq = pci_irq_vector(pdev, msix_vec);
>   ret = devm_request_irq(>dev, irq, vp_vdpa_config_handler, 0,
>  vp_vdpa->msix_name, vp_vdpa);
>   if (ret) {
>   dev_err(>dev,
> - "vp_vdpa: fail to request irq for vq %d\n", i);
> + "vp_vdpa: fail to request irq for config, ret %d\n", 
> ret);

As long as we are here

Re: [PATCH] drivers/virtio: delayed configuration descriptor flags

2024-04-09 Thread Michael S. Tsirkin

On Tue, Apr 09, 2024 at 01:02:52AM +0800, ni.liqiang wrote:
> In our testing of the virtio hardware accelerator, we found that
> configuring the flags of the descriptor after addr and len,
> as implemented in DPDK, seems to be more friendly to the hardware.
> 
> In our Virtio hardware implementation tests, using the default
> open-source code, the hardware's bulk reads ensure performance
> but correctness is compromised. If we refer to the implementation code
> of DPDK, placing the flags configuration of the descriptor
> after addr and len, virtio backend can function properly based on
> our hardware accelerator.
> 
> I am somewhat puzzled by this. From a software process perspective,
> it seems that there should be no difference whether
> the flags configuration of the descriptor is before or after addr and len.
> However, this is not the case according to experimental test results.

You should be aware of the following, from the PCI Express spec.
Note especially the second paragraph, and the last paragraph:

2.4.2.
25
30
Update Ordering and Granularity Observed by a
Read Transaction
If a Requester using a single transaction reads a block of data from a 
Completer, and the
Completer's data buffer is concurrently being updated, the ordering of multiple 
updates and
granularity of each update reflected in the data returned by the read is 
outside the scope of this
specification. This applies both to updates performed by PCI Express write 
transactions and
updates performed by other mechanisms such as host CPUs updating host memory.
If a Requester using a single transaction reads a block of data from a 
Completer, and the
Completer's data buffer is concurrently being updated by one or more entities 
not on the PCI
Express fabric, the ordering of multiple updates and granularity of each update 
reflected in the data
returned by the read is outside the scope of this specification.

As an example of update ordering, assume that the block of data is in host 
memory, and a host CPU
writes first to location A and then to a different location B. A Requester 
reading that data block
with a single read transaction is not guaranteed to observe those updates in 
order. In other words,
the Requester may observe an updated value in location B and an old value in 
location A, regardless
of the placement of locations A and B within the data block. Unless a Completer 
makes its own
guarantees (outside this specification) with respect to update ordering, a 
Requester that relies on
update ordering must observe the update to location B via one read transaction 
before initiating a
subsequent read to location A to return its updated value.

As an example of update granularity, if a host CPU writes a QWORD to host 
memory, a Requester
reading that QWORD from host memory may observe a portion of the QWORD updated 
and
another portion of it containing the old value.
While not required by this specification, it is strongly recommended that host 
platforms guarantee
that when a host CPU writes aligned DWORDs or aligned QWORDs to host memory, 
the update
granularity observed by a PCI Express read will not be smaller than a DWORD.

IMPLEMENTATION NOTE
No Ordering Required Between Cachelines
15
A Root Complex serving as a Completer to a single Memory Read that requests 
multiple cachelines
from host memory is permitted to fetch multiple cachelines concurrently, to 
help facilitate multi-
cacheline completions, subject to Max_Payload_Size. No ordering relationship 
between these
cacheline fetches is required.

Now I suspect that what is going on is that your Root complex
reads descriptors out of order, so the second descriptor is invalid
but the 1st one is valid.

> We would like to know if such a change in the configuration order
> is reasonable and acceptable?

We need to understand the root cause and how robust the fix is
before answering this.

> Thanks.
> 
> Signed-off-by: ni.liqiang 
> Reviewed-by: jin.qi 
> Tested-by: jin.qi 
> Cc: ni.liqiang 
> ---
>  drivers/virtio/virtio_ring.c | 9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 6f7e5010a673..bea2c2fb084e 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -1472,15 +1472,16 @@ static inline int virtqueue_add_packed(struct 
> virtqueue *_vq,
>   flags = cpu_to_le16(vq->packed.avail_used_flags |
>   (++c == total_sg ? 0 : VRING_DESC_F_NEXT) |
>   (n < out_sgs ? 0 : VRING_DESC_F_WRITE));
> - if (i == head)
> - head_flags = flags;
> - else
> - desc[i].flags = flags;
>  
>   desc[i].addr = cpu_to_le64(addr);
>   desc[i].len = cpu_to_le32(sg->length);
>   desc[i].id = cpu_to_le16(id);
>  
> +

Re: [PATCH v3] vp_vdpa: fix the method of calculating vectors

2024-04-08 Thread Michael S. Tsirkin

better subject:

 vp_vdpa: don't allocate unused msix vectors

to make it clear it's not a bugfix.




more comments below, but most importantly this
looks like it adds a bug.

On Tue, Apr 09, 2024 at 09:49:35AM +0800, lyx634449800 wrote:
> When there is a ctlq and it doesn't require interrupt
> callbacks,the original method of calculating vectors
> wastes hardware msi or msix resources as well as system
> IRQ resources.
> 
> When conducting performance testing using testpmd in the
> guest os, it was found that the performance was lower compared
> to directly using vfio-pci to passthrough the device
> 
> In scenarios where the virtio device in the guest os does
> not utilize interrupts, the vdpa driver still configures
> the hardware's msix vector. Therefore, the hardware still
> sends interrupts to the host os. Because of this unnecessary
> action by the hardware, hardware performance decreases, and
> it also affects the performance of the host os.
> 
> Before modification:(interrupt mode)
>  32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
>  33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
>  34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
>  35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config
> 
> After modification:(interrupt mode)
>  32:  0  0  1  7   PCI-MSI 32768-edge  vp-vdpa[:00:02.0]-0
>  33: 36  0  3  0   PCI-MSI 32769-edge  vp-vdpa[:00:02.0]-1
>  34:  0  0  0  0   PCI-MSI 32770-edge  vp-vdpa[:00:02.0]-config
> 
> Before modification:(virtio pmd mode for guest os)
>  32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
>  33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
>  34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
>  35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config
> 
> After modification:(virtio pmd mode for guest os)
>  32: 0  0  0   0   PCI-MSI 32768-edge   vp-vdpa[:00:02.0]-config
> 
> To verify the use of the virtio PMD mode in the guest operating
> system, the following patch needs to be applied to QEMU:
> https://lore.kernel.org/all/20240408073311.2049-1-yuxue@jaguarmicro.com
> 
> Signed-off-by: lyx634449800 


Bad S.O.B format. Should be

Signed-off-by: Real Name 


> ---
> 
> V3: delete unused variables and add validation records
> V2: fix when allocating IRQs, scan all queues
> 
>  drivers/vdpa/virtio_pci/vp_vdpa.c | 35 +++
>  1 file changed, 22 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/vdpa/virtio_pci/vp_vdpa.c 
> b/drivers/vdpa/virtio_pci/vp_vdpa.c
> index df5f4a3bccb5..cd3aeb3b8f21 100644
> --- a/drivers/vdpa/virtio_pci/vp_vdpa.c
> +++ b/drivers/vdpa/virtio_pci/vp_vdpa.c
> @@ -160,22 +160,31 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
>   struct pci_dev *pdev = mdev->pci_dev;
>   int i, ret, irq;
>   int queues = vp_vdpa->queues;
> - int vectors = queues + 1;
> + int msix_vec, allocated_vectors = 0;


I would actually call allocated_vectors -> vectors, make the patch
smaller.

>  
> - ret = pci_alloc_irq_vectors(pdev, vectors, vectors, PCI_IRQ_MSIX);
> - if (ret != vectors) {
> + for (i = 0; i < queues; i++) {
> + if (vp_vdpa->vring[i].cb.callback)
> + allocated_vectors++;
> + }
> + allocated_vectors = allocated_vectors + 1;

better: 
allocated_vectors++; /* extra one for config */

> +
> + ret = pci_alloc_irq_vectors(pdev, allocated_vectors, allocated_vectors,
> + PCI_IRQ_MSIX);
> + if (ret != allocated_vectors) {
>   dev_err(>dev,
>   "vp_vdpa: fail to allocate irq vectors want %d but 
> %d\n",
> - vectors, ret);
> + allocated_vectors, ret);
>   return ret;
>   }
> -
> - vp_vdpa->vectors = vectors;
> + vp_vdpa->vectors = allocated_vectors;
>  
>   for (i = 0; i < queues; i++) {
> + if (!vp_vdpa->vring[i].cb.callback)
> + continue;
> +
>   snprintf(vp_vdpa->vring[i].msix_name, VP_VDPA_NAME_SIZE,
>   "vp-vdpa[%s]-%d\n", pci_name(pdev), i);
> - irq = pci_irq_vector(pdev, i);
> + irq = pci_irq_vector(pdev, msix_vec);

using uninitialized msix_vec here?

I would expect compiler to warn about it.


pay attention to compiler warnings pls.


>   ret = devm_request_irq(>dev, irq,
>  vp_vdpa_vq_handler,
>  0, vp_vdpa->vring[i].msix_name,
> @@ -185,23 +194,23 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
>   "vp_vdpa: fail to request irq for vq %d\n", i);
>   goto err;
>   }
> - vp_modern_queue_vector(mdev, i, i);
> + vp_modern_queue_vector(mdev, i, msix_vec);
>

Re: [PATCH v3] Documentation: Add reconnect process for VDUSE

2024-04-08 Thread Michael S. Tsirkin

On Mon, Apr 08, 2024 at 08:39:21PM +0800, Cindy Lu wrote:
> On Mon, Apr 8, 2024 at 3:40 PM Michael S. Tsirkin  wrote:
> >
> > On Thu, Apr 04, 2024 at 01:56:31PM +0800, Cindy Lu wrote:
> > > Add a document explaining the reconnect process, including what the
> > > Userspace App needs to do and how it works with the kernel.
> > >
> > > Signed-off-by: Cindy Lu 
> > > ---
> > >  Documentation/userspace-api/vduse.rst | 41 +++
> > >  1 file changed, 41 insertions(+)
> > >
> > > diff --git a/Documentation/userspace-api/vduse.rst 
> > > b/Documentation/userspace-api/vduse.rst
> > > index bdb880e01132..7faa83462e78 100644
> > > --- a/Documentation/userspace-api/vduse.rst
> > > +++ b/Documentation/userspace-api/vduse.rst
> > > @@ -231,3 +231,44 @@ able to start the dataplane processing as follows:
> > > after the used ring is filled.
> > >
> > >  For more details on the uAPI, please see include/uapi/linux/vduse.h.
> > > +
> > > +HOW VDUSE devices reconnection works
> > > +
> > > +1. What is reconnection?
> > > +
> > > +   When the userspace application loads, it should establish a connection
> > > +   to the vduse kernel device. Sometimes,the userspace application 
> > > exists,
> > > +   and we want to support its restart and connect to the kernel device 
> > > again
> > > +
> > > +2. How can I support reconnection in a userspace application?
> > > +
> > > +2.1 During initialization, the userspace application should first verify 
> > > the
> > > +existence of the device "/dev/vduse/vduse_name".
> > > +If it doesn't exist, it means this is the first-time for connection. 
> > > goto step 2.2
> > > +If it exists, it means this is a reconnection, and we should goto 
> > > step 2.3
> > > +
> > > +2.2 Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
> > > +/dev/vduse/control.
> > > +When ioctl(VDUSE_CREATE_DEV) is called, kernel allocates memory for
> > > +the reconnect information. The total memory size is 
> > > PAGE_SIZE*vq_mumber.
> >
> > Confused. Where is that allocation, in code?
> >
> > Thanks!
> >
> this should allocated in function vduse_create_dev(),

I mean, it's not allocated there ATM right? This is just doc patch
to become part of a larger patchset?

> I will rewrite
> this part  to make it more clearer
> will send a new version soon
> Thanks
> cindy
> 
> > > +2.3 Check if the information is suitable for reconnect
> > > +If this is reconnection :
> > > +Before attempting to reconnect, The userspace application needs to 
> > > use the
> > > +ioctl(VDUSE_DEV_GET_CONFIG, VDUSE_DEV_GET_STATUS, 
> > > VDUSE_DEV_GET_FEATURES...)
> > > +to get the information from kernel.
> > > +Please review the information and confirm if it is suitable to 
> > > reconnect.
> > > +
> > > +2.4 Userspace application needs to mmap the memory to userspace
> > > +The userspace application requires mapping one page for every vq. 
> > > These pages
> > > +should be used to save vq-related information during system running. 
> > > Additionally,
> > > +the application must define its own structure to store information 
> > > for reconnection.
> > > +
> > > +2.5 Completed the initialization and running the application.
> > > +While the application is running, it is important to store relevant 
> > > information
> > > +about reconnections in mapped pages. When calling the ioctl 
> > > VDUSE_VQ_GET_INFO to
> > > +get vq information, it's necessary to check whether it's a 
> > > reconnection. If it is
> > > +a reconnection, the vq-related information must be get from the 
> > > mapped pages.
> > > +
> > > +2.6 When the Userspace application exits, it is necessary to unmap all 
> > > the
> > > +pages for reconnection
> > > --
> > > 2.43.0
> >

Re: [PATCH v2 0/6] virtiofs: fix the warning for ITER_KVEC dio

2024-04-08 Thread Michael S. Tsirkin

On Wed, Feb 28, 2024 at 10:41:20PM +0800, Hou Tao wrote:
> From: Hou Tao 
> 
> Hi,
> 
> The patch set aims to fix the warning related to an abnormal size
> parameter of kmalloc() in virtiofs. The warning occurred when attempting
> to insert a 10MB sized kernel module kept in a virtiofs with cache
> disabled. As analyzed in patch #1, the root cause is that the length of
> the read buffer is no limited, and the read buffer is passed directly to
> virtiofs through out_args[0].value. Therefore patch #1 limits the
> length of the read buffer passed to virtiofs by using max_pages. However
> it is not enough, because now the maximal value of max_pages is 256.
> Consequently, when reading a 10MB-sized kernel module, the length of the
> bounce buffer in virtiofs will be 40 + (256 * 4096), and kmalloc will
> try to allocate 2MB from memory subsystem. The request for 2MB of
> physically contiguous memory significantly stress the memory subsystem
> and may fail indefinitely on hosts with fragmented memory. To address
> this, patch #2~#5 use scattered pages in a bio_vec to replace the
> kmalloc-allocated bounce buffer when the length of the bounce buffer for
> KVEC_ITER dio is larger than PAGE_SIZE. The final issue with the
> allocation of the bounce buffer and sg array in virtiofs is that
> GFP_ATOMIC is used even when the allocation occurs in a kworker context.
> Therefore the last patch uses GFP_NOFS for the allocation of both sg
> array and bounce buffer when initiated by the kworker. For more details,
> please check the individual patches.
> 
> As usual, comments are always welcome.
> 
> Change Log:

Bernd should I just merge the patchset as is?
It seems to fix a real problem and no one has the
time to work on a better fix  WDYT?


> v2:
>   * limit the length of ITER_KVEC dio by max_pages instead of the
> newly-introduced max_nopage_rw. Using max_pages make the ITER_KVEC
> dio being consistent with other rw operations.
>   * replace kmalloc-allocated bounce buffer by using a bounce buffer
> backed by scattered pages when the length of the bounce buffer for
> KVEC_ITER dio is larger than PAG_SIZE, so even on hosts with
> fragmented memory, the KVEC_ITER dio can be handled normally by
> virtiofs. (Bernd Schubert)
>   * merge the GFP_NOFS patch [1] into this patch-set and use
> memalloc_nofs_{save|restore}+GFP_KERNEL instead of GFP_NOFS
> (Benjamin Coddington)
> 
> v1: 
> https://lore.kernel.org/linux-fsdevel/20240103105929.1902658-1-hou...@huaweicloud.com/
> 
> [1]: 
> https://lore.kernel.org/linux-fsdevel/20240105105305.4052672-1-hou...@huaweicloud.com/
> 
> Hou Tao (6):
>   fuse: limit the length of ITER_KVEC dio by max_pages
>   virtiofs: move alloc/free of argbuf into separated helpers
>   virtiofs: factor out more common methods for argbuf
>   virtiofs: support bounce buffer backed by scattered pages
>   virtiofs: use scattered bounce buffer for ITER_KVEC dio
>   virtiofs: use GFP_NOFS when enqueuing request through kworker
> 
>  fs/fuse/file.c  |  12 +-
>  fs/fuse/virtio_fs.c | 336 +---
>  2 files changed, 296 insertions(+), 52 deletions(-)
> 
> -- 
> 2.29.2

Re: [PATCH v3] Documentation: Add reconnect process for VDUSE

2024-04-08 Thread Michael S. Tsirkin

On Thu, Apr 04, 2024 at 01:56:31PM +0800, Cindy Lu wrote:
> Add a document explaining the reconnect process, including what the
> Userspace App needs to do and how it works with the kernel.
> 
> Signed-off-by: Cindy Lu 
> ---
>  Documentation/userspace-api/vduse.rst | 41 +++
>  1 file changed, 41 insertions(+)
> 
> diff --git a/Documentation/userspace-api/vduse.rst 
> b/Documentation/userspace-api/vduse.rst
> index bdb880e01132..7faa83462e78 100644
> --- a/Documentation/userspace-api/vduse.rst
> +++ b/Documentation/userspace-api/vduse.rst
> @@ -231,3 +231,44 @@ able to start the dataplane processing as follows:
> after the used ring is filled.
>  
>  For more details on the uAPI, please see include/uapi/linux/vduse.h.
> +
> +HOW VDUSE devices reconnection works
> +
> +1. What is reconnection?
> +
> +   When the userspace application loads, it should establish a connection
> +   to the vduse kernel device. Sometimes,the userspace application exists,
> +   and we want to support its restart and connect to the kernel device again
> +
> +2. How can I support reconnection in a userspace application?
> +
> +2.1 During initialization, the userspace application should first verify the
> +existence of the device "/dev/vduse/vduse_name".
> +If it doesn't exist, it means this is the first-time for connection. 
> goto step 2.2
> +If it exists, it means this is a reconnection, and we should goto step 
> 2.3
> +
> +2.2 Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
> +/dev/vduse/control.
> +When ioctl(VDUSE_CREATE_DEV) is called, kernel allocates memory for
> +the reconnect information. The total memory size is PAGE_SIZE*vq_mumber.

Confused. Where is that allocation, in code?

Thanks!

> +2.3 Check if the information is suitable for reconnect
> +If this is reconnection :
> +Before attempting to reconnect, The userspace application needs to use 
> the
> +ioctl(VDUSE_DEV_GET_CONFIG, VDUSE_DEV_GET_STATUS, 
> VDUSE_DEV_GET_FEATURES...)
> +to get the information from kernel.
> +Please review the information and confirm if it is suitable to reconnect.
> +
> +2.4 Userspace application needs to mmap the memory to userspace
> +The userspace application requires mapping one page for every vq. These 
> pages
> +should be used to save vq-related information during system running. 
> Additionally,
> +the application must define its own structure to store information for 
> reconnection.
> +
> +2.5 Completed the initialization and running the application.
> +While the application is running, it is important to store relevant 
> information
> +about reconnections in mapped pages. When calling the ioctl 
> VDUSE_VQ_GET_INFO to
> +get vq information, it's necessary to check whether it's a reconnection. 
> If it is
> +a reconnection, the vq-related information must be get from the mapped 
> pages.
> +
> +2.6 When the Userspace application exits, it is necessary to unmap all the
> +pages for reconnection
> -- 
> 2.43.0

Re: [PATCH v3 3/3] vhost: Improve vhost_get_avail_idx() with smp_rmb()

2024-04-08 Thread Michael S. Tsirkin

On Mon, Apr 08, 2024 at 02:15:24PM +1000, Gavin Shan wrote:
> Hi Michael,
> 
> On 3/30/24 19:02, Gavin Shan wrote:
> > On 3/28/24 19:31, Michael S. Tsirkin wrote:
> > > On Thu, Mar 28, 2024 at 10:21:49AM +1000, Gavin Shan wrote:
> > > > All the callers of vhost_get_avail_idx() are concerned to the memory
> > > > barrier, imposed by smp_rmb() to ensure the order of the available
> > > > ring entry read and avail_idx read.
> > > > 
> > > > Improve vhost_get_avail_idx() so that smp_rmb() is executed when
> > > > the avail_idx is advanced. With it, the callers needn't to worry
> > > > about the memory barrier.
> > > > 
> > > > Suggested-by: Michael S. Tsirkin 
> > > > Signed-off-by: Gavin Shan 
> > > 
> > > Previous patches are ok. This one I feel needs more work -
> > > first more code such as sanity checking should go into
> > > this function, second there's actually a difference
> > > between comparing to last_avail_idx and just comparing
> > > to the previous value of avail_idx.
> > > I will pick patches 1-2 and post a cleanup on top so you can
> > > take a look, ok?
> > > 
> > 
> > Thanks, Michael. It's fine to me.
> > 
> 
> A kindly ping.
> 
> If it's ok to you, could you please merge PATCH[1-2]? Our downstream
> 9.4 need the fixes, especially for NVidia's grace-hopper and grace-grace
> platforms.
> 
> For PATCH[3], I also can help with the improvement if you don't have time
> for it. Please let me know.
> 
> Thanks,
> Gavin

The thing to do is basically diff with the patch I wrote :)
We can also do a bit more cleanups on top of *that*, like unifying
error handling.

-- 
MST

Re: [PATCH v3 3/3] vhost: Improve vhost_get_avail_idx() with smp_rmb()

2024-04-08 Thread Michael S. Tsirkin

On Mon, Apr 08, 2024 at 02:15:24PM +1000, Gavin Shan wrote:
> Hi Michael,
> 
> On 3/30/24 19:02, Gavin Shan wrote:
> > On 3/28/24 19:31, Michael S. Tsirkin wrote:
> > > On Thu, Mar 28, 2024 at 10:21:49AM +1000, Gavin Shan wrote:
> > > > All the callers of vhost_get_avail_idx() are concerned to the memory
> > > > barrier, imposed by smp_rmb() to ensure the order of the available
> > > > ring entry read and avail_idx read.
> > > > 
> > > > Improve vhost_get_avail_idx() so that smp_rmb() is executed when
> > > > the avail_idx is advanced. With it, the callers needn't to worry
> > > > about the memory barrier.
> > > > 
> > > > Suggested-by: Michael S. Tsirkin 
> > > > Signed-off-by: Gavin Shan 
> > > 
> > > Previous patches are ok. This one I feel needs more work -
> > > first more code such as sanity checking should go into
> > > this function, second there's actually a difference
> > > between comparing to last_avail_idx and just comparing
> > > to the previous value of avail_idx.
> > > I will pick patches 1-2 and post a cleanup on top so you can
> > > take a look, ok?
> > > 
> > 
> > Thanks, Michael. It's fine to me.
> > 
> 
> A kindly ping.
> 
> If it's ok to you, could you please merge PATCH[1-2]? Our downstream
> 9.4 need the fixes, especially for NVidia's grace-hopper and grace-grace
> platforms.

Yes - in the next rc hopefully.

> For PATCH[3], I also can help with the improvement if you don't have time
> for it. Please let me know.
> 
> Thanks,
> Gavin


That would be great.

-- 
MST

[PATCH] vhost-vdpa: change ioctl # for VDPA_GET_VRING_SIZE

2024-04-02 Thread Michael S. Tsirkin

VDPA_GET_VRING_SIZE by mistake uses the already occupied
ioctl # 0x80 and we never noticed - it happens to work
because the direction and size are different, but confuses
tools such as perf which like to look at just the number,
and breaks the extra robustness of the ioctl numbering macros.

To fix, sort the entries and renumber the ioctl - not too late
since it wasn't in any released kernels yet.

Cc: Arnaldo Carvalho de Melo 
Reported-by: Namhyung Kim 
Fixes: 1496c47065f9 ("vhost-vdpa: uapi to support reporting per vq size")
Cc: "Zhu Lingshan" 
Signed-off-by: Michael S. Tsirkin 
---

Build tested only - userspace patches using this will have to adjust.
I will merge this in a week or so unless I hear otherwise,
and afterwards perf can update there header.

 include/uapi/linux/vhost.h | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index bea697390613..b95dd84eef2d 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -179,12 +179,6 @@
 /* Get the config size */
 #define VHOST_VDPA_GET_CONFIG_SIZE _IOR(VHOST_VIRTIO, 0x79, __u32)
 
-/* Get the count of all virtqueues */
-#define VHOST_VDPA_GET_VQS_COUNT   _IOR(VHOST_VIRTIO, 0x80, __u32)
-
-/* Get the number of virtqueue groups. */
-#define VHOST_VDPA_GET_GROUP_NUM   _IOR(VHOST_VIRTIO, 0x81, __u32)
-
 /* Get the number of address spaces. */
 #define VHOST_VDPA_GET_AS_NUM  _IOR(VHOST_VIRTIO, 0x7A, unsigned int)
 
@@ -228,10 +222,17 @@
 #define VHOST_VDPA_GET_VRING_DESC_GROUP_IOWR(VHOST_VIRTIO, 0x7F,   
\
  struct vhost_vring_state)
 
+
+/* Get the count of all virtqueues */
+#define VHOST_VDPA_GET_VQS_COUNT   _IOR(VHOST_VIRTIO, 0x80, __u32)
+
+/* Get the number of virtqueue groups. */
+#define VHOST_VDPA_GET_GROUP_NUM   _IOR(VHOST_VIRTIO, 0x81, __u32)
+
 /* Get the queue size of a specific virtqueue.
  * userspace set the vring index in vhost_vring_state.index
  * kernel set the queue size in vhost_vring_state.num
  */
-#define VHOST_VDPA_GET_VRING_SIZE  _IOWR(VHOST_VIRTIO, 0x80,   \
+#define VHOST_VDPA_GET_VRING_SIZE  _IOWR(VHOST_VIRTIO, 0x82,   \
  struct vhost_vring_state)
 #endif
-- 
MST

Re: [syzbot] [virtualization?] bpf boot error: WARNING: refcount bug in __free_pages_ok

2024-03-31 Thread Michael S. Tsirkin

On Sat, Mar 30, 2024 at 08:37:19AM -0700, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:6dae957c8eef bpf: fix possible file descriptor leaks in ve..
> git tree:   bpf
> console output: https://syzkaller.appspot.com/x/log.txt?x=14ec025e18
> kernel config:  https://syzkaller.appspot.com/x/.config?x=7b667bc37450fdcd
> dashboard link: https://syzkaller.appspot.com/bug?extid=689655a7402cc18ace0a
> compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 
> 2.40
> 
> Downloadable assets:
> disk image: 
> https://storage.googleapis.com/syzbot-assets/94b03853b65f/disk-6dae957c.raw.xz
> vmlinux: 
> https://storage.googleapis.com/syzbot-assets/7375c1b6b108/vmlinux-6dae957c.xz
> kernel image: 
> https://storage.googleapis.com/syzbot-assets/126013ac11e1/bzImage-6dae957c.xz
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+689655a7402cc18ac...@syzkaller.appspotmail.com
> 
> Key type pkcs7_test registered
> Block layer SCSI generic (bsg) driver version 0.4 loaded (major 239)
> io scheduler mq-deadline registered
> io scheduler kyber registered
> io scheduler bfq registered
> input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
> ACPI: button: Power Button [PWRF]
> input: Sleep Button as /devices/LNXSYSTM:00/LNXSLPBN:00/input/input1
> ACPI: button: Sleep Button [SLPF]
> ioatdma: Intel(R) QuickData Technology Driver 5.00
> ACPI: \_SB_.LNKC: Enabled at IRQ 11
> virtio-pci :00:03.0: virtio_pci: leaving for legacy driver
> ACPI: \_SB_.LNKD: Enabled at IRQ 10
> virtio-pci :00:04.0: virtio_pci: leaving for legacy driver
> ACPI: \_SB_.LNKB: Enabled at IRQ 10
> virtio-pci :00:06.0: virtio_pci: leaving for legacy driver
> virtio-pci :00:07.0: virtio_pci: leaving for legacy driver
> N_HDLC line discipline registered with maxframe=4096
> Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> 00:03: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
> 00:04: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A
> 00:05: ttyS2 at I/O 0x3e8 (irq = 6, base_baud = 115200) is a 16550A
> 00:06: ttyS3 at I/O 0x2e8 (irq = 7, base_baud = 115200) is a 16550A
> Non-volatile memory driver v1.3
> Linux agpgart interface v0.103
> ACPI: bus type drm_connector registered
> [drm] Initialized vgem 1.0.0 20120112 for vgem on minor 0
> [drm] Initialized vkms 1.0.0 20180514 for vkms on minor 1
> Console: switching to colour frame buffer device 128x48
> platform vkms: [drm] fb0: vkmsdrmfb frame buffer device
> usbcore: registered new interface driver udl
> brd: module loaded
> loop: module loaded
> zram: Added device: zram0
> null_blk: disk nullb0 created
> null_blk: module loaded
> Guest personality initialized and is inactive
> VMCI host device registered (name=vmci, major=10, minor=118)
> Initialized host personality
> usbcore: registered new interface driver rtsx_usb
> usbcore: registered new interface driver viperboard
> usbcore: registered new interface driver dln2
> usbcore: registered new interface driver pn533_usb
> nfcsim 0.2 initialized
> usbcore: registered new interface driver port100
> usbcore: registered new interface driver nfcmrvl
> Loading iSCSI transport class v2.0-870.
> virtio_scsi virtio0: 1/0/0 default/read/poll queues
> [ cut here ]
> refcount_t: decrement hit 0; leaking memory.
> WARNING: CPU: 1 PID: 1 at lib/refcount.c:31 refcount_warn_saturate+0xfa/0x1d0 
> lib/refcount.c:31
> Modules linked in:
> CPU: 1 PID: 1 Comm: swapper/0 Not tainted 
> 6.9.0-rc1-syzkaller-00160-g6dae957c8eef #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 03/27/2024
> RIP: 0010:refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31
> Code: b2 00 00 00 e8 97 cf e9 fc 5b 5d c3 cc cc cc cc e8 8b cf e9 fc c6 05 8e 
> 73 e8 0a 01 90 48 c7 c7 e0 33 1f 8c e8 c7 6b ac fc 90 <0f> 0b 90 90 eb d9 e8 
> 6b cf e9 fc c6 05 6b 73 e8 0a 01 90 48 c7 c7
> RSP: :c9066e18 EFLAGS: 00010246
> RAX: eee901a1fb7e2300 RBX: 888146687e7c RCX: 8880166d
> RDX:  RSI:  RDI: 
> RBP: 0004 R08: 815800c2 R09: fbfff1c396e0
> R10: dc00 R11: fbfff1c396e0 R12: ea000502edc0
> R13: ea000502edc8 R14: 1d4000a05db9 R15: 
> FS:  () GS:8880b950() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2:  CR3: 0e132000 CR4: 003506f0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  
>  reset_page_owner include/linux/page_owner.h:25 [inline]
>  free_pages_prepare mm/page_alloc.c:1141 [inline]
>  __free_pages_ok+0xc60/0xd90 mm/page_alloc.c:1270
>  make_alloc_exact+0xa3/0xf0 mm/page_alloc.c:4829
>  vring_alloc_queue drivers/virtio/virtio_ring.c:319

Re: [PATCH net v3] virtio_net: Do not send RSS key if it is not supported

2024-03-31 Thread Michael S. Tsirkin

On Fri, Mar 29, 2024 at 10:16:41AM -0700, Breno Leitao wrote:
> There is a bug when setting the RSS options in virtio_net that can break
> the whole machine, getting the kernel into an infinite loop.
> 
> Running the following command in any QEMU virtual machine with virtionet
> will reproduce this problem:
> 
> # ethtool -X eth0  hfunc toeplitz
> 
> This is how the problem happens:
> 
> 1) ethtool_set_rxfh() calls virtnet_set_rxfh()
> 
> 2) virtnet_set_rxfh() calls virtnet_commit_rss_command()
> 
> 3) virtnet_commit_rss_command() populates 4 entries for the rss
> scatter-gather
> 
> 4) Since the command above does not have a key, then the last
> scatter-gatter entry will be zeroed, since rss_key_size == 0.
> sg_buf_size = vi->rss_key_size;
> 
> 5) This buffer is passed to qemu, but qemu is not happy with a buffer
> with zero length, and do the following in virtqueue_map_desc() (QEMU
> function):
> 
>   if (!sz) {
>   virtio_error(vdev, "virtio: zero sized buffers are not allowed");
> 
> 6) virtio_error() (also QEMU function) set the device as broken
> 
> vdev->broken = true;
> 
> 7) Qemu bails out, and do not repond this crazy kernel.
> 
> 8) The kernel is waiting for the response to come back (function
> virtnet_send_command())
> 
> 9) The kernel is waiting doing the following :
> 
>   while (!virtqueue_get_buf(vi->cvq, ) &&
>!virtqueue_is_broken(vi->cvq))
> cpu_relax();
> 
> 10) None of the following functions above is true, thus, the kernel
> loops here forever. Keeping in mind that virtqueue_is_broken() does
> not look at the qemu `vdev->broken`, so, it never realizes that the
> vitio is broken at QEMU side.
> 
> Fix it by not sending RSS commands if the feature is not available in
> the device.
> 
> Fixes: c7114b1249fa ("drivers/net/virtio_net: Added basic RSS support.")
> Cc: sta...@vger.kernel.org

net has its own stable process, don't CC stable on net patches.


> Cc: qemu-de...@nongnu.org
> Signed-off-by: Breno Leitao 
> ---
> Changelog:
> 
> V2:
>   * Moved from creating a valid packet, by rejecting the request
> completely
> V3:
>   * Got some good feedback from and Xuan Zhuo and Heng Qi, and reworked
> the rejection path.
> 
> ---
>  drivers/net/virtio_net.c | 22 ++
>  1 file changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index c22d1118a133..c4a21ec51adf 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -3807,6 +3807,7 @@ static int virtnet_set_rxfh(struct net_device *dev,
>   struct netlink_ext_ack *extack)
>  {
>   struct virtnet_info *vi = netdev_priv(dev);
> + bool update = false;
>   int i;
>  
>   if (rxfh->hfunc != ETH_RSS_HASH_NO_CHANGE &&
> @@ -3814,13 +3815,24 @@ static int virtnet_set_rxfh(struct net_device *dev,
>   return -EOPNOTSUPP;
>  
>   if (rxfh->indir) {
> + if (!vi->has_rss)
> + return -EOPNOTSUPP;
> +
>   for (i = 0; i < vi->rss_indir_table_size; ++i)
>   vi->ctrl->rss.indirection_table[i] = rxfh->indir[i];
> + update = true;
>   }
> - if (rxfh->key)
> +
> + if (rxfh->key) {
> + if (!vi->has_rss && !vi->has_rss_hash_report)
> + return -EOPNOTSUPP;


What's the logic here? Is it || or &&? A comment can't hurt.

> +
>   memcpy(vi->ctrl->rss.key, rxfh->key, vi->rss_key_size);
> + update = true;
> + }
>  
> - virtnet_commit_rss_command(vi);
> + if (update)
> + virtnet_commit_rss_command(vi);
>  
>   return 0;
>  }
> @@ -4729,13 +4741,15 @@ static int virtnet_probe(struct virtio_device *vdev)
>   if (virtio_has_feature(vdev, VIRTIO_NET_F_HASH_REPORT))
>   vi->has_rss_hash_report = true;
>  
> - if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS))
> + if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS)) {
>   vi->has_rss = true;
>  
> - if (vi->has_rss || vi->has_rss_hash_report) {
>   vi->rss_indir_table_size =
>   virtio_cread16(vdev, offsetof(struct virtio_net_config,
>   rss_max_indirection_table_length));
> + }
> +
> + if (vi->has_rss || vi->has_rss_hash_report) {
>   vi->rss_key_size =
>   virtio_cread8(vdev, offsetof(struct virtio_net_config, 
> rss_max_key_size));
>  
> -- 
> 2.43.0

Re: [PATCH v3] vhost/vdpa: Add MSI translation tables to iommu for software-managed MSI

2024-03-29 Thread Michael S. Tsirkin

On Fri, Mar 29, 2024 at 06:39:33PM +0800, Jason Wang wrote:
> On Fri, Mar 29, 2024 at 5:13 PM Michael S. Tsirkin  wrote:
> >
> > On Wed, Mar 27, 2024 at 05:08:57PM +0800, Jason Wang wrote:
> > > On Thu, Mar 21, 2024 at 3:00 PM Michael S. Tsirkin  
> > > wrote:
> > > >
> > > > On Wed, Mar 20, 2024 at 06:19:12PM +0800, Wang Rong wrote:
> > > > > From: Rong Wang 
> > > > >
> > > > > Once enable iommu domain for one device, the MSI
> > > > > translation tables have to be there for software-managed MSI.
> > > > > Otherwise, platform with software-managed MSI without an
> > > > > irq bypass function, can not get a correct memory write event
> > > > > from pcie, will not get irqs.
> > > > > The solution is to obtain the MSI phy base address from
> > > > > iommu reserved region, and set it to iommu MSI cookie,
> > > > > then translation tables will be created while request irq.
> > > > >
> > > > > Change log
> > > > > --
> > > > >
> > > > > v1->v2:
> > > > > - add resv iotlb to avoid overlap mapping.
> > > > > v2->v3:
> > > > > - there is no need to export the iommu symbol anymore.
> > > > >
> > > > > Signed-off-by: Rong Wang 
> > > >
> > > > There's in interest to keep extending vhost iotlb -
> > > > we should just switch over to iommufd which supports
> > > > this already.
> > >
> > > IOMMUFD is good but VFIO supports this before IOMMUFD.
> >
> > You mean VFIO migrated to IOMMUFD but of course they keep supporting
> > their old UAPI?
> 
> I meant VFIO support software managed MSI before IOMMUFD.

And then they switched over and stopped adding new IOMMU
related features. And so should vdpa?


> > OK and point being?
> >
> > > This patch
> > > makes vDPA run without a backporting of full IOMMUFD in the production
> > > environment. I think it's worth.
> >
> > Where do we stop? saying no to features is the only tool maintainers
> > have to make cleanups happen, otherwise people will just keep piling
> > stuff up.
> 
> I think we should not have more features than VFIO without IOMMUFD.
> 
> Thanks
> 
> >
> > > If you worry about the extension, we can just use the vhost iotlb
> > > existing facility to do this.
> > >
> > > Thanks
> > >
> > > >
> > > > > ---
> > > > >  drivers/vhost/vdpa.c | 59 
> > > > > +---
> > > > >  1 file changed, 56 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> > > > > index ba52d128aeb7..28b56b10372b 100644
> > > > > --- a/drivers/vhost/vdpa.c
> > > > > +++ b/drivers/vhost/vdpa.c
> > > > > @@ -49,6 +49,7 @@ struct vhost_vdpa {
> > > > >   struct completion completion;
> > > > >   struct vdpa_device *vdpa;
> > > > >   struct hlist_head as[VHOST_VDPA_IOTLB_BUCKETS];
> > > > > + struct vhost_iotlb resv_iotlb;
> > > > >   struct device dev;
> > > > >   struct cdev cdev;
> > > > >   atomic_t opened;
> > > > > @@ -247,6 +248,7 @@ static int _compat_vdpa_reset(struct vhost_vdpa 
> > > > > *v)
> > > > >  static int vhost_vdpa_reset(struct vhost_vdpa *v)
> > > > >  {
> > > > >   v->in_batch = 0;
> > > > > + vhost_iotlb_reset(>resv_iotlb);
> > > > >   return _compat_vdpa_reset(v);
> > > > >  }
> > > > >
> > > > > @@ -1219,10 +1221,15 @@ static int 
> > > > > vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
> > > > >   msg->iova + msg->size - 1 > v->range.last)
> > > > >   return -EINVAL;
> > > > >
> > > > > + if (vhost_iotlb_itree_first(>resv_iotlb, msg->iova,
> > > > > + msg->iova + msg->size - 1))
> > > > > + return -EINVAL;
> > > > > +
> > > > >   if (vhost_iotlb_itree_first(iotlb, msg->iova,
> > > > >   msg->iova + msg->size - 1))
> > > > >   re

Re: [PATCH v2] Documentation: Add reconnect process for VDUSE

2024-03-29 Thread Michael S. Tsirkin

On Fri, Mar 29, 2024 at 05:38:25PM +0800, Cindy Lu wrote:
> Add a document explaining the reconnect process, including what the
> Userspace App needs to do and how it works with the kernel.
> 
> Signed-off-by: Cindy Lu 
> ---
>  Documentation/userspace-api/vduse.rst | 41 +++
>  1 file changed, 41 insertions(+)
> 
> diff --git a/Documentation/userspace-api/vduse.rst 
> b/Documentation/userspace-api/vduse.rst
> index bdb880e01132..f903aed714d1 100644
> --- a/Documentation/userspace-api/vduse.rst
> +++ b/Documentation/userspace-api/vduse.rst
> @@ -231,3 +231,44 @@ able to start the dataplane processing as follows:
> after the used ring is filled.
>  
>  For more details on the uAPI, please see include/uapi/linux/vduse.h.
> +
> +HOW VDUSE devices reconnectoin works

typo

> +
> +1. What is reconnection?
> +
> +   When the userspace application loads, it should establish a connection
> +   to the vduse kernel device. Sometimes,the userspace application exists,
> +   and we want to support its restart and connect to the kernel device again
> +
> +2. How can I support reconnection in a userspace application?
> +
> +2.1 During initialization, the userspace application should first verify the
> +existence of the device "/dev/vduse/vduse_name".
> +If it doesn't exist, it means this is the first-time for connection. 
> goto step 2.2
> +If it exists, it means this is a reconnection, and we should goto step 
> 2.3
> +
> +2.2 Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
> +/dev/vduse/control.
> +When ioctl(VDUSE_CREATE_DEV) is called, kernel allocates memory for
> +the reconnect information. The total memory size is PAGE_SIZE*vq_mumber.
> +
> +2.3 Check if the information is suitable for reconnect
> +If this is reconnection :
> +Before attempting to reconnect, The userspace application needs to use 
> the
> +ioctl(VDUSE_DEV_GET_CONFIG, VDUSE_DEV_GET_STATUS, 
> VDUSE_DEV_GET_FEATURES...)
> +to get the information from kernel.
> +Please review the information and confirm if it is suitable to reconnect.
> +
> +2.4 Userspace application needs to mmap the memory to userspace
> +The userspace application requires mapping one page for every vq. These 
> pages
> +should be used to save vq-related information during system running. 
> Additionally,
> +the application must define its own structure to store information for 
> reconnection.
> +
> +2.5 Completed the initialization and running the application.
> +While the application is running, it is important to store relevant 
> information
> +about reconnections in mapped pages. When calling the ioctl 
> VDUSE_VQ_GET_INFO to
> +get vq information, it's necessary to check whether it's a reconnection. 
> If it is
> +a reconnection, the vq-related information must be get from the mapped 
> pages.
> +


I don't get it. So this is just a way for the application to allocate
memory? Why do we need this new way to do it?
Why not just mmap a file anywhere at all?


> +2.6 When the Userspace application exits, it is necessary to unmap all the
> +pages for reconnection
> -- 
> 2.43.0

Re: [PATCH v3] vhost/vdpa: Add MSI translation tables to iommu for software-managed MSI

2024-03-29 Thread Michael S. Tsirkin

On Fri, Mar 29, 2024 at 11:55:50AM +0800, Jason Wang wrote:
> On Wed, Mar 27, 2024 at 5:08 PM Jason Wang  wrote:
> >
> > On Thu, Mar 21, 2024 at 3:00 PM Michael S. Tsirkin  wrote:
> > >
> > > On Wed, Mar 20, 2024 at 06:19:12PM +0800, Wang Rong wrote:
> > > > From: Rong Wang 
> > > >
> > > > Once enable iommu domain for one device, the MSI
> > > > translation tables have to be there for software-managed MSI.
> > > > Otherwise, platform with software-managed MSI without an
> > > > irq bypass function, can not get a correct memory write event
> > > > from pcie, will not get irqs.
> > > > The solution is to obtain the MSI phy base address from
> > > > iommu reserved region, and set it to iommu MSI cookie,
> > > > then translation tables will be created while request irq.
> > > >
> > > > Change log
> > > > --
> > > >
> > > > v1->v2:
> > > > - add resv iotlb to avoid overlap mapping.
> > > > v2->v3:
> > > > - there is no need to export the iommu symbol anymore.
> > > >
> > > > Signed-off-by: Rong Wang 
> > >
> > > There's in interest to keep extending vhost iotlb -
> > > we should just switch over to iommufd which supports
> > > this already.
> >
> > IOMMUFD is good but VFIO supports this before IOMMUFD. This patch
> > makes vDPA run without a backporting of full IOMMUFD in the production
> > environment. I think it's worth.
> >
> > If you worry about the extension, we can just use the vhost iotlb
> > existing facility to do this.
> >
> > Thanks
> 
> Btw, Wang Rong,
> 
> It looks that Cindy does have the bandwidth in working for IOMMUFD support.

I think you mean she does not.

> Do you have the will to do that?
> 
> Thanks

Re: [PATCH v3] vhost/vdpa: Add MSI translation tables to iommu for software-managed MSI

2024-03-29 Thread Michael S. Tsirkin

On Wed, Mar 27, 2024 at 05:08:57PM +0800, Jason Wang wrote:
> On Thu, Mar 21, 2024 at 3:00 PM Michael S. Tsirkin  wrote:
> >
> > On Wed, Mar 20, 2024 at 06:19:12PM +0800, Wang Rong wrote:
> > > From: Rong Wang 
> > >
> > > Once enable iommu domain for one device, the MSI
> > > translation tables have to be there for software-managed MSI.
> > > Otherwise, platform with software-managed MSI without an
> > > irq bypass function, can not get a correct memory write event
> > > from pcie, will not get irqs.
> > > The solution is to obtain the MSI phy base address from
> > > iommu reserved region, and set it to iommu MSI cookie,
> > > then translation tables will be created while request irq.
> > >
> > > Change log
> > > --
> > >
> > > v1->v2:
> > > - add resv iotlb to avoid overlap mapping.
> > > v2->v3:
> > > - there is no need to export the iommu symbol anymore.
> > >
> > > Signed-off-by: Rong Wang 
> >
> > There's in interest to keep extending vhost iotlb -
> > we should just switch over to iommufd which supports
> > this already.
> 
> IOMMUFD is good but VFIO supports this before IOMMUFD.

You mean VFIO migrated to IOMMUFD but of course they keep supporting
their old UAPI? OK and point being?

> This patch
> makes vDPA run without a backporting of full IOMMUFD in the production
> environment. I think it's worth.

Where do we stop? saying no to features is the only tool maintainers
have to make cleanups happen, otherwise people will just keep piling
stuff up.

> If you worry about the extension, we can just use the vhost iotlb
> existing facility to do this.
> 
> Thanks
> 
> >
> > > ---
> > >  drivers/vhost/vdpa.c | 59 +---
> > >  1 file changed, 56 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> > > index ba52d128aeb7..28b56b10372b 100644
> > > --- a/drivers/vhost/vdpa.c
> > > +++ b/drivers/vhost/vdpa.c
> > > @@ -49,6 +49,7 @@ struct vhost_vdpa {
> > >   struct completion completion;
> > >   struct vdpa_device *vdpa;
> > >   struct hlist_head as[VHOST_VDPA_IOTLB_BUCKETS];
> > > + struct vhost_iotlb resv_iotlb;
> > >   struct device dev;
> > >   struct cdev cdev;
> > >   atomic_t opened;
> > > @@ -247,6 +248,7 @@ static int _compat_vdpa_reset(struct vhost_vdpa *v)
> > >  static int vhost_vdpa_reset(struct vhost_vdpa *v)
> > >  {
> > >   v->in_batch = 0;
> > > + vhost_iotlb_reset(>resv_iotlb);
> > >   return _compat_vdpa_reset(v);
> > >  }
> > >
> > > @@ -1219,10 +1221,15 @@ static int vhost_vdpa_process_iotlb_update(struct 
> > > vhost_vdpa *v,
> > >   msg->iova + msg->size - 1 > v->range.last)
> > >   return -EINVAL;
> > >
> > > + if (vhost_iotlb_itree_first(>resv_iotlb, msg->iova,
> > > + msg->iova + msg->size - 1))
> > > + return -EINVAL;
> > > +
> > >   if (vhost_iotlb_itree_first(iotlb, msg->iova,
> > >   msg->iova + msg->size - 1))
> > >   return -EEXIST;
> > >
> > > +
> > >   if (vdpa->use_va)
> > >   return vhost_vdpa_va_map(v, iotlb, msg->iova, msg->size,
> > >msg->uaddr, msg->perm);
> > > @@ -1307,6 +1314,45 @@ static ssize_t vhost_vdpa_chr_write_iter(struct 
> > > kiocb *iocb,
> > >   return vhost_chr_write_iter(dev, from);
> > >  }
> > >
> > > +static int vhost_vdpa_resv_iommu_region(struct iommu_domain *domain, 
> > > struct device *dma_dev,
> > > + struct vhost_iotlb *resv_iotlb)
> > > +{
> > > + struct list_head dev_resv_regions;
> > > + phys_addr_t resv_msi_base = 0;
> > > + struct iommu_resv_region *region;
> > > + int ret = 0;
> > > + bool with_sw_msi = false;
> > > + bool with_hw_msi = false;
> > > +
> > > + INIT_LIST_HEAD(_resv_regions);
> > > + iommu_get_resv_regions(dma_dev, _resv_regions);
> > > +
> > > + list_for_each_entry(region, _resv_regions, list) {
> > > + ret = vhost_iotlb_add_range_ctx(resv_iotlb, region->start,
> > > +

Re: [PATCH v3 3/3] vhost: Improve vhost_get_avail_idx() with smp_rmb()

2024-03-28 Thread Michael S. Tsirkin

On Thu, Mar 28, 2024 at 10:21:49AM +1000, Gavin Shan wrote:
> All the callers of vhost_get_avail_idx() are concerned to the memory
> barrier, imposed by smp_rmb() to ensure the order of the available
> ring entry read and avail_idx read.
> 
> Improve vhost_get_avail_idx() so that smp_rmb() is executed when
> the avail_idx is advanced. With it, the callers needn't to worry
> about the memory barrier.
> 
> Suggested-by: Michael S. Tsirkin 
> Signed-off-by: Gavin Shan 

Previous patches are ok. This one I feel needs more work -
first more code such as sanity checking should go into
this function, second there's actually a difference
between comparing to last_avail_idx and just comparing
to the previous value of avail_idx.
I will pick patches 1-2 and post a cleanup on top so you can
take a look, ok?


> ---
>  drivers/vhost/vhost.c | 75 +++
>  1 file changed, 26 insertions(+), 49 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 32686c79c41d..e6882f4f6ce2 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1290,10 +1290,28 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
>   mutex_unlock(>vqs[i]->mutex);
>  }
>  
> -static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
> -   __virtio16 *idx)
> +static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq)
>  {
> - return vhost_get_avail(vq, *idx, >avail->idx);
> + __virtio16 avail_idx;
> + int r;
> +
> + r = vhost_get_avail(vq, avail_idx, >avail->idx);
> + if (unlikely(r)) {
> + vq_err(vq, "Failed to access avail idx at %p\n",
> +>avail->idx);
> + return r;
> + }
> +
> + vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
> + if (vq->avail_idx != vq->last_avail_idx) {
> + /* Ensure the available ring entry read happens
> +  * before the avail_idx read when the avail_idx
> +  * is advanced.
> +  */
> + smp_rmb();
> + }
> +
> + return 0;
>  }
>  
>  static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
> @@ -2499,7 +2517,6 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>   struct vring_desc desc;
>   unsigned int i, head, found = 0;
>   u16 last_avail_idx;
> - __virtio16 avail_idx;
>   __virtio16 ring_head;
>   int ret, access;
>  
> @@ -2507,12 +2524,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>   last_avail_idx = vq->last_avail_idx;
>  
>   if (vq->avail_idx == vq->last_avail_idx) {
> - if (unlikely(vhost_get_avail_idx(vq, _idx))) {
> - vq_err(vq, "Failed to access avail idx at %p\n",
> - >avail->idx);
> + if (unlikely(vhost_get_avail_idx(vq)))
>   return -EFAULT;
> - }
> - vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
>  
>   if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
>   vq_err(vq, "Guest moved used index from %u to %u",
> @@ -2525,11 +2538,6 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>*/
>   if (vq->avail_idx == last_avail_idx)
>   return vq->num;
> -
> - /* Only get avail ring entries after they have been
> -  * exposed by guest.
> -  */
> - smp_rmb();
>   }
>  
>   /* Grab the next descriptor number they're advertising, and increment
> @@ -2790,35 +2798,19 @@ EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n);
>  /* return true if we're sure that avaiable ring is empty */
>  bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>  {
> - __virtio16 avail_idx;
> - int r;
> -
>   if (vq->avail_idx != vq->last_avail_idx)
>   return false;
>  
> - r = vhost_get_avail_idx(vq, _idx);
> - if (unlikely(r))
> - return false;
> -
> - vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
> - if (vq->avail_idx != vq->last_avail_idx) {
> - /* Since we have updated avail_idx, the following
> -  * call to vhost_get_vq_desc() will read available
> -  * ring entries. Make sure that read happens after
> -  * the avail_idx read.
> -  */
> - smp_rmb();
> + if (unlikely(vhost_get_avail_idx(vq)))
>   return false;
> - }
>  
> - retu

Re: [PATCH untested] vhost: order avail ring reads after index updates

2024-03-27 Thread Michael S. Tsirkin

On Wed, Mar 27, 2024 at 07:52:02PM +, Will Deacon wrote:
> On Wed, Mar 27, 2024 at 01:26:23PM -0400, Michael S. Tsirkin wrote:
> > vhost_get_vq_desc (correctly) uses smp_rmb to order
> > avail ring reads after index reads.
> > However, over time we added two more places that read the
> > index and do not bother with barriers.
> > Since vhost_get_vq_desc when it was written assumed it is the
> > only reader when it sees a new index value is cached
> > it does not bother with a barrier either, as a result,
> > on the nvidia-gracehopper platform (arm64) available ring
> > entry reads have been observed bypassing ring reads, causing
> > a ring corruption.
> > 
> > To fix, factor out the correct index access code from vhost_get_vq_desc.
> > As a side benefit, we also validate the index on all paths now, which
> > will hopefully help catch future errors earlier.
> > 
> > Note: current code is inconsistent in how it handles errors:
> > some places treat it as an empty ring, others - non empty.
> > This patch does not attempt to change the existing behaviour.
> > 
> > Cc: sta...@vger.kernel.org
> > Reported-by: Gavin Shan 
> > Reported-by: Will Deacon 
> > Suggested-by: Will Deacon 
> > Fixes: 275bf960ac69 ("vhost: better detection of available buffers")
> > Cc: "Jason Wang" 
> > Fixes: d3bb267bbdcb ("vhost: cache avail index in vhost_enable_notify()")
> > Cc: "Stefano Garzarella" 
> > Signed-off-by: Michael S. Tsirkin 
> > ---
> > 
> > I think it's better to bite the bullet and clean up the code.
> > Note: this is still only built, not tested.
> > Gavin could you help test please?
> > Especially on the arm platform you have?
> > 
> > Will thanks so much for finding this race!
> 
> No problem, and I was also hoping that the smp_rmb() could be
> consolidated into a single helper like you've done here.
> 
> One minor comment below:
> 
> >  drivers/vhost/vhost.c | 80 +++
> >  1 file changed, 42 insertions(+), 38 deletions(-)
> > 
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 045f666b4f12..26b70b1fd9ff 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -1290,10 +1290,38 @@ static void vhost_dev_unlock_vqs(struct vhost_dev 
> > *d)
> > mutex_unlock(>vqs[i]->mutex);
> >  }
> >  
> > -static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
> > - __virtio16 *idx)
> > +static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq)
> >  {
> > -   return vhost_get_avail(vq, *idx, >avail->idx);
> > +   __virtio16 idx;
> > +   u16 avail_idx;
> > +   int r = vhost_get_avail(vq, idx, >avail->idx);
> > +
> > +   if (unlikely(r < 0)) {
> > +   vq_err(vq, "Failed to access avail idx at %p: %d\n",
> > +  >avail->idx, r);
> > +   return -EFAULT;
> > +   }
> > +
> > +   avail_idx = vhost16_to_cpu(vq, idx);
> > +
> > +   /* Check it isn't doing very strange things with descriptor numbers. */
> > +   if (unlikely((u16)(avail_idx - vq->last_avail_idx) > vq->num)) {
> > +   vq_err(vq, "Guest moved used index from %u to %u",
> > +  vq->last_avail_idx, vq->avail_idx);
> > +   return -EFAULT;
> > +   }
> > +
> > +   /* Nothing new? We are done. */
> > +   if (avail_idx == vq->avail_idx)
> > +   return 0;
> > +
> > +   vq->avail_idx = avail_idx;
> > +
> > +   /* We updated vq->avail_idx so we need a memory barrier between
> > +* the index read above and the caller reading avail ring entries.
> > +*/
> > +   smp_rmb();
> 
> I think you could use smp_acquire__after_ctrl_dep() if you're feeling
> brave, but to be honest I'd prefer we went in the opposite direction
> and used READ/WRITE_ONCE + smp_load_acquire()/smp_store_release() across
> the board. It's just a thankless, error-prone task to get there :(

Let's just say that's a separate patch, I tried hard to make this one
a bugfix only, no other functional changes at all.

> So, for the patch as-is:
> 
> Acked-by: Will Deacon 
> 
> (I've not tested it either though, so definitely wait for Gavin on that!)
> 
> Cheers,
> 
> Will

[PATCH untested] vhost: order avail ring reads after index updates

2024-03-27 Thread Michael S. Tsirkin

vhost_get_vq_desc (correctly) uses smp_rmb to order
avail ring reads after index reads.
However, over time we added two more places that read the
index and do not bother with barriers.
Since vhost_get_vq_desc when it was written assumed it is the
only reader when it sees a new index value is cached
it does not bother with a barrier either, as a result,
on the nvidia-gracehopper platform (arm64) available ring
entry reads have been observed bypassing ring reads, causing
a ring corruption.

To fix, factor out the correct index access code from vhost_get_vq_desc.
As a side benefit, we also validate the index on all paths now, which
will hopefully help catch future errors earlier.

Note: current code is inconsistent in how it handles errors:
some places treat it as an empty ring, others - non empty.
This patch does not attempt to change the existing behaviour.

Cc: sta...@vger.kernel.org
Reported-by: Gavin Shan 
Reported-by: Will Deacon 
Suggested-by: Will Deacon 
Fixes: 275bf960ac69 ("vhost: better detection of available buffers")
Cc: "Jason Wang" 
Fixes: d3bb267bbdcb ("vhost: cache avail index in vhost_enable_notify()")
Cc: "Stefano Garzarella" 
Signed-off-by: Michael S. Tsirkin 
---

I think it's better to bite the bullet and clean up the code.
Note: this is still only built, not tested.
Gavin could you help test please?
Especially on the arm platform you have?

Will thanks so much for finding this race!


 drivers/vhost/vhost.c | 80 +++
 1 file changed, 42 insertions(+), 38 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 045f666b4f12..26b70b1fd9ff 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1290,10 +1290,38 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
mutex_unlock(>vqs[i]->mutex);
 }
 
-static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
- __virtio16 *idx)
+static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq)
 {
-   return vhost_get_avail(vq, *idx, >avail->idx);
+   __virtio16 idx;
+   u16 avail_idx;
+   int r = vhost_get_avail(vq, idx, >avail->idx);
+
+   if (unlikely(r < 0)) {
+   vq_err(vq, "Failed to access avail idx at %p: %d\n",
+  >avail->idx, r);
+   return -EFAULT;
+   }
+
+   avail_idx = vhost16_to_cpu(vq, idx);
+
+   /* Check it isn't doing very strange things with descriptor numbers. */
+   if (unlikely((u16)(avail_idx - vq->last_avail_idx) > vq->num)) {
+   vq_err(vq, "Guest moved used index from %u to %u",
+  vq->last_avail_idx, vq->avail_idx);
+   return -EFAULT;
+   }
+
+   /* Nothing new? We are done. */
+   if (avail_idx == vq->avail_idx)
+   return 0;
+
+   vq->avail_idx = avail_idx;
+
+   /* We updated vq->avail_idx so we need a memory barrier between
+* the index read above and the caller reading avail ring entries.
+*/
+   smp_rmb();
+   return 1;
 }
 
 static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
@@ -2498,38 +2526,21 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 {
struct vring_desc desc;
unsigned int i, head, found = 0;
-   u16 last_avail_idx;
-   __virtio16 avail_idx;
+   u16 last_avail_idx = vq->last_avail_idx;
__virtio16 ring_head;
int ret, access;
 
-   /* Check it isn't doing very strange things with descriptor numbers. */
-   last_avail_idx = vq->last_avail_idx;
 
if (vq->avail_idx == vq->last_avail_idx) {
-   if (unlikely(vhost_get_avail_idx(vq, _idx))) {
-   vq_err(vq, "Failed to access avail idx at %p\n",
-   >avail->idx);
-   return -EFAULT;
-   }
-   vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
-
-   if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
-   vq_err(vq, "Guest moved used index from %u to %u",
-   last_avail_idx, vq->avail_idx);
-   return -EFAULT;
-   }
+   ret = vhost_get_avail_idx(vq);
+   if (unlikely(ret < 0))
+   return ret;
 
/* If there's nothing new since last we looked, return
 * invalid.
 */
-   if (vq->avail_idx == last_avail_idx)
+   if (!ret)
return vq->num;
-
-   /* Only get avail ring entries after they have been
-* exposed by guest.
-*/
-   smp_rmb();
}
 
/* Grab the next descriptor number they're advertising,

Re: [PATCH v2 1/2] vhost: Add smp_rmb() in vhost_vq_avail_empty()

2024-03-27 Thread Michael S. Tsirkin

On Wed, Mar 27, 2024 at 09:38:45AM +1000, Gavin Shan wrote:
> A smp_rmb() has been missed in vhost_vq_avail_empty(), spotted by
> Will Deacon . Otherwise, it's not ensured the
> available ring entries pushed by guest can be observed by vhost
> in time, leading to stale available ring entries fetched by vhost
> in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's
> grace-hopper (ARM64) platform.
> 
>   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
>   -accel kvm -machine virt,gic-version=host -cpu host  \
>   -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \
>   -m 4096M,slots=16,maxmem=64G \
>   -object memory-backend-ram,id=mem0,size=4096M\
>:   \
>   -netdev tap,id=vnet0,vhost=true  \
>   -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0
>:
>   guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM
>   virtio_net virtio0: output.0:id 100 is not a head!
> 
> Add the missed smp_rmb() in vhost_vq_avail_empty(). Note that it
> should be safe until vq->avail_idx is changed by commit 275bf960ac697
> ("vhost: better detection of available buffers").
> 
> Fixes: 275bf960ac697 ("vhost: better detection of available buffers")
> Cc:  # v4.11+
> Reported-by: Yihuang Yu 
> Signed-off-by: Gavin Shan 
> ---
>  drivers/vhost/vhost.c | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 045f666b4f12..00445ab172b3 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2799,9 +2799,18 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, 
> struct vhost_virtqueue *vq)
>   r = vhost_get_avail_idx(vq, _idx);
>   if (unlikely(r))
>   return false;
> +
>   vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
> + if (vq->avail_idx != vq->last_avail_idx) {
> + /* Similar to what's done in vhost_get_vq_desc(), we need
> +  * to ensure the available ring entries have been exposed
> +  * by guest.
> +  */

A slightly clearer comment:

/* Since we have updated avail_idx, the following call to
 * vhost_get_vq_desc will read available ring entries.
 * Make sure that read happens after the avail_idx read.
 */

Pls repost with that, and I will apply.

Also add suggested-by for will.


> + smp_rmb();
> + return false;
> + }
>  
> - return vq->avail_idx == vq->last_avail_idx;
> + return true;
>  }
>  EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);

As a follow-up patch, we should clean out code duplication that
accumulated with 3 places reading avail idx in essentially
the same way - this duplication is what causes the mess in
the 1st place.






> -- 
> 2.44.0

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-27 Thread Michael S. Tsirkin

On Tue, Mar 26, 2024 at 03:46:29PM +, Will Deacon wrote:
> On Tue, Mar 26, 2024 at 11:43:13AM +, Will Deacon wrote:
> > On Tue, Mar 26, 2024 at 09:38:55AM +, Keir Fraser wrote:
> > > On Tue, Mar 26, 2024 at 03:49:02AM -0400, Michael S. Tsirkin wrote:
> > > > > Secondly, the debugging code is enhanced so that the available head 
> > > > > for
> > > > > (last_avail_idx - 1) is read for twice and recorded. It means the 
> > > > > available
> > > > > head for one specific available index is read for twice. I do see the
> > > > > available heads are different from the consecutive reads. More details
> > > > > are shared as below.
> > > > > 
> > > > > From the guest side
> > > > > ===
> > > > > 
> > > > > virtio_net virtio0: output.0:id 86 is not a head!
> > > > > head to be released: 047 062 112
> > > > > 
> > > > > avail_idx:
> > > > > 000  49665
> > > > > 001  49666  <--
> > > > >  :
> > > > > 015  49664
> > > > 
> > > > what are these #s 49665 and so on?
> > > > and how large is the ring?
> > > > I am guessing 49664 is the index ring size is 16 and
> > > > 49664 % 16 == 0
> > > 
> > > More than that, 49664 % 256 == 0
> > > 
> > > So again there seems to be an error in the vicinity of roll-over of
> > > the idx low byte, as I observed in the earlier log. Surely this is
> > > more than coincidence?
> > 
> > Yeah, I'd still really like to see the disassembly for both sides of the
> > protocol here. Gavin, is that something you're able to provide? Worst
> > case, the host and guest vmlinux objects would be a starting point.
> > 
> > Personally, I'd be fairly surprised if this was a hardware issue.
> 
> Ok, long shot after eyeballing the vhost code, but does the diff below
> help at all? It looks like vhost_vq_avail_empty() can advance the value
> saved in 'vq->avail_idx' but without the read barrier, possibly confusing
> vhost_get_vq_desc() in polling mode.
> 
> Will
> 
> --->8
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 045f666b4f12..87bff710331a 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2801,6 +2801,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct 
> vhost_virtqueue *vq)
> return false;
> vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
>  
> +   smp_rmb();
> return vq->avail_idx == vq->last_avail_idx;
>  }
>  EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);

Oh wow you are right.

We have:

if (vq->avail_idx == vq->last_avail_idx) {
if (unlikely(vhost_get_avail_idx(vq, _idx))) {
vq_err(vq, "Failed to access avail idx at %p\n",
>avail->idx);
return -EFAULT;
}
vq->avail_idx = vhost16_to_cpu(vq, avail_idx);

if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
vq_err(vq, "Guest moved used index from %u to %u",
last_avail_idx, vq->avail_idx);
return -EFAULT;
}

/* If there's nothing new since last we looked, return
 * invalid.
 */
if (vq->avail_idx == last_avail_idx)
return vq->num;

/* Only get avail ring entries after they have been
 * exposed by guest.
 */
smp_rmb();
}


and so the rmb only happens if avail_idx is not advanced.

Actually there is a bunch of code duplication where we assign to
avail_idx, too.

Will thanks a lot for looking into this! I kept looking into
the virtio side for some reason, the fact that it did not
trigger with qemu should have been a big hint!


-- 
MST

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-26 Thread Michael S. Tsirkin

On Mon, Mar 25, 2024 at 05:34:29PM +1000, Gavin Shan wrote:
> 
> On 3/20/24 17:14, Michael S. Tsirkin wrote:
> > On Wed, Mar 20, 2024 at 03:24:16PM +1000, Gavin Shan wrote:
> > > On 3/20/24 10:49, Michael S. Tsirkin wrote:>
> > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > index 6f7e5010a673..79456706d0bd 100644
> > > > --- a/drivers/virtio/virtio_ring.c
> > > > +++ b/drivers/virtio/virtio_ring.c
> > > > @@ -685,7 +685,8 @@ static inline int virtqueue_add_split(struct 
> > > > virtqueue *_vq,
> > > > /* Put entry in available array (but don't update avail->idx 
> > > > until they
> > > >  * do sync). */
> > > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > -   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > > head);
> > > > +   u16 headwithflag = head | (q->split.avail_idx_shadow & 
> > > > ~(vq->split.vring.num - 1));
> > > > +   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > > headwithflag);
> > > > /* Descriptors and available array need to be set before we 
> > > > expose the
> > > >  * new available array entries. */
> > > > 
> 
> Ok, Michael. I continued with my debugging code. It still looks like a
> hardware bug on NVidia's grace-hopper. I really think NVidia needs to be
> involved for the discussion, as suggested by you.

Do you have a support contact at Nvidia to report this?

> Firstly, I bind the vhost process and vCPU thread to CPU#71 and CPU#70.
> Note that I have only one vCPU in my configuration.

Interesting but is guest built with CONFIG_SMP set?

> Secondly, the debugging code is enhanced so that the available head for
> (last_avail_idx - 1) is read for twice and recorded. It means the available
> head for one specific available index is read for twice. I do see the
> available heads are different from the consecutive reads. More details
> are shared as below.
> 
> From the guest side
> ===
> 
> virtio_net virtio0: output.0:id 86 is not a head!
> head to be released: 047 062 112
> 
> avail_idx:
> 000  49665
> 001  49666  <--
>  :
> 015  49664

what are these #s 49665 and so on?
and how large is the ring?
I am guessing 49664 is the index ring size is 16 and
49664 % 16 == 0

> avail_head:


is this the avail ring contents?

> 000  062
> 001  047  <--
>  :
> 015  112


What are these arrows pointing at, btw?


> From the host side
> ==
> 
> avail_idx
> 000  49663
> 001  49666  <---
>  :
> 
> avail_head
> 000  062  (062)
> 001  047  (047)  <---
>  :
> 015  086  (112)  // head 086 is returned from the first read,
>  // but head 112 is returned from the second read
> 
> vhost_get_vq_desc: Inconsistent head in two read (86 -> 112) for avail_idx 
> 49664
> 
> Thanks,
> Gavin

OK thanks so this proves it is actually the avail ring value.

-- 
MST

Re: [PATCH v3] vhost/vdpa: Add MSI translation tables to iommu for software-managed MSI

2024-03-21 Thread Michael S. Tsirkin

On Wed, Mar 20, 2024 at 06:19:12PM +0800, Wang Rong wrote:
> From: Rong Wang 
> 
> Once enable iommu domain for one device, the MSI
> translation tables have to be there for software-managed MSI.
> Otherwise, platform with software-managed MSI without an
> irq bypass function, can not get a correct memory write event
> from pcie, will not get irqs.
> The solution is to obtain the MSI phy base address from
> iommu reserved region, and set it to iommu MSI cookie,
> then translation tables will be created while request irq.
> 
> Change log
> --
> 
> v1->v2:
> - add resv iotlb to avoid overlap mapping.
> v2->v3:
> - there is no need to export the iommu symbol anymore.
> 
> Signed-off-by: Rong Wang 

There's in interest to keep extending vhost iotlb -
we should just switch over to iommufd which supports
this already.

> ---
>  drivers/vhost/vdpa.c | 59 +---
>  1 file changed, 56 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index ba52d128aeb7..28b56b10372b 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -49,6 +49,7 @@ struct vhost_vdpa {
>   struct completion completion;
>   struct vdpa_device *vdpa;
>   struct hlist_head as[VHOST_VDPA_IOTLB_BUCKETS];
> + struct vhost_iotlb resv_iotlb;
>   struct device dev;
>   struct cdev cdev;
>   atomic_t opened;
> @@ -247,6 +248,7 @@ static int _compat_vdpa_reset(struct vhost_vdpa *v)
>  static int vhost_vdpa_reset(struct vhost_vdpa *v)
>  {
>   v->in_batch = 0;
> + vhost_iotlb_reset(>resv_iotlb);
>   return _compat_vdpa_reset(v);
>  }
>  
> @@ -1219,10 +1221,15 @@ static int vhost_vdpa_process_iotlb_update(struct 
> vhost_vdpa *v,
>   msg->iova + msg->size - 1 > v->range.last)
>   return -EINVAL;
>  
> + if (vhost_iotlb_itree_first(>resv_iotlb, msg->iova,
> + msg->iova + msg->size - 1))
> + return -EINVAL;
> +
>   if (vhost_iotlb_itree_first(iotlb, msg->iova,
>   msg->iova + msg->size - 1))
>   return -EEXIST;
>  
> +
>   if (vdpa->use_va)
>   return vhost_vdpa_va_map(v, iotlb, msg->iova, msg->size,
>msg->uaddr, msg->perm);
> @@ -1307,6 +1314,45 @@ static ssize_t vhost_vdpa_chr_write_iter(struct kiocb 
> *iocb,
>   return vhost_chr_write_iter(dev, from);
>  }
>  
> +static int vhost_vdpa_resv_iommu_region(struct iommu_domain *domain, struct 
> device *dma_dev,
> + struct vhost_iotlb *resv_iotlb)
> +{
> + struct list_head dev_resv_regions;
> + phys_addr_t resv_msi_base = 0;
> + struct iommu_resv_region *region;
> + int ret = 0;
> + bool with_sw_msi = false;
> + bool with_hw_msi = false;
> +
> + INIT_LIST_HEAD(_resv_regions);
> + iommu_get_resv_regions(dma_dev, _resv_regions);
> +
> + list_for_each_entry(region, _resv_regions, list) {
> + ret = vhost_iotlb_add_range_ctx(resv_iotlb, region->start,
> + region->start + region->length - 1,
> + 0, 0, NULL);
> + if (ret) {
> + vhost_iotlb_reset(resv_iotlb);
> + break;
> + }
> +
> + if (region->type == IOMMU_RESV_MSI)
> + with_hw_msi = true;
> +
> + if (region->type == IOMMU_RESV_SW_MSI) {
> + resv_msi_base = region->start;
> + with_sw_msi = true;
> + }
> + }
> +
> + if (!ret && !with_hw_msi && with_sw_msi)
> + ret = iommu_get_msi_cookie(domain, resv_msi_base);
> +
> + iommu_put_resv_regions(dma_dev, _resv_regions);
> +
> + return ret;
> +}
> +
>  static int vhost_vdpa_alloc_domain(struct vhost_vdpa *v)
>  {
>   struct vdpa_device *vdpa = v->vdpa;
> @@ -1335,11 +1381,16 @@ static int vhost_vdpa_alloc_domain(struct vhost_vdpa 
> *v)
>  
>   ret = iommu_attach_device(v->domain, dma_dev);
>   if (ret)
> - goto err_attach;
> + goto err_alloc_domain;
>  
> - return 0;
> + ret = vhost_vdpa_resv_iommu_region(v->domain, dma_dev, >resv_iotlb);
> + if (ret)
> + goto err_attach_device;
>  
> -err_attach:
> + return 0;
> +err_attach_device:
> + iommu_detach_device(v->domain, dma_dev);
> +err_alloc_domain:
>   iommu_domain_free(v->domain);
>   v->domain = NULL;
>   return ret;
> @@ -1595,6 +1646,8 @@ static int vhost_vdpa_probe(struct vdpa_device *vdpa)
>   goto err;
>   }
>  
> + vhost_iotlb_init(>resv_iotlb, 0, 0);
> +
>   r = dev_set_name(>dev, "vhost-vdpa-%u", minor);
>   if (r)
>   goto err;
> -- 
> 2.27.0
>

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-20 Thread Michael S. Tsirkin

On Wed, Mar 20, 2024 at 03:24:16PM +1000, Gavin Shan wrote:
> On 3/20/24 10:49, Michael S. Tsirkin wrote:>
> > I think you are wasting the time with these tests. Even if it helps what
> > does this tell us? Try setting a flag as I suggested elsewhere.
> > Then check it in vhost.
> > Or here's another idea - possibly easier. Copy the high bits from index
> > into ring itself. Then vhost can check that head is synchronized with
> > index.
> > 
> > Warning: completely untested, not even compiled. But should give you
> > the idea. If this works btw we should consider making this official in
> > the spec.
> > 
> > 
> >   static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index 6f7e5010a673..79456706d0bd 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -685,7 +685,8 @@ static inline int virtqueue_add_split(struct virtqueue 
> > *_vq,
> > /* Put entry in available array (but don't update avail->idx until they
> >  * do sync). */
> > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > -   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > +   u16 headwithflag = head | (q->split.avail_idx_shadow & 
> > ~(vq->split.vring.num - 1));
> > +   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > headwithflag);
> > /* Descriptors and available array need to be set before we expose the
> >  * new available array entries. */
> > 
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 045f666b4f12..bd8f7c763caa 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -1299,8 +1299,15 @@ static inline int vhost_get_avail_idx(struct 
> > vhost_virtqueue *vq,
> >   static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
> >__virtio16 *head, int idx)
> >   {
> > -   return vhost_get_avail(vq, *head,
> > +   unsigned i = idx;
> > +   unsigned flag = i & ~(vq->num - 1);
> > +   unsigned val = vhost_get_avail(vq, *head,
> >>avail->ring[idx & (vq->num - 1)]);
> > +   unsigned valflag = val & ~(vq->num - 1);
> > +
> > +   WARN_ON(valflag != flag);
> > +
> > +   return val & (vq->num - 1);
> >   }
> 
> Thanks, Michael. The code is already self-explanatory.

Apparently not. See below.

> Since vq->num is 256, I just
> squeezed the last_avail_idx to the high byte. Unfortunately, I'm unable to hit
> the WARN_ON(). Does it mean the low byte is stale (or corrupted) while the 
> high
> byte is still correct and valid?


I would find this very surprising.

> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> vq->split.vring.avail->ring[avail] =
> cpu_to_virtio16(_vq->vdev, head | (avail << 8));
> 
> 
> head = vhost16_to_cpu(vq, ring_head);
> WARN_ON((head >> 8) != (vq->last_avail_idx % vq->num));
> head = head & 0xff;

This code misses the point of the test.
The high value you store now is exactly the same each time you
go around the ring. E.g. at beginning of ring you now always
store 0 as high byte. So a stale value will not be detected/
The high value you store now is exactly the same each time you
go around the ring. E.g. at beginning of ring you now always
store 0 as high byte. So a stale value will not be detected.

The value you are interested in should change
each time you go around the ring a full circle.
Thus you want exactly the *high byte* of avail idx -
this is what my patch did - your patch instead
stored and compared the low byte.


The advantage of this debugging patch is that it will detect the issue 
immediately
not after guest detected the problem in the used ring.
For example, you can add code to re-read the value, or dump the whole
ring.

> One question: Does QEMU has any chance writing data to the available queue 
> when
> vhost is enabled? My previous understanding is no, the queue is totally owned 
> by
> vhost instead of QEMU.

It shouldn't do it normally.

> Before this patch was posted, I had debugging code to record last 16 
> transactions
> to the available and used queue from guest and host side. It did reveal the 
> wrong
> head was fetched from the available queue.

Oh nice that's a very good hint. And is this still reproducible?

> [   11.785745]  virtqueue_get_buf_ctx_split 
> [   11

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Michael S. Tsirkin

On Wed, Mar 20, 2024 at 09:56:58AM +1000, Gavin Shan wrote:
> On 3/20/24 04:22, Will Deacon wrote:
> > On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> > > On 3/19/24 02:59, Will Deacon wrote:
> > > > >drivers/virtio/virtio_ring.c | 12 +---
> > > > >1 file changed, 9 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > b/drivers/virtio/virtio_ring.c
> > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct 
> > > > > virtqueue *_vq,
> > > > >   avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > >   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > > > head);
> > > > > - /* Descriptors and available array need to be set before we 
> > > > > expose the
> > > > > -  * new available array entries. */
> > > > > - virtio_wmb(vq->weak_barriers);
> > > > > + /*
> > > > > +  * Descriptors and available array need to be set before we 
> > > > > expose
> > > > > +  * the new available array entries. virtio_wmb() should be 
> > > > > enough
> > > > > +  * to ensuere the order theoretically. However, a stronger 
> > > > > barrier
> > > > > +  * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > +  * by the host (vhost). A stronger barrier should work for other
> > > > > +  * architectures, but performance loss is expected.
> > > > > +  */
> > > > > + virtio_mb(false);
> > > > >   vq->split.avail_idx_shadow++;
> > > > >   vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > >   
> > > > > vq->split.avail_idx_shadow);
> > > > 
> > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > > > here, especially when ordering accesses to coherent memory.
> > > > 
> > > > In practice, either the larger timing different from the DSB or the fact
> > > > that you're going from a Store->Store barrier to a full barrier is what
> > > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > > (e.g. via __smb_mb()).
> > > > 
> > > > We definitely shouldn't take changes like this without a proper
> > > > explanation of what is going on.
> > > > 
> > > 
> > > Thanks for your comments, Will.
> > > 
> > > Yes, DMB should work for us. However, it seems this instruction has 
> > > issues on
> > > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB 
> > > works
> > > from hardware level. I agree it's not the solution to replace DMB with DSB
> > > before we fully understand the root cause.
> > > 
> > > I tried the possible replacement like below. __smp_mb() can avoid the 
> > > issue like
> > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> > > 
> > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > {
> > >  :
> > >  /* Put entry in available array (but don't update avail->idx 
> > > until they
> > >   * do sync). */
> > >  avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > >  vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > head);
> > > 
> > >  /* Descriptors and available array need to be set before we 
> > > expose the
> > >   * new available array entries. */
> > >  // Broken: virtio_wmb(vq->weak_barriers);
> > >  // Broken: __dma_mb();
> > >  // Work:   __mb();
> > >  // Work:   __smp_mb();
> > 
> > It's pretty weird that __dma_mb() is "broken" but __smp_mb() "works". How
> > confident are you in that result?
> > 
> 
> Yes, __dma_mb() is even stronger than __smp_mb(). I retried the test, showing
> that both __dma_mb() and __smp_mb() work for us. I had too many tests 
> yesterday
> and something may have been messed up.
> 
> Instruction Hitting times in 10 tests
> -
> __smp_wmb() 8
> __smp_mb()  0
> __dma_wmb() 7
> __dma_mb()  0
> __mb()  0
> __wmb() 0
> 
> It's strange that __smp_mb() works, but __smp_wmb() fails. It seems we need a
> read barrier here. I will try WRITE_ONCE() + __smp_wmb() as suggested by 
> Michael
> in another reply. Will update the result soon.
> 
> Thanks,
> Gavin


I think you are wasting the time with these tests. Even if it helps what
does this tell us? Try setting a flag as I suggested elsewhere.
Then check it in vhost.
Or here's another idea - possibly easier. Copy the high bits from index
into ring itself. Then vhost can check that head is synchronized with
index.

Warning: completely untested, not even compiled. But should give you
the idea. If this works btw we should consider making this official in
the spec.


 static inline int vhost_get_avail_flags(struct

Re: [GIT PULL] virtio: features, fixes

2024-03-19 Thread Michael S. Tsirkin

On Tue, Mar 19, 2024 at 11:03:44AM -0700, Linus Torvalds wrote:
> On Tue, 19 Mar 2024 at 00:41, Michael S. Tsirkin  wrote:
> >
> > virtio: features, fixes
> >
> > Per vq sizes in vdpa.
> > Info query for block devices support in vdpa.
> > DMA sync callbacks in vduse.
> >
> > Fixes, cleanups.
> 
> Grr. I thought the merge message was a bit too terse, but I let it slide.
> 
> But only after pushing it out do I notice that not only was the pull
> request message overly terse, you had also rebased this all just
> moments before sending the pull request and didn't even give a hit of
> a reason for that.
> 
> So I missed that, and the merge is out now, but this was NOT OK.
> 
> Yes, rebasing happens. But last-minute rebasing needs to be explained,
> not some kind of nasty surprise after-the-fact.
> 
> And that pull request explanation was really borderline even *without*
> that issue.
> 
> Linus

OK thanks Linus and sorry. I did that rebase for testing then I thought
hey history looks much nicer now why don't I switch to that.  Just goes
to show not to do this thing past midnight, I write better merge
messages at sane hours, too.

-- 
MST

Re: [syzbot] [virtualization?] upstream boot error: WARNING: refcount bug in __free_pages_ok

2024-03-19 Thread Michael S. Tsirkin

On Tue, Mar 19, 2024 at 01:19:23PM -0400, Stefan Hajnoczi wrote:
> On Tue, Mar 19, 2024 at 03:40:53AM -0400, Michael S. Tsirkin wrote:
> > On Tue, Mar 19, 2024 at 12:32:26AM -0700, syzbot wrote:
> > > Hello,
> > > 
> > > syzbot found the following issue on:
> > > 
> > > HEAD commit:b3603fcb79b1 Merge tag 'dlm-6.9' of 
> > > git://git.kernel.org/p..
> > > git tree:   upstream
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=10f04c8118
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=fcb5bfbee0a42b54
> > > dashboard link: 
> > > https://syzkaller.appspot.com/bug?extid=70f57d8a3ae84934c003
> > > compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for 
> > > Debian) 2.40
> > > 
> > > Downloadable assets:
> > > disk image: 
> > > https://storage.googleapis.com/syzbot-assets/43969dffd4a6/disk-b3603fcb.raw.xz
> > > vmlinux: 
> > > https://storage.googleapis.com/syzbot-assets/ef48ab3b378b/vmlinux-b3603fcb.xz
> > > kernel image: 
> > > https://storage.googleapis.com/syzbot-assets/728f5ff2b6fe/bzImage-b3603fcb.xz
> > > 
> > > IMPORTANT: if you fix the issue, please add the following tag to the 
> > > commit:
> > > Reported-by: syzbot+70f57d8a3ae84934c...@syzkaller.appspotmail.com
> > > 
> > > Key type pkcs7_test registered
> > > Block layer SCSI generic (bsg) driver version 0.4 loaded (major 239)
> > > io scheduler mq-deadline registered
> > > io scheduler kyber registered
> > > io scheduler bfq registered
> > > input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
> > > ACPI: button: Power Button [PWRF]
> > > input: Sleep Button as /devices/LNXSYSTM:00/LNXSLPBN:00/input/input1
> > > ACPI: button: Sleep Button [SLPF]
> > > ioatdma: Intel(R) QuickData Technology Driver 5.00
> > > ACPI: \_SB_.LNKC: Enabled at IRQ 11
> > > virtio-pci :00:03.0: virtio_pci: leaving for legacy driver
> > > ACPI: \_SB_.LNKD: Enabled at IRQ 10
> > > virtio-pci :00:04.0: virtio_pci: leaving for legacy driver
> > > ACPI: \_SB_.LNKB: Enabled at IRQ 10
> > > virtio-pci :00:06.0: virtio_pci: leaving for legacy driver
> > > virtio-pci :00:07.0: virtio_pci: leaving for legacy driver
> > > N_HDLC line discipline registered with maxframe=4096
> > > Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> > > 00:03: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
> > > 00:04: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A
> > > 00:05: ttyS2 at I/O 0x3e8 (irq = 6, base_baud = 115200) is a 16550A
> > > 00:06: ttyS3 at I/O 0x2e8 (irq = 7, base_baud = 115200) is a 16550A
> > > Non-volatile memory driver v1.3
> > > Linux agpgart interface v0.103
> > > ACPI: bus type drm_connector registered
> > > [drm] Initialized vgem 1.0.0 20120112 for vgem on minor 0
> > > [drm] Initialized vkms 1.0.0 20180514 for vkms on minor 1
> > > Console: switching to colour frame buffer device 128x48
> > > platform vkms: [drm] fb0: vkmsdrmfb frame buffer device
> > > usbcore: registered new interface driver udl
> > > brd: module loaded
> > > loop: module loaded
> > > zram: Added device: zram0
> > > null_blk: disk nullb0 created
> > > null_blk: module loaded
> > > Guest personality initialized and is inactive
> > > VMCI host device registered (name=vmci, major=10, minor=118)
> > > Initialized host personality
> > > usbcore: registered new interface driver rtsx_usb
> > > usbcore: registered new interface driver viperboard
> > > usbcore: registered new interface driver dln2
> > > usbcore: registered new interface driver pn533_usb
> > > nfcsim 0.2 initialized
> > > usbcore: registered new interface driver port100
> > > usbcore: registered new interface driver nfcmrvl
> > > Loading iSCSI transport class v2.0-870.
> > > virtio_scsi virtio0: 1/0/0 default/read/poll queues
> > > [ cut here ]
> > > refcount_t: decrement hit 0; leaking memory.
> > > WARNING: CPU: 0 PID: 1 at lib/refcount.c:31 
> > > refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31
> > > Modules linked in:
> > > CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> > > 6.8.0-syzkaller-11567-gb3603fcb79b1 #0
> > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> > > Google 02/29/2024
> > > RIP: 0010:refcount_warn_s

Re: [PATCH v7 0/3] vduse: add support for networking devices

2024-03-19 Thread Michael S. Tsirkin

On Thu, Feb 29, 2024 at 11:16:04AM +0100, Maxime Coquelin wrote:
> Hello Michael,
> 
> On 2/1/24 09:40, Michael S. Tsirkin wrote:
> > On Thu, Feb 01, 2024 at 09:34:11AM +0100, Maxime Coquelin wrote:
> > > Hi Jason,
> > > 
> > > It looks like all patches got acked by you.
> > > Any blocker to queue the series for next release?
> > > 
> > > Thanks,
> > > Maxime
> > 
> > I think it's good enough at this point. Will put it in
> > linux-next shortly.
> > 
> 
> I fetched linux-next and it seems the series is not in yet.
> Is there anything to be reworked on my side?
> 
> Thanks,
> Maxime

I am sorry I messed up. It was in a wrong branch and was not
pushed so of course it did not get tested and I kept wondering
why. I pushed it in my tree but it is too late to put it upstream
for this cycle. Assuming Linus merges my tree
with no drama, I will make an effort not to rebase my tree below them
so these patches will keep their hashes, you can use them already.

-- 
MST

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Michael S. Tsirkin

On Tue, Mar 19, 2024 at 06:08:27PM +1000, Gavin Shan wrote:
> On 3/19/24 17:09, Michael S. Tsirkin wrote:
> > On Tue, Mar 19, 2024 at 04:49:50PM +1000, Gavin Shan wrote:
> > > 
> > > On 3/19/24 16:43, Michael S. Tsirkin wrote:
> > > > On Tue, Mar 19, 2024 at 04:38:49PM +1000, Gavin Shan wrote:
> > > > > On 3/19/24 16:09, Michael S. Tsirkin wrote:
> > > > > 
> > > > > > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > > > > > b/drivers/virtio/virtio_ring.c
> > > > > > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > > > @@ -687,9 +687,15 @@ static inline int 
> > > > > > > > > virtqueue_add_split(struct virtqueue *_vq,
> > > > > > > > >   avail = vq->split.avail_idx_shadow & 
> > > > > > > > > (vq->split.vring.num - 1);
> > > > > > > > >   vq->split.vring.avail->ring[avail] = 
> > > > > > > > > cpu_to_virtio16(_vq->vdev, head);
> > > > > > > > > - /* Descriptors and available array need to be set 
> > > > > > > > > before we expose the
> > > > > > > > > -  * new available array entries. */
> > > > > > > > > - virtio_wmb(vq->weak_barriers);
> > > > > > > > > + /*
> > > > > > > > > +  * Descriptors and available array need to be set 
> > > > > > > > > before we expose
> > > > > > > > > +  * the new available array entries. virtio_wmb() should 
> > > > > > > > > be enough
> > > > > > > > > +  * to ensuere the order theoretically. However, a 
> > > > > > > > > stronger barrier
> > > > > > > > > +  * is needed by ARM64. Otherwise, the stale data can be 
> > > > > > > > > observed
> > > > > > > > > +  * by the host (vhost). A stronger barrier should work 
> > > > > > > > > for other
> > > > > > > > > +  * architectures, but performance loss is expected.
> > > > > > > > > +  */
> > > > > > > > > + virtio_mb(false);
> > > > > > > > >   vq->split.avail_idx_shadow++;
> > > > > > > > >   vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > > > > >   
> > > > > > > > > vq->split.avail_idx_shadow);
> > > > > > > > 
> > > > > > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct 
> > > > > > > > solution
> > > > > > > > here, especially when ordering accesses to coherent memory.
> > > > > > > > 
> > > > > > > > In practice, either the larger timing different from the DSB or 
> > > > > > > > the fact
> > > > > > > > that you're going from a Store->Store barrier to a full barrier 
> > > > > > > > is what
> > > > > > > > makes things "work" for you. Have you tried, for example, a DMB 
> > > > > > > > SY
> > > > > > > > (e.g. via __smb_mb()).
> > > > > > > > 
> > > > > > > > We definitely shouldn't take changes like this without a proper
> > > > > > > > explanation of what is going on.
> > > > > > > > 
> > > > > > > 
> > > > > > > Thanks for your comments, Will.
> > > > > > > 
> > > > > > > Yes, DMB should work for us. However, it seems this instruction 
> > > > > > > has issues on
> > > > > > > NVidia's grace-hopper. It's hard for me to understand how DMB and 
> > > > > > > DSB works
> > > > > > > from hardware level. I agree it's not the solution to replace DMB 
> > > > > > > with DSB
> > > > > > > before we fully understand the root cause.
> > > > > > > 
>

Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-03-19 Thread Michael S. Tsirkin

On Tue, Mar 19, 2024 at 09:21:06AM +0100, Tobias Huschle wrote:
> On 2024-03-15 11:31, Michael S. Tsirkin wrote:
> > On Fri, Mar 15, 2024 at 09:33:49AM +0100, Tobias Huschle wrote:
> > > On Thu, Mar 14, 2024 at 11:09:25AM -0400, Michael S. Tsirkin wrote:
> > > >
> > 
> > Could you remind me pls, what is the kworker doing specifically that
> > vhost is relying on?
> 
> The kworker is handling the actual data moving in memory if I'm not
> mistaking.

I think that is the vhost process itself. Maybe you mean the
guest thread versus the vhost thread then?

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Michael S. Tsirkin

On Tue, Mar 19, 2024 at 04:54:15PM +1000, Gavin Shan wrote:
> On 3/19/24 16:10, Michael S. Tsirkin wrote:
> > On Tue, Mar 19, 2024 at 02:09:34AM -0400, Michael S. Tsirkin wrote:
> > > On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> > > > On 3/19/24 02:59, Will Deacon wrote:
> [...]
> > > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > > b/drivers/virtio/virtio_ring.c
> > > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct 
> > > > > > virtqueue *_vq,
> > > > > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > > > > head);
> > > > > > -   /* Descriptors and available array need to be set before we 
> > > > > > expose the
> > > > > > -* new available array entries. */
> > > > > > -   virtio_wmb(vq->weak_barriers);
> > > > > > +   /*
> > > > > > +* Descriptors and available array need to be set before we 
> > > > > > expose
> > > > > > +* the new available array entries. virtio_wmb() should be 
> > > > > > enough
> > > > > > +* to ensuere the order theoretically. However, a stronger 
> > > > > > barrier
> > > > > > +* is needed by ARM64. Otherwise, the stale data can be observed
> > > > > > +* by the host (vhost). A stronger barrier should work for other
> > > > > > +* architectures, but performance loss is expected.
> > > > > > +*/
> > > > > > +   virtio_mb(false);
> > > > > > vq->split.avail_idx_shadow++;
> > > > > > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > > 
> > > > > > vq->split.avail_idx_shadow);
> > > > > 
> > > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct 
> > > > > solution
> > > > > here, especially when ordering accesses to coherent memory.
> > > > > 
> > > > > In practice, either the larger timing different from the DSB or the 
> > > > > fact
> > > > > that you're going from a Store->Store barrier to a full barrier is 
> > > > > what
> > > > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > > > (e.g. via __smb_mb()).
> > > > > 
> > > > > We definitely shouldn't take changes like this without a proper
> > > > > explanation of what is going on.
> > > > > 
> > > > 
> > > > Thanks for your comments, Will.
> > > > 
> > > > Yes, DMB should work for us. However, it seems this instruction has 
> > > > issues on
> > > > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB 
> > > > works
> > > > from hardware level. I agree it's not the solution to replace DMB with 
> > > > DSB
> > > > before we fully understand the root cause.
> > > > 
> > > > I tried the possible replacement like below. __smp_mb() can avoid the 
> > > > issue like
> > > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> > > > 
> > > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > > {
> > > >  :
> > > >  /* Put entry in available array (but don't update avail->idx 
> > > > until they
> > > >   * do sync). */
> > > >  avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > >  vq->split.vring.avail->ring[avail] = 
> > > > cpu_to_virtio16(_vq->vdev, head);
> > > > 
> > > >  /* Descriptors and available array need to be set before we 
> > > > expose the
> > > >   * new available array entries. */
> > > >  // Broken: virtio_wmb(vq->weak_barriers);
> > > >  // Broken: __dma_mb();
> > > >  // Work:   __mb();
> > &g

[GIT PULL] virtio: features, fixes

2024-03-19 Thread Michael S. Tsirkin

The following changes since commit e8f897f4afef0031fe618a8e94127a0934896aba:

  Linux 6.8 (2024-03-10 13:38:09 -0700)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to 5da7137de79ca6ffae3ace77050588cdf5263d33:

  virtio_net: rename free_old_xmit_skbs to free_old_xmit (2024-03-19 03:19:22 
-0400)


virtio: features, fixes

Per vq sizes in vdpa.
Info query for block devices support in vdpa.
DMA sync callbacks in vduse.

Fixes, cleanups.

Signed-off-by: Michael S. Tsirkin 


Andrew Melnychenko (1):
  vhost: Added pad cleanup if vnet_hdr is not present.

David Hildenbrand (1):
  virtio: reenable config if freezing device failed

Jason Wang (2):
  virtio-net: convert rx mode setting to use workqueue
  virtio-net: add cond_resched() to the command waiting loop

Jonah Palmer (1):
  vdpa/mlx5: Allow CVQ size changes

Maxime Coquelin (1):
  vduse: implement DMA sync callbacks

Ricardo B. Marliere (2):
  vdpa: make vdpa_bus const
  virtio: make virtio_bus const

Shannon Nelson (1):
  vdpa/pds: fixes for VF vdpa flr-aer handling

Steve Sistare (2):
  vdpa_sim: reset must not run
  vdpa: skip suspend/resume ops if not DRIVER_OK

Suzuki K Poulose (1):
  virtio: uapi: Drop __packed attribute in linux/virtio_pci.h

Xuan Zhuo (3):
  virtio: packed: fix unmap leak for indirect desc table
  virtio_net: unify the code for recycling the xmit ptr
  virtio_net: rename free_old_xmit_skbs to free_old_xmit

Zhu Lingshan (20):
  vhost-vdpa: uapi to support reporting per vq size
  vDPA: introduce get_vq_size to vdpa_config_ops
  vDPA/ifcvf: implement vdpa_config_ops.get_vq_size
  vp_vdpa: implement vdpa_config_ops.get_vq_size
  eni_vdpa: implement vdpa_config_ops.get_vq_size
  vdpa_sim: implement vdpa_config_ops.get_vq_size for vDPA simulator
  vduse: implement vdpa_config_ops.get_vq_size for vduse
  virtio_vdpa: create vqs with the actual size
  vDPA/ifcvf: get_max_vq_size to return max size
  vDPA/ifcvf: implement vdpa_config_ops.get_vq_num_min
  vDPA: report virtio-block capacity to user space
  vDPA: report virtio-block max segment size to user space
  vDPA: report virtio-block block-size to user space
  vDPA: report virtio-block max segments in a request to user space
  vDPA: report virtio-block MQ info to user space
  vDPA: report virtio-block topology info to user space
  vDPA: report virtio-block discarding configuration to user space
  vDPA: report virtio-block write zeroes configuration to user space
  vDPA: report virtio-block read-only info to user space
  vDPA: report virtio-blk flush info to user space

 drivers/net/virtio_net.c | 151 +++-
 drivers/vdpa/alibaba/eni_vdpa.c  |   8 ++
 drivers/vdpa/ifcvf/ifcvf_base.c  |  11 +-
 drivers/vdpa/ifcvf/ifcvf_base.h  |   2 +
 drivers/vdpa/ifcvf/ifcvf_main.c  |  15 +++
 drivers/vdpa/mlx5/net/mlx5_vnet.c|  13 ++-
 drivers/vdpa/pds/aux_drv.c   |   2 +-
 drivers/vdpa/pds/vdpa_dev.c  |  20 +++-
 drivers/vdpa/pds/vdpa_dev.h  |   1 +
 drivers/vdpa/vdpa.c  | 214 ++-
 drivers/vdpa/vdpa_sim/vdpa_sim.c |  15 ++-
 drivers/vdpa/vdpa_user/iova_domain.c |  27 -
 drivers/vdpa/vdpa_user/iova_domain.h |   8 ++
 drivers/vdpa/vdpa_user/vduse_dev.c   |  34 ++
 drivers/vdpa/virtio_pci/vp_vdpa.c|   8 ++
 drivers/vhost/net.c  |   3 +
 drivers/vhost/vdpa.c |  14 +++
 drivers/virtio/virtio.c  |   6 +-
 drivers/virtio/virtio_ring.c |   6 +-
 drivers/virtio/virtio_vdpa.c |   5 +-
 include/linux/vdpa.h |   6 +
 include/uapi/linux/vdpa.h|  17 +++
 include/uapi/linux/vhost.h   |   7 ++
 include/uapi/linux/virtio_pci.h  |  10 +-
 24 files changed, 521 insertions(+), 82 deletions(-)

Re: [syzbot] [virtualization?] upstream boot error: WARNING: refcount bug in __free_pages_ok

2024-03-19 Thread Michael S. Tsirkin

On Tue, Mar 19, 2024 at 12:32:26AM -0700, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:b3603fcb79b1 Merge tag 'dlm-6.9' of git://git.kernel.org/p..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=10f04c8118
> kernel config:  https://syzkaller.appspot.com/x/.config?x=fcb5bfbee0a42b54
> dashboard link: https://syzkaller.appspot.com/bug?extid=70f57d8a3ae84934c003
> compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 
> 2.40
> 
> Downloadable assets:
> disk image: 
> https://storage.googleapis.com/syzbot-assets/43969dffd4a6/disk-b3603fcb.raw.xz
> vmlinux: 
> https://storage.googleapis.com/syzbot-assets/ef48ab3b378b/vmlinux-b3603fcb.xz
> kernel image: 
> https://storage.googleapis.com/syzbot-assets/728f5ff2b6fe/bzImage-b3603fcb.xz
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+70f57d8a3ae84934c...@syzkaller.appspotmail.com
> 
> Key type pkcs7_test registered
> Block layer SCSI generic (bsg) driver version 0.4 loaded (major 239)
> io scheduler mq-deadline registered
> io scheduler kyber registered
> io scheduler bfq registered
> input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
> ACPI: button: Power Button [PWRF]
> input: Sleep Button as /devices/LNXSYSTM:00/LNXSLPBN:00/input/input1
> ACPI: button: Sleep Button [SLPF]
> ioatdma: Intel(R) QuickData Technology Driver 5.00
> ACPI: \_SB_.LNKC: Enabled at IRQ 11
> virtio-pci :00:03.0: virtio_pci: leaving for legacy driver
> ACPI: \_SB_.LNKD: Enabled at IRQ 10
> virtio-pci :00:04.0: virtio_pci: leaving for legacy driver
> ACPI: \_SB_.LNKB: Enabled at IRQ 10
> virtio-pci :00:06.0: virtio_pci: leaving for legacy driver
> virtio-pci :00:07.0: virtio_pci: leaving for legacy driver
> N_HDLC line discipline registered with maxframe=4096
> Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> 00:03: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
> 00:04: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A
> 00:05: ttyS2 at I/O 0x3e8 (irq = 6, base_baud = 115200) is a 16550A
> 00:06: ttyS3 at I/O 0x2e8 (irq = 7, base_baud = 115200) is a 16550A
> Non-volatile memory driver v1.3
> Linux agpgart interface v0.103
> ACPI: bus type drm_connector registered
> [drm] Initialized vgem 1.0.0 20120112 for vgem on minor 0
> [drm] Initialized vkms 1.0.0 20180514 for vkms on minor 1
> Console: switching to colour frame buffer device 128x48
> platform vkms: [drm] fb0: vkmsdrmfb frame buffer device
> usbcore: registered new interface driver udl
> brd: module loaded
> loop: module loaded
> zram: Added device: zram0
> null_blk: disk nullb0 created
> null_blk: module loaded
> Guest personality initialized and is inactive
> VMCI host device registered (name=vmci, major=10, minor=118)
> Initialized host personality
> usbcore: registered new interface driver rtsx_usb
> usbcore: registered new interface driver viperboard
> usbcore: registered new interface driver dln2
> usbcore: registered new interface driver pn533_usb
> nfcsim 0.2 initialized
> usbcore: registered new interface driver port100
> usbcore: registered new interface driver nfcmrvl
> Loading iSCSI transport class v2.0-870.
> virtio_scsi virtio0: 1/0/0 default/read/poll queues
> [ cut here ]
> refcount_t: decrement hit 0; leaking memory.
> WARNING: CPU: 0 PID: 1 at lib/refcount.c:31 refcount_warn_saturate+0xfa/0x1d0 
> lib/refcount.c:31
> Modules linked in:
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.8.0-syzkaller-11567-gb3603fcb79b1 
> #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 02/29/2024
> RIP: 0010:refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31
> Code: b2 00 00 00 e8 57 d4 f2 fc 5b 5d c3 cc cc cc cc e8 4b d4 f2 fc c6 05 0c 
> f9 ef 0a 01 90 48 c7 c7 a0 5d 1e 8c e8 b7 75 b5 fc 90 <0f> 0b 90 90 eb d9 e8 
> 2b d4 f2 fc c6 05 e9 f8 ef 0a 01 90 48 c7 c7
> RSP: :c9066e18 EFLAGS: 00010246
> RAX: 76f86e452fcad900 RBX: 8880210d2aec RCX: 888016ac8000
> RDX:  RSI:  RDI: 
> RBP: 0004 R08: 8157ffe2 R09: fbfff1c396e0
> R10: dc00 R11: fbfff1c396e0 R12: ea000502cdc0
> R13: ea000502cdc8 R14: 1d4000a059b9 R15: 
> FS:  () GS:8880b940() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 88823000 CR3: 0e132000 CR4: 003506f0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  
>  reset_page_owner include/linux/page_owner.h:25 [inline]
>  free_pages_prepare mm/page_alloc.c:1141 [inline]
>  __free_pages_ok+0xc54/0xd80 mm/page_alloc.c:1270
>  make_alloc_exact+0xa3/0xf0 mm/page_alloc.c:4829
>  vring_alloc_queue drivers/virtio/virtio_ring.c:319

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Michael S. Tsirkin

On Mon, Mar 18, 2024 at 04:59:24PM +, Will Deacon wrote:
> On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > The issue is reported by Yihuang Yu who have 'netperf' test on
> > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > client is started in the VM hosted by grace-hopper machine,
> > while the 'netperf' server is running on grace-grace machine.
> > 
> > The VM is started with virtio-net and vhost has been enabled.
> > We observe a error message spew from VM and then soft-lockup
> > report. The error message indicates the data associated with
> > the descriptor (index: 135) has been released, and the queue
> > is marked as broken. It eventually leads to the endless effort
> > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > and soft-lockup. The stale index 135 is fetched from the available
> > ring and published to the used ring by vhost, meaning we have
> > disordred write to the available ring element and available index.
> > 
> >   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
> >   -accel kvm -machine virt,gic-version=host\
> >  : \
> >   -netdev tap,id=vnet0,vhost=on\
> >   -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> > 
> >   [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> > 
> > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > ARM64. It should work for other architectures, but performance loss is
> > expected.
> > 
> > Cc: sta...@vger.kernel.org
> > Reported-by: Yihuang Yu 
> > Signed-off-by: Gavin Shan 
> > ---
> >  drivers/virtio/virtio_ring.c | 12 +---
> >  1 file changed, 9 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index 49299b1f9ec7..7d852811c912 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue 
> > *_vq,
> > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> >  
> > -   /* Descriptors and available array need to be set before we expose the
> > -* new available array entries. */
> > -   virtio_wmb(vq->weak_barriers);
> > +   /*
> > +* Descriptors and available array need to be set before we expose
> > +* the new available array entries. virtio_wmb() should be enough
> > +* to ensuere the order theoretically. However, a stronger barrier
> > +* is needed by ARM64. Otherwise, the stale data can be observed
> > +* by the host (vhost). A stronger barrier should work for other
> > +* architectures, but performance loss is expected.
> > +*/
> > +   virtio_mb(false);
> > vq->split.avail_idx_shadow++;
> > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > vq->split.avail_idx_shadow);
> 
> Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> here, especially when ordering accesses to coherent memory.
> 
> In practice, either the larger timing different from the DSB or the fact
> that you're going from a Store->Store barrier to a full barrier is what
> makes things "work" for you. Have you tried, for example, a DMB SY
> (e.g. via __smb_mb()).
> 
> We definitely shouldn't take changes like this without a proper
> explanation of what is going on.
> 
> Will

Just making sure: so on this system, how do
smp_wmb() and wmb() differ? smb_wmb is normally for synchronizing
with kernel running on another CPU and we are doing something
unusual in virtio when we use it to synchronize with host
as opposed to the guest - e.g. CONFIG_SMP is special cased
because of this:

#define virt_wmb() do { kcsan_wmb(); __smp_wmb(); } while (0)

Note __smp_wmb not smp_wmb which would be a NOP on UP.


-- 
MST

Re: [PATCH v3] vduse: Fix off by one in vduse_dev_mmap()

2024-03-19 Thread Michael S. Tsirkin

On Wed, Feb 28, 2024 at 09:24:07PM +0300, Dan Carpenter wrote:
> The dev->vqs[] array has "dev->vq_num" elements.  It's allocated in
> vduse_dev_init_vqs().  Thus, this > comparison needs to be >= to avoid
> reading one element beyond the end of the array.
> 
> Add an array_index_nospec() as well to prevent speculation issues.
> 
> Fixes: 316ecd1346b0 ("vduse: Add file operation for mmap")
> Signed-off-by: Dan Carpenter 

Thanks a lot!
I assume this will be squashed in the relevant patch when that is
re-spun.

> ---
> v2: add array_index_nospec()
> v3: I accidentally corrupted v2.  Try again.
> 
>  drivers/vdpa/vdpa_user/vduse_dev.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> b/drivers/vdpa/vdpa_user/vduse_dev.c
> index b7a1fb88c506..eb914084c650 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -1532,9 +1532,10 @@ static int vduse_dev_mmap(struct file *file, struct 
> vm_area_struct *vma)
>   if ((vma->vm_flags & VM_SHARED) == 0)
>   return -EINVAL;
>  
> - if (index > dev->vq_num)
> + if (index >= dev->vq_num)
>   return -EINVAL;
>  
> + index = array_index_nospec(index, dev->vq_num);
>   vq = dev->vqs[index];
>   vaddr = vq->vdpa_reconnect_vaddr;
>   if (vaddr == 0)
> -- 
> 2.43.0

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Michael S. Tsirkin

On Tue, Mar 19, 2024 at 04:49:50PM +1000, Gavin Shan wrote:
> 
> On 3/19/24 16:43, Michael S. Tsirkin wrote:
> > On Tue, Mar 19, 2024 at 04:38:49PM +1000, Gavin Shan wrote:
> > > On 3/19/24 16:09, Michael S. Tsirkin wrote:
> > > 
> > > > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > > > b/drivers/virtio/virtio_ring.c
> > > > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct 
> > > > > > > virtqueue *_vq,
> > > > > > >   avail = vq->split.avail_idx_shadow & 
> > > > > > > (vq->split.vring.num - 1);
> > > > > > >   vq->split.vring.avail->ring[avail] = 
> > > > > > > cpu_to_virtio16(_vq->vdev, head);
> > > > > > > - /* Descriptors and available array need to be set before we 
> > > > > > > expose the
> > > > > > > -  * new available array entries. */
> > > > > > > - virtio_wmb(vq->weak_barriers);
> > > > > > > + /*
> > > > > > > +  * Descriptors and available array need to be set before we 
> > > > > > > expose
> > > > > > > +  * the new available array entries. virtio_wmb() should be 
> > > > > > > enough
> > > > > > > +  * to ensuere the order theoretically. However, a stronger 
> > > > > > > barrier
> > > > > > > +  * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > > > +  * by the host (vhost). A stronger barrier should work for other
> > > > > > > +  * architectures, but performance loss is expected.
> > > > > > > +  */
> > > > > > > + virtio_mb(false);
> > > > > > >   vq->split.avail_idx_shadow++;
> > > > > > >   vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > > >   
> > > > > > > vq->split.avail_idx_shadow);
> > > > > > 
> > > > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct 
> > > > > > solution
> > > > > > here, especially when ordering accesses to coherent memory.
> > > > > > 
> > > > > > In practice, either the larger timing different from the DSB or the 
> > > > > > fact
> > > > > > that you're going from a Store->Store barrier to a full barrier is 
> > > > > > what
> > > > > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > > > > (e.g. via __smb_mb()).
> > > > > > 
> > > > > > We definitely shouldn't take changes like this without a proper
> > > > > > explanation of what is going on.
> > > > > > 
> > > > > 
> > > > > Thanks for your comments, Will.
> > > > > 
> > > > > Yes, DMB should work for us. However, it seems this instruction has 
> > > > > issues on
> > > > > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB 
> > > > > works
> > > > > from hardware level. I agree it's not the solution to replace DMB 
> > > > > with DSB
> > > > > before we fully understand the root cause.
> > > > > 
> > > > > I tried the possible replacement like below. __smp_mb() can avoid the 
> > > > > issue like
> > > > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) 
> > > > > doesn't.
> > > > > 
> > > > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > > > {
> > > > >   :
> > > > >   /* Put entry in available array (but don't update 
> > > > > avail->idx until they
> > > > >* do sync). */
> > > > >   avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 
> > > > > 1);
> > > > >   vq->split.vring.avail->ring[avail] = 
> > > > > cpu_to_virtio16(_vq-&g

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Michael S. Tsirkin

On Tue, Mar 19, 2024 at 04:54:15PM +1000, Gavin Shan wrote:
> On 3/19/24 16:10, Michael S. Tsirkin wrote:
> > On Tue, Mar 19, 2024 at 02:09:34AM -0400, Michael S. Tsirkin wrote:
> > > On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> > > > On 3/19/24 02:59, Will Deacon wrote:
> [...]
> > > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > > b/drivers/virtio/virtio_ring.c
> > > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct 
> > > > > > virtqueue *_vq,
> > > > > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > > > > head);
> > > > > > -   /* Descriptors and available array need to be set before we 
> > > > > > expose the
> > > > > > -* new available array entries. */
> > > > > > -   virtio_wmb(vq->weak_barriers);
> > > > > > +   /*
> > > > > > +* Descriptors and available array need to be set before we 
> > > > > > expose
> > > > > > +* the new available array entries. virtio_wmb() should be 
> > > > > > enough
> > > > > > +* to ensuere the order theoretically. However, a stronger 
> > > > > > barrier
> > > > > > +* is needed by ARM64. Otherwise, the stale data can be observed
> > > > > > +* by the host (vhost). A stronger barrier should work for other
> > > > > > +* architectures, but performance loss is expected.
> > > > > > +*/
> > > > > > +   virtio_mb(false);
> > > > > > vq->split.avail_idx_shadow++;
> > > > > > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > > 
> > > > > > vq->split.avail_idx_shadow);
> > > > > 
> > > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct 
> > > > > solution
> > > > > here, especially when ordering accesses to coherent memory.
> > > > > 
> > > > > In practice, either the larger timing different from the DSB or the 
> > > > > fact
> > > > > that you're going from a Store->Store barrier to a full barrier is 
> > > > > what
> > > > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > > > (e.g. via __smb_mb()).
> > > > > 
> > > > > We definitely shouldn't take changes like this without a proper
> > > > > explanation of what is going on.
> > > > > 
> > > > 
> > > > Thanks for your comments, Will.
> > > > 
> > > > Yes, DMB should work for us. However, it seems this instruction has 
> > > > issues on
> > > > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB 
> > > > works
> > > > from hardware level. I agree it's not the solution to replace DMB with 
> > > > DSB
> > > > before we fully understand the root cause.
> > > > 
> > > > I tried the possible replacement like below. __smp_mb() can avoid the 
> > > > issue like
> > > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> > > > 
> > > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > > {
> > > >  :
> > > >  /* Put entry in available array (but don't update avail->idx 
> > > > until they
> > > >   * do sync). */
> > > >  avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > >  vq->split.vring.avail->ring[avail] = 
> > > > cpu_to_virtio16(_vq->vdev, head);
> > > > 
> > > >  /* Descriptors and available array need to be set before we 
> > > > expose the
> > > >   * new available array entries. */
> > > >  // Broken: virtio_wmb(vq->weak_barriers);
> > > >  // Broken: __dma_mb();
> > > >  // Work:   __mb(

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Michael S. Tsirkin

On Tue, Mar 19, 2024 at 04:38:49PM +1000, Gavin Shan wrote:
> On 3/19/24 16:09, Michael S. Tsirkin wrote:
> 
> > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > b/drivers/virtio/virtio_ring.c
> > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct 
> > > > > virtqueue *_vq,
> > > > >   avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > >   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > > > head);
> > > > > - /* Descriptors and available array need to be set before we 
> > > > > expose the
> > > > > -  * new available array entries. */
> > > > > - virtio_wmb(vq->weak_barriers);
> > > > > + /*
> > > > > +  * Descriptors and available array need to be set before we 
> > > > > expose
> > > > > +  * the new available array entries. virtio_wmb() should be 
> > > > > enough
> > > > > +  * to ensuere the order theoretically. However, a stronger 
> > > > > barrier
> > > > > +  * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > +  * by the host (vhost). A stronger barrier should work for other
> > > > > +  * architectures, but performance loss is expected.
> > > > > +  */
> > > > > + virtio_mb(false);
> > > > >   vq->split.avail_idx_shadow++;
> > > > >   vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > >   
> > > > > vq->split.avail_idx_shadow);
> > > > 
> > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > > > here, especially when ordering accesses to coherent memory.
> > > > 
> > > > In practice, either the larger timing different from the DSB or the fact
> > > > that you're going from a Store->Store barrier to a full barrier is what
> > > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > > (e.g. via __smb_mb()).
> > > > 
> > > > We definitely shouldn't take changes like this without a proper
> > > > explanation of what is going on.
> > > > 
> > > 
> > > Thanks for your comments, Will.
> > > 
> > > Yes, DMB should work for us. However, it seems this instruction has 
> > > issues on
> > > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB 
> > > works
> > > from hardware level. I agree it's not the solution to replace DMB with DSB
> > > before we fully understand the root cause.
> > > 
> > > I tried the possible replacement like below. __smp_mb() can avoid the 
> > > issue like
> > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> > > 
> > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > {
> > >  :
> > >  /* Put entry in available array (but don't update avail->idx 
> > > until they
> > >   * do sync). */
> > >  avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > >  vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > head);
> > > 
> > >  /* Descriptors and available array need to be set before we 
> > > expose the
> > >   * new available array entries. */
> > >  // Broken: virtio_wmb(vq->weak_barriers);
> > >  // Broken: __dma_mb();
> > >  // Work:   __mb();
> > >  // Work:   __smp_mb();
> > >  // Work:   __ndelay(100);
> > >  // Work:   __ndelay(10);
> > >  // Broken: __ndelay(9);
> > > 
> > > vq->split.avail_idx_shadow++;
> > >  vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > >  
> > > vq->split.avail_idx_shadow);
> > 
> > What if you stick __ndelay here?
> > 
> 
>/* Put entry in available array (but don't update avail->idx until they
>  * d

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Michael S. Tsirkin

On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> The issue is reported by Yihuang Yu who have 'netperf' test on
> NVidia's grace-grace and grace-hopper machines. The 'netperf'
> client is started in the VM hosted by grace-hopper machine,
> while the 'netperf' server is running on grace-grace machine.
> 
> The VM is started with virtio-net and vhost has been enabled.
> We observe a error message spew from VM and then soft-lockup
> report. The error message indicates the data associated with
> the descriptor (index: 135) has been released, and the queue
> is marked as broken. It eventually leads to the endless effort
> to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> and soft-lockup. The stale index 135 is fetched from the available
> ring and published to the used ring by vhost, meaning we have
> disordred write to the available ring element and available index.
> 
>   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
>   -accel kvm -machine virt,gic-version=host\
>  : \
>   -netdev tap,id=vnet0,vhost=on\
>   -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> 
>   [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> 
> Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> ARM64. It should work for other architectures, but performance loss is
> expected.
> 
> Cc: sta...@vger.kernel.org
> Reported-by: Yihuang Yu 
> Signed-off-by: Gavin Shan 
> ---
>  drivers/virtio/virtio_ring.c | 12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 49299b1f9ec7..7d852811c912 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue 
> *_vq,
>   avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>  
> - /* Descriptors and available array need to be set before we expose the
> -  * new available array entries. */
> - virtio_wmb(vq->weak_barriers);
> + /*
> +  * Descriptors and available array need to be set before we expose
> +  * the new available array entries. virtio_wmb() should be enough
> +  * to ensuere the order theoretically. However, a stronger barrier
> +  * is needed by ARM64. Otherwise, the stale data can be observed
> +  * by the host (vhost). A stronger barrier should work for other
> +  * architectures, but performance loss is expected.
> +  */
> + virtio_mb(false);
>   vq->split.avail_idx_shadow++;
>   vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>   vq->split.avail_idx_shadow);



Something else to try, is to disassemble the code and check the compiler is not 
broken.

It also might help to replace assigment above with WRITE_ONCE -
it's technically always has been the right thing to do, it's just a big
change (has to be done everywhere if done at all) so we never bothered
and we never hit a compiler that would split or speculate stores ...


> -- 
> 2.44.0

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Michael S. Tsirkin

On Tue, Mar 19, 2024 at 02:09:34AM -0400, Michael S. Tsirkin wrote:
> On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> > On 3/19/24 02:59, Will Deacon wrote:
> > > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > > The issue is reported by Yihuang Yu who have 'netperf' test on
> > > > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > > > client is started in the VM hosted by grace-hopper machine,
> > > > while the 'netperf' server is running on grace-grace machine.
> > > > 
> > > > The VM is started with virtio-net and vhost has been enabled.
> > > > We observe a error message spew from VM and then soft-lockup
> > > > report. The error message indicates the data associated with
> > > > the descriptor (index: 135) has been released, and the queue
> > > > is marked as broken. It eventually leads to the endless effort
> > > > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > > > and soft-lockup. The stale index 135 is fetched from the available
> > > > ring and published to the used ring by vhost, meaning we have
> > > > disordred write to the available ring element and available index.
> > > > 
> > > >/home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  
> > > > \
> > > >-accel kvm -machine virt,gic-version=host
> > > > \
> > > >   : 
> > > > \
> > > >-netdev tap,id=vnet0,vhost=on
> > > > \
> > > >-device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 
> > > > \
> > > > 
> > > >[   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> > > > 
> > > > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > > > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > > > ARM64. It should work for other architectures, but performance loss is
> > > > expected.
> > > > 
> > > > Cc: sta...@vger.kernel.org
> > > > Reported-by: Yihuang Yu 
> > > > Signed-off-by: Gavin Shan 
> > > > ---
> > > >   drivers/virtio/virtio_ring.c | 12 +---
> > > >   1 file changed, 9 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > index 49299b1f9ec7..7d852811c912 100644
> > > > --- a/drivers/virtio/virtio_ring.c
> > > > +++ b/drivers/virtio/virtio_ring.c
> > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct 
> > > > virtqueue *_vq,
> > > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > > head);
> > > > -   /* Descriptors and available array need to be set before we 
> > > > expose the
> > > > -* new available array entries. */
> > > > -   virtio_wmb(vq->weak_barriers);
> > > > +   /*
> > > > +* Descriptors and available array need to be set before we 
> > > > expose
> > > > +* the new available array entries. virtio_wmb() should be 
> > > > enough
> > > > +* to ensuere the order theoretically. However, a stronger 
> > > > barrier
> > > > +* is needed by ARM64. Otherwise, the stale data can be observed
> > > > +* by the host (vhost). A stronger barrier should work for other
> > > > +* architectures, but performance loss is expected.
> > > > +*/
> > > > +   virtio_mb(false);
> > > > vq->split.avail_idx_shadow++;
> > > > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > 
> > > > vq->split.avail_idx_shadow);
> > > 
> > > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > > here, especially when ordering accesses to coherent memory.
> > > 
> > > In practice, either the larger timing different from the DSB or the fact
> > > that you're going from a Store->Store barrier to a full barrier is what
> > > makes t

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Michael S. Tsirkin

On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> On 3/19/24 02:59, Will Deacon wrote:
> > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > The issue is reported by Yihuang Yu who have 'netperf' test on
> > > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > > client is started in the VM hosted by grace-hopper machine,
> > > while the 'netperf' server is running on grace-grace machine.
> > > 
> > > The VM is started with virtio-net and vhost has been enabled.
> > > We observe a error message spew from VM and then soft-lockup
> > > report. The error message indicates the data associated with
> > > the descriptor (index: 135) has been released, and the queue
> > > is marked as broken. It eventually leads to the endless effort
> > > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > > and soft-lockup. The stale index 135 is fetched from the available
> > > ring and published to the used ring by vhost, meaning we have
> > > disordred write to the available ring element and available index.
> > > 
> > >/home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
> > >-accel kvm -machine virt,gic-version=host\
> > >   : \
> > >-netdev tap,id=vnet0,vhost=on\
> > >-device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> > > 
> > >[   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> > > 
> > > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > > ARM64. It should work for other architectures, but performance loss is
> > > expected.
> > > 
> > > Cc: sta...@vger.kernel.org
> > > Reported-by: Yihuang Yu 
> > > Signed-off-by: Gavin Shan 
> > > ---
> > >   drivers/virtio/virtio_ring.c | 12 +---
> > >   1 file changed, 9 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > index 49299b1f9ec7..7d852811c912 100644
> > > --- a/drivers/virtio/virtio_ring.c
> > > +++ b/drivers/virtio/virtio_ring.c
> > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct 
> > > virtqueue *_vq,
> > >   avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > >   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > head);
> > > - /* Descriptors and available array need to be set before we expose the
> > > -  * new available array entries. */
> > > - virtio_wmb(vq->weak_barriers);
> > > + /*
> > > +  * Descriptors and available array need to be set before we expose
> > > +  * the new available array entries. virtio_wmb() should be enough
> > > +  * to ensuere the order theoretically. However, a stronger barrier
> > > +  * is needed by ARM64. Otherwise, the stale data can be observed
> > > +  * by the host (vhost). A stronger barrier should work for other
> > > +  * architectures, but performance loss is expected.
> > > +  */
> > > + virtio_mb(false);
> > >   vq->split.avail_idx_shadow++;
> > >   vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > >   
> > > vq->split.avail_idx_shadow);
> > 
> > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > here, especially when ordering accesses to coherent memory.
> > 
> > In practice, either the larger timing different from the DSB or the fact
> > that you're going from a Store->Store barrier to a full barrier is what
> > makes things "work" for you. Have you tried, for example, a DMB SY
> > (e.g. via __smb_mb()).
> > 
> > We definitely shouldn't take changes like this without a proper
> > explanation of what is going on.
> > 
> 
> Thanks for your comments, Will.
> 
> Yes, DMB should work for us. However, it seems this instruction has issues on
> NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
> from hardware level. I agree it's not the solution to replace DMB with DSB
> before we fully understand the root cause.
> 
> I tried the possible replacement like below. __smp_mb() can avoid the issue 
> like
> __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> 
> static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> {
> :
> /* Put entry in available array (but don't update avail->idx until 
> they
>  * do sync). */
> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> 
> /* Descriptors and available array need to be set before we expose the
>  * new available array entries. */
> // Broken: virtio_wmb(vq->weak_barriers);
> // Broken: __dma_mb();
> // Work:   __mb();
> // Work:   __smp_mb();
>

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-18 Thread Michael S. Tsirkin

On Mon, Mar 18, 2024 at 09:41:45AM +1000, Gavin Shan wrote:
> On 3/18/24 02:50, Michael S. Tsirkin wrote:
> > On Fri, Mar 15, 2024 at 09:24:36PM +1000, Gavin Shan wrote:
> > > 
> > > On 3/15/24 21:05, Michael S. Tsirkin wrote:
> > > > On Fri, Mar 15, 2024 at 08:45:10PM +1000, Gavin Shan wrote:
> > > > > > > Yes, I guess smp_wmb() ('dmb') is buggy on NVidia's grace-hopper 
> > > > > > > platform. I tried
> > > > > to reproduce it with my own driver where one thread writes to the 
> > > > > shared buffer
> > > > > and another thread reads from the buffer. I don't hit the 
> > > > > out-of-order issue so
> > > > > far.
> > > > 
> > > > Make sure the 2 areas you are accessing are in different cache lines.
> > > > 
> > > 
> > > Yes, I already put those 2 areas to separate cache lines.
> > > 
> > > > 
> > > > > My driver may be not correct somewhere and I will update if I can 
> > > > > reproduce
> > > > > the issue with my driver in the future.
> > > > 
> > > > Then maybe your change is just making virtio slower and masks the bug
> > > > that is actually elsewhere?
> > > > 
> > > > You don't really need a driver. Here's a simple test: without barriers
> > > > assertion will fail. With barriers it will not.
> > > > (Warning: didn't bother testing too much, could be buggy.
> > > > 
> > > > ---
> > > > 
> > > > #include 
> > > > #include 
> > > > #include 
> > > > #include 
> > > > 
> > > > #define FIRST values[0]
> > > > #define SECOND values[64]
> > > > 
> > > > volatile int values[100] = {};
> > > > 
> > > > void* writer_thread(void* arg) {
> > > > while (1) {
> > > > FIRST++;
> > > > // NEED smp_wmb here
> > >  __asm__ volatile("dmb ishst" : : : "memory");
> > > > SECOND++;
> > > > }
> > > > }
> > > > 
> > > > void* reader_thread(void* arg) {
> > > >   while (1) {
> > > > int first = FIRST;
> > > > // NEED smp_rmb here
> > >  __asm__ volatile("dmb ishld" : : : "memory");
> > > > int second = SECOND;
> > > > assert(first - second == 1 || first - second == 0);
> > > >   }
> > > > }
> > > > 
> > > > int main() {
> > > >   pthread_t writer, reader;
> > > > 
> > > >   pthread_create(, NULL, writer_thread, NULL);
> > > >   pthread_create(, NULL, reader_thread, NULL);
> > > > 
> > > >   pthread_join(writer, NULL);
> > > >   pthread_join(reader, NULL);
> > > > 
> > > >   return 0;
> > > > }
> > > > 
> > > 
> > > Had a quick test on NVidia's grace-hopper and Ampere's CPUs. I hit
> > > the assert on both of them. After replacing 'dmb' with 'dsb', I can
> > > hit assert on both of them too. I need to look at the code closely.
> > > 
> > > [root@virt-mtcollins-02 test]# ./a
> > > a: a.c:26: reader_thread: Assertion `first - second == 1 || first - 
> > > second == 0' failed.
> > > Aborted (core dumped)
> > > 
> > > [root@nvidia-grace-hopper-05 test]# ./a
> > > a: a.c:26: reader_thread: Assertion `first - second == 1 || first - 
> > > second == 0' failed.
> > > Aborted (core dumped)
> > > 
> > > Thanks,
> > > Gavin
> > 
> > 
> > Actually this test is broken. No need for ordering it's a simple race.
> > The following works on x86 though (x86 does not need barriers
> > though).
> > 
> > 
> > #include 
> > #include 
> > #include 
> > #include 
> > 
> > #if 0
> > #define x86_rmb()  asm volatile("lfence":::"memory")
> > #define x86_mb()  asm volatile("mfence":::"memory")
> > #define x86_smb()  asm volatile("sfence":::"memory")
> > #else
> > #define x86_rmb()  asm volatile("":::"memory")
> > #define x86_mb()  asm volatile("":::"memory")
> > #define x86_smb()  asm volatile("":::"memory")
> > #endif
> > 
&g

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-17 Thread Michael S. Tsirkin

On Fri, Mar 15, 2024 at 09:24:36PM +1000, Gavin Shan wrote:
> 
> On 3/15/24 21:05, Michael S. Tsirkin wrote:
> > On Fri, Mar 15, 2024 at 08:45:10PM +1000, Gavin Shan wrote:
> > > > > Yes, I guess smp_wmb() ('dmb') is buggy on NVidia's grace-hopper 
> > > > > platform. I tried
> > > to reproduce it with my own driver where one thread writes to the shared 
> > > buffer
> > > and another thread reads from the buffer. I don't hit the out-of-order 
> > > issue so
> > > far.
> > 
> > Make sure the 2 areas you are accessing are in different cache lines.
> > 
> 
> Yes, I already put those 2 areas to separate cache lines.
> 
> > 
> > > My driver may be not correct somewhere and I will update if I can 
> > > reproduce
> > > the issue with my driver in the future.
> > 
> > Then maybe your change is just making virtio slower and masks the bug
> > that is actually elsewhere?
> > 
> > You don't really need a driver. Here's a simple test: without barriers
> > assertion will fail. With barriers it will not.
> > (Warning: didn't bother testing too much, could be buggy.
> > 
> > ---
> > 
> > #include 
> > #include 
> > #include 
> > #include 
> > 
> > #define FIRST values[0]
> > #define SECOND values[64]
> > 
> > volatile int values[100] = {};
> > 
> > void* writer_thread(void* arg) {
> > while (1) {
> > FIRST++;
> > // NEED smp_wmb here
> __asm__ volatile("dmb ishst" : : : "memory");
> > SECOND++;
> > }
> > }
> > 
> > void* reader_thread(void* arg) {
> >  while (1) {
> > int first = FIRST;
> > // NEED smp_rmb here
> __asm__ volatile("dmb ishld" : : : "memory");
> > int second = SECOND;
> > assert(first - second == 1 || first - second == 0);
> >  }
> > }
> > 
> > int main() {
> >  pthread_t writer, reader;
> > 
> >  pthread_create(, NULL, writer_thread, NULL);
> >  pthread_create(, NULL, reader_thread, NULL);
> > 
> >  pthread_join(writer, NULL);
> >  pthread_join(reader, NULL);
> > 
> >  return 0;
> > }
> > 
> 
> Had a quick test on NVidia's grace-hopper and Ampere's CPUs. I hit
> the assert on both of them. After replacing 'dmb' with 'dsb', I can
> hit assert on both of them too. I need to look at the code closely.
> 
> [root@virt-mtcollins-02 test]# ./a
> a: a.c:26: reader_thread: Assertion `first - second == 1 || first - second == 
> 0' failed.
> Aborted (core dumped)
> 
> [root@nvidia-grace-hopper-05 test]# ./a
> a: a.c:26: reader_thread: Assertion `first - second == 1 || first - second == 
> 0' failed.
> Aborted (core dumped)
> 
> Thanks,
> Gavin


Actually this test is broken. No need for ordering it's a simple race.
The following works on x86 though (x86 does not need barriers
though).


#include 
#include 
#include 
#include 

#if 0
#define x86_rmb()  asm volatile("lfence":::"memory")
#define x86_mb()  asm volatile("mfence":::"memory")
#define x86_smb()  asm volatile("sfence":::"memory")
#else
#define x86_rmb()  asm volatile("":::"memory")
#define x86_mb()  asm volatile("":::"memory")
#define x86_smb()  asm volatile("":::"memory")
#endif

#define FIRST values[0]
#define SECOND values[640]
#define FLAG values[1280]

volatile unsigned values[2000] = {};

void* writer_thread(void* arg) {
while (1) {
/* Now synchronize with reader */
while(FLAG);
FIRST++;
x86_smb();
SECOND++;
x86_smb();
FLAG = 1;
}
}

void* reader_thread(void* arg) {
while (1) {
/* Now synchronize with writer */
while(!FLAG);
x86_rmb();
unsigned first = FIRST;
x86_rmb();
unsigned second = SECOND;
assert(first - second == 1 || first - second == 0);
FLAG = 0;

if (!(first %100))
printf("%d\n", first);
   }
}

int main() {
pthread_t writer, reader;

pthread_create(, NULL, writer_thread, NULL);
pthread_create(, NULL, reader_thread, NULL);

pthread_join(writer, NULL);
pthread_join(reader, NULL);

return 0;
}

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-15 Thread Michael S. Tsirkin

On Fri, Mar 15, 2024 at 08:45:10PM +1000, Gavin Shan wrote:
> 
> + Will, Catalin and Matt from Nvidia
> 
> On 3/14/24 22:59, Michael S. Tsirkin wrote:
> > On Thu, Mar 14, 2024 at 10:50:15PM +1000, Gavin Shan wrote:
> > > On 3/14/24 21:50, Michael S. Tsirkin wrote:
> > > > On Thu, Mar 14, 2024 at 08:15:22PM +1000, Gavin Shan wrote:
> > > > > On 3/14/24 18:05, Michael S. Tsirkin wrote:
> > > > > > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > > > > > The issue is reported by Yihuang Yu who have 'netperf' test on
> > > > > > > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > > > > > > client is started in the VM hosted by grace-hopper machine,
> > > > > > > while the 'netperf' server is running on grace-grace machine.
> > > > > > > 
> > > > > > > The VM is started with virtio-net and vhost has been enabled.
> > > > > > > We observe a error message spew from VM and then soft-lockup
> > > > > > > report. The error message indicates the data associated with
> > > > > > > the descriptor (index: 135) has been released, and the queue
> > > > > > > is marked as broken. It eventually leads to the endless effort
> > > > > > > to fetch free buffer (skb) in 
> > > > > > > drivers/net/virtio_net.c::start_xmit()
> > > > > > > and soft-lockup. The stale index 135 is fetched from the available
> > > > > > > ring and published to the used ring by vhost, meaning we have
> > > > > > > disordred write to the available ring element and available index.
> > > > > > > 
> > > > > > >  /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  
> > > > > > > \
> > > > > > >  -accel kvm -machine virt,gic-version=host
> > > > > > > \
> > > > > > > : 
> > > > > > > \
> > > > > > >  -netdev tap,id=vnet0,vhost=on
> > > > > > > \
> > > > > > >  -device 
> > > > > > > virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> > > > > > > 
> > > > > > >  [   19.993158] virtio_net virtio1: output.0:id 135 is not a 
> > > > > > > head!
> > > > > > > 
> > > > > > > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with 
> > > > > > > stronger
> > > > > > > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' 
> > > > > > > instruction on
> > > > > > > ARM64. It should work for other architectures, but performance 
> > > > > > > loss is
> > > > > > > expected.
> > > > > > > 
> > > > > > > Cc: sta...@vger.kernel.org
> > > > > > > Reported-by: Yihuang Yu 
> > > > > > > Signed-off-by: Gavin Shan 
> > > > > > > ---
> > > > > > > drivers/virtio/virtio_ring.c | 12 +---
> > > > > > > 1 file changed, 9 insertions(+), 3 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > > > b/drivers/virtio/virtio_ring.c
> > > > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct 
> > > > > > > virtqueue *_vq,
> > > > > > >   avail = vq->split.avail_idx_shadow & 
> > > > > > > (vq->split.vring.num - 1);
> > > > > > >   vq->split.vring.avail->ring[avail] = 
> > > > > > > cpu_to_virtio16(_vq->vdev, head);
> > > > > > > - /* Descriptors and available array need to be set before we 
> > > > > > > expose the
> > > > > > > -  * new available array entries. */
> > > > > > > - virtio_wmb(vq->weak_barriers);
> > > > > > > + /*
> > > > > > &

Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-03-15 Thread Michael S. Tsirkin

On Fri, Mar 15, 2024 at 09:33:49AM +0100, Tobias Huschle wrote:
> On Thu, Mar 14, 2024 at 11:09:25AM -0400, Michael S. Tsirkin wrote:
> > 
> > Thanks a lot! To clarify it is not that I am opposed to changing vhost.
> > I would like however for some documentation to exist saying that if you
> > do abc then call API xyz. Then I hope we can feel a bit safer that
> > future scheduler changes will not break vhost (though as usual, nothing
> > is for sure).  Right now we are going by the documentation and that says
> > cond_resched so we do that.
> > 
> > -- 
> > MST
> > 
> 
> Here I'd like to add that we have two different problems:
> 
> 1. cond_resched not working as expected
>This appears to me to be a bug in the scheduler where it lets the cgroup, 
>which the vhost is running in, loop endlessly. In EEVDF terms, the cgroup
>is allowed to surpass its own deadline without consequences. One of my RFCs
>mentioned above adresses this issue (not happy yet with the 
> implementation).
>This issue only appears in that specific scenario, so it's not a general 
>issue, rather a corner case.
>But, this fix will still allow the vhost to reach its deadline, which is
>one full time slice. This brings down the max delays from 300+ms to 
> whatever
>the timeslice is. This is not enough to fix the regression.
> 
> 2. vhost relying on kworker being scheduled on wake up
>This is the bigger issue for the regression. There are rare cases, where
>the vhost runs only for a very short amount of time before it wakes up 
>the kworker. Simultaneously, the kworker takes longer than usual to 
>complete its work and takes longer than the vhost did before. We
>are talking 4digit to low 5digit nanosecond values.
>With those two being the only tasks on the CPU, the scheduler now assumes
>that the kworker wants to unfairly consume more than the vhost and denies
>it being scheduled on wakeup.
>In the regular cases, the kworker is faster than the vhost, so the 
>scheduler assumes that the kworker needs help, which benefits the
>scenario we are looking at.
>In the bad case, this means unfortunately, that cond_resched cannot work
>as good as before, for this particular case!
>So, let's assume that problem 1 from above is fixed. It will take one 
>full time slice to get the need_resched flag set by the scheduler
>because vhost surpasses its deadline. Before, the scheduler cannot know
>that the kworker should actually run. The kworker itself is unable
>to communicate that by itself since it's not getting scheduled and there 
>is no external entity that could intervene.
>Hence my argumentation that cond_resched still works as expected. The
>crucial part is that the wake up behavior has changed which is why I'm 
>a bit reluctant to propose a documentation change on cond_resched.
>I could see proposing a doc change, that cond_resched should not be
>used if a task heavily relies on a woken up task being scheduled.

Could you remind me pls, what is the kworker doing specifically that
vhost is relying on?

-- 
MST

Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-03-14 Thread Michael S. Tsirkin

On Thu, Mar 14, 2024 at 12:46:54PM +0100, Tobias Huschle wrote:
> On Tue, Mar 12, 2024 at 09:45:57AM +, Luis Machado wrote:
> > On 3/11/24 17:05, Michael S. Tsirkin wrote:
> > > 
> > > Are we going anywhere with this btw?
> > > 
> > >
> > 
> > I think Tobias had a couple other threads related to this, with other 
> > potential fixes:
> > 
> > https://lore.kernel.org/lkml/20240228161018.14253-1-husc...@linux.ibm.com/
> > 
> > https://lore.kernel.org/lkml/20240228161023.14310-1-husc...@linux.ibm.com/
> > 
> 
> Sorry, Michael, should have provided those threads here as well.
> 
> The more I look into this issue, the more things to ponder upon I find.
> It seems like this issue can (maybe) be fixed on the scheduler side after all.
> 
> The root cause of this regression remains that the mentioned kworker gets
> a negative lag value and is therefore not elligible to run on wake up.
> This negative lag is potentially assigned incorrectly. But I'm not sure yet.
> 
> Anytime I find something that can address the symptom, there is a potential
> root cause on another level, and I would like to avoid to just address a
> symptom to fix the issue, wheras it would be better to find the actual
> root cause.
> 
> I would nevertheless still argue, that vhost relies rather heavily on the fact
> that the kworker gets scheduled on wake up everytime. But I don't have a 
> proposal at hand that accounts for potential side effects if opting for
> explicitly initiating a schedule.
> Maybe the assumption, that said kworker should always be selected on wake 
> up is valid. In that case the explicit schedule would merely be a safety 
> net.
> 
> I will let you know if something comes up on the scheduler side. There are
> some more ideas on my side how this could be approached.

Thanks a lot! To clarify it is not that I am opposed to changing vhost.
I would like however for some documentation to exist saying that if you
do abc then call API xyz. Then I hope we can feel a bit safer that
future scheduler changes will not break vhost (though as usual, nothing
is for sure).  Right now we are going by the documentation and that says
cond_resched so we do that.

-- 
MST

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-14 Thread Michael S. Tsirkin

On Thu, Mar 14, 2024 at 10:50:15PM +1000, Gavin Shan wrote:
> On 3/14/24 21:50, Michael S. Tsirkin wrote:
> > On Thu, Mar 14, 2024 at 08:15:22PM +1000, Gavin Shan wrote:
> > > On 3/14/24 18:05, Michael S. Tsirkin wrote:
> > > > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > > > The issue is reported by Yihuang Yu who have 'netperf' test on
> > > > > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > > > > client is started in the VM hosted by grace-hopper machine,
> > > > > while the 'netperf' server is running on grace-grace machine.
> > > > > 
> > > > > The VM is started with virtio-net and vhost has been enabled.
> > > > > We observe a error message spew from VM and then soft-lockup
> > > > > report. The error message indicates the data associated with
> > > > > the descriptor (index: 135) has been released, and the queue
> > > > > is marked as broken. It eventually leads to the endless effort
> > > > > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > > > > and soft-lockup. The stale index 135 is fetched from the available
> > > > > ring and published to the used ring by vhost, meaning we have
> > > > > disordred write to the available ring element and available index.
> > > > > 
> > > > > /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64   
> > > > >\
> > > > > -accel kvm -machine virt,gic-version=host 
> > > > >\
> > > > >:  
> > > > >\
> > > > > -netdev tap,id=vnet0,vhost=on 
> > > > >\
> > > > > -device 
> > > > > virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> > > > > 
> > > > > [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> > > > > 
> > > > > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > > > > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > > > > ARM64. It should work for other architectures, but performance loss is
> > > > > expected.
> > > > > 
> > > > > Cc: sta...@vger.kernel.org
> > > > > Reported-by: Yihuang Yu 
> > > > > Signed-off-by: Gavin Shan 
> > > > > ---
> > > > >drivers/virtio/virtio_ring.c | 12 +---
> > > > >1 file changed, 9 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > b/drivers/virtio/virtio_ring.c
> > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct 
> > > > > virtqueue *_vq,
> > > > >   avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > >   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > > > head);
> > > > > - /* Descriptors and available array need to be set before we 
> > > > > expose the
> > > > > -  * new available array entries. */
> > > > > - virtio_wmb(vq->weak_barriers);
> > > > > + /*
> > > > > +  * Descriptors and available array need to be set before we 
> > > > > expose
> > > > > +  * the new available array entries. virtio_wmb() should be 
> > > > > enough
> > > > > +  * to ensuere the order theoretically. However, a stronger 
> > > > > barrier
> > > > > +  * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > +  * by the host (vhost). A stronger barrier should work for other
> > > > > +  * architectures, but performance loss is expected.
> > > > > +  */
> > > > > + virtio_mb(false);
> > > > 
> > > > 
> > > > I don't get what is going on here. Any explanation why virtio_wmb is not
> > > > enough besides "it does not work"?
> > > > 
> > > 
> > > The change is replacing instructi

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-14 Thread Michael S. Tsirkin

On Thu, Mar 14, 2024 at 08:15:22PM +1000, Gavin Shan wrote:
> On 3/14/24 18:05, Michael S. Tsirkin wrote:
> > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > The issue is reported by Yihuang Yu who have 'netperf' test on
> > > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > > client is started in the VM hosted by grace-hopper machine,
> > > while the 'netperf' server is running on grace-grace machine.
> > > 
> > > The VM is started with virtio-net and vhost has been enabled.
> > > We observe a error message spew from VM and then soft-lockup
> > > report. The error message indicates the data associated with
> > > the descriptor (index: 135) has been released, and the queue
> > > is marked as broken. It eventually leads to the endless effort
> > > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > > and soft-lockup. The stale index 135 is fetched from the available
> > > ring and published to the used ring by vhost, meaning we have
> > > disordred write to the available ring element and available index.
> > > 
> > >/home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
> > >-accel kvm -machine virt,gic-version=host\
> > >   : \
> > >-netdev tap,id=vnet0,vhost=on\
> > >-device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> > > 
> > >[   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> > > 
> > > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > > ARM64. It should work for other architectures, but performance loss is
> > > expected.
> > > 
> > > Cc: sta...@vger.kernel.org
> > > Reported-by: Yihuang Yu 
> > > Signed-off-by: Gavin Shan 
> > > ---
> > >   drivers/virtio/virtio_ring.c | 12 +---
> > >   1 file changed, 9 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > index 49299b1f9ec7..7d852811c912 100644
> > > --- a/drivers/virtio/virtio_ring.c
> > > +++ b/drivers/virtio/virtio_ring.c
> > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct 
> > > virtqueue *_vq,
> > >   avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > >   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > head);
> > > - /* Descriptors and available array need to be set before we expose the
> > > -  * new available array entries. */
> > > - virtio_wmb(vq->weak_barriers);
> > > + /*
> > > +  * Descriptors and available array need to be set before we expose
> > > +  * the new available array entries. virtio_wmb() should be enough
> > > +  * to ensuere the order theoretically. However, a stronger barrier
> > > +  * is needed by ARM64. Otherwise, the stale data can be observed
> > > +  * by the host (vhost). A stronger barrier should work for other
> > > +  * architectures, but performance loss is expected.
> > > +  */
> > > + virtio_mb(false);
> > 
> > 
> > I don't get what is going on here. Any explanation why virtio_wmb is not
> > enough besides "it does not work"?
> > 
> 
> The change is replacing instruction "dmb" with "dsb". "dsb" is stronger 
> barrier
> than "dmb" because "dsb" ensures that all memory accesses raised before this
> instruction is completed when the 'dsb' instruction completes. However, "dmb"
> doesn't guarantee the order of completion of the memory accesses.
>
> So 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, 
> vq->split.avail_idx_shadow)'
> can be completed before 'vq->split.vring.avail->ring[avail] = 
> cpu_to_virtio16(_vq->vdev, head)'.

Completed as observed by which CPU?
We have 2 writes that we want observed by another CPU in order.
So if CPU observes a new value of idx we want it to see
new value in ring.
This is standard use of smp_wmb()
How are these 2 writes different?

What DMB does, is that is seems to ensure that effects
of 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, 
vq->split.avail_idx_shadow)'
are observed after effects of
'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-14 Thread Michael S. Tsirkin

On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> The issue is reported by Yihuang Yu who have 'netperf' test on
> NVidia's grace-grace and grace-hopper machines. The 'netperf'
> client is started in the VM hosted by grace-hopper machine,
> while the 'netperf' server is running on grace-grace machine.
> 
> The VM is started with virtio-net and vhost has been enabled.
> We observe a error message spew from VM and then soft-lockup
> report. The error message indicates the data associated with
> the descriptor (index: 135) has been released, and the queue
> is marked as broken. It eventually leads to the endless effort
> to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> and soft-lockup. The stale index 135 is fetched from the available
> ring and published to the used ring by vhost, meaning we have
> disordred write to the available ring element and available index.
> 
>   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
>   -accel kvm -machine virt,gic-version=host\
>  : \
>   -netdev tap,id=vnet0,vhost=on\
>   -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> 
>   [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> 
> Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> ARM64. It should work for other architectures, but performance loss is
> expected.
> 
> Cc: sta...@vger.kernel.org
> Reported-by: Yihuang Yu 
> Signed-off-by: Gavin Shan 
> ---
>  drivers/virtio/virtio_ring.c | 12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 49299b1f9ec7..7d852811c912 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue 
> *_vq,
>   avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>  
> - /* Descriptors and available array need to be set before we expose the
> -  * new available array entries. */
> - virtio_wmb(vq->weak_barriers);
> + /*
> +  * Descriptors and available array need to be set before we expose
> +  * the new available array entries. virtio_wmb() should be enough
> +  * to ensuere the order theoretically. However, a stronger barrier
> +  * is needed by ARM64. Otherwise, the stale data can be observed
> +  * by the host (vhost). A stronger barrier should work for other
> +  * architectures, but performance loss is expected.
> +  */
> + virtio_mb(false);


I don't get what is going on here. Any explanation why virtio_wmb is not
enough besides "it does not work"?

>   vq->split.avail_idx_shadow++;
>   vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>   vq->split.avail_idx_shadow);
> -- 
> 2.44.0

Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-03-11 Thread Michael S. Tsirkin

On Thu, Feb 01, 2024 at 12:47:39PM +0100, Tobias Huschle wrote:
> On Thu, Feb 01, 2024 at 03:08:07AM -0500, Michael S. Tsirkin wrote:
> > On Thu, Feb 01, 2024 at 08:38:43AM +0100, Tobias Huschle wrote:
> > > On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote:
> > > > On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
> > > > > On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> > > 
> > >  Summary 
> > > 
> > > In my (non-vhost experience) opinion the way to go would be either
> > > replacing the cond_resched with a hard schedule or setting the
> > > need_resched flag within vhost if the a data transfer was successfully
> > > initiated. It will be necessary to check if this causes problems with
> > > other workloads/benchmarks.
> > 
> > Yes but conceptually I am still in the dark on whether the fact that
> > periodically invoking cond_resched is no longer sufficient to be nice to
> > others is a bug, or intentional.  So you feel it is intentional?
> 
> I would assume that cond_resched is still a valid concept.
> But, in this particular scenario we have the following problem:
> 
> So far (with CFS) we had:
> 1. vhost initiates data transfer
> 2. kworker is woken up
> 3. CFS gives priority to woken up task and schedules it
> 4. kworker runs
> 
> Now (with EEVDF) we have:
> 0. In some cases, kworker has accumulated negative lag 
> 1. vhost initiates data transfer
> 2. kworker is woken up
> -3a. EEVDF does not schedule kworker if it has negative lag
> -4a. vhost continues running, kworker on same CPU starves
> --
> -3b. EEVDF schedules kworker if it has positive or no lag
> -4b. kworker runs
> 
> In the 3a/4a case, the kworker is given no chance to set the
> necessary flag. The flag can only be set by another CPU now.
> The schedule of the kworker was not caused by cond_resched, but
> rather by the wakeup path of the scheduler.
> 
> cond_resched works successfully once the load balancer (I suppose) 
> decides to migrate the vhost off to another CPU. In that case, the
> load balancer on another CPU sets that flag and we are good.
> That then eventually allows the scheduler to pick kworker, but very
> late.

Are we going anywhere with this btw?


> > I propose a two patch series then:
> > 
> > patch 1: in this text in Documentation/kernel-hacking/hacking.rst
> > 
> > If you're doing longer computations: first think userspace. If you
> > **really** want to do it in kernel you should regularly check if you need
> > to give up the CPU (remember there is cooperative multitasking per CPU).
> > Idiom::
> > 
> > cond_resched(); /* Will sleep */
> > 
> > 
> > replace cond_resched -> schedule
> > 
> > 
> > Since apparently cond_resched is no longer sufficient to
> > make the scheduler check whether you need to give up the CPU.
> > 
> > patch 2: make this change for vhost.
> > 
> > WDYT?
> 
> For patch 1, I would like to see some feedback from Peter (or someone else
> from the scheduler maintainers).
> For patch 2, I would prefer to do some more testing first if this might have
> an negative effect on other benchmarks.
> 
> I also stumbled upon something in the scheduler code that I want to verify.
> Maybe a cgroup thing, will check that out again.
> 
> I'll do some more testing with the cond_resched->schedule fix, check the
> cgroup thing and wait for Peter then.
> Will get back if any of the above yields some results.
> 
> > 
> > -- 
> > MST
> > 
> >

Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support

2024-03-01 Thread Michael S. Tsirkin

On Fri, Mar 01, 2024 at 11:45:52AM +, wangyunjian wrote:
> > -Original Message-
> > From: Paolo Abeni [mailto:pab...@redhat.com]
> > Sent: Thursday, February 29, 2024 7:13 PM
> > To: wangyunjian ; m...@redhat.com;
> > willemdebruijn.ker...@gmail.com; jasow...@redhat.com; k...@kernel.org;
> > bj...@kernel.org; magnus.karls...@intel.com; maciej.fijalkow...@intel.com;
> > jonathan.le...@gmail.com; da...@davemloft.net
> > Cc: b...@vger.kernel.org; net...@vger.kernel.org;
> > linux-kernel@vger.kernel.org; k...@vger.kernel.org;
> > virtualizat...@lists.linux.dev; xudingke ; liwei (DT)
> > 
> > Subject: Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support
> > 
> > On Wed, 2024-02-28 at 19:05 +0800, Yunjian Wang wrote:
> > > @@ -2661,6 +2776,54 @@ static int tun_ptr_peek_len(void *ptr)
> > >   }
> > >  }
> > >
> > > +static void tun_peek_xsk(struct tun_file *tfile) {
> > > + struct xsk_buff_pool *pool;
> > > + u32 i, batch, budget;
> > > + void *frame;
> > > +
> > > + if (!ptr_ring_empty(>tx_ring))
> > > + return;
> > > +
> > > + spin_lock(>pool_lock);
> > > + pool = tfile->xsk_pool;
> > > + if (!pool) {
> > > + spin_unlock(>pool_lock);
> > > + return;
> > > + }
> > > +
> > > + if (tfile->nb_descs) {
> > > + xsk_tx_completed(pool, tfile->nb_descs);
> > > + if (xsk_uses_need_wakeup(pool))
> > > + xsk_set_tx_need_wakeup(pool);
> > > + }
> > > +
> > > + spin_lock(>tx_ring.producer_lock);
> > > + budget = min_t(u32, tfile->tx_ring.size, TUN_XDP_BATCH);
> > > +
> > > + batch = xsk_tx_peek_release_desc_batch(pool, budget);
> > > + if (!batch) {
> > 
> > This branch looks like an unneeded "optimization". The generic loop below
> > should have the same effect with no measurable perf delta - and smaller 
> > code.
> > Just remove this.
> > 
> > > + tfile->nb_descs = 0;
> > > + spin_unlock(>tx_ring.producer_lock);
> > > + spin_unlock(>pool_lock);
> > > + return;
> > > + }
> > > +
> > > + tfile->nb_descs = batch;
> > > + for (i = 0; i < batch; i++) {
> > > + /* Encode the XDP DESC flag into lowest bit for consumer to 
> > > differ
> > > +  * XDP desc from XDP buffer and sk_buff.
> > > +  */
> > > + frame = tun_xdp_desc_to_ptr(>tx_descs[i]);
> > > + /* The budget must be less than or equal to tx_ring.size,
> > > +  * so enqueuing will not fail.
> > > +  */
> > > + __ptr_ring_produce(>tx_ring, frame);
> > > + }
> > > + spin_unlock(>tx_ring.producer_lock);
> > > + spin_unlock(>pool_lock);
> > 
> > More related to the general design: it looks wrong. What if
> > get_rx_bufs() will fail (ENOBUF) after successful peeking? With no more
> > incoming packets, later peek will return 0 and it looks like that the
> > half-processed packets will stay in the ring forever???
> > 
> > I think the 'ring produce' part should be moved into tun_do_read().
> 
> Currently, the vhost-net obtains a batch descriptors/sk_buffs from the
> ptr_ring and enqueue the batch descriptors/sk_buffs to the virtqueue'queue,
> and then consumes the descriptors/sk_buffs from the virtqueue'queue in
> sequence. As a result, TUN does not know whether the batch descriptors have
> been used up, and thus does not know when to return the batch descriptors.
> 
> So, I think it's reasonable that when vhost-net checks ptr_ring is empty,
> it calls peek_len to get new xsk's descs and return the descriptors.
> 
> Thanks

What you need to think about is that if you peek, another call
in parallel can get the same value at the same time.


> > 
> > Cheers,
> > 
> > Paolo
>

Re: [PATCH net-next v2 0/3] tun: AF_XDP Tx zero-copy support

2024-02-28 Thread Michael S. Tsirkin

On Wed, Feb 28, 2024 at 07:04:41PM +0800, Yunjian Wang wrote:
> Hi all:
> 
> Now, some drivers support the zero-copy feature of AF_XDP sockets,
> which can significantly reduce CPU utilization for XDP programs.
> 
> This patch set allows TUN to also support the AF_XDP Tx zero-copy
> feature. It is based on Linux 6.8.0+(openEuler 23.09) and has
> successfully passed Netperf and Netserver stress testing with
> multiple streams between VM A and VM B, using AF_XDP and OVS.
> 
> The performance testing was performed on a Intel E5-2620 2.40GHz
> machine. Traffic were generated/send through TUN(testpmd txonly
> with AF_XDP) to VM (testpmd rxonly in guest).
> 
> +--+-+-+-+
> |  |   copy  |zero-copy| speedup |
> +--+-+-+-+
> | UDP  |   Mpps  |   Mpps  |%|
> | 64   |   2.5   |   4.0   |   60%   |
> | 512  |   2.1   |   3.6   |   71%   |
> | 1024 |   1.9   |   3.3   |   73%   |
> +--+-+-+-+
> 
> Yunjian Wang (3):
>   xsk: Remove non-zero 'dma_page' check in xp_assign_dev
>   vhost_net: Call peek_len when using xdp
>   tun: AF_XDP Tx zero-copy support


threading broken pls repost.

vhost bits look ok though:

Acked-by: Michael S. Tsirkin 


>  drivers/net/tun.c   | 177 ++--
>  drivers/vhost/net.c |  21 +++--
>  include/linux/if_tun.h  |  32 
>  net/xdp/xsk_buff_pool.c |   7 --
>  4 files changed, 220 insertions(+), 17 deletions(-)
> 
> -- 
> 2.41.0

Re: [PATCH] vduse: Fix off by one in vduse_dev_mmap()

2024-02-27 Thread Michael S. Tsirkin

On Tue, Feb 27, 2024 at 06:21:46PM +0300, Dan Carpenter wrote:
> The dev->vqs[] array has "dev->vq_num" elements.  It's allocated in
> vduse_dev_init_vqs().  Thus, this > comparison needs to be >= to avoid
> reading one element beyond the end of the array.
> 
> Fixes: 316ecd1346b0 ("vduse: Add file operation for mmap")
> Signed-off-by: Dan Carpenter 


Oh wow and does this not come from userspace? If yes we
need the speculation magic macro when using the index, do we not?

> ---
>  drivers/vdpa/vdpa_user/vduse_dev.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> b/drivers/vdpa/vdpa_user/vduse_dev.c
> index b7a1fb88c506..9150c8281953 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -1532,7 +1532,7 @@ static int vduse_dev_mmap(struct file *file, struct 
> vm_area_struct *vma)
>   if ((vma->vm_flags & VM_SHARED) == 0)
>   return -EINVAL;
>  
> - if (index > dev->vq_num)
> + if (index >= dev->vq_num)
>   return -EINVAL;
>  
>   vq = dev->vqs[index];
> -- 
> 2.43.0

Re: [PATCH] virtiofs: limit the length of ITER_KVEC dio by max_nopage_rw

2024-02-25 Thread Michael S. Tsirkin

On Fri, Feb 23, 2024 at 10:42:37AM +0100, Miklos Szeredi wrote:
> On Wed, 3 Jan 2024 at 11:58, Hou Tao  wrote:
> >
> > From: Hou Tao 
> >
> > When trying to insert a 10MB kernel module kept in a virtiofs with cache
> > disabled, the following warning was reported:
> >
> >   [ cut here ]
> >   WARNING: CPU: 2 PID: 439 at mm/page_alloc.c:4544 ..
> >   Modules linked in:
> >   CPU: 2 PID: 439 Comm: insmod Not tainted 6.7.0-rc7+ #33
> >   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ..
> >   RIP: 0010:__alloc_pages+0x2c4/0x360
> >   ..
> >   Call Trace:
> >
> >? __warn+0x8f/0x150
> >? __alloc_pages+0x2c4/0x360
> >__kmalloc_large_node+0x86/0x160
> >__kmalloc+0xcd/0x140
> >virtio_fs_enqueue_req+0x240/0x6d0
> >virtio_fs_wake_pending_and_unlock+0x7f/0x190
> >queue_request_and_unlock+0x58/0x70
> >fuse_simple_request+0x18b/0x2e0
> >fuse_direct_io+0x58a/0x850
> >fuse_file_read_iter+0xdb/0x130
> >__kernel_read+0xf3/0x260
> >kernel_read+0x45/0x60
> >kernel_read_file+0x1ad/0x2b0
> >init_module_from_file+0x6a/0xe0
> >idempotent_init_module+0x179/0x230
> >__x64_sys_finit_module+0x5d/0xb0
> >do_syscall_64+0x36/0xb0
> >entry_SYSCALL_64_after_hwframe+0x6e/0x76
> >..
> >
> >   ---[ end trace  ]---
> >
> > The warning happened as follow. In copy_args_to_argbuf(), virtiofs uses
> > kmalloc-ed memory as bound buffer for fuse args, but
> 
> So this seems to be the special case in fuse_get_user_pages() when the
> read/write requests get a piece of kernel memory.
> 
> I don't really understand the comment in virtio_fs_enqueue_req():  /*
> Use a bounce buffer since stack args cannot be mapped */
> 
> Stefan, can you explain?  What's special about the arg being on the stack?

virtio core wants DMA'able addresses.

See Documentation/core-api/dma-api-howto.rst :

...


This rule also means that you may use neither kernel image addresses
(items in data/text/bss segments), nor module image addresses, nor
stack addresses for DMA.



> What if the arg is not on the stack (as is probably the case for big
> args like this)?   Do we need the bounce buffer in that case?
> 
> Thanks,
> Miklos

Re: [PATCH -next] VDUSE: fix another doc underline warning

2024-02-23 Thread Michael S. Tsirkin

On Thu, Feb 22, 2024 at 10:23:41PM -0800, Randy Dunlap wrote:
> Extend the underline for a heading to prevent a documentation
> build warning. Also spell "reconnection" correctly.
> 
> Documentation/userspace-api/vduse.rst:236: WARNING: Title underline too short.
> HOW VDUSE devices reconnectoin works
> 
> 
> Fixes: 2b3fd606c662 ("Documentation: Add reconnect process for VDUSE")
> Signed-off-by: Randy Dunlap 
> Cc: Cindy Lu 
> Cc: Michael S. Tsirkin 
> Cc: Jason Wang 
> Cc: Xuan Zhuo 
> Cc: virtualizat...@lists.linux.dev
> Cc: Jonathan Corbet 

Thanks, I fixed this in my tree already.

> ---
>  Documentation/userspace-api/vduse.rst |4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff -- a/Documentation/userspace-api/vduse.rst 
> b/Documentation/userspace-api/vduse.rst
> --- a/Documentation/userspace-api/vduse.rst
> +++ b/Documentation/userspace-api/vduse.rst
> @@ -232,8 +232,8 @@ able to start the dataplane processing a
>  
>  For more details on the uAPI, please see include/uapi/linux/vduse.h.
>  
> -HOW VDUSE devices reconnectoin works
> -
> +HOW VDUSE devices reconnection works
> +
>  0. Userspace APP checks if the device /dev/vduse/vduse_name exists.
> If it does not exist, need to create the instance.goto step 1
> If it does exist, it means this is a reconnect and goto step 3.

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 9627 matches

Mail list logo