Re: [PATCH 1/4] virtio_net: notify MAC address change on device initialization

2023-01-23 Thread Laurent Vivier

On 1/24/23 04:31, Jakub Kicinski wrote:

On Sun, 22 Jan 2023 15:47:08 +0200 Eli Cohen wrote:

@@ -3956,6 +3958,18 @@ static int virtnet_probe(struct virtio_device *vdev)
pr_debug("virtnet: registered device %s with %d RX and TX vq's\n",
 dev->name, max_queue_pairs);
   
+	/* a random MAC address has been assigned, notify the device */

+   if (dev->addr_assign_type == NET_ADDR_RANDOM &&

Maybe it's better to not count on addr_assign_type and use a local
variable to indicate that virtnet_probe assigned random MAC.


+1, FWIW

v2 sent, but I rely on virtio_has_feature(vdev, VIRTIO_NET_F_MAC) to know if the MAC 
address is provided by the device or not:


https://lore.kernel.org/lkml/20230123120022.2364889-2-lviv...@redhat.com/T/#me9211516e12771001e0346818255c9fb48a2bf4a

Thanks,
Laurent

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 02/19] drivers/vhost: Convert to use vm_account

2023-01-23 Thread Michael S. Tsirkin
On Tue, Jan 24, 2023 at 04:42:31PM +1100, Alistair Popple wrote:
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index ec32f78..a31dd53 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c

...

> @@ -780,6 +780,10 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct 
> vhost_iotlb *iotlb,
>   u32 asid = iotlb_to_asid(iotlb);
>   int r = 0;
>  
> + if (!vdpa->use_va)
> + if (vm_account_pinned(>vm_account, PFN_DOWN(size)))
> + return -ENOMEM;
> +
>   r = vhost_iotlb_add_range_ctx(iotlb, iova, iova + size - 1,
> pa, perm, opaque);
>   if (r)

I suspect some error handling will have to be reworked then, no?

> -- 
> git-series 0.9.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 01/19] mm: Introduce vm_account

2023-01-23 Thread Christoph Hellwig
> +/**
> + * vm_account_init - Initialise a new struct vm_account.
> + * @vm_account: pointer to uninitialised vm_account.
> + * @task: task to charge against.
> + * @user: user to charge against. Must be non-NULL for VM_ACCOUNT_USER.
> + * @flags: flags to use when charging to vm_account.
> + *
> + * Initialise a new uninitialiused struct vm_account. Takes references
> + * on the task/mm/user/cgroup as required although callers must ensure
> + * any references passed in remain valid for the duration of this
> + * call.
> + */
> +void vm_account_init(struct vm_account *vm_account, struct task_struct *task,
> + struct user_struct *user, enum vm_account_flags flags);


kerneldoc comments are supposed to be next to the implementation, and
not the declaration in the header.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2 0/1] virtio_net: vdpa: update MAC address when it is generated by virtio-net

2023-01-23 Thread Laurent Vivier
When the MAC address is not provided by the vdpa device virtio_net
driver assigns a random one without notifying the device.
The consequence, in the case of mlx5_vdpa, is the internal routing
tables of the device are not updated and this can block the
communication between two namespaces.

To fix this problem, use virtnet_send_command(VIRTIO_NET_CTRL_MAC)
to set the address from virtnet_probe() when the MAC address is
not provided by the device.

v2:
  - remove vdpa_sim related fixes
  - check virtio_has_feature(vdev, VIRTIO_NET_F_MAC) rather than
addr_assign_type

Laurent Vivier (1):
  virtio_net: notify MAC address change on device initialization

 drivers/net/virtio_net.c | 14 ++
 1 file changed, 14 insertions(+)

-- 
2.39.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2 1/1] virtio_net: notify MAC address change on device initialization

2023-01-23 Thread Laurent Vivier
In virtnet_probe(), if the device doesn't provide a MAC address the
driver assigns a random one.
As we modify the MAC address we need to notify the device to allow it
to update all the related information.

The problem can be seen with vDPA and mlx5_vdpa driver as it doesn't
assign a MAC address by default. The virtio_net device uses a random
MAC address (we can see it with "ip link"), but we can't ping a net
namespace from another one using the virtio-vdpa device because the
new MAC address has not been provided to the hardware.

Signed-off-by: Laurent Vivier 
---
 drivers/net/virtio_net.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7723b2a49d8e..4bdc8286678b 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3800,6 +3800,8 @@ static int virtnet_probe(struct virtio_device *vdev)
eth_hw_addr_set(dev, addr);
} else {
eth_hw_addr_random(dev);
+   dev_info(>dev, "Assigned random MAC address %pM\n",
+dev->dev_addr);
}
 
/* Set up our device-specific information */
@@ -3956,6 +3958,18 @@ static int virtnet_probe(struct virtio_device *vdev)
pr_debug("virtnet: registered device %s with %d RX and TX vq's\n",
 dev->name, max_queue_pairs);
 
+   /* a random MAC address has been assigned, notify the device */
+   if (!virtio_has_feature(vdev, VIRTIO_NET_F_MAC) &&
+   virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_MAC_ADDR)) {
+   struct scatterlist sg;
+
+   sg_init_one(, dev->dev_addr, dev->addr_len);
+   if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MAC,
+ VIRTIO_NET_CTRL_MAC_ADDR_SET, )) {
+   dev_warn(>dev, "Failed to update MAC address.\n");
+   }
+   }
+
return 0;
 
 free_unregister_netdev:
-- 
2.39.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/4] virtio_net: vdpa: update MAC address when it is generated by virtio-net

2023-01-23 Thread Laurent Vivier

On 1/22/23 11:23, Michael S. Tsirkin wrote:

On Sun, Jan 22, 2023 at 11:05:22AM +0100, Laurent Vivier wrote:

When the MAC address is not provided by the vdpa device virtio_net
driver assigns a random one without notifying the device.
The consequence, in the case of mlx5_vdpa, is the internal routing
tables of the device are not updated and this can block the
communication between two namespaces.

To fix this problem, use virtnet_send_command(VIRTIO_NET_CTRL_MAC)
to set the address from virtnet_probe() when the MAC address is
randomly assigned from virtio_net.

While I was testing this change I found 3 other bugs in vdpa_sim_net:

- vdpa_sim_net sets the VIRTIO_NET_F_MAC even if no MAC address is
   provided. So virtio_net doesn't generate a random MAC address and
   the MAC address appears to be 00:00:00:00:00:00

- vdpa_sim_net never processes the command and virtnet_send_command()
   hangs in an infinite loop. To avoid a kernel crash add a timeout
   in the loop.

- To allow vdpa_sim_net to process the command, replace the cpu_relax()
   in the loop by a schedule(). vdpa_sim_net uses a workqueue to process
   the queue, and if we don't allow the kernel to schedule, the queue
   is not processed and the loop is infinite.


I'd split these things out as opposed to a series unless there's
a dependency I missed.


We needed to fix virtio_net before fixing vdpa_sim_net otherwise the 
virtnet_send_command() hangs when we define the vdpa device with "vdpa dev" but without a 
MAC address.



All this reminds me of
https://lore.kernel.org/r/20221226074908.8154-5-jasowang%40redhat.com

how is this patch different/better?
Pls also CC people involved in that original discussion.


I was not aware of the Jason's series.

It seems to address better the problem, except it triggers the ASSERT_RTNL() in 
virtnet_send_command() when it is called from virtnet_probe().


I will remove patches 2 and 4 from my series.
PATCH 3 can be sent on independently too.

Thanks,
Laurent

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC] memory pressure detection in VMs using PSI mechanism for dynamically inflating/deflating VM memory

2023-01-23 Thread David Hildenbrand


1. This will be a native userspace daemon that will be running only in
the Linux VM which will use virtio-mem driver that uses memory hotplug
to add/remove memory. The VM (aka Secondary VM, SVM) will request for
memory from the host which is Primary VM, PVM via the backend hypervisor
which takes care of cross-VM communication.

2. This will be guest driver. This daemon will use PSI mechanism to
monitor memory pressure to keep track of memory demands in the system.
It will register to few memory pressure events and make an educated
guess on when demand for memory in system is increasing.


Is that running in the primary or the secondary VM?


The userspace PSI daemon will be running on secondary VM. It will talk
to a kernel driver (running on secondary VM itself) via ioctl. This
kernel driver will talk to slightly modified version of virtio-mem
driver where it can call the virtio_mem_config_changed(virtiomem_device)
function for resizing the secondary VM. So its mainly "guest driven" now.


Okay, thanks.

[...]



This daemon is currently in just Beta stage now and we have basic
functionality running. We are yet to add more flesh to this scheme to


Good to hear that the basics are running with virtio-mem (I assume :) ).


make sure any potential risks or security concerns are taken care as
well.


It would be great to draw/explain the architecture in more detail.


We will be looking into solving any potential security concerns where
hypervisor would restrict few actions of resizing of memory. Right now,
we are experimenting to see if PSI mechanism itself can be used for ways
of detecting memory pressure in the system and add memory to secondary
VM when memory is in need. Taking into account all the latencies
involved in the PSI scheme (i.e. time when one does malloc call till
when extra memory gets added to SVM system). And wanted to know
upstream's opinion on such a scheme using PSI mechanism for detecting
memory pressure and resizing SVM accordingly.


One problematic thing is that adding memory to Linux by virtio-mem 
eventually consumes memory (e.g., the memmap), especially when having to 
to add a completely new memory block to Linux.


So if you're already under severe memory pressure, these allocations to 
bring up new memory can fail. The question is, if PSI can notify "early" 
enough such that this barely happens in practice.


There are some possible ways to mitigate:

1) Always keep spare memory blocks by virtio-mem added to Linux, that
   don't expose any memory yet. Memory from these block can be handed
   over to Linux without additional Linux allocations. Of course, they
   consume metadata, so one might want to limit them.

2) Implement memmap_on_memory support for virtio-mem. This might help in
   some setups, where the device block size is suitable.

Did you run into that scenario already during your experiments, and how 
did you deal with that?


--
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/4] virtio_net: notify MAC address change on device initialization

2023-01-23 Thread Laurent Vivier

On 1/22/23 14:47, Eli Cohen wrote:


On 22/01/2023 12:05, Laurent Vivier wrote:

In virtnet_probe(), if the device doesn't provide a MAC address the
driver assigns a random one.
As we modify the MAC address we need to notify the device to allow it
to update all the related information.

The problem can be seen with vDPA and mlx5_vdpa driver as it doesn't
assign a MAC address by default. The virtio_net device uses a random
MAC address (we can see it with "ip link"), but we can't ping a net
namespace from another one using the virtio-vdpa device because the
new MAC address has not been provided to the hardware.

Signed-off-by: Laurent Vivier 
---
  drivers/net/virtio_net.c | 14 ++
  1 file changed, 14 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7723b2a49d8e..25511a86590e 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3800,6 +3800,8 @@ static int virtnet_probe(struct virtio_device *vdev)
  eth_hw_addr_set(dev, addr);
  } else {
  eth_hw_addr_random(dev);
+    dev_info(>dev, "Assigned random MAC address %pM\n",
+ dev->dev_addr);
  }
  /* Set up our device-specific information */
@@ -3956,6 +3958,18 @@ static int virtnet_probe(struct virtio_device *vdev)
  pr_debug("virtnet: registered device %s with %d RX and TX vq's\n",
   dev->name, max_queue_pairs);
+    /* a random MAC address has been assigned, notify the device */
+    if (dev->addr_assign_type == NET_ADDR_RANDOM &&
Maybe it's better to not count on addr_assign_type and use a local variable to indicate 
that virtnet_probe assigned random MAC. The reason is that the hardware driver might have 
done that as well and does not need notification.


eth_hw_addr_random() sets explicitly NET_ADDR_RANDOM, while eth_hw_addr_set() doesn't 
change addr_assign_type so it doesn't seem this value is set by the hardware driver.


So I guess it's the default value (NET_ADDR_PERM) in this case (even if it's a random 
address from the point of view of the hardware).


If you prefer I can replace it by "!virtio_has_feature(vdev, VIRTIO_NET_F_MAC)"?

Thanks,
Laurent

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/8] iommu: Add a gfp parameter to iommu_map()

2023-01-23 Thread Joerg Roedel
On Fri, Jan 20, 2023 at 01:53:40PM -0400, Jason Gunthorpe wrote:
> > Well, having GFP parameters is not a strict kernel convention. There are
> > places doing it differently and have sleeping and atomic variants of
> > APIs. I have to say I like the latter more. But given that this leads to
> > an invasion of API functions here which all do the same under the hood, I
> > agree it is better to go with a GFP parameter here.
> 
> Ok, I think we are done with this series, I'll stick it in linux-next
> for a bit and send you a PR so the trees stay in sync

This series mostly touches parts outside of IOMMUFD, so we should follow
the process here and let this reach linux-next via the IOMMU tree.
Please send me a new version and I will put it into a separate branch
where you can pull it from.

Regards,

Joerg
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[linux-next:master] BUILD REGRESSION 691781f561e9868a94c3ed7daf4adad7f8af5d16

2023-01-23 Thread kernel test robot
tree/branch: 
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
branch HEAD: 691781f561e9868a94c3ed7daf4adad7f8af5d16  Add linux-next specific 
files for 20230123

Error/Warning: (recently discovered and may have been fixed)

ERROR: modpost: "devm_platform_ioremap_resource" [drivers/dma/fsl-edma.ko] 
undefined!
ERROR: modpost: "devm_platform_ioremap_resource" [drivers/dma/idma64.ko] 
undefined!
drivers/gpio/gpio-zevio.c:174:40: error: invalid use of undefined type 'struct 
platform_device'
drivers/gpio/gpio-zevio.c:178:9: error: implicit declaration of function 
'platform_set_drvdata' [-Werror=implicit-function-declaration]
drivers/gpio/gpio-zevio.c:184:28: error: implicit declaration of function 
'devm_platform_ioremap_resource'; did you mean 'devm_ioremap_resource'? 
[-Werror=implicit-function-declaration]
drivers/gpio/gpio-zevio.c:211:15: error: variable 'zevio_gpio_driver' has 
initializer but incomplete type
drivers/gpio/gpio-zevio.c:211:31: error: storage size of 'zevio_gpio_driver' 
isn't known
drivers/gpio/gpio-zevio.c:212:10: error: 'struct platform_driver' has no member 
named 'driver'
drivers/gpio/gpio-zevio.c:212:27: error: extra brace group at end of initializer
drivers/gpio/gpio-zevio.c:217:10: error: 'struct platform_driver' has no member 
named 'probe'
drivers/gpio/gpio-zevio.c:219:1: error: type defaults to 'int' in declaration 
of 'builtin_platform_driver' [-Werror=implicit-int]
drivers/gpu/drm/amd/amdgpu/../display/dc/link/link_dp_training.c:1585:38: 
warning: variable 'result' set but not used [-Wunused-but-set-variable]

Unverified Error/Warning (likely false positive, please contact us if 
interested):

drivers/block/virtio_blk.c:721:9: sparse:bad type *
drivers/block/virtio_blk.c:721:9: sparse:unsigned int *
drivers/block/virtio_blk.c:721:9: sparse: sparse: incompatible types in 
comparison expression (different base types):
drivers/block/virtio_blk.c:721:9: sparse: sparse: no generic selection for 
'restricted __le32 [addressable] virtio_cread_v'
drivers/block/virtio_blk.c:721:9: sparse: sparse: no generic selection for 
'restricted __le32 virtio_cread_v'
drivers/media/i2c/max9286.c:771 max9286_s_stream() error: buffer overflow 
'priv->fmt' 4 <= 32
drivers/nvmem/imx-ocotp.c:599:21: sparse: sparse: symbol 'imx_ocotp_layout' was 
not declared. Should it be static?
mm/hugetlb.c:3100 alloc_hugetlb_folio() error: uninitialized symbol 'h_cg'.
net/devlink/leftover.c:7160 devlink_fmsg_prepare_skb() error: uninitialized 
symbol 'err'.
sound/ac97/bus.c:465:1: sparse: sparse: symbol 'dev_attr_vendor_id' was not 
declared. Should it be static?

Error/Warning ids grouped by kconfigs:

gcc_recent_errors
|-- alpha-allyesconfig
|   `-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-link_dp_training.c:warning:variable-result-set-but-not-used
|-- arc-allyesconfig
|   `-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-link_dp_training.c:warning:variable-result-set-but-not-used
|-- arc-randconfig-m031-20230123
|   `-- 
drivers-media-i2c-max9286.c-max9286_s_stream()-error:buffer-overflow-priv-fmt
|-- arm-allyesconfig
|   `-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-link_dp_training.c:warning:variable-result-set-but-not-used
|-- arm-buildonly-randconfig-r005-20230123
|   `-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-link_dp_training.c:warning:variable-result-set-but-not-used
|-- arm-randconfig-r023-20230123
|   |-- drivers-gpio-gpio-zevio.c:error:extra-brace-group-at-end-of-initializer
|   |-- 
drivers-gpio-gpio-zevio.c:error:implicit-declaration-of-function-devm_platform_ioremap_resource
|   |-- 
drivers-gpio-gpio-zevio.c:error:implicit-declaration-of-function-platform_set_drvdata
|   |-- 
drivers-gpio-gpio-zevio.c:error:invalid-use-of-undefined-type-struct-platform_device
|   |-- 
drivers-gpio-gpio-zevio.c:error:storage-size-of-zevio_gpio_driver-isn-t-known
|   |-- 
drivers-gpio-gpio-zevio.c:error:struct-platform_driver-has-no-member-named-driver
|   |-- 
drivers-gpio-gpio-zevio.c:error:struct-platform_driver-has-no-member-named-probe
|   |-- 
drivers-gpio-gpio-zevio.c:error:type-defaults-to-int-in-declaration-of-builtin_platform_driver
|   `-- 
drivers-gpio-gpio-zevio.c:error:variable-zevio_gpio_driver-has-initializer-but-incomplete-type
|-- arm64-allyesconfig
|   `-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-link_dp_training.c:warning:variable-result-set-but-not-used
|-- csky-randconfig-s033-20230123
|   |-- 
drivers-nvmem-imx-ocotp.c:sparse:sparse:symbol-imx_ocotp_layout-was-not-declared.-Should-it-be-static
|   `-- 
sound-ac97-bus.c:sparse:sparse:symbol-dev_attr_vendor_id-was-not-declared.-Should-it-be-static
|-- i386-allyesconfig
|   `-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-link_dp_training.c:warning:variable-result-set-but-not-used
|-- ia64-allmodconfig
|   `-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-link_dp_training.c:warning:variable-result-set-but-not-used
|-- ia64-allyesconfig
|   `-- 
drivers-gpu-dr

[PATCH 1/6] x86: Always inline arch_atomic64

2023-01-23 Thread Peter Zijlstra
As already done for regular arch_atomic, always inline arch_atomic64.

Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/x86/include/asm/atomic64_32.h |   44 ++---
 arch/x86/include/asm/atomic64_64.h |   36 +++---
 2 files changed, 40 insertions(+), 40 deletions(-)

--- a/arch/x86/include/asm/atomic64_32.h
+++ b/arch/x86/include/asm/atomic64_32.h
@@ -71,7 +71,7 @@ ATOMIC64_DECL(add_unless);
  * the old value.
  */
 
-static inline s64 arch_atomic64_cmpxchg(atomic64_t *v, s64 o, s64 n)
+static __always_inline s64 arch_atomic64_cmpxchg(atomic64_t *v, s64 o, s64 n)
 {
return arch_cmpxchg64(>counter, o, n);
 }
@@ -85,7 +85,7 @@ static inline s64 arch_atomic64_cmpxchg(
  * Atomically xchgs the value of @v to @n and returns
  * the old value.
  */
-static inline s64 arch_atomic64_xchg(atomic64_t *v, s64 n)
+static __always_inline s64 arch_atomic64_xchg(atomic64_t *v, s64 n)
 {
s64 o;
unsigned high = (unsigned)(n >> 32);
@@ -104,7 +104,7 @@ static inline s64 arch_atomic64_xchg(ato
  *
  * Atomically sets the value of @v to @n.
  */
-static inline void arch_atomic64_set(atomic64_t *v, s64 i)
+static __always_inline void arch_atomic64_set(atomic64_t *v, s64 i)
 {
unsigned high = (unsigned)(i >> 32);
unsigned low = (unsigned)i;
@@ -119,7 +119,7 @@ static inline void arch_atomic64_set(ato
  *
  * Atomically reads the value of @v and returns it.
  */
-static inline s64 arch_atomic64_read(const atomic64_t *v)
+static __always_inline s64 arch_atomic64_read(const atomic64_t *v)
 {
s64 r;
alternative_atomic64(read, "=" (r), "c" (v) : "memory");
@@ -133,7 +133,7 @@ static inline s64 arch_atomic64_read(con
  *
  * Atomically adds @i to @v and returns @i + *@v
  */
-static inline s64 arch_atomic64_add_return(s64 i, atomic64_t *v)
+static __always_inline s64 arch_atomic64_add_return(s64 i, atomic64_t *v)
 {
alternative_atomic64(add_return,
 ASM_OUTPUT2("+A" (i), "+c" (v)),
@@ -145,7 +145,7 @@ static inline s64 arch_atomic64_add_retu
 /*
  * Other variants with different arithmetic operators:
  */
-static inline s64 arch_atomic64_sub_return(s64 i, atomic64_t *v)
+static __always_inline s64 arch_atomic64_sub_return(s64 i, atomic64_t *v)
 {
alternative_atomic64(sub_return,
 ASM_OUTPUT2("+A" (i), "+c" (v)),
@@ -154,7 +154,7 @@ static inline s64 arch_atomic64_sub_retu
 }
 #define arch_atomic64_sub_return arch_atomic64_sub_return
 
-static inline s64 arch_atomic64_inc_return(atomic64_t *v)
+static __always_inline s64 arch_atomic64_inc_return(atomic64_t *v)
 {
s64 a;
alternative_atomic64(inc_return, "=" (a),
@@ -163,7 +163,7 @@ static inline s64 arch_atomic64_inc_retu
 }
 #define arch_atomic64_inc_return arch_atomic64_inc_return
 
-static inline s64 arch_atomic64_dec_return(atomic64_t *v)
+static __always_inline s64 arch_atomic64_dec_return(atomic64_t *v)
 {
s64 a;
alternative_atomic64(dec_return, "=" (a),
@@ -179,7 +179,7 @@ static inline s64 arch_atomic64_dec_retu
  *
  * Atomically adds @i to @v.
  */
-static inline s64 arch_atomic64_add(s64 i, atomic64_t *v)
+static __always_inline s64 arch_atomic64_add(s64 i, atomic64_t *v)
 {
__alternative_atomic64(add, add_return,
   ASM_OUTPUT2("+A" (i), "+c" (v)),
@@ -194,7 +194,7 @@ static inline s64 arch_atomic64_add(s64
  *
  * Atomically subtracts @i from @v.
  */
-static inline s64 arch_atomic64_sub(s64 i, atomic64_t *v)
+static __always_inline s64 arch_atomic64_sub(s64 i, atomic64_t *v)
 {
__alternative_atomic64(sub, sub_return,
   ASM_OUTPUT2("+A" (i), "+c" (v)),
@@ -208,7 +208,7 @@ static inline s64 arch_atomic64_sub(s64
  *
  * Atomically increments @v by 1.
  */
-static inline void arch_atomic64_inc(atomic64_t *v)
+static __always_inline void arch_atomic64_inc(atomic64_t *v)
 {
__alternative_atomic64(inc, inc_return, /* no output */,
   "S" (v) : "memory", "eax", "ecx", "edx");
@@ -221,7 +221,7 @@ static inline void arch_atomic64_inc(ato
  *
  * Atomically decrements @v by 1.
  */
-static inline void arch_atomic64_dec(atomic64_t *v)
+static __always_inline void arch_atomic64_dec(atomic64_t *v)
 {
__alternative_atomic64(dec, dec_return, /* no output */,
   "S" (v) : "memory", "eax", "ecx", "edx");
@@ -237,7 +237,7 @@ static inline void arch_atomic64_dec(ato
  * Atomically adds @a to @v, so long as it was not @u.
  * Returns non-zero if the add was done, zero otherwise.
  */
-static inline int arch_atomic64_add_unless(atomic64_t *v, s64 a, s64 u)
+static __always_inline int arch_atomic64_add_unless(atomic64_t *v, s64 a, s64 
u)
 {
unsigned low = (unsigned)u;
unsigned high = (unsigned)(u >> 32);
@@ -248,7 +248,7 @@ static inline int arch_atomic64_add_unle
 }
 #define arch_atomic64_add_unless 

[PATCH 0/6] A few cpuidle vs rcu fixes

2023-01-23 Thread Peter Zijlstra
0-day robot reported graph-tracing made the cpuidle-vs-rcu rework go splat.

These patches appear to cure this, the ftrace selftest now runs to completion
without spamming scary messages to dmesg.

---
 arch/x86/include/asm/atomic64_32.h | 44 +++---
 arch/x86/include/asm/atomic64_64.h | 36 +++
 arch/x86/include/asm/kvmclock.h|  2 +-
 arch/x86/include/asm/paravirt.h|  2 +-
 arch/x86/include/asm/pvclock.h |  3 ++-
 arch/x86/kernel/cpu/vmware.c   |  2 +-
 arch/x86/kernel/ftrace.c   |  3 +++
 arch/x86/kernel/kvmclock.c |  6 +++---
 arch/x86/kernel/pvclock.c  | 22 +--
 arch/x86/kernel/tsc.c  |  7 +++---
 arch/x86/xen/time.c| 12 +--
 drivers/cpuidle/cpuidle.c  |  2 +-
 drivers/cpuidle/poll_state.c   |  2 --
 include/linux/math64.h |  4 ++--
 include/linux/sched/clock.h|  8 +++
 kernel/sched/clock.c   | 27 +--
 16 files changed, 107 insertions(+), 75 deletions(-)


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 5/6] sched/clock: Make local_clock() noinstr

2023-01-23 Thread Peter Zijlstra
With sched_clock() noinstr, provide a noinstr implementation of
local_clock().

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/sched/clock.h |8 +++-
 kernel/sched/clock.c|   27 +--
 2 files changed, 24 insertions(+), 11 deletions(-)

--- a/include/linux/sched/clock.h
+++ b/include/linux/sched/clock.h
@@ -45,7 +45,7 @@ static inline u64 cpu_clock(int cpu)
return sched_clock();
 }
 
-static inline u64 local_clock(void)
+static __always_inline u64 local_clock(void)
 {
return sched_clock();
 }
@@ -79,10 +79,8 @@ static inline u64 cpu_clock(int cpu)
return sched_clock_cpu(cpu);
 }
 
-static inline u64 local_clock(void)
-{
-   return sched_clock_cpu(raw_smp_processor_id());
-}
+extern u64 local_clock(void);
+
 #endif
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -93,7 +93,7 @@ struct sched_clock_data {
 
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct sched_clock_data, 
sched_clock_data);
 
-notrace static inline struct sched_clock_data *this_scd(void)
+static __always_inline struct sched_clock_data *this_scd(void)
 {
return this_cpu_ptr(_clock_data);
 }
@@ -244,12 +244,12 @@ late_initcall(sched_clock_init_late);
  * min, max except they take wrapping into account
  */
 
-notrace static inline u64 wrap_min(u64 x, u64 y)
+static __always_inline u64 wrap_min(u64 x, u64 y)
 {
return (s64)(x - y) < 0 ? x : y;
 }
 
-notrace static inline u64 wrap_max(u64 x, u64 y)
+static __always_inline u64 wrap_max(u64 x, u64 y)
 {
return (s64)(x - y) > 0 ? x : y;
 }
@@ -260,7 +260,7 @@ notrace static inline u64 wrap_max(u64 x
  *  - filter out backward motion
  *  - use the GTOD tick value to create a window to filter crazy TSC values
  */
-notrace static u64 sched_clock_local(struct sched_clock_data *scd)
+static __always_inline u64 sched_clock_local(struct sched_clock_data *scd)
 {
u64 now, clock, old_clock, min_clock, max_clock, gtod;
s64 delta;
@@ -287,13 +287,28 @@ notrace static u64 sched_clock_local(str
clock = wrap_max(clock, min_clock);
clock = wrap_min(clock, max_clock);
 
-   if (!try_cmpxchg64(>clock, _clock, clock))
+   if (!arch_try_cmpxchg64(>clock, _clock, clock))
goto again;
 
return clock;
 }
 
-notrace static u64 sched_clock_remote(struct sched_clock_data *scd)
+noinstr u64 local_clock(void)
+{
+   u64 clock;
+
+   if (static_branch_likely(&__sched_clock_stable))
+   return sched_clock() + __sched_clock_offset;
+
+   preempt_disable_notrace();
+   clock = sched_clock_local(this_scd());
+   preempt_enable_notrace();
+
+   return clock;
+}
+EXPORT_SYMBOL_GPL(local_clock);
+
+static notrace u64 sched_clock_remote(struct sched_clock_data *scd)
 {
struct sched_clock_data *my_scd = this_scd();
u64 this_clock, remote_clock;


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 3/6] ftrace/x86: Warn and ignore graph tracing when RCU is disabled

2023-01-23 Thread Peter Zijlstra
All RCU disabled code should be noinstr and hence we should never get
here -- when we do, WARN about it and make sure to not actually do
tracing.

Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/x86/kernel/ftrace.c |3 +++
 1 file changed, 3 insertions(+)

--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -646,6 +646,9 @@ void prepare_ftrace_return(unsigned long
if (unlikely(atomic_read(>tracing_graph_pause)))
return;
 
+   if (WARN_ONCE(!rcu_is_watching(), "RCU not on for: %pS\n", (void *)ip))
+   return;
+
bit = ftrace_test_recursion_trylock(ip, *parent);
if (bit < 0)
return;


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 6/6] cpuidle: Fix poll_idle() noinstr annotation

2023-01-23 Thread Peter Zijlstra
The instrumentation_begin()/end() annotations in poll_idle() were
complete nonsense. Specifically they caused tracing to happen in the
middle of noinstr code, resulting in RCU splats.

Now that local_clock() is noinstr, mark up the rest and let it rip.

Fixes: 00717eb8c955 ("cpuidle: Annotate poll_idle()")
Signed-off-by: Peter Zijlstra (Intel) 
Reported-by: kernel test robot 
Link: https://lore.kernel.org/oe-lkp/202301192148.58ece903-oliver.s...@intel.com
---
 drivers/cpuidle/cpuidle.c|2 +-
 drivers/cpuidle/poll_state.c |2 --
 2 files changed, 1 insertion(+), 3 deletions(-)

--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -426,7 +426,7 @@ void cpuidle_reflect(struct cpuidle_devi
  * @dev:   the cpuidle device
  *
  */
-u64 cpuidle_poll_time(struct cpuidle_driver *drv,
+__cpuidle u64 cpuidle_poll_time(struct cpuidle_driver *drv,
  struct cpuidle_device *dev)
 {
int i;
--- a/drivers/cpuidle/poll_state.c
+++ b/drivers/cpuidle/poll_state.c
@@ -15,7 +15,6 @@ static int __cpuidle poll_idle(struct cp
 {
u64 time_start;
 
-   instrumentation_begin();
time_start = local_clock();
 
dev->poll_time_limit = false;
@@ -42,7 +41,6 @@ static int __cpuidle poll_idle(struct cp
raw_local_irq_disable();
 
current_clr_polling();
-   instrumentation_end();
 
return index;
 }


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 4/6] x86: Mark sched_clock() noinstr

2023-01-23 Thread Peter Zijlstra
In order to use sched_clock() from noinstr code, mark it and all it's
implenentations noinstr.

The whole pvclock thing (used by KVM/Xen) is a bit of a pain,
since it calls out to watchdogs, create a
pvclock_clocksource_read_nowd() variant doesn't do that and can be
noinstr.

Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/x86/include/asm/kvmclock.h |2 +-
 arch/x86/include/asm/paravirt.h |2 +-
 arch/x86/include/asm/pvclock.h  |3 ++-
 arch/x86/kernel/cpu/vmware.c|2 +-
 arch/x86/kernel/kvmclock.c  |6 +++---
 arch/x86/kernel/pvclock.c   |   19 +++
 arch/x86/kernel/tsc.c   |7 +++
 arch/x86/xen/time.c |   12 ++--
 include/linux/math64.h  |4 ++--
 9 files changed, 38 insertions(+), 19 deletions(-)

--- a/arch/x86/include/asm/kvmclock.h
+++ b/arch/x86/include/asm/kvmclock.h
@@ -8,7 +8,7 @@ extern struct clocksource kvm_clock;
 
 DECLARE_PER_CPU(struct pvclock_vsyscall_time_info *, hv_clock_per_cpu);
 
-static inline struct pvclock_vcpu_time_info *this_cpu_pvti(void)
+static __always_inline struct pvclock_vcpu_time_info *this_cpu_pvti(void)
 {
return _cpu_read(hv_clock_per_cpu)->pvti;
 }
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -26,7 +26,7 @@ DECLARE_STATIC_CALL(pv_sched_clock, dumm
 
 void paravirt_set_sched_clock(u64 (*func)(void));
 
-static inline u64 paravirt_sched_clock(void)
+static __always_inline u64 paravirt_sched_clock(void)
 {
return static_call(pv_sched_clock)();
 }
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -7,6 +7,7 @@
 
 /* some helper functions for xen and kvm pv clock sources */
 u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src);
+u64 pvclock_clocksource_read_nowd(struct pvclock_vcpu_time_info *src);
 u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src);
 void pvclock_set_flags(u8 flags);
 unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src);
@@ -39,7 +40,7 @@ bool pvclock_read_retry(const struct pvc
  * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
  * yielding a 64-bit result.
  */
-static inline u64 pvclock_scale_delta(u64 delta, u32 mul_frac, int shift)
+static __always_inline u64 pvclock_scale_delta(u64 delta, u32 mul_frac, int 
shift)
 {
u64 product;
 #ifdef __i386__
--- a/arch/x86/kernel/cpu/vmware.c
+++ b/arch/x86/kernel/cpu/vmware.c
@@ -143,7 +143,7 @@ static __init int parse_no_stealacc(char
 }
 early_param("no-steal-acc", parse_no_stealacc);
 
-static unsigned long long notrace vmware_sched_clock(void)
+static noinstr u64 vmware_sched_clock(void)
 {
unsigned long long ns;
 
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -71,12 +71,12 @@ static int kvm_set_wallclock(const struc
return -ENODEV;
 }
 
-static u64 kvm_clock_read(void)
+static noinstr u64 kvm_clock_read(void)
 {
u64 ret;
 
preempt_disable_notrace();
-   ret = pvclock_clocksource_read(this_cpu_pvti());
+   ret = pvclock_clocksource_read_nowd(this_cpu_pvti());
preempt_enable_notrace();
return ret;
 }
@@ -86,7 +86,7 @@ static u64 kvm_clock_get_cycles(struct c
return kvm_clock_read();
 }
 
-static u64 kvm_sched_clock_read(void)
+static noinstr u64 kvm_sched_clock_read(void)
 {
return kvm_clock_read() - kvm_sched_clock_offset;
 }
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -64,7 +64,8 @@ u8 pvclock_read_flags(struct pvclock_vcp
return flags & valid_flags;
 }
 
-u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
+static __always_inline
+u64 __pvclock_clocksource_read(struct pvclock_vcpu_time_info *src, bool dowd)
 {
unsigned version;
u64 ret;
@@ -77,7 +78,7 @@ u64 pvclock_clocksource_read(struct pvcl
flags = src->flags;
} while (pvclock_read_retry(src, version));
 
-   if (unlikely((flags & PVCLOCK_GUEST_STOPPED) != 0)) {
+   if (dowd && unlikely((flags & PVCLOCK_GUEST_STOPPED) != 0)) {
src->flags &= ~PVCLOCK_GUEST_STOPPED;
pvclock_touch_watchdogs();
}
@@ -100,15 +101,25 @@ u64 pvclock_clocksource_read(struct pvcl
 * updating at the same time, and one of them could be slightly behind,
 * making the assumption that last_value always go forward fail to hold.
 */
-   last = atomic64_read(_value);
+   last = arch_atomic64_read(_value);
do {
if (ret <= last)
return last;
-   } while (!atomic64_try_cmpxchg(_value, , ret));
+   } while (!arch_atomic64_try_cmpxchg(_value, , ret));
 
return ret;
 }
 
+u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
+{
+   return __pvclock_clocksource_read(src, true);
+}
+
+noinstr u64 pvclock_clocksource_read_nowd(struct pvclock_vcpu_time_info *src)
+{
+   return __pvclock_clocksource_read(src, false);
+}
+
 void 

[PATCH 2/6] x86/pvclock: improve atomic update of last_value in pvclock_clocksource_read

2023-01-23 Thread Peter Zijlstra
From: Uros Bizjak 

Improve atomic update of last_value in pvclock_clocksource_read:

- Atomic update can be skipped if the "last_value" is already
  equal to "ret".

- The detection of atomic update failure is not correct. The value,
  returned by atomic64_cmpxchg should be compared to the old value
  from the location to be updated. If these two are the same, then
  atomic update succeeded and "last_value" location is updated to
  "ret" in an atomic way. Otherwise, the atomic update failed and
  it should be retried with the value from "last_value" - exactly
  what atomic64_try_cmpxchg does in a correct and more optimal way.

Signed-off-by: Uros Bizjak 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20230118202330.3740-1-ubiz...@gmail.com
---
 arch/x86/kernel/pvclock.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index eda37df016f0..5a2a517dd61b 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -102,10 +102,9 @@ u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info 
*src)
 */
last = atomic64_read(_value);
do {
-   if (ret < last)
+   if (ret <= last)
return last;
-   last = atomic64_cmpxchg(_value, last, ret);
-   } while (unlikely(last != ret));
+   } while (!atomic64_try_cmpxchg(_value, , ret));
 
return ret;
 }
-- 
2.39.0



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/6] ftrace/x86: Warn and ignore graph tracing when RCU is disabled

2023-01-23 Thread Steven Rostedt
On Mon, 23 Jan 2023 21:50:12 +0100
Peter Zijlstra  wrote:

> All RCU disabled code should be noinstr and hence we should never get
> here -- when we do, WARN about it and make sure to not actually do
> tracing.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  arch/x86/kernel/ftrace.c |3 +++
>  1 file changed, 3 insertions(+)
> 
> --- a/arch/x86/kernel/ftrace.c
> +++ b/arch/x86/kernel/ftrace.c
> @@ -646,6 +646,9 @@ void prepare_ftrace_return(unsigned long
>   if (unlikely(atomic_read(>tracing_graph_pause)))
>   return;
>  
> + if (WARN_ONCE(!rcu_is_watching(), "RCU not on for: %pS\n", (void *)ip))
> + return;
> +

Please add this to after recursion trylock below. Although WARN_ONCE()
should not not have recursion issues, as function tracing can do weird
things, I rather be safe than sorry, and not have the system triple boot
due to some path that might get added in the future.

If rcu_is_watching() is false, it will still get by the below recursion
check and warn. That is, the below check should be done before this
function calls any other function.

>   bit = ftrace_test_recursion_trylock(ip, *parent);
>   if (bit < 0)
>   return;
> 

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/6] ftrace/x86: Warn and ignore graph tracing when RCU is disabled

2023-01-23 Thread Steven Rostedt
On Mon, 23 Jan 2023 16:53:04 -0500
Steven Rostedt  wrote:

> On Mon, 23 Jan 2023 21:50:12 +0100
> Peter Zijlstra  wrote:
> 
> > All RCU disabled code should be noinstr and hence we should never get
> > here -- when we do, WARN about it and make sure to not actually do
> > tracing.
> > 
> > Signed-off-by: Peter Zijlstra (Intel) 
> > ---
> >  arch/x86/kernel/ftrace.c |3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > --- a/arch/x86/kernel/ftrace.c
> > +++ b/arch/x86/kernel/ftrace.c
> > @@ -646,6 +646,9 @@ void prepare_ftrace_return(unsigned long
> > if (unlikely(atomic_read(>tracing_graph_pause)))
> > return;
> >  
> > +   if (WARN_ONCE(!rcu_is_watching(), "RCU not on for: %pS\n", (void *)ip))
> > +   return;
> > +  
> 
> Please add this to after recursion trylock below. Although WARN_ONCE()
> should not not have recursion issues, as function tracing can do weird
> things, I rather be safe than sorry, and not have the system triple boot
> due to some path that might get added in the future.
> 
> If rcu_is_watching() is false, it will still get by the below recursion
> check and warn. That is, the below check should be done before this
> function calls any other function.
> 
> > bit = ftrace_test_recursion_trylock(ip, *parent);
> > if (bit < 0)
> > return;
> >   
> 

Actually, perhaps we can just add this, and all you need to do is create
and set CONFIG_NO_RCU_TRACING (or some other name).

This should cover all ftrace locations. (Uncompiled).

-- Steve

diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index c303f7a114e9..10ee3fbb9113 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -135,6 +135,22 @@ extern void ftrace_record_recursion(unsigned long ip, 
unsigned long parent_ip);
 # define do_ftrace_record_recursion(ip, pip)   do { } while (0)
 #endif
 
+#ifdef CONFIG_NO_RCU_TRACING
+# define trace_warn_on_no_rcu(ip)  \
+   ({  \
+   bool __ret = false; \
+   if (!trace_recursion_test(TRACE_RECORD_RECURSION_BIT)) { \
+   trace_recursion_set(TRACE_RECORD_RECURSION_BIT); \
+   __ret = WARN_ONCE(!rcu_is_watching(),   \
+ "RCU not on for: %pS\n", (void *)ip); 
\
+   trace_recursion_clear(TRACE_RECORD_RECURSION_BIT); \
+   }   \
+   __ret;  \
+   })
+#else
+# define trace_warn_on_no_rcu(ip)  false
+#endif
+
 /*
  * Preemption is promised to be disabled when return bit >= 0.
  */
@@ -144,6 +160,9 @@ static __always_inline int 
trace_test_and_set_recursion(unsigned long ip, unsign
unsigned int val = READ_ONCE(current->trace_recursion);
int bit;
 
+   if (trace_warn_on_no_rcu(ip))
+   return -1;
+
bit = trace_get_context_bit() + start;
if (unlikely(val & (1 << bit))) {
/*
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V2] vhost-scsi: unbreak any layout for response

2023-01-23 Thread Stefan Hajnoczi
On Thu, Jan 19, 2023 at 03:36:47PM +0800, Jason Wang wrote:
> Al Viro said:
> 
> """
> Since "vhost/scsi: fix reuse of >iov[out] in response"
> we have this:
> cmd->tvc_resp_iov = vq->iov[vc.out];
> cmd->tvc_in_iovs = vc.in;
> combined with
> iov_iter_init(_iter, ITER_DEST, >tvc_resp_iov,
>   cmd->tvc_in_iovs, sizeof(v_rsp));
> in vhost_scsi_complete_cmd_work().  We used to have ->tvc_resp_iov
> _pointing_ to vq->iov[vc.out]; back then iov_iter_init() asked to
> set an iovec-backed iov_iter over the tail of vq->iov[], with
> length being the amount of iovecs in the tail.
> 
> Now we have a copy of one element of that array.  Fortunately, the members
> following it in the containing structure are two non-NULL kernel pointers,
> so copy_to_iter() will not copy anything beyond the first iovec - kernel
> pointer is not (on the majority of architectures) going to be accepted by
> access_ok() in copyout() and it won't be skipped since the "length" (in
> reality - another non-NULL kernel pointer) won't be zero.
> 
> So it's not going to give a guest-to-qemu escalation, but it's definitely
> a bug.  Frankly, my preference would be to verify that the very first iovec
> is long enough to hold rsp_size.  Due to the above, any users that try to
> give us vq->iov[vc.out].iov_len < sizeof(struct virtio_scsi_cmd_resp)
> would currently get a failure in vhost_scsi_complete_cmd_work()
> anyway.
> """
> 
> However, the spec doesn't say anything about the legacy descriptor
> layout for the respone. So this patch tries to not assume the response
> to reside in a single separate descriptor which is what commit
> 79c14141a487 ("vhost/scsi: Convert completion path to use") tries to
> achieve towards to ANY_LAYOUT.
> 
> This is done by allocating and using dedicate resp iov in the
> command. To be safety, start with UIO_MAXIOV to be consistent with the
> limitation that we advertise to the vhost_get_vq_desc().
> 
> Testing with the hacked virtio-scsi driver that use 1 descriptor for 1
> byte in the response.
> 
> Reported-by: Al Viro 
> Cc: Benjamin Coddington 
> Cc: Nicholas Bellinger 
> Fixes: a77ec83a5789 ("vhost/scsi: fix reuse of >iov[out] in response")
> Signed-off-by: Jason Wang 
> ---
> Changes since V1:
> - tweak the changelog
> - fix the allocation size for tvc_resp_iov (should be sizeof(struct iovec))
> ---
>  drivers/vhost/scsi.c | 21 +
>  1 file changed, 17 insertions(+), 4 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization