Re: [PATCH] tcp: Add skb addr and sock addr to arguments of tracepoint tcp_probe.

2024-02-28 Thread Jason Xing
On Thu, Feb 29, 2024 at 1:33 PM fuyuanli  wrote:
>
> It is useful to expose skb addr and sock addr to user in tracepoint
> tcp_probe, so that we can get more information while monitoring
> receiving of tcp data, by ebpf or other ways.
>
> For example, we need to identify a packet by seq and end_seq when
> calculate transmit latency between lay 2 and lay 4 by ebpf, but which is
> not available in tcp_probe, so we can only use kprobe hooking
> tcp_rcv_esatblised to get them. But we can use tcp_probe directly if skb
> addr and sock addr are available, which is more efficient.
>
> Signed-off-by: fuyuanli 

Please target 'net-next' in the title of your v2 patch.

> ---
>  include/trace/events/tcp.h | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
> index 7b1ddffa3dfc..096c15f64b92 100644
> --- a/include/trace/events/tcp.h
> +++ b/include/trace/events/tcp.h
> @@ -258,6 +258,8 @@ TRACE_EVENT(tcp_probe,
> __field(__u32, srtt)
> __field(__u32, rcv_wnd)
> __field(__u64, sock_cookie)
> +   __field(const void *, skbaddr)
> +   __field(const void *, skaddr)
> ),
>
> TP_fast_assign(
> @@ -285,6 +287,9 @@ TRACE_EVENT(tcp_probe,
> __entry->ssthresh = tcp_current_ssthresh(sk);
> __entry->srtt = tp->srtt_us >> 3;
> __entry->sock_cookie = sock_gen_cookie(sk);
> +
> +   __entry->skbaddr = skb;
> +   __entry->skaddr = sk;
> ),
>
> TP_printk("family=%s src=%pISpc dest=%pISpc mark=%#x data_len=%d 
> snd_nxt=%#x snd_una=%#x snd_cwnd=%u ssthresh=%u snd_wnd=%u srtt=%u rcv_wnd=%u 
> sock_cookie=%llx",

If they are useful, at least you should printk those two addresses
like what trace_kfree_skb() does.

May I ask how it could be useful if there is no more function printing
such information in the receive path?

Thanks,
Jason
> --
> 2.17.1
>
>



Re: [PATCH v3] vduse: Fix off by one in vduse_dev_mmap()

2024-02-28 Thread Cindy Lu
On Thu, Feb 29, 2024 at 3:19 AM Stefan Hajnoczi  wrote:
>
> On Wed, 28 Feb 2024 at 13:24, Dan Carpenter  wrote:
> >
> > The dev->vqs[] array has "dev->vq_num" elements.  It's allocated in
> > vduse_dev_init_vqs().  Thus, this > comparison needs to be >= to avoid
> > reading one element beyond the end of the array.
> >
> > Add an array_index_nospec() as well to prevent speculation issues.
> >
> > Fixes: 316ecd1346b0 ("vduse: Add file operation for mmap")
> > Signed-off-by: Dan Carpenter 
> > ---
> > v2: add array_index_nospec()
> > v3: I accidentally corrupted v2.  Try again.
> >
Thanks for fix this
> >  drivers/vdpa/vdpa_user/vduse_dev.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
>
> Reviewed-by: Stefan Hajnoczi 
Reviewed-by:  Cindy Lu 
> > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> > b/drivers/vdpa/vdpa_user/vduse_dev.c
> > index b7a1fb88c506..eb914084c650 100644
> > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > @@ -1532,9 +1532,10 @@ static int vduse_dev_mmap(struct file *file, struct 
> > vm_area_struct *vma)
> > if ((vma->vm_flags & VM_SHARED) == 0)
> > return -EINVAL;
> >
> > -   if (index > dev->vq_num)
> > +   if (index >= dev->vq_num)
> > return -EINVAL;
> >
> > +   index = array_index_nospec(index, dev->vq_num);
> > vq = dev->vqs[index];
> > vaddr = vq->vdpa_reconnect_vaddr;
> > if (vaddr == 0)
> > --
> > 2.43.0
> >
> >
>




Re: [PATCH 2/3] dt-bindings: clock: ti: remove unstable remark

2024-02-28 Thread Tony Lindgren
* Stephen Boyd  [240228 22:59]:
> +Tony
> 
> Quoting Krzysztof Kozlowski (2024-02-24 01:12:35)
> > Several TI SoC clock bindings were marked as work-in-progress / unstable
> > between 2013-2016, for example in commit f60b1ea5ea7a ("CLK: TI: add
> > support for gate clock").  It was enough of time to consider them stable
> > and expect usual ABI rules.
> > 
> > Signed-off-by: Krzysztof Kozlowski 
> > ---
> 
> Acked-by: Stephen Boyd 

Makes sense to me:

Acked-by: Tony Lindgren 



[PATCH] tcp: Add skb addr and sock addr to arguments of tracepoint tcp_probe.

2024-02-28 Thread fuyuanli
It is useful to expose skb addr and sock addr to user in tracepoint
tcp_probe, so that we can get more information while monitoring
receiving of tcp data, by ebpf or other ways.

For example, we need to identify a packet by seq and end_seq when
calculate transmit latency between lay 2 and lay 4 by ebpf, but which is
not available in tcp_probe, so we can only use kprobe hooking
tcp_rcv_esatblised to get them. But we can use tcp_probe directly if skb
addr and sock addr are available, which is more efficient.

Signed-off-by: fuyuanli 
---
 include/trace/events/tcp.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index 7b1ddffa3dfc..096c15f64b92 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -258,6 +258,8 @@ TRACE_EVENT(tcp_probe,
__field(__u32, srtt)
__field(__u32, rcv_wnd)
__field(__u64, sock_cookie)
+   __field(const void *, skbaddr)
+   __field(const void *, skaddr)
),
 
TP_fast_assign(
@@ -285,6 +287,9 @@ TRACE_EVENT(tcp_probe,
__entry->ssthresh = tcp_current_ssthresh(sk);
__entry->srtt = tp->srtt_us >> 3;
__entry->sock_cookie = sock_gen_cookie(sk);
+
+   __entry->skbaddr = skb;
+   __entry->skaddr = sk;
),
 
TP_printk("family=%s src=%pISpc dest=%pISpc mark=%#x data_len=%d 
snd_nxt=%#x snd_una=%#x snd_cwnd=%u ssthresh=%u snd_wnd=%u srtt=%u rcv_wnd=%u 
sock_cookie=%llx",
-- 
2.17.1




Re: [PATCH 2/2] riscv: Fix text patching when IPI are used

2024-02-28 Thread Andrea Parri
On Wed, Feb 28, 2024 at 06:51:49PM +0100, Alexandre Ghiti wrote:
> For now, we use stop_machine() to patch the text and when we use IPIs for
> remote icache flushes (which is emitted in patch_text_nosync()), the system
> hangs.
> 
> So instead, make sure every cpu executes the stop_machine() patching
> function and emit a local icache flush there.
> 
> Co-developed-by: Björn Töpel 
> Signed-off-by: Björn Töpel 
> Signed-off-by: Alexandre Ghiti 

Modulo the removal of the hunk discussed with Samuel,

Reviewed-by: Andrea Parri 

Some nits / amendments to the inline comments below:


> + /*
> +  * Make sure the patching store is effective *before* we
> +  * increment the counter which releases all waiting cpus
> +  * by using the release version of atomic increment.
> +  */

s/cpus/CPUs
s/release version/release variant

The comment could be amended with a description of the matching barrier(s), say,
"The release pairs with the call to local_flush_icache_all() on the waiting 
CPU".

(Same for the comment in patch_text_cb().)

  Andrea



Re: [PATCH 1/2] riscv: Remove superfluous smp_mb()

2024-02-28 Thread Andrea Parri
On Wed, Feb 28, 2024 at 06:51:48PM +0100, Alexandre Ghiti wrote:
> This memory barrier is not needed and not documented so simply remove
> it.
> 
> Suggested-by: Andrea Parri 
> Signed-off-by: Alexandre Ghiti 

Reviewed-by: Andrea Parri 

  Andrea



Re: [PATCH 1/3] dt-bindings: clock: keystone: remove unstable remark

2024-02-28 Thread Stephen Boyd
Quoting Krzysztof Kozlowski (2024-02-24 01:12:34)
> Keystone clock controller bindings were marked as work-in-progress /
> unstable in 2013 in commit b9e0d40c0d83 ("clk: keystone: add Keystone
> PLL clock driver") and commit 7affe5685c96 ("clk: keystone: Add gate
> control clock driver") Almost eleven years is enough, so drop the
> "unstable" remark and expect usual ABI rules.
> 
> Signed-off-by: Krzysztof Kozlowski 
> ---

Acked-by: Stephen Boyd 



Re: [PATCH 2/3] dt-bindings: clock: ti: remove unstable remark

2024-02-28 Thread Stephen Boyd
+Tony

Quoting Krzysztof Kozlowski (2024-02-24 01:12:35)
> Several TI SoC clock bindings were marked as work-in-progress / unstable
> between 2013-2016, for example in commit f60b1ea5ea7a ("CLK: TI: add
> support for gate clock").  It was enough of time to consider them stable
> and expect usual ABI rules.
> 
> Signed-off-by: Krzysztof Kozlowski 
> ---

Acked-by: Stephen Boyd 

>  Documentation/devicetree/bindings/clock/ti/adpll.txt| 2 --
>  Documentation/devicetree/bindings/clock/ti/apll.txt | 2 --
>  Documentation/devicetree/bindings/clock/ti/autoidle.txt | 2 --
>  Documentation/devicetree/bindings/clock/ti/clockdomain.txt  | 2 --
>  Documentation/devicetree/bindings/clock/ti/composite.txt| 2 --
>  Documentation/devicetree/bindings/clock/ti/divider.txt  | 2 --
>  Documentation/devicetree/bindings/clock/ti/dpll.txt | 2 --
>  Documentation/devicetree/bindings/clock/ti/fapll.txt| 2 --
>  .../devicetree/bindings/clock/ti/fixed-factor-clock.txt | 2 --
>  Documentation/devicetree/bindings/clock/ti/gate.txt | 2 --
>  Documentation/devicetree/bindings/clock/ti/interface.txt| 2 --
>  Documentation/devicetree/bindings/clock/ti/mux.txt  | 2 --
>  12 files changed, 24 deletions(-)
> 
> diff --git a/Documentation/devicetree/bindings/clock/ti/adpll.txt 
> b/Documentation/devicetree/bindings/clock/ti/adpll.txt
> index 4c8a2ce2cd70..3122360adcf3 100644
> --- a/Documentation/devicetree/bindings/clock/ti/adpll.txt
> +++ b/Documentation/devicetree/bindings/clock/ti/adpll.txt
> @@ -1,7 +1,5 @@
>  Binding for Texas Instruments ADPLL clock.
>  
> -Binding status: Unstable - ABI compatibility may be broken in the future
> -
>  This binding uses the common clock binding[1]. It assumes a
>  register-mapped ADPLL with two to three selectable input clocks
>  and three to four children.
> diff --git a/Documentation/devicetree/bindings/clock/ti/apll.txt 
> b/Documentation/devicetree/bindings/clock/ti/apll.txt
> index ade4dd4c30f0..bbd505c1199d 100644
> --- a/Documentation/devicetree/bindings/clock/ti/apll.txt
> +++ b/Documentation/devicetree/bindings/clock/ti/apll.txt
> @@ -1,7 +1,5 @@
>  Binding for Texas Instruments APLL clock.
>  
> -Binding status: Unstable - ABI compatibility may be broken in the future
> -
>  This binding uses the common clock binding[1].  It assumes a
>  register-mapped APLL with usually two selectable input clocks
>  (reference clock and bypass clock), with analog phase locked
> diff --git a/Documentation/devicetree/bindings/clock/ti/autoidle.txt 
> b/Documentation/devicetree/bindings/clock/ti/autoidle.txt
> index 7c735dde9fe9..05645a10a9e3 100644
> --- a/Documentation/devicetree/bindings/clock/ti/autoidle.txt
> +++ b/Documentation/devicetree/bindings/clock/ti/autoidle.txt
> @@ -1,7 +1,5 @@
>  Binding for Texas Instruments autoidle clock.
>  
> -Binding status: Unstable - ABI compatibility may be broken in the future
> -
>  This binding uses the common clock binding[1]. It assumes a register mapped
>  clock which can be put to idle automatically by hardware based on the usage
>  and a configuration bit setting. Autoidle clock is never an individual
> diff --git a/Documentation/devicetree/bindings/clock/ti/clockdomain.txt 
> b/Documentation/devicetree/bindings/clock/ti/clockdomain.txt
> index 9c6199249ce5..edf0b5d42768 100644
> --- a/Documentation/devicetree/bindings/clock/ti/clockdomain.txt
> +++ b/Documentation/devicetree/bindings/clock/ti/clockdomain.txt
> @@ -1,7 +1,5 @@
>  Binding for Texas Instruments clockdomain.
>  
> -Binding status: Unstable - ABI compatibility may be broken in the future
> -
>  This binding uses the common clock binding[1] in consumer role.
>  Every clock on TI SoC belongs to one clockdomain, but software
>  only needs this information for specific clocks which require
> diff --git a/Documentation/devicetree/bindings/clock/ti/composite.txt 
> b/Documentation/devicetree/bindings/clock/ti/composite.txt
> index 33ac7c9ad053..6f7e1331b546 100644
> --- a/Documentation/devicetree/bindings/clock/ti/composite.txt
> +++ b/Documentation/devicetree/bindings/clock/ti/composite.txt
> @@ -1,7 +1,5 @@
>  Binding for TI composite clock.
>  
> -Binding status: Unstable - ABI compatibility may be broken in the future
> -
>  This binding uses the common clock binding[1]. It assumes a
>  register-mapped composite clock with multiple different sub-types;
>  
> diff --git a/Documentation/devicetree/bindings/clock/ti/divider.txt 
> b/Documentation/devicetree/bindings/clock/ti/divider.txt
> index 9b13b32974f9..4d7c76f0b356 100644
> --- a/Documentation/devicetree/bindings/clock/ti/divider.txt
> +++ b/Documentation/devicetree/bindings/clock/ti/divider.txt
> @@ -1,7 +1,5 @@
>  Binding for TI divider clock
>  
> -Binding status: Unstable - ABI compatibility may be broken in the future
> -
>  This binding uses the common clock binding[1].  It assumes a
>  register-mapped adjustable clock rate 

Re: [PATCH] virtiofs: don't mark virtio_fs_sysfs_exit as __exit

2024-02-28 Thread Stefan Hajnoczi
On Wed, 28 Feb 2024 at 16:47, Arnd Bergmann  wrote:
>
> From: Arnd Bergmann 
>
> Calling an __exit function from an __init function is not allowed
> and will result in undefined behavior when the code is built-in:
>
> WARNING: modpost: vmlinux: section mismatch in reference: virtio_fs_init+0x50 
> (section: .init.text) -> virtio_fs_sysfs_exit (section: .exit.text)
>
> Remove the incorrect annotation.
>
> Fixes: a8f62f50b4e4 ("virtiofs: export filesystem tags through sysfs")
> Signed-off-by: Arnd Bergmann 
> ---
>  fs/fuse/virtio_fs.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Thanks, Arnd. Please see the duplicate patch that Miklos applied:
https://lore.kernel.org/linux-fsdevel/cajfpegsjcz-dnzyft3b5gbgcntmdr6r1n8pm5yclmw9fjy1...@mail.gmail.com/T/#t

Stefan

>
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index 3a7dd48b534f..36d87dd3cb48 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -1595,7 +1595,7 @@ static int __init virtio_fs_sysfs_init(void)
> return 0;
>  }
>
> -static void __exit virtio_fs_sysfs_exit(void)
> +static void virtio_fs_sysfs_exit(void)
>  {
> kset_unregister(virtio_fs_kset);
> virtio_fs_kset = NULL;
> --
> 2.39.2
>
>



[PATCH] virtiofs: don't mark virtio_fs_sysfs_exit as __exit

2024-02-28 Thread Arnd Bergmann
From: Arnd Bergmann 

Calling an __exit function from an __init function is not allowed
and will result in undefined behavior when the code is built-in:

WARNING: modpost: vmlinux: section mismatch in reference: virtio_fs_init+0x50 
(section: .init.text) -> virtio_fs_sysfs_exit (section: .exit.text)

Remove the incorrect annotation.

Fixes: a8f62f50b4e4 ("virtiofs: export filesystem tags through sysfs")
Signed-off-by: Arnd Bergmann 
---
 fs/fuse/virtio_fs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 3a7dd48b534f..36d87dd3cb48 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -1595,7 +1595,7 @@ static int __init virtio_fs_sysfs_init(void)
return 0;
 }
 
-static void __exit virtio_fs_sysfs_exit(void)
+static void virtio_fs_sysfs_exit(void)
 {
kset_unregister(virtio_fs_kset);
virtio_fs_kset = NULL;
-- 
2.39.2




Re: [PATCH RFC ftrace] Chose RCU Tasks based on TASKS_RCU rather than PREEMPTION

2024-02-28 Thread Paul E. McKenney
On Wed, Feb 28, 2024 at 03:22:36PM -0500, Steven Rostedt wrote:
> On Wed, 28 Feb 2024 11:38:29 -0800
> "Paul E. McKenney"  wrote:
> 
> > The advent of CONFIG_PREEMPT_AUTO, AKA lazy preemption, will mean that
> > even kernels built with CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY
> > might see the occasional preemption, and that this preemption just might
> > happen within a trampoline.
> > 
> > Therefore, update ftrace_shutdown() to invoke synchronize_rcu_tasks()
> > based on CONFIG_TASKS_RCU instead of CONFIG_PREEMPTION.
> > 
> > Only build tested.
> > 
> > Signed-off-by: Paul E. McKenney 
> > Cc: Steven Rostedt 
> > Cc: Masami Hiramatsu 
> > Cc: Mark Rutland 
> > Cc: Mathieu Desnoyers 
> > Cc: Ankur Arora 
> > Cc: Thomas Gleixner 
> > Cc: 
> > 
> > diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> > index 2da4eaa2777d6..c9e6c69cf3446 100644
> > --- a/kernel/trace/ftrace.c
> > +++ b/kernel/trace/ftrace.c
> > @@ -3156,7 +3156,7 @@ int ftrace_shutdown(struct ftrace_ops *ops, int 
> > command)
> >  * synchronize_rcu_tasks() will wait for those tasks to
> >  * execute and either schedule voluntarily or enter user space.
> >  */
> > -   if (IS_ENABLED(CONFIG_PREEMPTION))
> > +   if (IS_ENABLED(CONFIG_TASKS_RCU))
> > synchronize_rcu_tasks();
> 
> What happens if CONFIG_TASKS_RCU is not enabled? Does
> synchronize_rcu_tasks() do anything? Or is it just a synchronize_rcu()?

It is just a synchronize_rcu().

> If that's the case, perhaps just remove the if statement and make it:
> 
>   synchronize_rcu_tasks();
> 
> Not sure an extra synchronize_rcu() will hurt (especially after doing a
> synchronize_rcu_tasks_rude() just before hand!

That would work for me.  If there are no objections, I will make this
change.

Thanx, Paul



Re: [PATCH RFC ftrace] Chose RCU Tasks based on TASKS_RCU rather than PREEMPTION

2024-02-28 Thread Steven Rostedt
On Wed, 28 Feb 2024 11:38:29 -0800
"Paul E. McKenney"  wrote:

> The advent of CONFIG_PREEMPT_AUTO, AKA lazy preemption, will mean that
> even kernels built with CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY
> might see the occasional preemption, and that this preemption just might
> happen within a trampoline.
> 
> Therefore, update ftrace_shutdown() to invoke synchronize_rcu_tasks()
> based on CONFIG_TASKS_RCU instead of CONFIG_PREEMPTION.
> 
> Only build tested.
> 
> Signed-off-by: Paul E. McKenney 
> Cc: Steven Rostedt 
> Cc: Masami Hiramatsu 
> Cc: Mark Rutland 
> Cc: Mathieu Desnoyers 
> Cc: Ankur Arora 
> Cc: Thomas Gleixner 
> Cc: 
> 
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index 2da4eaa2777d6..c9e6c69cf3446 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -3156,7 +3156,7 @@ int ftrace_shutdown(struct ftrace_ops *ops, int command)
>* synchronize_rcu_tasks() will wait for those tasks to
>* execute and either schedule voluntarily or enter user space.
>*/
> - if (IS_ENABLED(CONFIG_PREEMPTION))
> + if (IS_ENABLED(CONFIG_TASKS_RCU))
>   synchronize_rcu_tasks();

What happens if CONFIG_TASKS_RCU is not enabled? Does
synchronize_rcu_tasks() do anything? Or is it just a synchronize_rcu()?

If that's the case, perhaps just remove the if statement and make it:

synchronize_rcu_tasks();

Not sure an extra synchronize_rcu() will hurt (especially after doing a
synchronize_rcu_tasks_rude() just before hand!

-- Steve



Re: [PATCH v11 1/4] remoteproc: zynqmp: fix lockstep mode memory region

2024-02-28 Thread Mathieu Poirier
On Wed, 28 Feb 2024 at 12:24, Tanmay Shah  wrote:
>
>
> On 2/28/24 11:08 AM, Mathieu Poirier wrote:
> > On Mon, Feb 19, 2024 at 09:44:34AM -0800, Tanmay Shah wrote:
> > > In lockstep mode, r5 core0 uses TCM of R5 core1. Following is lockstep
> > > mode memory region as per hardware reference manual.
> > >
> > > |  *TCM* |   *R5 View* | *Linux view* |
> > > | R5_0 ATCM (128 KB) | 0x_ | 0xFFE0_  |
> > > | R5_0 BTCM (128 KB) | 0x0002_ | 0xFFE2_  |
> > >
> > > However, driver shouldn't model it as above because R5 core0 TCM and core1
> > > TCM has different power-domains mapped to it.
> > > Hence, TCM address space in lockstep mode should be modeled as 64KB
> > > regions only where each region has its own power-domain as following:
> > >
> > > |  *TCM* |   *R5 View* | *Linux view* |
> > > | R5_0 ATCM0 (64 KB) | 0x_ | 0xFFE0_  |
> > > | R5_0 BTCM0 (64 KB) | 0x0002_ | 0xFFE2_  |
> > > | R5_0 ATCM1 (64 KB) | 0x0001_ | 0xFFE1_  |
> > > | R5_0 BTCM1 (64 KB) | 0x0003_ | 0xFFE3_  |
> > >
> > > This makes driver maintanance easy and makes design robust for future
> > > platorms as well.
> > >
> > > Signed-off-by: Tanmay Shah 
> >
> > Now that I have a clearer picture of where things are going, I am adding 
> > this
> > patch to rproc-next.
> >
> > I'll wait for the DT crew for the rest of this set.
>
> Hi Mathieu,
>
> Is it okay if we wait for DT crew to clear new bindings as well before taking 
> this one to rproc-next ?
>
> Just in case any modifications needed further?
>

Sure, we can do that too.

>
> Tanmay
>
>
> >
> > Thanks,
> > Mathieu
> >
> > > ---
> > >  drivers/remoteproc/xlnx_r5_remoteproc.c | 145 ++--
> > >  1 file changed, 12 insertions(+), 133 deletions(-)
> > >
> > > diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c 
> > > b/drivers/remoteproc/xlnx_r5_remoteproc.c
> > > index 4395edea9a64..42b0384d34f2 100644
> > > --- a/drivers/remoteproc/xlnx_r5_remoteproc.c
> > > +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
> > > @@ -84,12 +84,12 @@ static const struct mem_bank_data 
> > > zynqmp_tcm_banks_split[] = {
> > > {0xffebUL, 0x2, 0x1UL, PD_R5_1_BTCM, "btcm1"},
> > >  };
> > >
> > > -/* In lockstep mode cluster combines each 64KB TCM and makes 128KB TCM */
> > > +/* In lockstep mode cluster uses each 64KB TCM from second core as well 
> > > */
> > >  static const struct mem_bank_data zynqmp_tcm_banks_lockstep[] = {
> > > -   {0xffe0UL, 0x0, 0x2UL, PD_R5_0_ATCM, "atcm0"}, /* TCM 128KB 
> > > each */
> > > -   {0xffe2UL, 0x2, 0x2UL, PD_R5_0_BTCM, "btcm0"},
> > > -   {0, 0, 0, PD_R5_1_ATCM, ""},
> > > -   {0, 0, 0, PD_R5_1_BTCM, ""},
> > > +   {0xffe0UL, 0x0, 0x1UL, PD_R5_0_ATCM, "atcm0"}, /* TCM 64KB 
> > > each */
> > > +   {0xffe2UL, 0x2, 0x1UL, PD_R5_0_BTCM, "btcm0"},
> > > +   {0xffe1UL, 0x1, 0x1UL, PD_R5_1_ATCM, "atcm1"},
> > > +   {0xffe3UL, 0x3, 0x1UL, PD_R5_1_BTCM, "btcm1"},
> > >  };
> > >
> > >  /**
> > > @@ -540,14 +540,14 @@ static int tcm_mem_map(struct rproc *rproc,
> > >  }
> > >
> > >  /*
> > > - * add_tcm_carveout_split_mode()
> > > + * add_tcm_banks()
> > >   * @rproc: single R5 core's corresponding rproc instance
> > >   *
> > > - * allocate and add remoteproc carveout for TCM memory in split mode
> > > + * allocate and add remoteproc carveout for TCM memory
> > >   *
> > >   * return 0 on success, otherwise non-zero value on failure
> > >   */
> > > -static int add_tcm_carveout_split_mode(struct rproc *rproc)
> > > +static int add_tcm_banks(struct rproc *rproc)
> > >  {
> > > struct rproc_mem_entry *rproc_mem;
> > > struct zynqmp_r5_core *r5_core;
> > > @@ -580,10 +580,10 @@ static int add_tcm_carveout_split_mode(struct rproc 
> > > *rproc)
> > >  ZYNQMP_PM_REQUEST_ACK_BLOCKING);
> > > if (ret < 0) {
> > > dev_err(dev, "failed to turn on TCM 0x%x", 
> > > pm_domain_id);
> > > -   goto release_tcm_split;
> > > +   goto release_tcm;
> > > }
> > >
> > > -   dev_dbg(dev, "TCM carveout split mode %s addr=%llx, da=0x%x, 
> > > size=0x%lx",
> > > +   dev_dbg(dev, "TCM carveout %s addr=%llx, da=0x%x, size=0x%lx",
> > > bank_name, bank_addr, da, bank_size);
> > >
> > > rproc_mem = rproc_mem_entry_init(dev, NULL, bank_addr,
> > > @@ -593,7 +593,7 @@ static int add_tcm_carveout_split_mode(struct rproc 
> > > *rproc)
> > > if (!rproc_mem) {
> > > ret = -ENOMEM;
> > > zynqmp_pm_release_node(pm_domain_id);
> > > -   goto release_tcm_split;
> > > +   goto release_tcm;
> > > }
> > >
> > > rproc_add_carveout(rproc, rproc_mem);
> > > @@ -601,7 +601,7 @@ static int add_tcm_carveout_split_mode(struct rproc 
> > > *rproc)

[PATCH RFC ftrace] Chose RCU Tasks based on TASKS_RCU rather than PREEMPTION

2024-02-28 Thread Paul E. McKenney
The advent of CONFIG_PREEMPT_AUTO, AKA lazy preemption, will mean that
even kernels built with CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY
might see the occasional preemption, and that this preemption just might
happen within a trampoline.

Therefore, update ftrace_shutdown() to invoke synchronize_rcu_tasks()
based on CONFIG_TASKS_RCU instead of CONFIG_PREEMPTION.

Only build tested.

Signed-off-by: Paul E. McKenney 
Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: Mark Rutland 
Cc: Mathieu Desnoyers 
Cc: Ankur Arora 
Cc: Thomas Gleixner 
Cc: 

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 2da4eaa2777d6..c9e6c69cf3446 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -3156,7 +3156,7 @@ int ftrace_shutdown(struct ftrace_ops *ops, int command)
 * synchronize_rcu_tasks() will wait for those tasks to
 * execute and either schedule voluntarily or enter user space.
 */
-   if (IS_ENABLED(CONFIG_PREEMPTION))
+   if (IS_ENABLED(CONFIG_TASKS_RCU))
synchronize_rcu_tasks();
 
ftrace_trampoline_free(ops);



Re: [PATCH v11 1/4] remoteproc: zynqmp: fix lockstep mode memory region

2024-02-28 Thread Tanmay Shah


On 2/28/24 11:08 AM, Mathieu Poirier wrote:
> On Mon, Feb 19, 2024 at 09:44:34AM -0800, Tanmay Shah wrote:
> > In lockstep mode, r5 core0 uses TCM of R5 core1. Following is lockstep
> > mode memory region as per hardware reference manual.
> > 
> > |  *TCM* |   *R5 View* | *Linux view* |
> > | R5_0 ATCM (128 KB) | 0x_ | 0xFFE0_  |
> > | R5_0 BTCM (128 KB) | 0x0002_ | 0xFFE2_  |
> > 
> > However, driver shouldn't model it as above because R5 core0 TCM and core1
> > TCM has different power-domains mapped to it.
> > Hence, TCM address space in lockstep mode should be modeled as 64KB
> > regions only where each region has its own power-domain as following:
> > 
> > |  *TCM* |   *R5 View* | *Linux view* |
> > | R5_0 ATCM0 (64 KB) | 0x_ | 0xFFE0_  |
> > | R5_0 BTCM0 (64 KB) | 0x0002_ | 0xFFE2_  |
> > | R5_0 ATCM1 (64 KB) | 0x0001_ | 0xFFE1_  |
> > | R5_0 BTCM1 (64 KB) | 0x0003_ | 0xFFE3_  |
> > 
> > This makes driver maintanance easy and makes design robust for future
> > platorms as well.
> > 
> > Signed-off-by: Tanmay Shah 
>
> Now that I have a clearer picture of where things are going, I am adding this
> patch to rproc-next.
>
> I'll wait for the DT crew for the rest of this set.

Hi Mathieu,

Is it okay if we wait for DT crew to clear new bindings as well before taking 
this one to rproc-next ?

Just in case any modifications needed further?


Tanmay


>
> Thanks,
> Mathieu
>
> > ---
> >  drivers/remoteproc/xlnx_r5_remoteproc.c | 145 ++--
> >  1 file changed, 12 insertions(+), 133 deletions(-)
> > 
> > diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c 
> > b/drivers/remoteproc/xlnx_r5_remoteproc.c
> > index 4395edea9a64..42b0384d34f2 100644
> > --- a/drivers/remoteproc/xlnx_r5_remoteproc.c
> > +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
> > @@ -84,12 +84,12 @@ static const struct mem_bank_data 
> > zynqmp_tcm_banks_split[] = {
> > {0xffebUL, 0x2, 0x1UL, PD_R5_1_BTCM, "btcm1"},
> >  };
> >  
> > -/* In lockstep mode cluster combines each 64KB TCM and makes 128KB TCM */
> > +/* In lockstep mode cluster uses each 64KB TCM from second core as well */
> >  static const struct mem_bank_data zynqmp_tcm_banks_lockstep[] = {
> > -   {0xffe0UL, 0x0, 0x2UL, PD_R5_0_ATCM, "atcm0"}, /* TCM 128KB 
> > each */
> > -   {0xffe2UL, 0x2, 0x2UL, PD_R5_0_BTCM, "btcm0"},
> > -   {0, 0, 0, PD_R5_1_ATCM, ""},
> > -   {0, 0, 0, PD_R5_1_BTCM, ""},
> > +   {0xffe0UL, 0x0, 0x1UL, PD_R5_0_ATCM, "atcm0"}, /* TCM 64KB each 
> > */
> > +   {0xffe2UL, 0x2, 0x1UL, PD_R5_0_BTCM, "btcm0"},
> > +   {0xffe1UL, 0x1, 0x1UL, PD_R5_1_ATCM, "atcm1"},
> > +   {0xffe3UL, 0x3, 0x1UL, PD_R5_1_BTCM, "btcm1"},
> >  };
> >  
> >  /**
> > @@ -540,14 +540,14 @@ static int tcm_mem_map(struct rproc *rproc,
> >  }
> >  
> >  /*
> > - * add_tcm_carveout_split_mode()
> > + * add_tcm_banks()
> >   * @rproc: single R5 core's corresponding rproc instance
> >   *
> > - * allocate and add remoteproc carveout for TCM memory in split mode
> > + * allocate and add remoteproc carveout for TCM memory
> >   *
> >   * return 0 on success, otherwise non-zero value on failure
> >   */
> > -static int add_tcm_carveout_split_mode(struct rproc *rproc)
> > +static int add_tcm_banks(struct rproc *rproc)
> >  {
> > struct rproc_mem_entry *rproc_mem;
> > struct zynqmp_r5_core *r5_core;
> > @@ -580,10 +580,10 @@ static int add_tcm_carveout_split_mode(struct rproc 
> > *rproc)
> >  ZYNQMP_PM_REQUEST_ACK_BLOCKING);
> > if (ret < 0) {
> > dev_err(dev, "failed to turn on TCM 0x%x", 
> > pm_domain_id);
> > -   goto release_tcm_split;
> > +   goto release_tcm;
> > }
> >  
> > -   dev_dbg(dev, "TCM carveout split mode %s addr=%llx, da=0x%x, 
> > size=0x%lx",
> > +   dev_dbg(dev, "TCM carveout %s addr=%llx, da=0x%x, size=0x%lx",
> > bank_name, bank_addr, da, bank_size);
> >  
> > rproc_mem = rproc_mem_entry_init(dev, NULL, bank_addr,
> > @@ -593,7 +593,7 @@ static int add_tcm_carveout_split_mode(struct rproc 
> > *rproc)
> > if (!rproc_mem) {
> > ret = -ENOMEM;
> > zynqmp_pm_release_node(pm_domain_id);
> > -   goto release_tcm_split;
> > +   goto release_tcm;
> > }
> >  
> > rproc_add_carveout(rproc, rproc_mem);
> > @@ -601,7 +601,7 @@ static int add_tcm_carveout_split_mode(struct rproc 
> > *rproc)
> >  
> > return 0;
> >  
> > -release_tcm_split:
> > +release_tcm:
> > /* If failed, Turn off all TCM banks turned on before */
> > for (i--; i >= 0; i--) {
> > pm_domain_id = r5_core->tcm_banks[i]->pm_domain_id;
> > @@ -610,127 +610,6 @@ static int add_tcm_carveout_split_mode(struct 

Re: [PATCH v11 2/4] dt-bindings: remoteproc: add Tightly Coupled Memory (TCM) bindings

2024-02-28 Thread Tanmay Shah
Hi Krzysztof,

Ping for reviews. Also have question.

I am referring this patch from patchwork link: 
https://patchwork.ozlabs.org/project/devicetree-bindings/patch/20240219174437.3722620-8-tanmay.s...@amd.com/

Patchwork shows that dtbs-check was failed, but I am sending dts changes in 
next patch. So dtbs-check failure only with bindings patch is expected.

How to resolve this? Should I send dtb changes same as bindings? Or we can 
ignore dtbs-check errors for now?

Thanks,

Tanmay

On 2/19/24 11:44 AM, Tanmay Shah wrote:
> From: Radhey Shyam Pandey 
>
> Introduce bindings for TCM memory address space on AMD-xilinx Zynq
> UltraScale+ platform. It will help in defining TCM in device-tree
> and make it's access platform agnostic and data-driven.
>
> Tightly-coupled memories(TCMs) are low-latency memory that provides
> predictable instruction execution and predictable data load/store
> timing. Each Cortex-R5F processor contains two 64-bit wide 64 KB memory
> banks on the ATCM and BTCM ports, for a total of 128 KB of memory.
>
> The TCM resources(reg, reg-names and power-domain) are documented for
> each TCM in the R5 node. The reg and reg-names are made as required
> properties as we don't want to hardcode TCM addresses for future
> platforms and for zu+ legacy implementation will ensure that the
> old dts w/o reg/reg-names works and stable ABI is maintained.
>
> It also extends the examples for TCM split and lockstep modes.
>
> Signed-off-by: Radhey Shyam Pandey 
> Signed-off-by: Tanmay Shah 
> ---
>
> Changes in v11:
>   - Fix yamllint warning and reduce indentation as needed
>
>  .../remoteproc/xlnx,zynqmp-r5fss.yaml | 192 --
>  1 file changed, 170 insertions(+), 22 deletions(-)
>
> diff --git 
> a/Documentation/devicetree/bindings/remoteproc/xlnx,zynqmp-r5fss.yaml 
> b/Documentation/devicetree/bindings/remoteproc/xlnx,zynqmp-r5fss.yaml
> index 78aac69f1060..77030edf41fa 100644
> --- a/Documentation/devicetree/bindings/remoteproc/xlnx,zynqmp-r5fss.yaml
> +++ b/Documentation/devicetree/bindings/remoteproc/xlnx,zynqmp-r5fss.yaml
> @@ -20,9 +20,21 @@ properties:
>compatible:
>  const: xlnx,zynqmp-r5fss
>  
> +  "#address-cells":
> +const: 2
> +
> +  "#size-cells":
> +const: 2
> +
> +  ranges:
> +description: |
> +  Standard ranges definition providing address translations for
> +  local R5F TCM address spaces to bus addresses.
> +
>xlnx,cluster-mode:
>  $ref: /schemas/types.yaml#/definitions/uint32
>  enum: [0, 1, 2]
> +default: 1
>  description: |
>The RPU MPCore can operate in split mode (Dual-processor performance), 
> Safety
>lock-step mode(Both RPU cores execute the same code in lock-step,
> @@ -37,7 +49,7 @@ properties:
>2: single cpu mode
>  
>  patternProperties:
> -  "^r5f-[a-f0-9]+$":
> +  "^r5f@[0-9a-f]+$":
>  type: object
>  description: |
>The RPU is located in the Low Power Domain of the Processor Subsystem.
> @@ -54,9 +66,6 @@ patternProperties:
>compatible:
>  const: xlnx,zynqmp-r5f
>  
> -  power-domains:
> -maxItems: 1
> -
>mboxes:
>  minItems: 1
>  items:
> @@ -101,35 +110,174 @@ patternProperties:
>  
>  required:
>- compatible
> -  - power-domains
>  
> -unevaluatedProperties: false
> +allOf:
> +  - if:
> +  properties:
> +xlnx,cluster-mode:
> +  enum:
> +- 1
> +then:
> +  patternProperties:
> +"^r5f@[0-9a-f]+$":
> +  type: object
> +
> +  properties:
> +reg:
> +  minItems: 1
> +  items:
> +- description: ATCM internal memory
> +- description: BTCM internal memory
> +- description: extra ATCM memory in lockstep mode
> +- description: extra BTCM memory in lockstep mode
> +
> +reg-names:
> +  minItems: 1
> +  items:
> +- const: atcm0
> +- const: btcm0
> +- const: atcm1
> +- const: btcm1
> +
> +power-domains:
> +  minItems: 2
> +  maxItems: 5
> +
> +  required:
> +- reg
> +- reg-names
> +- power-domains
> +
> +else:
> +  patternProperties:
> +"^r5f@[0-9a-f]+$":
> +  type: object
> +
> +  properties:
> +reg:
> +  minItems: 1
> +  items:
> +- description: ATCM internal memory
> +- description: BTCM internal memory
> +
> +reg-names:
> +  minItems: 1
> +  items:
> +- const: atcm0
> +- const: btcm0
> +
> +power-domains:
> +  minItems: 2
> +  maxItems: 3
> +
> +  required:
> +- reg
> +- reg-names
> +- power-domains
>  
>  required:
> 

Re: [PATCH v3] vduse: Fix off by one in vduse_dev_mmap()

2024-02-28 Thread Stefan Hajnoczi
On Wed, 28 Feb 2024 at 13:24, Dan Carpenter  wrote:
>
> The dev->vqs[] array has "dev->vq_num" elements.  It's allocated in
> vduse_dev_init_vqs().  Thus, this > comparison needs to be >= to avoid
> reading one element beyond the end of the array.
>
> Add an array_index_nospec() as well to prevent speculation issues.
>
> Fixes: 316ecd1346b0 ("vduse: Add file operation for mmap")
> Signed-off-by: Dan Carpenter 
> ---
> v2: add array_index_nospec()
> v3: I accidentally corrupted v2.  Try again.
>
>  drivers/vdpa/vdpa_user/vduse_dev.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi 

> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> b/drivers/vdpa/vdpa_user/vduse_dev.c
> index b7a1fb88c506..eb914084c650 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -1532,9 +1532,10 @@ static int vduse_dev_mmap(struct file *file, struct 
> vm_area_struct *vma)
> if ((vma->vm_flags & VM_SHARED) == 0)
> return -EINVAL;
>
> -   if (index > dev->vq_num)
> +   if (index >= dev->vq_num)
> return -EINVAL;
>
> +   index = array_index_nospec(index, dev->vq_num);
> vq = dev->vqs[index];
> vaddr = vq->vdpa_reconnect_vaddr;
> if (vaddr == 0)
> --
> 2.43.0
>
>



Re: [PATCH 2/2] riscv: Fix text patching when IPI are used

2024-02-28 Thread Alexandre Ghiti
On Wed, Feb 28, 2024 at 7:21 PM Samuel Holland
 wrote:
>
> Hi Alex,
>
> On 2024-02-28 11:51 AM, Alexandre Ghiti wrote:
> > For now, we use stop_machine() to patch the text and when we use IPIs for
> > remote icache flushes (which is emitted in patch_text_nosync()), the system
> > hangs.
> >
> > So instead, make sure every cpu executes the stop_machine() patching
> > function and emit a local icache flush there.
> >
> > Co-developed-by: Björn Töpel 
> > Signed-off-by: Björn Töpel 
> > Signed-off-by: Alexandre Ghiti 
> > ---
> >  arch/riscv/include/asm/patch.h |  1 +
> >  arch/riscv/kernel/ftrace.c | 42 ++
> >  arch/riscv/kernel/patch.c  | 18 +--
> >  3 files changed, 50 insertions(+), 11 deletions(-)
> >
> > diff --git a/arch/riscv/include/asm/patch.h b/arch/riscv/include/asm/patch.h
> > index e88b52d39eac..9f5d6e14c405 100644
> > --- a/arch/riscv/include/asm/patch.h
> > +++ b/arch/riscv/include/asm/patch.h
> > @@ -6,6 +6,7 @@
> >  #ifndef _ASM_RISCV_PATCH_H
> >  #define _ASM_RISCV_PATCH_H
> >
> > +int patch_insn_write(void *addr, const void *insn, size_t len);
> >  int patch_text_nosync(void *addr, const void *insns, size_t len);
> >  int patch_text_set_nosync(void *addr, u8 c, size_t len);
> >  int patch_text(void *addr, u32 *insns, int ninsns);
> > diff --git a/arch/riscv/kernel/ftrace.c b/arch/riscv/kernel/ftrace.c
> > index f5aa24d9e1c1..5654966c4e7d 100644
> > --- a/arch/riscv/kernel/ftrace.c
> > +++ b/arch/riscv/kernel/ftrace.c
> > @@ -8,6 +8,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >
> > @@ -75,8 +76,7 @@ static int __ftrace_modify_call(unsigned long hook_pos, 
> > unsigned long target,
> >   make_call_t0(hook_pos, target, call);
> >
> >   /* Replace the auipc-jalr pair at once. Return -EPERM on write error. 
> > */
> > - if (patch_text_nosync
> > - ((void *)hook_pos, enable ? call : nops, MCOUNT_INSN_SIZE))
> > + if (patch_insn_write((void *)hook_pos, enable ? call : nops, 
> > MCOUNT_INSN_SIZE))
> >   return -EPERM;
> >
> >   return 0;
> > @@ -88,7 +88,7 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned 
> > long addr)
> >
> >   make_call_t0(rec->ip, addr, call);
> >
> > - if (patch_text_nosync((void *)rec->ip, call, MCOUNT_INSN_SIZE))
> > + if (patch_insn_write((void *)rec->ip, call, MCOUNT_INSN_SIZE))
> >   return -EPERM;
> >
> >   return 0;
> > @@ -99,7 +99,7 @@ int ftrace_make_nop(struct module *mod, struct dyn_ftrace 
> > *rec,
> >  {
> >   unsigned int nops[2] = {NOP4, NOP4};
> >
> > - if (patch_text_nosync((void *)rec->ip, nops, MCOUNT_INSN_SIZE))
> > + if (patch_insn_write((void *)rec->ip, nops, MCOUNT_INSN_SIZE))
> >   return -EPERM;
> >
> >   return 0;
> > @@ -134,6 +134,40 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
> >
> >   return ret;
> >  }
> > +
> > +struct ftrace_modify_param {
> > + int command;
> > + atomic_t cpu_count;
> > +};
> > +
> > +static int __ftrace_modify_code(void *data)
> > +{
> > + struct ftrace_modify_param *param = data;
> > +
> > + if (atomic_inc_return(>cpu_count) == num_online_cpus()) {
> > + ftrace_modify_all_code(param->command);
> > + /*
> > +  * Make sure the patching store is effective *before* we
> > +  * increment the counter which releases all waiting cpus
> > +  * by using the release version of atomic increment.
> > +  */
> > + atomic_inc_return_release(>cpu_count);
> > + } else {
> > + while (atomic_read(>cpu_count) <= num_online_cpus())
> > + cpu_relax();
> > + }
> > +
> > + local_flush_icache_all();
> > +
> > + return 0;
> > +}
> > +
> > +void arch_ftrace_update_code(int command)
> > +{
> > + struct ftrace_modify_param param = { command, ATOMIC_INIT(0) };
> > +
> > + stop_machine(__ftrace_modify_code, , cpu_online_mask);
> > +}
> >  #endif
> >
> >  #ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
> > diff --git a/arch/riscv/kernel/patch.c b/arch/riscv/kernel/patch.c
> > index 0b5c16dfe3f4..82d8508c765b 100644
> > --- a/arch/riscv/kernel/patch.c
> > +++ b/arch/riscv/kernel/patch.c
> > @@ -188,7 +188,7 @@ int patch_text_set_nosync(void *addr, u8 c, size_t len)
> >  }
> >  NOKPROBE_SYMBOL(patch_text_set_nosync);
> >
> > -static int patch_insn_write(void *addr, const void *insn, size_t len)
> > +int patch_insn_write(void *addr, const void *insn, size_t len)
> >  {
> >   size_t patched = 0;
> >   size_t size;
> > @@ -211,11 +211,9 @@ NOKPROBE_SYMBOL(patch_insn_write);
> >
> >  int patch_text_nosync(void *addr, const void *insns, size_t len)
> >  {
> > - u32 *tp = addr;
> >   int ret;
> >
> > - ret = patch_insn_write(tp, insns, len);
> > -
> > + ret = patch_insn_write(addr, insns, len);
> >   if (!ret)
> >   flush_icache_range((uintptr_t) tp, 

Re: [PATCH] tracepoints: Use WARN() and not WARN_ON() for warnings

2024-02-28 Thread Paul E. McKenney
On Wed, Feb 28, 2024 at 01:31:12PM -0500, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" 
> 
> There are two WARN_ON*() warnings in tracepoint.h that deal with RCU
> usage. But when they trigger, especially from using a TRACE_EVENT()
> macro, the information is not very helpful and is confusing:
> 
>  [ cut here ]
>  WARNING: CPU: 0 PID: 0 at include/trace/events/lock.h:24 
> lock_acquire+0x2b2/0x2d0
> 
> Where the above warning takes you to:
> 
>  TRACE_EVENT(lock_acquire,  <<<--- line 24 in lock.h
> 
>   TP_PROTO(struct lockdep_map *lock, unsigned int subclass,
>   int trylock, int read, int check,
>   struct lockdep_map *next_lock, unsigned long ip),
>   [..]
> 
> Change the WARN_ON_ONCE() to WARN_ONCE() and add a string that allows
> someone to search for exactly where the bug happened.
> 
> Reported-by: Borislav Petkov 
> Signed-off-by: Steven Rostedt (Google) 

Reviewed-by: Paul E. McKenney 

> ---
>  include/linux/tracepoint.h | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index 88c0ba623ee6..689b6d71590e 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -199,7 +199,8 @@ static inline struct tracepoint 
> *tracepoint_ptr_deref(tracepoint_ptr_t *p)
>   if (!(cond))\
>   return; \
>   \
> - if (WARN_ON_ONCE(RCUIDLE_COND(rcuidle)))\
> + if (WARN_ONCE(RCUIDLE_COND(rcuidle),\
> +   "Bad RCU usage for tracepoint"))  \
>   return; \
>   \
>   /* keep srcu and sched-rcu usage consistent */  \
> @@ -259,7 +260,8 @@ static inline struct tracepoint 
> *tracepoint_ptr_deref(tracepoint_ptr_t *p)
>   TP_ARGS(args),  \
>   TP_CONDITION(cond), 0); \
>   if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) { \
> - WARN_ON_ONCE(!rcu_is_watching());   \
> + WARN_ONCE(!rcu_is_watching(),   \
> +   "RCU not watching for tracepoint");   \
>   }   \
>   }   \
>   __DECLARE_TRACE_RCU(name, PARAMS(proto), PARAMS(args),  \
> -- 
> 2.43.0
> 



Re: [PATCH] tracepoints: Use WARN() and not WARN_ON() for warnings

2024-02-28 Thread Borislav Petkov
On Wed, Feb 28, 2024 at 01:31:12PM -0500, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" 
> 
> There are two WARN_ON*() warnings in tracepoint.h that deal with RCU
> usage. But when they trigger, especially from using a TRACE_EVENT()
> macro, the information is not very helpful and is confusing:
> 
>  [ cut here ]
>  WARNING: CPU: 0 PID: 0 at include/trace/events/lock.h:24 
> lock_acquire+0x2b2/0x2d0
> 
> Where the above warning takes you to:
> 
>  TRACE_EVENT(lock_acquire,  <<<--- line 24 in lock.h
> 
>   TP_PROTO(struct lockdep_map *lock, unsigned int subclass,
>   int trylock, int read, int check,
>   struct lockdep_map *next_lock, unsigned long ip),
>   [..]
> 
> Change the WARN_ON_ONCE() to WARN_ONCE() and add a string that allows
> someone to search for exactly where the bug happened.
> 
> Reported-by: Borislav Petkov 
> Signed-off-by: Steven Rostedt (Google) 
> ---
>  include/linux/tracepoint.h | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)

Tested-by: Borislav Petkov (AMD) 

meaning: tested that it really fires:

[1.196008] Running RCU Tasks wait API self tests
[1.200227] [ cut here ]
[1.203899] RCU not watching for tracepoint
[1.203899] WARNING: CPU: 0 PID: 0 at include/trace/events/lock.h:24 
lock_acquire+0x2d3/0x300
...

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette



Re: [PATCH net-next v2 0/3] tun: AF_XDP Tx zero-copy support

2024-02-28 Thread Michael S. Tsirkin
On Wed, Feb 28, 2024 at 07:04:41PM +0800, Yunjian Wang wrote:
> Hi all:
> 
> Now, some drivers support the zero-copy feature of AF_XDP sockets,
> which can significantly reduce CPU utilization for XDP programs.
> 
> This patch set allows TUN to also support the AF_XDP Tx zero-copy
> feature. It is based on Linux 6.8.0+(openEuler 23.09) and has
> successfully passed Netperf and Netserver stress testing with
> multiple streams between VM A and VM B, using AF_XDP and OVS.
> 
> The performance testing was performed on a Intel E5-2620 2.40GHz
> machine. Traffic were generated/send through TUN(testpmd txonly
> with AF_XDP) to VM (testpmd rxonly in guest).
> 
> +--+-+-+-+
> |  |   copy  |zero-copy| speedup |
> +--+-+-+-+
> | UDP  |   Mpps  |   Mpps  |%|
> | 64   |   2.5   |   4.0   |   60%   |
> | 512  |   2.1   |   3.6   |   71%   |
> | 1024 |   1.9   |   3.3   |   73%   |
> +--+-+-+-+
> 
> Yunjian Wang (3):
>   xsk: Remove non-zero 'dma_page' check in xp_assign_dev
>   vhost_net: Call peek_len when using xdp
>   tun: AF_XDP Tx zero-copy support


threading broken pls repost.

vhost bits look ok though:

Acked-by: Michael S. Tsirkin 


>  drivers/net/tun.c   | 177 ++--
>  drivers/vhost/net.c |  21 +++--
>  include/linux/if_tun.h  |  32 
>  net/xdp/xsk_buff_pool.c |   7 --
>  4 files changed, 220 insertions(+), 17 deletions(-)
> 
> -- 
> 2.41.0




[PATCH] tracepoints: Use WARN() and not WARN_ON() for warnings

2024-02-28 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

There are two WARN_ON*() warnings in tracepoint.h that deal with RCU
usage. But when they trigger, especially from using a TRACE_EVENT()
macro, the information is not very helpful and is confusing:

 [ cut here ]
 WARNING: CPU: 0 PID: 0 at include/trace/events/lock.h:24 
lock_acquire+0x2b2/0x2d0

Where the above warning takes you to:

 TRACE_EVENT(lock_acquire,  <<<--- line 24 in lock.h

TP_PROTO(struct lockdep_map *lock, unsigned int subclass,
int trylock, int read, int check,
struct lockdep_map *next_lock, unsigned long ip),
[..]

Change the WARN_ON_ONCE() to WARN_ONCE() and add a string that allows
someone to search for exactly where the bug happened.

Reported-by: Borislav Petkov 
Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/tracepoint.h | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 88c0ba623ee6..689b6d71590e 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -199,7 +199,8 @@ static inline struct tracepoint 
*tracepoint_ptr_deref(tracepoint_ptr_t *p)
if (!(cond))\
return; \
\
-   if (WARN_ON_ONCE(RCUIDLE_COND(rcuidle)))\
+   if (WARN_ONCE(RCUIDLE_COND(rcuidle),\
+ "Bad RCU usage for tracepoint"))  \
return; \
\
/* keep srcu and sched-rcu usage consistent */  \
@@ -259,7 +260,8 @@ static inline struct tracepoint 
*tracepoint_ptr_deref(tracepoint_ptr_t *p)
TP_ARGS(args),  \
TP_CONDITION(cond), 0); \
if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) { \
-   WARN_ON_ONCE(!rcu_is_watching());   \
+   WARN_ONCE(!rcu_is_watching(),   \
+ "RCU not watching for tracepoint");   \
}   \
}   \
__DECLARE_TRACE_RCU(name, PARAMS(proto), PARAMS(args),  \
-- 
2.43.0




[PATCH v3] vduse: Fix off by one in vduse_dev_mmap()

2024-02-28 Thread Dan Carpenter
The dev->vqs[] array has "dev->vq_num" elements.  It's allocated in
vduse_dev_init_vqs().  Thus, this > comparison needs to be >= to avoid
reading one element beyond the end of the array.

Add an array_index_nospec() as well to prevent speculation issues.

Fixes: 316ecd1346b0 ("vduse: Add file operation for mmap")
Signed-off-by: Dan Carpenter 
---
v2: add array_index_nospec()
v3: I accidentally corrupted v2.  Try again.

 drivers/vdpa/vdpa_user/vduse_dev.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
b/drivers/vdpa/vdpa_user/vduse_dev.c
index b7a1fb88c506..eb914084c650 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -1532,9 +1532,10 @@ static int vduse_dev_mmap(struct file *file, struct 
vm_area_struct *vma)
if ((vma->vm_flags & VM_SHARED) == 0)
return -EINVAL;
 
-   if (index > dev->vq_num)
+   if (index >= dev->vq_num)
return -EINVAL;
 
+   index = array_index_nospec(index, dev->vq_num);
vq = dev->vqs[index];
vaddr = vq->vdpa_reconnect_vaddr;
if (vaddr == 0)
-- 
2.43.0




Re: [PATCH 2/2] riscv: Fix text patching when IPI are used

2024-02-28 Thread Samuel Holland
Hi Alex,

On 2024-02-28 11:51 AM, Alexandre Ghiti wrote:
> For now, we use stop_machine() to patch the text and when we use IPIs for
> remote icache flushes (which is emitted in patch_text_nosync()), the system
> hangs.
> 
> So instead, make sure every cpu executes the stop_machine() patching
> function and emit a local icache flush there.
> 
> Co-developed-by: Björn Töpel 
> Signed-off-by: Björn Töpel 
> Signed-off-by: Alexandre Ghiti 
> ---
>  arch/riscv/include/asm/patch.h |  1 +
>  arch/riscv/kernel/ftrace.c | 42 ++
>  arch/riscv/kernel/patch.c  | 18 +--
>  3 files changed, 50 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/riscv/include/asm/patch.h b/arch/riscv/include/asm/patch.h
> index e88b52d39eac..9f5d6e14c405 100644
> --- a/arch/riscv/include/asm/patch.h
> +++ b/arch/riscv/include/asm/patch.h
> @@ -6,6 +6,7 @@
>  #ifndef _ASM_RISCV_PATCH_H
>  #define _ASM_RISCV_PATCH_H
>  
> +int patch_insn_write(void *addr, const void *insn, size_t len);
>  int patch_text_nosync(void *addr, const void *insns, size_t len);
>  int patch_text_set_nosync(void *addr, u8 c, size_t len);
>  int patch_text(void *addr, u32 *insns, int ninsns);
> diff --git a/arch/riscv/kernel/ftrace.c b/arch/riscv/kernel/ftrace.c
> index f5aa24d9e1c1..5654966c4e7d 100644
> --- a/arch/riscv/kernel/ftrace.c
> +++ b/arch/riscv/kernel/ftrace.c
> @@ -8,6 +8,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  
> @@ -75,8 +76,7 @@ static int __ftrace_modify_call(unsigned long hook_pos, 
> unsigned long target,
>   make_call_t0(hook_pos, target, call);
>  
>   /* Replace the auipc-jalr pair at once. Return -EPERM on write error. */
> - if (patch_text_nosync
> - ((void *)hook_pos, enable ? call : nops, MCOUNT_INSN_SIZE))
> + if (patch_insn_write((void *)hook_pos, enable ? call : nops, 
> MCOUNT_INSN_SIZE))
>   return -EPERM;
>  
>   return 0;
> @@ -88,7 +88,7 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned long 
> addr)
>  
>   make_call_t0(rec->ip, addr, call);
>  
> - if (patch_text_nosync((void *)rec->ip, call, MCOUNT_INSN_SIZE))
> + if (patch_insn_write((void *)rec->ip, call, MCOUNT_INSN_SIZE))
>   return -EPERM;
>  
>   return 0;
> @@ -99,7 +99,7 @@ int ftrace_make_nop(struct module *mod, struct dyn_ftrace 
> *rec,
>  {
>   unsigned int nops[2] = {NOP4, NOP4};
>  
> - if (patch_text_nosync((void *)rec->ip, nops, MCOUNT_INSN_SIZE))
> + if (patch_insn_write((void *)rec->ip, nops, MCOUNT_INSN_SIZE))
>   return -EPERM;
>  
>   return 0;
> @@ -134,6 +134,40 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
>  
>   return ret;
>  }
> +
> +struct ftrace_modify_param {
> + int command;
> + atomic_t cpu_count;
> +};
> +
> +static int __ftrace_modify_code(void *data)
> +{
> + struct ftrace_modify_param *param = data;
> +
> + if (atomic_inc_return(>cpu_count) == num_online_cpus()) {
> + ftrace_modify_all_code(param->command);
> + /*
> +  * Make sure the patching store is effective *before* we
> +  * increment the counter which releases all waiting cpus
> +  * by using the release version of atomic increment.
> +  */
> + atomic_inc_return_release(>cpu_count);
> + } else {
> + while (atomic_read(>cpu_count) <= num_online_cpus())
> + cpu_relax();
> + }
> +
> + local_flush_icache_all();
> +
> + return 0;
> +}
> +
> +void arch_ftrace_update_code(int command)
> +{
> + struct ftrace_modify_param param = { command, ATOMIC_INIT(0) };
> +
> + stop_machine(__ftrace_modify_code, , cpu_online_mask);
> +}
>  #endif
>  
>  #ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
> diff --git a/arch/riscv/kernel/patch.c b/arch/riscv/kernel/patch.c
> index 0b5c16dfe3f4..82d8508c765b 100644
> --- a/arch/riscv/kernel/patch.c
> +++ b/arch/riscv/kernel/patch.c
> @@ -188,7 +188,7 @@ int patch_text_set_nosync(void *addr, u8 c, size_t len)
>  }
>  NOKPROBE_SYMBOL(patch_text_set_nosync);
>  
> -static int patch_insn_write(void *addr, const void *insn, size_t len)
> +int patch_insn_write(void *addr, const void *insn, size_t len)
>  {
>   size_t patched = 0;
>   size_t size;
> @@ -211,11 +211,9 @@ NOKPROBE_SYMBOL(patch_insn_write);
>  
>  int patch_text_nosync(void *addr, const void *insns, size_t len)
>  {
> - u32 *tp = addr;
>   int ret;
>  
> - ret = patch_insn_write(tp, insns, len);
> -
> + ret = patch_insn_write(addr, insns, len);
>   if (!ret)
>   flush_icache_range((uintptr_t) tp, (uintptr_t) tp + len);

This only happens to compile because flush_icache_range() is a macro that
ignores its parameters. You could replace tp with addr in this line as well, but
that seems like more of a cosmetic change and should be a separate patch (like
in [1] which covers both related functions) if you 

Re: [PATCH v2] vduse: Fix off by one in vduse_dev_mmap()

2024-02-28 Thread Dan Carpenter
On Wed, Feb 28, 2024 at 12:53:28PM -0500, Stefan Hajnoczi wrote:
> On Wed, 28 Feb 2024 at 12:44, Dan Carpenter  wrote:
> >
> > The dev->vqs[] array has "dev->vq_num" elements.  It's allocated in
> > vduse_dev_init_vqs().  Thus, this > comparison needs to be >= to avoid
> > reading one element beyond the end of the array.
> >
> > Add an array_index_nospec() as well to prevent speculation issues.
> >
> > Fixes: 316ecd1346b0 ("vduse: Add file operation for mmap")
> > Signed-off-by: Dan Carpenter 
> > ---
> > v2: add array_index_nospec().
> 
> Did you forget to update the patch, I don't see array_index_nospec()?
> 
> >
> >  drivers/vdpa/vdpa_user/vduse_dev.c | 3 ++-
  ^
I updated the patch but the thing about vim is that every time you
press a button it does something unexpected.  Vim ate my homework.

regards,
dan carpenter




Re: [PATCH v2] mm: add alloc_contig_migrate_range allocation statistics

2024-02-28 Thread Minchan Kim
On Wed, Feb 28, 2024 at 05:11:17AM +, Richard Chang wrote:
> alloc_contig_migrate_range has every information to be able to
> understand big contiguous allocation latency. For example, how many
> pages are migrated, how many times they were needed to unmap from
> page tables.
> 
> This patch adds the trace event to collect the allocation statistics.
> In the field, it was quite useful to understand CMA allocation
> latency.
> 
> Signed-off-by: Richard Chang 

Acked-by: Minchan Kim 



[PATCH 2/2] riscv: Fix text patching when IPI are used

2024-02-28 Thread Alexandre Ghiti
For now, we use stop_machine() to patch the text and when we use IPIs for
remote icache flushes (which is emitted in patch_text_nosync()), the system
hangs.

So instead, make sure every cpu executes the stop_machine() patching
function and emit a local icache flush there.

Co-developed-by: Björn Töpel 
Signed-off-by: Björn Töpel 
Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/include/asm/patch.h |  1 +
 arch/riscv/kernel/ftrace.c | 42 ++
 arch/riscv/kernel/patch.c  | 18 +--
 3 files changed, 50 insertions(+), 11 deletions(-)

diff --git a/arch/riscv/include/asm/patch.h b/arch/riscv/include/asm/patch.h
index e88b52d39eac..9f5d6e14c405 100644
--- a/arch/riscv/include/asm/patch.h
+++ b/arch/riscv/include/asm/patch.h
@@ -6,6 +6,7 @@
 #ifndef _ASM_RISCV_PATCH_H
 #define _ASM_RISCV_PATCH_H
 
+int patch_insn_write(void *addr, const void *insn, size_t len);
 int patch_text_nosync(void *addr, const void *insns, size_t len);
 int patch_text_set_nosync(void *addr, u8 c, size_t len);
 int patch_text(void *addr, u32 *insns, int ninsns);
diff --git a/arch/riscv/kernel/ftrace.c b/arch/riscv/kernel/ftrace.c
index f5aa24d9e1c1..5654966c4e7d 100644
--- a/arch/riscv/kernel/ftrace.c
+++ b/arch/riscv/kernel/ftrace.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -75,8 +76,7 @@ static int __ftrace_modify_call(unsigned long hook_pos, 
unsigned long target,
make_call_t0(hook_pos, target, call);
 
/* Replace the auipc-jalr pair at once. Return -EPERM on write error. */
-   if (patch_text_nosync
-   ((void *)hook_pos, enable ? call : nops, MCOUNT_INSN_SIZE))
+   if (patch_insn_write((void *)hook_pos, enable ? call : nops, 
MCOUNT_INSN_SIZE))
return -EPERM;
 
return 0;
@@ -88,7 +88,7 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned long 
addr)
 
make_call_t0(rec->ip, addr, call);
 
-   if (patch_text_nosync((void *)rec->ip, call, MCOUNT_INSN_SIZE))
+   if (patch_insn_write((void *)rec->ip, call, MCOUNT_INSN_SIZE))
return -EPERM;
 
return 0;
@@ -99,7 +99,7 @@ int ftrace_make_nop(struct module *mod, struct dyn_ftrace 
*rec,
 {
unsigned int nops[2] = {NOP4, NOP4};
 
-   if (patch_text_nosync((void *)rec->ip, nops, MCOUNT_INSN_SIZE))
+   if (patch_insn_write((void *)rec->ip, nops, MCOUNT_INSN_SIZE))
return -EPERM;
 
return 0;
@@ -134,6 +134,40 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
 
return ret;
 }
+
+struct ftrace_modify_param {
+   int command;
+   atomic_t cpu_count;
+};
+
+static int __ftrace_modify_code(void *data)
+{
+   struct ftrace_modify_param *param = data;
+
+   if (atomic_inc_return(>cpu_count) == num_online_cpus()) {
+   ftrace_modify_all_code(param->command);
+   /*
+* Make sure the patching store is effective *before* we
+* increment the counter which releases all waiting cpus
+* by using the release version of atomic increment.
+*/
+   atomic_inc_return_release(>cpu_count);
+   } else {
+   while (atomic_read(>cpu_count) <= num_online_cpus())
+   cpu_relax();
+   }
+
+   local_flush_icache_all();
+
+   return 0;
+}
+
+void arch_ftrace_update_code(int command)
+{
+   struct ftrace_modify_param param = { command, ATOMIC_INIT(0) };
+
+   stop_machine(__ftrace_modify_code, , cpu_online_mask);
+}
 #endif
 
 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
diff --git a/arch/riscv/kernel/patch.c b/arch/riscv/kernel/patch.c
index 0b5c16dfe3f4..82d8508c765b 100644
--- a/arch/riscv/kernel/patch.c
+++ b/arch/riscv/kernel/patch.c
@@ -188,7 +188,7 @@ int patch_text_set_nosync(void *addr, u8 c, size_t len)
 }
 NOKPROBE_SYMBOL(patch_text_set_nosync);
 
-static int patch_insn_write(void *addr, const void *insn, size_t len)
+int patch_insn_write(void *addr, const void *insn, size_t len)
 {
size_t patched = 0;
size_t size;
@@ -211,11 +211,9 @@ NOKPROBE_SYMBOL(patch_insn_write);
 
 int patch_text_nosync(void *addr, const void *insns, size_t len)
 {
-   u32 *tp = addr;
int ret;
 
-   ret = patch_insn_write(tp, insns, len);
-
+   ret = patch_insn_write(addr, insns, len);
if (!ret)
flush_icache_range((uintptr_t) tp, (uintptr_t) tp + len);
 
@@ -232,15 +230,21 @@ static int patch_text_cb(void *data)
if (atomic_inc_return(>cpu_count) == num_online_cpus()) {
for (i = 0; ret == 0 && i < patch->ninsns; i++) {
len = GET_INSN_LENGTH(patch->insns[i]);
-   ret = patch_text_nosync(patch->addr + i * len,
-   >insns[i], len);
+   ret = patch_insn_write(patch->addr + i * len, 
>insns[i], len);
}
-   

Re: [PATCH v2] vduse: Fix off by one in vduse_dev_mmap()

2024-02-28 Thread Stefan Hajnoczi
On Wed, 28 Feb 2024 at 12:44, Dan Carpenter  wrote:
>
> The dev->vqs[] array has "dev->vq_num" elements.  It's allocated in
> vduse_dev_init_vqs().  Thus, this > comparison needs to be >= to avoid
> reading one element beyond the end of the array.
>
> Add an array_index_nospec() as well to prevent speculation issues.
>
> Fixes: 316ecd1346b0 ("vduse: Add file operation for mmap")
> Signed-off-by: Dan Carpenter 
> ---
> v2: add array_index_nospec().

Did you forget to update the patch, I don't see array_index_nospec()?

>
>  drivers/vdpa/vdpa_user/vduse_dev.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> b/drivers/vdpa/vdpa_user/vduse_dev.c
> index b7a1fb88c506..eb914084c650 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -1532,9 +1532,10 @@ static int vduse_dev_mmap(struct file *file, struct 
> vm_area_struct *vma)
> if ((vma->vm_flags & VM_SHARED) == 0)
> return -EINVAL;
>
> -   if (index > dev->vq_num)
> +   if (index >= dev->vq_num)
> return -EINVAL;
>
> vq = dev->vqs[index];
> vaddr = vq->vdpa_reconnect_vaddr;
> if (vaddr == 0)
> --
> 2.43.0
>
>



[PATCH 1/2] riscv: Remove superfluous smp_mb()

2024-02-28 Thread Alexandre Ghiti
This memory barrier is not needed and not documented so simply remove
it.

Suggested-by: Andrea Parri 
Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/kernel/patch.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/riscv/kernel/patch.c b/arch/riscv/kernel/patch.c
index 37e87fdcf6a0..0b5c16dfe3f4 100644
--- a/arch/riscv/kernel/patch.c
+++ b/arch/riscv/kernel/patch.c
@@ -239,7 +239,6 @@ static int patch_text_cb(void *data)
} else {
while (atomic_read(>cpu_count) <= num_online_cpus())
cpu_relax();
-   smp_mb();
}
 
return ret;
-- 
2.39.2




[PATCH 0/2] riscv: fix patching with IPI

2024-02-28 Thread Alexandre Ghiti
patch 1 removes a useless memory barrier and patch 2 actually fixes the
issue with IPI in the patching code.

Changes in v2:
- Add patch 1 and then remove the memory barrier from patch 2 as
  suggested by Andrea
- Convert atomic_inc into an atomic_inc with release semantics as
  suggested by Andrea

Alexandre Ghiti (2):
  riscv: Remove superfluous smp_mb()
  riscv: Fix text patching when IPI are used

 arch/riscv/include/asm/patch.h |  1 +
 arch/riscv/kernel/ftrace.c | 42 ++
 arch/riscv/kernel/patch.c  | 19 ---
 3 files changed, 50 insertions(+), 12 deletions(-)

-- 
2.39.2




[PATCH v2] vduse: Fix off by one in vduse_dev_mmap()

2024-02-28 Thread Dan Carpenter
The dev->vqs[] array has "dev->vq_num" elements.  It's allocated in
vduse_dev_init_vqs().  Thus, this > comparison needs to be >= to avoid
reading one element beyond the end of the array.

Add an array_index_nospec() as well to prevent speculation issues.

Fixes: 316ecd1346b0 ("vduse: Add file operation for mmap")
Signed-off-by: Dan Carpenter 
---
v2: add array_index_nospec().

 drivers/vdpa/vdpa_user/vduse_dev.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
b/drivers/vdpa/vdpa_user/vduse_dev.c
index b7a1fb88c506..eb914084c650 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -1532,9 +1532,10 @@ static int vduse_dev_mmap(struct file *file, struct 
vm_area_struct *vma)
if ((vma->vm_flags & VM_SHARED) == 0)
return -EINVAL;
 
-   if (index > dev->vq_num)
+   if (index >= dev->vq_num)
return -EINVAL;

vq = dev->vqs[index];
vaddr = vq->vdpa_reconnect_vaddr;
if (vaddr == 0)
-- 
2.43.0




Re: [PATCH] vduse: Fix off by one in vduse_dev_mmap()

2024-02-28 Thread Dan Carpenter
On Tue, Feb 27, 2024 at 10:48:49AM -0500, Michael S. Tsirkin wrote:
> On Tue, Feb 27, 2024 at 06:21:46PM +0300, Dan Carpenter wrote:
> > The dev->vqs[] array has "dev->vq_num" elements.  It's allocated in
> > vduse_dev_init_vqs().  Thus, this > comparison needs to be >= to avoid
> > reading one element beyond the end of the array.
> > 
> > Fixes: 316ecd1346b0 ("vduse: Add file operation for mmap")
> > Signed-off-by: Dan Carpenter 
> 
> 
> Oh wow and does this not come from userspace? If yes we
> need the speculation magic macro when using the index, do we not?
> 

Yes, it does come from userspace.

To be honest, I'm not sure about speculation.  The similar places in
this file protect against speculation so, probably?  I'll resend the
patch.

regards,
dan carpenter




Re: [PATCH v11 4/4] remoteproc: zynqmp: parse TCM from device tree

2024-02-28 Thread Tanmay Shah


On 2/28/24 11:06 AM, Mathieu Poirier wrote:
> On Mon, Feb 19, 2024 at 09:44:37AM -0800, Tanmay Shah wrote:
> > ZynqMP TCM information was fixed in driver. Now ZynqMP TCM information
> > is available in device-tree. Parse TCM information in driver
> > as per new bindings.
> > 
> > Signed-off-by: Tanmay Shah 
> > ---
> > 
> > Changes in v11:
> >   - Remove redundant initialization of the variable
> >   - return correct error code if memory allocation failed
>
> Where is that?  I looked really hard but couldn't find it.

Hi Mathieu,

Thanks for reviews. Please find my comment below.

>
> > 
> >  drivers/remoteproc/xlnx_r5_remoteproc.c | 112 ++--
> >  1 file changed, 107 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c 
> > b/drivers/remoteproc/xlnx_r5_remoteproc.c
> > index 42b0384d34f2..d4a22caebaad 100644
> > --- a/drivers/remoteproc/xlnx_r5_remoteproc.c
> > +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
> > @@ -74,8 +74,8 @@ struct mbox_info {
> >  };
> >  
> >  /*
> > - * Hardcoded TCM bank values. This will be removed once TCM bindings are
> > - * accepted for system-dt specifications and upstreamed in linux kernel
> > + * Hardcoded TCM bank values. This will stay in driver to maintain backward
> > + * compatibility with device-tree that does not have TCM information.
> >   */
> >  static const struct mem_bank_data zynqmp_tcm_banks_split[] = {
> > {0xffe0UL, 0x0, 0x1UL, PD_R5_0_ATCM, "atcm0"}, /* TCM 64KB each 
> > */
> > @@ -757,6 +757,103 @@ static struct zynqmp_r5_core 
> > *zynqmp_r5_add_rproc_core(struct device *cdev)
> > return ERR_PTR(ret);
> >  }
> >  
> > +static int zynqmp_r5_get_tcm_node_from_dt(struct zynqmp_r5_cluster 
> > *cluster)
> > +{
> > +   int i, j, tcm_bank_count, ret, tcm_pd_idx, pd_count;
> > +   struct of_phandle_args out_args;
> > +   struct zynqmp_r5_core *r5_core;
> > +   struct platform_device *cpdev;
> > +   struct mem_bank_data *tcm;
> > +   struct device_node *np;
> > +   struct resource *res;
> > +   u64 abs_addr, size;
> > +   struct device *dev;
> > +
> > +   for (i = 0; i < cluster->core_count; i++) {
> > +   r5_core = cluster->r5_cores[i];
> > +   dev = r5_core->dev;
> > +   np = r5_core->np;
> > +
> > +   pd_count = of_count_phandle_with_args(np, "power-domains",
> > + "#power-domain-cells");
> > +
> > +   if (pd_count <= 0) {
> > +   dev_err(dev, "invalid power-domains property, %d\n", 
> > pd_count);
> > +   return -EINVAL;
> > +   }
> > +
> > +   /* First entry in power-domains list is for r5 core, rest for 
> > TCM. */
> > +   tcm_bank_count = pd_count - 1;
> > +
> > +   if (tcm_bank_count <= 0) {
> > +   dev_err(dev, "invalid TCM count %d\n", tcm_bank_count);
> > +   return -EINVAL;
> > +   }
> > +
> > +   r5_core->tcm_banks = devm_kcalloc(dev, tcm_bank_count,
> > + sizeof(struct mem_bank_data 
> > *),
> > + GFP_KERNEL);
> > +   if (!r5_core->tcm_banks)
> > +   return -ENOMEM;

Hi Mathiue,

Here: in v10 it was following:

+   if (!r5_core->tcm_banks)
+   ret = -ENOMEM;

Which is not correct. Somehow I missed to return the error code instead it was 
just storing the error in variable.


> > +
> > +   r5_core->tcm_bank_count = tcm_bank_count;
> > +   for (j = 0, tcm_pd_idx = 1; j < tcm_bank_count; j++, 
> > tcm_pd_idx++) {
> > +   tcm = devm_kzalloc(dev, sizeof(struct mem_bank_data),
> > +  GFP_KERNEL);
> > +   if (!tcm)
> > +   return -ENOMEM;
> > +
> > +   r5_core->tcm_banks[j] = tcm;
> > +
> > +   /* Get power-domains id of TCM. */
> > +   ret = of_parse_phandle_with_args(np, "power-domains",
> > +"#power-domain-cells",
> > +tcm_pd_idx, _args);
> > +   if (ret) {
> > +   dev_err(r5_core->dev,
> > +   "failed to get tcm %d pm domain, ret 
> > %d\n",
> > +   tcm_pd_idx, ret);
> > +   return ret;
> > +   }
> > +   tcm->pm_domain_id = out_args.args[0];
> > +   of_node_put(out_args.np);
> > +
> > +   /* Get TCM address without translation. */
> > +   ret = of_property_read_reg(np, j, _addr, );
> > +   if (ret) {
> > +   dev_err(dev, "failed to get reg property\n");
> > +   return ret;
> > +   }
> > +
> > +   

Re: [PATCH v11 1/4] remoteproc: zynqmp: fix lockstep mode memory region

2024-02-28 Thread Mathieu Poirier
On Mon, Feb 19, 2024 at 09:44:34AM -0800, Tanmay Shah wrote:
> In lockstep mode, r5 core0 uses TCM of R5 core1. Following is lockstep
> mode memory region as per hardware reference manual.
> 
> |  *TCM* |   *R5 View* | *Linux view* |
> | R5_0 ATCM (128 KB) | 0x_ | 0xFFE0_  |
> | R5_0 BTCM (128 KB) | 0x0002_ | 0xFFE2_  |
> 
> However, driver shouldn't model it as above because R5 core0 TCM and core1
> TCM has different power-domains mapped to it.
> Hence, TCM address space in lockstep mode should be modeled as 64KB
> regions only where each region has its own power-domain as following:
> 
> |  *TCM* |   *R5 View* | *Linux view* |
> | R5_0 ATCM0 (64 KB) | 0x_ | 0xFFE0_  |
> | R5_0 BTCM0 (64 KB) | 0x0002_ | 0xFFE2_  |
> | R5_0 ATCM1 (64 KB) | 0x0001_ | 0xFFE1_  |
> | R5_0 BTCM1 (64 KB) | 0x0003_ | 0xFFE3_  |
> 
> This makes driver maintanance easy and makes design robust for future
> platorms as well.
> 
> Signed-off-by: Tanmay Shah 

Now that I have a clearer picture of where things are going, I am adding this
patch to rproc-next.

I'll wait for the DT crew for the rest of this set.

Thanks,
Mathieu

> ---
>  drivers/remoteproc/xlnx_r5_remoteproc.c | 145 ++--
>  1 file changed, 12 insertions(+), 133 deletions(-)
> 
> diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c 
> b/drivers/remoteproc/xlnx_r5_remoteproc.c
> index 4395edea9a64..42b0384d34f2 100644
> --- a/drivers/remoteproc/xlnx_r5_remoteproc.c
> +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
> @@ -84,12 +84,12 @@ static const struct mem_bank_data 
> zynqmp_tcm_banks_split[] = {
>   {0xffebUL, 0x2, 0x1UL, PD_R5_1_BTCM, "btcm1"},
>  };
>  
> -/* In lockstep mode cluster combines each 64KB TCM and makes 128KB TCM */
> +/* In lockstep mode cluster uses each 64KB TCM from second core as well */
>  static const struct mem_bank_data zynqmp_tcm_banks_lockstep[] = {
> - {0xffe0UL, 0x0, 0x2UL, PD_R5_0_ATCM, "atcm0"}, /* TCM 128KB 
> each */
> - {0xffe2UL, 0x2, 0x2UL, PD_R5_0_BTCM, "btcm0"},
> - {0, 0, 0, PD_R5_1_ATCM, ""},
> - {0, 0, 0, PD_R5_1_BTCM, ""},
> + {0xffe0UL, 0x0, 0x1UL, PD_R5_0_ATCM, "atcm0"}, /* TCM 64KB each 
> */
> + {0xffe2UL, 0x2, 0x1UL, PD_R5_0_BTCM, "btcm0"},
> + {0xffe1UL, 0x1, 0x1UL, PD_R5_1_ATCM, "atcm1"},
> + {0xffe3UL, 0x3, 0x1UL, PD_R5_1_BTCM, "btcm1"},
>  };
>  
>  /**
> @@ -540,14 +540,14 @@ static int tcm_mem_map(struct rproc *rproc,
>  }
>  
>  /*
> - * add_tcm_carveout_split_mode()
> + * add_tcm_banks()
>   * @rproc: single R5 core's corresponding rproc instance
>   *
> - * allocate and add remoteproc carveout for TCM memory in split mode
> + * allocate and add remoteproc carveout for TCM memory
>   *
>   * return 0 on success, otherwise non-zero value on failure
>   */
> -static int add_tcm_carveout_split_mode(struct rproc *rproc)
> +static int add_tcm_banks(struct rproc *rproc)
>  {
>   struct rproc_mem_entry *rproc_mem;
>   struct zynqmp_r5_core *r5_core;
> @@ -580,10 +580,10 @@ static int add_tcm_carveout_split_mode(struct rproc 
> *rproc)
>ZYNQMP_PM_REQUEST_ACK_BLOCKING);
>   if (ret < 0) {
>   dev_err(dev, "failed to turn on TCM 0x%x", 
> pm_domain_id);
> - goto release_tcm_split;
> + goto release_tcm;
>   }
>  
> - dev_dbg(dev, "TCM carveout split mode %s addr=%llx, da=0x%x, 
> size=0x%lx",
> + dev_dbg(dev, "TCM carveout %s addr=%llx, da=0x%x, size=0x%lx",
>   bank_name, bank_addr, da, bank_size);
>  
>   rproc_mem = rproc_mem_entry_init(dev, NULL, bank_addr,
> @@ -593,7 +593,7 @@ static int add_tcm_carveout_split_mode(struct rproc 
> *rproc)
>   if (!rproc_mem) {
>   ret = -ENOMEM;
>   zynqmp_pm_release_node(pm_domain_id);
> - goto release_tcm_split;
> + goto release_tcm;
>   }
>  
>   rproc_add_carveout(rproc, rproc_mem);
> @@ -601,7 +601,7 @@ static int add_tcm_carveout_split_mode(struct rproc 
> *rproc)
>  
>   return 0;
>  
> -release_tcm_split:
> +release_tcm:
>   /* If failed, Turn off all TCM banks turned on before */
>   for (i--; i >= 0; i--) {
>   pm_domain_id = r5_core->tcm_banks[i]->pm_domain_id;
> @@ -610,127 +610,6 @@ static int add_tcm_carveout_split_mode(struct rproc 
> *rproc)
>   return ret;
>  }
>  
> -/*
> - * add_tcm_carveout_lockstep_mode()
> - * @rproc: single R5 core's corresponding rproc instance
> - *
> - * allocate and add remoteproc carveout for TCM memory in lockstep mode
> - *
> - * return 0 on success, otherwise non-zero value on failure
> - */
> -static int add_tcm_carveout_lockstep_mode(struct rproc *rproc)
> -{
> - 

Re: [PATCH v11 4/4] remoteproc: zynqmp: parse TCM from device tree

2024-02-28 Thread Mathieu Poirier
On Mon, Feb 19, 2024 at 09:44:37AM -0800, Tanmay Shah wrote:
> ZynqMP TCM information was fixed in driver. Now ZynqMP TCM information
> is available in device-tree. Parse TCM information in driver
> as per new bindings.
> 
> Signed-off-by: Tanmay Shah 
> ---
> 
> Changes in v11:
>   - Remove redundant initialization of the variable
>   - return correct error code if memory allocation failed

Where is that?  I looked really hard but couldn't find it.

> 
>  drivers/remoteproc/xlnx_r5_remoteproc.c | 112 ++--
>  1 file changed, 107 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c 
> b/drivers/remoteproc/xlnx_r5_remoteproc.c
> index 42b0384d34f2..d4a22caebaad 100644
> --- a/drivers/remoteproc/xlnx_r5_remoteproc.c
> +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
> @@ -74,8 +74,8 @@ struct mbox_info {
>  };
>  
>  /*
> - * Hardcoded TCM bank values. This will be removed once TCM bindings are
> - * accepted for system-dt specifications and upstreamed in linux kernel
> + * Hardcoded TCM bank values. This will stay in driver to maintain backward
> + * compatibility with device-tree that does not have TCM information.
>   */
>  static const struct mem_bank_data zynqmp_tcm_banks_split[] = {
>   {0xffe0UL, 0x0, 0x1UL, PD_R5_0_ATCM, "atcm0"}, /* TCM 64KB each 
> */
> @@ -757,6 +757,103 @@ static struct zynqmp_r5_core 
> *zynqmp_r5_add_rproc_core(struct device *cdev)
>   return ERR_PTR(ret);
>  }
>  
> +static int zynqmp_r5_get_tcm_node_from_dt(struct zynqmp_r5_cluster *cluster)
> +{
> + int i, j, tcm_bank_count, ret, tcm_pd_idx, pd_count;
> + struct of_phandle_args out_args;
> + struct zynqmp_r5_core *r5_core;
> + struct platform_device *cpdev;
> + struct mem_bank_data *tcm;
> + struct device_node *np;
> + struct resource *res;
> + u64 abs_addr, size;
> + struct device *dev;
> +
> + for (i = 0; i < cluster->core_count; i++) {
> + r5_core = cluster->r5_cores[i];
> + dev = r5_core->dev;
> + np = r5_core->np;
> +
> + pd_count = of_count_phandle_with_args(np, "power-domains",
> +   "#power-domain-cells");
> +
> + if (pd_count <= 0) {
> + dev_err(dev, "invalid power-domains property, %d\n", 
> pd_count);
> + return -EINVAL;
> + }
> +
> + /* First entry in power-domains list is for r5 core, rest for 
> TCM. */
> + tcm_bank_count = pd_count - 1;
> +
> + if (tcm_bank_count <= 0) {
> + dev_err(dev, "invalid TCM count %d\n", tcm_bank_count);
> + return -EINVAL;
> + }
> +
> + r5_core->tcm_banks = devm_kcalloc(dev, tcm_bank_count,
> +   sizeof(struct mem_bank_data 
> *),
> +   GFP_KERNEL);
> + if (!r5_core->tcm_banks)
> + return -ENOMEM;
> +
> + r5_core->tcm_bank_count = tcm_bank_count;
> + for (j = 0, tcm_pd_idx = 1; j < tcm_bank_count; j++, 
> tcm_pd_idx++) {
> + tcm = devm_kzalloc(dev, sizeof(struct mem_bank_data),
> +GFP_KERNEL);
> + if (!tcm)
> + return -ENOMEM;
> +
> + r5_core->tcm_banks[j] = tcm;
> +
> + /* Get power-domains id of TCM. */
> + ret = of_parse_phandle_with_args(np, "power-domains",
> +  "#power-domain-cells",
> +  tcm_pd_idx, _args);
> + if (ret) {
> + dev_err(r5_core->dev,
> + "failed to get tcm %d pm domain, ret 
> %d\n",
> + tcm_pd_idx, ret);
> + return ret;
> + }
> + tcm->pm_domain_id = out_args.args[0];
> + of_node_put(out_args.np);
> +
> + /* Get TCM address without translation. */
> + ret = of_property_read_reg(np, j, _addr, );
> + if (ret) {
> + dev_err(dev, "failed to get reg property\n");
> + return ret;
> + }
> +
> + /*
> +  * Remote processor can address only 32 bits
> +  * so convert 64-bits into 32-bits. This will discard
> +  * any unwanted upper 32-bits.
> +  */
> + tcm->da = (u32)abs_addr;
> + tcm->size = (u32)size;
> +
> + cpdev = to_platform_device(dev);
> + res = platform_get_resource(cpdev, 

Re: tprobe event tracing error

2024-02-28 Thread Steven Rostedt
On Wed, 28 Feb 2024 10:52:52 -0500
Steven Rostedt  wrote:

> The prototype of fchownat() is:
> 
>   int fchmodat(int dirfd, const char *pathname, mode_t mode, int flags);
> 
> Where pathname is the third parameter, not the first, and mode is the third.

I meant pathname is the second parameter.

-- Steve



Re: tprobe event tracing error

2024-02-28 Thread Steven Rostedt
On Wed, 28 Feb 2024 17:25:40 +0300
Максим Морсков  wrote:

> Dear colleagues,
> One last question — is it bug or feature that trobe event tracing can not 
> correctly dereference string pointers from pt_regs?
> For example:
> echo 't:tmy_chmod sys_enter id=$arg2 filename=+8($arg1):string 
> mode=+16($arg1)'  | sudo tee ‘/sys/kernel/tracing/dynamic_events’

So the tprobe attaches to the tracepoint, which is this:

 trace_sys_enter(regs, syscall);

Where arg1 is pt_regs, which on x86_64 (I'm assuming that's what you are
using) has:

struct pt_regs {
/*
 * C ABI says these regs are callee-preserved. They aren't saved on kernel entry
 * unless syscall needs a complete, fully filled "struct pt_regs".
 */
unsigned long r15;
unsigned long r14;
unsigned long r13;
unsigned long r12;
unsigned long rbp;
unsigned long rbx;
/* These regs are callee-clobbered. Always saved on kernel entry. */
unsigned long r11;
unsigned long r10;
unsigned long r9;
unsigned long r8;
unsigned long rax;
unsigned long rcx;
unsigned long rdx;
unsigned long rsi;
unsigned long rdi;
/*
 * On syscall entry, this is syscall#. On CPU exception, this is error code.
 * On hw interrupt, it's IRQ number:
 */
unsigned long orig_rax;
/* Return frame for iretq */
unsigned long rip;
unsigned long cs;
unsigned long eflags;
unsigned long rsp;
unsigned long ss;
/* top of stack page */
};

Where regs+8 is register r14.  and regs+16 is r13. Is that what you really want?

No, it's not.

Also, I noticed that you are not tracing chmod, but you are tracing id = 268
which is fchownat() (I noticed via strace, that this is what "chmod" uses).

The prototype of fchownat() is:

  int fchmodat(int dirfd, const char *pathname, mode_t mode, int flags);

Where pathname is the third parameter, not the first, and mode is the third.

The calling convention for x86_64 is:  rdi rsi rdx rcx r8-9 

That is, arg1 is in register rdi, arg2 is rsi, arg3 is rdx.

We want arguments 2 and 3. Which is:

  regs: $arg1
  regs->rsi:+104($arg1)
  regs->rdx:+96($arg1)

And since the file name is a string, we need to do one more dereference to
get to it:

  pathname: +0(+104($arg1)):ustring

(notice I used "ustring" as we now differentiate between kernel and user space)

With the above information I can do:

 # cd /sys/kernel/tracing
 # echo 't:tmy_chmod sys_enter id=$arg2 filename=+0(+104($arg1)):ustring 
mode=+96($arg1):x16' > dynamic_events
 # echo 'id == 268' > events/tracepoints/tmy_chmod/filter
 # echo 1 > events/tracepoints/tmy_chmod/enable
 # mkdir /tmp/x
 # chmod 100 /tmp/x
 # cat trace

# tracer: nop
#
# entries-in-buffer/entries-written: 2/2   #P:8
#
#_-=> irqs-off/BH-disabled
#   / _=> need-resched
#  | / _---=> hardirq/softirq
#  || / _--=> preempt-depth
#  ||| / _-=> migrate-disable
#   / delay
#   TASK-PID CPU#  |  TIMESTAMP  FUNCTION
#  | | |   | | |
   chmod-1035[004] ...1.  1744.492490: tmy_chmod: 
(__probestub_sys_enter+0x4/0x10) id=0x10c filename="/tmp/x" mode=0x40

  TADA!!!

-- Steve


> echo 'id == 268' | sudo tee 
> ‘/sys/kernel/tracing/events/tracepoints/tmy_chmod/filter’
> echo '1' | sudo tee ‘/sys/kernel/tracing/events/tracepoints/tmy_chmod/enable’
> echo ‘1’ | sudo tee ‘/sys/kernel/tracing/tracing_on’
> 
> cat ‘/sys/kernel/tracing/trace’
> #   TASK-PID CPU#  |  TIMESTAMP  FUNCTION 
> #  | | |   | | |
>   chmod-10522   [010] ...1.  8533.321703: tmy_chmod: 
> (__probestub_sys_enter+0x0/0x10) id=0x10c fd=0x81ed filename="" mode=0x1ed
>  
> The pointer is correct (it corresponds to kprobe event args), but dereference 
> never happens
>   



Re: [PATCH v2] mm: add alloc_contig_migrate_range allocation statistics

2024-02-28 Thread Steven Rostedt
On Wed, 28 Feb 2024 05:11:17 +
Richard Chang  wrote:

> alloc_contig_migrate_range has every information to be able to
> understand big contiguous allocation latency. For example, how many
> pages are migrated, how many times they were needed to unmap from
> page tables.
> 
> This patch adds the trace event to collect the allocation statistics.
> In the field, it was quite useful to understand CMA allocation
> latency.
> 
> Signed-off-by: Richard Chang 
> ---
> * from v1 - 
> https://lore.kernel.org/linux-mm/20240226100045.2083962-1-richard...@google.com/
>   * Move the trace event int field to the end of the longs - rostedt
>   * Do the calculation only when tracing is enabled - rostedt
> 
>  include/trace/events/kmem.h | 38 +
>  mm/internal.h   |  3 ++-
>  mm/page_alloc.c | 32 ++-
>  mm/page_isolation.c |  2 +-
>  4 files changed, 68 insertions(+), 7 deletions(-)
> 
> diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
> index 58688768ef0f..6e62cc64cd92 100644
> --- a/include/trace/events/kmem.h
> +++ b/include/trace/events/kmem.h
> @@ -304,6 +304,44 @@ TRACE_EVENT(mm_page_alloc_extfrag,
>   __entry->change_ownership)
>  );
>  
> +TRACE_EVENT(mm_alloc_contig_migrate_range_info,
> +
> + TP_PROTO(unsigned long start,
> +  unsigned long end,
> +  unsigned long nr_migrated,
> +  unsigned long nr_reclaimed,
> +  unsigned long nr_mapped,
> +  int migratetype),

Well, you didn't need to change the order of the parameters.

Anyway, from a tracing point of view:

Reviewed-by: Steven Rostedt (Google)  +
> + TP_ARGS(start, end, nr_migrated, nr_reclaimed, nr_mapped, migratetype),
> +
> + TP_STRUCT__entry(
> + __field(unsigned long, start)
> + __field(unsigned long, end)
> + __field(unsigned long, nr_migrated)
> + __field(unsigned long, nr_reclaimed)
> + __field(unsigned long, nr_mapped)
> + __field(int, migratetype)
> + ),
> +



[PATCH v2 5/6] virtiofs: use scattered bounce buffer for ITER_KVEC dio

2024-02-28 Thread Hou Tao
From: Hou Tao 

To prevent unnecessary request for large contiguous physical memory
chunk, use bounce buffer backed by scattered pages for ITER_KVEC
direct-io read/write when the total size of its args is greater than
PAGE_SIZE.

Signed-off-by: Hou Tao 
---
 fs/fuse/virtio_fs.c | 78 ++---
 1 file changed, 59 insertions(+), 19 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index ffea684bd100d..34b9370beba6d 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -458,20 +458,15 @@ static void virtio_fs_argbuf_free(struct virtio_fs_argbuf 
*argbuf)
kfree(argbuf);
 }
 
-static struct virtio_fs_argbuf *virtio_fs_argbuf_new(struct fuse_args *args,
+static struct virtio_fs_argbuf *virtio_fs_argbuf_new(unsigned int in_len,
+unsigned int out_len,
 gfp_t gfp, bool is_flat)
 {
struct virtio_fs_argbuf *argbuf;
-   unsigned int numargs;
-   unsigned int in_len, out_len, len;
+   unsigned int len;
unsigned int i, nr;
 
-   numargs = args->in_numargs - args->in_pages;
-   in_len = fuse_len_args(numargs, (struct fuse_arg *) args->in_args);
-   numargs = args->out_numargs - args->out_pages;
-   out_len = fuse_len_args(numargs, args->out_args);
len = virtio_fs_argbuf_len(in_len, out_len, is_flat);
-
if (is_flat) {
argbuf = kmalloc(struct_size(argbuf, f.buf, len), gfp);
if (argbuf)
@@ -1222,14 +1217,17 @@ static unsigned int sg_count_fuse_pages(struct 
fuse_page_desc *page_descs,
 }
 
 /* Return the number of scatter-gather list elements required */
-static unsigned int sg_count_fuse_req(struct fuse_req *req)
+static unsigned int sg_count_fuse_req(struct fuse_req *req,
+ unsigned int in_args_len,
+ unsigned int out_args_len,
+ bool flat_argbuf)
 {
struct fuse_args *args = req->args;
struct fuse_args_pages *ap = container_of(args, typeof(*ap), args);
unsigned int size, total_sgs = 1 /* fuse_in_header */;
+   unsigned int num_in, num_out;
 
-   if (args->in_numargs - args->in_pages)
-   total_sgs += 1;
+   num_in = args->in_numargs - args->in_pages;
 
if (args->in_pages) {
size = args->in_args[args->in_numargs - 1].size;
@@ -1237,20 +1235,25 @@ static unsigned int sg_count_fuse_req(struct fuse_req 
*req)
 size);
}
 
-   if (!test_bit(FR_ISREPLY, >flags))
-   return total_sgs;
+   if (!test_bit(FR_ISREPLY, >flags)) {
+   num_out = 0;
+   goto done;
+   }
 
total_sgs += 1 /* fuse_out_header */;
-
-   if (args->out_numargs - args->out_pages)
-   total_sgs += 1;
+   num_out = args->out_numargs - args->out_pages;
 
if (args->out_pages) {
size = args->out_args[args->out_numargs - 1].size;
total_sgs += sg_count_fuse_pages(ap->descs, ap->num_pages,
 size);
}
-
+done:
+   if (flat_argbuf)
+   total_sgs += !!num_in + !!num_out;
+   else
+   total_sgs += virtio_fs_argbuf_len(in_args_len, out_args_len,
+ false) >> PAGE_SHIFT;
return total_sgs;
 }
 
@@ -1302,6 +1305,31 @@ static unsigned int sg_init_fuse_args(struct scatterlist 
*sg,
return total_sgs;
 }
 
+static bool use_scattered_argbuf(struct fuse_req *req)
+{
+   struct fuse_args *args = req->args;
+
+   /*
+* To prevent unnecessary request for contiguous physical memory chunk,
+* use argbuf backed by scattered pages for ITER_KVEC direct-io
+* read/write when the total size of its args is greater than PAGE_SIZE.
+*/
+   if ((req->in.h.opcode == FUSE_WRITE && !args->in_pages) ||
+   (req->in.h.opcode == FUSE_READ && !args->out_pages)) {
+   unsigned int numargs;
+   unsigned int len;
+
+   numargs = args->in_numargs - args->in_pages;
+   len = fuse_len_args(numargs, (struct fuse_arg *)args->in_args);
+   numargs = args->out_numargs - args->out_pages;
+   len += fuse_len_args(numargs, args->out_args);
+   if (len > PAGE_SIZE)
+   return true;
+   }
+
+   return false;
+}
+
 /* Add a request to a virtqueue and kick the device */
 static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
 struct fuse_req *req, bool in_flight)
@@ -1317,13 +1345,24 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq 
*fsvq,
unsigned int out_sgs = 0;
unsigned int in_sgs = 0;
unsigned int total_sgs;
+   unsigned int 

[PATCH v2 6/6] virtiofs: use GFP_NOFS when enqueuing request through kworker

2024-02-28 Thread Hou Tao
From: Hou Tao 

When invoking virtio_fs_enqueue_req() through kworker, both the
allocation of the sg array and the bounce buffer still use GFP_ATOMIC.
Considering the size of the sg array may be greater than PAGE_SIZE, use
GFP_NOFS instead of GFP_ATOMIC to lower the possibility of memory
allocation failure and to avoid unnecessarily depleting the atomic
reserves. GFP_NOFS is not passed to virtio_fs_enqueue_req() directly,
use GFP_KERNEL and memalloc_nofs_{save|restore} helpers instead.

Signed-off-by: Hou Tao 
---
 fs/fuse/virtio_fs.c | 22 ++
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 34b9370beba6d..9ee71051c89f2 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -108,7 +108,8 @@ struct virtio_fs_argbuf {
 };
 
 static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
-struct fuse_req *req, bool in_flight);
+struct fuse_req *req, bool in_flight,
+gfp_t gfp);
 
 static const struct constant_table dax_param_enums[] = {
{"always",  FUSE_DAX_ALWAYS },
@@ -394,6 +395,8 @@ static void virtio_fs_request_dispatch_work(struct 
work_struct *work)
 
/* Dispatch pending requests */
while (1) {
+   unsigned int flags;
+
spin_lock(>lock);
req = list_first_entry_or_null(>queued_reqs,
   struct fuse_req, list);
@@ -404,7 +407,9 @@ static void virtio_fs_request_dispatch_work(struct 
work_struct *work)
list_del_init(>list);
spin_unlock(>lock);
 
-   ret = virtio_fs_enqueue_req(fsvq, req, true);
+   flags = memalloc_nofs_save();
+   ret = virtio_fs_enqueue_req(fsvq, req, true, GFP_KERNEL);
+   memalloc_nofs_restore(flags);
if (ret < 0) {
if (ret == -ENOMEM || ret == -ENOSPC) {
spin_lock(>lock);
@@ -1332,7 +1337,8 @@ static bool use_scattered_argbuf(struct fuse_req *req)
 
 /* Add a request to a virtqueue and kick the device */
 static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
-struct fuse_req *req, bool in_flight)
+struct fuse_req *req, bool in_flight,
+gfp_t gfp)
 {
/* requests need at least 4 elements */
struct scatterlist *stack_sgs[6];
@@ -1364,8 +1370,8 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq 
*fsvq,
total_sgs = sg_count_fuse_req(req, in_args_len, out_args_len,
  flat_argbuf);
if (total_sgs > ARRAY_SIZE(stack_sgs)) {
-   sgs = kmalloc_array(total_sgs, sizeof(sgs[0]), GFP_ATOMIC);
-   sg = kmalloc_array(total_sgs, sizeof(sg[0]), GFP_ATOMIC);
+   sgs = kmalloc_array(total_sgs, sizeof(sgs[0]), gfp);
+   sg = kmalloc_array(total_sgs, sizeof(sg[0]), gfp);
if (!sgs || !sg) {
ret = -ENOMEM;
goto out;
@@ -1373,8 +1379,8 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq 
*fsvq,
}
 
/* Use a bounce buffer since stack args cannot be mapped */
-   req->argbuf = virtio_fs_argbuf_new(in_args_len, out_args_len,
-  GFP_ATOMIC, flat_argbuf);
+   req->argbuf = virtio_fs_argbuf_new(in_args_len, out_args_len, gfp,
+  flat_argbuf);
if (!req->argbuf) {
ret = -ENOMEM;
goto out;
@@ -1473,7 +1479,7 @@ __releases(fiq->lock)
 fuse_len_args(req->args->out_numargs, req->args->out_args));
 
fsvq = >vqs[queue_id];
-   ret = virtio_fs_enqueue_req(fsvq, req, false);
+   ret = virtio_fs_enqueue_req(fsvq, req, false, GFP_ATOMIC);
if (ret < 0) {
if (ret == -ENOMEM || ret == -ENOSPC) {
/*
-- 
2.29.2




[PATCH v2 4/6] virtiofs: support bounce buffer backed by scattered pages

2024-02-28 Thread Hou Tao
From: Hou Tao 

When reading a file kept in virtiofs from kernel (e.g., insmod a kernel
module), if the cache of virtiofs is disabled, the read buffer will be
passed to virtiofs through out_args[0].value instead of pages. Because
virtiofs can't get the pages for the read buffer, virtio_fs_argbuf_new()
will create a bounce buffer for the read buffer by using kmalloc() and
copy the read buffer into bounce buffer. If the read buffer is large
(e.g., 1MB), the allocation will incur significant stress on the memory
subsystem.

So instead of allocating bounce buffer by using kmalloc(), allocate a
bounce buffer which is backed by scattered pages. The original idea is
to use vmap(), but the use of GFP_ATOMIC is no possible for vmap(). To
simplify the copy operations in the bounce buffer, use a bio_vec flex
array to represent the argbuf. Also add an is_flat field in struct
virtio_fs_argbuf to distinguish between kmalloc-ed and scattered bounce
buffer.

Signed-off-by: Hou Tao 
---
 fs/fuse/virtio_fs.c | 163 
 1 file changed, 149 insertions(+), 14 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index f10fff7f23a0f..ffea684bd100d 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -86,10 +86,27 @@ struct virtio_fs_req_work {
struct work_struct done_work;
 };
 
-struct virtio_fs_argbuf {
+struct virtio_fs_flat_argbuf {
DECLARE_FLEX_ARRAY(u8, buf);
 };
 
+struct virtio_fs_scattered_argbuf {
+   unsigned int size;
+   unsigned int nr;
+   DECLARE_FLEX_ARRAY(struct bio_vec, bvec);
+};
+
+struct virtio_fs_argbuf {
+   bool is_flat;
+   /* There is flexible array in the end of these two struct
+* definitions, so they must be the last field.
+*/
+   union {
+   struct virtio_fs_flat_argbuf f;
+   struct virtio_fs_scattered_argbuf s;
+   };
+};
+
 static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
 struct fuse_req *req, bool in_flight);
 
@@ -408,42 +425,143 @@ static void virtio_fs_request_dispatch_work(struct 
work_struct *work)
}
 }
 
+static unsigned int virtio_fs_argbuf_len(unsigned int in_args_len,
+unsigned int out_args_len,
+bool is_flat)
+{
+   if (is_flat)
+   return in_args_len + out_args_len;
+
+   /*
+* Align in_args_len with PAGE_SIZE to reduce the total number of
+* sg entries when the value of out_args_len (e.g., the length of
+* read buffer) is page-aligned.
+*/
+   return round_up(in_args_len, PAGE_SIZE) +
+  round_up(out_args_len, PAGE_SIZE);
+}
+
 static void virtio_fs_argbuf_free(struct virtio_fs_argbuf *argbuf)
 {
+   unsigned int i;
+
+   if (!argbuf)
+   return;
+
+   if (argbuf->is_flat)
+   goto free_argbuf;
+
+   for (i = 0; i < argbuf->s.nr; i++)
+   __free_page(argbuf->s.bvec[i].bv_page);
+
+free_argbuf:
kfree(argbuf);
 }
 
 static struct virtio_fs_argbuf *virtio_fs_argbuf_new(struct fuse_args *args,
-gfp_t gfp)
+gfp_t gfp, bool is_flat)
 {
struct virtio_fs_argbuf *argbuf;
unsigned int numargs;
-   unsigned int len;
+   unsigned int in_len, out_len, len;
+   unsigned int i, nr;
 
numargs = args->in_numargs - args->in_pages;
-   len = fuse_len_args(numargs, (struct fuse_arg *) args->in_args);
+   in_len = fuse_len_args(numargs, (struct fuse_arg *) args->in_args);
numargs = args->out_numargs - args->out_pages;
-   len += fuse_len_args(numargs, args->out_args);
+   out_len = fuse_len_args(numargs, args->out_args);
+   len = virtio_fs_argbuf_len(in_len, out_len, is_flat);
+
+   if (is_flat) {
+   argbuf = kmalloc(struct_size(argbuf, f.buf, len), gfp);
+   if (argbuf)
+   argbuf->is_flat = true;
+
+   return argbuf;
+   }
+
+   nr = len >> PAGE_SHIFT;
+   argbuf = kmalloc(struct_size(argbuf, s.bvec, nr), gfp);
+   if (!argbuf)
+   return NULL;
+
+   argbuf->is_flat = false;
+   argbuf->s.size = len;
+   argbuf->s.nr = 0;
+   for (i = 0; i < nr; i++) {
+   struct page *page;
+
+   page = alloc_page(gfp);
+   if (!page) {
+   virtio_fs_argbuf_free(argbuf);
+   return NULL;
+   }
+   bvec_set_page(>s.bvec[i], page, PAGE_SIZE, 0);
+   argbuf->s.nr++;
+   }
+
+   /* Zero the unused space for in_args */
+   if (in_len & ~PAGE_MASK) {
+   struct iov_iter iter;
+   unsigned int to_zero;
+
+   iov_iter_bvec(, ITER_DEST, argbuf->s.bvec, argbuf->s.nr,
+ 

[PATCH v2 3/6] virtiofs: factor out more common methods for argbuf

2024-02-28 Thread Hou Tao
From: Hou Tao 

Factor out more common methods for bounce buffer of fuse args:

1) virtio_fs_argbuf_setup_sg: set-up sgs for bounce buffer
2) virtio_fs_argbuf_copy_from_in_arg: copy each in-arg to bounce buffer
3) virtio_fs_argbuf_out_args_offset: calc the start offset of out-arg
4) virtio_fs_argbuf_copy_to_out_arg: copy bounce buffer to each out-arg

These methods will be used to implement bounce buffer backed by
scattered pages which are allocated separatedly.

Signed-off-by: Hou Tao 
---
 fs/fuse/virtio_fs.c | 77 +++--
 1 file changed, 60 insertions(+), 17 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index cd1330506daba..f10fff7f23a0f 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -86,6 +86,10 @@ struct virtio_fs_req_work {
struct work_struct done_work;
 };
 
+struct virtio_fs_argbuf {
+   DECLARE_FLEX_ARRAY(u8, buf);
+};
+
 static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
 struct fuse_req *req, bool in_flight);
 
@@ -404,13 +408,15 @@ static void virtio_fs_request_dispatch_work(struct 
work_struct *work)
}
 }
 
-static void virtio_fs_argbuf_free(void *argbuf)
+static void virtio_fs_argbuf_free(struct virtio_fs_argbuf *argbuf)
 {
kfree(argbuf);
 }
 
-static void *virtio_fs_argbuf_new(struct fuse_args *args, gfp_t gfp)
+static struct virtio_fs_argbuf *virtio_fs_argbuf_new(struct fuse_args *args,
+gfp_t gfp)
 {
+   struct virtio_fs_argbuf *argbuf;
unsigned int numargs;
unsigned int len;
 
@@ -419,7 +425,41 @@ static void *virtio_fs_argbuf_new(struct fuse_args *args, 
gfp_t gfp)
numargs = args->out_numargs - args->out_pages;
len += fuse_len_args(numargs, args->out_args);
 
-   return kmalloc(len, gfp);
+   argbuf = kmalloc(struct_size(argbuf, buf, len), gfp);
+
+   return argbuf;
+}
+
+static unsigned int virtio_fs_argbuf_setup_sg(struct virtio_fs_argbuf *argbuf,
+ unsigned int offset,
+ unsigned int len,
+ struct scatterlist *sg)
+{
+   sg_init_one(sg, argbuf->buf + offset, len);
+   return 1;
+}
+
+static void virtio_fs_argbuf_copy_from_in_arg(struct virtio_fs_argbuf *argbuf,
+ unsigned int offset,
+ const void *src, unsigned int len)
+{
+   memcpy(argbuf->buf + offset, src, len);
+}
+
+static unsigned int
+virtio_fs_argbuf_out_args_offset(struct virtio_fs_argbuf *argbuf,
+const struct fuse_args *args)
+{
+   unsigned int num_in = args->in_numargs - args->in_pages;
+
+   return fuse_len_args(num_in, (struct fuse_arg *)args->in_args);
+}
+
+static void virtio_fs_argbuf_copy_to_out_arg(struct virtio_fs_argbuf *argbuf,
+unsigned int offset, void *dst,
+unsigned int len)
+{
+   memcpy(dst, argbuf->buf + offset, len);
 }
 
 /*
@@ -515,9 +555,9 @@ static void copy_args_to_argbuf(struct fuse_req *req)
 
num_in = args->in_numargs - args->in_pages;
for (i = 0; i < num_in; i++) {
-   memcpy(req->argbuf + offset,
-  args->in_args[i].value,
-  args->in_args[i].size);
+   virtio_fs_argbuf_copy_from_in_arg(req->argbuf, offset,
+ args->in_args[i].value,
+ args->in_args[i].size);
offset += args->in_args[i].size;
}
 }
@@ -525,17 +565,19 @@ static void copy_args_to_argbuf(struct fuse_req *req)
 /* Copy args out of req->argbuf */
 static void copy_args_from_argbuf(struct fuse_args *args, struct fuse_req *req)
 {
+   struct virtio_fs_argbuf *argbuf;
unsigned int remaining;
unsigned int offset;
-   unsigned int num_in;
unsigned int num_out;
unsigned int i;
 
remaining = req->out.h.len - sizeof(req->out.h);
-   num_in = args->in_numargs - args->in_pages;
num_out = args->out_numargs - args->out_pages;
-   offset = fuse_len_args(num_in, (struct fuse_arg *)args->in_args);
+   if (!num_out)
+   goto out;
 
+   argbuf = req->argbuf;
+   offset = virtio_fs_argbuf_out_args_offset(argbuf, args);
for (i = 0; i < num_out; i++) {
unsigned int argsize = args->out_args[i].size;
 
@@ -545,13 +587,16 @@ static void copy_args_from_argbuf(struct fuse_args *args, 
struct fuse_req *req)
argsize = remaining;
}
 
-   memcpy(args->out_args[i].value, req->argbuf + offset, argsize);
+   virtio_fs_argbuf_copy_to_out_arg(argbuf, offset,
+   

[PATCH v2 2/6] virtiofs: move alloc/free of argbuf into separated helpers

2024-02-28 Thread Hou Tao
From: Hou Tao 

The bounce buffer for fuse args in virtiofs will be extended to support
scatterd pages later. Therefore, move the allocation and the free of
argbuf out of the copy procedures and factor them into
virtio_fs_argbuf_{new|free}() helpers.

Signed-off-by: Hou Tao 
---
 fs/fuse/virtio_fs.c | 52 +++--
 1 file changed, 31 insertions(+), 21 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 5f1be1da92ce9..cd1330506daba 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -404,6 +404,24 @@ static void virtio_fs_request_dispatch_work(struct 
work_struct *work)
}
 }
 
+static void virtio_fs_argbuf_free(void *argbuf)
+{
+   kfree(argbuf);
+}
+
+static void *virtio_fs_argbuf_new(struct fuse_args *args, gfp_t gfp)
+{
+   unsigned int numargs;
+   unsigned int len;
+
+   numargs = args->in_numargs - args->in_pages;
+   len = fuse_len_args(numargs, (struct fuse_arg *) args->in_args);
+   numargs = args->out_numargs - args->out_pages;
+   len += fuse_len_args(numargs, args->out_args);
+
+   return kmalloc(len, gfp);
+}
+
 /*
  * Returns 1 if queue is full and sender should wait a bit before sending
  * next request, 0 otherwise.
@@ -487,36 +505,24 @@ static void virtio_fs_hiprio_dispatch_work(struct 
work_struct *work)
}
 }
 
-/* Allocate and copy args into req->argbuf */
-static int copy_args_to_argbuf(struct fuse_req *req)
+/* Copy args into req->argbuf */
+static void copy_args_to_argbuf(struct fuse_req *req)
 {
struct fuse_args *args = req->args;
unsigned int offset = 0;
unsigned int num_in;
-   unsigned int num_out;
-   unsigned int len;
unsigned int i;
 
num_in = args->in_numargs - args->in_pages;
-   num_out = args->out_numargs - args->out_pages;
-   len = fuse_len_args(num_in, (struct fuse_arg *) args->in_args) +
- fuse_len_args(num_out, args->out_args);
-
-   req->argbuf = kmalloc(len, GFP_ATOMIC);
-   if (!req->argbuf)
-   return -ENOMEM;
-
for (i = 0; i < num_in; i++) {
memcpy(req->argbuf + offset,
   args->in_args[i].value,
   args->in_args[i].size);
offset += args->in_args[i].size;
}
-
-   return 0;
 }
 
-/* Copy args out of and free req->argbuf */
+/* Copy args out of req->argbuf */
 static void copy_args_from_argbuf(struct fuse_args *args, struct fuse_req *req)
 {
unsigned int remaining;
@@ -549,9 +555,6 @@ static void copy_args_from_argbuf(struct fuse_args *args, 
struct fuse_req *req)
/* Store the actual size of the variable-length arg */
if (args->out_argvar)
args->out_args[args->out_numargs - 1].size = remaining;
-
-   kfree(req->argbuf);
-   req->argbuf = NULL;
 }
 
 /* Work function for request completion */
@@ -571,6 +574,9 @@ static void virtio_fs_request_complete(struct fuse_req *req,
args = req->args;
copy_args_from_argbuf(args, req);
 
+   virtio_fs_argbuf_free(req->argbuf);
+   req->argbuf = NULL;
+
if (args->out_pages && args->page_zeroing) {
len = args->out_args[args->out_numargs - 1].size;
ap = container_of(args, typeof(*ap), args);
@@ -1149,9 +1155,13 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq 
*fsvq,
}
 
/* Use a bounce buffer since stack args cannot be mapped */
-   ret = copy_args_to_argbuf(req);
-   if (ret < 0)
+   req->argbuf = virtio_fs_argbuf_new(args, GFP_ATOMIC);
+   if (!req->argbuf) {
+   ret = -ENOMEM;
goto out;
+   }
+
+   copy_args_to_argbuf(req);
 
/* Request elements */
sg_init_one([out_sgs++], >in.h, sizeof(req->in.h));
@@ -1210,7 +1220,7 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq 
*fsvq,
 
 out:
if (ret < 0 && req->argbuf) {
-   kfree(req->argbuf);
+   virtio_fs_argbuf_free(req->argbuf);
req->argbuf = NULL;
}
if (sgs != stack_sgs) {
-- 
2.29.2




[PATCH v2 0/6] virtiofs: fix the warning for ITER_KVEC dio

2024-02-28 Thread Hou Tao
From: Hou Tao 

Hi,

The patch set aims to fix the warning related to an abnormal size
parameter of kmalloc() in virtiofs. The warning occurred when attempting
to insert a 10MB sized kernel module kept in a virtiofs with cache
disabled. As analyzed in patch #1, the root cause is that the length of
the read buffer is no limited, and the read buffer is passed directly to
virtiofs through out_args[0].value. Therefore patch #1 limits the
length of the read buffer passed to virtiofs by using max_pages. However
it is not enough, because now the maximal value of max_pages is 256.
Consequently, when reading a 10MB-sized kernel module, the length of the
bounce buffer in virtiofs will be 40 + (256 * 4096), and kmalloc will
try to allocate 2MB from memory subsystem. The request for 2MB of
physically contiguous memory significantly stress the memory subsystem
and may fail indefinitely on hosts with fragmented memory. To address
this, patch #2~#5 use scattered pages in a bio_vec to replace the
kmalloc-allocated bounce buffer when the length of the bounce buffer for
KVEC_ITER dio is larger than PAGE_SIZE. The final issue with the
allocation of the bounce buffer and sg array in virtiofs is that
GFP_ATOMIC is used even when the allocation occurs in a kworker context.
Therefore the last patch uses GFP_NOFS for the allocation of both sg
array and bounce buffer when initiated by the kworker. For more details,
please check the individual patches.

As usual, comments are always welcome.

Change Log:

v2:
  * limit the length of ITER_KVEC dio by max_pages instead of the
newly-introduced max_nopage_rw. Using max_pages make the ITER_KVEC
dio being consistent with other rw operations.
  * replace kmalloc-allocated bounce buffer by using a bounce buffer
backed by scattered pages when the length of the bounce buffer for
KVEC_ITER dio is larger than PAG_SIZE, so even on hosts with
fragmented memory, the KVEC_ITER dio can be handled normally by
virtiofs. (Bernd Schubert)
  * merge the GFP_NOFS patch [1] into this patch-set and use
memalloc_nofs_{save|restore}+GFP_KERNEL instead of GFP_NOFS
(Benjamin Coddington)

v1: 
https://lore.kernel.org/linux-fsdevel/20240103105929.1902658-1-hou...@huaweicloud.com/

[1]: 
https://lore.kernel.org/linux-fsdevel/20240105105305.4052672-1-hou...@huaweicloud.com/

Hou Tao (6):
  fuse: limit the length of ITER_KVEC dio by max_pages
  virtiofs: move alloc/free of argbuf into separated helpers
  virtiofs: factor out more common methods for argbuf
  virtiofs: support bounce buffer backed by scattered pages
  virtiofs: use scattered bounce buffer for ITER_KVEC dio
  virtiofs: use GFP_NOFS when enqueuing request through kworker

 fs/fuse/file.c  |  12 +-
 fs/fuse/virtio_fs.c | 336 +---
 2 files changed, 296 insertions(+), 52 deletions(-)

-- 
2.29.2




[PATCH v2 1/6] fuse: limit the length of ITER_KVEC dio by max_pages

2024-02-28 Thread Hou Tao
From: Hou Tao 

When trying to insert a 10MB kernel module kept in a virtio-fs with cache
disabled, the following warning was reported:

  [ cut here ]
  WARNING: CPU: 2 PID: 439 at mm/page_alloc.c:4544 ..
  Modules linked in:
  CPU: 2 PID: 439 Comm: insmod Not tainted 6.7.0-rc7+ #33
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ..
  RIP: 0010:__alloc_pages+0x2c4/0x360
  ..
  Call Trace:
   
   ? __warn+0x8f/0x150
   ? __alloc_pages+0x2c4/0x360
   __kmalloc_large_node+0x86/0x160
   __kmalloc+0xcd/0x140
   virtio_fs_enqueue_req+0x240/0x6d0
   virtio_fs_wake_pending_and_unlock+0x7f/0x190
   queue_request_and_unlock+0x58/0x70
   fuse_simple_request+0x18b/0x2e0
   fuse_direct_io+0x58a/0x850
   fuse_file_read_iter+0xdb/0x130
   __kernel_read+0xf3/0x260
   kernel_read+0x45/0x60
   kernel_read_file+0x1ad/0x2b0
   init_module_from_file+0x6a/0xe0
   idempotent_init_module+0x179/0x230
   __x64_sys_finit_module+0x5d/0xb0
   do_syscall_64+0x36/0xb0
   entry_SYSCALL_64_after_hwframe+0x6e/0x76
   ..
   
  ---[ end trace  ]---

The warning is triggered when:

1) inserting a 10MB sized kernel module kept in a virtiofs.
syscall finit_module() will handle the module insertion and it will
invoke kernel_read_file() to read the content of the module first.

2) kernel_read_file() allocates a 10MB buffer by using vmalloc() and
passes it to kernel_read(). kernel_read() constructs a kvec iter by
using iov_iter_kvec() and passes it to fuse_file_read_iter().

3) virtio-fs disables the cache, so fuse_file_read_iter() invokes
fuse_direct_io(). As for now, the maximal read size for kvec iter is
only limited by fc->max_read. For virtio-fs, max_read is UINT_MAX, so
fuse_direct_io() doesn't split the 10MB buffer. It saves the address and
the size of the 10MB-sized buffer in out_args[0] of a fuse request and
passes the fuse request to virtio_fs_wake_pending_and_unlock().

4) virtio_fs_wake_pending_and_unlock() uses virtio_fs_enqueue_req() to
queue the request. Because the arguments in fuse request may be kept in
stack, so virtio_fs_enqueue_req() uses kmalloc() to allocate a bounce
buffer for all fuse args, copies these args into the bounce buffer and
passed the physical address of the bounce buffer to virtiofsd. The total
length of these fuse args for the passed fuse request is about 10MB, so
copy_args_to_argbuf() invokes kmalloc() with a 10MB size parameter
and it triggers the warning in __alloc_pages():

if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
return NULL;

5) virtio_fs_enqueue_req() will retry the memory allocation in a
kworker, but it won't help, because kmalloc() will always return NULL
due to the abnormal size and finit_module() will hang forever.

A feasible solution is to limit the value of max_read for virtio-fs, so
the length passed to kmalloc() will be limited. However it will affect
the maximal read size for normal fuse read. And for virtio-fs write
initiated from kernel, it has the similar problem and now there is no
way to limit fc->max_write in kernel.

So instead of limiting both the values of max_read and max_write in
kernel, capping the maximal length of kvec iter IO by using max_pages in
fuse_direct_io() just like it does for ubuf/iovec iter IO. Now the max
value for max_pages is 256, so on host with 4KB page size, the maximal
size passed to kmalloc() in copy_args_to_argbuf() is about 1MB+40B. The
allocation of 2MB of physically contiguous memory will still incur
significant stress on the memory subsystem, but the warning is fixed.
Additionally, the requirement for huge physically contiguous memory will
be removed in the following patch.

Fixes: a62a8ef9d97d ("virtio-fs: add virtiofs filesystem")
Signed-off-by: Hou Tao 
---
 fs/fuse/file.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 148a71b8b4d0e..f90ea25e366f0 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1423,6 +1423,16 @@ static int fuse_get_user_pages(struct fuse_args_pages 
*ap, struct iov_iter *ii,
return ret < 0 ? ret : 0;
 }
 
+static size_t fuse_max_dio_rw_size(const struct fuse_conn *fc,
+  const struct iov_iter *iter, int write)
+{
+   unsigned int nmax = write ? fc->max_write : fc->max_read;
+
+   if (iov_iter_is_kvec(iter))
+   nmax = min(nmax, fc->max_pages << PAGE_SHIFT);
+   return nmax;
+}
+
 ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
   loff_t *ppos, int flags)
 {
@@ -1433,7 +1443,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct 
iov_iter *iter,
struct inode *inode = mapping->host;
struct fuse_file *ff = file->private_data;
struct fuse_conn *fc = ff->fm->fc;
-   size_t nmax = write ? fc->max_write : fc->max_read;
+   size_t nmax = fuse_max_dio_rw_size(fc, iter, write);
loff_t pos = *ppos;
size_t count = 

Re: [PATCH net-next v2 0/3] tun: AF_XDP Tx zero-copy support

2024-02-28 Thread Jiri Pirko
Wed, Feb 28, 2024 at 12:04:41PM CET, wangyunj...@huawei.com wrote:
>Hi all:
>
>Now, some drivers support the zero-copy feature of AF_XDP sockets,
>which can significantly reduce CPU utilization for XDP programs.
>
>This patch set allows TUN to also support the AF_XDP Tx zero-copy
>feature. It is based on Linux 6.8.0+(openEuler 23.09) and has
>successfully passed Netperf and Netserver stress testing with
>multiple streams between VM A and VM B, using AF_XDP and OVS.
>
>The performance testing was performed on a Intel E5-2620 2.40GHz
>machine. Traffic were generated/send through TUN(testpmd txonly
>with AF_XDP) to VM (testpmd rxonly in guest).
>
>+--+-+-+-+
>|  |   copy  |zero-copy| speedup |
>+--+-+-+-+
>| UDP  |   Mpps  |   Mpps  |%|
>| 64   |   2.5   |   4.0   |   60%   |
>| 512  |   2.1   |   3.6   |   71%   |
>| 1024 |   1.9   |   3.3   |   73%   |
>+--+-+-+-+
>
>Yunjian Wang (3):
>  xsk: Remove non-zero 'dma_page' check in xp_assign_dev
>  vhost_net: Call peek_len when using xdp
>  tun: AF_XDP Tx zero-copy support

Threading of the patchset seems to be broken. Did you by any chance send
this with "--nothread" git-send-email option?
pw seems to cope fine with this though.


>
> drivers/net/tun.c   | 177 ++--
> drivers/vhost/net.c |  21 +++--
> include/linux/if_tun.h  |  32 
> net/xdp/xsk_buff_pool.c |   7 --
> 4 files changed, 220 insertions(+), 17 deletions(-)
>
>-- 
>2.41.0
>
>



[PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support

2024-02-28 Thread Yunjian Wang
This patch set allows TUN to support the AF_XDP Tx zero-copy feature,
which can significantly reduce CPU utilization for XDP programs.

Since commit fc72d1d54dd9 ("tuntap: XDP transmission"), the pointer
ring has been utilized to queue different types of pointers by encoding
the type into the lower bits. Therefore, we introduce a new flag,
TUN_XDP_DESC_FLAG(0x2UL), which allows us to enqueue XDP descriptors
and differentiate them from XDP buffers and sk_buffs. Additionally, a
spin lock is added for enabling and disabling operations on the xsk pool.

The performance testing was performed on a Intel E5-2620 2.40GHz machine.
Traffic were generated/send through TUN(testpmd txonly with AF_XDP)
to VM (testpmd rxonly in guest).

+--+-+-+-+
|  |   copy  |zero-copy| speedup |
+--+-+-+-+
| UDP  |   Mpps  |   Mpps  |%|
| 64   |   2.5   |   4.0   |   60%   |
| 512  |   2.1   |   3.6   |   71%   |
| 1024 |   1.9   |   3.3   |   73%   |
+--+-+-+-+

Signed-off-by: Yunjian Wang 
---
 drivers/net/tun.c  | 177 +++--
 drivers/vhost/net.c|   4 +
 include/linux/if_tun.h |  32 
 3 files changed, 208 insertions(+), 5 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index bc80fc1d576e..7f4ff50b532c 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -63,6 +63,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -86,6 +87,7 @@ static void tun_default_link_ksettings(struct net_device *dev,
   struct ethtool_link_ksettings *cmd);
 
 #define TUN_RX_PAD (NET_IP_ALIGN + NET_SKB_PAD)
+#define TUN_XDP_BATCH 64
 
 /* TUN device flags */
 
@@ -146,6 +148,9 @@ struct tun_file {
struct tun_struct *detached;
struct ptr_ring tx_ring;
struct xdp_rxq_info xdp_rxq;
+   struct xsk_buff_pool *xsk_pool;
+   spinlock_t pool_lock;   /* Protects xsk pool enable/disable */
+   u32 nb_descs;
 };
 
 struct tun_page {
@@ -614,6 +619,8 @@ void tun_ptr_free(void *ptr)
struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
 
xdp_return_frame(xdpf);
+   } else if (tun_is_xdp_desc_frame(ptr)) {
+   return;
} else {
__skb_array_destroy_skb(ptr);
}
@@ -631,6 +638,37 @@ static void tun_queue_purge(struct tun_file *tfile)
skb_queue_purge(>sk.sk_error_queue);
 }
 
+static void tun_set_xsk_pool(struct tun_file *tfile, struct xsk_buff_pool 
*pool)
+{
+   if (!pool)
+   return;
+
+   spin_lock(>pool_lock);
+   xsk_pool_set_rxq_info(pool, >xdp_rxq);
+   tfile->xsk_pool = pool;
+   spin_unlock(>pool_lock);
+}
+
+static void tun_clean_xsk_pool(struct tun_file *tfile)
+{
+   spin_lock(>pool_lock);
+   if (tfile->xsk_pool) {
+   void *ptr;
+
+   while ((ptr = ptr_ring_consume(>tx_ring)) != NULL)
+   tun_ptr_free(ptr);
+
+   if (tfile->nb_descs) {
+   xsk_tx_completed(tfile->xsk_pool, tfile->nb_descs);
+   if (xsk_uses_need_wakeup(tfile->xsk_pool))
+   xsk_set_tx_need_wakeup(tfile->xsk_pool);
+   tfile->nb_descs = 0;
+   }
+   tfile->xsk_pool = NULL;
+   }
+   spin_unlock(>pool_lock);
+}
+
 static void __tun_detach(struct tun_file *tfile, bool clean)
 {
struct tun_file *ntfile;
@@ -648,6 +686,11 @@ static void __tun_detach(struct tun_file *tfile, bool 
clean)
u16 index = tfile->queue_index;
BUG_ON(index >= tun->numqueues);
 
+   ntfile = rtnl_dereference(tun->tfiles[tun->numqueues - 1]);
+   /* Stop xsk zc xmit */
+   tun_clean_xsk_pool(tfile);
+   tun_clean_xsk_pool(ntfile);
+
rcu_assign_pointer(tun->tfiles[index],
   tun->tfiles[tun->numqueues - 1]);
ntfile = rtnl_dereference(tun->tfiles[index]);
@@ -668,6 +711,7 @@ static void __tun_detach(struct tun_file *tfile, bool clean)
tun_flow_delete_by_queue(tun, tun->numqueues + 1);
/* Drop read queue */
tun_queue_purge(tfile);
+   tun_set_xsk_pool(ntfile, xsk_get_pool_from_qid(tun->dev, 
index));
tun_set_real_num_queues(tun);
} else if (tfile->detached && clean) {
tun = tun_enable_queue(tfile);
@@ -801,6 +845,7 @@ static int tun_attach(struct tun_struct *tun, struct file 
*file,
 
if (tfile->xdp_rxq.queue_index!= tfile->queue_index)
tfile->xdp_rxq.queue_index = tfile->queue_index;
+   tun_set_xsk_pool(tfile, xsk_get_pool_from_qid(dev, 
tfile->queue_index));
} else {
/* Setup XDP RX-queue info, for new tfile getting attached */

[PATCH net-next v2 2/3] vhost_net: Call peek_len when using xdp

2024-02-28 Thread Yunjian Wang
If TUN supports AF_XDP TX zero-copy, the XDP program will enqueue
packets to the XDP ring and wake up the vhost worker. This requires
the vhost worker to call peek_len(), which can be used to consume
XDP descriptors.

Signed-off-by: Yunjian Wang 
---
 drivers/vhost/net.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f2ed7167c848..077e74421558 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -207,6 +207,11 @@ static int vhost_net_buf_peek_len(void *ptr)
return __skb_array_len_with_tag(ptr);
 }
 
+static bool vhost_sock_xdp(struct socket *sock)
+{
+   return sock_flag(sock->sk, SOCK_XDP);
+}
+
 static int vhost_net_buf_peek(struct vhost_net_virtqueue *nvq)
 {
struct vhost_net_buf *rxq = >rxq;
@@ -214,6 +219,13 @@ static int vhost_net_buf_peek(struct vhost_net_virtqueue 
*nvq)
if (!vhost_net_buf_is_empty(rxq))
goto out;
 
+   if (ptr_ring_empty(nvq->rx_ring)) {
+   struct socket *sock = vhost_vq_get_backend(>vq);
+   /* Call peek_len to consume XSK descriptors, when using xdp */
+   if (vhost_sock_xdp(sock) && sock->ops->peek_len)
+   sock->ops->peek_len(sock);
+   }
+
if (!vhost_net_buf_produce(nvq))
return 0;
 
@@ -346,11 +358,6 @@ static bool vhost_sock_zcopy(struct socket *sock)
sock_flag(sock->sk, SOCK_ZEROCOPY);
 }
 
-static bool vhost_sock_xdp(struct socket *sock)
-{
-   return sock_flag(sock->sk, SOCK_XDP);
-}
-
 /* In case of DMA done not in order in lower device driver for some reason.
  * upend_idx is used to track end of used idx, done_idx is used to track head
  * of used idx. Once lower device DMA done contiguously, we will signal KVM
-- 
2.41.0




[PATCH net-next v2 1/3] xsk: Remove non-zero 'dma_page' check in xp_assign_dev

2024-02-28 Thread Yunjian Wang
Now dma mappings are used by the physical NICs. However the vNIC
maybe do not need them. So remove non-zero 'dma_page' check in
xp_assign_dev.

Signed-off-by: Yunjian Wang 
---
 net/xdp/xsk_buff_pool.c | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index ce60ecd48a4d..a5af75b1f43c 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -219,16 +219,9 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
if (err)
goto err_unreg_pool;
 
-   if (!pool->dma_pages) {
-   WARN(1, "Driver did not DMA map zero-copy buffers");
-   err = -EINVAL;
-   goto err_unreg_xsk;
-   }
pool->umem->zc = true;
return 0;
 
-err_unreg_xsk:
-   xp_disable_drv_zc(pool);
 err_unreg_pool:
if (!force_zc)
err = 0; /* fallback to copy mode */
-- 
2.41.0




[PATCH net-next v2 0/3] tun: AF_XDP Tx zero-copy support

2024-02-28 Thread Yunjian Wang
Hi all:

Now, some drivers support the zero-copy feature of AF_XDP sockets,
which can significantly reduce CPU utilization for XDP programs.

This patch set allows TUN to also support the AF_XDP Tx zero-copy
feature. It is based on Linux 6.8.0+(openEuler 23.09) and has
successfully passed Netperf and Netserver stress testing with
multiple streams between VM A and VM B, using AF_XDP and OVS.

The performance testing was performed on a Intel E5-2620 2.40GHz
machine. Traffic were generated/send through TUN(testpmd txonly
with AF_XDP) to VM (testpmd rxonly in guest).

+--+-+-+-+
|  |   copy  |zero-copy| speedup |
+--+-+-+-+
| UDP  |   Mpps  |   Mpps  |%|
| 64   |   2.5   |   4.0   |   60%   |
| 512  |   2.1   |   3.6   |   71%   |
| 1024 |   1.9   |   3.3   |   73%   |
+--+-+-+-+

Yunjian Wang (3):
  xsk: Remove non-zero 'dma_page' check in xp_assign_dev
  vhost_net: Call peek_len when using xdp
  tun: AF_XDP Tx zero-copy support

 drivers/net/tun.c   | 177 ++--
 drivers/vhost/net.c |  21 +++--
 include/linux/if_tun.h  |  32 
 net/xdp/xsk_buff_pool.c |   7 --
 4 files changed, 220 insertions(+), 17 deletions(-)

-- 
2.41.0




[PATCH net-next v6 4/5] vhost/net: remove vhost_net_page_frag_refill()

2024-02-28 Thread Yunsheng Lin
The page frag in vhost_net_page_frag_refill() uses the
'struct page_frag' from skb_page_frag_refill(), but it's
implementation is similar to page_frag_alloc_align() now.

This patch removes vhost_net_page_frag_refill() by using
'struct page_frag_cache' instead of 'struct page_frag',
and allocating frag using page_frag_alloc_align().

The added benefit is that not only unifying the page frag
implementation a little, but also having about 0.5% performance
boost testing by using the vhost_net_test introduced in the
last patch.

Signed-off-by: Yunsheng Lin 
Acked-by: Jason Wang 
---
 drivers/vhost/net.c | 91 ++---
 1 file changed, 27 insertions(+), 64 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index e574e21cc0ca..4b2fcb228a0a 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -141,10 +141,8 @@ struct vhost_net {
unsigned tx_zcopy_err;
/* Flush in progress. Protected by tx vq lock. */
bool tx_flush;
-   /* Private page frag */
-   struct page_frag page_frag;
-   /* Refcount bias of page frag */
-   int refcnt_bias;
+   /* Private page frag cache */
+   struct page_frag_cache pf_cache;
 };
 
 static unsigned vhost_net_zcopy_mask __read_mostly;
@@ -655,41 +653,6 @@ static bool tx_can_batch(struct vhost_virtqueue *vq, 
size_t total_len)
   !vhost_vq_avail_empty(vq->dev, vq);
 }
 
-static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int sz,
-  struct page_frag *pfrag, gfp_t gfp)
-{
-   if (pfrag->page) {
-   if (pfrag->offset + sz <= pfrag->size)
-   return true;
-   __page_frag_cache_drain(pfrag->page, net->refcnt_bias);
-   }
-
-   pfrag->offset = 0;
-   net->refcnt_bias = 0;
-   if (SKB_FRAG_PAGE_ORDER) {
-   /* Avoid direct reclaim but allow kswapd to wake */
-   pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
- __GFP_COMP | __GFP_NOWARN |
- __GFP_NORETRY | __GFP_NOMEMALLOC,
- SKB_FRAG_PAGE_ORDER);
-   if (likely(pfrag->page)) {
-   pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
-   goto done;
-   }
-   }
-   pfrag->page = alloc_page(gfp);
-   if (likely(pfrag->page)) {
-   pfrag->size = PAGE_SIZE;
-   goto done;
-   }
-   return false;
-
-done:
-   net->refcnt_bias = USHRT_MAX;
-   page_ref_add(pfrag->page, USHRT_MAX - 1);
-   return true;
-}
-
 #define VHOST_NET_RX_PAD (NET_IP_ALIGN + NET_SKB_PAD)
 
 static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
@@ -699,7 +662,6 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue 
*nvq,
struct vhost_net *net = container_of(vq->dev, struct vhost_net,
 dev);
struct socket *sock = vhost_vq_get_backend(vq);
-   struct page_frag *alloc_frag = >page_frag;
struct virtio_net_hdr *gso;
struct xdp_buff *xdp = >xdp[nvq->batched_xdp];
struct tun_xdp_hdr *hdr;
@@ -710,6 +672,7 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue 
*nvq,
int sock_hlen = nvq->sock_hlen;
void *buf;
int copied;
+   int ret;
 
if (unlikely(len < nvq->sock_hlen))
return -EFAULT;
@@ -719,18 +682,17 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue 
*nvq,
return -ENOSPC;
 
buflen += SKB_DATA_ALIGN(len + pad);
-   alloc_frag->offset = ALIGN((u64)alloc_frag->offset, SMP_CACHE_BYTES);
-   if (unlikely(!vhost_net_page_frag_refill(net, buflen,
-alloc_frag, GFP_KERNEL)))
+   buf = page_frag_alloc_align(>pf_cache, buflen, GFP_KERNEL,
+   SMP_CACHE_BYTES);
+   if (unlikely(!buf))
return -ENOMEM;
 
-   buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
-   copied = copy_page_from_iter(alloc_frag->page,
-alloc_frag->offset +
-offsetof(struct tun_xdp_hdr, gso),
-sock_hlen, from);
-   if (copied != sock_hlen)
-   return -EFAULT;
+   copied = copy_from_iter(buf + offsetof(struct tun_xdp_hdr, gso),
+   sock_hlen, from);
+   if (copied != sock_hlen) {
+   ret = -EFAULT;
+   goto err;
+   }
 
hdr = buf;
gso = >gso;
@@ -743,27 +705,30 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue 
*nvq,
   vhost16_to_cpu(vq, gso->csum_start) +
   vhost16_to_cpu(vq, gso->csum_offset) + 2);
 
-   if 

[PATCH net-next v6 5/5] tools: virtio: introduce vhost_net_test

2024-02-28 Thread Yunsheng Lin
introduce vhost_net_test for both vhost_net tx and rx basing
on virtio_test to test vhost_net changing in the kernel.

Steps for vhost_net tx testing:
1. Prepare a out buf.
2. Kick the vhost_net to do tx processing.
3. Do the receiving in the tun side.
4. verify the data received by tun is correct.

Steps for vhost_net rx testing:
1. Prepare a in buf.
2. Do the sending in the tun side.
3. Kick the vhost_net to do rx processing.
4. verify the data received by vhost_net is correct.

Signed-off-by: Yunsheng Lin 
---
 tools/virtio/.gitignore|   1 +
 tools/virtio/Makefile  |   8 +-
 tools/virtio/linux/virtio_config.h |   4 +
 tools/virtio/vhost_net_test.c  | 532 +
 4 files changed, 542 insertions(+), 3 deletions(-)
 create mode 100644 tools/virtio/vhost_net_test.c

diff --git a/tools/virtio/.gitignore b/tools/virtio/.gitignore
index 9934d48d9a55..7e47b281c442 100644
--- a/tools/virtio/.gitignore
+++ b/tools/virtio/.gitignore
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 *.d
 virtio_test
+vhost_net_test
 vringh_test
 virtio-trace/trace-agent
diff --git a/tools/virtio/Makefile b/tools/virtio/Makefile
index d128925980e0..e25e99c1c3b7 100644
--- a/tools/virtio/Makefile
+++ b/tools/virtio/Makefile
@@ -1,8 +1,9 @@
 # SPDX-License-Identifier: GPL-2.0
 all: test mod
-test: virtio_test vringh_test
+test: virtio_test vringh_test vhost_net_test
 virtio_test: virtio_ring.o virtio_test.o
 vringh_test: vringh_test.o vringh.o virtio_ring.o
+vhost_net_test: virtio_ring.o vhost_net_test.o
 
 try-run = $(shell set -e;  \
if ($(1)) >/dev/null 2>&1;  \
@@ -49,6 +50,7 @@ oot-clean: OOT_BUILD+=clean
 
 .PHONY: all test mod clean vhost oot oot-clean oot-build
 clean:
-   ${RM} *.o vringh_test virtio_test vhost_test/*.o vhost_test/.*.cmd \
-  vhost_test/Module.symvers vhost_test/modules.order *.d
+   ${RM} *.o vringh_test virtio_test vhost_net_test vhost_test/*.o \
+  vhost_test/.*.cmd vhost_test/Module.symvers \
+  vhost_test/modules.order *.d
 -include *.d
diff --git a/tools/virtio/linux/virtio_config.h 
b/tools/virtio/linux/virtio_config.h
index 2a8a70e2a950..42a564f22f2d 100644
--- a/tools/virtio/linux/virtio_config.h
+++ b/tools/virtio/linux/virtio_config.h
@@ -1,4 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0 */
+#ifndef LINUX_VIRTIO_CONFIG_H
+#define LINUX_VIRTIO_CONFIG_H
 #include 
 #include 
 #include 
@@ -95,3 +97,5 @@ static inline __virtio64 cpu_to_virtio64(struct virtio_device 
*vdev, u64 val)
 {
return __cpu_to_virtio64(virtio_is_little_endian(vdev), val);
 }
+
+#endif
diff --git a/tools/virtio/vhost_net_test.c b/tools/virtio/vhost_net_test.c
new file mode 100644
index ..389d99a6d7c7
--- /dev/null
+++ b/tools/virtio/vhost_net_test.c
@@ -0,0 +1,532 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define HDR_LENsizeof(struct virtio_net_hdr_mrg_rxbuf)
+#define TEST_BUF_LEN   256
+#define TEST_PTYPE ETH_P_LOOPBACK
+#define DESC_NUM   256
+
+/* Used by implementation of kmalloc() in tools/virtio/linux/kernel.h */
+void *__kmalloc_fake, *__kfree_ignore_start, *__kfree_ignore_end;
+
+struct vq_info {
+   int kick;
+   int call;
+   int idx;
+   long started;
+   long completed;
+   struct pollfd fds;
+   void *ring;
+   /* copy used for control */
+   struct vring vring;
+   struct virtqueue *vq;
+};
+
+struct vdev_info {
+   struct virtio_device vdev;
+   int control;
+   struct vq_info vqs[2];
+   int nvqs;
+   void *buf;
+   size_t buf_size;
+   char *test_buf;
+   char *res_buf;
+   struct vhost_memory *mem;
+   int sock;
+   int ifindex;
+   unsigned char mac[ETHER_ADDR_LEN];
+};
+
+static int tun_alloc(struct vdev_info *dev, char *tun_name)
+{
+   struct ifreq ifr;
+   int len = HDR_LEN;
+   int fd, e;
+
+   fd = open("/dev/net/tun", O_RDWR);
+   if (fd < 0) {
+   perror("Cannot open /dev/net/tun");
+   return fd;
+   }
+
+   memset(, 0, sizeof(ifr));
+
+   ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR;
+   strncpy(ifr.ifr_name, tun_name, IFNAMSIZ);
+
+   e = ioctl(fd, TUNSETIFF, );
+   if (e < 0) {
+   perror("ioctl[TUNSETIFF]");
+   close(fd);
+   return e;
+   }
+
+   e = ioctl(fd, TUNSETVNETHDRSZ, );
+   if (e < 0) {
+   perror("ioctl[TUNSETVNETHDRSZ]");
+   close(fd);
+   return e;
+   }
+
+   e = ioctl(fd, SIOCGIFHWADDR, );
+   if (e < 0) {
+   perror("ioctl[SIOCGIFHWADDR]");
+   close(fd);
+   return e;
+   }
+
+

[PATCH net-next v6 2/5] page_frag: unify gfp bits for order 3 page allocation

2024-02-28 Thread Yunsheng Lin
Currently there seems to be three page frag implementations
which all try to allocate order 3 page, if that fails, it
then fail back to allocate order 0 page, and each of them
all allow order 3 page allocation to fail under certain
condition by using specific gfp bits.

The gfp bits for order 3 page allocation are different
between different implementation, __GFP_NOMEMALLOC is
or'd to forbid access to emergency reserves memory for
__page_frag_cache_refill(), but it is not or'd in other
implementions, __GFP_DIRECT_RECLAIM is masked off to avoid
direct reclaim in vhost_net_page_frag_refill(), but it is
not masked off in __page_frag_cache_refill().

This patch unifies the gfp bits used between different
implementions by or'ing __GFP_NOMEMALLOC and masking off
__GFP_DIRECT_RECLAIM for order 3 page allocation to avoid
possible pressure for mm.

Leave the gfp unifying for page frag implementation in sock.c
for now as suggested by Paolo Abeni.

Signed-off-by: Yunsheng Lin 
Reviewed-by: Alexander Duyck 
CC: Alexander Duyck 
---
 drivers/vhost/net.c | 2 +-
 mm/page_alloc.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f2ed7167c848..e574e21cc0ca 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -670,7 +670,7 @@ static bool vhost_net_page_frag_refill(struct vhost_net 
*net, unsigned int sz,
/* Avoid direct reclaim but allow kswapd to wake */
pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
  __GFP_COMP | __GFP_NOWARN |
- __GFP_NORETRY,
+ __GFP_NORETRY | __GFP_NOMEMALLOC,
  SKB_FRAG_PAGE_ORDER);
if (likely(pfrag->page)) {
pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c0f7e67c4250..636145c29f70 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4685,8 +4685,8 @@ static struct page *__page_frag_cache_refill(struct 
page_frag_cache *nc,
gfp_t gfp = gfp_mask;
 
 #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
-   gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
-   __GFP_NOMEMALLOC;
+   gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
+  __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
PAGE_FRAG_CACHE_MAX_ORDER);
nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
-- 
2.33.0




Re: [PATCH v3 1/7] remoteproc: Add TEE support

2024-02-28 Thread Arnaud POULIQUEN
Hello Mathieu,


On 2/23/24 19:27, Mathieu Poirier wrote:
> On Wed, Feb 14, 2024 at 06:21:21PM +0100, Arnaud Pouliquen wrote:
>> From: Arnaud Pouliquen 
>>
>> Add a remoteproc TEE (Trusted Execution Environment) driver
>> that will be probed by the TEE bus. If the associated Trusted
>> application is supported on secure part this device offers a client
>> interface to load a firmware in the secure part.
>> This firmware could be authenticated and decrypted by the secure
>> trusted application.
>>
>> Signed-off-by: Arnaud Pouliquen 
>> ---
>> update from V2
>> - Use 'tee_rproc' prefix for all functions
>> - rename tee_rproc_get_loaded_rsc_table to tee_rproc_find_loaded_rsc_table
>> - redefine fonction to better match with the rproc_ops structure format
>>  - replace 'struct tee_rproc' parameter by 'struct rproc' parameter
>>  - rename 'rproc_tee_get_rsc_table()' to tee_rproc_get_loaded_rsc_table()
>>and rework it to remove the cached_table management.
>>  - introduce tee_rproc_get_context() to get the tee_rproc struct from the
>>   rproc struct
>>  - rename tee_rproc_get_loaded_rsc_table() to 
>> tee_rproc_find_loaded_rsc_table()
>> - remove useless check on tee_rproc_ctx structure in tee_rproc_register()
>>   and tee_rproc_unregister()
>> - fix test on the return of tee_rproc_ctx = devm_kzalloc()
>> - remove useless includes and unused tee_rproc_mem structure.
>> ---
>>  drivers/remoteproc/Kconfig  |   9 +
>>  drivers/remoteproc/Makefile |   1 +
>>  drivers/remoteproc/tee_remoteproc.c | 397 
>>  include/linux/tee_remoteproc.h  | 102 +++
>>  4 files changed, 509 insertions(+)
>>  create mode 100644 drivers/remoteproc/tee_remoteproc.c
>>  create mode 100644 include/linux/tee_remoteproc.h
>>
>> diff --git a/drivers/remoteproc/Kconfig b/drivers/remoteproc/Kconfig
>> index 48845dc8fa85..85299606806c 100644
>> --- a/drivers/remoteproc/Kconfig
>> +++ b/drivers/remoteproc/Kconfig
>> @@ -365,6 +365,15 @@ config XLNX_R5_REMOTEPROC
>>  
>>It's safe to say N if not interested in using RPU r5f cores.
>>  
>> +
>> +config TEE_REMOTEPROC
>> +tristate "trusted firmware support by a TEE application"
>> +depends on OPTEE
>> +help
>> +  Support for trusted remote processors firmware. The firmware
>> +  authentication and/or decryption are managed by a trusted application.
>> +  This can be either built-in or a loadable module.
>> +
>>  endif # REMOTEPROC
>>  
>>  endmenu
>> diff --git a/drivers/remoteproc/Makefile b/drivers/remoteproc/Makefile
>> index 91314a9b43ce..fa8daebce277 100644
>> --- a/drivers/remoteproc/Makefile
>> +++ b/drivers/remoteproc/Makefile
>> @@ -36,6 +36,7 @@ obj-$(CONFIG_RCAR_REMOTEPROC)  += rcar_rproc.o
>>  obj-$(CONFIG_ST_REMOTEPROC) += st_remoteproc.o
>>  obj-$(CONFIG_ST_SLIM_REMOTEPROC)+= st_slim_rproc.o
>>  obj-$(CONFIG_STM32_RPROC)   += stm32_rproc.o
>> +obj-$(CONFIG_TEE_REMOTEPROC)+= tee_remoteproc.o
>>  obj-$(CONFIG_TI_K3_DSP_REMOTEPROC)  += ti_k3_dsp_remoteproc.o
>>  obj-$(CONFIG_TI_K3_R5_REMOTEPROC)   += ti_k3_r5_remoteproc.o
>>  obj-$(CONFIG_XLNX_R5_REMOTEPROC)+= xlnx_r5_remoteproc.o
>> diff --git a/drivers/remoteproc/tee_remoteproc.c 
>> b/drivers/remoteproc/tee_remoteproc.c
>> new file mode 100644
>> index ..ac727e062d00
>> --- /dev/null
>> +++ b/drivers/remoteproc/tee_remoteproc.c
>> @@ -0,0 +1,397 @@
>> +// SPDX-License-Identifier: GPL-2.0-or-later
>> +/*
>> + * Copyright (C) STMicroelectronics 2023 - All Rights Reserved
>> + * Author: Arnaud Pouliquen 
>> + */
>> +
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +
>> +#include "remoteproc_internal.h"
>> +
>> +#define MAX_TEE_PARAM_ARRY_MEMBER   4
>> +
>> +/*
>> + * Authentication of the firmware and load in the remote processor memory
>> + *
>> + * [in]  params[0].value.a: unique 32bit identifier of the remote processor
>> + * [in]  params[1].memref:  buffer containing the image of the 
>> buffer
>> + */
>> +#define TA_RPROC_FW_CMD_LOAD_FW 1
>> +
>> +/*
>> + * Start the remote processor
>> + *
>> + * [in]  params[0].value.a: unique 32bit identifier of the remote processor
>> + */
>> +#define TA_RPROC_FW_CMD_START_FW2
>> +
>> +/*
>> + * Stop the remote processor
>> + *
>> + * [in]  params[0].value.a: unique 32bit identifier of the remote processor
>> + */
>> +#define TA_RPROC_FW_CMD_STOP_FW 3
>> +
>> +/*
>> + * Return the address of the resource table, or 0 if not found
>> + * No check is done to verify that the address returned is accessible by
>> + * the non secure context. If the resource table is loaded in a protected
>> + * memory the access by the non secure context will lead to a data abort.
>> + *
>> + * [in]  params[0].value.a: unique 32bit identifier of the remote processor
>> + * [out]  params[1].value.a:32bit LSB resource table memory address
>> + * [out] 

Re: [PATCH v2 2/3] media: usb: pvrusb2: Fix Wcast-function-type-strict warnings

2024-02-28 Thread Hans Verkuil
On 26/02/2024 18:32, Ricardo Ribalda wrote:
> Building with LLVM=1 throws the following warning:
> drivers/media/usb/pvrusb2/pvrusb2-context.c:110:6: warning: cast from 'void 
> (*)(struct pvr2_context *)' to 'void (*)(void *)' converts to incompatible 
> function type [-Wcast-function-type-strict]
> drivers/media/usb/pvrusb2/pvrusb2-v4l2.c:1070:30: warning: cast from 'void 
> (*)(struct pvr2_v4l2_fh *)' to 'pvr2_stream_callback' (aka 'void (*)(void 
> *)') converts to incompatible function type [-Wcast-function-type-strict] 
> drivers/media/usb/pvrusb2/pvrusb2-dvb.c:152:6: warning: cast from 'void 
> (*)(struct pvr2_dvb_adapter *)' to 'pvr2_stream_callback' (aka 'void (*)(void 
> *)') converts to incompatible function type [-Wcast-function-type-strict]

This patch from Arnd has already been merged, superseding your patch:

https://lore.kernel.org/linux-media/20240213100439.457403-1-a...@kernel.org/

Marking your patch as Obsolete in patchwork.

Regards,

Hans

> 
> Reviewed-by: Nathan Chancellor 
> Signed-off-by: Ricardo Ribalda 
> ---
>  drivers/media/usb/pvrusb2/pvrusb2-context.c | 5 +++--
>  drivers/media/usb/pvrusb2/pvrusb2-dvb.c | 7 ---
>  drivers/media/usb/pvrusb2/pvrusb2-v4l2.c| 7 ---
>  3 files changed, 11 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/media/usb/pvrusb2/pvrusb2-context.c 
> b/drivers/media/usb/pvrusb2/pvrusb2-context.c
> index 1764674de98bc..16edabda191c4 100644
> --- a/drivers/media/usb/pvrusb2/pvrusb2-context.c
> +++ b/drivers/media/usb/pvrusb2/pvrusb2-context.c
> @@ -90,8 +90,9 @@ static void pvr2_context_destroy(struct pvr2_context *mp)
>  }
>  
>  
> -static void pvr2_context_notify(struct pvr2_context *mp)
> +static void pvr2_context_notify(void *data)
>  {
> + struct pvr2_context *mp = data;
>   pvr2_context_set_notify(mp,!0);
>  }
>  
> @@ -107,7 +108,7 @@ static void pvr2_context_check(struct pvr2_context *mp)
>  "pvr2_context %p (initialize)", mp);
>   /* Finish hardware initialization */
>   if (pvr2_hdw_initialize(mp->hdw,
> - (void (*)(void *))pvr2_context_notify,
> + pvr2_context_notify,
>   mp)) {
>   mp->video_stream.stream =
>   pvr2_hdw_get_video_stream(mp->hdw);
> diff --git a/drivers/media/usb/pvrusb2/pvrusb2-dvb.c 
> b/drivers/media/usb/pvrusb2/pvrusb2-dvb.c
> index 26811efe0fb58..8b9f1a09bd53d 100644
> --- a/drivers/media/usb/pvrusb2/pvrusb2-dvb.c
> +++ b/drivers/media/usb/pvrusb2/pvrusb2-dvb.c
> @@ -88,8 +88,9 @@ static int pvr2_dvb_feed_thread(void *data)
>   return stat;
>  }
>  
> -static void pvr2_dvb_notify(struct pvr2_dvb_adapter *adap)
> +static void pvr2_dvb_notify(void *data)
>  {
> + struct pvr2_dvb_adapter *adap = data;
>   wake_up(>buffer_wait_data);
>  }
>  
> @@ -148,8 +149,8 @@ static int pvr2_dvb_stream_do_start(struct 
> pvr2_dvb_adapter *adap)
>   if (!(adap->buffer_storage[idx])) return -ENOMEM;
>   }
>  
> - pvr2_stream_set_callback(pvr->video_stream.stream,
> -  (pvr2_stream_callback) pvr2_dvb_notify, adap);
> + pvr2_stream_set_callback(pvr->video_stream.stream, pvr2_dvb_notify,
> +  adap);
>  
>   ret = pvr2_stream_set_buffer_count(stream, PVR2_DVB_BUFFER_COUNT);
>   if (ret < 0) return ret;
> diff --git a/drivers/media/usb/pvrusb2/pvrusb2-v4l2.c 
> b/drivers/media/usb/pvrusb2/pvrusb2-v4l2.c
> index c04ab7258d645..590f80949bbfc 100644
> --- a/drivers/media/usb/pvrusb2/pvrusb2-v4l2.c
> +++ b/drivers/media/usb/pvrusb2/pvrusb2-v4l2.c
> @@ -1032,9 +1032,10 @@ static int pvr2_v4l2_open(struct file *file)
>   return 0;
>  }
>  
> -
> -static void pvr2_v4l2_notify(struct pvr2_v4l2_fh *fhp)
> +static void pvr2_v4l2_notify(void *data)
>  {
> + struct pvr2_v4l2_fh *fhp = data;
> +
>   wake_up(>wait_data);
>  }
>  
> @@ -1067,7 +1068,7 @@ static int pvr2_v4l2_iosetup(struct pvr2_v4l2_fh *fh)
>  
>   hdw = fh->channel.mc_head->hdw;
>   sp = fh->pdi->stream->stream;
> - pvr2_stream_set_callback(sp,(pvr2_stream_callback)pvr2_v4l2_notify,fh);
> + pvr2_stream_set_callback(sp, pvr2_v4l2_notify, fh);
>   pvr2_hdw_set_stream_type(hdw,fh->pdi->config);
>   if ((ret = pvr2_hdw_set_streaming(hdw,!0)) < 0) return ret;
>   return pvr2_ioread_set_enabled(fh->rhp,!0);
>