date:20240320

Re: [PATCH v3 resend] net/ipv4: add tracepoint for icmp_send

2024-03-20 Thread Jakub Kicinski

On Thu, 21 Mar 2024 11:09:18 +0800 (CST) xu.xi...@zte.com.cn wrote:
> +/* This part must be outside protection */
> +#include 
> \ No newline at end of file

In addition to Jason's comments please make sure there is a new line at
the end of the file.

And please post v4 on Monday, net-next is currently closed.

While you wait have a read of:
https://www.kernel.org/doc/html/next/process/maintainer-netdev.html

:)

Re: [PATCH v3 resend] net/ipv4: add tracepoint for icmp_send

2024-03-20 Thread Jason Xing

On Thu, Mar 21, 2024 at 11:09 AM  wrote:
>
> From: he peilin 
>
> Introduce a tracepoint for icmp_send, which can help users to get more
> detail information conveniently when icmp abnormal events happen.
>
> 1. Giving an usecase example:
> =
> When an application experiences packet loss due to an unreachable UDP
> destination port, the kernel will send an exception message through the
> icmp_send function. By adding a trace point for icmp_send, developers or
> system administrators can obtain detailed information about the UDP
> packet loss, including the type, code, source address, destination address,
> source port, and destination port. This facilitates the trouble-shooting
> of UDP packet loss issues especially for those network-service
> applications.
>
> 2. Operation Instructions:
> ==
> Switch to the tracing directory.
> cd /sys/kernel/tracing
> Filter for destination port unreachable.
> echo "type==3 && code==3" > events/icmp/icmp_send/filter
> Enable trace event.
> echo 1 > events/icmp/icmp_send/enable
>
> 3. Result View:
> 
>  udp_client_erro-11370   [002] ...s.12   124.728002:
>  icmp_send: icmp_send: type=3, code=3.
>  From 127.0.0.1:41895 to 127.0.0.1: ulen=23
>  skbaddr=589b167a
>
> Changelog
> -
> v2->v3:
> Some fixes according to
> https://lore.kernel.org/all/20240319102549.7f7f6...@gandalf.local.home/
> 1. Change the tracking directory to/sys/kernel/tracking.
> 2. Adjust the layout of the TP-STRUCT_entry parameter structure.
>
> v1->v2:
> Some fixes according to
> https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=sztrnkru_tnuwqhufqtjvjsv-nz1x...@mail.gmail.com/
> 1. adjust the trace_icmp_send() to more protocols than UDP.
> 2. move the calling of trace_icmp_send after sanity checks
> in __icmp_send().
>
> Signed-off-by: Peilin He
> Reviewed-by: xu xin 
> Reviewed-by: Yunkai Zhang 
> Cc: Yang Yang 
> Cc: Liu Chun 
> Cc: Xuexin Jiang 

I think it would be better to target net-next tree since it's not a
fix or something else important.

> ---
>  include/trace/events/icmp.h | 64 +
>  net/ipv4/icmp.c |  4 +++
>  2 files changed, 68 insertions(+)
>  create mode 100644 include/trace/events/icmp.h
>
> diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
> new file mode 100644
> index ..2098d4b1b12e
> --- /dev/null
> +++ b/include/trace/events/icmp.h
> @@ -0,0 +1,64 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM icmp
> +
> +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_ICMP_H
> +
> +#include 
> +#include 
> +
> +TRACE_EVENT(icmp_send,
> +
> +   TP_PROTO(const struct sk_buff *skb, int type, int code),
> +
> +   TP_ARGS(skb, type, code),
> +
> +   TP_STRUCT__entry(
> +   __field(const void *, skbaddr)
> +   __field(int, type)
> +   __field(int, code)
> +   __array(__u8, saddr, 4)
> +   __array(__u8, daddr, 4)
> +   __field(__u16, sport)
> +   __field(__u16, dport)
> +   __field(unsigned short, ulen)
> +   ),
> +
> +   TP_fast_assign(
> +   struct iphdr *iph = ip_hdr(skb);
> +   int proto_4 = iph->protocol;
> +   __be32 *p32;
> +
> +   __entry->skbaddr = skb;
> +   __entry->type = type;
> +   __entry->code = code;
> +
> +   if (proto_4 == IPPROTO_UDP) {
> +   struct udphdr *uh = udp_hdr(skb);
> +   __entry->sport = ntohs(uh->source);
> +   __entry->dport = ntohs(uh->dest);
> +   __entry->ulen = ntohs(uh->len);
> +   } else {
> +   __entry->sport = 0;
> +   __entry->dport = 0;
> +   __entry->ulen = 0;
> +   }

What about using the TP_STORE_ADDR_PORTS_SKB macro to record the sport
and dport like the patch[1] did through extending the use of header
for TCP and UDP?

And, I wonder what the use of tracing ulen of that skb?

[1]: 
https://lore.kernel.org/all/1c7156a3f164eb33ef3a25b8432e359f0bb60a8e.1710866188.git.balazs.scheid...@axoflow.com/

Thanks,
Jason

> +
> +   p32 = (__be32 *) __entry->saddr;
> +   *p32 = iph->saddr;
> +
> +   p32 = (__be32 *) __entry->daddr;
> +   *p32 = iph->daddr;
> +   ),
> +
> +   TP_printk("icmp_send: type=%d, code=%d. From %pI4:%u to 
> %pI4:%u ulen=%d skbaddr=%p",
> +   __entry->type, __entry->code,
> +

[PATCH v3 resend] net/ipv4: add tracepoint for icmp_send

2024-03-20 Thread xu.xin16

From: he peilin 

Introduce a tracepoint for icmp_send, which can help users to get more
detail information conveniently when icmp abnormal events happen.

1. Giving an usecase example:
=
When an application experiences packet loss due to an unreachable UDP
destination port, the kernel will send an exception message through the
icmp_send function. By adding a trace point for icmp_send, developers or
system administrators can obtain detailed information about the UDP
packet loss, including the type, code, source address, destination address,
source port, and destination port. This facilitates the trouble-shooting
of UDP packet loss issues especially for those network-service
applications.

2. Operation Instructions:
==
Switch to the tracing directory.
cd /sys/kernel/tracing
Filter for destination port unreachable.
echo "type==3 && code==3" > events/icmp/icmp_send/filter
Enable trace event.
echo 1 > events/icmp/icmp_send/enable

3. Result View:

 udp_client_erro-11370   [002] ...s.12   124.728002:
 icmp_send: icmp_send: type=3, code=3.
 From 127.0.0.1:41895 to 127.0.0.1: ulen=23
 skbaddr=589b167a

Changelog
-
v2->v3:
Some fixes according to
https://lore.kernel.org/all/20240319102549.7f7f6...@gandalf.local.home/
1. Change the tracking directory to/sys/kernel/tracking.
2. Adjust the layout of the TP-STRUCT_entry parameter structure.

v1->v2:
Some fixes according to
https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=sztrnkru_tnuwqhufqtjvjsv-nz1x...@mail.gmail.com/
1. adjust the trace_icmp_send() to more protocols than UDP.
2. move the calling of trace_icmp_send after sanity checks
in __icmp_send().

Signed-off-by: Peilin He
Reviewed-by: xu xin 
Reviewed-by: Yunkai Zhang 
Cc: Yang Yang 
Cc: Liu Chun 
Cc: Xuexin Jiang 
---
 include/trace/events/icmp.h | 64 +
 net/ipv4/icmp.c |  4 +++
 2 files changed, 68 insertions(+)
 create mode 100644 include/trace/events/icmp.h

diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
new file mode 100644
index ..2098d4b1b12e
--- /dev/null
+++ b/include/trace/events/icmp.h
@@ -0,0 +1,64 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM icmp
+
+#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_ICMP_H
+
+#include 
+#include 
+
+TRACE_EVENT(icmp_send,
+
+   TP_PROTO(const struct sk_buff *skb, int type, int code),
+
+   TP_ARGS(skb, type, code),
+
+   TP_STRUCT__entry(
+   __field(const void *, skbaddr)
+   __field(int, type)
+   __field(int, code)
+   __array(__u8, saddr, 4)
+   __array(__u8, daddr, 4)
+   __field(__u16, sport)
+   __field(__u16, dport)
+   __field(unsigned short, ulen)
+   ),
+
+   TP_fast_assign(
+   struct iphdr *iph = ip_hdr(skb);
+   int proto_4 = iph->protocol;
+   __be32 *p32;
+
+   __entry->skbaddr = skb;
+   __entry->type = type;
+   __entry->code = code;
+
+   if (proto_4 == IPPROTO_UDP) {
+   struct udphdr *uh = udp_hdr(skb);
+   __entry->sport = ntohs(uh->source);
+   __entry->dport = ntohs(uh->dest);
+   __entry->ulen = ntohs(uh->len);
+   } else {
+   __entry->sport = 0;
+   __entry->dport = 0;
+   __entry->ulen = 0;
+   }
+
+   p32 = (__be32 *) __entry->saddr;
+   *p32 = iph->saddr;
+
+   p32 = (__be32 *) __entry->daddr;
+   *p32 = iph->daddr;
+   ),
+
+   TP_printk("icmp_send: type=%d, code=%d. From %pI4:%u to %pI4:%u 
ulen=%d skbaddr=%p",
+   __entry->type, __entry->code,
+   __entry->saddr, __entry->sport, __entry->daddr,
+   __entry->dport, __entry->ulen, __entry->skbaddr)
+);
+
+#endif /* _TRACE_ICMP_H */
+
+/* This part must be outside protection */
+#include 
\ No newline at end of file
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index e63a3bf99617..21fb41257fe9 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -92,6 +92,8 @@
 #include 
 #include 
 #include 
+#define CREATE_TRACE_POINTS
+#include 

 /*
  * Build xmit assembly blocks
@@ -672,6 +674,8 @@ void __icmp_send(struct sk_buff *skb_in, int type, int 
code, __be32 info,
}
}

+   trace_icmp_send(skb_in, type, code);
+
/* Needed by both

Re: [PATCH v3] net/ipv4: add tracepoint for icmp_send

2024-03-20 Thread Jason Xing

On Thu, Mar 21, 2024 at 10:12 AM  wrote:
>
> From: he peilin 
>
>
> Introduce a tracepoint for icmp_send, which can help users to get more
>
> detail information conveniently when icmp abnormal events happen.
>
>
> 1. Giving an usecase example:
>
> =
>
> When an application experiences packet loss due to an unreachable UDP
>
> destination port, the kernel will send an exception message through the
>
> icmp_send function. By adding a trace point for icmp_send, developers or
>
> system administrators can obtain detailed information about the UDP
>
> packet loss, including the type, code, source address, destination address,
>
> source port, and destination port. This facilitates the trouble-shooting
>
> of UDP packet loss issues especially for those network-service
>
> applications.
>
>
> 2. Operation Instructions:
>
> ==
>
> Switch to the tracing directory.
>
> cd /sys/kernel/tracing
>
> Filter for destination port unreachable.
>
> echo "type==3 && code==3" > events/icmp/icmp_send/filter
>
> Enable trace event.
>
> echo 1 > events/icmp/icmp_send/enable
>
>
> 3. Result View:
>
> 
>
>  udp_client_erro-11370   [002] ...s.12   124.728002:
>
>  icmp_send: icmp_send: type=3, code=3.
>
>  From 127.0.0.1:41895 to 127.0.0.1: ulen=23
>
>  skbaddr=589b167a
>
>
> Changelog
>
> -
>
> v2->v3:
>
> Some fixes according to
>
> https://lore.kernel.org/all/20240319102549.7f7f6...@gandalf.local.home/
>
> 1. Change the tracking directory to/sys/kernel/tracking.
>
> 2. Adjust the layout of the TP-STRUCT_entry parameter structure.
>
>
> v1->v2:
>
> Some fixes according to
>
> https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=sztrnkru_tnuwqhufqtjvjsv-nz1x...@mail.gmail.com/
>
> 1. adjust the trace_icmp_send() to more protocols than UDP.
>
> 2. move the calling of trace_icmp_send after sanity checks
>
> in __icmp_send().
>
>
> Signed-off-by: Peilin He
>
> Reviewed-by: xu xin 
>
> Reviewed-by: Yunkai Zhang 
>
> Cc: Yang Yang 
>
> Cc: Liu Chun 
>
> Cc: Xuexin Jiang 

The format of the whole patch looks strange... Did you send this patch
by using 'git send-email' instead of pasting the text and sending?

>
> ---
>
>  include/trace/events/icmp.h | 64 +
>
>  net/ipv4/icmp.c |  4 +++
>
>  2 files changed, 68 insertions(+)
>
>  create mode 100644 include/trace/events/icmp.h
>
>
> diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
>
> new file mode 100644
>
> index ..2098d4b1b12e
>
> --- /dev/null
>
> +++ b/include/trace/events/icmp.h
>
> @@ -0,0 +1,64 @@
>
> +/* SPDX-License-Identifier: GPL-2.0 */
>
> +#undef TRACE_SYSTEM
>
> +#define TRACE_SYSTEM icmp
>
> +
>
> +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
>
> +#define _TRACE_ICMP_H
>
> +
>
> +#include 
>
> +#include 
>
> +
>
> +TRACE_EVENT(icmp_send,
>
> +
>
> + TP_PROTO(const struct sk_buff *skb, int type, int code),
>
> +
>
> + TP_ARGS(skb, type, code),
>
> +
>
> + TP_STRUCT__entry(
>
> + __field(const void *, skbaddr)
>
> + __field(int, type)
>
> + __field(int, code)
>
> + __array(__u8, saddr, 4)
>
> + __array(__u8, daddr, 4)
>
> + __field(__u16, sport)
>
> + __field(__u16, dport)
>
> + __field(unsigned short, ulen)
>
> + ),
>
> +
>
> + TP_fast_assign(
>
> + struct iphdr *iph = ip_hdr(skb);
>
> + int proto_4 = iph->protocol;
>
> + __be32 *p32;
>
> +
>
> + __entry->skbaddr = skb;
>
> + __entry->type = type;
>
> + __entry->code = code;
>
> +
>
> + if (proto_4 == IPPROTO_UDP) {
>
> + struct udphdr *uh = udp_hdr(skb);
>
> + __entry->sport = ntohs(uh->source);
>
> + __entry->dport = ntohs(uh->dest);
>
> + __entry->ulen = ntohs(uh->len);
>
> + } else {
>
> + __entry->sport = 0;
>
> + __entry->dport = 0;
>
> + __entry->ulen = 0;
>
> + }
>
> +
>
> + p32 = (__be32 *) __entry->saddr;
>
> + *p32 = iph->saddr;
>
> +
>
> + p32 = (__be32 *) __entry->daddr;
>
> + *p32 = iph->daddr;
>
> + ),
>
> +
>
> + TP_printk("icmp_send: type=%d, code=%d. From %pI4:%u to %pI4:%u ulen=%d 
> skbaddr=%p",
>
> + __entry->type, __entry->code,
>
> + __entry->saddr, __entry->sport, __entry->daddr,
>
> + __entry->dport, __entry->ulen, __entry->skbaddr)
>
> +);
>
> +
>
> +#endif /* _TRACE_ICMP_H */
>
> +
>
> +/* This part must be outside protection */
>
> +#include 
>
> \ No newline at end of file
>
> diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
>
> index e63a3bf99617..21fb41257fe9 100644
>
> --- a/net/ipv4/icmp.c
>
> +++ b/net/ipv4/icmp.c
>
> @@ -92,6 +92,8 @@
>
>  #include 
>
>  #include 
>
>  #include 
>
> +#define CREATE_TRACE_POINTS
>
> +#include 
>
>
>
>  /*
>
>   * Build xmit assembly blocks
>
> @@ -672,6 +674,8 @@ void __icmp_send(struct sk_buff *skb_in, int type, int 
> code, __be32 info,
>
>   }
>
>   }
>
>
>
> + trace_icmp_send(skb_in, type, code);
>
> +
>
>   /* Needed by both icmp_global_allow and icmp_xmit_lock */
>
>   local_bh_disable();
>
>
>
> --
>
> 2.44.0
>
>
>
>
>

[PATCH v3] net/ipv4: add tracepoint for icmp_send

2024-03-20 Thread xu.xin16

From: he peilin 

Introduce a tracepoint for icmp_send, which can help users to get more
detail information conveniently when icmp abnormal events happen.

1. Giving an usecase example:
=
When an application experiences packet loss due to an unreachable UDP
destination port, the kernel will send an exception message through the
icmp_send function. By adding a trace point for icmp_send, developers or
system administrators can obtain detailed information about the UDP
packet loss, including the type, code, source address, destination address,
source port, and destination port. This facilitates the trouble-shooting
of UDP packet loss issues especially for those network-service
applications.

2. Operation Instructions:
==
Switch to the tracing directory.
cd /sys/kernel/tracing
Filter for destination port unreachable.
echo "type==3 && code==3" > events/icmp/icmp_send/filter
Enable trace event.
echo 1 > events/icmp/icmp_send/enable

3. Result View:

 udp_client_erro-11370   [002] ...s.12   124.728002:
 icmp_send: icmp_send: type=3, code=3.
 From 127.0.0.1:41895 to 127.0.0.1: ulen=23
 skbaddr=589b167a

Changelog
-
v2->v3:
Some fixes according to
https://lore.kernel.org/all/20240319102549.7f7f6...@gandalf.local.home/
1. Change the tracking directory to/sys/kernel/tracking.
2. Adjust the layout of the TP-STRUCT_entry parameter structure.

v1->v2:
Some fixes according to
https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=sztrnkru_tnuwqhufqtjvjsv-nz1x...@mail.gmail.com/
1. adjust the trace_icmp_send() to more protocols than UDP.
2. move the calling of trace_icmp_send after sanity checks
in __icmp_send().

Signed-off-by: Peilin He
Reviewed-by: xu xin 
Reviewed-by: Yunkai Zhang 
Cc: Yang Yang 
Cc: Liu Chun 
Cc: Xuexin Jiang 
---
 include/trace/events/icmp.h | 64 +
 net/ipv4/icmp.c |  4 +++
 2 files changed, 68 insertions(+)
 create mode 100644 include/trace/events/icmp.h

diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
new file mode 100644
index ..2098d4b1b12e
--- /dev/null
+++ b/include/trace/events/icmp.h
@@ -0,0 +1,64 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM icmp
+
+#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_ICMP_H
+
+#include 
+#include 
+
+TRACE_EVENT(icmp_send,
+
+   TP_PROTO(const struct sk_buff *skb, int type, int code),
+
+   TP_ARGS(skb, type, code),
+
+   TP_STRUCT__entry(
+   __field(const void *, skbaddr)
+   __field(int, type)
+   __field(int, code)
+   __array(__u8, saddr, 4)
+   __array(__u8, daddr, 4)
+   __field(__u16, sport)
+   __field(__u16, dport)
+   __field(unsigned short, ulen)
+   ),
+
+   TP_fast_assign(
+   struct iphdr *iph = ip_hdr(skb);
+   int proto_4 = iph->protocol;
+   __be32 *p32;
+
+   __entry->skbaddr = skb;
+   __entry->type = type;
+   __entry->code = code;
+
+   if (proto_4 == IPPROTO_UDP) {
+   struct udphdr *uh = udp_hdr(skb);
+   __entry->sport = ntohs(uh->source);
+   __entry->dport = ntohs(uh->dest);
+   __entry->ulen = ntohs(uh->len);
+   } else {
+   __entry->sport = 0;
+   __entry->dport = 0;
+   __entry->ulen = 0;
+   }
+
+   p32 = (__be32 *) __entry->saddr;
+   *p32 = iph->saddr;
+
+   p32 = (__be32 *) __entry->daddr;
+   *p32 = iph->daddr;
+   ),
+
+   TP_printk("icmp_send: type=%d, code=%d. From %pI4:%u to %pI4:%u 
ulen=%d skbaddr=%p",
+   __entry->type, __entry->code,
+   __entry->saddr, __entry->sport, __entry->daddr,
+   __entry->dport, __entry->ulen, __entry->skbaddr)
+);
+
+#endif /* _TRACE_ICMP_H */
+
+/* This part must be outside protection */
+#include 
\ No newline at end of file
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index e63a3bf99617..21fb41257fe9 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -92,6 +92,8 @@
 #include 
 #include 
 #include 
+#define CREATE_TRACE_POINTS
+#include 
 
 /*
  * Build xmit assembly blocks
@@ -672,6 +674,8 @@ void __icmp_send(struct sk_buff *skb_in, int type, int 
code, __be32 info,
}
}
 
+   trace_icmp_send(skb_in, type, code);
+
/* Needed by both

[PATCH v2 0/2] Improvements to virtio_balloon pm

2024-03-20 Thread David Stevens

From: David Stevens 

The virtio_balloon driver uses wakeup sources to allow the guest to
enter system power management sleep states (e.g. s2idle) without running
the risk of becoming unresponsive to cooperative memory management
requests from the host. This series fixes an issue where wakeup sources
for inflate/deflate were improperly shared between drivers. It also
closes a race where stats requests that come in immediately before a
sleep state transition could fail to be handled in a timely manner.

v1: https://lore.kernel.org/lkml/20240318091034.535573-1-steve...@google.com/

v1 -> v2:
 - Add comment about virtio-balloon's wakeup source
 - Rename virtio-balloon wakeup event macros

David Stevens (2):
  virtio_balloon: Give the balloon its own wakeup source
  virtio_balloon: Treat stats requests as wakeup events

 drivers/virtio/virtio_balloon.c | 84 +
 1 file changed, 55 insertions(+), 29 deletions(-)


base-commit: e8f897f4afef0031fe618a8e94127a0934896aba
-- 
2.44.0.291.gc1ea87d7ee-goog

[PATCH v2 2/2] virtio_balloon: Treat stats requests as wakeup events

2024-03-20 Thread David Stevens

From: David Stevens 

Treat stats requests as wakeup events to ensure that the driver responds
to device requests in a timely manner.

Signed-off-by: David Stevens 
Acked-by: David Hildenbrand 
---
 drivers/virtio/virtio_balloon.c | 75 -
 1 file changed, 46 insertions(+), 29 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 89bc8da80519..b09e8e3c62e5 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -121,11 +121,14 @@ struct virtio_balloon {
struct page_reporting_dev_info pr_dev_info;
 
/* State for keeping the wakeup_source active while adjusting the 
balloon */
-   spinlock_t adjustment_lock;
-   bool adjustment_signal_pending;
-   bool adjustment_in_progress;
+   spinlock_t wakeup_lock;
+   bool processing_wakeup_event;
+   u32 wakeup_signal_mask;
 };
 
+#define VIRTIO_BALLOON_WAKEUP_SIGNAL_ADJUST (1 << 0)
+#define VIRTIO_BALLOON_WAKEUP_SIGNAL_STATS (1 << 1)
+
 static const struct virtio_device_id id_table[] = {
{ VIRTIO_ID_BALLOON, VIRTIO_DEV_ANY_ID },
{ 0 },
@@ -140,6 +143,36 @@ static u32 page_to_balloon_pfn(struct page *page)
return pfn * VIRTIO_BALLOON_PAGES_PER_PAGE;
 }
 
+static void start_wakeup_event(struct virtio_balloon *vb, u32 mask)
+{
+   unsigned long flags;
+
+   spin_lock_irqsave(>wakeup_lock, flags);
+   vb->wakeup_signal_mask |= mask;
+   if (!vb->processing_wakeup_event) {
+   vb->processing_wakeup_event = true;
+   pm_stay_awake(>vdev->dev);
+   }
+   spin_unlock_irqrestore(>wakeup_lock, flags);
+}
+
+static void process_wakeup_event(struct virtio_balloon *vb, u32 mask)
+{
+   spin_lock_irq(>wakeup_lock);
+   vb->wakeup_signal_mask &= ~mask;
+   spin_unlock_irq(>wakeup_lock);
+}
+
+static void finish_wakeup_event(struct virtio_balloon *vb)
+{
+   spin_lock_irq(>wakeup_lock);
+   if (!vb->wakeup_signal_mask && vb->processing_wakeup_event) {
+   vb->processing_wakeup_event = false;
+   pm_relax(>vdev->dev);
+   }
+   spin_unlock_irq(>wakeup_lock);
+}
+
 static void balloon_ack(struct virtqueue *vq)
 {
struct virtio_balloon *vb = vq->vdev->priv;
@@ -370,8 +403,10 @@ static void stats_request(struct virtqueue *vq)
struct virtio_balloon *vb = vq->vdev->priv;
 
spin_lock(>stop_update_lock);
-   if (!vb->stop_update)
+   if (!vb->stop_update) {
+   start_wakeup_event(vb, VIRTIO_BALLOON_WAKEUP_SIGNAL_STATS);
queue_work(system_freezable_wq, >update_balloon_stats_work);
+   }
spin_unlock(>stop_update_lock);
 }
 
@@ -444,29 +479,10 @@ static void virtio_balloon_queue_free_page_work(struct 
virtio_balloon *vb)
 
 static void start_update_balloon_size(struct virtio_balloon *vb)
 {
-   unsigned long flags;
-
-   spin_lock_irqsave(>adjustment_lock, flags);
-   vb->adjustment_signal_pending = true;
-   if (!vb->adjustment_in_progress) {
-   vb->adjustment_in_progress = true;
-   pm_stay_awake(>vdev->dev);
-   }
-   spin_unlock_irqrestore(>adjustment_lock, flags);
-
+   start_wakeup_event(vb, VIRTIO_BALLOON_WAKEUP_SIGNAL_ADJUST);
queue_work(system_freezable_wq, >update_balloon_size_work);
 }
 
-static void end_update_balloon_size(struct virtio_balloon *vb)
-{
-   spin_lock_irq(>adjustment_lock);
-   if (!vb->adjustment_signal_pending && vb->adjustment_in_progress) {
-   vb->adjustment_in_progress = false;
-   pm_relax(>vdev->dev);
-   }
-   spin_unlock_irq(>adjustment_lock);
-}
-
 static void virtballoon_changed(struct virtio_device *vdev)
 {
struct virtio_balloon *vb = vdev->priv;
@@ -495,7 +511,10 @@ static void update_balloon_stats_func(struct work_struct 
*work)
 
vb = container_of(work, struct virtio_balloon,
  update_balloon_stats_work);
+
+   process_wakeup_event(vb, VIRTIO_BALLOON_WAKEUP_SIGNAL_STATS);
stats_handle_request(vb);
+   finish_wakeup_event(vb);
 }
 
 static void update_balloon_size_func(struct work_struct *work)
@@ -506,9 +525,7 @@ static void update_balloon_size_func(struct work_struct 
*work)
vb = container_of(work, struct virtio_balloon,
  update_balloon_size_work);
 
-   spin_lock_irq(>adjustment_lock);
-   vb->adjustment_signal_pending = false;
-   spin_unlock_irq(>adjustment_lock);
+   process_wakeup_event(vb, VIRTIO_BALLOON_WAKEUP_SIGNAL_ADJUST);
 
diff = towards_target(vb);
 
@@ -523,7 +540,7 @@ static void update_balloon_size_func(struct work_struct 
*work)
if (diff)
queue_work(system_freezable_wq, work);
else
-   end_update_balloon_size(vb);
+   finish_wakeup_event(vb);
 }
 
 static int init_vqs(struct virtio_balloon *vb)
@@ -1027,7 +1044,7 @@ static int

[PATCH v2 1/2] virtio_balloon: Give the balloon its own wakeup source

2024-03-20 Thread David Stevens

From: David Stevens 

Wakeup sources don't support nesting multiple events, so sharing a
single object between multiple drivers can result in one driver
overriding the wakeup event processing period specified by another
driver. Have the virtio balloon driver use the wakeup source of the
device it is bound to rather than the wakeup source of the parent
device, to avoid conflicts with the transport layer.

Note that although the virtio balloon's virtio_device itself isn't what
actually wakes up the device, it is responsible for processing wakeup
events. In the same way that EPOLLWAKEUP uses a dedicated wakeup_source
to prevent suspend when userspace is processing wakeup events, a
dedicated wakeup_source is necessary when processing wakeup events in a
higher layer in the kernel.

Fixes: b12fbc3f787e ("virtio_balloon: stay awake while adjusting balloon")
Signed-off-by: David Stevens 
Acked-by: David Hildenbrand 
---
 drivers/virtio/virtio_balloon.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 1f5b3dd31fcf..89bc8da80519 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -450,7 +450,7 @@ static void start_update_balloon_size(struct virtio_balloon 
*vb)
vb->adjustment_signal_pending = true;
if (!vb->adjustment_in_progress) {
vb->adjustment_in_progress = true;
-   pm_stay_awake(vb->vdev->dev.parent);
+   pm_stay_awake(>vdev->dev);
}
spin_unlock_irqrestore(>adjustment_lock, flags);
 
@@ -462,7 +462,7 @@ static void end_update_balloon_size(struct virtio_balloon 
*vb)
spin_lock_irq(>adjustment_lock);
if (!vb->adjustment_signal_pending && vb->adjustment_in_progress) {
vb->adjustment_in_progress = false;
-   pm_relax(vb->vdev->dev.parent);
+   pm_relax(>vdev->dev);
}
spin_unlock_irq(>adjustment_lock);
 }
@@ -1029,6 +1029,15 @@ static int virtballoon_probe(struct virtio_device *vdev)
 
spin_lock_init(>adjustment_lock);
 
+   /*
+* The virtio balloon itself can't wake up the device, but it is
+* responsible for processing wakeup events passed up from the transport
+* layer. Wakeup sources don't support nesting/chaining calls, so we use
+* our own wakeup source to ensure wakeup events are properly handled
+* without trampling on the transport layer's wakeup source.
+*/
+   device_set_wakeup_capable(>vdev->dev, true);
+
virtio_device_ready(vdev);
 
if (towards_target(vb))
-- 
2.44.0.291.gc1ea87d7ee-goog

Re: [RFC][PATCH] tracing: Introduce restart_critical_timings()

2024-03-20 Thread Steven Rostedt

On Wed, 20 Mar 2024 20:46:11 -0400
Waiman Long  wrote:

> I have no objection to that. However, there are now 2 function call 
> overhead in each iteration if either CONFIG_IRQSOFF_TRACER or 
> CONFIG_PREEMPT_TRACER is on. Is it possible to do it with just one 
> function call? IOW, make restart_critical_timings() a real function.

Yeah, I could do that.

-- Steve

Re: [RFC][PATCH] tracing: Introduce restart_critical_timings()

2024-03-20 Thread Waiman Long


On 3/20/24 12:20, Steven Rostedt wrote:

From: Steven Rostedt (Google) 

I'm debugging some latency issues on a Chromebook and the preemptirqsoff
tracer hit this:

# tracer: preemptirqsoff
#
# preemptirqsoff latency trace v1.1.5 on 5.15.148-21853-g165fd2387469-dirty
# 
# latency: 7813 us, #60/60, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
#-
#| task: init-1 (uid:0 nice:0 policy:0 rt_prio:0)
#-
#  => started at: rwsem_optimistic_spin
#  => ended at:   rwsem_optimistic_spin
#
#
#_--=> CPU#
#   / _-=> irqs-off
#  | / _=> need-resched
#  || / _---=> hardirq/softirq
#  ||| / _--=> preempt-depth
#   / _-=> migrate-disable
#  | / delay
#  cmd pid || time  |   caller
# \   /||  \|/
<...>-1 1...1.0us!: rwsem_optimistic_spin+0x20/0x194 
<-rwsem_optimistic_spin+0x20/0x194
<...>-1 1.N.1. 7813us : rwsem_optimistic_spin+0x140/0x194 
<-rwsem_optimistic_spin+0x140/0x194
<...>-1 1.N.1. 7814us+: tracer_preempt_on+0x4c/0x6a 
<-rwsem_optimistic_spin+0x140/0x194
<...>-1 1.N.1. 7824us : 
  => rwsem_optimistic_spin+0x140/0x194
  => rwsem_down_write_slowpath+0xc9/0x3fe
  => copy_process+0xd08/0x1b8a
  => kernel_clone+0x94/0x256
  => __x64_sys_clone+0x7a/0x9a
  => do_syscall_64+0x51/0xa1
  => entry_SYSCALL_64_after_hwframe+0x5c/0xc6

Which isn't a real issue, as it's in the rwsem_optimistic_spin() code that
spins with interrupts enabled, preempt disabled, and checks for
need_resched(). If it is true, it breaks out and schedules.

Hence, it's hiding real issues, because it can spin for a very long time
and this is not the source of the latency I'm tracking down.

I would like to introduce restart_critical_timings() and place it in
locations that have this behavior.

Signed-off-by: Steven Rostedt (Google) 


I have no objection to that. However, there are now 2 function call 
overhead in each iteration if either CONFIG_IRQSOFF_TRACER or 
CONFIG_PREEMPT_TRACER is on. Is it possible to do it with just one 
function call? IOW, make restart_critical_timings() a real function.


Cheers,
Longman



diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index 147feebd508c..e9f97f60bfc0 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -145,6 +145,13 @@ do {   \
  # define start_critical_timings() do { } while (0)
  #endif
  
+/* Used in spins that check need_resched() with preemption disabled */

+static inline void restart_critical_timings(void)
+{
+   stop_critical_timings();
+   start_critical_timings();
+}
+
  #ifdef CONFIG_DEBUG_IRQFLAGS
  extern void warn_bogus_irq_restore(void);
  #define raw_check_bogus_irq_restore() \
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 2340b6d90ec6..fa7b43e53d20 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -27,6 +27,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  
  #ifndef CONFIG_PREEMPT_RT

@@ -780,6 +781,7 @@ rwsem_spin_on_owner(struct rw_semaphore *sem)
 */
barrier();
  
+		restart_critical_timings();

if (need_resched() || !owner_on_cpu(owner)) {
state = OWNER_NONSPINNABLE;
break;
@@ -912,6 +914,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 * a writer, need_resched() check needs to be done here.
 */
if (owner_state != OWNER_WRITER) {
+   restart_critical_timings();
if (need_resched())
break;
if (rt_task(current) &&

Re: [External] Re: [PATCH v3 1/2] memory tier: dax/kmem: create CPUless memory tiers after obtaining HMAT info

2024-03-20 Thread Ho-Ren (Jack) Chuang

On Wed, Mar 20, 2024 at 12:15 AM Huang, Ying  wrote:
>
> "Ho-Ren (Jack) Chuang"  writes:
>
> > The current implementation treats emulated memory devices, such as
> > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory
> > (E820_TYPE_RAM). However, these emulated devices have different
> > characteristics than traditional DRAM, making it important to
> > distinguish them. Thus, we modify the tiered memory initialization process
> > to introduce a delay specifically for CPUless NUMA nodes. This delay
> > ensures that the memory tier initialization for these nodes is deferred
> > until HMAT information is obtained during the boot process. Finally,
> > demotion tables are recalculated at the end.
> >
> > More details:
>
> You have done several stuff in one patch.  So you need "more details".
> You may separate them into multiple patches.  One for echo "*" below.
> But I have no strong opinion on that.
>
> > * late_initcall(memory_tier_late_init);
> > Some device drivers may have initialized memory tiers between
> > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
> > online memory nodes and configuring memory tiers. They should be excluded
> > in the late init.
> >
> > * Abstract common functions into `mt_find_alloc_memory_type()`
> > Since different memory devices require finding or allocating a memory type,
> > these common steps are abstracted into a single function,
> > `mt_find_alloc_memory_type()`, enhancing code scalability and conciseness.
> >
> > * Handle cases where there is no HMAT when creating memory tiers
> > There is a scenario where a CPUless node does not provide HMAT information.
> > If no HMAT is specified, it falls back to using the default DRAM tier.
> >
> > * Change adist calculation code to use another new lock, `mt_perf_lock`.
> > In the current implementation, iterating through CPUlist nodes requires
> > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
> > trying to acquire the same lock, leading to a potential deadlock.
> > Therefore, we propose introducing a standalone `mt_perf_lock` to protect
> > `default_dram_perf`. This approach not only avoids deadlock but also
> > prevents holding a large lock simultaneously.
> >
> > * Upgrade `set_node_memory_tier` to support additional cases, including
> >   default DRAM, late CPUless, and hot-plugged initializations.
> > To cover hot-plugged memory nodes, `mt_calc_adistance()` and
> > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
> > handle cases where memtype is not initialized and where HMAT information is
> > available.
> >
> > * Introduce `default_memory_types` for those memory types that are not
> >   initialized by device drivers.
> > Because late initialized memory and default DRAM memory need to be managed,
> > a default memory type is created for storing all memory types that are
> > not initialized by device drivers and as a fallback.
> >
> > Signed-off-by: Ho-Ren (Jack) Chuang 
> > Signed-off-by: Hao Xiang 
> > ---
> >  drivers/dax/kmem.c   | 13 +
> >  include/linux/memory-tiers.h |  7 +++
> >  mm/memory-tiers.c| 94 +---
> >  3 files changed, 95 insertions(+), 19 deletions(-)
> >
> > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> > index 42ee360cf4e3..de1333aa7b3e 100644
> > --- a/drivers/dax/kmem.c
> > +++ b/drivers/dax/kmem.c
> > @@ -55,21 +55,10 @@ static LIST_HEAD(kmem_memory_types);
> >
> >  static struct memory_dev_type *kmem_find_alloc_memory_type(int adist)
> >  {
> > - bool found = false;
> >   struct memory_dev_type *mtype;
> >
> >   mutex_lock(_memory_type_lock);
> > - list_for_each_entry(mtype, _memory_types, list) {
> > - if (mtype->adistance == adist) {
> > - found = true;
> > - break;
> > - }
> > - }
> > - if (!found) {
> > - mtype = alloc_memory_type(adist);
> > - if (!IS_ERR(mtype))
> > - list_add(>list, _memory_types);
> > - }
> > + mtype = mt_find_alloc_memory_type(adist, _memory_types);
> >   mutex_unlock(_memory_type_lock);
> >
> >   return mtype;
>
> It seems that there's some miscommunication about my previous comments
> about this.  What I suggested is to create one separate patch, which
> moves mt_find_alloc_memory_type() and mt_put_memory_types() into
> memory-tiers.c.  And make this patch the first one of the series.
>

I will make mt_find_alloc/mt_put_memory_type changes as
a separate patch and the first of my patch series. Thanks.


> > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> > index 69e781900082..b2135334ac18 100644
> > --- a/include/linux/memory-tiers.h
> > +++ b/include/linux/memory-tiers.h
> > @@ -48,6 +48,8 @@ int mt_calc_adistance(int node, int *adist);
> >  int mt_set_default_dram_perf(int nid, struct access_coordinate *perf,
> >

Re: [PATCH] tracing: probes: Fix to zero initialize a local variable

2024-03-20 Thread Google

On Wed, 20 Mar 2024 09:27:49 -0400
Steven Rostedt  wrote:

> On Wed, 20 Mar 2024 17:10:38 +0900
> "Masami Hiramatsu (Google)"  wrote:
> 
> > From: Masami Hiramatsu (Google) 
> > 
> > Fix to initialize 'val' local variable with zero.
> > Dan reported that Smatch static code checker reports an error that a local
> > 'val' variable needs to be initialized. Actually, the 'val' is expected to
> > be initialized by FETCH_OP_ARG in the same loop, but it is not obvious. So
> > initialize it with zero.
> 
> BTW, that loop should really have a comment stating that FETCH_OP_ARG is
> expected to happen before FETCH_OP_ST_EDATA.

Indeed, OK, let me add it.

Thank you,

> 
> -- Steve


-- 
Masami Hiramatsu (Google)

[PATCH v2] tracing: Improve performance by using do_div()

2024-03-20 Thread Thorsten Blum

Partially revert commit d6cb38e10810 ("tracing: Use div64_u64() instead
of do_div()") and use do_div() again to utilize its faster 64-by-32
division compared to the 64-by-64 division done by div64_u64().

Explicitly cast the divisor bm_cnt to u32 to prevent a Coccinelle
warning reported by do_div.cocci. The warning was removed with commit
d6cb38e10810 ("tracing: Use div64_u64() instead of do_div()").

Using the faster 64-by-32 division and casting bm_cnt to u32 is safe
because we return early from trace_do_benchmark() if bm_cnt > UINT_MAX.
This approach is already used twice in trace_do_benchmark() when
calculating the standard deviation:

do_div(stddev, (u32)bm_cnt);
do_div(stddev, (u32)bm_cnt - 1);

Signed-off-by: Thorsten Blum 
---
Changes in v2:
- Update patch with latest changes from master
- Update patch title and description
---
 kernel/trace/trace_benchmark.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/trace_benchmark.c b/kernel/trace/trace_benchmark.c
index 811b08439406..e19c32f2a938 100644
--- a/kernel/trace/trace_benchmark.c
+++ b/kernel/trace/trace_benchmark.c
@@ -104,7 +104,7 @@ static void trace_do_benchmark(void)
stddev = 0;
 
delta = bm_total;
-   delta = div64_u64(delta, bm_cnt);
+   do_div(delta, (u32)bm_cnt);
avg = delta;
 
if (stddev > 0) {
-- 
2.44.0

Re: [syzbot] [virtualization?] upstream boot error: WARNING: refcount bug in __free_pages_ok

2024-03-20 Thread syzbot

Hello,

syzbot has tested the proposed patch and the reproducer did not trigger any 
issue:

Reported-and-tested-by: syzbot+70f57d8a3ae84934c...@syzkaller.appspotmail.com

Tested on:

commit: 4bedfb31 mm,page_owner: maintain own list of stack_rec..
git tree:   git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel config:  https://syzkaller.appspot.com/x/.config?x=527195e149aa3091
dashboard link: https://syzkaller.appspot.com/bug?extid=70f57d8a3ae84934c003
compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 
2.40

Note: no patches were applied.
Note: testing is done by a robot and is best-effort only.

[PATCH 02/11] LICENSES: Add 0BSD license text

2024-03-20 Thread Lasse Collin

The license text was copied from:

https://spdx.org/licenses/0BSD.html

Reviewed-by: Jia Tan 
Signed-off-by: Lasse Collin 
---

Notes:
0BSD is the ISC license without the requirements to preserve copyright
and license notices.

 LICENSES/deprecated/0BSD | 23 +++
 1 file changed, 23 insertions(+)
 create mode 100644 LICENSES/deprecated/0BSD

diff --git a/LICENSES/deprecated/0BSD b/LICENSES/deprecated/0BSD
new file mode 100644
index ..e4b95b749966
--- /dev/null
+++ b/LICENSES/deprecated/0BSD
@@ -0,0 +1,23 @@
+Valid-License-Identifier: 0BSD
+SPDX-URL: https://spdx.org/licenses/0BSD.html
+Usage-Guide:
+  To use the BSD Zero Clause License put the following SPDX tag/value
+  pair into a comment according to the placement guidelines in the
+  licensing rules documentation:
+SPDX-License-Identifier: 0BSD
+License-Text:
+
+BSD Zero Clause License
+
+Copyright (c)  
+
+Permission to use, copy, modify, and/or distribute this software for any
+purpose with or without fee is hereby granted.
+
+THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY
+SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION
+OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN
+CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
-- 
2.44.0

[PATCH 03/11] xz: Switch from public domain to BSD Zero Clause License (0BSD)

2024-03-20 Thread Lasse Collin

Remove the public domain notices and add SPDX license identifiers.

Change MODULE_LICENSE from "GPL" to "Dual BSD/GPL" because 0BSD should
count as a BSD license variant here.

The switch to 0BSD was done in the upstream XZ Embedded project because
public domain has (real or perceived) legal issues in some jurisdictions.

Reviewed-by: Jia Tan 
Signed-off-by: Lasse Collin 
---
 include/linux/decompress/unxz.h |  5 ++---
 include/linux/xz.h  |  5 ++---
 lib/decompress_unxz.c   |  5 ++---
 lib/xz/xz_crc32.c   |  5 ++---
 lib/xz/xz_dec_bcj.c |  5 ++---
 lib/xz/xz_dec_lzma2.c   |  5 ++---
 lib/xz/xz_dec_stream.c  |  5 ++---
 lib/xz/xz_dec_syms.c| 12 +++-
 lib/xz/xz_dec_test.c| 12 +++-
 lib/xz/xz_lzma2.h   |  5 ++---
 lib/xz/xz_private.h |  5 ++---
 lib/xz/xz_stream.h  |  5 ++---
 scripts/xz_wrap.sh  |  5 +
 13 files changed, 27 insertions(+), 52 deletions(-)

diff --git a/include/linux/decompress/unxz.h b/include/linux/decompress/unxz.h
index f764e2a7201e..3dd2658a9dab 100644
--- a/include/linux/decompress/unxz.h
+++ b/include/linux/decompress/unxz.h
@@ -1,10 +1,9 @@
+/* SPDX-License-Identifier: 0BSD */
+
 /*
  * Wrapper for decompressing XZ-compressed kernel, initramfs, and initrd
  *
  * Author: Lasse Collin 
- *
- * This file has been put into the public domain.
- * You can do whatever you want with this file.
  */
 
 #ifndef DECOMPRESS_UNXZ_H
diff --git a/include/linux/xz.h b/include/linux/xz.h
index 7285ca5d56e9..5728d57aecc0 100644
--- a/include/linux/xz.h
+++ b/include/linux/xz.h
@@ -1,11 +1,10 @@
+/* SPDX-License-Identifier: 0BSD */
+
 /*
  * XZ decompressor
  *
  * Authors: Lasse Collin 
  *  Igor Pavlov 
- *
- * This file has been put into the public domain.
- * You can do whatever you want with this file.
  */
 
 #ifndef XZ_H
diff --git a/lib/decompress_unxz.c b/lib/decompress_unxz.c
index 842894158944..34bb7efc0412 100644
--- a/lib/decompress_unxz.c
+++ b/lib/decompress_unxz.c
@@ -1,10 +1,9 @@
+// SPDX-License-Identifier: 0BSD
+
 /*
  * Wrapper for decompressing XZ-compressed kernel, initramfs, and initrd
  *
  * Author: Lasse Collin 
- *
- * This file has been put into the public domain.
- * You can do whatever you want with this file.
  */
 
 /*
diff --git a/lib/xz/xz_crc32.c b/lib/xz/xz_crc32.c
index 88a2c35e1b59..30b8a27110b1 100644
--- a/lib/xz/xz_crc32.c
+++ b/lib/xz/xz_crc32.c
@@ -1,11 +1,10 @@
+// SPDX-License-Identifier: 0BSD
+
 /*
  * CRC32 using the polynomial from IEEE-802.3
  *
  * Authors: Lasse Collin 
  *  Igor Pavlov 
- *
- * This file has been put into the public domain.
- * You can do whatever you want with this file.
  */
 
 /*
diff --git a/lib/xz/xz_dec_bcj.c b/lib/xz/xz_dec_bcj.c
index ef449e97d1a1..ab9237ed6db8 100644
--- a/lib/xz/xz_dec_bcj.c
+++ b/lib/xz/xz_dec_bcj.c
@@ -1,11 +1,10 @@
+// SPDX-License-Identifier: 0BSD
+
 /*
  * Branch/Call/Jump (BCJ) filter decoders
  *
  * Authors: Lasse Collin 
  *  Igor Pavlov 
- *
- * This file has been put into the public domain.
- * You can do whatever you want with this file.
  */
 
 #include "xz_private.h"
diff --git a/lib/xz/xz_dec_lzma2.c b/lib/xz/xz_dec_lzma2.c
index 27ce34520e78..613939f5dd6c 100644
--- a/lib/xz/xz_dec_lzma2.c
+++ b/lib/xz/xz_dec_lzma2.c
@@ -1,11 +1,10 @@
+// SPDX-License-Identifier: 0BSD
+
 /*
  * LZMA2 decoder
  *
  * Authors: Lasse Collin 
  *  Igor Pavlov 
- *
- * This file has been put into the public domain.
- * You can do whatever you want with this file.
  */
 
 #include "xz_private.h"
diff --git a/lib/xz/xz_dec_stream.c b/lib/xz/xz_dec_stream.c
index 683570b93a8c..0058406ccd17 100644
--- a/lib/xz/xz_dec_stream.c
+++ b/lib/xz/xz_dec_stream.c
@@ -1,10 +1,9 @@
+// SPDX-License-Identifier: 0BSD
+
 /*
  * .xz Stream decoder
  *
  * Author: Lasse Collin 
- *
- * This file has been put into the public domain.
- * You can do whatever you want with this file.
  */
 
 #include "xz_private.h"
diff --git a/lib/xz/xz_dec_syms.c b/lib/xz/xz_dec_syms.c
index 61098c67a413..495d2cc2e6e8 100644
--- a/lib/xz/xz_dec_syms.c
+++ b/lib/xz/xz_dec_syms.c
@@ -1,10 +1,9 @@
+// SPDX-License-Identifier: 0BSD
+
 /*
  * XZ decoder module information
  *
  * Author: Lasse Collin 
- *
- * This file has been put into the public domain.
- * You can do whatever you want with this file.
  */
 
 #include 
@@ -25,9 +24,4 @@ EXPORT_SYMBOL(xz_dec_microlzma_end);
 MODULE_DESCRIPTION("XZ decompressor");
 MODULE_VERSION("1.1");
 MODULE_AUTHOR("Lasse Collin  and Igor Pavlov");
-
-/*
- * This code is in the public domain, but in Linux it's simplest to just
- * say it's GPL and consider the authors as the copyright holders.
- */
-MODULE_LICENSE("GPL");
+MODULE_LICENSE("Dual BSD/GPL");
diff --git a/lib/xz/xz_dec_test.c b/lib/xz/xz_dec_test.c
index da28a19d6c98..53d3600f2ddb 100644
---

Re: [RFC][PATCH] tracing: Introduce restart_critical_timings()

2024-03-20 Thread Mathieu Desnoyers


On 2024-03-20 13:58, Steven Rostedt wrote:

On Wed, 20 Mar 2024 13:15:39 -0400
Mathieu Desnoyers  wrote:


I would like to introduce restart_critical_timings() and place it in
locations that have this behavior.


Is there any way you could move this to need_resched() rather than
sprinkle those everywhere ?


Because need_resched() itself does not mean it's going to schedule
immediately. I looked at a few locations that need_resched() is called.
Most are in idle code where the critical timings are already handled.

I'm not sure I'd add it for places like mm/memory.c or 
drivers/md/bcache/btree.c.

A lot of places look to use it more for PREEMPT_NONE situations as a open
coded cond_resched().

The main reason this one is particularly an issue, is that it spins as long
as the owner is still running. Which may be some time, as here it was 7ms.


What I think we should be discussing here is how calling need_resched()
should interact with the latency tracked by critical timings.

AFAIU, when code explicitly calls need_resched() in a loop, there are
two cases:

- need_resched() returns false: This means the loop can continue without
  causing long latency on the system. Technically we could restart the
  critical timings at this point.

- need_resched() returns true: This means the loop should exit quickly
  and call the scheduler. I would not reset the critical timings there,
  as whatever code is executed between need_resched() returning true
  and calling the scheduler is adding to latency.

Having stop/start critical timings around idle loops seems to just be
an optimization over that.

As for mm and driver/md code, what is wrong with doing a critical
timings reset when need_resched() returns false ? It would prevent
a whole class of false-positives rather than playing whack-a-mole with
those that pop up.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC PATCH] riscv: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS

2024-03-20 Thread Mark Rutland

On Thu, Mar 14, 2024 at 04:07:33PM +0100, Bj"orn T"opel wrote:
> After reading Mark's reply, and discussing with OpenJDK folks (who does
> the most crazy text patching on all platforms), having to patch multiple
> instructions (where the address materialization is split over multiple
> instructions) is a no-go. It's just a too big can of worms. So, if we
> can only patch one insn, it's CALL_OPS.
> 
> A couple of options (in addition to Andy's), and all require a
> per-function landing address ala CALL_OPS) tweaking what Mark is doing
> on Arm (given the poor branch range).
> 
> ..and maybe we'll get RISC-V rainbows/unicorns in the future getting
> better reach (full 64b! ;-)).
> 
> A) Use auipc/jalr, only patch jalr to take us to a common
>dispatcher/trampoline
>   
>  |  # probably on a data cache-line != func .text 
> to avoid ping-pong
>  | ...
>  | func:
>  |   ...make sure ra isn't messed up...
>  |   aupic
>  |   nop <=> jalr # Text patch point -> common_dispatch
>  |   ACTUAL_FUNC
>  | 
>  | common_dispatch:
>  |   load  based on ra
>  |   jalr
>  |   ...
> 
> The auipc is never touched, and will be overhead. Also, we need a mv to
> store ra in a scratch register as well -- like Arm. We'll have two insn
> per-caller overhead for a disabled caller.

Is the AUIPC a significant overhead? IIUC that's similar to Arm's ADRP, and I'd
have expected that to be pretty cheap.

IIUC your JALR can choose which destination register to store the return
address in, and if so, you could leave the original ra untouched (and recover
that in the common trampoline). Have I misunderstood that?

Maybe that doesn't play nicely with something else?

> B) Use jal, which can only take us +/-1M, and requires multiple
>dispatchers (and tracking which one to use, and properly distribute
>them. Ick.)
> 
>  |  # probably on a data cache-line != func .text 
> to avoid ping-pong
>  | ...
>  | func:
>  |   ...make sure ra isn't messed up...
>  |   nop <=> jal # Text patch point -> within_1M_to_func_dispatch
>  |   ACTUAL_FUNC
>  | 
>  | within_1M_to_func_dispatch:
>  |   load  based on ra
>  |   jalr
> 
> C) Use jal, which can only take us +/-1M, and use a per-function
>trampoline requires multiple dispatchers (and tracking which one to
>use). Blows up text size A LOT.
> 
>  |  # somewhere, but probably on a different 
> cacheline than the .text to avoid ping-ongs
>  | ...
>  | per_func_dispatch
>  |   load  based on ra
>  |   jalr
>  | func:
>  |   ...make sure ra isn't messed up...
>  |   nop <=> jal # Text patch point -> per_func_dispatch
>  |   ACTUAL_FUNC

Beware that with option (C) you'll need to handle that in your unwinder for
RELIABLE_STACKTRACE. If you don't have a symbol for per_func_dispatch (or
func_trace_target_data_8B), PC values within per_func_dispatch would be
symbolized as the prior function/data.

> It's a bit sad that we'll always have to have a dispatcher/trampoline,
> but it's still better than stop_machine(). (And we'll need a fencei IPI
> as well, but only one. ;-))
> 
> Today, I'm leaning towards A (which is what Mark suggested, and also
> Robbin).. Any other options?

Assuming my understanding of JALR above is correct, I reckon A is the nicest
option out of A/B/C.

Mark.

Re: [RFC][PATCH] tracing: Introduce restart_critical_timings()

2024-03-20 Thread Steven Rostedt

On Wed, 20 Mar 2024 13:15:39 -0400
Mathieu Desnoyers  wrote:

> > I would like to introduce restart_critical_timings() and place it in
> > locations that have this behavior.  
> 
> Is there any way you could move this to need_resched() rather than
> sprinkle those everywhere ?

Because need_resched() itself does not mean it's going to schedule
immediately. I looked at a few locations that need_resched() is called.
Most are in idle code where the critical timings are already handled.

I'm not sure I'd add it for places like mm/memory.c or 
drivers/md/bcache/btree.c.

A lot of places look to use it more for PREEMPT_NONE situations as a open
coded cond_resched().

The main reason this one is particularly an issue, is that it spins as long
as the owner is still running. Which may be some time, as here it was 7ms.

-- Steve

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-20 Thread David Woodhouse

On Tue, 2024-03-19 at 14:47 +0100, Peter Hilber wrote:
> While the virtio-comment list is not available, now also CC'ing Parav,
> which may be interested in this virtio-rtc spec related discussion thread.
> 
> On 14.03.24 15:19, David Woodhouse wrote:
> > On 14 March 2024 11:13:37 CET, Peter Hilber  
> > wrote:
> > > > To a certain extent, as long as the virtio-rtc device is designed to 
> > > > expose time precisely and unambiguously, it's less important if the 
> > > > Linux kernel *today* can use that. Although of course we should strive 
> > > > for that. Let's be...well, *unambiguous*, I suppose... that we've 
> > > > changed topics to discuss that though.
> > > > 
> > > 
> > > As Virtio is extensible (unlike hardware), my approach is to mostly 
> > > specify
> > > only what also has a PoC user and a use case.
> > 
> > If we get memory-mapped (X, Y, Z, ±x, ±y) I'll have a user and a use case 
> > on day one. Otherwise, as I said in my first response, I can go do that as 
> > a separate device and decide that virtio_rtc doesn't meet our needs 
> > (especially for maintaining accuracy over LM).
> 
> We plan to add 
> 
> - leap second indication,
> 
> - UTC-to-TAI offset,
> 
> - clock smearing indication (including the noon-to-noon linear smearing
>   variant which seems to be somewhat popular), and
>
> - clock accuracy indication
> 
> to the initial spec and to the PoC implementation.

Sounds good, thanks! I look forward to seeing the new revision. I'm
hoping Julien can give feedback on the clock accuracy parts.

> However, due to resource restrictions, we cannot ourselves add the
> memory-mapped clock to the initial spec.
>
> Everyone is very welcome to contribute the memory-mapped clock to the spec,
> and I think it might then still make it to the initial version.

Makes sense. That is my primary target, so I'm *hoping* we can converge
and get that into your initial spec, otherwise for expediency I'm going
to have to define an ACPI or DT or PCI device of our own and expose the
memory region through that instead.

(Even if I have to do that in the short term to stop the bleeding with
customers' clocks and live migration, I'd still aspire to migrate to a
virtio_rtc version of it in future)



smime.p7s
Description: S/MIME cryptographic signature

Re: [RFC][PATCH] tracing: Introduce restart_critical_timings()

2024-03-20 Thread Mathieu Desnoyers


On 2024-03-20 12:20, Steven Rostedt wrote:

From: Steven Rostedt (Google) 

I'm debugging some latency issues on a Chromebook and the preemptirqsoff
tracer hit this:

# tracer: preemptirqsoff
#
# preemptirqsoff latency trace v1.1.5 on 5.15.148-21853-g165fd2387469-dirty
# 
# latency: 7813 us, #60/60, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
#-
#| task: init-1 (uid:0 nice:0 policy:0 rt_prio:0)
#-
#  => started at: rwsem_optimistic_spin
#  => ended at:   rwsem_optimistic_spin
#
#
#_--=> CPU#
#   / _-=> irqs-off
#  | / _=> need-resched
#  || / _---=> hardirq/softirq
#  ||| / _--=> preempt-depth
#   / _-=> migrate-disable
#  | / delay
#  cmd pid || time  |   caller
# \   /||  \|/
<...>-1 1...1.0us!: rwsem_optimistic_spin+0x20/0x194 
<-rwsem_optimistic_spin+0x20/0x194
<...>-1 1.N.1. 7813us : rwsem_optimistic_spin+0x140/0x194 
<-rwsem_optimistic_spin+0x140/0x194
<...>-1 1.N.1. 7814us+: tracer_preempt_on+0x4c/0x6a 
<-rwsem_optimistic_spin+0x140/0x194
<...>-1 1.N.1. 7824us : 
  => rwsem_optimistic_spin+0x140/0x194
  => rwsem_down_write_slowpath+0xc9/0x3fe
  => copy_process+0xd08/0x1b8a
  => kernel_clone+0x94/0x256
  => __x64_sys_clone+0x7a/0x9a
  => do_syscall_64+0x51/0xa1
  => entry_SYSCALL_64_after_hwframe+0x5c/0xc6

Which isn't a real issue, as it's in the rwsem_optimistic_spin() code that
spins with interrupts enabled, preempt disabled, and checks for
need_resched(). If it is true, it breaks out and schedules.

Hence, it's hiding real issues, because it can spin for a very long time
and this is not the source of the latency I'm tracking down.

I would like to introduce restart_critical_timings() and place it in
locations that have this behavior.


Is there any way you could move this to need_resched() rather than
sprinkle those everywhere ?

Thanks,

Mathieu



Signed-off-by: Steven Rostedt (Google) 

diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index 147feebd508c..e9f97f60bfc0 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -145,6 +145,13 @@ do {   \
  # define start_critical_timings() do { } while (0)
  #endif
  
+/* Used in spins that check need_resched() with preemption disabled */

+static inline void restart_critical_timings(void)
+{
+   stop_critical_timings();
+   start_critical_timings();
+}
+
  #ifdef CONFIG_DEBUG_IRQFLAGS
  extern void warn_bogus_irq_restore(void);
  #define raw_check_bogus_irq_restore() \
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 2340b6d90ec6..fa7b43e53d20 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -27,6 +27,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  
  #ifndef CONFIG_PREEMPT_RT

@@ -780,6 +781,7 @@ rwsem_spin_on_owner(struct rw_semaphore *sem)
 */
barrier();
  
+		restart_critical_timings();

if (need_resched() || !owner_on_cpu(owner)) {
state = OWNER_NONSPINNABLE;
break;
@@ -912,6 +914,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 * a writer, need_resched() check needs to be done here.
 */
if (owner_state != OWNER_WRITER) {
+   restart_critical_timings();
if (need_resched())
break;
if (rt_task(current) &&


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-20 Thread Keir Fraser

On Wed, Mar 20, 2024 at 03:24:16PM +1000, Gavin Shan wrote:
> 
> Before this patch was posted, I had debugging code to record last 16 
> transactions
> to the available and used queue from guest and host side. It did reveal the 
> wrong
> head was fetched from the available queue.
> 
> [   11.785745]  virtqueue_get_buf_ctx_split 
> [   11.786238] virtio_net virtio0: output.0:id 74 is not a head!
> [   11.786655] head to be released: 036 077
> [   11.786952]
> [   11.786952] avail_idx:
> [   11.787234] 000  63985  <--
> [   11.787237] 001  63986
> [   11.787444] 002  63987
> [   11.787632] 003  63988
> [   11.787821] 004  63989
> [   11.788006] 005  63990
> [   11.788194] 006  63991
> [   11.788381] 007  63992
> [   11.788567] 008  63993
> [   11.788772] 009  63994
> [   11.788957] 010  63995
> [   11.789141] 011  63996
> [   11.789327] 012  63997
> [   11.789515] 013  63998
> [   11.789701] 014  63999
> [   11.789886] 015  64000

Does the error always occur at such a round idx value?

Here, 64000 == 0xFA00. Maybe coincidence but it's improbable enough to be 
interesting.

This debug code seems rather useful!

 -- Keir



> [   11.790068]
> [   11.790068] avail_head:
> [   11.790529] 000  075  <--
> [   11.790718] 001  036
> [   11.790890] 002  077
> [   11.791061] 003  129
> [   11.791231] 004  072
> [   11.791400] 005  130
> [   11.791574] 006  015
> [   11.791748] 007  074
> [   11.791918] 008  130
> [   11.792094] 009  130
> [   11.792263] 010  074
> [   11.792437] 011  015
> [   11.792617] 012  072
> [   11.792788] 013  129
> [   11.792961] 014  077// The last two heads from guest to host: 077, 036
> [   11.793134] 015  036
> 
> [root@nvidia-grace-hopper-05 qemu.main]# cat /proc/vhost
> 
> avail_idx
> 000  63998
> 001  64000
> 002  63954  <---
> 003  63955
> 004  63956
> 005  63974
> 006  63981
> 007  63984
> 008  63986
> 009  63987
> 010  63988
> 011  63989
> 012  63992
> 013  63993
> 014  63995
> 015  63997
> 
> avail_head
> 000  074
> 001  015
> 002  072
> 003  129
> 004  074// The last two heads seen by vhost is: 074, 036
> 005  036
> 006  075  <---
> 007  036
> 008  077
> 009  129
> 010  072
> 011  130
> 012  015
> 013  074
> 014  130
> 015  130
> 
> used_idx
> 000  64000
> 001  63882  <---
> 002  63889
> 003  63891
> 004  63898
> 005  63936
> 006  63942
> 007  63946
> 008  63949
> 009  63953
> 010  63957
> 011  63981
> 012  63990
> 013  63992
> 014  63993
> 015  63999
> 
> used_head
> 000  072
> 001  129
> 002  074  // The last two heads published to guest is: 074, 036
> 003  036
> 004  075  <---
> 005  036
> 006  077
> 007  129
> 008  072
> 009  130
> 010  015
> 011  074
> 012  130
> 013  130
> 014  074
> 015  015
> 
> Thanks,
> Gavin
> 
> 
> 
>

Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-03-20 Thread SeongJae Park

Hi Honggyu,

On Wed, 20 Mar 2024 16:07:48 +0900 Honggyu Kim  wrote:

> Hi SeongJae,
> 
> On Mon, 18 Mar 2024 12:07:21 -0700 SeongJae Park  wrote:
> > On Mon, 18 Mar 2024 22:27:45 +0900 Honggyu Kim  wrote:
> > 
> > > Hi SeongJae,
> > > 
> > > On Sun, 17 Mar 2024 08:31:44 -0700 SeongJae Park  wrote:
> > > > Hi Honggyu,
> > > > 
> > > > On Sun, 17 Mar 2024 17:36:29 +0900 Honggyu Kim  
> > > > wrote:
> > > > 
> > > > > Hi SeongJae,
> > > > > 
> > > > > Thanks for the confirmation.  I have a few comments on young filter so
> > > > > please read the inline comments again.
> > > > > 
> > > > > On Wed, 12 Mar 2024 08:53:00 -0800 SeongJae Park  
> > > > > wrote:
> > > > > > Hi Honggyu,
> > > > > > 
> > > > > > > > -Original Message-
> > > > > > > > From: SeongJae Park 
> > > > > > > > Sent: Tuesday, March 12, 2024 3:33 AM
> > > > > > > > To: Honggyu Kim 
> > > > > > > > Cc: SeongJae Park ; kernel_team 
> > > > > > > > 
> > > > > > > > Subject: RE: Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory 
> > > > > > > > management for CXL memory
> > > > > > > >
> > > > > > > > Hi Honggyu,
> > > > > > > >
> > > > > > > > On Mon, 11 Mar 2024 12:51:12 + "honggyu@sk.com" 
> > > > > > > >  wrote:
> > > > > > > >
> > > > > > > > > Hi SeongJae,
> > > > > > > > >
> > > > > > > > > I've tested it again and found that "young" filter has to be 
> > > > > > > > > set
> > > > > > > > > differently as follows.
> > > > > > > > > - demote action: set "young" filter with "matching" true
> > > > > > > > > - promote action: set "young" filter with "matching" false
> > > 
> > > Thinking it again, I feel like "matching" true or false looks quite
> > > vague to me as a general user.
> > > 
> > > Instead, I would like to have more meaningful names for "matching" as
> > > follows.
> > > 
> > > - matching "true" can be either (filter) "out" or "skip".
> > > - matching "false" can be either (filter) "in" or "apply".
> > 
> > I agree the naming could be done much better.  And thank you for the nice
> > suggestions.  I have a few concerns, though.
> 
> I don't think my suggestion is best.  I just would like to have more
> discussion about it.

I also understand my naming sense is far from good :)  I'm grateful to have
this constructive discussion!

> 
> > Firstly, increasing the number of behavioral concepts.  DAMOS filter feature
> > has only single behavior: excluding some types of memory from DAMOS action
> > target.  The "matching" is to provide a flexible way for further specifying 
> > the
> > target to exclude in a bit detail.  Without it, we would need non-variant 
> > for
> > each filter type.  Comapred to the current terms, the new terms feel like
> > implying there are two types of behaviors.  I think one behavior is easier 
> > to
> > understand than two behaviors, and better match what internal code is doing.
> > 
> > Secondly, ambiguity in "in" and "apply".  To me, the terms sound like 
> > _adding_
> > something more than _excluding_.
> 
> I understood that young filter "matching" "false" means apply action
> only to young pages.  Do I misunderstood something here?  If not,

Technically speaking, having a DAMOS filter with 'matching' parameter as
'false' for 'young pages' type means you want DAMOS to "exclude pages that not
young from the scheme's action target".  That's the only thing it truly does,
and what it tries to guarantee.  Whether the action will be applied to young
pages or not depends on more factors including additional filters and DAMOS
parameter.  IOW, that's not what the simple setting promises.

Of course, I know you are assuming there is only the single filter.  Hence,
effectively you're correct.  And the sentence may be a better wording for end
users.  However, it tooke me a bit time to understand your assumption and
concluding whether your sentence is correct or not, since I had to think about
the assumptions.

I'd say this also reminds me the first concern that I raised on the previous
mail.  IOW, I feel this sentence is introducing one more behavior and making it
bit taking longer time to digest, for developers who should judge it based on
the source code.  I'd suggest use only one behavioral term, "exclude", since it
is what the code really does, unless it is wording for end users.

> "apply" means _adding_ or _including_ only the matched pages for action.
> It looks like you thought about _excluding_ non matched pages here.

Yes.  I'd prefer using only single term, _excluding_.  It fits with the code,
and require one word less that "adding" or "including", since "adding" or
"including" require one more word, "only".

Also, even with "only", the fact that there could be more filters makes me
unsure what is the consequence of having it.  That is, if we have a filter that
includes only pages of type A, but if there could be yet another filter that
includes only pages of type B, would the consequence is the action being
applied to pages of type A and B?  Or, type A or type B?

In my opinion,

Re: [RFC PATCH] riscv: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS

2024-03-20 Thread Andy Chiu

Hi Björn,

On Fri, Mar 15, 2024 at 4:50 AM Björn Töpel  wrote:
>
> Björn Töpel  writes:
>
> > Puranjay Mohan  writes:
> >
> >> Björn Töpel  writes:
> >>
> >>>
> >>> Hmm, depending on RISC-V's CMODX path, the pro/cons CALL_OPS vs dynamic
> >>> trampolines changes quite a bit.
> >>>
> >>> The more I look at the pains of patching two instruction ("split
> >>> immediates"), the better "patch data" + one insn patching look.
> >>
> >> I was looking at how dynamic trampolines would be implemented for RISC-V.
> >>
> >> With CALL-OPS we need to patch the auipc+jalr at function entry only, the
> >> ops pointer above the function can be patched atomically.
> >>
> >> With a dynamic trampoline we need a auipc+jalr pair at function entry to 
> >> jump
> >> to the trampoline and then another auipc+jalr pair to jump from trampoline 
> >> to
> >> ops->func. When the ops->func is modified, we would need to update the
> >> auipc+jalr at in the trampoline.
> >>
> >> So, I am not sure how to move forward here, CALL-OPS or Dynamic 
> >> trampolines?
> >
> > Yeah. Honestly, we need to figure out the patching story prior
> > choosing the path, so let's start there.
> >
> > After reading Mark's reply, and discussing with OpenJDK folks (who does
> > the most crazy text patching on all platforms), having to patch multiple
> > instructions (where the address materialization is split over multiple
> > instructions) is a no-go. It's just a too big can of worms. So, if we
> > can only patch one insn, it's CALL_OPS.
> >
> > A couple of options (in addition to Andy's), and all require a
> > per-function landing address ala CALL_OPS) tweaking what Mark is doing
> > on Arm (given the poor branch range).
> >
> > ...and maybe we'll get RISC-V rainbows/unicorns in the future getting
> > better reach (full 64b! ;-)).
> >
> > A) Use auipc/jalr, only patch jalr to take us to a common
> >dispatcher/trampoline
> >
> >  |  # probably on a data cache-line != func 
> > .text to avoid ping-pong
> >  | ...
> >  | func:
> >  |   ...make sure ra isn't messed up...
> >  |   aupic
> >  |   nop <=> jalr # Text patch point -> common_dispatch
> >  |   ACTUAL_FUNC
> >  |
> >  | common_dispatch:
> >  |   load  based on ra
> >  |   jalr
> >  |   ...
> >
> > The auipc is never touched, and will be overhead. Also, we need a mv to
> > store ra in a scratch register as well -- like Arm. We'll have two insn
> > per-caller overhead for a disabled caller.
> >
> > B) Use jal, which can only take us +/-1M, and requires multiple
> >dispatchers (and tracking which one to use, and properly distribute
> >them. Ick.)
> >
> >  |  # probably on a data cache-line != func 
> > .text to avoid ping-pong
> >  | ...
> >  | func:
> >  |   ...make sure ra isn't messed up...
> >  |   nop <=> jal # Text patch point -> within_1M_to_func_dispatch
> >  |   ACTUAL_FUNC
> >  |
> >  | within_1M_to_func_dispatch:
> >  |   load  based on ra
> >  |   jalr
> >
> > C) Use jal, which can only take us +/-1M, and use a per-function
> >trampoline requires multiple dispatchers (and tracking which one to
> >use). Blows up text size A LOT.
> >
> >  |  # somewhere, but probably on a different 
> > cacheline than the .text to avoid ping-ongs
> >  | ...
> >  | per_func_dispatch
> >  |   load  based on ra
> >  |   jalr
> >  | func:
> >  |   ...make sure ra isn't messed up...
> >  |   nop <=> jal # Text patch point -> per_func_dispatch
> >  |   ACTUAL_FUNC
>
> Brendan proposed yet another option, "in-function dispatch":
>
> D)
>   |  # idk somewhere
>   | ...
>   | func:
>   |mv tmp1, ra
>   |aupic tmp2, 
>   |mv tmp3, zero <=> ld tmp3, 
> tmp2
>   |nop <=> jalr ra, tmp3
>   |ACTUAL_FUNC

My patch series takes a similar "in-function dispatch" approach. A
difference is that the  is
embedded within each function entry. I'd like to have it moved to a
run-time allocated array to reduce total text size.

Another difference is that my series changes the first instruction to
"j ACTUAL_FUNC" for the "ftrace disable" case. As long as the
architecture guarantees the atomicity of the first instruction, then
we are safe. For example, we are safe if the first instruction could
only be "mv tmp, ra" or "j ACTUAL_FUNC". And since the loaded address is
always valid, we can fix "mv + jalr" down so we don't have to
play with the exception handler trick. The guarantee from arch would
require ziccif (in RVA22) though, but I think it is the same for us
(unless with stop_machine). For ziccif, I would rather call that out
during boot than blindly assume.

However, one thing I am not very sure is: do we need a destination
address in a "per-function" manner? It seems like most of the time the
destination address can only be ftrace_call, or ftrace_regs_call. If
the number of destination addresses is very few, then we could
potentially reduce the size of
.

>
> There are 4 CMODX possiblities:
>mv, nop:  fully disabled, no problems
>mv, jalr: We will jump to zero. We would need to have the

Re: [PATCH] tracing: Explicitly cast divisor to fix Coccinelle warning

2024-03-20 Thread Thorsten Blum

On 20. Mar 2024, at 16:01, Thorsten Blum  wrote:
> 
> Coccinelle also finds this one, but please ignore this patch as I just 
> realized
> this was already fixed in another patch of mine from February.
> 
> Sorry for the inconvenience.
> 
> Link: 
> https://lore.kernel.org/linux-kernel/20240225164507.232942-2-thorsten.b...@toblux.com/

Actually, I will submit a new patch to revert

delta = div64_u64(delta, bm_cnt);

back to

do_div(delta, bm_cnt);

but this time include an explicit cast to u32

do_div(delta, (u32)bm_cnt);

to remove the Coccinelle warning reported by do_div.cocci and to improve
performance.

The do_div() macro does a 64-by-32 division which is faster than the 64-by-64 
division done by div64_u64(). Casting the divisor bm_cnt to u32 is safe since 
we return early from trace_do_benchmark() if bm_cnt > UINT_MAX (something I 
missed in d6cb38e10810).

This approach is already used when calculating the standard deviation:

do_div(stddev, (u32)bm_cnt);
do_div(stddev, (u32)bm_cnt - 1);

Thanks,
Thorsten

[RFC][PATCH] tracing: Introduce restart_critical_timings()

2024-03-20 Thread Steven Rostedt

From: Steven Rostedt (Google) 

I'm debugging some latency issues on a Chromebook and the preemptirqsoff
tracer hit this:

# tracer: preemptirqsoff
#
# preemptirqsoff latency trace v1.1.5 on 5.15.148-21853-g165fd2387469-dirty
# 
# latency: 7813 us, #60/60, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
#-
#| task: init-1 (uid:0 nice:0 policy:0 rt_prio:0)
#-
#  => started at: rwsem_optimistic_spin
#  => ended at:   rwsem_optimistic_spin
#
#
#_--=> CPU#
#   / _-=> irqs-off
#  | / _=> need-resched
#  || / _---=> hardirq/softirq 
#  ||| / _--=> preempt-depth   
#   / _-=> migrate-disable 
#  | / delay   
#  cmd pid || time  |   caller 
# \   /||  \|/   
   <...>-1 1...1.0us!: rwsem_optimistic_spin+0x20/0x194 
<-rwsem_optimistic_spin+0x20/0x194
   <...>-1 1.N.1. 7813us : rwsem_optimistic_spin+0x140/0x194 
<-rwsem_optimistic_spin+0x140/0x194
   <...>-1 1.N.1. 7814us+: tracer_preempt_on+0x4c/0x6a 
<-rwsem_optimistic_spin+0x140/0x194
   <...>-1 1.N.1. 7824us : 
 => rwsem_optimistic_spin+0x140/0x194
 => rwsem_down_write_slowpath+0xc9/0x3fe
 => copy_process+0xd08/0x1b8a
 => kernel_clone+0x94/0x256
 => __x64_sys_clone+0x7a/0x9a
 => do_syscall_64+0x51/0xa1
 => entry_SYSCALL_64_after_hwframe+0x5c/0xc6

Which isn't a real issue, as it's in the rwsem_optimistic_spin() code that
spins with interrupts enabled, preempt disabled, and checks for
need_resched(). If it is true, it breaks out and schedules.

Hence, it's hiding real issues, because it can spin for a very long time
and this is not the source of the latency I'm tracking down.

I would like to introduce restart_critical_timings() and place it in
locations that have this behavior.

Signed-off-by: Steven Rostedt (Google) 

diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index 147feebd508c..e9f97f60bfc0 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -145,6 +145,13 @@ do {   \
 # define start_critical_timings() do { } while (0)
 #endif
 
+/* Used in spins that check need_resched() with preemption disabled */
+static inline void restart_critical_timings(void)
+{
+   stop_critical_timings();
+   start_critical_timings();
+}
+
 #ifdef CONFIG_DEBUG_IRQFLAGS
 extern void warn_bogus_irq_restore(void);
 #define raw_check_bogus_irq_restore()  \
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 2340b6d90ec6..fa7b43e53d20 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #ifndef CONFIG_PREEMPT_RT
@@ -780,6 +781,7 @@ rwsem_spin_on_owner(struct rw_semaphore *sem)
 */
barrier();
 
+   restart_critical_timings();
if (need_resched() || !owner_on_cpu(owner)) {
state = OWNER_NONSPINNABLE;
break;
@@ -912,6 +914,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 * a writer, need_resched() check needs to be done here.
 */
if (owner_state != OWNER_WRITER) {
+   restart_critical_timings();
if (need_resched())
break;
if (rt_task(current) &&

Re: [PATCH 2/3] dt-bindings: remoteproc: add Versal-NET platform

2024-03-20 Thread Tanmay Shah




On 3/20/24 2:40 AM, Krzysztof Kozlowski wrote:
> On 19/03/2024 15:42, Tanmay Shah wrote:
>> 
>> 
>> On 3/19/24 12:30 AM, Krzysztof Kozlowski wrote:
>>> On 19/03/2024 01:51, Tanmay Shah wrote:
 Hello Krzysztof,

 Thanks for reviews. Please find my comments below.

 On 3/17/24 1:53 PM, Krzysztof Kozlowski wrote:
> On 15/03/2024 22:15, Tanmay Shah wrote:
>> AMD-Xilinx Versal-NET platform is successor of Versal platform. It
>> contains multiple clusters of cortex-R52 real-time processing units.
>> Each cluster contains two cores of cortex-R52 processors. Each cluster
>> can be configured in lockstep mode or split mode.
>>
>> Each R52 core is assigned 128KB of TCM memory. ATCM memory is 64KB, BTCM
>> and CTCM memoreis are 32KB each. Each TCM memory has its own dedicated
>> power-domain that needs to be requested before using it.
>>
>> Signed-off-by: Tanmay Shah 
>> ---
>>  .../remoteproc/xlnx,zynqmp-r5fss.yaml | 220 +++---
>>  1 file changed, 184 insertions(+), 36 deletions(-)
>>
>> diff --git 
>> a/Documentation/devicetree/bindings/remoteproc/xlnx,zynqmp-r5fss.yaml 
>> b/Documentation/devicetree/bindings/remoteproc/xlnx,zynqmp-r5fss.yaml
>> index 711da0272250..55654ee02eef 100644
>> --- a/Documentation/devicetree/bindings/remoteproc/xlnx,zynqmp-r5fss.yaml
>> +++ b/Documentation/devicetree/bindings/remoteproc/xlnx,zynqmp-r5fss.yaml
>> @@ -18,7 +18,9 @@ description: |
>>  
>>  properties:
>>compatible:
>> -const: xlnx,zynqmp-r5fss
>> +enum:
>> +  - xlnx,zynqmp-r5fss
>> +  - xlnx,versal-net-r52fss
>>  
>>"#address-cells":
>>  const: 2
>> @@ -64,7 +66,9 @@ patternProperties:
>>  
>>  properties:
>>compatible:
>> -const: xlnx,zynqmp-r5f
>> +enum:
>> +  - xlnx,zynqmp-r5f
>> +  - xlnx,versal-net-r52f
>>  
>>reg:
>>  minItems: 1
>> @@ -135,9 +139,11 @@ required:
>>  allOf:
>>- if:
>>properties:
>> -xlnx,cluster-mode:
>> -  enum:
>> -- 1
>> +compatible:
>> +  contains:
>> +enum:
>> +  - xlnx,versal-net-r52fss
>
> Why do you touch these lines?
>
>> +
>>  then:
>>patternProperties:
>>  "^r5f@[0-9a-f]+$":
>> @@ -149,16 +155,14 @@ allOf:
>>items:
>>  - description: ATCM internal memory
>>  - description: BTCM internal memory
>> -- description: extra ATCM memory in lockstep mode
>> -- description: extra BTCM memory in lockstep mode
>> +- description: CTCM internal memory
>>  
>>  reg-names:
>>minItems: 1
>>items:
>> -- const: atcm0
>> -- const: btcm0
>> -- const: atcm1
>> -- const: btcm1
>> +- const: atcm
>> +- const: btcm
>> +- const: ctcm
>>  
>>  power-domains:
>>minItems: 2
>> @@ -166,33 +170,70 @@ allOf:
>>  - description: RPU core power domain
>>  - description: ATCM power domain
>>  - description: BTCM power domain
>> -- description: second ATCM power domain
>> -- description: second BTCM power domain
>> +- description: CTCM power domain
>>  
>>  else:
>> -  patternProperties:
>> -"^r5f@[0-9a-f]+$":
>> -  type: object
>> -
>> -  properties:
>> -reg:
>> -  minItems: 1
>> -  items:
>> -- description: ATCM internal memory
>> -- description: BTCM internal memory
>> -
>> -reg-names:
>> -  minItems: 1
>> -  items:
>> -- const: atcm0
>> -- const: btcm0
>> -
>> -power-domains:
>> -  minItems: 2
>> -  items:
>> -- description: RPU core power domain
>> -- description: ATCM power domain
>> -- description: BTCM power domain
>> +  allOf:
>> +- if:
>> +properties:
>> +  xlnx,cluster-mode:
>> +enum:
>> +  - 1
>
> Whatever you did here, is not really readable. You have now multiple
> if:then:if:then embedded.

 For ZynqMP platform, TCM can be configured differently in lockstep mode
 and split mode.

 For Versal-NET no

Re: [PATCH] tracing: Explicitly cast divisor to fix Coccinelle warning

2024-03-20 Thread Thorsten Blum

On 20. Mar 2024, at 11:27, Masami Hiramatsu (Google)  
wrote:
> 
> Hmm, strange, trace_do_benchmark() has another do_div(u64, u64). 
> 
>do {
>last_seed = seed;
>seed = stddev;
>if (!last_seed)
>break;
>do_div(seed, last_seed);
>seed += last_seed;
>do_div(seed, 2);
>} while (i++ < 10 && last_seed != seed);
> 
> Didn't Coccinelle find that?

Coccinelle also finds this one, but please ignore this patch as I just realized
this was already fixed in another patch of mine from February.

Sorry for the inconvenience.

Link: 
https://lore.kernel.org/linux-kernel/20240225164507.232942-2-thorsten.b...@toblux.com/

[GIT PULL V2] tracing/tools: Updates for 6.9

2024-03-20 Thread Daniel Bristot de Oliveira

Steven,

[
   I rebased my queue on top of the v6.8 tag.
]

Tracing:
- Update makefiles for latency-collector and RTLA,
  using tools/build/ makefiles like perf does, inheriting
  its benefits. For example, having a proper way to
  handle dependencies.

- The timerlat tracer has an interface for any tool to use.
  rtla timerlat tool uses this interface dispatching its
  own threads as workload. But, rtla timerlat could also be
  used for any other process. So, add 'rtla timerlat -U'
  option, allowing the timerlat tool to measure the latency of
  any task using the timerlat tracer interface.

Verification:
- Update makefiles for verification/rv, using tools/build/
  makefiles like perf does, inheriting its benefits.
  For example, having a proper way to handle dependencies.


Please pull the latest trace-tools-v6.9-2 tree, which can be found at:


  git://git.kernel.org/pub/scm/linux/kernel/git/bristot/linux.git
trace-tools-v6.9-2

Tag SHA1: e8d5e0f38601c3718874d95db2a0020ab1c454df
Head SHA1: a23c05fd76cf4ad27e0c74f7a93e7b089e94a55c


Daniel Bristot de Oliveira (4):
  tools/tracing: Use tools/build makefiles on latency-collector
  tools/rtla: Use tools/build makefiles to build rtla
  tools/verification: Use tools/build makefiles on rv
  tools/rtla: Add -U/--user-load option to timerlat


 .../tools/rtla/common_timerlat_options.rst |   6 +
 tools/tracing/latency/.gitignore   |   5 +-
 tools/tracing/latency/Build|   1 +
 tools/tracing/latency/Makefile | 105 --
 tools/tracing/latency/Makefile.config  |  30 +++
 tools/tracing/rtla/.gitignore  |   7 +-
 tools/tracing/rtla/Build   |   1 +
 tools/tracing/rtla/Makefile| 217 +++--
 tools/tracing/rtla/Makefile.config |  47 +
 tools/tracing/rtla/Makefile.rtla   |  80 
 tools/tracing/rtla/Makefile.standalone |  26 +++
 tools/tracing/rtla/sample/timerlat_load.py |  74 +++
 tools/tracing/rtla/src/Build   |  11 ++
 tools/tracing/rtla/src/timerlat_hist.c |  16 +-
 tools/tracing/rtla/src/timerlat_top.c  |  14 +-
 tools/verification/rv/.gitignore   |   6 +
 tools/verification/rv/Build|   1 +
 tools/verification/rv/Makefile | 207 +++-
 tools/verification/rv/Makefile.config  |  47 +
 tools/verification/rv/Makefile.rv  |  51 +
 tools/verification/rv/src/Build|   4 +
 21 files changed, 650 insertions(+), 306 deletions(-)
 create mode 100644 tools/tracing/latency/Build
 create mode 100644 tools/tracing/latency/Makefile.config
 create mode 100644 tools/tracing/rtla/Build
 create mode 100644 tools/tracing/rtla/Makefile.config
 create mode 100644 tools/tracing/rtla/Makefile.rtla
 create mode 100644 tools/tracing/rtla/Makefile.standalone
 create mode 100644 tools/tracing/rtla/sample/timerlat_load.py
 create mode 100644 tools/tracing/rtla/src/Build
 create mode 100644 tools/verification/rv/.gitignore
 create mode 100644 tools/verification/rv/Build
 create mode 100644 tools/verification/rv/Makefile.config
 create mode 100644 tools/verification/rv/Makefile.rv
 create mode 100644 tools/verification/rv/src/Build
---

[PATCH v7 2/5] tracing/probes: support '%pD' type for print struct file's name

2024-03-20 Thread Ye Bin

Similar to '%pD' for printk, use '%pD' for print struct file's name.

Signed-off-by: Ye Bin 
---
 kernel/trace/trace.c   |  2 +-
 kernel/trace/trace_probe.c | 57 +++---
 2 files changed, 36 insertions(+), 23 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index ac26b8446337..831dfd0773a4 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -5510,7 +5510,7 @@ static const char readme_msg[] =
"\t   +|-[u](), \\imm-value, 
\\\"imm-string\"\n"
"\t type: s8/16/32/64, u8/16/32/64, x8/16/32/64, char, string, 
symbol,\n"
"\t   b@/, ustring,\n"
-   "\t   symstr, %pd, \\[\\]\n"
+   "\t   symstr, %pd/%pD, \\[\\]\n"
 #ifdef CONFIG_HIST_TRIGGERS
"\tfield:  ;\n"
"\tstype: u8/u16/u32/u64, s8/s16/s32/s64, pid_t,\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index a27567e16c36..7bfc6c0d5436 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -12,6 +12,7 @@
 #define pr_fmt(fmt)"trace_probe: " fmt
 
 #include 
+#include 
 #include "trace_btf.h"
 
 #include "trace_probe.h"
@@ -1581,35 +1582,47 @@ int traceprobe_expand_dentry_args(int argc, const char 
*argv[], char **buf)
 
used = 0;
for (i = 0; i < argc; i++) {
-   if (glob_match("*:%pd", argv[i])) {
-   char *tmp;
-   char *equal;
-
-   if (!tmpbuf) {
-   tmpbuf = kmalloc(bufsize, GFP_KERNEL);
-   if (!tmpbuf)
-   return -ENOMEM;
-   }
+   char *tmp;
+   char *equal;
+   size_t arg_len;
 
-   tmp = kstrdup(argv[i], GFP_KERNEL);
-   if (!tmp)
-   goto nomem;
+   if (!glob_match("*:%p[dD]", argv[i]))
+   continue;
 
-   equal = strchr(tmp, '=');
-   if (equal)
-   *equal = '\0';
-   tmp[strlen(argv[i]) - 4] = '\0';
+   if (!tmpbuf) {
+   tmpbuf = kmalloc(bufsize, GFP_KERNEL);
+   if (!tmpbuf)
+   return -ENOMEM;
+   }
+
+   tmp = kstrdup(argv[i], GFP_KERNEL);
+   if (!tmp)
+   goto nomem;
+
+   equal = strchr(tmp, '=');
+   if (equal)
+   *equal = '\0';
+   arg_len = strlen(argv[i]);
+   tmp[arg_len - 4] = '\0';
+   if (argv[i][arg_len - 1] == 'd')
ret = snprintf(tmpbuf + used, bufsize - used,
   "%s%s+0x0(+0x%zx(%s)):string",
   equal ? tmp : "", equal ? "=" : "",
   offsetof(struct dentry, d_name.name),
   equal ? equal + 1 : tmp);
-   kfree(tmp);
-   if (ret >= bufsize - used)
-   goto nomem;
-   argv[i] = tmpbuf + used;
-   used += ret + 1;
-   }
+   else
+   ret = snprintf(tmpbuf + used, bufsize - used,
+  "%s%s+0x0(+0x%zx(+0x%zx(%s))):string",
+  equal ? tmp : "", equal ? "=" : "",
+  offsetof(struct dentry, d_name.name),
+  offsetof(struct file, f_path.dentry),
+  equal ? equal + 1 : tmp);
+
+   kfree(tmp);
+   if (ret >= bufsize - used)
+   goto nomem;
+   argv[i] = tmpbuf + used;
+   used += ret + 1;
}
 
*buf = tmpbuf;
-- 
2.31.1

[PATCH v7 4/5] selftests/ftrace: add kprobe test cases for VFS type "%pd" and "%pD"

2024-03-20 Thread Ye Bin

This patch adds test cases for new print format type "%pd/%pD".The test cases
test the following items:
1. Test README if add "%pd/%pD" type;
2. Test "%pd" type for dput();
3. Test "%pD" type for vfs_read();

This test case require enable CONFIG_HAVE_FUNCTION_ARG_ACCESS_API configuration.

Signed-off-by: Ye Bin 
---
 .../ftrace/test.d/kprobe/kprobe_args_vfs.tc   | 40 +++
 1 file changed, 40 insertions(+)
 create mode 100644 
tools/testing/selftests/ftrace/test.d/kprobe/kprobe_args_vfs.tc

diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_args_vfs.tc 
b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_args_vfs.tc
new file mode 100644
index ..21a54be6894c
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_args_vfs.tc
@@ -0,0 +1,40 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: Kprobe event VFS type argument
+# requires: kprobe_events "%pd/%pD":README
+
+: "Test argument %pd with name"
+echo 'p:testprobe dput name=$arg1:%pd' > kprobe_events
+echo 1 > events/kprobes/testprobe/enable
+grep -q "1" events/kprobes/testprobe/enable
+echo 0 > events/kprobes/testprobe/enable
+grep "dput" trace | grep -q "enable"
+echo "" > kprobe_events
+echo "" > trace
+
+: "Test argument %pd without name"
+echo 'p:testprobe dput $arg1:%pd' > kprobe_events
+echo 1 > events/kprobes/testprobe/enable
+grep -q "1" events/kprobes/testprobe/enable
+echo 0 > events/kprobes/testprobe/enable
+grep "dput" trace | grep -q "enable"
+echo "" > kprobe_events
+echo "" > trace
+
+: "Test argument %pD with name"
+echo 'p:testprobe vfs_read name=$arg1:%pD' > kprobe_events
+echo 1 > events/kprobes/testprobe/enable
+grep -q "1" events/kprobes/testprobe/enable
+echo 0 > events/kprobes/testprobe/enable
+grep "vfs_read" trace | grep -q "enable"
+echo "" > kprobe_events
+echo "" > trace
+
+: "Test argument %pD without name"
+echo 'p:testprobe vfs_read $arg1:%pD' > kprobe_events
+echo 1 > events/kprobes/testprobe/enable
+grep -q "1"  events/kprobes/testprobe/enable
+echo 0 > events/kprobes/testprobe/enable
+grep "vfs_read" trace | grep -q "enable"
+echo "" > kprobe_events
+echo "" > trace
-- 
2.31.1

[PATCH v7 5/5] selftests/ftrace: add fprobe test cases for VFS type "%pd" and "%pD"

2024-03-20 Thread Ye Bin

This patch adds fprobe test cases for new print format type "%pd/%pD".The
test cases test the following items:
1. Test "%pd" type for dput();
2. Test "%pD" type for vfs_read();

This test case require enable CONFIG_HAVE_FUNCTION_ARG_ACCESS_API configuration.

Signed-off-by: Ye Bin 
---
 .../ftrace/test.d/dynevent/fprobe_args_vfs.tc | 40 +++
 1 file changed, 40 insertions(+)
 create mode 100644 
tools/testing/selftests/ftrace/test.d/dynevent/fprobe_args_vfs.tc

diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/fprobe_args_vfs.tc 
b/tools/testing/selftests/ftrace/test.d/dynevent/fprobe_args_vfs.tc
new file mode 100644
index ..49a833bf334c
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/fprobe_args_vfs.tc
@@ -0,0 +1,40 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: Fprobe event VFS type argument
+# requires: kprobe_events "%pd/%pD":README
+
+: "Test argument %pd with name for fprobe"
+echo 'f:testprobe dput name=$arg1:%pd' > dynamic_events
+echo 1 > events/fprobes/testprobe/enable
+grep -q "1" events/fprobes/testprobe/enable
+echo 0 > events/fprobes/testprobe/enable
+grep "dput" trace | grep -q "enable"
+echo "" > dynamic_events
+echo "" > trace
+
+: "Test argument %pd without name for fprobe"
+echo 'f:testprobe dput $arg1:%pd' > dynamic_events
+echo 1 > events/fprobes/testprobe/enable
+grep -q "1" events/fprobes/testprobe/enable
+echo 0 > events/fprobes/testprobe/enable
+grep "dput" trace | grep -q "enable"
+echo "" > dynamic_events
+echo "" > trace
+
+: "Test argument %pD with name for fprobe"
+echo 'f:testprobe vfs_read name=$arg1:%pD' > dynamic_events
+echo 1 > events/fprobes/testprobe/enable
+grep -q "1" events/fprobes/testprobe/enable
+echo 0 > events/fprobes/testprobe/enable
+grep "vfs_read" trace | grep -q "enable"
+echo "" > dynamic_events
+echo "" > trace
+
+: "Test argument %pD without name for fprobe"
+echo 'f:testprobe vfs_read $arg1:%pD' > dynamic_events
+echo 1 > events/fprobes/testprobe/enable
+grep -q "1"  events/fprobes/testprobe/enable
+echo 0 > events/fprobes/testprobe/enable
+grep "vfs_read" trace | grep -q "enable"
+echo "" > dynamic_events
+echo "" > trace
-- 
2.31.1

[PATCH v7 3/5] Documentation: tracing: add new type '%pd' and '%pD' for kprobe

2024-03-20 Thread Ye Bin

Similar to printk() '%pd' is for fetch dentry's name from struct dentry's
pointer, and '%pD' is for fetch file's name from struct file's pointer.

Signed-off-by: Ye Bin 
---
 Documentation/trace/kprobetrace.rst | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/Documentation/trace/kprobetrace.rst 
b/Documentation/trace/kprobetrace.rst
index bf9cecb69fc9..f13f0fc11251 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -58,8 +58,9 @@ Synopsis of kprobe_events
   NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
   FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
  (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
- (x8/x16/x32/x64), "char", "string", "ustring", "symbol", 
"symstr"
-  and bitfield are supported.
+ (x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
+  "string", "ustring", "symbol", "symstr" and bitfield are
+  supported.
 
   (\*1) only for the probe on function entry (offs == 0). Note, this argument 
access
 is best effort, because depending on the argument type, it may be 
passed on
@@ -113,6 +114,9 @@ With 'symstr' type, you can filter the event with wildcard 
pattern of the
 symbols, and you don't need to solve symbol name by yourself.
 For $comm, the default type is "string"; any other type is invalid.
 
+VFS layer common type(%pd/%pD) is a special type, which fetches dentry's or
+file's name from struct dentry's address or struct file's address.
+
 .. _user_mem_access:
 
 User Memory Access
-- 
2.31.1

[PATCH v7 0/5] support '%pd' and '%pD' for print file name

2024-03-20 Thread Ye Bin

During fault locating, the file name needs to be printed based on the
dentry/file address. The offset needs to be calculated each time, which
is troublesome. Similar to printk, kprobe supports printing file names
for dentry/file addresses.

Diff v7 vs v6:
1. Squash [1/8] to [3/8] patches into 1 patch;
2. Split readme_msg[] into each patch;

Diff v6 vs v5:
1. Add const for 'bufsize' in PATCH [1];
2. Move PATCH 'tracing/probes: support '%pd/%pD' type for fprobe' after
PATCH "tracing/probes: support '%pd' type for print struct dentry's name".
3. Add requires '"%pd/%pD":README' for testcase.

Diff v5 vs v4:
1. Use glob_match() instead of str_has_suffix(), so remove the first patch;
2. Allocate buffer from heap for expand dentry;
3. Support "%pd/%pD" print type for fprobe;
4. Use $arg1 instead of origin register in test case;
5. Add test case for fprobe;

Diff v4 vs v3:
1. Use "argv[i][idx + 3] == 'd'" instead of "argv[i][strlen(argv[i]) - 1] == 
'd'"
to judge print format in PATCH[4/7];

Diff v3 vs v2:
1. Return the index of where the suffix was found in str_has_suffix();

Diff v2 vs v1:
1. Use "%pd/%pD" print format instead of "pd/pD" print format;
2. Add "%pd/%pD" in README;
3. Expand "%pd/%pD" argument before parameter parsing;
4. Add more detail information in ftrace documentation;
5. Add test cases for new print format in selftests/ftrace;


Ye Bin (5):
  tracing/probes: support '%pd' type for print struct dentry's name
  tracing/probes: support '%pD' type for print struct file's name
  Documentation: tracing: add new type '%pd' and '%pD' for kprobe
  selftests/ftrace: add kprobe test cases for VFS type "%pd" and "%pD"
  selftests/ftrace: add fprobe test cases for VFS type "%pd" and "%pD"

 Documentation/trace/kprobetrace.rst   |  8 ++-
 kernel/trace/trace.c  |  2 +-
 kernel/trace/trace_fprobe.c   |  6 ++
 kernel/trace/trace_kprobe.c   |  6 ++
 kernel/trace/trace_probe.c| 63 +++
 kernel/trace/trace_probe.h|  2 +
 .../ftrace/test.d/dynevent/fprobe_args_vfs.tc | 40 
 .../ftrace/test.d/kprobe/kprobe_args_vfs.tc   | 40 
 8 files changed, 164 insertions(+), 3 deletions(-)
 create mode 100644 
tools/testing/selftests/ftrace/test.d/dynevent/fprobe_args_vfs.tc
 create mode 100644 
tools/testing/selftests/ftrace/test.d/kprobe/kprobe_args_vfs.tc

-- 
2.31.1

[PATCH v7 1/5] tracing/probes: support '%pd' type for print struct dentry's name

2024-03-20 Thread Ye Bin

Support print type '%pd' for print dentry's  name.

Signed-off-by: Ye Bin 
---
 kernel/trace/trace.c|  2 +-
 kernel/trace/trace_fprobe.c |  6 +
 kernel/trace/trace_kprobe.c |  6 +
 kernel/trace/trace_probe.c  | 50 +
 kernel/trace/trace_probe.h  |  2 ++
 5 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index b12f8384a36a..ac26b8446337 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -5510,7 +5510,7 @@ static const char readme_msg[] =
"\t   +|-[u](), \\imm-value, 
\\\"imm-string\"\n"
"\t type: s8/16/32/64, u8/16/32/64, x8/16/32/64, char, string, 
symbol,\n"
"\t   b@/, ustring,\n"
-   "\t   symstr, \\[\\]\n"
+   "\t   symstr, %pd, \\[\\]\n"
 #ifdef CONFIG_HIST_TRIGGERS
"\tfield:  ;\n"
"\tstype: u8/u16/u32/u64, s8/s16/s32/s64, pid_t,\n"
diff --git a/kernel/trace/trace_fprobe.c b/kernel/trace/trace_fprobe.c
index 7d2ddbcfa377..988d68e906ad 100644
--- a/kernel/trace/trace_fprobe.c
+++ b/kernel/trace/trace_fprobe.c
@@ -976,6 +976,7 @@ static int __trace_fprobe_create(int argc, const char 
*argv[])
char gbuf[MAX_EVENT_NAME_LEN];
char sbuf[KSYM_NAME_LEN];
char abuf[MAX_BTF_ARGS_LEN];
+   char *dbuf = NULL;
bool is_tracepoint = false;
struct tracepoint *tpoint = NULL;
struct traceprobe_parse_context ctx = {
@@ -1086,6 +1087,10 @@ static int __trace_fprobe_create(int argc, const char 
*argv[])
argv = new_argv;
}
 
+   ret = traceprobe_expand_dentry_args(argc, argv, );
+   if (ret)
+   goto out;
+
/* setup a probe */
tf = alloc_trace_fprobe(group, event, symbol, tpoint, maxactive,
argc, is_return);
@@ -1131,6 +1136,7 @@ static int __trace_fprobe_create(int argc, const char 
*argv[])
trace_probe_log_clear();
kfree(new_argv);
kfree(symbol);
+   kfree(dbuf);
return ret;
 
 parse_error:
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index c4c6e0e0068b..7cbb43740b4f 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -779,6 +779,7 @@ static int __trace_kprobe_create(int argc, const char 
*argv[])
char buf[MAX_EVENT_NAME_LEN];
char gbuf[MAX_EVENT_NAME_LEN];
char abuf[MAX_BTF_ARGS_LEN];
+   char *dbuf = NULL;
struct traceprobe_parse_context ctx = { .flags = TPARG_FL_KERNEL };
 
switch (argv[0][0]) {
@@ -930,6 +931,10 @@ static int __trace_kprobe_create(int argc, const char 
*argv[])
argv = new_argv;
}
 
+   ret = traceprobe_expand_dentry_args(argc, argv, );
+   if (ret)
+   goto out;
+
/* setup a probe */
tk = alloc_trace_kprobe(group, event, addr, symbol, offset, maxactive,
argc, is_return);
@@ -971,6 +976,7 @@ static int __trace_kprobe_create(int argc, const char 
*argv[])
trace_probe_log_clear();
kfree(new_argv);
kfree(symbol);
+   kfree(dbuf);
return ret;
 
 parse_error:
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 34289f9c6707..a27567e16c36 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -1569,6 +1569,56 @@ const char **traceprobe_expand_meta_args(int argc, const 
char *argv[],
return ERR_PTR(ret);
 }
 
+/* @buf: *buf must be equal to NULL. Caller must to free *buf */
+int traceprobe_expand_dentry_args(int argc, const char *argv[], char **buf)
+{
+   int i, used, ret;
+   const int bufsize = MAX_DENTRY_ARGS_LEN;
+   char *tmpbuf = NULL;
+
+   if (*buf)
+   return -EINVAL;
+
+   used = 0;
+   for (i = 0; i < argc; i++) {
+   if (glob_match("*:%pd", argv[i])) {
+   char *tmp;
+   char *equal;
+
+   if (!tmpbuf) {
+   tmpbuf = kmalloc(bufsize, GFP_KERNEL);
+   if (!tmpbuf)
+   return -ENOMEM;
+   }
+
+   tmp = kstrdup(argv[i], GFP_KERNEL);
+   if (!tmp)
+   goto nomem;
+
+   equal = strchr(tmp, '=');
+   if (equal)
+   *equal = '\0';
+   tmp[strlen(argv[i]) - 4] = '\0';
+   ret = snprintf(tmpbuf + used, bufsize - used,
+  "%s%s+0x0(+0x%zx(%s)):string",
+  equal ? tmp : "", equal ? "=" : "",
+  offsetof(struct dentry, d_name.name),
+  equal ? equal + 1 : tmp);
+   kfree(tmp);
+   if

Re: [GIT PULL] tracing/tools: Updates for 6.9

2024-03-20 Thread Steven Rostedt

On Wed, 20 Mar 2024 13:41:12 +0100
Daniel Bristot de Oliveira  wrote:

> On 3/20/24 00:02, Steven Rostedt wrote:
> > On Mon, 18 Mar 2024 18:41:13 +0100
> > Daniel Bristot de Oliveira  wrote:
> >   
> >> Steven,
> >>
> >> Tracing tooling updates for 6.9
> >>
> >> Tracing:
> >> - Update makefiles for latency-collector and RTLA,
> >>   using tools/build/ makefiles like perf does, inheriting
> >>   its benefits. For example, having a proper way to
> >>   handle dependencies.
> >>
> >> - The timerlat tracer has an interface for any tool to use.
> >>   rtla timerlat tool uses this interface dispatching its
> >>   own threads as workload. But, rtla timerlat could also be
> >>   used for any other process. So, add 'rtla timerlat -U'
> >>   option, allowing the timerlat tool to measure the latency of
> >>   any task using the timerlat tracer interface.
> >>
> >> Verification:
> >> - Update makefiles for verification/rv, using tools/build/
> >>   makefiles like perf does, inheriting its benefits.
> >>   For example, having a proper way to handle dependencies.
> >>
> >>
> >> Please pull the latest trace-tools-v6.9 tree, which can be found at:
> >>
> >>
> >>   git://git.kernel.org/pub/scm/linux/kernel/git/bristot/linux.git
> >> trace-tools-v6.9  
> > 
> > Looks like you just built on top of a random commit from Linus's tree:  
> 
> yeah :-/
> 
> > commit f6cef5f8c37f58a3bc95b3754c3ae98e086631ca
> > Merge: 906a93befec8 8f06fb458539
> > Author: Linus Torvalds 
> > Date:   Sun Mar 17 16:59:33 2024 -0700
> > 
> > Merge tag 'i3c/for-6.9' of 
> > git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux
> > 
> > Linus prefers basing off of real tags or previous pulls from us.  
> 
> Ack, took note. I will do on top v6.8 tag.
> 
> > Can you rebase your changes on v6.8 and resend?
> > 
> >   $ git checkout v6.8
> >   $ git cherry-pick 
> > f6cef5f8c37f58a3bc95b3754c3ae98e086631ca..trace-tools-v6.9
> > 
> > Appears to work fine.  
> 
> questions: when something go wrong in a pull request
> 
>   - Should I keep the old tag, and then do another one with -2
> (it seems you do like this), or delete the old tag and send it again
> with the same name?

Just create a new tag.

> 
>   - Should I resend the PULL request with something in the log or
> at the Subject saying it is a v2 of the pull request?

Yes please.

> 
> I could ask via chat, but I think it is good for the community to
> have access to these info.

+1

Thanks,


-- Steve

Re: [PATCH] tracing: probes: Fix to zero initialize a local variable

2024-03-20 Thread Steven Rostedt

On Wed, 20 Mar 2024 17:10:38 +0900
"Masami Hiramatsu (Google)"  wrote:

> From: Masami Hiramatsu (Google) 
> 
> Fix to initialize 'val' local variable with zero.
> Dan reported that Smatch static code checker reports an error that a local
> 'val' variable needs to be initialized. Actually, the 'val' is expected to
> be initialized by FETCH_OP_ARG in the same loop, but it is not obvious. So
> initialize it with zero.

BTW, that loop should really have a comment stating that FETCH_OP_ARG is
expected to happen before FETCH_OP_ST_EDATA.

-- Steve

Re: [PATCH] tracing: probes: Fix to zero initialize a local variable

2024-03-20 Thread Steven Rostedt

On Wed, 20 Mar 2024 17:10:38 +0900
"Masami Hiramatsu (Google)"  wrote:

> From: Masami Hiramatsu (Google) 
> 
> Fix to initialize 'val' local variable with zero.
> Dan reported that Smatch static code checker reports an error that a local
> 'val' variable needs to be initialized. Actually, the 'val' is expected to
> be initialized by FETCH_OP_ARG in the same loop, but it is not obvious. So
> initialize it with zero.
> 
> Reported-by: Dan Carpenter 
> Closes: 
> https://lore.kernel.org/all/b010488e-68aa-407c-add0-3e059254aaa0@moroto.mountain/
> Fixes: 25f00e40ce79 ("tracing/probes: Support $argN in return probe (kprobe 
> and fprobe)")
> Signed-off-by: Masami Hiramatsu (Google) 
> ---

Reviewed-by: Steven Rostedt (Google) 

-- Steve

>  kernel/trace/trace_probe.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
> index 217169de0920..dfe3ee6035ec 100644
> --- a/kernel/trace/trace_probe.c
> +++ b/kernel/trace/trace_probe.c
> @@ -839,7 +839,7 @@ int traceprobe_get_entry_data_size(struct trace_probe *tp)
>  void store_trace_entry_data(void *edata, struct trace_probe *tp, struct 
> pt_regs *regs)
>  {
>   struct probe_entry_arg *earg = tp->entry_arg;
> - unsigned long val;
> + unsigned long val = 0;
>   int i;
>  
>   if (!earg)

Re: [bug report] tracing/probes: Support $argN in return probe (kprobe and fprobe)

2024-03-20 Thread Steven Rostedt

On Wed, 20 Mar 2024 12:44:23 +0900
Masami Hiramatsu (Google)  wrote:

> > > kernel/trace/trace_probe.c
> > > 846 return;
> > > 847 
> > > 848 for (i = 0; i < earg->size; i++) {
> > > 849 struct fetch_insn *code = >code[i];
> > > 850 
> > > 851 switch (code->op) {
> > > 852 case FETCH_OP_ARG:
> > > 853 val = regs_get_kernel_argument(regs, 
> > > code->param);
> > > 854 break;
> > > 855 case FETCH_OP_ST_EDATA:  
> > > --> 856 *(unsigned long *)((unsigned long)edata + 
> > > code->offset) = val;
> > > 
> > > Probably the earg->code[i] always has FETCH_OP_ARG before
> > > FETCH_OP_ST_EDATA but Smatch isn't smart enough to figure that out...  
> > 
> > Looks that way:
> > 
> > case FETCH_OP_END:
> > earg->code[i].op = FETCH_OP_ARG;
> > earg->code[i].param = argnum;
> > earg->code[i + 1].op = FETCH_OP_ST_EDATA;
> > earg->code[i + 1].offset = offset;
> > return offset;
> > 
> > But probably should still initialize val to zero or have a WARN_ON() if
> > that doesn't happen.  
> 
> OK, let's val = 0 in the store_trace_entry_data(), but WARN_ON() in this loop
> is a bit strange. I think we should have a verifiler.

Initializing to zero is fine.

-- Steve

Re: [PATCH v6 3/8] tracing/probes: support '%pd' type for fprobe

2024-03-20 Thread Google

Hi Ye,

BTW, if you have a chance, please squash [1/8] to [3/8] patches
into 1 patch. There is no reason to split these since these are
a local(= the same subsystem) function implementation and callers.

Thank you,

On Fri, 15 Mar 2024 14:55:35 +0800
Ye Bin  wrote:

> Support print type '%pd' for print dentry's or file's name.
> 
> Signed-off-by: Ye Bin 
> ---
>  kernel/trace/trace_fprobe.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/kernel/trace/trace_fprobe.c b/kernel/trace/trace_fprobe.c
> index 7d2ddbcfa377..988d68e906ad 100644
> --- a/kernel/trace/trace_fprobe.c
> +++ b/kernel/trace/trace_fprobe.c
> @@ -976,6 +976,7 @@ static int __trace_fprobe_create(int argc, const char 
> *argv[])
>   char gbuf[MAX_EVENT_NAME_LEN];
>   char sbuf[KSYM_NAME_LEN];
>   char abuf[MAX_BTF_ARGS_LEN];
> + char *dbuf = NULL;
>   bool is_tracepoint = false;
>   struct tracepoint *tpoint = NULL;
>   struct traceprobe_parse_context ctx = {
> @@ -1086,6 +1087,10 @@ static int __trace_fprobe_create(int argc, const char 
> *argv[])
>   argv = new_argv;
>   }
>  
> + ret = traceprobe_expand_dentry_args(argc, argv, );
> + if (ret)
> + goto out;
> +
>   /* setup a probe */
>   tf = alloc_trace_fprobe(group, event, symbol, tpoint, maxactive,
>   argc, is_return);
> @@ -1131,6 +1136,7 @@ static int __trace_fprobe_create(int argc, const char 
> *argv[])
>   trace_probe_log_clear();
>   kfree(new_argv);
>   kfree(symbol);
> + kfree(dbuf);
>   return ret;
>  
>  parse_error:
> -- 
> 2.31.1
> 


-- 
Masami Hiramatsu (Google)

Re: [GIT PULL] tracing/tools: Updates for 6.9

2024-03-20 Thread Daniel Bristot de Oliveira

On 3/20/24 00:02, Steven Rostedt wrote:
> On Mon, 18 Mar 2024 18:41:13 +0100
> Daniel Bristot de Oliveira  wrote:
> 
>> Steven,
>>
>> Tracing tooling updates for 6.9
>>
>> Tracing:
>> - Update makefiles for latency-collector and RTLA,
>>   using tools/build/ makefiles like perf does, inheriting
>>   its benefits. For example, having a proper way to
>>   handle dependencies.
>>
>> - The timerlat tracer has an interface for any tool to use.
>>   rtla timerlat tool uses this interface dispatching its
>>   own threads as workload. But, rtla timerlat could also be
>>   used for any other process. So, add 'rtla timerlat -U'
>>   option, allowing the timerlat tool to measure the latency of
>>   any task using the timerlat tracer interface.
>>
>> Verification:
>> - Update makefiles for verification/rv, using tools/build/
>>   makefiles like perf does, inheriting its benefits.
>>   For example, having a proper way to handle dependencies.
>>
>>
>> Please pull the latest trace-tools-v6.9 tree, which can be found at:
>>
>>
>>   git://git.kernel.org/pub/scm/linux/kernel/git/bristot/linux.git
>> trace-tools-v6.9
> 
> Looks like you just built on top of a random commit from Linus's tree:

yeah :-/

> commit f6cef5f8c37f58a3bc95b3754c3ae98e086631ca
> Merge: 906a93befec8 8f06fb458539
> Author: Linus Torvalds 
> Date:   Sun Mar 17 16:59:33 2024 -0700
> 
> Merge tag 'i3c/for-6.9' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux
> 
> Linus prefers basing off of real tags or previous pulls from us.

Ack, took note. I will do on top v6.8 tag.

> Can you rebase your changes on v6.8 and resend?
> 
>   $ git checkout v6.8
>   $ git cherry-pick f6cef5f8c37f58a3bc95b3754c3ae98e086631ca..trace-tools-v6.9
> 
> Appears to work fine.

questions: when something go wrong in a pull request

- Should I keep the old tag, and then do another one with -2
  (it seems you do like this), or delete the old tag and send it again
  with the same name?

- Should I resend the PULL request with something in the log or
  at the Subject saying it is a v2 of the pull request?

I could ask via chat, but I think it is good for the community to
have access to these info.

Thanks!
-- Daniel

> Thanks!
> 
> -- Steve

Re: Re: [PATCH v2] net/ipv4: add tracepoint for icmp_send

2024-03-20 Thread He Peilin

> > From: Peilin He
> > 
> > Introduce a tracepoint for icmp_send, which can help users to get more
> > detail information conveniently when icmp abnormal events happen.
> > 
> > 1. Giving an usecase example:
> > =
> > When an application experiences packet loss due to an unreachable UDP
> > destination port, the kernel will send an exception message through the
> > icmp_send function. By adding a trace point for icmp_send, developers or
> > system administrators can obtain detailed information about the UDP
> > packet loss, including the type, code, source address, destination address,
> > source port, and destination port. This facilitates the trouble-shooting
> > of UDP packet loss issues especially for those network-service
> > applications.
> > 
> > 2. Operation Instructions:
> > ==
> > Switch to the tracing directory.
> > cd /sys/kernel/debug/tracing
> 
> FYI, that directory is obsolete. Please always reference /sys/kernel/tracing.
OK. 
> > Filter for destination port unreachable.
> > echo "type==3 && code==3" > events/icmp/icmp_send/filter
> > Enable trace event.
> > echo 1 > events/icmp/icmp_send/enable
> > 
> > 3. Result View:
> > 
> >  udp_client_erro-11370   [002] ...s.12   124.728002:
> >  icmp_send: icmp_send: type=3, code=3.
> >  From 127.0.0.1:41895 to 127.0.0.1: ulen=23
> >  skbaddr=589b167a
> > 
> > v1->v2:
> > Some fixes according to
> > https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=sztrnkru_tnuwqhufqtjvjsv-nz1x...@mail.gmail.com/
> > 1. adjust the trace_icmp_send() to more protocols than UDP.
> > 2. move the calling of trace_icmp_send after sanity checks
> >in __icmp_send().
> > 
> > Signed-off-by: Peilin He
> > Reviewed-by: xu xin 
> > Reviewed-by: Yunkai Zhang 
> > Cc: Yang Yang 
> > Cc: Liu Chun 
> > Cc: Xuexin Jiang 
> > ---
> >  include/trace/events/icmp.h | 64 
> > +
> >  net/ipv4/icmp.c |  4 +++
> >  2 files changed, 68 insertions(+)
> >  create mode 100644 include/trace/events/icmp.h
> > 
> > diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
> > new file mode 100644
> > index ..c3dc337be7bc
> > --- /dev/null
> > +++ b/include/trace/events/icmp.h
> > @@ -0,0 +1,64 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#undef TRACE_SYSTEM
> > +#define TRACE_SYSTEM icmp
> > +
> > +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
> > +#define _TRACE_ICMP_H
> > +
> > +#include 
> > +#include 
> > +
> > +TRACE_EVENT(icmp_send,
> > +
> > +   TP_PROTO(const struct sk_buff *skb, int type, int code),
> > +
> > +   TP_ARGS(skb, type, code),
> > +
> > +   TP_STRUCT__entry(
> > +   __field(__u16, sport)
> > +   __field(__u16, dport)
> > +   __field(int, type)
> > +   __field(int, code)
> > +   __array(__u8, saddr, 4)
> > +   __array(__u8, daddr, 4)
> > +   __field(const void *, skbaddr)
> > +   __field(unsigned short, ulen)
> 
> Note, to prevent holes, I usually suggest pointers and longs go first,
> followed by ints, and then end with char.
> 
>   __field(const void *, skbaddr)
>   __field(int, type)
>   __field(int, code)
>   __array(__u8, saddr, 4)
>   __array(__u8, daddr, 4)
>   __field(__u16, sport)
>   __field(__u16, dport)
>   __field(unsigned short, ulen)
> 
> -- Steve
Thank you very much for your suggestion. We will rearrange the parameters in 
TP_STRUCT_entry, prioritizing pointers and longs, followed by ints, and ending 
with char. This will be reflected in Patch v3.
> 
> 
> > +   ),
> > +
> > +   TP_fast_assign(
> > +   struct iphdr *iph = ip_hdr(skb);
> > +   int proto_4 = iph->protocol;
> > +   __be32 *p32;
> > +
> > +   __entry->skbaddr = skb;
> > +   __entry->type = type;
> > +   __entry->code = code;
> > +
> > +   if (proto_4 == IPPROTO_UDP) {
> > +   struct udphdr *uh = udp_hdr(skb);
> > +   __entry->sport = ntohs(uh->source);
> > +   __entry->dport = ntohs(uh->dest);
> > +   __entry->ulen = ntohs(uh->len);
> > +   } else {
> > +   __entry->sport = 0;
> > +   __entry->dport = 0;
> > +   __entry->ulen = 0;
> > +   }
> > +
> > +   p32 = (__be32 *) __entry->saddr;
> > +   *p32 = iph->saddr;
> > +
> > +   p32 = (__be32 *) __entry->daddr;
> > +   *p32 = iph->daddr;
> > +

Re: [syzbot] [virtualization?] upstream boot error: WARNING: refcount bug in __free_pages_ok

2024-03-20 Thread Stefan Hajnoczi

On Tue, Mar 19, 2024 at 03:51:18PM -0500, Mike Christie wrote:
> On 3/19/24 12:19 PM, Stefan Hajnoczi wrote:
> > On Tue, Mar 19, 2024 at 03:40:53AM -0400, Michael S. Tsirkin wrote:
> >> On Tue, Mar 19, 2024 at 12:32:26AM -0700, syzbot wrote:
> >>> Hello,
> >>>
> >>> syzbot found the following issue on:
> >>>
> >>> HEAD commit:b3603fcb79b1 Merge tag 'dlm-6.9' of 
> >>> git://git.kernel.org/p..
> >>> git tree:   upstream
> >>> console output: https://syzkaller.appspot.com/x/log.txt?x=10f04c8118
> >>> kernel config:  https://syzkaller.appspot.com/x/.config?x=fcb5bfbee0a42b54
> >>> dashboard link: 
> >>> https://syzkaller.appspot.com/bug?extid=70f57d8a3ae84934c003
> >>> compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for 
> >>> Debian) 2.40
> >>>
> >>> Downloadable assets:
> >>> disk image: 
> >>> https://storage.googleapis.com/syzbot-assets/43969dffd4a6/disk-b3603fcb.raw.xz
> >>> vmlinux: 
> >>> https://storage.googleapis.com/syzbot-assets/ef48ab3b378b/vmlinux-b3603fcb.xz
> >>> kernel image: 
> >>> https://storage.googleapis.com/syzbot-assets/728f5ff2b6fe/bzImage-b3603fcb.xz
> >>>
> >>> IMPORTANT: if you fix the issue, please add the following tag to the 
> >>> commit:
> >>> Reported-by: syzbot+70f57d8a3ae84934c...@syzkaller.appspotmail.com
> >>>
> >>> Key type pkcs7_test registered
> >>> Block layer SCSI generic (bsg) driver version 0.4 loaded (major 239)
> >>> io scheduler mq-deadline registered
> >>> io scheduler kyber registered
> >>> io scheduler bfq registered
> >>> input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
> >>> ACPI: button: Power Button [PWRF]
> >>> input: Sleep Button as /devices/LNXSYSTM:00/LNXSLPBN:00/input/input1
> >>> ACPI: button: Sleep Button [SLPF]
> >>> ioatdma: Intel(R) QuickData Technology Driver 5.00
> >>> ACPI: \_SB_.LNKC: Enabled at IRQ 11
> >>> virtio-pci :00:03.0: virtio_pci: leaving for legacy driver
> >>> ACPI: \_SB_.LNKD: Enabled at IRQ 10
> >>> virtio-pci :00:04.0: virtio_pci: leaving for legacy driver
> >>> ACPI: \_SB_.LNKB: Enabled at IRQ 10
> >>> virtio-pci :00:06.0: virtio_pci: leaving for legacy driver
> >>> virtio-pci :00:07.0: virtio_pci: leaving for legacy driver
> >>> N_HDLC line discipline registered with maxframe=4096
> >>> Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> >>> 00:03: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
> >>> 00:04: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A
> >>> 00:05: ttyS2 at I/O 0x3e8 (irq = 6, base_baud = 115200) is a 16550A
> >>> 00:06: ttyS3 at I/O 0x2e8 (irq = 7, base_baud = 115200) is a 16550A
> >>> Non-volatile memory driver v1.3
> >>> Linux agpgart interface v0.103
> >>> ACPI: bus type drm_connector registered
> >>> [drm] Initialized vgem 1.0.0 20120112 for vgem on minor 0
> >>> [drm] Initialized vkms 1.0.0 20180514 for vkms on minor 1
> >>> Console: switching to colour frame buffer device 128x48
> >>> platform vkms: [drm] fb0: vkmsdrmfb frame buffer device
> >>> usbcore: registered new interface driver udl
> >>> brd: module loaded
> >>> loop: module loaded
> >>> zram: Added device: zram0
> >>> null_blk: disk nullb0 created
> >>> null_blk: module loaded
> >>> Guest personality initialized and is inactive
> >>> VMCI host device registered (name=vmci, major=10, minor=118)
> >>> Initialized host personality
> >>> usbcore: registered new interface driver rtsx_usb
> >>> usbcore: registered new interface driver viperboard
> >>> usbcore: registered new interface driver dln2
> >>> usbcore: registered new interface driver pn533_usb
> >>> nfcsim 0.2 initialized
> >>> usbcore: registered new interface driver port100
> >>> usbcore: registered new interface driver nfcmrvl
> >>> Loading iSCSI transport class v2.0-870.
> >>> virtio_scsi virtio0: 1/0/0 default/read/poll queues
> >>> [ cut here ]
> >>> refcount_t: decrement hit 0; leaking memory.
> >>> WARNING: CPU: 0 PID: 1 at lib/refcount.c:31 
> >>> refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31
> >>> Modules linked in:
> >>> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> >>> 6.8.0-syzkaller-11567-gb3603fcb79b1 #0
> >>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> >>> Google 02/29/2024
> >>> RIP: 0010:refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31
> >>> Code: b2 00 00 00 e8 57 d4 f2 fc 5b 5d c3 cc cc cc cc e8 4b d4 f2 fc c6 
> >>> 05 0c f9 ef 0a 01 90 48 c7 c7 a0 5d 1e 8c e8 b7 75 b5 fc 90 <0f> 0b 90 90 
> >>> eb d9 e8 2b d4 f2 fc c6 05 e9 f8 ef 0a 01 90 48 c7 c7
> >>> RSP: :c9066e18 EFLAGS: 00010246
> >>> RAX: 76f86e452fcad900 RBX: 8880210d2aec RCX: 888016ac8000
> >>> RDX:  RSI:  RDI: 
> >>> RBP: 0004 R08: 8157ffe2 R09: fbfff1c396e0
> >>> R10: dc00 R11: fbfff1c396e0 R12: ea000502cdc0
> >>> R13: ea000502cdc8 R14: 1d4000a059b9 R15: 
> >>> FS:  ()

Re: [PATCH v3] vhost/vdpa: Add MSI translation tables to iommu for software-managed MSI

2024-03-20 Thread Stefano Garzarella


On Wed, Mar 20, 2024 at 06:19:12PM +0800, Wang Rong wrote:

From: Rong Wang 

Once enable iommu domain for one device, the MSI
translation tables have to be there for software-managed MSI.
Otherwise, platform with software-managed MSI without an
irq bypass function, can not get a correct memory write event
from pcie, will not get irqs.
The solution is to obtain the MSI phy base address from
iommu reserved region, and set it to iommu MSI cookie,
then translation tables will be created while request irq.

Change log
--

v1->v2:
- add resv iotlb to avoid overlap mapping.
v2->v3:
- there is no need to export the iommu symbol anymore.

Signed-off-by: Rong Wang 
---
drivers/vhost/vdpa.c | 59 +---
1 file changed, 56 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index ba52d128aeb7..28b56b10372b 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -49,6 +49,7 @@ struct vhost_vdpa {
struct completion completion;
struct vdpa_device *vdpa;
struct hlist_head as[VHOST_VDPA_IOTLB_BUCKETS];
+   struct vhost_iotlb resv_iotlb;
struct device dev;
struct cdev cdev;
atomic_t opened;
@@ -247,6 +248,7 @@ static int _compat_vdpa_reset(struct vhost_vdpa *v)
static int vhost_vdpa_reset(struct vhost_vdpa *v)
{
v->in_batch = 0;
+   vhost_iotlb_reset(>resv_iotlb);
return _compat_vdpa_reset(v);
}

@@ -1219,10 +1221,15 @@ static int vhost_vdpa_process_iotlb_update(struct 
vhost_vdpa *v,
msg->iova + msg->size - 1 > v->range.last)
return -EINVAL;

+   if (vhost_iotlb_itree_first(>resv_iotlb, msg->iova,
+   msg->iova + msg->size - 1))
+   return -EINVAL;
+
if (vhost_iotlb_itree_first(iotlb, msg->iova,
msg->iova + msg->size - 1))
return -EEXIST;

+


Unnecessary new line here.


if (vdpa->use_va)
return vhost_vdpa_va_map(v, iotlb, msg->iova, msg->size,
 msg->uaddr, msg->perm);
@@ -1307,6 +1314,45 @@ static ssize_t vhost_vdpa_chr_write_iter(struct kiocb 
*iocb,
return vhost_chr_write_iter(dev, from);
}

+static int vhost_vdpa_resv_iommu_region(struct iommu_domain *domain, struct 
device *dma_dev,
+   struct vhost_iotlb *resv_iotlb)
+{
+   struct list_head dev_resv_regions;
+   phys_addr_t resv_msi_base = 0;
+   struct iommu_resv_region *region;
+   int ret = 0;
+   bool with_sw_msi = false;
+   bool with_hw_msi = false;
+
+   INIT_LIST_HEAD(_resv_regions);
+   iommu_get_resv_regions(dma_dev, _resv_regions);
+
+   list_for_each_entry(region, _resv_regions, list) {
+   ret = vhost_iotlb_add_range_ctx(resv_iotlb, region->start,
+   region->start + region->length - 1,
+   0, 0, NULL);
+   if (ret) {
+   vhost_iotlb_reset(resv_iotlb);
+   break;
+   }
+
+   if (region->type == IOMMU_RESV_MSI)
+   with_hw_msi = true;
+
+   if (region->type == IOMMU_RESV_SW_MSI) {
+   resv_msi_base = region->start;


Can it happen that there are multiple regions of the IOMMU_RESV_SW_MSI 
type?


In this case, is it correct to overwrite `resv_msi_base`?


+   with_sw_msi = true;
+   }
+   }
+
+   if (!ret && !with_hw_msi && with_sw_msi)
+   ret = iommu_get_msi_cookie(domain, resv_msi_base);


If `iommu_get_msi_cookie()` fails:
 - Should we avoid calling iommu_put_resv_regions()?
 - Should we also call `vhost_iotlb_reset(resv_iotlb)` like for the
   vhost_iotlb_add_range_ctx() failure ?

If it is the case, maybe it's better to add an error label where do the 
cleanup.



+
+   iommu_put_resv_regions(dma_dev, _resv_regions);
+
+   return ret;
+}
+
static int vhost_vdpa_alloc_domain(struct vhost_vdpa *v)
{
struct vdpa_device *vdpa = v->vdpa;
@@ -1335,11 +1381,16 @@ static int vhost_vdpa_alloc_domain(struct vhost_vdpa *v)

ret = iommu_attach_device(v->domain, dma_dev);
if (ret)
-   goto err_attach;
+   goto err_alloc_domain;

-   return 0;
+   ret = vhost_vdpa_resv_iommu_region(v->domain, dma_dev, >resv_iotlb);
+   if (ret)
+   goto err_attach_device;

-err_attach:
+   return 0;


I suggest to add a new line here to separate the error path for the 
success path.



+err_attach_device:
+   iommu_detach_device(v->domain, dma_dev);
+err_alloc_domain:
iommu_domain_free(v->domain);
v->domain = NULL;
return ret;
@@ -1595,6 +1646,8 @@ static int vhost_vdpa_probe(struct vdpa_device *vdpa)
goto err;
}

+   vhost_iotlb_init(>resv_iotlb, 0, 0);
+


IIUC the lifetime of

Re: [PATCH] tracing: Explicitly cast divisor to fix Coccinelle warning

2024-03-20 Thread Google

On Mon, 18 Mar 2024 11:52:44 +0100
Thorsten Blum  wrote:

> Explicitly cast the divisor to u32 to fix a Coccinelle/coccicheck warning
> reported by do_div.cocci.
> 

Hmm, strange, trace_do_benchmark() has another do_div(u64, u64). 

do {
last_seed = seed;
seed = stddev;
if (!last_seed)
break;
do_div(seed, last_seed);
seed += last_seed;
do_div(seed, 2);
} while (i++ < 10 && last_seed != seed);

Didn't Coccinelle find that?

Thank you,

> Signed-off-by: Thorsten Blum 
> ---
>  kernel/trace/trace_benchmark.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace_benchmark.c b/kernel/trace/trace_benchmark.c
> index 54d5fa35c90a..218b40050629 100644
> --- a/kernel/trace/trace_benchmark.c
> +++ b/kernel/trace/trace_benchmark.c
> @@ -105,7 +105,7 @@ static void trace_do_benchmark(void)
>   stddev = 0;
>  
>   delta = bm_total;
> - do_div(delta, bm_cnt);
> + do_div(delta, (u32)bm_cnt);
>   avg = delta;
>  
>   if (stddev > 0) {
> -- 
> 2.44.0
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCH v6 0/8] support '%pd' and '%pD' for print file name

2024-03-20 Thread Google

Hi Ye,

On Fri, 15 Mar 2024 14:55:32 +0800
Ye Bin  wrote:

> Sorry for taking so long to post the V6 version. I thought my email was
> sent successfully. I checked the patchwork and it was not sent successfully.
> 
> During fault locating, the file name needs to be printed based on the
> dentry/file address. The offset needs to be calculated each time, which
> is troublesome. Similar to printk, kprobe supports printing file names
> for dentry/file addresses.

Thanks for updating the series.

There are some nits, but basically this is good to me.
Could you update this series?

Thank you,

> 
> Diff v6 vs v5:
> 1. Add const for 'bufsize' in PATCH [1];
> 2. Move PATCH 'tracing/probes: support '%pd/%pD' type for fprobe' after
> PATCH "tracing/probes: support '%pd' type for print struct dentry's name".
> 3. Add requires '"%pd/%pD":README' for testcase.
> 
> Diff v5 vs v4:
> 1. Use glob_match() instead of str_has_suffix(), so remove the first patch;
> 2. Allocate buffer from heap for expand dentry;
> 3. Support "%pd/%pD" print type for fprobe;
> 4. Use $arg1 instead of origin register in test case;
> 5. Add test case for fprobe;
> 
> Diff v4 vs v3:
> 1. Use "argv[i][idx + 3] == 'd'" instead of "argv[i][strlen(argv[i]) - 1] == 
> 'd'"
> to judge print format in PATCH[4/7];
> 
> Diff v3 vs v2:
> 1. Return the index of where the suffix was found in str_has_suffix();
> 
> Diff v2 vs v1:
> 1. Use "%pd/%pD" print format instead of "pd/pD" print format;
> 2. Add "%pd/%pD" in README;
> 3. Expand "%pd/%pD" argument before parameter parsing;
> 4. Add more detail information in ftrace documentation;
> 5. Add test cases for new print format in selftests/ftrace;
> 
> Ye Bin (8):
>   tracing/probes: add traceprobe_expand_dentry_args() helper
>   tracing/probes: support '%pd' type for print struct dentry's name
>   tracing/probes: support '%pd' type for fprobe
>   tracing/probes: support '%pD' type for print struct file's name
>   tracing: add new type "%pd/%pD" in readme_msg[]
>   Documentation: tracing: add new type '%pd' and '%pD' for kprobe
>   selftests/ftrace: add kprobe test cases for VFS type "%pd" and "%pD"
>   selftests/ftrace: add fprobe test cases for VFS type "%pd" and "%pD"
> 
>  Documentation/trace/kprobetrace.rst   |  8 ++-
>  kernel/trace/trace.c  |  2 +-
>  kernel/trace/trace_fprobe.c   |  6 ++
>  kernel/trace/trace_kprobe.c   |  6 ++
>  kernel/trace/trace_probe.c| 63 +++
>  kernel/trace/trace_probe.h|  2 +
>  .../ftrace/test.d/dynevent/fprobe_args_vfs.tc | 40 
>  .../ftrace/test.d/kprobe/kprobe_args_vfs.tc   | 43 +
>  8 files changed, 167 insertions(+), 3 deletions(-)
>  create mode 100644 
> tools/testing/selftests/ftrace/test.d/dynevent/fprobe_args_vfs.tc
>  create mode 100644 
> tools/testing/selftests/ftrace/test.d/kprobe/kprobe_args_vfs.tc
> 
> -- 
> 2.31.1
> 


-- 
Masami Hiramatsu (Google)

[PATCH v3] vhost/vdpa: Add MSI translation tables to iommu for software-managed MSI

2024-03-20 Thread Wang Rong

From: Rong Wang 

Once enable iommu domain for one device, the MSI
translation tables have to be there for software-managed MSI.
Otherwise, platform with software-managed MSI without an
irq bypass function, can not get a correct memory write event
from pcie, will not get irqs.
The solution is to obtain the MSI phy base address from
iommu reserved region, and set it to iommu MSI cookie,
then translation tables will be created while request irq.

Change log
--

v1->v2:
- add resv iotlb to avoid overlap mapping.
v2->v3:
- there is no need to export the iommu symbol anymore.

Signed-off-by: Rong Wang 
---
 drivers/vhost/vdpa.c | 59 +---
 1 file changed, 56 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index ba52d128aeb7..28b56b10372b 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -49,6 +49,7 @@ struct vhost_vdpa {
struct completion completion;
struct vdpa_device *vdpa;
struct hlist_head as[VHOST_VDPA_IOTLB_BUCKETS];
+   struct vhost_iotlb resv_iotlb;
struct device dev;
struct cdev cdev;
atomic_t opened;
@@ -247,6 +248,7 @@ static int _compat_vdpa_reset(struct vhost_vdpa *v)
 static int vhost_vdpa_reset(struct vhost_vdpa *v)
 {
v->in_batch = 0;
+   vhost_iotlb_reset(>resv_iotlb);
return _compat_vdpa_reset(v);
 }
 
@@ -1219,10 +1221,15 @@ static int vhost_vdpa_process_iotlb_update(struct 
vhost_vdpa *v,
msg->iova + msg->size - 1 > v->range.last)
return -EINVAL;
 
+   if (vhost_iotlb_itree_first(>resv_iotlb, msg->iova,
+   msg->iova + msg->size - 1))
+   return -EINVAL;
+
if (vhost_iotlb_itree_first(iotlb, msg->iova,
msg->iova + msg->size - 1))
return -EEXIST;
 
+
if (vdpa->use_va)
return vhost_vdpa_va_map(v, iotlb, msg->iova, msg->size,
 msg->uaddr, msg->perm);
@@ -1307,6 +1314,45 @@ static ssize_t vhost_vdpa_chr_write_iter(struct kiocb 
*iocb,
return vhost_chr_write_iter(dev, from);
 }
 
+static int vhost_vdpa_resv_iommu_region(struct iommu_domain *domain, struct 
device *dma_dev,
+   struct vhost_iotlb *resv_iotlb)
+{
+   struct list_head dev_resv_regions;
+   phys_addr_t resv_msi_base = 0;
+   struct iommu_resv_region *region;
+   int ret = 0;
+   bool with_sw_msi = false;
+   bool with_hw_msi = false;
+
+   INIT_LIST_HEAD(_resv_regions);
+   iommu_get_resv_regions(dma_dev, _resv_regions);
+
+   list_for_each_entry(region, _resv_regions, list) {
+   ret = vhost_iotlb_add_range_ctx(resv_iotlb, region->start,
+   region->start + region->length - 1,
+   0, 0, NULL);
+   if (ret) {
+   vhost_iotlb_reset(resv_iotlb);
+   break;
+   }
+
+   if (region->type == IOMMU_RESV_MSI)
+   with_hw_msi = true;
+
+   if (region->type == IOMMU_RESV_SW_MSI) {
+   resv_msi_base = region->start;
+   with_sw_msi = true;
+   }
+   }
+
+   if (!ret && !with_hw_msi && with_sw_msi)
+   ret = iommu_get_msi_cookie(domain, resv_msi_base);
+
+   iommu_put_resv_regions(dma_dev, _resv_regions);
+
+   return ret;
+}
+
 static int vhost_vdpa_alloc_domain(struct vhost_vdpa *v)
 {
struct vdpa_device *vdpa = v->vdpa;
@@ -1335,11 +1381,16 @@ static int vhost_vdpa_alloc_domain(struct vhost_vdpa *v)
 
ret = iommu_attach_device(v->domain, dma_dev);
if (ret)
-   goto err_attach;
+   goto err_alloc_domain;
 
-   return 0;
+   ret = vhost_vdpa_resv_iommu_region(v->domain, dma_dev, >resv_iotlb);
+   if (ret)
+   goto err_attach_device;
 
-err_attach:
+   return 0;
+err_attach_device:
+   iommu_detach_device(v->domain, dma_dev);
+err_alloc_domain:
iommu_domain_free(v->domain);
v->domain = NULL;
return ret;
@@ -1595,6 +1646,8 @@ static int vhost_vdpa_probe(struct vdpa_device *vdpa)
goto err;
}
 
+   vhost_iotlb_init(>resv_iotlb, 0, 0);
+
r = dev_set_name(>dev, "vhost-vdpa-%u", minor);
if (r)
goto err;
-- 
2.27.0

Re: [PATCH v6 5/8] tracing: add new type "%pd/%pD" in readme_msg[]

2024-03-20 Thread Google

Hi Ye,

Can you split this into 

(1)
> - "\t   symstr, \\[\\]\n"
> + "\t   symstr, %pd, \\[\\]\n"

and 

(2)
> - "\t   symstr, %pd, \\[\\]\n"
> + "\t   symstr, %pd/%pD, \\[\\]\n"

And merge (1) into [PATCH 3/8], (2) into [PATCH 5/8], so that
the feature patch itself updates the readme text?

This is because even if one of the patch drops, others can
show its feature on readme.

Thank you,

On Fri, 15 Mar 2024 14:55:37 +0800
Ye Bin  wrote:

> Signed-off-by: Ye Bin 
> ---
>  kernel/trace/trace.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index b12f8384a36a..831dfd0773a4 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -5510,7 +5510,7 @@ static const char readme_msg[] =
>   "\t   +|-[u](), \\imm-value, 
> \\\"imm-string\"\n"
>   "\t type: s8/16/32/64, u8/16/32/64, x8/16/32/64, char, string, 
> symbol,\n"
>   "\t   b@/, ustring,\n"
> - "\t   symstr, \\[\\]\n"
> + "\t   symstr, %pd/%pD, \\[\\]\n"
>  #ifdef CONFIG_HIST_TRIGGERS
>   "\tfield:  ;\n"
>   "\tstype: u8/u16/u32/u64, s8/s16/s32/s64, pid_t,\n"
> -- 
> 2.31.1
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCH v2] kprobes/x86: Use copy_from_kernel_nofault() to read from unsafe address

2024-03-20 Thread Google

On Sun, 17 Mar 2024 10:53:59 -0500
Jinghao Jia  wrote:

> 
> 
> On 3/16/24 08:46, Masami Hiramatsu (Google) wrote:
> > On Thu, 14 Mar 2024 18:56:35 -0500
> > Jinghao Jia  wrote:
> > 
> >> On 3/14/24 10:17, Masami Hiramatsu (Google) wrote:
> >>> From: Masami Hiramatsu (Google) 
> >>>
> >>> Read from an unsafe address with copy_from_kernel_nofault() in
> >>> arch_adjust_kprobe_addr() because this function is used before checking
> >>> the address is in text or not. Syzcaller bot found a bug and reported
> >>> the case if user specifies inaccessible data area,
> >>> arch_adjust_kprobe_addr() will cause a kernel panic.
> >>
> >> IMHO there is a check on the address in kallsyms_lookup_size_offset to see
> >> if it is a kernel text address before arch_adjust_kprobe_addr is invoked.
> > 
> > Yeah, kallsyms does not ensure the page (especially data) exists.
> > 
> >>
> >> The call chain is:
> >>
> >> register_kprobe()
> >>   _kprobe_addr()
> >> kallsyms_lookup_size_offset() <- check on addr is here
> >> arch_adjust_kprobe_addr()
> >>
> >> I wonder why this check was not able to capture the problem in this bug
> >> report (I cannot reproduce it locally).
> > 
> > I could reproduce it locally, it tried to access 'Y' data.
> > (I attached my .config) And I ensured that this fixed the problem.
> > 
> > The reproduce test actually tried to access initdata area
> > 
> > 82fb5450 d __alt_reloc_selftest_addr
> > 82fb5460 d int3_exception_nb.1
> > 82fb5478 d tsc_early_khz
> > 82fb547c d io_delay_override
> > 82fb5480 d fxregs.0
> > 82fb5680 d y<--- access this
> > 82fb5688 d x
> > 82fb56a0 d xsave_cpuid_features
> > 82fb56c8 d l1d_flush_mitigation
> > 
> > `y` is too generic, so check `io_delay_override` which is on the
> > same page.
> > 
> > $ git grep io_delay_override
> > arch/x86/kernel/io_delay.c:static int __initdata io_delay_override;
> > 
> > As you can see, it is marked as `__initdata`, and the initdata has been
> > freed before starting /init.
> > 
> > 
> > [2.679161] Freeing unused kernel image (initmem) memory: 2888K
> > [2.688731] Write protecting the kernel read-only data: 24576k
> > [2.691802] Freeing unused kernel image (rodata/data gap) memory: 1436K
> > [2.746994] x86/mm: Checked W+X mappings: passed, no W+X pages found.
> > [2.748022] x86/mm: Checking user space page tables
> > [2.789520] x86/mm: Checked W+X mappings: passed, no W+X pages found.
> > [2.790527] Run /init as init process
> > 
> > 
> > So this has been caused because accessing freed initdata.
> 
> Thanks a lot for the explanation! I have confirmed the bug and tested the
> patch with CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT=y (which explicitly marks
> the init pages as not-present after boot).
> 
> Tested-by: Jinghao Jia 
> 

Thank you for testing!

Regards,
-- 
Masami Hiramatsu (Google)

[PATCH] tracing: probes: Fix to zero initialize a local variable

2024-03-20 Thread Masami Hiramatsu (Google)

From: Masami Hiramatsu (Google) 

Fix to initialize 'val' local variable with zero.
Dan reported that Smatch static code checker reports an error that a local
'val' variable needs to be initialized. Actually, the 'val' is expected to
be initialized by FETCH_OP_ARG in the same loop, but it is not obvious. So
initialize it with zero.

Reported-by: Dan Carpenter 
Closes: 
https://lore.kernel.org/all/b010488e-68aa-407c-add0-3e059254aaa0@moroto.mountain/
Fixes: 25f00e40ce79 ("tracing/probes: Support $argN in return probe (kprobe and 
fprobe)")
Signed-off-by: Masami Hiramatsu (Google) 
---
 kernel/trace/trace_probe.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 217169de0920..dfe3ee6035ec 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -839,7 +839,7 @@ int traceprobe_get_entry_data_size(struct trace_probe *tp)
 void store_trace_entry_data(void *edata, struct trace_probe *tp, struct 
pt_regs *regs)
 {
struct probe_entry_arg *earg = tp->entry_arg;
-   unsigned long val;
+   unsigned long val = 0;
int i;
 
if (!earg)

Re: [PATCH 2/3] dt-bindings: remoteproc: add Versal-NET platform

2024-03-20 Thread Krzysztof Kozlowski

On 19/03/2024 15:42, Tanmay Shah wrote:
> 
> 
> On 3/19/24 12:30 AM, Krzysztof Kozlowski wrote:
>> On 19/03/2024 01:51, Tanmay Shah wrote:
>>> Hello Krzysztof,
>>>
>>> Thanks for reviews. Please find my comments below.
>>>
>>> On 3/17/24 1:53 PM, Krzysztof Kozlowski wrote:
 On 15/03/2024 22:15, Tanmay Shah wrote:
> AMD-Xilinx Versal-NET platform is successor of Versal platform. It
> contains multiple clusters of cortex-R52 real-time processing units.
> Each cluster contains two cores of cortex-R52 processors. Each cluster
> can be configured in lockstep mode or split mode.
>
> Each R52 core is assigned 128KB of TCM memory. ATCM memory is 64KB, BTCM
> and CTCM memoreis are 32KB each. Each TCM memory has its own dedicated
> power-domain that needs to be requested before using it.
>
> Signed-off-by: Tanmay Shah 
> ---
>  .../remoteproc/xlnx,zynqmp-r5fss.yaml | 220 +++---
>  1 file changed, 184 insertions(+), 36 deletions(-)
>
> diff --git 
> a/Documentation/devicetree/bindings/remoteproc/xlnx,zynqmp-r5fss.yaml 
> b/Documentation/devicetree/bindings/remoteproc/xlnx,zynqmp-r5fss.yaml
> index 711da0272250..55654ee02eef 100644
> --- a/Documentation/devicetree/bindings/remoteproc/xlnx,zynqmp-r5fss.yaml
> +++ b/Documentation/devicetree/bindings/remoteproc/xlnx,zynqmp-r5fss.yaml
> @@ -18,7 +18,9 @@ description: |
>  
>  properties:
>compatible:
> -const: xlnx,zynqmp-r5fss
> +enum:
> +  - xlnx,zynqmp-r5fss
> +  - xlnx,versal-net-r52fss
>  
>"#address-cells":
>  const: 2
> @@ -64,7 +66,9 @@ patternProperties:
>  
>  properties:
>compatible:
> -const: xlnx,zynqmp-r5f
> +enum:
> +  - xlnx,zynqmp-r5f
> +  - xlnx,versal-net-r52f
>  
>reg:
>  minItems: 1
> @@ -135,9 +139,11 @@ required:
>  allOf:
>- if:
>properties:
> -xlnx,cluster-mode:
> -  enum:
> -- 1
> +compatible:
> +  contains:
> +enum:
> +  - xlnx,versal-net-r52fss

 Why do you touch these lines?

> +
>  then:
>patternProperties:
>  "^r5f@[0-9a-f]+$":
> @@ -149,16 +155,14 @@ allOf:
>items:
>  - description: ATCM internal memory
>  - description: BTCM internal memory
> -- description: extra ATCM memory in lockstep mode
> -- description: extra BTCM memory in lockstep mode
> +- description: CTCM internal memory
>  
>  reg-names:
>minItems: 1
>items:
> -- const: atcm0
> -- const: btcm0
> -- const: atcm1
> -- const: btcm1
> +- const: atcm
> +- const: btcm
> +- const: ctcm
>  
>  power-domains:
>minItems: 2
> @@ -166,33 +170,70 @@ allOf:
>  - description: RPU core power domain
>  - description: ATCM power domain
>  - description: BTCM power domain
> -- description: second ATCM power domain
> -- description: second BTCM power domain
> +- description: CTCM power domain
>  
>  else:
> -  patternProperties:
> -"^r5f@[0-9a-f]+$":
> -  type: object
> -
> -  properties:
> -reg:
> -  minItems: 1
> -  items:
> -- description: ATCM internal memory
> -- description: BTCM internal memory
> -
> -reg-names:
> -  minItems: 1
> -  items:
> -- const: atcm0
> -- const: btcm0
> -
> -power-domains:
> -  minItems: 2
> -  items:
> -- description: RPU core power domain
> -- description: ATCM power domain
> -- description: BTCM power domain
> +  allOf:
> +- if:
> +properties:
> +  xlnx,cluster-mode:
> +enum:
> +  - 1

 Whatever you did here, is not really readable. You have now multiple
 if:then:if:then embedded.
>>>
>>> For ZynqMP platform, TCM can be configured differently in lockstep mode
>>> and split mode.
>>>
>>> For Versal-NET no such configuration is available, but new CTCM memory
>>> is added.
>>>
>>> So, I am trying to achieve following representation of TCM for both:
>>>
>>> if: versal-net compatible
>>> then:
>>>

Re: [PATCH v3 1/2] memory tier: dax/kmem: create CPUless memory tiers after obtaining HMAT info

2024-03-20 Thread Huang, Ying

"Ho-Ren (Jack) Chuang"  writes:

> The current implementation treats emulated memory devices, such as
> CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory
> (E820_TYPE_RAM). However, these emulated devices have different
> characteristics than traditional DRAM, making it important to
> distinguish them. Thus, we modify the tiered memory initialization process
> to introduce a delay specifically for CPUless NUMA nodes. This delay
> ensures that the memory tier initialization for these nodes is deferred
> until HMAT information is obtained during the boot process. Finally,
> demotion tables are recalculated at the end.
>
> More details:

You have done several stuff in one patch.  So you need "more details".
You may separate them into multiple patches.  One for echo "*" below.
But I have no strong opinion on that.

> * late_initcall(memory_tier_late_init);
> Some device drivers may have initialized memory tiers between
> `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
> online memory nodes and configuring memory tiers. They should be excluded
> in the late init.
>
> * Abstract common functions into `mt_find_alloc_memory_type()`
> Since different memory devices require finding or allocating a memory type,
> these common steps are abstracted into a single function,
> `mt_find_alloc_memory_type()`, enhancing code scalability and conciseness.
>
> * Handle cases where there is no HMAT when creating memory tiers
> There is a scenario where a CPUless node does not provide HMAT information.
> If no HMAT is specified, it falls back to using the default DRAM tier.
>
> * Change adist calculation code to use another new lock, `mt_perf_lock`.
> In the current implementation, iterating through CPUlist nodes requires
> holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
> trying to acquire the same lock, leading to a potential deadlock.
> Therefore, we propose introducing a standalone `mt_perf_lock` to protect
> `default_dram_perf`. This approach not only avoids deadlock but also
> prevents holding a large lock simultaneously.
>
> * Upgrade `set_node_memory_tier` to support additional cases, including
>   default DRAM, late CPUless, and hot-plugged initializations.
> To cover hot-plugged memory nodes, `mt_calc_adistance()` and
> `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
> handle cases where memtype is not initialized and where HMAT information is
> available.
>
> * Introduce `default_memory_types` for those memory types that are not
>   initialized by device drivers.
> Because late initialized memory and default DRAM memory need to be managed,
> a default memory type is created for storing all memory types that are
> not initialized by device drivers and as a fallback.
>
> Signed-off-by: Ho-Ren (Jack) Chuang 
> Signed-off-by: Hao Xiang 
> ---
>  drivers/dax/kmem.c   | 13 +
>  include/linux/memory-tiers.h |  7 +++
>  mm/memory-tiers.c| 94 +---
>  3 files changed, 95 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 42ee360cf4e3..de1333aa7b3e 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -55,21 +55,10 @@ static LIST_HEAD(kmem_memory_types);
>  
>  static struct memory_dev_type *kmem_find_alloc_memory_type(int adist)
>  {
> - bool found = false;
>   struct memory_dev_type *mtype;
>  
>   mutex_lock(_memory_type_lock);
> - list_for_each_entry(mtype, _memory_types, list) {
> - if (mtype->adistance == adist) {
> - found = true;
> - break;
> - }
> - }
> - if (!found) {
> - mtype = alloc_memory_type(adist);
> - if (!IS_ERR(mtype))
> - list_add(>list, _memory_types);
> - }
> + mtype = mt_find_alloc_memory_type(adist, _memory_types);
>   mutex_unlock(_memory_type_lock);
>  
>   return mtype;

It seems that there's some miscommunication about my previous comments
about this.  What I suggested is to create one separate patch, which
moves mt_find_alloc_memory_type() and mt_put_memory_types() into
memory-tiers.c.  And make this patch the first one of the series.

> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index 69e781900082..b2135334ac18 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -48,6 +48,8 @@ int mt_calc_adistance(int node, int *adist);
>  int mt_set_default_dram_perf(int nid, struct access_coordinate *perf,
>const char *source);
>  int mt_perf_to_adistance(struct access_coordinate *perf, int *adist);
> +struct memory_dev_type *mt_find_alloc_memory_type(int adist,
> + struct list_head 
> *memory_types);
>  #ifdef CONFIG_MIGRATION
>  int next_demotion_node(int node);
>  void node_get_allowed_targets(pg_data_t

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-20 Thread Michael S. Tsirkin

On Wed, Mar 20, 2024 at 03:24:16PM +1000, Gavin Shan wrote:
> On 3/20/24 10:49, Michael S. Tsirkin wrote:>
> > I think you are wasting the time with these tests. Even if it helps what
> > does this tell us? Try setting a flag as I suggested elsewhere.
> > Then check it in vhost.
> > Or here's another idea - possibly easier. Copy the high bits from index
> > into ring itself. Then vhost can check that head is synchronized with
> > index.
> > 
> > Warning: completely untested, not even compiled. But should give you
> > the idea. If this works btw we should consider making this official in
> > the spec.
> > 
> > 
> >   static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index 6f7e5010a673..79456706d0bd 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -685,7 +685,8 @@ static inline int virtqueue_add_split(struct virtqueue 
> > *_vq,
> > /* Put entry in available array (but don't update avail->idx until they
> >  * do sync). */
> > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > -   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > +   u16 headwithflag = head | (q->split.avail_idx_shadow & 
> > ~(vq->split.vring.num - 1));
> > +   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > headwithflag);
> > /* Descriptors and available array need to be set before we expose the
> >  * new available array entries. */
> > 
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 045f666b4f12..bd8f7c763caa 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -1299,8 +1299,15 @@ static inline int vhost_get_avail_idx(struct 
> > vhost_virtqueue *vq,
> >   static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
> >__virtio16 *head, int idx)
> >   {
> > -   return vhost_get_avail(vq, *head,
> > +   unsigned i = idx;
> > +   unsigned flag = i & ~(vq->num - 1);
> > +   unsigned val = vhost_get_avail(vq, *head,
> >>avail->ring[idx & (vq->num - 1)]);
> > +   unsigned valflag = val & ~(vq->num - 1);
> > +
> > +   WARN_ON(valflag != flag);
> > +
> > +   return val & (vq->num - 1);
> >   }
> 
> Thanks, Michael. The code is already self-explanatory.

Apparently not. See below.

> Since vq->num is 256, I just
> squeezed the last_avail_idx to the high byte. Unfortunately, I'm unable to hit
> the WARN_ON(). Does it mean the low byte is stale (or corrupted) while the 
> high
> byte is still correct and valid?


I would find this very surprising.

> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> vq->split.vring.avail->ring[avail] =
> cpu_to_virtio16(_vq->vdev, head | (avail << 8));
> 
> 
> head = vhost16_to_cpu(vq, ring_head);
> WARN_ON((head >> 8) != (vq->last_avail_idx % vq->num));
> head = head & 0xff;

This code misses the point of the test.
The high value you store now is exactly the same each time you
go around the ring. E.g. at beginning of ring you now always
store 0 as high byte. So a stale value will not be detected/
The high value you store now is exactly the same each time you
go around the ring. E.g. at beginning of ring you now always
store 0 as high byte. So a stale value will not be detected.

The value you are interested in should change
each time you go around the ring a full circle.
Thus you want exactly the *high byte* of avail idx -
this is what my patch did - your patch instead
stored and compared the low byte.


The advantage of this debugging patch is that it will detect the issue 
immediately
not after guest detected the problem in the used ring.
For example, you can add code to re-read the value, or dump the whole
ring.

> One question: Does QEMU has any chance writing data to the available queue 
> when
> vhost is enabled? My previous understanding is no, the queue is totally owned 
> by
> vhost instead of QEMU.

It shouldn't do it normally.

> Before this patch was posted, I had debugging code to record last 16 
> transactions
> to the available and used queue from guest and host side. It did reveal the 
> wrong
> head was fetched from the available queue.

Oh nice that's a very good hint. And is this still reproducible?

> [   11.785745]  virtqueue_get_buf_ctx_split 
> [   11.786238] virtio_net virtio0: output.0:id 74 is not a head!
> [   11.786655] head to be released: 036 077
> [   11.786952]
> [   11.786952] avail_idx:
> [   11.787234] 000  63985  <--
> [   11.787237] 001  63986
> [   11.787444] 002  63987
> [   11.787632] 003  63988
> [   11.787821] 004  63989
> [   11.788006] 005  63990
> [   11.788194] 006  63991
> [   11.788381] 007  63992
> [   11.788567] 008  63993
> [   11.788772] 009  63994
> [   11.788957] 010  63995
> [   11.789141] 011  63996
> [   11.789327]

Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-03-20 Thread Honggyu Kim

Hi SeongJae,

On Mon, 18 Mar 2024 12:07:21 -0700 SeongJae Park  wrote:
> On Mon, 18 Mar 2024 22:27:45 +0900 Honggyu Kim  wrote:
> 
> > Hi SeongJae,
> > 
> > On Sun, 17 Mar 2024 08:31:44 -0700 SeongJae Park  wrote:
> > > Hi Honggyu,
> > > 
> > > On Sun, 17 Mar 2024 17:36:29 +0900 Honggyu Kim  wrote:
> > > 
> > > > Hi SeongJae,
> > > > 
> > > > Thanks for the confirmation.  I have a few comments on young filter so
> > > > please read the inline comments again.
> > > > 
> > > > On Wed, 12 Mar 2024 08:53:00 -0800 SeongJae Park  
> > > > wrote:
> > > > > Hi Honggyu,
> > > > > 
> > > > > > > -Original Message-
> > > > > > > From: SeongJae Park 
> > > > > > > Sent: Tuesday, March 12, 2024 3:33 AM
> > > > > > > To: Honggyu Kim 
> > > > > > > Cc: SeongJae Park ; kernel_team 
> > > > > > > 
> > > > > > > Subject: RE: Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory 
> > > > > > > management for CXL memory
> > > > > > >
> > > > > > > Hi Honggyu,
> > > > > > >
> > > > > > > On Mon, 11 Mar 2024 12:51:12 + "honggyu@sk.com" 
> > > > > > >  wrote:
> > > > > > >
> > > > > > > > Hi SeongJae,
> > > > > > > >
> > > > > > > > I've tested it again and found that "young" filter has to be set
> > > > > > > > differently as follows.
> > > > > > > > - demote action: set "young" filter with "matching" true
> > > > > > > > - promote action: set "young" filter with "matching" false
> > 
> > Thinking it again, I feel like "matching" true or false looks quite
> > vague to me as a general user.
> > 
> > Instead, I would like to have more meaningful names for "matching" as
> > follows.
> > 
> > - matching "true" can be either (filter) "out" or "skip".
> > - matching "false" can be either (filter) "in" or "apply".
> 
> I agree the naming could be done much better.  And thank you for the nice
> suggestions.  I have a few concerns, though.

I don't think my suggestion is best.  I just would like to have more
discussion about it.

> Firstly, increasing the number of behavioral concepts.  DAMOS filter feature
> has only single behavior: excluding some types of memory from DAMOS action
> target.  The "matching" is to provide a flexible way for further specifying 
> the
> target to exclude in a bit detail.  Without it, we would need non-variant for
> each filter type.  Comapred to the current terms, the new terms feel like
> implying there are two types of behaviors.  I think one behavior is easier to
> understand than two behaviors, and better match what internal code is doing.
> 
> Secondly, ambiguity in "in" and "apply".  To me, the terms sound like _adding_
> something more than _excluding_.

I understood that young filter "matching" "false" means apply action
only to young pages.  Do I misunderstood something here?  If not,
"apply" means _adding_ or _including_ only the matched pages for action.
It looks like you thought about _excluding_ non matched pages here.

> I think that might confuse people in some
> cases.  Actually, I have used the term "filter-out" and "filter-in" on
> this  and several threads.  When saying about "filter-in" scenario, I had to
> emphasize the fact that it is not adding something but excluding others.

Excluding others also means including matched pages.  I think we better
focus on what to do only for the matched pages.

> I now think that was not a good approach.
> 
> Finally, "apply" sounds a bit deterministic.  I think it could be a bit
> confusing in some cases such as when using multiple filters in a combined way.
> For example, if we have two filters for 1) "apply" a memcg and 2) skip anon
> pages, the given DAMOS action will not be applied to anon pages of the memcg.
> I think this might be a bit confusing.

No objection on this.  If so, I think "in" sounds better than "apply".

> > 
> > Internally, the type of "matching" can be boolean, but it'd be better
> > for general users have another ways to set it such as "out"/"in" or
> > "skip"/"apply" via sysfs interface.  I prefer "skip" and "apply" looks
> > more intuitive, but I don't have strong objection on "out" and "in" as
> > well.
> 
> Unfortunately, DAMON sysfs interface is an ABI that we want to keep stable.  
> Of
> course we could make some changes on it if really required.  But I'm unsure if
> the problem of current naming and benefit of the sugegsted change are big
> enough to outweighs the stability risk and additional efforts.

I don't ask to change the interface, but just provide another way for
the setting.  For example, the current "matching" accepts either 1,
true, or Y but internally keeps as "true" as a boolean type.

  $ cd /sys/kernel/mm/damon/admin/kdamonds/0/contexts/0/schemes/0/filters/0

  $ echo 1 | tee matching && cat matching
  1
  Y

  $ echo true | tee matching && cat matching
  true
  Y

  $ echo Y | tee matching && cat matching
  Y
  Y

I'm asking if it's okay making "matching" receive "out" or "skip" as
follows.

  $ echo out | tee matching && cat matching
  out
  Y

  $ echo skip | tee matching && cat

Re: [RFC PATCH] rpmsg: glink: Add bounds check on tx path

2024-03-20 Thread Deepak Kumar Singh





On 1/29/2024 10:03 PM, Michal Koutný wrote:

On Mon, Jan 29, 2024 at 04:18:36PM +0530, Deepak Kumar Singh 
 wrote:

There is already a patch posted for similar problem -
https://lore.kernel.org/all/20231201110631.669085-1-quic_dee...@quicinc.com/


I was not aware, thanks for the pointer.

Do you plan to update your patch to "just" bail-out/zero instead of
using slightly random values (as pointed out by Bjorn)?

Michal

Hi Michal,
Yes, i will be fixing those comments and re post patch.

Re: [syzbot] [virtualization?] upstream boot error: WARNING: refcount bug in __free_pages_ok

2024-03-20 Thread syzbot

Hello,

syzbot has tested the proposed patch and the reproducer did not trigger any 
issue:

Reported-and-tested-by: syzbot+70f57d8a3ae84934c...@syzkaller.appspotmail.com

Tested on:

commit: 52998cdd Merge branch '6.8/scsi-staging' into 6.8/scsi..
git tree:   git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel config:  https://syzkaller.appspot.com/x/.config?x=7b1f286a7e950707
dashboard link: https://syzkaller.appspot.com/bug?extid=70f57d8a3ae84934c003
compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 
2.40

Note: no patches were applied.
Note: testing is done by a robot and is best-effort only.

[PATCH v3 2/2] memory tier: dax/kmem: abstract memory types put

2024-03-20 Thread Ho-Ren (Jack) Chuang

Abstract `kmem_put_memory_types()` into `mt_put_memory_types()` to
accommodate various memory types and enhance flexibility,
similar to `mt_find_alloc_memory_type()`.

Signed-off-by: Ho-Ren (Jack) Chuang 
---
 drivers/dax/kmem.c   |  7 +--
 include/linux/memory-tiers.h |  6 ++
 mm/memory-tiers.c| 12 
 3 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index de1333aa7b3e..01399e5b53b2 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -66,13 +66,8 @@ static struct memory_dev_type 
*kmem_find_alloc_memory_type(int adist)
 
 static void kmem_put_memory_types(void)
 {
-   struct memory_dev_type *mtype, *mtn;
-
mutex_lock(_memory_type_lock);
-   list_for_each_entry_safe(mtype, mtn, _memory_types, list) {
-   list_del(>list);
-   put_memory_type(mtype);
-   }
+   mt_put_memory_types(_memory_types);
mutex_unlock(_memory_type_lock);
 }
 
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index b2135334ac18..a44c03c2ba3a 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -50,6 +50,7 @@ int mt_set_default_dram_perf(int nid, struct 
access_coordinate *perf,
 int mt_perf_to_adistance(struct access_coordinate *perf, int *adist);
 struct memory_dev_type *mt_find_alloc_memory_type(int adist,
struct list_head 
*memory_types);
+void mt_put_memory_types(struct list_head *memory_types);
 #ifdef CONFIG_MIGRATION
 int next_demotion_node(int node);
 void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
@@ -143,5 +144,10 @@ struct memory_dev_type *mt_find_alloc_memory_type(int 
adist, struct list_head *m
 {
return NULL;
 }
+
+void mt_put_memory_types(struct list_head *memory_types)
+{
+
+}
 #endif /* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index d9b96b21b65a..6246c28546ba 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -662,6 +662,18 @@ struct memory_dev_type *mt_find_alloc_memory_type(int 
adist, struct list_head *m
 }
 EXPORT_SYMBOL_GPL(mt_find_alloc_memory_type);
 
+
+void mt_put_memory_types(struct list_head *memory_types)
+{
+   struct memory_dev_type *mtype, *mtn;
+
+   list_for_each_entry_safe(mtype, mtn, memory_types, list) {
+   list_del(>list);
+   put_memory_type(mtype);
+   }
+}
+EXPORT_SYMBOL_GPL(mt_put_memory_types);
+
 /*
  * This is invoked via late_initcall() to create
  * CPUless memory tiers after HMAT info is ready or
-- 
Ho-Ren (Jack) Chuang

[PATCH v3 1/2] memory tier: dax/kmem: create CPUless memory tiers after obtaining HMAT info

2024-03-20 Thread Ho-Ren (Jack) Chuang

The current implementation treats emulated memory devices, such as
CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory
(E820_TYPE_RAM). However, these emulated devices have different
characteristics than traditional DRAM, making it important to
distinguish them. Thus, we modify the tiered memory initialization process
to introduce a delay specifically for CPUless NUMA nodes. This delay
ensures that the memory tier initialization for these nodes is deferred
until HMAT information is obtained during the boot process. Finally,
demotion tables are recalculated at the end.

More details:
* late_initcall(memory_tier_late_init);
Some device drivers may have initialized memory tiers between
`memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
online memory nodes and configuring memory tiers. They should be excluded
in the late init.

* Abstract common functions into `mt_find_alloc_memory_type()`
Since different memory devices require finding or allocating a memory type,
these common steps are abstracted into a single function,
`mt_find_alloc_memory_type()`, enhancing code scalability and conciseness.

* Handle cases where there is no HMAT when creating memory tiers
There is a scenario where a CPUless node does not provide HMAT information.
If no HMAT is specified, it falls back to using the default DRAM tier.

* Change adist calculation code to use another new lock, `mt_perf_lock`.
In the current implementation, iterating through CPUlist nodes requires
holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
trying to acquire the same lock, leading to a potential deadlock.
Therefore, we propose introducing a standalone `mt_perf_lock` to protect
`default_dram_perf`. This approach not only avoids deadlock but also
prevents holding a large lock simultaneously.

* Upgrade `set_node_memory_tier` to support additional cases, including
  default DRAM, late CPUless, and hot-plugged initializations.
To cover hot-plugged memory nodes, `mt_calc_adistance()` and
`mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
handle cases where memtype is not initialized and where HMAT information is
available.

* Introduce `default_memory_types` for those memory types that are not
  initialized by device drivers.
Because late initialized memory and default DRAM memory need to be managed,
a default memory type is created for storing all memory types that are
not initialized by device drivers and as a fallback.

Signed-off-by: Ho-Ren (Jack) Chuang 
Signed-off-by: Hao Xiang 
---
 drivers/dax/kmem.c   | 13 +
 include/linux/memory-tiers.h |  7 +++
 mm/memory-tiers.c| 94 +---
 3 files changed, 95 insertions(+), 19 deletions(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 42ee360cf4e3..de1333aa7b3e 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -55,21 +55,10 @@ static LIST_HEAD(kmem_memory_types);
 
 static struct memory_dev_type *kmem_find_alloc_memory_type(int adist)
 {
-   bool found = false;
struct memory_dev_type *mtype;
 
mutex_lock(_memory_type_lock);
-   list_for_each_entry(mtype, _memory_types, list) {
-   if (mtype->adistance == adist) {
-   found = true;
-   break;
-   }
-   }
-   if (!found) {
-   mtype = alloc_memory_type(adist);
-   if (!IS_ERR(mtype))
-   list_add(>list, _memory_types);
-   }
+   mtype = mt_find_alloc_memory_type(adist, _memory_types);
mutex_unlock(_memory_type_lock);
 
return mtype;
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 69e781900082..b2135334ac18 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -48,6 +48,8 @@ int mt_calc_adistance(int node, int *adist);
 int mt_set_default_dram_perf(int nid, struct access_coordinate *perf,
 const char *source);
 int mt_perf_to_adistance(struct access_coordinate *perf, int *adist);
+struct memory_dev_type *mt_find_alloc_memory_type(int adist,
+   struct list_head 
*memory_types);
 #ifdef CONFIG_MIGRATION
 int next_demotion_node(int node);
 void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
@@ -136,5 +138,10 @@ static inline int mt_perf_to_adistance(struct 
access_coordinate *perf, int *adis
 {
return -EIO;
 }
+
+struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct list_head 
*memory_types)
+{
+   return NULL;
+}
 #endif /* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 0537664620e5..d9b96b21b65a 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -36,6 +37,11 @@ struct node_memory_type_map {
 
 static

[PATCH v3 0/2] Improved Memory Tier Creation for CPUless NUMA Nodes

2024-03-20 Thread Ho-Ren (Jack) Chuang

When a memory device, such as CXL1.1 type3 memory, is emulated as
normal memory (E820_TYPE_RAM), the memory device is indistinguishable
from normal DRAM in terms of memory tiering with the current implementation.
The current memory tiering assigns all detected normal memory nodes
to the same DRAM tier. This results in normal memory devices with
different attributions being unable to be assigned to the correct memory tier,
leading to the inability to migrate pages between different types of memory.
https://lore.kernel.org/linux-mm/ph0pr08mb7955e9f08ccb64f23963b5c3a8...@ph0pr08mb7955.namprd08.prod.outlook.com/T/

This patchset automatically resolves the issues. It delays the initialization
of memory tiers for CPUless NUMA nodes until they obtain HMAT information
and after all devices are initialized at boot time, eliminating the need
for user intervention. If no HMAT is specified, it falls back to
using `default_dram_type`.

Example usecase:
We have CXL memory on the host, and we create VMs with a new system memory
device backed by host CXL memory. We inject CXL memory performance attributes
through QEMU, and the guest now sees memory nodes with performance attributes
in HMAT. With this change, we enable the guest kernel to construct
the correct memory tiering for the memory nodes.

-v3:
 Thanks to Ying's comments,
 * Make the newly added code independent of HMAT
 * Upgrade set_node_memory_tier to support more cases
 * Put all non-driver-initialized memory types into default_memory_types
   instead of using hmat_memory_types
 * find_alloc_memory_type -> mt_find_alloc_memory_type
-v2:
 Thanks to Ying's comments,
 * Rewrite cover letter & patch description
 * Rename functions, don't use _hmat
 * Abstract common functions into find_alloc_memory_type()
 * Use the expected way to use set_node_memory_tier instead of modifying it
 * 
https://lore.kernel.org/lkml/20240312061729.1997111-1-horenchu...@bytedance.com/T/#u
-v1:
 * 
https://lore.kernel.org/lkml/20240301082248.3456086-1-horenchu...@bytedance.com/T/#u

Ho-Ren (Jack) Chuang (2):
  memory tier: dax/kmem: create CPUless memory tiers after obtaining
HMAT info
  memory tier: dax/kmem: abstract memory types put

 drivers/dax/kmem.c   |  20 +--
 include/linux/memory-tiers.h |  13 +
 mm/memory-tiers.c| 106 ---
 3 files changed, 114 insertions(+), 25 deletions(-)

-- 
Ho-Ren (Jack) Chuang

59 matches

Mail list logo