[stable 4.4, 4.9, 4.14, 4.19 LTS] Missing fix "memcg: fix a crash in wb_workfn when a device disappears"

2021-02-10 Thread Mathieu Desnoyers
Hi,

While reconciling the lttng-modules writeback instrumentation with its 
counterpart
within the upstream Linux kernel, I notice that the following commit introduced 
in
5.6 is present in stable branches 5.4 and 5.5, but is missing from LTS stable 
branches
for 4.4, 4.9, 4.14, 4.19:

commit 68f23b89067fdf187763e75a56087550624fdbee
("memcg: fix a crash in wb_workfn when a device disappears")

Considering that this fix was CC'd to the stable mailing list, is there any
reason why it has not been integrated into those LTS branches ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [lttng-dev] Community input: Feedback on use of enumeration type within LTTng-UST

2021-02-09 Thread Mathieu Desnoyers via lttng-dev
Hi Jesper,

We currently don't have anything in place to allow sharing a given enumeration
across multiple tracepoint providers. However, you are not the first one to
express the need for it.

One way I figure we could do this would be to introduce the notion of
"probe provider dependency" in lttng-ust. The basic idea is to have the
enumeration within a probe provider which would be referred to as a
"dependency" by other tracepoint probes. We could also do some sharing
of tracepoint event class declarations as well if needed in a similar
way.

The upgrade scenario problem you highlight is not specific to enumerations.
It also applies to events being changed when upgrading a tracepoint probe
provider.

For events, the way we match them within the session daemon (ht_match_event)
is by comparing their entire description for an exact match: they need to
have all the same fields. If two events happen to have the same name but
different fields (due to an upgrade in some applications but not others),
they will be represented by different event IDs within the metadata. This
allows us to make sure each event has a unique event ID within a trace if
they have different content, even if they have the same name (e.g. upgrade
scenario).

When dealing with enumeration types in the ust registry within lttng-sessiond,
we already do something similar (ht_match_enum): we check that all labels match.

So I think that the only part we are missing is really at the UST probe provider
level: adding the ability to express dependency from provider A on an 
enumeration
defined by provider B. The rest of the logic is already there in the session 
daemon
to allow reference to an enumeration from various event fields within a single 
probe
provider.

Thanks,

Mathieu

- On Jan 29, 2021, at 10:07 AM, jderehag jdere...@hotmail.com wrote:

> Hi,
> 
> I dont have any data regarding enum usage unfortunatly.
> 
> But is it possible to share metadata between multiple tp-providers?
> If so, would you not be required anyway to re-generate the enumeration labels
> given that any trace producer (sharing the same metadata) may have different
> enumeration definitions?
> For example if you would lets say trace different versions of the same app but
> where the enum definition have changed?
> Cases like those I guess would force you to atleast version enum labels or
> somesuch?
> 
> /Jesper
> 
> 
> Från: lttng-dev  för Mathieu Desnoyers via
> lttng-dev 
> Skickat: den 26 januari 2021 16:19
> Till: lttng-dev
> Ämne: [lttng-dev] Community input: Feedback on use of enumeration type within
> LTTng-UST
> 
> Hi,
> 
> We are currently working on the CTF2 specification [1] at EfficiOS, and there 
> is
> a user metric we would need to help us with a design decision with respect
> to enumerations. This is where we would need community input.
> 
> The usage metrics in question are with respect to LTTng-UST enumeration types
> (TRACEPOINT_ENUM) [2] used by tracepoint events within instrumented
> applications.
> 
> A tracepoint enumeration can be used by many events. What we are looking for 
> is
> to get
> an idea of the common use, and extreme cases as well.
> 
> Ideally, what we would need is:
> 
> - For each enumeration within your application instrumentation
> (TRACEPOINT_ENUM), how
>  many events refer to each enumeration ? (average, mode, and maximum)
> - For each TRACEPOINT_ENUM, how many labels do they possess ? (sum number of
>  ctf_enum_value/ctf_enum_range/ctf_enum_auto entries) (average, mode, and
>  maximum)
> - For each TRACEPOINT_ENUM, for each label, what is the string length ?
> (average, mode, and maximum)
> 
> Based on this information, we can estimate the data overhead generated by
> repeating enumeration
> labels in the LTTng-UST metadata. We need to decide whether we allow 
> references
> to a single
> enumeration description when it is used by many events, or if we require
> repeatedly serializing
> the entire enumeration description for each event field using the enumeration.
> 
> Thanks,
> 
> Mathieu
> 
> [1] https://lists.lttng.org/pipermail/lttng-dev/2020-November/029777.html
> [2] https://lttng.org/man/3/lttng-ust/v2.12/#doc-tracepoint-enum
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
> ___
> lttng-dev mailing list
> lttng-dev@lists.lttng.org
> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [PATCH 0/2 v3] tracepoints: Stop punishing non-static call users

2021-02-09 Thread Mathieu Desnoyers
- On Feb 8, 2021, at 3:09 PM, rostedt rost...@goodmis.org wrote:

> Broke this up into two patches now. See the second patch for the
> description of waht this series is doing.

For both patches:

Reviewed-by: Mathieu Desnoyers 

> 
> Changes since v2:
> 
> Added a patch to remove "data_args", as it was causing issues with
> handling of "__data", especially when it wasn't needed.
> 
> 
> Steven Rostedt (VMware) (2):
>  tracepoints: Remove unnecessary "data_args" macro parameter
>  tracepoints: Do not punish non static call users
> 
> 
> include/linux/tracepoint.h | 52 ------
>  1 file changed, 23 insertions(+), 29 deletions(-)

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [lttng-dev] LTTng Support for Network Namespace

2021-02-08 Thread Mathieu Desnoyers via lttng-dev
- On Feb 8, 2021, at 1:04 PM, lttng-dev lttng-dev@lists.lttng.org wrote:

> Dear LTTng developers,
> I was wondering if LTTng currently (or in the future) supports network
> namespaces. For instance, an ambiguity in the logs would make problems for
> future system analysis.

Did you try the "net_ns" context ? 

> Taking the "bind" system call, that would help us distinguish between 
> different
> local IP addresses when two or more container instances are running on the
> host, given that the "umyaddr" field is an address inside the memory, how 
> would
> we recover the local addresses after the trace has been collected?

> syscall_entry_bind: { cpu_id = 1 }, { mnt_ns = 4026532553, pid = 532, tid = 
> 532,
> vtid = 532, vpid = 532 }, { fd = 14, umyaddr = 93827759199256, addrlen = 12 }

In lttng-modules, we'd need to implement a system call override for the bind 
system
call which fetches the umyaddr content from user-space. You can see the x86-64 
override
code for the "connect" system call which fetches the uservaddr from user-space 
as an
inspiration.

Thanks,

Mathieu


> Thanks, we appreciate the help beforehand!

> Mohammad

> ___
> lttng-dev mailing list
> lttng-dev@lists.lttng.org
> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [RFC] security: replace indirect calls with static calls

2021-02-05 Thread Mathieu Desnoyers
- On Feb 5, 2021, at 10:40 AM, Peter Zijlstra pet...@infradead.org wrote:

> On Fri, Feb 05, 2021 at 10:09:26AM -0500, Mathieu Desnoyers wrote:
>> Then we should be able to generate the following using static keys as a
>> jump table and N static calls:
>> 
>>   jump 
>> label_N:
>>   stack setup
>>   call
>> label_N-1:
>>   stack setup
>>   call
>> label_N-2:
>>   stack setup
>>   call
>>   ...
>> label_0:
>>   jump end
>> label_fallback:
>>   
>> end:
>> 
>> So the static keys would be used to jump to the appropriate label (using
>> a static branch, which has pretty much 0 overhead). Static calls would
>> be used to implement each of the calls.
>> 
>> Thoughts ?
> 
> At some point I tried to extend the static_branch infra to do multiple
> targets and while the low level plumbing is trivial, I ran into trouble
> trying to get a sane C level API for it.

Did you try doing an API for a variable number of targets, or was it for
a specific number of targets ? It might be easier to just duplicate some
of the API code for number of targets between 2 and 12, and let the
users code choose the maximum number of targets they want to accelerate.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC] security: replace indirect calls with static calls

2021-02-05 Thread Mathieu Desnoyers
On 20-Aug-2020 06:47:53 PM, Brendan Jackman wrote:
> From: Paul Renauld 
> 
> LSMs have high overhead due to indirect function calls through
> retpolines. This RPC proposes to replace these with static calls [1]
> instead.
> 
> This overhead is especially significant for the "bpf" LSM which supports
> the implementation of LSM hooks with eBPF programs (security/bpf)[2]. In
> order to facilitate this, the "bpf" LSM provides a default nop callback for
> all LSM hooks. When enabled, the "bpf", LSM incurs an unnecessary /
> avoidable indirect call to this nop callback.
> 
> The performance impact on a simple syscall eventfd_write (which triggers
> the file_permission hook) was measured with and without "bpf" LSM
> enabled. Activating the LSM resulted in an overhead of 4% [3].
> 
> This overhead prevents the adoption of bpf LSM on performance critical
> systems, and also, in general, slows down all LSMs.
> 
> Currently, the LSM hook callbacks are stored in a linked list and
> dispatched as indirect calls. Using static calls can remove this overhead
> by replacing all indirect calls with direct calls.
> 
> During the discussion of the "bpf" LSM patch-set it was proposed to special
> case BPF LSM to avoid the overhead by using static keys. This was however
> not accepted and it was decided to [4]:
> 
> - Not special-case the "bpf" LSM.
> - Implement a general solution benefitting the whole LSM framework.
> 
> This is based on the static call branch [5].

Hi!

So I reviewed this quickly, and hopefully my understanding is correct.
AFAIU, your approach is limited to scenarios where the callbacks are
known at compile-time. It also appears to add the overhead of a
switch/case for every function call on the fast-path.

I am the original author of the tracepoint infrastructure in the Linux
kernel, which also needs to iterate on an array of callbacks. Recently,
Steven Rostedt pushed a change which accelerates the single-callback
case using static calls to reduce retpoline mitigation overhead, but I
would prefer if we could accelerate the multiple-callback case as well.
Note that for tracepoints, the callbacks are not known at compile-time.

This is where I think we could come up with a generic solution that
would fit both LSM and tracepoint use-cases.

Here is what I have in mind. Let's say we generate code to accelerate up
to N calls, and after that we have a fallback using indirect calls.

Then we should be able to generate the following using static keys as a
jump table and N static calls:

  jump 
label_N:
  stack setup
  call
label_N-1:
  stack setup
  call
label_N-2:
  stack setup
  call
  ...
label_0:
  jump end
label_fallback:
  
end:

So the static keys would be used to jump to the appropriate label (using
a static branch, which has pretty much 0 overhead). Static calls would
be used to implement each of the calls.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH] tracepoints: Do not punish non static call users

2021-02-04 Thread Mathieu Desnoyers
- On Feb 4, 2021, at 2:17 PM, rostedt rost...@goodmis.org wrote:

> With static calls, a tracepoint can call the callback directly if there is
> only one callback registered to that tracepoint. When there is more than
> one, the static call will call the tracepoint's "iterator" function, which
> needs to reload the tracepoint's "funcs" array again, as it could have
> changed since the first time it was loaded.
> 
> But an arch without static calls is punished by having to load the
> tracepoint's "funcs" array twice. Once in the DO_TRACE macro, and once
> again in the iterator macro.

In addition to loading it, it needs to test it against NULL twice.

> 
> For archs without static calls, there's no reason to load the array macro
> in the first place, since the iterator function will do it anyway.
> 
> Change the __DO_TRACE_CALL() macro to do the double call only for

Do you mean "double call" or "double load and NULL check" ?

> architectures with static calls, and just call the iterator function
> directly for architectures without static calls.
> 
> [ Tested only on architectures with static calls, will test on those
>  without later ]
> 
> Signed-off-by: Steven Rostedt (VMware) 
> ---
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index dc1d4c612cc3..966bfa6a861c 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -152,9 +152,18 @@ static inline struct tracepoint
> *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> #ifdef TRACEPOINTS_ENABLED
> 
> #ifdef CONFIG_HAVE_STATIC_CALL
> -#define __DO_TRACE_CALL(name)static_call(tp_func_##name)
> +#define __DO_TRACE_CALL(name, args)  \
> + do {\
> + struct tracepoint_func *it_func_ptr;\
> + it_func_ptr =   \
> + rcu_dereference_raw((&__tracepoint_##name)->funcs); \
> + if (it_func_ptr) {  \
> + __data = (it_func_ptr)->data;   \
> + static_call(tp_func_##name)(args);  \
> + }   \
> + } while (0)
> #else
> -#define __DO_TRACE_CALL(name)__traceiter_##name
> +#define __DO_TRACE_CALL(name, args)  __traceiter_##name(args)

Also, we may want to comment or annotate the "void *data" argument of
__traceiter_##_name() to state that it is unused.

Thanks,

Mathieu

> #endif /* CONFIG_HAVE_STATIC_CALL */
> 
> /*
> @@ -168,7 +177,6 @@ static inline struct tracepoint
> *tracepoint_ptr_deref(tracepoint_ptr_t *p)
>  */
> #define __DO_TRACE(name, proto, args, cond, rcuidle)  \
>   do {\
> - struct tracepoint_func *it_func_ptr;\
>   int __maybe_unused __idx = 0;   \
>   void *__data;   \
>   \
> @@ -190,12 +198,7 @@ static inline struct tracepoint
> *tracepoint_ptr_deref(tracepoint_ptr_t *p)
>   rcu_irq_enter_irqson(); \
>   }   \
>   \
> - it_func_ptr =   \
> - rcu_dereference_raw((&__tracepoint_##name)->funcs); \
> - if (it_func_ptr) {  \
> - __data = (it_func_ptr)->data;   \
> - __DO_TRACE_CALL(name)(args);\
> - }   \
> + __DO_TRACE_CALL(name, TP_ARGS(args));   \
>   \
>   if (rcuidle) {  \
>   rcu_irq_exit_irqson();  \

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH] tracepoints: Code clean up

2021-02-04 Thread Mathieu Desnoyers
- On Feb 4, 2021, at 1:27 PM, rostedt rost...@goodmis.org wrote:

> From: "Steven Rostedt (VMware)" 
> 
> Restructure the code a bit to make it simpler, fix some formatting problems
> and add READ_ONCE/WRITE_ONCE to make sure there's no compiler tear downs on

compiler tear downs -> compiler load/store tearing.

> changes to variables that can be accessed across CPUs.
> 
> Started with Mathieu Desnoyers's patch:
> 
>  Link:
>  
> https://lore.kernel.org/lkml/20210203175741.20665-1-mathieu.desnoy...@efficios.com/
> 
> And will keep his signature, but I will take the responsibility of this
> being correct, and keep the authorship.
> 
> Signed-off-by: Mathieu Desnoyers 
> Signed-off-by: Steven Rostedt (VMware) 
> ---
> include/linux/tracepoint.h |  2 +-
> kernel/tracepoint.c| 92 +++---
> 2 files changed, 37 insertions(+), 57 deletions(-)
> 
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index 966ed8980327..dc1d4c612cc3 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -309,7 +309,7 @@ static inline struct tracepoint
> *tracepoint_ptr_deref(tracepoint_ptr_t *p)
>   rcu_dereference_raw((&__tracepoint_##_name)->funcs); \
>   if (it_func_ptr) {  \
>   do {\
> - it_func = (it_func_ptr)->func;  \
> + it_func = READ_ONCE((it_func_ptr)->func); \
>   __data = (it_func_ptr)->data;   \
>   ((void(*)(void *, proto))(it_func))(__data, 
> args); \
>   } while ((++it_func_ptr)->func);\
> diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
> index e8f20ae29c18..4b9de79bb927 100644
> --- a/kernel/tracepoint.c
> +++ b/kernel/tracepoint.c
> @@ -136,9 +136,9 @@ func_add(struct tracepoint_func **funcs, struct
> tracepoint_func *tp_func,
>int prio)
> {
>   struct tracepoint_func *old, *new;
> - int nr_probes = 0;
> - int stub_funcs = 0;
> - int pos = -1;
> + int iter_probes;/* Iterate over old probe array. */
> + int nr_probes = 0;  /* Counter for probes */
> + int pos = -1;   /* New position */

New position -> Insertion position into new array

> 
>   if (WARN_ON(!tp_func->func))
>   return ERR_PTR(-EINVAL);
> @@ -147,54 +147,39 @@ func_add(struct tracepoint_func **funcs, struct
> tracepoint_func *tp_func,
>   old = *funcs;
>   if (old) {
>   /* (N -> N+1), (N != 0, 1) probes */
> - for (nr_probes = 0; old[nr_probes].func; nr_probes++) {
> - /* Insert before probes of lower priority */
> - if (pos < 0 && old[nr_probes].prio < prio)
> - pos = nr_probes;
> - if (old[nr_probes].func == tp_func->func &&
> - old[nr_probes].data == tp_func->data)
> + for (iter_probes = 0; old[iter_probes].func; iter_probes++) {
> + if (old[iter_probes].func == tp_stub_func)
> + continue;   /* Skip stub functions. */
> + if (old[iter_probes].func == tp_func->func &&
> + old[iter_probes].data == tp_func->data)
>   return ERR_PTR(-EEXIST);
> - if (old[nr_probes].func == tp_stub_func)
> - stub_funcs++;
> + nr_probes++;
>   }
>   }
> - /* + 2 : one for new probe, one for NULL func - stub functions */
> - new = allocate_probes(nr_probes + 2 - stub_funcs);
> + /* + 2 : one for new probe, one for NULL func */
> + new = allocate_probes(nr_probes + 2);
>   if (new == NULL)
>   return ERR_PTR(-ENOMEM);
>   if (old) {
> - if (stub_funcs) {
> - /* Need to copy one at a time to remove stubs */
> - int probes = 0;
> -
> - pos = -1;
> - for (nr_probes = 0; old[nr_probes].func; nr_probes++) {
> - if (old[nr_probes].func == tp_stub_func)
> - continue;
> - if (pos < 0 && old[nr_probes].prio < prio)
> - pos = probes++;
> - new[probes++] = old[nr_probes];
> - }
&g

[PATCH 1/1] tracepoint: cleanup: do not fail unregistration

2021-02-03 Thread Mathieu Desnoyers
This patch is only compile-tested.

Signed-off-by: Mathieu Desnoyers 
---
 include/linux/tracepoint.h |  2 +-
 kernel/tracepoint.c| 80 +-
 2 files changed, 28 insertions(+), 54 deletions(-)

diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 0f21617f1a66..fae8d6d9588c 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -308,7 +308,7 @@ static inline struct tracepoint 
*tracepoint_ptr_deref(tracepoint_ptr_t *p)
it_func_ptr =   \
rcu_dereference_raw((&__tracepoint_##_name)->funcs); \
do {\
-   it_func = (it_func_ptr)->func;  \
+   it_func = READ_ONCE((it_func_ptr)->func);   \
__data = (it_func_ptr)->data;   \
((void(*)(void *, proto))(it_func))(__data, args); \
} while ((++it_func_ptr)->func);\
diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 3e261482296c..b1bec710f68a 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -136,9 +136,10 @@ func_add(struct tracepoint_func **funcs, struct 
tracepoint_func *tp_func,
 int prio)
 {
struct tracepoint_func *old, *new;
-   int nr_probes = 0;
-   int stub_funcs = 0;
-   int pos = -1;
+   int iter_probes = 0;/* Iterate over old probe array. */
+   int nr_old_probes = 0;  /* Count non-stub functions in old. */
+   int nr_new_probes = 0;  /* Count functions in new. */
+   int pos = -1;   /* Insert position in new. */
 
if (WARN_ON(!tp_func->func))
return ERR_PTR(-EINVAL);
@@ -147,54 +148,34 @@ func_add(struct tracepoint_func **funcs, struct 
tracepoint_func *tp_func,
old = *funcs;
if (old) {
/* (N -> N+1), (N != 0, 1) probes */
-   for (nr_probes = 0; old[nr_probes].func; nr_probes++) {
-   /* Insert before probes of lower priority */
-   if (pos < 0 && old[nr_probes].prio < prio)
-   pos = nr_probes;
-   if (old[nr_probes].func == tp_func->func &&
-   old[nr_probes].data == tp_func->data)
+   for (iter_probes = 0; old[iter_probes].func; iter_probes++) {
+   if (old[iter_probes].func == tp_stub_func)
+   continue;   /* Skip stub functions. */
+   if (old[iter_probes].func == tp_func->func &&
+   old[iter_probes].data == tp_func->data)
return ERR_PTR(-EEXIST);
-   if (old[nr_probes].func == tp_stub_func)
-   stub_funcs++;
+   /* Insert before probes of lower priority */
+   if (pos < 0 && old[iter_probes].prio < prio)
+   pos = nr_old_probes;
+   nr_old_probes++;
}
}
-   /* + 2 : one for new probe, one for NULL func - stub functions */
-   new = allocate_probes(nr_probes + 2 - stub_funcs);
+   /* + 2 : one for new probe, one for NULL func */
+   new = allocate_probes(nr_old_probes + 2);
if (new == NULL)
return ERR_PTR(-ENOMEM);
if (old) {
-   if (stub_funcs) {
-   /* Need to copy one at a time to remove stubs */
-   int probes = 0;
-
-   pos = -1;
-   for (nr_probes = 0; old[nr_probes].func; nr_probes++) {
-   if (old[nr_probes].func == tp_stub_func)
-   continue;
-   if (pos < 0 && old[nr_probes].prio < prio)
-   pos = probes++;
-   new[probes++] = old[nr_probes];
-   }
-   nr_probes = probes;
-   if (pos < 0)
-   pos = probes;
-   else
-   nr_probes--; /* Account for insertion */
-
-   } else if (pos < 0) {
-   pos = nr_probes;
-   memcpy(new, old, nr_probes * sizeof(struct 
tracepoint_func));
-   } else {
-   /* Copy higher priority probes ahead of the new probe */
-   memcpy(new, old, pos * sizeof(struct tracepoint_func));
-   /* Copy the rest after it. */
-   memcpy(new + pos + 1, old + pos,
-  (nr_probes - pos) * sizeo

Re: [for-next][PATCH 14/15] tracepoint: Do not fail unregistering a probe due to memory failure

2021-02-03 Thread Mathieu Desnoyers



- On Feb 3, 2021, at 11:05 AM, rostedt rost...@goodmis.org wrote:

> From: "Steven Rostedt (VMware)" 
> 
> The list of tracepoint callbacks is managed by an array that is protected
> by RCU. To update this array, a new array is allocated, the updates are
> copied over to the new array, and then the list of functions for the
> tracepoint is switched over to the new array. After a completion of an RCU
> grace period, the old array is freed.
> 
> This process happens for both adding a callback as well as removing one.
> But on removing a callback, if the new array fails to be allocated, the
> callback is not removed, and may be used after it is freed by the clients
> of the tracepoint.
> 
> There's really no reason to fail if the allocation for a new array fails
> when removing a function. Instead, the function can simply be replaced by a
> stub function that could be cleaned up on the next modification of the
> array. That is, instead of calling the function registered to the
> tracepoint, it would call a stub function in its place.
> 
> Link: https://lore.kernel.org/r/20201115055256.65625-1-mmull...@mmlx.us
> Link: https://lore.kernel.org/r/20201116175107.02db3...@gandalf.local.home
> Link: https://lore.kernel.org/r/20201117211836.54aca...@oasis.local.home
> Link: https://lkml.kernel.org/r/20201118093405.7a6d2...@gandalf.local.home
> 
> [ Note, this version does use undefined compiler behavior (assuming that
>  a stub function with no parameters or return, can be called by a location
>  that thinks it has parameters but still no return value. Static calls
>  do the same thing, so this trick is not without precedent.
> 
>  There's another solution that uses RCU tricks and is more complex, but
>  can be an alternative if this solution becomes an issue.
> 
>  Link: 
> https://lore.kernel.org/lkml/20210127170721.58bce...@gandalf.local.home/
> ]
> 
> Cc: Peter Zijlstra 
> Cc: Josh Poimboeuf 
> Cc: Mathieu Desnoyers 
> Cc: Ingo Molnar 
> Cc: Alexei Starovoitov 
> Cc: Daniel Borkmann 
> Cc: Dmitry Vyukov 
> Cc: Martin KaFai Lau 
> Cc: Song Liu 
> Cc: Yonghong Song 
> Cc: Andrii Nakryiko 
> Cc: John Fastabend 
> Cc: KP Singh 
> Cc: netdev 
> Cc: bpf 
> Cc: Kees Cook 
> Cc: Florian Weimer 
> Fixes: 97e1c18e8d17b ("tracing: Kernel Tracepoints")
> Reported-by: syzbot+83aa762ef23b6f0d1...@syzkaller.appspotmail.com
> Reported-by: syzbot+d29e58bb557324e55...@syzkaller.appspotmail.com
> Reported-by: Matt Mullins 
> Signed-off-by: Steven Rostedt (VMware) 
> Tested-by: Matt Mullins 
> ---
> kernel/tracepoint.c | 80 -
> 1 file changed, 64 insertions(+), 16 deletions(-)
> 
> diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
> index 7261fa0f5e3c..e8f20ae29c18 100644
> --- a/kernel/tracepoint.c
> +++ b/kernel/tracepoint.c
> @@ -53,6 +53,12 @@ struct tp_probes {
>   struct tracepoint_func probes[];
> };
> 
> +/* Called in removal of a func but failed to allocate a new tp_funcs */
> +static void tp_stub_func(void)
> +{
> + return;
> +}
> +
> static inline void *allocate_probes(int count)
> {
>   struct tp_probes *p  = kmalloc(struct_size(p, probes, count),
> @@ -131,6 +137,7 @@ func_add(struct tracepoint_func **funcs, struct
> tracepoint_func *tp_func,
> {
>   struct tracepoint_func *old, *new;
>   int nr_probes = 0;
> + int stub_funcs = 0;
>   int pos = -1;
> 
>   if (WARN_ON(!tp_func->func))
> @@ -147,14 +154,34 @@ func_add(struct tracepoint_func **funcs, struct
> tracepoint_func *tp_func,
>   if (old[nr_probes].func == tp_func->func &&
>   old[nr_probes].data == tp_func->data)
>   return ERR_PTR(-EEXIST);
> + if (old[nr_probes].func == tp_stub_func)
> + stub_funcs++;
>   }
>   }
> - /* + 2 : one for new probe, one for NULL func */
> - new = allocate_probes(nr_probes + 2);
> + /* + 2 : one for new probe, one for NULL func - stub functions */
> + new = allocate_probes(nr_probes + 2 - stub_funcs);
>   if (new == NULL)
>   return ERR_PTR(-ENOMEM);
>   if (old) {
> - if (pos < 0) {
> + if (stub_funcs) {

Considering that we end up implementing a case where we carefully copy over
each item, I recommend we replace the two "memcpy" branches by a single 
item-wise
implementation. It's a slow-path anyway, and reducing the overall complexity
is a benefit for slow paths. Fewer bugs, less code to review, and it's easier to
reach a decent testing state-space coverage.

> +

Re: [for-next][PATCH 14/15] tracepoint: Do not fail unregistering a probe due to memory failure

2021-02-03 Thread Mathieu Desnoyers



- On Feb 3, 2021, at 11:05 AM, rostedt rost...@goodmis.org wrote:

> From: "Steven Rostedt (VMware)" 
> 
> The list of tracepoint callbacks is managed by an array that is protected
> by RCU. To update this array, a new array is allocated, the updates are
> copied over to the new array, and then the list of functions for the
> tracepoint is switched over to the new array. After a completion of an RCU
> grace period, the old array is freed.
> 
> This process happens for both adding a callback as well as removing one.
> But on removing a callback, if the new array fails to be allocated, the
> callback is not removed, and may be used after it is freed by the clients
> of the tracepoint.
> 
> There's really no reason to fail if the allocation for a new array fails
> when removing a function. Instead, the function can simply be replaced by a
> stub function that could be cleaned up on the next modification of the
> array. That is, instead of calling the function registered to the
> tracepoint, it would call a stub function in its place.
> 
> Link: https://lore.kernel.org/r/20201115055256.65625-1-mmull...@mmlx.us
> Link: https://lore.kernel.org/r/20201116175107.02db3...@gandalf.local.home
> Link: https://lore.kernel.org/r/20201117211836.54aca...@oasis.local.home
> Link: https://lkml.kernel.org/r/20201118093405.7a6d2...@gandalf.local.home
> 
> [ Note, this version does use undefined compiler behavior (assuming that
>  a stub function with no parameters or return, can be called by a location
>  that thinks it has parameters but still no return value. Static calls
>  do the same thing, so this trick is not without precedent.
> 
>  There's another solution that uses RCU tricks and is more complex, but
>  can be an alternative if this solution becomes an issue.
> 
>  Link: 
> https://lore.kernel.org/lkml/20210127170721.58bce...@gandalf.local.home/
> ]
> 
> Cc: Peter Zijlstra 
> Cc: Josh Poimboeuf 
> Cc: Mathieu Desnoyers 
> Cc: Ingo Molnar 
> Cc: Alexei Starovoitov 
> Cc: Daniel Borkmann 
> Cc: Dmitry Vyukov 
> Cc: Martin KaFai Lau 
> Cc: Song Liu 
> Cc: Yonghong Song 
> Cc: Andrii Nakryiko 
> Cc: John Fastabend 
> Cc: KP Singh 
> Cc: netdev 
> Cc: bpf 
> Cc: Kees Cook 
> Cc: Florian Weimer 
> Fixes: 97e1c18e8d17b ("tracing: Kernel Tracepoints")
> Reported-by: syzbot+83aa762ef23b6f0d1...@syzkaller.appspotmail.com
> Reported-by: syzbot+d29e58bb557324e55...@syzkaller.appspotmail.com
> Reported-by: Matt Mullins 
> Signed-off-by: Steven Rostedt (VMware) 
> Tested-by: Matt Mullins 
> ---
> kernel/tracepoint.c | 80 -
> 1 file changed, 64 insertions(+), 16 deletions(-)
> 
> diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
> index 7261fa0f5e3c..e8f20ae29c18 100644
> --- a/kernel/tracepoint.c
> +++ b/kernel/tracepoint.c
> @@ -53,6 +53,12 @@ struct tp_probes {
>   struct tracepoint_func probes[];
> };
> 
> +/* Called in removal of a func but failed to allocate a new tp_funcs */
> +static void tp_stub_func(void)
> +{
> + return;
> +}
> +
> static inline void *allocate_probes(int count)
> {
>   struct tp_probes *p  = kmalloc(struct_size(p, probes, count),
> @@ -131,6 +137,7 @@ func_add(struct tracepoint_func **funcs, struct
> tracepoint_func *tp_func,
> {
>   struct tracepoint_func *old, *new;
>   int nr_probes = 0;
> + int stub_funcs = 0;
>   int pos = -1;
> 
>   if (WARN_ON(!tp_func->func))
> @@ -147,14 +154,34 @@ func_add(struct tracepoint_func **funcs, struct
> tracepoint_func *tp_func,
>   if (old[nr_probes].func == tp_func->func &&
>   old[nr_probes].data == tp_func->data)
>   return ERR_PTR(-EEXIST);
> + if (old[nr_probes].func == tp_stub_func)
> + stub_funcs++;
>   }
>   }
> - /* + 2 : one for new probe, one for NULL func */
> - new = allocate_probes(nr_probes + 2);
> + /* + 2 : one for new probe, one for NULL func - stub functions */
> + new = allocate_probes(nr_probes + 2 - stub_funcs);
>   if (new == NULL)
>   return ERR_PTR(-ENOMEM);
>   if (old) {
> - if (pos < 0) {
> + if (stub_funcs) {

Considering that we end up implementing a case where we carefully copy over
each item, I recommend we replace the two "memcpy" branches by a single 
item-wise
implementation. It's a slow-path anyway, and reducing the overall complexity
is a benefit for slow paths. Fewer bugs, less code to review, and it's easier to
reach a decent testing state-space coverage.

> +

Re: [PATCH v4] tracepoint: Do not fail unregistering a probe due to memory failure

2021-01-27 Thread Mathieu Desnoyers
- On Jan 27, 2021, at 2:16 PM, rostedt rost...@goodmis.org wrote:

> On Wed, 27 Jan 2021 13:13:22 -0500 (EST)
> Mathieu Desnoyers  wrote:
> 
>> > Thanks for bringing that up.
>> 
>> Requiring an RCU synchronize on element removal is quite intrusive, and can
>> be problematic if tracepoint removal is called from e.g. preempt-off context.
> 
> But how often do you remove more than one callback from the same
> tracepoint? Or should I say, from a lot of tracepoints?
> 
> This will only synchronize for the following case:
> 
> Add three callbacks to a single tracepoint.
> Remove the first one.
>
> Remove the second one
>   updating>
> Remove the third one.
>   
> 
> And we may be able to make this work in batch too.
> 
> More to come, but I really like this approach over the others because it
> does not increase the size of the kernel for a failure that should never
> happen in practice.

My concern with introducing a scheme based on synchronize_rcu to handle 
out-of-memory
scenarios is not about how frequently synchronize_rcu will be called, but
about the added complexity this adds to the tracepoints insertion/removal.
This has chances to explode in unlikely scenarios, which will become harder to 
test
for. This will also introduce constraints on which kernel context can perform
tracepoint removal.

I notice that the error report which leads to this v4 is against v2 of the 
patch [1],
which has known issues. I wonder whether there are any such issues with v3, 
which is
using a function stub ? [2]

The main concern I had about v3 of the patch was that the prototype of the
stub (void argument list) does not match the prototype of the called function. 
As
discussed in [2], there are other scenarios where the kernel expects this to 
work,
based on the expectation that all architectures supported by the Linux kernel 
have
a calling convention where the caller performs the clean-up.

So I would prefer the approach taken in v3 rather than adding the kind of 
complexity
introduced in v4. Ideally we should document, near the stub function, that this
expects the calling convention to have the caller perform the clean-up (cdecl 
family),
and that the returned type (void) of the function always match. It's only the 
arguments
which may differ.

There is still one thing that I'm not 100% sure about. It's related to this 
comment
from Kees Cook [3], hinting that in a CFI build the function prototype of the 
call
site and the function need to match. So do we need extra annotation of the stub 
function
to make this work in a CFI build ?

Thanks,

Mathieu

[1] https://lore.kernel.org/bpf/20201117211836.54aca...@oasis.local.home/
[2] https://lore.kernel.org/bpf/20201118093405.7a6d2...@gandalf.local.home/
[3] https://lore.kernel.org/bpf/202011171330.94C6BA7E93@keescook/
-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH v4] tracepoint: Do not fail unregistering a probe due to memory failure

2021-01-27 Thread Mathieu Desnoyers
- On Jan 27, 2021, at 1:07 PM, rostedt rost...@goodmis.org wrote:

> On Wed, 27 Jan 2021 13:00:46 -0500 (EST)
> Mathieu Desnoyers  wrote:
> 
>> > Instead of allocating a new array for removing a tracepoint, allocate twice
>> > the needed size when adding tracepoints to the array. On removing, use the
>> > second half of the allocated array. This removes the need to allocate 
>> > memory
>> > for removing a tracepoint, as the allocation for removals will already have
>> > been done.
>> 
>> I don't see how this can work reliably. AFAIU, with RCU, approaches
>> requiring a pre-allocation of twice the size and swapping to the alternate
>> memory area on removal falls apart whenever you remove 2 or more elements
>> back-to-back without waiting for a grace period.
> 
> Good point ;-)
> 
>> 
>> How is this handled by your scheme ?
> 
> I believe we can detect this case using the "prio" part of extra element,
> and force a rcu sync if there's back to back removals on the same
> tracepoint. That case does not happen often, so I'm hoping nobody will
> notice the slowdown with these syncs. I'll take a look at this.
> 
> Thanks for bringing that up.

Requiring an RCU synchronize on element removal is quite intrusive, and can
be problematic if tracepoint removal is called from e.g. preempt-off context.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH v4] tracepoint: Do not fail unregistering a probe due to memory failure

2021-01-27 Thread Mathieu Desnoyers
* If this is the second half of the array,
> +  * make it point back to the first half.
> +  */
> + if (old->func < old)
> + old = old->func;
>   } else
>   pos = 0;
>   new[pos] = *tp_func;
> @@ -202,14 +228,18 @@ static void *func_remove(struct tracepoint_func **funcs,
>   /* N -> 0, (N > 1) */
>   *funcs = NULL;
>   debug_print_probes(*funcs);
> + /* Set old back to what it was allocated to */
> + old--;
> + if (old->func < old)
> + old = old->func;
>   return old;
>   } else {
>   int j = 0;
> - /* N -> M, (N > 1, M > 0) */
> - /* + 1 for NULL */
> - new = allocate_probes(nr_probes - nr_del + 1);
> - if (new == NULL)
> - return ERR_PTR(-ENOMEM);
> +
> + /* Use the other half of the array for the new probes */
> + new = old - 1;
> + new = new->func;
> + new++;
>   for (i = 0; old[i].func; i++)
>   if (old[i].func != tp_func->func
>   || old[i].data != tp_func->data)
> @@ -218,7 +248,7 @@ static void *func_remove(struct tracepoint_func **funcs,
>   *funcs = new;
>   }
>   debug_print_probes(*funcs);
> - return old;
> + return NULL;
> }
> 
> static void tracepoint_update_call(struct tracepoint *tp, struct 
> tracepoint_func
> *tp_funcs, bool sync)
> @@ -309,8 +339,8 @@ static int tracepoint_remove_func(struct tracepoint *tp,
>   rcu_assign_pointer(tp->funcs, tp_funcs);
>   } else {
>   rcu_assign_pointer(tp->funcs, tp_funcs);
> - tracepoint_update_call(tp, tp_funcs,
> -tp_funcs[0].func != old[0].func);
> + /* Need to sync, if going back to a single caller */
> + tracepoint_update_call(tp, tp_funcs, tp_funcs[1].func == NULL);
>   }
>   release_probes(old);
>   return 0;
> --
> 2.25.4

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH v3] fs/proc: Expose RSEQ configuration

2021-01-26 Thread Mathieu Desnoyers
- On Jan 26, 2021, at 1:54 PM, Piotr Figiel fig...@google.com wrote:
[...]
> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index a4f86a9d6937..6aea67878065 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -322,8 +322,10 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32,
> rseq_len,
>   ret = rseq_reset_rseq_cpu_id(current);
>   if (ret)
>   return ret;
> + task_lock(current);
>   current->rseq = NULL;
>   current->rseq_sig = 0;
> + task_unlock(current);
>   return 0;
>   }
> 
> @@ -353,8 +355,10 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32,
> rseq_len,
>   return -EINVAL;
>   if (!access_ok(rseq, rseq_len))
>   return -EFAULT;
> + task_lock(current);
>   current->rseq = rseq;
>   current->rseq_sig = sig;
> + task_unlock(current);

So AFAIU, the locks are there to make sure that whenever a user-space thread 
reads
that state through that new /proc file ABI, it observes coherent "rseq" vs 
"rseq_sig"
values. However, I'm not convinced this is the right approach to consistency 
here.

Because if you add locking as done here, you ensure that the /proc file reader
sees coherent values, but between the point where those values are read from
kernel-space, copied to user-space, and then acted upon by user-space, those can
very well have become outdated if the observed process runs concurrently.

So my understanding here is that the only non-racy way to effectively use those
values is to either read them from /proc/self/* (from the thread owning the 
task struct),
or to ensure that the thread is stopped/frozen while the read is done.

Maybe we should consider validating that the proc file is used from the right 
context
(from self or when the target thread is stopped/frozen) rather than add dubious 
locking ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


[lttng-dev] Community input: Feedback on use of enumeration type within LTTng-UST

2021-01-26 Thread Mathieu Desnoyers via lttng-dev
Hi,

We are currently working on the CTF2 specification [1] at EfficiOS, and there is
a user metric we would need to help us with a design decision with respect
to enumerations. This is where we would need community input.

The usage metrics in question are with respect to LTTng-UST enumeration types
(TRACEPOINT_ENUM) [2] used by tracepoint events within instrumented 
applications.

A tracepoint enumeration can be used by many events. What we are looking for is 
to get
an idea of the common use, and extreme cases as well.

Ideally, what we would need is:

- For each enumeration within your application instrumentation 
(TRACEPOINT_ENUM), how
  many events refer to each enumeration ? (average, mode, and maximum)
- For each TRACEPOINT_ENUM, how many labels do they possess ? (sum number of
  ctf_enum_value/ctf_enum_range/ctf_enum_auto entries) (average, mode, and 
maximum)
- For each TRACEPOINT_ENUM, for each label, what is the string length ? 
(average, mode, and maximum)

Based on this information, we can estimate the data overhead generated by 
repeating enumeration
labels in the LTTng-UST metadata. We need to decide whether we allow references 
to a single
enumeration description when it is used by many events, or if we require 
repeatedly serializing
the entire enumeration description for each event field using the enumeration.

Thanks,

Mathieu

[1] https://lists.lttng.org/pipermail/lttng-dev/2020-November/029777.html
[2] https://lttng.org/man/3/lttng-ust/v2.12/#doc-tracepoint-enum

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] 回复: 回复: 回复: help modpost: "__bad_cmpxchg"

2021-01-19 Thread Mathieu Desnoyers via lttng-dev


- On Jan 17, 2021, at 9:04 PM, lttng-dev lttng-dev@lists.lttng.org wrote:

> 
> 发件人: Jonathan Rajotte-Julien 
> 发送时间: 2021年1月16日 3:18
> 收件人: Zhang, Qiang
> 抄送: lttng-dev
> 主题: Re: 回复: 回复: help modpost: "__bad_cmpxchg"
> 
> [Please note this e-mail is from an EXTERNAL e-mail address]
> 
>>Hi Quiang,
>>
>>Please test the attached patch.
> 
> Hello Jonathan
> 
> Thank you for your patch, I can compile successfully after testing.

Patch pushed into lttng-modules master, thanks for testing!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [PATCH v2] fs/proc: Expose RSEQ configuration

2021-01-15 Thread Mathieu Desnoyers
- On Jan 14, 2021, at 1:54 PM, Piotr Figiel fig...@google.com wrote:

Added PeterZ, Paul and Boqun to CC. They are also listed as maintainers of rseq.
Please CC them in your next round of patches.

> For userspace checkpoint and restore (C/R) some way of getting process
> state containing RSEQ configuration is needed.
> 
> There are two ways this information is going to be used:
> - to re-enable RSEQ for threads which had it enabled before C/R
> - to detect if a thread was in a critical section during C/R
> 
> Since C/R preserves TLS memory and addresses RSEQ ABI will be restored
> using the address registered before C/R.

Indeed, if the process goes through a checkpoint/restore while within a
rseq c.s., that critical section should abort. Given that it's only the
restored process which resumes user-space execution, there should be some
way to ensure that the rseq tls pointer is restored before that thread goes
back to user-space, or some way to ensure the rseq TLS is registered
before that thread returns to the saved instruction pointer.

How do you plan to re-register the rseq TLS for each thread upon restore ?

I suspect you move the return IP to the abort either at checkpoint or restore
if you detect that the thread is running in a rseq critical section.

> 
> Detection whether the thread is in a critical section during C/R is
> needed to enforce behavior of RSEQ abort during C/R. Attaching with
> ptrace() before registers are dumped itself doesn't cause RSEQ abort.

Right, because the RSEQ abort is only done when going back to user-space,
and AFAIU the checkpointed process will cease to exist, and won't go back
to user-space, therefore bypassing any RSEQ abort.

> Restoring the instruction pointer within the critical section is
> problematic because rseq_cs may get cleared before the control is
> passed to the migrated application code leading to RSEQ invariants not
> being preserved.

The commit message should state that both the per-thread rseq TLS area address
and the signature are dumped within this new proc file.

> 
> Signed-off-by: Piotr Figiel 
> 
> ---
> 
> v2:
> - fixed string formatting for 32-bit architectures
> 
> v1:
> - https://lkml.kernel.org/r/20210113174127.2500051-1-fig...@google.com
> 
> ---
> fs/proc/base.c | 21 +
> 1 file changed, 21 insertions(+)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index b3422cda2a91..7cc36a224b8b 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -662,6 +662,21 @@ static int proc_pid_syscall(struct seq_file *m, struct
> pid_namespace *ns,
> 
>   return 0;
> }
> +
> +#ifdef CONFIG_RSEQ
> +static int proc_pid_rseq(struct seq_file *m, struct pid_namespace *ns,
> + struct pid *pid, struct task_struct *task)
> +{
> + int res = lock_trace(task);

AFAIU lock_trace prevents concurrent exec() from modifying the task's content.
What prevents a concurrent rseq register/unregister to be executed concurrently
with proc_pid_rseq ?

> +
> + if (res)
> + return res;
> + seq_printf(m, "%tx %08x\n", (ptrdiff_t)((uintptr_t)task->rseq),

I wonder if all those parentheses are needed. Wouldn't it be enough to have:

  (ptrdiff_t)(uintptr_t)task->rseq

?

Thanks,

Mathieu

> +task->rseq_sig);
> + unlock_trace(task);
> + return 0;
> +}
> +#endif /* CONFIG_RSEQ */
> #endif /* CONFIG_HAVE_ARCH_TRACEHOOK */
> 
> //
> @@ -3182,6 +3197,9 @@ static const struct pid_entry tgid_base_stuff[] = {
>   REG("comm",  S_IRUGO|S_IWUSR, proc_pid_set_comm_operations),
> #ifdef CONFIG_HAVE_ARCH_TRACEHOOK
>   ONE("syscall",S_IRUSR, proc_pid_syscall),
> +#ifdef CONFIG_RSEQ
> + ONE("rseq",   S_IRUSR, proc_pid_rseq),
> +#endif
> #endif
>   REG("cmdline",S_IRUGO, proc_pid_cmdline_ops),
>   ONE("stat",   S_IRUGO, proc_tgid_stat),
> @@ -3522,6 +3540,9 @@ static const struct pid_entry tid_base_stuff[] = {
>&proc_pid_set_comm_operations, {}),
> #ifdef CONFIG_HAVE_ARCH_TRACEHOOK
>   ONE("syscall",   S_IRUSR, proc_pid_syscall),
> +#ifdef CONFIG_RSEQ
> + ONE("rseq",  S_IRUSR, proc_pid_rseq),
> +#endif
> #endif
>   REG("cmdline",   S_IRUGO, proc_pid_cmdline_ops),
>   ONE("stat",  S_IRUGO, proc_tid_stat),
> --
> 2.30.0.284.gd98b1dd5eaa7-goog

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


[RELEASE] LTTng-modules 2.11.7 and 2.12.4 (Linux kernel tracer)

2021-01-11 Thread Mathieu Desnoyers
Hi,

I just released versions 2.11.7 and 2.12.4 of the lttng-modules stable branches.
These add support for the 5.10 Linux kernel.

Please try them out, and, as usual, feedback is welcome!

Thanks,

Mathieu

Project website: https://lttng.org
Documentation: https://lttng.org/docs
Download link: https://lttng.org/download

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


[lttng-dev] [RELEASE] LTTng-modules 2.11.7 and 2.12.4 (Linux kernel tracer)

2021-01-11 Thread Mathieu Desnoyers via lttng-dev
Hi,

I just released versions 2.11.7 and 2.12.4 of the lttng-modules stable branches.
These add support for the 5.10 Linux kernel.

Please try them out, and, as usual, feedback is welcome!

Thanks,

Mathieu

Project website: https://lttng.org
Documentation: https://lttng.org/docs
Download link: https://lttng.org/download

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] Enamebling namespace contexts

2020-12-28 Thread Mathieu Desnoyers via lttng-dev
- On Dec 23, 2020, at 9:55 PM, lttng-dev  wrote: 

> Hi,

> ThI found that lttng is working on container awareness in this slides:
> https://archive.fosdem.org/2019/schedule/event/containers_lttng/attachments/slides/3419/export/events/attachments/containers_lttng/slides/3419/lttng_containers_fosdem19.pdf

> On page #13, there is a command: lttng add-context -k -t procname -t pid -t 
> vpid
> -t tid -t vtid -t pid_ns, where pid_ns and other namespace identifiers are 
> very
> useful for tracing containers. However, it seems like that lttng of current
> version doesn't support adding context pid_ns(Error: Unknown context type
> pid_ns). Do you know how to enable these features?

What version of LTTng do you use (output of "lttng --version"), and what is the 
output of "lttng add-context --list" ? 

Thanks, 

Mathieu 

> Thanks a lot.
> Btw, have a nice holiday!

> Serica

> ___
> lttng-dev mailing list
> lttng-dev@lists.lttng.org
> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

-- 
Mathieu Desnoyers 
EfficiOS Inc. 
http://www.efficios.com 
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2020-12-28 Thread Mathieu Desnoyers
hread siblings have executed a core  serializing  instruc■
>  tion.   This  guarantee is provided only for threads in the same
>  process as the calling thread.
> 
>  The "expedited" commands complete faster than the  non-expedited
>  ones,  they  never block, but have the downside of causing extra
>  overhead.
> 
>  A process must register its intent to use the private  expedited
>  sync core command prior to using it.
> 
> This just says that the siblings have executed a serialising
> instruction, in other words a barrier. It makes no claims concerning
> cache coherency - and without some form of cache maintenance, there
> can be no expectation that the I and D streams to be coherent with
> each other.

Right, membarrier is not doing anything wrt I/D caches. On architectures
without coherent caches, users should use other system calls or instructions
provided by the architecture to synchronize the appropriate address ranges. 

> This description is also weird in another respect. "guarantee that
> all its running thread siblings have executed a core serializing
> instruction" ... "The expedited commands ... never block".
> 
> So, the core executing this call is not allowed to block, but the
> other part indicates that the other CPUs _have_ executed a serialising
> instruction before this call returns... one wonders how that happens
> without blocking. Maybe the CPU spins waiting for completion instead?

Membarrier expedited sync-core issues IPIs to all CPUs running sibling
threads. AFAIR the IPI mechanism uses the "csd lock" which is basically
busy waiting. So it does not "block", it busy-waits.

For completeness of the explanation, other (non-running) threads acting
on the same mm will eventually issue the context synchronizing instruction
before returning to user-space whenever they are scheduled back.

Thanks,

Mathieu


> 
> --
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2020-12-28 Thread Mathieu Desnoyers
er does not
do anything wrt cache coherency. We should perhaps explicitly point
it out though.

>>
>> So, either Andy has a misunderstanding, or the man page is wrong, or
>> my rudimentary understanding of what membarrier is supposed to be
>> doing is wrong...
> 
> Look at the latest man page:
> 
> https://man7.org/linux/man-pages/man2/membarrier.2.html
> 
> for MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE.  The result may not be
> all that enlightening.

As described in the membarrier(2) man page and in 
include/uapi/linux/membarrier.h,
membarrier MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE guarantees core 
serializing
instructions in addition to the memory ordering guaranteed by 
MEMBARRIER_CMD_PRIVATE_EXPEDITED.

It does not guarantee anything wrt i/d cache coherency. I initially considered
adding such guarantees when we introduced membarrier sync-core, but decided
against it because it would basically replicate what architectures already
expose to user-space, e.g. flush_icache_user_range on arm32.

So between code modification and allowing other threads to jump to that code,
it should be expected that architectures without coherent i/d cache will need
to flush caches to ensure coherency *and* to issue membarrier to make sure
core serializing instructions will be issued by every thread acting on the
same mm either immediately by means of the IPI, or before they return to
user-space if they do not happen to be currently running when the membarrier
system call is invoked.

Hoping this clarifies things. I suspect we will need to clarify documentation
about what membarrier *does not* guarantee, given that you mistakenly expected
membarrier to take care of cache flushing.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2020-12-28 Thread Mathieu Desnoyers
- On Dec 28, 2020, at 4:06 PM, Andy Lutomirski l...@kernel.org wrote:

> On Mon, Dec 28, 2020 at 12:32 PM Mathieu Desnoyers
>  wrote:
>>
>> - On Dec 28, 2020, at 2:44 PM, Andy Lutomirski l...@kernel.org wrote:
>>
>> > On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin
>> >  wrote:
>> >>
>> >> On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote:
>> >> > After chatting with rmk about this (but without claiming that any of
>> >> > this is his opinion), based on the manpage, I think membarrier()
>> >> > currently doesn't really claim to be synchronizing caches? It just
>> >> > serializes cores. So arguably if userspace wants to use membarrier()
>> >> > to synchronize code changes, userspace should first do the code
>> >> > change, then flush icache as appropriate for the architecture, and
>> >> > then do the membarrier() to ensure that the old code is unused?
>>
>> ^ exactly, yes.
>>
>> >> >
>> >> > For 32-bit arm, rmk pointed out that that would be the cacheflush()
>> >> > syscall. That might cause you to end up with two IPIs instead of one
>> >> > in total, but we probably don't care _that_ much about extra IPIs on
>> >> > 32-bit arm?
>>
>> This was the original thinking, yes. The cacheflush IPI will flush specific
>> regions of code, and the membarrier IPI issues context synchronizing
>> instructions.
>>
>> Architectures with coherent i/d caches don't need the cacheflush step.
> 
> There are different levels of coherency -- VIVT architectures may have
> differing requirements compared to PIPT, etc.
> 
> In any case, I feel like the approach taken by the documentation is
> fundamentally confusing.  Architectures don't all speak the same
> language

Agreed.

> How about something like:

I dislike the wording "barrier" and the association between "write" and
"instruction fetch" done in the descriptions below. It leads to think that
this behaves like a memory barrier, when in fact my understanding of
a context synchronizing instruction is that it simply flushes internal
CPU state, which would cause coherency issues if the CPU observes both the
old and then the new code without having this state flushed.

[ Sorry if I take more time to reply and if my replies are a bit more
  concise than usual. I'm currently on parental leave, so I have
  non-maskable interrupts to attend to. ;-) ]

Thanks,

Mathieu

> 
> The SYNC_CORE operation causes all threads in the caller's address
> space (including the caller) to execute an architecture-defined
> barrier operation.  membarrier() will ensure that this barrier is
> executed at a time such that all data writes done by the calling
> thread before membarrier() are made visible by the barrier.
> Additional architecture-dependent cache management operations may be
> required to use this for JIT code.
> 
> x86: SYNC_CORE executes a barrier that will cause subsequent
> instruction fetches to observe prior writes.  Currently this will be a
> "serializing" instruction, but, if future improved CPU documentation
> becomes available and relaxes this requirement, the barrier may
> change.  The kernel guarantees that writing new or modified
> instructions to normal memory (and issuing SFENCE if the writes were
> non-temporal) then doing a membarrier SYNC_CORE operation is
> sufficient to cause all threads in the caller's address space to
> execute the new or modified instructions.  This is true regardless of
> whether or not those instructions are written at the same virtual
> address from which they are subsequently executed.  No additional
> cache management is required on x86.
> 
> arm: Something about the cache management syscalls.
> 
> arm64: Ditto
> 
> powerpc: I have no idea.

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2020-12-28 Thread Mathieu Desnoyers
- On Dec 27, 2020, at 4:36 PM, Andy Lutomirski l...@kernel.org wrote:

[...]

>> You seem to have noticed odd cases on arm64 where this guarantee does not
>> match reality. Where exactly can we find this in the code, and which part
>> of the architecture manual can you point us to which supports your concern ?
>>
>> Based on the notes I have, use of `eret` on aarch64 guarantees a context
>> synchronizing
>> instruction when returning to user-space.
> 
> Based on my reading of the manual, ERET on ARM doesn't synchronize
> anything at all.  I can't find any evidence that it synchronizes data
> or instructions, and I've seen reports that the CPU will happily
> speculate right past it.

Reading [1] there appears to be 3 kind of context synchronization events:

- Taking an exception,
- Returning from an exception,
- ISB.

This other source [2] adds (search for Context synchronization operation):

- Exit from Debug state
- Executing a DCPS instruction
- Executing a DRPS instruction

"ERET" falls into the second kind of events, and AFAIU should be context
synchronizing. That was confirmed to me by Will Deacon when membarrier
sync-core was implemented for aarch64. If the architecture reference manuals
are wrong, is there an errata ?

As for the algorithm to use on ARMv8 to update instructions, see [2]
B2.3.4  Implication of caches for the application programmer
"Synchronization and coherency issues between data and instruction accesses"

Membarrier only takes care of making sure the "ISB" part of the algorithm can be
done easily and efficiently on multiprocessor systems.

Thanks,

Mathieu

[1] 
https://developer.arm.com/documentation/den0024/a/Memory-Ordering/Barriers/ISB-in-more-detail
[2] 
https://montcs.bloomu.edu/Information/ARMv8/ARMv8-A_Architecture_Reference_Manual_(Issue_A.a).pdf

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2020-12-28 Thread Mathieu Desnoyers
- On Dec 28, 2020, at 4:06 PM, Andy Lutomirski l...@kernel.org wrote:

> On Mon, Dec 28, 2020 at 12:32 PM Mathieu Desnoyers
>  wrote:
>>
>> - On Dec 28, 2020, at 2:44 PM, Andy Lutomirski l...@kernel.org wrote:
>>
>> > On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin
>> >  wrote:
>> >>
>> >> On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote:
>> >> > After chatting with rmk about this (but without claiming that any of
>> >> > this is his opinion), based on the manpage, I think membarrier()
>> >> > currently doesn't really claim to be synchronizing caches? It just
>> >> > serializes cores. So arguably if userspace wants to use membarrier()
>> >> > to synchronize code changes, userspace should first do the code
>> >> > change, then flush icache as appropriate for the architecture, and
>> >> > then do the membarrier() to ensure that the old code is unused?
>>
>> ^ exactly, yes.
>>
>> >> >
>> >> > For 32-bit arm, rmk pointed out that that would be the cacheflush()
>> >> > syscall. That might cause you to end up with two IPIs instead of one
>> >> > in total, but we probably don't care _that_ much about extra IPIs on
>> >> > 32-bit arm?
>>
>> This was the original thinking, yes. The cacheflush IPI will flush specific
>> regions of code, and the membarrier IPI issues context synchronizing
>> instructions.
>>
>> Architectures with coherent i/d caches don't need the cacheflush step.
> 
> There are different levels of coherency -- VIVT architectures may have
> differing requirements compared to PIPT, etc.
> 
> In any case, I feel like the approach taken by the documentation is
> fundamentally confusing.  Architectures don't all speak the same
> language

Agreed.

> How about something like:

I dislike the wording "barrier" and the association between "write" and
"instruction fetch" done in the descriptions below. It leads to think that
this behaves like a memory barrier, when in fact my understanding of
a context synchronizing instruction is that it simply flushes internal
CPU state, which would cause coherency issues if the CPU observes both the
old and then the new code without having this state flushed.

[ Sorry if I take more time to reply and if my replies are a bit more
  concise than usual. I'm currently on parental leave, so I have
  non-maskable interrupts to attend to. ;-) ]

Thanks,

Mathieu

> 
> The SYNC_CORE operation causes all threads in the caller's address
> space (including the caller) to execute an architecture-defined
> barrier operation.  membarrier() will ensure that this barrier is
> executed at a time such that all data writes done by the calling
> thread before membarrier() are made visible by the barrier.
> Additional architecture-dependent cache management operations may be
> required to use this for JIT code.
> 
> x86: SYNC_CORE executes a barrier that will cause subsequent
> instruction fetches to observe prior writes.  Currently this will be a
> "serializing" instruction, but, if future improved CPU documentation
> becomes available and relaxes this requirement, the barrier may
> change.  The kernel guarantees that writing new or modified
> instructions to normal memory (and issuing SFENCE if the writes were
> non-temporal) then doing a membarrier SYNC_CORE operation is
> sufficient to cause all threads in the caller's address space to
> execute the new or modified instructions.  This is true regardless of
> whether or not those instructions are written at the same virtual
> address from which they are subsequently executed.  No additional
> cache management is required on x86.
> 
> arm: Something about the cache management syscalls.
> 
> arm64: Ditto
> 
> powerpc: I have no idea.

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2020-12-28 Thread Mathieu Desnoyers
- On Dec 27, 2020, at 4:36 PM, Andy Lutomirski l...@kernel.org wrote:

[...]

>> You seem to have noticed odd cases on arm64 where this guarantee does not
>> match reality. Where exactly can we find this in the code, and which part
>> of the architecture manual can you point us to which supports your concern ?
>>
>> Based on the notes I have, use of `eret` on aarch64 guarantees a context
>> synchronizing
>> instruction when returning to user-space.
> 
> Based on my reading of the manual, ERET on ARM doesn't synchronize
> anything at all.  I can't find any evidence that it synchronizes data
> or instructions, and I've seen reports that the CPU will happily
> speculate right past it.

Reading [1] there appears to be 3 kind of context synchronization events:

- Taking an exception,
- Returning from an exception,
- ISB.

This other source [2] adds (search for Context synchronization operation):

- Exit from Debug state
- Executing a DCPS instruction
- Executing a DRPS instruction

"ERET" falls into the second kind of events, and AFAIU should be context
synchronizing. That was confirmed to me by Will Deacon when membarrier
sync-core was implemented for aarch64. If the architecture reference manuals
are wrong, is there an errata ?

As for the algorithm to use on ARMv8 to update instructions, see [2]
B2.3.4  Implication of caches for the application programmer
"Synchronization and coherency issues between data and instruction accesses"

Membarrier only takes care of making sure the "ISB" part of the algorithm can be
done easily and efficiently on multiprocessor systems.

Thanks,

Mathieu

[1] 
https://developer.arm.com/documentation/den0024/a/Memory-Ordering/Barriers/ISB-in-more-detail
[2] 
https://montcs.bloomu.edu/Information/ARMv8/ARMv8-A_Architecture_Reference_Manual_(Issue_A.a).pdf

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2020-12-28 Thread Mathieu Desnoyers
hread siblings have executed a core  serializing  instruc■
>  tion.   This  guarantee is provided only for threads in the same
>  process as the calling thread.
> 
>  The "expedited" commands complete faster than the  non-expedited
>  ones,  they  never block, but have the downside of causing extra
>  overhead.
> 
>  A process must register its intent to use the private  expedited
>  sync core command prior to using it.
> 
> This just says that the siblings have executed a serialising
> instruction, in other words a barrier. It makes no claims concerning
> cache coherency - and without some form of cache maintenance, there
> can be no expectation that the I and D streams to be coherent with
> each other.

Right, membarrier is not doing anything wrt I/D caches. On architectures
without coherent caches, users should use other system calls or instructions
provided by the architecture to synchronize the appropriate address ranges. 

> This description is also weird in another respect. "guarantee that
> all its running thread siblings have executed a core serializing
> instruction" ... "The expedited commands ... never block".
> 
> So, the core executing this call is not allowed to block, but the
> other part indicates that the other CPUs _have_ executed a serialising
> instruction before this call returns... one wonders how that happens
> without blocking. Maybe the CPU spins waiting for completion instead?

Membarrier expedited sync-core issues IPIs to all CPUs running sibling
threads. AFAIR the IPI mechanism uses the "csd lock" which is basically
busy waiting. So it does not "block", it busy-waits.

For completeness of the explanation, other (non-running) threads acting
on the same mm will eventually issue the context synchronizing instruction
before returning to user-space whenever they are scheduled back.

Thanks,

Mathieu


> 
> --
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2020-12-28 Thread Mathieu Desnoyers
er does not
do anything wrt cache coherency. We should perhaps explicitly point
it out though.

>>
>> So, either Andy has a misunderstanding, or the man page is wrong, or
>> my rudimentary understanding of what membarrier is supposed to be
>> doing is wrong...
> 
> Look at the latest man page:
> 
> https://man7.org/linux/man-pages/man2/membarrier.2.html
> 
> for MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE.  The result may not be
> all that enlightening.

As described in the membarrier(2) man page and in 
include/uapi/linux/membarrier.h,
membarrier MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE guarantees core 
serializing
instructions in addition to the memory ordering guaranteed by 
MEMBARRIER_CMD_PRIVATE_EXPEDITED.

It does not guarantee anything wrt i/d cache coherency. I initially considered
adding such guarantees when we introduced membarrier sync-core, but decided
against it because it would basically replicate what architectures already
expose to user-space, e.g. flush_icache_user_range on arm32.

So between code modification and allowing other threads to jump to that code,
it should be expected that architectures without coherent i/d cache will need
to flush caches to ensure coherency *and* to issue membarrier to make sure
core serializing instructions will be issued by every thread acting on the
same mm either immediately by means of the IPI, or before they return to
user-space if they do not happen to be currently running when the membarrier
system call is invoked.

Hoping this clarifies things. I suspect we will need to clarify documentation
about what membarrier *does not* guarantee, given that you mistakenly expected
membarrier to take care of cache flushing.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2020-12-27 Thread Mathieu Desnoyers
- On Dec 27, 2020, at 1:28 PM, Andy Lutomirski l...@kernel.org wrote:

> The old sync_core_before_usermode() comments said that a non-icache-syncing
> return-to-usermode instruction is x86-specific and that all other
> architectures automatically notice cross-modified code on return to
> userspace.  Based on my general understanding of how CPUs work and based on
> my atttempt to read the ARM manual, this is not true at all.  In fact, x86
> seems to be a bit of an anomaly in the other direction: x86's IRET is
> unusually heavyweight for a return-to-usermode instruction.
> 
> So let's drop any pretense that we can have a generic way implementation
> behind membarrier's SYNC_CORE flush and require all architectures that opt
> in to supply their own.

Removing the generic implementation is OK with me, as this will really require
architecture maintainers to think hard about it when porting this feature.

> This means x86, arm64, and powerpc for now.  Let's
> also rename the function from sync_core_before_usermode() to
> membarrier_sync_core_before_usermode() because the precise flushing details
> may very well be specific to membarrier, and even the concept of
> "sync_core" in the kernel is mostly an x86-ism.

Work for me too.

> 
> I admit that I'm rather surprised that the code worked at all on arm64,
> and I'm suspicious that it has never been very well tested.  My apologies
> for not reviewing this more carefully in the first place.

Please refer to 
Documentation/features/sched/membarrier-sync-core/arch-support.txt

It clearly states that only arm, arm64, powerpc and x86 support the membarrier
sync core feature as of now:


# Architecture requirements
#
# * arm/arm64/powerpc
#
# Rely on implicit context synchronization as a result of exception return
# when returning from IPI handler, and when returning to user-space.
#
# * x86
#
# x86-32 uses IRET as return from interrupt, which takes care of the IPI.
# However, it uses both IRET and SYSEXIT to go back to user-space. The IRET
# instruction is core serializing, but not SYSEXIT.
#
# x86-64 uses IRET as return from interrupt, which takes care of the IPI.
# However, it can return to user-space through either SYSRETL (compat code),
# SYSRETQ, or IRET.
#
# Given that neither SYSRET{L,Q}, nor SYSEXIT, are core serializing, we rely
# instead on write_cr3() performed by switch_mm() to provide core serialization
# after changing the current mm, and deal with the special case of kthread ->
# uthread (temporarily keeping current mm into active_mm) by issuing a
# sync_core_before_usermode() in that specific case.

This is based on direct feedback from the architecture maintainers.

You seem to have noticed odd cases on arm64 where this guarantee does not
match reality. Where exactly can we find this in the code, and which part
of the architecture manual can you point us to which supports your concern ?

Based on the notes I have, use of `eret` on aarch64 guarantees a context 
synchronizing
instruction when returning to user-space.

Thanks,

Mathieu

> 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: linuxppc-...@lists.ozlabs.org
> Cc: Nicholas Piggin 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: Mathieu Desnoyers 
> Cc: x...@kernel.org
> Cc: sta...@vger.kernel.org
> Fixes: 70216e18e519 ("membarrier: Provide core serializing command,
> *_SYNC_CORE")
> Signed-off-by: Andy Lutomirski 
> ---
> 
> Hi arm64 and powerpc people-
> 
> This is part of a series here:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/fixes
> 
> Before I send out the whole series, I'm hoping that some arm64 and powerpc
> people can help me verify that I did this patch right.  Once I get
> some feedback on this patch, I'll send out the whole pile.  And once
> *that's* done, I'll start giving the mm lazy stuff some serious thought.
> 
> The x86 part is already fixed in Linus' tree.
> 
> Thanks,
> Andy
> 
> arch/arm64/include/asm/sync_core.h   | 21 +
> arch/powerpc/include/asm/sync_core.h | 20 
> arch/x86/Kconfig |  1 -
> arch/x86/include/asm/sync_core.h |  7 +++
> include/linux/sched/mm.h |  1 -
> include/linux/sync_core.h| 21 -
> init/Kconfig |  3 ---
> kernel/sched/membarrier.c| 15 +++
> 8 files changed, 55 insertions(+), 34 deletions(-)
> create mode 100644 arch/arm64/include/asm/sync_core.h
> create mode 100644 arch/powerpc/include/asm/sync_core.h
> delete mode 100644 include/linux/sync_core.h
> 
> diff --git a/arch/arm64/include/asm/sync_core.h
> 

Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2020-12-27 Thread Mathieu Desnoyers
- On Dec 27, 2020, at 1:28 PM, Andy Lutomirski l...@kernel.org wrote:

> The old sync_core_before_usermode() comments said that a non-icache-syncing
> return-to-usermode instruction is x86-specific and that all other
> architectures automatically notice cross-modified code on return to
> userspace.  Based on my general understanding of how CPUs work and based on
> my atttempt to read the ARM manual, this is not true at all.  In fact, x86
> seems to be a bit of an anomaly in the other direction: x86's IRET is
> unusually heavyweight for a return-to-usermode instruction.
> 
> So let's drop any pretense that we can have a generic way implementation
> behind membarrier's SYNC_CORE flush and require all architectures that opt
> in to supply their own.

Removing the generic implementation is OK with me, as this will really require
architecture maintainers to think hard about it when porting this feature.

> This means x86, arm64, and powerpc for now.  Let's
> also rename the function from sync_core_before_usermode() to
> membarrier_sync_core_before_usermode() because the precise flushing details
> may very well be specific to membarrier, and even the concept of
> "sync_core" in the kernel is mostly an x86-ism.

Work for me too.

> 
> I admit that I'm rather surprised that the code worked at all on arm64,
> and I'm suspicious that it has never been very well tested.  My apologies
> for not reviewing this more carefully in the first place.

Please refer to 
Documentation/features/sched/membarrier-sync-core/arch-support.txt

It clearly states that only arm, arm64, powerpc and x86 support the membarrier
sync core feature as of now:


# Architecture requirements
#
# * arm/arm64/powerpc
#
# Rely on implicit context synchronization as a result of exception return
# when returning from IPI handler, and when returning to user-space.
#
# * x86
#
# x86-32 uses IRET as return from interrupt, which takes care of the IPI.
# However, it uses both IRET and SYSEXIT to go back to user-space. The IRET
# instruction is core serializing, but not SYSEXIT.
#
# x86-64 uses IRET as return from interrupt, which takes care of the IPI.
# However, it can return to user-space through either SYSRETL (compat code),
# SYSRETQ, or IRET.
#
# Given that neither SYSRET{L,Q}, nor SYSEXIT, are core serializing, we rely
# instead on write_cr3() performed by switch_mm() to provide core serialization
# after changing the current mm, and deal with the special case of kthread ->
# uthread (temporarily keeping current mm into active_mm) by issuing a
# sync_core_before_usermode() in that specific case.

This is based on direct feedback from the architecture maintainers.

You seem to have noticed odd cases on arm64 where this guarantee does not
match reality. Where exactly can we find this in the code, and which part
of the architecture manual can you point us to which supports your concern ?

Based on the notes I have, use of `eret` on aarch64 guarantees a context 
synchronizing
instruction when returning to user-space.

Thanks,

Mathieu

> 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Nicholas Piggin 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: Mathieu Desnoyers 
> Cc: x...@kernel.org
> Cc: sta...@vger.kernel.org
> Fixes: 70216e18e519 ("membarrier: Provide core serializing command,
> *_SYNC_CORE")
> Signed-off-by: Andy Lutomirski 
> ---
> 
> Hi arm64 and powerpc people-
> 
> This is part of a series here:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/fixes
> 
> Before I send out the whole series, I'm hoping that some arm64 and powerpc
> people can help me verify that I did this patch right.  Once I get
> some feedback on this patch, I'll send out the whole pile.  And once
> *that's* done, I'll start giving the mm lazy stuff some serious thought.
> 
> The x86 part is already fixed in Linus' tree.
> 
> Thanks,
> Andy
> 
> arch/arm64/include/asm/sync_core.h   | 21 +
> arch/powerpc/include/asm/sync_core.h | 20 
> arch/x86/Kconfig |  1 -
> arch/x86/include/asm/sync_core.h |  7 +++
> include/linux/sched/mm.h |  1 -
> include/linux/sync_core.h| 21 -
> init/Kconfig |  3 ---
> kernel/sched/membarrier.c| 15 +++
> 8 files changed, 55 insertions(+), 34 deletions(-)
> create mode 100644 arch/arm64/include/asm/sync_core.h
> create mode 100644 arch/powerpc/include/asm/sync_core.h
> delete mode 100644 include/linux/sync_core.h
> 
> diff --git a/arch/arm64/include/asm/sync_core.h
> 

Re: [lttng-dev] Possibilities to customize lttng tracepoints in kernel space

2020-12-17 Thread Mathieu Desnoyers via lttng-dev
- On Dec 16, 2020, at 4:19 AM, lttng-dev  wrote: 

> Hi,

> I send this email to consult that whether it is possible to customize lttng
> tracepoints in kernel space. I have learnt that lttng leverages linux
> tracepoint to collect audit logs like system calls. Also, I have found that
> user can define their customized tracepoints in user space by using lttng-ust
> so that they can trace their user applications.

> Is it possible for lttng users to customize the existing tracepoints in kernel
> space? For example, after the system call sys_clone, or read, called and then
> collected by lttng, I want to process some data ( e.g., the return value of 
> the
> syscall ), and place the result in a new field in the audit log ( or using
> another approach, by emitting a new type of event in the audit log ), and 
> later
> when parsed by babeltrace, we can see the newly-added field or event in the
> parsed result.

> Looking forward to your reply.

Hi, 

You will want to start by having a look at this section of the LTTng 
documentation: https://lttng.org/docs/v2.12/#doc-instrumenting-linux-kernel 

You can indeed modify lttng-modules to change the fields gathered by the system 
call tracing facility (see include/instrumentation/syscalls/README section 
(3)). 
Those changes will be reflected in the resulting trace data. 

Thanks, 

Mathieu 

> Best wishes,

> Serica

> ___
> lttng-dev mailing list
> lttng-dev@lists.lttng.org
> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

-- 
Mathieu Desnoyers 
EfficiOS Inc. 
http://www.efficios.com 
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [PATCH urcu 4/4] Don't force a target and optimization level on ARMv7

2020-12-17 Thread Mathieu Desnoyers via lttng-dev
Merged in liburcu master, thanks!

Mathieu

- On Dec 15, 2020, at 11:28 AM, Michael Jeanson mjean...@efficios.com wrote:

> We shouldn't force a specific target cpu for the compiler on ARMv7 but
> let the system or the user choose it. If some of our code depends on a
> specific target CPU features, it should be compile tested.
> 
> Also remove the default optimisation level of O1, it's potentially a
> workaround to an early armv7 compiler performance problem and anyway
> most builds will have an optimisation level flag set in the CFLAGS which
> will override this one.
> 
> Signed-off-by: Michael Jeanson 
> Cc: Paul E. McKenney 
> Change-Id: I1d1bb5cc0fa0be8f8b1d6a9ad7bf063809be1aef
> ---
> configure.ac | 4 
> 1 file changed, 4 deletions(-)
> 
> diff --git a/configure.ac b/configure.ac
> index daa967a..f477425 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -119,10 +119,6 @@ AS_CASE([$host],[*-cygwin*],
>   [AM_CONDITIONAL(USE_CYGWIN, false)]
> )
> 
> -AS_IF([test "$host_cpu" = "armv7l"],[
> - AM_CFLAGS="$AM_CFLAGS -mcpu=cortex-a9 -mtune=cortex-a9 -O1"
> -])
> -
> # Search for clock_gettime
> AC_SEARCH_LIBS([clock_gettime], [rt], [
>   AC_DEFINE([CONFIG_RCU_HAVE_CLOCK_GETTIME], [1])
> --
> 2.29.2

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [PATCH urcu 3/4] Use DMB only on ARMv7

2020-12-17 Thread Mathieu Desnoyers via lttng-dev
Merged in liburcu master, thanks!

Mathieu

- On Dec 15, 2020, at 11:28 AM, Michael Jeanson mjean...@efficios.com wrote:

> Remove the configure time CONFIG_RCU_ARM_HAVE_DMB option and replace it
> by compile time detection based on the ARM ISA version. This makes sure
> we unconditionnaly use the DMB instruction only on ARMv7 where it's part
> of the baseline ISA.
> 
> This will change the behavior on ARMv6 platform that possibly have this
> instruction but it was probably already broken since we use the 'ISH'
> option which doesn't seem to be valid on this ISA.
> 
> This will also allow sharing headers in a multi-arch environment and
> reduce the build system complexity.
> 
> Signed-off-by: Michael Jeanson 
> Cc: Jason Wessel 
> Change-Id: I8e56ada55148d8e0f198c3d2e741ea414de5fef2
> ---
> configure.ac | 19 ---
> include/urcu/arch.h  |  7 +++
> include/urcu/arch/arm.h  | 13 +++--
> include/urcu/config.h.in |  3 ---
> 4 files changed, 18 insertions(+), 24 deletions(-)
> 
> diff --git a/configure.ac b/configure.ac
> index d1d43e6..daa967a 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -21,7 +21,6 @@ m4_ifdef([AM_SILENT_RULES], [AM_SILENT_RULES([yes])])
> AC_REQUIRE_AUX_FILE([tap-driver.sh])
> 
> AH_TEMPLATE([CONFIG_RCU_SMP], [Enable SMP support. With SMP support enabled,
> uniprocessors are also supported. With SMP support disabled, UP systems work
> fine, but the behavior of SMP systems is undefined.])
> -AH_TEMPLATE([CONFIG_RCU_ARM_HAVE_DMB], [Use the dmb instruction if available
> for use on ARM.])
> AH_TEMPLATE([CONFIG_RCU_TLS], [TLS provided by the compiler.])
> AH_TEMPLATE([CONFIG_RCU_HAVE_CLOCK_GETTIME], [clock_gettime() is detected.])
> AH_TEMPLATE([CONFIG_RCU_FORCE_SYS_MEMBARRIER], [Require the operating system 
> to
> support the membarrier system call for default and bulletproof flavors.])
> @@ -124,24 +123,6 @@ AS_IF([test "$host_cpu" = "armv7l"],[
>   AM_CFLAGS="$AM_CFLAGS -mcpu=cortex-a9 -mtune=cortex-a9 -O1"
> ])
> 
> -# ARM-specific checks
> -AS_CASE([$host_cpu], [arm*], [
> - AC_MSG_CHECKING([for dmb instruction])
> - AC_COMPILE_IFELSE([AC_LANG_SOURCE([[
> - int main()
> - {
> - asm volatile("dmb":::"memory");
> - return 0;
> - }
> - ]])
> - ],[
> - AC_MSG_RESULT([yes])
> - AC_DEFINE([CONFIG_RCU_ARM_HAVE_DMB], [1])
> - ],[
> - AC_MSG_RESULT([no])
> - ])
> -])
> -
> # Search for clock_gettime
> AC_SEARCH_LIBS([clock_gettime], [rt], [
>   AC_DEFINE([CONFIG_RCU_HAVE_CLOCK_GETTIME], [1])
> diff --git a/include/urcu/arch.h b/include/urcu/arch.h
> index c4b8bc2..620743c 100644
> --- a/include/urcu/arch.h
> +++ b/include/urcu/arch.h
> @@ -41,6 +41,7 @@
>  * URCU_ARCH_ALPHA : All DEC Alpha variants
>  * URCU_ARCH_IA64 : All Intel Itanium variants
>  * URCU_ARCH_ARM : All ARM 32 bits variants
> + *   URCU_ARCH_ARMV7 : All ARMv7 ISA variants
>  * URCU_ARCH_AARCH64 : All ARM 64 bits variants
>  * URCU_ARCH_MIPS : All MIPS variants
>  * URCU_ARCH_NIOS2 : All Intel / Altera NIOS II variants
> @@ -105,6 +106,12 @@
> #define URCU_ARCH_IA64 1
> #include 
> 
> +#elif (defined(__ARM_ARCH_7A__) || defined(__ARM_ARCH_7__))
> +
> +#define URCU_ARCH_ARMV7 1
> +#define URCU_ARCH_ARM 1
> +#include 
> +
> #elif (defined(__arm__) || defined(__arm))
> 
> #define URCU_ARCH_ARM 1
> diff --git a/include/urcu/arch/arm.h b/include/urcu/arch/arm.h
> index e904b06..54ca4fa 100644
> --- a/include/urcu/arch/arm.h
> +++ b/include/urcu/arch/arm.h
> @@ -30,7 +30,15 @@
> extern "C" {
> #endif
> 
> -#ifdef CONFIG_RCU_ARM_HAVE_DMB
> +/*
> + * Using DMB is faster than the builtin __sync_synchronize and this 
> instruction
> is
> + * part of the baseline ARMv7 ISA.
> + */
> +#ifdef URCU_ARCH_ARMV7
> +
> +/* For backwards compat. */
> +#define CONFIG_RCU_ARM_HAVE_DMB 1
> +
> /*
>  * Issues full system DMB operation.
>  */
> @@ -44,7 +52,8 @@ extern "C" {
> #define cmm_smp_mb()  __asm__ __volatile__ ("dmb ish":::"memory")
> #define cmm_smp_rmb() __asm__ __volatile__ ("dmb ish":::"memory")
> #define cmm_smp_wmb() __asm__ __volatile__ ("dmb ish":::"memory")
> -#endif /* CONFIG_RCU_ARM_HAVE_DMB */
> +
> +#endif /* URCU_ARCH_ARMV7 */
> 
> #include 
> #include 
> diff --git a/include/urcu/config.h.in b/include/urcu/config.h.in
> index f

Re: [lttng-dev] [PATCH urcu 2/4] Blacklist GCC 4.4.0, 4.4.1 and 4.4.2 on ARM

2020-12-17 Thread Mathieu Desnoyers via lttng-dev
Merged in liburcu master, thanks!

Mathieu

- On Dec 15, 2020, at 11:28 AM, Michael Jeanson mjean...@efficios.com wrote:

> GCC added __sync_synchronize() in 4.4.0 but it was broken on ARM until
> 4.4.3, see the GCC bug report for more details :
> 
>  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42263
> 
> Signed-off-by: Michael Jeanson 
> Change-Id: I629e3c8b4baaccb34b2311e220f30d0ad8b69a19
> ---
> include/urcu/arch/arm.h | 9 +
> 1 file changed, 9 insertions(+)
> 
> diff --git a/include/urcu/arch/arm.h b/include/urcu/arch/arm.h
> index 5d1c608..e904b06 100644
> --- a/include/urcu/arch/arm.h
> +++ b/include/urcu/arch/arm.h
> @@ -70,6 +70,15 @@ extern "C" {
> # endif
> #endif
> 
> +/*
> + * https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42263
> + */
> +#ifdef URCU_GCC_VERSION
> +# if URCU_GCC_VERSION >= 40400 && URCU_GCC_VERSION <= 40402
> +#  error Your gcc version has a non-functional __sync_synchronize()
> +# endif
> +#endif
> +
> #ifdef __cplusplus
> }
> #endif
> --
> 2.29.2

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [PATCH urcu 1/4] Cleanup: Move ARM specific code to urcu/arch/arm.h

2020-12-17 Thread Mathieu Desnoyers via lttng-dev
Merged in liburcu master, thanks!

Mathieu

- On Dec 15, 2020, at 11:28 AM, Michael Jeanson mjean...@efficios.com wrote:

> Change-Id: I3e17308c5ae985789a2ac8361e9c9e958ff7d656
> Signed-off-by: Michael Jeanson 
> ---
> include/urcu/arch/arm.h | 13 +
> include/urcu/compiler.h | 13 -
> 2 files changed, 13 insertions(+), 13 deletions(-)
> 
> diff --git a/include/urcu/arch/arm.h b/include/urcu/arch/arm.h
> index cb8f28d..5d1c608 100644
> --- a/include/urcu/arch/arm.h
> +++ b/include/urcu/arch/arm.h
> @@ -57,6 +57,19 @@ extern "C" {
> #define __NR_membarrier   389
> #endif
> 
> +/*
> + * Error out for compilers with known bugs.
> + */
> +
> +/*
> + * http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58854
> + */
> +#ifdef URCU_GCC_VERSION
> +# if URCU_GCC_VERSION >= 40800 && URCU_GCC_VERSION <= 40802
> +#  error Your gcc version produces clobbered frame accesses
> +# endif
> +#endif
> +
> #ifdef __cplusplus
> }
> #endif
> diff --git a/include/urcu/compiler.h b/include/urcu/compiler.h
> index 511dbdf..4806ee3 100644
> --- a/include/urcu/compiler.h
> +++ b/include/urcu/compiler.h
> @@ -108,23 +108,10 @@
> 
> #define CAA_ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
> 
> -/*
> - * Don't allow compiling with buggy compiler.
> - */
> -
> #ifdef __GNUC__
> # define URCU_GCC_VERSION (__GNUC__ * 1 \
>   + __GNUC_MINOR__ * 100 \
>   + __GNUC_PATCHLEVEL__)
> -
> -/*
> - * http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58854
> - */
> -# ifdef __ARMEL__
> -#  if URCU_GCC_VERSION >= 40800 && URCU_GCC_VERSION <= 40802
> -#   error Your gcc version produces clobbered frame accesses
> -#  endif
> -# endif
> #endif
> 
> #endif /* _URCU_COMPILER_H */
> --
> 2.29.2

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [PATCH urcu] fix: bump tests thread limit to 256

2020-12-11 Thread Mathieu Desnoyers via lttng-dev
Hi Michael,

After discussion with Paul, we can go for bumping the max nr cpu limit
of this test to 4096.

1536 would cover all the current hardware Paul E. McKenney is aware of.
4096 would cover all hardware he has ever received a bug report on.

Can you try it out and submit an updated patch ?

Thanks,

Mathieu

- On Dec 9, 2020, at 1:29 PM, Mathieu Desnoyers 
mathieu.desnoy...@efficios.com wrote:

> Hi Paul,
> 
> Should I merge this temporary fix for liburcu tests, or should we go
> for dynamic allocation of the array right away instead ?
> 
> Thanks,
> 
> Mathieu
> 
> - On Dec 9, 2020, at 1:15 PM, Michael Jeanson mjean...@efficios.com wrote:
> 
>> Machines with more than 128 CPUs are becomming more common, the proper
>> fix here would be to dynamically allocate the array which we will do,
>> but in the meantime bump the limit to 256 to fix the problem on a 160
>> CPUs ppc64el system where this was reported.
>> 
>> Signed-off-by: Michael Jeanson 
>> Cc: Paul E. McKenney 
>> Change-Id: Ib3cb5d8cb4515e6f626be33c2685fa38cb081782
>> ---
>> tests/common/api.h | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/tests/common/api.h b/tests/common/api.h
>> index 2b72ec5..b15e588 100644
>> --- a/tests/common/api.h
>> +++ b/tests/common/api.h
>> @@ -108,7 +108,7 @@ static void spin_unlock(spinlock_t *sp)
>> 
>> typedef pthread_t thread_id_t;
>> 
>> -#define NR_THREADS 128
>> +#define NR_THREADS 256
>> 
>> #define __THREAD_ID_MAP_EMPTY ((thread_id_t) 0)
>> #define __THREAD_ID_MAP_WAITING ((thread_id_t) 1)
>> --
>> 2.29.2
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [PATCH urcu] fix: bump tests thread limit to 256

2020-12-09 Thread Mathieu Desnoyers via lttng-dev
Hi Paul,

Should I merge this temporary fix for liburcu tests, or should we go
for dynamic allocation of the array right away instead ?

Thanks,

Mathieu

- On Dec 9, 2020, at 1:15 PM, Michael Jeanson mjean...@efficios.com wrote:

> Machines with more than 128 CPUs are becomming more common, the proper
> fix here would be to dynamically allocate the array which we will do,
> but in the meantime bump the limit to 256 to fix the problem on a 160
> CPUs ppc64el system where this was reported.
> 
> Signed-off-by: Michael Jeanson 
> Cc: Paul E. McKenney 
> Change-Id: Ib3cb5d8cb4515e6f626be33c2685fa38cb081782
> ---
> tests/common/api.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tests/common/api.h b/tests/common/api.h
> index 2b72ec5..b15e588 100644
> --- a/tests/common/api.h
> +++ b/tests/common/api.h
> @@ -108,7 +108,7 @@ static void spin_unlock(spinlock_t *sp)
> 
> typedef pthread_t thread_id_t;
> 
> -#define NR_THREADS 128
> +#define NR_THREADS 256
> 
> #define __THREAD_ID_MAP_EMPTY ((thread_id_t) 0)
> #define __THREAD_ID_MAP_WAITING ((thread_id_t) 1)
> --
> 2.29.2

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [RFC v2 1/2] [NEEDS HELP] x86/mm: Handle unlazying membarrier core sync in the arch code

2020-12-04 Thread Mathieu Desnoyers
- On Dec 4, 2020, at 3:17 AM, Nadav Amit nadav.a...@gmail.com wrote:

> I am not very familiar with membarrier, but here are my 2 cents while trying
> to answer your questions.
> 
>> On Dec 3, 2020, at 9:26 PM, Andy Lutomirski  wrote:
>> @@ -496,6 +497,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
>> mm_struct *next,
>>   * from one thread in a process to another thread in the same
>>   * process. No TLB flush required.
>>   */
>> +
>> +// XXX: why is this okay wrt membarrier?
>>  if (!was_lazy)
>>  return;
> 
> I am confused.
> 
> On one hand, it seems that membarrier_private_expedited() would issue an IPI
> to that core, as it would find that this core’s cpu_rq(cpu)->curr->mm is the
> same as the one that the membarrier applies to.

If the scheduler switches from one thread to another which both have the same 
mm,
it means cpu_rq(cpu)->curr->mm is invariant, even though ->curr changes. So 
there
is no need to issue a memory barrier or sync core for membarrier in this case,
because there is no way the IPI can be missed.

> But… (see below)
> 
> 
>> @@ -508,12 +511,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
>> mm_struct *next,
>>  smp_mb();
>>  next_tlb_gen = atomic64_read(&next->context.tlb_gen);
>>  if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
>> -next_tlb_gen)
>> +next_tlb_gen) {
>> +/*
>> + * We're reactivating an mm, and membarrier might
>> + * need to serialize.  Tell membarrier.
>> + */
>> +
>> +// XXX: I can't understand the logic in
>> +// membarrier_mm_sync_core_before_usermode().  What's
>> +// the mm check for?
>> +membarrier_mm_sync_core_before_usermode(next);
> 
> On the other hand the reason for this mm check that you mention contradicts
> my previous understanding as the git log says:
> 
> commit 2840cf02fae627860156737e83326df354ee4ec6
> Author: Mathieu Desnoyers 
> Date:   Thu Sep 19 13:37:01 2019 -0400
> 
>sched/membarrier: Call sync_core only before usermode for same mm
>
>When the prev and next task's mm change, switch_mm() provides the core
>serializing guarantees before returning to usermode. The only case
>where an explicit core serialization is needed is when the scheduler
>keeps the same mm for prev and next.

Hrm, so your point here is that if the scheduler keeps the same mm for
prev and next, it means membarrier will have observed the same rq->curr->mm,
and therefore the IPI won't be missed. I wonder if that
membarrier_mm_sync_core_before_usermode is needed at all then or if we
have just been too careful and did not consider that all the scenarios which
need to be core-sync'd are indeed taken care of ?

I see here that my prior commit message indeed discusses prev and next task's
mm, but in reality, we are comparing current->mm with rq->prev_mm. So from
a lazy TLB perspective, this probably matters, and we may still need a core sync
in some lazy TLB scenarios.

> 
>>  /*
>>   * When switching through a kernel thread, the loop in
>>   * membarrier_{private,global}_expedited() may have observed that
>>   * kernel thread and not issued an IPI. It is therefore possible to
>>   * schedule between user->kernel->user threads without passing though
>>   * switch_mm(). Membarrier requires a barrier after storing to
>> - * rq->curr, before returning to userspace, so provide them here:
>> + * rq->curr, before returning to userspace, and mmdrop() provides
>> + * this barrier.
>>   *
>> - * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
>> - *   provided by mmdrop(),
>> - * - a sync_core for SYNC_CORE.
>> + * XXX: I don't think mmdrop() actually does this.  There's no
>> + * smp_mb__before/after_atomic() in there.
> 
> I presume that since x86 is the only one that needs
> membarrier_mm_sync_core_before_usermode(), nobody noticed the missing
> smp_mb__before/after_atomic(). These are anyhow a compiler barrier in x86,
> and such a barrier would take place before the return to userspace.

mmdrop already provides the memory barriers for membarrer, as I documented 
within the
function.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC v2 1/2] [NEEDS HELP] x86/mm: Handle unlazying membarrier core sync in the arch code

2020-12-04 Thread Mathieu Desnoyers
- On Dec 4, 2020, at 3:17 AM, Nadav Amit nadav.a...@gmail.com wrote:

> I am not very familiar with membarrier, but here are my 2 cents while trying
> to answer your questions.
> 
>> On Dec 3, 2020, at 9:26 PM, Andy Lutomirski  wrote:
>> @@ -496,6 +497,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
>> mm_struct *next,
>>   * from one thread in a process to another thread in the same
>>   * process. No TLB flush required.
>>   */
>> +
>> +// XXX: why is this okay wrt membarrier?
>>  if (!was_lazy)
>>  return;
> 
> I am confused.
> 
> On one hand, it seems that membarrier_private_expedited() would issue an IPI
> to that core, as it would find that this core’s cpu_rq(cpu)->curr->mm is the
> same as the one that the membarrier applies to.

If the scheduler switches from one thread to another which both have the same 
mm,
it means cpu_rq(cpu)->curr->mm is invariant, even though ->curr changes. So 
there
is no need to issue a memory barrier or sync core for membarrier in this case,
because there is no way the IPI can be missed.

> But… (see below)
> 
> 
>> @@ -508,12 +511,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
>> mm_struct *next,
>>  smp_mb();
>>  next_tlb_gen = atomic64_read(&next->context.tlb_gen);
>>  if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
>> -next_tlb_gen)
>> +next_tlb_gen) {
>> +/*
>> + * We're reactivating an mm, and membarrier might
>> + * need to serialize.  Tell membarrier.
>> + */
>> +
>> +// XXX: I can't understand the logic in
>> +// membarrier_mm_sync_core_before_usermode().  What's
>> +// the mm check for?
>> +membarrier_mm_sync_core_before_usermode(next);
> 
> On the other hand the reason for this mm check that you mention contradicts
> my previous understanding as the git log says:
> 
> commit 2840cf02fae627860156737e83326df354ee4ec6
> Author: Mathieu Desnoyers 
> Date:   Thu Sep 19 13:37:01 2019 -0400
> 
>sched/membarrier: Call sync_core only before usermode for same mm
>
>When the prev and next task's mm change, switch_mm() provides the core
>serializing guarantees before returning to usermode. The only case
>where an explicit core serialization is needed is when the scheduler
>keeps the same mm for prev and next.

Hrm, so your point here is that if the scheduler keeps the same mm for
prev and next, it means membarrier will have observed the same rq->curr->mm,
and therefore the IPI won't be missed. I wonder if that
membarrier_mm_sync_core_before_usermode is needed at all then or if we
have just been too careful and did not consider that all the scenarios which
need to be core-sync'd are indeed taken care of ?

I see here that my prior commit message indeed discusses prev and next task's
mm, but in reality, we are comparing current->mm with rq->prev_mm. So from
a lazy TLB perspective, this probably matters, and we may still need a core sync
in some lazy TLB scenarios.

> 
>>  /*
>>   * When switching through a kernel thread, the loop in
>>   * membarrier_{private,global}_expedited() may have observed that
>>   * kernel thread and not issued an IPI. It is therefore possible to
>>   * schedule between user->kernel->user threads without passing though
>>   * switch_mm(). Membarrier requires a barrier after storing to
>> - * rq->curr, before returning to userspace, so provide them here:
>> + * rq->curr, before returning to userspace, and mmdrop() provides
>> + * this barrier.
>>   *
>> - * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
>> - *   provided by mmdrop(),
>> - * - a sync_core for SYNC_CORE.
>> + * XXX: I don't think mmdrop() actually does this.  There's no
>> + * smp_mb__before/after_atomic() in there.
> 
> I presume that since x86 is the only one that needs
> membarrier_mm_sync_core_before_usermode(), nobody noticed the missing
> smp_mb__before/after_atomic(). These are anyhow a compiler barrier in x86,
> and such a barrier would take place before the return to userspace.

mmdrop already provides the memory barriers for membarrer, as I documented 
within the
function.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC v2 1/2] [NEEDS HELP] x86/mm: Handle unlazying membarrier core sync in the arch code

2020-12-04 Thread Mathieu Desnoyers
I. It is therefore possible to
>* schedule between user->kernel->user threads without passing though
>* switch_mm(). Membarrier requires a barrier after storing to
> -  * rq->curr, before returning to userspace, so provide them here:
> +  * rq->curr, before returning to userspace, and mmdrop() provides
> +  * this barrier.
>*
> -  * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
> -  *   provided by mmdrop(),
> -  * - a sync_core for SYNC_CORE.
> +  * XXX: I don't think mmdrop() actually does this.  There's no
> +  * smp_mb__before/after_atomic() in there.

I recall mmdrop providing a memory barrier. It looks like I event went though 
the
trouble of documenting it myself. ;-)

static inline void mmdrop(struct mm_struct *mm)
{
/*
 * The implicit full barrier implied by atomic_dec_and_test() is
 * required by the membarrier system call before returning to
 * user-space, after storing to rq->curr.
 */
if (unlikely(atomic_dec_and_test(&mm->mm_count)))
__mmdrop(mm);
}


>*/
> - if (mm) {
> - membarrier_mm_sync_core_before_usermode(mm);

OK so here is the meat. The current code is using the (possibly incomplete)
lazy TLB state known by the scheduler to sync core, and it appears it may be
a bit more heavy that what is strictly needed.

Your change instead rely on the internal knowledge of lazy TLB within x86
switch_mm_irqs_off to achieve this, which overall seems like an improvement.

I agree with Nick's comment that it should go on top of his exit_lazy_mm
patches.

Thanks,

Mathieu


> + if (mm)
>   mmdrop(mm);
> - }
> +
>   if (unlikely(prev_state == TASK_DEAD)) {
>   if (prev->sched_class->task_dead)
>   prev->sched_class->task_dead(prev);
> --
> 2.28.0

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC v2 1/2] [NEEDS HELP] x86/mm: Handle unlazying membarrier core sync in the arch code

2020-12-04 Thread Mathieu Desnoyers
I. It is therefore possible to
>* schedule between user->kernel->user threads without passing though
>* switch_mm(). Membarrier requires a barrier after storing to
> -  * rq->curr, before returning to userspace, so provide them here:
> +  * rq->curr, before returning to userspace, and mmdrop() provides
> +  * this barrier.
>*
> -  * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
> -  *   provided by mmdrop(),
> -  * - a sync_core for SYNC_CORE.
> +  * XXX: I don't think mmdrop() actually does this.  There's no
> +  * smp_mb__before/after_atomic() in there.

I recall mmdrop providing a memory barrier. It looks like I event went though 
the
trouble of documenting it myself. ;-)

static inline void mmdrop(struct mm_struct *mm)
{
/*
 * The implicit full barrier implied by atomic_dec_and_test() is
 * required by the membarrier system call before returning to
 * user-space, after storing to rq->curr.
 */
if (unlikely(atomic_dec_and_test(&mm->mm_count)))
__mmdrop(mm);
}


>*/
> - if (mm) {
> - membarrier_mm_sync_core_before_usermode(mm);

OK so here is the meat. The current code is using the (possibly incomplete)
lazy TLB state known by the scheduler to sync core, and it appears it may be
a bit more heavy that what is strictly needed.

Your change instead rely on the internal knowledge of lazy TLB within x86
switch_mm_irqs_off to achieve this, which overall seems like an improvement.

I agree with Nick's comment that it should go on top of his exit_lazy_mm
patches.

Thanks,

Mathieu


> + if (mm)
>   mmdrop(mm);
> - }
> +
>   if (unlikely(prev_state == TASK_DEAD)) {
>   if (prev->sched_class->task_dead)
>   prev->sched_class->task_dead(prev);
> --
> 2.28.0

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH v3 4/4] membarrier: Execute SYNC_CORE on the calling thread

2020-12-04 Thread Mathieu Desnoyers
- On Dec 4, 2020, at 12:07 AM, Andy Lutomirski l...@kernel.org wrote:

> membarrier()'s MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is documented
> as syncing the core on all sibling threads but not necessarily the
> calling thread.  This behavior is fundamentally buggy and cannot be used
> safely.  Suppose a user program has two threads.  Thread A is on CPU 0
> and thread B is on CPU 1.  Thread A modifies some text and calls
> membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE).  Then thread B
> executes the modified code.  If, at any point after membarrier() decides
> which CPUs to target, thread A could be preempted and replaced by thread
> B on CPU 0.  This could even happen on exit from the membarrier()
> syscall.  If this happens, thread B will end up running on CPU 0 without
> having synced.
> 
> In principle, this could be fixed by arranging for the scheduler to
> sync_core_before_usermode() whenever switching between two threads in
> the same mm if there is any possibility of a concurrent membarrier()
> call, but this would have considerable overhead.  Instead, make
> membarrier() sync the calling CPU as well.
> 
> As an optimization, this avoids an extra smp_mb() in the default
> barrier-only mode.

^ we could also add to the commit message that it avoids doing rseq preempt
on the caller as well.

Other than that:

Reviewed-by: Mathieu Desnoyers 

Thanks!

Mathieu

> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Andy Lutomirski 
> ---
> kernel/sched/membarrier.c | 51 +--
> 1 file changed, 33 insertions(+), 18 deletions(-)
> 
> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> index 01538b31f27e..57266ab32ef9 100644
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -333,7 +333,8 @@ static int membarrier_private_expedited(int flags, int
> cpu_id)
>   return -EPERM;
>   }
> 
> - if (atomic_read(&mm->mm_users) == 1 || num_online_cpus() == 1)
> + if (flags != MEMBARRIER_FLAG_SYNC_CORE &&
> + (atomic_read(&mm->mm_users) == 1 || num_online_cpus() == 1))
>   return 0;
> 
>   /*
> @@ -352,8 +353,6 @@ static int membarrier_private_expedited(int flags, int
> cpu_id)
> 
>   if (cpu_id >= nr_cpu_ids || !cpu_online(cpu_id))
>   goto out;
> - if (cpu_id == raw_smp_processor_id())
> - goto out;
>   rcu_read_lock();
>   p = rcu_dereference(cpu_rq(cpu_id)->curr);
>   if (!p || p->mm != mm) {
> @@ -368,16 +367,6 @@ static int membarrier_private_expedited(int flags, int
> cpu_id)
>   for_each_online_cpu(cpu) {
>   struct task_struct *p;
> 
> - /*
> -  * Skipping the current CPU is OK even through we can be
> -  * migrated at any point. The current CPU, at the point
> -  * where we read raw_smp_processor_id(), is ensured to
> -  * be in program order with respect to the caller
> -  * thread. Therefore, we can skip this CPU from the
> -  * iteration.
> -  */
> - if (cpu == raw_smp_processor_id())
> - continue;
>   p = rcu_dereference(cpu_rq(cpu)->curr);
>   if (p && p->mm == mm)
>   __cpumask_set_cpu(cpu, tmpmask);
> @@ -385,12 +374,38 @@ static int membarrier_private_expedited(int flags, int
> cpu_id)
>   rcu_read_unlock();
>   }
> 
> - preempt_disable();
> - if (cpu_id >= 0)
> + if (cpu_id >= 0) {
> + /*
> +  * smp_call_function_single() will call ipi_func() if cpu_id
> +  * is the calling CPU.
> +  */
>   smp_call_function_single(cpu_id, ipi_func, NULL, 1);
> - else
> - smp_call_function_many(tmpmask, ipi_func, NULL, 1);
> - preempt_enable();
> + } else {
> + /*
> +  * For regular membarrier, we can save a few cycles by
> +  * skipping the current cpu -- we're about to do smp_mb()
> +  * below, and if we migrate to a different cpu, this cpu
> +  * and the new cpu will execute a full barrier in the
> +  * scheduler.
> +  *
> +  * For CORE_SYNC, we do need a barrier on the current cpu --
> +  * otherwise, if we are migrated and replaced by a different
> +  * task in the same mm just b

Re: [PATCH v2 3/4] membarrier: Explicitly sync remote cores when SYNC_CORE is requested

2020-12-02 Thread Mathieu Desnoyers
- On Dec 2, 2020, at 10:35 AM, Andy Lutomirski l...@kernel.org wrote:

> membarrier() does not explicitly sync_core() remote CPUs; instead, it
> relies on the assumption that an IPI will result in a core sync.  On
> x86, I think this may be true in practice, but it's not architecturally
> reliable.  In particular, the SDM and APM do not appear to guarantee
> that interrupt delivery is serializing.  While IRET does serialize, IPI
> return can schedule, thereby switching to another task in the same mm
> that was sleeping in a syscall.  The new task could then SYSRET back to
> usermode without ever executing IRET.
> 
> Make this more robust by explicitly calling sync_core_before_usermode()
> on remote cores.  (This also helps people who search the kernel tree for
> instances of sync_core() and sync_core_before_usermode() -- one might be
> surprised that the core membarrier code doesn't currently show up in a
> such a search.)
> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Andy Lutomirski 

Reviewed-by: Mathieu Desnoyers 

> ---
> kernel/sched/membarrier.c | 18 ++
> 1 file changed, 18 insertions(+)
> 
> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> index 6251d3d12abe..01538b31f27e 100644
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -166,6 +166,23 @@ static void ipi_mb(void *info)
>   smp_mb();   /* IPIs should be serializing but paranoid. */
> }
> 
> +static void ipi_sync_core(void *info)
> +{
> + /*
> +  * The smp_mb() in membarrier after all the IPIs is supposed to
> +  * ensure that memory on remote CPUs that occur before the IPI
> +  * become visible to membarrier()'s caller -- see scenario B in
> +  * the big comment at the top of this file.
> +  *
> +  * A sync_core() would provide this guarantee, but
> +  * sync_core_before_usermode() might end up being deferred until
> +  * after membarrier()'s smp_mb().
> +  */
> + smp_mb();   /* IPIs should be serializing but paranoid. */
> +
> + sync_core_before_usermode();
> +}
> +
> static void ipi_rseq(void *info)
> {
>   /*
> @@ -301,6 +318,7 @@ static int membarrier_private_expedited(int flags, int
> cpu_id)
>   if (!(atomic_read(&mm->membarrier_state) &
> MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
>   return -EPERM;
> +     ipi_func = ipi_sync_core;
>   } else if (flags == MEMBARRIER_FLAG_RSEQ) {
>   if (!IS_ENABLED(CONFIG_RSEQ))
>   return -EINVAL;
> --
> 2.28.0

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH v2 2/4] membarrier: Add an actual barrier before rseq_preempt()

2020-12-02 Thread Mathieu Desnoyers
- On Dec 2, 2020, at 10:35 AM, Andy Lutomirski l...@kernel.org wrote:

> It seems to me that most RSEQ membarrier users will expect any
> stores done before the membarrier() syscall to be visible to the
> target task(s).  While this is extremely likely to be true in
> practice, nothing actually guarantees it by a strict reading of the
> x86 manuals.  Rather than providing this guarantee by accident and
> potentially causing a problem down the road, just add an explicit
> barrier.
> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Andy Lutomirski 

Reviewed-by: Mathieu Desnoyers 

> ---
> kernel/sched/membarrier.c | 8 
> 1 file changed, 8 insertions(+)
> 
> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> index 5a40b3828ff2..6251d3d12abe 100644
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -168,6 +168,14 @@ static void ipi_mb(void *info)
> 
> static void ipi_rseq(void *info)
> {
> + /*
> +  * Ensure that all stores done by the calling thread are visible
> +  * to the current task before the current task resumes.  We could
> +  * probably optimize this away on most architectures, but by the
> +  * time we've already sent an IPI, the cost of the extra smp_mb()
> +  * is negligible.
> +  */
> + smp_mb();
>   rseq_preempt(current);
> }
> 
> --
> 2.28.0

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH v2 4/4] membarrier: Execute SYNC_CORE on the calling thread

2020-12-02 Thread Mathieu Desnoyers
   *
> +  * For CORE_SYNC, we do need a barrier on the current cpu --
> +  * otherwise, if we are migrated and replaced by a different
> +  * task in the same mm just before, during, or after
> +  * membarrier, we will end up with some thread in the mm
> +  * running without a core sync.
> +  *
> +  * For RSEQ, it seems polite to target the calling thread
> +  * as well, although it's not clear it makes much difference
> +  * either way.  Users aren't supposed to run syscalls in an
> +  * rseq critical section.

Considering that we want a consistent behavior between single and multi-threaded
programs (as I pointed out above wrt the optimization change), I think it would
be better to skip self for the rseq ipi, in the same way we'd want to return
early for a membarrier-rseq-private on a single-threaded mm. Users are _really_
not supposed to run system calls in rseq critical sections. The kernel even 
kills
the offending applications when run on kernels with CONFIG_DEBUG_RSEQ=y. So it 
seems
rather pointless to waste cycles doing a rseq fence on self considering this.

Thanks,

Mathieu

> +  */
> + if (ipi_func == ipi_mb) {
> + preempt_disable();
> + smp_call_function_many(tmpmask, ipi_func, NULL, true);
> + preempt_enable();
> + } else {
> + on_each_cpu_mask(tmpmask, ipi_func, NULL, true);
> + }
> + }
> 
> out:
>   if (cpu_id < 0)
> --
> 2.28.0

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH 3/3] membarrier: Propagate SYNC_CORE and RSEQ actions more carefully

2020-12-01 Thread Mathieu Desnoyers
- On Dec 1, 2020, at 1:48 PM, Andy Lutomirski l...@kernel.org wrote:

> On Tue, Dec 1, 2020 at 10:29 AM Mathieu Desnoyers
>  wrote:
>>
>> - On Dec 1, 2020, at 1:12 PM, Andy Lutomirski l...@kernel.org wrote:
>>
>> > On Tue, Dec 1, 2020 at 6:28 AM Mathieu Desnoyers
>> >  wrote:
>> >>
>> >> - On Dec 1, 2020, at 5:16 AM, Peter Zijlstra pet...@infradead.org 
>> >> wrote:
>> >>
>> >> > On Mon, Nov 30, 2020 at 09:50:35AM -0800, Andy Lutomirski wrote:
>> >> >> membarrier() carefully propagates SYNC_CORE and RSEQ actions to all
>> >> >> other CPUs, but there are two issues.
>> >> >>
>> >> >>  - membarrier() does not sync_core() or rseq_preempt() the calling
>> >> >>CPU.  Aside from the logic being mind-bending, this also means
>> >> >>that it may not be safe to modify user code through an alias,
>> >> >>call membarrier(), and then jump to a different executable alias
>> >> >>of the same code.
>> >> >
>> >> > I always understood this to be on purpose. The calling CPU can fix up
>> >> > itself just fine. The pain point is fixing up the other CPUs, and that's
>> >> > where membarrier() helps.
>> >>
>> >> Indeed, as documented in the man page:
>> >>
>> >>MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE (since Linux 4.16)
>> >>   In  addition  to  providing  the  memory ordering 
>> >> guarantees de‐
>> >>   scribed in MEMBARRIER_CMD_PRIVATE_EXPEDITED,  upon  return  
>> >> from
>> >>   system call the calling thread has a guarantee that all its 
>> >> run‐
>> >>   ning thread siblings have executed a core  serializing  
>> >> instruc‐
>> >>   tion.   This  guarantee is provided only for threads in the 
>> >> same
>> >>   process as the calling thread.
>> >>
>> >> membarrier sync core guarantees a core serializing instruction on the 
>> >> siblings,
>> >> not on the caller thread. This has been done on purpose given that the 
>> >> caller
>> >> thread can always issue its core serializing instruction from user-space 
>> >> on
>> >> its own.
>> >>
>> >> >
>> >> > That said, I don't mind including self, these aren't fast calls by any
>> >> > means.
>> >>
>> >> I don't mind including self either, but this would require documentation
>> >> updates, including man pages, to state that starting from kernel Y this
>> >> is the guaranteed behavior. It's then tricky for user-space to query what
>> >> the behavior is unless we introduce a new membarrier command for it. So 
>> >> this
>> >> could introduce issues if software written for the newer kernels runs on 
>> >> older
>> >> kernels.
>> >
>> > For rseq at least, if we do this now we don't have this issue -- I
>> > don't think any released kernel has the rseq mode.
>>
>> But for rseq, there is no core-sync. And considering that it is invalid
>> to issue a system call within an rseq critical section (including 
>> membarrier),
>> I don't see what we gain by doing a rseq barrier on self ?
>>
>> The only case where it really changes the semantic is for core-sync I think.
>> And in this case, it would be adding an additional core-sync on self. I
>> am OK with doing that considering that it will simplify use of the system
>> call. I'm just wondering how we should document this change in the man page.
>>
>> >
>> >>
>> >> >
>> >> >>  - membarrier() does not explicitly sync_core() remote CPUs either;
>> >> >>instead, it relies on the assumption that an IPI will result in a
>> >> >>core sync.  On x86, I think this may be true in practice, but
>> >> >>it's not architecturally reliable.  In particular, the SDM and
>> >> >>APM do not appear to guarantee that interrupt delivery is
>> >> >>serializing.
>> >> >
>> >> > Right, I don't think we rely on that, we do rely on interrupt delivery
>> >> > providing order though -- as per the previous email.
>> >> >
>> >> >>On

Re: [PATCH 3/3] membarrier: Propagate SYNC_CORE and RSEQ actions more carefully

2020-12-01 Thread Mathieu Desnoyers
- On Dec 1, 2020, at 1:12 PM, Andy Lutomirski l...@kernel.org wrote:

> On Tue, Dec 1, 2020 at 6:28 AM Mathieu Desnoyers
>  wrote:
>>
>> - On Dec 1, 2020, at 5:16 AM, Peter Zijlstra pet...@infradead.org wrote:
>>
>> > On Mon, Nov 30, 2020 at 09:50:35AM -0800, Andy Lutomirski wrote:
>> >> membarrier() carefully propagates SYNC_CORE and RSEQ actions to all
>> >> other CPUs, but there are two issues.
>> >>
>> >>  - membarrier() does not sync_core() or rseq_preempt() the calling
>> >>CPU.  Aside from the logic being mind-bending, this also means
>> >>that it may not be safe to modify user code through an alias,
>> >>call membarrier(), and then jump to a different executable alias
>> >>of the same code.
>> >
>> > I always understood this to be on purpose. The calling CPU can fix up
>> > itself just fine. The pain point is fixing up the other CPUs, and that's
>> > where membarrier() helps.
>>
>> Indeed, as documented in the man page:
>>
>>MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE (since Linux 4.16)
>>   In  addition  to  providing  the  memory ordering guarantees 
>> de‐
>>   scribed in MEMBARRIER_CMD_PRIVATE_EXPEDITED,  upon  return  
>> from
>>   system call the calling thread has a guarantee that all its 
>> run‐
>>   ning thread siblings have executed a core  serializing  
>> instruc‐
>>   tion.   This  guarantee is provided only for threads in the 
>> same
>>   process as the calling thread.
>>
>> membarrier sync core guarantees a core serializing instruction on the 
>> siblings,
>> not on the caller thread. This has been done on purpose given that the caller
>> thread can always issue its core serializing instruction from user-space on
>> its own.
>>
>> >
>> > That said, I don't mind including self, these aren't fast calls by any
>> > means.
>>
>> I don't mind including self either, but this would require documentation
>> updates, including man pages, to state that starting from kernel Y this
>> is the guaranteed behavior. It's then tricky for user-space to query what
>> the behavior is unless we introduce a new membarrier command for it. So this
>> could introduce issues if software written for the newer kernels runs on 
>> older
>> kernels.
> 
> For rseq at least, if we do this now we don't have this issue -- I
> don't think any released kernel has the rseq mode.

But for rseq, there is no core-sync. And considering that it is invalid
to issue a system call within an rseq critical section (including membarrier),
I don't see what we gain by doing a rseq barrier on self ?

The only case where it really changes the semantic is for core-sync I think.
And in this case, it would be adding an additional core-sync on self. I
am OK with doing that considering that it will simplify use of the system
call. I'm just wondering how we should document this change in the man page.

> 
>>
>> >
>> >>  - membarrier() does not explicitly sync_core() remote CPUs either;
>> >>instead, it relies on the assumption that an IPI will result in a
>> >>core sync.  On x86, I think this may be true in practice, but
>> >>it's not architecturally reliable.  In particular, the SDM and
>> >>APM do not appear to guarantee that interrupt delivery is
>> >>serializing.
>> >
>> > Right, I don't think we rely on that, we do rely on interrupt delivery
>> > providing order though -- as per the previous email.
>> >
>> >>On a preemptible kernel, IPI return can schedule,
>> >>thereby switching to another task in the same mm that was
>> >>sleeping in a syscall.  The new task could then SYSRET back to
>> >>usermode without ever executing IRET.
>> >
>> > This; I think we all overlooked this scenario.
>>
>> Indeed, this is an issue which needs to be fixed.
>>
>> >
>> >> This patch simplifies the code to treat the calling CPU just like
>> >> all other CPUs, and explicitly sync_core() on all target CPUs.  This
>> >> eliminates the need for the smp_mb() at the end of the function
>> >> except in the special case of a targeted remote membarrier().  This
>> >> patch updates that code and the comments accordingly.
>>
>> I am not confident that removing the smp_mb at the end of membarrier is
>> an 

Re: [PATCH 1/3] x86/membarrier: Get rid of a dubious optimization

2020-12-01 Thread Mathieu Desnoyers
- On Nov 30, 2020, at 12:50 PM, Andy Lutomirski l...@kernel.org wrote:
[...]
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 11666ba19b62..dabe683ab076 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -474,8 +474,10 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
> mm_struct *next,
>   /*
>* The membarrier system call requires a full memory barrier and
>* core serialization before returning to user-space, after
> -  * storing to rq->curr. Writing to CR3 provides that full
> -  * memory barrier and core serializing instruction.
> +  * storing to rq->curr, when changing mm.  This is because membarrier()
> +  * sends IPIs to all CPUs that are in the target mm, but another
> +  * CPU switch to the target mm in the mean time.

The sentence "This is because membarrier() sends IPIs to all CPUs that are in
the target mm, but another CPU switch to the target mm in the mean time." seems
rather unclear. Could be clarified with e.g.:

"This is because membarrier() sends IPIs to all CPUs that are in the target mm
to make them issue memory barriers. However, if another CPU switches to/from the
target mm concurrently with membarrier(), it can cause that CPU not to receive 
the
IPI when it really should issue a memory barrier."

For the rest of this patch:

Reviewed-by: Mathieu Desnoyers 

Thanks!

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH 2/3] membarrier: Add an actual barrier before rseq_preempt()

2020-12-01 Thread Mathieu Desnoyers
- On Dec 1, 2020, at 5:06 AM, Peter Zijlstra pet...@infradead.org wrote:

> On Mon, Nov 30, 2020 at 09:50:34AM -0800, Andy Lutomirski wrote:
>> It seems to be that most RSEQ membarrier users will expect any
>> stores done before the membarrier() syscall to be visible to the
>> target task(s).  While this is extremely likely to be true in
>> practice, nothing actually guarantees it by a strict reading of the
>> x86 manuals.  Rather than providing this guarantee by accident and
>> potentially causing a problem down the road, just add an explicit
>> barrier.
> 
> A very long time ago; when Jens introduced smp_call_function(), we had
> this discussion. At the time Linus said that receiving an interrupt had
> better be ordering, and if it is not, then it's up to the architecture
> to handle that before it gets into the common code.
> 
>  
> https://lkml.kernel.org/r/alpine.LFD.2.00.0902180744520.21686@localhost.localdomain
> 
> Maybe we want to revisit this now, but there might be a fair amount of
> code relying on all this by now.
> 
> Documenting it better might help.

Considering that we already have this in membarrier ipi_mb :

static void ipi_mb(void *info)
{
smp_mb();   /* IPIs should be serializing but paranoid. */
}

I think it makes sense to add this same smp_mb() in the ipi_rseq if the expected
behavior is to order memory accesses as well, and have the same level of 
paranoia as
the ipi_mb.

Thanks,

Mathieu


> 
>> Signed-off-by: Andy Lutomirski 
>> ---
>>  kernel/sched/membarrier.c | 8 
>>  1 file changed, 8 insertions(+)
>> 
>> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
>> index e23e74d52db5..7d98ef5d3bcd 100644
>> --- a/kernel/sched/membarrier.c
>> +++ b/kernel/sched/membarrier.c
>> @@ -40,6 +40,14 @@ static void ipi_mb(void *info)
>>  
>>  static void ipi_rseq(void *info)
>>  {
>> +/*
>> + * Ensure that all stores done by the calling thread are visible
>> + * to the current task before the current task resumes.  We could
>> + * probably optimize this away on most architectures, but by the
>> + * time we've already sent an IPI, the cost of the extra smp_mb()
>> + * is negligible.
>> + */
>> +smp_mb();
>>  rseq_preempt(current);
>>  }
> 
> So I think this really isn't right.

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH 3/3] membarrier: Propagate SYNC_CORE and RSEQ actions more carefully

2020-12-01 Thread Mathieu Desnoyers
- On Dec 1, 2020, at 5:16 AM, Peter Zijlstra pet...@infradead.org wrote:

> On Mon, Nov 30, 2020 at 09:50:35AM -0800, Andy Lutomirski wrote:
>> membarrier() carefully propagates SYNC_CORE and RSEQ actions to all
>> other CPUs, but there are two issues.
>> 
>>  - membarrier() does not sync_core() or rseq_preempt() the calling
>>CPU.  Aside from the logic being mind-bending, this also means
>>that it may not be safe to modify user code through an alias,
>>call membarrier(), and then jump to a different executable alias
>>of the same code.
> 
> I always understood this to be on purpose. The calling CPU can fix up
> itself just fine. The pain point is fixing up the other CPUs, and that's
> where membarrier() helps.

Indeed, as documented in the man page:

   MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE (since Linux 4.16)
  In  addition  to  providing  the  memory ordering guarantees de‐
  scribed in MEMBARRIER_CMD_PRIVATE_EXPEDITED,  upon  return  from
  system call the calling thread has a guarantee that all its run‐
  ning thread siblings have executed a core  serializing  instruc‐
  tion.   This  guarantee is provided only for threads in the same
  process as the calling thread.

membarrier sync core guarantees a core serializing instruction on the siblings,
not on the caller thread. This has been done on purpose given that the caller
thread can always issue its core serializing instruction from user-space on
its own.

> 
> That said, I don't mind including self, these aren't fast calls by any
> means.

I don't mind including self either, but this would require documentation
updates, including man pages, to state that starting from kernel Y this
is the guaranteed behavior. It's then tricky for user-space to query what
the behavior is unless we introduce a new membarrier command for it. So this
could introduce issues if software written for the newer kernels runs on older
kernels.

> 
>>  - membarrier() does not explicitly sync_core() remote CPUs either;
>>instead, it relies on the assumption that an IPI will result in a
>>core sync.  On x86, I think this may be true in practice, but
>>it's not architecturally reliable.  In particular, the SDM and
>>APM do not appear to guarantee that interrupt delivery is
>>serializing.
> 
> Right, I don't think we rely on that, we do rely on interrupt delivery
> providing order though -- as per the previous email.
> 
>>On a preemptible kernel, IPI return can schedule,
>>thereby switching to another task in the same mm that was
>>sleeping in a syscall.  The new task could then SYSRET back to
>>usermode without ever executing IRET.
> 
> This; I think we all overlooked this scenario.

Indeed, this is an issue which needs to be fixed.

> 
>> This patch simplifies the code to treat the calling CPU just like
>> all other CPUs, and explicitly sync_core() on all target CPUs.  This
>> eliminates the need for the smp_mb() at the end of the function
>> except in the special case of a targeted remote membarrier().  This
>> patch updates that code and the comments accordingly.

I am not confident that removing the smp_mb at the end of membarrier is
an appropriate change, nor that it simplifies the model.

This changes things from a model where we have a barrier at the beginning
and end of the membarrier system call, which nicely orders things happening
before/after the system call with respect to anything that is observed within
the system call (including the scheduler activity updating the runqueue's
current task), to a model where the memory barrier for the current thread
will be conditionally executed after we have sent the IPIs, and unconditionally
when issuing smp_call_function* on self.

About the documentation of the membarrier scenario, I think it is redundant
with a documentation patch I already have sitting in -tip (scenario A):

https://git.kernel.org/tip/25595eb6aaa9fbb31330f1e0b400642694bc6574

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH 2/8] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-11-30 Thread Mathieu Desnoyers
- On Nov 28, 2020, at 11:01 AM, Nicholas Piggin npig...@gmail.com wrote:

> And get rid of the generic sync_core_before_usermode facility. This is
> functionally a no-op in the core scheduler code, but it also catches

This sentence is incomplete.

> 
> This helper is the wrong way around I think. The idea that membarrier
> state requires a core sync before returning to user is the easy one
> that does not need hiding behind membarrier calls. The gap in core
> synchronization due to x86's sysret/sysexit and lazy tlb mode, is the
> tricky detail that is better put in x86 lazy tlb code.

Ideally yes this complexity should sit within the x86 architecture code
if only that architecture requires it.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH 2/8] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-11-30 Thread Mathieu Desnoyers
- On Nov 28, 2020, at 11:01 AM, Nicholas Piggin npig...@gmail.com wrote:

> And get rid of the generic sync_core_before_usermode facility. This is
> functionally a no-op in the core scheduler code, but it also catches

This sentence is incomplete.

> 
> This helper is the wrong way around I think. The idea that membarrier
> state requires a core sync before returning to user is the easy one
> that does not need hiding behind membarrier calls. The gap in core
> synchronization due to x86's sysret/sysexit and lazy tlb mode, is the
> tricky detail that is better put in x86 lazy tlb code.

Ideally yes this complexity should sit within the x86 architecture code
if only that architecture requires it.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [lttng-dev] Some confusion about cpu usage of the lttng-consumerd process

2020-11-30 Thread Mathieu Desnoyers via lttng-dev
- On Nov 28, 2020, at 1:49 AM, 熊毓华 xiongyu...@zju.edu.cn wrote:

> Hi,

> I put my test scripts in the attachment.

> You can just run the script directly, create the trace session with the 
> "--live"
> option on the 8core server,

> then you will find the cpu usage of the lttng-consumerd process reached 10% or
> more.

> About the streaming mode of lttng,I did the test before, it worked well.

> When I create the trace session with "lttng create my-session
> --output=/tmp/my-kernel-trace", or with "lttng create my-session
> --set-url=net://ip",

> the number of CPU seems not affect the cpu usage with lttng-consumerd.

> It seems that only live-mode will be affected.

The overhead of consumer daemon in live mode will increase with the number
of cpus on the system. This is expected.

There is one knob you can try to configure to adapt the amount of overhead
caused by the live timer in the consumer daemon: lttng create --live=NNN
where NNN is the live timer period in microseconds. The default is 100us
(1 second). Try changing this value to something larger and you will
probably notice that it lessen the consumer daemon overhead.

Shipping trace data with relatively low latency so it can be immediately
read adds overhead, and it increases with the number of cpus in the system
because there are then more per-cpu buffers to flush.

Note that your instrumentation also targets unrelated kernel and user-space
execution, so whatever is executed in the background also gets traced. And
this generate trace data traffic on each cpu. Having even just a bit of
trace data to send out in live mode can very much explain the overhead you
observe with the lttng consumer daemon.

If you care that much about overhead of the consumer daemon, don't use live
mode, and use the alternatives we discussed earlier instead, such as the
session rotation mode, if you need to consume the trace data while it is
produced.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] Some confusion about cpu usage of the lttng-consumerd process

2020-11-27 Thread Mathieu Desnoyers via lttng-dev
- On Nov 27, 2020, at 11:04 AM, lttng-dev lttng-dev@lists.lttng.org wrote:

> Well we also want to know why! You will understand that albeit we develop 
> lttng
> we do not always have a quick and easy answer to all problems. Performance
> related problem are always tricky.
> And we also have to keep in mind that we do not necessarily optimize for 
> low-cpu
> usage on the lttng-consumerd side.

That being said, we did optimize for low-cpu usage of lttng-consumerd for 
use-cases
streaming to disk or to the network. However, the "live" mode was originally 
created
for use-cases where only a few events per second would be emitted, and no such
requirements were placed on performance. We can see today that its use has grown
much beyond the few events per seconds, but then in those use-cases the live 
mode
may not be the appropriate tool for the job then. We have introduced the 
"session
rotation" feature as a more efficient alternative to the live mode.

> We have to take a look at what "work" scale with the number of CPU on the
> lttng-consumerd side. One such thing is the live timer which is fired on an
> interval (default is 1s (100us)).

> You could test this hypothesis by streaming the trace instead of using the 
> live
> feature.

> lttng create --set-url 

Yes, I agree with Jonathan's recommendation: you should compare this cpu usage 
with
that of the streaming mode of lttng by *not* using the "--live" option when 
creating
the trace session. It will at least help identify whether consumerd also 
exhibits this
cpu usage increase with number of cores in streaming mode, or if it is an 
expected
additional overhead of periodically flushing more cpu buffers (because there 
are more
cores) caused by the live timer.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [PATCH] rseq/selftests: Fix MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ build error under other arch.

2020-11-25 Thread Mathieu Desnoyers
- On Nov 24, 2020, at 11:04 PM, Xingxing Su suxingx...@loongson.cn wrote:

> Except arch x86, the function rseq_offset_deref_addv is not defined.
> The function test_membarrier_manager_thread call rseq_offset_deref_addv
> produces a build error.
> 
> The RSEQ_ARCH_HAS_OFFSET_DEREF_ADD should contain all the code
> for the MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ.
> If the other Arch implements this feature,
> defined RSEQ_ARCH_HAS_OFFSET_DEREF_ADD in the header file
> to ensure that this feature is available.
> 
> Following build errors:
> 
> param_test.c: In function ‘test_membarrier_worker_thread’:
> param_test.c:1164:10: warning: implicit declaration of function
> ‘rseq_offset_deref_addv’
>ret = rseq_offset_deref_addv(&args->percpu_list_ptr,
>  ^~
> /tmp/ccMj9yHJ.o: In function `test_membarrier_worker_thread':
> param_test.c:1164: undefined reference to `rseq_offset_deref_addv'
> param_test.c:1164: undefined reference to `rseq_offset_deref_addv'
> collect2: error: ld returned 1 exit status
> make: *** [/selftests/rseq/param_test_benchmark] Error 1
> 
> Signed-off-by: Xingxing Su 

Acked-by: Mathieu Desnoyers 

Shuah, can you pick up this fix please ?

Thanks,

Mathieu

> ---
> tools/testing/selftests/rseq/param_test.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/testing/selftests/rseq/param_test.c
> b/tools/testing/selftests/rseq/param_test.c
> index 3845890..699ad5f 100644
> --- a/tools/testing/selftests/rseq/param_test.c
> +++ b/tools/testing/selftests/rseq/param_test.c
> @@ -1133,6 +1133,8 @@ static int set_signal_handler(void)
>   return ret;
> }
> 
> +/* Test MEMBARRIER_CMD_PRIVATE_RESTART_RSEQ_ON_CPU membarrier command. */
> +#ifdef RSEQ_ARCH_HAS_OFFSET_DEREF_ADDV
> struct test_membarrier_thread_args {
>   int stop;
>   intptr_t percpu_list_ptr;
> @@ -1286,8 +1288,6 @@ void *test_membarrier_manager_thread(void *arg)
>   return NULL;
> }
> 
> -/* Test MEMBARRIER_CMD_PRIVATE_RESTART_RSEQ_ON_CPU membarrier command. */
> -#ifdef RSEQ_ARCH_HAS_OFFSET_DEREF_ADDV
> void test_membarrier(void)
> {
>   const int num_threads = opt_threads;
> --
> 1.8.3.1

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH] regmap: remove duplicate `type` field from `regcache_sync` trace event

2020-11-23 Thread Mathieu Desnoyers
Hi Philippe,

Some additional feedback:

The patch title could be changed to:

"tracing: Remove duplicate `type` field from regmap `regcache_sync` trace event"

to clarify that this belongs to tracing.

- On Nov 23, 2020, at 11:15 AM, Philippe Duplessis-Guindon 
pduples...@efficios.com wrote:

> I had an error saying that `regcache_sync` had 2 fields named `type` while 
> using
> libtraceevent. This was the format of this event:

Please use the present rather than past when writing a commit message.

> 
>   $ sudo cat /sys/kernel/debug/tracing/events/regmap/regcache_sync/format
>   name: regcache_sync
>   ID: 1216
>   format:
>   field:unsigned short common_type;   offset:0;   size:2; 
> signed:0;
>   field:unsigned char common_flags;   offset:2;   size:1; 
> signed:0;
>   field:unsigned char common_preempt_count;   offset:3;   
> size:1; signed:0;
>   field:int common_pid;   offset:4;   size:4; signed:1;
> 
>   field:__data_loc char[] name;   offset:8;   size:4; 
> signed:1;
>   field:__data_loc char[] status; offset:12;  size:4; 
> signed:1;
>   field:__data_loc char[] type;   offset:16;  size:4; 
> signed:1;
>   field:int type;offset:20;size:4;signed:1;
> 
>   print fmt: "%s type=%s status=%s", __get_str(name), __get_str(type),
>   __get_str(status)
> 
> Erase the `int field` type, who was not assigned. This field was introduce by

who -> which

was introduce -> was introduced

Ideally you should use git blame to identify which commit introduced this issue.
e.g.:

git blame drivers/base/regmap/trace.h
[...]
593600890110c include/trace/events/regmap.h (Dimitris Papastamos 2011-09-19 
14:34:04 +0100 134) __assign_str(status, status);
593600890110c include/trace/events/regmap.h (Dimitris Papastamos 2011-09-19 
14:34:04 +0100 135) __assign_str(type, type);

commit 593600890110c02eb471cf844649dee213870416
Author: Dimitris Papastamos 
Date:   Mon Sep 19 14:34:04 2011 +0100

regmap: Add the regcache_sync trace event

This information can be added to the commit message as:

Fixes commit 593600890110c ("regmap: Add the regcache_sync trace event")

Thanks,

Mathieu

> mistake and this commit removes it.
> 
> This is the format of this event after the fix:
> 
>   $ sudo cat /sys/kernel/debug/tracing/events/regmap/regcache_sync/format
>   name: regcache_sync
>   ID: 1266
>   format:
>   field:unsigned short common_type;   offset:0;   size:2; 
> signed:0;
>   field:unsigned char common_flags;   offset:2;   size:1; 
> signed:0;
>   field:unsigned char common_preempt_count;   offset:3;   
> size:1; signed:0;
>   field:int common_pid;   offset:4;   size:4; signed:1;
> 
>   field:__data_loc char[] name;   offset:8;   size:4; 
> signed:1;
>   field:__data_loc char[] status; offset:12;  size:4; 
> signed:1;
>   field:__data_loc char[] type;   offset:16;  size:4; 
> signed:1;
> 
>   print fmt: "%s type=%s status=%s", __get_str(name), __get_str(type),
>   __get_str(status)
> 
> Signed-off-by: Philippe Duplessis-Guindon 
> ---
> drivers/base/regmap/trace.h | 1 -
> 1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/base/regmap/trace.h b/drivers/base/regmap/trace.h
> index d4066fa079ab..9abee14df9ee 100644
> --- a/drivers/base/regmap/trace.h
> +++ b/drivers/base/regmap/trace.h
> @@ -126,7 +126,6 @@ TRACE_EVENT(regcache_sync,
>   __string(   name,   regmap_name(map))
>   __string(   status,     status  )
>   __string(   type,   type)
> - __field(int,type)
>   ),
> 
>   TP_fast_assign(
> --
> 2.25.1

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: violating function pointer signature

2020-11-18 Thread Mathieu Desnoyers
- On Nov 18, 2020, at 9:02 AM, rostedt rost...@goodmis.org wrote:

> On Wed, 18 Nov 2020 14:21:36 +0100
> Peter Zijlstra  wrote:
> 
>> I think that as long as the function is completely empty (it never
>> touches any of the arguments) this should work in practise.
>> 
>> That is:
>> 
>>   void tp_nop_func(void) { }
> 
> My original version (the OP of this thread) had this:
> 
> +static void tp_stub_func(void)
> +{
> + return;
> +}
> 
>> 
>> can be used as an argument to any function pointer that has a void
>> return. In fact, I already do that, grep for __static_call_nop().
>> 
>> I'm not sure what the LLVM-CFI crud makes of it, but that's their
>> problem.
> 
> If it is already done elsewhere in the kernel, then I will call this
> precedence, and keep the original version.

It works for me. Bonus points if you can document in a comment that this
trick depends on the cdecl calling convention.

Thanks,

Mathieu

> 
> This way Alexei can't complain about adding a check in the fast path of
> more than one callback attached.
> 
> -- Steve

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: violating function pointer signature

2020-11-18 Thread Mathieu Desnoyers
- On Nov 18, 2020, at 9:02 AM, rostedt rost...@goodmis.org wrote:

> On Wed, 18 Nov 2020 14:21:36 +0100
> Peter Zijlstra  wrote:
> 
>> I think that as long as the function is completely empty (it never
>> touches any of the arguments) this should work in practise.
>> 
>> That is:
>> 
>>   void tp_nop_func(void) { }
> 
> My original version (the OP of this thread) had this:
> 
> +static void tp_stub_func(void)
> +{
> + return;
> +}
> 
>> 
>> can be used as an argument to any function pointer that has a void
>> return. In fact, I already do that, grep for __static_call_nop().
>> 
>> I'm not sure what the LLVM-CFI crud makes of it, but that's their
>> problem.
> 
> If it is already done elsewhere in the kernel, then I will call this
> precedence, and keep the original version.

It works for me. Bonus points if you can document in a comment that this
trick depends on the cdecl calling convention.

Thanks,

Mathieu

> 
> This way Alexei can't complain about adding a check in the fast path of
> more than one callback attached.
> 
> -- Steve

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH] tracepoint: Do not fail unregistering a probe due to memory allocation

2020-11-17 Thread Mathieu Desnoyers
- On Nov 17, 2020, at 5:19 PM, rostedt rost...@goodmis.org wrote:

> On Tue, 17 Nov 2020 13:33:42 -0800
> Kees Cook  wrote:
> 
>> As I think got discussed in the thread, what you had here wouldn't work
>> in a CFI build if the function prototype of the call site and the
>> function don't match. (Though I can't tell if .func() is ever called?)
>> 
>> i.e. .func's prototype must match tp_stub_func()'s.
>> 
> 
> 
> Hmm, I wonder how you handle tracepoints? This is called here:
> 
> include/linux/tracepoint.h:
> 
> 
> #define DEFINE_TRACE_FN(_name, _reg, _unreg, proto, args) \
>   static const char __tpstrtab_##_name[]  \
>   __section("__tracepoints_strings") = #_name;\
>   extern struct static_call_key STATIC_CALL_KEY(tp_func_##_name); \
>   int __traceiter_##_name(void *__data, proto);   \
>   struct tracepoint __tracepoint_##_name  __used  \
>   __section("__tracepoints") = {  \
>   .name = __tpstrtab_##_name, \
>   .key = STATIC_KEY_INIT_FALSE,   \
>   .static_call_key = &STATIC_CALL_KEY(tp_func_##_name),   \
>   .static_call_tramp = STATIC_CALL_TRAMP_ADDR(tp_func_##_name), \
>   .iterator = &__traceiter_##_name,   \
>   .regfunc = _reg,\
>   .unregfunc = _unreg,\
>   .funcs = NULL };\
>   __TRACEPOINT_ENTRY(_name);  \
>   int __traceiter_##_name(void *__data, proto)\
>   {   \
>   struct tracepoint_func *it_func_ptr;\
>   void *it_func;  \
>   \
>   it_func_ptr =   \
>   rcu_dereference_raw((&__tracepoint_##_name)->funcs); \
>   do {\
>   it_func = (it_func_ptr)->func;  \
>   __data = (it_func_ptr)->data;   \
> 
>   ((void(*)(void *, proto))(it_func))(__data, args); \
> 
>    called above 
> 
> Where args is unique for every tracepoint, but func is simply a void
> pointer.

That being said, the called functions have a prototype which match the
caller prototype exactly. So within the tracepoint internal data structures,
this function pointer is indeed a void pointer, but it is cast to a prototype
matching the callees to perform the calls. I suspect that as long as CFI checks
that caller/callees prototypes are compatible at runtime when the actual
calls happen, this all works fine.

Thanks,

Mathieu

> 
> -- Steve
> 
> 
>   } while ((++it_func_ptr)->func);\
>   return 0;   \
>   }   \

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH] tracepoint: Do not fail unregistering a probe due to memory allocation

2020-11-17 Thread Mathieu Desnoyers
- On Nov 17, 2020, at 5:19 PM, rostedt rost...@goodmis.org wrote:

> On Tue, 17 Nov 2020 13:33:42 -0800
> Kees Cook  wrote:
> 
>> As I think got discussed in the thread, what you had here wouldn't work
>> in a CFI build if the function prototype of the call site and the
>> function don't match. (Though I can't tell if .func() is ever called?)
>> 
>> i.e. .func's prototype must match tp_stub_func()'s.
>> 
> 
> 
> Hmm, I wonder how you handle tracepoints? This is called here:
> 
> include/linux/tracepoint.h:
> 
> 
> #define DEFINE_TRACE_FN(_name, _reg, _unreg, proto, args) \
>   static const char __tpstrtab_##_name[]  \
>   __section("__tracepoints_strings") = #_name;\
>   extern struct static_call_key STATIC_CALL_KEY(tp_func_##_name); \
>   int __traceiter_##_name(void *__data, proto);   \
>   struct tracepoint __tracepoint_##_name  __used  \
>   __section("__tracepoints") = {  \
>   .name = __tpstrtab_##_name, \
>   .key = STATIC_KEY_INIT_FALSE,   \
>   .static_call_key = &STATIC_CALL_KEY(tp_func_##_name),   \
>   .static_call_tramp = STATIC_CALL_TRAMP_ADDR(tp_func_##_name), \
>   .iterator = &__traceiter_##_name,   \
>   .regfunc = _reg,\
>   .unregfunc = _unreg,\
>   .funcs = NULL };\
>   __TRACEPOINT_ENTRY(_name);  \
>   int __traceiter_##_name(void *__data, proto)\
>   {   \
>   struct tracepoint_func *it_func_ptr;\
>   void *it_func;  \
>   \
>   it_func_ptr =   \
>   rcu_dereference_raw((&__tracepoint_##_name)->funcs); \
>   do {\
>   it_func = (it_func_ptr)->func;  \
>   __data = (it_func_ptr)->data;   \
> 
>   ((void(*)(void *, proto))(it_func))(__data, args); \
> 
>    called above 
> 
> Where args is unique for every tracepoint, but func is simply a void
> pointer.

That being said, the called functions have a prototype which match the
caller prototype exactly. So within the tracepoint internal data structures,
this function pointer is indeed a void pointer, but it is cast to a prototype
matching the callees to perform the calls. I suspect that as long as CFI checks
that caller/callees prototypes are compatible at runtime when the actual
calls happen, this all works fine.

Thanks,

Mathieu

> 
> -- Steve
> 
> 
>   } while ((++it_func_ptr)->func);\
>   return 0;   \
>   }   \

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH] tracepoint: Do not fail unregistering a probe due to memory allocation

2020-11-17 Thread Mathieu Desnoyers
- On Nov 17, 2020, at 5:16 PM, rostedt rost...@goodmis.org wrote:

> On Tue, 17 Nov 2020 16:22:23 -0500 (EST)
> Mathieu Desnoyers  wrote:
> 
>> If we don't call the stub, then there is no point in having the stub at
>> all, and we should just compare to a constant value, e.g. 0x1UL. As far
>> as I can recall, comparing with a small immediate constant is more efficient
>> than comparing with a loaded value on many architectures.
> 
> Why 0x1UL, and not just set it to NULL.
> 
>   do {\
>   it_func = (it_func_ptr)->func;  \
>   __data = (it_func_ptr)->data;   \
>   if (likely(it_func))\
>   ((void(*)(void *, proto))(it_func))(__data, 
> args); \
>   } while ((++it_func_ptr)->func);

Because of this end-of-loop condition ^
which is also testing for a NULL func. So if we reach a stub, we end up stopping
iteration and not firing the following tracepoint probes.

Thanks,

Mathieu

> 
> 
> -- Steve

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH] bpf: don't fail kmalloc while releasing raw_tp

2020-11-17 Thread Mathieu Desnoyers
- On Nov 16, 2020, at 5:10 PM, rostedt rost...@goodmis.org wrote:

> On Mon, 16 Nov 2020 16:34:41 -0500 (EST)
> Mathieu Desnoyers  wrote:

[...]

>> I think you'll want a WRITE_ONCE(old[i].func, tp_stub_func) here, matched
>> with a READ_ONCE() in __DO_TRACE. This introduces a new situation where the
>> func pointer can be updated and loaded concurrently.
> 
> I thought about this a little, and then only thing we really should worry
> about is synchronizing with those that unregister. Because when we make
> this update, there are now two states. the __DO_TRACE either reads the
> original func or the stub. And either should be OK to call.
> 
> Only the func gets updated and not the data. So what exactly are we worried
> about here?

Indeed with a stub function, I don't see any need for READ_ONCE/WRITE_ONCE.

However, if we want to compare the function pointer to some other value and
conditionally do (or skip) the call, I think you'll need the 
READ_ONCE/WRITE_ONCE
to make sure the pointer is not re-fetched between comparison and call.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH] tracepoint: Do not fail unregistering a probe due to memory allocation

2020-11-17 Thread Mathieu Desnoyers
- On Nov 17, 2020, at 5:16 PM, rostedt rost...@goodmis.org wrote:

> On Tue, 17 Nov 2020 16:22:23 -0500 (EST)
> Mathieu Desnoyers  wrote:
> 
>> If we don't call the stub, then there is no point in having the stub at
>> all, and we should just compare to a constant value, e.g. 0x1UL. As far
>> as I can recall, comparing with a small immediate constant is more efficient
>> than comparing with a loaded value on many architectures.
> 
> Why 0x1UL, and not just set it to NULL.
> 
>   do {\
>   it_func = (it_func_ptr)->func;  \
>   __data = (it_func_ptr)->data;   \
>   if (likely(it_func))\
>   ((void(*)(void *, proto))(it_func))(__data, 
> args); \
>   } while ((++it_func_ptr)->func);

Because of this end-of-loop condition ^
which is also testing for a NULL func. So if we reach a stub, we end up stopping
iteration and not firing the following tracepoint probes.

Thanks,

Mathieu

> 
> 
> -- Steve

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH] bpf: don't fail kmalloc while releasing raw_tp

2020-11-17 Thread Mathieu Desnoyers
- On Nov 16, 2020, at 5:10 PM, rostedt rost...@goodmis.org wrote:

> On Mon, 16 Nov 2020 16:34:41 -0500 (EST)
> Mathieu Desnoyers  wrote:

[...]

>> I think you'll want a WRITE_ONCE(old[i].func, tp_stub_func) here, matched
>> with a READ_ONCE() in __DO_TRACE. This introduces a new situation where the
>> func pointer can be updated and loaded concurrently.
> 
> I thought about this a little, and then only thing we really should worry
> about is synchronizing with those that unregister. Because when we make
> this update, there are now two states. the __DO_TRACE either reads the
> original func or the stub. And either should be OK to call.
> 
> Only the func gets updated and not the data. So what exactly are we worried
> about here?

Indeed with a stub function, I don't see any need for READ_ONCE/WRITE_ONCE.

However, if we want to compare the function pointer to some other value and
conditionally do (or skip) the call, I think you'll need the 
READ_ONCE/WRITE_ONCE
to make sure the pointer is not re-fetched between comparison and call.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH] tracepoint: Do not fail unregistering a probe due to memory allocation

2020-11-17 Thread Mathieu Desnoyers
- On Nov 17, 2020, at 3:58 PM, rostedt rost...@goodmis.org wrote:

> On Tue, 17 Nov 2020 15:34:51 -0500
> Steven Rostedt  wrote:
[...]

> If it comes down to not trusting calling a stub, I'll still keep the stub
> logic in, and just add the following:

If we don't call the stub, then there is no point in having the stub at
all, and we should just compare to a constant value, e.g. 0x1UL. As far
as I can recall, comparing with a small immediate constant is more efficient
than comparing with a loaded value on many architectures.

Thanks,

Mathieu
 
-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH] tracepoint: Do not fail unregistering a probe due to memory allocation

2020-11-17 Thread Mathieu Desnoyers
- On Nov 17, 2020, at 3:58 PM, rostedt rost...@goodmis.org wrote:

> On Tue, 17 Nov 2020 15:34:51 -0500
> Steven Rostedt  wrote:
[...]

> If it comes down to not trusting calling a stub, I'll still keep the stub
> logic in, and just add the following:

If we don't call the stub, then there is no point in having the stub at
all, and we should just compare to a constant value, e.g. 0x1UL. As far
as I can recall, comparing with a small immediate constant is more efficient
than comparing with a loaded value on many architectures.

Thanks,

Mathieu
 
-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH] tracepoint: Do not fail unregistering a probe due to memory allocation

2020-11-17 Thread Mathieu Desnoyers
- On Nov 17, 2020, at 3:34 PM, rostedt rost...@goodmis.org wrote:

> On Tue, 17 Nov 2020 14:47:20 -0500 (EST)
> Mathieu Desnoyers  wrote:
> 
>> There seems to be more effect on the data size: adding the "stub_func" field
>> in struct tracepoint adds 8320 bytes of data to my vmlinux. But considering
>> the layout of struct tracepoint:
>> 
>> struct tracepoint {
>> const char *name;   /* Tracepoint name */
>> struct static_key key;
>> struct static_call_key *static_call_key;
>> void *static_call_tramp;
>> void *iterator;
>> int (*regfunc)(void);
>> void (*unregfunc)(void);
>> struct tracepoint_func __rcu *funcs;
>> void *stub_func;
>> };
>> 
>> I would argue that we have many other things to optimize there if we want to
>> shrink the bloat, starting with static keys and system call reg/unregfunc
>> pointers.
> 
> This is the part that I want to decrease, and yes there's other fish to fry
> in that code, but I really don't want to be adding more.

I agree on the goal of not bloating the code and data size of tracepoints, but
I also don't want to introduce subtle hard-to-debug undefined behaviors.

> 
>> 
>> > 
>> > Since all tracepoints callbacks have at least one parameter (__data), we
>> > could declare tp_stub_func as:
>> > 
>> > static void tp_stub_func(void *data, ...)
>> > {
>> >return;
>> > }
>> > 
>> > And now C knows that tp_stub_func() can be called with one or more
>> > parameters, and had better be able to deal with it!
>> 
>> AFAIU this won't work.
>> 
>> C99 6.5.2.2 Function calls
>> 
>> "If the function is defined with a type that is not compatible with the type 
>> (of
>> the
>> expression) pointed to by the expression that denotes the called function, 
>> the
>> behavior is
>> undefined."
> 
> But is it really a problem in practice. I'm sure we could create an objtool
> function to check to make sure we don't break anything at build time.

There are also tools like UBSAN which will trigger whenever an undefined 
behavior
is executed. Having tools which can validate that the generated assembly 
happens to
work does not make it OK to generate code with undefined behavior.

> 
>> 
>> and
>> 
>> 6.7.5.3 Function declarators (including prototypes), item 15:
>> 
>> "For two function types to be compatible, both shall specify compatible 
>> return
>> types.
> 
> But all tracepoint callbacks have void return types, which means they are
> compatible.

Yes, my concern is about what follows just after:

> 
>> 
>> Moreover, the parameter type lists, if both are present, shall agree in the
>> number of
>> parameters and in use of the ellipsis terminator; corresponding parameters 
>> shall
>> have
>> compatible types. [...]"
> 
> Which is why I gave the stub function's first parameter the same type that
> all tracepoint callbacks have a prototype that starts with "void *data"
> 
> and my solution is to define:
> 
>   void tp_stub_func(void *data, ...) { return; }
> 
> Which is in line with: "corresponding parameters shall have compatible
> types". The corresponding parameter is simply "void *data".

No, you forgot about the part "[...] shall agree [...] in use of the ellipsis
terminator"

That last part about agreeing about use of the ellipsis terminator is what
makes your last solution run into undefined behavior territory. The caller
and callee don't agree on the use of ellipsis terminator: the caller does not
use it, but the callee expects it.

> 
>> 
>> What you suggest here is to use the ellipsis in the stub definition, but the
>> caller
>> prototype does not use the ellipsis, which brings us into undefined behavior
>> territory
>> again.
> 
> And I believe the "undefined behavior" is that you can't trust what is in
> the parameters if the callee chooses to look at them, and that is not the
> case here.

I am aware of no such definition of "undefined behavior". So you would be
very much dependent on the compiler choosing a not-so-hurtful way to deal
with this behavior then.

> But since the called function doesn't care, I highly doubt it
> will ever be an issue. I mean, the only way this can break is if the caller
> places something in the stack that it expects the callee to fix.

AFAIU an undefined behavior is something we really try to avoid in C. As I said
earlier, it seems to work in practice because the cdecl calling convention 
leaves
the stack cleanup to the caller. But I'm worried about subtle portability issues
that may arise due to this.

> With all the functions in assembly we have, I'm pretty confident that if a 
> compiler
> does something like this, it would break all over the place.

Fair point. Then maybe we should write the stub in assembly ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH] tracepoint: Do not fail unregistering a probe due to memory allocation

2020-11-17 Thread Mathieu Desnoyers
- On Nov 17, 2020, at 3:34 PM, rostedt rost...@goodmis.org wrote:

> On Tue, 17 Nov 2020 14:47:20 -0500 (EST)
> Mathieu Desnoyers  wrote:
> 
>> There seems to be more effect on the data size: adding the "stub_func" field
>> in struct tracepoint adds 8320 bytes of data to my vmlinux. But considering
>> the layout of struct tracepoint:
>> 
>> struct tracepoint {
>> const char *name;   /* Tracepoint name */
>> struct static_key key;
>> struct static_call_key *static_call_key;
>> void *static_call_tramp;
>> void *iterator;
>> int (*regfunc)(void);
>> void (*unregfunc)(void);
>> struct tracepoint_func __rcu *funcs;
>> void *stub_func;
>> };
>> 
>> I would argue that we have many other things to optimize there if we want to
>> shrink the bloat, starting with static keys and system call reg/unregfunc
>> pointers.
> 
> This is the part that I want to decrease, and yes there's other fish to fry
> in that code, but I really don't want to be adding more.

I agree on the goal of not bloating the code and data size of tracepoints, but
I also don't want to introduce subtle hard-to-debug undefined behaviors.

> 
>> 
>> > 
>> > Since all tracepoints callbacks have at least one parameter (__data), we
>> > could declare tp_stub_func as:
>> > 
>> > static void tp_stub_func(void *data, ...)
>> > {
>> >return;
>> > }
>> > 
>> > And now C knows that tp_stub_func() can be called with one or more
>> > parameters, and had better be able to deal with it!
>> 
>> AFAIU this won't work.
>> 
>> C99 6.5.2.2 Function calls
>> 
>> "If the function is defined with a type that is not compatible with the type 
>> (of
>> the
>> expression) pointed to by the expression that denotes the called function, 
>> the
>> behavior is
>> undefined."
> 
> But is it really a problem in practice. I'm sure we could create an objtool
> function to check to make sure we don't break anything at build time.

There are also tools like UBSAN which will trigger whenever an undefined 
behavior
is executed. Having tools which can validate that the generated assembly 
happens to
work does not make it OK to generate code with undefined behavior.

> 
>> 
>> and
>> 
>> 6.7.5.3 Function declarators (including prototypes), item 15:
>> 
>> "For two function types to be compatible, both shall specify compatible 
>> return
>> types.
> 
> But all tracepoint callbacks have void return types, which means they are
> compatible.

Yes, my concern is about what follows just after:

> 
>> 
>> Moreover, the parameter type lists, if both are present, shall agree in the
>> number of
>> parameters and in use of the ellipsis terminator; corresponding parameters 
>> shall
>> have
>> compatible types. [...]"
> 
> Which is why I gave the stub function's first parameter the same type that
> all tracepoint callbacks have a prototype that starts with "void *data"
> 
> and my solution is to define:
> 
>   void tp_stub_func(void *data, ...) { return; }
> 
> Which is in line with: "corresponding parameters shall have compatible
> types". The corresponding parameter is simply "void *data".

No, you forgot about the part "[...] shall agree [...] in use of the ellipsis
terminator"

That last part about agreeing about use of the ellipsis terminator is what
makes your last solution run into undefined behavior territory. The caller
and callee don't agree on the use of ellipsis terminator: the caller does not
use it, but the callee expects it.

> 
>> 
>> What you suggest here is to use the ellipsis in the stub definition, but the
>> caller
>> prototype does not use the ellipsis, which brings us into undefined behavior
>> territory
>> again.
> 
> And I believe the "undefined behavior" is that you can't trust what is in
> the parameters if the callee chooses to look at them, and that is not the
> case here.

I am aware of no such definition of "undefined behavior". So you would be
very much dependent on the compiler choosing a not-so-hurtful way to deal
with this behavior then.

> But since the called function doesn't care, I highly doubt it
> will ever be an issue. I mean, the only way this can break is if the caller
> places something in the stack that it expects the callee to fix.

AFAIU an undefined behavior is something we really try to avoid in C. As I said
earlier, it seems to work in practice because the cdecl calling convention 
leaves
the stack cleanup to the caller. But I'm worried about subtle portability issues
that may arise due to this.

> With all the functions in assembly we have, I'm pretty confident that if a 
> compiler
> does something like this, it would break all over the place.

Fair point. Then maybe we should write the stub in assembly ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [lttng-dev] [PATCH urcu] cleanup: Improve wording of CONFIG_RCU_DEBUG description

2020-11-17 Thread Mathieu Desnoyers via lttng-dev
Merged into master, stable-0.11, stable-0.12, thanks!

Mathieu

- On Nov 17, 2020, at 3:22 PM, Michael Jeanson mjean...@efficios.com wrote:

> Signed-off-by: Michael Jeanson 
> ---
> configure.ac | 4 ++--
> include/urcu/config.h.in | 2 +-
> 2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/configure.ac b/configure.ac
> index b62f587..73ab19f 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -28,7 +28,7 @@ AH_TEMPLATE([CONFIG_RCU_ARM_HAVE_DMB], [Use the dmb
> instruction if available for
> AH_TEMPLATE([CONFIG_RCU_TLS], [TLS provided by the compiler.])
> AH_TEMPLATE([CONFIG_RCU_HAVE_CLOCK_GETTIME], [clock_gettime() is detected.])
> AH_TEMPLATE([CONFIG_RCU_FORCE_SYS_MEMBARRIER], [Require the operating system 
> to
> support the membarrier system call for default and bulletproof flavors.])
> -AH_TEMPLATE([CONFIG_RCU_DEBUG], [Enable internal debugging self-checks.
> Introduce performance penalty.])
> +AH_TEMPLATE([CONFIG_RCU_DEBUG], [Enable internal debugging self-checks.
> Introduces a performance penalty.])
> AH_TEMPLATE([CONFIG_CDS_LFHT_ITER_DEBUG], [Enable extra debugging checks for
> lock-free hash table iterator traversal. Alters the rculfhash ABI. Make sure 
> to
> compile both library and application with matching configuration.])
> 
> # Allow requiring the operating system to support the membarrier system
> @@ -257,7 +257,7 @@ AS_IF([test "x$def_smp_support" = "xyes"],
> [AC_DEFINE([CONFIG_RCU_SMP], [1])])
> # RCU debugging option
> AC_ARG_ENABLE([rcu-debug],
>   AS_HELP_STRING([--enable-rcu-debug], [Enable internal debugging
> -   self-checks. Introduce performance penalty.]))
> +   self-checks. Introduces a performance penalty.]))
> AS_IF([test "x$enable_rcu_debug" = "xyes"], [
>AC_DEFINE([CONFIG_RCU_DEBUG], [1])
> ])
> diff --git a/include/urcu/config.h.in b/include/urcu/config.h.in
> index 8c3e4da..7350631 100644
> --- a/include/urcu/config.h.in
> +++ b/include/urcu/config.h.in
> @@ -28,7 +28,7 @@
> #undef CONFIG_RCU_FORCE_SYS_MEMBARRIER
> 
> /* Enable internal debugging self-checks.
> -   Introduce performance penalty. */
> +   Introduces a performance penalty. */
> #undef CONFIG_RCU_DEBUG
> 
> /* Expose multi-flavor support */
> --
> 2.29.2

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [PATCH urcu] fix: explicitly include urcu/config.h in files using CONFIG_RCU_ defines

2020-11-17 Thread Mathieu Desnoyers via lttng-dev
Merged into master, stable-0.11, stable-0.12, thanks!

Mathieu

- On Nov 17, 2020, at 3:22 PM, Michael Jeanson mjean...@efficios.com wrote:

> Signed-off-by: Michael Jeanson 
> ---
> include/urcu/debug.h   | 2 ++
> include/urcu/uatomic/x86.h | 1 +
> include/urcu/urcu-qsbr.h   | 2 ++
> src/urcu-bp.c  | 1 +
> src/urcu-utils.h   | 2 ++
> src/urcu.c | 1 +
> 6 files changed, 9 insertions(+)
> 
> diff --git a/include/urcu/debug.h b/include/urcu/debug.h
> index 14b50b6..4a7eac9 100644
> --- a/include/urcu/debug.h
> +++ b/include/urcu/debug.h
> @@ -21,6 +21,8 @@
> 
> #include 
> 
> +#include 
> +
> #if defined(DEBUG_RCU) || defined(CONFIG_RCU_DEBUG)
> #define urcu_assert(...)  assert(__VA_ARGS__)
> #else
> diff --git a/include/urcu/uatomic/x86.h b/include/urcu/uatomic/x86.h
> index 129a2f5..2a4ea1c 100644
> --- a/include/urcu/uatomic/x86.h
> +++ b/include/urcu/uatomic/x86.h
> @@ -20,6 +20,7 @@
>  * Boehm-Demers-Weiser conservative garbage collector.
>  */
> 
> +#include 
> #include 
> #include 
> 
> diff --git a/include/urcu/urcu-qsbr.h b/include/urcu/urcu-qsbr.h
> index 28e6065..fd6cbda 100644
> --- a/include/urcu/urcu-qsbr.h
> +++ b/include/urcu/urcu-qsbr.h
> @@ -31,6 +31,8 @@
> #include 
> #include 
> 
> +#include 
> +
> /*
>  * See urcu/pointer.h and urcu/static/pointer.h for pointer
>  * publication headers.
> diff --git a/src/urcu-bp.c b/src/urcu-bp.c
> index 05efd97..c2ac792 100644
> --- a/src/urcu-bp.c
> +++ b/src/urcu-bp.c
> @@ -37,6 +37,7 @@
> #include 
> #include 
> 
> +#include 
> #include 
> #include 
> #include 
> diff --git a/src/urcu-utils.h b/src/urcu-utils.h
> index 407f42e..69e37bd 100644
> --- a/src/urcu-utils.h
> +++ b/src/urcu-utils.h
> @@ -23,6 +23,8 @@
>  * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 
> USA
>  */
> 
> +#include 
> +
> #define urcu_stringify(a) _urcu_stringify(a)
> #define _urcu_stringify(a) #a
> 
> diff --git a/src/urcu.c b/src/urcu.c
> index f6ca5f8..2ebe993 100644
> --- a/src/urcu.c
> +++ b/src/urcu.c
> @@ -38,6 +38,7 @@
> #include 
> #include 
> 
> +#include 
> #include 
> #include 
> #include 
> --
> 2.29.2

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [PATCH] tracepoint: Do not fail unregistering a probe due to memory allocation

2020-11-17 Thread Mathieu Desnoyers
- On Nov 17, 2020, at 2:21 PM, rostedt rost...@goodmis.org wrote:

> On Tue, 17 Nov 2020 14:15:10 -0500 (EST)
> Mathieu Desnoyers  wrote:
> 
> 
>> diff --git a/include/linux/tracepoint-defs.h 
>> b/include/linux/tracepoint-defs.h
>> index e7c2276be33e..e0351bb0b140 100644
>> --- a/include/linux/tracepoint-defs.h
>> +++ b/include/linux/tracepoint-defs.h
>> @@ -38,6 +38,7 @@ struct tracepoint {
>> int (*regfunc)(void);
>> void (*unregfunc)(void);
>> struct tracepoint_func __rcu *funcs;
>> +   void *stub_func;
>>  };
>>  
>>  #ifdef CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
>> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
>> index 0f21617f1a66..b0b805de3779 100644
>> --- a/include/linux/tracepoint.h
>> +++ b/include/linux/tracepoint.h
>> @@ -287,6 +287,9 @@ static inline struct tracepoint
>> *tracepoint_ptr_deref(tracepoint_ptr_t *p)
>>  #define DEFINE_TRACE_FN(_name, _reg, _unreg, proto, args)  \
>> static const char __tpstrtab_##_name[]  \
>> __section("__tracepoints_strings") = #_name;\
>> +   static void __cold __tracepoint_stub_func_##_name(void *__data, 
>> proto) \
>> +   {   \
>> +   }   \
> 
> The thing is, tracepoints are already bloated. I do not want to add
> something like this that will unnecessarily add more text.

I've measured the impact of adding these stubs to kernel/sched/core.o, and
it's entirely lost in the code alignment (it effectively adds 0 byte of text
to my build) when using the "cold" attribute.

Over an entire vmlinux, it adds 256 bytes of text overall, for a 0.008% code 
size
increase.

Can you measure any significant code size increase with this approach ?

There seems to be more effect on the data size: adding the "stub_func" field
in struct tracepoint adds 8320 bytes of data to my vmlinux. But considering
the layout of struct tracepoint:

struct tracepoint {
const char *name;   /* Tracepoint name */
struct static_key key;
struct static_call_key *static_call_key;
void *static_call_tramp;
void *iterator;
int (*regfunc)(void);
void (*unregfunc)(void);
struct tracepoint_func __rcu *funcs;
void *stub_func;
};

I would argue that we have many other things to optimize there if we want to
shrink the bloat, starting with static keys and system call reg/unregfunc 
pointers.

> 
> Since all tracepoints callbacks have at least one parameter (__data), we
> could declare tp_stub_func as:
> 
> static void tp_stub_func(void *data, ...)
> {
>   return;
> }
> 
> And now C knows that tp_stub_func() can be called with one or more
> parameters, and had better be able to deal with it!

AFAIU this won't work.

C99 6.5.2.2 Function calls

"If the function is defined with a type that is not compatible with the type 
(of the
expression) pointed to by the expression that denotes the called function, the 
behavior is
undefined."

and

6.7.5.3 Function declarators (including prototypes), item 15:

"For two function types to be compatible, both shall specify compatible return 
types.

Moreover, the parameter type lists, if both are present, shall agree in the 
number of
parameters and in use of the ellipsis terminator; corresponding parameters 
shall have
compatible types. [...]"

What you suggest here is to use the ellipsis in the stub definition, but the 
caller
prototype does not use the ellipsis, which brings us into undefined behavior 
territory
again.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH] tracepoint: Do not fail unregistering a probe due to memory allocation

2020-11-17 Thread Mathieu Desnoyers
- On Nov 17, 2020, at 2:21 PM, rostedt rost...@goodmis.org wrote:

> On Tue, 17 Nov 2020 14:15:10 -0500 (EST)
> Mathieu Desnoyers  wrote:
> 
> 
>> diff --git a/include/linux/tracepoint-defs.h 
>> b/include/linux/tracepoint-defs.h
>> index e7c2276be33e..e0351bb0b140 100644
>> --- a/include/linux/tracepoint-defs.h
>> +++ b/include/linux/tracepoint-defs.h
>> @@ -38,6 +38,7 @@ struct tracepoint {
>> int (*regfunc)(void);
>> void (*unregfunc)(void);
>> struct tracepoint_func __rcu *funcs;
>> +   void *stub_func;
>>  };
>>  
>>  #ifdef CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
>> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
>> index 0f21617f1a66..b0b805de3779 100644
>> --- a/include/linux/tracepoint.h
>> +++ b/include/linux/tracepoint.h
>> @@ -287,6 +287,9 @@ static inline struct tracepoint
>> *tracepoint_ptr_deref(tracepoint_ptr_t *p)
>>  #define DEFINE_TRACE_FN(_name, _reg, _unreg, proto, args)  \
>> static const char __tpstrtab_##_name[]  \
>> __section("__tracepoints_strings") = #_name;\
>> +   static void __cold __tracepoint_stub_func_##_name(void *__data, 
>> proto) \
>> +   {   \
>> +   }   \
> 
> The thing is, tracepoints are already bloated. I do not want to add
> something like this that will unnecessarily add more text.

I've measured the impact of adding these stubs to kernel/sched/core.o, and
it's entirely lost in the code alignment (it effectively adds 0 byte of text
to my build) when using the "cold" attribute.

Over an entire vmlinux, it adds 256 bytes of text overall, for a 0.008% code 
size
increase.

Can you measure any significant code size increase with this approach ?

There seems to be more effect on the data size: adding the "stub_func" field
in struct tracepoint adds 8320 bytes of data to my vmlinux. But considering
the layout of struct tracepoint:

struct tracepoint {
const char *name;   /* Tracepoint name */
struct static_key key;
struct static_call_key *static_call_key;
void *static_call_tramp;
void *iterator;
int (*regfunc)(void);
void (*unregfunc)(void);
struct tracepoint_func __rcu *funcs;
void *stub_func;
};

I would argue that we have many other things to optimize there if we want to
shrink the bloat, starting with static keys and system call reg/unregfunc 
pointers.

> 
> Since all tracepoints callbacks have at least one parameter (__data), we
> could declare tp_stub_func as:
> 
> static void tp_stub_func(void *data, ...)
> {
>   return;
> }
> 
> And now C knows that tp_stub_func() can be called with one or more
> parameters, and had better be able to deal with it!

AFAIU this won't work.

C99 6.5.2.2 Function calls

"If the function is defined with a type that is not compatible with the type 
(of the
expression) pointed to by the expression that denotes the called function, the 
behavior is
undefined."

and

6.7.5.3 Function declarators (including prototypes), item 15:

"For two function types to be compatible, both shall specify compatible return 
types.

Moreover, the parameter type lists, if both are present, shall agree in the 
number of
parameters and in use of the ellipsis terminator; corresponding parameters 
shall have
compatible types. [...]"

What you suggest here is to use the ellipsis in the stub definition, but the 
caller
prototype does not use the ellipsis, which brings us into undefined behavior 
territory
again.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH] tracepoint: Do not fail unregistering a probe due to memory allocation

2020-11-17 Thread Mathieu Desnoyers
- On Nov 16, 2020, at 5:51 PM, rostedt rost...@goodmis.org wrote:

> [ Kees, I added you because you tend to know about these things.
>  Is it OK to assign a void func(void) that doesn't do anything and returns
>  nothing to a function pointer that could be call with parameters? We need
>  to add stubs for tracepoints when we fail to allocate a new array on
>  removal of a callback, but the callbacks do have arguments, but the stub
>  called does not have arguments.

[...]

> +/* Called in removal of a func but failed to allocate a new tp_funcs */
> +static void tp_stub_func(void)
> +{
> + return;
> +}
> +

In C, the "void" unnamed function parameter specifies that the function has no
parameters. C99 section 6.7.5.3 Function declarators (including prototypes):

"The special case of an unnamed parameter of type void as the only item in the 
list
specifies that the function has no parameters."

The C99 standard section "6.5.2.2 Function calls" states:

"If the function is defined with a type that is not compatible with the type 
(of the
expression) pointed to by the expression that denotes the called function, the 
behavior is
undefined."

"J.2 Undefined behavior" states:

"For a call to a function without a function prototype in scope, the number of
arguments does not equal the number of parameters (6.5.2.2)."

I suspect that calling a function expecting no parameters from a call site with
parameters might work for the cdecl calling convention because the caller
is responsible for stack cleanup, but it seems to rely on a behavior which is
very much tied to the calling convention, and within "undefined behavior" 
territory
as far as the C standard is concerned.

I checked whether going for "void tp_stub_func()" instead would work better,
but it seems to also mean "no parameter" when applied to the function 
definition.
It's only when applied on the function declarator that is not part of the 
definition
that it means no information about the number or type of parameters is supplied.
(see C99 "6.7.5.3 Function declarators (including prototypes)" items 14 and 15)
And the kernel builds with "-Werror=strict-prototypes" which would not allow
it anyway.

One thing which would work with respect to the C standard is to define one stub
function per tracepoint. This add 16 bytes of code per TRACE_DEFINE() on x86-64,
but by specifying that those are cache-cold with "__cold", I notice that it adds
no extra code size my build of kernel/sched/core.o which contains 30 tracepoint
definitions.

diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
index e7c2276be33e..e0351bb0b140 100644
--- a/include/linux/tracepoint-defs.h
+++ b/include/linux/tracepoint-defs.h
@@ -38,6 +38,7 @@ struct tracepoint {
int (*regfunc)(void);
void (*unregfunc)(void);
struct tracepoint_func __rcu *funcs;
+   void *stub_func;
 };
 
 #ifdef CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 0f21617f1a66..b0b805de3779 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -287,6 +287,9 @@ static inline struct tracepoint 
*tracepoint_ptr_deref(tracepoint_ptr_t *p)
 #define DEFINE_TRACE_FN(_name, _reg, _unreg, proto, args)  \
static const char __tpstrtab_##_name[]  \
__section("__tracepoints_strings") = #_name;\
+   static void __cold __tracepoint_stub_func_##_name(void *__data, proto) \
+   {   \
+   }   \
extern struct static_call_key STATIC_CALL_KEY(tp_func_##_name); \
int __traceiter_##_name(void *__data, proto);   \
struct tracepoint __tracepoint_##_name  __used  \
@@ -298,7 +301,8 @@ static inline struct tracepoint 
*tracepoint_ptr_deref(tracepoint_ptr_t *p)
.iterator = &__traceiter_##_name,   \
.regfunc = _reg,\
.unregfunc = _unreg,\
-   .funcs = NULL };\
+   .funcs = NULL,  \
+   .stub_func = __tracepoint_stub_func_##_name, }; \
__TRACEPOINT_ENTRY(_name);      \
int __traceiter_##_name(void *__data, proto)\
{   \

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH] tracepoint: Do not fail unregistering a probe due to memory allocation

2020-11-17 Thread Mathieu Desnoyers
- On Nov 16, 2020, at 5:51 PM, rostedt rost...@goodmis.org wrote:

> [ Kees, I added you because you tend to know about these things.
>  Is it OK to assign a void func(void) that doesn't do anything and returns
>  nothing to a function pointer that could be call with parameters? We need
>  to add stubs for tracepoints when we fail to allocate a new array on
>  removal of a callback, but the callbacks do have arguments, but the stub
>  called does not have arguments.

[...]

> +/* Called in removal of a func but failed to allocate a new tp_funcs */
> +static void tp_stub_func(void)
> +{
> + return;
> +}
> +

In C, the "void" unnamed function parameter specifies that the function has no
parameters. C99 section 6.7.5.3 Function declarators (including prototypes):

"The special case of an unnamed parameter of type void as the only item in the 
list
specifies that the function has no parameters."

The C99 standard section "6.5.2.2 Function calls" states:

"If the function is defined with a type that is not compatible with the type 
(of the
expression) pointed to by the expression that denotes the called function, the 
behavior is
undefined."

"J.2 Undefined behavior" states:

"For a call to a function without a function prototype in scope, the number of
arguments does not equal the number of parameters (6.5.2.2)."

I suspect that calling a function expecting no parameters from a call site with
parameters might work for the cdecl calling convention because the caller
is responsible for stack cleanup, but it seems to rely on a behavior which is
very much tied to the calling convention, and within "undefined behavior" 
territory
as far as the C standard is concerned.

I checked whether going for "void tp_stub_func()" instead would work better,
but it seems to also mean "no parameter" when applied to the function 
definition.
It's only when applied on the function declarator that is not part of the 
definition
that it means no information about the number or type of parameters is supplied.
(see C99 "6.7.5.3 Function declarators (including prototypes)" items 14 and 15)
And the kernel builds with "-Werror=strict-prototypes" which would not allow
it anyway.

One thing which would work with respect to the C standard is to define one stub
function per tracepoint. This add 16 bytes of code per TRACE_DEFINE() on x86-64,
but by specifying that those are cache-cold with "__cold", I notice that it adds
no extra code size my build of kernel/sched/core.o which contains 30 tracepoint
definitions.

diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
index e7c2276be33e..e0351bb0b140 100644
--- a/include/linux/tracepoint-defs.h
+++ b/include/linux/tracepoint-defs.h
@@ -38,6 +38,7 @@ struct tracepoint {
int (*regfunc)(void);
void (*unregfunc)(void);
struct tracepoint_func __rcu *funcs;
+   void *stub_func;
 };
 
 #ifdef CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 0f21617f1a66..b0b805de3779 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -287,6 +287,9 @@ static inline struct tracepoint 
*tracepoint_ptr_deref(tracepoint_ptr_t *p)
 #define DEFINE_TRACE_FN(_name, _reg, _unreg, proto, args)  \
static const char __tpstrtab_##_name[]  \
__section("__tracepoints_strings") = #_name;\
+   static void __cold __tracepoint_stub_func_##_name(void *__data, proto) \
+   {   \
+   }   \
extern struct static_call_key STATIC_CALL_KEY(tp_func_##_name); \
int __traceiter_##_name(void *__data, proto);   \
struct tracepoint __tracepoint_##_name  __used  \
@@ -298,7 +301,8 @@ static inline struct tracepoint 
*tracepoint_ptr_deref(tracepoint_ptr_t *p)
.iterator = &__traceiter_##_name,   \
.regfunc = _reg,\
.unregfunc = _unreg,\
-   .funcs = NULL };\
+   .funcs = NULL,  \
+   .stub_func = __tracepoint_stub_func_##_name, }; \
__TRACEPOINT_ENTRY(_name);      \
int __traceiter_##_name(void *__data, proto)\
{   \

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH] bpf: don't fail kmalloc while releasing raw_tp

2020-11-16 Thread Mathieu Desnoyers
- On Nov 16, 2020, at 4:02 PM, rostedt rost...@goodmis.org wrote:

> On Mon, 16 Nov 2020 15:44:37 -0500
> Steven Rostedt  wrote:
> 
>> If you use a stub function, it shouldn't affect anything. And the worse
>> that would happen is that you have a slight overhead of calling the stub
>> until you can properly remove the callback.
> 
> Something like this:
> 
> (haven't compiled it yet, I'm about to though).
> 
> -- Steve
> 
> diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
> index 3f659f855074..8eab40f9d388 100644
> --- a/kernel/tracepoint.c
> +++ b/kernel/tracepoint.c
> @@ -53,10 +53,16 @@ struct tp_probes {
>   struct tracepoint_func probes[];
> };
> 
> -static inline void *allocate_probes(int count)
> +/* Called in removal of a func but failed to allocate a new tp_funcs */
> +static void tp_stub_func(void)

I'm still not sure whether it's OK to call a (void) function with arguments.

> +{
> + return;
> +}
> +
> +static inline void *allocate_probes(int count, gfp_t extra_flags)
> {
>   struct tp_probes *p  = kmalloc(struct_size(p, probes, count),
> -GFP_KERNEL);
> +GFP_KERNEL | extra_flags);
>   return p == NULL ? NULL : p->probes;
> }
> 
> @@ -150,7 +156,7 @@ func_add(struct tracepoint_func **funcs, struct
> tracepoint_func *tp_func,
>   }
>   }
>   /* + 2 : one for new probe, one for NULL func */
> - new = allocate_probes(nr_probes + 2);
> + new = allocate_probes(nr_probes + 2, 0);
>   if (new == NULL)
>   return ERR_PTR(-ENOMEM);
>   if (old) {
> @@ -188,8 +194,9 @@ static void *func_remove(struct tracepoint_func **funcs,
>   /* (N -> M), (N > 1, M >= 0) probes */
>   if (tp_func->func) {
>   for (nr_probes = 0; old[nr_probes].func; nr_probes++) {
> - if (old[nr_probes].func == tp_func->func &&
> -  old[nr_probes].data == tp_func->data)
> + if ((old[nr_probes].func == tp_func->func &&
> +  old[nr_probes].data == tp_func->data) ||
> + old[nr_probes].func == tp_stub_func)
>   nr_del++;
>   }
>   }
> @@ -207,15 +214,20 @@ static void *func_remove(struct tracepoint_func **funcs,
>   int j = 0;
>   /* N -> M, (N > 1, M > 0) */
>   /* + 1 for NULL */
> - new = allocate_probes(nr_probes - nr_del + 1);
> - if (new == NULL)
> - return ERR_PTR(-ENOMEM);
> - for (i = 0; old[i].func; i++)
> - if (old[i].func != tp_func->func
> - || old[i].data != tp_func->data)
> - new[j++] = old[i];
> - new[nr_probes - nr_del].func = NULL;
> - *funcs = new;
> + new = allocate_probes(nr_probes - nr_del + 1, __GFP_NOFAIL);
> + if (new) {
> + for (i = 0; old[i].func; i++)
> + if (old[i].func != tp_func->func
> + || old[i].data != tp_func->data)

as you point out in your reply, skip tp_stub_func here too.

> + new[j++] = old[i];
> + new[nr_probes - nr_del].func = NULL;
> + } else {
> + for (i = 0; old[i].func; i++)
> + if (old[i].func == tp_func->func &&
> + old[i].data == tp_func->data)
> + old[i].func = tp_stub_func;

I think you'll want a WRITE_ONCE(old[i].func, tp_stub_func) here, matched
with a READ_ONCE() in __DO_TRACE. This introduces a new situation where the
func pointer can be updated and loaded concurrently.

> + }
> + *funcs = old;

The line above seems wrong for the successful allocate_probe case. You will 
likely
want *funcs = new on successful allocation, and *funcs = old for the failure 
case.

Thanks,

Mathieu

>   }
>   debug_print_probes(*funcs);
>   return old;
> @@ -300,6 +312,10 @@ static int tracepoint_remove_func(struct tracepoint *tp,
>   return PTR_ERR(old);
>   }
> 
> + if (tp_funcs == old)
> + /* Failed allocating new tp_funcs, replaced func with stub */
> + return 0;
> +
>   if (!tp_funcs) {
>   /* Removed last function */
>   if (tp->unregfunc && static_key_enabled(&tp->key))

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH] bpf: don't fail kmalloc while releasing raw_tp

2020-11-16 Thread Mathieu Desnoyers
- On Nov 16, 2020, at 4:02 PM, rostedt rost...@goodmis.org wrote:

> On Mon, 16 Nov 2020 15:44:37 -0500
> Steven Rostedt  wrote:
> 
>> If you use a stub function, it shouldn't affect anything. And the worse
>> that would happen is that you have a slight overhead of calling the stub
>> until you can properly remove the callback.
> 
> Something like this:
> 
> (haven't compiled it yet, I'm about to though).
> 
> -- Steve
> 
> diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
> index 3f659f855074..8eab40f9d388 100644
> --- a/kernel/tracepoint.c
> +++ b/kernel/tracepoint.c
> @@ -53,10 +53,16 @@ struct tp_probes {
>   struct tracepoint_func probes[];
> };
> 
> -static inline void *allocate_probes(int count)
> +/* Called in removal of a func but failed to allocate a new tp_funcs */
> +static void tp_stub_func(void)

I'm still not sure whether it's OK to call a (void) function with arguments.

> +{
> + return;
> +}
> +
> +static inline void *allocate_probes(int count, gfp_t extra_flags)
> {
>   struct tp_probes *p  = kmalloc(struct_size(p, probes, count),
> -GFP_KERNEL);
> +GFP_KERNEL | extra_flags);
>   return p == NULL ? NULL : p->probes;
> }
> 
> @@ -150,7 +156,7 @@ func_add(struct tracepoint_func **funcs, struct
> tracepoint_func *tp_func,
>   }
>   }
>   /* + 2 : one for new probe, one for NULL func */
> - new = allocate_probes(nr_probes + 2);
> + new = allocate_probes(nr_probes + 2, 0);
>   if (new == NULL)
>   return ERR_PTR(-ENOMEM);
>   if (old) {
> @@ -188,8 +194,9 @@ static void *func_remove(struct tracepoint_func **funcs,
>   /* (N -> M), (N > 1, M >= 0) probes */
>   if (tp_func->func) {
>   for (nr_probes = 0; old[nr_probes].func; nr_probes++) {
> - if (old[nr_probes].func == tp_func->func &&
> -  old[nr_probes].data == tp_func->data)
> + if ((old[nr_probes].func == tp_func->func &&
> +  old[nr_probes].data == tp_func->data) ||
> + old[nr_probes].func == tp_stub_func)
>   nr_del++;
>   }
>   }
> @@ -207,15 +214,20 @@ static void *func_remove(struct tracepoint_func **funcs,
>   int j = 0;
>   /* N -> M, (N > 1, M > 0) */
>   /* + 1 for NULL */
> - new = allocate_probes(nr_probes - nr_del + 1);
> - if (new == NULL)
> - return ERR_PTR(-ENOMEM);
> - for (i = 0; old[i].func; i++)
> - if (old[i].func != tp_func->func
> - || old[i].data != tp_func->data)
> - new[j++] = old[i];
> - new[nr_probes - nr_del].func = NULL;
> - *funcs = new;
> + new = allocate_probes(nr_probes - nr_del + 1, __GFP_NOFAIL);
> + if (new) {
> + for (i = 0; old[i].func; i++)
> + if (old[i].func != tp_func->func
> + || old[i].data != tp_func->data)

as you point out in your reply, skip tp_stub_func here too.

> + new[j++] = old[i];
> + new[nr_probes - nr_del].func = NULL;
> + } else {
> + for (i = 0; old[i].func; i++)
> + if (old[i].func == tp_func->func &&
> + old[i].data == tp_func->data)
> + old[i].func = tp_stub_func;

I think you'll want a WRITE_ONCE(old[i].func, tp_stub_func) here, matched
with a READ_ONCE() in __DO_TRACE. This introduces a new situation where the
func pointer can be updated and loaded concurrently.

> + }
> + *funcs = old;

The line above seems wrong for the successful allocate_probe case. You will 
likely
want *funcs = new on successful allocation, and *funcs = old for the failure 
case.

Thanks,

Mathieu

>   }
>   debug_print_probes(*funcs);
>   return old;
> @@ -300,6 +312,10 @@ static int tracepoint_remove_func(struct tracepoint *tp,
>   return PTR_ERR(old);
>   }
> 
> + if (tp_funcs == old)
> + /* Failed allocating new tp_funcs, replaced func with stub */
> + return 0;
> +
>   if (!tp_funcs) {
>   /* Removed last function */
>   if (tp->unregfunc && static_key_enabled(&tp->key))

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH] bpf: don't fail kmalloc while releasing raw_tp

2020-11-16 Thread Mathieu Desnoyers
- On Nov 16, 2020, at 3:44 PM, rostedt rost...@goodmis.org wrote:

> On Mon, 16 Nov 2020 15:37:27 -0500 (EST)
> Mathieu Desnoyers  wrote:
>> > 
>> > Mathieu,
>> > 
>> > Can't we do something that would still allow to unregister a probe even if
>> > a new probe array fails to allocate? We could kick off a irq work to try to
>> > clean up the probe at a later time, but still, the unregister itself should
>> > not fail due to memory failure.
>> 
>> Currently, the fast path iteration looks like:
>> 
>> struct tracepoint_func *it_func_ptr;
>> void *it_func;
>> 
>> it_func_ptr =   \
>> rcu_dereference_raw((&__tracepoint_##_name)->funcs); 
>> \
>> do {\
>> it_func = (it_func_ptr)->func;  \
>> __data = (it_func_ptr)->data;   \
>> ((void(*)(void *, proto))(it_func))(__data, args); \
>> } while ((++it_func_ptr)->func);
>> 
>> So we RCU dereference the array, and iterate on the array until we find a 
>> NULL
>> func. So you could not use NULL to skip items, but you could perhaps reserve
>> a (void *)0x1UL tombstone for this.
> 
> Actually, you could just set it to a stub callback that does nothing. then
> you don't even need to touch the above macro. Not sure why I didn't
> recommend this to begin with, because that's exactly what the function
> tracer does with ftrace_stub.

I like the stub idea.

What prototype should the stub function have ? Is it acceptable to pass
arguments to a function expecting (void) ? If not, then we may need to
create stub functions for each tracepoint.

> 
> 
>> 
>> It should ideally be an unlikely branch, and it would be good to benchmark 
>> the
>> change when multiple tracing probes are attached to figure out whether the
>> overhead is significant when tracing is enabled.
> 
> If you use a stub function, it shouldn't affect anything. And the worse
> that would happen is that you have a slight overhead of calling the stub
> until you can properly remove the callback.
> 
>> 
>> I wonder whether we really mind that much about using slightly more memory
>> than required after a failed reallocation due to ENOMEM. Perhaps the irq work
>> is not even needed. Chances are that the irq work would fail again and again 
>> if
>> it's in low memory conditions. So maybe it's better to just keep the 
>> tombstone
>> in place until the next successful callback array reallocation.
>> 
> 
> True. If we just replace the function with a stub on memory failure (always
> using __GFP_NOFAIL, and if it fails to reallocate a new array, just replace
> the callback with the stub and be done with it. It may require some more
> accounting to make sure the tracepoint.c code can handle these stubs, and
> remove them on new additions to the code.

Yes.

> Heck, if a stub exists, you could just swap it with a new item.

Not without proper synchronization, otherwise you could end up with
mismatch between function callback and private data. The only transition
valid without waiting for an rcu grace period is to change the function
pointer to a stub function. Anything else (e.g. replacing the stub by
some other callback function) would require to wait for a grace period
beforehand.

> But on any new changes to the list, the stubs should be purged.

Yes,

Thanks,

Mathieu

> 
> 
> -- Steve

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH] bpf: don't fail kmalloc while releasing raw_tp

2020-11-16 Thread Mathieu Desnoyers
- On Nov 16, 2020, at 3:44 PM, rostedt rost...@goodmis.org wrote:

> On Mon, 16 Nov 2020 15:37:27 -0500 (EST)
> Mathieu Desnoyers  wrote:
>> > 
>> > Mathieu,
>> > 
>> > Can't we do something that would still allow to unregister a probe even if
>> > a new probe array fails to allocate? We could kick off a irq work to try to
>> > clean up the probe at a later time, but still, the unregister itself should
>> > not fail due to memory failure.
>> 
>> Currently, the fast path iteration looks like:
>> 
>> struct tracepoint_func *it_func_ptr;
>> void *it_func;
>> 
>> it_func_ptr =   \
>> rcu_dereference_raw((&__tracepoint_##_name)->funcs); 
>> \
>> do {\
>> it_func = (it_func_ptr)->func;  \
>> __data = (it_func_ptr)->data;   \
>> ((void(*)(void *, proto))(it_func))(__data, args); \
>> } while ((++it_func_ptr)->func);
>> 
>> So we RCU dereference the array, and iterate on the array until we find a 
>> NULL
>> func. So you could not use NULL to skip items, but you could perhaps reserve
>> a (void *)0x1UL tombstone for this.
> 
> Actually, you could just set it to a stub callback that does nothing. then
> you don't even need to touch the above macro. Not sure why I didn't
> recommend this to begin with, because that's exactly what the function
> tracer does with ftrace_stub.

I like the stub idea.

What prototype should the stub function have ? Is it acceptable to pass
arguments to a function expecting (void) ? If not, then we may need to
create stub functions for each tracepoint.

> 
> 
>> 
>> It should ideally be an unlikely branch, and it would be good to benchmark 
>> the
>> change when multiple tracing probes are attached to figure out whether the
>> overhead is significant when tracing is enabled.
> 
> If you use a stub function, it shouldn't affect anything. And the worse
> that would happen is that you have a slight overhead of calling the stub
> until you can properly remove the callback.
> 
>> 
>> I wonder whether we really mind that much about using slightly more memory
>> than required after a failed reallocation due to ENOMEM. Perhaps the irq work
>> is not even needed. Chances are that the irq work would fail again and again 
>> if
>> it's in low memory conditions. So maybe it's better to just keep the 
>> tombstone
>> in place until the next successful callback array reallocation.
>> 
> 
> True. If we just replace the function with a stub on memory failure (always
> using __GFP_NOFAIL, and if it fails to reallocate a new array, just replace
> the callback with the stub and be done with it. It may require some more
> accounting to make sure the tracepoint.c code can handle these stubs, and
> remove them on new additions to the code.

Yes.

> Heck, if a stub exists, you could just swap it with a new item.

Not without proper synchronization, otherwise you could end up with
mismatch between function callback and private data. The only transition
valid without waiting for an rcu grace period is to change the function
pointer to a stub function. Anything else (e.g. replacing the stub by
some other callback function) would require to wait for a grace period
beforehand.

> But on any new changes to the list, the stubs should be purged.

Yes,

Thanks,

Mathieu

> 
> 
> -- Steve

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH] bpf: don't fail kmalloc while releasing raw_tp

2020-11-16 Thread Mathieu Desnoyers
- On Nov 16, 2020, at 12:19 PM, rostedt rost...@goodmis.org wrote:

> On Sat, 14 Nov 2020 21:52:55 -0800
> Matt Mullins  wrote:
> 
>> bpf_link_free is always called in process context, including from a
>> workqueue and from __fput.  Neither of these have the ability to
>> propagate an -ENOMEM to the caller.
>> 
> 
> Hmm, I think the real fix is to not have unregistering a tracepoint probe
> fail because of allocation. We are removing a probe, perhaps we could just
> inject NULL pointer that gets checked via the DO_TRACE loop?
> 
> I bet failing an unregister because of an allocation failure causes
> problems elsewhere than just BPF.
> 
> Mathieu,
> 
> Can't we do something that would still allow to unregister a probe even if
> a new probe array fails to allocate? We could kick off a irq work to try to
> clean up the probe at a later time, but still, the unregister itself should
> not fail due to memory failure.

Currently, the fast path iteration looks like:

struct tracepoint_func *it_func_ptr;
void *it_func;

it_func_ptr =   \
rcu_dereference_raw((&__tracepoint_##_name)->funcs); \
do {\
it_func = (it_func_ptr)->func;  \
__data = (it_func_ptr)->data;   \
((void(*)(void *, proto))(it_func))(__data, args); \
} while ((++it_func_ptr)->func); 

So we RCU dereference the array, and iterate on the array until we find a NULL
func. So you could not use NULL to skip items, but you could perhaps reserve
a (void *)0x1UL tombstone for this.

It should ideally be an unlikely branch, and it would be good to benchmark the
change when multiple tracing probes are attached to figure out whether the
overhead is significant when tracing is enabled.

I wonder whether we really mind that much about using slightly more memory
than required after a failed reallocation due to ENOMEM. Perhaps the irq work
is not even needed. Chances are that the irq work would fail again and again if
it's in low memory conditions. So maybe it's better to just keep the tombstone
in place until the next successful callback array reallocation.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH] bpf: don't fail kmalloc while releasing raw_tp

2020-11-16 Thread Mathieu Desnoyers
- On Nov 16, 2020, at 12:19 PM, rostedt rost...@goodmis.org wrote:

> On Sat, 14 Nov 2020 21:52:55 -0800
> Matt Mullins  wrote:
> 
>> bpf_link_free is always called in process context, including from a
>> workqueue and from __fput.  Neither of these have the ability to
>> propagate an -ENOMEM to the caller.
>> 
> 
> Hmm, I think the real fix is to not have unregistering a tracepoint probe
> fail because of allocation. We are removing a probe, perhaps we could just
> inject NULL pointer that gets checked via the DO_TRACE loop?
> 
> I bet failing an unregister because of an allocation failure causes
> problems elsewhere than just BPF.
> 
> Mathieu,
> 
> Can't we do something that would still allow to unregister a probe even if
> a new probe array fails to allocate? We could kick off a irq work to try to
> clean up the probe at a later time, but still, the unregister itself should
> not fail due to memory failure.

Currently, the fast path iteration looks like:

struct tracepoint_func *it_func_ptr;
void *it_func;

it_func_ptr =   \
rcu_dereference_raw((&__tracepoint_##_name)->funcs); \
do {\
it_func = (it_func_ptr)->func;  \
__data = (it_func_ptr)->data;   \
((void(*)(void *, proto))(it_func))(__data, args); \
} while ((++it_func_ptr)->func); 

So we RCU dereference the array, and iterate on the array until we find a NULL
func. So you could not use NULL to skip items, but you could perhaps reserve
a (void *)0x1UL tombstone for this.

It should ideally be an unlikely branch, and it would be good to benchmark the
change when multiple tracing probes are attached to figure out whether the
overhead is significant when tracing is enabled.

I wonder whether we really mind that much about using slightly more memory
than required after a failed reallocation due to ENOMEM. Perhaps the irq work
is not even needed. Chances are that the irq work would fail again and again if
it's in low memory conditions. So maybe it's better to just keep the tombstone
in place until the next successful callback array reallocation.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [lttng-dev] [PATCH] README: fix typo in README.md

2020-11-16 Thread Mathieu Desnoyers via lttng-dev
- On Nov 16, 2020, at 7:24 AM, wangshuo wangshu...@huawei.com wrote:

> Dear Mathieu,
> 
> This mail may work.

Merged into liburcu master, stable-0.12, stable-0.11, thanks!

Mathieu

> 
> Thank you,
> Shuo
> 
> Signed-off-by: Shuo Wang 
> ---
> README.md | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/README.md b/README.md
> index 2d6dcaa..dd7ebd1 100644
> --- a/README.md
> +++ b/README.md
> @@ -388,7 +388,7 @@ For always-on debugging self-checks:
>   ./configure --enable-rcu-debug
> 
> For fine grained enabling of debugging self-checks, build
> -urserspace-rcu with DEBUG_RCU defined and compile dependent
> +userspace-rcu with DEBUG_RCU defined and compile dependent
> applications with DEBUG_RCU defined when necessary.
> 
> Warning: Enabling this feature result in a performance penalty.
> --
> 2.23.0
> 
>> Hi,
>> 
>> This fix does not apply with "git am", probably due to the fact that the 
>> email
>> is not
>> formatted as plain text without alteration by your email client. I recommend
>> using
>> git send-email to send patches. Can you try sending it again in the proper
>> format
>> expected by git am ?
>> 
>> You will likely find useful information about why your email client changed 
>> the
>> patch formatting here:
>> 
>> [ https://www.kernel.org/doc/html/v5.9/process/submitting-patches.html |
>> https://www.kernel.org/doc/html/v5.9/process/submitting-patches.html ]
>> [ 
>> https://www.kernel.org/doc/html/v5.9/process/email-clients.html#email-clients
>> | 
>> https://www.kernel.org/doc/html/v5.9/process/email-clients.html#email-clients
>> ]
>> 
>> Thanks,
>> 
>> Mathieu
>> 
>> - On Nov 13, 2020, at 4:28 AM, wangshuo (AF) 
>> wrote:
>> 
>> > README: fix typo in README.md
>> 
>> > Signed-off-by: Shuo Wang < [ mailto:wangshuo47 at huawei.com |
>> > wangshuo47 at huawei.com ] >
>> 
>> > ---
>> 
>> > README.md | 2 +-
>> 
>> > 1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> > diff --git a/README.md b/README.md
>> 
>> > index 2d6dcaa..dd7ebd1 100644
>> 
>> > --- a/README.md
>> 
>> > +++ b/README.md
>> 
>> > @@ -388,7 +388,7 @@ For always-on debugging self-checks:
>> 
>> > ./configure --enable-rcu-debug
>> 
>> > For fine grained enabling of debugging self-checks, build
>> 
>> > -urserspace-rcu with DEBUG_RCU defined and compile dependent
>> 
>> > +userspace-rcu with DEBUG_RCU defined and compile dependent
>> 
>> > applications with DEBUG_RCU defined when necessary.
>> 
>> > Warning: Enabling this feature result in a performance penalty.
>> 
>> > --
>> 
> > > 2.23.0

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [PATCH] README: fix typo in README.md

2020-11-13 Thread Mathieu Desnoyers via lttng-dev
Hi, 

This fix does not apply with "git am", probably due to the fact that the email 
is not 
formatted as plain text without alteration by your email client. I recommend 
using 
git send-email to send patches. Can you try sending it again in the proper 
format 
expected by git am ? 

You will likely find useful information about why your email client changed the 
patch formatting here: 

[ https://www.kernel.org/doc/html/v5.9/process/submitting-patches.html | 
https://www.kernel.org/doc/html/v5.9/process/submitting-patches.html ] 
[ https://www.kernel.org/doc/html/v5.9/process/email-clients.html#email-clients 
| https://www.kernel.org/doc/html/v5.9/process/email-clients.html#email-clients 
] 

Thanks, 

Mathieu 

- On Nov 13, 2020, at 4:28 AM, wangshuo (AF)  wrote: 

> README: fix typo in README.md

> Signed-off-by: Shuo Wang < [ mailto:wangshu...@huawei.com |
> wangshu...@huawei.com ] >

> ---

> README.md | 2 +-

> 1 file changed, 1 insertion(+), 1 deletion(-)

> diff --git a/README.md b/README.md

> index 2d6dcaa..dd7ebd1 100644

> --- a/README.md

> +++ b/README.md

> @@ -388,7 +388,7 @@ For always-on debugging self-checks:

> ./configure --enable-rcu-debug

> For fine grained enabling of debugging self-checks, build

> -urserspace-rcu with DEBUG_RCU defined and compile dependent

> +userspace-rcu with DEBUG_RCU defined and compile dependent

> applications with DEBUG_RCU defined when necessary.

> Warning: Enabling this feature result in a performance penalty.

> --

> 2.23.0

-- 
Mathieu Desnoyers 
EfficiOS Inc. 
http://www.efficios.com 
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [PATCH urcu] fix: add -lurcu-common to pkg-config libs for each flavor

2020-11-02 Thread Mathieu Desnoyers via lttng-dev
Merged into liburcu master, 0.11, 0.12, thanks!

Mathieu

- On Oct 30, 2020, at 3:39 PM, Michael Jeanson mjean...@efficios.com wrote:

> The urcu-common library contains common code like the write-free queue
> and compat code, each urcu flavor library is dynamicly linked with it.
> 
> Most but not all toolchains will automatically link an executable with a
> transitive depency of an explicitly linked library if said binary uses a
> symbol from the transitive dependency.
> 
> Since this behavior is not present in all toolchains, add
> '-lurcu-common' to the 'Libs' field of each flavors pkg-config file so
> that executables using symbols from urcu-common can be reliably linked
> using pkg-config.
> 
> Signed-off-by: Michael Jeanson 
> ---
> src/liburcu-bp.pc.in | 2 +-
> src/liburcu-cds.pc.in| 2 +-
> src/liburcu-mb.pc.in | 2 +-
> src/liburcu-qsbr.pc.in   | 2 +-
> src/liburcu-signal.pc.in | 2 +-
> src/liburcu.pc.in| 2 +-
> 6 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/src/liburcu-bp.pc.in b/src/liburcu-bp.pc.in
> index c5f2355..7cba58a 100644
> --- a/src/liburcu-bp.pc.in
> +++ b/src/liburcu-bp.pc.in
> @@ -7,5 +7,5 @@ Name: Userspace RCU BulletProof
> Description: A userspace RCU (read-copy-update) library, bulletproof version
> Version: @PACKAGE_VERSION@
> Requires:
> -Libs: -L${libdir} -lurcu-bp
> +Libs: -L${libdir} -lurcu-common -lurcu-bp
> Cflags: -I${includedir}
> diff --git a/src/liburcu-cds.pc.in b/src/liburcu-cds.pc.in
> index e3d13af..1cb19b6 100644
> --- a/src/liburcu-cds.pc.in
> +++ b/src/liburcu-cds.pc.in
> @@ -7,5 +7,5 @@ Name: Userspace RCU Concurrent Data Structures
> Description: Data structures leveraging RCU and atomic operations to provide
> efficient concurrency-aware storage
> Version: @PACKAGE_VERSION@
> Requires:
> -Libs: -L${libdir} -lurcu-cds
> +Libs: -L${libdir} -lurcu-common -lurcu-cds
> Cflags: -I${includedir}
> diff --git a/src/liburcu-mb.pc.in b/src/liburcu-mb.pc.in
> index cd669ef..1684701 100644
> --- a/src/liburcu-mb.pc.in
> +++ b/src/liburcu-mb.pc.in
> @@ -7,5 +7,5 @@ Name: Userspace RCU Memory barriers
> Description: A userspace RCU (read-copy-update) library, memory barriers 
> version
> Version: @PACKAGE_VERSION@
> Requires:
> -Libs: -L${libdir} -lurcu-mb
> +Libs: -L${libdir} -lurcu-common -lurcu-mb
> Cflags: -I${includedir}
> diff --git a/src/liburcu-qsbr.pc.in b/src/liburcu-qsbr.pc.in
> index 0732602..d123a10 100644
> --- a/src/liburcu-qsbr.pc.in
> +++ b/src/liburcu-qsbr.pc.in
> @@ -7,5 +7,5 @@ Name: Userspace RCU QSBR
> Description: A userspace RCU (read-copy-update) library, quiescent state 
> version
> Version: @PACKAGE_VERSION@
> Requires:
> -Libs: -L${libdir} -lurcu-qsbr
> +Libs: -L${libdir} -lurcu-common -lurcu-qsbr
> Cflags: -I${includedir}
> diff --git a/src/liburcu-signal.pc.in b/src/liburcu-signal.pc.in
> index f9bc3a3..844c449 100644
> --- a/src/liburcu-signal.pc.in
> +++ b/src/liburcu-signal.pc.in
> @@ -7,5 +7,5 @@ Name: Userspace RCU signal
> Description: A userspace RCU (read-copy-update) library, signal version
> Version: @PACKAGE_VERSION@
> Requires:
> -Libs: -L${libdir} -lurcu-signal
> +Libs: -L${libdir} -lurcu-common -lurcu-signal
> Cflags: -I${includedir}
> diff --git a/src/liburcu.pc.in b/src/liburcu.pc.in
> index 22bf2c8..b9f812b 100644
> --- a/src/liburcu.pc.in
> +++ b/src/liburcu.pc.in
> @@ -7,5 +7,5 @@ Name: Userspace RCU
> Description: A userspace RCU (read-copy-update) library, standard version
> Version: @PACKAGE_VERSION@
> Requires:
> -Libs: -L${libdir} -lurcu
> +Libs: -L${libdir} -lurcu-common -lurcu
> Cflags: -I${includedir}
> --
> 2.28.0

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [PATCH] membarrier.2: Update prototype

2020-11-02 Thread Mathieu Desnoyers
- On Nov 1, 2020, at 3:04 PM, Alejandro Colomar colomar.6@gmail.com 
wrote:

> The Linux kernel now uses 'flags' and added a new argument: 'cpu_id'.
> These changes were introduced to the kernel
> in commit 2a36ab717e8fe678d98f81c14a0b124712719840.

I doubt the proposed patch with a FIXME and a TODO is appropriate for the man 
pages project.
It does point out the fact that the membarrier man page needs updating following
Peter's commit though.

Peter (Oskolkov), can you contribute a patch detailing the new membarrier flags 
and cpu_id
arguments to the man pages project ?

Thanks,

Mathieu

> 
> Signed-off-by: Alejandro Colomar 
> ---
> man2/membarrier.2 | 9 -
> 1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/man2/membarrier.2 b/man2/membarrier.2
> index 24a24ba86..42b7e2acc 100644
> --- a/man2/membarrier.2
> +++ b/man2/membarrier.2
> @@ -23,6 +23,13 @@
> .\" %%%LICENSE_END
> .\"
> .TH MEMBARRIER 2 2020-06-09 "Linux" "Linux Programmer's Manual"
> +.\" FIXME:
> +.\" The Linux kernel now uses 'flags' and added a new argument: 'cpu_id'.
> +.\" These changes were introduced to the kernel
> +.\" in commit 2a36ab717e8fe678d98f81c14a0b124712719840.
> +.\" The prototype has been updated,
> +.\" but the new features have not yet been documented.
> +.\" TODO: Document those new features.
> .SH NAME
> membarrier \- issue memory barriers on a set of threads
> .SH SYNOPSIS
> @@ -30,7 +37,7 @@ membarrier \- issue memory barriers on a set of threads
> .PP
> .B #include 
> .PP
> -.BI "int membarrier(int " cmd ", int " flags ");"
> +.BI "int membarrier(int " cmd ", unsigned int " flags ", int " cpu_id );
> .fi
> .PP
> .IR Note :
> --
> 2.28.0

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


[tip: sched/core] sched: fix exit_mm vs membarrier (v4)

2020-10-29 Thread tip-bot2 for Mathieu Desnoyers
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 5bc78502322a5e4eef3f1b2a2813751dc6434143
Gitweb:
https://git.kernel.org/tip/5bc78502322a5e4eef3f1b2a2813751dc6434143
Author:Mathieu Desnoyers 
AuthorDate:Tue, 20 Oct 2020 09:47:13 -04:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 29 Oct 2020 11:00:30 +01:00

sched: fix exit_mm vs membarrier (v4)

exit_mm should issue memory barriers after user-space memory accesses,
before clearing current->mm, to order user-space memory accesses
performed prior to exit_mm before clearing tsk->mm, which has the
effect of skipping the membarrier private expedited IPIs.

exit_mm should also update the runqueue's membarrier_state so
membarrier global expedited IPIs are not sent when they are not
needed.

The membarrier system call can be issued concurrently with do_exit
if we have thread groups created with CLONE_VM but not CLONE_THREAD.

Here is the scenario I have in mind:

Two thread groups are created, A and B. Thread group B is created by
issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD.
Let's assume we have a single thread within each thread group (Thread A
and Thread B).

The AFAIU we can have:

Userspace variables:

int x = 0, y = 0;

CPU 0   CPU 1
Thread AThread B
(in thread group A) (in thread group B)

x = 1
barrier()
y = 1
exit()
exit_mm()
current->mm = NULL;
r1 = load y
membarrier()
  skips CPU 0 (no IPI) because its current mm is NULL
r2 = load x
BUG_ON(r1 == 1 && r2 == 0)

Signed-off-by: Mathieu Desnoyers 
Signed-off-by: Peter Zijlstra (Intel) 
Link: 
https://lkml.kernel.org/r/20201020134715.13909-2-mathieu.desnoy...@efficios.com
---
 include/linux/sched/mm.h  |  5 +
 kernel/exit.c | 16 +++-
 kernel/sched/membarrier.c | 12 
 3 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index d5ece7a..a91fb3a 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -347,6 +347,8 @@ static inline void 
membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
 
 extern void membarrier_exec_mmap(struct mm_struct *mm);
 
+extern void membarrier_update_current_mm(struct mm_struct *next_mm);
+
 #else
 #ifdef CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS
 static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
@@ -361,6 +363,9 @@ static inline void membarrier_exec_mmap(struct mm_struct 
*mm)
 static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct 
*mm)
 {
 }
+static inline void membarrier_update_current_mm(struct mm_struct *next_mm)
+{
+}
 #endif
 
 #endif /* _LINUX_SCHED_MM_H */
diff --git a/kernel/exit.c b/kernel/exit.c
index 87a2d51..a3dd6b3 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -475,10 +475,24 @@ static void exit_mm(void)
BUG_ON(mm != current->active_mm);
/* more a memory barrier than a real lock */
task_lock(current);
+   /*
+* When a thread stops operating on an address space, the loop
+* in membarrier_private_expedited() may not observe that
+* tsk->mm, and the loop in membarrier_global_expedited() may
+* not observe a MEMBARRIER_STATE_GLOBAL_EXPEDITED
+* rq->membarrier_state, so those would not issue an IPI.
+* Membarrier requires a memory barrier after accessing
+* user-space memory, before clearing tsk->mm or the
+* rq->membarrier_state.
+*/
+   smp_mb__after_spinlock();
+   local_irq_disable();
current->mm = NULL;
-   mmap_read_unlock(mm);
+   membarrier_update_current_mm(NULL);
enter_lazy_tlb(mm, current);
+   local_irq_enable();
task_unlock(current);
+   mmap_read_unlock(mm);
mm_update_next_owner(mm);
mmput(mm);
if (test_thread_flag(TIF_MEMDIE))
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index e23e74d..aac3292 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -76,6 +76,18 @@ void membarrier_exec_mmap(struct mm_struct *mm)
this_cpu_write(runqueues.membarrier_state, 0);
 }
 
+void membarrier_update_current_mm(struct mm_struct *next_mm)
+{
+   struct rq *rq = this_rq();
+   int membarrier_state = 0;
+
+   if (next_mm)
+   membarrier_state = atomic_read(&next_mm->membarrier_state);
+   if (READ_ONCE(rq->membarrier_state) == membarrier_state)
+   return;
+   WRITE_ONCE(rq->membarrier_state, membarrier_state);
+}
+
 static int membarrier_global_expedited(void)
 {
int cpu;


[tip: sched/core] sched: membarrier: cover kthread_use_mm (v4)

2020-10-29 Thread tip-bot2 for Mathieu Desnoyers
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 618758ed3a4f7d790414d020b362111748ebbf9f
Gitweb:
https://git.kernel.org/tip/618758ed3a4f7d790414d020b362111748ebbf9f
Author:Mathieu Desnoyers 
AuthorDate:Tue, 20 Oct 2020 09:47:14 -04:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 29 Oct 2020 11:00:31 +01:00

sched: membarrier: cover kthread_use_mm (v4)

Add comments and memory barrier to kthread_use_mm and kthread_unuse_mm
to allow the effect of membarrier(2) to apply to kthreads accessing
user-space memory as well.

Given that no prior kthread use this guarantee and that it only affects
kthreads, adding this guarantee does not affect user-space ABI.

Refine the check in membarrier_global_expedited to exclude runqueues
running the idle thread rather than all kthreads from the IPI cpumask.

Now that membarrier_global_expedited can IPI kthreads, the scheduler
also needs to update the runqueue's membarrier_state when entering lazy
TLB state.

Signed-off-by: Mathieu Desnoyers 
Signed-off-by: Peter Zijlstra (Intel) 
Link: 
https://lkml.kernel.org/r/20201020134715.13909-3-mathieu.desnoy...@efficios.com
---
 kernel/kthread.c  | 21 +
 kernel/sched/idle.c   |  1 +
 kernel/sched/membarrier.c |  7 +++
 3 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index e29773c..481428f 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1248,6 +1248,7 @@ void kthread_use_mm(struct mm_struct *mm)
tsk->active_mm = mm;
}
tsk->mm = mm;
+   membarrier_update_current_mm(mm);
switch_mm_irqs_off(active_mm, mm, tsk);
local_irq_enable();
task_unlock(tsk);
@@ -1255,8 +1256,19 @@ void kthread_use_mm(struct mm_struct *mm)
finish_arch_post_lock_switch();
 #endif
 
+   /*
+* When a kthread starts operating on an address space, the loop
+* in membarrier_{private,global}_expedited() may not observe
+* that tsk->mm, and not issue an IPI. Membarrier requires a
+* memory barrier after storing to tsk->mm, before accessing
+* user-space memory. A full memory barrier for membarrier
+* {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
+* mmdrop(), or explicitly with smp_mb().
+*/
if (active_mm != mm)
mmdrop(active_mm);
+   else
+   smp_mb();
 
to_kthread(tsk)->oldfs = force_uaccess_begin();
 }
@@ -1276,9 +1288,18 @@ void kthread_unuse_mm(struct mm_struct *mm)
force_uaccess_end(to_kthread(tsk)->oldfs);
 
task_lock(tsk);
+   /*
+* When a kthread stops operating on an address space, the loop
+* in membarrier_{private,global}_expedited() may not observe
+* that tsk->mm, and not issue an IPI. Membarrier requires a
+* memory barrier after accessing user-space memory, before
+* clearing tsk->mm.
+*/
+   smp_mb__after_spinlock();
sync_mm_rss(mm);
local_irq_disable();
tsk->mm = NULL;
+   membarrier_update_current_mm(NULL);
/* active_mm is still 'mm' */
enter_lazy_tlb(mm, tsk);
local_irq_enable();
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 24d0ee2..846743e 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -338,6 +338,7 @@ void play_idle_precise(u64 duration_ns, u64 latency_ns)
WARN_ON_ONCE(!(current->flags & PF_KTHREAD));
WARN_ON_ONCE(!(current->flags & PF_NO_SETAFFINITY));
WARN_ON_ONCE(!duration_ns);
+   WARN_ON_ONCE(current->mm);
 
rcu_sleep_check();
preempt_disable();
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index aac3292..f223f35 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -126,12 +126,11 @@ static int membarrier_global_expedited(void)
continue;
 
/*
-* Skip the CPU if it runs a kernel thread. The scheduler
-* leaves the prior task mm in place as an optimization when
-* scheduling a kthread.
+* Skip the CPU if it runs a kernel thread which is not using
+* a task mm.
 */
p = rcu_dereference(cpu_rq(cpu)->curr);
-   if (p->flags & PF_KTHREAD)
+   if (!p->mm)
continue;
 
__cpumask_set_cpu(cpu, tmpmask);


[tip: sched/core] sched: membarrier: document memory ordering scenarios

2020-10-29 Thread tip-bot2 for Mathieu Desnoyers
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 25595eb6aaa9fbb31330f1e0b400642694bc6574
Gitweb:
https://git.kernel.org/tip/25595eb6aaa9fbb31330f1e0b400642694bc6574
Author:Mathieu Desnoyers 
AuthorDate:Tue, 20 Oct 2020 09:47:15 -04:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 29 Oct 2020 11:00:31 +01:00

sched: membarrier: document memory ordering scenarios

Document membarrier ordering scenarios in membarrier.c. Thanks to Alan
Stern for refreshing my memory. Now that I have those in mind, it seems
appropriate to serialize them to comments for posterity.

Signed-off-by: Mathieu Desnoyers 
Signed-off-by: Peter Zijlstra (Intel) 
Link: 
https://lkml.kernel.org/r/20201020134715.13909-4-mathieu.desnoy...@efficios.com
---
 kernel/sched/membarrier.c | 128 +-
 1 file changed, 128 insertions(+)

diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index f223f35..5a40b38 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -7,6 +7,134 @@
 #include "sched.h"
 
 /*
+ * For documentation purposes, here are some membarrier ordering
+ * scenarios to keep in mind:
+ *
+ * A) Userspace thread execution after IPI vs membarrier's memory
+ *barrier before sending the IPI
+ *
+ * Userspace variables:
+ *
+ * int x = 0, y = 0;
+ *
+ * The memory barrier at the start of membarrier() on CPU0 is necessary in
+ * order to enforce the guarantee that any writes occurring on CPU0 before
+ * the membarrier() is executed will be visible to any code executing on
+ * CPU1 after the IPI-induced memory barrier:
+ *
+ * CPU0  CPU1
+ *
+ * x = 1
+ * membarrier():
+ *   a: smp_mb()
+ *   b: send IPI   IPI-induced mb
+ *   c: smp_mb()
+ * r2 = y
+ *   y = 1
+ *   barrier()
+ *   r1 = x
+ *
+ * BUG_ON(r1 == 0 && r2 == 0)
+ *
+ * The write to y and load from x by CPU1 are unordered by the hardware,
+ * so it's possible to have "r1 = x" reordered before "y = 1" at any
+ * point after (b).  If the memory barrier at (a) is omitted, then "x = 1"
+ * can be reordered after (a) (although not after (c)), so we get r1 == 0
+ * and r2 == 0.  This violates the guarantee that membarrier() is
+ * supposed by provide.
+ *
+ * The timing of the memory barrier at (a) has to ensure that it executes
+ * before the IPI-induced memory barrier on CPU1.
+ *
+ * B) Userspace thread execution before IPI vs membarrier's memory
+ *barrier after completing the IPI
+ *
+ * Userspace variables:
+ *
+ * int x = 0, y = 0;
+ *
+ * The memory barrier at the end of membarrier() on CPU0 is necessary in
+ * order to enforce the guarantee that any writes occurring on CPU1 before
+ * the membarrier() is executed will be visible to any code executing on
+ * CPU0 after the membarrier():
+ *
+ * CPU0  CPU1
+ *
+ *   x = 1
+ *   barrier()
+ *   y = 1
+ * r2 = y
+ * membarrier():
+ *   a: smp_mb()
+ *   b: send IPI   IPI-induced mb
+ *   c: smp_mb()
+ * r1 = x
+ * BUG_ON(r1 == 0 && r2 == 1)
+ *
+ * The writes to x and y are unordered by the hardware, so it's possible to
+ * have "r2 = 1" even though the write to x doesn't execute until (b).  If
+ * the memory barrier at (c) is omitted then "r1 = x" can be reordered
+ * before (b) (although not before (a)), so we get "r1 = 0".  This violates
+ * the guarantee that membarrier() is supposed to provide.
+ *
+ * The timing of the memory barrier at (c) has to ensure that it executes
+ * after the IPI-induced memory barrier on CPU1.
+ *
+ * C) Scheduling userspace thread -> kthread -> userspace thread vs membarrier
+ *
+ *   CPU0CPU1
+ *
+ *   membarrier():
+ *   a: smp_mb()
+ *   d: switch to kthread (includes mb)
+ *   b: read rq->curr->mm == NULL
+ *   e: switch to user (includes mb)
+ *   c: smp_mb()
+ *
+ * Using the scenario from (A), we can show that (a) needs to be paired
+ * with (e). Using the scenario from (B), we can show that (c) needs to
+ * be paired with (d).
+ *
+ * D) exit_mm vs membarrier
+ *
+ * Two thread groups are created, A and B.  Thread group B is created by
+ * issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD.
+ * Let's assume we have a single thread within each thread group (Thread A
+ * and Thread B).  T

Re: [PATCH] call_rcu: Fix race between rcu_barrier() and call_rcu_data_free()

2020-10-27 Thread Mathieu Desnoyers
- On Oct 26, 2020, at 4:53 PM, paulmck paul...@kernel.org wrote:

> On Mon, Oct 26, 2020 at 03:58:11PM -0400, Mathieu Desnoyers wrote:
>> - On Oct 22, 2020, at 6:30 PM, paulmck paul...@kernel.org wrote:
>> 
>> > The current code can lose RCU callbacks at shutdown time, which can
>> > result in hangs.  This lossage can happen as follows:
>> > 
>> > o   A thread invokes call_rcu_data_free(), which executes up through
>> >the wake_call_rcu_thread().  At this point, the call_rcu_data
>> >structure has been drained of callbacks, but is still on the
>> >call_rcu_data_list.  Note that this thread does not hold the
>> >call_rcu_mutex.
>> > 
>> > o   Another thread invokes rcu_barrier(), which traverses the
>> >call_rcu_data_list under the protection of call_rcu_mutex,
>> >a list which still includes the above newly drained structure.
>> >This thread therefore adds a callback to the newly drained
>> >call_rcu_data structure.  It then releases call_rcu_mutex and
>> >enters a mystifying loop that does futex stuff.
>> > 
>> > o   The first thread finishes executing call_rcu_data_free(),
>> >which acquires call_rcu_mutex just long enough to remove the
>> >newly drained call_rcu_data structure from call_rcu_data_list.
>> >Which causes one of the rcu_barrier() invocation's callbacks to
>> >be leaked.
>> > 
>> > o   The second thread's rcu_barrier() invocation never returns
>> >resulting in a hang.
>> > 
>> > This commit therefore changes call_rcu_data_free() to acquire
>> > call_rcu_mutex before checking the call_rcu_data structure for callbacks.
>> > In the case where there are no callbacks, call_rcu_mutex is held across
>> > both the check and the removal from call_rcu_data_list, thus preventing
>> > rcu_barrier() from adding a callback in the meantime.  In the case where
>> > there are callbacks, call_rcu_mutex must be momentarily dropped across
>> > the call to get_default_call_rcu_data(), which can itself acquire
>> > call_rcu_mutex.  This momentary drop is not a problem because any
>> > callbacks that rcu_barrier() might queue during that period of time will
>> > be moved to the default call_rcu_data structure, and the lock will be
>> > held across the full time including moving those callbacks and removing
>> > the call_rcu_data structure that was passed into call_rcu_data_free()
>> > from call_rcu_data_list.
>> > 
>> > With this fix, a several-hundred-CPU test successfully completes more
>> > than 5,000 executions.  Without this fix, it fails within a few tens
>> > of executions.  Although the failures happen more quickly on larger
>> > systems, in theory this could happen on a single-CPU system, courtesy
>> > of preemption.
>> 
>> I agree with this fix, will merge in liburcu master, stable-0.12, and
>> stable-2.11.
>> Out of curiosity, which test is hanging ?  Is it a test which is part of the
>> liburcu
>> tree or some out-of-tree test ? I wonder why we did not catch it in our CI 
>> [1].
> 
> The hung test was from perfbook [1] in the CodeSamples/datastruct/hash
> directory.  A repeat-by is as follows:
> 
> # Have userspace RCU preinstalled as you wish.
> git clone git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> cd CodeSamples
> make pthreads
> cd datastruct/hash
> make
> time for ((i = 0; i < 2000; i++)); do echo $i; ./hash_bkt_rcu --schroedinger
> --nreaders 444 --nupdaters 4 --duration 1000 --updatewait 1 --nbuckets 262144
> --elems/writer 65536; done
> 
> This normally hangs within a few tens of iterations.  With this patch,
> the passes more than 6,000 iterations.
> 
> I have smaller tests that produce this same hang on my 12-CPU laptop,
> but with much lower probability.  Here is one example that did hang on
> my laptop, and which could be placed into a similar bash loop as above:
> 
> hash_bkt_rcu --schroedinger --nreaders 10 --nupdaters 2 --duration 1000
> --updatewait 1 --nbuckets 8192 --elems/writer 4096
> 
> But I don't have a good estimate of the hang probability, except a
> suspicion that it is lower than would be convenient for a CI test.
> Attaching to the hung process using gdb did confirm the type of hang,
> however.
> 
> It might be possible to create a focused test that races rcu_barrier()
> against thread exit, where threads are created and exit repeatedly,
> and make a per

Re: [lttng-dev] [PATCH] call_rcu: Fix race between rcu_barrier() and call_rcu_data_free()

2020-10-27 Thread Mathieu Desnoyers via lttng-dev
- On Oct 26, 2020, at 4:53 PM, paulmck paul...@kernel.org wrote:

> On Mon, Oct 26, 2020 at 03:58:11PM -0400, Mathieu Desnoyers wrote:
>> - On Oct 22, 2020, at 6:30 PM, paulmck paul...@kernel.org wrote:
>> 
>> > The current code can lose RCU callbacks at shutdown time, which can
>> > result in hangs.  This lossage can happen as follows:
>> > 
>> > o   A thread invokes call_rcu_data_free(), which executes up through
>> >the wake_call_rcu_thread().  At this point, the call_rcu_data
>> >structure has been drained of callbacks, but is still on the
>> >call_rcu_data_list.  Note that this thread does not hold the
>> >call_rcu_mutex.
>> > 
>> > o   Another thread invokes rcu_barrier(), which traverses the
>> >call_rcu_data_list under the protection of call_rcu_mutex,
>> >a list which still includes the above newly drained structure.
>> >This thread therefore adds a callback to the newly drained
>> >call_rcu_data structure.  It then releases call_rcu_mutex and
>> >enters a mystifying loop that does futex stuff.
>> > 
>> > o   The first thread finishes executing call_rcu_data_free(),
>> >which acquires call_rcu_mutex just long enough to remove the
>> >newly drained call_rcu_data structure from call_rcu_data_list.
>> >Which causes one of the rcu_barrier() invocation's callbacks to
>> >be leaked.
>> > 
>> > o   The second thread's rcu_barrier() invocation never returns
>> >resulting in a hang.
>> > 
>> > This commit therefore changes call_rcu_data_free() to acquire
>> > call_rcu_mutex before checking the call_rcu_data structure for callbacks.
>> > In the case where there are no callbacks, call_rcu_mutex is held across
>> > both the check and the removal from call_rcu_data_list, thus preventing
>> > rcu_barrier() from adding a callback in the meantime.  In the case where
>> > there are callbacks, call_rcu_mutex must be momentarily dropped across
>> > the call to get_default_call_rcu_data(), which can itself acquire
>> > call_rcu_mutex.  This momentary drop is not a problem because any
>> > callbacks that rcu_barrier() might queue during that period of time will
>> > be moved to the default call_rcu_data structure, and the lock will be
>> > held across the full time including moving those callbacks and removing
>> > the call_rcu_data structure that was passed into call_rcu_data_free()
>> > from call_rcu_data_list.
>> > 
>> > With this fix, a several-hundred-CPU test successfully completes more
>> > than 5,000 executions.  Without this fix, it fails within a few tens
>> > of executions.  Although the failures happen more quickly on larger
>> > systems, in theory this could happen on a single-CPU system, courtesy
>> > of preemption.
>> 
>> I agree with this fix, will merge in liburcu master, stable-0.12, and
>> stable-2.11.
>> Out of curiosity, which test is hanging ?  Is it a test which is part of the
>> liburcu
>> tree or some out-of-tree test ? I wonder why we did not catch it in our CI 
>> [1].
> 
> The hung test was from perfbook [1] in the CodeSamples/datastruct/hash
> directory.  A repeat-by is as follows:
> 
> # Have userspace RCU preinstalled as you wish.
> git clone git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> cd CodeSamples
> make pthreads
> cd datastruct/hash
> make
> time for ((i = 0; i < 2000; i++)); do echo $i; ./hash_bkt_rcu --schroedinger
> --nreaders 444 --nupdaters 4 --duration 1000 --updatewait 1 --nbuckets 262144
> --elems/writer 65536; done
> 
> This normally hangs within a few tens of iterations.  With this patch,
> the passes more than 6,000 iterations.
> 
> I have smaller tests that produce this same hang on my 12-CPU laptop,
> but with much lower probability.  Here is one example that did hang on
> my laptop, and which could be placed into a similar bash loop as above:
> 
> hash_bkt_rcu --schroedinger --nreaders 10 --nupdaters 2 --duration 1000
> --updatewait 1 --nbuckets 8192 --elems/writer 4096
> 
> But I don't have a good estimate of the hang probability, except a
> suspicion that it is lower than would be convenient for a CI test.
> Attaching to the hung process using gdb did confirm the type of hang,
> however.
> 
> It might be possible to create a focused test that races rcu_barrier()
> against thread exit, where threads are created and exit repeatedly,
> and make a per

Re: [RFC PATCH 6/6] tracing: use sched-RCU instead of SRCU for rcuidle tracepoints

2020-10-27 Thread Mathieu Desnoyers
- On Oct 26, 2020, at 4:44 PM, rostedt rost...@goodmis.org wrote:

> On Mon, 26 Oct 2020 10:28:07 -0400 (EDT)
> Mathieu Desnoyers  wrote:
> 
>> I agree with Peter. Removing the trace_.*_rcuidle weirdness from the 
>> tracepoint
>> API and fixing all callers to ensure they trace from a context where RCU is
>> watching would simplify instrumentation of the Linux kernel, thus making it
>> harder
>> for subtle bugs to hide and be unearthed only when tracing is enabled. This 
>> is
> 
> Note, the lockdep RCU checking of a tracepoint is outside of it being
> enabled or disable. So if a non rcuidle() tracepoint is in a location that
> RCU is not watching, it will complain loudly, even if you don't enable that
> tracepoint.
> 
>> AFAIU the general approach Thomas Gleixner has been aiming for recently, and 
>> I
>> think it is a good thing.
>> 
>> So if we consider this our target, and that the current state of things is 
>> that
>> we need to have RCU watching around callback invocation, then removing the
>> dependency on SRCU seems like an overall simplification which does not 
>> regress
>> feature-wise nor speed-wise compared with what we have upstream today. The 
>> next
>> steps would then be to audit all rcuidle tracepoints and make sure the 
>> context
>> where they are placed has RCU watching already, so we can remove the 
>> tracepoint
> 
> Just remove the _rcuidle() from them, and lockdep will complain if they are
> being called without RCU watching.

That's the easy part. The patch removing rcuidle is attached to this email,
and here are the splats I get just when booting the machine (see below). The
part which takes more time is fixing those splats and figuring out how to
restructure the code. For instance, what should we do about the rcuidle
tracepoint within printk() ?

Also, how do we test rcuidle tracepoints triggered only when printk is called
from printk warnings ? We'd also need to test on arm32 boards, because some arm
architecture code uses rcuidle tracepoints as well.

As this is beyond the scope of this patch set, I can either drop this patch
entirely (it's not required for sleepable tracepoints), or keep it as an
intermediate cleanup step. Removing rcuidle tracepoints entirely is beyond
the effort Michael and I can undertake now.

=
WARNING: suspicious RCU usage

5.9.1+ #72 Not tainted
-
=
./include/trace/events/preemptirq.h:42 suspicious rcu_dereference_check() usage!
WARNING: suspicious RCU usage

other info that might help us debug this:


rcu_scheduler_active = 1, debug_locks = 1
5.9.1+ #72 Not tainted
RCU used illegally from extended quiescent state!
no locks held by swapper/0/0.
-
./include/trace/events/preemptirq.h:38 suspicious rcu_dereference_check() usage!

stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.9.1+ #72

other info that might help us debug this:


rcu_scheduler_active = 1, debug_locks = 1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
RCU used illegally from extended quiescent state!
no locks held by swapper/1/0.
 dump_stack+0x8d/0xbb

stack backtrace:
 ? default_idle+0xa/0x20
 trace_hardirqs_on+0xed/0xf0
 default_idle+0xa/0x20
 default_idle_call+0x4f/0x1e0
 do_idle+0x1ef/0x2c0
 cpu_startup_entry+0x19/0x20
 start_kernel+0x578/0x59d
 secondary_startup_64+0xa4/0xb0
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.1+ #72

=
WARNING: suspicious RCU usage
5.9.1+ #72 Not tainted
-
./include/trace/events/lock.h:37 suspicious rcu_dereference_check() usage!

other info that might help us debug this:


rcu_scheduler_active = 1, debug_locks = 1
RCU used illegally from extended quiescent state!
no locks held by swapper/0/0.

stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.9.1+ #72
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
 dump_stack+0x8d/0xbb
 lock_acquire+0x346/0x3b0
 ? __lock_acquire+0x2e7/0x1da0
 _raw_spin_lock+0x2c/0x40
 ? vprintk_emit+0x9f/0x410
 vprintk_emit+0x9f/0x410
 printk+0x52/0x6e
 lockdep_rcu_suspicious+0x1b/0xe0
 ? default_idle+0xa/0x20
 trace_hardirqs_on+0xed/0xf0
 default_idle+0xa/0x20
 default_idle_call+0x4f/0x1e0
 do_idle+0x1ef/0x2c0
 cpu_startup_entry+0x19/0x20
 start_kernel+0x578/0x59d
 secondary_startup_64+0xa4/0xb0

=
WARNING: suspicious RCU usage
5.9.1+ #72 Not tainted
-
./include/trace/events/lock.h:63 suspicious rcu_dereference_check() usage!

other info that might help us debug this:


rcu_scheduler_active = 1, debug_locks = 1
RCU used illegally from extended quiescent state!
1 lock held by swapper/0/0:
 #0: 97a80158 (logbuf_l

Re: [RFC PATCH 1/6] tracing: introduce sleepable tracepoints

2020-10-27 Thread Mathieu Desnoyers


- On Oct 26, 2020, at 6:43 PM, Alexei Starovoitov 
alexei.starovoi...@gmail.com wrote:

> On Fri, Oct 23, 2020 at 03:53:47PM -0400, Michael Jeanson wrote:
>> -#define __DO_TRACE(tp, proto, args, cond, rcuidle)  \
>> +#define __DO_TRACE(tp, proto, args, cond, rcuidle, tp_flags)
>> \
>>  do {\
>>  struct tracepoint_func *it_func_ptr;\
>>  void *it_func;  \
>>  void *__data;   \
>>  int __maybe_unused __idx = 0;   \
>> +bool maysleep = (tp_flags) & TRACEPOINT_MAYSLEEP;   \
>>  \
>>  if (!(cond))\
>>  return; \
>> @@ -170,8 +178,13 @@ static inline struct tracepoint
>> *tracepoint_ptr_deref(tracepoint_ptr_t *p)
>>  /* srcu can't be used from NMI */   \
>>  WARN_ON_ONCE(rcuidle && in_nmi());  \
>>  \
>> -/* keep srcu and sched-rcu usage consistent */  \
>> -preempt_disable_notrace();  \
>> +if (maysleep) { \
>> +might_sleep();  \
> 
> The main purpose of the patch set is to access user memory in tracepoints,
> right?

Yes, exactly.

> In such case I suggest to use stronger might_fault() here.
> We used might_sleep() in sleepable bpf and it wasn't enough to catch
> a combination where sleepable hook was invoked while mm->mmap_lock was
> taken which may cause a deadlock.

Good point! We will do that for the next round.

By the way, we named this "sleepable" tracepoint (with flag 
TRACEPOINT_MAYSLEEP),
but we are open to a better name. Would TRACEPOINT_MAYFAULT be more descriptive 
?
(a "faultable" tracepoint sounds weird though)

Thanks,

Mathieu

> 
>> +rcu_read_lock_trace();  \
>> +} else {\
>> +        /* keep srcu and sched-rcu usage consistent */  \
>> +preempt_disable_notrace();  \
> > +   }   \

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [lttng-dev] [PATCH] call_rcu: Fix race between rcu_barrier() and call_rcu_data_free()

2020-10-26 Thread Mathieu Desnoyers via lttng-dev
- On Oct 22, 2020, at 6:30 PM, paulmck paul...@kernel.org wrote:

> The current code can lose RCU callbacks at shutdown time, which can
> result in hangs.  This lossage can happen as follows:
> 
> o   A thread invokes call_rcu_data_free(), which executes up through
>the wake_call_rcu_thread().  At this point, the call_rcu_data
>structure has been drained of callbacks, but is still on the
>call_rcu_data_list.  Note that this thread does not hold the
>call_rcu_mutex.
> 
> o   Another thread invokes rcu_barrier(), which traverses the
>call_rcu_data_list under the protection of call_rcu_mutex,
>a list which still includes the above newly drained structure.
>This thread therefore adds a callback to the newly drained
>call_rcu_data structure.  It then releases call_rcu_mutex and
>enters a mystifying loop that does futex stuff.
> 
> o   The first thread finishes executing call_rcu_data_free(),
>which acquires call_rcu_mutex just long enough to remove the
>newly drained call_rcu_data structure from call_rcu_data_list.
>Which causes one of the rcu_barrier() invocation's callbacks to
>be leaked.
> 
> o   The second thread's rcu_barrier() invocation never returns
>resulting in a hang.
> 
> This commit therefore changes call_rcu_data_free() to acquire
> call_rcu_mutex before checking the call_rcu_data structure for callbacks.
> In the case where there are no callbacks, call_rcu_mutex is held across
> both the check and the removal from call_rcu_data_list, thus preventing
> rcu_barrier() from adding a callback in the meantime.  In the case where
> there are callbacks, call_rcu_mutex must be momentarily dropped across
> the call to get_default_call_rcu_data(), which can itself acquire
> call_rcu_mutex.  This momentary drop is not a problem because any
> callbacks that rcu_barrier() might queue during that period of time will
> be moved to the default call_rcu_data structure, and the lock will be
> held across the full time including moving those callbacks and removing
> the call_rcu_data structure that was passed into call_rcu_data_free()
> from call_rcu_data_list.
> 
> With this fix, a several-hundred-CPU test successfully completes more
> than 5,000 executions.  Without this fix, it fails within a few tens
> of executions.  Although the failures happen more quickly on larger
> systems, in theory this could happen on a single-CPU system, courtesy
> of preemption.

I agree with this fix, will merge in liburcu master, stable-0.12, and 
stable-2.11.
Out of curiosity, which test is hanging ?  Is it a test which is part of the 
liburcu
tree or some out-of-tree test ? I wonder why we did not catch it in our CI [1].

Thanks,

Mathieu

[1] https://ci.lttng.org/view/Liburcu/

> 
> Signed-off-by: Paul E. McKenney 
> Cc: Mathieu Desnoyers 
> Cc: Stephen Hemminger 
> Cc: Alan Stern 
> Cc: Lai Jiangshan 
> Cc: 
> Cc: 
> 
> ---
> 
> urcu-call-rcu-impl.h |7 +--
> 1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/src/urcu-call-rcu-impl.h b/src/urcu-call-rcu-impl.h
> index b6ec6ba..18fd65a 100644
> --- a/src/urcu-call-rcu-impl.h
> +++ b/src/urcu-call-rcu-impl.h
> @@ -772,9 +772,13 @@ void call_rcu_data_free(struct call_rcu_data *crdp)
>   while ((uatomic_read(&crdp->flags) & URCU_CALL_RCU_STOPPED) == 
> 0)
>   (void) poll(NULL, 0, 1);
>   }
> + call_rcu_lock(&call_rcu_mutex);
>   if (!cds_wfcq_empty(&crdp->cbs_head, &crdp->cbs_tail)) {
> - /* Create default call rcu data if need be */
> + call_rcu_unlock(&call_rcu_mutex);
> + /* Create default call rcu data if need be. */
> + /* CBs queued here will be handed to the default list. */
>   (void) get_default_call_rcu_data();
> + call_rcu_lock(&call_rcu_mutex);
>   __cds_wfcq_splice_blocking(&default_call_rcu_data->cbs_head,
>   &default_call_rcu_data->cbs_tail,
>   &crdp->cbs_head, &crdp->cbs_tail);
> @@ -783,7 +787,6 @@ void call_rcu_data_free(struct call_rcu_data *crdp)
>   wake_call_rcu_thread(default_call_rcu_data);
>   }
> 
> - call_rcu_lock(&call_rcu_mutex);
>   cds_list_del(&crdp->list);
>   call_rcu_unlock(&call_rcu_mutex);

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [PATCH] call_rcu: Fix race between rcu_barrier() and call_rcu_data_free()

2020-10-26 Thread Mathieu Desnoyers
- On Oct 22, 2020, at 6:30 PM, paulmck paul...@kernel.org wrote:

> The current code can lose RCU callbacks at shutdown time, which can
> result in hangs.  This lossage can happen as follows:
> 
> o   A thread invokes call_rcu_data_free(), which executes up through
>the wake_call_rcu_thread().  At this point, the call_rcu_data
>structure has been drained of callbacks, but is still on the
>call_rcu_data_list.  Note that this thread does not hold the
>call_rcu_mutex.
> 
> o   Another thread invokes rcu_barrier(), which traverses the
>call_rcu_data_list under the protection of call_rcu_mutex,
>a list which still includes the above newly drained structure.
>This thread therefore adds a callback to the newly drained
>call_rcu_data structure.  It then releases call_rcu_mutex and
>enters a mystifying loop that does futex stuff.
> 
> o   The first thread finishes executing call_rcu_data_free(),
>which acquires call_rcu_mutex just long enough to remove the
>newly drained call_rcu_data structure from call_rcu_data_list.
>Which causes one of the rcu_barrier() invocation's callbacks to
>be leaked.
> 
> o   The second thread's rcu_barrier() invocation never returns
>resulting in a hang.
> 
> This commit therefore changes call_rcu_data_free() to acquire
> call_rcu_mutex before checking the call_rcu_data structure for callbacks.
> In the case where there are no callbacks, call_rcu_mutex is held across
> both the check and the removal from call_rcu_data_list, thus preventing
> rcu_barrier() from adding a callback in the meantime.  In the case where
> there are callbacks, call_rcu_mutex must be momentarily dropped across
> the call to get_default_call_rcu_data(), which can itself acquire
> call_rcu_mutex.  This momentary drop is not a problem because any
> callbacks that rcu_barrier() might queue during that period of time will
> be moved to the default call_rcu_data structure, and the lock will be
> held across the full time including moving those callbacks and removing
> the call_rcu_data structure that was passed into call_rcu_data_free()
> from call_rcu_data_list.
> 
> With this fix, a several-hundred-CPU test successfully completes more
> than 5,000 executions.  Without this fix, it fails within a few tens
> of executions.  Although the failures happen more quickly on larger
> systems, in theory this could happen on a single-CPU system, courtesy
> of preemption.

I agree with this fix, will merge in liburcu master, stable-0.12, and 
stable-2.11.
Out of curiosity, which test is hanging ?  Is it a test which is part of the 
liburcu
tree or some out-of-tree test ? I wonder why we did not catch it in our CI [1].

Thanks,

Mathieu

[1] https://ci.lttng.org/view/Liburcu/

> 
> Signed-off-by: Paul E. McKenney 
> Cc: Mathieu Desnoyers 
> Cc: Stephen Hemminger 
> Cc: Alan Stern 
> Cc: Lai Jiangshan 
> Cc: 
> Cc: 
> 
> ---
> 
> urcu-call-rcu-impl.h |7 +--
> 1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/src/urcu-call-rcu-impl.h b/src/urcu-call-rcu-impl.h
> index b6ec6ba..18fd65a 100644
> --- a/src/urcu-call-rcu-impl.h
> +++ b/src/urcu-call-rcu-impl.h
> @@ -772,9 +772,13 @@ void call_rcu_data_free(struct call_rcu_data *crdp)
>   while ((uatomic_read(&crdp->flags) & URCU_CALL_RCU_STOPPED) == 
> 0)
>   (void) poll(NULL, 0, 1);
>   }
> + call_rcu_lock(&call_rcu_mutex);
>   if (!cds_wfcq_empty(&crdp->cbs_head, &crdp->cbs_tail)) {
> - /* Create default call rcu data if need be */
> + call_rcu_unlock(&call_rcu_mutex);
> + /* Create default call rcu data if need be. */
> + /* CBs queued here will be handed to the default list. */
>   (void) get_default_call_rcu_data();
> + call_rcu_lock(&call_rcu_mutex);
>   __cds_wfcq_splice_blocking(&default_call_rcu_data->cbs_head,
>   &default_call_rcu_data->cbs_tail,
>   &crdp->cbs_head, &crdp->cbs_tail);
> @@ -783,7 +787,6 @@ void call_rcu_data_free(struct call_rcu_data *crdp)
>   wake_call_rcu_thread(default_call_rcu_data);
>   }
> 
> - call_rcu_lock(&call_rcu_mutex);
>   cds_list_del(&crdp->list);
>   call_rcu_unlock(&call_rcu_mutex);

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 0/6] Sleepable tracepoints

2020-10-26 Thread Mathieu Desnoyers
- On Oct 26, 2020, at 8:05 AM, peter enderborg peter.enderb...@sony.com 
wrote:

> On 10/23/20 9:53 PM, Michael Jeanson wrote:
>> When invoked from system call enter/exit instrumentation, accessing
>> user-space data is a common use-case for tracers. However, tracepoints
>> currently disable preemption around iteration on the registered
>> tracepoint probes and invocation of the probe callbacks, which prevents
>> tracers from handling page faults.
>>
>> Extend the tracepoint and trace event APIs to allow specific tracer
>> probes to take page faults. Adapt ftrace, perf, and ebpf to allow being
>> called from sleepable context, and convert the system call enter/exit
>> instrumentation to sleepable tracepoints.
> 
> Will this not be a problem for analyse of the trace? It get two
> relevant times, one it when it is called and one when it returns.

It will depend on what the tracer chooses to do. If we call the side-effect
of what is being traced a "transaction" (e.g. actually opening a file
descriptor and adding it to a process'file descriptor table as the result
of an open(2) system call), we have to consider that already today the
timestamp which we get is either slightly before or after the actual
side-effect of the transaction in the kernel. That is true even without
being preemptable.

Sometimes it's not relevant to have a tracepoint before and after the
transaction, e.g. when all we care about is to know that the transaction
has successfully happened or not.

In the case of system calls, we have sys_enter and sys_exit to mark the
beginning and end of the "transaction". Whatever side-effects are done by
the system call happens in between.

I think the question here is whether it is relevant to know whether page
faults triggered by accessing system call input parameters need to
happen after we trace a "system call entry" event. If the tracers care,
then it would be up to them to first trace that "system call entry" event,
and have a separate event for the argument payload. But there are other
ways to identify whether page faults happen within the system call or
from user-space, for instance by using the instruction pointer associated
with the page fault. So when observing page faults happening before sys
enter, but associated with a kernel instruction pointer, a trace analysis
tool could theoretically figure out who is to blame for that page fault,
*if* it cares.

> 
> It makes things harder to correlate in what order things happen.

The alternative is to have partial payloads like LTTng does today for
system call arguments. If reading a string from userspace (e.g. open(2)
file name) requires to take a page fault, LTTng truncates the string. This
is pretty bad for automated analysis as well.

> 
> And handling of tracing of contexts that already are not preamptable?

The sleepable tracepoints are only meant to be used in contexts which can sleep.
For tracepoints placed in non-preemptible contexts, those should never take
a page fault to begin with.

> 
> Eg the same tracepoint are used in different places and contexts.

As far as considering that a given tracepoint "name" could be registered to
by both a sleepable and non-sleepable tracer probes, I would like to see an
actual use-case for this. I don't have any.

I can envision that some tracer code will want to be allowed to work in
both sleepable and non-sleepable context, e.g. take page faults in
sleepable context (and called from a sleepable tracepoint), but have a
truncation behavior when called from non-sleepable context. This can actually
be done by looking at the new "TRACEPOINT_MAYSLEEP" tp flag. Passing that
tp_flags to code shared between sleepable and non-sleepable probes would allow
the callee to know whether it can take a page fault or not.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 6/6] tracing: use sched-RCU instead of SRCU for rcuidle tracepoints

2020-10-26 Thread Mathieu Desnoyers
- On Oct 26, 2020, at 4:20 AM, Peter Zijlstra pet...@infradead.org wrote:

> On Fri, Oct 23, 2020 at 05:13:59PM -0400, Joel Fernandes wrote:
>> On Fri, Oct 23, 2020 at 03:53:52PM -0400, Michael Jeanson wrote:
>> > From: Mathieu Desnoyers 
>> > 
>> > Considering that tracer callbacks expect RCU to be watching (for
>> > instance, perf uses rcu_read_lock), we need rcuidle tracepoints to issue
>> > rcu_irq_{enter,exit}_irqson around calls to the callbacks. So there is
>> > no point in using SRCU anymore given that rcuidle tracepoints need to
>> > ensure RCU is watching. Therefore, simply use sched-RCU like normal
>> > tracepoints for rcuidle tracepoints.
>> 
>> High level question:
>> 
>> IIRC, doing this increases overhead for general tracing that does not use
>> perf, for 'rcuidle' tracepoints such as the preempt/irq enable/disable
>> tracepoints. I remember adding SRCU because of this reason.
>> 
>> Can the 'rcuidle' information not be pushed down further, such that perf does
>> it because it requires RCU to be watching, so that it does not effect, say,
>> trace events?
> 
> There's very few trace_.*_rcuidle() users left. We should eradicate them
> and remove the option. It's bugs to begin with.

I agree with Peter. Removing the trace_.*_rcuidle weirdness from the tracepoint
API and fixing all callers to ensure they trace from a context where RCU is
watching would simplify instrumentation of the Linux kernel, thus making it 
harder
for subtle bugs to hide and be unearthed only when tracing is enabled. This is
AFAIU the general approach Thomas Gleixner has been aiming for recently, and I
think it is a good thing.

So if we consider this our target, and that the current state of things is that
we need to have RCU watching around callback invocation, then removing the
dependency on SRCU seems like an overall simplification which does not regress
feature-wise nor speed-wise compared with what we have upstream today. The next
steps would then be to audit all rcuidle tracepoints and make sure the context
where they are placed has RCU watching already, so we can remove the tracepoint
rcuidle API. That would effectively remove the calls to 
rcu_irq_{enter,exit}_irqson
from the tracepoint code.

This is however beyond the scope of the proposed patch set.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [LKP] Re: [sched] bdfcae1140: will-it-scale.per_thread_ops -37.0% regression

2020-10-23 Thread Mathieu Desnoyers
- On Oct 23, 2020, at 1:37 AM, Xing Zhengjun zhengjun.x...@linux.intel.com 
wrote:

> On 10/22/2020 9:19 PM, Mathieu Desnoyers wrote:
>> - On Oct 21, 2020, at 9:54 PM, Xing Zhengjun 
>> zhengjun.x...@linux.intel.com
>> wrote:
>> [...]
>>> In fact, 0-day just copy the will-it-scale benchmark from the GitHub, if
>>> you think the will-it-scale benchmark has some issues, you can
>>> contribute your idea and help to improve it, later we will update the
>>> will-it-scale benchmark to the new version.
>> 
>> This is why I CC'd the maintainer of the will-it-scale github project, Anton
>> Blanchard.
>> My main intent is to report this issue to him, but I have not heard back from
>> him yet.
>> Is this project maintained ? Let me try to add his ozlabs.org address in CC.
>> 
>>> For this test case, if we bind the workload to a specific CPU, then it
>>> will hide the scheduler balance issue. In the real world, we seldom bind
>>> the CPU...
>> 
>> When you say that you bind the workload to a specific CPU, is that done
>> outside of the will-it-scale testsuite, thus limiting the entire testsuite
>> to a single CPU, or you expect that internally the will-it-scale 
>> context-switch1
>> test gets affined to a single specific CPU/core/hardware thread through use 
>> of
>> hwloc ?
>> 
> The later one.

Yeah, that's not currently true due to the affinity issue I pointed out. Both 
threads
used in that test-case are free to be scheduled either on the same HW thread, 
or of
two distinct HW threads on the same core.

Depending on the choice made by the scheduler for each individual test run,
this can lead to very large variation in the benchmark results.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [LKP] Re: [sched] bdfcae1140: will-it-scale.per_thread_ops -37.0% regression

2020-10-22 Thread Mathieu Desnoyers
- On Oct 21, 2020, at 9:54 PM, Xing Zhengjun zhengjun.x...@linux.intel.com 
wrote:
[...]
> In fact, 0-day just copy the will-it-scale benchmark from the GitHub, if
> you think the will-it-scale benchmark has some issues, you can
> contribute your idea and help to improve it, later we will update the
> will-it-scale benchmark to the new version.

This is why I CC'd the maintainer of the will-it-scale github project, Anton 
Blanchard.
My main intent is to report this issue to him, but I have not heard back from 
him yet.
Is this project maintained ? Let me try to add his ozlabs.org address in CC.

> For this test case, if we bind the workload to a specific CPU, then it
> will hide the scheduler balance issue. In the real world, we seldom bind
> the CPU...

When you say that you bind the workload to a specific CPU, is that done
outside of the will-it-scale testsuite, thus limiting the entire testsuite
to a single CPU, or you expect that internally the will-it-scale context-switch1
test gets affined to a single specific CPU/core/hardware thread through use of
hwloc ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 1/2] rseq: Implement KTLS prototype for x86-64

2020-10-20 Thread Mathieu Desnoyers
- On Sep 29, 2020, at 4:13 AM, Florian Weimer fwei...@redhat.com wrote:

> * Mathieu Desnoyers:
> 
>>> So we have a bootstrap issue here that needs to be solved, I think.
>>
>> The one thing I'm not sure about is whether the vDSO interface is indeed
>> superior to KTLS, or if it is just the model we are used to.
>>
>> AFAIU, the current use-cases for vDSO is that an application calls into
>> glibc, which then calls the vDSO function exposed by the kernel. I wonder
>> whether the vDSO indirection is really needed if we typically have a glibc
>> function used as indirection ? For an end user, what is the benefit of vDSO
>> over accessing KTLS data directly from glibc ?
> 
> I think the kernel can only reasonably maintain a single userspace data
> structure.  It's not reasonable to update several versions of the data
> structure in parallel.

I disagree with your statement. Considering that the kernel needs to keep
ABI compatibility for whatever it exposes to user-space, claiming that it
should never update several versions of data structures exposed to user-space
in parallel means that once a data structure is exposed to user-space as ABI
in a certain way, it can never ever change in the future, even if we find
a better way to do things.

It makes more sense to allow multiple data structures to be updated
in parallel until older ones become deprecated/unused/irrelevant, at
which point those can be configured out at build time and eventually
phased out after years of deprecation. Having the ability to update multiple
data structures in user-space with replicated information is IMHO necessary
to allow creation of new/better accelerated ABIs.

> 
> This means that glibc would have to support multiple kernel data
> structures, and users might lose userspace acceleration after a kernel
> update, until they update glibc as well.  The glibc update should be
> ABI-compatible, but someone would still have to backport it, apply it to
> container images, etc.

No. If the kernel ever exposes a data structure to user-space as ABI,
then it needs to stay there, and not break userspace. Hence the need to
duplicate information provided to user-space if need be, so we can move
on to better ABIs without breaking the old ones.

> 
> What's worse, the glibc code would be quite hard to test because we
> would have to keep around multiple kernel versions to exercise all the
> different data structure variants.
> 
> In contrast, the vDSO code always matches the userspace data structures,
> is always updated at the same time, and tested together.  That looks
> like a clear win to me.

For cases where the overhead of vDSO is not an issue, I agree that it
makes things tidier than directly accessing a data structure. The
documentation of the ABI becomes much simpler as well.

> 
>> If we decide that using KTLS from a vDSO function is indeed a requirement,
>> then, as you point out, the thread_pointer is available as ABI, but we miss
>> the KTLS offset.
>>
>> Some ideas on how we could solve this: we could either make the KTLS
>> offset part of the ABI (fixed offset), or save the offset near the
>> thread pointer at a location that would become ABI. It would have to
>> be already populated with something which can help detect the case
>> where a vDSO is called from a thread which does not populate KTLS
>> though. Is that even remotely doable ?
> 
> I don't know.
> 
> We could decide that these accelerated system calls must only be called
> with a valid TCB.  That's unavoidable if the vDSO sets errno directly,
> so it's perhaps not a big loss.  It's also backwards-compatible because
> existing TCB-less code won't know about those new vDSO entrypoints.
> Calling into glibc from a TCB-less thread has always been undefined.
> TCB-less code would have to make direct, non-vDSO system calls, as today.
> 
> For discovering the KTLS offset, a per-process page at a fixed offset
> from the vDSO code (i.e., what real shared objects already do for global
> data) could store this offset.  This way, we could entirely avoid an ABI
> dependency.

Or as Andy mentioned, we would simply pass the ktls offset as argument to
the vDSO ? It seems simple enough. Would it fit all our use-cases including
errno ?

> 
> We'll see what will break once we have the correct TID after vfork. 8->
> glibc currently supports malloc-after-vfork as an extension, and
> a lot of software depends on it (OpenJDK, for example).

I am not sure to see how that is related to ktls ?

> 
>>> With the latter, we could
>>> directly expose the vDSO implementation to applications, assuming that
>>> we agree that the vDSO will not fail with ENOSYS to request fallback to
>

Re: [PATCH 1/3] sched: fix exit_mm vs membarrier (v4)

2020-10-20 Thread Mathieu Desnoyers
- On Oct 20, 2020, at 10:36 AM, Peter Zijlstra pet...@infradead.org wrote:

> On Tue, Oct 20, 2020 at 09:47:13AM -0400, Mathieu Desnoyers wrote:
>> +void membarrier_update_current_mm(struct mm_struct *next_mm)
>> +{
>> +struct rq *rq = this_rq();
>> +int membarrier_state = 0;
>> +
>> +if (next_mm)
>> +membarrier_state = atomic_read(&next_mm->membarrier_state);
>> +if (READ_ONCE(rq->membarrier_state) == membarrier_state)
>> +return;
>> +WRITE_ONCE(rq->membarrier_state, membarrier_state);
>> +}
> 
> This is suspisioucly similar to membarrier_switch_mm().
> 
> Would something like so make sense?

Very much yes. Do you want me to re-send the series, or you
want to fold this in as you merge it ?

Thanks,

Mathieu

> 
> ---
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -206,14 +206,7 @@ void membarrier_exec_mmap(struct mm_stru
> 
> void membarrier_update_current_mm(struct mm_struct *next_mm)
> {
> - struct rq *rq = this_rq();
> - int membarrier_state = 0;
> -
> - if (next_mm)
> - membarrier_state = atomic_read(&next_mm->membarrier_state);
> - if (READ_ONCE(rq->membarrier_state) == membarrier_state)
> - return;
> - WRITE_ONCE(rq->membarrier_state, membarrier_state);
> + membarrier_switch_mm(this_rq(), NULL, next_mm);
> }
> 
> static int membarrier_global_expedited(void)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index d2621155393c..3d589c2ffd28 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2645,12 +2645,14 @@ static inline void membarrier_switch_mm(struct rq *rq,
>   struct mm_struct *prev_mm,
>   struct mm_struct *next_mm)
> {
> - int membarrier_state;
> + int membarrier_state = 0;
> 
>   if (prev_mm == next_mm)
>   return;
> 
> - membarrier_state = atomic_read(&next_mm->membarrier_state);
> + if (next_mm)
> + membarrier_state = atomic_read(&next_mm->membarrier_state);
> +
>   if (READ_ONCE(rq->membarrier_state) == membarrier_state)
>   return;

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


<    2   3   4   5   6   7   8   9   10   11   >