On Thu, Jul 17, 2025 at 09:14:41AM -0400, Mathieu Desnoyers wrote: > On 2025-07-16 18:54, Paul E. McKenney wrote: > > On Wed, Jul 16, 2025 at 01:35:48PM -0700, Paul E. McKenney wrote: > > > On Wed, Jul 16, 2025 at 11:09:22AM -0400, Steven Rostedt wrote: > > > > On Fri, 11 Jul 2025 10:05:26 -0700 > > > > "Paul E. McKenney" <paul...@kernel.org> wrote: > > > > > > > > > This trace point will invoke rcu_read_unlock{,_notrace}(), which will > > > > > note that preemption is disabled. If rcutree.use_softirq is set and > > > > > this task is blocking an expedited RCU grace period, it will directly > > > > > invoke the non-notrace function raise_softirq_irqoff(). Otherwise, > > > > > it will directly invoke the non-notrace function irq_work_queue_on(). > > > > > > > > Just to clarify some things; A function annotated by "notrace" simply > > > > will not have the ftrace hook to that function, but that function may > > > > very well have tracing triggered inside of it. > > > > > > > > Functions with "_notrace" in its name (like preempt_disable_notrace()) > > > > should not have any tracing instrumentation (as Mathieu stated) > > > > inside of it, so that it can be used in the tracing infrastructure. > > > > > > > > raise_softirq_irqoff() has a tracepoint inside of it. If we have the > > > > tracing infrastructure call that, and we happen to enable that > > > > tracepoint, we will have: > > > > > > > > raise_softirq_irqoff() > > > > trace_softirq_raise() > > > > [..] > > > > raise_softirq_irqoff() > > > > trace_softirq_raise() > > > > [..] > > > > Ad infinitum! > > > > > > > > I'm not sure if that's what is being proposed or not, but I just wanted > > > > to make sure everyone is aware of the above. > > > > > > OK, I *think* I might actually understand the problem. Maybe. > > > > > > I am sure that the usual suspects will not be shy about correcting any > > > misapprehensions in the following. ;-) > > > > > > My guess is that some users of real-time Linux would like to use BPF > > > programs while still getting decent latencies out of their systems. > > > (Not something I would have predicted, but then again, I was surprised > > > some years back to see people with a 4096-CPU system complaining about > > > 200-microsecond latency blows from RCU.) And the BPF guys (now CCed) > > > made some changes some years back to support this, perhaps most notably > > > replacing some uses of preempt_disable() with migrate_disable(). > > > > > > Except that the current __DECLARE_TRACE() macro defeats this work > > > for tracepoints by disabling preemption across the tracepoint call, > > > which might well be a BPF program. So we need to do something to > > > __DECLARE_TRACE() to get the right sort of protection while still leaving > > > preemption enabled. > > > > > > One way of attacking this problem is to use preemptible RCU. The problem > > > with this is that although one could construct a trace-safe version > > > of rcu_read_unlock(), these would negate some optimizations that Lai > > > Jiangshan worked so hard to put in place. Plus those optimizations > > > also simplified the code quite a bit. Which is why I was pushing back > > > so hard, especially given that I did not realize that real-time systems > > > would be running BPF programs concurrently with real-time applications. > > > This meant that I was looking for a functional problem with the current > > > disabling of preemption, and not finding it. > > > > > > So another way of dealing with this is to use SRCU-fast, which is > > > like SRCU, but dispenses with the smp_mb() calls and the redundant > > > read-side array indexing. Plus it is easy to make _notrace variants > > > srcu_read_lock_fast_notrace() and srcu_read_unlock_fast_notrace(), > > > along with the requisite guards. > > > > > > Re-introducing SRCU requires reverting most of e53244e2c893 ("tracepoint: > > > Remove SRCU protection"), and I have hacked together this and the > > > prerequisites mentioned in the previous paragraph. > > > > > > These are passing ridiculously light testing, but probably have at > > > least their share of bugs. > > > > > > But first, do I actually finally understand the problem? > > > > OK, they pass somewhat less ridiculously moderate testing, though I have > > not yet hit them over the head with the ftrace selftests. > > > > So might as well post them. > > > > Thoughts? > > Your explanation of the problem context fits my understanding. > > Note that I've mostly been pulled into this by Sebastian who wanted > to understand better the how we could make the tracepoint > instrumentation work with bpf probes that need to sleep due to > locking. Hence my original somewhat high-level desiderata. > > I'm glad this seems to be converging towards a concrete solution. > > There are two things I'm wondering: > > 1) Would we want to always use srcu-fast (for both preempt and > non-preempt kernels ?), or is there any downside compared to > preempt-off rcu ? (e.g. overhead ?)
For kernels built with CONFIG_PREEMPT_DYNAMIC=n and either CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y, non-preemptible RCU would be faster. I did consider this, but decided to keep the initial patch simple. > If the overhead is similar when actually used by tracers > (I'm talking about actual workload benchmark and not a > microbenchmark), I would tend to err towards simplicity > and to minimize the number of configurations to test, and > use srcu-fast everywhere. To this point, I was wondering whether it is still necessary to do the call_rcu() stage, but left it because that is the safe mistake to make. I am testing a fifth patch that removes the early-boot deferral of call_srcu() because call_srcu() now does exactly this deferral internally. > 2) I think I'm late to the party in reviewing srcu-fast, I'll > go have a look :) Looking forward to seeing what you come up with! I deferred one further optimization, namely statically classifying srcu_struct structures as intended for vanilla, _nmisafe(), or _fast() use, or at least doing so at initialization time. This would get rid of the call to srcu_check_read_flavor_force() in srcu_read_lock_fast() srcu_read_unlock_fast(), and friends, or at least to tuck it under CONFIG_PROVE_RCU. On my laptop, this saves an additional 25%, though that 25% amounts to a big half of a nanosecond. Thoughts? Thanx, Paul > Thanks, > > Mathieu > > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com