On 2025-07-16 18:54, Paul E. McKenney wrote:
On Wed, Jul 16, 2025 at 01:35:48PM -0700, Paul E. McKenney wrote:
On Wed, Jul 16, 2025 at 11:09:22AM -0400, Steven Rostedt wrote:
On Fri, 11 Jul 2025 10:05:26 -0700
"Paul E. McKenney" <paul...@kernel.org> wrote:

This trace point will invoke rcu_read_unlock{,_notrace}(), which will
note that preemption is disabled.  If rcutree.use_softirq is set and
this task is blocking an expedited RCU grace period, it will directly
invoke the non-notrace function raise_softirq_irqoff().  Otherwise,
it will directly invoke the non-notrace function irq_work_queue_on().

Just to clarify some things; A function annotated by "notrace" simply
will not have the ftrace hook to that function, but that function may
very well have tracing triggered inside of it.

Functions with "_notrace" in its name (like preempt_disable_notrace())
should not have any tracing instrumentation (as Mathieu stated)
inside of it, so that it can be used in the tracing infrastructure.

raise_softirq_irqoff() has a tracepoint inside of it. If we have the
tracing infrastructure call that, and we happen to enable that
tracepoint, we will have:

   raise_softirq_irqoff()
      trace_softirq_raise()
        [..]
          raise_softirq_irqoff()
             trace_softirq_raise()
                [..]
                  Ad infinitum!

I'm not sure if that's what is being proposed or not, but I just wanted
to make sure everyone is aware of the above.

OK, I *think* I might actually understand the problem.  Maybe.

I am sure that the usual suspects will not be shy about correcting any
misapprehensions in the following.  ;-)

My guess is that some users of real-time Linux would like to use BPF
programs while still getting decent latencies out of their systems.
(Not something I would have predicted, but then again, I was surprised
some years back to see people with a 4096-CPU system complaining about
200-microsecond latency blows from RCU.)  And the BPF guys (now CCed)
made some changes some years back to support this, perhaps most notably
replacing some uses of preempt_disable() with migrate_disable().

Except that the current __DECLARE_TRACE() macro defeats this work
for tracepoints by disabling preemption across the tracepoint call,
which might well be a BPF program.  So we need to do something to
__DECLARE_TRACE() to get the right sort of protection while still leaving
preemption enabled.

One way of attacking this problem is to use preemptible RCU.  The problem
with this is that although one could construct a trace-safe version
of rcu_read_unlock(), these would negate some optimizations that Lai
Jiangshan worked so hard to put in place.  Plus those optimizations
also simplified the code quite a bit.  Which is why I was pushing back
so hard, especially given that I did not realize that real-time systems
would be running BPF programs concurrently with real-time applications.
This meant that I was looking for a functional problem with the current
disabling of preemption, and not finding it.

So another way of dealing with this is to use SRCU-fast, which is
like SRCU, but dispenses with the smp_mb() calls and the redundant
read-side array indexing.  Plus it is easy to make _notrace variants
srcu_read_lock_fast_notrace() and srcu_read_unlock_fast_notrace(),
along with the requisite guards.

Re-introducing SRCU requires reverting most of e53244e2c893 ("tracepoint:
Remove SRCU protection"), and I have hacked together this and the
prerequisites mentioned in the previous paragraph.

These are passing ridiculously light testing, but probably have at
least their share of bugs.

But first, do I actually finally understand the problem?

OK, they pass somewhat less ridiculously moderate testing, though I have
not yet hit them over the head with the ftrace selftests.

So might as well post them.

Thoughts?

Your explanation of the problem context fits my understanding.

Note that I've mostly been pulled into this by Sebastian who wanted
to understand better the how we could make the tracepoint
instrumentation work with bpf probes that need to sleep due to
locking. Hence my original somewhat high-level desiderata.

I'm glad this seems to be converging towards a concrete solution.

There are two things I'm wondering:

1) Would we want to always use srcu-fast (for both preempt and
   non-preempt kernels ?), or is there any downside compared to
   preempt-off rcu ? (e.g. overhead ?)

   If the overhead is similar when actually used by tracers
   (I'm talking about actual workload benchmark and not a
   microbenchmark), I would tend to err towards simplicity
   and to minimize the number of configurations to test, and
   use srcu-fast everywhere.

2) I think I'm late to the party in reviewing srcu-fast, I'll
   go have a look :)

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Reply via email to