On 2025-07-11 13:05, Paul E. McKenney wrote:
On Fri, Jul 11, 2025 at 09:46:25AM -0400, Mathieu Desnoyers wrote:
On 2025-07-09 14:33, Paul E. McKenney wrote:
On Wed, Jul 09, 2025 at 10:31:14AM -0400, Mathieu Desnoyers wrote:
[...]

Joel's patch, which is currently slated for the upcoming merge window,
should take care of the endless-IRQ-work recursion problem.  So the
main remaining issue is how rcu_read_unlock_special() should go about
safely invoking raise_softirq_irqoff() and irq_work_queue_on() when in
notrace mode.

Are those two functions only needed from the outermost rcu_read_unlock
within a thread ? If yes, then keeping track of nesting level and
preventing those calls in nested contexts (per thread) should work.

You lost me on this one.

Yes, these are invoked only by the outermost rcu_read_unlock{,_notrace}().
And we already have a nexting counter, current->rcu_read_lock_nesting.

But AFAIU those are invoked after decrementing the nesting counter,
right ? So any instrumentation call done within those functions may
end up doing a read-side critical section again.


However...

If the tracing invokes an outermost rcu_read_unlock{,_notrace}(), then in
some contexts we absolutely need to invoke the raise_softirq_irqoff()
and irq_work_queue_on() functions, both of which are notrace functions.

I guess you mean "both of which are non-notrace functions", otherwise
we would not be having this discussion.


Or are you telling me that it is OK for a rcu_read_unlock_notrace()
to directly call these non-notrace functions?

What I am getting at is that it may be OK for the outermost nesting
level of rcu_read_unlock_notrace() to call those non-notrace functions,
but only if we manage to keep track of that nesting level while those
non-notrace functions are called.

So AFAIU one issue here is that when the non-notrace functions are
called, the nesting level is back to 0 already.


- Keep some nesting count in the task struct to prevent calling the
     instrumentation when nested in notrace,

OK, for this one, is the idea to invoke some TBD RCU API the tracing
exits the notrace region?  I could see that working.  But there would
need to be a guarantee that if the notrace API was invoked, a call to
this TBD RCU API would follow in short order.  And I suspect that
preemption (and probably also interrupts) would need to be disabled
across this region.

No quite.

What I have in mind is to try to find the most elegant way to prevent
endless recursion of the irq work issued immediately on
rcu_read_unlock_notrace without slowing down most fast paths, and
ideally without too much code duplication.

I'm not entirely sure what would be the best approach though.

Joel's patch adjusts use of the rcu_data structure's ->defer_qs_iw_pending
flag, so that it is cleared not in the IRQ-work handler, but
instead in rcu_preempt_deferred_qs_irqrestore().  That prevents
rcu_read_unlock_special() from requeueing the IRQ-work handler until
after the previous request for a quiescent state has been satisfied.

So my main concern is again safely invoking raise_softirq_irqoff()
and irq_work_queue_on() when in notrace mode.

Would the nesting counter (per thread) approach suffice for your use
case ?

Over and above the t->rcu_read_lock_nesting that we already use?
As in only the outermost rcu_read_unlock{,_notrace}() will invoke
rcu_read_unlock_special().

OK, let's look at a couple of scenarios.

First, suppose that we apply Joel's patch above, and someone sets a trace
point in task context outside of any RCU read-side critical section.
Suppose further that this task is preempted in the tracepoint's RCU
read-side critical section, and that RCU priority boosting is applied.

This trace point will invoke rcu_read_unlock{,_notrace}(), which
will in turn invoke rcu_read_unlock_special(), which will in turn
will note that preemption, interrupts, and softirqs are all enabled.
It will therefore directly invoke rcu_preempt_deferred_qs_irqrestore(),
a non-notrace function, which can in turn invoke all sorts of interesting
functions involving locking, the scheduler, ...

Is this OK, or should I set some sort of tracepoint recursion flag?

Or somehow modify the semantic of t->rcu_read_lock_nesting if at all
possible. Rather than decrementing it first and then if 0 invoke
a rcu_read_unlock_special, it could perhaps invoke
rcu_read_unlock_special if the nesting counter is _about to be
decremented from 1 to 0_, and then decrement to 0. This would
hopefully prevent recursion.

But I may be entirely misunderstanding the whole problem. If so,
please let me know!

And if for some reason is really needs to be decremented before
calling rcu_read_unlock_special, then we can have the following:
when exiting the outermost critical section, it could be decremented
from 2 to 1, then call rcu_read_unlock_special, after which it's
decremented to 0. The outermost read lock increment would have to
be adapted accordingly. But this would add overhead on the fast-paths,
which may be frowned upon.

The idea here is to keep tracking the fact that we are within the
execution of rcu_read_unlock_special, so it does not call it again
recursively, even though we are technically not nested within a
read-side critical section anymore.


Second, suppose that we apply Joel's patch above, and someone sets a trace
point in task context outside of an RCU read-side critical section, but in
an preemption-disabled region of code.  Suppose further that this code is
delayed, perhaps due to a flurry of interrupts, so that a scheduling-clock
interrupt sets t->rcu_read_unlock_special.b.need_qs to true.

This trace point will invoke rcu_read_unlock{,_notrace}(), which will
note that preemption is disabled.  If rcutree.use_softirq is set and
this task is blocking an expedited RCU grace period, it will directly
invoke the non-notrace function raise_softirq_irqoff().  Otherwise,
it will directly invoke the non-notrace function irq_work_queue_on().

Is this OK, or should I set some sort of tracepoint recursion flag?

Invoking instrumentation from the implementation of instrumentation
is a good recipe for endless recursion, so we'd need to check for
recursion somehow there as well AFAIU.


There are other scenarios, but they require interrupts to be disabled
across the rcu_read_unlock{,_notrace}(), but to have been enabled somewhere
in the just-ended RCU read-side critical section.  It does not look to
me like tracing does this.  But I might be missing something.  If so,
we have more scenarios to think through.  ;-)

I don't see a good use-case for that kind of scenario though. But I may
simply be lacking imagination.


There are probably other possible approaches I am missing, each with
their respective trade offs.

I am pretty sure that we also have some ways to go before we have the
requirements fully laid out, for that matter.  ;-)

Could you please tell me where in the current tracing code these
rcu_read_lock_notrace()/rcu_read_unlock_notrace() calls would be placed?

AFAIU here:

include/linux/tracepoint.h:

#define __DECLARE_TRACE(name, proto, args, cond, data_proto) [...]

          static inline void __do_trace_##name(proto)                     \
          {                                                               \
                  if (cond) {                                             \
                          guard(preempt_notrace)();                       \
                          __DO_TRACE_CALL(name, TP_ARGS(args));           \
                  }                                                       \
          }                                                               \
          static inline void trace_##name(proto)                          \
          {                                                               \
                  if (static_branch_unlikely(&__tracepoint_##name.key))   \
                          __do_trace_##name(args);                        \
                  if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) {             \
                          WARN_ONCE(!rcu_is_watching(),                   \
                                    "RCU not watching for tracepoint");   \
                  }                                                       \
          }

and

#define __DECLARE_TRACE_SYSCALL(name, proto, args, data_proto) [...]

          static inline void __do_trace_##name(proto)                     \
          {                                                               \
                  guard(rcu_tasks_trace)();                               \
                  __DO_TRACE_CALL(name, TP_ARGS(args));                   \
          }                                                               \
          static inline void trace_##name(proto)                          \
          {                                                               \
                  might_fault();                                          \
                  if (static_branch_unlikely(&__tracepoint_##name.key))   \
                          __do_trace_##name(args);                        \
                  if (IS_ENABLED(CONFIG_LOCKDEP)) {                       \
                          WARN_ONCE(!rcu_is_watching(),                   \
                                    "RCU not watching for tracepoint");   \
                  }                                                       \
          }

I am not seeing a guard(rcu)() in here, only guard(preempt_notrace)()
and guard(rcu_tasks_trace)().  Or is the idea to move the first to
guard(rcu_notrace)() in order to improve PREEMPT_RT latency?

AFAIU the goal here is to turn the guard(preempt_notrace)() into a
guard(rcu_notrace)() because the preempt-off critical sections don't
agree with BPF.

OK, got it, thank you!

The combination of BPF and CONFIG_PREEMPT_RT certainly has provided at
least its share of entertainment, that is for sure.  ;-)

There is indeed no shortage of entertainment when combining those rather
distinct sets of requirements. :)

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Reply via email to