On 11/06/26 13:55, Tomas Glozar wrote:
> čt 11. 6. 2026 v 12:31 odesílatel Valentin Schneider
> <[email protected]> napsal:
>> >
>> > Isn't that precisely what the ipi tracepoints used by this
>> > implementation (ipi:ipi_send_cpu) are for?
>> >
>>
>> Well, these catch the emission of the IPI, which is great for investigation
>> - slap a stacktrace trigger and you (most of the time) get the source of
>> your interference.
>>
>> However Crystal's point is that on x86 (and I assume other archs) receiving
>> & handling these IPIs is "special" and doesn't go through the generic irq
>> subsystem and thus has to be tracked separately, which is why osnoise has
>> this fairly lengthy osnoise_arch_register() thing.
>>
>
> Ah, right. This is not IPI specific, though, IIUC - Intel also has
> other IRQs that have to be traced using Intel-specific trace points,
> like irq_vectors:local_timer, which is also handled in
> osnoise_arch_register(). On ARM from what I recall, most (all?) IRQs
> are traced with irq:* tracepoints.
>
> So there are two parts to this:
>
> - Detecting interference from IPIs firing as osnoise:irq_noise (to be
> analyzed by timerlat auto analysis, and also will appear by default in
> trace output if enabled, regardless of the tool, as all osnoise:*
> tracepoints are enabled there). This is done locally using the already
> existing path (no race hazard), but requires arch-specific detection.
>
> - Counting IPIs when they are being sent. This is the new feature, and
> the count is being recorded in osnoise_sample.
>
> I guess that means that if there were a generic IPI interface, it
> would be easier to use that for IPI counting, as the event would be
> CPU-local? As you say, for tracing of the IPI source, the sending
> tracepoints are better, and that you can already dump the stack trace
> of with --event/--trigger. timerlat auto-analysis could be extended to
> connect the specific IPI to the IRQ noise and display its stack trace
> automatically, instead of manually analyzing the trace output.
>

Right, at least for the smp_call stuff (which includes irq_work) we can
leverage:

  csd_queue_cpu (on the sending CPU)
  csd_func_start (on the receiving CPU)

by indexing on the @csd address; once upon a time [1] I had this:

  $ echo 'hist:keys=cpu,csd.hex:ts=common_timestamp.usecs:src=common_cpu' >\
       /sys/kernel/tracing/events/csd/csd_queue_cpu/trigger
  $ echo 'csd_latency unsigned int src_cpu; '\
       'unsigned int dst_cpu; '\
       'unsigned long csd; u64 time' >\
       /sys/kernel/tracing/synthetic_events

  $ echo 'hist:keys=common_cpu,csd.hex:
  time=common_timestamp.usecs-$ts:
  onmatch(csd.csd_queue_cpu).trace(csd_latency,$src,common_cpu,csd,$time)' >\
       /sys/kernel/tracing/events/csd/csd_function_entry/trigger

  $ trace-cmd record -e 'synthetic:csd_latency' hackbench
  $ trace-cmd report
  <idle>-0     [001]   115.236810: csd_latency:          src_cpu=7, dst_cpu=1, 
csd=18446612682588476192, time=134
  <idle>-0     [000]   115.240676: csd_latency:          src_cpu=7, dst_cpu=0, 
csd=18446612682588214048, time=103
  <idle>-0     [009]   115.241320: csd_latency:          src_cpu=7, dst_cpu=9, 
csd=18446612682143963384, time=83
  <idle>-0     [007]   115.242817: csd_latency:          src_cpu=8, dst_cpu=7, 
csd=18446612682150759032, time=93
  <idle>-0     [005]   115.247802: csd_latency:          src_cpu=7, dst_cpu=5, 
csd=18446612682144441144, time=114
  <idle>-0     [005]   115.271775: csd_latency:          src_cpu=7, dst_cpu=5, 
csd=18446612682144441144, time=151
  <idle>-0     [000]   115.279620: csd_latency:          src_cpu=7, dst_cpu=0, 
csd=18446612682588214048, time=87
  <idle>-0     [000]   115.281727: csd_latency:          src_cpu=7, dst_cpu=0, 
csd=18446612682588214048, time=101

[1]: https://lore.kernel.org/lkml/[email protected]/

I believe you're right that leveraging this would be useful for
timerlat-aa; I'll add it to my todolist :-)

>> >> Isn't this racy to do from a different CPU?  Both in terms of the
>> >> counter, and the timing of the increment relative to when the IPI is
>> >> actually received.  Not necessarily a huge deal if you only care about
>> >> zero versus bignum, but still.  At least worth a comment, if we go with
>> >> this approach.
>> >>
>> >
>> > I also think it's a bit confusing, especially as the other accesses to
>> > osn_var are cpu-local, but here, "cpu" is the *target* CPU, not the
>> > current CPU. Not sure how expensive it would be to do atomic_add for
>> > that, at least it's something to consider.
>> >
>>
>> I suppose that could be an argument for doing that stat aggregation in
>> userspace osnoise - event handlers are run after the fact via
>> tracefs_iterate_raw_events(), it's all inherently slower since it's just
>> increments of one (one per handled event) but it's also all done in
>> userspace on a control thread and doesn't bog down the kernelspace.
>>
>
> You can also do per-cpu counters in-kernel and sum them in the end,
> but that would take cpus^2 space (indexed by [current_cpu,
> target_cpu]). The question is whether there could be enough samples to
> overload sample collection (like it happens for timerlat, which
> collects data in-kernel using BPF instead).
>
> In-kernel counting can be tested with " --event ipi:ipi_send_cpu
> --trigger hist:key=cpu" - IIRC, tracefs histograms use atomic
> operations (via tracing_map) to protect the entries from races in
> multi thread access. Of course, that is inferior to what the patchset
> implements, as it doesn't record which osnoise cycle the IPI was sent
> in, nor can record cpumask IPIs.
>

I suppose I'll need to go do some benchmarking, but I'm starting to lean
towards the side of atomic incs for IPI counts being okay considering the
sort of latencies we track.

>
> Tomas


Reply via email to