Hi Steve, On Thu, May 08, 2025 at 12:03:21PM -0400, Steven Rostedt wrote: > On Thu, 24 Apr 2025 12:25:42 -0400 > Steven Rostedt <rost...@goodmis.org> wrote: > > > +static void perf_event_callchain_deferred(struct callback_head *work) > > +{ > > + struct perf_event *event = container_of(work, struct perf_event, > > pending_unwind_work); > > + struct perf_callchain_deferred_event deferred_event; > > + u64 callchain_context = PERF_CONTEXT_USER; > > + struct unwind_stacktrace trace; > > + struct perf_output_handle handle; > > + struct perf_sample_data data; > > + u64 nr; > > + > > + if (!event->pending_unwind_callback) > > + return; > > + > > + if (unwind_deferred_trace(&trace) < 0) > > + goto out; > > + > > + /* > > + * All accesses to the event must belong to the same implicit RCU > > + * read-side critical section as the ->pending_unwind_callback reset. > > + * See comment in perf_pending_unwind_sync(). > > + */ > > + guard(rcu)(); > > + > > + if (!current->mm) > > + goto out; > > + > > + nr = trace.nr + 1 ; /* '+1' == callchain_context */ > > Hi Namhyung, > > Talking with Beau about how Microsoft does their own deferred tracing, I > wonder if the timestamp approach would be useful. > > This is where a timestamp is taken at the first request for a deferred > trace, and this is recorded in the trace when it happens. It basically > states that "this trace is good up until the given timestamp". > > The rationale for this is for lost events. Let's say you have: > > <task enters kernel> > Request deferred trace > > <buffer fills up and events start to get lost> > > Deferred trace happens (but is dropped due to buffer being full) > > <task exits kernel> > > <task enters kernel again> > Request deferred trace (Still dropped due to buffer being full) > > <Reader catches up and buffer is free again> > > Deferred trace happens (this time it is recorded> > <task exits kernel> > > How would user space know that the deferred trace that was recorded doesn't > go with the request (and kernel stack trace) that was done initially)?
Right, this is a problem. > > If we add a timestamp, then it would look like: > > <task enters kernel> > Request deferred trace > [Record timestamp] > > <buffer fills up and events start to get lost> > > Deferred trace happens with timestamp (but is dropped due to buffer being > full) > > <task exits kernel> > > <task enters kernel again> > Request deferred trace (Still dropped due to buffer being full) > [Record timestamp] > > <Reader catches up and buffer is free again> > > Deferred trace happens with timestamp (this time it is recorded> > <task exits kernel> > > Then user space will look at the timestamp that was recorded and know that > it's not for the initial request because the timestamp of the kernel stack > trace done was before the timestamp of the user space stacktrace and > therefore is not valid for the kernel stacktrace. IIUC the deferred stacktrace will have the timestamp of the first request, right? > > The timestamp would become zero when exiting to user space. The first > request will add it but would need a cmpxchg to do so, and if the cmpxchg > fails, it then needs to check if the one recorded is before the current > one, and if it isn't it still needs to update the timestamp (this is to > handle races with NMIs). Yep, it needs to maintain an accurate first timestamp. > > Basically, the timestamp would replace the cookie method. > > Thoughts? Sounds good to me. You'll need to add it to the PERF_RECORD_DEFERRED_CALLCHAIN. Probably it should check if sample_type has PERF_SAMPLE_TIME. It'd work along with PERF_SAMPLE_TID (which will be added by the perf tools anyway). Thanks, Namhyung > > > + > > + deferred_event.header.type = PERF_RECORD_CALLCHAIN_DEFERRED; > > + deferred_event.header.misc = PERF_RECORD_MISC_USER; > > + deferred_event.header.size = sizeof(deferred_event) + (nr * > > sizeof(u64)); > > + > > + deferred_event.nr = nr; > > + > > + perf_event_header__init_id(&deferred_event.header, &data, event); > > + > > + if (perf_output_begin(&handle, &data, event, > > deferred_event.header.size)) > > + goto out; > > + > > + perf_output_put(&handle, deferred_event); > > + perf_output_put(&handle, callchain_context); > > + perf_output_copy(&handle, trace.entries, trace.nr * sizeof(u64)); > > + perf_event__output_id_sample(event, &handle, &data); > > + > > + perf_output_end(&handle); > > + > > +out: > > + event->pending_unwind_callback = 0; > > + local_dec(&event->ctx->nr_no_switch_fast); > > + rcuwait_wake_up(&event->pending_unwind_wait); > > +} > > +