On Thu Feb 5, 2026 at 6:47 AM PST, Steven Rostedt wrote: > On Sun, 1 Feb 2026 18:58:38 -0800 > Colin Lord <[email protected]> wrote: > > I was about to send this as part of my fixes, but taking a closer look at > it, I have more questions. > >> The get_sample() function in the hwlat tracer assumes the caller holds >> hwlat_data.lock, but this is not actually happening. The result is >> unprotected data access to hwlat_data, and in per-cpu mode can result in >> false sharing. The false sharing can cause false positive latency > > BTW, what exactly do you mean by "false sharing"? >
>> events, since the sample_width member is involved and gets read as part >> of the main latency detection loop. > > I'm trying to figure out why the change in sample_width would cause any > issue. > You described it below, but yes, by false sharing I am referring to cache latencies that occur when reading sample_width. In this case, it's primarily due to hwlat_data.count getting modified. Since count and sample_width are just 8B apart from each other, they'll often share a cache line, so one thread incrementing count due to a latency event causes all other running threads to fetch the updated cache line when reading sample_width. On most systems I tested, this doesn't cause high enough latency to show up, but I guess the system mentioned in the commit message has a slow interconnect between the two sockets. Combined with large numbers of threads, a single latency event would cause a cascade of cache related latencies. Another test I did to verify this is what was happening on that system was setting tracing_cpumask to only contain CPUs on the same socket. If I ran 32 threads on the same socket, there were far fewer latency events compared to running 16 on one socket + 16 on the other (same number of total threads just different distribution). >> >> Convert hwlat_data.count to atomic64_t so it can be safely accessed >> without locking, and prevent false sharing by pulling sample_width into >> a local variable. > > The above still makes sense, but this only effects the seqnum, which may > show either duplicate or skipped numbers. But shouldn't affect any of the > other data. > Yes, the modifications made in this commit to hwlat_data.count do not affect the latency. The change is related to the rest of the commit in the sense that it is addressing the incorrect assumption that hwlat_data.lock is held during get_sample(), but it would also make sense to put the hwlat_data.count changes in a separate commit. I could split it up if you'd prefer that. For what it's worth, I did also see many cases of duplicate/skipped sequence numbers on the machine that was affected by this latency issue, but I wasn't sure how detailed I should make my commit message. >> >> One system this was tested on was a dual socket server with 32 CPUs on >> each numa node. With settings of 1us threshold, 1000us width, and >> 2000us window, this change reduced the number of latency events from >> 500 per second down to approximately 1 event per minute. Some machines >> tested did not exhibit measurable latency from the false sharing. > > Is this because the read of hwlat_data.sample_width is a global variable > and could possibly be causing a cache hit latency that shows up on large > machines? > > Is that what you mean by "false sharing"? > > Oh, and subjects for the tracing subsystem should start with a capital, and > should be something like: > > tracing: Fix get_sample() in hwlat from ... Got it, I can make that adjustment and resend. Would you also like me to modify the commit message with any of the above clarifications? thanks, Colin > -- Steve > > >> >> Signed-off-by: Colin Lord <[email protected]> >> --- >> Changes in v2: >> - convert hwlat_data.count to atomic64_t >> - leave irqs_disabled block where it originally was, outside of >> get_sample() >> >> Thanks for the v1 review Steve, have updated and retested. >> >> cheers, >> Colin >> >> kernel/trace/trace_hwlat.c | 15 +++++++-------- >> 1 file changed, 7 insertions(+), 8 deletions(-) >> >> diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c >> index 2f7b94e98317..3fe274b84f1c 100644 >> --- a/kernel/trace/trace_hwlat.c >> +++ b/kernel/trace/trace_hwlat.c >> @@ -102,9 +102,9 @@ struct hwlat_sample { >> /* keep the global state somewhere. */ >> static struct hwlat_data { >> >> - struct mutex lock; /* protect changes */ >> + struct mutex lock; /* protect changes */ >> >> - u64 count; /* total since reset */ >> + atomic64_t count; /* total since reset */ >> >> u64 sample_window; /* total sampling window (on+off) */ >> u64 sample_width; /* active sampling portion of window */ >> @@ -193,8 +193,7 @@ void trace_hwlat_callback(bool enter) >> * get_sample - sample the CPU TSC and look for likely hardware latencies >> * >> * Used to repeatedly capture the CPU TSC (or similar), looking for >> potential >> - * hardware-induced latency. Called with interrupts disabled and with >> - * hwlat_data.lock held. >> + * hardware-induced latency. Called with interrupts disabled. >> */ >> static int get_sample(void) >> { >> @@ -204,6 +203,7 @@ static int get_sample(void) >> time_type start, t1, t2, last_t2; >> s64 diff, outer_diff, total, last_total = 0; >> u64 sample = 0; >> + u64 sample_width = READ_ONCE(hwlat_data.sample_width); >> u64 thresh = tracing_thresh; >> u64 outer_sample = 0; >> int ret = -1; >> @@ -267,7 +267,7 @@ static int get_sample(void) >> if (diff > sample) >> sample = diff; /* only want highest value */ >> >> - } while (total <= hwlat_data.sample_width); >> + } while (total <= sample_width); >> >> barrier(); /* finish the above in the view for NMIs */ >> trace_hwlat_callback_enabled = false; >> @@ -285,8 +285,7 @@ static int get_sample(void) >> if (kdata->nmi_total_ts) >> do_div(kdata->nmi_total_ts, NSEC_PER_USEC); >> >> - hwlat_data.count++; >> - s.seqnum = hwlat_data.count; >> + s.seqnum = atomic64_inc_return(&hwlat_data.count); >> s.duration = sample; >> s.outer_duration = outer_sample; >> s.nmi_total_ts = kdata->nmi_total_ts; >> @@ -832,7 +831,7 @@ static int hwlat_tracer_init(struct trace_array *tr) >> >> hwlat_trace = tr; >> >> - hwlat_data.count = 0; >> + atomic64_set(&hwlat_data.count, 0); >> tr->max_latency = 0; >> save_tracing_thresh = tracing_thresh; >> >> >> base-commit: 24d479d26b25bce5faea3ddd9fa8f3a6c3129ea7
