From: Colin Lord <[email protected]> [ Upstream commit f743435f988cb0cf1f521035aee857851b25e06d ]
The get_sample() function in the hwlat tracer assumes the caller holds hwlat_data.lock, but this is not actually happening. The result is unprotected data access to hwlat_data, and in per-cpu mode can result in false sharing which may show up as false positive latency events. The specific case of false sharing observed was primarily between hwlat_data.sample_width and hwlat_data.count. These are separated by just 8B and are therefore likely to share a cache line. When one thread modifies count, the cache line is in a modified state so when other threads read sample_width in the main latency detection loop, they fetch the modified cache line. On some systems, the fetch itself may be slow enough to count as a latency event, which could set up a self reinforcing cycle of latency events as each event increments count which then causes more latency events, continuing the cycle. The other result of the unprotected data access is that hwlat_data.count can end up with duplicate or missed values, which was observed on some systems in testing. Convert hwlat_data.count to atomic64_t so it can be safely modified without locking, and prevent false sharing by pulling sample_width into a local variable. One system this was tested on was a dual socket server with 32 CPUs on each numa node. With settings of 1us threshold, 1000us width, and 2000us window, this change reduced the number of latency events from 500 per second down to approximately 1 event per minute. Some machines tested did not exhibit measurable latency from the false sharing. Cc: Masami Hiramatsu <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Colin Lord <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]> Signed-off-by: Sasha Levin <[email protected]> --- LLM Generated explanations, may be completely bogus: ## Analysis of the Commit ### 1. Commit Message Analysis The commit message is very detailed and clearly describes: - **The bug**: `get_sample()` assumes `hwlat_data.lock` is held, but it's not actually held. This leads to unprotected data access and false sharing in per-cpu mode. - **The specific false sharing**: `hwlat_data.sample_width` and `hwlat_data.count` are 8 bytes apart, likely sharing a cache line. When one thread modifies `count`, other threads reading `sample_width` in the latency detection loop fetch the modified cache line, causing measurable latency. - **The self-reinforcing cycle**: The latency from cache line fetch triggers a latency event, which increments `count`, which causes more cache line invalidation, which causes more latency events — a vicious cycle. - **The data race**: `hwlat_data.count` can end up with duplicate or missed values due to unprotected concurrent access. - **Concrete test results**: On a dual-socket 32-CPUs-per-node server, latency events dropped from 500/sec to ~1/min. Keywords: "false sharing", "false positive latency events", "unprotected data access", "duplicate or missed values." ### 2. Code Change Analysis The changes are minimal and surgical: 1. **`hwlat_data.count` converted from `u64` to `atomic64_t`**: This fixes a real data race where multiple threads could concurrently increment `count` without synchronization, leading to lost updates (duplicate/missed values). The `atomic64_inc_return()` replaces `hwlat_data.count++; s.seqnum = hwlat_data.count;` with a single atomic operation. 2. **`sample_width` pulled into a local variable with `READ_ONCE()`**: `u64 sample_width = READ_ONCE(hwlat_data.sample_width);` is used instead of reading `hwlat_data.sample_width` in the hot loop (`do { ... } while (total <= sample_width)`). This: - Prevents the CPU from repeatedly fetching a cache line that may be bouncing between cores - Eliminates the false sharing between `sample_width` and `count` - Uses `READ_ONCE()` for proper load semantics 3. **Init path updated**: `hwlat_data.count = 0` → `atomic64_set(&hwlat_data.count, 0)` — consistent with the type change. 4. **Comment fix**: Removes the incorrect claim that `get_sample()` is "called with hwlat_data.lock held." ### 3. Bug Classification This commit fixes **multiple real bugs**: 1. **Data race on `hwlat_data.count`**: Concurrent unsynchronized access to a shared counter. This is a real correctness bug — sequence numbers can be duplicated or skipped. 2. **False sharing causing false positive latency detection**: The hwlat tracer is producing bogus results (500 events/sec vs 1/min). This is a **functional correctness bug** — the tracer is reporting phantom hardware latency that doesn't exist. Users relying on hwlat tracer results would be misled. 3. **Self-reinforcing feedback loop**: The false sharing creates a pathological cycle that makes the tracer practically unusable on some multi-socket systems. ### 4. Scope and Risk Assessment - **Lines changed**: ~15 lines of actual code changes across 4 hunks in a single file - **Files touched**: 1 (`kernel/trace/trace_hwlat.c`) - **Risk**: Very low. The changes are: - Converting a counter to atomic (well-understood primitive) - Caching a value in a local variable (safe, the value doesn't need to be re-read) - Using `READ_ONCE()` (standard kernel pattern) - **Subsystem**: Tracing — self-contained, well-maintained by Steven Rostedt - **Could it break something?**: Extremely unlikely. The atomic64 operations are well-tested primitives, and caching sample_width is semantically equivalent (the width doesn't change during a sample). ### 5. User Impact - **Who is affected**: Anyone using the hwlat tracer in per-cpu mode on multi-socket systems - **Severity**: The tracer produces wildly incorrect results (500x false positives) on affected systems - **Real-world impact**: Users trying to validate hardware latency characteristics for real-time workloads would get completely misleading data - **The data race on count**: Could cause sequence number issues that affect trace analysis tooling ### 6. Stability Indicators - **Signed-off by Steven Rostedt** (tracing subsystem maintainer) — high confidence - **Concrete test data** provided showing dramatic improvement - **Link to mailing list** discussion provided - The fix uses standard kernel primitives (atomic64_t, READ_ONCE) — well-understood patterns ### 7. Dependency Check - No dependencies on other commits - `atomic64_t` and `READ_ONCE()` are available in all stable kernel versions - The hwlat tracer exists in all recent stable trees (introduced in v4.9) ### Stable Kernel Rules Assessment 1. **Obviously correct and tested**: Yes — tested on real hardware with measurable improvement 2. **Fixes a real bug**: Yes — data race + false sharing causing incorrect tracer output 3. **Important issue**: Yes — tracer producing 500x false positive latency events, plus data race on counter 4. **Small and contained**: Yes — ~15 lines in one file 5. **No new features**: Correct — this is a pure bug fix 6. **Applies cleanly**: Should apply cleanly to recent stable trees ### Risk vs Benefit - **Risk**: Near zero — atomic counter and local variable caching are trivially safe changes - **Benefit**: High — fixes a data race and makes the hwlat tracer produce correct results on multi-socket systems **YES** kernel/trace/trace_hwlat.c | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c index 2f7b94e98317c..3fe274b84f1c2 100644 --- a/kernel/trace/trace_hwlat.c +++ b/kernel/trace/trace_hwlat.c @@ -102,9 +102,9 @@ struct hwlat_sample { /* keep the global state somewhere. */ static struct hwlat_data { - struct mutex lock; /* protect changes */ + struct mutex lock; /* protect changes */ - u64 count; /* total since reset */ + atomic64_t count; /* total since reset */ u64 sample_window; /* total sampling window (on+off) */ u64 sample_width; /* active sampling portion of window */ @@ -193,8 +193,7 @@ void trace_hwlat_callback(bool enter) * get_sample - sample the CPU TSC and look for likely hardware latencies * * Used to repeatedly capture the CPU TSC (or similar), looking for potential - * hardware-induced latency. Called with interrupts disabled and with - * hwlat_data.lock held. + * hardware-induced latency. Called with interrupts disabled. */ static int get_sample(void) { @@ -204,6 +203,7 @@ static int get_sample(void) time_type start, t1, t2, last_t2; s64 diff, outer_diff, total, last_total = 0; u64 sample = 0; + u64 sample_width = READ_ONCE(hwlat_data.sample_width); u64 thresh = tracing_thresh; u64 outer_sample = 0; int ret = -1; @@ -267,7 +267,7 @@ static int get_sample(void) if (diff > sample) sample = diff; /* only want highest value */ - } while (total <= hwlat_data.sample_width); + } while (total <= sample_width); barrier(); /* finish the above in the view for NMIs */ trace_hwlat_callback_enabled = false; @@ -285,8 +285,7 @@ static int get_sample(void) if (kdata->nmi_total_ts) do_div(kdata->nmi_total_ts, NSEC_PER_USEC); - hwlat_data.count++; - s.seqnum = hwlat_data.count; + s.seqnum = atomic64_inc_return(&hwlat_data.count); s.duration = sample; s.outer_duration = outer_sample; s.nmi_total_ts = kdata->nmi_total_ts; @@ -832,7 +831,7 @@ static int hwlat_tracer_init(struct trace_array *tr) hwlat_trace = tr; - hwlat_data.count = 0; + atomic64_set(&hwlat_data.count, 0); tr->max_latency = 0; save_tracing_thresh = tracing_thresh; -- 2.51.0
