From: Pengfei Li <[email protected]> Hi Masami, Steven, all,
This is v3 of the ftrace stackmap series. It addresses the Sashiko review on v2 [1] that Masami pointed out. [1] https://sashiko.dev/#/patchset/20260522104017.1668638-1-lipengfei28%40xiaomi.com The series adds stack trace deduplication to ftrace. When the stacktrace option is enabled, the ring buffer stores a 4-byte stack_id instead of a full kernel stack trace, while the full stacks are exported via tracefs. Rebased onto v7.1-rc5 (e8c2f9fdadee) before sending. Changes since v2 ================ Patch 1 (lock-free stackmap): - Hot-path counters changed from atomic64_t to per-CPU local_t. This avoids the raw_spinlock_t fallback that atomic64_t uses on 32-bit GENERIC_ATOMIC64, which would deadlock from NMI context. - reset() now serializes against tracefs readers via an rw_semaphore (held for write during the clearing memset, held for read by seq_file iteration and bin snapshot construction). synchronize_rcu() alone was insufficient because seq_file/bin readers are in process context, not preempt-disabled. - get_id() uses atomic_read_acquire() on smap->resetting so subsequent loads of entry->key/val are properly ordered after the check (LKMM control dependencies only order stores). - All plain reads of entry->key now use READ_ONCE() to avoid LKMM data races with the cmpxchg writer. - val->nr in the hot path now uses READ_ONCE() to keep style consistent with the seq_show / bin_open readers. - stackmap_seq_next() now updates *pos past map_size on EOF so seq_read() terminates instead of looping on the last element. - Added a comment in the cmpxchg-claim path documenting that two CPUs racing with the same key_hash may produce a small number of duplicate entries; this is an accepted trade-off for keeping the hot path lock-free. - Removed BUG_ON in create path (the constraint is satisfied by construction; no runtime check needed). Patch 2 (integration): - 'stackmap' is added to TOP_LEVEL_TRACE_FLAGS and ZEROED_TRACE_FLAGS so the option is only exposed under the top-level trace instance, matching the convention used for other global-only options such as 'printk' and 'record-cmd'. Secondary instances under tracing/instances/*/ no longer see the option at all, instead of seeing it as a silent no-op. - TRACE_STACK_ID added to trace_valid_entry() in trace_selftest.c so ftrace startup selftests don't reject the entry type. - Corrected a comment about how global_trace.stackmap is zero-initialized (BSS, not kzalloc). Patch 3 (docs / selftest / tooling): - Selftest now reads trace contents BEFORE switching back to the nop tracer (tracer_init() calls tracing_reset_online_cpus() which would have emptied the ring buffer). - Added 'function:tracer' to the selftest '# requires:' line so ftracetest skips when CONFIG_FUNCTION_TRACER is disabled instead of failing spuriously. - Selftest grep tightened to '<stack_id' to avoid future false-positives if any other tracepoint name contains "stack_id". - New stackmap-instance-gate.tc selftest asserts the option and stack_map* nodes are present on the global instance and absent on a freshly-created secondary instance, locking in the TOP_LEVEL_TRACE_FLAGS gating behavior introduced in patch 2. - Documentation Performance section made vendor-neutral ("aarch64 SMP system" instead of a specific device name) and the term "Hit rate" replaced with "Dedup rate" to match the actual stat field name (success_rate). - Documentation Design section now states that deduplication is best-effort under heavy contention (cmpxchg races may produce a small number of duplicate entries for the same stack), so users observing entries > unique-stacks have a documented explanation. Test results ============ Device: Xiaomi SM8850 (ARM64), Android 16, kernel 6.12 (OGKI) Config: CONFIG_FTRACE_STACKMAP=y, bits=14 (16384 elts, 32768 slots) Method: 5-second capture with stacktrace trigger Functional tests (all PASS): - tracefs nodes (stack_map / stack_map_stat / stack_map_bin) exist - options/stackmap writable, trace shows <stack_id N> - stack_map text export with correct symbols - reset clears entries when tracing stopped - reset rejected (-EBUSY) while tracing active - per-event trigger: only specified events get stacks Performance (sched_switch, 5s): entries: 466 / 16384 successes: 9159 drops: 0 success_rate: 100% dedup rate: 95.2% (466 unique stacks / 9625 total events) Performance (kmem_cache_alloc, 5s): entries: 1177 / 16384 successes: 60078 drops: 0 success_rate: 100% dedup rate: 98.1% (1177 unique stacks / 61255 total events) Ring buffer space savings: Event Full stack Stackmap Saving ---------------- --------------- --------------- ------ sched_switch 9625 × 88B=847KB 12B×9625+88B×466=156KB 82% kmem_cache_alloc 61255×88B=5.4MB 12B×61255+88B×1177=839KB 85% QEMU validation (v3 base: v7.1-rc5) =================================== The series boots cleanly on aarch64 QEMU. A post-init smoke test (12/12 PASS) verified all functional behaviors including: - tracefs nodes appear with correct file modes - stack_id events emitted, kernel symbols resolve correctly (e.g. __schedule+0x7cc/0x1138) - reset rejected with -EBUSY while tracing is active - reset clears the map when tracing is stopped - per-CPU local_t counters aggregate correctly across CPUs - stack_map_bin magic correct (0x464D5342 'FSMB') - 'stackmap' option visible on the global instance, hidden on secondary instances under tracing/instances/*/ Boot-time activation via 'trace_options=stackmap,stacktrace' works: events that fire before stackmap initialization fall back to recording full stack traces; later events are deduplicated. No events are dropped due to the transition. Known limitations ================= - Per-instance stackmap support is not included in this series. Following the convention used for other global-only options (PRINTK, RECORD_CMD), the 'stackmap' option is gated to the top-level trace instance via TOP_LEVEL_TRACE_FLAGS, so it is not exposed under tracing/instances/*/options/. Per-instance maps would be a follow-up. - The element pool is allocated eagerly at fs_initcall when CONFIG_FTRACE_STACKMAP=y, regardless of whether userspace will ever enable the option. At the default bits=14 this is roughly 8 MB of vmalloc; at the maximum bits=18, ~135 MB. The eager allocation keeps the hot path entirely allocation-free and avoids any allocation-failure path under tracing pressure. Lazy allocation on first 'echo 1 > options/stackmap' is a reasonable follow-up if maintainers prefer that trade-off. - Deduplication is best-effort, not strict: under heavy concurrent contention two CPUs racing in the insert path with the same stack hash may each succeed in claiming a different slot, producing a small number of duplicate entries for the same stack. ref_count is then split across the duplicates. This is intentional: it keeps the hot path lock-free and bounds memory by the element pool size. - The stackmap currently covers kernel stacks only. - stack_map_bin is a best-effort snapshot, not a fully atomic export. - trace-cmd / libtraceevent integration is left for follow-up once the binary format settles. Usage ===== echo 1 > /sys/kernel/debug/tracing/options/stackmap echo 1 > /sys/kernel/debug/tracing/options/stacktrace Pengfei Li (3): trace: add lock-free stackmap for stack trace deduplication trace: integrate stackmap into ftrace stack recording path trace: add documentation, selftest and tooling for stackmap Documentation/trace/ftrace-stackmap.rst | 162 ++++ Documentation/trace/index.rst | 1 + kernel/trace/Kconfig | 22 + kernel/trace/Makefile | 1 + kernel/trace/trace.c | 78 +- kernel/trace/trace.h | 16 + kernel/trace/trace_entries.h | 15 + kernel/trace/trace_output.c | 23 + kernel/trace/trace_selftest.c | 1 + kernel/trace/trace_stackmap.c | 780 ++++++++++++++++++ kernel/trace/trace_stackmap.h | 57 ++ .../ftrace/test.d/ftrace/stackmap-basic.tc | 103 +++ .../test.d/ftrace/stackmap-instance-gate.tc | 42 + tools/tracing/stackmap_dump.py | 150 ++++ 14 files changed, 1449 insertions(+), 2 deletions(-) create mode 100644 Documentation/trace/ftrace-stackmap.rst create mode 100644 kernel/trace/trace_stackmap.c create mode 100644 kernel/trace/trace_stackmap.h create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc create mode 100755 tools/tracing/stackmap_dump.py -- 2.34.1
