From: Pengfei Li <[email protected]> Hi Steven, all,
This is v2 of the ftrace stackmap series. It addresses the Sashiko review at [1] and incorporates the kernel test robot's toctree fix. The series adds stack trace deduplication to ftrace. When the stacktrace option is enabled, the ring buffer stores a 4-byte stack_id instead of a full kernel stack trace, while the full stacks are exported via tracefs. Problem ======= With stacktrace enabled, each trace event stores a full kernel stack (typically 10-20 frames x 8 bytes = 80-160 bytes). On production devices with 4-8 MB trace buffers, this fills the buffer in seconds, limiting the usefulness of boot-time tracing and always-on performance monitoring. Design ====== The implementation is a lock-free hash map modeled after tracing_map.c, as suggested by Steven [2]: - lock-free insert via cmpxchg, safe in NMI/IRQ/any context - pre-allocated element pool, so there is no allocation on the hot path - linear probing with a 2x over-provisioned table - bounded probe length to keep worst-case lookup/insert cost bounded - currently implemented for the global trace instance The ring buffer stores only stack_id. Full stacks are exported via: /sys/kernel/debug/tracing/stack_map /sys/kernel/debug/tracing/stack_map_stat /sys/kernel/debug/tracing/stack_map_bin Reset semantics =============== Reset is treated as a control-path operation and is only supported when tracing is stopped on the owning trace_array. Online reset is intentionally not supported. The reset path: - atomically claims reset rights via cmpxchg - rejects reset with -EBUSY if tracing is active - blocks new get_id() callers via the resetting flag - waits for in-flight ftrace callback paths with synchronize_rcu() - clears the map and releases resetting with release semantics Why not reuse tracing_map.c =========================== This series follows the same overall lock-free approach, but uses a purpose-built structure. tracing_map.c is designed for histogram-style aggregation with fixed-size keys and value fields, while this use case needs variable-length stack storage plus reference counting. Why not reuse BPF stackmap ========================== BPF_MAP_TYPE_STACK_TRACE addresses a similar problem, but requires a BPF program and the BPF runtime. This series keeps the functionality inside ftrace and available without CONFIG_BPF. Unlike BPF stackmap, which may replace entries on collision, this design keeps stack_id stable once assigned, which is important because ring buffer events may reference that stack_id long after insertion. Test results ============ Platform: ARM64 Qualcomm SM8850 (8 cores), kernel 6.12, bits=14, tracing sched_switch + kmem_cache_alloc with stacktrace trigger, 5-second capture, default ring buffer. Per-event payload (measured from tracing stats): Event Full stack Stackmap Reduction --------------------- ---------- -------- --------- sched_switch 102 B/entry 48 B/entry -53% kmem_cache_alloc 111 B/entry 44 B/entry -60% In the same 5-second capture window, the smaller per-event footprint translated to many more retained events before wraparound. For sched_switch: - without stackmap: 43,950 retained entries - with stackmap: 1,710,044 retained entries During the same runs, the stackmap observed a few thousand unique stacks and no drops. Boot-time activation is also supported via: trace_options=stackmap,stacktrace Events that occur before stackmap initialization fall back to full stack traces; later events are deduplicated. This transition does not itself drop events, but early boot stacks recorded before initialization are not deduplicated. QEMU validation =============== The series also runs cleanly in QEMU on aarch64 (mainline, qemu-system-aarch64, 2 vCPU, virt machine, busybox initrd). A post-init smoke test verified: - stack_map, stack_map_stat, stack_map_bin, and options/stackmap exist - enabling stackmap + stacktrace produces stack_id events - stack_map_stat shows non-zero successes and zero drops - reset is rejected with -EBUSY while tracing is active - reset clears the map when tracing is stopped - stack_map_bin magic is correct Changes since RFC v1 ==================== - tightened reset semantics: reset now requires tracing to be stopped and returns -EBUSY if tracing is active or another reset is in progress - fixed publication/consumption ordering with smp_store_release() / smp_load_acquire() - bounded probe length and added pool-exhaustion fast-path handling - moved hash_seed into struct ftrace_stackmap - switched the element pool to a single flat vmalloc allocation - bounded bits range to [10, 18] to limit worst-case memory usage - fixed TRACE_ITER(STACKMAP) handling - tightened stack_map reset input parsing - renamed stat counters to "successes" / "success_rate" so the meaning is unambiguous (counts events served, including first-time inserts) - added documentation, selftest coverage, and userspace dump tooling Known limitations ================= - Per-instance stackmap support is not included in this series. - The stackmap currently covers kernel stacks only. - stack_map_bin is a best-effort snapshot, not a fully atomic export. - trace-cmd / libtraceevent integration is left for follow-up once the binary format settles. Usage ===== echo 1 > /sys/kernel/debug/tracing/options/stackmap echo 1 > /sys/kernel/debug/tracing/options/stacktrace [1] https://sashiko.dev/?list=org.kernel.vger.linux-trace-kernel#/patchset/20260514034916.2162517-1-lipengfei28%40xiaomi.com [2] https://lore.kernel.org/all/20260513085145.30dd23e0@fedora/ Pengfei Li (3): trace: add lock-free stackmap for stack trace deduplication trace: integrate stackmap into ftrace stack recording path trace: add documentation, selftest and tooling for stackmap Documentation/trace/ftrace-stackmap.rst | 145 ++++ Documentation/trace/index.rst | 1 + kernel/trace/Kconfig | 21 + kernel/trace/Makefile | 1 + kernel/trace/trace.c | 66 ++ kernel/trace/trace.h | 16 + kernel/trace/trace_entries.h | 15 + kernel/trace/trace_output.c | 23 + kernel/trace/trace_stackmap.c | 643 ++++++++++++++++++ kernel/trace/trace_stackmap.h | 56 ++ .../ftrace/test.d/ftrace/stackmap-basic.tc | 100 +++ tools/tracing/stackmap_dump.py | 150 ++++ 12 files changed, 1237 insertions(+) create mode 100644 Documentation/trace/ftrace-stackmap.rst create mode 100644 kernel/trace/trace_stackmap.c create mode 100644 kernel/trace/trace_stackmap.h create mode 100755 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc create mode 100755 tools/tracing/stackmap_dump.py -- 2.34.1
