From: Pengfei Li <[email protected]> Hi Masami, Steven, all,
This is v4 of the ftrace stackmap series. It is sent as a new thread. v3: https://lore.kernel.org/all/[email protected]/ The series adds stack trace deduplication to ftrace. When the 'stackmap' option is enabled alongside 'stacktrace', the ring buffer stores a 4-byte stack_id instead of a full kernel stack trace, and the full stacks are exported once via tracefs (stack_map / stack_map_bin). Rebased onto v7.1-rc5 (e8c2f9fdadee). Motivation ========== The target use case is long-duration, from-boot kernel tracing where the same stacks recur enormously often and the bottleneck is ring buffer space, not CPU. Concretely: tracing the slab allocator from boot for hours to study memory aging and to catch the allocation backtraces behind a usage peak. With a stacktrace trigger on the slab tracepoints, every event today carries a full kernel stack (~80-160 bytes). On a fixed-size ring buffer that bounds how far back in time the trace reaches: the buffer wraps in seconds to minutes and the early-boot history -- the part we care about -- is overwritten before it can be consumed. In this workload the set of distinct stacks is small and highly repetitive, so storing a 4-byte stack_id per event and the full stack only once dramatically increases the time span a given buffer covers. The intended operating model is exactly the low-overhead one ftrace is good at: let the trace run for a long time producing a comparatively small, dense log, then resolve stack_ids offline (cat stack_map, or parse stack_map_bin with the included tool) during analysis. This is complementary to, not a replacement for, the existing full stack recording: deep stacks and the early pre-init window still fall back to full stacks (see below). Effect on retention ==================== Same fixed per-CPU buffer, slab allocation workload with a shallow kernel stack (kmem_cache_alloc), stackmap OFF vs ON: retained events bytes/event time span stackmap OFF 645,068 ~104 B 15.0 s stackmap ON 1,397,741 ~48 B 27.7 s 2.17x 2.17x 1.85x The buffer holds ~2.17x more events and reaches ~1.85x further back in time for the same memory. The win grows with stack depth and with how repetitive the stacks are; for deep, highly-repeated stacks the per-event size approaches the 4-byte stack_id plus event header. Changes since v3 ================ Correctness: - Deep stacks are never silently truncated or merged. A stack deeper than FTRACE_STACKMAP_MAX_DEPTH (64) now falls back to a full stack trace; ftrace_stackmap_get_id() returns -E2BIG rather than truncating, so two distinct stacks sharing their first 64 frames can no longer collapse to one stack_id. - reset is now genuinely destructive and coherent: under the reader_sem write lock it clears the owning trace_array's ring buffer (and snapshot) BEFORE clearing the map, and uses tracing_reset_all_cpus() rather than _online_cpus() so a TRACE_STACK_ID written by a now-offline CPU cannot survive and dangle against a cleared map. - __ftrace_trace_stack() reserves the TRACE_STACK_ID ring-buffer slot before inserting into the map, so stack_map_stat counters and ref_count stay consistent with what the ring buffer actually references (failed reservation -> full stack, map untouched). - ref_count / successes / drops now saturate (INT_MAX / LONG_MAX) instead of wrapping on multi-hour, billions-of-hits traces. Global-instance gating: - Enabling 'stackmap' on a secondary instance via the aggregate trace_options file is now rejected, not just hidden in the per-instance options/ directory. - tracefs init is failure-atomic: the required stack_map file is created before the map pointer is published; if it cannot be created the map is destroyed and never published. An init-state (PENDING/DONE/FAILED) lets boot-time trace_options=stackmap set the flag before the map exists (hot path falls back until it is published) while still rejecting enables after a permanent init failure, so options/stackmap never reports an enabled no-op. ABI / tooling: - Binary magic corrected to 0x46534D42 ('FSMB'); version is 1 (first upstream ABI). Documentation, tool and selftest updated to match. - Text and binary exports now follow the same trampoline-marker and trace_adjust_address() handling as the normal stack print path. - stackmap_dump.py emits hex addresses in 'ips' and shows the ftrace trampoline marker only in the resolved 'symbols'. Selftests: - New stackmap-reset.tc: verifies reset clears stale <stack_id N> from the trace buffer and checks the stack_map_bin magic/version. - stackmap-instance-gate.tc extended to verify the trace_options write path is rejected on a secondary instance. - stackmap-basic.tc no longer treats a nonzero drops count as a failure (drops is a by-design fallback); only zero successes with nonzero drops is fatal. Open questions for maintainers ============================== Two design points where I would value direction before polishing further: 1. Eager vs lazy allocation. The element pool is allocated at fs_initcall when CONFIG_FTRACE_STACKMAP=y, regardless of whether userspace ever enables the option (~8 MB at the default bits=14, up to ~135 MB at bits=18). This keeps the hot path allocation-free with no allocation-failure path under tracing pressure. Is eager allocation acceptable, or would you prefer lazy allocation on the first 'echo 1 > options/stackmap'? 2. Binary ABI now or later. stack_map_bin is a new tracefs binary interface (magic 0x46534D42, version 1). Is it acceptable to introduce it now, or would you prefer the first version ship with the text stack_map interface only and add the binary export once trace-cmd / libtraceevent integration is designed? Test results ============ QEMU (aarch64 virt, v7.1-rc5 + this series), boot to init smoke test: - stackmap functional suite: 16/16 PASS, including reset clearing the trace buffer (stale <stack_id> count 48 -> 0), stack_map_bin magic/version, global-vs-secondary instance gating, and the trace_options rejection on a secondary instance. - boot-time activation (trace_options=stackmap,stacktrace on the kernel cmdline): 3/3 PASS -- the option survives the pre-initialization window and the map is live once published. - ftrace startup self-tests pass with the new TRACE_STACK_ID entry. Device retention numbers above were collected on a Xiaomi SM8850 (ARM64) running an Android workload, comparing the same buffer with the option off and on. Known limitations ================= - Per-instance stackmap support is not included; the option is gated to the global instance (in the tracefs layout and at the set_tracer_flag() write path). Per-instance maps are a follow-up. - Deduplication is best-effort, not strict: under heavy concurrent contention two CPUs racing with the same stack hash may each claim a different slot, producing a few duplicate entries; ref_count is then split across them. This keeps the hot path lock-free. - The stackmap covers kernel stacks only. - stack_map_bin is a best-effort snapshot, serialized against reset but not a fully atomic export. - trace-cmd / libtraceevent integration is left for follow-up once the binary format settles (see open question 2). Usage ===== echo 1 > /sys/kernel/debug/tracing/options/stackmap echo 1 > /sys/kernel/debug/tracing/options/stacktrace Pengfei Li (3): trace: add lock-free stackmap for stack trace deduplication trace: integrate stackmap into ftrace stack recording path trace: add documentation, selftest and tooling for stackmap Documentation/trace/ftrace-stackmap.rst | 177 ++++ Documentation/trace/index.rst | 1 + kernel/trace/Kconfig | 22 + kernel/trace/Makefile | 1 + kernel/trace/trace.c | 216 ++++- kernel/trace/trace.h | 17 + kernel/trace/trace_entries.h | 15 + kernel/trace/trace_output.c | 23 + kernel/trace/trace_selftest.c | 1 + kernel/trace/trace_stackmap.c | 889 ++++++++++++++++++ kernel/trace/trace_stackmap.h | 57 ++ .../ftrace/test.d/ftrace/stackmap-basic.tc | 111 +++ .../test.d/ftrace/stackmap-instance-gate.tc | 54 ++ .../ftrace/test.d/ftrace/stackmap-reset.tc | 76 ++ tools/tracing/stackmap_dump.py | 164 ++++ 15 files changed, 1821 insertions(+), 3 deletions(-) create mode 100644 Documentation/trace/ftrace-stackmap.rst create mode 100644 kernel/trace/trace_stackmap.c create mode 100644 kernel/trace/trace_stackmap.h create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-reset.tc create mode 100755 tools/tracing/stackmap_dump.py -- 2.34.1
