From: Pengfei Li <[email protected]>

Hi Masami, Steven, all,

This is v4 of the ftrace stackmap series. It is sent as a new thread.

v3: https://lore.kernel.org/all/[email protected]/

The series adds stack trace deduplication to ftrace. When the
'stackmap' option is enabled alongside 'stacktrace', the ring buffer
stores a 4-byte stack_id instead of a full kernel stack trace, and the
full stacks are exported once via tracefs (stack_map / stack_map_bin).

Rebased onto v7.1-rc5 (e8c2f9fdadee).

Motivation
==========

The target use case is long-duration, from-boot kernel tracing where
the same stacks recur enormously often and the bottleneck is ring
buffer space, not CPU.

Concretely: tracing the slab allocator from boot for hours to study
memory aging and to catch the allocation backtraces behind a usage
peak. With a stacktrace trigger on the slab tracepoints, every event
today carries a full kernel stack (~80-160 bytes). On a fixed-size
ring buffer that bounds how far back in time the trace reaches: the
buffer wraps in seconds to minutes and the early-boot history -- the
part we care about -- is overwritten before it can be consumed.

In this workload the set of distinct stacks is small and highly
repetitive, so storing a 4-byte stack_id per event and the full stack
only once dramatically increases the time span a given buffer covers.
The intended operating model is exactly the low-overhead one ftrace is
good at: let the trace run for a long time producing a comparatively
small, dense log, then resolve stack_ids offline (cat stack_map, or
parse stack_map_bin with the included tool) during analysis.

This is complementary to, not a replacement for, the existing full
stack recording: deep stacks and the early pre-init window still fall
back to full stacks (see below).

Effect on retention
====================

Same fixed per-CPU buffer, slab allocation workload with a shallow
kernel stack (kmem_cache_alloc), stackmap OFF vs ON:

                  retained events   bytes/event   time span
  stackmap OFF        645,068          ~104 B        15.0 s
  stackmap ON       1,397,741           ~48 B        27.7 s
                     2.17x             2.17x          1.85x

The buffer holds ~2.17x more events and reaches ~1.85x further back in
time for the same memory. The win grows with stack depth and with how
repetitive the stacks are; for deep, highly-repeated stacks the
per-event size approaches the 4-byte stack_id plus event header.

Changes since v3
================

Correctness:
  - Deep stacks are never silently truncated or merged. A stack deeper
    than FTRACE_STACKMAP_MAX_DEPTH (64) now falls back to a full stack
    trace; ftrace_stackmap_get_id() returns -E2BIG rather than
    truncating, so two distinct stacks sharing their first 64 frames
    can no longer collapse to one stack_id.
  - reset is now genuinely destructive and coherent: under the
    reader_sem write lock it clears the owning trace_array's ring
    buffer (and snapshot) BEFORE clearing the map, and uses
    tracing_reset_all_cpus() rather than _online_cpus() so a
    TRACE_STACK_ID written by a now-offline CPU cannot survive and
    dangle against a cleared map.
  - __ftrace_trace_stack() reserves the TRACE_STACK_ID ring-buffer
    slot before inserting into the map, so stack_map_stat counters and
    ref_count stay consistent with what the ring buffer actually
    references (failed reservation -> full stack, map untouched).
  - ref_count / successes / drops now saturate (INT_MAX / LONG_MAX)
    instead of wrapping on multi-hour, billions-of-hits traces.

Global-instance gating:
  - Enabling 'stackmap' on a secondary instance via the aggregate
    trace_options file is now rejected, not just hidden in the
    per-instance options/ directory.
  - tracefs init is failure-atomic: the required stack_map file is
    created before the map pointer is published; if it cannot be
    created the map is destroyed and never published. An init-state
    (PENDING/DONE/FAILED) lets boot-time trace_options=stackmap set
    the flag before the map exists (hot path falls back until it is
    published) while still rejecting enables after a permanent init
    failure, so options/stackmap never reports an enabled no-op.

ABI / tooling:
  - Binary magic corrected to 0x46534D42 ('FSMB'); version is 1 (first
    upstream ABI). Documentation, tool and selftest updated to match.
  - Text and binary exports now follow the same trampoline-marker and
    trace_adjust_address() handling as the normal stack print path.
  - stackmap_dump.py emits hex addresses in 'ips' and shows the ftrace
    trampoline marker only in the resolved 'symbols'.

Selftests:
  - New stackmap-reset.tc: verifies reset clears stale <stack_id N>
    from the trace buffer and checks the stack_map_bin magic/version.
  - stackmap-instance-gate.tc extended to verify the trace_options
    write path is rejected on a secondary instance.
  - stackmap-basic.tc no longer treats a nonzero drops count as a
    failure (drops is a by-design fallback); only zero successes with
    nonzero drops is fatal.

Open questions for maintainers
==============================

Two design points where I would value direction before polishing
further:

1. Eager vs lazy allocation. The element pool is allocated at
   fs_initcall when CONFIG_FTRACE_STACKMAP=y, regardless of whether
   userspace ever enables the option (~8 MB at the default bits=14,
   up to ~135 MB at bits=18). This keeps the hot path allocation-free
   with no allocation-failure path under tracing pressure. Is eager
   allocation acceptable, or would you prefer lazy allocation on the
   first 'echo 1 > options/stackmap'?

2. Binary ABI now or later. stack_map_bin is a new tracefs binary
   interface (magic 0x46534D42, version 1). Is it acceptable to
   introduce it now, or would you prefer the first version ship with
   the text stack_map interface only and add the binary export once
   trace-cmd / libtraceevent integration is designed?

Test results
============

QEMU (aarch64 virt, v7.1-rc5 + this series), boot to init smoke test:
  - stackmap functional suite: 16/16 PASS, including reset clearing the
    trace buffer (stale <stack_id> count 48 -> 0), stack_map_bin
    magic/version, global-vs-secondary instance gating, and the
    trace_options rejection on a secondary instance.
  - boot-time activation (trace_options=stackmap,stacktrace on the
    kernel cmdline): 3/3 PASS -- the option survives the
    pre-initialization window and the map is live once published.
  - ftrace startup self-tests pass with the new TRACE_STACK_ID entry.

Device retention numbers above were collected on a Xiaomi SM8850
(ARM64) running an Android workload, comparing the same buffer with
the option off and on.

Known limitations
=================

- Per-instance stackmap support is not included; the option is gated
  to the global instance (in the tracefs layout and at the
  set_tracer_flag() write path). Per-instance maps are a follow-up.
- Deduplication is best-effort, not strict: under heavy concurrent
  contention two CPUs racing with the same stack hash may each claim a
  different slot, producing a few duplicate entries; ref_count is then
  split across them. This keeps the hot path lock-free.
- The stackmap covers kernel stacks only.
- stack_map_bin is a best-effort snapshot, serialized against reset
  but not a fully atomic export.
- trace-cmd / libtraceevent integration is left for follow-up once the
  binary format settles (see open question 2).

Usage
=====

  echo 1 > /sys/kernel/debug/tracing/options/stackmap
  echo 1 > /sys/kernel/debug/tracing/options/stacktrace

Pengfei Li (3):
  trace: add lock-free stackmap for stack trace deduplication
  trace: integrate stackmap into ftrace stack recording path
  trace: add documentation, selftest and tooling for stackmap

 Documentation/trace/ftrace-stackmap.rst       | 177 ++++
 Documentation/trace/index.rst                 |   1 +
 kernel/trace/Kconfig                          |  22 +
 kernel/trace/Makefile                         |   1 +
 kernel/trace/trace.c                          | 216 ++++-
 kernel/trace/trace.h                          |  17 +
 kernel/trace/trace_entries.h                  |  15 +
 kernel/trace/trace_output.c                   |  23 +
 kernel/trace/trace_selftest.c                 |   1 +
 kernel/trace/trace_stackmap.c                 | 889 ++++++++++++++++++
 kernel/trace/trace_stackmap.h                 |  57 ++
 .../ftrace/test.d/ftrace/stackmap-basic.tc    | 111 +++
 .../test.d/ftrace/stackmap-instance-gate.tc   |  54 ++
 .../ftrace/test.d/ftrace/stackmap-reset.tc    |  76 ++
 tools/tracing/stackmap_dump.py                | 164 ++++
 15 files changed, 1821 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/trace/ftrace-stackmap.rst
 create mode 100644 kernel/trace/trace_stackmap.c
 create mode 100644 kernel/trace/trace_stackmap.h
 create mode 100644 
tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
 create mode 100644 
tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc
 create mode 100644 
tools/testing/selftests/ftrace/test.d/ftrace/stackmap-reset.tc
 create mode 100755 tools/tracing/stackmap_dump.py

-- 
2.34.1


Reply via email to