From: Pengfei Li <[email protected]>

Hi Masami, Steven, all,

This is v3 of the ftrace stackmap series. It addresses the Sashiko
review on v2 [1] that Masami pointed out.

[1] 
https://sashiko.dev/#/patchset/20260522104017.1668638-1-lipengfei28%40xiaomi.com

The series adds stack trace deduplication to ftrace. When the
stacktrace option is enabled, the ring buffer stores a 4-byte
stack_id instead of a full kernel stack trace, while the full
stacks are exported via tracefs.

Rebased onto v7.1-rc5 (e8c2f9fdadee) before sending.

Changes since v2
================

Patch 1 (lock-free stackmap):
  - Hot-path counters changed from atomic64_t to per-CPU local_t.
    This avoids the raw_spinlock_t fallback that atomic64_t uses on
    32-bit GENERIC_ATOMIC64, which would deadlock from NMI context.
  - reset() now serializes against tracefs readers via an
    rw_semaphore (held for write during the clearing memset, held
    for read by seq_file iteration and bin snapshot construction).
    synchronize_rcu() alone was insufficient because seq_file/bin
    readers are in process context, not preempt-disabled.
  - get_id() uses atomic_read_acquire() on smap->resetting so
    subsequent loads of entry->key/val are properly ordered after
    the check (LKMM control dependencies only order stores).
  - All plain reads of entry->key now use READ_ONCE() to avoid
    LKMM data races with the cmpxchg writer.
  - val->nr in the hot path now uses READ_ONCE() to keep style
    consistent with the seq_show / bin_open readers.
  - stackmap_seq_next() now updates *pos past map_size on EOF so
    seq_read() terminates instead of looping on the last element.
  - Added a comment in the cmpxchg-claim path documenting that
    two CPUs racing with the same key_hash may produce a small
    number of duplicate entries; this is an accepted trade-off
    for keeping the hot path lock-free.
  - Removed BUG_ON in create path (the constraint is satisfied by
    construction; no runtime check needed).

Patch 2 (integration):
  - 'stackmap' is added to TOP_LEVEL_TRACE_FLAGS and
    ZEROED_TRACE_FLAGS so the option is only exposed under the
    top-level trace instance, matching the convention used for
    other global-only options such as 'printk' and 'record-cmd'.
    Secondary instances under tracing/instances/*/ no longer see
    the option at all, instead of seeing it as a silent no-op.
  - TRACE_STACK_ID added to trace_valid_entry() in trace_selftest.c
    so ftrace startup selftests don't reject the entry type.
  - Corrected a comment about how global_trace.stackmap is
    zero-initialized (BSS, not kzalloc).

Patch 3 (docs / selftest / tooling):
  - Selftest now reads trace contents BEFORE switching back to the
    nop tracer (tracer_init() calls tracing_reset_online_cpus()
    which would have emptied the ring buffer).
  - Added 'function:tracer' to the selftest '# requires:' line so
    ftracetest skips when CONFIG_FUNCTION_TRACER is disabled
    instead of failing spuriously.
  - Selftest grep tightened to '<stack_id' to avoid future
    false-positives if any other tracepoint name contains
    "stack_id".
  - New stackmap-instance-gate.tc selftest asserts the option and
    stack_map* nodes are present on the global instance and absent
    on a freshly-created secondary instance, locking in the
    TOP_LEVEL_TRACE_FLAGS gating behavior introduced in patch 2.
  - Documentation Performance section made vendor-neutral
    ("aarch64 SMP system" instead of a specific device name) and
    the term "Hit rate" replaced with "Dedup rate" to match the
    actual stat field name (success_rate).
  - Documentation Design section now states that deduplication is
    best-effort under heavy contention (cmpxchg races may produce
    a small number of duplicate entries for the same stack), so
    users observing entries > unique-stacks have a documented
    explanation.

Test results
============

Device: Xiaomi SM8850 (ARM64), Android 16, kernel 6.12 (OGKI)
Config: CONFIG_FTRACE_STACKMAP=y, bits=14 (16384 elts, 32768 slots)
Method: 5-second capture with stacktrace trigger

Functional tests (all PASS):
  - tracefs nodes (stack_map / stack_map_stat / stack_map_bin) exist
  - options/stackmap writable, trace shows <stack_id N>
  - stack_map text export with correct symbols
  - reset clears entries when tracing stopped
  - reset rejected (-EBUSY) while tracing active
  - per-event trigger: only specified events get stacks

Performance (sched_switch, 5s):
  entries:       466 / 16384
  successes:     9159
  drops:         0
  success_rate:  100%
  dedup rate:    95.2% (466 unique stacks / 9625 total events)

Performance (kmem_cache_alloc, 5s):
  entries:       1177 / 16384
  successes:     60078
  drops:         0
  success_rate:  100%
  dedup rate:    98.1% (1177 unique stacks / 61255 total events)

Ring buffer space savings:
  Event               Full stack         Stackmap           Saving
  ----------------    ---------------    ---------------    ------
  sched_switch        9625 × 88B=847KB   12B×9625+88B×466=156KB   82%
  kmem_cache_alloc    61255×88B=5.4MB    12B×61255+88B×1177=839KB  85%

QEMU validation (v3 base: v7.1-rc5)
===================================

The series boots cleanly on aarch64 QEMU. A post-init smoke test
(12/12 PASS) verified all functional behaviors including:
- tracefs nodes appear with correct file modes
- stack_id events emitted, kernel symbols resolve correctly
  (e.g. __schedule+0x7cc/0x1138)
- reset rejected with -EBUSY while tracing is active
- reset clears the map when tracing is stopped
- per-CPU local_t counters aggregate correctly across CPUs
- stack_map_bin magic correct (0x464D5342 'FSMB')
- 'stackmap' option visible on the global instance, hidden on
  secondary instances under tracing/instances/*/

Boot-time activation via 'trace_options=stackmap,stacktrace' works:
events that fire before stackmap initialization fall back to
recording full stack traces; later events are deduplicated. No
events are dropped due to the transition.

Known limitations
=================

- Per-instance stackmap support is not included in this series.
  Following the convention used for other global-only options
  (PRINTK, RECORD_CMD), the 'stackmap' option is gated to the
  top-level trace instance via TOP_LEVEL_TRACE_FLAGS, so it is
  not exposed under tracing/instances/*/options/. Per-instance
  maps would be a follow-up.
- The element pool is allocated eagerly at fs_initcall when
  CONFIG_FTRACE_STACKMAP=y, regardless of whether userspace will
  ever enable the option. At the default bits=14 this is roughly
  8 MB of vmalloc; at the maximum bits=18, ~135 MB. The eager
  allocation keeps the hot path entirely allocation-free and
  avoids any allocation-failure path under tracing pressure.
  Lazy allocation on first 'echo 1 > options/stackmap' is a
  reasonable follow-up if maintainers prefer that trade-off.
- Deduplication is best-effort, not strict: under heavy
  concurrent contention two CPUs racing in the insert path with
  the same stack hash may each succeed in claiming a different
  slot, producing a small number of duplicate entries for the
  same stack. ref_count is then split across the duplicates.
  This is intentional: it keeps the hot path lock-free and
  bounds memory by the element pool size.
- The stackmap currently covers kernel stacks only.
- stack_map_bin is a best-effort snapshot, not a fully atomic export.
- trace-cmd / libtraceevent integration is left for follow-up once
  the binary format settles.

Usage
=====

  echo 1 > /sys/kernel/debug/tracing/options/stackmap
  echo 1 > /sys/kernel/debug/tracing/options/stacktrace


Pengfei Li (3):
  trace: add lock-free stackmap for stack trace deduplication
  trace: integrate stackmap into ftrace stack recording path
  trace: add documentation, selftest and tooling for stackmap

 Documentation/trace/ftrace-stackmap.rst       | 162 ++++
 Documentation/trace/index.rst                 |   1 +
 kernel/trace/Kconfig                          |  22 +
 kernel/trace/Makefile                         |   1 +
 kernel/trace/trace.c                          |  78 +-
 kernel/trace/trace.h                          |  16 +
 kernel/trace/trace_entries.h                  |  15 +
 kernel/trace/trace_output.c                   |  23 +
 kernel/trace/trace_selftest.c                 |   1 +
 kernel/trace/trace_stackmap.c                 | 780 ++++++++++++++++++
 kernel/trace/trace_stackmap.h                 |  57 ++
 .../ftrace/test.d/ftrace/stackmap-basic.tc    | 103 +++
 .../test.d/ftrace/stackmap-instance-gate.tc   |  42 +
 tools/tracing/stackmap_dump.py                | 150 ++++
 14 files changed, 1449 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/trace/ftrace-stackmap.rst
 create mode 100644 kernel/trace/trace_stackmap.c
 create mode 100644 kernel/trace/trace_stackmap.h
 create mode 100644 
tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
 create mode 100644 
tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc
 create mode 100755 tools/tracing/stackmap_dump.py

-- 
2.34.1


Reply via email to