Background
==========

One of our latency-sensitive services reported random CPU %sys spikes.
After a thorough investigation, we finally identified the root cause of the
CPU %sys spikes. The key kernel stacks are as follows:

- Task A

2026-02-14-16:53:40.938243: [CPU198] 2156302(bpftrace) cgrp:4019437 pod:4019253

        find_kallsyms_symbol+142
        module_address_lookup+104
        kallsyms_lookup_buildid+203
        kallsyms_lookup+20
        print_rec+64
        t_show+67
        seq_read_iter+709
        seq_read+165
        vfs_read+165
        ksys_read+103
        __x64_sys_read+25
        do_syscall_64+56
        entry_SYSCALL_64_after_hwframe+100

This task (2156302, bpftrace) is reading the
/sys/kernel/debug/tracing/available_filter_functions to check if a function
is traceable:

  https://github.com/bpftrace/bpftrace/blob/master/src/tracefs/tracefs.h#L21

Reading the available_filter_functions file is time-consuming, as it
contains tens of thousands of functions:

  $ cat /sys/kernel/debug/tracing/available_filter_functions | wc -l
  61081

  $ time cat /sys/kernel/debug/tracing/available_filter_functions > /dev/null

  real    0m0.452s
  user    0m0.000s
  sys     0m0.452s

Consequently, the ftrace_lock is held by this task for an extended period.

- Other Tasks

2026-02-14-16:53:41.437094: [CPU79] 2156308(bpftrace) cgrp:4019437 pod:4019253

        mutex_spin_on_owner+108
        __mutex_lock.constprop.0+1132
        __mutex_lock_slowpath+19
        mutex_lock+56
        t_start+51
        seq_read_iter+250
        seq_read+165
        vfs_read+165
        ksys_read+103
        __x64_sys_read+25
        do_syscall_64+56
        entry_SYSCALL_64_after_hwframe+100

Since ftrace_lock is held by Task-A and Task-A is actively running on a
CPU, all other tasks waiting for the same lock will spin on their
respective CPUs. This leads to increased CPU pressure.

Reproduction
============

This issue can be reproduced simply by running
`cat available_filter_functions`.

- Single process reading available_filter_functions:

  $ time cat /sys/kernel/tracing/available_filter_functions > /dev/null

  real    0m0.452s
  user    0m0.001s
  sys     0m0.451s

- Six processes reading available_filter_functions simultaneously:

  for i in `seq 0 5`; do
    time cat /sys/kernel/tracing/available_filter_functions > /dev/null &
  done

  The results are as follows:

  real    0m1.801s
  user    0m0.000s
  sys     0m1.779s

  real    0m1.804s
  user    0m0.001s
  sys     0m1.791s

  real    0m1.805s
  user    0m0.000s
  sys     0m1.792s

  real    0m1.806s
  user    0m0.001s
  sys     0m1.796s

  As more processes are added, the system time increases correspondingly.

Solution
========

One approach is to optimize the reading of available_filter_functions to
make it as fast as possible. However, the risk lies in the contention
caused by optimistic spin locking.

Therefore, we need to consider an alternative solution that avoids
optimistic spinning for heavy mutexes that may be held for long durations.
Note that we do not want to disable CONFIG_MUTEX_SPIN_ON_OWNER entirely, as
that could lead to unexpected performance regressions.

In this patch, a new wrapper mutex_lock_nospin() is used for ftrace_lock to
selectively disable optimistic spinning.

Yafang Shao (2):
  locking: add mutex_lock_nospin()
  ftrace: disable optimistic spinning for ftrace_lock

 include/linux/mutex.h  |  3 +++
 kernel/locking/mutex.c | 39 +++++++++++++++++++++++++------
 kernel/trace/ftrace.c  | 52 +++++++++++++++++++++---------------------
 3 files changed, 61 insertions(+), 33 deletions(-)

-- 
2.47.3


Reply via email to