This series mitigates a severe contention issue in the membarrier system
call by replacing the global membarrier_ipi_mutex with per-CPU mutexes
for targeted expedited commands.

Problem:
Currently, MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ relies on a single
global mutex to serialize IPIs. On large systems with heavy concurrent
usage, this creates significant contention. The issue cascades into
hard lockups when combined with CFS bandwidth throttling, during which
target CPUs may have interrupts disabled for extended periods.
If membarrier is waiting on such a CPU, it holds the global mutex,
stalling all other membarrier callers system-wide.

Solution:
Patch 1 introduces membarrier_cpu_mutexes to serialize expedited commands
specifically when a target CPU is provided, isolating the lock contention
to the targeted CPU. Broadcast commands continue to use the global mutex.

Patch 2 modernizes `membarrier_global_expedited` by using scoped cleanup
guards (`guard` and `__free`) to simplify error paths and resource
management.

Patch 3 introduces a dedicated kselftest reproducer
(membarrier_rseq_stress) to permanently test this interaction. It
aggressively hammers targeted membarrier commands within a deep,
aggressively throttled cgroup hierarchy to prove the lockup scenario
and validate the fix.

Results:
As measured by the stress test introduced in Patch 3, testing on an
AMD Turin machine with 384 CPUs (2 NUMA nodes, SMT=2) shows:

Throughput: ~200x increase in successful membarrier calls.

---
Changes in v3:
v2: 
https://lore.kernel.org/lkml/[email protected]/
- Fixed the code path when `cpu_id < 0` in membarrier_private_expedited
  as reported by Peter.
- Added Patch 2 to use cleanup guards in `membarrier_global_expedited`.

Aniket Gattani (3):
  sched/membarrier: Use per-CPU mutexes for targeted commands
  sched/membarrier: Modernize membarrier_global_expedited with cleanup
    guards
  selftests/membarrier: Add rseq stress test for CFS throttle
    interactions

 kernel/sched/membarrier.c                     |  98 +-
 tools/testing/selftests/membarrier/Makefile   |   5 +-
 .../membarrier/membarrier_rseq_stress.c       | 951 ++++++++++++++++++
 3 files changed, 1002 insertions(+), 52 deletions(-)
 create mode 100644 tools/testing/selftests/membarrier/membarrier_rseq_stress.c

-- 
2.54.0.545.g6539524ca2-goog


Reply via email to