On 1/30/26 9:53 PM, Chen Ridong wrote:

On 2026/1/30 23:42, Waiman Long wrote:
The current cpuset partition code is able to dynamically update
the sched domains of a running system and the corresponding
HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the
"isolcpus=domain,..." boot command line feature at run time.

The housekeeping cpumask update requires flushing a number of different
workqueues which may not be safe with cpus_read_lock() held as the
workqueue flushing code may acquire cpus_read_lock() or acquiring locks
which have locking dependency with cpus_read_lock() down the chain. Below
is an example of such circular locking problem.

   ======================================================
   WARNING: possible circular locking dependency detected
   6.18.0-test+ #2 Tainted: G S
   ------------------------------------------------------
   test_cpuset_prs/10971 is trying to acquire lock:
   ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: 
touch_wq_lockdep_map+0x7a/0x180

   but task is already holding lock:
   ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: 
cpuset_partition_write+0x85/0x130

   which lock already depends on the new lock.

   the existing dependency chain (in reverse order) is:
   -> #4 (cpuset_mutex){+.+.}-{4:4}:
   -> #3 (cpu_hotplug_lock){++++}-{0:0}:
   -> #2 (rtnl_mutex){+.+.}-{4:4}:
   -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
   -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:

   Chain exists of:
     (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex

   5 locks held by test_cpuset_prs/10971:
    #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0
    #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: 
kernfs_fop_write_iter+0x260/0x5f0
    #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: 
kernfs_fop_write_iter+0x2b6/0x5f0
    #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: 
cpuset_partition_write+0x77/0x130
    #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: 
cpuset_partition_write+0x85/0x130

   Call Trace:
    <TASK>
      :
    touch_wq_lockdep_map+0x93/0x180
    __flush_workqueue+0x111/0x10b0
    housekeeping_update+0x12d/0x2d0
    update_parent_effective_cpumask+0x595/0x2440
    update_prstate+0x89d/0xce0
    cpuset_partition_write+0xc5/0x130
    cgroup_file_write+0x1a5/0x680
    kernfs_fop_write_iter+0x3df/0x5f0
    vfs_write+0x525/0xfd0
    ksys_write+0xf9/0x1d0
    do_syscall_64+0x95/0x520
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

To avoid such a circular locking dependency problem, we have to
call housekeeping_update() without holding the cpus_read_lock() and
cpuset_mutex. The current set of wq's flushed by housekeeping_update()
may not have work functions that call cpus_read_lock() directly,
but we are likely to extend the list of wq's that are flushed in the
future. Moreover, the current set of work functions may hold locks that
may have cpu_hotplug_lock down the dependency chain.

One way to do that is to introduce a new top level cpuset_top_mutex
which will be acquired first.  This new cpuset_top_mutex will provide
the need mutual exclusion without the need to hold cpus_read_lock().

Introducing a new global lock warrants careful consideration. I wonder if we
could make all updates to isolated_cpus asynchronous. If that is feasible, we
could avoid adding a global lock altogether. If not, we need to clarify which
updates must remain synchronous and which ones can be handled asynchronously.

Almost all the cpuset code are run with cpuset_mutex held with either cpus_read_lock or cpus_write_lock. So there is no concurrent access/update to any of the cpuset internal data. The new cpuset_top_mutex is aded to resolve the possible deadlock scenarios with the new housekeeping_update() call without breaking this model. Allow parallel concurrent access/update to cpuset data will greatly complicate the code and we will likely missed some corner cases that we have to fix in the future. We will only do that if cpuset is in a critical performance path, but it is not. It is not just isolated_cpus that we are protecting, all the other cpuset data may be at risk if we don't have another top level mutex to protect them.

Cheers,
Longman


Reply via email to