On 1/29/26 3:20 AM, Chen Ridong wrote:

On 2026/1/29 16:01, Chen Ridong wrote:

On 2026/1/28 12:42, Waiman Long wrote:
The current cpuset partition code is able to dynamically update
the sched domains of a running system and the corresponding
HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the
"isolcpus=domain,..." boot command line feature at run time.

The housekeeping cpumask update requires flushing a number of different
workqueues which may not be safe with cpus_read_lock() held as the
workqueue flushing code may acquire cpus_read_lock() or acquiring locks
which have locking dependency with cpus_read_lock() down the chain. Below
is an example of such circular locking problem.

   ======================================================
   WARNING: possible circular locking dependency detected
   6.18.0-test+ #2 Tainted: G S
   ------------------------------------------------------
   test_cpuset_prs/10971 is trying to acquire lock:
   ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: 
touch_wq_lockdep_map+0x7a/0x180

   but task is already holding lock:
   ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: 
cpuset_partition_write+0x85/0x130

   which lock already depends on the new lock.

   the existing dependency chain (in reverse order) is:
   -> #4 (cpuset_mutex){+.+.}-{4:4}:
   -> #3 (cpu_hotplug_lock){++++}-{0:0}:
   -> #2 (rtnl_mutex){+.+.}-{4:4}:
   -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
   -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:

   Chain exists of:
     (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex

   5 locks held by test_cpuset_prs/10971:
    #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0
    #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: 
kernfs_fop_write_iter+0x260/0x5f0
    #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: 
kernfs_fop_write_iter+0x2b6/0x5f0
    #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: 
cpuset_partition_write+0x77/0x130
    #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: 
cpuset_partition_write+0x85/0x130

   Call Trace:
    <TASK>
      :
    touch_wq_lockdep_map+0x93/0x180
    __flush_workqueue+0x111/0x10b0
    housekeeping_update+0x12d/0x2d0
    update_parent_effective_cpumask+0x595/0x2440
    update_prstate+0x89d/0xce0
    cpuset_partition_write+0xc5/0x130
    cgroup_file_write+0x1a5/0x680
    kernfs_fop_write_iter+0x3df/0x5f0
    vfs_write+0x525/0xfd0
    ksys_write+0xf9/0x1d0
    do_syscall_64+0x95/0x520
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

To avoid such a circular locking dependency problem, we have to
call housekeeping_update() without holding the cpus_read_lock()
and cpuset_mutex. One way to do that is to introduce a new top level
isolcpus_update_mutex which will be acquired first if the set of isolated
CPUs may have to be updated. This new isolcpus_update_mutex will provide
the need mutual exclusion without the need to hold cpus_read_lock().

When I reviewed Frederic's patches, I concerned about this issue. However, I was
not certain whether any flush worker would need to acquire cpu_hotplug_lock or
cpuset_mutex.

Despite this warning, I do not understand how wq_completion would need to
acquire cpu_hotplug_lock and cpuset_mutex.

The reason I want to understand how wq_completion acquires cpu_hotplug_lock or
cpuset_mutex is to determine whether isolcpus_update_mutex is truly necessary.
As I mentioned in my previous email, I am concerned about a potential
use-after-free (UAF) issue, which might imply that isolcpus_update_mutex is
required in most places that currently acquire cpuset_mutex, with the possible
exception of the hotplug path?

A circular lock dependency can invoke more than 2 tasks/parties. In this case, the task that hold wq_completion does not need to acquire cpu_hotplug_lock. If a worker that flushes a work function required for the completion to finish and it happens to acquire cpu_hotplug_lock with another task trying to acquire cpus_write_lock in the interim, the worker will wait there for the write lock to be released which will not happen until the original task that calls flush_workqueue() release its read lock. In essence, it is a deadlock.

Cheers,
Longman


Reply via email to