lock contention

Sebastian Andrzej Siewior Thu, 13 Nov 2025 07:54:29 -0800

On 2025-11-13 10:24:45 [-0500], Steven Rostedt wrote:
> On Thu, 13 Nov 2025 16:17:29 +0100
> Sebastian Andrzej Siewior <[email protected]> wrote:
> 
> > On 2025-11-13 10:05:24 [-0500], Steven Rostedt wrote:
> > > This means that the chunks are not being freed and we can't be doing
> > > synchronize_rcu() in every exit.  
> > 
> > You don't have to, you can do call_rcu().
> 
> But the chunk isn't being freed. They may be used right away.


Not if you avoid using it until after the rcu callback.

> > > > So I *think* the RCU approach should be doable and cover this.  
> > > 
> > > Where would you put the synchronize_rcu()? In do_exit()?  
> > 
> > simply call_rcu() and let it move to the freelist.
> 
> A couple of issues. One, the chunks are fully used. There's no place to put
> a "rcu_head" in them. Well, we may be able to make use of them.

This could be the first (16?) bytes of the memory chunk.

> Second, if there's a lot of tasks exiting and forking, we can easily run
> out of chunks that are waiting to be "freed" via call_rcu().

but this is a general RCU problem and not new here. The task_struct and
everything around it (including stack) is RCU freed.

> > 
> > > Also understanding what this is used for helps in understanding the scope
> > > of protection needed.
> > > 
> > > The pid_list is created when you add anything into one of the pid files in
> > > tracefs. Let's use /sys/kernel/tracing/set_ftrace_pid:
> > > 
> > >   # cd /sys/kernel/tracing
> > >   # echo $$ > set_ftrace_pid
> > >   # echo 1 > options/function-fork
> > >   # cat set_ftrace_pid
> > >   2716
> > >   2936
> > >   # cat set_ftrace_pid
> > >   2716
> > >   2945
> > > 
> > > What the above did was to create a pid_list for the function tracer. I
> > > added the bash process pid using $$ (2716). Then when I cat the file, it
> > > showed the pid for the bash process as well as the pid for the cat 
> > > process,
> > > as the cat process is a child of the bash process. The function-fork 
> > > option
> > > means to add any child process to the set_ftrace_pid if the parent is
> > > already in the list. It also means to remove the pid if a process in the
> > > list exits.  
> > 
> > This adding/ add-on-fork, removing and remove-on-exit is the only write
> > side?
> 
> That and manual writes to the set_ftrace_pid file.

This looks like minimal. I miss understood then that context switch can
also contribute to it.

> > > What we are protecting against is when one chunk is freed, but then
> > > allocated again for a different set of PIDs. Where the reader has the 
> > > chunk,
> > > it was freed and re-allocated and the bit that is about to be checked
> > > doesn't represent the bit it is checking for.  
> > 
> > This I assumed.
> > And the kfree() at the end can not happen while there is still a reader?
> 
> Correct. That's done by the pid_list user:
> 
> In clear_ftrace_pids():
> 
>       /* Wait till all users are no longer using pid filtering */
>       synchronize_rcu();
> 
>       if ((type & TRACE_PIDS) && pid_list)
>               trace_pid_list_free(pid_list);
> 
>       if ((type & TRACE_NO_PIDS) && no_pid_list)
>               trace_pid_list_free(no_pid_list);

And the callers of trace_pid_list_is_set() are always in the RCU read
section then? I assume so, since it wouldn't make sense otherwise.

> -- Steve

Sebastian

Re: [PATCH v3] trace/pid_list: optimize pid_list->lock contention

Reply via email to