Re: [PATCH v3] rethook: Remove the running task check in rethook_find_ret_addr()

Tengda Wu Tue, 09 Jun 2026 04:12:55 -0700

On 2026/6/9 17:43, Petr Mladek wrote:
> Added live-patching mailing list.
> 
> On Tue 2026-06-09 16:49:53, Tengda Wu wrote:
>> The current check in rethook_find_ret_addr() prevents obtaining a return
>> address when the target task is marked as running. However, this condition
>> is both insufficient for correctness and unnecessary for its intended
>> purpose.
>>
>> The check is inherently racy: a task can begin running on another CPU
>> immediately after task_is_running() returns false, potentially leading to
>> concurrent modification of rethook data structures while the iteration is
>> in progress.
>>
>> Rather than trying to fix this unreliable check deep in the unwinding
>> path, simply remove it. The iteration is already safe from crashes because
>> unwind_next_frame() holds RCU and rethook_node structures are RCU-freed;
>> even if the iteration goes off the rails and returns invalid information,
>> it will not crash. Callers that require consistency must provide a safe
>> context themselves.
>>
>> Fixes: 54ecbe6f1ed5 ("rethook: Add a generic return hook")
>> Acked-by: Peter Zijlstra (Intel) <[email protected]>
>> Signed-off-by: Tengda Wu <[email protected]>
>> ---
>> v3: Improve commit message: clarify safety semantics and document that RCU 
>> guarantees no crash.
>> v2: 
>> https://lore.kernel.org/all/[email protected]/
>> v1: 
>> https://lore.kernel.org/all/[email protected]/
>>
>> --- a/kernel/trace/rethook.c
>> +++ b/kernel/trace/rethook.c
>> @@ -250,9 +250,6 @@ unsigned long rethook_find_ret_addr(struct task_struct 
>> *tsk, unsigned long frame
>>      if (WARN_ON_ONCE(!cur))
>>              return 0;
>>  
>> -    if (tsk != current && task_is_running(tsk))
>> -            return 0;
>> -
> 
> The description of the function should be updated as well. It still
> mentions:
> 
>  * The @tsk must be 'current' or a task which is not running.
> 
> Instead it should explain that it safe to call the function even
> on another running tasks but the returned address is not reliable
> then.
> 

Oh, I forgot that. Thanks for pointing it out.

>>      do {
>>              ret = __rethook_find_ret_addr(tsk, cur);
>>              if (!ret)
> 
> I am still a bit concerned about the motivation.
> 
> Tengda mentioned at
> https://lore.kernel.org/all/[email protected]/
> that they tried to verify livepatching:
> 
> <paste>
> Background: We are verifying the support of live patches for functions that
> have a kretprobe. The specific verification method is as follows:
> 
> We construct a function foo() that calls bar():
> 
> void bar(void)
> {
>     for (;;) {
>         schedule();
>     }
> }
> 
> void foo(void)
> {
>     bar();
> }
> 
> A kretprobe is attached to bar():
> 
> echo 'r:rp1 bar' > /sys/kernel/tracing/kprobe_events
> echo 1 > /sys/kernel/tracing/events/kprobes/rp1/enable
> 
> Then foo() is triggered. The expected behavior is that bar() will call
> schedule() and yield the CPU.
> 
> After that, the live patch is activated to attempt replacing the 
> implementation
> of foo(). The expectation is that this should succeed.
> 
> However, in reality, because the task that called schedule() is still in the
> RUNNING state, the condition task_is_running(tsk) inside 
> rethook_find_ret_addr()
> is not satisfied, causing the function to return early. This, in turn,
> prevents stack_trace_save_tsk_reliable() from determining the stack as
> reliable, leading to a failure in activating the live patch.
> 
> **Not sure if this is correct:**
> 
> We believe that after a task voluntarily calls schedule(), when the stack
> is expected to be reliable, it is a safe time to activate a live patch.
> Additionally, a similar tsk->on_cpu check can be found elsewhere in the
> kernel (See task_on_another_cpu() in arch/x86/include/asm/unwind.h).
> Therefore, we propose changing the task_is_running(tsk) condition to
> tsk->on_cpu.
> </paste>
> 
> More background:
> ----------------
> 
> The test is artificial because it keeps the RUNNING state before
> calling schedule, see
> https://lore.kernel.org/all/[email protected]/
> 
> 
> 
> My questions:
> 
> Does this patch allows to livepatch the above mentioned test code?

At least it will no longer be blocked by the rethook check.

> Is the livepatching safe?
> Does it help in another scenarios?
> 
> 
> My opinion:
> 
> The livepatching might be safe only when the process is migrating
> itself. I mean that it might be safe even when it is RUNNING as long
> at it is _current_.
> 
> I agree that we do not need to enforce this in rethook_find_ret_addr()
> if the function is used also in other scenarios, for example, by
> ftrace/BTF for taking snapshots of other processes.
> 
> But we need to make sure that the backtrace is reliable when
> livepatching (migrating) the task.
> 
> Best Regards,
> Petr

Peter actually touched on this point earlier:

<paste>
Things like live-patch use task_call_func(), which ensures the callback
function is done while holding sufficient locks for the task to not
change state.
</paste>

>From my understanding, removing this check should not introduce
stack reliability issues for livepatch.

Thanks,
Tengda
Re: [PATCH v3] rethook: Remove the running task check in rethook_find_ret_addr()

Reply via email to