On 2026/6/9 17:43, Petr Mladek wrote:
> Added live-patching mailing list.
>
> On Tue 2026-06-09 16:49:53, Tengda Wu wrote:
>> The current check in rethook_find_ret_addr() prevents obtaining a return
>> address when the target task is marked as running. However, this condition
>> is both insufficient for correctness and unnecessary for its intended
>> purpose.
>>
>> The check is inherently racy: a task can begin running on another CPU
>> immediately after task_is_running() returns false, potentially leading to
>> concurrent modification of rethook data structures while the iteration is
>> in progress.
>>
>> Rather than trying to fix this unreliable check deep in the unwinding
>> path, simply remove it. The iteration is already safe from crashes because
>> unwind_next_frame() holds RCU and rethook_node structures are RCU-freed;
>> even if the iteration goes off the rails and returns invalid information,
>> it will not crash. Callers that require consistency must provide a safe
>> context themselves.
>>
>> Fixes: 54ecbe6f1ed5 ("rethook: Add a generic return hook")
>> Acked-by: Peter Zijlstra (Intel) <[email protected]>
>> Signed-off-by: Tengda Wu <[email protected]>
>> ---
>> v3: Improve commit message: clarify safety semantics and document that RCU
>> guarantees no crash.
>> v2:
>> https://lore.kernel.org/all/[email protected]/
>> v1:
>> https://lore.kernel.org/all/[email protected]/
>>
>> --- a/kernel/trace/rethook.c
>> +++ b/kernel/trace/rethook.c
>> @@ -250,9 +250,6 @@ unsigned long rethook_find_ret_addr(struct task_struct
>> *tsk, unsigned long frame
>> if (WARN_ON_ONCE(!cur))
>> return 0;
>>
>> - if (tsk != current && task_is_running(tsk))
>> - return 0;
>> -
>
> The description of the function should be updated as well. It still
> mentions:
>
> * The @tsk must be 'current' or a task which is not running.
>
> Instead it should explain that it safe to call the function even
> on another running tasks but the returned address is not reliable
> then.
>
Oh, I forgot that. Thanks for pointing it out.
>> do {
>> ret = __rethook_find_ret_addr(tsk, cur);
>> if (!ret)
>
> I am still a bit concerned about the motivation.
>
> Tengda mentioned at
> https://lore.kernel.org/all/[email protected]/
> that they tried to verify livepatching:
>
> <paste>
> Background: We are verifying the support of live patches for functions that
> have a kretprobe. The specific verification method is as follows:
>
> We construct a function foo() that calls bar():
>
> void bar(void)
> {
> for (;;) {
> schedule();
> }
> }
>
> void foo(void)
> {
> bar();
> }
>
> A kretprobe is attached to bar():
>
> echo 'r:rp1 bar' > /sys/kernel/tracing/kprobe_events
> echo 1 > /sys/kernel/tracing/events/kprobes/rp1/enable
>
> Then foo() is triggered. The expected behavior is that bar() will call
> schedule() and yield the CPU.
>
> After that, the live patch is activated to attempt replacing the
> implementation
> of foo(). The expectation is that this should succeed.
>
> However, in reality, because the task that called schedule() is still in the
> RUNNING state, the condition task_is_running(tsk) inside
> rethook_find_ret_addr()
> is not satisfied, causing the function to return early. This, in turn,
> prevents stack_trace_save_tsk_reliable() from determining the stack as
> reliable, leading to a failure in activating the live patch.
>
> **Not sure if this is correct:**
>
> We believe that after a task voluntarily calls schedule(), when the stack
> is expected to be reliable, it is a safe time to activate a live patch.
> Additionally, a similar tsk->on_cpu check can be found elsewhere in the
> kernel (See task_on_another_cpu() in arch/x86/include/asm/unwind.h).
> Therefore, we propose changing the task_is_running(tsk) condition to
> tsk->on_cpu.
> </paste>
>
> More background:
> ----------------
>
> The test is artificial because it keeps the RUNNING state before
> calling schedule, see
> https://lore.kernel.org/all/[email protected]/
>
>
>
> My questions:
>
> Does this patch allows to livepatch the above mentioned test code?
At least it will no longer be blocked by the rethook check.
> Is the livepatching safe?
> Does it help in another scenarios?
>
>
> My opinion:
>
> The livepatching might be safe only when the process is migrating
> itself. I mean that it might be safe even when it is RUNNING as long
> at it is _current_.
>
> I agree that we do not need to enforce this in rethook_find_ret_addr()
> if the function is used also in other scenarios, for example, by
> ftrace/BTF for taking snapshots of other processes.
>
> But we need to make sure that the backtrace is reliable when
> livepatching (migrating) the task.
>
> Best Regards,
> Petr
Peter actually touched on this point earlier:
<paste>
Things like live-patch use task_call_func(), which ensures the callback
function is done while holding sufficient locks for the task to not
change state.
</paste>
>From my understanding, removing this check should not introduce
stack reliability issues for livepatch.
Thanks,
Tengda