On 11/20/2025 10:07 AM, Leon Hwang wrote:
On 19/11/25 20:36, Xu Kuohai wrote:
On 11/19/2025 10:55 AM, Leon Hwang wrote:
On 19/11/25 10:47, Menglong Dong wrote:
On 2025/11/19 08:28, Alexei Starovoitov wrote:
On Tue, Nov 18, 2025 at 4:36 AM Menglong Dong
<[email protected]> wrote:
As we can see above, the performance of fexit increase from
80.544M/s to
136.540M/s, and the "fmodret" increase from 78.301M/s to 159.248M/s.
Nice! Now we're talking.
I think arm64 CPUs have a similar RSB-like return address predictor.
Do we need to do something similar there?
The question is not targeted to you, Menglong,
just wondering.
I did some research before, and I find that most arch
have such RSB-like stuff. I'll have a look at the loongarch
later(maybe after the LPC, as I'm forcing on the English practice),
and Leon is following the arm64.
Yep, happy to take this on.
I'm reviewing the arm64 JIT code now and will experiment with possible
approaches to handle this as well.
Unfortunately, the arm64 trampoline uses a tricky approach to bypass BTI
by using ret instruction to invoke the patched function. This conflicts
with the current approach, and seems there is no straightforward solution.
Hi Kuohai,
Thanks for the explanation.
Do you recall the original reason for using a ret instruction to bypass
BTI in the arm64 trampoline? I'm trying to understand whether that
constraint is fundamental or historical.
arm64 direct jump instructions (b and bl) support only a ±128 MB range.
But the distance between the trampoline and the patched function may
exceed this range. So an indirect jump is required.
With BTI enabled, indirect jump instructions (br and blr) require a landing
pad at the jump target. The target is the instruction immediately after
the call site in the patched function. It may be any instruction, including
non-landing-pad instructions. If it is ot a landing pad, a BTI exception
occurs when trampline jump back using BR/BLR.
Since the RET instruction does not require landing pad, it is chosen to
return to the patched function.
See [1] for reference.
[1]
https://lore.kernel.org/bpf/[email protected]/
I'm wondering if we could structure the control flow like this:
foo "bl" bar -> bar:
bar "br" trampoline -> trampoline:
trampoline "bl" -> bar func body:
As mentioned above, the problem is that the bl may be out of range.
If blr instruction is used instead, the target instruction must be a landing
pad when BTI is enabled. One approach is to reserve an extra nop at the call
site and patch it into a bti instruction at runtime when needed.
bar func body "ret" -> trampoline
trampoline "ret" -> foo
This would introduce two "bl"s and two "ret"s, keeping the RAS balanced
in a way similar to the x86 approach.
With this structure, we could also shrink the frame layout:
* SP + retaddr_off [ self ip ]
* [ FP ]
And then store the "self" return address elsewhere on the stack.
Do you think something along these lines could work?
Thanks,
Leon