On April 1, 2026 7:36:48 AM PDT, Xin Li <[email protected]> wrote:
>
>Thanks!
>Xin
>
>> On Mar 31, 2026, at 8:15 PM, H. Peter Anvin <[email protected]> wrote:
>> 
>> On March 31, 2026 6:59:06 PM PDT, Xin Li <[email protected]> wrote:
>>> 
>>> 
>>>>> On Mar 30, 2026, at 11:03 PM, Xin Li <[email protected]> wrote:
>>>> 
>>>> 
>>>>>>>> The existing 'sysret_rip' selftest asserts that 'regs->r11 ==
>>>>>>>> regs->flags'. This check relies on the behavior of the SYSCALL
>>>>>>>> instruction on legacy x86_64, which saves 'RFLAGS' into 'R11'.
>>>>>>>> 
>>>>>>>> However, on systems with FRED (Flexible Return and Event Delivery)
>>>>>>>> enabled, instead of using registers, all state is saved onto the stack.
>>>>>>>> Consequently, 'R11' retains its userspace value, causing the assertion
>>>>>>>> to fail.
>>>>>>>> 
>>>>>>>> Fix this by detecting if FRED is enabled and skipping the register
>>>>>>>> assertion in that case. The detection is done by checking if the RPL
>>>>>>>> bits of the GS selector are preserved after a hardware exception.
>>>>>>>> IDT (via IRET) clears the RPL bits of NULL selectors, while FRED (via
>>>>>>>> ERETU) preserves them.
>>>>>>>> 
>>>>>>> 
>>>>>>> I don't really like this.  I think we have two credible choices:
>>>>>>> 
>>>>>>> 1. Define the Linux ABI to be that, on FRED systems, SYSCALL preserves
>>>>>>> R11 and RCX on entry and exit.  And update the test to actually test
>>>>>>> this.
>>>>>>> 
>>>>>>> 2. Define the Linux ABI to be what it has been for quite a few years:
>>>>>>> SYSCALL entry copies RFLAGS to R11 and RIP to RCX and SYSCALL exit
>>>>>>> preserves all registers.
>>>>>>> 
>>>>>>> I'm in favor of #2.  People love making new programming languages and
>>>>>>> runtimes and inline asm and, these days, vibe coded crap.  And it's
>>>>>>> *easier* to emit a SYSCALL and forget to tell the compiler / code
>>>>>>> generator that RCX and R11 are clobbered than it is to remember that
>>>>>>> they're clobbered.  And it's easy to test on FRED (well, not really,
>>>>>>> but it hopefully will be some day) and it's easy to publish one's
>>>>>>> code, and then everyone is a bit screwed when the resulting program
>>>>>>> crashes sometimes on non-FRED systems.  And it will be miserable to
>>>>>>> debug.
>>>>>>> 
>>>>>>> (It's *really* *really* easy to screw this up in a way that sort of
>>>>>>> works even on non-FRED: RCX and R11 are usually clobbered across
>>>>>>> function calls, so one can get into a situation in which one's
>>>>>>> generated code usually doesn't require that SYSCALL preserve one of
>>>>>>> these registers until an inlining decision changes or some code gets
>>>>>>> reordered, and then it will start failing.  And making the failure
>>>>>>> depend on hardware details is just nasty.
>>>>>>> 
>>>>>>> So I think we should add the ~2 lines of code to fix the SYSCALL entry
>>>>>>> on FRED to match non-FRED.
>>>>>> 
>>>>>> Yes; I'm afraid I have to concur. Preserving the clobber on entry for
>>>>>> FRED systems is by far the safest choice.
>>>>>> 
>>>>>> Aside from this selftest, fancy debuggers and anything that can transfer
>>>>>> userspace state between machines might be 'surprised'.
>>>>> 
>>>>> Thanks Andy and Peter.
>>>>> 
>>>>> Indeed, making the selftest branch on FRED vs. non-FRED behavior
>>>>> is not a good practice. The selftest should validate ABI consistency.
>>>>> 
>>>>> I agree with Andy's option #2, so this should be fixed in the FRED
>>>>> syscall entry implementation.
>>>>> 
>>>>> Li Xin, does this direction look right to you? I can assit with
>>>>> validation and keep the selftest aligned with the agreed ABI.
>>>>> 
>>>> 
>>>> Yes, consistency should take precedence over hardware-specific variations.
>>>> 
>>>> I would like to hear from Andrew Cooper and hpa before we do it.
>>> 
>>> Per Andy’s suggestion, the change would be:
>>> 
>>> diff --git a/arch/x86/entry/entry_fred.c b/arch/x86/entry/entry_fred.c
>>> index 88c757ac8ccd..a19898747a2c 100644
>>> --- a/arch/x86/entry/entry_fred.c
>>> +++ b/arch/x86/entry/entry_fred.c
>>> @@ -79,6 +79,9 @@ static __always_inline void fred_other(struct pt_regs 
>>> *regs)
>>> {
>>>    /* The compiler can fold these conditions into a single test */
>>>    if (likely(regs->fred_ss.vector == FRED_SYSCALL && regs->fred_ss.l)) {
>>> +        regs->cx = regs->ip;
>>> +        regs->r11 = regs->flags;
>>> +
>>>        regs->orig_ax = regs->ax;
>>>        regs->ax = -ENOSYS;
>>>        do_syscall_64(regs, regs->orig_ax);
>>> 
>>> It adds 4 extra MOVs on this hot path, but I don’t see it's a problem here.
>> 
>> We discussed this over a year ago, and at that point agreed that reserving 
>> the register was the desired behavior. Why has this changed now?
>
>Yes, that is technically cleaner.
>
>The question is, is the RCX/R11 clobbering behavior an established 
>architectural contract, or is it an implementation detail that software 
>ignores?
>
>I think Andy and Peter want to be on the safer side, which kind of assumes 
>that this is established.
>

Clobbering is never an architectural contract; clobbering is always an option. 
However, I understand the concern that a developer who writes software on a 
FRED system which breaks on a legacy system.

Last time this came up, the policy we decided on was that a system that 
clobbers must do so in all cases (in order to not leak internal kernel state) 
but a system that can preserve (FRED or IDT-without-SYSCALL) may always do so.

I would prefer if we could defer this policy reversal for a bit. Since there is 
production hardware out now, I have been working on actually tuning the FRED 
code paths, and because the Linux kernel is so efficient, details matter in 
surprising ways. 

I *particularly* dislike clobbering registers on the way *into* the kernel, 
though. That needlessly makes them unavailable to a debugger, and one of the 
benefits of FRED is improving debug visibility in some specific cases.

Reply via email to