On April 1, 2026 7:36:48 AM PDT, Xin Li <[email protected]> wrote: > >Thanks! >Xin > >> On Mar 31, 2026, at 8:15 PM, H. Peter Anvin <[email protected]> wrote: >> >> On March 31, 2026 6:59:06 PM PDT, Xin Li <[email protected]> wrote: >>> >>> >>>>> On Mar 30, 2026, at 11:03 PM, Xin Li <[email protected]> wrote: >>>> >>>> >>>>>>>> The existing 'sysret_rip' selftest asserts that 'regs->r11 == >>>>>>>> regs->flags'. This check relies on the behavior of the SYSCALL >>>>>>>> instruction on legacy x86_64, which saves 'RFLAGS' into 'R11'. >>>>>>>> >>>>>>>> However, on systems with FRED (Flexible Return and Event Delivery) >>>>>>>> enabled, instead of using registers, all state is saved onto the stack. >>>>>>>> Consequently, 'R11' retains its userspace value, causing the assertion >>>>>>>> to fail. >>>>>>>> >>>>>>>> Fix this by detecting if FRED is enabled and skipping the register >>>>>>>> assertion in that case. The detection is done by checking if the RPL >>>>>>>> bits of the GS selector are preserved after a hardware exception. >>>>>>>> IDT (via IRET) clears the RPL bits of NULL selectors, while FRED (via >>>>>>>> ERETU) preserves them. >>>>>>>> >>>>>>> >>>>>>> I don't really like this. I think we have two credible choices: >>>>>>> >>>>>>> 1. Define the Linux ABI to be that, on FRED systems, SYSCALL preserves >>>>>>> R11 and RCX on entry and exit. And update the test to actually test >>>>>>> this. >>>>>>> >>>>>>> 2. Define the Linux ABI to be what it has been for quite a few years: >>>>>>> SYSCALL entry copies RFLAGS to R11 and RIP to RCX and SYSCALL exit >>>>>>> preserves all registers. >>>>>>> >>>>>>> I'm in favor of #2. People love making new programming languages and >>>>>>> runtimes and inline asm and, these days, vibe coded crap. And it's >>>>>>> *easier* to emit a SYSCALL and forget to tell the compiler / code >>>>>>> generator that RCX and R11 are clobbered than it is to remember that >>>>>>> they're clobbered. And it's easy to test on FRED (well, not really, >>>>>>> but it hopefully will be some day) and it's easy to publish one's >>>>>>> code, and then everyone is a bit screwed when the resulting program >>>>>>> crashes sometimes on non-FRED systems. And it will be miserable to >>>>>>> debug. >>>>>>> >>>>>>> (It's *really* *really* easy to screw this up in a way that sort of >>>>>>> works even on non-FRED: RCX and R11 are usually clobbered across >>>>>>> function calls, so one can get into a situation in which one's >>>>>>> generated code usually doesn't require that SYSCALL preserve one of >>>>>>> these registers until an inlining decision changes or some code gets >>>>>>> reordered, and then it will start failing. And making the failure >>>>>>> depend on hardware details is just nasty. >>>>>>> >>>>>>> So I think we should add the ~2 lines of code to fix the SYSCALL entry >>>>>>> on FRED to match non-FRED. >>>>>> >>>>>> Yes; I'm afraid I have to concur. Preserving the clobber on entry for >>>>>> FRED systems is by far the safest choice. >>>>>> >>>>>> Aside from this selftest, fancy debuggers and anything that can transfer >>>>>> userspace state between machines might be 'surprised'. >>>>> >>>>> Thanks Andy and Peter. >>>>> >>>>> Indeed, making the selftest branch on FRED vs. non-FRED behavior >>>>> is not a good practice. The selftest should validate ABI consistency. >>>>> >>>>> I agree with Andy's option #2, so this should be fixed in the FRED >>>>> syscall entry implementation. >>>>> >>>>> Li Xin, does this direction look right to you? I can assit with >>>>> validation and keep the selftest aligned with the agreed ABI. >>>>> >>>> >>>> Yes, consistency should take precedence over hardware-specific variations. >>>> >>>> I would like to hear from Andrew Cooper and hpa before we do it. >>> >>> Per Andy’s suggestion, the change would be: >>> >>> diff --git a/arch/x86/entry/entry_fred.c b/arch/x86/entry/entry_fred.c >>> index 88c757ac8ccd..a19898747a2c 100644 >>> --- a/arch/x86/entry/entry_fred.c >>> +++ b/arch/x86/entry/entry_fred.c >>> @@ -79,6 +79,9 @@ static __always_inline void fred_other(struct pt_regs >>> *regs) >>> { >>> /* The compiler can fold these conditions into a single test */ >>> if (likely(regs->fred_ss.vector == FRED_SYSCALL && regs->fred_ss.l)) { >>> + regs->cx = regs->ip; >>> + regs->r11 = regs->flags; >>> + >>> regs->orig_ax = regs->ax; >>> regs->ax = -ENOSYS; >>> do_syscall_64(regs, regs->orig_ax); >>> >>> It adds 4 extra MOVs on this hot path, but I don’t see it's a problem here. >> >> We discussed this over a year ago, and at that point agreed that reserving >> the register was the desired behavior. Why has this changed now? > >Yes, that is technically cleaner. > >The question is, is the RCX/R11 clobbering behavior an established >architectural contract, or is it an implementation detail that software >ignores? > >I think Andy and Peter want to be on the safer side, which kind of assumes >that this is established. >
Clobbering is never an architectural contract; clobbering is always an option. However, I understand the concern that a developer who writes software on a FRED system which breaks on a legacy system. Last time this came up, the policy we decided on was that a system that clobbers must do so in all cases (in order to not leak internal kernel state) but a system that can preserve (FRED or IDT-without-SYSCALL) may always do so. I would prefer if we could defer this policy reversal for a bit. Since there is production hardware out now, I have been working on actually tuning the FRED code paths, and because the Linux kernel is so efficient, details matter in surprising ways. I *particularly* dislike clobbering registers on the way *into* the kernel, though. That needlessly makes them unavailable to a debugger, and one of the benefits of FRED is improving debug visibility in some specific cases.

