On Sun, May 24, 2026 at 4:44 AM Paulo Duarte <[email protected]> wrote: > > Two latent bugs in the aarch64 FPU state plumbing, both surfaced > the first time an aarch64 thread_get_state / thread_set_state > caller exercised AARCH64_FLOAT_STATE. > > 1. Field-order bug in _fpu_save_state / _fpu_load_state > ============================================================ > > struct aarch64_float_state in <mach/aarch64/thread_status.h> lays > out fpsr at the first offset after v[32], with fpcr second: > > struct aarch64_float_state { > __int128 v[32]; > uint64_t fpsr; /* offset +512 */ > uint64_t fpcr; /* offset +520 */ > uint64_t fpmr; > uint64_t fp_reserved; > }; > > The save/load asm in aarch64/aarch64/locore.S read and wrote them in > the opposite order — store FPCR into the fpsr slot and FPSR into the > fpcr slot — which silently corrupted both fields across every > thread_get_state / thread_set_state cycle. > > A caller that loads a known FPCR via `msr fpcr, ...` and reads it > back through thread_get_state(AARCH64_FLOAT_STATE) sees 0 instead, > because the value was stored at the fpsr offset. > > 2. Alignment-check bug in thread_setstatus > ============================================================ > > thread_setstatus() in aarch64/aarch64/pcb.c rejected > AARCH64_FLOAT_STATE writes whose tstate pointer wasn't aligned to > alignof(struct aarch64_float_state). The struct's first member is > __int128 v[32], giving the type a 16-byte alignment requirement. > > The MIG-generated _Xthread_set_state stub places `new_state[]` at > offset 40 from the request message header, which is only 8-byte > aligned. Even when the user-side buffer is perfectly aligned, the > in-kernel buffer the stub hands to thread_setstatus is at the > fixed offset 40 — so the alignment check rejected every legitimate > AARCH64_FLOAT_STATE write with KERN_INVALID_ARGUMENT. > > Copy the state into a stack-local that has the natural type > alignment before validating and storing. This is also what would > have to happen if we wanted to read __int128 from the buffer > directly on a strict-alignment-trapping config; the byte-by-byte > copy handles the under-aligned source without faulting. > --- > aarch64/aarch64/locore.S | 16 ++++++++++++---- > aarch64/aarch64/pcb.c | 24 +++++++++++++++++++----- > 2 files changed, 31 insertions(+), 9 deletions(-) > > diff --git a/aarch64/aarch64/locore.S b/aarch64/aarch64/locore.S > index 008e39cc..8581a3b6 100644 > --- a/aarch64/aarch64/locore.S > +++ b/aarch64/aarch64/locore.S > @@ -406,8 +406,12 @@ ENTRY(_fpu_save_state) > stp q26, q27, [x0], #32 > stp q28, q29, [x0], #32 > stp q30, q31, [x0], #32 > - mrs x1, fpcr > - mrs x2, fpsr > + /* > + * struct aarch64_float_state lays out fpsr at offset 0 from > + * here and fpcr at offset 8, so save in that order. > + */ > + mrs x1, fpsr > + mrs x2, fpcr > stp x1, x2, [x0], #32 > stp xzr, xzr, [x0] /* FPMR ignored for now */ > ret > @@ -434,9 +438,13 @@ ENTRY(_fpu_load_state) > ldp q28, q29, [x0], #32 > ldp q30, q31, [x0], #32 > > + /* > + * struct aarch64_float_state lays out fpsr first at offset 0, > + * then fpcr at offset 8 — mirror the save order above. > + */ > ldp x1, x2, [x0] > - msr fpcr, x1 > - msr fpsr, x2 > + msr fpsr, x1 > + msr fpcr, x2
No need to leave these comments in the source code, just fix the order.
