https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90582
Bug ID: 90582 Summary: AArch64 stack-protector wastes an instruction on address-generation Product: gcc Version: 8.2.1 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- void protect_me() { volatile int buf[2]; buf[1] = 3; } https://godbolt.org/z/xdlr5w AArch64 gcc8.2 -O3 -fstack-protector-strong protect_me: stp x29, x30, [sp, -32]! adrp x0, __stack_chk_guard add x0, x0, :lo12:__stack_chk_guard ### this instruction mov x29, sp # frame pointer even though -fomit-frame-pointer is part of -O3. Goes away with explicit -fomit-frame-pointer ldr x1, [x0] # copy the cookie str x1, [sp, 24] mov x1,0 # and destroy the reg mov w1, 3 # right before it's already destroyed str w1, [sp, 20] # buf[1] = 3 ldr x1, [sp, 24] # canary ldr x0, [x0] # key destroys the key pointer eor x0, x1, x0 cbnz x0, .L5 ldp x29, x30, [sp], 32 # FP and LR save/restore (for some reason?) ret .L5: # can the store of the link register go here, for backtracing? bl __stack_chk_fail A function that returns a global can embed the low 12 bits of the address into the load instruction. AArch64 instructions are fixed-width, so there's no reason (AFAIK) not to do this. f: adrp x0, foo ldr w0, [x0, #:lo12:foo] ret I'm not an AArch64 performance expert; it's plausible that zero displacements are worth spending an extra instruction on for addresses that are used twice, but unlikely. So we should be doing adrp x0, __stack_chk_guard ldr x1, [x0, #:lo12:__stack_chk_guard] # in prologue to copy cookie ... ldr x0, [x0, #:lo12:__stack_chk_guard] # in epilogue to check cookie This also avoids leaving an exact pointer right to __stack_chk_guard in a register, in case a vulnerable callee or code in the function body can be tricked into dereferencing it and leaking the cookie. (In non-leaf functions, we generate the pointer in a call-preserved register like x19, so yes it will be floating around in a register for callees). I'd hate to suggest destroying the pointer when copying to the stack, because that would require another adrp later. Finding a gadget that has exactly the right offset (the low 12 bits of __stack_chk_guard's address) is a lot less likely than finding an ldr from [x0]. Of course this will introduce a lot of LDR instructions with an #:lo12:__stack_chk_guard offset, but hopefully they won't be part of useful gadgets because they lead to writing the stack, or to EOR/CBNZ to __stack_chk_fail ---- I don't see a way to optimize canary^key == 0 any further, unlike x86-64 PR 90568. I assume EOR / CBNZ is as at least as efficient as SUBS / BNE on all/most AArch64 microarchitectures, but someone should check. ---- -O3 includes -fomit-frame-pointer according to -fverbose-asm, but functions protected with -fstack-protector-strong still get a frame pointer in x29 (costing a MOV x29, sp instruction, and save/restore with STP/LDP along with x30.) However, explicitly using -fomit-frame-pointer stops that from happening. Is that a separate bug, or am I missing something? ---- Without stack-protector, the function is vastly simpler protect_me: sub sp, sp, #16 mov w0, 3 str w0, [sp, 12] add sp, sp, 16 ret Does stack-protector really need to spill/reload x29/x30 (FP and LR)? Bouncing the return address through memory seems inefficient, even though branch prediction does hide that latency. Is that just so __stack_chk_fail can backtrace? Can we move the store of the link register into the __stack_chk_fail branch, off the fast path? Or if we do unconditionally store x30 (the link register), at least don't bother reloading it in a leaf function if register allocation didn't need to clobber it. Unlike x86-64, the return address can't be attacked with buffer overflows if it stays safe in a register the whole function. Obviously my test-case with a volatile array and no inputs at all is making -fstack-protector-strong look dumb by protecting a perfectly safe function. IDK how common it is to have leaf functions with arrays or structs that just use them for some computation on function args or globals and then return, maybe after copying the array back to somewhere else. A sort function might use a tmp array. With -fstack-protector (not -strong), maybe instead of optimizing the store of the link register into the block that calls __stack_chk_fail, we should use that as a sign that a leaf function which doesn't spill LR is not *itself* vulnerable to ROP attacks, and not stack-protect it. -strong might still want to stack-protect to defend against overwriting the caller's stack frame, which might contain function pointers or classes with vtable pointers. Or saved registers that are about to be restored. Or to detect that something bad happened to it, even if it won't immediately result in a bad control transfer. Or maybe not, and maybe -strong would also want to drop protection from leaf functions that don't (need to) spill LR. ---- The mov x1,0 in the prologue is redundant: mov w1, 3 is about to destroy the value in x1 anyway, and it can't fault. (Reporting this bug separately; it's separate and not AArch64-specific).