https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90582

            Bug ID: 90582
           Summary: AArch64 stack-protector wastes an instruction on
                    address-generation
           Product: gcc
           Version: 8.2.1
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---

void protect_me() {
    volatile int buf[2];
    buf[1] = 3;
}

https://godbolt.org/z/xdlr5w AArch64 gcc8.2 -O3 -fstack-protector-strong

protect_me:
        stp     x29, x30, [sp, -32]!
        adrp    x0, __stack_chk_guard
        add     x0, x0, :lo12:__stack_chk_guard         ### this instruction
        mov     x29, sp                         # frame pointer even though
-fomit-frame-pointer is part of -O3.  Goes away with explicit
-fomit-frame-pointer

        ldr     x1, [x0]                        # copy the cookie
        str     x1, [sp, 24]
        mov     x1,0                            # and destroy the reg

        mov     w1, 3                           # right before it's already
destroyed
        str     w1, [sp, 20]             # buf[1] = 3

        ldr     x1, [sp, 24]                    # canary
        ldr     x0, [x0]                        # key destroys the key pointer
        eor     x0, x1, x0
        cbnz    x0, .L5
        ldp     x29, x30, [sp], 32              # FP and LR save/restore (for
some reason?)
        ret
.L5:
              # can the store of the link register go here, for backtracing?
        bl      __stack_chk_fail

A function that returns a global can embed the low 12 bits of the address into
the load instruction.  AArch64 instructions are fixed-width, so there's no
reason (AFAIK) not to do this.

f:
        adrp    x0, foo
        ldr     w0, [x0, #:lo12:foo]
        ret

I'm not an AArch64 performance expert; it's plausible that zero displacements
are worth spending an extra instruction on for addresses that are used twice,
but unlikely.

So we should be doing 

        adrp    x0, __stack_chk_guard
        ldr     x1, [x0, #:lo12:__stack_chk_guard]  # in prologue to copy
cookie
        ... 
        ldr     x0, [x0, #:lo12:__stack_chk_guard]  # in epilogue to check
cookie

This also avoids leaving an exact pointer right to __stack_chk_guard in a
register, in case a vulnerable callee or code in the function body can be
tricked into dereferencing it and leaking the cookie.  (In non-leaf functions,
we generate the pointer in a call-preserved register like x19, so yes it will
be floating around in a register for callees).

I'd hate to suggest destroying the pointer when copying to the stack, because
that would require another adrp later.

Finding a gadget that has exactly the right offset (the low 12 bits of
__stack_chk_guard's address) is a lot less likely than finding an  ldr from
[x0].  Of course this will introduce a lot of LDR instructions with an
#:lo12:__stack_chk_guard offset, but hopefully they won't be part of useful
gadgets because they lead to writing the stack, or to EOR/CBNZ to
__stack_chk_fail

----

I don't see a way to optimize canary^key == 0 any further, unlike x86-64 PR
90568.  I assume EOR / CBNZ is as at least as efficient as SUBS / BNE on
all/most AArch64 microarchitectures, but someone should check.

----

-O3 includes -fomit-frame-pointer according to -fverbose-asm, but functions
protected with -fstack-protector-strong still get a frame pointer in x29
(costing a MOV x29, sp instruction, and save/restore with STP/LDP along with
x30.)

However, explicitly using -fomit-frame-pointer stops that from happening.  Is
that a separate bug, or am I missing something?

----

Without stack-protector, the function is vastly simpler

protect_me:
        sub     sp, sp, #16
        mov     w0, 3
        str     w0, [sp, 12]
        add     sp, sp, 16
        ret

Does stack-protector really need to spill/reload x29/x30 (FP and LR)?  Bouncing
the return address through memory seems inefficient, even though branch
prediction does hide that latency.

Is that just so __stack_chk_fail can backtrace?  Can we move the store of the
link register into the __stack_chk_fail branch, off the fast path?

Or if we do unconditionally store x30 (the link register), at least don't
bother reloading it in a leaf function if register allocation didn't need to
clobber it.  Unlike x86-64, the return address can't be attacked with buffer
overflows if it stays safe in a register the whole function.

Obviously my test-case with a volatile array and no inputs at all is making
-fstack-protector-strong look dumb by protecting a perfectly safe function. 
IDK how common it is to have leaf functions with arrays or structs that just
use them for some computation on function args or globals and then return,
maybe after copying the array back to somewhere else.  A sort function might
use a tmp array.

With -fstack-protector (not -strong), maybe instead of optimizing the store of
the link register into the block that calls  __stack_chk_fail, we should use
that as a sign that a leaf function which doesn't spill LR is not *itself*
vulnerable to ROP attacks, and not stack-protect it.

-strong might still want to stack-protect to defend against overwriting the
caller's stack frame, which might contain function pointers or classes with
vtable pointers.  Or saved registers that are about to be restored.  Or to
detect that something bad happened to it, even if it won't immediately result
in a bad control transfer.

Or maybe not, and maybe -strong would also want to drop protection from leaf
functions that don't (need to) spill LR.


----

The    mov     x1,0  in the prologue is redundant: mov  w1, 3 is about to
destroy the value in x1 anyway, and it can't fault.  (Reporting this bug
separately; it's separate and not AArch64-specific).

Reply via email to