Re: [Patch v2 3/3] aarch64: Mitigate SLS for BLR instruction

2020-06-23 Thread Richard Sandiford
Matthew Malcomson  writes:
> diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
> index 
> f996472d6990b7709602ae93f7a2cb7daa0e84b0..9795c929b8733f89722d3660456f5e7d6405d902
>  100644
> --- a/gcc/config/aarch64/aarch64.h
> +++ b/gcc/config/aarch64/aarch64.h
> @@ -643,6 +643,16 @@ extern unsigned aarch64_architecture_version;
>  #define GP_REGNUM_P(REGNO)   \
>(((unsigned) (REGNO - R0_REGNUM)) <= (R30_REGNUM - R0_REGNUM))
>  
> +/* Registers known to be preserved over a BL instruction.  This consists of 
> the
> +   GENERAL_REGS without x16, x17, and x30.  The x30 register is changed by 
> the BL

Long line.

> +   instruction itself, while the x16 and x17 registers may be used by veneers
> +   which can be inserted by the linker.  */
> +#define STUB_REGNUM_P(REGNO) \
> +  (GP_REGNUM_P (REGNO) \
> +   && ((unsigned) (REGNO - R0_REGNUM)) != (R16_REGNUM - R0_REGNUM) \
> +   && ((unsigned) (REGNO - R0_REGNUM)) != (R17_REGNUM - R0_REGNUM) \
> +   && ((unsigned) (REGNO - R0_REGNUM)) != (R30_REGNUM - R0_REGNUM)) \

Sorry, I should have noticed this before, but we can just compare
(REGNO) directly with R16_REGNUM etc, with subtracting R0_REGNUM from
both sides.  The R0_REGNUM stuff is only needed for range comparisons,
where the idea is to avoid reevaluating REGNO.

> […]
> @@ -10869,7 +10872,7 @@ aarch64_asm_trampoline_template (FILE *f)
>   specific attributes to choose between hardening against straight line
>   speculation or not, but such function specific attributes are likely to
>   happen in the future.  */
> -  output_asm_insn ("dsb\tsy\n\tisb", NULL);
> +  asm_fprintf (f, "\tdsb\tsy\n\tisb\n");

Looks like this should be part of 2/3.

> […]
> +rtx
> +aarch64_sls_create_blr_label (int regnum)
> +{
> +  gcc_assert (regnum < 30 && regnum != 16 && regnum != 17);

Can just use STUB_REGNUM_P here.

> […]
> +/* Emit shared BLR stubs for the current compilation unit.
> +   Over the course of compiling this unit we may have converted some BLR
> +   instructions to a BL to a shared stub function.  This is where we emit 
> those
> +   stub functions.
> +   This function is for the stubs shared between different functions in this
> +   compilation unit.  We share when optimising for size instead of speed.

optimizing (alas).

> […]
> +/* { dg-final { scan-assembler "\tbr\tx\[0-9\]\[0-9\]?" } } */

Probably easier to read with {…} quoting rather than "…" quoting,
so that no backslashes are needed for [ and ].

OK with those changes, thanks.

Richard


[Patch v2 3/3] aarch64: Mitigate SLS for BLR instruction

2020-06-23 Thread Matthew Malcomson
This patch introduces the mitigation for Straight Line Speculation past
the BLR instruction.

This mitigation replaces BLR instructions with a BL to a stub which uses
a BR to jump to the original value.  These function stubs are then
appended with a speculation barrier to ensure no straight line
speculation happens after these jumps.

When optimising for speed we use a set of stubs for each function since
this should help the branch predictor make more accurate predictions
about where a stub should branch.

When optimising for size we use one set of stubs for all functions.
This set of stubs can have human readable names, and we are using
`__call_indirect_x` for register x.

When BTI branch protection is enabled the BLR instruction can jump to a
`BTI c` instruction using any register, while the BR instruction can
only jump to a `BTI c` instruction using the x16 or x17 registers.
Hence, in order to ensure this transformation is safe we mov the value
of the original register into x16 and use x16 for the BR.

As an example when optimising for size:
a
BLR x0
instruction would get transformed to something like
BL __call_indirect_x0
where __call_indirect_x0 labels a thunk that contains
__call_indirect_x0:
MOV X16, X0
BR X16



The first version of this patch used local symbols specific to a
compilation unit to try and avoid relocations.
This was mistaken since functions coming from the same compilation unit
can still be in different sections, and the assembler will insert
relocations at jumps between sections.

On any relocation the linker is permitted to emit a veneer to handle
jumps between symbols that are very far apart.  The registers x16 and
x17 may be clobbered by these veneers.
Hence the function stubs cannot rely on the values of x16 and x17 being
the same as just before the function stub is called.

Similar can be said for the hot/cold partitioning of single functions,
so function-local stubs have the same restriction.

This updated version of the patch never emits function stubs for x16 and
x17, and instead forces other registers to be used.


Given the above, there is now no benefit to local symbols (since they
are not enough to avoid dealing with linker intricacies).  This patch
now uses global symbols with hidden visibility each stored in their own
COMDAT section.  This means stubs can be shared between compilation
units while still avoiding the PLT indirection.


This patch also removes the `__call_indirect_x30` stub (and
function-local equivalent) which would simply jump back to the original
location.


The function-local stubs are emitted to the assembly output file in one
chunk, which means we need not add the speculation barrier directly
after each one.
This is because we know for certain that the instructions directly after
the BR in all but the last function stub will be from another one of
these stubs and hence will not contain a speculation gadget.
Instead we add a speculation barrier at the end of the sequence of
stubs.

The global stubs are emitted in COMDAT/.linkonce sections by
themselves so that the linker can remove duplicates from multiple object
files.  This means they are not emitted in one chunk, and each one must
include the speculation barrier.

Another difference is that since the global stubs are shared across
compilation units we do not know that all functions will be targeting an
architecture supporting the SB instruction.
Rather than provide multiple stubs for each architecture, we provide a
stub that will work for all architectures -- using the DSB+ISB barrier.


This mitigation does not apply for BLR instructions in the following
places:
- Some accesses to thread-local variables use a code sequence with a BLR
  instruction.  This code sequence is part of the binary interface between
  compiler and linker. If this BLR instruction needs to be mitigated, it'd
  probably be best to do so in the linker. It seems that the code sequence
  for thread-local variable access is unlikely to lead to a Spectre Revalation
  Gadget.
- PLT stubs are produced by the linker and each contain a BLR instruction.
  It seems that at most only after the last PLT stub a Spectre Revalation
  Gadget might appear.

Testing:
  Bootstrap and regtest on AArch64
(with BOOT_CFLAGS="-mharden-sls=retbr,blr")
  Used a temporary hack(1) in gcc-dg.exp to use these options on every
  test in the testsuite, a slight modification to emit the speculation
  barrier after every function stub, and a script to check that the
  output never emitted a BLR, or unmitigated BR or RET instruction.
  Similar on an aarch64-none-elf cross-compiler.

1) Temporary hack emitted a speculation barrier at the end of every stub
function, and used a script to ensure that:
  a) Every RET or BR is immediately followed by a speculation barrier.
  b) No BLR instruction is emitted by compiler.


gcc/ChangeLog:

2020-06-23  Matthew Malcomson  

* config/aarch64/aarch64-protos.h (aarch64_indirect_call_asm):