On Tue, 2018-01-23 at 08:53 +0100, Ingo Molnar wrote:
> 
> The patch below demonstrates the principle, it forcibly enables dynamic 
> ftrace 
> patching (CONFIG_DYNAMIC_FTRACE=y et al) and turns mcount/__fentry__ into a 
> RET:
> 
>   ffffffff81a01a40 <__fentry__>:
>   ffffffff81a01a40:       c3                      retq   
> 
> This would have to be extended with (very simple) call stack depth tracking 
> (just 
> 3 more instructions would do in the fast path I believe) and a suitable 
> SkyLake 
> workaround (and also has to play nice with the ftrace callbacks).
> 
> On non-SkyLake the overhead would be 0 cycles.

The overhead of forcing CONFIG_DYNAMIC_FTRACE=y is precisely zero
cycles? That seems a little optimistic. ;)

I'll grant you if it goes straight to a 'ret' it isn't *that* high
though.

> On SkyLake this would add an overhead of maybe 2-3 cycles per function call 
> and 
> obviously all this code and data would be very cache hot. Given that the 
> average 
> number of function calls per system call is around a dozen, this would be 
> _much_ 
> faster than any microcode/MSR based approach.

That's kind of neat, except you don't want it at the top of the
function; you want it at the bottom.

If you could hijack the *return* site, then you could check for
underflow and stuff the RSB right there. But in __fentry__ there's not
a lot you can do other than complain that something bad is going to
happen in the future. You know that a string of 16+ rets is going to
happen, but you've got no gadget in *there* to deal with it when it
does.

HJ did have patches to turn 'ret' into a form of retpoline, which I
don't think ever even got performance-tested. They'd have forced a
mispredict on *every* ret. A cheaper option might be to turn ret into a
'jmp skylake_ret_hack'. Which on pre-SKL will be a bare ret, and SKL+
can do the counting (in conjunction with a 'per_cpu(call_depth)++' in
__fentry__) and stuff the RSB before actually returning, when
appropriate.

By the time you've made it work properly, I suspect we're approaching
the barf-factor of IBRS, for a less complete solution.

> Is there a testcase for the SkyLake 16-deep-call-stack problem that I could 
> run? 

Andi's been experimenting at 
https://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc.git/log/?h=spec/deep-chain-3

> Is there a description of the exact speculative execution vulnerability that 
> has 
> to be addressed to begin with?

"It takes predictions from the generic branch target buffer when the
RSB underflows".

IBRS filters what can come from the BTB, and resolves the problem that
way. Retpoline avoids the indirect branches that on *earlier* CPUs were
the only things that would use the offending predictions. But on SKL,
now 'ret' is one of the problematic instructions too. Fun! :)

> If this approach is workable I'd much prefer it to any MSR writes in the 
> syscall 
> entry path not just because it's fast enough in practice to not be turned off 
> by 
> everyone, but also because everyone would agree that per function call 
> overhead 
> needs to go away on new CPUs. Both deployment and backporting is also _much_ 
> more 
> flexible, simpler, faster and more complete than microcode/firmware or 
> compiler 
> based solutions.
> 
> Assuming the vulnerability can be addressed via this route that is, which is 
> a big 
> assumption!

I think it's close. There are some other cases which empty the RSB,
like sleeping and loading microcode, which can happily be special-
cased. Andi's rounded up many of the remaining details already at 
https://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc.git/log/?h=spec/skl-rsb-3

And there's SMI, which is a pain but I think Linus is right we can
possibly just stick our fingers in our ears and pretend we didn't hear
about that one as it's likely to be hard to trigger (famous last
words).

On the whole though, I think you can see why we're keeping IBRS around
for now, sent out purely as an RFC and rebased on top of the stuff
we're *actually* sending to Linus for inclusion.

When we have a clear idea of what we're doing for Skylake, it'll be
useful to have a proper comparison of the security, the performance and
the "ick" factor of whatever we come up with, vs. IBRS.

Right now the plan is just "screw Skylake"; we'll just forget it's a
special snowflake and treat it like everything else, except for a bit
of extra RSB-stuffing on context switch (since we had to add that for
!SMEP anyway). And that's not *entirely* unreasonable but as I said I'd
*really* like to have a decent analysis of the implications of that,
not just some hand-wavy "nah, it'll be fine".

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to