> > With 5 bits there's a ~96.9% chance of crashing the system in an attempt,
> > the exploit cannot be used for a range of attacks, including spear
> > attacks and fast-spreading worms, right? A crashed and inaccessible
> > system also increases the odds of leaving around unfinished attack code
> > and leaking a zero-day attack.
> 
> Yup, which is why I'd like to have _something_ here without us getting
> lost in the "perfect entropy" weeds. :)

I really start to believe that we cannot make good randomness sources behave
fast enough for per-syscall usage if our target is 1-2% overhead under worst 
possible
(and potentially unrealistic ) scenario (stress test of some simple syscall 
like getpid()).
The only thing that would fit the margin is indeed rdtsc().

I profiled the path in use with get_random_bytes() and results look like this 
(arch_get_random_long in not inline for measurement purpose here):

> >
> >                |          |          |           --9.44%--random_get_byte
> >                |          |          |                     |
> >                |          |          |                      
> > --8.08%--get_random_bytes
> >                |          |          |                                |
> >                |          |          |                                 
> > --7.80%--_extract_crng.constprop.45
> >                |          |          |                                      
> >      |
> >                |          |          |                                      
> >      |--4.95%--arch_get_random_long
> >                |          |          |                                      
> >      |
> >                |          |          |                                      
> >       --2.39%--chacha_block


And here is the proof that under such usage _extract_crng bottlenecks on 
rdrand: 

PerfTop:    5877 irqs/sec  kernel:78.6%  exact: 100.0% [4000Hz cycles:ppp],  
(all, 8 CPUs)
------------------------------------------------------------------------------------------------------------------------------------
Showing cycles:ppp for _extract_crng.constprop.46
  Events  Pcnt (>=5%)
 Percent | Source code & Disassembly of kcore for cycles:ppp (2104 samples, 
percent: local period)
-------------------------------------------------------------------------------------------------------
    0.00 :   ffffffff9abd1a62:       mov    $0xa,%edx
   97.94 :   ffffffff9abd1a67:       rdrand %rax

And then of course there is chacha permutation itself. So, I think Andy's 
proposal to rewrite 
"get_random_bytes" for speed is not so easy to implement. 

So, given that all we want is to raise the bar for attackers to predict the 
stack location
on subsequent syscall, is it really worth to try to come up with more complex 
solutions than
just using lower bits of rdtsc() by default? 

One idea that I got suggested last week is to create a pool of good randomness 
and
 then during syscall select a random number from the pool using smth 
rdtsc()%POOL_SIZE.
Pool would need to be refilled periodically, outside of syscall path to 
maintain diversity. 
I can try this approach, if people believe that it would address the security 
concerns around
rdtsc() (my personal feeling is that one can still time attack this if we 
assume that rdtsc
can be attacked and complexity of the whole thing increases considerably). 

If we decide that this is too much trouble for just 5 bits of randomness we 
need per syscall, I would 
still propose we reconsider original rdtsc() approach since it is still better 
than nothing. 
We can have the whole thing on three levels:

CONFIG_RANDOMIZE_KSTACK_OFFSET - off - no randomization, like now
CONFIG_RANDOMIZE_KSTACK_OFFSET on with rdtsc(), fast, better than nothing, but 
prone to 
timing attacks
CONFIG_RANDOMIZE_KSTACK_OFFSET based on get_random_bytes() with better security 
guarantees. 

Performance numbers for  will approx. look like

No randomization:                       Simple syscall: 0.0534 microseconds
With rdtsc():                           Simple syscall: 0.0539 microseconds
Wih get_random_bytes(4096 buffer):      Simple syscall: 0.0597 microseconds

Pure rdrand option with calling rdrand_long every 10th syscall is considerably 
slower

With rdrand (every 10th syscall):               Simple syscall: 0.0719 
microseconds

And I guess we should once again remember that these are *not* the numbers that 
real
users will see in practice since I doubt we have the real loads issuing 
millions of *very
lightweight* syscalls in a loop, so this is really more "theoretical, worst 
case ever" numbers. 

If someone could actually propose a reasonable *practical* workload to measure 
with, 
then we can see what is the overhead on that both for rdtsc and 
get_random_bytes(). 

Best Regards,
Elena.


Reply via email to