> On 21 Aug 2018, at 12:57, David Woodhouse <dw...@infradead.org> wrote:
> 
> Another alternative... I'm told POWER8 does an interesting thing with
> hyperthreading and gang scheduling for KVM. The host kernel doesn't
> actually *see* the hyperthreads at all, and KVM just launches the full
> set of siblings when it enters a guest, and gathers them again when any
> of them exits. That's definitely worth investigating as an option for
> x86, too.

I actually think that such scheduling mechanism which prevents leaking cache 
entries to sibling hyperthreads should co-exist together with the KVM address 
space isolation to fully mitigate L1TF and other similar vulnerabilities. The 
address space isolation should prevent VMExit handlers code gadgets from 
loading arbitrary host memory to the cache. Once VMExit code path switches to 
full host address space, then we should also make sure that no other sibling 
hyprethread is running in the guest.

Focusing on the scheduling mechanism, we must make sure that when a logical 
processor runs guest code, all siblings logical processors must run code which 
do not populate L1D cache with information unrelated to this VM. This includes 
forbidding one logical processor to run guest code while sibling is running a 
host task such as a NIC interrupt handler.
Thus, when a vCPU thread exits the guest into the host and VMExit handler 
reaches code flow which could populate L1D cache with this information, we 
should force an exit from the guest of the siblings logical processors, such 
that they will be allowed to resume only on a core which we can promise that 
the L1D cache is free from information unrelated to this VM.

At first, I have created a patch series which attempts to implement such 
mechanism in KVM. However, it became clear to me that this may need to be 
implemented in the scheduler itself. This is because:
1. It is difficult to handle all new scheduling contrains only in KVM.
2. This mechanism should be relevant for any Type-2 hypervisor which runs 
inside Linux besides KVM (Such as VMware Workstation or VirtualBox).
3. This mechanism could also be used to prevent future “core-cache-leaking” 
vulnerabilities to be exploited between processes of different security domains 
which run as siblings on the same core.

The main idea is a mechanism which is very similar to Microsoft's "core 
scheduler" which they implemented to mitigate this vulnerability. The mechanism 
should work as follows:
1. Each CPU core will now be tagged with a "security domain id".
2. The scheduler will provide a mechanism to tag a task with a security domain 
id.
3. Tasks will inherit their security domain id from their parent task.
    3.1. First task in system will have security domain id of 0. Thus, if 
nothing special is done, all tasks will be assigned with security domain id of 
0.
4. Tasks will be able to allocate a new security domain id from the scheduler 
and assign it to another task dynamically.
5. Linux scheduler will prevent scheduling tasks on a core with a different 
security domain id:
    5.0. CPU core security domain id will be set to the security domain id of 
the tasks which currently run on it.
    5.1. The scheduler will attempt to first schedule a task on a core with 
required security domain id if such exists.
    5.2. Otherwise, will need to decide if it wishes to kick all tasks running 
on some core to run the task with a different security domain id on that core.

The above mechanism can be used to mitigate the L1TF HT variant by just 
assigning vCPU tasks with a security domain id which is unique per VM and also 
different than the security domain id of the host which is 0.

I would be glad to hear feedback on the above suggestion.
If this should better be discussed on a separate email thread, please say so 
and I will open a new thread.

Thanks,
-Liran


Reply via email to