I'm sure people here have heard enough about the Meltdown vulnerability and 
the rush of Linux fixes that have to do with addressing it. So I won't get 
into how the vulnerability works here (my one word reaction to the simple 
code snippets showing "remote sensing" of protected data values was "Wow").

However, in examining both the various fixes rolled out in actual Linux 
distros over the past few days and doing some very informal surveying of 
environments I have access to, I discovered that the PCID processor 
feature, which used to be a virtual no-op, is now a performance AND 
security critical item. In the spirit of [mechanically] sympathizing with 
the many systems that now use PCID for a new purpose, as well was with the 
gap between the haves/have-nots in the PCID world, let me explain why:

The PCID (Processor-Context ID) feature on x86-64 works much like the more 
generic ASID (Address Space IDs) available on many hardware platforms for 
decades. Simplistically, it allows TLB-cached page table contents to be 
tagged with a context identifier, and limits the lookups in the TLB to only 
match within the currently allowed context. TLB cached entires with a 
different PCID will be ignored. Without this feature, a context switch that 
would involve switching to a different page table (e.g. a 
process-to-process context switch) would require a flush of the entire TLB. 
With the feature, it only requires a change to the context id designated as 
"currently allowed". The benefit of this comes up when a back-and-forth set 
of context switches (e.g. from process 1 to process 2 and back to process 
1) occurs "quickly enough" that TLB entries of the newly-switched-into 
context still reside in the TLB cache. With modern x86 CPUs holding >1K 
entries in their L2 TLB caches (sometimes referred to as STLB), and each 
entry mapping 2MB or 4KB virtual regions to physical pages, the possibility 
of such reuse becomes interesting on heavily loaded systems that do a lot 
of process-to-process context switching. It's important to note that in 
virtually all modern operating systems, thread-to-thread context switches 
do not require TLB flushing, and remain within the same PCID because they 
do not require switching the page table. In addition, UNTIL NOW, most 
modern operating systems implemented user-to-kernel and kernel-to-user 
context switching without switching page tables, so no TLB flushing or 
switching or ASID/PCID was required in system calls or interrupts.

The PCID feature has been a "cool, interesting, but not critical" feature 
to know about in most Linux/x86 environments for these main reasons:

1. Linux kernels did not make use of PCID until 4.14. So even tho it's been 
around and available in hardware, it didn't make any difference.

2. It's bee around and supported in hardware "forever", since 2010 
(apparently added with Westmere), so it's not new or exciting.

3. The benefits of PCID-based retention of TLB entires in the TLB cache, 
once supported by the OS, would only show up when process-to-process 
context switching is rapid enough to matter. While heavily loaded systems 
with lots of active processes (not threads) that rapidly switch would 
benefit, systems with a reasonable number of  of [potentially heavily] 
multi-threaded processes wouldn't really be affected or see a benefit.

This all changed with Meltdown. 

The basic mechanism used by Meltdown fixes in the various distros, under 
term variants like "pti", "KPTI", "kaiser" and "KAISER", all have one key 
thing in common: They use completely separate page tables for user mode 
execution and for kernel mode execution, in order to make sure that kernel 
mapping would not be available [to the processor] as the basis for any 
speculative operations. Where previously a user process had a single page 
table with entries for both user-space and kernel-space mappings in it 
(with the kernel mapping having access enforced by protection rules), it 
now has two page tables: A "user-only" table containing only the 
user-accesible mappings (this table is referred to as "user" in some 
variants and "shadow" in other variants), and another table containing both 
the kernel and the user mappings (referred to as "kernel" in the variants 
I've seen so far). When running user-mode code, the user-only table is the 
currently active table that the processor would walk on a TLB miss, and 
when running kernel code, the "kernel" table is. System calls switch from 
using the user-only table to using the kernel table, perform their 
kernel-code work, and then switch back to the user-only table before 
returning to user code.

When a processor has the PCID feature, this back-and-forth switching 
between page tables is achieved by using separate PCIDs for the two tables 
associated with the process. For kernels that did not previously have PCID 
support (which is all kernels prior to 4.14, so the vast majority of 
kernels in use at the time of this writing), the Meltdown fix variants seem 
to use constant PCID values for this purpose (e.g. 0 for kernel and 128 for 
user). For later kernels where PCID-to-process relationship is maintained 
on each CPU, the PCID space is split in half (e.g. uPCID = kPCID + 2048). 
Either way, the switch back and forth between the user-only table and the 
kernel table does involve telling the CPU that the page table root and the 
PCID have changed, but does not require or force a TLB flush.

When a processor does NOT have the PCID feature, things get ugly. Each 
system call and each user-to-kernel-to-user transition (like an interrupt) 
would be required to flush the TLB twice (once after each switch), which 
means two terrible things happen:

1. System calls [which are generally fairly short] are pretty much 
guaranteed to incur TLB misses on all first-access any data and code within 
the call, with each miss taking 1-7 steps to walk through the page tables 
in memory. This has an obvious impact on workloads that involve frequent 
system calls, as the length of each system call will now be longer.

2. Each system call and each user-to-kernel-to-user transition flushes the 
entire cache of user space TLBs, which means that *after* the 
systemcall/transition 100s or 1000s of additional TLB misses will be 
incurred, the walks for many of which can end up missing in L2/L3. This 
will effect applications and systems that do not necessarily have a "very 
high" rate of system calls. The more TLBs have being helping your 
performance, the more this impact will be felt, and TLBs have been silently 
helping you for decades. It is enough for only a few hundreds or a few 
thousands of user-to-kernel-to-user transitions per second to be happening 
for this impact to be sorely felt. And guess what: in most normal 
configurations, interrupts (timer, TLB-invalidate, etc.) all cause such 
transitions on a regular and frequent basis.

The performance impact of needing to fully flush the TLB on each transition 
is apparently high enough that at least some of the Meltdown-fixing 
variants I've read through (e.g. the KAISER variant in RHEL7/RHEL6 and 
their CentOS brethren) are not willing to take it. Instead, some of those 
variants appear to implicitly turn off the dual-page-table-per-process 
security measure if the processor they are running on does not have PCID 
capability. 

The bottom line so far is: you REALLY want PCID in your processor. Without 
it, you may be running insecurely (Meltdown fixes turned off by default), 
or you may run so slow you'll be wishing for a security intrusion to put 
you out of your misery.

Ok. So far, you'd think this whole thing boils down to "once I update my 
Linux distro with the latest fixes, I just want to make sure I'm not 
running on ancient hardware". And since virtually all x86 hardware made 
this decade has PCID support, everything is fine. Right? That was my first 
thought too. Then I went and check a bunch of systems. Most of the Linux 
instances I looked in had no pcid feature, and all of them were running on 
modern hardware. Oh Shit.

The quickest way to check whether or not you have PCID is to grep for 
"pcid" in /proc/cpuinfo. If it's there, you're good. You can stop reading 
and go on to worrying about the other performance and security impacts 
being discussed everywhere else. But if it's not there, you are in trouble. 
You now have a choice between running insecurely (turn pti off) and having 
performance so bad that some of the security fixes out there will refuse to 
secure you. Or you can act (which often means "go scream at someon") and 
get that PCID feature you now really really need.

So, why would youI not have PCID?

It turns out that because PCID was so boring and non-exciting, and Linux 
didn't even use it until a couple of months ago, it's been withheld from 
many guest-OS instances when running on modern hardware and modern 
hypervisors. In my quick and informal polling I so far found that:

- Most of the KVM guests I personally ooked in did NOT have pcid
- All the VMWare guests I personally looked in had pcid
- About half the AWS instances I l personally looked in did NOT have pcid, 
and the other half did.

[I encourage others to add their experiences, and e.g. enrich this with a 
table of PCID-capability on known instance types on cloud platforms]

The actual Bottom Line:

- On any system that does not currently show "pcid" in the flags line of 
/proc/cpuinfo, Meltdown is a bigger issue than "install latest updates".

- PCID is now a critical feature for both security and performance.

- Many existing Linux guest instances don't have PCID. Including many Cloud 
instances.

Go get your PCID!





-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to