Thank you very much for sharing! :bow: On Sun, Jan 7, 2018 at 8:54 PM Gil Tene <[email protected]> wrote:
> I'm sure people here have heard enough about the Meltdown vulnerability > and the rush of Linux fixes that have to do with addressing it. So I won't > get into how the vulnerability works here (my one word reaction to the > simple code snippets showing "remote sensing" of protected data values was > "Wow"). > > However, in examining both the various fixes rolled out in actual Linux > distros over the past few days and doing some very informal surveying of > environments I have access to, I discovered that the PCID processor > feature, which used to be a virtual no-op, is now a performance AND > security critical item. In the spirit of [mechanically] sympathizing with > the many systems that now use PCID for a new purpose, as well was with the > gap between the haves/have-nots in the PCID world, let me explain why: > > The PCID (Processor-Context ID) feature on x86-64 works much like the more > generic ASID (Address Space IDs) available on many hardware platforms for > decades. Simplistically, it allows TLB-cached page table contents to be > tagged with a context identifier, and limits the lookups in the TLB to only > match within the currently allowed context. TLB cached entires with a > different PCID will be ignored. Without this feature, a context switch that > would involve switching to a different page table (e.g. a > process-to-process context switch) would require a flush of the entire TLB. > With the feature, it only requires a change to the context id designated as > "currently allowed". The benefit of this comes up when a back-and-forth set > of context switches (e.g. from process 1 to process 2 and back to process > 1) occurs "quickly enough" that TLB entries of the newly-switched-into > context still reside in the TLB cache. With modern x86 CPUs holding >1K > entries in their L2 TLB caches (sometimes referred to as STLB), and each > entry mapping 2MB or 4KB virtual regions to physical pages, the possibility > of such reuse becomes interesting on heavily loaded systems that do a lot > of process-to-process context switching. It's important to note that in > virtually all modern operating systems, thread-to-thread context switches > do not require TLB flushing, and remain within the same PCID because they > do not require switching the page table. In addition, UNTIL NOW, most > modern operating systems implemented user-to-kernel and kernel-to-user > context switching without switching page tables, so no TLB flushing or > switching or ASID/PCID was required in system calls or interrupts. > > The PCID feature has been a "cool, interesting, but not critical" feature > to know about in most Linux/x86 environments for these main reasons: > > 1. Linux kernels did not make use of PCID until 4.14. So even tho it's > been around and available in hardware, it didn't make any difference. > > 2. It's bee around and supported in hardware "forever", since 2010 > (apparently added with Westmere), so it's not new or exciting. > > 3. The benefits of PCID-based retention of TLB entires in the TLB cache, > once supported by the OS, would only show up when process-to-process > context switching is rapid enough to matter. While heavily loaded systems > with lots of active processes (not threads) that rapidly switch would > benefit, systems with a reasonable number of of [potentially heavily] > multi-threaded processes wouldn't really be affected or see a benefit. > > This all changed with Meltdown. > > The basic mechanism used by Meltdown fixes in the various distros, under > term variants like "pti", "KPTI", "kaiser" and "KAISER", all have one key > thing in common: They use completely separate page tables for user mode > execution and for kernel mode execution, in order to make sure that kernel > mapping would not be available [to the processor] as the basis for any > speculative operations. Where previously a user process had a single page > table with entries for both user-space and kernel-space mappings in it > (with the kernel mapping having access enforced by protection rules), it > now has two page tables: A "user-only" table containing only the > user-accesible mappings (this table is referred to as "user" in some > variants and "shadow" in other variants), and another table containing both > the kernel and the user mappings (referred to as "kernel" in the variants > I've seen so far). When running user-mode code, the user-only table is the > currently active table that the processor would walk on a TLB miss, and > when running kernel code, the "kernel" table is. System calls switch from > using the user-only table to using the kernel table, perform their > kernel-code work, and then switch back to the user-only table before > returning to user code. > > When a processor has the PCID feature, this back-and-forth switching > between page tables is achieved by using separate PCIDs for the two tables > associated with the process. For kernels that did not previously have PCID > support (which is all kernels prior to 4.14, so the vast majority of > kernels in use at the time of this writing), the Meltdown fix variants seem > to use constant PCID values for this purpose (e.g. 0 for kernel and 128 for > user). For later kernels where PCID-to-process relationship is maintained > on each CPU, the PCID space is split in half (e.g. uPCID = kPCID + 2048). > Either way, the switch back and forth between the user-only table and the > kernel table does involve telling the CPU that the page table root and the > PCID have changed, but does not require or force a TLB flush. > > When a processor does NOT have the PCID feature, things get ugly. Each > system call and each user-to-kernel-to-user transition (like an interrupt) > would be required to flush the TLB twice (once after each switch), which > means two terrible things happen: > > 1. System calls [which are generally fairly short] are pretty much > guaranteed to incur TLB misses on all first-access any data and code within > the call, with each miss taking 1-7 steps to walk through the page tables > in memory. This has an obvious impact on workloads that involve frequent > system calls, as the length of each system call will now be longer. > > 2. Each system call and each user-to-kernel-to-user transition flushes the > entire cache of user space TLBs, which means that *after* the > systemcall/transition 100s or 1000s of additional TLB misses will be > incurred, the walks for many of which can end up missing in L2/L3. This > will effect applications and systems that do not necessarily have a "very > high" rate of system calls. The more TLBs have being helping your > performance, the more this impact will be felt, and TLBs have been silently > helping you for decades. It is enough for only a few hundreds or a few > thousands of user-to-kernel-to-user transitions per second to be happening > for this impact to be sorely felt. And guess what: in most normal > configurations, interrupts (timer, TLB-invalidate, etc.) all cause such > transitions on a regular and frequent basis. > > The performance impact of needing to fully flush the TLB on each > transition is apparently high enough that at least some of the > Meltdown-fixing variants I've read through (e.g. the KAISER variant in > RHEL7/RHEL6 and their CentOS brethren) are not willing to take it. Instead, > some of those variants appear to implicitly turn off the > dual-page-table-per-process security measure if the processor they are > running on does not have PCID capability. > > The bottom line so far is: you REALLY want PCID in your processor. Without > it, you may be running insecurely (Meltdown fixes turned off by default), > or you may run so slow you'll be wishing for a security intrusion to put > you out of your misery. > > Ok. So far, you'd think this whole thing boils down to "once I update my > Linux distro with the latest fixes, I just want to make sure I'm not > running on ancient hardware". And since virtually all x86 hardware made > this decade has PCID support, everything is fine. Right? That was my first > thought too. Then I went and check a bunch of systems. Most of the Linux > instances I looked in had no pcid feature, and all of them were running on > modern hardware. Oh Shit. > > The quickest way to check whether or not you have PCID is to grep for > "pcid" in /proc/cpuinfo. If it's there, you're good. You can stop reading > and go on to worrying about the other performance and security impacts > being discussed everywhere else. But if it's not there, you are in trouble. > You now have a choice between running insecurely (turn pti off) and having > performance so bad that some of the security fixes out there will refuse to > secure you. Or you can act (which often means "go scream at someon") and > get that PCID feature you now really really need. > > So, why would youI not have PCID? > > It turns out that because PCID was so boring and non-exciting, and Linux > didn't even use it until a couple of months ago, it's been withheld from > many guest-OS instances when running on modern hardware and modern > hypervisors. In my quick and informal polling I so far found that: > > - Most of the KVM guests I personally ooked in did NOT have pcid > - All the VMWare guests I personally looked in had pcid > - About half the AWS instances I l personally looked in did NOT have pcid, > and the other half did. > > [I encourage others to add their experiences, and e.g. enrich this with a > table of PCID-capability on known instance types on cloud platforms] > > The actual Bottom Line: > > - On any system that does not currently show "pcid" in the flags line of > /proc/cpuinfo, Meltdown is a bigger issue than "install latest updates". > > - PCID is now a critical feature for both security and performance. > > - Many existing Linux guest instances don't have PCID. Including many > Cloud instances. > > Go get your PCID! > > > > > > -- > You received this message because you are subscribed to the Google Groups > "mechanical-sympathy" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
