Thank you very much for sharing!
:bow:
On Sun, Jan 7, 2018 at 8:54 PM Gil Tene <[email protected]> wrote:

> I'm sure people here have heard enough about the Meltdown vulnerability
> and the rush of Linux fixes that have to do with addressing it. So I won't
> get into how the vulnerability works here (my one word reaction to the
> simple code snippets showing "remote sensing" of protected data values was
> "Wow").
>
> However, in examining both the various fixes rolled out in actual Linux
> distros over the past few days and doing some very informal surveying of
> environments I have access to, I discovered that the PCID processor
> feature, which used to be a virtual no-op, is now a performance AND
> security critical item. In the spirit of [mechanically] sympathizing with
> the many systems that now use PCID for a new purpose, as well was with the
> gap between the haves/have-nots in the PCID world, let me explain why:
>
> The PCID (Processor-Context ID) feature on x86-64 works much like the more
> generic ASID (Address Space IDs) available on many hardware platforms for
> decades. Simplistically, it allows TLB-cached page table contents to be
> tagged with a context identifier, and limits the lookups in the TLB to only
> match within the currently allowed context. TLB cached entires with a
> different PCID will be ignored. Without this feature, a context switch that
> would involve switching to a different page table (e.g. a
> process-to-process context switch) would require a flush of the entire TLB.
> With the feature, it only requires a change to the context id designated as
> "currently allowed". The benefit of this comes up when a back-and-forth set
> of context switches (e.g. from process 1 to process 2 and back to process
> 1) occurs "quickly enough" that TLB entries of the newly-switched-into
> context still reside in the TLB cache. With modern x86 CPUs holding >1K
> entries in their L2 TLB caches (sometimes referred to as STLB), and each
> entry mapping 2MB or 4KB virtual regions to physical pages, the possibility
> of such reuse becomes interesting on heavily loaded systems that do a lot
> of process-to-process context switching. It's important to note that in
> virtually all modern operating systems, thread-to-thread context switches
> do not require TLB flushing, and remain within the same PCID because they
> do not require switching the page table. In addition, UNTIL NOW, most
> modern operating systems implemented user-to-kernel and kernel-to-user
> context switching without switching page tables, so no TLB flushing or
> switching or ASID/PCID was required in system calls or interrupts.
>
> The PCID feature has been a "cool, interesting, but not critical" feature
> to know about in most Linux/x86 environments for these main reasons:
>
> 1. Linux kernels did not make use of PCID until 4.14. So even tho it's
> been around and available in hardware, it didn't make any difference.
>
> 2. It's bee around and supported in hardware "forever", since 2010
> (apparently added with Westmere), so it's not new or exciting.
>
> 3. The benefits of PCID-based retention of TLB entires in the TLB cache,
> once supported by the OS, would only show up when process-to-process
> context switching is rapid enough to matter. While heavily loaded systems
> with lots of active processes (not threads) that rapidly switch would
> benefit, systems with a reasonable number of  of [potentially heavily]
> multi-threaded processes wouldn't really be affected or see a benefit.
>
> This all changed with Meltdown.
>
> The basic mechanism used by Meltdown fixes in the various distros, under
> term variants like "pti", "KPTI", "kaiser" and "KAISER", all have one key
> thing in common: They use completely separate page tables for user mode
> execution and for kernel mode execution, in order to make sure that kernel
> mapping would not be available [to the processor] as the basis for any
> speculative operations. Where previously a user process had a single page
> table with entries for both user-space and kernel-space mappings in it
> (with the kernel mapping having access enforced by protection rules), it
> now has two page tables: A "user-only" table containing only the
> user-accesible mappings (this table is referred to as "user" in some
> variants and "shadow" in other variants), and another table containing both
> the kernel and the user mappings (referred to as "kernel" in the variants
> I've seen so far). When running user-mode code, the user-only table is the
> currently active table that the processor would walk on a TLB miss, and
> when running kernel code, the "kernel" table is. System calls switch from
> using the user-only table to using the kernel table, perform their
> kernel-code work, and then switch back to the user-only table before
> returning to user code.
>
> When a processor has the PCID feature, this back-and-forth switching
> between page tables is achieved by using separate PCIDs for the two tables
> associated with the process. For kernels that did not previously have PCID
> support (which is all kernels prior to 4.14, so the vast majority of
> kernels in use at the time of this writing), the Meltdown fix variants seem
> to use constant PCID values for this purpose (e.g. 0 for kernel and 128 for
> user). For later kernels where PCID-to-process relationship is maintained
> on each CPU, the PCID space is split in half (e.g. uPCID = kPCID + 2048).
> Either way, the switch back and forth between the user-only table and the
> kernel table does involve telling the CPU that the page table root and the
> PCID have changed, but does not require or force a TLB flush.
>
> When a processor does NOT have the PCID feature, things get ugly. Each
> system call and each user-to-kernel-to-user transition (like an interrupt)
> would be required to flush the TLB twice (once after each switch), which
> means two terrible things happen:
>
> 1. System calls [which are generally fairly short] are pretty much
> guaranteed to incur TLB misses on all first-access any data and code within
> the call, with each miss taking 1-7 steps to walk through the page tables
> in memory. This has an obvious impact on workloads that involve frequent
> system calls, as the length of each system call will now be longer.
>
> 2. Each system call and each user-to-kernel-to-user transition flushes the
> entire cache of user space TLBs, which means that *after* the
> systemcall/transition 100s or 1000s of additional TLB misses will be
> incurred, the walks for many of which can end up missing in L2/L3. This
> will effect applications and systems that do not necessarily have a "very
> high" rate of system calls. The more TLBs have being helping your
> performance, the more this impact will be felt, and TLBs have been silently
> helping you for decades. It is enough for only a few hundreds or a few
> thousands of user-to-kernel-to-user transitions per second to be happening
> for this impact to be sorely felt. And guess what: in most normal
> configurations, interrupts (timer, TLB-invalidate, etc.) all cause such
> transitions on a regular and frequent basis.
>
> The performance impact of needing to fully flush the TLB on each
> transition is apparently high enough that at least some of the
> Meltdown-fixing variants I've read through (e.g. the KAISER variant in
> RHEL7/RHEL6 and their CentOS brethren) are not willing to take it. Instead,
> some of those variants appear to implicitly turn off the
> dual-page-table-per-process security measure if the processor they are
> running on does not have PCID capability.
>
> The bottom line so far is: you REALLY want PCID in your processor. Without
> it, you may be running insecurely (Meltdown fixes turned off by default),
> or you may run so slow you'll be wishing for a security intrusion to put
> you out of your misery.
>
> Ok. So far, you'd think this whole thing boils down to "once I update my
> Linux distro with the latest fixes, I just want to make sure I'm not
> running on ancient hardware". And since virtually all x86 hardware made
> this decade has PCID support, everything is fine. Right? That was my first
> thought too. Then I went and check a bunch of systems. Most of the Linux
> instances I looked in had no pcid feature, and all of them were running on
> modern hardware. Oh Shit.
>
> The quickest way to check whether or not you have PCID is to grep for
> "pcid" in /proc/cpuinfo. If it's there, you're good. You can stop reading
> and go on to worrying about the other performance and security impacts
> being discussed everywhere else. But if it's not there, you are in trouble.
> You now have a choice between running insecurely (turn pti off) and having
> performance so bad that some of the security fixes out there will refuse to
> secure you. Or you can act (which often means "go scream at someon") and
> get that PCID feature you now really really need.
>
> So, why would youI not have PCID?
>
> It turns out that because PCID was so boring and non-exciting, and Linux
> didn't even use it until a couple of months ago, it's been withheld from
> many guest-OS instances when running on modern hardware and modern
> hypervisors. In my quick and informal polling I so far found that:
>
> - Most of the KVM guests I personally ooked in did NOT have pcid
> - All the VMWare guests I personally looked in had pcid
> - About half the AWS instances I l personally looked in did NOT have pcid,
> and the other half did.
>
> [I encourage others to add their experiences, and e.g. enrich this with a
> table of PCID-capability on known instance types on cloud platforms]
>
> The actual Bottom Line:
>
> - On any system that does not currently show "pcid" in the flags line of
> /proc/cpuinfo, Meltdown is a bigger issue than "install latest updates".
>
> - PCID is now a critical feature for both security and performance.
>
> - Many existing Linux guest instances don't have PCID. Including many
> Cloud instances.
>
> Go get your PCID!
>
>
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to