Cold boot attacks are a major risk for the protection that Full-Disk-Encryption solutions provide. Any powered-on computer is vulnerable to this attack and until now there has been no general-purpose solution to this problem.

This entry details my solution for mitigating cold boot attacks. Future posts will delve into additional details and describe what other ideas (and problems) were considered on the path to finding a general-purpose solution.

Why is this blog called "Frozen Cache"? Because we're proposing the use of the CPU cache for thwarting cold boot attacks (at least on most X86 systems). In contrast to most blogs, the entries of this blog will appear in normal chronological order on the website (for a more paper-like reading). Feeds however, will be sorted as expected (reverse chronological order).

The concept is easy: by switching the cache into a special mode one can force that data remains in the cache and is not written to the backing RAM locations. Thus, the encryption key can't be extracted from RAM. This technique is actually not new: LinuxBIOS/CoreBoot calls this Cache-as-RAM. They use it to allow "RAM access", even before the memory controller is initialized.

The following (simplified and technically not 100% correct/complete) steps load and maintain a 256 bit encryption key in the CPU cache. The demo assembly code assumes the encryption key is stored at linear address X in RAM on a page boundary for simplicity:

Load the encryption key from RAM into some CPU registers (like the SSE registers)

movq [X], %xmm0
movq [X+8], %xmm1
movq [X+16], %xmm2
movq [X+24], %xmm3
Overwrite the encryption key in RAM with a zero-value

movq 0, [X]
movq 0, [X+8]
movq 0, [X+16]
movq 0, [X+24]
Flush the cache (thus truly overwriting the encryption key in RAM)

wbinvd
Add the desired RAM region to the CPU's MTRR (the 4K segment containing the key)

movl (X | MEMORY_TYPE_WRITEBACK), %eax
xorl %edx, %edx
movl 0x200, %ecx
wrmsr
movl ( ~(1) | MTRR_VALID ), %eax
movl 0x1, %edx
movl 0x201, %ecx
wrmsr
Disable/freeze the CPU's cache (CR0.CD=1)

movl %cr0, %eax
orl 0x40000000, %eax
movl %eax, %cr0
Write the encryption key from the CPU registers to RAM (data remains in the cache, doesn't get written to memory)

movq %xmm0, [X]
movq %xmm1, [X+8]
movq %xmm2, [X+16]
movq %xmm3, [X+24]

"Disabling/freezing" the CPU's cache severely degrades the performance. However, this seems acceptable if one considers that this special mode only needs to be set whenever the screen is locked (all efforts are pretty much worthless if an unlocked laptop is stolen). A very first proof-of-concept test on Linux shows that there's quite a bit of performance optimization necessary to make even just the act of unlocking the GUI an acceptable experience (from a performance/usability perspective).

Please note that this post only described the very basic concept, there are many aspects that haven't been covered. Upcoming posts will address things like multi-CPU/Core issues, performance considerations/optimizations and lots of other stuff.

Performance aspects

As previously mentioned, performance is a major concern when the cache is switched into no-fill mode ("frozen"). The broad range of CPU architectures (multi-CPU, multi-core, multi-threads) and respective cache configurations makes the matter even more complex.

First off: only a single CPU's cache needs to be "frozen" in order to effectively protect the encryption key; other CPUs are allowed to operate in normal cache mode. This is true for as long as each (logical/virtual) CPU uses its own cache exclusively: CPUs that employ threading technology (like Intel's HyperThreading) appear as two (or potentially more) logical CPUs, but these two CPUs share the same cache and because of this must both change into no-fill cache mode. The situation may be different for multi-core CPUs, if the cores all have their own (L1 and L2) caches.

The encryption key resides only in a single CPU's cache. Only this CPU must therefore execute the encryption and decryption routines. The most prevalent architecture among Full-Disk-Encryption solutions is to employ a kernel module, which spawns a designated kernel thread for the encryption and decryption logic. Kernel threads are schedulable entities and are therefore bindable onto the CPU, which holds the encryption key in its cache.

Back to more "traditional" performance aspects. What can be done to minimize the impact of freezing the CPU cache? Loading the most frequently used memory areas into the cache (before freezing it) is a great start. Among the highest potential candidates are: the system call entry point, the timer interrupt routine and its "helper" functions and the encryption/decryption functions executed by the kernel thread. Current L2 caches are usually large enough to hold all this code, but one also needs to consider the cache's associativity in order to not shoot one into the foot. Another good idea is to schedule all other processes onto any of the other available CPUs (which don't use the frozen cache): this allows for them to be executed at "normal" speed. There's another why this is important, but we'll get to this some other time.

It should be obvious by now, that an implementation will have to identify the specific CPU/cache components at runtime and "manage" them accordingly. My proof-of-concept implementation for Linux will be purely for single-CPU systems (for simplicity), but I'll explain the technical details in a future post (like so many other things).

Lack of cache control

Management of the cache contents isn't over once the cache has been "frozen": it is also important, that the data in the cache (the encryption key) isn't written back to memory. Unfortunately, the Intel architecture allows only very minimalistic cache control:

enable / disable cache (systemwide: CR0.CD, per memory-region: MTRR, per page: PAT)
flush cache (wbinvd)

That's it. There are no processor instructions for querying the status of the cache (the currently held RAM locations in the individual cache lines) or any other "advanced" cache management functions. Therefore, it is nearly impossible to verify that the encryption key is really only present in the CPU cache. With the frozen cache setup, it's pretty much guaranteed that the key will be present in the cache - but that doesn't say anything about whether that data hasn't been sync'ed to RAM. This happens (in the frozen cache setup) whenever the wbinvd instruction is executed; this instruction can be executed by any code running in ring 0 (kernel). Therefore, it is important to minimize the (kernel) code that runs on the CPU which holds the encryption key in its cache. This is why binding the other schedulable entities (at least all other kernel threads) onto other CPUs (if present) is important, too.

One way to minimize the impact of "unintentional" cache flushes (unintentional from our point of view) is to repeat the cache freezing procedure periodically in order to reverse the effects of "unintentional" cache flushes (wbinvd). Fortunately, there's at least a (theoretically) better solution for Linux: modify the function/macro that executes/wraps the invd/wbinvd instructions in the kernel to trigger the re-execution of the cache freezing (independent of how realistic the chances of integration of such a patch seem).

Protecting the encryption key

The cryptographic key is not the only data item that needs to be kept in the CPU cache in order to keep it from prying eyes/spraycans.

Key scheduling is an established paradigm in modern cryptographic ciphers: the encryption and decryption routines don't use the encryption key directly, but rather employ "round keys". These are derived from the encryption key (by a cipher-specific algorithm) and are than used in the encryption/decryption in the various rounds of the algorithm.

The AES standard defines that a 128 bit AES key is used to generate/derive 10 round keys of 128 bit each (for 192/256 bit AES keys 12/14 round keys are calculated respectively). One would expect that these round keys are calculated from the encryption key by some non-reversible hash-like-function. Unfortunately, this isn't true for AES: the encryption key is easily re-calculatable from any of the 10/12/14 round keys. Thus: these round keys need to be kept inside the CPU cache as well (at least for AES, which is used quite often).

Sounds easy in theory, but especially for Linux it's quite a challenge to find a nicely structured approach instead of just hacking something together.

Locking the screen

One important aspect of my proposition is that the performance impact is only in effect whenever the screen is locked (only than are the keys stored "safely" in the CPU cache). However, there are very likely situations in which one would like to lock the screen but not suffer the performance impact (such as compiling software over lunch).

I foresee two strategies for maintaining native system performance:

configuration options that determine whenever the cache shouldn't be frozen
a time-window after the screen is locked which allows for the some user interaction to prevent cache freezing

The configuration options could be items like "don't freeze the cache if the system is in a docking station" or "... is AC powered" or "a process named gcc is running" and so on.

The time-window approach could be something like a count-down which starts right when the screen is locked (and the user might be still in front of the computer). Clicking on the "don't freeze the cache" button during the countdown would prevent the key protection - while ignoring it would lead to the desired protection (thus addressing the case the computer auto-locks the screen, it would only add a small window of additional exposure for the encryption key).

Protecting the encryption key: it is not just the encryption key

I've followed the coverage on Slashdot, Hack-a-day and other sites: it's great that my effort is being exposed to such a broad audience, but it seems that there is a misunderstanding about the details of my research.

I supposedly suggested to protect only the encryption key by "removing" it from RAM and keeping in the CPU cache. However, this is not the case, as I've previously stated in the entry "Protecting the encryption key":

Thus: these round keys need to be kept inside the CPU cache as well (at least for AES, which is used quite often).

Maybe the misunderstanding about what all should be "protected" arose because of the demo code in my first blog entry; it only shows how to "move" 256 bits to the CPU cache and how to than "freeze" the CPU cache. That assembly code was only meant to demonstrate the core concept, not more.

Anyhow, I would like to take this opportunity and to explain a bit more thoroughly what all needs to be kept in the CPU cache in order to achieve "perfect" protection against cold-boot-attacks.

Obviously, the "encryption key" needs to be protected (duh). Secondly, the "key schedule" (I previously called these "round keys") need to be protected; the "key schedule" is derived directly from the "encryption key" and could be seen as being an "expanded" version of the "encryption key". Thirdly, one should aim to protect the "Initialization Vector" (IV). Whether the IV in fact needs to be really kept secret depends on how the IV is determined/generated. The "Encrypted Salt-Sector Initialization Vector" (ESSIV) is one example of an IV that should definitively be protected: the ESSIV is a hash of the "encryption key" and is the default IV used by dm-crypt on Linux. Forthly, any buffers containing the contents of decrypted sectors should be protected in order to prevent known plain-text attacks for ciphers that are vulnerable to this attack (however, protecting these memory buffers is especially tricky). Lastly, any data values calculated during the encryption/decryption should be "securely" stored in the CPU cache in order to prevent key analysis.

I hinted in the "Protecting the encryption key" entry that designing an elegant implementation is somewhat troublesome on Linux. I would like to explain why: some key relevant data is kept in data structures maintained by dm-crypt(.c), but other data items are calculated and managed by the crypto API. The proof-of-concept implementation therefore requires changes in multiple parts of the Linux kernel, this makes it a bit more challenging - unless one is willing to resort to ugly hacks (which I am not).

Controlling the uncontrollable cache

I think I found a solution for the last significant challenge, which I've described in the blog entry "Lack of cache control": it's the uncertainty about whether any data in the CPU cache has been flushed out into RAM. This flushing could be initiated by CPU instructions like invd, wbinvd and clflush (Thanks, haxwell) or even external events like signals on a CPU pin (although this is just my personal speculation).

I've previously suggested this approach to minimize this risk:

One way to minimize the impact of "unintentional" cache flushes (unintentional from our point of view) is to repeat the cache freezing procedure periodically in order to reverse the effects of "unintentional" cache flushes (wbinvd).

My new idea would eliminate this risk all together. However, I haven't actually verified yet, whether this idea can actually be implemented. Keep this in mind while reading the next paragraphs. It is also important to understand the difference between physical/linear and virtual memory addresses; if you don't know what they are then you should read this before you read on.

The idea is actually quite simple: keep the data in the cache on physical/linear addresses which aren't backed by RAM on the system. This would guarantee that the data won't leave the CPU cache, even if a cache flush is triggered (a GP would be raised).

What I haven't verified, is whether it is actually possible to set up this scenario. The setup procedure might look something like this:

load the data into CPU registers
overwrite the data in RAM
change the virtual-to-linear mapping for that virtual address to a non-existant physical/linear address (by modifying the appropriate page-table entry)
switch the CPU cache into frozen mode
move the data from the CPU registers into the CPU cache

What I don't know yet is: whether CPUs will throw faults in step 3 (or maybe even step 4) and render this idea impossible to implement/execute. After all, this isn't quite conforming with a CPU's understanding of "correct" memory management. It'll be a few days before I'll have time to actually implement and verify this idea - I'll let you know.

One last note: obviously, if one has 4GB of RAM then there would be no invalid linear addresses - PAE might be a possibility, but that's a problem for much later.

[linuxkernelnewbies] Frozen Cache

Frozen Cache

The concept