Re: Optimisation Considered Harmful
Here's an amusing example of optimization: On the PowerPC 7450 (G4e), integer multiplication is faster by one cycle if the second operand is between -131072 and 131071. Ever use multiplication in cryptography? Jerrold Leichter writes: There are only a couple of roads forward: - Develop algorithms that offer reasonable performance even if implemented in unoptimized ways. We already have some secret-key ciphers that naturally run in constant time on current CPUs. One example is my own Salsa20, which is a quite conservative design but still faster than AES. Some other examples are Phelix and good old TEA. Furthermore, most CPUs have constant-time floating-point multiplication. Floating-point multiplication usually turns out to be the fastest way to do integer arithmetic in RSA etc., although it takes some effort to use. The quick summary is that we _can_ have high-speed constant-time cryptography. The algorithms are already there---although they need to be distinguished from the impostors such as Rijndael. The implementation techniques are already there---although they need to be more widely understood and used. I can't recall ever seeing a benchmark that reported the variance of timing across instances of the loop. For years now I've been reporting multiple individual timings rather than averages. See, e.g., http://cr.yp.to/mac/poly1305-20041101.pdf, Appendix B; those are the AES timings that prompted me to start investigating cache-timing attacks. (Subsequent versions of the poly1305 paper report even more timing information but, for space reasons, have to compress the information into small graphs. Big tables are on the web.) ---D. J. Bernstein, Associate Professor, Department of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago - The Cryptography Mailing List Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]
Re: Optimisation Considered Harmful
| Suppose you have something that is inadvertently an | oracle - it encrypts stuff from many different users | preparatory to sending it out over the internet, and | makes no effort to strongly authenticate a user. | | Have it encrypt stuff into a buffer, and on a timer | event, send out the buffer | The problem is with edges: | | Suppose the timer goes off every 10ms. | | You have an operation that takes either 5ms or 15ms, depending on | whether a chosen bit of the key is 1 or 0. | | Whether or not a given time slot is occupied with results will emit | whether the bit was 1 or 0. | | Now, suppose your timer goes off every 200ms. No problem, right? | | At time=190ms, you force an encryption. If it's done by the time=200ms | deadline, you know thus bringing us back to an observation that goes back maybe 30 years: Timing channels are usually impractical to eliminate, but you can lower the bit rate (or decrease the S/N ratio, which comes down to the same thing). For the case at hand, we've lowered the bit rate by a factor of 20. If you're willing to change the protocol a bit, you can do better: Never send a block whose encryption crossed the boundary. Instead, send a special ignore me block. (Obviously, the ignore me marker is *inside* the encrypted envelope.) Depending on the nature of the protocol - is the size of a response invarient, or is the number of exchanges invarient? - you then continue either by sending the real block, or by the sender returning an ignore me block which you reply to with the encrypted block that crossed the boundary. This still doesn't completely remove the timing channel - there are differences in the data stream depending on whether you cross the boundary - but they are much subtler, so again you've decreased the bit rate. (It's not clear how to quantify in this case, though.) -- Jerry - The Cryptography Mailing List Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]
Re: Optimisation Considered Harmful
On Thu, 23 Jun 2005, Jerrold Leichter wrote: Consider what it means to create an optimizing compiler (or some kind of opti- mization in hardware - the issues are the same, but I'll talk in terms of a compiler for definiteness.) The input is source code; the output is a bunch of instructions. A compiler's transformation is correct if it's semantics- preserving: Yep. As a sometime compiler weenie, I'll give literal answers to your next few questions. The big issue in compiler optimization - and even more so in some hardware optimization - is defining exactly what semantics has to be preserved. For example, must computations be done even if the compiler can determine that they cannot affect the output? No. There's no language higher than assembly that requires that. A lot of compilers will do the redundant computations anyway because the authors of those compilers considered that it would be too much work to prove what can be eliminated, and in imperative languages like C there's often little benefit in doing so. But there are transformations and techniques you can do (autoconverting to laziest semantics allowed by the language spec) to absolutely determine what can be eliminated, and emit code that *only* evaluates that which does affect the output. And increasingly, compilers are doing them. Can the rules of algebra be used to rearrange expressions (possibly breaking carefully crafted error management strategies, since floating point arithmetic doesn't actually obey the rules of algebra)? If it changes the output, it's considered an error for C compilers to do this. That said, it's considered a minor error, and since a lot of people are in a great hurry to get wrong answers, GCC will knowingly and willingly commit this error for you (as an extension to the language) if you give it a -O option that's too high. And several other compilers (including Borland's and Microsoft's) have committed this error in the past without giving anyone the ability to turn it off, or by default with a --fpstrict option to turn it off. That said, in languages other than C you have to look to the language spec to see whether this is allowed or not. If the spec is silent on the topic then rearranging in a way that changes output is still usually considered a bug, but it becomes open to debate whether it is an actual bug or not. Some language specs (including Common Lisp, IIRC) and several single-implementation languages (including Mathematica, IIRC) allow arbitrary rearrangement of floating-point operations, provided that the compiler can prove that *at least* as much precision is preserved as would be gained by doing it in the order it appears in the source. This is explicitly allowed to change output, provided the compiler can prove that the changed output is a closer approximation to the algebraic solution. Even in languages where this is allowed, many implementors consider it tasteless and try to avoid doing it, conforming thier compilers' behavior to a tenet of the principle of least surprise. Do writes to two variables written in order in the source code have to occur in that order in the object code? In C, the answer depends on whether the variables are declared volatile. If they are not declared volatile, then the writes need not appear in any particular order, and if the compiler can prove the values are not used, need not appear at all. In languages without volatile variables, the answer is usually no. If two writes are issued in order to the hardware, do they have to hit memory in that order? Uh, I think the short answer is no, but you'll never be able to tell the difference in normal circumstances. It's common for a superscalar chip to buffer memory writes in its L1 Cache, and write a cache line to memory at a go (which may reorder the writes to memory that's off the chip). But no software running on the same chip will be able to tell it by reading those addresses, because it will snag the written values (even though they haven't hit phyiscal memory yet) out of the L1 cache. Where this can break down with the possibility that you'd be able to tell it's breaking down, is volatile variables in multiprocessor machines. But these are normally protected by mutexes, the implementation of which must trigger explicit L1 cache flushes on multiprocessor machines. An understanding of what semantics has to be preserved, and what semantics is side-effect and can be tossed to gain performance, has gradually emerged in the hardware and software communities. There have been, and continue to be, missteps along the way. Some of the weaker memory consistency models might have helped the hardware guys, but were just too hard for the software guys to deal with. Newbies to multi-threaded programming are caught by the not-so obvious memory access semantics present even at the language level in all common programming languages. (Java tried to pin this down exactly and got it
Re: Optimisation Considered Harmful
-- James A. Donald: Suppose you have something that is inadvertently an oracle - it encrypts stuff from many different users preparatory to sending it out over the internet, and makes no effort to strongly authenticate a user. Have it encrypt stuff into a buffer, and on a timer event, send out the buffer. Your code is now of course multithreaded - very easy to get multithreading bugs that never show up during testing, but non deterministically show up in actual use. On 24 Jun 2005 at 12:25, Dan Kaminsky wrote: The problem is with edges: Now, suppose your timer goes off every 200ms. No problem, right? At time=190ms, you force an encryption. If it's done by the time=200ms deadline, you know. It should have been needless to say, that at the end of each time frame, the oracle only starts sending out stuff encrypted in response to data received at least n time frames previously, where n is a small positive number, possibly one. A time frame is longer than the difference between the quickest and slowest encryption of a block. n time frames is longer than the slowest encryption of a block. --digsig James A. Donald 6YeGpsZR+nOTh/cGwvITnSR3TdzclVpR0+pr3YYQdkG JdXC3IuQNnYvM2SrAOIY2iLJyhKf21IR191yeebK 4FIl5EvQ0dseZCj2m2/NsQANv7tID98AAQ+pJMARn - The Cryptography Mailing List Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]
Re: Optimisation Considered Harmful
On Thu, Jun 23, 2005 at 07:36:38AM -0400, Jerrold Leichter wrote: - Develop algorithms that offer reasonable performance even if implemented in unoptimized ways. This will be difficult to maintain in the face of ever-increasing hardware optimiza- tions that you can't just turn off by not using -O. - Live with less performance and hope that raw hardware speeds will catch up. - Use specialized hardware, designed not to leak side-channel information. - ? - Find reasonably efficient masking strategies, that assume that side-channel attacks are here to stay, and randomly choose one of many isomorphic ways to perform the computation. The masking would have to eliminate key/data correlation from all observables other than the final output. -- /\ ASCII RIBBON NOTICE: If received in error, \ / CAMPAIGN Victor Duchovni please destroy and notify X AGAINST IT Security, sender. Sender does not waive / \ HTML MAILMorgan Stanley confidentiality or privilege, and use is prohibited. - The Cryptography Mailing List Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]
Re: Optimisation Considered Harmful
Victor Duchovni wrote: On Thu, Jun 23, 2005 at 07:36:38AM -0400, Jerrold Leichter wrote: - Develop algorithms that offer reasonable performance even if implemented in unoptimized ways. This will be difficult to maintain in the face of ever-increasing hardware optimiza- tions that you can't just turn off by not using -O. - Live with less performance and hope that raw hardware speeds will catch up. - Use specialized hardware, designed not to leak side-channel information. - ? - Find reasonably efficient masking strategies, that assume that side-channel attacks are here to stay, and randomly choose one of many isomorphic ways to perform the computation. The masking would have to eliminate key/data correlation from all observables other than the final output. If it does that, why do you want to choose one of many? Surely a single one will do? -- ApacheCon Europe http://www.apachecon.com/ http://www.apache-ssl.org/ben.html http://www.thebunker.net/ There is no limit to what a man can do or how far he can go if he doesn't mind who gets the credit. - Robert Woodruff - The Cryptography Mailing List Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]
Re: Optimisation Considered Harmful
On Fri, Jun 24, 2005 at 10:00:55AM +0100, Ben Laurie wrote: - Find reasonably efficient masking strategies, that assume that side-channel attacks are here to stay, and randomly choose one of many isomorphic ways to perform the computation. The masking would have to eliminate key/data correlation from all observables other than the final output. If it does that, why do you want to choose one of many? Surely a single one will do? The idea is that each choice leaks side-channel information about its algorithm, but the attacker does not know which one was chosen. And, repeated observations do not on average (over all algorithms) show correlation between the key or data and side-channel information (other than the final output). Is this possible? There is a paper that claims no correlation with any any single intermediate result, is that strong enough? -- /\ ASCII RIBBON NOTICE: If received in error, \ / CAMPAIGN Victor Duchovni please destroy and notify X AGAINST IT Security, sender. Sender does not waive / \ HTML MAILMorgan Stanley confidentiality or privilege, and use is prohibited. - The Cryptography Mailing List Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]
Re: Optimisation Considered Harmful
-- On 23 Jun 2005 at 0:50, Ben Laurie wrote: A brief altercation this evening with CERT over the recent hyperthread caching issues has brought something that's been simmering at the back of my brain to the forefront. The recent hyperthread/cache key recovery trick, followed by DJB's related (IMO) symmetric key recovery, and preceded by the RSA timing attacks (Boneh et al?) are all examples of attacks on the same thing: optimisation. The problem is that if different paths through your code expose different availability of optimisation, then there's a timing attack available. I predict, for example, that someone will find a way to attack something using pipeline breaks on Pentium processors[1]. How do we fix this? Well, its easy to say: we make everything equally crap - we make sure that all operations are as slow as the slowest possible variant can be. However, on a modern processor, this is _very_ hard to do. Suppose you have something that is inadvertently an oracle - it encrypts stuff from many different users preparatory to sending it out over the internet, and makes no effort to strongly authenticate a user. Have it encrypt stuff into a buffer, and on a timer event, send out the buffer. Your code is now of course multithreaded - very easy to get multithreading bugs that never show up during testing, but non deterministically show up in actual use. --digsig James A. Donald 6YeGpsZR+nOTh/cGwvITnSR3TdzclVpR0+pr3YYQdkG fWkmIPqr+sQN9GW27vahB3Bc9ulLdzbGrPKEjXFL 4nPDlKsQgDKH6LEnS3M7ECcBByW0lH5o7CUzo2UYB - The Cryptography Mailing List Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]
Re: Optimisation Considered Harmful
Suppose you have something that is inadvertently an oracle - it encrypts stuff from many different users preparatory to sending it out over the internet, and makes no effort to strongly authenticate a user. Have it encrypt stuff into a buffer, and on a timer event, send out the buffer. Your code is now of course multithreaded - very easy to get multithreading bugs that never show up during testing, but non deterministically show up in actual use. The problem is with edges: Suppose the timer goes off every 10ms. You have an operation that takes either 5ms or 15ms, depending on whether a chosen bit of the key is 1 or 0. Whether or not a given time slot is occupied with results will emit whether the bit was 1 or 0. Now, suppose your timer goes off every 200ms. No problem, right? At time=190ms, you force an encryption. If it's done by the time=200ms deadline, you know. Things get trickier when there's random noise in the timer, and it matters whether the distribution of 1's and 0's is equal or not. But this is fundamentally a difficult problem to handle. --Dan - The Cryptography Mailing List Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]
Re: Optimisation Considered Harmful
| A brief altercation this evening with CERT over the recent hyperthread caching | issues has brought something that's been simmering at the back of my brain to | the forefront. | | The recent hyperthread/cache key recovery trick, followed by DJB's related | (IMO) symmetric key recovery, and preceded by the RSA timing attacks (Boneh et | al?) are all examples of attacks on the same thing: optimisation. | | The problem is that if different paths through your code expose different | availability of optimisation, then there's a timing attack available. I | predict, for example, that someone will find a way to attack something using | pipeline breaks on Pentium processors[1]. | | How do we fix this? Well, its easy to say: we make everything equally crap - | we make sure that all operations are as slow as the slowest possible variant | can be. However, on a modern processor, this is _very_ hard to do. | | Could it be that for safe crypto we should be using Z80s? This is an excellent point, as it reveals a deep and significant parallel between cryptography and compiler/hardware optimization. Consider what it means to create an optimizing compiler (or some kind of opti- mization in hardware - the issues are the same, but I'll talk in terms of a compiler for definiteness.) The input is source code; the output is a bunch of instructions. A compiler's transformation is correct if it's semantics- preserving: The source has some meaning, and the object code correctly represents that meaning. There are an infinite number of possible object codes that preserve the input semantics. Some are better than others with respect to some objective function, say size or speed. An optimizing compiler simply chooses a better object code than some reference choice. The big issue in compiler optimization - and even more so in some hardware optimization - is defining exactly what semantics has to be preserved. For example, must computations be done even if the compiler can determine that they cannot affect the output? Can the rules of algebra be used to rearrange expressions (possibly breaking carefully crafted error management strategies, since floating point arithmetic doesn't actually obey the rules of algebra)? Do writes to two variables written in order in the source code have to occur in that order in the object code? If two writes are issued in order to the hardware, do they have to hit memory in that order? An understanding of what semantics has to be preserved, and what semantics is side-effect and can be tossed to gain performance, has gradually emerged in the hardware and software communities. There have been, and continue to be, missteps along the way. Some of the weaker memory consistency models might have helped the hardware guys, but were just too hard for the software guys to deal with. Newbies to multi-threaded programming are caught by the not-so- obvious memory access semantics present even at the language level in all common programming languages. (Java tried to pin this down exactly and got it completely wrong for several tries.) Enter cryptographic algorithms. On their face, these are simple mathematical transformations. You have to really abuse the implementations (e.g., having multiple side-effect-producing operations in one statement) to get into area where a programmer or hardware developer's warning bell would sound watch the semantics. And, in fact, from the point of view of input/output transforma- tions, there really are no semantic issues. The problem is that these side- channel attacks broaden the meaning of the program to something that has never been considered in previous work that I know of. (The closest you are likely to find is in complaints by real-time programmers that modern machines give you no way to determine how long an instruction sequence will really take: You might take a page fault, or a cache miss, or who knows what along the way, and in some real-time code, you have to *know* that that won't happen. Such code really is sometimes run only on machines like Z80's! What can be done? Well, the first thing that's clearly needed is a more complete specification of the semantics of cryptographic algorithms. Just the input/output transformation - which is all we write down in most analyses - is insufficient. We sometimes state things informally, almost in passing - as in the comments on AES that table accesses take constant time. We certainly assume things like the time to add two numbers is independent of the number of carries - which is probably true on machines today, but may actually have been false at one time. Without a way to write down what matters, we have no way to judge whether a particular compilation/hardware approach is safe. There's some work on abstract program interpretation that might help here (though it's mainly aimed in other directions). Ultimately, performance is likely to suffer. Software and hardware optimiza- tions are
Optimisation Considered Harmful
A brief altercation this evening with CERT over the recent hyperthread caching issues has brought something that's been simmering at the back of my brain to the forefront. The recent hyperthread/cache key recovery trick, followed by DJB's related (IMO) symmetric key recovery, and preceded by the RSA timing attacks (Boneh et al?) are all examples of attacks on the same thing: optimisation. The problem is that if different paths through your code expose different availability of optimisation, then there's a timing attack available. I predict, for example, that someone will find a way to attack something using pipeline breaks on Pentium processors[1]. How do we fix this? Well, its easy to say: we make everything equally crap - we make sure that all operations are as slow as the slowest possible variant can be. However, on a modern processor, this is _very_ hard to do. Could it be that for safe crypto we should be using Z80s? Cheers, Ben. [1] For those who don't know, even old Pentia have several different processing units internally, which run in parallel. They can even tell when instructions in one's pipeline conflict with another's (e.g. one uses a register as input that another is going to change, but hasn't yet), and delay processing appropriately. However, sometimes they can't work it out (this is, of course, another optimisation, this time for transistor count) and so they just throw their hands up, stop all the pipelines and start again. This causes a _substantial_ delay, easily measurable - I know of loops of tens of instructions that go twice as fast if apparently redundant - but pipeline-intelligence-informing - instructions are inserted. Of course, the PowerPCs with their speculative execution are probably even more fun in this respect (do they still do that?). -- ApacheCon Europe http://www.apachecon.com/ http://www.apache-ssl.org/ben.html http://www.thebunker.net/ There is no limit to what a man can do or how far he can go if he doesn't mind who gets the credit. - Robert Woodruff - The Cryptography Mailing List Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]