http://www.yosefk.com/blog/consistency-how-to-defeat-the-purpose-of-ieee-floating-point.htmlConsistency: how to defeat the purpose of IEEE floating pointOctober 31st, 2008 | hardware, numerical, software I don’t know much about the design of IEEE floating point, except for the fact that a lot of knowledge and what they call “intellectual effort” went into it. I don’t even know the requirements, and I suspect those were pretty detailed and complex (for example, the benefits of having a separate representation for +0 and -0 seem hard to grasp unless you know about the very specific and hairy examples in the complex plane). So I don’t trust my own summary of the requirements very much. That said, here’s the summary: the basic purpose of IEEE floating point is to give you results of the highest practically possible precision at each step of your computation. I’m not going to claim this requirement is misguided, because I don’t feel like arguing with people two orders of magnitude more competent than myself who have likely faced much tougher numerical problems than I’ve ever seen. What I will claim is that differences in numerical needs divide programmers into roughly three camps, and the highest-possible-precision approach hurts one of them really badly, and so has to be worked around in ways we’ll discuss. The camps are:
The majority of people who use and actually should be using floating point are thus in camp 3. Those people don’t care about precision anywhere near camp 2, nor do they know how to make the best of IEEE floating point in the very unlikely circumstances where their naive approach will actually fail to work. What they do care about though is consistency. It’s important that things compute the same on all platforms. Perhaps more importantly for most, they should compute the same under different build settings, most notably debug and release mode, because otherwise you can’t reproduce problems. Side note: I don’t believe in build modes; I usually debug production code in release mode. It’s not just floating point that’s inconsistent across modes - it’s code snippets with behavior undefined by the language, buggy dependence on timing, optimizer bugs, conditional compilation, etc. Many other cans of worms. But the fact is that most people have trouble debugging optimized code, and nobody likes it, so it’s nice to have the option to debug in debug mode, and to do that, you need things to reproduce there. Also, comparing results of different build modes is one way to find worms from those other cans, like undefined behavior and optimizer bugs. Also, many changes you make are optimizations or refaptorings and you can check their sanity by making sure they didn’t change the results of the previous version. As we’ll see, IEEE FP won’t give you even that, regardless of platforms and build modes. The bottom line is that if you’re in camp 3, you want consistency, while the “only” things you can expect from IEEE FP is precision and speed. Sure, “only” should be put in quotes because it’s a lot to get, it’s just a real pity that fulfilling the smaller and more popular wish for consistency is somewhere between hard and impossible. Some numerical analysts seem annoyed by the camp 3 whiners. To them I say: look, if IEEE FP wasn’t the huge success that it is in the precision and speed departments, you wouldn’t be hearing from us because we’d be busy struggling with those problems. What we’re saying is the exact opposite of “IEEE FP sucks”. It’s “IEEE FP is so damn precise and fast that I’m happy with ALL of its many answers - the one in optimized x86 build, the one in debug PowerPC build, the one before I added a couple of local variables to that function and the one I got after that change. I just wish I consistently got ONE of these answers, any of them, but the same one.” I think it’s more flattering than insulting. I’ve accumulated quite some experience in defeating the purpose of IEEE floating point and getting consistency at the (tiny, IMO) cost of precision and speed. I want to share this knowledge with humanity, with the hope of getting rewarded in the comments. The reward I’m after is a refutation of my current theory that you can only eliminate 95%-99% of the pain automatically and have to solve the rest manually each time it raises its ugly head. The pain breakdown I know three main sources of floating point inconsistency pain:
The good news is that most pain comes from item 3 which can be more or less solved automatically. For the purpose of decision making (”should we invest energy into FP consistency or is it futile?”), I’d say that it’s not futile and if you can cite actual benefits you’d get from consistency, than it’s worth the (continuous) effort. Disclaimer: I only discuss problems I know and to the extent of my understanding. I have no solid evidence that this understanding is complete. Perhaps the pain breakdown list should have item 4, and perhaps items 1 to 3 have more meat than I think. And while I tried to get the legal stuff right - which behavior conforms to IEEE 754, which conforms to C99, and which conforms to nothing but is still out there - I’m generally a weak tech lawyer and can be wrong. I can only give you the “worked on my 4 families of machines” kind of warranty. Algebraic compiler optimizations Compilers, or more specifically buggy optimization passes, assume that floating point numbers can be treated as a field - you know, associativity, distributivity, the works. This means that a+b+c can be computed in either the order implied by (a+b)+c or the one implied by a+(b+c). Adding actual parentheses in source code doesn’t help you one bit. The compiler assumes associativity and may implement the computation in the order implied by regrouping your parentheses. Since each floating point operation loses precision on some of the possible inputs, the code generated by different optimizers or under different optimization settings may produce different results. This could be extremely intimidating because you can’t trust any FP _expression_ with more than 2 inputs. However, I think that programming languages in general don’t allow optimizers to make these assumptions, and in particular, the C standard doesn’t (C99 §5.1.2.3 #13, didn’t read it in the document but saw it cited in two sources). So this sort of optimization is actually a bug that will likely be fixed if reported, and at any given time, the number of these bugs in a particular compiler should be small. I only recall one recurring algebraic optimization problem. Specifically, a*(b+1) tended to be compiled to a*b+a in release mode. The reason is that floating-point literal values like 1 are expensive; 1 becomes a hairy hexadecimal value that has to be loaded from a constant address at run time. So the optimizer was probably happy to optimize a literal away. I was always able to solve this problem by changing the spelling in the source code to a*b+a - the optimizer simply had less work to do, while the debug build saw no reason to make me miserable by undoing my regrouping back into a*(b+1). This implies a general way of solving this sort of problem: find what the optimizer does by looking at the generated assembly, and do it yourself in the source code. This almost certainly guarantees that debug and release will work the same. With different compilers and platforms, the guarantee is less certain. The second optimizer may think that the optimization you copied from the first optimizer into your source code is brain-dead, and undo it and do a different optimization. However, that means you target two radically different optimizers, both of which are buggy and can’t be fixed in the near future; how unlucky can you get? The bottom line is that you rarely have to deal with this problem, and when it can’t be solved with a bug report, you can look at the assembly and do the optimization in the source code yourself. If that fails because you have to use two very different and buggy compilers, use the shotgun described in the next item. “Complex” instructions Your target hardware can have instructions computing “non-trivial” expressions beyond a*b or a+b, such as a+=b*c or sin(x). The precision of the intermediate result b*c in a+=b*c may be higher than the size of an FP register would allow, had that result been actually stored in a register. IEEE and the C standard think it’s great, because the single instruction generated from a+=b*c is both faster and more precise than the 2 instructions implementing it as d=b*c, a=a+d. Camp 3 people like myself don’t think it’s so great, because it happens in some build modes but not others, and across platforms the availability of these instruction varies, as does their precision. AFAIK the “contraction” of a+=b*c is permitted by both the IEEE FP standard (which defines FP + and *) and the C standard (which defines FP types that can map to standards other than IEEE). On the other hand, sin(x), which also gets implemented in hardware these days, isn’t addressed by either standard - to the same effect of making the optimization perfectly legitimate. So you can’t solve this by reporting a bug the way you could with algebraic optimizations. The other way in which this is tougher is that tweaking the code according to the optimizer’s wishes doesn’t help much. AFAIK what can help is one of these two things:
The trouble with the shotgun approach, aside from its ugliness, is that you can’t afford to shoot at the performance-critical parts of your code that way. Let us hope that you’ll never really have to choose between FP consistency and performance, as I’ve never had to date. x86 Intel is the birthplace of IEEE floating point, and the manufacturer of the most camp-3-painful and otherwise convoluted FP hardware. The pain comes, somewhat understandably, from a unique commitment to the IEEE FP philosophy - intermediate results should be as precise as possible; more on that in a moment. The “convoluted” part is consistent with the general insanity of the x86 instruction set. Specifically, the “old” (a.k.a “x87″) floating point unit uses a stack architecture for addressing FP operands, which is pretty much the exact opposite of the compiler writer’s dream target, but so is the rest of x86. The “new” floating point instructions in SSE don’t have these problems, at the cost of creating the aesthetic/psychiatric problem of actually having two FP ISAs in the same processor. Now, in our context we don’t care about the FP stack thingie and all that, the only thing that matters is the consistency of precision. The “old” FP unit handles precision thusly. Precision of stuff written to memory is according to the number of bits of the variable, ’cause what else can it be. Precision of intermediate results in the “registers” (or the “FP stack” or whatever you call it) is defined according to the FPU control & status register, globally for all intermediate results in your program. By default, it’s 80 bits. This means that when you compute a*b+c*d and a,b,c,d are 32b floats, a*b and c*d are computed in 80b precision, and then their 80b sum is converted to a 32b result in memory (if a*b+c*d is indeed written to memory and isn’t itself an “intermediate” result). Indeed, what’s “intermediate” in the sense of not being written to memory and what isn’t? That depends on:
Compilers could have an option asking them to hide this mess and give us consistent results. The problems with this are that (1) if you care about cross-platform/compiler consistency, then the availability of cross-mode consistency options in one compiler doesn’t help with the other compiler and (2) for some reason, compilers apparently don’t offer this option in a very good way. For example, MS C++ used to have a /fltconsistency switch but seems to have abandoned it in favor of an insane special-casing of the syntax float(a*b)+float(c*d) - that spelling forces consistency (although the C++ standard doesn’t assign it a special meaning not included in the plain and sane a*b+c*d). I’d guess they changed it because of the speed penalty it implies rather than the precision penalty as they say. I haven’t heard about someone caring both about consistency and that level of precision, but I did hear that gcc’s consistency-forcing -ffloat-store flag caused notable slowdowns. And the reason it did is implied by its name - AFAIK the only way to implement x86 FP consistency at compile time is to generate code storing FP values to memory to get rid of the extra precision bits. And -ffloat-store only affects named variables, not unnamed intermediate results (annoying, isn’t it?), so /fltconsistency, assuming it actually gave you consistency of all results, should have been much slower. Anyway, the bottom line seems to be that you can’t get much help from compilers here; deal with it yourself. Even Java gave up on its initial intent of getting consistent results on the x87 FPU and retreated to a cowardly strictfp scheme. And the thing is, you never have to deal with it outside of x86 - all floating point units I’ve come across, including the ones specified by Intel’s SSE and SSE2, simply compute 32b results from 32b inputs. People who decided to do it that way and rob us of quite some bits of precision have my deepest gratitude, because there’s absolutely no good way to work around the generosity of the original x86 FPU designers and get consistent results. Here’s what you can do:
Side note: I sure as hell don’t believe in “very special” “testing” build/running modes. For example, you could say that you have a special mode where you use option (3) and get 32b results, and use that mode to test debug==release or something. I think it’s completely self-defeating, because the point of consistency is being able to reproduce a phenomenon X that happens in a mode which is actually important, in another mode where reproducing X is actually useful. Therefore, who needs consistency across inherently useless modes? We’d be defeating the purpose of defeating the purpose of IEEE floating point. Therefore, if you don’t have SSE, the only option is (2) - set the FP CSR to 64b and try to avoid 32b floats. On Linux, you can do it with: #include <fpu_control.h> fpu_control_t cw; _FPU_GETCW(cw); cw = (cw & ~_FPU_EXTENDED) | _FPU_DOUBLE; _FPU_SETCW(cw); Do it first thing in main(). If you use C++, you should do it first thing before main(), because people can use FP in constructors of global variables. This can be achieved by figuring out the compiler-specific translation unit initialization order, compiling your own C/C++ start-up library, overriding the entry point of a compiled start-up library using stuff like LD_PRELOAD, overwriting it in a statically linked program right there in the binary image, having a coding convention forcing to call FloatingPointSingleton::instance() before using FP, or shooting the people who like to do things before main(). It’s a trade-off. The situation is really even worse because the FPU CSR setting only affects mantissa precision but not the exponent range, so you never work with “real” 64b or 32b floats there. This matters in cases of huge numbers (overflow) and tiny numbers (double rounding of subnormals). But it’s bad enough already, and camp 3 people don’t really care about the extra horror; if you want those Halloween stories, you can find them here. The good news are that today, you are quite likely to have SSE2 and very likely to have SSE on your machine. So you can automatically sanitize all the mess as follows:
Interestingly, SSE with its support for SIMD FP commands actually can make things worse in the standard-violating-algebraic-optimizations department. Specifically, Intel’s compiler reportedly has (had?) an optimization which unrolls FP accumulation loops and reorders additions in order to utilize SIMD FP commands (gcc 4 does that, too, but only if you explicitly ask for trouble with -funsafe-math-optimizations or similar). But I wouldn’t conclude anything from it, except that automatic vectorization, which is known to work only on the simplest of code snippets, actually doesn’t work even on them. Summary: use SSE2 or SSE, and if you can’t, configure the FP CSR to use 64b intermediates and avoid 32b floats. Even the latter solution works passably in practice, as long as everybody is aware of it. I think I covered everything I know except for things like long double, FP exceptions, etc. - and if you need that, you’re not in camp 3; go away and hang out with your ivory tower numerical analyst friends. If you know a way to automate away more pain, I’ll be grateful for every FP inconsistency debugging afternoon your advice will save me. Happy Halloween! |