So we've talking about micro-architecture of multipliers and floating point, but it's got me thinking about a macro-problem with parallel computations.
Let's suppose (in a GPU case) you want to compute the average light intensity of a bunch of pixels, and they are stored as floating point numbers. (or maybe you have a bunch of intensity numbers from a deep space image that has 6 orders of magnitude of dynamic range) If you have a serial program, and add the numbers, in order, then divide by the total count, you'll always get the same thing, all nice and deterministic like we expect. If you distribute the problem to multiple cpus (or GPUs), and add in parallel, the round-off error (answer) changes depending on how many compute elements you have, or even worse, if you try to get the answer fast, the round off(s) change depending on which element got done first and passed it's contribution up to a parent. Is there any reasonable way to figure out the worst-case round-off for a parallel global sum? (or product, or whatever else)? In my undergrad days, I ran across someone (I think it was John Gustafson) talking about a micro-architecture in which we'd have a new floating point format that has the absolute worst uppper bound rounding error, as well as the worst lower bound error. The simple way is just drag around another float, and just do the computation twice, one in a way to always round up, the other to always round down. Has anyone else heard of a good solution to this problem, or can you reference some published work on this? The idea being if you wanted to know how big your round off error was, it was a trivial matter of subtracting the lower bound from the upper. _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
