So we've talking about micro-architecture of multipliers and floating
point, but it's got me thinking about a macro-problem with parallel 
computations.

Let's suppose (in a GPU case) you want to compute the average light
intensity of a bunch of pixels, and they are stored as floating point
numbers. (or maybe you have a bunch of intensity numbers from a deep
space image that has 6 orders of magnitude of dynamic range)

If you have a serial program, and add the numbers, in order, then 
divide by the total count, you'll always get the same thing, all nice
and deterministic like we expect.

If you distribute the problem to multiple cpus (or GPUs), and add in
parallel, the round-off error (answer) changes depending on how many
compute elements you have, or even worse, if you try to get the answer
fast, the round off(s) change depending on which element got done 
first and passed it's contribution up to a parent.

Is there any reasonable way to figure out the worst-case round-off
for a parallel global sum? (or product, or whatever else)?

In my undergrad days, I ran across someone (I think it was John 
Gustafson) talking about a micro-architecture in which we'd have a new
floating point format that has the absolute worst uppper bound rounding
error, as well as the worst lower bound error. The simple way is just
drag around another float, and just do the computation twice, one in a
way to always round up, the other to always round down.

Has anyone else heard of a good solution to this problem, or can you
reference some published work on this?

The idea being if you wanted to know how big your round off error was,
it was a trivial matter of subtracting the lower bound from the upper.
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to