> I leave it as a challenge to the reader to come up with a compromise; > figure out minor ALU changes and add a particular form predication, > and you can squeeze the divide into a fixed sequence of 32 > instructions (or fewer if the divisor is a constant). Figure out how > that predication could easily save us cycles under other circumstances > (remember that we have no need for branch prediction, so branches are > cheap), and you're the bee's knees.
A lot of this depends whether you're using simple binary division or an iterative method, and how much you care about early termination for "simple" divisors. By my reading[1] even fairly simple iterative implementation should give a single precision result in about 4 iterations (2 modified multiply-add per iteration, probably 9 or 10 instructions total) to get a single precision precision result. At this point it's not worth doing early termination. One nice property of the iterative methods is that extra iterations are harmless, so you can predict the worst case and just unroll the whole thing. A more accurate initial estimate (e.g. via a lookup table) can probably save an iteration (2 instructions). Division by a constant is fairly straightforward to implement in the compiler as multiplication by the inverse. I believe a fused multiply-add helps for floating point, and gives a bit-accurate result in no more than 3 instructions. FWIW I'd expect integer division to much rarer than floating point division. Paul [1] http://en.wikipedia.org/wiki/Division_(digital)#Newton.E2.80.93Raphson_division I also have experience with real hardware that implements this (IA64 and ARM). _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
