> I leave it as a challenge to the reader to come up with a compromise;
> figure out minor ALU changes and add a particular form predication,
> and you can squeeze the divide into a fixed sequence of 32
> instructions (or fewer if the divisor is a constant).  Figure out how
> that predication could easily save us cycles under other circumstances
> (remember that we have no need for branch prediction, so branches are
> cheap), and you're the bee's knees.

A lot of this depends whether you're using simple binary division or an 
iterative method, and how much you care about early termination for "simple" 
divisors. 

By my reading[1] even fairly simple iterative implementation should give a 
single precision result in about 4 iterations (2 modified multiply-add per 
iteration, probably 9 or 10 instructions total) to get a single precision 
precision result.  At this point it's not worth doing early termination. One 
nice property of the iterative methods is that extra iterations are harmless, 
so you can predict the worst case and just unroll the whole thing.  A more 
accurate initial estimate (e.g. via a lookup table) can probably save an 
iteration (2 instructions).

Division by a constant is fairly straightforward to implement in the compiler 
as multiplication by the inverse. I believe a fused multiply-add helps for 
floating point, and gives a bit-accurate result in no more than 3 
instructions.

FWIW I'd expect integer division to much rarer than floating point division.

Paul

[1] 
http://en.wikipedia.org/wiki/Division_(digital)#Newton.E2.80.93Raphson_division
I also have experience with real hardware that implements this (IA64 and ARM).
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to