On Wed, Sep 23, 2009 at 1:48 PM, Paul Brook <[email protected]> wrote: >> > You should only implement 1 cycle operation. If you really need div, >> > pipeline (1/x) with MUL with enough garded bit to have the required >> > precision. There is a lots of 1 cycle operation for complexe function >> > (1/x, 1/sqrt(x)). >> >> Any division operation even when if it's supposedly 1 cycle is in reality: >> 1 operation is to be executed per cycle, but the latency will be between >> 25 to 64 cycles. It depend on the operation requested and the data type. >> Doing so will require ~ 64 substractor if we support fractionnal result >> divide for integer. 32 substractor if we support divide and modulo only. > > For floating point at least it's fairly common to have a low-precision > reciprocal estimate (LUT + a bit of exponent twiddling), then do explicit N-R > iterations [x = 1/d => x = x(2 - dx)] to get the desired precision. Feedng the > N-R iterations through the regular ALU (possibly with ISA cooperation) may > give better overall throughput (through increased ALU space) than a dedicated > divider.
Ah. It sounds like you're suggesting that we unroll the divider in decode and send the subops down to the ALU. I like this. -- Timothy Normand Miller http://www.cse.ohio-state.edu/~millerti Open Graphics Project _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
