Given the discussion we had so far, it seems that we should stick with Gabe's implementation, but for x86 we should change the integer division latency to a single cycle. The default latency is 20 cycles, which is not right for x86.

--
Nilay


On Mon, 20 Apr 2015, Steve Reinhardt wrote:

Thanks for speaking up Gabe... I agree on both counts. I should have said
"probably not realistic any more". Also, a single-cycle divide is arguably
at least as unrealistic in the other direction.

Looking at table 17 in section B.6 on p. 349 of the AMD SW optimization
guide (http://support.amd.com/TechDocs/47414_15h_sw_opt_guide.pdf), integer
divide latencies are data-dependent, and a 64-bit divide can take anywhere
from 9 to 72 cycles.  If I'm understanding Gabe's old algorithm correctly,
it looks like it takes a fixed number of cycles, though assuming the branch
overhead can be overlapped, that number is probably pretty close to the
upper bound of the actual value, at least for recent AMD processors.  (I
haven't looked for equivalent official Intel docs, though if
https://gmplib.org/~tege/x86-timing.pdf is correct, the latency can be up
to 95 cycles on Haswell.)

Is that right, Gabe?  Or is there a data dependency in that microcode loop
that's not obvious?

The most flexible thing to do from a timing perspective would be to code
the division in C and then program the latency separately. However, since
the computation really is microcoded (see p. 248), that would not give
realistic results if you care about the modeling of microcode fetch etc.
(which would impact power models if nothing else).

Steve


On Mon, Apr 20, 2015 at 2:56 AM, Gabe Black <[email protected]> wrote:

The original was implemented based on the K6 microops. It might not be
realistic any more (although I don't think single cycle division is
either?), but it wasn't entirely made up.

Gabe

On Sun, Apr 19, 2015 at 12:33 PM, Steve Reinhardt <[email protected]>
wrote:

On Sun, Apr 19, 2015 at 9:25 AM, Nilay Vaish <[email protected]> wrote:

On Sun, 19 Apr 2015, Steve Reinhardt wrote:


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2743/#review6052
-----------------------------------------------------------


I like the restructuring... I agree the micro-op loop is probably not
realistic.  Is there a reason to code a loop in C though, as opposed
to
just using '/' and '%'?



The dividend is represented as rdx:rax, which means upto 128 bits of
data.
So we would not be able to carry out division by just using '/' and '%'
when only using 64-bit integers.  GCC and LLVM both support 128-bit
integers on x86-64 platforms.  We may want to use those, but I don't
know
if that would cause any compatibility problems.

--
Nilay



Ah, thanks... I didn't look closely enough to see that it was a 128-bit
operation.  I'd be fine with using gcc/llvm 128-bit support if others
are.
If not, there are ways to build a 128-bit operation out of the 64-bit
operations that would still be simpler than the bitwise loop.  For
example,
I found this:



http://codereview.stackexchange.com/questions/67962/mostly-portable-128-by-64-bit-division

and if I read the StackExchange terms correctly, we could just use that
code with an appropriate attribution and a link in a comment back to the
question (look under Subscriber Content):
http://stackexchange.com/legal/terms-of-service

Steve
_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev


_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Reply via email to