I see.  The confusion all makes sense now.

Do the x86 divide micro-ops currently use the divide unit latencies?  If
not, what latencies do they use?

My gut reaction is that we should have a "divide step" functional unit that
the x86 micro-ops should use, independent of the full divider that the
other ISAs use. That way we eliminate (or at least reduce) the confusion
but can keep the more realistic x86 implementation.  It's not clear how
different that is from the status quo, though... certainly you'll still
have the confusion that changing the "divide" unit parameters won't impact
x86 performance.

Steve

On Mon, Apr 20, 2015 at 7:39 AM, Nilay Vaish <[email protected]> wrote:

> Given the discussion we had so far, it seems that we should stick with
> Gabe's implementation, but for x86 we should change the integer division
> latency to a single cycle.  The default latency is 20 cycles, which is not
> right for x86.
>
> --
> Nilay
>
>
>
> On Mon, 20 Apr 2015, Steve Reinhardt wrote:
>
>  Thanks for speaking up Gabe... I agree on both counts. I should have said
>> "probably not realistic any more". Also, a single-cycle divide is arguably
>> at least as unrealistic in the other direction.
>>
>> Looking at table 17 in section B.6 on p. 349 of the AMD SW optimization
>> guide (http://support.amd.com/TechDocs/47414_15h_sw_opt_guide.pdf),
>> integer
>> divide latencies are data-dependent, and a 64-bit divide can take anywhere
>> from 9 to 72 cycles.  If I'm understanding Gabe's old algorithm correctly,
>> it looks like it takes a fixed number of cycles, though assuming the
>> branch
>> overhead can be overlapped, that number is probably pretty close to the
>> upper bound of the actual value, at least for recent AMD processors.  (I
>> haven't looked for equivalent official Intel docs, though if
>> https://gmplib.org/~tege/x86-timing.pdf is correct, the latency can be up
>> to 95 cycles on Haswell.)
>>
>> Is that right, Gabe?  Or is there a data dependency in that microcode loop
>> that's not obvious?
>>
>> The most flexible thing to do from a timing perspective would be to code
>> the division in C and then program the latency separately. However, since
>> the computation really is microcoded (see p. 248), that would not give
>> realistic results if you care about the modeling of microcode fetch etc.
>> (which would impact power models if nothing else).
>>
>> Steve
>>
>>
>> On Mon, Apr 20, 2015 at 2:56 AM, Gabe Black <[email protected]> wrote:
>>
>>  The original was implemented based on the K6 microops. It might not be
>>> realistic any more (although I don't think single cycle division is
>>> either?), but it wasn't entirely made up.
>>>
>>> Gabe
>>>
>>> On Sun, Apr 19, 2015 at 12:33 PM, Steve Reinhardt <[email protected]>
>>> wrote:
>>>
>>>  On Sun, Apr 19, 2015 at 9:25 AM, Nilay Vaish <[email protected]> wrote:
>>>>
>>>>  On Sun, 19 Apr 2015, Steve Reinhardt wrote:
>>>>>
>>>>>
>>>>>  -----------------------------------------------------------
>>>>>> This is an automatically generated e-mail. To reply, visit:
>>>>>> http://reviews.gem5.org/r/2743/#review6052
>>>>>> -----------------------------------------------------------
>>>>>>
>>>>>>
>>>>>> I like the restructuring... I agree the micro-op loop is probably not
>>>>>> realistic.  Is there a reason to code a loop in C though, as opposed
>>>>>>
>>>>> to
>>>
>>>> just using '/' and '%'?
>>>>>>
>>>>>>
>>>>>>
>>>>> The dividend is represented as rdx:rax, which means upto 128 bits of
>>>>>
>>>> data.
>>>>
>>>>> So we would not be able to carry out division by just using '/' and '%'
>>>>> when only using 64-bit integers.  GCC and LLVM both support 128-bit
>>>>> integers on x86-64 platforms.  We may want to use those, but I don't
>>>>>
>>>> know
>>>
>>>> if that would cause any compatibility problems.
>>>>>
>>>>> --
>>>>> Nilay
>>>>>
>>>>
>>>>
>>>>
>>>> Ah, thanks... I didn't look closely enough to see that it was a 128-bit
>>>> operation.  I'd be fine with using gcc/llvm 128-bit support if others
>>>>
>>> are.
>>>
>>>> If not, there are ways to build a 128-bit operation out of the 64-bit
>>>> operations that would still be simpler than the bitwise loop.  For
>>>>
>>> example,
>>>
>>>> I found this:
>>>>
>>>>
>>>>
>>>>
>>> http://codereview.stackexchange.com/questions/67962/mostly-portable-128-by-64-bit-division
>>>
>>>>
>>>> and if I read the StackExchange terms correctly, we could just use that
>>>> code with an appropriate attribution and a link in a comment back to the
>>>> question (look under Subscriber Content):
>>>> http://stackexchange.com/legal/terms-of-service
>>>>
>>>> Steve
>>>> _______________________________________________
>>>> gem5-dev mailing list
>>>> [email protected]
>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>
>>>>  _______________________________________________
>>> gem5-dev mailing list
>>> [email protected]
>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>
>>>  _______________________________________________
>> gem5-dev mailing list
>> [email protected]
>> http://m5sim.org/mailman/listinfo/gem5-dev
>>
>>
>>  _______________________________________________
> gem5-dev mailing list
> [email protected]
> http://m5sim.org/mailman/listinfo/gem5-dev
>
_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Reply via email to