Yes, I believe that the divide micro-ops currently uses the divide unit
latency, which I think is the cause of the large discrepancy between the
x86 and ARM performance.

Jason

On Mon, Apr 20, 2015 at 10:16 AM Steve Reinhardt <[email protected]> wrote:

> I see.  The confusion all makes sense now.
>
> Do the x86 divide micro-ops currently use the divide unit latencies?  If
> not, what latencies do they use?
>
> My gut reaction is that we should have a "divide step" functional unit that
> the x86 micro-ops should use, independent of the full divider that the
> other ISAs use. That way we eliminate (or at least reduce) the confusion
> but can keep the more realistic x86 implementation.  It's not clear how
> different that is from the status quo, though... certainly you'll still
> have the confusion that changing the "divide" unit parameters won't impact
> x86 performance.
>
> Steve
>
> On Mon, Apr 20, 2015 at 7:39 AM, Nilay Vaish <[email protected]> wrote:
>
> > Given the discussion we had so far, it seems that we should stick with
> > Gabe's implementation, but for x86 we should change the integer division
> > latency to a single cycle.  The default latency is 20 cycles, which is
> not
> > right for x86.
> >
> > --
> > Nilay
> >
> >
> >
> > On Mon, 20 Apr 2015, Steve Reinhardt wrote:
> >
> >  Thanks for speaking up Gabe... I agree on both counts. I should have
> said
> >> "probably not realistic any more". Also, a single-cycle divide is
> arguably
> >> at least as unrealistic in the other direction.
> >>
> >> Looking at table 17 in section B.6 on p. 349 of the AMD SW optimization
> >> guide (http://support.amd.com/TechDocs/47414_15h_sw_opt_guide.pdf),
> >> integer
> >> divide latencies are data-dependent, and a 64-bit divide can take
> anywhere
> >> from 9 to 72 cycles.  If I'm understanding Gabe's old algorithm
> correctly,
> >> it looks like it takes a fixed number of cycles, though assuming the
> >> branch
> >> overhead can be overlapped, that number is probably pretty close to the
> >> upper bound of the actual value, at least for recent AMD processors.  (I
> >> haven't looked for equivalent official Intel docs, though if
> >> https://gmplib.org/~tege/x86-timing.pdf is correct, the latency can be
> up
> >> to 95 cycles on Haswell.)
> >>
> >> Is that right, Gabe?  Or is there a data dependency in that microcode
> loop
> >> that's not obvious?
> >>
> >> The most flexible thing to do from a timing perspective would be to code
> >> the division in C and then program the latency separately. However,
> since
> >> the computation really is microcoded (see p. 248), that would not give
> >> realistic results if you care about the modeling of microcode fetch etc.
> >> (which would impact power models if nothing else).
> >>
> >> Steve
> >>
> >>
> >> On Mon, Apr 20, 2015 at 2:56 AM, Gabe Black <[email protected]>
> wrote:
> >>
> >>  The original was implemented based on the K6 microops. It might not be
> >>> realistic any more (although I don't think single cycle division is
> >>> either?), but it wasn't entirely made up.
> >>>
> >>> Gabe
> >>>
> >>> On Sun, Apr 19, 2015 at 12:33 PM, Steve Reinhardt <[email protected]>
> >>> wrote:
> >>>
> >>>  On Sun, Apr 19, 2015 at 9:25 AM, Nilay Vaish <[email protected]>
> wrote:
> >>>>
> >>>>  On Sun, 19 Apr 2015, Steve Reinhardt wrote:
> >>>>>
> >>>>>
> >>>>>  -----------------------------------------------------------
> >>>>>> This is an automatically generated e-mail. To reply, visit:
> >>>>>> http://reviews.gem5.org/r/2743/#review6052
> >>>>>> -----------------------------------------------------------
> >>>>>>
> >>>>>>
> >>>>>> I like the restructuring... I agree the micro-op loop is probably
> not
> >>>>>> realistic.  Is there a reason to code a loop in C though, as opposed
> >>>>>>
> >>>>> to
> >>>
> >>>> just using '/' and '%'?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>> The dividend is represented as rdx:rax, which means upto 128 bits of
> >>>>>
> >>>> data.
> >>>>
> >>>>> So we would not be able to carry out division by just using '/' and
> '%'
> >>>>> when only using 64-bit integers.  GCC and LLVM both support 128-bit
> >>>>> integers on x86-64 platforms.  We may want to use those, but I don't
> >>>>>
> >>>> know
> >>>
> >>>> if that would cause any compatibility problems.
> >>>>>
> >>>>> --
> >>>>> Nilay
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> Ah, thanks... I didn't look closely enough to see that it was a
> 128-bit
> >>>> operation.  I'd be fine with using gcc/llvm 128-bit support if others
> >>>>
> >>> are.
> >>>
> >>>> If not, there are ways to build a 128-bit operation out of the 64-bit
> >>>> operations that would still be simpler than the bitwise loop.  For
> >>>>
> >>> example,
> >>>
> >>>> I found this:
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> http://codereview.stackexchange.com/questions/67962/mostly-portable-128-by-64-bit-division
> >>>
> >>>>
> >>>> and if I read the StackExchange terms correctly, we could just use
> that
> >>>> code with an appropriate attribution and a link in a comment back to
> the
> >>>> question (look under Subscriber Content):
> >>>> http://stackexchange.com/legal/terms-of-service
> >>>>
> >>>> Steve
> >>>> _______________________________________________
> >>>> gem5-dev mailing list
> >>>> [email protected]
> >>>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>
> >>>>  _______________________________________________
> >>> gem5-dev mailing list
> >>> [email protected]
> >>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>>
> >>>  _______________________________________________
> >> gem5-dev mailing list
> >> [email protected]
> >> http://m5sim.org/mailman/listinfo/gem5-dev
> >>
> >>
> >>  _______________________________________________
> > gem5-dev mailing list
> > [email protected]
> > http://m5sim.org/mailman/listinfo/gem5-dev
> >
> _______________________________________________
> gem5-dev mailing list
> [email protected]
> http://m5sim.org/mailman/listinfo/gem5-dev
>
_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Reply via email to