Re: [gem5-dev] Review Request 2743: x86: reimplement integer division

Jason Power Mon, 20 Apr 2015 07:37:41 -0700

Hi all,

I think that Nilay posted this patch after I found some strange performance
numbers when looking at divide. I wrote a homework for one of our graduate
architecture courses where the idea was to look at the effect of pipelining
on performance of an out-of-order processor. As an example of a
high-latency instruction, I chose divide (even if it was totally
unrealistic to pipeline a divide unit). I had the students modify the
"opLat" and the "issueLat" of the division unit. Unfortunately, modifying
the "issueLat" of the division unit did not change the performance at all.


Below is some data that I got running a *very* simple loop with a divide
that the O3 CPU should be able to fully unroll. All results are relative to
x86 not pipelined. The homework can be found here:
http://pages.cs.wisc.edu/~david/courses/cs752/Spring2015/wiki/index.php?n=Main.Homework4

                              opLat | issueLat | x86 perf | ARM perf
(relative to x86)
Config 1 (not pipelined)  :    10   |    10    |   1.0x   | 8.0x
Config 2 (fully pipelined):    10   |    1     |   1.0x   | 9.6x (1.2x over
ARM)

IMO there are two problems here.
First, when I set the issueLat and opLat, because of the microcode
implementation these parameters do not behave the way I expect.
Second, because of differences in the ISA implementation of ARM and x86,
gem5 (incorrectly) shows that just by changing the ISA there are huge
performance differences.

I'm not sure what the "right" way to model divide is. As a user, what I
really would like to see is either the "divide unit" instead be called the
"divide step unit" (or something more descriptive), or for the parameters
of the divide unit to actually reflect the performance of the unit. Nilay's
patch does the latter as far as I can tell.

Additionally, I think that the ARM and the x86 implementations of an
instruction should be somewhat comparable. At least, there shouldn't be a
8x performance difference for an instruction.

Overall, this seems like more of a configuration problem than anything
else. We need to decide if we want to model the micro-ops or the macro-op,
and whichever we choose, we should be sure that it is reflected clearly in
the parameters and in the documentation.

Thanks,
Jason


On Mon, Apr 20, 2015 at 9:08 AM Steve Reinhardt <[email protected]> wrote:

> Thanks for speaking up Gabe... I agree on both counts. I should have said
> "probably not realistic any more". Also, a single-cycle divide is arguably
> at least as unrealistic in the other direction.
>
> Looking at table 17 in section B.6 on p. 349 of the AMD SW optimization
> guide (http://support.amd.com/TechDocs/47414_15h_sw_opt_guide.pdf),
> integer
> divide latencies are data-dependent, and a 64-bit divide can take anywhere
> from 9 to 72 cycles.  If I'm understanding Gabe's old algorithm correctly,
> it looks like it takes a fixed number of cycles, though assuming the branch
> overhead can be overlapped, that number is probably pretty close to the
> upper bound of the actual value, at least for recent AMD processors.  (I
> haven't looked for equivalent official Intel docs, though if
> https://gmplib.org/~tege/x86-timing.pdf is correct, the latency can be up
> to 95 cycles on Haswell.)
>
> Is that right, Gabe?  Or is there a data dependency in that microcode loop
> that's not obvious?
>
> The most flexible thing to do from a timing perspective would be to code
> the division in C and then program the latency separately. However, since
> the computation really is microcoded (see p. 248), that would not give
> realistic results if you care about the modeling of microcode fetch etc.
> (which would impact power models if nothing else).
>
> Steve
>
>
> On Mon, Apr 20, 2015 at 2:56 AM, Gabe Black <[email protected]> wrote:
>
> > The original was implemented based on the K6 microops. It might not be
> > realistic any more (although I don't think single cycle division is
> > either?), but it wasn't entirely made up.
> >
> > Gabe
> >
> > On Sun, Apr 19, 2015 at 12:33 PM, Steve Reinhardt <[email protected]>
> > wrote:
> >
> > > On Sun, Apr 19, 2015 at 9:25 AM, Nilay Vaish <[email protected]>
> wrote:
> > >
> > > > On Sun, 19 Apr 2015, Steve Reinhardt wrote:
> > > >
> > > >
> > > >> -----------------------------------------------------------
> > > >> This is an automatically generated e-mail. To reply, visit:
> > > >> http://reviews.gem5.org/r/2743/#review6052
> > > >> -----------------------------------------------------------
> > > >>
> > > >>
> > > >> I like the restructuring... I agree the micro-op loop is probably
> not
> > > >> realistic.  Is there a reason to code a loop in C though, as opposed
> > to
> > > >> just using '/' and '%'?
> > > >>
> > > >>
> > > >
> > > > The dividend is represented as rdx:rax, which means upto 128 bits of
> > > data.
> > > > So we would not be able to carry out division by just using '/' and
> '%'
> > > > when only using 64-bit integers.  GCC and LLVM both support 128-bit
> > > > integers on x86-64 platforms.  We may want to use those, but I don't
> > know
> > > > if that would cause any compatibility problems.
> > > >
> > > > --
> > > > Nilay
> > >
> > >
> > >
> > > Ah, thanks... I didn't look closely enough to see that it was a 128-bit
> > > operation.  I'd be fine with using gcc/llvm 128-bit support if others
> > are.
> > > If not, there are ways to build a 128-bit operation out of the 64-bit
> > > operations that would still be simpler than the bitwise loop.  For
> > example,
> > > I found this:
> > >
> > >
> > >
> >
> http://codereview.stackexchange.com/questions/67962/mostly-portable-128-by-64-bit-division
> > >
> > > and if I read the StackExchange terms correctly, we could just use that
> > > code with an appropriate attribution and a link in a comment back to
> the
> > > question (look under Subscriber Content):
> > > http://stackexchange.com/legal/terms-of-service
> > >
> > > Steve
> > > _______________________________________________
> > > gem5-dev mailing list
> > > [email protected]
> > > http://m5sim.org/mailman/listinfo/gem5-dev
> > >
> > _______________________________________________
> > gem5-dev mailing list
> > [email protected]
> > http://m5sim.org/mailman/listinfo/gem5-dev
> >
> _______________________________________________
> gem5-dev mailing list
> [email protected]
> http://m5sim.org/mailman/listinfo/gem5-dev
>
_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Re: [gem5-dev] Review Request 2743: x86: reimplement integer division

Reply via email to