Yes, I believe that the divide micro-ops currently uses the divide unit latency, which I think is the cause of the large discrepancy between the x86 and ARM performance.
Jason On Mon, Apr 20, 2015 at 10:16 AM Steve Reinhardt <[email protected]> wrote: > I see. The confusion all makes sense now. > > Do the x86 divide micro-ops currently use the divide unit latencies? If > not, what latencies do they use? > > My gut reaction is that we should have a "divide step" functional unit that > the x86 micro-ops should use, independent of the full divider that the > other ISAs use. That way we eliminate (or at least reduce) the confusion > but can keep the more realistic x86 implementation. It's not clear how > different that is from the status quo, though... certainly you'll still > have the confusion that changing the "divide" unit parameters won't impact > x86 performance. > > Steve > > On Mon, Apr 20, 2015 at 7:39 AM, Nilay Vaish <[email protected]> wrote: > > > Given the discussion we had so far, it seems that we should stick with > > Gabe's implementation, but for x86 we should change the integer division > > latency to a single cycle. The default latency is 20 cycles, which is > not > > right for x86. > > > > -- > > Nilay > > > > > > > > On Mon, 20 Apr 2015, Steve Reinhardt wrote: > > > > Thanks for speaking up Gabe... I agree on both counts. I should have > said > >> "probably not realistic any more". Also, a single-cycle divide is > arguably > >> at least as unrealistic in the other direction. > >> > >> Looking at table 17 in section B.6 on p. 349 of the AMD SW optimization > >> guide (http://support.amd.com/TechDocs/47414_15h_sw_opt_guide.pdf), > >> integer > >> divide latencies are data-dependent, and a 64-bit divide can take > anywhere > >> from 9 to 72 cycles. If I'm understanding Gabe's old algorithm > correctly, > >> it looks like it takes a fixed number of cycles, though assuming the > >> branch > >> overhead can be overlapped, that number is probably pretty close to the > >> upper bound of the actual value, at least for recent AMD processors. (I > >> haven't looked for equivalent official Intel docs, though if > >> https://gmplib.org/~tege/x86-timing.pdf is correct, the latency can be > up > >> to 95 cycles on Haswell.) > >> > >> Is that right, Gabe? Or is there a data dependency in that microcode > loop > >> that's not obvious? > >> > >> The most flexible thing to do from a timing perspective would be to code > >> the division in C and then program the latency separately. However, > since > >> the computation really is microcoded (see p. 248), that would not give > >> realistic results if you care about the modeling of microcode fetch etc. > >> (which would impact power models if nothing else). > >> > >> Steve > >> > >> > >> On Mon, Apr 20, 2015 at 2:56 AM, Gabe Black <[email protected]> > wrote: > >> > >> The original was implemented based on the K6 microops. It might not be > >>> realistic any more (although I don't think single cycle division is > >>> either?), but it wasn't entirely made up. > >>> > >>> Gabe > >>> > >>> On Sun, Apr 19, 2015 at 12:33 PM, Steve Reinhardt <[email protected]> > >>> wrote: > >>> > >>> On Sun, Apr 19, 2015 at 9:25 AM, Nilay Vaish <[email protected]> > wrote: > >>>> > >>>> On Sun, 19 Apr 2015, Steve Reinhardt wrote: > >>>>> > >>>>> > >>>>> ----------------------------------------------------------- > >>>>>> This is an automatically generated e-mail. To reply, visit: > >>>>>> http://reviews.gem5.org/r/2743/#review6052 > >>>>>> ----------------------------------------------------------- > >>>>>> > >>>>>> > >>>>>> I like the restructuring... I agree the micro-op loop is probably > not > >>>>>> realistic. Is there a reason to code a loop in C though, as opposed > >>>>>> > >>>>> to > >>> > >>>> just using '/' and '%'? > >>>>>> > >>>>>> > >>>>>> > >>>>> The dividend is represented as rdx:rax, which means upto 128 bits of > >>>>> > >>>> data. > >>>> > >>>>> So we would not be able to carry out division by just using '/' and > '%' > >>>>> when only using 64-bit integers. GCC and LLVM both support 128-bit > >>>>> integers on x86-64 platforms. We may want to use those, but I don't > >>>>> > >>>> know > >>> > >>>> if that would cause any compatibility problems. > >>>>> > >>>>> -- > >>>>> Nilay > >>>>> > >>>> > >>>> > >>>> > >>>> Ah, thanks... I didn't look closely enough to see that it was a > 128-bit > >>>> operation. I'd be fine with using gcc/llvm 128-bit support if others > >>>> > >>> are. > >>> > >>>> If not, there are ways to build a 128-bit operation out of the 64-bit > >>>> operations that would still be simpler than the bitwise loop. For > >>>> > >>> example, > >>> > >>>> I found this: > >>>> > >>>> > >>>> > >>>> > >>> > http://codereview.stackexchange.com/questions/67962/mostly-portable-128-by-64-bit-division > >>> > >>>> > >>>> and if I read the StackExchange terms correctly, we could just use > that > >>>> code with an appropriate attribution and a link in a comment back to > the > >>>> question (look under Subscriber Content): > >>>> http://stackexchange.com/legal/terms-of-service > >>>> > >>>> Steve > >>>> _______________________________________________ > >>>> gem5-dev mailing list > >>>> [email protected] > >>>> http://m5sim.org/mailman/listinfo/gem5-dev > >>>> > >>>> _______________________________________________ > >>> gem5-dev mailing list > >>> [email protected] > >>> http://m5sim.org/mailman/listinfo/gem5-dev > >>> > >>> _______________________________________________ > >> gem5-dev mailing list > >> [email protected] > >> http://m5sim.org/mailman/listinfo/gem5-dev > >> > >> > >> _______________________________________________ > > gem5-dev mailing list > > [email protected] > > http://m5sim.org/mailman/listinfo/gem5-dev > > > _______________________________________________ > gem5-dev mailing list > [email protected] > http://m5sim.org/mailman/listinfo/gem5-dev > _______________________________________________ gem5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/gem5-dev
