Thanks for speaking up Gabe... I agree on both counts. I should have said "probably not realistic any more". Also, a single-cycle divide is arguably at least as unrealistic in the other direction.
Looking at table 17 in section B.6 on p. 349 of the AMD SW optimization guide (http://support.amd.com/TechDocs/47414_15h_sw_opt_guide.pdf), integer divide latencies are data-dependent, and a 64-bit divide can take anywhere from 9 to 72 cycles. If I'm understanding Gabe's old algorithm correctly, it looks like it takes a fixed number of cycles, though assuming the branch overhead can be overlapped, that number is probably pretty close to the upper bound of the actual value, at least for recent AMD processors. (I haven't looked for equivalent official Intel docs, though if https://gmplib.org/~tege/x86-timing.pdf is correct, the latency can be up to 95 cycles on Haswell.) Is that right, Gabe? Or is there a data dependency in that microcode loop that's not obvious? The most flexible thing to do from a timing perspective would be to code the division in C and then program the latency separately. However, since the computation really is microcoded (see p. 248), that would not give realistic results if you care about the modeling of microcode fetch etc. (which would impact power models if nothing else). Steve On Mon, Apr 20, 2015 at 2:56 AM, Gabe Black <[email protected]> wrote: > The original was implemented based on the K6 microops. It might not be > realistic any more (although I don't think single cycle division is > either?), but it wasn't entirely made up. > > Gabe > > On Sun, Apr 19, 2015 at 12:33 PM, Steve Reinhardt <[email protected]> > wrote: > > > On Sun, Apr 19, 2015 at 9:25 AM, Nilay Vaish <[email protected]> wrote: > > > > > On Sun, 19 Apr 2015, Steve Reinhardt wrote: > > > > > > > > >> ----------------------------------------------------------- > > >> This is an automatically generated e-mail. To reply, visit: > > >> http://reviews.gem5.org/r/2743/#review6052 > > >> ----------------------------------------------------------- > > >> > > >> > > >> I like the restructuring... I agree the micro-op loop is probably not > > >> realistic. Is there a reason to code a loop in C though, as opposed > to > > >> just using '/' and '%'? > > >> > > >> > > > > > > The dividend is represented as rdx:rax, which means upto 128 bits of > > data. > > > So we would not be able to carry out division by just using '/' and '%' > > > when only using 64-bit integers. GCC and LLVM both support 128-bit > > > integers on x86-64 platforms. We may want to use those, but I don't > know > > > if that would cause any compatibility problems. > > > > > > -- > > > Nilay > > > > > > > > Ah, thanks... I didn't look closely enough to see that it was a 128-bit > > operation. I'd be fine with using gcc/llvm 128-bit support if others > are. > > If not, there are ways to build a 128-bit operation out of the 64-bit > > operations that would still be simpler than the bitwise loop. For > example, > > I found this: > > > > > > > http://codereview.stackexchange.com/questions/67962/mostly-portable-128-by-64-bit-division > > > > and if I read the StackExchange terms correctly, we could just use that > > code with an appropriate attribution and a link in a comment back to the > > question (look under Subscriber Content): > > http://stackexchange.com/legal/terms-of-service > > > > Steve > > _______________________________________________ > > gem5-dev mailing list > > [email protected] > > http://m5sim.org/mailman/listinfo/gem5-dev > > > _______________________________________________ > gem5-dev mailing list > [email protected] > http://m5sim.org/mailman/listinfo/gem5-dev > _______________________________________________ gem5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/gem5-dev
