On Mon, Feb 27, 2023 at 04:03:56PM -0600, Pat Haugen wrote:
> On 2/27/23 2:53 PM, Segher Boessenkool wrote:
> >"Slightly".  It takes 12 cycles for the two in parallel (64-bit, p9),
> >but 17 cycles for the "cheaper" sequence (divd+mulld+subf, 12+5+2).  It
> >is all worse if the units are busy of course, or if there are other
> >problems.
> >
> >>but if you throw in another
> >>independent div or mod in the insn stream then doing the peephole should
> >>be a clear win since that 3rd insn can execute in parallel with the
> >>initial divide as opposed to waiting for the one of the first div/mod to
> >>clear the exclusive stage of the pipe.
> >
> >That is the SMT4 case, the one we do not optimise for.  SMT2 and ST can
> >do four in parallel.  This means you can start a div or mod every 2nd
> >cycle on average, so it is very unlikely you will ever be limited by
> >this on real code.
> 
> Power9/Power10 only have 2 fixed-point divide units, and are able to 
> issue 2 divides every 9/11 cycles (they aren't fully pipelined), with 
> latencies of 12-24/12-25. Not saying that changes the "best case" 
> scenario, just pointing out a lot of variables in play.

The p9 UM says in no uncertain terms there are four integer dividers
(four fixed-point execution pipelines, all four capable of divides).
Is that wrong then?

Let's do actual tests on actual hardware :-)


Segher

Reply via email to