On Mon, Feb 27, 2023 at 04:03:56PM -0600, Pat Haugen wrote: > On 2/27/23 2:53 PM, Segher Boessenkool wrote: > >"Slightly". It takes 12 cycles for the two in parallel (64-bit, p9), > >but 17 cycles for the "cheaper" sequence (divd+mulld+subf, 12+5+2). It > >is all worse if the units are busy of course, or if there are other > >problems. > > > >>but if you throw in another > >>independent div or mod in the insn stream then doing the peephole should > >>be a clear win since that 3rd insn can execute in parallel with the > >>initial divide as opposed to waiting for the one of the first div/mod to > >>clear the exclusive stage of the pipe. > > > >That is the SMT4 case, the one we do not optimise for. SMT2 and ST can > >do four in parallel. This means you can start a div or mod every 2nd > >cycle on average, so it is very unlikely you will ever be limited by > >this on real code. > > Power9/Power10 only have 2 fixed-point divide units, and are able to > issue 2 divides every 9/11 cycles (they aren't fully pipelined), with > latencies of 12-24/12-25. Not saying that changes the "best case" > scenario, just pointing out a lot of variables in play.
The p9 UM says in no uncertain terms there are four integer dividers (four fixed-point execution pipelines, all four capable of divides). Is that wrong then? Let's do actual tests on actual hardware :-) Segher