Hi Mohit, I wonder if the number of Physical register file entries is becoming a bottleneck in the configuration you are using? Normally, I would expect that 'ProdLo' and 'ProdHi' registers will be renamed to some physical register and should not cause any dependency between two independent multiply operations.
-Ayaz On Tue, Jul 20, 2021 at 5:27 PM Mohit Gambhir via gem5-users < [email protected]> wrote: > Hi all, > > > > I am running a DerivO3CPU basesd SE mode simulation with x86 ISA. The > micro benchmark that I am running contains a loop with independent multiply > instructions. An excerpt from the disassembly of the benchmark loop looks > something like this > > > > 400c07: 48 0f af d2 imul %rdx,%rdx > > 400c0b: 48 0f af db imul %rbx,%rbx > > … > > > > When I look at the O3PipeView, I see that all the independent multiply > instructions are issued sequentially, even though there are 2 multiply > functional units and each of them is pipelined > > > > [................f....dn.pi..c.r.................................................]-( > 16664000.0) 0x00400c07.0 IMUL_R_R [ 34983] > > [................f....dn.p...ic.r................................................]-( > 16664000.0) 0x00400c07.1 IMUL_R_R [ 34984] > > [................f....dn.p...ic.r................................................]-( > 16664000.0) 0x00400c07.2 IMUL_R_R [ 34985] > > [................f....dn.p...i..c.r..............................................]-( > 16664000.0) 0x00400c0b.0 IMUL_R_R [ 34986] > > [................f....dn.p......ic.r.............................................]-( > 16664000.0) 0x00400c0b.1 IMUL_R_R [ 34987] > > [................f....dn.p......ic.r.............................................]-( > 16664000.0) 0x00400c0b.2 IMUL_R_R [ 34988] > > … > > > > Digging into it further I found that each of the IMUL_R_R instructions > have Implicit Register 0 and 1 (ProdHi and ProdLow) added as a source and > destination in the generated code. Following is the excerpt from > decoder-ns-cc.inc. > > > > Mul1sFlags::Mul1sFlags(…) > > { > > > > … > > …. > > setSrcRegIdx(_numSrcRegs++, RegId(IntRegClass, > INTREG_FOLDED(src1, foldOBit))); > > setSrcRegIdx(_numSrcRegs++, RegId(IntRegClass, > INTREG_FOLDED(src2, foldOBit))); > > setSrcRegIdx(_numSrcRegs++, RegId(IntRegClass, > INTREG_IMPLICIT(0))); > > setDestRegIdx(_numDestRegs++, RegId(IntRegClass, > INTREG_IMPLICIT(0))); > > _numIntDestRegs++; > > setSrcRegIdx(_numSrcRegs++, RegId(IntRegClass, > INTREG_IMPLICIT(1))); > > setDestRegIdx(_numDestRegs++, RegId(IntRegClass, > INTREG_IMPLICIT(1))); > > > > … > > } > > > > This results in all the independent multiply instructions to execute > sequentially and multiply throughput is 1/3. > > If we have multiple functional units, then should these implicit registers > (ProdHi and ProdLo) be replicated for each of them, and if so, why add them > as source and destination at all? > > Any clarifications or workaround for this? > > > > Thanks, > > Mohit > > > _______________________________________________ > gem5-users mailing list -- [email protected] > To unsubscribe send an email to [email protected] > %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s
_______________________________________________ gem5-users mailing list -- [email protected] To unsubscribe send an email to [email protected] %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s
