[gem5-dev] Change in gem5/gem5[develop]: cpu-o3: Support multiple FU Pools and partial bypassing in O3CPU

David Zhao Akeley (Gerrit) Mon, 13 Apr 2020 02:03:45 -0700

David Zhao Akeley has uploaded this change for review. (https://gem5-review.googlesource.com/c/public/gem5/+/27767 )

Change subject: cpu-o3: Support multiple FU Pools and partial bypassing inO3CPU

......................................................................

cpu-o3: Support multiple FU Pools and partial bypassing in O3CPU

The main user-visible change is the partial replacement of the fuPool
parameter with the fuPools vector parameter. The old fuPool parameter
is still supported as a fallback.

The main internal change is the addition of a new Impl template
parameter, FUPoolsStrategy. This class is tasked with assigning
instruction executions to FU pools, modeling the bypass network of
the CPU, and maintaining information in each dyn inst (through the
FuPoolsStrategy::InstructionRecord struct) needed to accurately
simulate bypasing.

The default implementation, implemented in fu_pools_strategy.hh,
simulates bypassing only within FU pools (with a 1 cycle delay before
results produced in one pool are visible in others), and employs a
"greedy" strategy for assigning instructions to pools for
execution. Alternate strategies are also implemented, including
CompleteBypassFUPoolsStrategy, which most closely matches the
bypassing model (and simulation cost) before this change.

Relatively minor changes are also made to the BaseO3CPU, IEW, and
InstructionQueue classes: this is mainly to handle the new fuPools
parameter, and propogate the needed information (including the cycle
count) to the FUPoolsStrategy.

Finally, there are new statistics added: insts_with_bypassing,
insts_without_bypassing, capability_bypass_fails,
confluence_bypass_fails, and congestion_bypass_fails, and two new
debug flags, IQFU and SIMDFU. These are intended to help users
understand the effects of the bypass network simulated.

Change-Id: Ibf39378c5af5ac4352a5d8ba1087417e2279234f
---
M src/cpu/o3/FUPool.py
M src/cpu/o3/O3CPU.py
M src/cpu/o3/SConscript
M src/cpu/o3/cpu.cc
M src/cpu/o3/cpu.hh
M src/cpu/o3/cpu_policy.hh
M src/cpu/o3/dyn_inst.hh
M src/cpu/o3/fu_pool.cc
M src/cpu/o3/fu_pool.hh
A src/cpu/o3/fu_pools_strategy.hh
M src/cpu/o3/iew.hh
M src/cpu/o3/iew_impl.hh
M src/cpu/o3/impl.hh
M src/cpu/o3/inst_queue.hh
M src/cpu/o3/inst_queue_impl.hh
15 files changed, 892 insertions(+), 51 deletions(-)



diff --git a/src/cpu/o3/FUPool.py b/src/cpu/o3/FUPool.py
index 58f33fc..74ee438 100644
--- a/src/cpu/o3/FUPool.py
+++ b/src/cpu/o3/FUPool.py
@@ -35,6 +35,8 @@
 # THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+# Authors: Kevin Lim, David Zhao Akeley

 from m5.SimObject import SimObject
 from m5.params import *
@@ -49,3 +51,29 @@
 class DefaultFUPool(FUPool):
     FUList = [ IntALU(), IntMultDiv(), FP_ALU(), FP_MultDiv(), ReadPort(),
                SIMD_Unit(), PredALU(), WritePort(), RdWrPort(), IprPort() ]
+
+# Make a subclass of FUPool that contains the given number of each
+# type of functional unit. For example,
+#
+# make_fu_pool_class(int_alu=4, int_mult_div=1)()
+#
+# Creates an FU Pool type consisting of 4 int ALUs and 1 int mult/div
+# unit, and then instantiates one instance of said FU Pool.
+def make_fu_pool_class(int_alu=0, int_mult_div=0, fp_alu=0,
+                       fp_mult_div=0, read_port=0, simd_unit=0,
+                       pred_alu=0, write_port=0, rdwr_port=0, ipr_port=0):
+    unit_types = [ IntALU, IntMultDiv, FP_ALU, FP_MultDiv, ReadPort,
+                   SIMD_Unit, PredALU, WritePort, RdWrPort, IprPort ]
+    args = (int_alu, int_mult_div, fp_alu, fp_mult_div, read_port,
+            simd_unit, pred_alu, write_port, rdwr_port, ipr_port)
+    outer_fu_list = []
+    for i, count in enumerate(args):
+        if count <= 0: continue
+        unit = unit_types[i]()
+        unit.count = count
+        outer_fu_list.append(unit)
+
+    class GeneratedFUPool(FUPool):
+        FUList = outer_fu_list
+
+    return GeneratedFUPool
diff --git a/src/cpu/o3/O3CPU.py b/src/cpu/o3/O3CPU.py
index 51d9121..11267f1 100644
--- a/src/cpu/o3/O3CPU.py
+++ b/src/cpu/o3/O3CPU.py
@@ -111,8 +111,10 @@
     dispatchWidth = Param.Unsigned(8, "Dispatch width")
     issueWidth = Param.Unsigned(8, "Issue width")
     wbWidth = Param.Unsigned(8, "Writeback width")
-    fuPool = Param.FUPool(DefaultFUPool(), "Functional Unit pool")
-
+    fuPool = Param.FUPool(DefaultFUPool(), "Functional Unit pool. "
+                        "(Use fuPools to provide multiple FU Pools).")
+    fuPools = VectorParam.FUPool([], "List of functional unit pools. "
+                                     "Overrides fuPool if non-empty.")
     iewToCommitDelay = Param.Cycles(1, "Issue/Execute/Writeback to commit "
                "delay")
     renameToROBDelay = Param.Cycles(1, "Rename to reorder buffer delay")
diff --git a/src/cpu/o3/SConscript b/src/cpu/o3/SConscript
index 3966f97..8e08270 100755
--- a/src/cpu/o3/SConscript
+++ b/src/cpu/o3/SConscript
@@ -60,6 +60,8 @@
     DebugFlag('CommitRate')
     DebugFlag('IEW')
     DebugFlag('IQ')
+    DebugFlag('IQFU')
+    DebugFlag('SIMDFU')
     DebugFlag('LSQ')
     DebugFlag('LSQUnit')
     DebugFlag('MemDepUnit')
diff --git a/src/cpu/o3/cpu.cc b/src/cpu/o3/cpu.cc
index 5f0a98b..4d8930b 100644
--- a/src/cpu/o3/cpu.cc
+++ b/src/cpu/o3/cpu.cc
@@ -80,6 +80,20 @@
     BaseCPU::regStats();
 }

+// Backwards-compatibility hack: this CPU supports multiple fu pools
+// via the `fuPools` parameter, but still supports the older single
+// `fuPool` parameter as a fallback if the `fuPools` parameter is
+// in its default empty state.
+namespace {
+std::vector<FUPool*> getFuPoolsVector(DerivO3CPUParams *params)
+{
+    std::vector<FUPool*> pools = params->fuPools;
+    if (pools.size() == 0) pools.push_back(params->fuPool);
+    for (FUPool* pool : pools) assert(pool != nullptr);
+    return pools;
+}
+}
+
 template <class Impl>
 FullO3CPU<Impl>::FullO3CPU(DerivO3CPUParams *params)
     : BaseO3CPU(params),
@@ -93,6 +107,7 @@
       instcount(0),
 #endif
       removeInstsThisCycle(false),
+      fuPools(getFuPoolsVector(params)),
       fetch(this, params),
       decode(this, params),
       rename(this, params),
@@ -522,6 +537,7 @@
     assert(!switchedOut());
     assert(drainState() != DrainState::Drained);

+    cycleCounter += Cycles(1);
     ++numCycles;
     updateCycleCounters(BaseCPU::CPU_STATE_ON);

diff --git a/src/cpu/o3/cpu.hh b/src/cpu/o3/cpu.hh
index c3d911b..2549785 100644
--- a/src/cpu/o3/cpu.hh
+++ b/src/cpu/o3/cpu.hh
@@ -127,8 +127,13 @@
     /** Overall CPU status. */
     Status _status;

-  private:
+    /** Increases by one for each simulated cycle. Used for internal
+     * purposes; may not be quite the same as numCycles due to sleep,
+     * etc.
+     */
+    Cycles cycleCounter = Cycles(0);

+  private:
     /** The tick event used for scheduling CPU ticks. */
     EventFunctionWrapper tickEvent;

@@ -557,6 +562,11 @@
      */
     bool removeInstsThisCycle;

+    /** The function unit pool(s) available to the CPU. Must be
+     *  declared before IEW stage.
+     */
+    const std::vector<FUPool*> fuPools;
+
   protected:
     /** The fetch stage. */
     typename CPUPolicy::Fetch fetch;
diff --git a/src/cpu/o3/cpu_policy.hh b/src/cpu/o3/cpu_policy.hh
index 82dcd09..c9c3a10 100644
--- a/src/cpu/o3/cpu_policy.hh
+++ b/src/cpu/o3/cpu_policy.hh
@@ -35,6 +35,7 @@
 #include "cpu/o3/decode.hh"
 #include "cpu/o3/fetch.hh"
 #include "cpu/o3/free_list.hh"
+#include "cpu/o3/fu_pools_strategy.hh"
 #include "cpu/o3/iew.hh"
 #include "cpu/o3/inst_queue.hh"
 #include "cpu/o3/lsq.hh"
diff --git a/src/cpu/o3/dyn_inst.hh b/src/cpu/o3/dyn_inst.hh
index c326058..0c424e7 100644
--- a/src/cpu/o3/dyn_inst.hh
+++ b/src/cpu/o3/dyn_inst.hh
@@ -46,10 +46,11 @@

 #include "arch/isa_traits.hh"
 #include "config/the_isa.hh"
-#include "cpu/o3/cpu.hh"
-#include "cpu/o3/isa_specific.hh"
 #include "cpu/base_dyn_inst.hh"
 #include "cpu/inst_seq.hh"
+#include "cpu/o3/cpu.hh"
+#include "cpu/o3/fu_pools_strategy.hh"
+#include "cpu/o3/isa_specific.hh"
 #include "cpu/reg_class.hh"

 class Packet;
@@ -63,6 +64,14 @@

     /** Binary machine instruction type. */
     typedef TheISA::MachInst MachInst;
+
+    /** Strategy used for dispatching instructions to FU pools */
+    typedef typename Impl::FUPoolsStrategy FUPoolsStrategy;
+
+    /** Additional data that the FU pools strategy needs to record in
+     * every DynInst. */
+    typedef typename FUPoolsStrategy::InstructionRecord FUPoolsRecord;
+
     /** Register types. */
     using VecRegContainer = TheISA::VecRegContainer;
     using VecElem = TheISA::VecElem;
@@ -117,8 +126,63 @@
     /** Number of destination misc. registers. */
     uint8_t _numDestMiscRegs;

+    /** Number of the FU Pool that this instruction was executed
+     * on. Used to see if within-FU-pool bypassing is possible for
+     * dependent instructions. -1 when not yet set; < -1 when there is
+     * (for some reason) no FU generating this instruction's
+     * result. */
+    int fuPoolIndex = -1;

   public:
+    FUPoolsRecord fuPoolsRecord;
+
+    /** Set the Issued flag for this instruction, and record the FU
+     * Pool that this instruction is executed on. This pool should not
+     * later be changed.
+     */
+    void
+    setIssuedFuPool(int _fuPoolIndex)
+    {
+        // Can only be set once (excepting -2, for squashed insts hack).
+        assert(!this->isIssued() || _fuPoolIndex < -1);
+        assert(_fuPoolIndex != -1);
+        this->setIssued();
+        fuPoolIndex = _fuPoolIndex;
+    }
+
+    /** True iff the instruction has been scheduled for execution, but
+     * for some reason is not generated by any FU Pool. */
+    bool
+    fuPoolNotUsed() const
+    {
+        assert((fuPoolIndex == -1) ^ this->isIssued());
+        return fuPoolIndex < -1;
+    }
+
+    /** Return the index of the FU Pool used to execute this
+     * instruction.  Can only be called for instructions already
+     * issued and assigned to a real FU Pool (!fuPoolNotUsed()).
+     */
+    int
+    getExecuteFuPoolIndex() const
+    {
+        assert(this->isIssued());
+        assert(fuPoolIndex >= 0); // debugger hack
+        return fuPoolIndex;
+    }
+
+    /** Records that a new input is ready, and is produced by the
+     * specified FU Pool. The current cycle number must also be
+     * provided. The FUPoolStrategy makes use of this information.
+     */
+    void
+    markSrcRegReadyFuPool(int fuPoolIndex, Cycles currentCycle)
+    {
+        assert(fuPoolIndex >= 0);
+        this->markSrcRegReady();
+        fuPoolsRecord.resultOnFuPool(fuPoolIndex, currentCycle);
+    }
+
 #if TRACING_ON
     /** Tick records used for the pipeline activity viewer. */
     Tick fetchTick;      // instruction fetch is completed.
@@ -427,4 +491,3 @@
 };

 #endif // __CPU_O3_ALPHA_DYN_INST_HH__
-
diff --git a/src/cpu/o3/fu_pool.cc b/src/cpu/o3/fu_pool.cc
index 5a26d80..43e3aac 100644
--- a/src/cpu/o3/fu_pool.cc
+++ b/src/cpu/o3/fu_pool.cc
@@ -158,7 +158,7 @@
 FUPool::getUnit(OpClass capability)
 {
     //  If this pool doesn't have the specified capability,
-    //  return this information to the caller
+    //  return this information to the caller.
     if (!capabilityList[capability])
         return -2;

@@ -182,6 +182,29 @@
     return fu_idx;
 }

+int
+FUPool::getFreeUnitCount(OpClass capability)
+{
+    //  If this pool doesn't have the specified capability,
+    //  return this information to the caller.
+    if (!capabilityList[capability])
+        return -2;
+
+    int count = 0;
+    int fu_idx = fuPerCapList[capability].getFU();
+    int start_idx = fu_idx;
+

+ // Iterate through the circular queue if needed, stopping if we'vereached

+    // the first element again.
+    do {
+        count += unitBusy[fu_idx] ? 0 : 1;
+        fu_idx = fuPerCapList[capability].getFU();
+    } while (fu_idx != start_idx);
+
+    assert(fu_idx < numFU);
+    return count;
+}
+
 void
 FUPool::freeUnitNextCycle(int fu_idx)
 {
@@ -205,6 +228,16 @@
 void
 FUPool::dump()
 {
+    auto dumpUnit = [] (const FuncUnit* unit_ptr)
+    {
+        FuncUnit& unit = *const_cast<FuncUnit*>(unit_ptr);
+        cout << unit.name;
+        for (int c = 0; c < Num_OpClasses; ++c) {
+            if (unit.provides(OpClass(c))) {
+                cout << " " << Enums::OpClassStrings[c];
+            }
+        }
+    };
     cout << "Function Unit Pool (" << name() << ")\n";
     cout << "======================================\n";
     cout << "Free List:\n";
@@ -215,9 +248,7 @@
         }

         cout << "  [" << i << "] : ";
-
-        cout << funcUnits[i]->name << " ";
-
+        dumpUnit(funcUnits[i]);
         cout << "\n";
     }

@@ -229,9 +260,7 @@
         }

         cout << "  [" << i << "] : ";
-
-        cout << funcUnits[i]->name << " ";
-
+        dumpUnit(funcUnits[i]);
         cout << "\n";
     }
 }
diff --git a/src/cpu/o3/fu_pool.hh b/src/cpu/o3/fu_pool.hh
index 81c4a6f..31c1ab4 100644
--- a/src/cpu/o3/fu_pool.hh
+++ b/src/cpu/o3/fu_pool.hh
@@ -146,6 +146,15 @@
      */
     int getUnit(OpClass capability);

+    /**
+     * Get the number of FUs currently available for executing
+     * instructions of the given OpClass. Does NOT mark any unit as
+     * busy. Returns 0 if there is no free FU, but still returns
+     * NoCapableFU, like getUnit, if this FU Pool lacks the needed
+     * capability.
+     */
+    int getFreeUnitCount(OpClass capability);
+
     /** Frees a FU at the end of this cycle. */
     void freeUnitNextCycle(int fu_idx);

diff --git a/src/cpu/o3/fu_pools_strategy.hhb/src/cpu/o3/fu_pools_strategy.hh

new file mode 100644
index 0000000..19f2e26
--- /dev/null
+++ b/src/cpu/o3/fu_pools_strategy.hh
@@ -0,0 +1,543 @@
+/*
+ * Copyright (c) 2020 David Zhao Akeley
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met: redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer;
+ * redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution;
+ * neither the name of the copyright holders nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * Authors: David Zhao Akeley
+ */
+
+#ifndef __CPU_O3_FU_POOLS_STRATEGY_HH__
+#define __CPU_O3_FU_POOLS_STRATEGY_HH__
+
+#include "base/random.hh"
+#include "base/types.hh"
+#include "cpu/o3/fu_pool.hh"
+#include "debug/IQ.hh"
+#include "debug/IQFU.hh"
+#include "debug/SIMDFU.hh"
+#include "enums/OpClass.hh"
+#include "params/DerivO3CPU.hh"
+
+/** Status code used by the FU Pool selection function, to
+ * indicate the reason for not scheduling an instruction on an FU
+ * Pool where bypassing the needed operands is possible.
+ */
+enum class BypassStatus
+{
+    // Successfully scheduled inst on an FU Pool with bypassed values.
+    Bypassed = 0,
+    // Could not bypass because the FU Pool producing the needed
+    // operand contains the needed FU type to execute this
+    // instruction, but all of those FUs are busy.
+    Congestion = 1,
+    // Could not bypass because the FU Pool producing the needed
+    // operand does not have an FU capable of executing this
+    // instruction.
+    Capability = 2,
+    // Could not bypass because multiple different FU Pools
+    // produced the needed operands for this instruction.
+    Confluence = 3,
+    // No bypassing was needed -- all operands are old enough.
+    NotNeeded = 4
+};
+
+/** Return type of FU pool selection function.  */
+struct ReserveResult
+{
+    // Reason for failure to bypass, if any.
+    BypassStatus bypassStatus;
+    // Index of the FU within the FU Pool that's chosen to execute
+    // this instruction. NoFreeFU if we failed to choose an FU and
+    // should try again later; NoCapableFU if we will execute the
+    // instruction without an FU.
+    int fuIdx;
+    // Index/number of the FU Pool that will be chosen to execute
+    // this instruction. Meaningless if fuIdx < 0.
+    int poolIdx;
+    // Latency of this instruction's execution on the chosen FU.
+    Cycles opLatency;
+};
+
+inline bool isSimdOpClass(OpClass op_class)
+{
+    return (Enums::SimdAdd <= op_class)
+        && (op_class <= Enums::SimdPredAlu);
+}
+
+/** Functional units are split into different pools. The FU Pools
+ *  Strategy determines
+ *
+ *  1. How instructions requiring functional units to execute are
+ *  assigned to different pools.
+ *
+ *  2. Whether and how values are bypassed between functional units.
+ *
+ *  3. What additional data is stored in each dynamic instruction in
+ *     order to correctly implement the bypassing model.
+ *
+ *  This base strategy models multiple functional unit pools, with
+ *  complete bypassing between FUs within a single pool, and no
+ *  bypassing between different pools. If a value is ready on a
+ *  certain FU pool on clock cycle C, then, for that clock cycle,
+ *  ready instructions dependent on that value may only begin
+ *  execution on said FU pool. This restriction is lifted on cycle C +
+ *  1, at which point the produced value is considered visible to all
+ *  FUs. This matches physical implementations that spend one cycle
+ *  writing a result back to a renamed physical register file visibled
+ *  to all functional units.
+ *
+ *  The InstructionRecord type, and reserveFU and
+ *  InstructionRecord::resultOnFuPool functions should be replaced to
+ *  customize the FUPoolStrategy. Note that this base strategy cannot
+ *  be used on its own since reserveFU is not implemented.
+ */
+template <class Impl>
+class BaseFUPoolsStrategy
+{
+    typedef typename Impl::O3CPU O3CPU;
+    typedef typename Impl::DynInstPtr DynInstPtr;
+
+  protected:
+    /** Functional unit pools available to choose from. */
+    std::vector<FUPool*> fuPools;
+
+    /** Back pointer to CPU. */
+    O3CPU *cpu;
+
+  public:
+    BaseFUPoolsStrategy(O3CPU *cpu_ptr, DerivO3CPUParams *)
+    {
+        cpu = cpu_ptr;
+        fuPools = cpu_ptr->fuPools;
+        assert(fuPools.size() != 0);
+    }
+
+    /** Information that is kept in each dynamic instruction record,
+     * which is used to keep track of which FU pools produce the
+     * instruction's input values, and whether the instruction can be
+     * executed with bypassing.
+     */
+    struct InstructionRecord
+    {
+        /** The cpu cycle number of the last call to resultOnFuPool */
+        Cycles latestSrcRegCycle = Cycles(0);
+
+        /** The FU pool index of the last call to resultOnFuPool,
+         * except if there were multiple such calls within one CPU
+         * cycle, and the FU pools aren't the same, this value is
+         * -1. (This indicates that bypassing is not possible this
+         * cycle as no one FU Pool has all the data needed for the
+         * inst to execute.)
+         */
+        int srcRegFuPoolIndex = -1;
+
+        /** Called whenever an FU pool produces a result needed by
+         * this instruction. Requires the index of the FU Pool (in the
+         * fuPools list) and the clock cycle that the result was
+         * generated on.
+         */
+        void resultOnFuPool(int fu_pool_idx, Cycles current_cycle)
+        {
+            assert(fu_pool_idx >= 0);
+            // New cycle? Older registers are now visible on all FUs, so
+            // just record the FU producing the new result for this cycle.
+            if (latestSrcRegCycle != current_cycle) {
+                latestSrcRegCycle = current_cycle;
+                srcRegFuPoolIndex = fu_pool_idx;
+            }
+            // Otherwise, if multiple results on different FU Pools are
+            // available this cycle, record via srcRegFuPoolIndex that
+            // bypassing is not possible this cycle.
+            else if (srcRegFuPoolIndex != fu_pool_idx) {
+                srcRegFuPoolIndex = -1;
+            }
+        }
+
+        /** Given that the instruction is ready to issue, return
+         * whether bypassing is needed to execute the instruction this
+         * cycle (due to src reg results being too new). If so, write
+         * through (*outFuPoolIndex) the number of the FU Pool that
+         * this instruction may execute on (or -1, if no FU Pool can
+         * execute the instruction).  If not, the instruction's needed
+         * inputs are available at all FU Pools.
+        */
+        bool

+ getBypassFuPoolIndex(int* out_fu_pool_idx, Cycles current_cycle)const

+        {
+            if (current_cycle != latestSrcRegCycle) return false;
+            *out_fu_pool_idx = srcRegFuPoolIndex;
+            return true;
+        }
+    };
+
+  protected:
+    /** Given that this instruction has the given OpClass and is ready
+     * to execute, select a suitable FU from an FU pool and mark it as
+     * busy (unless no FU is needed).
+     *
+     * The given SelectFU lambda selects an available functional unit
+     * for instruction execution (when we have a choice, i.e. when
+     * bypassing is not needed).
+     */
+    template <class SelectFU>
+    ReserveResult
+    baseReserveFU(OpClass op_class, DynInstPtr issuing_inst,
+                  SelectFU&& select_fu)
+    {
+        ReserveResult result;
+        result.poolIdx = -2;
+        result.fuIdx = FUPool::NoCapableFU;
+        result.opLatency = Cycles(1);
+
+        bool debug_print = DTRACE(IQFU) ||
+            (DTRACE(SIMDFU) && isSimdOpClass(op_class));
+
+        // Change this if it turns out bypassing would be needed to
+        // execute this instruction this cycle.
+        result.bypassStatus = BypassStatus::NotNeeded;
+

+ bool need_bypassing =issuing_inst->fuPoolsRecord.getBypassFuPoolIndex(

+            &result.poolIdx, cpu->cycleCounter);
+
+        // Did we find a FU pool able to execute this instruction?
+        bool success = false;
+        assert(fuPools.size() > 0);
+
+        // No FU needed case indicated by NoCapableFU (for some reason).
+        if (op_class == No_OpClass) {
+            // Nothing to do in this case.
+        }
+
+        // If bypassing would be needed, try to schedule this instruction
+        // on the FU pool producing the bypassed value.
+        else if (need_bypassing) {
+            // Needed bypassed values on different FU pools -- fail.
+            if (result.poolIdx < 0) {
+                result.bypassStatus = BypassStatus::Confluence;
+                result.fuIdx = FUPool::NoFreeFU;
+            }
+            // See if the FU pool producing the bypassed value is
+            // ready to execute this instruction. If not for any
+            // reason, fail, because maybe next cycle we'll find
+            // another FU pool that is capable.
+            else {
+                tryFU(&result, op_class, result.poolIdx);
+                if (result.fuIdx == FUPool::NoFreeFU) {
+                    result.bypassStatus = BypassStatus::Congestion;
+                }
+                else if (result.fuIdx == FUPool::NoCapableFU) {
+                    result.bypassStatus = BypassStatus::Capability;
+                    result.fuIdx = FUPool::NoFreeFU;
+                }
+                else {
+                    success = true;
+                    result.bypassStatus = BypassStatus::Bypassed;
+                }
+            }
+            assert(result.bypassStatus != BypassStatus::NotNeeded);
+        }
+
+        // Otherwise, bypassing is not needed, so all FU pools see all
+        // the needed operands; we are free to choose any FU pool to
+        // execute this instruction, using the template's strategy.
+        else {
+            success = select_fu(&result, op_class);
+        }
+
+        // Finally time to return the result (after doing a bunch of
+        // debug stuff).
+        if (success) {
+            assert(result.poolIdx >= 0);
+            assert(result.fuIdx >= 0);
+            assert(result.poolIdx < int(fuPools.size()));
+        }
+        else {
+            assert(
+               result.fuIdx == FUPool::NoFreeFU ||
+               result.fuIdx == FUPool::NoCapableFU);
+        }
+
+        if (debug_print) {
+            std::stringstream ss;
+            ss << "cycleCounter " << long(cpu->cycleCounter) << ":";
+            const int src_reg_count = issuing_inst->numSrcRegs();
+            for (int i = 0; i < src_reg_count; ++i) {

+ ss << " r" <<int(issuing_inst->renamedSrcRegIdx(i)->index());

+            }
+            ss << " ->";
+            const int dest_reg_count = issuing_inst->numDestRegs();
+            for (int i = 0; i < dest_reg_count; ++i) {

+ ss << " r" <<int(issuing_inst->renamedDestRegIdx(i)->index());

+            }
+            if (result.fuIdx == FUPool::NoCapableFU) {
+                ss << (op_class == No_OpClass
+                    ? "\n\t no FU needed" : "\n\t no capable FU");
+            }
+            else if (result.fuIdx == FUPool::NoFreeFU) {
+                switch (result.bypassStatus) {
+                    case BypassStatus::Congestion:
+                        ss << "\n\t congestion bypass fail on FU pool ";
+                        ss << result.poolIdx;
+                        break;
+                    case BypassStatus::Capability:
+                        ss << "\n\t capability bypass fail on FU pool ";
+                        ss << result.poolIdx;
+                    break;
+                case BypassStatus::Confluence:
+                    ss << "\n\t confluence bypass fail";
+                    break;
+                case BypassStatus::NotNeeded:
+                    ss << "\n\t no free FU";
+                    break;
+                case BypassStatus::Bypassed:
+                    assert(0);
+                default:
+                    assert(0);
+                }
+            }
+            else {
+                assert(result.fuIdx >= 0);
+                ss << "\n\t executing on FU pool ";
+                ss << result.poolIdx;
+                ss << (result.bypassStatus == BypassStatus::Bypassed
+                    ? " (bypassed)" : " (no bypassing)");
+            }
+            DPRINTF_UNCONDITIONAL(IQ, "%s.\n", ss.str().c_str());
+        }
+
+        return result;
+    }
+
+    // Helper function that returns true iff the FU Pool with the
+    // given index (pool_idx) is available for executing an
+    // instruction of the given OpClass. If so, reserve the FU and
+    // fill in the ReserveResult variables with the correct
+    // values. Otherwise, set fuIdx to NoFreeFU only if that's the
+    // reason (this has the effect of making fuIdx == NoCapableFU only
+    // when ALL FU pools are not capable of executing this op class).
+    bool tryFU(ReserveResult *result, OpClass op_class, int pool_idx)
+    {
+        FUPool* pool = fuPools[pool_idx];
+        const int idx = pool->getUnit(op_class);
+        if (idx >= 0) {
+            result->fuIdx = idx;
+            result->poolIdx = pool_idx;
+            result->opLatency = pool->getOpLatency(op_class);
+            return true;
+        } else if (idx == FUPool::NoFreeFU) {
+            result->fuIdx = idx;
+        }
+        return false;
+    };
+};
+
+
+/** The default FU pools strategy is to model a within-fu-pools bypass
+ *  network (as implemented by BaseFUPoolsStrategy), and assign an FU
+ *  to the first FU Pool available for executing the instruction (when
+ *  we have a choice, i.e., no bypassing is needed).
+ */
+template <class Impl>
+class DefaultFUPoolsStrategy : public BaseFUPoolsStrategy<Impl>
+{
+  public:
+    typedef typename Impl::O3CPU O3CPU;
+    typedef typename Impl::DynInstPtr DynInstPtr;
+
+    DefaultFUPoolsStrategy(O3CPU *cpu_ptr, DerivO3CPUParams *params)
+    : BaseFUPoolsStrategy<Impl>(cpu_ptr, params) { }
+
+    ReserveResult reserveFU(OpClass op_class, DynInstPtr issuing_inst)
+    {
+        // Greedy scheme. Try FU Pools in reverse order until we find
+        // one that can execute this instruction. (I don't remember
+        // why I did this in reverse order but I don't want to change
+        // it now and possibly mess up my previous sim results).
+        auto greedy = [&] (ReserveResult* result, OpClass op_class) -> bool
+        {
+            bool success = false;
+            const int start = int(this->fuPools.size()) - 1;
+            for (int i = start; !success && i >= 0; --i) {
+                success = this->tryFU(result, op_class, i);
+            }
+            return success;
+        };
+        return this->baseReserveFU(op_class, issuing_inst, greedy);
+    }
+};
+
+
+/** Similar to the default (greedy) strategy, except that the starting
+ *  index for the search is randomized. FU pools are queried for a
+ *  free capable FU in a circular order, starting from the random
+ *  start index.
+ */
+template <class Impl>
+class RandomFUPoolsStrategy : public BaseFUPoolsStrategy<Impl>
+{
+  public:
+    typedef typename Impl::O3CPU O3CPU;
+    typedef typename Impl::DynInstPtr DynInstPtr;
+
+    RandomFUPoolsStrategy(O3CPU *cpu_ptr, DerivO3CPUParams *params)
+    : BaseFUPoolsStrategy<Impl>(cpu_ptr, params) { }
+
+    ReserveResult reserveFU(OpClass op_class, DynInstPtr issuing_inst)
+    {
+        auto select = [&] (ReserveResult* result, OpClass op_class) -> bool
+        {
+            bool success = false;
+            auto fu_pool_count = int(this->fuPools.size());
+            int random_idx = random_mt.random<int>(0, fu_pool_count-1);
+            for (int i = random_idx; !success && i >= 0; --i) {
+                success = this->tryFU(result, op_class, i);
+            }

+ for (int i = fu_pool_count-1; !success && i > random_idx; --i){

+                success = this->tryFU(result, op_class, i);
+            }
+            return success;
+        };
+        return this->baseReserveFU(op_class, issuing_inst, select);
+    }
+};
+
+
+/** All FU pools are queried for the number of free FUs capable of
+ *  executing the instruction. The one reporting the highest number is
+ *  assigned the instruction (in case of a tie, the earliest one
+ *  queried is chosen).
+ */
+template <class Impl>
+class LoadBalanceFUPoolsStrategy : public BaseFUPoolsStrategy<Impl>
+{
+  public:
+    typedef typename Impl::O3CPU O3CPU;
+    typedef typename Impl::DynInstPtr DynInstPtr;
+
+    LoadBalanceFUPoolsStrategy(O3CPU *cpu_ptr, DerivO3CPUParams *params)
+    : BaseFUPoolsStrategy<Impl>(cpu_ptr, params) { }
+
+    ReserveResult reserveFU(OpClass op_class, DynInstPtr issuing_inst)
+    {
+        // Load balancing scheme. Select the FU pool that has the
+        // highest number of FUs ready to execute this class of
+        // instruction. (This is definitely a lazy 5 AM design; peek
+        // behind the function call boundaries and you'll see this is
+        // unreasonably expensive in simulation time).
+        auto select = [&] (ReserveResult* result, OpClass op_class) -> bool
+        {
+            bool success = false;
+            auto& pools = this->fuPools;
+            auto fu_pool_count = int(pools.size());
+            int maxFreeUnitCount = FUPool::NoCapableFU;
+            for (int i = 0; i < fu_pool_count; ++i) {
+                int free_count = pools[i]->getFreeUnitCount(op_class);
+                success |= (free_count >= 1);
+                if (free_count > maxFreeUnitCount) {
+                    maxFreeUnitCount = free_count;
+                    result->poolIdx = i;
+                }
+            }
+            if (success) {
+                FUPool* pool = pools[result->poolIdx];
+                result->fuIdx = pool->getUnit(op_class);
+                assert(result->fuIdx >= 0);
+                result->opLatency = pool->getOpLatency(op_class);
+            }
+            else {
+                result->fuIdx = maxFreeUnitCount == FUPool::NoCapableFU
+                    ? FUPool::NoCapableFU
+                    : FUPool::NoFreeFU;
+            }
+            return success;
+        };
+        return this->baseReserveFU(op_class, issuing_inst, select);
+    }
+};
+
+
+/** Strategy that simulates complete bypassing between ALL functional
+ * units. Any result is visible to any FU as soon as it is
+ * produced. This should be cheaper for simulations that don't care
+ * about more realistic bypass networks.
+ */
+template <class Impl>
+class CompleteBypassFUPoolsStrategy
+{
+    typedef typename Impl::O3CPU O3CPU;
+    typedef typename Impl::DynInstPtr DynInstPtr;
+
+  protected:
+    /** Functional unit pools available to choose from. */
+    std::vector<FUPool*> fuPools;
+
+    /** Back pointer to CPU. */
+    O3CPU *cpu;
+
+  public:
+    CompleteBypassFUPoolsStrategy(O3CPU *cpu_ptr, DerivO3CPUParams *)
+    {
+        cpu = cpu_ptr;
+        fuPools = cpu->fuPools;
+        assert(fuPools.size() != 0);
+    }
+
+    struct InstructionRecord
+    {
+        void resultOnFuPool(int, Cycles) { }
+    };
+
+    ReserveResult reserveFU(OpClass op_class, DynInstPtr issuing_inst)
+    {
+        ReserveResult result;
+        result.poolIdx = -2;
+        result.fuIdx = FUPool::NoCapableFU;
+        result.opLatency = Cycles(1);
+        result.bypassStatus = BypassStatus::NotNeeded;
+
+        bool success = false;
+        const int start = int(this->fuPools.size()) - 1;
+        for (int i = start; !success && i >= 0; --i) {
+            success = this->tryFU(&result, op_class, i);
+        }
+        return result;
+    }
+
+  protected:
+    bool tryFU(ReserveResult *result, OpClass op_class, int pool_idx)
+    {
+        FUPool* pool = fuPools[pool_idx];
+        const int idx = pool->getUnit(op_class);
+        if (idx >= 0) {
+            result->fuIdx = idx;
+            result->poolIdx = pool_idx;
+            result->opLatency = pool->getOpLatency(op_class);
+            return true;
+        } else if (idx == FUPool::NoFreeFU) {
+            result->fuIdx = idx;
+        }
+        return false;
+    };
+};
+#endif /* !__CPU_O3_FU_POOLS_STRATEGY_HH__ */
diff --git a/src/cpu/o3/iew.hh b/src/cpu/o3/iew.hh
index 7f3409e..5d886e9 100644
--- a/src/cpu/o3/iew.hh
+++ b/src/cpu/o3/iew.hh
@@ -359,8 +359,9 @@
     /** Load / store queue. */
     LSQ ldstQueue;

-    /** Pointer to the functional unit pool. */
-    FUPool *fuPool;
+    /** Function unit pools. */
+    std::vector<FUPool*> fuPools;
+
     /** Records if the LSQ needs to be updated on the next cycle, so that
      * IEW knows if there will be activity on the next cycle.
      */
diff --git a/src/cpu/o3/iew_impl.hh b/src/cpu/o3/iew_impl.hh
index 99dfd19..3e16ab7 100644
--- a/src/cpu/o3/iew_impl.hh
+++ b/src/cpu/o3/iew_impl.hh
@@ -68,7 +68,7 @@
       cpu(_cpu),
       instQueue(_cpu, this, params),
       ldstQueue(_cpu, this, params),
-      fuPool(params->fuPool),
+      fuPools(_cpu->fuPools),
       commitToIEWDelay(params->commitToIEWDelay),
       renameToIEWDelay(params->renameToIEWDelay),
       issueToExecuteDelay(params->issueToExecuteDelay),
@@ -79,6 +79,8 @@
       wbWidth(params->wbWidth),
       numThreads(params->numThreads)
 {
+    assert(fuPools.size() != 0);
+
     if (dispatchWidth > Impl::MaxWidth)
         fatal("dispatchWidth (%d) is larger than compiled limit (%d),\n"
              "\tincrease MaxWidth in src/cpu/o3/impl.hh\n",
@@ -407,11 +409,13 @@
         drained = drained && dispatchStatus[tid] == Running;
     }

-    // Also check the FU pool as instructions are "stored" in FU
+    // Also check the FU pools as instructions are "stored" in FU
     // completion events until they are done and not accounted for
     // above
-    if (drained && !fuPool->isDrained()) {
-        DPRINTF(Drain, "FU pool still busy.\n");
+    bool fu_pools_drained = true;
+    for (FUPool* pool : fuPools) fu_pools_drained &= pool->isDrained();
+    if (drained && !fu_pools_drained) {
+        DPRINTF(Drain, "Some FU pools still busy.\n");
         drained = false;
     }

@@ -439,7 +443,7 @@

     instQueue.takeOverFrom();
     ldstQueue.takeOverFrom();
-    fuPool->takeOverFrom();
+    for (FUPool* fuPool : fuPools) fuPool->takeOverFrom();

     startupStage();
     cpu->activityThisCycle();
@@ -1512,7 +1516,7 @@
     sortInsts();

     // Free function units marked as being freed this cycle.
-    fuPool->processFreeUnits();
+    for (FUPool* fuPool : fuPools) fuPool->processFreeUnits();

     list<ThreadID>::iterator threads = activeThreads->begin();
     list<ThreadID>::iterator end = activeThreads->end();
diff --git a/src/cpu/o3/impl.hh b/src/cpu/o3/impl.hh
index b7f43c5..0a176b7 100644
--- a/src/cpu/o3/impl.hh
+++ b/src/cpu/o3/impl.hh
@@ -68,6 +68,9 @@
     /** The O3CPU type to be used. */
     typedef FullO3CPU<O3CPUImpl> O3CPU;

+    /** Model for bypassing between functional units. */
+    typedef DefaultFUPoolsStrategy<O3CPUImpl> FUPoolsStrategy;
+
     /** Same typedef, but for CPUType.  BaseDynInst may not always use
      * an O3 CPU, so it's clearer to call it CPUType instead in that
      * case.
diff --git a/src/cpu/o3/inst_queue.hh b/src/cpu/o3/inst_queue.hh
index f9e2966..76bbaf5 100644
--- a/src/cpu/o3/inst_queue.hh
+++ b/src/cpu/o3/inst_queue.hh
@@ -59,6 +59,7 @@
 struct DerivO3CPUParams;
 class FUPool;
 class MemInterface;
+struct ReserveResult;

 /**
  * A standard instruction queue class.  It holds ready instructions, in
@@ -75,7 +76,6 @@

* requiring IEW to be able to peek into the IQ. At the end of theexecution

  * latency, the instruction is put into the queue to execute, where it will
  * have the execute() function called on it.
- * @todo: Make IQ able to handle multiple FU pools.
  */
 template <class Impl>
 class InstructionQueue
@@ -84,6 +84,7 @@
     //Typedefs from the Impl.
     typedef typename Impl::O3CPU O3CPU;
     typedef typename Impl::DynInstPtr DynInstPtr;
+    typedef typename Impl::FUPoolsStrategy FUPoolsStrategy;

     typedef typename Impl::CPUPol::IEW IEW;
     typedef typename Impl::CPUPol::MemDepUnit MemDepUnit;
@@ -271,6 +272,11 @@
     void printInsts();

   private:
+    /** Given that this instruction has the given OpClass and is ready
+     * to execute, select a suitable FU from an FU pool and mark it as
+     * busy (unless no FU is needed). */
+    ReserveResult reserveFU(OpClass opClass, DynInstPtr inst);
+
     /** Does the actual squashing. */
     void doSquash(ThreadID tid);

@@ -303,8 +309,11 @@
     /** Wire to read information from timebuffer. */
     typename TimeBuffer<TimeStruct>::wire fromCommit;

-    /** Function unit pool. */
-    FUPool *fuPool;
+    /** Function unit pools. */
+    std::vector<FUPool*> fuPools;
+
+    /** FU pool selection strategy / bypass network. */
+    FUPoolsStrategy fuPoolsStrategy;

     //////////////////////////////////////
     // Instruction lists, ready queues, and ordering
@@ -543,6 +552,13 @@
     Stats::Scalar intAluAccesses;
     Stats::Scalar fpAluAccesses;
     Stats::Scalar vecAluAccesses;
+
+    // Register bypassing stats.
+    Stats::Scalar instsWithBypassing;
+    Stats::Scalar instsWithoutBypassing;
+    Stats::Scalar congestionBypassFails;
+    Stats::Scalar capabilityBypassFails;
+    Stats::Scalar confluenceBypassFails;
 };

 #endif //__CPU_O3_INST_QUEUE_HH__
diff --git a/src/cpu/o3/inst_queue_impl.hh b/src/cpu/o3/inst_queue_impl.hh
index 67b1108..0e37f4f 100644
--- a/src/cpu/o3/inst_queue_impl.hh
+++ b/src/cpu/o3/inst_queue_impl.hh
@@ -43,12 +43,17 @@
 #define __CPU_O3_INST_QUEUE_IMPL_HH__

 #include <limits>
+#include <sstream>
+#include <string>
 #include <vector>

 #include "base/logging.hh"
 #include "cpu/o3/fu_pool.hh"
+#include "cpu/o3/fu_pools_strategy.hh"
 #include "cpu/o3/inst_queue.hh"
 #include "debug/IQ.hh"
+#include "debug/IQFU.hh"
+#include "debug/SIMDFU.hh"
 #include "enums/OpClass.hh"
 #include "params/DerivO3CPU.hh"
 #include "sim/core.hh"
@@ -86,13 +91,18 @@
                                          DerivO3CPUParams *params)
     : cpu(cpu_ptr),
       iewStage(iew_ptr),
-      fuPool(params->fuPool),
+      fuPools(cpu_ptr->fuPools),
+      fuPoolsStrategy(cpu_ptr, params),
       iqPolicy(params->smtIQPolicy),
       numEntries(params->numIQEntries),
       totalWidth(params->issueWidth),
       commitToIEWDelay(params->commitToIEWDelay)
 {
-    assert(fuPool);
+    assert(fuPools.size() != 0);
+    for (FUPool* pool : fuPools) {
+    //    pool->dump();
+        assert(pool != nullptr);
+    }

     numThreads = params->numThreads;

@@ -387,6 +397,30 @@
         .desc("Number of vector alu accesses")
         .flags(total);

+    instsWithBypassing
+        .name(name() + ".insts_with_bypassing")
+        .desc("instructions executed with bypassed operands");
+
+    instsWithoutBypassing
+        .name(name() + ".insts_without_bypassing")
+        .desc("instructions executed without any bypassed operands");
+
+    congestionBypassFails
+        .name(name() + ".congestion_bypass_fails")
+        .desc("instructions executed without bypassing because "
+              "the FU pool producing the last-arriving operand was busy.");
+
+    capabilityBypassFails
+        .name(name() + ".capability_bypass_fails")
+        .desc("instructions executed without bypassing because "
+              "the FU pool producing the last-arriving operand was not "
+              "capable of executing the instruction.");
+
+    confluenceBypassFails
+        .name(name() + ".confluence_bypass_fails")
+        .desc("instructions executed without bypassing because "
+              "there were multiple last-arriving operands produced "
+              "on different FU pools.");
 }

 template <class Impl>
@@ -755,8 +789,9 @@
    --wbOutstanding;
     iewStage->wakeCPU();

-    if (fu_idx > -1)
-        fuPool->freeUnitNextCycle(fu_idx);
+    if (fu_idx > -1) {
+        fuPools[inst->getExecuteFuPoolIndex()]->freeUnitNextCycle(fu_idx);
+    }

     // @todo: Ensure that these FU Completions happen at the beginning
     // of a cycle, otherwise they could add too many instructions to
@@ -833,12 +868,7 @@
             continue;
         }

-        int idx = FUPool::NoCapableFU;
-        Cycles op_latency = Cycles(1);
-        ThreadID tid = issuing_inst->threadNumber;
-
         if (op_class != No_OpClass) {
-            idx = fuPool->getUnit(op_class);
             if (issuing_inst->isFloating()) {
                 fpAluAccesses++;
             } else if (issuing_inst->isVector()) {
@@ -846,28 +876,76 @@
             } else {
                 intAluAccesses++;
             }
-            if (idx > FUPool::NoFreeFU) {
-                op_latency = fuPool->getOpLatency(op_class);
-            }
+        }
+        ThreadID tid = issuing_inst->threadNumber;
+
+        // Try to reserve an FU to execute this instruction.
+        ReserveResult reserveResult =
+            fuPoolsStrategy.reserveFU(op_class, issuing_inst);
+        // Index/number of the FU Pool that will be chosen to execute
+        // this instruction.
+        int pool_idx = reserveResult.poolIdx;
+        // Index of the FU within said FU Pool that's chosen to execute
+        // this instruction.
+        int fu_idx = reserveResult.fuIdx;
+        // Latency of this instruction's execution on the chosen FU.
+        Cycles op_latency = reserveResult.opLatency;
+
+        // Collect statistics on bypassing or failure thereof.
+        switch (reserveResult.bypassStatus) {
+            case BypassStatus::Bypassed:
+                instsWithBypassing++;
+                break;
+            case BypassStatus::Congestion:
+                congestionBypassFails++;
+                break;
+            case BypassStatus::Capability:
+                capabilityBypassFails++;
+                break;
+            case BypassStatus::Confluence:
+                confluenceBypassFails++;
+                break;
+            case BypassStatus::NotNeeded:
+                instsWithoutBypassing++;
+                break;
+            default:
+                assert(0);
+                break;
         }

-        // If we have an instruction that doesn't require a FU, or a
-        // valid FU, then schedule for execution.
-        if (idx != FUPool::NoFreeFU) {
+        // Schedule if a needed FU is found. If we have an instruction
+        // that doesn't require a FU, or no FU pool has the needed
+        // capability for this instruction, then also schedule for
+        // execution (indicated by fu_idx == FUPool::NoCapableFU).
+        if (fu_idx != FUPool::NoFreeFU) {
+            FUPool* fuPool = nullptr;
+            if (fu_idx != FUPool::NoCapableFU) {
+                assert(pool_idx >= 0);
+                fuPool = fuPools[pool_idx];
+            } else {
+                // If no FU Pool was capable (or no FU was needed?,
+                // I'm trying to wrap my head around all possible
+                // meanings of NoCapableFU the original author used),
+                // arbitrarily choose pool 0 as the FU Pool, to match
+                // the old behaviour.
+                pool_idx = -2;
+                fuPool = fuPools[0];
+            }
+
             if (op_latency == Cycles(1)) {
                 i2e_info->size++;
                 instsToExecute.push_back(issuing_inst);

                 // Add the FU onto the list of FU's to be freed next
                 // cycle if we used one.
-                if (idx >= 0)
-                    fuPool->freeUnitNextCycle(idx);
+                if (fu_idx >= 0)
+                    fuPool->freeUnitNextCycle(fu_idx);
             } else {
                 bool pipelined = fuPool->isPipelined(op_class);
                 // Generate completion event for the FU
                 ++wbOutstanding;
                 FUCompletion *execution = new FUCompletion(issuing_inst,
-                                                           idx, this);
+                                                           fu_idx, this);

                 cpu->schedule(execution,
                               cpu->clockEdge(Cycles(op_latency - 1)));
@@ -878,14 +956,18 @@
                     execution->setFreeFU();
                 } else {

// Add the FU onto the list of FU's to be freed nextcycle.

-                    fuPool->freeUnitNextCycle(idx);
+                    fuPool->freeUnitNextCycle(fu_idx);
                 }
             }

             DPRINTF(IQ, "Thread %i: Issuing instruction PC %s "
-                    "[sn:%llu]\n",
+                    "[sn:%llu] bypassing %s FU Pool %i\n",
                     tid, issuing_inst->pcState(),
-                    issuing_inst->seqNum);
+                    issuing_inst->seqNum,
+                    (reserveResult.bypassStatus != BypassStatus::NotNeeded
+                        ? "yes" : "no"),
+                    pool_idx
+                    );

             readyInsts[op_class].pop();

@@ -896,7 +978,8 @@
                 queueOnList[op_class] = false;
             }

-            issuing_inst->setIssued();
+            // Mark the instruction as issued & record the FU Pool used.
+            issuing_inst->setIssuedFuPool(pool_idx);
             ++total_issued;

 #if TRACING_ON
@@ -916,6 +999,10 @@
             listOrder.erase(order_it++);
             statIssuedInstType[tid][op_class]++;
         } else {
+            // TODO: Now we might have an "FU Busy" stat increase
+            // because a needed source reg is produced on another FU
+            // and could not be bypassed to an available FU this
+            // cycle. Collect statistics on this?
             statFuBusy[op_class]++;
             fuBusy[tid]++;
             ++order_it;
@@ -926,9 +1013,9 @@
     iqInstsIssued+= total_issued;

     // If we issued any instructions, tell the CPU we had activity.
-    // @todo If the way deferred memory instructions are handeled due to

- // translation changes then the deferredMemInsts condition should beremoved

-    // from the code below.
+    // @todo If the way deferred memory instructions are handeled due
+    // to translation changes then the deferredMemInsts condition
+    // should be removed from the code below.

if (total_issued || !retryMemInsts.empty() || !deferredMemInsts.empty()) {

         cpu->activityThisCycle();
     } else {
@@ -1048,6 +1135,28 @@
         //Go through the dependency chain, marking the registers as
         //ready within the waiting instructions.
         DynInstPtr dep_inst = dependGraph.pop(dest_reg->flatIndex());
+        const int pool_idx = completed_inst->fuPoolNotUsed()
+            ? -1
+            : completed_inst->getExecuteFuPoolIndex();
+
+        // Tracing code; skip me.
+        if (DTRACE(IQFU) ||

+ (DTRACE(SIMDFU) &&isSimdOpClass(completed_inst->opClass()))) {

+            using namespace Debug;
+            if (pool_idx < 0) {
+                DPRINTF_UNCONDITIONAL(IQ,
+                    "cycleCounter %llu: r%d produced without FU.\n",
+                    (unsigned long long)(cpu->cycleCounter),
+                    int(dest_reg->index()));
+            }
+            else {
+                DPRINTF_UNCONDITIONAL(IQ,
+                    "cycleCounter %llu: r%d produced on FU pool %d.\n",
+                    (unsigned long long)(cpu->cycleCounter),
+                    int(dest_reg->index()),
+                    int(pool_idx));
+            }
+        }

         while (dep_inst) {
             DPRINTF(IQ, "Waking up a dependent instruction, [sn:%llu] "
@@ -1057,7 +1166,12 @@
             // so that it knows which of its source registers is
             // ready.  However that would mean that the dependency
             // graph entries would need to hold the src_reg_idx.
-            dep_inst->markSrcRegReady();
+            if (pool_idx < 0) {
+                dep_inst->markSrcRegReady();
+            }
+            else {

+ dep_inst->markSrcRegReadyFuPool(pool_idx,cpu->cycleCounter);

+            }

             addIfReady(dep_inst);

@@ -1326,7 +1440,7 @@

// @todo: Remove this hack where several statuses are set sothe

             // inst will flow through the rest of the pipeline.
-            squashed_inst->setIssued();
+            squashed_inst->setIssuedFuPool(-2);
             squashed_inst->setCanCommit();
             squashed_inst->clearInIQ();


--
To view, visit https://gem5-review.googlesource.com/c/public/gem5/+/27767

To unsubscribe, or for help writing mail filters, visithttps://gem5-review.googlesource.com/settings


Gerrit-Project: public/gem5
Gerrit-Branch: develop
Gerrit-Change-Id: Ibf39378c5af5ac4352a5d8ba1087417e2279234f
Gerrit-Change-Number: 27767
Gerrit-PatchSet: 1
Gerrit-Owner: David Zhao Akeley <dza...@gmail.com>
Gerrit-MessageType: newchange
_______________________________________________
gem5-dev mailing list
gem5-dev@gem5.org
http://m5sim.org/mailman/listinfo/gem5-dev

[gem5-dev] Change in gem5/gem5[develop]: cpu-o3: Support multiple FU Pools and partial bypassing in O3CPU

Reply via email to