Re: [gem5-dev] Review Request 2120: sim: Fix a bug when scaling a clock domain's frequency

Christopher Torng Mon, 16 Dec 2013 07:32:06 -0800

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2120/
-----------------------------------------------------------


(Updated Dec. 16, 2013, 10:28 a.m.)


Review request for Default.


Changes
-------

Additional testing.


Repository: gem5


Description
-------

sim: Fix a bug when scaling a clock domain's frequency

A SrcClockDomain (src/sim/clock_domain.hh) can change its clock period with
the clockPeriod(Tick) function. However, when a SrcClockDomain changes the
clock period, the ClockedObjects in the clock domain may be inactive. When
these ClockedObjects wake up, they update their internal tick incorrectly,
leading to incorrect latency calculations throughout the system. For
example, a 2 cycle L1 cache hit latency can turn into 3 cycles, causing
significant slowdowns in simulated runtime after the frequency change (4% to
20% slowdowns in my experiments).

This patch fixes the bug by adding a member list to the ClockDomain --
ClockedObjects register themselves with their clock domain at construction
time and are added to the member list. Using this list, before a clock
domain modifies its clock period, it can update each member's tick to the
curTick().

Diffed from Changeset 9993


Diffs
-----

  src/sim/clocked_object.hh bdd606534bdc 
  src/sim/clock_domain.cc bdd606534bdc 
  src/sim/clock_domain.hh bdd606534bdc 

Diff: http://reviews.gem5.org/r/2120/diff/


Testing (updated)
-------

  Purely to demonstrate this bug, I've added the following code, using the
  inorder cpu.

  In src/sim/stat_control.cc (changed precision from 6 to 12 for more detail)::

    simSeconds
        .name("sim_seconds")
        .desc("Number of seconds simulated")
        .precision(12)
        ;

  In src/sim/clocked_object.hh (added function for a clocked object to get its 
clock domain's pointer)::

    inline ClockDomain* getClockDomain() const
    {
        ClockDomain *c = &clockDomain;
        return c;
    }

  In src/cpu/inorder/InOrderCPU.py (change stageWidth to 1 to make dumps easier 
to read)::

    stageWidth = Param.Unsigned(1, "Stage width")

  In src/cpu/inorder/cpu.hh (create a pointer to a SrcClockDomain to control 
the CPU's clock period from within the CPU)::

    SrcClockDomain *clockDomain;

  In src/cpu/inorder/cpu.cc (let CPU constructor set the clock domain pointer)::

    InOrderCPU::InOrderCPU(Params *params)
        : (...)
    {
        (...)
        clockDomain = (SrcClockDomain*) getClockDomain();
    }

  In src/cpu/inorder/cpu.cc (after 5000 cycles, change the frequency to 
1999ps)::

  void
  InOrderCPU::tick()
  {

    (...)

    if (numCycles.value() > 5000)
    {
      clockDomain->clockPeriod(1999);
    }
  }

  After compiling these testing changes, I run MIPS hello world at 500 MHz
  (2000 ps)::

    $ ./build/MIPS/gem5.opt configs/example/se.py -c 
tests/test-progs/hello/bin/mips/linux/hello --cpu-type=inorder --caches 
--cpu-clock=500MHz --l1d_size=16kB --l1i_size=16kB --l1d_assoc=1 --l1i_assoc=1 
--cacheline_size=32 --num-cpus=1

  Results:

  - Without patch, simSeconds is 0.000040804594.
  - With    patch, simSeconds is 0.000038657668.

  The patch speeds up simulated runtime by 5.6% here. You can check the Exec
  debug-flag dump to see that memory instructions complete with different
  latencies with and without the patch.

  Instructions in middle of simulation, without patch, the time difference is 
5997 ps (3
  cycles)::

    10073967: system.cpu T0 : @_int_malloc+992    : srl        r9, r21         
: IntAlu :  D=0x0000000000000000
    10079964: system.cpu T0 : @_int_malloc+996    : lw         r25, -32740(r28) 
: MemRead :  D=0x0000000010000000 A=0x10004fac

  Same instructions with patch, the time difference is 3998 ps (2 cycles)::

    10071968: system.cpu T0 : @_int_malloc+992    : srl        r9, r21         
: IntAlu :  D=0x0000000000000000
    10075966: system.cpu T0 : @_int_malloc+996    : lw         r25, -32740(r28) 
: MemRead :  D=0x0000000010000000 A=0x10004fac

  The correct L1 hit latency is 2 cycles.

  For more information, you can look at src/sim/clocked_object.hh update() to
  see how the tick is incorrectly updated after a clocked object wakes up.

  Here's the relevant section of code in src/sim/clocked_object.hh update()::

    void update() const
    {
        (...)

        // if not, we have to recalculate the cycle and tick, we
        // perform the calculations in terms of relative cycles to
        // allow changes to the clock period in the future
        Cycles elapsedCycles(divCeil(curTick() - tick, clockPeriod()));
        cycle += elapsedCycles;
        tick += elapsedCycles * clockPeriod();
    }

  This part of update() is only executed when a clocked object has been
  inactive and has not updated its tick for a while. The code updates the tick
  using clockPeriod() -- basically it's assuming that the clock period has not
  changed since it went to sleep.

  In the hello world example, the dcache goes out of sync with the system. You
  can see this by adding this printout.

  In src/sim/clocked_object.hh::

    inline Tick clockEdge(Cycles cycles = Cycles(0)) const
    {
        (update ...)

        std::cout << name() << " tick is ";
        std::cout << std::dec << tick << std::endl;

        (return ...)
    }

  If you compile and re-simulate and go to those same two instructions we
  looked at earlier...

  Without patch::

    10073967: system.cpu T0 : @_int_malloc+992    : srl        r9, r21         
: IntAlu :  D=0x0000000000000000
    system.cpu.dcache tick is 10075945
    system.cpu.dcache tick is 10075945
    system.cpu.dcache tick is 10075945
    system.cpu tick is 10073967
    system.cpu tick is 10075966
    system.cpu tick is 10077965
    10079964: system.cpu T0 : @_int_malloc+996    : lw         r25, -32740(r28) 
: MemRead :  D=0x0000000010000000 A=0x10004fac

  The CPU's lw graduates at tick 10079964. Meanwhile dcache has tick 10075945.
  This is a difference of 4019, which is not a multiple of the 1999 ps clock
  period. This mismatch causes the L1 hit latency to increase from 2 cycles to
  3 cycles.

  With patch::

    10071968: system.cpu T0 : @_int_malloc+992    : srl        r9, r21         
: IntAlu :  D=0x0000000000000000
    system.cpu.dcache tick is 10071968
    system.cpu.dcache tick is 10071968
    system.cpu.dcache tick is 10071968
    system.cputick is 10071968
    system.cputick is 10073967
    10075966: system.cpu T0 : @_int_malloc+996    : lw         r25, -32740(r28) 
: MemRead :  D=0x0000000010000000 A=0x10004fac

  The CPU's lw graduates at tick 10075966. Meanwhile dcache has tick 10071968.
  This is a difference of 3998, which is exactly 2 cycles. Since the cache and
  CPU are together, the L1 hit latency stays at 2 cycles even after frequency
  scaling.


Thanks,

Christopher Torng

_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Re: [gem5-dev] Review Request 2120: sim: Fix a bug when scaling a clock domain's frequency

Reply via email to