Re: [gem5-users] A Patch for DRAMsim2 Integration

Gabriel Michael Black Fri, 04 May 2012 05:53:54 -0700

I haven't had a chance to study what's going on here, but could theproblem be that we don't have bandwidth limits/back pressureimplemented for the TLB and delayed translation? It could be that theCPU is pumping instructions into translation which eventually drainout/are squashed, and if too many accumulate they trip that assert.

That may not actually make any sense as far as what the code isactually doing, but it occurred to me as a possibility and I thoughtI'd throw it out there.


Gabe

Quoting Andrew Cebulski <af...@drexel.edu>:

I double-checked by looking at the config.ini file.  It turns out I did
actually create the checkpoint with an Atomic CPU without caches.  Sorry
for the confusion.

-Andrew

On Wed, May 2, 2012 at 10:12 PM, Andrew Cebulski <af...@drexel.edu> wrote:
I started hitting this assertion (that the number of insts in flight was >
1500) before I started using a checkpoint.  I created the checkpoint
afterwards to decrease the time needed to run simulations to debug this
problem.  I'll create a new checkpoint, then send the new trace output.

-Andrew


On Wed, May 2, 2012 at 9:53 PM, Ali Saidi <sa...@umich.edu> wrote:
**

It's likely the cause for all of your problems. Dirty data in the caches
doesn't get restored either.  You should always create checkpoints with an
atomic cpu and without caches.



Ali



On 02.05.2012 21:23, Andrew Cebulski wrote:

Sorry, I created the checkpoint I referred to with an O3 CPU with caches.
 From what I recall reading, caches don't get restored from checkpoints.
 Since the checkpoint wasn't during the benchmark run, I assumed that was
okay.
-Andrew

On Wed, May 2, 2012 at 9:07 PM, Ali Saidi <sa...@umich.edu> wrote:
 You haven't answered the question about if you created the checkpoints
with an atomic cpu without caches.

Ali





On 02.05.2012 19:58, Andrew Cebulski wrote:

I have not run with the checker CPU recently.  Here's the stderr output
from a run I did awhile back:
http://dl.dropbox.com/u/2953302/gem5/err.0
Note that the instruction match error is before my benchmark actually
starts running.  The start of my boot script checks to see if my files
image is mounted (which it is), then continues on to run the benchmark.  I
booted the system, mounted my files image, then took a checkpoint.  I've
been running all my tests from that checkpoint. I found where mybenchmark
started based on the ASID (from ExecAsid debug flag).
I delayed the start of gathering trace data until the second-to-last
linear increase in dynamic instructions in-flight. I'm running anew trace
now.
-Andrew


On Wed, May 2, 2012 at 5:28 PM, Ali Saidi <sa...@umich.edu> wrote:
 Something is wrong well before this point. There is no reason that
address 0x0 or 0x4 should be translated.

Did you happen to create a checkpoint when caches were in the system?

Have you tried to run with the checker cpu and see if it detects any
errors?



Ali





On 02.05.2012 17:22, Andrew Cebulski wrote:

They are data TLB misses that occur as the in-flight instruction count
rises (at 0x0 and 0x4). The last TLB miss before the in-flightinstruction
count finally linearly decreases is to 0x200.  Also, at the start of the
rising slope, I see a miss to 0x8 and 0x2508c.
 Here's a trace file:
http://dl.dropbox.com/u/2953302/gem5/tlb.out
To reduce size, I just have lines that have either TLB or walker in
them.
I do see only a handful of instruction TLB misses.
 -Andrew

On Wed, May 2, 2012 at 11:10 AM, Ali Saidi <sa...@umich.edu> wrote:
 Hi Andrew,



Thanks for digging into this. I think there is an issue somewhere, but
I'm still not sure where.

Ali

On 01.05.2012 23:34, Andrew Cebulski wrote:

Okay, I'm positive now that the issue lies with delayed translations
that are squashed before finishing.

 On the data on instruction side? You seem to allude to data in the
paragraph below, but then instructions in the latter text.

 It seems to me like speculative load/stores are being executed,
rather than waiting for the instructions to commit. Once theinstructions
begin getting (speculatively) executed in the TLB, a reference is left
there, which seems hard to root out and dereference after theinstructionends up being squashed. At least, I have not been able to findthat out in
the source code as of yet.  Can anyone clarify on this?



 There should only be one translation outstanding from each
instruction and data side walker. Any nested transactionsshould be queuedin the walker. Until one finishes, I'm not sure how multiplewould ever be
outstanding.

Recall the following image that shows how the number of dynamic
instruction (DynInst) objects in-flight increases linearly for varying
periods of time:
http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
After enabling the TLB debug flag, I see that the linear increase in
instructions in flight is proportional to the number of TLBmisses. TheseTLB misses have a much larger delay (resulting in translationdelays) due
to the fact the DramSim2 models the memory system more accurately.  It
seems that with the classic memory system, TLB misses often do not have
translation delays.  For whatever reason, it would also seem that every
instruction that has a TLB miss also is eventually squashed...

 From a data side perspective this is reasonable. While a miss is
outstanding at some point instructions will stop committing and thus the
instructions in flight will begin to rise until the miss is satisfied.

 Here's a summary of outputs from my trace.  These two DPRINTF
messages appears on the rising slopes (repeated up until the peak):
TLB Miss: Starting hardware table walker for 0(656)
TLB Miss: Starting hardware table walker for 0x4(656)

 This is interesting/odd. I don't know a good reason why (1) a miss
would be outstanding to both address 0 and address 4 at thesame time. Inalmost all cases these pages are marked as no-access to detectsegfaults.Perhaps there is an issue where the cpu is getting into a loopfaulting ona bad access and then faulting again on the fault handler. Icould imagine
this would happen if there was some corruption in the memory system (for
example the timings in dramsim exposing a bug in the cache models or
something).


At the peak, the following message appears (from fetch) almost every
tick for (what I believe to be) every single one of the tablewalkers that
were squashed.
Fetch is waiting ITLB walk to finish!

 There must be another walk in flight? The instruction side will only
have one fault outstanding at once. Successive branch mispredicts will
re-direct fetch but there is code that catches the fact that a different
walk completed then expected and "does the right thing."

 The problem is that these ITLB table walks are for instructions that
were squashed as much as 0.3 billion cycles earlier, and sincebeen removed
from the CPU's instruction list.

 I'm not following here.

 Any help will be greatly appreciated in solving this problem.  I've
hit a roadblock with getting Ruby working with ARM, most likelydue to thefact that ARM has disjoint memory (x86 and Alpha do not).There's the 256MB for physical memory, then the 64 MB for the boot loader. Ibrought this
up in my last email about trying to get Ruby working.  Therefore, I'm
trying to get this DramSim2 integration fixed so I can start modeling FS
with DRAM memory.

 Brad/Steve/Nilay anyone have a suggestion on how to make this work?


Note that these problems also occur in Soplex from the Spec CPU2006
benchmark suite (also hits 1500 in-flight instructionsassertion). Due to
time constraints, I haven't tested on other benchmarks.
Thanks,
Andrew
On Tue, May 1, 2012 at 4:27 AM, Andrew Cebulski<af...@drexel.edu>wrote:
 Hey Gabe,
    Thanks for this...very helpful.  I just recently got back into
debugging this problem.  I made a small change in src/base/refcnt.hh to
allow me to return the current count of references to a DynInst object.
    I then modified existing DPRINTFs to also print out reference
counts, then added some of my own when I needed extra visibility.
    I've found one memory store instruction that seems to be getting
lost. What's happening is that is progresses as far asgetting executed inthe IEW once, but a delayed translation occurs, deferring thestore. By
the time it reenters the IEW, the IQ has marked the instruction as
squashed.  Everything progresses as usual from here on out, with one
exception. When the instruction is removed from the CPUsinstruction list,
there is one reference count hanging.
    I've added in some additional debugging for my traces to help
narrow down where this reference is coming from.  As far as I can tell,
it's because of a call to initiateAcc() within theexecuteStore function inthe lsq unit. Please see the following two traces. The firsttrace shows
what I just discussed.  The second trace is another memory store
instruction that got squashed, however, it was squashed upon its first
entry into the IEW, therefore it never started execution.
http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out
http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
    Let me know if you have any ideas based on these two instruction
traces.  I do not understand how the initiateAcc function results in
another reference, but maybe someone else does.... Since Idon't see how
it makes a reference, it's hard to find out how to make sure it gets
dereferenced...
    Unfortunately, I haven't been able to add a DPRINTF in
src/base/refcnt.hh ...this would make things more clear (i.e.exactly when
references/deferences occur).  Let me know if you have any advice on
this...if it's possible. I can't seem to get the rightinclude files, and
likely right SConscript compile order...
Thanks,
 Andrew
On Sat, Apr 7, 2012 at 9:48 PM, Gabe Black<gbl...@eecs.umich.edu>wrote:
 Without digging into things too deeply, it looks like you may be
leaking references to dynamic instructions. The CPU may thinkit's donewith one, but until that final reference is removed, theobject will hang
around forever. I think I've had problems before where there reference
count ended up off by one somehow and instructions wouldstart piling up.It's also possible that a clog develops in O3's pipeline andsome internalstructure stops letting instructions through and startsaccumulating them.Either of these problems will be annoying to track down, butwith enough
digging I've been able to fix these sorts of things.

This may have more to do with O3 not handling the benchmark you're
running well rather than a problem with your new DRAM model.There may besome interaction between the two, though, where the newmemory makes thetiming line up to cause O3 to behave poorly. What you can dois instrumentdynamic instruction creation and destruction and referencecounting (try
print "this" for both the reference counting wrapper and the dyn inst
itself) and turn it on as close as you can to where things go bad tick
wise. Then look for an instruction which gets lost, and lookfor where it'sreference count is incremented and decremented. It should berelativelyeasy to pair up where references are created and destroyed,and you shouldbe able to identify the reference which never goes away. Thenyou need tofigure out where that reference is being created. After that,you shouldhave enough information to identify why the referencecounting isn't being
done correctly. It's arduous, but that's the only way.

It's important to also make sure reference counts aren't decremented
to zero prematurely. I had a problem once where that happened and the
memory behind the object was updated by something that didn'tknow it wasdead. The memory had since been reallocated to another objectof the sametype, so that other object reflected what happened to thephantom one. If I
remember that manifested as something weird like an add causing a page
fault or something.

Gabe


On 04/07/12 18:21, Andrew Cebulski wrote:

 Hi all,
I've looked into this problem some more, and have put together a
couple traces.  I've been becoming more familiar with how gem5 handles
dynamic instructions, in particular how it destroys them.  I have two
traces to compare, one with the physical memory, and theother with theintegrated dramsim2 dram memory. I also have two plotsshowing instructioncounts over time (sim ticks). All of these are linked at theend of the
email.
First, I'm going to go into what I've been able to interpret
regarding how instructions are destroyed. In particular,comparing when
DynInst's vs. DynInstPtr's are deconstructed/removed from the cpu.  I
separate these because I've seen a difference, as I discusslater. These
explanations are fairly non-existent on the wiki.  There is a section
header waiting to be filled...
From what I have been able to gather from the code, there is a list
of all the instructions in flight in cpu/o3/cpu.cc calledinstList, with
the type DynInstPtr.  There are three conditions to instructions being
cleaned from this list:
1.)  The ROB retires its head instruction
2.)  Fetch receives a rob squashing signal from the commit,
resulting in removing any instruction not in the ROB
3.)  Decode detects an incorrect branch prediction, resulting in
removal of all instructions back to the bad seq num.
Once all five stages have completed, the CPU cleans up all the
removed in-flight instructions.  This line in particular
in cleanUpRemovedInsts() in cpu/o3/cpu.cc deconstructs a DynInstPtr:
instList.erase(removeList.front());
When I turn on the debug flag O3CPU, I see the message "Removing
instruction, ..." (from o3/cpu.cc) with the threadNum, seqNumand pcStateafter all 5 cpu stages have completed, and one of theconditions above is
met.  I also see what tick it occurs on.
When I turn on the DynInst debug flag, I see when instructions are
created and destroyed (cpu/base_dyn_inst_impl.hh) and what tick.  From
analyzing the trace files, I've gathered that this takes intoaccount thatinstructions have different execution lengths. So if onetick a memoryinstruction in the instList (DynInstPtr) is removed, theDynInst for thatmemory instruction will occur much later (i.e. 1M tickslater). I have yet
to determine how this is implemented.
Now for the problem.
What I'm seeing when I run dramsim2 dram memory is a significant
difference between the size of the instList vector (ofDynInstPtr objects),
and the size of dynamic instruction count (of DynInst objects).  The
benchmark I'm running is libquantum from SPEC 2006. For thefirst roughly130B ticks, the dynamic instruction count kept incpu/base_dyn_inst.impl.hhshadows the instList size in o3/cpu.cc (figure linked below)very closely.
 Around tick 130B after libquantum started, it starts hitting what I'm
assuming are loops (therefore branch prediction), resulting in some
behavior that seems to imply improper instruction handling (i.e. more
instructions in flight than allowed by ROB).
I wasn't able to sync-up the physical and dramsim2 traces exactly by
trace, but they should represent roughly the same area ofexecution. Theydon't execute the same due to the dramsim2 modeling thememory differently
(i.e. latency and other delays).
I've shared both traces on my public Dropbox here --

http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz

http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
Here are a couple plots of tick versus instruction count, with
respect to cpu->instcount in cpu/base_dyn_inst.impl.hh andinstList.size()
in cpu/o3/cpu.cc.  --

http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png

http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
Note that I added the printout of the instList size to an existing
O3CPU DPRINTF in cleanUpRemovedInsts() in cpu/o3/cpu.cc.
Here are the commands I ran to parse the traces into data files to
analyze in MATLAB and create the plots:
zgrep DynInst
dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz |grep destroyed
| awk '{print $1,$11}' > cpuinstcount.out
zgrep instList
dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz |awk '{print
$1,$11}' > instlistsize.out
It seems to me like the problem might lie in gem5, but has just been
exposed by integrating this more detailed memory model, dramsim2, into
gem5. Either that, or their are some timing errors in howdramsim2 wasintegrated. I doubt this, however, since those first 190Bticks executedused the dramsim2 memory. I believe the problem is acombination of memoryinstructions + complex loops (branch prediction), resultingin improper
destroying of instructions.
I've included the ROB, Commit, Fetch, DynInst and O3CPU debug flags.
Their are 192 ROB entries, which is why the instList sizegenerally has amax of about 192 instructions. The dynamic instructioncounts (seen in thedramsim2 plot) seem to also imply that instructions areincorrectly beenremoved from the ROB, and then from the cpu's instructionlist in cpu.cc,which allows more and more instructions to be added to thesystem (possibly
from a bad branch).
I appreciate any help in debugging this and further figuring out the
root problem, just let me know if you need anything else fromme. I don'thave much more time at the moment to debug, but I can takeany advice forquick changes and/or additional traces, then send the resultsback to the
list for discussion.
Thanks,
Andrew
P.S. Paul - I did try decreasing the size of the dramsim2
transaction (and even command) queue from 512 to 32. Thesame instructions
problem occurred.  It basically just decreased the execution time.

On Wed, Mar 14, 2012 at 2:10 PM, Ali Saidi <sa...@umich.edu> wrote:
 The error is that there are more that 1500 instructions currently
in flight in the system. It could mean several things:

1. The value is somewhat arbitrarily defined and maybe there are
more than 1500 in your system at one time?

2. Instructions aren't being destroyed correctly

You could try to to run a debug binary so you'll get a list of
instructions when it happens or increase the number which may
be appropriate for certain situations (but 1500 is quite afew inflight
instructions).

Ali

On 13.03.2012 10:56, Andrew Cebulski wrote:

 Hi Xiangyu,
    I just started looking into this some more.  So at first I
thought it was due to updating to a more recent revision,but then I wentback to revision 8643, added your patch, built andran....and now get theerror with it too (when running ARM_FS/gem5.opt). I"mtesting now to seeif an update to SWIG might have resulted in this error,maybe someone onthe mailing list would know if that's possible. Thedifference is 1.3.40vs. 2.0.3, both of which are supported according to thedependencies wiki
page.
Just for completeness, here's the error from revision 8643:
 build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion`cpu->instcount
   I have not tried running with gem5.debug, so I will be doing
that today.  Maybe this is an assertion that is occurring due to an
optimization. That would mean it wouldn't be triggered ingem5.debug sinceit runs without optimizations. Have you tested all debug,opt and fast
with your tests?
Thanks,
 Andrew

 On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <
riosher...@gmail.com> wrote:
  Hi Andrew,



I didn?t see this error in my simulations. May I ask which gem5
version you are using? I find some of the latest codeupdates do not complywith my changes. I am still using the DRAMsim2 patch onGem5 repo8643, andhave run all the runnable benchmarks in SPEC2006, SPEC2000,EEMBC2, and
PARSEC2 on ARM_SE.



Thank you!



Best,

Xiangyu



*From:* Andrew Cebulski [mailto:af...@drexel.edu]
*Sent:* Thursday, March 08, 2012 6:52 PM

*To:* gem5 users mailing list
*Cc:*riosher...@gmail.com; sa...@umich.edu

*Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration

Xiangyu,

   I've been having an issue recently with the number of
instructions I've been seeing committed to the CPU (I havea separatethread on this). It turns out the issue seems to be comingfrom this patchyou created to integrate DramSim2 with Gem5.Unfortunately, I've beenrunning with gem5.fast, not gem5.opt. So up until now, Ihaven't beenseeing assertions. I thought I'd run it with gem5.opt ordebug back inDecember, but I must not have. My runs on the Arm O3 cpufails with this
assertion:

build/ARM/cpu/base_dyn_inst_impl.hh:149: void
BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion`cpu->instcount
-Andrew

Date: Sun, 18 Dec 2011 01:48:58 -0800
From: "Dong, Xiangyu" <riosher...@gmail.com>
To: "gem5 users mailing list" <gem5-users@gem5.org>
Subject: [gem5-users] A Patch for DRAMsim2 Integration
Message-ID: gmail.com>

Content-Type: text/plain; charset="us-ascii"

Hi all,



I have a Gem5+DRAMsim2 patch.  I've tested it under both SE and FS
modes.
I'm willing to share it here.



For those who have such needs, please go to my website
www.cse.psu.edu/~xydong <http://www.cse.psu.edu/%7Exydong> to
download the patch and test it.  To enable
DRAMSim2, use se_dramsim2.py script instead of se.py (for FS, you
can create
by yourself).  The basic idea to enable the DRAMsim2 module is to
use the
derived DRAMMemory class instead of PhysicalMemory class.



Please let me know if there are bugs.



Thank you!



Best,

Xiangyu Dong

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <
http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
>
_______________________________________________
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
 _______________________________________________
gem5-users mailinglistgem5-users@gem5.orghttp://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users



_______________________________________________
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] A Patch for DRAMsim2 Integration

Reply via email to