Re: [gem5-users] A Patch for DRAMsim2 Integration

Andrew Cebulski Wed, 02 May 2012 16:59:19 -0700

I have not run with the checker CPU recently.  Here's the stderr output
from a run I did awhile back:


http://dl.dropbox.com/u/2953302/gem5/err.0

Note that the instruction match error is before my benchmark actually
starts running.  The start of my boot script checks to see if my files
image is mounted (which it is), then continues on to run the benchmark.  I
booted the system, mounted my files image, then took a checkpoint.  I've
been running all my tests from that checkpoint.  I found where my benchmark
started based on the ASID (from ExecAsid debug flag).

I delayed the start of gathering trace data until the second-to-last linear
increase in dynamic instructions in-flight.  I'm running a new trace now.

-Andrew



On Wed, May 2, 2012 at 5:28 PM, Ali Saidi <sa...@umich.edu> wrote:

> **
>
> Something is wrong well before this point. There is no reason that address
> 0x0 or 0x4 should be translated.
>
> Did you happen to create a checkpoint when caches were in the system?
>
> Have you tried to run with the checker cpu and see if it detects any
> errors?
>
>
>
> Ali
>
>
>
>
>
> On 02.05.2012 17:22, Andrew Cebulski wrote:
>
> They are data TLB misses that occur as the in-flight instruction count
> rises (at 0x0 and 0x4).  The last TLB miss before the in-flight instruction
> count finally linearly decreases is to 0x200.  Also, at the start of the
> rising slope, I see a miss to 0x8 and 0x2508c.
>  Here's a trace file:
> http://dl.dropbox.com/u/2953302/gem5/tlb.out
> To reduce size, I just have lines that have either TLB or walker in them.
> I do see only a handful of instruction TLB misses.
>  -Andrew
>
> On Wed, May 2, 2012 at 11:10 AM, Ali Saidi <sa...@umich.edu> wrote:
>
>>  Hi Andrew,
>>
>>
>>
>> Thanks for digging into this. I think there is an issue somewhere, but
>> I'm still not sure where.
>>
>> Ali
>>
>> On 01.05.2012 23:34, Andrew Cebulski wrote:
>>
>> Okay, I'm positive now that the issue lies with delayed translations that
>> are squashed before finishing.
>>
>>  On the data on instruction side? You seem to allude to data in the
>> paragraph below, but then instructions in the latter text.
>>
>>  It seems to me like speculative load/stores are being executed, rather
>> than waiting for the instructions to commit.  Once the instructions begin
>> getting (speculatively) executed in the TLB, a reference is left there,
>> which seems hard to root out and dereference after the instruction ends up
>> being squashed.  At least, I have not been able to find that out in the
>> source code as of yet.  Can anyone clarify on this?
>>
>>
>>
>>  There should only be one translation outstanding from each instruction
>> and data side walker. Any nested transactions should be queued in the
>> walker. Until one finishes, I'm not sure how multiple would ever be
>> outstanding.
>>
>> Recall the following image that shows how the number of dynamic
>> instruction (DynInst) objects in-flight increases linearly for varying
>> periods of time:
>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>> After enabling the TLB debug flag, I see that the linear increase in
>> instructions in flight is proportional to the number of TLB misses.  These
>> TLB misses have a much larger delay (resulting in translation delays) due
>> to the fact the DramSim2 models the memory system more accurately.  It
>> seems that with the classic memory system, TLB misses often do not have
>> translation delays.  For whatever reason, it would also seem that every
>> instruction that has a TLB miss also is eventually squashed...
>>
>>  From a data side perspective this is reasonable. While a miss is
>> outstanding at some point instructions will stop committing and thus the
>> instructions in flight will begin to rise until the miss is satisfied.
>>
>>  Here's a summary of outputs from my trace.  These two DPRINTF messages
>> appears on the rising slopes (repeated up until the peak):
>> TLB Miss: Starting hardware table walker for 0(656)
>> TLB Miss: Starting hardware table walker for 0x4(656)
>>
>>  This is interesting/odd. I don't know a good reason why (1) a miss
>> would be outstanding to both address 0 and address 4 at the same time. In
>> almost all cases these pages are marked as no-access to detect segfaults.
>> Perhaps there is an issue where the cpu is getting into a loop faulting on
>> a bad access and then faulting again on the fault handler. I could imagine
>> this would happen if there was some corruption in the memory system (for
>> example the timings in dramsim exposing a bug in the cache models or
>> something).
>>
>>
>> At the peak, the following message appears (from fetch) almost every tick
>> for (what I believe to be) every single one of the table walkers that were
>> squashed.
>> Fetch is waiting ITLB walk to finish!
>>
>>  There must be another walk in flight? The instruction side will only
>> have one fault outstanding at once. Successive branch mispredicts will
>> re-direct fetch but there is code that catches the fact that a different
>> walk completed then expected and "does the right thing."
>>
>>  The problem is that these ITLB table walks are for instructions that
>> were squashed as much as 0.3 billion cycles earlier, and since been removed
>> from the CPU's instruction list.
>>
>>  I'm not following here.
>>
>>  Any help will be greatly appreciated in solving this problem.  I've hit
>> a roadblock with getting Ruby working with ARM, most likely due to the fact
>> that ARM has disjoint memory (x86 and Alpha do not).  There's the 256 MB
>> for physical memory, then the 64 MB for the boot loader.  I brought this up
>> in my last email about trying to get Ruby working.  Therefore, I'm trying
>> to get this DramSim2 integration fixed so I can start modeling FS with DRAM
>> memory.
>>
>>  Brad/Steve/Nilay anyone have a suggestion on how to make this work?
>>
>>
>> Note that these problems also occur in Soplex from the Spec CPU2006
>> benchmark suite (also hits 1500 in-flight instructions assertion).  Due to
>> time constraints, I haven't tested on other benchmarks.
>> Thanks,
>> Andrew
>>    On Tue, May 1, 2012 at 4:27 AM, Andrew Cebulski <af...@drexel.edu>wrote:
>>
>>>  Hey Gabe,
>>>     Thanks for this...very helpful.  I just recently got back into
>>> debugging this problem.  I made a small change in src/base/refcnt.hh to
>>> allow me to return the current count of references to a DynInst object.
>>>     I then modified existing DPRINTFs to also print out reference
>>> counts, then added some of my own when I needed extra visibility.
>>>     I've found one memory store instruction that seems to be getting
>>> lost.  What's happening is that is progresses as far as getting executed in
>>> the IEW once, but a delayed translation occurs, deferring the store.  By
>>> the time it reenters the IEW, the IQ has marked the instruction as
>>> squashed.  Everything progresses as usual from here on out, with one
>>> exception.  When the instruction is removed from the CPUs instruction list,
>>> there is one reference count hanging.
>>>     I've added in some additional debugging for my traces to help narrow
>>> down where this reference is coming from.  As far as I can tell, it's
>>> because of a call to initiateAcc() within the executeStore function in the
>>> lsq unit.  Please see the following two traces.  The first trace shows what
>>> I just discussed.  The second trace is another memory store instruction
>>> that got squashed, however, it was squashed upon its first entry into the
>>> IEW, therefore it never started execution.
>>> http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out
>>> http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
>>>     Let me know if you have any ideas based on these two instruction
>>> traces.  I do not understand how the initiateAcc function results in
>>> another reference, but maybe someone else does....  Since I don't see how
>>> it makes a reference, it's hard to find out how to make sure it gets
>>> dereferenced...
>>>     Unfortunately, I haven't been able to add a DPRINTF in
>>> src/base/refcnt.hh ...this would make things more clear (i.e. exactly when
>>> references/deferences occur).  Let me know if you have any advice on
>>> this...if it's possible.  I can't seem to get the right include files, and
>>> likely right SConscript compile order...
>>> Thanks,
>>>  Andrew
>>>
>>>
>>>  On Sat, Apr 7, 2012 at 9:48 PM, Gabe Black <gbl...@eecs.umich.edu>wrote:
>>>
>>>>  Without digging into things too deeply, it looks like you may be
>>>> leaking references to dynamic instructions. The CPU may think it's done
>>>> with one, but until that final reference is removed, the object will hang
>>>> around forever. I think I've had problems before where there reference
>>>> count ended up off by one somehow and instructions would start piling up.
>>>> It's also possible that a clog develops in O3's pipeline and some internal
>>>> structure stops letting instructions through and starts accumulating them.
>>>> Either of these problems will be annoying to track down, but with enough
>>>> digging I've been able to fix these sorts of things.
>>>>
>>>> This may have more to do with O3 not handling the benchmark you're
>>>> running well rather than a problem with your new DRAM model. There may be
>>>> some interaction between the two, though, where the new memory makes the
>>>> timing line up to cause O3 to behave poorly. What you can do is instrument
>>>> dynamic instruction creation and destruction and reference counting (try
>>>> print "this" for both the reference counting wrapper and the dyn inst
>>>> itself) and turn it on as close as you can to where things go bad tick
>>>> wise. Then look for an instruction which gets lost, and look for where it's
>>>> reference count is incremented and decremented. It should be relatively
>>>> easy to pair up where references are created and destroyed, and you should
>>>> be able to identify the reference which never goes away. Then you need to
>>>> figure out where that reference is being created. After that, you should
>>>> have enough information to identify why the reference counting isn't being
>>>> done correctly. It's arduous, but that's the only way.
>>>>
>>>> It's important to also make sure reference counts aren't decremented to
>>>> zero prematurely. I had a problem once where that happened and the memory
>>>> behind the object was updated by something that didn't know it was dead.
>>>> The memory had since been reallocated to another object of the same type,
>>>> so that other object reflected what happened to the phantom one. If I
>>>> remember that manifested as something weird like an add causing a page
>>>> fault or something.
>>>>
>>>> Gabe
>>>>
>>>>
>>>> On 04/07/12 18:21, Andrew Cebulski wrote:
>>>>
>>>>  Hi all,
>>>> I've looked into this problem some more, and have put together a couple
>>>> traces.  I've been becoming more familiar with how gem5 handles dynamic
>>>> instructions, in particular how it destroys them.  I have two traces to
>>>> compare, one with the physical memory, and the other with the integrated
>>>> dramsim2 dram memory.  I also have two plots showing instruction counts
>>>> over time (sim ticks).  All of these are linked at the end of the email.
>>>> First, I'm going to go into what I've been able to interpret regarding
>>>> how instructions are destroyed.  In particular, comparing when DynInst's
>>>> vs. DynInstPtr's are deconstructed/removed from the cpu.  I separate these
>>>> because I've seen a difference, as I discuss later.  These explanations are
>>>> fairly non-existent on the wiki.  There is a section header waiting to be
>>>> filled...
>>>> From what I have been able to gather from the code, there is a list of
>>>> all the instructions in flight in cpu/o3/cpu.cc called instList, with the
>>>> type DynInstPtr.  There are three conditions to instructions being cleaned
>>>> from this list:
>>>> 1.)  The ROB retires its head instruction
>>>> 2.)  Fetch receives a rob squashing signal from the commit, resulting
>>>> in removing any instruction not in the ROB
>>>> 3.)  Decode detects an incorrect branch prediction, resulting in
>>>> removal of all instructions back to the bad seq num.
>>>> Once all five stages have completed, the CPU cleans up all the removed
>>>> in-flight instructions.  This line in particular
>>>> in cleanUpRemovedInsts() in cpu/o3/cpu.cc deconstructs a DynInstPtr:
>>>> instList.erase(removeList.front());
>>>> When I turn on the debug flag O3CPU, I see the message "Removing
>>>> instruction, ..." (from o3/cpu.cc) with the threadNum, seqNum and pcState
>>>> after all 5 cpu stages have completed, and one of the conditions above is
>>>> met.  I also see what tick it occurs on.
>>>> When I turn on the DynInst debug flag, I see when instructions are
>>>> created and destroyed (cpu/base_dyn_inst_impl.hh) and what tick.  From
>>>> analyzing the trace files, I've gathered that this takes into account that
>>>> instructions have different execution lengths.  So if one tick a memory
>>>> instruction in the instList (DynInstPtr) is removed, the DynInst for that
>>>> memory instruction will occur much later (i.e. 1M ticks later).  I have yet
>>>> to determine how this is implemented.
>>>> Now for the problem.
>>>> What I'm seeing when I run dramsim2 dram memory is a significant
>>>> difference between the size of the instList vector (of DynInstPtr objects),
>>>> and the size of dynamic instruction count (of DynInst objects).  The
>>>> benchmark I'm running is libquantum from SPEC 2006.  For the first roughly
>>>> 130B ticks, the dynamic instruction count kept in cpu/base_dyn_inst.impl.hh
>>>> shadows the instList size in o3/cpu.cc (figure linked below) very closely.
>>>>  Around tick 130B after libquantum started, it starts hitting what I'm
>>>> assuming are loops (therefore branch prediction), resulting in some
>>>> behavior that seems to imply improper instruction handling (i.e. more
>>>> instructions in flight than allowed by ROB).
>>>> I wasn't able to sync-up the physical and dramsim2 traces exactly by
>>>> trace, but they should represent roughly the same area of execution.  They
>>>> don't execute the same due to the dramsim2 modeling the memory differently
>>>> (i.e. latency and other delays).
>>>> I've shared both traces on my public Dropbox here --
>>>>
>>>> http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
>>>>
>>>> http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
>>>> Here are a couple plots of tick versus instruction count, with respect
>>>> to cpu->instcount in cpu/base_dyn_inst.impl.hh and instList.size() in
>>>> cpu/o3/cpu.cc.  --
>>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
>>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>>>> Note that I added the printout of the instList size to an existing
>>>> O3CPU DPRINTF in cleanUpRemovedInsts() in cpu/o3/cpu.cc.
>>>> Here are the commands I ran to parse the traces into data files to
>>>> analyze in MATLAB and create the plots:
>>>> zgrep DynInst
>>>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | grep destroyed
>>>> | awk '{print $1,$11}' > cpuinstcount.out
>>>> zgrep instList
>>>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | awk '{print
>>>> $1,$11}' > instlistsize.out
>>>> It seems to me like the problem might lie in gem5, but has just been
>>>> exposed by integrating this more detailed memory model, dramsim2, into
>>>> gem5.  Either that, or their are some timing errors in how dramsim2 was
>>>> integrated.  I doubt this, however, since those first 190B ticks executed
>>>> used the dramsim2 memory.  I believe the problem is a combination of memory
>>>> instructions + complex loops (branch prediction), resulting in improper
>>>> destroying of instructions.
>>>> I've included the ROB, Commit, Fetch, DynInst and O3CPU debug flags.
>>>>  Their are 192 ROB entries, which is why the instList size generally has a
>>>> max of about 192 instructions.  The dynamic instruction counts (seen in the
>>>> dramsim2 plot) seem to also imply that instructions are incorrectly been
>>>> removed from the ROB, and then from the cpu's instruction list in cpu.cc,
>>>> which allows more and more instructions to be added to the system (possibly
>>>> from a bad branch).
>>>> I appreciate any help in debugging this and further figuring out the
>>>> root problem, just let me know if you need anything else from me.  I don't
>>>> have much more time at the moment to debug, but I can take any advice for
>>>> quick changes and/or additional traces, then send the results back to the
>>>> list for discussion.
>>>> Thanks,
>>>> Andrew
>>>> P.S. Paul - I did try decreasing the size of the dramsim2 transaction
>>>> (and even command) queue from 512 to 32.  The same instructions problem
>>>> occurred.  It basically just decreased the execution time.
>>>>
>>>> On Wed, Mar 14, 2012 at 2:10 PM, Ali Saidi <sa...@umich.edu> wrote:
>>>>
>>>>>  The error is that there are more that 1500 instructions currently in
>>>>> flight in the system. It could mean several things:
>>>>>
>>>>> 1. The value is somewhat arbitrarily defined and maybe there are more
>>>>> than 1500 in your system at one time?
>>>>>
>>>>> 2. Instructions aren't being destroyed correctly
>>>>>
>>>>> You could try to to run a debug binary so you'll get a list of
>>>>> instructions when it happens or increase the number which may
>>>>> be appropriate for certain situations (but 1500 is quite a few inflight
>>>>> instructions).
>>>>>
>>>>> Ali
>>>>>
>>>>> On 13.03.2012 10:56, Andrew Cebulski wrote:
>>>>>
>>>>>  Hi Xiangyu,
>>>>>     I just started looking into this some more.  So at first I thought
>>>>> it was due to updating to a more recent revision, but then I went back to
>>>>> revision 8643, added your patch, built and ran....and now get the error
>>>>> with it too (when running ARM_FS/gem5.opt).  I"m testing now to see if an
>>>>> update to SWIG might have resulted in this error, maybe someone on the
>>>>> mailing list would know if that's possible.  The difference is 1.3.40 vs.
>>>>> 2.0.3, both of which are supported according to the dependencies wiki 
>>>>> page.
>>>>> Just for completeness, here's the error from revision 8643:
>>>>>  build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
>>>>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>>>>    I have not tried running with gem5.debug, so I will be doing that
>>>>> today.  Maybe this is an assertion that is occurring due to an
>>>>> optimization.  That would mean it wouldn't be triggered in gem5.debug 
>>>>> since
>>>>> it runs without optimizations.  Have you tested all debug, opt and fast
>>>>> with your tests?
>>>>> Thanks,
>>>>>  Andrew
>>>>>
>>>>>  On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <
>>>>> riosher...@gmail.com> wrote:
>>>>>
>>>>>>   Hi Andrew,
>>>>>>
>>>>>>
>>>>>>
>>>>>> I didn’t see this error in my simulations. May I ask which gem5
>>>>>> version you are using? I find some of the latest code updates do not 
>>>>>> comply
>>>>>> with my changes. I am still using the DRAMsim2 patch on Gem5 repo8643, 
>>>>>> and
>>>>>> have run all the runnable benchmarks in SPEC2006, SPEC2000, EEMBC2, and
>>>>>> PARSEC2 on ARM_SE.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Xiangyu
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Andrew Cebulski [mailto:af...@drexel.edu]
>>>>>> *Sent:* Thursday, March 08, 2012 6:52 PM
>>>>>>
>>>>>> *To:* gem5 users mailing list
>>>>>> *Cc:*riosher...@gmail.com; sa...@umich.edu
>>>>>>
>>>>>> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration
>>>>>>
>>>>>> Xiangyu,
>>>>>>
>>>>>>    I've been having an issue recently with the number of instructions
>>>>>> I've been seeing committed to the CPU (I have a separate thread on this).
>>>>>>  It turns out the issue seems to be coming from this patch you created to
>>>>>> integrate DramSim2 with Gem5.  Unfortunately, I've been running with
>>>>>> gem5.fast, not gem5.opt.  So up until now, I haven't been seeing
>>>>>> assertions.  I thought I'd run it with gem5.opt or debug back in 
>>>>>> December,
>>>>>> but I must not have.  My runs on the Arm O3 cpu fails with this 
>>>>>> assertion:
>>>>>>
>>>>>> build/ARM/cpu/base_dyn_inst_impl.hh:149: void BaseDynInst::initVars()
>>>>>> [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>>>>>
>>>>>> -Andrew
>>>>>>
>>>>>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>>>>>> From: "Dong, Xiangyu" <riosher...@gmail.com>
>>>>>> To: "gem5 users mailing list" <gem5-users@gem5.org>
>>>>>> Subject: [gem5-users] A Patch for DRAMsim2 Integration
>>>>>> Message-ID: gmail.com>
>>>>>>
>>>>>> Content-Type: text/plain; charset="us-ascii"
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>>
>>>>>>
>>>>>> I have a Gem5+DRAMsim2 patch.  I've tested it under both SE and FS
>>>>>> modes.
>>>>>> I'm willing to share it here.
>>>>>>
>>>>>>
>>>>>>
>>>>>> For those who have such needs, please go to my website
>>>>>> www.cse.psu.edu/~xydong <http://www.cse.psu.edu/%7Exydong> to
>>>>>> download the patch and test it.  To enable
>>>>>> DRAMSim2, use se_dramsim2.py script instead of se.py (for FS, you can
>>>>>> create
>>>>>> by yourself).  The basic idea to enable the DRAMsim2 module is to use
>>>>>> the
>>>>>> derived DRAMMemory class instead of PhysicalMemory class.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Please let me know if there are bugs.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Xiangyu Dong
>>>>>>
>>>>>> -------------- next part --------------
>>>>>> An HTML attachment was scrubbed...
>>>>>> URL: <
>>>>>> http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
>>>>>> >
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> gem5-users mailing list
>>>>> gem5-users@gem5.org
>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>
>>>>>
>>>>
>>>>
>>>>  _______________________________________________
>>>> gem5-users mailing 
>>>> listgem5-users@gem5.orghttp://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>
>>>>
>>>> _______________________________________________
>>>> gem5-users mailing list
>>>> gem5-users@gem5.org
>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>
>>>
>>
>>
>> _______________________________________________
>> gem5-users mailing list
>> gem5-users@gem5.org
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
>
>
>
> _______________________________________________
> gem5-users mailing list
> gem5-users@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>

_______________________________________________
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] A Patch for DRAMsim2 Integration

Reply via email to