Re: [gem5-users] A Patch for DRAMsim2 Integration

Andrew Cebulski Wed, 02 May 2012 14:22:54 -0700

They are data TLB misses that occur as the in-flight instruction count
rises (at 0x0 and 0x4).  The last TLB miss before the in-flight instruction
count finally linearly decreases is to 0x200.  Also, at the start of the
rising slope, I see a miss to 0x8 and 0x2508c.


Here's a trace file:

http://dl.dropbox.com/u/2953302/gem5/tlb.out

To reduce size, I just have lines that have either TLB or walker in them.

I do see only a handful of instruction TLB misses.

-Andrew

On Wed, May 2, 2012 at 11:10 AM, Ali Saidi <sa...@umich.edu> wrote:

> **
>
> Hi Andrew,
>
>
>
> Thanks for digging into this. I think there is an issue somewhere, but I'm
> still not sure where.
>
> Ali
>
> On 01.05.2012 23:34, Andrew Cebulski wrote:
>
> Okay, I'm positive now that the issue lies with delayed translations that
> are squashed before finishing.
>
> On the data on instruction side? You seem to allude to data in the
> paragraph below, but then instructions in the latter text.
>
>  It seems to me like speculative load/stores are being executed, rather
> than waiting for the instructions to commit.  Once the instructions begin
> getting (speculatively) executed in the TLB, a reference is left there,
> which seems hard to root out and dereference after the instruction ends up
> being squashed.  At least, I have not been able to find that out in the
> source code as of yet.  Can anyone clarify on this?
>
>
>
> There should only be one translation outstanding from each instruction and
> data side walker. Any nested transactions should be queued in the walker.
> Until one finishes, I'm not sure how multiple would ever be outstanding.
>
> Recall the following image that shows how the number of dynamic
> instruction (DynInst) objects in-flight increases linearly for varying
> periods of time:
> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
> After enabling the TLB debug flag, I see that the linear increase in
> instructions in flight is proportional to the number of TLB misses.  These
> TLB misses have a much larger delay (resulting in translation delays) due
> to the fact the DramSim2 models the memory system more accurately.  It
> seems that with the classic memory system, TLB misses often do not have
> translation delays.  For whatever reason, it would also seem that every
> instruction that has a TLB miss also is eventually squashed...
>
>  From a data side perspective this is reasonable. While a miss is
> outstanding at some point instructions will stop committing and thus the
> instructions in flight will begin to rise until the miss is satisfied.
>
>  Here's a summary of outputs from my trace.  These two DPRINTF messages
> appears on the rising slopes (repeated up until the peak):
> TLB Miss: Starting hardware table walker for 0(656)
> TLB Miss: Starting hardware table walker for 0x4(656)
>
>  This is interesting/odd. I don't know a good reason why (1) a miss would
> be outstanding to both address 0 and address 4 at the same time. In almost
> all cases these pages are marked as no-access to detect segfaults. Perhaps
> there is an issue where the cpu is getting into a loop faulting on a bad
> access and then faulting again on the fault handler. I could imagine this
> would happen if there was some corruption in the memory system (for example
> the timings in dramsim exposing a bug in the cache models or something).
>
>
> At the peak, the following message appears (from fetch) almost every tick
> for (what I believe to be) every single one of the table walkers that were
> squashed.
> Fetch is waiting ITLB walk to finish!
>
>  There must be another walk in flight? The instruction side will only
> have one fault outstanding at once. Successive branch mispredicts will
> re-direct fetch but there is code that catches the fact that a different
> walk completed then expected and "does the right thing."
>
>  The problem is that these ITLB table walks are for instructions that
> were squashed as much as 0.3 billion cycles earlier, and since been removed
> from the CPU's instruction list.
>
>  I'm not following here.
>
>  Any help will be greatly appreciated in solving this problem.  I've hit
> a roadblock with getting Ruby working with ARM, most likely due to the fact
> that ARM has disjoint memory (x86 and Alpha do not).  There's the 256 MB
> for physical memory, then the 64 MB for the boot loader.  I brought this up
> in my last email about trying to get Ruby working.  Therefore, I'm trying
> to get this DramSim2 integration fixed so I can start modeling FS with DRAM
> memory.
>
>  Brad/Steve/Nilay anyone have a suggestion on how to make this work?
>
>
> Note that these problems also occur in Soplex from the Spec CPU2006
> benchmark suite (also hits 1500 in-flight instructions assertion).  Due to
> time constraints, I haven't tested on other benchmarks.
> Thanks,
> Andrew
>  On Tue, May 1, 2012 at 4:27 AM, Andrew Cebulski <af...@drexel.edu> wrote:
>
>> Hey Gabe,
>>     Thanks for this...very helpful.  I just recently got back into
>> debugging this problem.  I made a small change in src/base/refcnt.hh to
>> allow me to return the current count of references to a DynInst object.
>>     I then modified existing DPRINTFs to also print out reference counts,
>> then added some of my own when I needed extra visibility.
>>     I've found one memory store instruction that seems to be getting
>> lost.  What's happening is that is progresses as far as getting executed in
>> the IEW once, but a delayed translation occurs, deferring the store.  By
>> the time it reenters the IEW, the IQ has marked the instruction as
>> squashed.  Everything progresses as usual from here on out, with one
>> exception.  When the instruction is removed from the CPUs instruction list,
>> there is one reference count hanging.
>>     I've added in some additional debugging for my traces to help narrow
>> down where this reference is coming from.  As far as I can tell, it's
>> because of a call to initiateAcc() within the executeStore function in the
>> lsq unit.  Please see the following two traces.  The first trace shows what
>> I just discussed.  The second trace is another memory store instruction
>> that got squashed, however, it was squashed upon its first entry into the
>> IEW, therefore it never started execution.
>> http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out
>> http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out
>>     Let me know if you have any ideas based on these two instruction
>> traces.  I do not understand how the initiateAcc function results in
>> another reference, but maybe someone else does....  Since I don't see how
>> it makes a reference, it's hard to find out how to make sure it gets
>> dereferenced...
>>     Unfortunately, I haven't been able to add a DPRINTF in
>> src/base/refcnt.hh ...this would make things more clear (i.e. exactly when
>> references/deferences occur).  Let me know if you have any advice on
>> this...if it's possible.  I can't seem to get the right include files, and
>> likely right SConscript compile order...
>> Thanks,
>> Andrew
>>
>>
>> On Sat, Apr 7, 2012 at 9:48 PM, Gabe Black <gbl...@eecs.umich.edu> wrote:
>>
>>> Without digging into things too deeply, it looks like you may be leaking
>>> references to dynamic instructions. The CPU may think it's done with one,
>>> but until that final reference is removed, the object will hang around
>>> forever. I think I've had problems before where there reference count ended
>>> up off by one somehow and instructions would start piling up. It's also
>>> possible that a clog develops in O3's pipeline and some internal structure
>>> stops letting instructions through and starts accumulating them. Either of
>>> these problems will be annoying to track down, but with enough digging I've
>>> been able to fix these sorts of things.
>>>
>>> This may have more to do with O3 not handling the benchmark you're
>>> running well rather than a problem with your new DRAM model. There may be
>>> some interaction between the two, though, where the new memory makes the
>>> timing line up to cause O3 to behave poorly. What you can do is instrument
>>> dynamic instruction creation and destruction and reference counting (try
>>> print "this" for both the reference counting wrapper and the dyn inst
>>> itself) and turn it on as close as you can to where things go bad tick
>>> wise. Then look for an instruction which gets lost, and look for where it's
>>> reference count is incremented and decremented. It should be relatively
>>> easy to pair up where references are created and destroyed, and you should
>>> be able to identify the reference which never goes away. Then you need to
>>> figure out where that reference is being created. After that, you should
>>> have enough information to identify why the reference counting isn't being
>>> done correctly. It's arduous, but that's the only way.
>>>
>>> It's important to also make sure reference counts aren't decremented to
>>> zero prematurely. I had a problem once where that happened and the memory
>>> behind the object was updated by something that didn't know it was dead.
>>> The memory had since been reallocated to another object of the same type,
>>> so that other object reflected what happened to the phantom one. If I
>>> remember that manifested as something weird like an add causing a page
>>> fault or something.
>>>
>>> Gabe
>>>
>>>
>>> On 04/07/12 18:21, Andrew Cebulski wrote:
>>>
>>> Hi all,
>>> I've looked into this problem some more, and have put together a couple
>>> traces.  I've been becoming more familiar with how gem5 handles dynamic
>>> instructions, in particular how it destroys them.  I have two traces to
>>> compare, one with the physical memory, and the other with the integrated
>>> dramsim2 dram memory.  I also have two plots showing instruction counts
>>> over time (sim ticks).  All of these are linked at the end of the email.
>>> First, I'm going to go into what I've been able to interpret regarding
>>> how instructions are destroyed.  In particular, comparing when DynInst's
>>> vs. DynInstPtr's are deconstructed/removed from the cpu.  I separate these
>>> because I've seen a difference, as I discuss later.  These explanations are
>>> fairly non-existent on the wiki.  There is a section header waiting to be
>>> filled...
>>> From what I have been able to gather from the code, there is a list of
>>> all the instructions in flight in cpu/o3/cpu.cc called instList, with the
>>> type DynInstPtr.  There are three conditions to instructions being cleaned
>>> from this list:
>>> 1.)  The ROB retires its head instruction
>>> 2.)  Fetch receives a rob squashing signal from the commit, resulting in
>>> removing any instruction not in the ROB
>>> 3.)  Decode detects an incorrect branch prediction, resulting in removal
>>> of all instructions back to the bad seq num.
>>> Once all five stages have completed, the CPU cleans up all the removed
>>> in-flight instructions.  This line in particular
>>> in cleanUpRemovedInsts() in cpu/o3/cpu.cc deconstructs a DynInstPtr:
>>> instList.erase(removeList.front());
>>> When I turn on the debug flag O3CPU, I see the message "Removing
>>> instruction, ..." (from o3/cpu.cc) with the threadNum, seqNum and pcState
>>> after all 5 cpu stages have completed, and one of the conditions above is
>>> met.  I also see what tick it occurs on.
>>> When I turn on the DynInst debug flag, I see when instructions are
>>> created and destroyed (cpu/base_dyn_inst_impl.hh) and what tick.  From
>>> analyzing the trace files, I've gathered that this takes into account that
>>> instructions have different execution lengths.  So if one tick a memory
>>> instruction in the instList (DynInstPtr) is removed, the DynInst for that
>>> memory instruction will occur much later (i.e. 1M ticks later).  I have yet
>>> to determine how this is implemented.
>>> Now for the problem.
>>> What I'm seeing when I run dramsim2 dram memory is a significant
>>> difference between the size of the instList vector (of DynInstPtr objects),
>>> and the size of dynamic instruction count (of DynInst objects).  The
>>> benchmark I'm running is libquantum from SPEC 2006.  For the first roughly
>>> 130B ticks, the dynamic instruction count kept in cpu/base_dyn_inst.impl.hh
>>> shadows the instList size in o3/cpu.cc (figure linked below) very closely.
>>>  Around tick 130B after libquantum started, it starts hitting what I'm
>>> assuming are loops (therefore branch prediction), resulting in some
>>> behavior that seems to imply improper instruction handling (i.e. more
>>> instructions in flight than allowed by ROB).
>>> I wasn't able to sync-up the physical and dramsim2 traces exactly by
>>> trace, but they should represent roughly the same area of execution.  They
>>> don't execute the same due to the dramsim2 modeling the memory differently
>>> (i.e. latency and other delays).
>>> I've shared both traces on my public Dropbox here --
>>>
>>> http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz
>>>
>>> http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
>>> Here are a couple plots of tick versus instruction count, with respect
>>> to cpu->instcount in cpu/base_dyn_inst.impl.hh and instList.size() in
>>> cpu/o3/cpu.cc.  --
>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png
>>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png
>>> Note that I added the printout of the instList size to an existing O3CPU
>>> DPRINTF in cleanUpRemovedInsts() in cpu/o3/cpu.cc.
>>> Here are the commands I ran to parse the traces into data files to
>>> analyze in MATLAB and create the plots:
>>> zgrep DynInst dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz
>>> | grep destroyed | awk '{print $1,$11}' > cpuinstcount.out
>>> zgrep instList
>>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | awk '{print
>>> $1,$11}' > instlistsize.out
>>> It seems to me like the problem might lie in gem5, but has just been
>>> exposed by integrating this more detailed memory model, dramsim2, into
>>> gem5.  Either that, or their are some timing errors in how dramsim2 was
>>> integrated.  I doubt this, however, since those first 190B ticks executed
>>> used the dramsim2 memory.  I believe the problem is a combination of memory
>>> instructions + complex loops (branch prediction), resulting in improper
>>> destroying of instructions.
>>> I've included the ROB, Commit, Fetch, DynInst and O3CPU debug flags.
>>>  Their are 192 ROB entries, which is why the instList size generally has a
>>> max of about 192 instructions.  The dynamic instruction counts (seen in the
>>> dramsim2 plot) seem to also imply that instructions are incorrectly been
>>> removed from the ROB, and then from the cpu's instruction list in cpu.cc,
>>> which allows more and more instructions to be added to the system (possibly
>>> from a bad branch).
>>> I appreciate any help in debugging this and further figuring out the
>>> root problem, just let me know if you need anything else from me.  I don't
>>> have much more time at the moment to debug, but I can take any advice for
>>> quick changes and/or additional traces, then send the results back to the
>>> list for discussion.
>>> Thanks,
>>> Andrew
>>> P.S. Paul - I did try decreasing the size of the dramsim2 transaction
>>> (and even command) queue from 512 to 32.  The same instructions problem
>>> occurred.  It basically just decreased the execution time.
>>>
>>> On Wed, Mar 14, 2012 at 2:10 PM, Ali Saidi <sa...@umich.edu> wrote:
>>>
>>>>  The error is that there are more that 1500 instructions currently in
>>>> flight in the system. It could mean several things:
>>>>
>>>> 1. The value is somewhat arbitrarily defined and maybe there are more
>>>> than 1500 in your system at one time?
>>>>
>>>> 2. Instructions aren't being destroyed correctly
>>>>
>>>> You could try to to run a debug binary so you'll get a list of
>>>> instructions when it happens or increase the number which may
>>>> be appropriate for certain situations (but 1500 is quite a few inflight
>>>> instructions).
>>>>
>>>> Ali
>>>>
>>>> On 13.03.2012 10:56, Andrew Cebulski wrote:
>>>>
>>>>  Hi Xiangyu,
>>>>     I just started looking into this some more.  So at first I thought
>>>> it was due to updating to a more recent revision, but then I went back to
>>>> revision 8643, added your patch, built and ran....and now get the error
>>>> with it too (when running ARM_FS/gem5.opt).  I"m testing now to see if an
>>>> update to SWIG might have resulted in this error, maybe someone on the
>>>> mailing list would know if that's possible.  The difference is 1.3.40 vs.
>>>> 2.0.3, both of which are supported according to the dependencies wiki page.
>>>> Just for completeness, here's the error from revision 8643:
>>>>  build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void
>>>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>>>    I have not tried running with gem5.debug, so I will be doing that
>>>> today.  Maybe this is an assertion that is occurring due to an
>>>> optimization.  That would mean it wouldn't be triggered in gem5.debug since
>>>> it runs without optimizations.  Have you tested all debug, opt and fast
>>>> with your tests?
>>>> Thanks,
>>>>  Andrew
>>>>
>>>>  On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong <
>>>> riosher...@gmail.com> wrote:
>>>>
>>>>>   Hi Andrew,
>>>>>
>>>>>
>>>>>
>>>>> I didn’t see this error in my simulations. May I ask which gem5
>>>>> version you are using? I find some of the latest code updates do not 
>>>>> comply
>>>>> with my changes. I am still using the DRAMsim2 patch on Gem5 repo8643, and
>>>>> have run all the runnable benchmarks in SPEC2006, SPEC2000, EEMBC2, and
>>>>> PARSEC2 on ARM_SE.
>>>>>
>>>>>
>>>>>
>>>>> Thank you!
>>>>>
>>>>>
>>>>>
>>>>> Best,
>>>>>
>>>>> Xiangyu
>>>>>
>>>>>
>>>>>
>>>>> *From:* Andrew Cebulski [mailto:af...@drexel.edu]
>>>>> *Sent:* Thursday, March 08, 2012 6:52 PM
>>>>>
>>>>> *To:* gem5 users mailing list
>>>>> *Cc:*riosher...@gmail.com; sa...@umich.edu
>>>>>
>>>>> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration
>>>>>
>>>>> Xiangyu,
>>>>>
>>>>>    I've been having an issue recently with the number of instructions
>>>>> I've been seeing committed to the CPU (I have a separate thread on this).
>>>>>  It turns out the issue seems to be coming from this patch you created to
>>>>> integrate DramSim2 with Gem5.  Unfortunately, I've been running with
>>>>> gem5.fast, not gem5.opt.  So up until now, I haven't been seeing
>>>>> assertions.  I thought I'd run it with gem5.opt or debug back in December,
>>>>> but I must not have.  My runs on the Arm O3 cpu fails with this assertion:
>>>>>
>>>>> build/ARM/cpu/base_dyn_inst_impl.hh:149: void BaseDynInst::initVars()
>>>>> [with Impl = O3CPUImpl]: Assertion `cpu->instcount
>>>>>
>>>>> -Andrew
>>>>>
>>>>> Date: Sun, 18 Dec 2011 01:48:58 -0800
>>>>> From: "Dong, Xiangyu" <riosher...@gmail.com>
>>>>> To: "gem5 users mailing list" <gem5-users@gem5.org>
>>>>> Subject: [gem5-users] A Patch for DRAMsim2 Integration
>>>>> Message-ID: gmail.com>
>>>>>
>>>>> Content-Type: text/plain; charset="us-ascii"
>>>>>
>>>>> Hi all,
>>>>>
>>>>>
>>>>>
>>>>> I have a Gem5+DRAMsim2 patch.  I've tested it under both SE and FS
>>>>> modes.
>>>>> I'm willing to share it here.
>>>>>
>>>>>
>>>>>
>>>>> For those who have such needs, please go to my website
>>>>> www.cse.psu.edu/~xydong <http://www.cse.psu.edu/%7Exydong> to
>>>>> download the patch and test it.  To enable
>>>>> DRAMSim2, use se_dramsim2.py script instead of se.py (for FS, you can
>>>>> create
>>>>> by yourself).  The basic idea to enable the DRAMsim2 module is to use
>>>>> the
>>>>> derived DRAMMemory class instead of PhysicalMemory class.
>>>>>
>>>>>
>>>>>
>>>>> Please let me know if there are bugs.
>>>>>
>>>>>
>>>>>
>>>>> Thank you!
>>>>>
>>>>>
>>>>>
>>>>> Best,
>>>>>
>>>>> Xiangyu Dong
>>>>>
>>>>> -------------- next part --------------
>>>>> An HTML attachment was scrubbed...
>>>>> URL: <
>>>>> http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html
>>>>> >
>>>>>
>>>>>
>>>> _______________________________________________
>>>> gem5-users mailing list
>>>> gem5-users@gem5.org
>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> gem5-users mailing 
>>> listgem5-users@gem5.orghttp://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>
>>>
>>> _______________________________________________
>>> gem5-users mailing list
>>> gem5-users@gem5.org
>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>
>>
>
>
> _______________________________________________
> gem5-users mailing list
> gem5-users@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>

_______________________________________________
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] A Patch for DRAMsim2 Integration

Reply via email to