They are data TLB misses that occur as the in-flight instruction count rises (at 0x0 and 0x4). The last TLB miss before the in-flight instruction count finally linearly decreases is to 0x200. Also, at the start of the rising slope, I see a miss to 0x8 and 0x2508c.
Here's a trace file: http://dl.dropbox.com/u/2953302/gem5/tlb.out To reduce size, I just have lines that have either TLB or walker in them. I do see only a handful of instruction TLB misses. -Andrew On Wed, May 2, 2012 at 11:10 AM, Ali Saidi <sa...@umich.edu> wrote: > ** > > Hi Andrew, > > > > Thanks for digging into this. I think there is an issue somewhere, but I'm > still not sure where. > > Ali > > On 01.05.2012 23:34, Andrew Cebulski wrote: > > Okay, I'm positive now that the issue lies with delayed translations that > are squashed before finishing. > > On the data on instruction side? You seem to allude to data in the > paragraph below, but then instructions in the latter text. > > It seems to me like speculative load/stores are being executed, rather > than waiting for the instructions to commit. Once the instructions begin > getting (speculatively) executed in the TLB, a reference is left there, > which seems hard to root out and dereference after the instruction ends up > being squashed. At least, I have not been able to find that out in the > source code as of yet. Can anyone clarify on this? > > > > There should only be one translation outstanding from each instruction and > data side walker. Any nested transactions should be queued in the walker. > Until one finishes, I'm not sure how multiple would ever be outstanding. > > Recall the following image that shows how the number of dynamic > instruction (DynInst) objects in-flight increases linearly for varying > periods of time: > http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png > After enabling the TLB debug flag, I see that the linear increase in > instructions in flight is proportional to the number of TLB misses. These > TLB misses have a much larger delay (resulting in translation delays) due > to the fact the DramSim2 models the memory system more accurately. It > seems that with the classic memory system, TLB misses often do not have > translation delays. For whatever reason, it would also seem that every > instruction that has a TLB miss also is eventually squashed... > > From a data side perspective this is reasonable. While a miss is > outstanding at some point instructions will stop committing and thus the > instructions in flight will begin to rise until the miss is satisfied. > > Here's a summary of outputs from my trace. These two DPRINTF messages > appears on the rising slopes (repeated up until the peak): > TLB Miss: Starting hardware table walker for 0(656) > TLB Miss: Starting hardware table walker for 0x4(656) > > This is interesting/odd. I don't know a good reason why (1) a miss would > be outstanding to both address 0 and address 4 at the same time. In almost > all cases these pages are marked as no-access to detect segfaults. Perhaps > there is an issue where the cpu is getting into a loop faulting on a bad > access and then faulting again on the fault handler. I could imagine this > would happen if there was some corruption in the memory system (for example > the timings in dramsim exposing a bug in the cache models or something). > > > At the peak, the following message appears (from fetch) almost every tick > for (what I believe to be) every single one of the table walkers that were > squashed. > Fetch is waiting ITLB walk to finish! > > There must be another walk in flight? The instruction side will only > have one fault outstanding at once. Successive branch mispredicts will > re-direct fetch but there is code that catches the fact that a different > walk completed then expected and "does the right thing." > > The problem is that these ITLB table walks are for instructions that > were squashed as much as 0.3 billion cycles earlier, and since been removed > from the CPU's instruction list. > > I'm not following here. > > Any help will be greatly appreciated in solving this problem. I've hit > a roadblock with getting Ruby working with ARM, most likely due to the fact > that ARM has disjoint memory (x86 and Alpha do not). There's the 256 MB > for physical memory, then the 64 MB for the boot loader. I brought this up > in my last email about trying to get Ruby working. Therefore, I'm trying > to get this DramSim2 integration fixed so I can start modeling FS with DRAM > memory. > > Brad/Steve/Nilay anyone have a suggestion on how to make this work? > > > Note that these problems also occur in Soplex from the Spec CPU2006 > benchmark suite (also hits 1500 in-flight instructions assertion). Due to > time constraints, I haven't tested on other benchmarks. > Thanks, > Andrew > On Tue, May 1, 2012 at 4:27 AM, Andrew Cebulski <af...@drexel.edu> wrote: > >> Hey Gabe, >> Thanks for this...very helpful. I just recently got back into >> debugging this problem. I made a small change in src/base/refcnt.hh to >> allow me to return the current count of references to a DynInst object. >> I then modified existing DPRINTFs to also print out reference counts, >> then added some of my own when I needed extra visibility. >> I've found one memory store instruction that seems to be getting >> lost. What's happening is that is progresses as far as getting executed in >> the IEW once, but a delayed translation occurs, deferring the store. By >> the time it reenters the IEW, the IQ has marked the instruction as >> squashed. Everything progresses as usual from here on out, with one >> exception. When the instruction is removed from the CPUs instruction list, >> there is one reference count hanging. >> I've added in some additional debugging for my traces to help narrow >> down where this reference is coming from. As far as I can tell, it's >> because of a call to initiateAcc() within the executeStore function in the >> lsq unit. Please see the following two traces. The first trace shows what >> I just discussed. The second trace is another memory store instruction >> that got squashed, however, it was squashed upon its first entry into the >> IEW, therefore it never started execution. >> http://dl.dropbox.com/u/2953302/gem5/lostinstruction.out >> http://dl.dropbox.com/u/2953302/gem5/similarinstruction.out >> Let me know if you have any ideas based on these two instruction >> traces. I do not understand how the initiateAcc function results in >> another reference, but maybe someone else does.... Since I don't see how >> it makes a reference, it's hard to find out how to make sure it gets >> dereferenced... >> Unfortunately, I haven't been able to add a DPRINTF in >> src/base/refcnt.hh ...this would make things more clear (i.e. exactly when >> references/deferences occur). Let me know if you have any advice on >> this...if it's possible. I can't seem to get the right include files, and >> likely right SConscript compile order... >> Thanks, >> Andrew >> >> >> On Sat, Apr 7, 2012 at 9:48 PM, Gabe Black <gbl...@eecs.umich.edu> wrote: >> >>> Without digging into things too deeply, it looks like you may be leaking >>> references to dynamic instructions. The CPU may think it's done with one, >>> but until that final reference is removed, the object will hang around >>> forever. I think I've had problems before where there reference count ended >>> up off by one somehow and instructions would start piling up. It's also >>> possible that a clog develops in O3's pipeline and some internal structure >>> stops letting instructions through and starts accumulating them. Either of >>> these problems will be annoying to track down, but with enough digging I've >>> been able to fix these sorts of things. >>> >>> This may have more to do with O3 not handling the benchmark you're >>> running well rather than a problem with your new DRAM model. There may be >>> some interaction between the two, though, where the new memory makes the >>> timing line up to cause O3 to behave poorly. What you can do is instrument >>> dynamic instruction creation and destruction and reference counting (try >>> print "this" for both the reference counting wrapper and the dyn inst >>> itself) and turn it on as close as you can to where things go bad tick >>> wise. Then look for an instruction which gets lost, and look for where it's >>> reference count is incremented and decremented. It should be relatively >>> easy to pair up where references are created and destroyed, and you should >>> be able to identify the reference which never goes away. Then you need to >>> figure out where that reference is being created. After that, you should >>> have enough information to identify why the reference counting isn't being >>> done correctly. It's arduous, but that's the only way. >>> >>> It's important to also make sure reference counts aren't decremented to >>> zero prematurely. I had a problem once where that happened and the memory >>> behind the object was updated by something that didn't know it was dead. >>> The memory had since been reallocated to another object of the same type, >>> so that other object reflected what happened to the phantom one. If I >>> remember that manifested as something weird like an add causing a page >>> fault or something. >>> >>> Gabe >>> >>> >>> On 04/07/12 18:21, Andrew Cebulski wrote: >>> >>> Hi all, >>> I've looked into this problem some more, and have put together a couple >>> traces. I've been becoming more familiar with how gem5 handles dynamic >>> instructions, in particular how it destroys them. I have two traces to >>> compare, one with the physical memory, and the other with the integrated >>> dramsim2 dram memory. I also have two plots showing instruction counts >>> over time (sim ticks). All of these are linked at the end of the email. >>> First, I'm going to go into what I've been able to interpret regarding >>> how instructions are destroyed. In particular, comparing when DynInst's >>> vs. DynInstPtr's are deconstructed/removed from the cpu. I separate these >>> because I've seen a difference, as I discuss later. These explanations are >>> fairly non-existent on the wiki. There is a section header waiting to be >>> filled... >>> From what I have been able to gather from the code, there is a list of >>> all the instructions in flight in cpu/o3/cpu.cc called instList, with the >>> type DynInstPtr. There are three conditions to instructions being cleaned >>> from this list: >>> 1.) The ROB retires its head instruction >>> 2.) Fetch receives a rob squashing signal from the commit, resulting in >>> removing any instruction not in the ROB >>> 3.) Decode detects an incorrect branch prediction, resulting in removal >>> of all instructions back to the bad seq num. >>> Once all five stages have completed, the CPU cleans up all the removed >>> in-flight instructions. This line in particular >>> in cleanUpRemovedInsts() in cpu/o3/cpu.cc deconstructs a DynInstPtr: >>> instList.erase(removeList.front()); >>> When I turn on the debug flag O3CPU, I see the message "Removing >>> instruction, ..." (from o3/cpu.cc) with the threadNum, seqNum and pcState >>> after all 5 cpu stages have completed, and one of the conditions above is >>> met. I also see what tick it occurs on. >>> When I turn on the DynInst debug flag, I see when instructions are >>> created and destroyed (cpu/base_dyn_inst_impl.hh) and what tick. From >>> analyzing the trace files, I've gathered that this takes into account that >>> instructions have different execution lengths. So if one tick a memory >>> instruction in the instList (DynInstPtr) is removed, the DynInst for that >>> memory instruction will occur much later (i.e. 1M ticks later). I have yet >>> to determine how this is implemented. >>> Now for the problem. >>> What I'm seeing when I run dramsim2 dram memory is a significant >>> difference between the size of the instList vector (of DynInstPtr objects), >>> and the size of dynamic instruction count (of DynInst objects). The >>> benchmark I'm running is libquantum from SPEC 2006. For the first roughly >>> 130B ticks, the dynamic instruction count kept in cpu/base_dyn_inst.impl.hh >>> shadows the instList size in o3/cpu.cc (figure linked below) very closely. >>> Around tick 130B after libquantum started, it starts hitting what I'm >>> assuming are loops (therefore branch prediction), resulting in some >>> behavior that seems to imply improper instruction handling (i.e. more >>> instructions in flight than allowed by ROB). >>> I wasn't able to sync-up the physical and dramsim2 traces exactly by >>> trace, but they should represent roughly the same area of execution. They >>> don't execute the same due to the dramsim2 modeling the memory differently >>> (i.e. latency and other delays). >>> I've shared both traces on my public Dropbox here -- >>> >>> http://dl.dropbox.com/u/2953302/gem5/physical-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU.out.gz >>> >>> http://dl.dropbox.com/u/2953302/gem5/dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz >>> Here are a couple plots of tick versus instruction count, with respect >>> to cpu->instcount in cpu/base_dyn_inst.impl.hh and instList.size() in >>> cpu/o3/cpu.cc. -- >>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_physical.png >>> http://dl.dropbox.com/u/2953302/gem5/dyninst_vs_dyninstptr_dramsim2.png >>> Note that I added the printout of the instList size to an existing O3CPU >>> DPRINTF in cleanUpRemovedInsts() in cpu/o3/cpu.cc. >>> Here are the commands I ran to parse the traces into data files to >>> analyze in MATLAB and create the plots: >>> zgrep DynInst dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz >>> | grep destroyed | awk '{print $1,$11}' > cpuinstcount.out >>> zgrep instList >>> dramsim2-fs-040612-ROB-Commit-DynInst-Fetch-O3CPU-2.out.gz | awk '{print >>> $1,$11}' > instlistsize.out >>> It seems to me like the problem might lie in gem5, but has just been >>> exposed by integrating this more detailed memory model, dramsim2, into >>> gem5. Either that, or their are some timing errors in how dramsim2 was >>> integrated. I doubt this, however, since those first 190B ticks executed >>> used the dramsim2 memory. I believe the problem is a combination of memory >>> instructions + complex loops (branch prediction), resulting in improper >>> destroying of instructions. >>> I've included the ROB, Commit, Fetch, DynInst and O3CPU debug flags. >>> Their are 192 ROB entries, which is why the instList size generally has a >>> max of about 192 instructions. The dynamic instruction counts (seen in the >>> dramsim2 plot) seem to also imply that instructions are incorrectly been >>> removed from the ROB, and then from the cpu's instruction list in cpu.cc, >>> which allows more and more instructions to be added to the system (possibly >>> from a bad branch). >>> I appreciate any help in debugging this and further figuring out the >>> root problem, just let me know if you need anything else from me. I don't >>> have much more time at the moment to debug, but I can take any advice for >>> quick changes and/or additional traces, then send the results back to the >>> list for discussion. >>> Thanks, >>> Andrew >>> P.S. Paul - I did try decreasing the size of the dramsim2 transaction >>> (and even command) queue from 512 to 32. The same instructions problem >>> occurred. It basically just decreased the execution time. >>> >>> On Wed, Mar 14, 2012 at 2:10 PM, Ali Saidi <sa...@umich.edu> wrote: >>> >>>> The error is that there are more that 1500 instructions currently in >>>> flight in the system. It could mean several things: >>>> >>>> 1. The value is somewhat arbitrarily defined and maybe there are more >>>> than 1500 in your system at one time? >>>> >>>> 2. Instructions aren't being destroyed correctly >>>> >>>> You could try to to run a debug binary so you'll get a list of >>>> instructions when it happens or increase the number which may >>>> be appropriate for certain situations (but 1500 is quite a few inflight >>>> instructions). >>>> >>>> Ali >>>> >>>> On 13.03.2012 10:56, Andrew Cebulski wrote: >>>> >>>> Hi Xiangyu, >>>> I just started looking into this some more. So at first I thought >>>> it was due to updating to a more recent revision, but then I went back to >>>> revision 8643, added your patch, built and ran....and now get the error >>>> with it too (when running ARM_FS/gem5.opt). I"m testing now to see if an >>>> update to SWIG might have resulted in this error, maybe someone on the >>>> mailing list would know if that's possible. The difference is 1.3.40 vs. >>>> 2.0.3, both of which are supported according to the dependencies wiki page. >>>> Just for completeness, here's the error from revision 8643: >>>> build/ARM_FS/cpu/base_dyn_inst_impl.hh:149: void >>>> BaseDynInst::initVars() [with Impl = O3CPUImpl]: Assertion `cpu->instcount >>>> I have not tried running with gem5.debug, so I will be doing that >>>> today. Maybe this is an assertion that is occurring due to an >>>> optimization. That would mean it wouldn't be triggered in gem5.debug since >>>> it runs without optimizations. Have you tested all debug, opt and fast >>>> with your tests? >>>> Thanks, >>>> Andrew >>>> >>>> On Tue, Mar 13, 2012 at 1:37 PM, Rio Xiangyu Dong < >>>> riosher...@gmail.com> wrote: >>>> >>>>> Hi Andrew, >>>>> >>>>> >>>>> >>>>> I didn’t see this error in my simulations. May I ask which gem5 >>>>> version you are using? I find some of the latest code updates do not >>>>> comply >>>>> with my changes. I am still using the DRAMsim2 patch on Gem5 repo8643, and >>>>> have run all the runnable benchmarks in SPEC2006, SPEC2000, EEMBC2, and >>>>> PARSEC2 on ARM_SE. >>>>> >>>>> >>>>> >>>>> Thank you! >>>>> >>>>> >>>>> >>>>> Best, >>>>> >>>>> Xiangyu >>>>> >>>>> >>>>> >>>>> *From:* Andrew Cebulski [mailto:af...@drexel.edu] >>>>> *Sent:* Thursday, March 08, 2012 6:52 PM >>>>> >>>>> *To:* gem5 users mailing list >>>>> *Cc:*riosher...@gmail.com; sa...@umich.edu >>>>> >>>>> *Subject:* Re: [gem5-users] A Patch for DRAMsim2 Integration >>>>> >>>>> Xiangyu, >>>>> >>>>> I've been having an issue recently with the number of instructions >>>>> I've been seeing committed to the CPU (I have a separate thread on this). >>>>> It turns out the issue seems to be coming from this patch you created to >>>>> integrate DramSim2 with Gem5. Unfortunately, I've been running with >>>>> gem5.fast, not gem5.opt. So up until now, I haven't been seeing >>>>> assertions. I thought I'd run it with gem5.opt or debug back in December, >>>>> but I must not have. My runs on the Arm O3 cpu fails with this assertion: >>>>> >>>>> build/ARM/cpu/base_dyn_inst_impl.hh:149: void BaseDynInst::initVars() >>>>> [with Impl = O3CPUImpl]: Assertion `cpu->instcount >>>>> >>>>> -Andrew >>>>> >>>>> Date: Sun, 18 Dec 2011 01:48:58 -0800 >>>>> From: "Dong, Xiangyu" <riosher...@gmail.com> >>>>> To: "gem5 users mailing list" <gem5-users@gem5.org> >>>>> Subject: [gem5-users] A Patch for DRAMsim2 Integration >>>>> Message-ID: gmail.com> >>>>> >>>>> Content-Type: text/plain; charset="us-ascii" >>>>> >>>>> Hi all, >>>>> >>>>> >>>>> >>>>> I have a Gem5+DRAMsim2 patch. I've tested it under both SE and FS >>>>> modes. >>>>> I'm willing to share it here. >>>>> >>>>> >>>>> >>>>> For those who have such needs, please go to my website >>>>> www.cse.psu.edu/~xydong <http://www.cse.psu.edu/%7Exydong> to >>>>> download the patch and test it. To enable >>>>> DRAMSim2, use se_dramsim2.py script instead of se.py (for FS, you can >>>>> create >>>>> by yourself). The basic idea to enable the DRAMsim2 module is to use >>>>> the >>>>> derived DRAMMemory class instead of PhysicalMemory class. >>>>> >>>>> >>>>> >>>>> Please let me know if there are bugs. >>>>> >>>>> >>>>> >>>>> Thank you! >>>>> >>>>> >>>>> >>>>> Best, >>>>> >>>>> Xiangyu Dong >>>>> >>>>> -------------- next part -------------- >>>>> An HTML attachment was scrubbed... >>>>> URL: < >>>>> http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20111218/f3fdf5da/attachment.html >>>>> > >>>>> >>>>> >>>> _______________________________________________ >>>> gem5-users mailing list >>>> gem5-users@gem5.org >>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users >>>> >>>> >>> >>> >>> _______________________________________________ >>> gem5-users mailing >>> listgem5-users@gem5.orghttp://m5sim.org/cgi-bin/mailman/listinfo/gem5-users >>> >>> >>> _______________________________________________ >>> gem5-users mailing list >>> gem5-users@gem5.org >>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users >>> >> > > > _______________________________________________ > gem5-users mailing list > gem5-users@gem5.org > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users >
_______________________________________________ gem5-users mailing list gem5-users@gem5.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users