Oooooooooooh. I see what's broken. This is a result of my changes to
allow delaying translation. What happens is that Stl_c goes into
initiateAcc. That function calls write on the CPU which calls into the
TLB which calls the translation callback which recognizes a failed store
conditional which completes the instruction execution with completeAcc
and cleans up. The call stack then collapses back to the initiateAcc
which is still waiting to finish and which then tries to call a member
function on traceData which was deleted during the cleanup. The problem
here is not fundamentally complicated, but the mechanisms involved are.
One solution would be to record the fact that we're still in
initiateAcc, and if we are wait for the call stack to collapse back down
to initiateAcc's caller before calling into completeAcc. That matches
the semantics an instruction would expect more, I think, where the
initiateAcc/completeAcc pair are called sequentially.

One other concern this raises is that the code in the simple timing CPU
is not very simple. One thing that would help would be to try to
relocate some of the special cases, like failed store conditionals or
memory mapped registers, into different bodies of code or at least out
of the midst of everything else going on. I haven't thought about this
in any depth, but I'll try to put together a flow chart sort of thing to
explain what happens to memory instructions as they execute. That would
be good for the sake of documentation and also so we have something
concrete to talk about.

Gabe

Gabe Black wrote:
> The segfault for me happens in malloc called by the new operator in
> exetrace.hh on line 84. That says to me that the most likely culprit is
> heap corruption which will be very obnoxious to track down. I've started
> up a run of valgrind just in case it can catch something bad happening
> sometime in the next n hours.
>
> Gabe
>
> Gabe Black wrote:
>   
>> Oh wow. It did happen eventually. I'll see if I can figure out what's
>> going on.
>>
>> Gabe
>>
>> Gabe Black wrote:
>>   
>>     
>>> I tried that command line and I haven't seen any segfault yet. I'll let
>>> it run and see if anything happens. What version of the code are you using?
>>>
>>> Gabe
>>>
>>> Geoffrey Blake wrote:
>>>   
>>>     
>>>       
>>>> I’ve added a couple edits, but nothing major, ie: added statistics to
>>>> the bus model, and some extra latency randomization to cache misses to
>>>> get better averages of parallel code runs.  None of this is tied to
>>>> the trace-flags mechanism that I can determine.  
>>>>
>>>>  
>>>>
>>>> I did run the code through valgrind, but ridiculously enough, the
>>>> segfault disappears. I’ll keep digging in my spare time. 
>>>>
>>>>  
>>>>
>>>> The “Exec” trace flags work fine (billions of instructions, no
>>>> problems) with an old version of m5 that is somewhere between beta4
>>>> and beta5 of the stable releases. Now I can trace maybe a few thousand
>>>> instructions before M5 seg faults.
>>>>
>>>>  
>>>>
>>>> Here is a stripped command line that does expose the bug with the
>>>> least number of variables to consider in case someone out there wants
>>>> to try and duplicate the segfaults I’m seeing (it could be a product
>>>> of my build setup, so I’d appreciate it if someone could verify
>>>> independently):
>>>>
>>>> % m5.opt –trace-flags=”ExecEnable” fs.py –b MutexTest –t –n 1 > /dev/null
>>>>
>>>>  
>>>>
>>>> Geoff
>>>>
>>>>  
>>>>
>>>> *From:* m5-dev-boun...@m5sim.org [mailto:m5-dev-boun...@m5sim.org] *On
>>>> Behalf Of *Korey Sewell
>>>> *Sent:* Friday, April 03, 2009 9:56 AM
>>>> *To:* M5 Developer List
>>>> *Subject:* Re: [m5-dev] Memory corruption in m5 dev repository when
>>>> using --trace-flags="ExecEnable"
>>>>
>>>>  
>>>>
>>>> I would echo Gabe sentiments. I've been suspicious of the trace-flags
>>>> causing memory corruption for awhile now, but every time I dig into it
>>>> there's some small error that I'm propagating through that finally
>>>> surfaces.
>>>>
>>>> In the big picture, I suspect that the trace-flags just exacerbate any
>>>> kind of memory-corruption issues since you are accessing things at
>>>> such a heavy-rate.
>>>>
>>>> In terms of debugging, is there any code that you edited that is
>>>> tagged when you use "ExecEnable" rather than just "Exec"?
>>>>
>>>> Also, if you can turn valgrind on for maybe the 1st thousand/million
>>>> cycles with ExecEnable you'll probably find something.
>>>>
>>>> On Thu, Apr 2, 2009 at 7:28 PM, Gabriel Michael Black
>>>> <gbl...@eecs.umich.edu <mailto:gbl...@eecs.umich.edu>> wrote:
>>>>
>>>> Does this happen when you start tracing sooner? I'd suggest valgrind,
>>>> especially if you can make the segfault happen quickly. If you wait
>>>> for your simulation to get to 1400000000000 ticks in valgrind, you may
>>>> die before you see the result. There's a suppression file in util
>>>> which should cut down on the noise.
>>>>
>>>> Gabe
>>>>
>>>>
>>>> Quoting Geoffrey Blake <bla...@umich.edu <mailto:bla...@umich.edu>>:
>>>>
>>>>     
>>>>       
>>>>         
>>>>> I stumbled upon what appears to be a memory corruption bug in the
>>>>>       
>>>>>         
>>>>>           
>>>> current M5
>>>>     
>>>>       
>>>>         
>>>>> repository.  If on the command line I enter:
>>>>>
>>>>> % ./build/ALPHA_FS/m5.opt -trace-flags="ExecEnable"
>>>>> -trace-start=1400000000000 fs.py -b <benchmark> -t -n <cpus> <more
>>>>> parameters>. The simulator will error with a segmentation fault or
>>>>> occasionally an assert not long after starting to trace instructions.
>>>>>
>>>>>
>>>>>
>>>>> I have run this through gdb in with m5.debug and see the same
>>>>>       
>>>>>         
>>>>>           
>>>> errors, the
>>>>     
>>>>       
>>>>         
>>>>> problem is the stack trace showing the cause of the seg fault or assert
>>>>> changes depending on the inputs to the simulator. So, I have not
>>>>>       
>>>>>         
>>>>>           
>>>> been able
>>>>     
>>>>       
>>>>         
>>>>> to pin point this bug which appears to be a subtle memory corruption
>>>>> somewhere in the code. This error does not happen for other trace
>>>>>       
>>>>>         
>>>>>           
>>>> flags such
>>>>     
>>>>       
>>>>         
>>>>> as the "Cache" trace flag. It appears linked solely to the instruction
>>>>> tracing mechanism.  Has anybody else seen this bug?
>>>>>
>>>>>
>>>>>
>>>>> I'm using an up to date repository I pulled from m5sim.org
>>>>>       
>>>>>         
>>>>>           
>>>> <http://m5sim.org> this morning.
>>>>     
>>>>       
>>>>         
>>>>> Thanks,
>>>>> Geoff
>>>>>
>>>>>
>>>>>       
>>>>>         
>>>>>           
>>>> _______________________________________________
>>>> m5-dev mailing list
>>>> m5-dev@m5sim.org <mailto:m5-dev@m5sim.org>
>>>> http://m5sim.org/mailman/listinfo/m5-dev
>>>>
>>>>
>>>>
>>>>
>>>> -- 
>>>> ----------
>>>> Korey L Sewell
>>>> Graduate Student - PhD Candidate
>>>> Computer Science & Engineering
>>>> University of Michigan
>>>>
>>>> No virus found in this incoming message.
>>>> Checked by AVG - www.avg.com
>>>> Version: 8.5.285 / Virus Database: 270.11.40/2039 - Release Date:
>>>> 04/03/09 06:19:00
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> m5-dev mailing list
>>>> m5-dev@m5sim.org
>>>> http://m5sim.org/mailman/listinfo/m5-dev
>>>>   
>>>>     
>>>>       
>>>>         
>>> _______________________________________________
>>> m5-dev mailing list
>>> m5-dev@m5sim.org
>>> http://m5sim.org/mailman/listinfo/m5-dev
>>>   
>>>     
>>>       
>> _______________________________________________
>> m5-dev mailing list
>> m5-dev@m5sim.org
>> http://m5sim.org/mailman/listinfo/m5-dev
>>   
>>     
>
> _______________________________________________
> m5-dev mailing list
> m5-dev@m5sim.org
> http://m5sim.org/mailman/listinfo/m5-dev
>   

_______________________________________________
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev

Reply via email to