Re: [m5-dev] Memory corruption in m5 dev repository when using --trace-flags="ExecEnable"

Gabe Black Tue, 21 Apr 2009 18:53:56 -0700

I couldn't deal with it right away and then forgot about it. It's still
broken to the best of my knowledge.


Gabe

nathan binkert wrote:
> Whatever happened with this?  I just lost track.
>
>   Nate
>
>   
>> It's broader than tracing and not caused by the tracing mechanism
>> itself, but I think it will only show up with tracing. The pointer to
>> the trace data will be NULL otherwise and the instruction won't
>> attempt to use it. Nothing else that exists currently to my knowledge
>> is in a position to be affected by this.
>>
>> Gabe
>>
>> Quoting nathan binkert <[email protected]>:
>>
>>     
>>> Does this problem really have anything to do with tracing, or is it
>>> just more apparent with it?
>>>
>>> On Sat, Apr 4, 2009 at 1:49 PM, Gabe Black <[email protected]> wrote:
>>>       
>>>> Oooooooooooh. I see what's broken. This is a result of my changes to
>>>> allow delaying translation. What happens is that Stl_c goes into
>>>> initiateAcc. That function calls write on the CPU which calls into the
>>>> TLB which calls the translation callback which recognizes a failed store
>>>> conditional which completes the instruction execution with completeAcc
>>>> and cleans up. The call stack then collapses back to the initiateAcc
>>>> which is still waiting to finish and which then tries to call a member
>>>> function on traceData which was deleted during the cleanup. The problem
>>>> here is not fundamentally complicated, but the mechanisms involved are.
>>>> One solution would be to record the fact that we're still in
>>>> initiateAcc, and if we are wait for the call stack to collapse back down
>>>> to initiateAcc's caller before calling into completeAcc. That matches
>>>> the semantics an instruction would expect more, I think, where the
>>>> initiateAcc/completeAcc pair are called sequentially.
>>>>
>>>> One other concern this raises is that the code in the simple timing CPU
>>>> is not very simple. One thing that would help would be to try to
>>>> relocate some of the special cases, like failed store conditionals or
>>>> memory mapped registers, into different bodies of code or at least out
>>>> of the midst of everything else going on. I haven't thought about this
>>>> in any depth, but I'll try to put together a flow chart sort of thing to
>>>> explain what happens to memory instructions as they execute. That would
>>>> be good for the sake of documentation and also so we have something
>>>> concrete to talk about.
>>>>
>>>> Gabe
>>>>
>>>> Gabe Black wrote:
>>>>         
>>>>> The segfault for me happens in malloc called by the new operator in
>>>>> exetrace.hh on line 84. That says to me that the most likely culprit is
>>>>> heap corruption which will be very obnoxious to track down. I've started
>>>>> up a run of valgrind just in case it can catch something bad happening
>>>>> sometime in the next n hours.
>>>>>
>>>>> Gabe
>>>>>
>>>>> Gabe Black wrote:
>>>>>
>>>>>           
>>>>>> Oh wow. It did happen eventually. I'll see if I can figure out what's
>>>>>> going on.
>>>>>>
>>>>>> Gabe
>>>>>>
>>>>>> Gabe Black wrote:
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> I tried that command line and I haven't seen any segfault yet. I'll let
>>>>>>> it run and see if anything happens. What version of the code are
>>>>>>> you using?
>>>>>>>
>>>>>>> Gabe
>>>>>>>
>>>>>>> Geoffrey Blake wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>> I’ve added a couple edits, but nothing major, ie: added statistics to
>>>>>>>> the bus model, and some extra latency randomization to cache misses to
>>>>>>>> get better averages of parallel code runs.  None of this is tied to
>>>>>>>> the trace-flags mechanism that I can determine.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I did run the code through valgrind, but ridiculously enough, the
>>>>>>>> segfault disappears. I’ll keep digging in my spare time.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> The “Exec” trace flags work fine (billions of instructions, no
>>>>>>>> problems) with an old version of m5 that is somewhere between beta4
>>>>>>>> and beta5 of the stable releases. Now I can trace maybe a few thousand
>>>>>>>> instructions before M5 seg faults.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Here is a stripped command line that does expose the bug with the
>>>>>>>> least number of variables to consider in case someone out there wants
>>>>>>>> to try and duplicate the segfaults I’m seeing (it could be a product
>>>>>>>> of my build setup, so I’d appreciate it if someone could verify
>>>>>>>> independently):
>>>>>>>>
>>>>>>>> % m5.opt –trace-flags=”ExecEnable” fs.py –b MutexTest –t –n 1 >
>>>>>>>> /dev/null
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Geoff
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From:* [email protected] [mailto:[email protected]] *On
>>>>>>>> Behalf Of *Korey Sewell
>>>>>>>> *Sent:* Friday, April 03, 2009 9:56 AM
>>>>>>>> *To:* M5 Developer List
>>>>>>>> *Subject:* Re: [m5-dev] Memory corruption in m5 dev repository when
>>>>>>>> using --trace-flags="ExecEnable"
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I would echo Gabe sentiments. I've been suspicious of the trace-flags
>>>>>>>> causing memory corruption for awhile now, but every time I dig into it
>>>>>>>> there's some small error that I'm propagating through that finally
>>>>>>>> surfaces.
>>>>>>>>
>>>>>>>> In the big picture, I suspect that the trace-flags just exacerbate any
>>>>>>>> kind of memory-corruption issues since you are accessing things at
>>>>>>>> such a heavy-rate.
>>>>>>>>
>>>>>>>> In terms of debugging, is there any code that you edited that is
>>>>>>>> tagged when you use "ExecEnable" rather than just "Exec"?
>>>>>>>>
>>>>>>>> Also, if you can turn valgrind on for maybe the 1st thousand/million
>>>>>>>> cycles with ExecEnable you'll probably find something.
>>>>>>>>
>>>>>>>> On Thu, Apr 2, 2009 at 7:28 PM, Gabriel Michael Black
>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>
>>>>>>>> Does this happen when you start tracing sooner? I'd suggest valgrind,
>>>>>>>> especially if you can make the segfault happen quickly. If you wait
>>>>>>>> for your simulation to get to 1400000000000 ticks in valgrind, you may
>>>>>>>> die before you see the result. There's a suppression file in util
>>>>>>>> which should cut down on the noise.
>>>>>>>>
>>>>>>>> Gabe
>>>>>>>>
>>>>>>>>
>>>>>>>> Quoting Geoffrey Blake <[email protected] <mailto:[email protected]>>:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> I stumbled upon what appears to be a memory corruption bug in the
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>> current M5
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> repository.  If on the command line I enter:
>>>>>>>>>
>>>>>>>>> % ./build/ALPHA_FS/m5.opt -trace-flags="ExecEnable"
>>>>>>>>> -trace-start=1400000000000 fs.py -b <benchmark> -t -n <cpus> <more
>>>>>>>>> parameters>. The simulator will error with a segmentation fault or
>>>>>>>>> occasionally an assert not long after starting to trace instructions.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have run this through gdb in with m5.debug and see the same
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>> errors, the
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> problem is the stack trace showing the cause of the seg fault or 
>>>>>>>>> assert
>>>>>>>>> changes depending on the inputs to the simulator. So, I have not
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>> been able
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> to pin point this bug which appears to be a subtle memory corruption
>>>>>>>>> somewhere in the code. This error does not happen for other trace
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>> flags such
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> as the "Cache" trace flag. It appears linked solely to the instruction
>>>>>>>>> tracing mechanism.  Has anybody else seen this bug?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'm using an up to date repository I pulled from m5sim.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>> <http://m5sim.org> this morning.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> Thanks,
>>>>>>>>> Geoff
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>> _______________________________________________
>>>>>>>> m5-dev mailing list
>>>>>>>> [email protected] <mailto:[email protected]>
>>>>>>>> http://m5sim.org/mailman/listinfo/m5-dev
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> ----------
>>>>>>>> Korey L Sewell
>>>>>>>> Graduate Student - PhD Candidate
>>>>>>>> Computer Science & Engineering
>>>>>>>> University of Michigan
>>>>>>>>
>>>>>>>> No virus found in this incoming message.
>>>>>>>> Checked by AVG - www.avg.com
>>>>>>>> Version: 8.5.285 / Virus Database: 270.11.40/2039 - Release Date:
>>>>>>>> 04/03/09 06:19:00
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> m5-dev mailing list
>>>>>>>> [email protected]
>>>>>>>> http://m5sim.org/mailman/listinfo/m5-dev
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>> _______________________________________________
>>>>>>> m5-dev mailing list
>>>>>>> [email protected]
>>>>>>> http://m5sim.org/mailman/listinfo/m5-dev
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>> _______________________________________________
>>>>>> m5-dev mailing list
>>>>>> [email protected]
>>>>>> http://m5sim.org/mailman/listinfo/m5-dev
>>>>>>
>>>>>>
>>>>>>             
>>>>> _______________________________________________
>>>>> m5-dev mailing list
>>>>> [email protected]
>>>>> http://m5sim.org/mailman/listinfo/m5-dev
>>>>>
>>>>>           
>>>> _______________________________________________
>>>> m5-dev mailing list
>>>> [email protected]
>>>> http://m5sim.org/mailman/listinfo/m5-dev
>>>>
>>>>         
>>> _______________________________________________
>>> m5-dev mailing list
>>> [email protected]
>>> http://m5sim.org/mailman/listinfo/m5-dev
>>>
>>>       
>> _______________________________________________
>> m5-dev mailing list
>> [email protected]
>> http://m5sim.org/mailman/listinfo/m5-dev
>>
>>     
> _______________________________________________
> m5-dev mailing list
> [email protected]
> http://m5sim.org/mailman/listinfo/m5-dev
>   

_______________________________________________
m5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/m5-dev

Re: [m5-dev] Memory corruption in m5 dev repository when using --trace-flags="ExecEnable"

Reply via email to