Re: [m5-dev] Memory corruption in m5 dev repository when using --trace-flags="ExecEnable"

nathan binkert Tue, 21 Apr 2009 18:09:30 -0700

Whatever happened with this?  I just lost track.

  Nate


> It's broader than tracing and not caused by the tracing mechanism
> itself, but I think it will only show up with tracing. The pointer to
> the trace data will be NULL otherwise and the instruction won't
> attempt to use it. Nothing else that exists currently to my knowledge
> is in a position to be affected by this.
>
> Gabe
>
> Quoting nathan binkert <[email protected]>:
>
>> Does this problem really have anything to do with tracing, or is it
>> just more apparent with it?
>>
>> On Sat, Apr 4, 2009 at 1:49 PM, Gabe Black <[email protected]> wrote:
>>> Oooooooooooh. I see what's broken. This is a result of my changes to
>>> allow delaying translation. What happens is that Stl_c goes into
>>> initiateAcc. That function calls write on the CPU which calls into the
>>> TLB which calls the translation callback which recognizes a failed store
>>> conditional which completes the instruction execution with completeAcc
>>> and cleans up. The call stack then collapses back to the initiateAcc
>>> which is still waiting to finish and which then tries to call a member
>>> function on traceData which was deleted during the cleanup. The problem
>>> here is not fundamentally complicated, but the mechanisms involved are.
>>> One solution would be to record the fact that we're still in
>>> initiateAcc, and if we are wait for the call stack to collapse back down
>>> to initiateAcc's caller before calling into completeAcc. That matches
>>> the semantics an instruction would expect more, I think, where the
>>> initiateAcc/completeAcc pair are called sequentially.
>>>
>>> One other concern this raises is that the code in the simple timing CPU
>>> is not very simple. One thing that would help would be to try to
>>> relocate some of the special cases, like failed store conditionals or
>>> memory mapped registers, into different bodies of code or at least out
>>> of the midst of everything else going on. I haven't thought about this
>>> in any depth, but I'll try to put together a flow chart sort of thing to
>>> explain what happens to memory instructions as they execute. That would
>>> be good for the sake of documentation and also so we have something
>>> concrete to talk about.
>>>
>>> Gabe
>>>
>>> Gabe Black wrote:
>>>> The segfault for me happens in malloc called by the new operator in
>>>> exetrace.hh on line 84. That says to me that the most likely culprit is
>>>> heap corruption which will be very obnoxious to track down. I've started
>>>> up a run of valgrind just in case it can catch something bad happening
>>>> sometime in the next n hours.
>>>>
>>>> Gabe
>>>>
>>>> Gabe Black wrote:
>>>>
>>>>> Oh wow. It did happen eventually. I'll see if I can figure out what's
>>>>> going on.
>>>>>
>>>>> Gabe
>>>>>
>>>>> Gabe Black wrote:
>>>>>
>>>>>
>>>>>> I tried that command line and I haven't seen any segfault yet. I'll let
>>>>>> it run and see if anything happens. What version of the code are
>>>>>> you using?
>>>>>>
>>>>>> Gabe
>>>>>>
>>>>>> Geoffrey Blake wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> I’ve added a couple edits, but nothing major, ie: added statistics to
>>>>>>> the bus model, and some extra latency randomization to cache misses to
>>>>>>> get better averages of parallel code runs.  None of this is tied to
>>>>>>> the trace-flags mechanism that I can determine.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I did run the code through valgrind, but ridiculously enough, the
>>>>>>> segfault disappears. I’ll keep digging in my spare time.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The “Exec” trace flags work fine (billions of instructions, no
>>>>>>> problems) with an old version of m5 that is somewhere between beta4
>>>>>>> and beta5 of the stable releases. Now I can trace maybe a few thousand
>>>>>>> instructions before M5 seg faults.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Here is a stripped command line that does expose the bug with the
>>>>>>> least number of variables to consider in case someone out there wants
>>>>>>> to try and duplicate the segfaults I’m seeing (it could be a product
>>>>>>> of my build setup, so I’d appreciate it if someone could verify
>>>>>>> independently):
>>>>>>>
>>>>>>> % m5.opt –trace-flags=”ExecEnable” fs.py –b MutexTest –t –n 1 >
>>>>>>> /dev/null
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Geoff
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From:* [email protected] [mailto:[email protected]] *On
>>>>>>> Behalf Of *Korey Sewell
>>>>>>> *Sent:* Friday, April 03, 2009 9:56 AM
>>>>>>> *To:* M5 Developer List
>>>>>>> *Subject:* Re: [m5-dev] Memory corruption in m5 dev repository when
>>>>>>> using --trace-flags="ExecEnable"
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I would echo Gabe sentiments. I've been suspicious of the trace-flags
>>>>>>> causing memory corruption for awhile now, but every time I dig into it
>>>>>>> there's some small error that I'm propagating through that finally
>>>>>>> surfaces.
>>>>>>>
>>>>>>> In the big picture, I suspect that the trace-flags just exacerbate any
>>>>>>> kind of memory-corruption issues since you are accessing things at
>>>>>>> such a heavy-rate.
>>>>>>>
>>>>>>> In terms of debugging, is there any code that you edited that is
>>>>>>> tagged when you use "ExecEnable" rather than just "Exec"?
>>>>>>>
>>>>>>> Also, if you can turn valgrind on for maybe the 1st thousand/million
>>>>>>> cycles with ExecEnable you'll probably find something.
>>>>>>>
>>>>>>> On Thu, Apr 2, 2009 at 7:28 PM, Gabriel Michael Black
>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>
>>>>>>> Does this happen when you start tracing sooner? I'd suggest valgrind,
>>>>>>> especially if you can make the segfault happen quickly. If you wait
>>>>>>> for your simulation to get to 1400000000000 ticks in valgrind, you may
>>>>>>> die before you see the result. There's a suppression file in util
>>>>>>> which should cut down on the noise.
>>>>>>>
>>>>>>> Gabe
>>>>>>>
>>>>>>>
>>>>>>> Quoting Geoffrey Blake <[email protected] <mailto:[email protected]>>:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> I stumbled upon what appears to be a memory corruption bug in the
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> current M5
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> repository.  If on the command line I enter:
>>>>>>>>
>>>>>>>> % ./build/ALPHA_FS/m5.opt -trace-flags="ExecEnable"
>>>>>>>> -trace-start=1400000000000 fs.py -b <benchmark> -t -n <cpus> <more
>>>>>>>> parameters>. The simulator will error with a segmentation fault or
>>>>>>>> occasionally an assert not long after starting to trace instructions.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I have run this through gdb in with m5.debug and see the same
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> errors, the
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> problem is the stack trace showing the cause of the seg fault or assert
>>>>>>>> changes depending on the inputs to the simulator. So, I have not
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> been able
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> to pin point this bug which appears to be a subtle memory corruption
>>>>>>>> somewhere in the code. This error does not happen for other trace
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> flags such
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> as the "Cache" trace flag. It appears linked solely to the instruction
>>>>>>>> tracing mechanism.  Has anybody else seen this bug?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I'm using an up to date repository I pulled from m5sim.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> <http://m5sim.org> this morning.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Geoff
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> m5-dev mailing list
>>>>>>> [email protected] <mailto:[email protected]>
>>>>>>> http://m5sim.org/mailman/listinfo/m5-dev
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> ----------
>>>>>>> Korey L Sewell
>>>>>>> Graduate Student - PhD Candidate
>>>>>>> Computer Science & Engineering
>>>>>>> University of Michigan
>>>>>>>
>>>>>>> No virus found in this incoming message.
>>>>>>> Checked by AVG - www.avg.com
>>>>>>> Version: 8.5.285 / Virus Database: 270.11.40/2039 - Release Date:
>>>>>>> 04/03/09 06:19:00
>>>>>>>
>>>>>>> ------------------------------------------------------------------------
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> m5-dev mailing list
>>>>>>> [email protected]
>>>>>>> http://m5sim.org/mailman/listinfo/m5-dev
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> m5-dev mailing list
>>>>>> [email protected]
>>>>>> http://m5sim.org/mailman/listinfo/m5-dev
>>>>>>
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> m5-dev mailing list
>>>>> [email protected]
>>>>> http://m5sim.org/mailman/listinfo/m5-dev
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> m5-dev mailing list
>>>> [email protected]
>>>> http://m5sim.org/mailman/listinfo/m5-dev
>>>>
>>>
>>> _______________________________________________
>>> m5-dev mailing list
>>> [email protected]
>>> http://m5sim.org/mailman/listinfo/m5-dev
>>>
>> _______________________________________________
>> m5-dev mailing list
>> [email protected]
>> http://m5sim.org/mailman/listinfo/m5-dev
>>
>
>
> _______________________________________________
> m5-dev mailing list
> [email protected]
> http://m5sim.org/mailman/listinfo/m5-dev
>
_______________________________________________
m5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/m5-dev

Re: [m5-dev] Memory corruption in m5 dev repository when using --trace-flags="ExecEnable"

Reply via email to