Re: [m5-users] Cannot resume checkpoint

Richard Strong Wed, 16 Feb 2011 21:48:31 -0800

I took a close look at this problem because the same thing happens to me. It
only occurs when I use the O3CPU model when resuming from a checkpoint. What
I find is that config.ini has orphan for the FUList parameter of the O3CPU
model. Further, none of the function units are adopted by fuPool. I think
the problem lies in SimObject.py::add_child(self, name, child) and
SimObject.py::
adoptOrphanParams(self). I think that there is no recursion to add the
children of params. I tried a simple change at the end of add_child, that I
adoptOrphanParams() of the child (change showed below). This allows the
setup code to get further but now I die with:


"AttributeError: 'AnyProxy' object has no attribute 'getValue'. I was
wondering if someone knows what is going wrong? Did a recent change forget
to go down enough recursive levels when adopting children nodes?

Best,
-Rick

def add_child(self, name, child):
        print "\t in add_child name=%s child=%s"%(name, child)
        child = coerceSimObjectOrVector(child)
        if child.get_parent():
            raise RuntimeError, \
                  "add_child('%s'): child '%s' already has parent '%s'" % \
                  (name, child._name, child._parent)
        if self._children.has_key(name):
            # This code path had an undiscovered bug that would make it fail
            # at runtime. It had been here for a long time and was only
            # exposed by a buggy script. Changes here will probably not be
            # exercised without specialized testing.
            self.clear_child(name)
        child.set_parent(self, name)
        self._children[name] = child
        if isSimObjectVector(child):
            for obj in child:
                obj.adoptOrphanParams()
        elif isSimObjectOrVector(child):
            child.adoptOrphanParams()

>
>
> On Fri, Feb 11, 2011 at 11:05 PM, Joel Hestness <[email protected]>wrote:
>
>> Hi Sheng,
>>   I've dug back through some of my simulations, and I haven't been able to
>> find a case where I used 4GB of simulated memory, so I don't know if I have
>> a baseline to show that the checkpoint restore works with that much memory.
>>  On the other hand, I have simulated with 512MB and 1GB of simulated memory,
>> and it has worked fine.  For full-system simulations, we often mount a swap
>> disk in the simulated system in order to avoid the small virtual memory
>> constraints imposed by the operating system.  I'd have to defer to others on
>> the list for knowledge about whether that would work with SE mode.
>>   I can attempt to address your other questions as well:
>>    1) The way that you described the O3 parameters is how I have set them
>> in the past, so that should work.
>>    2) I've seen this problem before... It has had to do with the way that
>> certain SimObjects are instantiated as children of other SimObjects at the
>> beginning of the simulation, and with checkpoint restore, this isn't the
>> cleanest process.  When I ran into this problem, I was working on getting
>> x86 timing mode working with Ruby, and Brad Beckmann was able to help me
>> debug.  He might be able to suggest first steps for figuring out what's
>> wrong here.
>>   Hope this helps,
>>   Joel
>>
>>
>> On Wed, Feb 9, 2011 at 3:14 PM, Sheng Li <[email protected]> wrote:
>>
>>> An two other questions:
>>>
>>> 1. What should I do to change the O3 parameters such as issueWidth,
>>> commitWidth, etc? I added a few lines in se.py as below. It runs fine if I
>>> just run the benchmarks, but if I resume a checkpoint (created without -d
>>> option), then it will complain the CPU class has no such parameters. I think
>>> these parameters can only be set after M5 performs CPU mode switch, then how
>>> can I set these parameters so that M5 will use them after switching CPU
>>> mode?
>>>
>>>  if options.detailed:
>>>     CPUClass.commitWidth    = 4
>>>     CPUClass.decodeWidth    = 4
>>>     CPUClass.dispatchWidth  = 4
>>>     CPUClass.fetchWidth     = 4
>>>     CPUClass.issueWidth     = 4
>>>     CPUClass.commitWidth    = 4
>>>     CPUClass.renameWidth    = 4
>>>     CPUClass.squashWidth    = 4
>>>     CPUClass.wbWidth        = 4
>>>     CPUClass.numROBEntries  = 128
>>>     CPUClass.numIQEntries   = 36
>>>     CPUClass.LQEntries      = 48
>>>
>>> 2. When I resume a checkpoint with -d --caches options, I got
>>> RuntimeError: Attempt to instantiate orphan node. I am trying to figure out
>>> what the orphan node is. What should I do to find the orphan node? I tried
>>> "print self.name" in File "/afs/
>>> crc.nd.edu/user/s/sli2/m5-stable/src/python/m5/SimObject.py", line 822,
>>> in getCCObject, but got nothing.
>>>
>>>
>>> command line: ./build/ALPHA_SE/m5.opt configs/example/se.py --bench bzip2
>>> --checkpoint-restore=0 --simpoint -d --caches --l2cache
>>> 2200
>>> m5out/cpt.bzip2.2200
>>>
>>> Global frequency set at 1000000000000 ticks per second
>>>  Traceback (most recent call last):
>>>   File "<string>", line 1, in ?
>>>   File "/afs/crc.nd.edu/user/s/sli2/m5-stable/src/python/m5/main.py",
>>> line 359, in main
>>>     exec filecode in scope
>>>   File "configs/example/se.py", line 179, in ?
>>>     Simulation.run(options, root, system, FutureClass)
>>>   File "/afs/
>>> crc.nd.edu/user/s/sli2/m5-work-stable/configs/common/Simulation.py",
>>> line 236, in run
>>>     m5.instantiate(checkpoint_dir)
>>>   File "/afs/
>>> crc.nd.edu/user/s/sli2/m5-work-stable/src/python/m5/simulate.py", line
>>> 77, in instantiate
>>>     for obj in root.descendants(): obj.createCCObject()
>>>   File "/afs/crc.nd.edu/user/s/sli2/m5-stable/src/python/m5/SimObject.py",
>>> line 841, in createCCObject
>>>     def createCCObject(self):
>>>   File "/afs/crc.nd.edu/user/s/sli2/m5-stable/src/python/m5/SimObject.py",
>>> line 796, in getCCParams
>>>     value = value.getValue()
>>>   File "/afs/crc.nd.edu/user/s/sli2/m5-stable/src/python/m5/SimObject.py",
>>> line 845, in getValue
>>>     def getValue(self):
>>>   File "/afs/crc.nd.edu/user/s/sli2/m5-stable/src/python/m5/SimObject.py",
>>> line 826, in getCCObject
>>>     self._ccObject = -1
>>>   File "/afs/crc.nd.edu/user/s/sli2/m5-stable/src/python/m5/SimObject.py",
>>> line 796, in getCCParams
>>>     value = value.getValue()
>>>   File "/afs/crc.nd.edu/user/s/sli2/m5-stable/src/python/m5/params.py",
>>> line 183, in getValue
>>>     return [ v.getValue() for v in self ]
>>>   File "/afs/crc.nd.edu/user/s/sli2/m5-stable/src/python/m5/SimObject.py",
>>> line 845, in getValue
>>>     def getValue(self):
>>>   File "/afs/crc.nd.edu/user/s/sli2/m5-stable/src/python/m5/SimObject.py",
>>> line 822, in getCCObject
>>>     #print self.name
>>> RuntimeError: Attempt to instantiate orphan node
>>>
>>> Thanks a lot!
>>> -Sheng
>>>
>>>
>>>
>>> On Wed, Feb 9, 2011 at 4:03 PM, Sheng Li <[email protected]> wrote:
>>>
>>>> Thanks Joel!
>>>>
>>>> Yes, I did. The checkpoint created with 4096MB has problem as lots of
>>>> information is missing. Is it possible that checkpoint does not support
>>>> larger memory (i.e 4096MB) in M5?
>>>>
>>>> Thanks
>>>> -Sheng
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Feb 9, 2011 at 3:31 PM, Joel Hestness 
>>>> <[email protected]>wrote:
>>>>
>>>>> Hi Sheng,
>>>>>   Did you collect the checkpoints from a simulated system with 512MB of
>>>>> memory?  The checkpoints encode the current state of memory in the 
>>>>> simulated
>>>>> system including the capacity, so you'll need to make sure that the
>>>>> simulated system in both runs (to collect the checkpoint and to restore 
>>>>> from
>>>>> it) use the same amount of simulated memory.
>>>>>   More generally, an M5 checkpoint is specific to the ISA/architecture,
>>>>> number of cores, and the capacity of memory in the simulated system that 
>>>>> you
>>>>> collect the checkpoint from.
>>>>>   Hope this helps,
>>>>>   Joel
>>>>>
>>>>>
>>>>> On Wed, Feb 9, 2011 at 12:41 PM, Sheng Li <[email protected]> wrote:
>>>>>
>>>>>> After spending several hours to guess what was wrong, here are my
>>>>>> findings:
>>>>>>
>>>>>> It seems that if I set PhysicalMemory as 512MB, checkpointing can
>>>>>> work. However, if I set  it as 4096MB (I did this because SPECCPU2006
>>>>>> requires at least 2GB free memory), checkpoint will not work. The place I
>>>>>> changed this is in common/example/se.py
>>>>>>
>>>>>> system = System(cpu = [CPUClass(cpu_id=i) for i in xrange(np)],
>>>>>>                 physmem = PhysicalMemory(range=AddrRange("4096MB")),
>>>>>>                 membus = Bus(), mem_mode = test_mem_mode)
>>>>>>
>>>>>> Could anyone give some suggestions?
>>>>>>
>>>>>> Thanks!
>>>>>> -Sheng
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 9, 2011 at 12:05 AM, Sheng Li <[email protected]>wrote:
>>>>>>
>>>>>>> Hi Guys,
>>>>>>>
>>>>>>> I tried to use checkpoints in M5 but could not have it work. I used
>>>>>>> ALPHA_SE.
>>>>>>>
>>>>>>> The commands I use to create/resume checkpoints are M5 outputs are:
>>>>>>>
>>>>>>> Creating checkpoint:
>>>>>>> ______________________
>>>>>>> [sli2@newcell ~/m5-work-stable]$ ./build/ALPHA_SE/m5.opt
>>>>>>> configs/example/se.py --bench bzip2 --take-checkpoint=2200 
>>>>>>> --at-instruction
>>>>>>> ...
>>>>>>> command line: ./build/ALPHA_SE/m5.opt configs/example/se.py --bench
>>>>>>> bzip2 --take-checkpoint=2200 --at-instruction
>>>>>>> 2200000000
>>>>>>> Global frequency set at 1000000000000 ticks per second
>>>>>>> 0: system.remote_gdb.listener: listening for remote gdb #0 on port
>>>>>>> 7000
>>>>>>> Creating checkpoint at inst:2200
>>>>>>> info: Entering event queue @ 0.  Starting simulation...
>>>>>>> info: Increasing stack size by one page.
>>>>>>> hack: be nice to actually delete the event here
>>>>>>> exit cause = a thread reached the max instruction count
>>>>>>> Writing checkpoint
>>>>>>> Checkpoint written.
>>>>>>> Exiting @ cycle 1111000 because a thread reached the max instruction
>>>>>>> count
>>>>>>>
>>>>>>> Resume checkpoint:
>>>>>>> _________________________
>>>>>>> command line: ./build/ALPHA_SE/m5.opt configs/example/se.py --bench
>>>>>>> bzip2 --checkpoint-restore=2200 --at-instruction
>>>>>>> 2200000000
>>>>>>> Global frequency set at 1000000000000 ticks per second
>>>>>>> 0: system.remote_gdb.listener: listening for remote gdb #0 on port
>>>>>>> 7000
>>>>>>> warn: optional parameter system.cpu.workload:M5_pid not present
>>>>>>> For more information see: http://www.m5sim.org/warn/aa78cda1
>>>>>>> **** REAL SIMULATION ****
>>>>>>> info: Entering event queue @ 1111000.  Starting simulation...
>>>>>>> hack: be nice to actually delete the event here
>>>>>>> Exiting @ cycle 1111500 because halt instruction encountered <--Here
>>>>>>> is the problem.
>>>>>>>
>>>>>>> Any help would be highly appreciated!
>>>>>>>
>>>>>>> Thanks
>>>>>>> -Sheng
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> m5-users mailing list
>>>>>> [email protected]
>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>   Joel Hestness
>>>>>   PhD Student, Computer Architecture
>>>>>   Dept. of Computer Science, University of Texas - Austin
>>>>>   http://www.cs.utexas.edu/~hestness
>>>>>
>>>>> _______________________________________________
>>>>> m5-users mailing list
>>>>> [email protected]
>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
>>>>>
>>>>
>>>>
>>>
>>> _______________________________________________
>>> m5-users mailing list
>>> [email protected]
>>> http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
>>>
>>
>>
>>
>> --
>>   Joel Hestness
>>   PhD Student, Computer Architecture
>>   Dept. of Computer Science, University of Texas - Austin
>>   http://www.cs.utexas.edu/~hestness
>>
>> _______________________________________________
>> m5-users mailing list
>> [email protected]
>> http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
>>
>
>

_______________________________________________
m5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users

Re: [m5-users] Cannot resume checkpoint

Reply via email to