Re: [m5-users] Creating ruby checkpoints

Steve Reinhardt Fri, 13 May 2011 07:48:56 -0700

Hi Tim,

Brad's on vacation, so I'll try and answer...


If you're completely sure that it's the same PC value in both traces and the
error is just in the symbol table, then it makes sense that you'd be running
into trouble with trying to skip functions that aren't really there.  AFAIK,
calibrate_delay() is only called during boot, so it is pretty suspicious
that you'd be executing it after your checkpoint.

I have no idea what could be going wrong though.  As just came up on another
thread, the symbols are stored in the checkpoint file and not re-read from
the kernel image; one thing to check is whether the calibrate_delay PC in
the m5.cpt file matches what's stored in the kernel image.

Steve

On Fri, May 13, 2011 at 5:49 AM, Timothy M Jones <[email protected]
> wrote:

> Hi Brad,
>
> Thanks for spending the time on this.  I have been digging a bit to try to
> find out the problem too.  I found that restoring the checkpoint seems to be
> assigning the wrong symbols to certain addresses.  If I trace the output
> from both runs, I get this.  First the correct version:
>
> 2259304210000: system.cpu1: Decode: Decoded lda instruction: 0x211f3fff
> 2259304210000: global: Reading int reg 31 (31) as 0.
> 2259304210000: global: Setting int reg 8 (8) to 0x3fff.
> 2259304210000: system.cpu1 + T0 : @alpha_switch_to+8    : lda r8,16383(r31)
>   : IntAlu :  D=0x0000000000003fff
>
> Now the version using ruby_fs.py:
>
> 2259492768000: system.cpu1: Decode: Decoded lda instruction: 0x211f3fff
> 2259492768000: global: Reading int reg 31 (31) as 0.
> 2259492768000: global: Setting int reg 8 (8) to 0x3fff.
> 2259492768000: system.cpu1 + T0 : @calibrate_delay+88    : lda
> r8,16383(r31)   : IntAlu :  D=0x0000000000003fff
>
> You can see that in the first version the address is from
> alpha_switch_to+8, but in the second it thinks it is calibrate_delay+88.
>  Later on the ruby version misses out a load of instructions with this line
> of output:
>
> 2259493454000: global: PC based event serviced at 0xfffffc0000311148:
> calibrate_delay
> 2259493454000: global: Reading int reg 26 (26) as 0xfffffc00006b83bc.
> 2259493454000: calibrate_delay: skipping calibrate_delay: pc =
> (0xfffffc0000311148=>0xfffffc000031114c), newpc =
> (0xfffffc00006b83bc=>0xfffffc00006b83c0)
>
> I'm not sure what's going on, but looking in the src/arch/alpha/linux
> directory, calibrate_delay is a function that should be skipped.  So, my
> guess is that the simulator thinks it is in this function and tries to skip
> it but in actual fact it isn't in this function at all and so execution goes
> haywire.  Would that make sense?  Where do I look to fix this problem of
> they symbols being wrong?
>
> Cheers
>
> Tim
>
> Beckmann, Brad wrote:
>
>> Hi Tim,
>>
>> I spent a little time trying to reproduce your error, but so far I have
>> not.  I'm using a slightly different Linux kernel than the default, but I'm
>> not ready to declare that is the reason for the error.  Unfortunately I'm
>> going to be out-of-town for the next week and a half, but I'll try look
>> further at your problem when I return.  One minor question that I've been
>> meaning to ask you is roughly how long did it take for you to encounter this
>> error?
>>
>> Brad
>>
>>
>>  -----Original Message-----
>>> From: [email protected] [mailto:m5-users-
>>> [email protected]] On Behalf Of Timothy M Jones
>>> Sent: Friday, May 06, 2011 2:00 AM
>>> To: M5 users mailing list
>>> Subject: Re: [m5-users] Creating ruby checkpoints
>>>
>>> Hi Brad,
>>>
>>> Thanks for the reply and the explanation about Ruby.
>>>
>>> I've attached the runscript.rcS file that I was using.  I'm using the
>>> kernel from
>>> the M5 website and disk image from UTexas
>>> (http://www.cs.utexas.edu/~parsec_m5/linux-parsec-2-1-m5-with-test-
>>> inputs.img.bz2)
>>>
>>>
>>> I looked at the config file and checkpoint files.  The CPUs do have the
>>> same
>>> names and I don't get any unserialization warnings at all when running
>>> from
>>> the checkpoint.  I did notice that the CPU types were different (since I
>>> was
>>> creating checkpoints with AtomicSimpleCPU) but adding the '-t' switch to
>>> the
>>> creation command didn't make the error go away.
>>>
>>> I also tried using the ruby_fs.py script to create checkpoints too by
>>> adding
>>> support for '--script' within it using the attached patch.  This created
>>> a
>>> checkpoint without problems.  Loading from it caused a segmentation fault
>>> in
>>> the simulated program though.  These were the commands I used for that:
>>>
>>> ./build/ALPHA_FS/m5.fast -d ../outputs --remote-gdb-port 0
>>> ./configs/example/ruby_fs.py -n 2 --script=../scripts/runscript.rcS
>>> --max-checkpoints=1
>>>
>>> ./build/ALPHA_FS/m5.fast -d ../outputs --remote-gdb-port 0
>>> ./configs/example/ruby_fs.py -n 2 -r 0
>>>
>>> Thanks again
>>> Tim
>>>
>>> Beckmann, Brad wrote:
>>>
>>>> Hi Tim,
>>>>
>>>> Before I try to help you with your specific problem, I want to point out
>>>> that
>>>>
>>> is Ruby's current support for checkpointing is a little confusing and
>>> that is one
>>> area that we are actively improving.  In particular Ruby currently uses
>>> physmem as a functional memory image and thus messages within Ruby only
>>> impact the timing of memory accesses.  Thus, loading a checkpoint with
>>> Ruby
>>> is nothing more than loading Ruby's backing image of physmem with the
>>> checkpointed memory image. Also there is no current support for cache
>>> warmup. We are in the process of changing that, but that code is not yet
>>> ready.
>>>
>>>> Having said that, I suspect that your problem is something different.
>>>>  In
>>>>
>>> general, your sequence of commands should work and I can't reproduce
>>> your specific error since I don't have your particular rcS script.  I'd
>>> be curious
>>> to know if you see any unserialzation warnings complaining that certain
>>> simobjects aren't in the loaded checkpoint.  In particular, do the cpus
>>> have
>>> the exact same name between the config.ini file with ruby and the m5.cpt
>>> file in your checkpoint?
>>>
>>>> Sorry I can't be more help, but if you send me your rcS script, I'd be
>>>> happy
>>>>
>>> to investigate further.
>>>
>>>> Brad
>>>>
>>>>
>>>>  -----Original Message-----
>>>>> From: [email protected] [mailto:m5-users-
>>>>>
>>>> [email protected]]
>>>
>>>> On Behalf Of Timothy M Jones
>>>>> Sent: Wednesday, May 04, 2011 5:49 AM
>>>>> To: M5 users mailing list
>>>>> Subject: [m5-users] Creating ruby checkpoints
>>>>>
>>>>> Hello,
>>>>>
>>>>> I'm trying to create checkpoints for use with ruby using ALPHA_FS.
>>>>> It takes ages to boot linux with ruby enabled, and since I want
>>>>> several checkpoints for different numbers of cores, I was hoping I'd
>>>>> be able to create checkpoints without ruby, then run from the
>>>>>
>>>> checkpoints with.
>>>
>>>> This doesn't appear to work.  If I create a checkpoint with this
>>>>> command:
>>>>>
>>>>> ./build/ALPHA_FS/m5.fast -d ../outputs --remote-gdb-port 0
>>>>> ./configs/example/fs.py -n 2 --max-checkpoints=1 --
>>>>> script=../scripts/runscript.rcS
>>>>>
>>>>> Then I can run it fine with this command:
>>>>>
>>>>> ./build/ALPHA_FS/m5.fast -d ../outputs --remote-gdb-port 0
>>>>> ./configs/example/fs.py -n 2 -r 0
>>>>>
>>>>> But switching to ruby causes errors:
>>>>>
>>>>> /build/ALPHA_FS/m5.fast -d ../outputs --remote-gdb-port 0
>>>>> ./configs/example/ruby_fs.py -n 2 -r 0
>>>>>
>>>>> In the system.terminal file I get this error output:
>>>>>
>>>>> script(759): unhandled unaligned exception pc = [<fffffc00006b83c0>]
>>>>> ra = [<fffffc00006b83bc>]  ps = 0007
>>>>> r0 = 000000001f6c8000  r1 = fffffc00003111a0  r2 = fffffc0000018000
>>>>> r3 = 000000000000002b  r4 = 0000000000000720  r5 = fffffc000085ecb8
>>>>> r6 = 0000000000000059  r7 = 0000000000000040  r8 = 0000000000003fff
>>>>> r9 = fffffc001f5c5580  r10= fffffc001f3eec00  r11= fffffc0000d09b80
>>>>> r12=
>>>>> fffffc001f6b0740  r13= 0000000000000001  r14= 0000000000000008 r15=
>>>>> fffffc001f657e48 r16= 000000001f654000  r17= fffffc001f3eec00  r18=
>>>>> fffffc001f6b0740 r19= 0000000000000001  r20= 0000000000000000  r21=
>>>>> fffffc0000860640 r22= 0000000000000000  r23= 000000200618a0cf  r24=
>>>>> 4000000000000000 r25= 00000000000003ff  r27= fffffc0000311190  r28=
>>>>> fffffc001f5c5580
>>>>>
>>>>> This seems to happen no matter which protocol I compile into the
>>>>> binary, although this was with MESI_CMP_directory.  Does anyone have
>>>>> any suggestions as to how I can go about creating some checkpoints to
>>>>> use like this or what I'm doing wrong?
>>>>>
>>>>> Thanks
>>>>> Tim
>>>>>
>>>>> --
>>>>> Timothy M. Jones
>>>>> http://www.cl.cam.ac.uk/~tmj32
>>>>> _______________________________________________
>>>>> m5-users mailing list
>>>>> [email protected]
>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
>>>>>
>>>>
>>>> _______________________________________________
>>>> m5-users mailing list
>>>> [email protected]
>>>> http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
>>>>
>>> --
>>> Timothy M. Jones
>>> http://www.cl.cam.ac.uk/~tmj32
>>>
>>
>> _______________________________________________
>> m5-users mailing list
>> [email protected]
>> http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
>>
>
> --
> Timothy M. Jones
> http://www.cl.cam.ac.uk/~tmj32
> _______________________________________________
> m5-users mailing list
> [email protected]
> http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
>

_______________________________________________
m5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users

Re: [m5-users] Creating ruby checkpoints

Reply via email to