Thanks very much for the pointers, Steve.
I'm feeling very foolish now. I've worked out the problem and it was my
mistake. I was using a different kernel to the standard one and I'd
updated the relevant line in FSConfig.py in makeLinuxAlphaSystem() but
not in makeLinuxAlphaRubySystem(). Setting them to the same thing
solves the problem :-)
I'm not sure why this was causing problems, the symbol table is cleared
before the checkpoint is loaded so that shouldn't have been an issue.
There must have been something else that was conflicting. However,
since it is working now, I'm not going to dig any further!
Tim
Steve Reinhardt wrote:
Hi Tim,
Brad's on vacation, so I'll try and answer...
If you're completely sure that it's the same PC value in both traces and
the error is just in the symbol table, then it makes sense that you'd be
running into trouble with trying to skip functions that aren't really
there. AFAIK, calibrate_delay() is only called during boot, so it is
pretty suspicious that you'd be executing it after your checkpoint.
I have no idea what could be going wrong though. As just came up on
another thread, the symbols are stored in the checkpoint file and not
re-read from the kernel image; one thing to check is whether the
calibrate_delay PC in the m5.cpt file matches what's stored in the
kernel image.
Steve
On Fri, May 13, 2011 at 5:49 AM, Timothy M Jones
<[email protected] <mailto:[email protected]>> wrote:
Hi Brad,
Thanks for spending the time on this. I have been digging a bit to
try to find out the problem too. I found that restoring the
checkpoint seems to be assigning the wrong symbols to certain
addresses. If I trace the output from both runs, I get this. First
the correct version:
2259304210000: system.cpu1: Decode: Decoded lda instruction: 0x211f3fff
2259304210000: global: Reading int reg 31 (31) as 0.
2259304210000: global: Setting int reg 8 (8) to 0x3fff.
2259304210000: system.cpu1 + T0 : @alpha_switch_to+8 : lda
r8,16383(r31) : IntAlu : D=0x0000000000003fff
Now the version using ruby_fs.py:
2259492768000: system.cpu1: Decode: Decoded lda instruction: 0x211f3fff
2259492768000: global: Reading int reg 31 (31) as 0.
2259492768000: global: Setting int reg 8 (8) to 0x3fff.
2259492768000: system.cpu1 + T0 : @calibrate_delay+88 : lda
r8,16383(r31) : IntAlu : D=0x0000000000003fff
You can see that in the first version the address is from
alpha_switch_to+8, but in the second it thinks it is
calibrate_delay+88. Later on the ruby version misses out a load of
instructions with this line of output:
2259493454000: global: PC based event serviced at
0xfffffc0000311148: calibrate_delay
2259493454000: global: Reading int reg 26 (26) as 0xfffffc00006b83bc.
2259493454000: calibrate_delay: skipping calibrate_delay: pc =
(0xfffffc0000311148=>0xfffffc000031114c), newpc =
(0xfffffc00006b83bc=>0xfffffc00006b83c0)
I'm not sure what's going on, but looking in the
src/arch/alpha/linux directory, calibrate_delay is a function that
should be skipped. So, my guess is that the simulator thinks it is
in this function and tries to skip it but in actual fact it isn't in
this function at all and so execution goes haywire. Would that make
sense? Where do I look to fix this problem of they symbols being wrong?
Cheers
Tim
Beckmann, Brad wrote:
Hi Tim,
I spent a little time trying to reproduce your error, but so far
I have not. I'm using a slightly different Linux kernel than
the default, but I'm not ready to declare that is the reason for
the error. Unfortunately I'm going to be out-of-town for the
next week and a half, but I'll try look further at your problem
when I return. One minor question that I've been meaning to ask
you is roughly how long did it take for you to encounter this error?
Brad
-----Original Message-----
From: [email protected]
<mailto:[email protected]> [mailto:m5-users-
<mailto:m5-users->
[email protected] <mailto:[email protected]>] On Behalf Of
Timothy M Jones
Sent: Friday, May 06, 2011 2:00 AM
To: M5 users mailing list
Subject: Re: [m5-users] Creating ruby checkpoints
Hi Brad,
Thanks for the reply and the explanation about Ruby.
I've attached the runscript.rcS file that I was using. I'm
using the kernel from
the M5 website and disk image from UTexas
(http://www.cs.utexas.edu/~parsec_m5/linux-parsec-2-1-m5-with-test-
<http://www.cs.utexas.edu/%7Eparsec_m5/linux-parsec-2-1-m5-with-test->
inputs.img.bz2)
I looked at the config file and checkpoint files. The CPUs
do have the same
names and I don't get any unserialization warnings at all
when running from
the checkpoint. I did notice that the CPU types were
different (since I was
creating checkpoints with AtomicSimpleCPU) but adding the
'-t' switch to the
creation command didn't make the error go away.
I also tried using the ruby_fs.py script to create
checkpoints too by adding
support for '--script' within it using the attached patch.
This created a
checkpoint without problems. Loading from it caused a
segmentation fault in
the simulated program though. These were the commands I
used for that:
./build/ALPHA_FS/m5.fast -d ../outputs --remote-gdb-port 0
./configs/example/ruby_fs.py -n 2
--script=../scripts/runscript.rcS
--max-checkpoints=1
./build/ALPHA_FS/m5.fast -d ../outputs --remote-gdb-port 0
./configs/example/ruby_fs.py -n 2 -r 0
Thanks again
Tim
Beckmann, Brad wrote:
Hi Tim,
Before I try to help you with your specific problem, I
want to point out that
is Ruby's current support for checkpointing is a little
confusing and that is one
area that we are actively improving. In particular Ruby
currently uses
physmem as a functional memory image and thus messages
within Ruby only
impact the timing of memory accesses. Thus, loading a
checkpoint with Ruby
is nothing more than loading Ruby's backing image of physmem
with the
checkpointed memory image. Also there is no current support
for cache
warmup. We are in the process of changing that, but that
code is not yet
ready.
Having said that, I suspect that your problem is
something different. In
general, your sequence of commands should work and I can't
reproduce
your specific error since I don't have your particular rcS
script. I'd be curious
to know if you see any unserialzation warnings complaining
that certain
simobjects aren't in the loaded checkpoint. In particular,
do the cpus have
the exact same name between the config.ini file with ruby
and the m5.cpt
file in your checkpoint?
Sorry I can't be more help, but if you send me your rcS
script, I'd be happy
to investigate further.
Brad
-----Original Message-----
From: [email protected]
<mailto:[email protected]>
[mailto:m5-users- <mailto:m5-users->
[email protected] <mailto:[email protected]>]
On Behalf Of Timothy M Jones
Sent: Wednesday, May 04, 2011 5:49 AM
To: M5 users mailing list
Subject: [m5-users] Creating ruby checkpoints
Hello,
I'm trying to create checkpoints for use with ruby
using ALPHA_FS.
It takes ages to boot linux with ruby enabled, and
since I want
several checkpoints for different numbers of cores,
I was hoping I'd
be able to create checkpoints without ruby, then run
from the
checkpoints with.
This doesn't appear to work. If I create a
checkpoint with this command:
./build/ALPHA_FS/m5.fast -d ../outputs
--remote-gdb-port 0
./configs/example/fs.py -n 2 --max-checkpoints=1 --
script=../scripts/runscript.rcS
Then I can run it fine with this command:
./build/ALPHA_FS/m5.fast -d ../outputs
--remote-gdb-port 0
./configs/example/fs.py -n 2 -r 0
But switching to ruby causes errors:
/build/ALPHA_FS/m5.fast -d ../outputs
--remote-gdb-port 0
./configs/example/ruby_fs.py -n 2 -r 0
In the system.terminal file I get this error output:
script(759): unhandled unaligned exception pc =
[<fffffc00006b83c0>]
ra = [<fffffc00006b83bc>] ps = 0007
r0 = 000000001f6c8000 r1 = fffffc00003111a0 r2 =
fffffc0000018000
r3 = 000000000000002b r4 = 0000000000000720 r5 =
fffffc000085ecb8
r6 = 0000000000000059 r7 = 0000000000000040 r8 =
0000000000003fff
r9 = fffffc001f5c5580 r10= fffffc001f3eec00 r11=
fffffc0000d09b80
r12=
fffffc001f6b0740 r13= 0000000000000001 r14=
0000000000000008 r15=
fffffc001f657e48 r16= 000000001f654000 r17=
fffffc001f3eec00 r18=
fffffc001f6b0740 r19= 0000000000000001 r20=
0000000000000000 r21=
fffffc0000860640 r22= 0000000000000000 r23=
000000200618a0cf r24=
4000000000000000 r25= 00000000000003ff r27=
fffffc0000311190 r28=
fffffc001f5c5580
This seems to happen no matter which protocol I
compile into the
binary, although this was with MESI_CMP_directory.
Does anyone have
any suggestions as to how I can go about creating
some checkpoints to
use like this or what I'm doing wrong?
Thanks
Tim
--
Timothy M. Jones
http://www.cl.cam.ac.uk/~tmj32
<http://www.cl.cam.ac.uk/%7Etmj32>
_______________________________________________
m5-users mailing list
[email protected] <mailto:[email protected]>
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
_______________________________________________
m5-users mailing list
[email protected] <mailto:[email protected]>
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
--
Timothy M. Jones
http://www.cl.cam.ac.uk/~tmj32
<http://www.cl.cam.ac.uk/%7Etmj32>
_______________________________________________
m5-users mailing list
[email protected] <mailto:[email protected]>
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
--
Timothy M. Jones
http://www.cl.cam.ac.uk/~tmj32 <http://www.cl.cam.ac.uk/%7Etmj32>
_______________________________________________
m5-users mailing list
[email protected] <mailto:[email protected]>
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
------------------------------------------------------------------------
_______________________________________________
m5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
--
Timothy M. Jones
http://www.cl.cam.ac.uk/~tmj32
_______________________________________________
gem5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users