Interesting... there's nothing conclusive here, but the symbols on
the instructions at tick 172000 show that this address is probably
TLS-related too. So the good news is that this could be the same
bug or a related one. I think the key thing is to figure out what
the Linux TLS structure is supposed to look like.
One thing that's puzzling me is why all this is coming up now when
we've run almost all of spec 2000 without any problems.
Anyone else have any ideas?
Steve
On 9/9/07, Elliott Cooper-Balis <[EMAIL PROTECTED]> wrote:
r0 gets set in the instruction right before the load into r20 :
2174500: system.cpu0 T0 : @__strtol_internal+24 : addq
r0,r1,r0 : IntAlu : D=0x00000001200944f0
2175000: system.cpu0 T0 : @__strtol_internal+28 : ldq r20,0
(r0) : MemRead : D=0x0000000000000000 A=0x1200944f0
and it doesnt look like address 0x1200944f0 gets used as an actual
address anywhere else but here are all other references to it :
172000: system.cpu0 T0 : @__libc_setup_tls+304 : addq
r10,r13,r16 : IntAlu : D=0x00000001200944f0
172500: system.cpu0 T0 : @__libc_setup_tls+308 : stq r16,16
(r9) : MemWrite : D=0x00000001200944f0 A=0x120092050
180000: system.cpu0 T0 : @memcpy+32 : bis
r31,r16,r12 : IntAlu : D=0x00000001200944f0
181000: system.cpu0 T0 : @memcpy+40 : bis
r31,r16,r9 : IntAlu : D=0x00000001200944f0
184000: system.cpu0 T0 : @memcpy+256 : bis
r31,r12,r0 : IntAlu : D=0x00000001200944f0
2174500: system.cpu0 T0 : @__strtol_internal+24 : addq
r0,r1,r0 : IntAlu : D=0x00000001200944f0
2175000: system.cpu0 T0 : @__strtol_internal+28 : ldq r20,0
(r0) : MemRead : D=0x0000000000000000 A=0x1200944f0
thanks again for all the help and sorry for being such pain in the
ass.
Steve Reinhardt <[EMAIL PROTECTED]> wrote:
The instruction at tick 2175000 loads r20 from memory location
0x1200944f0 so the earlier refs are irrelevant. The next
questions are where does r0 get set immediately prior to 2175000
(i.e. does 0x1200944f0 make sense as an address) and where else
does 0x1200944f0 get accessed...
Steve
On 9/9/07, Elliott Cooper-Balis < [EMAIL PROTECTED]> wrote:
here are all the instances of r20 in the specrand benchmark. i'm
sorry i can't be of more help in debugging this issue :
4500: system.cpu0 T0 : @_start+36 : ldq r20,-32440
(r29) : MemRead : D=0x0000000120000eb8 A=0x1200907a0
15000: system.cpu0 T0 : @__libc_start_main+60 : bis
r31,r20,r15 : IntAlu : D=0x0000000120000eb8
293000: system.cpu0 T0 : @__geteuid+20 : bis
r31,r20,r0 : IntAlu : D=0x0000000000000064
305500: system.cpu0 T0 : @__getegid+20 : bis
r31,r20,r0 : IntAlu : D=0x0000000000000064
2175000: system.cpu0 T0 : @__strtol_internal+28 : ldq r20,0
(r0) : MemRead : D=0x0000000000000000 A=0x1200944f0
2183500: system.cpu0 T0 : @____strtoll_l_internal+56 : bis
r31,r20,r11 : IntAlu : D=0x0000000000000000
2184000: system.cpu0 T0 : @____strtoll_l_internal+60 : ldq
r3,8(r20) : MemRead : A=0x8
the last of which being the instruction causing the page fault.
elliott
Steve Reinhardt < [EMAIL PROTECTED]> wrote:
Interesting... my guess with perl then is that the Linux kernel is
supposed to be initializing some value in the thread-local storage
that we're not initializing. Unfortunately the only way to track
that down is usually to go reading the kernel source... though if
you find a spot where they define a base TLS struct then that
should give it to you. Anyone else out there on the list have any
experience with this?
As far as specrand it's impossible to say what the problem is
without going backward further in the trace to see where r20 is
coming from. If r20 also comes from reading something out of the
TLS area then it could well be the same bug.
Steve
On 9/9/07, Elliott Cooper-Balis < [EMAIL PROTECTED]> wrote:
hey steve,
i tried both of your suggestions, and the latter of which i
think might give a good clue as the memory address which causes
the fault is not referenced at any other point in the program.
here is the result of grep'ing for the address in the execution
trace :
>grep 12022e50 exec.out
5278458500: system.cpu0 T0 : @__printf_fp+128 : addq
r0,r1,r0 : IntAlu : D=0x000000012022e508
5278459000: system.cpu0 T0 : @__printf_fp+132 : ldq r1,0
(r0) : MemRead : D=0x0000000000000000 A=0x12022e508
which are the 2 instructions right before the fault and the only 2
instances of it being referenced.
i tried digging around a little more to see if this address in
particular was causing the problems. unfortunately, that doesn't
appear to be the case. the benchmark we have been discussing is
the Perl benchmark in SPEC06. i ran the random number generator
benchmark as well ( 999.specrand) and here is the execution output
just before its page fault :
[EMAIL PROTECTED]:~/Development/M5/m5-2.0b3/build/ALPHA_SE$ ./
m5.debug --trace-flags=Exec,Syscall,SyscallVerbose --trace-
start=2000000 ../../configs/example/se.py -c benchmarks/
999.specrand/exe/specrand_base.amd64-m64-gcc41-nn -o "4 3943"
....
2183000: system.cpu0 T0 : @____strtoll_l_internal+52 : bis
r31,r18,r10 : IntAlu : D=0x000000000000000a
2183500: system.cpu0 T0 : @____strtoll_l_internal+56 : bis
r31,r20,r11 : IntAlu : D=0x0000000000000000
2184000: system.cpu0 T0 : @____strtoll_l_internal+60 : ldq
r3,8(r20) : MemRead : A=0x8
panic: Page table fault when accessing virtual address 0x8
@ cycle 2184000
[invoke:build/ALPHA_SE/sim/faults.cc, line 65]
Program aborted at cycle 2184000
Aborted (core dumped)
unfortunately, there doesn't appear to be (at least to me) any
similarities between the two benchmark's output.
elliott
Steve Reinhardt < [EMAIL PROTECTED]> wrote:
It's not obvious, but it does give some clues...
The null pointer is being read from memory address 0x12022e508, so
either that's a bogus address or the memory location doesn't have
the right value (not getting initialized or getting clobbered at
some point).
The pointer address is computed by adding the uniq register (put
into R0 by "call_pal rduniq") and some value (0x28) read from
-29160(r29)... I think that's the global constant pool. The uniq
reg is used as a pointer to thread-local storage. So basically
it's reading the null value out of thread-local storage. It could
be that that's a value that the OS is supposed to provide but
we're not initializing it properly.
I'd do two more things to try and get some more clues:
- run with just --trace-flags=Syscall (and no --trace-start) to
get a complete syscall trace, then look at whatever the last few
syscalls are, and see what they are and how closely they precede
the crash
- run with just --trace-flags=Exec (and no --trace-start) and then
pipe the trace through "egrep -i '12022e50[0-7]' " to look at all
the other references to that memory location... is it ever
written, if it's read before is it always zero, etc. This will
take a while...
Steve
On 9/7/07, Elliott Cooper-Balis < [EMAIL PROTECTED]> wrote:
here is the output. is there anything obvious that might be broken?
_______________________________________________
m5-users mailing list
m5-users@m5sim.org
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
Yahoo! oneSearch: Finally, mobile search that gives answers, not
web links.
_______________________________________________
m5-users mailing list
m5-users@m5sim.org
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
_______________________________________________
m5-users mailing list
m5-users@m5sim.org
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
Shape Yahoo! in your own image. Join our Network Research Panel
today!
_______________________________________________
m5-users mailing list
m5-users@m5sim.org
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
_______________________________________________
m5-users mailing list
m5-users@m5sim.org
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
Moody friends. Drama queens. Your life? Nope! - their life, your
story.
Play Sims Stories at Yahoo! Games.
_______________________________________________
m5-users mailing list
m5-users@m5sim.org
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
_______________________________________________
m5-users mailing list
m5-users@m5sim.org
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users