On Mar 16, 2007, at 9:45 AM, Charles Krinke wrote: > It this a system you are just bringing up or one that's been running > for a while. It really seems like memory corruption of some form. > I'd suggest checking memory controller settings. > > Also, what happens if you disassemble the kernel image and look at > the addresses pointed to by NIP: > C00DEE18 & C002CE68. > > - k > Dear Kumar: > > We have two systems. One based on an 8241, and one based on an > 8541. The 8241 has been running for some time with Linux 2.4 and > the 8541 is coming up. Both are using the 2.6.17.11 kernel from > kernel.org with modifications for our hardware. > > In the case of the 8241, I started out with the 2.4 modifications, > which were originally based on the 8260 and ported them to 2.6. In > the case of the 8541, I started out with the embedded planet 8555EP > 2.6 kernel source and added that to the 2.6. > > I dont see this exception in the 8541, although extensive testing > has not yet been completed. The 8241 exhibits this exception on > three different 8241 boards, so I dont suspect the hardware. > > We are using the Montavista toolchain and their root filesystem > including 'tar' and 'cp' which are the programs that currently > exhibit the fault. > > Yesterday, when I saw an NIP at 0x900, I was ready to jump on the > interrupts not being setup correctly, but after a few hours of > going through that, I am now convinced the interrupts are setup > correctly, so it is something more subtle. > > Certainly, memory corruption is the next thing to be concerned with. > > One thing that has concerned me a bit is that we have no swap space > available at all. This is an embedded system with 64MByte of RAM > and JFFS2 NAND flash with no swap partitions. > > I suspect auditing the MMU setup differences between the original > 2.4 kernel and the new 2.6 kernel for the 8241 board is the next step. > > The three exceptions I saw yesterday were 1)0x900 in the > timer_interrupt, 2) C00DEE18 (inside the tar program) and 3) > C002CE68 (in one of the kernel routines).
#2 is inside the kernel as well. Look at the System.map or objdump - d vmlinux to see what exactly is at those instructions. > I suspect the actual addresses are red-herrings and this exception > can occur at any address. This certainly would tend to indicate > some sort of memory setup issue. I think it's useful to know if the instructions at the two offsets C00DEE18 & C002CE68 are similar in some way before jumping to that conclusion. > Changing the Oops logic to printout the NextInstruction as well as > the NIP might be helpful so I could discern the difference between > what the program is trying to do and what it is really doing. > > Are there any other thoughts you might have on diagnosis techniques > at this point? Try turning on KALLSYMS, this should provide more info on the oops as well. - k _______________________________________________ Linuxppc-embedded mailing list Linuxppc-embedded@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-embedded