Re: SA110/SA1100 possible bug or kernel bug? (long) ...

Nicolas Pitre Tue, 15 Jun 1999 13:54:08 -0700
On Sat, 12 Jun 1999, Russell King - ARM Linux Admin wrote:

> Nicolas Pitre writes:
> > lpn: memory violation at pc=0x020140c0, lr=0x020282f0 
> >     (bad address=0x020140c0, code 1)
> > pc : [<020140c0>]    lr : [<020282f0>]
> > r7 : 01fa582c  r6 : 020f4090  r5 : 020adf88  r4 : 00001800
> >  20140c0:       e5864000        str     r4, [r6]
> 
> When your hack is triggered, could you try calling
> show_pte(current->mm, regs->ARM_pc) please?  This should tell
> you what's in the page tables for the task.

*pgd = c012d001, *pmd = c012d001, *pte = c035a01f, *ppte = c035aaae
*pgd = c012d001, *pmd = c012d001, *pte = c036e01f, *ppte = c036eaae
etc...

> Also, could you try to find out what the value is in memory
> there when the handler is called?

I tried to modify my hack so now it just ignore the fault and returns
immediately.  And... it seems to do the job: the execution resumes happily
for about 30 sec to one minute before the hack is triggered again.  So I
must assume memory content is sane.

> > What is really weird about it is the fact that the faulty address is equal
> > to the pc.  However, r6 which is used to do the str contains actually a
> > good address and is quite different from the pc.
> 
> Is there any chance you could try modifing the instruction to
> be pre-indexed instead of post-indexed?

The fault seems to happen anywhere, but not really everywhere.  Here some
dumps.  The last instruction of each sequence is where the fault occured:

 2014098:       eb005082        bl      20282a8 <__read>
 201409c:       e2504000        subs    r4, r0, #0
 20140a0:       aa000005        bge     20140bc <GetBuffer+0xac>
 20140a4:       e59f000c        ldr     r0, 20140b8 <GetBuffer+0xa8>
 20140a8:       eb008de2        bl      2037838 <perror>
 20140ac:       e1a00006        mov     r0, r6
 20140b0:       eb009f83        bl      203bec4 <__libc_free>
 20140b4:       eaffffec        b       201406c <GetBuffer+0x5c>
 20140b8:       0207a778        andeq   sl, r7, #31457280
 20140bc:       0a000005        beq     20140d8 <GetBuffer+0xc8>
 20140c0:       e5864000        str     r4, [r6]

 203b748:       e0840009        add     r0, r4, r9
 203b74c:       e586000c        str     r0, [r6, #12]
 203b750:       e5880010        str     r0, [r8, #16]
 203b754:       e580600c        str     r6, [r0, #12]

 203c13c:       e5870004        str     r0, [r7, #4]
 203c140:       e3853001        orr     r3, r5, #1
 203c144:       e5843004        str     r3, [r4, #4]

... and here I don't have the full dump anymore but the instruction where
the fault ocurred are:

 201ede4:       e7893102        str     r3, [r9, r2, lsl #2]

 203ba4c:       e5823004        str     r3, [r2, #4]

Always a single store but mostly all sort of indexing.

> > 1) In some situations, the CPU generates a data abort exception
> > instead of a prefetch abort exception as it should be.  This
> > would explain why the faulty address is equal to the pc.  And
> > since this happens in the middle of a page and there is no way
> > to jump exacly there from another page, this should hapen right
> > after a context switch.  However the data abort handler gets
> > the offending memory address from the FAR register but the
> > documentation says that it is used only for data abort exceptions.
> > So is the FAR updated for prefetch abort exception too?  If not,
> > this might not be a wrongly identified prefetch exception but
> > really a data abort exception.  And since the data abort handler
> > substract 8 from the pc instead of 4, the pc and faulting address
> > shouldn't match.
> 
> A way of checking this would be to introduce a new field in the task
> structure which contains the PC that the context switch switched to.
> This can be found on the kernel stack, at stack_base+4084.  Then, when
> the problem occurs, you can find out where the context switch returned
> control to.

The fault never seems to happen where control is returned to user space.
User space may regain control after a swi or anywhere else (interrupt,
normal page fault) but it seems not to be near the faulting instruction.

> > 2) In some situations, maybe when the process is restarted after
> > a context switch or similar, the str opcode takes the pc register
> > instead of the r6 register in this case to dereference the address
> > to use for storing.  This would fault since the text segment is
> > mapped read-only.  But here if the pc register was actually used
> > it would have been 8 bytes ahead from the instruction's address,
> > which isn't the case.
> 
> It indeed would fault, and the conditions that the register dump
> are indicating are in fact indicating a user mode store to the
> current PC location.

Seems most probably.  And it happens at few spots only.  The first
sequence included above is the most popular by a factor of 5 over all
other occurences.

> 
> My `bug' on just one NetWinder (but not another) seemed to be
> an apparant random pipeline error.  I never did get this resolved
> by CCC/HCC/whoever it is, and it's still sitting around here.
> Unfortunately, when I sent it back to them, they just tested it
> with their stuff, and didn't find anything wrong.  Yet, the same
> code running on two supposed identical NetWinders caused one to
> crash but not the other.  I'm not certain what I can do about this
> NetWinder now - I now use it solely for testing kernels on, but
> nothing else since it can't be trusted.

Here I can reproduce the problem on about 30 different SA1100's reliably.
I don't know how to pinpoint the exact problem though.

Any other ideas?



Nicolas Pitre, B. ing.
[EMAIL PROTECTED]


unsubscribe: body of `unsubscribe linux-arm' to [EMAIL PROTECTED]
Re: SA110/SA1100 possible bug or kernel bug? (long) ...

Reply via email to