On Sat, 12 Jun 1999, Russell King - ARM Linux Admin wrote:
> Nicolas Pitre writes:
> > lpn: memory violation at pc=0x020140c0, lr=0x020282f0
> > (bad address=0x020140c0, code 1)
> > pc : [<020140c0>] lr : [<020282f0>]
> > r7 : 01fa582c r6 : 020f4090 r5 : 020adf88 r4 : 00001800
> > 20140c0: e5864000 str r4, [r6]
>
> When your hack is triggered, could you try calling
> show_pte(current->mm, regs->ARM_pc) please? This should tell
> you what's in the page tables for the task.
*pgd = c012d001, *pmd = c012d001, *pte = c035a01f, *ppte = c035aaae
*pgd = c012d001, *pmd = c012d001, *pte = c036e01f, *ppte = c036eaae
etc...
> Also, could you try to find out what the value is in memory
> there when the handler is called?
I tried to modify my hack so now it just ignore the fault and returns
immediately. And... it seems to do the job: the execution resumes happily
for about 30 sec to one minute before the hack is triggered again. So I
must assume memory content is sane.
> > What is really weird about it is the fact that the faulty address is equal
> > to the pc. However, r6 which is used to do the str contains actually a
> > good address and is quite different from the pc.
>
> Is there any chance you could try modifing the instruction to
> be pre-indexed instead of post-indexed?
The fault seems to happen anywhere, but not really everywhere. Here some
dumps. The last instruction of each sequence is where the fault occured:
2014098: eb005082 bl 20282a8 <__read>
201409c: e2504000 subs r4, r0, #0
20140a0: aa000005 bge 20140bc <GetBuffer+0xac>
20140a4: e59f000c ldr r0, 20140b8 <GetBuffer+0xa8>
20140a8: eb008de2 bl 2037838 <perror>
20140ac: e1a00006 mov r0, r6
20140b0: eb009f83 bl 203bec4 <__libc_free>
20140b4: eaffffec b 201406c <GetBuffer+0x5c>
20140b8: 0207a778 andeq sl, r7, #31457280
20140bc: 0a000005 beq 20140d8 <GetBuffer+0xc8>
20140c0: e5864000 str r4, [r6]
203b748: e0840009 add r0, r4, r9
203b74c: e586000c str r0, [r6, #12]
203b750: e5880010 str r0, [r8, #16]
203b754: e580600c str r6, [r0, #12]
203c13c: e5870004 str r0, [r7, #4]
203c140: e3853001 orr r3, r5, #1
203c144: e5843004 str r3, [r4, #4]
... and here I don't have the full dump anymore but the instruction where
the fault ocurred are:
201ede4: e7893102 str r3, [r9, r2, lsl #2]
203ba4c: e5823004 str r3, [r2, #4]
Always a single store but mostly all sort of indexing.
> > 1) In some situations, the CPU generates a data abort exception
> > instead of a prefetch abort exception as it should be. This
> > would explain why the faulty address is equal to the pc. And
> > since this happens in the middle of a page and there is no way
> > to jump exacly there from another page, this should hapen right
> > after a context switch. However the data abort handler gets
> > the offending memory address from the FAR register but the
> > documentation says that it is used only for data abort exceptions.
> > So is the FAR updated for prefetch abort exception too? If not,
> > this might not be a wrongly identified prefetch exception but
> > really a data abort exception. And since the data abort handler
> > substract 8 from the pc instead of 4, the pc and faulting address
> > shouldn't match.
>
> A way of checking this would be to introduce a new field in the task
> structure which contains the PC that the context switch switched to.
> This can be found on the kernel stack, at stack_base+4084. Then, when
> the problem occurs, you can find out where the context switch returned
> control to.
The fault never seems to happen where control is returned to user space.
User space may regain control after a swi or anywhere else (interrupt,
normal page fault) but it seems not to be near the faulting instruction.
> > 2) In some situations, maybe when the process is restarted after
> > a context switch or similar, the str opcode takes the pc register
> > instead of the r6 register in this case to dereference the address
> > to use for storing. This would fault since the text segment is
> > mapped read-only. But here if the pc register was actually used
> > it would have been 8 bytes ahead from the instruction's address,
> > which isn't the case.
>
> It indeed would fault, and the conditions that the register dump
> are indicating are in fact indicating a user mode store to the
> current PC location.
Seems most probably. And it happens at few spots only. The first
sequence included above is the most popular by a factor of 5 over all
other occurences.
>
> My `bug' on just one NetWinder (but not another) seemed to be
> an apparant random pipeline error. I never did get this resolved
> by CCC/HCC/whoever it is, and it's still sitting around here.
> Unfortunately, when I sent it back to them, they just tested it
> with their stuff, and didn't find anything wrong. Yet, the same
> code running on two supposed identical NetWinders caused one to
> crash but not the other. I'm not certain what I can do about this
> NetWinder now - I now use it solely for testing kernels on, but
> nothing else since it can't be trusted.
Here I can reproduce the problem on about 30 different SA1100's reliably.
I don't know how to pinpoint the exact problem though.
Any other ideas?
Nicolas Pitre, B. ing.
[EMAIL PROTECTED]
unsubscribe: body of `unsubscribe linux-arm' to [EMAIL PROTECTED]