Hi All.
Here is a (slightly) copy of my mail I sent ARM/Intel on Monday. Enjoy - it
could explain a lot of problems.
Further to this, I have been able to confirm that switching off the data cache
clears the following problems:
1. Random corruption of r5 in schedule(), which causes the kernel to panic
in the idle task.
2. Allows the tulip driver to work on the EBSA-285 architecture.
3. Random segmentation violations with Corel's binaries.
I personally have not verified 2) and 3) are related to this exact problem, but
certainly turning off the cache seems to fix it, and since 2) gives similar
behaviour, I think it's highly likely that this one is. However, 3) could
be something to do with the ELF stuff, but I don't think this would be caused
by the tests I have conducted.
I shall be only releasing this as PRELIMINARY, ie, I reserve the right to be
totally wrong, but I think that people using the chip ought to know about a
possible problem.
* * * * * PRELIMINARY * * * * *
After the weekends work on the Linux kernel, I am convinced that I have found
a bug in the SA110 revision S processor. My reasons are this:
1. Dave Gilbert has been experiencing random crashes with 2.1.126 on his EBSA285
with tulip netcard driver, which only occur with the data cache on. Any
attempt to add debugging causes the nature of the problem to change.
If I supply Dave a kernel configured for his machine compiled using my
GCC 2.7.2.2 ELF tools (which I trust), he sees the same problem. However,
if I compile up an EBSA285 kernel, well - it's been running for 28 days
without one single problem now.
2. In 2.1.129 recently, I have been getting data aborts in schedule().
On close investigation, it appears that the code sequence, aligned
as follows causes the SA110S to misbehave:
r4 = c00ff6c0 r5 = c00ffc40 r6 = 00000000
schedule:
c0018ba4: mov ip, sp
c0018ba8: stmfd sp!, {r4, r5, r6, fp, ip, lr, pc}
c0018bac: sub fp, ip, #4 ; sp = c00fdf4c
c0018bb0: bic r5, sp, #0x1f00
c0018bb4: bic r5, r5, #0xff
While this code executes as expected 99% of the time, the SA110S appears
to under some bizarre circumstance miss the instruction at c0018bb0.
As a result, r5 ends up containing c00ffc00, NOT c00fc000.
The problem appears to be dependent on the alignment of this routine,
the number of instructions between the stmfd and the first bic, and
the registers used. As a result, I believe that it is an interaction
between the Icache, and the stmfd instruction.
There has already been one instance of Icache and stm interaction causing
problems - the stmia ..., {r8-pc}^ instruction (which stores both non-user
mode (r8 - r12) and user mode (sp, lr) registers, which ARM Linux already
avoids.
I have not noticed this problem until now because I normally use my
special GCC 2.7.2.2 ELF for ARM, which does not optimise to the same
level (and inserts an extra instruction between the sub and the bic).
3. A vdir /usr -R > /dev/null on the Netwinder under the above mentioned 2.1.129
kernel (but with the assembler modified to prevent this bug) appears to
cause a lot of segmentation violations. Turning off the data cache (only)
cures this problem.
* * * * * PRELIMINARY * * * * *
Disclaimer: Testing is still in progress. I reserve the right to be wrong.
Hence, this may not be a processor problem at all.
_____
|_____| ------------------------------------------------- ---+---+-
| | Russell King [EMAIL PROTECTED] --- ---
| | | | http://www.arm.linux.org.uk/~rmk/aboutme.html / / |
| +-+-+ --- -+-
/ | THE developer of ARM Linux |+| /|\
/ | | | --- |
+-+-+ ------------------------------------------------- /\\\ |
unsubscribe: body of `unsubscribe linux-arm' to [EMAIL PROTECTED]