I've been seeing an oops on this machine ever since I first started it
up. I've been going over this with the people on linux-kernel. Because
of the nature of the oops they weren't being logged to disk, so I didn't
have a good decode. Today I set up a serial console and finally got to
see where the bugger was oopsing.
I don't know how many people here are also on l-k, but I'll give a brief
run down of my system configuration and history of the problem for the
benifit of those who aren't.
I have a motherboard with dual SYM53C896 controllers. I'm only using
one. It is connected to an external RAID module. This RAID controller
has 3 SYM53C895s on it. Obviously one channel is connected to the
controller on the motherboard, and another is to the drive array. But the
third channel is connected to a second motherboard. Eventually (when I
can get a stock kernel stable) I will be running a clustered file system
to be shared between the two motherboards. Right now the second board is
only mounting the file system read only so it isn't messing with anything.
This is a SMP machine with a ServerWorks chipset and 1 GB of RAM.
Please do no hesitate to ask me for any additional information. This
machine has no valuable data on it so I'm not afraid to try anything.
First I booted the machine from a boot/root floppy set that I made using
the 2.4.0-test5 kernel I had configured for this machine. I NFS mounted
my work machine to copy the directory structure to the new machine's local
drives. Part way through the copy the machine oopsed. I re-mkfs'ed the
drives and tried again, same thing. So next time I did it in smaller
chunks and eventually got everything copied without problems. I setup the
fstab and lilo.conf and rebooted the machine. It came up fine.
The next step in my usual installation is to recompile everything I'm
going to be using on the machine. So I started with the kernel, to make
sure I had all the options set exactly to my liking. That went fine. So
on to glibc (2.1.3). Part way though the build sed complained about not
being albe to find a file. I looked and the file was there. So I typed
make agian, it got past that point only to stop with an other missing file
error. Make agian, something in the build process segfaulted this
time. Make one more time, and I got an oops.
I tried glibc a few more times, totally deleting the directory and
starting over, never with any luck. So I figured I'd try rebuilding my
build programs. gcc, binutils, and make. I got all of them to build just
fine. Even gcc's "make bootstrap" went without a hitch. I tried glibc
again, still no good.
Somewhere in here I thought I might be seeing a hardware problem. So I
tested everything as well as I could, running memtest86, burnP6, and
burnBX for hours on end. Not one lock-up or error anywhere. Next I moved
to bonnie++. One bonnine++ running by itself is no problem. But 4 of
them started at the same time would oops the machine when it got to the
"intelligently writing" part.
I started all this on Monday. It was now Friday. I had been running the
RAID controller on two 16 MB non-parity SIMMs I found laying around. I
really wanted parity RAM, so I had ordered two 64 MB 60ns EDO parity
SIMMs. They arrived on Friday and I installed them. The machine seemed a
little more stable. So I ran bonnie++ again. I was so happy to see it
pass the "intelligently writing", so went dancing around the office. When
I came back I found that it had oopsed during the part where it creates
the thousands of files. So I restarted the machine and watched the the
whole bonnie++ process. It did the creating files thing for a really long
time (I'm still using the 4 bonnie++s at once). So it is probally just
about done when the oops comes.
Just to make myself feel a little better I went back to glibc. This time
with the 128MB of cache on the RAID controller it compiled all the way
through without problem. But 4 bonnie++s can always kill it without fail.
This oops is from 2.4.0-test5 (I have also tried -test6-pre6 with and got
just about the same oops) dying during create sequential files part of 4
bonnie++s running at the same time:
Unable to handle kernel NULL pointer dereference at virtual address
00000000
c0119024
*pde = 00000000
Oops: 0000
CPU: 1
EIP: 0010:[<c0119024>]
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010087
eax: c026dcd8 ebx: f7e8d6c0 ecx: 00000023 edx: 00000000
esi: f7e8d720 edi: 00000000 ebp: c1e19e14 esp: c1e19d84
ds: 0018 es: 0018 ss: 0018
Process swapper (pid: 0, stackpage=c1e19000)
Stack: f7e8d6c0 f7e8d720 f7d65d0c f7eb3e10 c01a19fe f7eb6000 dd3b1800
dd3b1800
f7d64e62 dd3b1c72 f7d64e00 00000022 f7eb6000 f7eb3e00 00000002
00000046
f7d64e00 f7eb6000 00000202 f7d65c00 f7edd980 00000286 00000082
f7edd980
Call Trace: [<c01a19fe>] [<c0168e22>] [<c01addf8>] [<c01ad5bb>]
[<c01a9092>] [<c01ad790>] [<c01ad9dd>]
[<c019f707>] [<c01afc3f>] [<c01b0196>] [<c01a76b8>] [<c010c8b1>]
[<c010ca96>] [<c0109390>] [<c0109390>]
[<c010b1e0>] [<c0109390>] [<c0109390>] [<c0100018>] [<c01093bd>]
[<c0109422>] [<c010cad4>] [<c0206ce6>]
Code: 8b 17 89 d0 24 df 85 45 f8 0f 84 54 03 00 00 8b 4d f8 89 4d
>>EIP; c0119024 <__wake_up+84/738> <=====
Trace; c01a19fe <ncr_start_next_ccb+56/88>
Trace; c0168e22 <blkdev_release_request+3a/3c>
Trace; c01addf8 <scsi_request_fn+20c/314>
Trace; c01ad5bb <scsi_queue_next_request+47/110>
Trace; c01a9092 <scsi_release_command+116/120>
Trace; c01ad790 <__scsi_end_request+10c/118>
Trace; c01ad9dd <scsi_io_completion+189/358>
Trace; c019f707 <rw_intr+1cb/1d8>
Trace; c01afc3f <scsi_old_done+43/5b4>
Trace; c01b0196 <scsi_old_done+59a/5b4>
Trace; c01a76b8 <sym53c8xx_intr+80/94>
Trace; c010c8b1 <handle_IRQ_event+4d/78>
Trace; c010ca96 <do_IRQ+a6/f4>
Trace; c0109390 <default_idle+0/34>
Trace; c0109390 <default_idle+0/34>
Trace; c010b1e0 <ret_from_intr+0/20>
Trace; c0109390 <default_idle+0/34>
Trace; c0109390 <default_idle+0/34>
Trace; c0100018 <startup_32+18/cc>
Trace; c01093bd <default_idle+2d/34>
Trace; c0109422 <cpu_idle+3e/54>
Trace; c010cad4 <do_IRQ+e4/f4>
Trace; c0206ce6 <vsprintf+33e/36c>
Code; c0119024 <__wake_up+84/738>
00000000 <_EIP>:
Code; c0119024 <__wake_up+84/738> <=====
0: 8b 17 mov (%edi),%edx <=====
Code; c0119026 <__wake_up+86/738>
2: 89 d0 mov %edx,%eax
Code; c0119028 <__wake_up+88/738>
4: 24 df and $0xdf,%al
Code; c011902a <__wake_up+8a/738>
6: 85 45 f8 test %eax,0xfffffff8(%ebp)
Code; c011902d <__wake_up+8d/738>
9: 0f 84 54 03 00 00 je 363 <_EIP+0x363> c0119387
<__wake_up+3e7/738>
Code; c0119033 <__wake_up+93/738>
f: 8b 4d f8 mov 0xfffffff8(%ebp),%ecx
Code; c0119036 <__wake_up+96/738>
12: 89 4d 00 mov %ecx,0x0(%ebp)
Aiee, killing interrupt handler
Kernel panic: Attempted to kill the idle task!
NMI Watchdog detected LOCKUP on CPU0, registers:
CPU: 0
EIP: 0010:[<c020c7a4>]
EFLAGS: 00000086
eax: f7d65a00 ebx: f7eb6000 ecx: f7eb6054 edx: f7eb6000
esi: 00000286 edi: 00000000 ebp: c02b4c40 esp: c0277f1c
ds: 0018 es: 0018 ss: 0018
Process swapper (pid: 0, stackpage=c0277000)
Stack: f7eb6000 c01a76cc c0121b15 f7eb6000 00000000 00000000 00000000
c02b4c40
c017c546 f7edd680 00000086 c011e351 c02cc120 00000000 c011e257
00000000
00000001 c02bc1a0 00000000 0000000e c011e0fc c02bc1a0 c02cc484
c02b2800
Call Trace: [<c01a76cc>] [<c0121b15>] [<c017c546>] [<c011e351>]
[<c011e257>] [<c011e0fc>] [<c010cad4>]
[<c0109390>] [<c0109390>] [<c010b1e0>] [<c0109390>] [<c0109390>]
[<c0100018>] [<c01093bd>] [<c0109422>]
[<c0105000>] [<c01001d0>]
Code: 80 3d 64 42 26 c0 00 f3 90 7e f5 e9 4b af f9 ff 80 7b 44 00
>>EIP; c020c7a4 <stext_lock+4084/9b20> <=====
Trace; c01a76cc <sym53c8xx_timeout+0/68>
Trace; c0121b15 <timer_bh+259/2b0>
Trace; c017c546 <rs_interrupt_single+72/88>
Trace; c011e351 <bh_action+4d/b4>
Trace; c011e257 <tasklet_hi_action+4f/7c>
Trace; c011e0fc <do_softirq+5c/8c>
Trace; c010cad4 <do_IRQ+e4/f4>
Trace; c0109390 <default_idle+0/34>
Trace; c0109390 <default_idle+0/34>
Trace; c010b1e0 <ret_from_intr+0/20>
Trace; c0109390 <default_idle+0/34>
Trace; c0109390 <default_idle+0/34>
Trace; c0100018 <startup_32+18/cc>
Trace; c01093bd <default_idle+2d/34>
Trace; c0109422 <cpu_idle+3e/54>
Trace; c0105000 <empty_bad_page+0/1000>
Trace; c01001d0 <L6+0/2>
Code; c020c7a4 <stext_lock+4084/9b20>
00000000 <_EIP>:
Code; c020c7a4 <stext_lock+4084/9b20> <=====
0: 80 3d 64 42 26 c0 00 cmpb $0x0,0xc0264264 <=====
Code; c020c7ab <stext_lock+408b/9b20>
7: f3 90 repz nop
Code; c020c7ad <stext_lock+408d/9b20>
9: 7e f5 jle 0 <_EIP>
Code; c020c7af <stext_lock+408f/9b20>
b: e9 4b af f9 ff jmp fff9af5b <_EIP+0xfff9af5b>
c01a76ff <sym53c8xx_timeout+33/68>
Code; c020c7b4 <stext_lock+4094/9b20>
10: 80 7b 44 00 cmpb $0x0,0x44(%ebx)
--
Two penguins were walking on an iceburg. The first one said to the
second, "you look like you are wearing a tuxedo." The second one said,
"I might be..."
--David Lynch, Twin Peaks
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]