Chris,
I think Gerard is on vacation. But, before he left, he made available some
code that you should probably try with your problem...
ftp://ftp.tux.org/pub/roudier/drivers/linux/stable/
sym-1.7.1-ncr-3.4.1.tar.gz
<>< Lance.
Chris Meadors wrote:
> I've been seeing an oops on this machine ever since I first started it
> up. I've been going over this with the people on linux-kernel. Because
> of the nature of the oops they weren't being logged to disk, so I didn't
> have a good decode. Today I set up a serial console and finally got to
> see where the bugger was oopsing.
>
> I don't know how many people here are also on l-k, but I'll give a brief
> run down of my system configuration and history of the problem for the
> benifit of those who aren't.
>
> I have a motherboard with dual SYM53C896 controllers. I'm only using
> one. It is connected to an external RAID module. This RAID controller
> has 3 SYM53C895s on it. Obviously one channel is connected to the
> controller on the motherboard, and another is to the drive array. But the
> third channel is connected to a second motherboard. Eventually (when I
> can get a stock kernel stable) I will be running a clustered file system
> to be shared between the two motherboards. Right now the second board is
> only mounting the file system read only so it isn't messing with anything.
>
> This is a SMP machine with a ServerWorks chipset and 1 GB of RAM.
>
> Please do no hesitate to ask me for any additional information. This
> machine has no valuable data on it so I'm not afraid to try anything.
>
> First I booted the machine from a boot/root floppy set that I made using
> the 2.4.0-test5 kernel I had configured for this machine. I NFS mounted
> my work machine to copy the directory structure to the new machine's local
> drives. Part way through the copy the machine oopsed. I re-mkfs'ed the
> drives and tried again, same thing. So next time I did it in smaller
> chunks and eventually got everything copied without problems. I setup the
> fstab and lilo.conf and rebooted the machine. It came up fine.
>
> The next step in my usual installation is to recompile everything I'm
> going to be using on the machine. So I started with the kernel, to make
> sure I had all the options set exactly to my liking. That went fine. So
> on to glibc (2.1.3). Part way though the build sed complained about not
> being albe to find a file. I looked and the file was there. So I typed
> make agian, it got past that point only to stop with an other missing file
> error. Make agian, something in the build process segfaulted this
> time. Make one more time, and I got an oops.
>
> I tried glibc a few more times, totally deleting the directory and
> starting over, never with any luck. So I figured I'd try rebuilding my
> build programs. gcc, binutils, and make. I got all of them to build just
> fine. Even gcc's "make bootstrap" went without a hitch. I tried glibc
> again, still no good.
>
> Somewhere in here I thought I might be seeing a hardware problem. So I
> tested everything as well as I could, running memtest86, burnP6, and
> burnBX for hours on end. Not one lock-up or error anywhere. Next I moved
> to bonnie++. One bonnine++ running by itself is no problem. But 4 of
> them started at the same time would oops the machine when it got to the
> "intelligently writing" part.
>
> I started all this on Monday. It was now Friday. I had been running the
> RAID controller on two 16 MB non-parity SIMMs I found laying around. I
> really wanted parity RAM, so I had ordered two 64 MB 60ns EDO parity
> SIMMs. They arrived on Friday and I installed them. The machine seemed a
> little more stable. So I ran bonnie++ again. I was so happy to see it
> pass the "intelligently writing", so went dancing around the office. When
> I came back I found that it had oopsed during the part where it creates
> the thousands of files. So I restarted the machine and watched the the
> whole bonnie++ process. It did the creating files thing for a really long
> time (I'm still using the 4 bonnie++s at once). So it is probally just
> about done when the oops comes.
>
> Just to make myself feel a little better I went back to glibc. This time
> with the 128MB of cache on the RAID controller it compiled all the way
> through without problem. But 4 bonnie++s can always kill it without fail.
>
> This oops is from 2.4.0-test5 (I have also tried -test6-pre6 with and got
> just about the same oops) dying during create sequential files part of 4
> bonnie++s running at the same time:
>
> Unable to handle kernel NULL pointer dereference at virtual address
> 00000000
> c0119024
> *pde = 00000000
> Oops: 0000
> CPU: 1
> EIP: 0010:[<c0119024>]
> Using defaults from ksymoops -t elf32-i386 -a i386
> EFLAGS: 00010087
> eax: c026dcd8 ebx: f7e8d6c0 ecx: 00000023 edx: 00000000
> esi: f7e8d720 edi: 00000000 ebp: c1e19e14 esp: c1e19d84
> ds: 0018 es: 0018 ss: 0018
> Process swapper (pid: 0, stackpage=c1e19000)
> Stack: f7e8d6c0 f7e8d720 f7d65d0c f7eb3e10 c01a19fe f7eb6000 dd3b1800
> dd3b1800
> f7d64e62 dd3b1c72 f7d64e00 00000022 f7eb6000 f7eb3e00 00000002
> 00000046
> f7d64e00 f7eb6000 00000202 f7d65c00 f7edd980 00000286 00000082
> f7edd980
> Call Trace: [<c01a19fe>] [<c0168e22>] [<c01addf8>] [<c01ad5bb>]
> [<c01a9092>] [<c01ad790>] [<c01ad9dd>]
> [<c019f707>] [<c01afc3f>] [<c01b0196>] [<c01a76b8>] [<c010c8b1>]
> [<c010ca96>] [<c0109390>] [<c0109390>]
> [<c010b1e0>] [<c0109390>] [<c0109390>] [<c0100018>] [<c01093bd>]
> [<c0109422>] [<c010cad4>] [<c0206ce6>]
> Code: 8b 17 89 d0 24 df 85 45 f8 0f 84 54 03 00 00 8b 4d f8 89 4d
>
> >>EIP; c0119024 <__wake_up+84/738> <=====
> Trace; c01a19fe <ncr_start_next_ccb+56/88>
> Trace; c0168e22 <blkdev_release_request+3a/3c>
> Trace; c01addf8 <scsi_request_fn+20c/314>
> Trace; c01ad5bb <scsi_queue_next_request+47/110>
> Trace; c01a9092 <scsi_release_command+116/120>
> Trace; c01ad790 <__scsi_end_request+10c/118>
> Trace; c01ad9dd <scsi_io_completion+189/358>
> Trace; c019f707 <rw_intr+1cb/1d8>
> Trace; c01afc3f <scsi_old_done+43/5b4>
> Trace; c01b0196 <scsi_old_done+59a/5b4>
> Trace; c01a76b8 <sym53c8xx_intr+80/94>
> Trace; c010c8b1 <handle_IRQ_event+4d/78>
> Trace; c010ca96 <do_IRQ+a6/f4>
> Trace; c0109390 <default_idle+0/34>
> Trace; c0109390 <default_idle+0/34>
> Trace; c010b1e0 <ret_from_intr+0/20>
> Trace; c0109390 <default_idle+0/34>
> Trace; c0109390 <default_idle+0/34>
> Trace; c0100018 <startup_32+18/cc>
> Trace; c01093bd <default_idle+2d/34>
> Trace; c0109422 <cpu_idle+3e/54>
> Trace; c010cad4 <do_IRQ+e4/f4>
> Trace; c0206ce6 <vsprintf+33e/36c>
> Code; c0119024 <__wake_up+84/738>
> 00000000 <_EIP>:
> Code; c0119024 <__wake_up+84/738> <=====
> 0: 8b 17 mov (%edi),%edx <=====
> Code; c0119026 <__wake_up+86/738>
> 2: 89 d0 mov %edx,%eax
> Code; c0119028 <__wake_up+88/738>
> 4: 24 df and $0xdf,%al
> Code; c011902a <__wake_up+8a/738>
> 6: 85 45 f8 test %eax,0xfffffff8(%ebp)
> Code; c011902d <__wake_up+8d/738>
> 9: 0f 84 54 03 00 00 je 363 <_EIP+0x363> c0119387
> <__wake_up+3e7/738>
> Code; c0119033 <__wake_up+93/738>
> f: 8b 4d f8 mov 0xfffffff8(%ebp),%ecx
> Code; c0119036 <__wake_up+96/738>
> 12: 89 4d 00 mov %ecx,0x0(%ebp)
>
> Aiee, killing interrupt handler
> Kernel panic: Attempted to kill the idle task!
> NMI Watchdog detected LOCKUP on CPU0, registers:
> CPU: 0
> EIP: 0010:[<c020c7a4>]
> EFLAGS: 00000086
> eax: f7d65a00 ebx: f7eb6000 ecx: f7eb6054 edx: f7eb6000
> esi: 00000286 edi: 00000000 ebp: c02b4c40 esp: c0277f1c
> ds: 0018 es: 0018 ss: 0018
> Process swapper (pid: 0, stackpage=c0277000)
> Stack: f7eb6000 c01a76cc c0121b15 f7eb6000 00000000 00000000 00000000
> c02b4c40
> c017c546 f7edd680 00000086 c011e351 c02cc120 00000000 c011e257
> 00000000
> 00000001 c02bc1a0 00000000 0000000e c011e0fc c02bc1a0 c02cc484
> c02b2800
> Call Trace: [<c01a76cc>] [<c0121b15>] [<c017c546>] [<c011e351>]
> [<c011e257>] [<c011e0fc>] [<c010cad4>]
> [<c0109390>] [<c0109390>] [<c010b1e0>] [<c0109390>] [<c0109390>]
> [<c0100018>] [<c01093bd>] [<c0109422>]
> [<c0105000>] [<c01001d0>]
> Code: 80 3d 64 42 26 c0 00 f3 90 7e f5 e9 4b af f9 ff 80 7b 44 00
>
> >>EIP; c020c7a4 <stext_lock+4084/9b20> <=====
> Trace; c01a76cc <sym53c8xx_timeout+0/68>
> Trace; c0121b15 <timer_bh+259/2b0>
> Trace; c017c546 <rs_interrupt_single+72/88>
> Trace; c011e351 <bh_action+4d/b4>
> Trace; c011e257 <tasklet_hi_action+4f/7c>
> Trace; c011e0fc <do_softirq+5c/8c>
> Trace; c010cad4 <do_IRQ+e4/f4>
> Trace; c0109390 <default_idle+0/34>
> Trace; c0109390 <default_idle+0/34>
> Trace; c010b1e0 <ret_from_intr+0/20>
> Trace; c0109390 <default_idle+0/34>
> Trace; c0109390 <default_idle+0/34>
> Trace; c0100018 <startup_32+18/cc>
> Trace; c01093bd <default_idle+2d/34>
> Trace; c0109422 <cpu_idle+3e/54>
> Trace; c0105000 <empty_bad_page+0/1000>
> Trace; c01001d0 <L6+0/2>
> Code; c020c7a4 <stext_lock+4084/9b20>
> 00000000 <_EIP>:
> Code; c020c7a4 <stext_lock+4084/9b20> <=====
> 0: 80 3d 64 42 26 c0 00 cmpb $0x0,0xc0264264 <=====
> Code; c020c7ab <stext_lock+408b/9b20>
> 7: f3 90 repz nop
> Code; c020c7ad <stext_lock+408d/9b20>
> 9: 7e f5 jle 0 <_EIP>
> Code; c020c7af <stext_lock+408f/9b20>
> b: e9 4b af f9 ff jmp fff9af5b <_EIP+0xfff9af5b>
> c01a76ff <sym53c8xx_timeout+33/68>
> Code; c020c7b4 <stext_lock+4094/9b20>
> 10: 80 7b 44 00 cmpb $0x0,0x44(%ebx)
>
> --
> Two penguins were walking on an iceburg. The first one said to the
> second, "you look like you are wearing a tuxedo." The second one said,
> "I might be..."
> --David Lynch, Twin Peaks
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]