When running a kernel which supports SMP, I receive errors of scsi time-outs and resets under a load (it doesn't take much ... copying files or compiling while do the trick). I enabled as much verbose flags as possible within aic7xxx and it seems as though scsi (some of which are completed) commands are dropped (see included messages below).
If I boot with a non-SMP kernel, I CANNOT reproduce the errors (maybe I can't generate enough traffic on the hard drives as with 2 CPU's compared to just one). Hence, I suspect SMP, IO-APIC, and/or the aic7xxx driver.
I have tried multiple combinations of kernels from 2.0.36, 2.2.1, 2.2.3, and 2.2.5, with multiple compile options (i.e. PCI Bridging, MTRR, and anything that I found possibly related to the problem at hand).
Hardware:
HP Netserver LH Pro
128 Meg RAM (2 - 64 Meg DIMM)
2 - Pentium Pro 200's
2 - aic7880 on-board (PCI): They share interrupt 11 and
cannot be changed
to have unique interrupts for each (The EISA config utility promptly
configures both adapters to the same interrupt when either is changed).
NOTE: I noticed that the 1st CPU has 512K cache while the 2nd CPU only has 256K cache.
Software:
Kernel 2.2.5
Raid 0.90
aic7xxx v5.13
Except for the cache difference on the two CPU's, I have eliminated hardware problems (or at least I think I have) via multiple tests of w/ and w/o SMP, diagnostic utilities, removing and swapping DIMMs, and etc.
The following are a limited set of debug messages from the aic7xxx driver:
Apr 10 18:40:11 lachesis kernel: scsi : aborting command due to
timeout : pid 4274, scsi0, channel 0, id 1, lun 0 Write (10) 00 00 48 00
e8 00 00 08 00
Apr 10 18:40:11 lachesis kernel: (scsi0:0:1:0) Abort called for
already completed command.
Apr 10 18:40:11 lachesis kernel: scsi : aborting command due to
timeout : pid 4275, scsi0, channel 0, id 1, lun 0 Write (10) 00 00 49 04
90 00 00 08 00
Apr 10 18:40:11 lachesis kernel: (scsi0:0:1:0) Aborting scb 10,
flags 0x4
Apr 10 18:40:11 lachesis kernel: (scsi0:0:1:0) SCB is currently
active. Waiting on completion.
Apr 10 18:40:11 lachesis kernel: scsi : aborting command due to
timeout : pid 4277, scsi1, channel 0, id 4, lun 0 Write (10) 00 00 49 04
90 00 00 08 00
Apr 10 18:40:11 lachesis kernel: (scsi1:0:4:0) Aborting scb 10,
flags 0x6
Apr 10 18:40:11 lachesis kernel: (scsi1:0:4:0) SCB found on waiting
list and aborted.
Apr 10 18:40:11 lachesis kernel: (scsi1:0:4:0) Aborting scb 10
Apr 10 18:40:11 lachesis kernel: (scsi1:-1:-1:-1) 1 commands found
and queued for completion.
Apr 11 14:43:56 lachesis kernel: scsi : aborting command due to
timeout : pid 13827, scsi1, channel 0, id 4, lun 0 Write (10) 00 00 30
00 10 00 00 08 00
Apr 11 14:43:56 lachesis kernel: (scsi1:0:4:0) Aborting scb 11,
flags 0x4
Apr 11 14:43:56 lachesis kernel: (scsi1:0:4:0) SCB disconnected.
Queueing Abort SCB.
Apr 11 14:43:56 lachesis kernel: (scsi1:0:4:0) Abort message mailed.
Apr 11 14:43:56 lachesis kernel: (scsi0:0:1:0) SCB 13 abort delivered.
Apr 11 14:43:56 lachesis kernel: (scsi0:0:1:-1) Reset device, active_scb
2
Apr 11 14:43:56 lachesis kernel: (scsi0:0:1:-1) Cleaning up status
information and delayed_scbs.
Apr 11 14:43:56 lachesis kernel: (scsi0:0:1:0:tag12) matches search
criteria (scsi0:0:1:-1:tag255)
Apr 11 14:43:56 lachesis kernel: (scsi0:0:1:0:tag9) matches search
criteria (scsi0:0:1:-1:tag255)
Apr 11 14:43:56 lachesis kernel: (scsi0:0:1:-1) Cleaning QINFIFO.
Apr 11 14:43:56 lachesis kernel: (scsi0:0:1:-1) Cleaning waiting_scbs.
I have also received these two kernel killing messages:
end_scsi_request: buffer-list destroyed
.
.
.
Kernel panic: Inactive in scsi_request_queueable
