Hello all,
We've come across a rather annoying behavior of systems hanging due to SCSI
activity on BusLogic and Adaptec cards. I originally thought it was excessive
heat, but now I'm not so sure. The systems are dual 400 MHz PII on the Intel
NightShade board running the 2.0.36 kernel. Has anyone else seen this problem
and come up with a solution? See the forwarded email below for more info.
I've also run against the problem of not being able to disable devices via the
SSU utility (Intel's System Setup Utility), nor can effectively change the
IRQ's of devices with the same utility. Has anyone had success with this
thing?
Thanks for the info,
Dan
___________________________________________________________________________
Dan Yocum | Phone: (630) 840-8525
Linux/Unix System Administrator | Fax: (630) 840-6345
Computing Division OSS/FSS | email: [EMAIL PROTECTED] .~. L
Fermi National Accelerator Lab | WWW: www-oss.fnal.gov/~yocum/ /V\ I
P.O. Box 500 | // \\ N
Batavia, IL 60510 | "TANSTAAFL" /( )\ U
________________________________|_________________________________ ^`~'^__X_
------- Forwarded Message
Return-Path: [EMAIL PROTECTED]
Received: from FNAL.FNAL.Gov (fnal.fnal.gov [131.225.9.8])
by sapphire.fnal.gov (8.8.7/8.8.7) with ESMTP id KAA21682
for <[EMAIL PROTECTED]>; Wed, 7 Apr 1999 10:23:55 -0500
Received: from fndaub.fnal.gov ("port 13509"@fndaub.fnal.gov)
by FNAL.FNAL.GOV (PMDF V5.1-12 #3998)
with ESMTP id <[EMAIL PROTECTED]> for [EMAIL PROTECTED];
Wed, 7 Apr 1999 10:23:53 -0500 CDT
Received: (from djholm@localhost)
by fndaub.fnal.gov (980427.SGI.8.8.8/970903.SGI.AUTOCF) id KAA10098; Wed,
07 Apr 1999 10:23:52 -0500 (CDT)
Date: Wed, 07 Apr 1999 10:23:52 -0500 (CDT)
From: Don Holmgren <[EMAIL PROTECTED]>
Subject: Re: Dual CPU Comark systems hanging in e831...
In-reply-to: <[EMAIL PROTECTED]>
To: "Harry W. K. Cheung" <[EMAIL PROTECTED]>
Cc: Dan Yocum <[EMAIL PROTECTED]>, [EMAIL PROTECTED]
Message-id: <[EMAIL PROTECTED]>
MIME-version: 1.0
Content-type: TEXT/PLAIN; charset=US-ASCII
Hi -
We've been using systems based on the Nightshade motherboard fairly
heavily to read/write tape drives on the EMASS robot and have run into
some hangs as well. I originally suspected the Buslogic BT958D cards we
were using, but now we're seeing the same hangs with Adaptec 2944 cards.
We also have used external disks on an Adaptec 2940U2W card, but have
seen so hangs. I'll exercise these heavily and see if I can provoke a
problem.
Before seeing your mails I was fairly certain the problem with our tapes
was related to the various SCSI layers in the kernel. I suppose there
could be hardware problems with these systems, and note that our common
experience is problems on external scsi buses.
After the next hang, could you do the following:
'cat /proc/scsi/Buslogic/2'
(that "2" might be another number) and send me the results? What I see
on our systems in this section:
DATA TRANSFER STATISTICS
Target Tagged Queuing Queue Depth Active Attempted Completed
====== ============== =========== ====== ========= =========
3 Not Supported 3 0 42181 42181
4 Not Supported 3 0 27513 27513
5 Not Supported 3 0 21025 21025
is that the "Attempted" numbers are 1 greater than the "Completed" for
one or more targets - the interface or the kernel or an external device
are hanging on a command. Again, since I get similar behaviour on
Adaptec cards (without the nice reports in /proc/scsi), I suspect a bug
in the kernel SCSI layers above the driver.
You should also check the temperatures of your systems. Fetch the
following file:
ftp://linux-rep.fnal.gov/pub/ipmi/sdrread
and run it as root (chmod +x first). It prints out a bunch of
information, but at the bottom will be a section like:
sdr 0: -12V: -12.21 C0
sdr 1: Proc1-VID: 4.83
sdr 2: Proc2-VID: 4.83
sdr 3: BB Temp1: 41.00 C0
sdr 4: BB Temp2: 38.00 C0
sdr 5: CPU1 Temp: 41.00 C0
sdr 6: CPU2 Temp: 39.00 C0
sdr 7: CPU Fan1: 4650.00 C0
sdr 8: CPU Fan2: 4800.00 C0
sdr 9: 5V: 5.24 C0
sdr 10: -5V: -5.05 C0
sdr 11: 12V: 11.90 C0
sdr 12: 3.3V: 3.37 C0
sdr 13: CPU1 Voltage: 2.02 C0
sdr 14: CPU2 Voltage: 2.03 C0
sdr 15: 2.5V: 2.53 C0
sdr 16: 1.5V: 1.51 C0
sdr 17: SCWA Term1: 2.88 C0
sdr 18: SCWA Term2: 2.87 C0
sdr 19: SCWA Term3: 2.85 C0
sdr 20: SCNA Term1: 2.85 C0
sdr 21: 5V_stndby: 5.00 C0
sdr 22: Proc1 Stat: 80 80
sdr 23: BMC-FP-NMI: 00 80
sdr 24: BMC-Watchdog: 00 80
sdr 25: Proc2 Stat: 80 80
sdr 26: DIMM1 Pres: C1
sdr 27: DIMM2 Pres: C0
sdr 28: DIMM3 Pres: C0
sdr 29: DIMM4 Pres: C0
sdr 30: Post Error: C0
sdr 31: Chasis Intruid: C1
sdr 32: not a sensor
The BB and CPU temperatures should be reasonable - not over 70 certainly
(I'll look up the accepted range for Intel). We've had some systems
under load with higher temperatures - reseating the CPU fans fixed these.
Don Holmgren
On Tue, 6 Apr 1999, Harry W. K. Cheung wrote:
...
>
>
> The problem is that the PC will hang with either an external SCSI disk
> with its light (SCSI activity) on. Or in one of the systems that has an
> internal SCSI 8mm Eliant 820 tape drive, it has also hung with the SCSI
> activity light lit on the tape drive. In this case I can get into a console
> and log on as root to the IDE drive. However I cannot shut down the system
> as it tries to do a SCSI bus reset and fails and aborts and keeps going
> like that. In other hangs with the external drive the PC does not respond
> at all.
>
> All three systems ran okay for a long time with just Monte Carlo jobs that
> do not access the disk much. Currently we are running data analysis jobs
> that access the disk more, across the network (NFS) and locally. Sometimes
> we spool data off tape using the tape drive across the network. It doesn't
> have to have many jobs running nor is the disk access that heavy when the
> systems hang. Its random in that I cannot predictably cause them to hang.
> Two of the system both each hung once today (so far!), so it does happen
> often.
>
> I tried changing SCSI cables to much shorter ones in two of the systems.
> The third system has only one external SCSI disk on a 2 foot cable. All
> systems still hung as before. In case it helps the systems are given below:
>
> all three systems have:
>
> BXN440BX Night shade Motherboard with dual 400 MHz PII
> supposedly including heasink and active fan though I haven't looked inside
> 80mm DC fan
> st38641A internal Seagate IDE drive
> qm309100td-s internal Quantum SCSI drive
> xm6201b-s internal SCSI CDROM drive
> bt-958 Buslogic wide SCSI controller
> 3c900-com-in 3COM 3c900 combo ethernet card
>
> one of the systems has an internal 8mm SCSI drive plus 3 external SCSI disks
> the second system has an external SCSI disk and a SCSI (narrow) scanner
> and the third system has an external SCSI disk only.
------- End of Forwarded Message
-
Linux SMP list: FIRST see FAQ at http://www.irisa.fr/prive/mentre/smp-faq/
To Unsubscribe: send "unsubscribe linux-smp" to [EMAIL PROTECTED]