Transient disk failures not retried?

Max TenEyck Woodbury Mon, 15 Jan 2001 10:07:29 -0800
I have an APX system that I am using as a file server.
The data drive is a RAID5 construct of 3 36 GB 10K rpm
drives and a 36 GB 7200 rpm spare drive. The drives are
in a separate enclosure with extra fans and each drive
 is in a swap tray with a temperature monitor and extra
fan. (I think the disk failure rate in this building
is a bit higher than it should be and one of the causes
may have been drive overheating.) The machine has its
own UPS. (We installed UPS on all machines in the 
building because the disk failure rate was definitely
too high at that point. Power company changes have also
improved building power considerably about two years 
ago.)

>From time to time (about 3 times a week) one of the
drives (always the same drive) gets kicked out of the
array. Originally, this brought in a spare drive. The
reconstruction used to cause a massive flood of bus
errors and hang the system. As a result, I took the
spare off-line.

SCSI bus is LVD with active termination. One terminator
is provided by the Symbios 53c895 card and the other is
on the port where SCSI bus exits the drive enclosure.
Internal reflections have been minimized by wrapping
the ribbon cable in the enclosure in bubble wrap and
placing another layer of bubble wrap on any flat 
metallic surfaces where the cable could make contact.
(This was done when the enclosure was first installed.)

A problem with the swap trays was found. The 10K
drives are barely small enough to fit in the trays.
On another system, the contact between the bottom of
the swap tray and the circuit board would cause bus
errors, some times on the drive with the problem and
sometimes on other drives. A single layer of tape on
the inside of the bottom of the swap tray solved that
problem. I was finally able to take the server down
and apply the same treatment to it the end of last
week. This eliminated the avalanche of bus errors, but
the spare still gets kicked out before the 
reconstruction can finish.

System software is Red Hat Linux 6.2 for Alpha.
Host interface is Symbios 53c895. (There is also an
Adaptec AHA-294x/AIC-7871 host for the system and
CD drives.) Primary drives are 3 Quantum ATLAS 10K
36WLS. Spare drive is a Quantum ATLAS V 36WLS.

My impression is that no error recovery is being tried
for transient failures. Could someone confirm this?
If this is correct, what changes are need to be made
to correct the problem?

(No, a hardware RAID solution is NOT an option. I
have been told that all the drives in a hardware
RAID system have to be functionally equivalent
right down to the same revision of the firmware.
Since it was difficult to get authorization for
the system in the first place, I could NOT get 
more than three drives to start with. The fact
that soft-RAID could handle a mixture of different
drives was the factor that allowed the project
to be implemented at all.)

[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
Transient disk failures not retried?

Reply via email to