Dan Jones wrote:
>
> Max TenEyck Woodbury wrote:
>> ...
>>
>> My impression is that no error recovery is being tried
>> for transient failures. Could someone confirm this?
>> If this is correct, what changes are need to be made
>> to correct the problem?
>>
>
> Well, as I imagine you already know, your basic SCSI
> setup is marginal.
OK. I said there was a problem myself, but I also said
that it seems to have been fixed or at least improved
recently...
> I base this on your recitation of problems i.e. both
> the spare and drives being taken offline and the fact you are
> using drive trays. In fact, the trays are probably causing
> most (all?) of your problems. Do you really need them?
Sorry, but the system would not even be running without the drive
trays. Without the extra fans (and temperature monitors), the
drives run HOT (> 45 C). With the trays, the monitors show the
drives running within temperature specs. (29, 35, 36.5 and 37
at this moment; all less than specs limit of 40.) I could just
use extra fans, but how would I know if they were effective
without the monitors and air plenum.
> If you really need them, it is possible that reducing the maximum
> SCSI transfer rate might help in your case.
Might be OK, but it doesn't seem to be the real problem. How?
> But, this is all speculation. What are the SCSI errors that cause
> the drive(s) to be taken offline? Can't really tell what is going
> on without the SCSI status.
Jan 16 04:03:37 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0,
CDB: Read (10) 00 04 42 00 90 00 00 08 00
Jan 16 04:03:37 oscar kernel: Info fld=0x4420097, Current sd08:21:
sense key Medium Error
Jan 16 04:03:37 oscar kernel: Additional sense indicates Unrecovered read error
Jan 16 04:03:37 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352
Which tells me that this is a drive problem, not a bus problem.
Ten years of experience with hard drives and SCSI and the behavior
of this particular drive tells me that this is a marginal drive that
needs to be watched like a hawk, but it has not yet turned the corner
and become unusable and may not do so for weeks or years.
The RAID driver then kicks the drive out of the array rather than
instituting normal read recovery procedures, which is what I complained
about.
Jan 16 04:03:37 oscar kernel: raid5: Disk failure on sdc1, disabling device.
Operation continuing on 2 devices
Jan 16 04:03:37 oscar kernel: md: updating md0 RAID superblock on device
Jan 16 04:03:37 oscar kernel: sdf1 [events: 0000006e](write)
sdf1's sb offset: 35839872
Jan 16 04:03:37 oscar kernel: md: recovery thread got woken up ...
Jan 16 04:03:37 oscar kernel: md0: resyncing spare disk sdf1 to
replace failed disk
Jan 16 04:03:37 oscar kernel: RAID5 conf printout:
...
After that there seems to be a sym53c895 driver problem with a
lot of messages about aborted commands and register dumps that
might be a bus problem but looks more like a SNAFU in the low and
mid-level driver error handling...
Jan 16 04:03:37 oscar kernel: raid5: restarting stripe 71434352
Jan 16 04:03:37 oscar kernel: sym53c895-0:3: ERROR (0:18) (0-20-0) (1f/9f)
@ (script bc0:18000a00).
Jan 16 04:03:37 oscar kernel: sym53c895-0: script cmd = 18000000
Jan 16 04:03:37 oscar kernel: sym53c895-0:
regdump: da 10 80 9f 47 1f 03 0b 56 00 83 20 80 01 08 01.
Jan 16 04:03:37 oscar kernel: sym53c895-0: resetting, command processing
suspended for 2 seconds
Jan 16 04:03:37 oscar kernel: sym53c895-0: restart (scsi reset).
Jan 16 04:03:37 oscar kernel: sym53c895-0: enabling clock multiplier
Jan 16 04:03:37 oscar kernel: sym53c895-0: Downloading SCSI SCRIPTS.
Jan 16 04:03:40 oscar kernel: sym53c895-0: command processing resumed
Jan 16 04:03:40 oscar kernel: sym53c895-0-<3,*>: WIDE SCSI (16 bit) enabled.
Jan 16 04:03:40 oscar kernel: sym53c895-0-<2,*>: WIDE SCSI (16 bit) enabled.
Jan 16 04:03:40 oscar kernel: sym53c895-0-<1,*>: WIDE SCSI (16 bit) enabled.
Jan 16 04:03:40 oscar kernel: sym53c895-0-<2,*>: FAST-40 WIDE SCSI 80.0 MB/s
(25 ns, offset 31)
Jan 16 04:03:40 oscar kernel: sym53c895-0-<1,*>: FAST-40 WIDE SCSI 80.0 MB/s
(25 ns, offset 31)
Jan 16 04:03:40 oscar kernel: sym53c895-0-<3,*>: FAST-40 WIDE SCSI 80.0 MB/s
(25 ns, offset 31)
Jan 16 04:03:40 oscar kernel: sym53c895-0:3: ERROR (0:48) (0-20-0)
(1f/9f) @ (script be0:18000600).
Jan 16 04:03:40 oscar kernel: sym53c895-0: script cmd = 18000000
Jan 16 04:03:40 oscar kernel: sym53c895-0:
regdump: da 10 80 9f 47 1f 03 0b 35 00 81 20 80 01 08 09.
Jan 16 04:03:40 oscar kernel: sym53c895-0: resetting, command processing
suspended for 2 seconds
Jan 16 04:03:40 oscar kernel: sym53c895-0: restart (scsi reset).
Jan 16 04:03:40 oscar kernel: sym53c895-0: enabling clock multiplier
Jan 16 04:03:40 oscar kernel: sym53c895-0: Downloading SCSI SCRIPTS.
Jan 16 04:03:43 oscar kernel: sym53c895-0: command processing resumed
Jan 16 04:03:43 oscar kernel: sym53c895-0-<3,*>: WIDE SCSI (16 bit) enabled.
Jan 16 04:03:43 oscar kernel: sym53c895-0-<2,*>: WIDE SCSI (16 bit) enabled.
Jan 16 04:03:43 oscar kernel: sym53c895-0-<1,*>: WIDE SCSI (16 bit) enabled.
Jan 16 04:03:43 oscar kernel: sym53c895-0-<3,*>: FAST-40 WIDE SCSI 80.0 MB/s
(25 ns, offset 31)
Jan 16 04:03:43 oscar kernel: sym53c895-0-<2,*>: FAST-40 WIDE SCSI 80.0 MB/s
(25 ns, offset 31)
Jan 16 04:03:43 oscar kernel: sym53c895-0-<1,*>: FAST-40 WIDE SCSI 80.0 MB/s
(25 ns, offset 31)
Jan 16 04:03:47 oscar kernel: scsi1 channel 0 :
resetting for second half of retries.
Jan 16 04:03:47 oscar kernel: SCSI bus is being reset for host 1 channel 0.
Jan 16 04:03:47 oscar kernel: sym53c8xx_reset: pid=7654664 reset_flags=1
serial_number=0 serial_number_at_timeout=0
Jan 16 04:03:47 oscar kernel: sym53c895-0: resetting, command processing
suspended for 2 seconds
Jan 16 04:03:47 oscar kernel: scsi1: device driver called scsi_done()
for a syncronous reset.
Jan 16 04:03:47 oscar kernel: sym53c895-0: restart (scsi reset).
Jan 16 04:03:47 oscar kernel: sym53c895-0: enabling clock multiplier
Jan 16 04:03:47 oscar kernel: sym53c895-0: Downloading SCSI SCRIPTS.
Jan 16 04:03:47 oscar kernel: sym53c895-0: command processing resumed
Jan 16 04:03:47 oscar kernel: sym53c895-0-<3,*>: WIDE SCSI (16 bit) enabled.
Jan 16 04:03:47 oscar kernel: sym53c895-0-<3,*>: FAST-40 WIDE SCSI 80.0 MB/s
(25 ns, offset 31)
Jan 16 04:03:47 oscar kernel: sym53c895-0-<1,*>: WIDE SCSI (16 bit) enabled.
Jan 16 04:03:47 oscar kernel: sym53c895-0-<2,*>: WIDE SCSI (16 bit) enabled.
Jan 16 04:03:47 oscar kernel: sym53c895-0-<1,*>: FAST-40 WIDE SCSI 80.0 MB/s
(25 ns, offset 31)
Jan 16 04:03:47 oscar kernel: sym53c895-0-<2,*>: FAST-40 WIDE SCSI 80.0 MB/s
(25 ns, offset 31)
Jan 16 04:03:47 oscar kernel: SCSI disk error : host 1 channel 0 id 3 lun 0
return code = 18000002
Jan 16 04:03:47 oscar kernel: Info fld=0x60, Current sd08:51: sense key
Aborted Command
Jan 16 04:03:47 oscar kernel: Additional sense indicates Scsi parity error
Jan 16 04:03:47 oscar kernel: scsidisk I/O error: dev 08:51, sector 64
Jan 16 04:03:47 oscar kernel: interrupting MD-thread pid 6
Jan 16 04:03:47 oscar kernel: raid5: Disk failure on spare sdf1
Jan 16 04:04:17 oscar kernel: scsi : aborting command due to timeout :
pid 7654761, scsi1, channel 0, id 3, lun 0 Write (6) 00 0a a0 80 00
Jan 16 04:04:17 oscar kernel: sym53c8xx_abort: pid=7654761 serial_number=7654872
serial_number_at_timeout=7654872
Jan 16 04:04:17 oscar kernel: sym53c895-0: abort ccb=fffffc0003aaa800 (cancel)
Jan 16 04:04:17 oscar kernel: scsi : aborting command due to timeout :
pid 7654760, scsi1, channel 0, id 3, lun 0 Write (6) 00 0a 20 80 00
Jan 16 04:04:17 oscar kernel: sym53c8xx_abort: pid=7654760 serial_number=7654871
serial_number_at_timeout=7654871
Jan 16 04:04:17 oscar kernel: scsi : aborting command due to timeout :
pid 7654759, scsi1, channel 0, id 3, lun 0 Write (6) 00 09 a0 80 00
Jan 16 04:04:17 oscar kernel: sym53c8xx_abort: pid=7654759 serial_number=7654870
serial_number_at_timeout=7654870
Jan 16 04:04:17 oscar kernel: scsi : aborting command due to timeout :
pid 7654758, scsi1, channel 0, id 3, lun 0 Write (6) 00 09 20 80 00
Jan 16 04:04:17 oscar kernel: sym53c8xx_abort: pid=7654758 serial_number=7654869
serial_number_at_timeout=7654869
Jan 16 04:04:17 oscar kernel: scsi : aborting command due to timeout :
pid 7654757, scsi1, channel 0, id 3, lun 0 Write (6) 00 08 a0 80 00
Jan 16 04:04:17 oscar kernel: sym53c8xx_abort: pid=7654757 serial_number=7654868
serial_number_at_timeout=7654868
Jan 16 04:04:17 oscar kernel: scsi : aborting command due to timeout :
pid 7654767, scsi1, channel 0, id 3, lun 0 Write (6) 00 0b 20 30 00
Jan 16 04:04:17 oscar kernel: sym53c8xx_abort: pid=7654767 serial_number=7654878
serial_number_at_timeout=7654878
Jan 16 04:04:17 oscar kernel: sym53c895-0: abort ccb=fffffc00002eb000 (cancel)
Jan 16 04:04:17 oscar kernel: scsi : aborting command due to timeout :
pid 7654756, scsi1, channel 0, id 3, lun 0 Write (6) 00 08 20 80 00
Jan 16 04:04:17 oscar kernel: sym53c8xx_abort: pid=7654756 serial_number=7654895
serial_number_at_timeout=7654895
Jan 16 04:04:17 oscar kernel: sym53c895-0: abort ccb=fffffc0003aa4800 (cancel)
Jan 16 04:04:17 oscar kernel: scsi : aborting command due to timeout :
pid 7654785, scsi1, channel 0, id 2, lun 0 Read (10) 00 03 5c 0e 10 00 00 08 00
Jan 16 04:04:17 oscar kernel: sym53c8xx_abort: pid=7654785 serial_number=7654897
serial_number_at_timeout=7654897
Jan 16 04:04:17 oscar kernel: sym53c895-0: abort ccb=fffffc0002e37800 (cancel)
Jan 16 04:04:17 oscar kernel: scsi : aborting command due to timeout :
pid 7654784, scsi1, channel 0, id 1, lun 0 Read (10) 00 03 5c 0e 10 00 00 08 00
Jan 16 04:04:17 oscar kernel: sym53c8xx_abort: pid=7654784 serial_number=7654896
serial_number_at_timeout=7654896
Jan 16 04:04:17 oscar kernel: sym53c895-0: abort ccb=fffffc0002226000 (cancel)
Jan 16 04:04:20 oscar kernel: SCSI host 1 abort (pid 7654761) timed out - resetting
Jan 16 04:04:20 oscar kernel: SCSI bus is being reset for host 1 channel 0.
Jan 16 04:04:20 oscar kernel: sym53c8xx_reset: pid=7654761 reset_flags=2
serial_number=7654872 serial_number_at_timeout=7654872
Jan 16 04:04:20 oscar kernel: sym53c895-0: resetting, command processing
suspended for 2 seconds
Jan 16 04:04:20 oscar kernel: sym53c895-0: restart (scsi reset).
Jan 16 04:04:20 oscar kernel: sym53c895-0: enabling clock multiplier
Jan 16 04:04:20 oscar kernel: sym53c895-0: Downloading SCSI SCRIPTS.
Jan 16 04:04:20 oscar kernel: sym53c895-0: command processing resumed
Jan 16 04:04:20 oscar kernel: sym53c895-0-<3,*>: WIDE SCSI (16 bit) enabled.
Jan 16 04:04:20 oscar kernel: sym53c895-0-<1,*>: WIDE SCSI (16 bit) enabled.
Jan 16 04:04:20 oscar kernel: sym53c895-0-<2,*>: WIDE SCSI (16 bit) enabled.
Jan 16 04:04:20 oscar kernel: sym53c895-0-<3,*>: FAST-40 WIDE SCSI 80.0 MB/s
(25 ns, offset 31)
Jan 16 04:04:20 oscar kernel: sym53c895-0-<1,*>: FAST-40 WIDE SCSI 80.0 MB/s
(25 ns, offset 31)
Jan 16 04:04:20 oscar kernel: sym53c895-0-<2,*>: FAST-40 WIDE SCSI 80.0 MB/s
(25 ns, offset 31)
Jan 16 04:04:20 oscar kernel: SCSI disk error : host 1 channel 0 id 3 lun 0
return code = 18000002
Jan 16 04:04:20 oscar kernel: Info fld=0xa20, Current sd08:51: sense key
Aborted Command
Jan 16 04:04:20 oscar kernel: Additional sense indicates Scsi parity error
Jan 16 04:04:20 oscar kernel: scsidisk I/O error: dev 08:51, sector 2560
Followed by a bug report from RAID...
Jan 16 04:04:20 oscar kernel: interrupting MD-thread pid 6
Jan 16 04:04:20 oscar kernel: md: bug in file raid5.c, line 659
Jan 16 04:04:20 oscar kernel:
Jan 16 04:04:20 oscar kernel: **********************************
Jan 16 04:04:20 oscar kernel: * <COMPLETE RAID STATE PRINTOUT> *
Jan 16 04:04:20 oscar kernel: **********************************
Jan 16 04:04:20 oscar kernel: md0: <sdf1><sdc1><sde1><sdd1> array superblock:
Jan 16 04:04:20 oscar kernel: SB: (V:0.90.0) ID:<a0a3d082.d9b7f185.6121a0f1.0a06a2d3>
CT:3a39677e
Jan 16 04:04:20 oscar kernel: L5 S35839872 ND:5 RD:3 md0 LO:2 CS:131072
Jan 16 04:04:20 oscar kernel: UT:3a640e69 ST:0 AD:2 WD:2 FD:3 SD:0 CSUM:00007310
E:0000006e
Jan 16 04:04:20 oscar kernel: D 0: DISK<N:0,sdc1(8,33),R:0,S:1>
Jan 16 04:04:20 oscar kernel: D 1: DISK<N:1,sdd1(8,49),R:1,S:6>
Jan 16 04:04:20 oscar kernel: D 2: DISK<N:2,sde1(8,65),R:2,S:6>
Jan 16 04:04:20 oscar kernel: D 3: DISK<N:3,[dev 00:00](0,0),R:3,S:9>
Jan 16 04:04:20 oscar kernel: D 4: DISK<N:4,sdf1(8,81),R:4,S:1>
Jan 16 04:04:20 oscar kernel: D 5: DISK<N:0,[dev 00:00](0,0),R:0,S:0>
Jan 16 04:04:20 oscar kernel: D 6: DISK<N:0,[dev 00:00](0,0),R:0,S:0>
Jan 16 04:04:20 oscar kernel: D 7: DISK<N:0,[dev 00:00](0,0),R:0,S:0>
Jan 16 04:04:20 oscar kernel: D 8: DISK<N:0,[dev 00:00](0,0),R:0,S:0>
Jan 16 04:04:20 oscar kernel: D 9: DISK<N:0,[dev 00:00](0,0),R:0,S:0>
Jan 16 04:04:22 oscar kernel: D 10: DISK<N:0,[dev 00:00](0,0),R:0,S:0>
Jan 16 04:04:22 oscar kernel: D 11: DISK<N:0,[dev 00:00](0,0),R:0,S:0>
Jan 16 04:04:22 oscar kernel: THIS: DISK<N:2,sde1(8,65),R:2,S:6>
Jan 16 04:04:22 oscar kernel: rdev sdf1: O:sdf1, SZ:00000000 F:1 DN:4 no rdev
superblock!
Jan 16 04:04:22 oscar kernel: rdev sdc1: O:sdc1, SZ:00000000 F:1 DN:0 no rdev
superblock!
Jan 16 04:04:22 oscar kernel: rdev sde1: O:sde1, SZ:35839872 F:0 DN:2 rdev
superblock:
Jan 16 04:04:22 oscar kernel: SB: (V:0.90.0) ID:<a0a3d082.d9b7f185.6121a0f1.0a06a2d3>
CT:3a39677e
Jan 16 04:04:22 oscar kernel: L5 S35839872 ND:5 RD:3 md0 LO:2 CS:131072
Jan 16 04:04:22 oscar kernel: UT:3a640e69 ST:0 AD:2 WD:3 FD:2 SD:1 CSUM:0000b202
E:0000006e
Jan 16 04:04:22 oscar kernel: D 0: DISK<N:0,sdc1(8,33),R:0,S:1>
Jan 16 04:04:22 oscar kernel: D 1: DISK<N:1,sdd1(8,49),R:1,S:6>
Jan 16 04:04:22 oscar kernel: D 2: DISK<N:2,sde1(8,65),R:2,S:6>
Jan 16 04:04:22 oscar kernel: D 3: DISK<N:3,[dev 00:00](0,0),R:3,S:9>
Jan 16 04:04:22 oscar kernel: D 4: DISK<N:4,sdf1(8,81),R:4,S:0>
Jan 16 04:04:22 oscar kernel: D 5: DISK<N:0,[dev 00:00](0,0),R:0,S:0>
Jan 16 04:04:22 oscar kernel: D 6: DISK<N:0,[dev 00:00](0,0),R:0,S:0>
Jan 16 04:04:22 oscar kernel: D 7: DISK<N:0,[dev 00:00](0,0),R:0,S:0>
Jan 16 04:04:22 oscar kernel: D 8: DISK<N:0,[dev 00:00](0,0),R:0,S:0>
Jan 16 04:04:22 oscar kernel: D 9: DISK<N:0,[dev 00:00](0,0),R:0,S:0>
Jan 16 04:04:22 oscar kernel: D 10: DISK<N:0,[dev 00:00](0,0),R:0,S:0>
Jan 16 04:04:22 oscar kernel: D 11: DISK<N:0,[dev 00:00](0,0),R:0,S:0>
Jan 16 04:04:22 oscar kernel: THIS: DISK<N:2,sde1(8,65),R:2,S:6>
Jan 16 04:04:22 oscar kernel: rdev sdd1: O:sdd1, SZ:35839872 F:0 DN:1 rdev
superblock:
Jan 16 04:04:22 oscar kernel: SB: (V:0.90.0) ID:<a0a3d082.d9b7f185.6121a0f1.0a06a2d3>
CT:3a39677e
Jan 16 04:04:24 oscar kernel: L5 S35839872 ND:5 RD:3 md0 LO:2 CS:131072
Jan 16 04:04:24 oscar kernel: UT:3a640e69 ST:0 AD:2 WD:3 FD:2 SD:1 CSUM:0000b1f0
E:0000006e
Jan 16 04:04:24 oscar kernel: D 0: DISK<N:0,sdc1(8,33),R:0,S:1>
Jan 16 04:04:24 oscar kernel: D 1: DISK<N:1,sdd1(8,49),R:1,S:6>
Jan 16 04:04:24 oscar kernel: D 2: DISK<N:2,sde1(8,65),R:2,S:6>
Jan 16 04:04:24 oscar kernel: D 3: DISK<N:3,[dev 00:00](0,0),R:3,S:9>
Jan 16 04:04:24 oscar kernel: D 4: DISK<N:4,sdf1(8,81),R:4,S:0>
Jan 16 04:04:24 oscar kernel: D 5: DISK<N:0,[dev 00:00](0,0),R:0,S:0>
Jan 16 04:04:24 oscar kernel: D 6: DISK<N:0,[dev 00:00](0,0),R:0,S:0>
Jan 16 04:04:24 oscar kernel: D 7: DISK<N:0,[dev 00:00](0,0),R:0,S:0>
Jan 16 04:04:24 oscar kernel: D 8: DISK<N:0,[dev 00:00](0,0),R:0,S:0>
Jan 16 04:04:24 oscar kernel: D 9: DISK<N:0,[dev 00:00](0,0),R:0,S:0>
Jan 16 04:04:24 oscar kernel: D 10: DISK<N:0,[dev 00:00](0,0),R:0,S:0>
Jan 16 04:04:24 oscar kernel: D 11: DISK<N:0,[dev 00:00](0,0),R:0,S:0>
Jan 16 04:04:24 oscar kernel: THIS: DISK<N:1,sdd1(8,49),R:1,S:6>
Jan 16 04:04:24 oscar kernel: **********************************
Jan 16 04:04:24 oscar kernel:
Jan 16 04:04:24 oscar kernel: <SPARE FAILED!>
Jan 16 04:04:24 oscar kernel: <6>md0: spare disk sdf1 failed, skipping to next spare.
...
> ...
>
> I don't think that a hardware RAID would help in this case,
> although in a marginal system any change can have surprising
> effects. I take the same line regarding mixing drives as the
> RAID vendors, but it is more of a configuration matrix problem
> than one of functionality. I just can't test every possible
> combination, so I can't make a blanket recommendation except to
> recommend against. Having said that, I have never experienced
> any problems mixing drives on Mylex RAID controllers.
At present, the three drives that make up the array are
identical with the spare being just slightly smaller and
slower. The problem is that I may need to replace a drive
a year or two into the arrays lifetime and identical drives
may no longer be available. The local rep for the hardware
RAID vendor told me that all the drives had to be the same
or the warantee was void. He may have been talking through
his hat, but the lack of support was enough to throw the
discussion to the software RAID solution.
However, these are all side issues to the question about
attempting the normal retry process before kicking the drive
out of the array.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]