Re: Repeated errors in RAID5 set.

Gérard Roudier Thu, 08 Mar 2001 12:57:49 -0800


On Thu, 8 Mar 2001, Max TenEyck Woodbury wrote:

> I brought this up on the raid list some time ago and got a less
> than completely helpful response. I concluded that more information
> was needed before I asked the question again.
> 
> Problem:
> 
> I have an Alpha running Red Hat Linux 6.2 (Kernel 2.2-14) with
> two SCSI adapters, an AHA-294X and a sym53c895. The trouble is
> associated with the sym53c895. On its LVD bus, there are 4 disks:
> 
> Host: scsi1 Channel: 00 Id: 00 Lun: 00
>   Vendor: QUANTUM  Model: ATLAS 10K 36WLS  Rev: UCP0
>   Type:   Direct-Access                    ANSI SCSI revision: 03
> Host: scsi1 Channel: 00 Id: 01 Lun: 00
>   Vendor: QUANTUM  Model: ATLAS 10K 36WLS  Rev: UCP0
>   Type:   Direct-Access                    ANSI SCSI revision: 03
> Host: scsi1 Channel: 00 Id: 02 Lun: 00
>   Vendor: QUANTUM  Model: ATLAS 10K 36WLS  Rev: UCP0
>   Type:   Direct-Access                    ANSI SCSI revision: 03
> Host: scsi1 Channel: 00 Id: 03 Lun: 00
>   Vendor: SEAGATE  Model: ST336704LW       Rev: 0004
>   Type:   Direct-Access                    ANSI SCSI revision: 03
> 
> Each has one 36 GB partition accessible as /dev/sdc1-/dev/sdf1.
> The first three have been configured with RAID5 into a 72 GB device
> /dev/md0 and initialized with ext2 into a file system. At odd 
> intervals, but always shortly after 04:03:35 in the morning an 
> error occurs on sector 71434352 of the disk /dev/sdc1. (See log 
> extracts later in this text.) /dev/sdc1 is then kicked out of the 
> RAID5 set until I come in and raidhotremove/raidhotadd it back in. 
> The reinsertion always succeeds without error.
> 
> This brings up two questions. The more important one is:
> 
> Why is the device being kicked out of the RAID set (other than
> the obvious answer that that is the way the code is written)
> without any real attempt at error recovery? At the least, the 
> read should be retried once, and that does not seem to be happening. 
> Further, since this is a RAID5 set, the sector can be recovered 
> from the other members of the set and rewritten on the original 
> disk. (This happens as part of the normal recovery process and 
> the indications are that it always succeeds.) This is NOT happening 
> as a part of the normal recovery process. (There was another 
> message in the RAID list some time ago that indicated that 
> writes were not retried either and that they should be.) I
> can see that some kinds of error require that a member be removed
> immediately from the RAID set, but this is not that kind of error
> in my opinion.
> 
> The less important question is:
> 
> Why is this particular pattern of errors occurring? It is odd in
> at least two respects: It happens at the same clock time and is
> always the same block. Real disk errors do not usually happen on
> such a regular schedule and tend to include more and more different
> blocks over time. Also, as mentioned above, the block in question 
> is being rewritten regularly as part of the RAID set reconstruction. 
> If it were a real error, the drive would have reassigned the block 
> and the error would either not recur, or would move around. Since 
> it is not being reassigned, the drive must not see it as a real 
> error. So, does anybody have a suggestion about what is really 
> going on?

About the timely manner the error is happening, I would suggest you 
to have a look in the crontab of your system.

If there are no real drive errors here, then we may just be dreaming at
this time, in my opinion.:)

I suggest you to check how the 'Read-write recovery page' is set up on
your disk. You must ensure that the ARRE bit, at least, is set. But this
will not magically make your disk reassign the faulty block. In fact, the
disk will reassign the block only if it has been able to recover from the
error and the behaviour is governed by the other infos in the page. In no
case, the drive will decide by itself to reassign the block and to copy
corrupted data to the new block.

If you have time to play, you can try, for example, to read the block
using 'dd', expect to get the error, and rewrite it immediately. This has
been reported to make the trick.

If you haven't time to play with all that stuff (btw, I donnot have too), 
just try to reformat the drive. You must just be aware that if the 
drive is switched off during this operation, it may get unusable forever.

  Gérard.

> Feb  9 04:03:39 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: 
>Read (10) 00 04 42 00 90 00 00 08 00  
> Feb  9 04:03:39 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium 
>Error 
> Feb  9 04:03:39 oscar kernel: Additional sense indicates Unrecovered read error 
> Feb  9 04:03:39 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352 
> Feb  9 04:03:39 oscar kernel: raid5: Disk failure on sdc1, disabling device. 
>Operation continuing on 2 devices 
> --
> Feb 15 04:03:42 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: 
>Read (10) 00 04 42 00 90 00 00 08 00  
> Feb 15 04:03:42 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium 
>Error 
> Feb 15 04:03:42 oscar kernel: Additional sense indicates Unrecovered read error 
> Feb 15 04:03:42 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352 
> Feb 15 04:03:42 oscar kernel: raid5: Disk failure on sdc1, disabling device. 
>Operation continuing on 2 devices 
> --
> Feb 16 04:03:39 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: 
>Read (10) 00 04 42 00 90 00 00 08 00  
> Feb 16 04:03:39 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium 
>Error 
> Feb 16 04:03:39 oscar kernel: Additional sense indicates Unrecovered read error 
> Feb 16 04:03:39 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352 
> Feb 16 04:03:39 oscar kernel: raid5: Disk failure on sdc1, disabling device. 
>Operation continuing on 2 devices 
> --
> Feb 18 04:03:40 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: 
>Read (10) 00 04 42 00 90 00 00 08 00  
> Feb 18 04:03:40 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium 
>Error 
> Feb 18 04:03:40 oscar kernel: Additional sense indicates Unrecovered read error 
> Feb 18 04:03:40 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352 
> Feb 18 04:03:40 oscar kernel: raid5: Disk failure on sdc1, disabling device. 
>Operation continuing on 2 devices 
> --
> Feb 20 04:03:38 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: 
>Read (10) 00 04 42 00 90 00 00 08 00  
> Feb 20 04:03:38 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium 
>Error 
> Feb 20 04:03:38 oscar kernel: Additional sense indicates Unrecovered read error 
> Feb 20 04:03:38 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352 
> Feb 20 04:03:38 oscar kernel: raid5: Disk failure on sdc1, disabling device. 
>Operation continuing on 2 devices 
> --
> Feb 22 04:03:37 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: 
>Read (10) 00 04 42 00 90 00 00 08 00  
> Feb 22 04:03:37 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium 
>Error 
> Feb 22 04:03:37 oscar kernel: Additional sense indicates Unrecovered read error 
> Feb 22 04:03:37 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352 
> Feb 22 04:03:37 oscar kernel: raid5: Disk failure on sdc1, disabling device. 
>Operation continuing on 2 devices 
> --
> Feb 23 04:03:37 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: 
>Read (10) 00 04 42 00 90 00 00 08 00  
> Feb 23 04:03:37 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium 
>Error 
> Feb 23 04:03:37 oscar kernel: Additional sense indicates Unrecovered read error 
> Feb 23 04:03:37 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352 
> Feb 23 04:03:37 oscar kernel: raid5: Disk failure on sdc1, disabling device. 
>Operation continuing on 2 devices 
> --
> Mar  1 04:03:38 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: 
>Read (10) 00 04 42 00 90 00 00 08 00  
> Mar  1 04:03:38 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium 
>Error 
> Mar  1 04:03:38 oscar kernel: Additional sense indicates Unrecovered read error 
> Mar  1 04:03:38 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352 
> Mar  1 04:03:38 oscar kernel: raid5: Disk failure on sdc1, disabling device. 
>Operation continuing on 2 devices 
> --
> Mar  3 04:03:36 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: 
>Read (10) 00 04 42 00 90 00 00 08 00  
> Mar  3 04:03:36 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium 
>Error 
> Mar  3 04:03:36 oscar kernel: Additional sense indicates Unrecovered read error 
> Mar  3 04:03:36 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352 
> Mar  3 04:03:36 oscar kernel: raid5: Disk failure on sdc1, disabling device. 
>Operation continuing on 2 devices 
> --
> Mar  5 04:03:36 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: 
>Read (10) 00 04 42 00 90 00 00 08 00  
> Mar  5 04:03:36 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium 
>Error 
> Mar  5 04:03:36 oscar kernel: Additional sense indicates Unrecovered read error 
> Mar  5 04:03:36 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352 
> Mar  5 04:03:36 oscar kernel: raid5: Disk failure on sdc1, disabling device. 
>Operation continuing on 2 devices 
> --
> Mar  6 04:03:38 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: 
>Read (10) 00 04 42 00 90 00 00 08 00  
> Mar  6 04:03:38 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium 
>Error 
> Mar  6 04:03:38 oscar kernel: Additional sense indicates Unrecovered read error 
> Mar  6 04:03:38 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352 
> Mar  6 04:03:38 oscar kernel: raid5: Disk failure on sdc1, disabling device. 
>Operation continuing on 2 devices 
> --
> Mar  7 04:03:37 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: 
>Read (10) 00 04 42 00 90 00 00 08 00  
> Mar  7 04:03:37 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium 
>Error 
> Mar  7 04:03:37 oscar kernel: Additional sense indicates Unrecovered read error 
> Mar  7 04:03:37 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352 
> Mar  7 04:03:37 oscar kernel: raid5: Disk failure on sdc1, disabling device. 
>Operation continuing on 2 devices 
> --
> Mar  8 04:03:37 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: 
>Read (10) 00 04 42 00 90 00 00 08 00  
> Mar  8 04:03:37 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium 
>Error 
> Mar  8 04:03:37 oscar kernel: Additional sense indicates Unrecovered read error 
> Mar  8 04:03:37 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352 
> Mar  8 04:03:37 oscar kernel: raid5: Disk failure on sdc1, disabling device. 
>Operation continuing on 2 devices
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to [EMAIL PROTECTED]
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
Re: Repeated errors in RAID5 set.

Reply via email to