Re: Rebalancing RAID1

Chris Murphy Wed, 13 Feb 2013 00:10:26 -0800

On Feb 12, 2013, at 11:18 PM, Fredrik Tolf <fred...@dolda2000.com> wrote:
> 
> 
>> smartctl -l scterc /dev/sdX
> 
> "Warning: device does not support SCT Error Recovery Control command"
> 
> Doesn't seem that way to me; partly because of the SMART data, and partly 
> because of the errors that were logged as the drive failed:
> 
> Feb 12 16:36:49 nerv kernel: [36769.546522] ata6.00: Ata error. fis:0x21
> Feb 12 16:36:49 nerv kernel: [36769.550454] ata6: SError: { Handshk }
> Feb 12 16:36:51 nerv kernel: [36769.554129] ata6.00: failed command: WRITE 
> FPDMA QUEUED
> Feb 12 16:36:51 nerv kernel: [36769.559375] ata6.00: cmd 
> 61/00:00:00:ec:2e/04:00:cd:00:00/40 tag 0 ncq 524288 out
> Feb 12 16:36:51 nerv kernel: [36769.559375]          res 
> 41/84:d0:00:98:2e/84:00:cd:00:00/40 Emask 0x10 (ATA bus error)
> Feb 12 16:36:51 nerv kernel: [36769.574831] ata6.00: status: { DRDY ERR }
> Feb 12 16:36:52 nerv kernel: [36769.578867] ata6.00: error: { ICRC ABRT }
> 
> That's not typical for actual media problems, in my experience. :)

Quite typical, because these drives don't support SCTERC which almost certainly
means their error timeouts are well above that of the linux SCSI layer which is
30 seconds. Their timeouts are likely around 2 minutes. So in fact they never
report back a URE because the command timer times out and resets the drive.
https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Online_Storage_Reconfiguration_Guide/task_controlling-scsi-command-timer-onlining-devices.html

For your use case, I'd reject these drives and get WDC Red, or even reportedly
the Hitachi Deskstars still have a settable SCTERC. And set it for something
like 70 deciseconds. Then if if a drive ECC hasn't recovered in 7 seconds, it
will give up, and report a read error with the problem LBA. Either btrfs (or
md) can recover the data from the other drive, and cause the read error to be
fixed on the other drive.

However, in your case, with both the kernel message ICRC ABRT, and the
following SMART entry, this is your cable problem. The ICRC and UCMA_CRC errors
are the same problem reported by the actors at each end of the cable.

/dev/hdi
Serial Number: WD-WMC1T1679668
199 UDMA_CRC_Error_Count 0x0032 200 192 000 Old_age Always
- 91

So the question is whether the cable problem has actually been fixed, and if
you're still getting ICRC errors from the kernel. As this is hdi, I'm wondering
how many drives are connected, and if this could be power induced rather than
just cable induced. Once that's solved, you should do a scrub, rather than a
rebalance.

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Rebalancing RAID1

Reply via email to