Re: [CentOS] mdadm raid-check

2020-11-16 Thread Valeri Galtsev


> On Nov 16, 2020, at 2:48 AM, hw  wrote:
> 
> On Sat, 2020-11-14 at 21:55 -0600, Valeri Galtsev wrote:
>>> On Nov 14, 2020, at 8:20 PM, hw  wrote:
>>> 
>>> 
>>> Hi,
>>> 
>>> is it required to run /usr/sbin/raid-check once per week?  Centos 7 does
>>> this.  Maybe it's sufficient to run it monthly?  IIRC Debian did it monthly.
>> 
>> On hardware RAIDs I do RAID verification once a week. Once a Month a
>> not often enough in my book. That RAID verification effectively
>> reads all stripes of all drives (and verifies that content of
>> redundant drives is consistent), thus preventing a “time bomb”, when
>> a drive left alone for too long, ready to fail in an area which is
>> not accessed, and failing when at some point different drive was
>> replaced and RAID rebuild has to go over all stripes of all
>> drives. Such “multiple failures” are due to poor sysadmin’s work:
>> not often enough RAID verification.
> 
> You mean there can be failures which can be detected during a
> raid-check and can still be repaired using the other disk, but they
> can be impossible to repair when a disk has failed?

No, what I meant to say is: the errors could have been detected, and the drive 
would be kicked out of RAID (not errors repaired), and replaced with good drive 
long ago. But if RAID is not being checked often, there is potential that more 
than redundancy number of drives are failed (in different areas) and are 
waiting to be kicked out, and when it happens the failure becomes fatal.

>> If software raid-check does the same, then it makes a lot of sense,
>> and I am more with RedHat's weekly cron job, than with Debian’s
>> Monthly.
> 
> How often do partial failures occur during normal operation?

I do not know what you mean by “partial failures”. I can imagine:

1. checksum does not match, no reason to suspect any of drives which wrong 
information comes from. If it is RAID-6, in assumption that only one drive 
provided wrong information, wrong drive can be pinpointed, and stripe on it 
overwritten, the event is over without data messed up. If it is RAID-5, there 
is no way to pinpoint wrong drive, if your setting in RAID firmware (I am 
speaking only about hardware RAIDs here) is to overwrite “parity”, fair chance 
is stripe on drive that gave correct information is overwritten, and the 
content on RAID device is damaged.

2. checksum does not match and one of the drives responded with significant 
delay. If there is no other way to pinpoint which drive wrong information came 
from, drive with delay can be fair suspect to be the one (it had to take time 
to multiple times read “bad block” and maybe re-allocate it). With fair 
certainty (but not 100%), RAID will handle the situation without data 
corruption.

3. One of the drives timed out or reported I/O error. The drive will be kicked 
out of RAID, and it is on operator’s side to decide whether to replace it or to 
attempt to rebuild RAID onto the same drive.

>  In case
> there was a power failure, it's probably a good idea to do a check
> anyway.

If you care about data on your RAID, you will use battery backup unit, which 
will keep the content of volatile RAM cache without loosing it, so when power 
has returned, the cache can be flashed to the drives. (Without cache hardware 
RAID devices are noticeably slower than with cache enabled). [non-volatile 
caches and supercapacitors are used as well]

However, the drives themselves have volatile memory as cache, that will 
evaporate when power suddenly disappears. To make things worse: drives are 
designed to lie about “transaction complete” (thus manufacturers can declare 
better specs than those of competitors), and “transaction complete” is reported 
when data is still in drives volatile cache, not on the platters. As far as I 
know, there is no way to query drive to get honest answer whether data is 
already on platters or not. Therefore, hardware RAID cards may think some of 
transactions are completed, but they may never become completed in case of 
power loss.

So, when power suddenly goes… it potentially is a mess on I/O intense box. Even 
with RAID battery backup for cache (or RAID cache disabled), having machine 
behind UPS, and starting clean shutdown when battery in UPS has less than [3 
minutes in my case, yours may be different] juice left is a good idea.


I hope, this helps.

Valeri

>> Valeri
>> 
>>> I just checked on Fedora 32.  It does not run raid-check at all, at least 
>>> not
>>> via a cron entry.  /usr/sbin/raid-check is available, though.  Is that an
>>> oversight?  (I started it manually now and will check if it's run once I 
>>> update
>>> to 33.)
> 
> ___
> CentOS mailing list
> CentOS@centos.org
> https://lists.centos.org/mailman/listinfo/centos

___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] mdadm raid-check

2020-11-16 Thread hw
On Sat, 2020-11-14 at 21:55 -0600, Valeri Galtsev wrote:
> > On Nov 14, 2020, at 8:20 PM, hw  wrote:
> > 
> > 
> > Hi,
> > 
> > is it required to run /usr/sbin/raid-check once per week?  Centos 7 does
> > this.  Maybe it's sufficient to run it monthly?  IIRC Debian did it monthly.
> 
> On hardware RAIDs I do RAID verification once a week. Once a Month a
> not often enough in my book. That RAID verification effectively
> reads all stripes of all drives (and verifies that content of
> redundant drives is consistent), thus preventing a “time bomb”, when
> a drive left alone for too long, ready to fail in an area which is
> not accessed, and failing when at some point different drive was
> replaced and RAID rebuild has to go over all stripes of all
> drives. Such “multiple failures” are due to poor sysadmin’s work:
> not often enough RAID verification.

You mean there can be failures which can be detected during a
raid-check and can still be repaired using the other disk, but they
can be impossible to repair when a disk has failed?

> If software raid-check does the same, then it makes a lot of sense,
> and I am more with RedHat's weekly cron job, than with Debian’s
> Monthly.

How often do partial failures occur during normal operation?  In case
there was a power failure, it's probably a good idea to do a check
anyway.

> Valeri
> 
> > I just checked on Fedora 32.  It does not run raid-check at all, at least 
> > not
> > via a cron entry.  /usr/sbin/raid-check is available, though.  Is that an
> > oversight?  (I started it manually now and will check if it's run once I 
> > update
> > to 33.)

___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] mdadm raid-check

2020-11-14 Thread John Pierce
FWIW, on a 4 x 8TB ZFS RaidZ1, I run a ZFS scrub every night at 2:30am.
 if I had more disks in a raidz2 (equiv to raid6) then I might do it weekly
rather than daily, but since a raidz1 is only singly redundant, I figure
daily scrubs increases the chance I'll catch a failing  drive sooner than
than later
___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] mdadm raid-check

2020-11-14 Thread Valeri Galtsev


> On Nov 14, 2020, at 8:20 PM, hw  wrote:
> 
> 
> Hi,
> 
> is it required to run /usr/sbin/raid-check once per week?  Centos 7 does
> this.  Maybe it's sufficient to run it monthly?  IIRC Debian did it monthly.

On hardware RAIDs I do RAID verification once a week. Once a Month a not often 
enough in my book. That RAID verification effectively reads all stripes of all 
drives (and verifies that content of redundant drives is consistent), thus 
preventing a “time bomb”, when a drive left alone for too long, ready to fail 
in an area which is not accessed, and failing when at some point different 
drive was replaced and RAID rebuild has to go over all stripes of all drives. 
Such “multiple failures” are due to poor sysadmin’s work: not often enough RAID 
verification.

If software raid-check does the same, then it makes a lot of sense, and I am 
more with RedHat's weekly cron job, than with Debian’s Monthly.

Valeri

> I just checked on Fedora 32.  It does not run raid-check at all, at least not
> via a cron entry.  /usr/sbin/raid-check is available, though.  Is that an
> oversight?  (I started it manually now and will check if it's run once I 
> update
> to 33.)
> 
> 
> ___
> CentOS mailing list
> CentOS@centos.org
> https://lists.centos.org/mailman/listinfo/centos

___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


[CentOS] mdadm raid-check

2020-11-14 Thread hw


Hi,

is it required to run /usr/sbin/raid-check once per week?  Centos 7 does
this.  Maybe it's sufficient to run it monthly?  IIRC Debian did it monthly.

I just checked on Fedora 32.  It does not run raid-check at all, at least not
via a cron entry.  /usr/sbin/raid-check is available, though.  Is that an
oversight?  (I started it manually now and will check if it's run once I update
to 33.)


___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos