Re: [gpfsug-discuss] [External] Re: Bad disk but not failed in DSS-G

Nicolas CALIMET Fri, 21 Jun 2024 00:35:14 -0700

The disk hospital sometimes might be a bit too conservative - or not enough in 
this case.
What is the SMART status of this drive?
Does mmhealth report anything different than the mmvdisk or mmlspdisk commands?

In DSS-G the health monitor (dssghealthmon) and diskIOHang (dssgdiskIOHang) 
systems may help detect such a problem, the first by polling multiple metrics 
at regular time interval (for drives, mostly around mmlspdisk), the second by 
power cycling a hung drive (as signaled by the GNR diskIOHang callback, which 
in this case might not get triggered though). In recent DSS-G releases, 
checking drives is complemented by a SMART overall health status that does help 
detect faulty drives.

Regarding the error from mmvdisk to prepare for drive replacement, and provided 
the drive is clearly at fault and may already be unreachable anymore, then I 
would definitely force remove it. That cannot be worse than keeping a bad drive 
that is still erroneously considered healthy.

HTH

--
Nicolas Calimet, PhD | HPC System Architect | Lenovo ISG | Meitnerstrasse 9, 
D-70563 Stuttgart, Germany | +49 71165690146 | https://www.lenovo.com/dssg

-----Original Message-----
From: gpfsug-discuss <[email protected]> On Behalf Of Jonathan 
Buzzard
Sent: Friday, June 21, 2024 00:35
To: [email protected]
Subject: [External] Re: [gpfsug-discuss] Bad disk but not failed in DSS-G

On 20/06/2024 22:02, Fred Stock wrote:
>
> I think you are seeing two different errors.  The backup is failing due
> to a stale file handle error which usually means the file system was
> unmounted while the file handle was open.  The write error on the
> physical disk, may have contributed to the stale file handle but I doubt
> that is the case.  As I understand a single IO error on a physical disk
> in an ESS (DSS) system will not cause the disk to be considered bad.
> This is likely why the system considers the disk to be ok.  I suggest
> you track down the source of the stale file handle and correct that
> issue to see if your backups will then again be successful.
>

There is a *lot* more than a single IO error on the physical disk, the
output of mmvdisk pdisk list for the disk shows

       IOErrors = 444
       IOTimeouts = 8958
       mediaErrors = 15

And the output of dmesg shows loads of errors. I have not attempted to
count them but it is again a *lot* more than a single IO error. That
disk should have been kicked out the file system and the fact that it
has not is a bug IMHO. Anyone who thinks that is "normal" and not
"failed" is as high as a kite.

Also mmbackup has now failed for three days in a row with different
stale file handles building the change lists, making this is an on going
issue.

So can I safely use the --force to get this dodgy disk out the file
system? It is the *only* disk in the system showing IO errors so almost
certainly the cause of the problems. Unless you are aware of some Linux
kernel bug that causes otherwise healthy disks in an enclosure to start
having problems. I guess there is an outside chance there could be an
issue with the enclosure but really you start with the disk.

JAB.

--
Jonathan A. Buzzard                         Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org

Re: [gpfsug-discuss] [External] Re: Bad disk but not failed in DSS-G

Reply via email to