Re: [gpfsug-discuss] [EXTERNAL] Re: Bad disk but not failed in DSS-G

Henson Jr.,Larry J Thu, 20 Jun 2024 15:45:01 -0700

Just curious what is output of 'mmlspdisk all --not-ok'?

Regards,
Larry Henson
MD Anderson Cancer Center
IT Engineering Storage Team
Cell (713) 702-4896

-----Original Message-----
From: gpfsug-discuss <[email protected]> On Behalf Of Jonathan 
Buzzard
Sent: Thursday, June 20, 2024 5:35 PM
To: [email protected]
Subject: [EXTERNAL] Re: [gpfsug-discuss] Bad disk but not failed in DSS-G

SLOW DOWN!  - EXTERNAL SENDER: [email protected] Be suspicious 
of tone, urgency, and formatting. Do not click/open links or attachments on a 
mobile device. Wait until you are at a computer to confirm you are absolutely 
certain it is a trusted source. If you are at all uncertain use the Report 
Phish button and our Cybersecurity team will investigate.

On 20/06/2024 22:02, Fred Stock wrote:
>
> I think you are seeing two different errors.  The backup is failing
> due to a stale file handle error which usually means the file system
> was unmounted while the file handle was open.  The write error on the
> physical disk, may have contributed to the stale file handle but I
> doubt that is the case.  As I understand a single IO error on a
> physical disk in an ESS (DSS) system will not cause the disk to be considered 
> bad.
> This is likely why the system considers the disk to be ok.  I suggest
> you track down the source of the stale file handle and correct that
> issue to see if your backups will then again be successful.
>

There is a *lot* more than a single IO error on the physical disk, the output 
of mmvdisk pdisk list for the disk shows

       IOErrors = 444
       IOTimeouts = 8958
       mediaErrors = 15

And the output of dmesg shows loads of errors. I have not attempted to count 
them but it is again a *lot* more than a single IO error. That disk should have 
been kicked out the file system and the fact that it has not is a bug IMHO. 
Anyone who thinks that is "normal" and not "failed" is as high as a kite.

Also mmbackup has now failed for three days in a row with different stale file 
handles building the change lists, making this is an on going issue.

So can I safely use the --force to get this dodgy disk out the file system? It 
is the *only* disk in the system showing IO errors so almost certainly the 
cause of the problems. Unless you are aware of some Linux kernel bug that 
causes otherwise healthy disks in an enclosure to start having problems. I 
guess there is an outside chance there could be an issue with the enclosure but 
really you start with the disk.

JAB.

--
Jonathan A. Buzzard                         Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
https://urldefense.com/v3/__http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org__;!!PfbeBCCAmug!lwWt17RQZQce8WbKYm78Lpbbl6YfCk8XYVIULodmqIdCLz2mWjii9-USsOcS-mNYfRGBUxjx-hs-vGs31dd4_eQY7nUgZFd7WA$

The information contained in this e-mail message may be privileged, 
confidential, and/or protected from disclosure. This e-mail message may contain 
protected health information (PHI); dissemination of PHI should comply with 
applicable federal and state laws. If you are not the intended recipient, or an 
authorized representative of the intended recipient, any further review, 
disclosure, use, dissemination, distribution, or copying of this message or any 
attachment (or the information contained therein) is strictly prohibited. If 
you think that you have received this e-mail message in error, please notify 
the sender by return e-mail and delete all references to it and its contents 
from your systems.
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org

Re: [gpfsug-discuss] [EXTERNAL] Re: Bad disk but not failed in DSS-G

Reply via email to