Re: [gpfsug-discuss] Bad disk but not failed in DSS-G

Jonathan Buzzard Thu, 20 Jun 2024 15:37:35 -0700

On 20/06/2024 22:02, Fred Stock wrote:

I think you are seeing two different errors. The backup is failing dueto a stale file handle error which usually means the file system wasunmounted while the file handle was open. The write error on thephysical disk, may have contributed to the stale file handle but I doubtthat is the case. As I understand a single IO error on a physical diskin an ESS (DSS) system will not cause the disk to be considered bad.This is likely why the system considers the disk to be ok. I suggestyou track down the source of the stale file handle and correct thatissue to see if your backups will then again be successful.

There is a *lot* more than a single IO error on the physical disk, theoutput of mmvdisk pdisk list for the disk shows


      IOErrors = 444
      IOTimeouts = 8958
      mediaErrors = 15

And the output of dmesg shows loads of errors. I have not attempted tocount them but it is again a *lot* more than a single IO error. Thatdisk should have been kicked out the file system and the fact that ithas not is a bug IMHO. Anyone who thinks that is "normal" and not"failed" is as high as a kite.

Also mmbackup has now failed for three days in a row with differentstale file handles building the change lists, making this is an on goingissue.

So can I safely use the --force to get this dodgy disk out the filesystem? It is the *only* disk in the system showing IO errors so almostcertainly the cause of the problems. Unless you are aware of some Linuxkernel bug that causes otherwise healthy disks in an enclosure to starthaving problems. I guess there is an outside chance there could be anissue with the enclosure but really you start with the disk.



JAB.

--
Jonathan A. Buzzard                         Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG


_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org

Re: [gpfsug-discuss] Bad disk but not failed in DSS-G

Reply via email to