This is not common in my experience; I wouldn’t worry about it that much. We have 5 of these things currently between GSS and DSS-G, from GPFS 4.1.0-8 to 5.1.8-2 and have only seen a similar situation once. Ours fail disks all the time before we even notice anything is wrong.
But do report it as a bug. What’s the hardware on this thing? Sent from my iPhone > On Jun 24, 2024, at 08:53, Jonathan Buzzard <[email protected]> > wrote: > > On 24/06/2024 13:16, Achim Rehor wrote: >> CAUTION: This email originated outside the University. Check before clicking >> links or attachments. >> well ... not necessarily 😄 >> but on the disk ... just as i expected ... taking it out helps a lot. >> Now on taking it out automatically when raising too many errors was a >> discussion i had several times with the GNR development. >> The issue really is .. I/O errors on disks (as seen in the >> mmlsrecoverygroupevent logs) can be due to several issues (the disk itself, >> the expander, the IOM, the adapter, the cable ... ) >> in case of a more general part serving like 5 or more pdisks, that would >> risk the FT , if we took them out automatically. >> Thus ... we dont do that .. > > When smartctl for the disk says > > Error counter log: > Errors Corrected by Total Correction Gigabytes Total > ECC rereads/ errors algorithm processed > uncorrected > fast | delayed rewrites corrected invocations [10^9 bytes] > errors > read: 0 33839 32 0 0 137434.705 > 32 > write: 0 36 0 0 0 178408.893 > 0 > > Non-medium error count: 0 > > > A disk with 32 read errors in smartctl is fubar, no ifs no buts. Whatever the > balance in ejecting bad disks is, IMHO currently it's in the wrong place > because it failed to eject an actual bad disk. > > At an absolute bare minimum mmhealth should be not be saying everything is > fine and dandy because clearly it was not. That's the bigger issue. I can > live with them not been taken out automatically, it is unacceptable that > mmhealth was giving false and inaccurate information about the state of the > filesystem. Had it even just changed something to a "degraded" state the > problems could have been picked up much much sooner. > > Presumably the disk category was still good because the vdisk's where > theoretically good. I suggest renaming that to VDISK to more accurately > reflect what it is about and add a PDISK category. Then when a pdisk starts > showing IO errors you can increment the number of disks in a degraded state > and it can be picked up without end users having to roll their own monitoring. > >> The idea is to improve the disk hospital more and more, so that the decision >> to switch a disk back to OK is more accurate, over time. >> Until then .. it might always be a good idea to scan the event log for pdisk >> errors ... > > That is my conclusion, that mmhealth is as useful as a chocolate teapot > because you can't rely on it to provide correct information and I need to do > my own health monitoring of the system. > > > JAB. > > -- > Jonathan A. Buzzard Tel: +44141-5483420 > HPC System Administrator, ARCHIE-WeSt. > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
