This is not common in my experience; I wouldn’t worry about it that much. We 
have 5 of these things currently between GSS and DSS-G, from GPFS 4.1.0-8 to 
5.1.8-2 and have only seen a similar situation once. Ours fail disks all the 
time before we even notice anything is wrong. 

But do report it as a bug.

What’s the hardware on this thing?

Sent from my iPhone

> On Jun 24, 2024, at 08:53, Jonathan Buzzard <[email protected]> 
> wrote:
> 
> On 24/06/2024 13:16, Achim Rehor wrote:
>> CAUTION: This email originated outside the University. Check before clicking 
>> links or attachments.
>> well ... not necessarily 😄
>> but on the disk ... just as i expected ... taking it out helps a lot.
>> Now on taking it out automatically when raising too many errors was a 
>> discussion i had several times with the GNR development.
>> The issue really is .. I/O errors on disks (as seen in the 
>> mmlsrecoverygroupevent logs) can be due to several issues  (the disk itself,
>> the expander, the IOM, the adapter, the cable ... )
>> in case of a more general part serving like 5 or more pdisks, that would 
>> risk the FT , if we took them out automatically.
>> Thus ... we dont do that ..
> 
> When smartctl for the disk says
> 
> Error counter log:
>           Errors Corrected by           Total   Correction Gigabytes    Total
>               ECC          rereads/    errors   algorithm processed    
> uncorrected
>           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  
> errors
> read:          0    33839        32         0          0     137434.705       
>   32
> write:         0       36         0         0          0     178408.893       
>    0
> 
> Non-medium error count:        0
> 
> 
> A disk with 32 read errors in smartctl is fubar, no ifs no buts. Whatever the 
> balance in ejecting bad disks is, IMHO currently it's in the wrong place 
> because it failed to eject an actual bad disk.
> 
> At an absolute bare minimum mmhealth should be not be saying everything is 
> fine and dandy because clearly it was not. That's the bigger issue. I can 
> live with them not been taken out automatically, it is unacceptable that 
> mmhealth was giving false and inaccurate information about the state of the 
> filesystem. Had it even just changed something to a "degraded" state the 
> problems could have been picked up much much sooner.
> 
> Presumably the disk category was still good because the vdisk's where 
> theoretically good. I suggest renaming that to VDISK to more accurately 
> reflect what it is about and add a PDISK category. Then when a pdisk starts 
> showing IO errors you can increment the number of disks in a degraded state 
> and it can be picked up without end users having to roll their own monitoring.
> 
>> The idea is to improve the disk hospital more and more, so that the decision 
>> to switch a disk back to OK is more accurate,   over time.
>> Until then .. it might always be a good idea to scan the event log for pdisk 
>> errors ...
> 
> That is my conclusion, that mmhealth is as useful as a chocolate teapot 
> because you can't rely on it to provide correct information and I need to do 
> my own health monitoring of the system.
> 
> 
> JAB.
> 
> --
> Jonathan A. Buzzard                         Tel: +44141-5483420
> HPC System Administrator, ARCHIE-WeSt.
> University of Strathclyde, John Anderson Building, Glasgow. G4 0NG
> 
> 
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org

Reply via email to