Re: [gpfsug-discuss] mmhealth - where is the info hiding?

Buterbaugh, Kevin L Thu, 19 Jul 2018 15:23:17 -0700

Hi Valdis,

Is this what you’re looking for (from an IBMer in response to another question 
a few weeks back)?


assuming 4.2.3 code level this can be done by deleting and recreating the rule 
with changed settings:

# mmhealth thresholds list
### Threshold Rules ###
rule_name                metric                error  warn              
direction  filterBy  groupBy                                           
sensitivity
--------------------------------------------------------------------------------------------------------------------------------------------------------
InodeCapUtil_Rule        Fileset_inode         90.0   80.0              high    
             gpfs_cluster_name,gpfs_fs_name,gpfs_fset_name      300
MetaDataCapUtil_Rule     MetaDataPool_capUtil  90.0   80.0              high    
             gpfs_cluster_name,gpfs_fs_name,gpfs_diskpool_name  300
DataCapUtil_Rule         DataPool_capUtil      90.0   80.0              high    
             gpfs_cluster_name,gpfs_fs_name,gpfs_diskpool_name  300
MemFree_Rule             mem_memfree           50000  100000            low     
             node                                               300

# mmhealth thresholds delete MetaDataCapUtil_Rule
The rule(s) was(were) deleted successfully


# mmhealth thresholds add MetaDataPool_capUtil --errorlevel 95.0 --warnlevel 
85.0 --direction high --sensitivity 300 --name MetaDataCapUtil_Rule --groupby 
gpfs_cluster_name,gpfs_fs_name,gpfs_diskpool_name


#  mmhealth thresholds list
### Threshold Rules ###
rule_name                metric                error  warn              
direction  filterBy  groupBy                                         
sensitivity  
--------------------------------------------------------------------------------------------------------------------------------------------------------
InodeCapUtil_Rule        Fileset_inode         90.0   80.0              high    
             gpfs_cluster_name,gpfs_fs_name,gpfs_fset_name      300
MemFree_Rule             mem_memfree           50000  100000            low     
             node                                               300
DataCapUtil_Rule         DataPool_capUtil      90.0   80.0              high    
             gpfs_cluster_name,gpfs_fs_name,gpfs_diskpool_name  300
MetaDataCapUtil_Rule     MetaDataPool_capUtil  95.0   85.0              high    
             gpfs_cluster_name,gpfs_fs_name,gpfs_diskpool_name  300

Kevin

—
Kevin Buterbaugh - Senior System Administrator
Vanderbilt University - Advanced Computing Center for Research and Education
[email protected]<mailto:[email protected]> - 
(615)875-9633



On Jul 19, 2018, at 4:25 PM, 
[email protected]<mailto:[email protected]> wrote:

So I'm trying to tidy up things like 'mmhealth' etc.  Got most of it fixed, but 
stuck on
one thing..

Note: I already did a 'mmhealth node eventlog --clear -N all' yesterday, which
cleaned out a bunch of other long-past events that were "stuck" as failed /
degraded even though they were corrected days/weeks ago - keep this in mind as
you read on....

# mmhealth cluster show

Component           Total         Failed       Degraded        Healthy          
Other
-------------------------------------------------------------------------------------
NODE                   10              0              0             10          
    0
GPFS                   10              0              0             10          
    0
NETWORK                10              0              0             10          
    0
FILESYSTEM              1              0              1              0          
    0
DISK                  102              0              0            102          
    0
CES                     4              0              0              4          
    0
GUI                     1              0              0              1          
    0
PERFMON                10              0              0             10          
    0
THRESHOLD              10              0              0             10          
    0

Great.  One hit for 'degraded' filesystem.

# mmhealth node show --unhealthy -N all
(skipping all the nodes that show healthy)

Node name:      arnsd3-vtc.nis.internal
Node status:    HEALTHY
Status Change:  21 hours ago

Component      Status        Status Change     Reasons
-----------------------------------------------------------------------------------
FILESYSTEM     FAILED        24 days ago       
pool-data_high_error(archive/system)
(...)
Node name:      arproto2-isb.nis.internal
Node status:    HEALTHY
Status Change:  21 hours ago

Component      Status        Status Change     Reasons
----------------------------------------------------------------------------------
FILESYSTEM     DEGRADED      6 days ago        
pool-data_high_warn(archive/system)

mmdf tells me:
nsd_isb_01        13103005696        1 No       Yes      1747905536 ( 13%)     
111667200 ( 1%)
nsd_isb_02        13103005696        1 No       Yes      1748245504 ( 13%)     
111724384 ( 1%)
(94 more LUNs all within 0.2% of these for usage - data is striped out pretty 
well)

There's also 6 SSD LUNs for metadata:
nsd_isb_flash_01    2956984320        1 Yes      No       2116091904 ( 72%)     
 26996992 ( 1%)
(again, evenly striped)

So who is remembering that status, and how to clear it?
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<http://spectrumscale.org>
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&amp;data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7Ca2e808fa12e74ed277bc08d5edc51bc3%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636676353194563950&amp;sdata=5biJuM0K0XwEw3BMwbS5epNQhrlig%2FFON7k1V79G%2Fyc%3D&amp;reserved=0

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] mmhealth - where is the info hiding?

Reply via email to