Hi, that message is still in memory. "mmhealth node eventlog --clear" deletes all old events but those which are currently active are not affected.
I think this is related to multiple Collector Nodes, will dig deeper into that code to find out if some issue lurks there. As a stop-gap measure one could execute "mmsysmoncontrol restart" on the affected node(s) as this stops the monitoring process and doing so clears the event in memory. The data used for the event comes from mmlspool (should be close or identical to mmdf) Mit freundlichen Grüßen / Kind regards Norbert Schuld From: [email protected] To: [email protected] Date: 20/07/2018 00:15 Subject: [gpfsug-discuss] mmhealth - where is the info hiding? Sent by: [email protected] So I'm trying to tidy up things like 'mmhealth' etc. Got most of it fixed, but stuck on one thing.. Note: I already did a 'mmhealth node eventlog --clear -N all' yesterday, which cleaned out a bunch of other long-past events that were "stuck" as failed / degraded even though they were corrected days/weeks ago - keep this in mind as you read on.... # mmhealth cluster show Component Total Failed Degraded Healthy Other ------------------------------------------------------------------------------------- NODE 10 0 0 10 0 GPFS 10 0 0 10 0 NETWORK 10 0 0 10 0 FILESYSTEM 1 0 1 0 0 DISK 102 0 0 102 0 CES 4 0 0 4 0 GUI 1 0 0 1 0 PERFMON 10 0 0 10 0 THRESHOLD 10 0 0 10 0 Great. One hit for 'degraded' filesystem. # mmhealth node show --unhealthy -N all (skipping all the nodes that show healthy) Node name: arnsd3-vtc.nis.internal Node status: HEALTHY Status Change: 21 hours ago Component Status Status Change Reasons ----------------------------------------------------------------------------------- FILESYSTEM FAILED 24 days ago pool-data_high_error (archive/system) (...) Node name: arproto2-isb.nis.internal Node status: HEALTHY Status Change: 21 hours ago Component Status Status Change Reasons ---------------------------------------------------------------------------------- FILESYSTEM DEGRADED 6 days ago pool-data_high_warn (archive/system) mmdf tells me: nsd_isb_01 13103005696 1 No Yes 1747905536 ( 13%) 111667200 ( 1%) nsd_isb_02 13103005696 1 No Yes 1748245504 ( 13%) 111724384 ( 1%) (94 more LUNs all within 0.2% of these for usage - data is striped out pretty well) There's also 6 SSD LUNs for metadata: nsd_isb_flash_01 2956984320 1 Yes No 2116091904 ( 72%) 26996992 ( 1%) (again, evenly striped) So who is remembering that status, and how to clear it? [attachment "attccdgx.dat" deleted by Norbert Schuld/Germany/IBM] _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
