[gpfsug-discuss] mmhealth alerts too quickly

Peter Childs Fri, 13 Sep 2024 01:36:30 -0700

We have a nagios alert that watches the output of mmhealth and alerts us if 
Scale is unhappy on a node. its fairly simple and straight forward, and is very 
good at letting us know quickly to simple issues.


Since we upgraded to 5.1.9-5 we're getting random nodes moaning about lost 
connections when ever a another machine is rebooted or stops working. This is 
great, however there does not seam to be any great way to acknowledge the 
alerts, or close the connections gracefully if the machine is turned off rather 
than actually failing.

I'm aware of the "mmhealth --refresh" method but I've not actually seen this 
achieve anything and normally endup running "mmsysmoncontrol restart" to get 
the message to reset. The problem is we don't exactly want to lose the alerts, 
they are useful if there is a problem, but it would be nice if they where a 
little more helpful and could be acknowledged. Maybe mmshutdown just needs to 
close all the cluster connections gracefully so that other nodes don't moan. 
I've always found it a little abrupt.

The other main issue we have is the old memory leak in our ESS5000 
https://www.ibm.com/support/pages/node/7027786 We have been working with IBM 
over the last 18 months on this issue, but still no resolution is in sight and 
I'm not sure the workaround is relevant any more.

Also we found going straight from 5.1.2 to 5.2.0 (to 5.2.1) is a stable upgrade 
path, and its best to pass though 5.1.9 first. I'm not sure if there is an 
issue here and something needs adding to the release notes, but that is 
certainly what we discovered.

I hope our findings help others.

Peter Childs



_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org

[gpfsug-discuss] mmhealth alerts too quickly

Reply via email to