Good news on that matter, 5.2.2.0 will have an option to ignore those events when they come from a client cluster.
Mit freundlichen Grüßen / Kind regards Norbert Schuld Software Engineer, Release Architect IBM Storage Scale IBM Systems / 00E636 Brüsseler Straße 1-3 60327 Frankfurt Phone: +49-160-7070335 E-Mail: [email protected]<mailto:[email protected]> IBM Data Privacy Statement<https://www.ibm.com/privacy/us/en/> IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Wolfgang Wendt / Geschäftsführung: David Faller Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 From: gpfsug-discuss <[email protected]> On Behalf Of Ryan Novosielski Sent: Friday, September 13, 2024 4:34 PM To: gpfsug main discussion list <[email protected]> Subject: [EXTERNAL] Re: [gpfsug-discuss] mmhealth alerts too quickly On Sep 13, 2024, at 08: 15, Dietrich, Stefan <stefan. dietrich@ desy. de> wrote: Hello Peter, Since we upgraded to 5. 1. 9-5 we're getting random nodes moaning about lost connections when ever a another machine is rebooted or stops working. On Sep 13, 2024, at 08:15, Dietrich, Stefan <[email protected]<mailto:[email protected]>> wrote: Hello Peter, Since we upgraded to 5.1.9-5 we're getting random nodes moaning about lost connections when ever a another machine is rebooted or stops working. This is great, however there does not seam to be any great way to acknowledge the alerts, or close the connections gracefully if the machine is turned off rather than actually failing. it's possible to resolve event in mmhealth: # mmhealth event resolve Missing arguments. Usage: mmhealth event resolve {EventName} [Identifier] -> `mmhealth event resolve cluster_connections_down AFFECTED_IP` should do the trick. In our clusters, a regular reboot doesn't seem to trigger this event. All our nodes are running Scale >= 5.2.0 Our clusters (5.1.9-3 on the client side, and either 5.1.5-1 or 5.1.9-2 on the storage side) also show downed connections, but I wish this were somehow tunable. A single downed client that’s not even part of the same cluster is not a reason to alert us on our storage cluster. We monitor MMHEALTH via Nagios, and so we’re occasionally getting messages about a single client. -- #BlackLivesMatter ____ || \\UTGERS<file://UTGERS>, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - [email protected]<mailto:[email protected]> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB A555B, Newark `'
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
