Re: [gpfsug-discuss] mmhealth alerts too quickly

Ryan Novosielski Fri, 13 Sep 2024 07:36:24 -0700

On Sep 13, 2024, at 08:15, Dietrich, Stefan <[email protected]> wrote:


Hello Peter,

Since we upgraded to 5.1.9-5 we're getting random nodes moaning about lost
connections when ever a another machine is rebooted or stops working. This is
great, however there does not seam to be any great way to acknowledge the
alerts, or close the connections gracefully if the machine is turned off rather
than actually failing.

it's possible to resolve event in mmhealth:

# mmhealth event resolve
Missing arguments.
Usage:
 mmhealth event resolve {EventName} [Identifier]

-> `mmhealth event resolve cluster_connections_down AFFECTED_IP` should do the 
trick.

In our clusters, a regular reboot doesn't seem to trigger this event. All our 
nodes are running Scale >= 5.2.0

Our clusters (5.1.9-3 on the client side, and either 5.1.5-1 or 5.1.9-2 on the 
storage side) also show downed connections, but I wish this were somehow 
tunable. A single downed client that’s not even part of the same cluster is not 
a reason to alert us on our storage cluster. We monitor MMHEALTH via Nagios, 
and so we’re occasionally getting messages about a single client.

--
#BlackLivesMatter
____
|| \\UTGERS,     |---------------------------*O*---------------------------
||_// the State  |         Ryan Novosielski - [email protected]
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
     `'

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org

Re: [gpfsug-discuss] mmhealth alerts too quickly

Reply via email to