Aleksey Plekhanov created IGNITE-26209:
------------------------------------------
Summary: Add metrics to improve node network unavailability
detection
Key: IGNITE-26209
URL: https://issues.apache.org/jira/browse/IGNITE-26209
Project: Ignite
Issue Type: Improvement
Reporter: Aleksey Plekhanov
Assignee: Aleksey Plekhanov
Any metrics collection system gathers them discretely. In the interval between
collections, some metric may exceed its critical threshold values and return to
normal levels by the time of the next collection. For example, a short-term
network unavailability of a node can lead to a significant increase in
operation latency. The fact that there was an issue with this particular node
could be detected by observing an increase in the size of the outgoing message
queue for the TCP Communication SPI on that node. However, if we collect
metrics less frequently than the duration of the node's downtime, such spikes
might go unnoticed. It is necessary to have metrics that would record bursts of
accumulated message queues from Discovery/Communication SPIs over a certain
period of time.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)