Aleksey Plekhanov created IGNITE-26209:
------------------------------------------

             Summary: Add metrics to improve node network unavailability 
detection
                 Key: IGNITE-26209
                 URL: https://issues.apache.org/jira/browse/IGNITE-26209
             Project: Ignite
          Issue Type: Improvement
            Reporter: Aleksey Plekhanov
            Assignee: Aleksey Plekhanov


Any metrics collection system gathers them discretely. In the interval between 
collections, some metric may exceed its critical threshold values and return to 
normal levels by the time of the next collection. For example, a short-term 
network unavailability of a node can lead to a significant increase in 
operation latency. The fact that there was an issue with this particular node 
could be detected by observing an increase in the size of the outgoing message 
queue for the TCP Communication SPI on that node. However, if we collect 
metrics less frequently than the duration of the node's downtime, such spikes 
might go unnoticed. It is necessary to have metrics that would record bursts of 
accumulated message queues from Discovery/Communication SPIs over a certain 
period of time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to