Stanislav Lukyanov created IGNITE-10808:
-------------------------------------------
Summary: Discovery message queue may build up with
TcpDiscoveryMetricsUpdateMessage
Key: IGNITE-10808
URL: https://issues.apache.org/jira/browse/IGNITE-10808
Project: Ignite
Issue Type: Bug
Reporter: Stanislav Lukyanov
Attachments: IgniteMetricsOverflowTest.java
A node receives a new metrics update message every `metricsUpdateFrequency`
milliseconds, and the message will be put at the top of the queue (because it
is a high priority message).
If processing one message takes more than `metricsUpdateFrequency` then
multiple `TcpDiscoveryMetricsUpdateMessage` will be in the queue. A long enough
delay (e.g. caused by a network glitch or GC) may lead to the queue building up
tens of metrics update messages which are essentially useless to be processed.
Finally, if processing a message on average takes a little more than
`metricsUpdateFrequency` (even for a relatively short period of time, say, for
a minute due to network issues) then the message worker will end up processing
only the metrics updates and the cluster will essentially hang.
Reproducer is attached. In the test, the queue first builds up and then very
slowly being teared down, causing "Failed to wait for PME" messages.
Need to change ServerImpl's SocketReader not to put another metrics update
message to the top of the queue if it already has one (or replace the one on
the top with new one).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)