[ 
https://issues.apache.org/jira/browse/IGNITE-10808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16730647#comment-16730647
 ] 

ASF GitHub Bot commented on IGNITE-10808:
-----------------------------------------

GitHub user dmekhanikov opened a pull request:

    https://github.com/apache/ignite/pull/5771

    IGNITE-10808 Ensure RingMessageWorker's progress.

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gridgain/apache-ignite ignite-10808

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/ignite/pull/5771.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5771
    
----
commit 8ccbec84568dc4058ea27dbfbbab0fa2e34477d9
Author: Denis Mekhanikov <dmekhanikov@...>
Date:   2018-12-26T15:37:45Z

    IGNITE-10808 Add a metrics overflow test.

commit b4ee4478f5e9819957ca52b962af42ab3d17452a
Author: Denis Mekhanikov <dmekhanikov@...>
Date:   2018-12-28T16:03:20Z

    IGNITE-10808 Drop stale TcpDiscoveryMetricsUpdateMessage upon arrival of 
fresh ones.

commit 1faec4bc05834dbe1ed7eaf2b5a3265ad503c385
Author: Denis Mekhanikov <dmekhanikov@...>
Date:   2018-12-29T10:39:05Z

    IGNITE-10808 Modify test.

commit 632dad49954247e1e16f6781a0be88b7e075c399
Author: Denis Mekhanikov <dmekhanikov@...>
Date:   2018-12-29T11:26:52Z

    IGNITE-10808 Track RingMessageWorker's progress.

----


> Discovery message queue may build up with TcpDiscoveryMetricsUpdateMessage
> --------------------------------------------------------------------------
>
>                 Key: IGNITE-10808
>                 URL: https://issues.apache.org/jira/browse/IGNITE-10808
>             Project: Ignite
>          Issue Type: Bug
>    Affects Versions: 2.7
>            Reporter: Stanislav Lukyanov
>            Assignee: Denis Mekhanikov
>            Priority: Major
>              Labels: discovery
>             Fix For: 2.8
>
>         Attachments: IgniteMetricsOverflowTest.java
>
>
> A node receives a new metrics update message every `metricsUpdateFrequency` 
> milliseconds, and the message will be put at the top of the queue (because it 
> is a high priority message).
> If processing one message takes more than `metricsUpdateFrequency` then 
> multiple `TcpDiscoveryMetricsUpdateMessage` will be in the queue. A long 
> enough delay (e.g. caused by a network glitch or GC) may lead to the queue 
> building up tens of metrics update messages which are essentially useless to 
> be processed. Finally, if processing a message on average takes a little more 
> than `metricsUpdateFrequency` (even for a relatively short period of time, 
> say, for a minute due to network issues) then the message worker will end up 
> processing only the metrics updates and the cluster will essentially hang.
> Reproducer is attached. In the test, the queue first builds up and then very 
> slowly being teared down, causing "Failed to wait for PME" messages.
> Need to change ServerImpl's SocketReader not to put another metrics update 
> message to the top of the queue if it already has one (or replace the one at 
> the top with new one).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to