[jira] [Comment Edited] (IGNITE-13012) Fix failure detection timeout. Simplify node ping routine.

Vladimir Steshin (Jira) Tue, 16 Jun 2020 10:02:18 -0700


    [ 
https://issues.apache.org/jira/browse/IGNITE-13012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136824#comment-17136824
 ]


Vladimir Steshin edited comment on IGNITE-13012 at 6/16/20, 5:01 PM:
---------------------------------------------------------------------

[~avinogradov], I've put the patch. It creates:

* JmhNodeFailureDetection. Not an ordinary JMH, I believe. Because we have to 
start/wait/fail node, the detection time is only small peice of each run. So, 
fixed/not-fixes results are close. I made own runs and collected timings to 
prepare the output. 

You can find in the output of the fix (example):
{code:java}
Detection delay: 294. Failure detection timeout: 300
Total detection delay: 5477

# Run complete. Total time: 00:01:23
Benchmark                                          Mode  Cnt   Score   Error    
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt       10,954          
ops/min
{code}

vs not-fixed:

{code:java}
Detection delay: 539. Failure detection timeout: 300
Total detection delay: 11370

# Run complete. Total time: 00:01:41

Benchmark                                          Mode  Cnt  Score   Error    
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt       5,276          
ops/min
{code}

* 
{code:java}TcpDiscoveryNetworkIssuesTest.testNodeFailureDetectedWithinConfiguredTimeout(){code}


was (Author: vladsz83):
[~avinogradov], I've put the patch. It creates:

* JmhNodeFailureDetection. Not ordinary JMH, I believe. Because we have to 
start/wait/fail node, the detection time is only small peice of each run. So, 
fixed/not-fixes results are close. I made own runs and collected timings to 
prepare the output. 

You can find in the output of the fix (example):
{code:java}
Detection delay: 294. Failure detection timeout: 300
Total detection delay: 5477

# Run complete. Total time: 00:01:23
Benchmark                                          Mode  Cnt   Score   Error    
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt       10,954          
ops/min
{code}

vs not-fixed:

{code:java}
Detection delay: 539. Failure detection timeout: 300
Total detection delay: 11370

# Run complete. Total time: 00:01:41

Benchmark                                          Mode  Cnt  Score   Error    
Units
JmhNodeFailureDetection.measureTotalForTheOutput  thrpt       5,276          
ops/min
{code}

* 
{code:java}TcpDiscoveryNetworkIssuesTest.testNodeFailureDetectedWithinConfiguredTimeout(){code}

> Fix failure detection timeout. Simplify node ping routine.
> ----------------------------------------------------------
>
>                 Key: IGNITE-13012
>                 URL: https://issues.apache.org/jira/browse/IGNITE-13012
>             Project: Ignite
>          Issue Type: Improvement
>    Affects Versions: 2.8.1
>            Reporter: Vladimir Steshin
>            Assignee: Vladimir Steshin
>            Priority: Major
>              Labels: iep-45
>         Attachments: IGNITE-13012-patch.patch
>
>          Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Connection failure may not be detected within 
> IgniteConfiguration.failureDetectionTimeout. Actual worst delay is: 
> ServerImpl.CON_CHECK_INTERVAL + IgniteConfiguration.failureDetectionTimeout. 
> Node ping routine is duplicated.
> We should fix:
> 1. Failure detection timeout should take in account last sent message. 
> Current ping is bound to own time:
> {code:java}ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent{code}
> This is weird because any discovery message check connection. 
> 2. Make connection check interval depend on failure detection timeout (FTD). 
> Current value is a constant:
> {code:java}static int ServerImpls.CON_CHECK_INTERVAL = 500{code}
> 3. Remove additional, quickened connection checking.  Once we do fix 1, this 
> will become even more useless.
> Despite TCP discovery has a period of connection checking, it may send ping 
> before this period exhausts. This premature node ping relies on the time of 
> any sent or even any received message. 
> 4. Do not worry user with “Node seems disconnected” when everything is OK. 
> Once we do fix 1 and 3, this will become even more useless. 
> Node may log on INFO: “Local node seems to be disconnected from topology …” 
> whereas it is not actually disconnected at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (IGNITE-13012) Fix failure detection timeout. Simplify node ping routine.

Reply via email to