[ 
https://issues.apache.org/jira/browse/IGNITE-8985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542723#comment-16542723
 ] 

Dmitry Karachentsev commented on IGNITE-8985:
---------------------------------------------

Here are few things that caused this behavior.
1. One node was killed.
2. Previous for it was unable to connect and tried to go to next of the killed.
3. As we have 60 secs of failure detection timeout, then connection check 
frequency will be 60 / 3 = 20 secs. So it means that previous node will be 
treated as failed if there was no message during 20 secs. In the other hand, 
recovery timeout is 10 secs.
4. Another case is that each node has two loopback addresses, when one of them 
172.17.0.1:47500 is not determined as localhost and was checked. In other words 
node checked connection to itself.

To fix it should be applied loopback check from IGNITE-8683 ticket and add 
IGNITE-8944 to mark node as failed faster.

> Node segmented itself after connRecoveryTimeout
> -----------------------------------------------
>
>                 Key: IGNITE-8985
>                 URL: https://issues.apache.org/jira/browse/IGNITE-8985
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Mikhail Cherkasov
>            Assignee: Dmitry Karachentsev
>            Priority: Major
>         Attachments: Archive.zip
>
>
> I can see the following message in logs:
> [2018-07-10 16:27:13,111][WARN ][tcp-disco-msg-worker-#2] Unable to connect 
> to next nodes in a ring, it seems local node is experiencing connectivity 
> issues. Segmenting local node to avoid case when one node fails a big part of 
> cluster. To disable that behavior set 
> TcpDiscoverySpi.setConnectionRecoveryTimeout() to 0. 
> [connRecoveryTimeout=10000, effectiveConnRecoveryTimeout=10000]
> [2018-07-10 16:27:13,112][WARN ][disco-event-worker-#61] Local node 
> SEGMENTED: TcpDiscoveryNode [id=e1a19d8e-2253-458c-9757-e3372de3bef9, 
> addrs=[127.0.0.1, 172.17.0.1, 172.25.1.17], sockAddrs=[/172.17.0.1:47500, 
> lab17.gridgain.local/172.25.1.17:47500, /127.0.0.1:47500], discPort=47500, 
> order=2, intOrder=2, lastExchangeTime=1531229233103, loc=true, 
> ver=2.4.7#20180710-sha1:a48ae923, isClient=false]
> I have failure detection time out 60_000 and during the test I had GC 
> <25secs, so I don't expect that node should be segmented.
>  
> Logs are attached.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to