[ https://issues.apache.org/jira/browse/IGNITE-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Denis Magda updated IGNITE-2656: -------------------------------- Priority: Minor (was: Major) > Documentation on debugging and fixing the reasons of node disconnection from > the cluster > ---------------------------------------------------------------------------------------- > > Key: IGNITE-2656 > URL: https://issues.apache.org/jira/browse/IGNITE-2656 > Project: Ignite > Issue Type: Bug > Reporter: Denis Magda > Assignee: Denis Magda > Priority: Minor > Fix For: 1.8 > > > Sometimes a node can be abruptly kicked off from the cluster buy some reason. > The documentation must contain information on how to get to the root of the > issue by looking at logs files. Usually the node that was kicked off contains > "Local node segmented" message and the node that failed its next neighbor > contains a message with more details "Failed to send message to next node". > Next the article must list possible reasons of the disconnection: > - long GC pauses. Give recommendations on how to check; > - high node utilization so that it responds with a delay; > - low network configuration parameters that are not suited for an environment; > There should be a section about > {{IgniteConfiguration.failureDetectionTimeout}} describing its behavior and > showing all its pros and cons. > The article must say when it makes sense to 'disable' this timeout by > switching to explicit configuration of TcpDiscoverySpi.socketTimeout, > TcpDiscoverySpi.ackTimeout, TcpDiscoverySpi.maxAckTimeout, > TcpDiscoverySpi.reconnectCount. Pros and cons of manual configuration has to > be mentioned as well. > > Also I would list the usage of TcpDiscoverySpi.joinTimeout, > TcpDiscoverySpi.networkTimeout (used on client reconnect, servers waits for > join result, node stop, socket reader first message.) there as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)