[
https://issues.apache.org/jira/browse/IGNITE-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Denis Magda updated IGNITE-2656:
--------------------------------
Fix Version/s: (was: 2.0)
2.1
> Documentation on debugging and fixing the reasons of node disconnection from
> the cluster
> ----------------------------------------------------------------------------------------
>
> Key: IGNITE-2656
> URL: https://issues.apache.org/jira/browse/IGNITE-2656
> Project: Ignite
> Issue Type: Bug
> Reporter: Denis Magda
> Assignee: Denis Magda
> Priority: Minor
> Fix For: 2.1
>
>
> Sometimes a node can be abruptly kicked off from the cluster buy some reason.
> The documentation must contain information on how to get to the root of the
> issue by looking at logs files. Usually the node that was kicked off contains
> "Local node segmented" message and the node that failed its next neighbor
> contains a message with more details "Failed to send message to next node".
> Next the article must list possible reasons of the disconnection:
> - long GC pauses. Give recommendations on how to check;
> - high node utilization so that it responds with a delay;
> - low network configuration parameters that are not suited for an environment;
> There should be a section about
> {{IgniteConfiguration.failureDetectionTimeout}} describing its behavior and
> showing all its pros and cons.
> The article must say when it makes sense to 'disable' this timeout by
> switching to explicit configuration of TcpDiscoverySpi.socketTimeout,
> TcpDiscoverySpi.ackTimeout, TcpDiscoverySpi.maxAckTimeout,
> TcpDiscoverySpi.reconnectCount. Pros and cons of manual configuration has to
> be mentioned as well.
>
> Also I would list the usage of TcpDiscoverySpi.joinTimeout,
> TcpDiscoverySpi.networkTimeout (used on client reconnect, servers waits for
> join result, node stop, socket reader first message.) there as well.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)