[jira] [Updated] (IGNITE-2656) Documentation on debugging and fixing the reasons of node disconnection from the cluster

Denis Magda (JIRA) Wed, 05 Apr 2017 19:13:13 -0700

     [ 
https://issues.apache.org/jira/browse/IGNITE-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Denis Magda updated IGNITE-2656:
--------------------------------
    Fix Version/s:     (was: 2.0)
                   2.1

> Documentation on debugging and fixing the reasons of node disconnection from 
> the cluster
> ----------------------------------------------------------------------------------------
>
>                 Key: IGNITE-2656
>                 URL: https://issues.apache.org/jira/browse/IGNITE-2656
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Denis Magda
>            Assignee: Denis Magda
>            Priority: Minor
>             Fix For: 2.1
>
>
> Sometimes a node can be abruptly kicked off from the cluster buy some reason.
> The documentation must contain information on how to get to the root of the 
> issue by looking at logs files. Usually the node that was kicked off contains 
> "Local node segmented" message and the node that failed its next neighbor 
> contains a message with more details "Failed to send message to next node".
> Next the article must list possible reasons of the disconnection:
> - long GC pauses. Give recommendations on how to check;
> - high node utilization so that it responds with a delay;
> - low network configuration parameters that are not suited for an environment;
> There should be a section about 
> {{IgniteConfiguration.failureDetectionTimeout}} describing its behavior and 
> showing all its pros and cons.
> The article must say when it makes sense to 'disable' this timeout by 
> switching to explicit configuration of TcpDiscoverySpi.socketTimeout, 
> TcpDiscoverySpi.ackTimeout, TcpDiscoverySpi.maxAckTimeout, 
> TcpDiscoverySpi.reconnectCount. Pros and cons of manual configuration has to 
> be mentioned as well.
>   
> Also I would list the usage of TcpDiscoverySpi.joinTimeout,
> TcpDiscoverySpi.networkTimeout (used on client reconnect, servers waits for 
> join result, node stop, socket reader first message.) there as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (IGNITE-2656) Documentation on debugging and fixing the reasons of node disconnection from the cluster

Reply via email to