[jira] [Commented] (IGNITE-752) Speed up failure detection

Denis Magda (JIRA) Thu, 23 Jul 2015 00:49:01 -0700

    [ 
https://issues.apache.org/jira/browse/IGNITE-752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14638407#comment-14638407
 ]


Denis Magda commented on IGNITE-752:
------------------------------------

So, what has been done by this point:
- introduced failure detection threshold. It is used to detect node failures 
quickly. There is no need to setup tons of different timeouts and other 
parameters. Supported for both TcpDiscoverySpi and TcpCommunicationSpi;
- performance optimizations of TcpDiscoverySpi for server nodes: removed 
heartbeats and status check senders - two Threads;
- introduced connection check message that is sent to the next to check 
connection aliveness. Connection check frequency is calculated from failure 
detection threshold automatically;
- connection check message is Externalizable that lets to send the minimal data 
required across the ring.


Yakov, the new patch is available, please review.

> Speed up failure detection
> --------------------------
>
>                 Key: IGNITE-752
>                 URL: https://issues.apache.org/jira/browse/IGNITE-752
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Yakov Zhdanov
>            Assignee: Denis Magda
>            Priority: Blocker
>             Fix For: sprint-7
>
>         Attachments: 882.patch
>
>
> I think we can (1) make grid configuration significantly easier and (2) speed 
> up failure detection.
> Here are disco SPI configuration properties which are responsible for failure 
> detection:
> # reconnectCount,
> # sockTimeout,
> # networkTImeout, 
> # ackTImeout, 
> # maxAckTimeout,
> # heartbeatFrequency 
> # maxMissedHearbeats
> Same for communication SPI
> # reconnectCount, 
> # maxConnTimeout, 
> # connTimeout
> So, we have 10 or even more properties.
> We did it to address half-opened sockets problem (which is pretty common for 
> cloud environment) and GC pauses which may happen on cluster nodes - we can 
> increase ack timeouts to prevent them from being kicked off the topology.
> By setting value for these props I set timeout for failure detection. Why do 
> we need such great number of parameters instead of having 1 on 
> IgniteConfiguration - nodeResponseThreshold (or failureDetectionThreshold - 
> can anyone propose better name?).
> All other parameters will be calculated automatically (I think user can still 
> set some of them for full control over situation - need to decide if this is 
> needed.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (IGNITE-752) Speed up failure detection

Reply via email to