[ https://issues.apache.org/jira/browse/IGNITE-752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14638407#comment-14638407 ]
Denis Magda commented on IGNITE-752: ------------------------------------ So, what has been done by this point: - introduced failure detection threshold. It is used to detect node failures quickly. There is no need to setup tons of different timeouts and other parameters. Supported for both TcpDiscoverySpi and TcpCommunicationSpi; - performance optimizations of TcpDiscoverySpi for server nodes: removed heartbeats and status check senders - two Threads; - introduced connection check message that is sent to the next to check connection aliveness. Connection check frequency is calculated from failure detection threshold automatically; - connection check message is Externalizable that lets to send the minimal data required across the ring. Yakov, the new patch is available, please review. > Speed up failure detection > -------------------------- > > Key: IGNITE-752 > URL: https://issues.apache.org/jira/browse/IGNITE-752 > Project: Ignite > Issue Type: Bug > Reporter: Yakov Zhdanov > Assignee: Denis Magda > Priority: Blocker > Fix For: sprint-7 > > Attachments: 882.patch > > > I think we can (1) make grid configuration significantly easier and (2) speed > up failure detection. > Here are disco SPI configuration properties which are responsible for failure > detection: > # reconnectCount, > # sockTimeout, > # networkTImeout, > # ackTImeout, > # maxAckTimeout, > # heartbeatFrequency > # maxMissedHearbeats > Same for communication SPI > # reconnectCount, > # maxConnTimeout, > # connTimeout > So, we have 10 or even more properties. > We did it to address half-opened sockets problem (which is pretty common for > cloud environment) and GC pauses which may happen on cluster nodes - we can > increase ack timeouts to prevent them from being kicked off the topology. > By setting value for these props I set timeout for failure detection. Why do > we need such great number of parameters instead of having 1 on > IgniteConfiguration - nodeResponseThreshold (or failureDetectionThreshold - > can anyone propose better name?). > All other parameters will be calculated automatically (I think user can still > set some of them for full control over situation - need to decide if this is > needed.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)