[ https://issues.apache.org/jira/browse/IGNITE-752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14645728#comment-14645728 ]
Denis Magda commented on IGNITE-752: ------------------------------------ Reopened the ticket because the following changes were applied to the code after benchmarking 'node left' case on Amazon EC2: - connection check message will be sent even if failure detection timeout is not used; - removed unecessary {{pingNode}} call when a remote node fails in {{sendMessageAcrossRing}}; - added info on the lowest failure detection timeout in lower-latency networks - 120 ms. Determined with AWS; Attached: - new patch; - performance comparision plot that shows how quickly a topology recovers when a node leaves it. No drop in compare to the previous results. Yakov, please review one more time. > Speed up failure detection > -------------------------- > > Key: IGNITE-752 > URL: https://issues.apache.org/jira/browse/IGNITE-752 > Project: Ignite > Issue Type: Bug > Reporter: Yakov Zhdanov > Assignee: Denis Magda > Priority: Blocker > Fix For: sprint-7 > > Attachments: 475-2.patch, 882.patch, > failure_detection_timeout_node_left.zip, ignite-752.patch > > > I think we can (1) make grid configuration significantly easier and (2) speed > up failure detection. > Here are disco SPI configuration properties which are responsible for failure > detection: > # reconnectCount, > # sockTimeout, > # networkTImeout, > # ackTImeout, > # maxAckTimeout, > # heartbeatFrequency > # maxMissedHearbeats > Same for communication SPI > # reconnectCount, > # maxConnTimeout, > # connTimeout > So, we have 10 or even more properties. > We did it to address half-opened sockets problem (which is pretty common for > cloud environment) and GC pauses which may happen on cluster nodes - we can > increase ack timeouts to prevent them from being kicked off the topology. > By setting value for these props I set timeout for failure detection. Why do > we need such great number of parameters instead of having 1 on > IgniteConfiguration - nodeResponseThreshold (or failureDetectionThreshold - > can anyone propose better name?). > All other parameters will be calculated automatically (I think user can still > set some of them for full control over situation - need to decide if this is > needed.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)