[jira] [Commented] (IGNITE-752) Speed up failure detection

Denis Magda (JIRA) Wed, 29 Jul 2015 01:54:16 -0700

    [ 
https://issues.apache.org/jira/browse/IGNITE-752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14645728#comment-14645728
 ]


Denis Magda commented on IGNITE-752:
------------------------------------

Reopened the ticket because the following changes were applied to the code 
after benchmarking 'node left' case on Amazon EC2:
- connection check message will be sent even if failure detection timeout is 
not used;
- removed unecessary {{pingNode}} call when a remote node fails in 
{{sendMessageAcrossRing}};
- added info on the lowest failure detection timeout in lower-latency networks 
- 120 ms. Determined with AWS;

Attached:
- new patch;
- performance comparision plot that shows how quickly a topology recovers when 
a node leaves it. No drop in compare to the previous results.

Yakov, please review one more time.

> Speed up failure detection
> --------------------------
>
>                 Key: IGNITE-752
>                 URL: https://issues.apache.org/jira/browse/IGNITE-752
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Yakov Zhdanov
>            Assignee: Denis Magda
>            Priority: Blocker
>             Fix For: sprint-7
>
>         Attachments: 475-2.patch, 882.patch, 
> failure_detection_timeout_node_left.zip, ignite-752.patch
>
>
> I think we can (1) make grid configuration significantly easier and (2) speed 
> up failure detection.
> Here are disco SPI configuration properties which are responsible for failure 
> detection:
> # reconnectCount,
> # sockTimeout,
> # networkTImeout, 
> # ackTImeout, 
> # maxAckTimeout,
> # heartbeatFrequency 
> # maxMissedHearbeats
> Same for communication SPI
> # reconnectCount, 
> # maxConnTimeout, 
> # connTimeout
> So, we have 10 or even more properties.
> We did it to address half-opened sockets problem (which is pretty common for 
> cloud environment) and GC pauses which may happen on cluster nodes - we can 
> increase ack timeouts to prevent them from being kicked off the topology.
> By setting value for these props I set timeout for failure detection. Why do 
> we need such great number of parameters instead of having 1 on 
> IgniteConfiguration - nodeResponseThreshold (or failureDetectionThreshold - 
> can anyone propose better name?).
> All other parameters will be calculated automatically (I think user can still 
> set some of them for full control over situation - need to decide if this is 
> needed.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (IGNITE-752) Speed up failure detection

Reply via email to