[jira] [Commented] (IGNITE-752) Speed up failure detection

Dmitriy Setrakyan (JIRA) Fri, 24 Jul 2015 02:57:49 -0700

    [ 
https://issues.apache.org/jira/browse/IGNITE-752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14640227#comment-14640227
 ]


Dmitriy Setrakyan commented on IGNITE-752:
------------------------------------------

I looked at the code and have some questions:

# I am not sure {{failureDetectionThreshold}} is the right name. Wouldn't 
{{failureDetectionTimeout}} make more sense?
# I tried to read the javadoc on {{IgniteConfiguration}}, but I think it is 
trying to say too much. How about this say just briefly explain what it does, 
without trying to confuse users with explanation of how the implementation 
works? For example,
{code}
Failure detection timeout is used to determine how long a the communication or 
discovery SPIs should wait before considering a remote connection failed.
{code}
# Then in the SPI javadocs for communication and discovery, you can say:
{code}
{{failureDetectionTimeout}} automatically controls the following parameters: a, 
b, c, d. If any of those parameters is set explicitly, then the 
{{failureDetectionTimeout}} setting will be ignored.
{code}

> Speed up failure detection
> --------------------------
>
>                 Key: IGNITE-752
>                 URL: https://issues.apache.org/jira/browse/IGNITE-752
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Yakov Zhdanov
>            Assignee: Denis Magda
>            Priority: Blocker
>             Fix For: sprint-7
>
>         Attachments: 882.patch, ignite-752.patch
>
>
> I think we can (1) make grid configuration significantly easier and (2) speed 
> up failure detection.
> Here are disco SPI configuration properties which are responsible for failure 
> detection:
> # reconnectCount,
> # sockTimeout,
> # networkTImeout, 
> # ackTImeout, 
> # maxAckTimeout,
> # heartbeatFrequency 
> # maxMissedHearbeats
> Same for communication SPI
> # reconnectCount, 
> # maxConnTimeout, 
> # connTimeout
> So, we have 10 or even more properties.
> We did it to address half-opened sockets problem (which is pretty common for 
> cloud environment) and GC pauses which may happen on cluster nodes - we can 
> increase ack timeouts to prevent them from being kicked off the topology.
> By setting value for these props I set timeout for failure detection. Why do 
> we need such great number of parameters instead of having 1 on 
> IgniteConfiguration - nodeResponseThreshold (or failureDetectionThreshold - 
> can anyone propose better name?).
> All other parameters will be calculated automatically (I think user can still 
> set some of them for full control over situation - need to decide if this is 
> needed.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (IGNITE-752) Speed up failure detection

Reply via email to