Yakov Zhdanov created IGNITE-752:
------------------------------------

             Summary: Speed up failure detection
                 Key: IGNITE-752
                 URL: https://issues.apache.org/jira/browse/IGNITE-752
             Project: Ignite
          Issue Type: Bug
            Reporter: Yakov Zhdanov
            Priority: Critical
             Fix For: sprint-4


I think we can (1) make grid configuration significantly easier and (2) speed 
up failure detection.

Here are disco SPI configuration properties which are responsible for failure 
detection:
# reconnectCount,
# sockTimeout,
# networkTImeout, 
# ackTImeout, 
# maxAckTimeout,
# heartbeatFrequency 
# maxMissedHearbeats

Same for communication SPI
# reconnectCount, 
# maxConnTimeout, 
# connTimeout

So, we have 10 or even more properties.

We did it to address half-opened sockets problem (which is pretty common for 
cloud environment) and GC pauses which may happen on cluster nodes - we can 
increase ack timeouts to prevent them from being kicked off the topology.

By setting value for these props I set timeout for failure detection. Why do we 
need such great number of parameters instead of having 1 on IgniteConfiguration 
- nodeResponseThreshold (or failureDetectionThreshold - can anyone propose 
better name?).

All other parameters will be calculated automatically (I think user can still 
set some of them for full control over situation - need to decide if this is 
needed.)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to