Yakov Zhdanov created IGNITE-752: ------------------------------------ Summary: Speed up failure detection Key: IGNITE-752 URL: https://issues.apache.org/jira/browse/IGNITE-752 Project: Ignite Issue Type: Bug Reporter: Yakov Zhdanov Priority: Critical Fix For: sprint-4
I think we can (1) make grid configuration significantly easier and (2) speed up failure detection. Here are disco SPI configuration properties which are responsible for failure detection: # reconnectCount, # sockTimeout, # networkTImeout, # ackTImeout, # maxAckTimeout, # heartbeatFrequency # maxMissedHearbeats Same for communication SPI # reconnectCount, # maxConnTimeout, # connTimeout So, we have 10 or even more properties. We did it to address half-opened sockets problem (which is pretty common for cloud environment) and GC pauses which may happen on cluster nodes - we can increase ack timeouts to prevent them from being kicked off the topology. By setting value for these props I set timeout for failure detection. Why do we need such great number of parameters instead of having 1 on IgniteConfiguration - nodeResponseThreshold (or failureDetectionThreshold - can anyone propose better name?). All other parameters will be calculated automatically (I think user can still set some of them for full control over situation - need to decide if this is needed.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)