Eric Pai created IOTDB-1564:
-------------------------------

             Summary: Make leader failure detection and election faster
                 Key: IOTDB-1564
                 URL: https://issues.apache.org/jira/browse/IOTDB-1564
             Project: Apache IoTDB
          Issue Type: Improvement
          Components: Cluster
            Reporter: Eric Pai
            Assignee: Eric Pai
             Fix For: master branch


The cluster configuration _connection_timeout_ms_ is now used in different 
layers:

1. The connection and socket timeout of underlying TSocket of the Thrift 
framework.
2. The CatchUpTask.
*3. The heartbeat expired time of RaftMember.*
*4. The sleep time between adjcent FOLLOWER heartbeat timeout validations.*
*5. The election timeout.*

However, it doesn't make sense that those time must be same. A longer heartbeat 
expired time means a delayed detection of the leader failure. Thus we should 
separate it as a new configuration parameter for DBAs.

Except for the network latency, +4+ and +5+ are the major impactions of the 
time cost from leader failure to a successful election ends. We can do some 
optimizations for them. Here are my solutions:

a) Add new cluster configurations: _heartbeat_timeout_ms_ and 
_election_timeout_ms_, and leave _connection_timeout_ms_ as the TSocket timeout 
only (this also satisfies the literal meaning).
 * _heartbeat_timeout_ms_: The max expired time from lastHeartbeatReceivedTime. 
If current time exceeds heartbeat_timeout_ms + lastHeartbeatReceivedTime, a new 
election starts.
 * _election_timeout_ms_: The max time waiting for the vote response.

b) We can also make +4+ process more wizardly.

Because we already know the _lastHeartbeatReceivedTime_, then the expired time 
of this heartbeat is _lastHeartbeatReceivedTime_ + _heartbeat_timeout_ms_.

Thus we can sleep _lastHeartbeatReceivedTime_ + _heartbeat_timeout_ms_ - 
_now()_ for the next check. If the heartbeat timeout happens, RaftMember will 
start election immediately.

c) If multi RaftMembers start election at the sametime, all elections may fail 
because of receiving insufficent votes, and every member will wait a random 
time for next election, which increases the whole election duration. In order 
to improve the election successful rate, we can add a random delayed time to 
+b+:
- Sleep: _lastHeartbeatReceivedTime + heartbeat_timeout_ms - now() + 
randomTime()_

The less value _randomTime()_ a member gets, the higher probability it has to 
become the new leader.

This design has another benfit. We know that the leader node has heavier 
workload than followers. In the future we can make the idlest node be the 
leader during election, by calculating the resource usage and returing a lower 
value from the randomTime() method in the idle node.

Any suggestions or disscussions are welcomed :)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to