Eric Pai created IOTDB-1564:
-------------------------------
Summary: Make leader failure detection and election faster
Key: IOTDB-1564
URL: https://issues.apache.org/jira/browse/IOTDB-1564
Project: Apache IoTDB
Issue Type: Improvement
Components: Cluster
Reporter: Eric Pai
Assignee: Eric Pai
Fix For: master branch
The cluster configuration _connection_timeout_ms_ is now used in different
layers:
1. The connection and socket timeout of underlying TSocket of the Thrift
framework.
2. The CatchUpTask.
*3. The heartbeat expired time of RaftMember.*
*4. The sleep time between adjcent FOLLOWER heartbeat timeout validations.*
*5. The election timeout.*
However, it doesn't make sense that those time must be same. A longer heartbeat
expired time means a delayed detection of the leader failure. Thus we should
separate it as a new configuration parameter for DBAs.
Except for the network latency, +4+ and +5+ are the major impactions of the
time cost from leader failure to a successful election ends. We can do some
optimizations for them. Here are my solutions:
a) Add new cluster configurations: _heartbeat_timeout_ms_ and
_election_timeout_ms_, and leave _connection_timeout_ms_ as the TSocket timeout
only (this also satisfies the literal meaning).
* _heartbeat_timeout_ms_: The max expired time from lastHeartbeatReceivedTime.
If current time exceeds heartbeat_timeout_ms + lastHeartbeatReceivedTime, a new
election starts.
* _election_timeout_ms_: The max time waiting for the vote response.
b) We can also make +4+ process more wizardly.
Because we already know the _lastHeartbeatReceivedTime_, then the expired time
of this heartbeat is _lastHeartbeatReceivedTime_ + _heartbeat_timeout_ms_.
Thus we can sleep _lastHeartbeatReceivedTime_ + _heartbeat_timeout_ms_ -
_now()_ for the next check. If the heartbeat timeout happens, RaftMember will
start election immediately.
c) If multi RaftMembers start election at the sametime, all elections may fail
because of receiving insufficent votes, and every member will wait a random
time for next election, which increases the whole election duration. In order
to improve the election successful rate, we can add a random delayed time to
+b+:
- Sleep: _lastHeartbeatReceivedTime + heartbeat_timeout_ms - now() +
randomTime()_
The less value _randomTime()_ a member gets, the higher probability it has to
become the new leader.
This design has another benfit. We know that the leader node has heavier
workload than followers. In the future we can make the idlest node be the
leader during election, by calculating the resource usage and returing a lower
value from the randomTime() method in the idle node.
Any suggestions or disscussions are welcomed :)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)