Tian Jiang created IOTDB-953:
--------------------------------

             Summary: [Distributed] Improve handling when a node cannot be 
reached
                 Key: IOTDB-953
                 URL: https://issues.apache.org/jira/browse/IOTDB-953
             Project: Apache IoTDB
          Issue Type: Improvement
          Components: Core/Cluster
            Reporter: Tian Jiang
            Assignee: Houliang Qi


When a node fails to send a request to another node, it will record one failure 
in its ClientPool, and when the count of failures reaches 3, it will reject to 
give clients of that node for 60s. 

This implementation has three main drawbacks:
1. It does not distinguish network connection errors from others. Once 
`onError()` of a client is called, the count of failures increases, even if it 
is not called due to a network failure.
2. Heartbeats should not be affected by this mechanism. As one functionality of 
heartbeats is to detect if one node is still alive, and they also need clients 
to do so, if they are blocked by the mechanism, we will lose the chance to 
resume connection with another node earlier, and the result would be we must 
wait for 60s even if the node has already resumed.
3. Heartbeat successes will not unblock other requests. Because we are using a 
separate pool for heartbeats when a heartbeat to a node succeeds, it only 
unblocks other heartbeats to this node, and other requests are still blocked 
for 60s because they are using another pool for clients.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to