failure detection in a distributed system by HTM

Takenori Sato Fri, 23 Oct 2015 00:14:41 -0700

Hello,

In a distributed system, it is very important to know which node is more
healthier than others to make a request. Or of course, when to determine
one node should be treated as dead.


For example, cassandra relies on phi accrual detector[1] to detect node
down. A node does a gossip communication with 3 nodes every second, and
exchanges information with each other. And its response time is used as an
input for the failure detection.

Also, a badness score is computed with such information, and which is used
to choose a healthier node among replica nodes.

But, I have seen many situations when it didn't work as expected,
especially choosing a healthier node.

On the other hand, I know any service provider makes some kind of health
check request to detect if service is available or not. It may be just a
simple ping, or HEAD request.

Then, I just wondered if it is a good use case to use HTM for failure
detection with such simple health check requests?

For example, its input looks like this:

time, node, avg response time(ms)
10:00:00, node1, 10
10:00:00, node2, 9
...
10:00:30, node1, 15
10:00:30, node2, 10
...


[1] http://www.jaist.ac.jp/~defago/files/pdf/IS_RR_2004_010.pdf

Thanks,
Takenori

failure detection in a distributed system by HTM

Reply via email to