Yu Li created HBASE-20158:
Summary: Enhance regionserver self health check to avoid stale
Issue Type: New Feature
Reporter: Yu Li
Assignee: Yu Li
Currently we have many good metrics to monitor our cluster status, such as
totalCallTime/processCallTime/queueCallTime etc. But these metrics won't work
if server got stale and the client call timed out, for example during RS fullgc
or there're some bad disk on HDFS and the read IO got stuck.
We also have a periodic health check chore introduced by HBASE-7351 which allow
us to launch some external script periodically to perform some self detection.
However this won't work if the server's system resource has ran out, for
example no new native thread could be created, no new network connection could
be setup, etc. Notice that although no new thread could not be launched,
running thread won't be affected so zookeeper session is still alive and RS
still regarded as alive, but clients cannot access since no new connection
could be setup.
Here we propose a new HealthChecker called DirectHealthChecker. In this new
checker we won't launch any outer script, but picking some regions on the RS
and send some rpc request to itself, regarding the server as unhealthy if the
call failure ratio exceeds some limit, and send the metrics out to our
monitoring system. More details please refer to the coming patch
This message was sent by Atlassian JIRA