[
https://issues.apache.org/jira/browse/HBASE-20158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17514790#comment-17514790
]
LiangJun He commented on HBASE-20158:
-------------------------------------
[~liyu], This issue can be assigned to me to move forward? Thanks.
> Enhance regionserver self health check to avoid stale server
> ------------------------------------------------------------
>
> Key: HBASE-20158
> URL: https://issues.apache.org/jira/browse/HBASE-20158
> Project: HBase
> Issue Type: New Feature
> Reporter: Yu Li
> Assignee: Yu Li
> Priority: Major
>
> Currently we have many good metrics to monitor our cluster status, such as
> totalCallTime/processCallTime/queueCallTime etc. But these metrics won't work
> if server got stale and the client call timed out, for example during RS
> fullgc or there're some bad disk on HDFS and the read IO got stuck.
> We also have a periodic health check chore introduced by HBASE-7351 which
> allow us to launch some external script periodically to perform some self
> detection. However this won't work if the server's system resource has ran
> out, for example no new native thread could be created, no new network
> connection could be setup, etc. Notice that although no new thread could not
> be launched, running thread won't be affected so zookeeper session is still
> alive and RS still regarded as alive, but clients cannot access since no new
> connection could be setup.
> Here we propose a new HealthChecker called DirectHealthChecker. In this new
> checker we won't launch any outer script, but picking some regions on the RS
> and send some rpc request to itself, regarding the server as unhealthy if the
> call failure ratio exceeds some limit, and send the metrics out to our
> monitoring system. More details please refer to the coming patch
--
This message was sent by Atlassian Jira
(v8.20.1#820001)