[ 
https://issues.apache.org/jira/browse/HBASE-20158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17514790#comment-17514790
 ] 

LiangJun He commented on HBASE-20158:
-------------------------------------

[~liyu], This issue can be assigned to me to move forward? Thanks.

> Enhance regionserver self health check to avoid stale server
> ------------------------------------------------------------
>
>                 Key: HBASE-20158
>                 URL: https://issues.apache.org/jira/browse/HBASE-20158
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Yu Li
>            Assignee: Yu Li
>            Priority: Major
>
> Currently we have many good metrics to monitor our cluster status, such as 
> totalCallTime/processCallTime/queueCallTime etc. But these metrics won't work 
> if server got stale and the client call timed out, for example during RS 
> fullgc or there're some bad disk on HDFS and the read IO got stuck.
> We also have a periodic health check chore introduced by HBASE-7351 which 
> allow us to launch some external script periodically to perform some self 
> detection. However this won't work if the server's system resource has ran 
> out, for example no new native thread could be created, no new network 
> connection could be setup, etc. Notice that although no new thread could not 
> be launched, running thread won't be affected so zookeeper session is still 
> alive and RS still regarded as alive, but clients cannot access since no new 
> connection could be setup.
> Here we propose a new HealthChecker called DirectHealthChecker. In this new 
> checker we won't launch any outer script, but picking some regions on the RS 
> and send some rpc request to itself, regarding the server as unhealthy if the 
> call failure ratio exceeds some limit, and send the metrics out to our 
> monitoring system. More details please refer to the coming patch



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to