[ 
https://issues.apache.org/jira/browse/HBASE-5075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13216672#comment-13216672
 ] 

zhiyuan.dai commented on HBASE-5075:
------------------------------------

@Jesse
Thanks for your reply.
I think HBase is a online DB. How long HBase failover takes is very important. 
Although kill -9 or network partition situation is a big event,the supervisor 
can judge that it's regionserver has crushed within ms,and hmaster can move 
regions which opened in the crushed regionserver to other alive 
regionservers.Therefore, the failover time is reduced to be accepted.

As stack and Lars said,shutdownhook is called when the regionserver process is 
alive and program logic isn't interrupted.The event which is kill -9 can't 
trigger event that shutdownhook would be called,so the the method 
deleteMyEphemeralNode would not be executed,in which case we'd need to rely on 
the ZK timeout.

My patch is order to reduce the failover time, which improves the availability 
of HBase.We have some big online hbase clusters which are all the core 
applications, and the acceptable failover time of the applications is about 
10s~20s which include splitting hlog and recovering hlog lease and 'zk timeout'.
                
> regionserver crashed and failover
> ---------------------------------
>
>                 Key: HBASE-5075
>                 URL: https://issues.apache.org/jira/browse/HBASE-5075
>             Project: HBase
>          Issue Type: Improvement
>          Components: monitoring, regionserver, replication, zookeeper
>    Affects Versions: 0.92.1
>            Reporter: zhiyuan.dai
>             Fix For: 0.90.5
>
>         Attachments: Degion of Failure Detection.pdf, HBase-5075-shell.patch, 
> HBase-5075-src.patch
>
>
> regionserver crashed,it is too long time to notify hmaster.when hmaster know 
> regionserver's shutdown,it is long time to fetch the hlog's lease.
> hbase is a online db, availability is very important.
> i have a idea to improve availability, monitor node to check regionserver's 
> pid.if this pid not exsits,i think the rs down,i will delete the znode,and 
> force close the hlog file.
> so the period maybe 100ms.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to