[
https://issues.apache.org/jira/browse/HBASE-5075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13216672#comment-13216672
]
zhiyuan.dai commented on HBASE-5075:
------------------------------------
@Jesse
Thanks for your reply.
I think HBase is a online DB. How long HBase failover takes is very important.
Although kill -9 or network partition situation is a big event,the supervisor
can judge that it's regionserver has crushed within ms,and hmaster can move
regions which opened in the crushed regionserver to other alive
regionservers.Therefore, the failover time is reduced to be accepted.
As stack and Lars said,shutdownhook is called when the regionserver process is
alive and program logic isn't interrupted.The event which is kill -9 can't
trigger event that shutdownhook would be called,so the the method
deleteMyEphemeralNode would not be executed,in which case we'd need to rely on
the ZK timeout.
My patch is order to reduce the failover time, which improves the availability
of HBase.We have some big online hbase clusters which are all the core
applications, and the acceptable failover time of the applications is about
10s~20s which include splitting hlog and recovering hlog lease and 'zk timeout'.
> regionserver crashed and failover
> ---------------------------------
>
> Key: HBASE-5075
> URL: https://issues.apache.org/jira/browse/HBASE-5075
> Project: HBase
> Issue Type: Improvement
> Components: monitoring, regionserver, replication, zookeeper
> Affects Versions: 0.92.1
> Reporter: zhiyuan.dai
> Fix For: 0.90.5
>
> Attachments: Degion of Failure Detection.pdf, HBase-5075-shell.patch,
> HBase-5075-src.patch
>
>
> regionserver crashed,it is too long time to notify hmaster.when hmaster know
> regionserver's shutdown,it is long time to fetch the hlog's lease.
> hbase is a online db, availability is very important.
> i have a idea to improve availability, monitor node to check regionserver's
> pid.if this pid not exsits,i think the rs down,i will delete the znode,and
> force close the hlog file.
> so the period maybe 100ms.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira