[
https://issues.apache.org/jira/browse/HBASE-5075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211691#comment-13211691
]
zhiyuan.dai commented on HBASE-5075:
------------------------------------
@stack @Lars Hofhansl
First the rpc method getRSPidAndRsZknode is to fetch PID and znode which
includes domain and service port,this way is reliable. If we use processes
list, there may be some misjudgment.
Second there is a supervisor called RegionServerFailureDetection,we first start
regionserver, and then start
RegionServerFailureDetection.RegionServerFailureDetection is a watchdog of
RegionServer.
Then the supervisor(RegionServerFailureDetection) of regionserver fetch PID and
znode by getRSPidAndRsZknode.
RegionServerFailureDetection doesn't have any relationship with long GC.
RegionServerFailureDetection first check whether PID is alive and the check
service port is alive.
> regionserver crashed and failover
> ---------------------------------
>
> Key: HBASE-5075
> URL: https://issues.apache.org/jira/browse/HBASE-5075
> Project: HBase
> Issue Type: Improvement
> Components: monitoring, regionserver, replication, zookeeper
> Affects Versions: 0.92.1
> Reporter: zhiyuan.dai
> Fix For: 0.90.5
>
> Attachments: Degion of Failure Detection.pdf, HBase-5075-src.patch
>
>
> regionserver crashed,it is too long time to notify hmaster.when hmaster know
> regionserver's shutdown,it is long time to fetch the hlog's lease.
> hbase is a online db, availability is very important.
> i have a idea to improve availability, monitor node to check regionserver's
> pid.if this pid not exsits,i think the rs down,i will delete the znode,and
> force close the hlog file.
> so the period maybe 100ms.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira