[ 
https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13562253#comment-13562253
 ] 

Tianying Chang commented on HBASE-5843:
---------------------------------------

@nkeywal  Thanks! 

I have two more questions regarding your "Improving failure detection" section. 

1. For hardware failure detection, it depends on the zk session timeout value. 
You mentioned it is 30 sec, so mean time to detect is 15sec, which is 
reasonable short. But I think the default value nowdays is 180sec. (I am 
referring to zookeeper.session.timeout) I can see in our cluster that master 
does not detect the crashed machine until 3 minutes later when the ephemeral 
znode timeout. So it takes 3 minutes to detect hardware failure in default 
case. That seems pretty long and should be improved. 

2. For software bug leading to a dirty stop, you mentioned  to launch region 
server with a script. I think we can use daemontool to start region server, and 
in daemontool's callback module, we can do the clean up znode based on the exit 
error code. daemontool can also be configured to restart the RS immediately 
again if needed. That seems can simplify the process. What do you think? 
                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: 
> https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a 
> query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to