[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

nkeywal (JIRA) Wed, 23 Jan 2013 00:22:22 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13560472#comment-13560472
 ]


nkeywal commented on HBASE-5843:
--------------------------------

bq. What is the application bug(AB) mentioned in your design doc? Do you mean 
hbase bug? or hbase client application code bug? 
Mainly HBase, but it could be as well a coprocessor issue. HBase can be 
configured to stop the regionserver if a coprocessor sends unexpected 
exceptions, but it's quite easy to write buggy stuff, like a coprocessor that 
takes resources without freeing them. Here you may need to stop the region 
server.


bq. If it is hbase client application code bug, does that need stop/start 
region server to fix the issue? 
For a pure client (i.e. a user of the hbase.client package), it would be an 
HBase bug imho: HBase/a regionserver should be resistant to any client behavior.
For a coprocessor, it's client code executed within the regionserver process. 
Thanks to Java, many coprocessors bugs will have a limited effect, but as said 
above there are some cases that cannot be handled simply.

bq. If it is hbase code bug, do you refer to hbase bug that cause region server 
einter some bad state like deadlock, and so on? I think that could benefit from 
restarting region server to fix the problem. 
Yes.
                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: 
> https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a 
> query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

Reply via email to