[ 
https://issues.apache.org/jira/browse/HBASE-932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12640246#action_12640246
 ] 

stack commented on HBASE-932:
-----------------------------

Yeah, we had a babysitter on our cluster.  His name was 'god'.  He got fired 
though because he was forever doing restarts when they weren't wanted and just 
generally being interfering and causing trouble.

There's for sure a place for babysitters.  This issue is based on the postulate 
that sometimes the regionserver knows more about its state or how it might fix 
itself than it could even reveal to an external generic daemon babysitter.  For 
a few hdfs error types, a pause and complete restart -- perhaps attempted N 
times at most -- could set a regionserver aright again.  Danger would be 
replication of 'god' behavior.

> Regionserver restart
> --------------------
>
>                 Key: HBASE-932
>                 URL: https://issues.apache.org/jira/browse/HBASE-932
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>
> If we drop a flush or we fail close a write-ahead log, we currently shutdown 
> the regionserver (we fail because of hdfs usually).  Rather than shut 
> themselves down, how about they restart?  The restart at least in the 
> HBASE-930 might fix the issue shaking DFSClient so it gets sense again.  Even 
> is HDFS is bad, it'll come around eventually.  The HRS restarting itself plus 
> HBASE-926 fix will make for fast recovery.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to