regionserver should do basic health check before reporting alls-well to the 
master
----------------------------------------------------------------------------------

                 Key: HBASE-611
                 URL: https://issues.apache.org/jira/browse/HBASE-611
             Project: Hadoop HBase
          Issue Type: Improvement
    Affects Versions: 0.1.2
            Reporter: stack
            Priority: Minor
             Fix For: 0.2.0


On IRC this afternoon, a user killed a regionserver.  It did something in HDFS. 
  Another regionserver, one carrying the catalog tables, started to get 
exceptions out of HDFS.  The last thing out of it was:

{code}
[15:55] <jgray> 2008-05-01 15:49:51,710 FATAL 
org.apache.hadoop.hbase.HRegionServer: Replay of hlog required. Forcing server 
restart
[15:55] <jgray> org.apache.hadoop.hbase.DroppedSnapshotException: Could not get 
block locations. Aborting...
{code}

Thats fine.

Only it didn't go down... it was in a state where it continued to send the 
master pings as though nothing was wrong so its lease never timed out and 
master was hosed because it couldn't get to catalog tables.

Regionservers should do a basic check that alls-healthy before they ping the 
master.  If critical threads have exited or a flag saying hdfs has been found 
bad has been set, then regionserver should stop reporting the master so master 
can deploy its load elsewhere.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to