regionserver should do basic health check before reporting alls-well to the master ----------------------------------------------------------------------------------
Key: HBASE-611 URL: https://issues.apache.org/jira/browse/HBASE-611 Project: Hadoop HBase Issue Type: Improvement Affects Versions: 0.1.2 Reporter: stack Priority: Minor Fix For: 0.2.0 On IRC this afternoon, a user killed a regionserver. It did something in HDFS. Another regionserver, one carrying the catalog tables, started to get exceptions out of HDFS. The last thing out of it was: {code} [15:55] <jgray> 2008-05-01 15:49:51,710 FATAL org.apache.hadoop.hbase.HRegionServer: Replay of hlog required. Forcing server restart [15:55] <jgray> org.apache.hadoop.hbase.DroppedSnapshotException: Could not get block locations. Aborting... {code} Thats fine. Only it didn't go down... it was in a state where it continued to send the master pings as though nothing was wrong so its lease never timed out and master was hosed because it couldn't get to catalog tables. Regionservers should do a basic check that alls-healthy before they ping the master. If critical threads have exited or a flag saying hdfs has been found bad has been set, then regionserver should stop reporting the master so master can deploy its load elsewhere. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.