[ https://issues.apache.org/jira/browse/HBASE-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jim Kellerman resolved HBASE-611. --------------------------------- Resolution: Fixed Added method isHealthy to HRegionServer. Reviewed by Stack. Committed > regionserver should do basic health check before reporting alls-well to the > master > ---------------------------------------------------------------------------------- > > Key: HBASE-611 > URL: https://issues.apache.org/jira/browse/HBASE-611 > Project: Hadoop HBase > Issue Type: Improvement > Affects Versions: 0.1.2 > Reporter: stack > Priority: Minor > Fix For: 0.2.0 > > > On IRC this afternoon, a user killed a regionserver. It did something in > HDFS. Another regionserver, one carrying the catalog tables, started to get > exceptions out of HDFS. The last thing out of it was: > {code} > [15:55] <jgray> 2008-05-01 15:49:51,710 FATAL > org.apache.hadoop.hbase.HRegionServer: Replay of hlog required. Forcing > server restart > [15:55] <jgray> org.apache.hadoop.hbase.DroppedSnapshotException: Could > not get block locations. Aborting... > {code} > Thats fine. > Only it didn't go down... it was in a state where it continued to send the > master pings as though nothing was wrong so its lease never timed out and > master was hosed because it couldn't get to catalog tables. > Regionservers should do a basic check that alls-healthy before they ping the > master. If critical threads have exited or a flag saying hdfs has been found > bad has been set, then regionserver should stop reporting the master so > master can deploy its load elsewhere. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.