[ 
https://issues.apache.org/jira/browse/HBASE-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Kellerman resolved HBASE-611.
---------------------------------

    Resolution: Fixed

Added method isHealthy to HRegionServer. Reviewed by Stack. Committed

> regionserver should do basic health check before reporting alls-well to the 
> master
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-611
>                 URL: https://issues.apache.org/jira/browse/HBASE-611
>             Project: Hadoop HBase
>          Issue Type: Improvement
>    Affects Versions: 0.1.2
>            Reporter: stack
>            Priority: Minor
>             Fix For: 0.2.0
>
>
> On IRC this afternoon, a user killed a regionserver.  It did something in 
> HDFS.   Another regionserver, one carrying the catalog tables, started to get 
> exceptions out of HDFS.  The last thing out of it was:
> {code}
> [15:55]       <jgray> 2008-05-01 15:49:51,710 FATAL 
> org.apache.hadoop.hbase.HRegionServer: Replay of hlog required. Forcing 
> server restart
> [15:55]       <jgray> org.apache.hadoop.hbase.DroppedSnapshotException: Could 
> not get block locations. Aborting...
> {code}
> Thats fine.
> Only it didn't go down... it was in a state where it continued to send the 
> master pings as though nothing was wrong so its lease never timed out and 
> master was hosed because it couldn't get to catalog tables.
> Regionservers should do a basic check that alls-healthy before they ping the 
> master.  If critical threads have exited or a flag saying hdfs has been found 
> bad has been set, then regionserver should stop reporting the master so 
> master can deploy its load elsewhere.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to