[jira] Commented: (HBASE-2940) Improve behavior under partial failure of region servers

Ted Yu (JIRA) Sun, 29 Aug 2010 17:56:34 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904058#action_12904058
 ]


Ted Yu commented on HBASE-2940:
-------------------------------

Since hbase.rootdir points to hadoop namenode, HBase Master can poll hadoop for 
the live data nodes. If a data node comes down for longer than specified 
duration and a RS happens to be on the same server, Master can blacklist that 
RS (assuming there is problem with heartbeat from that RS in the same time 
period).

> Improve behavior under partial failure of region servers
> --------------------------------------------------------
>
>                 Key: HBASE-2940
>                 URL: https://issues.apache.org/jira/browse/HBASE-2940
>             Project: HBase
>          Issue Type: New Feature
>          Components: master, regionserver
>            Reporter: Todd Lipcon
>
> On larger clusters, we often see failure cases where a server is "up" (ie 
> heartbeating) but unable to actually service requests properly (or at a 
> reasonable speed). This can happen for any number of reasons including:
> - failing disks or disk controllers respond, but do so very slowly
> - the machine is swapping, so everything is still running but much more 
> slowly than expected
> - HBase or the DN on the machine has been misconfigured (eg missing lzo libs) 
> so it fails to correctly open regions, perform flushes, etc.
> Here are a few proposed features that are worth considering:
> 1) Add a "blacklist" or "remote shutdown" functionality to the master. This 
> is useful if the region server is up but for some reason the admin can't ssh 
> in to shut it down (eg the root disk has failed). This feature would allow 
> the admin to issue a command that will shut down any given RS.
> 2) Periodically run a "health check" script on the region server node. If the 
> script returns an error code, the RS could shut itself down gracefully and 
> report an error message on the master console.
> 3) Allow clients to report back RS-specific errors to the master. This would 
> be useful for monitoring, and we could add heuristics to automatically shut 
> down region servers if they have an elevated error count over some period of 
> time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2940) Improve behavior under partial failure of region servers

Reply via email to