[ 
https://issues.apache.org/jira/browse/HBASE-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904269#action_12904269
 ] 

Jonathan Gray commented on HBASE-2940:
--------------------------------------

I like this direction.  We're starting to use HBCK fairly heavily as a periodic 
health-check script and there were some ideas about some kind of basic 
read/write verification test on each RS as part of it.

@Todd, #1 sounds good.  There has been talk from ops guys here about having a 
separate file with a list of blacklisted RS (there's something like this in 
hadoop I believe), so you can add nodes under maintenance or blacklisted from 
something related to this jira.  #2, see above.  definitely an RS sanity check 
would be nice (can you append to log, can you do basic read/write, etc).  #3, 
interesting.  need to think on that more.

@Ryan, that seems like a secondary mechanism for shutting down an RS because it 
will always require log replay.  Though the reasons above might require a 
forceful external abort, if the RS is responsive, we should do a controlled 
shutdown so regions can be flushed.  If it takes too long or RS is 
unresponsive, then using the HLog sounds like a good strategy.  Need to be sure 
whatever properties of hdfs appends we're using will not change between the 
0.20 implementation and the 0.21 and later one.

@Ted, HBase is far better at determining live/dead nodes than the NN (zk vs 3 
minute timeout heartbeats), so I wouldn't expect this to be a big win.  It's 
also an open question whether you would always want an RS with a dead DN on the 
same machine to also go down.  Maybe there is a situation where this would be 
useful information to have in HBase but need to think on it more.  In most 
instances, if there is a problem with the node and we want HBase to proactively 
kill an RS, we would know in HBase-land.

> Improve behavior under partial failure of region servers
> --------------------------------------------------------
>
>                 Key: HBASE-2940
>                 URL: https://issues.apache.org/jira/browse/HBASE-2940
>             Project: HBase
>          Issue Type: New Feature
>          Components: master, regionserver
>            Reporter: Todd Lipcon
>
> On larger clusters, we often see failure cases where a server is "up" (ie 
> heartbeating) but unable to actually service requests properly (or at a 
> reasonable speed). This can happen for any number of reasons including:
> - failing disks or disk controllers respond, but do so very slowly
> - the machine is swapping, so everything is still running but much more 
> slowly than expected
> - HBase or the DN on the machine has been misconfigured (eg missing lzo libs) 
> so it fails to correctly open regions, perform flushes, etc.
> Here are a few proposed features that are worth considering:
> 1) Add a "blacklist" or "remote shutdown" functionality to the master. This 
> is useful if the region server is up but for some reason the admin can't ssh 
> in to shut it down (eg the root disk has failed). This feature would allow 
> the admin to issue a command that will shut down any given RS.
> 2) Periodically run a "health check" script on the region server node. If the 
> script returns an error code, the RS could shut itself down gracefully and 
> report an error message on the master console.
> 3) Allow clients to report back RS-specific errors to the master. This would 
> be useful for monitoring, and we could add heuristics to automatically shut 
> down region servers if they have an elevated error count over some period of 
> time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to