[
https://issues.apache.org/jira/browse/HBASE-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904269#action_12904269
]
Jonathan Gray commented on HBASE-2940:
--------------------------------------
I like this direction. We're starting to use HBCK fairly heavily as a periodic
health-check script and there were some ideas about some kind of basic
read/write verification test on each RS as part of it.
@Todd, #1 sounds good. There has been talk from ops guys here about having a
separate file with a list of blacklisted RS (there's something like this in
hadoop I believe), so you can add nodes under maintenance or blacklisted from
something related to this jira. #2, see above. definitely an RS sanity check
would be nice (can you append to log, can you do basic read/write, etc). #3,
interesting. need to think on that more.
@Ryan, that seems like a secondary mechanism for shutting down an RS because it
will always require log replay. Though the reasons above might require a
forceful external abort, if the RS is responsive, we should do a controlled
shutdown so regions can be flushed. If it takes too long or RS is
unresponsive, then using the HLog sounds like a good strategy. Need to be sure
whatever properties of hdfs appends we're using will not change between the
0.20 implementation and the 0.21 and later one.
@Ted, HBase is far better at determining live/dead nodes than the NN (zk vs 3
minute timeout heartbeats), so I wouldn't expect this to be a big win. It's
also an open question whether you would always want an RS with a dead DN on the
same machine to also go down. Maybe there is a situation where this would be
useful information to have in HBase but need to think on it more. In most
instances, if there is a problem with the node and we want HBase to proactively
kill an RS, we would know in HBase-land.
> Improve behavior under partial failure of region servers
> --------------------------------------------------------
>
> Key: HBASE-2940
> URL: https://issues.apache.org/jira/browse/HBASE-2940
> Project: HBase
> Issue Type: New Feature
> Components: master, regionserver
> Reporter: Todd Lipcon
>
> On larger clusters, we often see failure cases where a server is "up" (ie
> heartbeating) but unable to actually service requests properly (or at a
> reasonable speed). This can happen for any number of reasons including:
> - failing disks or disk controllers respond, but do so very slowly
> - the machine is swapping, so everything is still running but much more
> slowly than expected
> - HBase or the DN on the machine has been misconfigured (eg missing lzo libs)
> so it fails to correctly open regions, perform flushes, etc.
> Here are a few proposed features that are worth considering:
> 1) Add a "blacklist" or "remote shutdown" functionality to the master. This
> is useful if the region server is up but for some reason the admin can't ssh
> in to shut it down (eg the root disk has failed). This feature would allow
> the admin to issue a command that will shut down any given RS.
> 2) Periodically run a "health check" script on the region server node. If the
> script returns an error code, the RS could shut itself down gracefully and
> report an error message on the master console.
> 3) Allow clients to report back RS-specific errors to the master. This would
> be useful for monitoring, and we could add heuristics to automatically shut
> down region servers if they have an elevated error count over some period of
> time.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.