[
https://issues.apache.org/jira/browse/HBASE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800155#action_12800155
]
Lars George commented on HBASE-2117:
------------------------------------
My 2c is to use Nagios et al. Add the number of regionservers (max/current) to
the hmaster metrics and use a check to verify that they are the same. If not
then raise an alarm with the typical escalation. That method I could assume
could be adopted by the Hadoop team for datanodes and jobtrackers.
> Simple check on the master overview page if the number of currently running
> regionservers is unchanged.
> -------------------------------------------------------------------------------------------------------
>
> Key: HBASE-2117
> URL: https://issues.apache.org/jira/browse/HBASE-2117
> Project: Hadoop HBase
> Issue Type: New Feature
> Components: master, regionserver
> Affects Versions: 0.20.2
> Reporter: Ferdy
> Attachments: HBASE-2117-v2.patch, HBASE-2117.patch
>
>
> Incidentally, it happens that some of our regionservers just stop working.
> The regionserver logs show some sort of termination and the affected
> regionserver is just removed from the master page. Besides the actual problem
> of the termination, what I was missing was some sort of warning (from either
> running client code or the master page) that some regionservers are having
> trouble.
> It seems like the Master is ok with the fact that a regionserver suddenly
> decides to stop. The result is that the clients depending on the data in
> Hbase will be presented an incomplete data set, at least as long as the
> failing regions are not re-assigned yet. In order to have this monitored, I
> decided to create a patch that exposes an extra piece of information on the
> master page. An 'OK:' is presented if the current number of regionservers is
> unchanged since the start of the processes. An 'ERROR:' is shown whenever the
> current number is not the same. What the master page does is reading the
> 'regionservers' file once, and remember the number of slaves so that is can
> be used in the check. (So afterwards changes to this file are not supported).
> Perhaps this is not the right way of doing things. Please let me know if
> there are any existing solutions for these issues.
> I will attach a patch right away.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.