[jira] Commented: (HBASE-2117) Simple check on the master overview page if the number of currently running regionservers is unchanged.

Lars George (JIRA) Thu, 14 Jan 2010 02:00:21 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800155#action_12800155
 ]


Lars George commented on HBASE-2117:
------------------------------------

My 2c is to use Nagios et al. Add the number of regionservers (max/current) to 
the hmaster metrics and use a check to verify that they are the same. If not 
then raise an alarm with the typical escalation. That method I could assume 
could be adopted by the Hadoop team for datanodes and jobtrackers. 

> Simple check on the master overview page if the number of currently running 
> regionservers is unchanged.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-2117
>                 URL: https://issues.apache.org/jira/browse/HBASE-2117
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: master, regionserver
>    Affects Versions: 0.20.2
>            Reporter: Ferdy
>         Attachments: HBASE-2117-v2.patch, HBASE-2117.patch
>
>
> Incidentally, it happens that some of our regionservers just stop working. 
> The regionserver logs show some sort of termination and the affected 
> regionserver is just removed from the master page. Besides the actual problem 
> of the termination, what I was missing was some sort of warning (from either 
> running client code or the master page) that some regionservers are having 
> trouble.
> It seems like the Master is ok with the fact that a regionserver suddenly 
> decides to stop. The result is that the clients depending on the data in 
> Hbase will be presented an incomplete data set, at least as long as the 
> failing regions are not re-assigned yet. In order to have this monitored, I 
> decided to create a patch that exposes an extra piece of information on the 
> master page. An 'OK:' is presented if the current number of regionservers is 
> unchanged since the start of the processes. An 'ERROR:' is shown whenever the 
> current number is not the same. What the master page does is reading the 
> 'regionservers' file once, and remember the number of slaves so that is can 
> be used in the check. (So afterwards changes to this file are not supported).
> Perhaps this is not the right way of doing things. Please let me know if 
> there are any existing solutions for these issues.
> I will attach a patch right away.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2117) Simple check on the master overview page if the number of currently running regionservers is unchanged.

Reply via email to