[ https://issues.apache.org/jira/browse/HBASE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802255#action_12802255 ]
Ferdy commented on HBASE-2117: ------------------------------ I can't tell for sure whether all non-active regionservers are administrated in the field 'deadServers'. There's always the possibility for a regionserver to shut down in a 'proper' way, at least in such a way that the Master will not put in it's deadServers set. Also, please note the example of starting a single regionserver by hand. A check against the configuration allows for a reminder to add this newly added server to your configuration (as mentioned above by stack). > Simple check on the master overview page if the number of currently running > regionservers is unchanged. > ------------------------------------------------------------------------------------------------------- > > Key: HBASE-2117 > URL: https://issues.apache.org/jira/browse/HBASE-2117 > Project: Hadoop HBase > Issue Type: New Feature > Components: master, regionserver > Affects Versions: 0.20.2 > Reporter: Ferdy > Attachments: HBASE-2117-v2.patch, HBASE-2117.patch > > > Incidentally, it happens that some of our regionservers just stop working. > The regionserver logs show some sort of termination and the affected > regionserver is just removed from the master page. Besides the actual problem > of the termination, what I was missing was some sort of warning (from either > running client code or the master page) that some regionservers are having > trouble. > It seems like the Master is ok with the fact that a regionserver suddenly > decides to stop. The result is that the clients depending on the data in > Hbase will be presented an incomplete data set, at least as long as the > failing regions are not re-assigned yet. In order to have this monitored, I > decided to create a patch that exposes an extra piece of information on the > master page. An 'OK:' is presented if the current number of regionservers is > unchanged since the start of the processes. An 'ERROR:' is shown whenever the > current number is not the same. What the master page does is reading the > 'regionservers' file once, and remember the number of slaves so that is can > be used in the check. (So afterwards changes to this file are not supported). > Perhaps this is not the right way of doing things. Please let me know if > there are any existing solutions for these issues. > I will attach a patch right away. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.