[
https://issues.apache.org/jira/browse/HBASE-14059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630771#comment-14630771
]
Esteban Gutierrez commented on HBASE-14059:
-------------------------------------------
In the issue I ran into it was a bad region causing the RS to be blocked for a
long time. From the point of view of the master the RS was doing fine since it
was getting region load info info and the RS ephemeral znode was present.
However, since the call queue length was maxed out, some operations like
closing a region or opening a region were not successful, hence this cluster
ended up with regions in transition very frequently due assignment issues on
this RS.
> We should add a RS to the dead servers list if admin calls fail more than a
> threshold
> -------------------------------------------------------------------------------------
>
> Key: HBASE-14059
> URL: https://issues.apache.org/jira/browse/HBASE-14059
> Project: HBase
> Issue Type: Bug
> Components: master, regionserver, rpc
> Affects Versions: 0.98.13
> Reporter: Esteban Gutierrez
> Assignee: Esteban Gutierrez
> Priority: Critical
>
> I ran into this problem twice this week: calls from the HBase master to a RS
> can timeout since the RS call queue size has been maxed out, however since
> the RS is not dead (ephemeral znode still present) the master keeps
> attempting to perform admin tasks like trying to open or close a region but
> those operations eventually fail after we run out of retries or the
> assignment manager attempts to re-assign to other RSs. From the side effects
> of this I've noticed master operations to be fully blocked or RITs since we
> cannot close the region and open the region in a new location since RS is not
> dead.
> A potential solution for this is to add the RS to the list of dead RSs after
> certain number of calls from the master to the RS fail.
> I've noticed only the problem in 0.98.x but it should be present in all
> versions.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)