[
https://issues.apache.org/jira/browse/HBASE-14059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630745#comment-14630745
]
Heng Chen commented on HBASE-14059:
-----------------------------------
Why the region server's call queue been full, which operation blocks ?
> We should add a RS to the dead servers list if admin calls fail more than a
> threshold
> -------------------------------------------------------------------------------------
>
> Key: HBASE-14059
> URL: https://issues.apache.org/jira/browse/HBASE-14059
> Project: HBase
> Issue Type: Bug
> Components: master, regionserver, rpc
> Affects Versions: 0.98.13
> Reporter: Esteban Gutierrez
> Assignee: Esteban Gutierrez
> Priority: Critical
>
> I ran into this problem twice this week: calls from the HBase master to a RS
> can timeout since the RS call queue size has been maxed out, however since
> the RS is not dead (ephemeral znode still present) the master keeps
> attempting to perform admin tasks like trying to open or close a region but
> those operations eventually fail after we run out of retries or the
> assignment manager attempts to re-assign to other RSs. From the side effects
> of this I've noticed master operations to be fully blocked or RITs since we
> cannot close the region and open the region in a new location since RS is not
> dead.
> A potential solution for this is to add the RS to the list of dead RSs after
> certain number of calls from the master to the RS fail.
> I've noticed only the problem in 0.98.x but it should be present in all
> versions.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)