Esteban Gutierrez created HBASE-14059:
-----------------------------------------
Summary: We should add a RS to the dead servers list if admin
calls fail more than a threshold
Key: HBASE-14059
URL: https://issues.apache.org/jira/browse/HBASE-14059
Project: HBase
Issue Type: Bug
Components: master, regionserver, rpc
Affects Versions: 0.98.13
Reporter: Esteban Gutierrez
Assignee: Esteban Gutierrez
Priority: Critical
I ran into this problem twice this week: calls from the HBase master to a RS
can timeout since the RS call queue size has been maxed out, however since the
RS is not dead (ephemeral znode still present) the master keeps attempting to
perform admin tasks like trying to open or close a region but those operations
eventually fail after we run out of retries or the assignment manager attempts
to re-assign to other RSs. From the side effects of this I've noticed master
operations to be fully blocked or RITs since we cannot close the region and
open the region in a new location since RS is not dead.
A potential solution for this is to add the RS to the list of dead RSs after
certain number of calls from the master to the RS fail.
I've noticed only the problem in 0.98.x but it should be present in all
versions.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)