[
https://issues.apache.org/jira/browse/HBASE-17704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892976#comment-15892976
]
Andrew Purtell commented on HBASE-17704:
----------------------------------------
I agree. I didn't know about HBASE-16209. With an exponential backoff policy
and a cap on max wait time (I see that patch has it) there's no reason not to
keep retrying indefinitely. Even prior to that the old default of 10 attempts
is too small. That wouldn't ride over some transient issues. At some point
operator intervention is necessary anyway, but we can get paged by a
region-in-transition-too-long alert to deal with it and there's no harm in
having the AM retry until we tell it not to with unassign_region or similar.
> Regions stuck in FAILED_OPEN when HDFS blocks are missing
> ---------------------------------------------------------
>
> Key: HBASE-17704
> URL: https://issues.apache.org/jira/browse/HBASE-17704
> Project: HBase
> Issue Type: Bug
> Components: regionserver
> Affects Versions: 1.1.8
> Reporter: Mathias Herberts
>
> We recently experienced the loss of a whole rack (6 DNs + RS) in a 120 node
> cluster. This lead to the regions which were present on the 6 RS which became
> unavailable to be reassigned to live RSs. When attempting to open some of the
> reassigned regions, some RS encountered missing blocks and issued "No live
> nodes contain current block Block locations" putting the regions in state
> FAILED_OPEN.
> Once the disappeared DNs went back online, the regions were left in
> FAILED_OPEN, needing a restart of all the affected RSs to solve the problem.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)