[
https://issues.apache.org/jira/browse/HBASE-19144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16234564#comment-16234564
]
Andrew Purtell commented on HBASE-19144:
----------------------------------------
It's debatable that kicking assignments in FAILED_OPEN state is always the
right thing to do generally when servers join the cluster. However with
RSgroups I think it makes sense. So, I put the logic to do this into the
RSGroups master extension.
We could make this a general facility in the master. Without RSgroups, most
likely we would be in FAILED_OPEN because of a corruption or runtime problem
like persistent failure to load a compression codec. Retrying assignments
neither helps nor hurts in those cases, really. On the other hand if a
transient condition resulted in regions in FAILED_OPEN state (I've seen that
with Phoenix) then it would help to do this generally, not only when servers
join the cluster, but periodically as well. Or, on other issues, we've
contemplated adding this as a feature to hbck.
> [RSgroups] Regions assigned to a RSGroup all go to FAILED_OPEN state when all
> servers in the group are unavailable
> ------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-19144
> URL: https://issues.apache.org/jira/browse/HBASE-19144
> Project: HBase
> Issue Type: Bug
> Reporter: Andrew Purtell
> Assignee: Andrew Purtell
> Priority: Major
> Fix For: 2.0.0, 3.0.0, 1.4.0, 1.5.0
>
> Attachments: HBASE-19144-branch-1.patch
>
>
> After all servers in the RSgroup are down the regions cannot be opened
> anywhere and transition rapidly into FAILED_OPEN state.
>
> 2017-10-31 21:06:25,449 INFO [ProcedureExecutor-13] master.RegionStates:
> Transition {c6c8150c9f4b8df25ba32073f25a5143 state=OFFLINE, ts=1509483985448,
> server=node-5.cluster,16020,1509482700768} to
> {c6c8150c9f4b8df25ba32073f25a5143 state=FAILED_OPEN, ts=1509483985449,
> server=node-5.cluster,16020,1509482700768}
> 2017-10-31 21:06:25,449 WARN [ProcedureExecutor-13] master.RegionStates:
> Failed to open/close d4e2f173e31ffad6aac140f4bd7b02bc on
> node-5.cluster,16020,1509482700768, set to FAILED_OPEN
>
> Any region in FAILED_OPEN state has to be manually reassigned, or the master
> can be restarted and this will also cause reattempt of assignment of any
> regions in FAILED_OPEN state. This is not unexpected but is an operational
> headache. It would be better if the RSGroupInfoManager could automatically
> kick reassignments of regions in FAILED_OPEN state when servers rejoin the
> cluster.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)