[
https://issues.apache.org/jira/browse/HBASE-19144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16227713#comment-16227713
]
Andrew Purtell edited comment on HBASE-19144 at 10/31/17 10:40 PM:
-------------------------------------------------------------------
bq. It would be better if the RSGroupInfoManager could automatically kick
reassignments of regions in FAILED_OPEN state when servers rejoin the cluster.
I threw together a hack (imagine this done with another registered
ServerListener) which addresses the problem as reported, in that regions in
FAILED_OPEN state due to constraint failures when a whole RSGroup is down are
all reassigned, but am not sure this is the best way:
{code}
diff --git
a/hbase-rsgroup/src/main/java/org/apache/hadoop/hbase/rsgroup/RSGroupInfoManagerImpl.java
b/hbase-rsgroup/src/main/java/org/apache/hadoop/hbase/rsgro
up/RSGroupInfoManagerImpl.java
index 80eaefb036..d6a7ec120d 100644
---
a/hbase-rsgroup/src/main/java/org/apache/hadoop/hbase/rsgroup/RSGroupInfoManagerImpl.java
+++
b/hbase-rsgroup/src/main/java/org/apache/hadoop/hbase/rsgroup/RSGroupInfoManagerImpl.java
@@ -531,6 +535,19 @@ public class RSGroupInfoManagerImpl implements
RSGroupInfoManager, ServerListene
prevDefaultServers = servers;
LOG.info("Updated with servers: "+servers.size());
}
+
+ // Kick assignments that may be in FAILED_OPEN state
+ List<HRegionInfo> failedAssignments = Lists.newArrayList();
+ for (RegionState state:
+
mgr.master.getAssignmentManager().getRegionStates().getRegionsInTransition()) {
+ if (state.isFailedOpen()) {
+ failedAssignments.add(state.getRegion());
+ }
+ }
+ for (HRegionInfo region: failedAssignments) {
+ mgr.master.getAssignmentManager().unassign(region);
+ }
+
try {
synchronized (this) {
if(!hasChanged) {
{code}
Testing was with branch-1.4 / branch-1. I also need to check how branch-2
behaves.
If doing this for real we should wait a bit in case more servers join, and do
the work in batch.
was (Author: apurtell):
bq. It would be better if the RSGroupInfoManager could automatically kick
reassignments of regions in FAILED_OPEN state when servers rejoin the cluster.
I threw together a hack (imagine this done with another registered
ServerListener) which addresses the problem as reported, in that regions in
FAILED_STATE due to constraint failures when a whole RSGroup is down are all
reassigned, but am not sure this is the best way:
{code}
diff --git
a/hbase-rsgroup/src/main/java/org/apache/hadoop/hbase/rsgroup/RSGroupInfoManagerImpl.java
b/hbase-rsgroup/src/main/java/org/apache/hadoop/hbase/rsgro
up/RSGroupInfoManagerImpl.java
index 80eaefb036..d6a7ec120d 100644
---
a/hbase-rsgroup/src/main/java/org/apache/hadoop/hbase/rsgroup/RSGroupInfoManagerImpl.java
+++
b/hbase-rsgroup/src/main/java/org/apache/hadoop/hbase/rsgroup/RSGroupInfoManagerImpl.java
@@ -531,6 +535,19 @@ public class RSGroupInfoManagerImpl implements
RSGroupInfoManager, ServerListene
prevDefaultServers = servers;
LOG.info("Updated with servers: "+servers.size());
}
+
+ // Kick assignments that may be in FAILED_OPEN state
+ List<HRegionInfo> failedAssignments = Lists.newArrayList();
+ for (RegionState state:
+
mgr.master.getAssignmentManager().getRegionStates().getRegionsInTransition()) {
+ if (state.isFailedOpen()) {
+ failedAssignments.add(state.getRegion());
+ }
+ }
+ for (HRegionInfo region: failedAssignments) {
+ mgr.master.getAssignmentManager().unassign(region);
+ }
+
try {
synchronized (this) {
if(!hasChanged) {
{code}
Testing was with branch-1.4 / branch-1. I also need to check how branch-2
behaves.
If doing this for real we should wait a bit in case more servers join, and do
the work in batch.
> [RSgroups] Regions assigned to a RSGroup all go to FAILED_OPEN state when all
> servers in the group are unavailable
> ------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-19144
> URL: https://issues.apache.org/jira/browse/HBASE-19144
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.0.0, 3.0.0, 1.4.0, 1.5.0
> Reporter: Andrew Purtell
>
> After all servers in the RSgroup are down the regions cannot be opened
> anywhere and transition rapidly into FAILED_OPEN state.
>
> 2017-10-31 21:06:25,449 INFO [ProcedureExecutor-13] master.RegionStates:
> Transition {c6c8150c9f4b8df25ba32073f25a5143 state=OFFLINE, ts=1509483985448,
> server=node-5.cluster,16020,1509482700768} to
> {c6c8150c9f4b8df25ba32073f25a5143 state=FAILED_OPEN, ts=1509483985449,
> server=node-5.cluster,16020,1509482700768}
> 2017-10-31 21:06:25,449 WARN [ProcedureExecutor-13] master.RegionStates:
> Failed to open/close d4e2f173e31ffad6aac140f4bd7b02bc on
> node-5.cluster,16020,1509482700768, set to FAILED_OPEN
>
> Any region in FAILED_OPEN state has to be manually reassigned, or the master
> can be restarted and this will also cause reattempt of assignment of any
> regions in FAILED_OPEN state. This is not unexpected but is an operational
> headache. It would be better if the RSGroupInfoManager could automatically
> kick reassignments of regions in FAILED_OPEN state when servers rejoin the
> cluster.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)