[
https://issues.apache.org/jira/browse/HBASE-19144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236362#comment-16236362
]
churro morales edited comment on HBASE-19144 at 11/2/17 6:41 PM:
-----------------------------------------------------------------
lgtm,
Should we first check there is at least one server up? the master patch looks
good because we have a serverAdded. In the branch-1.4 patch looks like we only
have serverChanged.
So this case we could kick this off when we don't have any servers from our
group up yet?
We could just do a wait for at least one server to come up, but i don't know if
thats overkill
something like
{code}
while
(Collections.disjoint(masterServices.getServerManager().getOnlineServersList(),
RSGroupInfoManagerImpl.this.getDefaultServers()) {
//wait
}
{code}
might be overkill but looks like not necessary in the master branch, only 1.4
was (Author: churromorales):
lgtm,
Should we first check there is at least one server up? the master patch looks
good because we have a serverAdded. In the branch-1.4 patch looks like we only
have serverChanged.
So this case we could kick this off when we don't have any servers from our
group up yet?
We could just do a wait for at least one server to come up, but i don't know if
thats overkill
something like
{code}
while
(Collections.disjoint(masterServices.getServerManager().getOnlineServersList(),
RSGroupInfoManagerImpl.this.getDefaultServers()) {
//wait
}
{code}
might be overkill but otherwise lgtm!
> [RSgroups] Retry assignments in FAILED_OPEN state when servers (re)join the
> cluster
> -----------------------------------------------------------------------------------
>
> Key: HBASE-19144
> URL: https://issues.apache.org/jira/browse/HBASE-19144
> Project: HBase
> Issue Type: Bug
> Reporter: Andrew Purtell
> Assignee: Andrew Purtell
> Priority: Major
> Fix For: 2.0.0, 3.0.0, 1.4.0, 1.5.0
>
> Attachments: HBASE-19144-branch-1.patch, HBASE-19144.patch
>
>
> After all servers in the RSgroup are down the regions cannot be opened
> anywhere and transition rapidly into FAILED_OPEN state.
>
> 2017-10-31 21:06:25,449 INFO [ProcedureExecutor-13] master.RegionStates:
> Transition {c6c8150c9f4b8df25ba32073f25a5143 state=OFFLINE, ts=1509483985448,
> server=node-5.cluster,16020,1509482700768} to
> {c6c8150c9f4b8df25ba32073f25a5143 state=FAILED_OPEN, ts=1509483985449,
> server=node-5.cluster,16020,1509482700768}
> 2017-10-31 21:06:25,449 WARN [ProcedureExecutor-13] master.RegionStates:
> Failed to open/close d4e2f173e31ffad6aac140f4bd7b02bc on
> node-5.cluster,16020,1509482700768, set to FAILED_OPEN
>
> Any region in FAILED_OPEN state has to be manually reassigned, or the master
> can be restarted and this will also cause reattempt of assignment of any
> regions in FAILED_OPEN state. This is not unexpected but is an operational
> headache. It would be better if the RSGroupInfoManager could automatically
> kick reassignments of regions in FAILED_OPEN state when servers rejoin the
> cluster.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)