[jira] [Commented] (HBASE-20368) Fix RIT stuck when a rsgroup has no online servers but AM's pendingAssginQueue is cleared

Xiaolin Ha (JIRA) Thu, 04 Jul 2019 23:34:45 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-20368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16878999#comment-16878999
 ]


Xiaolin Ha commented on HBASE-20368:
------------------------------------

See this region stuck log: WARN [ProcExecTimeout] 
assignment.AssignmentManager(1328): STUCK Region-In-Transition rit=OPEN, 
location=localhost,32843,1562307050191, 
table=Group_testKillAllRSInGroupAndThenAddNew, 
region=a763499801435d2f78ab42876c6cb3ec

I think region state 'OPEN' may be error and confusing? When SCP starts and 
creates TRSP, should these new TRSPs  also call serverCrashed() to set region 
state to 'ABNORMALLY_CLOSED'? Any concerns if assign region begins at state 
'ABNORMALLY_CLOSED'? [~zghaobac],[~Apache9]

Relevant codes in SCP:
{quote}private void assignRegions(MasterProcedureEnv env, List<RegionInfo> 
regions) throws IOException {
 AssignmentManager am = env.getMasterServices().getAssignmentManager();
 for (RegionInfo region : regions) {
 RegionStateNode regionNode = 
am.getRegionStates().getOrCreateRegionStateNode(region);
 regionNode.lock();
 try {
 if (regionNode.getProcedure() != null) {
 LOG.info("{} found RIT {}; {}", this, regionNode.getProcedure(), regionNode);
 regionNode.getProcedure().serverCrashed(env, regionNode, getServerName());
 } else {
 if 
(env.getMasterServices().getTableStateManager().isTableState(regionNode.getTable(),
 TableState.State.DISABLING, TableState.State.DISABLED)) {
 continue;
 }
 TransitRegionStateProcedure proc = TransitRegionStateProcedure.assign(env, 
region, null);
 regionNode.setProcedure(proc);
 addChildProcedure(proc);
 }
 } finally {
 regionNode.unlock();
 }
 }
}{quote}

> Fix RIT stuck when a rsgroup has no online servers but AM's 
> pendingAssginQueue is cleared
> -----------------------------------------------------------------------------------------
>
>                 Key: HBASE-20368
>                 URL: https://issues.apache.org/jira/browse/HBASE-20368
>             Project: HBase
>          Issue Type: Bug
>          Components: rsgroup
>    Affects Versions: 2.0.0
>            Reporter: Xiaolin Ha
>            Assignee: Xiaolin Ha
>            Priority: Major
>             Fix For: 2.0.6, 2.1.6
>
>         Attachments: HBASE-20368.branch-2.001.patch, 
> HBASE-20368.branch-2.002.patch, HBASE-20368.branch-2.003.patch, 
> HBASE-20368.branch-2.003.patch, HBASE-20368.branch-2.003.patch, 
> HBASE-20368.branch-2.1.001.patch
>
>
> This error can be reproduced by shutting down all servers in a rsgroups and 
> starting them soon afterwards. 
> The regions on this rsgroup will be reassigned, but there is no available 
> servers of this rsgroup.
> They will be added to AM's pendingAssginQueue, which AM will clear regardless 
> of the result of assigning in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-20368) Fix RIT stuck when a rsgroup has no online servers but AM's pendingAssginQueue is cleared

Reply via email to