[jira] [Comment Edited] (HBASE-20368) Fix RIT stuck when a rsgroup has no online servers but AM's pendingAssginQueue is cleared

Xiaolin Ha (JIRA) Tue, 10 Apr 2018 19:52:12 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-20368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433324#comment-16433324
 ]


Xiaolin Ha edited comment on HBASE-20368 at 4/11/18 2:50 AM:
-------------------------------------------------------------

[~stack]The case is that all the regionservers are stopped(It's better to wait 
for a while), and then we start one or more of them or add new servers to this 
rsgroup. 
There are some differences between this case and restarting all the 
regionservers in the rsgroup.
 '...regions on this rsgroup will be reassigned, but there is no available 
servers of this rsgroup....' It means when all the regionservers in the rsgroup 
are offline, the assginment of regions will be failed. But the problem is that 
when some servers in the rsgroup are online again, the assignment of the 
regions will not be continued because AM's pendingAssginQueue was cleared after 
the last assginment though no available servers were found.

We can see the stuck of RIT by logs or UI, and DML of tables in this rsgroup 
will also show it.
Logs are: 
2018-04-09,11:48:39,421 WARN 
org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK 
Region-In-Transition rit=OFFLINE, 
location=c3-hadoop-tst-st23.bj,40100,1523172960034, table=t3, 
region=c8890704468083ceae6a6c3b5e24b968
2018-04-09,11:48:39,421 WARN 
org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK 
Region-In-Transition rit=OFFLINE, 
location=c3-hadoop-tst-st26.bj,40100,1523172965147, table=hh:t3, 
region=97591999e282ac4dc54300693bba4263
2018-04-09,11:48:39,421 WARN 
org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK 
Region-In-Transition rit=OFFLINE, 
location=c3-hadoop-tst-st23.bj,40100,1523172960034, table=t1, 
region=62bdc6fb8e9af1c21a323c9191313613
2018-04-09,11:48:39,421 WARN 
org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK 
Region-In-Transition rit=OFFLINE, 
location=c3-hadoop-tst-st23.bj,40100,1523172960034, table=t2, 
region=9bdf2635e2d76c0d0388a5708ce21e3c
2018-04-09,11:48:39,421 WARN 
org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK 
Region-In-Transition rit=OFFLINE, 
location=c3-hadoop-tst-st26.bj,40100,1523172965147, table=hh:t1, 
region=47ba2a3d6968ad09a79e05bdd6db5694
2018-04-09,11:48:39,421 WARN 
org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK 
Region-In-Transition rit=OFFLINE, 
location=c3-hadoop-tst-st23.bj,40100,1523172960034, table=t2, 
region=ddef57619b45a023e076c3d5bcf30a04
2018-04-09,11:48:39,421 WARN 
org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK 
Region-In-Transition rit=OFFLINE, 
location=c3-hadoop-tst-st26.bj,40100,1523172965147, table=t2, 
region=c0fb941b4c27fa04211e119494cf34d1







was (Author: xiaolin ha):
[~stack]The case is that all the regionservers are stopped(It's better to wait 
for a while), and then we start one or more of them or add new servers to this 
rsgroup. 
There are some differences between this case and restarting all the regions 
servers in the rsgroup.
 '...regions on this rsgroup will be reassigned, but there is no available 
servers of this rsgroup....' It means when all the regionservers in the rsgroup 
are offline, the assginment of regions will be failed. But the problem is that 
when some servers in the rsgroup are online again, the assignment of the 
regions will not be continued because AM's pendingAssginQueue was cleared after 
the last assginment though no available servers were found.

We can see the stuck of RIT by logs or UI, and DML of tables in this rsgroup 
will also show it.
Logs are: 
2018-04-09,11:48:39,421 WARN 
org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK 
Region-In-Transition rit=OFFLINE, 
location=c3-hadoop-tst-st23.bj,40100,1523172960034, table=t3, 
region=c8890704468083ceae6a6c3b5e24b968
2018-04-09,11:48:39,421 WARN 
org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK 
Region-In-Transition rit=OFFLINE, 
location=c3-hadoop-tst-st26.bj,40100,1523172965147, table=hh:t3, 
region=97591999e282ac4dc54300693bba4263
2018-04-09,11:48:39,421 WARN 
org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK 
Region-In-Transition rit=OFFLINE, 
location=c3-hadoop-tst-st23.bj,40100,1523172960034, table=t1, 
region=62bdc6fb8e9af1c21a323c9191313613
2018-04-09,11:48:39,421 WARN 
org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK 
Region-In-Transition rit=OFFLINE, 
location=c3-hadoop-tst-st23.bj,40100,1523172960034, table=t2, 
region=9bdf2635e2d76c0d0388a5708ce21e3c
2018-04-09,11:48:39,421 WARN 
org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK 
Region-In-Transition rit=OFFLINE, 
location=c3-hadoop-tst-st26.bj,40100,1523172965147, table=hh:t1, 
region=47ba2a3d6968ad09a79e05bdd6db5694
2018-04-09,11:48:39,421 WARN 
org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK 
Region-In-Transition rit=OFFLINE, 
location=c3-hadoop-tst-st23.bj,40100,1523172960034, table=t2, 
region=ddef57619b45a023e076c3d5bcf30a04
2018-04-09,11:48:39,421 WARN 
org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK 
Region-In-Transition rit=OFFLINE, 
location=c3-hadoop-tst-st26.bj,40100,1523172965147, table=t2, 
region=c0fb941b4c27fa04211e119494cf34d1






> Fix RIT stuck when a rsgroup has no online servers but AM's 
> pendingAssginQueue is cleared
> -----------------------------------------------------------------------------------------
>
>                 Key: HBASE-20368
>                 URL: https://issues.apache.org/jira/browse/HBASE-20368
>             Project: HBase
>          Issue Type: Bug
>          Components: rsgroup
>    Affects Versions: 2.0.0
>            Reporter: Xiaolin Ha
>            Assignee: Xiaolin Ha
>            Priority: Major
>         Attachments: HBASE-20368.branch-2.0.001.patch
>
>
> This error can be reproduced by shutting down all servers in a rsgroups and 
> starting them soon afterwards. 
> The regions on this rsgroup will be reassigned, but there is no available 
> servers of this rsgroup.
> They will be added to AM's pendingAssginQueue, which AM will clear regardless 
> of the result of assigning in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-20368) Fix RIT stuck when a rsgroup has no online servers but AM's pendingAssginQueue is cleared

Reply via email to