[ https://issues.apache.org/jira/browse/HBASE-20368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433324#comment-16433324 ]
Xiaolin Ha edited comment on HBASE-20368 at 4/11/18 2:50 AM: ------------------------------------------------------------- [~stack]The case is that all the regionservers are stopped(It's better to wait for a while), and then we start one or more of them or add new servers to this rsgroup. There are some differences between this case and restarting all the regionservers in the rsgroup. '...regions on this rsgroup will be reassigned, but there is no available servers of this rsgroup....' It means when all the regionservers in the rsgroup are offline, the assginment of regions will be failed. But the problem is that when some servers in the rsgroup are online again, the assignment of the regions will not be continued because AM's pendingAssginQueue was cleared after the last assginment though no available servers were found. We can see the stuck of RIT by logs or UI, and DML of tables in this rsgroup will also show it. Logs are: 2018-04-09,11:48:39,421 WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK Region-In-Transition rit=OFFLINE, location=c3-hadoop-tst-st23.bj,40100,1523172960034, table=t3, region=c8890704468083ceae6a6c3b5e24b968 2018-04-09,11:48:39,421 WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK Region-In-Transition rit=OFFLINE, location=c3-hadoop-tst-st26.bj,40100,1523172965147, table=hh:t3, region=97591999e282ac4dc54300693bba4263 2018-04-09,11:48:39,421 WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK Region-In-Transition rit=OFFLINE, location=c3-hadoop-tst-st23.bj,40100,1523172960034, table=t1, region=62bdc6fb8e9af1c21a323c9191313613 2018-04-09,11:48:39,421 WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK Region-In-Transition rit=OFFLINE, location=c3-hadoop-tst-st23.bj,40100,1523172960034, table=t2, region=9bdf2635e2d76c0d0388a5708ce21e3c 2018-04-09,11:48:39,421 WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK Region-In-Transition rit=OFFLINE, location=c3-hadoop-tst-st26.bj,40100,1523172965147, table=hh:t1, region=47ba2a3d6968ad09a79e05bdd6db5694 2018-04-09,11:48:39,421 WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK Region-In-Transition rit=OFFLINE, location=c3-hadoop-tst-st23.bj,40100,1523172960034, table=t2, region=ddef57619b45a023e076c3d5bcf30a04 2018-04-09,11:48:39,421 WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK Region-In-Transition rit=OFFLINE, location=c3-hadoop-tst-st26.bj,40100,1523172965147, table=t2, region=c0fb941b4c27fa04211e119494cf34d1 was (Author: xiaolin ha): [~stack]The case is that all the regionservers are stopped(It's better to wait for a while), and then we start one or more of them or add new servers to this rsgroup. There are some differences between this case and restarting all the regions servers in the rsgroup. '...regions on this rsgroup will be reassigned, but there is no available servers of this rsgroup....' It means when all the regionservers in the rsgroup are offline, the assginment of regions will be failed. But the problem is that when some servers in the rsgroup are online again, the assignment of the regions will not be continued because AM's pendingAssginQueue was cleared after the last assginment though no available servers were found. We can see the stuck of RIT by logs or UI, and DML of tables in this rsgroup will also show it. Logs are: 2018-04-09,11:48:39,421 WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK Region-In-Transition rit=OFFLINE, location=c3-hadoop-tst-st23.bj,40100,1523172960034, table=t3, region=c8890704468083ceae6a6c3b5e24b968 2018-04-09,11:48:39,421 WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK Region-In-Transition rit=OFFLINE, location=c3-hadoop-tst-st26.bj,40100,1523172965147, table=hh:t3, region=97591999e282ac4dc54300693bba4263 2018-04-09,11:48:39,421 WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK Region-In-Transition rit=OFFLINE, location=c3-hadoop-tst-st23.bj,40100,1523172960034, table=t1, region=62bdc6fb8e9af1c21a323c9191313613 2018-04-09,11:48:39,421 WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK Region-In-Transition rit=OFFLINE, location=c3-hadoop-tst-st23.bj,40100,1523172960034, table=t2, region=9bdf2635e2d76c0d0388a5708ce21e3c 2018-04-09,11:48:39,421 WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK Region-In-Transition rit=OFFLINE, location=c3-hadoop-tst-st26.bj,40100,1523172965147, table=hh:t1, region=47ba2a3d6968ad09a79e05bdd6db5694 2018-04-09,11:48:39,421 WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK Region-In-Transition rit=OFFLINE, location=c3-hadoop-tst-st23.bj,40100,1523172960034, table=t2, region=ddef57619b45a023e076c3d5bcf30a04 2018-04-09,11:48:39,421 WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK Region-In-Transition rit=OFFLINE, location=c3-hadoop-tst-st26.bj,40100,1523172965147, table=t2, region=c0fb941b4c27fa04211e119494cf34d1 > Fix RIT stuck when a rsgroup has no online servers but AM's > pendingAssginQueue is cleared > ----------------------------------------------------------------------------------------- > > Key: HBASE-20368 > URL: https://issues.apache.org/jira/browse/HBASE-20368 > Project: HBase > Issue Type: Bug > Components: rsgroup > Affects Versions: 2.0.0 > Reporter: Xiaolin Ha > Assignee: Xiaolin Ha > Priority: Major > Attachments: HBASE-20368.branch-2.0.001.patch > > > This error can be reproduced by shutting down all servers in a rsgroups and > starting them soon afterwards. > The regions on this rsgroup will be reassigned, but there is no available > servers of this rsgroup. > They will be added to AM's pendingAssginQueue, which AM will clear regardless > of the result of assigning in this case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)