chenyang created HBASE-20919:
--------------------------------
Summary: meta region can`t reonline when restart when open rsgroup
Key: HBASE-20919
URL: https://issues.apache.org/jira/browse/HBASE-20919
Project: HBase
Issue Type: Bug
Components: Balancer, master, rsgroup
Affects Versions: 2.0.1
Reporter: chenyang
Attachments: bug2.png
if you open rsgroup, hbase-site.xml contains below configuration.
{code:java}
// code placeholder
<property>
<name>hbase.coprocessor.master.classes</name>
<value>org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint</value>
</property>
<property>
<name>hbase.master.loadbalancer.class</name>
<value>org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer</value>
</property>
{code}
And you shut down the whole HBase cluster in the way:
# first shut down region server one by one
# shut down master
Then you restart whole cluster in the way:
# start master
# start regionserver
The hbase:meta region can not be re-online and the the rsgroup can not be
initialized successfully.
master logs:
{code:java}
// code placeholder
2018-07-12 18:27:08,775 INFO
[org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409]
rsgroup.RSGro
upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come
online
2018-07-12 18:27:08,876 INFO
[org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409]
zookeeper.Met
aTableLocator: Failed verification of hbase:meta,,1 at
address=bjpg-rs4732.yz02,60020,1531388712053,
exception=org.apache.hadoop.hbase.NotServingRegionExcepti
on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928
at
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729)
at
org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
{code}
The logs show that hbase:meta region is not online and rsgroup keeps retrying
to initialize.
but why the hbase:meta region is not online?
The info-level logs and jstack had not enough infomation, so I added some debug
logs in test-source-code. Then i checked the master`s logs and region server`s
logs, and found the meta region assign procedure which hold the meta region
lock not completed and not released the lock forever, so the
recoverMetaProcedure could not be executed.
Why the first procedure not completed and not released meta region lock?
In the test logs, i found when assignmentManager assigned the region, it need
to call the rsgroup balancer which have not been initialized completely, so
throw NPE. As a result, the procedure not completed and not released the lock
forever.
{code:java}
// code placeholder
java.lang.NullPointerException
at
org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.generateGroupMaps(RSGroupBasedLoadBalancer.java:262)
at
org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:162)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignmentPlans(AssignmentManager.java:1864)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignQueue(AssignmentManager.java:1809)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.access$400(AssignmentManager.java:113)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager$2.run(AssignmentManager.java:1693)
{code}
As shown in the figure named bug2.png listed in attachments, when we shutdown
the last region server, the master submit a ServerCrashProcedure. In the
procedure, it will reassign hbase:meta region, but at that moment, there is no
online region server, so the procedure can not be executed completely. Then we
shut down master, the ServerCrashProcedure and it`a subProcedures are stored
into procedureStore.
When we restart master, at first, the master blocks waiting for becoming
active master. after becoming active master, it starts procedureExecutor. The
procedureExecutor start to read procedure from procedureStore and the pre
serverCrashProcedure submit a assign region task to assignmentManager`s queue.
The processQueue thread and active-master thread block waiting for online
region servers. when we start a region server, the active-master thread do some
operations and init rsgroup balancer. At the same time, the processQueue
thread start to call balancer. If the processQueue thread run faster than
active master, the processQueue thread will throw NPE. As a result, the
procedure not complete and not release hbase:meta region lock forever.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)