[ 
https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenyang updated HBASE-20919:
-----------------------------
    Description: 
if you open rsgroup, hbase-site.xml contains  below configuration.
{code:java}
<property>
  <name>hbase.coprocessor.master.classes</name>
  <value>org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint</value>
</property>
<property>
  <name>hbase.master.loadbalancer.class&lt;/name>
 <value>org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer</value>
</property>
{code}
And you shut down the whole HBase cluster in the way:
 # first shut down region server one by one
 # shut down master

Then you restart whole cluster in the way:
 # start master
 # start regionserver

The hbase:meta region can not be re-online and the the rsgroup can not be 
initialized successfully.
 master logs:
{code:java}
2018-07-12 18:27:08,775 INFO 
[org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409]
 rsgroup.RSGro
upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come 
online
2018-07-12 18:27:08,876 INFO 
[org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409]
 zookeeper.Met
aTableLocator: Failed verification of hbase:meta,,1 at 
address=bjpg-rs4732.yz02,60020,1531388712053, 
exception=org.apache.hadoop.hbase.NotServingRegionExcepti
on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729)
at 
org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
{code}
The logs show that hbase:meta region is not online and rsgroup keeps retrying 
to initialize.
  
 but why the hbase:meta region is not online?
 The info-level logs and jstack had not enough infomation, so I added some 
debug logs in test-source-code. Then i checked the master`s logs and region 
server`s logs, and found the meta region assign procedure which hold the meta 
region lock not completed and not released the lock forever, so the 
recoverMetaProcedure could not be executed. 
  
 Why the first procedure not completed and not released meta region lock?
 In the test logs, i found when assignmentManager assigned the region, it need 
to call the rsgroup balancer which  have not been initialized completely, so 
throw NPE.  As a result, the procedure not completed and not released the lock 
forever.
{code:java}
java.lang.NullPointerException
at 
org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.generateGroupMaps(RSGroupBasedLoadBalancer.java:262)
at 
org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:162)
at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignmentPlans(AssignmentManager.java:1864)
at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignQueue(AssignmentManager.java:1809)
at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.access$400(AssignmentManager.java:113)
at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager$2.run(AssignmentManager.java:1693)
{code}
As shown in the figure named bug2.png listed in attachments, when we shutdown 
the last region server, the master submit a ServerCrashProcedure. In the 
procedure, it will reassign hbase:meta region, but at that moment, there is no 
online region server, so the procedure can not be executed completely. Then we 
shut down master, the ServerCrashProcedure and it`a subProcedures are stored 
into procedureStore.
  
 When we restart master, at first,  the master blocks waiting for becoming 
active master.  after becoming active master, it starts procedureExecutor. The 
procedureExecutor start to read procedure from procedureStore and the pre 
serverCrashProcedure submit a assign region task to assignmentManager`s queue. 
The processQueue thread and active-master thread block waiting for online 
region servers. when we start a region server, the active-master thread do some 
operations and init rsgroup balancer.  At the same time, the processQueue 
thread start to call balancer. If the processQueue thread run faster than 
active master,  the processQueue thread will throw NPE.  As a result, the 
procedure not complete and not release hbase:meta region lock forever.
  
  

  was:
if you open rsgroup, hbase-site.xml contains  below configuration.

{code:java}
// code placeholder
<property>
<name>hbase.coprocessor.master.classes</name>
<value>org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint</value>
</property>
<property>
<name>hbase.master.loadbalancer.class</name>
<value>org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer</value>
</property>
{code}
And you shut down the whole HBase cluster in the way:
 # first shut down region server one by one
 # shut down master

Then you restart whole cluster in the way:
 # start master
 # start regionserver

The hbase:meta region can not be re-online and the the rsgroup can not be 
initialized successfully.
master logs:
{code:java}
// code placeholder

2018-07-12 18:27:08,775 INFO 
[org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409]
 rsgroup.RSGro
upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come 
online
2018-07-12 18:27:08,876 INFO 
[org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409]
 zookeeper.Met
aTableLocator: Failed verification of hbase:meta,,1 at 
address=bjpg-rs4732.yz02,60020,1531388712053, 
exception=org.apache.hadoop.hbase.NotServingRegionExcepti
on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729)
at 
org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
{code}
The logs show that hbase:meta region is not online and rsgroup keeps retrying 
to initialize.
 
but why the hbase:meta region is not online?
The info-level logs and jstack had not enough infomation, so I added some debug 
logs in test-source-code. Then i checked the master`s logs and region server`s 
logs, and found the meta region assign procedure which hold the meta region 
lock not completed and not released the lock forever, so the 
recoverMetaProcedure could not be executed. 
 
Why the first procedure not completed and not released meta region lock?
In the test logs, i found when assignmentManager assigned the region, it need 
to call the rsgroup balancer which  have not been initialized completely, so 
throw NPE.  As a result, the procedure not completed and not released the lock 
forever.
{code:java}
// code placeholder
java.lang.NullPointerException
at 
org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.generateGroupMaps(RSGroupBasedLoadBalancer.java:262)
at 
org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:162)
at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignmentPlans(AssignmentManager.java:1864)
at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignQueue(AssignmentManager.java:1809)
at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.access$400(AssignmentManager.java:113)
at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager$2.run(AssignmentManager.java:1693)
{code}
As shown in the figure named bug2.png listed in attachments, when we shutdown 
the last region server, the master submit a ServerCrashProcedure. In the 
procedure, it will reassign hbase:meta region, but at that moment, there is no 
online region server, so the procedure can not be executed completely. Then we 
shut down master, the ServerCrashProcedure and it`a subProcedures are stored 
into procedureStore.
 
When we restart master, at first,  the master blocks waiting for becoming 
active master.  after becoming active master, it starts procedureExecutor. The 
procedureExecutor start to read procedure from procedureStore and the pre 
serverCrashProcedure submit a assign region task to assignmentManager`s queue. 
The processQueue thread and active-master thread block waiting for online 
region servers. when we start a region server, the active-master thread do some 
operations and init rsgroup balancer.  At the same time, the processQueue 
thread start to call balancer. If the processQueue thread run faster than 
active master,  the processQueue thread will throw NPE.  As a result, the 
procedure not complete and not release hbase:meta region lock forever.
 
 


> meta region can`t be re-online when restart if opening rsgroup
> --------------------------------------------------------------
>
>                 Key: HBASE-20919
>                 URL: https://issues.apache.org/jira/browse/HBASE-20919
>             Project: HBase
>          Issue Type: Bug
>          Components: Balancer, master, rsgroup
>    Affects Versions: 2.0.1
>            Reporter: chenyang
>            Priority: Major
>         Attachments: bug2.png, hbase-hbase-master-bjpg-rs4730.yz02.log.test
>
>
> if you open rsgroup, hbase-site.xml contains  below configuration.
> {code:java}
> <property>
>   <name>hbase.coprocessor.master.classes</name>
>   <value>org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint</value>
> </property>
> <property>
>   <name>hbase.master.loadbalancer.class&lt;/name>
>  <value>org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer</value>
> </property>
> {code}
> And you shut down the whole HBase cluster in the way:
>  # first shut down region server one by one
>  # shut down master
> Then you restart whole cluster in the way:
>  # start master
>  # start regionserver
> The hbase:meta region can not be re-online and the the rsgroup can not be 
> initialized successfully.
>  master logs:
> {code:java}
> 2018-07-12 18:27:08,775 INFO 
> [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409]
>  rsgroup.RSGro
> upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come 
> online
> 2018-07-12 18:27:08,876 INFO 
> [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409]
>  zookeeper.Met
> aTableLocator: Failed verification of hbase:meta,,1 at 
> address=bjpg-rs4732.yz02,60020,1531388712053, 
> exception=org.apache.hadoop.hbase.NotServingRegionExcepti
> on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
> {code}
> The logs show that hbase:meta region is not online and rsgroup keeps retrying 
> to initialize.
>   
>  but why the hbase:meta region is not online?
>  The info-level logs and jstack had not enough infomation, so I added some 
> debug logs in test-source-code. Then i checked the master`s logs and region 
> server`s logs, and found the meta region assign procedure which hold the meta 
> region lock not completed and not released the lock forever, so the 
> recoverMetaProcedure could not be executed. 
>   
>  Why the first procedure not completed and not released meta region lock?
>  In the test logs, i found when assignmentManager assigned the region, it 
> need to call the rsgroup balancer which  have not been initialized 
> completely, so throw NPE.  As a result, the procedure not completed and not 
> released the lock forever.
> {code:java}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.generateGroupMaps(RSGroupBasedLoadBalancer.java:262)
> at 
> org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:162)
> at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignmentPlans(AssignmentManager.java:1864)
> at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignQueue(AssignmentManager.java:1809)
> at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.access$400(AssignmentManager.java:113)
> at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager$2.run(AssignmentManager.java:1693)
> {code}
> As shown in the figure named bug2.png listed in attachments, when we shutdown 
> the last region server, the master submit a ServerCrashProcedure. In the 
> procedure, it will reassign hbase:meta region, but at that moment, there is 
> no online region server, so the procedure can not be executed completely. 
> Then we shut down master, the ServerCrashProcedure and it`a subProcedures are 
> stored into procedureStore.
>   
>  When we restart master, at first,  the master blocks waiting for becoming 
> active master.  after becoming active master, it starts procedureExecutor. 
> The procedureExecutor start to read procedure from procedureStore and the pre 
> serverCrashProcedure submit a assign region task to assignmentManager`s 
> queue. The processQueue thread and active-master thread block waiting for 
> online region servers. when we start a region server, the active-master 
> thread do some operations and init rsgroup balancer.  At the same time, the 
> processQueue thread start to call balancer. If the processQueue thread run 
> faster than active master,  the processQueue thread will throw NPE.  As a 
> result, the procedure not complete and not release hbase:meta region lock 
> forever.
>   
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to