[
https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16555265#comment-16555265
]
chenyang commented on HBASE-20919:
----------------------------------
hi, [~elserj]. Thanks for your suggestions.
Q:"What about failing fast here, and having the caller decide how to handle the
retry logic? AssignmentManager should already have logic to do this."
A: Fast failing is a better solution, AssignmentManager catches
HBaseIOException and re-add to PendingAssignmentQueue. Codes showed below in
processAssignmentPlans() method:
{code:java}
try {
acceptPlan(regions, balancer.retainAssignment(retainMap, servers));
} catch (HBaseIOException e) {
LOG.warn("unable to retain assignment", e);
addToPendingAssignment(regions, retainMap.keySet());
}
//or
try {
acceptPlan(regions, balancer.roundRobinAssignment(hris, servers));
} catch (HBaseIOException e) {
LOG.warn("unable to round-robin assignment", e);
addToPendingAssignment(regions, hris);
}{code}
I will submit a new patch which implements fast failing.
Q: "RSGroupLoadBalancer doesn't get initialized until after hbase:meta gets
assigned, but hbase:meta can't be assigned until the RSGroupLoadBalancer is
initialized so we soft-lock. "
A: I debug the initialization of rsgroup and test some cases. The
initialization process is executed in a independent Thread. For the moment, I
don`t find soft-lock. But I think it is risk still.
Q: "This is hard because, while I don't disagree with Stack's comment about
StochasticLB to RSGroupLB, the Master using the LoadBalancer before it was
initialized is bad"
A: According my tests, it works to initialize balancers before calling
startServiceThreads which starts ProcedureExecutor during HMaster`s
finishActiveMasterInitialization method. But I can not make sure it`s ok for
other cases. Maybe It needs more tests to do. So, I think the risks are lower
to modify RSGroupBasedLoadBalancer. I will re-submit the patch which
initializes balancers before calling startServiceThreads for reference only.
Q: "Do you have more logs you can share? "
A: I will offer whole logs and steps along with new patch. Because it need
start, stop, and restart whole master to test the case, so i don`t know how to
offer unit tests, do you or anyone have some suggestions?
> meta region can't be re-onlined when restarting cluster if opening rsgroup
> --------------------------------------------------------------------------
>
> Key: HBASE-20919
> URL: https://issues.apache.org/jira/browse/HBASE-20919
> Project: HBase
> Issue Type: Bug
> Components: Balancer, master, rsgroup
> Affects Versions: 2.0.1
> Reporter: chenyang
> Priority: Major
> Attachments: HBASE-20919-branch-2.0-01.patch, bug2.png,
> hbase-hbase-master-bjpg-rs4730.yz02.log.test
>
>
> if you open rsgroup, hbase-site.xml contains below configuration.
> {code:java}
> <property>
> <name>hbase.coprocessor.master.classes</name>
> <value>org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint</value>
> </property>
> <property>
> <name>hbase.master.loadbalancer.class</name>
> <value>org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer</value>
> </property>
> {code}
> And you shut down the whole HBase cluster in the way:
> # first shut down region server one by one
> # shut down master
> Then you restart whole cluster in the way:
> # start master
> # start regionserver
> The hbase:meta region can not be re-online and the rsgroup can not be
> initialized successfully.
> master logs:
> {code:java}
> 2018-07-12 18:27:08,775 INFO
> [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409]
> rsgroup.RSGro
> upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come
> online
> 2018-07-12 18:27:08,876 INFO
> [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409]
> zookeeper.Met
> aTableLocator: Failed verification of hbase:meta,,1 at
> address=bjpg-rs4732.yz02,60020,1531388712053,
> exception=org.apache.hadoop.hbase.NotServingRegionExcepti
> on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729)
> at
> org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
> {code}
> The logs show that hbase:meta region is not online and rsgroup keeps retrying
> to initialize.
>
> but why the hbase:meta region is not online?
> The info-level logs and jstack had not enough infomation, so I added some
> debug logs in test-source-code. Then i checked the master`s logs and region
> server`s logs, and found the meta region assign procedure which hold the meta
> region lock not completed and not released the lock forever, so the
> recoverMetaProcedure could not be executed.
>
> Why the first procedure not completed and not released meta region lock?
> In the test logs, i found when assignmentManager assigned the region, it
> need to call the rsgroup balancer which have not been initialized
> completely, so throw NPE. As a result, the procedure not completed and not
> released the lock forever.
> {code:java}
> java.lang.NullPointerException
> at
> org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.generateGroupMaps(RSGroupBasedLoadBalancer.java:262)
> at
> org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:162)
> at
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignmentPlans(AssignmentManager.java:1864)
> at
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignQueue(AssignmentManager.java:1809)
> at
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.access$400(AssignmentManager.java:113)
> at
> org.apache.hadoop.hbase.master.assignment.AssignmentManager$2.run(AssignmentManager.java:1693)
> {code}
> !bug2.png!
> As shown in the figure named bug2.png listed in attachments, when we shutdown
> the last region server, the master submit a ServerCrashProcedure. In the
> procedure, it will reassign hbase:meta region, but at that moment, there is
> no online region server, so the procedure can not be executed completely.
> Then we shut down master, the ServerCrashProcedure and it`a subProcedures are
> stored into procedureStore.
>
> When we restart master, at first, the master blocks waiting for becoming
> active master. after becoming active master, it starts procedureExecutor.
> The procedureExecutor start to read procedure from procedureStore and the pre
> serverCrashProcedure submit a assign region task to assignmentManager`s
> queue. The processQueue thread and active-master thread block waiting for
> online region servers. when we start a region server, the active-master
> thread do some operations and init rsgroup balancer. At the same time, the
> processQueue thread start to call balancer. If the processQueue thread run
> faster than active master, the processQueue thread will throw NPE. As a
> result, the procedure not complete and not release hbase:meta region lock
> forever.
>
> Now, my solution is that initializing the balancer before calling
> startServiceThreads in finishActiveMasterInitialization() of HMaster.But this
> may have some side effects for master.
> Based on stack`s suggestion, i re-submit a new patch which waiting for
> initializing rsgroup balancer before calling balance-methods.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)