[
https://issues.apache.org/jira/browse/HBASE-22017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790289#comment-16790289
]
lujie commented on HBASE-22017:
-------------------------------
After long time debug, I have found the reason of this bug. This is a total
*data race bug*.
While shutdown the RegionSever who hold the meta table, it will close the
leases:
{code:java}
public void close() {
this.stopRequested = true;
leases.clear();
LOG.info("Closed leases");
}
{code}
And while the HMaster use RSRpcServices#scan to scan the table, it will
{code:java}
# see line 3345
try {
// Remove lease while its being processed in server; protects against case
// where processing of request takes > lease expiration time.
lease = regionServer.leases.removeLease(scannerName);
} catch (LeaseException e) {
throw new ServiceException(e);
}
{code}
in removeLease, it do :
{code:java}
Lease removeLease(final String leaseName) throws LeaseException {
Lease lease = leases.remove(leaseName);
if (lease == null) {
throw new LeaseException("lease '" + leaseName + "' does not exist");
}
return lease;
}
{code}
Due to lease is closed, so lease == null and removeLease throw LeaseException.
So it is a data race bug, and the share memory is
{code:java}
leases{code}
I have checked other place that access the *leases,* and find they have safety
check, like:
{code:java}
public void renewLease(final String leaseName) throws LeaseException {
if (this.stopRequested) {// here is safety check
return;
}
Lease lease = leases.get(leaseName);
if (lease == null ) {
throw new LeaseException("lease '" + leaseName +
"' does not exist or has already expired");
}
lease.resetExpirationTime();
}
{code}
I will give the patch soon.
> Failed to become active master due to lease 'XXX' does not exist
> ----------------------------------------------------------------
>
> Key: HBASE-22017
> URL: https://issues.apache.org/jira/browse/HBASE-22017
> Project: HBase
> Issue Type: Bug
> Reporter: lujie
> Assignee: lujie
> Priority: Critical
> Attachments: logs.zip
>
>
> Test cluster: hadoop11(master), hadoop14(slave), haoop15(slave).
> before code execute at
> org.apache.hadoop.hbase.regionserver.HStore#getScanner(function)#2027(line
> number), hadoop15 shutdown, then master startup fails
> {code:java}
> 2019-03-06 01:36:17,040 ERROR [master/hadoop11:16000:becomeActiveMaster]
> master.HMaster: ***** ABORTING master hadoop11,16000,1551807353275: Unhandled
> exception. Starting shutdown. *****
> org.apache.hadoop.hbase.regionserver.LeaseException:
> org.apache.hadoop.hbase.regionserver.LeaseException: lease
> '3449673378019934209' does not exist
> at org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:224)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3434)
> at
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42002)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.instantiateException(RemoteWithExtrasException.java:100)
> at
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.unwrapRemoteException(RemoteWithExtrasException.java:90)
> at
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.makeIOExceptionOfException(ProtobufUtil.java:361)
> at
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.handleRemoteException(ProtobufUtil.java:349)
> at
> org.apache.hadoop.hbase.client.ScannerCallable.openScanner(ScannerCallable.java:344)
> at
> org.apache.hadoop.hbase.client.ScannerCallable.rpcCall(ScannerCallable.java:242)
> at
> org.apache.hadoop.hbase.client.ScannerCallable.rpcCall(ScannerCallable.java:58)
> at
> org.apache.hadoop.hbase.client.RegionServerCallable.call(RegionServerCallable.java:127)
> at
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithoutRetries(RpcRetryingCallerImpl.java:192)
> at
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:387)
> at
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:361)
> at
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:107)
> at
> org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)