zhangduo created HBASE-13172:
--------------------------------

             Summary: TestDistributedLogSplitting.testThreeRSAbort fails 
several times on branch-1
                 Key: HBASE-13172
                 URL: https://issues.apache.org/jira/browse/HBASE-13172
             Project: HBase
          Issue Type: Bug
          Components: test
    Affects Versions: 1.1.0
            Reporter: zhangduo


The direct reason is we are stuck in ServerManager.isServerReachable.

https://builds.apache.org/job/HBase-1.1/253/testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testThreeRSAbort/

{noformat}
2015-03-06 04:06:19,430 DEBUG [AM.-pool300-t1] master.ServerManager(855): 
Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=0 of 10
2015-03-06 04:07:10,545 DEBUG [AM.-pool300-t1] master.ServerManager(855): 
Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=9 of 10
{noformat}
The interval between first and last retry log is about 1 minute, and we only 
wait 1 minute so the test is timeout.
Still do not know why this happen.

And at last there are lots of this 
{noformat}
2015-03-06 04:07:21,529 DEBUG [AM.-pool300-t1] master.ServerManager(855): 
Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=9 of 10
org.apache.hadoop.hbase.ipc.StoppedRpcClientException
        at 
org.apache.hadoop.hbase.ipc.RpcClientImpl.getConnection(RpcClientImpl.java:1261)
        at 
org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1146)
        at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
        at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
        at 
org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.getServerInfo(AdminProtos.java:22031)
        at 
org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1797)
        at 
org.apache.hadoop.hbase.master.ServerManager.isServerReachable(ServerManager.java:850)
        at 
org.apache.hadoop.hbase.master.RegionStates.isServerDeadAndNotProcessed(RegionStates.java:843)
        at 
org.apache.hadoop.hbase.master.AssignmentManager.forceRegionStateToOffline(AssignmentManager.java:1969)
        at 
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1576)
        at 
org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:48)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
{noformat}
I think the problem is here
{code:title=ServerManager.java}
    while (retryCounter.shouldRetry()) {
        ...
        try {
          retryCounter.sleepUntilNextRetry();
        } catch(InterruptedException ie) {
          Thread.currentThread().interrupt();
        }
        ...
    }
{code}
We need to break out of the while loop when getting InterruptedException, not 
just mark current thread as interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to