[ 
https://issues.apache.org/jira/browse/HBASE-13172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14352559#comment-14352559
 ] 

Jeffrey Zhong commented on HBASE-13172:
---------------------------------------

I just skimmed through the thread. It seems the test was stucked in 
isServerReachable(). 

[~Apache9] In order to make the test case stable you can set config 
"hbase.master.maximum.ping.server.attempts" to 3(by default it's 10). For 
isServerReachable() call, inside IOException catch block, we should check 
following conditions and return false immediately when any of them is true.
1) if current server is put in deadServer already
2) If current IOException is one of RegionServerStoppedException or 
ServerNotRunningYetException

[~jxiang] The following code inside RegionStates seems unnecessary and should 
just return false(because the result of isServerReachable call may still return 
false positive info after retries) . In addition, should we expire the server 
instead directly put it in deadServers? Thanks.

{code}
        if (serverManager.isServerReachable(server)) {
          return false;
        }
        // The size of deadServers won't grow unbounded.
        deadServers.put(hostAndPort, Long.valueOf(startCode));
{code}

> TestDistributedLogSplitting.testThreeRSAbort fails several times on branch-1
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-13172
>                 URL: https://issues.apache.org/jira/browse/HBASE-13172
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 1.1.0
>            Reporter: zhangduo
>
> The direct reason is we are stuck in ServerManager.isServerReachable.
> https://builds.apache.org/job/HBase-1.1/253/testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testThreeRSAbort/
> {noformat}
> 2015-03-06 04:06:19,430 DEBUG [AM.-pool300-t1] master.ServerManager(855): 
> Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=0 of 10
> 2015-03-06 04:07:10,545 DEBUG [AM.-pool300-t1] master.ServerManager(855): 
> Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=9 of 10
> {noformat}
> The interval between first and last retry log is about 1 minute, and we only 
> wait 1 minute so the test is timeout.
> Still do not know why this happen.
> And at last there are lots of this 
> {noformat}
> 2015-03-06 04:07:21,529 DEBUG [AM.-pool300-t1] master.ServerManager(855): 
> Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=9 of 10
> org.apache.hadoop.hbase.ipc.StoppedRpcClientException
>       at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.getConnection(RpcClientImpl.java:1261)
>       at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1146)
>       at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
>       at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
>       at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.getServerInfo(AdminProtos.java:22031)
>       at 
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1797)
>       at 
> org.apache.hadoop.hbase.master.ServerManager.isServerReachable(ServerManager.java:850)
>       at 
> org.apache.hadoop.hbase.master.RegionStates.isServerDeadAndNotProcessed(RegionStates.java:843)
>       at 
> org.apache.hadoop.hbase.master.AssignmentManager.forceRegionStateToOffline(AssignmentManager.java:1969)
>       at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1576)
>       at 
> org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:48)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:744)
> {noformat}
> I think the problem is here
> {code:title=ServerManager.java}
>     while (retryCounter.shouldRetry()) {
>         ...
>         try {
>           retryCounter.sleepUntilNextRetry();
>         } catch(InterruptedException ie) {
>           Thread.currentThread().interrupt();
>         }
>         ...
>     }
> {code}
> We need to break out of the while loop when getting InterruptedException, not 
> just mark current thread as interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to