[
https://issues.apache.org/jira/browse/HBASE-13172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14353452#comment-14353452
]
Jimmy Xiang commented on HBASE-13172:
-------------------------------------
+1. Looks good to me. As to the issue [~jeffreyz] pointed out, that part is
needed. It is preferred that a RS dies naturally (means per ZK) instead of
marked dead by AM. Call isServerReachable should not return false info after
retries since we check the start code, if the retries take longer the ZK
session time-out time.
> TestDistributedLogSplitting.testThreeRSAbort fails several times on branch-1
> ----------------------------------------------------------------------------
>
> Key: HBASE-13172
> URL: https://issues.apache.org/jira/browse/HBASE-13172
> Project: HBase
> Issue Type: Bug
> Components: test
> Affects Versions: 1.1.0
> Reporter: zhangduo
> Assignee: zhangduo
> Attachments: HBASE-13172-branch-1.patch
>
>
> The direct reason is we are stuck in ServerManager.isServerReachable.
> https://builds.apache.org/job/HBase-1.1/253/testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testThreeRSAbort/
> {noformat}
> 2015-03-06 04:06:19,430 DEBUG [AM.-pool300-t1] master.ServerManager(855):
> Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=0 of 10
> 2015-03-06 04:07:10,545 DEBUG [AM.-pool300-t1] master.ServerManager(855):
> Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=9 of 10
> {noformat}
> The interval between first and last retry log is about 1 minute, and we only
> wait 1 minute so the test is timeout.
> Still do not know why this happen.
> And at last there are lots of this
> {noformat}
> 2015-03-06 04:07:21,529 DEBUG [AM.-pool300-t1] master.ServerManager(855):
> Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=9 of 10
> org.apache.hadoop.hbase.ipc.StoppedRpcClientException
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl.getConnection(RpcClientImpl.java:1261)
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1146)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
> at
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.getServerInfo(AdminProtos.java:22031)
> at
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1797)
> at
> org.apache.hadoop.hbase.master.ServerManager.isServerReachable(ServerManager.java:850)
> at
> org.apache.hadoop.hbase.master.RegionStates.isServerDeadAndNotProcessed(RegionStates.java:843)
> at
> org.apache.hadoop.hbase.master.AssignmentManager.forceRegionStateToOffline(AssignmentManager.java:1969)
> at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1576)
> at
> org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:48)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> {noformat}
> I think the problem is here
> {code:title=ServerManager.java}
> while (retryCounter.shouldRetry()) {
> ...
> try {
> retryCounter.sleepUntilNextRetry();
> } catch(InterruptedException ie) {
> Thread.currentThread().interrupt();
> }
> ...
> }
> {code}
> We need to break out of the while loop when getting InterruptedException, not
> just mark current thread as interrupted.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)