[
https://issues.apache.org/jira/browse/HBASE-13172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14352339#comment-14352339
]
Jimmy Xiang commented on HBASE-13172:
-------------------------------------
Some tests at branch-1 are more flaky than in master because we may kill RS
holding meta which takes longer to recover. In master, there is no such issue
since meta is on master all the time. This also means it is usually a bug if
some assignment related test is flaky in master. For branch-1, it is a little
complicated.
You are right this test is not meant to test region assignment. If we can
assure the 3 RS killed don't hold meta, the test may not be that flaky. We can
have another test for meta handling if there is not such a testcase already.
> TestDistributedLogSplitting.testThreeRSAbort fails several times on branch-1
> ----------------------------------------------------------------------------
>
> Key: HBASE-13172
> URL: https://issues.apache.org/jira/browse/HBASE-13172
> Project: HBase
> Issue Type: Bug
> Components: test
> Affects Versions: 1.1.0
> Reporter: zhangduo
>
> The direct reason is we are stuck in ServerManager.isServerReachable.
> https://builds.apache.org/job/HBase-1.1/253/testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testThreeRSAbort/
> {noformat}
> 2015-03-06 04:06:19,430 DEBUG [AM.-pool300-t1] master.ServerManager(855):
> Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=0 of 10
> 2015-03-06 04:07:10,545 DEBUG [AM.-pool300-t1] master.ServerManager(855):
> Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=9 of 10
> {noformat}
> The interval between first and last retry log is about 1 minute, and we only
> wait 1 minute so the test is timeout.
> Still do not know why this happen.
> And at last there are lots of this
> {noformat}
> 2015-03-06 04:07:21,529 DEBUG [AM.-pool300-t1] master.ServerManager(855):
> Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=9 of 10
> org.apache.hadoop.hbase.ipc.StoppedRpcClientException
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl.getConnection(RpcClientImpl.java:1261)
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1146)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
> at
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.getServerInfo(AdminProtos.java:22031)
> at
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1797)
> at
> org.apache.hadoop.hbase.master.ServerManager.isServerReachable(ServerManager.java:850)
> at
> org.apache.hadoop.hbase.master.RegionStates.isServerDeadAndNotProcessed(RegionStates.java:843)
> at
> org.apache.hadoop.hbase.master.AssignmentManager.forceRegionStateToOffline(AssignmentManager.java:1969)
> at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1576)
> at
> org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:48)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> {noformat}
> I think the problem is here
> {code:title=ServerManager.java}
> while (retryCounter.shouldRetry()) {
> ...
> try {
> retryCounter.sleepUntilNextRetry();
> } catch(InterruptedException ie) {
> Thread.currentThread().interrupt();
> }
> ...
> }
> {code}
> We need to break out of the while loop when getting InterruptedException, not
> just mark current thread as interrupted.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)