[jira] [Commented] (HBASE-13172) TestDistributedLogSplitting.testThreeRSAbort fails several times on branch-1

Jimmy Xiang (JIRA) Sun, 08 Mar 2015 16:29:22 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-13172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14352339#comment-14352339
 ]


Jimmy Xiang commented on HBASE-13172:
-------------------------------------

Some tests at branch-1 are more flaky than in master because we may kill RS 
holding meta which takes longer to recover.  In master, there is no such issue 
since meta is on master all the time. This also means it is usually a bug if 
some assignment related test is flaky in master. For branch-1, it is a little 
complicated.

You are right this test is not meant to test region assignment. If we can 
assure the 3 RS killed don't hold meta, the test may not be that flaky. We can 
have another test for meta handling if there is not such a testcase already.

> TestDistributedLogSplitting.testThreeRSAbort fails several times on branch-1
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-13172
>                 URL: https://issues.apache.org/jira/browse/HBASE-13172
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 1.1.0
>            Reporter: zhangduo
>
> The direct reason is we are stuck in ServerManager.isServerReachable.
> https://builds.apache.org/job/HBase-1.1/253/testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testThreeRSAbort/
> {noformat}
> 2015-03-06 04:06:19,430 DEBUG [AM.-pool300-t1] master.ServerManager(855): 
> Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=0 of 10
> 2015-03-06 04:07:10,545 DEBUG [AM.-pool300-t1] master.ServerManager(855): 
> Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=9 of 10
> {noformat}
> The interval between first and last retry log is about 1 minute, and we only 
> wait 1 minute so the test is timeout.
> Still do not know why this happen.
> And at last there are lots of this 
> {noformat}
> 2015-03-06 04:07:21,529 DEBUG [AM.-pool300-t1] master.ServerManager(855): 
> Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=9 of 10
> org.apache.hadoop.hbase.ipc.StoppedRpcClientException
>       at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.getConnection(RpcClientImpl.java:1261)
>       at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1146)
>       at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
>       at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
>       at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.getServerInfo(AdminProtos.java:22031)
>       at 
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1797)
>       at 
> org.apache.hadoop.hbase.master.ServerManager.isServerReachable(ServerManager.java:850)
>       at 
> org.apache.hadoop.hbase.master.RegionStates.isServerDeadAndNotProcessed(RegionStates.java:843)
>       at 
> org.apache.hadoop.hbase.master.AssignmentManager.forceRegionStateToOffline(AssignmentManager.java:1969)
>       at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1576)
>       at 
> org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:48)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:744)
> {noformat}
> I think the problem is here
> {code:title=ServerManager.java}
>     while (retryCounter.shouldRetry()) {
>         ...
>         try {
>           retryCounter.sleepUntilNextRetry();
>         } catch(InterruptedException ie) {
>           Thread.currentThread().interrupt();
>         }
>         ...
>     }
> {code}
> We need to break out of the while loop when getting InterruptedException, not 
> just mark current thread as interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13172) TestDistributedLogSplitting.testThreeRSAbort fails several times on branch-1

Reply via email to