[ 
https://issues.apache.org/jira/browse/HBASE-13172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14351858#comment-14351858
 ] 

zhangduo commented on HBASE-13172:
----------------------------------

Seems the logic is right. We failed to assign a region(possible 
f3a9bc396ddd2bb1fcd1bdbc436eac36) to asf906.gq1.ygridcore.net,59366, so try to 
reassign it. The region is in PENDING_OPEN state so we enter isServerReachable. 

{noformat}
2015-03-06 04:06:19,414 INFO  [MASTER_SERVER_OPERATIONS-asf906:36657-0] 
master.GeneralBulkAssigner(194): Failed assigning 1 regions to server 
asf906.gq1.ygridcore.net,59366,1425614770146, reassigning them
2015-03-06 04:06:19,417 DEBUG [AM.-pool300-t1] master.AssignmentManager(1935): 
Force region state offline {f3a9bc396ddd2bb1fcd1bdbc436eac36 
state=PENDING_OPEN, ts=1425614779402, 
server=asf906.gq1.ygridcore.net,59366,1425614770146}
2015-03-06 04:06:19,421 DEBUG [AM.-pool300-t1] master.AssignmentManager(1858): 
Offline table,l\x9B\xC4/\xEA,1425614773116.f3a9bc396ddd2bb1fcd1bdbc436eac36., 
it's not any more on asf906.gq1.ygridcore.net,59366,1425614770146
org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: 
org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 
asf906.gq1.ygridcore.net,59366,1425614770146 not running, aborting
        at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:902)
        at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.closeRegion(RSRpcServices.java:988)
        at 
org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:21082)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2032)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
        at java.lang.Thread.run(Thread.java:744)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at 
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
        at 
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
        at 
org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:314)
        at 
org.apache.hadoop.hbase.protobuf.ProtobufUtil.closeRegion(ProtobufUtil.java:1729)
        at 
org.apache.hadoop.hbase.master.ServerManager.sendRegionClose(ServerManager.java:771)
        at 
org.apache.hadoop.hbase.master.AssignmentManager.unassign(AssignmentManager.java:1834)
        at 
org.apache.hadoop.hbase.master.AssignmentManager.forceRegionStateToOffline(AssignmentManager.java:1951)
        at 
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1576)
        at 
org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:48)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
Caused by: 
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.regionserver.RegionServerStoppedException):
 org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 
asf906.gq1.ygridcore.net,59366,1425614770146 not running, aborting
        at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:902)
        at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.closeRegion(RSRpcServices.java:988)
        at 
org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:21082)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2032)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
        at java.lang.Thread.run(Thread.java:744)

        at 
org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1190)
        at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
        at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
        at 
org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.closeRegion(AdminProtos.java:21935)
        at 
org.apache.hadoop.hbase.protobuf.ProtobufUtil.closeRegion(ProtobufUtil.java:1726)
        ... 9 more
2015-03-06 04:06:19,425 INFO  [AM.-pool300-t1] master.RegionStates(1112): 
Transition {f3a9bc396ddd2bb1fcd1bdbc436eac36 state=PENDING_OPEN, 
ts=1425614779402, server=asf906.gq1.ygridcore.net,59366,1425614770146} to 
{f3a9bc396ddd2bb1fcd1bdbc436eac36 state=OFFLINE, ts=1425614779425, 
server=asf906.gq1.ygridcore.net,59366,1425614770146}
2015-03-06 04:06:19,430 DEBUG [AM.-pool300-t1] master.ServerManager(855): 
Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=0 of 10
org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: 
org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 
asf906.gq1.ygridcore.net,59366,1425614770146 not running, aborting
        at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:902)
        at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.getServerInfo(RSRpcServices.java:1181)
        at 
org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:21098)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2032)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
        at java.lang.Thread.run(Thread.java:744)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at 
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
        at 
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
        at 
org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:314)
        at 
org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1800)
        at 
org.apache.hadoop.hbase.master.ServerManager.isServerReachable(ServerManager.java:850)
        at 
org.apache.hadoop.hbase.master.RegionStates.isServerDeadAndNotProcessed(RegionStates.java:843)
        at 
org.apache.hadoop.hbase.master.AssignmentManager.forceRegionStateToOffline(AssignmentManager.java:1969)
        at 
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1576)
        at 
org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:48)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
Caused by: 
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.regionserver.RegionServerStoppedException):
 org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 
asf906.gq1.ygridcore.net,59366,1425614770146 not running, aborting
        at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:902)
        at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.getServerInfo(RSRpcServices.java:1181)
        at 
org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:21098)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2032)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
        at java.lang.Thread.run(Thread.java:744)

        at 
org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1190)
        at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
        at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
        at 
org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.getServerInfo(AdminProtos.java:22031)
        at 
org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1797)
        ... 9 more
{noformat}

Here is the code on branch-1.
{code:title=AssignmentManager.java}
  private RegionState forceRegionStateToOffline(
      final HRegionInfo region, final boolean forceNewPlan) {
    ...
    case OFFLINE:
      // This region could have been open on this server
      // for a while. If the server is dead and not processed
      // yet, we can move on only if the meta shows the
      // region is not on this server actually, or on a server
      // not dead, or dead and processed already.
      // In case not using ZK, we don't need this check because
      // we have the latest info in memory, and the caller
      // will do another round checking any way.
      if (useZKForAssignment
          && regionStates.isServerDeadAndNotProcessed(sn)
          && wasRegionOnDeadServerByMeta(region, sn)) {
        if (!regionStates.isRegionInTransition(region)) {
          LOG.info("Updating the state to " + State.OFFLINE + " to allow to be 
reassigned by SSH");
          regionStates.updateRegionState(region, State.OFFLINE);
        }
        LOG.info("Skip assigning " + region.getRegionNameAsString()
            + ", it is on a dead but not processed yet server: " + sn);
        return null;
      }
    ...
  }
{code}
And the logic is totally removed on master so master does not have the same 
problem.
So this is only a testcase issue I think? Just make isServerReachable return 
quickly is enough?  I assume we do not test region assignment here.

Thanks. [~jxiang] [~jeffreyz]

> TestDistributedLogSplitting.testThreeRSAbort fails several times on branch-1
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-13172
>                 URL: https://issues.apache.org/jira/browse/HBASE-13172
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 1.1.0
>            Reporter: zhangduo
>
> The direct reason is we are stuck in ServerManager.isServerReachable.
> https://builds.apache.org/job/HBase-1.1/253/testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testThreeRSAbort/
> {noformat}
> 2015-03-06 04:06:19,430 DEBUG [AM.-pool300-t1] master.ServerManager(855): 
> Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=0 of 10
> 2015-03-06 04:07:10,545 DEBUG [AM.-pool300-t1] master.ServerManager(855): 
> Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=9 of 10
> {noformat}
> The interval between first and last retry log is about 1 minute, and we only 
> wait 1 minute so the test is timeout.
> Still do not know why this happen.
> And at last there are lots of this 
> {noformat}
> 2015-03-06 04:07:21,529 DEBUG [AM.-pool300-t1] master.ServerManager(855): 
> Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=9 of 10
> org.apache.hadoop.hbase.ipc.StoppedRpcClientException
>       at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.getConnection(RpcClientImpl.java:1261)
>       at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1146)
>       at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
>       at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
>       at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.getServerInfo(AdminProtos.java:22031)
>       at 
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1797)
>       at 
> org.apache.hadoop.hbase.master.ServerManager.isServerReachable(ServerManager.java:850)
>       at 
> org.apache.hadoop.hbase.master.RegionStates.isServerDeadAndNotProcessed(RegionStates.java:843)
>       at 
> org.apache.hadoop.hbase.master.AssignmentManager.forceRegionStateToOffline(AssignmentManager.java:1969)
>       at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1576)
>       at 
> org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:48)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:744)
> {noformat}
> I think the problem is here
> {code:title=ServerManager.java}
>     while (retryCounter.shouldRetry()) {
>         ...
>         try {
>           retryCounter.sleepUntilNextRetry();
>         } catch(InterruptedException ie) {
>           Thread.currentThread().interrupt();
>         }
>         ...
>     }
> {code}
> We need to break out of the while loop when getting InterruptedException, not 
> just mark current thread as interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to