[jira] [Commented] (HBASE-12844) ServerManager.isServerReacable() should sleep between retries

Hadoop QA (JIRA) Mon, 12 Jan 2015 21:07:45 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-12844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274717#comment-14274717
 ]


Hadoop QA commented on HBASE-12844:
-----------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12691834/hbase-12844_v1.patch
  against master branch at commit c32a2c0b16b1d7e41fd0ad4a2737b7f0f2806c82.
  ATTACHMENT ID: 12691834

    {color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

    {color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
                        Please justify why no new tests are needed for this 
patch.
                        Also please list what manual steps were performed to 
verify this patch.

    {color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

    {color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

    {color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

    {color:green}+1 checkstyle{color}.  The applied patch does not increase the 
total number of checkstyle errors

    {color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

    {color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

     {color:red}-1 core tests{color}.  The patch failed these unit tests:
                       
org.apache.hadoop.hbase.master.TestAssignmentManagerOnCluster

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12430//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12430//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12430//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12430//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12430//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12430//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12430//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12430//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12430//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12430//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12430//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12430//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12430//artifact/patchprocess/checkstyle-aggregate.html

  Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12430//console

This message is automatically generated.

> ServerManager.isServerReacable() should sleep between retries
> -------------------------------------------------------------
>
>                 Key: HBASE-12844
>                 URL: https://issues.apache.org/jira/browse/HBASE-12844
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 1.0.0, 2.0.0, 1.1.0
>
>         Attachments: hbase-12844_v1.patch
>
>
> There is a fundamental problem with the way assignment manager and cluster 
> membership works. Basically,  the root cause of most of the complexity and 
> root cause for many bugs is that we do have multiple "cluster membership" 
> sources. This causes problems when they diverge from each other. 
> Master's in-memory ServerManager class keep track of what servers are online 
> and what servers are considered dead. We have online and dead servers list in 
> ServerManager, and a separate dead servers list in RegionStates. 
> There are at least 3 ways that a server can join into the dead list. First is 
> the zookeeper session. If a server loses it's zk session, the master gets 
> notification and expires the server. This is the regular way. 
> Second is calls through ServerManager.expireServer(). On master this is 
> mostly through master rejoining the cluster. Master waits for some time for 
> RS's to heartbeat and expires all others and process them as dead servers.  
> This method has the potential to hijack the regions in a region server 
> without  the region server knowing about it (and thus can cause multi homing 
> of regions for reads etc). 
> Third is the RegionStates calling ServerManager.isServerReachable() and if 
> not adding the server to it's own dead list, but not to the dead list of 
> ServerManager. 
> Obviously, as in the region assignment case as well as this, we should fix 
> the "state is kept in multiple places" syndrome, but not in this issue (we 
> already have HBASE-5487, etc for that). 
> In this issue we should at least solve the following case: 
> When a region server is starting up, it will throw exceptions when we want to 
> ping:
> {code}
> 2015-01-10 00:23:10,369 DEBUG [AM.-pool1-t5] master.ServerManager: Couldn't 
> reach os-enis-hbase-1.0-test-1.hw.com,16020,1420849386091, try=0 of 10
> org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: 
> org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is not 
> running yet
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:886)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getServerInfo(RSRpcServices.java:1155)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:20886)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2028)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
>         at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
>         at java.lang.Thread.run(Thread.java:745)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>         at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>         at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>         at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
>         at 
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:309)
>         at 
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1794)
>         at 
> org.apache.hadoop.hbase.master.ServerManager.isServerReachable(ServerManager.java:810)
>         at 
> org.apache.hadoop.hbase.master.RegionStates.isServerDeadAndNotProcessed(RegionStates.java:756)
>         at 
> org.apache.hadoop.hbase.master.AssignmentManager.forceRegionStateToOffline(AssignmentManager.java:1952)
>         at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1559)
>         at 
> org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:48)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.ipc.ServerNotRunningYetException):
>  org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is not 
> running yet
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:886)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getServerInfo(RSRpcServices.java:1155)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:20886)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2028)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
>         at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
>         at java.lang.Thread.run(Thread.java:745)
>         at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1199)
>         at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
>         at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:300)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.getServerInfo(AdminProtos.java:21819)
>         at 
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1791)
>         ... 9 more
> ....
> 2015-01-10 00:23:10,399 DEBUG [AM.-pool1-t5] master.ServerManager: Couldn't 
> reach os-enis-hbase-1.0-test-1.hw.com,16020,1420849386091, try=9 of 10
> org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: 
> org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is not 
> running yet
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:886)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getServerInfo(RSRpcServices.java:1155)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:20886)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2028)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
>         at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
>         at java.lang.Thread.run(Thread.java:745)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>         at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>         at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>         at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
>         at 
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:309)
>         at 
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1794)
>         at 
> org.apache.hadoop.hbase.master.ServerManager.isServerReachable(ServerManager.java:810)
>         at 
> org.apache.hadoop.hbase.master.RegionStates.isServerDeadAndNotProcessed(RegionStates.java:756)
>         at 
> org.apache.hadoop.hbase.master.AssignmentManager.forceRegionStateToOffline(AssignmentManager.java:1952)
>         at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1559)
>         at 
> org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:48)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.ipc.ServerNotRunningYetException):
>  org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is not 
> running yet
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:886)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getServerInfo(RSRpcServices.java:1155)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:20886)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2028)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
>         at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
>         at java.lang.Thread.run(Thread.java:745)
>         at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1199)
>         at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
>         at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:300)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.getServerInfo(AdminProtos.java:21819)
>         at 
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1791)
>         ... 9 more
> {code}
> After 10 attempts happening in 10s of milliseconds (as opposed to sleeping 
> between retries, the server is put in the dead servers list in RegionStates 
> (but not in ServerManager's dead servers list). This results in the region 
> server never receiving YouAreDeadException, and the ServerManager thinking 
> that the server is alive and well, while the RegionStates thinks that the RS 
> is dead and not assigning regions: 
> {code}
> 2015-01-10 00:23:13,163 INFO  
> [os-enis-hbase-1.0-test-2.hw.com,16020,1420848162613-GeneralBulkAssigner-0] 
> master.AssignmentManager: Assigning 2 region(s) to 
> os-enis-hbase-1.0-test-1.hw.com,16020,1420849386091
> 2015-01-10 00:23:13,170 WARN  
> [os-enis-hbase-1.0-test-2.hw.com,16020,1420848162613-GeneralBulkAssigner-0] 
> master.RegionStates: Couldn't reach online server 
> os-enis-hbase-1.0-test-1.hw.com,16020,1420849386091
> {code}
> This also prevents unassign etc leaving the regions in transition state 
> forever (until the admin kills the RS manually). 
> {code}
> 2015-01-10 00:23:13,188 INFO  
> [os-enis-hbase-1.0-test-2.hw.com,16020,1420848162613-GeneralBulkAssigner-0] 
> master.AssignmentManager: Skip assigning 
> loadtest_d1,cccccccc,1420849388510.15a752a6ad4b3a21c0d471483a225144., it is 
> on a dead but not processed yet server: 
> os-enis-hbase-1.0-test-1.hw.com,16020,1420849386091
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12844) ServerManager.isServerReacable() should sleep between retries

Reply via email to