[ 
https://issues.apache.org/jira/browse/HBASE-12844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-12844:
-----------------------------------
    Attachment: HBASE-12844-0.98.patch

The change is almost identical now. Only the hunks for imports and constructor 
rejected, and they were trivial to fix up.

> ServerManager.isServerReacable() should sleep between retries
> -------------------------------------------------------------
>
>                 Key: HBASE-12844
>                 URL: https://issues.apache.org/jira/browse/HBASE-12844
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 1.0.0, 2.0.0, 1.1.0
>
>         Attachments: HBASE-12844-0.98.patch, HBASE-12844-0.98.patch, 
> hbase-12844_v1.patch
>
>
> There is a fundamental problem with the way assignment manager and cluster 
> membership works. Basically,  the root cause of most of the complexity and 
> root cause for many bugs is that we do have multiple "cluster membership" 
> sources. This causes problems when they diverge from each other. 
> Master's in-memory ServerManager class keep track of what servers are online 
> and what servers are considered dead. We have online and dead servers list in 
> ServerManager, and a separate dead servers list in RegionStates. 
> There are at least 3 ways that a server can join into the dead list. First is 
> the zookeeper session. If a server loses it's zk session, the master gets 
> notification and expires the server. This is the regular way. 
> Second is calls through ServerManager.expireServer(). On master this is 
> mostly through master rejoining the cluster. Master waits for some time for 
> RS's to heartbeat and expires all others and process them as dead servers.  
> This method has the potential to hijack the regions in a region server 
> without  the region server knowing about it (and thus can cause multi homing 
> of regions for reads etc). 
> Third is the RegionStates calling ServerManager.isServerReachable() and if 
> not adding the server to it's own dead list, but not to the dead list of 
> ServerManager. 
> Obviously, as in the region assignment case as well as this, we should fix 
> the "state is kept in multiple places" syndrome, but not in this issue (we 
> already have HBASE-5487, etc for that). 
> In this issue we should at least solve the following case: 
> When a region server is starting up, it will throw exceptions when we want to 
> ping:
> {code}
> 2015-01-10 00:23:10,369 DEBUG [AM.-pool1-t5] master.ServerManager: Couldn't 
> reach os-enis-hbase-1.0-test-1.hw.com,16020,1420849386091, try=0 of 10
> org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: 
> org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is not 
> running yet
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:886)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getServerInfo(RSRpcServices.java:1155)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:20886)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2028)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
>         at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
>         at java.lang.Thread.run(Thread.java:745)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>         at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>         at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>         at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
>         at 
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:309)
>         at 
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1794)
>         at 
> org.apache.hadoop.hbase.master.ServerManager.isServerReachable(ServerManager.java:810)
>         at 
> org.apache.hadoop.hbase.master.RegionStates.isServerDeadAndNotProcessed(RegionStates.java:756)
>         at 
> org.apache.hadoop.hbase.master.AssignmentManager.forceRegionStateToOffline(AssignmentManager.java:1952)
>         at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1559)
>         at 
> org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:48)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.ipc.ServerNotRunningYetException):
>  org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is not 
> running yet
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:886)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getServerInfo(RSRpcServices.java:1155)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:20886)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2028)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
>         at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
>         at java.lang.Thread.run(Thread.java:745)
>         at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1199)
>         at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
>         at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:300)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.getServerInfo(AdminProtos.java:21819)
>         at 
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1791)
>         ... 9 more
> ....
> 2015-01-10 00:23:10,399 DEBUG [AM.-pool1-t5] master.ServerManager: Couldn't 
> reach os-enis-hbase-1.0-test-1.hw.com,16020,1420849386091, try=9 of 10
> org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: 
> org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is not 
> running yet
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:886)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getServerInfo(RSRpcServices.java:1155)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:20886)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2028)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
>         at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
>         at java.lang.Thread.run(Thread.java:745)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>         at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>         at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>         at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
>         at 
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:309)
>         at 
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1794)
>         at 
> org.apache.hadoop.hbase.master.ServerManager.isServerReachable(ServerManager.java:810)
>         at 
> org.apache.hadoop.hbase.master.RegionStates.isServerDeadAndNotProcessed(RegionStates.java:756)
>         at 
> org.apache.hadoop.hbase.master.AssignmentManager.forceRegionStateToOffline(AssignmentManager.java:1952)
>         at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1559)
>         at 
> org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:48)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.ipc.ServerNotRunningYetException):
>  org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is not 
> running yet
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:886)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getServerInfo(RSRpcServices.java:1155)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:20886)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2028)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
>         at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
>         at java.lang.Thread.run(Thread.java:745)
>         at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1199)
>         at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
>         at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:300)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.getServerInfo(AdminProtos.java:21819)
>         at 
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1791)
>         ... 9 more
> {code}
> After 10 attempts happening in 10s of milliseconds (as opposed to sleeping 
> between retries, the server is put in the dead servers list in RegionStates 
> (but not in ServerManager's dead servers list). This results in the region 
> server never receiving YouAreDeadException, and the ServerManager thinking 
> that the server is alive and well, while the RegionStates thinks that the RS 
> is dead and not assigning regions: 
> {code}
> 2015-01-10 00:23:13,163 INFO  
> [os-enis-hbase-1.0-test-2.hw.com,16020,1420848162613-GeneralBulkAssigner-0] 
> master.AssignmentManager: Assigning 2 region(s) to 
> os-enis-hbase-1.0-test-1.hw.com,16020,1420849386091
> 2015-01-10 00:23:13,170 WARN  
> [os-enis-hbase-1.0-test-2.hw.com,16020,1420848162613-GeneralBulkAssigner-0] 
> master.RegionStates: Couldn't reach online server 
> os-enis-hbase-1.0-test-1.hw.com,16020,1420849386091
> {code}
> This also prevents unassign etc leaving the regions in transition state 
> forever (until the admin kills the RS manually). 
> {code}
> 2015-01-10 00:23:13,188 INFO  
> [os-enis-hbase-1.0-test-2.hw.com,16020,1420848162613-GeneralBulkAssigner-0] 
> master.AssignmentManager: Skip assigning 
> loadtest_d1,cccccccc,1420849388510.15a752a6ad4b3a21c0d471483a225144., it is 
> on a dead but not processed yet server: 
> os-enis-hbase-1.0-test-1.hw.com,16020,1420849386091
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to