[
https://issues.apache.org/jira/browse/HBASE-12844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Enis Soztutar updated HBASE-12844:
----------------------------------
Status: Patch Available (was: Open)
> ServerManager.isServerReacable() should sleep between retries
> -------------------------------------------------------------
>
> Key: HBASE-12844
> URL: https://issues.apache.org/jira/browse/HBASE-12844
> Project: HBase
> Issue Type: Bug
> Reporter: Enis Soztutar
> Assignee: Enis Soztutar
> Fix For: 1.0.0, 2.0.0, 1.1.0
>
> Attachments: hbase-12844_v1.patch
>
>
> There is a fundamental problem with the way assignment manager and cluster
> membership works. Basically, the root cause of most of the complexity and
> root cause for many bugs is that we do have multiple "cluster membership"
> sources. This causes problems when they diverge from each other.
> Master's in-memory ServerManager class keep track of what servers are online
> and what servers are considered dead. We have online and dead servers list in
> ServerManager, and a separate dead servers list in RegionStates.
> There are at least 3 ways that a server can join into the dead list. First is
> the zookeeper session. If a server loses it's zk session, the master gets
> notification and expires the server. This is the regular way.
> Second is calls through ServerManager.expireServer(). On master this is
> mostly through master rejoining the cluster. Master waits for some time for
> RS's to heartbeat and expires all others and process them as dead servers.
> This method has the potential to hijack the regions in a region server
> without the region server knowing about it (and thus can cause multi homing
> of regions for reads etc).
> Third is the RegionStates calling ServerManager.isServerReachable() and if
> not adding the server to it's own dead list, but not to the dead list of
> ServerManager.
> Obviously, as in the region assignment case as well as this, we should fix
> the "state is kept in multiple places" syndrome, but not in this issue (we
> already have HBASE-5487, etc for that).
> In this issue we should at least solve the following case:
> When a region server is starting up, it will throw exceptions when we want to
> ping:
> {code}
> 2015-01-10 00:23:10,369 DEBUG [AM.-pool1-t5] master.ServerManager: Couldn't
> reach os-enis-hbase-1.0-test-1.hw.com,16020,1420849386091, try=0 of 10
> org.apache.hadoop.hbase.ipc.ServerNotRunningYetException:
> org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is not
> running yet
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:886)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getServerInfo(RSRpcServices.java:1155)
> at
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:20886)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2028)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
> at java.lang.Thread.run(Thread.java:745)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
> at
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
> at
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:309)
> at
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1794)
> at
> org.apache.hadoop.hbase.master.ServerManager.isServerReachable(ServerManager.java:810)
> at
> org.apache.hadoop.hbase.master.RegionStates.isServerDeadAndNotProcessed(RegionStates.java:756)
> at
> org.apache.hadoop.hbase.master.AssignmentManager.forceRegionStateToOffline(AssignmentManager.java:1952)
> at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1559)
> at
> org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:48)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by:
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.ipc.ServerNotRunningYetException):
> org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is not
> running yet
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:886)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getServerInfo(RSRpcServices.java:1155)
> at
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:20886)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2028)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
> at java.lang.Thread.run(Thread.java:745)
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1199)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:300)
> at
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.getServerInfo(AdminProtos.java:21819)
> at
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1791)
> ... 9 more
> ....
> 2015-01-10 00:23:10,399 DEBUG [AM.-pool1-t5] master.ServerManager: Couldn't
> reach os-enis-hbase-1.0-test-1.hw.com,16020,1420849386091, try=9 of 10
> org.apache.hadoop.hbase.ipc.ServerNotRunningYetException:
> org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is not
> running yet
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:886)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getServerInfo(RSRpcServices.java:1155)
> at
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:20886)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2028)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
> at java.lang.Thread.run(Thread.java:745)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
> at
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
> at
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:309)
> at
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1794)
> at
> org.apache.hadoop.hbase.master.ServerManager.isServerReachable(ServerManager.java:810)
> at
> org.apache.hadoop.hbase.master.RegionStates.isServerDeadAndNotProcessed(RegionStates.java:756)
> at
> org.apache.hadoop.hbase.master.AssignmentManager.forceRegionStateToOffline(AssignmentManager.java:1952)
> at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1559)
> at
> org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:48)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by:
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.ipc.ServerNotRunningYetException):
> org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is not
> running yet
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:886)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getServerInfo(RSRpcServices.java:1155)
> at
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:20886)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2028)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
> at java.lang.Thread.run(Thread.java:745)
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1199)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:300)
> at
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.getServerInfo(AdminProtos.java:21819)
> at
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1791)
> ... 9 more
> {code}
> After 10 attempts happening in 10s of milliseconds (as opposed to sleeping
> between retries, the server is put in the dead servers list in RegionStates
> (but not in ServerManager's dead servers list). This results in the region
> server never receiving YouAreDeadException, and the ServerManager thinking
> that the server is alive and well, while the RegionStates thinks that the RS
> is dead and not assigning regions:
> {code}
> 2015-01-10 00:23:13,163 INFO
> [os-enis-hbase-1.0-test-2.hw.com,16020,1420848162613-GeneralBulkAssigner-0]
> master.AssignmentManager: Assigning 2 region(s) to
> os-enis-hbase-1.0-test-1.hw.com,16020,1420849386091
> 2015-01-10 00:23:13,170 WARN
> [os-enis-hbase-1.0-test-2.hw.com,16020,1420848162613-GeneralBulkAssigner-0]
> master.RegionStates: Couldn't reach online server
> os-enis-hbase-1.0-test-1.hw.com,16020,1420849386091
> {code}
> This also prevents unassign etc leaving the regions in transition state
> forever (until the admin kills the RS manually).
> {code}
> 2015-01-10 00:23:13,188 INFO
> [os-enis-hbase-1.0-test-2.hw.com,16020,1420848162613-GeneralBulkAssigner-0]
> master.AssignmentManager: Skip assigning
> loadtest_d1,cccccccc,1420849388510.15a752a6ad4b3a21c0d471483a225144., it is
> on a dead but not processed yet server:
> os-enis-hbase-1.0-test-1.hw.com,16020,1420849386091
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)