[
https://issues.apache.org/jira/browse/HBASE-13567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14520853#comment-14520853
]
stack commented on HBASE-13567:
-------------------------------
Server dies. We mark all its hosted regions as 'recovering' up in zk. We then
queue the splitting of all its logs. When done, the split log coordinator
removes the 'recovering' state from zk and the regions move to taking reads.
What happened here is that we got through the above process but master died
just as the last log was queued. When the master came back up, the last log had
been split... so it thought all was ok.
There is no provision for running the removal of 'recovering' state on master
joining a cluster when DLR is enabled.
Let me try and add one and a test.
> [DLR] Region stuck in recovering mode
> -------------------------------------
>
> Key: HBASE-13567
> URL: https://issues.apache.org/jira/browse/HBASE-13567
> Project: HBase
> Issue Type: Bug
> Components: wal
> Affects Versions: 1.1.0
> Reporter: stack
> Assignee: stack
>
> On the hosting server, it says:
> {code}
> 2015-04-24 22:47:27,736 DEBUG
> [PriorityRpcServer.handler=2,queue=0,port=16020] ipc.RpcServer:
> PriorityRpcServer.handler=2,queue=0,port=16020: callId: 84 service:
> ClientService methodName: Get size: 99 connection: 10.20.84.26:52860
> org.apache.hadoop.hbase.exceptions.RegionInRecoveryException:
> hbase:namespace,,1425361558502.57b3dc571e5753b306509b5740cd25b9. is
> recovering; cannot take reads
> at
> org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:7530)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2427)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2422)
> at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:6451)
> at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:6430)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1898)
> at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32201)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2112)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> whenever someone tries to read the region. In the above case, it is the
> master trying to initialize after being killed by a monkey. It is trying to
> set up the TableNamespaceManager. Eventually it fails after 350 attempts:
> {code}
> 2015-04-25 00:35:58,750 WARN [c2020:16000.activeMasterManager]
> master.TableNamespaceManager: Caught exception in initializing namespace
> table manager
> 193959 org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> attempts=350, exceptions:
> 193960 Fri Apr 24 22:40:57 PDT 2015,
> RpcRetryingCaller{globalStartTime=1429940457781, pause=100, retries=350},
> org.apache.hadoop.hbase.exceptions.RegionInRecoveryException:
> org.apache.hadoop.hbase.excepti ons.RegionInRecoveryException:
> hbase:namespace,,1425361558502.57b3dc571e5753b306509b5740cd25b9. is
> recovering; cannot take reads
> 193961 at
> org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:7530)
> 193962 at
> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2427)
> 193963 at
> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2422)
> 193964 at
> org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:6451)
> 193965 at
> org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:6430)
> 193966 at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1898)
> 193967 at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32201)
> {code}
> The master is supposed to have 'processed' the region -- the hbase:namespace
> in this case -- but the regionserver did not get the notification:
> 184849 2015-04-24 22:46:31,650 DEBUG [M_LOG_REPLAY_OPS-c2020:16000-8]
> coordination.SplitLogManagerCoordination: Processing recovering
> [3fbee1781e0c2ded3cc30b701a03797d, 3f4ea5ea14653cee6006f13c7d06d10b, e
> 52b81449d08921c49625620cfc7ace7, 07459e75bef40ec82b6d4267c9de9971,
> b26731667a5e0f15162ad4fa3408b99c, 57b3dc571e5753b306509b5740cd25b9,
> 349eadd360d57083a88e0d84bcb29351, 99e3eb2bcf44bccded24103e351c 96b6]
> and servers [c2021.halxg.cloudera.com,16020,1429940280208],
> isMetaRecovery=false
> Do we need to keep sending notification until acknowledged by the
> regionserver as we do w/ split?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)