[
https://issues.apache.org/jira/browse/HBASE-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158455#comment-14158455
]
stack commented on HBASE-12166:
-------------------------------
The problem is that the meta region has no edits in it so when we list the
filesystem to find crashed servers, we see this:
476 2014-10-03 10:21:42,470 INFO [ActiveMasterManager]
master.ServerManager(918): Finished waiting for region servers count to settle;
checked in 6, slept for 1012 ms, expecting minimum of 5, maximum of 6, master
is running
477 2014-10-03 10:21:42,471 INFO [ActiveMasterManager]
master.MasterFileSystem(253): Log folder
hdfs://localhost:58772/user/stack/hbase/WALs/localhost,61565,1412356895892
belongs to an existing region server
478 2014-10-03 10:21:42,471 INFO [ActiveMasterManager]
master.MasterFileSystem(249): Log folder
hdfs://localhost:58772/user/stack/hbase/WALs/localhost,61572,1412356895952
doesn't belong to a known region server, splitting
479 2014-10-03 10:21:42,471 INFO [ActiveMasterManager]
master.MasterFileSystem(253): Log folder
hdfs://localhost:58772/user/stack/hbase/WALs/localhost,61576,1412356896091
belongs to an existing region server
480 2014-10-03 10:21:42,471 INFO [ActiveMasterManager]
master.MasterFileSystem(253): Log folder
hdfs://localhost:58772/user/stack/hbase/WALs/localhost,61579,1412356896131
belongs to an existing region server
481 2014-10-03 10:21:42,471 INFO [ActiveMasterManager]
master.MasterFileSystem(253): Log folder
hdfs://localhost:58772/user/stack/hbase/WALs/localhost,61582,1412356896169
belongs to an existing region server
482 2014-10-03 10:21:42,471 INFO [ActiveMasterManager]
master.MasterFileSystem(253): Log folder
hdfs://localhost:58772/user/stack/hbase/WALs/localhost,61586,1412356896205
belongs to an existing region server
483 2014-10-03 10:21:42,471 INFO [ActiveMasterManager]
master.MasterFileSystem(253): Log folder
hdfs://localhost:58772/user/stack/hbase/WALs/localhost,61591,1412356896245
belongs to an existing region server
i.e. all servers but the dead master which in this case is
localhost,61562,1412356895859
When we go to online hbase:meta, it complains that there is no WAL file:
501 2014-10-03 10:21:42,478 INFO [ActiveMasterManager]
master.MasterFileSystem(325): Log dir for server localhost,61562,1412356895859
does not exist
502 2014-10-03 10:21:42,479 DEBUG [ActiveMasterManager]
master.MasterFileSystem(323): Renamed region directory:
hdfs://localhost:58772/user/stack/hbase/WALs/localhost,61572,1412356895952-splitting
503 2014-10-03 10:21:42,479 INFO [ActiveMasterManager]
master.SplitLogManager(536): dead splitlog workers
[localhost,61562,1412356895859, localhost,61572,1412356895952]
504 2014-10-03 10:21:42,480 INFO [ActiveMasterManager]
master.SplitLogManager(172):
hdfs://localhost:58772/user/stack/hbase/WALs/localhost,61572,1412356895952-splitting
is empty dir, no logs to split
505 2014-10-03 10:21:42,480 DEBUG [ActiveMasterManager]
master.SplitLogManager(235): Scheduling batch of logs to split
506 2014-10-03 10:21:42,480 INFO [ActiveMasterManager]
master.SplitLogManager(237): started splitting 0 logs in
[hdfs://localhost:58772/user/stack/hbase/WALs/localhost,61572,1412356895952-splitting]
> TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork
> ---------------------------------------------------------------
>
> Key: HBASE-12166
> URL: https://issues.apache.org/jira/browse/HBASE-12166
> Project: HBase
> Issue Type: Bug
> Components: test
> Reporter: stack
> Assignee: stack
> Fix For: 2.0.0, 0.99.1
>
> Attachments: 12166.txt, log.txt
>
>
> See
> https://builds.apache.org/job/PreCommit-HBASE-Build/11204//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testMasterStartsUpWithLogReplayWork/
> The namespace region gets stuck. It is never 'recovered' even though we have
> finished log splitting. Here is the main exception:
> {code}
> 4941 2014-10-03 02:00:36,862 DEBUG
> [B.defaultRpcServer.handler=1,queue=0,port=37113] ipc.CallRunner(111):
> B.defaultRpcServer.handler=1,queue=0,port=37113: callId: 211 service:
> ClientService methodName: Get
> size: 99 connection: 67.195.81.144:44526
> 4942 org.apache.hadoop.hbase.exceptions.RegionInRecoveryException:
> hbase:namespace,,1412301462277.eba5d23de65f2718715eeb22edf7edc2. is recovering
> 4943 at
> org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:6058)
> 4944 at
> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2086)
> 4945 at
> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2072)
> 4946 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:5014)
> 4947 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4988)
> 4948 at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1690)
> 4949 at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30418)
> 4950 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020)
> 4951 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
> 4952 at
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
> 4953 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
> 4954 at java.lang.Thread.run(Thread.java:744)
> {code}
> See how we've finished log splitting long time previous:
> {code}
> 2014-10-03 01:57:48,129 INFO [M_LOG_REPLAY_OPS-asf900:37113-1]
> master.SplitLogManager(294): finished splitting (more than or equal to)
> 197337 bytes in 1 log files in
> [hdfs://localhost:49601/user/jenkins/hbase/WALs/asf900.gq1.ygridcore.net,40732,1412301461887-splitting]
> in 379ms
> {code}
> If I grep for the deleting of znodes on recovery, which is when we set the
> recovering flag to false, I see a bunch of regions but not my namespace one:
> 2014-10-03 01:57:47,330 INFO [Thread-9216-EventThread]
> zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/1588230740
> znode deleted. Region: 1588230740 completes recovery.
> 2014-10-03 01:57:48,119 INFO [Thread-9216-EventThread]
> zookeeper.RecoveringRegionWatcher(66):
> /hbase/recovering-regions/adfdcf958dd958f0e2ce59072ce2209d znode deleted.
> Region: adfdcf958dd958f0e2ce59072ce2209d completes recovery.
> 2014-10-03 01:57:48,121 INFO [Thread-9216-EventThread]
> zookeeper.RecoveringRegionWatcher(66):
> /hbase/recovering-regions/41d438848305831b61d708a406d5ecde znode deleted.
> Region: 41d438848305831b61d708a406d5ecde completes recovery.
> 2014-10-03 01:57:48,122 INFO [Thread-9216-EventThread]
> zookeeper.RecoveringRegionWatcher(66):
> /hbase/recovering-regions/6a7cada80de2ae5d774fe8cd33bd4cda znode deleted.
> Region: 6a7cada80de2ae5d774fe8cd33bd4cda completes recovery.
> 2014-10-03 01:57:48,124 INFO [Thread-9216-EventThread]
> zookeeper.RecoveringRegionWatcher(66):
> /hbase/recovering-regions/65451bd5b38bd16a31e25b62b3305533 znode deleted.
> Region: 65451bd5b38bd16a31e25b62b3305533 completes recovery.
> 2014-10-03 01:57:48,125 INFO [Thread-9216-EventThread]
> zookeeper.RecoveringRegionWatcher(66):
> /hbase/recovering-regions/07afdc3748894cf2b56e0075272a95a0 znode deleted.
> Region: 07afdc3748894cf2b56e0075272a95a0 completes recovery.
> 2014-10-03 01:57:48,126 INFO [Thread-9216-EventThread]
> zookeeper.RecoveringRegionWatcher(66):
> /hbase/recovering-regions/a4337ad2874ee7e599ca2344fce21583 znode deleted.
> Region: a4337ad2874ee7e599ca2344fce21583 completes recovery.
> 2014-10-03 01:57:48,128 INFO [Thread-9216-EventThread]
> zookeeper.RecoveringRegionWatcher(66):
> /hbase/recovering-regions/9d91d6eafe260ce33e8d7d23ccd13192 znode deleted.
> Region: 9d91d6eafe260ce33e8d7d23ccd13192 completes recovery.
> This would seem to indicate that we successfully wrote zk that we are
> recovering:
> {code}
> 2014-10-03 01:57:47,672 DEBUG [MASTER_SERVER_OPERATIONS-asf900:37113-0]
> coordination.ZKSplitLogManagerCoordination(647): Mark region
> eba5d23de65f2718715eeb22edf7edc2 recovering from failed region server
> asf900.gq1.ygridcore.net,42071,1412301461790
> {code}
> As part of the open of namespace, we updated our last flushed id successfully:
> {code}
> 2ae5d774fe8cd33bd4cda/family
> 2014-10-03 01:57:47,698 DEBUG [Thread-9216-EventThread]
> zookeeper.ZooKeeperWatcher(304): master:37113-0x148d3b7d84b020f,
> quorum=localhost:50702, baseZNode=/hbase Received ZooKeeper Event,
> type=NodeDataChanged, state=SyncConnected,
> path=/hbase/recovering-regions/eba5d23de65f2718715eeb22edf7edc2
> 2014-10-03 01:57:47,699 DEBUG
> [PostOpenDeployTasks:eba5d23de65f2718715eeb22edf7edc2]
> regionserver.HRegionServer(2947): Update last flushed sequence id of region
> eba5d23de65f2718715eeb22edf7edc2 for
> asf900.gq1.ygridcore.net,42071,1412301461790
> {code}
> Not sure why this node is up in zk.
> Let me dig again later.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)