wchevreuil commented on a change in pull request #6: HBASE-22680 [HBCK2]
OfflineMetaRepair for hbase2/hbck2
URL: https://github.com/apache/hbase-operator-tools/pull/6#discussion_r303860447
##########
File path: hbase-hbck2/README.md
##########
@@ -337,11 +334,53 @@ The Master is unable to continue startup because there
is no Procedure to assign
_hbase:meta_ (or _hbase:namespace_). To inject one, use the _HBCK2_ tool:
```
-HBASE_CLASSPATH_PREFIX=./hbase-hbck2-1.0.0-SNAPSHOT.jar hbase
org.apache.hbase.HBCK2 assigns 1588230740
+HBASE_CLASSPATH_PREFIX=./hbase-hbck2-1.0.0-SNAPSHOT.jar hbase
org.apache.hbase.HBCK2 assigns -skip 1588230740
```
-...where 1588230740 is the encoded name of the _hbase:meta_ Region.
+...where 1588230740 is the encoded name of the _hbase:meta_ Region. Pass the
'-skip' option to
+stop HBCK2 doing a verstion check against the remote master. If the remote
master is not up,
+the version check will prompt a 'Master is initializing response' or
'PleaseHoldException'
+and drop the assign attempt. The '-skip' command punts on version check and
will land the
+scheduled assign.
The same may happen to the _hbase:namespace_ system table. Look for the
encoded Region name of the _hbase:namespace_ Region and do similar to
-what we did for _hbase:meta_.
+what we did for _hbase:meta_. In this latter case, the Master actually
+prints out a helpful message that looks like the following:
+
+```2019-07-09 22:08:38,966 WARN [master/localhost:16000:becomeActiveMaster]
master.HMaster:
hbase:namespace,,1562733904278.9559cf72b8e81e1291c626a8e781a6ae. is NOT online;
state={9559cf72b8e81e1291c626a8e781a6ae state=CLOSED, ts=1562735318897,
server=null}; ServerCrashProcedures=true. Master startup cannot progress, in
holding-pattern until region onlined.```
+
+To schedule an assign for the hbase:namespace table noted in the above log
line, you would do:
+```HBASE_CLASSPATH_PREFIX=./hbase-hbck2-1.0.0-SNAPSHOT.jar hbase
org.apache.hbase.HBCK2 -skip assigns 9559cf72b8e81e1291c626a8e781a6ae```
+... passing the encoded name for the namespace region (the encoded name will
differ per deploy).
+
+### hbase:meta region/table restore/rebuild
+
+Should a cluster suffer a catastrophic loss of the `hbase:meta` region, a
rough rebuild is possible following the below receipe. In outline: stop the
cluster; run the _OfflineMetaRepair_ tool which reads directories and metadata
dropped into the filesystem making a best effort at reconstructing a viable
_hbase:meta_ table; restart your cluster; inject an assign to bring the system
namespace table online; and then finally, re-assign userspace tables you'd like
enabled (the rebuilt _hbase:meta_ creates a table with all tables offline and
no regions assigned).
+
+#### Detailed rebuild recipe
+Stop the cluster.
+
+Run the rebuild _hbase:meta_ command from _HBCK2_. This will move aside the
original _hbase:meta_ and put in place a newly rebuilt one. Below is an example
of how to run the tool. It adds the `-details` flag so the tool dumps info on
the regions its found in hdfs:
+```$
HBASE_CLASSPATH_PREFIX=~/checkouts/hbase-operator-tools/hbase-hbck2/target/hbase-hbck2-1.0.0-SNAPSHOT.jar
./bin/hbase org.apache.hbase.hbck1.OfflineMetaRepair -details```
+
+Start the cluster up. It won’t come up fully. It will be stuck because the
_namespace_ table is not online and there is no assign procedure in the
procedure store for this contingency. The hbase master log will show this
state. Here is an example of what it will log:
+```2019-07-10 18:30:51,090 WARN [master/localhost:16000:becomeActiveMaster]
master.HMaster:
hbase:namespace,,1562808216225.725a0fe6c2c869d3d0a9ed82bfa80fa3. is NOT online;
state={725a0fe6c2c869d3d0a9ed82bfa80fa3 state=CLOSED, ts=1562808619952,
server=null}; ServerCrashProcedures=false. Master startup cannot progress, in
holding-pattern until region onlined.```
+
+To assign the namespace table region, you cannot use the shell. If you use the
shell, it will fail with a `PleaseHoldException` because the master is not yet
up (it is waiting for the namepace table to come online before it declares
itself ‘up’). You have to use the `HBCK2` _assigns_ command. To assign, you
will need the namespace encoded name. It shows in the log quoted above: i.e.
_725a0fe6c2c869d3d0a9ed82bfa80fa3_ in this case. You will also have to pass the
-skip command to ‘skip’ the master version check (without it, your `HBCK2`
invocation will also elicit the above `PleaseHoldException` because the master
is not yet up). Here is an example adding an assign of the namespace table:
+```$
HBASE_CLASSPATH_PREFIX=~/checkouts/hbase-operator-tools/hbase-hbck2/target/hbase-hbck2-1.0.0-SNAPSHOT.jar
./bin/hbase org.apache.hbase.HBCK2 -skip assigns
725a0fe6c2c869d3d0a9ed82bfa80fa3```
+
+If the invocation comes back with ‘Connection refused’, is the Master up? The
Master will shut down after a while if it can’t initialize itself. Just restart
the cluster/master and rerun the above assigns command.
+
+When the assigns runs successfully, you’ll see it emit the likes of the
following. The ‘48’ on the end is the pid of the assign procedure schedule. If
the pid returned is ‘-1’, then the master startup has not progressed
sufficently… retry. Or, the encoded regionname is incorrect. Check.
+{{{$
HBASE_CLASSPATH_PREFIX=~/checkouts/hbase-operator-tools/hbase-hbck2/target/hbase-hbck2-1.0.0-SNAPSHOT.jar
./bin/hbase org.apache.hbase.HBCK2 -skip assigns
725a0fe6c2c869d3d0a9ed82bfa80fa3
+18:40:43.817 [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to
load native-hadoop library for your platform... using builtin-java classes
where applicable
+18:40:44.315 [main] INFO org.apache.hbase.HBCK2 - hbck support check skipped
+[48]}}}
+
+Check the master logs. The master should have come up. You’ll see successful
completion of pid=48. Look for a line like this to verify successful master
launch:
+```master.HMaster: Master has completed initialization 132.515sec``` It might
take a while to appear.
+
+The rebuild of _hbase:meta_ adds the user tables in _DISABLED_ state and the
regions in _CLOSED_ mode. Reenable tables via the shell to bring all table
regions back online.
+
Review comment:
Worth mention a handy _enable_all_ command example to bring all table
regions online at once?
`enable_all '.*'`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services