saintstack commented on a change in pull request #6: HBASE-22680 [HBCK2] 
OfflineMetaRepair for hbase2/hbck2
URL: https://github.com/apache/hbase-operator-tools/pull/6#discussion_r303969819
 
 

 ##########
 File path: hbase-hbck2/README.md
 ##########
 @@ -337,11 +334,53 @@ The Master is unable to continue startup because there 
is no Procedure to assign
 _hbase:meta_ (or _hbase:namespace_). To inject one, use the _HBCK2_ tool:
 
 ```
-HBASE_CLASSPATH_PREFIX=./hbase-hbck2-1.0.0-SNAPSHOT.jar hbase 
org.apache.hbase.HBCK2 assigns 1588230740
+HBASE_CLASSPATH_PREFIX=./hbase-hbck2-1.0.0-SNAPSHOT.jar hbase 
org.apache.hbase.HBCK2 assigns -skip 1588230740
 ```
 
-...where 1588230740 is the encoded name of the _hbase:meta_ Region.
+...where 1588230740 is the encoded name of the _hbase:meta_ Region. Pass the 
'-skip' option to
+stop HBCK2 doing a verstion check against the remote master. If the remote 
master is not up,
+the version check will prompt a 'Master is initializing response' or 
'PleaseHoldException'
+and drop the assign attempt. The '-skip' command punts on version check and 
will land the
+scheduled assign.
 
 The same may happen to the _hbase:namespace_ system table. Look for the
 encoded Region name of the _hbase:namespace_ Region and do similar to
-what we did for _hbase:meta_.
+what we did for _hbase:meta_. In this latter case, the Master actually
+prints out a helpful message that looks like the following:
+
+```2019-07-09 22:08:38,966 WARN  [master/localhost:16000:becomeActiveMaster] 
master.HMaster: 
hbase:namespace,,1562733904278.9559cf72b8e81e1291c626a8e781a6ae. is NOT online; 
state={9559cf72b8e81e1291c626a8e781a6ae state=CLOSED, ts=1562735318897, 
server=null}; ServerCrashProcedures=true. Master startup cannot progress, in 
holding-pattern until region onlined.```
+
+To schedule an assign for the hbase:namespace table noted in the above log 
line, you would do:
+```HBASE_CLASSPATH_PREFIX=./hbase-hbck2-1.0.0-SNAPSHOT.jar hbase 
org.apache.hbase.HBCK2 -skip assigns 9559cf72b8e81e1291c626a8e781a6ae```
+... passing the encoded name for the namespace region (the encoded name will 
differ per deploy).
+
+### hbase:meta region/table restore/rebuild
+
+Should a cluster suffer a catastrophic loss of the `hbase:meta` region, a 
rough rebuild is possible following the below receipe. In outline: stop the 
cluster; run the _OfflineMetaRepair_ tool which reads directories and metadata 
dropped into the filesystem making a best effort at reconstructing a viable 
_hbase:meta_ table; restart your cluster; inject an assign to bring the system 
namespace table online; and then finally, re-assign userspace tables you'd like 
enabled (the rebuilt _hbase:meta_ creates a table with all tables offline and 
no regions assigned).
+
+#### Detailed rebuild recipe
+Stop the cluster.
+
+Run the rebuild _hbase:meta_ command from _HBCK2_. This will move aside the 
original _hbase:meta_ and put in place a newly rebuilt one. Below is an example 
of how to run the tool.  It adds the `-details` flag so the tool dumps info on 
the regions its found in hdfs:
+```$ 
HBASE_CLASSPATH_PREFIX=~/checkouts/hbase-operator-tools/hbase-hbck2/target/hbase-hbck2-1.0.0-SNAPSHOT.jar
 ./bin/hbase org.apache.hbase.hbck1.OfflineMetaRepair -details```
+
+Start the cluster up. It won’t come up fully. It will be stuck because the 
_namespace_ table is not online and there is no assign procedure in the 
procedure store for this contingency. The hbase master log will show this 
state. Here is an example of what it will log:
+```2019-07-10 18:30:51,090 WARN  [master/localhost:16000:becomeActiveMaster] 
master.HMaster: 
hbase:namespace,,1562808216225.725a0fe6c2c869d3d0a9ed82bfa80fa3. is NOT online; 
state={725a0fe6c2c869d3d0a9ed82bfa80fa3 state=CLOSED, ts=1562808619952, 
server=null}; ServerCrashProcedures=false. Master startup cannot progress, in 
holding-pattern until region onlined.```
+
+To assign the namespace table region, you cannot use the shell. If you use the 
shell, it will fail with a `PleaseHoldException` because the master is not yet 
up (it is waiting for the namepace table to come online before it declares 
itself ‘up’). You have to use the `HBCK2` _assigns_ command. To assign, you 
will need the namespace encoded name. It shows in the log quoted above: i.e. 
_725a0fe6c2c869d3d0a9ed82bfa80fa3_ in this case. You will also have to pass the 
-skip command to ‘skip’ the master version check (without it, your `HBCK2` 
invocation will also elicit the above `PleaseHoldException` because the master 
is not yet up). Here is an example adding an assign of the namespace table:
+```$ 
HBASE_CLASSPATH_PREFIX=~/checkouts/hbase-operator-tools/hbase-hbck2/target/hbase-hbck2-1.0.0-SNAPSHOT.jar
 ./bin/hbase org.apache.hbase.HBCK2 -skip assigns 
725a0fe6c2c869d3d0a9ed82bfa80fa3```
+
+If the invocation comes back with ‘Connection refused’, is the Master up? The 
Master will shut down after a while if it can’t initialize itself. Just restart 
the cluster/master and rerun the above assigns command.
+
+When the assigns runs successfully, you’ll see it emit the likes of the 
following. The ‘48’ on the end is the pid of the assign procedure schedule. If 
the pid returned is ‘-1’, then the  master startup has not progressed 
sufficently… retry. Or, the encoded regionname is incorrect. Check.
+{{{$  
HBASE_CLASSPATH_PREFIX=~/checkouts/hbase-operator-tools/hbase-hbck2/target/hbase-hbck2-1.0.0-SNAPSHOT.jar
 ./bin/hbase org.apache.hbase.HBCK2 -skip assigns 
725a0fe6c2c869d3d0a9ed82bfa80fa3
+18:40:43.817 [main] WARN  org.apache.hadoop.util.NativeCodeLoader - Unable to 
load native-hadoop library for your platform... using builtin-java classes 
where applicable
+18:40:44.315 [main] INFO  org.apache.hbase.HBCK2 - hbck support check skipped
+[48]}}}
+
+Check the master logs. The master should have come up. You’ll see successful 
completion of pid=48. Look for a line like this to verify successful master 
launch:
+```master.HMaster: Master has completed initialization 132.515sec``` It might 
take a while to appear.
+
+The rebuild of _hbase:meta_ adds the user tables in _DISABLED_ state and the 
regions in _CLOSED_ mode. Reenable tables via the shell to bring all table 
regions back online.
+
 
 Review comment:
   Done.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to