This is an automated email from the ASF dual-hosted git repository.

wchevreuil pushed a commit to branch branch-1
in repository https://gitbox.apache.org/repos/asf/hbase.git


The following commit(s) were added to refs/heads/branch-1 by this push:
     new 080cf92  HBASE-22556 [DOCS] Backport HBASE-15557 to branch-1 and 
branch-2
080cf92 is described below

commit 080cf92a82e32217f89ad86d4e4b12342b5b4343
Author: Wellington Chevreuil <[email protected]>
AuthorDate: Fri Jun 14 14:11:57 2019 +0100

    HBASE-22556 [DOCS] Backport HBASE-15557 to branch-1 and branch-2
    
    Signed-off-by: Andrew Purtell <[email protected]>
---
 src/main/asciidoc/_chapters/ops_mgt.adoc | 125 ++++++++++++++++++++++++++++++-
 1 file changed, 122 insertions(+), 3 deletions(-)

diff --git a/src/main/asciidoc/_chapters/ops_mgt.adoc 
b/src/main/asciidoc/_chapters/ops_mgt.adoc
index 722357e..1bb8946 100644
--- a/src/main/asciidoc/_chapters/ops_mgt.adoc
+++ b/src/main/asciidoc/_chapters/ops_mgt.adoc
@@ -436,6 +436,125 @@ By default, CopyTable utility only copies the latest 
version of row cells unless
 See Jonathan Hsieh's 
link:http://www.cloudera.com/blog/2012/06/online-hbase-backups-with-copytable-2/[Online
           HBase Backups with CopyTable] blog post for more on `CopyTable`.
 
+[[hashtable.synctable]]
+=== HashTable/SyncTable
+
+HashTable/SyncTable is a two steps tool for synchronizing table data, where 
each of the steps are implemented as MapReduce jobs.
+Similarly to CopyTable, it can be used for partial or entire table data 
syncing, under same or remote cluster.
+However, it performs the sync in a more efficient way than CopyTable. Instead 
of copying all cells
+in specified row key/time period range, HashTable (the first step) creates 
hashed indexes for batch of cells on source table and output those as results.
+On the next stage, SyncTable scans the source table and now calculates hash 
indexes for table cells,
+compares these hashes with the outputs of HashTable, then it just scans (and 
compares) cells for diverging hashes, only updating
+mismatching cells. This results in less network traffic/data transfers, which 
can be impacting when syncing large tables on remote clusters.
+
+==== Step 1, HashTable
+
+First, run HashTable on the source table cluster (this is the table whose 
state will be copied to its counterpart).
+
+Usage:
+
+----
+$ ./bin/hbase org.apache.hadoop.hbase.mapreduce.HashTable --help
+Usage: HashTable [options] <tablename> <outputpath>
+
+Options:
+ batchsize     the target amount of bytes to hash in each batch
+               rows are added to the batch until this size is reached
+               (defaults to 8000 bytes)
+ numhashfiles  the number of hash files to create
+               if set to fewer than number of regions then
+               the job will create this number of reducers
+               (defaults to 1/100 of regions -- at least 1)
+ startrow      the start row
+ stoprow       the stop row
+ starttime     beginning of the time range (unixtime in millis)
+               without endtime means from starttime to forever
+ endtime       end of the time range.  Ignored if no starttime specified.
+ scanbatch     scanner batch size to support intra row scans
+ versions      number of cell versions to include
+ families      comma-separated list of families to include
+
+Args:
+ tablename     Name of the table to hash
+ outputpath    Filesystem path to put the output data
+
+Examples:
+ To hash 'TestTable' in 32kB batches for a 1 hour window into 50 files:
+ $ bin/hbase org.apache.hadoop.hbase.mapreduce.HashTable --batchsize=32000 
--numhashfiles=50 --starttime=1265875194289 --endtime=1265878794289 
--families=cf2,cf3 TestTable /hashes/testTable
+----
+
+The *batchsize* property defines how much cell data for a given region will be 
hashed together in a single hash value.
+Sizing this properly has a direct impact on the sync efficiency, as it may 
lead to less scans executed by mapper tasks
+of SyncTable (the next step in the process). The rule of thumb is that, the 
smaller the number of cells out of sync
+(lower probability of finding a diff), larger batch size values can be 
determined.
+
+==== Step 2, SyncTable
+
+Once HashTable has completed on source cluster, SyncTable can be ran on target 
cluster.
+Just like replication and other synchronization jobs, it requires that all 
RegionServers/DataNodes
+on source cluster be accessible by NodeManagers on the target cluster (where 
SyncTable job tasks will be running).
+
+Usage:
+
+----
+$ ./bin/hbase org.apache.hadoop.hbase.mapreduce.SyncTable --help
+Usage: SyncTable [options] <sourcehashdir> <sourcetable> <targettable>
+
+Options:
+ sourcezkcluster  ZK cluster key of the source table
+                  (defaults to cluster in classpath's config)
+ targetzkcluster  ZK cluster key of the target table
+                  (defaults to cluster in classpath's config)
+ dryrun           if true, output counters but no writes
+                  (defaults to false)
+ doDeletes        if false, does not perform deletes
+                  (defaults to true)
+ doPuts           if false, does not perform puts
+                  (defaults to true)
+
+Args:
+ sourcehashdir    path to HashTable output dir for source table
+                  (see org.apache.hadoop.hbase.mapreduce.HashTable)
+ sourcetable      Name of the source table to sync from
+ targettable      Name of the target table to sync to
+
+Examples:
+ For a dry run SyncTable of tableA from a remote source cluster
+ to a local target cluster:
+ $ bin/hbase org.apache.hadoop.hbase.mapreduce.SyncTable --dryrun=true 
--sourcezkcluster=zk1.example.com,zk2.example.com,zk3.example.com:2181:/hbase 
hdfs://nn:9000/hashes/tableA tableA tableA
+----
+
+The *dryrun* option is useful when a read only, diff report is wanted, as it 
will produce only COUNTERS indicating the differences, but will not perform
+any actual changes. It can be used as an alternative to VerifyReplication tool.
+
+By default, SyncTable will cause target table to become an exact copy of 
source table (at least, for the specified startrow/stoprow or/and 
starttime/endtime).
+
+Setting doDeletes to false modifies default behaviour to not delete target 
cells that are missing on source.
+Similarly, setting doPuts to false modifies default behaviour to not add 
missing cells on target. Setting both doDeletes
+and doPuts to false would give same effect as setting dryrun to true.
+
+.Set doDeletes to false on Two-Way Replication scenarios
+[NOTE]
+====
+On Two-Way Replication or other scenarios where both source and target 
clusters can have data ingested, it's advisable to always set doDeletes option 
to false,
+as any additional cell inserted on SyncTable target cluster and not yet 
replicated to source would be deleted, and potentially lost permanently.
+====
+
+.Set sourcezkcluster to the actual source cluster ZK quorum
+[NOTE]
+====
+Although not required, if sourcezkcluster is not set, SyncTable will connect 
to local HBase cluster for both source and target,
+which does not give any meaningful result.
+====
+
+.Remote Clusters on different Kerberos Realms
+[NOTE]
+====
+Currently, SyncTable can't be ran for remote clusters on different Kerberos 
realms.
+There's some work in progress to resolve this on 
link:https://jira.apache.org/jira/browse/HBASE-20586[HBASE-20586]
+====
+
+[[export]]
 === Export
 
 Export is a utility that will dump the contents of table to HDFS in a sequence 
file.
@@ -1344,13 +1463,13 @@ list_peers:: list all replication relationships known 
by this cluster
 enable_peer <ID>::
   Enable a previously-disabled replication relationship
 disable_peer <ID>::
-  Disable a replication relationship. HBase will no longer send edits to that 
peer cluster, but it still keeps track of all the new WALs that it will need to 
replicate if and when it is re-enabled. 
+  Disable a replication relationship. HBase will no longer send edits to that 
peer cluster, but it still keeps track of all the new WALs that it will need to 
replicate if and when it is re-enabled.
 remove_peer <ID>::
   Disable and remove a replication relationship. HBase will no longer send 
edits to that peer cluster or keep track of WALs.
 enable_table_replication <TABLE_NAME>::
-  Enable the table replication switch for all it's column families. If the 
table is not found in the destination cluster then it will create one with the 
same name and column families. 
+  Enable the table replication switch for all it's column families. If the 
table is not found in the destination cluster then it will create one with the 
same name and column families.
 disable_table_replication <TABLE_NAME>::
-  Disable the table replication switch for all it's column families. 
+  Disable the table replication switch for all it's column families.
 
 === Verifying Replicated Data
 

Reply via email to