[beam-site] 01/02: [BEAM-4361] Document usage of HBase TableSnapshotInputFormat

mergebot-role Fri, 25 May 2018 01:18:48 -0700

This is an automated email from the ASF dual-hosted git repository.

mergebot-role pushed a commit to branch mergebot
in repository https://gitbox.apache.org/repos/asf/beam-site.git


commit 091a41569bca643db12ed0b6fd202d1dda6eff2f
Author: timrobertson100 <[email protected]>
AuthorDate: Tue May 22 16:33:54 2018 +0200

    [BEAM-4361] Document usage of HBase TableSnapshotInputFormat
---
 src/documentation/io/built-in-hadoop.md | 67 +++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/src/documentation/io/built-in-hadoop.md 
b/src/documentation/io/built-in-hadoop.md
index 82fc47f..bcfa267 100644
--- a/src/documentation/io/built-in-hadoop.md
+++ b/src/documentation/io/built-in-hadoop.md
@@ -269,4 +269,71 @@ PCollection<Text, DynamoDBItemWritable> dynamoDBData =
 
 ```py
   # The Beam SDK for Python does not support Hadoop InputFormat IO.
+```
+
+### Apache HBase - TableSnapshotInputFormat
+
+To read data from an HBase table snapshot, use 
`org.apache.hadoop.hbase.mapreduce.TableSnapshotInputFormat`.
+Reading from a table snapshot bypasses the HBase region servers, instead 
reading HBase data files directly from the filesystem.
+This is useful for cases such as reading historical data or offloading of work 
from the HBase cluster. 
+There are scenarios when this may prove faster than accessing content through 
the region servers using the `HBaseIO`.
+
+A table snapshot can be taken using the HBase shell or programmatically:
+```java
+try (
+    Connection connection = ConnectionFactory.createConnection(hbaseConf);
+    Admin admin = connection.getAdmin()
+  ) {
+  admin.snapshot(
+    "my_snaphshot",
+    TableName.valueOf("my_table"),
+    HBaseProtos.SnapshotDescription.Type.FLUSH);
+}  
+```
+
+```py
+  # The Beam SDK for Python does not support Hadoop InputFormat IO.
+```
+
+A `TableSnapshotInputFormat` is configured as follows:
+
+```java
+// Construct a typical HBase scan
+Scan scan = new Scan();
+scan.setCaching(1000);
+scan.setBatch(1000);
+scan.addColumn(Bytes.toBytes("CF"), Bytes.toBytes("col_1"));
+scan.addColumn(Bytes.toBytes("CF"), Bytes.toBytes("col_2"));
+
+Configuration hbaseConf = HBaseConfiguration.create();
+hbaseConf.set(HConstants.ZOOKEEPER_QUORUM, "zk1:2181");
+hbaseConf.set("hbase.rootdir", "/hbase");
+hbaseConf.setClass(
+    "mapreduce.job.inputformat.class", TableSnapshotInputFormat.class, 
InputFormat.class);
+hbaseConf.setClass("key.class", ImmutableBytesWritable.class, Writable.class);
+hbaseConf.setClass("value.class", Result.class, Writable.class);
+ClientProtos.Scan proto = ProtobufUtil.toScan(scan);
+hbaseConf.set(TableInputFormat.SCAN, Base64.encodeBytes(proto.toByteArray()));
+
+// Make use of existing utility methods
+Job job = Job.getInstance(hbaseConf); // creates internal clone of hbaseConf
+TableSnapshotInputFormat.setInput(job, "my_snapshot", new 
Path("/tmp/snapshot_restore"));
+hbaseConf = job.getConfiguration(); // extract the modified clone
+```
+
+```py
+  # The Beam SDK for Python does not support Hadoop InputFormat IO.
+```
+
+Call Read transform as follows:
+
+```java
+PCollection<ImmutableBytesWritable, Result> hbaseSnapshotData =
+  p.apply("read",
+  HadoopInputFormatIO.<ImmutableBytesWritable, Result>read()
+  .withConfiguration(hbaseConf);
+```
+
+```py
+  # The Beam SDK for Python does not support Hadoop InputFormat IO.
 ```
\ No newline at end of file

-- 
To stop receiving notification emails like this one, please contact
[email protected].

[beam-site] 01/02: [BEAM-4361] Document usage of HBase TableSnapshotInputFormat

Reply via email to