This is an automated email from the ASF dual-hosted git repository.

mmiller pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/accumulo-website.git


The following commit(s) were added to refs/heads/master by this push:
     new b9b8aea  Blog post to configure Accumulo with Azure Data Lake Gen2 
Storage (#198)
b9b8aea is described below

commit b9b8aea71bf9b4ddbb697310a50be4768eb1d3bf
Author: Karthick Narendran <karthick.narend...@gmail.com>
AuthorDate: Thu Oct 17 15:55:00 2019 +0100

    Blog post to configure Accumulo with Azure Data Lake Gen2 Storage (#198)
---
 _posts/blog/2019-10-15-accumulo-adlsgen2-notes.md | 133 ++++++++++++++++++++++
 1 file changed, 133 insertions(+)

diff --git a/_posts/blog/2019-10-15-accumulo-adlsgen2-notes.md 
b/_posts/blog/2019-10-15-accumulo-adlsgen2-notes.md
new file mode 100644
index 0000000..03288aa
--- /dev/null
+++ b/_posts/blog/2019-10-15-accumulo-adlsgen2-notes.md
@@ -0,0 +1,133 @@
+---
+title: "Using Azure Data Lake Gen2 storage as a data store for Accumulo"
+author: Karthick Narendran
+---
+
+Accumulo can store its files in [Azure Data Lake Storage 
Gen2](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction)
+using the [ABFS (Azure Blob File 
System)](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-abfs-driver)
 driver.
+Similar to [S3 
blog](https://accumulo.apache.org/blog/2019/09/10/accumulo-S3-notes.html), 
+the write ahead logs & Accumulo metadata can be stored in HDFS and everything 
else on Gen2 storage
+using the volume chooser feature introduced in Accumulo 2.0. The 
configurations referred on this blog
+are specific to Accumulo 2.0 and Hadoop 3.2.0.
+
+## Hadoop setup
+
+For ABFS client to talk to Gen2 storage, it requires one of the Authentication 
mechanism listed 
[here](https://hadoop.apache.org/docs/current/hadoop-azure/abfs.html#Authentication)
+This post covers [Azure Managed 
Identity](https://docs.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/overview)
+formerly known as Managed Service Identity or MSI. This feature provides Azure 
services with an 
+automatically managed identity in [Azure 
AD](https://docs.microsoft.com/en-us/azure/active-directory/fundamentals/active-directory-whatis)
+and it avoids the need for credentials or other sensitive information from 
being stored in code 
+or configs/JCEKS. Plus, it comes free with Azure AD.  
+
+At least the following should be added to Hadoop's `core-site.xml` on each 
node. 
+
+```xml
+<property>
+  <name>fs.azure.account.auth.type</name>
+  <value>OAuth</value>
+</property>
+<property>
+  <name>fs.azure.account.oauth.provider.type</name>
+  <value>org.apache.hadoop.fs.azurebfs.oauth2.MsiTokenProvider</value>
+</property>
+<property>
+  <name>fs.azure.account.oauth2.msi.tenant</name>
+  <value>TenantID</value>
+</property>
+<property>
+  <name>fs.azure.account.oauth2.client.id</name>
+  <value>ClientID</value>
+</property>
+```
+ 
+See [ABFS doc](https://hadoop.apache.org/docs/current/hadoop-azure/abfs.html)
+for more information on Hadoop Azure support.
+
+To get hadoop command to work with ADLS Gen2 set the 
+following entries in `hadoop-env.sh`. As Gen2 storage is TLS enabled by 
default, 
+it is important we use the native OpenSSL implementation of TLS.
+
+```bash
+export HADOOP_OPTIONAL_TOOLS="hadoop-azure"
+export HADOOP_OPTS="-Dorg.wildfly.openssl.path=<path/to/OpenSSL/libraries> 
${HADOOP_OPTS}"
+```
+
+To verify the location of the OpenSSL libraries, run `whereis libssl` command 
+on the host
+
+## Accumulo setup
+
+For each node in the cluster, modify `accumulo-env.sh` to add Azure storage 
jars to the
+classpath.  Your versions may differ depending on your Hadoop version,
+following versions were included with Hadoop 3.2.0.
+
+```bash
+CLASSPATH="${conf}:${lib}/*:${HADOOP_CONF_DIR}:${ZOOKEEPER_HOME}/*:${HADOOP_HOME}/share/hadoop/client/*"
+CLASSPATH="${CLASSPATH}:${HADOOP_HOME}/share/hadoop/tools/lib/azure-data-lake-store-sdk-2.2.9.jar"
+CLASSPATH="${CLASSPATH}:${HADOOP_HOME}/share/hadoop/tools/lib/azure-keyvault-core-1.0.0.jar"
+CLASSPATH="${CLASSPATH}:${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-azure-3.2.0.jar"
+CLASSPATH="${CLASSPATH}:${HADOOP_HOME}/share/hadoop/tools/lib/wildfly-openssl-1.0.4.Final.jar"
+CLASSPATH="${CLASSPATH}:${HADOOP_HOME}/share/hadoop/common/lib/jaxb-api-2.2.11.jar"
+CLASSPATH="${CLASSPATH}:${HADOOP_HOME}/share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar"
+CLASSPATH="${CLASSPATH}:${HADOOP_HOME}/share/hadoop/common/lib/commons-lang3-3.7.jar"
+CLASSPATH="${CLASSPATH}:${HADOOP_HOME}/share/hadoop/common/lib/httpclient-4.5.2.jar"
+CLASSPATH="${CLASSPATH}:${HADOOP_HOME}/share/hadoop/common/lib/jackson-core-asl-1.9.13.jar"
+CLASSPATH="${CLASSPATH}:${HADOOP_HOME}/share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar"
+export CLASSPATH
+```
+
+Tried adding `-Dorg.wildfly.openssl.path` to `JAVA_OPTS` in `accumulo-env.sh`, 
but it 
+did not appear to work, this needs further investigation.
+
+Set the following in `accumulo.properties` and then run `accumulo init`, but 
don't start Accumulo.
+
+```ini
+instance.volumes=hdfs://<name node>/accumulo
+```
+
+After running Accumulo init we need to configure storing write ahead logs in
+HDFS.  Set the following in `accumulo.properties`.
+
+```ini
+instance.volumes=hdfs://<namenode>/accumulo,abfss://<file_system>@<storage_account_name>.dfs.core.windows.net/accumulo
+general.volume.chooser=org.apache.accumulo.server.fs.PreferredVolumeChooser
+general.custom.volume.preferred.default=abfss://<file_system>@<storage_account_name>.dfs.core.windows.net/accumulo
+general.custom.volume.preferred.logger=hdfs://<namenode>/accumulo
+```
+
+Run `accumulo init --add-volumes` to initialize the Azure DLS Gen2 volume.  
Doing this
+in two steps avoids putting any Accumulo metadata files in Gen2  during init.
+Copy `accumulo.properties` to all nodes and start Accumulo.
+
+Individual tables can be configured to store their files in HDFS by setting the
+table property `table.custom.volume.preferred`.  This should be set for the
+metadata table in case it splits using the following Accumulo shell command.
+
+```
+config -t accumulo.metadata -s 
table.custom.volume.preferred=hdfs://<namenode>/accumulo
+```
+
+## Accumulo example
+
+The following Accumulo shell session shows an example of writing data to Gen2 
and
+reading it back.  It also shows scanning the metadata table to verify the data
+is stored in Gen2.
+
+```
+root@muchos> createtable gen2test
+root@muchos gen2test> insert r1 f1 q1 v1
+root@muchos gen2test> insert r1 f1 q2 v2
+root@muchos gen2test> flush -w
+2019-10-16 08:01:00,564 [shell.Shell] INFO : Flush of table gen2test  
completed.
+root@muchos gen2test> scan
+r1 f1:q1 []    v1
+r1 f1:q2 []    v2
+root@muchos gen2test> scan -t accumulo.metadata -c file
+4< 
file:abfss://<file_system>@<storage_account_name>.dfs.core.windows.net/accumulo/tables/4/default_tablet/F00000gj.rf
 []    234,2
+```
+
+These instructions will help to configure Accumulo to use Azure's Data Lake 
Gen2 Storage along with HDFS. With this setup, 
+we are able to successfully run the continuos ingest test. Going forward, 
we'll experiment more on this space 
+with ADLS Gen2 and add/update blog as we come along.
+
+

Reply via email to