IMPALA-7788: [DOCS] Impala supports ADLS Gen 2 (ABFS) Change-Id: Ic06d9ac92ed78b9092369e211de8a81db1d7ce90 Reviewed-on: http://gerrit.cloudera.org:8080/11853 Tested-by: Impala Public Jenkins <[email protected]> Reviewed-by: Joe McDonnell <[email protected]> Reviewed-by: Jim Apple <[email protected]>
Project: http://git-wip-us.apache.org/repos/asf/impala/repo Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/030f0ac3 Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/030f0ac3 Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/030f0ac3 Branch: refs/heads/branch-3.1.0 Commit: 030f0ac303f044ad1661cc3601ca0cedc675aba0 Parents: 0f63b2c Author: Alex Rodoni <[email protected]> Authored: Thu Nov 1 16:55:27 2018 -0700 Committer: Zoltan Borok-Nagy <[email protected]> Committed: Tue Nov 13 12:51:39 2018 +0100 ---------------------------------------------------------------------- docs/shared/impala_common.xml | 29 +++--- docs/topics/impala_adls.xml | 179 ++++++++++++++++++++-------------- docs/topics/impala_insert.xml | 3 +- docs/topics/impala_load_data.xml | 13 ++- 4 files changed, 127 insertions(+), 97 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/impala/blob/030f0ac3/docs/shared/impala_common.xml ---------------------------------------------------------------------- diff --git a/docs/shared/impala_common.xml b/docs/shared/impala_common.xml index 8b79596..f4aaedd 100644 --- a/docs/shared/impala_common.xml +++ b/docs/shared/impala_common.xml @@ -1297,17 +1297,21 @@ drop database temp; See <xref href="../topics/impala_s3.xml#s3"/> for details about reading and writing S3 data with Impala. </p> - <p rev="2.9.0 IMPALA-5333" id="adls_dml"> - In <keyword keyref="impala29_full"/> and higher, the Impala DML statements (<codeph>INSERT</codeph>, <codeph>LOAD DATA</codeph>, - and <codeph>CREATE TABLE AS SELECT</codeph>) can write data into a table or partition that resides in the - Azure Data Lake Store (ADLS). - The syntax of the DML statements is the same as for any other tables, because the ADLS location for tables and - partitions is specified by an <codeph>adl://</codeph> prefix in the - <codeph>LOCATION</codeph> attribute of - <codeph>CREATE TABLE</codeph> or <codeph>ALTER TABLE</codeph> statements. - If you bring data into ADLS using the normal ADLS transfer mechanisms instead of Impala DML statements, - issue a <codeph>REFRESH</codeph> statement for the table before using Impala to query the ADLS data. - </p> + <p rev="2.9.0 IMPALA-5333" id="adls_dml"> In <keyword + keyref="impala29_full"/> and higher, the Impala DML statements + (<codeph>INSERT</codeph>, <codeph>LOAD DATA</codeph>, and + <codeph>CREATE TABLE AS SELECT</codeph>) can write data into a table + or partition that resides in the Azure Data Lake Store (ADLS). ADLS Gen2 + is supported in <keyword keyref="impala31"/> and higher.</p> + <p rev="2.9.0 IMPALA-5333">In the<codeph>CREATE TABLE</codeph> or + <codeph>ALTER TABLE</codeph> statements, specify the ADLS location for + tables and partitions with the <codeph>adl://</codeph> prefix for ADLS + Gen1 and <codeph>abfs://</codeph> or <codeph>abfss://</codeph> for ADLS + Gen2 in the <codeph>LOCATION</codeph> attribute.</p> + <p rev="2.9.0 IMPALA-5333" id="adls_dml_end">If you bring data into ADLS + using the normal ADLS transfer mechanisms instead of Impala DML + statements, issue a <codeph>REFRESH</codeph> statement for the table + before using Impala to query the ADLS data. </p> <p rev="2.6.0 IMPALA-1878" id="s3_dml"> In <keyword keyref="impala26_full"/> and higher, the Impala DML statements (<codeph>INSERT</codeph>, <codeph>LOAD DATA</codeph>, @@ -1321,9 +1325,6 @@ drop database temp; issue a <codeph>REFRESH</codeph> statement for the table before using Impala to query the S3 data. </p> - <!-- Formerly part of s3_dml element. Moved out to avoid a circular link in the S3 topic itelf. --> - <!-- See <xref href="../topics/impala_s3.xml#s3"/> for details about reading and writing S3 data with Impala. --> - <p rev="2.2.0" id="s3_metadata"> Impala caches metadata for tables where the data resides in the Amazon Simple Storage Service (S3), and the <codeph>REFRESH</codeph> and <codeph>INVALIDATE METADATA</codeph> http://git-wip-us.apache.org/repos/asf/impala/blob/030f0ac3/docs/topics/impala_adls.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_adls.xml b/docs/topics/impala_adls.xml index 5d790c5..f5103f4 100644 --- a/docs/topics/impala_adls.xml +++ b/docs/topics/impala_adls.xml @@ -35,14 +35,12 @@ under the License. <conbody> - <p> - <indexterm audience="hidden">ADLS with Impala</indexterm> - You can use Impala to query data residing on the Azure Data Lake Store (ADLS) filesystem. - This capability allows convenient access to a storage system that is remotely managed, - accessible from anywhere, and integrated with various cloud-based services. Impala can - query files in any supported file format from ADLS. The ADLS storage location - can be for an entire table, or individual partitions in a partitioned table. - </p> + <p> You can use Impala to query data residing on the Azure Data Lake Store + (ADLS) filesystem. This capability allows convenient access to a storage + system that is remotely managed, accessible from anywhere, and integrated + with various cloud-based services. Impala can query files in any supported + file format from ADLS. The ADLS storage location can be for an entire + table, or individual partitions in a partitioned table. </p> <p> The default Impala tables use data files stored on HDFS, which are ideal for bulk loads and queries using @@ -51,6 +49,8 @@ under the License. HDFS. In a partitioned table, you can set the <codeph>LOCATION</codeph> attribute for individual partitions to put some partitions on HDFS and others on ADLS, typically depending on the age of the data. </p> + <p>Starting in <keyword keyref="impala31"/>, Impala supports ADLS Gen2 + filesystem, Azure Blob File System (ABFS).</p> <p outputclass="toc inpage"/> @@ -70,6 +70,9 @@ under the License. <xref href="https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-get-started-portal" scope="external" format="html">Get started with Azure Data Lake Store using the Azure Portal</xref> </p> </li> + <li><xref + href="https://docs.microsoft.com/en-us/azure/storage/data-lake-storage/quickstart-create-account" + format="html" scope="external">Azure Data Lake Storage Gen2</xref></li> <li> <p> <xref href="https://hadoop.apache.org/docs/current/hadoop-azure-datalake/index.html" scope="external" format="html">Hadoop Azure Data Lake Support</xref> @@ -82,27 +85,22 @@ under the License. <concept id="sql"> <title>How Impala SQL Statements Work with ADLS</title> <conbody> - <p> - Impala SQL statements work with data on ADLS as follows: - </p> + <p> Impala SQL statements work with data on ADLS as follows. </p> <ul> - <li> - <p> - The <xref href="impala_create_table.xml#create_table"/> - or <xref href="impala_alter_table.xml#alter_table"/> statements - can specify that a table resides on the ADLS filesystem by - encoding an <codeph>adl://</codeph> prefix for the <codeph>LOCATION</codeph> - property. <codeph>ALTER TABLE</codeph> can also set the <codeph>LOCATION</codeph> - property for an individual partition, so that some data in a table resides on - ADLS and other data in the same table resides on HDFS. - </p> - <p> - The full format of the location URI is typically: -<codeblock> -adl://<varname>your_account</varname>.azuredatalakestore.net/<varname>rest_of_directory_path</varname> -</codeblock> - </p> - </li> + <li><p> The <xref href="impala_create_table.xml#create_table"/> or <xref + href="impala_alter_table.xml#alter_table"/> statements can specify + that a table resides on the ADLS filesystem by specifying an ADLS + prefix for the <codeph>LOCATION</codeph> property.<ul> + <li><codeph>adl://</codeph> for ADLS Gen1</li> + <li><codeph>abfs://</codeph> for ADLS Gen2</li> + <li><codeph>abfss://</codeph> for ADLS Gen2 with a secure socket + layer connection</li> + </ul> + <codeph>ALTER TABLE</codeph> can also set the + <codeph>LOCATION</codeph> property for an individual partition, so + that some data in a table resides on ADLS and other data in the same + table resides on HDFS. </p> See <xref href="impala_adls.xml#ddl"/> + for usage information.</li> <li> <p> Once a table or partition is designated as residing on ADLS, the <xref href="impala_select.xml#select"/> @@ -135,10 +133,8 @@ adl://<varname>your_account</varname>.azuredatalakestore.net/<varname>rest_of_di </p> </li> </ul> - <p> - For usage information about Impala SQL statements with ADLS tables, see <xref href="impala_adls.xml#ddl"/> - and <xref href="impala_adls.xml#dml"/>. - </p> + <p> For usage information about Impala SQL statements with ADLS tables, + see <xref href="impala_adls.xml#dml"/>. </p> </conbody> </concept> @@ -148,30 +144,54 @@ adl://<varname>your_account</varname>.azuredatalakestore.net/<varname>rest_of_di <conbody> - <p> - To allow Impala to access data in ADLS, specify values for the following configuration settings in your - <filepath>core-site.xml</filepath> file: - </p> + <p> To allow Impala to access data in ADLS, specify values for the + following configuration settings in your + <filepath>core-site.xml</filepath> file.</p> + <p>For ADLS Gen1:</p> + +<codeblock><property> + <name>dfs.adls.oauth2.access.token.provider.type</name> + <value>ClientCredential</value> +</property> +<property> + <name>dfs.adls.oauth2.client.id</name> + <value><varname>your_client_id</varname></value> +</property> +<property> + <name>dfs.adls.oauth2.credential</name> + <value><varname>your_client_secret</varname></value> +</property> +<property> + <name>dfs.adls.oauth2.refresh.url</name> + <value>https://login.windows.net/<varname>your_azure_tenant_id</varname>/oauth2/token</value> +</property> -<codeblock><![CDATA[ -<property> - <name>dfs.adls.oauth2.access.token.provider.type</name> - <value>ClientCredential</value> -</property> -<property> - <name>dfs.adls.oauth2.client.id</name> - <value><varname>your_client_id</varname></value> -</property> -<property> - <name>dfs.adls.oauth2.credential</name> - <value><varname>your_client_secret</varname></value> -</property> -<property> - <name>dfs.adls.oauth2.refresh.url</name> - <value><varname>refresh_URL</varname></value> -</property> -]]> </codeblock> + <p>For ADLS Gen2:</p> + <codeblock> <property> + <name>fs.azure.account.auth.type</name> + <value>OAuth</value> + </property> + + <property> + <name>fs.azure.account.oauth.provider.type</name> + <value>org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider</value> + </property> + + <property> + <name>fs.azure.account.oauth2.client.id</name> + <value><varname>your_client_id</varname></value> + </property> + + <property> + <name>fs.azure.account.oauth2.client.secret</name> + <value><varname>your_client_secret</varname></value> + </property> + + <property> + <name>fs.azure.account.oauth2.client.endpoint</name> + <value>https://login.microsoftonline.com/<varname>your_azure_tenant_id</varname>/oauth2/token</value> + </property></codeblock> <note> <p> @@ -180,11 +200,10 @@ adl://<varname>your_account</varname>.azuredatalakestore.net/<varname>rest_of_di </p> </note> - <p> - After specifying the credentials, restart both the Impala and - Hive services. (Restarting Hive is required because Impala queries, CREATE TABLE statements, and so on go - through the Hive metastore.) - </p> + <p> After specifying the credentials, restart both the Impala and Hive + services. Restarting Hive is required because certain Impala queries, + such as <codeph>CREATE TABLE</codeph> statements, go through the Hive + metastore.</p> </conbody> @@ -213,7 +232,8 @@ adl://<varname>your_account</varname>.azuredatalakestore.net/<varname>rest_of_di <concept id="dml"> <title>Using Impala DML Statements for ADLS Data</title> <conbody> - <p conref="../shared/impala_common.xml#common/adls_dml"/> + <p conref="../shared/impala_common.xml#common/adls_dml" + conrefend="../shared/impala_common.xml#common/adls_dml_end"/> </conbody> </concept> @@ -249,12 +269,24 @@ adl://<varname>your_account</varname>.azuredatalakestore.net/<varname>rest_of_di <conbody> - <p> - Impala reads data for a table or partition from ADLS based on the <codeph>LOCATION</codeph> attribute for the - table or partition. Specify the ADLS details in the <codeph>LOCATION</codeph> clause of a <codeph>CREATE - TABLE</codeph> or <codeph>ALTER TABLE</codeph> statement. The notation for the <codeph>LOCATION</codeph> - clause is <codeph>adl://<varname>store</varname>/<varname>path/to/file</varname></codeph>. - </p> + <p> Impala reads data for a table or partition from ADLS based on the + <codeph>LOCATION</codeph> attribute for the table or partition. + Specify the ADLS details in the <codeph>LOCATION</codeph> clause of a + <codeph>CREATE TABLE</codeph> or <codeph>ALTER TABLE</codeph> + statement. The syntax for the <codeph>LOCATION</codeph> clause is:<ul> + <li>For ADLS Gen1, + <codeph>adl://<varname>account</varname>.azuredatalakestore.net/<varname>path/file</varname></codeph> + </li> + <li>For ADLS Gen2, + <codeph>abfs://<varname>container</varname>@<varname>account</varname>.dfs.core.windows.net/<varname>path</varname>/<varname>file</varname></codeph></li> + <li>For ADLS Gen2 with a secure socket layer connection, + <codeph>abfss://<varname>container</varname>@<varname>account</varname>.dfs.core.windows.net/<varname>path</varname>/<varname>file</varname></codeph></li> + </ul></p> + <p><codeph><varname>container</varname></codeph> denotes the parent + location that holds the files and folders, which is the Containers in + the Azure Storage Blobs service.</p> + <p><codeph><varname>account</varname></codeph> is the name given for your + storage account.</p> <p> For a partitioned table, either specify a separate <codeph>LOCATION</codeph> clause for each new partition, @@ -288,15 +320,12 @@ adl://<varname>your_account</varname>.azuredatalakestore.net/<varname>rest_of_di > location 'adl://impalademo.azuredatalakestore.net/dir1/dir2/dir3/t1'; </codeblock> - <p> - For convenience when working with multiple tables with data files stored in ADLS, you can create a database - with a <codeph>LOCATION</codeph> attribute pointing to an ADLS path. - Specify a URL of the form <codeph>adl://<varname>store</varname>/<varname>root/path/for/database</varname></codeph> - for the <codeph>LOCATION</codeph> attribute of the database. - Any tables created inside that database - automatically create directories underneath the one specified by the database - <codeph>LOCATION</codeph> attribute. - </p> + <p> For convenience when working with multiple tables with data files + stored in ADLS, you can create a database with a + <codeph>LOCATION</codeph> attribute pointing to an ADLS path. Specify + a URL of the form as shown above. Any tables created inside that + database automatically create directories underneath the one specified + by the database <codeph>LOCATION</codeph> attribute. </p> <p> The following session creates a database and two partitioned tables residing entirely on ADLS, one http://git-wip-us.apache.org/repos/asf/impala/blob/030f0ac3/docs/topics/impala_insert.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_insert.xml b/docs/topics/impala_insert.xml index 7e6ce63..58b5169 100644 --- a/docs/topics/impala_insert.xml +++ b/docs/topics/impala_insert.xml @@ -629,7 +629,8 @@ Inserted 2 rows in 0.16s <p>See <xref href="../topics/impala_s3.xml#s3"/> for details about reading and writing S3 data with Impala.</p> <p conref="../shared/impala_common.xml#common/adls_blurb"/> - <p conref="../shared/impala_common.xml#common/adls_dml"/> + <p conref="../shared/impala_common.xml#common/adls_dml" + conrefend="../shared/impala_common.xml#common/adls_dml_end"/> <p>See <xref href="../topics/impala_adls.xml#adls"/> for details about reading and writing ADLS data with Impala.</p> <p conref="../shared/impala_common.xml#common/security_blurb"/> http://git-wip-us.apache.org/repos/asf/impala/blob/030f0ac3/docs/topics/impala_load_data.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_load_data.xml b/docs/topics/impala_load_data.xml index 96305a5..f947534 100644 --- a/docs/topics/impala_load_data.xml +++ b/docs/topics/impala_load_data.xml @@ -39,12 +39,10 @@ under the License. <conbody> - <p> - <indexterm audience="hidden">LOAD DATA statement</indexterm> - The <codeph>LOAD DATA</codeph> statement streamlines the ETL process for an internal Impala table by moving a - data file or all the data files in a directory from an HDFS location into the Impala data directory for that - table. - </p> + <p> The <codeph>LOAD DATA</codeph> statement streamlines the ETL process for + an internal Impala table by moving a data file or all the data files in a + directory from an HDFS location into the Impala data directory for that + table. </p> <p conref="../shared/impala_common.xml#common/syntax_blurb"/> @@ -240,7 +238,8 @@ Returned 1 row(s) in 0.62s</codeblock> <p>See <xref href="../topics/impala_s3.xml#s3"/> for details about reading and writing S3 data with Impala.</p> <p conref="../shared/impala_common.xml#common/adls_blurb"/> - <p conref="../shared/impala_common.xml#common/adls_dml"/> + <p conref="../shared/impala_common.xml#common/adls_dml" + conrefend="../shared/impala_common.xml#common/adls_dml_end"/> <p>See <xref href="../topics/impala_adls.xml#adls"/> for details about reading and writing ADLS data with Impala.</p> <p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
