Repository: incubator-impala Updated Branches: refs/heads/master 07d3cea1f -> 717dd73d7
IMPALA-5333: [DOCS] Document Impala ADLS support Change-Id: Id5a98217741e5d540d9874e9b30e36f01644ef14 Reviewed-on: http://gerrit.cloudera.org:8080/7175 Reviewed-by: Sailesh Mukil <[email protected]> Reviewed-by: Laurel Hale <[email protected]> Reviewed-by: John Russell <[email protected]> Tested-by: Impala Public Jenkins Project: http://git-wip-us.apache.org/repos/asf/incubator-impala/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-impala/commit/717dd73d Tree: http://git-wip-us.apache.org/repos/asf/incubator-impala/tree/717dd73d Diff: http://git-wip-us.apache.org/repos/asf/incubator-impala/diff/717dd73d Branch: refs/heads/master Commit: 717dd73d78c52ff372a0faf1af1b8c40b51101ad Parents: 07d3cea Author: John Russell <[email protected]> Authored: Tue Jun 13 13:39:09 2017 -0700 Committer: Impala Public Jenkins <[email protected]> Committed: Mon Jul 10 17:21:39 2017 +0000 ---------------------------------------------------------------------- docs/impala.ditamap | 1 + docs/shared/impala_common.xml | 38 ++ docs/topics/impala_adls.xml | 669 ++++++++++++++++++++++++++ docs/topics/impala_insert.xml | 4 + docs/topics/impala_load_data.xml | 4 + docs/topics/impala_parquet_file_size.xml | 2 + 6 files changed, 718 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/717dd73d/docs/impala.ditamap ---------------------------------------------------------------------- diff --git a/docs/impala.ditamap b/docs/impala.ditamap index 3985dcf..574602a 100644 --- a/docs/impala.ditamap +++ b/docs/impala.ditamap @@ -288,6 +288,7 @@ under the License. <topicref href="topics/impala_kudu.xml"/> <topicref href="topics/impala_hbase.xml"/> <topicref href="topics/impala_s3.xml"/> + <topicref rev="2.9.0" href="topics/impala_adls.xml"/> <topicref href="topics/impala_isilon.xml"/> <topicref href="topics/impala_logging.xml"/> <topicref href="topics/impala_troubleshooting.xml"> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/717dd73d/docs/shared/impala_common.xml ---------------------------------------------------------------------- diff --git a/docs/shared/impala_common.xml b/docs/shared/impala_common.xml index 8a10c9f..6e65c40 100644 --- a/docs/shared/impala_common.xml +++ b/docs/shared/impala_common.xml @@ -1069,6 +1069,13 @@ drop database temp; <codeph>hadoop fs -cp</codeph>, or <codeph>INSERT</codeph> in Impala or Hive. </p> + <p rev="2.9.0 IMPALA-5333" id="adls_dml_performance"> + <draft-comment> + Currently nothing to say on this subject. Leaving this placeholder + in case there are DML performance implications to discuss in future. + </draft-comment> + </p> + <p rev="2.6.0 IMPALA-1878" id="s3_dml_performance"> Because of differences between S3 and traditional filesystems, DML operations for S3 tables can take longer than for tables on HDFS. For example, both the @@ -1085,6 +1092,14 @@ drop database temp; See <xref href="../topics/impala_s3_skip_insert_staging.xml#s3_skip_insert_staging"/> for details. </p> + <p id="adls_block_splitting" rev="IMPALA-5383"> + Because ADLS does not expose the block sizes of data files the way HDFS does, + any Impala <codeph>INSERT</codeph> or <codeph>CREATE TABLE AS SELECT</codeph> statements + use the <codeph>PARQUET_FILE_SIZE</codeph> query option setting to define the size of + Parquet data files. (Using a large block size is more important for Parquet tables than + for tables that use other file formats.) + </p> + <p rev="2.6.0 IMPALA-3453" id="s3_block_splitting"> In <keyword keyref="impala26_full"/> and higher, Impala queries are optimized for files stored in Amazon S3. For Impala tables that use the file formats Parquet, RCFile, SequenceFile, @@ -1100,6 +1115,13 @@ drop database temp; to 268435456 (256 MB) to match the row group size produced by Impala. </p> + <note rev="2.9.0 IMPALA-5333" id="adls_production" type="important"> + <p> + Currently, the ADLS support in Impala is preliminary and not + fully tested. Do not use Impala with ADLS in a production environment. + </p> + </note> + <note rev="2.6.0 IMPALA-1878" id="s3_production" type="important"> <p> In <keyword keyref="impala26_full"/> and higher, Impala supports both queries (<codeph>SELECT</codeph>) @@ -1126,6 +1148,18 @@ drop database temp; See <xref href="../topics/impala_s3.xml#s3"/> for details about reading and writing S3 data with Impala. </p> + <p rev="2.9.0 IMPALA-5333" id="adls_dml"> + In <keyword keyref="impala29_full"/> and higher, the Impala DML statements (<codeph>INSERT</codeph>, <codeph>LOAD DATA</codeph>, + and <codeph>CREATE TABLE AS SELECT</codeph>) can write data into a table or partition that resides in the + Azure Data Lake Store (ADLS). + The syntax of the DML statements is the same as for any other tables, because the ADLS location for tables and + partitions is specified by an <codeph>adl://</codeph> prefix in the + <codeph>LOCATION</codeph> attribute of + <codeph>CREATE TABLE</codeph> or <codeph>ALTER TABLE</codeph> statements. + If you bring data into ADLS using the normal ADLS transfer mechanisms instead of Impala DML statements, + issue a <codeph>REFRESH</codeph> statement for the table before using Impala to query the ADLS data. + </p> + <p rev="2.6.0 IMPALA-1878" id="s3_dml"> In <keyword keyref="impala26_full"/> and higher, the Impala DML statements (<codeph>INSERT</codeph>, <codeph>LOAD DATA</codeph>, and <codeph>CREATE TABLE AS SELECT</codeph>) can write data into a table or partition that resides in the @@ -2392,6 +2426,10 @@ flight_num: INT32 SNAPPY DO:83456393 FPO:83488603 SZ:10216514/11474301 <b>Amazon S3 considerations:</b> </p> + <p id="adls_blurb" rev="2.9.0"> + <b>ADLS considerations:</b> + </p> + <p id="isilon_blurb" rev="2.2.3"> <b>Isilon considerations:</b> </p> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/717dd73d/docs/topics/impala_adls.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_adls.xml b/docs/topics/impala_adls.xml new file mode 100644 index 0000000..8723e23 --- /dev/null +++ b/docs/topics/impala_adls.xml @@ -0,0 +1,669 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="adls" rev="2.9.0"> + + <title>Using Impala with the Azure Data Lake Store (ADLS)</title> + <titlealts audience="PDF"><navtitle>ADLS Tables</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="ADLS"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Preview Features"/> + </metadata> + </prolog> + + <conbody> + + <note conref="../shared/impala_common.xml#common/adls_production"/> + + <p> + <indexterm audience="hidden">ADLS with Impala</indexterm> + You can use Impala to query data residing on the Azure Data Lake Store (ADLS) filesystem. + This capability allows convenient access to a storage system that is remotely managed, + accessible from anywhere, and integrated with various cloud-based services. Impala can + query files in any supported file format from ADLS. The ADLS storage location + can be for an entire table, or individual partitions in a partitioned table. + </p> + + <p> + The default Impala tables use data files stored on HDFS, which are ideal for bulk loads and queries using + full-table scans. In contrast, queries against ADLS data are less performant, making ADLS suitable for holding + <q>cold</q> data that is only queried occasionally, while more frequently accessed <q>hot</q> data resides in + HDFS. In a partitioned table, you can set the <codeph>LOCATION</codeph> attribute for individual partitions + to put some partitions on HDFS and others on ADLS, typically depending on the age of the data. + </p> + + <p outputclass="toc inpage"/> + + </conbody> + + <concept id="prereqs"> + <title>Prerequisites</title> + <conbody> + <p> + These procedures presume that you have already set up an Azure account, + configured an ADLS store, and configured your Hadoop cluster with appropriate + credentials to be able to access ADLS. See the following resources for information: + </p> + <ul> + <li> + <p> + <xref href="https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-get-started-portal" scope="external" format="html">Get started with Azure Data Lake Store using the Azure Portal</xref> + </p> + </li> + <li> + <p> + <xref href="https://hadoop.apache.org/docs/current2/hadoop-azure-datalake/index.html" scope="external" format="html">Hadoop Azure Data Lake Support</xref> + </p> + </li> + </ul> + </conbody> + </concept> + + <concept id="sql"> + <title>How Impala SQL Statements Work with ADLS</title> + <conbody> + <p> + Impala SQL statements work with data on ADLS as follows: + </p> + <ul> + <li> + <p> + The <xref href="impala_create_table.xml#create_table"/> + or <xref href="impala_alter_table.xml#alter_table"/> statements + can specify that a table resides on the ADLS filesystem by + encoding an <codeph>adl://</codeph> prefix for the <codeph>LOCATION</codeph> + property. <codeph>ALTER TABLE</codeph> can also set the <codeph>LOCATION</codeph> + property for an individual partition, so that some data in a table resides on + ADLS and other data in the same table resides on HDFS. + </p> + <p> + The full format of the location URI is typically: +<codeblock> +adl://<varname>your_account</varname>.azuredatalakestore.net/<varname>rest_of_directory_path</varname> +</codeblock> + </p> + </li> + <li> + <p> + Once a table or partition is designated as residing on ADLS, the <xref href="impala_select.xml#select"/> + statement transparently accesses the data files from the appropriate storage layer. + </p> + </li> + <li> + <p> + If the ADLS table is an internal table, the <xref href="impala_drop_table.xml#drop_table"/> statement + removes the corresponding data files from ADLS when the table is dropped. + </p> + </li> + <li> + <p> + The <xref href="impala_truncate_table.xml#truncate_table"/> statement always removes the corresponding + data files from ADLS when the table is truncated. + </p> + </li> + <li> + <p> + The <xref href="impala_load_data.xml#load_data"/> can move data files residing in HDFS into + an ADLS table. + </p> + </li> + <li> + <p> + The <xref href="impala_insert.xml#insert"/>, or the <codeph>CREATE TABLE AS SELECT</codeph> + form of the <codeph>CREATE TABLE</codeph> statement, can copy data from an HDFS table or another ADLS + table into an ADLS table. + </p> + </li> + </ul> + <p> + For usage information about Impala SQL statements with ADLS tables, see <xref href="impala_adls.xml#ddl"/> + and <xref href="impala_adls.xml#dml"/>. + </p> + </conbody> + </concept> + + <concept id="creds"> + + <title>Specifying Impala Credentials to Access Data in ADLS</title> + + <conbody> + + <p> + To allow Impala to access data in ADLS, specify values for the following configuration settings in your + <filepath>core-site.xml</filepath> file: + </p> + +<codeblock><![CDATA[ +<property> + <name>dfs.adls.oauth2.access.token.provider.type</name> + <value>ClientCredential</value> +</property> +<property> + <name>dfs.adls.oauth2.client.id</name> + <value><varname>your_client_id</varname></value> +</property> +<property> + <name>dfs.adls.oauth2.credential</name> + <value><varname>your_client_secret</varname></value> +</property> +<property> + <name>dfs.adls.oauth2.refresh.url</name> + <value><varname>refresh_URL</varname></value> +</property> +]]> +</codeblock> + + <note> + <p> + Check if your Hadoop distribution or cluster management tool includes support for + filling in and distributing credentials across the cluster in an automated way. + </p> + </note> + + <p> + After specifying the credentials, restart both the Impala and + Hive services. (Restarting Hive is required because Impala queries, CREATE TABLE statements, and so on go + through the Hive metastore.) + </p> + + </conbody> + + </concept> + + <concept id="etl"> + + <title>Loading Data into ADLS for Impala Queries</title> + <prolog> + <metadata> + <data name="Category" value="ETL"/> + <data name="Category" value="Ingest"/> + </metadata> + </prolog> + + <conbody> + + <p> + If your ETL pipeline involves moving data into ADLS and then querying through Impala, + you can either use Impala DML statements to create, move, or copy the data, or + use the same data loading techniques as you would for non-Impala data. + </p> + + </conbody> + + <concept id="dml"> + <title>Using Impala DML Statements for ADLS Data</title> + <conbody> + <p conref="../shared/impala_common.xml#common/adls_dml"/> + </conbody> + </concept> + + <concept id="manual_etl"> + <title>Manually Loading Data into Impala Tables on ADLS</title> + <conbody> + <p> + As an alternative, you can use the Microsoft-provided methods to bring data files + into ADLS for querying through Impala. See + <xref href="https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-copy-data-azure-storage-blob" scope="external" format="html">the Microsoft ADLS documentation</xref> + for details. + </p> + + <p> + After you upload data files to a location already mapped to an Impala table or partition, or if you delete + files in ADLS from such a location, issue the <codeph>REFRESH <varname>table_name</varname></codeph> + statement to make Impala aware of the new set of data files. + </p> + + </conbody> + </concept> + + </concept> + + <concept id="ddl"> + + <title>Creating Impala Databases, Tables, and Partitions for Data Stored on ADLS</title> + <prolog> + <metadata> + <data name="Category" value="Databases"/> + </metadata> + </prolog> + + <conbody> + + <p> + Impala reads data for a table or partition from ADLS based on the <codeph>LOCATION</codeph> attribute for the + table or partition. Specify the ADLS details in the <codeph>LOCATION</codeph> clause of a <codeph>CREATE + TABLE</codeph> or <codeph>ALTER TABLE</codeph> statement. The notation for the <codeph>LOCATION</codeph> + clause is <codeph>adl://<varname>store</varname>/<varname>path/to/file</varname></codeph>. + </p> + + <p> + For a partitioned table, either specify a separate <codeph>LOCATION</codeph> clause for each new partition, + or specify a base <codeph>LOCATION</codeph> for the table and set up a directory structure in ADLS to mirror + the way Impala partitioned tables are structured in HDFS. Although, strictly speaking, ADLS filenames do not + have directory paths, Impala treats ADLS filenames with <codeph>/</codeph> characters the same as HDFS + pathnames that include directories. + </p> + + <p> + To point a nonpartitioned table or an individual partition at ADLS, specify a single directory + path in ADLS, which could be any arbitrary directory. To replicate the structure of an entire Impala + partitioned table or database in ADLS requires more care, with directories and subdirectories nested and + named to match the equivalent directory tree in HDFS. Consider setting up an empty staging area if + necessary in HDFS, and recording the complete directory structure so that you can replicate it in ADLS. + </p> + + <p> + For example, the following session creates a partitioned table where only a single partition resides on ADLS. + The partitions for years 2013 and 2014 are located on HDFS. The partition for year 2015 includes a + <codeph>LOCATION</codeph> attribute with an <codeph>adl://</codeph> URL, and so refers to data residing on + ADLS, under a specific path underneath the store <codeph>impalademo</codeph>. + </p> + +<codeblock>[localhost:21000] > create database db_on_hdfs; +[localhost:21000] > use db_on_hdfs; +[localhost:21000] > create table mostly_on_hdfs (x int) partitioned by (year int); +[localhost:21000] > alter table mostly_on_hdfs add partition (year=2013); +[localhost:21000] > alter table mostly_on_hdfs add partition (year=2014); +[localhost:21000] > alter table mostly_on_hdfs add partition (year=2015) + > location 'adl://impalademo.azuredatalakestore.net/dir1/dir2/dir3/t1'; +</codeblock> + + <p> + For convenience when working with multiple tables with data files stored in ADLS, you can create a database + with a <codeph>LOCATION</codeph> attribute pointing to an ADLS path. + Specify a URL of the form <codeph>adl://<varname>store</varname>/<varname>root/path/for/database</varname></codeph> + for the <codeph>LOCATION</codeph> attribute of the database. + Any tables created inside that database + automatically create directories underneath the one specified by the database + <codeph>LOCATION</codeph> attribute. + </p> + + <p> + The following session creates a database and two partitioned tables residing entirely on ADLS, one + partitioned by a single column and the other partitioned by multiple columns. Because a + <codeph>LOCATION</codeph> attribute with an <codeph>adl://</codeph> URL is specified for the database, the + tables inside that database are automatically created on ADLS underneath the database directory. To see the + names of the associated subdirectories, including the partition key values, we use an ADLS client tool to + examine how the directory structure is organized on ADLS. For example, Impala partition directories such as + <codeph>month=1</codeph> do not include leading zeroes, which sometimes appear in partition directories created + through Hive. + </p> + +<codeblock>[localhost:21000] > create database db_on_adls location 'adl://impalademo.azuredatalakestore.net/dir1/dir2/dir3'; +[localhost:21000] > use db_on_adls; + +[localhost:21000] > create table partitioned_on_adls (x int) partitioned by (year int); +[localhost:21000] > alter table partitioned_on_adls add partition (year=2013); +[localhost:21000] > alter table partitioned_on_adls add partition (year=2014); +[localhost:21000] > alter table partitioned_on_adls add partition (year=2015); + +[localhost:21000] > ! hadoop fs -ls adl://impalademo.azuredatalakestore.net/dir1/dir2/dir3 --recursive; +2015-03-17 13:56:34 0 dir1/dir2/dir3/ +2015-03-17 16:43:28 0 dir1/dir2/dir3/partitioned_on_adls/ +2015-03-17 16:43:49 0 dir1/dir2/dir3/partitioned_on_adls/year=2013/ +2015-03-17 16:43:53 0 dir1/dir2/dir3/partitioned_on_adls/year=2014/ +2015-03-17 16:43:58 0 dir1/dir2/dir3/partitioned_on_adls/year=2015/ + +[localhost:21000] > create table partitioned_multiple_keys (x int) + > partitioned by (year smallint, month tinyint, day tinyint); +[localhost:21000] > alter table partitioned_multiple_keys + > add partition (year=2015,month=1,day=1); +[localhost:21000] > alter table partitioned_multiple_keys + > add partition (year=2015,month=1,day=31); +[localhost:21000] > alter table partitioned_multiple_keys + > add partition (year=2015,month=2,day=28); + +[localhost:21000] > ! hadoop fs -ls adl://impalademo.azuredatalakestore.net/dir1/dir2/dir3 --recursive; +2015-03-17 13:56:34 0 dir1/dir2/dir3/ +2015-03-17 16:47:13 0 dir1/dir2/dir3/partitioned_multiple_keys/ +2015-03-17 16:47:44 0 dir1/dir2/dir3/partitioned_multiple_keys/year=2015/month=1/day=1/ +2015-03-17 16:47:50 0 dir1/dir2/dir3/partitioned_multiple_keys/year=2015/month=1/day=31/ +2015-03-17 16:47:57 0 dir1/dir2/dir3/partitioned_multiple_keys/year=2015/month=2/day=28/ +2015-03-17 16:43:28 0 dir1/dir2/dir3/partitioned_on_adls/ +2015-03-17 16:43:49 0 dir1/dir2/dir3/partitioned_on_adls/year=2013/ +2015-03-17 16:43:53 0 dir1/dir2/dir3/partitioned_on_adls/year=2014/ +2015-03-17 16:43:58 0 dir1/dir2/dir3/partitioned_on_adls/year=2015/ +</codeblock> + + <p> + The <codeph>CREATE DATABASE</codeph> and <codeph>CREATE TABLE</codeph> statements create the associated + directory paths if they do not already exist. You can specify multiple levels of directories, and the + <codeph>CREATE</codeph> statement creates all appropriate levels, similar to using <codeph>mkdir + -p</codeph>. + </p> + + <p> + Use the standard ADLS file upload methods to actually put the data files into the right locations. You can + also put the directory paths and data files in place before creating the associated Impala databases or + tables, and Impala automatically uses the data from the appropriate location after the associated databases + and tables are created. + </p> + + <p> + You can switch whether an existing table or partition points to data in HDFS or ADLS. For example, if you + have an Impala table or partition pointing to data files in HDFS or ADLS, and you later transfer those data + files to the other filesystem, use an <codeph>ALTER TABLE</codeph> statement to adjust the + <codeph>LOCATION</codeph> attribute of the corresponding table or partition to reflect that change. Because + Impala does not have an <codeph>ALTER DATABASE</codeph> statement, this location-switching technique is not + practical for entire databases that have a custom <codeph>LOCATION</codeph> attribute. + </p> + + </conbody> + + </concept> + + <concept id="internal_external"> + + <title>Internal and External Tables Located on ADLS</title> + + <conbody> + + <p> + Just as with tables located on HDFS storage, you can designate ADLS-based tables as either internal (managed + by Impala) or external, by using the syntax <codeph>CREATE TABLE</codeph> or <codeph>CREATE EXTERNAL + TABLE</codeph> respectively. When you drop an internal table, the files associated with the table are + removed, even if they are on ADLS storage. When you drop an external table, the files associated with the + table are left alone, and are still available for access by other tools or components. See + <xref href="impala_tables.xml#tables"/> for details. + </p> + + <p> + If the data on ADLS is intended to be long-lived and accessed by other tools in addition to Impala, create + any associated ADLS tables with the <codeph>CREATE EXTERNAL TABLE</codeph> syntax, so that the files are not + deleted from ADLS when the table is dropped. + </p> + + <p> + If the data on ADLS is only needed for querying by Impala and can be safely discarded once the Impala + workflow is complete, create the associated ADLS tables using the <codeph>CREATE TABLE</codeph> syntax, so + that dropping the table also deletes the corresponding data files on ADLS. + </p> + + <p> + For example, this session creates a table in ADLS with the same column layout as a table in HDFS, then + examines the ADLS table and queries some data from it. The table in ADLS works the same as a table in HDFS as + far as the expected file format of the data, table and column statistics, and other table properties. The + only indication that it is not an HDFS table is the <codeph>adl://</codeph> URL in the + <codeph>LOCATION</codeph> property. Many data files can reside in the ADLS directory, and their combined + contents form the table data. Because the data in this example is uploaded after the table is created, a + <codeph>REFRESH</codeph> statement prompts Impala to update its cached information about the data files. + </p> + +<codeblock>[localhost:21000] > create table usa_cities_adls like usa_cities location 'adl://impalademo.azuredatalakestore.net/usa_cities'; +[localhost:21000] > desc usa_cities_adls; ++-------+----------+---------+ +| name | type | comment | ++-------+----------+---------+ +| id | smallint | | +| city | string | | +| state | string | | ++-------+----------+---------+ + +-- Now from a web browser, upload the same data file(s) to ADLS as in the HDFS table, +-- under the relevant store and path. If you already have the data in ADLS, you would +-- point the table LOCATION at an existing path. + +[localhost:21000] > refresh usa_cities_adls; +[localhost:21000] > select count(*) from usa_cities_adls; ++----------+ +| count(*) | ++----------+ +| 289 | ++----------+ +[localhost:21000] > select distinct state from sample_data_adls limit 5; ++----------------------+ +| state | ++----------------------+ +| Louisiana | +| Minnesota | +| Georgia | +| Alaska | +| Ohio | ++----------------------+ +[localhost:21000] > desc formatted usa_cities_adls; ++------------------------------+----------------------------------------------------+---------+ +| name | type | comment | ++------------------------------+----------------------------------------------------+---------+ +| # col_name | data_type | comment | +| | NULL | NULL | +| id | smallint | NULL | +| city | string | NULL | +| state | string | NULL | +| | NULL | NULL | +| # Detailed Table Information | NULL | NULL | +| Database: | adls_testing | NULL | +| Owner: | jrussell | NULL | +| CreateTime: | Mon Mar 16 11:36:25 PDT 2017 | NULL | +| LastAccessTime: | UNKNOWN | NULL | +| Protect Mode: | None | NULL | +| Retention: | 0 | NULL | +| Location: | adl://impalademo.azuredatalakestore.net/usa_cities | NULL | +| Table Type: | MANAGED_TABLE | NULL | +... ++------------------------------+----------------------------------------------------+---------+ +</codeblock> + + <p> + In this case, we have already uploaded a Parquet file with a million rows of data to the + <codeph>sample_data</codeph> directory underneath the <codeph>impalademo</codeph> store on ADLS. This + session creates a table with matching column settings pointing to the corresponding location in ADLS, then + queries the table. Because the data is already in place on ADLS when the table is created, no + <codeph>REFRESH</codeph> statement is required. + </p> + +<codeblock>[localhost:21000] > create table sample_data_adls + > (id int, id bigint, val int, zerofill string, + > name string, assertion boolean, city string, state string) + > stored as parquet location 'adl://impalademo.azuredatalakestore.net/sample_data'; +[localhost:21000] > select count(*) from sample_data_adls; ++----------+ +| count(*) | ++----------+ +| 1000000 | ++----------+ +[localhost:21000] > select count(*) howmany, assertion from sample_data_adls group by assertion; ++---------+-----------+ +| howmany | assertion | ++---------+-----------+ +| 667149 | true | +| 332851 | false | ++---------+-----------+ +</codeblock> + + </conbody> + + </concept> + + <concept id="queries"> + + <title>Running and Tuning Impala Queries for Data Stored on ADLS</title> + + <conbody> + + <p> + Once the appropriate <codeph>LOCATION</codeph> attributes are set up at the table or partition level, you + query data stored in ADLS exactly the same as data stored on HDFS or in HBase: + </p> + + <ul> + <li> + Queries against ADLS data support all the same file formats as for HDFS data. + </li> + + <li> + Tables can be unpartitioned or partitioned. For partitioned tables, either manually construct paths in ADLS + corresponding to the HDFS directories representing partition key values, or use <codeph>ALTER TABLE ... + ADD PARTITION</codeph> to set up the appropriate paths in ADLS. + </li> + + <li> + HDFS, Kudu, and HBase tables can be joined to ADLS tables, or ADLS tables can be joined with each other. + </li> + + <li> + Authorization using the Sentry framework to control access to databases, tables, or columns works the + same whether the data is in HDFS or in ADLS. + </li> + + <li> + The <cmdname>catalogd</cmdname> daemon caches metadata for both HDFS and ADLS tables. Use + <codeph>REFRESH</codeph> and <codeph>INVALIDATE METADATA</codeph> for ADLS tables in the same situations + where you would issue those statements for HDFS tables. + </li> + + <li> + Queries against ADLS tables are subject to the same kinds of admission control and resource management as + HDFS tables. + </li> + + <li> + Metadata about ADLS tables is stored in the same metastore database as for HDFS tables. + </li> + + <li> + You can set up views referring to ADLS tables, the same as for HDFS tables. + </li> + + <li> + The <codeph>COMPUTE STATS</codeph>, <codeph>SHOW TABLE STATS</codeph>, and <codeph>SHOW COLUMN + STATS</codeph> statements work for ADLS tables also. + </li> + </ul> + + </conbody> + + <concept id="performance"> + + <title>Understanding and Tuning Impala Query Performance for ADLS Data</title> + <prolog> + <metadata> + <data name="Category" value="Performance"/> + </metadata> + </prolog> + + <conbody> + + <p> + Although Impala queries for data stored in ADLS might be less performant than queries against the + equivalent data stored in HDFS, you can still do some tuning. Here are techniques you can use to + interpret explain plans and profiles for queries against ADLS data, and tips to achieve the best + performance possible for such queries. + </p> + + <p> + All else being equal, performance is expected to be lower for queries running against data on ADLS rather + than HDFS. The actual mechanics of the <codeph>SELECT</codeph> statement are somewhat different when the + data is in ADLS. Although the work is still distributed across the datanodes of the cluster, Impala might + parallelize the work for a distributed query differently for data on HDFS and ADLS. ADLS does not have the + same block notion as HDFS, so Impala uses heuristics to determine how to split up large ADLS files for + processing in parallel. Because all hosts can access any ADLS data file with equal efficiency, the + distribution of work might be different than for HDFS data, where the data blocks are physically read + using short-circuit local reads by hosts that contain the appropriate block replicas. Although the I/O to + read the ADLS data might be spread evenly across the hosts of the cluster, the fact that all data is + initially retrieved across the network means that the overall query performance is likely to be lower for + ADLS data than for HDFS data. + </p> + + <p conref="../shared/impala_common.xml#common/adls_block_splitting"/> + + <p> + When optimizing aspects of for complex queries such as the join order, Impala treats tables on HDFS and + ADLS the same way. Therefore, follow all the same tuning recommendations for ADLS tables as for HDFS ones, + such as using the <codeph>COMPUTE STATS</codeph> statement to help Impala construct accurate estimates of + row counts and cardinality. See <xref href="impala_performance.xml#performance"/> for details. + </p> + + <p> + In query profile reports, the numbers for <codeph>BytesReadLocal</codeph>, + <codeph>BytesReadShortCircuit</codeph>, <codeph>BytesReadDataNodeCached</codeph>, and + <codeph>BytesReadRemoteUnexpected</codeph> are blank because those metrics come from HDFS. + If you do see any indications that a query against an ADLS table performed <q>remote read</q> + operations, do not be alarmed. That is expected because, by definition, all the I/O for ADLS tables involves + remote reads. + </p> + + </conbody> + + </concept> + + </concept> + + <concept id="restrictions"> + + <title>Restrictions on Impala Support for ADLS</title> + + <conbody> + + <p> + Impala requires that the default filesystem for the cluster be HDFS. You cannot use ADLS as the only + filesystem in the cluster. + </p> + + <p> + Although ADLS is often used to store JSON-formatted data, the current Impala support for ADLS does not include + directly querying JSON data. For Impala queries, use data files in one of the file formats listed in + <xref href="impala_file_formats.xml#file_formats"/>. If you have data in JSON format, you can prepare a + flattened version of that data for querying by Impala as part of your ETL cycle. + </p> + + <p> + You cannot use the <codeph>ALTER TABLE ... SET CACHED</codeph> statement for tables or partitions that are + located in ADLS. + </p> + + </conbody> + + </concept> + + <concept id="best_practices"> + <title>Best Practices for Using Impala with ADLS</title> + <prolog> + <metadata> + <data name="Category" value="Guidelines"/> + <data name="Category" value="Best Practices"/> + </metadata> + </prolog> + <conbody> + <p> + The following guidelines represent best practices derived from testing and real-world experience with Impala on ADLS: + </p> + <ul> + <li> + <p> + Any reference to an ADLS location must be fully qualified. (This rule applies when + ADLS is not designated as the default filesystem.) + </p> + </li> + <li> + <p> + Set any appropriate configuration settings for <cmdname>impalad</cmdname>. + </p> + </li> + </ul> + + </conbody> + </concept> + +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/717dd73d/docs/topics/impala_insert.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_insert.xml b/docs/topics/impala_insert.xml index 5a8e9a5..a83692d 100644 --- a/docs/topics/impala_insert.xml +++ b/docs/topics/impala_insert.xml @@ -708,6 +708,10 @@ Inserted 2 rows in 0.16s <p conref="../shared/impala_common.xml#common/s3_dml_performance"/> <p>See <xref href="../topics/impala_s3.xml#s3"/> for details about reading and writing S3 data with Impala.</p> + <p conref="../shared/impala_common.xml#common/adls_blurb"/> + <p conref="../shared/impala_common.xml#common/adls_dml"/> + <p>See <xref href="../topics/impala_adls.xml#adls"/> for details about reading and writing ADLS data with Impala.</p> + <p conref="../shared/impala_common.xml#common/security_blurb"/> <p conref="../shared/impala_common.xml#common/redaction_yes"/> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/717dd73d/docs/topics/impala_load_data.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_load_data.xml b/docs/topics/impala_load_data.xml index a092276..96305a5 100644 --- a/docs/topics/impala_load_data.xml +++ b/docs/topics/impala_load_data.xml @@ -239,6 +239,10 @@ Returned 1 row(s) in 0.62s</codeblock> <p conref="../shared/impala_common.xml#common/s3_dml_performance"/> <p>See <xref href="../topics/impala_s3.xml#s3"/> for details about reading and writing S3 data with Impala.</p> + <p conref="../shared/impala_common.xml#common/adls_blurb"/> + <p conref="../shared/impala_common.xml#common/adls_dml"/> + <p>See <xref href="../topics/impala_adls.xml#adls"/> for details about reading and writing ADLS data with Impala.</p> + <p conref="../shared/impala_common.xml#common/cancel_blurb_no"/> <p conref="../shared/impala_common.xml#common/permissions_blurb"/> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/717dd73d/docs/topics/impala_parquet_file_size.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_parquet_file_size.xml b/docs/topics/impala_parquet_file_size.xml index 2471feb..05e6c36 100644 --- a/docs/topics/impala_parquet_file_size.xml +++ b/docs/topics/impala_parquet_file_size.xml @@ -88,6 +88,8 @@ INSERT OVERWRITE parquet_table SELECT * FROM text_table; <b>Default:</b> 0 (produces files with a target size of 256 MB; files might be larger for very wide tables) </p> + <p conref="../shared/impala_common.xml#common/adls_block_splitting"/> + <p conref="../shared/impala_common.xml#common/isilon_blurb"/> <p conref="../shared/impala_common.xml#common/isilon_block_size_caveat"/>
