incubator-impala git commit: IMPALA-5333: [DOCS] Document Impala ADLS support

jrussell Mon, 10 Jul 2017 12:18:43 -0700

Repository: incubator-impala
Updated Branches:
  refs/heads/master 07d3cea1f -> 717dd73d7



IMPALA-5333: [DOCS] Document Impala ADLS support

Change-Id: Id5a98217741e5d540d9874e9b30e36f01644ef14
Reviewed-on: http://gerrit.cloudera.org:8080/7175
Reviewed-by: Sailesh Mukil <[email protected]>
Reviewed-by: Laurel Hale <[email protected]>
Reviewed-by: John Russell <[email protected]>
Tested-by: Impala Public Jenkins


Project: http://git-wip-us.apache.org/repos/asf/incubator-impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-impala/commit/717dd73d
Tree: http://git-wip-us.apache.org/repos/asf/incubator-impala/tree/717dd73d
Diff: http://git-wip-us.apache.org/repos/asf/incubator-impala/diff/717dd73d

Branch: refs/heads/master
Commit: 717dd73d78c52ff372a0faf1af1b8c40b51101ad
Parents: 07d3cea
Author: John Russell <[email protected]>
Authored: Tue Jun 13 13:39:09 2017 -0700
Committer: Impala Public Jenkins <[email protected]>
Committed: Mon Jul 10 17:21:39 2017 +0000

----------------------------------------------------------------------
 docs/impala.ditamap                      |   1 +
 docs/shared/impala_common.xml            |  38 ++
 docs/topics/impala_adls.xml              | 669 ++++++++++++++++++++++++++
 docs/topics/impala_insert.xml            |   4 +
 docs/topics/impala_load_data.xml         |   4 +
 docs/topics/impala_parquet_file_size.xml |   2 +
 6 files changed, 718 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/717dd73d/docs/impala.ditamap
----------------------------------------------------------------------
diff --git a/docs/impala.ditamap b/docs/impala.ditamap
index 3985dcf..574602a 100644
--- a/docs/impala.ditamap
+++ b/docs/impala.ditamap
@@ -288,6 +288,7 @@ under the License.
   <topicref href="topics/impala_kudu.xml"/>
   <topicref href="topics/impala_hbase.xml"/>
   <topicref href="topics/impala_s3.xml"/>
+  <topicref rev="2.9.0" href="topics/impala_adls.xml"/>
   <topicref href="topics/impala_isilon.xml"/>
   <topicref href="topics/impala_logging.xml"/>
   <topicref href="topics/impala_troubleshooting.xml">

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/717dd73d/docs/shared/impala_common.xml
----------------------------------------------------------------------
diff --git a/docs/shared/impala_common.xml b/docs/shared/impala_common.xml
index 8a10c9f..6e65c40 100644
--- a/docs/shared/impala_common.xml
+++ b/docs/shared/impala_common.xml
@@ -1069,6 +1069,13 @@ drop database temp;
         <codeph>hadoop fs -cp</codeph>, or <codeph>INSERT</codeph> in Impala 
or Hive.
       </p>
 
+      <p rev="2.9.0 IMPALA-5333" id="adls_dml_performance">
+        <draft-comment>
+          Currently nothing to say on this subject. Leaving this placeholder
+          in case there are DML performance implications to discuss in future.
+        </draft-comment>
+      </p>
+
       <p rev="2.6.0 IMPALA-1878" id="s3_dml_performance">
         Because of differences between S3 and traditional filesystems, DML 
operations
         for S3 tables can take longer than for tables on HDFS. For example, 
both the
@@ -1085,6 +1092,14 @@ drop database temp;
         See <xref 
href="../topics/impala_s3_skip_insert_staging.xml#s3_skip_insert_staging"/> for 
details.
       </p>
 
+      <p id="adls_block_splitting" rev="IMPALA-5383">
+        Because ADLS does not expose the block sizes of data files the way 
HDFS does,
+        any Impala <codeph>INSERT</codeph> or <codeph>CREATE TABLE AS 
SELECT</codeph> statements
+        use the <codeph>PARQUET_FILE_SIZE</codeph> query option setting to 
define the size of
+        Parquet data files. (Using a large block size is more important for 
Parquet tables than
+        for tables that use other file formats.)
+      </p>
+
       <p rev="2.6.0 IMPALA-3453" id="s3_block_splitting">
         In <keyword keyref="impala26_full"/> and higher, Impala queries are 
optimized for files stored in Amazon S3.
         For Impala tables that use the file formats Parquet, RCFile, 
SequenceFile,
@@ -1100,6 +1115,13 @@ drop database temp;
         to 268435456 (256 MB) to match the row group size produced by Impala.
       </p>
 
+      <note rev="2.9.0 IMPALA-5333" id="adls_production" type="important">
+        <p>
+          Currently, the ADLS support in Impala is preliminary and not
+          fully tested. Do not use Impala with ADLS in a production 
environment.
+        </p>
+      </note>
+
       <note rev="2.6.0 IMPALA-1878" id="s3_production" type="important">
         <p>
           In <keyword keyref="impala26_full"/> and higher, Impala supports 
both queries (<codeph>SELECT</codeph>)
@@ -1126,6 +1148,18 @@ drop database temp;
         See <xref href="../topics/impala_s3.xml#s3"/> for details about 
reading and writing S3 data with Impala.
       </p>
 
+      <p rev="2.9.0 IMPALA-5333" id="adls_dml">
+        In <keyword keyref="impala29_full"/> and higher, the Impala DML 
statements (<codeph>INSERT</codeph>, <codeph>LOAD DATA</codeph>,
+        and <codeph>CREATE TABLE AS SELECT</codeph>) can write data into a 
table or partition that resides in the
+        Azure Data Lake Store (ADLS).
+        The syntax of the DML statements is the same as for any other tables, 
because the ADLS location for tables and
+        partitions is specified by an <codeph>adl://</codeph> prefix in the
+        <codeph>LOCATION</codeph> attribute of
+        <codeph>CREATE TABLE</codeph> or <codeph>ALTER TABLE</codeph> 
statements.
+        If you bring data into ADLS using the normal ADLS transfer mechanisms 
instead of Impala DML statements,
+        issue a <codeph>REFRESH</codeph> statement for the table before using 
Impala to query the ADLS data.
+      </p>
+
       <p rev="2.6.0 IMPALA-1878" id="s3_dml">
         In <keyword keyref="impala26_full"/> and higher, the Impala DML 
statements (<codeph>INSERT</codeph>, <codeph>LOAD DATA</codeph>,
         and <codeph>CREATE TABLE AS SELECT</codeph>) can write data into a 
table or partition that resides in the
@@ -2392,6 +2426,10 @@ flight_num:           INT32 SNAPPY DO:83456393 
FPO:83488603 SZ:10216514/11474301
         <b>Amazon S3 considerations:</b>
       </p>
 
+      <p id="adls_blurb" rev="2.9.0">
+        <b>ADLS considerations:</b>
+      </p>
+
       <p id="isilon_blurb" rev="2.2.3">
         <b>Isilon considerations:</b>
       </p>

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/717dd73d/docs/topics/impala_adls.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_adls.xml b/docs/topics/impala_adls.xml
new file mode 100644
index 0000000..8723e23
--- /dev/null
+++ b/docs/topics/impala_adls.xml
@@ -0,0 +1,669 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="adls" rev="2.9.0">
+
+  <title>Using Impala with the Azure Data Lake Store (ADLS)</title>
+  <titlealts audience="PDF"><navtitle>ADLS Tables</navtitle></titlealts>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="ADLS"/>
+      <data name="Category" value="Data Analysts"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="Querying"/>
+      <data name="Category" value="Preview Features"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <note conref="../shared/impala_common.xml#common/adls_production"/>
+
+    <p>
+      <indexterm audience="hidden">ADLS with Impala</indexterm>
+      You can use Impala to query data residing on the Azure Data Lake Store 
(ADLS) filesystem.
+      This capability allows convenient access to a storage system that is 
remotely managed,
+      accessible from anywhere, and integrated with various cloud-based 
services. Impala can
+      query files in any supported file format from ADLS. The ADLS storage 
location
+      can be for an entire table, or individual partitions in a partitioned 
table.
+    </p>
+
+    <p>
+      The default Impala tables use data files stored on HDFS, which are ideal 
for bulk loads and queries using
+      full-table scans. In contrast, queries against ADLS data are less 
performant, making ADLS suitable for holding
+      <q>cold</q> data that is only queried occasionally, while more 
frequently accessed <q>hot</q> data resides in
+      HDFS. In a partitioned table, you can set the <codeph>LOCATION</codeph> 
attribute for individual partitions
+      to put some partitions on HDFS and others on ADLS, typically depending 
on the age of the data.
+    </p>
+
+    <p outputclass="toc inpage"/>
+
+  </conbody>
+
+  <concept id="prereqs">
+    <title>Prerequisites</title>
+    <conbody>
+      <p>
+        These procedures presume that you have already set up an Azure account,
+        configured an ADLS store, and configured your Hadoop cluster with 
appropriate
+        credentials to be able to access ADLS. See the following resources for 
information:
+      </p>
+      <ul>
+        <li>
+          <p>
+            <xref 
href="https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-get-started-portal";
 scope="external" format="html">Get started with Azure Data Lake Store using 
the Azure Portal</xref>
+          </p>
+        </li>
+        <li>
+          <p>
+            <xref 
href="https://hadoop.apache.org/docs/current2/hadoop-azure-datalake/index.html"; 
scope="external" format="html">Hadoop Azure Data Lake Support</xref>
+          </p>
+        </li>
+      </ul>
+    </conbody>
+  </concept>
+
+  <concept id="sql">
+    <title>How Impala SQL Statements Work with ADLS</title>
+    <conbody>
+      <p>
+        Impala SQL statements work with data on ADLS as follows:
+      </p>
+      <ul>
+        <li>
+          <p>
+            The <xref href="impala_create_table.xml#create_table"/>
+            or <xref href="impala_alter_table.xml#alter_table"/> statements
+            can specify that a table resides on the ADLS filesystem by
+            encoding an <codeph>adl://</codeph> prefix for the 
<codeph>LOCATION</codeph>
+            property. <codeph>ALTER TABLE</codeph> can also set the 
<codeph>LOCATION</codeph>
+            property for an individual partition, so that some data in a table 
resides on
+            ADLS and other data in the same table resides on HDFS.
+          </p>
+          <p>
+            The full format of the location URI is typically:
+<codeblock>
+adl://<varname>your_account</varname>.azuredatalakestore.net/<varname>rest_of_directory_path</varname>
+</codeblock>
+          </p>
+        </li>
+        <li>
+          <p>
+            Once a table or partition is designated as residing on ADLS, the 
<xref href="impala_select.xml#select"/>
+            statement transparently accesses the data files from the 
appropriate storage layer.
+          </p>
+        </li>
+        <li>
+          <p>
+            If the ADLS table is an internal table, the <xref 
href="impala_drop_table.xml#drop_table"/> statement
+            removes the corresponding data files from ADLS when the table is 
dropped.
+          </p>
+        </li>
+        <li>
+          <p>
+            The <xref href="impala_truncate_table.xml#truncate_table"/> 
statement always removes the corresponding
+            data files from ADLS when the table is truncated.
+          </p>
+        </li>
+        <li>
+          <p>
+            The <xref href="impala_load_data.xml#load_data"/> can move data 
files residing in HDFS into
+            an ADLS table.
+          </p>
+        </li>
+        <li>
+          <p>
+            The <xref href="impala_insert.xml#insert"/>, or the <codeph>CREATE 
TABLE AS SELECT</codeph>
+            form of the <codeph>CREATE TABLE</codeph> statement, can copy data 
from an HDFS table or another ADLS
+            table into an ADLS table.
+          </p>
+        </li>
+      </ul>
+      <p>
+        For usage information about Impala SQL statements with ADLS tables, 
see <xref href="impala_adls.xml#ddl"/>
+        and <xref href="impala_adls.xml#dml"/>.
+      </p>
+    </conbody>
+  </concept>
+
+  <concept id="creds">
+
+    <title>Specifying Impala Credentials to Access Data in ADLS</title>
+
+    <conbody>
+
+      <p>
+        To allow Impala to access data in ADLS, specify values for the 
following configuration settings in your
+        <filepath>core-site.xml</filepath> file:
+      </p>
+
+<codeblock><![CDATA[
+<property>
+   <name>dfs.adls.oauth2.access.token.provider.type</name>
+   <value>ClientCredential</value>
+</property>
+<property>
+   <name>dfs.adls.oauth2.client.id</name>
+   <value><varname>your_client_id</varname></value>
+</property>
+<property>
+   <name>dfs.adls.oauth2.credential</name>
+   <value><varname>your_client_secret</varname></value>
+</property>
+<property>
+   <name>dfs.adls.oauth2.refresh.url</name>
+   <value><varname>refresh_URL</varname></value>
+</property>
+]]>
+</codeblock>
+
+      <note>
+        <p>
+          Check if your Hadoop distribution or cluster management tool 
includes support for
+          filling in and distributing credentials across the cluster in an 
automated way.
+        </p>
+      </note>
+
+      <p>
+        After specifying the credentials, restart both the Impala and
+        Hive services. (Restarting Hive is required because Impala queries, 
CREATE TABLE statements, and so on go
+        through the Hive metastore.)
+      </p>
+
+    </conbody>
+
+  </concept>
+
+  <concept id="etl">
+
+    <title>Loading Data into ADLS for Impala Queries</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="ETL"/>
+      <data name="Category" value="Ingest"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p>
+        If your ETL pipeline involves moving data into ADLS and then querying 
through Impala,
+        you can either use Impala DML statements to create, move, or copy the 
data, or
+        use the same data loading techniques as you would for non-Impala data.
+      </p>
+
+    </conbody>
+
+    <concept id="dml">
+      <title>Using Impala DML Statements for ADLS Data</title>
+      <conbody>
+        <p conref="../shared/impala_common.xml#common/adls_dml"/>
+      </conbody>
+    </concept>
+
+    <concept id="manual_etl">
+      <title>Manually Loading Data into Impala Tables on ADLS</title>
+      <conbody>
+        <p>
+          As an alternative, you can use the Microsoft-provided methods to 
bring data files
+          into ADLS for querying through Impala. See
+          <xref 
href="https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-copy-data-azure-storage-blob";
 scope="external" format="html">the Microsoft ADLS documentation</xref>
+          for details.
+        </p>
+
+        <p>
+          After you upload data files to a location already mapped to an 
Impala table or partition, or if you delete
+          files in ADLS from such a location, issue the <codeph>REFRESH 
<varname>table_name</varname></codeph>
+          statement to make Impala aware of the new set of data files.
+        </p>
+
+      </conbody>
+    </concept>
+
+  </concept>
+
+  <concept id="ddl">
+
+    <title>Creating Impala Databases, Tables, and Partitions for Data Stored 
on ADLS</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Databases"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p>
+        Impala reads data for a table or partition from ADLS based on the 
<codeph>LOCATION</codeph> attribute for the
+        table or partition. Specify the ADLS details in the 
<codeph>LOCATION</codeph> clause of a <codeph>CREATE
+        TABLE</codeph> or <codeph>ALTER TABLE</codeph> statement. The notation 
for the <codeph>LOCATION</codeph>
+        clause is 
<codeph>adl://<varname>store</varname>/<varname>path/to/file</varname></codeph>.
+      </p>
+
+      <p>
+        For a partitioned table, either specify a separate 
<codeph>LOCATION</codeph> clause for each new partition,
+        or specify a base <codeph>LOCATION</codeph> for the table and set up a 
directory structure in ADLS to mirror
+        the way Impala partitioned tables are structured in HDFS. Although, 
strictly speaking, ADLS filenames do not
+        have directory paths, Impala treats ADLS filenames with 
<codeph>/</codeph> characters the same as HDFS
+        pathnames that include directories.
+      </p>
+
+      <p>
+        To point a nonpartitioned table or an individual partition at ADLS, 
specify a single directory
+        path in ADLS, which could be any arbitrary directory. To replicate the 
structure of an entire Impala
+        partitioned table or database in ADLS requires more care, with 
directories and subdirectories nested and
+        named to match the equivalent directory tree in HDFS. Consider setting 
up an empty staging area if
+        necessary in HDFS, and recording the complete directory structure so 
that you can replicate it in ADLS.
+      </p>
+
+      <p>
+        For example, the following session creates a partitioned table where 
only a single partition resides on ADLS.
+        The partitions for years 2013 and 2014 are located on HDFS. The 
partition for year 2015 includes a
+        <codeph>LOCATION</codeph> attribute with an <codeph>adl://</codeph> 
URL, and so refers to data residing on
+        ADLS, under a specific path underneath the store 
<codeph>impalademo</codeph>.
+      </p>
+
+<codeblock>[localhost:21000] > create database db_on_hdfs;
+[localhost:21000] > use db_on_hdfs;
+[localhost:21000] > create table mostly_on_hdfs (x int) partitioned by (year 
int);
+[localhost:21000] > alter table mostly_on_hdfs add partition (year=2013);
+[localhost:21000] > alter table mostly_on_hdfs add partition (year=2014);
+[localhost:21000] > alter table mostly_on_hdfs add partition (year=2015)
+                  >   location 
'adl://impalademo.azuredatalakestore.net/dir1/dir2/dir3/t1';
+</codeblock>
+
+      <p>
+        For convenience when working with multiple tables with data files 
stored in ADLS, you can create a database
+        with a <codeph>LOCATION</codeph> attribute pointing to an ADLS path.
+        Specify a URL of the form 
<codeph>adl://<varname>store</varname>/<varname>root/path/for/database</varname></codeph>
+        for the <codeph>LOCATION</codeph> attribute of the database.
+        Any tables created inside that database
+        automatically create directories underneath the one specified by the 
database
+        <codeph>LOCATION</codeph> attribute.
+      </p>
+
+      <p>
+        The following session creates a database and two partitioned tables 
residing entirely on ADLS, one
+        partitioned by a single column and the other partitioned by multiple 
columns. Because a
+        <codeph>LOCATION</codeph> attribute with an <codeph>adl://</codeph> 
URL is specified for the database, the
+        tables inside that database are automatically created on ADLS 
underneath the database directory. To see the
+        names of the associated subdirectories, including the partition key 
values, we use an ADLS client tool to
+        examine how the directory structure is organized on ADLS. For example, 
Impala partition directories such as
+        <codeph>month=1</codeph> do not include leading zeroes, which 
sometimes appear in partition directories created
+        through Hive.
+      </p>
+
+<codeblock>[localhost:21000] > create database db_on_adls location 
'adl://impalademo.azuredatalakestore.net/dir1/dir2/dir3';
+[localhost:21000] > use db_on_adls;
+
+[localhost:21000] > create table partitioned_on_adls (x int) partitioned by 
(year int);
+[localhost:21000] > alter table partitioned_on_adls add partition (year=2013);
+[localhost:21000] > alter table partitioned_on_adls add partition (year=2014);
+[localhost:21000] > alter table partitioned_on_adls add partition (year=2015);
+
+[localhost:21000] > ! hadoop fs -ls 
adl://impalademo.azuredatalakestore.net/dir1/dir2/dir3 --recursive;
+2015-03-17 13:56:34          0 dir1/dir2/dir3/
+2015-03-17 16:43:28          0 dir1/dir2/dir3/partitioned_on_adls/
+2015-03-17 16:43:49          0 dir1/dir2/dir3/partitioned_on_adls/year=2013/
+2015-03-17 16:43:53          0 dir1/dir2/dir3/partitioned_on_adls/year=2014/
+2015-03-17 16:43:58          0 dir1/dir2/dir3/partitioned_on_adls/year=2015/
+
+[localhost:21000] > create table partitioned_multiple_keys (x int)
+                  >   partitioned by (year smallint, month tinyint, day 
tinyint);
+[localhost:21000] > alter table partitioned_multiple_keys
+                  >   add partition (year=2015,month=1,day=1);
+[localhost:21000] > alter table partitioned_multiple_keys
+                  >   add partition (year=2015,month=1,day=31);
+[localhost:21000] > alter table partitioned_multiple_keys
+                  >   add partition (year=2015,month=2,day=28);
+
+[localhost:21000] > ! hadoop fs -ls 
adl://impalademo.azuredatalakestore.net/dir1/dir2/dir3 --recursive;
+2015-03-17 13:56:34          0 dir1/dir2/dir3/
+2015-03-17 16:47:13          0 dir1/dir2/dir3/partitioned_multiple_keys/
+2015-03-17 16:47:44          0 
dir1/dir2/dir3/partitioned_multiple_keys/year=2015/month=1/day=1/
+2015-03-17 16:47:50          0 
dir1/dir2/dir3/partitioned_multiple_keys/year=2015/month=1/day=31/
+2015-03-17 16:47:57          0 
dir1/dir2/dir3/partitioned_multiple_keys/year=2015/month=2/day=28/
+2015-03-17 16:43:28          0 dir1/dir2/dir3/partitioned_on_adls/
+2015-03-17 16:43:49          0 dir1/dir2/dir3/partitioned_on_adls/year=2013/
+2015-03-17 16:43:53          0 dir1/dir2/dir3/partitioned_on_adls/year=2014/
+2015-03-17 16:43:58          0 dir1/dir2/dir3/partitioned_on_adls/year=2015/
+</codeblock>
+
+      <p>
+        The <codeph>CREATE DATABASE</codeph> and <codeph>CREATE TABLE</codeph> 
statements create the associated
+        directory paths if they do not already exist. You can specify multiple 
levels of directories, and the
+        <codeph>CREATE</codeph> statement creates all appropriate levels, 
similar to using <codeph>mkdir
+        -p</codeph>.
+      </p>
+
+      <p>
+        Use the standard ADLS file upload methods to actually put the data 
files into the right locations. You can
+        also put the directory paths and data files in place before creating 
the associated Impala databases or
+        tables, and Impala automatically uses the data from the appropriate 
location after the associated databases
+        and tables are created.
+      </p>
+
+      <p>
+        You can switch whether an existing table or partition points to data 
in HDFS or ADLS. For example, if you
+        have an Impala table or partition pointing to data files in HDFS or 
ADLS, and you later transfer those data
+        files to the other filesystem, use an <codeph>ALTER TABLE</codeph> 
statement to adjust the
+        <codeph>LOCATION</codeph> attribute of the corresponding table or 
partition to reflect that change. Because
+        Impala does not have an <codeph>ALTER DATABASE</codeph> statement, 
this location-switching technique is not
+        practical for entire databases that have a custom 
<codeph>LOCATION</codeph> attribute.
+      </p>
+
+    </conbody>
+
+  </concept>
+
+  <concept id="internal_external">
+
+    <title>Internal and External Tables Located on ADLS</title>
+
+    <conbody>
+
+      <p>
+        Just as with tables located on HDFS storage, you can designate 
ADLS-based tables as either internal (managed
+        by Impala) or external, by using the syntax <codeph>CREATE 
TABLE</codeph> or <codeph>CREATE EXTERNAL
+        TABLE</codeph> respectively. When you drop an internal table, the 
files associated with the table are
+        removed, even if they are on ADLS storage. When you drop an external 
table, the files associated with the
+        table are left alone, and are still available for access by other 
tools or components. See
+        <xref href="impala_tables.xml#tables"/> for details.
+      </p>
+
+      <p>
+        If the data on ADLS is intended to be long-lived and accessed by other 
tools in addition to Impala, create
+        any associated ADLS tables with the <codeph>CREATE EXTERNAL 
TABLE</codeph> syntax, so that the files are not
+        deleted from ADLS when the table is dropped.
+      </p>
+
+      <p>
+        If the data on ADLS is only needed for querying by Impala and can be 
safely discarded once the Impala
+        workflow is complete, create the associated ADLS tables using the 
<codeph>CREATE TABLE</codeph> syntax, so
+        that dropping the table also deletes the corresponding data files on 
ADLS.
+      </p>
+
+      <p>
+        For example, this session creates a table in ADLS with the same column 
layout as a table in HDFS, then
+        examines the ADLS table and queries some data from it. The table in 
ADLS works the same as a table in HDFS as
+        far as the expected file format of the data, table and column 
statistics, and other table properties. The
+        only indication that it is not an HDFS table is the 
<codeph>adl://</codeph> URL in the
+        <codeph>LOCATION</codeph> property. Many data files can reside in the 
ADLS directory, and their combined
+        contents form the table data. Because the data in this example is 
uploaded after the table is created, a
+        <codeph>REFRESH</codeph> statement prompts Impala to update its cached 
information about the data files.
+      </p>
+
+<codeblock>[localhost:21000] > create table usa_cities_adls like usa_cities 
location 'adl://impalademo.azuredatalakestore.net/usa_cities';
+[localhost:21000] > desc usa_cities_adls;
++-------+----------+---------+
+| name  | type     | comment |
++-------+----------+---------+
+| id    | smallint |         |
+| city  | string   |         |
+| state | string   |         |
++-------+----------+---------+
+
+-- Now from a web browser, upload the same data file(s) to ADLS as in the HDFS 
table,
+-- under the relevant store and path. If you already have the data in ADLS, 
you would
+-- point the table LOCATION at an existing path.
+
+[localhost:21000] > refresh usa_cities_adls;
+[localhost:21000] > select count(*) from usa_cities_adls;
++----------+
+| count(*) |
++----------+
+| 289      |
++----------+
+[localhost:21000] > select distinct state from sample_data_adls limit 5;
++----------------------+
+| state                |
++----------------------+
+| Louisiana            |
+| Minnesota            |
+| Georgia              |
+| Alaska               |
+| Ohio                 |
++----------------------+
+[localhost:21000] > desc formatted usa_cities_adls;
++------------------------------+----------------------------------------------------+---------+
+| name                         | type                                          
     | comment |
++------------------------------+----------------------------------------------------+---------+
+| # col_name                   | data_type                                     
     | comment |
+|                              | NULL                                          
     | NULL    |
+| id                           | smallint                                      
     | NULL    |
+| city                         | string                                        
     | NULL    |
+| state                        | string                                        
     | NULL    |
+|                              | NULL                                          
     | NULL    |
+| # Detailed Table Information | NULL                                          
     | NULL    |
+| Database:                    | adls_testing                                  
     | NULL    |
+| Owner:                       | jrussell                                      
     | NULL    |
+| CreateTime:                  | Mon Mar 16 11:36:25 PDT 2017                  
     | NULL    |
+| LastAccessTime:              | UNKNOWN                                       
     | NULL    |
+| Protect Mode:                | None                                          
     | NULL    |
+| Retention:                   | 0                                             
     | NULL    |
+| Location:                    | 
adl://impalademo.azuredatalakestore.net/usa_cities | NULL    |
+| Table Type:                  | MANAGED_TABLE                                 
     | NULL    |
+...
++------------------------------+----------------------------------------------------+---------+
+</codeblock>
+
+      <p>
+        In this case, we have already uploaded a Parquet file with a million 
rows of data to the
+        <codeph>sample_data</codeph> directory underneath the 
<codeph>impalademo</codeph> store on ADLS. This
+        session creates a table with matching column settings pointing to the 
corresponding location in ADLS, then
+        queries the table. Because the data is already in place on ADLS when 
the table is created, no
+        <codeph>REFRESH</codeph> statement is required.
+      </p>
+
+<codeblock>[localhost:21000] > create table sample_data_adls
+                  > (id int, id bigint, val int, zerofill string,
+                  > name string, assertion boolean, city string, state string)
+                  > stored as parquet location 
'adl://impalademo.azuredatalakestore.net/sample_data';
+[localhost:21000] > select count(*) from sample_data_adls;
++----------+
+| count(*) |
++----------+
+| 1000000  |
++----------+
+[localhost:21000] > select count(*) howmany, assertion from sample_data_adls 
group by assertion;
++---------+-----------+
+| howmany | assertion |
++---------+-----------+
+| 667149  | true      |
+| 332851  | false     |
++---------+-----------+
+</codeblock>
+
+    </conbody>
+
+  </concept>
+
+  <concept id="queries">
+
+    <title>Running and Tuning Impala Queries for Data Stored on ADLS</title>
+
+    <conbody>
+
+      <p>
+        Once the appropriate <codeph>LOCATION</codeph> attributes are set up 
at the table or partition level, you
+        query data stored in ADLS exactly the same as data stored on HDFS or 
in HBase:
+      </p>
+
+      <ul>
+        <li>
+          Queries against ADLS data support all the same file formats as for 
HDFS data.
+        </li>
+
+        <li>
+          Tables can be unpartitioned or partitioned. For partitioned tables, 
either manually construct paths in ADLS
+          corresponding to the HDFS directories representing partition key 
values, or use <codeph>ALTER TABLE ...
+          ADD PARTITION</codeph> to set up the appropriate paths in ADLS.
+        </li>
+
+        <li>
+          HDFS, Kudu, and HBase tables can be joined to ADLS tables, or ADLS 
tables can be joined with each other.
+        </li>
+
+        <li>
+          Authorization using the Sentry framework to control access to 
databases, tables, or columns works the
+          same whether the data is in HDFS or in ADLS.
+        </li>
+
+        <li>
+          The <cmdname>catalogd</cmdname> daemon caches metadata for both HDFS 
and ADLS tables. Use
+          <codeph>REFRESH</codeph> and <codeph>INVALIDATE METADATA</codeph> 
for ADLS tables in the same situations
+          where you would issue those statements for HDFS tables.
+        </li>
+
+        <li>
+          Queries against ADLS tables are subject to the same kinds of 
admission control and resource management as
+          HDFS tables.
+        </li>
+
+        <li>
+          Metadata about ADLS tables is stored in the same metastore database 
as for HDFS tables.
+        </li>
+
+        <li>
+          You can set up views referring to ADLS tables, the same as for HDFS 
tables.
+        </li>
+
+        <li>
+          The <codeph>COMPUTE STATS</codeph>, <codeph>SHOW TABLE 
STATS</codeph>, and <codeph>SHOW COLUMN
+          STATS</codeph> statements work for ADLS tables also.
+        </li>
+      </ul>
+
+    </conbody>
+
+    <concept id="performance">
+
+      <title>Understanding and Tuning Impala Query Performance for ADLS 
Data</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Performance"/>
+    </metadata>
+  </prolog>
+
+      <conbody>
+
+        <p>
+          Although Impala queries for data stored in ADLS might be less 
performant than queries against the
+          equivalent data stored in HDFS, you can still do some tuning. Here 
are techniques you can use to
+          interpret explain plans and profiles for queries against ADLS data, 
and tips to achieve the best
+          performance possible for such queries.
+        </p>
+
+        <p>
+          All else being equal, performance is expected to be lower for 
queries running against data on ADLS rather
+          than HDFS. The actual mechanics of the <codeph>SELECT</codeph> 
statement are somewhat different when the
+          data is in ADLS. Although the work is still distributed across the 
datanodes of the cluster, Impala might
+          parallelize the work for a distributed query differently for data on 
HDFS and ADLS. ADLS does not have the
+          same block notion as HDFS, so Impala uses heuristics to determine 
how to split up large ADLS files for
+          processing in parallel. Because all hosts can access any ADLS data 
file with equal efficiency, the
+          distribution of work might be different than for HDFS data, where 
the data blocks are physically read
+          using short-circuit local reads by hosts that contain the 
appropriate block replicas. Although the I/O to
+          read the ADLS data might be spread evenly across the hosts of the 
cluster, the fact that all data is
+          initially retrieved across the network means that the overall query 
performance is likely to be lower for
+          ADLS data than for HDFS data.
+        </p>
+
+        <p conref="../shared/impala_common.xml#common/adls_block_splitting"/>
+
+        <p>
+          When optimizing aspects of for complex queries such as the join 
order, Impala treats tables on HDFS and
+          ADLS the same way. Therefore, follow all the same tuning 
recommendations for ADLS tables as for HDFS ones,
+          such as using the <codeph>COMPUTE STATS</codeph> statement to help 
Impala construct accurate estimates of
+          row counts and cardinality. See <xref 
href="impala_performance.xml#performance"/> for details.
+        </p>
+
+        <p>
+          In query profile reports, the numbers for 
<codeph>BytesReadLocal</codeph>,
+          <codeph>BytesReadShortCircuit</codeph>, 
<codeph>BytesReadDataNodeCached</codeph>, and
+          <codeph>BytesReadRemoteUnexpected</codeph> are blank because those 
metrics come from HDFS.
+          If you do see any indications that a query against an ADLS table 
performed <q>remote read</q>
+          operations, do not be alarmed. That is expected because, by 
definition, all the I/O for ADLS tables involves
+          remote reads.
+        </p>
+
+      </conbody>
+
+    </concept>
+
+  </concept>
+
+  <concept id="restrictions">
+
+    <title>Restrictions on Impala Support for ADLS</title>
+
+    <conbody>
+
+      <p>
+        Impala requires that the default filesystem for the cluster be HDFS. 
You cannot use ADLS as the only
+        filesystem in the cluster.
+      </p>
+
+      <p>
+        Although ADLS is often used to store JSON-formatted data, the current 
Impala support for ADLS does not include
+        directly querying JSON data. For Impala queries, use data files in one 
of the file formats listed in
+        <xref href="impala_file_formats.xml#file_formats"/>. If you have data 
in JSON format, you can prepare a
+        flattened version of that data for querying by Impala as part of your 
ETL cycle.
+      </p>
+
+      <p>
+        You cannot use the <codeph>ALTER TABLE ... SET CACHED</codeph> 
statement for tables or partitions that are
+        located in ADLS.
+      </p>
+
+    </conbody>
+
+  </concept>
+
+  <concept id="best_practices">
+    <title>Best Practices for Using Impala with ADLS</title>
+    <prolog>
+      <metadata>
+        <data name="Category" value="Guidelines"/>
+        <data name="Category" value="Best Practices"/>
+      </metadata>
+    </prolog>
+    <conbody>
+      <p>
+        The following guidelines represent best practices derived from testing 
and real-world experience with Impala on ADLS:
+      </p>
+      <ul>
+        <li>
+          <p>
+            Any reference to an ADLS location must be fully qualified. (This 
rule applies when
+            ADLS is not designated as the default filesystem.)
+          </p>
+        </li>
+        <li>
+          <p>
+            Set any appropriate configuration settings for 
<cmdname>impalad</cmdname>.
+          </p>
+        </li>
+      </ul>
+
+    </conbody>
+  </concept>
+
+</concept>

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/717dd73d/docs/topics/impala_insert.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_insert.xml b/docs/topics/impala_insert.xml
index 5a8e9a5..a83692d 100644
--- a/docs/topics/impala_insert.xml
+++ b/docs/topics/impala_insert.xml
@@ -708,6 +708,10 @@ Inserted 2 rows in 0.16s
       <p conref="../shared/impala_common.xml#common/s3_dml_performance"/>
       <p>See <xref href="../topics/impala_s3.xml#s3"/> for details about 
reading and writing S3 data with Impala.</p>
 
+      <p conref="../shared/impala_common.xml#common/adls_blurb"/>
+      <p conref="../shared/impala_common.xml#common/adls_dml"/>
+      <p>See <xref href="../topics/impala_adls.xml#adls"/> for details about 
reading and writing ADLS data with Impala.</p>
+
       <p conref="../shared/impala_common.xml#common/security_blurb"/>
       <p conref="../shared/impala_common.xml#common/redaction_yes"/>
 

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/717dd73d/docs/topics/impala_load_data.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_load_data.xml b/docs/topics/impala_load_data.xml
index a092276..96305a5 100644
--- a/docs/topics/impala_load_data.xml
+++ b/docs/topics/impala_load_data.xml
@@ -239,6 +239,10 @@ Returned 1 row(s) in 0.62s</codeblock>
     <p conref="../shared/impala_common.xml#common/s3_dml_performance"/>
     <p>See <xref href="../topics/impala_s3.xml#s3"/> for details about reading 
and writing S3 data with Impala.</p>
 
+    <p conref="../shared/impala_common.xml#common/adls_blurb"/>
+    <p conref="../shared/impala_common.xml#common/adls_dml"/>
+    <p>See <xref href="../topics/impala_adls.xml#adls"/> for details about 
reading and writing ADLS data with Impala.</p>
+
     <p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
 
     <p conref="../shared/impala_common.xml#common/permissions_blurb"/>

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/717dd73d/docs/topics/impala_parquet_file_size.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_parquet_file_size.xml 
b/docs/topics/impala_parquet_file_size.xml
index 2471feb..05e6c36 100644
--- a/docs/topics/impala_parquet_file_size.xml
+++ b/docs/topics/impala_parquet_file_size.xml
@@ -88,6 +88,8 @@ INSERT OVERWRITE parquet_table SELECT * FROM text_table;
       <b>Default:</b> 0 (produces files with a target size of 256 MB; files 
might be larger for very wide tables)
     </p>
 
+    <p conref="../shared/impala_common.xml#common/adls_block_splitting"/>
+
     <p conref="../shared/impala_common.xml#common/isilon_blurb"/>
     <p conref="../shared/impala_common.xml#common/isilon_block_size_caveat"/>

incubator-impala git commit: IMPALA-5333: [DOCS] Document Impala ADLS support

Reply via email to