[28/51] [partial] incubator-impala git commit: IMPALA-4181 [DOCS] Publish rendered Impala documentation to ASF site

jbapple Wed, 12 Apr 2017 11:26:07 -0700

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_isilon.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_isilon.html 
b/docs/build/html/topics/impala_isilon.html
new file mode 100644
index 0000000..304e8ec
--- /dev/null
+++ b/docs/build/html/topics/impala_isilon.html
@@ -0,0 +1,89 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; 
charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) 
Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta 
name="DC.Type" content="concept"><meta name="prodname" content="Impala"><meta 
name="prodname" content="Impala"><meta name="version" content="Impala 
2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" 
content="XHTML"><meta name="DC.Identifier" content="impala_isilon"><link 
rel="stylesheet" type="text/css" href="../commonltr.css"><title>Using Impala 
with Isilon Storage</title></head><body id="impala_isilon"><main 
role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1">Using Impala with Isilon 
Storage</h1>
+  
+
+  
+
+  <div class="body conbody">
+
+    <p class="p">
+      
+      You can use Impala to query data files that reside on EMC Isilon storage 
devices, rather than in HDFS.
+      This capability allows convenient query access to a storage system where 
you might already be
+      managing large volumes of data. The combination of the Impala query 
engine and Isilon storage is
+      certified on <span class="keyword">Impala 2.2.4</span> or higher.
+    </p>
+
+    <div class="p">
+        Because the EMC Isilon storage devices use a global value for the 
block size
+        rather than a configurable value for each file, the <code class="ph 
codeph">PARQUET_FILE_SIZE</code>
+        query option has no effect when Impala inserts data into a table or 
partition
+        residing on Isilon storage. Use the <code class="ph codeph">isi</code> 
command to set the
+        default block size globally on the Isilon device. For example, to set 
the
+        Isilon default block size to 256 MB, the recommended size for Parquet
+        data files for Impala, issue the following command:
+<pre class="pre codeblock"><code>isi hdfs settings modify 
--default-block-size=256MB</code></pre>
+      </div>
+
+    <p class="p">
+      The typical use case for Impala and Isilon together is to use Isilon for 
the
+      default filesystem, replacing HDFS entirely. In this configuration,
+      when you create a database, table, or partition, the data always resides 
on
+      Isilon storage and you do not need to specify any special <code 
class="ph codeph">LOCATION</code>
+      attribute. If you do specify a <code class="ph codeph">LOCATION</code> 
attribute, its value refers
+      to a path within the Isilon filesystem.
+      For example:
+    </p>
+<pre class="pre codeblock"><code>-- If the default filesystem is Isilon, all 
Impala data resides there
+-- and all Impala databases and tables are located there.
+CREATE TABLE t1 (x INT, s STRING);
+
+-- You can specify LOCATION for database, table, or partition,
+-- using values from the Isilon filesystem.
+CREATE DATABASE d1 LOCATION '/some/path/on/isilon/server/d1.db';
+CREATE TABLE d1.t2 (a TINYINT, b BOOLEAN);
+</code></pre>
+
+    <p class="p">
+      Impala can write to, delete, and rename data files and database, table,
+      and partition directories on Isilon storage. Therefore, Impala 
statements such
+      as
+      <code class="ph codeph">CREATE TABLE</code>, <code class="ph 
codeph">DROP TABLE</code>,
+      <code class="ph codeph">CREATE DATABASE</code>, <code class="ph 
codeph">DROP DATABASE</code>,
+      <code class="ph codeph">ALTER TABLE</code>,
+      and
+      <code class="ph codeph">INSERT</code> work the same with Isilon storage 
as with HDFS.
+    </p>
+
+    <p class="p">
+      When the Impala spill-to-disk feature is activated by a query that 
approaches
+      the memory limit, Impala writes all the temporary data to a local (not 
Isilon)
+      storage device. Because the I/O bandwidth for the temporary data depends 
on
+      the number of local disks, and clusters using Isilon storage might not 
have
+      as many local disks attached, pay special attention on Isilon-enabled 
clusters
+      to any queries that use the spill-to-disk feature. Where practical, tune 
the
+      queries or allocate extra memory for Impala to avoid spilling.
+      Although you can specify an Isilon storage device as the destination for
+      the temporary data for the spill-to-disk feature, that configuration is
+      not recommended due to the need to transfer the data both ways using 
remote I/O.
+    </p>
+
+    <p class="p">
+      When tuning Impala queries on HDFS, you typically try to avoid any 
remote reads.
+      When the data resides on Isilon storage, all the I/O consists of remote 
reads.
+      Do not be alarmed when you see non-zero numbers for remote read 
measurements
+      in query profile output. The benefit of the Impala and Isilon 
integration is
+      primarily convenience of not having to move or copy large volumes of 
data to HDFS,
+      rather than raw query performance. You can increase the performance of 
Impala
+      I/O for Isilon systems by increasing the value for the
+      <code class="ph codeph">--num_remote_hdfs_io_threads</code> startup 
option for the
+      <span class="keyword cmdname">impalad</span> daemon.
+    </p>
+
+    
+  </div>
+
+</article></main></body></html>
\ No newline at end of file


http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_jdbc.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_jdbc.html 
b/docs/build/html/topics/impala_jdbc.html
new file mode 100644
index 0000000..cdb031b
--- /dev/null
+++ b/docs/build/html/topics/impala_jdbc.html
@@ -0,0 +1,326 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; 
charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) 
Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta 
name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" 
content="../topics/impala_config.html"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="version" 
content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta 
name="DC.Format" content="XHTML"><meta name="DC.Identifier" 
content="impala_jdbc"><link rel="stylesheet" type="text/css" 
href="../commonltr.css"><title>Configuring Impala to Work with 
JDBC</title></head><body id="impala_jdbc"><main role="main"><article 
role="article" aria-labelledby="impala_jdbc__jdbc">
+
+  <h1 class="title topictitle1" id="impala_jdbc__jdbc">Configuring Impala to 
Work with JDBC</h1>
+  
+
+  <div class="body conbody">
+
+    <p class="p">
+      
+      Impala supports the standard JDBC interface, allowing access from 
commercial Business Intelligence tools and
+      custom software written in Java or other programming languages. The JDBC 
driver allows you to access Impala
+      from a Java program that you write, or a Business Intelligence or 
similar tool that uses JDBC to communicate
+      with various database products.
+    </p>
+
+    <p class="p">
+      Setting up a JDBC connection to Impala involves the following steps:
+    </p>
+
+    <ul class="ul">
+      <li class="li">
+        Verifying the communication port where the Impala daemons in your 
cluster are listening for incoming JDBC
+        requests.
+      </li>
+
+      <li class="li">
+        Installing the JDBC driver on every system that runs the JDBC-enabled 
application.
+      </li>
+
+      <li class="li">
+        Specifying a connection string for the JDBC application to access one 
of the servers running the
+        <span class="keyword cmdname">impalad</span> daemon, with the 
appropriate security settings.
+      </li>
+    </ul>
+
+    <p class="p toc inpage"></p>
+  </div>
+
+  <nav role="navigation" class="related-links"><div class="familylinks"><div 
class="parentlink"><strong>Parent topic:</strong> <a class="link" 
href="../topics/impala_config.html">Managing 
Impala</a></div></div></nav><article class="topic concept nested1" 
aria-labelledby="ariaid-title2" id="impala_jdbc__jdbc_port">
+
+    <h2 class="title topictitle2" id="ariaid-title2">Configuring the JDBC 
Port</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        The default port used by JDBC 2.0 and later (as well as ODBC 2.x) is 
21050. Impala server accepts JDBC
+        connections through this same port 21050 by default. Make sure this 
port is available for communication
+        with other hosts on your network, for example, that it is not blocked 
by firewall software. If your JDBC
+        client software connects to a different port, specify that alternative 
port number with the
+        <code class="ph codeph">--hs2_port</code> option when starting <code 
class="ph codeph">impalad</code>. See
+        <a class="xref" href="impala_processes.html#processes">Starting 
Impala</a> for details about Impala startup options. See
+        <a class="xref" href="impala_ports.html#ports">Ports Used by 
Impala</a> for information about all ports used for communication between Impala
+        and clients or between Impala components.
+      </p>
+    </div>
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title3" 
id="impala_jdbc__jdbc_driver_choice">
+
+    <h2 class="title topictitle2" id="ariaid-title3">Choosing the JDBC 
Driver</h2>
+  
+
+    <div class="body conbody">
+
+      <p class="p">
+        In Impala 2.0 and later, you can use the Hive 0.13 JDBC driver.  If 
you are
+        already using JDBC applications with an earlier Impala release, you 
should update
+        your JDBC driver, because the Hive 0.12 driver that was formerly the 
only choice
+        is not compatible with Impala 2.0 and later.
+      </p>
+
+      <p class="p">
+        The Hive JDBC driver provides a substantial speed increase for JDBC
+        applications with Impala 2.0 and higher, for queries that return large 
result sets.
+      </p>
+
+      <p class="p">
+        <strong class="ph b">Complex type considerations:</strong>
+      </p>
+
+      <p class="p">
+        The Impala complex types (<code class="ph codeph">STRUCT</code>, <code 
class="ph codeph">ARRAY</code>, or <code class="ph codeph">MAP</code>)
+        are available in <span class="keyword">Impala 2.3</span> and higher.
+        To use these types with JDBC requires version 2.5.28 or higher of the 
JDBC Connector for Impala.
+        To use these types with ODBC requires version 2.5.30 or higher of the 
ODBC Connector for Impala.
+        Consider upgrading all JDBC and ODBC drivers at the same time you 
upgrade from <span class="keyword">Impala 2.3</span> or higher.
+      </p>
+      <p class="p">
+        Although the result sets from queries involving complex types consist 
of all scalar values,
+        the queries involve join notation and column references that might not 
be understood by
+        a particular JDBC or ODBC connector. Consider defining a view that 
represents the
+        flattened version of a table containing complex type columns, and 
pointing the JDBC
+        or ODBC application at the view.
+        See <a class="xref" 
href="../shared/../topics/impala_complex_types.html#complex_types">Complex 
Types (Impala 2.3 or higher only)</a> for details.
+      </p>
+
+    </div>
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title4" 
id="impala_jdbc__jdbc_setup">
+
+    <h2 class="title topictitle2" id="ariaid-title4">Enabling Impala JDBC 
Support on Client Systems</h2>
+  
+
+    <div class="body conbody">
+
+      <section class="section" id="jdbc_setup__install_hive_driver"><h3 
class="title sectiontitle">Using the Hive JDBC Driver</h3>
+        
+        <p class="p">
+          You install the Hive JDBC driver (<code class="ph 
codeph">hive-jdbc</code> package) through the Linux package manager, on
+          hosts within the cluster. The driver consists of several Java JAR 
files. The same driver can be used by Impala and Hive.
+        </p>
+
+        <p class="p">
+          To get the JAR files, install the Hive JDBC driver on each host in 
the cluster that will run
+          JDBC applications. 
+        </p>
+
+        <div class="note note note_note"><span class="note__title 
notetitle">Note:</span> 
+          The latest JDBC driver, corresponding to Hive 0.13, provides 
substantial performance improvements for
+          Impala queries that return large result sets. Impala 2.0 and later 
are compatible with the Hive 0.13
+          driver. If you already have an older JDBC driver installed, and are 
running Impala 2.0 or higher, consider
+          upgrading to the latest Hive JDBC driver for best performance with 
JDBC applications.
+        </div>
+
+        <p class="p">
+          If you are using JDBC-enabled applications on hosts outside the 
cluster, you cannot use the the same install
+          procedure on the hosts. Install the JDBC driver on at least one 
cluster host using the preceding
+          procedure. Then download the JAR files to each client machine that 
will use JDBC with Impala:
+        </p>
+
+  <pre class="pre codeblock"><code>commons-logging-X.X.X.jar
+  hadoop-common.jar
+  hive-common-X.XX.X.jar
+  hive-jdbc-X.XX.X.jar
+  hive-metastore-X.XX.X.jar
+  hive-service-X.XX.X.jar
+  httpclient-X.X.X.jar
+  httpcore-X.X.X.jar
+  libfb303-X.X.X.jar
+  libthrift-X.X.X.jar
+  log4j-X.X.XX.jar
+  slf4j-api-X.X.X.jar
+  slf4j-logXjXX-X.X.X.jar
+  </code></pre>
+
+        <p class="p">
+          <strong class="ph b">To enable JDBC support for Impala on the system 
where you run the JDBC application:</strong>
+        </p>
+
+        <ol class="ol">
+          <li class="li">
+            Download the JAR files listed above to each client machine.
+            <div class="note note note_note"><span class="note__title 
notetitle">Note:</span> 
+              For Maven users, see
+              <a class="xref" 
href="https://github.com/onefoursix/Cloudera-Impala-JDBC-Example"; 
target="_blank">this sample github page</a> for an example of the
+              dependencies you could add to a <code class="ph 
codeph">pom</code> file instead of downloading the individual JARs.
+            </div>
+          </li>
+
+          <li class="li">
+            Store the JAR files in a location of your choosing, ideally a 
directory already referenced in your
+            <code class="ph codeph">CLASSPATH</code> setting. For example:
+            <ul class="ul">
+              <li class="li">
+                On Linux, you might use a location such as <code class="ph 
codeph">/opt/jars/</code>.
+              </li>
+
+              <li class="li">
+                On Windows, you might use a subdirectory underneath <span 
class="ph filepath">C:\Program Files</span>.
+              </li>
+            </ul>
+          </li>
+
+          <li class="li">
+            To successfully load the Impala JDBC driver, client programs must 
be able to locate the associated JAR
+            files. This often means setting the <code class="ph 
codeph">CLASSPATH</code> for the client process to include the
+            JARs. Consult the documentation for your JDBC client for more 
details on how to install new JDBC drivers,
+            but some examples of how to set <code class="ph 
codeph">CLASSPATH</code> variables include:
+            <ul class="ul">
+              <li class="li">
+                On Linux, if you extracted the JARs to <code class="ph 
codeph">/opt/jars/</code>, you might issue the following
+                command to prepend the JAR files path to an existing classpath:
+  <pre class="pre codeblock"><code>export 
CLASSPATH=/opt/jars/*.jar:$CLASSPATH</code></pre>
+              </li>
+
+              <li class="li">
+                On Windows, use the <strong class="ph b">System 
Properties</strong> control panel item to modify the <strong class="ph 
b">Environment
+                Variables</strong> for your system. Modify the environment 
variables to include the path to which you
+                extracted the files.
+                <div class="note note note_note"><span class="note__title 
notetitle">Note:</span> 
+                  If the existing <code class="ph codeph">CLASSPATH</code> on 
your client machine refers to some older version of
+                  the Hive JARs, ensure that the new JARs are the first ones 
listed. Either put the new JAR files
+                  earlier in the listings, or delete the other references to 
Hive JAR files.
+                </div>
+              </li>
+            </ul>
+          </li>
+        </ol>
+      </section>
+
+    </div>
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title5" 
id="impala_jdbc__jdbc_connect">
+
+    <h2 class="title topictitle2" id="ariaid-title5">Establishing JDBC 
Connections</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        The JDBC driver class depends on which driver you select.
+      </p>
+
+      <div class="note note note_note"><span class="note__title 
notetitle">Note:</span> 
+        If your JDBC or ODBC application connects to Impala through a load 
balancer such as
+        <code class="ph codeph">haproxy</code>, be cautious about reusing the 
connections. If the load balancer has set up
+        connection timeout values, either check the connection frequently so 
that it never sits idle longer than
+        the load balancer timeout value, or check the connection validity 
before using it and create a new one if
+        the connection has been closed.
+      </div>
+
+      <section class="section" id="jdbc_connect__class_hive_driver"><h3 
class="title sectiontitle">Using the Hive JDBC Driver</h3>
+      
+
+      <p class="p">
+        For example, with the Hive JDBC driver, the class name is <code 
class="ph codeph">org.apache.hive.jdbc.HiveDriver</code>.
+        Once you have configured Impala to work with JDBC, you can establish 
connections between the two.
+        To do so for a cluster that does not use
+        Kerberos authentication, use a connection string of the form
+        <code class="ph codeph">jdbc:hive2://<var class="keyword 
varname">host</var>:<var class="keyword varname">port</var>/;auth=noSasl</code>.
+
+        For example, you might use:
+      </p>
+
+<pre class="pre 
codeblock"><code>jdbc:hive2://myhost.example.com:21050/;auth=noSasl</code></pre>
+
+      <p class="p">
+        To connect to an instance of Impala that requires Kerberos 
authentication, use a connection string of the
+        form
+        <code class="ph codeph">jdbc:hive2://<var class="keyword 
varname">host</var>:<var class="keyword varname">port</var>/;principal=<var 
class="keyword varname">principal_name</var></code>.
+        The principal must be the same user principal you used when starting 
Impala. For example, you might use:
+      </p>
+
+<pre class="pre 
codeblock"><code>jdbc:hive2://myhost.example.com:21050/;principal=impala/[email protected]</code></pre>
+
+      <p class="p">
+        To connect to an instance of Impala that requires LDAP authentication, 
use a connection string of the form
+        <code class="ph codeph">jdbc:hive2://<var class="keyword 
varname">host</var>:<var class="keyword varname">port</var>/<var class="keyword 
varname">db_name</var>;user=<var class="keyword 
varname">ldap_userid</var>;password=<var class="keyword 
varname">ldap_password</var></code>.
+        For example, you might use:
+      </p>
+
+<pre class="pre 
codeblock"><code>jdbc:hive2://myhost.example.com:21050/test_db;user=fred;password=xyz123</code></pre>
+
+      <div class="note note note_note"><span class="note__title 
notetitle">Note:</span> 
+          <p class="p">
+        Prior to <span class="keyword">Impala 2.5</span>, the Hive JDBC driver 
did not support connections that use both Kerberos authentication
+        and SSL encryption. If your cluster is running an older release that 
has this restriction,
+        use an alternative JDBC driver that supports
+        both of these security features.
+      </p>
+      </div>
+
+      </section>
+
+    </div>
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title6" 
id="impala_jdbc__jdbc_odbc_notes">
+    <h2 class="title topictitle2" id="ariaid-title6">Notes about JDBC and ODBC 
Interaction with Impala SQL Features</h2>
+    <div class="body conbody">
+      <p class="p">
+        Most Impala SQL features work equivalently through the <span 
class="keyword cmdname">impala-shell</span> interpreter
+        of the JDBC or ODBC APIs. The following are some exceptions to keep in 
mind when switching between
+        the interactive shell and applications using the APIs:
+      </p>
+      <ul class="ul">
+        <li class="li">
+          <p class="p">
+        <strong class="ph b">Complex type considerations:</strong>
+      </p>
+          <ul class="ul">
+          <li class="li">
+          <p class="p">
+            Queries involving the complex types (<code class="ph 
codeph">ARRAY</code>, <code class="ph codeph">STRUCT</code>, and <code 
class="ph codeph">MAP</code>)
+            require notation that might not be available in all levels of JDBC 
and ODBC drivers.
+            If you have trouble querying such a table due to the driver level 
or
+            inability to edit the queries used by the application, you can 
create a view that exposes
+            a <span class="q">"flattened"</span> version of the complex 
columns and point the application at the view.
+            See <a class="xref" 
href="impala_complex_types.html#complex_types">Complex Types (Impala 2.3 or 
higher only)</a> for details.
+          </p>
+        </li>
+        <li class="li">
+          <p class="p">
+            The complex types available in <span class="keyword">Impala 
2.3</span> and higher are supported by the
+            JDBC <code class="ph codeph">getColumns()</code> API.
+            Both <code class="ph codeph">MAP</code> and <code class="ph 
codeph">ARRAY</code> are reported as the JDBC SQL Type <code class="ph 
codeph">ARRAY</code>,
+            because this is the closest matching Java SQL type. This behavior 
is consistent with Hive.
+            <code class="ph codeph">STRUCT</code> types are reported as the 
JDBC SQL Type <code class="ph codeph">STRUCT</code>.
+          </p>
+          <div class="p">
+            To be consistent with Hive's behavior, the TYPE_NAME field is 
populated
+            with the primitive type name for scalar types, and with the full 
<code class="ph codeph">toSql()</code>
+            for complex types. The resulting type names are somewhat 
inconsistent,
+            because nested types are printed differently than top-level types. 
For example,
+            the following list shows how <code class="ph 
codeph">toSQL()</code> for Impala types are
+            translated to <code class="ph codeph">TYPE_NAME</code> values:
+<pre class="pre codeblock"><code>DECIMAL(10,10)         becomes  DECIMAL
+CHAR(10)               becomes  CHAR
+VARCHAR(10)            becomes  VARCHAR
+ARRAY&lt;DECIMAL(10,10)&gt;  becomes  ARRAY&lt;DECIMAL(10,10)&gt;
+ARRAY&lt;CHAR(10)&gt;        becomes  ARRAY&lt;CHAR(10)&gt;
+ARRAY&lt;VARCHAR(10)&gt;     becomes  ARRAY&lt;VARCHAR(10)&gt;
+
+</code></pre>
+          </div>
+          </li>
+        </ul>
+        </li>
+      </ul>
+    </div>
+  </article>
+
+</article></main></body></html>
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_joins.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_joins.html 
b/docs/build/html/topics/impala_joins.html
new file mode 100644
index 0000000..436f9f5
--- /dev/null
+++ b/docs/build/html/topics/impala_joins.html
@@ -0,0 +1,531 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; 
charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) 
Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta 
name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" 
content="../topics/impala_select.html"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="version" 
content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta 
name="DC.Format" content="XHTML"><meta name="DC.Identifier" 
content="joins"><link rel="stylesheet" type="text/css" 
href="../commonltr.css"><title>Joins in Impala SELECT 
Statements</title></head><body id="joins"><main role="main"><article 
role="article" aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1">Joins in Impala SELECT 
Statements</h1>
+  
+  
+
+  <div class="body conbody">
+
+    <p class="p">
+      
+      A join query is a <code class="ph codeph">SELECT</code> statement that 
combines data from two or more tables,
+      and returns a result set containing items from some or all of those 
tables. It is a way to
+      cross-reference and correlate related data that is organized into 
multiple tables, typically
+      using identifiers that are repeated in each of the joined tables.
+    </p>
+
+    <p class="p">
+        <strong class="ph b">Syntax:</strong>
+      </p>
+
+    <p class="p">
+        Impala supports a wide variety of <code class="ph codeph">JOIN</code> 
clauses. Left, right, semi, full, and outer joins
+        are supported in all Impala versions. The <code class="ph 
codeph">CROSS JOIN</code> operator is available in Impala 1.2.2
+        and higher. During performance tuning, you can override the reordering 
of join clauses that Impala does
+        internally by including the keyword <code class="ph 
codeph">STRAIGHT_JOIN</code> immediately after the
+        <code class="ph codeph">SELECT</code> keyword
+      </p>
+
+<pre class="pre codeblock"><code>SELECT <var class="keyword 
varname">select_list</var> FROM
+  <var class="keyword varname">table_or_subquery1</var> [INNER] JOIN <var 
class="keyword varname">table_or_subquery2</var> |
+  <var class="keyword varname">table_or_subquery1</var> {LEFT [OUTER] | RIGHT 
[OUTER] | FULL [OUTER]} JOIN <var class="keyword 
varname">table_or_subquery2</var> |
+  <var class="keyword varname">table_or_subquery1</var> {LEFT | RIGHT} SEMI 
JOIN <var class="keyword varname">table_or_subquery2</var> |
+  <span class="ph"><var class="keyword varname">table_or_subquery1</var> {LEFT 
| RIGHT} ANTI JOIN <var class="keyword varname">table_or_subquery2</var> 
|</span>
+    [ ON <var class="keyword varname">col1</var> = <var class="keyword 
varname">col2</var> [AND <var class="keyword varname">col3</var> = <var 
class="keyword varname">col4</var> ...] |
+      USING (<var class="keyword varname">col1</var> [, <var class="keyword 
varname">col2</var> ...]) ]
+  [<var class="keyword varname">other_join_clause</var> ...]
+[ WHERE <var class="keyword varname">where_clauses</var> ]
+
+SELECT <var class="keyword varname">select_list</var> FROM
+  <var class="keyword varname">table_or_subquery1</var>, <var class="keyword 
varname">table_or_subquery2</var> [, <var class="keyword 
varname">table_or_subquery3</var> ...]
+  [<var class="keyword varname">other_join_clause</var> ...]
+WHERE
+    <var class="keyword varname">col1</var> = <var class="keyword 
varname">col2</var> [AND <var class="keyword varname">col3</var> = <var 
class="keyword varname">col4</var> ...]
+
+SELECT <var class="keyword varname">select_list</var> FROM
+  <var class="keyword varname">table_or_subquery1</var> CROSS JOIN <var 
class="keyword varname">table_or_subquery2</var>
+  [<var class="keyword varname">other_join_clause</var> ...]
+[ WHERE <var class="keyword varname">where_clauses</var> ]</code></pre>
+
+    <p class="p">
+      <strong class="ph b">SQL-92 and SQL-89 Joins:</strong>
+    </p>
+
+    <p class="p">
+      Queries with the explicit <code class="ph codeph">JOIN</code> keywords 
are known as SQL-92 style joins, referring to the
+      level of the SQL standard where they were introduced. The corresponding 
<code class="ph codeph">ON</code> or
+      <code class="ph codeph">USING</code> clauses clearly show which columns 
are used as the join keys in each case:
+    </p>
+
+<pre class="pre codeblock"><code>SELECT t1.c1, t2.c2 FROM <strong class="ph 
b">t1 JOIN t2</strong>
+  <strong class="ph b">ON t1.id = t2.id and t1.type_flag = 
t2.type_flag</strong>
+  WHERE t1.c1 &gt; 100;
+
+SELECT t1.c1, t2.c2 FROM <strong class="ph b">t1 JOIN t2</strong>
+  <strong class="ph b">USING (id, type_flag)</strong>
+  WHERE t1.c1 &gt; 100;</code></pre>
+
+    <p class="p">
+      The <code class="ph codeph">ON</code> clause is a general way to compare 
columns across the two tables, even if the column
+      names are different. The <code class="ph codeph">USING</code> clause is 
a shorthand notation for specifying the join
+      columns, when the column names are the same in both tables. You can code 
equivalent <code class="ph codeph">WHERE</code>
+      clauses that compare the columns, instead of <code class="ph 
codeph">ON</code> or <code class="ph codeph">USING</code> clauses, but that
+      practice is not recommended because mixing the join comparisons with 
other filtering clauses is typically
+      less readable and harder to maintain.
+    </p>
+
+    <p class="p">
+      Queries with a comma-separated list of tables and subqueries are known 
as SQL-89 style joins. In these
+      queries, the equality comparisons between columns of the joined tables 
go in the <code class="ph codeph">WHERE</code>
+      clause alongside other kinds of comparisons. This syntax is easy to 
learn, but it is also easy to
+      accidentally remove a <code class="ph codeph">WHERE</code> clause needed 
for the join to work correctly.
+    </p>
+
+<pre class="pre codeblock"><code>SELECT t1.c1, t2.c2 FROM <strong class="ph 
b">t1, t2</strong>
+  WHERE
+  <strong class="ph b">t1.id = t2.id AND t1.type_flag = t2.type_flag</strong>
+  AND t1.c1 &gt; 100;</code></pre>
+
+    <p class="p">
+      <strong class="ph b">Self-joins:</strong>
+    </p>
+
+    <p class="p">
+      Impala can do self-joins, for example to join on two different columns 
in the same table to represent
+      parent-child relationships or other tree-structured data. There is no 
explicit syntax for this; just use the
+      same table name for both the left-hand and right-hand table, and assign 
different table aliases to use when
+      referring to the fully qualified column names:
+    </p>
+
+<pre class="pre codeblock"><code>-- Combine fields from both parent and child 
rows.
+SELECT lhs.id, rhs.parent, lhs.c1, rhs.c2 FROM tree_data lhs, tree_data rhs 
WHERE lhs.id = rhs.parent;</code></pre>
+
+    <p class="p">
+      <strong class="ph b">Cartesian joins:</strong>
+    </p>
+
+    <div class="p">
+      To avoid producing huge result sets by mistake, Impala does not allow 
Cartesian joins of the form:
+<pre class="pre codeblock"><code>SELECT ... FROM t1 JOIN t2;
+SELECT ... FROM t1, t2;</code></pre>
+      If you intend to join the tables based on common values, add <code 
class="ph codeph">ON</code> or <code class="ph codeph">WHERE</code>
+      clauses to compare columns across the tables. If you truly intend to do 
a Cartesian join, use the
+      <code class="ph codeph">CROSS JOIN</code> keyword as the join operator. 
The <code class="ph codeph">CROSS JOIN</code> form does not use
+      any <code class="ph codeph">ON</code> clause, because it produces a 
result set with all combinations of rows from the
+      left-hand and right-hand tables. The result set can still be filtered by 
subsequent <code class="ph codeph">WHERE</code>
+      clauses. For example:
+    </div>
+
+<pre class="pre codeblock"><code>SELECT ... FROM t1 CROSS JOIN t2;
+SELECT ... FROM t1 CROSS JOIN t2 WHERE <var class="keyword 
varname">tests_on_non_join_columns</var>;</code></pre>
+
+    <p class="p">
+      <strong class="ph b">Inner and outer joins:</strong>
+    </p>
+
+    <p class="p">
+      An inner join is the most common and familiar type: rows in the result 
set contain the requested columns from
+      the appropriate tables, for all combinations of rows where the join 
columns of the tables have identical
+      values. If a column with the same name occurs in both tables, use a 
fully qualified name or a column alias to
+      refer to the column in the select list or other clauses. Impala performs 
inner joins by default for both
+      SQL-89 and SQL-92 join syntax:
+    </p>
+
+<pre class="pre codeblock"><code>-- The following 3 forms are all equivalent.
+SELECT t1.id, c1, c2 FROM t1, t2 WHERE t1.id = t2.id;
+SELECT t1.id, c1, c2 FROM t1 JOIN t2 ON t1.id = t2.id;
+SELECT t1.id, c1, c2 FROM t1 INNER JOIN t2 ON t1.id = t2.id;</code></pre>
+
+    <p class="p">
+      An outer join retrieves all rows from the left-hand table, or the 
right-hand table, or both; wherever there
+      is no matching data in the table on the other side of the join, the 
corresponding columns in the result set
+      are set to <code class="ph codeph">NULL</code>. To perform an outer 
join, include the <code class="ph codeph">OUTER</code> keyword in the
+      join operator, along with either <code class="ph codeph">LEFT</code>, 
<code class="ph codeph">RIGHT</code>, or <code class="ph codeph">FULL</code>:
+    </p>
+
+<pre class="pre codeblock"><code>SELECT * FROM t1 LEFT OUTER JOIN t2 ON t1.id 
= t2.id;
+SELECT * FROM t1 RIGHT OUTER JOIN t2 ON t1.id = t2.id;
+SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.id = t2.id;</code></pre>
+
+    <p class="p">
+      For outer joins, Impala requires SQL-92 syntax; that is, the <code 
class="ph codeph">JOIN</code> keyword instead of
+      comma-separated table names. Impala does not support vendor extensions 
such as <code class="ph codeph">(+)</code> or
+      <code class="ph codeph">*=</code> notation for doing outer joins with 
SQL-89 query syntax.
+    </p>
+
+    <p class="p">
+      <strong class="ph b">Equijoins and Non-Equijoins:</strong>
+    </p>
+
+    <p class="p">
+      By default, Impala requires an equality comparison between the left-hand 
and right-hand tables, either
+      through <code class="ph codeph">ON</code>, <code class="ph 
codeph">USING</code>, or <code class="ph codeph">WHERE</code> clauses. These 
types of
+      queries are classified broadly as equijoins. Inner, outer, full, and 
semi joins can all be equijoins based on
+      the presence of equality tests between columns in the left-hand and 
right-hand tables.
+    </p>
+
+    <p class="p">
+      In Impala 1.2.2 and higher, non-equijoin queries are also possible, with 
comparisons such as
+      <code class="ph codeph">!=</code> or <code class="ph codeph">&lt;</code> 
between the join columns. These kinds of queries require care to
+      avoid producing huge result sets that could exceed resource limits. Once 
you have planned a non-equijoin
+      query that produces a result set of acceptable size, you can code the 
query using the <code class="ph codeph">CROSS
+      JOIN</code> operator, and add the extra comparisons in the <code 
class="ph codeph">WHERE</code> clause:
+    </p>
+
+<pre class="pre codeblock"><code>SELECT * FROM t1 CROSS JOIN t2 WHERE t1.total 
&gt; t2.maximum_price;</code></pre>
+
+    <p class="p">
+      In <span class="keyword">Impala 2.3</span> and higher, additional 
non-equijoin queries are possible due to the addition
+      of nested loop joins. These queries typically involve <code class="ph 
codeph">SEMI JOIN</code>,
+      <code class="ph codeph">ANTI JOIN</code>, or <code class="ph 
codeph">FULL OUTER JOIN</code> clauses.
+      Impala sometimes also uses nested loop joins internally when evaluating 
<code class="ph codeph">OUTER JOIN</code>
+      queries involving complex type columns.
+      Query phases involving nested loop joins do not use the spill-to-disk 
mechanism if they
+      exceed the memory limit. Impala decides internally when to use each join 
mechanism; you cannot
+      specify any query hint to choose between the nested loop join or the 
original hash join algorithm.
+    </p>
+
+<pre class="pre codeblock"><code>SELECT * FROM t1 LEFT OUTER JOIN t2 ON 
t1.int_col &lt; t2.int_col;</code></pre>
+
+    <p class="p">
+      <strong class="ph b">Semi-joins:</strong>
+    </p>
+
+    <p class="p">
+      Semi-joins are a relatively rarely used variation. With the left 
semi-join, only data from the left-hand
+      table is returned, for rows where there is matching data in the 
right-hand table, based on comparisons
+      between join columns in <code class="ph codeph">ON</code> or <code 
class="ph codeph">WHERE</code> clauses. Only one instance of each row
+      from the left-hand table is returned, regardless of how many matching 
rows exist in the right-hand table.
+      <span class="ph">A right semi-join (available in Impala 2.0 and higher) 
reverses the comparison and returns
+      data from the right-hand table.</span>
+    </p>
+
+<pre class="pre codeblock"><code>SELECT t1.c1, t1.c2, t1.c2 FROM t1 LEFT SEMI 
JOIN t2 ON t1.id = t2.id;</code></pre>
+
+    <p class="p">
+      <strong class="ph b">Natural joins (not supported):</strong>
+    </p>
+
+    <p class="p">
+      Impala does not support the <code class="ph codeph">NATURAL JOIN</code> 
operator, again to avoid inconsistent or huge
+      result sets. Natural joins do away with the <code class="ph 
codeph">ON</code> and <code class="ph codeph">USING</code> clauses, and
+      instead automatically join on all columns with the same names in the 
left-hand and right-hand tables. This
+      kind of query is not recommended for rapidly evolving data structures 
such as are typically used in Hadoop.
+      Thus, Impala does not support the <code class="ph codeph">NATURAL 
JOIN</code> syntax, which can produce different query
+      results as columns are added to or removed from tables.
+    </p>
+
+    <p class="p">
+      If you do have any queries that use <code class="ph codeph">NATURAL 
JOIN</code>, make sure to rewrite them with explicit
+      <code class="ph codeph">USING</code> clauses, because Impala could 
interpret the <code class="ph codeph">NATURAL</code> keyword as a
+      table alias:
+    </p>
+
+<pre class="pre codeblock"><code>-- 'NATURAL' is interpreted as an alias for 
't1' and Impala attempts an inner join,
+-- resulting in an error because inner joins require explicit comparisons 
between columns.
+SELECT t1.c1, t2.c2 FROM t1 NATURAL JOIN t2;
+ERROR: NotImplementedException: Join with 't2' requires at least one 
conjunctive equality predicate.
+  To perform a Cartesian product between two tables, use a CROSS JOIN.
+
+-- If you expect the tables to have identically named columns with matching 
values,
+-- list the corresponding column names in a USING clause.
+SELECT t1.c1, t2.c2 FROM t1 JOIN t2 USING (id, type_flag, name, 
address);</code></pre>
+
+    <p class="p">
+      <strong class="ph b">Anti-joins (<span class="keyword">Impala 2.0</span> 
and higher only):</strong>
+    </p>
+
+    <p class="p">
+      Impala supports the <code class="ph codeph">LEFT ANTI JOIN</code> and 
<code class="ph codeph">RIGHT ANTI JOIN</code> clauses in
+      <span class="keyword">Impala 2.0</span> and higher. The <code class="ph 
codeph">LEFT</code> or <code class="ph codeph">RIGHT</code>
+      keyword is required for this kind of join. For <code class="ph 
codeph">LEFT ANTI JOIN</code>, this clause returns those
+      values from the left-hand table that have no matching value in the 
right-hand table. <code class="ph codeph">RIGHT ANTI
+      JOIN</code> reverses the comparison and returns values from the 
right-hand table. You can express this
+      negative relationship either through the <code class="ph codeph">ANTI 
JOIN</code> clause or through a <code class="ph codeph">NOT
+      EXISTS</code> operator with a subquery.
+    </p>
+
+
+
+    <p class="p">
+        <strong class="ph b">Complex type considerations:</strong>
+      </p>
+
+
+
+    <p class="p">
+      When referring to a column with a complex type (<code class="ph 
codeph">STRUCT</code>, <code class="ph codeph">ARRAY</code>, or <code class="ph 
codeph">MAP</code>)
+      in a query, you use join notation to <span class="q">"unpack"</span> the 
scalar fields of the struct, the elements of the array, or
+      the key-value pairs of the map. (The join notation is not required for 
aggregation operations, such as
+      <code class="ph codeph">COUNT()</code> or <code class="ph 
codeph">SUM()</code> for array elements.) Because Impala recognizes which 
complex type elements are associated with which row
+      of the result set, you use the same syntax as for a cross or cartesian 
join, without an explicit join condition.
+      See <a class="xref" 
href="impala_complex_types.html#complex_types">Complex Types (Impala 2.3 or 
higher only)</a> for details about Impala support for complex types.
+    </p>
+
+    <p class="p">
+        <strong class="ph b">Usage notes:</strong>
+      </p>
+
+    <p class="p">
+      You typically use join queries in situations like these:
+    </p>
+
+    <ul class="ul">
+      <li class="li">
+        When related data arrives from different sources, with each data set 
physically residing in a separate
+        table. For example, you might have address data from business records 
that you cross-check against phone
+        listings or census data.
+        <div class="note note note_note"><span class="note__title 
notetitle">Note:</span> 
+          Impala can join tables of different file formats, including 
Impala-managed tables and HBase tables. For
+          example, you might keep small dimension tables in HBase, for 
convenience of single-row lookups and
+          updates, and for the larger fact tables use Parquet or other binary 
file format optimized for scan
+          operations. Then, you can issue a join query to cross-reference the 
fact tables with the dimension
+          tables.
+        </div>
+      </li>
+
+      <li class="li">
+        When data is normalized, a technique for reducing data duplication by 
dividing it across multiple tables.
+        This kind of organization is often found in data that comes from 
traditional relational database systems.
+        For example, instead of repeating some long string such as a customer 
name in multiple tables, each table
+        might contain a numeric customer ID. Queries that need to display the 
customer name could <span class="q">"join"</span> the
+        table that specifies which customer ID corresponds to which name.
+      </li>
+
+      <li class="li">
+        When certain columns are rarely needed for queries, so they are moved 
into separate tables to reduce
+        overhead for common queries. For example, a <code class="ph 
codeph">biography</code> field might be rarely needed in
+        queries on employee data. Putting that field in a separate table 
reduces the amount of I/O for common
+        queries on employee addresses or phone numbers. Queries that do need 
the <code class="ph codeph">biography</code> column
+        can retrieve it by performing a join with that separate table.
+      </li>
+
+      <li class="li">
+        In <span class="keyword">Impala 2.3</span> or higher, when referring 
to complex type columns in queries.
+        See <a class="xref" 
href="impala_complex_types.html#complex_types">Complex Types (Impala 2.3 or 
higher only)</a> for details.
+      </li>
+    </ul>
+
+    <p class="p">
+      When comparing columns with the same names in <code class="ph 
codeph">ON</code> or <code class="ph codeph">WHERE</code> clauses, use the
+      fully qualified names such as <code class="ph codeph"><var 
class="keyword varname">db_name</var>.<var class="keyword 
varname">table_name</var></code>, or
+      assign table aliases, column aliases, or both to make the code more 
compact and understandable:
+    </p>
+
+<pre class="pre codeblock"><code>select t1.c1 as first_id, t2.c2 as second_id 
from
+  t1 join t2 on first_id = second_id;
+
+select fact.custno, dimension.custno from
+  customer_data as fact join customer_address as dimension
+  using (custno)</code></pre>
+
+    <div class="note note note_note"><span class="note__title 
notetitle">Note:</span> 
+      <p class="p">
+        Performance for join queries is a crucial aspect for Impala, because 
complex join queries are
+        resource-intensive operations. An efficient join query produces much 
less network traffic and CPU overhead
+        than an inefficient one. For best results:
+      </p>
+      <ul class="ul">
+        <li class="li">
+          Make sure that both <a class="xref" 
href="impala_perf_stats.html#perf_stats">table and column statistics</a> are
+          available for all the tables involved in a join query, and 
especially for the columns referenced in any
+          join conditions. Impala uses the statistics to automatically deduce 
an efficient join order.
+          Use <a class="xref" href="impala_show.html#show"><code class="ph 
codeph">SHOW TABLE STATS <var class="keyword varname">table_name</var></code> 
and
+          <code class="ph codeph">SHOW COLUMN STATS <var class="keyword 
varname">table_name</var></code></a> to check if statistics are
+          already present. Issue the <code class="ph codeph">COMPUTE STATS 
<var class="keyword varname">table_name</var></code> for a nonpartitioned table,
+          or (in Impala 2.1.0 and higher) <code class="ph codeph">COMPUTE 
INCREMENTAL STATS <var class="keyword varname">table_name</var></code>
+          for a partitioned table, to collect the initial statistics at both 
the table and column levels, and to keep the
+          statistics up to date after any substantial <code class="ph 
codeph">INSERT</code> or <code class="ph codeph">LOAD DATA</code> operations.
+        </li>
+
+        <li class="li">
+          If table or column statistics are not available, join the largest 
table first. You can check the
+          existence of statistics with the <code class="ph codeph">SHOW TABLE 
STATS <var class="keyword varname">table_name</var></code> and
+          <code class="ph codeph">SHOW COLUMN STATS <var class="keyword 
varname">table_name</var></code> statements.
+        </li>
+
+        <li class="li">
+          If table or column statistics are not available, join subsequent 
tables according to which table has the
+          most selective filter, based on overall size and <code class="ph 
codeph">WHERE</code> clauses. Joining the table with
+          the most selective filter results in the fewest number of rows being 
returned.
+        </li>
+      </ul>
+      <p class="p">
+        For more information and examples of performance for join queries, see
+        <a class="xref" href="impala_perf_joins.html#perf_joins">Performance 
Considerations for Join Queries</a>.
+      </p>
+    </div>
+
+    <p class="p">
+      To control the result set from a join query, include the names of 
corresponding column names in both tables
+      in an <code class="ph codeph">ON</code> or <code class="ph 
codeph">USING</code> clause, or by coding equality comparisons for those
+      columns in the <code class="ph codeph">WHERE</code> clause.
+    </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; select c_last_name, 
ca_city from customer join customer_address where c_customer_sk = ca_address_sk;
++-------------+-----------------+
+| c_last_name | ca_city         |
++-------------+-----------------+
+| Lewis       | Fairfield       |
+| Moses       | Fairview        |
+| Hamilton    | Pleasant Valley |
+| White       | Oak Ridge       |
+| Moran       | Glendale        |
+...
+| Richards    | Lakewood         |
+| Day         | Lebanon          |
+| Painter     | Oak Hill         |
+| Bentley     | Greenfield       |
+| Jones       | Stringtown       |
++-------------+------------------+
+Returned 50000 row(s) in 9.82s</code></pre>
+
+    <p class="p">
+      One potential downside of joins is the possibility of excess resource 
usage in poorly constructed queries.
+      Impala imposes restrictions on join queries to guard against such 
issues. To minimize the chance of runaway
+      queries on large data sets, Impala requires every join query to contain 
at least one equality predicate
+      between the columns of the various tables. For example, if <code 
class="ph codeph">T1</code> contains 1000 rows and
+      <code class="ph codeph">T2</code> contains 1,000,000 rows, a query <code 
class="ph codeph">SELECT <var class="keyword varname">columns</var> FROM t1 JOIN
+      t2</code> could return up to 1 billion rows (1000 * 1,000,000); Impala 
requires that the query include a
+      clause such as <code class="ph codeph">ON t1.c1 = t2.c2</code> or <code 
class="ph codeph">WHERE t1.c1 = t2.c2</code>.
+    </p>
+
+    <p class="p">
+      Because even with equality clauses, the result set can still be large, 
as we saw in the previous example, you
+      might use a <code class="ph codeph">LIMIT</code> clause to return a 
subset of the results:
+    </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; select c_last_name, 
ca_city from customer, customer_address where c_customer_sk = ca_address_sk 
limit 10;
++-------------+-----------------+
+| c_last_name | ca_city         |
++-------------+-----------------+
+| Lewis       | Fairfield       |
+| Moses       | Fairview        |
+| Hamilton    | Pleasant Valley |
+| White       | Oak Ridge       |
+| Moran       | Glendale        |
+| Sharp       | Lakeview        |
+| Wiles       | Farmington      |
+| Shipman     | Union           |
+| Gilbert     | New Hope        |
+| Brunson     | Martinsville    |
++-------------+-----------------+
+Returned 10 row(s) in 0.63s</code></pre>
+
+    <p class="p">
+      Or you might use additional comparison operators or aggregation 
functions to condense a large result set into
+      a smaller set of values:
+    </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; -- Find the names of 
customers who live in one particular town.
+[localhost:21000] &gt; select distinct c_last_name from customer, 
customer_address where
+  c_customer_sk = ca_address_sk
+  and ca_city = "Green Acres";
++---------------+
+| c_last_name   |
++---------------+
+| Hensley       |
+| Pearson       |
+| Mayer         |
+| Montgomery    |
+| Ricks         |
+...
+| Barrett       |
+| Price         |
+| Hill          |
+| Hansen        |
+| Meeks         |
++---------------+
+Returned 332 row(s) in 0.97s
+
+[localhost:21000] &gt; -- See how many different customers in this town have 
names starting with "A".
+[localhost:21000] &gt; select count(distinct c_last_name) from customer, 
customer_address where
+  c_customer_sk = ca_address_sk
+  and ca_city = "Green Acres"
+  and substr(c_last_name,1,1) = "A";
++-----------------------------+
+| count(distinct c_last_name) |
++-----------------------------+
+| 12                          |
++-----------------------------+
+Returned 1 row(s) in 1.00s</code></pre>
+
+    <p class="p">
+      Because a join query can involve reading large amounts of data from 
disk, sending large amounts of data
+      across the network, and loading large amounts of data into memory to do 
the comparisons and filtering, you
+      might do benchmarking, performance analysis, and query tuning to find 
the most efficient join queries for
+      your data set, hardware capacity, network configuration, and cluster 
workload.
+    </p>
+
+    <p class="p">
+      The two categories of joins in Impala are known as <strong class="ph 
b">partitioned joins</strong> and <strong class="ph b">broadcast 
joins</strong>. If
+      inaccurate table or column statistics, or some quirk of the data 
distribution, causes Impala to choose the
+      wrong mechanism for a particular join, consider using query hints as a 
temporary workaround. For details, see
+      <a class="xref" href="impala_hints.html#hints">Query Hints in Impala 
SELECT Statements</a>.
+    </p>
+
+    <p class="p">
+      <strong class="ph b">Handling NULLs in Join Columns:</strong>
+    </p>
+
+    <p class="p">
+      By default, join key columns do not match if either one contains a <code 
class="ph codeph">NULL</code> value.
+      To treat such columns as equal if both contain <code class="ph 
codeph">NULL</code>, you can use an expression
+      such as <code class="ph codeph">A = B OR (A IS NULL AND B IS 
NULL)</code>.
+      In <span class="keyword">Impala 2.5</span> and higher, the <code 
class="ph codeph">&lt;=&gt;</code> operator (shorthand for
+      <code class="ph codeph">IS NOT DISTINCT FROM</code>) performs the same 
comparison in a concise and efficient form.
+      The <code class="ph codeph">&lt;=&gt;</code> operator is more efficient 
in for comparing join keys in a <code class="ph codeph">NULL</code>-safe
+      manner, because the operator can use a hash join while the <code 
class="ph codeph">OR</code> expression cannot.
+    </p>
+
+    <p class="p">
+        <strong class="ph b">Examples:</strong>
+      </p>
+
+    <div class="p">
+      The following examples refer to these simple tables containing small 
sets of integers:
+<pre class="pre codeblock"><code>[localhost:21000] &gt; create table t1 (x 
int);
+[localhost:21000] &gt; insert into t1 values (1), (2), (3), (4), (5), (6);
+
+[localhost:21000] &gt; create table t2 (y int);
+[localhost:21000] &gt; insert into t2 values (2), (4), (6);
+
+[localhost:21000] &gt; create table t3 (z int);
+[localhost:21000] &gt; insert into t3 values (1), (3), (5);
+</code></pre>
+    </div>
+
+
+
+    <p class="p">
+      The following example demonstrates an anti-join, returning the values 
from <code class="ph codeph">T1</code> that do not
+      exist in <code class="ph codeph">T2</code> (in this case, the odd 
numbers 1, 3, and 5):
+    </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; select x from t1 left 
anti join t2 on (t1.x = t2.y);
++---+
+| x |
++---+
+| 1 |
+| 3 |
+| 5 |
++---+
+</code></pre>
+
+    <p class="p">
+        <strong class="ph b">Related information:</strong>
+      </p>
+
+    <p class="p">
+      See these tutorials for examples of different kinds of joins:
+    </p>
+
+    <ul class="ul">
+      <li class="li">
+        <a class="xref" href="impala_tutorial.html#tut_cross_join">Cross Joins 
and Cartesian Products with the CROSS JOIN Operator</a>
+      </li>
+    </ul>
+  </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div 
class="parentlink"><strong>Parent topic:</strong> <a class="link" 
href="../topics/impala_select.html">SELECT 
Statement</a></div></div></nav></article></main></body></html>
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_kerberos.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_kerberos.html 
b/docs/build/html/topics/impala_kerberos.html
new file mode 100644
index 0000000..858ff87
--- /dev/null
+++ b/docs/build/html/topics/impala_kerberos.html
@@ -0,0 +1,342 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; 
charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) 
Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta 
name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" 
content="../topics/impala_authentication.html"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" 
content="Impala"><meta name="version" content="Impala 2.8.x"><meta 
name="version" content="Impala 2.8.x"><meta name="version" content="Impala 
2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" 
content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta 
name="version" content="Impala 
 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" 
content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta 
name="DC.Identifier" content="kerberos"><link rel="stylesheet" type="text/css" 
href="../commonltr.css"><title>Enabling Kerberos Authentication for 
Impala</title></head><body id="kerberos"><main role="main"><article 
role="article" aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1">Enabling Kerberos 
Authentication for Impala</h1>
+  
+
+  <div class="body conbody">
+
+    <p class="p">
+      Impala supports an enterprise-grade authentication system called 
Kerberos. Kerberos provides strong security benefits including
+      capabilities that render intercepted authentication packets unusable by 
an attacker. It virtually eliminates the threat of
+      impersonation by never sending a user's credentials in cleartext over 
the network. For more information on Kerberos, visit
+      the <a class="xref" href="https://web.mit.edu/kerberos/"; 
target="_blank">MIT Kerberos website</a>.
+    </p>
+
+    <p class="p">
+      The rest of this topic assumes you have a working <a class="xref" 
href="https://web.mit.edu/kerberos/krb5-latest/doc/admin/install_kdc.html"; 
target="_blank">Kerberos Key Distribution Center (KDC)</a>
+      set up. To enable Kerberos, you first create a Kerberos principal for 
each host running
+      <span class="keyword cmdname">impalad</span> or <span class="keyword 
cmdname">statestored</span>.
+    </p>
+
+    <div class="note note note_note"><span class="note__title 
notetitle">Note:</span> 
+      Regardless of the authentication mechanism used, Impala always creates 
HDFS directories and data files
+      owned by the same user (typically <code class="ph 
codeph">impala</code>). To implement user-level access to different
+      databases, tables, columns, partitions, and so on, use the Sentry 
authorization feature, as explained in
+      <a class="xref" 
href="../shared/../topics/impala_authorization.html#authorization">Enabling 
Sentry Authorization for Impala</a>.
+    </div>
+
+    <p class="p">
+      An alternative form of authentication you can use is LDAP, described in 
<a class="xref" href="impala_ldap.html#ldap">Enabling LDAP Authentication for 
Impala</a>.
+    </p>
+
+    <p class="p toc inpage"></p>
+
+  </div>
+
+  <nav role="navigation" class="related-links"><div class="familylinks"><div 
class="parentlink"><strong>Parent topic:</strong> <a class="link" 
href="../topics/impala_authentication.html">Impala 
Authentication</a></div></div></nav><article class="topic concept nested1" 
aria-labelledby="ariaid-title2" id="kerberos__kerberos_prereqs">
+
+    <h2 class="title topictitle2" id="ariaid-title2">Requirements for Using 
Impala with Kerberos</h2>
+  
+
+    <div class="body conbody">
+
+      <div class="p">
+        On version 5 of Red Hat Enterprise Linux and comparable distributions, 
some additional setup is needed for
+        the <span class="keyword cmdname">impala-shell</span> interpreter to 
connect to a Kerberos-enabled Impala cluster:
+<pre class="pre codeblock"><code>sudo yum install python-devel openssl-devel 
python-pip
+sudo pip-python install ssl</code></pre>
+      </div>
+
+      <div class="note important note_important"><span class="note__title 
importanttitle">Important:</span> 
+        <p class="p">
+          If you plan to use Impala in your cluster, you must configure your 
KDC to allow tickets to be renewed,
+          and you must configure <span class="ph filepath">krb5.conf</span> to 
request renewable tickets. Typically, you can do
+          this by adding the <code class="ph codeph">max_renewable_life</code> 
setting to your realm in
+          <span class="ph filepath">kdc.conf</span>, and by adding the <span 
class="ph filepath">renew_lifetime</span> parameter to the
+          <span class="ph filepath">libdefaults</span> section of <span 
class="ph filepath">krb5.conf</span>. For more information about
+          renewable tickets, see the
+          <a class="xref" href="http://web.mit.edu/Kerberos/krb5-1.8/"; 
target="_blank"> Kerberos
+          documentation</a>.
+        </p>
+        <p class="p">
+          Currently, you cannot use the resource management feature on a 
cluster that has Kerberos
+          authentication enabled.
+        </p>
+      </div>
+
+      <p class="p">
+        Start all <span class="keyword cmdname">impalad</span> and <span 
class="keyword cmdname">statestored</span> daemons with the
+        <code class="ph codeph">--principal</code> and <code class="ph 
codeph">--keytab-file</code> flags set to the principal and full path
+        name of the <code class="ph codeph">keytab</code> file containing the 
credentials for the principal.
+      </p>
+
+      <p class="p">
+        To enable Kerberos in the Impala shell, start the <span class="keyword 
cmdname">impala-shell</span> command using the
+        <code class="ph codeph">-k</code> flag.
+      </p>
+
+      <p class="p">
+        To enable Impala to work with Kerberos security on your Hadoop 
cluster, make sure you perform the
+        installation and configuration steps in
+        <a class="xref" 
href="https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SecureMode.html#Authentication";
 target="_blank">Authentication in Hadoop</a>.
+        Note that when Kerberos security is enabled in Impala, a web browser 
that
+        supports Kerberos HTTP SPNEGO is required to access the Impala web 
console (for example, Firefox, Internet
+        Explorer, or Chrome).
+      </p>
+
+      <p class="p">
+        If the NameNode, Secondary NameNode, DataNode, JobTracker, 
TaskTrackers, ResourceManager, NodeManagers,
+        HttpFS, Oozie, Impala, or Impala statestore services are configured to 
use Kerberos HTTP SPNEGO
+        authentication, and two or more of these services are running on the 
same host, then all of the running
+        services must use the same HTTP principal and keytab file used for 
their HTTP endpoints.
+      </p>
+    </div>
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title3" 
id="kerberos__kerberos_config">
+
+    <h2 class="title topictitle2" id="ariaid-title3">Configuring Impala to 
Support Kerberos Security</h2>
+  
+
+    <div class="body conbody">
+
+      <p class="p">
+        Enabling Kerberos authentication for Impala involves steps that can be 
summarized as follows:
+      </p>
+
+      <ul class="ul">
+        <li class="li">
+          Creating service principals for Impala and the HTTP service. 
Principal names take the form:
+          <code class="ph codeph"><var class="keyword 
varname">serviceName</var>/<var class="keyword 
varname">fully.qualified.domain.name</var>@<var class="keyword 
varname">KERBEROS.REALM</var></code>.
+          <p class="p">
+        In Impala 2.0 and later, <code class="ph codeph">user()</code> returns 
the full Kerberos principal string, such as
+        <code class="ph codeph">[email protected]</code>, in a Kerberized 
environment.
+      </p>
+        </li>
+
+        <li class="li">
+          Creating, merging, and distributing key tab files for these 
principals.
+        </li>
+
+        <li class="li">
+          Editing <code class="ph codeph">/etc/default/impala</code>
+          to accommodate Kerberos authentication.
+        </li>
+      </ul>
+    </div>
+
+    <article class="topic concept nested2" aria-labelledby="ariaid-title4" 
id="kerberos_config__kerberos_setup">
+
+      <h3 class="title topictitle3" id="ariaid-title4">Enabling Kerberos for 
Impala</h3>
+
+      <div class="body conbody">
+
+
+
+        <ol class="ol">
+          <li class="li">
+            Create an Impala service principal, specifying the name of the OS 
user that the Impala daemons run
+            under, the fully qualified domain name of each node running <span 
class="keyword cmdname">impalad</span>, and the realm
+            name. For example:
+<pre class="pre codeblock"><code>$ kadmin
+kadmin: addprinc -requires_preauth -randkey 
impala/[email protected]</code></pre>
+          </li>
+
+          <li class="li">
+            Create an HTTP service principal. For example:
+<pre class="pre codeblock"><code>kadmin: addprinc -randkey 
HTTP/[email protected]</code></pre>
+            <div class="note note note_note"><span class="note__title 
notetitle">Note:</span> 
+              The <code class="ph codeph">HTTP</code> component of the service 
principal must be uppercase as shown in the
+              preceding example.
+            </div>
+          </li>
+
+          <li class="li">
+            Create <code class="ph codeph">keytab</code> files with both 
principals. For example:
+<pre class="pre codeblock"><code>kadmin: xst -k impala.keytab 
impala/impala_host.example.com
+kadmin: xst -k http.keytab HTTP/impala_host.example.com
+kadmin: quit</code></pre>
+          </li>
+
+          <li class="li">
+            Use <code class="ph codeph">ktutil</code> to read the contents of 
the two keytab files and then write those contents
+            to a new file. For example:
+<pre class="pre codeblock"><code>$ ktutil
+ktutil: rkt impala.keytab
+ktutil: rkt http.keytab
+ktutil: wkt impala-http.keytab
+ktutil: quit</code></pre>
+          </li>
+
+          <li class="li">
+            (Optional) Test that credentials in the merged keytab file are 
valid, and that the <span class="q">"renew until"</span>
+            date is in the future. For example:
+<pre class="pre codeblock"><code>$ klist -e -k -t 
impala-http.keytab</code></pre>
+          </li>
+
+          <li class="li">
+            Copy the <span class="ph filepath">impala-http.keytab</span> file 
to the Impala configuration directory. Change the
+            permissions to be only read for the file owner and change the file 
owner to the <code class="ph codeph">impala</code>
+            user. By default, the Impala user and group are both named <code 
class="ph codeph">impala</code>. For example:
+<pre class="pre codeblock"><code>$ cp impala-http.keytab /etc/impala/conf
+$ cd /etc/impala/conf
+$ chmod 400 impala-http.keytab
+$ chown impala:impala impala-http.keytab</code></pre>
+          </li>
+
+          <li class="li">
+            Add Kerberos options to the Impala defaults file, <span class="ph 
filepath">/etc/default/impala</span>. Add the
+            options for both the <span class="keyword cmdname">impalad</span> 
and <span class="keyword cmdname">statestored</span> daemons, using the
+            <code class="ph codeph">IMPALA_SERVER_ARGS</code> and <code 
class="ph codeph">IMPALA_STATE_STORE_ARGS</code> variables. For
+            example, you might add:
+
+<pre class="pre codeblock"><code>-kerberos_reinit_interval=60
+-principal=impala_1/[email protected]
+-keytab_file=<var class="keyword 
varname">/path/to/impala.keytab</var></code></pre>
+            <p class="p">
+              For more information on changing the Impala defaults specified in
+              <span class="ph filepath">/etc/default/impala</span>, see
+              <a class="xref" 
href="impala_config_options.html#config_options">Modifying Impala Startup
+              Options</a>.
+            </p>
+          </li>
+        </ol>
+
+        <div class="note note note_note"><span class="note__title 
notetitle">Note:</span> 
+          Restart <span class="keyword cmdname">impalad</span> and <span 
class="keyword cmdname">statestored</span> for these configuration changes to
+          take effect.
+        </div>
+      </div>
+    </article>
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title5" 
id="kerberos__kerberos_proxy">
+
+    <h2 class="title topictitle2" id="ariaid-title5">Enabling Kerberos for 
Impala with a Proxy Server</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        A common configuration for Impala with High Availability is to use a 
proxy server to submit requests to the
+        actual <span class="keyword cmdname">impalad</span> daemons on 
different hosts in the cluster. This configuration avoids
+        connection problems in case of machine failure, because the proxy 
server can route new requests through one
+        of the remaining hosts in the cluster. This configuration also helps 
with load balancing, because the
+        additional overhead of being the <span class="q">"coordinator 
node"</span> for each query is spread across multiple hosts.
+      </p>
+
+      <p class="p">
+        Although you can set up a proxy server with or without Kerberos 
authentication, typically users set up a
+        secure Kerberized configuration. For information about setting up a 
proxy server for Impala, including
+        Kerberos-specific steps, see <a class="xref" 
href="impala_proxy.html#proxy">Using Impala through a Proxy for High 
Availability</a>.
+      </p>
+    </div>
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title6" 
id="kerberos__spnego">
+
+    <h2 class="title topictitle2" id="ariaid-title6">Using a Web Browser to 
Access a URL Protected by Kerberos HTTP SPNEGO</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        Your web browser must support Kerberos HTTP SPNEGO. For example, 
Chrome, Firefox, or Internet Explorer.
+      </p>
+
+      <p class="p">
+        <strong class="ph b">To configure Firefox to access a URL protected by 
Kerberos HTTP SPNEGO:</strong>
+      </p>
+
+      <ol class="ol">
+        <li class="li">
+          Open the advanced settings Firefox configuration page by loading the 
<code class="ph codeph">about:config</code> page.
+        </li>
+
+        <li class="li">
+          Use the <strong class="ph b">Filter</strong> text box to find <code 
class="ph codeph">network.negotiate-auth.trusted-uris</code>.
+        </li>
+
+        <li class="li">
+          Double-click the <code class="ph 
codeph">network.negotiate-auth.trusted-uris</code> preference and enter the 
hostname
+          or the domain of the web server that is protected by Kerberos HTTP 
SPNEGO. Separate multiple domains and
+          hostnames with a comma.
+        </li>
+
+        <li class="li">
+          Click <strong class="ph b">OK</strong>.
+        </li>
+      </ol>
+    </div>
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title7" 
id="kerberos__kerberos_delegation">
+    <h2 class="title topictitle2" id="ariaid-title7">Enabling Impala 
Delegation for Kerberos Users</h2>
+    <div class="body conbody">
+      <p class="p">
+        See <a class="xref" 
href="impala_delegation.html#delegation">Configuring Impala Delegation for Hue 
and BI Tools</a> for details about the delegation feature
+        that lets certain users submit queries using the credentials of other 
users.
+      </p>
+    </div>
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title8" 
id="kerberos__ssl_jdbc_odbc">
+    <h2 class="title topictitle2" id="ariaid-title8">Using TLS/SSL with 
Business Intelligence Tools</h2>
+    <div class="body conbody">
+      <p class="p">
+        You can use Kerberos authentication, TLS/SSL encryption, or both to 
secure
+        connections from JDBC and ODBC applications to Impala.
+        See <a class="xref" href="impala_jdbc.html#impala_jdbc">Configuring 
Impala to Work with JDBC</a> and <a class="xref" 
href="impala_odbc.html#impala_odbc">Configuring Impala to Work with ODBC</a>
+        for details.
+      </p>
+
+      <p class="p">
+        Prior to <span class="keyword">Impala 2.5</span>, the Hive JDBC driver 
did not support connections that use both Kerberos authentication
+        and SSL encryption. If your cluster is running an older release that 
has this restriction,
+        use an alternative JDBC driver that supports
+        both of these security features.
+      </p>
+    </div>
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title9" 
id="kerberos__whitelisting_internal_apis">
+  <h2 class="title topictitle2" id="ariaid-title9">Enabling Access to Internal 
Impala APIs for Kerberos Users</h2>
+    <div class="body conbody">
+    
+      <p class="p">
+        For applications that need direct access
+        to Impala APIs, without going through the HiveServer2 or Beeswax 
interfaces, you can
+        specify a list of Kerberos users who are allowed to call those APIs. 
By default, the
+        <code class="ph codeph">impala</code> and <code class="ph 
codeph">hdfs</code> users are the only ones authorized
+        for this kind of access.
+        Any users not explicitly authorized through the <code class="ph 
codeph">internal_principals_whitelist</code>
+        configuration setting are blocked from accessing the APIs. This 
setting applies to all the
+        Impala-related daemons, although currently it is primarily used for 
HDFS to control the
+        behavior of the catalog server.
+      </p>
+    </div>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title10" 
id="kerberos__auth_to_local">
+    <h2 class="title topictitle2" id="ariaid-title10">Mapping Kerberos 
Principals to Short Names for Impala</h2>
+    <div class="body conbody">
+      <div class="p">
+      In <span class="keyword">Impala 2.6</span> and higher, Impala recognizes 
the <code class="ph codeph">auth_to_local</code> setting,
+      specified through the HDFS configuration setting
+      <code class="ph codeph">hadoop.security.auth_to_local</code>.
+      This feature is disabled by default, to avoid an unexpected change in 
security-related behavior.
+      To enable it:
+      <ul class="ul">
+        <li class="li">
+          <p class="p">
+            Specify <code class="ph 
codeph">--load_auth_to_local_rules=true</code>
+            in the <span class="keyword cmdname">impalad</span> and <span 
class="keyword cmdname">catalogd</span> configuration settings.
+          </p>
+        </li>
+      </ul>
+    </div>
+    </div>
+  </article>
+
+</article></main></body></html>
\ No newline at end of file

[28/51] [partial] incubator-impala git commit: IMPALA-4181 [DOCS] Publish rendered Impala documentation to ASF site

Reply via email to