http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_isilon.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_isilon.html b/docs/build/html/topics/impala_isilon.html new file mode 100644 index 0000000..304e8ec --- /dev/null +++ b/docs/build/html/topics/impala_isilon.html @@ -0,0 +1,89 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="impala_isilon"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Using Impala with Isilon Storage</title></head><body id="impala_isilon"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">Using Impala with Isilon Storage</h1> + + + + + <div class="body conbody"> + + <p class="p"> + + You can use Impala to query data files that reside on EMC Isilon storage devices, rather than in HDFS. + This capability allows convenient query access to a storage system where you might already be + managing large volumes of data. The combination of the Impala query engine and Isilon storage is + certified on <span class="keyword">Impala 2.2.4</span> or higher. + </p> + + <div class="p"> + Because the EMC Isilon storage devices use a global value for the block size + rather than a configurable value for each file, the <code class="ph codeph">PARQUET_FILE_SIZE</code> + query option has no effect when Impala inserts data into a table or partition + residing on Isilon storage. Use the <code class="ph codeph">isi</code> command to set the + default block size globally on the Isilon device. For example, to set the + Isilon default block size to 256 MB, the recommended size for Parquet + data files for Impala, issue the following command: +<pre class="pre codeblock"><code>isi hdfs settings modify --default-block-size=256MB</code></pre> + </div> + + <p class="p"> + The typical use case for Impala and Isilon together is to use Isilon for the + default filesystem, replacing HDFS entirely. In this configuration, + when you create a database, table, or partition, the data always resides on + Isilon storage and you do not need to specify any special <code class="ph codeph">LOCATION</code> + attribute. If you do specify a <code class="ph codeph">LOCATION</code> attribute, its value refers + to a path within the Isilon filesystem. + For example: + </p> +<pre class="pre codeblock"><code>-- If the default filesystem is Isilon, all Impala data resides there +-- and all Impala databases and tables are located there. +CREATE TABLE t1 (x INT, s STRING); + +-- You can specify LOCATION for database, table, or partition, +-- using values from the Isilon filesystem. +CREATE DATABASE d1 LOCATION '/some/path/on/isilon/server/d1.db'; +CREATE TABLE d1.t2 (a TINYINT, b BOOLEAN); +</code></pre> + + <p class="p"> + Impala can write to, delete, and rename data files and database, table, + and partition directories on Isilon storage. Therefore, Impala statements such + as + <code class="ph codeph">CREATE TABLE</code>, <code class="ph codeph">DROP TABLE</code>, + <code class="ph codeph">CREATE DATABASE</code>, <code class="ph codeph">DROP DATABASE</code>, + <code class="ph codeph">ALTER TABLE</code>, + and + <code class="ph codeph">INSERT</code> work the same with Isilon storage as with HDFS. + </p> + + <p class="p"> + When the Impala spill-to-disk feature is activated by a query that approaches + the memory limit, Impala writes all the temporary data to a local (not Isilon) + storage device. Because the I/O bandwidth for the temporary data depends on + the number of local disks, and clusters using Isilon storage might not have + as many local disks attached, pay special attention on Isilon-enabled clusters + to any queries that use the spill-to-disk feature. Where practical, tune the + queries or allocate extra memory for Impala to avoid spilling. + Although you can specify an Isilon storage device as the destination for + the temporary data for the spill-to-disk feature, that configuration is + not recommended due to the need to transfer the data both ways using remote I/O. + </p> + + <p class="p"> + When tuning Impala queries on HDFS, you typically try to avoid any remote reads. + When the data resides on Isilon storage, all the I/O consists of remote reads. + Do not be alarmed when you see non-zero numbers for remote read measurements + in query profile output. The benefit of the Impala and Isilon integration is + primarily convenience of not having to move or copy large volumes of data to HDFS, + rather than raw query performance. You can increase the performance of Impala + I/O for Isilon systems by increasing the value for the + <code class="ph codeph">--num_remote_hdfs_io_threads</code> startup option for the + <span class="keyword cmdname">impalad</span> daemon. + </p> + + + </div> + +</article></main></body></html> \ No newline at end of file
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_jdbc.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_jdbc.html b/docs/build/html/topics/impala_jdbc.html new file mode 100644 index 0000000..cdb031b --- /dev/null +++ b/docs/build/html/topics/impala_jdbc.html @@ -0,0 +1,326 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_config.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="impala_jdbc"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Configuring Impala to Work with JDBC</title></head><body id="impala_jdbc"><main role="main"><article role="article" aria-labelledby="impala_jdbc__jdbc"> + + <h1 class="title topictitle1" id="impala_jdbc__jdbc">Configuring Impala to Work with JDBC</h1> + + + <div class="body conbody"> + + <p class="p"> + + Impala supports the standard JDBC interface, allowing access from commercial Business Intelligence tools and + custom software written in Java or other programming languages. The JDBC driver allows you to access Impala + from a Java program that you write, or a Business Intelligence or similar tool that uses JDBC to communicate + with various database products. + </p> + + <p class="p"> + Setting up a JDBC connection to Impala involves the following steps: + </p> + + <ul class="ul"> + <li class="li"> + Verifying the communication port where the Impala daemons in your cluster are listening for incoming JDBC + requests. + </li> + + <li class="li"> + Installing the JDBC driver on every system that runs the JDBC-enabled application. + </li> + + <li class="li"> + Specifying a connection string for the JDBC application to access one of the servers running the + <span class="keyword cmdname">impalad</span> daemon, with the appropriate security settings. + </li> + </ul> + + <p class="p toc inpage"></p> + </div> + + <nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_config.html">Managing Impala</a></div></div></nav><article class="topic concept nested1" aria-labelledby="ariaid-title2" id="impala_jdbc__jdbc_port"> + + <h2 class="title topictitle2" id="ariaid-title2">Configuring the JDBC Port</h2> + + <div class="body conbody"> + + <p class="p"> + The default port used by JDBC 2.0 and later (as well as ODBC 2.x) is 21050. Impala server accepts JDBC + connections through this same port 21050 by default. Make sure this port is available for communication + with other hosts on your network, for example, that it is not blocked by firewall software. If your JDBC + client software connects to a different port, specify that alternative port number with the + <code class="ph codeph">--hs2_port</code> option when starting <code class="ph codeph">impalad</code>. See + <a class="xref" href="impala_processes.html#processes">Starting Impala</a> for details about Impala startup options. See + <a class="xref" href="impala_ports.html#ports">Ports Used by Impala</a> for information about all ports used for communication between Impala + and clients or between Impala components. + </p> + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title3" id="impala_jdbc__jdbc_driver_choice"> + + <h2 class="title topictitle2" id="ariaid-title3">Choosing the JDBC Driver</h2> + + + <div class="body conbody"> + + <p class="p"> + In Impala 2.0 and later, you can use the Hive 0.13 JDBC driver. If you are + already using JDBC applications with an earlier Impala release, you should update + your JDBC driver, because the Hive 0.12 driver that was formerly the only choice + is not compatible with Impala 2.0 and later. + </p> + + <p class="p"> + The Hive JDBC driver provides a substantial speed increase for JDBC + applications with Impala 2.0 and higher, for queries that return large result sets. + </p> + + <p class="p"> + <strong class="ph b">Complex type considerations:</strong> + </p> + + <p class="p"> + The Impala complex types (<code class="ph codeph">STRUCT</code>, <code class="ph codeph">ARRAY</code>, or <code class="ph codeph">MAP</code>) + are available in <span class="keyword">Impala 2.3</span> and higher. + To use these types with JDBC requires version 2.5.28 or higher of the JDBC Connector for Impala. + To use these types with ODBC requires version 2.5.30 or higher of the ODBC Connector for Impala. + Consider upgrading all JDBC and ODBC drivers at the same time you upgrade from <span class="keyword">Impala 2.3</span> or higher. + </p> + <p class="p"> + Although the result sets from queries involving complex types consist of all scalar values, + the queries involve join notation and column references that might not be understood by + a particular JDBC or ODBC connector. Consider defining a view that represents the + flattened version of a table containing complex type columns, and pointing the JDBC + or ODBC application at the view. + See <a class="xref" href="../shared/../topics/impala_complex_types.html#complex_types">Complex Types (Impala 2.3 or higher only)</a> for details. + </p> + + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title4" id="impala_jdbc__jdbc_setup"> + + <h2 class="title topictitle2" id="ariaid-title4">Enabling Impala JDBC Support on Client Systems</h2> + + + <div class="body conbody"> + + <section class="section" id="jdbc_setup__install_hive_driver"><h3 class="title sectiontitle">Using the Hive JDBC Driver</h3> + + <p class="p"> + You install the Hive JDBC driver (<code class="ph codeph">hive-jdbc</code> package) through the Linux package manager, on + hosts within the cluster. The driver consists of several Java JAR files. The same driver can be used by Impala and Hive. + </p> + + <p class="p"> + To get the JAR files, install the Hive JDBC driver on each host in the cluster that will run + JDBC applications. + </p> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + The latest JDBC driver, corresponding to Hive 0.13, provides substantial performance improvements for + Impala queries that return large result sets. Impala 2.0 and later are compatible with the Hive 0.13 + driver. If you already have an older JDBC driver installed, and are running Impala 2.0 or higher, consider + upgrading to the latest Hive JDBC driver for best performance with JDBC applications. + </div> + + <p class="p"> + If you are using JDBC-enabled applications on hosts outside the cluster, you cannot use the the same install + procedure on the hosts. Install the JDBC driver on at least one cluster host using the preceding + procedure. Then download the JAR files to each client machine that will use JDBC with Impala: + </p> + + <pre class="pre codeblock"><code>commons-logging-X.X.X.jar + hadoop-common.jar + hive-common-X.XX.X.jar + hive-jdbc-X.XX.X.jar + hive-metastore-X.XX.X.jar + hive-service-X.XX.X.jar + httpclient-X.X.X.jar + httpcore-X.X.X.jar + libfb303-X.X.X.jar + libthrift-X.X.X.jar + log4j-X.X.XX.jar + slf4j-api-X.X.X.jar + slf4j-logXjXX-X.X.X.jar + </code></pre> + + <p class="p"> + <strong class="ph b">To enable JDBC support for Impala on the system where you run the JDBC application:</strong> + </p> + + <ol class="ol"> + <li class="li"> + Download the JAR files listed above to each client machine. + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + For Maven users, see + <a class="xref" href="https://github.com/onefoursix/Cloudera-Impala-JDBC-Example" target="_blank">this sample github page</a> for an example of the + dependencies you could add to a <code class="ph codeph">pom</code> file instead of downloading the individual JARs. + </div> + </li> + + <li class="li"> + Store the JAR files in a location of your choosing, ideally a directory already referenced in your + <code class="ph codeph">CLASSPATH</code> setting. For example: + <ul class="ul"> + <li class="li"> + On Linux, you might use a location such as <code class="ph codeph">/opt/jars/</code>. + </li> + + <li class="li"> + On Windows, you might use a subdirectory underneath <span class="ph filepath">C:\Program Files</span>. + </li> + </ul> + </li> + + <li class="li"> + To successfully load the Impala JDBC driver, client programs must be able to locate the associated JAR + files. This often means setting the <code class="ph codeph">CLASSPATH</code> for the client process to include the + JARs. Consult the documentation for your JDBC client for more details on how to install new JDBC drivers, + but some examples of how to set <code class="ph codeph">CLASSPATH</code> variables include: + <ul class="ul"> + <li class="li"> + On Linux, if you extracted the JARs to <code class="ph codeph">/opt/jars/</code>, you might issue the following + command to prepend the JAR files path to an existing classpath: + <pre class="pre codeblock"><code>export CLASSPATH=/opt/jars/*.jar:$CLASSPATH</code></pre> + </li> + + <li class="li"> + On Windows, use the <strong class="ph b">System Properties</strong> control panel item to modify the <strong class="ph b">Environment + Variables</strong> for your system. Modify the environment variables to include the path to which you + extracted the files. + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + If the existing <code class="ph codeph">CLASSPATH</code> on your client machine refers to some older version of + the Hive JARs, ensure that the new JARs are the first ones listed. Either put the new JAR files + earlier in the listings, or delete the other references to Hive JAR files. + </div> + </li> + </ul> + </li> + </ol> + </section> + + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title5" id="impala_jdbc__jdbc_connect"> + + <h2 class="title topictitle2" id="ariaid-title5">Establishing JDBC Connections</h2> + + <div class="body conbody"> + + <p class="p"> + The JDBC driver class depends on which driver you select. + </p> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + If your JDBC or ODBC application connects to Impala through a load balancer such as + <code class="ph codeph">haproxy</code>, be cautious about reusing the connections. If the load balancer has set up + connection timeout values, either check the connection frequently so that it never sits idle longer than + the load balancer timeout value, or check the connection validity before using it and create a new one if + the connection has been closed. + </div> + + <section class="section" id="jdbc_connect__class_hive_driver"><h3 class="title sectiontitle">Using the Hive JDBC Driver</h3> + + + <p class="p"> + For example, with the Hive JDBC driver, the class name is <code class="ph codeph">org.apache.hive.jdbc.HiveDriver</code>. + Once you have configured Impala to work with JDBC, you can establish connections between the two. + To do so for a cluster that does not use + Kerberos authentication, use a connection string of the form + <code class="ph codeph">jdbc:hive2://<var class="keyword varname">host</var>:<var class="keyword varname">port</var>/;auth=noSasl</code>. + + For example, you might use: + </p> + +<pre class="pre codeblock"><code>jdbc:hive2://myhost.example.com:21050/;auth=noSasl</code></pre> + + <p class="p"> + To connect to an instance of Impala that requires Kerberos authentication, use a connection string of the + form + <code class="ph codeph">jdbc:hive2://<var class="keyword varname">host</var>:<var class="keyword varname">port</var>/;principal=<var class="keyword varname">principal_name</var></code>. + The principal must be the same user principal you used when starting Impala. For example, you might use: + </p> + +<pre class="pre codeblock"><code>jdbc:hive2://myhost.example.com:21050/;principal=impala/[email protected]</code></pre> + + <p class="p"> + To connect to an instance of Impala that requires LDAP authentication, use a connection string of the form + <code class="ph codeph">jdbc:hive2://<var class="keyword varname">host</var>:<var class="keyword varname">port</var>/<var class="keyword varname">db_name</var>;user=<var class="keyword varname">ldap_userid</var>;password=<var class="keyword varname">ldap_password</var></code>. + For example, you might use: + </p> + +<pre class="pre codeblock"><code>jdbc:hive2://myhost.example.com:21050/test_db;user=fred;password=xyz123</code></pre> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + <p class="p"> + Prior to <span class="keyword">Impala 2.5</span>, the Hive JDBC driver did not support connections that use both Kerberos authentication + and SSL encryption. If your cluster is running an older release that has this restriction, + use an alternative JDBC driver that supports + both of these security features. + </p> + </div> + + </section> + + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title6" id="impala_jdbc__jdbc_odbc_notes"> + <h2 class="title topictitle2" id="ariaid-title6">Notes about JDBC and ODBC Interaction with Impala SQL Features</h2> + <div class="body conbody"> + <p class="p"> + Most Impala SQL features work equivalently through the <span class="keyword cmdname">impala-shell</span> interpreter + of the JDBC or ODBC APIs. The following are some exceptions to keep in mind when switching between + the interactive shell and applications using the APIs: + </p> + <ul class="ul"> + <li class="li"> + <p class="p"> + <strong class="ph b">Complex type considerations:</strong> + </p> + <ul class="ul"> + <li class="li"> + <p class="p"> + Queries involving the complex types (<code class="ph codeph">ARRAY</code>, <code class="ph codeph">STRUCT</code>, and <code class="ph codeph">MAP</code>) + require notation that might not be available in all levels of JDBC and ODBC drivers. + If you have trouble querying such a table due to the driver level or + inability to edit the queries used by the application, you can create a view that exposes + a <span class="q">"flattened"</span> version of the complex columns and point the application at the view. + See <a class="xref" href="impala_complex_types.html#complex_types">Complex Types (Impala 2.3 or higher only)</a> for details. + </p> + </li> + <li class="li"> + <p class="p"> + The complex types available in <span class="keyword">Impala 2.3</span> and higher are supported by the + JDBC <code class="ph codeph">getColumns()</code> API. + Both <code class="ph codeph">MAP</code> and <code class="ph codeph">ARRAY</code> are reported as the JDBC SQL Type <code class="ph codeph">ARRAY</code>, + because this is the closest matching Java SQL type. This behavior is consistent with Hive. + <code class="ph codeph">STRUCT</code> types are reported as the JDBC SQL Type <code class="ph codeph">STRUCT</code>. + </p> + <div class="p"> + To be consistent with Hive's behavior, the TYPE_NAME field is populated + with the primitive type name for scalar types, and with the full <code class="ph codeph">toSql()</code> + for complex types. The resulting type names are somewhat inconsistent, + because nested types are printed differently than top-level types. For example, + the following list shows how <code class="ph codeph">toSQL()</code> for Impala types are + translated to <code class="ph codeph">TYPE_NAME</code> values: +<pre class="pre codeblock"><code>DECIMAL(10,10) becomes DECIMAL +CHAR(10) becomes CHAR +VARCHAR(10) becomes VARCHAR +ARRAY<DECIMAL(10,10)> becomes ARRAY<DECIMAL(10,10)> +ARRAY<CHAR(10)> becomes ARRAY<CHAR(10)> +ARRAY<VARCHAR(10)> becomes ARRAY<VARCHAR(10)> + +</code></pre> + </div> + </li> + </ul> + </li> + </ul> + </div> + </article> + +</article></main></body></html> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_joins.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_joins.html b/docs/build/html/topics/impala_joins.html new file mode 100644 index 0000000..436f9f5 --- /dev/null +++ b/docs/build/html/topics/impala_joins.html @@ -0,0 +1,531 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_select.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="joins"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Joins in Impala SELECT Statements</title></head><body id="joins"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">Joins in Impala SELECT Statements</h1> + + + + <div class="body conbody"> + + <p class="p"> + + A join query is a <code class="ph codeph">SELECT</code> statement that combines data from two or more tables, + and returns a result set containing items from some or all of those tables. It is a way to + cross-reference and correlate related data that is organized into multiple tables, typically + using identifiers that are repeated in each of the joined tables. + </p> + + <p class="p"> + <strong class="ph b">Syntax:</strong> + </p> + + <p class="p"> + Impala supports a wide variety of <code class="ph codeph">JOIN</code> clauses. Left, right, semi, full, and outer joins + are supported in all Impala versions. The <code class="ph codeph">CROSS JOIN</code> operator is available in Impala 1.2.2 + and higher. During performance tuning, you can override the reordering of join clauses that Impala does + internally by including the keyword <code class="ph codeph">STRAIGHT_JOIN</code> immediately after the + <code class="ph codeph">SELECT</code> keyword + </p> + +<pre class="pre codeblock"><code>SELECT <var class="keyword varname">select_list</var> FROM + <var class="keyword varname">table_or_subquery1</var> [INNER] JOIN <var class="keyword varname">table_or_subquery2</var> | + <var class="keyword varname">table_or_subquery1</var> {LEFT [OUTER] | RIGHT [OUTER] | FULL [OUTER]} JOIN <var class="keyword varname">table_or_subquery2</var> | + <var class="keyword varname">table_or_subquery1</var> {LEFT | RIGHT} SEMI JOIN <var class="keyword varname">table_or_subquery2</var> | + <span class="ph"><var class="keyword varname">table_or_subquery1</var> {LEFT | RIGHT} ANTI JOIN <var class="keyword varname">table_or_subquery2</var> |</span> + [ ON <var class="keyword varname">col1</var> = <var class="keyword varname">col2</var> [AND <var class="keyword varname">col3</var> = <var class="keyword varname">col4</var> ...] | + USING (<var class="keyword varname">col1</var> [, <var class="keyword varname">col2</var> ...]) ] + [<var class="keyword varname">other_join_clause</var> ...] +[ WHERE <var class="keyword varname">where_clauses</var> ] + +SELECT <var class="keyword varname">select_list</var> FROM + <var class="keyword varname">table_or_subquery1</var>, <var class="keyword varname">table_or_subquery2</var> [, <var class="keyword varname">table_or_subquery3</var> ...] + [<var class="keyword varname">other_join_clause</var> ...] +WHERE + <var class="keyword varname">col1</var> = <var class="keyword varname">col2</var> [AND <var class="keyword varname">col3</var> = <var class="keyword varname">col4</var> ...] + +SELECT <var class="keyword varname">select_list</var> FROM + <var class="keyword varname">table_or_subquery1</var> CROSS JOIN <var class="keyword varname">table_or_subquery2</var> + [<var class="keyword varname">other_join_clause</var> ...] +[ WHERE <var class="keyword varname">where_clauses</var> ]</code></pre> + + <p class="p"> + <strong class="ph b">SQL-92 and SQL-89 Joins:</strong> + </p> + + <p class="p"> + Queries with the explicit <code class="ph codeph">JOIN</code> keywords are known as SQL-92 style joins, referring to the + level of the SQL standard where they were introduced. The corresponding <code class="ph codeph">ON</code> or + <code class="ph codeph">USING</code> clauses clearly show which columns are used as the join keys in each case: + </p> + +<pre class="pre codeblock"><code>SELECT t1.c1, t2.c2 FROM <strong class="ph b">t1 JOIN t2</strong> + <strong class="ph b">ON t1.id = t2.id and t1.type_flag = t2.type_flag</strong> + WHERE t1.c1 > 100; + +SELECT t1.c1, t2.c2 FROM <strong class="ph b">t1 JOIN t2</strong> + <strong class="ph b">USING (id, type_flag)</strong> + WHERE t1.c1 > 100;</code></pre> + + <p class="p"> + The <code class="ph codeph">ON</code> clause is a general way to compare columns across the two tables, even if the column + names are different. The <code class="ph codeph">USING</code> clause is a shorthand notation for specifying the join + columns, when the column names are the same in both tables. You can code equivalent <code class="ph codeph">WHERE</code> + clauses that compare the columns, instead of <code class="ph codeph">ON</code> or <code class="ph codeph">USING</code> clauses, but that + practice is not recommended because mixing the join comparisons with other filtering clauses is typically + less readable and harder to maintain. + </p> + + <p class="p"> + Queries with a comma-separated list of tables and subqueries are known as SQL-89 style joins. In these + queries, the equality comparisons between columns of the joined tables go in the <code class="ph codeph">WHERE</code> + clause alongside other kinds of comparisons. This syntax is easy to learn, but it is also easy to + accidentally remove a <code class="ph codeph">WHERE</code> clause needed for the join to work correctly. + </p> + +<pre class="pre codeblock"><code>SELECT t1.c1, t2.c2 FROM <strong class="ph b">t1, t2</strong> + WHERE + <strong class="ph b">t1.id = t2.id AND t1.type_flag = t2.type_flag</strong> + AND t1.c1 > 100;</code></pre> + + <p class="p"> + <strong class="ph b">Self-joins:</strong> + </p> + + <p class="p"> + Impala can do self-joins, for example to join on two different columns in the same table to represent + parent-child relationships or other tree-structured data. There is no explicit syntax for this; just use the + same table name for both the left-hand and right-hand table, and assign different table aliases to use when + referring to the fully qualified column names: + </p> + +<pre class="pre codeblock"><code>-- Combine fields from both parent and child rows. +SELECT lhs.id, rhs.parent, lhs.c1, rhs.c2 FROM tree_data lhs, tree_data rhs WHERE lhs.id = rhs.parent;</code></pre> + + <p class="p"> + <strong class="ph b">Cartesian joins:</strong> + </p> + + <div class="p"> + To avoid producing huge result sets by mistake, Impala does not allow Cartesian joins of the form: +<pre class="pre codeblock"><code>SELECT ... FROM t1 JOIN t2; +SELECT ... FROM t1, t2;</code></pre> + If you intend to join the tables based on common values, add <code class="ph codeph">ON</code> or <code class="ph codeph">WHERE</code> + clauses to compare columns across the tables. If you truly intend to do a Cartesian join, use the + <code class="ph codeph">CROSS JOIN</code> keyword as the join operator. The <code class="ph codeph">CROSS JOIN</code> form does not use + any <code class="ph codeph">ON</code> clause, because it produces a result set with all combinations of rows from the + left-hand and right-hand tables. The result set can still be filtered by subsequent <code class="ph codeph">WHERE</code> + clauses. For example: + </div> + +<pre class="pre codeblock"><code>SELECT ... FROM t1 CROSS JOIN t2; +SELECT ... FROM t1 CROSS JOIN t2 WHERE <var class="keyword varname">tests_on_non_join_columns</var>;</code></pre> + + <p class="p"> + <strong class="ph b">Inner and outer joins:</strong> + </p> + + <p class="p"> + An inner join is the most common and familiar type: rows in the result set contain the requested columns from + the appropriate tables, for all combinations of rows where the join columns of the tables have identical + values. If a column with the same name occurs in both tables, use a fully qualified name or a column alias to + refer to the column in the select list or other clauses. Impala performs inner joins by default for both + SQL-89 and SQL-92 join syntax: + </p> + +<pre class="pre codeblock"><code>-- The following 3 forms are all equivalent. +SELECT t1.id, c1, c2 FROM t1, t2 WHERE t1.id = t2.id; +SELECT t1.id, c1, c2 FROM t1 JOIN t2 ON t1.id = t2.id; +SELECT t1.id, c1, c2 FROM t1 INNER JOIN t2 ON t1.id = t2.id;</code></pre> + + <p class="p"> + An outer join retrieves all rows from the left-hand table, or the right-hand table, or both; wherever there + is no matching data in the table on the other side of the join, the corresponding columns in the result set + are set to <code class="ph codeph">NULL</code>. To perform an outer join, include the <code class="ph codeph">OUTER</code> keyword in the + join operator, along with either <code class="ph codeph">LEFT</code>, <code class="ph codeph">RIGHT</code>, or <code class="ph codeph">FULL</code>: + </p> + +<pre class="pre codeblock"><code>SELECT * FROM t1 LEFT OUTER JOIN t2 ON t1.id = t2.id; +SELECT * FROM t1 RIGHT OUTER JOIN t2 ON t1.id = t2.id; +SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.id = t2.id;</code></pre> + + <p class="p"> + For outer joins, Impala requires SQL-92 syntax; that is, the <code class="ph codeph">JOIN</code> keyword instead of + comma-separated table names. Impala does not support vendor extensions such as <code class="ph codeph">(+)</code> or + <code class="ph codeph">*=</code> notation for doing outer joins with SQL-89 query syntax. + </p> + + <p class="p"> + <strong class="ph b">Equijoins and Non-Equijoins:</strong> + </p> + + <p class="p"> + By default, Impala requires an equality comparison between the left-hand and right-hand tables, either + through <code class="ph codeph">ON</code>, <code class="ph codeph">USING</code>, or <code class="ph codeph">WHERE</code> clauses. These types of + queries are classified broadly as equijoins. Inner, outer, full, and semi joins can all be equijoins based on + the presence of equality tests between columns in the left-hand and right-hand tables. + </p> + + <p class="p"> + In Impala 1.2.2 and higher, non-equijoin queries are also possible, with comparisons such as + <code class="ph codeph">!=</code> or <code class="ph codeph"><</code> between the join columns. These kinds of queries require care to + avoid producing huge result sets that could exceed resource limits. Once you have planned a non-equijoin + query that produces a result set of acceptable size, you can code the query using the <code class="ph codeph">CROSS + JOIN</code> operator, and add the extra comparisons in the <code class="ph codeph">WHERE</code> clause: + </p> + +<pre class="pre codeblock"><code>SELECT * FROM t1 CROSS JOIN t2 WHERE t1.total > t2.maximum_price;</code></pre> + + <p class="p"> + In <span class="keyword">Impala 2.3</span> and higher, additional non-equijoin queries are possible due to the addition + of nested loop joins. These queries typically involve <code class="ph codeph">SEMI JOIN</code>, + <code class="ph codeph">ANTI JOIN</code>, or <code class="ph codeph">FULL OUTER JOIN</code> clauses. + Impala sometimes also uses nested loop joins internally when evaluating <code class="ph codeph">OUTER JOIN</code> + queries involving complex type columns. + Query phases involving nested loop joins do not use the spill-to-disk mechanism if they + exceed the memory limit. Impala decides internally when to use each join mechanism; you cannot + specify any query hint to choose between the nested loop join or the original hash join algorithm. + </p> + +<pre class="pre codeblock"><code>SELECT * FROM t1 LEFT OUTER JOIN t2 ON t1.int_col < t2.int_col;</code></pre> + + <p class="p"> + <strong class="ph b">Semi-joins:</strong> + </p> + + <p class="p"> + Semi-joins are a relatively rarely used variation. With the left semi-join, only data from the left-hand + table is returned, for rows where there is matching data in the right-hand table, based on comparisons + between join columns in <code class="ph codeph">ON</code> or <code class="ph codeph">WHERE</code> clauses. Only one instance of each row + from the left-hand table is returned, regardless of how many matching rows exist in the right-hand table. + <span class="ph">A right semi-join (available in Impala 2.0 and higher) reverses the comparison and returns + data from the right-hand table.</span> + </p> + +<pre class="pre codeblock"><code>SELECT t1.c1, t1.c2, t1.c2 FROM t1 LEFT SEMI JOIN t2 ON t1.id = t2.id;</code></pre> + + <p class="p"> + <strong class="ph b">Natural joins (not supported):</strong> + </p> + + <p class="p"> + Impala does not support the <code class="ph codeph">NATURAL JOIN</code> operator, again to avoid inconsistent or huge + result sets. Natural joins do away with the <code class="ph codeph">ON</code> and <code class="ph codeph">USING</code> clauses, and + instead automatically join on all columns with the same names in the left-hand and right-hand tables. This + kind of query is not recommended for rapidly evolving data structures such as are typically used in Hadoop. + Thus, Impala does not support the <code class="ph codeph">NATURAL JOIN</code> syntax, which can produce different query + results as columns are added to or removed from tables. + </p> + + <p class="p"> + If you do have any queries that use <code class="ph codeph">NATURAL JOIN</code>, make sure to rewrite them with explicit + <code class="ph codeph">USING</code> clauses, because Impala could interpret the <code class="ph codeph">NATURAL</code> keyword as a + table alias: + </p> + +<pre class="pre codeblock"><code>-- 'NATURAL' is interpreted as an alias for 't1' and Impala attempts an inner join, +-- resulting in an error because inner joins require explicit comparisons between columns. +SELECT t1.c1, t2.c2 FROM t1 NATURAL JOIN t2; +ERROR: NotImplementedException: Join with 't2' requires at least one conjunctive equality predicate. + To perform a Cartesian product between two tables, use a CROSS JOIN. + +-- If you expect the tables to have identically named columns with matching values, +-- list the corresponding column names in a USING clause. +SELECT t1.c1, t2.c2 FROM t1 JOIN t2 USING (id, type_flag, name, address);</code></pre> + + <p class="p"> + <strong class="ph b">Anti-joins (<span class="keyword">Impala 2.0</span> and higher only):</strong> + </p> + + <p class="p"> + Impala supports the <code class="ph codeph">LEFT ANTI JOIN</code> and <code class="ph codeph">RIGHT ANTI JOIN</code> clauses in + <span class="keyword">Impala 2.0</span> and higher. The <code class="ph codeph">LEFT</code> or <code class="ph codeph">RIGHT</code> + keyword is required for this kind of join. For <code class="ph codeph">LEFT ANTI JOIN</code>, this clause returns those + values from the left-hand table that have no matching value in the right-hand table. <code class="ph codeph">RIGHT ANTI + JOIN</code> reverses the comparison and returns values from the right-hand table. You can express this + negative relationship either through the <code class="ph codeph">ANTI JOIN</code> clause or through a <code class="ph codeph">NOT + EXISTS</code> operator with a subquery. + </p> + + + + <p class="p"> + <strong class="ph b">Complex type considerations:</strong> + </p> + + + + <p class="p"> + When referring to a column with a complex type (<code class="ph codeph">STRUCT</code>, <code class="ph codeph">ARRAY</code>, or <code class="ph codeph">MAP</code>) + in a query, you use join notation to <span class="q">"unpack"</span> the scalar fields of the struct, the elements of the array, or + the key-value pairs of the map. (The join notation is not required for aggregation operations, such as + <code class="ph codeph">COUNT()</code> or <code class="ph codeph">SUM()</code> for array elements.) Because Impala recognizes which complex type elements are associated with which row + of the result set, you use the same syntax as for a cross or cartesian join, without an explicit join condition. + See <a class="xref" href="impala_complex_types.html#complex_types">Complex Types (Impala 2.3 or higher only)</a> for details about Impala support for complex types. + </p> + + <p class="p"> + <strong class="ph b">Usage notes:</strong> + </p> + + <p class="p"> + You typically use join queries in situations like these: + </p> + + <ul class="ul"> + <li class="li"> + When related data arrives from different sources, with each data set physically residing in a separate + table. For example, you might have address data from business records that you cross-check against phone + listings or census data. + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + Impala can join tables of different file formats, including Impala-managed tables and HBase tables. For + example, you might keep small dimension tables in HBase, for convenience of single-row lookups and + updates, and for the larger fact tables use Parquet or other binary file format optimized for scan + operations. Then, you can issue a join query to cross-reference the fact tables with the dimension + tables. + </div> + </li> + + <li class="li"> + When data is normalized, a technique for reducing data duplication by dividing it across multiple tables. + This kind of organization is often found in data that comes from traditional relational database systems. + For example, instead of repeating some long string such as a customer name in multiple tables, each table + might contain a numeric customer ID. Queries that need to display the customer name could <span class="q">"join"</span> the + table that specifies which customer ID corresponds to which name. + </li> + + <li class="li"> + When certain columns are rarely needed for queries, so they are moved into separate tables to reduce + overhead for common queries. For example, a <code class="ph codeph">biography</code> field might be rarely needed in + queries on employee data. Putting that field in a separate table reduces the amount of I/O for common + queries on employee addresses or phone numbers. Queries that do need the <code class="ph codeph">biography</code> column + can retrieve it by performing a join with that separate table. + </li> + + <li class="li"> + In <span class="keyword">Impala 2.3</span> or higher, when referring to complex type columns in queries. + See <a class="xref" href="impala_complex_types.html#complex_types">Complex Types (Impala 2.3 or higher only)</a> for details. + </li> + </ul> + + <p class="p"> + When comparing columns with the same names in <code class="ph codeph">ON</code> or <code class="ph codeph">WHERE</code> clauses, use the + fully qualified names such as <code class="ph codeph"><var class="keyword varname">db_name</var>.<var class="keyword varname">table_name</var></code>, or + assign table aliases, column aliases, or both to make the code more compact and understandable: + </p> + +<pre class="pre codeblock"><code>select t1.c1 as first_id, t2.c2 as second_id from + t1 join t2 on first_id = second_id; + +select fact.custno, dimension.custno from + customer_data as fact join customer_address as dimension + using (custno)</code></pre> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + <p class="p"> + Performance for join queries is a crucial aspect for Impala, because complex join queries are + resource-intensive operations. An efficient join query produces much less network traffic and CPU overhead + than an inefficient one. For best results: + </p> + <ul class="ul"> + <li class="li"> + Make sure that both <a class="xref" href="impala_perf_stats.html#perf_stats">table and column statistics</a> are + available for all the tables involved in a join query, and especially for the columns referenced in any + join conditions. Impala uses the statistics to automatically deduce an efficient join order. + Use <a class="xref" href="impala_show.html#show"><code class="ph codeph">SHOW TABLE STATS <var class="keyword varname">table_name</var></code> and + <code class="ph codeph">SHOW COLUMN STATS <var class="keyword varname">table_name</var></code></a> to check if statistics are + already present. Issue the <code class="ph codeph">COMPUTE STATS <var class="keyword varname">table_name</var></code> for a nonpartitioned table, + or (in Impala 2.1.0 and higher) <code class="ph codeph">COMPUTE INCREMENTAL STATS <var class="keyword varname">table_name</var></code> + for a partitioned table, to collect the initial statistics at both the table and column levels, and to keep the + statistics up to date after any substantial <code class="ph codeph">INSERT</code> or <code class="ph codeph">LOAD DATA</code> operations. + </li> + + <li class="li"> + If table or column statistics are not available, join the largest table first. You can check the + existence of statistics with the <code class="ph codeph">SHOW TABLE STATS <var class="keyword varname">table_name</var></code> and + <code class="ph codeph">SHOW COLUMN STATS <var class="keyword varname">table_name</var></code> statements. + </li> + + <li class="li"> + If table or column statistics are not available, join subsequent tables according to which table has the + most selective filter, based on overall size and <code class="ph codeph">WHERE</code> clauses. Joining the table with + the most selective filter results in the fewest number of rows being returned. + </li> + </ul> + <p class="p"> + For more information and examples of performance for join queries, see + <a class="xref" href="impala_perf_joins.html#perf_joins">Performance Considerations for Join Queries</a>. + </p> + </div> + + <p class="p"> + To control the result set from a join query, include the names of corresponding column names in both tables + in an <code class="ph codeph">ON</code> or <code class="ph codeph">USING</code> clause, or by coding equality comparisons for those + columns in the <code class="ph codeph">WHERE</code> clause. + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > select c_last_name, ca_city from customer join customer_address where c_customer_sk = ca_address_sk; ++-------------+-----------------+ +| c_last_name | ca_city | ++-------------+-----------------+ +| Lewis | Fairfield | +| Moses | Fairview | +| Hamilton | Pleasant Valley | +| White | Oak Ridge | +| Moran | Glendale | +... +| Richards | Lakewood | +| Day | Lebanon | +| Painter | Oak Hill | +| Bentley | Greenfield | +| Jones | Stringtown | ++-------------+------------------+ +Returned 50000 row(s) in 9.82s</code></pre> + + <p class="p"> + One potential downside of joins is the possibility of excess resource usage in poorly constructed queries. + Impala imposes restrictions on join queries to guard against such issues. To minimize the chance of runaway + queries on large data sets, Impala requires every join query to contain at least one equality predicate + between the columns of the various tables. For example, if <code class="ph codeph">T1</code> contains 1000 rows and + <code class="ph codeph">T2</code> contains 1,000,000 rows, a query <code class="ph codeph">SELECT <var class="keyword varname">columns</var> FROM t1 JOIN + t2</code> could return up to 1 billion rows (1000 * 1,000,000); Impala requires that the query include a + clause such as <code class="ph codeph">ON t1.c1 = t2.c2</code> or <code class="ph codeph">WHERE t1.c1 = t2.c2</code>. + </p> + + <p class="p"> + Because even with equality clauses, the result set can still be large, as we saw in the previous example, you + might use a <code class="ph codeph">LIMIT</code> clause to return a subset of the results: + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > select c_last_name, ca_city from customer, customer_address where c_customer_sk = ca_address_sk limit 10; ++-------------+-----------------+ +| c_last_name | ca_city | ++-------------+-----------------+ +| Lewis | Fairfield | +| Moses | Fairview | +| Hamilton | Pleasant Valley | +| White | Oak Ridge | +| Moran | Glendale | +| Sharp | Lakeview | +| Wiles | Farmington | +| Shipman | Union | +| Gilbert | New Hope | +| Brunson | Martinsville | ++-------------+-----------------+ +Returned 10 row(s) in 0.63s</code></pre> + + <p class="p"> + Or you might use additional comparison operators or aggregation functions to condense a large result set into + a smaller set of values: + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > -- Find the names of customers who live in one particular town. +[localhost:21000] > select distinct c_last_name from customer, customer_address where + c_customer_sk = ca_address_sk + and ca_city = "Green Acres"; ++---------------+ +| c_last_name | ++---------------+ +| Hensley | +| Pearson | +| Mayer | +| Montgomery | +| Ricks | +... +| Barrett | +| Price | +| Hill | +| Hansen | +| Meeks | ++---------------+ +Returned 332 row(s) in 0.97s + +[localhost:21000] > -- See how many different customers in this town have names starting with "A". +[localhost:21000] > select count(distinct c_last_name) from customer, customer_address where + c_customer_sk = ca_address_sk + and ca_city = "Green Acres" + and substr(c_last_name,1,1) = "A"; ++-----------------------------+ +| count(distinct c_last_name) | ++-----------------------------+ +| 12 | ++-----------------------------+ +Returned 1 row(s) in 1.00s</code></pre> + + <p class="p"> + Because a join query can involve reading large amounts of data from disk, sending large amounts of data + across the network, and loading large amounts of data into memory to do the comparisons and filtering, you + might do benchmarking, performance analysis, and query tuning to find the most efficient join queries for + your data set, hardware capacity, network configuration, and cluster workload. + </p> + + <p class="p"> + The two categories of joins in Impala are known as <strong class="ph b">partitioned joins</strong> and <strong class="ph b">broadcast joins</strong>. If + inaccurate table or column statistics, or some quirk of the data distribution, causes Impala to choose the + wrong mechanism for a particular join, consider using query hints as a temporary workaround. For details, see + <a class="xref" href="impala_hints.html#hints">Query Hints in Impala SELECT Statements</a>. + </p> + + <p class="p"> + <strong class="ph b">Handling NULLs in Join Columns:</strong> + </p> + + <p class="p"> + By default, join key columns do not match if either one contains a <code class="ph codeph">NULL</code> value. + To treat such columns as equal if both contain <code class="ph codeph">NULL</code>, you can use an expression + such as <code class="ph codeph">A = B OR (A IS NULL AND B IS NULL)</code>. + In <span class="keyword">Impala 2.5</span> and higher, the <code class="ph codeph"><=></code> operator (shorthand for + <code class="ph codeph">IS NOT DISTINCT FROM</code>) performs the same comparison in a concise and efficient form. + The <code class="ph codeph"><=></code> operator is more efficient in for comparing join keys in a <code class="ph codeph">NULL</code>-safe + manner, because the operator can use a hash join while the <code class="ph codeph">OR</code> expression cannot. + </p> + + <p class="p"> + <strong class="ph b">Examples:</strong> + </p> + + <div class="p"> + The following examples refer to these simple tables containing small sets of integers: +<pre class="pre codeblock"><code>[localhost:21000] > create table t1 (x int); +[localhost:21000] > insert into t1 values (1), (2), (3), (4), (5), (6); + +[localhost:21000] > create table t2 (y int); +[localhost:21000] > insert into t2 values (2), (4), (6); + +[localhost:21000] > create table t3 (z int); +[localhost:21000] > insert into t3 values (1), (3), (5); +</code></pre> + </div> + + + + <p class="p"> + The following example demonstrates an anti-join, returning the values from <code class="ph codeph">T1</code> that do not + exist in <code class="ph codeph">T2</code> (in this case, the odd numbers 1, 3, and 5): + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > select x from t1 left anti join t2 on (t1.x = t2.y); ++---+ +| x | ++---+ +| 1 | +| 3 | +| 5 | ++---+ +</code></pre> + + <p class="p"> + <strong class="ph b">Related information:</strong> + </p> + + <p class="p"> + See these tutorials for examples of different kinds of joins: + </p> + + <ul class="ul"> + <li class="li"> + <a class="xref" href="impala_tutorial.html#tut_cross_join">Cross Joins and Cartesian Products with the CROSS JOIN Operator</a> + </li> + </ul> + </div> +<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_select.html">SELECT Statement</a></div></div></nav></article></main></body></html> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_kerberos.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_kerberos.html b/docs/build/html/topics/impala_kerberos.html new file mode 100644 index 0000000..858ff87 --- /dev/null +++ b/docs/build/html/topics/impala_kerberos.html @@ -0,0 +1,342 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_authentication.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="kerberos"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Enabling Kerberos Authentication for Impala</title></head><body id="kerberos"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">Enabling Kerberos Authentication for Impala</h1> + + + <div class="body conbody"> + + <p class="p"> + Impala supports an enterprise-grade authentication system called Kerberos. Kerberos provides strong security benefits including + capabilities that render intercepted authentication packets unusable by an attacker. It virtually eliminates the threat of + impersonation by never sending a user's credentials in cleartext over the network. For more information on Kerberos, visit + the <a class="xref" href="https://web.mit.edu/kerberos/" target="_blank">MIT Kerberos website</a>. + </p> + + <p class="p"> + The rest of this topic assumes you have a working <a class="xref" href="https://web.mit.edu/kerberos/krb5-latest/doc/admin/install_kdc.html" target="_blank">Kerberos Key Distribution Center (KDC)</a> + set up. To enable Kerberos, you first create a Kerberos principal for each host running + <span class="keyword cmdname">impalad</span> or <span class="keyword cmdname">statestored</span>. + </p> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + Regardless of the authentication mechanism used, Impala always creates HDFS directories and data files + owned by the same user (typically <code class="ph codeph">impala</code>). To implement user-level access to different + databases, tables, columns, partitions, and so on, use the Sentry authorization feature, as explained in + <a class="xref" href="../shared/../topics/impala_authorization.html#authorization">Enabling Sentry Authorization for Impala</a>. + </div> + + <p class="p"> + An alternative form of authentication you can use is LDAP, described in <a class="xref" href="impala_ldap.html#ldap">Enabling LDAP Authentication for Impala</a>. + </p> + + <p class="p toc inpage"></p> + + </div> + + <nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_authentication.html">Impala Authentication</a></div></div></nav><article class="topic concept nested1" aria-labelledby="ariaid-title2" id="kerberos__kerberos_prereqs"> + + <h2 class="title topictitle2" id="ariaid-title2">Requirements for Using Impala with Kerberos</h2> + + + <div class="body conbody"> + + <div class="p"> + On version 5 of Red Hat Enterprise Linux and comparable distributions, some additional setup is needed for + the <span class="keyword cmdname">impala-shell</span> interpreter to connect to a Kerberos-enabled Impala cluster: +<pre class="pre codeblock"><code>sudo yum install python-devel openssl-devel python-pip +sudo pip-python install ssl</code></pre> + </div> + + <div class="note important note_important"><span class="note__title importanttitle">Important:</span> + <p class="p"> + If you plan to use Impala in your cluster, you must configure your KDC to allow tickets to be renewed, + and you must configure <span class="ph filepath">krb5.conf</span> to request renewable tickets. Typically, you can do + this by adding the <code class="ph codeph">max_renewable_life</code> setting to your realm in + <span class="ph filepath">kdc.conf</span>, and by adding the <span class="ph filepath">renew_lifetime</span> parameter to the + <span class="ph filepath">libdefaults</span> section of <span class="ph filepath">krb5.conf</span>. For more information about + renewable tickets, see the + <a class="xref" href="http://web.mit.edu/Kerberos/krb5-1.8/" target="_blank"> Kerberos + documentation</a>. + </p> + <p class="p"> + Currently, you cannot use the resource management feature on a cluster that has Kerberos + authentication enabled. + </p> + </div> + + <p class="p"> + Start all <span class="keyword cmdname">impalad</span> and <span class="keyword cmdname">statestored</span> daemons with the + <code class="ph codeph">--principal</code> and <code class="ph codeph">--keytab-file</code> flags set to the principal and full path + name of the <code class="ph codeph">keytab</code> file containing the credentials for the principal. + </p> + + <p class="p"> + To enable Kerberos in the Impala shell, start the <span class="keyword cmdname">impala-shell</span> command using the + <code class="ph codeph">-k</code> flag. + </p> + + <p class="p"> + To enable Impala to work with Kerberos security on your Hadoop cluster, make sure you perform the + installation and configuration steps in + <a class="xref" href="https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SecureMode.html#Authentication" target="_blank">Authentication in Hadoop</a>. + Note that when Kerberos security is enabled in Impala, a web browser that + supports Kerberos HTTP SPNEGO is required to access the Impala web console (for example, Firefox, Internet + Explorer, or Chrome). + </p> + + <p class="p"> + If the NameNode, Secondary NameNode, DataNode, JobTracker, TaskTrackers, ResourceManager, NodeManagers, + HttpFS, Oozie, Impala, or Impala statestore services are configured to use Kerberos HTTP SPNEGO + authentication, and two or more of these services are running on the same host, then all of the running + services must use the same HTTP principal and keytab file used for their HTTP endpoints. + </p> + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title3" id="kerberos__kerberos_config"> + + <h2 class="title topictitle2" id="ariaid-title3">Configuring Impala to Support Kerberos Security</h2> + + + <div class="body conbody"> + + <p class="p"> + Enabling Kerberos authentication for Impala involves steps that can be summarized as follows: + </p> + + <ul class="ul"> + <li class="li"> + Creating service principals for Impala and the HTTP service. Principal names take the form: + <code class="ph codeph"><var class="keyword varname">serviceName</var>/<var class="keyword varname">fully.qualified.domain.name</var>@<var class="keyword varname">KERBEROS.REALM</var></code>. + <p class="p"> + In Impala 2.0 and later, <code class="ph codeph">user()</code> returns the full Kerberos principal string, such as + <code class="ph codeph">[email protected]</code>, in a Kerberized environment. + </p> + </li> + + <li class="li"> + Creating, merging, and distributing key tab files for these principals. + </li> + + <li class="li"> + Editing <code class="ph codeph">/etc/default/impala</code> + to accommodate Kerberos authentication. + </li> + </ul> + </div> + + <article class="topic concept nested2" aria-labelledby="ariaid-title4" id="kerberos_config__kerberos_setup"> + + <h3 class="title topictitle3" id="ariaid-title4">Enabling Kerberos for Impala</h3> + + <div class="body conbody"> + + + + <ol class="ol"> + <li class="li"> + Create an Impala service principal, specifying the name of the OS user that the Impala daemons run + under, the fully qualified domain name of each node running <span class="keyword cmdname">impalad</span>, and the realm + name. For example: +<pre class="pre codeblock"><code>$ kadmin +kadmin: addprinc -requires_preauth -randkey impala/[email protected]</code></pre> + </li> + + <li class="li"> + Create an HTTP service principal. For example: +<pre class="pre codeblock"><code>kadmin: addprinc -randkey HTTP/[email protected]</code></pre> + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + The <code class="ph codeph">HTTP</code> component of the service principal must be uppercase as shown in the + preceding example. + </div> + </li> + + <li class="li"> + Create <code class="ph codeph">keytab</code> files with both principals. For example: +<pre class="pre codeblock"><code>kadmin: xst -k impala.keytab impala/impala_host.example.com +kadmin: xst -k http.keytab HTTP/impala_host.example.com +kadmin: quit</code></pre> + </li> + + <li class="li"> + Use <code class="ph codeph">ktutil</code> to read the contents of the two keytab files and then write those contents + to a new file. For example: +<pre class="pre codeblock"><code>$ ktutil +ktutil: rkt impala.keytab +ktutil: rkt http.keytab +ktutil: wkt impala-http.keytab +ktutil: quit</code></pre> + </li> + + <li class="li"> + (Optional) Test that credentials in the merged keytab file are valid, and that the <span class="q">"renew until"</span> + date is in the future. For example: +<pre class="pre codeblock"><code>$ klist -e -k -t impala-http.keytab</code></pre> + </li> + + <li class="li"> + Copy the <span class="ph filepath">impala-http.keytab</span> file to the Impala configuration directory. Change the + permissions to be only read for the file owner and change the file owner to the <code class="ph codeph">impala</code> + user. By default, the Impala user and group are both named <code class="ph codeph">impala</code>. For example: +<pre class="pre codeblock"><code>$ cp impala-http.keytab /etc/impala/conf +$ cd /etc/impala/conf +$ chmod 400 impala-http.keytab +$ chown impala:impala impala-http.keytab</code></pre> + </li> + + <li class="li"> + Add Kerberos options to the Impala defaults file, <span class="ph filepath">/etc/default/impala</span>. Add the + options for both the <span class="keyword cmdname">impalad</span> and <span class="keyword cmdname">statestored</span> daemons, using the + <code class="ph codeph">IMPALA_SERVER_ARGS</code> and <code class="ph codeph">IMPALA_STATE_STORE_ARGS</code> variables. For + example, you might add: + +<pre class="pre codeblock"><code>-kerberos_reinit_interval=60 +-principal=impala_1/[email protected] +-keytab_file=<var class="keyword varname">/path/to/impala.keytab</var></code></pre> + <p class="p"> + For more information on changing the Impala defaults specified in + <span class="ph filepath">/etc/default/impala</span>, see + <a class="xref" href="impala_config_options.html#config_options">Modifying Impala Startup + Options</a>. + </p> + </li> + </ol> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + Restart <span class="keyword cmdname">impalad</span> and <span class="keyword cmdname">statestored</span> for these configuration changes to + take effect. + </div> + </div> + </article> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title5" id="kerberos__kerberos_proxy"> + + <h2 class="title topictitle2" id="ariaid-title5">Enabling Kerberos for Impala with a Proxy Server</h2> + + <div class="body conbody"> + + <p class="p"> + A common configuration for Impala with High Availability is to use a proxy server to submit requests to the + actual <span class="keyword cmdname">impalad</span> daemons on different hosts in the cluster. This configuration avoids + connection problems in case of machine failure, because the proxy server can route new requests through one + of the remaining hosts in the cluster. This configuration also helps with load balancing, because the + additional overhead of being the <span class="q">"coordinator node"</span> for each query is spread across multiple hosts. + </p> + + <p class="p"> + Although you can set up a proxy server with or without Kerberos authentication, typically users set up a + secure Kerberized configuration. For information about setting up a proxy server for Impala, including + Kerberos-specific steps, see <a class="xref" href="impala_proxy.html#proxy">Using Impala through a Proxy for High Availability</a>. + </p> + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title6" id="kerberos__spnego"> + + <h2 class="title topictitle2" id="ariaid-title6">Using a Web Browser to Access a URL Protected by Kerberos HTTP SPNEGO</h2> + + <div class="body conbody"> + + <p class="p"> + Your web browser must support Kerberos HTTP SPNEGO. For example, Chrome, Firefox, or Internet Explorer. + </p> + + <p class="p"> + <strong class="ph b">To configure Firefox to access a URL protected by Kerberos HTTP SPNEGO:</strong> + </p> + + <ol class="ol"> + <li class="li"> + Open the advanced settings Firefox configuration page by loading the <code class="ph codeph">about:config</code> page. + </li> + + <li class="li"> + Use the <strong class="ph b">Filter</strong> text box to find <code class="ph codeph">network.negotiate-auth.trusted-uris</code>. + </li> + + <li class="li"> + Double-click the <code class="ph codeph">network.negotiate-auth.trusted-uris</code> preference and enter the hostname + or the domain of the web server that is protected by Kerberos HTTP SPNEGO. Separate multiple domains and + hostnames with a comma. + </li> + + <li class="li"> + Click <strong class="ph b">OK</strong>. + </li> + </ol> + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title7" id="kerberos__kerberos_delegation"> + <h2 class="title topictitle2" id="ariaid-title7">Enabling Impala Delegation for Kerberos Users</h2> + <div class="body conbody"> + <p class="p"> + See <a class="xref" href="impala_delegation.html#delegation">Configuring Impala Delegation for Hue and BI Tools</a> for details about the delegation feature + that lets certain users submit queries using the credentials of other users. + </p> + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title8" id="kerberos__ssl_jdbc_odbc"> + <h2 class="title topictitle2" id="ariaid-title8">Using TLS/SSL with Business Intelligence Tools</h2> + <div class="body conbody"> + <p class="p"> + You can use Kerberos authentication, TLS/SSL encryption, or both to secure + connections from JDBC and ODBC applications to Impala. + See <a class="xref" href="impala_jdbc.html#impala_jdbc">Configuring Impala to Work with JDBC</a> and <a class="xref" href="impala_odbc.html#impala_odbc">Configuring Impala to Work with ODBC</a> + for details. + </p> + + <p class="p"> + Prior to <span class="keyword">Impala 2.5</span>, the Hive JDBC driver did not support connections that use both Kerberos authentication + and SSL encryption. If your cluster is running an older release that has this restriction, + use an alternative JDBC driver that supports + both of these security features. + </p> + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title9" id="kerberos__whitelisting_internal_apis"> + <h2 class="title topictitle2" id="ariaid-title9">Enabling Access to Internal Impala APIs for Kerberos Users</h2> + <div class="body conbody"> + + <p class="p"> + For applications that need direct access + to Impala APIs, without going through the HiveServer2 or Beeswax interfaces, you can + specify a list of Kerberos users who are allowed to call those APIs. By default, the + <code class="ph codeph">impala</code> and <code class="ph codeph">hdfs</code> users are the only ones authorized + for this kind of access. + Any users not explicitly authorized through the <code class="ph codeph">internal_principals_whitelist</code> + configuration setting are blocked from accessing the APIs. This setting applies to all the + Impala-related daemons, although currently it is primarily used for HDFS to control the + behavior of the catalog server. + </p> + </div> + + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title10" id="kerberos__auth_to_local"> + <h2 class="title topictitle2" id="ariaid-title10">Mapping Kerberos Principals to Short Names for Impala</h2> + <div class="body conbody"> + <div class="p"> + In <span class="keyword">Impala 2.6</span> and higher, Impala recognizes the <code class="ph codeph">auth_to_local</code> setting, + specified through the HDFS configuration setting + <code class="ph codeph">hadoop.security.auth_to_local</code>. + This feature is disabled by default, to avoid an unexpected change in security-related behavior. + To enable it: + <ul class="ul"> + <li class="li"> + <p class="p"> + Specify <code class="ph codeph">--load_auth_to_local_rules=true</code> + in the <span class="keyword cmdname">impalad</span> and <span class="keyword cmdname">catalogd</span> configuration settings. + </p> + </li> + </ul> + </div> + </div> + </article> + +</article></main></body></html> \ No newline at end of file
