[23/51] [partial] incubator-impala git commit: IMPALA-3398: Add docs to main Impala branch.

jbapple Thu, 17 Nov 2016 15:12:23 -0800

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_joins.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_joins.xml b/docs/topics/impala_joins.xml
new file mode 100644
index 0000000..ddd2f2b
--- /dev/null
+++ b/docs/topics/impala_joins.xml
@@ -0,0 +1,534 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="joins">
+
+  <title>Joins in Impala SELECT Statements</title>
+  <titlealts audience="PDF"><navtitle>Joins</navtitle></titlealts>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="Data Analysts"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="SQL"/>
+      <data name="Category" value="Querying"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p>
+      <indexterm audience="Cloudera">joins</indexterm>
+      A join query is a <codeph>SELECT</codeph> statement that combines data 
from two or more tables,
+      and returns a result set containing items from some or all of those 
tables. It is a way to
+      cross-reference and correlate related data that is organized into 
multiple tables, typically
+      using identifiers that are repeated in each of the joined tables.
+    </p>
+
+    <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+    <p conref="../shared/impala_common.xml#common/join_types"/>
+
+<codeblock>SELECT <varname>select_list</varname> FROM
+  <varname>table_or_subquery1</varname> [INNER] JOIN 
<varname>table_or_subquery2</varname> |
+  <varname>table_or_subquery1</varname> {LEFT [OUTER] | RIGHT [OUTER] | FULL 
[OUTER]} JOIN <varname>table_or_subquery2</varname> |
+  <varname>table_or_subquery1</varname> {LEFT | RIGHT} SEMI JOIN 
<varname>table_or_subquery2</varname> |
+  <ph rev="2.0.0"><varname>table_or_subquery1</varname> {LEFT | RIGHT} ANTI 
JOIN <varname>table_or_subquery2</varname> |</ph>
+    [ ON <varname>col1</varname> = <varname>col2</varname> [AND 
<varname>col3</varname> = <varname>col4</varname> ...] |
+      USING (<varname>col1</varname> [, <varname>col2</varname> ...]) ]
+  [<varname>other_join_clause</varname> ...]
+[ WHERE <varname>where_clauses</varname> ]
+
+SELECT <varname>select_list</varname> FROM
+  <varname>table_or_subquery1</varname>, <varname>table_or_subquery2</varname> 
[, <varname>table_or_subquery3</varname> ...]
+  [<varname>other_join_clause</varname> ...]
+WHERE
+    <varname>col1</varname> = <varname>col2</varname> [AND 
<varname>col3</varname> = <varname>col4</varname> ...]
+
+SELECT <varname>select_list</varname> FROM
+  <varname>table_or_subquery1</varname> CROSS JOIN 
<varname>table_or_subquery2</varname>
+  [<varname>other_join_clause</varname> ...]
+[ WHERE <varname>where_clauses</varname> ]</codeblock>
+
+    <p>
+      <b>SQL-92 and SQL-89 Joins:</b>
+    </p>
+
+    <p>
+      Queries with the explicit <codeph>JOIN</codeph> keywords are known as 
SQL-92 style joins, referring to the
+      level of the SQL standard where they were introduced. The corresponding 
<codeph>ON</codeph> or
+      <codeph>USING</codeph> clauses clearly show which columns are used as 
the join keys in each case:
+    </p>
+
+<codeblock>SELECT t1.c1, t2.c2 FROM <b>t1 JOIN t2</b>
+  <b>ON t1.id = t2.id and t1.type_flag = t2.type_flag</b>
+  WHERE t1.c1 &gt; 100;
+
+SELECT t1.c1, t2.c2 FROM <b>t1 JOIN t2</b>
+  <b>USING (id, type_flag)</b>
+  WHERE t1.c1 &gt; 100;</codeblock>
+
+    <p>
+      The <codeph>ON</codeph> clause is a general way to compare columns 
across the two tables, even if the column
+      names are different. The <codeph>USING</codeph> clause is a shorthand 
notation for specifying the join
+      columns, when the column names are the same in both tables. You can code 
equivalent <codeph>WHERE</codeph>
+      clauses that compare the columns, instead of <codeph>ON</codeph> or 
<codeph>USING</codeph> clauses, but that
+      practice is not recommended because mixing the join comparisons with 
other filtering clauses is typically
+      less readable and harder to maintain.
+    </p>
+
+    <p>
+      Queries with a comma-separated list of tables and subqueries are known 
as SQL-89 style joins. In these
+      queries, the equality comparisons between columns of the joined tables 
go in the <codeph>WHERE</codeph>
+      clause alongside other kinds of comparisons. This syntax is easy to 
learn, but it is also easy to
+      accidentally remove a <codeph>WHERE</codeph> clause needed for the join 
to work correctly.
+    </p>
+
+<codeblock>SELECT t1.c1, t2.c2 FROM <b>t1, t2</b>
+  WHERE
+  <b>t1.id = t2.id AND t1.type_flag = t2.type_flag</b>
+  AND t1.c1 &gt; 100;</codeblock>
+
+    <p>
+      <b>Self-joins:</b>
+    </p>
+
+    <p>
+      Impala can do self-joins, for example to join on two different columns 
in the same table to represent
+      parent-child relationships or other tree-structured data. There is no 
explicit syntax for this; just use the
+      same table name for both the left-hand and right-hand table, and assign 
different table aliases to use when
+      referring to the fully qualified column names:
+    </p>
+
+<codeblock>-- Combine fields from both parent and child rows.
+SELECT lhs.id, rhs.parent, lhs.c1, rhs.c2 FROM tree_data lhs, tree_data rhs 
WHERE lhs.id = rhs.parent;</codeblock>
+
+    <p>
+      <b>Cartesian joins:</b>
+    </p>
+
+    <p>
+      To avoid producing huge result sets by mistake, Impala does not allow 
Cartesian joins of the form:
+<codeblock>SELECT ... FROM t1 JOIN t2;
+SELECT ... FROM t1, t2;</codeblock>
+      If you intend to join the tables based on common values, add 
<codeph>ON</codeph> or <codeph>WHERE</codeph>
+      clauses to compare columns across the tables. If you truly intend to do 
a Cartesian join, use the
+      <codeph>CROSS JOIN</codeph> keyword as the join operator. The 
<codeph>CROSS JOIN</codeph> form does not use
+      any <codeph>ON</codeph> clause, because it produces a result set with 
all combinations of rows from the
+      left-hand and right-hand tables. The result set can still be filtered by 
subsequent <codeph>WHERE</codeph>
+      clauses. For example:
+    </p>
+
+<codeblock>SELECT ... FROM t1 CROSS JOIN t2;
+SELECT ... FROM t1 CROSS JOIN t2 WHERE 
<varname>tests_on_non_join_columns</varname>;</codeblock>
+
+    <p>
+      <b>Inner and outer joins:</b>
+    </p>
+
+    <p>
+      An inner join is the most common and familiar type: rows in the result 
set contain the requested columns from
+      the appropriate tables, for all combinations of rows where the join 
columns of the tables have identical
+      values. If a column with the same name occurs in both tables, use a 
fully qualified name or a column alias to
+      refer to the column in the select list or other clauses. Impala performs 
inner joins by default for both
+      SQL-89 and SQL-92 join syntax:
+    </p>
+
+<codeblock>-- The following 3 forms are all equivalent.
+SELECT t1.id, c1, c2 FROM t1, t2 WHERE t1.id = t2.id;
+SELECT t1.id, c1, c2 FROM t1 JOIN t2 ON t1.id = t2.id;
+SELECT t1.id, c1, c2 FROM t1 INNER JOIN t2 ON t1.id = t2.id;</codeblock>
+
+    <p>
+      An outer join retrieves all rows from the left-hand table, or the 
right-hand table, or both; wherever there
+      is no matching data in the table on the other side of the join, the 
corresponding columns in the result set
+      are set to <codeph>NULL</codeph>. To perform an outer join, include the 
<codeph>OUTER</codeph> keyword in the
+      join operator, along with either <codeph>LEFT</codeph>, 
<codeph>RIGHT</codeph>, or <codeph>FULL</codeph>:
+    </p>
+
+<codeblock>SELECT * FROM t1 LEFT OUTER JOIN t2 ON t1.id = t2.id;
+SELECT * FROM t1 RIGHT OUTER JOIN t2 ON t1.id = t2.id;
+SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.id = t2.id;</codeblock>
+
+    <p>
+      For outer joins, Impala requires SQL-92 syntax; that is, the 
<codeph>JOIN</codeph> keyword instead of
+      comma-separated table names. Impala does not support vendor extensions 
such as <codeph>(+)</codeph> or
+      <codeph>*=</codeph> notation for doing outer joins with SQL-89 query 
syntax.
+    </p>
+
+    <p>
+      <b>Equijoins and Non-Equijoins:</b>
+    </p>
+
+    <p>
+      By default, Impala requires an equality comparison between the left-hand 
and right-hand tables, either
+      through <codeph>ON</codeph>, <codeph>USING</codeph>, or 
<codeph>WHERE</codeph> clauses. These types of
+      queries are classified broadly as equijoins. Inner, outer, full, and 
semi joins can all be equijoins based on
+      the presence of equality tests between columns in the left-hand and 
right-hand tables.
+    </p>
+
+    <p>
+      In Impala 1.2.2 and higher, non-equijoin queries are also possible, with 
comparisons such as
+      <codeph>!=</codeph> or <codeph>&lt;</codeph> between the join columns. 
These kinds of queries require care to
+      avoid producing huge result sets that could exceed resource limits. Once 
you have planned a non-equijoin
+      query that produces a result set of acceptable size, you can code the 
query using the <codeph>CROSS
+      JOIN</codeph> operator, and add the extra comparisons in the 
<codeph>WHERE</codeph> clause:
+    </p>
+
+<codeblock>SELECT * FROM t1 CROSS JOIN t2 WHERE t1.total &gt; 
t2.maximum_price;</codeblock>
+
+    <p rev="2.3.0">
+      In <keyword keyref="impala23_full"/> and higher, additional non-equijoin 
queries are possible due to the addition
+      of nested loop joins. These queries typically involve <codeph>SEMI 
JOIN</codeph>,
+      <codeph>ANTI JOIN</codeph>, or <codeph>FULL OUTER JOIN</codeph> clauses.
+      Impala sometimes also uses nested loop joins internally when evaluating 
<codeph>OUTER JOIN</codeph>
+      queries involving complex type columns.
+      Query phases involving nested loop joins do not use the spill-to-disk 
mechanism if they
+      exceed the memory limit. Impala decides internally when to use each join 
mechanism; you cannot
+      specify any query hint to choose between the nested loop join or the 
original hash join algorithm.
+    </p>
+
+<codeblock rev="2.3.0">SELECT * FROM t1 LEFT OUTER JOIN t2 ON t1.int_col &lt; 
t2.int_col;</codeblock>
+
+    <p>
+      <b>Semi-joins:</b>
+    </p>
+
+    <p>
+      Semi-joins are a relatively rarely used variation. With the left 
semi-join, only data from the left-hand
+      table is returned, for rows where there is matching data in the 
right-hand table, based on comparisons
+      between join columns in <codeph>ON</codeph> or <codeph>WHERE</codeph> 
clauses. Only one instance of each row
+      from the left-hand table is returned, regardless of how many matching 
rows exist in the right-hand table.
+      <ph rev="2.0.0">A right semi-join (available in Impala 2.0 and higher) 
reverses the comparison and returns
+      data from the right-hand table.</ph>
+    </p>
+
+<codeblock>SELECT t1.c1, t1.c2, t1.c2 FROM t1 LEFT SEMI JOIN t2 ON t1.id = 
t2.id;</codeblock>
+
+    <p>
+      <b>Natural joins (not supported):</b>
+    </p>
+
+    <p>
+      Impala does not support the <codeph>NATURAL JOIN</codeph> operator, 
again to avoid inconsistent or huge
+      result sets. Natural joins do away with the <codeph>ON</codeph> and 
<codeph>USING</codeph> clauses, and
+      instead automatically join on all columns with the same names in the 
left-hand and right-hand tables. This
+      kind of query is not recommended for rapidly evolving data structures 
such as are typically used in Hadoop.
+      Thus, Impala does not support the <codeph>NATURAL JOIN</codeph> syntax, 
which can produce different query
+      results as columns are added to or removed from tables.
+    </p>
+
+    <p>
+      If you do have any queries that use <codeph>NATURAL JOIN</codeph>, make 
sure to rewrite them with explicit
+      <codeph>USING</codeph> clauses, because Impala could interpret the 
<codeph>NATURAL</codeph> keyword as a
+      table alias:
+    </p>
+
+<codeblock>-- 'NATURAL' is interpreted as an alias for 't1' and Impala 
attempts an inner join,
+-- resulting in an error because inner joins require explicit comparisons 
between columns.
+SELECT t1.c1, t2.c2 FROM t1 NATURAL JOIN t2;
+ERROR: NotImplementedException: Join with 't2' requires at least one 
conjunctive equality predicate.
+  To perform a Cartesian product between two tables, use a CROSS JOIN.
+
+-- If you expect the tables to have identically named columns with matching 
values,
+-- list the corresponding column names in a USING clause.
+SELECT t1.c1, t2.c2 FROM t1 JOIN t2 USING (id, type_flag, name, 
address);</codeblock>
+
+    <p rev="2.0.0">
+      <b>Anti-joins (<keyword keyref="impala20_full"/> and higher only):</b>
+    </p>
+
+    <p rev="2.0.0">
+      Impala supports the <codeph>LEFT ANTI JOIN</codeph> and <codeph>RIGHT 
ANTI JOIN</codeph> clauses in
+      <keyword keyref="impala20"/> and higher. The <codeph>LEFT</codeph> or 
<codeph>RIGHT</codeph>
+      keyword is required for this kind of join. For <codeph>LEFT ANTI 
JOIN</codeph>, this clause returns those
+      values from the left-hand table that have no matching value in the 
right-hand table. <codeph>RIGHT ANTI
+      JOIN</codeph> reverses the comparison and returns values from the 
right-hand table. You can express this
+      negative relationship either through the <codeph>ANTI JOIN</codeph> 
clause or through a <codeph>NOT
+      EXISTS</codeph> operator with a subquery.
+    </p>
+
+<!-- Restriction lifted in Impala 2.0.
+<p>
+Impala does not support <codeph>WHERE</codeph> clauses
+such as <codeph>IN (<varname>subquery</varname>)</codeph>,
+<codeph>NOT IN (<varname>subquery</varname>)</codeph>,
+<codeph>EXISTS (<varname>subquery</varname>)</codeph>,
+and <codeph>NOT EXISTS (<varname>subquery</varname>)</codeph>.
+Therefore from a practical standpoint, you cannot
+express an anti-join condition, where values from one table
+are returned only if no matching values are present in another table.
+</p>
+-->
+
+    <p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
+
+<!-- To do: reuse some complex types examples with joins here or under 
Examples farther down. -->
+
+    <p rev="2.3.0">
+      When referring to a column with a complex type (<codeph>STRUCT</codeph>, 
<codeph>ARRAY</codeph>, or <codeph>MAP</codeph>)
+      in a query, you use join notation to <q>unpack</q> the scalar fields of 
the struct, the elements of the array, or
+      the key-value pairs of the map. (The join notation is not required for 
aggregation operations, such as
+      <codeph>COUNT()</codeph> or <codeph>SUM()</codeph> for array elements.) 
Because Impala recognizes which complex type elements are associated with which 
row
+      of the result set, you use the same syntax as for a cross or cartesian 
join, without an explicit join condition.
+      See <xref href="impala_complex_types.xml#complex_types"/> for details 
about Impala support for complex types.
+    </p>
+
+    <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
+
+    <p>
+      You typically use join queries in situations like these:
+    </p>
+
+    <ul>
+      <li>
+        When related data arrives from different sources, with each data set 
physically residing in a separate
+        table. For example, you might have address data from business records 
that you cross-check against phone
+        listings or census data.
+        <note>
+          Impala can join tables of different file formats, including 
Impala-managed tables and HBase tables. For
+          example, you might keep small dimension tables in HBase, for 
convenience of single-row lookups and
+          updates, and for the larger fact tables use Parquet or other binary 
file format optimized for scan
+          operations. Then, you can issue a join query to cross-reference the 
fact tables with the dimension
+          tables.
+        </note>
+      </li>
+
+      <li>
+        When data is normalized, a technique for reducing data duplication by 
dividing it across multiple tables.
+        This kind of organization is often found in data that comes from 
traditional relational database systems.
+        For example, instead of repeating some long string such as a customer 
name in multiple tables, each table
+        might contain a numeric customer ID. Queries that need to display the 
customer name could <q>join</q> the
+        table that specifies which customer ID corresponds to which name.
+      </li>
+
+      <li>
+        When certain columns are rarely needed for queries, so they are moved 
into separate tables to reduce
+        overhead for common queries. For example, a <codeph>biography</codeph> 
field might be rarely needed in
+        queries on employee data. Putting that field in a separate table 
reduces the amount of I/O for common
+        queries on employee addresses or phone numbers. Queries that do need 
the <codeph>biography</codeph> column
+        can retrieve it by performing a join with that separate table.
+      </li>
+
+      <li>
+        In <keyword keyref="impala23_full"/> or higher, when referring to 
complex type columns in queries.
+        See <xref href="impala_complex_types.xml#complex_types"/> for details.
+      </li>
+    </ul>
+
+    <p>
+      When comparing columns with the same names in <codeph>ON</codeph> or 
<codeph>WHERE</codeph> clauses, use the
+      fully qualified names such as 
<codeph><varname>db_name</varname>.<varname>table_name</varname></codeph>, or
+      assign table aliases, column aliases, or both to make the code more 
compact and understandable:
+    </p>
+
+<codeblock>select t1.c1 as first_id, t2.c2 as second_id from
+  t1 join t2 on first_id = second_id;
+
+select fact.custno, dimension.custno from
+  customer_data as fact join customer_address as dimension
+  using (custno)</codeblock>
+
+    <note>
+      <p>
+        Performance for join queries is a crucial aspect for Impala, because 
complex join queries are
+        resource-intensive operations. An efficient join query produces much 
less network traffic and CPU overhead
+        than an inefficient one. For best results:
+      </p>
+      <ul>
+        <li rev="1.2">
+          Make sure that both <xref 
href="impala_perf_stats.xml#perf_stats">table and column statistics</xref> are
+          available for all the tables involved in a join query, and 
especially for the columns referenced in any
+          join conditions. Impala uses the statistics to automatically deduce 
an efficient join order.
+          Use <xref href="impala_show.xml#show"><codeph>SHOW TABLE STATS 
<varname>table_name</varname></codeph> and
+          <codeph>SHOW COLUMN STATS 
<varname>table_name</varname></codeph></xref> to check if statistics are
+          already present. Issue the <codeph>COMPUTE STATS 
<varname>table_name</varname></codeph> for a nonpartitioned table,
+          or (in Impala 2.1.0 and higher) <codeph>COMPUTE INCREMENTAL STATS 
<varname>table_name</varname></codeph>
+          for a partitioned table, to collect the initial statistics at both 
the table and column levels, and to keep the
+          statistics up to date after any substantial <codeph>INSERT</codeph> 
or <codeph>LOAD DATA</codeph> operations.
+        </li>
+
+        <li rev="1.2">
+          If table or column statistics are not available, join the largest 
table first. You can check the
+          existence of statistics with the <codeph>SHOW TABLE STATS 
<varname>table_name</varname></codeph> and
+          <codeph>SHOW COLUMN STATS <varname>table_name</varname></codeph> 
statements.
+        </li>
+
+        <li rev="1.2.2">
+          If table or column statistics are not available, join subsequent 
tables according to which table has the
+          most selective filter, based on overall size and 
<codeph>WHERE</codeph> clauses. Joining the table with
+          the most selective filter results in the fewest number of rows being 
returned.
+        </li>
+      </ul>
+      <p>
+        For more information and examples of performance for join queries, see
+        <xref href="impala_perf_joins.xml#perf_joins"/>.
+      </p>
+    </note>
+
+    <p>
+      To control the result set from a join query, include the names of 
corresponding column names in both tables
+      in an <codeph>ON</codeph> or <codeph>USING</codeph> clause, or by coding 
equality comparisons for those
+      columns in the <codeph>WHERE</codeph> clause.
+    </p>
+
+<codeblock>[localhost:21000] &gt; select c_last_name, ca_city from customer 
join customer_address where c_customer_sk = ca_address_sk;
++-------------+-----------------+
+| c_last_name | ca_city         |
++-------------+-----------------+
+| Lewis       | Fairfield       |
+| Moses       | Fairview        |
+| Hamilton    | Pleasant Valley |
+| White       | Oak Ridge       |
+| Moran       | Glendale        |
+...
+| Richards    | Lakewood         |
+| Day         | Lebanon          |
+| Painter     | Oak Hill         |
+| Bentley     | Greenfield       |
+| Jones       | Stringtown       |
++-------------+------------------+
+Returned 50000 row(s) in 9.82s</codeblock>
+
+    <p>
+      One potential downside of joins is the possibility of excess resource 
usage in poorly constructed queries.
+      Impala imposes restrictions on join queries to guard against such 
issues. To minimize the chance of runaway
+      queries on large data sets, Impala requires every join query to contain 
at least one equality predicate
+      between the columns of the various tables. For example, if 
<codeph>T1</codeph> contains 1000 rows and
+      <codeph>T2</codeph> contains 1,000,000 rows, a query <codeph>SELECT 
<varname>columns</varname> FROM t1 JOIN
+      t2</codeph> could return up to 1 billion rows (1000 * 1,000,000); Impala 
requires that the query include a
+      clause such as <codeph>ON t1.c1 = t2.c2</codeph> or <codeph>WHERE t1.c1 
= t2.c2</codeph>.
+    </p>
+
+    <p>
+      Because even with equality clauses, the result set can still be large, 
as we saw in the previous example, you
+      might use a <codeph>LIMIT</codeph> clause to return a subset of the 
results:
+    </p>
+
+<codeblock>[localhost:21000] &gt; select c_last_name, ca_city from customer, 
customer_address where c_customer_sk = ca_address_sk limit 10;
++-------------+-----------------+
+| c_last_name | ca_city         |
++-------------+-----------------+
+| Lewis       | Fairfield       |
+| Moses       | Fairview        |
+| Hamilton    | Pleasant Valley |
+| White       | Oak Ridge       |
+| Moran       | Glendale        |
+| Sharp       | Lakeview        |
+| Wiles       | Farmington      |
+| Shipman     | Union           |
+| Gilbert     | New Hope        |
+| Brunson     | Martinsville    |
++-------------+-----------------+
+Returned 10 row(s) in 0.63s</codeblock>
+
+    <p>
+      Or you might use additional comparison operators or aggregation 
functions to condense a large result set into
+      a smaller set of values:
+    </p>
+
+<codeblock>[localhost:21000] &gt; -- Find the names of customers who live in 
one particular town.
+[localhost:21000] &gt; select distinct c_last_name from customer, 
customer_address where
+  c_customer_sk = ca_address_sk
+  and ca_city = "Green Acres";
++---------------+
+| c_last_name   |
++---------------+
+| Hensley       |
+| Pearson       |
+| Mayer         |
+| Montgomery    |
+| Ricks         |
+...
+| Barrett       |
+| Price         |
+| Hill          |
+| Hansen        |
+| Meeks         |
++---------------+
+Returned 332 row(s) in 0.97s
+
+[localhost:21000] &gt; -- See how many different customers in this town have 
names starting with "A".
+[localhost:21000] &gt; select count(distinct c_last_name) from customer, 
customer_address where
+  c_customer_sk = ca_address_sk
+  and ca_city = "Green Acres"
+  and substr(c_last_name,1,1) = "A";
++-----------------------------+
+| count(distinct c_last_name) |
++-----------------------------+
+| 12                          |
++-----------------------------+
+Returned 1 row(s) in 1.00s</codeblock>
+
+    <p>
+      Because a join query can involve reading large amounts of data from 
disk, sending large amounts of data
+      across the network, and loading large amounts of data into memory to do 
the comparisons and filtering, you
+      might do benchmarking, performance analysis, and query tuning to find 
the most efficient join queries for
+      your data set, hardware capacity, network configuration, and cluster 
workload.
+    </p>
+
+    <p>
+      The two categories of joins in Impala are known as <b>partitioned 
joins</b> and <b>broadcast joins</b>. If
+      inaccurate table or column statistics, or some quirk of the data 
distribution, causes Impala to choose the
+      wrong mechanism for a particular join, consider using query hints as a 
temporary workaround. For details, see
+      <xref href="impala_hints.xml#hints"/>.
+    </p>
+
+    <p rev="2.5.0">
+      <b>Handling NULLs in Join Columns:</b>
+    </p>
+
+    <p rev="2.5.0">
+      By default, join key columns do not match if either one contains a 
<codeph>NULL</codeph> value.
+      To treat such columns as equal if both contain <codeph>NULL</codeph>, 
you can use an expression
+      such as <codeph>A = B OR (A IS NULL AND B IS NULL)</codeph>.
+      In <keyword keyref="impala25_full"/> and higher, the 
<codeph>&lt;=&gt;</codeph> operator (shorthand for
+      <codeph>IS NOT DISTINCT FROM</codeph>) performs the same comparison in a 
concise and efficient form.
+      The <codeph>&lt;=&gt;</codeph> operator is more efficient in for 
comparing join keys in a <codeph>NULL</codeph>-safe
+      manner, because the operator can use a hash join while the 
<codeph>OR</codeph> expression cannot.
+    </p>
+
+    <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+    <p>
+      The following examples refer to these simple tables containing small 
sets of integers:
+<codeblock>[localhost:21000] &gt; create table t1 (x int);
+[localhost:21000] &gt; insert into t1 values (1), (2), (3), (4), (5), (6);
+
+[localhost:21000] &gt; create table t2 (y int);
+[localhost:21000] &gt; insert into t2 values (2), (4), (6);
+
+[localhost:21000] &gt; create table t3 (z int);
+[localhost:21000] &gt; insert into t3 values (1), (3), (5);
+</codeblock>
+    </p>
+
+<!-- To do: fill in examples for other join types. -->
+
+    <p>
+      The following example demonstrates an anti-join, returning the values 
from <codeph>T1</codeph> that do not
+      exist in <codeph>T2</codeph> (in this case, the odd numbers 1, 3, and 5):
+    </p>
+
+<codeblock>[localhost:21000] &gt; select x from t1 left anti join t2 on (t1.x 
= t2.y);
++---+
+| x |
++---+
+| 1 |
+| 3 |
+| 5 |
++---+
+</codeblock>
+
+    <p conref="../shared/impala_common.xml#common/related_info"/>
+
+    <p>
+      See these tutorials for examples of different kinds of joins:
+    </p>
+
+    <ul>
+      <li>
+        <xref href="impala_tutorial.xml#tut_cross_join"/>
+      </li>
+    </ul>
+  </conbody>
+</concept>


http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_kerberos.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_kerberos.xml b/docs/topics/impala_kerberos.xml
new file mode 100644
index 0000000..6393580
--- /dev/null
+++ b/docs/topics/impala_kerberos.xml
@@ -0,0 +1,363 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="kerberos">
+
+  <title>Enabling Kerberos Authentication for Impala</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Security"/>
+      <data name="Category" value="Kerberos"/>
+      <data name="Category" value="Authentication"/>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="Configuring"/>
+      <data name="Category" value="Starting and Stopping"/>
+      <data name="Category" value="Administrators"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p>
+      Impala supports Kerberos authentication. For more information on 
enabling Kerberos authentication, see the
+      topic on Configuring Hadoop Security in the
+      <xref 
href="http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_sg_cdh5_hadoop_security.html";
 scope="external" format="html">CDH 5 Security Guide</xref>.
+    </p>
+
+    <p>
+      When using Impala in a managed environment, Cloudera Manager 
automatically completes Kerberos configuration.
+      In an unmanaged environment, create a Kerberos principal for each host 
running <cmdname>impalad</cmdname> or
+      <cmdname>statestored</cmdname>. <ph rev="upstream">Cloudera</ph> 
recommends using a consistent format, such as
+      <codeph>impala/_HOST@Your-Realm</codeph>, but you can use any three-part 
Kerberos server principal.
+    </p>
+
+    <p conref="../shared/impala_common.xml#common/user_kerberized"/>
+
+    <note 
conref="../shared/impala_common.xml#common/authentication_vs_authorization"/>
+
+    <p outputclass="toc inpage"/>
+
+    <p>
+      An alternative form of authentication you can use is LDAP, described in 
<xref href="impala_ldap.xml#ldap"/>.
+    </p>
+  </conbody>
+
+  <concept id="kerberos_prereqs">
+
+    <title>Requirements for Using Impala with Kerberos</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Requirements"/>
+      <data name="Category" value="Planning"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p conref="../shared/impala_common.xml#common/rhel5_kerberos"/>
+
+<!-- This note adapted from the one at 
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Security-Guide/cdh4sg_topic_3_4.html.
+   Ideally should be conref'ed in both places. -->
+
+      <note type="important">
+        <p>
+          If you plan to use Impala in your cluster, you must configure your 
KDC to allow tickets to be renewed,
+          and you must configure <filepath>krb5.conf</filepath> to request 
renewable tickets. Typically, you can do
+          this by adding the <codeph>max_renewable_life</codeph> setting to 
your realm in
+          <filepath>kdc.conf</filepath>, and by adding the 
<filepath>renew_lifetime</filepath> parameter to the
+          <filepath>libdefaults</filepath> section of 
<filepath>krb5.conf</filepath>. For more information about
+          renewable tickets, see the
+          <xref href="http://web.mit.edu/Kerberos/krb5-1.8/"; scope="external" 
format="html"> Kerberos
+          documentation</xref>.
+        </p>
+        <p rev="1.2">
+          Currently, you cannot use the resource management feature in CDH 5 
on a cluster that has Kerberos
+          authentication enabled.
+        </p>
+      </note>
+
+      <p>
+        Start all <cmdname>impalad</cmdname> and 
<cmdname>statestored</cmdname> daemons with the
+        <codeph>--principal</codeph> and <codeph>--keytab-file</codeph> flags 
set to the principal and full path
+        name of the <codeph>keytab</codeph> file containing the credentials 
for the principal.
+      </p>
+
+      <p>
+        Impala supports the Cloudera ODBC driver and the Kerberos interface 
provided. To use Kerberos through the
+        ODBC driver, the host type must be set depending on the level of the 
ODBD driver:
+      </p>
+
+      <ul>
+        <li>
+          <codeph>SecImpala</codeph> for the ODBC 1.0 driver.
+        </li>
+
+        <li>
+          <codeph>SecBeeswax</codeph> for the ODBC 1.2 driver.
+        </li>
+
+        <li>
+          Blank for the ODBC 2.0 driver or higher, when connecting to a secure 
cluster.
+        </li>
+
+        <li>
+          <codeph>HS2NoSasl</codeph> for the ODBC 2.0 driver or higher, when 
connecting to a non-secure cluster.
+        </li>
+      </ul>
+
+      <p>
+        To enable Kerberos in the Impala shell, start the 
<cmdname>impala-shell</cmdname> command using the
+        <codeph>-k</codeph> flag.
+      </p>
+
+      <p>
+        To enable Impala to work with Kerberos security on your Hadoop 
cluster, make sure you perform the
+        installation and configuration steps in
+        <xref 
href="http://www.cloudera.com/documentation/enterprise/latest/topics/sg_authentication.html";
 scope="external" format="html">Authentication in the CDH 5 Security 
Guide</xref>.
+        Also note that when Kerberos security is enabled in Impala, a web 
browser that
+        supports Kerberos HTTP SPNEGO is required to access the Impala web 
console (for example, Firefox, Internet
+        Explorer, or Chrome).
+      </p>
+
+      <p>
+        If the NameNode, Secondary NameNode, DataNode, JobTracker, 
TaskTrackers, ResourceManager, NodeManagers,
+        HttpFS, Oozie, Impala, or Impala statestore services are configured to 
use Kerberos HTTP SPNEGO
+        authentication, and two or more of these services are running on the 
same host, then all of the running
+        services must use the same HTTP principal and keytab file used for 
their HTTP endpoints.
+      </p>
+    </conbody>
+  </concept>
+
+  <concept id="kerberos_config">
+
+    <title>Configuring Impala to Support Kerberos Security</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Configuring"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p>
+        Enabling Kerberos authentication for Impala involves steps that can be 
summarized as follows:
+      </p>
+
+      <ul>
+        <li>
+          Creating service principals for Impala and the HTTP service. 
Principal names take the form:
+          
<codeph><varname>serviceName</varname>/<varname>fully.qualified.domain.name</varname>@<varname>KERBEROS.REALM</varname></codeph>
+        </li>
+
+        <li>
+          Creating, merging, and distributing key tab files for these 
principals.
+        </li>
+
+        <li>
+          Editing <codeph>/etc/default/impala</codeph> (in cluster not managed 
by Cloudera Manager), or editing the
+          <uicontrol>Security</uicontrol> settings in the Cloudera Manager 
interface, to accommodate Kerberos
+          authentication.
+        </li>
+      </ul>
+    </conbody>
+
+    <concept id="kerberos_setup">
+
+      <title>Enabling Kerberos for Impala</title>
+
+      <conbody>
+
+<!--
+      <p>
+        <b>To enable Kerberos for Impala:</b>
+      </p>
+-->
+
+        <ol>
+          <li>
+            Create an Impala service principal, specifying the name of the OS 
user that the Impala daemons run
+            under, the fully qualified domain name of each node running 
<cmdname>impalad</cmdname>, and the realm
+            name. For example:
+<codeblock>$ kadmin
+kadmin: addprinc -requires_preauth -randkey 
impala/[email protected]</codeblock>
+          </li>
+
+          <li>
+            Create an HTTP service principal. For example:
+<codeblock>kadmin: addprinc -randkey 
HTTP/[email protected]</codeblock>
+            <note>
+              The <codeph>HTTP</codeph> component of the service principal 
must be uppercase as shown in the
+              preceding example.
+            </note>
+          </li>
+
+          <li>
+            Create <codeph>keytab</codeph> files with both principals. For 
example:
+<codeblock>kadmin: xst -k impala.keytab impala/impala_host.example.com
+kadmin: xst -k http.keytab HTTP/impala_host.example.com
+kadmin: quit</codeblock>
+          </li>
+
+          <li>
+            Use <codeph>ktutil</codeph> to read the contents of the two keytab 
files and then write those contents
+            to a new file. For example:
+<codeblock>$ ktutil
+ktutil: rkt impala.keytab
+ktutil: rkt http.keytab
+ktutil: wkt impala-http.keytab
+ktutil: quit</codeblock>
+          </li>
+
+          <li>
+            (Optional) Test that credentials in the merged keytab file are 
valid, and that the <q>renew until</q>
+            date is in the future. For example:
+<codeblock>$ klist -e -k -t impala-http.keytab</codeblock>
+          </li>
+
+          <li>
+            Copy the <filepath>impala-http.keytab</filepath> file to the 
Impala configuration directory. Change the
+            permissions to be only read for the file owner and change the file 
owner to the <codeph>impala</codeph>
+            user. By default, the Impala user and group are both named 
<codeph>impala</codeph>. For example:
+<codeblock>$ cp impala-http.keytab /etc/impala/conf
+$ cd /etc/impala/conf
+$ chmod 400 impala-http.keytab
+$ chown impala:impala impala-http.keytab</codeblock>
+          </li>
+
+          <li>
+            Add Kerberos options to the Impala defaults file, 
<filepath>/etc/default/impala</filepath>. Add the
+            options for both the <cmdname>impalad</cmdname> and 
<cmdname>statestored</cmdname> daemons, using the
+            <codeph>IMPALA_SERVER_ARGS</codeph> and 
<codeph>IMPALA_STATE_STORE_ARGS</codeph> variables. For
+            example, you might add:
+<!-- Found these in a discussion post somewhere but not applicable as Impala 
startup options.
+-kerberos_ticket_life=36000
+-maxrenewlife 7days
+-->
+<codeblock>-kerberos_reinit_interval=60
+-principal=impala_1/[email protected]
+-keytab_file=/var/run/cloudera-scm-agent/process/3212-impala-IMPALAD/impala.keytab</codeblock>
+            <p>
+              For more information on changing the Impala defaults specified in
+              <filepath>/etc/default/impala</filepath>, see
+              <xref href="impala_config_options.xml#config_options">Modifying 
Impala Startup
+              Options</xref>.
+            </p>
+          </li>
+        </ol>
+
+        <note>
+          Restart <cmdname>impalad</cmdname> and 
<cmdname>statestored</cmdname> for these configuration changes to
+          take effect.
+        </note>
+      </conbody>
+    </concept>
+  </concept>
+
+  <concept id="kerberos_proxy">
+
+    <title>Enabling Kerberos for Impala with a Proxy Server</title>
+
+    <conbody>
+
+      <p>
+        A common configuration for Impala with High Availability is to use a 
proxy server to submit requests to the
+        actual <cmdname>impalad</cmdname> daemons on different hosts in the 
cluster. This configuration avoids
+        connection problems in case of machine failure, because the proxy 
server can route new requests through one
+        of the remaining hosts in the cluster. This configuration also helps 
with load balancing, because the
+        additional overhead of being the <q>coordinator node</q> for each 
query is spread across multiple hosts.
+      </p>
+
+      <p>
+        Although you can set up a proxy server with or without Kerberos 
authentication, typically users set up a
+        secure Kerberized configuration. For information about setting up a 
proxy server for Impala, including
+        Kerberos-specific steps, see <xref href="impala_proxy.xml#proxy"/>.
+      </p>
+    </conbody>
+  </concept>
+
+  <concept id="spnego">
+
+    <title>Using a Web Browser to Access a URL Protected by Kerberos HTTP 
SPNEGO</title>
+
+    <conbody>
+
+      <p>
+        Your web browser must support Kerberos HTTP SPNEGO. For example, 
Chrome, Firefox, or Internet Explorer.
+      </p>
+
+      <p>
+        <b>To configure Firefox to access a URL protected by Kerberos HTTP 
SPNEGO:</b>
+      </p>
+
+      <ol>
+        <li>
+          Open the advanced settings Firefox configuration page by loading the 
<codeph>about:config</codeph> page.
+        </li>
+
+        <li>
+          Use the <b>Filter</b> text box to find 
<codeph>network.negotiate-auth.trusted-uris</codeph>.
+        </li>
+
+        <li>
+          Double-click the 
<codeph>network.negotiate-auth.trusted-uris</codeph> preference and enter the 
hostname
+          or the domain of the web server that is protected by Kerberos HTTP 
SPNEGO. Separate multiple domains and
+          hostnames with a comma.
+        </li>
+
+        <li>
+          Click <b>OK</b>.
+        </li>
+      </ol>
+    </conbody>
+  </concept>
+
+  <concept id="kerberos_delegation">
+    <title>Enabling Impala Delegation for Kerberos Users</title>
+    <conbody>
+      <p>
+        See <xref href="impala_delegation.xml#delegation"/> for details about 
the delegation feature
+        that lets certain users submit queries using the credentials of other 
users.
+      </p>
+    </conbody>
+  </concept>
+
+  <concept id="ssl_jdbc_odbc">
+    <title>Using TLS/SSL with Business Intelligence Tools</title>
+    <conbody>
+      <p>
+        You can use Kerberos authentication, TLS/SSL encryption, or both to 
secure
+        connections from JDBC and ODBC applications to Impala.
+        See <xref href="impala_jdbc.xml#impala_jdbc"/> and <xref 
href="impala_odbc.xml#impala_odbc"/>
+        for details.
+      </p>
+
+      <p 
conref="../shared/impala_common.xml#common/hive_jdbc_ssl_kerberos_caveat"/>
+    </conbody>
+  </concept>
+
+  <concept id="whitelisting_internal_apis">
+  <title>Enabling Access to Internal Impala APIs for Kerberos Users</title>
+    <conbody>
+    <!-- Reusing (most of) the text from the New Features bullet here. Turn 
into a conref in both places. -->
+      <p rev="IMPALA-3095">
+        For applications that need direct access
+        to Impala APIs, without going through the HiveServer2 or Beeswax 
interfaces, you can
+        specify a list of Kerberos users who are allowed to call those APIs. 
By default, the
+        <codeph>impala</codeph> and <codeph>hdfs</codeph> users are the only 
ones authorized
+        for this kind of access.
+        Any users not explicitly authorized through the 
<codeph>internal_principals_whitelist</codeph>
+        configuration setting are blocked from accessing the APIs. This 
setting applies to all the
+        Impala-related daemons, although currently it is primarily used for 
HDFS to control the
+        behavior of the catalog server.
+      </p>
+    </conbody>
+
+  </concept>
+
+  <concept id="auth_to_local" rev="IMPALA-2660 CDH-40241">
+    <title>Mapping Kerberos Principals to Short Names for Impala</title>
+    <conbody>
+      <p 
conref="../shared/impala_common.xml#common/auth_to_local_instructions"/>
+    </conbody>
+  </concept>
+
+</concept>

[23/51] [partial] incubator-impala git commit: IMPALA-3398: Add docs to main Impala branch.

Reply via email to