http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_joins.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_joins.xml b/docs/topics/impala_joins.xml new file mode 100644 index 0000000..ddd2f2b --- /dev/null +++ b/docs/topics/impala_joins.xml @@ -0,0 +1,534 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="joins"> + + <title>Joins in Impala SELECT Statements</title> + <titlealts audience="PDF"><navtitle>Joins</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Querying"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">joins</indexterm> + A join query is a <codeph>SELECT</codeph> statement that combines data from two or more tables, + and returns a result set containing items from some or all of those tables. It is a way to + cross-reference and correlate related data that is organized into multiple tables, typically + using identifiers that are repeated in each of the joined tables. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + + <p conref="../shared/impala_common.xml#common/join_types"/> + +<codeblock>SELECT <varname>select_list</varname> FROM + <varname>table_or_subquery1</varname> [INNER] JOIN <varname>table_or_subquery2</varname> | + <varname>table_or_subquery1</varname> {LEFT [OUTER] | RIGHT [OUTER] | FULL [OUTER]} JOIN <varname>table_or_subquery2</varname> | + <varname>table_or_subquery1</varname> {LEFT | RIGHT} SEMI JOIN <varname>table_or_subquery2</varname> | + <ph rev="2.0.0"><varname>table_or_subquery1</varname> {LEFT | RIGHT} ANTI JOIN <varname>table_or_subquery2</varname> |</ph> + [ ON <varname>col1</varname> = <varname>col2</varname> [AND <varname>col3</varname> = <varname>col4</varname> ...] | + USING (<varname>col1</varname> [, <varname>col2</varname> ...]) ] + [<varname>other_join_clause</varname> ...] +[ WHERE <varname>where_clauses</varname> ] + +SELECT <varname>select_list</varname> FROM + <varname>table_or_subquery1</varname>, <varname>table_or_subquery2</varname> [, <varname>table_or_subquery3</varname> ...] + [<varname>other_join_clause</varname> ...] +WHERE + <varname>col1</varname> = <varname>col2</varname> [AND <varname>col3</varname> = <varname>col4</varname> ...] + +SELECT <varname>select_list</varname> FROM + <varname>table_or_subquery1</varname> CROSS JOIN <varname>table_or_subquery2</varname> + [<varname>other_join_clause</varname> ...] +[ WHERE <varname>where_clauses</varname> ]</codeblock> + + <p> + <b>SQL-92 and SQL-89 Joins:</b> + </p> + + <p> + Queries with the explicit <codeph>JOIN</codeph> keywords are known as SQL-92 style joins, referring to the + level of the SQL standard where they were introduced. The corresponding <codeph>ON</codeph> or + <codeph>USING</codeph> clauses clearly show which columns are used as the join keys in each case: + </p> + +<codeblock>SELECT t1.c1, t2.c2 FROM <b>t1 JOIN t2</b> + <b>ON t1.id = t2.id and t1.type_flag = t2.type_flag</b> + WHERE t1.c1 > 100; + +SELECT t1.c1, t2.c2 FROM <b>t1 JOIN t2</b> + <b>USING (id, type_flag)</b> + WHERE t1.c1 > 100;</codeblock> + + <p> + The <codeph>ON</codeph> clause is a general way to compare columns across the two tables, even if the column + names are different. The <codeph>USING</codeph> clause is a shorthand notation for specifying the join + columns, when the column names are the same in both tables. You can code equivalent <codeph>WHERE</codeph> + clauses that compare the columns, instead of <codeph>ON</codeph> or <codeph>USING</codeph> clauses, but that + practice is not recommended because mixing the join comparisons with other filtering clauses is typically + less readable and harder to maintain. + </p> + + <p> + Queries with a comma-separated list of tables and subqueries are known as SQL-89 style joins. In these + queries, the equality comparisons between columns of the joined tables go in the <codeph>WHERE</codeph> + clause alongside other kinds of comparisons. This syntax is easy to learn, but it is also easy to + accidentally remove a <codeph>WHERE</codeph> clause needed for the join to work correctly. + </p> + +<codeblock>SELECT t1.c1, t2.c2 FROM <b>t1, t2</b> + WHERE + <b>t1.id = t2.id AND t1.type_flag = t2.type_flag</b> + AND t1.c1 > 100;</codeblock> + + <p> + <b>Self-joins:</b> + </p> + + <p> + Impala can do self-joins, for example to join on two different columns in the same table to represent + parent-child relationships or other tree-structured data. There is no explicit syntax for this; just use the + same table name for both the left-hand and right-hand table, and assign different table aliases to use when + referring to the fully qualified column names: + </p> + +<codeblock>-- Combine fields from both parent and child rows. +SELECT lhs.id, rhs.parent, lhs.c1, rhs.c2 FROM tree_data lhs, tree_data rhs WHERE lhs.id = rhs.parent;</codeblock> + + <p> + <b>Cartesian joins:</b> + </p> + + <p> + To avoid producing huge result sets by mistake, Impala does not allow Cartesian joins of the form: +<codeblock>SELECT ... FROM t1 JOIN t2; +SELECT ... FROM t1, t2;</codeblock> + If you intend to join the tables based on common values, add <codeph>ON</codeph> or <codeph>WHERE</codeph> + clauses to compare columns across the tables. If you truly intend to do a Cartesian join, use the + <codeph>CROSS JOIN</codeph> keyword as the join operator. The <codeph>CROSS JOIN</codeph> form does not use + any <codeph>ON</codeph> clause, because it produces a result set with all combinations of rows from the + left-hand and right-hand tables. The result set can still be filtered by subsequent <codeph>WHERE</codeph> + clauses. For example: + </p> + +<codeblock>SELECT ... FROM t1 CROSS JOIN t2; +SELECT ... FROM t1 CROSS JOIN t2 WHERE <varname>tests_on_non_join_columns</varname>;</codeblock> + + <p> + <b>Inner and outer joins:</b> + </p> + + <p> + An inner join is the most common and familiar type: rows in the result set contain the requested columns from + the appropriate tables, for all combinations of rows where the join columns of the tables have identical + values. If a column with the same name occurs in both tables, use a fully qualified name or a column alias to + refer to the column in the select list or other clauses. Impala performs inner joins by default for both + SQL-89 and SQL-92 join syntax: + </p> + +<codeblock>-- The following 3 forms are all equivalent. +SELECT t1.id, c1, c2 FROM t1, t2 WHERE t1.id = t2.id; +SELECT t1.id, c1, c2 FROM t1 JOIN t2 ON t1.id = t2.id; +SELECT t1.id, c1, c2 FROM t1 INNER JOIN t2 ON t1.id = t2.id;</codeblock> + + <p> + An outer join retrieves all rows from the left-hand table, or the right-hand table, or both; wherever there + is no matching data in the table on the other side of the join, the corresponding columns in the result set + are set to <codeph>NULL</codeph>. To perform an outer join, include the <codeph>OUTER</codeph> keyword in the + join operator, along with either <codeph>LEFT</codeph>, <codeph>RIGHT</codeph>, or <codeph>FULL</codeph>: + </p> + +<codeblock>SELECT * FROM t1 LEFT OUTER JOIN t2 ON t1.id = t2.id; +SELECT * FROM t1 RIGHT OUTER JOIN t2 ON t1.id = t2.id; +SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.id = t2.id;</codeblock> + + <p> + For outer joins, Impala requires SQL-92 syntax; that is, the <codeph>JOIN</codeph> keyword instead of + comma-separated table names. Impala does not support vendor extensions such as <codeph>(+)</codeph> or + <codeph>*=</codeph> notation for doing outer joins with SQL-89 query syntax. + </p> + + <p> + <b>Equijoins and Non-Equijoins:</b> + </p> + + <p> + By default, Impala requires an equality comparison between the left-hand and right-hand tables, either + through <codeph>ON</codeph>, <codeph>USING</codeph>, or <codeph>WHERE</codeph> clauses. These types of + queries are classified broadly as equijoins. Inner, outer, full, and semi joins can all be equijoins based on + the presence of equality tests between columns in the left-hand and right-hand tables. + </p> + + <p> + In Impala 1.2.2 and higher, non-equijoin queries are also possible, with comparisons such as + <codeph>!=</codeph> or <codeph><</codeph> between the join columns. These kinds of queries require care to + avoid producing huge result sets that could exceed resource limits. Once you have planned a non-equijoin + query that produces a result set of acceptable size, you can code the query using the <codeph>CROSS + JOIN</codeph> operator, and add the extra comparisons in the <codeph>WHERE</codeph> clause: + </p> + +<codeblock>SELECT * FROM t1 CROSS JOIN t2 WHERE t1.total > t2.maximum_price;</codeblock> + + <p rev="2.3.0"> + In <keyword keyref="impala23_full"/> and higher, additional non-equijoin queries are possible due to the addition + of nested loop joins. These queries typically involve <codeph>SEMI JOIN</codeph>, + <codeph>ANTI JOIN</codeph>, or <codeph>FULL OUTER JOIN</codeph> clauses. + Impala sometimes also uses nested loop joins internally when evaluating <codeph>OUTER JOIN</codeph> + queries involving complex type columns. + Query phases involving nested loop joins do not use the spill-to-disk mechanism if they + exceed the memory limit. Impala decides internally when to use each join mechanism; you cannot + specify any query hint to choose between the nested loop join or the original hash join algorithm. + </p> + +<codeblock rev="2.3.0">SELECT * FROM t1 LEFT OUTER JOIN t2 ON t1.int_col < t2.int_col;</codeblock> + + <p> + <b>Semi-joins:</b> + </p> + + <p> + Semi-joins are a relatively rarely used variation. With the left semi-join, only data from the left-hand + table is returned, for rows where there is matching data in the right-hand table, based on comparisons + between join columns in <codeph>ON</codeph> or <codeph>WHERE</codeph> clauses. Only one instance of each row + from the left-hand table is returned, regardless of how many matching rows exist in the right-hand table. + <ph rev="2.0.0">A right semi-join (available in Impala 2.0 and higher) reverses the comparison and returns + data from the right-hand table.</ph> + </p> + +<codeblock>SELECT t1.c1, t1.c2, t1.c2 FROM t1 LEFT SEMI JOIN t2 ON t1.id = t2.id;</codeblock> + + <p> + <b>Natural joins (not supported):</b> + </p> + + <p> + Impala does not support the <codeph>NATURAL JOIN</codeph> operator, again to avoid inconsistent or huge + result sets. Natural joins do away with the <codeph>ON</codeph> and <codeph>USING</codeph> clauses, and + instead automatically join on all columns with the same names in the left-hand and right-hand tables. This + kind of query is not recommended for rapidly evolving data structures such as are typically used in Hadoop. + Thus, Impala does not support the <codeph>NATURAL JOIN</codeph> syntax, which can produce different query + results as columns are added to or removed from tables. + </p> + + <p> + If you do have any queries that use <codeph>NATURAL JOIN</codeph>, make sure to rewrite them with explicit + <codeph>USING</codeph> clauses, because Impala could interpret the <codeph>NATURAL</codeph> keyword as a + table alias: + </p> + +<codeblock>-- 'NATURAL' is interpreted as an alias for 't1' and Impala attempts an inner join, +-- resulting in an error because inner joins require explicit comparisons between columns. +SELECT t1.c1, t2.c2 FROM t1 NATURAL JOIN t2; +ERROR: NotImplementedException: Join with 't2' requires at least one conjunctive equality predicate. + To perform a Cartesian product between two tables, use a CROSS JOIN. + +-- If you expect the tables to have identically named columns with matching values, +-- list the corresponding column names in a USING clause. +SELECT t1.c1, t2.c2 FROM t1 JOIN t2 USING (id, type_flag, name, address);</codeblock> + + <p rev="2.0.0"> + <b>Anti-joins (<keyword keyref="impala20_full"/> and higher only):</b> + </p> + + <p rev="2.0.0"> + Impala supports the <codeph>LEFT ANTI JOIN</codeph> and <codeph>RIGHT ANTI JOIN</codeph> clauses in + <keyword keyref="impala20"/> and higher. The <codeph>LEFT</codeph> or <codeph>RIGHT</codeph> + keyword is required for this kind of join. For <codeph>LEFT ANTI JOIN</codeph>, this clause returns those + values from the left-hand table that have no matching value in the right-hand table. <codeph>RIGHT ANTI + JOIN</codeph> reverses the comparison and returns values from the right-hand table. You can express this + negative relationship either through the <codeph>ANTI JOIN</codeph> clause or through a <codeph>NOT + EXISTS</codeph> operator with a subquery. + </p> + +<!-- Restriction lifted in Impala 2.0. +<p> +Impala does not support <codeph>WHERE</codeph> clauses +such as <codeph>IN (<varname>subquery</varname>)</codeph>, +<codeph>NOT IN (<varname>subquery</varname>)</codeph>, +<codeph>EXISTS (<varname>subquery</varname>)</codeph>, +and <codeph>NOT EXISTS (<varname>subquery</varname>)</codeph>. +Therefore from a practical standpoint, you cannot +express an anti-join condition, where values from one table +are returned only if no matching values are present in another table. +</p> +--> + + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + +<!-- To do: reuse some complex types examples with joins here or under Examples farther down. --> + + <p rev="2.3.0"> + When referring to a column with a complex type (<codeph>STRUCT</codeph>, <codeph>ARRAY</codeph>, or <codeph>MAP</codeph>) + in a query, you use join notation to <q>unpack</q> the scalar fields of the struct, the elements of the array, or + the key-value pairs of the map. (The join notation is not required for aggregation operations, such as + <codeph>COUNT()</codeph> or <codeph>SUM()</codeph> for array elements.) Because Impala recognizes which complex type elements are associated with which row + of the result set, you use the same syntax as for a cross or cartesian join, without an explicit join condition. + See <xref href="impala_complex_types.xml#complex_types"/> for details about Impala support for complex types. + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + You typically use join queries in situations like these: + </p> + + <ul> + <li> + When related data arrives from different sources, with each data set physically residing in a separate + table. For example, you might have address data from business records that you cross-check against phone + listings or census data. + <note> + Impala can join tables of different file formats, including Impala-managed tables and HBase tables. For + example, you might keep small dimension tables in HBase, for convenience of single-row lookups and + updates, and for the larger fact tables use Parquet or other binary file format optimized for scan + operations. Then, you can issue a join query to cross-reference the fact tables with the dimension + tables. + </note> + </li> + + <li> + When data is normalized, a technique for reducing data duplication by dividing it across multiple tables. + This kind of organization is often found in data that comes from traditional relational database systems. + For example, instead of repeating some long string such as a customer name in multiple tables, each table + might contain a numeric customer ID. Queries that need to display the customer name could <q>join</q> the + table that specifies which customer ID corresponds to which name. + </li> + + <li> + When certain columns are rarely needed for queries, so they are moved into separate tables to reduce + overhead for common queries. For example, a <codeph>biography</codeph> field might be rarely needed in + queries on employee data. Putting that field in a separate table reduces the amount of I/O for common + queries on employee addresses or phone numbers. Queries that do need the <codeph>biography</codeph> column + can retrieve it by performing a join with that separate table. + </li> + + <li> + In <keyword keyref="impala23_full"/> or higher, when referring to complex type columns in queries. + See <xref href="impala_complex_types.xml#complex_types"/> for details. + </li> + </ul> + + <p> + When comparing columns with the same names in <codeph>ON</codeph> or <codeph>WHERE</codeph> clauses, use the + fully qualified names such as <codeph><varname>db_name</varname>.<varname>table_name</varname></codeph>, or + assign table aliases, column aliases, or both to make the code more compact and understandable: + </p> + +<codeblock>select t1.c1 as first_id, t2.c2 as second_id from + t1 join t2 on first_id = second_id; + +select fact.custno, dimension.custno from + customer_data as fact join customer_address as dimension + using (custno)</codeblock> + + <note> + <p> + Performance for join queries is a crucial aspect for Impala, because complex join queries are + resource-intensive operations. An efficient join query produces much less network traffic and CPU overhead + than an inefficient one. For best results: + </p> + <ul> + <li rev="1.2"> + Make sure that both <xref href="impala_perf_stats.xml#perf_stats">table and column statistics</xref> are + available for all the tables involved in a join query, and especially for the columns referenced in any + join conditions. Impala uses the statistics to automatically deduce an efficient join order. + Use <xref href="impala_show.xml#show"><codeph>SHOW TABLE STATS <varname>table_name</varname></codeph> and + <codeph>SHOW COLUMN STATS <varname>table_name</varname></codeph></xref> to check if statistics are + already present. Issue the <codeph>COMPUTE STATS <varname>table_name</varname></codeph> for a nonpartitioned table, + or (in Impala 2.1.0 and higher) <codeph>COMPUTE INCREMENTAL STATS <varname>table_name</varname></codeph> + for a partitioned table, to collect the initial statistics at both the table and column levels, and to keep the + statistics up to date after any substantial <codeph>INSERT</codeph> or <codeph>LOAD DATA</codeph> operations. + </li> + + <li rev="1.2"> + If table or column statistics are not available, join the largest table first. You can check the + existence of statistics with the <codeph>SHOW TABLE STATS <varname>table_name</varname></codeph> and + <codeph>SHOW COLUMN STATS <varname>table_name</varname></codeph> statements. + </li> + + <li rev="1.2.2"> + If table or column statistics are not available, join subsequent tables according to which table has the + most selective filter, based on overall size and <codeph>WHERE</codeph> clauses. Joining the table with + the most selective filter results in the fewest number of rows being returned. + </li> + </ul> + <p> + For more information and examples of performance for join queries, see + <xref href="impala_perf_joins.xml#perf_joins"/>. + </p> + </note> + + <p> + To control the result set from a join query, include the names of corresponding column names in both tables + in an <codeph>ON</codeph> or <codeph>USING</codeph> clause, or by coding equality comparisons for those + columns in the <codeph>WHERE</codeph> clause. + </p> + +<codeblock>[localhost:21000] > select c_last_name, ca_city from customer join customer_address where c_customer_sk = ca_address_sk; ++-------------+-----------------+ +| c_last_name | ca_city | ++-------------+-----------------+ +| Lewis | Fairfield | +| Moses | Fairview | +| Hamilton | Pleasant Valley | +| White | Oak Ridge | +| Moran | Glendale | +... +| Richards | Lakewood | +| Day | Lebanon | +| Painter | Oak Hill | +| Bentley | Greenfield | +| Jones | Stringtown | ++-------------+------------------+ +Returned 50000 row(s) in 9.82s</codeblock> + + <p> + One potential downside of joins is the possibility of excess resource usage in poorly constructed queries. + Impala imposes restrictions on join queries to guard against such issues. To minimize the chance of runaway + queries on large data sets, Impala requires every join query to contain at least one equality predicate + between the columns of the various tables. For example, if <codeph>T1</codeph> contains 1000 rows and + <codeph>T2</codeph> contains 1,000,000 rows, a query <codeph>SELECT <varname>columns</varname> FROM t1 JOIN + t2</codeph> could return up to 1 billion rows (1000 * 1,000,000); Impala requires that the query include a + clause such as <codeph>ON t1.c1 = t2.c2</codeph> or <codeph>WHERE t1.c1 = t2.c2</codeph>. + </p> + + <p> + Because even with equality clauses, the result set can still be large, as we saw in the previous example, you + might use a <codeph>LIMIT</codeph> clause to return a subset of the results: + </p> + +<codeblock>[localhost:21000] > select c_last_name, ca_city from customer, customer_address where c_customer_sk = ca_address_sk limit 10; ++-------------+-----------------+ +| c_last_name | ca_city | ++-------------+-----------------+ +| Lewis | Fairfield | +| Moses | Fairview | +| Hamilton | Pleasant Valley | +| White | Oak Ridge | +| Moran | Glendale | +| Sharp | Lakeview | +| Wiles | Farmington | +| Shipman | Union | +| Gilbert | New Hope | +| Brunson | Martinsville | ++-------------+-----------------+ +Returned 10 row(s) in 0.63s</codeblock> + + <p> + Or you might use additional comparison operators or aggregation functions to condense a large result set into + a smaller set of values: + </p> + +<codeblock>[localhost:21000] > -- Find the names of customers who live in one particular town. +[localhost:21000] > select distinct c_last_name from customer, customer_address where + c_customer_sk = ca_address_sk + and ca_city = "Green Acres"; ++---------------+ +| c_last_name | ++---------------+ +| Hensley | +| Pearson | +| Mayer | +| Montgomery | +| Ricks | +... +| Barrett | +| Price | +| Hill | +| Hansen | +| Meeks | ++---------------+ +Returned 332 row(s) in 0.97s + +[localhost:21000] > -- See how many different customers in this town have names starting with "A". +[localhost:21000] > select count(distinct c_last_name) from customer, customer_address where + c_customer_sk = ca_address_sk + and ca_city = "Green Acres" + and substr(c_last_name,1,1) = "A"; ++-----------------------------+ +| count(distinct c_last_name) | ++-----------------------------+ +| 12 | ++-----------------------------+ +Returned 1 row(s) in 1.00s</codeblock> + + <p> + Because a join query can involve reading large amounts of data from disk, sending large amounts of data + across the network, and loading large amounts of data into memory to do the comparisons and filtering, you + might do benchmarking, performance analysis, and query tuning to find the most efficient join queries for + your data set, hardware capacity, network configuration, and cluster workload. + </p> + + <p> + The two categories of joins in Impala are known as <b>partitioned joins</b> and <b>broadcast joins</b>. If + inaccurate table or column statistics, or some quirk of the data distribution, causes Impala to choose the + wrong mechanism for a particular join, consider using query hints as a temporary workaround. For details, see + <xref href="impala_hints.xml#hints"/>. + </p> + + <p rev="2.5.0"> + <b>Handling NULLs in Join Columns:</b> + </p> + + <p rev="2.5.0"> + By default, join key columns do not match if either one contains a <codeph>NULL</codeph> value. + To treat such columns as equal if both contain <codeph>NULL</codeph>, you can use an expression + such as <codeph>A = B OR (A IS NULL AND B IS NULL)</codeph>. + In <keyword keyref="impala25_full"/> and higher, the <codeph><=></codeph> operator (shorthand for + <codeph>IS NOT DISTINCT FROM</codeph>) performs the same comparison in a concise and efficient form. + The <codeph><=></codeph> operator is more efficient in for comparing join keys in a <codeph>NULL</codeph>-safe + manner, because the operator can use a hash join while the <codeph>OR</codeph> expression cannot. + </p> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following examples refer to these simple tables containing small sets of integers: +<codeblock>[localhost:21000] > create table t1 (x int); +[localhost:21000] > insert into t1 values (1), (2), (3), (4), (5), (6); + +[localhost:21000] > create table t2 (y int); +[localhost:21000] > insert into t2 values (2), (4), (6); + +[localhost:21000] > create table t3 (z int); +[localhost:21000] > insert into t3 values (1), (3), (5); +</codeblock> + </p> + +<!-- To do: fill in examples for other join types. --> + + <p> + The following example demonstrates an anti-join, returning the values from <codeph>T1</codeph> that do not + exist in <codeph>T2</codeph> (in this case, the odd numbers 1, 3, and 5): + </p> + +<codeblock>[localhost:21000] > select x from t1 left anti join t2 on (t1.x = t2.y); ++---+ +| x | ++---+ +| 1 | +| 3 | +| 5 | ++---+ +</codeblock> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + See these tutorials for examples of different kinds of joins: + </p> + + <ul> + <li> + <xref href="impala_tutorial.xml#tut_cross_join"/> + </li> + </ul> + </conbody> +</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_kerberos.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_kerberos.xml b/docs/topics/impala_kerberos.xml new file mode 100644 index 0000000..6393580 --- /dev/null +++ b/docs/topics/impala_kerberos.xml @@ -0,0 +1,363 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="kerberos"> + + <title>Enabling Kerberos Authentication for Impala</title> + <prolog> + <metadata> + <data name="Category" value="Security"/> + <data name="Category" value="Kerberos"/> + <data name="Category" value="Authentication"/> + <data name="Category" value="Impala"/> + <data name="Category" value="Configuring"/> + <data name="Category" value="Starting and Stopping"/> + <data name="Category" value="Administrators"/> + </metadata> + </prolog> + + <conbody> + + <p> + Impala supports Kerberos authentication. For more information on enabling Kerberos authentication, see the + topic on Configuring Hadoop Security in the + <xref href="http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_sg_cdh5_hadoop_security.html" scope="external" format="html">CDH 5 Security Guide</xref>. + </p> + + <p> + When using Impala in a managed environment, Cloudera Manager automatically completes Kerberos configuration. + In an unmanaged environment, create a Kerberos principal for each host running <cmdname>impalad</cmdname> or + <cmdname>statestored</cmdname>. <ph rev="upstream">Cloudera</ph> recommends using a consistent format, such as + <codeph>impala/_HOST@Your-Realm</codeph>, but you can use any three-part Kerberos server principal. + </p> + + <p conref="../shared/impala_common.xml#common/user_kerberized"/> + + <note conref="../shared/impala_common.xml#common/authentication_vs_authorization"/> + + <p outputclass="toc inpage"/> + + <p> + An alternative form of authentication you can use is LDAP, described in <xref href="impala_ldap.xml#ldap"/>. + </p> + </conbody> + + <concept id="kerberos_prereqs"> + + <title>Requirements for Using Impala with Kerberos</title> + <prolog> + <metadata> + <data name="Category" value="Requirements"/> + <data name="Category" value="Planning"/> + </metadata> + </prolog> + + <conbody> + + <p conref="../shared/impala_common.xml#common/rhel5_kerberos"/> + +<!-- This note adapted from the one at http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Security-Guide/cdh4sg_topic_3_4.html. + Ideally should be conref'ed in both places. --> + + <note type="important"> + <p> + If you plan to use Impala in your cluster, you must configure your KDC to allow tickets to be renewed, + and you must configure <filepath>krb5.conf</filepath> to request renewable tickets. Typically, you can do + this by adding the <codeph>max_renewable_life</codeph> setting to your realm in + <filepath>kdc.conf</filepath>, and by adding the <filepath>renew_lifetime</filepath> parameter to the + <filepath>libdefaults</filepath> section of <filepath>krb5.conf</filepath>. For more information about + renewable tickets, see the + <xref href="http://web.mit.edu/Kerberos/krb5-1.8/" scope="external" format="html"> Kerberos + documentation</xref>. + </p> + <p rev="1.2"> + Currently, you cannot use the resource management feature in CDH 5 on a cluster that has Kerberos + authentication enabled. + </p> + </note> + + <p> + Start all <cmdname>impalad</cmdname> and <cmdname>statestored</cmdname> daemons with the + <codeph>--principal</codeph> and <codeph>--keytab-file</codeph> flags set to the principal and full path + name of the <codeph>keytab</codeph> file containing the credentials for the principal. + </p> + + <p> + Impala supports the Cloudera ODBC driver and the Kerberos interface provided. To use Kerberos through the + ODBC driver, the host type must be set depending on the level of the ODBD driver: + </p> + + <ul> + <li> + <codeph>SecImpala</codeph> for the ODBC 1.0 driver. + </li> + + <li> + <codeph>SecBeeswax</codeph> for the ODBC 1.2 driver. + </li> + + <li> + Blank for the ODBC 2.0 driver or higher, when connecting to a secure cluster. + </li> + + <li> + <codeph>HS2NoSasl</codeph> for the ODBC 2.0 driver or higher, when connecting to a non-secure cluster. + </li> + </ul> + + <p> + To enable Kerberos in the Impala shell, start the <cmdname>impala-shell</cmdname> command using the + <codeph>-k</codeph> flag. + </p> + + <p> + To enable Impala to work with Kerberos security on your Hadoop cluster, make sure you perform the + installation and configuration steps in + <xref href="http://www.cloudera.com/documentation/enterprise/latest/topics/sg_authentication.html" scope="external" format="html">Authentication in the CDH 5 Security Guide</xref>. + Also note that when Kerberos security is enabled in Impala, a web browser that + supports Kerberos HTTP SPNEGO is required to access the Impala web console (for example, Firefox, Internet + Explorer, or Chrome). + </p> + + <p> + If the NameNode, Secondary NameNode, DataNode, JobTracker, TaskTrackers, ResourceManager, NodeManagers, + HttpFS, Oozie, Impala, or Impala statestore services are configured to use Kerberos HTTP SPNEGO + authentication, and two or more of these services are running on the same host, then all of the running + services must use the same HTTP principal and keytab file used for their HTTP endpoints. + </p> + </conbody> + </concept> + + <concept id="kerberos_config"> + + <title>Configuring Impala to Support Kerberos Security</title> + <prolog> + <metadata> + <data name="Category" value="Configuring"/> + </metadata> + </prolog> + + <conbody> + + <p> + Enabling Kerberos authentication for Impala involves steps that can be summarized as follows: + </p> + + <ul> + <li> + Creating service principals for Impala and the HTTP service. Principal names take the form: + <codeph><varname>serviceName</varname>/<varname>fully.qualified.domain.name</varname>@<varname>KERBEROS.REALM</varname></codeph> + </li> + + <li> + Creating, merging, and distributing key tab files for these principals. + </li> + + <li> + Editing <codeph>/etc/default/impala</codeph> (in cluster not managed by Cloudera Manager), or editing the + <uicontrol>Security</uicontrol> settings in the Cloudera Manager interface, to accommodate Kerberos + authentication. + </li> + </ul> + </conbody> + + <concept id="kerberos_setup"> + + <title>Enabling Kerberos for Impala</title> + + <conbody> + +<!-- + <p> + <b>To enable Kerberos for Impala:</b> + </p> +--> + + <ol> + <li> + Create an Impala service principal, specifying the name of the OS user that the Impala daemons run + under, the fully qualified domain name of each node running <cmdname>impalad</cmdname>, and the realm + name. For example: +<codeblock>$ kadmin +kadmin: addprinc -requires_preauth -randkey impala/[email protected]</codeblock> + </li> + + <li> + Create an HTTP service principal. For example: +<codeblock>kadmin: addprinc -randkey HTTP/[email protected]</codeblock> + <note> + The <codeph>HTTP</codeph> component of the service principal must be uppercase as shown in the + preceding example. + </note> + </li> + + <li> + Create <codeph>keytab</codeph> files with both principals. For example: +<codeblock>kadmin: xst -k impala.keytab impala/impala_host.example.com +kadmin: xst -k http.keytab HTTP/impala_host.example.com +kadmin: quit</codeblock> + </li> + + <li> + Use <codeph>ktutil</codeph> to read the contents of the two keytab files and then write those contents + to a new file. For example: +<codeblock>$ ktutil +ktutil: rkt impala.keytab +ktutil: rkt http.keytab +ktutil: wkt impala-http.keytab +ktutil: quit</codeblock> + </li> + + <li> + (Optional) Test that credentials in the merged keytab file are valid, and that the <q>renew until</q> + date is in the future. For example: +<codeblock>$ klist -e -k -t impala-http.keytab</codeblock> + </li> + + <li> + Copy the <filepath>impala-http.keytab</filepath> file to the Impala configuration directory. Change the + permissions to be only read for the file owner and change the file owner to the <codeph>impala</codeph> + user. By default, the Impala user and group are both named <codeph>impala</codeph>. For example: +<codeblock>$ cp impala-http.keytab /etc/impala/conf +$ cd /etc/impala/conf +$ chmod 400 impala-http.keytab +$ chown impala:impala impala-http.keytab</codeblock> + </li> + + <li> + Add Kerberos options to the Impala defaults file, <filepath>/etc/default/impala</filepath>. Add the + options for both the <cmdname>impalad</cmdname> and <cmdname>statestored</cmdname> daemons, using the + <codeph>IMPALA_SERVER_ARGS</codeph> and <codeph>IMPALA_STATE_STORE_ARGS</codeph> variables. For + example, you might add: +<!-- Found these in a discussion post somewhere but not applicable as Impala startup options. +-kerberos_ticket_life=36000 +-maxrenewlife 7days +--> +<codeblock>-kerberos_reinit_interval=60 +-principal=impala_1/[email protected] +-keytab_file=/var/run/cloudera-scm-agent/process/3212-impala-IMPALAD/impala.keytab</codeblock> + <p> + For more information on changing the Impala defaults specified in + <filepath>/etc/default/impala</filepath>, see + <xref href="impala_config_options.xml#config_options">Modifying Impala Startup + Options</xref>. + </p> + </li> + </ol> + + <note> + Restart <cmdname>impalad</cmdname> and <cmdname>statestored</cmdname> for these configuration changes to + take effect. + </note> + </conbody> + </concept> + </concept> + + <concept id="kerberos_proxy"> + + <title>Enabling Kerberos for Impala with a Proxy Server</title> + + <conbody> + + <p> + A common configuration for Impala with High Availability is to use a proxy server to submit requests to the + actual <cmdname>impalad</cmdname> daemons on different hosts in the cluster. This configuration avoids + connection problems in case of machine failure, because the proxy server can route new requests through one + of the remaining hosts in the cluster. This configuration also helps with load balancing, because the + additional overhead of being the <q>coordinator node</q> for each query is spread across multiple hosts. + </p> + + <p> + Although you can set up a proxy server with or without Kerberos authentication, typically users set up a + secure Kerberized configuration. For information about setting up a proxy server for Impala, including + Kerberos-specific steps, see <xref href="impala_proxy.xml#proxy"/>. + </p> + </conbody> + </concept> + + <concept id="spnego"> + + <title>Using a Web Browser to Access a URL Protected by Kerberos HTTP SPNEGO</title> + + <conbody> + + <p> + Your web browser must support Kerberos HTTP SPNEGO. For example, Chrome, Firefox, or Internet Explorer. + </p> + + <p> + <b>To configure Firefox to access a URL protected by Kerberos HTTP SPNEGO:</b> + </p> + + <ol> + <li> + Open the advanced settings Firefox configuration page by loading the <codeph>about:config</codeph> page. + </li> + + <li> + Use the <b>Filter</b> text box to find <codeph>network.negotiate-auth.trusted-uris</codeph>. + </li> + + <li> + Double-click the <codeph>network.negotiate-auth.trusted-uris</codeph> preference and enter the hostname + or the domain of the web server that is protected by Kerberos HTTP SPNEGO. Separate multiple domains and + hostnames with a comma. + </li> + + <li> + Click <b>OK</b>. + </li> + </ol> + </conbody> + </concept> + + <concept id="kerberos_delegation"> + <title>Enabling Impala Delegation for Kerberos Users</title> + <conbody> + <p> + See <xref href="impala_delegation.xml#delegation"/> for details about the delegation feature + that lets certain users submit queries using the credentials of other users. + </p> + </conbody> + </concept> + + <concept id="ssl_jdbc_odbc"> + <title>Using TLS/SSL with Business Intelligence Tools</title> + <conbody> + <p> + You can use Kerberos authentication, TLS/SSL encryption, or both to secure + connections from JDBC and ODBC applications to Impala. + See <xref href="impala_jdbc.xml#impala_jdbc"/> and <xref href="impala_odbc.xml#impala_odbc"/> + for details. + </p> + + <p conref="../shared/impala_common.xml#common/hive_jdbc_ssl_kerberos_caveat"/> + </conbody> + </concept> + + <concept id="whitelisting_internal_apis"> + <title>Enabling Access to Internal Impala APIs for Kerberos Users</title> + <conbody> + <!-- Reusing (most of) the text from the New Features bullet here. Turn into a conref in both places. --> + <p rev="IMPALA-3095"> + For applications that need direct access + to Impala APIs, without going through the HiveServer2 or Beeswax interfaces, you can + specify a list of Kerberos users who are allowed to call those APIs. By default, the + <codeph>impala</codeph> and <codeph>hdfs</codeph> users are the only ones authorized + for this kind of access. + Any users not explicitly authorized through the <codeph>internal_principals_whitelist</codeph> + configuration setting are blocked from accessing the APIs. This setting applies to all the + Impala-related daemons, although currently it is primarily used for HDFS to control the + behavior of the catalog server. + </p> + </conbody> + + </concept> + + <concept id="auth_to_local" rev="IMPALA-2660 CDH-40241"> + <title>Mapping Kerberos Principals to Short Names for Impala</title> + <conbody> + <p conref="../shared/impala_common.xml#common/auth_to_local_instructions"/> + </conbody> + </concept> + +</concept>
