http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_invalidate_metadata.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_invalidate_metadata.xml b/docs/topics/impala_invalidate_metadata.xml new file mode 100644 index 0000000..96fca7d --- /dev/null +++ b/docs/topics/impala_invalidate_metadata.xml @@ -0,0 +1,236 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="1.1" id="invalidate_metadata"> + + <title>INVALIDATE METADATA Statement</title> + <titlealts><navtitle>INVALIDATE METADATA</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="DDL"/> + <data name="Category" value="Metastore"/> + <data name="Category" value="Schemas"/> + <data name="Category" value="Tables"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">INVALIDATE METADATA statement</indexterm> + Marks the metadata for one or all tables as stale. Required after a table is created through the Hive shell, + before the table is available for Impala queries. The next time the current Impala node performs a query + against a table whose metadata is invalidated, Impala reloads the associated metadata before the query + proceeds. This is a relatively expensive operation compared to the incremental metadata update done by the + <codeph>REFRESH</codeph> statement, so in the common scenario of adding new data files to an existing table, + prefer <codeph>REFRESH</codeph> rather than <codeph>INVALIDATE METADATA</codeph>. If you are not familiar + with the way Impala uses metadata and how it shares the same metastore database as Hive, see + <xref href="impala_hadoop.xml#intro_metastore"/> for background information. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>INVALIDATE METADATA [[<varname>db_name</varname>.]<varname>table_name</varname>]</codeblock> + + <p> + By default, the cached metadata for all tables is flushed. If you specify a table name, only the metadata for + that one table is flushed. Even for a single table, <codeph>INVALIDATE METADATA</codeph> is more expensive + than <codeph>REFRESH</codeph>, so prefer <codeph>REFRESH</codeph> in the common case where you add new data + files for an existing table. + </p> + + <p conref="../shared/impala_common.xml#common/internals_blurb"/> + + <p> + To accurately respond to queries, Impala must have current metadata about those databases and tables that + clients query directly. Therefore, if some other entity modifies information used by Impala in the metastore + that Impala and Hive share, the information cached by Impala must be updated. However, this does not mean + that all metadata updates require an Impala update. + </p> + + <note> + <p conref="../shared/impala_common.xml#common/catalog_server_124"/> + <p rev="1.2"> + In Impala 1.2 and higher, a dedicated daemon (<cmdname>catalogd</cmdname>) broadcasts DDL changes made + through Impala to all Impala nodes. Formerly, after you created a database or table while connected to one + Impala node, you needed to issue an <codeph>INVALIDATE METADATA</codeph> statement on another Impala node + before accessing the new database or table from the other node. Now, newly created or altered objects are + picked up automatically by all Impala nodes. You must still use the <codeph>INVALIDATE METADATA</codeph> + technique after creating or altering objects through Hive. See + <xref href="impala_components.xml#intro_catalogd"/> for more information on the catalog service. + </p> + <p> + The <codeph>INVALIDATE METADATA</codeph> statement is new in Impala 1.1 and higher, and takes over some of + the use cases of the Impala 1.0 <codeph>REFRESH</codeph> statement. Because <codeph>REFRESH</codeph> now + requires a table name parameter, to flush the metadata for all tables at once, use the <codeph>INVALIDATE + METADATA</codeph> statement. + </p> + <draft-comment translate="no"> Almost-identical wording here, under INVALIDATE METADATA, and in Release Notes :: New Features. Makes sense to conref. </draft-comment> + <p> + Because <codeph>REFRESH <varname>table_name</varname></codeph> only works for tables that the current + Impala node is already aware of, when you create a new table in the Hive shell, you must enter + <codeph>INVALIDATE METADATA</codeph> with no table parameter before you can see the new table in + <cmdname>impala-shell</cmdname>. Once the table is known by the Impala node, you can issue <codeph>REFRESH + <varname>table_name</varname></codeph> after you add data files for that table. + </p> + </note> + + <p conref="../shared/impala_common.xml#common/refresh_vs_invalidate"/> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + A metadata update for an <codeph>impalad</codeph> instance <b>is</b> required if: + </p> + + <ul> + <li> + A metadata change occurs. + </li> + + <li> + <b>and</b> the change is made from another <codeph>impalad</codeph> instance in your cluster, or through + Hive. + </li> + + <li> + <b>and</b> the change is made to a database to which clients such as the Impala shell or ODBC directly + connect. + </li> + </ul> + + <p> + A metadata update for an Impala node is <b>not</b> required when you issue queries from the same Impala node + where you ran <codeph>ALTER TABLE</codeph>, <codeph>INSERT</codeph>, or other table-modifying statement. + </p> + + <p> + Database and table metadata is typically modified by: + </p> + + <ul> + <li> + Hive - via <codeph>ALTER</codeph>, <codeph>CREATE</codeph>, <codeph>DROP</codeph> or + <codeph>INSERT</codeph> operations. + </li> + + <li> + Impalad - via <codeph>CREATE TABLE</codeph>, <codeph>ALTER TABLE</codeph>, and <codeph>INSERT</codeph> + operations. + </li> + </ul> + + <p> + <codeph>INVALIDATE METADATA</codeph> causes the metadata for that table to be marked as stale, and reloaded + the next time the table is referenced. For a huge table, that process could take a noticeable amount of time; + thus you might prefer to use <codeph>REFRESH</codeph> where practical, to avoid an unpredictable delay later, + for example if the next reference to the table is during a benchmark test. + </p> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following example shows how you might use the <codeph>INVALIDATE METADATA</codeph> statement after + creating new tables (such as SequenceFile or HBase tables) through the Hive shell. Before the + <codeph>INVALIDATE METADATA</codeph> statement was issued, Impala would give a <q>table not found</q> error + if you tried to refer to those table names. The <codeph>DESCRIBE</codeph> statements cause the latest + metadata to be immediately loaded for the tables, avoiding a delay the next time those tables are queried. + </p> + +<codeblock>[impalad-host:21000] > invalidate metadata; +[impalad-host:21000] > describe t1; +... +[impalad-host:21000] > describe t2; +... </codeblock> + + <p> + For more examples of using <codeph>REFRESH</codeph> and <codeph>INVALIDATE METADATA</codeph> with a + combination of Impala and Hive operations, see <xref href="impala_tutorial.xml#tutorial_impala_hive"/>. + </p> + + <p> + If you need to ensure that the metadata is up-to-date when you start an <cmdname>impala-shell</cmdname> + session, run <cmdname>impala-shell</cmdname> with the <codeph>-r</codeph> or + <codeph>--refresh_after_connect</codeph> command-line option. Because this operation adds a delay to the next + query against each table, potentially expensive for large tables with many partitions, try to avoid using + this option for day-to-day operations in a production environment. + </p> + + <p conref="../shared/impala_common.xml#common/permissions_blurb"/> + <p rev="CDH-19187"> + The user ID that the <cmdname>impalad</cmdname> daemon runs under, + typically the <codeph>impala</codeph> user, must have execute + permissions for all the relevant directories holding table data. + (A table could have data spread across multiple directories, + or in unexpected paths, if it uses partitioning or + specifies a <codeph>LOCATION</codeph> attribute for + individual partitions or the entire table.) + Issues with permissions might not cause an immediate error for this statement, + but subsequent statements such as <codeph>SELECT</codeph> + or <codeph>SHOW TABLE STATS</codeph> could fail. + </p> + + <p conref="../shared/impala_common.xml#common/hdfs_blurb"/> + + <p> + By default, the <codeph>INVALIDATE METADATA</codeph> command checks HDFS permissions of the underlying data + files and directories, caching this information so that a statement can be cancelled immediately if for + example the <codeph>impala</codeph> user does not have permission to write to the data directory for the + table. (This checking does not apply if you have set the <cmdname>catalogd</cmdname> configuration option + <codeph>--load_catalog_in_background=false</codeph>.) Impala reports any lack of write permissions as an + <codeph>INFO</codeph> message in the log file, in case that represents an oversight. If you change HDFS + permissions to make data readable or writeable by the Impala user, issue another <codeph>INVALIDATE + METADATA</codeph> to make Impala aware of the change. + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p rev="1.2.4"> + This example illustrates creating a new database and new table in Hive, then doing an <codeph>INVALIDATE + METADATA</codeph> statement in Impala using the fully qualified table name, after which both the new table + and the new database are visible to Impala. The ability to specify <codeph>INVALIDATE METADATA + <varname>table_name</varname></codeph> for a table created in Hive is a new capability in Impala 1.2.4. In + earlier releases, that statement would have returned an error indicating an unknown table, requiring you to + do <codeph>INVALIDATE METADATA</codeph> with no table name, a more expensive operation that reloaded metadata + for all tables and databases. + </p> + +<codeblock rev="1.2.4">$ hive +hive> create database new_db_from_hive; +OK +Time taken: 4.118 seconds +hive> create table new_db_from_hive.new_table_from_hive (x int); +OK +Time taken: 0.618 seconds +hive> quit; +$ impala-shell +[localhost:21000] > show databases like 'new*'; +[localhost:21000] > refresh new_db_from_hive.new_table_from_hive; +ERROR: AnalysisException: Database does not exist: new_db_from_hive +[localhost:21000] > invalidate metadata new_db_from_hive.new_table_from_hive; +[localhost:21000] > show databases like 'new*'; ++--------------------+ +| name | ++--------------------+ +| new_db_from_hive | ++--------------------+ +[localhost:21000] > show tables in new_db_from_hive; ++---------------------+ +| name | ++---------------------+ +| new_table_from_hive | ++---------------------+</codeblock> + + <p conref="../shared/impala_common.xml#common/s3_blurb"/> + <p conref="../shared/impala_common.xml#common/s3_metadata"/> + + <p conref="../shared/impala_common.xml#common/cancel_blurb_no"/> + <p conref="../shared/impala_common.xml#common/related_info"/> + <p> + <xref href="impala_hadoop.xml#intro_metastore"/>, + <xref href="impala_refresh.xml#refresh"/> + </p> + + </conbody> +</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_joins.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_joins.xml b/docs/topics/impala_joins.xml new file mode 100644 index 0000000..011a488 --- /dev/null +++ b/docs/topics/impala_joins.xml @@ -0,0 +1,520 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="joins"> + + <title>Joins in Impala SELECT Statements</title> + <titlealts><navtitle>Joins</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Querying"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">joins</indexterm> + A join query is a <codeph>SELECT</codeph> statement that combines data from two or more tables, + and returns a result set containing items from some or all of those tables. It is a way to + cross-reference and correlate related data that is organized into multiple tables, typically + using identifiers that are repeated in each of the joined tables. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + + <p conref="../shared/impala_common.xml#common/join_types"/> + +<codeblock>SELECT <varname>select_list</varname> FROM + <varname>table_or_subquery1</varname> [INNER] JOIN <varname>table_or_subquery2</varname> | + <varname>table_or_subquery1</varname> {LEFT [OUTER] | RIGHT [OUTER] | FULL [OUTER]} JOIN <varname>table_or_subquery2</varname> | + <varname>table_or_subquery1</varname> {LEFT | RIGHT} SEMI JOIN <varname>table_or_subquery2</varname> | + <ph rev="2.0.0"><varname>table_or_subquery1</varname> {LEFT | RIGHT} ANTI JOIN <varname>table_or_subquery2</varname> |</ph> + [ ON <varname>col1</varname> = <varname>col2</varname> [AND <varname>col3</varname> = <varname>col4</varname> ...] | + USING (<varname>col1</varname> [, <varname>col2</varname> ...]) ] + [<varname>other_join_clause</varname> ...] +[ WHERE <varname>where_clauses</varname> ] + +SELECT <varname>select_list</varname> FROM + <varname>table_or_subquery1</varname>, <varname>table_or_subquery2</varname> [, <varname>table_or_subquery3</varname> ...] + [<varname>other_join_clause</varname> ...] +WHERE + <varname>col1</varname> = <varname>col2</varname> [AND <varname>col3</varname> = <varname>col4</varname> ...] + +SELECT <varname>select_list</varname> FROM + <varname>table_or_subquery1</varname> CROSS JOIN <varname>table_or_subquery2</varname> + [<varname>other_join_clause</varname> ...] +[ WHERE <varname>where_clauses</varname> ]</codeblock> + + <p> + <b>SQL-92 and SQL-89 Joins:</b> + </p> + + <p> + Queries with the explicit <codeph>JOIN</codeph> keywords are known as SQL-92 style joins, referring to the + level of the SQL standard where they were introduced. The corresponding <codeph>ON</codeph> or + <codeph>USING</codeph> clauses clearly show which columns are used as the join keys in each case: + </p> + +<codeblock>SELECT t1.c1, t2.c2 FROM <b>t1 JOIN t2</b> + <b>ON t1.id = t2.id and t1.type_flag = t2.type_flag</b> + WHERE t1.c1 > 100; + +SELECT t1.c1, t2.c2 FROM <b>t1 JOIN t2</b> + <b>USING (id, type_flag)</b> + WHERE t1.c1 > 100;</codeblock> + + <p> + The <codeph>ON</codeph> clause is a general way to compare columns across the two tables, even if the column + names are different. The <codeph>USING</codeph> clause is a shorthand notation for specifying the join + columns, when the column names are the same in both tables. You can code equivalent <codeph>WHERE</codeph> + clauses that compare the columns, instead of <codeph>ON</codeph> or <codeph>USING</codeph> clauses, but that + practice is not recommended because mixing the join comparisons with other filtering clauses is typically + less readable and harder to maintain. + </p> + + <p> + Queries with a comma-separated list of tables and subqueries are known as SQL-89 style joins. In these + queries, the equality comparisons between columns of the joined tables go in the <codeph>WHERE</codeph> + clause alongside other kinds of comparisons. This syntax is easy to learn, but it is also easy to + accidentally remove a <codeph>WHERE</codeph> clause needed for the join to work correctly. + </p> + +<codeblock>SELECT t1.c1, t2.c2 FROM <b>t1, t2</b> + WHERE + <b>t1.id = t2.id AND t1.type_flag = t2.type_flag</b> + AND t1.c1 > 100;</codeblock> + + <p> + <b>Self-joins:</b> + </p> + + <p> + Impala can do self-joins, for example to join on two different columns in the same table to represent + parent-child relationships or other tree-structured data. There is no explicit syntax for this; just use the + same table name for both the left-hand and right-hand table, and assign different table aliases to use when + referring to the fully qualified column names: + </p> + +<codeblock>-- Combine fields from both parent and child rows. +SELECT lhs.id, rhs.parent, lhs.c1, rhs.c2 FROM tree_data lhs, tree_data rhs WHERE lhs.id = rhs.parent;</codeblock> + + <p> + <b>Cartesian joins:</b> + </p> + + <p> + To avoid producing huge result sets by mistake, Impala does not allow Cartesian joins of the form: +<codeblock>SELECT ... FROM t1 JOIN t2; +SELECT ... FROM t1, t2;</codeblock> + If you intend to join the tables based on common values, add <codeph>ON</codeph> or <codeph>WHERE</codeph> + clauses to compare columns across the tables. If you truly intend to do a Cartesian join, use the + <codeph>CROSS JOIN</codeph> keyword as the join operator. The <codeph>CROSS JOIN</codeph> form does not use + any <codeph>ON</codeph> clause, because it produces a result set with all combinations of rows from the + left-hand and right-hand tables. The result set can still be filtered by subsequent <codeph>WHERE</codeph> + clauses. For example: + </p> + +<codeblock>SELECT ... FROM t1 CROSS JOIN t2; +SELECT ... FROM t1 CROSS JOIN t2 WHERE <varname>tests_on_non_join_columns</varname>;</codeblock> + + <p> + <b>Inner and outer joins:</b> + </p> + + <p> + An inner join is the most common and familiar type: rows in the result set contain the requested columns from + the appropriate tables, for all combinations of rows where the join columns of the tables have identical + values. If a column with the same name occurs in both tables, use a fully qualified name or a column alias to + refer to the column in the select list or other clauses. Impala performs inner joins by default for both + SQL-89 and SQL-92 join syntax: + </p> + +<codeblock>-- The following 3 forms are all equivalent. +SELECT t1.id, c1, c2 FROM t1, t2 WHERE t1.id = t2.id; +SELECT t1.id, c1, c2 FROM t1 JOIN t2 ON t1.id = t2.id; +SELECT t1.id, c1, c2 FROM t1 INNER JOIN t2 ON t1.id = t2.id;</codeblock> + + <p> + An outer join retrieves all rows from the left-hand table, or the right-hand table, or both; wherever there + is no matching data in the table on the other side of the join, the corresponding columns in the result set + are set to <codeph>NULL</codeph>. To perform an outer join, include the <codeph>OUTER</codeph> keyword in the + join operator, along with either <codeph>LEFT</codeph>, <codeph>RIGHT</codeph>, or <codeph>FULL</codeph>: + </p> + +<codeblock>SELECT * FROM t1 LEFT OUTER JOIN t2 ON t1.id = t2.id; +SELECT * FROM t1 RIGHT OUTER JOIN t2 ON t1.id = t2.id; +SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.id = t2.id;</codeblock> + + <p> + For outer joins, Impala requires SQL-92 syntax; that is, the <codeph>JOIN</codeph> keyword instead of + comma-separated table names. Impala does not support vendor extensions such as <codeph>(+)</codeph> or + <codeph>*=</codeph> notation for doing outer joins with SQL-89 query syntax. + </p> + + <p> + <b>Equijoins and Non-Equijoins:</b> + </p> + + <p> + By default, Impala requires an equality comparison between the left-hand and right-hand tables, either + through <codeph>ON</codeph>, <codeph>USING</codeph>, or <codeph>WHERE</codeph> clauses. These types of + queries are classified broadly as equijoins. Inner, outer, full, and semi joins can all be equijoins based on + the presence of equality tests between columns in the left-hand and right-hand tables. + </p> + + <p> + In Impala 1.2.2 and higher, non-equijoin queries are also possible, with comparisons such as + <codeph>!=</codeph> or <codeph><</codeph> between the join columns. These kinds of queries require care to + avoid producing huge result sets that could exceed resource limits. Once you have planned a non-equijoin + query that produces a result set of acceptable size, you can code the query using the <codeph>CROSS + JOIN</codeph> operator, and add the extra comparisons in the <codeph>WHERE</codeph> clause: + </p> + +<codeblock>SELECT * FROM t1 CROSS JOIN t2 WHERE t1.total > t2.maximum_price;</codeblock> + + <p rev="2.3.0"> + In CDH 5.5 / Impala 2.3 and higher, additional non-equijoin queries are possible due to the addition + of nested loop joins. These queries typically involve <codeph>SEMI JOIN</codeph>, + <codeph>ANTI JOIN</codeph>, or <codeph>FULL OUTER JOIN</codeph> clauses. + Impala sometimes also uses nested loop joins internally when evaluating <codeph>OUTER JOIN</codeph> + queries involving complex type columns. + Query phases involving nested loop joins do not use the spill-to-disk mechanism if they + exceed the memory limit. Impala decides internally when to use each join mechanism; you cannot + specify any query hint to choose between the nested loop join or the original hash join algorithm. + </p> + +<codeblock rev="2.3.0">SELECT * FROM t1 LEFT OUTER JOIN t2 ON t1.int_col < t2.int_col;</codeblock> + + <p> + <b>Semi-joins:</b> + </p> + + <p> + Semi-joins are a relatively rarely used variation. With the left semi-join, only data from the left-hand + table is returned, for rows where there is matching data in the right-hand table, based on comparisons + between join columns in <codeph>ON</codeph> or <codeph>WHERE</codeph> clauses. Only one instance of each row + from the left-hand table is returned, regardless of how many matching rows exist in the right-hand table. + <ph rev="2.0.0">A right semi-join (available in Impala 2.0 and higher) reverses the comparison and returns + data from the right-hand table.</ph> + </p> + +<codeblock>SELECT t1.c1, t1.c2, t1.c2 FROM t1 LEFT SEMI JOIN t2 ON t1.id = t2.id;</codeblock> + + <p> + <b>Natural joins (not supported):</b> + </p> + + <p> + Impala does not support the <codeph>NATURAL JOIN</codeph> operator, again to avoid inconsistent or huge + result sets. Natural joins do away with the <codeph>ON</codeph> and <codeph>USING</codeph> clauses, and + instead automatically join on all columns with the same names in the left-hand and right-hand tables. This + kind of query is not recommended for rapidly evolving data structures such as are typically used in Hadoop. + Thus, Impala does not support the <codeph>NATURAL JOIN</codeph> syntax, which can produce different query + results as columns are added to or removed from tables. + </p> + + <p> + If you do have any queries that use <codeph>NATURAL JOIN</codeph>, make sure to rewrite them with explicit + <codeph>USING</codeph> clauses, because Impala could interpret the <codeph>NATURAL</codeph> keyword as a + table alias: + </p> + +<codeblock>-- 'NATURAL' is interpreted as an alias for 't1' and Impala attempts an inner join, +-- resulting in an error because inner joins require explicit comparisons between columns. +SELECT t1.c1, t2.c2 FROM t1 NATURAL JOIN t2; +ERROR: NotImplementedException: Join with 't2' requires at least one conjunctive equality predicate. + To perform a Cartesian product between two tables, use a CROSS JOIN. + +-- If you expect the tables to have identically named columns with matching values, +-- list the corresponding column names in a USING clause. +SELECT t1.c1, t2.c2 FROM t1 JOIN t2 USING (id, type_flag, name, address);</codeblock> + + <p rev="2.0.0"> + <b>Anti-joins (Impala 2.0 / CDH 5.2 and higher only):</b> + </p> + + <p rev="2.0.0"> + Impala supports the <codeph>LEFT ANTI JOIN</codeph> and <codeph>RIGHT ANTI JOIN</codeph> clauses in Impala + 2.0 and higher on CDH 4, or CDH 5.2 and higher on CDH 5. The <codeph>LEFT</codeph> or <codeph>RIGHT</codeph> + keyword is required for this kind of join. For <codeph>LEFT ANTI JOIN</codeph>, this clause returns those + values from the left-hand table that have no matching value in the right-hand table. <codeph>RIGHT ANTI + JOIN</codeph> reverses the comparison and returns values from the right-hand table. You can express this + negative relationship either through the <codeph>ANTI JOIN</codeph> clause or through a <codeph>NOT + EXISTS</codeph> operator with a subquery. + </p> + +<!-- Restriction lifted in Impala 2.0. +<p> +Impala does not support <codeph>WHERE</codeph> clauses +such as <codeph>IN (<varname>subquery</varname>)</codeph>, +<codeph>NOT IN (<varname>subquery</varname>)</codeph>, +<codeph>EXISTS (<varname>subquery</varname>)</codeph>, +and <codeph>NOT EXISTS (<varname>subquery</varname>)</codeph>. +Therefore from a practical standpoint, you cannot +express an anti-join condition, where values from one table +are returned only if no matching values are present in another table. +</p> +--> + + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + +<!-- To do: reuse some complex types examples with joins here or under Examples farther down. --> + + <p rev="2.3.0"> + When referring to a column with a complex type (<codeph>STRUCT</codeph>, <codeph>ARRAY</codeph>, or <codeph>MAP</codeph>) + in a query, you use join notation to <q>unpack</q> the scalar fields of the struct, the elements of the array, or + the key-value pairs of the map. (The join notation is not required for aggregation operations, such as + <codeph>COUNT()</codeph> or <codeph>SUM()</codeph> for array elements.) Because Impala recognizes which complex type elements are associated with which row + of the result set, you use the same syntax as for a cross or cartesian join, without an explicit join condition. + See <xref href="impala_complex_types.xml#complex_types"/> for details about Impala support for complex types. + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + You typically use join queries in situations like these: + </p> + + <ul> + <li> + When related data arrives from different sources, with each data set physically residing in a separate + table. For example, you might have address data from business records that you cross-check against phone + listings or census data. + <note> + Impala can join tables of different file formats, including Impala-managed tables and HBase tables. For + example, you might keep small dimension tables in HBase, for convenience of single-row lookups and + updates, and for the larger fact tables use Parquet or other binary file format optimized for scan + operations. Then, you can issue a join query to cross-reference the fact tables with the dimension + tables. + </note> + </li> + + <li> + When data is normalized, a technique for reducing data duplication by dividing it across multiple tables. + This kind of organization is often found in data that comes from traditional relational database systems. + For example, instead of repeating some long string such as a customer name in multiple tables, each table + might contain a numeric customer ID. Queries that need to display the customer name could <q>join</q> the + table that specifies which customer ID corresponds to which name. + </li> + + <li> + When certain columns are rarely needed for queries, so they are moved into separate tables to reduce + overhead for common queries. For example, a <codeph>biography</codeph> field might be rarely needed in + queries on employee data. Putting that field in a separate table reduces the amount of I/O for common + queries on employee addresses or phone numbers. Queries that do need the <codeph>biography</codeph> column + can retrieve it by performing a join with that separate table. + </li> + + <li> + In CDH 5.5 / Impala 2.3 or higher, when referring to complex type columns in queries. + See <xref href="impala_complex_types.xml#complex_types"/> for details. + </li> + </ul> + + <p> + When comparing columns with the same names in <codeph>ON</codeph> or <codeph>WHERE</codeph> clauses, use the + fully qualified names such as <codeph><varname>db_name</varname>.<varname>table_name</varname></codeph>, or + assign table aliases, column aliases, or both to make the code more compact and understandable: + </p> + +<codeblock>select t1.c1 as first_id, t2.c2 as second_id from + t1 join t2 on first_id = second_id; + +select fact.custno, dimension.custno from + customer_data as fact join customer_address as dimension + using (custno)</codeblock> + + <note> + <p> + Performance for join queries is a crucial aspect for Impala, because complex join queries are + resource-intensive operations. An efficient join query produces much less network traffic and CPU overhead + than an inefficient one. For best results: + </p> + <ul> + <li rev="1.2"> + Make sure that both <xref href="impala_perf_stats.xml#perf_stats">table and column statistics</xref> are + available for all the tables involved in a join query, and especially for the columns referenced in any + join conditions. Impala uses the statistics to automatically deduce an efficient join order. + Use <xref href="impala_show.xml#show"><codeph>SHOW TABLE STATS <varname>table_name</varname></codeph> and + <codeph>SHOW COLUMN STATS <varname>table_name</varname></codeph></xref> to check if statistics are + already present. Issue the <codeph>COMPUTE STATS <varname>table_name</varname></codeph> for a nonpartitioned table, + or (in Impala 2.1.0 and higher) <codeph>COMPUTE INCREMENTAL STATS <varname>table_name</varname></codeph> + for a partitioned table, to collect the initial statistics at both the table and column levels, and to keep the + statistics up to date after any substantial <codeph>INSERT</codeph> or <codeph>LOAD DATA</codeph> operations. + </li> + + <li rev="1.2"> + If table or column statistics are not available, join the largest table first. You can check the + existence of statistics with the <codeph>SHOW TABLE STATS <varname>table_name</varname></codeph> and + <codeph>SHOW COLUMN STATS <varname>table_name</varname></codeph> statements. + </li> + + <li rev="1.2.2"> + If table or column statistics are not available, join subsequent tables according to which table has the + most selective filter, based on overall size and <codeph>WHERE</codeph> clauses. Joining the table with + the most selective filter results in the fewest number of rows being returned. + </li> + </ul> + <p> + For more information and examples of performance for join queries, see + <xref href="impala_perf_joins.xml#perf_joins"/>. + </p> + </note> + + <p> + To control the result set from a join query, include the names of corresponding column names in both tables + in an <codeph>ON</codeph> or <codeph>USING</codeph> clause, or by coding equality comparisons for those + columns in the <codeph>WHERE</codeph> clause. + </p> + +<codeblock>[localhost:21000] > select c_last_name, ca_city from customer join customer_address where c_customer_sk = ca_address_sk; ++-------------+-----------------+ +| c_last_name | ca_city | ++-------------+-----------------+ +| Lewis | Fairfield | +| Moses | Fairview | +| Hamilton | Pleasant Valley | +| White | Oak Ridge | +| Moran | Glendale | +... +| Richards | Lakewood | +| Day | Lebanon | +| Painter | Oak Hill | +| Bentley | Greenfield | +| Jones | Stringtown | ++-------------+------------------+ +Returned 50000 row(s) in 9.82s</codeblock> + + <p> + One potential downside of joins is the possibility of excess resource usage in poorly constructed queries. + Impala imposes restrictions on join queries to guard against such issues. To minimize the chance of runaway + queries on large data sets, Impala requires every join query to contain at least one equality predicate + between the columns of the various tables. For example, if <codeph>T1</codeph> contains 1000 rows and + <codeph>T2</codeph> contains 1,000,000 rows, a query <codeph>SELECT <varname>columns</varname> FROM t1 JOIN + t2</codeph> could return up to 1 billion rows (1000 * 1,000,000); Impala requires that the query include a + clause such as <codeph>ON t1.c1 = t2.c2</codeph> or <codeph>WHERE t1.c1 = t2.c2</codeph>. + </p> + + <p> + Because even with equality clauses, the result set can still be large, as we saw in the previous example, you + might use a <codeph>LIMIT</codeph> clause to return a subset of the results: + </p> + +<codeblock>[localhost:21000] > select c_last_name, ca_city from customer, customer_address where c_customer_sk = ca_address_sk limit 10; ++-------------+-----------------+ +| c_last_name | ca_city | ++-------------+-----------------+ +| Lewis | Fairfield | +| Moses | Fairview | +| Hamilton | Pleasant Valley | +| White | Oak Ridge | +| Moran | Glendale | +| Sharp | Lakeview | +| Wiles | Farmington | +| Shipman | Union | +| Gilbert | New Hope | +| Brunson | Martinsville | ++-------------+-----------------+ +Returned 10 row(s) in 0.63s</codeblock> + + <p> + Or you might use additional comparison operators or aggregation functions to condense a large result set into + a smaller set of values: + </p> + +<codeblock>[localhost:21000] > -- Find the names of customers who live in one particular town. +[localhost:21000] > select distinct c_last_name from customer, customer_address where + c_customer_sk = ca_address_sk + and ca_city = "Green Acres"; ++---------------+ +| c_last_name | ++---------------+ +| Hensley | +| Pearson | +| Mayer | +| Montgomery | +| Ricks | +... +| Barrett | +| Price | +| Hill | +| Hansen | +| Meeks | ++---------------+ +Returned 332 row(s) in 0.97s + +[localhost:21000] > -- See how many different customers in this town have names starting with "A". +[localhost:21000] > select count(distinct c_last_name) from customer, customer_address where + c_customer_sk = ca_address_sk + and ca_city = "Green Acres" + and substr(c_last_name,1,1) = "A"; ++-----------------------------+ +| count(distinct c_last_name) | ++-----------------------------+ +| 12 | ++-----------------------------+ +Returned 1 row(s) in 1.00s</codeblock> + + <p> + Because a join query can involve reading large amounts of data from disk, sending large amounts of data + across the network, and loading large amounts of data into memory to do the comparisons and filtering, you + might do benchmarking, performance analysis, and query tuning to find the most efficient join queries for + your data set, hardware capacity, network configuration, and cluster workload. + </p> + + <p> + The two categories of joins in Impala are known as <b>partitioned joins</b> and <b>broadcast joins</b>. If + inaccurate table or column statistics, or some quirk of the data distribution, causes Impala to choose the + wrong mechanism for a particular join, consider using query hints as a temporary workaround. For details, see + <xref href="impala_hints.xml#hints"/>. + </p> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following examples refer to these simple tables containing small sets of integers: +<codeblock>[localhost:21000] > create table t1 (x int); +[localhost:21000] > insert into t1 values (1), (2), (3), (4), (5), (6); + +[localhost:21000] > create table t2 (y int); +[localhost:21000] > insert into t2 values (2), (4), (6); + +[localhost:21000] > create table t3 (z int); +[localhost:21000] > insert into t3 values (1), (3), (5); +</codeblock> + </p> + +<!-- To do: fill in examples for other join types. --> + + <p> + The following example demonstrates an anti-join, returning the values from <codeph>T1</codeph> that do not + exist in <codeph>T2</codeph> (in this case, the odd numbers 1, 3, and 5): + </p> + +<codeblock>[localhost:21000] > select x from t1 left anti join t2 on (t1.x = t2.y); ++---+ +| x | ++---+ +| 1 | +| 3 | +| 5 | ++---+ +</codeblock> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + See these tutorials for examples of different kinds of joins: + </p> + + <ul> + <li> + <xref href="impala_tutorial.xml#tut_cross_join"/> + </li> + </ul> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_langref.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_langref.xml b/docs/topics/impala_langref.xml new file mode 100644 index 0000000..aaa76aa --- /dev/null +++ b/docs/topics/impala_langref.xml @@ -0,0 +1,179 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="langref"> + + <title><ph audience="PDF">Impala SQL Language Reference</ph><ph audience="HTML">Overview of Impala SQL</ph></title> + + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="impala-shell"/> + </metadata> + </prolog> + + <conbody> + + <p> + Impala uses SQL as its query language. Impala interprets SQL statements and performs the + full end-to-end processing for each statement. (As opposed to acting as a translation + layer for some other Hadoop subsystem.) + </p> + + <p> + Impala implements many familiar statements, such as <codeph>CREATE TABLE</codeph>, + <codeph>INSERT</codeph>, and <codeph>SELECT</codeph>. Currently, the DML statements + <codeph>UPDATE</codeph> and <codeph>DELETE</codeph> are not available in the production + level of Impala, because big data analytics with Hadoop and HDFS typically involves + unchanging data. <codeph>UPDATE</codeph> and <codeph>DELETE</codeph> <i>are</i> available + in beta form in the version of Impala used with the Kudu storage layer. For full details + about Impala SQL syntax and semantics, see + <xref href="impala_langref_sql.xml#langref_sql"/>. + </p> + + <p> + Queries include clauses such as <codeph>WHERE</codeph>, <codeph>GROUP BY</codeph>, + <codeph>ORDER BY</codeph>, and <codeph>JOIN</codeph>. For information about query syntax, + see <xref href="impala_select.xml#select"/>. + </p> + + <p> + Queries can also include function calls, to scalar functions such as + <codeph>sin()</codeph> and <codeph>substr()</codeph>, aggregate functions such as + <codeph>count()</codeph> and <codeph>avg()</codeph>, and analytic functions such as + <codeph>lag()</codeph> and <codeph>rank()</codeph>. For a list of the built-in functions + available in Impala queries, see <xref href="impala_functions.xml#builtins"/>. + </p> + + <p outputclass="toc"/> + + </conbody> + + <concept id="langref_performance"> + + <title>Performance Features</title> + + <conbody> + + <p> + The main performance-related SQL features for Impala are: + </p> + + <ul> + <li> + <p> + The <codeph>COMPUTE STATS</codeph> statement, and the underlying table statistics + and column statistics used in query planning. The statistics are used to estimate + the number of rows and size of the result set for queries, subqueries, and the + different <q>sides</q> of a join query. + </p> + </li> + + <li> + <p> + The output of the <codeph>EXPLAIN</codeph> statement. It outlines the ways in which + the query is parallelized, and how much I/O, memory, and so on the query expects to + use. You can control the level of detail in the output through a query option. + </p> + </li> + + <li> + <p> + Partitioning for tables. By organizing the data for efficient access along one or + more dimensions, this technique lets queries read only the relevant data. + </p> + </li> + + <li> + <p> + Query hints, especially for join queries. Impala selects from different join + algorithms based on the relative sizes of the result sets for each side of the join. + In cases where you know the most effective technique for a particular query, you can + override the estimates that Impala uses to make that choice, and select the join + technique directly. + </p> + </li> + + <li> + <p> + Query options. These options control settings that can influence the performance of + individual queries when you know the special considerations based on your workload, + hardware configuration, or data distribution. + </p> + </li> + </ul> + + <p> + Because analytic queries against high volumes of data tend to require full scans against + large portions of data from each table, Impala does not include index-related SQL + statements such as <codeph>CREATE INDEX</codeph>. The <codeph>COMPUTE STATS</codeph> + serves the purpose of analyzing the distribution of data within each column and the + overall table. Partitioning optimizes the physical layout of the data for queries that + filter on one or more crucial columns. + </p> + + </conbody> + + </concept> + + <concept id="hive_interoperability"> + + <title>Sharing Tables, Data, and Queries Between Impala and Hive</title> + + <conbody> + + <p> + To protect user investment in skills development and query design, Impala provides a + high degree of compatibility with the Hive Query Language (HiveQL): + </p> + + <ul> + <li> + Because Impala uses the same metadata store as Hive to record information about table + structure and properties, Impala can access tables defined through the native Impala + <codeph>CREATE TABLE</codeph> command, or tables created using the Hive data + definition language (DDL). + </li> + + <li> + Impala supports data manipulation (DML) statements similar to the DML component of + HiveQL. + </li> + + <li> + Impala provides many <xref href="impala_functions.xml#builtins">built-in + functions</xref> with the same names and parameter types as their HiveQL equivalents. + </li> + </ul> + + <p> + Impala supports most of the same + <xref href="impala_langref_sql.xml#langref_sql">statements and clauses</xref> as HiveQL, + including, but not limited to <codeph>JOIN</codeph>, <codeph>AGGREGATE</codeph>, + <codeph>DISTINCT</codeph>, <codeph>UNION ALL</codeph>, <codeph>ORDER BY</codeph>, + <codeph>LIMIT</codeph> and (uncorrelated) subquery in the <codeph>FROM</codeph> clause. + Impala also supports <codeph>INSERT INTO</codeph> and <codeph>INSERT OVERWRITE</codeph>. + </p> + + <p> + Impala supports data types with the same names and semantics as the equivalent Hive data + types: <codeph>STRING</codeph>, <codeph>TINYINT</codeph>, <codeph>SMALLINT</codeph>, + <codeph>INT</codeph>, <codeph>BIGINT</codeph>, <codeph>FLOAT</codeph>, + <codeph>DOUBLE</codeph>, <codeph>BOOLEAN</codeph>, <codeph>STRING</codeph>, + <codeph>TIMESTAMP</codeph>. CDH 5.5 / Impala 2.3 and higher also include the complex + types <codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph>. + </p> + + <p> + Most HiveQL <codeph>SELECT</codeph> and <codeph>INSERT</codeph> statements run + unmodified with Impala. For information about Hive syntax not available in Impala, see + <xref href="impala_langref_unsupported.xml#langref_hiveql_delta"/>. + </p> + + </conbody> + + </concept> + +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_langref_sql.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_langref_sql.xml b/docs/topics/impala_langref_sql.xml new file mode 100644 index 0000000..d759e76 --- /dev/null +++ b/docs/topics/impala_langref_sql.xml @@ -0,0 +1,35 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="langref_sql"> + + <title>Impala SQL Statements</title> + <titlealts><navtitle>SQL Statements</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + The Impala SQL dialect supports a range of standard elements, plus some extensions for Big Data use cases + related to data loading and data warehousing. + </p> + + <note> + <p> + In the <cmdname>impala-shell</cmdname> interpreter, a semicolon at the end of each statement is required. + Since the semicolon is not actually part of the SQL syntax, we do not include it in the syntax definition + of each statement, but we do show it in examples intended to be run in <cmdname>impala-shell</cmdname>. + </p> + </note> + + <p audience="PDF" outputclass="toc all"> + The following sections show the major SQL statements that you work with in Impala: + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_langref_unsupported.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_langref_unsupported.xml b/docs/topics/impala_langref_unsupported.xml new file mode 100644 index 0000000..f2b0560 --- /dev/null +++ b/docs/topics/impala_langref_unsupported.xml @@ -0,0 +1,296 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="langref_hiveql_delta"> + + <title>SQL Differences Between Impala and Hive</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Hive"/> + <data name="Category" value="Porting"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">Hive</indexterm> + <indexterm audience="Cloudera">HiveQL</indexterm> + Impala's SQL syntax follows the SQL-92 standard, and includes many industry extensions in areas such as + built-in functions. See <xref href="impala_porting.xml#porting"/> for a general discussion of adapting SQL + code from a variety of database systems to Impala. + </p> + + <p> + Because Impala and Hive share the same metastore database and their tables are often used interchangeably, + the following section covers differences between Impala and Hive in detail. + </p> + + <p outputclass="toc inpage"/> + </conbody> + + <concept id="langref_hiveql_unsupported"> + + <title>HiveQL Features not Available in Impala</title> + + <conbody> + + <p> + The current release of Impala does not support the following SQL features that you might be familiar with + from HiveQL: + </p> + + <draft-comment translate="no"> +Yeesh, too many separate lists of unsupported Hive syntax. +Here, the FAQ, and in some of the intro topics. +Some discussion in IMP-1061 about how best to reorg. +Lots of opportunities for conrefs. +</draft-comment> + + <ul> +<!-- Now supported in CDH 5.5 / Impala 2.3 and higher. Find places on this page (like already done under lateral views) to note the new data type support. + <li> + Non-scalar data types such as maps, arrays, structs. + </li> +--> + + <li rev="1.2"> + Extensibility mechanisms such as <codeph>TRANSFORM</codeph>, custom file formats, or custom SerDes. + </li> + + <li> + XML and JSON functions. + </li> + + <li> + Certain aggregate functions from HiveQL: <codeph>covar_pop</codeph>, <codeph>covar_samp</codeph>, + <codeph>corr</codeph>, <codeph>percentile</codeph>, <codeph>percentile_approx</codeph>, + <codeph>histogram_numeric</codeph>, <codeph>collect_set</codeph>; Impala supports the set of aggregate + functions listed in <xref href="impala_aggregate_functions.xml#aggregate_functions"/> and analytic + functions listed in <xref href="impala_analytic_functions.xml#analytic_functions"/>. + </li> + + <li> + Sampling. + </li> + + <li> + Lateral views. In CDH 5.5 / Impala 2.3 and higher, Impala supports queries on complex types + (<codeph>STRUCT</codeph>, <codeph>ARRAY</codeph>, or <codeph>MAP</codeph>), using join notation + rather than the <codeph>EXPLODE()</codeph> keyword. + See <xref href="impala_complex_types.xml#complex_types"/> for details about Impala support for complex types. + </li> + + <li> + Multiple <codeph>DISTINCT</codeph> clauses per query, although Impala includes some workarounds for this + limitation. + <note conref="../shared/impala_common.xml#common/multiple_count_distinct"/> + </li> + </ul> + + <p> + User-defined functions (UDFs) are supported starting in Impala 1.2. See <xref href="impala_udf.xml#udfs"/> + for full details on Impala UDFs. + <ul> + <li> + Impala supports high-performance UDFs written in C++, as well as reusing some Java-based Hive UDFs. + </li> + + <li> + Impala supports scalar UDFs and user-defined aggregate functions (UDAFs). Impala does not currently + support user-defined table generating functions (UDTFs). + </li> + + <li> + Only Impala-supported column types are supported in Java-based UDFs. + </li> + </ul> + </p> + + <p> + Impala does not currently support these HiveQL statements: + </p> + + <ul> + <li> + <codeph>ANALYZE TABLE</codeph> (the Impala equivalent is <codeph>COMPUTE STATS</codeph>) + </li> + + <li> + <codeph>DESCRIBE COLUMN</codeph> + </li> + + <li> + <codeph>DESCRIBE DATABASE</codeph> + </li> + + <li> + <codeph>EXPORT TABLE</codeph> + </li> + + <li> + <codeph>IMPORT TABLE</codeph> + </li> + + <li> + <codeph>SHOW TABLE EXTENDED</codeph> + </li> + + <li> + <codeph>SHOW INDEXES</codeph> + </li> + + <li> + <codeph>SHOW COLUMNS</codeph> + </li> + </ul> + </conbody> + </concept> + + <concept id="langref_hiveql_semantics"> + + <title>Semantic Differences Between Impala and HiveQL Features</title> + + <conbody> + + <p> + This section covers instances where Impala and Hive have similar functionality, sometimes including the + same syntax, but there are differences in the runtime semantics of those features. + </p> + + <p> + <b>Security:</b> + </p> + + <p> + Impala utilizes the <xref href="http://sentry.incubator.apache.org/" scope="external" format="html">Apache + Sentry (incubating)</xref> authorization framework, which provides fine-grained role-based access control + to protect data against unauthorized access or tampering. + </p> + + <p> + The Hive component included in CDH 5.1 and higher now includes Sentry-enabled <codeph>GRANT</codeph>, + <codeph>REVOKE</codeph>, and <codeph>CREATE/DROP ROLE</codeph> statements. Earlier Hive releases had a + privilege system with <codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements that were primarily + intended to prevent accidental deletion of data, rather than a security mechanism to protect against + malicious users. + </p> + + <p> + Impala can make use of privileges set up through Hive <codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements. + Impala has its own <codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements in Impala 2.0 and higher. + See <xref href="impala_authorization.xml#authorization"/> for the details of authorization in Impala, including + how to switch from the original policy file-based privilege model to the Sentry service using privileges + stored in the metastore database. + </p> + + <p> + <b>SQL statements and clauses:</b> + </p> + + <p> + The semantics of Impala SQL statements varies from HiveQL in some cases where they use similar SQL + statement and clause names: + </p> + + <ul> + <li> + Impala uses different syntax and names for query hints, <codeph>[SHUFFLE]</codeph> and + <codeph>[NOSHUFFLE]</codeph> rather than <codeph>MapJoin</codeph> or <codeph>StreamJoin</codeph>. See + <xref href="impala_joins.xml#joins"/> for the Impala details. + </li> + + <li> + Impala does not expose MapReduce specific features of <codeph>SORT BY</codeph>, <codeph>DISTRIBUTE + BY</codeph>, or <codeph>CLUSTER BY</codeph>. + </li> + + <li> + Impala does not require queries to include a <codeph>FROM</codeph> clause. + </li> + </ul> + + <p> + <b>Data types:</b> + </p> + + <ul> + <li> + Impala supports a limited set of implicit casts. This can help avoid undesired results from unexpected + casting behavior. + <ul> + <li> + Impala does not implicitly cast between string and numeric or Boolean types. Always use + <codeph>CAST()</codeph> for these conversions. + </li> + + <li> + Impala does perform implicit casts among the numeric types, when going from a smaller or less precise + type to a larger or more precise one. For example, Impala will implicitly convert a + <codeph>SMALLINT</codeph> to a <codeph>BIGINT</codeph> or <codeph>FLOAT</codeph>, but to convert from + <codeph>DOUBLE</codeph> to <codeph>FLOAT</codeph> or <codeph>INT</codeph> to <codeph>TINYINT</codeph> + requires a call to <codeph>CAST()</codeph> in the query. + </li> + + <li> + Impala does perform implicit casts from string to timestamp. Impala has a restricted set of literal + formats for the <codeph>TIMESTAMP</codeph> data type and the <codeph>from_unixtime()</codeph> format + string; see <xref href="impala_timestamp.xml#timestamp"/> for details. + </li> + </ul> + <p> + See <xref href="impala_datatypes.xml#datatypes"/> for full details on implicit and explicit casting for + all types, and <xref href="impala_conversion_functions.xml#conversion_functions"/> for details about + the <codeph>CAST()</codeph> function. + </p> + </li> + + <li> + Impala does not store or interpret timestamps using the local timezone, to avoid undesired results from + unexpected time zone issues. Timestamps are stored and interpreted relative to UTC. This difference can + produce different results for some calls to similarly named date/time functions between Impala and Hive. + See <xref href="impala_datetime_functions.xml#datetime_functions"/> for details about the Impala + functions. See <xref href="impala_timestamp.xml#timestamp"/> for a discussion of how Impala handles + time zones, and configuration options you can use to make Impala match the Hive behavior more closely + when dealing with Parquet-encoded <codeph>TIMESTAMP</codeph> data or when converting between + the local time zone and UTC. + </li> + + <li> + The Impala <codeph>TIMESTAMP</codeph> type can represent dates ranging from 1400-01-01 to 9999-12-31. + This is different from the Hive date range, which is 0000-01-01 to 9999-12-31. + </li> + + <li> + Impala does not return column overflows as <codeph>NULL</codeph>, so that customers can distinguish + between <codeph>NULL</codeph> data and overflow conditions similar to how they do so with traditional + database systems. Impala returns the largest or smallest value in the range for the type. For example, + valid values for a <codeph>tinyint</codeph> range from -128 to 127. In Impala, a <codeph>tinyint</codeph> + with a value of -200 returns -128 rather than <codeph>NULL</codeph>. A <codeph>tinyint</codeph> with a + value of 200 returns 127. + </li> + </ul> + + <p> + <b>Miscellaneous features:</b> + </p> + + <ul> + <li> + Impala does not provide virtual columns. + </li> + + <li> + Impala does not expose locking. + </li> + + <li> + Impala does not expose some configuration properties. + </li> + </ul> + </conbody> + </concept> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_limit.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_limit.xml b/docs/topics/impala_limit.xml new file mode 100644 index 0000000..c186cd4 --- /dev/null +++ b/docs/topics/impala_limit.xml @@ -0,0 +1,149 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="limit"> + + <title>LIMIT Clause</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Reports"/> + </metadata> + </prolog> + + <conbody> + + <p> + The <codeph>LIMIT</codeph> clause in a <codeph>SELECT</codeph> query sets a maximum number of rows for the + result set. Pre-selecting the maximum size of the result set helps Impala to optimize memory usage while + processing a distributed query. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>LIMIT <varname>constant_integer_expression</varname></codeblock> + + <p> + The argument to the <codeph>LIMIT</codeph> clause must evaluate to a constant value. It can be a numeric + literal, or another kind of numeric expression involving operators, casts, and function return values. You + cannot refer to a column or use a subquery. + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + This clause is useful in contexts such as: + </p> + + <ul> + <li> + To return exactly N items from a top-N query, such as the 10 highest-rated items in a shopping category or + the 50 hostnames that refer the most traffic to a web site. + </li> + + <li> + To demonstrate some sample values from a table or a particular query. (To display some arbitrary items, use + a query with no <codeph>ORDER BY</codeph> clause. An <codeph>ORDER BY</codeph> clause causes additional + memory and/or disk usage during the query.) + </li> + + <li> + To keep queries from returning huge result sets by accident if a table is larger than expected, or a + <codeph>WHERE</codeph> clause matches more rows than expected. + </li> + </ul> + + <p rev="1.2.1"> + Originally, the value for the <codeph>LIMIT</codeph> clause had to be a numeric literal. In Impala 1.2.1 and + higher, it can be a numeric expression. + </p> + + <p rev="obwl" conref="../shared/impala_common.xml#common/order_by_limit"/> + + <p> + See <xref href="impala_order_by.xml#order_by"/> for details. + </p> + + <p conref="../shared/impala_common.xml#common/limit_and_offset"/> + + <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> + + <p conref="../shared/impala_common.xml#common/subquery_no_limit"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following example shows how the <codeph>LIMIT</codeph> clause caps the size of the result set, with the + limit being applied after any other clauses such as <codeph>WHERE</codeph>. + </p> + +<codeblock>[localhost:21000] > create database limits; +[localhost:21000] > use limits; +[localhost:21000] > create table numbers (x int); +[localhost:21000] > insert into numbers values (1), (3), (4), (5), (2); +Inserted 5 rows in 1.34s +[localhost:21000] > select x from numbers limit 100; ++---+ +| x | ++---+ +| 1 | +| 3 | +| 4 | +| 5 | +| 2 | ++---+ +Returned 5 row(s) in 0.26s +[localhost:21000] > select x from numbers limit 3; ++---+ +| x | ++---+ +| 1 | +| 3 | +| 4 | ++---+ +Returned 3 row(s) in 0.27s +[localhost:21000] > select x from numbers where x > 2 limit 2; ++---+ +| x | ++---+ +| 3 | +| 4 | ++---+ +Returned 2 row(s) in 0.27s</codeblock> + + <p> + For top-N and bottom-N queries, you use the <codeph>ORDER BY</codeph> and <codeph>LIMIT</codeph> clauses + together: + </p> + +<codeblock rev="obwl">[localhost:21000] > select x as "Top 3" from numbers order by x desc limit 3; ++-------+ +| top 3 | ++-------+ +| 5 | +| 4 | +| 3 | ++-------+ +[localhost:21000] > select x as "Bottom 3" from numbers order by x limit 3; ++----------+ +| bottom 3 | ++----------+ +| 1 | +| 2 | +| 3 | ++----------+ +</codeblock> + + <p> + You can use constant values besides integer literals as the <codeph>LIMIT</codeph> argument: + </p> + +<codeblock>-- Other expressions that yield constant integer values work too. +SELECT x FROM t1 LIMIT 1e6; -- Limit is one million. +SELECT x FROM t1 LIMIT length('hello world'); -- Limit is 11. +SELECT x FROM t1 LIMIT 2+2; -- Limit is 4. +SELECT x FROM t1 LIMIT cast(truncate(9.9) AS INT); -- Limit is 9. +</codeblock> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_literals.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_literals.xml b/docs/topics/impala_literals.xml new file mode 100644 index 0000000..3c53796 --- /dev/null +++ b/docs/topics/impala_literals.xml @@ -0,0 +1,384 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="literals"> + + <title>Literals</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Data Types"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">literals</indexterm> + Each of the Impala data types has corresponding notation for literal values of that type. You specify literal + values in SQL statements, such as in the <codeph>SELECT</codeph> list or <codeph>WHERE</codeph> clause of a + query, or as an argument to a function call. See <xref href="impala_datatypes.xml#datatypes"/> for a complete + list of types, ranges, and conversion rules. + </p> + + <p outputclass="toc inpage"/> + </conbody> + + <concept id="numeric_literals"> + + <title>Numeric Literals</title> + + <conbody> + + <p> + <indexterm audience="Cloudera">numeric literals</indexterm> + To write literals for the integer types (<codeph>TINYINT</codeph>, <codeph>SMALLINT</codeph>, + <codeph>INT</codeph>, and <codeph>BIGINT</codeph>), use a sequence of digits with optional leading zeros. + </p> + + <p rev="1.4.0"> + To write literals for the floating-point types (<codeph rev="1.4.0">DECIMAL</codeph>, + <codeph>FLOAT</codeph>, and <codeph>DOUBLE</codeph>), use a sequence of digits with an optional decimal + point (<codeph>.</codeph> character). To preserve accuracy during arithmetic expressions, Impala interprets + floating-point literals as the <codeph>DECIMAL</codeph> type with the smallest appropriate precision and + scale, until required by the context to convert the result to <codeph>FLOAT</codeph> or + <codeph>DOUBLE</codeph>. + </p> + + <p> + Integer values are promoted to floating-point when necessary, based on the context. + </p> + + <p> + You can also use exponential notation by including an <codeph>e</codeph> character. For example, + <codeph>1e6</codeph> is 1 times 10 to the power of 6 (1 million). A number in exponential notation is + always interpreted as floating-point. + </p> + + <p rev="tk"> + When Impala encounters a numeric literal, it considers the type to be the <q>smallest</q> that can + accurately represent the value. The type is promoted to larger or more accurate types if necessary, based + on subsequent parts of an expression. + </p> + <p> + For example, you can see by the types Impala defines for the following table columns + how it interprets the corresponding numeric literals: + </p> +<codeblock>[localhost:21000] > create table ten as select 10 as x; ++-------------------+ +| summary | ++-------------------+ +| Inserted 1 row(s) | ++-------------------+ +[localhost:21000] > desc ten; ++------+---------+---------+ +| name | type | comment | ++------+---------+---------+ +| x | tinyint | | ++------+---------+---------+ + +[localhost:21000] > create table four_k as select 4096 as x; ++-------------------+ +| summary | ++-------------------+ +| Inserted 1 row(s) | ++-------------------+ +[localhost:21000] > desc four_k; ++------+----------+---------+ +| name | type | comment | ++------+----------+---------+ +| x | smallint | | ++------+----------+---------+ + +[localhost:21000] > create table one_point_five as select 1.5 as x; ++-------------------+ +| summary | ++-------------------+ +| Inserted 1 row(s) | ++-------------------+ +[localhost:21000] > desc one_point_five; ++------+--------------+---------+ +| name | type | comment | ++------+--------------+---------+ +| x | decimal(2,1) | | ++------+--------------+---------+ + +[localhost:21000] > create table one_point_three_three_three as select 1.333 as x; ++-------------------+ +| summary | ++-------------------+ +| Inserted 1 row(s) | ++-------------------+ +[localhost:21000] > desc one_point_three_three_three; ++------+--------------+---------+ +| name | type | comment | ++------+--------------+---------+ +| x | decimal(4,3) | | ++------+--------------+---------+ +</codeblock> + </conbody> + </concept> + + <concept id="string_literals"> + + <title>String Literals</title> + + <conbody> + + <p> + <indexterm audience="Cloudera">string literals</indexterm> + String literals are quoted using either single or double quotation marks. You can use either kind of quotes + for string literals, even both kinds for different literals within the same statement. + </p> + + <p rev="2.0.0"> + Quoted literals are considered to be of type <codeph>STRING</codeph>. To use quoted literals in contexts + requiring a <codeph>CHAR</codeph> or <codeph>VARCHAR</codeph> value, <codeph>CAST()</codeph> the literal to + a <codeph>CHAR</codeph> or <codeph>VARCHAR</codeph> of the appropriate length. + </p> + + <p> + <b>Escaping special characters:</b> + </p> + + <p> + To encode special characters within a string literal, precede them with the backslash (<codeph>\</codeph>) + escape character: + </p> + + <ul> + <li> + <codeph>\t</codeph> represents a tab. + </li> + + <li> + <codeph>\n</codeph> represents a newline or linefeed. This might cause extra line breaks in + <cmdname>impala-shell</cmdname> output. + </li> + + <li> + <codeph>\r</codeph> represents a carriage return. This might cause unusual formatting (making it appear + that some content is overwritten) in <cmdname>impala-shell</cmdname> output. + </li> + + <li> + <codeph>\b</codeph> represents a backspace. This might cause unusual formatting (making it appear that + some content is overwritten) in <cmdname>impala-shell</cmdname> output. + </li> + + <li> + <codeph>\0</codeph> represents an ASCII <codeph>nul</codeph> character (not the same as a SQL + <codeph>NULL</codeph>). This might not be visible in <cmdname>impala-shell</cmdname> output. + </li> + + <li> + <codeph>\Z</codeph> represents a DOS end-of-file character. This might not be visible in + <cmdname>impala-shell</cmdname> output. + </li> + + <li> + <codeph>\%</codeph> and <codeph>\_</codeph> can be used to escape wildcard characters within the string + passed to the <codeph>LIKE</codeph> operator. + </li> + + <li> + <codeph>\</codeph> followed by 3 octal digits represents the ASCII code of a single character; for + example, <codeph>\101</codeph> is ASCII 65, the character <codeph>A</codeph>. + </li> + + <li> + Use two consecutive backslashes (<codeph>\\</codeph>) to prevent the backslash from being interpreted as + an escape character. + </li> + + <li> + Use the backslash to escape single or double quotation mark characters within a string literal, if the + literal is enclosed by the same type of quotation mark. + </li> + + <li> + If the character following the <codeph>\</codeph> does not represent the start of a recognized escape + sequence, the character is passed through unchanged. + </li> + </ul> + + <p> + <b>Quotes within quotes:</b> + </p> + + <p> + To include a single quotation character within a string value, enclose the literal with either single or + double quotation marks, and optionally escape the single quote as a <codeph>\'</codeph> sequence. Earlier + releases required escaping a single quote inside double quotes. Continue using escape sequences in this + case if you also need to run your SQL code on older versions of Impala. + </p> + + <p> + To include a double quotation character within a string value, enclose the literal with single quotation + marks, no escaping is necessary in this case. Or, enclose the literal with double quotation marks and + escape the double quote as a <codeph>\"</codeph> sequence. + </p> + +<codeblock>[localhost:21000] > select "What\'s happening?" as single_within_double, + > 'I\'m not sure.' as single_within_single, + > "Homer wrote \"The Iliad\"." as double_within_double, + > 'Homer also wrote "The Odyssey".' as double_within_single; ++----------------------+----------------------+--------------------------+---------------------------------+ +| single_within_double | single_within_single | double_within_double | double_within_single | ++----------------------+----------------------+--------------------------+---------------------------------+ +| What's happening? | I'm not sure. | Homer wrote "The Iliad". | Homer also wrote "The Odyssey". | ++----------------------+----------------------+--------------------------+---------------------------------+ +</codeblock> + + <p> + <b>Field terminator character in CREATE TABLE:</b> + </p> + + <note conref="../shared/impala_common.xml#common/thorn"/> + + <p> + <b>impala-shell considerations:</b> + </p> + + <p> + When dealing with output that includes non-ASCII or non-printable characters such as linefeeds and + backspaces, use the <cmdname>impala-shell</cmdname> options to save to a file, turn off pretty printing, or + both rather than relying on how the output appears visually. See + <xref href="impala_shell_options.xml#shell_options"/> for a list of <cmdname>impala-shell</cmdname> + options. + </p> + </conbody> + </concept> + + <concept id="boolean_literals"> + + <title>Boolean Literals</title> + + <conbody> + + <p> + For <codeph>BOOLEAN</codeph> values, the literals are <codeph>TRUE</codeph> and <codeph>FALSE</codeph>, + with no quotation marks and case-insensitive. + </p> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + +<codeblock>select true; +select * from t1 where assertion = false; +select case bool_col when true then 'yes' when false 'no' else 'null' end from t1;</codeblock> + </conbody> + </concept> + + <concept id="timestamp_literals"> + + <title>Timestamp Literals</title> + + <conbody> + + <p conref="../shared/impala_common.xml#common/timestamp_conversions"/> + + <p> + You can also use <codeph>INTERVAL</codeph> expressions to add or subtract from timestamp literal values, + such as <codeph>'1966-07-30' + INTERVAL 5 YEARS + INTERVAL 3 DAYS</codeph>. See + <xref href="impala_timestamp.xml#timestamp"/> for details. + </p> + + <p> + Depending on your data pipeline, you might receive date and time data as text, in notation that does not + exactly match the format for Impala <codeph>TIMESTAMP</codeph> literals. + See <xref href="impala_datetime_functions.xml#datetime_functions"/> for functions that can convert + between a variety of string literals (including different field order, separators, and timezone notation) + and equivalent <codeph>TIMESTAMP</codeph> or numeric values. + </p> + </conbody> + </concept> + + <concept id="null"> + + <title>NULL</title> + + <conbody> + + <p> + <indexterm audience="Cloudera">NULL</indexterm> + The notion of <codeph>NULL</codeph> values is familiar from all kinds of database systems, but each SQL + dialect can have its own behavior and restrictions on <codeph>NULL</codeph> values. For Big Data + processing, the precise semantics of <codeph>NULL</codeph> values are significant: any misunderstanding + could lead to inaccurate results or misformatted data, that could be time-consuming to correct for large + data sets. + </p> + + <ul> + <li> + <codeph>NULL</codeph> is a different value than an empty string. The empty string is represented by a + string literal with nothing inside, <codeph>""</codeph> or <codeph>''</codeph>. + </li> + + <li> + In a delimited text file, the <codeph>NULL</codeph> value is represented by the special token + <codeph>\N</codeph>. + </li> + + <li> + When Impala inserts data into a partitioned table, and the value of one of the partitioning columns is + <codeph>NULL</codeph> or the empty string, the data is placed in a special partition that holds only + these two kinds of values. When these values are returned in a query, the result is <codeph>NULL</codeph> + whether the value was originally <codeph>NULL</codeph> or an empty string. This behavior is compatible + with the way Hive treats <codeph>NULL</codeph> values in partitioned tables. Hive does not allow empty + strings as partition keys, and it returns a string value such as + <codeph>__HIVE_DEFAULT_PARTITION__</codeph> instead of <codeph>NULL</codeph> when such values are + returned from a query. For example: +<codeblock>create table t1 (i int) partitioned by (x int, y string); +-- Select an INT column from another table, with all rows going into a special HDFS subdirectory +-- named __HIVE_DEFAULT_PARTITION__. Depending on whether one or both of the partitioning keys +-- are null, this special directory name occurs at different levels of the physical data directory +-- for the table. +insert into t1 partition(x=NULL, y=NULL) select c1 from some_other_table; +insert into t1 partition(x, y=NULL) select c1, c2 from some_other_table; +insert into t1 partition(x=NULL, y) select c1, c3 from some_other_table;</codeblock> + </li> + + <li> + There is no <codeph>NOT NULL</codeph> clause when defining a column to prevent <codeph>NULL</codeph> + values in that column. + </li> + + <li> + There is no <codeph>DEFAULT</codeph> clause to specify a non-<codeph>NULL</codeph> default value. + </li> + + <li> + If an <codeph>INSERT</codeph> operation mentions some columns but not others, the unmentioned columns + contain <codeph>NULL</codeph> for all inserted rows. + </li> + + <li rev="1.2.1"> + <p conref="../shared/impala_common.xml#common/null_sorting_change"/> + <note> + <draft-comment translate="no"> Probably a bunch of similar view-related restrictions like this that should be collected, reused, or cross-referenced under the Views topic. </draft-comment> + Because the <codeph>NULLS FIRST</codeph> and <codeph>NULLS LAST</codeph> keywords are not currently + available in Hive queries, any views you create using those keywords will not be available through + Hive. + </note> + </li> + + <li> + In all other contexts besides sorting with <codeph>ORDER BY</codeph>, comparing a <codeph>NULL</codeph> + to anything else returns <codeph>NULL</codeph>, making the comparison meaningless. For example, + <codeph>10 > NULL</codeph> produces <codeph>NULL</codeph>, <codeph>10 < NULL</codeph> also produces + <codeph>NULL</codeph>, <codeph>5 BETWEEN 1 AND NULL</codeph> produces <codeph>NULL</codeph>, and so on. + </li> + </ul> + + <p> + Several built-in functions serve as shorthand for evaluating expressions and returning + <codeph>NULL</codeph>, 0, or some other substitution value depending on the expression result: + <codeph>ifnull()</codeph>, <codeph>isnull()</codeph>, <codeph>nvl()</codeph>, <codeph>nullif()</codeph>, + <codeph>nullifzero()</codeph>, and <codeph>zeroifnull()</codeph>. See + <xref href="impala_conditional_functions.xml#conditional_functions"/> for details. + </p> + </conbody> + </concept> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_live_progress.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_live_progress.xml b/docs/topics/impala_live_progress.xml new file mode 100644 index 0000000..f58cdcb --- /dev/null +++ b/docs/topics/impala_live_progress.xml @@ -0,0 +1,81 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="2.3.0" id="live_progress"> + + <title>LIVE_PROGRESS Query Option</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Performance"/> + <data name="Category" value="Reports"/> + <data name="Category" value="impala-shell"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">LIVE_PROGRESS query option</indexterm> + For queries submitted through the <cmdname>impala-shell</cmdname> command, + displays an interactive progress bar showing roughly what percentage of + processing has been completed. When the query finishes, the progress bar is erased + from the <cmdname>impala-shell</cmdname> console output. + </p> + + <p> + </p> + + <p conref="../shared/impala_common.xml#common/type_boolean"/> + <p conref="../shared/impala_common.xml#common/default_false_0"/> + + <p conref="../shared/impala_common.xml#common/command_line_blurb"/> + <p> + You can enable this query option within <cmdname>impala-shell</cmdname> + by starting the shell with the <codeph>--live_progress</codeph> + command-line option. + You can still turn this setting off and on again within the shell through the + <codeph>SET</codeph> command. + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + <p conref="../shared/impala_common.xml#common/live_reporting_details"/> + <p> + For a more detailed way of tracking the progress of an interactive query through + all phases of processing, see <xref href="impala_live_summary.xml#live_summary"/>. + </p> + + <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> + <p> + Because the percentage complete figure is calculated using the number of + issued and completed <q>scan ranges</q>, which occur while reading the table + data, the progress bar might reach 100% before the query is entirely finished. + For example, the query might do work to perform aggregations after all the + table data has been read. If many of your queries fall into this category, + consider using the <codeph>LIVE_SUMMARY</codeph> option instead for + more granular progress reporting. + </p> + <p conref="../shared/impala_common.xml#common/impala_shell_progress_reports_compute_stats_caveat"/> + <p conref="../shared/impala_common.xml#common/impala_shell_progress_reports_shell_only_caveat"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> +<codeblock><![CDATA[[localhost:21000] > set live_progress=true; +LIVE_PROGRESS set to true +[localhost:21000] > select count(*) from customer; ++----------+ +| count(*) | ++----------+ +| 150000 | ++----------+ +[localhost:21000] > select count(*) from customer t1 cross join customer t2; +[################################################## ] 50% +[####################################################################################################] 100% + +]]> +</codeblock> + + <p conref="../shared/impala_common.xml#common/live_progress_live_summary_asciinema"/> + + </conbody> +</concept>
