http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_perf_stats.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_perf_stats.xml b/docs/topics/impala_perf_stats.xml index 7ed878f..5e12881 100644 --- a/docs/topics/impala_perf_stats.xml +++ b/docs/topics/impala_perf_stats.xml @@ -3,7 +3,16 @@ <concept id="perf_stats"> <title>Table and Column Statistics</title> - + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Performance"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Concepts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> <conbody> @@ -16,7 +25,1007 @@ with, and how to produce them and keep them up to date. </p> - + <note> + <p rev="1.2.2"> + Originally, Impala relied on the Hive mechanism for collecting statistics, through the Hive <codeph>ANALYZE + TABLE</codeph> statement which initiates a MapReduce job. For better user-friendliness and reliability, + Impala implements its own <codeph>COMPUTE STATS</codeph> statement in Impala 1.2.2 and higher, along with the + <codeph>DROP STATS</codeph>, <codeph>SHOW TABLE STATS</codeph>, and <codeph>SHOW COLUMN STATS</codeph> + statements. + </p> + </note> + + <p outputclass="toc inpage"/> + </conbody> + + <concept id="perf_table_stats"> + + <title id="table_stats">Overview of Table Statistics</title> + <prolog> + <metadata> + <data name="Category" value="Concepts"/> + </metadata> + </prolog> + + <conbody> + +<!-- Hive background info: https://cwiki.apache.org/Hive/statsdev.html --> + + <p> + The Impala query planner can make use of statistics about entire tables and partitions. + This information includes physical characteristics such as the number of rows, number of data files, + the total size of the data files, and the file format. For partitioned tables, the numbers + are calculated per partition, and as totals for the whole table. + This metadata is stored in the metastore database, and can be updated by either Impala or Hive. + If a number is not available, the value -1 is used as a placeholder. + Some numbers, such as number and total sizes of data files, are always kept up to date because + they can be calculated cheaply, as part of gathering HDFS block metadata. + </p> + + <p> + The following example shows table stats for an unpartitioned Parquet table. + The values for the number and sizes of files are always available. + Initially, the number of rows is not known, because it requires a potentially expensive + scan through the entire table, and so that value is displayed as -1. + The <codeph>COMPUTE STATS</codeph> statement fills in any unknown table stats values. + </p> + +<codeblock> +show table stats parquet_snappy; ++-------+--------+---------+--------------+-------------------+---------+-------------------+... +| #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats |... ++-------+--------+---------+--------------+-------------------+---------+-------------------+... +| -1 | 96 | 23.35GB | NOT CACHED | NOT CACHED | PARQUET | false |... ++-------+--------+---------+--------------+-------------------+---------+-------------------+... + +compute stats parquet_snappy; ++-----------------------------------------+ +| summary | ++-----------------------------------------+ +| Updated 1 partition(s) and 6 column(s). | ++-----------------------------------------+ + + +show table stats parquet_snappy; ++------------+--------+---------+--------------+-------------------+---------+-------------------+... +| #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats |... ++------------+--------+---------+--------------+-------------------+---------+-------------------+... +| 1000000000 | 96 | 23.35GB | NOT CACHED | NOT CACHED | PARQUET | false |... ++------------+--------+---------+--------------+-------------------+---------+-------------------+... +</codeblock> + + <p> + Impala performs some optimizations using this metadata on its own, and other optimizations by + using a combination of table and column statistics. + </p> + + <p rev="1.2.1"> + To check that table statistics are available for a table, and see the details of those statistics, use the + statement <codeph>SHOW TABLE STATS <varname>table_name</varname></codeph>. See + <xref href="impala_show.xml#show"/> for details. + </p> + + <p> + If you use the Hive-based methods of gathering statistics, see + <xref href="https://cwiki.apache.org/confluence/display/Hive/StatsDev" scope="external" format="html">the + Hive wiki</xref> for information about the required configuration on the Hive side. Cloudera recommends + using the Impala <codeph>COMPUTE STATS</codeph> statement to avoid potential configuration and scalability + issues with the statistics-gathering process. + </p> + + <p conref="../shared/impala_common.xml#common/hive_column_stats_caveat"/> + </conbody> + </concept> + + <concept id="perf_column_stats"> + + <title id="column_stats">Overview of Column Statistics</title> + + <conbody> + +<!-- Cloudera+Hive background information: http://blog.cloudera.com/blog/2012/08/column-statistics-in-hive/ --> + + <p> + The Impala query planner can make use of statistics about individual columns when that metadata is + available in the metastore database. This technique is most valuable for columns compared across tables in + <xref href="impala_perf_joins.xml#perf_joins">join queries</xref>, to help estimate how many rows the query + will retrieve from each table. <ph rev="2.0.0"> These statistics are also important for correlated + subqueries using the <codeph>EXISTS()</codeph> or <codeph>IN()</codeph> operators, which are processed + internally the same way as join queries.</ph> + </p> + + <p> + The following example shows column stats for an unpartitioned Parquet table. + The values for the maximum and average sizes of some types are always available, + because those figures are constant for numeric and other fixed-size types. + Initially, the number of distinct values is not known, because it requires a potentially expensive + scan through the entire table, and so that value is displayed as -1. + The same applies to maximum and average sizes of variable-sized types, such as <codeph>STRING</codeph>. + The <codeph>COMPUTE STATS</codeph> statement fills in most unknown column stats values. + (It does not record the number of <codeph>NULL</codeph> values, because currently Impala + does not use that figure for query optimization.) + </p> + +<codeblock> +show column stats parquet_snappy; ++-------------+----------+------------------+--------+----------+----------+ +| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | ++-------------+----------+------------------+--------+----------+----------+ +| id | BIGINT | -1 | -1 | 8 | 8 | +| val | INT | -1 | -1 | 4 | 4 | +| zerofill | STRING | -1 | -1 | -1 | -1 | +| name | STRING | -1 | -1 | -1 | -1 | +| assertion | BOOLEAN | -1 | -1 | 1 | 1 | +| location_id | SMALLINT | -1 | -1 | 2 | 2 | ++-------------+----------+------------------+--------+----------+----------+ + +compute stats parquet_snappy; ++-----------------------------------------+ +| summary | ++-----------------------------------------+ +| Updated 1 partition(s) and 6 column(s). | ++-----------------------------------------+ + +show column stats parquet_snappy; ++-------------+----------+------------------+--------+----------+-------------------+ +| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | ++-------------+----------+------------------+--------+----------+-------------------+ +| id | BIGINT | 183861280 | -1 | 8 | 8 | +| val | INT | 139017 | -1 | 4 | 4 | +| zerofill | STRING | 101761 | -1 | 6 | 6 | +| name | STRING | 145636240 | -1 | 22 | 13.00020027160645 | +| assertion | BOOLEAN | 2 | -1 | 1 | 1 | +| location_id | SMALLINT | 339 | -1 | 2 | 2 | ++-------------+----------+------------------+--------+----------+-------------------+ +</codeblock> + + <note> + <p> + For column statistics to be effective in Impala, you also need to have table statistics for the + applicable tables, as described in <xref href="impala_perf_stats.xml#perf_table_stats"/>. When you use + the Impala <codeph>COMPUTE STATS</codeph> statement, both table and column statistics are automatically + gathered at the same time, for all columns in the table. + </p> + <p conref="../shared/impala_common.xml#common/decimal_no_stats"/> + </note> + + <note conref="../shared/impala_common.xml#common/compute_stats_nulls"/> + +<!-- Hive-based instructions are considered obsolete since the introduction of the Impala COMPUTE STATS statement. + <p> + Add settings like the following to the <filepath>hive-site.xml</filepath> + configuration file, in the Hive configuration directory, on every node where you run + <codeph>ANALYZE TABLE</codeph> statements through the + <codeph>hive</codeph> shell. The + <codeph>hive.stats.ndv.error</codeph> setting represents the standard error when + estimating the number of distinct values for a column. The value of 5.0 is recommended as a tradeoff between the + accuracy of the gathered statistics and the resource usage of the stats-gathering process. + </p> + +<codeblock><![CDATA[<property> + <name>hive.stats.ndv.error</name> + <value>5.0</value> +</property>]]></codeblock> + + <p> + 5.0 is a relatively low value that devotes substantial computational resources to the statistics-gathering + process. To reduce the resource usage, you could increase this value; to make the statistics even more precise, + you could lower it. + </p> + + <p> + The syntax for gathering column statistics uses the <codeph>ANALYZE TABLE ... + COMPUTE STATISTICS</codeph> clause, with an additional <codeph>FOR + COLUMNS</codeph> clause. For partitioned tables, you can gather statistics for specific partitions by including + a clause <codeph>PARTITION + (<varname>col1=val1</varname>,<varname>col2=val2</varname>, + ...)</codeph>; but you cannot include the partitioning columns in the + <codeph>FOR COLUMNS</codeph> clause. Also, you cannot use fully qualified table + names, so issue a <codeph>USE</codeph> command first to switch to the + appropriate database. For example: + </p> + +<codeblock>USE <varname>database_name</varname>; +ANALYZE TABLE <varname>table_name</varname> COMPUTE STATISTICS FOR COLUMNS <varname>column_list</varname>; +ANALYZE TABLE <varname>table_name</varname> PARTITION (<varname>partition_specs</varname>) COMPUTE STATISTICS FOR COLUMNS <varname>column_list</varname>;</codeblock> +--> + + <p rev="1.2.1"> + To check whether column statistics are available for a particular set of columns, use the <codeph>SHOW + COLUMN STATS <varname>table_name</varname></codeph> statement, or check the extended + <codeph>EXPLAIN</codeph> output for a query against that table that refers to those columns. See + <xref href="impala_show.xml#show"/> and <xref href="impala_explain.xml#explain"/> for details. + </p> + + <p conref="../shared/impala_common.xml#common/hive_column_stats_caveat"/> + </conbody> + </concept> + + <concept id="perf_stats_partitions"> + <title id="stats_partitions">How Table and Column Statistics Work for Partitioned Tables</title> + <conbody> + + <p> + When you use Impala for <q>big data</q>, you are highly likely to use partitioning + for your biggest tables, the ones representing data that can be logically divided + based on dates, geographic regions, or similar criteria. The table and column statistics + are especially useful for optimizing queries on such tables. For example, a query involving + one year might involve substantially more or less data than a query involving a different year, + or a range of several years. Each query might be optimized differently as a result. + </p> + + <p> + The following examples show how table and column stats work with a partitioned table. + The table for this example is partitioned by year, month, and day. + For simplicity, the sample data consists of 5 partitions, all from the same year and month. + Table stats are collected independently for each partition. (In fact, the + <codeph>SHOW PARTITIONS</codeph> statement displays exactly the same information as + <codeph>SHOW TABLE STATS</codeph> for a partitioned table.) Column stats apply to + the entire table, not to individual partitions. Because the partition key column values + are represented as HDFS directories, their characteristics are typically known in advance, + even when the values for non-key columns are shown as -1. + </p> + +<codeblock> +show partitions year_month_day; ++-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+... +| year | month | day | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format |... ++-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+... +| 2013 | 12 | 1 | -1 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |... +| 2013 | 12 | 2 | -1 | 1 | 2.53MB | NOT CACHED | NOT CACHED | PARQUET |... +| 2013 | 12 | 3 | -1 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |... +| 2013 | 12 | 4 | -1 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |... +| 2013 | 12 | 5 | -1 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |... +| Total | | | -1 | 5 | 12.58MB | 0B | | |... ++-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+... + +show table stats year_month_day; ++-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+... +| year | month | day | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format |... ++-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+... +| 2013 | 12 | 1 | -1 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |... +| 2013 | 12 | 2 | -1 | 1 | 2.53MB | NOT CACHED | NOT CACHED | PARQUET |... +| 2013 | 12 | 3 | -1 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |... +| 2013 | 12 | 4 | -1 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |... +| 2013 | 12 | 5 | -1 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |... +| Total | | | -1 | 5 | 12.58MB | 0B | | |... ++-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+... + +show column stats year_month_day; ++-----------+---------+------------------+--------+----------+----------+ +| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | ++-----------+---------+------------------+--------+----------+----------+ +| id | INT | -1 | -1 | 4 | 4 | +| val | INT | -1 | -1 | 4 | 4 | +| zfill | STRING | -1 | -1 | -1 | -1 | +| name | STRING | -1 | -1 | -1 | -1 | +| assertion | BOOLEAN | -1 | -1 | 1 | 1 | +| year | INT | 1 | 0 | 4 | 4 | +| month | INT | 1 | 0 | 4 | 4 | +| day | INT | 5 | 0 | 4 | 4 | ++-----------+---------+------------------+--------+----------+----------+ + +compute stats year_month_day; ++-----------------------------------------+ +| summary | ++-----------------------------------------+ +| Updated 5 partition(s) and 5 column(s). | ++-----------------------------------------+ + +show table stats year_month_day; ++-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+... +| year | month | day | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format |... ++-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+... +| 2013 | 12 | 1 | 93606 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |... +| 2013 | 12 | 2 | 94158 | 1 | 2.53MB | NOT CACHED | NOT CACHED | PARQUET |... +| 2013 | 12 | 3 | 94122 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |... +| 2013 | 12 | 4 | 93559 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |... +| 2013 | 12 | 5 | 93845 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |... +| Total | | | 469290 | 5 | 12.58MB | 0B | | |... ++-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+... + +show column stats year_month_day; ++-----------+---------+------------------+--------+----------+-------------------+ +| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | ++-----------+---------+------------------+--------+----------+-------------------+ +| id | INT | 511129 | -1 | 4 | 4 | +| val | INT | 364853 | -1 | 4 | 4 | +| zfill | STRING | 311430 | -1 | 6 | 6 | +| name | STRING | 471975 | -1 | 22 | 13.00160026550293 | +| assertion | BOOLEAN | 2 | -1 | 1 | 1 | +| year | INT | 1 | 0 | 4 | 4 | +| month | INT | 1 | 0 | 4 | 4 | +| day | INT | 5 | 0 | 4 | 4 | ++-----------+---------+------------------+--------+----------+-------------------+ +</codeblock> + + <note> + Partitioned tables can grow so large that scanning the entire table, as the <codeph>COMPUTE STATS</codeph> + statement does, is impractical just to update the statistics for a new partition. The standard + <codeph>COMPUTE STATS</codeph> statement might take hours, or even days. That situation is where you switch + to using incremental statistics, a feature available in CDH 5.3 / Impala 2.1 and higher. + See <xref href="impala_perf_stats.xml#perf_stats_incremental"/> for details about this feature + and the <codeph>COMPUTE INCREMENTAL STATS</codeph> syntax. + </note> + + <p conref="../shared/impala_common.xml#common/hive_column_stats_caveat"/> + </conbody> + </concept> + + <concept rev="2.1.0" id="perf_stats_incremental"> + + <title id="incremental_stats">Overview of Incremental Statistics</title> + + <conbody> + + <p> + In Impala 2.1.0 and higher, you can use the syntax <codeph>COMPUTE INCREMENTAL STATS</codeph> and + <codeph>DROP INCREMENTAL STATS</codeph>. The <codeph>INCREMENTAL</codeph> clauses work with incremental + statistics, a specialized feature for partitioned tables that are large or frequently updated with new + partitions. + </p> + + <p> + When you compute incremental statistics for a partitioned table, by default Impala only processes those + partitions that do not yet have incremental statistics. By processing only newly added partitions, you can + keep statistics up to date for large partitioned tables, without incurring the overhead of reprocessing the + entire table each time. + </p> + + <p> + You can also compute or drop statistics for a single partition by including a <codeph>PARTITION</codeph> + clause in the <codeph>COMPUTE INCREMENTAL STATS</codeph> or <codeph>DROP INCREMENTAL STATS</codeph> + statement. + </p> + + <p> + The metadata for incremental statistics is handled differently from the original style of statistics: + </p> + + <ul> + <li> + <p> + If you have an existing partitioned table for which you have already computed statistics, issuing + <codeph>COMPUTE INCREMENTAL STATS</codeph> without a partition clause causes Impala to rescan the + entire table. Once the incremental statistics are computed, any future <codeph>COMPUTE INCREMENTAL + STATS</codeph> statements only scan any new partitions and any partitions where you performed + <codeph>DROP INCREMENTAL STATS</codeph>. + </p> + </li> + + <li> + <p> + The <codeph>SHOW TABLE STATS</codeph> and <codeph>SHOW PARTITIONS</codeph> statements now include an + additional column showing whether incremental statistics are available for each column. A partition + could already be covered by the original type of statistics based on a prior <codeph>COMPUTE + STATS</codeph> statement, as indicated by a value other than <codeph>-1</codeph> under the + <codeph>#Rows</codeph> column. Impala query planning uses either kind of statistics when available. + </p> + </li> + + <li> + <p> + <codeph>COMPUTE INCREMENTAL STATS</codeph> takes more time than <codeph>COMPUTE STATS</codeph> for the + same volume of data. Therefore it is most suitable for tables with large data volume where new + partitions are added frequently, making it impractical to run a full <codeph>COMPUTE STATS</codeph> + operation for each new partition. For unpartitioned tables, or partitioned tables that are loaded once + and not updated with new partitions, use the original <codeph>COMPUTE STATS</codeph> syntax. + </p> + </li> + + <li> + <p> + <codeph>COMPUTE INCREMENTAL STATS</codeph> uses some memory in the <cmdname>catalogd</cmdname> process, + proportional to the number of partitions and number of columns in the applicable table. The memory + overhead is approximately 400 bytes for each column in each partition. This memory is reserved in the + <cmdname>catalogd</cmdname> daemon, the <cmdname>statestored</cmdname> daemon, and in each instance of + the <cmdname>impalad</cmdname> daemon. + </p> + </li> + + <li> + <p> + In cases where new files are added to an existing partition, issue a <codeph>REFRESH</codeph> statement + for the table, followed by a <codeph>DROP INCREMENTAL STATS</codeph> and <codeph>COMPUTE INCREMENTAL + STATS</codeph> sequence for the changed partition. + </p> + </li> + + <li> + <p> + The <codeph>DROP INCREMENTAL STATS</codeph> statement operates only on a single partition at a time. To + remove statistics (whether incremental or not) from all partitions of a table, issue a <codeph>DROP + STATS</codeph> statement with no <codeph>INCREMENTAL</codeph> or <codeph>PARTITION</codeph> clauses. + </p> + </li> + </ul> + + <p> + The following considerations apply to incremental statistics when the structure of an existing table is + changed (known as <term>schema evolution</term>): + </p> + + <ul> + <li> + <p> + If you use an <codeph>ALTER TABLE</codeph> statement to drop a column, the existing statistics remain + valid and <codeph>COMPUTE INCREMENTAL STATS</codeph> does not rescan any partitions. + </p> + </li> + + <li> + <p> + If you use an <codeph>ALTER TABLE</codeph> statement to add a column, Impala rescans all partitions and + fills in the appropriate column-level values the next time you run <codeph>COMPUTE INCREMENTAL + STATS</codeph>. + </p> + </li> + + <li> + <p> + If you use an <codeph>ALTER TABLE</codeph> statement to change the data type of a column, Impala + rescans all partitions and fills in the appropriate column-level values the next time you run + <codeph>COMPUTE INCREMENTAL STATS</codeph>. + </p> + </li> + + <li> + <p> + If you use an <codeph>ALTER TABLE</codeph> statement to change the file format of a table, the existing + statistics remain valid and a subsequent <codeph>COMPUTE INCREMENTAL STATS</codeph> does not rescan any + partitions. + </p> + </li> + </ul> + + <p> + See <xref href="impala_compute_stats.xml#compute_stats"/> and + <xref href="impala_drop_stats.xml#drop_stats"/> for syntax details. + </p> + </conbody> + </concept> + + <concept id="perf_stats_computing"> + <title>Generating Table and Column Statistics (COMPUTE STATS Statement)</title> + <conbody> + + <p> + To gather table statistics after loading data into a table or partition, you typically use the + <codeph>COMPUTE STATS</codeph> statement. This statement is available in Impala 1.2.2 and higher. + It gathers both table statistics and column statistics for all columns in a single operation. + For large partitioned tables, where you frequently need to update statistics and it is impractical + to scan the entire table each time, use the syntax <codeph>COMPUTE INCREMENTAL STATS</codeph>, + which is available in CDH 5.3 / Impala 2.1 and higher. + </p> + + <p> + If you use Hive as part of your ETL workflow, you can also use Hive to generate table and + column statistics. You might need to do extra configuration within Hive itself, the metastore, + or even set up a separate database to hold Hive-generated statistics. You might need to run + multiple statements to generate all the necessary statistics. Therefore, prefer the + Impala <codeph>COMPUTE STATS</codeph> statement where that technique is practical. + For details about collecting statistics through Hive, see + <xref href="https://cwiki.apache.org/confluence/display/Hive/StatsDev" scope="external" format="html">the Hive wiki</xref>. + </p> + + <p conref="../shared/impala_common.xml#common/hive_column_stats_caveat"/> + +<!-- Commenting out over-detailed Hive instructions as part of stats reorg. + <li> + Issue an <codeph>ANALYZE TABLE</codeph> statement in Hive, for the entire table or a specific partition. +<codeblock>ANALYZE TABLE <varname>tablename</varname> [PARTITION(<varname>partcol1</varname>[=<varname>val1</varname>], <varname>partcol2</varname>[=<varname>val2</varname>], ...)] COMPUTE STATISTICS [NOSCAN];</codeblock> + For example, to gather statistics for a non-partitioned table: +<codeblock>ANALYZE TABLE customer COMPUTE STATISTICS;</codeblock> + To gather statistics for a <codeph>store</codeph> table partitioned by state and city, and both of its + partitions: +<codeblock>ANALYZE TABLE store PARTITION(s_state, s_county) COMPUTE STATISTICS;</codeblock> + To gather statistics for the <codeph>store</codeph> table and only the partitions for California: +<codeblock>ANALYZE TABLE store PARTITION(s_state='CA', s_county) COMPUTE STATISTICS;</codeblock> + </li> + + <li> + Load the data through the <codeph>INSERT OVERWRITE</codeph> statement in Hive, while the Hive setting + <b>hive.stats.autogather</b> is enabled. + </li> + + </ul> +--> + + <p rev="2.0.1"> +<!-- Additional info as a result of IMPALA-1420 --> +<!-- Keep checking if https://issues.apache.org/jira/browse/HIVE-8648 ever gets fixed and when that fix makes it into a CDH release. --> + For your very largest tables, you might find that <codeph>COMPUTE STATS</codeph> or even <codeph>COMPUTE INCREMENTAL STATS</codeph> + take so long to scan the data that it is impractical to use them regularly. In such a case, after adding a partition or inserting new data, + you can update just the number of rows property through an <codeph>ALTER TABLE</codeph> statement. + See <xref href="impala_perf_stats.xml#perf_table_stats_manual"/> for details. + Because the column statistics might be left in a stale state, do not use this technique as a replacement + for <codeph>COMPUTE STATS</codeph>. Only use this technique if all other means of collecting statistics are impractical, or as a + low-overhead operation that you run in between periodic <codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL STATS</codeph> operations. + </p> + + </conbody> + </concept> + + <concept rev="2.1.0" id="perf_stats_checking"> + + <title>Detecting Missing Statistics</title> + + <conbody> + + <p> + You can check whether a specific table has statistics using the <codeph>SHOW TABLE STATS</codeph> statement + (for any table) or the <codeph>SHOW PARTITIONS</codeph> statement (for a partitioned table). Both + statements display the same information. If a table or a partition does not have any statistics, the + <codeph>#Rows</codeph> field contains <codeph>-1</codeph>. Once you compute statistics for the table or + partition, the <codeph>#Rows</codeph> field changes to an accurate value. + </p> + + <p> + The following example shows a table that initially does not have any statistics. The <codeph>SHOW TABLE + STATS</codeph> statement displays different values for <codeph>#Rows</codeph> before and after the + <codeph>COMPUTE STATS</codeph> operation. + </p> + +<codeblock>[localhost:21000] > create table no_stats (x int); +[localhost:21000] > show table stats no_stats; ++-------+--------+------+--------------+--------+-------------------+ +| #Rows | #Files | Size | Bytes Cached | Format | Incremental stats | ++-------+--------+------+--------------+--------+-------------------+ +| -1 | 0 | 0B | NOT CACHED | TEXT | false | ++-------+--------+------+--------------+--------+-------------------+ +[localhost:21000] > compute stats no_stats; ++-----------------------------------------+ +| summary | ++-----------------------------------------+ +| Updated 1 partition(s) and 1 column(s). | ++-----------------------------------------+ +[localhost:21000] > show table stats no_stats; ++-------+--------+------+--------------+--------+-------------------+ +| #Rows | #Files | Size | Bytes Cached | Format | Incremental stats | ++-------+--------+------+--------------+--------+-------------------+ +| 0 | 0 | 0B | NOT CACHED | TEXT | false | ++-------+--------+------+--------------+--------+-------------------+ +</codeblock> + + <p> + The following example shows a similar progression with a partitioned table. Initially, + <codeph>#Rows</codeph> is <codeph>-1</codeph>. After a <codeph>COMPUTE STATS</codeph> operation, + <codeph>#Rows</codeph> changes to an accurate value. Any newly added partition starts with no statistics, + meaning that you must collect statistics after adding a new partition. + </p> + +<codeblock>[localhost:21000] > create table no_stats_partitioned (x int) partitioned by (year smallint); +[localhost:21000] > show table stats no_stats_partitioned; ++-------+-------+--------+------+--------------+--------+-------------------+ +| year | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats | ++-------+-------+--------+------+--------------+--------+-------------------+ +| Total | -1 | 0 | 0B | 0B | | | ++-------+-------+--------+------+--------------+--------+-------------------+ +[localhost:21000] > show partitions no_stats_partitioned; ++-------+-------+--------+------+--------------+--------+-------------------+ +| year | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats | ++-------+-------+--------+------+--------------+--------+-------------------+ +| Total | -1 | 0 | 0B | 0B | | | ++-------+-------+--------+------+--------------+--------+-------------------+ +[localhost:21000] > alter table no_stats_partitioned add partition (year=2013); +[localhost:21000] > compute stats no_stats_partitioned; ++-----------------------------------------+ +| summary | ++-----------------------------------------+ +| Updated 1 partition(s) and 1 column(s). | ++-----------------------------------------+ +[localhost:21000] > alter table no_stats_partitioned add partition (year=2014); +[localhost:21000] > show partitions no_stats_partitioned; ++-------+-------+--------+------+--------------+--------+-------------------+ +| year | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats | ++-------+-------+--------+------+--------------+--------+-------------------+ +| 2013 | 0 | 0 | 0B | NOT CACHED | TEXT | false | +| 2014 | -1 | 0 | 0B | NOT CACHED | TEXT | false | +| Total | 0 | 0 | 0B | 0B | | | ++-------+-------+--------+------+--------------+--------+-------------------+ +</codeblock> + + <note> + Because the default <codeph>COMPUTE STATS</codeph> statement creates and updates statistics for all + partitions in a table, if you expect to frequently add new partitions, use the <codeph>COMPUTE INCREMENTAL + STATS</codeph> syntax instead, which lets you compute stats for a single specified partition, or only for + those partitions that do not already have incremental stats. + </note> + + <p> + If checking each individual table is impractical, due to a large number of tables or views that hide the + underlying base tables, you can also check for missing statistics for a particular query. Use the + <codeph>EXPLAIN</codeph> statement to preview query efficiency before actually running the query. Use the + query profile output available through the <codeph>PROFILE</codeph> command in + <cmdname>impala-shell</cmdname> or the web UI to verify query execution and timing after running the query. + Both the <codeph>EXPLAIN</codeph> plan and the <codeph>PROFILE</codeph> output display a warning if any + tables or partitions involved in the query do not have statistics. + </p> + +<codeblock>[localhost:21000] > create table no_stats (x int); +[localhost:21000] > explain select count(*) from no_stats; ++------------------------------------------------------------------------------------+ +| Explain String | ++------------------------------------------------------------------------------------+ +| Estimated Per-Host Requirements: Memory=10.00MB VCores=1 | +| WARNING: The following tables are missing relevant table and/or column statistics. | +| incremental_stats.no_stats | +| | +| 03:AGGREGATE [FINALIZE] | +| | output: count:merge(*) | +| | | +| 02:EXCHANGE [UNPARTITIONED] | +| | | +| 01:AGGREGATE | +| | output: count(*) | +| | | +| 00:SCAN HDFS [incremental_stats.no_stats] | +| partitions=1/1 files=0 size=0B | ++------------------------------------------------------------------------------------+ +</codeblock> + + <p> + Because Impala uses the <term>partition pruning</term> technique when possible to only evaluate certain + partitions, if you have a partitioned table with statistics for some partitions and not others, whether or + not the <codeph>EXPLAIN</codeph> statement shows the warning depends on the actual partitions used by the + query. For example, you might see warnings or not for different queries against the same table: + </p> + +<codeblock>-- No warning because all the partitions for the year 2012 have stats. +EXPLAIN SELECT ... FROM t1 WHERE year = 2012; + +-- Missing stats warning because one or more partitions in this range +-- do not have stats. +EXPLAIN SELECT ... FROM t1 WHERE year BETWEEN 2006 AND 2009; +</codeblock> + + <p> + To confirm if any partitions at all in the table are missing statistics, you might explain a query that + scans the entire table, such as <codeph>SELECT COUNT(*) FROM <varname>table_name</varname></codeph>. + </p> + </conbody> + </concept> + + <concept rev="2.1.0" id="perf_stats_collecting"> + + <title>Keeping Statistics Up to Date</title> + + <conbody> + + <p> + When the contents of a table or partition change significantly, recompute the stats for the relevant table + or partition. The degree of change that qualifies as <q>significant</q> varies, depending on the absolute + and relative sizes of the tables. Typically, if you add more than 30% more data to a table, it is + worthwhile to recompute stats, because the differences in number of rows and number of distinct values + might cause Impala to choose a different join order when that table is used in join queries. This guideline + is most important for the largest tables. For example, adding 30% new data to a table containing 1 TB has a + greater effect on join order than adding 30% to a table containing only a few megabytes, and the larger + table has a greater effect on query performance if Impala chooses a suboptimal join order as a result of + outdated statistics. + </p> + + <p> + If you reload a complete new set of data for a table, but the number of rows and number of distinct values + for each column is relatively unchanged from before, you do not need to recompute stats for the table. + </p> + + <p> + If the statistics for a table are out of date, and the table's large size makes it impractical to recompute + new stats immediately, you can use the <codeph>DROP STATS</codeph> statement to remove the obsolete + statistics, making it easier to identify tables that need a new <codeph>COMPUTE STATS</codeph> operation. + </p> + + <p> + For a large partitioned table, consider using the incremental stats feature available in Impala 2.1.0 and + higher, as explained in <xref href="impala_perf_stats.xml#perf_stats_incremental"/>. If you add a new + partition to a table, it is worthwhile to recompute incremental stats, because the operation only scans the + data for that one new partition. + </p> + </conbody> + </concept> + +<!-- Might deserve its own conceptual topic at some point. --> + + <concept audience="Cloudera" rev="1.2.2" id="perf_stats_joins"> + + <title>How Statistics Are Used in Join Queries</title> + + <conbody> + + <p></p> </conbody> </concept> +<!-- Might deserve its own conceptual topic at some point. --> + + <concept audience="Cloudera" rev="1.2.2" id="perf_stats_inserts"> + + <title>How Statistics Are Used in INSERT Operations</title> + + <conbody> + + <p conref="../shared/impala_common.xml#common/insert_hints"/> + </conbody> + </concept> + + <concept rev="1.2.2" id="perf_table_stats_manual"> + + <title>Setting the NUMROWS Value Manually through ALTER TABLE</title> + + <conbody> + + <p> + The most crucial piece of data in all the statistics is the number of rows in the table (for an + unpartitioned or partitioned table) and for each partition (for a partitioned table). The <codeph>COMPUTE STATS</codeph> + statement always gathers statistics about all columns, as well as overall table statistics. If it is not + practical to do a full <codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL STATS</codeph> + operation after adding a partition or inserting data, or if you can see that Impala would produce a more + efficient plan if the number of rows was different, you can manually set the number of rows through an + <codeph>ALTER TABLE</codeph> statement: + </p> + +<codeblock> +-- Set total number of rows. Applies to both unpartitioned and partitioned tables. +alter table <varname>table_name</varname> set tblproperties('numRows'='<varname>new_value</varname>', 'STATS_GENERATED_VIA_STATS_TASK'='true'); + +-- Set total number of rows for a specific partition. Applies to partitioned tables only. +-- You must specify all the partition key columns in the PARTITION clause. +alter table <varname>table_name</varname> partition (<varname>keycol1</varname>=<varname>val1</varname>,<varname>keycol2</varname>=<varname>val2</varname>...) set tblproperties('numRows'='<varname>new_value</varname>', 'STATS_GENERATED_VIA_STATS_TASK'='true'); +</codeblock> + + <p> + This statement avoids re-scanning any data files. (The requirement to include the <codeph>STATS_GENERATED_VIA_STATS_TASK</codeph> property is relatively new, as a + result of the issue <xref href="https://issues.apache.org/jira/browse/HIVE-8648" scope="external" format="html">HIVE-8648</xref> + for the Hive metastore.) + </p> + +<codeblock conref="../shared/impala_common.xml#common/set_numrows_example"/> + + <p> + For a partitioned table, update both the per-partition number of rows and the number of rows for the whole + table: + </p> + +<codeblock conref="../shared/impala_common.xml#common/set_numrows_partitioned_example"/> + + <p> + In practice, the <codeph>COMPUTE STATS</codeph> statement, or <codeph>COMPUTE INCREMENTAL STATS</codeph> + for a partitioned table, should be fast and convenient enough that this technique is only useful for the very + largest partitioned tables. + <!-- + It is most useful as a workaround for in case of performance issues where you might adjust the <codeph>numRows</codeph> value higher + or lower to produce the ideal join order. + --> + <!-- Following wording is duplicated from earlier. Consider conref'ing. --> + Because the column statistics might be left in a stale state, do not use this technique as a replacement + for <codeph>COMPUTE STATS</codeph>. Only use this technique if all other means of collecting statistics are impractical, or as a + low-overhead operation that you run in between periodic <codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL STATS</codeph> operations. + </p> + </conbody> + </concept> + + <concept id="perf_column_stats_manual" rev="2.6.0 IMPALA-3369"> + <title>Setting Column Stats Manually through ALTER TABLE</title> + <conbody> + <p> + In CDH 5.8 / Impala 2.6 and higher, you can also use the <codeph>SET COLUMN STATS</codeph> + clause of <codeph>ALTER TABLE</codeph> to manually set or change column statistics. + Only use this technique in cases where it is impractical to run + <codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL STATS</codeph> + frequently enough to keep up with data changes for a huge table. + </p> + <p conref="../shared/impala_common.xml#common/set_column_stats_example"/> + </conbody> + </concept> + + <concept rev="1.2.2" id="perf_stats_examples"> + + <title>Examples of Using Table and Column Statistics with Impala</title> + + <conbody> + + <p> + The following examples walk through a sequence of <codeph>SHOW TABLE STATS</codeph>, <codeph>SHOW COLUMN + STATS</codeph>, <codeph>ALTER TABLE</codeph>, and <codeph>SELECT</codeph> and <codeph>INSERT</codeph> + statements to illustrate various aspects of how Impala uses statistics to help optimize queries. + </p> + + <p> + This example shows table and column statistics for the <codeph>STORE</codeph> column used in the + <xref href="http://www.tpc.org/tpcds/" scope="external" format="html">TPC-DS benchmarks for decision + support</xref> systems. It is a tiny table holding data for 12 stores. Initially, before any statistics are + gathered by a <codeph>COMPUTE STATS</codeph> statement, most of the numeric fields show placeholder values + of -1, indicating that the figures are unknown. The figures that are filled in are values that are easily + countable or deducible at the physical level, such as the number of files, total data size of the files, + and the maximum and average sizes for data types that have a constant size such as <codeph>INT</codeph>, + <codeph>FLOAT</codeph>, and <codeph>TIMESTAMP</codeph>. + </p> + +<codeblock>[localhost:21000] > show table stats store; ++-------+--------+--------+--------+ +| #Rows | #Files | Size | Format | ++-------+--------+--------+--------+ +| -1 | 1 | 3.08KB | TEXT | ++-------+--------+--------+--------+ +Returned 1 row(s) in 0.03s +[localhost:21000] > show column stats store; ++--------------------+-----------+------------------+--------+----------+----------+ +| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | ++--------------------+-----------+------------------+--------+----------+----------+ +| s_store_sk | INT | -1 | -1 | 4 | 4 | +| s_store_id | STRING | -1 | -1 | -1 | -1 | +| s_rec_start_date | TIMESTAMP | -1 | -1 | 16 | 16 | +| s_rec_end_date | TIMESTAMP | -1 | -1 | 16 | 16 | +| s_closed_date_sk | INT | -1 | -1 | 4 | 4 | +| s_store_name | STRING | -1 | -1 | -1 | -1 | +| s_number_employees | INT | -1 | -1 | 4 | 4 | +| s_floor_space | INT | -1 | -1 | 4 | 4 | +| s_hours | STRING | -1 | -1 | -1 | -1 | +| s_manager | STRING | -1 | -1 | -1 | -1 | +| s_market_id | INT | -1 | -1 | 4 | 4 | +| s_geography_class | STRING | -1 | -1 | -1 | -1 | +| s_market_desc | STRING | -1 | -1 | -1 | -1 | +| s_market_manager | STRING | -1 | -1 | -1 | -1 | +| s_division_id | INT | -1 | -1 | 4 | 4 | +| s_division_name | STRING | -1 | -1 | -1 | -1 | +| s_company_id | INT | -1 | -1 | 4 | 4 | +| s_company_name | STRING | -1 | -1 | -1 | -1 | +| s_street_number | STRING | -1 | -1 | -1 | -1 | +| s_street_name | STRING | -1 | -1 | -1 | -1 | +| s_street_type | STRING | -1 | -1 | -1 | -1 | +| s_suite_number | STRING | -1 | -1 | -1 | -1 | +| s_city | STRING | -1 | -1 | -1 | -1 | +| s_county | STRING | -1 | -1 | -1 | -1 | +| s_state | STRING | -1 | -1 | -1 | -1 | +| s_zip | STRING | -1 | -1 | -1 | -1 | +| s_country | STRING | -1 | -1 | -1 | -1 | +| s_gmt_offset | FLOAT | -1 | -1 | 4 | 4 | +| s_tax_percentage | FLOAT | -1 | -1 | 4 | 4 | ++--------------------+-----------+------------------+--------+----------+----------+ +Returned 29 row(s) in 0.04s</codeblock> + + <p> + With the Hive <codeph>ANALYZE TABLE</codeph> statement for column statistics, you had to specify each + column for which to gather statistics. The Impala <codeph>COMPUTE STATS</codeph> statement automatically + gathers statistics for all columns, because it reads through the entire table relatively quickly and can + efficiently compute the values for all the columns. This example shows how after running the + <codeph>COMPUTE STATS</codeph> statement, statistics are filled in for both the table and all its columns: + </p> + +<codeblock>[localhost:21000] > compute stats store; ++------------------------------------------+ +| summary | ++------------------------------------------+ +| Updated 1 partition(s) and 29 column(s). | ++------------------------------------------+ +Returned 1 row(s) in 1.88s +[localhost:21000] > show table stats store; ++-------+--------+--------+--------+ +| #Rows | #Files | Size | Format | ++-------+--------+--------+--------+ +| 12 | 1 | 3.08KB | TEXT | ++-------+--------+--------+--------+ +Returned 1 row(s) in 0.02s +[localhost:21000] > show column stats store; ++--------------------+-----------+------------------+--------+----------+-------------------+ +| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | ++--------------------+-----------+------------------+--------+----------+-------------------+ +| s_store_sk | INT | 12 | -1 | 4 | 4 | +| s_store_id | STRING | 6 | -1 | 16 | 16 | +| s_rec_start_date | TIMESTAMP | 4 | -1 | 16 | 16 | +| s_rec_end_date | TIMESTAMP | 3 | -1 | 16 | 16 | +| s_closed_date_sk | INT | 3 | -1 | 4 | 4 | +| s_store_name | STRING | 8 | -1 | 5 | 4.25 | +| s_number_employees | INT | 9 | -1 | 4 | 4 | +| s_floor_space | INT | 10 | -1 | 4 | 4 | +| s_hours | STRING | 2 | -1 | 8 | 7.083300113677979 | +| s_manager | STRING | 7 | -1 | 15 | 12 | +| s_market_id | INT | 7 | -1 | 4 | 4 | +| s_geography_class | STRING | 1 | -1 | 7 | 7 | +| s_market_desc | STRING | 10 | -1 | 94 | 55.5 | +| s_market_manager | STRING | 7 | -1 | 16 | 14 | +| s_division_id | INT | 1 | -1 | 4 | 4 | +| s_division_name | STRING | 1 | -1 | 7 | 7 | +| s_company_id | INT | 1 | -1 | 4 | 4 | +| s_company_name | STRING | 1 | -1 | 7 | 7 | +| s_street_number | STRING | 9 | -1 | 3 | 2.833300113677979 | +| s_street_name | STRING | 12 | -1 | 11 | 6.583300113677979 | +| s_street_type | STRING | 8 | -1 | 9 | 4.833300113677979 | +| s_suite_number | STRING | 11 | -1 | 9 | 8.25 | +| s_city | STRING | 2 | -1 | 8 | 6.5 | +| s_county | STRING | 1 | -1 | 17 | 17 | +| s_state | STRING | 1 | -1 | 2 | 2 | +| s_zip | STRING | 2 | -1 | 5 | 5 | +| s_country | STRING | 1 | -1 | 13 | 13 | +| s_gmt_offset | FLOAT | 1 | -1 | 4 | 4 | +| s_tax_percentage | FLOAT | 5 | -1 | 4 | 4 | ++--------------------+-----------+------------------+--------+----------+-------------------+ +Returned 29 row(s) in 0.04s</codeblock> + + <p> + The following example shows how statistics are represented for a partitioned table. In this case, we have + set up a table to hold the world's most trivial census data, a single <codeph>STRING</codeph> field, + partitioned by a <codeph>YEAR</codeph> column. The table statistics include a separate entry for each + partition, plus final totals for the numeric fields. The column statistics include some easily deducible + facts for the partitioning column, such as the number of distinct values (the number of partition + subdirectories). +<!-- and the number of <codeph>NULL</codeph> values (none in this case). --> + </p> + +<codeblock>localhost:21000] > describe census; ++------+----------+---------+ +| name | type | comment | ++------+----------+---------+ +| name | string | | +| year | smallint | | ++------+----------+---------+ +Returned 2 row(s) in 0.02s +[localhost:21000] > show table stats census; ++-------+-------+--------+------+---------+ +| year | #Rows | #Files | Size | Format | ++-------+-------+--------+------+---------+ +| 2000 | -1 | 0 | 0B | TEXT | +| 2004 | -1 | 0 | 0B | TEXT | +| 2008 | -1 | 0 | 0B | TEXT | +| 2010 | -1 | 0 | 0B | TEXT | +| 2011 | 0 | 1 | 22B | TEXT | +| 2012 | -1 | 1 | 22B | TEXT | +| 2013 | -1 | 1 | 231B | PARQUET | +| Total | 0 | 3 | 275B | | ++-------+-------+--------+------+---------+ +Returned 8 row(s) in 0.02s +[localhost:21000] > show column stats census; ++--------+----------+------------------+--------+----------+----------+ +| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | ++--------+----------+------------------+--------+----------+----------+ +| name | STRING | -1 | -1 | -1 | -1 | +| year | SMALLINT | 7 | -1 | 2 | 2 | ++--------+----------+------------------+--------+----------+----------+ +Returned 2 row(s) in 0.02s</codeblock> + + <p> + The following example shows how the statistics are filled in by a <codeph>COMPUTE STATS</codeph> statement + in Impala. + </p> + +<codeblock>[localhost:21000] > compute stats census; ++-----------------------------------------+ +| summary | ++-----------------------------------------+ +| Updated 3 partition(s) and 1 column(s). | ++-----------------------------------------+ +Returned 1 row(s) in 2.16s +[localhost:21000] > show table stats census; ++-------+-------+--------+------+---------+ +| year | #Rows | #Files | Size | Format | ++-------+-------+--------+------+---------+ +| 2000 | -1 | 0 | 0B | TEXT | +| 2004 | -1 | 0 | 0B | TEXT | +| 2008 | -1 | 0 | 0B | TEXT | +| 2010 | -1 | 0 | 0B | TEXT | +| 2011 | 4 | 1 | 22B | TEXT | +| 2012 | 4 | 1 | 22B | TEXT | +| 2013 | 1 | 1 | 231B | PARQUET | +| Total | 9 | 3 | 275B | | ++-------+-------+--------+------+---------+ +Returned 8 row(s) in 0.02s +[localhost:21000] > show column stats census; ++--------+----------+------------------+--------+----------+----------+ +| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | ++--------+----------+------------------+--------+----------+----------+ +| name | STRING | 4 | -1 | 5 | 4.5 | +| year | SMALLINT | 7 | -1 | 2 | 2 | ++--------+----------+------------------+--------+----------+----------+ +Returned 2 row(s) in 0.02s</codeblock> + + <p rev="1.4.0"> + For examples showing how some queries work differently when statistics are available, see + <xref href="impala_perf_joins.xml#perf_joins_examples"/>. You can see how Impala executes a query + differently in each case by observing the <codeph>EXPLAIN</codeph> output before and after collecting + statistics. Measure the before and after query times, and examine the throughput numbers in before and + after <codeph>SUMMARY</codeph> or <codeph>PROFILE</codeph> output, to verify how much the improved plan + speeds up performance. + </p> + </conbody> + </concept> +</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_performance.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_performance.xml b/docs/topics/impala_performance.xml index 3d454ca..e58270e 100644 --- a/docs/topics/impala_performance.xml +++ b/docs/topics/impala_performance.xml @@ -4,7 +4,18 @@ <title>Tuning Impala for Performance</title> <titlealts audience="PDF"><navtitle>Performance Tuning</navtitle></titlealts> - + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Performance"/> + <data name="Category" value="Databases"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Developers"/> + <!-- Like Impala Administration, this page has a fair bit of info already, but it could benefit from wiki-style embedded of intro text from those other pages. --> + <data name="Category" value="Stub Pages"/> + </metadata> + </prolog> <conbody> @@ -13,8 +24,168 @@ tuning, monitoring, and benchmarking Impala queries and other SQL operations. </p> - + <p> + This section also describes techniques for maximizing Impala scalability. Scalability is tied to performance: + it means that performance remains high as the system workload increases. For example, reducing the disk I/O + performed by a query can speed up an individual query, and at the same time improve scalability by making it + practical to run more queries simultaneously. Sometimes, an optimization technique improves scalability more + than performance. For example, reducing memory usage for a query might not change the query performance much, + but might improve scalability by allowing more Impala queries or other kinds of jobs to run at the same time + without running out of memory. + </p> + + <note> + <p> + Before starting any performance tuning or benchmarking, make sure your system is configured with all the + recommended minimum hardware requirements from <xref href="impala_prereqs.xml#prereqs_hardware"/> and + software settings from <xref href="impala_config_performance.xml#config_performance"/>. + </p> + </note> + + <ul> + <li> + <xref href="impala_partitioning.xml#partitioning"/>. This technique physically divides the data based on + the different values in frequently queried columns, allowing queries to skip reading a large percentage of + the data in a table. + </li> + + <li> + <xref href="impala_perf_joins.xml#perf_joins"/>. Joins are the main class of queries that you can tune at + the SQL level, as opposed to changing physical factors such as the file format or the hardware + configuration. The related topics <xref href="impala_perf_stats.xml#perf_column_stats"/> and + <xref href="impala_perf_stats.xml#perf_table_stats"/> are also important primarily for join performance. + </li> + + <li> + <xref href="impala_perf_stats.xml#perf_table_stats"/> and + <xref href="impala_perf_stats.xml#perf_column_stats"/>. Gathering table and column statistics, using the + <codeph>COMPUTE STATS</codeph> statement, helps Impala automatically optimize the performance for join + queries, without requiring changes to SQL query statements. (This process is greatly simplified in Impala + 1.2.2 and higher, because the <codeph>COMPUTE STATS</codeph> statement gathers both kinds of statistics in + one operation, and does not require any setup and configuration as was previously necessary for the + <codeph>ANALYZE TABLE</codeph> statement in Hive.) + </li> + + <li> + <xref href="impala_perf_testing.xml#performance_testing"/>. Do some post-setup testing to ensure Impala is + using optimal settings for performance, before conducting any benchmark tests. + </li> + + <li> + <xref href="impala_perf_benchmarking.xml#perf_benchmarks"/>. The configuration and sample data that you use + for initial experiments with Impala is often not appropriate for doing performance tests. + </li> + + <li> + <xref href="impala_perf_resources.xml#mem_limits"/>. The more memory Impala can utilize, the better query + performance you can expect. In a cluster running other kinds of workloads as well, you must make tradeoffs + to make sure all Hadoop components have enough memory to perform well, so you might cap the memory that + Impala can use. + </li> + + <li rev="1.2" audience="Cloudera"> + <xref href="impala_perf_hdfs_caching.xml#hdfs_caching"/>. Impala can use the HDFS caching feature to pin + frequently accessed data in memory, reducing disk I/O. + </li> + + <li rev="2.2.0"> + <xref href="impala_s3.xml#s3"/>. Queries against data stored in the Amazon Simple Storage Service (S3) + have different performance characteristics than when the data is stored in HDFS. + </li> + </ul> + + <p outputclass="toc"/> + + <p conref="../shared/impala_common.xml#common/cookbook_blurb"/> + + </conbody> + +<!-- Empty/hidden stub sections that might be worth expanding later. --> + + <concept id="perf_network" audience="Cloudera"> + + <title>Network Traffic</title> + + <conbody/> + </concept> + + <concept id="perf_partition_schema" audience="Cloudera"> + + <title>Designing Partitioned Tables</title> + + <conbody/> + </concept> + + <concept id="perf_partition_query" audience="Cloudera"> + + <title>Queries on Partitioned Tables</title> + + <conbody/> + </concept> + + <concept id="perf_monitoring" audience="Cloudera"> - </conbody> + <title>Monitoring Performance through the Impala Web Interface</title> + <prolog> + <metadata> + <data name="Category" value="Monitoring"/> + </metadata> + </prolog> + + <conbody/> + </concept> + + <concept id="perf_query_coord" audience="Cloudera"> + + <title>Query Coordination</title> + + <conbody/> + </concept> + + <concept id="perf_bottlenecks" audience="Cloudera"> + + <title>Performance Bottlenecks</title> + + <conbody/> + </concept> + + <concept id="perf_long_queries" audience="Cloudera"> + + <title>Managing Long-Running Queries</title> + + <conbody/> + </concept> + + <concept id="perf_load" audience="Cloudera"> + + <title>Performance Considerations for Loading Data</title> + + <conbody/> + </concept> + + <concept id="perf_file_formats" audience="Cloudera"> + + <title>Performance Considerations for File Formats</title> + + <conbody/> </concept> + <concept id="perf_compression" audience="Cloudera"> + + <title>Performance Considerations for Compression</title> + <prolog> + <metadata> + <data name="Category" value="Compression"/> + </metadata> + </prolog> + + <conbody/> + </concept> + + <concept id="perf_codegen" audience="Cloudera"> + + <title>Native Code Generation</title> + + <conbody/> + </concept> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_porting.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_porting.xml b/docs/topics/impala_porting.xml index c9c8e52..3800713 100644 --- a/docs/topics/impala_porting.xml +++ b/docs/topics/impala_porting.xml @@ -3,6 +3,7 @@ <concept id="porting"> <title>Porting SQL from Other Database Systems to Impala</title> + <titlealts audience="PDF"><navtitle>Porting SQL</navtitle></titlealts> <prolog> <metadata> <data name="Category" value="Impala"/> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_prereqs.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_prereqs.xml b/docs/topics/impala_prereqs.xml index bb10263..0587c6f 100644 --- a/docs/topics/impala_prereqs.xml +++ b/docs/topics/impala_prereqs.xml @@ -5,7 +5,22 @@ <title>Impala Requirements</title> <titlealts audience="PDF"><navtitle>Requirements</navtitle></titlealts> - + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Requirements"/> + <data name="Category" value="Planning"/> + <data name="Category" value="Installing"/> + <data name="Category" value="Upgrading"/> + <data name="Category" value="Administrators"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + <!-- Another instance of a topic pulled into the map twice, resulting in a second HTML page with a *1.html filename. --> + <data name="Category" value="Duplicate Topics"/> + <!-- Using a separate category, 'Multimap', to flag those pages that are duplicate because of multiple DITA map references. --> + <data name="Category" value="Multimap"/> + </metadata> + </prolog> <conbody> @@ -16,7 +31,330 @@ described in the following sections. </p> - + <p outputclass="toc inpage"/> + </conbody> + + <concept id="product_compatibility_matrix"> + + <title>Product Compatibility Matrix</title> + + <conbody> + + <p> The ultimate source of truth about compatibility between various + versions of CDH, Cloudera Manager, and various CDH components is the <ph + audience="integrated"><xref + href="rn_consolidated_pcm.xml" + >Product Compatibility Matrix for CDH and Cloudera + Manager</xref></ph><ph audience="standalone">online <xref + href="http://www.cloudera.com/documentation/enterprise/latest/topics/rn_consolidated_pcm.html" + format="html" scope="external">Product Compatibility + Matrix</xref></ph>. </p> + + <p> + For Impala, see the + <xref href="http://www.cloudera.com/documentation/enterprise/latest/topics/pcm_impala.html" scope="external" format="html">Impala + compatibility matrix page</xref>. + </p> + </conbody> + </concept> + + <concept id="prereqs_os"> + + <title>Supported Operating Systems</title> + + <conbody> + + <p> + <indexterm audience="Cloudera">software requirements</indexterm> + <indexterm audience="Cloudera">Red Hat Enterprise Linux</indexterm> + <indexterm audience="Cloudera">RHEL</indexterm> + <indexterm audience="Cloudera">CentOS</indexterm> + <indexterm audience="Cloudera">SLES</indexterm> + <indexterm audience="Cloudera">Ubuntu</indexterm> + <indexterm audience="Cloudera">SUSE</indexterm> + <indexterm audience="Cloudera">Debian</indexterm> The relevant supported operating systems + and versions for Impala are the same as for the corresponding CDH 4 and CDH 5 platforms. For + details, see the <cite>Supported Operating Systems</cite> page for <xref + href="http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Requirements-and-Supported-Versions/cdhrsv_topic_1.html" + scope="external" format="html">CDH 4</xref> or <!-- Original URL: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Requirements-and-Supported-Versions/cdh_rsv_os.html --> + <!-- Expected post-reorg URL, doesn't work: <xref href="http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_rsv_os.html" scope="external" format="html">CDH 5</xref>. --> + <ph audience="integrated"><xref href="rn_consolidated_pcm.xml#cdh_cm_supported_os">CDH + 5</xref></ph><ph audience="standalone"><xref + href="http://www.cloudera.com/documentation/enterprise/latest/topics/rn_consolidated_pcm.html#cdh_cm_supported_os" + scope="external" format="html">CDH 5</xref></ph>. </p> + </conbody> + </concept> + + <concept id="prereqs_hive"> + + <title>Hive Metastore and Related Configuration</title> + <prolog> + <metadata> + <data name="Category" value="Metastore"/> + <data name="Category" value="Hive"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">Hive</indexterm> + <indexterm audience="Cloudera">MySQL</indexterm> + <indexterm audience="Cloudera">PostgreSQL</indexterm> + Impala can interoperate with data stored in Hive, and uses the same infrastructure as Hive for tracking + metadata about schema objects such as tables and columns. The following components are prerequisites for + Impala: + </p> + + <ul> + <li> + MySQL or PostgreSQL, to act as a metastore database for both Impala and Hive. + <note> + <p> + Installing and configuring a Hive metastore is an Impala requirement. Impala does not work without + the metastore database. For the process of installing and configuring the metastore, see + <xref href="impala_install.xml#install"/>. + </p> + <p> + Always configure a <b>Hive metastore service</b> rather than connecting directly to the metastore + database. The Hive metastore service is required to interoperate between possibly different levels of + metastore APIs used by CDH and Impala, and avoids known issues with connecting directly to the + metastore database. The Hive metastore service is set up for you by default if you install through + Cloudera Manager 4.5 or higher. + </p> + <p> + A summary of the metastore installation process is as follows: + </p> + <ul> + <li> + Install a MySQL or PostgreSQL database. Start the database if it is not started after installation. + </li> + + <li> + Download the + <xref href="http://www.mysql.com/products/connector/" scope="external" format="html">MySQL + connector</xref> or the + <xref href="http://jdbc.postgresql.org/download.html" scope="external" format="html">PostgreSQL + connector</xref> and place it in the <codeph>/usr/share/java/</codeph> directory. + </li> + + <li> + Use the appropriate command line tool for your database to create the metastore database. + </li> + + <li> + Use the appropriate command line tool for your database to grant privileges for the metastore + database to the <codeph>hive</codeph> user. + </li> + + <li> + Modify <codeph>hive-site.xml</codeph> to include information matching your particular database: its + URL, username, and password. You will copy the <codeph>hive-site.xml</codeph> file to the Impala + Configuration Directory later in the Impala installation process. + </li> + </ul> + </note> + </li> + + <li> + <b>Optional:</b> Hive. Although only the Hive metastore database is required for Impala to function, you + might install Hive on some client machines to create and load data into tables that use certain file + formats. See <xref href="impala_file_formats.xml#file_formats"/> for details. Hive does not need to be + installed on the same DataNodes as Impala; it just needs access to the same metastore database. + </li> + </ul> + </conbody> + </concept> + + <concept id="prereqs_java"> + + <title>Java Dependencies</title> + <prolog> + <metadata> + <data name="Category" value="Java"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">Java</indexterm> + <indexterm audience="Cloudera">impala-dependencies.jar</indexterm> + Although Impala is primarily written in C++, it does use Java to communicate with various Hadoop + components: + </p> + + <ul> + <li> + The officially supported JVM for Impala is the Oracle JVM. Other JVMs might cause issues, typically + resulting in a failure at <cmdname>impalad</cmdname> startup. In particular, the JamVM used by default on + certain levels of Ubuntu systems can cause <cmdname>impalad</cmdname> to fail to start. + <!-- To do: + Could say something here about JDK 6 vs. JDK 7 in CDH 5. Since we didn't specify the JDK version before, + don't know the impact from the user perspective so not calling it out at the moment. + --> + </li> + + <li> + Internally, the <cmdname>impalad</cmdname> daemon relies on the <codeph>JAVA_HOME</codeph> environment + variable to locate the system Java libraries. Make sure the <cmdname>impalad</cmdname> service is not run + from an environment with an incorrect setting for this variable. + </li> + + <li> + All Java dependencies are packaged in the <codeph>impala-dependencies.jar</codeph> file, which is located + at <codeph>/usr/lib/impala/lib/</codeph>. These map to everything that is built under + <codeph>fe/target/dependency</codeph>. + </li> + </ul> + </conbody> + </concept> + + <concept id="prereqs_network"> + + <title>Networking Configuration Requirements</title> + <prolog> + <metadata> + <data name="Category" value="Network"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">network configuration</indexterm> + As part of ensuring best performance, Impala attempts to complete tasks on local data, as opposed to using + network connections to work with remote data. To support this goal, Impala matches + the <b>hostname</b> provided to each Impala daemon with the <b>IP address</b> of each DataNode by + resolving the hostname flag to an IP address. For Impala to work with local data, use a single IP interface + for the DataNode and the Impala daemon on each machine. Ensure that the Impala daemon's hostname flag + resolves to the IP address of the DataNode. For single-homed machines, this is usually automatic, but for + multi-homed machines, ensure that the Impala daemon's hostname resolves to the correct interface. Impala + tries to detect the correct hostname at start-up, and prints the derived hostname at the start of the log + in a message of the form: + </p> + +<codeblock>Using hostname: impala-daemon-1.cloudera.com</codeblock> + + <p> + In the majority of cases, this automatic detection works correctly. If you need to explicitly set the + hostname, do so by setting the <codeph>--hostname</codeph> flag. + </p> + </conbody> + </concept> + + <concept id="prereqs_hardware"> + + <title>Hardware Requirements</title> + + <conbody> + + <p> + <indexterm audience="Cloudera">hardware requirements</indexterm> + <indexterm audience="Cloudera">capacity</indexterm> + <indexterm audience="Cloudera">RAM</indexterm> + <indexterm audience="Cloudera">memory</indexterm> + <indexterm audience="Cloudera">CPU</indexterm> + <indexterm audience="Cloudera">processor</indexterm> + <indexterm audience="Cloudera">Intel</indexterm> + <indexterm audience="Cloudera">AMD</indexterm> + During join operations, portions of data from each joined table are loaded into memory. Data sets can be + very large, so ensure your hardware has sufficient memory to accommodate the joins you anticipate + completing. + </p> + + <p> + While requirements vary according to data set size, the following is generally recommended: + </p> + + <ul> + <li rev="2.0.0"> + CPU - Impala version 2.2 and higher uses the SSSE3 instruction set, which is included in newer processors. + <note> + This required level of processor is the same as in Impala version 1.x. The Impala 2.0 and 2.1 releases + had a stricter requirement for the SSE4.1 instruction set, which has now been relaxed. + </note> +<!-- + For best performance use: + <ul> + <li> + Intel - Nehalem (released 2008) or later processors. + </li> + + <li> + AMD - Bulldozer (released 2011) or later processors. + </li> + </ul> +--> + </li> + + <li rev="1.2"> + Memory - 128 GB or more recommended, ideally 256 GB or more. If the intermediate results during query + processing on a particular node exceed the amount of memory available to Impala on that node, the query + writes temporary work data to disk, which can lead to long query times. Note that because the work is + parallelized, and intermediate results for aggregate queries are typically smaller than the original + data, Impala can query and join tables that are much larger than the memory available on an individual + node. + </li> + + <li> + Storage - DataNodes with 12 or more disks each. I/O speeds are often the limiting factor for disk + performance with Impala. Ensure that you have sufficient disk space to store the data Impala will be + querying. + </li> + </ul> </conbody> </concept> + <concept id="prereqs_account"> + + <title>User Account Requirements</title> + <prolog> + <metadata> + <data name="Category" value="Users"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">impala user</indexterm> + <indexterm audience="Cloudera">impala group</indexterm> + <indexterm audience="Cloudera">root user</indexterm> + Impala creates and uses a user and group named <codeph>impala</codeph>. Do not delete this account or group + and do not modify the account's or group's permissions and rights. Ensure no existing systems obstruct the + functioning of these accounts and groups. For example, if you have scripts that delete user accounts not in + a white-list, add these accounts to the list of permitted accounts. + </p> + +<!-- Taking out because no longer applicable in CDH 5.5 and up. --> + <p id="impala_hdfs_group" rev="1.2" audience="Cloudera"> + For the resource management feature to work (in combination with CDH 5 and the YARN and Llama components), + the <codeph>impala</codeph> user must be a member of the <codeph>hdfs</codeph> group. This setup is + performed automatically during a new install, but not when upgrading from earlier Impala releases to Impala + 1.2. If you are upgrading a node to CDH 5 that already had Impala 1.1 or 1.0 installed, manually add the + <codeph>impala</codeph> user to the <codeph>hdfs</codeph> group. + </p> + + <p> + For correct file deletion during <codeph>DROP TABLE</codeph> operations, Impala must be able to move files + to the HDFS trashcan. You might need to create an HDFS directory <filepath>/user/impala</filepath>, + writeable by the <codeph>impala</codeph> user, so that the trashcan can be created. Otherwise, data files + might remain behind after a <codeph>DROP TABLE</codeph> statement. + </p> + + <p> + Impala should not run as root. Best Impala performance is achieved using direct reads, but root is not + permitted to use direct reads. Therefore, running Impala as root negatively affects performance. + </p> + + <p> + By default, any user can connect to Impala and access all the associated databases and tables. You can + enable authorization and authentication based on the Linux OS user who connects to the Impala server, and + the associated groups for that user. <xref href="impala_security.xml#security"/> for details. These + security features do not change the underlying file permission requirements; the <codeph>impala</codeph> + user still needs to be able to access the data files. + </p> + </conbody> + </concept> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_processes.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_processes.xml b/docs/topics/impala_processes.xml index be8a9b8..05f2274 100644 --- a/docs/topics/impala_processes.xml +++ b/docs/topics/impala_processes.xml @@ -3,16 +3,132 @@ <concept id="processes"> <title>Starting Impala</title> - + <prolog> + <metadata> + <data name="Category" value="Starting and Stopping"/> + <data name="Category" value="Impala"/> + <data name="Category" value="Administrators"/> + <data name="Category" value="Operators"/> + </metadata> + </prolog> <conbody> - <p> - + <p rev="1.2"> + <indexterm audience="Cloudera">state store</indexterm> + <indexterm audience="Cloudera">starting services</indexterm> + <indexterm audience="Cloudera">services</indexterm> To activate Impala if it is installed but not yet started: </p> - + <ol> + <li> + Set any necessary configuration options for the Impala services. See + <xref href="impala_config_options.xml#config_options"/> for details. + </li> + + <li> + Start one instance of the Impala statestore. The statestore helps Impala to distribute work efficiently, + and to continue running in the event of availability problems for other Impala nodes. If the statestore + becomes unavailable, Impala continues to function. + </li> + + <li> + Start one instance of the Impala catalog service. + </li> + + <li> + Start the main Impala service on one or more DataNodes, ideally on all DataNodes to maximize local + processing and avoid network traffic due to remote reads. + </li> + </ol> + + <p> + Once Impala is running, you can conduct interactive experiments using the instructions in + <xref href="impala_tutorial.xml#tutorial"/> and try <xref href="impala_impala_shell.xml#impala_shell"/>. + </p> + + <p outputclass="toc inpage"/> + </conbody> + + <concept id="starting_via_cm"> + + <title>Starting Impala through Cloudera Manager</title> + + <conbody> + + <p> + If you installed Impala with Cloudera Manager, use Cloudera Manager to start and stop services. The + Cloudera Manager GUI is a convenient way to check that all services are running, to set configuration + options using form fields in a browser, and to spot potential issues such as low disk space before they + become serious. Cloudera Manager automatically starts all the Impala-related services as a group, in the + correct order. See + <xref href="http://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_start_stop_service.html" scope="external" format="html">the + Cloudera Manager Documentation</xref> for details. + </p> + + <note> + <p conref="../shared/impala_common.xml#common/udf_persistence_restriction"/> + </note> </conbody> </concept> + <concept id="starting_via_cmdline"> + + <title>Starting Impala from the Command Line</title> + + <conbody> + + <p> + To start the Impala state store and Impala from the command line or a script, you can either use the + <cmdname>service</cmdname> command or you can start the daemons directly through the + <cmdname>impalad</cmdname>, <codeph>statestored</codeph>, and <cmdname>catalogd</cmdname> executables. + </p> + + <p> + Start the Impala statestore and then start <codeph>impalad</codeph> instances. You can modify the values + the service initialization scripts use when starting the statestore and Impala by editing + <codeph>/etc/default/impala</codeph>. + </p> + + <p> + Start the statestore service using a command similar to the following: + </p> + + <p> +<codeblock>$ sudo service impala-state-store start</codeblock> + </p> + + <p rev="1.2"> + Start the catalog service using a command similar to the following: + </p> + +<codeblock rev="1.2">$ sudo service impala-catalog start</codeblock> + + <p> + Start the Impala service on each DataNode using a command similar to the following: + </p> + + <p> +<codeblock>$ sudo service impala-server start</codeblock> + </p> + + <note> + <p conref="../shared/impala_common.xml#common/udf_persistence_restriction"/> + </note> + + <p> + If any of the services fail to start, review: + <ul> + <li> + <xref href="impala_logging.xml#logs_debug"/> + </li> + + <li> + <xref href="impala_troubleshooting.xml#troubleshooting"/> + </li> + </ul> + </p> + </conbody> + </concept> +</concept>
