IMPALA-6459: [DOCS] Part 1 Stats doc reorg Change-Id: I3b061c39bb847690760311fd7280121eda2db626 Reviewed-on: http://gerrit.cloudera.org:8080/10067 Reviewed-by: Alex Rodoni <[email protected]> Tested-by: Impala Public Jenkins <[email protected]>
Project: http://git-wip-us.apache.org/repos/asf/impala/repo Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/39f986ec Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/39f986ec Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/39f986ec Branch: refs/heads/master Commit: 39f986ecf4fca0733c3778d7756dd6b12175dd35 Parents: 62d3388 Author: Alex Rodoni <[email protected]> Authored: Fri Apr 13 14:41:27 2018 -0700 Committer: Impala Public Jenkins <[email protected]> Committed: Wed Apr 18 21:17:02 2018 +0000 ---------------------------------------------------------------------- docs/impala_keydefs.ditamap | 1 - docs/shared/impala_common.xml | 10 +- docs/topics/impala_perf_stats.xml | 893 ++++++++++++++++----------------- 3 files changed, 448 insertions(+), 456 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/impala/blob/39f986ec/docs/impala_keydefs.ditamap ---------------------------------------------------------------------- diff --git a/docs/impala_keydefs.ditamap b/docs/impala_keydefs.ditamap index e94fe5d..88bdc71 100644 --- a/docs/impala_keydefs.ditamap +++ b/docs/impala_keydefs.ditamap @@ -10936,7 +10936,6 @@ under the License. <keydef href="topics/impala_perf_joins.xml#straight_join" keys="straight_join"/> <keydef href="topics/impala_perf_joins.xml#perf_joins_examples" keys="perf_joins_examples"/> <keydef href="topics/impala_perf_stats.xml" keys="perf_stats"/> - <keydef href="topics/impala_perf_stats.xml#perf_column_stats_manual" keys="perf_column_stats_manual"/> <keydef href="topics/impala_perf_benchmarking.xml" keys="perf_benchmarks"/> <keydef href="topics/impala_perf_resources.xml" keys="mem_limits"/> http://git-wip-us.apache.org/repos/asf/impala/blob/39f986ec/docs/shared/impala_common.xml ---------------------------------------------------------------------- diff --git a/docs/shared/impala_common.xml b/docs/shared/impala_common.xml index df58a53..f8ded41 100644 --- a/docs/shared/impala_common.xml +++ b/docs/shared/impala_common.xml @@ -1426,11 +1426,11 @@ drop database temp; <p id="cs_or_cis"> For a particular table, use either <codeph>COMPUTE STATS</codeph> or - <codeph>COMPUTE INCREMENTAL STATS</codeph>, but never combine the two or alternate - between them. If you switch from <codeph>COMPUTE STATS</codeph> to - <codeph>COMPUTE INCREMENTAL STATS</codeph> during the lifetime of a table, or vice - versa, drop all statistics (by running both <codeph>DROP STATS</codeph> and - <codeph>DROP INCREMENTAL STATS</codeph>) before making the switch. + <codeph>COMPUTE INCREMENTAL STATS</codeph>, but never combine the two or + alternate between them. If you switch from <codeph>COMPUTE STATS</codeph> to + <codeph>COMPUTE INCREMENTAL STATS</codeph> during the lifetime of a table, or + vice versa, drop all statistics by running <codeph>DROP STATS</codeph> before + making the switch. </p> <p id="incremental_stats_after_full"> http://git-wip-us.apache.org/repos/asf/impala/blob/39f986ec/docs/topics/impala_perf_stats.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_perf_stats.xml b/docs/topics/impala_perf_stats.xml index ac771be..f503a68 100644 --- a/docs/topics/impala_perf_stats.xml +++ b/docs/topics/impala_perf_stats.xml @@ -21,6 +21,7 @@ under the License. <concept id="perf_stats"> <title>Table and Column Statistics</title> + <prolog> <metadata> <data name="Category" value="Impala"/> @@ -35,57 +36,48 @@ under the License. <conbody> <p> - Impala can do better optimization for complex or multi-table queries when it has access to statistics about - the volume of data and how the values are distributed. Impala uses this information to help parallelize and - distribute the work for a query. For example, optimizing join queries requires a way of determining if one - table is <q>bigger</q> than another, which is a function of the number of rows and the average row size - for each table. The following sections describe the categories of statistics Impala can work - with, and how to produce them and keep them up to date. + Impala can do better optimization for complex or multi-table queries when it has access to + statistics about the volume of data and how the values are distributed. Impala uses this + information to help parallelize and distribute the work for a query. For example, + optimizing join queries requires a way of determining if one table is <q>bigger</q> than + another, which is a function of the number of rows and the average row size for each + table. The following sections describe the categories of statistics Impala can work with, + and how to produce them and keep them up to date. </p> - <note> - <p rev="1.2.2"> - Originally, Impala relied on the Hive mechanism for collecting statistics, through the Hive <codeph>ANALYZE - TABLE</codeph> statement which initiates a MapReduce job. For better user-friendliness and reliability, - Impala implements its own <codeph>COMPUTE STATS</codeph> statement in Impala 1.2.2 and higher, along with the - <codeph>DROP STATS</codeph>, <codeph>SHOW TABLE STATS</codeph>, and <codeph>SHOW COLUMN STATS</codeph> - statements. - </p> - </note> + <p outputclass="toc inpage all"/> - <p outputclass="toc inpage"/> </conbody> <concept id="perf_table_stats"> <title id="table_stats">Overview of Table Statistics</title> - <prolog> - <metadata> - <data name="Category" value="Concepts"/> - </metadata> - </prolog> - <conbody> + <prolog> + <metadata> + <data name="Category" value="Concepts"/> + </metadata> + </prolog> -<!-- Hive background info: https://cwiki.apache.org/Hive/statsdev.html --> + <conbody> <p> The Impala query planner can make use of statistics about entire tables and partitions. - This information includes physical characteristics such as the number of rows, number of data files, - the total size of the data files, and the file format. For partitioned tables, the numbers - are calculated per partition, and as totals for the whole table. - This metadata is stored in the metastore database, and can be updated by either Impala or Hive. - If a number is not available, the value -1 is used as a placeholder. - Some numbers, such as number and total sizes of data files, are always kept up to date because - they can be calculated cheaply, as part of gathering HDFS block metadata. + This information includes physical characteristics such as the number of rows, number of + data files, the total size of the data files, and the file format. For partitioned + tables, the numbers are calculated per partition, and as totals for the whole table. + This metadata is stored in the metastore database, and can be updated by either Impala + or Hive. If a number is not available, the value -1 is used as a placeholder. Some + numbers, such as number and total sizes of data files, are always kept up to date + because they can be calculated cheaply, as part of gathering HDFS block metadata. </p> <p> - The following example shows table stats for an unpartitioned Parquet table. - The values for the number and sizes of files are always available. - Initially, the number of rows is not known, because it requires a potentially expensive - scan through the entire table, and so that value is displayed as -1. - The <codeph>COMPUTE STATS</codeph> statement fills in any unknown table stats values. + The following example shows table stats for an unpartitioned Parquet table. The values + for the number and sizes of files are always available. Initially, the number of rows is + not known, because it requires a potentially expensive scan through the entire table, + and so that value is displayed as -1. The <codeph>COMPUTE STATS</codeph> statement fills + in any unknown table stats values. </p> <codeblock> @@ -113,26 +105,29 @@ show table stats parquet_snappy; </codeblock> <p> - Impala performs some optimizations using this metadata on its own, and other optimizations by - using a combination of table and column statistics. + Impala performs some optimizations using this metadata on its own, and other + optimizations by using a combination of table and column statistics. </p> <p rev="1.2.1"> - To check that table statistics are available for a table, and see the details of those statistics, use the - statement <codeph>SHOW TABLE STATS <varname>table_name</varname></codeph>. See - <xref href="impala_show.xml#show"/> for details. + To check that table statistics are available for a table, and see the details of those + statistics, use the statement <codeph>SHOW TABLE STATS + <varname>table_name</varname></codeph>. See <xref href="impala_show.xml#show"/> for + details. </p> <p> If you use the Hive-based methods of gathering statistics, see <xref href="https://cwiki.apache.org/confluence/display/Hive/StatsDev" scope="external" format="html">the - Hive wiki</xref> for information about the required configuration on the Hive side. Where practical, - use the Impala <codeph>COMPUTE STATS</codeph> statement to avoid potential configuration and scalability - issues with the statistics-gathering process. + Hive wiki</xref> for information about the required configuration on the Hive side. + Where practical, use the Impala <codeph>COMPUTE STATS</codeph> statement to avoid + potential configuration and scalability issues with the statistics-gathering process. </p> <p conref="../shared/impala_common.xml#common/hive_column_stats_caveat"/> + </conbody> + </concept> <concept id="perf_column_stats"> @@ -142,23 +137,24 @@ show table stats parquet_snappy; <conbody> <p> - The Impala query planner can make use of statistics about individual columns when that metadata is - available in the metastore database. This technique is most valuable for columns compared across tables in - <xref href="impala_perf_joins.xml#perf_joins">join queries</xref>, to help estimate how many rows the query - will retrieve from each table. <ph rev="2.0.0"> These statistics are also important for correlated - subqueries using the <codeph>EXISTS()</codeph> or <codeph>IN()</codeph> operators, which are processed + The Impala query planner can make use of statistics about individual columns when that + metadata is available in the metastore database. This technique is most valuable for + columns compared across tables in <xref href="impala_perf_joins.xml#perf_joins">join + queries</xref>, to help estimate how many rows the query will retrieve from each table. + <ph rev="2.0.0"> These statistics are also important for correlated subqueries using the + <codeph>EXISTS()</codeph> or <codeph>IN()</codeph> operators, which are processed internally the same way as join queries.</ph> </p> <p> - The following example shows column stats for an unpartitioned Parquet table. - The values for the maximum and average sizes of some types are always available, - because those figures are constant for numeric and other fixed-size types. - Initially, the number of distinct values is not known, because it requires a potentially expensive - scan through the entire table, and so that value is displayed as -1. - The same applies to maximum and average sizes of variable-sized types, such as <codeph>STRING</codeph>. - The <codeph>COMPUTE STATS</codeph> statement fills in most unknown column stats values. - (It does not record the number of <codeph>NULL</codeph> values, because currently Impala + The following example shows column stats for an unpartitioned Parquet table. The values + for the maximum and average sizes of some types are always available, because those + figures are constant for numeric and other fixed-size types. Initially, the number of + distinct values is not known, because it requires a potentially expensive scan through + the entire table, and so that value is displayed as -1. The same applies to maximum and + average sizes of variable-sized types, such as <codeph>STRING</codeph>. The + <codeph>COMPUTE STATS</codeph> statement fills in most unknown column stats values. (It + does not record the number of <codeph>NULL</codeph> values, because currently Impala does not use that figure for query optimization.) </p> @@ -197,88 +193,56 @@ show column stats parquet_snappy; <note> <p> - For column statistics to be effective in Impala, you also need to have table statistics for the - applicable tables, as described in <xref href="impala_perf_stats.xml#perf_table_stats"/>. When you use - the Impala <codeph>COMPUTE STATS</codeph> statement, both table and column statistics are automatically - gathered at the same time, for all columns in the table. + For column statistics to be effective in Impala, you also need to have table + statistics for the applicable tables, as described in + <xref href="impala_perf_stats.xml#perf_table_stats"/>. When you use the Impala + <codeph>COMPUTE STATS</codeph> statement, both table and column statistics are + automatically gathered at the same time, for all columns in the table. </p> </note> <note conref="../shared/impala_common.xml#common/compute_stats_nulls"/> -<!-- Hive-based instructions are considered obsolete since the introduction of the Impala COMPUTE STATS statement. - <p> - Add settings like the following to the <filepath>hive-site.xml</filepath> - configuration file, in the Hive configuration directory, on every node where you run - <codeph>ANALYZE TABLE</codeph> statements through the - <codeph>hive</codeph> shell. The - <codeph>hive.stats.ndv.error</codeph> setting represents the standard error when - estimating the number of distinct values for a column. The value of 5.0 is recommended as a tradeoff between the - accuracy of the gathered statistics and the resource usage of the stats-gathering process. - </p> - -<codeblock><![CDATA[<property> - <name>hive.stats.ndv.error</name> - <value>5.0</value> -</property>]]></codeblock> - - <p> - 5.0 is a relatively low value that devotes substantial computational resources to the statistics-gathering - process. To reduce the resource usage, you could increase this value; to make the statistics even more precise, - you could lower it. - </p> - - <p> - The syntax for gathering column statistics uses the <codeph>ANALYZE TABLE ... - COMPUTE STATISTICS</codeph> clause, with an additional <codeph>FOR - COLUMNS</codeph> clause. For partitioned tables, you can gather statistics for specific partitions by including - a clause <codeph>PARTITION - (<varname>col1=val1</varname>,<varname>col2=val2</varname>, - ...)</codeph>; but you cannot include the partitioning columns in the - <codeph>FOR COLUMNS</codeph> clause. Also, you cannot use fully qualified table - names, so issue a <codeph>USE</codeph> command first to switch to the - appropriate database. For example: - </p> - -<codeblock>USE <varname>database_name</varname>; -ANALYZE TABLE <varname>table_name</varname> COMPUTE STATISTICS FOR COLUMNS <varname>column_list</varname>; -ANALYZE TABLE <varname>table_name</varname> PARTITION (<varname>partition_specs</varname>) COMPUTE STATISTICS FOR COLUMNS <varname>column_list</varname>;</codeblock> ---> - <p rev="1.2.1"> - To check whether column statistics are available for a particular set of columns, use the <codeph>SHOW - COLUMN STATS <varname>table_name</varname></codeph> statement, or check the extended - <codeph>EXPLAIN</codeph> output for a query against that table that refers to those columns. See - <xref href="impala_show.xml#show"/> and <xref href="impala_explain.xml#explain"/> for details. + To check whether column statistics are available for a particular set of columns, use + the <codeph>SHOW COLUMN STATS <varname>table_name</varname></codeph> statement, or check + the extended <codeph>EXPLAIN</codeph> output for a query against that table that refers + to those columns. See <xref href="impala_show.xml#show"/> and + <xref href="impala_explain.xml#explain"/> for details. </p> <p conref="../shared/impala_common.xml#common/hive_column_stats_caveat"/> + </conbody> + </concept> <concept id="perf_stats_partitions"> + <title id="stats_partitions">How Table and Column Statistics Work for Partitioned Tables</title> + <conbody> <p> - When you use Impala for <q>big data</q>, you are highly likely to use partitioning - for your biggest tables, the ones representing data that can be logically divided - based on dates, geographic regions, or similar criteria. The table and column statistics - are especially useful for optimizing queries on such tables. For example, a query involving - one year might involve substantially more or less data than a query involving a different year, - or a range of several years. Each query might be optimized differently as a result. + When you use Impala for <q>big data</q>, you are highly likely to use partitioning for + your biggest tables, the ones representing data that can be logically divided based on + dates, geographic regions, or similar criteria. The table and column statistics are + especially useful for optimizing queries on such tables. For example, a query involving + one year might involve substantially more or less data than a query involving a + different year, or a range of several years. Each query might be optimized differently + as a result. </p> <p> The following examples show how table and column stats work with a partitioned table. - The table for this example is partitioned by year, month, and day. - For simplicity, the sample data consists of 5 partitions, all from the same year and month. - Table stats are collected independently for each partition. (In fact, the - <codeph>SHOW PARTITIONS</codeph> statement displays exactly the same information as - <codeph>SHOW TABLE STATS</codeph> for a partitioned table.) Column stats apply to - the entire table, not to individual partitions. Because the partition key column values - are represented as HDFS directories, their characteristics are typically known in advance, - even when the values for non-key columns are shown as -1. + The table for this example is partitioned by year, month, and day. For simplicity, the + sample data consists of 5 partitions, all from the same year and month. Table stats are + collected independently for each partition. (In fact, the <codeph>SHOW + PARTITIONS</codeph> statement displays exactly the same information as <codeph>SHOW + TABLE STATS</codeph> for a partitioned table.) Column stats apply to the entire table, + not to individual partitions. Because the partition key column values are represented as + HDFS directories, their characteristics are typically known in advance, even when the + values for non-key columns are shown as -1. </p> <codeblock> @@ -355,207 +319,257 @@ show column stats year_month_day; </codeblock> <p conref="../shared/impala_common.xml#common/hive_column_stats_caveat"/> + </conbody> + </concept> - <concept rev="2.1.0" id="perf_stats_incremental"> + <concept id="perf_generating_stats"> - <title id="incremental_stats">Overview of Incremental Statistics</title> + <title>Generating Table and Column Statistics</title> <conbody> <p> - In Impala 2.1.0 and higher, you can use the syntax <codeph>COMPUTE INCREMENTAL STATS</codeph> and - <codeph>DROP INCREMENTAL STATS</codeph>. The <codeph>INCREMENTAL</codeph> clauses work with incremental - statistics, a specialized feature for partitioned tables that are large or frequently updated with new - partitions. - </p> - - <p> - When you compute incremental statistics for a partitioned table, by default Impala only processes those - partitions that do not yet have incremental statistics. By processing only newly added partitions, you can - keep statistics up to date for large partitioned tables, without incurring the overhead of reprocessing the - entire table each time. + Use the <codeph>COMPUTE STATS</codeph> family of commands to collect table and + column statistics. The <codeph>COMPUTE STATS</codeph> variants offer + different tradeoffs between computation cost, staleness, and maintenance + workflows which are explained below. </p> <note type="important"> <p conref="../shared/impala_common.xml#common/cs_or_cis"/> - <p conref="../shared/impala_common.xml#common/incremental_stats_after_full"/> - <p conref="../shared/impala_common.xml#common/incremental_stats_caveats"/> </note> + <!-- TODO: Commented out because it is inaccurate and confusing. Leaving this + material for future refactoring into a Hive-compatibility section. <p> - You can also compute or drop statistics for a single partition by including a <codeph>PARTITION</codeph> - clause in the <codeph>COMPUTE INCREMENTAL STATS</codeph> or <codeph>DROP INCREMENTAL STATS</codeph> - statement. + If you use Hive as part of your ETL workflow, you can also use Hive to generate table + and column statistics. You might need to do extra configuration within Hive itself, the + metastore, or even set up a separate database to hold Hive-generated statistics. You + might need to run multiple statements to generate all the necessary statistics. + Therefore, prefer the Impala <codeph>COMPUTE STATS</codeph> statement where that + technique is practical. For details about collecting statistics through Hive, see + <xref href="https://cwiki.apache.org/confluence/display/Hive/StatsDev" scope="external" format="html">the + Hive wiki</xref>. </p> + --> - <p> - The metadata for incremental statistics is handled differently from the original style of statistics: - </p> + </conbody> - <ul> - <li> - <p> - If you have a partitioned table for which you have already run a regular <codeph>COMPUTE STATS</codeph> - statement, issuing <codeph>COMPUTE INCREMENTAL STATS</codeph> without a partition clause causes Impala - to rescan the entire table. Once the incremental statistics are computed, any future <codeph>COMPUTE INCREMENTAL - STATS</codeph> statements only scan any new partitions and any partitions where you performed - <codeph>DROP INCREMENTAL STATS</codeph>. - </p> - </li> - - <li> - <p> - The <codeph>SHOW TABLE STATS</codeph> and <codeph>SHOW PARTITIONS</codeph> statements now include an - additional column showing whether incremental statistics are available for each column. A partition - could already be covered by the original type of statistics based on a prior <codeph>COMPUTE - STATS</codeph> statement, as indicated by a value other than <codeph>-1</codeph> under the - <codeph>#Rows</codeph> column. Impala query planning uses either kind of statistics when available. - </p> - </li> - - <li> - <p> - <codeph>COMPUTE INCREMENTAL STATS</codeph> takes more time than <codeph>COMPUTE STATS</codeph> for the - same volume of data. Therefore it is most suitable for tables with large data volume where new - partitions are added frequently, making it impractical to run a full <codeph>COMPUTE STATS</codeph> - operation for each new partition. For unpartitioned tables, or partitioned tables that are loaded once - and not updated with new partitions, use the original <codeph>COMPUTE STATS</codeph> syntax. - </p> - </li> - - <li> - <p> - <codeph>COMPUTE INCREMENTAL STATS</codeph> uses some memory in the <cmdname>catalogd</cmdname> process, - proportional to the number of partitions and number of columns in the applicable table. The memory - overhead is approximately 400 bytes for each column in each partition. This memory is reserved in the - <cmdname>catalogd</cmdname> daemon, the <cmdname>statestored</cmdname> daemon, and in each instance of - the <cmdname>impalad</cmdname> daemon. - </p> - </li> - - <li> - <p> - In cases where new files are added to an existing partition, issue a <codeph>REFRESH</codeph> statement - for the table, followed by a <codeph>DROP INCREMENTAL STATS</codeph> and <codeph>COMPUTE INCREMENTAL - STATS</codeph> sequence for the changed partition. - </p> - </li> - - <li> - <p> - The <codeph>DROP INCREMENTAL STATS</codeph> statement operates only on a single partition at a time. To - remove statistics (whether incremental or not) from all partitions of a table, issue a <codeph>DROP - STATS</codeph> statement with no <codeph>INCREMENTAL</codeph> or <codeph>PARTITION</codeph> clauses. - </p> - </li> - </ul> + <concept id="concept_y2f_nfl_mdb"> - <p> - The following considerations apply to incremental statistics when the structure of an existing table is - changed (known as <term>schema evolution</term>): - </p> + <title>COMPUTE STATS</title> - <ul> - <li> - <p> - If you use an <codeph>ALTER TABLE</codeph> statement to drop a column, the existing statistics remain - valid and <codeph>COMPUTE INCREMENTAL STATS</codeph> does not rescan any partitions. - </p> - </li> - - <li> - <p> - If you use an <codeph>ALTER TABLE</codeph> statement to add a column, Impala rescans all partitions and - fills in the appropriate column-level values the next time you run <codeph>COMPUTE INCREMENTAL - STATS</codeph>. - </p> - </li> - - <li> - <p> - If you use an <codeph>ALTER TABLE</codeph> statement to change the data type of a column, Impala - rescans all partitions and fills in the appropriate column-level values the next time you run - <codeph>COMPUTE INCREMENTAL STATS</codeph>. - </p> - </li> - - <li> - <p> - If you use an <codeph>ALTER TABLE</codeph> statement to change the file format of a table, the existing - statistics remain valid and a subsequent <codeph>COMPUTE INCREMENTAL STATS</codeph> does not rescan any - partitions. - </p> - </li> - </ul> + <conbody> - <p> - See <xref href="impala_compute_stats.xml#compute_stats"/> and - <xref href="impala_drop_stats.xml#drop_stats"/> for syntax details. - </p> - </conbody> - </concept> + <p> + The <codeph>COMPUTE STATS</codeph> command collects and sets the table-level + and partition-level row counts as well as all column statistics for a given + table. The collection process is CPU-intensive and can take a long time to + complete for very large tables. + </p> + <p> + To speed up <codeph>COMPUTE STATS</codeph> consider the following options + which can be combined. + <ul> + <li> + <p> + Limit the number of columns for which statistics are collected to increase + the efficiency of COMPUTE STATS. Queries benefit from statistics for those + columns involved in filters, join conditions, group by or partition by + clauses. Other columns are good candidates to exclude from COMPUTE STATS. + This feature is available since Impala 2.12. + </p> + </li> + <li> + <p> + Set the MT_DOP query option to use more threads within each participating + impalad to compute the statistics faster - but not more efficiently. Note + that computing stats on a large table with a high MT_DOP value can + negatively affect other queries running at the same time if the + COMPUTE STATS claims most CPU cycles. + This feature is available since Impala 2.8. + </p> + </li> + </ul> + </p> - <concept id="perf_stats_computing"> - <title>Generating Table and Column Statistics (COMPUTE STATS Statement)</title> - <conbody> + <p> + <codeph>COMPUTE STATS</codeph> is intended to be run periodically, + e.g. weekly, or on-demand when the contents of a table have changed + significantly. Due to the high resource utilization and long repsonse + time of t<codeph>COMPUTE STATS</codeph>, it is most practical to run it + in a scheduled maintnance window where the Impala cluster is idle + enough to accommodate the expensive operation. The degree of change that + qualifies as <q>significant</q> depends on the query workload, but typically, + if 30% of the rows have changed then it is recommended to recompute + statistics. + </p> - <p> - To gather table statistics after loading data into a table or partition, you typically use the - <codeph>COMPUTE STATS</codeph> statement. This statement is available in Impala 1.2.2 and higher. - It gathers both table statistics and column statistics for all columns in a single operation. - For large partitioned tables, where you frequently need to update statistics and it is impractical - to scan the entire table each time, use the syntax <codeph>COMPUTE INCREMENTAL STATS</codeph>, - which is available in <keyword keyref="impala21_full"/> and higher. - </p> + <p> + If you reload a complete new set of data for a table, but the number of rows and + number of distinct values for each column is relatively unchanged from before, you + do not need to recompute stats for the table. + </p> - <p> - If you use Hive as part of your ETL workflow, you can also use Hive to generate table and - column statistics. You might need to do extra configuration within Hive itself, the metastore, - or even set up a separate database to hold Hive-generated statistics. You might need to run - multiple statements to generate all the necessary statistics. Therefore, prefer the - Impala <codeph>COMPUTE STATS</codeph> statement where that technique is practical. - For details about collecting statistics through Hive, see - <xref href="https://cwiki.apache.org/confluence/display/Hive/StatsDev" scope="external" format="html">the Hive wiki</xref>. - </p> + </conbody> - <p conref="../shared/impala_common.xml#common/hive_column_stats_caveat"/> + </concept> -<!-- Commenting out over-detailed Hive instructions as part of stats reorg. - <li> - Issue an <codeph>ANALYZE TABLE</codeph> statement in Hive, for the entire table or a specific partition. -<codeblock>ANALYZE TABLE <varname>tablename</varname> [PARTITION(<varname>partcol1</varname>[=<varname>val1</varname>], <varname>partcol2</varname>[=<varname>val2</varname>], ...)] COMPUTE STATISTICS [NOSCAN];</codeblock> - For example, to gather statistics for a non-partitioned table: -<codeblock>ANALYZE TABLE customer COMPUTE STATISTICS;</codeblock> - To gather statistics for a <codeph>store</codeph> table partitioned by state and city, and both of its - partitions: -<codeblock>ANALYZE TABLE store PARTITION(s_state, s_county) COMPUTE STATISTICS;</codeblock> - To gather statistics for the <codeph>store</codeph> table and only the partitions for California: -<codeblock>ANALYZE TABLE store PARTITION(s_state='CA', s_county) COMPUTE STATISTICS;</codeblock> - </li> - - <li> - Load the data through the <codeph>INSERT OVERWRITE</codeph> statement in Hive, while the Hive setting - <b>hive.stats.autogather</b> is enabled. - </li> - - </ul> ---> + <concept id="concept_bmk_pfl_mdb"> - <p rev="2.0.1"> -<!-- Additional info as a result of IMPALA-1420 --> -<!-- Keep checking if https://issues.apache.org/jira/browse/HIVE-8648 ever gets fixed and when that fix makes it into an Impala release. --> - For your very largest tables, you might find that <codeph>COMPUTE STATS</codeph> or even <codeph>COMPUTE INCREMENTAL STATS</codeph> - take so long to scan the data that it is impractical to use them regularly. In such a case, after adding a partition or inserting new data, - you can update just the number of rows property through an <codeph>ALTER TABLE</codeph> statement. - See <xref href="impala_perf_stats.xml#perf_table_stats_manual"/> for details. - Because the column statistics might be left in a stale state, do not use this technique as a replacement - for <codeph>COMPUTE STATS</codeph>. Only use this technique if all other means of collecting statistics are impractical, or as a - low-overhead operation that you run in between periodic <codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL STATS</codeph> operations. - </p> + <title>COMPUTE INCREMENTAL STATS</title> + + <conbody> + + <p> + In Impala 2.1.0 and higher, you can use the + <codeph>COMPUTE INCREMENTAL STATS</codeph> and + <codeph>DROP INCREMENTAL STATS</codeph> commands. + The <codeph>INCREMENTAL</codeph> clauses work with incremental statistics, + a specialized feature for partitioned tables. + </p> + + <p> + When you compute incremental statistics for a partitioned table, by default Impala only + processes those partitions that do not yet have incremental statistics. By processing + only newly added partitions, you can keep statistics up to date without incurring the + overhead of reprocessing the entire table each time. + </p> + + <p> + You can also compute or drop statistics for a specified subset of partitions by + including a <codeph>PARTITION</codeph> clause in the + <codeph>COMPUTE INCREMENTAL STATS</codeph> or <codeph>DROP INCREMENTAL STATS</codeph> + statement. + </p> + + <note type="important"> + <p conref="../shared/impala_common.xml#common/incremental_stats_caveats"/> + <p conref="../shared/impala_common.xml#common/incremental_stats_after_full"/> + </note> + + <p> + The metadata for incremental statistics is handled differently from the original style + of statistics: + </p> + + <ul> + <li> + <p> + Issuing a <codeph>COMPUTE INCREMENTAL STATS</codeph> without a partition + clause causes Impala to compute incremental stats for all partitions that + do not already have incremental stats. This might be the entire table when + running the command for the first time, but subsequent runs should only + update new partitions. You can force updating a partition that already has + incremental stats by issuing a <codeph>DROP INCREMENTAL STATS</codeph> + before running <codeph>COMPUTE INCREMENTAL STATS</codeph>. + </p> + </li> + + <li> + <p> + The <codeph>SHOW TABLE STATS</codeph> and <codeph>SHOW PARTITIONS</codeph> + statements now include an additional column showing whether incremental statistics + are available for each column. A partition could already be covered by the original + type of statistics based on a prior <codeph>COMPUTE STATS</codeph> statement, as + indicated by a value other than <codeph>-1</codeph> under the <codeph>#Rows</codeph> + column. Impala query planning uses either kind of statistics when available. + </p> + </li> + + <li> + <p> + <codeph>COMPUTE INCREMENTAL STATS</codeph> takes more time than <codeph>COMPUTE + STATS</codeph> for the same volume of data. Therefore it is most suitable for tables + with large data volume where new partitions are added frequently, making it + impractical to run a full <codeph>COMPUTE STATS</codeph> operation for each new + partition. For unpartitioned tables, or partitioned tables that are loaded once and + not updated with new partitions, use the original <codeph>COMPUTE STATS</codeph> + syntax. + </p> + </li> + + <li> + <p> + <codeph>COMPUTE INCREMENTAL STATS</codeph> uses some memory in the + <cmdname>catalogd</cmdname> process, proportional to the number of partitions and + number of columns in the applicable table. The memory overhead is approximately 400 + bytes for each column in each partition. This memory is reserved in the + <cmdname>catalogd</cmdname> daemon, the <cmdname>statestored</cmdname> daemon, and + in each instance of the <cmdname>impalad</cmdname> daemon. + </p> + </li> + + <li> + <p> + In cases where new files are added to an existing partition, issue a + <codeph>REFRESH</codeph> statement for the table, followed by a <codeph>DROP + INCREMENTAL STATS</codeph> and <codeph>COMPUTE INCREMENTAL STATS</codeph> sequence + for the changed partition. + </p> + </li> + + <li> + <p> + The <codeph>DROP INCREMENTAL STATS</codeph> statement operates only on a single + partition at a time. To remove statistics (whether incremental or not) from all + partitions of a table, issue a <codeph>DROP STATS</codeph> statement with no + <codeph>INCREMENTAL</codeph> or <codeph>PARTITION</codeph> clauses. + </p> + </li> + </ul> + + <p> + The following considerations apply to incremental statistics when the structure of an + existing table is changed (known as <term>schema evolution</term>): + </p> + + <ul> + <li> + <p> + If you use an <codeph>ALTER TABLE</codeph> statement to drop a column, the existing + statistics remain valid and <codeph>COMPUTE INCREMENTAL STATS</codeph> does not + rescan any partitions. + </p> + </li> + + <li> + <p> + If you use an <codeph>ALTER TABLE</codeph> statement to add a column, Impala rescans + all partitions and fills in the appropriate column-level values the next time you + run <codeph>COMPUTE INCREMENTAL STATS</codeph>. + </p> + </li> + + <li> + <p> + If you use an <codeph>ALTER TABLE</codeph> statement to change the data type of a + column, Impala rescans all partitions and fills in the appropriate column-level + values the next time you run <codeph>COMPUTE INCREMENTAL STATS</codeph>. + </p> + </li> + + <li> + <p> + If you use an <codeph>ALTER TABLE</codeph> statement to change the file format of a + table, the existing statistics remain valid and a subsequent <codeph>COMPUTE + INCREMENTAL STATS</codeph> does not rescan any partitions. + </p> + </li> + </ul> + + <p> + See <xref href="impala_compute_stats.xml#compute_stats"/> and + <xref + href="impala_drop_stats.xml#drop_stats"/> for syntax details. + </p> + + </conbody> + + </concept> - </conbody> </concept> <concept rev="2.1.0" id="perf_stats_checking"> @@ -565,17 +579,18 @@ show column stats year_month_day; <conbody> <p> - You can check whether a specific table has statistics using the <codeph>SHOW TABLE STATS</codeph> statement - (for any table) or the <codeph>SHOW PARTITIONS</codeph> statement (for a partitioned table). Both - statements display the same information. If a table or a partition does not have any statistics, the - <codeph>#Rows</codeph> field contains <codeph>-1</codeph>. Once you compute statistics for the table or - partition, the <codeph>#Rows</codeph> field changes to an accurate value. + You can check whether a specific table has statistics using the <codeph>SHOW TABLE + STATS</codeph> statement (for any table) or the <codeph>SHOW PARTITIONS</codeph> + statement (for a partitioned table). Both statements display the same information. If a + table or a partition does not have any statistics, the <codeph>#Rows</codeph> field + contains <codeph>-1</codeph>. Once you compute statistics for the table or partition, + the <codeph>#Rows</codeph> field changes to an accurate value. </p> <p> - The following example shows a table that initially does not have any statistics. The <codeph>SHOW TABLE - STATS</codeph> statement displays different values for <codeph>#Rows</codeph> before and after the - <codeph>COMPUTE STATS</codeph> operation. + The following example shows a table that initially does not have any statistics. The + <codeph>SHOW TABLE STATS</codeph> statement displays different values for + <codeph>#Rows</codeph> before and after the <codeph>COMPUTE STATS</codeph> operation. </p> <codeblock>[localhost:21000] > create table no_stats (x int); @@ -601,9 +616,10 @@ show column stats year_month_day; <p> The following example shows a similar progression with a partitioned table. Initially, - <codeph>#Rows</codeph> is <codeph>-1</codeph>. After a <codeph>COMPUTE STATS</codeph> operation, - <codeph>#Rows</codeph> changes to an accurate value. Any newly added partition starts with no statistics, - meaning that you must collect statistics after adding a new partition. + <codeph>#Rows</codeph> is <codeph>-1</codeph>. After a <codeph>COMPUTE STATS</codeph> + operation, <codeph>#Rows</codeph> changes to an accurate value. Any newly added + partition starts with no statistics, meaning that you must collect statistics after + adding a new partition. </p> <codeblock>[localhost:21000] > create table no_stats_partitioned (x int) partitioned by (year smallint); @@ -638,20 +654,22 @@ show column stats year_month_day; </codeblock> <note> - Because the default <codeph>COMPUTE STATS</codeph> statement creates and updates statistics for all - partitions in a table, if you expect to frequently add new partitions, use the <codeph>COMPUTE INCREMENTAL - STATS</codeph> syntax instead, which lets you compute stats for a single specified partition, or only for - those partitions that do not already have incremental stats. + Because the default <codeph>COMPUTE STATS</codeph> statement creates and updates + statistics for all partitions in a table, if you expect to frequently add new + partitions, use the <codeph>COMPUTE INCREMENTAL STATS</codeph> syntax instead, which + lets you compute stats for a single specified partition, or only for those partitions + that do not already have incremental stats. </note> <p> - If checking each individual table is impractical, due to a large number of tables or views that hide the - underlying base tables, you can also check for missing statistics for a particular query. Use the - <codeph>EXPLAIN</codeph> statement to preview query efficiency before actually running the query. Use the - query profile output available through the <codeph>PROFILE</codeph> command in - <cmdname>impala-shell</cmdname> or the web UI to verify query execution and timing after running the query. - Both the <codeph>EXPLAIN</codeph> plan and the <codeph>PROFILE</codeph> output display a warning if any - tables or partitions involved in the query do not have statistics. + If checking each individual table is impractical, due to a large number of tables or + views that hide the underlying base tables, you can also check for missing statistics + for a particular query. Use the <codeph>EXPLAIN</codeph> statement to preview query + efficiency before actually running the query. Use the query profile output available + through the <codeph>PROFILE</codeph> command in <cmdname>impala-shell</cmdname> or the + web UI to verify query execution and timing after running the query. Both the + <codeph>EXPLAIN</codeph> plan and the <codeph>PROFILE</codeph> output display a warning + if any tables or partitions involved in the query do not have statistics. </p> <codeblock>[localhost:21000] > create table no_stats (x int); @@ -677,10 +695,11 @@ show column stats year_month_day; </codeblock> <p> - Because Impala uses the <term>partition pruning</term> technique when possible to only evaluate certain - partitions, if you have a partitioned table with statistics for some partitions and not others, whether or - not the <codeph>EXPLAIN</codeph> statement shows the warning depends on the actual partitions used by the - query. For example, you might see warnings or not for different queries against the same table: + Because Impala uses the <term>partition pruning</term> technique when possible to only + evaluate certain partitions, if you have a partitioned table with statistics for some + partitions and not others, whether or not the <codeph>EXPLAIN</codeph> statement shows + the warning depends on the actual partitions used by the query. For example, you might + see warnings or not for different queries against the same table: </p> <codeblock>-- No warning because all the partitions for the year 2012 have stats. @@ -692,89 +711,36 @@ EXPLAIN SELECT ... FROM t1 WHERE year BETWEEN 2006 AND 2009; </codeblock> <p> - To confirm if any partitions at all in the table are missing statistics, you might explain a query that - scans the entire table, such as <codeph>SELECT COUNT(*) FROM <varname>table_name</varname></codeph>. - </p> - </conbody> - </concept> - - <concept rev="2.1.0" id="perf_stats_collecting"> - - <title>Keeping Statistics Up to Date</title> - - <conbody> - - <p> - When the contents of a table or partition change significantly, recompute the stats for the relevant table - or partition. The degree of change that qualifies as <q>significant</q> varies, depending on the absolute - and relative sizes of the tables. Typically, if you add more than 30% more data to a table, it is - worthwhile to recompute stats, because the differences in number of rows and number of distinct values - might cause Impala to choose a different join order when that table is used in join queries. This guideline - is most important for the largest tables. For example, adding 30% new data to a table containing 1 TB has a - greater effect on join order than adding 30% to a table containing only a few megabytes, and the larger - table has a greater effect on query performance if Impala chooses a suboptimal join order as a result of - outdated statistics. - </p> - - <p> - If you reload a complete new set of data for a table, but the number of rows and number of distinct values - for each column is relatively unchanged from before, you do not need to recompute stats for the table. + To confirm if any partitions at all in the table are missing statistics, you might + explain a query that scans the entire table, such as <codeph>SELECT COUNT(*) FROM + <varname>table_name</varname></codeph>. </p> - <p> - If the statistics for a table are out of date, and the table's large size makes it impractical to recompute - new stats immediately, you can use the <codeph>DROP STATS</codeph> statement to remove the obsolete - statistics, making it easier to identify tables that need a new <codeph>COMPUTE STATS</codeph> operation. - </p> - - <p> - For a large partitioned table, consider using the incremental stats feature available in Impala 2.1.0 and - higher, as explained in <xref href="impala_perf_stats.xml#perf_stats_incremental"/>. If you add a new - partition to a table, it is worthwhile to recompute incremental stats, because the operation only scans the - data for that one new partition. - </p> </conbody> - </concept> - -<!-- Might deserve its own conceptual topic at some point. --> - - <concept audience="hidden" rev="1.2.2" id="perf_stats_joins"> - <title>How Statistics Are Used in Join Queries</title> - - <conbody> - - <p></p> - </conbody> </concept> -<!-- Might deserve its own conceptual topic at some point. --> + <concept id="concept_s3c_4gl_mdb"> - <concept audience="hidden" rev="1.2.2" id="perf_stats_inserts"> + <title>Manually Setting Table and Column Statistics with ALTER TABLE</title> - <title>How Statistics Are Used in INSERT Operations</title> + <concept id="concept_wpt_pgl_mdb"> - <conbody> + <title>Setting Table Statistics</title> - <p conref="../shared/impala_common.xml#common/insert_hints"/> - </conbody> - </concept> - - <concept rev="1.2.2" id="perf_table_stats_manual"> - - <title>Setting the NUMROWS Value Manually through ALTER TABLE</title> - - <conbody> + <conbody> - <p> - The most crucial piece of data in all the statistics is the number of rows in the table (for an - unpartitioned or partitioned table) and for each partition (for a partitioned table). The <codeph>COMPUTE STATS</codeph> - statement always gathers statistics about all columns, as well as overall table statistics. If it is not - practical to do a full <codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL STATS</codeph> - operation after adding a partition or inserting data, or if you can see that Impala would produce a more - efficient plan if the number of rows was different, you can manually set the number of rows through an - <codeph>ALTER TABLE</codeph> statement: - </p> + <p> + The most crucial piece of data in all the statistics is the number of rows in the + table (for an unpartitioned or partitioned table) and for each partition (for a + partitioned table). The <codeph>COMPUTE STATS</codeph> statement always gathers + statistics about all columns, as well as overall table statistics. If it is not + practical to do a full <codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL + STATS</codeph> operation after adding a partition or inserting data, or if you can see + that Impala would produce a more efficient plan if the number of rows was different, + you can manually set the number of rows through an <codeph>ALTER TABLE</codeph> + statement: + </p> <codeblock> -- Set total number of rows. Applies to both unpartitioned and partitioned tables. @@ -785,49 +751,66 @@ alter table <varname>table_name</varname> set tblproperties('numRows'='<varname> alter table <varname>table_name</varname> partition (<varname>keycol1</varname>=<varname>val1</varname>,<varname>keycol2</varname>=<varname>val2</varname>...) set tblproperties('numRows'='<varname>new_value</varname>', 'STATS_GENERATED_VIA_STATS_TASK'='true'); </codeblock> - <p> - This statement avoids re-scanning any data files. (The requirement to include the <codeph>STATS_GENERATED_VIA_STATS_TASK</codeph> property is relatively new, as a - result of the issue <xref href="https://issues.apache.org/jira/browse/HIVE-8648" scope="external" format="html">HIVE-8648</xref> - for the Hive metastore.) - </p> + <p> + This statement avoids re-scanning any data files. (The requirement to include the + <codeph>STATS_GENERATED_VIA_STATS_TASK</codeph> property is relatively new, as a + result of the issue + <xref + href="https://issues.apache.org/jira/browse/HIVE-8648" + scope="external" format="html">HIVE-8648</xref> + for the Hive metastore.) + </p> <codeblock conref="../shared/impala_common.xml#common/set_numrows_example"/> - <p> - For a partitioned table, update both the per-partition number of rows and the number of rows for the whole - table: - </p> + <p> + For a partitioned table, update both the per-partition number of rows and the number + of rows for the whole table: + </p> <codeblock conref="../shared/impala_common.xml#common/set_numrows_partitioned_example"/> - <p> - In practice, the <codeph>COMPUTE STATS</codeph> statement, or <codeph>COMPUTE INCREMENTAL STATS</codeph> - for a partitioned table, should be fast and convenient enough that this technique is only useful for the very - largest partitioned tables. - <!-- + <p> + In practice, the <codeph>COMPUTE STATS</codeph> statement, or <codeph>COMPUTE + INCREMENTAL STATS</codeph> for a partitioned table, should be fast and convenient + enough that this technique is only useful for the very largest partitioned tables. +<!-- It is most useful as a workaround for in case of performance issues where you might adjust the <codeph>numRows</codeph> value higher or lower to produce the ideal join order. --> - <!-- Following wording is duplicated from earlier. Consider conref'ing. --> - Because the column statistics might be left in a stale state, do not use this technique as a replacement - for <codeph>COMPUTE STATS</codeph>. Only use this technique if all other means of collecting statistics are impractical, or as a - low-overhead operation that you run in between periodic <codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL STATS</codeph> operations. - </p> - </conbody> - </concept> +<!-- Following wording is duplicated from earlier. Consider conref'ing. --> + Because the column statistics might be left in a stale state, do not use this + technique as a replacement for <codeph>COMPUTE STATS</codeph>. Only use this technique + if all other means of collecting statistics are impractical, or as a low-overhead + operation that you run in between periodic <codeph>COMPUTE STATS</codeph> or + <codeph>COMPUTE INCREMENTAL STATS</codeph> operations. + </p> + + </conbody> + + </concept> + + <concept id="concept_asb_vgl_mdb"> + + <title>Setting Column Statistics</title> + + <conbody> + + <p> + In <keyword keyref="impala26_full"/> and higher, you can also use the <codeph>SET + COLUMN STATS</codeph> clause of <codeph>ALTER TABLE</codeph> to manually set or change + column statistics. Only use this technique in cases where it is impractical to run + <codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL STATS</codeph> + frequently enough to keep up with data changes for a huge table. + </p> + + <p conref="../shared/impala_common.xml#common/set_column_stats_example" + /> + + </conbody> + + </concept> - <concept id="perf_column_stats_manual" rev="2.6.0 IMPALA-3369"> - <title>Setting Column Stats Manually through ALTER TABLE</title> - <conbody> - <p> - In <keyword keyref="impala26_full"/> and higher, you can also use the <codeph>SET COLUMN STATS</codeph> - clause of <codeph>ALTER TABLE</codeph> to manually set or change column statistics. - Only use this technique in cases where it is impractical to run - <codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL STATS</codeph> - frequently enough to keep up with data changes for a huge table. - </p> - <p conref="../shared/impala_common.xml#common/set_column_stats_example"/> - </conbody> </concept> <concept rev="1.2.2" id="perf_stats_examples"> @@ -837,20 +820,23 @@ alter table <varname>table_name</varname> partition (<varname>keycol1</varname>= <conbody> <p> - The following examples walk through a sequence of <codeph>SHOW TABLE STATS</codeph>, <codeph>SHOW COLUMN - STATS</codeph>, <codeph>ALTER TABLE</codeph>, and <codeph>SELECT</codeph> and <codeph>INSERT</codeph> - statements to illustrate various aspects of how Impala uses statistics to help optimize queries. + The following examples walk through a sequence of <codeph>SHOW TABLE STATS</codeph>, + <codeph>SHOW COLUMN STATS</codeph>, <codeph>ALTER TABLE</codeph>, and + <codeph>SELECT</codeph> and <codeph>INSERT</codeph> statements to illustrate various + aspects of how Impala uses statistics to help optimize queries. </p> <p> - This example shows table and column statistics for the <codeph>STORE</codeph> column used in the - <xref href="http://www.tpc.org/tpcds/" scope="external" format="html">TPC-DS benchmarks for decision - support</xref> systems. It is a tiny table holding data for 12 stores. Initially, before any statistics are - gathered by a <codeph>COMPUTE STATS</codeph> statement, most of the numeric fields show placeholder values - of -1, indicating that the figures are unknown. The figures that are filled in are values that are easily - countable or deducible at the physical level, such as the number of files, total data size of the files, - and the maximum and average sizes for data types that have a constant size such as <codeph>INT</codeph>, - <codeph>FLOAT</codeph>, and <codeph>TIMESTAMP</codeph>. + This example shows table and column statistics for the <codeph>STORE</codeph> column + used in the <xref href="http://www.tpc.org/tpcds/" scope="external" format="html">TPC-DS + benchmarks for decision support</xref> systems. It is a tiny table holding data for 12 + stores. Initially, before any statistics are gathered by a <codeph>COMPUTE + STATS</codeph> statement, most of the numeric fields show placeholder values of -1, + indicating that the figures are unknown. The figures that are filled in are values that + are easily countable or deducible at the physical level, such as the number of files, + total data size of the files, and the maximum and average sizes for data types that have + a constant size such as <codeph>INT</codeph>, <codeph>FLOAT</codeph>, and + <codeph>TIMESTAMP</codeph>. </p> <codeblock>[localhost:21000] > show table stats store; @@ -897,11 +883,13 @@ Returned 1 row(s) in 0.03s Returned 29 row(s) in 0.04s</codeblock> <p> - With the Hive <codeph>ANALYZE TABLE</codeph> statement for column statistics, you had to specify each - column for which to gather statistics. The Impala <codeph>COMPUTE STATS</codeph> statement automatically - gathers statistics for all columns, because it reads through the entire table relatively quickly and can - efficiently compute the values for all the columns. This example shows how after running the - <codeph>COMPUTE STATS</codeph> statement, statistics are filled in for both the table and all its columns: + With the Hive <codeph>ANALYZE TABLE</codeph> statement for column statistics, you had to + specify each column for which to gather statistics. The Impala <codeph>COMPUTE + STATS</codeph> statement automatically gathers statistics for all columns, because it + reads through the entire table relatively quickly and can efficiently compute the values + for all the columns. This example shows how after running the <codeph>COMPUTE + STATS</codeph> statement, statistics are filled in for both the table and all its + columns: </p> <codeblock>[localhost:21000] > compute stats store; @@ -955,11 +943,12 @@ Returned 1 row(s) in 0.02s Returned 29 row(s) in 0.04s</codeblock> <p> - The following example shows how statistics are represented for a partitioned table. In this case, we have - set up a table to hold the world's most trivial census data, a single <codeph>STRING</codeph> field, - partitioned by a <codeph>YEAR</codeph> column. The table statistics include a separate entry for each - partition, plus final totals for the numeric fields. The column statistics include some easily deducible - facts for the partitioning column, such as the number of distinct values (the number of partition + The following example shows how statistics are represented for a partitioned table. In + this case, we have set up a table to hold the world's most trivial census data, a single + <codeph>STRING</codeph> field, partitioned by a <codeph>YEAR</codeph> column. The table + statistics include a separate entry for each partition, plus final totals for the + numeric fields. The column statistics include some easily deducible facts for the + partitioning column, such as the number of distinct values (the number of partition subdirectories). <!-- and the number of <codeph>NULL</codeph> values (none in this case). --> </p> @@ -996,8 +985,8 @@ Returned 8 row(s) in 0.02s Returned 2 row(s) in 0.02s</codeblock> <p> - The following example shows how the statistics are filled in by a <codeph>COMPUTE STATS</codeph> statement - in Impala. + The following example shows how the statistics are filled in by a <codeph>COMPUTE + STATS</codeph> statement in Impala. </p> <codeblock>[localhost:21000] > compute stats census; @@ -1031,13 +1020,17 @@ Returned 8 row(s) in 0.02s Returned 2 row(s) in 0.02s</codeblock> <p rev="1.4.0"> - For examples showing how some queries work differently when statistics are available, see - <xref href="impala_perf_joins.xml#perf_joins_examples"/>. You can see how Impala executes a query - differently in each case by observing the <codeph>EXPLAIN</codeph> output before and after collecting - statistics. Measure the before and after query times, and examine the throughput numbers in before and - after <codeph>SUMMARY</codeph> or <codeph>PROFILE</codeph> output, to verify how much the improved plan - speeds up performance. + For examples showing how some queries work differently when statistics are available, + see <xref href="impala_perf_joins.xml#perf_joins_examples"/>. You can see how Impala + executes a query differently in each case by observing the <codeph>EXPLAIN</codeph> + output before and after collecting statistics. Measure the before and after query times, + and examine the throughput numbers in before and after <codeph>SUMMARY</codeph> or + <codeph>PROFILE</codeph> output, to verify how much the improved plan speeds up + performance. </p> + </conbody> + </concept> + </concept>
