Repository: incubator-impala Updated Branches: refs/heads/master 36cd610d6 -> e278ed228
[DOCS] Tighten up advice about first COMPUTE INCREMENTAL STATS Explain how doing COMPUTE INCREMENTAL STATS for the first time starts over and discards any previous stats from COMPUTE STATS. As a consequence, moved some wording and examples into impala_common.xml so that content could be used in multiple places. Also made a new subtopic on the "Partitioning" page because I saw COMPUTE INCREMENTAL STATS wasn't mentioned there. Change-Id: Ia53a6518ce5541e5c9a2cd896856ce042a599b03 Reviewed-on: http://gerrit.cloudera.org:8080/7999 Reviewed-by: Alex Behm <[email protected]> Tested-by: Impala Public Jenkins Project: http://git-wip-us.apache.org/repos/asf/incubator-impala/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-impala/commit/e278ed22 Tree: http://git-wip-us.apache.org/repos/asf/incubator-impala/tree/e278ed22 Diff: http://git-wip-us.apache.org/repos/asf/incubator-impala/diff/e278ed22 Branch: refs/heads/master Commit: e278ed228b9e15bcf2ba89dab6b002eb8d71f892 Parents: 36cd610 Author: John Russell <[email protected]> Authored: Fri Sep 1 15:15:30 2017 -0700 Committer: Impala Public Jenkins <[email protected]> Committed: Fri Oct 6 23:33:15 2017 +0000 ---------------------------------------------------------------------- docs/shared/impala_common.xml | 127 ++++++++++++++++++++++++++++++ docs/topics/impala_compute_stats.xml | 105 ++---------------------- docs/topics/impala_partitioning.xml | 29 +++++++ docs/topics/impala_perf_stats.xml | 21 +++-- 4 files changed, 172 insertions(+), 110 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/e278ed22/docs/shared/impala_common.xml ---------------------------------------------------------------------- diff --git a/docs/shared/impala_common.xml b/docs/shared/impala_common.xml index f31bdf0..18a93de 100644 --- a/docs/shared/impala_common.xml +++ b/docs/shared/impala_common.xml @@ -1337,6 +1337,31 @@ drop database temp; other administrative contexts. See <xref keyref="sg_redaction"/> for details. </p> + <p id="cs_or_cis"> + For a particular table, use either <codeph>COMPUTE STATS</codeph> or + <codeph>COMPUTE INCREMENTAL STATS</codeph>, but never combine the two or alternate + between them. If you switch from <codeph>COMPUTE STATS</codeph> to + <codeph>COMPUTE INCREMENTAL STATS</codeph> during the lifetime of a table, or vice + versa, drop all statistics (by running both <codeph>DROP STATS</codeph> and + <codeph>DROP INCREMENTAL STATS</codeph>) before making the switch. + </p> + + <p id="incremental_stats_after_full"> + When you run <codeph>COMPUTE INCREMENTAL STATS</codeph> on a table for the first time, + the statistics are computed again from scratch regardless of whether the table already + has statistics. Therefore, expect a one-time resource-intensive operation + for scanning the entire table when running <codeph>COMPUTE INCREMENTAL STATS</codeph> + for the first time on a given table. + </p> + + <p id="incremental_stats_caveats"> + For a table with a huge number of partitions and many columns, the approximately 400 bytes + of metadata per column per partition can add up to significant memory overhead, as it must + be cached on the <cmdname>catalogd</cmdname> host and on every <cmdname>impalad</cmdname> host + that is eligible to be a coordinator. If this metadata for all tables combined exceeds 2 GB, + you might experience service downtime. + </p> + <p id="incremental_partition_spec"> The <codeph>PARTITION</codeph> clause is only allowed in combination with the <codeph>INCREMENTAL</codeph> clause. It is optional for <codeph>COMPUTE INCREMENTAL STATS</codeph>, and required for <codeph>DROP @@ -1346,6 +1371,108 @@ drop database temp; specification, and specify constant values for all the partition key columns. </p> +<codeblock id="compute_stats_walkthrough">-- Initially the table has no incremental stats, as indicated +-- 'false' under Incremental stats. +show table stats item_partitioned; ++-------------+-------+--------+----------+--------------+---------+------------------ +| i_category | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats ++-------------+-------+--------+----------+--------------+---------+------------------ +| Books | -1 | 1 | 223.74KB | NOT CACHED | PARQUET | false +| Children | -1 | 1 | 230.05KB | NOT CACHED | PARQUET | false +| Electronics | -1 | 1 | 232.67KB | NOT CACHED | PARQUET | false +| Home | -1 | 1 | 232.56KB | NOT CACHED | PARQUET | false +| Jewelry | -1 | 1 | 223.72KB | NOT CACHED | PARQUET | false +| Men | -1 | 1 | 231.25KB | NOT CACHED | PARQUET | false +| Music | -1 | 1 | 237.90KB | NOT CACHED | PARQUET | false +| Shoes | -1 | 1 | 234.90KB | NOT CACHED | PARQUET | false +| Sports | -1 | 1 | 227.97KB | NOT CACHED | PARQUET | false +| Women | -1 | 1 | 226.27KB | NOT CACHED | PARQUET | false +| Total | -1 | 10 | 2.25MB | 0B | | ++-------------+-------+--------+----------+--------------+---------+------------------ + +-- After the first COMPUTE INCREMENTAL STATS, +-- all partitions have stats. The first +-- COMPUTE INCREMENTAL STATS scans the whole +-- table, discarding any previous stats from +-- a traditional COMPUTE STATS statement. +compute incremental stats item_partitioned; ++-------------------------------------------+ +| summary | ++-------------------------------------------+ +| Updated 10 partition(s) and 21 column(s). | ++-------------------------------------------+ +show table stats item_partitioned; ++-------------+-------+--------+----------+--------------+---------+------------------ +| i_category | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats ++-------------+-------+--------+----------+--------------+---------+------------------ +| Books | 1733 | 1 | 223.74KB | NOT CACHED | PARQUET | true +| Children | 1786 | 1 | 230.05KB | NOT CACHED | PARQUET | true +| Electronics | 1812 | 1 | 232.67KB | NOT CACHED | PARQUET | true +| Home | 1807 | 1 | 232.56KB | NOT CACHED | PARQUET | true +| Jewelry | 1740 | 1 | 223.72KB | NOT CACHED | PARQUET | true +| Men | 1811 | 1 | 231.25KB | NOT CACHED | PARQUET | true +| Music | 1860 | 1 | 237.90KB | NOT CACHED | PARQUET | true +| Shoes | 1835 | 1 | 234.90KB | NOT CACHED | PARQUET | true +| Sports | 1783 | 1 | 227.97KB | NOT CACHED | PARQUET | true +| Women | 1790 | 1 | 226.27KB | NOT CACHED | PARQUET | true +| Total | 17957 | 10 | 2.25MB | 0B | | ++-------------+-------+--------+----------+--------------+---------+------------------ + +-- Add a new partition... +alter table item_partitioned add partition (i_category='Camping'); +-- Add or replace files in HDFS outside of Impala, +-- rendering the stats for a partition obsolete. +!import_data_into_sports_partition.sh +refresh item_partitioned; +drop incremental stats item_partitioned partition (i_category='Sports'); +-- Now some partitions have incremental stats +-- and some do not. +show table stats item_partitioned; ++-------------+-------+--------+----------+--------------+---------+------------------ +| i_category | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats ++-------------+-------+--------+----------+--------------+---------+------------------ +| Books | 1733 | 1 | 223.74KB | NOT CACHED | PARQUET | true +| Camping | -1 | 1 | 408.02KB | NOT CACHED | PARQUET | false +| Children | 1786 | 1 | 230.05KB | NOT CACHED | PARQUET | true +| Electronics | 1812 | 1 | 232.67KB | NOT CACHED | PARQUET | true +| Home | 1807 | 1 | 232.56KB | NOT CACHED | PARQUET | true +| Jewelry | 1740 | 1 | 223.72KB | NOT CACHED | PARQUET | true +| Men | 1811 | 1 | 231.25KB | NOT CACHED | PARQUET | true +| Music | 1860 | 1 | 237.90KB | NOT CACHED | PARQUET | true +| Shoes | 1835 | 1 | 234.90KB | NOT CACHED | PARQUET | true +| Sports | -1 | 1 | 227.97KB | NOT CACHED | PARQUET | false +| Women | 1790 | 1 | 226.27KB | NOT CACHED | PARQUET | true +| Total | 17957 | 11 | 2.65MB | 0B | | ++-------------+-------+--------+----------+--------------+---------+------------------ + +-- After another COMPUTE INCREMENTAL STATS, +-- all partitions have incremental stats, and only the 2 +-- partitions without incremental stats were scanned. +compute incremental stats item_partitioned; ++------------------------------------------+ +| summary | ++------------------------------------------+ +| Updated 2 partition(s) and 21 column(s). | ++------------------------------------------+ +show table stats item_partitioned; ++-------------+-------+--------+----------+--------------+---------+------------------ +| i_category | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats ++-------------+-------+--------+----------+--------------+---------+------------------ +| Books | 1733 | 1 | 223.74KB | NOT CACHED | PARQUET | true +| Camping | 5328 | 1 | 408.02KB | NOT CACHED | PARQUET | true +| Children | 1786 | 1 | 230.05KB | NOT CACHED | PARQUET | true +| Electronics | 1812 | 1 | 232.67KB | NOT CACHED | PARQUET | true +| Home | 1807 | 1 | 232.56KB | NOT CACHED | PARQUET | true +| Jewelry | 1740 | 1 | 223.72KB | NOT CACHED | PARQUET | true +| Men | 1811 | 1 | 231.25KB | NOT CACHED | PARQUET | true +| Music | 1860 | 1 | 237.90KB | NOT CACHED | PARQUET | true +| Shoes | 1835 | 1 | 234.90KB | NOT CACHED | PARQUET | true +| Sports | 1783 | 1 | 227.97KB | NOT CACHED | PARQUET | true +| Women | 1790 | 1 | 226.27KB | NOT CACHED | PARQUET | true +| Total | 17957 | 11 | 2.65MB | 0B | | ++-------------+-------+--------+----------+--------------+---------+------------------ +</codeblock> + <p id="udf_persistence_restriction" rev="2.5.0 IMPALA-1748"> In <keyword keyref="impala25_full"/> and higher, Impala UDFs and UDAs written in C++ are persisted in the metastore database. Java UDFs are also persisted, if they were created with the new <codeph>CREATE FUNCTION</codeph> syntax for Java UDFs, http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/e278ed22/docs/topics/impala_compute_stats.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_compute_stats.xml b/docs/topics/impala_compute_stats.xml index b7489c5..98694f8 100644 --- a/docs/topics/impala_compute_stats.xml +++ b/docs/topics/impala_compute_stats.xml @@ -80,6 +80,12 @@ COMPUTE INCREMENTAL STATS [<varname>db_name</varname>.]<varname>table_name</varn for full usage details. </p> + <note type="important"> + <p conref="../shared/impala_common.xml#common/cs_or_cis"/> + <p conref="../shared/impala_common.xml#common/incremental_stats_after_full"/> + <p conref="../shared/impala_common.xml#common/incremental_stats_caveats"/> + </note> + <p> <codeph>COMPUTE INCREMENTAL STATS</codeph> only applies to partitioned tables. If you use the <codeph>INCREMENTAL</codeph> clause for an unpartitioned table, Impala automatically uses the original @@ -340,104 +346,7 @@ Returned 2 row(s) in 0.01s</codeblock> changed partitions, without rescanning the entire table. </p> -<codeblock>-- Initially the table has no incremental stats, as indicated --- by -1 under #Rows and false under Incremental stats. -show table stats item_partitioned; -+-------------+-------+--------+----------+--------------+---------+------------------ -| i_category | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats -+-------------+-------+--------+----------+--------------+---------+------------------ -| Books | -1 | 1 | 223.74KB | NOT CACHED | PARQUET | false -| Children | -1 | 1 | 230.05KB | NOT CACHED | PARQUET | false -| Electronics | -1 | 1 | 232.67KB | NOT CACHED | PARQUET | false -| Home | -1 | 1 | 232.56KB | NOT CACHED | PARQUET | false -| Jewelry | -1 | 1 | 223.72KB | NOT CACHED | PARQUET | false -| Men | -1 | 1 | 231.25KB | NOT CACHED | PARQUET | false -| Music | -1 | 1 | 237.90KB | NOT CACHED | PARQUET | false -| Shoes | -1 | 1 | 234.90KB | NOT CACHED | PARQUET | false -| Sports | -1 | 1 | 227.97KB | NOT CACHED | PARQUET | false -| Women | -1 | 1 | 226.27KB | NOT CACHED | PARQUET | false -| Total | -1 | 10 | 2.25MB | 0B | | -+-------------+-------+--------+----------+--------------+---------+------------------ - --- After the first COMPUTE INCREMENTAL STATS, --- all partitions have stats. -compute incremental stats item_partitioned; -+-------------------------------------------+ -| summary | -+-------------------------------------------+ -| Updated 10 partition(s) and 21 column(s). | -+-------------------------------------------+ -show table stats item_partitioned; -+-------------+-------+--------+----------+--------------+---------+------------------ -| i_category | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats -+-------------+-------+--------+----------+--------------+---------+------------------ -| Books | 1733 | 1 | 223.74KB | NOT CACHED | PARQUET | true -| Children | 1786 | 1 | 230.05KB | NOT CACHED | PARQUET | true -| Electronics | 1812 | 1 | 232.67KB | NOT CACHED | PARQUET | true -| Home | 1807 | 1 | 232.56KB | NOT CACHED | PARQUET | true -| Jewelry | 1740 | 1 | 223.72KB | NOT CACHED | PARQUET | true -| Men | 1811 | 1 | 231.25KB | NOT CACHED | PARQUET | true -| Music | 1860 | 1 | 237.90KB | NOT CACHED | PARQUET | true -| Shoes | 1835 | 1 | 234.90KB | NOT CACHED | PARQUET | true -| Sports | 1783 | 1 | 227.97KB | NOT CACHED | PARQUET | true -| Women | 1790 | 1 | 226.27KB | NOT CACHED | PARQUET | true -| Total | 17957 | 10 | 2.25MB | 0B | | -+-------------+-------+--------+----------+--------------+---------+------------------ - --- Add a new partition... -alter table item_partitioned add partition (i_category='Camping'); --- Add or replace files in HDFS outside of Impala, --- rendering the stats for a partition obsolete. -!import_data_into_sports_partition.sh -refresh item_partitioned; -drop incremental stats item_partitioned partition (i_category='Sports'); --- Now some partitions have incremental stats --- and some do not. -show table stats item_partitioned; -+-------------+-------+--------+----------+--------------+---------+------------------ -| i_category | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats -+-------------+-------+--------+----------+--------------+---------+------------------ -| Books | 1733 | 1 | 223.74KB | NOT CACHED | PARQUET | true -| Camping | -1 | 1 | 408.02KB | NOT CACHED | PARQUET | false -| Children | 1786 | 1 | 230.05KB | NOT CACHED | PARQUET | true -| Electronics | 1812 | 1 | 232.67KB | NOT CACHED | PARQUET | true -| Home | 1807 | 1 | 232.56KB | NOT CACHED | PARQUET | true -| Jewelry | 1740 | 1 | 223.72KB | NOT CACHED | PARQUET | true -| Men | 1811 | 1 | 231.25KB | NOT CACHED | PARQUET | true -| Music | 1860 | 1 | 237.90KB | NOT CACHED | PARQUET | true -| Shoes | 1835 | 1 | 234.90KB | NOT CACHED | PARQUET | true -| Sports | -1 | 1 | 227.97KB | NOT CACHED | PARQUET | false -| Women | 1790 | 1 | 226.27KB | NOT CACHED | PARQUET | true -| Total | 17957 | 11 | 2.65MB | 0B | | -+-------------+-------+--------+----------+--------------+---------+------------------ - --- After another COMPUTE INCREMENTAL STATS, --- all partitions have incremental stats, and only the 2 --- partitions without incremental stats were scanned. -compute incremental stats item_partitioned; -+------------------------------------------+ -| summary | -+------------------------------------------+ -| Updated 2 partition(s) and 21 column(s). | -+------------------------------------------+ -show table stats item_partitioned; -+-------------+-------+--------+----------+--------------+---------+------------------ -| i_category | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats -+-------------+-------+--------+----------+--------------+---------+------------------ -| Books | 1733 | 1 | 223.74KB | NOT CACHED | PARQUET | true -| Camping | 5328 | 1 | 408.02KB | NOT CACHED | PARQUET | true -| Children | 1786 | 1 | 230.05KB | NOT CACHED | PARQUET | true -| Electronics | 1812 | 1 | 232.67KB | NOT CACHED | PARQUET | true -| Home | 1807 | 1 | 232.56KB | NOT CACHED | PARQUET | true -| Jewelry | 1740 | 1 | 223.72KB | NOT CACHED | PARQUET | true -| Men | 1811 | 1 | 231.25KB | NOT CACHED | PARQUET | true -| Music | 1860 | 1 | 237.90KB | NOT CACHED | PARQUET | true -| Shoes | 1835 | 1 | 234.90KB | NOT CACHED | PARQUET | true -| Sports | 1783 | 1 | 227.97KB | NOT CACHED | PARQUET | true -| Women | 1790 | 1 | 226.27KB | NOT CACHED | PARQUET | true -| Total | 17957 | 11 | 2.65MB | 0B | | -+-------------+-------+--------+----------+--------------+---------+------------------ -</codeblock> +<codeblock conref="../shared/impala_common.xml#common/compute_stats_walkthrough"/> <p conref="../shared/impala_common.xml#common/file_format_blurb"/> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/e278ed22/docs/topics/impala_partitioning.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_partitioning.xml b/docs/topics/impala_partitioning.xml index 1729530..c2e36ed 100644 --- a/docs/topics/impala_partitioning.xml +++ b/docs/topics/impala_partitioning.xml @@ -603,4 +603,33 @@ SELECT COUNT(*) FROM sales_table WHERE year IN (2005, 2010, 2015); </concept> + <concept id="partition_stats"> + <title>Keeping Statistics Up to Date for Partitioned Tables</title> + <conbody> + + <p> + Because the <codeph>COMPUTE STATS</codeph> statement can be resource-intensive to run on a partitioned table + as new partitions are added, Impala includes a variation of this statement that allows computing statistics + on a per-partition basis such that stats can be incrementally updated when new partitions are added. + </p> + + <note type="important"> + <p conref="../shared/impala_common.xml#common/cs_or_cis"/> + <p conref="../shared/impala_common.xml#common/incremental_stats_after_full"/> + <p conref="../shared/impala_common.xml#common/incremental_stats_caveats"/> + </note> + + <p rev="2.1.0"> + The <codeph>COMPUTE INCREMENTAL STATS</codeph> variation computes statistics only for partitions that were + added or changed since the last <codeph>COMPUTE INCREMENTAL STATS</codeph> statement, rather than the entire + table. It is typically used for tables where a full <codeph>COMPUTE STATS</codeph> + operation takes too long to be practical each time a partition is added or dropped. See + <xref href="impala_perf_stats.xml#perf_stats_incremental"/> for full usage details. + </p> + +<codeblock conref="../shared/impala_common.xml#common/compute_stats_walkthrough"/> + + </conbody> + </concept> + </concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/e278ed22/docs/topics/impala_perf_stats.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_perf_stats.xml b/docs/topics/impala_perf_stats.xml index 86800f7..ac771be 100644 --- a/docs/topics/impala_perf_stats.xml +++ b/docs/topics/impala_perf_stats.xml @@ -354,15 +354,6 @@ show column stats year_month_day; +-----------+---------+------------------+--------+----------+-------------------+ </codeblock> - <note> - Partitioned tables can grow so large that scanning the entire table, as the <codeph>COMPUTE STATS</codeph> - statement does, is impractical just to update the statistics for a new partition. The standard - <codeph>COMPUTE STATS</codeph> statement might take hours, or even days. That situation is where you switch - to using incremental statistics, a feature available in <keyword keyref="impala21_full"/> and higher. - See <xref href="impala_perf_stats.xml#perf_stats_incremental"/> for details about this feature - and the <codeph>COMPUTE INCREMENTAL STATS</codeph> syntax. - </note> - <p conref="../shared/impala_common.xml#common/hive_column_stats_caveat"/> </conbody> </concept> @@ -387,6 +378,12 @@ show column stats year_month_day; entire table each time. </p> + <note type="important"> + <p conref="../shared/impala_common.xml#common/cs_or_cis"/> + <p conref="../shared/impala_common.xml#common/incremental_stats_after_full"/> + <p conref="../shared/impala_common.xml#common/incremental_stats_caveats"/> + </note> + <p> You can also compute or drop statistics for a single partition by including a <codeph>PARTITION</codeph> clause in the <codeph>COMPUTE INCREMENTAL STATS</codeph> or <codeph>DROP INCREMENTAL STATS</codeph> @@ -400,9 +397,9 @@ show column stats year_month_day; <ul> <li> <p> - If you have an existing partitioned table for which you have already computed statistics, issuing - <codeph>COMPUTE INCREMENTAL STATS</codeph> without a partition clause causes Impala to rescan the - entire table. Once the incremental statistics are computed, any future <codeph>COMPUTE INCREMENTAL + If you have a partitioned table for which you have already run a regular <codeph>COMPUTE STATS</codeph> + statement, issuing <codeph>COMPUTE INCREMENTAL STATS</codeph> without a partition clause causes Impala + to rescan the entire table. Once the incremental statistics are computed, any future <codeph>COMPUTE INCREMENTAL STATS</codeph> statements only scan any new partitions and any partitions where you performed <codeph>DROP INCREMENTAL STATS</codeph>. </p>
